CSV and TSV: The Universal Data Exchange Format Explained

CSV and TSV: The Universal Data Exchange Format Explained

Why CSV Has Survived 50 Years

CSV is older than the personal computer. The format dates to at least the 1970s on IBM mainframes, predating VisiCalc, Excel, and the entire concept of a spreadsheet as we know it today. By the time structured data exchange became a web problem, CSV was already entrenched in every operating system, business application, and database export tool on the planet.

The reason it survived isn't elegance. CSV has no formal specification — the closest thing, RFC 4180, was written in 2005 as documentation of existing practice, not a design document. No types, no schema, no native support for nested data. Just text rows with values separated by commas.

It survived because spreadsheets survived. Excel opens CSV. Google Sheets opens CSV. Every database can export CSV. Every analytics tool can import it. Universal compatibility beats almost everything else in data exchange.

The Quoting Rules: RFC 4180

The basic case is obvious — comma-separated values, one record per line:

name,age,city
Alice,30,New York
Bob,25,London

But values that contain commas, newlines, or double-quote characters need to be quoted. That's where most hand-rolled CSV parsers break. RFC 4180 specifies:

  1. Fields may be enclosed in double quotes. If a field is not quoted, it must not contain commas, newlines, or double quotes.
  2. If a field contains commas, newlines, or double quotes, it must be enclosed in double quotes.
  3. Double quotes within a quoted field are escaped by doubling them: "" represents a literal ".
name,bio,score
Alice,"She said, ""hello world"" to the crowd",95
Bob,"Line one
Line two",87

That second record has a newline inside a quoted field — which means the record spans two physical lines. A naive line-by-line parser will corrupt this. Always use a proper CSV library.

// Node.js — don't hand-roll parsing
import { parse } from 'csv-parse/sync';

const records = parse(csvText, {
  columns: true,      // use first row as header names
  skip_empty_lines: true,
  trim: true
});
import csv
import io

reader = csv.DictReader(io.StringIO(csv_text))
for row in reader:
    print(row['name'], row['age'])

Both Python's csv module and Node's csv-parse handle quoted fields, embedded newlines, and escaped quotes correctly. Any time you're tempted to split on commas yourself, reach for a library instead.

TSV: A Simpler Alternative

TSV (Tab-Separated Values) uses a tab character (\t) as the delimiter instead of a comma. The appeal: tab characters almost never appear in real data, so you rarely need quoting. The format is often simpler to produce and parse for simple data.

name	age	city
Alice	30	New York
Bob	25	London

TSV is common in bioinformatics, genomics data, and system-generated exports where the data never goes near Excel. It's also the format MySQL's LOAD DATA INFILE and PostgreSQL's \copy prefer by default.

The tradeoff: tabs are invisible in most text editors. You can't tell whether something is tab-separated or space-separated at a glance, and copy-pasting TSV into Excel or a browser form sometimes converts tabs to spaces. For human-edited data, CSV is usually more practical.

Header Row Conventions

RFC 4180 explicitly makes the header row optional and leaves it to the application to decide. In practice, a header row is almost always present and expected. The convention:

  • First row contains field names
  • Names are lowercase with underscores or camelCase depending on your ecosystem
  • Names should be stable — downstream consumers hard-code them

If you're exporting CSV from a database, match the column names to the table column names unless there's a good reason not to. Consumers who write code against your CSV export will be grateful.

Some tools — particularly data science pipelines — also support a "no header" CSV where all rows are data. When sharing files with others, always clarify whether a header row is present.

The UTF-8 BOM Problem in Excel

Excel on Windows doesn't correctly recognize UTF-8 encoded CSV files unless they start with a BOM (byte order mark) — the three-byte sequence \xEF\xBB\xBF at the very beginning of the file. Reliably frustrating.

Without the BOM, Excel applies a legacy encoding (often Windows-1252) and any non-ASCII characters — accented letters, Chinese characters, emoji — appear as mojibake.

The fix when generating CSVs for Excel consumption:

// Add UTF-8 BOM so Excel opens it correctly
// '' is the Unicode BOM character (U+FEFF)
const BOM = '';
const csvContent = BOM + rows.join('\n');
# Python — specify utf-8-sig encoding (writes BOM automatically)
with open('output.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(data)

The irony is that the BOM is not part of the RFC 4180 spec and can confuse non-Excel parsers. If your audience is a mix of Excel users and programmatic consumers, the safest approach is to offer both: a -excel.csv variant with BOM and a standard -utf8.csv without.

Common CSV Parsing Bugs

Splitting on commas. Already covered, but worth repeating: line.split(',') gives wrong results on any field that contains a comma. Even if your current data doesn't, the next export will.

Ignoring line ending differences. Windows line endings are \r\n, Unix are \n. CSV files from Windows sources often have \r appended to the last field in each row. Most libraries handle this; split('\n') doesn't.

Mishandling null vs empty string. CSV has no null type. An empty field (,,) might mean null, an empty string, zero, or missing — that's application-level semantics that CSV doesn't encode. Be explicit in your documentation.

Assuming consistent column count. Some CSV files have rows with different numbers of fields — malformed exports, trailing commas. Validate before processing if the data comes from an untrusted source.

CSV vs JSON vs Excel: When to Use Which

Use CSV when: you need maximum interoperability, your data is flat tabular (no nesting), you're moving data between tools that both understand spreadsheets, or the recipient will open the file in Excel or Google Sheets. Data exports, report downloads, bulk import templates — CSV wins here.

Use JSON when: your data is nested or hierarchical, you're building an API, you need types (numbers, booleans, null), or you're working entirely within a JavaScript/TypeScript stack. For web-to-web data exchange, JSON is almost always better.

Use Excel (XLSX) when: formatting, formulas, multiple sheets, or rich formatting matter to the recipient. Raw data exchange should still prefer CSV even when the final consumer is Excel, because CSV is universally parseable and XLSX parsing libraries are heavy.

The CSV to JSON and JSON to CSV tools convert between these formats directly in your browser — nothing is uploaded to a server, so they're safe for sensitive data.

For a broader look at data format tradeoffs, see XML vs JSON: Which Format to Use and When. YAML is another common format for configuration data, covered in YAML Explained: Configuration Made Human-Readable.