JSON Converter ยท 6 min read
CSV: the 50-year-old format that refuses to die
CSV is older than the personal computer, has never been fully standardised, and breaks in ways that are predictable but hard to fix. It also outcompetes every replacement. Here is the full story.
Comma-Separated Values is not a format anyone designed. It emerged from the practical needs of IBM mainframe operators in the early 1970s who needed a simple way to move tabular data between programs. There was no specification, no standards body, and no versioning โ just a convention that spread because it was useful and required no special tools to read or write. Half a century later, CSV is ubiquitous in spreadsheets, databases, data science, and ETL pipelines, despite having accumulated a long list of well-documented problems.
Origins: IBM mainframes and the 1970s
The earliest documented use of comma-delimited data is in IBM operating systems of the late 1960s and early 1970s. Programs running on IBM System/360 and System/370 mainframes used comma-separated values to pass parameters and exchange tabular data between jobs in batch processing pipelines. The format was never formally specified โ it was simply what the tools expected and what operators produced.
When personal computers arrived in the late 1970s and early 1980s, CSV came with them. VisiCalc, the first spreadsheet program for personal computers (1979), could import CSV data. Lotus 1-2-3 (1983) and dBASE II (1981) both supported CSV exchange. By the time Microsoft Excel appeared in 1985, CSV was already the de facto standard for moving tabular data between applications.
RFC 4180: the standard that isn't
RFC 4180, published by Yakov Shafranovich in 2005, is frequently described as the CSV specification. This description is technically accurate but misleading. RFC 4180 is an informational RFC โ it documents existing practice, it does not mandate it. The RFC itself acknowledges that "the format is not fully standardized" and that existing implementations differ in ways the RFC cannot resolve.
The RFC specifies: fields separated by commas, records separated by CRLF line endings, an optional header row in the first line, fields containing commas or line breaks enclosed in double quotes, and double-quote characters inside quoted fields escaped by doubling them (""). These rules cover most cases. They do not cover: which character encoding to use, whether a BOM is permitted, how to handle leading or trailing whitespace, or what delimiter to use when the data itself contains commas โ which is, ironically, the most common failure scenario.
Excel and the BOM problem
Microsoft Excel's handling of CSV files has introduced two problems that affect international data exchange more than any specification gap. First, Excel on Windows defaults to saving CSV files in the system code page (typically Windows-1252 in Western European locales) rather than UTF-8. A CSV containing accented characters, CJK characters, or any non-Latin script exported from Excel and opened in another application will frequently display garbled text (mojibake) unless the encoding is explicitly matched.
Second, Excel requires a UTF-8 BOM (byte order mark โ the three bytes EF BB BF at the start of the file) to correctly open a UTF-8 CSV. Without the BOM, Excel falls back to the system code page. Adding the BOM is not specified in RFC 4180 and causes parse errors in some non-Excel tools that do not expect a BOM. This creates an awkward choice: include the BOM for Excel compatibility or omit it for general compatibility.
Why CSV persists despite its problems
CSV has survived because it has one property no alternative format has matched: zero-friction readability. A CSV file can be opened and meaningfully read in any text editor, any spreadsheet application, any database import tool, and any programming language without installing additional libraries or understanding a format specification. A non-technical user who has never heard of RFC 4180 can open a CSV file in Excel, read it, and edit it.
This accessibility has made CSV the lowest common denominator for data exchange between systems that would otherwise have no shared format. Banks export transactions as CSV. Government open data portals publish datasets as CSV. SaaS applications offer CSV export as the universal data escape hatch. Each of these use cases could theoretically use JSON, Parquet, or ORC โ but those formats require tooling that not every recipient has.
Common pitfalls in practice
The practical problems that cause CSV to fail fall into a small number of recurring categories:
- Delimiter conflicts: Data containing commas requires quoting. When the quoting logic is implemented inconsistently between producer and consumer, values split incorrectly. The naive fix โ switching to a tab delimiter (TSV) โ fails when data contains tabs.
- Encoding mismatches: As described above, UTF-8 vs Windows-1252 vs UTF-8 with BOM causes garbled data in international deployments.
- Line ending inconsistency: RFC 4180 specifies CRLF. Unix systems produce LF. Mac OS X (pre-2001) produced CR. Many CSV parsers handle all three; some do not.
- Leading zeros and scientific notation: Excel automatically reformats numeric-looking strings. A product code
007becomes7. A value1E+12is interpreted as scientific notation. Phone numbers, postal codes, and product identifiers routinely lose leading zeros when opened in Excel. - No schema: CSV carries no type information. Every parser must infer whether
12is an integer, a float, or a string; whether2024-01-15is a date; whethertrueis a boolean or a word. Different parsers make different inferences.
Why CSV will still be here in another 50 years
Every decade produces a new crop of formats intended to replace CSV: JSON in the 2000s, Avro and Parquet in the 2010s, Arrow in the 2020s. These formats solve real problems โ schema enforcement, efficient columnar storage, binary encoding of types. They have captured important use cases in data engineering and analytics pipelines. They have not replaced CSV in the general population.
CSV's durability comes from the same property that made it ubiquitous: it requires no special knowledge to produce or consume. As long as there are spreadsheet users โ which means as long as there are businesses โ there will be CSV files. The format's flaws are well-understood and work-aroundable. Its advantage, universal accessibility, is structural and durable.
References
- Shafranovich, Y. (2005). RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files. Internet Engineering Task Force.
- Knuth, D. E. (1968). The Art of Computer Programming, Volume 1: Fundamental Algorithms. Addison-Wesley.
- Loukides, M. (2011). What is Data Science? O'Reilly Media.
- Microsoft. (2023). Import or export text (.txt or .csv) files. Microsoft Excel documentation. support.microsoft.com.
- Unicode Consortium. (2023). The Unicode Standard, Version 15.1. unicode.org.