Hex / Binary Encoder · 6 min read
From Smoke Signals to UTF-8: A Brief History of Binary Encoding
How humans went from semaphore flags to Baudot to ASCII to Unicode and UTF-8 — and why the modern web converged on a single encoding.
Encoding is the act of agreeing what symbols mean. Before computers, humans had been encoding language into compact signals for centuries — and the conceptual leap from a semaphore flag to a UTF-8 byte is smaller than it looks. Both are systems for mapping a finite alphabet to a finite set of physical states. The history of binary encoding is the history of squeezing more meaning into fewer bits.
Pre-Electric: Semaphore and Optical Telegraphs
Claude Chappe's optical telegraph network, built across France in the 1790s, used pivoting wooden arms on towers to transmit messages between Paris and the borders. Each arm position represented a code-book entry — not a letter, but a whole word or phrase. With 196 distinguishable positions, an operator could send a sentence in the time it took to read it. The principle is identical to modern encoding: a small alphabet of physical states, mapped through a shared lookup table to a much larger meaning space.
Morse: Variable-Length Before It Was Cool
Morse code (1837) was the first widely deployed binary-ish encoding for natural language. Two symbols (dot and dash) plus inter-symbol gaps; common letters got short codes (E is a single dot) and rare letters got long ones (Q is dash-dash-dot-dash). This was variable-length encoding — Huffman coding by hand, a century before Huffman. Shannon would later prove in 1948 that this approach approaches the theoretical minimum bits per symbol for a known character distribution.
Baudot: Fixed-Length, Five Bits
Émile Baudot's 1870 telegraph code traded efficiency for mechanical simplicity. Five bits per character, sent as five sequential pulses. Five bits gives 32 codes — not enough for letters and digits, so Baudot used shift codes (LETTERS and FIGURES) to flip between two character sets. This was the first time engineers chose a fixed-width encoding because the hardware demanded it. Telex networks ran on Baudot variants well into the 1980s.
ASCII: Seven Bits, English Only
ASCII was published in 1963 and reached its still-current form in 1968. Seven bits per character (128 codes), enough for the English alphabet in upper and lower case, digits, punctuation, and 32 control characters left over from the teletype era — TAB, LF, CR, BEL, and the still-mysterious DC1 through DC4. The eighth bit was reserved for parity on the noisy serial lines of the day.
ASCII's biggest design choice was making letters and digits contiguous. 'A' through 'Z' are 0x41 to 0x5A; 'a' through 'z' are 0x61 to 0x7A; the only difference is bit 5. That single property made case conversion a one-instruction operation and shaped the C standard library for fifty years.
The Code Page Wars
ASCII covered English. Everyone else was on their own. From the 1980s through the 2000s, the industry produced hundreds of incompatible 8-bit encodings — ISO-8859-1 for Western Europe, ISO-8859-5 for Cyrillic, Shift-JIS for Japanese, GB2312 for Simplified Chinese, KOI8-R for Russian. A document was just bytes; the meaning depended on which code page the receiver assumed. Mismatches produced mojibake — the legendary garbled text that anyone who used the early web has seen.
Unicode: One Codepoint Per Character, Forever
Unicode (1991) made the radical decision to give every character in every script a unique number, called a codepoint, with room for over a million. The first version had 7,161 characters; the current version 16.0 has over 154,000. Codepoints are written U+XXXX in hex.
But codepoints aren't bytes. The first attempts at storing them — UCS-2 and UTF-16 — used two bytes per character, which broke ASCII compatibility and doubled the size of every English document. The web revolted.
UTF-8: The Encoding That Won
Ken Thompson and Rob Pike designed UTF-8 in 1992 over dinner at a New Jersey diner. It has three properties that no other Unicode encoding had at once:
- ASCII-compatible. Codepoints 0–127 encode to a single byte identical to ASCII. Existing English text and code is already valid UTF-8.
- Self-synchronising. The high bits of every byte tell you whether you're at the start of a character or in the middle of one. You can drop into a UTF-8 stream at any point and find the next character boundary in at most three bytes.
- Variable-length, 1 to 4 bytes per character. Common characters cost less; rare ones cost more. Documents in Latin scripts stay roughly the same size as ASCII; documents in Chinese or Japanese take three bytes per character but no shift codes.
UTF-8 was standardised as RFC 3629 in 2003. By 2010 it was the majority encoding on the web; by 2024 it's over 98% of all web pages. The W3C, IETF, and most language standards now require UTF-8 by default.
What This Trajectory Tells You
Every successful encoding has done two things: minimised the cost of the common case, and made the rare case possible. Morse made E cheap. ASCII made English cheap. UTF-8 made ASCII cheap and Mandarin possible. The encodings that died — UTF-16, UTF-32, EBCDIC, the code-page zoo — failed one or both tests.
The next encoding, if there is one, will probably emerge from a domain we don't yet think of as text — neural network weights, biological sequences, sensor streams. The pattern, though, will be the same: a stable lookup, a self-synchronising stream, and a path forward when the alphabet grows.
References
- Cerf, V. (1969). RFC 20: ASCII format for Network Interchange. Internet Engineering Task Force.
- Yergeau, F. (2003). RFC 3629: UTF-8, a transformation format of ISO 10646. Internet Engineering Task Force.
- The Unicode Consortium. (2024). The Unicode Standard, Version 16.0.
- Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379–423.
- Davies, D. W. & Barber, D. L. A. (1973). Communication Networks for Computers. John Wiley & Sons.