GoWin Tools
Tools
← HTML Entity Encoder

HTML Entity Encoder Β· 7 min read

Unicode: How One Standard Encodes 149,000 Characters

Unicode is the attempt to create a single character encoding that covers every writing system in the world β€” past and present. As of version 15.1, it defines 149,813 characters. Here is how it works.

The Problem Unicode Solved

Before Unicode, the world had hundreds of incompatible character encodings. ASCII covered English; ISO 8859-1 extended it for Western European languages; ISO 8859-2 for Central and Eastern European languages; ISO 8859-5 for Cyrillic; Shift-JIS and EUC-JP for Japanese; GB2312 and Big5 for Chinese; KOI-8 for Russian; and so on across dozens of national and regional standards.

When documents or software crossed language boundaries, encoding chaos ensued. Text encoded in one system produced gibberish when decoded with another. Multilingual documents β€” containing English, French, and Japanese in the same file β€” were essentially impossible in this fragmented landscape, because no single encoding supported all three languages.

Unicode was conceived in 1987 by Joe Becker at Xerox and Lee Collins and Mark Davis at Apple. The Unicode Consortium was formally incorporated in 1991. The goal was ambitious and necessary: a single universal encoding that would cover every character used in human writing β€” historical and contemporary β€” with no ambiguity.

Code Points: The Fundamental Unit

Unicode assigns a unique integer β€” called a code point β€” to every character it defines. Code points are written in hexadecimal with the prefix U+. For example:

  • U+0041: Latin Capital Letter A (the ASCII "A")
  • U+00E9: Latin Small Letter E with Acute (Γ©)
  • U+4E2D: CJK Unified Ideograph δΈ­ (the Chinese character for "middle")
  • U+1F600: Grinning Face (πŸ˜€)
  • U+1D11E: Musical Symbol G Clef (π„ž)

Unicode code points range from U+0000 to U+10FFFF β€” a space of 1,114,112 possible code points. Of these, approximately 149,813 are assigned as of Unicode 15.1 (2023). The remainder are either reserved for future use or designated as private use areas (for applications to define their own characters).

The Unicode Planes

Unicode organises its code point space into 17 planes of 65,536 code points each:

  • Plane 0 (Basic Multilingual Plane, BMP): U+0000–U+FFFF β€” contains the characters of virtually all modern writing systems, plus many symbols, technical characters, and compatibility characters
  • Plane 1 (Supplementary Multilingual Plane): U+10000–U+1FFFF β€” historic scripts, musical notation, mathematical symbols, emoji
  • Plane 2 (Supplementary Ideographic Plane): U+20000–U+2FFFF β€” CJK (Chinese-Japanese-Korean) unified ideograph extensions
  • Planes 3–13: Reserved for future use
  • Plane 14: U+E0000–U+EFFFF β€” tag characters and language tags
  • Planes 15–16: Private use areas β€” no standard assignments, free for applications to define

Most text encountered in everyday web use is in the BMP. Emoji (many of which are in Plane 1) became the primary reason most developers encountered supplementary planes.

UTF-8: The Dominant Encoding

Unicode defines code points β€” the abstract numbers. Encodings define how those numbers are stored as bytes. Several Unicode encodings exist:

  • UTF-8: Variable-width encoding, 1–4 bytes per character. The dominant encoding on the web (used by over 98% of websites). ASCII-compatible: any ASCII byte sequence is valid UTF-8.
  • UTF-16: Variable-width, 2 or 4 bytes per character. Dominant in Windows internals, Java, and .NET string types.
  • UTF-32: Fixed-width, 4 bytes per character. Rarely used for storage (4Γ— larger than ASCII for English text) but convenient for processing (direct index access to code points).

UTF-8 was invented by Ken Thompson and Rob Pike in September 1992, famously designed on a placemat at a New Jersey diner. Its key design decisions:

  • All ASCII characters (U+0000–U+007F) are encoded as single bytes identical to their ASCII encoding. Any ASCII file is valid UTF-8.
  • Non-ASCII code points are encoded in 2–4 bytes, with the first byte indicating the sequence length.
  • No valid multi-byte sequence can contain bytes in the ASCII range β€” preventing false matches when scanning for ASCII patterns in UTF-8 text.
  • The encoding is self-synchronising β€” you can find the start of a character from any position by scanning backward at most 3 bytes.

The Scope: 161 Scripts and Counting

Unicode 15.1 covers 161 scripts β€” writing systems. These include:

  • All major living scripts: Latin, Cyrillic, Greek, Arabic, Hebrew, Devanagari, Hangul, Hiragana, Katakana, CJK ideographs, and many more
  • Historic scripts: Linear B, Cuneiform, Hieroglyphics, Phoenician, Runic, Gothic, Ogham
  • Constructed scripts: Shavian, Deseret
  • Mathematical notation, musical notation, braille
  • Currency symbols, technical symbols, arrows, box drawing characters
  • Emoji: approximately 3,600+ emoji sequences as of 2023

The Great Unification Controversy: Han Unification

Unicode's most contested decision was Han Unification β€” treating Chinese, Japanese, and Korean characters as unified code points where they have the same historical origin, even when the contemporary forms in each language are visually distinct. A single code point might render differently in Chinese, Japanese, and Korean fonts.

Han Unification was a pragmatic decision that significantly reduced the code point count required for CJK characters (which number in the tens of thousands). The Unicode Consortium argued that the differences between national variants were analogous to font differences β€” rendering style rather than distinct characters. Critics, particularly in Japan, argued that the distinct forms had developed into genuinely distinct characters in their respective languages and should have separate code points.

The controversy remains active. Unicode now provides language tags and variation selectors to specify national rendering variants, partially addressing the concern while maintaining unification.

Unicode and HTML Entities

HTML entity encoding maps to Unicode. Named entities like © (Β©, U+00A9), € (€, U+20AC), and — (β€”, U+2014) are shorthand for specific Unicode code points. Numeric entities like 😀 (πŸ˜€) reference Unicode code points directly.

In modern HTML5 with UTF-8 encoding declared (<meta charset="utf-8">), most Unicode characters can be included directly in HTML without entity encoding β€” the UTF-8 bytes are valid. Entity encoding remains necessary for characters with special meaning in HTML (<, >, &, ") regardless of the document encoding.

Encode HTML entities β†’

References

  1. Unicode Consortium. (2023). The Unicode Standard, Version 15.1. unicode.org.
  2. Davis, M., & Whistler, K. (2023). Unicode Character Database. unicode.org.
  3. Yergeau, F. (2003). RFC 3629: UTF-8, a transformation format of ISO 10646. IETF.
  4. Spolsky, J. (2003). The absolute minimum every software developer absolutely, positively must know about Unicode and character sets. Joel on Software.
  5. Pike, R., & Thompson, K. (1992). Hello world, or Kalos kosmos. Proceedings of the Winter 1993 USENIX Technical Conference.