HTML Entity Encoder · 6 min read

From SGML to HTML5: The Evolution of Web Character Encoding

HTML's character entity system — &, <, >, and hundreds more — dates to the SGML standard of the 1960s. Here is how web character encoding evolved over five decades.

SGML: The Ancestor of HTML

HTML was not invented from scratch — it was defined as an application of SGML (Standard Generalised Markup Language), a standard for describing document structure that originated in the 1960s at IBM. Charles Goldfarb, Edward Mosher, and Raymond Lorie developed GML (Generalised Markup Language) at IBM in 1969; this evolved into SGML, which became an ISO standard (ISO 8879) in 1986.

SGML defined markup as text contained in angle brackets (<tag>), and character references as ampersand-delimited sequences (&name;). These two conventions — angle-bracket tags and ampersand entities — passed directly from SGML into HTML and remain unchanged in HTML5 today.

When Tim Berners-Lee designed HTML in 1990–1991 as a way to link research documents at CERN, he chose to base it on SGML precisely because SGML already had the apparatus for document structure markup. HTML was defined as an SGML application — a specific set of SGML rules for hypertext documents — with a Document Type Definition (DTD) that specified allowed elements, attributes, and character references.

HTML 2.0 and the Latin-1 Entity Set

HTML 2.0 (RFC 1866, 1995) was the first formal HTML specification. It inherited from SGML a small set of named character entities — the <, >, &, and " entities for HTML-special characters, plus a set of entities for the ISO 8859-1 (Latin-1) characters beyond ASCII: é (é), ñ (ñ), ü (ü), and so on for Western European accented characters.

The Latin-1 entity set reflected the ISO 8859-1 bias of early web development — the web began at CERN in Switzerland, spread primarily through English-speaking and Western European research communities, and the character set for those languages was all that was immediately needed. The 252 characters in ISO 8859-1 covered all Western European languages (English, French, German, Spanish, Portuguese, Italian, Dutch, and Scandinavian languages).

HTML 3.2 and 4.0: Expanding the Entity Set

HTML 3.2 (1997) expanded the entity set modestly. HTML 4.0 (1997) and its successor HTML 4.01 (1999) made a significant expansion: they defined entities for mathematical symbols, arrows, and Greek letters (primarily for science and mathematics on the web), and "special" characters including typographic characters like em dash (—), en dash (–), smart quotes, and the Euro sign (€).

The HTML 4 entity expansion reflected two pressures: the scientific community's need to include mathematical notation in web documents, and the global web's need to represent characters beyond the Western European set. The Euro sign was added in 1999 because the Euro was introduced as a currency in 1999 and web developers immediately needed to represent it.

HTML 4.01 also formally declared that documents should declare their character encoding using the <meta http-equiv="Content-Type" content="text/html; charset=..."> element — an attempt to address the encoding chaos that was producing widespread mojibake as the web became global.

XHTML: The XML Detour

XHTML 1.0 (2000) was an attempt to redefine HTML as an application of XML rather than SGML. The primary motivation: XML's strict parsing rules (well-formedness requirements, namespace support, self-closing tags) were seen as a path toward a more consistent, tool-friendly web markup language.

XHTML's character encoding position was identical to XML's: all XML documents are Unicode documents, with UTF-8 as the default encoding. XHTML documents were expected to declare UTF-8 encoding and could contain any Unicode character directly or via numeric entity references (&#xNNNN;).

XHTML achieved significant adoption — many websites of the 2000s declared an XHTML doctype. However, browsers continued to parse XHTML documents using HTML parsing rules rather than XML parsing rules (because XML parsers reject malformed documents with a fatal error, while HTML parsers are designed to recover from errors). This meant that XHTML's strict parsing rules were never enforced in practice, and most "XHTML" websites were actually HTML parsed by the HTML parser.

HTML5: Back to Pragmatism

HTML5, developed from 2004 onward by WHATWG and later W3C, represented a deliberate retreat from the XHTML approach. Rather than requiring XML well-formedness, HTML5 defines precise error-recovery rules for malformed HTML — rules that all conformant browsers must implement identically, ensuring that even invalid HTML produces predictable results.

HTML5's character encoding position is the most decisive of any HTML version:

UTF-8 is the required encoding for all new HTML documents
The encoding declaration was simplified to <meta charset="utf-8">
Browsers are required to use UTF-8 when no encoding is declared (overriding the legacy behaviour of using the OS default encoding)
HTML5 defines named character references for all 2,231 named entities in HTML 4 plus hundreds of additional Unicode characters — accessible via the official named character references list in the HTML5 specification

The Current State: HTML Entity Encoding in 2024

In modern HTML5 practice, the situation is clear:

Declare <meta charset="utf-8"> in every HTML document
Encode your files as UTF-8
Use literal Unicode characters for all content characters — é, ñ, 中, 😀 — they are valid UTF-8 and no entity encoding is needed
Use entity encoding only for characters with special meaning in HTML markup: < (<), > (>), & (&), " (")
Use named or numeric entities for characters that are hard to type but syntactically neutral in HTML: © (©), ® (®), — (—)

The legacy of SGML's entity system remains embedded in HTML — the &-delimited entity syntax is unchanged from the 1960s conventions it inherited. But the scope of what needs entity encoding has narrowed dramatically as UTF-8 adoption has become universal. The encoding chaos that motivated decades of entity workarounds is largely resolved, at least for newly created content.

Encode HTML entities →

References

Goldfarb, C.F. (1990). The SGML Handbook. Oxford University Press.
Berners-Lee, T., & Connolly, D. (1995). RFC 1866: Hypertext Markup Language — 2.0. IETF.
Raggett, D., Le Hors, A., & Jacobs, I. (1999). HTML 4.01 Specification. W3C.
Hickson, I., et al. (2021). HTML Living Standard. WHATWG.
Bray, T., et al. (1998). Extensible Markup Language (XML) 1.0. W3C.