GoWin Tools
Tools
← HTML Entity Encoder

HTML Entity Encoder Β· 6 min read

Mojibake: The Encoding Disasters That Garbled the Internet

Mojibake (ζ–‡ε­—εŒ–γ‘) is the Japanese term for garbled text β€” the visual noise produced when text is decoded with the wrong character encoding. Here is the history of encoding chaos and how it was resolved.

What Is Mojibake?

Mojibake (ζ–‡ε­—εŒ–γ‘, literally "character transform") is the Japanese term for the garbled text that appears when a document is decoded with a different character encoding than the one it was encoded with. The term has been adopted in English-language computing contexts because it succinctly describes a phenomenon that lacks a concise equivalent in English.

Mojibake appears as a sequence of characters that look wrong β€” accented characters replaced by sequences of strange symbols, question marks, or boxes. The precise garbling depends on which encoding was used to write the text and which was used to read it. The pattern is not random noise; it is a systematic misinterpretation of valid bytes.

A classic example: the German word "KΓΌche" (kitchen), encoded in ISO 8859-1, decoded as if it were UTF-8, produces "KΓΌche" β€” the ΓΌ (one byte in ISO 8859-1) is reinterpreted as the start of a multi-byte UTF-8 sequence that does not decode cleanly.

The Era of Encoding Chaos (1985–2005)

Before Unicode adoption became widespread, the internet and personal computing were a patchwork of incompatible encodings. Different regions used different standards:

  • Western Europe: ISO 8859-1 (Latin-1) and its Windows variant CP1252
  • Central/Eastern Europe: ISO 8859-2 (Latin-2)
  • Russia/Eastern Europe (Cyrillic): KOI-8, ISO 8859-5, CP1251 β€” multiple incompatible encodings for the same script
  • Japan: Shift-JIS (Windows), EUC-JP (Unix/Linux), and ISO-2022-JP (email) β€” three incompatible encodings for the same language
  • Traditional Chinese: Big5
  • Simplified Chinese: GB2312 and GBK
  • Korea: EUC-KR and CP949

Operating systems, email clients, web browsers, and web servers all had to guess or declare encodings. When they disagreed, the result was mojibake.

The Email Mojibake Era

Email was one of the worst-affected communication channels. Early email was limited to ASCII β€” the SMTP protocol (Simple Mail Transfer Protocol) was designed in 1982 for ASCII text. As email crossed language boundaries, encoding problems proliferated.

MIME (Multipurpose Internet Mail Extensions, 1992) added encoding declaration headers to email, allowing non-ASCII content to be specified with a declared encoding. But implementation was inconsistent. Email clients often ignored MIME headers, applied their own guesses, or displayed text using the operating system's default encoding regardless of declaration.

The result was decades of international email that regularly arrived garbled. Japanese and Chinese users developed a particularly acute awareness of mojibake because their writing systems use far more code points than European scripts β€” the probability of a decoding error in a CJK-encoded message was high, and the result was completely unreadable rather than merely having a few wrong characters.

The Web Encoding Wars

Early web pages were frequently misencoded or incorrectly declared. HTML 2.0 (1995) made charset declarations optional; most early web pages specified no encoding, leaving browsers to guess. Browsers developed heuristic encoding detection β€” scanning the document for patterns that suggested one encoding over another. The heuristics were imperfect and produced mojibake when they were wrong.

Microsoft Internet Explorer and Netscape Navigator had different heuristics, causing pages that displayed correctly in one browser to show mojibake in the other. Web developers working with non-ASCII content had to test in multiple browsers and often had to insert explicit encoding hints to prevent garbling.

The particular chaos of the Japanese web in the 1990s spawned an entire genre of technical advice and debugging tools. Japanese web developers became expert in encoding problems by necessity β€” the three competing Japanese encodings (Shift-JIS, EUC-JP, ISO-2022-JP) and the poor-quality encoding detection in early browsers made encoding errors routine.

Famous Mojibake Patterns

Several mojibake patterns became recognisable enough to be identified by the encoding mismatch they represent:

  • Ò€ℒ for ' (right single quotation mark, U+2019): UTF-8 encoding decoded as ISO 8859-1 β€” one of the most common mojibake patterns in English web content, produced when a smart quote is pasted from Microsoft Word into a CMS that assumes ISO 8859-1
  • ü for ΓΌ: UTF-8 encoded German text decoded as ISO 8859-1
  • ??? or β–‘β–‘β–‘ sequences: Characters with no equivalent in the decoded encoding, rendered as the fallback replacement character
  • οΏ½ (U+FFFD, the Unicode replacement character): What Unicode-aware systems display for bytes that do not represent valid code points in the declared encoding

The Microsoft Quotation Mark Problem

One specific encoding problem became notorious in English-language web publishing: Microsoft Word's "smart quotes" β€” the typographically curved quotation marks (" " ' ') β€” are in the Windows-1252 (CP1252) code page but not in ISO 8859-1, despite being in the same byte range (0x80–0x9F, which ISO 8859-1 leaves undefined).

When writers pasted text from Word into web content management systems that expected ISO 8859-1, the smart quotes were stored as CP1252 bytes. When displayed in browsers that correctly decoded them as ISO 8859-1, the result was mojibake: Ò€ℒ for the right single quote, Ò€œ for the left double quote, and so on.

This problem affected millions of web pages published in the early 2000s. Many content management systems added smart-quote-to-entity converters specifically to handle this Microsoft-Word-to-web pipeline. The fix was declaring or converting to UTF-8, which encodes the smart quotes as multi-byte sequences that are correctly decoded by all UTF-8 aware systems.

The Resolution: UTF-8 Everywhere

The encoding chaos was resolved gradually and incompletely by the adoption of UTF-8 as the universal web encoding. The HTML5 specification (2014) strongly recommends UTF-8 for all web content and requires conformant user agents to default to UTF-8 when no encoding is declared. HTTP/1.1 and HTTP/2 default to UTF-8 for text content types.

As of 2023, over 98% of web pages declare UTF-8 encoding. Modern web development frameworks and CMS platforms default to UTF-8 and make it difficult to accidentally use a different encoding. The era of routine mojibake on the web is largely over β€” though it persists in legacy systems, data migrations between old databases and modern applications, and anywhere that data crosses a boundary between UTF-8 and a legacy encoding without explicit conversion.

Encode HTML entities β†’

References

  1. Lunde, K. (2009). CJKV Information Processing, 2nd Ed. O'Reilly Media.
  2. Spolsky, J. (2003). The absolute minimum every software developer must know about Unicode. Joel on Software.
  3. Mozilla Developer Network. (2023). Character encodings for beginners. developer.mozilla.org.
  4. Unicode Consortium. (2023). FAQ β€” UTF-8, UTF-16, UTF-32 & BOM. unicode.org.
  5. W3C. (2021). Character encodings: Essential concepts. w3.org.