Decoding Mojibake: How HTML Entities and Unicode Fix Garbled Text

Understand common character encoding problems like mojibake, and learn how HTML entities and Unicode (UTF-8) resolve them. Includes a worked example and

You've probably seen it: 锟斤拷, ä½ å¥½, or □□□. That's mojibake — garbled text caused by mismatched character encodings. This article explains why mojibake happens, how Unicode and UTF-8 solve the root problem, and how HTML entities provide a safe escape hatch for the web. You'll walk away with a concrete mental model and a worked example that ties it all together.

A computer screen displaying garbled text with question marks and strange symbols

The Root Cause: Encoding Mismatch

At its core, mojibake is simple: the bytes that represent text were written with Encoding A, but read with Encoding B. Because different encodings map different characters to the same byte sequences, the reader interprets the bytes wrongly.

For example, the Chinese character 中 has:

GBK encoding: 0xD6 0xD0
UTF-8 encoding: 0xE4 0xB8 0xAD

If a file saved as GBK is opened as UTF-8, the bytes 0xD6 0xD0 get decoded as something else — often producing gibberish like ÖÐ or 锟斤拷 depending on the exact scenario.

The Unicode Solution: One Numbering System to Rule Them All

Unicode assigns every character a unique code point (e.g., U+4E2D for 中), decoupling the character's identity from its byte representation. Then UTF-8 (and other encodings like UTF-16) defines how to store those code points as bytes.

Encoding	`中` (U+4E2D) bytes	Notes
UTF-8	E4 B8 AD	Variable-length, ASCII-compatible, dominant on the web
UTF-16	4E 2D (or 2D 4E)	2 or 4 bytes per code point; used by Windows/JavaScript
GBK	D6 D0	Legacy Chinese encoding, 2 bytes per char
ASCII	N/A (not supported)	Only 128 characters

Key insight: Unicode + UTF-8 eliminates mojibake at the byte level — as long as both ends agree on UTF-8. But what about text that must survive in environments with limited encoding support, like HTML?

HTML Entities: Escaping for the Web

HTML entities (e.g., &, 中) allow you to represent any Unicode character using only ASCII-safe characters. This is crucial when:

You embed user-generated text in HTML (prevents XSS and encoding corruption)
You need to include characters that might not survive a legacy transport (e.g., email, old databases)
You want to make invisible or ambiguous characters explicit

There are two forms:

Named entities: & for &, < for <
Numeric entities: 中 (hex) or 中 (decimal) for 中

Our companion tool, the HTML entities encoder/decoder, lets you instantly convert between plain text and HTML entities.

Worked Example: From GBK to UTF-8 via HTML Entities

Let's trace a real-world scenario: you receive a CSV file exported from a legacy system, saved as GBK. You open it in a UTF-8 editor and see 锟斤拷. Here's how to fix it.

Step 1: Identify the original encoding

Check the file's metadata or use a tool like file on Linux:

$ file -bi data.csv
text/plain; charset=iso-8859-1  # often misdetected

Better: try chardet or iconv to detect GBK.

Step 2: Re-encode to UTF-8

Use iconv to convert:

$ iconv -f GBK -t UTF-8 data.csv > data_utf8.csv

Now the Chinese characters display correctly.

Step 3: Escape for HTML (if needed)

If this data will be embedded in a web page, convert special characters to HTML entities to avoid XSS and encoding issues. Using our HTML entities tool, paste 中 and get 中 or 中.

End-to-end in Python:

import codecs

# Simulate reading a GBK file as UTF-8 (mojibake)
with open('data.csv', 'rb') as f:
    raw = f.read()
    # Wrong: decoded as UTF-8
    garbled = raw.decode('utf-8', errors='replace')
    print(garbled)  # 锟斤拷...

# Correct: decode as GBK, then encode as UTF-8
correct = raw.decode('gbk').encode('utf-8')
print(correct.decode('utf-8'))  # 你好世界

# For HTML output, escape entities
import html
safe = html.escape(correct.decode('utf-8'))
print(safe)  # &#x4F60;&#x597D;&#x4E16;&#x754C;

Common Pitfalls

Assuming UTF-8 everywhere: Legacy systems (Windows ANSI, old databases) may use GBK, Shift-JIS, or Latin-1. Always verify.
Using .length to truncate strings: In JavaScript, '👨‍👩‍👧‍👦'.length is 11 (UTF-16 code units), not 1 (grapheme cluster). Truncating by length can break emoji or combined characters.
BOM (Byte Order Mark): UTF-8 with BOM (EF BB BF) can confuse parsers. Prefer UTF-8 without BOM.
Database charset: MySQL's utf8 is actually utf8mb3 (max 3 bytes). Use utf8mb4 to support emoji and rare characters.
HTML entity double-encoding: If you encode & again, it becomes &amp;. Only encode once.

FAQ

What exactly is mojibake?

Mojibake is garbled text caused by decoding bytes using the wrong character encoding. The classic Chinese example is 锟斤拷, which appears when GBK-encoded text is read as UTF-8, then the replacement character (U+FFFD) is re-encoded and read again.

When should I use HTML entities instead of UTF-8?

Use HTML entities when:

You need to embed text in HTML/XML and want to avoid XSS or parsing errors.
The transport channel (e.g., email, some databases) doesn't reliably support UTF-8.
You want to make invisible characters (like zero-width space) explicit.

Otherwise, UTF-8 is more efficient and readable.

How do I fix mojibake in my application?

Identify the original encoding (ask the data source, or use a detection library like chardet).
Decode using that encoding, then re-encode as UTF-8.
For web output, apply HTML entity encoding to user-generated content.

What's the difference between Unicode, UTF-8, and HTML entities?

Unicode is a character set: a mapping of characters to code points (numbers).
UTF-8 is an encoding: a way to store those code points as bytes.
HTML entities are a text-level escape mechanism: they represent code points using ASCII-safe sequences like 中.

Why does `'👨‍👩‍👧‍👦'.length` return 11 in JavaScript?

Because JavaScript strings are UTF-16 code units. The family emoji is composed of 4 emoji joined by zero-width joiners (ZWJ), plus variation selectors — totaling 11 code units. To count visible graphemes, use Intl.Segmenter or a library like grapheme-splitter.

Summary

Mojibake is a symptom of encoding mismatch. The fix is to standardize on Unicode (UTF-8) everywhere, and use HTML entities as an escape hatch for the web. Remember the golden rule: never guess encoding, always declare it explicitly, and prefer UTF-8 without BOM. With this foundation, you can banish mojibake from your projects.

Try it in our HTML entities encoder/decoder to see how characters translate to entities.

页面加载失败