You've probably seen it: 锟斤拷, ä½ å¥½, or □□□. That's mojibake — garbled text caused by mismatched character encodings. This article explains why mojibake happens, how Unicode and UTF-8 solve the root problem, and how HTML entities provide a safe escape hatch for the web. You'll walk away with a concrete mental model and a worked example that ties it all together.
The Root Cause: Encoding Mismatch
At its core, mojibake is simple: the bytes that represent text were written with Encoding A, but read with Encoding B. Because different encodings map different characters to the same byte sequences, the reader interprets the bytes wrongly.
For example, the Chinese character 中 has:
- GBK encoding:
0xD6 0xD0 - UTF-8 encoding:
0xE4 0xB8 0xAD
If a file saved as GBK is opened as UTF-8, the bytes 0xD6 0xD0 get decoded as something else — often producing gibberish like ÖÐ or 锟斤拷 depending on the exact scenario.
The Unicode Solution: One Numbering System to Rule Them All
Unicode assigns every character a unique code point (e.g., U+4E2D for 中), decoupling the character's identity from its byte representation. Then UTF-8 (and other encodings like UTF-16) defines how to store those code points as bytes.
| Encoding | 中 (U+4E2D) bytes |
Notes |
|---|---|---|
| UTF-8 | E4 B8 AD | Variable-length, ASCII-compatible, dominant on the web |
| UTF-16 | 4E 2D (or 2D 4E) | 2 or 4 bytes per code point; used by Windows/JavaScript |
| GBK | D6 D0 | Legacy Chinese encoding, 2 bytes per char |
| ASCII | N/A (not supported) | Only 128 characters |
Key insight: Unicode + UTF-8 eliminates mojibake at the byte level — as long as both ends agree on UTF-8. But what about text that must survive in environments with limited encoding support, like HTML?
HTML Entities: Escaping for the Web
HTML entities (e.g., &, 中) allow you to represent any Unicode character using only ASCII-safe characters. This is crucial when:
- You embed user-generated text in HTML (prevents XSS and encoding corruption)
- You need to include characters that might not survive a legacy transport (e.g., email, old databases)
- You want to make invisible or ambiguous characters explicit
There are two forms:
- Named entities:
&for&,<for< - Numeric entities:
中(hex) or中(decimal) for中
Our companion tool, the HTML entities encoder/decoder, lets you instantly convert between plain text and HTML entities.
Worked Example: From GBK to UTF-8 via HTML Entities
Let's trace a real-world scenario: you receive a CSV file exported from a legacy system, saved as GBK. You open it in a UTF-8 editor and see 锟斤拷. Here's how to fix it.
Step 1: Identify the original encoding
Check the file's metadata or use a tool like file on Linux:
$ file -bi data.csv
text/plain; charset=iso-8859-1 # often misdetected
Better: try chardet or iconv to detect GBK.
Step 2: Re-encode to UTF-8
Use iconv to convert:
$ iconv -f GBK -t UTF-8 data.csv > data_utf8.csv
Now the Chinese characters display correctly.
Step 3: Escape for HTML (if needed)
If this data will be embedded in a web page, convert special characters to HTML entities to avoid XSS and encoding issues. Using our HTML entities tool, paste 中 and get 中 or 中.
End-to-end in Python:
import codecs
# Simulate reading a GBK file as UTF-8 (mojibake)
with open('data.csv', 'rb') as f:
raw = f.read()
# Wrong: decoded as UTF-8
garbled = raw.decode('utf-8', errors='replace')
print(garbled) # 锟斤拷...
# Correct: decode as GBK, then encode as UTF-8
correct = raw.decode('gbk').encode('utf-8')
print(correct.decode('utf-8')) # 你好世界
# For HTML output, escape entities
import html
safe = html.escape(correct.decode('utf-8'))
print(safe) # 你好世界
Common Pitfalls
- Assuming UTF-8 everywhere: Legacy systems (Windows ANSI, old databases) may use GBK, Shift-JIS, or Latin-1. Always verify.
- Using
.lengthto truncate strings: In JavaScript,'👨👩👧👦'.lengthis 11 (UTF-16 code units), not 1 (grapheme cluster). Truncating by length can break emoji or combined characters. - BOM (Byte Order Mark): UTF-8 with BOM (
EF BB BF) can confuse parsers. Prefer UTF-8 without BOM. - Database charset: MySQL's
utf8is actuallyutf8mb3(max 3 bytes). Useutf8mb4to support emoji and rare characters. - HTML entity double-encoding: If you encode
&again, it becomes&amp;. Only encode once.
FAQ
What exactly is mojibake?
Mojibake is garbled text caused by decoding bytes using the wrong character encoding. The classic Chinese example is 锟斤拷, which appears when GBK-encoded text is read as UTF-8, then the replacement character (U+FFFD) is re-encoded and read again.
When should I use HTML entities instead of UTF-8?
Use HTML entities when:
- You need to embed text in HTML/XML and want to avoid XSS or parsing errors.
- The transport channel (e.g., email, some databases) doesn't reliably support UTF-8.
- You want to make invisible characters (like zero-width space) explicit.
Otherwise, UTF-8 is more efficient and readable.
How do I fix mojibake in my application?
- Identify the original encoding (ask the data source, or use a detection library like
chardet). - Decode using that encoding, then re-encode as UTF-8.
- For web output, apply HTML entity encoding to user-generated content.
What's the difference between Unicode, UTF-8, and HTML entities?
- Unicode is a character set: a mapping of characters to code points (numbers).
- UTF-8 is an encoding: a way to store those code points as bytes.
- HTML entities are a text-level escape mechanism: they represent code points using ASCII-safe sequences like
中.
Why does '👨👩👧👦'.length return 11 in JavaScript?
Because JavaScript strings are UTF-16 code units. The family emoji is composed of 4 emoji joined by zero-width joiners (ZWJ), plus variation selectors — totaling 11 code units. To count visible graphemes, use Intl.Segmenter or a library like grapheme-splitter.
Summary
Mojibake is a symptom of encoding mismatch. The fix is to standardize on Unicode (UTF-8) everywhere, and use HTML entities as an escape hatch for the web. Remember the golden rule: never guess encoding, always declare it explicitly, and prefer UTF-8 without BOM. With this foundation, you can banish mojibake from your projects.
Try it in our HTML entities encoder/decoder to see how characters translate to entities.