正在加载,请稍候…

From Mojibake to UTF-8: A Complete Guide to Character Encoding

Understand ASCII, GBK, Unicode, and UTF-8 with practical examples. Learn why mojibake happens and how to avoid encoding issues in your projects.

You've probably seen gibberish like 锟斤拷 or ä½ å¥½ in old files or on some websites. This is mojibake — the result of mismatched character encodings. Understanding how computers store and interpret text is essential for every developer. This guide covers the evolution from ASCII to UTF-8, common pitfalls, and a practical example to make you encoding-savvy.

Server racks with glowing lights

What Is Character Encoding?

Computers only understand binary — 0s and 1s. To display the letter A or the Chinese character , the machine needs a mapping between human-readable symbols and binary sequences. That mapping is a character encoding.

A character set assigns a unique number (a code point) to each character. For example, in Unicode, A is U+0041 and is U+4E2D. An encoding then converts that code point into a specific byte sequence. UTF-8, UTF-16, and GBK are all encodings that implement different character sets.

A Brief History of Encodings

ASCII: The American Standard

ASCII (American Standard Code for Information Interchange) was introduced in 1963. It uses 7 bits to represent 128 characters: English letters, digits, punctuation, and control codes. Every byte with a value below 128 is an ASCII character.

Range (decimal) Category Examples
0–31 Control characters Null (0), Line Feed (10)
32 Space
48–57 Digits 09
65–90 Uppercase letters AZ
97–122 Lowercase letters az

Key fact: 'A' = 65, 'a' = 97. The difference is 32 — useful for case conversion.

GBK: Chinese Character Encoding

ASCII cannot represent Chinese characters. In the 1980s, China developed GB2312 (2 bytes per character, 6763 characters). Later, GBK (Guobiao Kuozhan) extended it to 21,003 characters, including traditional Chinese. GBK was the default ANSI encoding on Chinese Windows systems for years.

Unicode: One Set to Rule Them All

Unicode provides a unique code point for every character in every writing system — over 140,000 characters so far. The code space ranges from U+0000 to U+10FFFF (1,114,112 possible code points). Unicode itself is just a character set; it does not specify how to store code points as bytes.

UTF-8: The Dominant Encoding

UTF-8 is a variable-length encoding for Unicode. It uses 1 to 4 bytes per code point:

Code point range UTF-8 bytes
U+0000–U+007F 1 byte (0xxxxxxx)
U+0080–U+07FF 2 bytes (110xxxxx 10xxxxxx)
U+0800–U+FFFF 3 bytes (1110xxxx 10xxxxxx 10xxxxxx)
U+10000–U+10FFFF 4 bytes (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx)

UTF-8 is backward-compatible with ASCII: any ASCII text is valid UTF-8. It is space-efficient for Latin scripts and self-synchronizing (a lost byte only corrupts the current code point). Today, over 98% of web pages use UTF-8.

Common Pitfalls

  • Mojibake: Opening a GBK-encoded file as UTF-8 produces garbled text like 锟斤拷. The fix: always know the original encoding and decode with it.
  • BOM (Byte Order Mark): Some UTF-8 files start with \xEF\xBB\xBF. This BOM can confuse parsers (e.g., JSON). Prefer UTF-8 without BOM.
  • Database encoding: MySQL's utf8 character set only supports up to 3-byte UTF-8, missing emoji. Use utf8mb4 instead.
  • String length: String.length in JavaScript counts UTF-16 code units, not code points or grapheme clusters. An emoji like 😀 is 2 code units (surrogate pair). For user-facing length, use Intl.Segmenter or a library.
  • Normalization: Unicode allows multiple representations for some characters (e.g., é as U+00E9 or U+0065 U+0301). Normalize (NFC) before comparing strings.

Worked Example: Decoding a Mojibake File

Suppose you have a file data.txt saved with GBK encoding, but you open it in a UTF-8 terminal. You see:

ä½ å¥½ ä¸ç

This is the GBK bytes for 你好世界 interpreted as UTF-8. Let's fix it step by step:

  1. Read the file as bytes.
  2. Decode those bytes using GBK.
  3. Re-encode the result as UTF-8 for modern use.

In Python:

# Read raw bytes from the file
with open('data.txt', 'rb') as f:
    raw_bytes = f.read()

# Decode using the original encoding (GBK)
text = raw_bytes.decode('gbk')
print(text)  # Output: 你好世界

# Re-encode as UTF-8 for modern storage
utf8_bytes = text.encode('utf-8')
with open('data_utf8.txt', 'wb') as f:
    f.write(utf8_bytes)

If you don't know the original encoding, try common ones (GBK, Shift-JIS, ISO-8859-1) and inspect the output. Tools like chardet can help guess.

When to Use Which Encoding?

Scenario Recommended encoding
Web pages, APIs, JSON UTF-8 (no BOM)
Legacy Chinese text files GBK (if original)
Windows system APIs (historical) UTF-16
Internal processing UTF-8 or UTF-32
Databases utf8mb4 (MySQL), NVARCHAR (SQL Server)

FAQ

What is the difference between Unicode and UTF-8?

Unicode is a character set that assigns a unique number (code point) to every character. UTF-8 is an encoding that serializes those numbers into bytes. Unicode defines the "what", UTF-8 defines the "how to store".

Why do I see 锟斤拷?

It's a classic mojibake chain: a file encoded in GBK is decoded as UTF-8, producing replacement characters (\uFFFD) that are then re-encoded and decoded again as GBK. The result is the characters 锟(0xEFBFBD) and 斤(0xEFBFBD) repeated.

How do I convert a file from GBK to UTF-8?

Use iconv -f gbk -t utf-8 input.txt > output.txt on Linux/macOS, or use Python as shown in the worked example above.

Should I use UTF-8 with or without BOM?

Without BOM for most contexts (web, JSON, Unix). With BOM if you need to distinguish UTF-8 from other encodings in legacy Windows tools (e.g., Notepad).

What is the difference between utf8 and utf8mb4 in MySQL?

utf8 in MySQL is an alias for utf8mb3, which only supports up to 3-byte UTF-8 characters (BMP). It cannot store emoji or other characters above U+FFFF. utf8mb4 supports the full 4-byte UTF-8 range.

Summary

  • Character encoding maps characters to bytes. Mismatched encoding causes mojibake.
  • ASCII is the foundation (7-bit, 128 characters).
  • GBK handles Chinese text but is not universal.
  • Unicode provides a universal character set; UTF-8 is the dominant encoding.
  • Always know your encoding, prefer UTF-8 without BOM, and use utf8mb4 in MySQL.
  • For a hands-on tool to explore Unicode encoding, try our Text to Unicode converter.