Introduction
Every developer eventually faces the alphabet soup of text encoding: ASCII, Unicode, UTF-8, UTF-16, code points, surrogate pairs. It's easy to treat them as magic incantations, but understanding the fundamentals will save you from bugs, security holes, and performance issues. This guide walks you from the 7-bit ASCII days to modern Unicode, explaining what each layer is, why it matters, and how to use them correctly in your code.
ASCII: The 7-Bit Foundation
ASCII (American Standard Code for Information Interchange) is the grandfather of character encodings. It uses 7 bits to represent 128 characters: control characters (0–31), printable characters (32–126), and DEL (127). The key ranges every programmer should know:
| Category | Range (decimal) | Examples |
|---|---|---|
| Control characters | 0–31 | LF (10), CR (13) |
| Space | 32 | ' ' |
| Digits | 48–57 | '0'=48, '9'=57 |
| Uppercase letters | 65–90 | 'A'=65, 'Z'=90 |
| Lowercase letters | 97–122 | 'a'=97, 'z'=122 |
Memory trick: '0'=48, 'A'=65, 'a'=97. Lowercase = uppercase + 32.
Why ASCII Still Matters
- UTF-8 is backward-compatible with ASCII. Every ASCII string is a valid UTF-8 string.
- Many network protocols (HTTP, SMTP) still use ASCII for headers.
- Understanding ASCII helps you debug encoding issues: if you see 'A' displayed as 65, you know the encoding is likely ASCII or UTF-8.
Unicode: The Universal Character Set
ASCII's fatal flaw: 128 characters can't represent Chinese, Arabic, emoji, or even French accented letters. Unicode solves this by assigning a unique number (called a code point) to every character in every writing system, past and present.
- Code points are written in hex with a
U+prefix:U+0041for 'A',U+4E2Dfor '中'. - The Unicode codespace has 1,114,112 possible code points (U+0000 to U+10FFFF).
- Currently ~150,000 are assigned; the rest are reserved or private use.
Planes and the BMP
The codespace is divided into 17 planes of 65,536 code points each. Plane 0 is the Basic Multilingual Plane (BMP), covering most modern scripts (Latin, Cyrillic, CJK, Arabic, etc.). Planes 1–2 contain historical scripts, emoji, and rare CJK. Planes 15–16 are private use.

Encoding Unicode: UTF-8, UTF-16, UTF-32
A code point is just an abstract number. To store or transmit it, we need an encoding. The three main ones:
UTF-32
- Every code point stored as a fixed 4-byte integer.
- Simple but wasteful: a 10 KB ASCII file becomes 40 KB.
- Rarely used for storage; sometimes used internally for processing.
UTF-16
- Uses 2 or 4 bytes per code point.
- BMP code points (U+0000–U+FFFF) use 2 bytes.
- Code points above U+FFFF use a surrogate pair: two 16-bit units in the range U+D800–U+DFFF.
- Surrogates are not characters; they are encoding artifacts. They must never appear in UTF-8 or UTF-32.
- Used by JavaScript strings, Java, and Windows APIs.
UTF-8
- Variable length: 1–4 bytes per code point.
- ASCII characters (U+0000–U+007F) use 1 byte — identical to ASCII.
- Non-ASCII characters use multi-byte sequences with a specific prefix pattern:
| Byte 1 | Byte 2 | Byte 3 | Byte 4 | Code point range |
|---|---|---|---|---|
| 0xxxxxxx | U+0000–U+007F | |||
| 110xxxxx | 10xxxxxx | U+0080–U+07FF | ||
| 1110xxxx | 10xxxxxx | 10xxxxxx | U+0800–U+FFFF | |
| 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | U+10000–U+10FFFF |
Key property: The bytes 0x00–0x7F never appear in multi-byte sequences, so ASCII-based string operations (null termination, searching for \n or ,) work on UTF-8 without modification.
Worked Example: From Code Point to UTF-8 Bytes
Let's encode the emoji 😄 (U+1F604).
- Code point: U+1F604 = 0x1F604 = 128,516 decimal.
- It lies in the 4-byte range (U+10000–U+10FFFF).
- Binary of 0x1F604:
0001 1111 0110 0000 0100(21 bits). - UTF-8 4-byte pattern:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. - Fill the
xbits with the 21-bit value:11110 000(0xF0) |000 11111→11110000 1001111110 011000→1001100010 000100→10000100
- Result:
F0 9F 98 84(hex).
Try it in our text to unicode converter to see the bytes.
Common Pitfalls
- Confusing code points with bytes: A "character" in a UTF-8 string may be 1–4 bytes. Never assume
strlen()returns character count. - Surrogate pairs in UTF-8: Never appear; if you see them, the data is mis-encoded.
- Byte order marks (BOM): UTF-16 files often start with
U+FEFFto indicate endianness. UTF-8 BOM (EF BB BF) is optional but can break ASCII tools. - Normalization: Characters like 'é' can be represented as a single code point (U+00E9) or as 'e' + combining accent (U+0065 U+0301). These are visually identical but byte-different. Use Unicode normalization (NFC, NFD) before comparing.
- Case conversion is locale-dependent: Turkish 'i' → 'İ' (dotted capital I), not 'I'. Don't rely on simple ±32 for non-ASCII.
When to Use Which Encoding
| Encoding | Best for | Avoid when |
|---|---|---|
| UTF-8 | Web, Unix/Linux, APIs, storage | You need random access by code point |
| UTF-16 | JavaScript, Java, Windows APIs | You want ASCII compatibility or space efficiency for ASCII-heavy text |
| UTF-32 | Internal processing (rare) | Storage or network transmission |
FAQ
What is the difference between Unicode and UTF-8?
Unicode is the character set — a mapping from numbers to characters. UTF-8 is one encoding of Unicode — a way to represent those numbers as bytes. Other encodings include UTF-16 and UTF-32.
Why does my string length seem wrong?
In many languages, len() or length returns the number of code units (bytes for UTF-8, 16-bit words for UTF-16), not code points or visible characters. For example, "😄".length in JavaScript returns 2 because it's a surrogate pair. Use library functions to count code points or grapheme clusters.
What is a BOM and should I use it?
A Byte Order Mark (U+FEFF) at the start of a UTF-16 file tells the reader whether the bytes are big-endian or little-endian. For UTF-8, the BOM (EF BB BF) is unnecessary because UTF-8 has no endianness, but some Windows tools add it. It can confuse Unix tools; avoid it unless required.
How do I handle emoji in my code?
Emoji are code points above U+FFFF, so they require 4 bytes in UTF-8 or a surrogate pair in UTF-16. Some emoji are sequences (e.g., skin tone + base emoji). Use a library that supports grapheme clusters (e.g., grapheme in Python, Intl.Segmenter in JavaScript) to count visible characters.
Can I use UTF-8 everywhere?
Almost. UTF-8 is the dominant encoding on the web and in Unix/Linux. Windows has historically favored UTF-16, but recent versions support UTF-8 more fully. For new projects, UTF-8 is generally the safest choice.