What Is Unicode?
Unicode is the universal character encoding standard that assigns a unique number — called a code point — to every character in every writing system in the world. From Latin letters to Chinese hanzi, Arabic script to emoji, Unicode covers over 149,000 characters across 154 scripts.
Unicode solves a fundamental problem in computing: historically, different countries and companies created incompatible character encodings, making it impossible to reliably exchange text across systems.
Unicode Code Points
A Unicode code point is written as U+ followed by a hexadecimal number:
U+0041= A (Latin capital letter A)U+4E2D= 中 (Chinese character for "middle")U+1F600= 😀 (Grinning Face emoji)U+0021= ! (Exclamation mark)
The code point range spans from U+0000 to U+10FFFF, divided into 17 planes of 65,536 code points each.
Unicode Planes
Plane 0: Basic Multilingual Plane (BMP)
The most commonly used characters, including all modern scripts:
- Latin, Greek, Cyrillic, Hebrew, Arabic, Devanagari
- CJK (Chinese, Japanese, Korean) characters
- Most punctuation, symbols, and special characters
Plane 1: Supplementary Multilingual Plane
- Historic scripts (Linear B, Egyptian hieroglyphs, Cuneiform)
- Musical symbols
- Mathematical symbols
- Many emoji
Plane 2: Supplementary Ideographic Plane
- Additional CJK unified ideographs (rare characters)
Planes 3-13: Reserved
Currently unassigned.
Planes 14-16: Supplementary Special-Purpose Planes
- Tags and variation selectors
Unicode Encodings
UTF-8
The dominant encoding on the web (used by over 98% of websites):
- ASCII characters use 1 byte
- Most European characters use 2 bytes
- CJK characters use 3 bytes
- Emoji and supplementary characters use 4 bytes
UTF-8 is backward compatible with ASCII — any ASCII file is valid UTF-8.
UTF-16
Used by Windows and Java internally:
- Most characters use 2 bytes
- Supplementary plane characters use 4 bytes (surrogate pairs)
- Not backward compatible with ASCII
UTF-32
Fixed-width 4-byte encoding. Simple to index but memory-inefficient. Used internally by some programming languages.
Unicode in Programming
JavaScript
JavaScript strings are UTF-16 internally. Working with supplementary plane characters requires care:
'A'.charCodeAt(0) // 65 (code point)
'\u0041' // 'A' (Unicode escape)
'\u{1F600}' // '😀' (ES6 extended escape)
'😀'.length // 2 (two UTF-16 code units!)
[...'😀'].length // 1 (correct character count)
Python
Python 3 strings are sequences of Unicode code points:
ord('A') # 65
chr(65) # 'A'
'\u0041' # 'A'
'\U0001F600' # '😀'
len('😀') # 1 (correct in Python 3)
HTML
Unicode characters in HTML:
A <!-- A (decimal) -->
A <!-- A (hexadecimal) -->
& <!-- & (named entity) -->
Unicode Normalization
The same visual character can sometimes be represented multiple ways:
- Precomposed:
é= U+00E9 (single code point) - Decomposed:
é= U+0065 + U+0301 (e + combining accent)
Unicode defines normalization forms to standardize these representations:
- NFC (Canonical Decomposition, followed by Canonical Composition) — preferred for most uses
- NFD (Canonical Decomposition) — decomposed form
- NFKC/NFKD — compatibility normalization
Failing to normalize can cause string comparison bugs, search failures, and security issues.
Special Unicode Characters
Some useful Unicode code points for developers:
U+FEFF— Byte Order Mark (BOM) / Zero Width No-Break SpaceU+200B— Zero Width Space (invisible, affects word breaking)U+200D— Zero Width Joiner (used in emoji sequences)U+FFFE— Non-character (used for encoding detection)U+202E— Right-to-Left Override (can be used for spoofing)
Using the Text-to-Unicode Tool
Our converter:
- Shows Unicode code points for every character in your text
- Displays multiple formats — U+ notation, decimal, hex, HTML entity
- Identifies script/block — shows which Unicode block each character belongs to
- Converts back — paste code points to decode to text
- Handles emoji — correctly processes multi-codepoint sequences
Use it for debugging encoding issues, learning about Unicode, preparing documentation about special characters, and inspecting suspicious text that might contain invisible or look-alike characters.