Unicode Explained: Characters, Code Points, and Encoding

Convert text to Unicode code points. Learn about Unicode, UTF-8, and how modern software handles international text.

What Is Unicode?

Unicode is the universal character encoding standard that assigns a unique number — called a code point — to every character in every writing system in the world. From Latin letters to Chinese hanzi, Arabic script to emoji, Unicode covers over 149,000 characters across 154 scripts.

Unicode solves a fundamental problem in computing: historically, different countries and companies created incompatible character encodings, making it impossible to reliably exchange text across systems.

Unicode Code Points

A Unicode code point is written as U+ followed by a hexadecimal number:

U+0041 = A (Latin capital letter A)
U+4E2D = 中 (Chinese character for "middle")
U+1F600 = 😀 (Grinning Face emoji)
U+0021 = ! (Exclamation mark)

The code point range spans from U+0000 to U+10FFFF, divided into 17 planes of 65,536 code points each.

Unicode Planes

Plane 0: Basic Multilingual Plane (BMP)

The most commonly used characters, including all modern scripts:

Latin, Greek, Cyrillic, Hebrew, Arabic, Devanagari
CJK (Chinese, Japanese, Korean) characters
Most punctuation, symbols, and special characters

Plane 1: Supplementary Multilingual Plane

Historic scripts (Linear B, Egyptian hieroglyphs, Cuneiform)
Musical symbols
Mathematical symbols
Many emoji

Plane 2: Supplementary Ideographic Plane

Additional CJK unified ideographs (rare characters)

Planes 3-13: Reserved

Currently unassigned.

Planes 14-16: Supplementary Special-Purpose Planes

Tags and variation selectors

Unicode Encodings

UTF-8

The dominant encoding on the web (used by over 98% of websites):

ASCII characters use 1 byte
Most European characters use 2 bytes
CJK characters use 3 bytes
Emoji and supplementary characters use 4 bytes

UTF-8 is backward compatible with ASCII — any ASCII file is valid UTF-8.

UTF-16

Used by Windows and Java internally:

Most characters use 2 bytes
Supplementary plane characters use 4 bytes (surrogate pairs)
Not backward compatible with ASCII

UTF-32

Fixed-width 4-byte encoding. Simple to index but memory-inefficient. Used internally by some programming languages.

Unicode in Programming

JavaScript

JavaScript strings are UTF-16 internally. Working with supplementary plane characters requires care:

'A'.charCodeAt(0)      // 65 (code point)
'\u0041'               // 'A' (Unicode escape)
'\u{1F600}'            // '😀' (ES6 extended escape)
'😀'.length            // 2 (two UTF-16 code units!)
[...'😀'].length       // 1 (correct character count)

Python

Python 3 strings are sequences of Unicode code points:

ord('A')           # 65
chr(65)            # 'A'
'\u0041'           # 'A'
'\U0001F600'       # '😀'
len('😀')          # 1 (correct in Python 3)

HTML

Unicode characters in HTML:

&#65;       <!-- A (decimal) -->
&#x41;      <!-- A (hexadecimal) -->
&amp;       <!-- & (named entity) -->

Unicode Normalization

The same visual character can sometimes be represented multiple ways:

Precomposed: é = U+00E9 (single code point)
Decomposed: é = U+0065 + U+0301 (e + combining accent)

Unicode defines normalization forms to standardize these representations:

NFC (Canonical Decomposition, followed by Canonical Composition) — preferred for most uses
NFD (Canonical Decomposition) — decomposed form
NFKC/NFKD — compatibility normalization

Failing to normalize can cause string comparison bugs, search failures, and security issues.

Special Unicode Characters

Some useful Unicode code points for developers:

U+FEFF — Byte Order Mark (BOM) / Zero Width No-Break Space
U+200B — Zero Width Space (invisible, affects word breaking)
U+200D — Zero Width Joiner (used in emoji sequences)
U+FFFE — Non-character (used for encoding detection)
U+202E — Right-to-Left Override (can be used for spoofing)

Using the Text-to-Unicode Tool

Our converter:

Shows Unicode code points for every character in your text
Displays multiple formats — U+ notation, decimal, hex, HTML entity
Identifies script/block — shows which Unicode block each character belongs to
Converts back — paste code points to decode to text
Handles emoji — correctly processes multi-codepoint sequences

Use it for debugging encoding issues, learning about Unicode, preparing documentation about special characters, and inspecting suspicious text that might contain invisible or look-alike characters.

页面加载失败