正在加载,请稍候…

Unicode Explained: Characters, Code Points, and Encoding

Convert text to Unicode code points. Learn about Unicode, UTF-8, and how modern software handles international text.

What Is Unicode?

Unicode is the universal character encoding standard that assigns a unique number — called a code point — to every character in every writing system in the world. From Latin letters to Chinese hanzi, Arabic script to emoji, Unicode covers over 149,000 characters across 154 scripts.

Unicode solves a fundamental problem in computing: historically, different countries and companies created incompatible character encodings, making it impossible to reliably exchange text across systems.

Unicode Code Points

A Unicode code point is written as U+ followed by a hexadecimal number:

  • U+0041 = A (Latin capital letter A)
  • U+4E2D = 中 (Chinese character for "middle")
  • U+1F600 = 😀 (Grinning Face emoji)
  • U+0021 = ! (Exclamation mark)

The code point range spans from U+0000 to U+10FFFF, divided into 17 planes of 65,536 code points each.

Unicode Planes

Plane 0: Basic Multilingual Plane (BMP)

The most commonly used characters, including all modern scripts:

  • Latin, Greek, Cyrillic, Hebrew, Arabic, Devanagari
  • CJK (Chinese, Japanese, Korean) characters
  • Most punctuation, symbols, and special characters

Plane 1: Supplementary Multilingual Plane

  • Historic scripts (Linear B, Egyptian hieroglyphs, Cuneiform)
  • Musical symbols
  • Mathematical symbols
  • Many emoji

Plane 2: Supplementary Ideographic Plane

  • Additional CJK unified ideographs (rare characters)

Planes 3-13: Reserved

Currently unassigned.

Planes 14-16: Supplementary Special-Purpose Planes

  • Tags and variation selectors

Unicode Encodings

UTF-8

The dominant encoding on the web (used by over 98% of websites):

  • ASCII characters use 1 byte
  • Most European characters use 2 bytes
  • CJK characters use 3 bytes
  • Emoji and supplementary characters use 4 bytes

UTF-8 is backward compatible with ASCII — any ASCII file is valid UTF-8.

UTF-16

Used by Windows and Java internally:

  • Most characters use 2 bytes
  • Supplementary plane characters use 4 bytes (surrogate pairs)
  • Not backward compatible with ASCII

UTF-32

Fixed-width 4-byte encoding. Simple to index but memory-inefficient. Used internally by some programming languages.

Unicode in Programming

JavaScript

JavaScript strings are UTF-16 internally. Working with supplementary plane characters requires care:

'A'.charCodeAt(0)      // 65 (code point)
'\u0041'               // 'A' (Unicode escape)
'\u{1F600}'            // '😀' (ES6 extended escape)
'😀'.length            // 2 (two UTF-16 code units!)
[...'😀'].length       // 1 (correct character count)

Python

Python 3 strings are sequences of Unicode code points:

ord('A')           # 65
chr(65)            # 'A'
'\u0041'           # 'A'
'\U0001F600'       # '😀'
len('😀')          # 1 (correct in Python 3)

HTML

Unicode characters in HTML:

&#65;       <!-- A (decimal) -->
&#x41;      <!-- A (hexadecimal) -->
&amp;       <!-- & (named entity) -->

Unicode Normalization

The same visual character can sometimes be represented multiple ways:

  • Precomposed: é = U+00E9 (single code point)
  • Decomposed: é = U+0065 + U+0301 (e + combining accent)

Unicode defines normalization forms to standardize these representations:

  • NFC (Canonical Decomposition, followed by Canonical Composition) — preferred for most uses
  • NFD (Canonical Decomposition) — decomposed form
  • NFKC/NFKD — compatibility normalization

Failing to normalize can cause string comparison bugs, search failures, and security issues.

Special Unicode Characters

Some useful Unicode code points for developers:

  • U+FEFF — Byte Order Mark (BOM) / Zero Width No-Break Space
  • U+200B — Zero Width Space (invisible, affects word breaking)
  • U+200D — Zero Width Joiner (used in emoji sequences)
  • U+FFFE — Non-character (used for encoding detection)
  • U+202E — Right-to-Left Override (can be used for spoofing)

Using the Text-to-Unicode Tool

Our converter:

  1. Shows Unicode code points for every character in your text
  2. Displays multiple formats — U+ notation, decimal, hex, HTML entity
  3. Identifies script/block — shows which Unicode block each character belongs to
  4. Converts back — paste code points to decode to text
  5. Handles emoji — correctly processes multi-codepoint sequences

Use it for debugging encoding issues, learning about Unicode, preparing documentation about special characters, and inspecting suspicious text that might contain invisible or look-alike characters.