Unicode And UTF-8
The Short Version
ASCII, Unicode, and UTF-8 are related, but they are not the same thing.
- ASCII is an older character code for a small set of characters.
- Unicode is the modern universal character standard.
- A Unicode code point is an assigned value such as
U+0041forA. - UTF-8 is a way to encode Unicode code points as bytes.
The key practical idea:
character -> Unicode code point -> UTF-8 bytes
For ASCII characters, UTF-8 preserves the same byte values. For many other characters, UTF-8 uses more than one byte.
Why This Matters
In from-bits-to-meaning, ASCII was enough to show how text can become bytes. But real software handles names, accents, symbols, non-Latin scripts, and emoji. That requires Unicode.
The most important beginner correction is:
one character is not always one byte
And soon after:
one visible character is not always one Unicode code point
That second sentence is an open door for later, not the main burden of this topic.
Source-Grounded Claims
- The official Unicode pages describe Unicode as a universal character encoding for worldwide text interchange, processing, and display.
- The official Unicode FAQ states that Unicode is not simply a 16-bit encoding in modern form.
- The official Unicode FAQ defines UTFs as mappings from Unicode code points to byte sequences.
- The official Unicode FAQ identifies UTF-8 as the byte-oriented encoding form of Unicode and an ASCII-compatible choice.
- Petzold provides local historical grounding for ASCII and earlier character-set problems.
Source Anchors
unicode-standard-about- “About the Unicode Standard”,
Characters for the World. Latest Version.
- “About the Unicode Standard”,
unicode-faq-basic-questionsQ: What is Unicode?Q: What is the scope of Unicode?Q: Does Unicode encode scripts or languages?Q: Where can I purchase the Unicode software or the Unicode font?
unicode-faq-utf-bomQ: Is Unicode a 16-bit encoding?Q: What is a UTF?Q: What are some of the differences between the UTFs?Q: What is the definition of UTF-8?
petzold-code-hidden-language-computer-hardware-software-2eChapter 20. ASCII and a Cast of Characters, near page 271 through page 285.
Open Questions
- Full Unicode support involves more than UTF-8 bytes. Later topics should cover grapheme clusters, normalization, fonts, and bidirectional text.