Study Guide: Unicode And UTF-8
Learning Goals
- Explain the relationship between ASCII, Unicode, and UTF-8.
- Distinguish characters, code points, code units, and bytes.
- Explain why UTF-8 is ASCII-compatible but not limited to ASCII.
- Inspect UTF-8 bytes locally.
- Name what this topic does not yet cover.
Key Terms
- ASCII: an older coded character set that maps a limited set of characters and control codes to numeric codes.
- Unicode: a universal character encoding standard for text interchange, processing, and display across languages and disciplines.
- Character: an abstract text element encoded by a character standard.
- Code point: a Unicode value such as
U+0041orU+00E9. - UTF: Unicode Transformation Format; an algorithmic mapping from Unicode code points to byte sequences and back.
- UTF-8: the byte-oriented Unicode encoding form using 8-bit code units.
- Code unit: the unit used by an encoding form; UTF-8 uses 8-bit code units.
- Font: software/data that knows how to draw glyphs for characters; Unicode is not itself a font.
1. ASCII Is A Small Earlier Code
Petzold’s ASCII chapter is still valuable because it shows the basic move: assign numeric codes to characters so text can be stored and transmitted as bytes.
ASCII works well for a limited English-centered character set. But Petzold also shows the pressure that appears immediately: other symbols, accented letters, non-Latin scripts, and ideographs do not fit comfortably in plain ASCII.
Source-grounded claim: Petzold’s Chapter 20 supports ASCII as a coded character set and shows why extensions and alternatives appeared.
2. Unicode Is The Character Standard
Unicode is not “a font” and not “a software program.” It is a character encoding standard. The official Unicode FAQ describes it as the basis for processing, storage, and interchange of text data in modern software and protocols.
Unicode assigns characters to code points. A code point is usually written like this:
U+0041
U+00E9
U+1F600
These are not yet file bytes. They are values in the Unicode code space.
Source-grounded claim: official Unicode Basic Questions and About the Unicode Standard pages support Unicode’s role as a universal character standard.
3. Unicode Is Not Just 16 Bits
Petzold’s chapter describes Unicode historically as a 16-bit code. That was a reasonable historical snapshot for the book’s context, but it is not enough for modern practice.
The official Unicode FAQ says early Unicode was 16-bit, but modern Unicode uses code points from:
U+0000 through U+10FFFF
Source-grounded claim: official Unicode UTF-8, UTF-16, UTF-32 & BOM FAQ, Q: Is Unicode a 16-bit encoding?.
4. UTF-8 Is One Way To Turn Code Points Into Bytes
Unicode code points need an encoding form to become bytes in a file or network stream. UTF-8 is one such encoding form.
The basic path is:
character -> code point -> UTF-8 bytes
UTF-8 is byte-oriented. It uses:
- 1 byte for ASCII-range characters,
- more bytes for many non-ASCII characters,
- up to 4 bytes for Unicode code points.
Source-grounded claim: official Unicode UTF FAQ defines UTFs as algorithmic mappings and identifies UTF-8 as the byte-oriented encoding form.
5. UTF-8 Preserves ASCII
For ASCII characters, UTF-8 uses the same byte values as ASCII. This is one reason UTF-8 works well with older ASCII-shaped systems, source files, markup, file paths, and protocols.
Example:
A
Unicode code point: U+0041
UTF-8 bytes: 41
For non-ASCII characters, UTF-8 uses multi-byte sequences.
Example:
é
Unicode code point: U+00E9
UTF-8 bytes: C3 A9
Source-grounded claim: official Unicode UTF FAQ says UTF-8 preserves ASCII for ASCII characters and is widely used for Unicode text files.
6. Bytes Are Not Characters
This is the practical point.
In ASCII-only text, it is tempting to think:
1 byte = 1 character
In UTF-8, this is not generally true.
"A" -> 1 UTF-8 byte
"é" -> 2 UTF-8 bytes
"😀" -> 4 UTF-8 bytes
Inference from official UTF-8 properties: a byte count is not the same thing as a character count.
7. Display Is Another Layer
Even if bytes decode correctly into Unicode code points, your computer still needs a way to display the characters. The Unicode FAQ notes that display can fail because of font coverage, operating system support, application support, or language/script setup.
So there are several different problems:
- Are the bytes valid UTF-8?
- Are they decoded using the correct encoding?
- Does the resulting code point exist in Unicode?
- Does the font have a glyph?
- Does the renderer know how to lay out the script?
This topic handles the first two beginner layers. The rest are future topics.
Misconceptions To Avoid
- “Unicode is UTF-8.” Unicode is the standard; UTF-8 is an encoding form for Unicode.
- “Unicode is 16-bit.” That is historically incomplete for modern Unicode.
- “A character is a byte.” Not in UTF-8 generally.
- “If a character does not display, the encoding is wrong.” Maybe, but it could also be a font, OS, app, or rendering issue.
- “ASCII is obsolete and irrelevant.” ASCII still matters because UTF-8 preserves ASCII byte values for ASCII characters.
Source Anchors
unicode-standard-about- “About the Unicode Standard”,
Characters for the World. Latest Version.
- “About the Unicode Standard”,
unicode-faq-basic-questionsQ: What is Unicode?Q: What is the scope of Unicode?Q: Does Unicode encode scripts or languages?Q: Where can I purchase the Unicode software or the Unicode font?Q: My computer cannot display some of the latest Unicode symbols...
unicode-faq-utf-bomQ: Is Unicode a 16-bit encoding?Q: What is a UTF?Q: What are some of the differences between the UTFs?Q: Is there a standard method to package a Unicode character so it fits an 8-Bit ASCII stream?Q: Which method of packing Unicode characters into an 8-bit stream is the best?Q: What is the definition of UTF-8?
petzold-code-hidden-language-computer-hardware-software-2eChapter 20. ASCII and a Cast of Characters, near page 271 through page 285.
Open Questions
- Grapheme clusters: user-perceived characters can be made from multiple code points.
- Normalization: visually similar text can have different code point sequences.
- Bidirectional text: mixed left-to-right and right-to-left text needs layout rules.
- Fonts and glyphs: Unicode identifies characters, but fonts draw glyphs.