Lab: Inspect UTF-8 Bytes
Goal
See that Unicode text becomes bytes through UTF-8, and that byte length can differ from character count.
Setup
Use local macOS-friendly tools:
python3xxdwc
Work in /tmp:
mkdir -p /tmp/unicode-and-utf-8-lab
cd /tmp/unicode-and-utf-8-lab
Step 1: ASCII In UTF-8
python3 - <<'PY'
from pathlib import Path
text = "A"
Path("a.txt").write_text(text, encoding="utf-8")
print("text:", text)
print("code point:", f"U+{ord(text):04X}")
print("utf-8 bytes:", text.encode("utf-8").hex(" "))
PY
xxd -g 1 a.txt
Observe:
Ahas code pointU+0041.- Its UTF-8 byte is
41, matching ASCII.
Step 2: A Non-ASCII Character
python3 - <<'PY'
from pathlib import Path
text = "é"
Path("e-acute.txt").write_text(text, encoding="utf-8")
print("text:", text)
print("code point:", f"U+{ord(text):04X}")
print("utf-8 bytes:", text.encode("utf-8").hex(" "))
print("python character count:", len(text))
print("utf-8 byte count:", len(text.encode("utf-8")))
PY
xxd -g 1 e-acute.txt
Observe:
éis one Unicode code point in this example.- UTF-8 stores it using two bytes.
Step 3: A Character Outside The Basic Multilingual Plane
python3 - <<'PY'
from pathlib import Path
text = "😀"
Path("face.txt").write_text(text, encoding="utf-8")
print("text:", text)
print("code point:", f"U+{ord(text):04X}")
print("utf-8 bytes:", text.encode("utf-8").hex(" "))
print("python character count:", len(text))
print("utf-8 byte count:", len(text.encode("utf-8")))
PY
xxd -g 1 face.txt
Observe:
- UTF-8 can use four bytes for one Unicode code point.
- This directly contradicts the beginner assumption that one character always equals one byte.
Step 4: Compare Counts
python3 - <<'PY'
samples = ["ABC", "café", "😀", "Aé😀"]
for text in samples:
print(repr(text))
print(" code points:", [f"U+{ord(ch):04X}" for ch in text])
print(" len(text):", len(text))
print(" len(utf8 bytes):", len(text.encode("utf-8")))
print(" bytes:", text.encode("utf-8").hex(" "))
PY
Observe:
- ASCII text has matching character and byte counts.
- Non-ASCII text often does not.
Step 5: Try An Invalid UTF-8 Byte
python3 - <<'PY'
from pathlib import Path
Path("invalid.bin").write_bytes(bytes([0xC3, 0x41]))
data = Path("invalid.bin").read_bytes()
print("bytes:", data.hex(" "))
try:
print(data.decode("utf-8"))
except UnicodeDecodeError as error:
print("decode error:", error)
print("with replacement:", data.decode("utf-8", errors="replace"))
PY
xxd -g 1 invalid.bin
Observe:
- Not every arbitrary byte sequence is valid UTF-8.
- A decoder can reject invalid bytes or use a replacement marker.
Reflection
- Why does ASCII feel simpler than UTF-8?
- What changed when you moved from
Atoéto😀? - What did
xxdshow that your terminal display hides? - Why is “character count” not always “byte count”?
- What would go wrong if a program sliced UTF-8 text by arbitrary byte positions?
Source Anchors
unicode-faq-basic-questionsQ: What is Unicode?Q: My computer cannot display some of the latest Unicode symbols...
unicode-faq-utf-bomQ: Is Unicode a 16-bit encoding?Q: What is a UTF?Q: What are some of the differences between the UTFs?Q: Are there any byte sequences that are not generated by a UTF? How should I interpret them?Q: What is the definition of UTF-8?
petzold-code-hidden-language-computer-hardware-software-2eChapter 20. ASCII and a Cast of Characters, near page 271 through page 285.
Open Questions
- This lab uses Python’s Unicode model as a practical inspection tool. It is not a full explanation of how every language runtime stores strings internally.