Unicode And UTF-8

type
lab
status
draft
id
lab.unicode-and-utf-8

Lab: Inspect UTF-8 Bytes

Goal

See that Unicode text becomes bytes through UTF-8, and that byte length can differ from character count.

Setup

Use local macOS-friendly tools:

Work in /tmp:

mkdir -p /tmp/unicode-and-utf-8-lab
cd /tmp/unicode-and-utf-8-lab

Step 1: ASCII In UTF-8

python3 - <<'PY'
from pathlib import Path
text = "A"
Path("a.txt").write_text(text, encoding="utf-8")
print("text:", text)
print("code point:", f"U+{ord(text):04X}")
print("utf-8 bytes:", text.encode("utf-8").hex(" "))
PY

xxd -g 1 a.txt

Observe:

  • A has code point U+0041.
  • Its UTF-8 byte is 41, matching ASCII.

Step 2: A Non-ASCII Character

python3 - <<'PY'
from pathlib import Path
text = "é"
Path("e-acute.txt").write_text(text, encoding="utf-8")
print("text:", text)
print("code point:", f"U+{ord(text):04X}")
print("utf-8 bytes:", text.encode("utf-8").hex(" "))
print("python character count:", len(text))
print("utf-8 byte count:", len(text.encode("utf-8")))
PY

xxd -g 1 e-acute.txt

Observe:

  • é is one Unicode code point in this example.
  • UTF-8 stores it using two bytes.

Step 3: A Character Outside The Basic Multilingual Plane

python3 - <<'PY'
from pathlib import Path
text = "😀"
Path("face.txt").write_text(text, encoding="utf-8")
print("text:", text)
print("code point:", f"U+{ord(text):04X}")
print("utf-8 bytes:", text.encode("utf-8").hex(" "))
print("python character count:", len(text))
print("utf-8 byte count:", len(text.encode("utf-8")))
PY

xxd -g 1 face.txt

Observe:

  • UTF-8 can use four bytes for one Unicode code point.
  • This directly contradicts the beginner assumption that one character always equals one byte.

Step 4: Compare Counts

python3 - <<'PY'
samples = ["ABC", "café", "😀", "Aé😀"]
for text in samples:
    print(repr(text))
    print("  code points:", [f"U+{ord(ch):04X}" for ch in text])
    print("  len(text):", len(text))
    print("  len(utf8 bytes):", len(text.encode("utf-8")))
    print("  bytes:", text.encode("utf-8").hex(" "))
PY

Observe:

  • ASCII text has matching character and byte counts.
  • Non-ASCII text often does not.

Step 5: Try An Invalid UTF-8 Byte

python3 - <<'PY'
from pathlib import Path
Path("invalid.bin").write_bytes(bytes([0xC3, 0x41]))

data = Path("invalid.bin").read_bytes()
print("bytes:", data.hex(" "))
try:
    print(data.decode("utf-8"))
except UnicodeDecodeError as error:
    print("decode error:", error)

print("with replacement:", data.decode("utf-8", errors="replace"))
PY

xxd -g 1 invalid.bin

Observe:

  • Not every arbitrary byte sequence is valid UTF-8.
  • A decoder can reject invalid bytes or use a replacement marker.

Reflection

Source Anchors

Open Questions