1 - Tokenization #
1.1 The Unicode Standard #
Unicode is a text encoding standard that maps characters to integer code points.
Each character is associated with a unique integer:
- The character
"s"
has the code point 115 (written as U+0073, whereU+
is a prefix and0073
is the hexadecimal representation). - The character
"牛"
has the code point 29275 (U+725B).
In Python:
ord("牛")
→29275
chr(29275)
→"牛"
Thus:
ord()
converts a Unicode character into its integer code point.chr()
converts an integer code point back into its corresponding character.
Example: Null Character in Unicode #
One case is chr(0)
, which returns the null character.
This is represented as '\x00'
in Python. It is an invisible control character that doesn’t correspond to any visible glyph.
Printed vs. Represented Forms #
When working with characters in Python, it is important to distinguish between two forms of string display:
Printed representation (
__str__
)
Designed for human readability. Printing the null character simply produces nothing visible on screen.Developer representation (
__repr__
)
Designed for clarity and unambiguous reconstruction.
For the null character,repr(chr(0))
returns'\x00'
, showing the escape sequence explicitly.
Example:
>>> chr(0)
'\x00'
>>> print(chr(0))
# prints nothing visible
1.2 Unicode Encodings (Byte-level tokenizers) #
- The full Unicode space is large (~150K code points), making direct training on code points impractical:
- Vocabulary too large
- Many characters are rare → sparsity problem
- Unicode encodings convert code points into sequences of bytes. The three standard encodings are:
- UTF-8 (dominant on the web)
- UTF-16
- UTF-32
Why UTF-8? #
- Encodes characters into 1–4 bytes.
- ASCII characters (U+0000–U+007F) use only 1 byte → efficient for English text.
- Widely adopted, portable, and avoids out-of-vocabulary issues.
Python Example #
test_string = "hello! こんにちは!"
# Encode to UTF-8 bytes
utf8_encoded = test_string.encode("utf-8")
print(utf8_encoded)
# b'hello! \xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf!'
# Check type
print(type(utf8_encoded))
# <class 'bytes'>
# Convert to integer byte values (0–255)
print(list(utf8_encoded))
# [104, 101, 108, 108, 111, 33, 32, 227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129, 161, 227, 129, 175, 33]
#104 = 0x68 in hex = the UTF-8/ASCII encoding of 'h'; Japanese characters like こ → 3 bytes each
# Compare string length vs. encoded length
print(len(test_string)) # 13 characters
print(len(utf8_encoded)) # 23 bytes = "hello!" (6 bytes) + " " (1 byte) + "こんにちは" (5×3 = 15 bytes) + "!" (1 byte)
# Decode back to Unicode string
print(utf8_encoded.decode("utf-8"))
# hello! こんにちは!
1.3 Subword Tokenization #
Motivation #
Word-level tokenizers
- Short sequences (e.g., 10 words = 10 tokens).
- Problem: out-of-vocabulary (OOV) words not handled well.
Byte-level tokenizers
- Fixed vocabulary size = 256 (all possible byte values).
- Solve OOV problem.
- Problem: Sequences become too long (10 words → 50+ tokens).
- Longer sequences = more computation + harder long-term dependencies.
Subword tokenization = trade-off
- Larger vocabulary than byte-level, but shorter sequences.
- Common byte sequences (like
"the"
) become a single token. - Reduces token count → more efficient training.
Byte Pair Encoding (BPE) #
Idea (Gage, 1994; Sennrich et al., 2016):
- Iteratively replace most frequent pair of bytes with a new token.
- Grows vocabulary by merging frequent patterns.
- Frequent words/subwords → represented as single tokens.
Result:
- Efficient compression of input sequences.
- No OOV problem (all text can still be broken down into bytes).
- Manageable input lengths.
1.4 BPE Tokenizer Training – Class Notes #
Main Steps #
Vocabulary Initialization
- Start with 256 byte values (all possible bytes).
- Add any special tokens (e.g.,
<|endoftext|>
). - Initial vocabulary size = 256 + number of special tokens.
Pre-tokenization
- Goal: avoid merging across arbitrary text boundaries + make counting efficient.
- Represent corpus as pre-tokens (substrings).
- Example:
"text"
appears 10 times → count pair (t
,e
) = 10. - BPE implementations:
- Original (Sennrich et al., 2016): split on whitespace.
- GPT-2 (Radford et al., 2019): regex-based pre-tokenizer:Example:
PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
re.findall(PAT, "some text that i'll pre-tokenize") # ['some', ' text', ' that', ' i', "'ll", ' pre', '-', 'tokenize']
Compute BPE Merges
- Count frequency of adjacent byte pairs inside pre-tokens.
- Select most frequent pair
(A, B)
. - Merge into new token
AB
and add to vocab. - Repeat until desired vocab size is reached.
- Tie-breaking rule: pick lexicographically greater pair.
Example (Sennrich et al., 2016) #
Corpus: low low low low low lower lower widest widest widest newest newest newest newest newest newest
Pre-tokenization (split on whitespace):
{low: 5, lower: 2, widest: 3, newest: 6}
Step 1: Count pairs
{lo:7, ow:7, we:8, er:2, wi:3, id:3, de:3, es:9, st:9, ne:6, ew:6}
- Most frequent =
('es', 9)
and('st', 9)
→ tie → choose('st')
. - Merge →
(w,i,d,e,st), (n,e,w,e,st)
- Most frequent =
Step 2: Merge again
- Now
(e, st)
most frequent → merge →(w,i,d,est), (n,e,w,est)
- Now
Continue merging → final merges:
['s t', 'e st', 'o w', 'l ow', 'w est', 'n e', 'ne west', 'w i', 'wi d', 'wid est', 'low e', 'lowe r']
If we stop after 6 merges:
- Learned merges:
['s t', 'e st', 'o w', 'l ow', 'w est', 'n e']
- Vocabulary =
[<|endoftext|>, (256 bytes), st, est, ow, low, west, ne]
- Example:
"newest"
→[ne, west]
- Learned merges: