Tokenization

1 - Tokenization #

1.1 The Unicode Standard #

Unicode is a text encoding standard that maps characters to integer code points.

Each character is associated with a unique integer:

  • The character "s" has the code point 115 (written as U+0073, where U+ is a prefix and 0073 is the hexadecimal representation).
  • The character "牛" has the code point 29275 (U+725B).

In Python:

  • ord("牛")29275
  • chr(29275)"牛"

Thus:

  • ord() converts a Unicode character into its integer code point.
  • chr() converts an integer code point back into its corresponding character.

Example: Null Character in Unicode #

One case is chr(0), which returns the null character.
This is represented as '\x00' in Python. It is an invisible control character that doesn’t correspond to any visible glyph.


Printed vs. Represented Forms #

When working with characters in Python, it is important to distinguish between two forms of string display:

  • Printed representation (__str__)
    Designed for human readability. Printing the null character simply produces nothing visible on screen.

  • Developer representation (__repr__)
    Designed for clarity and unambiguous reconstruction.
    For the null character, repr(chr(0)) returns '\x00', showing the escape sequence explicitly.

Example:

>>> chr(0)
'\x00'

>>> print(chr(0))
# prints nothing visible

1.2 Unicode Encodings (Byte-level tokenizers) #

  • The full Unicode space is large (~150K code points), making direct training on code points impractical:
    • Vocabulary too large
    • Many characters are rare → sparsity problem
  • Unicode encodings convert code points into sequences of bytes. The three standard encodings are:
    • UTF-8 (dominant on the web)
    • UTF-16
    • UTF-32

Why UTF-8? #

  • Encodes characters into 1–4 bytes.
  • ASCII characters (U+0000–U+007F) use only 1 byte → efficient for English text.
  • Widely adopted, portable, and avoids out-of-vocabulary issues.

Python Example #

test_string = "hello! こんにちは!"

# Encode to UTF-8 bytes
utf8_encoded = test_string.encode("utf-8")
print(utf8_encoded)
# b'hello! \xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf!'

# Check type
print(type(utf8_encoded))
# <class 'bytes'>

# Convert to integer byte values (0–255)
print(list(utf8_encoded))
# [104, 101, 108, 108, 111, 33, 32, 227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129, 161, 227, 129, 175, 33]
#104 = 0x68 in hex = the UTF-8/ASCII encoding of 'h'; Japanese characters like こ → 3 bytes each

# Compare string length vs. encoded length
print(len(test_string))      # 13 characters
print(len(utf8_encoded))     # 23 bytes = "hello!" (6 bytes) + " " (1 byte) + "こんにちは" (5×3 = 15 bytes) + "!" (1 byte)

# Decode back to Unicode string
print(utf8_encoded.decode("utf-8"))
# hello! こんにちは!

1.3 Subword Tokenization #

Motivation #

  • Word-level tokenizers

    • Short sequences (e.g., 10 words = 10 tokens).
    • Problem: out-of-vocabulary (OOV) words not handled well.
  • Byte-level tokenizers

    • Fixed vocabulary size = 256 (all possible byte values).
    • Solve OOV problem.
    • Problem: Sequences become too long (10 words → 50+ tokens).
    • Longer sequences = more computation + harder long-term dependencies.
  • Subword tokenization = trade-off

    • Larger vocabulary than byte-level, but shorter sequences.
    • Common byte sequences (like "the") become a single token.
    • Reduces token count → more efficient training.

Byte Pair Encoding (BPE) #

  • Idea (Gage, 1994; Sennrich et al., 2016):

    • Iteratively replace most frequent pair of bytes with a new token.
    • Grows vocabulary by merging frequent patterns.
    • Frequent words/subwords → represented as single tokens.
  • Result:

    • Efficient compression of input sequences.
    • No OOV problem (all text can still be broken down into bytes).
    • Manageable input lengths.

1.4 BPE Tokenizer Training – Class Notes #

Main Steps #

  1. Vocabulary Initialization

    • Start with 256 byte values (all possible bytes).
    • Add any special tokens (e.g., <|endoftext|>).
    • Initial vocabulary size = 256 + number of special tokens.
  2. Pre-tokenization

    • Goal: avoid merging across arbitrary text boundaries + make counting efficient.
    • Represent corpus as pre-tokens (substrings).
    • Example: "text" appears 10 times → count pair (t,e) = 10.
    • BPE implementations:
      • Original (Sennrich et al., 2016): split on whitespace.
      • GPT-2 (Radford et al., 2019): regex-based pre-tokenizer:
        PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
        
        Example:
        re.findall(PAT, "some text that i'll pre-tokenize")
        # ['some', ' text', ' that', ' i', "'ll", ' pre', '-', 'tokenize']
        
  3. Compute BPE Merges

    • Count frequency of adjacent byte pairs inside pre-tokens.
    • Select most frequent pair (A, B).
    • Merge into new token AB and add to vocab.
    • Repeat until desired vocab size is reached.
    • Tie-breaking rule: pick lexicographically greater pair.

Example (Sennrich et al., 2016) #

Corpus: low low low low low lower lower widest widest widest newest newest newest newest newest newest

  • Pre-tokenization (split on whitespace):
    {low: 5, lower: 2, widest: 3, newest: 6}

  • Step 1: Count pairs
    {lo:7, ow:7, we:8, er:2, wi:3, id:3, de:3, es:9, st:9, ne:6, ew:6}

    • Most frequent = ('es', 9) and ('st', 9) → tie → choose ('st').
    • Merge → (w,i,d,e,st), (n,e,w,e,st)
  • Step 2: Merge again

    • Now (e, st) most frequent → merge → (w,i,d,est), (n,e,w,est)
  • Continue merging → final merges:
    ['s t', 'e st', 'o w', 'l ow', 'w est', 'n e', 'ne west', 'w i', 'wi d', 'wid est', 'low e', 'lowe r']

  • If we stop after 6 merges:

    • Learned merges: ['s t', 'e st', 'o w', 'l ow', 'w est', 'n e']
    • Vocabulary = [<|endoftext|>, (256 bytes), st, est, ow, low, west, ne]
    • Example: "newest"[ne, west]