8 Tokenization

Choosing an Atomic Vocabulary

Intuition

Before a transformer can process text, it must convert that text into a sequence of discrete symbols drawn from a fixed vocabulary. These symbols are called tokens, and the process of converting text to tokens is tokenization.

8.1 Why We Need Tokens

Transformers as currently designed require a rigidly fixed vocabulary. Changing the vocabulary would change the shape of the embedding and unembedding matrices, which are core to the model’s structure. Humans have a dynamic, fuzzy vocabulary — words fall in and out of use, subcultures invent new ones, language evolves. Transformers cannot do this mid-inference.

So we must choose a fixed vocabulary before training and stick with it.

8.2 What Makes a Good Tokenization?

Two competing pressures shape the choice:

Compression: We want to represent in-distribution text using as few tokens as possible. Every token requires an expensive forward pass through the model. A tokenizer that represents “running” as a single token is more efficient than one that splits it into r, u, n, n, i, n, g.
Coverage: The vocabulary must cover the full range of text the model will encounter, including rare words, proper nouns, code, and multilingual content.

8.3 Natural Choices for Tokens

Words are the most intuitive unit, but word vocabularies are dynamic and incomplete. They also tend to be huge (many hundred thousand distinct words in English alone) before accounting for morphological variants.

Characters are the smallest complete basis — any text can be represented with a character vocabulary. But the compression ratio is 1.0 (worst possible) and every word becomes a long sequence.

Byte Pair Encoding (BPE) finds a pragmatic middle ground. Starting from a character vocabulary, it repeatedly finds the most common adjacent pair of tokens and merges them into a new single token. After many merges, common words and word fragments emerge as tokens, while rare strings fall back to characters. The result: a vocabulary of user-defined size that compresses common text efficiently while remaining complete.

Key Insight

BPE learns a vocabulary by counting frequency — it is entirely data-driven. Train a BPE tokenizer on English and you’ll get English words. Train it on code and you’ll get common identifiers and syntax. The same algorithm, different outputs.

The Math

8.4 Compression Ratio

For a tokenizer \(T\) and a corpus \(C\), the compression ratio measures how efficiently the tokenizer represents the text:

\[ \text{CR}(T, C) = \frac{|\text{characters in } C|}{|\text{tokens in } T(C)|} \]

A higher ratio means fewer tokens per character — more efficient. A character-level tokenizer achieves CR = 1.0. A good BPE tokenizer on in-distribution text achieves CR ≈ 3–5.

8.5 Vocabulary Overlap (Jaccard Similarity)

To compare two tokenizer vocabularies \(V_1\) and \(V_2\):

\[ J(V_1, V_2) = \frac{|V_1 \cap V_2|}{|V_1 \cup V_2|} \]

A value near 1 means the vocabularies are nearly identical; near 0 means they share little.

The Code

The BPE tokenizer is implemented in diy_llm/tokenizer.py. The core components are:

# From diy_llm/tokenizer.py
# BPETokenizer.train() — learn merge rules from a corpus
# BPETokenizer.encode() — apply merge rules to tokenize text
# BPETokenizer.decode() — convert token IDs back to text

Explore

8.6 Compression Ratio Comparison

Coming Soon

Computed figures comparing compression ratios across tokenizers are in progress.

8.7 Tokenizer Playground

Select a tokenizer, enter any text, and see how it gets split into tokens.

Coming Soon

The interactive tokenizer playground is in progress.