Tokenization Explained: How LLMs Understand Text
Abstract AlgorithmsTL;DR
TLDR: Computers don't read words; they read numbers. Tokenization is the process of breaking text down into smaller pieces (tokens) and converting them into numerical IDs that a Large Language Model can process. It's the foundational first step for a...

TLDR: Computers don't read words; they read numbers. Tokenization is the process of breaking text down into smaller pieces (tokens) and converting them into numerical IDs that a Large Language Model can process. It's the foundational first step for any NLP task.
1. What is Tokenization? (The "No-Jargon" Explanation)
Imagine you are a Chef preparing a meal. You don't throw a whole carrot into the pot. You first chop it into smaller, manageable pieces.
- The Text: The raw ingredient (the carrot).
- The Tokenizer: The knife.
- The Tokens: The small, chopped pieces.
A tokenizer is a "knife" for text. It breaks down sentences and words into smaller units that the AI can understand. The model doesn't see "Hello World"; it sees [15496, 2159]
2. Tokenization Strategies: The Good, The Bad, and The Ugly
There are three main ways to chop up text.
A. Word-Based Tokenization
- How: Split text by spaces.
["Hello", "world", "!"] - Problem: What about "don't"? Is it one word or two? What about rare words like "epistemology"? The vocabulary would be enormous.
B. Character-Based Tokenization
- How: Split text into individual characters.
['H', 'e', 'l', 'l', 'o'] - Problem: The meaning is lost. "H" has no semantic value on its own. The model has to work much harder to learn what words mean.
C. Subword-Based Tokenization (The Winner)
- How: A hybrid approach. Common words are single tokens ("the", "a"). Rare words are broken into meaningful sub-parts ("unbelievably" ->
["un", "believe", "ably"]). - Benefit: It balances vocabulary size and meaning. It can represent any word, even ones it has never seen before.
- Algorithm: Byte-Pair Encoding (BPE) is the most common.
3. Deep Dive: How Byte-Pair Encoding (BPE) Works
BPE is a simple but powerful algorithm that learns the most efficient way to "chop" text.
The Goal: Start with characters and merge the most frequent pairs until you reach a desired vocabulary size.
Toy Example: Learning a Vocabulary
Let's say our entire training data is:
"hug"(5 times)"pug"(3 times)"pun"(4 times)"bun"(2 times)
Step 1: Initial Vocabulary (Characters)
Our starting vocabulary is ['b', 'g', 'h', 'n', 'p', 'u'].
Step 2: Find the Most Frequent Pair
The pair "ug" appears 5 + 3 = 8 times. This is the most common pair.
Step 3: Create a Merge Rule
We create a new token "ug" and add it to our vocabulary.
- New Vocabulary:
['b', 'g', 'h', 'n', 'p', 'u', 'ug']
Step 4: Repeat
Now, the most frequent pair is "un" (4 times).
- New Vocabulary:
['b', 'g', 'h', 'n', 'p', 'u', 'ug', 'un']
The Result:
After training, when we see a new word like "bug", the tokenizer knows to split it into ["b", "ug"]. It can handle a new word without having seen it before!
4. Real-World Application: Why Tokens Matter
Tokenization isn't just an academic detail; it has huge practical implications.
A. Context Window Limits
- The Problem: An LLM like Llama 3 has a context window of 8,192 tokens. This is not 8,192 words.
- Example: The phrase "Tokenization is unbelievably important" might be 5 words, but it could be 7 tokens:
["Token", "ization", " is", " un", "believe", "ably", " important"]. - Impact: Complex words use up your context window faster.
B. API Costs
- The Problem: OpenAI, Anthropic, and Google all charge you per token, not per word.
- Example: Sending a long, technical document with many rare words will cost more than sending a simple story of the same word count.
C. Multilingual Challenges
- The Problem: A tokenizer trained mostly on English will be inefficient for other languages.
- Example: A Japanese character might be broken into many meaningless byte tokens, making the model perform poorly on Japanese text. This is why multilingual models use much larger vocabularies.
Summary & Key Takeaways
- Tokenization: The process of breaking text into numerical IDs for an LLM.
- Subword Tokenization (BPE): The standard method. It balances vocabulary size and the ability to represent any word.
- Practical Impact: Tokens determine your API costs and how much text you can fit into a model's context window.
- Rule of Thumb: 1 word $\approx$ 1.3 tokens in English.
Practice Quiz: Test Your Knowledge
Scenario: Why is subword tokenization (like BPE) generally preferred over word-based tokenization?
- A) It is faster to train.
- B) It can handle rare or made-up words without failing.
- C) It always results in fewer tokens.
Scenario: You are using an LLM API that costs $0.01 per 1,000 tokens. You send a 1,000-word essay. What is the most likely cost?
- A) Exactly $0.01.
- B) Less than $0.01.
- C) More than $0.01.
Scenario: The word "antidisestablishmentarianism" is not in the tokenizer's vocabulary. What will a BPE tokenizer likely do?
- A) Return an "Unknown Word" error.
- B) Break it into smaller, known subwords like
["anti", "dis", "establish", ...]. - C) Ignore the word completely.
(Answers: 1-B, 2-C, 3-B)

Written by
Abstract Algorithms
@abstractalgorithms
