Tokenization and Vocabulary
Tokenization is the process of converting raw text into the numerical tokens that LLMs actually process. Understanding how tokenization works — and where it breaks down — is essential for writing effective prompts, estimating costs, and debugging unexpected model behavior.
What Is a Token?
LLMs don't operate on characters or words — they operate on tokens, which are chunks of text learned from the training data. Common words become single tokens; rare words are split into multiple subword tokens.
Examples (approximate, using GPT tokenization):
| Text | Tokens | Count |
|---|---|---|
hello | hello | 1 |
transformer | transform + er | 2 |
tokenization | token + ization | 2 |
supercalifragilistic | super + cal + if + rag + ilis + tic | 6 |
😀 | <0xF0> + <0x9F> + ... | 3-4 |
As a rough rule: 1 token ≈ 4 characters ≈ 0.75 words in English. Non-English text, code, and special characters often use significantly more tokens per word.
BPE: How Tokens Are Learned
The dominant tokenization algorithm is Byte Pair Encoding (BPE). It works by:
- Start with a vocabulary of individual bytes (256 characters)
- Count all adjacent byte pair frequencies in the training corpus
- Merge the most frequent pair into a new vocabulary item
- Repeat until the target vocabulary size is reached (typically 32K–128K tokens)
This means the tokenizer learns which subwords are common enough to deserve their own token based on the training data. Code, medical text, and other domain-specific content may tokenize poorly if underrepresented in training.
Visualizing Tokenization in Python
# Install: pip install tiktoken
import tiktoken
# Load the tokenizer for GPT-4o
enc = tiktoken.get_encoding("o200k_base")
text = "The attention mechanism in transformers is surprisingly elegant."
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded tokens: {[enc.decode([t]) for t in tokens]}")
Output:
Token count: 12
Decoded tokens: ['The', ' attention', ' mechanism', ' in', ' transformers', ' is', ' surprisingly', ' elegant', '.']
Why Tokenization Matters for Prompting
Cost estimation: API pricing is per token. gpt-4o costs ~$5 per million input tokens. A 10,000-word document ≈ 13,000 tokens ≈ $0.065 to process. Tokenization lets you estimate this precisely.
def estimate_cost(text: str, price_per_million: float = 5.0) -> float:
enc = tiktoken.get_encoding("o200k_base")
token_count = len(enc.encode(text))
cost = (token_count / 1_000_000) * price_per_million
return cost, token_count
Surprising behaviors: Models sometimes struggle with character-level tasks (counting letters, reversing words) because they never see individual characters — only tokens. "How many r's in 'strawberry'?" is surprisingly hard because 'strawberry' may be a single token.
Token boundaries affect generation: The model generates one token at a time. A continuation that requires breaking a token boundary differently than the model expects can produce artifacts.
Context Window in Tokens
The model's context window — the maximum amount of text it can process at once — is measured in tokens:
| Model | Context window |
|---|---|
| GPT-4o | 128,000 tokens (~96,000 words) |
| Claude 3.5 Sonnet | 200,000 tokens (~150,000 words) |
| Gemini 1.5 Pro | 1,000,000 tokens (~750,000 words) |
Knowing how to count tokens lets you stay within limits and manage costs when building systems that process large documents.
Code Tokenization
Code tokenizes differently from prose:
# This short Python snippet uses more tokens than you might expect
code = """
def fibonacci(n: int) -> list[int]:
result = []
a, b = 0, 1
while a < n:
result.append(a)
a, b = b, a + b
return result
"""
enc = tiktoken.get_encoding("o200k_base")
print(f"Code tokens: {len(enc.encode(code))}")
# Output: ~55 tokens for ~120 characters
Indentation, special characters, and variable names all contribute to higher token counts in code. This matters when building coding agents that process large codebases.