How GPT (LLM) Works: The Next Word Predictor
ChatGPT isn't magic; it's math. We explain the core mechanism of Generative Pre-trained Transformers (GPT) and how they predict the next word.
Abstract AlgorithmsTLDR: At its core, GPT asks one question, repeated: "Given everything so far, what is the most likely next token?" Tokens are not words โ they're subword units. The Transformer architecture uses self-attention to weigh how much each token should influence the prediction. Sampling strategies (temperature, top-p) control how creative or deterministic the output is.
๐ The One Question GPT Asks a Trillion Times
Everything ChatGPT, Claude, and Gemini do comes down to a single repeated operation:
Given the sequence of tokens seen so far, predict a probability distribution over the next token.
Sample one token from that distribution. Append it to the sequence. Repeat.
That's it. By doing this billions of times during training (and millions of times during inference), the model learns to generate coherent paragraphs, working code, and useful answers.
๐ข Tokenization: Words Are Not the Input
GPT doesn't process words โ it processes tokens, which are roughly 3โ4 characters each.
"Hello, world!" โ ["Hello", ",", " world", "!"]
"unbelievable" โ ["un", "bel", "iev", "able"]
"ChatGPT" โ ["Chat", "G", "PT"]
Subword tokenization (BPE โ Byte Pair Encoding) allows the model to:
- Handle any word, including rare or invented ones
- Represent a vocabulary of 100K tokens that covers millions of possible words
- Share representations between related words ("run", "running", "runner")
Practical impact: GPT-4 has a 128K token context window. 1 token โ 0.75 words, so that's roughly 100K words โ about one full novel.
โ๏ธ Predicting the Next Token: Logits, Softmax, and Sampling
After processing the input, GPT outputs a logit for every token in its vocabulary:
Raw logits (not probabilities):
"Paris" โ 8.3
"London" โ 6.1
"apple" โ -1.2
...
These are converted to probabilities via softmax:
$$P(\text{token}_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$
Then one token is sampled from that distribution. How you sample controls the trade-off between coherence and creativity:
| Strategy | How it works | Result |
| Greedy | Always pick highest probability | Repetitive, safe, boring |
| Temperature < 1 | Sharpen the distribution | More confident, less diverse |
| Temperature > 1 | Flatten the distribution | More creative, less reliable |
| Top-k | Sample only from top-k tokens | Cuts off long tail |
| Top-p (nucleus) | Sample from smallest set summing to probability p | Dynamic top-k |
๐ง The Transformer Under the Hood: Attention in Plain Language
GPT is built from stacked Transformer decoder blocks. Each block runs multi-head self-attention.
Self-attention answers: "For each position in the sequence, how much should I attend to every other position?"
Example: "The animal didn't cross the street because it was too tired."
- Self-attention figures out that "it" refers to "animal," not "street."
- It does this by learning attention weights between all token pairs.
flowchart LR
Token1[The] --> Attn[Self-Attention Layer]
Token2[animal] --> Attn
Token3[didn't] --> Attn
Token4[cross] --> Attn
Token5[it] --> Attn
Attn --> Rep[Contextual\nRepresentation\nof each token]
Rep --> FF[Feed-Forward Layer]
FF --> Next[Next token\ndistribution]
The attention formula:
$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
- $Q$ (Query): "What am I looking for?"
- $K$ (Key): "What do I offer?"
- $V$ (Value): "What information do I contain?"
- $\sqrt{d_k}$: scaling factor to prevent softmax saturation for large dimensions
โ๏ธ GPT's Known Limitations
| Limitation | What it means in practice |
| Knowledge cutoff | Cannot know events that happened after training data ended |
| Context window | Cannot "remember" earlier in a conversation once the window is full |
| Hallucination | Generates plausible-sounding but factually wrong content |
| No true reasoning | Pattern matching at scale, not symbolic logic |
| Token budget | Long inputs are expensive โ both in cost ($) and latency |
๐ Key Takeaways
- GPT = autoregressive next-token prediction, repeated at inference time.
- Input is tokenized into subwords (BPE); tokens โ 0.75 words.
- Raw logits โ softmax โ probability distribution โ sample a token.
- Self-attention allows GPT to model relationships between any two tokens in the context window.
- Key limitations: knowledge cutoff, context window ceiling, hallucination, and no symbolic reasoning.
๐งฉ Test Your Understanding
- GPT outputs a logit of 8.3 for "Paris" and 6.1 for "London." Which is more probable after softmax? Why isn't it just "pick 8.3"?
- What would temperature=0 mean operationally?
- In the self-attention formula, what do Q, K, and V represent in intuitive terms?
- Why doesn't GPT know about an event from last week?
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. ๏ฟฝ...
