How GPT (LLM) Works: The Next Word Predictor
ChatGPT isn't magic; it's math. We explain the core mechanism of Generative Pre-trained Transformers (GPT) and how they predict the next word.
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: At its core, GPT asks one question, repeated: "Given everything so far, what is the most likely next token?" Tokens are not words โ they're subword units. The Transformer architecture uses self-attention to weigh how much each token should influence the prediction. Sampling strategies (temperature, top-p) control how creative or deterministic the output is.
๐ The One Question GPT Asks a Trillion Times
Everything ChatGPT, Claude, and Gemini do comes down to a single repeated operation:
Given the sequence of tokens seen so far, predict a probability distribution over the next token.
Sample one token from that distribution. Append it to the sequence. Repeat.
That's it. By doing this billions of times during training (and millions of times during inference), the model learns to generate coherent paragraphs, working code, and useful answers.
๐ The Basics: How GPT Generates Text
GPT โ Generative Pre-trained Transformer โ is an autoregressive language model. "Autoregressive" means it generates output one token at a time, each new token conditioned on all the tokens that came before it. This is fundamentally different from how humans write (where you might plan a whole sentence), but it turns out to be enormously powerful at scale.
Training vs. Inference:
During training, GPT reads enormous amounts of text (hundreds of billions of tokens from books, websites, and code). It learns to predict the next token at every position in every sentence. The weights of the network are updated via gradient descent to make these predictions more accurate over time.
During inference (when you actually use the model), the process flips: you provide a prompt, and GPT autoregressively extends it. At each step, it:
- Encodes the current token sequence into contextual representations
- Computes a probability distribution over the next token
- Samples one token from that distribution
- Appends the token to the sequence
- Repeats until it hits a stop condition or reaches the max length
This simple loop โ predict โ sample โ append โ repeat โ is the entire mechanism behind every ChatGPT conversation, GitHub Copilot suggestion, and GPT-4-powered assistant you have ever used.
Why "Pre-trained"? The "P" in GPT stands for pre-trained on a massive general corpus. This base model can then be fine-tuned (RLHF, instruction tuning) to be a better assistant. The core architecture and inference loop remain the same.
๐ข Tokenization: Words Are Not the Input
GPT doesn't process words โ it processes tokens, which are roughly 3โ4 characters each.
"Hello, world!" โ ["Hello", ",", " world", "!"]
"unbelievable" โ ["un", "bel", "iev", "able"]
"ChatGPT" โ ["Chat", "G", "PT"]
Subword tokenization (BPE โ Byte Pair Encoding) allows the model to:
- Handle any word, including rare or invented ones
- Represent a vocabulary of 100K tokens that covers millions of possible words
- Share representations between related words ("run", "running", "runner")
Practical impact: GPT-4 has a 128K token context window. 1 token โ 0.75 words, so that's roughly 100K words โ about one full novel.
โ๏ธ Predicting the Next Token: Logits, Softmax, and Sampling
After processing the input, GPT outputs a logit for every token in its vocabulary:
Raw logits (not probabilities):
"Paris" โ 8.3
"London" โ 6.1
"apple" โ -1.2
...
These are converted to probabilities via softmax:
$$P(\text{token}_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$
Then one token is sampled from that distribution. How you sample controls the trade-off between coherence and creativity:
| Strategy | How it works | Result |
| Greedy | Always pick highest probability | Repetitive, safe, boring |
| Temperature < 1 | Sharpen the distribution | More confident, less diverse |
| Temperature > 1 | Flatten the distribution | More creative, less reliable |
| Top-k | Sample only from top-k tokens | Cuts off long tail |
| Top-p (nucleus) | Sample from smallest set summing to probability p | Dynamic top-k |
๐ Token Sampling: Logits to Next Token
sequenceDiagram
participant TF as Transformer Stack
participant LG as Logit Layer
participant SM as Softmax
participant SA as Sampling Strategy
participant O as Output Token
TF->>LG: final hidden state (vocab_size dims)
LG->>SM: raw logits per token
SM->>SA: probability distribution
SA->>SA: apply temperature / top-p
SA->>O: sample next token
O-->>TF: append to context, repeat
This diagram shows the five-stage pipeline that converts a transformer's final hidden state into the next generated token. The transformer stack produces raw unnormalized logits for every token in the vocabulary; softmax normalizes these into a proper probability distribution; and the sampling strategy โ greedy, temperature-scaled, or nucleus (top-p) โ selects the actual output token. That selected token is appended to the context and fed back as input, repeating the entire loop until a stop condition is met.
๐ The GPT Inference Pipeline
To see how all the pieces connect, here is the full flow from prompt to generated token:
flowchart TD
A[User Prompt] --> B[Tokenizer]
B --> C[Token Embeddings + Positional Encoding]
C --> D[Transformer Layer 1 Self-Attention + FFN]
D --> E[Transformer Layer 2 Self-Attention + FFN]
E --> F[... N Transformer Layers ...]
F --> G[Final Hidden State]
G --> H[Linear Projection to Logits one per vocabulary token]
H --> I[Softmax to Probabilities]
I --> J[Sampling Strategy greedy / temperature / top-p]
J --> K[Next Token Selected]
K --> L{Stop condition?}
L -- No --> M[Append token to context]
M --> B
L -- Yes --> N[Final Output Text]
Each step in this pipeline is deterministic except the sampling step, where randomness (controlled by temperature) is introduced. Lowering temperature toward zero makes the model increasingly deterministic; raising it above 1.0 introduces more randomness and variety.
The loop between tokenizer and sampling is the engine of generation. A model with 96 Transformer layers (like GPT-4) runs the full stack โ all 96 layers of self-attention and feed-forward computation โ for every single token it generates. For a 500-token response, that means 96 ร 500 = 48,000 forward passes through Transformer sub-layers.
๐ Real-World Applications of GPT-Style LLMs
GPT-style large language models power a remarkable breadth of products across industries. Understanding the underlying mechanism helps you appreciate why they excel at some tasks and fail at others.
Code Generation and Completion Tools like GitHub Copilot, Cursor, and Amazon CodeWhisperer use fine-tuned LLMs to autocomplete code, generate functions from comments, explain snippets, and write tests. Because code has regular structure and the training corpus contains billions of lines from GitHub, these models develop a strong prior over syntax and common patterns.
Writing Assistance and Content Generation GPT models are used in marketing tools (Jasper, Copy.ai), document editors (Notion AI, Microsoft Copilot in Word), and email assistants (Gmail Smart Compose). The autoregressive nature makes them excellent at extending partial text in a stylistically consistent way.
Question Answering and Search Augmentation Microsoft Bing AI and Google AI Overviews use LLMs to synthesize answers from retrieved documents. The LLM's job is to read context (provided via retrieval) and produce a coherent natural-language answer โ a task perfectly suited to the next-token prediction paradigm.
Summarization and Information Extraction LLMs can distill long documents into concise summaries or extract structured information (names, dates, clauses from contracts). Tools like Claude and GPT-4 are widely used in legal, financial, and medical fields for this purpose.
Translation and Multilingual Tasks Because pre-training corpora span many languages, GPT-style models learn cross-lingual representations. They can translate, code-switch between languages in the same output, and even handle low-resource languages better than dedicated translation models from a few years ago.
Conversational AI and Customer Support Chatbots built on GPT handle first-tier support, route tickets, and answer FAQs. The model's ability to maintain context across a multi-turn conversation (within its context window) makes it suitable for interactive workflows.
The key insight: all of these applications reduce to the same underlying operation โ autoregressive token prediction on a shared pre-trained foundation, with task-specific fine-tuning or prompting on top.
๐งช Practical: Using GPT via API
The OpenAI API exposes GPT models over a simple HTTP interface. Here is a minimal Python example using the openai library:
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from environment
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain how softmax turns logits into probabilities."}
],
temperature=0.7, # 0 = deterministic, 1 = high creativity
max_tokens=512, # cap the response length
top_p=0.95, # nucleus sampling threshold
)
print(response.choices[0].message.content)
Parameter breakdown:
| Parameter | Effect |
temperature | Controls randomness. 0 โ greedy (always pick top token). >1 โ more random output. Typical range: 0.2โ0.9. |
max_tokens | Hard upper limit on tokens generated. Prevents runaway output and controls cost. |
top_p | Nucleus sampling. 0.95 means the model samples only from the top tokens that together cover 95% of probability mass. |
model | Which GPT checkpoint to use (gpt-4o, gpt-4-turbo, gpt-3.5-turbo, etc.). Larger models cost more but reason better. |
Best practices:
- Use a low
temperature(0.0โ0.3) for factual tasks (summarization, extraction, classification). - Use a higher
temperature(0.7โ1.0) for creative tasks (brainstorming, story writing). - Always set
max_tokensexplicitly to avoid unexpected cost spikes. - Pass a
systemmessage to give the model a persona or constraint set.
The messages array is how you build multi-turn context. The model doesn't have memory โ you must resend the entire conversation history each call. This is why context window size matters so much in practice.
๐ง Deep Dive: The Transformer Under the Hood: Attention in Plain Language
GPT is built from stacked Transformer decoder blocks. Each block runs multi-head self-attention.
Self-attention answers: "For each position in the sequence, how much should I attend to every other position?"
Example: "The animal didn't cross the street because it was too tired."
- Self-attention figures out that "it" refers to "animal," not "street."
- It does this by learning attention weights between all token pairs.
flowchart LR
Token1[The] --> Attn[Self-Attention Layer]
Token2[animal] --> Attn
Token3[didn't] --> Attn
Token4[cross] --> Attn
Token5[it] --> Attn
Attn --> Rep[Contextual Representation of each token]
Rep --> FF[Feed-Forward Layer]
FF --> Next[Next token distribution]
This diagram shows how all five input tokens flow into the Self-Attention layer simultaneously, which computes contextual representations for each token based on every other token in the sequence. The key insight is that after self-attention, the representation of "it" has been enriched with information from "animal" โ the model has learned where to look. The Feed-Forward layer then applies a non-linear transformation per token before the combined output is used to predict the next token distribution.
๐ GPT Decoder Block: Causal Attention Flow
flowchart LR
IN[Input Tokens [T1, T2, ... Tn]] --> ME[Token + Positional Embedding]
ME --> MSA[Masked Self-Attention (causal: no future tokens)]
MSA --> AN1[Add & Norm]
AN1 --> FFN[Feed-Forward Network GELU activation]
FFN --> AN2[Add & Norm]
AN2 --> OUT[Output Hidden State next Decoder block]
This diagram shows one complete Transformer decoder block โ the fundamental repeating unit stacked N times in GPT. The causal masking in the Masked Self-Attention layer is what makes GPT a decoder: each token can only attend to tokens that came before it, never to future tokens. The Add & Norm layers stabilize gradient flow as information passes through dozens of stacked blocks, and the Feed-Forward Network applies a non-linear transformation that gives the model its capacity to encode complex patterns. Every generated token passes through this entire structure once per decode step.
The attention formula:
$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
- $Q$ (Query): "What am I looking for?"
- $K$ (Key): "What do I offer?"
- $V$ (Value): "What information do I contain?"
- $\sqrt{d_k}$: scaling factor to prevent softmax saturation for large dimensions
๐ฌ Internals
GPT is a decoder-only transformer: each token attends only to previous tokens via causal masking (upper-triangular attention mask set to -โ). The residual stream carries information across layers โ each attention head and MLP block adds a residual update. Tokenization uses BPE (byte-pair encoding), which merges frequent character pairs into subword units; vocabulary size is typically 50Kโ100K tokens.
โก Performance Analysis
GPT-3 (175B params) requires ~350 GB of GPU memory in FP16 โ impractical without model parallelism across 8+ A100s. GPT-2 (117M) runs on a single consumer GPU in under 500ms per token. Modern quantized 7B models (4-bit) run at ~30 tokens/second on an RTX 3090, achieving GPT-3.5-level quality on many benchmarks with 99% less hardware cost.
โ๏ธ Trade-offs & Failure Modes: GPT's Known Limitations
| Limitation | What it means in practice |
| Knowledge cutoff | Cannot know events that happened after training data ended |
| Context window | Cannot "remember" earlier in a conversation once the window is full |
| Hallucination | Generates plausible-sounding but factually wrong content |
| No true reasoning | Pattern matching at scale, not symbolic logic |
| Token budget | Long inputs are expensive โ both in cost ($) and latency |
๐งญ Decision Guide: When to Use GPT
Use GPT when you need flexible language generation: summarization, Q&A, code completion, or open-ended chat. Choose smaller models (GPT-3.5, local LLMs) for high-volume or cost-sensitive tasks. Use GPT-4 for complex reasoning. For structured classification or tabular prediction, classical ML still wins.
๐ ๏ธ HuggingFace Transformers & vLLM: Running GPT-Style Inference in Python
HuggingFace Transformers is the standard Python library for loading and running open-weight language models (GPT-2, Mistral, LLaMA). Its pipeline("text-generation") abstraction handles tokenization, the autoregressive decode loop, and sampling โ mirroring every step in the GPT Inference Pipeline diagram from this post, from token embeddings through softmax and sampling to the stop condition check.
vLLM is a high-throughput inference server that wraps HuggingFace-compatible models behind an OpenAI-compatible REST API using PagedAttention and continuous batching โ allowing you to swap a local model in place of gpt-4o with zero client code changes.
# pip install transformers torch
# โโ HuggingFace pipeline: GPT-2 text generation โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
from transformers import pipeline, set_seed
# pipeline() wraps: tokenizer โ model forward pass โ sampling โ decode
generator = pipeline(
"text-generation",
model="gpt2", # 117 M params; runs on CPU for demos
)
set_seed(42) # reproducibility
outputs = generator(
"The best way to learn distributed systems is",
max_new_tokens=60, # cap output length (maps to max_tokens in OpenAI API)
temperature=0.7, # < 1 โ more focused; > 1 โ more creative
top_p=0.9, # nucleus sampling โ sample from top 90% of mass
do_sample=True,
num_return_sequences=1,
)
print(outputs[0]["generated_text"])
# โโ Manual decode loop (mirrors the flowchart in this post step-by-step) โโโโโ
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
input_ids = tokenizer.encode("Distributed systems fail when", return_tensors="pt")
with torch.no_grad():
for _ in range(20): # generate 20 tokens
logits = model(input_ids).logits # โ [1, seq_len, 50257]
next_id = logits[:, -1, :].argmax(-1) # greedy: highest logit
input_ids = torch.cat([input_ids,
next_id.unsqueeze(0)], dim=-1) # append
print(tokenizer.decode(input_ids[0]))
HuggingFace Transformers vs vLLM โ which to use:
| HuggingFace Transformers | vLLM | |
| Best for | Prototyping, fine-tuning, research | High-throughput production serving |
| Batching | Manual | Continuous batching (PagedAttention) |
| Throughput | ~1โ5 req/s per GPU | ~50โ200 req/s same GPU, larger models |
| API interface | Python API | OpenAI-compatible REST (/v1/completions) |
For a full deep-dive on HuggingFace Transformers inference and vLLM deployment, a dedicated follow-up post is planned.
๐ Key Lessons About How GPT Works
After walking through every component โ tokenization, embeddings, self-attention, logits, and sampling โ here are the five lessons that matter most:
1. Tokens, not words, are the unit of computation. GPT never sees words. It sees integer IDs mapped to subword pieces. This has practical consequences: long words, rare words, and non-English text cost more tokens and may be represented less reliably.
2. Self-attention is the core superpower. The ability to model arbitrary long-range dependencies between tokens is what separates Transformers from earlier RNN-based architectures. Each layer refines token representations by "looking at" every other token in the context.
3. Autoregression means each token depends on all previous tokens. There is no parallel decoding at inference time. GPT generates token 1, then token 2 conditioned on token 1, then token 3 conditioned on tokens 1 and 2, and so on. This is why longer outputs take longer to generate.
4. Hallucination is a structural feature, not a bug to be patched. GPT generates the statistically most likely continuation of the input. If a likely continuation contains a false fact, the model has no internal mechanism to detect the falsehood. Mitigations (RLHF, retrieval augmentation) reduce but do not eliminate hallucination.
5. Sampling parameters are tunable levers with real consequences. Temperature, top-p, and top-k are not obscure settings โ they directly determine whether your model's output is boring-but-accurate or creative-but-unreliable. Understanding them helps you prompt more effectively and choose the right configuration for your use case.
๐ TLDR: Summary & Key Takeaways
- GPT = autoregressive next-token prediction, repeated at inference time.
- Input is tokenized into subwords (BPE); tokens โ 0.75 words.
- Raw logits โ softmax โ probability distribution โ sample a token.
- Self-attention allows GPT to model relationships between any two tokens in the context window.
- Key limitations: knowledge cutoff, context window ceiling, hallucination, and no symbolic reasoning.
๐งฉ Test Your Understanding
- GPT outputs a logit of 8.3 for "Paris" and 6.1 for "London." Which is more probable after softmax? Why isn't it just "pick 8.3"?
- What would temperature=0 mean operationally?
- In the self-attention formula, what do Q, K, and V represent in intuitive terms?
- Why doesn't GPT know about an event from last week?
๐ Related Posts
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
