Home/Blog/Deep Learning/How GPT (LLM) Works: The Next Word Predictor

Deep LearningAdvanced•15 min read•Mar 9, 2026

How GPT (LLM) Works: The Next Word Predictor

ChatGPT isn't magic; it's math. We explain the core mechanism of Generative Pre-trained Transformers (GPT) and how they predict the next word.

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: At its core, GPT asks one question, repeated: "Given everything so far, what is the most likely next token?" Tokens are not words — they're subword units. The Transformer architecture uses self-attention to weigh how much each token should influence the prediction. Sampling strategies (temperature, top-p) control how creative or deterministic the output is.

📖 The One Question GPT Asks a Trillion Times

Everything ChatGPT, Claude, and Gemini do comes down to a single repeated operation:

Given the sequence of tokens seen so far, predict a probability distribution over the next token.

Sample one token from that distribution. Append it to the sequence. Repeat.

That's it. By doing this billions of times during training (and millions of times during inference), the model learns to generate coherent paragraphs, working code, and useful answers.

🔍 The Basics: How GPT Generates Text

GPT — Generative Pre-trained Transformer — is an autoregressive language model. "Autoregressive" means it generates output one token at a time, each new token conditioned on all the tokens that came before it. This is fundamentally different from how humans write (where you might plan a whole sentence), but it turns out to be enormously powerful at scale.

Training vs. Inference:

During training, GPT reads enormous amounts of text (hundreds of billions of tokens from books, websites, and code). It learns to predict the next token at every position in every sentence. The weights of the network are updated via gradient descent to make these predictions more accurate over time.

During inference (when you actually use the model), the process flips: you provide a prompt, and GPT autoregressively extends it. At each step, it:

Encodes the current token sequence into contextual representations
Computes a probability distribution over the next token
Samples one token from that distribution
Appends the token to the sequence
Repeats until it hits a stop condition or reaches the max length

This simple loop — predict → sample → append → repeat — is the entire mechanism behind every ChatGPT conversation, GitHub Copilot suggestion, and GPT-4-powered assistant you have ever used.

Why "Pre-trained"? The "P" in GPT stands for pre-trained on a massive general corpus. This base model can then be fine-tuned (RLHF, instruction tuning) to be a better assistant. The core architecture and inference loop remain the same.

🔢 Tokenization: Words Are Not the Input

GPT doesn't process words — it processes tokens, which are roughly 3–4 characters each.

"Hello, world!" → ["Hello", ",", " world", "!"]
"unbelievable"  → ["un", "bel", "iev", "able"]
"ChatGPT"       → ["Chat", "G", "PT"]

Subword tokenization (BPE — Byte Pair Encoding) allows the model to:

Handle any word, including rare or invented ones
Represent a vocabulary of 100K tokens that covers millions of possible words
Share representations between related words ("run", "running", "runner")

Practical impact: GPT-4 has a 128K token context window. 1 token ≈ 0.75 words, so that's roughly 100K words — about one full novel.

⚙️ Predicting the Next Token: Logits, Softmax, and Sampling

After processing the input, GPT outputs a logit for every token in its vocabulary:

Raw logits (not probabilities):
"Paris"   →  8.3
"London"  →  6.1
"apple"   → -1.2
...

These are converted to probabilities via softmax:

$$P(\text{token}_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

Then one token is sampled from that distribution. How you sample controls the trade-off between coherence and creativity:

Strategy	How it works	Result
Greedy	Always pick highest probability	Repetitive, safe, boring
Temperature < 1	Sharpen the distribution	More confident, less diverse
Temperature > 1	Flatten the distribution	More creative, less reliable
Top-k	Sample only from top-k tokens	Cuts off long tail
Top-p (nucleus)	Sample from smallest set summing to probability p	Dynamic top-k

📊 Token Sampling: Logits to Next Token

sequenceDiagram
    participant TF as Transformer Stack
    participant LG as Logit Layer
    participant SM as Softmax
    participant SA as Sampling Strategy
    participant O as Output Token

    TF->>LG: final hidden state (vocab_size dims)
    LG->>SM: raw logits per token
    SM->>SA: probability distribution
    SA->>SA: apply temperature / top-p
    SA->>O: sample next token
    O-->>TF: append to context, repeat

This diagram shows the five-stage pipeline that converts a transformer's final hidden state into the next generated token. The transformer stack produces raw unnormalized logits for every token in the vocabulary; softmax normalizes these into a proper probability distribution; and the sampling strategy — greedy, temperature-scaled, or nucleus (top-p) — selects the actual output token. That selected token is appended to the context and fed back as input, repeating the entire loop until a stop condition is met.

📊 The GPT Inference Pipeline

To see how all the pieces connect, here is the full flow from prompt to generated token:

flowchart TD
    A[User Prompt] --> B[Tokenizer]
    B --> C[Token Embeddings + Positional Encoding]
    C --> D[Transformer Layer 1 Self-Attention + FFN]
    D --> E[Transformer Layer 2 Self-Attention + FFN]
    E --> F[... N Transformer Layers ...]
    F --> G[Final Hidden State]
    G --> H[Linear Projection to Logits one per vocabulary token]
    H --> I[Softmax to Probabilities]
    I --> J[Sampling Strategy greedy / temperature / top-p]
    J --> K[Next Token Selected]
    K --> L{Stop condition?}
    L -- No --> M[Append token to context]
    M --> B
    L -- Yes --> N[Final Output Text]

Each step in this pipeline is deterministic except the sampling step, where randomness (controlled by temperature) is introduced. Lowering temperature toward zero makes the model increasingly deterministic; raising it above 1.0 introduces more randomness and variety.

The loop between tokenizer and sampling is the engine of generation. A model with 96 Transformer layers (like GPT-4) runs the full stack — all 96 layers of self-attention and feed-forward computation — for every single token it generates. For a 500-token response, that means 96 × 500 = 48,000 forward passes through Transformer sub-layers.

🌍 Real-World Applications of GPT-Style LLMs

GPT-style large language models power a remarkable breadth of products across industries. Understanding the underlying mechanism helps you appreciate why they excel at some tasks and fail at others.

Code Generation and Completion Tools like GitHub Copilot, Cursor, and Amazon CodeWhisperer use fine-tuned LLMs to autocomplete code, generate functions from comments, explain snippets, and write tests. Because code has regular structure and the training corpus contains billions of lines from GitHub, these models develop a strong prior over syntax and common patterns.

Writing Assistance and Content Generation GPT models are used in marketing tools (Jasper, Copy.ai), document editors (Notion AI, Microsoft Copilot in Word), and email assistants (Gmail Smart Compose). The autoregressive nature makes them excellent at extending partial text in a stylistically consistent way.

Question Answering and Search Augmentation Microsoft Bing AI and Google AI Overviews use LLMs to synthesize answers from retrieved documents. The LLM's job is to read context (provided via retrieval) and produce a coherent natural-language answer — a task perfectly suited to the next-token prediction paradigm.

Summarization and Information Extraction LLMs can distill long documents into concise summaries or extract structured information (names, dates, clauses from contracts). Tools like Claude and GPT-4 are widely used in legal, financial, and medical fields for this purpose.

Translation and Multilingual Tasks Because pre-training corpora span many languages, GPT-style models learn cross-lingual representations. They can translate, code-switch between languages in the same output, and even handle low-resource languages better than dedicated translation models from a few years ago.

Conversational AI and Customer Support Chatbots built on GPT handle first-tier support, route tickets, and answer FAQs. The model's ability to maintain context across a multi-turn conversation (within its context window) makes it suitable for interactive workflows.

The key insight: all of these applications reduce to the same underlying operation — autoregressive token prediction on a shared pre-trained foundation, with task-specific fine-tuning or prompting on top.

🧪 Practical: Using GPT via API

The OpenAI API exposes GPT models over a simple HTTP interface. Here is a minimal Python example using the openai library:

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain how softmax turns logits into probabilities."}
    ],
    temperature=0.7,      # 0 = deterministic, 1 = high creativity
    max_tokens=512,       # cap the response length
    top_p=0.95,           # nucleus sampling threshold
)

print(response.choices[0].message.content)

Parameter breakdown:

Parameter	Effect
`temperature`	Controls randomness. `0` → greedy (always pick top token). `>1` → more random output. Typical range: `0.2`–`0.9`.
`max_tokens`	Hard upper limit on tokens generated. Prevents runaway output and controls cost.
`top_p`	Nucleus sampling. `0.95` means the model samples only from the top tokens that together cover 95% of probability mass.
`model`	Which GPT checkpoint to use (`gpt-4o`, `gpt-4-turbo`, `gpt-3.5-turbo`, etc.). Larger models cost more but reason better.

Best practices:

Use a low temperature (0.0–0.3) for factual tasks (summarization, extraction, classification).
Use a higher temperature (0.7–1.0) for creative tasks (brainstorming, story writing).
Always set max_tokens explicitly to avoid unexpected cost spikes.
Pass a system message to give the model a persona or constraint set.

The messages array is how you build multi-turn context. The model doesn't have memory — you must resend the entire conversation history each call. This is why context window size matters so much in practice.

🧠 Deep Dive: The Transformer Under the Hood: Attention in Plain Language

GPT is built from stacked Transformer decoder blocks. Each block runs multi-head self-attention.

Self-attention answers: "For each position in the sequence, how much should I attend to every other position?"

Example: "The animal didn't cross the street because it was too tired."

Self-attention figures out that "it" refers to "animal," not "street."
It does this by learning attention weights between all token pairs.

flowchart LR
    Token1[The] --> Attn[Self-Attention Layer]
    Token2[animal] --> Attn
    Token3[didn't] --> Attn
    Token4[cross] --> Attn
    Token5[it] --> Attn
    Attn --> Rep[Contextual Representation of each token]
    Rep --> FF[Feed-Forward Layer]
    FF --> Next[Next token distribution]

This diagram shows how all five input tokens flow into the Self-Attention layer simultaneously, which computes contextual representations for each token based on every other token in the sequence. The key insight is that after self-attention, the representation of "it" has been enriched with information from "animal" — the model has learned where to look. The Feed-Forward layer then applies a non-linear transformation per token before the combined output is used to predict the next token distribution.

📊 GPT Decoder Block: Causal Attention Flow

flowchart LR
    IN[Input Tokens [T1, T2, ... Tn]] --> ME[Token + Positional Embedding]
    ME --> MSA["Masked Self-Attention (causal: no future tokens)"]
    MSA --> AN1[Add & Norm]
    AN1 --> FFN[Feed-Forward Network GELU activation]
    FFN --> AN2[Add & Norm]
    AN2 --> OUT[Output Hidden State  next Decoder block]

This diagram shows one complete Transformer decoder block — the fundamental repeating unit stacked N times in GPT. The causal masking in the Masked Self-Attention layer is what makes GPT a decoder: each token can only attend to tokens that came before it, never to future tokens. The Add & Norm layers stabilize gradient flow as information passes through dozens of stacked blocks, and the Feed-Forward Network applies a non-linear transformation that gives the model its capacity to encode complex patterns. Every generated token passes through this entire structure once per decode step.

The attention formula:

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

$Q$ (Query): "What am I looking for?"
$K$ (Key): "What do I offer?"
$V$ (Value): "What information do I contain?"
$\sqrt{d_k}$: scaling factor to prevent softmax saturation for large dimensions

🔬 Internals

GPT is a decoder-only transformer: each token attends only to previous tokens via causal masking (upper-triangular attention mask set to -∞). The residual stream carries information across layers — each attention head and MLP block adds a residual update. Tokenization uses BPE (byte-pair encoding), which merges frequent character pairs into subword units; vocabulary size is typically 50K–100K tokens.

⚡ Performance Analysis

GPT-3 (175B params) requires ~350 GB of GPU memory in FP16 — impractical without model parallelism across 8+ A100s. GPT-2 (117M) runs on a single consumer GPU in under 500ms per token. Modern quantized 7B models (4-bit) run at ~30 tokens/second on an RTX 3090, achieving GPT-3.5-level quality on many benchmarks with 99% less hardware cost.

⚖️ Trade-offs & Failure Modes: GPT's Known Limitations

Limitation	What it means in practice
Knowledge cutoff	Cannot know events that happened after training data ended
Context window	Cannot "remember" earlier in a conversation once the window is full
Hallucination	Generates plausible-sounding but factually wrong content
No true reasoning	Pattern matching at scale, not symbolic logic
Token budget	Long inputs are expensive — both in cost ($) and latency

🧭 Decision Guide: When to Use GPT

Use GPT when you need flexible language generation: summarization, Q&A, code completion, or open-ended chat. Choose smaller models (GPT-3.5, local LLMs) for high-volume or cost-sensitive tasks. Use GPT-4 for complex reasoning. For structured classification or tabular prediction, classical ML still wins.

🛠️ HuggingFace Transformers & vLLM: Running GPT-Style Inference in Python

HuggingFace Transformers is the standard Python library for loading and running open-weight language models (GPT-2, Mistral, LLaMA). Its pipeline("text-generation") abstraction handles tokenization, the autoregressive decode loop, and sampling — mirroring every step in the GPT Inference Pipeline diagram from this post, from token embeddings through softmax and sampling to the stop condition check.

vLLM is a high-throughput inference server that wraps HuggingFace-compatible models behind an OpenAI-compatible REST API using PagedAttention and continuous batching — allowing you to swap a local model in place of gpt-4o with zero client code changes.

# pip install transformers torch

# ── HuggingFace pipeline: GPT-2 text generation ───────────────────────────────
from transformers import pipeline, set_seed

# pipeline() wraps: tokenizer → model forward pass → sampling → decode
generator = pipeline(
    "text-generation",
    model="gpt2",           # 117 M params; runs on CPU for demos
)
set_seed(42)               # reproducibility

outputs = generator(
    "The best way to learn distributed systems is",
    max_new_tokens=60,      # cap output length (maps to max_tokens in OpenAI API)
    temperature=0.7,        # < 1 → more focused; > 1 → more creative
    top_p=0.9,              # nucleus sampling — sample from top 90% of mass
    do_sample=True,
    num_return_sequences=1,
)
print(outputs[0]["generated_text"])

# ── Manual decode loop (mirrors the flowchart in this post step-by-step) ─────
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model     = AutoModelForCausalLM.from_pretrained("gpt2")

input_ids = tokenizer.encode("Distributed systems fail when", return_tensors="pt")

with torch.no_grad():
    for _ in range(20):                              # generate 20 tokens
        logits   = model(input_ids).logits           # → [1, seq_len, 50257]
        next_id  = logits[:, -1, :].argmax(-1)       # greedy: highest logit
        input_ids = torch.cat([input_ids,
                                next_id.unsqueeze(0)], dim=-1)  # append

print(tokenizer.decode(input_ids[0]))

HuggingFace Transformers vs vLLM — which to use:

	HuggingFace Transformers	vLLM
Best for	Prototyping, fine-tuning, research	High-throughput production serving
Batching	Manual	Continuous batching (PagedAttention)
Throughput	~1–5 req/s per GPU	~50–200 req/s same GPU, larger models
API interface	Python API	OpenAI-compatible REST (`/v1/completions`)

For a full deep-dive on HuggingFace Transformers inference and vLLM deployment, a dedicated follow-up post is planned.

📚 Key Lessons About How GPT Works

After walking through every component — tokenization, embeddings, self-attention, logits, and sampling — here are the five lessons that matter most:

1. Tokens, not words, are the unit of computation. GPT never sees words. It sees integer IDs mapped to subword pieces. This has practical consequences: long words, rare words, and non-English text cost more tokens and may be represented less reliably.

2. Self-attention is the core superpower. The ability to model arbitrary long-range dependencies between tokens is what separates Transformers from earlier RNN-based architectures. Each layer refines token representations by "looking at" every other token in the context.

3. Autoregression means each token depends on all previous tokens. There is no parallel decoding at inference time. GPT generates token 1, then token 2 conditioned on token 1, then token 3 conditioned on tokens 1 and 2, and so on. This is why longer outputs take longer to generate.

4. Hallucination is a structural feature, not a bug to be patched. GPT generates the statistically most likely continuation of the input. If a likely continuation contains a false fact, the model has no internal mechanism to detect the falsehood. Mitigations (RLHF, retrieval augmentation) reduce but do not eliminate hallucination.

5. Sampling parameters are tunable levers with real consequences. Temperature, top-p, and top-k are not obscure settings — they directly determine whether your model's output is boring-but-accurate or creative-but-unreliable. Understanding them helps you prompt more effectively and choose the right configuration for your use case.

📌 TLDR: Summary & Key Takeaways

GPT = autoregressive next-token prediction, repeated at inference time.
Input is tokenized into subwords (BPE); tokens ≈ 0.75 words.
Raw logits → softmax → probability distribution → sample a token.
Self-attention allows GPT to model relationships between any two tokens in the context window.
Key limitations: knowledge cutoff, context window ceiling, hallucination, and no symbolic reasoning.

🧩 Test Your Understanding

GPT outputs a logit of 8.3 for "Paris" and 6.1 for "London." Which is more probable after softmax? Why isn't it just "pick 8.3"?
What would temperature=0 mean operationally?
In the self-attention formula, what do Q, K, and V represent in intuitive terms?
Why doesn't GPT know about an event from last week?

Article tools

Explain simpler Compare approaches What next?

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata