Text Decoding Strategies: Greedy, Beam Search, and Sampling
How does an LLM choose the next word? It's not just random. We explore Greedy Search, Beam Search...
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: An LLM doesn't "write" text — it generates a probability distribution over all possible next tokens and then uses a decoding strategy to pick one. Greedy, Beam Search, and Sampling are different rules for that choice. Temperature controls the creativity knob.
TLDR: Greedy decoding is fast but repetitive; beam search improves quality at compute cost; temperature + top-p sampling trades determinism for creativity — choose based on your task.
📖 The "Finish the Sentence" Game
Imagine flashing the phrase "The dog jumps over the ___" to a crowd:
- 60% shout "Fence"
- 20% shout "Cat"
- 15% shout "River"
- 5% shout "Moon"
The LLM does this for every single token. The question is: do you always pick "Fence" (greedy), do you explore multiple branches (beam search), or do you randomly sample from the crowd (sampling)?
🔍 Core Concepts: Tokens, Logits, and Probability Distributions
Before diving into how decoding strategies differ, it helps to understand exactly what is happening inside the model at each generation step.
What is a token? A token is not the same as a word. Modern LLMs use sub-word tokenization — common words become single tokens, but rarer or longer words get split into smaller pieces. For example, "unbelievable" might be tokenized as ["un", "believ", "able"]. GPT-series models use a vocabulary of roughly 50,000 tokens. Every time the model generates text, it is choosing the next item from this ~50,000-item menu.
What are logits? After processing the input through all its transformer layers, the model produces one raw score — called a logit — for every token in the vocabulary. These are unnormalized numbers: a logit of 5.0 for "fence" and 2.0 for "cat" tells you the model leans toward "fence," but by how much depends on all 50,000 scores together.
What does softmax do? The softmax function converts the entire logit vector into a valid probability distribution — all values become positive and they sum to exactly 1.0. Only after softmax do we have probabilities we can actually sample from.
| Concept | Plain-English Meaning | Why It Matters |
| Token | Sub-word unit; ~50 k in vocabulary | Decoding operates token-by-token |
| Logit | Raw unnormalized score from the model's last layer | Higher logit = more likely token |
| Softmax | Converts logits → valid probabilities summing to 1.0 | Enables probabilistic selection |
| Decoding strategy | Rule for picking the next token from the distribution | Controls quality, diversity, speed |
The key insight: the LLM doesn't "write" text like a human does. It iterates — take the current context, pass it through the model, get a probability distribution over all 50,000 tokens, pick one token according to the decoding strategy, append it to the context, and repeat. That choice rule — the decoding strategy — has an enormous impact on output quality, diversity, and coherence.
Why does the same prompt give different answers? Unless you use pure greedy decoding (temperature T=0), the model makes a probabilistic draw at every token position, so two runs of the same prompt produce different outputs. This is intentional for creative tasks and something to disable for deterministic ones like SQL generation.
🔢 The Three Core Strategies
1. Greedy Decoding
Always pick the single highest-probability token at each step.
Input: "The dog jumps over the"
Step 1: "fence" (highest probability token)
Output: "The dog jumps over the fence."
Pros: Fast, deterministic, reproducible. Cons: Locally optimal choices lead to globally suboptimal sequences. The model can paint itself into a corner — e.g., choosing "fence" first makes "and runs away" the natural continuation, even if "cat and runs away together" would have scored higher as a complete sentence.
2. Beam Search
Keep the top-K complete sequences (beams) in parallel. At each step, expand all K beams, score them, and keep only the top K.
K=3 beams after step 1:
Beam 1: "... fence" (log-prob: -0.5)
Beam 2: "... cat" (log-prob: -1.2)
Beam 3: "... river" (log-prob: -1.8)
After step 2 (expand each beam):
Beam 1a: "... fence and ..." (-0.5 + -0.3)
Beam 1b: "... fence to ..." (-0.5 + -0.6)
Beam 2a: "... cat and ..." (-1.2 + -0.2) ← might beat 1b!
...
Pros: Better global sequence scores than greedy. Cons: Computationally K× more expensive. Beam outputs tend to be generic/safe — less creative.
3. Sampling
Draw a token randomly according to the probability distribution — not always the top one.
Pros: Diverse, creative outputs. Reduces repetition. Cons: Can produce incoherent text if the probability distribution is too flat.
Top-K Sampling
Only sample from the K highest-probability tokens. Prevents picking absurdly rare tokens.
Top-P (Nucleus) Sampling
Sample from the smallest set of tokens whose cumulative probability ≥ P.
P=0.9: include tokens until you've covered 90% of probability mass
If vocab: [fence: 60%, cat: 20%, river: 15%, moon: 5%]
Top-P=0.9 set: [fence(60%), cat(20%), river(15%)] → sample from these 3
| Strategy | Diversity | Quality | Use Case |
| Greedy | None | High | Factual Q&A, code generation |
| Beam Search (K=5) | Low | High | Translation, summarization |
| Top-K Sampling | Medium | Medium | Creative writing |
| Top-P Sampling | High | Medium-High | Dialogue, creative tasks |
📊 Beam Search Step-by-Step Sequence
sequenceDiagram
participant M as Model
participant B as Beam Buffer
M->>B: Position 1: generate top-K tokens
B->>B: Keep K=3 beams with scores
M->>B: Position 2: expand each beam
B->>B: Score all K candidates
B->>B: Prune: keep top-K by cumulative score
M->>B: Position 3: expand K beams again
B->>B: Prune to top-K
B->>M: Return highest-score complete sequence
This sequence diagram illustrates how beam search avoids the single-path trap of greedy decoding. At each position the Model generates tokens, the Beam Buffer scores all K² candidate continuations and prunes back to K beams — so even if the initially most likely token leads to a poor continuation, a lower-probability first choice can still survive if its overall cumulative score is competitive. The takeaway: beam search finds globally better sequences, but at K× the compute cost and with a tendency toward safe, generic output.
⚖️ Trade-offs & Failure Modes: When Strategies Go Wrong
Each strategy has failure patterns worth knowing before you deploy:
| Mistake | Strategy | Problem | Fix |
| Greedy for long creative text | Greedy | Repetitive loops, "dead ends" | Switch to Top-P sampling |
| High temperature for SQL/code | T > 1.0 | Invalid tokens inserted | Use T = 0 with structured schema |
| Beam search for chat responses | Beam K=5 | Generic, unnatural dialogue | Use Top-P T=0.7 instead |
| Top-P too narrow (P=0.1) for stories | Sampling | Boring, repetitive output | Raise P to 0.85–0.95 |
| Temperature set once and never tuned | Any | Wrong quality/creativity balance | Tune per task type |
⚙️ Temperature: The Creativity Dial
Temperature $T$ reshapes the probability distribution before sampling:
$$P_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$
Where $z_i$ are the raw logit scores.
| Temperature | Effect | Output Character |
| $T = 0$ | Greedy (argmax) — one token wins entirely | Deterministic, repetitive |
| $T = 1.0$ | Standard distribution (no change) | Balanced |
| $T < 1$ (e.g., 0.3) | Sharper distribution — top token dominates even more | Focused, precise |
| $T > 1$ (e.g., 1.8) | Flatter distribution — rare tokens become more likely | Creative, potentially incoherent |
Numerical example at $T = 0.5$ vs $T = 2.0$ for two tokens A (logit 2.0) and B (logit 1.0):
| Temperature | P(A) | P(B) | Interpretation |
| 1.0 | 73% | 27% | Standard |
| 0.5 | 98% | 2% | Very confident in A |
| 2.0 | 62% | 38% | B nearly as likely as A |
flowchart LR
Logits[Raw Logits z_i] --> Temp[Divide by T]
Temp --> Softmax[Softmax Probabilities]
Softmax --> Decode[Decoding Strategy (greedy / top-k / nucleus)]
Decode --> Token[Selected Token]
This flowchart shows the three-step temperature pipeline: raw logits are divided by T (sharpening or flattening the distribution), softmax converts them into valid probabilities, and the chosen decoding strategy selects a token. Temperature is applied before sampling, which means it amplifies or suppresses the effect of top-p and top-k filtering downstream. At T→0 the softmax output collapses to a one-hot vector (greedy); at T>1 the distribution flattens and even low-probability tokens become viable candidates.
📊 Decoding Strategy Selection: Pipeline Flow
Each token generation step applies the selected decoding strategy to transform raw logits into the next output token.
flowchart LR
A[LLM Logits] --> B{Decoding Strategy?}
B -->|Greedy| C[argmax token]
B -->|Beam Search| D[Expand k beams]
B -->|Sampling| E[Apply Temperature]
E --> F{top-p or top-k filter}
F --> G[Sample from distribution]
D --> H[Prune to k beams]
C --> I[Next Token]
G --> I
H --> I[Best-beam token]
🧭 Decision Guide: Decoding Strategy Selection
Choosing the right decoding strategy depends on three questions about your task: Does it need creativity? Does it need speed? Does it need globally optimal sequence quality? The decision tree below maps those constraints to concrete settings.
flowchart TD
Task[Define Generation Task]
Creative{Is creativity desired?}
Speed{Is speed critical?}
Quality{Is sequence quality critical?}
Greedy[Greedy Decoding T=0, deterministic Fastest]
Beam[Beam Search K=5 Better global quality Slower]
TopP[Top-P Sampling P=0.9, T=0.8 Diverse & creative]
TopK[Top-K Sampling K=40, T=1.0 Moderate diversity]
Task --> Creative
Creative -->|No| Speed
Creative -->|Yes| TopP
Speed -->|Yes| Greedy
Speed -->|No| Quality
Quality -->|Yes| Beam
Quality -->|No| TopK
Figure: Decision tree for selecting a decoding strategy based on your task's creativity, speed, and quality requirements.
How to use this tree in practice: Start at "Define Generation Task" and answer each question honestly. A SQL generator needs neither creativity nor beam quality — greedy wins. A chatbot needs creativity — go Top-P. An offline translation pipeline needs quality and can afford extra compute — beam search. A game dialogue system needs moderate diversity without the beam overhead — Top-K. Most production APIs default to Top-P sampling with temperature tuning and do not expose beam search directly; knowing where your use case sits on this tree helps you configure temperature and top-p values intelligently.
🌍 Real-World Applications of Decoding Strategies
Different applications sit in very different parts of the decision tree above. Here is how major production use cases map to strategies:
| Application | Recommended Strategy | Why |
| Machine translation | Beam Search (K=4–8) | Sequence-level quality matters; output has a near-correct answer |
| Creative story writing | Top-P=0.9, T=0.9–1.1 | Diversity and surprise are desirable; no single correct continuation |
| SQL generation | Greedy or T=0.0–0.1 | SQL is a deterministic task; randomness introduces syntax errors |
| Chatbot dialogue | Top-P=0.9, T=0.7 | Natural conversational variance; avoids robotic repetition |
| Code completion | Greedy or T=0.1, top-p=0.95 | Code has correctness constraints; focused output reduces hallucination |
| Factual Q&A | T=0.0–0.2 | Factual answers have a correct form; randomness only adds noise |
A note on production APIs: Most hosted LLM APIs (OpenAI, Anthropic, Google Gemini) do not expose beam search as a parameter. Their defaults are top-p sampling with temperature tuning. Beam search remains dominant in offline pipelines — machine translation frameworks (MarianMT, fairseq) and speech recognition (Whisper) — where running K beams is acceptable and sequence-level optimality is paramount.
One important misconception to clear up: setting temperature=0 in the OpenAI API gives you greedy-like output, not beam search. Many developers assume low temperature equals beam search; it does not. Beam search explores multiple sequence branches in parallel and scores them globally; temperature=0 just makes greedy decoding deterministic by collapsing the softmax to an argmax.
📊 Decoding Strategy Latency vs Diversity
flowchart TD
Token[Next Token Position]
Greedy[Greedy: argmax(logits) one token, fastest]
BeamK[Beam Search: K=5 K parallel sequences K slower, best quality]
TopK[Top-K: sample from K balanced diversity/speed]
TopP[Top-P: nucleus 0.9 dynamic vocab, diverse]
Temp[Temperature T reshapes distribution applied before sampling]
Token --> Greedy
Token --> BeamK
Token --> TopK
Token --> TopP
TopK --> Temp
TopP --> Temp
This diagram compares all four decoding strategies along the latency-diversity axis simultaneously. Greedy is fastest but produces no diversity; Beam Search is slowest and offers the best global sequence quality but also the least variety; Top-K and Top-P sampling occupy the middle ground — Top-P dynamically adjusts the candidate vocabulary per step while Top-K uses a fixed cutoff. Temperature feeds into both sampling methods, making it the shared dial that amplifies or dampens diversity within whatever nucleus or top-K constraint you have set.
🧪 Exploring Decoding Strategies in Practice
Exercise 1 — Greedy vs. Sampling Side-by-Side
Use the HuggingFace transformers library to compare strategies on the same prompt:
from transformers import pipeline
gen = pipeline("text-generation", model="gpt2")
prompt = "The future of renewable energy is"
# Greedy decoding — deterministic, same result every run
greedy = gen(prompt, max_new_tokens=50, do_sample=False)
# Top-P nucleus sampling — varied output each run
sampled = gen(prompt, max_new_tokens=50, do_sample=True, top_p=0.9, temperature=0.8)
print("Greedy:", greedy[0]["generated_text"])
print("Sampled:", sampled[0]["generated_text"])
Run this five times. The greedy output is identical every run; the sampled output varies. Greedy outputs often start looping or repeating phrases after a few sentences — the classic symptom of locally optimal but globally poor token choices.
Exercise 2 — Temperature Spectrum
Generate five completions at T=0.1, T=0.7, and T=1.5 for the prompt "The future of AI is". Observe: at T=0.1 the output is near-deterministic and focused but may repeat phrases; at T=0.7 there is a balanced diversity-coherence trade-off; at T=1.5 the output becomes creative but risks incoherence or going off-topic. This exercise builds direct intuition for the creativity-vs-coherence dial that temperature controls.
Exercise 3 — Nucleus Size with the OpenAI API
Use the OpenAI API to test top_p=0.1 versus top_p=0.95 on a creative prompt such as "Write the opening line of a science fiction novel set on Europa". At top_p=0.1 sampling is restricted to the top 10% of probability mass — outputs are safe and repetitive. At top_p=0.95 the model ranges across 95% of probable continuations — outputs vary dramatically. Note how top-p and temperature interact: high top-p gives temperature room to explore; low top-p limits exploration regardless of temperature setting.
🧠 Deep Dive: Practical Defaults by Task
| Task | Recommended Settings |
| Code generation | Greedy or T=0.1, top-p=0.95 |
| Factual Q&A | T=0.0–0.2 (deterministic) |
| Creative writing | T=0.8–1.2, top-p=0.9 |
| API calls / JSON output | T=0.0 + strict output schema |
| Dialogue/chatbot | T=0.7, top-p=0.9 |
🔬 Internals
Greedy decoding selects argmax at each step: t_i = argmax P(t|t1,...,t{i-1}). Beam search maintains k partial sequences ("beams"), expanding each at every step and pruning to keep only the top-k by cumulative log-probability. Sampling draws t_i ~ P(·|context) after optional temperature scaling and vocabulary truncation (top-k/top-p), breaking the deterministic mode-seeking behavior of beam search.
⚡ Performance Analysis
Greedy decoding runs at peak throughput (~100–150 tokens/second on an A100 for a 7B model) since it processes one candidate per step. Beam search with k=5 is ~4× slower due to expanded batch size and is prone to repetition on open-ended tasks. Nucleus sampling (top-p=0.9) matches greedy throughput and scores 20–30% higher on human preference evaluations for creative and conversational tasks.
🛠️ HuggingFace Transformers: Controlling Decoding with GenerationConfig
HuggingFace Transformers is the leading open-source Python library for loading and running transformer models. Its GenerationConfig object centralizes every decoding parameter — do_sample, temperature, top_p, num_beams — into a single, versionable configuration that can be saved alongside the model checkpoint and swapped without reloading the model.
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
# --- Strategy 1: Greedy — deterministic, ideal for SQL / code generation ---
greedy_cfg = GenerationConfig(
max_new_tokens=128,
do_sample=False, # argmax at every step: T→0 behaviour
temperature=1.0,
)
# --- Strategy 2: Beam Search — high sequence quality, good for translation ---
beam_cfg = GenerationConfig(
max_new_tokens=128,
do_sample=False,
num_beams=4, # keep 4 candidate sequences in parallel
early_stopping=True,
)
# --- Strategy 3: Nucleus Sampling — creative, good for chat / story generation ---
nucleus_cfg = GenerationConfig(
max_new_tokens=128,
do_sample=True,
temperature=0.8, # slightly sharpen the distribution
top_p=0.92, # sample from top-92% probability mass
top_k=0, # disable top-k; use pure nucleus sampling
)
prompt = "Explain why the sky is blue in one paragraph."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
for label, cfg in [("Greedy", greedy_cfg), ("Beam", beam_cfg), ("Nucleus", nucleus_cfg)]:
with torch.inference_mode():
output = model.generate(**inputs, generation_config=cfg)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\n[{label}]\n{text[len(prompt):]}")
GenerationConfig cleanly separates your decoding strategy from model loading — you can A/B test strategies in production by swapping configs without touching model weights. Save the winning configuration with cfg.save_pretrained("./configs/nucleus") so results are fully reproducible across environments and deployments.
For a full deep-dive on HuggingFace Transformers generation internals, constrained decoding, and fine-tuning workflows, a dedicated follow-up post is planned.
📚 Five Lessons Every LLM Developer Should Internalize
Greedy decoding is suboptimal for long sequences. Each locally best token constrains what follows — the globally best sentence may require starting with a slightly lower-probability token. This is why greedy outputs often feel stilted or start looping after a few sentences.
Beam search improves global sequence quality but produces "safe" text. It is the right tool for translation and summarization where a near-correct answer exists, but it actively penalizes diversity. Do not use it for open-ended creative generation.
Top-P (nucleus) sampling is the most common production default. It balances quality and diversity by concentrating sampling in the high-probability region while still allowing variation. Most API defaults sit at P=0.9–0.95.
Temperature and top-p are complementary, not alternatives. Temperature reshapes the entire probability distribution (sharpening or flattening it); top-p then truncates the distribution after reshaping. Use both together deliberately rather than treating them as the same knob.
For factual or structured output, use T=0–0.2. SQL, JSON, code, and factual Q&A all have correct answers. Randomness only introduces noise. Near-zero temperature is the appropriate default; beam search can be layered on top in offline pipelines where sequence-level quality is paramount.
📌 TLDR: Summary & Key Takeaways
- Greedy = always pick the best token; fast but suboptimal.
- Beam Search = keep top-K candidate sequences; better quality, more compute.
- Top-P (Nucleus) Sampling = sample from the smallest set covering P% of probability mass; best for creative tasks.
- Temperature reshapes the distribution: low $T$ → focused, high $T$ → creative/risky.
🔗 Related Posts
- Prompt Engineering Guide: Zero-Shot to Chain-of-Thought
- Embeddings Explained
- RAG Explained: How to Give Your LLM a Brain Upgrade
- How to Build Apps with LangChain and LLMs
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
