Text Decoding Strategies: Greedy, Beam Search, and Sampling
How does an LLM choose the next word? It's not just random. We explore Greedy Search, Beam Search...
Abstract AlgorithmsTLDR: An LLM doesn't "write" text โ it generates a probability distribution over all possible next tokens and then uses a decoding strategy to pick one. Greedy, Beam Search, and Sampling are different rules for that choice. Temperature controls the creativity knob.
๐ The "Finish the Sentence" Game
Imagine flashing the phrase "The dog jumps over the ___" to a crowd:
- 60% shout "Fence"
- 20% shout "Cat"
- 15% shout "River"
- 5% shout "Moon"
The LLM does this for every single token. The question is: do you always pick "Fence" (greedy), do you explore multiple branches (beam search), or do you randomly sample from the crowd (sampling)?
๐ข The Three Core Strategies
1. Greedy Decoding
Always pick the single highest-probability token at each step.
Input: "The dog jumps over the"
Step 1: "fence" (highest probability token)
Output: "The dog jumps over the fence."
Pros: Fast, deterministic, reproducible. Cons: Locally optimal choices lead to globally suboptimal sequences. The model can paint itself into a corner โ e.g., choosing "fence" first makes "and runs away" the natural continuation, even if "cat and runs away together" would have scored higher as a complete sentence.
2. Beam Search
Keep the top-K complete sequences (beams) in parallel. At each step, expand all K beams, score them, and keep only the top K.
K=3 beams after step 1:
Beam 1: "... fence" (log-prob: -0.5)
Beam 2: "... cat" (log-prob: -1.2)
Beam 3: "... river" (log-prob: -1.8)
After step 2 (expand each beam):
Beam 1a: "... fence and ..." (-0.5 + -0.3)
Beam 1b: "... fence to ..." (-0.5 + -0.6)
Beam 2a: "... cat and ..." (-1.2 + -0.2) โ might beat 1b!
...
Pros: Better global sequence scores than greedy. Cons: Computationally Kร more expensive. Beam outputs tend to be generic/safe โ less creative.
3. Sampling
Draw a token randomly according to the probability distribution โ not always the top one.
Pros: Diverse, creative outputs. Reduces repetition. Cons: Can produce incoherent text if the probability distribution is too flat.
Top-K Sampling
Only sample from the K highest-probability tokens. Prevents picking absurdly rare tokens.
Top-P (Nucleus) Sampling
Sample from the smallest set of tokens whose cumulative probability โฅ P.
P=0.9: include tokens until you've covered 90% of probability mass
If vocab: [fence: 60%, cat: 20%, river: 15%, moon: 5%]
Top-P=0.9 set: [fence(60%), cat(20%), river(15%)] โ sample from these 3
| Strategy | Diversity | Quality | Use Case |
| Greedy | None | High | Factual Q&A, code generation |
| Beam Search (K=5) | Low | High | Translation, summarization |
| Top-K Sampling | Medium | Medium | Creative writing |
| Top-P Sampling | High | Medium-High | Dialogue, creative tasks |
โ๏ธ Temperature: The Creativity Dial
Temperature $T$ reshapes the probability distribution before sampling:
$$P_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$
Where $z_i$ are the raw logit scores.
| Temperature | Effect | Output Character |
| $T = 0$ | Greedy (argmax) โ one token wins entirely | Deterministic, repetitive |
| $T = 1.0$ | Standard distribution (no change) | Balanced |
| $T < 1$ (e.g., 0.3) | Sharper distribution โ top token dominates even more | Focused, precise |
| $T > 1$ (e.g., 1.8) | Flatter distribution โ rare tokens become more likely | Creative, potentially incoherent |
Numerical example at $T = 0.5$ vs $T = 2.0$ for two tokens A (logit 2.0) and B (logit 1.0):
| Temperature | P(A) | P(B) | Interpretation |
| 1.0 | 73% | 27% | Standard |
| 0.5 | 98% | 2% | Very confident in A |
| 2.0 | 62% | 38% | B nearly as likely as A |
flowchart LR
Logits["Raw Logits z_i"] --> Temp["Divide by T"]
Temp --> Softmax["Softmax โ Probabilities"]
Softmax --> Decode["Decoding Strategy\n(greedy / top-k / nucleus)"]
Decode --> Token["Selected Token"]
๐ง Practical Defaults
| Task | Recommended Settings |
| Code generation | Greedy or T=0.1, top-p=0.95 |
| Factual Q&A | T=0.0โ0.2 (deterministic) |
| Creative writing | T=0.8โ1.2, top-p=0.9 |
| API calls / JSON output | T=0.0 + strict output schema |
| Dialogue/chatbot | T=0.7, top-p=0.9 |
๐ Summary
- LLMs produce a probability distribution over the vocabulary at each step; decoding strategy picks the token.
- Greedy = always pick the best token; fast but suboptimal globally.
- Beam Search = keep top-K candidate sequences; better quality, more compute.
- Top-P (Nucleus) Sampling = sample from the smallest set covering P% of probability mass; best for creative tasks.
- Temperature reshapes the distribution: low $T$ โ focused, high $T$ โ creative/risky.
๐ Practice Quiz
Why can greedy decoding produce poor-quality long text despite always choosing the highest-probability token?
- A) Because it picks the top token too slowly.
- B) Because each locally optimal choice constrains future choices โ the globally best sentence may require a lower-probability first token.
- C) Because it ignores the model's attention weights.
Answer: B
Setting temperature $T = 0$ is equivalent to which decoding strategy?
- A) Beam Search with K=1.
- B) Greedy decoding โ the softmax with $T โ 0$ gives all probability mass to the argmax token.
- C) Top-P sampling with P=0.5.
Answer: B
You are generating a SQL query from a natural language prompt. Which settings minimize hallucination?
- A) T=1.5 with top-p=0.99 for maximum exploration.
- B) T=0.0โ0.2 (near-greedy) with a strict output schema โ SQL has a correct answer; randomness only adds errors.
- C) Beam Search with K=5 always produces the best SQL.
Answer: B

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. ๏ฟฝ...
