All Posts

Text Decoding Strategies: Greedy, Beam Search, and Sampling

How does an LLM choose the next word? It's not just random. We explore Greedy Search, Beam Search...

Abstract AlgorithmsAbstract Algorithms
··15 min read

AI-assisted content.

TLDR: An LLM doesn't "write" text — it generates a probability distribution over all possible next tokens and then uses a decoding strategy to pick one. Greedy, Beam Search, and Sampling are different rules for that choice. Temperature controls the creativity knob.


TLDR: Greedy decoding is fast but repetitive; beam search improves quality at compute cost; temperature + top-p sampling trades determinism for creativity — choose based on your task.

📖 The "Finish the Sentence" Game

Imagine flashing the phrase "The dog jumps over the ___" to a crowd:

  • 60% shout "Fence"
  • 20% shout "Cat"
  • 15% shout "River"
  • 5% shout "Moon"

The LLM does this for every single token. The question is: do you always pick "Fence" (greedy), do you explore multiple branches (beam search), or do you randomly sample from the crowd (sampling)?


🔍 Core Concepts: Tokens, Logits, and Probability Distributions

Before diving into how decoding strategies differ, it helps to understand exactly what is happening inside the model at each generation step.

What is a token? A token is not the same as a word. Modern LLMs use sub-word tokenization — common words become single tokens, but rarer or longer words get split into smaller pieces. For example, "unbelievable" might be tokenized as ["un", "believ", "able"]. GPT-series models use a vocabulary of roughly 50,000 tokens. Every time the model generates text, it is choosing the next item from this ~50,000-item menu.

What are logits? After processing the input through all its transformer layers, the model produces one raw score — called a logit — for every token in the vocabulary. These are unnormalized numbers: a logit of 5.0 for "fence" and 2.0 for "cat" tells you the model leans toward "fence," but by how much depends on all 50,000 scores together.

What does softmax do? The softmax function converts the entire logit vector into a valid probability distribution — all values become positive and they sum to exactly 1.0. Only after softmax do we have probabilities we can actually sample from.

ConceptPlain-English MeaningWhy It Matters
TokenSub-word unit; ~50 k in vocabularyDecoding operates token-by-token
LogitRaw unnormalized score from the model's last layerHigher logit = more likely token
SoftmaxConverts logits → valid probabilities summing to 1.0Enables probabilistic selection
Decoding strategyRule for picking the next token from the distributionControls quality, diversity, speed

The key insight: the LLM doesn't "write" text like a human does. It iterates — take the current context, pass it through the model, get a probability distribution over all 50,000 tokens, pick one token according to the decoding strategy, append it to the context, and repeat. That choice rule — the decoding strategy — has an enormous impact on output quality, diversity, and coherence.

Why does the same prompt give different answers? Unless you use pure greedy decoding (temperature T=0), the model makes a probabilistic draw at every token position, so two runs of the same prompt produce different outputs. This is intentional for creative tasks and something to disable for deterministic ones like SQL generation.


🔢 The Three Core Strategies

1. Greedy Decoding

Always pick the single highest-probability token at each step.

Input: "The dog jumps over the"
Step 1: "fence" (highest probability token)
Output: "The dog jumps over the fence."

Pros: Fast, deterministic, reproducible. Cons: Locally optimal choices lead to globally suboptimal sequences. The model can paint itself into a corner — e.g., choosing "fence" first makes "and runs away" the natural continuation, even if "cat and runs away together" would have scored higher as a complete sentence.

Keep the top-K complete sequences (beams) in parallel. At each step, expand all K beams, score them, and keep only the top K.

K=3 beams after step 1:
  Beam 1: "... fence" (log-prob: -0.5)
  Beam 2: "... cat"   (log-prob: -1.2)
  Beam 3: "... river" (log-prob: -1.8)

After step 2 (expand each beam):
  Beam 1a: "... fence and ..."  (-0.5 + -0.3)
  Beam 1b: "... fence to ..."   (-0.5 + -0.6)
  Beam 2a: "... cat and ..."    (-1.2 + -0.2)  ← might beat 1b!
  ...

Pros: Better global sequence scores than greedy. Cons: Computationally K× more expensive. Beam outputs tend to be generic/safe — less creative.

3. Sampling

Draw a token randomly according to the probability distribution — not always the top one.

Pros: Diverse, creative outputs. Reduces repetition. Cons: Can produce incoherent text if the probability distribution is too flat.

Top-K Sampling

Only sample from the K highest-probability tokens. Prevents picking absurdly rare tokens.

Top-P (Nucleus) Sampling

Sample from the smallest set of tokens whose cumulative probability ≥ P.

P=0.9: include tokens until you've covered 90% of probability mass
If vocab: [fence: 60%, cat: 20%, river: 15%, moon: 5%]
Top-P=0.9 set: [fence(60%), cat(20%), river(15%)] → sample from these 3
StrategyDiversityQualityUse Case
GreedyNoneHighFactual Q&A, code generation
Beam Search (K=5)LowHighTranslation, summarization
Top-K SamplingMediumMediumCreative writing
Top-P SamplingHighMedium-HighDialogue, creative tasks

📊 Beam Search Step-by-Step Sequence

sequenceDiagram
    participant M as Model
    participant B as Beam Buffer

    M->>B: Position 1: generate top-K tokens
    B->>B: Keep K=3 beams with scores
    M->>B: Position 2: expand each beam
    B->>B: Score all K candidates
    B->>B: Prune: keep top-K by cumulative score
    M->>B: Position 3: expand K beams again
    B->>B: Prune to top-K
    B->>M: Return highest-score complete sequence

This sequence diagram illustrates how beam search avoids the single-path trap of greedy decoding. At each position the Model generates tokens, the Beam Buffer scores all K² candidate continuations and prunes back to K beams — so even if the initially most likely token leads to a poor continuation, a lower-probability first choice can still survive if its overall cumulative score is competitive. The takeaway: beam search finds globally better sequences, but at K× the compute cost and with a tendency toward safe, generic output.


⚖️ Trade-offs & Failure Modes: When Strategies Go Wrong

Each strategy has failure patterns worth knowing before you deploy:

MistakeStrategyProblemFix
Greedy for long creative textGreedyRepetitive loops, "dead ends"Switch to Top-P sampling
High temperature for SQL/codeT > 1.0Invalid tokens insertedUse T = 0 with structured schema
Beam search for chat responsesBeam K=5Generic, unnatural dialogueUse Top-P T=0.7 instead
Top-P too narrow (P=0.1) for storiesSamplingBoring, repetitive outputRaise P to 0.85–0.95
Temperature set once and never tunedAnyWrong quality/creativity balanceTune per task type

⚙️ Temperature: The Creativity Dial

Temperature $T$ reshapes the probability distribution before sampling:

$$P_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$

Where $z_i$ are the raw logit scores.

TemperatureEffectOutput Character
$T = 0$Greedy (argmax) — one token wins entirelyDeterministic, repetitive
$T = 1.0$Standard distribution (no change)Balanced
$T < 1$ (e.g., 0.3)Sharper distribution — top token dominates even moreFocused, precise
$T > 1$ (e.g., 1.8)Flatter distribution — rare tokens become more likelyCreative, potentially incoherent

Numerical example at $T = 0.5$ vs $T = 2.0$ for two tokens A (logit 2.0) and B (logit 1.0):

TemperatureP(A)P(B)Interpretation
1.073%27%Standard
0.598%2%Very confident in A
2.062%38%B nearly as likely as A
flowchart LR
    Logits[Raw Logits z_i] --> Temp[Divide by T]
    Temp --> Softmax[Softmax  Probabilities]
    Softmax --> Decode[Decoding Strategy (greedy / top-k / nucleus)]
    Decode --> Token[Selected Token]

This flowchart shows the three-step temperature pipeline: raw logits are divided by T (sharpening or flattening the distribution), softmax converts them into valid probabilities, and the chosen decoding strategy selects a token. Temperature is applied before sampling, which means it amplifies or suppresses the effect of top-p and top-k filtering downstream. At T→0 the softmax output collapses to a one-hot vector (greedy); at T>1 the distribution flattens and even low-probability tokens become viable candidates.


📊 Decoding Strategy Selection: Pipeline Flow

Each token generation step applies the selected decoding strategy to transform raw logits into the next output token.

flowchart LR
    A[LLM Logits] --> B{Decoding Strategy?}
    B -->|Greedy| C[argmax token]
    B -->|Beam Search| D[Expand k beams]
    B -->|Sampling| E[Apply Temperature]
    E --> F{top-p or top-k filter}
    F --> G[Sample from distribution]
    D --> H[Prune to k beams]
    C --> I[Next Token]
    G --> I
    H --> I[Best-beam token]

🧭 Decision Guide: Decoding Strategy Selection

Choosing the right decoding strategy depends on three questions about your task: Does it need creativity? Does it need speed? Does it need globally optimal sequence quality? The decision tree below maps those constraints to concrete settings.

flowchart TD
    Task[Define Generation Task]
    Creative{Is creativity desired?}
    Speed{Is speed critical?}
    Quality{Is sequence quality critical?}

    Greedy[Greedy Decoding T=0, deterministic Fastest]
    Beam[Beam Search K=5 Better global quality Slower]
    TopP[Top-P Sampling P=0.9, T=0.8 Diverse & creative]
    TopK[Top-K Sampling K=40, T=1.0 Moderate diversity]

    Task --> Creative
    Creative -->|No| Speed
    Creative -->|Yes| TopP
    Speed -->|Yes| Greedy
    Speed -->|No| Quality
    Quality -->|Yes| Beam
    Quality -->|No| TopK

Figure: Decision tree for selecting a decoding strategy based on your task's creativity, speed, and quality requirements.

How to use this tree in practice: Start at "Define Generation Task" and answer each question honestly. A SQL generator needs neither creativity nor beam quality — greedy wins. A chatbot needs creativity — go Top-P. An offline translation pipeline needs quality and can afford extra compute — beam search. A game dialogue system needs moderate diversity without the beam overhead — Top-K. Most production APIs default to Top-P sampling with temperature tuning and do not expose beam search directly; knowing where your use case sits on this tree helps you configure temperature and top-p values intelligently.


🌍 Real-World Applications of Decoding Strategies

Different applications sit in very different parts of the decision tree above. Here is how major production use cases map to strategies:

ApplicationRecommended StrategyWhy
Machine translationBeam Search (K=4–8)Sequence-level quality matters; output has a near-correct answer
Creative story writingTop-P=0.9, T=0.9–1.1Diversity and surprise are desirable; no single correct continuation
SQL generationGreedy or T=0.0–0.1SQL is a deterministic task; randomness introduces syntax errors
Chatbot dialogueTop-P=0.9, T=0.7Natural conversational variance; avoids robotic repetition
Code completionGreedy or T=0.1, top-p=0.95Code has correctness constraints; focused output reduces hallucination
Factual Q&AT=0.0–0.2Factual answers have a correct form; randomness only adds noise

A note on production APIs: Most hosted LLM APIs (OpenAI, Anthropic, Google Gemini) do not expose beam search as a parameter. Their defaults are top-p sampling with temperature tuning. Beam search remains dominant in offline pipelines — machine translation frameworks (MarianMT, fairseq) and speech recognition (Whisper) — where running K beams is acceptable and sequence-level optimality is paramount.

One important misconception to clear up: setting temperature=0 in the OpenAI API gives you greedy-like output, not beam search. Many developers assume low temperature equals beam search; it does not. Beam search explores multiple sequence branches in parallel and scores them globally; temperature=0 just makes greedy decoding deterministic by collapsing the softmax to an argmax.

📊 Decoding Strategy Latency vs Diversity

flowchart TD
    Token[Next Token Position]
    Greedy[Greedy: argmax(logits)  one token, fastest]
    BeamK[Beam Search: K=5  K parallel sequences K slower, best quality]
    TopK[Top-K: sample from K  balanced diversity/speed]
    TopP[Top-P: nucleus 0.9  dynamic vocab, diverse]
    Temp[Temperature T  reshapes distribution applied before sampling]

    Token --> Greedy
    Token --> BeamK
    Token --> TopK
    Token --> TopP
    TopK --> Temp
    TopP --> Temp

This diagram compares all four decoding strategies along the latency-diversity axis simultaneously. Greedy is fastest but produces no diversity; Beam Search is slowest and offers the best global sequence quality but also the least variety; Top-K and Top-P sampling occupy the middle ground — Top-P dynamically adjusts the candidate vocabulary per step while Top-K uses a fixed cutoff. Temperature feeds into both sampling methods, making it the shared dial that amplifies or dampens diversity within whatever nucleus or top-K constraint you have set.


🧪 Exploring Decoding Strategies in Practice

Exercise 1 — Greedy vs. Sampling Side-by-Side

Use the HuggingFace transformers library to compare strategies on the same prompt:

from transformers import pipeline

gen = pipeline("text-generation", model="gpt2")
prompt = "The future of renewable energy is"

# Greedy decoding — deterministic, same result every run
greedy = gen(prompt, max_new_tokens=50, do_sample=False)

# Top-P nucleus sampling — varied output each run
sampled = gen(prompt, max_new_tokens=50, do_sample=True, top_p=0.9, temperature=0.8)

print("Greedy:", greedy[0]["generated_text"])
print("Sampled:", sampled[0]["generated_text"])

Run this five times. The greedy output is identical every run; the sampled output varies. Greedy outputs often start looping or repeating phrases after a few sentences — the classic symptom of locally optimal but globally poor token choices.

Exercise 2 — Temperature Spectrum

Generate five completions at T=0.1, T=0.7, and T=1.5 for the prompt "The future of AI is". Observe: at T=0.1 the output is near-deterministic and focused but may repeat phrases; at T=0.7 there is a balanced diversity-coherence trade-off; at T=1.5 the output becomes creative but risks incoherence or going off-topic. This exercise builds direct intuition for the creativity-vs-coherence dial that temperature controls.

Exercise 3 — Nucleus Size with the OpenAI API

Use the OpenAI API to test top_p=0.1 versus top_p=0.95 on a creative prompt such as "Write the opening line of a science fiction novel set on Europa". At top_p=0.1 sampling is restricted to the top 10% of probability mass — outputs are safe and repetitive. At top_p=0.95 the model ranges across 95% of probable continuations — outputs vary dramatically. Note how top-p and temperature interact: high top-p gives temperature room to explore; low top-p limits exploration regardless of temperature setting.


🧠 Deep Dive: Practical Defaults by Task

TaskRecommended Settings
Code generationGreedy or T=0.1, top-p=0.95
Factual Q&AT=0.0–0.2 (deterministic)
Creative writingT=0.8–1.2, top-p=0.9
API calls / JSON outputT=0.0 + strict output schema
Dialogue/chatbotT=0.7, top-p=0.9

🔬 Internals

Greedy decoding selects argmax at each step: t_i = argmax P(t|t1,...,t{i-1}). Beam search maintains k partial sequences ("beams"), expanding each at every step and pruning to keep only the top-k by cumulative log-probability. Sampling draws t_i ~ P(·|context) after optional temperature scaling and vocabulary truncation (top-k/top-p), breaking the deterministic mode-seeking behavior of beam search.

⚡ Performance Analysis

Greedy decoding runs at peak throughput (~100–150 tokens/second on an A100 for a 7B model) since it processes one candidate per step. Beam search with k=5 is ~4× slower due to expanded batch size and is prone to repetition on open-ended tasks. Nucleus sampling (top-p=0.9) matches greedy throughput and scores 20–30% higher on human preference evaluations for creative and conversational tasks.

🛠️ HuggingFace Transformers: Controlling Decoding with GenerationConfig

HuggingFace Transformers is the leading open-source Python library for loading and running transformer models. Its GenerationConfig object centralizes every decoding parameter — do_sample, temperature, top_p, num_beams — into a single, versionable configuration that can be saved alongside the model checkpoint and swapped without reloading the model.

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

# --- Strategy 1: Greedy — deterministic, ideal for SQL / code generation ---
greedy_cfg = GenerationConfig(
    max_new_tokens=128,
    do_sample=False,          # argmax at every step: T→0 behaviour
    temperature=1.0,
)

# --- Strategy 2: Beam Search — high sequence quality, good for translation ---
beam_cfg = GenerationConfig(
    max_new_tokens=128,
    do_sample=False,
    num_beams=4,              # keep 4 candidate sequences in parallel
    early_stopping=True,
)

# --- Strategy 3: Nucleus Sampling — creative, good for chat / story generation ---
nucleus_cfg = GenerationConfig(
    max_new_tokens=128,
    do_sample=True,
    temperature=0.8,          # slightly sharpen the distribution
    top_p=0.92,               # sample from top-92% probability mass
    top_k=0,                  # disable top-k; use pure nucleus sampling
)

prompt  = "Explain why the sky is blue in one paragraph."
inputs  = tokenizer(prompt, return_tensors="pt").to(model.device)

for label, cfg in [("Greedy", greedy_cfg), ("Beam", beam_cfg), ("Nucleus", nucleus_cfg)]:
    with torch.inference_mode():
        output = model.generate(**inputs, generation_config=cfg)
    text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"\n[{label}]\n{text[len(prompt):]}")

GenerationConfig cleanly separates your decoding strategy from model loading — you can A/B test strategies in production by swapping configs without touching model weights. Save the winning configuration with cfg.save_pretrained("./configs/nucleus") so results are fully reproducible across environments and deployments.

For a full deep-dive on HuggingFace Transformers generation internals, constrained decoding, and fine-tuning workflows, a dedicated follow-up post is planned.


📚 Five Lessons Every LLM Developer Should Internalize

  1. Greedy decoding is suboptimal for long sequences. Each locally best token constrains what follows — the globally best sentence may require starting with a slightly lower-probability token. This is why greedy outputs often feel stilted or start looping after a few sentences.

  2. Beam search improves global sequence quality but produces "safe" text. It is the right tool for translation and summarization where a near-correct answer exists, but it actively penalizes diversity. Do not use it for open-ended creative generation.

  3. Top-P (nucleus) sampling is the most common production default. It balances quality and diversity by concentrating sampling in the high-probability region while still allowing variation. Most API defaults sit at P=0.9–0.95.

  4. Temperature and top-p are complementary, not alternatives. Temperature reshapes the entire probability distribution (sharpening or flattening it); top-p then truncates the distribution after reshaping. Use both together deliberately rather than treating them as the same knob.

  5. For factual or structured output, use T=0–0.2. SQL, JSON, code, and factual Q&A all have correct answers. Randomness only introduces noise. Near-zero temperature is the appropriate default; beam search can be layered on top in offline pipelines where sequence-level quality is paramount.


📌 TLDR: Summary & Key Takeaways

  • Greedy = always pick the best token; fast but suboptimal.
  • Beam Search = keep top-K candidate sequences; better quality, more compute.
  • Top-P (Nucleus) Sampling = sample from the smallest set covering P% of probability mass; best for creative tasks.
  • Temperature reshapes the distribution: low $T$ → focused, high $T$ → creative/risky.


Share

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms