All Posts

Prompt Engineering Guide: From Zero-Shot to Chain-of-Thought

Is Prompt Engineering a real skill? Yes. We explore the science behind talking to AI. Learn how Z...

Abstract AlgorithmsAbstract Algorithms
··12 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Prompt Engineering is the art of writing instructions that guide an LLM toward the answer you want. Zero-Shot, Few-Shot, and Chain-of-Thought are systematic techniques — not guesswork — that can dramatically improve accuracy without changing a single model weight.


📖 The Super-Smart Intern With Zero Initiative

Imagine a brilliant intern who has read every book ever written but has no common sense. Give them a vague instruction ("Fix this") and they stare blankly. Give them a structured instruction and they execute perfectly.

Prompt Engineering is the art of writing the structured instruction.

Since LLMs predict the next token based on the input, how you frame the input directly steers which prediction paths are activated.


🔍 How Prompts Work: The Fundamentals

Every time you send text to an LLM, the entire input — system message, instructions, examples, and your question — is packed into a context window. The model reads it all at once and predicts the next most likely token. That prediction is directly shaped by every word you included.

Three core ingredients of a well-formed prompt:

IngredientWhat It DoesExample
Role / PersonaTells the model who it is"You are a senior data engineer."
Task DescriptionTells the model what to do"Classify the sentiment of the review below."
Output FormatTells the model how to respond"Respond with one word: Positive or Negative."

Missing any one of these is the most common source of vague or off-topic responses.

Tokenization basics: LLMs don't read words — they read tokens (roughly 3–4 characters each). A prompt like "Explain recursion" becomes tokens like ["Explain", " rec", "urs", "ion"]. The model's attention mechanism weighs each token against every other token in the context window.

In-context learning: Unlike fine-tuning, prompt engineering does not update model weights. You are not teaching the model — you are steering the probability distribution of its next-token predictions from within the context alone. This is why example quality and instruction precision matter so much: the model extrapolates your pattern, it does not memorize it.

Key insight: The model always answers the question "what token is most likely to follow this input?" Your prompt is the input. Better structure = higher probability of the right output.


🔢 The Four Core Techniques

1. Zero-Shot: Ask Directly

Give the task with no examples:

Classify this movie review as Positive or Negative:
"The acting was incredible and the plot kept me on the edge of my seat."

Classification:

Best when: The task is well-defined and the model likely saw similar tasks in pre-training.

2. Few-Shot: Teach by Example

Provide 2–5 examples before the actual question:

Review: "Disappointing. Slow pacing, weak script." → Negative
Review: "A masterpiece. Emotional and visually stunning." → Positive
Review: "Predictable but fun summer blockbuster." → ?

Why it works: The examples prime the model's "in-context learning." It extracts the pattern from the examples rather than relying on training data alone.

Tip: Select diverse examples that cover edge cases. Avoid examples biased toward one label.

3. Chain-of-Thought (CoT): Think Step by Step

For complex reasoning tasks, instruct the model to show its work:

Q: A train travels 60 km/h. It departs at 9:00 AM and arrives at 11:00 AM.
How far did it travel?

A: Let me think step by step.
- Travel time = 11:00 AM - 9:00 AM = 2 hours.
- Distance = speed × time = 60 × 2 = 120 km.
Answer: 120 km.

Simply adding "Let's think step by step" to a math or logic prompt improves accuracy by 15-30% on benchmarks (Wei et al., 2022).

4. System Role + Persona

Set a persona in the system message to constrain tone, expertise, and output format:

SYSTEM: You are a senior PostgreSQL DBA. Always respond with:
1. The SQL query
2. The query plan consideration
3. Index recommendations

USER: Find all users who haven't logged in for 90+ days.
TechniqueBest ForToken Cost
Zero-ShotSimple, well-defined tasksLow
Few-ShotClassification, extraction, formattingMedium
Chain-of-ThoughtMath, logic, multi-step reasoningHigh
System PersonaConsistent output format, expertise framingLow overhead

⚙️ The "Lost in the Middle" Problem

Research shows LLMs pay the most attention to content at the beginning and end of the prompt. Information in the middle is often skipped.

[INSTRUCTION] [pages of context] [QUESTION]  ← High accuracy (instruction at start)
[pages of context] [INSTRUCTION] [QUESTION]  ← Moderate (instruction at end)
[half the context] [INSTRUCTION] [half the context] [QUESTION]  ← Lower accuracy

Practical rule: Put your most important instructions or the specific text to analyze at either the very start or the very end.


🧠 Deep Dive: Self-Consistency and Tree-of-Thought

Self-Consistency

Instead of one CoT sample, generate 3–5 different reasoning chains and vote on the most common answer. Reduces reasoning variance on math problems.

answers = [generate_cot_response(prompt) for _ in range(5)]
from collections import Counter
final = Counter(answers).most_common(1)[0][0]

Tree-of-Thought (ToT)

For planning problems, generate multiple "thought branches" at each step, evaluate which branch is most promising, and expand only that branch. Outperforms linear CoT on reasoning tasks that require backtracking.


⚖️ Trade-offs & Failure Modes: Common Anti-Patterns

Anti-PatternProblemFix
Negative instructions only ("Don't say X")LLMs focus on what to say; negations are often ignoredReplace with positive framing: "Only say Y"
Vague length constraint ("Be brief")"Brief" means different thingsBe specific: "Respond in 2 sentences max"
Instructions buried in the middleLost in the middle; LLM may missMove to start or end
Temperature = 1.0 for factual tasksHigh randomness → hallucinationsUse T = 0 or 0.2 for factual/structured output
No output format specifiedLLM outputs free-form text when you needed JSONSpecify: "Respond in JSON with fields {label, confidence}"

📊 Decision Guide: Prompting Technique Selection

Not sure which technique to reach for? Use this decision tree to match your scenario to the right approach before writing a single token.

flowchart TD
    Start["Define Your Task"]
    Simple{"Is the task\nwell-defined?"}
    Examples{"Do you have\nlabeled examples?"}
    Reasoning{"Multi-step\nreasoning needed?"}
    ZeroShot["Zero-Shot\n(direct instruction)"]
    FewShot["Few-Shot\n(2–5 examples)"]
    CoT["Chain-of-Thought\n(step-by-step)"]
    SelfConsistency["Self-Consistency\n(vote on 5 CoT runs)"]
    Start --> Simple
    Simple -->|Yes| Examples
    Simple -->|No - rephrase| Start
    Examples -->|No| ZeroShot
    Examples -->|Yes| FewShot
    FewShot --> Reasoning
    Reasoning -->|Yes| CoT
    Reasoning -->|No| FewShot
    CoT --> SelfConsistency

Start at the top with your task. If the task is ambiguous, rephrase it before choosing a technique — no prompting strategy rescues a vague objective. Once the task is clear, the branching logic guides you: Zero-Shot for simple well-defined tasks with no examples, Few-Shot when you have labeled examples, Chain-of-Thought when multi-step reasoning is required, and Self-Consistency when you need to reduce variance on high-stakes reasoning outputs.


🌍 Real-World Applications: Prompt Engineering in Production

Prompt engineering is not just a prototyping trick — it underpins real product features across industries. The table below shows common industry use cases and which technique typically drives the best results.

IndustryUse CasePrimary Technique
Customer SupportAuto-classify tickets by urgency and topicFew-Shot (labeled historical tickets)
LegalExtract clause types from contractsFew-Shot + System Persona
HealthcareSummarize patient notes in plain languageChain-of-Thought + System Persona
FinanceClassify earnings call sentimentZero-Shot or Few-Shot
SoftwareGenerate SQL from natural languageChain-of-Thought
EducationCreate adaptive quiz questionsZero-Shot + System Persona

Case Study — SaaS Customer Triage: A mid-size SaaS company was spending 8 engineer-hours per week manually labeling incoming support tickets before routing them to the correct team. After building a Few-Shot classifier prompt with 4 labeled examples per category (billing, bug, feature request, account access), the model achieved 91% routing accuracy on a held-out test set. Manual triage dropped by 70%, and the prompt required no model fine-tuning — just careful example selection. The key lesson: the 4 examples took longer to curate than the prompt itself, and that curation time was the real investment.


🧪 Hands-On Exercises: Putting the Techniques to Work

The fastest way to internalize prompt engineering is to rewrite bad prompts. These three exercises cover the three most common failure modes.

Exercise 1 — Zero-Shot to Few-Shot upgrade

You have this Zero-Shot prompt that struggles with mixed-sentiment reviews:

Classify this review as Positive, Negative, or Mixed:
"The camera is stunning but battery life is terrible."

Rewrite it as a Few-Shot prompt with 3 examples that cover each label class. Make sure one example is genuinely mixed-sentiment so the model learns the boundary.

Exercise 2 — Adding Chain-of-Thought to a calculation

You have this prompt that keeps returning wrong discount totals:

A jacket costs $120. It's 30% off. What is the final price?

Add a Chain-of-Thought instruction that forces the model to show the subtraction step explicitly before giving the final answer. Then test it with a compound discount (e.g., 30% off then an extra 10% off the discounted price).

Exercise 3 — Fix the vague System Persona

This System Persona prompt produces inconsistent, generic responses:

SYSTEM: You are helpful. Answer user questions.

Identify the three missing prompt ingredients and rewrite the System Persona for a customer support agent at a software company. Include the role, response format (numbered steps), tone (professional but approachable), and one explicit constraint (never speculate about timelines).


🛠️ LangChain and DSPy: Programmatic Prompt Engineering Frameworks

Two open-source frameworks take very different approaches to making prompt engineering systematic and repeatable.

LangChain is the most widely adopted LLM application framework — it provides PromptTemplate, FewShotPromptTemplate, and ChatPromptTemplate to turn prompt strings into typed, testable, composable objects that wire into chains and agents.

from langchain_core.prompts import PromptTemplate, FewShotPromptTemplate

# Few-Shot prompt template with labeled examples
examples = [
    {"review": "Terrible experience, never again.",   "sentiment": "Negative"},
    {"review": "Loved every moment, highly recommend!", "sentiment": "Positive"},
    {"review": "Good product but shipping took forever.", "sentiment": "Mixed"},
]

example_template = PromptTemplate(
    input_variables=["review", "sentiment"],
    template="Review: {review}\nSentiment: {sentiment}",
)

few_shot_prompt = FewShotPromptTemplate(
    examples=examples,
    example_prompt=example_template,
    prefix="Classify the sentiment of each review as Positive, Negative, or Mixed.",
    suffix="Review: {input}\nSentiment:",
    input_variables=["input"],
)

# Render the fully assembled few-shot prompt
formatted = few_shot_prompt.format(input="Fast delivery but the item was damaged.")
print(formatted)
# Outputs the prefix + 3 examples + the target review, ready to send to an LLM

DSPy (Stanford) is a higher-level framework that treats prompts as programs rather than strings — you define modules with typed inputs and outputs, and DSPy compiles your program into optimized prompts by automatically selecting the best few-shot examples and instructions for a given LLM and metric.

import dspy

lm = dspy.LM("openai/gpt-4o", temperature=0)
dspy.configure(lm=lm)

# Define a structured reasoning module
class SentimentClassifier(dspy.Signature):
    """Classify the sentiment of a product review."""
    review: str = dspy.InputField()
    sentiment: str = dspy.OutputField(desc="Positive, Negative, or Mixed")

classifier = dspy.Predict(SentimentClassifier)

# DSPy handles prompt construction and LLM call automatically
result = classifier(review="Fast delivery but the item was damaged.")
print(result.sentiment)  # → Mixed
FrameworkApproachBest for
LangChainTemplate + chain compositionApplication development, RAG, agents
DSPyCompiled prompt programsResearch, automatic prompt optimization, evaluation-driven pipelines

For a full deep-dive on LangChain and DSPy, dedicated follow-up posts are planned.


📚 Five Lessons From Building Prompt-Driven Features

These lessons come from the gap between "works in the playground" and "works reliably in production."

  1. Specificity beats cleverness. An explicit role + task + output format instruction outperforms any amount of creative framing. If your prompt reads like a riddle, rewrite it like a specification.

  2. Examples beat descriptions. Showing the model two doctor-style responses is more effective than writing "respond like a doctor." The model extrapolates behavior from examples far better than from adjective lists.

  3. Temperature is a dial, not a toggle. Use temperature=0 for factual retrieval, classification, and structured output. Use temperature=0.7–0.9 for creative writing, brainstorming, and ideation. High temperature on factual tasks is the single largest source of hallucinations in production systems.

  4. Lost in the Middle is real. If you have 10 pages of context and one question, put the question first or last. LLMs consistently under-attend to information buried in the middle of long prompts — this is not a model quirk, it is a documented research finding.

  5. Test your prompts empirically. Run your prompt against 20 representative inputs and measure accuracy before shipping. Prompt intuition is unreliable; a 5-example spot-check will miss systematic failure modes that only appear at scale.


📌 TLDR: Summary & Key Takeaways

  • Few-Shot teaches by example — 2–5 diverse examples prime the model's pattern matching.
  • Chain-of-Thought ("step by step") dramatically improves reasoning accuracy on multi-step problems.
  • System Personas constrain output format and expertise level consistently.
  • Lost in the Middle: Put critical instructions at the start or end of the prompt.

📝 Practice Quiz

  1. Adding "Let's think step by step" to a prompt is an example of which technique?

    • A) Zero-Shot prompting.
    • B) Chain-of-Thought prompting — the model reasons through intermediate steps before answering.
    • C) Few-Shot prompting.
    • D) System Persona prompting. Correct Answer: B — Chain-of-Thought explicitly instructs the model to show its reasoning steps before reaching a conclusion.
  2. You have a prompt with 8 pages of document context. Where should you place your question for best accuracy?

    • A) In the middle of the context, surrounded by relevant text.
    • B) At the beginning or end — LLMs attend most to those positions (Lost in the Middle finding).
    • C) After a separator like "---" for visual clarity.
    • D) It doesn't matter; LLMs read every token equally. Correct Answer: B — Research shows LLM accuracy degrades for information buried in the middle of long prompts.
  3. Your LLM keeps returning free-form text when you need structured JSON output. What is the simplest fix?

    • A) Lower the temperature to 0.
    • B) Explicitly specify the output format: "Respond in JSON with fields: {label: string, confidence: float}".
    • C) Use Few-Shot examples with prose outputs.
    • D) Switch to a different LLM provider. Correct Answer: B — Explicit output format instructions in the prompt are the most reliable way to enforce structured responses.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms