Prompt Engineering Guide: From Zero-Shot to Chain-of-Thought

Is Prompt Engineering a real skill? Yes. We explore the science behind talking to AI. Learn how Z...

Abstract Algorithms

·Mar 9, 2026·12 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Prompt Engineering is the art of writing instructions that guide an LLM toward the answer you want. Zero-Shot, Few-Shot, and Chain-of-Thought are systematic techniques — not guesswork — that can dramatically improve accuracy without changing a single model weight.

📖 The Super-Smart Intern With Zero Initiative

Imagine a brilliant intern who has read every book ever written but has no common sense. Give them a vague instruction ("Fix this") and they stare blankly. Give them a structured instruction and they execute perfectly.

Prompt Engineering is the art of writing the structured instruction.

Since LLMs predict the next token based on the input, how you frame the input directly steers which prediction paths are activated.

🔍 How Prompts Work: The Fundamentals

Every time you send text to an LLM, the entire input — system message, instructions, examples, and your question — is packed into a context window. The model reads it all at once and predicts the next most likely token. That prediction is directly shaped by every word you included.

Three core ingredients of a well-formed prompt:

Ingredient	What It Does	Example
Role / Persona	Tells the model who it is	"You are a senior data engineer."
Task Description	Tells the model what to do	"Classify the sentiment of the review below."
Output Format	Tells the model how to respond	"Respond with one word: Positive or Negative."

Missing any one of these is the most common source of vague or off-topic responses.

Tokenization basics: LLMs don't read words — they read tokens (roughly 3–4 characters each). A prompt like "Explain recursion" becomes tokens like ["Explain", " rec", "urs", "ion"]. The model's attention mechanism weighs each token against every other token in the context window.

In-context learning: Unlike fine-tuning, prompt engineering does not update model weights. You are not teaching the model — you are steering the probability distribution of its next-token predictions from within the context alone. This is why example quality and instruction precision matter so much: the model extrapolates your pattern, it does not memorize it.

Key insight: The model always answers the question "what token is most likely to follow this input?" Your prompt is the input. Better structure = higher probability of the right output.

🔢 The Four Core Techniques

1. Zero-Shot: Ask Directly

Give the task with no examples:

Classify this movie review as Positive or Negative:
"The acting was incredible and the plot kept me on the edge of my seat."

Classification:

Best when: The task is well-defined and the model likely saw similar tasks in pre-training.

2. Few-Shot: Teach by Example

Provide 2–5 examples before the actual question:

Review: "Disappointing. Slow pacing, weak script." → Negative
Review: "A masterpiece. Emotional and visually stunning." → Positive
Review: "Predictable but fun summer blockbuster." → ?

Why it works: The examples prime the model's "in-context learning." It extracts the pattern from the examples rather than relying on training data alone.

Tip: Select diverse examples that cover edge cases. Avoid examples biased toward one label.

3. Chain-of-Thought (CoT): Think Step by Step

For complex reasoning tasks, instruct the model to show its work:

Q: A train travels 60 km/h. It departs at 9:00 AM and arrives at 11:00 AM.
How far did it travel?

A: Let me think step by step.
- Travel time = 11:00 AM - 9:00 AM = 2 hours.
- Distance = speed × time = 60 × 2 = 120 km.
Answer: 120 km.

Simply adding "Let's think step by step" to a math or logic prompt improves accuracy by 15-30% on benchmarks (Wei et al., 2022).

4. System Role + Persona

Set a persona in the system message to constrain tone, expertise, and output format:

SYSTEM: You are a senior PostgreSQL DBA. Always respond with:
1. The SQL query
2. The query plan consideration
3. Index recommendations

USER: Find all users who haven't logged in for 90+ days.

Technique	Best For	Token Cost
Zero-Shot	Simple, well-defined tasks	Low
Few-Shot	Classification, extraction, formatting	Medium
Chain-of-Thought	Math, logic, multi-step reasoning	High
System Persona	Consistent output format, expertise framing	Low overhead

📊 Chain-of-Thought Reasoning Sequence

sequenceDiagram
    participant U as User
    participant M as LLM

    U->>M: Question + "Let's think step by step"
    M->>M: Reasoning Step 1: identify knowns
    M->>M: Reasoning Step 2: apply formula/logic
    M->>M: Reasoning Step 3: verify intermediate
    M-->>U: Final Answer with reasoning shown

This sequence diagram shows how Chain-of-Thought prompting changes the model's output path compared to a direct question. Without CoT, the model jumps straight from question to answer; with the "step by step" instruction, it generates explicit intermediate reasoning tokens before producing the final answer. The key takeaway is that these intermediate tokens are not merely decorative output — they shift the probability distribution of subsequent tokens toward more accurate conclusions on multi-step problems.

⚙️ The "Lost in the Middle" Problem

Research shows LLMs pay the most attention to content at the beginning and end of the prompt. Information in the middle is often skipped.

[INSTRUCTION] [pages of context] [QUESTION]  ← High accuracy (instruction at start)
[pages of context] [INSTRUCTION] [QUESTION]  ← Moderate (instruction at end)
[half the context] [INSTRUCTION] [half the context] [QUESTION]  ← Lower accuracy

Practical rule: Put your most important instructions or the specific text to analyze at either the very start or the very end.

🧠 Deep Dive: Self-Consistency and Tree-of-Thought

Self-Consistency

Instead of one CoT sample, generate 3–5 different reasoning chains and vote on the most common answer. Reduces reasoning variance on math problems.

answers = [generate_cot_response(prompt) for _ in range(5)]
from collections import Counter
final = Counter(answers).most_common(1)[0][0]

Tree-of-Thought (ToT)

For planning problems, generate multiple "thought branches" at each step, evaluate which branch is most promising, and expand only that branch. Outperforms linear CoT on reasoning tasks that require backtracking.

⚖️ Trade-offs & Failure Modes: Common Anti-Patterns

Anti-Pattern	Problem	Fix
Negative instructions only ("Don't say X")	LLMs focus on what to say; negations are often ignored	Replace with positive framing: "Only say Y"
Vague length constraint ("Be brief")	"Brief" means different things	Be specific: "Respond in 2 sentences max"
Instructions buried in the middle	Lost in the middle; LLM may miss	Move to start or end
Temperature = 1.0 for factual tasks	High randomness → hallucinations	Use T = 0 or 0.2 for factual/structured output
No output format specified	LLM outputs free-form text when you needed JSON	Specify: "Respond in JSON with fields {label, confidence}"

📊 Decision Guide: Prompting Technique Selection

Not sure which technique to reach for? Use this decision tree to match your scenario to the right approach before writing a single token.

flowchart TD
    Start[Define Your Task]
    Simple{Is the task well-defined?}
    Examples{Do you have labeled examples?}
    Reasoning{Multi-step reasoning needed?}
    ZeroShot[Zero-Shot (direct instruction)]
    FewShot[Few-Shot 2 to 5 examples]
    CoT[Chain-of-Thought (step-by-step)]
    SelfConsistency[Self-Consistency (vote on 5 CoT runs)]
    Start --> Simple
    Simple -->|Yes| Examples
    Simple -->|No - rephrase| Start
    Examples -->|No| ZeroShot
    Examples -->|Yes| FewShot
    FewShot --> Reasoning
    Reasoning -->|Yes| CoT
    Reasoning -->|No| FewShot
    CoT --> SelfConsistency

Start at the top with your task. If the task is ambiguous, rephrase it before choosing a technique — no prompting strategy rescues a vague objective. Once the task is clear, the branching logic guides you: Zero-Shot for simple well-defined tasks with no examples, Few-Shot when you have labeled examples, Chain-of-Thought when multi-step reasoning is required, and Self-Consistency when you need to reduce variance on high-stakes reasoning outputs.

🌍 Real-World Applications: Prompt Engineering in Production

Prompt engineering is not just a prototyping trick — it underpins real product features across industries. The table below shows common industry use cases and which technique typically drives the best results.

Industry	Use Case	Primary Technique
Customer Support	Auto-classify tickets by urgency and topic	Few-Shot (labeled historical tickets)
Legal	Extract clause types from contracts	Few-Shot + System Persona
Healthcare	Summarize patient notes in plain language	Chain-of-Thought + System Persona
Finance	Classify earnings call sentiment	Zero-Shot or Few-Shot
Software	Generate SQL from natural language	Chain-of-Thought
Education	Create adaptive quiz questions	Zero-Shot + System Persona

Case Study — SaaS Customer Triage: A mid-size SaaS company was spending 8 engineer-hours per week manually labeling incoming support tickets before routing them to the correct team. After building a Few-Shot classifier prompt with 4 labeled examples per category (billing, bug, feature request, account access), the model achieved 91% routing accuracy on a held-out test set. Manual triage dropped by 70%, and the prompt required no model fine-tuning — just careful example selection. The key lesson: the 4 examples took longer to curate than the prompt itself, and that curation time was the real investment.

📊 Prompt Strategy Selection by Task

flowchart TD
    Task[Define Generation Task]
    WellDefined{Task clearly defined?}
    Examples{Have labeled examples?}
    MultiStep{Requires multi-step reasoning?}
    Variance{High-stakes, reduce variance?}

    ZeroShot[Zero-Shot (simple, direct)]
    FewShot[Few-Shot (2-5 examples)]
    CoT[Chain-of-Thought (step-by-step reasoning)]
    SelfConsist[Self-Consistency (vote on 5 CoT runs)]
    ReAct[ReAct (reason + act + observe)]

    Task --> WellDefined
    WellDefined -->|No: rephrase| Task
    WellDefined -->|Yes| Examples
    Examples -->|No| ZeroShot
    Examples -->|Yes| FewShot
    FewShot --> MultiStep
    MultiStep -->|Yes| CoT
    MultiStep -->|No| FewShot
    CoT --> Variance
    Variance -->|Yes| SelfConsist
    Variance -->|No + tools| ReAct

This extended decision tree adds two paths that the earlier guide omitted: Self-Consistency (running five CoT chains and voting on the most common answer) for high-stakes reasoning tasks where variance must be reduced, and ReAct (reason + act + observe) for tasks that require external tool calls between reasoning steps. Use this version when the simpler tree leads to CoT but the stakes are high enough to justify the token overhead of multiple inference runs.

🧪 Hands-On Exercises: Putting the Techniques to Work

The fastest way to internalize prompt engineering is to rewrite bad prompts. These three exercises cover the three most common failure modes.

Exercise 1 — Zero-Shot to Few-Shot upgrade

You have this Zero-Shot prompt that struggles with mixed-sentiment reviews:

Classify this review as Positive, Negative, or Mixed:
"The camera is stunning but battery life is terrible."

Rewrite it as a Few-Shot prompt with 3 examples that cover each label class. Make sure one example is genuinely mixed-sentiment so the model learns the boundary.

Exercise 2 — Adding Chain-of-Thought to a calculation

You have this prompt that keeps returning wrong discount totals:

A jacket costs $120. It's 30% off. What is the final price?

Add a Chain-of-Thought instruction that forces the model to show the subtraction step explicitly before giving the final answer. Then test it with a compound discount (e.g., 30% off then an extra 10% off the discounted price).

Exercise 3 — Fix the vague System Persona

This System Persona prompt produces inconsistent, generic responses:

SYSTEM: You are helpful. Answer user questions.

Identify the three missing prompt ingredients and rewrite the System Persona for a customer support agent at a software company. Include the role, response format (numbered steps), tone (professional but approachable), and one explicit constraint (never speculate about timelines).

🛠️ LangChain and DSPy: Programmatic Prompt Engineering Frameworks

Two open-source frameworks take very different approaches to making prompt engineering systematic and repeatable.

LangChain is the most widely adopted LLM application framework — it provides PromptTemplate, FewShotPromptTemplate, and ChatPromptTemplate to turn prompt strings into typed, testable, composable objects that wire into chains and agents.

from langchain_core.prompts import PromptTemplate, FewShotPromptTemplate

# Few-Shot prompt template with labeled examples
examples = [
    {"review": "Terrible experience, never again.",   "sentiment": "Negative"},
    {"review": "Loved every moment, highly recommend!", "sentiment": "Positive"},
    {"review": "Good product but shipping took forever.", "sentiment": "Mixed"},
]

example_template = PromptTemplate(
    input_variables=["review", "sentiment"],
    template="Review: {review}\nSentiment: {sentiment}",
)

few_shot_prompt = FewShotPromptTemplate(
    examples=examples,
    example_prompt=example_template,
    prefix="Classify the sentiment of each review as Positive, Negative, or Mixed.",
    suffix="Review: {input}\nSentiment:",
    input_variables=["input"],
)

# Render the fully assembled few-shot prompt
formatted = few_shot_prompt.format(input="Fast delivery but the item was damaged.")
print(formatted)
# Outputs the prefix + 3 examples + the target review, ready to send to an LLM

DSPy (Stanford) is a higher-level framework that treats prompts as programs rather than strings — you define modules with typed inputs and outputs, and DSPy compiles your program into optimized prompts by automatically selecting the best few-shot examples and instructions for a given LLM and metric.

import dspy

lm = dspy.LM("openai/gpt-4o", temperature=0)
dspy.configure(lm=lm)

# Define a structured reasoning module
class SentimentClassifier(dspy.Signature):
    """Classify the sentiment of a product review."""
    review: str = dspy.InputField()
    sentiment: str = dspy.OutputField(desc="Positive, Negative, or Mixed")

classifier = dspy.Predict(SentimentClassifier)

# DSPy handles prompt construction and LLM call automatically
result = classifier(review="Fast delivery but the item was damaged.")
print(result.sentiment)  # → Mixed

Framework	Approach	Best for
LangChain	Template + chain composition	Application development, RAG, agents
DSPy	Compiled prompt programs	Research, automatic prompt optimization, evaluation-driven pipelines

For a full deep-dive on LangChain and DSPy, dedicated follow-up posts are planned.

📚 Five Lessons From Building Prompt-Driven Features

These lessons come from the gap between "works in the playground" and "works reliably in production."

Specificity beats cleverness. An explicit role + task + output format instruction outperforms any amount of creative framing. If your prompt reads like a riddle, rewrite it like a specification.
Examples beat descriptions. Showing the model two doctor-style responses is more effective than writing "respond like a doctor." The model extrapolates behavior from examples far better than from adjective lists.
Temperature is a dial, not a toggle. Use temperature=0 for factual retrieval, classification, and structured output. Use temperature=0.7–0.9 for creative writing, brainstorming, and ideation. High temperature on factual tasks is the single largest source of hallucinations in production systems.
Lost in the Middle is real. If you have 10 pages of context and one question, put the question first or last. LLMs consistently under-attend to information buried in the middle of long prompts — this is not a model quirk, it is a documented research finding.
Test your prompts empirically. Run your prompt against 20 representative inputs and measure accuracy before shipping. Prompt intuition is unreliable; a 5-example spot-check will miss systematic failure modes that only appear at scale.

📌 TLDR: Summary & Key Takeaways

Few-Shot teaches by example — 2–5 diverse examples prime the model's pattern matching.
Chain-of-Thought ("step by step") dramatically improves reasoning accuracy on multi-step problems.
System Personas constrain output format and expertise level consistently.
Lost in the Middle: Put critical instructions at the start or end of the prompt.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read