All Posts

AI Architecture Patterns: Routers, Planner-Worker Loops, Memory Layers, and Evaluation Guardrails

Production AI needs explicit routing, memory, execution, and evaluation layers rather than one loop.

Abstract AlgorithmsAbstract Algorithms
··12 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: A single agent loop is enough for a demo, but production AI systems need explicit layers for routing, execution, memory, and evaluation. Those layers determine safety, latency, cost, and traceability far more than model choice alone.

TLDR: Production AI architecture is mostly a routing and control problem: send each request through only the layers it needs, then prove output quality before exposure.

A customer support copilot worked great in demos but hallucinated in 30% of live tickets. The fix was not a better model — it was adding an explicit routing layer (classify intent first, so billing questions never hit the expensive reasoning path), a memory layer (store resolved tickets so the model stops confabulating policy), and an evaluation layer (score every response before the user sees it, escalate failures to a human queue). Hallucination rate dropped from 30% to under 2% in six weeks.

Here is the pattern in three lines: request arrives → router classifies intent and picks the cheapest safe path → evaluator scores the answer before it leaves the system. Everything else in this post is how to build and operate those three steps reliably.

📖 Why AI Pattern Choice Matters More Than Prompt Tuning

Teams usually start with one model and one prompt. That works for demos, then fails in production for predictable reasons: request mix broadens, tool calls fail, costs spike, and bad answers become operational incidents.

Architecture patterns solve this by separating responsibilities:

  • routing chooses the cheapest safe path,
  • planning decomposes tasks that need multiple steps,
  • memory controls what context can be trusted,
  • evaluation guards output quality and policy safety.
Production symptomPattern response
Every request is expensiveAdd routing and cheaper direct paths
Tool-heavy tasks are brittleAdd planner-worker orchestration
Answers cite stale policyAdd layered memory freshness controls
Hallucinations reach usersAdd inline evaluation and escalation

🔍 When to Use Each AI Pattern (and When Not To)

PatternUse whenAvoid whenFirst implementation move
RouterRequest types and risk levels varyProduct has one narrow use caseStart with 3-5 route classes only
Planner-workerTasks need stepwise tool usageMost tasks are one-shot Q&ARestrict planner to bounded workflows
Layered memoryMulti-turn context and policy docs matterSession-only Q&A with no persistenceSeparate session memory from durable retrieval
Runtime evaluatorWrong answers are costly or regulatedLow-stakes experimentationAdd pass/fail guard before final response

Quick practical rule

  • Start with router + evaluator for most production copilots.
  • Add planner only for workflows with measurable multi-step value.
  • Add richer memory only after freshness and ownership are defined.

⚙️ How the AI Runtime Works in Practice

  1. Classify request intent and risk.
  2. Route to direct-answer path or workflow path.
  3. If workflow path, generate a bounded plan.
  4. Retrieve scoped memory with freshness checks.
  5. Execute tools/workers with trace logging.
  6. Evaluate answer quality and policy compliance.
  7. Return answer, fallback, or escalate to human.
StagePractical controlCommon failure
RouteIntent + risk classifierOverfitted route taxonomy
PlanMax steps, allowed toolsPlanner loop runs too long
MemorySource trust tier + TTLStale documents outrank newer policy
ExecutePer-tool timeout and retry budgetTool failures cascade into hallucinated answers
EvaluateRubric checks + policy checksEvaluator too weak or too permissive

🛠️ How to Implement: 10-Step Rollout Checklist

  1. Define request classes (faq, account_action, policy_sensitive, complex_workflow).
  2. Create router policy mapping each class to a path.
  3. Set latency and cost budget per path.
  4. Implement planner only for one complex class first.
  5. Split memory into session context, task memory, and durable retrieval.
  6. Add document freshness metadata (source, version, updated_at).
  7. Add evaluator with explicit pass/fail rubric and escalation reason codes.
  8. Instrument traces for route choice, tool calls, retrieval IDs, and evaluator decision.
  9. Run offline replay tests against historical incidents.
  10. Launch with kill switch and fallback model path.

Done criteria:

GatePass condition
SafetyHigh-risk outputs are blocked or escalated
Costp50 cost per successful task remains in budget
ReliabilityTool failure does not produce fabricated final answers
ExplainabilityEvery final answer has a route + evidence trace

🧠 Deep Dive: Latency, Traceability, and Memory Quality

The Internals: Route Policy, Memory Boundaries, and Eval Enforcement

Routing should use explicit features: intent, risk class, required tools, and user tier. Avoid free-form prompt-only routing for critical paths.

Memory should be layered and owned:

  • Session memory: short-lived dialogue context.
  • Task memory: state for one ongoing workflow.
  • Durable retrieval: policy docs, runbooks, knowledge base.

Evaluation must run inline for risky paths. Treat it as a runtime gate, not a dashboard-only metric.

ControlWhat good looks like
Route explainabilityLogs include route decision and feature values
Memory provenanceEvery cited fact links to source ID/version
Eval actionabilityFail result includes reason + fallback action

Performance Analysis: What to Measure Weekly

MetricWhy it matters
Route misclassification rateMeasures cost and behavior drift
End-to-end p95 latency by pathPrevents hidden latency stacking
Retrieval freshness failure rateDetects stale-memory risk
Eval false-negative rateDetects unsafe answers slipping through
Cost per accepted responseMeasures architecture sustainability

Debug order for incidents:

  1. Was route choice correct?
  2. Was retrieval scoped and fresh?
  3. Did tool execution succeed within budget?
  4. Did evaluator correctly gate output?

📊 AI Runtime Flow: Route, Plan, Retrieve, Execute, and Guard

flowchart TD
    A[User request] --> B[Risk and intent router]
    B --> C{Direct path or workflow path?}
    C -->|Direct| D[Answer model with minimal context]
    C -->|Workflow| E[Planner with bounded steps]
    E --> F[Tool workers]
    F --> G[Layered memory retrieval]
    D --> H[Runtime evaluator]
    G --> H
    H --> I{Pass rubric and policy?}
    I -->|Yes| J[Return answer with trace metadata]
    I -->|No| K[Fallback model or human escalation]

🌍 Real-World Applications: Realistic Scenario: Support Copilot With Compliance Constraints

Constraints:

  • 600k monthly chats across billing and account security.
  • 2.5 second p95 response target for simple questions.
  • PII policy violations must be <0.1%.
  • Cost cap of $0.015 per accepted answer.

Practical architecture:

  • Router sends faq traffic to cheaper direct path.
  • account_security routes to workflow path with strict evaluator.
  • Planner used only for incident and account-action workflows.
  • Memory retrieval restricted to policy version matching current quarter.
  • Any failed evaluator check escalates to human queue.
ConstraintArchitecture decisionWhy it helps
Tight latency budgetDirect route for simple intentsAvoids planner/tool overhead
Compliance riskInline evaluator with policy rubricBlocks unsafe output before user sees it
Cost capPath-specific model tiersPrevents expensive model overuse
Audit needRoute + evidence trace logsMakes incidents diagnosable

⚖️ Trade-offs & Failure Modes: Pros, Cons, and Risks by Pattern Layer

LayerProsConsKey riskMitigation
RouterControls cost and latencyExtra classification complexityMisrouting high-risk tasksKeep route classes simple and monitored
Planner-workerBetter handling of complex tasksAdds latency and orchestration workUnbounded loopsEnforce max steps and tool allowlist
Layered memoryBetter context relevanceMore data governance workStale policy leakageFreshness TTL + source version checks
EvaluatorPrevents unsafe or low-quality outputAdditional runtime overheadFalse confidence from weak rubricRegularly calibrate with failure replay

🧭 Decision Guide: What to Add First

SituationRecommendation
Mostly simple Q&A with occasional risky answersAdd runtime evaluator first
Many intents and uneven cost profileAdd router next
Complex workflows need tools and decompositionAdd planner-worker only for those paths
Stale citations and context drift incidentsAdd layered memory governance

If you can only ship one control in the next sprint, ship the evaluator on high-risk paths first.

🧪 Practical Example: Incident Assistant Architecture Slice

Minimal design for an SRE incident assistant:

  1. Router identifies incident_triage requests.
  2. Planner creates max 4-step plan (logs, metrics, runbook, recommendation).
  3. Workers query approved observability tools only.
  4. Memory is task-scoped and expires after incident closure.
  5. Evaluator rejects recommendations lacking supporting evidence links.
if route == "incident_triage":
  plan = planner.create(max_steps=4)
  evidence = workers.execute(plan, tool_allowlist)
  response = model.summarize(evidence)
  if evaluator.pass(response, evidence, policy):
    return response
  return escalate_to_human(reason="insufficient evidence")

Operator Field Note: What Fails First in Production

A recurring pattern from postmortems is that incidents in AI Architecture Patterns: Routers, Planner-Worker Loops, Memory Layers, and Evaluation Guardrails start with weak signals long before full outage.

  • Early warning signal: one guardrail metric drifts (error rate, lag, divergence, or stale-read ratio) while dashboards still look mostly green.
  • First containment move: freeze rollout, route to the last known safe path, and cap retries to avoid amplification.
  • Escalate immediately when: customer-visible impact persists for two monitoring windows or recovery automation fails once.

15-Minute SRE Drill

  1. Replay one bounded failure case in staging.
  2. Capture one metric, one trace, and one log that prove the guardrail worked.
  3. Update the runbook with exact rollback command and owner on call.

📚 Lessons Learned

  • Route fewer paths well instead of many paths poorly.
  • Planner value comes from bounded execution, not autonomous sprawl.
  • Memory quality is about freshness and ownership, not vector size.
  • Evaluation must block unsafe output in real time.
  • Traceability is the key to debugging AI incidents quickly.

🛠️ LangGraph and LangSmith: Stateful Agent Graphs with Built-In Evaluation

LangGraph is a Python library from LangChain that models AI agent workflows as directed graphs (StateGraph), where each node is a callable function and edges encode conditional branching — exactly the router → planner → evaluator topology described in this post. LangSmith provides observability and automated evaluation for LangGraph workflows in production.

How it solves the problem: Rather than writing custom orchestration code for routing, planning, memory, and evaluation, LangGraph encodes each layer as a typed graph node. Memory state flows between nodes via a shared TypedDict schema; LangSmith traces every node invocation, tool call, and evaluation decision — making the debugging workflow from the "debug order for incidents" table above practical rather than theoretical.

from typing import TypedDict, Literal
from langgraph.graph import StateGraph, END
from langchain_core.messages import HumanMessage, AIMessage


# ── Shared agent state ────────────────────────────────────────────────────────
class AgentState(TypedDict):
    request:     str
    intent:      str           # router output: "faq" | "account_action" | "complex_workflow"
    risk_level:  str           # router output: "low" | "high"
    plan:        list[str]     # planner output: ordered steps (empty for direct path)
    evidence:    list[str]     # tool worker output: supporting facts
    answer:      str           # model output
    eval_pass:   bool          # evaluator output


# ── Node: intent + risk router ────────────────────────────────────────────────
def router_node(state: AgentState) -> AgentState:
    """Classify intent and risk class; choose direct or workflow path."""
    # In production, use a fast fine-tuned classifier or prompt
    intent = classify_intent(state["request"])   # returns "faq" | "account_action" | ...
    risk   = classify_risk(state["request"])      # returns "low" | "high"
    return {**state, "intent": intent, "risk_level": risk, "plan": []}


# ── Conditional edge: route to direct answer or planner ──────────────────────
def route_decision(state: AgentState) -> Literal["direct_answer", "planner"]:
    return "planner" if state["intent"] == "complex_workflow" else "direct_answer"


# ── Node: direct answer (low-cost path) ──────────────────────────────────────
def direct_answer_node(state: AgentState) -> AgentState:
    answer = llm.invoke([HumanMessage(content=state["request"])]).content
    return {**state, "answer": answer, "evidence": []}


# ── Node: planner (bounded step decomposition) ────────────────────────────────
def planner_node(state: AgentState) -> AgentState:
    plan = generate_plan(state["request"], max_steps=4)
    evidence = execute_tools(plan, tool_allowlist=["logs", "metrics", "runbook"])
    answer = llm.invoke(evidence_prompt(state["request"], evidence)).content
    return {**state, "plan": plan, "evidence": evidence, "answer": answer}


# ── Node: runtime evaluator ────────────────────────────────────────────────────
def evaluator_node(state: AgentState) -> AgentState:
    passes = evaluate_answer(
        answer   = state["answer"],
        evidence = state["evidence"],
        rubric   = ["no_pii", "evidence_linked", "policy_compliant"],
    )
    return {**state, "eval_pass": passes}


# ── Conditional edge: pass → return, fail → escalate ─────────────────────────
def eval_decision(state: AgentState) -> Literal["return_answer", "escalate"]:
    return "return_answer" if state["eval_pass"] else "escalate"


def escalate_node(state: AgentState) -> AgentState:
    queue_for_human(state["request"], reason="evaluator_failed")
    return {**state, "answer": "Your request has been escalated to our team."}


# ── Build the graph ────────────────────────────────────────────────────────────
workflow = StateGraph(AgentState)
workflow.add_node("router",        router_node)
workflow.add_node("direct_answer", direct_answer_node)
workflow.add_node("planner",       planner_node)
workflow.add_node("evaluator",     evaluator_node)
workflow.add_node("escalate",      escalate_node)

workflow.set_entry_point("router")
workflow.add_conditional_edges("router",    route_decision)
workflow.add_edge("direct_answer",          "evaluator")
workflow.add_edge("planner",                "evaluator")
workflow.add_conditional_edges("evaluator", eval_decision)
workflow.add_edge("return_answer",           END)
workflow.add_edge("escalate",                END)

agent = workflow.compile()

LangSmith traces every node call, tool invocation, and evaluator decision automatically when LANGCHAIN_TRACING_V2=true is set in the environment — providing the route + evidence audit trail required by the compliance constraints in the real-world scenario above.

For a full deep-dive on LangGraph and LangSmith in production AI systems, a dedicated follow-up post is planned.

📌 TLDR: Summary & Key Takeaways

  • Production AI patterns should be selected by risk, latency, and cost profile.
  • Use routers to control path selection and spending.
  • Use planner-worker only where decomposition materially improves outcomes.
  • Use layered memory with freshness metadata and provenance.
  • Use runtime evaluation as the final guard before answer exposure.

📝 Practice Quiz

  1. Which pattern usually delivers the fastest initial production safety improvement?

A) Unlimited planner autonomy
B) Inline runtime evaluator on risky paths
C) Storing all text in one memory store

Correct Answer: B

  1. What is a strong signal that planner-worker is overused?

A) Complex tasks now complete with evidence
B) Simple FAQ traffic is routed through multi-step tool workflows
C) Route logs include intent class

Correct Answer: B

  1. Why should durable memory include freshness metadata?

A) To increase embedding size
B) To prevent stale policies from being treated as current truth
C) To remove the need for evaluators

Correct Answer: B

  1. Open-ended challenge: your evaluator blocks too many valid answers and hurts latency. How would you redesign route thresholds, evaluation rubrics, and fallback paths without losing safety?
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms