AI Architecture Patterns: Routers, Planner-Worker Loops, Memory Layers, and Evaluation Guardrails
Production AI needs explicit routing, memory, execution, and evaluation layers rather than one loop.
Abstract AlgorithmsTLDR: A single agent loop is enough for a demo, but production AI systems need explicit layers for routing, execution, memory, and evaluation. Those layers determine safety, latency, cost, and traceability far more than model choice alone.
TLDR: Production AI architecture is mostly a routing and control problem: send each request through only the layers it needs, then prove output quality before exposure.
A customer support copilot worked great in demos but hallucinated in 30% of live tickets. The fix was not a better model — it was adding an explicit routing layer (classify intent first, so billing questions never hit the expensive reasoning path), a memory layer (store resolved tickets so the model stops confabulating policy), and an evaluation layer (score every response before the user sees it, escalate failures to a human queue). Hallucination rate dropped from 30% to under 2% in six weeks.
Here is the pattern in three lines: request arrives → router classifies intent and picks the cheapest safe path → evaluator scores the answer before it leaves the system. Everything else in this post is how to build and operate those three steps reliably.
📖 Why AI Pattern Choice Matters More Than Prompt Tuning
Teams usually start with one model and one prompt. That works for demos, then fails in production for predictable reasons: request mix broadens, tool calls fail, costs spike, and bad answers become operational incidents.
Architecture patterns solve this by separating responsibilities:
routingchooses the cheapest safe path,planningdecomposes tasks that need multiple steps,memorycontrols what context can be trusted,evaluationguards output quality and policy safety.
| Production symptom | Pattern response |
| Every request is expensive | Add routing and cheaper direct paths |
| Tool-heavy tasks are brittle | Add planner-worker orchestration |
| Answers cite stale policy | Add layered memory freshness controls |
| Hallucinations reach users | Add inline evaluation and escalation |
🔍 When to Use Each AI Pattern (and When Not To)
| Pattern | Use when | Avoid when | First implementation move |
| Router | Request types and risk levels vary | Product has one narrow use case | Start with 3-5 route classes only |
| Planner-worker | Tasks need stepwise tool usage | Most tasks are one-shot Q&A | Restrict planner to bounded workflows |
| Layered memory | Multi-turn context and policy docs matter | Session-only Q&A with no persistence | Separate session memory from durable retrieval |
| Runtime evaluator | Wrong answers are costly or regulated | Low-stakes experimentation | Add pass/fail guard before final response |
Quick practical rule
- Start with router + evaluator for most production copilots.
- Add planner only for workflows with measurable multi-step value.
- Add richer memory only after freshness and ownership are defined.
⚙️ How the AI Runtime Works in Practice
- Classify request intent and risk.
- Route to direct-answer path or workflow path.
- If workflow path, generate a bounded plan.
- Retrieve scoped memory with freshness checks.
- Execute tools/workers with trace logging.
- Evaluate answer quality and policy compliance.
- Return answer, fallback, or escalate to human.
| Stage | Practical control | Common failure |
| Route | Intent + risk classifier | Overfitted route taxonomy |
| Plan | Max steps, allowed tools | Planner loop runs too long |
| Memory | Source trust tier + TTL | Stale documents outrank newer policy |
| Execute | Per-tool timeout and retry budget | Tool failures cascade into hallucinated answers |
| Evaluate | Rubric checks + policy checks | Evaluator too weak or too permissive |
ðŸ› ï¸ How to Implement: 10-Step Rollout Checklist
- Define request classes (
faq,account_action,policy_sensitive,complex_workflow). - Create router policy mapping each class to a path.
- Set latency and cost budget per path.
- Implement planner only for one complex class first.
- Split memory into session context, task memory, and durable retrieval.
- Add document freshness metadata (
source,version,updated_at). - Add evaluator with explicit pass/fail rubric and escalation reason codes.
- Instrument traces for route choice, tool calls, retrieval IDs, and evaluator decision.
- Run offline replay tests against historical incidents.
- Launch with kill switch and fallback model path.
Done criteria:
| Gate | Pass condition |
| Safety | High-risk outputs are blocked or escalated |
| Cost | p50 cost per successful task remains in budget |
| Reliability | Tool failure does not produce fabricated final answers |
| Explainability | Every final answer has a route + evidence trace |
🧠 Deep Dive: Latency, Traceability, and Memory Quality
The Internals: Route Policy, Memory Boundaries, and Eval Enforcement
Routing should use explicit features: intent, risk class, required tools, and user tier. Avoid free-form prompt-only routing for critical paths.
Memory should be layered and owned:
- Session memory: short-lived dialogue context.
- Task memory: state for one ongoing workflow.
- Durable retrieval: policy docs, runbooks, knowledge base.
Evaluation must run inline for risky paths. Treat it as a runtime gate, not a dashboard-only metric.
| Control | What good looks like |
| Route explainability | Logs include route decision and feature values |
| Memory provenance | Every cited fact links to source ID/version |
| Eval actionability | Fail result includes reason + fallback action |
Performance Analysis: What to Measure Weekly
| Metric | Why it matters |
| Route misclassification rate | Measures cost and behavior drift |
| End-to-end p95 latency by path | Prevents hidden latency stacking |
| Retrieval freshness failure rate | Detects stale-memory risk |
| Eval false-negative rate | Detects unsafe answers slipping through |
| Cost per accepted response | Measures architecture sustainability |
Debug order for incidents:
- Was route choice correct?
- Was retrieval scoped and fresh?
- Did tool execution succeed within budget?
- Did evaluator correctly gate output?
📊 AI Runtime Flow: Route, Plan, Retrieve, Execute, and Guard
flowchart TD
A[User request] --> B[Risk and intent router]
B --> C{Direct path or workflow path?}
C -->|Direct| D[Answer model with minimal context]
C -->|Workflow| E[Planner with bounded steps]
E --> F[Tool workers]
F --> G[Layered memory retrieval]
D --> H[Runtime evaluator]
G --> H
H --> I{Pass rubric and policy?}
I -->|Yes| J[Return answer with trace metadata]
I -->|No| K[Fallback model or human escalation]
🌍 Real-World Applications: Realistic Scenario: Support Copilot With Compliance Constraints
Constraints:
- 600k monthly chats across billing and account security.
- 2.5 second p95 response target for simple questions.
- PII policy violations must be <0.1%.
- Cost cap of $0.015 per accepted answer.
Practical architecture:
- Router sends
faqtraffic to cheaper direct path. account_securityroutes to workflow path with strict evaluator.- Planner used only for incident and account-action workflows.
- Memory retrieval restricted to policy version matching current quarter.
- Any failed evaluator check escalates to human queue.
| Constraint | Architecture decision | Why it helps |
| Tight latency budget | Direct route for simple intents | Avoids planner/tool overhead |
| Compliance risk | Inline evaluator with policy rubric | Blocks unsafe output before user sees it |
| Cost cap | Path-specific model tiers | Prevents expensive model overuse |
| Audit need | Route + evidence trace logs | Makes incidents diagnosable |
⚖️ Trade-offs & Failure Modes: Pros, Cons, and Risks by Pattern Layer
| Layer | Pros | Cons | Key risk | Mitigation |
| Router | Controls cost and latency | Extra classification complexity | Misrouting high-risk tasks | Keep route classes simple and monitored |
| Planner-worker | Better handling of complex tasks | Adds latency and orchestration work | Unbounded loops | Enforce max steps and tool allowlist |
| Layered memory | Better context relevance | More data governance work | Stale policy leakage | Freshness TTL + source version checks |
| Evaluator | Prevents unsafe or low-quality output | Additional runtime overhead | False confidence from weak rubric | Regularly calibrate with failure replay |
🧭 Decision Guide: What to Add First
| Situation | Recommendation |
| Mostly simple Q&A with occasional risky answers | Add runtime evaluator first |
| Many intents and uneven cost profile | Add router next |
| Complex workflows need tools and decomposition | Add planner-worker only for those paths |
| Stale citations and context drift incidents | Add layered memory governance |
If you can only ship one control in the next sprint, ship the evaluator on high-risk paths first.
🧪 Practical Example: Incident Assistant Architecture Slice
Minimal design for an SRE incident assistant:
- Router identifies
incident_triagerequests. - Planner creates max 4-step plan (logs, metrics, runbook, recommendation).
- Workers query approved observability tools only.
- Memory is task-scoped and expires after incident closure.
- Evaluator rejects recommendations lacking supporting evidence links.
if route == "incident_triage":
plan = planner.create(max_steps=4)
evidence = workers.execute(plan, tool_allowlist)
response = model.summarize(evidence)
if evaluator.pass(response, evidence, policy):
return response
return escalate_to_human(reason="insufficient evidence")
Operator Field Note: What Fails First in Production
A recurring pattern from postmortems is that incidents in AI Architecture Patterns: Routers, Planner-Worker Loops, Memory Layers, and Evaluation Guardrails start with weak signals long before full outage.
- Early warning signal: one guardrail metric drifts (error rate, lag, divergence, or stale-read ratio) while dashboards still look mostly green.
- First containment move: freeze rollout, route to the last known safe path, and cap retries to avoid amplification.
- Escalate immediately when: customer-visible impact persists for two monitoring windows or recovery automation fails once.
15-Minute SRE Drill
- Replay one bounded failure case in staging.
- Capture one metric, one trace, and one log that prove the guardrail worked.
- Update the runbook with exact rollback command and owner on call.
📚 Lessons Learned
- Route fewer paths well instead of many paths poorly.
- Planner value comes from bounded execution, not autonomous sprawl.
- Memory quality is about freshness and ownership, not vector size.
- Evaluation must block unsafe output in real time.
- Traceability is the key to debugging AI incidents quickly.
🛠️ LangGraph and LangSmith: Stateful Agent Graphs with Built-In Evaluation
LangGraph is a Python library from LangChain that models AI agent workflows as directed graphs (StateGraph), where each node is a callable function and edges encode conditional branching — exactly the router → planner → evaluator topology described in this post. LangSmith provides observability and automated evaluation for LangGraph workflows in production.
How it solves the problem: Rather than writing custom orchestration code for routing, planning, memory, and evaluation, LangGraph encodes each layer as a typed graph node. Memory state flows between nodes via a shared TypedDict schema; LangSmith traces every node invocation, tool call, and evaluation decision — making the debugging workflow from the "debug order for incidents" table above practical rather than theoretical.
from typing import TypedDict, Literal
from langgraph.graph import StateGraph, END
from langchain_core.messages import HumanMessage, AIMessage
# ── Shared agent state ────────────────────────────────────────────────────────
class AgentState(TypedDict):
request: str
intent: str # router output: "faq" | "account_action" | "complex_workflow"
risk_level: str # router output: "low" | "high"
plan: list[str] # planner output: ordered steps (empty for direct path)
evidence: list[str] # tool worker output: supporting facts
answer: str # model output
eval_pass: bool # evaluator output
# ── Node: intent + risk router ────────────────────────────────────────────────
def router_node(state: AgentState) -> AgentState:
"""Classify intent and risk class; choose direct or workflow path."""
# In production, use a fast fine-tuned classifier or prompt
intent = classify_intent(state["request"]) # returns "faq" | "account_action" | ...
risk = classify_risk(state["request"]) # returns "low" | "high"
return {**state, "intent": intent, "risk_level": risk, "plan": []}
# ── Conditional edge: route to direct answer or planner ──────────────────────
def route_decision(state: AgentState) -> Literal["direct_answer", "planner"]:
return "planner" if state["intent"] == "complex_workflow" else "direct_answer"
# ── Node: direct answer (low-cost path) ──────────────────────────────────────
def direct_answer_node(state: AgentState) -> AgentState:
answer = llm.invoke([HumanMessage(content=state["request"])]).content
return {**state, "answer": answer, "evidence": []}
# ── Node: planner (bounded step decomposition) ────────────────────────────────
def planner_node(state: AgentState) -> AgentState:
plan = generate_plan(state["request"], max_steps=4)
evidence = execute_tools(plan, tool_allowlist=["logs", "metrics", "runbook"])
answer = llm.invoke(evidence_prompt(state["request"], evidence)).content
return {**state, "plan": plan, "evidence": evidence, "answer": answer}
# ── Node: runtime evaluator ────────────────────────────────────────────────────
def evaluator_node(state: AgentState) -> AgentState:
passes = evaluate_answer(
answer = state["answer"],
evidence = state["evidence"],
rubric = ["no_pii", "evidence_linked", "policy_compliant"],
)
return {**state, "eval_pass": passes}
# ── Conditional edge: pass → return, fail → escalate ─────────────────────────
def eval_decision(state: AgentState) -> Literal["return_answer", "escalate"]:
return "return_answer" if state["eval_pass"] else "escalate"
def escalate_node(state: AgentState) -> AgentState:
queue_for_human(state["request"], reason="evaluator_failed")
return {**state, "answer": "Your request has been escalated to our team."}
# ── Build the graph ────────────────────────────────────────────────────────────
workflow = StateGraph(AgentState)
workflow.add_node("router", router_node)
workflow.add_node("direct_answer", direct_answer_node)
workflow.add_node("planner", planner_node)
workflow.add_node("evaluator", evaluator_node)
workflow.add_node("escalate", escalate_node)
workflow.set_entry_point("router")
workflow.add_conditional_edges("router", route_decision)
workflow.add_edge("direct_answer", "evaluator")
workflow.add_edge("planner", "evaluator")
workflow.add_conditional_edges("evaluator", eval_decision)
workflow.add_edge("return_answer", END)
workflow.add_edge("escalate", END)
agent = workflow.compile()
LangSmith traces every node call, tool invocation, and evaluator decision automatically when LANGCHAIN_TRACING_V2=true is set in the environment — providing the route + evidence audit trail required by the compliance constraints in the real-world scenario above.
For a full deep-dive on LangGraph and LangSmith in production AI systems, a dedicated follow-up post is planned.
📌 TLDR: Summary & Key Takeaways
- Production AI patterns should be selected by risk, latency, and cost profile.
- Use routers to control path selection and spending.
- Use planner-worker only where decomposition materially improves outcomes.
- Use layered memory with freshness metadata and provenance.
- Use runtime evaluation as the final guard before answer exposure.
📝 Practice Quiz
- Which pattern usually delivers the fastest initial production safety improvement?
A) Unlimited planner autonomy
B) Inline runtime evaluator on risky paths
C) Storing all text in one memory store
Correct Answer: B
- What is a strong signal that planner-worker is overused?
A) Complex tasks now complete with evidence
B) Simple FAQ traffic is routed through multi-step tool workflows
C) Route logs include intent class
Correct Answer: B
- Why should durable memory include freshness metadata?
A) To increase embedding size
B) To prevent stale policies from being treated as current truth
C) To remove the need for evaluators
Correct Answer: B
- Open-ended challenge: your evaluator blocks too many valid answers and hurts latency. How would you redesign route thresholds, evaluation rubrics, and fallback paths without losing safety?
🔗 Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Types of LLM Quantization: By Timing, Scope, and Mapping
TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantizati...
Stream Processing Pipeline Pattern: Stateful Real-Time Data Products
TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together — and Kafka Streams lets you build all three directly inside your Spring Boot service. Stripe's real-time fraud detection processes...
Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic
TLDR: A service mesh intercepts all service-to-service traffic via injected Envoy sidecar proxies, letting a platform team enforce mTLS, retries, timeouts, and circuit breaking centrally — without changing application code. Reach for it when cross-te...
Serverless Architecture Pattern: Event-Driven Scale with Operational Guardrails
TLDR: Serverless is strongest for spiky asynchronous workloads when cold-start, observability, and state boundaries are intentionally designed. TLDR: Serverless works best for spiky, event-driven workloads when you design for idempotency, observabili...
