LLM Skill Registries, Routing Policies, and Evaluation for Production Agents
After tools and skills, this is the control plane: registry design, routing rules, and evaluation loops.
Abstract AlgorithmsTLDR: If tools are primitives and skills are reusable routines, then the skill registry + router + evaluator is your production control plane. This layer decides which skill runs, under what constraints, and how you detect regressions before users do.
๐ Why a Skill Registry Becomes the Agent Control Plane
In small demos, the model picks a tool and returns a decent answer. In production, that is not enough.
You need answers to operational questions:
- Which skills are currently active?
- Which team owns each skill and guardrail policy?
- Which skills are safe for high-risk intents?
- What changed between yesterday's and today's routing behavior?
That is what a skill registry solves. It is not just a list of skill names. It is the source of truth for execution behavior.
Example โ one registry entry in practice:
{
"skill_id": "sql_query_v2",
"input_schema": { "query": "string", "db": "string" },
"routing_condition": "intent == 'data_lookup' AND risk_level == 'low'",
"eval_hook": "sql_accuracy_v1"
}
When a user asks "Show me all orders over $500", the router matches intent data_lookup, confirms risk_level == 'low', selects sql_query_v2, and tags the response for evaluation via sql_accuracy_v1. That four-field entry is the minimum viable registry contract.
| Capability | Without registry | With registry |
| Skill discovery | Prompt memory or hardcoded list | Queryable metadata |
| Governance | Ad hoc docs | Owner, risk level, policy fields |
| Routing consistency | Prompt-dependent | Deterministic + scored selection |
| Incident triage | Slow transcript digging | Versioned skill and route traces |
A practical architecture has three pieces:
- Registry: skill metadata and contracts.
- Router: selects the best skill for a request.
- Evaluator: measures quality, safety, latency, and drift.
This post is the operational follow-up to LLM Skills vs Tools: The Missing Layer in Agent Design.
๐ Designing a Registry That Humans and Routers Can Trust
A useful registry entry is both machine-readable and operator-readable.
Minimum fields per skill:
| Field | Example | Why it matters |
skill_id | incident_triage_v3 | Stable reference in traces |
description | "Investigate alerts and create tickets" | Helps intent matching |
input_schema | JSON schema | Prevents malformed runs |
output_schema | JSON schema | Stabilizes downstream integrations |
risk_level | low, medium, high | Enables policy gating |
allowed_data_domains | logs, tickets | Limits data exposure |
owner_team | sre-platform | Accountability |
slo | p95<4s | Runtime expectations |
version | 3.2.1 | Safe rollouts and rollbacks |
Dense systems should also include:
- deprecation status,
- fallback skill id,
- required approvals (for sensitive actions),
- evaluation baseline hash.
A registry is a product artifact. Treat it like API surface area, not internal trivia.
โ๏ธ Routing Pipeline: From User Intent to Skill Selection
A production router should be explicit about stages.
flowchart TD
A[User request] --> B[Intent and entity extraction]
B --> C[Candidate skill retrieval from registry]
C --> D[Policy gate: data/risk/compliance]
D --> E[Score candidates: fit, cost, risk, freshness]
E --> F{Confidence above threshold?}
F -- Yes --> G[Select top skill]
F -- No --> H[Fallback: safe default or human review]
G --> I[Execute skill with trace]
H --> I
I --> J[Return response + route metadata]
This pipeline prevents a common failure mode: the model picks a "kind of related" skill because a keyword looked similar.
| Stage | Typical failure if skipped | Fix |
| Candidate retrieval | Wrong skill family selected | Embedding + keyword hybrid retrieval |
| Policy gate | Unsafe skill selected | Hard allow/deny rules before scoring |
| Confidence threshold | Overconfident wrong execution | Fallback path when confidence is low |
| Trace capture | No root cause during outages | Persist route id, candidate scores, policy decisions |
Router quality is usually more important than incremental prompt tweaks once you scale beyond a few skills.
๐ง Deep Dive: Scoring, Constraints, and Runtime Guarantees
Internals: hybrid routing usually beats single-strategy routing
Most robust systems combine three routing signals:
- Rule-based filters for non-negotiable constraints (risk, permissions, domain).
- Semantic match for intent-to-skill relevance.
- Operational priors from latency, error rate, and freshness.
| Router signal | Strength | Weakness |
| Rules | Deterministic safety | Can be rigid |
| Semantic score | Flexible intent fit | Can over-match vague text |
| Operational priors | Production-aware decisions | Needs telemetry quality |
A pure LLM router is fast to prototype but hard to govern. A pure rules engine is predictable but brittle. The hybrid path tends to be the practical middle ground.
Mathematical model: route score with explicit penalties
A common scoring objective:
$$ RouteScore(s \mid q) = w_f \cdot Fit(s, q) - w_l \cdot Latency(s) - w_r \cdot Risk(s) + w_o \cdot Reliability(s) $$
Where:
Fit: intent coverage confidence,Latency: normalized expected runtime,Risk: policy and safety risk score,Reliability: historical success and schema-valid output rate.
Add hard constraints before scoring:
$$ Allowed(s, q) = Permission(s, q) \land DataPolicy(s, q) \land RegionPolicy(s, q) $$
Then choose:
$$ s^* = \arg\max_{s \in S, Allowed(s,q)} RouteScore(s \mid q) $$
This separates policy from optimization, which keeps audits and incident reviews much cleaner.
Performance analysis: what to measure in routing systems
| Metric | Why it matters | Target style |
| Route accuracy | Correct skill chosen | Task-dependent baseline |
| Fallback rate | Router uncertainty / poor coverage | Low and stable |
| Schema-valid output rate | Downstream integration health | Very high |
| p95 route+execution latency | User experience and SLA risk | Within product SLO |
| Safety violation rate | Compliance and trust | Near zero |
A strong sign your registry is healthy: new skills can be added without increasing fallback and safety incidents disproportionately.
๐ Evaluation Loop: Offline Replay, Shadow Routing, and Live Gates
Evaluation is not one number. It is a loop.
sequenceDiagram
participant D as Dataset Store
participant R as Router
participant E as Evaluator
participant P as Prod Traffic
D->>R: historical requests replay
R-->>E: selected skill + confidence + trace
E->>E: compute quality/safety/latency metrics
E-->>R: threshold updates and alerts
P->>R: live requests (shadow mode)
R-->>E: shadow route decisions
E-->>R: promote or rollback recommendation
Recommended evaluation layers:
| Layer | Input | Output |
| Offline replay | Curated request set | Route accuracy, regression diffs |
| Shadow mode | Live traffic copy | Real-world drift signals |
| Online canary | Small user slice | Business-safe rollout confidence |
For intermediate maturity, start with one robust offline suite and one shadow dashboard before touching canary automation.
๐ Real-World Applications: Rollout Patterns That Work in Real Teams
Pattern 1: New skill onboarding checklist
- Add skill metadata and policy fields to registry.
- Add at least 20 representative replay prompts.
- Verify schema-valid output rate and safety checks.
- Enable shadow routing before any user-facing traffic.
Pattern 2: Risk-tiered routing
| Intent class | Route policy |
| Informational Q&A | Standard skill routing |
| Data mutation | High-confidence threshold + stricter policy gate |
| Regulated output | Human approval or signed workflow |
Pattern 3: Progressive promotion
devregistry namespace,stagingwith replay and shadow tests,prod-canaryfor 1-5% traffic,- full promotion if metrics pass.
This avoids the all-or-nothing rollout trap that causes noisy incidents.
โ๏ธ Trade-offs & Failure Modes: Failure Modes and Mitigations in Skill Routing Systems
| Failure mode | Typical symptom | Mitigation |
| Registry drift | Skill docs and behavior diverge | Contract tests + version pinning |
| Overlapping skills | Router flips between near-identical skills | Capability taxonomy + ownership boundaries |
| Silent policy gaps | Unexpected sensitive actions | Deny-by-default policy design |
| Score overfitting | Good replay metrics, bad live behavior | Shadow routing with live telemetry |
| Evaluation blind spots | Regressions after release | Include adversarial and long-tail test sets |
Also watch for this anti-pattern: using one global confidence threshold for every intent type. High-risk intents need stricter thresholds.
๐งญ Decision Guide: What to Build First at Your Current Maturity
| Team situation | Build first | Build second |
| 3-5 skills, early product stage | Basic registry with owners and schemas | Deterministic policy gate |
| 10-20 skills, multiple teams | Hybrid router with scoring + traces | Offline replay regression suite |
| Regulated domain or high-risk actions | Strict policy engine + approvals | Canary automation with rollback |
| Frequent model or prompt updates | Evaluation harness with drift alerts | Route score calibration tooling |
| Decision question | Recommendation |
| Should routing live in prompts only? | No, keep prompts as one signal, not sole control plane |
| Should every skill have full autonomy? | No, route through centralized policy + registry metadata |
| Should evaluation be periodic only? | No, combine continuous shadow metrics with scheduled replays |
| Should fallback be generic? | No, define intent-aware fallbacks per risk tier |
๐งช Practical Examples: Registry and Router Skeleton
Example 1: Registry document shape (JSON)
{
"skill_id": "incident_triage_v3",
"version": "3.2.1",
"description": "Analyze outage alerts, summarize impact, and create incident tickets.",
"risk_level": "medium",
"owner_team": "sre-platform",
"allowed_data_domains": ["logs", "incidents"],
"input_schema": {
"type": "object",
"required": ["service", "time_range", "alert_id"]
},
"output_schema": {
"type": "object",
"required": ["summary", "severity", "ticket_id"]
},
"slo": {
"p95_latency_ms": 4000,
"schema_valid_rate": 0.99
},
"fallback_skill_id": "incident_triage_safe_v1"
}
This is enough to power both routing decisions and operator dashboards.
Example 2: Hybrid route selection sketch (Python)
from typing import Dict, List
def allowed(skill: Dict, request: Dict) -> bool:
if request.get("risk_tier") == "high" and skill.get("risk_level") == "high":
return False
required_domain = request.get("domain")
return required_domain in skill.get("allowed_data_domains", [])
def route_score(skill: Dict, fit: float, latency_ms: float, reliability: float) -> float:
risk_penalty = {"low": 0.05, "medium": 0.15, "high": 0.40}[skill.get("risk_level", "medium")]
return 0.55 * fit - 0.20 * (latency_ms / 5000.0) - 0.15 * risk_penalty + 0.10 * reliability
def choose_skill(request: Dict, candidates: List[Dict]) -> Dict:
scored = []
for skill in candidates:
if not allowed(skill, request):
continue
# Placeholder signals. In production these come from embedding match and telemetry.
fit = skill.get("fit", 0.0)
latency_ms = skill.get("p95_latency_ms", 3000)
reliability = skill.get("schema_valid_rate", 0.95)
scored.append((route_score(skill, fit, latency_ms, reliability), skill))
if not scored:
return {"action": "fallback", "reason": "no_allowed_candidate"}
scored.sort(key=lambda pair: pair[0], reverse=True)
best_score, best_skill = scored[0]
if best_score < 0.35:
return {"action": "fallback", "reason": "low_confidence", "score": best_score}
return {"action": "execute", "skill_id": best_skill["skill_id"], "score": best_score}
Even this simple structure gives you deterministic policy handling and explainable route selection.
๐ Lessons Learned from Scaling Agent Skill Systems
- Registry design quality determines routing quality more than people expect.
- If metadata ownership is unclear, incident resolution slows down dramatically.
- Route traces are as important as model logs for debugging production failures.
- Always separate policy eligibility from score ranking.
- Evaluation must include long-tail and adversarial queries, not only happy-path prompts.
- Fallback quality matters as much as primary route quality for user trust.
๐ ๏ธ LangChain and LangGraph: Building the Routing and Execution Stack
LangChain is a Python/TypeScript framework providing composable building blocks for LLM applications โ tools, chains, memory, and output parsers. LangGraph extends LangChain with stateful, cyclic graphs for multi-step agent workflows where routing decisions must react to intermediate outputs rather than executing a fixed sequence of steps.
# pip install langchain langchain-openai langgraph
from langchain_core.tools import tool
from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal, Optional
# --- Registry-backed tools (one action each) ---
@tool
def fetch_service_metrics(service: str, window_minutes: int = 15) -> dict:
"""Retrieve error rate and p95 latency for a service over the given window."""
# Replace with real observability API call
return {"service": service, "error_rate": 0.082, "p95_latency_ms": 1340}
@tool
def create_incident_ticket(summary: str, severity: str) -> str:
"""Create a production incident ticket and return the ticket ID."""
return f"INC-{abs(hash(summary)) % 9999:04d}"
# --- LangGraph stateful skill: routes through nodes based on intermediate state ---
class IncidentState(TypedDict):
service: str
metrics: Optional[dict]
incident_id: Optional[str]
status: Literal["pending", "escalated", "done"]
def fetch_node(state: IncidentState) -> IncidentState:
metrics = fetch_service_metrics.invoke(
{"service": state["service"], "window_minutes": 15}
)
return {**state, "metrics": metrics}
def ticket_node(state: IncidentState) -> IncidentState:
summary = (f"High error rate on {state['service']}: "
f"{state['metrics']['error_rate']:.1%}")
ticket_id = create_incident_ticket.invoke(
{"summary": summary, "severity": "high"}
)
return {**state, "incident_id": ticket_id, "status": "escalated"}
def route(state: IncidentState) -> str:
"""Policy gate: only escalate when error rate exceeds threshold."""
if state.get("metrics", {}).get("error_rate", 0) >= 0.05:
return "ticket"
return END
graph = StateGraph(IncidentState)
graph.add_node("fetch", fetch_node)
graph.add_node("ticket", ticket_node)
graph.set_entry_point("fetch")
graph.add_conditional_edges("fetch", route)
graph.add_edge("ticket", END)
skill = graph.compile()
result = skill.invoke({"service": "payments-svc", "metrics": None,
"incident_id": None, "status": "pending"})
print(result)
LangGraph's StateGraph maps directly onto the registry-router-evaluator control plane described earlier: each node is a step, each edge is a conditional routing decision, and the route function encodes the policy gate โ separate from execution logic, independently testable, and auditable in the state trace.
For a full deep-dive on LangChain tool schemas, LangGraph multi-agent orchestration, and checkpoint-based resumability, a dedicated follow-up post is planned.
๐ TLDR: Summary & Key Takeaways
- Production agents need a control plane: registry, router, and evaluator.
- A strong registry captures contracts, ownership, risk, and runtime expectations.
- Hybrid routing (rules + semantic fit + telemetry priors) is usually the best practical approach.
- Policy constraints should be hard gates before scoring.
- Evaluation should be continuous: replay, shadow, and canary.
- Reliable fallbacks turn router uncertainty into safe user outcomes.
One-line takeaway: Great agent behavior is rarely accidental; it is routed, constrained, and continuously evaluated.
๐ Practice Quiz
Why should policy checks run before route scoring? A) To reduce code size B) To enforce non-negotiable safety and compliance constraints C) To improve tokenization speed
Correct Answer: B
Which set best represents a practical agent control plane? A) Prompt + temperature B) Tool docs + retries C) Skill registry + router + evaluator
Correct Answer: C
A new skill has excellent replay metrics but poor live outcomes. What is the most likely missing layer? A) More synonyms in the skill description B) Shadow routing and drift monitoring C) Longer system prompts
Correct Answer: B
Open-ended: Define one high-risk intent in your domain and design an appropriate routing policy, confidence threshold, and fallback flow.
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Types of LLM Quantization: By Timing, Scope, and Mapping
TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantizati...
Stream Processing Pipeline Pattern: Stateful Real-Time Data Products
TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together โ and Kafka Streams lets you build all three directly inside your Spring Boot service. Stripe's real-time fraud detection processes...
Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic
TLDR: A service mesh intercepts all service-to-service traffic via injected Envoy sidecar proxies, letting a platform team enforce mTLS, retries, timeouts, and circuit breaking centrally โ without changing application code. Reach for it when cross-te...
Serverless Architecture Pattern: Event-Driven Scale with Operational Guardrails
TLDR: Serverless is strongest for spiky asynchronous workloads when cold-start, observability, and state boundaries are intentionally designed. TLDR: Serverless works best for spiky, event-driven workloads when you design for idempotency, observabili...
