Home/Blog/Agent Architecture/LLM Skill Registries, Routing Policies, and Evaluation for Production Agents

Agent ArchitectureAdvanced•14 min read•Mar 12, 2026

LLM Skill Registries, Routing Policies, and Evaluation for Production Agents

After tools and skills, this is the control plane: registry design, routing rules, and evaluation loops.

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: If tools are primitives and skills are reusable routines, then the skill registry + router + evaluator is your production control plane. This layer decides which skill runs, under what constraints, and how you detect regressions before users do.

📖 Why a Skill Registry Becomes the Agent Control Plane

In small demos, the model picks a tool and returns a decent answer. In production, that is not enough.

You need answers to operational questions:

Which skills are currently active?
Which team owns each skill and guardrail policy?
Which skills are safe for high-risk intents?
What changed between yesterday's and today's routing behavior?

That is what a skill registry solves. It is not just a list of skill names. It is the source of truth for execution behavior.

Example — one registry entry in practice:

{
  "skill_id": "sql_query_v2",
  "input_schema": { "query": "string", "db": "string" },
  "routing_condition": "intent == 'data_lookup' AND risk_level == 'low'",
  "eval_hook": "sql_accuracy_v1"
}

When a user asks "Show me all orders over $500", the router matches intent data_lookup, confirms risk_level == 'low', selects sql_query_v2, and tags the response for evaluation via sql_accuracy_v1. That four-field entry is the minimum viable registry contract.

Capability	Without registry	With registry
Skill discovery	Prompt memory or hardcoded list	Queryable metadata
Governance	Ad hoc docs	Owner, risk level, policy fields
Routing consistency	Prompt-dependent	Deterministic + scored selection
Incident triage	Slow transcript digging	Versioned skill and route traces

A practical architecture has three pieces:

Registry: skill metadata and contracts.
Router: selects the best skill for a request.
Evaluator: measures quality, safety, latency, and drift.

📊 Skill Routing Sequence

sequenceDiagram
    participant U as User
    participant IC as Intent Classifier
    participant SR as Skill Registry
    participant PG as Policy Gate
    participant E as Executor
    participant T as Skill Tracer

    U->>IC: User query
    IC->>SR: intent + entities
    SR->>SR: Embed + cosine match
    SR-->>PG: Candidate skills
    PG->>PG: Check risk / permissions
    PG-->>E: Approved skill
    E->>T: Execute skill (step trace)
    T-->>U: Response + route metadata

This sequence traces a user query through the full skill routing pipeline: from intent classification and embedding-based registry lookup, through a policy gate that enforces risk and permission checks, to skill execution with a step trace returned alongside the response. Notice how the Policy Gate acts as a hard filter before any skill runs — queries that fail eligibility never reach the executor, regardless of embedding match score. The reader should focus on how each layer narrows candidates progressively rather than making a single monolithic routing decision.

📊 Skill Registry Lookup Flow

flowchart TD
    Query[User Query]
    Embed["Embed Query (intent vector)"]
    Cos[Cosine Similarity vs skill descriptions]
    TopK[Top-K Candidate Skills]
    Filter["Policy Gate (risk, permissions, domain)"]
    Score[Score: fit  latency  risk]
    Select[Select Top Skill]
    Invoke[Invoke Skill Runtime]

    Query --> Embed --> Cos --> TopK --> Filter --> Score --> Select --> Invoke

This post is the operational follow-up to LLM Skills vs Tools: The Missing Layer in Agent Design.

🔍 Designing a Registry That Humans and Routers Can Trust

A useful registry entry is both machine-readable and operator-readable.

Minimum fields per skill:

Field	Example	Why it matters
`skill_id`	`incident_triage_v3`	Stable reference in traces
`description`	"Investigate alerts and create tickets"	Helps intent matching
`input_schema`	JSON schema	Prevents malformed runs
`output_schema`	JSON schema	Stabilizes downstream integrations
`risk_level`	`low`, `medium`, `high`	Enables policy gating
`allowed_data_domains`	`logs`, `tickets`	Limits data exposure
`owner_team`	`sre-platform`	Accountability
`slo`	`p95<4s`	Runtime expectations
`version`	`3.2.1`	Safe rollouts and rollbacks

Dense systems should also include:

deprecation status,
fallback skill id,
required approvals (for sensitive actions),
evaluation baseline hash.

A registry is a product artifact. Treat it like API surface area, not internal trivia.

⚙️ Routing Pipeline: From User Intent to Skill Selection

A production router should be explicit about stages.

flowchart TD
    A[User request] --> B[Intent and entity extraction]
    B --> C[Candidate skill retrieval from registry]
    C --> D[Policy gate: data/risk/compliance]
    D --> E[Score candidates: fit, cost, risk, freshness]
    E --> F{Confidence above threshold?}
    F -- Yes --> G[Select top skill]
    F -- No --> H[Fallback: safe default or human review]
    G --> I[Execute skill with trace]
    H --> I
    I --> J[Return response + route metadata]

This pipeline prevents a common failure mode: the model picks a "kind of related" skill because a keyword looked similar.

Stage	Typical failure if skipped	Fix
Candidate retrieval	Wrong skill family selected	Embedding + keyword hybrid retrieval
Policy gate	Unsafe skill selected	Hard allow/deny rules before scoring
Confidence threshold	Overconfident wrong execution	Fallback path when confidence is low
Trace capture	No root cause during outages	Persist route id, candidate scores, policy decisions

Router quality is usually more important than incremental prompt tweaks once you scale beyond a few skills.

🧠 Deep Dive: Scoring, Constraints, and Runtime Guarantees

Internals: hybrid routing usually beats single-strategy routing

Most robust systems combine three routing signals:

Rule-based filters for non-negotiable constraints (risk, permissions, domain).
Semantic match for intent-to-skill relevance.
Operational priors from latency, error rate, and freshness.

Router signal	Strength	Weakness
Rules	Deterministic safety	Can be rigid
Semantic score	Flexible intent fit	Can over-match vague text
Operational priors	Production-aware decisions	Needs telemetry quality

A pure LLM router is fast to prototype but hard to govern. A pure rules engine is predictable but brittle. The hybrid path tends to be the practical middle ground.

Mathematical model: route score with explicit penalties

A common scoring objective:

$$ RouteScore(s \mid q) = w_f \cdot Fit(s, q) - w_l \cdot Latency(s) - w_r \cdot Risk(s) + w_o \cdot Reliability(s) $$

Where:

Fit: intent coverage confidence,
Latency: normalized expected runtime,
Risk: policy and safety risk score,
Reliability: historical success and schema-valid output rate.

Add hard constraints before scoring:

$$ Allowed(s, q) = Permission(s, q) \land DataPolicy(s, q) \land RegionPolicy(s, q) $$

Then choose:

$$ s^* = \arg\max_{s \in S, Allowed(s,q)} RouteScore(s \mid q) $$

This separates policy from optimization, which keeps audits and incident reviews much cleaner.

Performance analysis: what to measure in routing systems

Metric	Why it matters	Target style
Route accuracy	Correct skill chosen	Task-dependent baseline
Fallback rate	Router uncertainty / poor coverage	Low and stable
Schema-valid output rate	Downstream integration health	Very high
p95 route+execution latency	User experience and SLA risk	Within product SLO
Safety violation rate	Compliance and trust	Near zero

A strong sign your registry is healthy: new skills can be added without increasing fallback and safety incidents disproportionately.

🔬 Internals

A skill registry maps intent signatures to executable handlers via a routing layer — typically a classifier or embedding similarity lookup over skill descriptions. At query time, the router embeds the user intent, retrieves the top-k skill candidates by cosine similarity, and optionally re-ranks with a small cross-encoder. Skill versioning (semver tags on handlers) allows A/B testing and gradual rollout without changing the routing API.

⚡ Performance Analysis

Embedding-based routing over 100 skills adds 5–15ms latency using precomputed skill embeddings cached in memory. A BERT-base cross-encoder re-ranker adds another 10–30ms but reduces misrouting rate from ~8% to ~2% on ambiguous queries. End-to-end agent request latency with a registry lookup is typically 50–100ms before the LLM call — negligible compared to the 1–3s LLM response time.

📊 Evaluation Loop: Offline Replay, Shadow Routing, and Live Gates

Evaluation is not one number. It is a loop.

sequenceDiagram
    participant D as Dataset Store
    participant R as Router
    participant E as Evaluator
    participant P as Prod Traffic

    D->>R: historical requests replay
    R-->>E: selected skill + confidence + trace
    E->>E: compute quality/safety/latency metrics
    E-->>R: threshold updates and alerts
    P->>R: live requests (shadow mode)
    R-->>E: shadow route decisions
    E-->>R: promote or rollback recommendation

Recommended evaluation layers:

Layer	Input	Output
Offline replay	Curated request set	Route accuracy, regression diffs
Shadow mode	Live traffic copy	Real-world drift signals
Online canary	Small user slice	Business-safe rollout confidence

For intermediate maturity, start with one robust offline suite and one shadow dashboard before touching canary automation.

🌍 Real-World Applications: Rollout Patterns That Work in Real Teams

Pattern 1: New skill onboarding checklist

Add skill metadata and policy fields to registry.
Add at least 20 representative replay prompts.
Verify schema-valid output rate and safety checks.
Enable shadow routing before any user-facing traffic.

Pattern 2: Risk-tiered routing

Intent class	Route policy
Informational Q&A	Standard skill routing
Data mutation	High-confidence threshold + stricter policy gate
Regulated output	Human approval or signed workflow

Pattern 3: Progressive promotion

dev registry namespace,
staging with replay and shadow tests,
prod-canary for 1-5% traffic,
full promotion if metrics pass.

This avoids the all-or-nothing rollout trap that causes noisy incidents.

⚖️ Trade-offs & Failure Modes: Failure Modes and Mitigations in Skill Routing Systems

Failure mode	Typical symptom	Mitigation
Registry drift	Skill docs and behavior diverge	Contract tests + version pinning
Overlapping skills	Router flips between near-identical skills	Capability taxonomy + ownership boundaries
Silent policy gaps	Unexpected sensitive actions	Deny-by-default policy design
Score overfitting	Good replay metrics, bad live behavior	Shadow routing with live telemetry
Evaluation blind spots	Regressions after release	Include adversarial and long-tail test sets

Also watch for this anti-pattern: using one global confidence threshold for every intent type. High-risk intents need stricter thresholds.

🧭 Decision Guide: What to Build First at Your Current Maturity

Team situation	Build first	Build second
3-5 skills, early product stage	Basic registry with owners and schemas	Deterministic policy gate
10-20 skills, multiple teams	Hybrid router with scoring + traces	Offline replay regression suite
Regulated domain or high-risk actions	Strict policy engine + approvals	Canary automation with rollback
Frequent model or prompt updates	Evaluation harness with drift alerts	Route score calibration tooling

Decision question	Recommendation
Should routing live in prompts only?	No, keep prompts as one signal, not sole control plane
Should every skill have full autonomy?	No, route through centralized policy + registry metadata
Should evaluation be periodic only?	No, combine continuous shadow metrics with scheduled replays
Should fallback be generic?	No, define intent-aware fallbacks per risk tier

🧪 Practical Examples: Registry and Router Skeleton

Example 1: Registry document shape (JSON)

These examples provide a concrete skill registry document shape and a hybrid route selection function that together form the skeleton of a production routing system. The JSON registry entry was chosen because it is the minimum viable artifact that makes routing decisions both machine-readable and operator-auditable — humans and automated routers share the same source of truth. When reading the Python routing function, focus on how eligibility filtering happens before score computation: this separation is what keeps policy enforcement from leaking into the scoring math and makes each layer independently testable.

{
  "skill_id": "incident_triage_v3",
  "version": "3.2.1",
  "description": "Analyze outage alerts, summarize impact, and create incident tickets.",
  "risk_level": "medium",
  "owner_team": "sre-platform",
  "allowed_data_domains": ["logs", "incidents"],
  "input_schema": {
    "type": "object",
    "required": ["service", "time_range", "alert_id"]
  },
  "output_schema": {
    "type": "object",
    "required": ["summary", "severity", "ticket_id"]
  },
  "slo": {
    "p95_latency_ms": 4000,
    "schema_valid_rate": 0.99
  },
  "fallback_skill_id": "incident_triage_safe_v1"
}

This is enough to power both routing decisions and operator dashboards.

Example 2: Hybrid route selection sketch (Python)

from typing import Dict, List

def allowed(skill: Dict, request: Dict) -> bool:
    if request.get("risk_tier") == "high" and skill.get("risk_level") == "high":
        return False
    required_domain = request.get("domain")
    return required_domain in skill.get("allowed_data_domains", [])

def route_score(skill: Dict, fit: float, latency_ms: float, reliability: float) -> float:
    risk_penalty = {"low": 0.05, "medium": 0.15, "high": 0.40}[skill.get("risk_level", "medium")]
    return 0.55 * fit - 0.20 * (latency_ms / 5000.0) - 0.15 * risk_penalty + 0.10 * reliability

def choose_skill(request: Dict, candidates: List[Dict]) -> Dict:
    scored = []
    for skill in candidates:
        if not allowed(skill, request):
            continue

        # Placeholder signals. In production these come from embedding match and telemetry.
        fit = skill.get("fit", 0.0)
        latency_ms = skill.get("p95_latency_ms", 3000)
        reliability = skill.get("schema_valid_rate", 0.95)

        scored.append((route_score(skill, fit, latency_ms, reliability), skill))

    if not scored:
        return {"action": "fallback", "reason": "no_allowed_candidate"}

    scored.sort(key=lambda pair: pair[0], reverse=True)
    best_score, best_skill = scored[0]

    if best_score < 0.35:
        return {"action": "fallback", "reason": "low_confidence", "score": best_score}

    return {"action": "execute", "skill_id": best_skill["skill_id"], "score": best_score}

Even this simple structure gives you deterministic policy handling and explainable route selection.

🛠️ LangChain and LangGraph: Building the Routing and Execution Stack

LangChain is a Python/TypeScript framework providing composable building blocks for LLM applications — tools, chains, memory, and output parsers. LangGraph extends LangChain with stateful, cyclic graphs for multi-step agent workflows where routing decisions must react to intermediate outputs rather than executing a fixed sequence of steps.

# pip install langchain langchain-openai langgraph
from langchain_core.tools import tool
from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal, Optional

# --- Registry-backed tools (one action each) ---
@tool
def fetch_service_metrics(service: str, window_minutes: int = 15) -> dict:
    """Retrieve error rate and p95 latency for a service over the given window."""
    # Replace with real observability API call
    return {"service": service, "error_rate": 0.082, "p95_latency_ms": 1340}

@tool
def create_incident_ticket(summary: str, severity: str) -> str:
    """Create a production incident ticket and return the ticket ID."""
    return f"INC-{abs(hash(summary)) % 9999:04d}"

# --- LangGraph stateful skill: routes through nodes based on intermediate state ---
class IncidentState(TypedDict):
    service:     str
    metrics:     Optional[dict]
    incident_id: Optional[str]
    status:      Literal["pending", "escalated", "done"]

def fetch_node(state: IncidentState) -> IncidentState:
    metrics = fetch_service_metrics.invoke(
        {"service": state["service"], "window_minutes": 15}
    )
    return {**state, "metrics": metrics}

def ticket_node(state: IncidentState) -> IncidentState:
    summary = (f"High error rate on {state['service']}: "
               f"{state['metrics']['error_rate']:.1%}")
    ticket_id = create_incident_ticket.invoke(
        {"summary": summary, "severity": "high"}
    )
    return {**state, "incident_id": ticket_id, "status": "escalated"}

def route(state: IncidentState) -> str:
    """Policy gate: only escalate when error rate exceeds threshold."""
    if state.get("metrics", {}).get("error_rate", 0) >= 0.05:
        return "ticket"
    return END

graph = StateGraph(IncidentState)
graph.add_node("fetch",  fetch_node)
graph.add_node("ticket", ticket_node)
graph.set_entry_point("fetch")
graph.add_conditional_edges("fetch",  route)
graph.add_edge("ticket", END)

skill = graph.compile()
result = skill.invoke({"service": "payments-svc", "metrics": None,
                       "incident_id": None, "status": "pending"})
print(result)

LangGraph's StateGraph maps directly onto the registry-router-evaluator control plane described earlier: each node is a step, each edge is a conditional routing decision, and the route function encodes the policy gate — separate from execution logic, independently testable, and auditable in the state trace.

For a full deep-dive on LangChain tool schemas, LangGraph multi-agent orchestration, and checkpoint-based resumability, a dedicated follow-up post is planned.

📚 Lessons Learned from Scaling Agent Skill Systems

Registry design quality determines routing quality more than people expect.
If metadata ownership is unclear, incident resolution slows down dramatically.
Route traces are as important as model logs for debugging production failures.
Always separate policy eligibility from score ranking.
Evaluation must include long-tail and adversarial queries, not only happy-path prompts.
Fallback quality matters as much as primary route quality for user trust.

📌 TLDR: Summary & Key Takeaways

Production agents need a control plane: registry, router, and evaluator.
A strong registry captures contracts, ownership, risk, and runtime expectations.
Hybrid routing (rules + semantic fit + telemetry priors) is usually the best practical approach.
Policy constraints should be hard gates before scoring.
Evaluation should be continuous: replay, shadow, and canary.
Reliable fallbacks turn router uncertainty into safe user outcomes.

One-line takeaway: Great agent behavior is rarely accidental; it is routed, constrained, and continuously evaluated.

Article tools

Explain simpler Compare approaches What next?

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata