All Posts

LLM Skill Registries, Routing Policies, and Evaluation for Production Agents

After tools and skills, this is the control plane: registry design, routing rules, and evaluation loops.

Abstract AlgorithmsAbstract Algorithms
ยทยท13 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: If tools are primitives and skills are reusable routines, then the skill registry + router + evaluator is your production control plane. This layer decides which skill runs, under what constraints, and how you detect regressions before users do.


๐Ÿ“– Why a Skill Registry Becomes the Agent Control Plane

In small demos, the model picks a tool and returns a decent answer. In production, that is not enough.

You need answers to operational questions:

  • Which skills are currently active?
  • Which team owns each skill and guardrail policy?
  • Which skills are safe for high-risk intents?
  • What changed between yesterday's and today's routing behavior?

That is what a skill registry solves. It is not just a list of skill names. It is the source of truth for execution behavior.

Example โ€” one registry entry in practice:

{
  "skill_id": "sql_query_v2",
  "input_schema": { "query": "string", "db": "string" },
  "routing_condition": "intent == 'data_lookup' AND risk_level == 'low'",
  "eval_hook": "sql_accuracy_v1"
}

When a user asks "Show me all orders over $500", the router matches intent data_lookup, confirms risk_level == 'low', selects sql_query_v2, and tags the response for evaluation via sql_accuracy_v1. That four-field entry is the minimum viable registry contract.

CapabilityWithout registryWith registry
Skill discoveryPrompt memory or hardcoded listQueryable metadata
GovernanceAd hoc docsOwner, risk level, policy fields
Routing consistencyPrompt-dependentDeterministic + scored selection
Incident triageSlow transcript diggingVersioned skill and route traces

A practical architecture has three pieces:

  1. Registry: skill metadata and contracts.
  2. Router: selects the best skill for a request.
  3. Evaluator: measures quality, safety, latency, and drift.

This post is the operational follow-up to LLM Skills vs Tools: The Missing Layer in Agent Design.


๐Ÿ” Designing a Registry That Humans and Routers Can Trust

A useful registry entry is both machine-readable and operator-readable.

Minimum fields per skill:

FieldExampleWhy it matters
skill_idincident_triage_v3Stable reference in traces
description"Investigate alerts and create tickets"Helps intent matching
input_schemaJSON schemaPrevents malformed runs
output_schemaJSON schemaStabilizes downstream integrations
risk_levellow, medium, highEnables policy gating
allowed_data_domainslogs, ticketsLimits data exposure
owner_teamsre-platformAccountability
slop95<4sRuntime expectations
version3.2.1Safe rollouts and rollbacks

Dense systems should also include:

  • deprecation status,
  • fallback skill id,
  • required approvals (for sensitive actions),
  • evaluation baseline hash.

A registry is a product artifact. Treat it like API surface area, not internal trivia.


โš™๏ธ Routing Pipeline: From User Intent to Skill Selection

A production router should be explicit about stages.

flowchart TD
    A[User request] --> B[Intent and entity extraction]
    B --> C[Candidate skill retrieval from registry]
    C --> D[Policy gate: data/risk/compliance]
    D --> E[Score candidates: fit, cost, risk, freshness]
    E --> F{Confidence above threshold?}
    F -- Yes --> G[Select top skill]
    F -- No --> H[Fallback: safe default or human review]
    G --> I[Execute skill with trace]
    H --> I
    I --> J[Return response + route metadata]

This pipeline prevents a common failure mode: the model picks a "kind of related" skill because a keyword looked similar.

StageTypical failure if skippedFix
Candidate retrievalWrong skill family selectedEmbedding + keyword hybrid retrieval
Policy gateUnsafe skill selectedHard allow/deny rules before scoring
Confidence thresholdOverconfident wrong executionFallback path when confidence is low
Trace captureNo root cause during outagesPersist route id, candidate scores, policy decisions

Router quality is usually more important than incremental prompt tweaks once you scale beyond a few skills.


๐Ÿง  Deep Dive: Scoring, Constraints, and Runtime Guarantees

Internals: hybrid routing usually beats single-strategy routing

Most robust systems combine three routing signals:

  1. Rule-based filters for non-negotiable constraints (risk, permissions, domain).
  2. Semantic match for intent-to-skill relevance.
  3. Operational priors from latency, error rate, and freshness.
Router signalStrengthWeakness
RulesDeterministic safetyCan be rigid
Semantic scoreFlexible intent fitCan over-match vague text
Operational priorsProduction-aware decisionsNeeds telemetry quality

A pure LLM router is fast to prototype but hard to govern. A pure rules engine is predictable but brittle. The hybrid path tends to be the practical middle ground.

Mathematical model: route score with explicit penalties

A common scoring objective:

$$ RouteScore(s \mid q) = w_f \cdot Fit(s, q) - w_l \cdot Latency(s) - w_r \cdot Risk(s) + w_o \cdot Reliability(s) $$

Where:

  • Fit: intent coverage confidence,
  • Latency: normalized expected runtime,
  • Risk: policy and safety risk score,
  • Reliability: historical success and schema-valid output rate.

Add hard constraints before scoring:

$$ Allowed(s, q) = Permission(s, q) \land DataPolicy(s, q) \land RegionPolicy(s, q) $$

Then choose:

$$ s^* = \arg\max_{s \in S, Allowed(s,q)} RouteScore(s \mid q) $$

This separates policy from optimization, which keeps audits and incident reviews much cleaner.

Performance analysis: what to measure in routing systems

MetricWhy it mattersTarget style
Route accuracyCorrect skill chosenTask-dependent baseline
Fallback rateRouter uncertainty / poor coverageLow and stable
Schema-valid output rateDownstream integration healthVery high
p95 route+execution latencyUser experience and SLA riskWithin product SLO
Safety violation rateCompliance and trustNear zero

A strong sign your registry is healthy: new skills can be added without increasing fallback and safety incidents disproportionately.


๐Ÿ“Š Evaluation Loop: Offline Replay, Shadow Routing, and Live Gates

Evaluation is not one number. It is a loop.

sequenceDiagram
    participant D as Dataset Store
    participant R as Router
    participant E as Evaluator
    participant P as Prod Traffic

    D->>R: historical requests replay
    R-->>E: selected skill + confidence + trace
    E->>E: compute quality/safety/latency metrics
    E-->>R: threshold updates and alerts
    P->>R: live requests (shadow mode)
    R-->>E: shadow route decisions
    E-->>R: promote or rollback recommendation

Recommended evaluation layers:

LayerInputOutput
Offline replayCurated request setRoute accuracy, regression diffs
Shadow modeLive traffic copyReal-world drift signals
Online canarySmall user sliceBusiness-safe rollout confidence

For intermediate maturity, start with one robust offline suite and one shadow dashboard before touching canary automation.


๐ŸŒ Real-World Applications: Rollout Patterns That Work in Real Teams

Pattern 1: New skill onboarding checklist

  • Add skill metadata and policy fields to registry.
  • Add at least 20 representative replay prompts.
  • Verify schema-valid output rate and safety checks.
  • Enable shadow routing before any user-facing traffic.

Pattern 2: Risk-tiered routing

Intent classRoute policy
Informational Q&AStandard skill routing
Data mutationHigh-confidence threshold + stricter policy gate
Regulated outputHuman approval or signed workflow

Pattern 3: Progressive promotion

  1. dev registry namespace,
  2. staging with replay and shadow tests,
  3. prod-canary for 1-5% traffic,
  4. full promotion if metrics pass.

This avoids the all-or-nothing rollout trap that causes noisy incidents.


โš–๏ธ Trade-offs & Failure Modes: Failure Modes and Mitigations in Skill Routing Systems

Failure modeTypical symptomMitigation
Registry driftSkill docs and behavior divergeContract tests + version pinning
Overlapping skillsRouter flips between near-identical skillsCapability taxonomy + ownership boundaries
Silent policy gapsUnexpected sensitive actionsDeny-by-default policy design
Score overfittingGood replay metrics, bad live behaviorShadow routing with live telemetry
Evaluation blind spotsRegressions after releaseInclude adversarial and long-tail test sets

Also watch for this anti-pattern: using one global confidence threshold for every intent type. High-risk intents need stricter thresholds.


๐Ÿงญ Decision Guide: What to Build First at Your Current Maturity

Team situationBuild firstBuild second
3-5 skills, early product stageBasic registry with owners and schemasDeterministic policy gate
10-20 skills, multiple teamsHybrid router with scoring + tracesOffline replay regression suite
Regulated domain or high-risk actionsStrict policy engine + approvalsCanary automation with rollback
Frequent model or prompt updatesEvaluation harness with drift alertsRoute score calibration tooling
Decision questionRecommendation
Should routing live in prompts only?No, keep prompts as one signal, not sole control plane
Should every skill have full autonomy?No, route through centralized policy + registry metadata
Should evaluation be periodic only?No, combine continuous shadow metrics with scheduled replays
Should fallback be generic?No, define intent-aware fallbacks per risk tier

๐Ÿงช Practical Examples: Registry and Router Skeleton

Example 1: Registry document shape (JSON)

{
  "skill_id": "incident_triage_v3",
  "version": "3.2.1",
  "description": "Analyze outage alerts, summarize impact, and create incident tickets.",
  "risk_level": "medium",
  "owner_team": "sre-platform",
  "allowed_data_domains": ["logs", "incidents"],
  "input_schema": {
    "type": "object",
    "required": ["service", "time_range", "alert_id"]
  },
  "output_schema": {
    "type": "object",
    "required": ["summary", "severity", "ticket_id"]
  },
  "slo": {
    "p95_latency_ms": 4000,
    "schema_valid_rate": 0.99
  },
  "fallback_skill_id": "incident_triage_safe_v1"
}

This is enough to power both routing decisions and operator dashboards.

Example 2: Hybrid route selection sketch (Python)

from typing import Dict, List

def allowed(skill: Dict, request: Dict) -> bool:
    if request.get("risk_tier") == "high" and skill.get("risk_level") == "high":
        return False
    required_domain = request.get("domain")
    return required_domain in skill.get("allowed_data_domains", [])

def route_score(skill: Dict, fit: float, latency_ms: float, reliability: float) -> float:
    risk_penalty = {"low": 0.05, "medium": 0.15, "high": 0.40}[skill.get("risk_level", "medium")]
    return 0.55 * fit - 0.20 * (latency_ms / 5000.0) - 0.15 * risk_penalty + 0.10 * reliability

def choose_skill(request: Dict, candidates: List[Dict]) -> Dict:
    scored = []
    for skill in candidates:
        if not allowed(skill, request):
            continue

        # Placeholder signals. In production these come from embedding match and telemetry.
        fit = skill.get("fit", 0.0)
        latency_ms = skill.get("p95_latency_ms", 3000)
        reliability = skill.get("schema_valid_rate", 0.95)

        scored.append((route_score(skill, fit, latency_ms, reliability), skill))

    if not scored:
        return {"action": "fallback", "reason": "no_allowed_candidate"}

    scored.sort(key=lambda pair: pair[0], reverse=True)
    best_score, best_skill = scored[0]

    if best_score < 0.35:
        return {"action": "fallback", "reason": "low_confidence", "score": best_score}

    return {"action": "execute", "skill_id": best_skill["skill_id"], "score": best_score}

Even this simple structure gives you deterministic policy handling and explainable route selection.


๐Ÿ“š Lessons Learned from Scaling Agent Skill Systems

  • Registry design quality determines routing quality more than people expect.
  • If metadata ownership is unclear, incident resolution slows down dramatically.
  • Route traces are as important as model logs for debugging production failures.
  • Always separate policy eligibility from score ranking.
  • Evaluation must include long-tail and adversarial queries, not only happy-path prompts.
  • Fallback quality matters as much as primary route quality for user trust.

๐Ÿ› ๏ธ LangChain and LangGraph: Building the Routing and Execution Stack

LangChain is a Python/TypeScript framework providing composable building blocks for LLM applications โ€” tools, chains, memory, and output parsers. LangGraph extends LangChain with stateful, cyclic graphs for multi-step agent workflows where routing decisions must react to intermediate outputs rather than executing a fixed sequence of steps.

# pip install langchain langchain-openai langgraph
from langchain_core.tools import tool
from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal, Optional

# --- Registry-backed tools (one action each) ---
@tool
def fetch_service_metrics(service: str, window_minutes: int = 15) -> dict:
    """Retrieve error rate and p95 latency for a service over the given window."""
    # Replace with real observability API call
    return {"service": service, "error_rate": 0.082, "p95_latency_ms": 1340}

@tool
def create_incident_ticket(summary: str, severity: str) -> str:
    """Create a production incident ticket and return the ticket ID."""
    return f"INC-{abs(hash(summary)) % 9999:04d}"

# --- LangGraph stateful skill: routes through nodes based on intermediate state ---
class IncidentState(TypedDict):
    service:     str
    metrics:     Optional[dict]
    incident_id: Optional[str]
    status:      Literal["pending", "escalated", "done"]

def fetch_node(state: IncidentState) -> IncidentState:
    metrics = fetch_service_metrics.invoke(
        {"service": state["service"], "window_minutes": 15}
    )
    return {**state, "metrics": metrics}

def ticket_node(state: IncidentState) -> IncidentState:
    summary = (f"High error rate on {state['service']}: "
               f"{state['metrics']['error_rate']:.1%}")
    ticket_id = create_incident_ticket.invoke(
        {"summary": summary, "severity": "high"}
    )
    return {**state, "incident_id": ticket_id, "status": "escalated"}

def route(state: IncidentState) -> str:
    """Policy gate: only escalate when error rate exceeds threshold."""
    if state.get("metrics", {}).get("error_rate", 0) >= 0.05:
        return "ticket"
    return END

graph = StateGraph(IncidentState)
graph.add_node("fetch",  fetch_node)
graph.add_node("ticket", ticket_node)
graph.set_entry_point("fetch")
graph.add_conditional_edges("fetch",  route)
graph.add_edge("ticket", END)

skill = graph.compile()
result = skill.invoke({"service": "payments-svc", "metrics": None,
                       "incident_id": None, "status": "pending"})
print(result)

LangGraph's StateGraph maps directly onto the registry-router-evaluator control plane described earlier: each node is a step, each edge is a conditional routing decision, and the route function encodes the policy gate โ€” separate from execution logic, independently testable, and auditable in the state trace.

For a full deep-dive on LangChain tool schemas, LangGraph multi-agent orchestration, and checkpoint-based resumability, a dedicated follow-up post is planned.


๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • Production agents need a control plane: registry, router, and evaluator.
  • A strong registry captures contracts, ownership, risk, and runtime expectations.
  • Hybrid routing (rules + semantic fit + telemetry priors) is usually the best practical approach.
  • Policy constraints should be hard gates before scoring.
  • Evaluation should be continuous: replay, shadow, and canary.
  • Reliable fallbacks turn router uncertainty into safe user outcomes.

One-line takeaway: Great agent behavior is rarely accidental; it is routed, constrained, and continuously evaluated.


๐Ÿ“ Practice Quiz

  1. Why should policy checks run before route scoring? A) To reduce code size B) To enforce non-negotiable safety and compliance constraints C) To improve tokenization speed

    Correct Answer: B

  2. Which set best represents a practical agent control plane? A) Prompt + temperature B) Tool docs + retries C) Skill registry + router + evaluator

    Correct Answer: C

  3. A new skill has excellent replay metrics but poor live outcomes. What is the most likely missing layer? A) More synonyms in the skill description B) Shadow routing and drift monitoring C) Longer system prompts

    Correct Answer: B

  4. Open-ended: Define one high-risk intent in your domain and design an appropriate routing policy, confidence threshold, and fallback flow.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms