Confidence as a routing signal, not just a number
Most teams attach a confidence score to model output and stop there. The mature pattern uses it as a routing signal, high to fast-path, mid to human-in-loop, low to rejected-with-reason. The thresholds are product-specific, the audit story is per-path, and calibration is a discipline.
Almost every model output I've seen ship in the last two years has a confidence score attached to it. Sometimes it's a logprob-derived number. Sometimes it's a self-rated 0-100 the model produces alongside the answer. Sometimes it's a downstream classifier's probability. Whatever the source, it's there, a single number meant to summarize how much the system trusts its own answer.
And then most teams stop. The number gets logged, maybe surfaced in a debug panel, occasionally referenced in a postmortem. But it doesn't change anything about how the output is treated. The high-confidence answer and the low-confidence answer take the same path through the system. The number is decoration.
That's the pattern I want to push back on. Confidence isn't a label you attach. It's a routing signal. The mature shape: high-confidence outputs go to the fast path, mid-confidence outputs go to a human-in-the-loop path, low-confidence outputs get rejected with a reason. Same model, same output, different downstream treatment based on what the confidence number actually says.
Here's how this fits with the rest of the series. It's the operational complement to the bounded autonomy framing and the six-rung autonomy ladder. The ladder defines how much rope the agent has earned for an operation. Confidence routing defines what the agent does with that rope on a given output. Both are about putting structure around the moments where the system is about to act on a model answer.
The three paths
The shape is simple enough to describe in three lines:
- High confidence → the output is acted on directly. Returned to the caller, written to the store, applied to the resource. The system trusts the answer enough to skip the gate.
- Mid confidence → the output is queued for human review before it lands. The reviewer sees the input, the proposed output, the confidence number, and whatever the model said about its own reasoning, then either approves, edits, or rejects.
- Low confidence → the output is rejected. Not silently, the caller gets a structured response that says "I'm not confident enough about this to act on it. Here's why." The "why" might be missing context, ambiguous input, an out-of-distribution question, or a constraint violation. The point is the rejection is informative, not a shrug.
The shape is obviously useful once you see it. The reason most teams don't ship it is that it requires three things they skip: pick thresholds with intent, build the human-in-the-loop path as a real product surface, and maintain calibration over time. Each is more work than logging the number.
Thresholds are product-specific, not universal
The first place teams go wrong is treating the confidence threshold as a system-wide constant. They pick 0.8 because it sounded reasonable, set it once, and apply it everywhere the model is invoked. This is the same category error as treating autonomy as a global property of the agent rather than a per-operation calibration.
The threshold is a function of the operation, not the model. A confidence-routed system for triaging support tickets has different thresholds than a confidence-routed system for proposing schema migrations. The triage system can probably let through anything above 0.6, the cost of misrouting a ticket is small and the human-in-the-loop is the team lead doing a daily sweep. The schema migration system shouldn't fast-path anything below 0.95, the cost of being wrong is days of recovery work and potentially a data integrity event.
The dimensions are familiar from the bounded-autonomy framing: blast radius, recoverability, reversibility, verifiability. A high blast-radius operation needs a higher fast-path threshold and a much lower rejection ceiling. A reversible operation can tolerate a lower fast-path threshold because the cost of being wrong is bounded by rollback speed. A hard-to-verify operation should be slow to fast-path because the confidence number is harder to trust when truth is harder to check.
The right form for thresholds is two numbers per operation class: the fast-path floor and the rejection ceiling. Above the floor, fast-path. Below the ceiling, reject. Between them, human-in-the-loop. Both numbers belong in the same centralized policy that holds the rest of the operational decisions, this is Decisions as Code applied to the threshold layer. Change the policy, every consumer picks it up. Audit the policy and you have the full picture of what the system trusts.
What "human-in-the-loop" actually means in production
The middle path is the one teams underbuild. They write the routing logic, they wire up the fast path, they handle the rejection, and then they hand-wave the human-in-the-loop step. "It goes to a queue." "An analyst reviews it." Six months later the queue has 40,000 unreviewed items and the system is effectively running fast-path-or-rejection with no middle.
Human-in-the-loop is a real product surface or it doesn't exist. The minimum viable shape: a queue with an SLA, an interface that gives the reviewer everything they need to evaluate the output in under a minute, an explicit accept/edit/reject control, and a feedback loop that turns the reviewer's decision into training data for the next iteration of the model and the next adjustment of the thresholds.
The "everything they need" part is where most middle paths fall apart. The reviewer needs the input, the proposed output, the confidence number, the model's reasoning if the model produced any, the relevant context the model was given, and a one-click way to see what the fast-path version of this would have done. If the reviewer is squinting at a JSON blob trying to reconstruct what the system was about to do, you've built a queue, not a review surface, and the queue will atrophy.
The SLA matters more than people expect. Mid-confidence outputs are often time-sensitive, a support ticket, a transaction, an alert. If the human-in-the-loop path takes a day, the fast path effectively becomes the only path because the alternative is too slow to use. The threshold drifts upward by selection pressure, the middle path shrinks, and you're back to two paths instead of three.
The audit story per path
Each of the three paths has a different audit story, and the difference matters when you're explaining the system to someone who isn't a model engineer.
The fast path's audit story is "the system acted alone because confidence exceeded the policy threshold." The audit record needs the input, the output, the confidence number, the policy version that was in effect, the threshold that was crossed, and the action taken. Anyone reviewing the trace later can reconstruct exactly why the system thought it was safe to act alone.
The human-in-the-loop path's audit story is "the system proposed, a named reviewer decided." The audit record needs everything from the fast-path record, plus the reviewer identity, the timestamp of the review, the decision (accept / edit / reject), and any edits the reviewer made. The reviewer's decision is the load-bearing event in the audit trail; the model's proposal is context.
The rejection path's audit story is "the system declined to act because confidence was below the rejection ceiling. Here's the reason." The record needs the input, the rejection reason, the confidence number, and what the caller did next. Rejections aren't failures, they're the system correctly declining to overreach. They're also a signal: a high rejection rate in an operation class means the model is poorly calibrated for that class, the input distribution shifted, or the thresholds need adjustment.
The three audit streams together are the complete operational picture. An auditor (internal or external) can answer "what did this system actually do, and on whose authority" by reading them. A system that only routes but doesn't audit per-path is invisible to anyone who needs to evaluate its behavior, which is most of the people who'll be evaluating it.
Calibration is a discipline, not a milestone
The last failure mode is the one that takes the longest to show up. You set the thresholds, you wire the routing, you build the audit, and the system runs cleanly for a few months. Then the input distribution shifts (new product, new customer segment, new external pressure) and the model's confidence numbers stop meaning what they used to mean. The 0.85s that used to be reliable are now systematically wrong. The 0.6s that used to need review are now usually fine.
If nobody is watching, the thresholds drift out of calibration silently. The fast path starts passing through outputs that should have been reviewed. The rejection path starts catching outputs that the model handles fine. The middle path's volume changes shape, and the reviewers either get overloaded or run dry. The system is still running, the metrics still look like metrics, but the routing is making the wrong cuts.
Calibration is the discipline of periodically checking that the confidence numbers still mean what the thresholds assume. The mechanics aren't exotic: take a sample of recent fast-path outputs and have humans grade them, do the same for a sample of rejections, compare the actual error rate at each confidence band to the expected error rate the thresholds were set for. If the bands have drifted, retune. If the model has drifted, retrain or re-prompt. If the input distribution has shifted, decide whether the thresholds need to move with it or whether the right answer is to add a new operation class with its own thresholds.
The cadence depends on the operation. High-volume, low-stakes operations can be calibrated quarterly. Low-volume, high-stakes operations should be calibrated continuously, every output graded, every drift caught early. The trap is treating calibration as a project that gets done once at launch instead of an ongoing operational practice. A system that's calibrated at launch and never again is a system that's silently miscalibrated by month six.
Why this is the bounded-autonomy version of routing
The connection back to the series is straightforward. Bounded autonomy says the agent gets as much rope as it has earned for the operation. Confidence routing says the agent's individual outputs get treated according to how much trust each one has earned. Both are gradient disciplines: not a single autonomy boundary, not a single confidence threshold, but a structured set of behaviors keyed off the properties of what's about to happen.
Confidence as a single number is the AI version of an unbounded agent, capable of acting, but with no structure around when it should. Confidence as a routing signal is the bounded version, same capability, but the act of using it is shaped by a plain policy that someone can read, audit, and tune.
The number isn't decoration. It's the input to a decision that the system has to make every time it produces an output. Build the decision into the system, not into the head of whoever's reading the logs.
, Sid