The six rungs of agent autonomy

A practical six-rung ladder for AI agent autonomy: suggest, draft, execute-with-confirmation, execute-bounded, execute-with-rollback, execute-and-report. Each rung has different operational requirements. What promotes an agent from rung N to N+1, and what sends it back down.

Sid Smith

27 Jan 2026 • 7 min read

The conversation about AI agent autonomy mostly happens in two registers. The first is the marketing version: "fully autonomous agents that handle the work end-to-end." The second is the operations version: "we can't let it touch production." Both are caricatures. Real agent operations live somewhere on a ladder between them, and the rungs of that ladder are well-defined enough that I want to lay them out.

I've been running agents in my own workflow long enough, across coding, content, ops, and homelab orchestration on the small Helix cluster I run at home, that I've watched agents move up and down the ladder enough times to feel the boundaries between rungs. The boundaries are about operational requirements, not capability. The agent is technically capable of operating at rung 5 from day one. Whether you let it depends on what you've built underneath it.

Here are the six rungs. Each one is a coherent operating mode with its own requirements for audit, rollback, confidence, and blast radius.

Rung 0: Suggest

The agent proposes an action. It does nothing. A human reads the suggestion, decides whether to act on it, and either does so or doesn't. The agent's output is text, sometimes structured (a diff, a command, a checklist), sometimes prose ("I think you should restart the queue worker on engine-01"). The human is the executor.

Operational requirements: almost none. You need the agent to log what it suggested and what context it had. You don't need rollback because nothing happened. You don't need a confidence threshold because the human will re-evaluate. You don't need bounded blast radius because the blast radius is zero, the agent can't touch anything.

This is where every agent starts. It's also where most agents should stay for longer than people instinctively believe, because rung 0 is the rung where the cost of being wrong is the lowest and the cost of moving too quickly to higher rungs is highest.

Rung 1: Draft

The agent doesn't just suggest, it produces the actual artifact. A draft pull request. A draft email. A draft Helm values change. A draft incident postmortem. The artifact is committed to a place where it can be reviewed (a PR branch, a draft folder, a staging environment), but nothing has been merged or sent or applied.

Operational requirements: a workspace that holds drafts safely, an interface for the human to review and accept or reject, and a record of which drafts the agent produced and what happened to them. The blast radius is still zero in the production sense (nothing has landed) but the draft artifacts themselves carry data, so they need to be stored with the same access controls as the production data they reference.

Rung 1 is where most enterprise teams should be running their agents in 2026. The agent does the labor; the human does the merge. The latency added by the human-in-the-loop step is real but the cost of an autonomous mistake at this stage is high enough that the latency is the right trade.

Rung 2: Execute with confirmation

The agent proposes the action and waits for explicit confirmation before doing it. The action exists, the agent has decided to take it, and it's pre-staged, but it hasn't fired. A confirmation dialog (or a Slack approval message, or a CLI prompt) gates the actual execution.

Operational requirements: the agent has to be able to describe what it's about to do in a form the human can evaluate. ("I'm about to delete the staging-old database. Are you sure?") The system needs an unambiguous confirmation mechanism that can't be triggered by accident. The confirmation needs to be logged with the identity of the human who confirmed.

Rung 2 is the right rung for high-blast-radius actions even when the agent is otherwise trusted to operate higher. The Production deletion. The financial transaction above a threshold. The customer-facing communication. Anything where the human's role isn't doing the work (the agent does the work) but signing off that the work should happen.

Rung 3: Execute bounded

The agent acts without confirmation, but the actions it can take are bounded. The bounds might be on the type of action ("you may modify draft documents but not published ones"), on the scope ("you may operate on staging but not production"), on the volume ("you may send up to 10 emails per day without confirmation"), or on the cost ("you may spend up to $50 of compute per day"). Within those bounds the agent acts freely. Outside those bounds it falls back to rung 2.

Operational requirements: a real authorization system with bounds expressed in code. This is where Decisions as Code starts paying off in earnest, because the bounds on agent action are themselves business decisions that should live in a centralized, readable, auditable place. The agent doesn't carry the bounds in its prompt; the platform carries the bounds in code, and the agent's actions are evaluated against them at execution time. The agent also needs reliable detection of when it's about to exceed a bound, so it can drop back to confirmation rather than fail mid-action.

This is where I run most of my homelab orchestration. The agent can restart services, deploy updates, rotate logs, scale workloads, all bounded to the cluster, all subject to per-action limits, all logged. Anything outside the bounds requires me to confirm.

Rung 4: Execute with rollback

The agent acts without confirmation, the action is large enough that it could matter, and the system around the agent guarantees that any action it takes can be reversed automatically if the action turns out to be wrong. Rung 4 is where the safety net moves from "ask before doing" to "do, then watch, then undo if it broke something."

Operational requirements: every agent action has to produce a reversible artifact. Database changes are wrapped in transactions or feature-flagged. Deployments are canaried. Configuration changes are reverted on health-check failure. The agent has to know what its rollback looks like at the moment it takes the action, not figure out the rollback after the action has already broken something. The system also needs reliable failure detection, health checks, error-rate monitors, whatever signals tell you the change went badly, and a fast path from "signal indicates failure" to "rollback executes."

Rung 4 is the rung where the operational investment shifts from agent capability to system instrumentation. A capable agent at rung 4 with no rollback infrastructure is dangerous. A modest agent at rung 4 with great rollback infrastructure is safe.

Rung 5: Execute and report

The agent acts. The action is bounded by the rung-3 bounds and reversible by the rung-4 rollback infrastructure. The human is informed after the fact (a daily report, a Slack summary, a line in the audit log) but is not consulted before the action and is not on the hot path of the action.

Operational requirements: everything from the rungs below, plus a report stream that's actually read. The trap at rung 5 is the report nobody reads. The agent acts autonomously, produces a daily summary, and the summary lands in a channel that's been muted for six months. By the time someone notices the agent has been doing the wrong thing repeatedly, the wrong thing has accumulated into a real problem. Rung 5 requires a human in the after-the-fact loop, with an SLA on reading the reports.

This is where most marketing imagines all agents will eventually live. In practice it should be a small minority of agent operations, reserved for the actions that are routine enough, bounded enough, and reversible enough to warrant trusting the agent to act alone.

What promotes an agent from rung N to N+1

The promotion criteria are the same at every rung. An agent gets promoted when:

It's been operating successfully at the current rung for long enough to build trust (weeks of clean operations, not days).
The class of mistake the agent might make is bounded, observable, and recoverable at the higher rung.
The operational infrastructure for the higher rung is in place (rollback, bounds, audit, report streams).
The cost of an autonomous mistake is below the cost of the human-in-the-loop friction at the lower rung.

The fourth criterion is the one teams skip. They look at how often the agent has been right and decide it's earned more autonomy. The right question is the cost-of-being-wrong calculation: at this rung the agent will occasionally do the wrong thing; what does the wrong thing cost; is that less than the friction cost of the lower rung. If yes, promote. If no, don't.

What sends an agent back down

Demotion criteria are also the same at every rung, but more sensitive than promotion. An agent gets demoted when:

It produces a wrong action with a real consequence at the current rung.
The class of wrong action wasn't anticipated by the rung's safety mechanisms (the rollback didn't fire, the bounds didn't catch it, the report wasn't read in time).
The agent's operating environment changes in a way that invalidates earlier trust (new foundation, new data, new prompt, new model version).

The third criterion is the one teams ignore. An agent that's been operating safely at rung 4 on Llama 3.1 isn't necessarily safe at rung 4 on Llama 4 or on a new fine-tune. The model change is a re-evaluation event. Drop the agent back to rung 2 or 3, observe for a few weeks, then re-promote.

The shape of the ladder

Rung 0 and rung 1 are where the agent and the human collaborate explicitly. The agent's contribution is labor; the human's contribution is judgment.

Rungs 2 through 4 are where the agent acts but operates inside a structure the human (and the platform) maintain around it. The agent's contribution grows from labor toward decision-making, but the structure constrains the decisions.

Rung 5 is where the agent operates independently, with the human in an after-the-fact monitoring role. The agent has full execution authority within tightly bounded, reversible actions.

The honest version of agent operations is that most agents in production today live at rungs 1 and 2, and most agents that are claimed to live at rung 5 are actually operating at rung 3 with insufficient rollback infrastructure. The fix isn't more capable agents. The fix is more rigor about which rung the agent actually lives at, what infrastructure that rung requires, and what events should send it back down.

Bounded autonomy isn't a constraint on what the agent can do. It's the discipline that makes it safe to let the agent do anything at all.

, Sid