The runaway tool-call: stories from a year of agentic IDEs

Twelve months of running agentic coding tools daily, and the failure mode that keeps repeating isn't the dramatic one. It's the quiet one, the agent doing the thing it was asked, exactly, to a result nobody wanted.

A row of vintage analog gauges on a polished wooden control panel with one needle pegged to the red zone

It's been about a year now of agentic coding tools as a daily-use thing. Claude Code shipped in February, Cursor's agent mode matured around the same time, JetBrains caught up, and a handful of others filled the rest of the niche. A year into using these things constantly, the failure mode that keeps repeating isn't the dramatic one, the agent that deletes production and produces a war story. It's the quiet one. The agent doing the thing it was asked, exactly, to a result nobody wanted.

Worth being specific about the patterns, because the dramatic failures get the attention and the quiet ones are what actually erode trust in the tool over time.

The categories

A year of journaling these (not formally, just noticing) gives me roughly five categories of runaway tool-call failures that keep showing up:

The over-eager scope expansion. I ask the agent to clean up imports in one file. It cleans up imports in the file, then notices the rest of the directory has the same pattern, and starts cleaning up imports across the whole module. Sometimes that's what I wanted; usually it isn't. The agent's bias toward doing more rather than less is the core mechanic. The fix is explicit scope in the prompt and the tendency to err toward narrower scoping than feels natural.

The plausible-but-wrong refactor. I ask the agent to rename a variable across a file. It renames the variable across the file, including a string literal that happened to contain the same word. The change passes lint, sometimes passes type-check, and breaks at runtime in a way that isn't obvious until the production deploy. The agent's local correctness optimization doesn't account for whether the change is semantically right.

The convincing-explanation-of-broken-code. The agent writes code, runs the tests, the tests fail, the agent confidently explains why the failure is fine and the code is correct. It's wrong; the failure is real. The pattern is most dangerous when the agent's explanation is plausible enough that I'd accept it on a quick read. The fix is to read the failure output myself, not to trust the agent's interpretation of it.

The infinite tool loop. The agent calls a tool, doesn't like the result, calls it again with slightly different parameters, doesn't like that result either, calls it again, and so on. Sometimes recovers; often doesn't. Burns tokens, burns time, and produces nothing. The fix in the platforms that have it is a max-iteration limit; in the ones that don't, manual interruption.

The silent regression. The agent makes a change, the tests pass, but a behavior that wasn't covered by tests is now subtly different. I notice three commits later. The agent's correctness model is "the tests pass", which is a lower bar than "the code does what I wanted." The fix is more aggressive test coverage of the behaviors that matter, plus reading the diffs more carefully than I'd read a teammate's PR.

What twelve months of these has actually taught me

A few patterns that have survived the year:

Plan mode is the most important workflow change. Letting the agent describe what it's about to do before doing it catches most of the over-eager scope and plausible-but-wrong cases before they hit the codebase. The cost is the friction of reading the plan; the benefit is not having to roll back the wrong work later.

Diffs are for reading, not approving. Treating the agent's PR like a teammate's PR (reading every line, asking why this change, looking for the silent regression) is what makes the tool durably useful. The "looks fine, accept" pattern is what produces the medium-term trust erosion.

Tool scope discipline matters more than I thought. The agent should have access to exactly the tools it needs for the task and no more. The platforms that make this easy (per-task tool whitelisting) produce more reliable workflows than the platforms that just give the agent everything.

Auto-commit is the wrong default. The agent should propose changes, the human should commit. The platforms with auto-commit-on-success are faster on the green-path cases and produce worse outcomes on the medium-cases where the agent was technically correct and substantively wrong.

Single-purpose sessions beat all-purpose sessions. A session that's "fix this bug" produces better behavior than a session that's "work on this codebase." The narrower scope reduces the over-eager scope expansion failure mode.

The framing that helps

The mental model that's actually worked for me over the year: the agent is a junior engineer who is fast, doesn't get tired, occasionally produces brilliant work, and occasionally produces work that's confidently wrong in ways a slightly-more-experienced reviewer would catch immediately. The right workflow treats the agent that way, review the work, set the scope, don't trust the explanation of failure, ship the good output, don't ship the bad.

The framing that doesn't work: the agent is a senior engineer who can be trusted to get it right and can be supervised by exception. That framing produces the failure modes above. It also produces the dramatic ones (the deleted-production variety), but those at least come with their own correction. The quiet ones don't.

The platform patterns that help

A year of testing different platforms makes the platform-design patterns that matter more visible:

Plan-then-execute as the default UI, not as an opt-in. Cursor, Claude Code, and a few others ship this; the ones that don't tend to produce more runaway tool-calls.

Per-task tool scoping, the agent's tool surface for "fix this bug" is different from the agent's tool surface for "set up the new project." Whitelisting per task reduces the failure modes where the agent reaches for a tool it shouldn't.

Visible diffs before commit, the change-review surface should be where the diff lives, not buried behind an "accept all" button.

Iteration limits with clear surfacing, when the agent is about to hit its iteration limit, surface that fact so the human can decide whether to extend or stop.

Per-action audit, an agent action log that's queryable after the fact. Helps when something went wrong and you need to figure out what happened. The platforms with good audit surfaces are the ones I trust more.

These aren't rocket science. The platforms that ship them are the ones that produce reliable workflows; the platforms that don't are the ones that produce horror stories every few weeks. The differentiation in the IDE-agent category in mid-2025 is mostly along these axes, not along raw model capability.

What I'd recommend after a year

For someone thinking about adopting an agentic coding tool now:

  • Start with the platform that has the strongest plan-mode and tool-scoping surface. Capability differences across the leading tools are smaller than the workflow-design differences.
  • Set the default to plan-then-execute, not auto-execute, and keep it there.
  • Treat the diff like a teammate's diff. Don't accept-all.
  • Use single-purpose sessions for non-trivial work.
  • Build the audit habit early, be able to answer "what did the agent do this morning" without having to dig.

The tools are durably useful. The failure modes are predictable. The discipline that makes the tools work is mostly workflow discipline, not capability discipline. A year in, the runaway tool-call is the quiet failure that matters most. Catching it requires habit, not heroics.