The "long-running agent" problem nobody is solving

Agents that run for ten minutes mostly work. Agents that run for ten hours mostly don't. The middle is where the actual production work lives, and the patterns we have for it are weaker than the conversation pretends.

A vintage hourglass with most of its sand fallen to the bottom chamber on an antique wooden desk in a dimly lit study

The agentic systems that are durably useful in 2025 are mostly the short-running kind. Ten-second tab completions, two-minute refactor loops, five-minute research summarizations. The ones that get the demo glory are the long-running kind, agents that work for an hour, an evening, overnight on a task. The ones that work in production are the ones that finish quickly. There's a real category of work in the middle (multi-hour agents on real tasks) that most of the platforms don't have a credible story for.

Worth being explicit about why, because the gap is going to start showing up in capability comparisons over the next few quarters and the keynote framings keep glossing over it.

What "long-running" actually breaks

Three failure modes that compound the longer the agent runs:

Context decay. The agent's working memory is its context window. As the conversation grows, the older parts get less salient, both because attention degrades over distance and because the model has to reconcile increasingly contradictory state. The pattern is well-documented: at one hour into a long agentic session, the agent has stopped reliably remembering what it decided in the first ten minutes. By three hours, the model is operating mostly off the most recent few thousand tokens of context regardless of what's earlier. The Claude 4 generation improved this, and improving it more is hard for fundamental architectural reasons.

Drift. The agent's plan at hour zero and the agent's plan at hour three look similar but accumulate small adjustments at each step. At any given step the adjustment looks reasonable; combined over a hundred steps, the agent has wandered well off the original plan, often without noticing. The plan-at-hour-three is what the agent is now executing; the user's intent from hour zero is no longer the operational frame.

No resumability. When a long-running agent fails (network error, model error, OOM, anything) most platforms have no way to resume from where it stopped. The agent restarts from scratch, the work-in-progress is lost, the cost of the prior hours is spent without recovery. Some platforms have checkpointing; very few have working resumability across model errors or restarts.

These three compound. A long agentic run that hits a transient error at hour two (resumability) loses the context the agent had built up (decay) and the plan adjustments it had made (drift). A retry from scratch starts fresh on context but inherits none of the work.

The partial fixes in the wild

A few patterns that real systems are using to mitigate, none of which fully solve the problem:

Periodic re-grounding. The agent rewrites its current state and plan into a structured summary every N steps and uses the summary as the basis for the next N steps rather than the full history. Mitigates context decay; introduces summarization-quality risk. Works moderately well for tasks with stable structure; doesn't work for tasks where the early-conversation specifics matter throughout.

Explicit checkpoints. The agent serializes its state (plan, intermediate results, tool call history) to durable storage at well-defined points. On error, the next run reloads from the most recent checkpoint. Solves the resumability problem when it works; the cost is that the checkpoint format is hard to design well and often misses some piece of state that matters.

Outer-loop supervision. A human (or sometimes another agent) reviews the long-running agent's progress every fixed interval and either approves continuation or intervenes. Catches drift before it compounds. The cost is the supervision burden, which often defeats the autonomy promise of long-running agents.

Scope-tightening. The simplest mitigation: keep the agentic workload short by design. Break the long task into multiple shorter agentic runs with explicit human or system-level handoff between them. Solves the problems by not having them. The cost is loss of the elegant "set the agent on the task and come back" experience.

Each of these is partial. The combination (re-grounding plus checkpointing plus supervision plus scope-tightening) is what actually-shipping production systems use. None of them feel like a complete answer.

Why this matters for the keynote framings

The agent demos at Build, I/O, and the equivalent OpenAI / Anthropic events all show variants of the long-running case: the agent works overnight, the agent does the multi-hour task end-to-end, the agent operates with minimal supervision. That demo class lands well because it's the most aspirational shape of the agent pitch. The category that demos well is also the category that production has the worst answers for.

The shape of the gap: 30-second agents are basically solved (modulo the usual model-quality issues). 10-minute agents are solvable with good design and the patterns from the design-patterns piece. 60-minute agents are workable with significant infrastructure investment. 6-hour agents are barely working in any platform I've tested. The keynote demos all play in the 6-hour-plus territory; the production workloads almost all play in the 10-minute territory. The conversation skips over the gap.

The interesting research direction

What would actually solve this is some combination of:

  • Better long-context architectures that don't degrade with conversation length the way current attention-based ones do. The candidates exist (state-space models, retrieval-augmented architectures, hybrid approaches); none has clearly won.
  • Native resumability primitives at the platform level, agent state as a first-class persistable object, not a hand-rolled checkpoint format. This is mostly a platform-engineering problem, not a model-research one, but the platforms haven't shipped it.
  • Drift detection that flags when the agent has wandered far from the original plan and triggers an explicit re-grounding step. Some experimental work here; nothing standardized.
  • Better cost models for long runs, the per-token billing structure makes a 6-hour agent run cost like a 6-hour agent run, with no built-in mechanism to abort when the value isn't materializing. Most teams add this themselves; the platforms could do better.

I'd guess the long-running agent problem gets a meaningful platform-level response in the next four to eight quarters. Not next quarter. The current state is that the demos imply it's solved, the production state is that it isn't, and the gap is more visible to practitioners than to the audiences the keynotes are aimed at.

What to do in the meantime

For shops thinking about deploying long-running agentic workloads in the back half of 2025: don't, unless you're willing to build the missing platform yourself. The combination of re-grounding, checkpointing, supervision, and scope-tightening is buildable, but it's not provided. The honest framing for what's actually deployable today is short and medium-running agents with explicit human handoff between phases. That's less exciting than the demo version, and it's the version that doesn't break the second a real workload hits it.

The long-running case is coming. It isn't here. Treating the gap as solved because the demo flow shows it solved is the same mistake as treating the governance layer as solved because the marketing slide shows it solved. Real production has a higher bar than either marketing layer is currently meeting.