OpenAI Operator and the browser-agent generation

Six months in on Operator, plus Mariner, plus Computer Use. The browser-agent category is real and the durable use cases are smaller than the demos suggested. Worth being concrete about what the category does and doesn't do well.

Sid Smith

14 Jul 2025 • 5 min read

OpenAI Operator shipped at the end of January. Anthropic's Computer Use went into wider preview around the same time. Google's Mariner started getting more attention at I/O in May. Six months in on the browser-agent category as a real thing rather than a demo, worth being honest about what's holding up, what isn't, and which use cases are durably interesting.

The keynote framing for this category is "the agent does the thing in the browser the way a person would." The day-to-day reality is more constrained, more useful in narrow cases, and more frustrating in the broad cases the demos lean on.

What's actually working

Three categories of browser-agent workload that have been useful for me over the last few months.

Bounded data-collection tasks. "Go to these five sites, pull this specific information, paste it into this format." Operator and Mariner both handle this well as long as the sites cooperate. The agent does what a careful human would do, takes a fraction of the time, doesn't get bored. The pattern works because the task is bounded, the success criteria are clear, and the failure modes are visible (the data either matches the expected format or it doesn't).

Repetitive form-filling against well-structured sites. The expense-report submission, the routine HR portal updates, the conference signups, the kind of work that's small individually and adds up over time. Browser agents handle these well. The latency is real (these aren't fast); the productivity recovery is also real because you weren't going to do them quickly anyway.

Light research scaffolding. "Go to these five sources and pull a one-paragraph summary from each." Different from real research (the depth isn't there) but as a scaffolding step before the real reading happens, it works. Operator's sourcing surface is reasonable; the citations are usable; the time saved is real.

That's most of the durable list. The bounded, well-defined, repeatable cases. The pattern is "the human knows what they want, the agent does it more patiently than the human would."

What's still breaking

The cases where browser agents still don't work well.

Anything requiring logged-in state across multi-step workflows. OAuth flows still trip up. Multi-factor auth confuses them. Session timeouts catch them mid-task. Accounts and identity is the hardest part of browser-agent work and it's the part the demos all skip past. Operator's handling here is the best of the three, helped by tighter integration with the user's saved credentials, but it's not solved.

Sites that don't cooperate. Modern web frameworks with heavy JS, anti-bot protection, captchas, single-page-apps that don't expose stable selectors, all of these are still hard. The percentage of "important" sites that fall into one of these categories is meaningful. The agent that works against well-behaved sites doesn't work against the ones that resist automation.

Anything where the value is "do the unbounded thing." "Plan my trip" is the demo case. The real case is "go to this hotel, pick the cheapest room, book it, expense it." The unbounded version is where the long tail of failure modes lives, the agent picks the wrong hotel, the wrong room, expenses it under the wrong category. The bounded version works; the unbounded version doesn't.

Anything time-sensitive. Browser agents are slow. A task that takes a human five minutes takes the agent fifteen. For most async tasks that's fine. For "I need this done before the meeting in ten minutes," it isn't.

The three players, six months in

Specific reads on the three browser-agent products.

OpenAI Operator. The strongest of the three for general web tasks. Best at the form-filling category. The latency improved meaningfully since launch. Pricing is still a Pro-tier feature, which limits usage to the workloads where the value is high enough to justify it. The biggest weakness is the same one as the others: when the task is unbounded, it makes confidently wrong decisions you have to catch.

Anthropic Computer Use. More general-purpose, works on the desktop, not just the browser. Slower than Operator on pure browser tasks; more capable on cross-app workflows. The use cases that benefit from desktop access (move data between local apps, automate something that lives partly outside the browser) are where it pulls ahead. For pure browser work, Operator is faster and more reliable.

Google Mariner. Still feels earlier than the other two. The integration with Google's identity and payments is real, the build feels less mature. By the end of the year I'd expect this to close; at this midpoint. Mariner is the third pick of the three for general work.

The design lesson

The most interesting takeaway from six months of using these things: the browser-agent category benefits enormously from the planner-executor pattern, and most of the products don't do it cleanly enough. The agent that picks a strategy and then executes it step-by-step is meaningfully better than the agent that decides on the fly. Operator is closest to this; the others lean toward more reactive behavior.

The other design lesson: the runaway tool-call problem shows up here at scale. When the browser agent is stuck in a loop trying to dismiss a popup that won't dismiss, or clicking through a flow that doesn't go anywhere, the visible failure mode is "agent took 20 minutes to do something that should have taken 2." The fix is the same as in the IDE-agent case: max-iteration limits, surface the failure, hand back to the human. Operator does this best of the three; all of them could do it better.

What I'd watch for in the second half

Two things from this category over the next two quarters.

Whether operating costs come down enough to expand the use case. Operator at $200/month tier is fine for the workloads where it's saving meaningful time. At $20/month it would be available for the much-larger set of casual workloads where the value is real but not "Pro-tier" real. The pricing trajectory tells us how widely the category goes.

Whether the durably-useful use cases get explicit product-level support. Bounded data collection, form-filling, light research scaffolding, these could become first-class workflows with templates and presets, rather than open-ended "tell the agent what to do" sessions. The product that builds this surface well wins meaningful share of attention from people who actually use this stuff.

The browser-agent category is real. The durable use cases are smaller than the demos suggested. The product that converges on those use cases (bounded, well-defined, repeatable) and stops trying to be the demo version is the one that becomes part of the daily-work surface for non-technical users. None of the three is fully there yet. Operator is closest. The next six months will tell whether the category narrows to its strengths or keeps trying to be the unbounded demo.

What's actually working

What's still breaking

The three players, six months in

The design lesson

What I'd watch for in the second half

Subscribe to Echoes of the machine