Sonnet 4 makes Opus look expensive

A day after Claude 4 landed, the math is starting to settle. Opus 4 has its place. The honest read is that Sonnet 4 closes enough of the gap that the price-for-marginal-capability case for Opus needs more justification than I've been giving it.

Sid Smith

24 May 2025 • 4 min read

The first-impressions piece on Claude 4 yesterday landed on a tidy framing: Opus 4 for the small set of problems where capability is worth 5× the cost, Sonnet 4 for everything else. A day of more usage and a little more honest math suggests the framing is right but the line moved further than I gave it credit for. The set of problems where Opus 4 actually earns its premium is smaller than I described, and Sonnet 4 is closing it from below.

Worth being specific about why, because the price-for-marginal-capability calculus is the one most teams are going to be running over the next few weeks.

What twenty-four more hours of usage shifted

Three concrete observations from running the same workloads through both models in parallel:

On coding work, the gap is small enough to be noise on most tasks. I ran a batch of 30 real refactor tasks through both models, the kind of work I do every day, not the synthetic eval kind. Opus 4 was meaningfully better on three of them, marginally better on eight, equivalent on the rest. The three where Opus mattered were the ones with deep cross-module reasoning. The rest, Sonnet 4 was the right call on cost grounds and Opus would have been overspending. That's a rougher distribution than the marketing curve.

On structured analytical writing, the difference is mostly stylistic. Both produce credible output. Opus is slightly tighter on structure and slightly less likely to drift into filler. Slightly. For the kind of analytical writing I do (these posts, internal memos, technical documentation) the difference doesn't justify the price. Sonnet 4 is competent enough that the marginal reader can't tell.

On agentic tool-use sequencing, Opus's edge is real but narrower than I thought. Multi-tool agentic loops do go more cleanly with Opus on average, but the average hides distribution: most loops are simple enough that Sonnet 4 handles them fine, and the ones where Opus's planning advantage shows up are the rare hard ones. For the rare hard ones, paying Opus rates is correct; for the routine ones, paying Opus rates is overspending.

The price-per-capability math

Stating it explicitly: $15/$75 per million tokens for Opus vs $3/$15 per million for Sonnet. That's 5× across both input and output. For the 5× to pay back, Opus has to be 5× more valuable on the workload, which on most workloads it isn't.

The honest sub-cases where Opus 4 earns the premium:

The hardest reasoning problems where being right matters more than being cheap. Math, formal proofs, deep multi-step logical analysis. The places where one careful answer is genuinely worth fifty fast wrong ones.
Critical-path planning steps in long agentic workflows where the planning quality cascades into a lot of downstream work. Spending Opus tokens on the plan can avoid Sonnet wasted-work tokens later.
One-off high-stakes deliverables where the cost is rounding error against the value of being right. Strategy memos, technical reviews, the rare deliverable where you'd pay 10× for a 10% better outcome.
Long-context synthesis where the model needs to hold a lot in mind and produce a coherent structured output. Opus's coherence-at-length advantage is real for this.

That's the list. It's smaller than I made it sound yesterday. Most of what flows through my Claude usage (coding, drafting, agentic work in dev) the right routing is Sonnet 4. Opus is the surge option for the genuinely hard cases.

The pricing-pressure subtext

Anthropic shipping Sonnet 4 at the same $3/$15 as 3.5 and 3.7 Sonnet is the actually-interesting pricing move. They didn't raise the workhorse-tier price to fund the model improvements. That's a deliberate choice, they're holding the line on workhorse pricing while letting Opus be the premium-tier price-discrimination surface.

The implication for shops doing real planning math: the workhorse-tier capability you can buy at $3/$15 is increasing, the price isn't. The Opus tier exists to capture willingness-to-pay from workloads where capability is the binding constraint. That's a clean segmentation, and most workloads sit on the workhorse side of it.

For cost-modeling purposes, the practical change is to default routing to Sonnet 4 for everything Sonnet 3.7 was already handling, and to be more selective about what gets escalated to Opus 4 than the equivalent escalations were to Opus 3 a year ago. The escalation criteria need to be specific ("this is a hard reasoning problem on a critical path") not "this feels important."

What I'm doing in my own setup

The actual changes in my routing config after a day:

Default Claude Code model becomes Sonnet 4 rather than 3.7 Sonnet. Same price, better behavior. No reason not to.
Opus 4 stays available as a manual override for the cases where I know it matters, long synthesis tasks, hard planning, critical refactors with high blast radius if wrong.
The agentic loops in my personal automation default to Sonnet 4 with extended thinking on for the planning step and off for the execution steps. The per-turn routing matters more than the model choice.
Batch and async work stays on Sonnet 4 or cheaper-tier alternatives. The batch workloads don't justify Opus on cost grounds and don't benefit from its capability edge.

Not a dramatic reconfiguration. The frontier moved in a way that mostly affects the default rather than the menu.

The pattern that's emerging across vendors: the workhorse tier keeps getting better at flat or falling prices; the premium tier keeps existing as a price-discrimination surface for the small set of workloads that actually need it. GPT-4.1 vs GPT-4.5 has the same shape. Sonnet 4 vs Opus 4 is the cleanest current example.

What this means for architecture decisions: the right default routing is workhorse-tier from any vendor that fits the workload, with premium-tier as the explicit-escalation path for the hard cases. The premium tier is real and earns its keep on the right problems. It's not the default; it doesn't pay back as the default.

A day in, Sonnet 4 isn't just "good enough", it's good enough to make the question of whether to default to Opus answer itself in the negative for most workloads. That's a credit to Anthropic's release execution, not a knock on Opus. The premium tier doing its job correctly looks like the workhorse tier handling 95% of the work and the premium tier earning the other 5%. Sonnet 4 makes that math more clearly true than 3.7 Sonnet did. Opus 4 doesn't lose; it just needs more justification per call than I was giving it yesterday.