Claude 4 / Opus 4: first impressions from a heavy user

Anthropic shipped the new model family yesterday. Opus 4 is the new top tier; Sonnet 4 is the everyday workhorse. After a day on real work, what's actually different and what isn't.

Sid Smith

23 May 2025 • 5 min read

Anthropic shipped the Claude 4 family yesterday. Opus 4 as the new top tier, Sonnet 4 as the everyday workhorse, both with extended thinking as the toggleable default the 3.7 line introduced. Pricing is $15/$75 per million for Opus 4: $3/$15 per million for Sonnet 4. Sonnet 4 is at the same rate as 3.5 and 3.7 Sonnet; Opus 4 is the premium tier the catalog has been missing since Opus 3.

Twenty-four hours of real work on it isn't enough for definitive takes. It's enough for honest first impressions, which is what I have. Worth being specific about what's actually different and what isn't, because the marketing layer for any major model release inflates the deltas in the early days.

What's actually new

Beyond the version number bump, three things stand out from the day-one experience:

Coding behavior is meaningfully better. I've been using Claude Code daily since it shipped in February, and the difference between 3.7 Sonnet and Sonnet 4 on real refactor work is noticeable from the first sessions. Specifically: it's better at multi-file changes where the right move depends on understanding the broader context. The pattern where 3.7 would make a locally-correct change that broke something three modules over is less frequent. It's not gone (no model is at this) but the rate is meaningfully lower.

Tool-use sequencing is sharper. Agentic loops take fewer turns to reach the right answer. The model is better at picking the right tool on the first try, better at chaining tool calls when the next step depends on the previous result, and better at recovering when a tool returns an unexpected result. This is the dimension where I'd expect the day-one impression to hold up under more usage.

Long-context coherence improved. Both Sonnet 4 and Opus 4 hold the structure of a long conversation better than 3.7 did. Long agentic sessions where the conversation history grows past 100k tokens used to start showing degradation in the model's awareness of what was decided earlier. The new versions hold it together longer. The 200k context is the same advertised number; the effective context for complex multi-turn work is closer to the advertised maximum than it was on 3.7.

What's not as different as the marketing suggests

A few places where the day-one experience is more "incrementally better" than "step change":

Pure prose quality. For long-form writing, the difference between 3.7 Sonnet and Sonnet 4 is small. The voice is slightly more natural in places; the structural cohesion of long pieces is marginally better; the rate of weird stylistic ticks is slightly lower. None of it is dramatic. If you were happy with 3.7 for writing, you'll be happy with Sonnet 4. If you weren't, Sonnet 4 isn't going to change your mind.

Hallucination rate on factual claims. Subjectively similar to 3.7, meaning the model still confidently asserts things that aren't true at roughly the same frequency, and the failure modes look the same. The reasoning improvements help on some classes of problem; they don't substantially fix the "made-up library function" or "wrong API signature" pattern.

Multimodal capabilities. Image-in is the same shape it was on 3.7. There's no audio or video story yet. If you need real multimodal capability beyond image input, Gemini 2.5 Pro is still the better pick, and the I/O announcements last week widened that gap rather than narrowed it.

Opus 4 specifically

The premium tier got a real refresh, and the question is what it's actually for.

At $15/$75 per million, Opus 4 is roughly 5× the cost of Sonnet 4 for the same input/output volume. For most workloads that price delta isn't justifiable. Sonnet 4 is competent enough that paying 5× for marginal capability gains doesn't pay back. The honest use case for Opus 4 is the small set of problems where the marginal capability matters a lot:

Hard reasoning problems where being right is much more valuable than being cheap. Mathematical proofs, multi-step logical analysis, the kind of work where one careful answer is worth fifty fast wrong ones.
Long-context synthesis where the model needs to hold a large amount of input and produce a structured analysis. The Opus 4 advantage on coherence at high context lengths is real and meaningful for this category.
Critical-path agentic decisions where the agent's choice cascades into a lot of downstream work. Spending Opus-tier money on the planning step can pay back by avoiding wasted Sonnet-tier work later.
High-stakes one-off analysis, research, strategy documents, the rare deliverable where the cost is rounding error compared to the value of being right.

For most workloads, Sonnet 4 is the right default. Opus 4 is the surge option for the problems where it matters.

The frontier model menu now reads roughly:

Opus 4, premium reasoning tier. Use for the few problems where it matters.
Sonnet 4, workhorse. The new default for Anthropic-aligned shops.
GPT-4.1 / mini / nano, workhorse competitor with a different shape. Cheaper than Sonnet 4 at the flagship tier, much cheaper at mini/nano.
GPT-4.5, premium positioning, still expensive, mostly ceded the prestige slot to Opus 4 from a value-for-money standpoint.
Gemini 2.5 Pro, competitive workhorse with the long-context and multimodal advantages.
OpenAI o-series + o3/o4-mini, reasoning specialty tier. Niche but real.
Open-weights tier (DeepSeek V3-0324, Llama 4 Scout/Maverick), commodity workhorse for cost-sensitive workloads.

The shape of the menu is less interesting than what hasn't changed: there's no single best model, and the right pick still depends on the workload more than on the leaderboard. Claude 4 doesn't change that. It does shift Anthropic's default workhorse upward and gives them a credible premium tier again, which they've needed.

What I'm changing in my own setup

A few specific changes from the day-one experience:

Sonnet 4 becomes the default for Claude Code instead of 3.7 Sonnet. The coding improvements are real and the price is identical.
Extended thinking budget defaults stay roughly where they were. The model uses the budget more efficiently, so the same budget yields better results without me having to retune.
Opus 4 gets reserved for hard one-offs. The price doesn't justify making it default; the capability justifies pulling it out for the specific problems where it earns its place.
Multi-vendor routing stays. GPT-4.1 still wins on cost for some workloads; Gemini 2.5 Pro still wins on long-context retrieval; DeepSeek V3 still wins on raw cost-per-token for batch work. Claude 4 doesn't subsume any of those.

The day-one impression is positive. The seven-day impression and the month-out impression will tell the actual story. First impressions from major model releases tend to compress the deltas (both directions) and the work that's left to do is to see how the model holds up across the actual range of work I do, not just the first afternoon of trying the new thing.

What's clear from twenty-four hours is that Anthropic shipped a real update with meaningful capability improvements in the categories that actually matter for production work. That's a different bar than "the benchmarks moved" and Claude 4 is on the right side of it.