Claude 3.7 Sonnet: the new defaults of "extended thinking"

Anthropic shipped a model that knows when to think. The interesting question isn't the benchmark numbers, it's what happens when reasoning becomes a default behavior instead of a separate product tier.

Sid Smith

24 Feb 2025 • 4 min read

Anthropic released Claude 3.7 Sonnet today. The headline number is a step up on the coding benchmarks. The headline product change (and the more interesting one) is that "extended thinking" is now a toggleable default behavior of the same model rather than a separate product tier you pay extra for.

That structural change matters more than the benchmark deltas. Worth working through what it actually is, what it costs, and how it sits next to the rest of the field as of late February.

What's actually new

Claude 3.7 Sonnet ships as a single model with a hybrid behavior: by default it answers conversationally, the way Sonnet always has. When extended thinking is enabled, either explicitly via the API parameter or by the user clicking a toggle in the interface, the model produces visible reasoning tokens before producing the final answer. The user can adjust how many thinking tokens the model is allowed to spend, up to a budget. The reasoning tokens count against context and against output cost, but at the standard Sonnet rate (no premium tier).

That phrasing matters because it puts Claude 3.7 in a different category than OpenAI's o-series. The o1 and o3-mini models are separate products, you pick one for "reasoning" workloads and a different one for "fast" workloads. Claude 3.7 collapses that choice into a single model and a per-request flag. Whether to think is now a request-level decision, not a procurement decision.

It is also priced like Sonnet. Three dollars per million input, fifteen dollars per million output. That's the same as 3.5 Sonnet was. Compared to o1 at fifteen and sixty, or DeepSeek-R1's open-weights equivalent at fifty-five cents and $2.19, Claude 3.7 sits in the middle of the price range, meaningfully cheaper than o1 for reasoning-style workloads, meaningfully more expensive than R1 for the same.

What "thinking" actually does in practice

Extended thinking turns on a chain-of-thought pass that's structured to work for the model, not for human readability. The visible reasoning is real (Anthropic chose to surface it rather than hide it the way OpenAI does with o1) but it reads like notes-to-self, not like a polished explanation. Don't read it as documentation; read it as scratch paper.

Three observations from the first day of using it on real work:

It's noticeably better at multi-step problems. Tasks where 3.5 Sonnet would jump to a plausible-but-wrong answer, most often involving counting, comparing across many items, or synthesizing constraints from multiple sources, now hold together. The model uses the thinking budget to actually decompose the problem and check itself.

The thinking doesn't always help. For straightforward conversational turns, lookup-style questions, or tasks where the model already has the answer in one pass, the extended thinking is wasted output cost and added latency. The toggle exists for a reason. Default it off and turn it on for the work that needs it.

It has the same failure modes as 3.5 Sonnet, just less often. The model still hallucinates, still sometimes follows a bad chain of reasoning to a confident wrong answer, still occasionally gets stuck on a misunderstanding. Reasoning models compress how often these happen. They don't eliminate them.

The product-shape consequence

The bigger story is what happens to the product surface when the thinking-vs-not-thinking choice becomes a per-request flag instead of a model-selection step. A few things change:

Application code gets simpler. Instead of routing some requests to o1 and others to gpt-4o based on request type, you point at one model and pass a flag. Less routing logic, less price-tier accounting, less "which model is this customer's request going to today."
Prompt engineering changes shape. When extended thinking is available on demand, the prompt-side question becomes "is this a thinking task" rather than "how do I get the cheap model to think." The work moves from coercing the model to choosing when to spend the budget.
Token-cost accounting becomes harder to predict ahead of time. Reasoning tokens are output tokens for billing purposes, but the amount of thinking the model decides to do for a given request is variable in a way that completion tokens for non-reasoning models aren't. Your monthly invoice gets more variance.

These aren't huge changes individually. Together they shift the "shape" of how a sensible application gets built. The thing that used to be a deployment-time decision (which model variant) becomes a runtime decision (whether this turn deserves thinking).

Where it fits in the late-Feb 2025 lineup

The reasoning-model field as of late February has three clear shapes:

OpenAI's o-series, separate products, premium pricing, the reasoning trace is hidden from the user. Strongest at the very hardest math/proof-style problems, currently.
DeepSeek-R1, open weights, MIT license, dramatically cheaper inference, the reasoning trace is fully visible. Strongest at the "I want the capability without the platform tax" use case.
Claude 3.7 Sonnet, hybrid model, standard Sonnet pricing, the reasoning trace is visible by default. Strongest at the "I want one model that does both jobs and one bill" use case.

Each of those shapes is a defensible product choice. None of them is going to be the only answer. What 3.7 demonstrates (alongside R1 a month earlier) is that "reasoning" is no longer a premium tier you pay extra for. It's becoming a feature that ships with the standard model. The o1-pricing era looks increasingly like a transitional phase rather than the new equilibrium.

The interesting test for the next month is whether GPT-4.5. Anthropic's competitive position is now defined partly in opposition to whatever OpenAI ships next, and Gemini's next reasoning variant follow Claude 3.7's bundled-by-default approach or hold the line on separate-tier pricing. The market is telling them which way to go. Whether they listen is the next datapoint worth watching.

What's actually new

What "thinking" actually does in practice

The product-shape consequence

Where it fits in the late-Feb 2025 lineup

Subscribe to Echoes of the machine