OpenAI's o-series: when reasoning beats raw scale

OpenAI's o1 and o3-mini are the most public proof yet that scaling test-time compute beats scaling parameters at a category of problems. Worth being clear about which category.

Brass mechanical clockwork mechanism with visible turning gears under dramatic spotlight

The o-series (o1 in December, o1-pro through ChatGPT Pro, and o3-mini at the end of January) is the product line OpenAI built around a thesis that "scale test-time compute, not parameters" is the real next move in capability. Six months on, the thesis has held up well enough at the category of problem it's actually optimized for, and noticeably less well outside that category. Worth being clear about both halves.

What the o-series actually is

The o-series models think before they answer. Internally, they generate a long chain of intermediate reasoning tokens that the user doesn't see, then produce a final response that's typically much shorter than the reasoning that produced it. The compute spent on the hidden reasoning is real, it shows up in latency, it shows up on the bill, and it shows up in capability on the problems the technique benefits.

In practice that means o1-class models do well at:

  • Multi-step math and proof-style problems where each step depends on the last
  • Code problems where the right answer requires planning through control flow before writing
  • Logic puzzles and constraint problems where the model has to consider and discard candidates
  • Scientific question-answering where multiple facts have to be combined deliberately

And noticeably worse at:

  • Simple chat turns where the answer is one paragraph and the model has nothing to reason about
  • Style-sensitive writing tasks where the reasoning trace doesn't help the prose
  • Latency-sensitive applications where the user notices five-to-ten seconds of staring at a spinner
  • Anything cost-sensitive at o1 pricing, fifteen-and-sixty per million tokens is not what you reach for for a tool call

The pattern is consistent enough that "reasoning model vs not" has become a per-request decision in any application that mixes both kinds of work.

The thesis vs the benchmarks

The thesis OpenAI's been advancing (that test-time compute is its own dimension to scale) has gotten more concrete as the year started. Not because the o-series ran the table on benchmarks (it didn't; DeepSeek-R1 in late January showed that comparable reasoning capability exists at a fraction of the inference price), but because the category of "models that think before they answer" stopped being a research demo and became a product line. As of the end of February, every major frontier-model vendor is shipping in this shape: OpenAI's o-series, Anthropic's Claude 3.7 with extended thinking, Google's Gemini 2.0 Flash Thinking variant. The disagreement isn't about whether to ship reasoning models. It's about how to expose them.

OpenAI's product choice is to make reasoning a separate model line. Buy o1 if you want the reasoning behavior; buy GPT-4o if you don't. That cleanly separates the cost decision from the capability decision, but it forces application code to route by request type. Anthropic's choice with 3.7 collapses both into one model with a per-request flag, which is friendlier for application authors but trickier for cost forecasting. DeepSeek's choice is to publish the recipe and weights and let everyone else figure out the product surface. None of these is the wrong answer; they're different bets about what the integrator wants to manage.

Where the o-series is actually winning

In my use, the place where the o-series most consistently outperforms a non-reasoning frontier model is one specific shape of code task: refactoring across multiple files where the right move depends on understanding the broader architecture before making any change. Non-reasoning models tend to make a change that's locally correct but architecturally wrong; o1 is willing to spend a few thousand reasoning tokens deciding what to actually do before touching the code. The result is fewer "looks right at first glance, breaks something three modules over" outcomes.

The other place is anything that resembles a research-style question, where the user wants a thorough answer and is patient. Multi-step planning. "Read these three documents and tell me what's inconsistent between them." Synthesis problems where the answer benefits from being chewed on.

The places where o-series doesn't help, interactive chat, simple data extraction, anything where the answer is a known shape of output you want fast, are the same places extended thinking on Claude 3.7 doesn't help. That's not a coincidence. The benefit of reasoning compute is bounded by whether the problem has reasoning in it.

The cost asymmetry that matters

The cost story is the awkward part. o1 at $15/$60 per million tokens is workable for high-stakes single-turn problems where being right matters more than being cheap. It's not workable for any application that runs reasoning on every turn. The o3-mini release at the end of January helped, it's $1.10/$4.40 per million, which makes the math reasonable for medium-stakes workloads, but the headline o-series price is still where the marketing message lands, and the marketing message is "reasoning is a premium tier."

The DeepSeek pricing argues against that framing. R1's 55-cents-and-$2.19 is roughly an order of magnitude cheaper than o1 for a comparable capability class, and unlike o1 the weights are downloadable. That means the question OpenAI's product team is implicitly answering with the o-series price isn't "what's the right price for reasoning capability in 2025." It's "what's the right price for reasoning capability that's also pre-integrated into ChatGPT, comes with the OpenAI ecosystem and reliability story, and doesn't require us to acknowledge the open-weights pressure exists." Both questions have valid answers; they're not the same question.

The interesting test going forward

The thing the next quarter will settle is whether reasoning ends up bundled into every standard-tier model (the Claude 3.7 model) or stays as a separate product (the o-series model). The market signals are pointing in the bundled direction, once "thinking on demand" becomes a per-request flag rather than a procurement decision, the appeal of a separate product line shrinks. But OpenAI has historically been willing to keep premium tiers premium when the alternative is admitting the market has moved.

Whichever way that resolves, the underlying fact the o-series demonstrated has stuck: spending more compute at inference, on the right kind of problem, beats spending more compute at training. That's a real change in the engineering economics of the field, even when the specific product line that proved it ends up replaced by something cheaper.