OpenAI's GPT-5 and what it actually changes

GPT-5 dropped this week. The benchmark deltas are real, the marketing layer is loud, and the question for shops with already-working AI stacks is which routing decisions actually need to change. Honest answer: fewer than the launch suggests.

Sid Smith

09 Jun 2025 • 5 min read

GPT-5 landed earlier this week, ending the longest "is it nearly ready" cycle in OpenAI's release history. The headline benchmarks are strong, the marketing layer is doing its job, and the Twitter takes are going through the predictable phases (overshoot → undershoot → settled). Worth being specific about what the launch actually changes for shops with already-working AI stacks, because the keynote framing is louder than the practical delta.

What's actually new

Three things stand out from the day-three usage:

Reasoning quality is meaningfully better, especially on hard problems. The class of problem where GPT-4.1 would get the structure right but miss a step in the middle. GPT-5 catches that step more often. The improvement is real on the hardest 10-20% of analytical work. On easier work, the improvement is small enough to be noise.

Tool-use sequencing is sharper. Similar to what I noted with Claude 4, agentic loops take fewer turns to reach the right answer, the model is better at picking the right tool the first time, recovery from unexpected tool returns is cleaner. This is the dimension where the day-three impression is most likely to hold up under longer use.

Cost positioning lands at workhorse-tier. GPT-5 prices in roughly the same band as GPT-4.1, meaningfully cheaper than a "premium-tier" model would have been. OpenAI ceded the premium-pricing slot to Opus 4 and is using GPT-5 as the better workhorse rather than the more expensive top-tier.

What's not as different as the launch suggests

A few places where the day-three experience is more "incrementally better" than "step change":

Pure prose quality. Marginally better than 4.1. Not dramatically. If you were happy with GPT-4.1 for writing, you'll be happy with 5. If you weren't, 5 doesn't change your mind.

Hallucination rate. Subjectively similar. The reasoning improvements help on some classes of problem; they don't substantially fix the made-up-library-function or wrong-API-signature pattern.

Multimodal capabilities. Still behind Gemini 2.5 Pro in the same ways the prior generation was. Image input is competent. There's no audio-out story in this release. If you need real multimodal, Gemini still leads.

Long context. Same advertised 128K most users see; effective context for complex multi-turn work is improved but not dramatically. Claude 4's 200K is still ahead for genuinely long conversations.

The frontier model menu in early June 2025:

GPT-5, new workhorse-tier flagship from OpenAI. Reasoning is competitive with Claude 4 Sonnet. Cost is competitive too. The new default for OpenAI-aligned shops.
GPT-4.1 / mini / nano, still around, still meaningfully cheaper at the smaller tiers. The right pick for cost-sensitive workloads where 5's reasoning isn't needed.
GPT-4.5, fully ceded the prestige slot. Probably gets deprecated within a couple of quarters.
Claude Opus 4, premium reasoning tier. Still earns its keep on the small set of problems where capability is worth the price.
Claude Sonnet 4, workhorse. Closes the gap with Opus enough that Sonnet stays the default for most workloads.
Gemini 2.5 Pro, credible workhorse with the long-context and multimodal advantages.
OpenAI o-series, reasoning specialty tier, narrower niche now that GPT-5 absorbed some of what o3 was for.
Open-weights tier. DeepSeek V3, Llama 4, Qwen variants. Cost-sensitive batch work continues to live here.

The shape of the menu is roughly the same as it was a month ago. GPT-5 slots in as a new workhorse competitor to Sonnet 4, the premium tier (Opus 4) is unchanged, the cheap tier is unchanged. The decision shape continues to be "match workload to model," not "pick the best."

Routing decisions that actually need to change

For shops with an already-working multi-vendor routing setup:

Default coding model for OpenAI-aligned teams shifts from GPT-4.1 to GPT-5. Same price band, better behavior on the hard cases. No reason not to.

Hard analytical work gets a credible second option. Where I'd reach for Opus 4 a week ago, GPT-5 is now a reasonable substitute on a subset of cases. Not all (the very hardest reasoning still favors Opus) but enough that the routing isn't binary anymore.

Mini-tier and nano-tier work doesn't change. GPT-5's pricing is workhorse-tier, not mini-tier. The cheap-end work continues to be GPT-4.1-mini, GPT-4.1-nano, or open-weights.

Multi-vendor routing stays. GPT-5 doesn't subsume Claude 4 or Gemini 2.5 Pro for the cases where those models win. The workload-routing pattern continues to be the right one; the menu of options just got marginally better.

Spend modeling gets revisited. GPT-5 is in roughly the same cost band as GPT-4.1 but uses tokens at slightly different rates because of internal reasoning behavior. Worth re-running the workload-cost model after a couple of weeks of usage to see where the actual landed costs are.

The pricing shape that's emerging

Three vendors have now shipped a flagship workhorse-tier release in the last month: Sonnet 4, Gemini 2.5 Pro GA, GPT-5. All three landed at roughly the same price band ($2-3 per million input: $8-15 per million output). All three are competitive on capability. The premium tier (Opus 4 and the OpenAI o-series) sits above this, the cheap tier (open-weights and the mini variants) sits below.

The pattern: the workhorse tier is getting more crowded and more competitive without the prices moving much. The capability you can buy at $2-3/$10-15 per million is increasing with each release. The implication for cost modeling is good news (capability per dollar improves); the implication for vendor differentiation is harder (the workhorse-tier vendors look more like substitutes than they used to).

What I'm watching from here

Two things over the next quarter:

Whether GPT-5 holds up on coding work over real usage time. Day-three impressions inflate the deltas. The seven-day and thirty-day impressions tell the actual story. The Claude Code-equivalent workflows on GPT-5 will tell us whether the coding improvements are durable or whether 4.1 stays the better fit for the kind of multi-file refactor work that matters.

How the OpenAI Agents SDK evolves with GPT-5 underneath. The agent surface is where the practical agentic experience gets shaped. If OpenAI doubles down on the agent SDK with GPT-5 as the default, that pulls the agent-framework center of gravity OpenAI-ward. If they don't, the multi-vendor pattern continues to be the right one.

The launch is real. The routing changes for shops with working stacks are smaller than the launch suggests. That's a sign of a market in the consolidation phase rather than the discovery phase, and that's where the AI tooling market actually is in mid-2025.