Mid-2025 model leaderboard: who wins on cost AND quality

Halfway through the year. The frontier moved, the workhorse tier got crowded, the price floor dropped. The honest leaderboard isn't a single ranking, it's a routing decision per workload, and the menu in July looks meaningfully different from the menu in May.

Sid Smith

07 Jul 2025 • 5 min read

Halfway through 2025. Let me take stock of where the model menu actually lands now versus where it was at the start of the year. The shape has shifted enough that the answer for any given workload has changed two or three times since January, and the version that's stable now is more interesting than any single point in the trajectory.

The honest framing isn't "which model wins", it's "which model wins for which workload." That gives less satisfying answers but more useful ones. Worth being concrete about the cells.

Top-right is where you want to live.

The tiers as of July

Five tiers, with the picks in each as of the past month.

Premium reasoning tier. Claude Opus 4 still owns this slot for the workloads where it matters. The OpenAI o-series sits adjacent, narrower, more reasoning-specialty. Gemini hasn't shipped a proper premium-tier release; their answer is "Gemini 2.5 Pro at the workhorse tier is good enough." For the small set of workloads where the marginal capability of Opus actually pays back (the harder reasoning, the planning steps in agentic loops, the high-stakes one-offs), this is where it lives.

Workhorse tier. This is where the action is. Three credible competitors all in roughly the same price band ($2-3/$10-15 per million tokens):

Claude Sonnet 4, strongest on coding work, mature tool-use, the pricing pressure narrative plays out here.
GPT-5, credible, closed roughly the gap with Sonnet 4 on coding and reasoning, still cheaper at the smaller variants.
Gemini 2.5 Pro, best on long context and multimodal, strongest if you live in Workspace.

For most workloads the right answer is "use whichever workhorse-tier model best fits the tools you already run." The capability differences between them are small enough to be noise on most jobs; the fit-to-your-stack differences are large enough to matter.

An order of magnitude separates the top tier from the open-weight tier.

Mid tier. GPT-4.1 and the smaller variants (mini, nano), Sonnet 3.7 (still around for cost-sensitive shops), Gemini 2.5 Flash. These are the workhorses for cost-sensitive workloads where the workhorse-tier capability is overkill. Honest read: the mid tier is increasingly underused because the workhorse tier got cheap enough to absorb most workloads. The mid tier hangs on for batch-heavy and high-volume jobs where the per-token math still matters.

Open-weights tier. DeepSeek V3-0324 at the cheap end, Llama 4 Scout and Maverick at the credible end, Qwen 2.5 / 3 variants for specific tasks. This tier got seriously competitive in the last six months, workhorse-tier capability at a fraction of the price if you're willing to run the inference yourself. Hosted open-weights via the various providers also expanded.

Specialty tier. OpenAI o3 / o4-mini for hard reasoning at moderate cost, image-and-video models (FLUX, Imagen, Veo, Runway), embedding models (Cohere, Voyage, OpenAI's text-embedding-3), code-specific models (Qwen 2.5-Coder). These are the workloads where a general workhorse model is the wrong tool and a specialty model wins.

The routing decisions that have shifted

A few specific routing calls that changed since January.

Default coding model. January: Sonnet 3.7 for Anthropic shops, GPT-4.1 for OpenAI shops. July: Sonnet 4 or GPT-5, both meaningfully better, no price increase. The Anthropic-or-OpenAI question still hinges more on what you already run than on capability.

Default agentic-loop planner. January: GPT-4.1 with extended thinking. July: Opus 4 for the hard ones, Sonnet 4 or GPT-5 for the rest, with the planner-executor split as the dominant pattern. Premium-tier for planning, workhorse-tier for execution is the split that survived the first two quarters.

Default open-weights pick for batch work. January: Llama 3.3 70B at INT4. July: DeepSeek V3-0324 hosted (~$0.30/M tokens) for the cheap-batch cases, Llama 4 Scout for the cases needing more recent training. The price floor dropped meaningfully.

Default for long-context retrieval. January: Gemini 1.5 Pro. July: Gemini 2.5 Pro for the long-context cases that need to be a single hosted call, locally-run hybrid for the cases that benefit from chunking plus smaller models. The single-call answer is still Gemini's; the where-this-is-going answer is more interesting.

Default for image generation. January: SDXL or Midjourney depending on use case. July: FLUX-1-schnell for the speed and cost cases, Imagen 4 for the quality cases, Midjourney for the aesthetic cases. The category fragmented further; the right answer is more workload-specific than it was.

Where I think the second half goes

A few predictions worth being plain about, caveat that prediction-markets-aren't-real-markets but it's worth saying out loud.

The workhorse tier consolidates further. Three competitors at the same price band is unstable; one or two will pull ahead on capability or cost and pressure the others. My bet is GPT-5 ends up the price-pressure leader, Sonnet 4 ends up the capability leader, Gemini 2.5 Pro stays the multimodal-and-context leader. Consolidation around two-of-three winners by year-end.

The premium tier widens its capability lead. Opus 4 is the only proper premium-tier release of the year. Whatever Anthropic ships next (Opus 4.5? Opus 5?) probably extends the gap. OpenAI's response is more likely the o-series than a new GPT-tier flagship. Whether that gap materializes is the most-watched question in the closed-frontier shops.

Open-weights closes another tranche of the gap. DeepSeek V4 (rumored) and Llama 4.5 / 5 (eventually) will narrow the closed-vs-open gap further. The cheap-tier inference market keeps compressing.

Specialty fragmentation continues. Image, video, embedding, code, each of these is a separate market with separate winners. General-purpose model plus specialty-model routing is the pattern that's emerging.

The decision shape that survives

Here's the thing the leaderboard format hides: the right model choice is rarely "the best model." It's "the best fit for this workload, this stack, this cost budget, this latency budget, this team's expertise." A team running on AWS with a Bedrock contract should usually default to whatever Bedrock exposes well, even if a slightly better model exists elsewhere. A team with strong MLX skills should weight Apple-Silicon-friendly options. A shop with a privacy mandate should weight what fits in their hosting envelope.

The leaderboard is a useful starting point for the conversation. The decision is downstream of the conversation. Halfway through 2025, the conversation has more good answers than it had in January, and the right way to use the leaderboard is as a menu rather than as a ranking.

The tiers as of July

The routing decisions that have shifted

Where I think the second half goes

The decision shape that survives

Subscribe to Echoes of the machine