Bedrock model selection: pick on evidence, not vibes

Sonnet, Haiku, Opus, Llama. Picking the right Bedrock model per use case using evals, not gut feel, and knowing when to switch.

Sid Smith

27 May 2026 • 7 min read

The 18-article MVP series wrapped last week with a piece about what I'd cut and what I'd keep. This one starts the year-one series, the part where the MVP is alive, customers are using it, and the questions stop being "what do I build" and start being "what do I run, and how do I make it cheaper without making it worse."

First question, every time: which model.

Bedrock model selection

If you've shipped anything on Bedrock, you already know the trap. There's a default in your code somewhere, anthropic.claude-sonnet-4-something, and it stays there for six months because nobody wanted to touch it. Then your bill triples, or a competitor ships something faster, or you read a benchmark that makes Haiku look like a steal, and you panic-swap to a different model and break a corner case nobody had test coverage for.

This piece is about not doing that. Pick on evidence. Switch on evidence. The evidence is the eval harness from piece #11, and the picking is one decision per loop, not one decision per app.

What "the right model" actually means

Three knobs. Cost, quality, latency. You get to pick two and the third comes along for the ride. That's the whole story.

What changes per use case is which two matter.

Routing the inbound query to the right pipeline? Latency and cost. Quality is a binary (did it pick the right bucket or not) and the buckets are coarse. You can do this with a model the size of a postage stamp.

Diagnosing what a customer actually needs help with, given their context and the consultant's body of work? Quality. Quality. Quality. Latency is fine in the 3-5 second band because the customer is already waiting. Cost matters but not on the same axis.

The hard-edge cases, the contract clause that's almost-but-not-quite the standard, the medical-second-opinion query where the symptom set is unusual, the financial diagnosis where the portfolio doesn't fit any of the standard patterns, quality is everything and you're willing to pay 10x per call because the case happens 1% of the time but it's the 1% the consultant put their name on.

So the model picker isn't "what's the best model." It's "what's the right model for this loop."

The four-way split I actually use

Bedrock gives you a menu. I run four models concurrently and route between them.

Haiku, for triage and routing. Inbound query comes in, Haiku decides which pipeline it belongs in. Sales-discovery prompt or onboarding-fit prompt? IT-ops triage or feature request? Marketing-positioning question or copy-edit request? It's a classifier dressed up as a chat model. Latency is sub-second, cost is rounding error, quality is high enough on coarse buckets that I trust it.

This is the triage loop from piece #9. Haiku is what makes that loop cheap enough to run on every inbound message instead of every fifth one.

Sonnet, for diagnosis. This is the workhorse. The query, the retrieved consultant context (from RAG, which is retrieval-augmented generation if you want to look it up later), the persona shape, the conversation history. Sonnet pulls it together and writes the answer. Or, more often, drafts the answer and sends it to the consultant for approval. 80%+ of my Bedrock spend is here.

Opus, for the hard ones. Two ways into Opus. First, Sonnet flags low confidence and the router hands the query up. Second, the case carries a tag ("high-stakes" or "novel" or "consultant-flagged-for-quality") and goes straight to Opus regardless. A legal-pro tenant doing contract review against their playbook routes 5-8% of clauses to Opus because that's the band where the playbook doesn't quite cover it and the consultant wants the model to think harder.

Llama on Bedrock, for cost-sensitive batch: summarization, re-embedding the corpus when chunking changes, generating eval candidates. Anything that runs overnight on the Mac Studio side fine, but sometimes the Mac Studio is busy fine-tuning and I want it in the cloud. Llama 3.x or whatever's current on Bedrock at the time. Quality is good enough for the work, and the per-token price is meaningfully lower than Sonnet.

That's the spread. Haiku at the door, Sonnet for the bulk, Opus for the corners, Llama for the back office.

How I actually pick, not by reading benchmarks

Here's the part nobody wants to hear. Public benchmarks are useful for narrowing the field. They are useless for the final pick.

Benchmark says Model X beats Model Y by 4 points on MMLU. Cool. My consultant's body of work is none of MMLU. The only thing that tells me whether Model X is right for a portfolio-diagnosis prompt against this financial-advisor's corpus is running both models against my eval set and looking at the pass rate.

The eval harness from piece #11 is the unlock. Golden examples, structured grading rubric, regression detection. Per-model scorecard. When I'm picking between Sonnet and Opus for the diagnose loop, I run the same 200-example set through both, score the outputs, look at the gap, look at the cost-per-pass.

Three numbers come out. Pass rate. Median latency. Cost per query. I write them in a tiny markdown table per loop and I keep that table in the repo. When somebody asks why we're on Sonnet not Opus for the marketing-positioning pipeline, I point at the table.

Want to go deeper on the harness mechanics? The eval setup itself is in The eval harness, how you know it's working, and the prompt-versioning discipline that lets you compare apples to apples is in Prompts as code.

The cost/quality/latency curve, in numbers I've actually seen

Rough shape, your mileage will vary, do your own evals, but for a triage-diagnose-resolve product running on a consultant's body of work, the numbers I've seen come out something like this.

Haiku for triage: ~300ms median, fractions of a cent per call, 96-98% bucket accuracy on coarse intent classification once you've tuned the prompt. Cheap, fast, good enough.

Sonnet for diagnose: ~2-3s median, low single-digit cents per call (depending on how much retrieved context you cram in, and you'll cram in more than you think), 88-92% pass rate on a well-graded eval set against the consultant's corpus. The number that pays the bills.

Opus for hard cases: 5-8s median, 5-10x the per-call cost of Sonnet, but the pass rate jumps from ~88% on the hard-case subset (where Sonnet was struggling) to ~96%. That gap is the reason Opus exists in your pipeline.

Llama on Bedrock for batch: latency doesn't matter because it's batch, cost is meaningfully under Sonnet, quality on the back-office tasks (summarization, eval generation, re-chunking) is fine.

The thing I want you to internalize: the difference between Sonnet and Opus on the easy 80% of cases is small enough that paying 10x for it is wasteful. The difference on the hard 5-10% is huge. So you route by case, not by app.

Picking by use case, three quick verticals

A sales consultant running discovery-call prep. Haiku triages the inbound: prep request vs. follow-up vs. objection-handling. Sonnet diagnoses: pulls the prospect's company context, the consultant's framework, the prior call notes, drafts the prep brief. Opus rarely fires here unless the deal is flagged as strategic. Most of the spend is Sonnet, latency tolerance is generous because the consultant is reading the brief asynchronously.

An IT-ops consultant doing infrastructure triage. Haiku routes by symptom class. Sonnet diagnoses against the consultant's runbook corpus and the customer's ticket history. Opus fires when the symptom set doesn't match a known runbook. That's the "this is novel, think harder" path. Cost-sensitive because the volume is high; Llama runs the overnight pattern-mining of resolved tickets to find new auto-resolve candidates.

A career coach doing resume + positioning review. Haiku triages: full resume review vs. single-section edit vs. positioning question. Sonnet handles the bulk. Opus fires when the candidate's background is non-standard and the coach has flagged the case for extra care. Latency is generous, quality is everything because the coach's name goes on the output.

Same architecture spine. Different routing thresholds per vertical. The picker is configuration, not code rewrite.

When to switch

Three triggers. Just three.

Eval scores drift. You re-run the eval set after a prompt change or a model update and the pass rate moves. If it dropped on Sonnet, maybe Opus is now the right pick for that loop. If it climbed on Haiku, maybe you can demote work down a tier. Re-running evals on a schedule (I do it weekly during active development, monthly after) is what makes this trigger fire when it should.

Cost shape changes. Anthropic ships a new Sonnet, the per-token price drops, the new model beats your current pick on your eval set, you switch. Or your usage shape moves and the model that was cheap at 10k queries/day is no longer cheap at 100k. The cost-model piece, #15, is the place you watch this from.

Customer complaint pattern. This is the one that doesn't show up in evals. Customers report the same kind of bad answer over and over. You go look. Often it's a class your eval set didn't have. Add it to the eval set, re-grade across models, switch if the data says switch. The complaint becomes a permanent test.

What's not on the list: a benchmark blog post made you feel behind, a competitor announced something, your CTO wants to "move to the new thing." Those are signals to test, not signals to switch. Run the eval. Look at the table. Then decide.

The default I ship with

For anyone starting from the architecture in the MVP series and trying to figure out where to begin: ship with Haiku for triage, Sonnet for diagnose, route the lowest-confidence 5% to Opus, and put Llama on whatever batch work the Mac Studio doesn't pick up. That's the default. It's not the right answer for your product. It's the right starting answer.

Then build the eval harness. Then the data tells you what to change.

If you're running this and only get to do one thing this week, do this: pick the loop that costs you the most per month, run it through three models with the same 100-example eval set, and put the numbers in a table. The picker decision after that writes itself.

The next piece in this series is about the other end of the year-one problem: when a brand-new consultant signs up, how do you get them from "I have secret sauce on a shared drive" to "my AI surface is live and answering questions" in five minutes instead of five days.