The eval harness: how you know it's working before customers tell you

Test sets, golden examples, regression detection. The eval harness is how you find out the AI broke before a customer does, and it has to be a permanent part of the stack, not a side project.

Sid Smith

19 May 2026 • 7 min read

The first time I shipped a prompt change without an eval suite, I found out it broke about forty hours later, when a paying customer sent a screenshot of an answer that was confidently, articulately wrong. The fix was a one-line tweak. The damage was already done. The customer still uses the product. They also still bring it up.

In an AI product, "is it working?" doesn't have an obvious answer. The thing runs. It returns something. It looks plausible. The output is sentences, not stack traces, so your eyes slide right past the bug. Without a deliberate way to check, the only feedback loop you have is the customer's annoyance, and by the time that loop closes, you've already shipped the regression to everyone.

Eval matrix

The deliberate way to check is called an eval harness. It's a small, boring piece of infrastructure that sits next to your AI product and answers one question on demand: given a known input, did the AI give a known-acceptable output? The whole game is having that question answerable in seconds, every time anyone changes anything.

Layman version. You hire a contract reviewer. On day one, you don't just hand them every incoming agreement and hope. You assemble a folder where you already know the right answer, this one has a bad indemnity, this one has a sneaky auto-renewal, this one is fine to sign, and you give them the folder before they touch anything live. If they spot the issues, they get the keys. If they miss things, you find out now, not three months in when a customer signs something they shouldn't have. The folder is the eval set. The check is the eval run. Everything else here is the engineering version of that idea.

What "working" actually means

The harness lives or dies by what you decide correct means. For a deterministic system, this is easy, the function returns 4 or it doesn't. For an AI system, correct is a spectrum, and you have to pin it down before you write a single test case.

For a legal pro running a contract-review product, correct probably means: the model flagged every clause my playbook says should be flagged, didn't fabricate clauses that aren't there, classified each flag at the right severity, and gave a one-line reason a human can verify in fifteen seconds. Four checks per case. Each one is a column in the eval matrix.

For a financial advisor running a portfolio-diagnosis product, correct might mean: the model identified concentration risk above my threshold, didn't recommend a specific security (the playbook says don't), used the right tone, and produced a structured output the report generator can consume. Different four checks. Same shape.

Notice what's not on either list: the answer sounds nice. Sounding nice is not a test. The whole reason we need a harness is that sounding nice is the default failure mode of a language model, it produces a plausible, fluent, wrong answer and we nod along. The eval has to test the thing the consultant cares about, in the consultant's words, not the model's.

Write the criteria first. What would a great answer look like? A bad one? Which differences are acceptable variation, which are bugs? You can't write a useful eval if you can't answer those three out loud. Most teams skip this step and end up with an eval that grades on style.

Golden examples, the boring core asset

The heart of a harness is a set of golden examples: input/expected-output pairs the consultant has personally signed off on. Cases where you've stared at the answer and said "yes, that one. That's right."

Golden examples are not generated, not crowd-sourced, not pulled from production logs without review. They are the consultant's own judgment, captured. The product is, in the end, a bet that the model can produce answers the consultant would endorse. The eval set is the formal version of that bet.

Start with twenty to fifty. Twenty is the floor; below that, the regressions you catch are luck. Fifty is comfortable. A hundred is the ceiling for the manual phase; past that, you should be pulling examples semi-automatically from production with a human review step.

The shape of each example is boring on purpose: an input, the expected behavior (a checklist, not a fixed string, you almost never want exact-match for AI outputs), and a category tag. Store them in your repo, treat them like code, review changes the same way.

Want to go deeper on why prompts and golden examples both belong in git? See prompts as code, the same discipline that catches a bad prompt change catches a bad eval-set change.

How the eval actually runs

The harness is a small program. It loads the golden examples, runs each input through the current model and prompt, applies the per-case checks, and produces a pass/fail-with-reasons report. None of this is exotic. The first version I shipped was a Python script and a CSV. The current one is a Lambda that pulls examples from S3, runs them through Bedrock, writes results to Postgres, and posts a summary to Slack. Same shape, more plumbing.

The checks come in three flavors, and you need all three.

Programmatic. Did the output contain the required structured field? Was the JSON valid? Did the classification land in one of the allowed enum values? Cheap, fast, and catches the most embarrassing failures, malformed output that breaks the rest of the product downstream.

Model-graded. You take the model's answer and ask another model (usually a stronger one) to grade it against a rubric. This is sometimes called LLM-as-judge, if you want to look it up later. Noisier than programmatic checks, and you have to validate the judge against your own opinion before you trust it, but it scales to checks you can't write as a regex.

Human spot-check. Five to ten percent of eval results get sampled and reviewed by the consultant whose judgment the product is supposed to encode. The checks themselves drift, and the only way to catch the drift is for the human to occasionally look at what the harness is calling correct and either nod or scowl.

Every prompt change triggers an eval run

Every change to a prompt, a model version, the retrieval corpus, or the chunking strategy is a change to the AI's behavior, and all of them get a full eval run before merge. No exceptions. No "small tweaks."

The mechanism is the same as any decent CI pipeline. The PR opens. The harness runs the full golden set against the proposed change, compares to baseline, and posts a comment: 47 cases run. 45 pass on baseline. 46 pass on this branch. Net improvement: +1. Or, more often than I'd like: Net regression: -1. Failed cases attached. If it's the second one, you don't merge until you understand why and either fix it or write down a deliberate decision that the trade-off is acceptable.

The deliberate-decision path matters. Sometimes the regression is an old golden example that was actually wrong. Fine, but update the example with a documented reason, don't ignore the regression. The audit trail is the whole point.

For the ops-minded reader. Every eval run costs money, model calls, judge calls, infrastructure. A 100-case suite on every PR is real spend. I track it with a CloudWatch metric and a monthly cap. Too expensive means thin the cases or run a sampled version on PR and the full suite nightly.

Regression detection vs absolute quality

There are two questions an eval can answer, and people confuse them.

Regression detection: did this change make things worse? The easy one. You don't need the absolute number to be high, just to not get lower without you knowing.

Absolute quality: is the AI good enough to ship? Harder, because it requires a definition that lives outside the eval set. The set is, by construction, things you've seen before. Real customers send things you haven't. A 95% pass rate means almost nothing about a brand-new question from a brand-new customer.

Use the harness for regression detection, that's its job. For absolute quality, pull a random sample of production interactions weekly, have the consultant review them, watch the score over time. When the production-sample score and the eval-suite score diverge, your eval set has gone stale and needs new cases.

The eval set is a living artifact

Every customer-facing failure becomes a new golden example. Always.

A customer reports an answer they didn't like. You investigate. You decide whether the answer was actually wrong or just unwelcome. If wrong, you fix the prompt or the retrieval, and before you fix it, add the case to the golden set with the expected-correct behavior. Now the fix has to make this case pass, and the case stays in the suite forever. Same pattern as a regression test in a normal codebase. The compounding payoff is enormous.

I now think of the golden set as one of the three or four most valuable assets a small AI product owns. The model is rented. The prompt is editable. But the eval set is a record of what the consultant has actually said is correct across the product's lifetime. The closest thing to institutional memory the product has.

The parts that will bite you

The eval set will rot. Examples become obsolete as the product changes scope or the model gets so good at a category that those cases stop catching anything. Audit the set quarterly. Drop dead weight. Add cases the production sample suggests.

The judge model has to be evaluated too. LLM-as-judge has its own failure modes, sycophancy, grading on style, drift when the underlying model updates. Keep a mini-eval-of-the-judge: twenty cases you've already scored yourself, run them through the judge, confirm it agrees before trusting it on the wider set.

The harness has to fail loudly. A silent eval run is worse than no eval run, it gives you the comforting appearance of safety without any of the actual safety. The summary lands somewhere a human will see, and it should be impossible to merge a PR with a regression without acknowledging it in writing.

And the harness has to run on production-shaped infrastructure. If your eval calls a different model version, a different prompt loader, a different retrieval path than production does, you're testing a different product. The green tick lies to you.

If you're shipping an AI product and you don't have an eval harness. That's the next thing you build. Not the next prompt tune. Not the next feature. The harness, with twenty golden examples, hooked to PR. Then fifty examples and a nightly run. Then model-graded checks for what programmatic checks can't see. The growth path is gentle. Skipping it is what makes the forty-hour customer screenshot possible, and that screenshot is the cheap version of the lesson. The expensive version is when the wrong answer was in a contract.