Prompts as code: versioning, A/B, rollback

Prompts in git, versioned, A/B-tested via feature flags, rollback when regressions hit. The single most-edited surface in an AI product needs the same discipline as code.

Prompts as code: versioning, A/B, rollback

Here's the layman version. The most edited file in an AI product isn't the code. It's the prompt, the instruction text that gets sent to the model every time a customer asks a question. The prompt is what tells the model who it is, what to do, what to refuse, what tone to use, what shape to return. A small change to the prompt can change every answer the system gives.

Most teams treat the prompt like a sticky note on a monitor. Somebody opens the AWS console, edits the text, hits save, and the change goes live for every customer instantly. No history. No rollback. No test. No record of who changed what or why. Then a customer reports a weird answer the next day and nobody can figure out what changed.

Prompts as code Repo (git) prompts/triage_v1.txt prompts/triage_v2.txt prompts/diagnose_v1.txt prompts/diagnose_v2.txt prompts/resolve_v1.txt CI eval Feature flag A/B routing 10% → triage_v2 90% → triage_v1 rollback in <60s Prompts in git. Versioned. A/B-tested. Rolled back like code.
Prompts as code

This is wild. The single most leveraged piece of text in the system gets edited like a sticky note. Imagine treating production code that way.

Prompts are code. They behave like code (small changes have big downstream effects). They fail like code. They need to be versioned, tested, deployed, and rolled back like code. None of this is exotic. It's the same git-based, CI-tested, feature-flagged shape you already use for your application code. You just have to apply it to the prompt.

Let me walk through what that looks like.

What "the prompt" actually is in a real AI product

Before we get into versioning, let's be clear about what we're versioning. "The prompt" sounds like one thing. It usually isn't.

In the kind of AI product I described in the three loops piece, there's a triage prompt, a diagnose prompt for each category, and a resolve prompt. That's a dozen prompts already. Each prompt has parts:

  • A system framing block, who the model is and what it's allowed to do.
  • A retrieved-context block, the slot where RAG chunks get inserted.
  • A rubric block, the consultant's playbook, when it's small enough to fit.
  • A response-shape block, the JSON schema or output structure.
  • The user's question slot.

Each can change independently. System framing doesn't change often but when it does the impact is huge. The response-shape changes whenever the downstream parser changes, they ship together. The rubric changes whenever the consultant updates their playbook. The retrieval block changes when chunking shape changes upstream.

If you think of "the prompt" as one chunk of text, you'll edit one chunk of text. If you think of it as a structured artifact with named parts, you can version each part, test each part, and roll back each part on its own. The second is much easier to live with.

Get the prompt out of the console

The first move, before anything else, is to get every prompt out of the AWS console (or the OpenAI playground, or the Anthropic Workbench, or wherever you're editing it live) and into a file in your git repo.

This sounds obvious, but it is not done by default. The console is convenient, and the playground gives you instant feedback. People edit there and forget to copy the changes back. So the source-of-truth prompt drifts from what's actually running in production. Then someone tries to reproduce a bug locally and can't, because the local prompt is stale.

Pick a directory; prompts/ works fine. Put each prompt in a file, and I use a structured format. YAML or TOML at the top with metadata (id, version, model, owner, last-changed-by) and the prompt body below as a multiline string. Something like:

id: diagnose.hr.interview-rubric.v3
model: claude-sonnet
owner: hr-vertical-team
description: "Apply the consultant's interview rubric to a candidate transcript."
inputs: [transcript, role_definition, rubric]
output_schema: rubric_assessment_v2
prompt: |
  You are an interview review assistant.
  ...

Now the prompt is a file. The file is in git. The file has a history. The file gets a code review when it changes. The file gets blamed when something breaks. This alone (just this one move) solves about half the problems most teams have with prompt changes.

For the HR consultant productizing their interview rubric: each rubric category (technical-skills, communication, role-fit) gets its own prompt file. When the consultant tweaks the rubric (adds a sub-skill, changes a weighting) the change shows up as a git diff. Someone reviews it. The eval suite runs. It ships.

For the career coach productizing resume positioning review: the prompts that score a resume against a target role, rewrite the summary line, and suggest reordering of experience are three files. They evolve independently. When the coach updates their positioning framework, only the positioning prompt changes.

For the product PM offering decision-coaching: the prompts that help frame a decision, pressure-test the framing, and generate "things to consider" are separate files. Coaching style shows up in the framing prompt; the pressure-test prompt is about logical rigor. Different changes touch different files.

Versioning that actually means something

Once prompts are in git, every commit that touches them is a version. That's the cheap version of versioning. It works.

What I do on top of that is give each prompt an explicit version number in its metadata. v1, v2, v3. The version bumps when there's a meaningful behavior change, not for whitespace, not for a typo fix, but for anything that could change the model's outputs in a way an eval would notice.

Why bother with explicit versions on top of git?

First, the version is a stable identifier the rest of the system refers to. The application doesn't say "use the prompt at this filepath." It says "use prompt diagnose.hr.interview-rubric.v3." A/B-test v3 against v4 by name. Roll back by flipping the reference. Git history is the audit; the version number is the handle.

Second, the version is what the eval harness records against. Every eval run is tagged with the prompt versions it ran against. When quality drops, you can see "quality dropped when we shipped diagnose.career-coach.positioning v5." Without explicit versions, the eval data is harder to read.

The packaging: at deploy time, all the prompt files get bundled into a versioned manifest, a JSON mapping prompt IDs to versions and bodies. The application reads from this manifest at startup. Updating prompts means shipping a new manifest, which is a regular code deploy.

A prompt change is a deploy. It goes through the same CI as code. It runs the eval suite. It can be rolled back the same way.

A/B testing prompts behind a feature flag

The point of A/B testing prompts is to ship a change to a small slice of traffic, watch the eval and the production metrics, and either roll it out wider or roll it back. This is the same pattern you already use for code changes that have user-visible impact. Apply it to prompts.

The mechanism is the same as for any feature flag. Each request gets routed to one of two prompt versions based on some bucket key, tenant ID, user ID, request hash, whatever. The split starts small (5% on the new version, 95% on the current) and ramps up if the metrics look good. The bucketing is sticky for a session so a user doesn't get jarringly different behavior mid-conversation.

What you watch during the A/B:

  • Eval scores on the golden set, the offline measure. Did the new prompt do at least as well on the held-out examples? This is the floor; if it failed here, you wouldn't have shipped the A/B at all.
  • Approval rates from the consultant. If the resolve loop has a human-approval gate, the rate at which the consultant approves the proposed action is a real-time quality signal. Drop in approval rate on the new version is a flag.
  • Rejection / regenerate rates from end users. If users have a "this isn't what I wanted, try again" button, the rate of that button getting clicked is a real-time signal too.
  • Latency and cost. A new prompt that's twice as long costs more and is slower. Sometimes worth it, sometimes not. Watch.
  • Free-text customer feedback. Less mechanical, more important. Look at it.

For the HR consultant: the new version of the interview-rubric prompt scores better on the golden set. Ship to 10% of new transcripts. Watch the consultant's approval rate, the regenerate rate, any feedback. If approval drops, roll back. If everything looks the same or better, ramp to 50%, then 100%.

For the career coach: a new positioning-analysis prompt adds a step where the model first identifies the candidate's three strongest signals before rewriting. Ship to 10% of resumes. Compare rewrite quality and "looks like me" feedback. Decide.

For the product PM: a new framing prompt that asks two clarifying questions before generating decision options. Ship to 10%. Watch whether users answer the clarifying questions or bounce. Decide.

The infrastructure cost is small. A feature flag service (LaunchDarkly, AWS AppConfig, your own Postgres-backed flag table), a tiny lookup at the start of each request, and a tag on every eval and metric so you can slice by which version got served. That's it.

The eval suite is what makes A/B safe. Without offline evals, you can't tell if v4 is even worth A/B-testing. I'll go deep on the harness in the eval harness, how you know it's working. Short version: every prompt change runs the suite before it gets near production traffic.

Rollback that's actually fast

A rollback should be one action and it should take less than a minute to get into effect. If your rollback story is "edit the prompt back to the old text and redeploy," you don't have a rollback story.

The shape that works for me:

The previous version of every prompt is still in the manifest. The flag service points at "current" by default but knows about "previous." When something goes wrong, one config change flips the flag from current to previous. The next request reads the new flag value and uses the previous prompt. No deploy, no build, no PR. Sub-minute.

Then, once the bleeding has stopped, the team figures out what went wrong with the new version. Maybe it was a prompt issue (fix and re-ship). Maybe it was a model-side change (Bedrock pushed a model update; need to recalibrate). Maybe it was an interaction with a retrieval change that landed at the same time. Investigate without the production fire.

The audit log captures the rollback. Who flipped the flag, when, against which prompts, with what justification. Same audit table that captures every other production decision. The point isn't to assign blame; it's so the next person who hits a similar issue can find the prior incident and learn from it.

For the career coach product, a typical shape: a new positioning-rewrite prompt shipped to 25% of traffic. Within hours, rejection rate jumped from baseline 8% to 23%. Eval scores looked fine offline; the regression was something the offline evals didn't catch (the new prompt produced more aggressive rewrites that scored well on the rubric but felt wrong to users). Flag flipped. Within sixty seconds, traffic was back on the old prompt. Then a week of investigation, a "feels-like-me" eval dimension added, a new prompt, another A/B. This time it stuck.

That loop (incident, fast revert, investigate, add eval coverage, retry) is only possible if rollback is fast and prompts are versioned. Without it, you're choosing between "leave the bad prompt running" and "panicky deploy at 11pm." Neither is good.

What I keep in CI for prompt changes

Every PR that touches a prompt file runs:

  1. Schema validation, required fields present, version bumped if the body changed, output schema exists.
  2. The eval suite for the affected prompts. Compares scores against the previous version. Regression past a threshold blocks the merge.
  3. A diff renderer in the PR description, old prompt next to new with changes highlighted.
  4. A cost / token estimate. If the new prompt is 30% longer, the comment notes the marginal cost.
  5. A sample-output comparison, the CI runs old and new prompts against five canned examples and posts both outputs.

A few minutes added to PR cycle time. Catches a startling fraction of regressions before they ship.

CI itself sits in deployment, IaC, CI/CD, environments, the minimum. The prompt-specific checks bolt on to whatever pipeline you already have for code.

What I'd do this week

Three concrete moves, in order, if you're shipping an AI product right now and your prompts live in the console:

  1. Copy every prompt out of the console into a file in your repo. Even if you don't have versioning or CI yet. Just having them in git with a history is a meaningful improvement.
  2. Wrap them in a thin loader. The application reads prompts from a manifest, not from the console. The manifest gets built from the files at deploy time. Now changing a prompt is a deploy.
  3. Add one A/B mechanism, even if it's a hand-coded if-statement. Not because you're going to A/B test today, but because you want the shape in place when you do. The first time you actually run an A/B, the infrastructure is already there.

The eval suite, the polished CI, the fast rollback, those come later. They're worth building. But you can ship the first three moves in an afternoon and they'll save you from the worst of the prompt-as-sticky-note failure mode immediately.

Prompts are code. Treat them like code and they'll behave like code, predictable, reviewable, recoverable. Treat them like sticky notes and they'll burn you on a Tuesday afternoon when you can least afford it.