DORA metrics in the AI era: DevEx, but with hallucinations

Deployment frequency, lead time, change-failure rate, MTTR. The DORA framework worked for a decade because it measured the right things. AI-augmented engineering is bending those metrics in interesting ways and exposing the next thing to measure.

A vintage analog dashboard with several round meters on a polished wooden panel, some needles trending up and some down, under warm tungsten light

The DORA framework, deployment frequency, lead time for changes, change-failure rate, mean time to recovery, has been the backbone of engineering-effectiveness measurement for the better part of a decade. The framework worked because it measured the right things and the metrics composed into a coherent story about engineering velocity and quality.

AI-augmented engineering is bending the framework in interesting ways. Some of the metrics get noticeably better; some get worse; some get harder to measure. The patterns that are emerging from a year-plus of AI-augmented dev work suggest the framework still applies but needs companion metrics that didn’t exist before.

Worth being concrete about what’s bending and what’s becoming visible.

dora-shift

What goes up

A few DORA dimensions that AI-augmented engineering moves in the favorable direction:

Deployment frequency. Up. Easier to ship more changes when the IDE-agent makes the small ones cheaper to author. Not a heroic change; a steady ~10-30% lift across the public DORA reporting I’ve read and the conversations I follow, which is a meaningful number compounded over a year.

Lead time for small changes. Down. The “I have an idea, make it real, ship it” cycle compresses when prompt-then-edit is faster than write-from-scratch. Most pronounced on routine changes (config tweaks, small refactors, dependency updates); less pronounced on novel architectural work.

Code-review throughput. Up. AI-assisted PR review (the IaC pattern I wrote about earlier) extends naturally to application code. Reviewers get pre-screening that catches the common categories; humans focus on the substantive cases. Throughput rises without sacrificing review quality if the AI review is calibrated correctly.

Documentation completeness. Up, in some teams. The friction of writing docs is lower when AI helps draft them. Whether the team actually keeps them updated is a separate question.

These are real wins. Deployments where AI is used well in the dev process show up better on these metrics.

What gets complicated

A few DORA dimensions where AI-augmented engineering produces mixed results:

Change-failure rate. Mixed. Two opposing forces. AI-assisted authoring reduces some classes of failure (typos, simple logic errors, missing edge-case handling). It introduces new ones, the hallucinated-file class, the plausible-but-wrong refactor, the silent regression where the tests still pass. Net effect varies by team and discipline. Teams with strong plan-mode discipline come out neutral or slightly positive; teams without it come out negative.

Mean time to recovery (MTTR). Mixed. Faster on the cases where the AI helps with diagnosis (suggesting hypotheses, summarizing logs, generating fix candidates). Slower on the cases where the AI’s suggestion sends the responder down a wrong path that takes a while to back out of. Net effect: slight positive on average, with higher variance than before.

Lead time for novel changes. Less affected than lead time for routine changes. AI-augmented authoring helps when the work resembles patterns the model has seen many times; it helps less when the work is genuinely new. Architectural and design work doesn’t compress the way routine implementation does.

These metrics show the technology is real but not magic.

What becomes harder to measure

A few things that DORA didn’t have to measure before but probably should now:

AI-attributable change attribution. What fraction of changes in a release came from AI-assisted authoring vs. human-only? Teams want to know; the metric is hard to define cleanly because most changes are now hybrid (human authored, AI completed; AI suggested, human modified).

Trust calibration over time. How often is the AI’s suggestion being accepted, rejected, or modified? The acceptance rate is a leading indicator of trust calibration; teams that accept too eagerly get the silent-regression class of failures, teams that reject too eagerly aren’t getting the velocity benefit. The right metric is a moving average that’s around 60-75% with team-specific calibration.

Time-to-detect regression. Specific to AI-assisted work: how long after a change shipped did the team notice the regression? The hallucinated-file class of failures specifically extends this time because the regression looks like a fix in the moment.

Cost-per-merged-change. When the AI assistance has a real per-change cost (model API calls, agentic tool-use, evaluation overhead), the per-change economics are worth tracking. Cheap enough that nobody worries; meaningful enough that it should be measured rather than ignored.

Cycle time including AI-iteration. The traditional lead-time metric measures human-clock time. The new dimension is “time spent iterating with the AI before the change was ready to ship.” Sometimes this is short and the lead time is faster overall; sometimes it’s long and the apparent lead time obscures real iteration overhead.

These are the companion metrics. The DORA framework still applies; these extend it for the AI-augmented case.

What the better-performing teams share

Patterns from deployments where AI is in the dev process and the metrics show up better:

Strong plan-mode discipline. Plan-then-execute as the default workflow, not opt-in. The teams without this default have higher change-failure rates and longer MTTR.

Explicit trust calibration cadence. Quarterly reviews of “where does the AI help us, where does it hurt us.” The teams that do this stay calibrated; the teams that don’t drift into either over-trust or under-trust.

AI-aware code review. Reviewers who know the difference between “AI-authored, human-approved” and “human-authored from scratch” review them differently. The teams that flatten this distinction miss the runaway-tool-call class of failures during review.

Conservative agentic scope. Agents in the dev process have narrow scopes, the IDE agent is for one task at a time, not for “do whatever needs doing.” Teams that scope agents narrowly show better metrics than teams that grant broad autonomy.

Test coverage as defense. The teams that increased test coverage as AI-assisted authoring grew see better change-failure rates than the teams that kept test coverage flat. The AI helps write the tests too; the higher coverage catches the silent-regression class.

These aren’t exotic disciplines. They’re the operational habits that the better-performing teams have built over the year. The teams that haven’t built them are the ones whose DORA metrics either don’t show the AI lift or show degradation.

The pattern in summary

DORA still works. The metrics still measure the right things. AI-augmented engineering moves the dial on each metric in ways that depend on team discipline more than on the technology itself. The teams that pair the technology with the operational discipline get the favorable shift; the teams that adopt the technology without the discipline get mixed-to-negative results.

The companion metrics that are emerging (AI attribution, trust calibration, AI-iteration time, cost-per-change) extend the framework rather than replacing it. The discipline that produces good DORA numbers extends naturally to producing good AI-augmented-DORA numbers.

A year-plus into AI being a real part of the engineering process, the conversation is mature enough to measure honestly rather than to take on faith. The teams that measure honestly improve faster than the teams that don’t. The framework that’s emerging is recognizable as DORA-with-extensions; that’s the right shape for it. Worth being explicit about the extensions because the marketing layer keeps offering simpler stories than the actual data supports.

The honest read on AI in engineering effectiveness, measured: meaningful positive when the discipline is in place; meaningful negative when it isn’t. Same shape the framework had for non-AI engineering work, applied to a new foundation. DORA, with hallucinations.