Automation

Why your IaC pipeline is the right place to put AI

The infrastructure-as-code review pipeline is one of the highest-leverage places to deploy AI in an engineering org, and almost nobody is doing it well. The reasons it's underused are mostly accidental rather than principled.

Sid Smith

16 Jun 2025 • 5 min read

The places where AI gets deployed in most engineering orgs in 2025 follow a predictable pattern: chat surfaces, code-completion in the IDE, document generation, customer-support automation. The places where AI doesn’t get deployed but probably should are less obvious. The infrastructure-as-code pipeline is the one that keeps coming up in the platform-engineering community, and the reasons it’s underused are mostly accidental rather than principled.

iac-ai-pipeline

Worth being explicit about why the IaC pipeline is the high-leverage spot, what the pattern looks like in practice, and what the failure modes are.

Why this is the leverage point

Three structural reasons IaC reviews are the right shape for AI assistance:

The diffs are bounded and reviewable. A Terraform PR, a Pulumi change, a Helm chart update, these are bounded, structured, machine-readable artifacts. The model has the entire change in front of it, the model can be given the prior state, the model can be asked specific questions about what’s different. The unstructured nature that makes general code review hard isn’t a problem here.

The cost of catching issues pre-merge is enormous. Infrastructure changes that go wrong cost downtime, security exposure, surprise bills, or all three. Catching a misconfigured IAM role or an over-permissive security group before it merges is worth paying real money for. Catching a typo in a frontend component is worth less. The economics of false positives versus true positives lean strongly in favor of aggressive AI review on infra changes.

The patterns are highly repetitive. Most IaC reviews catch the same handful of issues over and over: missing tags, wrong region, over-broad IAM, missing encryption, missing backup config, drift from the standard module pattern. These are the kind of pattern-matching tasks AI is genuinely good at, even with workhorse-tier models.

The combination (bounded artifacts, high cost of misses, repetitive patterns) is exactly the workload shape AI assistance is supposed to be best at. And almost nobody is doing it.

What the pattern looks like in practice

A working IaC-AI pipeline shape that’s been described in several recent public writeups and practitioner conversations:

Trigger: PR opened against the IaC repo. CI runs the usual lint/validate/plan steps, then triggers an AI review step.
Input to the AI: the PR diff, the terraform plan output (or Pulumi preview, or helm template output), the relevant org standards as a system prompt, the prior state of any modified resources.
The review prompt: structured to look for specific classes of issue, security posture, cost implications, operational concerns, compliance with org standards, drift from standard patterns.
Output: a PR comment with categorized findings (blockers, warnings, suggestions) with the model’s reasoning for each. Optionally with suggested edits as a separate suggested-PR.
The human loop: the human reviewer reads both the AI output and the diff, decides which findings are real, and either addresses them or marks them as accepted-as-is. The audit trail captures both the AI findings and the human dispositions.

The pattern is unsurprising in its shape. The leverage comes from running it consistently, on every IaC change, with the same standards loaded as system prompt, with the findings tracked over time. The compounding effect of catching the same class of issue 20 times in a quarter is what makes the deployment pay back.

The failure modes worth avoiding

Three patterns I’ve watched go wrong in public projects and in my own testing:

Treating the AI output as authoritative. The AI catches real issues; it also produces false positives. Treating its findings as automatically blocking creates frustration and erodes trust in the tool. The right shape is “AI suggests, human decides,” with the human’s decision logged. The deployments that try to make the AI an automated gate burn out within a quarter.

Not loading org standards. A generic prompt produces generic findings. The deployments that work load the actual org-specific standards (the standard module pattern, the required tags, the IAM patterns the security team has approved) into the system prompt and review against those. The findings get sharper; the false-positive rate drops; the value goes up.

Not closing the feedback loop. The first pass of the AI review pipeline catches a lot of stuff. The second pass catches less. Without a feedback loop where false positives get fed back into the prompt and missed issues get added to the standards, the value plateaus quickly. The teams that treat the prompt as a living document keep getting value; the teams that set it up once and walk away see diminishing returns.

Why this isn’t more widely deployed

Three accidental reasons, not principled ones:

The “AI for code” conversation is dominated by IDE-agent vendors. The marketing energy is on Cursor / Copilot / Claude Code / etc. The PR-review surface is a separate product category, less vendor-saturated, and gets less attention. Most teams hear “AI for code” and think IDE first, pipeline second.

The PR-review AI tools that exist are mostly general-purpose. GitHub Copilot has PR review, the various LangChain-style tools have PR review, but most are aimed at general-purpose code review rather than the specific shape of IaC review. The IaC-specific deployment requires assembly that the off-the-shelf tools don’t provide.

Platform teams are smaller than application-engineering teams. The team that owns the IaC pipeline is usually a small platform team. The capacity to build the AI review integration is limited. The application-engineering teams that have more capacity aren’t the ones deploying it because it’s not their pipeline.

The combination is that the workload-fit is excellent and the deployment energy is misdirected. The shops that have invested in the IaC-AI pattern have gotten outsized value. The pattern doesn’t generalize to “AI everywhere in the pipeline”, application-code review is harder, runtime monitoring AI is harder, etc., but the IaC-specific spot is unusually well-suited.

The agent-design overlay

The IaC-AI pipeline benefits from the design patterns that work for agentic systems generally: bounded scope (review one PR), specific tool access (read the plan, read the diff, comment on the PR), human checkpoint before any action with side effects (the human merges, not the AI). It also benefits from the treat-the-AI-like-an-employee discipline, define what it’s responsible for, define what it isn’t, audit its decisions, evaluate its work.

The IaC pipeline is also a relatively safe place to deploy because the worst case is a noisy PR comment. The blast radius for a bad AI review is “human ignores it.” The blast radius for the runaway tool-call failure mode in an IDE-agent is “incorrect code merged”; the equivalent in the IaC pipeline is “incorrect comment posted.” The asymmetry favors aggressive deployment in the pipeline relative to the IDE.

Where to start

For a platform team thinking about deploying AI in their IaC pipeline:

Pick one resource type to start with. IAM is usually the highest-leverage starting point because the cost of misses is highest and the patterns are well-documented. Add coverage for other resource types as the first one matures.
Load your actual org standards into the prompt. Not generic AWS best practices, your standards. The difference matters.
Track findings over time. Categories, frequencies, dispositions. The data is the basis for improving the prompt and for deciding what to expand coverage to.
Don’t make it a gate. Make it a comment. Let the human reviewer decide what to act on. The trust gets built faster when the AI is an aid rather than an authority.
Plan for the prompt to be a living document. The first version is the worst version. Plan for ten iterations over the first quarter.

Done well, this is one of the highest-leverage AI deployments most platform teams can make in 2025. The combination of bounded scope, repetitive pattern detection, and high cost-of-misses is unusually well-suited. The reason it’s underused is accidental, vendor attention is elsewhere, off-the-shelf tools don’t quite fit, the responsible team is small. None of those are good reasons to leave the leverage on the table.