Cloud

Drift in cloud: what it actually looks like in production

Every IaC vendor talks about drift in the abstract. Here's what it actually looks like in a real cloud account, the security groups that no longer match the code, the manual fixes that never made it back into the repo, and the next terraform apply quietly fighting reality.

Sid Smith

07 Nov 2023 • 8 min read

I’ve been running the same demo for a couple of months now. A small AWS account with a deliberately dirty production-shaped setup. A VPC, a couple of EC2 instances, a security group that I named acme-prod-drift-sg and tagged, unambiguously, with Environment=sandbox and Purpose=drifted-resources. The tags are the joke. The drift is real.

The point of the demo is to show people what drift actually looks like in a live cloud account, not in a slide. The reaction is always the same: a long pause, and then “oh, that’s what’s been happening to us.”

This is the piece I want to write about that pause.

What drift is, less abstractly than the vendors say

Every IaC vendor has a slide that says “drift is when your cloud state diverges from your IaC code.” That sentence is technically correct and almost completely useless, because it doesn’t tell you what the divergence looks like, how it got there, or what it costs.

In the customer scenarios I’ve seen this year (and these were everywhere, not edge cases) drift shows up in three flavors:

Manual changes. Somebody opened the AWS console at 2 a.m. to fix a production incident. They added an ingress rule to a security group. They bumped an RDS instance class. They reattached an EBS volume. They didn’t go back and update the Terraform repo, because by the time the incident was resolved they were exhausted and the meeting moved on. The change is now real in AWS and invisible in code.

Side effects of other Terraform. Two repos manage overlapping pieces of infrastructure. Repo A owns the VPC. Repo B owns the security groups inside it. Repo B’s last apply added rules. Repo A’s terraform plan now sees those rules as drift, because from Repo A’s perspective nobody told it about them. This is the variant that surprises people the most, because it isn’t a human’s fault, it’s a tooling boundary that nobody redrew when the team split the repos.

Things created outside IaC. A new team gets onboarded. They have AWS console access. They build a stack of resources to ship a feature. Nobody told them about the Terraform repo, or they did know but didn’t have time. Six months later their resources are running, and nobody in the platform team can explain where they came from or who maintains them.

The third category is the one I want to circle back to in the ghost-resources piece, because it actually overlaps with the inverse problem. For now: most of what teams call “drift” is some mixture of these three.

The security group that always tells the story

Pick one resource type to teach drift with, and it’s almost always a security group.

The reason: security groups have an unusually high rate of legitimate-feeling manual changes. Adding a CIDR for a new vendor IP. Opening port 22 to a developer’s home IP for a debugging session. Adding a rule to allow a new internal service to reach a database. Every one of those changes feels, in the moment, like a reasonable thing to do in the console, because the friction of “edit the Terraform, open a PR, get review, run the pipeline” is higher than just clicking the button.

And every one of those changes is now a drift event.

In the demo I built, acme-prod-drift-sg has three rules in Terraform and seven rules in reality. Two of the extras were added by an automation that touches the same SG from outside the IaC pipeline. One was added by a person, in the console, who is not on the team anymore. One I added myself just to make the point.

When I run terraform plan against this account, the plan output is unmissable. Four lines of diff, in red, against a resource that the team thinks they control.

The audience reaction to that diff is the actual lesson. The diff isn’t surprising. The fact that it’s been there for months without anybody noticing is.

Why drift accumulates

The mechanics are not mysterious. Drift accumulates because:

The cost of a manual change is paid by the person making it. The cost of the drift is paid by the future team six months later.
Most teams don’t have continuous drift detection. They have terraform plan runs at apply time. Between applies, drift can grow indefinitely.
The tools that do detect drift continuously are not in most teams’ standard toolkits. They are something a platform team has to choose to adopt, configure, and respond to.
When drift is detected, the resolution is high-friction. Either you update the code to match reality (which legitimizes whatever change was made, including the ones that maybe shouldn’t have happened), or you re-apply to overwrite reality (which can break the system the manual change was fixing). Neither is fun. So teams kick the can.

That fourth bullet is the one I keep coming back to. Drift detection is not the hard part. Drift remediation is the hard part. Detection is a plan command and a webhook. Remediation is a judgment call about which version of the truth (the code or the cloud) should win, and most teams don’t have a policy for that.

The older shape of the same problem

There’s a class of drift the three categories above don’t quite capture, and it’s worth naming because it’s the one I pattern-matched to the hardest when I first started seeing it in customer Terraform repos. Call it configuration drift between platforms: vRA defines the “production environment” standard one way, the Terraform repo defines it another way, the Crossplane composition somewhere defines it a third way, and over time the three definitions drift apart because nothing forces them to stay aligned.

This is the kind of problem Decisions as Code (DaC) was built to solve. The methodology behind nearly every self-service and automation system I’ve designed: extract the business decisions out of platform configuration into a small, curated layer (often five real decisions where the raw config exposed eighty-nine) and let the platform absorb the rest through templates and defaults. The remaining configuration becomes the platform’s responsibility, not the consumer’s. (I called this Property Toolkit during my OneFuse days; the shape of the idea hasn’t changed, only the foundation.)

Same root cause as platform-to-platform drift (business logic living in multiple consuming platforms with no standard source) same outcome: each copy drifts on its own timeline until the platforms disagree about what “production” even means. The fix is to centralize the standard into a single curated decision surface with per-platform projection, and let each consumer pull from one source. For the Terraform layer specifically, that’s a centralized standards module that every cloud-specific module pulls from, so there’s only one definition of the organizational standard to drift from.

The runtime drift this article opens with, security groups edited in the console, RDS instances bumped at 2 a.m., is the same structural problem one layer further down. Business logic that lives in multiple places drifts. The fix is the same in both: centralize the decision, project it onto the platform, and make the standard source the path of least resistance. When I see a security group with seven rules in reality and three in code, I see the DaC problem with a different vocabulary.

What drift actually costs

The honest answer is that the cost depends on what drifted. The full taxonomy:

Security risk. The most common and the least visible. A security group that’s been quietly opened to a wider range than it should be. An S3 bucket policy that no longer matches what the audit document says. An IAM policy that was relaxed to fix an incident and never tightened back up. These accumulate silently. They show up in an audit, or in a breach, and the team is shocked because “that’s not what our Terraform says.”

Mystery resources at audit time. This is the cost the compliance teams know about. When the auditor asks “show me the IaC for this resource,” and the answer is some combination of “we wrote it three years ago” and “we don’t think we own that one,” the cost is measured in hours of forensic work and credibility with the auditor.

The next terraform apply fighting reality. This is the operational cost. A team that has accumulated months of drift runs an apply against a perfectly reasonable change, and the plan output shows fifty modifications they didn’t expect. Some of those modifications would break production if applied. The team aborts. The change doesn’t ship. They open a Jira ticket to “reconcile drift” and the ticket sits in the backlog forever, because reconciling drift is nobody’s favorite week of work.

The cumulative effect of that third one is the worst, in my opinion. It is how teams stop trusting their IaC pipeline. And once a team stops trusting the pipeline, they stop using it. They start clicking in the console more, because the pipeline is “broken,” which means more drift, which makes the pipeline less trustworthy, and the loop closes.

The remediation playbook

When I walk customers through what to do about drift, the conversation has a consistent structure.

Detect continuously. Not just at apply time. Run a plan against every workspace on a schedule, daily is enough for most teams, hourly for the security-sensitive ones. Surface the results somewhere humans will see them. A Slack channel works. A dashboard works. An email that everybody filters works less well.

Categorize before you remediate. When drift shows up, the first question is “is this drift we want to keep, or drift we want to undo?” Adding the answer to that question to the detection workflow is the difference between drift detection that’s useful and drift detection that becomes noise. A simple two-state label (“intentional, update code” vs “unintentional, restore from code”) is enough.

Make the easy path the right path. If the cost of “edit Terraform and open a PR” is higher than the cost of “click in the console,” you will get drift. Lower the first cost. Pre-built templates for common changes. A self-service PR generator for SG rule changes. Anything that makes the IaC path the path of least resistance.

Treat drift remediation as a budgeted activity. Drift will accumulate. The question is whether you have time on the calendar to reconcile it. A one-hour weekly drift-review meeting eliminates 90% of the chronic accumulation. It is the cheapest hygiene practice in IaC.

Get the policy decisions written down. “When drift is detected, who decides which way to reconcile?” should not be a question the on-call engineer has to figure out in real time. Write it down. Some teams pick “code is standard, always overwrite cloud.” Some pick “cloud is standard, always update code.” Most pick a hybrid based on resource type. Pick something. Write it down.

The longer thread

Drift is the operational tax of running IaC in environments where humans still have console access. You can lower the tax, but you cannot eliminate it without removing the humans, which most teams aren’t willing to do (and shouldn’t be, emergency access exists for a reason).

The flip side of drift (when the state file thinks a resource exists and the cloud says it doesn’t) is its own category, and it’s the one I want to write next. Ghost resources are weirder than drift, harder to detect, and cause a different shape of problem. That’s the December piece.

For now, if you have a terraform plan in your CI pipeline and you can’t remember the last time you ran it against your full production account, that’s the thing to do this week. Whatever it tells you, it’s better to know.

, Sid