Cloud

Ghost resources in IaC: what's in the statefile that's no longer in the cloud

Drift gets all the attention. The inverse problem, your statefile says a resource exists, the cloud says it doesn't, is weirder, harder to detect, and causes a different shape of failure. Closing out 2023 with the ghost-resource problem and a note on where IaC observability has to go next.

Sid Smith

05 Dec 2023 • 7 min read

Last month I wrote about drift, the case where your cloud has more, or different, resources than your IaC code describes. That’s the case everybody knows about. Every drift-detection product on the market is built around it. Every IaC vendor’s marketing leads with it.

The inverse is the one nobody talks about, and it’s worse.

The inverse is when your statefile says a resource exists (fully described, attributes populated, references intact) and the cloud says no such resource exists. Terraform thinks it owns something that’s not there. I have been calling these ghost resources in customer conversations, because it captures the feel of it: the state file is haunted by something that used to be real.

Closing out 2023 with this one because it’s been on my mind, and because it points at a category (IaC observability) that I expect to get a lot more attention in 2024.

How a resource becomes a ghost

The mechanics, in order of frequency:

Out-of-band deletes. Somebody opened the cloud console (or, more commonly, ran a one-off CLI command) and deleted the resource without going through Terraform. They had a reason at the time. Maybe the resource was costing money in a dev account. Maybe they were cleaning up after a failed deploy. Maybe a different automation deleted it. Either way: Terraform was never told. The state file still has the resource. The cloud doesn’t.

Region or account drift. This one is sneakier. A team has Terraform configured against an account ID and a region. Somebody migrates resources to a different account or region, manually, or via a script that doesn’t update the IaC. The cloud-side reality moves. The state file stays where it was. From Terraform’s perspective, the resources have vanished.

AWS Config and the cloud’s eventual-consistency model. A more technical variant. AWS in particular has resources whose existence is reported with a lag, or whose state varies between Config and the live API. Terraform queries the live API. If a resource was deleted recently enough, Terraform might still see it briefly, or might see partial deletion state, or might see the resource as “exists, just unreachable.” The state file can get out of sync in the gap.

Manual terraform state rm gone wrong. Somebody tried to fix an unrelated problem by removing a resource from state, intending to re-import it. They forgot to re-import. Now the resource exists in the cloud but not in the state file. (This is the actual inverse-inverse case (present in cloud, absent in state) but it interacts with the ghost case in confusing ways, especially in shared-state environments.)

Cross-stack drift. Two Terraform workspaces both reference the same resource via data sources or remote state. Workspace A deletes the resource. Workspace B doesn’t know. Workspace B’s plan output is suddenly hallucinating references to a thing that isn’t there.

That last one is the variant that bit a customer engagement I was on earlier this year. The team had split their Terraform across five workspaces. Cross-references between workspaces were carrying assumptions about resources existing. One delete in one workspace cascaded into broken plans in the other four. Took them a week to figure out.

What it looks like in practice

The symptom is usually one of three things:

terraform plan produces a destruction proposal for a resource you didn’t change. This is the friendliest version. Terraform looks at the state file, looks at the cloud, sees the cloud is missing something, and proposes to “fix” it by recreating it. Sometimes that’s actually fine and the team accepts the recreation. Sometimes the recreated resource has different attributes than the original (a new IP address, a new ID, a new ARN) and downstream things break.

terraform apply fails with a “resource not found” error mid-apply. Less friendly. Terraform tries to update a resource that the state file says exists, queries the cloud, gets a 404, and the apply errors out partway. Now you have a partial state, and the apply has to be rerun, sometimes after manual cleanup.

terraform refresh produces a state file that’s smaller than the previous one. This is the one teams don’t notice until later. A refresh against a cloud that’s missing some of the resources will quietly remove them from the state file. If the team isn’t tracking the size and shape of their state across refreshes (which most teams aren’t) this happens silently.

The first symptom is the loud one. The third is the dangerous one.

State vs cloud, the reconciliation dance

When you find a ghost, you have three tools to bring state and cloud back into agreement:

terraform refresh. Pulls the cloud’s view of the world into the state file. Resources that no longer exist in the cloud get removed from the state file. This is the right answer when you trust the cloud and want to bring the state file in line.

terraform state rm. Surgical removal of a specific resource from state. Use this when you know exactly which resource is the ghost and you want to remove it without disturbing the rest of the state. Common after out-of-band deletes when you don’t want a full refresh.

terraform import (after-the-fact recreation). If you want the resource back, recreate it in the cloud, then re-import it under the same Terraform address. Continuity of references is preserved. This is the right answer when the cloud-side deletion was a mistake.

The flowchart in my head, when I’m walking a team through this:

Is the resource supposed to exist? If yes, recreate and re-import. If no, remove from state.
Is anything else referencing it? If yes, find and fix the references before you do anything. Otherwise you’ll cascade the problem.
Are you confident the rest of the state file is correct? If no, do a state list and a manual cloud check on each resource before you refresh. A bulk refresh against a substantially-out-of-sync state can do more damage than the ghosts.
Document what you did. Especially the state rm cases. These are the changes that show up six months later in audits.

Why this matters more than people think

The reason I’m writing this piece in December instead of June is that I’ve watched the consequences of ghost resources accumulate across the year, and they’re worse than the conventional drift case in a few specific ways.

Audit failures. A clean audit requires that the IaC, the state file, and the cloud all agree on what exists. Drift breaks the cloud-vs-IaC alignment. Ghosts break the state-vs-cloud alignment. Auditors notice both, and ghosts are weirder to explain. “We have state for resources that don’t exist anywhere” is not a sentence that lands well with a compliance reviewer.

Costs you can’t see. Sort of. Ghosts don’t directly cost cloud money, by definition, the resource isn’t running. But they do cost engineering time. Every time a ghost resource shows up in a plan as a proposed recreation, somebody has to triage it. Every time an apply fails partway because of a ghost, somebody has to clean up. Multiply that across hundreds of resources and dozens of workspaces and the cost is real.

Plan noise. This is the one I think is most underrated. Every ghost resource is a line of noise in terraform plan output. If you have ten of them, your plan output is ten lines longer than it needs to be. If you have a hundred of them, your plan output is unreadable, and your team stops reading it carefully. The team stops noticing the real issues because they’re buried in the ghost noise. This is the same dynamic that kills alerting systems when the false-positive rate goes up.

Trust in the state file decays. Same loop I described in the drift piece. Once a team can’t trust the state file, they stop trusting the pipeline. Once they stop trusting the pipeline, they stop using it consistently. Once they stop using it consistently, the ghosts and the drift both grow. The loop closes.

What I’m telling teams as we head into 2024

The advice is consistent with the drift piece, with one addition:

Track state file size and shape over time. A simple line graph of “how many resources are in this state file” plotted weekly is enough to catch most ghost-creation events. If the line drops unexpectedly, something deleted resources you didn’t know were being deleted. If the line grows unexpectedly, something is creating resources outside your visibility.

Run terraform refresh more often than you apply. A scheduled refresh against every workspace, with the diff posted somewhere human-readable, is the equivalent of running a git status against your infrastructure. It catches both drift and ghosts.

Treat state files as production data. Back them up. Version them. Don’t state rm without a backup. Don’t manually edit them without a backup. The state file is the only standard record of what Terraform thinks it owns; losing it or corrupting it costs more than losing the cloud resources themselves.

Make cross-workspace dependencies explicit and audited. Every data source or remote_state reference is a place where one workspace’s ghosts can affect another workspace’s plan. Inventory them. Document them. Re-evaluate them quarterly.

Add a “state-vs-cloud reconciliation” item to your IaC hygiene checklist. Quarterly is enough for most teams. The same way you do an access review or a dependency audit, do a state review. It will surface ghosts and drift together, which is the right framing, they’re two sides of the same observability problem.

The 2023 wrap

I’m going to land this piece on the broader thread, because it’s the year-end one and I get to.

The thing that became clear to me across the year, working through customer engagements on the multi-cloud Terraform pattern, codification projects, drift demos, and now this ghost-resource problem, is that IaC observability is the missing primitive. We have observability for applications. We have observability for infrastructure runtime, metrics, logs, traces. We don’t have observability for the IaC pipeline itself. We don’t have a clean way to say “here is what my state files contain, here is what my cloud actually has, here is the delta, here is the trend over time, here are the anomalies.”

The vendors that crack that primitive (that turn IaC from a set of files into an observable system) are the ones that win the next era of the category. The category has been stuck for a while in the “write the HCL, run the plan, run the apply, hope” loop. The next move is making the whole loop visible.

A few of the BSL / OpenTofu implications run into this directly: a healthier, more pluggable IaC core lowers the cost of building observability on top of it. The fork might end up mattering more for what it enables in adjacent tooling than for the engine itself.

That’s the thread to watch in 2024. I’ll write more on it as the patterns shake out. For now, happy holidays, go check your state files.

, Sid