Cloud

Lifecycle hooks for production Terraform: prevent_destroy and ignore_changes

Lifecycle hooks are the part of Terraform that looks trivial in the docs and saves you from a six-figure outage in practice. Here's how prevent_destroy and ignore_changes actually get used in production, what to put them on, what not to, and the operations cost of getting it right.

Sid Smith

26 Mar 2024 • 7 min read

Every Terraform-in-production conversation eventually arrives at the same uncomfortable moment. Somebody runs terraform apply against an environment that has been running for two years, and the plan output proposes to destroy and recreate a resource that the entire business is sitting on top of.

The lifecycle block is the language feature that exists to prevent that moment from becoming an outage.

I’ve been writing variations of the same lifecycle { ... } snippet into customer demos for the last several months (same three patterns, slightly different resource types) and at some point you start to notice that the three patterns are doing very different jobs, and that most people use them interchangeably.

This is the piece I want to write about the difference.

The three blocks that actually matter

There are four arguments inside a lifecycle block in current Terraform: create_before_destroy, prevent_destroy, ignore_changes, and replace_triggered_by. The first and the last are useful in specific situations and I’ll get to them. The middle two are the ones you reach for almost every week, and they are also the two that get used wrong most often.

Concrete examples first, from the acme-aws-prod module I run on demos:

On the IAM role that the GitHub Actions runner assumes via OIDC:

resource "aws_iam_role" "ci_deploy" {
  name = "acme-prod-ci-deploy"
  # ... assume_role_policy, etc.

  lifecycle {
    prevent_destroy = true
  }
}

On the security group that protects the application tier:

resource "aws_security_group" "app_tier" {
  name   = "acme-prod-app-tier"
  vpc_id = aws_vpc.main.id
  # ... ingress, egress, etc.

  lifecycle {
    ignore_changes = [revoke_rules_on_delete, timeouts]
  }
}

And on a Lambda function that ships behind an API Gateway:

resource "aws_lambda_function" "api_handler" {
  function_name = "acme-prod-api-handler"
  # ... role, handler, runtime, etc.

  lifecycle {
    ignore_changes = [publish, timeouts]
  }
}

Three different blocks, three different jobs. Worth taking them one at a time.

prevent_destroy, the seatbelt for the things you can’t replace

prevent_destroy = true does exactly what the name says. If a plan ever proposes to destroy this resource (for any reason, including a downstream change you didn’t expect) Terraform refuses to even produce the plan. It errors out. You cannot apply your way through it. You have to go back to the code, remove the lifecycle block, and try again.

That sounds annoying, and it is, and it is the entire point.

The resources that earn a prevent_destroy in the customer engagements I’ve been on look like a short, boring list:

The production database. Whether it’s RDS, Cloud SQL, or something else, the resource that holds the data your company sells is the standard case. If a Terraform refactor proposes to destroy it because somebody changed a parameter that triggers a recreate, the right behavior is for the plan to fail loudly.
IAM roles that your CI assumes. Specifically the role that the pipeline itself uses to apply Terraform. If your CI’s deploy role gets destroyed, the next pipeline run can’t run, and the recovery path involves going outside Terraform to recreate the role manually, which is exactly the situation you set up Terraform to avoid.
State-bearing resources generally. S3 buckets that hold artifacts. ECR repositories with images. The KMS key that decrypts other secrets. Any resource whose identity is referenced by external systems that you can’t trivially update.
Network primitives that other accounts trust. The VPC peering connection that a partner account expects. The Route53 zone that has NS records propagated across the internet. The transit gateway attachment that a hundred routes depend on.

What does not belong on this list:

Stateless compute (Lambda functions, ECS services, EC2 instances behind an autoscaler). The whole point of stateless compute is that you can replace it.
Anything in a dev environment. prevent_destroy in dev means a developer can’t terraform destroy their own sandbox, which forces them to learn how to comment out the block, which then becomes a habit, which then defeats the seatbelt entirely.
Resources you’ve never thought hard about. The block exists for the cases where you have deliberately decided this resource is irreplaceable. Sprinkling it on by default is worse than leaving it off, because the day you actually need to destroy something, you have a five-line code change instead of a one-line code change.

The operations cost of prevent_destroy is real and worth naming. When you genuinely need to destroy a resource that has the block on it, the process is: open a PR that comments out the lifecycle block, get it reviewed, merge it, run apply (which now succeeds), then open a second PR that uncomments the block on whatever-replaces-the-resource, get it reviewed, merge that. Two PRs. Two reviews. Annoying.

But the annoyance is the feature, not the bug. The two-PR process is what makes “destroy production database” something that takes thirty minutes of human attention instead of thirty seconds of accidental keyboard. I will take that trade every time.

ignore_changes, the truce with the provider

ignore_changes does a completely different thing. It tells Terraform: this attribute exists in the resource, but I don’t want plans to react to it. If the cloud-side value diverges from the code-side value on this attribute, don’t propose a change.

This sounds like a way to hide drift, and people sometimes use it that way, and it’s a bad idea. The right uses of ignore_changes are narrower than the wrong ones.

The good uses, all of which show up in the demo module:

Provider-introduced defaults. When you write an aws_security_group resource and don’t specify revoke_rules_on_delete, the provider materializes a default. When the provider version changes, the default sometimes changes. You don’t want your plan to suddenly propose to update an attribute you never wrote in the first place. Ignoring it is the right call.

Fields the cloud rewrites server-side. Lambda’s publish field is the standard example. If you have an external deploy pipeline that publishes new versions of the function, the publish value on the resource is being updated by something other than Terraform. The next plan you run will propose to set it back. The honest thing is to acknowledge that this attribute is owned by the deploy pipeline, not by the IaC, and tell Terraform to leave it alone.

Timeouts blocks. Specifically the case where the provider changes the default timeouts between versions. You don’t want a terraform plan that’s supposed to be a no-op to start showing a diff because the provider bumped its default deletion timeout from 10 minutes to 15.

Tags managed by external systems. A common one: a cost-allocation tool adds tags to every resource for billing attribution. Those tags are real, in AWS, but they aren’t in the Terraform. If you don’t ignore_changes = [tags] (or, more precisely, the specific tag keys). Every plan will propose to strip them. Worse: every apply will actually strip them, and the cost-allocation tool will put them back, and you’ve created a flap loop.

The bad uses:

ignore_changes = all. This argument exists. Almost nobody should use it. It tells Terraform to ignore every attribute on the resource, which means the resource is now effectively unmanaged after creation. If you’ve reached for ignore_changes = all, what you actually wanted was probably to import the resource and not put it in Terraform at all. The honest version of “I don’t want Terraform to manage this resource” is “this resource is not in Terraform.”

Ignoring an attribute because the plan is annoying. If terraform plan keeps proposing to revert a manual change somebody made, the answer is usually to update the code to match reality, not to ignore the attribute. ignore_changes is for attributes you have decided, deliberately, are not yours to own. It is not a way to silence the plan output when reality and code disagree.

The distinction I try to teach on these calls: ignore_changes is for fields where ownership is shared with somebody else (the provider, another pipeline, a sidecar tool). It is not for fields where ownership is yours but you can’t be bothered.

The two I haven’t talked about

create_before_destroy is the third member of the lifecycle family. It does what the name says: when a resource needs to be replaced, create the new one first, then destroy the old one, instead of the default (destroy first, then create). This matters mostly for resources with name uniqueness constraints (IAM roles, security groups, load balancers) where you want the new version standing before the old version goes away. It’s a quality-of-life feature, not a safety feature.

replace_triggered_by is newer. It lets you say: replace this resource whenever some other resource (or attribute) changes. Useful for things like “redeploy this Lambda whenever this layer version changes.” It is more elegant than the older null_resource + triggers pattern, and I am slowly transitioning customer demos over to it.

Neither of these two is as load-bearing as the first two. If you only learn prevent_destroy and ignore_changes, and learn them well, you’ve covered ninety percent of the cases.

The pattern in a code review

What I look for when I’m reviewing somebody else’s production Terraform:

Every state-bearing resource has prevent_destroy = true. Database, KMS key, S3 buckets that hold real data, IAM roles your CI assumes. If the list is empty, that’s a finding.
No ignore_changes = all anywhere. Hard rule. If you see it, the right next question is “what do you actually want this resource to do,” and the answer is almost never “be in Terraform but unmanaged.”
The ignore_changes lists are short and specific. [publish, timeouts] is fine. [publish, timeouts, role, runtime, handler, environment] means somebody gave up.
The prevent_destroy blocks have a comment explaining why. Not because Terraform requires it. Because the next person to touch this code, in six months, will not remember why this specific resource has a seatbelt. Write the reason down.
No lifecycle blocks on resources where they don’t belong. A prevent_destroy on a Lambda function is almost always wrong. Lambdas are supposed to be replaced. A prevent_destroy on a data source is impossible, data sources can’t be destroyed. If you see these, somebody copy-pasted.

The fifth point is the one that takes longest to develop a feel for. Lifecycle hooks are not free. Every block you add is a constraint on future you. The right way to think about them is as a deliberate, narrow, intentional concession to the fact that not every resource is equally replaceable.

The close

The Terraform documentation describes lifecycle hooks in about a page and a half. That page and a half is doing more load-bearing work than almost any other part of the language, because it is the part of the language that admits something the rest of the language doesn’t, that your code and your cloud are out of sync more often than you think, and that the right behavior in the gap is usually “do nothing, and tell the human.”

prevent_destroy is the seatbelt. ignore_changes is the truce. Use them sparingly, write the reason down, and the next time terraform apply proposes something terrifying, the language will catch you before the cloud does.