Cloud

Terraform state on AWS: S3 and DynamoDB, done right

The S3-plus-DynamoDB backend is the most common Terraform state setup in the world and the most commonly misconfigured. The versioning, encryption, lock-table, and cross-account patterns that hold up across customer engagements, and the failure modes that take teams a week to debug.

Sid Smith

13 Feb 2024 • 9 min read

Every Terraform engagement I’ve worked on has eventually had a conversation about state backends, and most of them have the conversation a year too late. The default behavior of starting with a local state file, getting something working, and then migrating to a remote backend “when the project grows up” produces a predictable outcome: the migration happens under pressure, in the middle of an incident, with state corruption already in flight. The right time to set up the backend is the first hour of the first repo.

On AWS, the standard pattern is S3 for the state object plus DynamoDB for the lock. It’s been the standard pattern for so long that most teams treat it as solved, until they have to fix something. This piece is the writeup I wished I’d had the first time I had to fix something.

The patterns below are what I’ve ended up with after enough customer engagements that I now ship a backend.tf plus a small bootstrap module on the first day of every new AWS environment. The shape is consistent. The variations are small. The failure modes are the same every time.

The two files

The setup splits cleanly into two files that live in different repos.

backend.tf in every workload repo. This is the file that points the workload at its remote state. It’s tiny:

terraform {
  backend "s3" {
    bucket         = "acme-tfstate-prod"
    key            = "workloads/app/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "acme-tfstate-lock"
    encrypt        = true
  }
}

That’s the operational artifact. Every workload repo has one. The key differs per workload. The bucket, region, and dynamodb_table are constants for the environment.

app_state.tf in the infrastructure-state repo. This is the file that creates the bucket and the lock table in the first place. It lives in a separate repo (not in any workload repo) and gets applied once per environment, bootstrapping the backend before any workload exists. It’s longer:

resource "aws_s3_bucket" "tfstate" {
  bucket = "acme-tfstate-prod"
}

resource "aws_s3_bucket_versioning" "tfstate" {
  bucket = aws_s3_bucket.tfstate.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "tfstate" {
  bucket = aws_s3_bucket.tfstate.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.tfstate.arn
    }
    bucket_key_enabled = true
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "tfstate" {
  bucket = aws_s3_bucket.tfstate.id

  rule {
    id     = "expire-noncurrent-versions"
    status = "Enabled"
    noncurrent_version_expiration {
      noncurrent_days = 90
    }
  }
}

resource "aws_s3_bucket_public_access_block" "tfstate" {
  bucket                  = aws_s3_bucket.tfstate.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "tfstate_lock" {
  name         = "acme-tfstate-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  point_in_time_recovery {
    enabled = true
  }
}

These are the two files. Everything below is about why each line is the way it is, and what happens when one of them is missing.

S3 versioning is not optional

The single most expensive lesson teams learn about this setup is what happens when versioning is off and somebody corrupts the state file.

A corrupted state file can happen in several ways: a Terraform run that crashed mid-write, two engineers running apply at the same time (more on the lock table in a moment), an out-of-band edit, a partial restore from backup. Any of these can leave the state object in a state where Terraform can’t parse it. The state file is JSON. JSON either parses or doesn’t.

If versioning is on, the recovery is pretty simple: list versions of the state object, find the one from before the corruption, restore it. Minutes of work.

If versioning is off, the recovery is “restore from yesterday’s backup, replay every change since yesterday.” Days of work, if the backup even exists.

Versioning costs essentially nothing for state-sized objects. The bucket-level lifecycle rule above expires noncurrent versions after 90 days, which bounds the cost. Turn versioning on the day you create the bucket. Do not “turn it on later.”

The related gotcha: if you do have to roll back to a previous version, you have to do it via aws s3api copy-object to restore the old version as the current version, then run terraform refresh. Just changing which version Terraform reads is not enough. Terraform writes to the current version, and you want the old data to be what gets read on the next refresh.

Encryption. KMS, with a bucket key

The encryption story is straightforward, with one knob worth getting right: the bucket_key_enabled = true line in the SSE configuration.

S3 bucket keys are a relatively recent feature (early 2021) that dramatically cut the cost of KMS-encrypted S3 by caching the data-encryption-key derivation at the bucket level rather than the object level. Without a bucket key, every PUT and GET on the state file calls KMS, and KMS API calls are not free. With a bucket key, the per-object KMS cost drops by roughly an order of magnitude on busy buckets.

The state bucket is not “busy” in the high-throughput sense, but Terraform CI runs hit the state object many times per apply, read, plan, write, refresh, read again. The bucket key pays for itself within the first hundred runs.

The other encryption choice is customer-managed KMS keys vs AWS-managed. I default to customer-managed (aws_kms_key.tfstate above) because it gives you the ability to deny access to the state file at the KMS level, useful for revoking access for a specific role without changing the bucket policy, and because the auditability is cleaner. The downside is one more resource to manage. For low-blast-radius environments the AWS-managed key is fine.

DynamoDB lock table, on-demand billing, point-in-time recovery on

The lock table is small. It holds a single item per state file while a lock is held, and is empty the rest of the time. The whole table is in the kilobytes.

The two knobs that matter:

Billing mode. Use PAY_PER_REQUEST (on-demand). The table sees a few requests per minute at most. Provisioned capacity is more expensive at this scale, and the burst behavior of provisioned capacity is exactly wrong for state-locking, you want the lock acquisition to be instant, not subject to throttling because you used your tokens.

Point-in-time recovery. Turn it on. The lock table is small enough that PITR costs are negligible, and the recovery story matters: if the table is accidentally dropped (or, more commonly, an engineer running terraform destroy on the bootstrap module forgets to exclude the lock table) you can restore the table without losing the locks in flight. Without PITR, dropping the table while a Terraform run is in progress can leave the in-flight run unable to release its lock, and the next run unable to acquire one.

The smaller gotcha: the hash_key must be LockID and the attribute type must be S (string). Terraform writes the lock ID in a specific format that the backend code expects. Don’t customize the schema.

Why the lock table failure modes are the worst

Most state-backend incidents I’ve debugged on customer engagements trace back to the lock table. The state object itself is well-understood and well-protected. The lock table is the part teams treat as an afterthought, and it’s the part that bites.

The shapes I’ve seen:

Lock table missing. Somebody applied the bootstrap module without the DynamoDB resource, or applied a destroy that included the lock table. The S3 backend silently does not error on missing lock table, it just doesn’t lock. Two engineers run apply at the same time. The state file ends up reflecting one engineer’s changes overwritten by the other’s. Half the resources are in the wrong state. Recovery is hours.

Lock table in a different region from the state bucket. Possible to misconfigure. Some Terraform runs report “lock acquired,” some report “lock failed,” depending on which region the runner is in and how it resolves the DynamoDB endpoint. Symptom is inconsistent locking behavior that’s hard to reproduce.

Lock table dropped while a lock is held. The Terraform run that held the lock cannot release it. Subsequent runs see no lock and grab one. The “force-unlock” command doesn’t help because the lock doesn’t exist anymore. The state file is still valid; the locking guarantee was temporarily off.

Lock table without the right schema. Custom hash key, wrong type, table name typo. The backend errors loudly here, which is the friendly version of the failure mode. At least you find out before any state writes happen.

The rule of thumb: if your bootstrap module gets edited, the lock table should be the last thing touched and the most-tested. Treat it as a first-class resource, not as a one-liner companion to the bucket.

The cross-account state-bucket pattern

For any organization with more than one team or more than one environment, the next question is where the state buckets live. The two patterns are:

Per-team state buckets, in each team’s own account. Simple ownership story. Each team manages their own bucket, their own lock table, their own IAM. The downside is multiplication: ten teams means ten state buckets to monitor, ten lifecycle rules to keep current, ten places to look when something is wrong.

A shared infra-state account, holding state for everyone. One bucket per environment (prod, staging, dev), with prefixes per team or workload. Cross-account IAM grants the workload accounts read-write access to their own prefix. The infra-state account is small, security-focused, audited.

I default to the shared-infra-state-account pattern on every engagement, for reasons that have nothing to do with operational efficiency:

Recovery is the thing. When you need to restore state from a 60-day-old version because somebody applied a destroy they shouldn’t have, you want exactly one place where the standard state lives. Not ten places, not ten different backup policies, not ten different teams’ on-call rotations. One place.

Auditing is the thing. Compliance teams want one bucket to point CloudTrail at. One KMS key to track. One IAM policy to review. The per-team-bucket pattern multiplies the audit surface.

Cross-team state references work. If team A’s workload needs to know team B’s VPC ID, and both states are in the same bucket, the cross-state read is a one-account IAM problem. With per-team buckets, you’re crossing accounts for every reference, which works but is more IAM to maintain.

The IAM shape for the shared pattern: the workload account assumes a role in the infra-state account, scoped tightly to its prefix. The role allows s3:GetObject, s3:PutObject, s3:DeleteObject, s3:ListBucket on the workload’s specific key, plus dynamodb:GetItem, dynamodb:PutItem, dynamodb:DeleteItem on the lock table. Nothing else.

Things that break, the full list

In rough order of how often I’ve seen each on customer engagements:

The lock table is missing or dropped. Already covered above. Most common state-backend incident I’ve debugged.

S3 versioning is off. Sometimes because the bucket was created before the versioning resource, sometimes because someone disabled it for cost reasons years ago. Symptom: corruption recovery is much harder than it should be.

Cross-region replication on, with read-after-write surprises. If the state bucket has CRR enabled to a DR region, and a CI runner happens to read from the replication target rather than the source (this can happen with custom endpoint configurations), the runner can see stale state. The S3 eventual-consistency story changed in late 2020 (read-after-write is now strongly consistent within a region) but cross-region replication is still asynchronous. The fix is to make sure all readers go to the source bucket. The general rule: don’t enable CRR on the state bucket unless you have a specific DR requirement that justifies it.

Encryption changed mid-life. Someone switched from AES-256 to KMS, or rotated the KMS key, and the existing state object can’t be decrypted with the new configuration. The fix is usually re-encrypting the existing object via aws s3 cp with the new SSE settings. The painful version is when the old key was deleted, in which case you’re restoring from a version that predates the change.

Public access block missing. Catastrophic when it happens, easy to prevent. Apply the aws_s3_bucket_public_access_block resource the day you create the bucket. Have a Config rule or SCP that prevents state buckets from being public, full stop.

Lifecycle rule too aggressive. A 7-day noncurrent-version expiration looked fine until you needed to roll back to a 60-day-old version. Set it long enough to recover from real incidents (90 days is my default), but bounded so the cost doesn’t spiral.

Wrong account. This is the one nobody talks about. An engineer assumes a role in the wrong account, runs terraform init, the backend points at a bucket they have access to in both accounts. Terraform writes the state for the prod workload into the dev bucket. Recovery is “copy the state object to the right bucket, fix the local config, never speak of this again.” Prevention is naming the state bucket distinctly enough, and using SCPs that deny s3:PutObject to state buckets from anywhere except the expected account.

What to do, concretely

If you’re setting up a new AWS environment, the moves I’d make on day one:

Create the state bucket and lock table before anything else. First Terraform run in any environment should be the bootstrap module that creates these resources. Apply it from a local state file once, then migrate the bootstrap’s own state into the bucket it just created.

Turn on versioning and PITR from the start. Both are nearly-free insurance. The lifecycle rule on noncurrent versions bounds the cost.

Default to a shared infra-state account. Per-team buckets multiply the operational surface for no real benefit.

Use a customer-managed KMS key with a bucket key enabled. The auditability beats AWS-managed; the bucket key keeps the cost down.

Pay-per-request billing on the lock table, with PITR on. Provisioned capacity is wrong for this workload.

Write the bootstrap module once and copy it. Every customer engagement I do starts with essentially the same app_state.tf. The shape is solved. Don’t reinvent it.

The longer thread

The state backend is the most under-discussed piece of a Terraform setup. Most of the writing about Terraform is about resources, modules, plan output, drift, the language. The state file is treated as an implementation detail.

It isn’t. The state file is the database that records what your IaC believes about reality. If the database is unavailable, your IaC is blind. If it’s corrupted, your IaC will try to fix things that don’t need fixing. If it’s missing entries, your IaC will create duplicates. Everything else in Terraform (the language, the plan, the apply) sits on top of the state file’s integrity.

Treat the state backend as infrastructure in its own right. Bootstrap it deliberately. Version it. Encrypt it. Audit it. The day you need to restore from a 60-day-old version is not the day to discover the bucket policy was wrong.

, Sid