Cloud

Provider version pinning: the audit nobody runs until something breaks

Provider version pinning is one of those Terraform topics nobody thinks about until the CI runner picks up a new minor release at 2 a.m. and a hundred plans go red. Here's the audit pattern I run for customers, the trap on both sides, and the constraint style I land on by default.

Sid Smith

09 Apr 2024 • 6 min read

The phone call always sounds the same. It’s a Wednesday. The CI pipeline that has worked every day for two years suddenly produces a plan output that proposes to recreate two hundred resources. Nothing changed in the code. The team is staring at a diff that includes the word destroy twice as many times as anything else, and somebody on the call says “I think it might be a provider version thing.”

It is, almost always, a provider version thing.

This is the piece I want to write about pinning. Not because the topic is glamorous (it isn’t) but because every team I’ve worked with this year discovered, in a postmortem, that they didn’t actually understand the constraint syntax in their own required_providers block.

What’s in the block, what isn’t

The piece of Terraform that controls this is in every module:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "4.8.0"
    }
  }
}

That’s the acme-aws-prod module from a customer demo I’ve been running. The thing to notice about that snippet is the second-to-last line: version = "4.8.0". Not ~> 4.8. Not >= 4.0. An exact version. Pinned.

The Azure side of the same demo:

azurerm = {
  source  = "hashicorp/azurerm"
  version = "3.68.0"
}

And the GCP side:

google = {
  source  = "hashicorp/google"
  version = "4.47.0"
}

Three providers, three pinned versions. The reason they’re pinned that way (exact, no constraint operator) is that the demo is meant to be reproducible. I want every plan to look the same the second time as the first time. Pinning gives me that.

The reason your production setup probably shouldn’t look exactly like this is that pinning to an exact version has a cost too. We’ll get to that.

The other file that matters here is .terraform.lock.hcl. It’s the file Terraform generates the first time you run terraform init, and it contains the resolved versions of every provider (even transitively) along with the SHA256 hashes of the binaries Terraform downloaded. It’s the provider equivalent of a package-lock.json or a Cargo.lock. It commits to a specific set of provider binaries, and it stops anyone (including future you, on a fresh checkout, on a different machine, on a different OS) from quietly picking up a different version.

The lockfile belongs in version control. This is the single most common thing I find wrong on engagement. About one in three teams I’ve talked to this year has .terraform.lock.hcl in their .gitignore. They almost always added it on purpose, because they were tired of merge conflicts in the file. The merge conflicts were trying to tell them something. Specifically: people on the team are getting different provider versions on their workstations than the CI runner is using.

Why pinning matters more than it sounds like it does

The Terraform provider tooling makes a distinction between major, minor, and patch releases that mostly tracks semver, except when it doesn’t. The AWS provider went from 3.x to 4.x in February 2022 and broke a non-trivial number of fields, argument names changed, computed attributes moved, validation got stricter. The 4.x to 5.x bump in July 2023 was milder but still broke certain S3 and EC2 resource shapes that older modules had relied on.

Minor versions are supposed to be additive. Mostly they are. But the providers are huge (the AWS provider is north of half a million lines of generated Go) and “mostly” is doing a lot of work in that sentence. A minor version bump can introduce a new default value for an attribute you never specified, which now produces a perpetual diff. It can add validation on a field that was previously permissive, which now blocks plans that used to apply cleanly. It can change the order in which the provider walks a resource’s attributes, which surfaces a subtle correctness bug in the resource code that hadn’t been hit before.

None of these are catastrophic when they happen on your laptop, during a planned upgrade, with the team paying attention. All of them are catastrophic when they happen at 2 a.m. on a CI runner that picked up the new version because nobody pinned anything.

The audit pattern

The actual workflow I run for customers, end to end:

Step 1. Find every required_providers block in the repo. Across a multi-repo estate, find them in every repo. The shape of the command is grep -rn "required_providers" . from the root of each Terraform tree, with a couple of language-specific tweaks if the teams have moved blocks into separate versions.tf files (which they should, but often haven’t).

Step 2. For each provider, capture three things: the source address (hashicorp/aws, hashicorp/azurerm, etc.), the version constraint (the actual string in the version = line), and the resolved version from .terraform.lock.hcl. The constraint and the lock are often different, the constraint says ~> 4.0 and the lock says 4.51.0, for example, and the difference matters.

Step 3. Compare each resolved version against the latest published version of that provider. The Terraform Registry’s API exposes this. A small script that hits https://registry.terraform.io/v1/providers/hashicorp/aws/versions returns the list in a few hundred bytes of JSON.

Step 4. Compute the gap. For each provider, in each module, how many minor versions behind is it? How many major versions? When was the resolved version released? When was the latest? The gap is your audit finding.

Step 5. Score the gap by impact. A provider that’s two minor versions behind on a module that only manages a couple of resources is low priority. A provider that’s two major versions behind on the module that owns your production network is the meeting you’re calling tomorrow.

I have run this audit on three different customer estates this year. Every time, the same shape comes out: a handful of modules are pinned tightly to versions from years ago, a much larger set are using loose constraints (>= 3.0) that have quietly resolved to whatever was the latest at last init, and the modules that have been touched recently are inconsistent, somebody updated some of them and not others.

The pattern is not “everybody is wrong.” It’s “nobody is consistent.” Which is its own kind of wrong, because it means there’s no shared mental model for how the team relates to provider versions at all.

The trap on both sides

Pinning is a spectrum, and the trap is on both ends.

Pin too loosely. version = ">= 3.0" on the AWS provider, committed to a repo three years ago, with no .terraform.lock.hcl in version control. Every terraform init on every machine resolves to whatever the latest 3.x, 4.x, or 5.x is at that moment. The CI runner gets one version. A developer running locally gets another. Your plans aren’t reproducible. The day the provider ships a breaking change in a minor release, you find out about it at the worst possible moment, on the worst possible runner, on the worst possible module.

Pin too tightly. version = "4.8.0", committed three years ago, never touched. You’re now using a provider that has a known CVE in its EC2 metadata handling, missed a year of bug fixes, and refuses to manage the half-dozen new resource types your team would like to use. Worse, the longer you sit on the old pin, the more painful the upgrade gets, because the diff between 4.8 and 5.30 is much larger than the diff between 4.8 and 4.10.

The version-pinning trap is one of those areas where the wrong-on-both-ends shape is genuinely real, and the right answer is in the middle.

The constraint style I land on

What I actually recommend on engagements, and what the next iteration of my own demo modules will use:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.40"
    }
  }
}

The ~> 5.40 operator says “any version that is greater than or equal to 5.40 and less than 6.0.” It accepts patch releases and forward minor releases within a major version. It refuses major version bumps.

Combined with .terraform.lock.hcl committed to version control, this gives you:

A floor (5.40 minimum) that documents what you’ve actually tested against.
A ceiling (less than 6.0) that prevents an accidental major-version jump.
A specific resolved version (whatever’s in the lockfile) that every developer and every CI runner uses identically.

The discipline that goes with it: a quarterly PR that updates the floor of the ~> constraint to the latest minor release, regenerates the lockfile, and runs the plan against every environment. If the plan output is clean, merge. If it isn’t, the PR becomes a controlled investigation instead of a 2 a.m. firefight.

That cadence (pinned-but-moving) is the trade-off that has actually held up under contact with real teams. The constraint moves often enough that you don’t end up two majors behind. It moves slowly enough that the move is always intentional. And the lockfile makes every individual run reproducible regardless.

What this connects to

This piece sits in the same conversation as the lifecycle hooks article from a couple of weeks ago, both are examples of the same broader idea, which is that production Terraform requires you to be explicit about the things that the language lets you be vague about. Vague is fine in a sandbox. In production, vague becomes a phone call at 2 a.m.

The provider lockfile is also why the OpenTofu 1.6 migration went as smoothly as it did for the teams I’ve moved over so far. The lockfile format is compatible. Providers are the same. The pins port across cleanly.

Pin loose, lock tight, audit quarterly. That’s the shape.