Cloud

Cloud waste at the IaC layer: catching it before merge

Most cloud-bill surprises were visible at PR time. The plan output knows the resource shape, the region, the size, and what the cloud charges for it, and you can read that out of the plan-json before anything ships. Here's the pattern, what it catches, and what it can't.

Sid Smith

21 May 2024 • 9 min read

Every time I sit down with a team that just got an uncomfortable cloud bill, the conversation has a familiar rhythm. Somebody pulls up the AWS Cost Explorer view. We sort by service. We look at the top three line items. We click into the largest one. And then, almost without fail, somebody on the call says something like: “we should never have built that the way we built it.”

Almost every “we should never have built it that way” decision was made in a pull request that nobody reviewed for cost.

This is the piece I want to write about that gap, the gap between the PR review you actually do (which checks correctness, security, and style) and the PR review you don’t do (which would have caught the cost mistake before it shipped).

The shape of the FinOps-at-IaC-layer pattern

The general idea has been floating around the IaC community for a couple of years. Pieces of it have shipped. Infracost has been around since 2020, the OpenTofu/Terraform plan-json output stabilized over the last couple of cycles, and the policy-as-code tools (OPA, Sentinel) have been picking up the FinOps use case for a while. What’s changed in the last twelve months is that the components have matured enough that a small team can actually wire them together without writing a research paper.

The shape, end to end:

Pull request opens. A developer changes Terraform, adds a new resource, resizes an existing one, modifies a config that affects pricing.
CI runs terraform plan -json against the relevant environment. The plan output goes to a file.
A cost-estimation tool reads the plan JSON, looks up each resource’s pricing in a maintained cost catalog, and produces a structured estimate. Infracost is the most popular open-source tool here; there are commercial options as well.
The estimate gets posted back into the PR as a comment. Specifically: the monthly cost of the resources being added, removed, or modified, with line-item breakdowns.
A policy gate decides whether the PR is allowed to merge. The policy might be “any change over $500/month requires platform-team approval,” or “any change that increases the monthly bill by more than 10% blocks the PR,” or “any new resource in this list of expensive types needs a comment explaining why.”
The PR author either merges, escalates, or revises based on the cost signal.

The whole loop closes before anything ships. The cost surprise becomes a cost conversation. The cost conversation happens at the cheapest possible time, when the change is still a few lines of code, not a running production resource that’s already racked up charges.

What plan-time analysis catches

The categories of waste that are visible at PR time are not the categories most teams expect. They are also, conveniently, the categories most teams are bleeding the most money to.

Oversized compute. Somebody writes instance_type = "m5.4xlarge" in a Terraform module that’s going to run continuously. The cost catalog knows that m5.4xlarge in us-east-1 is roughly $500/month. The plan-json analysis flags the cost. The PR author either has a reason (and writes a comment explaining the workload), or doesn’t (and downsizes to m5.large for $63/month, an order of magnitude difference for an identical-looking line of code).

Missing lifecycle policies on S3. A new S3 bucket without a lifecycle policy is a bucket that, over time, accumulates objects with no expiration. The cost is small per object but grows monotonically forever. Plan-time analysis can flag any aws_s3_bucket resource that doesn’t have a corresponding aws_s3_bucket_lifecycle_configuration and ask whether that’s intentional. This is policy-as-code more than cost-estimation, but they overlap.

RDS without storage autoscaling caps. RDS instances can be configured to autoscale storage upward, which is correct behavior, until the workload changes shape and the database starts growing without bound, at which point you discover that “autoscaling” means “the bill grows automatically too.” Plan-time analysis can flag any aws_db_instance with max_allocated_storage set very high (or unset, which defaults to “grow forever”).

Idle ELBs. A common waste pattern: a load balancer was created for a service that has since been decommissioned, but the LB is still around in the Terraform. Plan-time analysis can’t tell you the LB is idle (it doesn’t see traffic) but it can tell you when an LB is being created that doesn’t have any target group attachments, which is the IaC-time signature of a soon-to-be-orphaned resource.

Dev-environment forever-running compute. The single biggest source of dev waste in customer engagements I’ve been on this year: someone built a dev environment with the same instance types as production, then never turned it off. Plan-time analysis can flag this by environment label, a var.env = "dev" resource with a production-sized instance type is a finding.

Multi-AZ NAT gateways in single-AZ workloads. NAT gateways are billed per AZ and per hour. A workload that only runs in one AZ does not need three NAT gateways. The Terraform plan can show you when the count is higher than the AZ count of the actual workload.

Idle EBS volumes. Less common than the rest but easy to catch: a volume declared in Terraform that isn’t attached to any instance resource. The plan output has the topology. The cost catalog has the rate.

Most teams I work with discover that 60-80% of their cloud waste was visible at PR time. The number lands as a punchline on demos because it sounds high, but I’ve never seen the percentage come in lower than that once we actually walk through the bill.

What plan-time analysis can’t catch

The honest piece. The pattern has real limits.

Post-deploy idle. A correctly-sized resource that nobody is using. Plan-time analysis sees the resource being created; it doesn’t see whether anybody touches it after it exists. An RDS instance with the right size on day one becomes waste on day ninety if the application stops calling it. Plan-time analysis won’t see that.

Query-time costs in serverless. Lambda invocations, DynamoDB on-demand reads, Cloud Run requests. The cost of these resources is mostly determined by usage, not by configuration. The Terraform plan shows that the function exists and is configured at 512 MB memory, it can’t tell you whether the function is going to be called a thousand times a day or a million.

Egress traffic. AWS data transfer charges are notoriously the line item nobody anticipates. They’re proportional to traffic, not to resource shape. Plan-time analysis sees that an S3 bucket exists in us-east-1 and a Lambda function exists in eu-west-1, it doesn’t know that the Lambda is going to read terabytes from that bucket every day until the workload runs.

Reserved capacity vs on-demand. A team that’s already purchased Reserved Instances or Savings Plans gets different effective rates than the public pricing the cost catalog uses. The catalog estimates are accurate as on-demand estimates; if you have significant reserved coverage, the actual bill impact is smaller than the estimate.

Cross-region duplication. A Terraform module that’s parameterized by region, deployed to three regions, costs three times the per-region estimate. The plan-output analysis on one apply sees one region’s worth of cost. The total is implicit.

These are real limits. They are not arguments against the pre-merge pattern, they are arguments for layering something else on top of it.

Decisions as Code, not policy after the fact

The deeper version of this argument is that most cloud waste is downstream of inconsistent standards, tags that don’t agree across modules, naming that doesn’t roll up to a useful cost report, lifecycle metadata that’s set in one place and missing in another. Plan-time analysis catches the symptom. The actual fix is upstream: every Terraform module pulling its tags, its naming, and its lifecycle defaults from the same authoritative source, so “is this resource tagged correctly” is not a thing a policy gate has to enforce because it’s a thing the module can’t get wrong.

This is Decisions as Code (DaC). The methodology behind nearly every self-service and automation system I’ve designed: extract business decisions out of platform configuration into a small, curated layer (often five real decisions where the raw config exposed eighty-nine) and let the platform absorb the rest through templates and defaults the platform owns. (I called this Property Toolkit during my OneFuse days; the shape of the idea hasn’t changed, only the foundation.)

The Terraform application is one centralized, structured set of decisions, with per-platform adapters projecting it into the consuming systems’ primitives. Today it’s Terraform modules pulling from a shared module.standards that exposes tags, naming, lifecycle_defaults, and the rest as locals. When the organizational standard for prod-tagging changes, you change it in one place; every module picks it up on the next plan.

OPA is the enforcement complement to DaC. Where DaC specifies the decisions, OPA verifies the deployment matches them. The Rego rules can reference the same standard structures the standards module exposes, same vocabulary, same shape, two roles. The pre-merge pattern in this article is the OPA half of that. The standards-module half is the part that makes the OPA half cheap to write, because the rules have a clean structured input to assert against rather than ad-hoc resource-by-resource string matching.

The teams that have both halves wired up stop having recurring “we forgot to tag this” conversations. The teams that only have OPA without the standards layer end up with Rego that’s increasingly defensive, because it’s compensating for the lack of a standard source. Get both. Specify, then verify.

What to layer on top

The pre-merge pattern is the floor, not the ceiling. The full FinOps picture needs at least three layers:

Pre-merge. What we’ve been talking about. Plan-time analysis, cost estimation, policy gates. This is where the configuration mistakes get caught.

Post-deploy monitoring. A daily or weekly view of actual resource usage vs the configured size. AWS Trusted Advisor and Cost Explorer’s recommendations are the cheap starting points. The cloud-cost commercial tools (Vantage, CloudHealth, the rest) build on top. This is where the “correctly-sized at first, idle later” cases get caught.

Bill anomaly detection. A higher-level view that watches for unexpected jumps in spend at the service level. This is what catches the “egress traffic suddenly tripled because of a bug in the application” cases that nothing at PR time could have predicted.

The pre-merge layer is the highest-leverage of the three because it’s the cheapest to act on. A cost signal in a PR is a five-minute conversation between two engineers. The same cost surprise three months later, in production, is a meeting with finance.

The customer pattern, generalized

I’ve watched the same shape play out on several engagements this year. An organization gets surprised by a cloud bill. The team digs in. They discover that the bulk of the unexpected spend traces back to a small number of configuration decisions that were merged without a cost conversation. They build (or buy) the pre-merge analysis. They wire it into the CI. The next quarter, the bill is meaningfully smaller, not by huge percentages, usually 10 to 20%, but consistently smaller, and the variance is much lower.

The thing that’s worth being explicit about: the savings are not what makes the pattern valuable. The variance reduction is.

A 15% reduction in monthly spend, on a $200K/month cloud bill, is real money. But it’s not major. What’s major is that the team stops getting bill surprises. The finance partner stops asking what happened in month X. The CFO stops requesting deep-dives on month Y. The conversation moves from “explain this bill” to “plan this bill,” which is a different and much more productive conversation.

The variance reduction is what the pre-merge layer specifically does. The post-deploy and anomaly layers reduce average spend; the pre-merge layer reduces the standard deviation. Both matter. The variance one is the one that gets the FinOps team off the back foot.

The implementation

If I were starting today, on a Terraform-using team that had never done pre-merge cost analysis, the minimum viable implementation would be:

Pick a tool. Infracost is the obvious first choice, open-source, well-maintained, supports both Terraform and OpenTofu, has good GitHub Actions integration. Try it.
Wire it into CI. A GitHub Actions step that runs after terraform plan, takes the plan-json output, runs the cost tool against it, and posts the result as a PR comment.
Set a threshold. Start permissive. “Comment on PRs that increase monthly cost by more than $100/month.” See what the volume looks like.
Tighten over a quarter. As the team gets used to seeing cost comments, lower the threshold or add policy gates. “PRs that increase cost by more than $500/month require platform-team approval.” “PRs that introduce new RDS instances require a database-team review.”
Don’t gate aggressively at the start. A policy gate that blocks merges before the team has internalized the cost discipline will produce frustration and workarounds. Gates work after the signal has been visible long enough that the team treats it as obviously useful.

The thing I want to flag: this is one of the rare automation patterns where the implementation is pretty easy and the politics are hard. Wiring up the CI step takes an afternoon. Getting the team to actually treat the cost comment as something to read and respond to takes months. The work is cultural, not technical.

The good news is that the cultural work is mostly self-reinforcing. Once the first cost-aware PR catches a $2K/month mistake before it ships, the team’s posture changes. The comment becomes the thing everybody looks at first. And the people who used to ship oversized instances without thinking about it now think about it, because the conversation is right there in the PR, in plain English, before merge.

The savings happen as a byproduct. The variance reduction is the real prize. The pre-merge cost gate is the cheapest place in the entire FinOps stack to install discipline. Install it there, and the rest of the stack works better.