Cloud

Multi-cloud Terraform: same workload, three providers

Most multi-cloud Terraform writeups are hypothetical. The customer engagements I keep doing aren't, they're the same workload running on AWS, Azure, and GCP in parallel, and the lessons about where abstraction actually helps and where it bites are not what the architecture diagrams suggest.

Sid Smith

16 Jan 2024 • 9 min read

On every customer engagement I’ve done in the last six months, the demo environment looks the same. Three repos sitting next to each other: acme-aws-prod, acme-azure-prod, acme-gcp-prod. Same shape of workload in each. Compute. Storage. IAM. Networking. Three Terraform stacks doing structurally the same thing on three completely different clouds.

I keep building this environment because it’s the cleanest way to have an honest conversation about multi-cloud. The marketing version of multi-cloud is a slide with a single architecture diagram and three logos along the bottom, as if the workload moves between clouds like a chess piece. The real version is three stacks that each had to be written with the local cloud’s primitives in mind, glued together with conventions, and operated as siblings rather than clones.

I want to write down what I’ve learned from doing this enough times that the patterns are starting to repeat, because the gap between “the talk track on multi-cloud” and “what multi-cloud actually looks like in HCL” is wider than most teams expect.

Why anyone runs multi-cloud in the first place

Before the technical part, the framing part. Most teams that show up to a multi-cloud conversation are doing it for one of three reasons, and the reason matters because it changes what “good” looks like.

Acquisition. They bought a company that was on a different cloud than they were. Now they have AWS and Azure, not because they chose to, but because the acquired engineering team has six years of Azure-native services they can’t realistically rewrite. The goal isn’t symmetry. The goal is “stop the bleeding, run both, eventually consolidate or don’t, but operate both safely in the meantime.”

Compliance or sovereignty. A regulator says certain workloads have to run in a specific cloud in a specific region. Healthcare, finance, defense, EU data-residency rules. They didn’t choose to be multi-cloud. The legal constraint chose for them.

Vendor leverage. A genuine, deliberate decision to keep the option to move. Usually driven by an executive who got burned on a previous lock-in story, or by procurement, or by a board-level risk lens. This is the rarest of the three and the only one where teams actually try to build for portability.

The hypothetical version of multi-cloud (“we should be cloud-agnostic so we can switch providers if pricing changes”) is almost never the real driver. When teams build for that hypothetical, they end up with the worst of both worlds: a vendor-neutral wrapper that doesn’t quite fit any cloud well, plus the maintenance burden of three providers’ worth of code anyway.

Once you know which reason you’re in, the Terraform shape follows.

The shape of the demo

Picture the parallel repos. Each one is structured the same way at the top level:

acme-aws-prod/
  backend.tf
  providers.tf
  network.tf
  compute.tf
  storage.tf
  iam.tf
  modules/
acme-azure-prod/
  backend.tf
  providers.tf
  network.tf
  compute.tf
  storage.tf
  iam.tf
  modules/
acme-gcp-prod/
  ...

Same filenames. Same directory layout. Same workload shape, a small web tier, a managed database, an object store, a couple of service identities, the networking to connect them. Three clouds.

The reason for the parallel structure is not portability. The workloads don’t move. The reason is operability. When an engineer who normally works in the AWS repo has to debug something in the Azure repo, the file they need to open is the file they expect to open. The cognitive cost of context-switching between clouds is the single biggest tax on multi-cloud teams. Matching the directory shape, naming conventions, variable names, and module interface across all three repos cuts that tax in half before anyone writes a line of HCL.

This is the first lesson, and it’s the one teams skip: interface conventions are worth more than abstraction. You don’t need a magic module that takes a cloud variable and dispatches. You need three modules (one per cloud) that take the same-shaped inputs and produce the same-shaped outputs, named the same way.

Where the modular pattern actually works

For the layers where each cloud has roughly the same primitive, parallel modules do hold up well.

Compute. AWS EC2, Azure VM, GCP Compute Engine. Different APIs, different attribute names, but the same essential thing: a machine with a size, an image, networking, an identity, and tags. A per-cloud module that takes instance_size, image_ref, subnet_id, identity, and tags as inputs maps cleanly onto each.

Object storage. S3, Azure Blob, GCS. Different consistency models under the hood, different access-control vocabularies, but conceptually the same: a bucket with a name, a region, encryption settings, lifecycle rules, and access policies. The per-cloud modules look almost identical at the call site.

Service identities. IAM roles, Azure managed identities, GCP service accounts. The auth flows differ, but the resource is similar: a non-human identity with attached permissions and a lifecycle tied to the workload that uses it.

For these three categories, the same-shaped-inputs pattern delivers what it promises. An engineer reading the compute.tf in any of the three repos can predict what they’ll see. The module call looks the same, the variables look the same, and even the outputs look the same. That’s the lift.

Where it breaks

The clouds diverge sharply on the primitives that don’t have direct equivalents. The breaking points I keep hitting:

Networking. This is the biggest one and the one teams underestimate. AWS gives you VPCs, subnets, route tables, internet gateways, NAT gateways, transit gateways, VPC endpoints. Azure gives you VNets, subnets, NSGs, route tables, NAT gateways, virtual network gateways, private endpoints, service endpoints, and the relationships between them are different. GCP gives you VPCs that are global by default, subnets that are regional, firewall rules at the VPC level, and a routing model that does not match either AWS or Azure.

You cannot write a network module that takes a shape like { vpc, subnets[], routes[] } and produces an equivalent thing on all three clouds. The primitives don’t line up. The relationships between them don’t line up. The cross-region story doesn’t line up. You can write three networking modules that take inputs appropriate to the cloud, and you can pick conventions so the outputs all expose a subnet_id or a network_self_link or a vnet_id for downstream modules to consume, but the inside of the module is genuinely different on each cloud.

Identity federation. Wiring AWS IAM to Azure AD to GCP service accounts so a workload running in one cloud can call services in another is, even in 2024, mostly hand-rolled. AWS has OIDC trust to GitHub and to GCP. Azure has its own federated-identity story. GCP has workload identity federation. The wiring between any two of them is a custom artifact every time. No module pattern saves you here.

Managed databases. RDS, Azure SQL Database / Azure Database for PostgreSQL, Cloud SQL. The lifecycle and parameter surface for each is different enough that a unified module ends up either lowest-common-denominator (you lose the cloud-specific features that were the reason you chose that cloud) or full of escape hatches (you stop being a module and become a leaky wrapper).

Anything cloud-native. Step Functions vs Logic Apps vs Cloud Workflows. SQS vs Service Bus vs Pub/Sub. Lambda vs Functions vs Cloud Run. The conceptual overlap is high. The Terraform surface is not. If your workload uses any of these, accept that the per-cloud module is going to look genuinely different inside.

The bad abstraction trap

The tempting move when you see this divergence is to write a vendor-neutral wrapper. A module called compute that takes a cloud enum and dispatches internally to per-cloud implementations. It feels like the right thing because it pushes the divergence below the call site.

I have watched several teams try this. Without exception, it goes badly.

The problems compound:

The wrapper has to keep up with three clouds’ provider updates. Each cloud ships new resource types and new attributes on roughly a monthly cadence. You’re now maintaining a meta-abstraction that lags every one of them.
Cloud-specific features that don’t have an analog get reduced to null or to an optional cloud-specific block. The wrapper accumulates escape hatches until it’s no longer abstracting anything.
The wrapper’s interface becomes the limiting factor. New work has to fit through the wrapper. Engineers stop using it and write raw resources alongside it. Now you have two patterns in the same repo.
When something breaks, you have an extra layer to debug through. The error message is in the wrapper, the bug is in the cloud provider, the engineer is staring at HCL that is two abstractions removed from the resource that actually failed.

The replacement pattern, which is what I now recommend to every team I work with, is per-cloud modules with shared interface conventions. Three modules (one per cloud) that:

Take inputs with the same names and the same shapes wherever possible.
Expose outputs with the same names where conceptually equivalent.
Are otherwise free to use the cloud’s native primitives idiomatically.

This gives you the operability win (the call site looks the same) without the abstraction tax (no wrapper to maintain). The cost is duplicated HCL inside the modules. The benefit is that each cloud’s module can use the cloud’s full feature set, can be debugged without reading through an abstraction layer, and can evolve at its own pace.

The pattern is older than Terraform

The thing I keep wanting to say out loud on these engagements is that this isn’t a new pattern, and I keep biting my tongue because the people in the room don’t have the background to care. So I’ll say it here.

This is Decisions as Code (DaC). The methodology behind nearly every self-service and automation system I’ve designed: extract business decisions out of platform configuration into a small, curated layer (often five real decisions where the raw config exposed eighty-nine) and let the platform absorb the rest through templates and defaults the platform owns. (I called this Property Toolkit during my OneFuse days; the foundation has changed, the methodology hasn’t.)

In the OneFuse era, one structured decision surface with reserved per-platform namespaces drove the same business standard into multiple consuming platforms. Same standard value for “small VM” or “production OS template”; each platform pulled the slice shaped the way that platform needed it. vRA7 saw VirtualMachine.CPU.Count, vRA8 saw flavor, and Terraform saw vcpu. One source of truth, three platform-correct projections.

The Terraform-per-cloud pattern I just described is the same idea projected onto a different vocabulary. Centralize the business decisions. Expose them through per-platform adapters with predictable interface conventions. Let each adapter use its platform’s native primitives idiomatically. The primitives a few years ago were vRA blueprints and orchestrators; the primitives in 2024 are AWS / Azure / GCP Terraform modules. The methodology is unchanged. Centralization, platform-aware shapes, variable interpolation, composition, and a discovery convention, those five primitives are what make the pattern work in either decade. Nested decision surfaces are Terraform’s nested modules. The reserved-namespace pattern is the per-cloud module’s input contract. If anything, Terraform’s locals and templatefile() make the interpolation cleaner than I had it before.

I bring this up because the multi-cloud Terraform pattern people are converging on in 2024 is not novel. It is the right answer for the same structural reason it was the right answer for multi-platform automation a few years ago: when the same business logic has to be expressed in multiple consuming systems, the only stable place to keep it is one standard decision surface with platform-aware projections. Everything else drifts.

State and backends per cloud

A subtler thing the parallel-repo pattern forces is the state-backend question. Each cloud has its own standard backend. AWS pushes you to S3 + DynamoDB. Azure pushes you to a storage account with state locking through blob leases. GCP pushes you to GCS with object versioning.

You can run all three states out of one cloud’s backend (store all your state in S3, even for Azure and GCP workloads) and there’s a real argument for it (one place to back up, one place to audit, one set of access policies). The argument against is operational: if the AWS state bucket has an outage, your Azure and GCP pipelines stop too. That’s coupling you didn’t have to take on.

The pattern I’ve settled on for engagements where each cloud has a substantive workload: each cloud’s state lives in that cloud’s backend. The state-bucket setup is its own bootstrap module per cloud, with versioning, encryption, lifecycle rules for old state versions, and a lock table or equivalent.

I’ll write that up separately. The S3-plus-DynamoDB setup specifically has enough gotchas that it’s a piece on its own.

What to do, concretely

If you’re walking into multi-cloud Terraform work, the moves that have held up across customer engagements:

Name the reason. Acquisition, compliance, vendor leverage. The pattern that fits each is different. Don’t build for vendor leverage if the real driver is an acquisition you have to operate today.

Match the directory shape across repos. Same filenames, same module names, same variable names where possible. The operability win is real and free.

Write per-cloud modules. Do not write a vendor-neutral wrapper. Share the interface conventions. Don’t share the implementation.

Accept divergence on networking, identity federation, managed databases, and cloud-native services. These are not the place to enforce symmetry. They will hurt you if you try.

Pick state backends per cloud. Unified-state is tempting; the coupling cost is not worth it for substantive workloads.

Resist the urge to “consolidate later.” Most acquisition-driven multi-cloud setups never consolidate. Build the parallel setup to be operable for years, not as a migration intermediate that may never migrate.

The longer thread

Finally, the thing I keep coming back to is that multi-cloud is a question about people more than it is about clouds. The technical part is solvable. The expensive part is the engineers who have to operate across all three, and the cognitive tax of keeping three providers’ worth of vocabulary in their heads.

The Terraform layer is where you can either pay that tax or save it. Same-shaped repos, same-named modules, same-shaped inputs, that’s where the operability comes from. Vendor-neutral wrappers don’t save anyone. Interface conventions do.

, Sid