Cloud

OIDC for Terraform CI/CD: kill the access keys

Long-lived AWS access keys in GitHub Actions secrets are the wrong default. OIDC federation gives every workflow a scoped, short-lived role assumption with no secret to leak. The trust-policy shape, the GitHub Actions wiring, and the gotchas that make it harder than the blog posts suggest.

Sid Smith

27 Feb 2024 • 9 min read

I have lost count of how many customer GitHub repos I’ve opened in the last year that have an AWS access key sitting in their secrets, named AWS_ACCESS_KEY_ID, valid for “indefinitely,” used by every Terraform workflow in the repo. Sometimes the key is a few months old. Sometimes it’s three years old. Sometimes nobody in the room can tell me which IAM user it belongs to or who created it.

That’s the failure mode I want to write about. It is one of the most common security weaknesses in IaC pipelines, and it has been completely solved by OIDC federation between GitHub and AWS for about two years now. The reason teams still have the keys is not that the alternative is hard. The alternative is genuinely easy. The reason is pretty simple: nobody walked through the wiring start-to-finish, and the existing writeups skip the gotchas that make the first attempt fail.

So: the wiring, start-to-finish, including the parts that take an extra hour the first time.

What OIDC federation actually does

The short version: GitHub Actions can issue a short-lived OIDC token to a workflow run. The token contains claims about the run, the repo, the branch, the environment, the workflow file, the actor. AWS IAM can be configured to trust this token, and to let it be exchanged for an AWS session, scoped to a specific IAM role.

What this gives you:

No long-lived credentials. Nothing stored in GitHub Actions secrets. Nothing to rotate. Nothing to leak.
Per-run identity. Every workflow run is a distinct, attestable identity. CloudTrail logs the role assumption with the OIDC claims attached, so you can trace exactly which run, on which branch, in which repo, made which AWS API call.
Scoped trust. The IAM role’s trust policy can restrict assumption to specific repos, specific branches, specific environments. A misconfigured workflow in a different repo cannot assume the role.
Auditable. All of this shows up in CloudTrail with the OIDC claims as conditions on the assume-role event. After-the-fact forensics is dramatically easier.

The cost is: one IAM OIDC identity provider per AWS account, one IAM role per workflow-and-permission-scope, a few lines of YAML in each workflow. That’s it. There is no service to run, no rotation to schedule, no secret to manage.

The three pieces

Piece one: the IAM OIDC identity provider. This is the AWS-side trust anchor for GitHub. You create it once per AWS account. The URL is https://token.actions.githubusercontent.com, the audience is sts.amazonaws.com, and the thumbprint is GitHub’s TLS cert thumbprint (AWS now manages this automatically, but older setups have it hardcoded).

In Terraform:

resource "aws_iam_openid_connect_provider" "github" {
  url             = "https://token.actions.githubusercontent.com"
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = ["6938fd4d98bab03faadb97b34396831e3780aea1"]
}

The thumbprint is mostly a historical artifact at this point (AWS validates GitHub’s cert chain independently) but the field is required.

Piece two: the IAM role with a scoped trust policy. This is the role that the workflow will assume. The trust policy is the load-bearing piece. It dictates which GitHub identities can assume the role.

data "aws_iam_policy_document" "github_actions_trust" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]

    principals {
      type        = "Federated"
      identifiers = [aws_iam_openid_connect_provider.github.arn]
    }

    condition {
      test     = "StringEquals"
      variable = "token.actions.githubusercontent.com:aud"
      values   = ["sts.amazonaws.com"]
    }

    condition {
      test     = "StringEquals"
      variable = "token.actions.githubusercontent.com:sub"
      values = [
        "repo:acme-org/acme-aws-prod:ref:refs/heads/main",
        "repo:acme-org/acme-aws-prod:environment:production",
      ]
    }
  }
}

resource "aws_iam_role" "github_actions" {
  name               = "GitHubActions"
  assume_role_policy = data.aws_iam_policy_document.github_actions_trust.json
}

The sub claim is the one that matters. It identifies exactly which workflow context can assume the role. The format is repo:<owner>/<repo>:<context>, where <context> can be ref:refs/heads/<branch>, environment:<environment-name>, pull_request, or ref:refs/tags/<tag>. You list every context that should be allowed.

The permissions attached to the role are the actual AWS permissions Terraform needs, typically broad on the workload account, narrow on the state-bucket account.

Piece three: the GitHub Actions workflow. The YAML side is small:

name: terraform-apply

on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read

jobs:
  apply:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::111122223333:role/GitHubActions
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.5.7
          terraform_wrapper: false

      - run: terraform init
      - run: terraform plan -out=tf.plan
      - run: terraform apply tf.plan

The two lines that almost everyone gets wrong on the first attempt:

permissions: id-token: write, this is the GitHub-side permission to request an OIDC token for the run. Without it, the configure-aws-credentials step will fail with a confusing error about being unable to retrieve a token. The permission has to be set at the job level or workflow level explicitly. It is not implied by anything else.

terraform_wrapper: false, setup-terraform defaults to wrapping the Terraform binary in a Node.js shim that captures stdout and stderr. The wrapper is useful for some workflows. It breaks terraform plan -json and terraform show -json in subtle ways. Disable it unless you specifically need the wrapper’s features. (We’ll come back to plan output in a separate piece.)

Why this is structurally better than access keys

The access-key pattern has four problems that OIDC solves directly.

There is a secret to leak. Long-lived access keys are valuable forever. If a workflow log ever prints the key (which happens more than anyone admits, bash scripts that echo env, debug output that includes the AWS SDK’s logging, a careless set -x), the key is now in GitHub’s log retention, possibly indefinitely, and anyone who later gains read access to the repo can fetch it. With OIDC, there is no secret. The token is per-run and expires in fifteen minutes by default.

Rotation is somebody’s job. Best-practice says rotate access keys every 90 days. In practice, nobody rotates keys until an auditor asks. The keys sit there, valid, used by an unknown number of workflows, occasionally also embedded in someone’s local environment. OIDC has nothing to rotate. The trust is structural.

Identity is fuzzy. Access keys belong to IAM users. IAM users belong to humans, in theory. In practice, the IAM user is named something like ci-deploy-user-2, was created by someone who has left the company, and has fifteen workflows in seven repos using it. When CloudTrail shows that user made a specific API call, you cannot tell which workflow did it. With OIDC, the assume-role event in CloudTrail includes the OIDC claims, repo, branch, environment, run ID. You can trace every API call back to the exact workflow run that produced it.

Trust scope is binary. An IAM user either has access to the role or doesn’t. You cannot say “this key is only usable from workflows in repo X on branch main.” With OIDC trust conditions, scoping the role’s assumability to a specific repo, branch, and environment is the default mode. The role is unusable from anywhere else.

These four properties are not incremental improvements. They are a different category of security guarantee. Once a team is on OIDC, the conversation about how to handle a leaked AWS key goes from “rotate everything and audit every CloudTrail log for the last 90 days” to “the token already expired.”

The gotchas

Things that bit me, in roughly the order they bit me:

The trust policy is too permissive. The first version of every trust policy I write is wrong. The temptation is to scope the sub claim with a wildcard (repo:acme-org/*:ref:refs/heads/*) to avoid editing the policy every time a new repo or branch needs the role. Don’t. Each wildcard widens the trust to potentially many workflows, including workflows that haven’t been written yet, including workflows that an attacker who lands a PR in some unrelated repo could write to. Be specific.

The right approach is: one role per scope. If three repos need to deploy to the same AWS account, that’s three trust statements, listing three specific sub claims. Or three roles, depending on whether the permissions differ. The role is cheap. The wildcard is dangerous.

Branch wildcards are dangerous in their own way. Even a single-repo wildcard (repo:acme-org/acme-aws-prod:ref:refs/heads/*) covers every branch in the repo, which means any contributor who can push a branch can run the workflow against that branch’s code. Use environment-protected branches and the environment: form of the sub claim, not the ref: form, for anything that touches production. The environment claim requires the workflow to be running under a GitHub environment, which can have branch restrictions, required reviewers, and a deployment-protection rule.

id-token: write is opt-in. Already mentioned, but worth repeating. The default in GitHub Actions is no token. The token permission has to be granted explicitly in the workflow file. The most common form of this bug is “I copied the workflow from the docs and it doesn’t work”, and the docs sometimes show the step without the permission, leaving the reader to discover the requirement.

terraform_wrapper issues. The setup-terraform wrapper interferes with anything that parses Terraform’s stdout. The plan-output-in-PR pattern, terraform show -json, custom scripts that grep the plan, all of them get weird output through the wrapper. Disable it. The cost is losing the wrapper’s outputs.stdout step output, which most workflows don’t use anyway.

Audience mismatch. The aud claim has to match between the trust policy and what GitHub’s OIDC token actually carries. The default audience is sts.amazonaws.com if you don’t override it. Some early writeups suggested customizing the audience to the repo owner, which is technically supported but produces friction for no real security benefit. Stick with sts.amazonaws.com. The repo/branch/environment scoping in the sub claim is where the security lives.

Region pinning. configure-aws-credentials requires the aws-region input. Easy to miss. Without it, some downstream tools default to us-east-1 and others default to whatever’s set in ~/.aws/config, which on the runner is unset. The error messages from this are unhelpful.

Multi-account assumption chains. If your Terraform manages resources in multiple accounts (workload account, state-bucket account, network account), the OIDC role typically lives in one of them and the others are reached via sts:AssumeRole from there. That second hop is a regular IAM role chain (not OIDC) and has to be wired separately. The pattern is straightforward but it’s worth realizing the OIDC trust is only for the first hop.

The CloudTrail story

The piece of this that doesn’t get enough airtime is what shows up in CloudTrail after OIDC is in place. Every assume-role event includes the OIDC claims as part of the request context. Specifically:

requestParameters.roleSessionName, the session name, which you can set in the workflow to something descriptive.
requestParameters.assumeRoleWithWebIdentity.subjectFromWebIdentityToken, the full sub claim from the OIDC token.
additionalEventData.OIDCClaims, the entire claim set, including the run ID, the workflow file path, the commit SHA.

This means every AWS API call made by a Terraform run can be traced back to:

The exact workflow run in GitHub.
The exact commit being applied.
The exact branch or environment.
The exact actor who triggered the run.

The forensics story this enables is the part I find most underrated. When something breaks in production and you need to know which Terraform apply caused it, the CloudTrail event has the GitHub run URL embedded in the metadata. You click through, see the diff, see the plan output, and see who approved the PR. The whole chain is one click.

With long-lived access keys, that chain doesn’t exist. CloudTrail tells you “user ci-deploy-user-2 did this thing at 14:32 UTC.” Which workflow? Which run? Which commit? Nobody knows.

What to do, concretely

If you’re walking into a repo with AWS access keys in GitHub Actions secrets, the moves I’d make this week:

Set up the OIDC identity provider once per AWS account. Five-line Terraform resource. Apply it from your bootstrap module.

Create one IAM role per workflow scope. Not one per workflow, one per (repo, branch, environment, permission-scope) tuple. If five workflows need the same permissions and the same scope, they share a role.

Write tight trust conditions. No wildcards. Specific repos. Specific branches or environments. Use environments for anything production-adjacent.

Update workflows to use configure-aws-credentials@v2 with role-to-assume. Add permissions: id-token: write. Remove the access-key secrets from the workflow.

Delete the access keys. Find the IAM users they belonged to. Decide whether the user still needs to exist (probably not). Audit CloudTrail for unexpected usage during the transition.

Audit your trust policies after a quarter. New repos, new workflows, new branches will have piled up. The trust policy is a list of specific sub claims; that list will drift toward being too permissive over time. Treat the trust policy as a piece of code that needs review.

The longer thread

Long-lived credentials are the original sin of CI/CD security. Every breach postmortem I’ve ever read where the entry point was a credential leak, the leak was a long-lived key sitting in a secret store, sometimes for years, often forgotten. OIDC federation removes the entire category. The token can’t leak meaningfully because it’s already expired by the time you find it.

The reason it took the industry as long as it did to move is that the alternative (short-lived credentials from a per-run identity) required a piece of infrastructure to exist. GitHub publishing OIDC tokens, AWS supporting them as trust principals. Both pieces had to land before anyone could use them. They both landed in 2021. The migration is overdue.

If your Terraform pipelines are still on access keys, this is the week to fix it. The wiring is a couple of hours. The security improvement is a category change.

, Sid