The five questions every audit trail must answer

If your audit trail can't answer all five, what happened, why was it allowed, under what rules, who is accountable, who coordinated, you don't have an audit trail. You have logs. The five questions, what populating them takes, and the predictable failure mode for teams that miss each one.

The five questions every audit trail must answer

I've sat in enough rooms with enough auditors and enough on-call engineers to have a strong opinion about this. Most teams I work with believe they have an audit trail because they have logs. They don't. They have logs. The two look similar from twenty feet away and behave very differently the moment you need one of them.

Here's the test I use. There are five questions every audit trail must answer for any single action the system took. Not most actions. Any action. Pick a row at random, an automated refund, a model invocation, an agent's tool call, a permission grant, a config change, and if the trail can't answer all five, you don't have a trail. You have a debugging breadcrumb that got promoted into a compliance story by hopeful naming.

The five: what happened, why was it allowed, under what rules, who is accountable, who coordinated. They sound mundane. They are. The work of populating them honestly is what separates platforms that survive an audit from platforms that survive an audit the first time and spend a quarter rebuilding before the next one.

Question one: what happened

The cheapest one. Also the only one most teams can answer.

What happened means the action, the inputs the system saw at decision time, the outputs it produced, the side effects that propagated, timestamps, and the identifiers that correlate the row to everything else that touched it. Not a stack trace. Not a printf. A structured record of what the system did, in a shape that survives being queried six months later by someone who wasn't on the team when it happened.

Populating it takes the discipline I covered in why traceability dies in most platforms, structured logs from day one, a shared schema, correlation IDs that propagate through the framework rather than depending on every developer to remember, smoke tests that the deployed service still emits in-schema. The mundane work that's always one sprint from done.

The predictable failure mode is the one I've watched play out a dozen times. The on-call engineer pulls up the row, finds three log shapes from three versions of the service, reconstructs a partial story, and writes a postmortem that begins "we believe what happened was…" The auditor reads "we believe" and asks for the row that proves it. There is no row. The team learns the difference, late.

Question two: why was it allowed

The question most teams get wrong, because the system can answer "what was returned" and they conflate that with "why."

Why was it allowed means: at the moment of the action, what authorization decision permitted it? Not "the API returned 200." The actual authority (policy evaluation, role check, rule that fired) recorded with a stable identifier that points back to the rule's definition at the time. The row should carry an identifier like policy.refund.automatic.v4#tier-a-window. The identifier resolves to the rule's text. The text lives in version control, with an author, an approver, and a PR that explains the intent.

This is the bidirectional link (outcome back to rule) that I've argued is half the value of traceability and the half most platforms leave on the table. Forward traces tell you what the system did. The "why allowed" pointer tells you what the system was supposed to do, recorded at the moment it did it.

Populating it takes treating every authority decision as a first-class event with a stable rule identifier attached. Policy engines do this naturally. Hand-rolled permission checks usually don't, because the engineer who wrote the check didn't think of themselves as authoring a rule, they thought of themselves as writing an if statement. The fix is the one that runs through this whole series: treat rules as code, centralize them, give them IDs, project them into every consumer.

The predictable failure mode is the conversation I've had to have more than once. Finance flags an unexpected automated refund. Engineering can confirm it was processed. Engineering cannot confirm it was supposed to be processed. The team spends a week reconstructing by hand what the rule should have been and whether the customer's state matched. The reconstruction is a guess. The guess is the audit answer. The auditor knows it's a guess.

Question three: under what rules

Question two's older sibling, and the one teams underestimate most.

Under what rules means: the version of the rule that was in force at the time of the action. Not the current rule. The rule active when the action happened. A rule that was permissive in Q1 and restrictive in Q3 should reconstruct a Q1 decision against the Q1 version. The audit answer for "was this allowed" depends entirely on which version was in scope.

Populating it takes versioned rules, in source control, with a stable identifier like name@version that the row carries inline. The rule store queryable across history. The deploy pipeline recording which version was active in each environment. When someone asks "under what rules was this decision made," the answer has to be a hash, a tag, a commit ID, something resolvable, not "the rules we had at the time, probably."

This is one of the places where the Decisions as Code discipline pays for itself in a way that's hard to fake. Rules in version control with proper versioning, answer is mechanical. Rules in a database edited in place, answer is "we'll have to ask the person who edited it, if they remember."

The predictable failure mode is the rule that got changed three months after the disputed action, in a way that retroactively makes the original action look wrong. The team can't show what the rule was at the time. The conversation becomes about people instead of about the system, and everyone walks away frustrated.

Question four: who is accountable

The question teams sometimes claim to answer with "the user ID."

Accountability isn't the user ID. It's the chain. For a human action: user, role, delegation, policy that scoped the role. For an automated action: service, deploy, owning team. For an agent action: agent, policy that bounded it, human who authorized the agent's autonomy at this scope, instruction that initiated the run.

Accountability means: if this action was wrong, the trail tells me whose decision was wrong. Not whose fingers were on the keyboard. Whose authority allowed it, at which layer. A user acting within their delegation isn't the accountable party, the delegation is. The delegation traces back to the policy. The policy traces back to whoever approved it. The chain stops at a person who agreed in writing that this class of action was acceptable.

Populating it takes treating roles, delegations, and policies as first-class versioned artifacts with plain owners, not implicit conventions buried in code. The "owner" field is a real field. The PR that introduced the rule is the intent record. The accountable party is whoever's name is on the most recent approval.

The predictable failure mode is the action that was technically permitted, materially wrong, and politically unowned. The user did what the system let them do. The system let them because nobody noticed the policy was over-broad. The audit answer becomes a finger-pointing exercise. The fix becomes a meeting about who should have caught it. Neither is an audit answer. The trail should have already named the policy author and approver, and the conversation should have moved straight to whether the policy needs to change.

Question five: who coordinated

The one nobody asks about until they need it.

Most non-trivial actions aren't single events. They're chains. A refund involves the customer-service tool, policy engine, payment processor, notification service, ledger update, email send. An agent's tool call involves the agent, the framework, the policy gate, the tool, the downstream system, the response back to the agent. Five-to-ten participants is normal.

Who coordinated means: which orchestrator owned the chain, which participants it called, in what order, with what handoffs, with what failures and retries along the way. Not "we have logs from each service." The coordination story, who handed work to whom, recorded in a way that lets you reconstruct the chain from any participant's perspective.

Populating it takes a coordination identifier that propagates through every participant, plus a record of the coordination itself, the orchestrator's view of the run, with each participant's role labeled. The forward trace gives you the spans. The audit angle asks: which participant was the coordinator, which were consulted, which were informed? RACI, applied to the runtime.

The predictable failure mode is the multi-step action that goes wrong in step four, and the postmortem that reconstructs steps one through three from one log source and steps five and six from another but never quite gets step four because the service that was supposed to coordinate the handoff didn't tie upstream and downstream together. For human-only systems this is annoying. For agent-driven systems, where the agent itself is a participant and may have made an autonomous routing decision mid-chain, this is disqualifying.

What separates a trail from a pile of logs

The five questions aren't a checklist you can satisfy with a logging library. They are a shape the platform has to be designed to. Each demands a specific discipline: structured emit with schema contracts gated at deploy; every authority decision recorded with a stable rule identifier; rules versioned with the version carried inline in the row; roles, delegations, and policies as first-class owned artifacts; coordination identifiers that propagate across participants with the orchestrator's view recorded as a first-class event.

None of this is novel. All of it is the mundane work that gets deprioritized exactly when the team is busy. And the questions compound, a trail that answers four and not the fifth fails on the action that needed the fifth, which is always the action the auditor pulls.

Want to go deeper on the companion pieces? The reverse-direction primitive is in outcome back to the rule, and the decay modes that kill the answers are in why traceability dies in most platforms.

The cost of getting it wrong arrives late. The team that has logs and believes it has a trail will get through the first conversation with a friendly auditor. They will get to the second conversation, with a less friendly auditor on a deal worth caring about, and they will spend the next quarter retrofitting what should have been built in. I have watched that quarter happen. It is more expensive than the year of discipline it would have replaced.

The five questions are the test. Apply them to a random row. If the trail answers all five, you have an audit trail. If it can't, you have logs, and somewhere in the queue is the action that will surface the gap, on a schedule you don't get to choose.

, Sid