Tenant-scoped policies without tenant-scoped code

Policies vary per tenant. Code does not. The architectural payoff of tenant-aware policy bundles. OPA-style, that don't require a tenant-aware codebase. Bundle resolution, defaults, overrides, and the audit story that makes 'what happened for tenant X' answerable without grep.

Tenant-scoped policies without tenant-scoped code

The biggest architectural mistake I have watched teams make in year two of multi-tenancy is the same one every time. They start branching the codebase on tenant. Not in some grand explicit way, there is no if tenant.id == "acmecorp": check anyone would admit to writing. It happens by accretion. A feature flag, a per-tenant config block, a class hierarchy that quietly grew a RetailTenantPipeline and a BankTenantPipeline because the bank wanted retention to work differently. By month nine the diff between the "shared" path and the special-cased one is six hundred lines and nobody can answer "what does the system do for tenant X" without reading both branches and a JIRA ticket from last spring.

The escape from that pattern is the same one I described in the migration piece but generalized to the steady state. Policies vary per tenant. Code does not. Every behavior that needs to differ between tenants is a policy decision, and policy decisions live outside the code path, in a bundle the tenant owns, evaluated by an engine the platform owns. The code path is one path. The policy bundle is the per-tenant differentiator.

This piece is about how to actually do that. The OPA-shaped pattern (bundle resolution, defaults, overrides, audit) applied to the tenant boundary. None of it is hypothetical. I have shipped it.

Here's the principle

State the principle plainly so the rest has somewhere to land. Tenant-scoped policy, tenant-agnostic code. The code path takes inputs, calls the policy engine with (tenant, action, resource, context), and acts on the structured decision. The code path does not know what the policy says, what tenant it is for, or which knobs the tenant turned. It knows that an answer came back and what to do with each answer.

The decision is structured: not just allow or deny, but the carrying answer the code path needs. Allowed-with-redaction. Allowed-but-route-to-region-eu. Denied-with-reason-X. The code path consumes the structured decision and proceeds without ever growing a tenant-aware branch.

The architectural payoff: the code path is one path. The same one. Forever. Every tenant runs through it. Every regression test is a test of the actual production code path, not the version of it that happened to be true for the tenant the test was written against. The variation lives in the bundle, the bundle is small and inspectable, and "what does the system do for tenant X" reduces to "show me tenant X's bundle and the request log." That is a tractable question. The branched-codebase version is not.

Tenant-bundle resolution

The bundle is a set of policies, scoped to a tenant, packaged as a unit. In OPA it is a tarball with data.json and .rego files. In a hand-rolled engine it might be a row in tenant_policy_bundles with a JSON column. The shape doesn't matter; the lifecycle does.

Resolution: a request arrives, the platform extracts the tenant context (the token-claim work the migration piece covered), and before the code path acts on anything that could vary per tenant, it asks the engine for a decision. Same call, same shape, every time, regardless of tenant.

Two implementation details matter more than they look like they should.

Bundles are cached, not loaded per request. The engine fetches a tenant's bundle once on cold-start (or on a push from the bundle registry), keeps it in memory, and re-evaluates locally on every request. A request that crosses a network hop to fetch a policy is a request that will time out under load, and "the policy fetch flaked" becomes a paged incident. Bundles are kilobytes. Cache them. Refresh on a push signal or a TTL. The engine is a sidecar to the code path, not a remote service it begs for an answer.

Bundles are versioned and pinned. Every decision the engine makes is tagged with the bundle hash that produced it. Rolling out a bundle change is a rollout, not an edit-in-place; rollback is a pin to the prior hash. This is the same supply-chain discipline I made the case for in the OPA Gatekeeper piece, pin the version, log the hash, never let a silent change go out. Policy is software. Treat it like software.

Defaults: where most of the policy actually lives

The temptation is to imagine every tenant authoring a bundle from scratch. That is wrong, and the architectures that imagine it that way collapse under a hundred bundles all reinventing the same wheel.

What actually works: a platform default bundle defines the entire policy surface, with sensible defaults for every decision. Every tenant inherits from that default. A tenant's bundle is a thin overlay, usually a handful of overrides on top of the platform default. Tenants do not author from scratch. They override.

This collapses ninety percent of the cost of a tenant-policy system. A new tenant is provisioned with an empty overlay; they get the default for everything; their behavior is identical to every other tenant on defaults; the platform team maintains the default in one place and a change there propagates to everyone who has not overridden it. The first time a tenant needs different behavior (a longer retention class, a stricter redaction list, an EU-only routing rule) they get a single-key override, not a forked bundle.

The default bundle is also where the Decisions as Code discipline lives at the tenant boundary. The set of decisions a tenant can turn is finite, named, documented, and small. Five real decisions, not eighty-nine raw config keys. Plan tier. Region constraint. Retention class. SSO provider. Default member role. Each one a policy primitive, each one with a typed value, each one with a default. The override surface is the contract. The contract is the product.

Overrides: how a tenant changes behavior without changing code

An override is a tenant-authored statement: "for this decision, use this value instead of the default." Mechanically, a key-value entry in the tenant's overlay. Operationally, a record in the audit log of who set it, when, why, and what bundle hash it produced.

The right shape for the override surface is the same shape the platform default declares. If the default says retention_class: "standard", the overlay says retention_class: "extended", same key, different value. Type-checked at compile time, schema-validated at admission, rejected if the value is not in the allowed set. A tenant cannot introduce a new policy primitive by editing their overlay; they can only set known primitives to known values.

This matters more than it looks. The thing you are protecting against is not a tenant typing a malicious value (the schema catches that). The thing you are protecting against is the platform team accidentally widening the override surface every time a customer asks for something. That is how you end up with eighty-nine knobs. The overlay schema is the gate. New overrides are a deliberate platform decision, made centrally, made with the audit trail to prove it. They are not a thing that happens because a sales engineer YAML'd in a new key on a Tuesday.

The audit story, "what happened for tenant X" without grep

Here is the test that decides whether your tenant-policy architecture is real or a story. A customer success engineer pages you on a Friday afternoon. Tenant ACME says they got rejected on an action they expect to be allowed. What do you tell them in the next twenty minutes?

In the branched-codebase world, the answer is "let me read the code." You grep for tenant.id, find three special cases, read the JIRA ticket from last spring, ask whether ACME's pipeline is the v2 path or the v1 path, ask whether a flag has been flipped, and eventually shrug and ask the customer to retry while you keep digging. The answer takes hours. Sometimes days. Sometimes you never get to a clean answer and the resolution is "we changed something and it works now."

In the tenant-policy world, the answer is "let me read the decision log." Every evaluation emits a structured record: tenant, request inputs, bundle hash, rule that fired, decision, reason. You filter by tenant ACME and timestamp. You see the rejected request, the bundle hash that produced it, the rule that fired, the input value that triggered the rule, and the override (or default) that set the threshold. You answer the customer in five minutes with a sentence that starts "your bundle hash AB12 has retention_class set to standard, and the action you tried requires extended; here is when that override was set and by whom." The customer either agrees with the policy or they don't, and if they don't you change one overlay value and ship a new bundle hash.

The audit story is the architectural payoff most worth optimizing for. It is the difference between an enterprise that buys you and an enterprise that doesn't, because enterprise buyers ask the "what happened for tenant X" question every week and they can tell from the cadence of your answer whether you have the architecture or you have a story.

The chain from "this request was rejected" back to "the platform team decided in March that retention requests over 90 days require an admin role" is end-to-end, with timestamps and authors at every step. That is what an SOC 2 auditor wants. That is what a future-you debugging a regression wants.

The post-migration steady state

The migration piece ended at the deprecation ceremony. This piece is what the next eighteen months look like, if you do them right. The code path stays single. The policy surface grows. New tenants get the default bundle. Custom enterprise tenants get an overlay. The platform team maintains one code path and one default policy and adjudicates new override primitives with deliberation.

The teams I have watched fail in year two are the ones that did the migration well and then started branching the code anyway, because the policy engine felt like overhead and the per-tenant overlay felt like ceremony. They were right that it was overhead. They were wrong about what it was buying them. The overhead is the thing that makes the question "what does the system do for tenant X" answerable. Without it, the answer is always "let me read the code, give me a few hours."

Tenant-scoped policy, tenant-agnostic code. One code path serves all tenants. Policy is the differentiator. The bundle is the contract. The decision log is the audit trail. That is the architecture that holds up in year three when you have forty enterprise tenants, a SOC 2 audit, and a customer success team that needs five-minute answers on Friday afternoons. The discipline is the one the series has been arguing for: pay the architectural tax deliberately, and refuse to let the codebase absorb variation that belongs in the policy layer.

, Sid