Auth and multi-tenancy from day one

The cheapest day-one decision is also the most expensive one to defer: auth, tenant scoping, and row-level security wired in before you have a single customer.

Auth and multi-tenancy from day one

Here's the layman version. You're building an AI tool for a single consultant, let's say a sales coach who's packaged her discovery framework into something her clients can talk to directly. Day one. It's just her and a handful of pilot customers. The temptation is enormous to skip the boring identity stuff. "It's just one tenant. I'll add multi-tenancy later." That sentence ("I'll add multi-tenancy later") is the most expensive sentence in software. I've watched companies eat six- and seven-figure rebuilds because they believed it.

The technical version: you need a real identity provider, you need a tenant scope on every row in your database, and you need the database itself to enforce that scope. All three. On day one. Before you have customers. Before you have data worth protecting. Especially before then, because once you have data the only safe migration is "stop the world."

Tenant scope from day one Cognito tenant_id in JWT JWT signed claims JWT Lambda handler reads tenant_id sets db session var RDS Postgres RLS policy: tenant_id = session var Tenant A rows scoped Tenant B rows scoped Tenant C rows scoped Three layers. Tenant scope set on auth, enforced in DB.
Tenant scope chain

This article is about the cheapest version of that day-one setup that actually holds. It's the version I'd build for a single consultant and grow without rewrites to a hundred. Same shape, more rows.

Why "later" is so expensive

When people say "I'll add multi-tenancy later," what they usually mean is one of three things, and all three are wrong in the same direction.

The first version: "I'll wrap every query with a WHERE tenant_id = ? later." You won't. By the time you have ten endpoints and twenty queries and a background job and an export script, you will miss one. The one you miss will leak one tenant's data to another. You will not find out from your monitoring; you will find out from the customer who saw the wrong logo.

The second version: "I'll move to per-tenant databases later." This is even more expensive. Migrating one shared database into N per-tenant databases is a project that touches the application layer, the data layer, your backups, your migrations, your secrets, and your billing. It's a quarter, minimum. And during that quarter, you are not shipping features.

The third version: "I'll bolt auth on later, for now I'll use a shared API key." This one is the worst because it's the one that ships fastest and looks fine in demos. You'll be reading about your own breach on Hacker News before you finish writing the auth ticket.

All three failure modes have the same fix, and the fix is cheap if you do it on day one. So let's do it on day one.

The minimum that holds

The shape I start with for any AI-integrated MVP looks like this. It's three things stacked.

Cognito at the front. AWS Cognito (their hosted identity service, it handles signup, login, password reset, MFA, and hands you a JWT, which is just a signed token that proves who you are, if you want to look it up later) does the boring identity work. I don't write password hashing code in 2026. I don't write password reset flows. Cognito gives me a user pool, optional MFA, and a token I can pass through the rest of the stack.

The trick is what I put in the Cognito user attributes. Every user gets a tenant_id claim baked into their token. That claim is set when the user is created and never changes. It's not editable from the client. It's the thing the rest of the system trusts.

Tenant_id on every row. Every table that holds anything customer-facing has a tenant_id uuid not null column. Every single one. The clients table, sure, but also the conversations, the embeddings, the audit log, the feature flags, the generated artifacts. If a row exists, it belongs to someone. If you can't say which tenant owns a row, you have a bug.

Row-level security in Postgres. This is the piece most teams skip and most teams regret. Postgres has a feature called row-level security (often shortened to RLS. Postgres can apply a WHERE clause to every query automatically based on a session variable). You set a policy on each table that says "you can only see rows where tenant_id matches the current session's tenant." Then in your Lambda handler, before you run any query, you set the session variable from the JWT claim.

Now even if your application code forgets the WHERE clause (and it will, somewhere, eventually) the database refuses to return cross-tenant data. The leak path is closed at the layer that actually owns the data.

Want the data-layer version of this story? I'll cover the rest of the database setup (pgvector for embeddings, Secrets Manager for credentials, KMS for encryption) in the data layer piece tomorrow. The auth and tenant scoping in this piece are the foundation it sits on.

How the request actually flows

Let me walk through one request end to end, because the wiring matters.

A user, say, a customer of an HR consultant who's packaged her interview-rubric review as a service, logs into the customer-facing app. Cognito hands their browser a JWT. The browser puts that JWT in the Authorization header on every API call.

The API Gateway in front of my Lambda functions is configured with a Cognito authorizer. That means before the request reaches my code, API Gateway has already verified the JWT signature, checked the expiration, and pulled out the claims. If the token's bad, the request never reaches Lambda. That's free defense-in-depth, the auth check runs before I pay for compute.

Inside the Lambda, my handler grabs the tenant_id claim out of the verified token. Not out of the request body. Not out of a query string. Out of the verified, signed token. Then before any database work, the handler runs SET LOCAL app.current_tenant = '<tenant_id>' on the database connection. That sets a session variable that the row-level security policies read.

From that point forward, every query (even ones I write badly) only sees rows belonging to that tenant. The HR consultant's customer can ask the AI for a candidate evaluation, and there is no path, even with a SQL injection bug, even with a missing WHERE clause, even with a confused developer, to see anyone else's candidates.

The audit trail is part of auth, not a separate thing

Here's the piece I see skipped almost universally: the audit log has to know which tenant did what, and it has to be tamper-evident from the moment you turn on the lights.

Every state-changing action in my system writes a row to an audit_events table. That row includes the tenant_id, the user_id, the action, the resource, the timestamp, and a hash of the request payload. The audit table has its own RLS policy, a tenant's admin can read their own tenant's audit, and nobody can write directly to that table from application code. Writes happen through a stored procedure that adds the tenant from the session variable, not from a parameter.

This sounds like overkill on day one. It is not. The first time a customer asks "did anyone on your side look at our data last Tuesday," you want to be able to answer that without spending a week reconstructing it from CloudWatch logs. The audit table makes that question a one-line SQL query.

Audit lives across the whole stack, the topic gets its own piece in observability and audit, not later. What I'm describing here is just the auth-shaped slice of it: the tenant_id and user_id stamped on every action. The rest of the audit story builds on the foundation in this article.

Two consultant scenarios, the shape generalizes

Pick any vertical and the shape is the same. Two examples I've sketched recently.

A financial advisor offers portfolio diagnosis as a productized service. Each client uploads their holdings and chats with an AI that knows the advisor's playbook. The advisor sees a queue of every interaction, can step in to override the AI, and exports compliance-friendly logs to her broker-dealer once a month. Tenant_id on day one is the financial advisor herself. Each client is a user under that tenant. The advisor's playbook is a corpus row owned by the tenant. The chat history is owned by the tenant. The compliance export is one query, SELECT * FROM audit_events WHERE tenant_id = ? AND created_at >= ?, and it works because the tenant_id was always there.

Now a marketing strategist productizes her brand-positioning approach. Same product shape, different secret sauce. She onboards her own clients. Each client interacts with the AI version of her positioning approach. She sees the queue, intervenes when the AI wanders, and watches the AI get better as she corrects it. Tenant_id on day one is her. When her business grows and she licenses her platform to two other strategists who run their own clienteles, the new tenants drop in as new rows in the tenants table. No migration. No code change. Same RLS policies do the work.

That's the whole point of doing this on day one. The financial-advisor version and the second-strategist-onboarded version are the same code path. You did the work once.

The parts that will bite you

A few things I've stepped on or watched others step on.

Background jobs forget to set the session variable. Your async worker pulls a job off SQS, runs a query, and the RLS policy slams the door because there's no app.current_tenant set. This is the right behavior, the job should have a tenant scope. The fix is a small wrapper that pulls the tenant_id off the job payload and sets the session variable before any DB work. Make it the only way jobs touch the database. No raw connections allowed.

Connection pooling can leak the session variable across requests. If you use a connection pool, a connection set to tenant A can come back into the pool and get reused by tenant B. SET LOCAL (note the LOCAL) ties the variable to the current transaction, not the session, so as soon as the transaction ends, the variable is gone. Use SET LOCAL always. Wrap the request in a transaction.

The "service role" temptation. You'll want a back-office tool that can see across all tenants, for support, for debugging, for reports. Don't give that role a magic bypass key. Make it a real role with its own audit trail, its own MFA requirement, and its own RLS policy that says "if the user has the support role, allow." Now your support team's reads also go in the audit log. When something goes wrong, you can answer "who looked at what."

Cognito groups vs. tenant_id. Cognito has a "groups" concept that some people try to use as tenant scoping. Don't. Groups are a permissions concept; tenants are a data ownership concept. Keep them separate. Tenant_id is a custom attribute on the user, set at creation, immutable. Groups are for "is this user a tenant admin or a regular member."

What this costs

Almost nothing. Cognito's free tier covers the first 50,000 monthly active users. RDS Postgres has had RLS since version 9.5; it costs you nothing extra. The audit table is one more table. The SET LOCAL call is microseconds.

The only real cost is the discipline. You have to insist that every new table gets tenant_id. You have to insist that every new query path goes through the wrapper that sets the session variable. You have to insist that no one writes a "temporary" admin endpoint that bypasses RLS "just for this one report."

Insist anyway. The day a prospect asks you for their SOC 2 evidence (and that day comes faster than you expect) you'll have answers ready. The day you sign your second consultant tenant, it'll be a row insert, not a project. The day a customer asks "can you prove no one else saw our data," you'll show them the policy and the audit log and they'll believe you, because the proof has been there since day one.

If you're building this and you remember nothing else: tenant_id on every row, RLS on every table, JWT claim flowing into the session variable, audit log carrying the tenant. Wire those four things on day one and the rest of the architecture has somewhere to stand.