The single-user → multi-tenant migration I actually shipped
The playbook for a single-tenant to multi-tenant migration done well, additive schema, identity backfill, dual-path policy enforcement, row-level security, observability for both paths, traffic shift, deprecation. What I wish I'd known on day 1.
What I wish I'd known on day 1 of a single-tenant to multi-tenant migration is that the migration is not a project. It's a posture. The teams I've watched fail treated it as a release, a cutover weekend, a feature flag, a Slack victory lap on Monday. The teams I've watched succeed treated it as a six-to-nine month period of running two architectures in parallel, with discipline about which path is authoritative for which class of read and write, and a deprecation calendar with real dates on it.
I've shipped one of these end-to-end and watched two others up close. The shape rhymes every time. The pieces I'm going to lay out here are the ones that, if you skip any of them, you eventually find yourself in a postmortem explaining how a query meant for tenant A returned three rows that belonged to tenant B. None of this is theoretical. It's the playbook I'd hand to a team starting tomorrow, with no ceremony.
A note before the recipe. The argument I made in why multi-tenant is twice as expensive as you estimated still stands: multi-tenancy is a cost shape, not a default. But if you've decided you need it, usually because the sales motion has changed and you're now signing customers faster than you can stamp out per-customer deployments, the migration itself is tractable. It's just patient.
Day 1: schema additive, never destructive
The first rule, and the one I see broken most often: you do not alter or drop a single column during the migration. Every schema change is additive. New tenant_id columns get added with a sensible default (more on the default in a second). New tables for tenant-scoped resources get created alongside the old ones. New indexes get built concurrently. Nothing that exists today gets renamed, dropped, or retyped. Nothing.
The reason is simple. The single-tenant code path has to keep working unchanged for the entire migration window. Every destructive change is a coupling between the migration and the deploy that's reading the old shape, and every coupling is a reason for a rollback to fail. The whole point of additive-only is that, at any moment in the migration, you can roll the application back to last week's build and the database is still readable.
The default for the new tenant_id column is the implicit tenant, usually 1 or a UUID you've decided represents "the original customer." Every existing row gets backfilled with that value in the same migration that adds the column. New writes from the unchanged single-tenant code path also write the implicit tenant, because the column has a default. This is the bridge state: schema is multi-tenant, code is still single-tenant, behavior is identical to before.
If you take one habit from this whole piece, take this one. Additive schema, defaulted backfill, no drops until the deprecation ceremony months later. The discipline costs nothing and saves you every time.
Day 14: identity backfill
The next piece, and the one that tends to be both more painful and more interesting than the schema work, is identity. In the single-tenant world, a user is a user. In the multi-tenant world, a user belongs to one or more tenants through a membership relation, and a request is authenticated as a user but authorized within a tenant.
The backfill is mechanical. Every existing user gets a membership row tying them to the implicit tenant with whatever role they currently have (most commonly: everyone is an admin, because single-tenant products rarely build a real role system). The membership table is the new authority for "what can this user do where," and every existing permission check eventually has to start consulting it. But (and this is the second rule) you don't switch the consultation over yet.
You also need to decide, on day 14, what your tenant identity primitive is. A UUID? A short slug? A subdomain? An organization-level claim in the JWT? Whatever you pick, it has to survive every later decision, because once it's in URLs and tokens and audit logs you can't change it without breaking integrations. I've seen teams pick a numeric ID and regret it the first time a customer asks for a vanity subdomain. Pick the slug or the UUID with the slug as a derived field. Don't pick the auto-incrementing integer.
The identity backfill is also where you decide what tenant a session is "in." A user with one membership is unambiguous. A user with multiple memberships needs a tenant selector, usually a tenant_id claim added to the access token at login or at a tenant-switch action. The token shape is part of the migration. Plan for it being wrong on the first try.
Day 30: dual-path policy enforcement
This is the heart of the migration and the part most teams get wrong. For some window (weeks, often months) both the old single-tenant code path and the new multi-tenant code path coexist in production. Both paths have to enforce authorization correctly, and the enforcement has to converge to the same answer for the same request.
The pattern that works: a single, library-level authorization function that every code path calls. The function takes a user, a tenant context, and an action, and returns allow/deny. The single-tenant code path calls it with the implicit tenant. The new multi-tenant code path calls it with the tenant the request is scoped to. The function consults the membership table, the role definitions, and (eventually) the row-level security policies that the database itself will enforce.
The dual-path part is not "we have an if statement." The dual-path part is "we have one enforcement function and two callers, and the function is the only place authorization decisions are made." If you find yourself writing if request.is_v2: check_new_way() else: check_old_way(), you have built a leak factory. The two paths have to share the enforcement primitive, or the divergence between them becomes a CVE waiting for an audit.
The other piece of dual-path enforcement is at the data layer. Postgres row-level security is the right tool here, and I'd reach for it earlier in the migration than most teams do. RLS gives you a defense-in-depth that doesn't depend on every query in your codebase remembering to filter by tenant_id. You set the session variable on connection (SET app.current_tenant = '...'), and every policy-bound table refuses to return rows that don't match. The single-tenant code path sets the implicit tenant; the new code path sets the request's tenant. Same enforcement, two callers, same shape as the application-layer authorization function.
RLS doesn't replace application-layer checks. It catches the bugs the application-layer checks have. That's the whole point of defense in depth.
Day 45: observability for both paths
You cannot run two code paths in production without instrumenting both of them, and you cannot trust the migration is working without a dashboard that shows you, in real time, what fraction of requests are flowing through each path and whether their behavior matches.
Concretely: every authorization decision emits a structured log with the path that made it (single-tenant vs. multi-tenant), the user, the tenant context, the action, and the outcome. Every database query carries a tenant tag in its trace. Every metric pipeline gets a path dimension. You build one dashboard that shows the request volume by path, the deny rate by path, the latency distribution by path, and any divergence between the two paths for the same logical operation.
Then (and this is the part teams skip) you build a shadow mode. The new multi-tenant code path runs alongside the old one, sees the same request, makes its own authorization decision, and logs whether it agrees with the old path's decision. It doesn't actually serve the request yet. It just shadows. Every disagreement is a bug, and you have a queue of disagreements to drain to zero before you flip a single byte of real traffic.
The shadow phase is where you find every implicit assumption the single-tenant code made about the user, the session, the global state. It's tedious. It's also the only way I've seen this migration done without a customer-visible incident in the cutover.
Day 90: traffic shift
By the time you start moving real traffic, the new path should have been shadowing the old path for at least a couple of weeks with a divergence rate at or near zero. The shift itself is undramatic: a percentage rollout per route, starting with read-only operations on internal endpoints, expanding to read-only on customer endpoints, expanding to writes on internal, expanding to writes on customer. Each expansion stays at its new percentage long enough for the dashboards to settle and for any tenant-specific weirdness to surface.
The right granularity for the rollout flag is per-route, per-percentage, with an optional per-tenant override. You'll want the per-tenant override the first time a friendly design-partner tenant volunteers to be on the new path early, and you'll want it the first time an enterprise tenant has a regression and needs to be parked on the old path while you debug.
The wrong granularity is global. A single boolean that flips the whole product to the new path is the same kind of cutover-weekend trap that started the failure pattern in the first place.
Day 150: deprecation ceremony
The final phase, and the one that requires the most discipline because the pressure to declare victory is enormous. The new path is serving 100% of traffic. The dashboards are clean. The on-call rotation has stopped getting paged about tenant-related weirdness. Now you can start the deprecation.
Deprecation looks like this: the old code path gets a feature flag wrapping it, defaulted off, with logging that screams every time the flag gets flipped on. After two weeks of nothing flipping it on in production, the code path gets deleted. Then (and only then) the additive schema starts becoming subtractive. Columns that the old path required and the new path doesn't can be dropped. Default values that backfilled the implicit tenant can be removed. Indexes that supported the old query patterns can be retired.
This is also where the Decisions as Code discipline pays off, the surface of decisions a tenant gets to make (plan, regions, SSO provider, default role, retention class) was modeled cleanly during the migration, and the deprecation phase is where you confirm you didn't accidentally expose eighty-nine knobs because the migration was easier that way. (I called the underlying pattern Property Toolkit during my OneFuse days; the shape is the same here, applied to tenancy decisions instead of infrastructure decisions.)
The deprecation ceremony is boring and absolutely worth doing. Every line of single-tenant scaffolding you leave in the codebase is a future incident vector. Delete it.
What this looks like on a calendar
A real timeline for a real product, roughly: weeks 1-2, additive schema and backfill. Weeks 3-4, identity model and membership backfill. Weeks 5-8, dual-path enforcement function and RLS rollout. Weeks 7-10, shadow mode and divergence drain. Weeks 11-16, percentage rollout. Weeks 17-22, run at 100% on the new path while the old path remains as a parked safety net. Weeks 23-26, deprecation and schema subtraction.
Six months. Maybe nine if the product surface is wide. Possibly four if you're a small team with a small surface and enormous discipline. Not a weekend, ever.
The honest framing on multi-tenant migrations is the same as the honest framing on multi-tenancy itself: it's a cost shape with a long tail, and the tail is where the safety lives. Pay the tail deliberately. The teams that ship this well are the ones that treated the boring middle weeks (shadow mode, divergence drain, dashboard-watching) as the actual product, not the prologue to a cutover.
, Sid