Multi-tenant SaaS starts as a neat cost-saving trick: one app, one database, many customers. Then you grow.
A big tenant runs a report at noon, p95 latency doubles, and your “shared everything” architecture turns into
shared suffering.
This piece is for the moment you realize “tenant isolation” isn’t a checkbox—it’s a system property that must survive:
growth, migrations, compliance audits, a bad deploy, and the occasional customer who thinks “Export All” is a lifestyle.
What tenant isolation really means in production
“Tenant isolation” is not just “tenant_id is in every table.” That’s data labeling. Isolation is the property that one
tenant cannot:
- Read another tenant’s data (obvious, but the fun bugs hide in joins, views, and background jobs).
- Corrupt or delete another tenant’s data (including via migration mistakes and shared sequences).
- Starve the system so other tenants time out (“noisy neighbor” is just a polite phrase for “my incident”).
- Force you into a single migration and scaling path (because your largest tenant always dictates your architecture).
- Break compliance boundaries (data residency, retention, encryption scope, access control boundaries).
Isolation is also operational. You need the ability to answer, quickly and confidently:
- Which tenant caused the load spike?
- Can I throttle or quarantine them without taking down everyone else?
- Can I migrate one tenant safely (schema change, shard move, restore) while others keep working?
That means isolation has layers: data isolation, query isolation, resource isolation,
fault isolation, and operational isolation. The database choice influences all five, but it won’t
save you from a tenancy model that’s structurally doomed.
Tenancy models that scale (and the ones that quietly don’t)
Model A: Shared tables, shared schema (tenant_id column)
One set of tables. Every row includes tenant_id. It’s the cheapest to start and the hardest to keep correct.
It can scale if you design for it from day one: strict access boundaries (RLS or vetted query paths),
partitioning, and a plan for moving tenants around.
When it works:
- Many small tenants, relatively uniform workloads.
- You need simple migrations (one schema).
- Your product doesn’t require tenant-level extensions or bespoke indexing.
How it fails:
- A few “elephant tenants” dominate I/O and lock contention.
- Every query must be perfect forever (spoiler: it won’t be).
- Backfills and analytics accidentally turn into global table scans.
Model B: Schema-per-tenant (one database, many schemas)
Each tenant has its own schema: tenant_123.orders, tenant_456.orders. This is a strong operational boundary
for object names, migrations, and partial moves. PostgreSQL is notably good here; MySQL’s “schema” is basically “database,”
so the shape differs.
When it works:
- Tenants need custom indexes, extensions, or periodic heavy operations.
- You want better blast-radius control for migrations and restores.
- You can tolerate higher catalog/object counts and manage migrations at scale.
How it fails:
- Too many objects (tables/indexes) can slow metadata operations and maintenance windows.
- Connection pooling becomes trickier if you rely on
SET search_path. - Cross-tenant analytics becomes more expensive and messy.
Model C: Database-per-tenant (or cluster-per-tenant)
The cleanest blast radius. The worst sprawl. This is isolation by multiplication.
It’s also how you end up running a small cloud provider inside your company.
When it works:
- High compliance requirements (hard boundaries, per-tenant encryption keys, residency).
- Tenants have wildly different workload sizes.
- You need per-tenant maintenance scheduling and restore guarantees.
How it fails:
- Operational overhead: migrations, monitoring, backups, failover, credentials.
- Capacity fragmentation: many small DBs waste memory (buffers, caches).
- Fleet-wide changes become slow and risky.
Model D: Sharded tenants (tenant-to-shard mapping)
Tenants are assigned to shards (multiple DB instances). Each shard can use shared-table or schema-per-tenant internally.
Sharding is what you do when you accept reality: you will outgrow one instance, and you need controlled distribution.
When it works:
- You have clear tenant boundaries and little need for cross-tenant joins.
- You can build and maintain a routing layer (app-level or proxy-level).
- You plan for tenant moves (rebalancing) as a routine operation.
How it fails:
- Cross-tenant analytics becomes a distributed query problem.
- Rebalancing is underestimated; it becomes a “special project” every time.
- Hot tenants still exist; sharding reduces odds, not physics.
MySQL vs PostgreSQL: isolation mechanics that matter
Both can run large SaaS fleets. Both can hurt you. The difference is how they push you toward safe defaults (or let you
build a foot-gun factory with excellent uptime).
Data isolation: enforcement primitives
PostgreSQL: Row Level Security (RLS) is a first-class isolation tool
PostgreSQL’s RLS lets you enforce tenant filters inside the database. Done right, it turns “we always add tenant_id”
into “the database will reject your query if you forget.” That’s not a luxury; it’s a survival trait.
RLS is not magic. Policies can be bypassed by roles with BYPASSRLS privileges, and badly-designed policies can
tank performance. But it gives you a declarative guardrail.
MySQL: isolation is mostly a discipline problem
MySQL does not have PostgreSQL-style RLS. You can approximate it with views, stored routines, definer rights, or a very
strict query layer. In practice, most teams rely on application-layer filtering and permissions.
That can be fine—until it isn’t. In incident postmortems, “we assumed all code paths add tenant_id” shows up the way
“we assumed backups worked” shows up. It’s a sentence that ages poorly.
Concurrency and locking: how noisy neighbors are born
PostgreSQL: MVCC with vacuum realities
PostgreSQL’s MVCC means reads don’t block writes, which is lovely for mixed workloads. But dead tuples accumulate and
must be vacuumed. Multi-tenant systems frequently create uneven churn: one tenant updates aggressively, and suddenly
autovacuum is doing overtime while everyone else wonders why latency spiked.
MySQL (InnoDB): MVCC plus different sharp edges
InnoDB also uses MVCC, but locking behaviors and gap locks under certain isolation levels can surprise teams, especially
with range queries and secondary indexes. Long transactions are the universal enemy, but the “shape” of the pain differs.
Operational isolation: moving tenants, restoring tenants, throttling tenants
PostgreSQL advantages
- Schema-per-tenant is natural and well-supported.
- RLS can make “shared tables” safer at scale.
- Logical replication enables selective replication/migration patterns (with careful design).
MySQL advantages
- Replication ecosystem maturity is excellent; operational tooling is widely understood.
- Performance predictability for certain OLTP patterns is strong, and many orgs have deep MySQL muscle memory.
- Database-per-tenant is straightforward when you already treat “schema = database.”
If you want a blunt recommendation: if you’re committed to shared-table multi-tenancy and you want the database to
enforce boundaries, PostgreSQL is the more forgiving partner. If you’re doing database-per-tenant with heavy operational
automation and a stable query layer, MySQL can be perfectly fine. The tenant model is the big decision; the engine is
the multiplier on how painful mistakes become.
How isolation fails as you grow
Failure mode 1: “We can always add tenant_id later”
Adding tenant boundaries after the fact is like adding seatbelts after you’ve started racing. You can do it, but you’ll
find out how many things were relying on the absence of constraints: global uniqueness assumptions, shared sequences,
background jobs, caches keyed incorrectly, and analytics queries that suddenly become expensive.
Failure mode 2: One migration becomes N migrations
With schema-per-tenant or database-per-tenant, schema changes become a fleet operation. The migration tool that worked
for one database now needs batching, retries, idempotency, and observability. This is not optional; it’s the price of
isolation.
Failure mode 3: The database becomes your scheduler
Multi-tenant “batch jobs” (billing runs, exports, report generation) often end up as “just a query.” Then it runs across
all tenants at once, and your database becomes a shared compute cluster with none of the admission control.
Joke #1: A database is a terrible queue, but it’s an excellent place to store the evidence that you tried anyway.
Failure mode 4: Tenant moves are treated as rare
If you shard, you must be able to move tenants routinely. Treat moves as routine, and you’ll build tooling, checksums,
dual-writes (if needed), cutover playbooks, and rollback paths. Treat moves as rare, and every move becomes a bespoke,
high-stress incident with executives “just joining to listen.”
Failure mode 5: Your isolation works, but your observability doesn’t
Even if data boundaries are correct, you still need to attribute load by tenant. Without tenant-tagged metrics and logs,
every performance issue is a guessing game. Guessing is expensive. It also tends to happen during outages, when everyone
is emotionally invested in being wrong quickly.
Interesting facts and historical context (the kind that changes decisions)
- PostgreSQL’s lineage traces back to the POSTGRES project at UC Berkeley in the 1980s; it inherited a research culture that shows in features like MVCC and extensibility.
- MySQL became a web default in the early LAMP era largely because it was easy to run and fast for common patterns, not because it had the strictest relational feature set.
- InnoDB wasn’t always the default; MySQL historically shipped with MyISAM as default, which lacked transactions—an origin story that still influences folklore and legacy deployments.
- PostgreSQL RLS arrived in v9.5 (mid-2010s), which is why older SaaS stacks often built their own isolation layers with views or ORM conventions.
- Autovacuum exists because of MVCC; PostgreSQL’s concurrency model is powerful, but the “garbage collection” is an operational tax you must budget for in multi-tenant churn.
- MySQL’s replication evolved in stages: statement-based to row-based to mixed formats, changing how safe and predictable tenant-level data moves are under different workloads.
- Partitioning matured over time in both engines; earlier implementations were more limited, which is why older designs often avoided partitions and later paid with painful table sizes.
- Connection handling differs culturally: PostgreSQL assumes heavier connections and encourages pooling; MySQL has a long history of many short connections in web stacks, which shapes defaults and tooling.
Three corporate mini-stories (and what they teach)
Mini-story #1: The incident caused by a wrong assumption
A mid-market B2B SaaS ran a shared-table model in MySQL. Every table had tenant_id. Their ORM also had a “default scope”
that automatically filtered by tenant. It felt safe. It was fast. Everyone moved on.
Then a background job was introduced to “cleanup old sessions.” It was written in a separate service that didn’t use the ORM.
The engineer used a simple delete query on a sessions table, assuming the tenant filter was “handled elsewhere.” It wasn’t.
The query deleted sessions across all tenants that matched the age condition.
The outage wasn’t data loss, but it was user-visible chaos: mass logouts, support tickets, and a day of leadership asking
why “multi-tenant isolation” didn’t prevent this. The real answer: isolation was a convention, not an enforced boundary.
The fix wasn’t just “be careful.” They implemented a tenant-safe database access layer for background services, added query
linting for tenant filters, and introduced a hard rule: any cross-tenant query must be explicitly named and reviewed.
They also made a plan to adopt PostgreSQL RLS for the highest-risk shared tables—because humans are creative, especially when tired.
Mini-story #2: The optimization that backfired
Another company used PostgreSQL with shared tables and RLS. One tenant grew quickly and began generating huge amounts of
event data. Engineering decided to “optimize” by adding a partial index tuned to that tenant’s common query pattern.
It looked smart: faster queries for the big customer, less load overall.
The partial index was defined on a predicate that included tenant_id and a status column. It did speed up the target queries.
But it also increased write amplification and maintenance cost for exactly the tables with the highest churn. Autovacuum started
falling behind, bloat increased, and other tenants—who didn’t even benefit from the index—saw rising latency.
The team tried to compensate with more aggressive autovacuum settings and bigger instances. That stabilized the symptoms but not the cause.
Eventually they removed the index and instead moved the elephant tenant to its own shard, where its indexing strategy could be as weird as it wanted.
Lesson: tenant-specific optimizations inside shared tables often become tenant-specific taxes paid by everyone. If the tenant is big enough to deserve custom indexing, it’s big enough to deserve its own blast radius.
Mini-story #3: The boring but correct practice that saved the day
A financial SaaS ran schema-per-tenant in PostgreSQL. It wasn’t glamorous, and it required a migration runner that could
apply changes tenant-by-tenant with strict ordering. Every schema had the same tables, the same constraints, and the same
extension set. No exceptions.
Their nightly backup strategy was equally boring: periodic full backups, frequent WAL archiving, and quarterly restore drills
where they would restore a single tenant schema into a staging cluster and run a verification suite. It felt like paperwork,
until it didn’t.
A deployment introduced a bug in a migration that dropped an index and rebuilt it concurrently, but a timeout and retry caused
an inconsistent state in a subset of tenant schemas. Some tenants saw slow queries; others were fine. The system was “up,” but
customer experience was not.
Because they had tenant-level restore drills and a schema-level verification checklist, they quickly identified affected tenants,
rebuilt indexes in a controlled way, and restored the worst cases from the most recent consistent point. No heroics. No guessing.
Just practiced procedures.
Practical tasks: commands, outputs, and the decision you make
You don’t get tenant isolation by arguing in a design review. You get it by being able to prove things at 02:00 with a shell
and a calm sense of betrayal. Below are practical tasks you can run today.
Task 1: Find the noisiest queries in MySQL
cr0x@server:~$ mysql -e "SELECT DIGEST_TEXT, COUNT_STAR, SUM_TIMER_WAIT/1000000000000 AS total_s, AVG_TIMER_WAIT/1000000000000 AS avg_s FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 5;"
+--------------------------------------------+------------+----------+---------+
| DIGEST_TEXT | COUNT_STAR | total_s | avg_s |
+--------------------------------------------+------------+----------+---------+
| SELECT * FROM orders WHERE tenant_id = ? | 92831 | 1842.11 | 0.01984 |
| SELECT ... JOIN ... WHERE created_at > ? | 8421 | 1201.33 | 0.14266 |
| UPDATE events SET status = ? WHERE id = ? | 2419921 | 992.77 | 0.00041 |
| SELECT ... WHERE tenant_id = ? ORDER BY ? | 182003 | 774.55 | 0.00425 |
| DELETE FROM sessions WHERE expires_at < ? | 12044 | 701.62 | 0.05825 |
+--------------------------------------------+------------+----------+---------+
What it means: total time spent by digest. High total_s means big aggregate impact; high avg_s means slow per call.
Decision: If a query is high total and includes tenant filters, attribute it to tenants at the application layer (add tenant tags) and consider per-tenant rate limiting or moving the tenant.
Task 2: Identify lock waits in MySQL (who’s blocking whom)
cr0x@server:~$ mysql -e "SELECT r.trx_id waiting_trx, r.trx_mysql_thread_id waiting_thread, b.trx_id blocking_trx, b.trx_mysql_thread_id blocking_thread, TIMESTAMPDIFF(SECOND, b.trx_started, NOW()) blocking_s FROM information_schema.innodb_lock_waits w JOIN information_schema.innodb_trx b ON b.trx_id=w.blocking_trx_id JOIN information_schema.innodb_trx r ON r.trx_id=w.requesting_trx_id ORDER BY blocking_s DESC LIMIT 5;"
+-------------+---------------+-------------+----------------+-----------+
| waiting_trx | waiting_thread| blocking_trx| blocking_thread| blocking_s|
+-------------+---------------+-------------+----------------+-----------+
| 4519281 | 3221 | 4519012 | 3189 | 97 |
+-------------+---------------+-------------+----------------+-----------+
What it means: a transaction has been blocking others for ~97 seconds. That’s usually a long transaction or a missing index.
Decision: If it’s tied to one tenant’s job, throttle or pause that job. If it’s systemic, inspect the blocked table and add the needed index or shorten transactions.
Task 3: Check InnoDB history length (undo pressure)
cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | grep -E "History list length|TRANSACTIONS"
TRANSACTIONS
History list length 138429
What it means: a high history list length suggests purge is behind, often due to long-running transactions.
Decision: Hunt long transactions; fix batch jobs and report queries that hold snapshots too long. This is a classic multi-tenant “one customer ran a report for an hour” scenario.
Task 4: Spot tenant-skipping queries in PostgreSQL via pg_stat_statements
cr0x@server:~$ psql -d appdb -c "SELECT query, calls, total_exec_time::int AS total_ms, mean_exec_time::int AS mean_ms FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 5;"
query | calls | total_ms | mean_ms
----------------------------------------------------------------------------------------------------------------+-------+----------+---------
SELECT * FROM orders WHERE created_at > $1 ORDER BY created_at DESC LIMIT $2 | 12021 | 8123340 | 675
SELECT * FROM orders WHERE tenant_id = $1 AND created_at > $2 ORDER BY created_at DESC LIMIT $3 | 99311 | 4901221 | 49
UPDATE events SET status = $1 WHERE id = $2 | 89122 | 811220 | 9
SELECT ... JOIN ... WHERE tenant_id = $1 AND state = $2 | 24011 | 644331 | 26
DELETE FROM sessions WHERE tenant_id = $1 AND expires_at < now() | 8012 | 499112 | 62
What it means: the top query has no tenant filter. In shared-table tenancy, that’s a flashing red light.
Decision: If you rely on app-layer enforcement, patch immediately. If you use RLS, verify that RLS is enabled and actually applied to that table.
Task 5: Verify RLS is enabled and policies exist (PostgreSQL)
cr0x@server:~$ psql -d appdb -c "\dp+ public.orders"
Access privileges
Schema | Name | Type | Access privileges | Column privileges | Policies
--------+--------+-------+-------------------+-------------------+-------------------------------
public | orders | table | app_user=arwdDxt/app_owner | | tenant_isolation (RLS)
What it means: policies are present, and RLS is active for this table (shown as policy info).
Decision: If the “Policies” column is empty for shared tables, you are trusting application code. Decide whether that’s acceptable under your risk model.
Task 6: Check if your role can bypass RLS (PostgreSQL)
cr0x@server:~$ psql -d appdb -c "SELECT rolname, rolbypassrls FROM pg_roles WHERE rolname IN ('app_user','app_owner');"
rolname | rolbypassrls
-----------+-------------
app_user | f
app_owner | t
What it means: app_owner can bypass RLS. That may be fine for migrations, disastrous for app runtime.
Decision: Ensure the application uses a role with rolbypassrls = false. Separate migration/admin roles from runtime roles.
Task 7: Find long-running transactions in PostgreSQL
cr0x@server:~$ psql -d appdb -c "SELECT pid, usename, now()-xact_start AS xact_age, state, left(query,80) AS q FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_start ASC LIMIT 5;"
pid | usename | xact_age | state | q
------+----------+--------------+--------+--------------------------------------------------------------------------------
8421 | app_user | 01:22:11 | active | SELECT * FROM orders WHERE created_at > $1 ORDER BY created_at DESC
What it means: a transaction has been open for 82 minutes. That can block vacuum and bloat tables.
Decision: If it’s a tenant report/export, move it to a replica, add timeouts, or redesign it (keyset pagination, incremental exports).
Task 8: Check autovacuum pressure and bloat signals (PostgreSQL)
cr0x@server:~$ psql -d appdb -c "SELECT relname, n_live_tup, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
relname | n_live_tup | n_dead_tup | last_autovacuum
-----------+------------+------------+-------------------------------
events | 81299311 | 22099122 | 2025-12-31 10:42:11.12345+00
orders | 9921121 | 2200122 | 2025-12-31 10:39:01.01234+00
What it means: high dead tuples: churn. In multi-tenant systems, it’s often dominated by a small set of tenants.
Decision: Tune autovacuum for hot tables, but also consider isolating the tenant causing churn (shard move, separate schema/DB).
Task 9: Confirm a query uses the tenant index (PostgreSQL)
cr0x@server:~$ psql -d appdb -c "EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE tenant_id = 42 AND created_at > now()-interval '7 days' ORDER BY created_at DESC LIMIT 50;"
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..102.55 rows=50 width=128) (actual time=0.214..1.992 rows=50 loops=1)
Buffers: shared hit=391
-> Index Scan Backward using orders_tenant_created_at_idx on orders (cost=0.43..18233.22 rows=8931 width=128) (actual time=0.212..1.978 rows=50 loops=1)
Index Cond: ((tenant_id = 42) AND (created_at > (now() - '7 days'::interval)))
Planning Time: 0.311 ms
Execution Time: 2.041 ms
What it means: index scan using a composite index that includes tenant_id. Buffers are hits, not reads: cache-friendly.
Decision: If you see sequential scans on shared tables for tenant-scoped queries, fix indexes or partitioning before you scale tenant count.
Task 10: Confirm a query uses the tenant index (MySQL)
cr0x@server:~$ mysql -e "EXPLAIN SELECT * FROM orders WHERE tenant_id=42 AND created_at > NOW() - INTERVAL 7 DAY ORDER BY created_at DESC LIMIT 50\G"
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: orders
partitions: NULL
type: range
possible_keys: idx_tenant_created_at
key: idx_tenant_created_at
key_len: 12
ref: NULL
rows: 8500
filtered: 100.00
Extra: Using where; Using index
What it means: MySQL chose your composite index. type: range is expected for time-window scans. “Using index” suggests a covering index behavior.
Decision: If type: ALL shows up (full scan), you don’t have isolation at scale—you have a future outage.
Task 11: Attribute load by user/host (MySQL) as a proxy for tenant
cr0x@server:~$ mysql -e "SELECT user, host, SUM_TIMER_WAIT/1000000000000 AS total_s FROM performance_schema.events_statements_summary_by_user_by_event_name ORDER BY SUM_TIMER_WAIT DESC LIMIT 5;"
+----------+-----------+----------+
| user | host | total_s |
+----------+-----------+----------+
| app_user | 10.0.2.% | 9921.22 |
| app_user | 10.0.9.% | 6211.55 |
+----------+-----------+----------+
What it means: a subset of app hosts are responsible for most DB time. Often those hosts are running a specific workload (exports, backfills).
Decision: Use this to narrow down the service/job responsible, then correlate with tenant-level application logs.
Task 12: Check connection pile-ups (PostgreSQL)
cr0x@server:~$ psql -d appdb -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state ORDER BY count DESC;"
state | count
--------+-------
idle | 412
active | 62
What it means: you have hundreds of idle connections. That’s memory and overhead; it often signals poor pooling or per-tenant connection patterns.
Decision: Put a pooler in front (and configure it correctly), or reduce connection counts. Multi-tenant systems die by a thousand “just one more connection” cuts.
Task 13: Check for table-level hot spots by size (PostgreSQL)
cr0x@server:~$ psql -d appdb -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) AS total_size FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 5;"
relname | total_size
---------+------------
events | 412 GB
orders | 126 GB
What it means: one table dominates storage and likely I/O. In multi-tenant systems, the biggest table is usually where the biggest tenant lives.
Decision: Consider partitioning by tenant or time, or moving tenants generating most rows to a separate shard.
Task 14: Check replication lag (MySQL) to decide where to run heavy reads
cr0x@server:~$ mysql -e "SHOW REPLICA STATUS\G" | egrep "Seconds_Behind_Source|Replica_IO_Running|Replica_SQL_Running"
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 37
What it means: 37 seconds behind. Running tenant exports on this replica may give inconsistent “latest” results.
Decision: If the product promises near-real-time exports, don’t offload to a lagging replica; instead throttle or isolate the tenant workload.
Fast diagnosis playbook (find the bottleneck before it finds you)
When a multi-tenant SaaS slows down, you need to answer two questions fast:
(1) is the database actually the bottleneck? (2) if yes, which tenant or workload is causing it?
First: Is the database saturated or just waiting?
- CPU pinned (query inefficiency, missing indexes, too many parallel tasks).
- I/O pinned (buffer misses, heavy scans, checkpoints, bloat).
- Locking pinned (long transactions, schema changes, hot rows).
- Connection pinned (pool exhaustion, thread contention, too many idle connections).
Second: Identify the top offender class
- Top queries by total time (digest / pg_stat_statements).
- Top waits (locks, I/O, buffer contention).
- Long transactions and blocking chains.
Third: Attribute to tenant and decide containment
- If you have tenant tags in logs/metrics: isolate the tenant (throttle, move job, move shard).
- If you don’t: use user/host/job correlation, then fix observability permanently.
Joke #2: The fastest way to reduce database load is to stop running the query that’s causing it. Revolutionary, I know.
A practical containment ladder (least to most invasive)
- Cancel the query / kill the session that’s causing immediate harm.
- Throttle the tenant at the application edge (rate limit exports, batch jobs).
- Move heavy reads to a replica (if lag and consistency allow).
- Add the missing index or rewrite the query (with explain plans, not hope).
- Partition or shard to separate noisy tenants.
- Change the tenancy model (the painful but sometimes correct answer).
Common mistakes: symptom → root cause → fix
1) Symptom: occasional cross-tenant data exposure in reports
Root cause: one code path missed tenant filtering; app-layer conventions aren’t enforcement.
Fix: PostgreSQL: implement RLS and ensure runtime roles cannot bypass it. MySQL: enforce via a strict data access layer, prohibit raw SQL for tenant-scoped tables, and add query linting tests.
2) Symptom: one tenant causes p95 spikes for everyone during “export”
Root cause: export runs on primary, scanning large ranges without proper indexing or pagination.
Fix: run exports on replica (if acceptable), use keyset pagination, precompute exports, and rate-limit per tenant. If the tenant is consistently heavy, move them to a separate shard.
3) Symptom: PostgreSQL gets slower over weeks, then improves after maintenance
Root cause: bloat from high-churn tables; autovacuum not keeping up; long transactions blocking vacuum cleanup.
Fix: find and eliminate long transactions, tune autovacuum per hot table, and consider partitioning or isolating the churn-heavy tenant workload.
4) Symptom: MySQL replication lag increases during tenant batch jobs
Root cause: large transactions and write bursts; replica SQL thread can’t apply fast enough.
Fix: break batch jobs into smaller transactions, add appropriate indexes, and schedule heavy jobs. For large tenants, consider dedicated infrastructure.
5) Symptom: migrations become risky and slow as tenant count grows
Root cause: schema-per-tenant or database-per-tenant without fleet-grade migration tooling; no batching, no observability, no rollback plan.
Fix: build a migration runner with: per-tenant state tracking, retries, timeouts, concurrency control, and post-migration verification queries. Treat it like a deployment system.
6) Symptom: “We scaled up the instance but it’s still slow”
Root cause: contention and inefficiency, not raw capacity. Bigger boxes don’t fix missing indexes or lock waits.
Fix: identify top queries and waits; fix root causes. Only scale up after you can explain why it helps.
Checklists / step-by-step plan
Step-by-step plan: choose a tenancy model that survives growth
- Classify tenants by workload: small/medium/elephant; read-heavy vs write-heavy; batch-heavy vs interactive.
- Decide your isolation target: data boundary only, or also resource/fault boundary. Compliance often forces the latter.
- Pick the initial model:
- If you need strong in-DB enforcement with shared tables: PostgreSQL + RLS is the pragmatic default.
- If you need per-tenant hard boundaries: database-per-tenant or shard-per-tenant, regardless of engine.
- Design keys for mobility: avoid global sequences that make tenant moves messy; prefer UUIDs or carefully scoped IDs when sharding is in your future.
- Make tenant moves routine: implement tooling for exporting/importing a tenant, verifying counts/checksums, and cutting over.
- Implement tenant-aware observability: every DB query path should be attributable to tenant (at least at request/job boundaries).
- Define per-tenant limits: rate limits, max export sizes, timeouts, concurrency caps.
- Establish boring recovery muscle: regular restore tests at tenant granularity.
Checklist: shared-table multi-tenancy (PostgreSQL recommended)
- RLS enabled on every tenant-scoped table.
- Runtime role cannot bypass RLS; admin/migration roles separated.
- Every tenant query has a composite index starting with
tenant_id(or a partitioning strategy that makes tenant access cheap). - pg_stat_statements enabled and monitored; top queries reviewed regularly.
- Long transaction alerts; statement timeouts set.
- Autovacuum tuned for hot tables; bloat tracked.
Checklist: schema-per-tenant (PostgreSQL sweet spot)
- Migration runner supports batching tenants, idempotency, and verification.
- Per-tenant schema naming conventions enforced; no manual snowflakes.
- Connection pooling strategy avoids per-request connection churn; careful use of
search_path. - Backups and restore drills can restore a single tenant schema cleanly.
Checklist: database-per-tenant (MySQL or PostgreSQL)
- Automated provisioning: create DB, users, grants, monitoring, backups.
- Fleet migration tooling; staged rollouts; canaries.
- Central service directory for routing (tenant → database mapping) with audit logging.
- Cost controls: avoid one tiny tenant per oversized instance unless compliance forces it.
FAQ
1) Should I use PostgreSQL RLS for multi-tenant SaaS?
If you’re doing shared tables and you’re serious about preventing cross-tenant access, yes. RLS turns missing tenant filters
from a latent security bug into a query error. It’s not free—policies must be designed and tested—but it’s a real enforcement tool.
2) Can MySQL safely do shared-table multi-tenancy?
Yes, but you’re relying more on application discipline and review processes. You can build safe patterns (views, stored procedures,
restricted accounts), but the database won’t naturally enforce tenant filters the way PostgreSQL can with RLS.
3) Is schema-per-tenant “too many objects” in PostgreSQL?
It can be. Thousands of schemas with many tables and indexes can stress catalogs and maintenance. If you go this route, invest in
migration tooling, avoid per-tenant snowflakes, and periodically evaluate whether sharding (fewer tenants per cluster) is simpler.
4) What’s the best model for compliance-heavy tenants?
Database-per-tenant or shard-per-tenant is the common answer, because it gives clear boundaries for backups, restores, encryption scope,
and access controls. Shared-table can pass audits, but it requires rigorous controls and clear evidence.
5) How do I prevent “noisy neighbor” incidents?
Start with attribution (tenant-aware metrics). Then implement containment: per-tenant rate limits, query timeouts, workload separation
(replicas for reads), and a path to isolate elephants (move them to their own shard/DB).
6) Does sharding solve tenant isolation?
Sharding helps resource and fault isolation by reducing blast radius, but it doesn’t automatically solve data isolation inside a shard.
You still need correct access boundaries and safe query paths.
7) What connection pooling strategy works best for multi-tenant PostgreSQL?
Use a pooler and keep connection counts sane. Be careful with session-level state (like search_path or per-tenant settings).
Transaction pooling is efficient but requires discipline; session pooling is simpler but uses more connections.
8) Should tenant_id be part of every primary key?
Often yes for shared-table designs, at least as part of the key strategy, because it improves locality and index selectivity.
But don’t blindly compound keys everywhere. Choose based on query patterns, uniqueness requirements, and future shard/move plans.
9) Is partitioning by tenant a good idea?
Sometimes. It can make per-tenant operations faster and improve vacuum/maintenance targeting in PostgreSQL. But it can explode partition counts
if you have many tenants. Time-based partitioning is often a better default for event tables, with tenant_id indexed inside partitions.
10) When should I move a tenant to its own database?
When tenant-specific workload or compliance needs repeatedly force tenant-specific tuning that harms others, or when you need independent scaling
and maintenance windows. If you’re debating it every week, you already know the answer.
Next steps you can actually execute
The database choice is important, but it’s not the main event. The main event is whether your isolation strategy is enforceable, observable,
and operationally movable.
- Pick a tenancy model you can operate, not one that looks elegant on a whiteboard.
- If you’re on PostgreSQL shared tables, implement RLS on the highest-risk tables first, and separate runtime roles from admin roles.
- If you’re on MySQL shared tables, formalize a tenant-safe query layer, add linting/tests for tenant predicates, and build per-tenant throttles.
- Instrument tenant attribution (logs, metrics, traces). If you can’t name the tenant causing load, you can’t isolate it.
- Make tenant moves routine. Write the playbook, automate it, and run drills when nobody is on fire.
- Schedule the boring work: restore tests, long-transaction alerts, index reviews, and migration canaries. These are not “later” tasks; they’re rent.
One quote worth keeping on your wall—paraphrased idea from John Allspaw: reliability comes from the system, not from individual heroics
.
Multi-tenancy is the ultimate proof of that.