MariaDB vs PostgreSQL for Multi-Tenant Hosting: Stop One Client Site from Killing Everyone

Was this helpful?

Multi-tenant hosting is where optimism goes to get audited. You sell “unlimited” sites, one client installs a plugin that turns every page view into a full-table scan, and suddenly your support queue looks like a denial-of-service report.

The problem isn’t that MariaDB or PostgreSQL is “slow.” The problem is that shared databases are a commons, and one tenant can absolutely graze it down to dirt—CPU, I/O, locks, connections, temp space, autovacuum/undo history, you name it. Your job is to make sure the blast radius of bad behavior is small, diagnosable, and boring.

What actually kills everyone in multi-tenant databases

Multi-tenant “outage” is rarely a single bug. It’s usually one tenant tripping a shared limit and then the system politely collapsing for everyone else.

The usual suspects (in production order)

  • Connection exhaustion: the app scales by “opening another connection,” and your database scales by “falling over.”
  • CPU runaway: unindexed predicates, pathological joins, JSON filtering with no GIN/functional indexes, and ORM-generated crimes.
  • I/O saturation: big sorts to disk, temp tables, checkpoint storms, and “let’s export the entire tenant every hour.”
  • Lock contention: one tenant doing large migrations at noon; everyone else waits, then retries, then dogpiles.
  • Maintenance debt: PostgreSQL autovacuum falling behind, or InnoDB history/list length growing until everything starts paying interest.
  • Disk-full events: temp files, binary logs/WAL, runaway table growth—nothing makes databases more honest than a 100% full filesystem.

In multi-tenant hosting, the “one site kills everyone” pattern is not mysterious. It’s predictable. Which is great, because you can engineer against it.

Joke #1: The cloud is just someone else’s computer, and multi-tenant hosting is just someone else’s computer with your pager attached.

MariaDB vs PostgreSQL: the decision that matters

People compare MariaDB and PostgreSQL like it’s a feature checklist. In multi-tenant hosting, the real question is: which system gives you more effective levers to isolate tenants and diagnose the noisy neighbor quickly?

Opinionated summary

  • If you can standardize on a pooler and enforce timeouts, PostgreSQL is usually the cleaner multi-tenant platform. It has better primitives for query governance and per-role control, and its introspection story is excellent.
  • If you’re operating at “classic LAMP hosting” scale with lots of small tenants and a strong MySQL/MariaDB operational muscle memory, MariaDB can be perfectly solid—if you take resource limits seriously and stop pretending “max_connections=5000” is a plan.
  • If you are not willing to operationalize guardrails, neither database will save you. The database is not a parenting strategy.

Where PostgreSQL tends to win for tenant containment

  • First-class role model for scoping privileges; easy to map “tenant role” to policies.
  • Row Level Security (RLS) enables row-per-tenant in a way that’s enforceable at the database layer, not “we swear the app always filters tenant_id.”
  • Powerful observability: pg_stat_statements, pg_locks, pg_stat_activity, auto_explain, and generally fewer “what is it doing?” moments.
  • Statement timeouts and lock timeouts are straightforward and can be per-role or per-database.

Where MariaDB tends to win (or at least hurts less)

  • Operational familiarity in hosting environments: many teams already have tooling around it.
  • Per-user resource limits exist and are simple to apply (MAX_QUERIES_PER_HOUR, MAX_USER_CONNECTIONS, etc.). They’re blunt, but blunt tools are sometimes the only ones that get used.
  • Replication ecosystem can be straightforward for read scaling in common hosting patterns (with caveats around consistency expectations).

What not to do

Don’t choose a database because someone told you it’s “faster.” In multi-tenant hosting, you’re not tuning for a benchmark. You’re tuning for the worst customer at the worst time.

Isolation models: database-per-tenant, schema-per-tenant, row-per-tenant

Before you argue MariaDB vs PostgreSQL, decide what you are isolating: data, performance, operational blast radius, or compliance boundaries. You don’t get all four for free.

Model A: database-per-tenant (strong isolation, heavy ops)

This is the classic “each customer gets their own database.” It’s great for containment: one tenant’s bloat, locking, and migrations are less likely to impact others.

  • Pros: Clear blast radius, easier backups/restore per tenant, easier “move tenant to another host,” simpler GDPR/retention deletion.
  • Cons: More objects, more connections, more background work, more monitoring surface area. You need automation or you’ll drown.
  • PostgreSQL: Works well, but watch global connection count and autovacuum overhead across many databases.
  • MariaDB: Also works well; schema/database are closely related. Watch table cache and open file limits.

Model B: schema-per-tenant (middle ground, still messy)

One database, many schemas (or MariaDB databases as namespaces). You get some organizational separation, but performance isolation is weak unless you add guardrails.

  • Pros: Fewer server-level objects than DB-per-tenant, simpler cross-tenant reporting (sometimes).
  • Cons: Shared buffer pool, shared WAL/binlog, shared temp space. DDL can still cause pain.
  • Best use: Moderate tenant counts with similar workloads and tight SRE ownership.

Model C: row-per-tenant (maximum density, minimum forgiveness)

One schema, shared tables, tenant_id in every row. It’s efficient. It’s also where a single missing index becomes a shared-hosting horror story.

  • Pros: Simplifies rolling out schema changes; fewer objects; great for SaaS with consistent model.
  • Cons: Noisy neighbor risk is highest. Backups/restore per tenant is harder. You need strict query discipline.
  • PostgreSQL: RLS can make this safe-ish from a data-access standpoint.
  • MariaDB: You’re relying on application correctness for tenant filters unless you do elaborate views/procedures patterns (which most teams don’t keep correct over time).

The isolation reality check

“Tenant isolation” isn’t just permissions. It’s controlling shared resources: CPU time, I/O, memory, locks, and connections. If your model doesn’t let you throttle those, you don’t have isolation—you have hope.

Hard limits and guardrails (the “you can’t do that” layer)

Guardrails are what you put in place before you need them. After a tenant melts the cluster, you’ll add them anyway—just under stress, with customers yelling. Do it now.

PostgreSQL guardrails that actually work

  • statement_timeout per role or per database: kills runaway queries.
  • lock_timeout: prevents “wait forever” pileups.
  • idle_in_transaction_session_timeout: stops connections that hold locks while doing nothing.
  • work_mem per role: prevents one tenant’s sort from eating RAM (or forcing huge temp spills).
  • role-based settings via ALTER ROLE … SET: makes policy enforceable.

MariaDB guardrails that are underrated

  • Per-user limits (MAX_USER_CONNECTIONS, MAX_QUERIES_PER_HOUR): crude, effective.
  • max_execution_time (for SELECT): can kill long reads when used carefully.
  • InnoDB settings that reduce damage from bursts (e.g., sane log file sizing, buffer pool sizing, flushing behavior).

What neither database gives you natively

Neither MariaDB nor PostgreSQL will perfectly enforce “tenant gets 10% CPU and 50 IOPS” at the SQL level. If you need that level of hard multi-tenancy, you end up using:

  • Separate instances per tenant (or per tenant tier).
  • OS-level controls (cgroups) to cap the database process resources—useful, but coarse.
  • Queueing at the app or pooler: you can’t run what you don’t accept.

Connection storms: pooling, caps, and queueing

The fastest way a single tenant kills everyone is connection churn. PHP-FPM scaling up, Node workers forking, cron jobs piling up, and suddenly the database is spending its life doing connection setup and context switching.

PostgreSQL: you almost always want a pooler

PostgreSQL handles concurrency well, but thousands of active backend processes are not free. In multi-tenant hosting with spiky traffic, you typically want PgBouncer (transaction pooling for most web apps), with per-tenant caps if you can map users/roles cleanly.

MariaDB: pooling exists, but be honest about where it lives

MariaDB can handle a lot of connections, but don’t confuse “accepts TCP connections” with “performs well.” Often the practical strategy is:

  • Cap connections per user.
  • Use application-level pooling (depending on app stack).
  • Keep max_connections realistic and design for backpressure.

Joke #2: If your scaling strategy is “increase max_connections,” you don’t have a database; you have a very expensive queue.

The ugly truth about fairness

Without explicit fairness controls, the loud tenant wins. The database doesn’t know what “one customer” is. It knows sessions. You need a mapping: tenant → role/user → limits. If you can’t map it, you can’t enforce it.

I/O and storage: stop the disk from being the shared tragedy

In multi-tenant systems, disks fail in a very specific way: not by dying, but by being busy. Your CPUs are fine, your RAM is fine, and your database is “slow” because it is waiting for storage. Then every tenant times out and retries, and you get a retry storm that makes the original issue look quaint.

Containment patterns that work

  • Separate WAL/binlog from data when you can. It’s not magic, but it reduces contention and makes disk-full events less catastrophic.
  • Per-tenant datasets/volumes (where feasible): if you do database-per-tenant, placing each tenant database on its own dataset can let you enforce quotas and detect growth.
  • Temp space monitoring: sorts and temp tables are the silent disk eaters.
  • Checkpoint and flush tuning: both PostgreSQL and InnoDB can generate write bursts; in multi-tenant, bursts correlate with page load spikes, which is how you get “every site is slow.”

Storage engineering take: don’t share without telemetry

If you can’t attribute I/O to a tenant (or at least to a database/user), you’ll fight ghosts. At minimum, attribute to queries (pg_stat_statements / Performance Schema) and to databases.

Observability: prove who’s noisy, fast

In hosting, “which tenant did it?” is not a philosophical question. It’s an on-call survival skill. You need:

  • Top queries by total time, mean time, calls, and rows.
  • Lock graphs: who is blocking whom.
  • Connection counts per tenant mapping (user/role/db).
  • Disk/WAL/binlog growth trends.
  • Maintenance lag: autovacuum delays, InnoDB history list length, etc.

One quote worth keeping on a sticky note. From John Allspaw: “You can’t improve what you don’t measure.”

Interesting facts and historical context (because it shapes today’s tradeoffs)

  1. PostgreSQL descends from the POSTGRES project at UC Berkeley, designed with extensibility in mind long before “extensions” became fashionable.
  2. MySQL became the default of early web hosting because it was easy to deploy and plenty fast for the era’s workloads—lots of reads, simpler schemas, smaller data.
  3. MariaDB was forked from MySQL after Oracle acquired Sun Microsystems; the fork was as much about governance as technology.
  4. PostgreSQL’s MVCC model makes reads non-blocking by design, but it creates a maintenance requirement: vacuuming dead tuples is not optional.
  5. InnoDB (the dominant MySQL/MariaDB engine) brought transactions and row-level locking to the MySQL ecosystem and changed hosting reliability in the 2000s.
  6. PgBouncer became a de facto standard in high-concurrency Postgres environments because process-per-connection has real overhead at scale.
  7. Row Level Security in PostgreSQL became a mainstream multi-tenant feature only after teams got tired of “we promise the app always adds tenant_id.”
  8. MySQL/MariaDB’s per-user limits have existed for a long time, but many hosting stacks ignored them because “it might break a customer,” until customers broke everyone.

Practical tasks: commands, outputs, decisions (12+)

These are the kinds of checks you run when you’re trying to keep multi-tenant hosting stable. Each task includes: command, realistic output, what it means, and what decision you make.

Task 1: Find top PostgreSQL queries by total time (pg_stat_statements)

cr0x@server:~$ sudo -u postgres psql -d postgres -c "SELECT queryid, calls, round(total_exec_time)::int AS total_ms, round(mean_exec_time,2) AS mean_ms, left(query,120) AS sample FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 5;"
 queryid  | calls | total_ms | mean_ms |                         sample
----------+-------+----------+---------+----------------------------------------------------------
 99120311 |  2140 |   987654 |  461.52 | SELECT * FROM orders WHERE tenant_id = $1 ORDER BY created_at DESC LIMIT $2
 77112002 |   120 |   501223 | 4176.86 | SELECT count(*) FROM events WHERE tenant_id = $1 AND payload::text ILIKE $2
 55110019 | 98000 |   210112 |    2.14 | SELECT id FROM sessions WHERE expires_at < now()
 44110070 |    42 |   180990 | 4309.29 | SELECT * FROM invoices WHERE status IN (...) AND tenant_id=$1 ORDER BY id
 99120355 |   300 |   150333 |  501.11 | UPDATE products SET stock = stock - 1 WHERE tenant_id=$1 AND id=$2
(5 rows)

Meaning: The second query is a classic tenant killer (ILIKE on payload text). High mean time, lower calls, big impact.

Decision: Identify the tenant via app logs/query parameters, add an index or redesign (e.g., GIN/trigram, extracted column), and set per-role statement_timeout to cap damage.

Task 2: See PostgreSQL sessions and who is waiting

cr0x@server:~$ sudo -u postgres psql -c "SELECT pid, usename, datname, state, wait_event_type, wait_event, now()-query_start AS age, left(query,80) AS q FROM pg_stat_activity WHERE state <> 'idle' ORDER BY age DESC LIMIT 10;"
 pid  |  usename   |  datname  | state  | wait_event_type | wait_event |   age    |                                         q
------+------------+-----------+--------+-----------------+------------+----------+--------------------------------------------------------------------------------
 8123 | tenant_442 | appdb     | active | Lock            | relation   | 00:03:14 | ALTER TABLE posts ADD COLUMN foo text
 8201 | tenant_101 | appdb     | active |                 |            | 00:02:59 | SELECT * FROM orders WHERE tenant_id=$1 ORDER BY created_at DESC LIMIT $2
 8210 | tenant_333 | appdb     | active | Lock            | tuple      | 00:02:10 | UPDATE posts SET status='x' WHERE tenant_id=$1 AND id=$2
(3 rows)

Meaning: A tenant is running DDL and holding locks; other tenants are waiting on relation/tuple locks.

Decision: Kill or postpone the DDL, enforce lock_timeout for tenant roles, and implement “maintenance window” policy for migrations.

Task 3: Detect PostgreSQL lock blockers

cr0x@server:~$ sudo -u postgres psql -c "SELECT blocked.pid AS blocked_pid, blocked.usename AS blocked_user, blocker.pid AS blocker_pid, blocker.usename AS blocker_user, left(blocker.query,80) AS blocker_query FROM pg_locks bl JOIN pg_stat_activity blocked ON blocked.pid=bl.pid JOIN pg_locks kl ON bl.locktype=kl.locktype AND bl.database IS NOT DISTINCT FROM kl.database AND bl.relation IS NOT DISTINCT FROM kl.relation AND bl.page IS NOT DISTINCT FROM kl.page AND bl.tuple IS NOT DISTINCT FROM kl.tuple AND bl.virtualxid IS NOT DISTINCT FROM kl.virtualxid AND bl.transactionid IS NOT DISTINCT FROM kl.transactionid AND bl.classid IS NOT DISTINCT FROM kl.classid AND bl.objid IS NOT DISTINCT FROM kl.objid AND bl.objsubid IS NOT DISTINCT FROM kl.objsubid AND kl.granted JOIN pg_stat_activity blocker ON blocker.pid=kl.pid WHERE NOT bl.granted LIMIT 5;"
 blocked_pid | blocked_user | blocker_pid | blocker_user |                         blocker_query
-------------+--------------+-------------+--------------+---------------------------------------------------------------
        8210 | tenant_333   |        8123 | tenant_442   | ALTER TABLE posts ADD COLUMN foo text
(1 row)

Meaning: You have a clear blocker. No guessing, no vibes.

Decision: Terminate the blocker session if it violates policy; add DDL controls in deployment pipeline.

Task 4: Verify PostgreSQL timeouts for tenant roles

cr0x@server:~$ sudo -u postgres psql -c "SELECT rolname, rolconfig FROM pg_roles WHERE rolname IN ('tenant_333','tenant_442');"
  rolname   |                        rolconfig
------------+----------------------------------------------------------
 tenant_333 | {statement_timeout=30000,lock_timeout=2000,work_mem=16MB}
 tenant_442 | {statement_timeout=30000,lock_timeout=2000,work_mem=16MB}
(2 rows)

Meaning: Policy is encoded at the database role, not in an internal wiki nobody reads.

Decision: If missing, add role defaults; if too strict, create tiers (small/medium/enterprise tenants).

Task 5: Check PostgreSQL autovacuum health

cr0x@server:~$ sudo -u postgres psql -c "SELECT relname, n_dead_tup, last_autovacuum, last_vacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
  relname  | n_dead_tup |        last_autovacuum        |          last_vacuum
-----------+------------+-------------------------------+-------------------------------
 events    |    5021132 |                               |
 sessions  |     901223 | 2025-12-29 08:11:02.12345+00  |
 orders    |     200111 | 2025-12-29 08:09:10.99431+00  |
 invoices  |     150991 | 2025-12-29 07:59:34.33121+00  |
 posts     |      90112 | 2025-12-29 08:00:20.11200+00  |
(5 rows)

Meaning: The events table has millions of dead tuples and no recent autovacuum. That’s a performance landmine.

Decision: Investigate why autovacuum isn’t running (thresholds, wraparound danger, long transactions). Consider per-table autovacuum tuning or partitioning by tenant/time.

Task 6: MariaDB: identify top queries (Performance Schema digest)

cr0x@server:~$ sudo mariadb -e "SELECT DIGEST_TEXT, COUNT_STAR, ROUND(SUM_TIMER_WAIT/1000000000000,2) AS total_s FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 3\G"
*************************** 1. row ***************************
DIGEST_TEXT: SELECT COUNT ( * ) FROM events WHERE tenant_id = ? AND payload LIKE ?
COUNT_STAR: 118
total_s: 632.11
*************************** 2. row ***************************
DIGEST_TEXT: SELECT * FROM orders WHERE tenant_id = ? ORDER BY created_at DESC LIMIT ?
COUNT_STAR: 9801
total_s: 201.44
*************************** 3. row ***************************
DIGEST_TEXT: UPDATE products SET stock = stock - ? WHERE tenant_id = ? AND id = ?
COUNT_STAR: 3002
total_s: 95.12

Meaning: You can see “query shapes” even if literal values differ. The COUNT with LIKE is the bully.

Decision: Add proper indexes, cap execution time, or rewrite feature. If the tenant insists, move them to their own instance.

Task 7: MariaDB: find active threads and who is hogging

cr0x@server:~$ sudo mariadb -e "SHOW FULL PROCESSLIST;"
Id	User	Host	db	Command	Time	State	Info
9112	tenant442	app1:44210	appdb	Query	210	Waiting for table metadata lock	ALTER TABLE posts ADD COLUMN foo TEXT
9201	tenant101	app2:51012	appdb	Query	45	Sending data	SELECT * FROM orders WHERE tenant_id=101 ORDER BY created_at DESC LIMIT 50
9303	tenant333	app3:39910	appdb	Query	39	Locked	UPDATE posts SET status='x' WHERE tenant_id=333 AND id=9

Meaning: Same story as Postgres: DDL causing metadata lock waits and pileups.

Decision: Kill the DDL, enforce online schema change tooling/policy, and schedule migrations.

Task 8: MariaDB: apply per-user connection limits

cr0x@server:~$ sudo mariadb -e "ALTER USER 'tenant442'@'%' WITH MAX_USER_CONNECTIONS 20;"
cr0x@server:~$ sudo mariadb -e "SHOW GRANTS FOR 'tenant442'@'%';"
Grants for tenant442@%
GRANT USAGE ON *.* TO `tenant442`@`%` WITH MAX_USER_CONNECTIONS 20

Meaning: Tenant 442 can’t open 500 connections anymore. They’ll queue at the app, not in your database.

Decision: Set tiered limits; monitor for “too many connections” errors and offer pooling guidance.

Task 9: PostgreSQL: see per-database connection consumption

cr0x@server:~$ sudo -u postgres psql -c "SELECT datname, count(*) AS conns FROM pg_stat_activity GROUP BY datname ORDER BY conns DESC;"
 datname | conns
---------+-------
 appdb   |   312
 postgres|     3
(2 rows)

Meaning: Most pressure is in appdb. Not shocking, but now you know.

Decision: If appdb is multi-tenant, consider splitting heavy tenants to separate databases or separate clusters.

Task 10: Check system I/O wait and top offenders

cr0x@server:~$ iostat -xz 1 3
Linux 6.8.0 (db01) 	12/29/2025 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.10    0.00    5.20   32.40    0.00   50.30

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz  aqu-sz  %util
nvme0n1         220.0  18432.0     2.0   0.90    8.20    83.78   600.0  51200.0   120.0  16.67   25.10    85.33   17.3  99.20

Meaning: 32% iowait and the NVMe is near 100% utilized with high write await. This is a storage bottleneck right now.

Decision: Identify why writes are spiking (checkpoint, vacuum, binlog/WAL, temp spills). Consider tuning checkpoint/flush behavior and ensuring enough IOPS headroom.

Task 11: PostgreSQL: check WAL generation rate symptoms

cr0x@server:~$ sudo -u postgres psql -c "SELECT now(), pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),'0/0')) AS wal_bytes_since_start;"
              now              | wal_bytes_since_start
------------------------------+------------------------
 2025-12-29 08:22:10.1122+00  | 842 GB
(1 row)

Meaning: That’s not “bad” by itself, but if WAL is growing fast and replicas lag, you’re heading for disk pressure.

Decision: Check replication lag and archiving; if one tenant is bulk-writing, throttle them, or isolate them.

Task 12: MariaDB: check InnoDB engine status for contention and history

cr0x@server:~$ sudo mariadb -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,70p'
*************************** 1. row ***************************
Type: InnoDB
Name:
Status:
=====================================
2025-12-29 08:23:01 0x7f3c6c1fe700 INNODB MONITOR OUTPUT
=====================================
...
History list length 412398
...
Trx id counter 928331120
Purge done for trx's n:o < 928300000 undo n:o < 0 state: running
...

Meaning: A large history list length can indicate purge lag, often due to long-running transactions. That can degrade performance across tenants.

Decision: Find and kill long transactions; adjust app behavior; ensure purge can keep up.

Task 13: PostgreSQL: find long-running transactions (vacuum blockers)

cr0x@server:~$ sudo -u postgres psql -c "SELECT pid, usename, now()-xact_start AS xact_age, state, left(query,80) AS q FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_age DESC LIMIT 5;"
 pid  |  usename   |  xact_age  |        state        |                                         q
------+------------+------------+---------------------+--------------------------------------------------------------------------------
 9001 | tenant_777 | 02:14:09   | idle in transaction | SELECT * FROM reports WHERE tenant_id=$1
(1 row)

Meaning: “Idle in transaction” is a vacuum killer and a lock risk. In multi-tenant, it’s a shared tax.

Decision: Enforce idle_in_transaction_session_timeout for tenant roles; fix the app path that forgets to commit/rollback.

Task 14: Verify disk usage and locate fast-growing areas

cr0x@server:~$ df -h /var/lib/postgresql /var/lib/mysql
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  900G  812G   43G  95% /var/lib/postgresql
/dev/nvme1n1p1  900G  620G  235G  73% /var/lib/mysql

Meaning: Postgres volume is at 95%. You’re one burst of WAL or temp files away from a very educational incident.

Decision: Emergency cleanup (WAL retention/archiving check, temp file hunting), then permanent quotas/alerting. Also: stop letting one cluster run at 95% in the first place.

Fast diagnosis playbook

This is the “someone just paged you and every tenant is screaming” sequence. The goal is not elegance. The goal is to find the bottleneck in minutes.

First: decide whether it’s CPU, I/O, locks, or connections

  1. CPU saturation? Check system load and CPU user/system.
    • If CPU is high and iowait is low: suspect bad queries, missing indexes, expensive sorts/joins.
  2. I/O wait high? If iowait is high and disk %util is near 100%: suspect checkpoint/flush storms, temp spills, WAL/binlog churn, vacuum/purge lag.
  3. Lock pileup? Look for sessions waiting on locks (pg_stat_activity wait_event_type=Lock; MariaDB processlist “Locked” / metadata locks).
  4. Connection exhaustion? Check active connections vs configured max; look for “too many connections” errors and thread/process explosion.

Second: identify the top offender and the blast radius

  1. Top queries (pg_stat_statements / Performance Schema digests): find the query shape consuming the most total time.
  2. Map to tenant: via role/user, database, application tags, or log correlation.
  3. Confirm impact: is it causing locks, I/O churn, or CPU burn?

Third: apply the least-bad mitigation

  • Kill the query/session if it’s actively harming others and violates policy.
  • Throttle at the edge: cap connections per tenant; reduce concurrency in pooler; rate limit heavy endpoints.
  • Temporarily raise fairness: lower statement timeouts for the noisy tenant role, not globally.
  • Stabilize storage: if disk is the bottleneck, reduce write pressure (postpone vacuum full/reindex, tune checkpoints cautiously, stop bulk jobs).

If you do these steps consistently, “multi-tenant meltdown” becomes a repeatable diagnosis instead of an all-hands séance.

Common mistakes: symptom → root cause → fix

1) Symptom: “All sites are slow” and database CPU is fine

Root cause: Storage saturation (temp spills, checkpoint bursts, WAL/binlog backlog).

Fix: Check iowait and disk %util; cap work_mem/sort sizes; tune checkpoints; separate WAL/binlog; add IOPS headroom.

2) Symptom: Random timeouts, lots of “waiting for lock”

Root cause: Tenant DDL or long transactions causing lock queues.

Fix: Enforce lock_timeout and idle-in-transaction timeout; require online schema change tooling; schedule migrations; kill blockers fast.

3) Symptom: Periodic “too many connections” and then recovery

Root cause: Connection storms from app scaling or cron bursts; no pooling/backpressure.

Fix: Deploy PgBouncer (Postgres) or app pooling; cap per-tenant connections (MariaDB limits, Postgres role-based pool limits); reduce worker counts.

4) Symptom: Postgres queries get slower over days, then “vacuum can’t keep up”

Root cause: Autovacuum falling behind due to long transactions, poor thresholds, or heavy churn tables (often shared multi-tenant tables).

Fix: Find long transactions; tune autovacuum per table; partition high-churn tables; ensure vacuum has resources; reduce bloat drivers.

5) Symptom: MariaDB performance degrades with lots of writes; undo/purge issues

Root cause: Long-running transactions and purge lag (history list length grows).

Fix: Identify long transactions; shorten transaction scopes; kill offenders; ensure purge can progress; revisit isolation levels and batch sizes.

6) Symptom: “It’s only one tenant’s report” but it wrecks everyone

Root cause: Heavy analytical query on the primary, no resource segregation, no read replica, no timeout.

Fix: Route reporting to replicas/warehouse; enforce statement_timeout; create tenant tier that includes separate reporting resources.

7) Symptom: Disk full, database crashes or goes read-only, panic ensues

Root cause: WAL/binlog growth, runaway temp files, or unbounded tenant data.

Fix: Enforce quotas (where possible), monitor growth, keep free space policy (not optional), and cap retention. Treat disk-full as a design failure, not a surprise.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

They ran a shared PostgreSQL cluster for a fleet of small customer sites. Each tenant got their own database user, but all tenants shared the same database and schema. The product team insisted the application always filtered by tenant_id. The SRE team believed them, because the logs looked clean and everyone wanted the sprint to end.

A new customer onboarded with an “analytics plugin” that added a search feature across historical events. The plugin used a dynamic query builder, and in one code path it forgot to include tenant_id in the WHERE clause. Not always. Just on certain combinations of filters. That made it hard to catch in staging; staging never had enough data for it to hurt.

In production, the query went from “scan a few thousand rows for one tenant” to “scan a few hundred million rows for all tenants.” PostgreSQL did what it was told. CPU climbed, then I/O spiked because the plan started spilling sorts to disk. Meanwhile, other tenants’ requests piled up and timed out, and their retry logic helpfully increased load. The plugin did not just leak data risk—it caused a full-host performance incident.

The fix was not heroic. They implemented Row Level Security, forced all tenant queries through a SECURITY DEFINER function with tenant context, and added statement_timeout for tenant roles. The bigger change was cultural: “the app always filters” stopped being accepted as an architecture decision. It became an untrusted input.

Mini-story 2: The optimization that backfired

A hosting platform standardized on MariaDB. One quarter, someone decided to “optimize” by increasing max_connections significantly because customers occasionally saw connection errors during traffic spikes. The thinking was simple: fewer errors equals happier customers.

For about two days, it looked like a win. Then a marketing campaign hit one tenant’s site. Their PHP workers ramped up. Each worker opened its own connection, did a slow query, and sat around waiting on I/O. More connections meant more concurrent slow work. Buffer pool churn increased. InnoDB started flushing harder. Latency rose across the board. Other tenants retried, which created more connections, which created more flushing. Classic self-inflicted storm.

When they finally reduced max_connections back down, the immediate symptom improved but the team had learned the wrong lesson: “max_connections should be tuned.” The actual lesson was “you need backpressure and pooling.” They added per-user connection caps, fixed the offending query with an index, and implemented a small connection pool in the application layer for their managed stack.

The postmortem conclusion was dry but accurate: allowing more concurrency does not reduce demand; it increases the system’s ability to suffer more demand at once.

Mini-story 3: The boring but correct practice that saved the day

A different company ran PostgreSQL for multi-tenant SaaS. They weren’t fancy. They just had policies: per-role timeouts, PgBouncer, and strict migration windows. Every tenant role had a default statement_timeout and lock_timeout, and production migrations were executed by a separate “deploy” role with elevated limits and a change window.

One Friday, a tenant pushed a new version of their integration that accidentally performed a large UPDATE in a single transaction, touching a hot table. The query was valid. It was also disastrous at that time of day. The statement_timeout killed it automatically. The app retried a few times and then backed off because the integration had sane retry logic. The database stayed alive. Other tenants barely noticed.

The on-call engineer saw the errors, identified the tenant quickly from role-based logs, and worked with support to coordinate a fix. Nobody got paged into an all-hands. No one “heroically scaled” the cluster. The system behaved like a system with guardrails.

It wasn’t glamorous. That’s the point. The best SRE work looks like nothing happened.

Checklists / step-by-step plan

Step-by-step: harden a shared database against noisy tenants

  1. Choose the isolation model
    • If tenants have unpredictable plugins/custom SQL: prefer database-per-tenant or instance-per-tier.
    • If you control all queries and want density: row-per-tenant with PostgreSQL RLS is viable.
  2. Enforce identity mapping
    • Create a distinct DB role/user per tenant (or per tenant tier) so you can apply limits and attribute behavior.
  3. Set query timeouts by default
    • PostgreSQL: statement_timeout, lock_timeout, idle_in_transaction_session_timeout.
    • MariaDB: use per-user limits; consider max_execution_time for SELECT where appropriate.
  4. Implement connection control
    • PostgreSQL: PgBouncer, plus pool sizing; avoid thousands of backends.
    • MariaDB: cap MAX_USER_CONNECTIONS and keep max_connections realistic.
  5. Build a “no DDL at noon” rule
    • Require migrations in a window; prefer online schema change patterns.
  6. Make storage boring
    • Alert on disk usage early (e.g., 70/80/90%).
    • Separate WAL/binlog if possible; monitor temp usage.
  7. Observability minimum bar
    • Top query shapes, lock waits, per-tenant session counts, and maintenance health (vacuum/purge).
  8. Create escalation paths
    • When a tenant repeatedly causes incidents, move them to a higher tier with dedicated resources—or enforce stricter caps. Don’t negotiate with physics.

Checklist: when adding a new tenant

  • Create tenant role/user with appropriate defaults (timeouts, work_mem, etc.).
  • Set connection cap and confirm pooling behavior.
  • Confirm backup/restore process for that tenant model.
  • Enable query logging/attribution to identify the tenant quickly during incidents.
  • Validate “dangerous endpoints” (exports, reporting, bulk updates) are throttled or offloaded.

Checklist: when a tenant wants “one big report”

  • Can it run on a replica or separate analytics store?
  • Is there a statement timeout override for the reporting role only?
  • Is the query indexed and tested on production-like data volume?
  • Is concurrency limited (one report at a time per tenant)?

FAQ

1) Should I use one database for all tenants or one database per tenant?

If tenants can run unpredictable workloads (plugins, custom reporting, migrations), database-per-tenant reduces blast radius. If you control queries tightly and need density, shared tables can work—but only with strict guardrails.

2) Is PostgreSQL “better” for multi-tenancy?

PostgreSQL generally gives you stronger database-layer governance (RLS, per-role settings, introspection). That doesn’t replace good application behavior, but it reduces how often you’re forced to trust it.

3) Can MariaDB enforce tenant fairness?

MariaDB can cap per-user connections and some per-user activity. It’s not a full fairness scheduler, but it’s enough to stop the worst “open 500 connections” behavior and buy you stability.

4) What’s the single most effective protection against noisy neighbors?

Connection control plus timeouts. Cap concurrency per tenant and kill runaway work. Everything else (indexing, tuning, partitioning) is important, but those two stop the bleeding.

5) Does Row Level Security in PostgreSQL solve multi-tenancy?

It solves data access enforcement. It does not solve performance isolation by itself. A tenant can still run an expensive query that is “allowed” but destructive.

6) Should I put heavy tenants on replicas?

Put read-heavy analytics and reporting on replicas if consistency requirements allow it. But watch for replica lag and ensure your app doesn’t silently fall back to primary when lag increases.

7) How do I stop tenants from running schema changes during peak hours?

Don’t rely on politeness. Enforce it: restrict DDL privileges for tenant roles, route migrations through a controlled deploy role, and use lock timeouts to prevent pileups.

8) What’s the fastest way to find the tenant causing trouble?

Use per-tenant database users/roles and query attribution. Without that mapping, you end up guessing from IPs and application logs while the cluster burns.

9) How do I decide when to split tenants onto separate instances?

Split when a tenant’s workload pattern is fundamentally different (heavy writes, huge data, constant reporting) or when they repeatedly trigger incidents. Isolation is cheaper than repeated outages.

10) Are OS-level controls (cgroups) useful for databases?

They’re useful as a coarse safety net, especially in containerized deployments. They are not a substitute for query-level and connection-level governance, because they don’t understand tenants or SQL.

Next steps you can actually do this week

  • Map tenant identity to database identity: one role/user per tenant or per tier. If you can’t do that, your incident response will stay expensive.
  • Put timeouts in place:
    • PostgreSQL: statement_timeout, lock_timeout, idle_in_transaction_session_timeout per tenant role.
    • MariaDB: per-user connection limits and execution controls where safe.
  • Implement pooling/backpressure: PgBouncer for Postgres; sane worker limits and pooling strategy for MariaDB stacks.
  • Turn on the right observability: pg_stat_statements or Performance Schema digest summaries, plus lock/session dashboards.
  • Write your “when to evict a tenant” policy: not as punishment, as engineering. If one tenant needs dedicated resources, give them a tier that matches reality.

If you want a single guiding principle: design so the worst tenant can’t take your entire business hostage. Databases are powerful. They’re not negotiators.

← Previous
ZFS vdev width planning: Why More Disks per VDEV Has a Cost
Next →
Docker host networking risks: when it’s worth it and how to limit damage

Leave a comment