MySQL vs PostgreSQL: the honest “website DB” pick (based on real bottlenecks)

November 20, 2025 • February 3, 2026 • Read: 26 min • Views: 13

Was this helpful?

Most “MySQL vs PostgreSQL” debates are cosplay: synthetic benchmarks, cargo-cult opinions, and one guy who read half a blog post in 2014. Meanwhile your website is timing out because a single query turned into a full table scan after a harmless-looking deploy. Customers don’t care which database “wins.” They care that checkout works and admin pages don’t take a coffee break.

This is the production-minded pick: which database will fail in which way, what you’ll see when it fails, and what you do at 2 a.m. when you’re on-call and your pager has opinions.

The thesis: pick the bottleneck you’d rather live with

If your site is a typical web application—CRUD tables, user sessions, orders, search-ish pages, a few background jobs—you can run either MySQL or PostgreSQL very successfully. The honest difference is not “which is faster.” It’s:

Which failure mode is more likely in your org? (bad SQL? bad migrations? sloppy schemas? questionable index hygiene?)
Which operational workflow fits your team? (replication, failover, backups, migrations, query analysis)
Which bottleneck you’re willing to babysit? (bloat/vacuum, lock contention, replication lag, buffer pool pressure, connection storms)

My opinionated default for a new “website DB” in 2025: PostgreSQL, unless you have a concrete reason not to. Not “we’ve always used MySQL.” A reason like: “we already run MySQL at scale, we have battle-tested failover, we have people who can debug InnoDB, and our app doesn’t need the bits where Postgres shines.”

When do I push people toward MySQL? When the company already has a MySQL operational muscle memory, when the workload is mostly simple reads/writes with predictable access patterns, and when you want fewer sharp edges around connection counts (not because MySQL is magically immune, but because Postgres’s per-connection overhead punishes sloppy pooling faster).

When do I push people toward Postgres? When data correctness and sane semantics matter, when you’re doing more than “SELECT by primary key,” when you anticipate analytics-ish queries, and when you want a planner and index ecosystem that rewards good modeling.

Joke #1: Picking a database by popularity is like picking a parachute by color—great right up until you need it.

Interesting facts and history (because it explains the sharp edges)

These are not trivia for trivia’s sake. They explain why certain defaults and behaviors exist.

PostgreSQL grew out of Postgres (1980s) at UC Berkeley; it’s been shaped by academic ideas like MVCC and extensibility from day one.
MySQL’s early success was “fast and simple” for web workloads—especially when you didn’t need transactions. That legacy still shows up in its ecosystem and usage patterns.
InnoDB became the default storage engine (for MySQL) because the web demanded transactions and crash recovery; modern MySQL is essentially “InnoDB with SQL on top.”
Postgres implemented MVCC without undo logs in the same style as InnoDB; instead, old row versions remain until vacuum cleans them. That’s the root of bloat and vacuum tuning.
MySQL replication historically prioritized simplicity (binlog-based) and broad compatibility; this is why it’s everywhere, including in “we built it in 2012 and it still runs” stacks.
Postgres extensions are a first-class culture (think: custom index types, full-text search, procedural languages). It’s not just “a database,” it’s a platform.
JSON support diverged philosophically: Postgres’s JSONB is deeply indexable; MySQL’s JSON is usable and improving but tends to lead people into semi-structured schema drift sooner.
Postgres’s query planner is famously opinionated and sometimes wrong in surprising ways; the fix is usually statistics and indexes, not prayer.
MySQL’s ecosystem has multiple major branches and vendors (community builds, commercial offerings, forks). Great for options; also great for “what exact behavior are we on?” confusion.

What “website DB” actually means (and what it doesn’t)

Most websites are not pure OLTP. They’re a messy combo:

Hot-path requests: login, session checks, product pages, cart operations. Latency sensitive.
Background jobs: sending emails, indexing, periodic syncs. Throughput sensitive.
Admin/reporting: “show me orders by status” with ten filters and sorting. Planner sensitive.
Search and feeds: pagination, ordering by time, “recommended.” Index sensitive.
Random spikes: marketing campaigns, crawlers, cron gone wild. Connection sensitive.

Both databases can do this. The trick is that they fail differently:

MySQL tends to punish you with lock contention, replication surprises, and “it was fine until it wasn’t” buffer pool or I/O saturation.
Postgres tends to punish you with connection storms, vacuum debt, bloat, and queries that go from 50ms to 50s when stats drift.

You don’t pick a database; you pick a set of operational chores. Choose chores your team will actually do.

Real bottlenecks: where each one hurts

1) Locking and concurrency: “why is everything waiting?”

Postgres uses MVCC and row-level locks. Reads typically don’t block writes, and writes don’t block reads, until you hit explicit locks, foreign key checks, long transactions, or schema changes. When it goes wrong, it’s often: one long transaction holds back vacuum and creates a line of blocked queries.

MySQL/InnoDB also uses MVCC, but its locking behavior (gap locks, next-key locks) can surprise you under certain isolation levels and access patterns. The common failure mode: a query that “should touch one row” takes locks over a range due to an index choice, and suddenly you’ve invented a traffic jam.

2) Query planning and indexing: “why did this query get dumb?”

Postgres shines when you have complex queries, multiple joins, partial indexes, and functional indexes. But it demands good statistics and disciplined schema evolution. When autovacuum/analyze falls behind, plans rot. You see it as sudden sequential scans or bad join orders.

MySQL can be extremely fast on simpler access paths and predictable indexing. But the optimizer historically had more “gotchas” around derived tables, subqueries, and complex join ordering. In practice: teams end up rewriting queries or denormalizing earlier.

3) Replication and failover: “the app says committed, but the read replica disagrees”

Both have mature replication. Both can hurt you.

MySQL replication is ubiquitous and well-understood. The classic pain is replication lag under write bursts and the operational complexity of multi-source or topology changes if you didn’t plan.
Postgres streaming replication is robust, but you must be explicit about synchronous vs asynchronous trade-offs. The pain is often “we assumed replicas are readable for everything,” then discover hot standby conflicts or lag under long-running queries.

4) Storage engine behavior: buffer cache, I/O, and write amplification

MySQL/InnoDB wants a big, well-sized buffer pool. When it’s too small or your working set grows, you thrash. You’ll see read I/O climb, latency spikes, and a server that looks “fine” on CPU but is dying on storage.

Postgres depends heavily on the OS page cache plus shared_buffers. It can perform beautifully, but it is more sensitive to table and index bloat. Bloat means you’re reading pages full of dead tuples and stale index entries. Your storage subsystem pays for your sins.

5) Maintenance: vacuum vs purge, and what “boring” really means

Postgres requires vacuum. Autovacuum is good, but “good” is not “set and forget.” If you do frequent updates/deletes and you don’t tune vacuum, you will eventually hit a wall: bloated tables, slow queries, and possibly transaction ID wraparound emergencies.

MySQL does purge internally (undo logs) and doesn’t have a vacuum equivalent in the same way. That’s fewer knobs, but not a free lunch: you still manage indexes, fragmentation, and you still pay for poor schema decisions. Also, large ALTER TABLE operations and online DDL behavior can become their own special kind of weekend.

6) Connection handling: Postgres punishes sloppy pooling

Postgres uses a process-per-connection model in many setups; too many connections means memory overhead, context switching, and pain. You almost always want a pooler (like pgbouncer) for web apps with bursty traffic.

MySQL’s thread model and typical defaults can tolerate more connections before faceplanting, but “tolerate” is not “safe.” If you let the app open thousands of connections because “it worked in staging,” you will eventually be staging in production.

7) Features that actually affect website architecture

Postgres: strong extensions, rich index types, better constraints, better window functions and advanced SQL, excellent JSONB indexing, generally stricter semantics.
MySQL: excellent ecosystem, wide hosting support, common operational patterns, very strong performance for many straightforward OLTP patterns.

Fast diagnosis playbook (first/second/third)

If your website is slow or timing out, do not start by changing random config. Start by determining which bottleneck class you’re in. Here’s the quickest triage that works in real incidents.

First: is the database waiting on CPU, I/O, or locks?

CPU pinned: look for expensive queries (bad plans), missing indexes, runaway reports, or hash/aggregation spills.
I/O pinned: look for buffer/cache misses, full scans, bloat, or storage latency regression.
Locking: look for blocked queries, long transactions, DDL, or a hot row/table.
Connections: if connection count spikes, everything else can look “fine” while the app melts.

Second: identify the top 1–3 queries by total time, not just slowest

The slowest query is often a one-off. The bottleneck is usually “this 20ms query runs 10,000 times per minute,” or “this 200ms query now runs 500 concurrent copies.”

Third: validate the plan and the index path

In Postgres: EXPLAIN (ANALYZE, BUFFERS). In MySQL: EXPLAIN plus performance_schema and handler metrics. You’re looking for: unexpected scans, bad join order, temp tables, filesorts, or huge buffer reads.

Fourth: check replication and the app’s read/write assumptions

A lot of “db is slow” tickets are actually “replica lag + read-after-write assumption.” This is not philosophical. It is literally a user clicking “save” and then seeing old data.

Fifth: stop the bleeding, then fix the root cause

Stop-the-bleeding is often: kill the worst query, disable a report, add a missing index, or route traffic away from a sick replica. Then you fix schema and code with calm hands.

Quote (paraphrased idea), John Allspaw: Reliability comes from how you respond to failure, not from pretending failure won’t happen.

Hands-on tasks: commands, outputs, and the decision you make

Below are practical tasks I actually run during incidents or performance work. Each includes a command, a plausible output snippet, what it means, and what decision you make next.

Task 1 (Postgres): see who is running what, and who is waiting

cr0x@server:~$ psql -X -c "SELECT pid, usename, state, wait_event_type, wait_event, now()-query_start AS age, left(query,80) AS query FROM pg_stat_activity WHERE datname='appdb' ORDER BY age DESC LIMIT 10;"
 pid  | usename | state  | wait_event_type |  wait_event   |   age   |                                      query
------+--------+--------+-----------------+---------------+---------+--------------------------------------------------------------------------------
 8123 | app    | active | Lock            | transactionid | 00:05:12| UPDATE orders SET status='paid' WHERE id=$1
 7991 | app    | active |                 |               | 00:01:44| SELECT * FROM orders WHERE user_id=$1 ORDER BY created_at DESC LIMIT 50
 7902 | app    | idle   | Client          | ClientRead    | 00:00:33|

What it means: PID 8123 is waiting on a transactionid lock for 5 minutes. That’s usually a long transaction elsewhere holding something open, or a hot-row contention pattern.

Decision: Find the blocker (next task), then decide whether to kill it, fix the app logic, or add an index/adjust isolation patterns.

Task 2 (Postgres): find the blocking query

cr0x@server:~$ psql -X -c "SELECT blocked.pid AS blocked_pid, blocker.pid AS blocker_pid, now()-blocker.query_start AS blocker_age, left(blocker.query,120) AS blocker_query FROM pg_stat_activity blocked JOIN pg_locks bl ON bl.pid=blocked.pid JOIN pg_locks kl ON kl.locktype=bl.locktype AND kl.database IS NOT DISTINCT FROM bl.database AND kl.relation IS NOT DISTINCT FROM bl.relation AND kl.page IS NOT DISTINCT FROM bl.page AND kl.tuple IS NOT DISTINCT FROM bl.tuple AND kl.virtualxid IS NOT DISTINCT FROM bl.virtualxid AND kl.transactionid IS NOT DISTINCT FROM bl.transactionid AND kl.classid IS NOT DISTINCT FROM bl.classid AND kl.objid IS NOT DISTINCT FROM bl.objid AND kl.objsubid IS NOT DISTINCT FROM bl.objsubid AND kl.pid != bl.pid JOIN pg_stat_activity blocker ON blocker.pid=kl.pid WHERE NOT bl.granted;"
 blocked_pid | blocker_pid | blocker_age |                        blocker_query
-------------+-------------+-------------+-------------------------------------------------------------
 8123        | 7701        | 00:12:09    | BEGIN; UPDATE users SET last_seen=now() WHERE id=$1; -- no COMMIT yet

What it means: A transaction has been open for 12 minutes and is blocking others. It’s probably an app bug, a stuck worker, or a connection held open across a network call.

Decision: Kill blocker PID 7701 if it’s safe, then fix code to avoid long transactions; add timeouts; audit transaction scopes.

Task 3 (Postgres): inspect vacuum health and bloat risk

cr0x@server:~$ psql -X -c "SELECT relname, n_live_tup, n_dead_tup, last_autovacuum, last_autoanalyze FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
   relname   | n_live_tup | n_dead_tup |    last_autovacuum     |    last_autoanalyze
-------------+------------+------------+------------------------+------------------------
 events      |   24000000 |    9800000 | 2025-12-29 01:12:43+00 | 2025-12-29 00:55:21+00
 sessions    |    3100000 |    1200000 | 2025-12-28 22:41:10+00 | 2025-12-28 22:40:58+00
 orders      |     900000 |      12000 | 2025-12-29 01:10:03+00 | 2025-12-29 01:10:02+00

What it means: The events table has a large dead tuple count. If queries on that table are getting slower over time, bloat is a prime suspect.

Decision: Tune autovacuum for that table, consider partitioning, reduce update churn, and check whether indexes are bloated too.

Task 4 (Postgres): confirm a bad plan with EXPLAIN ANALYZE

cr0x@server:~$ psql -X -c "EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE user_id=42 ORDER BY created_at DESC LIMIT 50;"
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..2451.11 rows=50 width=512) (actual time=1832.504..1832.573 rows=50 loops=1)
   Buffers: shared hit=102 read=9180
   ->  Seq Scan on orders  (cost=0.00..98010.22 rows=1999 width=512) (actual time=0.041..1829.721 rows=950000 loops=1)
         Filter: (user_id = 42)
         Rows Removed by Filter: 899000
 Planning Time: 0.412 ms
 Execution Time: 1832.711 ms

What it means: Sequential scan with huge reads. The server read ~9k buffers from disk. This screams “missing index” (likely (user_id, created_at)).

Decision: Add an index, then re-check. If the index exists, check statistics and whether the query matches the index order.

Task 5 (Postgres): check index usage and find unused indexes

cr0x@server:~$ psql -X -c "SELECT relname AS table, indexrelname AS index, idx_scan, pg_size_pretty(pg_relation_size(indexrelid)) AS size FROM pg_stat_user_indexes ORDER BY idx_scan ASC, pg_relation_size(indexrelid) DESC LIMIT 10;"
  table  |          index           | idx_scan |  size
---------+--------------------------+----------+---------
 events  | events_payload_gin       |        0 | 312 MB
 orders  | orders_status_created_at |        2 | 148 MB
 users   | users_email_key          |     9012 |  42 MB

What it means: Big indexes with near-zero scans can be bloat magnets and write amplifiers. But do not drop them blindly—some are for rare admin queries or constraints.

Decision: Confirm with query logs and code search; if truly unused, remove to speed writes and vacuum.

Task 6 (MySQL): see active sessions and whether they’re waiting on locks

cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST;"
Id	User	Host	db	Command	Time	State	Info
221	app	10.0.1.12:55342	appdb	Query	120	Waiting for row lock	UPDATE orders SET status='paid' WHERE id=12345
238	app	10.0.1.15:49810	appdb	Query	15	Sending data	SELECT * FROM orders WHERE user_id=42 ORDER BY created_at DESC LIMIT 50
250	app	10.0.1.16:51011	appdb	Sleep	300		NULL

What it means: You have lock waits and also sleeping connections. “Waiting for row lock” plus a growing queue means contention, usually from an index pattern or a transaction held open.

Decision: Identify the blocker via InnoDB lock tables, then fix transaction scope and indexing.

Task 7 (MySQL): find the InnoDB lock wait chain

cr0x@server:~$ mysql -e "SELECT * FROM information_schema.INNODB_TRX\G"
*************************** 1. row ***************************
trx_id: 54100912
trx_state: LOCK WAIT
trx_started: 2025-12-29 01:20:01
trx_mysql_thread_id: 221
trx_query: UPDATE orders SET status='paid' WHERE id=12345
*************************** 2. row ***************************
trx_id: 54100901
trx_state: RUNNING
trx_started: 2025-12-29 01:05:43
trx_mysql_thread_id: 199
trx_query: UPDATE orders SET shipping_label=... WHERE id=12345

What it means: Thread 199 has been running a transaction since 01:05. That’s not normal for a web request. It’s probably an app job holding a transaction while doing something slow (API call, file upload, retry loop).

Decision: Kill the offending transaction if safe; fix application to commit sooner; add timeouts; ensure queries use the right index to reduce lock footprint.

Task 8 (MySQL): check replication lag fast

cr0x@server:~$ mysql -e "SHOW SLAVE STATUS\G"
Seconds_Behind_Master: 87
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replica_SQL_Running_State: Waiting for dependent transaction to commit
Last_SQL_Error:

What it means: Replica is ~87 seconds behind. Reads from it will be stale. “Waiting for dependent transaction to commit” can indicate write bursts or locking on the replica apply thread.

Decision: Route critical reads to primary temporarily; reduce write amplification; check for large transactions; tune replica parallelization if applicable.

Task 9 (Postgres): check replication lag and replay state

cr0x@server:~$ psql -X -c "SELECT client_addr, state, sync_state, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
 client_addr |   state   | sync_state | write_lag | flush_lag | replay_lag
-------------+-----------+------------+-----------+-----------+------------
 10.0.2.21   | streaming | async      | 00:00:01  | 00:00:02  | 00:00:24

What it means: Replay lag is 24 seconds. If your app reads from replicas, you’ll see read-after-write anomalies unless you code for it.

Decision: Implement read-your-writes routing (stickiness) or switch to synchronous replication for the subset of traffic that needs it.

Task 10 (Postgres): detect connection storms and pooler need

cr0x@server:~$ psql -X -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state ORDER BY count(*) DESC;"
 state  | count
--------+-------
 idle   | 420
 active | 85
 null   | 0

What it means: Hundreds of idle connections means your app or pool is holding connections open. That is memory overhead and can become a fork-bomb when traffic spikes.

Decision: Put a real pooler in front, set sane max connections, and tune application pool sizes.

Task 11 (Linux): confirm whether you are I/O bound

cr0x@server:~$ iostat -x 1 3
Device            r/s     w/s   rkB/s   wkB/s  await  svctm  %util
nvme0n1         820.0   210.0 51200.0 18400.0   9.80   0.45  89.0

What it means: %util near 90% and await ~10ms indicates the disk is busy and requests are queuing. If latency-sensitive queries spike at the same time, you’re I/O bound.

Decision: Reduce reads (fix scans, bloat), increase memory/cache hit ratio, or move to faster storage. Also check if a backup, vacuum, or batch job is saturating I/O.

Task 12 (Postgres): see cache hit ratio and buffer reads for pain spotting

cr0x@server:~$ psql -X -c "SELECT datname, blks_hit, blks_read, round(blks_hit*100.0/nullif(blks_hit+blks_read,0),2) AS hit_pct FROM pg_stat_database WHERE datname='appdb';"
 datname | blks_hit  | blks_read | hit_pct
---------+-----------+-----------+---------
 appdb   | 190224455 |   9221440 |   95.38

What it means: 95% hit rate is decent, but if it drops during peak or if one query causes huge reads, you’ll feel it. “Hit rate” isn’t a trophy; it’s a clue.

Decision: If reads climb, investigate bloat and missing indexes. If hit rate is low overall, revisit memory sizing and query patterns.

Task 13 (MySQL): check InnoDB buffer pool effectiveness

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';"
Variable_name	Value
Innodb_buffer_pool_read_requests	184455912
Innodb_buffer_pool_reads	2219441

What it means: Buffer pool misses = Innodb_buffer_pool_reads. If misses climb fast during traffic, the working set doesn’t fit, or queries are scanning.

Decision: Add indexes, stop scans, and size the buffer pool appropriately. If the dataset outgrew RAM, plan for read replicas, sharding, or better caching patterns.

Task 14 (Both): check disk space growth and the “oops we’re full” trajectory

cr0x@server:~$ df -h /var/lib
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  1.8T  1.6T  120G  94% /var/lib

What it means: 94% full is not “fine.” It’s an incident prequel. Databases behave badly when they can’t extend files, can’t checkpoint comfortably, or can’t write temp data.

Decision: Free space immediately (logs, old backups, temp files), then fix the growth driver: bloat, runaway tables, missing retention, or oversized indexes.

Task 15 (Postgres): check transaction age and wraparound risk indicators

cr0x@server:~$ psql -X -c "SELECT datname, age(datfrozenxid) AS xid_age FROM pg_database ORDER BY xid_age DESC;"
 datname  | xid_age
----------+---------
 appdb    | 145000000
 template1|   4300000

What it means: High XID age means vacuum isn’t freezing tuples fast enough. If it climbs toward dangerous thresholds, you will face emergency autovacuum and performance collapse.

Decision: Investigate why vacuum can’t keep up (long transactions, mis-tuned autovacuum, giant tables) and fix before it becomes existential.

Task 16 (MySQL): capture top query digests (performance_schema)

cr0x@server:~$ mysql -e "SELECT DIGEST_TEXT, COUNT_STAR, ROUND(SUM_TIMER_WAIT/1000000000000,2) AS total_s FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 3;"
DIGEST_TEXT	COUNT_STAR	total_s
SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC LIMIT ?	120000	980.12
UPDATE users SET last_seen = NOW() WHERE id = ?	900000	410.33
SELECT * FROM products WHERE status = ? ORDER BY updated_at DESC LIMIT ?	80000	155.20

What it means: The biggest total time consumer is the first query, even if it’s “fast-ish” per call. It’s running constantly.

Decision: Index it, cache it, or reduce calls. Total time wins over “slowest query” for most web incidents.

Three corporate mini-stories (anonymized, plausible, and painfully familiar)

Mini-story 1: the incident caused by a wrong assumption (read replicas are “basically the same”)

A mid-size SaaS company had a clean architecture: primary database for writes, a read replica for “everything else.” The application developers treated the replica as interchangeable. A request would write a row and then immediately read it back—from whichever connection the ORM handed out.

It worked for months. Traffic grew. Writes got bursty: marketing emails, background jobs, and a new feature that updated counters more often than anyone admitted. Replica lag crept from “milliseconds” to “sometimes seconds.” Nobody noticed because most pages were tolerant of stale data.

Then came the incident: users updated billing information, got a “Saved!” toast, refreshed the page, and saw the old address. Support tickets lit up. Engineers blamed caching, then CDN, then “the frontend.” The database graphs looked “fine” because CPU wasn’t pinned and latency was only mildly elevated.

The root cause was boring: read-after-write inconsistency due to laggy asynchronous replication. The wrong assumption was fatal only when the business process demanded correctness in the UI.

The fix wasn’t heroic. They implemented stickiness: after a write, that user’s reads pin to primary for a short window (or until a session token expires). They also made replication lag visible on dashboards and routed certain endpoints (billing, auth) to primary always. After that, the “db incident” became a design constraint, not a surprise.

Mini-story 2: the optimization that backfired (indexing everything, then paying for it forever)

An e-commerce platform decided to “get ahead of performance.” A senior developer added indexes for every conceivable filter combination in the admin UI: status, date range, country, payment provider, SKU, marketing campaign code. The schema looked like a hedgehog.

Read queries were fast. Everyone congratulated themselves. Then Black Friday happened. Writes slowed, not by a little. Checkout latencies climbed. CPU was fine. The database storage started thrashing. Replication lag jumped. The platform survived, but in “we can’t deploy for two days” mode.

In the postmortem, they found the obvious thing nobody wanted to say: every insert/update had to maintain a pile of rarely-used indexes. The write path was paying rent for admin fantasies. Worse, vacuum/maintenance (Postgres) or background page management (MySQL) had more work. The system became “fast reads, slow everything else,” and the business bottleneck was checkout.

The fix was disciplined ruthlessness. They kept a small set of high-value indexes tied to actual top queries and dropped the rest. For admin reporting, they moved heavy queries to a replica with looser SLOs and built summary tables for common reports. Performance improved, but more importantly: the database stopped being a fragile crystal.

Mini-story 3: the boring but correct practice that saved the day (restore drills)

A content-heavy site ran Postgres. Nothing fancy: a primary, one replica, nightly backups. The team was not famous for process. But one staff engineer insisted on a quarterly restore drill, like flossing for infrastructure.

They had a checklist: pull the latest base backup, restore it into an isolated environment, replay WAL to a point in time, run sanity queries, and verify that the application can boot against it. They did it when nobody was on fire, which made it feel like busywork. It was also the only reason they knew their backups were real.

Months later, a migration accidentally dropped a column used by a background job. The job retried, failed, retried, and wrote garbage data into a different table. The blast radius was not immediate; it was slow corruption, the worst kind. By the time they noticed, the bad data had replicated and been backed up multiple times.

They performed point-in-time recovery to a timestamp just before the migration, verified with the same drill queries, and restored service with minimal drama. The incident still happened, but it didn’t become a career event.

Joke #2: Backup plans are like gym memberships—owning one doesn’t change anything until you actually use it.

Common mistakes: symptom → root cause → fix

1) Symptom: “DB CPU is low, but requests time out”

Root cause: Locking or connection pool saturation. The database is mostly waiting, not working.

Fix: Identify blocking queries/transactions; shorten transaction scopes; add timeouts; cap concurrency; introduce a pooler (Postgres) or fix thread/connection limits (MySQL).

2) Symptom: “One query is slow only in production”

Root cause: Different data distribution, missing statistics, or different indexes. Production has skew and hot spots; staging has wishful thinking.

Fix: Capture production EXPLAIN/ANALYZE; update stats; add targeted indexes; consider partial indexes (Postgres) or composite covering indexes (MySQL).

3) Symptom: “Read replica shows old data”

Root cause: Asynchronous replication lag and application read-after-write assumptions.

Fix: Read-from-primary after writes (stickiness), or implement synchronous replication for critical operations, or design UI to tolerate eventual consistency explicitly.

4) Symptom: “Disk usage grows forever even though traffic is stable”

Root cause: Postgres bloat from update/delete churn; MySQL table/index growth from fragmentation or unbounded retention.

Fix: Postgres: tune autovacuum, reduce update churn, partition big churn tables, consider periodic REINDEX/VACUUM FULL with downtime plan. MySQL: archive/purge old data, rebuild tables carefully, check for oversized indexes.

5) Symptom: “Deploy caused a full outage, then recovered”

Root cause: Locking DDL or a migration that rewrote a huge table, saturating I/O and blocking queries.

Fix: Use online schema changes, split migrations, add columns without defaults first, backfill in batches, then enforce constraints.

6) Symptom: “Everything is slow when a report runs”

Root cause: Long-running queries consuming I/O and buffers, or holding locks, or causing replication conflicts (Postgres hot standby).

Fix: Route reporting to a replica, add statement timeouts, precompute aggregates, and index for that report or ban it from production hours.

7) Symptom: “Performance gets worse over days, then temporarily improves after maintenance”

Root cause: Postgres vacuum debt and bloat, or MySQL buffer pool churn due to changing working set and index growth.

Fix: Postgres: tune autovacuum and analyze; monitor dead tuples and freeze age. MySQL: check buffer pool sizing, missing indexes, and remove write-amplifying indexes.

8) Symptom: “High latency spikes during traffic bursts”

Root cause: Connection storms, queueing at the database, or disk I/O saturation.

Fix: Put hard limits on concurrency, ensure pooling, use backpressure, and measure queue depth/await on storage.

Checklists / step-by-step plan

Decision checklist: pick MySQL or PostgreSQL for a website

If your team has deep MySQL operational maturity (replication, upgrades, index tuning, backup restores) and your workload is straightforward OLTP: choose MySQL and move on.
If you expect complex queries, richer constraints, and you want sane SQL semantics with an extension ecosystem: choose Postgres.
If you cannot commit to connection pooling discipline: MySQL will usually forgive you longer, but you’re still building on debt. Fix pooling either way.
If you do heavy updates/deletes on large tables: Postgres needs vacuum love; MySQL needs index and purge discipline. Choose based on which maintenance pattern you can execute reliably.
If you rely heavily on read replicas: design for eventual consistency from day one, regardless of database.

First week setup: production hygiene (both databases)

Enable query visibility: slow query logs (MySQL) or pg_stat_statements (Postgres). If you can’t see queries, you’re debugging vibes.
Set timeouts: statement timeouts, lock timeouts, and transaction timeouts appropriate to web requests.
Establish backup + restore testing: one scripted restore path that someone can run under stress.
Decide replication semantics: which endpoints may read from replicas; enforce it in code.
Define a migration policy: no table rewrites during peak, split backfills, batch updates.

Performance workflow: when a page is slow

Find the endpoint and correlate with top queries by total time.
Capture EXPLAIN/ANALYZE (or MySQL EXPLAIN + digest stats).
Check if the query is missing an index or using the wrong one.
Validate row counts and selectivity (skew happens).
Fix the access path (index or query rewrite) before touching server knobs.
Only then consider configuration and hardware changes.

Schema change plan: safe migrations without drama

Additive first: add new columns nullable, no defaults, no constraints initially.
Backfill in batches: small transactions, sleep between batches, measure replication lag.
Dual-write if needed: write both old and new fields during rollout.
Add constraints last: validate carefully, preferably online/low impact if supported.
Remove old fields only after verification: and only after you know rollback paths.

FAQ

1) Which is faster for a typical website: MySQL or PostgreSQL?

Neither, by default. The fastest database is the one with the right indexes, sane queries, and no self-inflicted replication/locking problems. For simple OLTP, both are fast. For complex queries, Postgres often wins—until you neglect stats and vacuum.

2) Which one is “more reliable”?

Reliability is an operational property: backups, tested restores, monitoring, and safe migrations. Both can be run reliably. Postgres tends to be stricter and more consistent in behavior; MySQL tends to be more forgiving of sloppy usage patterns until it isn’t.

3) Do I need a connection pooler with PostgreSQL?

For web apps with bursty traffic: yes, practically always. Without a pooler, you’ll eventually hit connection storms, memory overhead, and poor tail latency. With a pooler, Postgres becomes much calmer under load.

4) Is Postgres vacuum a dealbreaker?

No, but it is a responsibility. If your data is mostly inserts with few updates/deletes, vacuum is easy. If you do high-churn updates on big tables, you must monitor and tune autovacuum or partition. If your team won’t do that, you’re betting against physics.

5) Are read replicas safe to use for “normal reads”?

Safe for non-critical reads, yes—if you accept eventual consistency. Unsafe for read-after-write workflows unless you implement stickiness or synchronous replication for those code paths.

6) Which handles JSON better for a website?

Postgres JSONB is typically the better tool when you need indexing and queries over JSON fields. MySQL JSON works, but teams often drift into storing too much semi-structured data without guardrails. If you need JSON as a first-class queryable structure, Postgres is the safer bet.

7) What’s the number one cause of “database is slow” incidents?

Bad access paths: missing or wrong indexes, plus queries that accidentally scan. The runner-up is lock contention from long transactions. Hardware is rarely the first culprit, though it’s often blamed first.

8) If we already run one database, should we switch?

Usually no. Switching databases is expensive and risky. Instead, fix the actual bottleneck: indexing, query patterns, replication design, backups, and maintenance. Switch only if a specific capability or operational failure mode is repeatedly killing you.

9) How do I choose for an e-commerce site specifically?

Pick the one your team can operate without heroics. E-commerce needs correctness (inventory, payments) and predictable latency. Postgres is a strong default if you can handle pooling and vacuum. MySQL is great if you already have mature MySQL ops and you keep transactions and indexes disciplined.

10) What should I monitor on day one?

Query latency distribution (p95/p99), top queries by total time, lock waits, replication lag, connection counts, disk space growth, and storage latency (await). These find bottlenecks before customers do.

Next steps you can do this week

Pick the database based on bottlenecks you can manage, not ideology. If you’re unsure, default to Postgres for new builds.
Instrument query visibility (digests, slow logs, pg_stat_statements). No visibility, no truth.
Implement connection discipline: poolers for Postgres, sane pool sizing for both.
Write down replication semantics: which endpoints can tolerate stale reads; enforce it in code.
Schedule a restore drill and run it. If this feels uncomfortable, you’ve found your real risk.
Build a “top queries” routine: weekly review of the top time consumers and the top lock-wait contributors.

If you do those, you’ll stop treating the database as a mysterious box and start treating it like what it is: a predictable machine that punishes sloppy thinking and rewards boring competence.