MySQL vs PostgreSQL: “It Suddenly Got Slow”—a 15-Minute Diagnosis Plan for Both

Was this helpful?

It’s 10:17 a.m. A product manager pings “checkout is slow.” Grafana looks like a crime scene. You open a DB console and everything feels sticky: queries that used to return in 20 ms now take seconds, connection counts climb, and your app starts timing out like it’s allergic to success.

In these moments, theology (“MySQL vs Postgres”) is useless. You need a disciplined, fast sequence that finds the bottleneck, avoids self-inflicted wounds, and produces a safe next action. This is that sequence—built from operating both engines in production, on real disks, with real humans asking why it’s slow.

Fast diagnosis playbook (15 minutes)

This is the order that most reliably finds the bottleneck without making things worse. It’s designed for the “suddenly got slow” incident where you need signal fast.

Minute 0–2: Confirm it’s the database (and not your own app lying)

  1. Compare app latency vs DB latency. If your app has per-query timing, use it. If not, run a trivial DB query from the app host (not your laptop) to eliminate network and DNS nonsense.
  2. Check connection pool metrics. A saturated pool looks like a slow DB. It’s often just “too many threads waiting for a connection.”

Decision: If trivial queries are fast and pool wait is high, your DB may be fine and your pool size / app concurrency is the fire.

Minute 2–5: Look for the obvious resource wall (CPU, memory, I/O)

  1. CPU pegged? Think: bad query plan, missing index, sudden report query, or a thundering herd.
  2. Memory pressure? Swapping makes databases perform like they’re running underwater.
  3. Disk latency? If storage is slow, everything is slow. Don’t debate SQL while your disks are melting.

Decision: If storage latency is high, you triage the storage path and reduce write pressure. If CPU is pegged, you identify the top queries and stop the worst offender.

Minute 5–9: Check contention (locks, waits, pool starvation)

  1. Locks: find blocking sessions and the statement holding the lock. Kill the blocker only when you understand what it is (schema change? long transaction? batch job?).
  2. Wait events: MySQL: InnoDB waits, metadata locks. PostgreSQL: lock waits, LWLocks, I/O waits.

Decision: If you see a single blocker causing a pile-up, resolve that first. Contention multiplies pain.

Minute 9–13: Identify the top “cost” queries right now

  1. Find query fingerprints consuming time, not just count.
  2. Check plan regression (index not used, stats stale, parameter change, different bind values).

Decision: If one query dominates, mitigate (rate-limit, kill, add index, toggle a feature flag) before doing anything “clever.”

Minute 13–15: Choose a safe mitigation and capture evidence

  • Mitigate: stop the bleeding (cancel/kill runaway sessions, reduce concurrency, temporarily disable expensive feature paths, add a targeted index if safe, increase pool timeouts to avoid stampedes).
  • Capture: snapshot of processlist / pg_stat_activity, waits, top queries, iostat, and replication status. You will need this after the adrenaline wears off.

Paraphrased idea from Gene Kim: “Improving reliability is largely about improving feedback loops and shortening time to diagnose.”

The shared reality: slowness is usually one of five things

MySQL and PostgreSQL have different guts, but production slowness tends to rhyme. Most incidents collapse into one of these categories:

  1. Resource saturation: CPU, memory, disk I/O, network, or file descriptors.
  2. Contention: locks, latch contention, row hotspots, or too many concurrent writers.
  3. Bad plans: missing indexes, stale stats, parameter changes, plan cache quirks, or data distribution shifts.
  4. Write amplification: checkpoints, flush storms, autovacuum/vacuum debt, or redo/WAL pressure.
  5. External dependencies: storage latency spikes, noisy neighbors, failing RAID cache, virtualization host issues, or a backup job doing something “innocent.”

That’s why the playbook starts with the host and waits. You can’t SQL your way out of a storage controller throwing a tantrum.

Joke #1: The fastest way to make a database slow is to declare, “It’s probably fine,” and then start a full-table report query during peak traffic.

Interesting facts & history that matter in incidents

These aren’t trivia night facts. They’re the kind that explain why the engine behaves the way it does when you’re on call.

  1. PostgreSQL’s MVCC is older than many production runbooks. Its multi-version concurrency control means readers don’t block writers, but it creates dead tuples that vacuum must clean—or bloat and slowness follow.
  2. InnoDB wasn’t always “MySQL.” MySQL historically had multiple storage engines; InnoDB became the default because transactional correctness and crash recovery became non-negotiable in real systems.
  3. MySQL metadata locks can halt the world. Some DDL and even certain “innocent” statements take metadata locks, which can block other queries in surprising ways during schema changes.
  4. PostgreSQL treats “VACUUM” as a first-class operational concern. Autovacuum isn’t optional decoration; it’s how Postgres pays its MVCC debt and prevents transaction ID wraparound.
  5. Replication is part of performance. MySQL replication lag can turn “read scaling” into a data correctness incident, and Postgres replication lag can grow when WAL generation overwhelms shipping or replay.
  6. Both engines have gotten dramatically better at introspection. MySQL’s Performance Schema and Postgres’s statistics views (plus extensions like pg_stat_statements) turned “guessing” into “measuring,” if you actually enable them.
  7. Default settings reflect compromises, not your workload. Both systems ship with settings meant to boot on modest machines. Production needs explicit memory, checkpoint, and connection decisions.
  8. PostgreSQL’s query planner is cost-based and sensitive to statistics. When stats are stale or data distribution changes, plans can flip from “index scan” to “sequential scan” and you’ll swear someone sabotaged the database.
  9. MySQL’s history includes sharp edges around implicit behavior. Things like silent truncation (depending on SQL mode) and different isolation defaults historically caused “it worked in staging” stories—often performance and correctness combined.

Practical tasks: commands, outputs, and decisions (MySQL + PostgreSQL + Linux)

You want commands you can run during an incident, plus what the output means and what decision you make next. Here are the ones I reach for first.

Task 1: Check host load, CPU steal, and run queue

cr0x@server:~$ uptime
 10:21:33 up 187 days,  3:12,  2 users,  load average: 28.14, 26.90, 19.77

Meaning: Load average far above CPU cores suggests contention (CPU, I/O, or runnable queue). If you’re on virtualized/cloud, high load with modest CPU usage often means I/O waits or throttling.

Decision: If load is huge, immediately check iowait and disk latency (Task 3). Don’t start “optimization” in SQL blind.

Task 2: Check memory pressure and swapping

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        55Gi       1.2Gi       512Mi       5.8Gi       2.3Gi
Swap:           8.0Gi       6.4Gi       1.6Gi

Meaning: Heavy swap usage on a database host is usually a performance death spiral. “Available” memory is low, and swapping implies the OS is evicting working sets.

Decision: Stop new load (rate-limit traffic), reduce DB memory spikes (e.g., too many connections), and consider restarting only if you can do it safely and you’ve found the leak. Also verify MySQL/Postgres memory settings vs total RAM.

Task 3: Measure disk latency and saturation

cr0x@server:~$ iostat -xz 1 3
Linux 6.2.0 (db01)  12/29/2025  _x86_64_  (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.41    0.00    6.23   38.77    0.00   42.59

Device            r/s     w/s   rkB/s   wkB/s  await  aqu-sz  %util
nvme0n1         120.0  1800.0   4096  98304   24.8   12.4    98.9

Meaning: %util near 100% with high await indicates the device is saturated and requests are queuing. iowait is high too. This is storage-bound slowness.

Decision: Find what is writing (Task 4), reduce write pressure (pause batch jobs, limit autovacuum intensity carefully, slow ingestion), and check for checkpoint/flush storms in the DB.

Task 4: Identify top disk writers (quick and dirty)

cr0x@server:~$ pidstat -d 1 5
Linux 6.2.0 (db01)  12/29/2025  _x86_64_  (16 CPU)

12:22:11      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
12:22:12      106      2143      0.00  84210.00      0.00  mysqld
12:22:12      106      1987      0.00   5200.00      0.00  postgres
12:22:12        0      8891      0.00   1200.00      0.00  rsync

Meaning: You can see which process is hammering disk writes. If rsync/backup is writing hard during peak, congratulations: you found your villain wearing a “maintenance” badge.

Decision: If non-DB processes are heavy, stop/reschedule them. If DB is the writer, check checkpointing, flushing, and large transactions.

Task 5: Confirm filesystem space and inode pressure

cr0x@server:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  900G  842G   13G  99% /var/lib

Meaning: 99% full can tank performance (fragmentation, allocation delays) and risks outright failure. Many databases behave badly near full disks.

Decision: Free space now (rotate logs, move backups), then plan capacity. Also check for runaway binary logs/WAL.

Task 6: Check network errors and retransmits (often “DB slow” is packet loss)

cr0x@server:~$ ss -s
Total: 1892 (kernel 0)
TCP:   1583 (estab 1242, closed 251, orphaned 0, timewait 251)

Transport Total     IP        IPv6
RAW	  0         0         0
UDP	  17        13        4
TCP	  1332      1210      122
INET	  1349      1223      126
FRAG	  0         0         0

Meaning: High established connections can imply pool misbehavior or traffic spikes. Pair this with DB connection counts.

Decision: If connections exploded, hunt for a deploy that changed pooling or timeouts. If connections are stable but latency is high, focus on waits and I/O.

Task 7 (MySQL): See what threads are doing right now

cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST\G" | sed -n '1,80p'
*************************** 1. row ***************************
     Id: 93218
   User: app
   Host: 10.0.4.21:51344
     db: shop
Command: Query
   Time: 42
  State: Waiting for table metadata lock
   Info: ALTER TABLE orders ADD COLUMN promo_code varchar(32)
*************************** 2. row ***************************
     Id: 93244
   User: app
   Host: 10.0.3.19:51002
     db: shop
Command: Query
   Time: 38
  State: Sending data
   Info: SELECT * FROM orders WHERE user_id=... ORDER BY created_at DESC LIMIT 20

Meaning: “Waiting for table metadata lock” is a big red arrow pointing at DDL blocking normal queries. “Sending data” for long times can indicate full scans, huge result sets, or I/O waits.

Decision: If metadata lock is blocking, pause/kill the DDL session (carefully) or wait for it to finish if it’s near completion. Then schedule schema change properly (online DDL where possible).

Task 8 (MySQL): Identify InnoDB lock waits and blockers

cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,120p'
------------
TRANSACTIONS
------------
Trx id counter 89423311
Purge done for trx's n:o < 89423000 undo n:o < 0 state: running
History list length 19876
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 89423288, ACTIVE 61 sec
2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 93244, OS thread handle 140228, query id 5129312 10.0.3.19 app Sending data
UPDATE inventory SET qty=qty-1 WHERE sku='A-1841'
------- TRX HAS BEEN WAITING 61 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 421 page no 183 n bits 72 index PRIMARY of table `shop`.`inventory` trx id 89423288 lock_mode X locks rec but not gap waiting

Meaning: InnoDB is telling you a transaction is waiting on a record lock. The “History list length” being large can also suggest purge lag, which can correlate with long transactions.

Decision: Find the blocking transaction (lower in output, or via performance_schema) and decide: kill blocker, kill waiter, or reduce concurrency on that hot key (SKU, user row, etc.). If history list length stays high, hunt long transactions.

Task 9 (MySQL): Check replication lag and SQL thread state

cr0x@server:~$ mysql -e "SHOW SLAVE STATUS\G" | egrep "Seconds_Behind_Master|Slave_IO_Running|Slave_SQL_Running|Last_SQL_Error|Relay_Master_Log_File|Exec_Master_Log_Pos"
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 487
Relay_Master_Log_File: binlog.003214
Exec_Master_Log_Pos: 98122314

Meaning: Replica is ~8 minutes behind. Reads from replicas may be stale; failover may lose data if you promote a lagging replica.

Decision: If the app uses replicas for reads, consider routing critical reads to primary temporarily. Investigate why SQL thread can’t keep up (heavy writes, single-thread apply, long transactions).

Task 10 (PostgreSQL): Check active sessions and who’s waiting

cr0x@server:~$ psql -c "SELECT pid, usename, state, wait_event_type, wait_event, now()-query_start AS runtime, left(query,120) AS q FROM pg_stat_activity WHERE state <> 'idle' ORDER BY runtime DESC LIMIT 10;"
  pid  | usename | state  | wait_event_type |  wait_event  |  runtime  |                          q
-------+---------+--------+-----------------+--------------+-----------+----------------------------------------------------------
 18421 | app     | active | Lock            | relation     | 00:01:12  | ALTER TABLE orders ADD COLUMN promo_code text
 18502 | app     | active | Lock            | tuple        | 00:00:55  | UPDATE inventory SET qty=qty-1 WHERE sku='A-1841'
 18611 | app     | active | IO              | DataFileRead | 00:00:49  | SELECT * FROM orders WHERE user_id=$1 ORDER BY created_at
(3 rows)

Meaning: Postgres tells you whether sessions are waiting on locks or I/O. “relation” lock waits often point at DDL or long transactions holding locks.

Decision: If lock waits dominate, find the blocker (Task 11). If IO waits dominate, go back to iostat and check cache hit and checkpoints.

Task 11 (PostgreSQL): Find blocking PIDs (the person holding the door shut)

cr0x@server:~$ psql -c "SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query, blocker.pid AS blocker_pid, blocker.query AS blocker_query FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked ON blocked.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocker_locks ON blocker_locks.locktype = blocked_locks.locktype AND blocker_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE AND blocker_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocker_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocker_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocker_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocker_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocker_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocker_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocker_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocker_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocker ON blocker.pid = blocker_locks.pid WHERE NOT blocked_locks.granted;"
 blocked_pid |              blocked_query              | blocker_pid |              blocker_query
------------+-----------------------------------------+------------+-----------------------------------------
      18502 | UPDATE inventory SET qty=qty-1 ...      |      18111 | BEGIN; SELECT * FROM inventory WHERE ...
(1 row)

Meaning: You have a specific blocker PID. Often it’s a transaction left open by a job, migration, or “temporary” debug session.

Decision: If blocker is clearly accidental and not doing critical work, terminate it. If it’s a migration, coordinate and pick the least damaging option.

Task 12 (PostgreSQL): Check vacuum/autovacuum pressure and dead tuples

cr0x@server:~$ psql -c "SELECT relname, n_live_tup, n_dead_tup, last_autovacuum, last_vacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
   relname   | n_live_tup | n_dead_tup |    last_autovacuum     |      last_vacuum
-------------+------------+------------+------------------------+------------------------
 orders      |    5821934 |    2149921 | 2025-12-29 09:41:12+00 | 2025-12-27 02:10:04+00
 inventory   |     402113 |     210993 | 2025-12-29 10:02:01+00 |
(2 rows)

Meaning: High dead tuples suggests bloat and/or vacuum not keeping up, which can increase I/O and degrade index efficiency.

Decision: If dead tuples are high and autovacuum is lagging, tune autovacuum for the hot tables (per-table settings) and investigate long-running transactions preventing cleanup.

Task 13 (PostgreSQL): Check replication lag via LSN

cr0x@server:~$ psql -c "SELECT client_addr, state, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
 client_addr |   state   | write_lag | flush_lag | replay_lag
-------------+-----------+-----------+-----------+-----------
 10.0.2.15    | streaming | 00:00:01  | 00:00:03  | 00:00:07
(1 row)

Meaning: Lag is low. If replay_lag is high, your replica may be slow to apply WAL (CPU or I/O bound, or hot standby conflicts).

Decision: If lag is high and you’re using replicas for reads, consider routing reads, or reduce write volume. Investigate WAL generation spikes and replica resources.

Task 14 (MySQL): Verify buffer pool health and read pressure

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests';"
+-------------------------+-----------+
| Variable_name           | Value     |
+-------------------------+-----------+
| Innodb_buffer_pool_reads| 982134221 |
+-------------------------+-----------+
+---------------------------------+------------+
| Variable_name                   | Value      |
+---------------------------------+------------+
| Innodb_buffer_pool_read_requests| 3812234910 |
+---------------------------------+------------+

Meaning: If buffer_pool_reads grows quickly relative to read_requests, you’re missing cache and hitting disk. Absolute numbers matter less than the rate during the incident.

Decision: If cache misses surge, find the query that changed access patterns, or consider increasing buffer pool (long-term). Short-term, reduce load or add the index that prevents wide scans.

Task 15 (PostgreSQL): Check cache hit ratio and temp file spill

cr0x@server:~$ psql -c "SELECT datname, blks_hit, blks_read, temp_files, temp_bytes FROM pg_stat_database WHERE datname = current_database();"
 datname | blks_hit  | blks_read | temp_files | temp_bytes
---------+-----------+-----------+------------+------------
 shop    | 991233221 | 11233211  |      12933 |  987654321
(1 row)

Meaning: temp_files/temp_bytes rising means queries are spilling to disk (sorts, hashes). That’s often a per-query memory issue (work_mem) or a bad plan causing big intermediate results.

Decision: Identify the spilling queries (pg_stat_statements helps) and fix them. Avoid blanket “raise work_mem” on busy systems—it multiplies per connection.

Task 16: Capture a quick profile of DB threads (Linux perspective)

cr0x@server:~$ top -b -n 1 | sed -n '1,25p'
top - 10:23:02 up 187 days,  3:14,  2 users,  load average: 27.92, 26.80, 20.01
Tasks: 612 total,   3 running, 609 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.1 us,  5.9 sy,  0.0 ni, 41.2 id, 38.1 wa,  0.0 hi,  0.7 si,  0.0 st
MiB Mem :  63488.0 total,   1320.2 free,  56502.4 used,   5665.4 buff/cache
MiB Swap:   8192.0 total,   6553.7 used,   1638.3 free.  2401.9 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 2143 mysql      20   0 24.1g   36.2g   154m S  182.4  58.3  884:12.11 mysqld
 1987 postgres   20   0 15.2g    9.1g    88m S   42.1  14.7  311:09.22 postgres

Meaning: High iowait (wa) says storage is limiting. High mysqld CPU suggests a CPU-bound query storm. High RES memory with swap indicates you’re overcommitted.

Decision: If iowait dominates, stop writing. If CPU dominates, isolate top queries and kill or mitigate. If swap dominates, reduce concurrency and memory footprint.

MySQL-specific: InnoDB, metadata locks, replication, and the usual traps

What “suddenly slow” looks like in MySQL

MySQL incidents often present as one of these patterns:

  • Metadata lock pile-up: one DDL or LOCK TABLES blocks a busy table; everything queues behind it.
  • Single hot row/key contention: counters, inventory rows, “last seen” updates—classic “one row, many writers.”
  • Buffer pool misses: data set grew or query pattern changed, turning memory hits into disk reads.
  • InnoDB flushing/checkpoint pressure: dirty pages hit a threshold, flushing storms appear, writes stall.
  • Replication lag: replicas are behind; reads are stale; failover safety drops.

Two MySQL diagnosis priorities people skip (then regret)

1) Metadata locks. If you only check row locks, you’ll miss the “one ALTER TABLE to rule them all” event. The processlist state spells it out. When you see it, don’t argue with it.

2) Connection storms. MySQL is not thrilled when you spawn massive connection counts with high per-thread memory. “We’ll just increase max_connections” is how you turn a slow incident into an OOM incident.

MySQL mitigation moves that are usually safe

  • Kill clearly runaway SELECTs that are chewing CPU and I/O, especially ad hoc queries.
  • Rate-limit the offender at the app layer (feature flag, request shaping) instead of thrashing the DB.
  • Stop schema changes mid-day unless you’re using the right online approach and have rehearsed it.
  • Move heavy reads away (if replicas are healthy and consistent enough for the use case).

When “it’s just slow reads” is actually a write problem

On InnoDB, heavy writes can slow reads via I/O saturation and buffer pool churn. You’ll see it as “reads are slow,” but iostat shows the truth. Treat the disk as the shared resource it is.

PostgreSQL-specific: vacuum, bloat, WAL, and the usual traps

What “suddenly slow” looks like in PostgreSQL

Postgres slowness tends to cluster around these:

  • Lock waits due to long transactions: someone held an open transaction while doing… whatever, and now vacuum and DDL can’t progress.
  • Autovacuum debt and bloat: queries get slower over days/weeks, then tip into “sudden” when cache and I/O can’t hide it.
  • Checkpoint/WAL pressure: write bursts cause WAL spikes; checkpoints get expensive; I/O latency rises.
  • Temp file spills: a query’s sort/hash spills to disk and drags the system into I/O waits.
  • Plan changes: stats, parameter values, or data distribution changes lead to sequential scans where you expected index scans.

Two Postgres diagnosis priorities people skip (then regret)

1) Long-running transactions. A single idle-in-transaction session can prevent vacuum from reclaiming dead tuples and can hold locks. It’s the operational equivalent of leaving a car parked sideways in a narrow alley.

2) Temp file activity. If temp_bytes climbs fast, you’re spilling to disk and you will feel it. Fix the query first; “raise work_mem” is sometimes correct but often reckless on high concurrency.

Postgres mitigation moves that are usually safe

  • Terminate the accidental blocker (idle in transaction, ad hoc session) after verifying it’s not a critical batch.
  • Cancel runaway queries rather than killing the whole database process.
  • Reduce concurrency at the pool to stop the herd. Postgres prefers fewer, well-behaved connections.
  • Target vacuum fixes per table rather than global “vacuum harder” settings that can thrash disks.

Joke #2: Autovacuum is like taking out the trash: skipping it doesn’t save time, it just changes the smell of your incident.

Three corporate-world mini-stories (anonymized, plausible, technically accurate)

Mini-story 1: The incident caused by a wrong assumption

They had a clean migration process: blue/green deploys, pre-flight checks, and a “DB migrations are safe” mantra that got repeated so often it became physics. A team pushed an application release with a schema change that added a nullable column. “It’s fine,” they said. “It’s online.”

On MySQL, it was not fine. The ALTER took a metadata lock, and the table was hot. The lock didn’t block everything instantly; it blocked at the worst possible time: as other sessions queued behind it, they held connections longer, the pool saturated, and the app started timing out. The SRE on call saw “DB slow” in the dashboard, then saw CPU wasn’t pegged. They almost went down a rabbit hole of query tuning.

The turning point was running SHOW FULL PROCESSLIST and seeing “Waiting for table metadata lock” repeated like a horror movie jump scare. The “wrong assumption” wasn’t that online DDL exists. It was assuming “online” means “no operational impact.” Online DDL still contends for resources and can still block under load, depending on the exact operation and version.

The fix was boring: kill the migration session, roll back the deploy, and schedule the change using a rehearsed procedure with explicit maintenance windows and load checks. After the incident they added a guardrail: schema changes had to be proven on production-like load, and migrations were rate-limited behind feature flags. Not glamorous. It worked.

Mini-story 2: The optimization that backfired

A Postgres cluster had a classic performance problem: too many connections. The database handled it for a while, until it didn’t. Someone proposed the obvious: “Let’s raise max_connections. Hardware is cheap.” It got approved because it sounded like capacity planning.

They raised it. And the next peak period, latency got worse. Not slightly worse. “Customers can’t log in” worse. The box started swapping. The monitoring looked absurd: lots of sessions doing tiny things slowly, as if the database had collectively decided to be passive-aggressive.

Postgres allocates per-connection memory and incurs overhead per backend. Raising max_connections without changing the pool behavior turned the DB into a context-switch festival. Worse: with more concurrent queries, the chance of temp spills increased because memory got fragmented across more backends, and disk I/O rose.

The eventual resolution wasn’t a clever GUC tweak. It was implementing a proper connection pooler, lowering the number of active DB sessions, and aligning work_mem with realistic concurrency. The lesson was sharp: increasing a limit is not an optimization; it’s permission for your system to fail in new and more expensive ways.

Mini-story 3: The boring but correct practice that saved the day

A company ran both MySQL and PostgreSQL across different products. The unifying constraint: the storage layer was shared, and “storage incidents” were the silent killer. One team had been nagging for months to add a simple practice: capture periodic snapshots of key DB and OS metrics during normal operation and store them alongside deploy markers. Everyone nodded politely. Nobody wanted to pay for “boring.”

Eventually they did it anyway, because an SRE got tired of debugging blind. Every five minutes: iostat summaries, replication lag, buffer/cache stats, top waits, top query fingerprints. No raw URLs, no fancy tooling—just consistent snapshots that could be compared across time. They also recorded when backup jobs ran and when compaction happened on the storage array.

Then the “sudden slowness” day arrived. Latency spiked across multiple services. The first question was predictable: “Which database is broken?” The snapshots answered faster than opinions. Disk await jumped at the same minute across all DB hosts. Replication lag rose after, not before. CPU stayed mostly normal. The system wasn’t “slow SQL.” It was “slow storage.”

Because they had baselines, they escalated to the storage team with evidence: device await, queue depth, and the correlation across hosts. The storage team found a misbehaving maintenance task on the array and stopped it. The incident ended without random query killing and without one team blaming another. Boring practice. Correct practice. It paid for itself in one afternoon.

Checklists / step-by-step plan

Checklist A: The first five minutes (use this every time)

  1. From the DB host: run uptime, free -h, iostat -xz 1 3.
  2. Decide: CPU-bound vs I/O-bound vs memory/swap-bound. If you can’t classify it, you’re not looking at the right layer.
  3. Check connections: DB connection counts and app pool wait.
  4. Check waits: MySQL processlist + InnoDB status; Postgres pg_stat_activity wait_event.
  5. Capture evidence: paste outputs into the incident channel. Future you will be grateful.

Checklist B: If it’s I/O-bound

  1. Confirm disk saturation: iostat await, aqu-sz, %util.
  2. Identify writers: pidstat -d; check for backup/log rotation/rsync.
  3. DB-side suspects:
    • MySQL: flushing/checkpoint pressure, large transactions, binlog spikes.
    • Postgres: checkpoints, WAL spikes, vacuum activity, temp spills.
  4. Mitigate: stop batch writers, reduce concurrency, delay non-critical jobs.
  5. Post-incident: evaluate storage QoS, separate volumes, and baseline disk latency.

Checklist C: If it’s CPU-bound

  1. Find top queries by time:
    • MySQL: performance schema / processlist sampling.
    • Postgres: pg_stat_statements if enabled; otherwise pg_stat_activity + logs.
  2. Validate plan: is it scanning? sorting? joining too much?
  3. Mitigate: kill runaway queries, add the missing index if safe, reduce concurrency, apply feature flags.
  4. After: add a regression test for the query plan and track query fingerprints over deploys.

Checklist D: If it’s lock/contention-bound

  1. Find blockers: InnoDB status / performance_schema in MySQL; pg_locks join in Postgres.
  2. Classify the blocker: DDL, long transaction, hot row update, or batch job.
  3. Mitigate: terminate the blocker if safe, or pause the feature generating contention.
  4. After: redesign hotspots (shard counters, use queues, avoid single-row updates in high concurrency).

Common mistakes: symptom → root cause → fix

1) “DB is slow” but trivial queries are fast

Symptom: App requests time out; DB “looks okay.”

Root cause: Connection pool saturation or thread exhaustion in the app; queries are queued before they reach the DB.

Fix: Check pool wait time and max connections; reduce app concurrency; fix leaks (not returning connections); add backpressure.

2) Many MySQL sessions stuck on “Waiting for table metadata lock”

Symptom: Suddenly everything involving one table hangs; CPU may look normal.

Root cause: DDL or a session holding a metadata lock blocks other statements.

Fix: Identify the DDL session and stop/kill it; reschedule schema changes; use online schema change techniques appropriate to your MySQL version and operation.

3) Postgres queries slow down over time, then “suddenly” fall off a cliff

Symptom: Growing latency, increased disk reads, larger tables/indexes.

Root cause: Autovacuum can’t keep up; bloat grows; dead tuples remain due to long transactions or insufficient vacuum settings.

Fix: Find and eliminate long-running transactions; tune autovacuum per hot table; schedule manual vacuum when needed; ensure maintenance_work_mem and I/O capacity align.

4) Replicas lag and the app gets “slow reads”

Symptom: Read endpoints are slow or inconsistent; failover feels risky.

Root cause: Replication apply can’t keep up due to write spikes, I/O bottlenecks, or single-thread bottlenecks (engine/setting dependent).

Fix: Reduce write volume, optimize transactions, ensure replicas have equal or better I/O, and verify replication settings. Route critical reads to primary temporarily.

5) “We added an index and it got worse”

Symptom: Writes slow down, replication lag increases, disk fills faster.

Root cause: Index increased write amplification; maintenance overhead rose; queries didn’t actually use it, or it’s the wrong index.

Fix: Validate usage with query plans and stats; drop ineffective indexes; prefer narrow, selective indexes; consider partial indexes (Postgres) or composite indexes designed for the exact predicates.

6) Temp spills or sort storms in Postgres

Symptom: temp_bytes spikes; disk latency rises; sorts become slow.

Root cause: Bad plan producing huge intermediate sets; insufficient work_mem for that query; too many concurrent sorts.

Fix: Fix the query (add index, rewrite join/order), use LIMIT earlier, and tune work_mem carefully with concurrency in mind.

7) Swapping begins and the DB becomes unpredictable

Symptom: Latency variance explodes; CPU idle may look high; iowait rises.

Root cause: Too many connections, too-large per-session memory, or OS cache pressure; sometimes a colocation/noisy neighbor problem.

Fix: Reduce concurrency; enforce connection pooling; right-size memory settings; investigate host-level contention.

FAQ

1) Which is easier to diagnose under pressure: MySQL or PostgreSQL?

Postgres is generally more explicit about waits and blockers via system views. MySQL can be equally diagnosable if Performance Schema is enabled and you know where to look, but many shops leave it underused.

2) What’s the single fastest signal for “storage is the problem”?

iostat -xz showing high await and high %util, plus high CPU iowait. If that’s true, debates about indexes are premature.

3) Is killing queries safe?

Safer than letting everything time out. But do it deliberately: identify the session, confirm it’s not a critical migration/cleanup, and prefer canceling a statement (Postgres) over killing the whole process unless necessary.

4) Why do “idle in transaction” sessions matter so much in Postgres?

They can prevent vacuum from removing dead tuples and can hold locks longer than you think. They also inflate transaction horizons, which turns routine maintenance into an incident later.

5) Why does increasing max_connections often make Postgres slower?

Because it increases per-backend overhead and encourages unbounded concurrency. More active sessions means more contention, more cache churn, and more temp spills. Pooling is usually the correct move.

6) Why does MySQL sometimes “hang” during schema changes?

Metadata locks and DDL behavior. Even when an operation is “online,” it may still block at phases or under certain workloads. The queueing effect can look like the whole DB froze.

7) How do I tell if I’m dealing with a plan regression?

Symptoms: CPU spikes with a surge in logical reads, or a query that used to be fast becomes slow after a deploy or stats change. Confirm by checking the plan (EXPLAIN) and whether the query stopped using an index.

8) Should I tune buffers/caches during the incident?

Rarely. Most tuning changes need a restart (sometimes), have delayed effects, and are easy to get wrong. During the incident: mitigate load, stop blockers, kill runaways, and capture evidence. Tune after, with a clear hypothesis.

9) What if both databases are slow at the same time?

Assume shared infrastructure first: storage, network, DNS, virtualization host, or a shared backup job. Correlated slowness across engines is usually not simultaneous query-plan bad luck.

10) If I can only enable one introspection feature, what should it be?

For Postgres: pg_stat_statements (with sane retention). For MySQL: Performance Schema with statement and wait instruments enabled at a level you can afford. The best incident is the one where you don’t have to guess.

Next steps you can do this week

If you run production databases, “it suddenly got slow” is not an if. It’s a when. Here’s how to make the next one shorter and less dramatic:

  1. Write down your 15-minute sequence (steal the playbook above) and make it visible to the on-call rotation.
  2. Enable query fingerprinting and waits in whichever engine you run, and verify you can answer: “what are the top 5 queries by total time right now?”
  3. Baseline disk latency during normal hours. Without a baseline, every graph is just modern art.
  4. Enforce connection discipline: pool connections, cap concurrency, and make pool wait time a first-class alert.
  5. Make schema changes boring: rehearse them, schedule them, and treat DDL as a production workload with blast radius.

When the next alert hits, you don’t need a debate about engines. You need a quick classification—CPU, memory, I/O, locks, or plan—and one safe action at a time.

← Previous
Proxmox Stuck Tasks: How to Clean Up Hung Jobs and Processes Safely
Next →
WireGuard Hub-and-Spoke for 3 Offices Through a Central Gateway

Leave a comment