PostgreSQL vs Percona Server: debugging slowdowns—who gives better visibility

Was this helpful?

The pager goes off. The app is “up” but everything feels like it’s swimming through molasses. API latency spikes, queues build, and your CEO discovers the refresh button.
You have two problems: a slowdown, and uncertainty. The second is worse.

PostgreSQL and Percona Server (MySQL-compatible) can both run fast at scale. The difference, during an incident, is how quickly they let you prove what’s actually happening:
CPU, I/O, locks, checkpoint pressure, buffer pool misses, autovacuum debt, replication stalls, bad plans, or “someone changed one tiny thing”.

The punchline: visibility is a product feature

If you care about incident response, PostgreSQL generally gives you more “explainable truth” out of the box: wait events, robust per-query stats with
pg_stat_statements, and a culture of exposing internal state in plain SQL. You can get to “we’re blocked on locks” or “we’re I/O bound on reads” quickly.

Percona Server’s superpower is that it takes MySQL’s instrumentation story (which can be great but sometimes needs coaxing) and makes it more operational:
extra metrics, better defaults for observability, and a well-worn toolchain (Performance Schema, sys schema, slow log, pt-query-digest)
that is brutally effective in production.

So who gives better visibility? In practice:

  • PostgreSQL wins when you need a unified story across query execution, waits, locks, vacuums, and plan behavior—especially with SQL-only access.
  • Percona Server wins when you want a rich, operationally tuned MySQL ecosystem, enhanced instrumentation, and proven “take the slow log, digest it, fix the top offenders” workflows.

The uncomfortable truth: neither helps if you didn’t enable the right knobs before the incident. Visibility is not a vibe. It’s a config file and a retention policy.

Interesting facts and historical context (8 things that matter)

  1. PostgreSQL’s lineage is research-first. It descends from POSTGRES (1980s), and it shows: internal state tends to be surfaced in tables/views that feel made for debugging.
  2. MySQL’s early performance story leaned on simplicity. For years, “it’s fast” sometimes meant “it’s opaque,” and the ecosystem compensated with logs and external tools.
  3. Performance Schema was a turning point. MySQL’s Performance Schema evolved into a proper instrumentation framework; it’s powerful, but not always enabled/retained correctly by default.
  4. Percona popularized “operational MySQL.” Percona Server and its tools made diagnosing MySQL production issues less of an art project and more of a checklist.
  5. pg_stat_statements changed how Postgres teams operate. Once you have normalized query stats, “top queries” becomes data, not an argument.
  6. Wait events in PostgreSQL are first-class. PostgreSQL’s wait event reporting (and related views) provides a crisp “what are we waiting on” narrative.
  7. InnoDB’s buffer pool is both a blessing and a trap. When it’s sized right, everything looks great. When it’s wrong, you chase “mystery I/O” for hours.
  8. Autovacuum is PostgreSQL’s tax collector. You can ignore it for a while, but it will collect—with interest—via bloat, table scans, and brutal latency spikes.

What “better visibility” means in real incidents

Visibility is not “I have a dashboard.” Visibility is the ability to answer four questions quickly, with evidence:

  • What changed? Query shape, data volume, plan, config, schema, workload mix, host behavior, network.
  • Where is time going? CPU, disk reads, disk writes, lock waits, fsync, replication, checkpoints, vacuum/purge, contention.
  • Who is responsible? Which query patterns, which users, which tables/indexes, which code path, which job.
  • What’s the safest mitigation? Kill, throttle, add index, change plan, tune memory, reduce durability, pause jobs, fail over.

PostgreSQL and Percona Server can both answer these. The real difference is how many steps it takes, and how often you’re forced to guess.

Two metrics matter more than your pride:

  • Time to first credible hypothesis. Not “it’s the database,” but “it’s lock contention on table X caused by query pattern Y.”
  • Time to safe mitigation. The fix doesn’t have to be perfect; it has to stop the bleeding without corrupting data or causing a second outage.

PostgreSQL: where the light is bright (and where it isn’t)

Postgres strengths: SQL-native introspection and wait truth

PostgreSQL exposes a lot of “what the engine is doing” via system views. In an incident, that means you can often stay inside SQL and still get answers:
sessions, locks, wait events, query text, I/O timing (if enabled), vacuum progress, replication, and query statistics.

Wait events are the underrated killer feature. If sessions are stuck on Lock waits, you immediately shift from “why is it slow” to “who is blocking whom.”
If waits show IO or LWLock pressure, you know whether it’s disk, cache churn, or internal contention.

Postgres blind spots: you still need to pre-enable the good stuff

PostgreSQL is generous with introspection, but it won’t store per-query history forever unless you tell it to, and it won’t magically capture
every slow statement unless logging and stats are configured sanely.

  • pg_stat_statements must be installed and sized. Too small and the hottest queries evict the evidence.
  • I/O timing instrumentation costs something. You can enable it, but don’t pretend it’s free on heavily loaded disks.
  • Query-level wait breakdown is not as granular as you might wish. You’ll know what a backend is waiting on, but not always a perfect per-query attribution without extra tooling.

The practical Postgres debugging experience

On-call with Postgres often feels like: “Run a handful of SQL queries and the engine confesses.” That’s not always true, but it’s true often enough that teams build muscle memory around it.

The most common failure mode is not lack of visibility—it’s drowning in it. If you don’t have a workflow, you end up staring at
pg_stat_activity like it’s going to blink twice for “disk is full.”

Percona Server: where it shines (and its blind spots)

Percona strengths: MySQL ecosystem, enhanced metrics, and incident tooling

Percona Server is MySQL-compatible, but tuned for production: better instrumentation defaults, extra status variables, and a culture of “here’s the knob, here’s the metric, here’s the way.”
The Percona toolkit workflow—slow log + digest—remains one of the fastest paths from “it’s slow” to “here are the top query fingerprints and their impact.”

Performance Schema can be spectacular when configured well: statement summaries, wait events, lock instrumentation, stages, and metadata locks.
Combine it with sys schema views and you get a narrative that’s close to Postgres’ wait story—just with more footguns.

Percona blind spots: instrumentation is optional, and optional things are often missing at 2 a.m.

MySQL/Percona visibility has historically depended on whether you turned it on. Performance Schema can be disabled, sized too small, or configured to not retain what you need.
Slow logs might be off because “disk space,” and then you’re debugging with vibes and a prayer.

There’s also a subtlety: MySQL has multiple ways to observe similar things (slow log, Performance Schema, INFORMATION_SCHEMA tables, status variables).
That’s flexibility—until it’s not. In an incident, you want one canonical workflow.

Joke #1: A slow query is like a corporate meeting—nobody knows why it exists, but everyone’s waiting on a lock.

Fast diagnosis playbook (check 1st/2nd/3rd)

This is the playbook I want on the wall next to the “don’t reboot the database” sign.

First: is it the database, or the box?

  • Check host saturation: CPU, iowait, memory pressure, swapping, disk queue depth.
  • Check storage latency: read/write await, fsync spikes, stalled devices, full filesystems.
  • Check network: retransmits, saturation, weird latency to clients.

Decision: if the host is on fire, fix the host first. No amount of query tuning beats a disk doing 80ms reads.

Second: is the workload blocked (locks) or slow (resource)?

  • Postgres: wait events + lock views + pg_stat_activity.
  • Percona: Performance Schema waits + InnoDB lock/transaction status + processlist.

Decision: if blocked, identify the blocker and the lock type. Then decide whether to kill, throttle, or change the write pattern.

Third: identify the top query fingerprints causing pain

  • Postgres: pg_stat_statements by total time and mean time; correlate with execution counts and I/O.
  • Percona: slow log + pt-query-digest; Performance Schema statement summaries.

Decision: pick the smallest safe change that reduces total time: add index, rewrite query, fix parameterization, update stats, reduce concurrency, or temporarily disable a batch job.

Fourth: confirm it’s not maintenance or durability doing it

  • Postgres: checkpoint frequency, autovacuum activity, replication apply delay.
  • Percona: purge lag/history list length, checkpoint age, flush stalls, replication lag.

Decision: if maintenance is the culprit, you choose between performance and debt. Pay now (vacuum/purge/optimize), or pay later with worse interest.

Practical tasks (commands, outputs, decisions)

These are real tasks you can run during an incident. Each includes: command, what the output means, and the decision you make.
I’m mixing host-level checks with database-level checks because “the database is slow” is often “the disk is sad.”

Task 1 — Host CPU and iowait snapshot

cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.1.0 (db01)  12/30/2025  _x86_64_ (32 CPU)

11:02:10 AM  CPU   %usr  %nice   %sys %iowait  %irq  %soft  %steal  %guest  %gnice  %idle
11:02:11 AM  all  22.10   0.00   6.30   18.40  0.00   0.60    0.00    0.00    0.00  52.60
11:02:11 AM    7  85.00   0.00  10.00    0.00  0.00   0.00    0.00    0.00    0.00   5.00

Meaning: high %iowait on all suggests storage latency is gating progress; one hot CPU suggests a single-thread hotspot (compression, single query, or background flush).

Decision: if iowait is high, pivot to storage checks before you start rewriting queries.

Task 2 — Disk latency and queue depth

cr0x@server:~$ iostat -x 1 3
Linux 6.1.0 (db01)  12/30/2025  _x86_64_ (32 CPU)

Device            r/s     w/s   r_await   w_await   aqu-sz  %util
nvme0n1         820.0   210.0     18.40     22.10    14.20  98.50

Meaning: r_await/w_await in the tens of ms and %util near 100%: the device is saturated. aqu-sz high: deep queue, requests are piling up.

Decision: throttle workload (connection pool, batch jobs), confirm no runaway checkpoint/flush, consider failover if replica storage is healthier.

Task 3 — Memory pressure and swapping

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           128Gi       120Gi       1.2Gi       2.0Gi       6.8Gi       3.5Gi
Swap:           16Gi       9.5Gi       6.5Gi

Meaning: swap in use on a database host is rarely “fine.” It usually means your cache is getting evicted and your latency is about to become interpretive art.

Decision: reduce memory footprint (lower connection count, fix huge work_mem/sort buffers, check for OS cache starvation), and plan a restart only if you can do it safely.

Task 4 — PostgreSQL: find active queries and their waits

cr0x@server:~$ psql -X -c "SELECT pid, usename, state, wait_event_type, wait_event, now()-query_start AS age, left(query,120) AS query FROM pg_stat_activity WHERE state <> 'idle' ORDER BY age DESC LIMIT 10;"
 pid  | usename | state  | wait_event_type |  wait_event   |   age    |                         query
------+--------+--------+-----------------+---------------+----------+----------------------------------------------------------
 4821 | app    | active | Lock            | relation      | 00:01:42 | UPDATE orders SET status=$1 WHERE id=$2
 4977 | app    | active | IO              | DataFileRead  | 00:01:10 | SELECT * FROM orders WHERE customer_id=$1 ORDER BY created

Meaning: the first session is waiting on a relation lock; the second is blocked on data file reads. That’s two different incident tracks: lock contention vs I/O bound.

Decision: if you see Lock waits, go to blockers. If you see IO waits, go to disk and cache behavior.

Task 5 — PostgreSQL: show blockers and blocked sessions

cr0x@server:~$ psql -X -c "SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query, blocker.pid AS blocker_pid, blocker.query AS blocker_query FROM pg_locks bl JOIN pg_stat_activity blocked ON blocked.pid=bl.pid JOIN pg_locks kl ON kl.locktype=bl.locktype AND kl.database IS NOT DISTINCT FROM bl.database AND kl.relation IS NOT DISTINCT FROM bl.relation AND kl.page IS NOT DISTINCT FROM bl.page AND kl.tuple IS NOT DISTINCT FROM bl.tuple AND kl.virtualxid IS NOT DISTINCT FROM bl.virtualxid AND kl.transactionid IS NOT DISTINCT FROM bl.transactionid AND kl.classid IS NOT DISTINCT FROM bl.classid AND kl.objid IS NOT DISTINCT FROM bl.objid AND kl.objsubid IS NOT DISTINCT FROM bl.objsubid AND kl.pid != bl.pid JOIN pg_stat_activity blocker ON blocker.pid=kl.pid WHERE NOT bl.granted AND kl.granted LIMIT 5;"
 blocked_pid |             blocked_query              | blocker_pid |              blocker_query
------------+----------------------------------------+------------+------------------------------------------
       4821 | UPDATE orders SET status=$1 WHERE id=$2 |       4602 | ALTER TABLE orders ADD COLUMN note text

Meaning: an ALTER TABLE is blocking production updates. This is not a “slow query,” it’s a DDL coordination failure.

Decision: decide whether to terminate the blocker, postpone schema change, or route writes away. If you kill it, confirm rollback time and impact.

Task 6 — PostgreSQL: top queries by total time (requires pg_stat_statements)

cr0x@server:~$ psql -X -c "SELECT calls, round(total_exec_time::numeric,1) AS total_ms, round(mean_exec_time::numeric,1) AS mean_ms, rows, left(query,120) AS query FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 5;"
 calls | total_ms | mean_ms |  rows  |                         query
-------+----------+---------+--------+----------------------------------------------------------
 92000 | 540000.0 |     5.9 | 120000 | SELECT * FROM orders WHERE customer_id=$1 ORDER BY created_at DESC LIMIT $2
  4100 | 210000.0 |    51.2 |   4100 | UPDATE orders SET status=$1 WHERE id=$2

Meaning: the first query dominates total time due to volume. The second is individually slow-ish but not necessarily the main culprit.

Decision: optimize the dominant fingerprint first (index, covering index, reduce selected columns, better pagination), then address tail latencies.

Task 7 — PostgreSQL: check autovacuum debt and bloat risk signals

cr0x@server:~$ psql -X -c "SELECT relname, n_dead_tup, n_live_tup, last_autovacuum, vacuum_count, autovacuum_count FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
  relname  | n_dead_tup | n_live_tup |     last_autovacuum     | vacuum_count | autovacuum_count
-----------+------------+------------+-------------------------+--------------+------------------
 orders    |   8420000  |   21000000 | 2025-12-29 02:14:11+00  |            1 |               14
 events    |   3200000  |    9000000 |                         |            0 |                0

Meaning: lots of dead tuples and missing autovacuum on events suggests bloat and poor visibility map coverage, which means more I/O and slower scans.

Decision: investigate autovacuum settings, long transactions preventing cleanup, and schedule targeted vacuuming (carefully, in production).

Task 8 — PostgreSQL: check checkpoint pressure

cr0x@server:~$ psql -X -c "SELECT checkpoints_timed, checkpoints_req, round(checkpoint_write_time/1000.0,1) AS write_s, round(checkpoint_sync_time/1000.0,1) AS sync_s, buffers_checkpoint FROM pg_stat_bgwriter;"
 checkpoints_timed | checkpoints_req | write_s | sync_s | buffers_checkpoint
------------------+-----------------+---------+--------+-------------------
             1200 |            3400 |  9800.2 |  420.1 |          81233421

Meaning: lots of requested checkpoints and huge checkpoint buffers can correlate with periodic latency spikes, especially on busy write workloads.

Decision: tune checkpoint settings (interval, completion target), and verify storage can absorb the write pattern. Don’t “fix” this by turning durability into a rumor.

Task 9 — Percona/MySQL: see who’s running and who’s waiting

cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST\G"
*************************** 1. row ***************************
     Id: 31244
   User: app
   Host: 10.0.12.34:52210
     db: prod
Command: Query
   Time: 87
  State: Waiting for table metadata lock
   Info: ALTER TABLE orders ADD COLUMN note TEXT
*************************** 2. row ***************************
     Id: 31251
   User: app
   Host: 10.0.12.18:50122
     db: prod
Command: Query
   Time: 83
  State: updating
   Info: UPDATE orders SET status='shipped' WHERE id=?

Meaning: metadata locks can freeze a whole application on MySQL/Percona. If you see “Waiting for table metadata lock,” you’re not optimizing queries—you’re ending a standoff.

Decision: find the lock holder and kill or wait; then fix your migration process (online DDL tooling, smaller steps, off-peak).

Task 10 — Percona/MySQL: InnoDB transaction and lock status snapshot

cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,120p'
=====================================
2025-12-30 11:04:21 INNODB MONITOR OUTPUT
=====================================
TRANSACTIONS
------------
Trx id counter 845112334
Purge done for trx's n:o < 845100000 undo n:o < 0 state: running
History list length 987654
...
LATEST DETECTED DEADLOCK
------------------------
*** (1) TRANSACTION:
TRANSACTION 845112100, ACTIVE 12 sec updating or deleting
...

Meaning: very large “History list length” indicates purge is behind, often due to long-running transactions. That increases undo retention and can hurt performance.
Deadlocks are normal; repeated patterns are not.

Decision: identify long transactions (sleeping clients, batch jobs), fix application transaction scoping, and consider tuning purge/undo behavior if it’s chronic.

Task 11 — Percona/MySQL: top statements via Performance Schema

cr0x@server:~$ mysql -e "SELECT DIGEST_TEXT, COUNT_STAR, ROUND(SUM_TIMER_WAIT/1000000000000,2) AS total_s, ROUND(AVG_TIMER_WAIT/1000000000000,4) AS avg_s FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 5\G"
*************************** 1. row ***************************
DIGEST_TEXT: SELECT * FROM orders WHERE customer_id = ? ORDER BY created_at DESC LIMIT ?
 COUNT_STAR: 92510
   total_s: 532.44
    avg_s: 0.0058

Meaning: same idea as pg_stat_statements: normalized fingerprints with counts and time. If this table is empty, you probably didn’t configure Performance Schema retention.

Decision: optimize the top digest, and ensure Performance Schema consumers are enabled and sized for your workload.

Task 12 — Percona/MySQL: slow log digest (fast “who hurt us”)

cr0x@server:~$ pt-query-digest /var/log/mysql/mysql-slow.log --limit=5
# 200ms user time, 20ms system time, 30.00M rss, 120.00M vsz
# Rank Query ID           Response time Calls R/Call  V/M   Item
# ==== ================== ============= ===== ======= ===== ====
#    1 0xA1B2C3D4E5F6A7B8  540.2311 50.2% 92000 0.0059  0.02  SELECT orders
#    2 0x1111222233334444  210.1145 19.5%  4100 0.0512  0.10  UPDATE orders

Meaning: you get an immediate ranked list of what’s consuming latency. This is why the slow log is still relevant in 2025: it works under pressure.

Decision: fix rank #1 first unless you can prove rank #2 is causing lock amplification or cascading failures.

Task 13 — PostgreSQL: examine a plan with real execution timing

cr0x@server:~$ psql -X -c "EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM orders WHERE customer_id=42 ORDER BY created_at DESC LIMIT 50;"
 Limit  (cost=0.43..25.88 rows=50 width=312) (actual time=120.114..120.220 rows=50 loops=1)
   Buffers: shared hit=120 read=1840
   ->  Index Scan using orders_customer_created_idx on public.orders  (cost=0.43..5120.33 rows=10000 width=312) (actual time=0.080..119.900 rows=50 loops=1)
         Index Cond: (orders.customer_id = 42)
 Planning Time: 0.320 ms
 Execution Time: 120.300 ms

Meaning: lots of read buffers implies cache misses; the index exists but still triggers I/O (maybe due to poor locality, cold cache, or too-wide rows).

Decision: consider a covering index, reduce selected columns, improve cache hit rate (memory sizing), or address storage latency.

Task 14 — Percona/MySQL: explain with actuals (if supported) or analyze plan shape

cr0x@server:~$ mysql -e "EXPLAIN SELECT * FROM orders WHERE customer_id=42 ORDER BY created_at DESC LIMIT 50\G"
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: orders
   partitions: NULL
         type: ref
possible_keys: idx_customer_created
          key: idx_customer_created
      key_len: 8
          ref: const
         rows: 10000
     filtered: 100.00
        Extra: Using where; Using filesort

Meaning: “Using filesort” with an ORDER BY can mean the index doesn’t satisfy the sort, or the optimizer chose not to use it for ordering. That’s a classic latency multiplier.

Decision: adjust index order, ensure the ORDER BY matches index, and verify statistics/cardinality.

Task 15 — Replication lag check (Postgres)

cr0x@server:~$ psql -X -c "SELECT application_name, state, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
 application_name |   state   | write_lag | flush_lag | replay_lag
-----------------+-----------+-----------+-----------+------------
 replica01        | streaming | 00:00:00  | 00:00:02  | 00:00:15

Meaning: replay lag indicates the replica is applying WAL slowly. During incident response, this matters for failover safety and read scaling expectations.

Decision: if lag is growing, avoid failing over to that replica unless you accept more loss/latency. Investigate IO/CPU on replica.

Task 16 — Replication lag check (Percona/MySQL)

cr0x@server:~$ mysql -e "SHOW REPLICA STATUS\G" | egrep 'Replica_IO_Running|Replica_SQL_Running|Seconds_Behind_Source|Last_SQL_Error'
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 48
Last_SQL_Error:

Meaning: 48 seconds behind isn’t necessarily fatal, but it’s a warning for failover and for read-after-write expectations.

Decision: if lag correlates with storage or lock stalls, fix the underlying bottleneck first; don’t “solve” it by restarting replication blindly.

Joke #2: Nothing accelerates a root-cause analysis like realizing the “temporary” debug setting has been enabled since last quarter.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran a multi-tenant SaaS on PostgreSQL. They had a tidy mental model: “Reads are cheap, writes are expensive.”
So when they saw latency spikes, they assumed a write-heavy background job was saturating disk.

The on-call opened host metrics: disk utilization was high, sure. But pg_stat_activity showed the loudest pain was on read queries.
Wait events weren’t “IO DataFileRead” everywhere either—many sessions were stuck on Lock, and the blocked queries were reads.

The wrong assumption was subtle: they believed reads don’t block. In PostgreSQL, reads can absolutely wait if they’re trying to access an object locked by DDL,
or if they need a buffer pinned by something else, or if they pile behind a lock queue. A schema migration had added an index with a locking mode they didn’t fully understand.

The fix wasn’t “add IOPS.” The fix was operational: change the migration process to use safer online patterns,
schedule invasive DDL for low traffic, and implement a preflight step that checks lock acquisition risk and expected runtime.

The lesson: the fastest way to waste an hour is to debug the problem you expected instead of the one your waits are reporting.

Mini-story 2: The optimization that backfired

Another team ran Percona Server with an aggressively tuned buffer pool and a proud “we never hit disk” stance.
They bumped the buffer pool to “as large as possible” after a traffic increase. The graphs looked heroic for a week.

Then came a slow-motion incident: tail latency rose, but throughput didn’t. CPU looked fine. Disk looked “busy but not insane.”
Performance Schema showed more time in InnoDB flushing and file I/O stages than before, and the slow log began reporting bursts of write-heavy stalls.

The backfire was OS memory starvation. With the buffer pool eating nearly everything, the OS page cache and filesystem metadata cache suffered,
and background flushing behavior became more erratic. The box started swapping under rare peak conditions—just enough to ruin the 99th percentile.

The fix was embarrassingly boring: reduce buffer pool a bit, give the OS breathing room, cap connection concurrency, and tune flush behavior to smooth spikes.
The “optimization” had improved averages while breaking the tail. Production lives in the tail.

The lesson: you can’t win by taking all memory for the database. The kernel gets a vote, and it votes with latency.

Mini-story 3: The boring but correct practice that saved the day

A large enterprise team ran both PostgreSQL and Percona across different products.
Their most valuable reliability work wasn’t exotic tuning; it was standardized instrumentation before incidents:
slow query logging with sane thresholds, statement aggregation (pg_stat_statements and Performance Schema digests), and consistent retention.

One afternoon, an application deployment caused a sudden DB slowdown. No one argued about whose fault it was because the data was already there.
In Postgres, the top query fingerprint’s total time doubled with the same call count; EXPLAIN (ANALYZE, BUFFERS) showed extra reads.
In Percona, pt-query-digest flagged an altered query shape and a new “Using filesort” behavior.

The “boring practice” was a pre-merge check that recorded query plans for critical endpoints and compared them after changes.
That made the rollback decision obvious. They reverted fast, then fixed the query and added the missing index in a controlled rollout.

The incident was short not because they were geniuses, but because they had receipts.

Common mistakes: symptom → root cause → fix

Here are failure modes that repeatedly show up in production, with specific fixes. This is the section you read when your coffee is shaking.

1) Symptom: sudden global slowdown, many sessions “waiting”

  • Root cause (Postgres): DDL blocking or lock queue behind one long transaction.
  • Fix: identify blocker via pg_locks/pg_stat_activity, terminate or finish it; change migration strategy.
  • Root cause (Percona): metadata lock (MDL) contention, often from DDL.
  • Fix: find lock owner in processlist/Performance Schema; kill session; use online schema change practices.

2) Symptom: periodic latency spikes every few minutes

  • Root cause (Postgres): checkpoint storms (too frequent or too spiky), sometimes combined with slow storage.
  • Fix: tune checkpoint interval and completion target; ensure WAL and data are on sane storage; verify background writer behavior.
  • Root cause (Percona): flush storms, checkpoint age pressure, fsync bursts.
  • Fix: tune InnoDB flushing and redo log sizing; smooth write workload; confirm device write latency.

3) Symptom: read queries slow down over days/weeks

  • Root cause (Postgres): table/index bloat from autovacuum debt or long-running transactions preventing cleanup.
  • Fix: find dead tuples and vacuum history; fix long transactions; tune autovacuum thresholds per table; schedule maintenance windows.
  • Root cause (Percona): fragmentation plus stale stats, or buffer pool churn from data growth.
  • Fix: update stats; adjust buffer pool; consider rebuild/optimize carefully; fix query/index usage.

4) Symptom: “CPU is low but queries are slow”

  • Root cause: I/O wait, lock waits, or internal contention. Low CPU is not a compliment; it’s a clue.
  • Fix: check waits (Postgres wait_event_type; MySQL stages/waits), then storage latency, then concurrency limits.

5) Symptom: replication lag grows during peak, then recovers

  • Root cause: replica is I/O bound applying changes; or long transactions/large batches cause apply stalls.
  • Fix: tune apply parallelism where appropriate, reduce big transactions, move heavy reads off replica, and confirm replica storage.

6) Symptom: slow query “fix” helps briefly, then degrades again

  • Root cause: you treated a symptom. Common culprits: cache warming effects, changing parameter values (plan instability), or stats drift.
  • Fix: confirm with query fingerprints over time, compare plans for different parameters, and set a stats/ANALYZE routine (or improve it) with monitoring.

Checklists / step-by-step plan for repeatable debugging

Checklist A: Pre-incident instrumentation (do this on a calm Tuesday)

  1. Enable and size query aggregation. Postgres: pg_stat_statements. Percona: Performance Schema statement digest + sys views.
  2. Enable slow query logging with a real threshold. Don’t set it to 10 seconds and call it observability.
  3. Retain enough history to span your incident. If you rotate logs every hour, you’ll lose the crime scene.
  4. Capture waits/locks metrics. Postgres wait events; MySQL waits/stages; lock views.
  5. Standardize EXPLAIN workflows. Postgres: EXPLAIN (ANALYZE, BUFFERS) for top queries in staging. MySQL: EXPLAIN plus runtime sampling via Performance Schema.
  6. Baseline storage latency. Know your normal r_await/w_await and fsync behavior.

Checklist B: During incident (15-minute containment loop)

  1. Confirm user-visible impact. Which endpoints? Read vs write? Single tenant or global?
  2. Host health first. CPU, iowait, memory/swap, disk saturation.
  3. Identify blocking. Postgres lock graph; MySQL MDL/InnoDB locks.
  4. Top query fingerprints. Postgres statements; MySQL digests or slow log digest.
  5. Mitigate safely. Kill the blocker, throttle concurrency, pause batch jobs, or fail over if replica is healthy and consistent enough.
  6. Prove improvement. Re-run the same checks. Don’t declare victory because the graph “looks calmer.”

Checklist C: After incident (turn pain into engineering)

  1. Write the one-page timeline. What changed, when, how detected, what fixed.
  2. Add guardrails. Migration controls, query plan regression tests, concurrency limits, circuit breakers.
  3. Instrument missing signals. If you had to guess, add a metric or log so you won’t guess next time.
  4. Make the fix boring. The best incident response is a script and a runbook, not a hero.

FAQ

1) If I can only pick one “must-have” visibility feature, what is it?

Query fingerprint aggregation with time and counts. In Postgres that’s pg_stat_statements. In Percona/MySQL that’s statement digests (Performance Schema) and/or slow log + digest.
Without it, you’ll chase individual queries instead of workload shape.

2) Which one is better at telling me “we are waiting on locks”?

PostgreSQL is usually clearer faster: wait_event_type plus straightforward lock joins give you an immediate story.
Percona can absolutely do it, but you need to know where to look (processlist states, Performance Schema waits, InnoDB status) and you need it enabled.

3) Is the slow query log obsolete if I have Performance Schema?

No. The slow query log is a low-tech black box recorder. It often survives when your fancy instrumentation wasn’t sized right.
Use both if you can afford the overhead and storage.

4) Why do I see “Using filesort” in MySQL when I thought I had the right index?

Because the index may not match the ORDER BY direction/columns, or the optimizer may choose a different access path based on stats.
Fix with an index that matches the sort, verify cardinality, and confirm you’re not selecting so many columns that it becomes expensive anyway.

5) PostgreSQL feels slower after we added more indexes. Isn’t that backwards?

Indexes speed reads and tax writes. More indexes mean more write amplification, more vacuum work, and sometimes worse cache behavior.
Keep the indexes that pay rent. Drop the ones that don’t.

6) What’s the most common reason Postgres “randomly” slows down?

A mix of autovacuum debt and long transactions that block cleanup. It’s not random; it’s deferred maintenance becoming visible.

7) What’s the most common reason MySQL/Percona “randomly” slows down?

Flush/IO pressure plus lock contention (including MDL), often triggered by a workload shift or a migration. And yes, sometimes it’s an undersized buffer pool or an oversized one.

8) Can I debug slowdowns safely without running heavy commands?

Yes. Use lightweight snapshots first: top statements tables, processlist/pg_stat_activity, and high-level I/O metrics.
Save heavy EXPLAIN ANALYZE runs for a controlled sample, and avoid running them on already-saturated systems unless you know the cost.

9) Which is easier to operate with limited database expertise on-call?

If you invest in runbooks, either works. Out of the box, PostgreSQL’s “ask SQL views, get truth” model is easier for generalists.
Percona can be just as diagnosable, but only if you standardize Performance Schema and slow log workflows ahead of time.

Next steps: how to choose and how to instrument

A single quote belongs here, because it’s the operational moral of the story. Werner Vogels’ widely repeated reliability stance can be summarized as this
(paraphrased idea): You build it, you run it — Werner Vogels.
Debuggability is part of running it, not an optional add-on when things get spicy.

If your org’s biggest risk is “we don’t know what’s happening during incidents,” pick the system and configuration that makes truth easiest to retrieve.
My bias:

  • Choose PostgreSQL if you want strong SQL-native visibility, first-class wait reporting, and a straightforward path from sessions → waits → locks → statements → plans.
    It rewards disciplined vacuum and stats hygiene.
  • Choose Percona Server if you’re committed to the MySQL ecosystem, want operational enhancements, and you’ll actually run Performance Schema and slow logging properly.
    It rewards standardized tooling and careful concurrency/flush management.

Practical next steps you can do this week

  1. Enable query fingerprints and verify retention. Confirm you can answer “top 5 queries by total time in the last hour.”
  2. Turn on slow logging with a defensible threshold and rotation. Keep enough history to span your typical incident duration.
  3. Write the 15-minute slowdown runbook. Host checks, lock checks, top queries, mitigation options.
  4. Practice one game day. Inject load, simulate a lock, simulate I/O saturation, and time how fast on-call gets to a credible hypothesis.
  5. Make migrations boring. Most “database slowdowns” are schema change coordination failures wearing a trench coat.

The best visibility is the kind you don’t think about until you need it—because you already know exactly where it is, and it’s already collecting the evidence.

← Previous
Microsoft Zune: how “not iPod” became cult-famous
Next →
OpenVPN AUTH_FAILED: why correct credentials still fail (and what to check)

Leave a comment