03:07. Pager goes off. The app is “down,” the exec channel is “up,” and somebody is typing
“did we lose data?” in all caps. You don’t get style points for architecture right now.
You get points for restoring service and knowing what to trust.
This is a practical comparison of running PostgreSQL versus Percona Server (MySQL-compatible) when you’re tired,
under pressure, and the blast radius is real. Not which is “better.” Which is simpler to operate at 3AM,
with the knobs, logs, and recovery paths that actually exist in production.
What “simpler at 3AM” really means
“Simpler” isn’t fewer features. It’s fewer unknowns in the failure path. At 3AM you need:
(1) clear visibility into what is broken, (2) one or two reliable recovery levers,
(3) predictable performance behavior under load, and (4) backups that restore on the first try.
PostgreSQL and Percona Server can both be operated well. But they tend to punish different mistakes.
PostgreSQL punishes neglect (autovacuum, bloated tables, bad queries that spill to disk).
Percona Server punishes hand-wavy replication assumptions and the “it’s just MySQL” vibe when the InnoDB
subsystem is quietly screaming.
My bias, based on incidents: PostgreSQL is often simpler to recover correctly because durability semantics
are explicit and replication is coherent. Percona Server is often simpler to scale out quickly because its
ecosystem (plus Percona tooling) makes common MySQL patterns easy—until you hit subtle replication drift,
GTID confusion, or multi-writer fantasies.
Another bias, also earned: the database you already have good runbooks for is simpler. “Familiar” beats “best”
in the moment, which is why operational maturity matters more than benchmarks.
One short joke, as required: At 3AM every database is a distributed system, because your teammates are distributed across time zones and opinions.
Interesting facts and historical context (the bits that explain the scars)
- PostgreSQL’s lineage: PostgreSQL grew out of the POSTGRES project at UC Berkeley in the 1980s, and its “do it right” DNA shows in transactional semantics and extensibility.
- MySQL’s early default baggage: MySQL historically shipped with non-transactional defaults (like MyISAM), which trained a generation to treat durability as optional. InnoDB changed the game, but the cultural echoes remain.
- Percona Server’s origin story: Percona built a distribution around MySQL with instrumentation and performance improvements because operators wanted more visibility than upstream gave them.
- WAL vs binlog: PostgreSQL’s write-ahead log (WAL) is fundamental to crash safety and replication. MySQL’s binary log (binlog) is both replication fuel and a logical history—powerful, but it can also be a source of drift if you treat it casually.
- MVCC everywhere, but not the same: Both systems use MVCC ideas, but PostgreSQL’s vacuum requirement is a first-class operational chore, while InnoDB’s undo/redo and purge behavior tends to surface as “history list length” or undo tablespace pain.
- Replication evolution: MySQL replication started as statement-based, then row-based, then mixed. That history matters because statement-based habits still leak into assumptions about determinism.
- PostgreSQL physical replication maturity: Streaming replication and replication slots made continuous archiving and replicas more robust, but they introduced a new footgun: slots can retain WAL forever if you forget them.
- Percona’s backup tooling impact: XtraBackup became the operator’s friend for hot physical backups in MySQL land, especially when logical dumps were too slow. But it also created a “backup succeeded, restore… maybe” gap if restore drills aren’t practiced.
- Default config conservatism: PostgreSQL defaults are intentionally conservative; you must tune memory and checkpoint behavior for real workloads. MySQL defaults have also improved, but you’ll still see production systems with dangerously large InnoDB buffers and risky flush settings because someone chased a graph.
Operational mental models: how each database fails
PostgreSQL: “everything is fine until autovacuum isn’t”
PostgreSQL usually fails in ways that are diagnosable: high load, long queries, lock contention, IO wait,
checkpoint spikes, or bloat making everything slower. When it crashes, recovery is generally deterministic:
replay WAL, come back. The tricky part is not crash recovery; it’s staying out of the slow spiral
where bloat and bad plans cause more I/O, which causes longer queries, which causes more dead tuples and worse vacuum.
PostgreSQL’s 3AM levers tend to be: cancel runaway queries, reduce lock contention, fix bad indexes/plans,
tune memory and checkpoints, and ensure WAL archiving is healthy. It’s a system that rewards
boring operational hygiene.
Percona Server: “replication is easy until it’s not”
Percona Server inherits the MySQL operational model: one primary (or “source”), replicas (“replica”/“slave”),
async replication, and a toolbox of options. Percona adds visibility (performance schema enhancements,
Percona Toolkit ecosystem compatibility, and operational patches depending on version).
At 3AM, your biggest enemy isn’t always “the database is down.” It’s “the database is up, but not consistent,”
or “replicas are behind,” or “we failed over and now writes are going to the wrong place.”
InnoDB can also get into states where it’s alive but effectively wedged on I/O, redo log pressure, or long-running
transactions preventing purge.
One quote (paraphrased idea), per your requirement: paraphrased idea: In reliability engineering, hope isn’t a strategy; systems need feedback loops and tested recovery.
— attributed to common SRE practice, inspired by the SRE discipline.
Fast diagnosis playbook (first/second/third)
First: is it CPU, memory, or I/O?
- Check load and saturation: high load doesn’t mean CPU is the problem. It can be I/O wait or runnable queue chaos.
- Look for I/O wait: if disk is saturated, query tuning won’t help until you stop the bleeding (throttle, kill offenders, add capacity, reduce checkpoint/flush spikes).
- Check memory pressure: swapping on a database host is a slow-motion outage.
Second: are we blocked on locks or waiting on storage?
- PostgreSQL: find blocking PIDs, long transactions, and autovacuum stuck behind locks. If replication is involved, check WAL sender/receiver and slot backlog.
- Percona Server: check active transactions, deadlocks, InnoDB row lock waits, and replication thread state. Validate you’re writing to the intended primary.
Third: is the system diverging (replication, corruption, or partial failure)?
- Replication lag changes your incident response. If your replica is 30 minutes behind, failing over might be data loss.
- Check error logs for CRC errors, fsync failures, full disk, or redo/WAL archiving failures. These are not “later” problems.
- Confirm backups are viable before you make destructive changes. In both ecosystems, it’s too easy to assume.
Hands-on ops tasks with commands (and what you decide from the output)
These are the kinds of commands you actually run when the world is on fire. Each task includes:
the command, what the output means, and what decision you make next.
Task 1 (host): confirm I/O wait vs CPU saturation
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (db01) 12/30/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.31 0.00 4.88 35.22 0.00 47.59
Device r/s w/s rkB/s wkB/s await aqu-sz %util
nvme0n1 220.0 780.0 8800.0 54000.0 18.4 7.12 98.7
Meaning: %iowait is high and the device is near 100% utilized. You’re I/O bound, not CPU bound.
Decision: stop creating more I/O: kill the worst queries, pause batch jobs, reduce checkpoint pressure (Postgres) or flush pressure (InnoDB), and check for disk-full and RAID/NVMe errors.
Task 2 (host): spot swap and memory pressure
cr0x@server:~$ free -m
total used free shared buff/cache available
Mem: 64000 54000 1200 600 8800 6200
Swap: 8192 3900 4292
Meaning: swap is in use. On a database host, that’s usually self-inflicted harm.
Decision: reduce memory consumers now (connection storms, huge work_mem/sort buffers, too-large buffer pool). If you can’t, move load off the node or add RAM. Swapping plus high I/O wait is a classic outage cocktail.
Task 3 (PostgreSQL): see what’s running and what’s waiting
cr0x@server:~$ psql -XAtc "select pid, usename, state, wait_event_type, wait_event, now()-query_start as age, left(query,80) from pg_stat_activity where state <> 'idle' order by age desc limit 10;"
9231|app|active|Lock|transactionid|00:12:33.18291|update orders set status='paid' where id=$1
8120|app|active|IO|DataFileRead|00:09:10.09121|select * from order_items where order_id=$1
...
Meaning: you have a lock wait (transactionid) and I/O waits (DataFileRead). The oldest query is likely blocking others.
Decision: find the blocker and decide whether to cancel/terminate it. Also check if the I/O wait is systemic (see iostat) or a single bad query/index.
Task 4 (PostgreSQL): find blockers fast
cr0x@server:~$ psql -XAtc "select blocked.pid as blocked_pid, blocker.pid as blocker_pid, now()-blocker.query_start as blocker_age, left(blocker.query,80) as blocker_query from pg_locks blocked join pg_locks blocker on blocker.locktype=blocked.locktype and blocker.database is not distinct from blocked.database and blocker.relation is not distinct from blocked.relation and blocker.page is not distinct from blocked.page and blocker.tuple is not distinct from blocked.tuple and blocker.virtualxid is not distinct from blocked.virtualxid and blocker.transactionid is not distinct from blocked.transactionid and blocker.classid is not distinct from blocked.classid and blocker.objid is not distinct from blocked.objid and blocker.objsubid is not distinct from blocked.objsubid and blocker.pid <> blocked.pid join pg_stat_activity blocked_act on blocked_act.pid=blocked.pid join pg_stat_activity blocker on blocker.pid=blocker.pid where not blocked.granted and blocker.granted limit 5;"
9231|7011|00:48:02.01123|alter table orders add column foo text
Meaning: an ALTER TABLE has been holding locks for 48 minutes. That can freeze your application.
Decision: terminate the DDL if it’s not safe, then put a policy in place: online schema changes, lock timeouts, and DDL windows.
Task 5 (PostgreSQL): check replication lag and WAL backlog
cr0x@server:~$ psql -XAtc "select application_name, state, sync_state, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) as byte_lag from pg_stat_replication;"
replica01|streaming|async|512 MB
Meaning: your replica is behind by ~512MB of WAL. That might be seconds or minutes depending on write rate.
Decision: don’t fail over blindly. Check if the lag is shrinking. If it’s growing, you have a downstream problem (network, disk, replay).
Task 6 (PostgreSQL): check replication slots for runaway WAL retention
cr0x@server:~$ psql -XAtc "select slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained from pg_replication_slots;"
logical_slot_1|f|97 GB
Meaning: an inactive logical slot is retaining 97GB of WAL. Disk-full is now a scheduled incident.
Decision: either re-enable the consumer or drop the slot if it’s truly dead. Then add monitoring/alerts for slot retention.
Task 7 (PostgreSQL): check checkpoint pressure
cr0x@server:~$ psql -XAtc "select checkpoints_timed, checkpoints_req, round(checkpoint_write_time/1000.0) as write_s, round(checkpoint_sync_time/1000.0) as sync_s from pg_stat_bgwriter;"
120|980|8420|3110
Meaning: checkpoints requested (980) dwarf timed (120). The system is forcing frequent checkpoints, usually due to WAL volume and small max_wal_size.
Decision: increase max_wal_size and tune checkpoint_completion_target; then verify storage can handle steady writeback. Don’t “fix” this by turning fsync off unless you enjoy career development in the job market.
Task 8 (Percona Server): identify the primary and replication health
cr0x@server:~$ mysql -NBe "SHOW SLAVE STATUS\G" | egrep "Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Last_SQL_Error"
Slave_IO_Running: Yes
Slave_SQL_Running: No
Seconds_Behind_Master: NULL
Last_SQL_Error: Error 'Duplicate entry' on query. Default database: 'app'. Query: 'INSERT INTO ...'
Meaning: IO thread is running, SQL thread stopped, lag is unknown. Replica is broken, not just behind.
Decision: do not promote this replica. Fix replication (skip/repair with extreme caution), or rebuild. Also diagnose why a duplicate happened—often a misconfigured multi-writer or non-deterministic statements.
Task 9 (Percona Server): confirm GTID mode and failover safety
cr0x@server:~$ mysql -NBe "SHOW VARIABLES LIKE 'gtid_mode'; SHOW VARIABLES LIKE 'enforce_gtid_consistency';"
gtid_mode ON
enforce_gtid_consistency ON
Meaning: GTID is enabled and consistency enforced. That makes failover tooling and replica re-pointing cleaner.
Decision: if you do planned/unplanned failover, prefer GTID-enabled topologies. If it’s off, expect manual binlog positions and more 3AM risk.
Task 10 (Percona Server): read InnoDB engine status for lock and purge pain
cr0x@server:~$ mysql -NBe "SHOW ENGINE INNODB STATUS\G" | sed -n '1,120p'
...
TRANSACTIONS
------------
Trx id counter 123456789
Purge done for trx's n:o < 123450000 undo n:o < 0 state: running but idle
History list length 987654
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 123456111, ACTIVE 1865 sec
...
Meaning: history list length is huge; purge can’t keep up, often due to long-running transactions.
Decision: find and end long transactions (or fix app behavior). Otherwise, undo grows, performance degrades, and you’ll end up “optimizing” by restarting—aka turning it off and on again with extra steps.
Task 11 (Percona Server): detect deadlocks and decide what to kill
cr0x@server:~$ mysql -NBe "SHOW ENGINE INNODB STATUS\G" | egrep -n "LATEST DETECTED DEADLOCK|TRANSACTION|WAITING FOR THIS LOCK"
2345:LATEST DETECTED DEADLOCK
2361:*** (1) TRANSACTION:
2388:*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
2410:*** (2) TRANSACTION:
Meaning: deadlocks are happening. InnoDB usually resolves by rolling back one transaction.
Decision: if deadlocks spike, look for new code paths or missing indexes. Killing random threads is rarely the fix; correcting access patterns is.
Task 12 (MySQL/Percona): confirm buffer pool and hit rate pressure
cr0x@server:~$ mysql -NBe "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests';"
Innodb_buffer_pool_reads 9876543
Innodb_buffer_pool_read_requests 1234567890
Meaning: buffer pool reads are significant; compare growth rates over time. If reads from disk climb rapidly, your cache isn’t holding the working set.
Decision: either increase buffer pool (if RAM allows), reduce working set (indexes, query fixes), or stop pretending spinning disks are “fine because it’s mostly reads.”
Task 13 (PostgreSQL): find top queries by total time (pg_stat_statements)
cr0x@server:~$ psql -XAtc "select round(total_exec_time) as total_ms, calls, round(mean_exec_time,2) as mean_ms, left(query,90) from pg_stat_statements order by total_exec_time desc limit 5;"
983412|1203|817.33|select * from events where user_id=$1 order by created_at desc limit 50
...
Meaning: the top total-time query is both frequent and slow-ish. That’s your ROI.
Decision: add the right index, fix sorting/pagination, or adjust query shape. Then re-check. Don’t tune kernel parameters before you fix the obvious query.
Task 14 (PostgreSQL): check bloat-ish signals (dead tuples) and autovacuum
cr0x@server:~$ psql -XAtc "select relname, n_live_tup, n_dead_tup, round(100.0*n_dead_tup/nullif(n_live_tup+n_dead_tup,0),2) as dead_pct from pg_stat_user_tables order by n_dead_tup desc limit 5;"
orders|12000000|4800000|28.57
events|90000000|11000000|10.89
Meaning: large dead tuple counts. If autovacuum can’t keep up, query performance will decay and indexes bloat too.
Decision: tune autovacuum for hot tables, consider manual VACUUM (or VACUUM FULL in controlled windows), and fix transaction patterns that keep tuples from being vacuumed (long txns).
Task 15 (Percona Server): see currently running queries and their time
cr0x@server:~$ mysql -NBe "SHOW FULL PROCESSLIST;" | head
12345 app 10.0.2.15:44210 appdb Query 35 Sending data SELECT ... FROM big_table ...
12346 app 10.0.2.16:55122 appdb Sleep 1200 NULL
Meaning: a query has been running for 35 seconds and is “Sending data” (often scanning or sorting). There are also long sleeping connections.
Decision: kill or optimize offenders, cap max connections or use pooling, and investigate why connections are left sleeping (application leaks, pool misconfig).
Task 16 (host): confirm disk-full before you chase ghosts
cr0x@server:~$ df -h /var/lib
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p3 900G 890G 10G 99% /var/lib
Meaning: you are essentially out of disk. Both PostgreSQL and InnoDB behave badly here, just in different dialects of pain.
Decision: free space immediately (old logs, old backups, rotated WAL/binlogs with caution), expand volume, and then do a postmortem on retention settings.
Backups & recovery: the 3AM truth
PostgreSQL: backups are straightforward, restores are where you earn trust
PostgreSQL gives you two mainstream backup modes:
logical (pg_dump) and physical (base backup + WAL). Logical backups are portable,
slow at scale, and great for migrations. Physical backups are fast to restore and good for disaster recovery,
but need WAL archiving (or continuous streaming) to get point-in-time recovery.
The operational simplicity win in Postgres is conceptual: if you understand WAL, you understand crash recovery,
replication, and PITR. It’s one mental model with different tools.
The operational footguns are also consistent: if WAL archiving breaks, PITR breaks. If replication slots retain WAL,
disks fill. If you don’t practice restore, your “backup” is a comforting folder of lies.
Percona Server: XtraBackup is great, but you must respect the restore pipeline
Percona Server operators typically use:
logical dumps (mysqldump/mysqlpump) and physical hot backups (XtraBackup).
XtraBackup is fast and usually the right answer for large datasets. It also introduces an operational reality:
a physical backup is not “one file.” It’s a directory tree, metadata, and a prepare/apply-log step.
If your restore path isn’t scripted and rehearsed, it will betray you in the moment you need it.
MySQL/Percona PITR often relies on binlogs plus a base backup. That’s powerful but operationally more fiddly:
you must track binlog retention, binlog format, GTID mode, and apply sequence.
Who’s simpler at 3AM for recovery?
If your org is disciplined with WAL archiving and you test restores, PostgreSQL recovery is usually simpler and more predictable.
If your org is already deep in the MySQL world and you have hardened XtraBackup restore automation, Percona can be extremely smooth.
The deciding factor is not tool availability. It’s whether you have a practiced, documented restore runbook and a staging restore drill.
Replication & failover: what breaks and how loudly
PostgreSQL replication: fewer modes, fewer weird surprises
PostgreSQL’s streaming replication is physical: replicas replay WAL. That makes it consistent with how the primary
commits changes. Logical replication exists, but most operational failover topologies rely on physical replication.
Operationally, this is good news: fewer replication formats, fewer “this statement is non-deterministic” issues.
You still have to manage replication lag, network partitions, and promotion tooling. But the typical 3AM failure
mode is obvious: “replica is behind” or “replica can’t keep up because disk is slow.”
Percona Server replication: flexible, mature, and easy to misunderstand
MySQL replication has improved massively over the years: GTID, row-based replication, semi-sync options,
multi-threaded appliers. Percona Server tracks this ecosystem and often exposes more instrumentation.
The 3AM danger is human: people assume async replication is “basically sync,” or they assume read-after-write
consistency across replicas, or they assume failover is just a DNS change. Then they discover that replication
lag is not a suggestion; it’s physics.
If you run Percona Server, get religious about:
row-based binlog format, GTID everywhere, and automated failover that also fences the old primary.
The fencing part is how you avoid split-brain writes—one of the most expensive classes of incident.
Performance tuning: safe changes vs regret
PostgreSQL tuning that actually helps at 3AM
PostgreSQL performance incidents are often query-plan and I/O related. Safe levers:
adjust connection count (pooling), fix missing indexes, and tune checkpoint/WAL settings to avoid periodic write storms.
Memory settings like work_mem can help but are also a common self-own: it’s per-sort, per-hash, per-session.
The most “3AM-safe” changes are those that reduce load without changing correctness: cancel a query, add a missing index
(carefully, maybe concurrently), adjust statement timeouts, and reduce concurrency.
Percona Server tuning that actually helps at 3AM
InnoDB is usually the center. Safe levers:
ensure buffer pool is sized sensibly, keep innodb_flush_log_at_trx_commit and sync_binlog at durable settings unless you have a written risk acceptance,
and make sure you’re not choking on too many connections or thread scheduling overhead.
Many MySQL “performance wins” are actually durability trades. They look great until a power event or kernel panic.
If you must change durability settings, do it as a conscious business decision, not as a midnight experiment.
Second short joke, as required: Turning off fsync is like removing the smoke detector because it’s loud—briefly peaceful, then educational.
Storage and filesystem realities: I/O, durability, and surprises
Both databases are storage engines with opinions. If your storage lies, they will faithfully persist that lie.
The 3AM simplicity question often comes down to “how does this database behave under I/O pain?”
PostgreSQL under I/O pain
- Checkpoint spikes can create periodic latency storms if poorly tuned.
- Autovacuum can become either a hero (preventing bloat) or a villain (if it competes with workload on slow disks).
- WAL archiving is a hard dependency for PITR; broken archive is an incident, not a warning.
InnoDB under I/O pain
- Redo log pressure and flushing behavior can cause stalls if the log subsystem is constrained.
- Long transactions prevent purge and bloat undo/history, which then causes more I/O.
- Dirty page flushing and checkpointing can cause throughput collapse when disks saturate.
Operator advice: pick your storage battles
If you’re on cloud block storage, test your worst-case IOPS, not your average. If you’re on local NVMe,
plan for device failure and rebuild behavior. And regardless: monitor disk latency, queue depth, and filesystem fullness.
Disk-full incidents are embarrassingly common because storage problems are quiet until they aren’t.
Three corporate mini-stories (realistic, anonymized, technically accurate)
1) Incident caused by a wrong assumption: “replica reads are always fresh”
A mid-sized SaaS company ran Percona Server with one primary and two replicas. The architecture doc said:
“Reads go to replicas, writes go to primary.” The load balancer dutifully followed that rule.
Nobody wrote down the second rule: “Some reads must be consistent with the user’s most recent write.”
A product launch increased writes. Replication lag climbed from “usually negligible” to “noticeable.”
Support tickets started: customers saved settings, refreshed, and the settings reverted. The app wasn’t losing data.
It was reading old data from replicas. From the user’s perspective, that’s indistinguishable from data loss.
At 3AM the on-call tried the usual moves: restart app pods, increase DB connections, scale replicas.
It made things worse. More connections increased write contention on the primary, which increased lag, which increased inconsistency.
The system behaved exactly as designed—just not as assumed.
The fix was boring and effective. They routed “read-after-write” paths to the primary (or used session stickiness),
added lag-aware routing (stop sending reads to a replica above a threshold), and set clearer expectations in code:
“eventual consistency is acceptable here, not there.” They also implemented GTID-based failover fencing,
because the incident revealed how casually failover had been discussed.
Operational lesson: Percona’s replication can be solid, but it will not rescue you from treating async replication like magic.
Postgres has the same physics, but teams often design around it earlier because streaming replication lag is frequently monitored as “WAL bytes behind,” which feels more concrete than “seconds behind master.”
2) Optimization that backfired: “bigger memory settings for faster sorts”
A data platform team ran PostgreSQL for analytics-ish workloads on a shared cluster. They had a backlog of slow queries
with large sorts and hashes. Someone proposed raising work_mem significantly. The test environment looked better.
The graphs smiled. The change rolled out.
Two hours later the primary started swapping. Latency exploded. Autovacuum fell behind.
Then replication lag appeared because WAL replay on the replica fell behind due to disk contention.
The incident channel filled with the usual suspects: “network?”, “kernel bug?”, “did we get DDoSed?”
What happened was simple: work_mem is per operation. Under concurrency, large work_mem multiplied
into real memory consumption. The database didn’t “use more memory for one query.” It used more memory for hundreds.
Once it hit swap, the system spent the night thrashing, doing expensive disk I/O to support memory overcommit.
The rollback fixed the immediate pain. The actual improvement was more nuanced: they added the right indexes,
reduced concurrency for heavy reports, and used query timeouts and resource queues (at the app scheduler level)
so one noisy workload couldn’t take the whole node hostage.
Operational lesson: changes that look safe in isolation can be catastrophic under concurrency. In Postgres, memory
settings are deceptively sharp. In Percona/MySQL land, the equivalent backfire is often “make the buffer pool huge”
while forgetting about filesystem cache, OS headroom, or backup/restore memory needs.
3) Boring but correct practice that saved the day: restore drills and promotion checklists
A payments-adjacent company ran PostgreSQL with a strict policy: monthly restore drills to a staging environment,
and quarterly “promote a replica” game days. It was the kind of practice that never wins internal awards because
it doesn’t ship features. It also makes incidents shorter and less dramatic, which is deeply unpopular with people who like drama.
One night, a storage controller went unhealthy and started returning intermittent I/O errors. PostgreSQL began logging
fsync failures. The team didn’t debate whether the logs were “real.” Their runbook treated fsync failures as a
“stop-the-world” risk. They fenced the node, promoted a replica, and moved traffic.
The key moment: they already knew the replica could be promoted cleanly because they had done it repeatedly.
They also knew their PITR chain was intact because restore drills verified WAL archives and base backups.
When leadership asked “are we safe?” the on-call could answer with evidence, not optimism.
Operational lesson: practice turns a 3AM incident into a checklist. PostgreSQL’s tooling aligns well with this discipline,
but the same approach works for Percona too—especially if you validate XtraBackup restores and binlog apply procedures regularly.
Common mistakes: symptom → root cause → fix
1) Symptom: disk usage climbs forever on PostgreSQL
Root cause: inactive replication slot retaining WAL, or WAL archiving failing and WAL piling up.
Fix: inspect pg_replication_slots, drop unused slots, fix the consumer, and alert on retained WAL size. Validate archive_command and permissions.
2) Symptom: periodic latency spikes every few minutes (PostgreSQL)
Root cause: aggressive checkpointing due to small max_wal_size or mis-tuned checkpoint settings.
Fix: raise max_wal_size, set checkpoint_completion_target closer to 0.9, ensure storage can sustain steady writes, and confirm you’re not saturating I/O.
3) Symptom: queries get slower over days; indexes seem “less effective” (PostgreSQL)
Root cause: table and index bloat from insufficient vacuum or long-running transactions preventing cleanup.
Fix: tune autovacuum per hot table, eliminate long transactions, and schedule reindex/vacuum operations appropriately.
4) Symptom: replica shows Seconds_Behind_Master: NULL (Percona/MySQL)
Root cause: SQL thread stopped due to error (duplicate key, missing table, schema drift).
Fix: check Last_SQL_Error, decide whether to rebuild or carefully repair. Prefer GTID and consistent schema deployment to avoid drift.
5) Symptom: “deadlocks increased” after a release (Percona/MySQL)
Root cause: new query path changes lock order or missing index causes broader locks and longer lock hold times.
Fix: analyze deadlock logs, add indexes, and enforce consistent transaction ordering in application code.
6) Symptom: InnoDB stalls with high history list length (Percona/MySQL)
Root cause: long-running transactions preventing purge; sometimes large read transactions or idle transactions left open.
Fix: identify and terminate long transactions, set sensible timeouts, and fix app connection handling. Monitor history list length.
7) Symptom: after failover, writes happen on two nodes (either ecosystem)
Root cause: no fencing; clients still connected to old primary; split-brain risk.
Fix: implement fencing at the network/load balancer level; enforce single-writer via automation and health checks; block old primary from accepting writes.
8) Symptom: backups “succeed,” restores fail (either ecosystem)
Root cause: backups not tested; missing WAL/binlogs; permissions/paths wrong; encryption keys absent; restore steps undocumented.
Fix: schedule restore drills, automate restore, verify checksums, and treat restore time objectives as production requirements.
Checklists / step-by-step plan
Checklist A: 3AM triage (works for both)
- Confirm impact scope: single service or everything? Is it latency, errors, or data correctness?
- Check host saturation: CPU, memory, I/O wait, disk full.
- Check database liveness: can you connect? Are queries progressing or stuck?
- Identify top offenders: longest-running queries, lock wait chains, connection storms.
- Check replication state before any failover: lag, stopped threads, WAL retention, etc.
- Stop the bleeding: kill runaway queries, pause batch jobs, throttle traffic, shed load.
- Only then tune: cautious config changes that reduce pressure; avoid correctness/durability trades.
- Make a recovery decision: stay and stabilize vs failover vs restore.
Checklist B: PostgreSQL “safe stabilization” steps
- Find blockers and terminate if needed (especially long DDL holding locks).
- Cancel runaway queries to reduce I/O and lock contention.
- Check WAL archiving and slot retention to prevent disk-full cascades.
- Verify autovacuum health on the hottest tables; tune per-table if necessary.
- Confirm checkpoint pressure; adjust WAL/checkpoint settings during a calm window, not mid-panic unless the alternative is outage.
Checklist C: Percona Server “safe stabilization” steps
- Confirm which node is writable primary; verify the app is writing to it.
- Check replication threads; don’t promote a broken replica.
- Inspect InnoDB status: long transactions, purge pressure, deadlocks.
- Reduce connection storms; consider temporary connection caps or pooling fixes.
- Validate durability settings before changing them; if you change them, document the risk and revert plan.
Step-by-step: practice restores (the thing that makes 3AM survivable)
- Pick one production backup set and restore it into an isolated environment.
- For PostgreSQL: restore base backup, replay WAL to a target timestamp, run integrity checks (queries + application smoke tests).
- For Percona/XtraBackup: restore directory, apply logs (prepare), start server, apply binlogs/GTID as needed for PITR, validate with smoke tests.
- Measure restore time and document it as your real RTO.
- Repeat until you can do it from a runbook without improvisation.
FAQ
1) If I only care about 3AM operations, should I choose PostgreSQL?
If you’re starting from scratch and you value predictable recovery semantics, yes, PostgreSQL is often the safer bet.
But if your team already has mature MySQL/Percona runbooks and tooling, operational maturity beats theoretical simplicity.
2) Is Percona Server harder than “vanilla MySQL” to operate?
Usually no. Percona Server is typically chosen because it improves operator visibility and performance tooling compatibility.
The complexity comes from the MySQL replication and topology choices, not from Percona branding.
3) Which one is more likely to surprise me with disk usage?
PostgreSQL surprises people with WAL retention (especially replication slots) and bloat if vacuum is neglected.
Percona/MySQL surprises people with binlog retention, undo/history growth, and backups piling up.
Either way: disk is a first-class SLO.
4) What’s the simplest reliable backup approach for each?
PostgreSQL: physical base backups plus WAL archiving for PITR, with regular restore drills.
Percona: XtraBackup full/incremental strategy plus binlog retention for PITR, with regular restore drills and GTID validation.
5) Which is easier to fail over safely?
PostgreSQL failover is conceptually clean (promote a physical replica), but you must manage client routing and fencing.
Percona/MySQL failover can be smooth with GTID and automation, but split-brain and replication breakage are more common failure modes.
6) What’s the most common 3AM “self-own” in PostgreSQL?
Letting long transactions and vacuum issues accumulate until bloat and lock contention turn into an outage.
The second most common is misconfigured WAL archiving or forgotten replication slots filling the disk.
7) What’s the most common 3AM “self-own” in Percona Server?
Treating async replication like sync and promoting the wrong replica, or failing over without fencing the old primary.
Also: “optimization” via durability trade-offs that later become data loss during a crash.
8) Can I make either one “simple at 3AM” regardless of choice?
Yes. The recipe is boring: consistent monitoring, tested backups, standardized failover procedures, clear ownership of schema changes,
and an incident playbook that starts with saturation and correctness checks.
9) Which one is easier to debug query performance under pressure?
PostgreSQL with pg_stat_statements and clear wait events is excellent for targeted diagnosis.
Percona/MySQL has strong instrumentation too (performance schema, InnoDB status), but teams often underuse it.
The easier system is the one your team practices with monthly, not the one you admire quarterly.
Practical next steps
If you’re choosing between PostgreSQL and Percona Server based on “3AM simplicity,” make the decision on operational reality:
your team’s skills, your automation, and your tolerance for specific failure modes.
- Pick one “golden path” topology (one primary, defined replicas, defined failover method) and forbid ad-hoc variations.
- Write the 3AM runbook now: saturation checks, lock checks, replication checks, disk-full procedure, and a “don’t do this” list.
- Practice restore until it’s a routine chore. If it’s not practiced, it’s not a backup.
- Instrument the real risks: Postgres WAL/slots/vacuum; Percona replication threads/GTID/binlog retention/InnoDB purge.
- Fence your failovers so you can’t write to two primaries. This is the difference between an incident and a catastrophe.
Finally, be honest about what you want: if you want fewer moving parts and more coherent semantics, PostgreSQL tends to be calmer at 3AM.
If you want a MySQL ecosystem with strong tooling and your organization already runs it well, Percona Server can be just as calm—if you respect replication and durability.