MySQL vs PostgreSQL Backups and Restores: Who Gets You Back Online Faster

Was this helpful?

When the database is down, nobody cares how elegant your schema is. They care how quickly the checkout page returns, how fast the incident channel quiets down, and whether your “we have backups” claim survives contact with reality.

This is a practical, production-grade look at MySQL and PostgreSQL backups and restores through a single lens: time to get back online. Not “time to complete a restore job eventually,” but time to a service that can safely take traffic again—with the least data loss you can credibly promise.

The real question: what does “back online” mean?

Backups and restores are not a beauty contest. The only scoring system that matters is the one your business uses during an outage:

  • RTO (Recovery Time Objective): how long you can be down.
  • RPO (Recovery Point Objective): how much data you can lose.
  • Safety: whether the recovered system is consistent and won’t corrupt itself under load.
  • Confidence: whether the restore path is rehearsed or theoretical.

“Back online” also has tiers. Be explicit.

  • Tier 0: database process starts.
  • Tier 1: can accept read-only traffic safely.
  • Tier 2: can accept full read/write traffic, replication configured, monitoring green.
  • Tier 3: caught up to a target timestamp (PITR), application has reconnected, the queue is draining.

My strong opinion: if your runbooks don’t define the tier you’re restoring to, you don’t have runbooks—you have vibes.

Fast answer: who’s faster, when?

For most production systems, physical backups win

If you’re trying to minimize downtime, physical backups are the default for both MySQL and PostgreSQL:

  • MySQL: Percona XtraBackup (for InnoDB) or vendor equivalents; plus binlogs for PITR.
  • PostgreSQL: pgBackRest (or similar) plus WAL archiving for PITR; or filesystem/base backups with WAL.

Logical dumps (mysqldump, pg_dump) have their place—migrations, selective restores, smaller datasets, portability—but they are rarely your fastest way back after a crater-sized incident.

Where MySQL tends to get you online faster

  • When you can promote an existing replica: MySQL + semi-sync/async replication can make failover feel “instant” if you already have a healthy replica and you’re okay with some RPO risk. (The same is true for PostgreSQL, but the tooling and operational patterns differ.)
  • When your restore is mostly “copy files + crash recover” and binlog catch-up is short. In practice, many MySQL restores are dominated by I/O copy speed and then a bounded crash recovery step.
  • When your backup toolchain is mature (XtraBackup in particular is battle-tested and optimized for large InnoDB estates).

Where PostgreSQL tends to get you online faster

  • When you need precise, trustworthy PITR to a timestamp or transaction boundary. PostgreSQL’s WAL-based recovery is straightforward, well-defined, and the ecosystem (pgBackRest, WAL-G, etc.) is excellent.
  • When you can restore and start serving reads while replay continues (depending on architecture and tolerances). PostgreSQL recovery behavior and visibility can be more predictable when you invest in the right settings and monitoring.
  • When you rely on checksum + WAL discipline and want stronger guardrails against silent corruption. It doesn’t make restores faster by magic, but it can prevent “restore succeeded, data is trash” incidents that waste hours.

The uncomfortable truth

In real outages, the database engine is rarely the main delay. The clock is murdered by:

  • copying terabytes over a saturated link,
  • decompressing on the wrong CPU profile,
  • replaying logs on slow random I/O,
  • waiting for DNS/connection pools/app deploys to notice the new primary,
  • or discovering your “backups” were missing a crucial piece (WAL/binlogs, keys, configs, users, grants).

Choose the database you like. But for recovery speed, choose the restore architecture: replicas, physical backups, WAL/binlog retention, and rehearsals.

Interesting facts & historical context (short and useful)

  1. MySQL’s InnoDB became the default in MySQL 5.5, and with it came a long era where physical backup tooling mattered more than logical dumps for big systems.
  2. PostgreSQL introduced streaming replication in 9.0, which shifted many orgs from “restore from backup” thinking to “promote a standby” thinking.
  3. WAL (PostgreSQL) and binlogs (MySQL) solve similar problems, but operational expectations differ: WAL archiving is usually treated as a first-class PITR requirement; binlog handling is sometimes treated as “optional until it isn’t.”
  4. Percona XtraBackup popularized hot physical backups for MySQL at scale, especially where taking downtime for backup wasn’t acceptable.
  5. PostgreSQL’s base backup + WAL replay model has been stable for years, which is why backup tools mostly compete on usability, storage efficiency, and verification—less on correctness.
  6. Checksum support in PostgreSQL (cluster-level checksums) pushed more teams to catch corruption earlier; restoring quickly is nice, restoring correct data is nicer.
  7. MySQL GTIDs changed how teams reason about promotion and binlog-based recovery—less “what file/position,” more “what transaction set.” This can speed up failover decision-making when used correctly.
  8. Compression became a restore-time tax as backups moved to object storage. The cheapest storage tier is often the slowest thing between you and your RTO.

Backup types that matter for recovery speed

Physical backups (file-level): your main downtime weapon

Physical backups copy the actual database files (or page-level deltas) and restore by putting them back. Restore speed tends to scale with:

  • sequential read/write throughput (storage and network),
  • CPU for decompression and checksums,
  • log replay speed (WAL/binlog application),
  • and how many files you create (filesystem metadata overhead is a real villain).

In MySQL land, physical backups typically mean XtraBackup-produced datadir copies plus redo/undo handling. In PostgreSQL land, physical backups mean a base backup plus WAL segments to reach a target point.

Logical backups (SQL dumps): the slow-but-portable option

Logical restores are slower because you’re rebuilding data through SQL. That means:

  • parsing SQL,
  • rebuilding indexes and constraints,
  • writing many small transactions unless tuned carefully,
  • and potentially hammering WAL/binlog generation during restore.

They’re still valuable for:

  • partial restores (one schema, one table),
  • cross-version/cross-engine migrations,
  • auditable, human-readable snapshots,
  • and as a hedge against “physical backup tool bug” nightmares.

Replication-based recovery: the fastest restore is no restore

If you can promote a standby/replica, you can often beat any backup restore. But you pay for it with:

  • continuous operational care (replication lag, schema changes, maintenance),
  • failure modes where replicas share the same corruption or bad writes,
  • and the need for logical error recovery (PITR) when the data is wrong but consistent.

Joke #1: Backups are like insurance—expensive, boring, and the day you need them is always a terrible day to learn the policy details.

MySQL recovery path: what actually takes time

What “restore” means in a MySQL outage

For MySQL (InnoDB-heavy, which is most real systems), recovery usually breaks down like this:

  1. Get a consistent physical copy of the datadir (from backup storage) onto a target server.
  2. Prepare/apply redo if your backup tool requires it (XtraBackup “prepare”).
  3. Start mysqld and let crash recovery finish.
  4. Apply binlogs if you’re doing PITR beyond the backup snapshot time.
  5. Rebuild replication topology (GTID or file/pos), verify, and cut over traffic.

The biggest time sinks (MySQL)

  • Decompression + file writes: if your backup is compressed and your restore host has fewer cores (or worse CPU), you get surprise slowness.
  • Many small files: per-table tablespaces (ibd files) can create metadata storms on restore. XFS/EXT4 tuning helps, but this is physics and syscalls.
  • Redo log application: heavy write workloads generate more redo; “prepare” can take a while, and crash recovery after start can too.
  • Binlog catch-up: if you’re applying hours of binlogs, you’re basically redoing your traffic the hard way. Your bottleneck might be single-threaded apply, disk latency, or just volume.

What MySQL does well for recovery speed

MySQL can be brutally fast to get running if you restore a prepared backup to fast storage and accept that crash recovery plus buffer pool warmup will happen live. With the right topology, you can also dodge restores entirely by promoting a replica.

MySQL sharp edges you feel during restores

  • Binlog retention assumptions: “We keep binlogs for 24 hours” is a plan until you need 30 hours.
  • GTID misconfigurations: a GTID-based failover is fast when correct and confusing when half-correct.
  • InnoDB settings drift: restoring onto a server with different innodb_log_file_size, encryption, or page size expectations can turn a clean restore into a long debugging session.

PostgreSQL recovery path: what actually takes time

What “restore” means in a PostgreSQL outage

PostgreSQL physical recovery is conceptually clean:

  1. Restore a base backup (files for the cluster).
  2. Provide WAL segments to bring it forward (from archive or streaming source).
  3. Replay WAL until the target (latest, timestamp, LSN, or named restore point).
  4. Promote and reattach clients/replicas.

The biggest time sinks (PostgreSQL)

  • Restoring the base backup: same physics—network, disk throughput, decompression.
  • WAL replay: write-heavy systems generate a lot of WAL; replay can be limited by random I/O and fsync behavior.
  • Checkpoint behavior: configuration and workload influence how painful crash recovery and replay are.
  • Large objects and bloated indexes: restore may be fine, but “back online” might require vacuuming or reindexing after the fact.

What PostgreSQL does well for recovery speed

PITR is a first-class citizen. You can choose “restore to the moment before the deploy” with fewer contortions, and the tooling ecosystem tends to validate WAL continuity aggressively. That reduces time lost to “we restored… why is it missing transactions?”

PostgreSQL sharp edges you feel during restores

  • WAL archive gaps: one missing segment and you’re not doing PITR; you’re doing archaeology.
  • restore_command correctness: one subtle quoting/path bug can stall recovery indefinitely.
  • Timeline confusion: repeated failovers create timelines; restoring the wrong timeline is a classic “it boots, but it’s wrong” failure.

One paraphrased idea from Werner Vogels (Amazon CTO): Everything fails, all the time; resilient systems assume it and recover quickly (paraphrased idea).

Practical tasks (commands, outputs, decisions)

These are the tasks I actually run during restores and drills. Each one includes: command, what the output means, and what decision you make.

Task 1: Measure disk throughput on the restore target (Linux)

cr0x@server:~$ sudo fio --name=restore-write --filename=/var/lib/db-restore.test --size=8G --bs=1M --rw=write --iodepth=16 --direct=1
restore-write: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=16
...
  write: IOPS=720, BW=720MiB/s (755MB/s)(8192MiB/11374msec)

Meaning: Your sequential write bandwidth is ~720MiB/s. That’s the ceiling for “restore base backup” if you’re I/O bound.

Decision: If BW is <200MiB/s for multi-TB restores, stop pretending your RTO is minutes. Move restore to faster disks, or restore from local snapshots, or promote a replica.

Task 2: Check filesystem free space and inode pressure

cr0x@server:~$ df -h /var/lib/mysql
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  3.5T  1.2T  2.3T  35% /var/lib/mysql
cr0x@server:~$ df -i /var/lib/mysql
Filesystem       Inodes  IUsed    IFree IUse% Mounted on
/dev/nvme0n1p2  244M    18M     226M    8% /var/lib/mysql

Meaning: You have space and plenty of inodes. On MySQL with per-table files, inode exhaustion is a real restore killer.

Decision: If inode use is high (>80%), restore onto a filesystem with more inodes or consolidate tablespaces (long-term). Don’t “hope it fits.”

Task 3: Verify backup integrity metadata (pgBackRest)

cr0x@server:~$ pgbackrest --stanza=prod check
stanza: prod
status: ok

Meaning: The repo and stanza are reachable and consistent enough for pgBackRest to be confident.

Decision: If this fails during an incident, you switch immediately to your secondary backup location or snapshot strategy. Don’t debug repo config while the CEO is refreshing dashboards.

Task 4: List available PostgreSQL backups and pick the right one

cr0x@server:~$ pgbackrest --stanza=prod info
stanza: prod
    status: ok
    cipher: none
    db (current)
        wal archive min/max (15): 0000001200000A1B000000C0/0000001200000A1C0000002F
        full backup: 2025-12-28-010001F
            timestamp start/stop: 2025-12-28 01:00:01 / 2025-12-28 01:12:33
            database size: 2.1TB, backup size: 820GB
            repo1: backup set size: 820GB, backup size: 820GB

Meaning: You have a recent full backup, and WAL coverage exists between min/max.

Decision: Pick the newest backup that still has complete WAL to your target recovery time. If WAL max is behind your incident timestamp, your RPO just got worse—say it early.

Task 5: Restore PostgreSQL to a target timestamp (PITR)

cr0x@server:~$ sudo pgbackrest --stanza=prod --delta --type=time --target="2025-12-28 14:32:00" restore
INFO: restore command begin 2.53: --delta --stanza=prod --target=2025-12-28 14:32:00 --type=time
INFO: restore size = 2.1TB, file total = 2143
INFO: restore command end: completed successfully

Meaning: Files are restored; the database will replay WAL on startup until the target time.

Decision: If restore is “successful” but startup hangs, your bottleneck is likely WAL retrieval or replay I/O. Move to the diagnosis playbook instead of re-running restores.

Task 6: Watch PostgreSQL recovery progress

cr0x@server:~$ sudo -u postgres psql -c "select pg_is_in_recovery(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();"
 pg_is_in_recovery | pg_last_wal_replay_lsn | pg_last_xact_replay_timestamp
-------------------+------------------------+------------------------------
 t                 | 0/A1C3F9B0             | 2025-12-28 14:29:17+00
(1 row)

Meaning: Still in recovery; you can quantify how far behind you are by timestamp/LSN movement.

Decision: If the timestamp isn’t advancing, suspect stalled WAL restore_command, missing WAL, or I/O saturation. If it is advancing slowly, tune WAL replay environment (faster disks, fewer bottlenecks) rather than touching SQL.

Task 7: Check for missing WAL segments (PostgreSQL logs)

cr0x@server:~$ sudo tail -n 20 /var/log/postgresql/postgresql-16-main.log
2025-12-28 14:35:02 UTC [21741] LOG:  entering standby mode
2025-12-28 14:35:02 UTC [21741] LOG:  restored log file "0000001200000A1B000000F1" from archive
2025-12-28 14:35:05 UTC [21741] LOG:  could not restore file "0000001200000A1B000000F2" from archive: No such file or directory
2025-12-28 14:35:05 UTC [21741] LOG:  waiting for WAL to become available at 0/A1BF0000

Meaning: Archive gap. Recovery is blocked.

Decision: Either locate the missing WAL (secondary archive, object storage lifecycle issue), or accept a worse RPO and restore to an earlier point. Don’t keep waiting “just in case.”

Task 8: Validate MySQL binlog availability for PITR

cr0x@server:~$ mysql -uroot -p -e "SHOW BINARY LOGS;"
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| binlog.000812    | 1073741824|
| binlog.000813    | 1073741824|
| binlog.000814    |  512345678|
+------------------+-----------+

Meaning: Binlogs exist on this server (or wherever you’re checking). File list suggests retention.

Decision: Confirm you have the binlogs covering the incident window. If binlog.000811 is required but missing, you can’t PITR across that boundary.

Task 9: Determine MySQL GTID state before re-pointing replication

cr0x@server:~$ mysql -uroot -p -e "SHOW VARIABLES LIKE 'gtid_mode'; SHOW MASTER STATUS\G"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| gtid_mode     | ON    |
+---------------+-------+
*************************** 1. row ***************************
             File: binlog.000814
         Position: 83422119
     Executed_Gtid_Set: 4a9b1c9c-8b60-11ee-9f4e-0242ac120002:1-987654321

Meaning: GTID is enabled; you have a specific executed set you can compare across servers.

Decision: Use GTID-based auto-positioning if possible. If GTID is OFF in parts of your fleet, failover becomes slower and riskier—plan to fix that when you’re not on fire.

Task 10: Prepare a MySQL XtraBackup restore (apply redo)

cr0x@server:~$ sudo xtrabackup --prepare --target-dir=/backups/mysql/full-2025-12-28
xtrabackup: starting shutdown with innodb_fast_shutdown = 1
xtrabackup: completed OK!

Meaning: The backup is consistent and ready to copy back. If this step is slow, your redo apply is heavy.

Decision: If prepare takes forever, you either accept longer RTO or change backup cadence/strategy (incrementals, more frequent fulls, or replica-based recovery).

Task 11: Copy back MySQL datadir and set permissions correctly

cr0x@server:~$ sudo systemctl stop mysql
cr0x@server:~$ sudo rsync -aHAX --info=progress2 /backups/mysql/full-2025-12-28/ /var/lib/mysql/
      1.23T  42%  610.23MB/s    0:35:12 (xfr#214, to-chk=1290/2143)
cr0x@server:~$ sudo chown -R mysql:mysql /var/lib/mysql

Meaning: Restore is I/O-bound; rsync speed gives you an ETA. Ownership is corrected so mysqld can read files.

Decision: If rsync speed is way below expected disk speed, suspect network path, throttling, or small-file overhead. Consider restoring from local snapshot or attached volume instead of across the network.

Task 12: Start MySQL and confirm crash recovery behavior

cr0x@server:~$ sudo systemctl start mysql
cr0x@server:~$ sudo tail -n 20 /var/log/mysql/error.log
2025-12-28T14:40:01.120345Z 0 [Note] InnoDB: Starting crash recovery.
2025-12-28T14:40:12.551231Z 0 [Note] InnoDB: Crash recovery finished.
2025-12-28T14:40:13.883412Z 0 [Note] mysqld: ready for connections.

Meaning: Crash recovery completed quickly; service is accepting connections.

Decision: If crash recovery takes a long time or loops, your logs/redo state may be mismatched (bad restore, wrong innodb settings, partial copy). Stop and validate the backup/restore steps rather than repeatedly restarting.

Task 13: Apply MySQL binlogs to reach a target point

cr0x@server:~$ mysqlbinlog --start-datetime="2025-12-28 14:00:00" --stop-datetime="2025-12-28 14:32:00" /var/lib/mysql/binlog.000812 /var/lib/mysql/binlog.000813 | mysql -uroot -p
cr0x@server:~$ echo $?
0

Meaning: Binlog apply succeeded (exit code 0). This is your PITR step.

Decision: If it errors, decide quickly: fix and re-run (if you can), or accept a different restore point. Don’t apply partial binlogs and call it “good enough” unless you enjoy data-integrity therapy sessions.

Task 14: Verify PostgreSQL can accept writes and is on the expected timeline

cr0x@server:~$ sudo -u postgres psql -c "select pg_is_in_recovery(); select timeline_id from pg_control_checkpoint();"
 pg_is_in_recovery
-------------------
 f
(1 row)

 timeline_id
------------
         19
(1 row)

Meaning: Recovery is finished, and you can track which timeline you’re on.

Decision: If clients are still failing, the database is probably fine and your bottleneck is elsewhere: DNS, load balancer, connection pool, secrets, or firewalling.

Task 15: Find replication lag / apply delay (PostgreSQL standby)

cr0x@server:~$ sudo -u postgres psql -c "select client_addr, state, sync_state, write_lag, flush_lag, replay_lag from pg_stat_replication;"
 client_addr |  state  | sync_state | write_lag | flush_lag | replay_lag
-------------+---------+------------+-----------+-----------+-----------
 10.0.2.41    | streaming | async    | 00:00:00  | 00:00:01  | 00:00:03
(1 row)

Meaning: Standby is nearly caught up; replay lag is tiny.

Decision: If replay lag is minutes/hours, promoting this standby may blow your RPO. Either pick another standby, or restore from backup with WAL to a safer point.

Task 16: Confirm MySQL replication status after failover

cr0x@server:~$ mysql -uroot -p -e "SHOW SLAVE STATUS\G" | egrep "Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Last_Error"
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 0
Last_Error:

Meaning: Replica is healthy and caught up.

Decision: If SQL thread is stopped with an error, don’t blindly skip transactions. Decide whether data correctness or fast recovery matters more, then document the compromise and plan cleanup.

Fast diagnosis playbook: find the bottleneck in minutes

Restores fail in boring ways. The goal is to identify which boring way you’re living today.

First: are you blocked on missing logs (WAL/binlogs)?

  • PostgreSQL: check logs for “could not restore file … waiting for WAL.” If present, you’re not slow—you’re stuck.
  • MySQL: confirm binlog coverage for the target interval. If you can’t find needed binlogs, PITR is impossible across that gap.

Second: is the bottleneck throughput (network/disk) or replay (CPU/random I/O)?

  • If the base backup copy is slow, measure network and disk throughput. Watch iostat, fio results, rsync progress.
  • If copy is fast but startup/recovery is slow, it’s replay/checkpoint/crash recovery. Measure WAL replay progress (PostgreSQL) or crash recovery logs and I/O latency (MySQL).

Third: is the database “up” but the service isn’t?

  • Connection pools pointing at old hosts.
  • DNS TTLs and stale caches.
  • Secrets not deployed to the restore environment.
  • Firewall/security groups missing replication/app access.

Fourth: is performance the new outage?

  • Cold cache: buffer pools and shared buffers empty.
  • Autovacuum, checkpoints, or purge catching up.
  • Missing indexes because someone restored from a partial logical dump.

Joke #2: The fastest restore is the one you tested last week—strangely, the database enjoys being paid attention to.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The company had a MySQL primary and two replicas. They also had “binlog backups,” which meant a cron job copied binlog files to object storage. Everyone slept well.

Then a deploy went sideways. A batch job ran against production with the wrong WHERE clause. The data was consistent, the replication faithfully spread the damage, and by the time anyone noticed, the problem was already in every node.

The team decided on PITR: restore last night’s physical backup, then apply binlogs up to “just before the deploy.” Easy. Except the binlogs were rotated and purged aggressively on the primary to save space, and the cron job had been copying only “current” binlogs. The older ones were gone. Not “hard to find.” Gone.

They restored last night’s backup and accepted an RPO they’d never written down. The painful part wasn’t the data loss; it was that the data loss was avoidable. The assumption was: “if we back up binlogs, we can PITR.” The reality was: “if we retain a continuous binlog chain, we can PITR.”

The fix was boring: explicit binlog retention based on RPO, plus a verification job that checked continuity and alerted on gaps. The next incident was less dramatic because it was less surprising.

Mini-story 2: The optimization that backfired

A PostgreSQL team wanted faster backups. They increased compression and moved backups to cheaper storage. Backup windows shrank, monthly bills went down, and someone got a round of applause in a planning meeting.

Months later, a storage failure took out the primary and the standby in the same rack. They restored to a fresh host and discovered the new truth: restore time is not backup time. Restore meant pulling terabytes from cold-ish storage, decompressing aggressively compressed archives on a smaller CPU instance, and then replaying a mountain of WAL.

They did get the database back. It just took long enough to make the business rethink what “high availability” was supposed to mean.

The postmortem decision was simple: keep a “fast restore tier” for the most recent full backup and the newest WAL, even if long-term retention stays cheap and compressed. They also pinned restore hosts to a compute profile that matched decompression needs. Compression is a tool, not a virtue.

Mini-story 3: The boring but correct practice that saved the day

A payments platform ran PostgreSQL with pgBackRest. Every week, an engineer ran a restore drill into an isolated environment. Not a full fire drill with executives—just a quiet exercise: restore last night’s backup, replay WAL to a chosen timestamp, run a validation query set, and then delete the environment.

It felt almost silly. The restore always worked. The queries always matched expectations. The tickets were repetitive.

Then an upgrade introduced an unexpected WAL archive permission issue. WAL was being generated, but the archive_command intermittently failed. The drill caught it within days because the restored instance couldn’t reach the target time. The fix happened before the first real incident depended on that WAL.

Later, a real outage hit: a human deleted a volume attachment in the wrong console. The restore drill playbook was already written, the restore host profile was already known, and the team didn’t have to rediscover their own system at 2 a.m. The recovery was fast precisely because it was boring.

Common mistakes: symptoms → root cause → fix

1) “Restore succeeded” but PostgreSQL won’t reach the target time

Symptoms: pgBackRest restore completes; Postgres starts; logs show waiting for WAL; pg_last_xact_replay_timestamp stops advancing.

Root cause: Missing WAL segment(s) due to archive gap, lifecycle deletion, or misconfigured archive_command/restore_command.

Fix: Find WAL in secondary archive; repair restore_command; adjust retention to cover RPO; add continuous WAL gap checks in monitoring.

2) MySQL is up, but crash recovery takes forever

Symptoms: mysqld starts but stays unavailable; error log shows extended crash recovery; disk latency spikes.

Root cause: Restored onto slow storage; redo logs heavy; backup wasn’t properly prepared; or innodb settings mismatch causing extra work.

Fix: Validate XtraBackup prepare step; restore to faster disks; ensure innodb_log_file_size and relevant settings match the backup environment; avoid repeated restarts that re-trigger recovery work.

3) Logical restore is “running” but ETA is days

Symptoms: pg_restore/mysqldump import crawls; CPU is busy parsing; WAL/binlog grows rapidly; replication falls behind.

Root cause: Wrong tool for RTO; restore running single-threaded; indexes/constraints rebuilt in the worst order; insufficient tuning.

Fix: Use physical backups for DR; if forced to use logical, parallelize pg_restore where possible, load data before indexes/constraints, disable unnecessary logging where safe, and restore onto isolated instance.

4) Replicas exist, but failover still takes forever

Symptoms: A standby is “there,” but promotion is delayed; app errors continue; replication configuration is messy.

Root cause: No automated cutover; DNS TTL too high; connection pools don’t refresh; certificates/secrets tied to hostnames; or replication was lagging beyond RPO.

Fix: Practice failover; reduce TTL sensibly; use stable VIPs/proxies; ensure apps reconnect; monitor replication lag with alerting tied to RPO.

5) Restore is fast, but performance is so bad it’s effectively down

Symptoms: Database responds, but p95 latency is awful; CPU and I/O spike; caches cold; autovacuum/purge storms.

Root cause: Cold cache plus background maintenance plus heavy application load immediately after restore.

Fix: Stage traffic: read-only first, then ramp writes; prewarm critical tables/indexes; tune background maintenance; consider delaying non-critical jobs until stable.

Checklists / step-by-step plan

Decide your recovery strategy per incident type

  • Hardware/node loss: prefer failover to replica/standby (fastest), then rebuild from backups.
  • Logical corruption (bad deploy, accidental deletes): prefer PITR from physical backup + logs.
  • Silent corruption suspicion: restore to isolated environment, validate, then cut over; don’t promote a possibly corrupted replica.

Pre-incident checklist (do this while calm)

  1. Define RTO/RPO per service (write it down, put it in the runbook).
  2. Implement physical backups (XtraBackup / pgBackRest) with verification.
  3. Ensure continuous log retention: binlogs/WAL cover at least RPO + buffer.
  4. Keep a “fast restore tier” of recent backups + logs on storage that can actually deliver throughput.
  5. Provision restore hosts with enough CPU for decompression and enough IOPS for replay.
  6. Automate or at least script: restore, validation queries, and cutover steps.
  7. Run restore drills on a schedule; record actual times and bottlenecks.

During-incident step-by-step plan (time-boxed)

  1. Time-box triage (5–10 minutes): decide if you’re failing over or restoring. Don’t do both at once unless you have enough people and discipline.
  2. Lock in your target: “latest possible” vs “timestamp before incident.” Communicate the RPO implication immediately.
  3. Validate log continuity: WAL/binlog coverage. If missing, adjust target and communicate.
  4. Start restore in parallel with app prep: prepare DNS/LB/config changes while bytes are moving.
  5. Watch progress metrics: copy throughput, replay timestamp/LSN, disk latency.
  6. Bring online in tiers: read-only or limited traffic first; then full load when stable.
  7. Rebuild redundancy: add replicas/standbys before declaring victory; a single recovered primary is a dare.

FAQ

1) Which restores faster: MySQL or PostgreSQL?

With physical backups and good storage, they’re both “fast enough” for many systems. The winner is usually the one with continuous logs, a tested runbook, and a restore path that avoids slow storage. If you force logical restores, both will feel slow; MySQL often suffers on schema/index rebuild overhead, PostgreSQL suffers on WAL generation and index build time. Don’t choose the engine based on dump speed.

2) Is promoting a replica always faster than restoring from backup?

Almost always, yes. But it may violate RPO if replication lag exists, and it doesn’t help for logical corruption that replicated. Also: if replicas share the same underlying corruption (storage/firmware bugs happen), you can promote garbage quickly.

3) What’s the single most common PITR failure?

Missing logs. WAL archive gaps in PostgreSQL; binlog retention/collection gaps in MySQL. Tools can’t recover what you didn’t keep.

4) Are logical backups useless?

No. They’re great for portability, partial restores, audits, and migrations. They’re just rarely the right tool for “we need to be back in 15 minutes.” Keep them as a secondary line of defense, not the primary RTO plan.

5) Does compression help or hurt recovery speed?

Both. It reduces bytes transferred and stored, but increases CPU during restore. The backfire scenario is common: cheap storage + high compression + small restore host = slowest possible restore. Measure restore time, not backup time.

6) How do I know if I’m I/O-bound or CPU-bound during restore?

Watch restore throughput alongside CPU utilization and disk latency. If CPU is pegged and disks are quiet, decompression/checksums are dominating. If CPU is moderate but disk latency is high and throughput is low, you’re I/O-bound. Then fix the right thing instead of yelling at the database.

7) For PostgreSQL, can I serve traffic while in recovery?

You can serve reads from a standby in recovery (hot standby) depending on configuration and use case. For a PITR restore aiming for a new primary, you generally wait for recovery to reach the target and then promote. If you need fast reads, keep a standby architecture; don’t improvise it mid-incident.

8) For MySQL, what’s the fastest safe restore pattern?

Promote a healthy replica for availability, then rebuild the failed primary from a physical backup in the background. If the incident is logical corruption, restore a physical backup to an isolated host and apply binlogs carefully to a safe point, then cut over.

9) How often should we run restore drills?

Often enough that your measured RTO is real. Weekly for critical systems is common; monthly might be acceptable if your environment is stable. After major version changes, storage migrations, backup tool upgrades, or retention policy updates, run a drill immediately.

10) What should we validate after restore besides “it starts”?

Run application-critical queries, validate row counts or checksums for key tables, confirm user/grant correctness, confirm replication can be rebuilt, and verify the system reaches your intended PITR timestamp. “Service started” is Tier 0; it’s not the finish line.

Conclusion: next steps you can execute

If your goal is getting back online fast, the MySQL vs PostgreSQL debate is mostly a distraction. The durable differences are in operational posture: continuous logs, physical backups, fast storage for restores, and rehearsed procedures.

Practical next steps:

  1. Pick your recovery tier (Tier 1/2/3) for each service and write it down with RTO/RPO.
  2. Standardize on physical backups (XtraBackup for MySQL InnoDB; pgBackRest for PostgreSQL) and automate verification.
  3. Prove log continuity: implement a job that checks WAL/binlog chains end-to-end, not just “files exist.”
  4. Time your restores on realistic hardware. Measure copy speed and replay speed separately.
  5. Keep a fast restore path: recent backups and logs on storage that can feed your RTO, plus restore hosts sized for decompression and I/O.
  6. Drill until it’s boring. Boring is fast. Boring is reliable. Boring is what your customers pay for.
← Previous
ZFS Property Inheritance: The Surprise That Changes Child Datasets
Next →
Fix Proxmox “IOMMU not enabled” for PCI Passthrough (VT-d/AMD-Vi) Safely

Leave a comment