The day you learn whether your backups work is never a calm day. It’s a day with a panicked Slack thread, a CFO who suddenly cares about retention policies, and an engineer who swears “we’ve always done it this way.”
Point-in-time recovery (PITR) is supposed to be your escape hatch: restore the database to just before the bad deploy, the dropped table, the accidental data fix that “only touched 12 rows.” But PITR is not a feature you own until you’ve restored it, timed it, and explained the result to someone who doesn’t want to hear excuses.
PITR is a drill, not a checkbox
PITR is simple in concept: take a base backup, then replay the transaction log until a target time, transaction ID, or log position. The database re-emerges, slightly annoyed, at the moment before the incident.
In practice, PITR is a chain of fragile assumptions: time synchronization, log retention, correct configuration, encryption keys available, permissions, and enough I/O bandwidth to do the replay before your business misses payroll. PITR drills are how you find the weak link when the stakes are low.
Here’s the operational truth: a backup you haven’t restored is just a compressed pile of hope.
One quote worth keeping on a sticky note (paraphrased idea): “Hope is not a strategy” — often attributed in operations circles to Gen. Gordon R. Sullivan.
Paraphrase aside, the sentiment is painfully exact.
Joke #1: The only thing more optimistic than an untested restore is the belief that the incident will wait for office hours.
Facts and small history that matter in production
The details behind MySQL binlogs and PostgreSQL WAL aren’t trivia. They explain why certain PITR habits work—and why others quietly rot until an incident.
- PostgreSQL introduced WAL in 7.1 (2001), enabling crash recovery and later streaming replication. PITR piggybacks on the same mechanism.
- MySQL’s binary log became core to replication early on, and PITR in MySQL is essentially “restore backup + apply binlogs.” Operationally, replication and PITR share failure modes.
- PostgreSQL’s “timeline” concept exists because recovery can fork history. If you restore and promote, you create a new timeline; ignoring that is how you replay the wrong future.
- MySQL supports statement-based, row-based, and mixed binlog formats. PITR reliability and determinism change drastically depending on which you use.
- PostgreSQL archive_command-based WAL archiving predates widespread cloud object storage, which is why many shops still wrap shell scripts around it—and why those scripts fail in creative ways.
- MySQL GTIDs (global transaction identifiers) were introduced to make replication and failover more robust; they also improve PITR reasoning when used consistently.
- PostgreSQL added backup_label and recovery.signal mechanics (replacing older recovery.conf) to make recovery state more explicit. Old runbooks that reference recovery.conf still exist, and they still cause midnight confusion.
- binlog_row_image defaults and changes across MySQL variants influence how much data sits in binlogs. That affects replay time, storage costs, and whether you can reconstruct certain rows cleanly.
MySQL vs PostgreSQL PITR: what’s actually different
What “log replay” means in each system
MySQL PITR replays binary logs: logical changes at the SQL statement level (statement-based), row-level changes (row-based), or a mixture. The replay tool is typically mysqlbinlog.
PostgreSQL PITR replays WAL (Write-Ahead Log): physical-ish redo information, applied by the server during recovery. You don’t “apply WAL” with a client tool; you configure recovery, feed it WAL segments (from archive or streaming), and PostgreSQL does the replay.
Base backup expectations
MySQL base backups vary widely: filesystem snapshots (LVM/ZFS), xtrabackup, or logical dumps. PITR with binlogs depends heavily on having a consistent base backup that aligns with binlog coordinates.
PostgreSQL has a more standardized story: a pg_basebackup (or filesystem snapshot taken correctly) plus archived WAL is a common, well-trodden path.
Time is a liar (unless you make it behave)
MySQL PITR often uses binlog positions, file names, timestamps embedded in events, or GTIDs. If system clocks skew between machines, “restore to 14:03:00” becomes interpretive dance.
PostgreSQL has recovery_target_time, but it also depends on the cluster’s time zone handling and the WAL timeline you’re on. If you restore to a point that doesn’t exist on the chosen timeline, you’ll either fail or land somewhere surprising.
Observability during recovery
PostgreSQL exposes recovery state via views like pg_stat_wal_receiver, and logs tell you which WAL is being applied. It’s decently introspectable.
MySQL replay via mysqlbinlog | mysql is transparent in the “you can see the pipe” sense, but terrible in the “what’s my exact progress” sense unless you wrap it with your own instrumentation.
One operational preference (opinionated)
If you run MySQL and you care about PITR, favor row-based binlogging unless you have a specific reason not to. Statement-based binlogs and non-deterministic functions are a divorce lawyer’s retirement plan.
RPO/RTO math you can’t hand-wave
PITR drills are not just “can it restore?” They’re “can it restore fast enough and accurately enough?”
- RPO (Recovery Point Objective): how much data you can lose. PITR usually targets near-zero, but only if logs are complete and available.
- RTO (Recovery Time Objective): how long you can be down. This is dominated by base restore time + log replay time + validation time.
The RTO killer is almost always I/O: decompressing backups, writing terabytes to disk, and replaying logs that were never designed to be read “fast.” If you do PITR drills on a tiny test VM with slow storage, you’ll either panic for no reason or, worse, believe an RTO that won’t hold in production.
Measure replay rate. Measure restore throughput. Measure how long it takes to run integrity checks and a handful of application queries. Then decide whether your current retention window and storage tier are compatible with your promises.
Designing a PITR drill that finds real bugs
Pick a target incident you can simulate
Don’t do abstract drills. Simulate a realistic failure:
- Accidental
DELETEwithout aWHERE(classic, evergreen). - Bad migration that drops an index and times out everything.
- Application bug that writes the wrong tenant ID for 15 minutes.
- Operator error: running a “safe” script on the wrong cluster.
Define success criteria in advance
Success is not “database started.” Success is:
- Restored to the correct point (verified by known marker queries).
- Application-critical tables present and consistent.
- RTO and RPO numbers measured and recorded.
- Runbook updated based on what broke.
Always include a “who has the keys?” step
If backups are encrypted (they should be), your drill must include retrieving the decryption key from wherever your organization hides it behind a ticket queue. If your restore requires a specific IAM role or a KMS permission, test that too.
Keep one thing boring: naming and metadata
Every backup artifact must include: cluster ID, start/end time, base backup label, binlog/WAL start position, and the software version. Metadata that’s “obvious” today becomes archaeology in six months.
Practical tasks: commands, outputs, decisions
The commands below are not decorative. They’re the exact kind of muscle memory you want before an incident. Each task includes: a command, what the output means, and the decision you make.
Task 1 — Verify MySQL binlog is enabled and sane
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'log_bin'; SHOW VARIABLES LIKE 'binlog_format'; SHOW VARIABLES LIKE 'binlog_row_image';"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_bin | ON |
+---------------+-------+
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| binlog_format | ROW |
+---------------+-------+
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| binlog_row_image | FULL |
+------------------+-------+
Meaning: PITR depends on binlogs; if log_bin=OFF, you can’t replay anything. binlog_format=ROW is the safer default for PITR.
Decision: If binlogs are off or statement-based, treat PITR as “best effort” and fix config before you claim a low RPO.
Task 2 — Check MySQL binlog retention (and whether it’s lying)
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'binlog_expire_logs_seconds'; SHOW VARIABLES LIKE 'expire_logs_days';"
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| binlog_expire_logs_seconds | 259200 |
+--------------------------+--------+
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| expire_logs_days| 0 |
+-----------------+-------+
Meaning: Binlogs expire after 3 days here. If your compliance asks for 14 days PITR, you’re not compliant; you’re just quiet.
Decision: Set retention based on business requirements and storage reality. If you ship binlogs elsewhere, retention on the primary can be shorter, but only if shipping is verified.
Task 3 — List MySQL binlog files and pick a replay window
cr0x@server:~$ mysql -e "SHOW BINARY LOGS;"
+------------------+-----------+
| Log_name | File_size |
+------------------+-----------+
| mysql-bin.000421 | 52428800 |
| mysql-bin.000422 | 52428800 |
| mysql-bin.000423 | 8388608 |
+------------------+-----------+
Meaning: You have binlog segments present on the server. You still don’t know if they cover the incident time.
Decision: Identify which binlog contains the incident window using mysqlbinlog timestamps (next task) or GTIDs.
Task 4 — Inspect MySQL binlog timestamps around an incident
cr0x@server:~$ mysqlbinlog --base64-output=DECODE-ROWS --verbose --start-datetime="2025-12-30 13:55:00" --stop-datetime="2025-12-30 14:10:00" /var/lib/mysql/mysql-bin.000423 | head -n 25
# at 4
#251230 13:55:02 server id 1 end_log_pos 123 CRC32 0x4f2b1a77 Start: binlog v 4, server v 8.0.36 created 251230 13:55:02
# at 123
#251230 14:02:11 server id 1 end_log_pos 456 CRC32 0x9c4d6e10 Query thread_id=812 exec_time=0 error_code=0
SET TIMESTAMP=1735567331/*!*/;
BEGIN
# at 456
#251230 14:02:11 server id 1 end_log_pos 892 CRC32 0x19a0f2cb Table_map: `app`.`users` mapped to number 108
Meaning: You can locate the time window and see which tables were touched. This helps you set an accurate stop point.
Decision: Choose --stop-datetime just before the damaging statement, or find exact position/GTID for precise control.
Task 5 — Extract MySQL binlog events to a file (do not pipe blind)
cr0x@server:~$ mysqlbinlog --start-datetime="2025-12-30 13:55:00" --stop-datetime="2025-12-30 14:02:10" /var/lib/mysql/mysql-bin.000423 > /tmp/pitr_replay.sql
cr0x@server:~$ tail -n 5 /tmp/pitr_replay.sql
# at 3321
#251230 14:02:10 server id 1 end_log_pos 3456 CRC32 0x2f1b8c3a Xid = 991228
COMMIT/*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /*!*/;
DELIMITER ;
# End of log file
Meaning: You have a deterministic replay script for that window. This is auditable and lets you search for “DROP” before you hit Enter.
Decision: Review for dangerous statements and confirm the stop point. In high-stakes restores, “trust the pipe” is how you explain yourself later.
Task 6 — Restore MySQL base backup to a staging datadir
cr0x@server:~$ sudo systemctl stop mysql
cr0x@server:~$ sudo rsync -aH --delete /backups/mysql/base/2025-12-30_1200/ /var/lib/mysql/
cr0x@server:~$ sudo chown -R mysql:mysql /var/lib/mysql
cr0x@server:~$ sudo systemctl start mysql
cr0x@server:~$ mysqladmin ping
mysqld is alive
Meaning: Base restore succeeded and MySQL starts. This says nothing about whether the data matches the intended restore point.
Decision: Capture the binlog coordinates or GTID state from the backup metadata, then apply binlogs from there to target time.
Task 7 — Apply MySQL replay script and watch for errors
cr0x@server:~$ mysql --show-warnings < /tmp/pitr_replay.sql
Warning (Code 1287): 'SET @@SESSION.GTID_NEXT' is deprecated and will be removed in a future release.
Meaning: Warnings are not fatal, but errors are. A common fatal error is missing tables because the base backup doesn’t match the replay starting point.
Decision: If you see errors like “Table doesn’t exist,” stop. Your base backup + binlog range is inconsistent; fix the starting coordinates.
Task 8 — Verify MySQL GTID state after PITR (if you use GTIDs)
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'gtid_mode'; SELECT @@global.gtid_executed\G"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| gtid_mode | ON |
+---------------+-------+
*************************** 1. row ***************************
@@global.gtid_executed: 8b3f9b75-9c67-11ee-9b7d-0242ac120002:1-889122
Meaning: GTIDs show what transactions exist. After PITR, this matters if you intend to replicate from/to this instance.
Decision: If you’re going to rejoin replication, ensure GTID sets align with your topology plan. Otherwise, you will create replication confusion that looks like data corruption.
Task 9 — Verify PostgreSQL archiving is configured and actually working
cr0x@server:~$ psql -d postgres -c "SHOW wal_level; SHOW archive_mode; SHOW archive_command; SHOW archive_timeout;"
wal_level
-----------
replica
(1 row)
archive_mode
--------------
on
(1 row)
archive_command
---------------------------------------------------------
test ! -f /wal_archive/%f && cp %p /wal_archive/%f
(1 row)
archive_timeout
-----------------
60s
(1 row)
Meaning: Archiving is enabled and the command copies WAL to a local archive directory. This is a start, not proof.
Decision: Confirm WAL files are actually appearing and that failures are alerted. “Configured” is not “working.”
Task 10 — Check PostgreSQL archive backlog and recent WAL activity
cr0x@server:~$ ls -lh /wal_archive | tail -n 5
-rw------- 1 postgres postgres 16M Dec 30 14:00 000000010000002A0000009F
-rw------- 1 postgres postgres 16M Dec 30 14:01 000000010000002A000000A0
-rw------- 1 postgres postgres 16M Dec 30 14:02 000000010000002A000000A1
-rw------- 1 postgres postgres 16M Dec 30 14:03 000000010000002A000000A2
-rw------- 1 postgres postgres 16M Dec 30 14:04 000000010000002A000000A3
Meaning: WAL segments are being archived, names are sequential, timestamps look healthy.
Decision: If the archive is stale, stop pretending PITR exists. Fix archiving first; base backups alone are not PITR.
Task 11 — Create a PostgreSQL base backup you can actually restore
cr0x@server:~$ sudo -u postgres pg_basebackup -D /backups/pg/base/2025-12-30_1200 -Fp -Xs -P -R
waiting for checkpoint
22994/22994 kB (100%), 1/1 tablespace
22994/22994 kB (100%), 1/1 tablespace
Meaning: You took a physical base backup and included WAL streaming (-Xs), plus wrote replication config (-R). For PITR, you still need archived WAL or streamed WAL availability.
Decision: Record the backup label and timestamp. Ensure the corresponding WAL archive range exists and is retained.
Task 12 — Restore PostgreSQL base backup into a new datadir
cr0x@server:~$ sudo systemctl stop postgresql
cr0x@server:~$ sudo rm -rf /var/lib/postgresql/16/main
cr0x@server:~$ sudo rsync -aH /backups/pg/base/2025-12-30_1200/ /var/lib/postgresql/16/main/
cr0x@server:~$ sudo chown -R postgres:postgres /var/lib/postgresql/16/main
Meaning: The cluster files are in place. Now you decide whether you’re doing PITR (recovery) or just starting a clone.
Decision: For PITR, configure recovery targets and restore_command; do not start it as a normal primary unless you intend to fork a timeline.
Task 13 — Configure PostgreSQL PITR: restore_command and target time
cr0x@server:~$ sudo -u postgres bash -c 'cat >> /var/lib/postgresql/16/main/postgresql.conf <<EOF
restore_command = '\''cp /wal_archive/%f %p'\''
recovery_target_time = '\''2025-12-30 14:02:10+00'\''
recovery_target_action = '\''pause'\''
EOF'
cr0x@server:~$ sudo -u postgres touch /var/lib/postgresql/16/main/recovery.signal
cr0x@server:~$ sudo systemctl start postgresql
Meaning: PostgreSQL will start in recovery mode, fetch WAL from the archive, replay until the target time, then pause.
Decision: Pausing is deliberate: you inspect data before promoting. If you auto-promote, you can accidentally enshrine the wrong point in time as “truth.”
Task 14 — Observe PostgreSQL recovery progress and confirm target reached
cr0x@server:~$ sudo -u postgres psql -c "SELECT pg_is_in_recovery();"
pg_is_in_recovery
-------------------
t
(1 row)
cr0x@server:~$ sudo -u postgres psql -c "SELECT now(), pg_last_wal_replay_lsn();"
now | pg_last_wal_replay_lsn
------------------------------+-------------------------
2025-12-30 14:02:12.123+00 | 2A/A100F2B0
(1 row)
Meaning: You’re in recovery and have a replay LSN. Logs will typically say you reached recovery_target_time and paused.
Decision: Run validation queries now. If correct, promote; if not, adjust target and restart recovery.
Task 15 — Promote PostgreSQL after validation
cr0x@server:~$ sudo -u postgres psql -c "SELECT pg_promote(wait_seconds => 60);"
pg_promote
------------
t
(1 row)
cr0x@server:~$ sudo -u postgres psql -c "SELECT pg_is_in_recovery();"
pg_is_in_recovery
-------------------
f
(1 row)
Meaning: The server is now writable and a new timeline exists.
Decision: Decide how this instance re-enters the topology. A promoted PITR restore is not automatically safe to reconnect as if nothing happened.
Task 16 — Validate data at the application boundary (not just “row counts”)
cr0x@server:~$ psql -d appdb -c "SELECT COUNT(*) FROM users; SELECT MAX(updated_at) FROM users;"
count
---------
1284932
(1 row)
max
-----------------------
2025-12-30 14:02:05+00
(1 row)
Meaning: Counts and timestamps give you quick sanity. The max timestamp being before the incident is a good sign.
Decision: If the data looks wrong, don’t promote (PostgreSQL) or don’t cut over (MySQL). Adjust the target restore point and rerun.
Task 17 — Measure restore I/O bottlenecks (Linux basics that decide your RTO)
cr0x@server:~$ iostat -xm 2 3
Linux 6.5.0 (server) 12/31/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.10 0.00 6.20 38.70 0.00 42.00
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
nvme0n1 1200.0 98000.0 0.0 0.0 1.2 81.7 2200.0 180000.0 0.0 0.0 6.8 81.8 16.2 99.0
Meaning: %util near 99% and high iowait means storage is the bottleneck. Your restore won’t get faster by adding CPU.
Decision: Move restores to faster volumes, use local NVMe scratch, parallelize decompression, or reduce replay by taking more frequent base backups.
Task 18 — Confirm time sync before time-based PITR
cr0x@server:~$ timedatectl
Local time: Wed 2025-12-31 10:12:03 UTC
Universal time: Wed 2025-12-31 10:12:03 UTC
RTC time: Wed 2025-12-31 10:12:03
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
Meaning: Your system clock is synchronized. Without this, “restore to 14:02” becomes guesswork.
Decision: If clocks aren’t synced, prefer position-based recovery (MySQL) or LSN-based reasoning (PostgreSQL) and fix NTP before the next drill.
Three mini-stories from corporate life
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company ran MySQL on managed VMs. They “had PITR” because nightly base backups existed and binlog was enabled. The assumption: binlogs were being copied to object storage by an agent, and retention in the bucket was “forever-ish.”
A developer pushed a migration that rebuilt a large table incorrectly. It didn’t break the app immediately; it just started writing subtly wrong data. Four hours later, a customer noticed. The incident commander asked for PITR to “30 minutes before deploy.”
The restore began smoothly: base backup restored, MySQL started, and the team started applying binlogs. Then the replay hit a gap. The binlog file for a 40-minute window was missing. Not corrupt. Missing.
The root cause wasn’t exotic. The copy agent had been failing silently on permission errors after a bucket policy change. Binlogs on the primary had already expired due to a too-short retention setting. Their “PITR” had a four-hour hole in the middle of the day, like a bad memory.
The fix wasn’t a new vendor. It was boring: alert on archive failures, extend retention until shipping is provably reliable, and add a drill step: “list binlogs in archive for the incident window.”
Mini-story 2: The optimization that backfired
Another organization ran PostgreSQL with WAL archiving. Restore drills were “too slow,” so someone optimized by compressing WAL aggressively and pushing it through a single-threaded encryption+upload pipeline on the database node. CPU went up, but the archive size dropped. Everyone celebrated.
The celebration lasted until the first real PITR attempt. The restore cluster could fetch WAL, but decompression became the bottleneck, and the single-threaded pipeline meant the archive lagged behind peak write periods. During the incident, the most recent WAL segments were not yet archived. The recovery stopped short of the target time.
The team tried to “just wait a bit for the archive to catch up.” That’s not an RTO; that’s a prayer with a calendar invite.
Post-incident, the fix was counterintuitive: reduce compression level, move encryption/upload off the primary, parallelize the pipeline, and set explicit SLOs for “WAL archived within X seconds.” They accepted more storage cost and got predictability back.
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent company had a weekly PITR drill. It was not glamorous. It was a checklist, a ticket, and a short write-up. It also annoyed everyone just enough to be effective.
During a routine drill, the engineer noticed that the restore worked but always landed five to seven minutes after the requested timestamp. The data “looked fine,” but the mismatch bothered them. They dug in and found time zone confusion: the runbook used local time, but recovery_target_time expected UTC. They fixed the runbook, standardized on UTC, and added a marker table with an inserted timestamp during deploys.
Two months later, a production incident hit: a batch job updated the wrong partition for about eight minutes. The team restored to just before the marker and validated quickly. They hit their RTO because they’d already argued about the dumb time zone detail in a low-stakes drill.
Joke #2: Time zones are like replication lag—everyone agrees they exist, and then they ruin your afternoon anyway.
Fast diagnosis playbook
Restores fail or drag for a few predictable reasons. The trick is to identify which one in minutes, not hours. This is the order I use when someone says “PITR is slow/broken.”
First: are you missing required logs?
- MySQL: confirm binlog coverage for the target time. Do you have the binlog files locally or in the archive for that entire window?
- PostgreSQL: confirm WAL archive coverage, and that
restore_commandcan fetch the next needed segment.
If logs are missing, stop optimizing. You don’t have a performance problem; you have a data-loss boundary.
Second: is the base backup consistent with replay start?
- MySQL: base backup must match the binlog coordinates/GTID set you start from. Mismatch yields missing tables or duplicate key errors during replay.
- PostgreSQL: base backup must be complete and include necessary files; WAL required from the backup start onward must exist.
Third: is it I/O, CPU, or network?
- I/O bound: high iowait, disks at high utilization, restore speed flatlines regardless of CPU.
- CPU bound: compression/decompression/encryption pegging cores, disks not fully utilized.
- Network bound: slow fetch from archive location, high latency, throughput caps.
Fourth: did you pick the right target and timeline?
- PostgreSQL: timeline mismatches after promotion are classic. Restoring from the wrong timeline gives “requested timeline does not contain minimum recovery point” or lands at an unexpected point.
- MySQL: time-based replay is vulnerable to clock skew and time zone confusion; position/GTID is safer when possible.
Fifth: are you validating the right thing?
“Database started” is not validation. Validate a small set of business queries, constraints, and application flows. If you can’t define those, your restore success metric is mostly vibes.
Common mistakes: symptoms → root cause → fix
1) “PITR stops early and can’t reach the target time”
Symptoms: PostgreSQL recovery pauses before target; logs mention missing WAL. MySQL replay ends without reaching the incident time.
Root cause: WAL/binlog archive gap, retention too short, or shipping failures.
Fix: Alert on archive_command failures, verify archive completeness daily, extend retention, and store logs independently from the primary.
2) “MySQL binlog replay throws missing table errors”
Symptoms: ERROR 1146 (42S02): Table '...' doesn't exist during replay.
Root cause: Base backup restored from time T, but replay started from binlogs generated before the table existed (or after it was dropped/recreated).
Fix: Start replay from the exact binlog position/GTID recorded with the base backup; don’t guess using timestamps alone.
3) “PostgreSQL recovery loops on the same WAL segment”
Symptoms: Logs repeatedly show attempts to restore the same %f.
Root cause: restore_command returns success but doesn’t actually place the file (or places a truncated file). Classic when scripts swallow errors.
Fix: Make restore_command fail loudly. Verify file size and checksum. Avoid “always exit 0” wrapper scripts.
4) “Restore is painfully slow, but CPU is low”
Symptoms: Long restore times; iostat shows saturated disks; CPU mostly idle.
Root cause: Storage throughput/IOPS ceiling; too many small random writes during replay; WAL/binlog on slow media.
Fix: Use faster scratch volumes for restore, separate data from logs, consider more frequent base backups, and ensure the archive fetch path is not throttled.
5) “Restore is fast, then fails at the end with corruption complaints”
Symptoms: PostgreSQL complains about invalid checkpoint record; MySQL won’t start due to InnoDB log issues.
Root cause: Incomplete/corrupted base backup, or filesystem snapshot taken without proper quiescing.
Fix: Use tooling designed for consistent backups (pg_basebackup, tested snapshot procedures, xtrabackup), and verify backup integrity in the drill.
6) “PITR lands at the wrong time”
Symptoms: Restored state includes changes that should have been excluded, or misses changes that should be included.
Root cause: Time zone mismatch, clock skew, or misunderstanding of inclusive/exclusive boundaries in replay.
Fix: Standardize on UTC for incident timestamps, embed a marker transaction, and prefer position/LSN/GTID targets when possible.
7) “Everything restores, but replication afterwards is chaos”
Symptoms: PostgreSQL replicas can’t follow; MySQL replication errors about executed GTIDs or duplicate entries.
Root cause: You promoted a restored instance without planning the new topology, timeline, or GTID set alignment.
Fix: Treat PITR restore as a fork: decide authoritative node, rebuild replicas from it, and document the rejoin procedure.
Checklists / step-by-step plan
Weekly PITR drill (90 minutes, boring on purpose)
- Pick a scenario: choose a known “bad change” window from the last week (or create a harmless marker transaction).
- Pick a target point: define the restore point in UTC and record it.
- Confirm log coverage: verify binlogs/WAL exist for the entire window.
- Restore base backup: to isolated infrastructure with similar storage class to production.
- Replay logs to target: MySQL via
mysqlbinlog; PostgreSQL via recovery configuration. - Validate: 5–10 critical queries, one application smoke test, and one integrity check.
- Record timings: base restore time, replay time, validation time.
- Write up: what failed, what was slow, what changed in the runbook.
Before you claim “we have PITR” (minimum bar)
- Base backups are consistent and restorable without heroics.
- Log archiving is complete, monitored, and retained per requirement.
- Restore runbook includes IAM/KMS/key access steps.
- RTO/RPO numbers are measured, not wished into existence.
- Validation queries exist and are owned by someone.
MySQL-specific drill steps
- Record base backup’s binlog file/position or GTID set.
- Confirm
binlog_formatand retention. - Extract replay SQL into a file and review for dangerous statements.
- Apply replay to the restored instance; stop on first error and fix the coordinate mismatch.
- Decide topology plan: is this a new primary, a temporary analytics clone, or a source for rebuilding replicas?
PostgreSQL-specific drill steps
- Confirm WAL archiving and archive freshness.
- Restore base backup; ensure
recovery.signaland correctrestore_command. - Use
recovery_target_action='pause'for drills to validate before promotion. - After promotion, note new timeline and plan replica rebuilds accordingly.
- Verify that your archive contains WAL across timeline changes (a common long-term pitfall).
FAQ
1) Is PITR the same as high availability?
No. HA keeps you running through node failures. PITR recovers from logical disasters: bad deploys, deletes, corruption caused by humans. You want both.
2) Which is “easier” for PITR: MySQL or PostgreSQL?
PostgreSQL’s PITR workflow is more standardized: base backup + WAL archive + recovery targets. MySQL PITR can be clean, but it varies more by backup method and binlog format.
3) Should I use timestamp-based targets or position-based targets?
Position-based (MySQL binlog position/GTID, PostgreSQL LSN) is usually more deterministic. Time-based targets are human-friendly but fragile unless clocks and time zones are disciplined.
4) How often should I take base backups if I have logs?
As often as your replay time tolerates. Longer gaps mean longer log replay. If your WAL/binlog replay for a day takes six hours, your RTO is already decided for you.
5) Can I do PITR from logical backups (mysqldump/pg_dump)?
Not reliably as PITR in the strict sense. Logical dumps capture a point, but replaying to an exact time is messy and slow. Use them for portability; use physical backups plus logs for PITR.
6) What’s the cleanest way to validate a PITR restore?
A small set of business invariants: key counts, last-updated markers, referential integrity checks, and a lightweight application smoke test. If your validation is “SELECT 1,” you’re practicing denial.
7) What about encryption and keys?
If backups or archives are encrypted, your drill must include retrieving keys and permissions. The most common “we can’t restore” failure is access, not data.
8) Can PITR restore across major version upgrades?
Typically, no in-place for physical backups. Major version changes often require logical migration or upgrade tools. Your PITR plan should assume “restore on same major version,” then upgrade afterward if needed.
9) How do I prevent the restored instance from accidentally reconnecting to production?
Put restores in isolated networks, change credentials, and explicitly disable outbound connectivity to production services. Also: rename the cluster and add loud banners in monitoring.
10) What’s the one metric you wish every team tracked?
“Time from incident start to validated restored data.” Not “backup succeeded.” Not “WAL archived.” The end-to-end outcome is what the business experiences.
Next steps that actually reduce risk
If you do nothing else this quarter, do this: schedule a PITR drill, run it on infrastructure that resembles production, and write down exactly where you got confused. That confusion is your future outage.
- Pick one database (the one that hurts the most) and run a restore to a known marker time.
- Instrument the pipeline: alert on binlog/WAL shipping failures and archive freshness.
- Standardize targets: UTC timestamps plus a position/LSN/GTID when possible.
- Measure RTO and decide whether to spend money on faster storage, more frequent base backups, or smaller replay windows.
- Update the runbook immediately after the drill, while the pain is fresh and the lessons are honest.
PITR isn’t a feature you enable. It’s a skill you practice. The restore doesn’t care how confident you sound in meetings.