MariaDB vs PostgreSQL: Replication vs PITR—What Actually Saves You After Human Error

Was this helpful?

The pager goes off, Slack lights up, and someone says the sentence that turns your coffee into a coping mechanism:
“I ran the delete on prod.”

This is where the mythology around “we have replication” dies. Replication is great for staying up while hardware fails.
It’s terrible at saving you from humans, because it replicates humans with impressive fidelity. What saves you is the
ability to rewind time: point-in-time recovery (PITR), plus the operational discipline to make it real.

Replication vs PITR: two different promises

Replication promise: stay up when something breaks

Replication is about availability and read scaling. It answers “what if a node dies?” and sometimes “what if a zone dies?”
It does not answer “what if we write garbage?” If you replicate writes, you replicate mistakes, schema drops, and bad deploys.

The painful phrase is “replication lag.” People treat it like a performance metric; during an incident it becomes a recovery tool.
Lag can give you a window to stop the blast radius. But it’s an unreliable seatbelt: sometimes it’s there, sometimes it’s not,
and it wasn’t designed for the crash you’re about to have.

PITR promise: rewind to a known-good time

PITR is a backup strategy, not a replication strategy. The concept is simple: take a base backup (a full snapshot) and then
retain the write-ahead logs (WAL in PostgreSQL; binary logs in MariaDB). To recover, you restore the base backup and replay logs
up to just before the mistake.

PITR is what you want when the problem is “we changed data” rather than “we lost a server.” It’s also what you want when a
“read-only” report job decided to be creative and ran an UPDATE without a WHERE clause.

One paraphrased idea from John Allspaw (operations and resilience): “Reliability comes from designing for failure, not pretending it won’t happen.” (paraphrased idea)

Replication keeps the service online. PITR keeps your career online.

Human error modes: what actually happens

The boring taxonomy of disasters (that keeps repeating)

  • Accidental delete/update: a WHERE clause missing, wrong tenant ID, or “quick fix” in a console.
  • Schema changes gone wrong: dropping a column/table, bad migration order, adding NOT NULL before backfill.
  • Data import mistakes: CSV mapped wrong, ETL job uses the wrong environment, duplicated keys.
  • Privilege mistakes: granting more than intended, revoking access, or running maintenance as superuser.
  • Application bugs: “works in staging” and then the new code deletes the wrong rows at scale.

Why replication fails you here

Most replication in both MariaDB and PostgreSQL is designed to replicate committed changes. It’s deterministic enough
to keep copies consistent. That’s great until the change you committed is exactly the thing you want to un-commit.

The first trap is psychological: teams see multiple nodes and assume “we can always fail over.” Failover does not undo writes.
It just moves you to a different copy of the same mistake.

Joke #1: Replication is like group chat—everyone gets the message, including the embarrassing one you wish you could unsend.

What PITR can’t do either (so you don’t romanticize it)

PITR can’t reconstruct history you didn’t retain. If you don’t archive WAL/binlogs reliably, or you rotate them too aggressively,
your rewind button is cosmetic. PITR also won’t fix application-level corruption if you can’t identify a safe point in time.
You still need forensics: when did the bad write start, and when did it stop?

MariaDB recovery mechanics: binlogs, GTIDs, replicas

Replication basics: what you actually have in production

MariaDB (and MySQL lineage) replication commonly uses binary logs from a primary, applied by replicas. You’ll encounter
file/position replication and GTID-based replication. GTID is generally easier to operationalize, but it doesn’t magically
solve human error; it just makes failover and topology changes less error-prone.

If you run semi-sync, you reduce the chance of losing the last transactions during failover. That’s an availability/data-loss
knob, not a human-error knob.

PITR in MariaDB: base backup + binlog replay

MariaDB PITR is typically: take a full backup (logical with mariadb-dump or physical with tools like mariabackup),
then keep binary logs and replay them until just before the bad statement. The operational edge is that binlog replay can be
filtered and targeted if you use row-based logging and have good timestamps/GTIDs.

Statement-based binlogs can be a footgun in recovery because replay may not be deterministic depending on functions, time,
or non-deterministic statements. Row-based logging is heavier but more predictable for recovery.

The replica “delay” pattern (useful, but not a plan)

Some teams run a delayed replica (intentional lag) so that accidental deletes on primary don’t immediately apply.
This can save you if the mistake is noticed quickly and the delay window covers it.

It’s still not a complete recovery plan. Delayed replicas break in real life: someone restarts it and it catches up,
or the SQL thread stops and you don’t notice until it’s useless, or the delay is too small for slow-burn incidents.

PostgreSQL recovery mechanics: WAL, timelines, standbys

Replication basics: streaming replication and its shape

PostgreSQL streaming replication ships WAL from primary to standbys. Standbys can be synchronous or asynchronous.
Like MariaDB, replication is about keeping a copy close enough to take over.

The operational nuance: PostgreSQL has timelines. When you promote a standby, you create a new timeline. That’s normal,
but it matters for PITR because your WAL archive must include timeline history files, and you need to understand what you’re restoring.

PITR in PostgreSQL: base backup + WAL archiving

PostgreSQL PITR is a first-class concept: a base backup plus continuous archiving of WAL segments. Recovery is configured by
restoring the base backup, then using a recovery signal plus target time/LSN, and replaying WAL from your archive.

The big win: PostgreSQL has strong tooling around WAL replay and a clean separation between streaming replication and WAL archiving.
But you still have to build the pipeline: archive_command reliability, retention, validation, and restore tests.

Joke #2: WAL is your database’s flight recorder—except it only helps if you didn’t store it in the same plane.

Logical replication isn’t your parachute

PostgreSQL logical replication (and similar change data capture systems) can help selectively replicate tables and avoid some
categories of schema change pain. It is not a substitute for PITR. Logical replication streams changes too, including mistakes,
and often lacks the ability to rewind without a separate snapshot strategy.

What saves you after human error: decision matrix

First principles: define the question before you pick a tool

After human error, you need answers to three questions:

  1. Blast radius: which data changed, and how much?
  2. Time bounds: when did the bad change start, and when was it detected?
  3. Recovery objective: do you need to restore the whole cluster, or just a schema/table/tenant?

Replication helps when:

  • A node died and you need to fail over with minimal downtime.
  • You need a read-only copy to offload queries.
  • You want quick replacement capacity while you rebuild the failed host.

PITR helps when:

  • Data was modified incorrectly and you need to go “back in time.”
  • Schema changes were applied and you need to recover to just before them.
  • Ransomware/compromise encrypted or altered data across all replicas.

Hard truth: you typically need both

Mature systems run replication for availability and PITR for recovery. If budget forces a choice, choose PITR first for
systems where data correctness matters more than uptime. That’s most business systems, even the ones whose owners insist
“we can recreate it.” They can’t. They just don’t want to pay for storage.

Practical tasks (commands, outputs, decisions)

These are the kind of tasks you run during a drill or an incident: quick, concrete, and tied to decisions. The outputs
shown are representative. Your environment will differ, but the meaning doesn’t.

Task 1 — MariaDB: confirm binlog is enabled and what format you’re in

cr0x@server:~$ mariadb -e "SHOW VARIABLES LIKE 'log_bin'; SHOW VARIABLES LIKE 'binlog_format';"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_bin       | ON    |
+---------------+-------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| binlog_format | ROW       |
+---------------+-----------+

What it means: binlogs exist, and they’re row-based (good for deterministic replay).
Decision: if log_bin is OFF, stop pretending you have PITR; you have backups only.
If binlog_format is STATEMENT, consider switching to ROW for recoverability (after checking workload impact).

Task 2 — MariaDB: find current binlog file/position (baseline for incident timeline)

cr0x@server:~$ mariadb -e "SHOW MASTER STATUS\G"
*************************** 1. row ***************************
File: mariadb-bin.001842
Position: 987654321
Binlog_Do_DB:
Binlog_Ignore_DB:

What it means: this identifies where the primary is “now.”
Decision: record it in the incident doc; you’ll use file/position or GTID ranges to bound recovery.

Task 3 — MariaDB: check replica health and lag

cr0x@server:~$ mariadb -e "SHOW SLAVE STATUS\G" | egrep "Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Using_Gtid"
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 2
Using_Gtid: Current_Pos

What it means: replication is healthy and nearly caught up.
Decision: if human error just happened and lag is small, you probably can’t use lag as a safety window.
If Slave_SQL_Running is No, treat it as an incident—your “safety replica” might be silently dead.

Task 4 — MariaDB: immediately stop a replica to preserve a “pre-mistake” copy (if you caught it in time)

cr0x@server:~$ mariadb -e "STOP SLAVE SQL_THREAD; SHOW SLAVE STATUS\G" | egrep "Slave_SQL_Running|Seconds_Behind_Master"
Slave_SQL_Running: No
Seconds_Behind_Master: 57

What it means: SQL thread is stopped; IO thread may still fetch binlogs, but not apply them.
Decision: freeze this replica. Do not restart SQL thread until you’ve extracted needed rows or built a recovery plan.
This is a tactical move, not your final restore.

Task 5 — MariaDB: locate a suspicious event in binlogs by time

cr0x@server:~$ mysqlbinlog --start-datetime="2025-12-30 09:55:00" --stop-datetime="2025-12-30 10:05:00" /var/lib/mysql/mariadb-bin.001842 | head -n 30
# at 456700001
#251230  9:58:12 server id 101  end_log_pos 456700321 CRC32 0x2a1b3c4d  GTID 0-101-998877
BEGIN
# at 456700321
#251230  9:58:12 server id 101  end_log_pos 456701234 CRC32 0x1c2d3e4f  Query   thread_id=8899 exec_time=0 error_code=0
use appdb/*!*/;
SET TIMESTAMP=1767088692/*!*/;
DELETE FROM orders WHERE tenant_id=42;
/*!*/;

What it means: you’ve found the offending statement and its time/GTID.
Decision: set your PITR stop point just before this GTID/time; or plan a table-level reinsert if feasible.
If binlog is row-based, the output will show row events rather than the SQL, which can still be used to replay/skip precisely.

Task 6 — MariaDB: replay binlogs into a restored copy up to a safe point

cr0x@server:~$ mysqlbinlog --stop-datetime="2025-12-30 09:58:11" /archives/mariadb-bin.* | mariadb appdb
Query OK, 0 rows affected (0.001 sec)
Query OK, 0 rows affected (0.000 sec)

What it means: you applied changes up to just before the bad statement.
Decision: validate row counts/business checks on this restored copy; then decide whether to cut over, extract rows, or run a controlled merge.

Task 7 — PostgreSQL: confirm WAL archiving is configured

cr0x@server:~$ psql -X -c "SHOW wal_level; SHOW archive_mode; SHOW archive_command;"
 wal_level
----------
 replica
(1 row)

 archive_mode
--------------
 on
(1 row)

                archive_command
------------------------------------------------
 test ! -f /wal-archive/%f && cp %p /wal-archive/%f
(1 row)

What it means: WAL archiving is on, and the archive_command is a simple copy to a local archive directory.
Decision: if archive_mode is off, you do not have PITR. If archive_command is fragile (no retries, no alerting),
treat it as a reliability bug and fix it before you need it.

Task 8 — PostgreSQL: verify WAL is actually being archived (not just “configured”)

cr0x@server:~$ psql -X -c "SELECT now(), last_archived_wal, last_archived_time, failed_count FROM pg_stat_archiver;"
              now              |     last_archived_wal     |     last_archived_time      | failed_count
-------------------------------+---------------------------+-----------------------------+--------------
 2025-12-30 10:06:40.12345+00  | 000000010000003A0000009F  | 2025-12-30 10:06:12+00      |            0
(1 row)

What it means: archiving is working recently; failures are zero.
Decision: if failed_count is increasing, stop. You’re building a false sense of security.
Fix archive failures first, or your restore will hit a missing segment and die halfway through.

Task 9 — PostgreSQL: check replication lag during an incident

cr0x@server:~$ psql -X -c "SELECT application_name, state, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
 application_name |   state   | write_lag | flush_lag | replay_lag
------------------+-----------+-----------+-----------+------------
 standby-a         | streaming | 00:00:00  | 00:00:00  | 00:00:02
(1 row)

What it means: the standby is 2 seconds behind in replay.
Decision: this lag is not a recovery plan. If the bad write already committed, it will likely hit the standby too.
If you need a “freeze point,” you must act fast (or rely on PITR).

Task 10 — PostgreSQL: find the approximate time of bad changes using logs/LSN

cr0x@server:~$ psql -X -c "SELECT now(), pg_current_wal_lsn();"
              now              | pg_current_wal_lsn
-------------------------------+--------------------
 2025-12-30 10:07:10.551+00    | 3A/9F123ABC
(1 row)

What it means: you have a current LSN marker.
Decision: during incident response, record LSNs alongside timestamps. They help reconcile what happened if clocks drift or logs are messy.

Task 11 — PostgreSQL: restore a base backup and configure PITR target time

cr0x@server:~$ sudo -u postgres bash -lc 'cat > /var/lib/postgresql/16/main/postgresql.auto.conf <

What it means: recovery will pull WAL from archive and stop at the target time.
Decision: choose a target time just before the destructive transaction. If uncertain, pick earlier and then re-apply known-good changes carefully.

Task 12 — PostgreSQL: confirm recovery stopped where you asked

cr0x@server:~$ psql -X -c "SELECT pg_is_in_recovery(), pg_last_wal_replay_lsn();"
 pg_is_in_recovery | pg_last_wal_replay_lsn
-------------------+------------------------
 f                 | 3A/9EFFFF00
(1 row)

What it means: recovery is finished (f) and you have the last replayed LSN.
Decision: run validation queries now. If data looks correct, you can plan cutover. If not, re-run recovery with a different target time/LSN.

Task 13 — Storage sanity check: verify archive volume has space and isn’t silently full

cr0x@server:~$ df -h /wal-archive
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       500G  412G   63G  87% /wal-archive

What it means: WAL archive is nearing capacity.
Decision: if this hits 100%, WAL archiving fails and PITR breaks. Implement retention and alerting; do not “just add space” as a strategy.

Task 14 — MariaDB: verify binlog retention won’t eat your recovery window

cr0x@server:~$ mariadb -e "SHOW VARIABLES LIKE 'expire_logs_days'; SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';"
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| expire_logs_days| 1     |
+-----------------+-------+
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| binlog_expire_logs_seconds| 0      |
+---------------------------+--------+

What it means: binlogs expire after 1 day via expire_logs_days.
Decision: if your detection time for mistakes is longer than a day (it is), expand retention or ship binlogs to durable storage.
Otherwise PITR will fail with missing logs exactly when you need it.

Fast diagnosis playbook

When someone yells “data is gone,” you need to identify whether this is (a) a replication problem, (b) a backup/PITR problem,
or (c) an application bug still actively damaging data. This is triage, not philosophy.

First: stop the bleeding

  1. Freeze writes: disable the job, feature flag, or deploy. If you can’t, block the app user at the DB temporarily.
  2. Preserve evidence: snapshot the affected volume if you can; stop a replica’s apply thread if you caught it early.
  3. Record markers: current time, current binlog file/position or WAL LSN, and who ran what.

Second: decide which recovery path is viable

  1. Is there a safe replica? Only if it’s definitely pre-mistake (intentional delay, stopped in time, or separate topology).
  2. Is PITR viable? Confirm log archiving is continuous, retention covers the time window, and you have a base backup.
  3. Can you do targeted repair? Sometimes you can reinsert rows from a snapshot/replica without full restore.

Third: find the bottleneck (RTO killer)

  • Restore throughput: base backup download + unpack + filesystem performance.
  • Replay speed: WAL/binlog apply rate; check if you’re I/O bound or CPU bound.
  • Validation & cutover: can you verify correctness quickly and switch traffic safely?

Common mistakes: symptoms → root cause → fix

“We failed over, but the bad delete is still there.”

Symptom: you promote a replica and the missing rows remain missing.

Root cause: replication propagated the delete; failover only changed which copy you’re looking at.

Fix: use PITR or a preserved pre-mistake replica/snapshot; implement PITR with tested restores; consider delayed replica as a supplement.

“PITR failed halfway with missing WAL/binlog files.”

Symptom: recovery stops with “requested WAL segment has already been removed” (PostgreSQL) or binlog file not found (MariaDB).

Root cause: retention too short, archiving pipeline failing, or archive stored on the same disk that died.

Fix: increase retention, alert on archiver failures, store archives on separate durable storage, and run periodic restore drills that prove continuity.

“We restored, but data is inconsistent / constraints fail.”

Symptom: restored DB won’t start cleanly, or app errors explode due to missing relations/constraints.

Root cause: mixing logical dumps with physical log replay incorrectly; or restoring schema from one time and data from another.

Fix: keep PITR coherent: base backup must match the log stream. For partial restores, export/import consistently (schema + data) and validate.

“Replica was supposed to be delayed, but it wasn’t.”

Symptom: delayed replica contains the mistake.

Root cause: delay configured incorrectly, replica restarted and caught up, or monitoring didn’t alert on delay not being applied.

Fix: monitor effective delay, not just configuration. Test the workflow: can you actually stop apply and extract data during a drill?

“Archive disk filled up silently; now we have no PITR.”

Symptom: archiving failures start; later you discover gaps.

Root cause: no retention, no alerting, or archive stored locally without capacity planning.

Fix: set retention policies, monitor free space and archiver stats, and treat WAL/binlog archive as production-critical storage with SLOs.

“Restore is too slow; RTO is fantasy.”

Symptom: base restore takes hours; WAL replay takes longer; business panics.

Root cause: backups stored on slow media, restore path untested, encryption/compression bottlenecks, or insufficient IOPS.

Fix: measure restore throughput quarterly, keep warm standby base backups, tune storage, and consider incremental/base backup strategies.

Checklists / step-by-step plan

Build a recovery posture that survives humans (90-day plan)

  1. Define RPO/RTO per database: what you can lose, and how long you can be down.
  2. Implement replication for availability: at least one standby/replica in a separate failure domain.
  3. Implement PITR:
    • MariaDB: physical base backup + binlog shipping and retention.
    • PostgreSQL: base backup + WAL archiving with monitored pg_stat_archiver.
  4. Separate storage: archive logs off the primary’s disks and ideally off the same blast radius.
  5. Retention by detection time: keep logs long enough that you can notice mistakes. Many teams need weeks, not days.
  6. Run restore drills monthly: a full PITR restore to a scratch cluster, with validation queries and a written timeline.
  7. Document the cutover: DNS/connection strings, read-only mode, application coordination, and rollback of the rollback.
  8. Add guardrails: limit prod console access, require reviewed migrations, protect dangerous tables (permissions), and use safer tooling.

Incident response steps: accidental delete/drop

  1. Stop writers: freeze the job/deploy; prevent further changes.
  2. Mark the timeline: exact detection time; identify approximate start time from app logs/audit logs.
  3. Choose recovery strategy:
    • Small blast radius: extract missing rows from PITR restore or stopped replica and reinsert.
    • Large blast radius: full PITR restore and controlled cutover.
  4. Execute recovery in parallel: one person builds restore environment; another validates scope; another handles comms.
  5. Validate like you mean it: row counts, business invariants, application smoke tests.
  6. Cut over: switch traffic, monitor error rates, keep old primary isolated until confident.

Three corporate mini-stories from the trenches

Mini-story 1: the incident caused by a wrong assumption

A mid-sized B2B platform ran MariaDB with a primary and two replicas. The ops deck said “HA: yes” and everyone relaxed.
A developer ran a cleanup script that was supposed to delete test tenants. It deleted real tenants. Quickly.

The on-call did what the runbook suggested: promote a replica. That took minutes, and the site came back. Everyone cheered,
until customer support reported the same missing tenants. Of course. The delete replicated. They’d just failed over to a
different copy of the same crime scene.

They did have nightly logical dumps, but no binlog retention beyond a few hours and no tested restore path for a “restore to 10:03.”
The team tried to reconstruct from application events and partial exports. It worked for some tenants, but not all, and it took days.

The lasting fix wasn’t “train devs better.” It was enabling proper binlog shipping off-host, increasing retention, and
running a monthly PITR drill where someone intentionally deletes a tenant in a sandbox and recovers it.

Mini-story 2: the optimization that backfired

Another company ran PostgreSQL with WAL archiving. Someone noticed the WAL archive volume growing quickly and “optimized”
retention by keeping only a couple of days. They also compressed archives aggressively and pushed them to slower storage.
It looked great on a cost dashboard.

Three weeks later, a long-lived application bug began corrupting a subset of rows. It wasn’t dramatic—just subtle enough
to pass casual inspection. When it was finally discovered, the team needed to restore to a point before the bug shipped.
That was well outside the WAL retention window.

They attempted to stitch together a recovery using weekly base backups plus partial logical exports, but the restored data
didn’t align cleanly with current schema migrations. They spent the night arguing about which “truth” to trust: the database
or the app’s derived caches. That’s not a fun debate at 3 a.m.

The fix was boring: keep WAL archives longer (based on detection time, not disk anxiety), store them on storage that can
sustain restore throughput, and monitor actual restore RTO as a metric. Cost still mattered, but they optimized with data,
not vibes.

Mini-story 3: the boring but correct practice that saved the day

A financial services team (regulatory pressure does wonders for good habits) ran PostgreSQL with streaming replication,
plus WAL archiving to a separate storage system. Every month, a scheduled job restored last week’s base backup into an
isolated environment and ran a set of “invariants”: balances sum correctly, foreign keys match, and a sample of customer
journeys can be replayed.

One afternoon, an engineer applied a migration that dropped an index and then ran an update that accidentally hit far more
rows than intended. The site stayed up—replication was fine—but customer-facing data was wrong. They immediately froze writes.

Because they had done restore drills, they already knew the restore time, the commands, and which storage path was fastest.
They restored to five minutes before the migration, validated invariants, then used a controlled replay of application events
for a narrow window. Cutover was tense but clean.

Nobody got a standing ovation. That’s the point. The reward for correctness is a quiet incident channel and everyone going to bed.

Facts & historical context worth knowing

  • PostgreSQL’s WAL heritage goes back decades: the core idea of write-ahead logging is older than many “modern” backup products.
  • MariaDB forked from MySQL in 2009 after Oracle acquired Sun; that history shaped enterprise trust and ecosystem tooling choices.
  • MySQL/MariaDB binlog formats evolved: statement-based logging came first; row-based became the safer choice for deterministic recovery.
  • PostgreSQL timelines matter: every promotion creates a new timeline; forgetting timeline history files can break restores in confusing ways.
  • Replication was never designed as undo: both systems prioritize consistency of copies, not reversibility of transactions.
  • “Synchronous replication” doesn’t mean “safe from humans”: it reduces data loss on crash/failover, not logical corruption.
  • Logical replication is selective but not time travel: it’s great for migrations and partial replication, not for rewinding mistakes.
  • Many outages are detection failures: human error often happens minutes or hours before it’s noticed; retention must match reality.

FAQ

1) If I have replication, do I still need PITR?

Yes. Replication handles hardware and some failover scenarios. PITR handles bad writes, schema mistakes, and compromise that propagates.
If you only pick one for a business-critical DB, pick PITR and accept some downtime risk.

2) Can a delayed replica replace PITR?

No. It’s a useful supplement: a quick way to grab missing rows or preserve a pre-mistake copy. But it can fail silently,
and it won’t cover slow-burn corruption beyond its delay window.

3) What’s the biggest operational difference between MariaDB and PostgreSQL recovery?

PostgreSQL has a very explicit PITR workflow built around base backups and WAL archiving with strong introspection (like pg_stat_archiver).
MariaDB PITR is powerful but more “assemble your own adventure”: physical backups plus binlog management and careful replay.

4) Should I use logical dumps for PITR?

Logical dumps are great for portability and table-level restores, but they’re not PITR by themselves. PITR needs a consistent base + log stream.
Use logical dumps as a complement, not the backbone, unless your data size and RTO allow it.

5) How long should I retain WAL/binlogs?

Base it on detection time, not comfort. If your organization routinely discovers issues days later, keep at least weeks of logs.
Also ensure you can afford the restore time for that window.

6) What’s the best way to know my RTO is real?

Perform restores on a schedule and measure: time to fetch backup, time to restore, time to replay logs, time to validate,
time to cut over. If you don’t measure, your RTO is a wish.

7) Can I do PITR to “right before a transaction” reliably?

Usually yes, but precision varies. PostgreSQL can target timestamps and can also work with LSN targets. MariaDB can replay to a datetime
and often to GTID boundaries depending on tooling and logging format. The more you log and the more you practice, the more precise you get.

8) What about restoring just one table or one tenant?

The common pattern is: PITR restore into an isolated environment, extract the needed rows (or schema/table), then merge into prod carefully.
This is slower than a full cutover but avoids global rollback for localized mistakes.

9) Does encryption/compression of archives affect recovery?

Yes. It can crush restore throughput and turn “we have backups” into “we have slow backups.” If you compress, test restore speed under load.
Don’t optimize storage costs by making recovery impossible.

Conclusion: next steps you can do this week

If you take one operational lesson from the MariaDB vs PostgreSQL debate, make it this: replication is not a rewind button.
Both databases can recover beautifully after human error—but only if you treat PITR as a production feature, not a checkbox.

  1. Pick a realistic recovery window (RPO/RTO) and align log retention to your detection time.
  2. Prove PITR works: do one full restore drill end-to-end and time it.
  3. Instrument the archive pipeline: alert on WAL/binlog archiving failures and on archive storage capacity.
  4. Write the “stop the bleeding” runbook: freezing writes is often the difference between a 10-minute fix and a week-long reconstruction.
  5. Keep replication for uptime, but stop selling it internally as protection from mistakes. It’s not.

The day you need this, you won’t have time to become the person who knows it. Build the muscle now, while nobody is watching.

← Previous
DNS TTL Mistakes That Haunt Migrations — How to Set TTL Like a Pro
Next →
Proxmox “Permission Denied” in Datacenter: Roles and ACLs Done Right

Leave a comment