MySQL vs PostgreSQL backups in containers: how to avoid fake backups

Was this helpful?

Containerized databases are a productivity miracle until the day you need a restore. Then you discover you’ve been lovingly shipping gzipped disappointment to object storage for months. It has a filename, a checksum, a retention policy, and absolutely no ability to bring your service back.

“Fake backups” are backups that complete successfully but cannot restore to a consistent, working database state within the RPO/RTO you promised your business (or yourself at 2 a.m.). In containers, the failure modes multiply: ephemeral filesystems, volume mounts you forgot about, missing WAL/binlogs, and snapshots taken at the worst possible moment.

What a fake backup looks like (and why containers make it worse)

A fake backup is anything that gives you confidence without giving you recoverability. The usual suspects:

  • Backups that “succeed” but restore fails (permissions, missing roles, missing extensions, corrupt archive, incompatible versions).
  • Backups that restore but data is wrong (inconsistent snapshot, partial dump, missing binlogs/WAL, timezone/encoding surprises).
  • Backups that restore but take 12 hours and your RTO was “less than one hour.” The backup exists; your job does not.
  • Backups stored inside the container filesystem, which disappears during reschedules, node drains, or image refreshes.
  • Backups that omit the only thing you needed: object permissions, routines, triggers, users, grants, config, or the WAL/binlog stream needed for point-in-time recovery.

Containers amplify these problems because they make “where data lives” deceptively ambiguous. You can run mysqldump or pg_dump from a cron in the container, see a file appear under /backups, and feel good. But if /backups isn’t a persistent volume, you just created a motivational poster.

Second, container platforms encourage stateless thinking. Great for app servers. Dangerous for databases. The platform will happily restart your database pod while you are taking a filesystem snapshot unless you set guardrails. It will also happily rotate secrets and break your backup job while still reporting the CronJob as “Completed.”

Your rule: a backup is only “real” if you can restore it, verify correctness, and do it within the time you promised, using automation you trust.

Joke #1: A backup that’s never been restored is like a parachute you bought on sale and never unfolded. You can guess how the first test goes.

MySQL vs PostgreSQL: what “consistent backup” actually means

Consistency is a contract, not a file

Consistency is about transactional correctness. Both MySQL (InnoDB) and PostgreSQL can provide consistent backups, but they do it differently and punish different mistakes.

MySQL: InnoDB, redo logs, and binlogs

MySQL has two big worlds: logical backups (mysqldump, mysqlpump) and physical backups (Percona XtraBackup, MySQL Enterprise Backup). Logical backups are portable and readable; physical backups are faster and better for big databases, but more sensitive to version and configuration mismatches.

For point-in-time recovery (PITR), MySQL relies on the binary log (binlog). A full backup without binlogs is often a “last night” restore, not “five minutes ago.” If your business expects “five minutes ago,” you need binlogs shipped and retained with a known mapping to the backup.

Common fake-backup pattern in MySQL containers: you run mysqldump but forget --single-transaction for InnoDB, or you dump from a replica with replication lag, or you don’t include routines/events/triggers, or you don’t capture users/grants. It restores, but the app falls over.

PostgreSQL: MVCC, base backups, and WAL

PostgreSQL’s world is MVCC and write-ahead logging (WAL). A consistent physical backup is a base backup plus the WAL segments needed to bring it to a consistent point (and optionally forward in time). Logical backups (pg_dump) are consistent at the database level, but they’re slower for large datasets and don’t capture cluster-wide objects unless you add pg_dumpall for roles and globals.

Common fake-backup pattern in PostgreSQL containers: you take a filesystem snapshot of the data directory without coordinating with PostgreSQL, don’t include WAL, and end up with a backup that fails with “invalid checkpoint record” when you try to start it. Another: you use pg_dump but forget roles, extensions, or schema search_path assumptions.

The operational takeaway

  • If you need fast restores and large databases: favor physical backups (XtraBackup / pg_basebackup) plus PITR logs.
  • If you need portability and smaller datasets: logical backups are fine, but you must include the full set of objects you need and you must test restores.
  • In containers: mount persistent volumes correctly, keep logs (WAL/binlog) outside ephemeral layers, and treat backup jobs as production systems.

One operational quote that ages well: “Hope is not a strategy.” — Gene Kranz

Interesting facts and historical context (because the past explains your pager)

  1. PostgreSQL’s lineage traces back to the POSTGRES project at UC Berkeley in the 1980s; WAL and MVCC matured as it grew into a reliability-first database.
  2. MySQL’s early popularity came from speed and simplicity in web stacks, but transactional reliability for serious workloads arrived with InnoDB becoming the default storage engine years later.
  3. WAL (write-ahead logging) is older than both MySQL and PostgreSQL; the principle dates to database research long before containers made everything feel disposable.
  4. MySQL replication historically leaned on statement-based binlogs; today row-based logging is common because it avoids subtle nondeterministic replay issues.
  5. PostgreSQL PITR became mainstream as WAL archiving tooling matured; without WAL retention, “backup” means “a time capsule,” not “recovery.”
  6. Filesystem snapshots became a practical backup primitive as copy-on-write filesystems and storage arrays improved; they’re powerful but easy to misuse without database coordination.
  7. Percona XtraBackup gained traction because it enabled hot physical backups for InnoDB without shutting down MySQL, changing the economics of backing up large datasets.
  8. Container orchestration normalized immutable images and ephemeral nodes; the mismatch with stateful databases created a decade-long industry of “stateful on Kubernetes” hard lessons.
  9. Checksums and “successful jobs” have always lied by omission: backup success is an I/O event, restore success is a correctness event. Operations learned this the loud way.

Backup types that work in containers (logical, physical, snapshots, and PITR)

Logical backups: portable, slower, easy to fake

MySQL: mysqldump is fine if you use it correctly and your dataset is not massive. You need --single-transaction for InnoDB consistency without locking tables, and you probably want --routines --triggers --events. But logical dumps can still be fake: they can be incomplete, silently truncated, or restore into a schema that doesn’t match production.

PostgreSQL: pg_dump produces consistent dumps thanks to MVCC, but it’s per-database. Cluster-wide objects (roles, tablespaces, some settings) are separate. If you restore into a clean cluster without recreating roles or extensions, your app breaks in inventive ways.

Physical backups: fast, operationally strict

MySQL: physical backups with XtraBackup (or enterprise equivalent) are the grown-up choice for large datasets. But you must handle redo logs, prepare the backup, and keep version compatibility in mind.

PostgreSQL: pg_basebackup plus WAL archiving gives you consistent, restorable physical backups. It’s straightforward, but you must ensure WAL is archived and retained, and you need a clean restore procedure (including recovery configuration or modern restore commands).

Snapshots: the “fastest” backup that loves to betray you

Volume snapshots (LVM, ZFS, Ceph, EBS, CSI snapshots) are excellent when you can guarantee consistency. The key word is guarantee. A crash-consistent snapshot might work, until it doesn’t, and you won’t know which backup is bad until you need it.

To do snapshots right, you coordinate with the database: flush/lock or use backup modes that ensure recoverability. PostgreSQL can recover from a crash, but a snapshot taken mid-write without the right WAL coverage can still fail. MySQL InnoDB can crash-recover, but mixing snapshots with missing logs or odd fsync settings gets ugly fast.

PITR: the difference between “we lost a day” and “we lost five minutes”

PITR is not optional if your RPO is small. For MySQL: binlogs. For PostgreSQL: WAL archiving. In containers, PITR breaks when log directories are not persisted, archiving is misconfigured, or your retention policy deletes logs before you notice.

Joke #2: PITR is like a time machine, except it only goes back—usually to the exact moment before you ran the destructive migration. Convenient.

Practical tasks: 12+ commands to verify backups are real

These are tasks you can run today. Each includes: command, example output, what it means, and the decision you make.

Task 1: Confirm your database data directory is actually on a persistent mount (container host)

cr0x@server:~$ docker inspect mysql01 --format '{{ range .Mounts }}{{ .Source }} -> {{ .Destination }} ({{ .Type }}){{ "\n" }}{{ end }}'
/var/lib/docker/volumes/mysql01-data/_data -> /var/lib/mysql (volume)
/srv/backups/mysql01 -> /backups (bind)

Meaning: /var/lib/mysql is a Docker volume (persistent), and backups are written to a host bind mount at /srv/backups/mysql01.

Decision: If you don’t see a persistent mount for the data directory and backup directory, stop. Fix storage before arguing about tools.

Task 2: Kubernetes: check the pod is using a PVC (not emptyDir)

cr0x@server:~$ kubectl get pod pg-0 -o jsonpath='{range .spec.volumes[*]}{.name}{"\t"}{.persistentVolumeClaim.claimName}{"\t"}{.emptyDir}{"\n"}{end}'
data	pgdata-pg-0	
tmp		map[]

Meaning: Volume data is backed by a PVC. tmp is ephemeral (emptyDir).

Decision: If your DB data is on emptyDir, you don’t have persistence; you have vibes.

Task 3: MySQL logical backup done right (and detect truncated dumps)

cr0x@server:~$ docker exec mysql01 sh -lc 'mysqldump -uroot -p"$MYSQL_ROOT_PASSWORD" --single-transaction --routines --triggers --events --set-gtid-purged=OFF --databases appdb | gzip -1 > /backups/appdb.sql.gz && ls -lh /backups/appdb.sql.gz'
-rw-r--r-- 1 root root 412M Dec 31 02:10 /backups/appdb.sql.gz

Meaning: A dump exists and has a plausible size.

Decision: Size is not proof. Next, validate gzip integrity and the presence of required objects.

Task 4: Validate gzip integrity (catches partial uploads and disk-full disasters)

cr0x@server:~$ gzip -t /srv/backups/mysql01/appdb.sql.gz && echo OK
OK

Meaning: The file is a valid gzip stream (not truncated).

Decision: If this fails, treat the backup as missing. Investigate disk space, job logs, and upload steps.

Task 5: Quick sanity check inside a MySQL dump (do we have schema and data?)

cr0x@server:~$ zcat /srv/backups/mysql01/appdb.sql.gz | head -n 25
-- MySQL dump 10.13  Distrib 8.0.36, for Linux (x86_64)
--
-- Host: localhost    Database: appdb
-- ------------------------------------------------------
-- Server version	8.0.36
/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
...

Meaning: Looks like a real dump with metadata. Also confirms MySQL version at backup time.

Decision: If the dump header is missing or the stream is empty, your backup job wrote an error page, not SQL.

Task 6: MySQL: confirm binlog is enabled (PITR requires it)

cr0x@server:~$ docker exec mysql01 sh -lc 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "SHOW VARIABLES LIKE \"log_bin\"; SHOW VARIABLES LIKE \"binlog_format\";"'
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_bin       | ON    |
+---------------+-------+
+---------------+-------+
| binlog_format | ROW   |
+---------------+-------+

Meaning: Binlog is on, and format is ROW (generally safest for replication and PITR).

Decision: If log_bin is OFF, accept “full backups only” RPO or turn it on and implement binlog shipping.

Task 7: MySQL: verify you actually have recent binlogs (not just enabled)

cr0x@server:~$ docker exec mysql01 sh -lc 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "SHOW BINARY LOGS;" | tail -n 5'
| binlog.000231 |  10485760 |
| binlog.000232 |   8912451 |

Meaning: Binlogs exist and are rotating.

Decision: If there are no binlogs or they stop days ago, PITR is not real. Fix retention and shipping.

Task 8: PostgreSQL: run a logical backup and capture globals (roles) separately

cr0x@server:~$ kubectl exec -it pg-0 -- bash -lc 'pg_dump -U postgres -Fc -d appdb -f /backups/appdb.dump && pg_dumpall -U postgres --globals-only > /backups/globals.sql && ls -lh /backups/appdb.dump /backups/globals.sql'
-rw-r--r-- 1 root root 2.1G Dec 31 02:12 /backups/appdb.dump
-rw-r--r-- 1 root root  19K Dec 31 02:12 /backups/globals.sql

Meaning: You have a custom-format dump plus global objects.

Decision: If you only dump the database and ignore globals, expect restore-time permission failures and missing roles.

Task 9: PostgreSQL: verify WAL archiving is enabled (PITR depends on it)

cr0x@server:~$ kubectl exec -it pg-0 -- bash -lc 'psql -U postgres -d postgres -c "SHOW wal_level; SHOW archive_mode; SHOW archive_command;"'
 wal_level 
-----------
 replica
(1 row)

 archive_mode 
--------------
 on
(1 row)

                 archive_command                 
-------------------------------------------------
 test ! -f /wal-archive/%f && cp %p /wal-archive/%f
(1 row)

Meaning: WAL is produced at replica level (good for PITR/replication). Archiving is on.

Decision: If archive_mode is off, you don’t have PITR. Either accept the risk or fix it before you claim an RPO.

Task 10: PostgreSQL: confirm WAL is actually being archived (not just configured)

cr0x@server:~$ kubectl exec -it pg-0 -- bash -lc 'psql -U postgres -d postgres -c "SELECT now(), last_archived_wal, last_archived_time, failed_count, last_failed_wal, last_failed_time FROM pg_stat_archiver;"'
              now              | last_archived_wal |     last_archived_time     | failed_count | last_failed_wal | last_failed_time 
-------------------------------+-------------------+----------------------------+--------------+-----------------+------------------
 2025-12-31 02:14:01.12345+00  | 0000000100000000000000A7 | 2025-12-31 02:13:58+00 |            0 |                 | 
(1 row)

Meaning: Archiving is succeeding and recent.

Decision: If failed_count rises or last_archived_time is stale, your PITR stream is broken. Page someone before you need it.

Task 11: Restore test MySQL dump into a disposable container (the only test that matters)

cr0x@server:~$ docker run --rm --name mysql-restore -e MYSQL_ROOT_PASSWORD=restorepass -d mysql:8.0
...output...
cr0x@server:~$ sleep 15 && zcat /srv/backups/mysql01/appdb.sql.gz | docker exec -i mysql-restore mysql -uroot -prestorepass
...output...
cr0x@server:~$ docker exec mysql-restore mysql -uroot -prestorepass -e "SHOW DATABASES; USE appdb; SHOW TABLES;" | tail -n 10
Tables_in_appdb
users
orders
order_items

Meaning: The dump imports and schema exists.

Decision: If import fails, the backup is fake. If it imports but app queries fail, you’re missing objects (routines, triggers) or relying on environment-specific settings.

Task 12: Restore test PostgreSQL dump into a disposable instance

cr0x@server:~$ docker run --rm --name pg-restore -e POSTGRES_PASSWORD=restorepass -d postgres:16
...output...
cr0x@server:~$ sleep 10 && cat /srv/backups/pg/globals.sql | docker exec -i pg-restore psql -U postgres
...output...
cr0x@server:~$ docker exec -i pg-restore createdb -U postgres appdb
...output...
cr0x@server:~$ docker exec -i pg-restore pg_restore -U postgres -d appdb /srv/backups/pg/appdb.dump
pg_restore: connecting to database for restore
pg_restore: creating TABLE "public.users"
pg_restore: creating TABLE "public.orders"
...output...

Meaning: The dump restores cleanly in a fresh cluster and replays object creation.

Decision: If it fails on roles/permissions, your globals capture is incomplete. If it fails on extensions, you need to preinstall them in the restore environment.

Task 13: Snapshot safety check: are you backing up the right path?

cr0x@server:~$ docker exec mysql01 sh -lc 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "SHOW VARIABLES LIKE \"datadir\";"'
+---------------+----------------+
| Variable_name | Value          |
+---------------+----------------+
| datadir       | /var/lib/mysql/ |
+---------------+----------------+

Meaning: Confirms the actual data directory.

Decision: If your snapshot is taken from a different host path than the mounted datadir, you are snapshotting the wrong thing. Yes, it happens.

Task 14: Basic capacity check: are backups failing due to disk pressure?

cr0x@server:~$ df -h /srv/backups
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  1.8T  1.7T   62G  97% /srv/backups

Meaning: You’re at 97% usage. This is not a lifestyle choice.

Decision: If you routinely exceed ~85–90%, expect partial files, failed fsync, and “backup succeeded” logs that don’t match reality. Fix retention, compression, and storage sizing.

Task 15: Confirm backup jobs aren’t “successful” while actually failing (Kubernetes CronJob)

cr0x@server:~$ kubectl get job -n db -l app=pg-backup -o wide
NAME               COMPLETIONS   DURATION   AGE   CONTAINERS   IMAGES        SELECTOR
pg-backup-28911    1/1           9s         2h    backup       alpine:3.20   controller-uid=...
cr0x@server:~$ kubectl logs -n db job/pg-backup-28911 | tail -n 20
pg_dump: error: connection to server at "pg" (10.0.2.44), port 5432 failed: FATAL:  password authentication failed for user "postgres"

Meaning: The job completed from Kubernetes’ perspective, but the backup failed.

Decision: Treat logs as the source of truth. Ensure the job exits non-zero on failure (no swallowing errors in shell scripts).

Fast diagnosis playbook: find the bottleneck quickly

When backups are slow, missing, or un-restorable, you don’t have time for philosophy. You need a tight loop. Here’s the order that catches the most issues fastest.

First: is the backup artifact valid?

  • Check integrity: gzip test, tar listing, or dump format listing.
  • Check size trend: sudden drop or spike usually means failure or runaway data.
  • Check last successful restore test time: if you don’t have one, you don’t know anything.

Second: does the pipeline include the logs needed for PITR?

  • MySQL: binlog enabled, rotating, shipped, retained.
  • PostgreSQL: WAL archiving on, pg_stat_archiver healthy, retention covers your desired window.

Third: is the bottleneck CPU, disk, or network?

  • CPU-bound: compression level too high, encryption overhead, single-threaded dumps.
  • Disk-bound: slow PVC, saturated node disk, snapshot copy delays.
  • Network-bound: object storage throughput, throttling, or cross-AZ egress limits.

Fourth: are you taking consistent backups?

  • MySQL: --single-transaction for dumps; physical backup prepared; snapshot coordination.
  • PostgreSQL: base backup + WAL; or pg_dump plus globals; snapshot coordination.

Fifth: will the restore meet RTO?

  • Test restore time end-to-end: download + decrypt + decompress + import + index build + verification queries.
  • If RTO is missed, switch backup method, parallelize restore, or change the business promise. Pick one.

Three corporate mini-stories from the land of “it passed CI”

Incident: the wrong assumption (logical dump “includes everything,” right?)

A mid-sized SaaS company ran PostgreSQL in Kubernetes. Their backup job did pg_dump -Fc nightly, shipped the file to object storage, and rotated it with a clean retention policy. Green dashboards. No alerts. The kind of setup that makes managers nod approvingly.

Then they had an incident: a schema migration went wrong and they needed to restore to “yesterday.” They spun up a new cluster and restored the dump. It worked—technically. The database started. Tables existed. The app still failed to authenticate users and started emitting permission errors like a metronome.

The missing piece was boring: roles and grants weren’t included. In production, a few roles were created manually during an on-call firefight months earlier. The dump captured objects, not the cluster-level identity and privilege model. Their “backup” had no idea who was allowed to do what.

The fix was immediate: add pg_dumpall --globals-only to the pipeline, enforce role creation via migrations/IaC, and add a restore test that runs a minimal application login flow against the restored database. They stopped assuming “database backup” equals “application recovery.”

Optimization that backfired (compression and cleverness versus time)

An e-commerce platform migrated MySQL backups from physical to logical to “simplify.” The database was several hundred GB. Dumps were compressed aggressively to cut object storage costs. They also ran backups during business hours because “containers scale.” The backup job had a nice nice value and everything looked civilized.

What happened was classic: CPU spiked due to high compression, causing query latency. Then the backup ran longer, overlapping peak traffic. That extended overlap caused more contention, which made the dump run even longer. The database wasn’t down; it was worse—slow, unpredictable, and angry.

When they finally tested restore time, it was brutal: downloading and decompressing the dump took ages, then import took ages, then rebuilding indexes took ages. The team had optimized the wrong metric (storage cost) and ignored the one that matters during an incident (time-to-restore).

They switched to physical backups for MySQL, reduced compression to a sane level, and implemented binlog shipping for PITR. Storage cost went up, but the restore path went from “we’ll get back to you tomorrow” to something you can say out loud in a status update.

Boring but correct practice that saved the day (restore drills and version pinning)

A financial services team ran PostgreSQL with WAL archiving and weekly base backups. Nothing fancy: consistent config, strict retention, and a monthly restore drill into an isolated environment. They pinned Postgres minor versions for restore environments and kept a small matrix of “backup taken on version X, restored on version X/Y” that they actually tested.

One quarter, a storage incident corrupted a primary volume and took the database down hard. Replication didn’t save them; the replica had already replayed the corruption. The only path was restore.

The on-call runbook was painfully explicit. Fetch base backup, fetch WAL from the archive, restore into a fresh PV, replay to a target time, run verification queries, flip traffic. They followed it step by step with the calm energy of people who have rehearsed the boring thing.

They restored within their promised window. No heroics. No Slack archaeology. The post-incident review was short because the system did what it was designed to do. This is the standard to copy.

Common mistakes: symptom → root cause → fix

1) “Backup job says success, but the file is tiny”

Symptom: backups suddenly drop from hundreds of MB/GB to a few KB/MB.

Root cause: the job wrote an error message into the output stream, authentication failed, or disk filled mid-write; pipeline still returned exit code 0 due to sloppy shell scripting.

Fix: enable set -euo pipefail in scripts; validate artifact integrity (gzip test); alert on size anomalies; fail the job on stderr patterns or non-zero exit codes.

2) “Postgres restore fails: invalid checkpoint / WAL missing”

Symptom: restored data directory won’t start, complains about checkpoint records or missing WAL.

Root cause: crash-inconsistent snapshot, or base backup without corresponding WAL segments.

Fix: use pg_basebackup with WAL archiving; ensure WAL retention; if using snapshots, coordinate with PostgreSQL and include WAL needed to reach consistency.

3) “MySQL restore works, but routines/events are missing”

Symptom: application errors after restore; stored procedures or scheduled events absent.

Root cause: mysqldump without --routines --events --triggers.

Fix: add those flags; add restore verification queries that check for these objects.

4) “Backups exist, but you can’t do PITR”

Symptom: you can restore only to the backup time, not to a specific moment.

Root cause: binlogs/WAL not enabled, not archived, or not retained long enough; log path is ephemeral in the container.

Fix: enable and ship binlogs/WAL to durable storage; monitor archiving health; align retention with business RPO.

5) “Kubernetes reschedules the pod and you lose backups”

Symptom: backups disappear after node maintenance or pod restart.

Root cause: backups written inside container filesystem or to emptyDir.

Fix: write backups to a PVC or stream directly to object storage; verify mounts via manifests and runtime inspection.

6) “Restore is too slow to meet RTO”

Symptom: restore takes hours/days; incident response becomes hostage to import speed.

Root cause: logical dumps for very large datasets, over-compression, single-threaded restore path, slow storage class.

Fix: switch to physical backups; tune compression; test restore time; consider read replicas for faster cutover; use faster PVC/storage tiers for recovery environments.

7) “Replica-based backups restore stale data”

Symptom: restore is consistent but missing recent writes.

Root cause: you backed up from a lagging replica without checking lag, or replication was broken.

Fix: enforce lag checks before backup; alert on replication health; for PITR, rely on logs with known ordering and retention.

8) “Everything restores, but the app still fails”

Symptom: DB is up; app errors include auth failures, missing extensions, encoding issues, timezone oddities.

Root cause: you backed up data but not the environment contract: roles, extensions, collations, config assumptions.

Fix: codify DB bootstrap (roles/extensions) as code; add smoke tests that run real app queries; store “restore environment” image/version alongside backups.

Checklists / step-by-step plan (boring on purpose)

Step 1: Decide your recovery targets (RPO/RTO) and make them executable

  • Pick an RPO (max acceptable data loss) and RTO (max acceptable downtime).
  • Translate them into implementation: full backup frequency + PITR log retention + restore workflow time budget.
  • If you can’t afford PITR complexity, admit the RPO is “last backup.” Don’t improvise later.

Step 2: Choose the right backup method per database and size

  • MySQL small/medium: mysqldump with correct flags + binlog shipping if PITR is needed.
  • MySQL large: physical backups (XtraBackup/enterprise) + binlog shipping.
  • PostgreSQL small/medium: pg_dump -Fc + pg_dumpall --globals-only.
  • PostgreSQL large: pg_basebackup + WAL archiving, plus periodic restore tests.

Step 3: Make storage placement explicit in container manifests

  • Data directory on PVC/volume with known performance class.
  • Backup staging directory on PVC (or stream to object storage).
  • WAL/binlog on durable storage if local archiving is used.

Step 4: Build verification into the pipeline

  • Artifact integrity checks: gzip/tar validation, dump format introspection.
  • Metadata capture: DB version, schema hash, backup timestamp, tool version.
  • Restore test automation: restore into disposable environment and run verification queries.

Step 5: Monitor the backup system like production

  • Alert on missing backups, size anomalies, and stale WAL/binlog archiving.
  • Track restore time trends (download + decrypt + decompress + apply + verify).
  • Make backup failures page-worthy if you claim an RPO/RTO. Quiet failures are how fake backups breed.

Step 6: Practice restores (and version drift control)

  • Monthly restore drills, minimum. Weekly if the database is business-critical.
  • Pin restore tooling versions (mysqldump/mysql client, pg_restore, etc.).
  • Keep a known-good restore environment image for each major DB version you run.

FAQ

1) What exactly makes a backup “fake”?

Anything that can’t restore to a consistent, working state within your required time window. “File exists” and “job succeeded” don’t count.

2) Are logical backups (mysqldump/pg_dump) inherently unsafe?

No. They’re just easy to misconfigure and slow at scale. Logical backups are fine when you restore-test them and your RTO can tolerate the import time.

3) In MySQL, is --single-transaction always required?

For consistent InnoDB dumps without locking tables, yes. If you have non-transactional tables (e.g., MyISAM), consistency becomes trickier and you should reconsider the storage engine or backup method.

4) In PostgreSQL, why do I need pg_dumpall --globals-only?

pg_dump is per-database. Roles and other global objects live at the cluster level. Without them, restores frequently fail with permission problems.

5) Can I rely on volume snapshots as my only backup?

Only if you can guarantee consistency and have tested restores from snapshots. Snapshots are great as a component, not a religion. Most teams combine snapshots (fast) with PITR logs (granular recovery).

6) What’s the simplest container-safe backup pattern?

Stream the backup directly to durable storage (object storage or a backup server) and keep local staging minimal. Don’t write critical artifacts into ephemeral container layers.

7) How do I know if PITR is really working?

MySQL: confirm binlog is enabled, binlogs are present and recent, and you can replay to a target timestamp in a test restore. PostgreSQL: pg_stat_archiver shows recent successful archiving and you can restore to a target time using archived WAL.

8) Should I back up from the primary or a replica?

Replica backups reduce primary load, but you must enforce replication health checks and lag thresholds. Otherwise you get clean, consistent, wrong-time backups.

9) Why do restores fail only in production-like environments?

Because production has the messy parts: roles, grants, extensions, and app behaviors that test your assumptions. Your restore tests need to include a small application smoke test, not just “database starts.”

10) What’s the minimum restore test I should automate?

Spin up a disposable DB instance, restore the latest backup, run a handful of representative queries (including auth-critical paths), and record total time. If it fails, alert loudly.

Conclusion: next steps that actually reduce risk

If you do nothing else, do these three things this week:

  1. Restore-test the latest backup of MySQL and/or PostgreSQL into a disposable environment. Time it. Record it.
  2. Verify PITR prerequisites: MySQL binlogs or PostgreSQL WAL archiving, with retention that matches your RPO.
  3. Make storage explicit: confirm data, backups, and logs live on persistent volumes or durable remote storage, not on container ephemera.

Then do the grown-up move: wire those checks into automation and monitoring, so you stop relying on memory, hope, and that one person who “knows backups.” The goal isn’t a backup file. The goal is a practiced restore.

← Previous
ZFS for Media Files: Big Records, Big Compression, Big Wins
Next →
ZFS Incremental Send: Backups That Don’t Recopy Everything

Leave a comment