Docker Volume Backups That Actually Restore

January 7, 2026 • February 3, 2026 • Read: 23 min • Views: 10

Was this helpful?

The painful truth: most Docker “backups” are optimistic file copies that nobody has ever restored under pressure. The first time you try is when a disk dies, an engineer is tired, and your VP learns what “RPO” means.

This guide is for running production. Not demos. We’ll back up Docker volumes in ways that survive reality, and we’ll prove restores work with repeatable drills and measurable checks.

What you are actually backing up

“Back up the container” is a sentence that sounds reasonable and is usually wrong.

Containers are disposable processes plus an image. The image is (typically) rebuildable. The real blast radius is in the data that lives outside the image:

Named volumes (managed by Docker; usually under /var/lib/docker/volumes).
Bind mounts (your host paths mounted into containers; often “just a folder,” until it isn’t).
Secrets/config (environment variables, mounted files, swarm secrets, Compose files, systemd unit files).
External services (managed databases, object storage) that your container depends on but doesn’t contain.

Backing up Docker volumes is mostly about filesystem-level integrity and application-level consistency. File integrity means the bits copy correctly. Consistency means the app can actually read those bits after you restore them.

For databases, “consistent” is not a vibe. It’s a state. You either use database-native backup tools, or you snapshot storage with the database properly quiesced.

Interesting facts (and why they matter)

Docker volumes were designed to decouple data from container lifecycle. That’s why “remove the container” does not remove the volume—until someone adds -v without thinking.
AUFS/OverlayFS popularized copy-on-write layers for containers. Great for images; irrelevant for your persistent data, which lives in volumes or bind mounts.
Early container users often backed up the entire /var/lib/docker directory. It “worked” until storage drivers changed or the restore host differed. Portability was the casualty.
Database vendors have preached “logical backups” for decades because physical file copies during write activity can be silently corrupt without obvious errors at copy time.
Filesystem snapshots (ZFS, LVM, btrfs) predate containers by years. Containers made snapshot-based backups fashionable again because they need fast, frequent capture with low overhead.
Tar is older than most of your production fleet. It’s still here because it’s simple, streamable, and integrates with compression and encryption tools cleanly.
RPO/RTO became boardroom vocabulary after high-profile outages. Containers didn’t change that; they just made it easier to confuse “rebuildable” with “recoverable.”
Checksums are not optional in serious backup systems. Silent corruption exists in every layer: RAM, disk, controller, network, object store. Verify or get surprised.

Principles: do this, not that

1) Treat “backup” as a restore workflow you haven’t run yet

A backup file is not evidence. A successful restore into a clean environment is evidence. Your goal is to reduce uncertainty, not to generate artifacts.

2) Separate “data backup” from “service rebuild”

Keep two inventories:

Rebuild inventory: images, Compose files, system configs, TLS cert issuance process, secrets management.
Data inventory: volumes, database dumps/WAL/binlogs, uploaded files, queues, search indexes (and whether you can rebuild them).

3) Prefer application-native backups for databases

For PostgreSQL, use pg_dump or physical backups with pg_basebackup (and WAL). For MySQL/MariaDB, use mysqldump or physical methods appropriate to your engine. Snapshotting raw database files while writes are happening is gambling.

4) When you do filesystem-level backups, control write activity

Either:

Stop the app container (or put it in maintenance/read-only mode), then copy; or
Use snapshots on the host filesystem; or
Use database quiesce hooks (flush/lock) and snapshot fast.

5) Make backups content-addressable (or at least checksum-verified)

At minimum: store a manifest with file list + sizes + hashes. “The file exists” is not verification.

6) Restore tests must be isolated and automated

Do not restore on the same host into the same paths and declare victory. Use a scratch host or a disposable VM. Run a health check that proves the service reads data correctly.

Joke #1: A backup you’ve never restored is like a parachute you’ve never packed—confidence is not a test plan.

One quote, because it’s still true

Hope is not a strategy. — attributed in ops circles; treat as a paraphrased idea, not a guaranteed exact original quote.

Practical tasks: commands, outputs, decisions (12+)

These are intentionally mundane. Production failures are usually mundane too.

Task 1: List volumes and spot the ones that matter

cr0x@server:~$ docker volume ls
DRIVER    VOLUME NAME
local     app_postgres_data
local     app_uploads
local     monitoring_prometheus

What it means: three named volumes exist. Your images aren’t listed because volumes are independent objects.

Decision: classify each volume: “must restore” (databases, uploads), “nice to have” (caches), “rebuildable” (Prometheus might be rebuildable depending on retention needs).

Task 2: Inspect a volume to find where it lives

cr0x@server:~$ docker volume inspect app_postgres_data
[
  {
    "CreatedAt": "2025-11-02T12:44:10Z",
    "Driver": "local",
    "Labels": {
      "com.docker.compose.project": "app"
    },
    "Mountpoint": "/var/lib/docker/volumes/app_postgres_data/_data",
    "Name": "app_postgres_data",
    "Options": null,
    "Scope": "local"
  }
]

What it means: you’re using the local driver; the data is a directory on the host. This is simple and dangerous: simple to back up, easy to corrupt if copied live.

Decision: decide whether you will do application-native backups (preferred for DB) or snapshot/cold copy.

Task 3: Find which containers use a volume (blast radius check)

cr0x@server:~$ docker ps --format '{{.Names}}\t{{.Mounts}}'
app-web        /var/lib/docker/volumes/app_uploads/_data
app-db         /var/lib/docker/volumes/app_postgres_data/_data
prometheus     /var/lib/docker/volumes/monitoring_prometheus/_data

What it means: each volume is currently mounted by one container. That’s good: fewer writers.

Decision: if multiple writers mount the same volume, your consistency plan needs to be more explicit (locks, quiesce, or redesign).

Task 4: Identify bind mounts hiding in plain sight

cr0x@server:~$ docker inspect app-web --format '{{json .Mounts}}'
[{"Type":"bind","Source":"/srv/app/config","Destination":"/etc/app","Mode":"ro","RW":false,"Propagation":"rprivate"},{"Type":"volume","Name":"app_uploads","Source":"/var/lib/docker/volumes/app_uploads/_data","Destination":"/var/www/uploads","Driver":"local","Mode":"z","RW":true,"Propagation":""}]

What it means: you have a bind mount at /srv/app/config. If you back up only Docker volumes, you’ll miss config—then restores “work” but the service won’t start.

Decision: add bind mount paths to backup scope, or migrate them into a managed config system.

Task 5: Check free space before you generate a giant archive

cr0x@server:~$ df -h /var/lib/docker
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  450G  380G   48G  89% /

What it means: only 48G free. A tarball of a large volume might fill disk and take Docker down with it.

Decision: stream backups off-host (pipe to storage), or snapshot and transfer, or free space first.

Task 6: Get volume size quickly (rough, but useful)

cr0x@server:~$ sudo du -sh /var/lib/docker/volumes/app_postgres_data/_data
23G	/var/lib/docker/volumes/app_postgres_data/_data

What it means: ~23G on disk. Compression may help (or not, depending on data).

Decision: plan retention and transfer time. 23G nightly over a thin link becomes a weekly apology.

Task 7: Cold backup a named volume with tar (safe for non-DB or stopped DB)

cr0x@server:~$ docker stop app-web
app-web
cr0x@server:~$ docker run --rm -v app_uploads:/data:ro -v /backup:/backup alpine:3.20 sh -c 'cd /data && tar -cpf /backup/app_uploads.tar .'
cr0x@server:~$ docker start app-web
app-web

What it means: the backup runs in a throwaway container that mounts the volume read-only and writes a tar file to /backup (a host directory you must provision).

Decision: if stopping the app is unacceptable, move to snapshot-based backups or application-native backups.

Task 8: Add compression and a checksum manifest

cr0x@server:~$ docker run --rm -v app_uploads:/data:ro -v /backup:/backup alpine:3.20 sh -c 'cd /data && tar -cpf - . | gzip -1 > /backup/app_uploads.tar.gz'
cr0x@server:~$ sha256sum /backup/app_uploads.tar.gz
9c1e6311d2c51d6f9a9b8b3f5d65ed3db3f87e96a57c4e1b2f5c34b1b1a4d9a0  /backup/app_uploads.tar.gz

What it means: you now have an integrity check. Store the checksum next to the artifact in your backup repository.

Decision: if you can’t produce and verify checksums, you don’t have an operational backup—just a file.

Task 9: PostgreSQL logical backup from inside the container (preferred for portability)

cr0x@server:~$ docker exec -t app-db sh -c 'pg_dump -U postgres -Fc appdb' > /backup/appdb.dump
cr0x@server:~$ ls -lh /backup/appdb.dump
-rw-r--r-- 1 cr0x cr0x 3.2G Dec  7 02:10 /backup/appdb.dump

What it means: a custom-format dump that supports parallel restore and is resilient across minor version changes (within reason).

Decision: if the dump is much smaller than expected, check if you dumped the right database and didn’t accidentally dump an empty schema.

Task 10: PostgreSQL restore test into a disposable container (proof, not theory)

cr0x@server:~$ docker run --rm --name pg-restore-test -e POSTGRES_PASSWORD=test -d postgres:16
2f4a6f88d3c8e6d0b0f14a27e8c2e6d84e8c4b7f6ddc5a8c1d2b3a4f5e6d7c8b
cr0x@server:~$ sleep 3
cr0x@server:~$ cat /backup/appdb.dump | docker exec -i pg-restore-test sh -c 'createdb -U postgres appdb && pg_restore -U postgres -d appdb'
cr0x@server:~$ docker exec -t pg-restore-test psql -U postgres -d appdb -c 'select count(*) from users;'
 count 
-------
 10492
(1 row)
cr0x@server:~$ docker stop pg-restore-test
pg-restore-test

What it means: you restored into a clean database and ran a sanity query. That’s real evidence.

Decision: if counts don’t match expectations, stop calling it “verified.” Investigate before retention rotates away your last good copy.

Task 11: Restore a named volume tarball into a fresh volume

cr0x@server:~$ docker volume create app_uploads_restore_test
app_uploads_restore_test
cr0x@server:~$ docker run --rm -v app_uploads_restore_test:/data -v /backup:/backup alpine:3.20 sh -c 'cd /data && tar -xpf /backup/app_uploads.tar'
cr0x@server:~$ docker run --rm -v app_uploads_restore_test:/data alpine:3.20 sh -c 'ls -lah /data | head'
total 64K
drwxr-xr-x    5 root     root        4.0K Dec  7 02:24 .
drwxr-xr-x    1 root     root        4.0K Dec  7 02:24 ..
drwxr-xr-x   12 root     root        4.0K Dec  5 19:11 images
drwxr-xr-x    3 root     root        4.0K Dec  6 08:33 tmp

What it means: you can unpack the archive and see expected top-level directories.

Decision: if permissions/ownership matter (they do), validate them with a representative file and ensure your tar preserved them (-p helps when running as root).

Task 12: Verify archive integrity before restore (catch corrupted transfers)

cr0x@server:~$ sha256sum -c /backup/app_uploads.tar.gz.sha256
/backup/app_uploads.tar.gz: OK

What it means: the artifact matches the expected hash.

Decision: if you see FAILED, stop. Don’t restore garbage into production. Fetch a different backup.

Task 13: Detect open files to judge whether a “live copy” is risky

cr0x@server:~$ sudo lsof +D /var/lib/docker/volumes/app_postgres_data/_data | head
postgres  22114  999  15u   REG  259,2  16777216  393222 /var/lib/docker/volumes/app_postgres_data/_data/base/16384/2619
postgres  22114  999  16u   REG  259,2  16777216  393223 /var/lib/docker/volumes/app_postgres_data/_data/base/16384/2620

What it means: PostgreSQL is actively writing. Copying these files right now is not a backup; it’s a corruption generator.

Decision: use pg_dump/pg_basebackup or stop/quiesce + snapshot.

Task 14: Measure restore throughput (know your RTO, don’t guess)

cr0x@server:~$ time docker run --rm -v app_uploads_restore_test:/data -v /backup:/backup alpine:3.20 sh -c 'cd /data && rm -rf ./* && tar -xpf /backup/app_uploads.tar'
real	0m18.412s
user	0m0.812s
sys	0m3.951s

What it means: your restore took ~18 seconds for this dataset on this host. That’s the number you use for RTO planning (plus service warmup).

Decision: if restore is slow, don’t optimize randomly—go to the Fast diagnosis playbook.

Backup methods that hold up

Method A: Logical backups for databases (recommended)

If you have one database volume and you’re tar’ing it “because it’s easy,” stop. Use the database’s backup mechanism. You get:

Portability across hosts and storage drivers
Consistency guaranteed by the engine
Better troubleshooting: the restore will tell you what’s wrong

For PostgreSQL, a solid baseline is daily pg_dump -Fc plus WAL archiving if you need point-in-time recovery. For MySQL, a baseline is mysqldump or engine-specific physical backups with binlogs.

Trade-off: logical backups can be slower and larger for some workloads, and restores can be slower than file-level restores. That’s a business decision—but make it explicitly.

Method B: “Cold” filesystem backup of a volume (stop the writer)

For uploads, configs, artifacts, and non-transactional data: stopping the container (or ensuring no writes) and copying the volume is straightforward.

Pros: simple, fast, easy to understand
Cons: requires downtime or a write freeze; must preserve ownership/ACLs/xattrs if relevant

If your app uses Linux capabilities, SELinux labels, or ACLs, your tar command needs to preserve that. Alpine tar is fine for basic permissions; if you need xattrs/ACLs, use a backup container with GNU tar and flags that match your environment.

Method C: Snapshot-based backups on the host filesystem (fast, low downtime)

If your Docker data directory lives on ZFS, LVM-thin, or btrfs, you can snapshot the underlying dataset/volume quickly, then copy from the snapshot while production continues.

Important: snapshotting a filesystem does not magically make an application consistent. For databases, combine snapshots with proper quiescing or engine-level backup mode, otherwise you can snapshot a perfectly consistent filesystem containing perfectly inconsistent database state.

Method D: Remote volume drivers / network storage

Some teams use NFS, iSCSI, Ceph, or cloud block storage behind a Docker volume driver. This can be fine, but it shifts the backup problem:

You now back up the storage system, not Docker.
Restores may require the same driver and configuration.
Latency and small-write behavior can hurt databases.

When you use a remote driver, record the driver name, options, and lifecycle. If your restore plan begins with “we’ll just reattach the volume,” you need a second plan for when that system is the one on fire.

Method E: Image-based “backup” (the trap)

People will propose “docker commit the container.” That produces an image layer containing the container filesystem at that moment. It does not capture named volumes. It rarely captures bind mounts. It is also a good way to preserve secrets in an image forever. Don’t do it for data backups.

Joke #2: “We backed up the container” is how you get a beautiful image of a service that has forgotten everything it ever knew.

How to prove restores work (not just “it extracted”)

Define “works” as a testable contract

A restore “works” when:

The data restores into a clean environment without manual fiddling.
The service starts with the restored data.
A small set of behavior checks pass: queries return expected rows, uploads are readable, migrations behave, and logs don’t show corruption.
The team can do it under time pressure with a runbook.

Build a restore drill that runs on schedule

Pick a cadence you can sustain: weekly for critical data, monthly for less critical, after every major schema change. The drill should:

Pull the latest backup artifact.
Verify checksums.
Restore into disposable infrastructure (VM, ephemeral host, or isolated Docker network).
Run a short validation suite.
Publish results somewhere visible (ticket, Slack channel, dashboard), including failure reasons.

Validation suite: examples that actually catch problems

Database: run SELECT queries for row counts in key tables; verify migrations can run in a dry-run mode if available; confirm indexes exist; confirm recent timestamps exist.
Uploads: pick 10 known files and verify checksum or at least size; ensure permissions allow the app user to read them.
App startup: check the logs for known bad patterns (permission denied, missing config keys, schema mismatch).

Prove RPO and RTO, not just correctness

Correctness answers “can we restore?” RPO/RTO answers “can we restore in time, with acceptable data loss?” Measure it:

RPO: time between last successful backup and incident moment (use backup timestamps, not feelings).
RTO: time from “we start restoring” to “service meets SLO again.” Include time to fetch artifacts, decompress, verify, and warm up caches.

Keep a “restore kit” with everything that isn’t the data

Most failed restores aren’t “bad data.” They’re missing glue:

Compose file version pinned and stored
Container images versioned (or build pipeline reproducible)
Secrets management path documented
Network/ports, reverse proxy config, TLS cert renewal process
Database version compatibility notes

Fast diagnosis playbook

Backups and restores fail in predictable ways. Don’t start with a blind rewrite of your scripts. Start with narrowing the bottleneck.

First: Is the failure about consistency or mechanics?

Mechanics: tar fails, checksum mismatch, permission denied, out of space, slow transfer.
Consistency: restore completes but app errors, DB reports corruption, missing recent data.

Second: Identify the slowest stage

Artifact fetch (network/object store)
Decompression (CPU bound)
Extract/write (disk bound, inode bound)
Application recovery (DB replay, migrations)

Third: Quick checks that give answers fast

Check disk saturation (restore is write-heavy)

cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (server) 	12/07/2025 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.31    0.00    8.14   34.22    0.00   45.33

Device            r/s     rkB/s   rrqm/s  %rrqm  r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm  w_await wareq-sz  aqu-sz  %util
nvme0n1         12.0   1456.0     0.0    0.0    1.20   121.3     980.0  84224.0   120.0   10.9   18.50    85.9    18.2   98.0

What it means: %util near 100% and elevated w_await implies the disk is the limiter.

Decision: reduce restore parallelism, restore to faster storage, or avoid compression formats that explode write amplification.

Check CPU bound decompression

cr0x@server:~$ top -b -n 1 | head -n 15
top - 02:31:20 up 21 days,  4:12,  1 user,  load average: 7.92, 8.10, 6.44
Tasks: 212 total,   2 running, 210 sleeping,   0 stopped,   0 zombie
%Cpu(s): 92.1 us,  0.0 sy,  0.0 ni,  5.8 id,  0.0 wa,  0.0 hi,  2.1 si,  0.0 st
MiB Mem :  32158.5 total,   812.4 free,  14220.1 used,  17126.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  17938.4 avail Mem

What it means: CPU is pegged in user space; decompression or checksumming may be the hotspot.

Decision: use faster compression (gzip -1 vs heavy), parallel decompression tools, or store uncompressed on fast internal networks where disk is the limiter anyway.

Check inode exhaustion (classic for lots of small files)

cr0x@server:~$ df -ih /var/lib/docker
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2   28M   27M  1.0M   97% /

What it means: you can have free space but no inodes; restores fail with “No space left on device” while df -h looks fine.

Decision: move volumes to a filesystem with more inodes or tune filesystem creation parameters; reduce tiny-file churn; consider packing into object storage where appropriate.

Check for permission/ownership mismatch (common after tar restores)

cr0x@server:~$ docker exec -t app-web sh -c 'id && ls -ld /var/www/uploads'
uid=1000(app) gid=1000(app) groups=1000(app)
drwxr-xr-x    5 root     root        4096 Dec  7 02:24 /var/www/uploads

What it means: the app runs as UID 1000, but the directory is owned by root. Reads may work; writes will fail.

Decision: fix ownership in restore procedure (e.g., chown -R 1000:1000), or run backup/restore preserving ownership and ensure correct user mappings.

Common mistakes: symptom → root cause → fix

1) “Restore finished, but app is missing recent data”

Symptom: service starts, but last day/week of data is gone.

Root cause: you backed up the wrong volume, wrong database, or wrong environment; or your backup job silently failed and you kept rotating empty artifacts.

Fix: enforce backup inventory mapping (volume/database name → artifact name). Fail the job if the dump size is below a threshold. Run scheduled restore drills with sanity queries.

2) “Tar backup exists, but restore yields permission denied”

Symptom: app logs show permission errors writing to restored directories.

Root cause: restoring as root without matching UID/GID expectations; lost ACLs/xattrs; or running containers with non-root users.

Fix: capture and restore ownership explicitly; use a restore step that applies correct UID/GID; verify with a write test as the app user.

3) “Database won’t start after volume copy”

Symptom: PostgreSQL or MySQL refuses to start, complains about WAL/binlog or corrupted pages.

Root cause: file-level copy taken while DB was writing; incomplete snapshot; inconsistent state.

Fix: use logical backup or proper physical backup. If you must snapshot, quiesce properly and snapshot atomically.

4) “Backup job takes forever and causes latency spikes”

Symptom: production I/O latency rises during backups; app slows down.

Root cause: backup reads compete with production reads; compression is CPU-heavy; storage is saturated; too many small files.

Fix: throttle backup read rate; schedule off-peak; use snapshots; change compression level; redesign data layout.

5) “We restored the volume, but the app still points to old data”

Symptom: restore appears successful but service still serves stale content.

Root cause: you restored into a new volume but the Compose file still references the old one; or bind mount paths differ.

Fix: explicitly swap volume references; use unique restore test project names; confirm container mounts with docker inspect.

6) “Checksum verification fails occasionally”

Symptom: random hash mismatches across days.

Root cause: partial uploads, non-atomic writes to the backup store, or unstable network transfers.

Fix: write to a temp name then rename atomically; store and verify manifests; ensure your upload tool uses multipart verification; retry on mismatch and alert.

Three corporate mini-stories (anonymized)

Mini-story 1: An incident caused by a wrong assumption

The team ran a customer-facing app with a PostgreSQL container and a named volume. Their backup script tar’d the volume nightly. It looked clean. It even had timestamps and retention. Everyone slept well.

Then an on-call got paged for a host failure. The recovery plan was simple: provision a new VM, restore the tarball into a fresh volume, start the DB container. It started… and immediately crashed. The logs mentioned WAL problems and a database state that “looks like it was copied while running.” That’s because it was.

The wrong assumption was subtle: “a filesystem copy is a backup.” The script ran at 2 a.m. when load was low, but not zero. PostgreSQL was still writing. The copy produced an archive that was internally inconsistent in a way tar could never detect.

They eventually recovered from an older backup that happened to be consistent by luck (the quietest night of the month). Afterward, they moved to pg_dump plus a weekly restore drill into a disposable container. The boring part—testing restores—became the part that executives cared about.

Mini-story 2: An optimization that backfired

A different organization had a mountain of user uploads. Backups were too slow, so someone “optimized” them with maximum compression to save bandwidth and storage. The artifacts shrank impressively. Everyone liked the cost graph.

Restores were never tested at full scale. The first real restore happened during a security incident where they had to rebuild hosts quickly. They pulled the backup and started decompressing. CPU hit the ceiling. Disk writes queued. The restore pipeline took hours longer than the RTO they had confidently promised in a slide deck.

The technical issue wasn’t that compression is bad. It was that they optimized a single metric (stored bytes) without measuring restore time. They also decompressed on the same production-grade but CPU-constrained nodes used for recovery, turning recovery into a compute-bound job.

The fix was refreshingly unsexy: switch to fast compression (or none) for hot-tier backups, and keep a longer-term, more compressed copy for archival. They also started measuring restore throughput as a first-class metric. Costs went up slightly. Pager fatigue went down a lot.

Mini-story 3: A boring but correct practice that saved the day

A finance-ish SaaS ran multiple services with Docker Compose. Their runbooks included a weekly “restore rehearsal” ticket. The on-call would restore the latest DB dump into a disposable container, bring up the app stack in an isolated network, and run a handful of API calls.

No heroics. Just repetition. They logged timings: download time, restore time, first successful request time. If numbers drifted, they investigated while nobody was panicking.

One weekend, a storage incident corrupted a host filesystem. They rebuilt a node, restored volumes and DB dumps, and were back online without drama. What looked like luck was just muscle memory. The team already knew which artifacts were valid, how long the restore took, and which commands would fail if something was off.

The real win: they didn’t have to improvise. They executed a practiced workflow and got back to being mildly annoyed at their monitoring alerts, which is the ideal emotional state for on-call.

Checklists / step-by-step plan

Checklist A: Build your backup inventory (one afternoon)

List volumes: docker volume ls.
List bind mounts: docker inspect on each container and extract Mounts.
Classify data: database / uploads / cache / rebuildable.
Define RPO and RTO per class (even rough is better than silent).
Write down ownership/permissions requirements (UID/GID, ACLs, SELinux).

Checklist B: Implement backups (repeatable scripts)

For databases: implement logical backup (or proper physical) and store artifacts off-host.
For file volumes: choose cold copy or snapshot-based copy; avoid live copy for writers.
Create a manifest per artifact: timestamp, source, size, checksum, tool version.
Make backups atomic: write temp then rename; never leave partial artifacts with “final” names.
Alert on failures and suspiciously small outputs.

Checklist C: Prove restores (weekly or monthly drill)

Fetch latest artifact.
Verify checksum.
Restore into a clean environment (new volume/new container/new Compose project).
Run validation checks (queries, file checks, app health endpoints).
Record timing and results; file a ticket for any deviation.

Checklist D: Incident restore runbook (when it’s already bad)

Stop the bleeding: prevent writers from continuing (maintenance mode, stop containers).
Identify “last known good” backup from restore drill logs, not from hope.
Restore into new volumes; don’t overwrite evidence unless you must.
Bring up services in dependency order (DB first, then app, then workers).
Validate externally (synthetic checks) and internally (logs, DB integrity checks).
After recovery: preserve the broken disk/volume for forensics if needed.

FAQ

1) Should I back up `/var/lib/docker`?

Generally no. It’s not portable across storage drivers, Docker versions, and host layouts. Back up the data (volumes and bind mounts) and the definitions (Compose files, configs), separately.

2) Is a tar of a volume always safe?

Safe if the data is not being modified or if the application tolerates crash-consistent copies. For databases, assume it is not safe unless you quiesce properly or use DB-native backups.

3) What’s the difference between a named volume and a bind mount for backups?

Named volumes are managed by Docker and live under Docker’s data directory. Bind mounts are arbitrary host paths. From a backup perspective, bind mounts are easier to integrate with traditional host backup tools—until someone changes the path and forgets to update the backup scope.

4) How do I back up volumes with Docker Compose?

Compose is just orchestration. The backup mechanics are the same: use docker exec for logical DB backups, and docker run --rm with volume mounts for filesystem backups. The important part is naming consistency so your scripts find the right volumes in each environment.

5) Can I use `docker commit` as a backup?

No for persistent data. It won’t include named volumes, and it can capture secrets into an image layer. It’s occasionally useful for debugging a container filesystem state, not for disaster recovery.

6) How often should I test restores?

As often as your business can afford to be wrong. Weekly for core databases is common; monthly for less critical datasets. Also test after major schema changes, storage migrations, or Docker host rebuilds.

7) Do I need encryption for volume backups?

If backups contain customer data, credentials, or proprietary code, yes. Encrypt at rest and in transit, and manage keys separately from the backup storage. “It’s in a private bucket” is not a control, it’s a hope.

8) How do I handle UID/GID differences across hosts?

Prefer stable numeric IDs for service users across hosts. If that’s not possible, include a post-restore ownership fix step. Verify by performing a write as the container’s runtime user, not as root.

9) What about incremental backups for huge volumes?

Doable, but complexity tax applies. For file volumes, snapshot-based send/receive (ZFS/btrfs) or rsync-based incrementals can work. For databases, use WAL/binlogs or vendor tooling. Whatever you choose, your restore drill must include reconstructing from incrementals—otherwise your “incremental strategy” is theoretical.

10) What’s the single most reliable improvement I can make?

Automate restore verification into an isolated environment. It turns “we think” into “we know,” and it catches the quiet failures like empty dumps, wrong targets, and permission mismatches.

Conclusion: next steps you can actually do today

Inventory your data: volumes, bind mounts, and “stuff outside Docker” (secrets, configs).
Pick correct backup primitives: DB-native for databases; cold/snapshot for file data.
Add integrity: checksums and manifests, stored with the artifacts.
Schedule a restore drill: restore into a disposable container/stack and run real checks.
Measure RTO: time the restore end-to-end, then decide if it matches reality you can live with.

If you do only one thing: restore a backup into a clean environment this week and make it a habit. Your future incident bridge will be quieter, shorter, and far less theatrical.