Docker: Backups You Never Tested — How to Run a Restore Drill Properly

Was this helpful?

You have backups. You even have a green checkmark in some dashboard. Then a node dies, the on-call starts a restore,
and suddenly the only thing you’re restoring is your respect for Murphy’s Law.

Docker makes it easy to ship apps. It also makes it easy to forget where the data actually lives: volumes, bind mounts,
secrets, env files, registries, and a few “temporary” directories someone once hard-coded at 2 a.m.

A restore drill is a product, not a ritual

A “backup” is a promise. A restore drill is where you pay the promise down and prove you can meet it under pressure.
The deliverable isn’t a tarball in object storage. It’s a repeatable recovery process with known time bounds.

Your restore drill has one job: convert assumptions into measurements. What’s your RPO (how much data you can lose)
and RTO (how long you can be down)? Which parts are slow? Which parts are fragile? Which parts require a specific
person’s memory and caffeine?

The most valuable outcome of a drill is often boring: a list of missing files, wrong permissions, undiscoverable secrets,
and “we thought this was in the backup” surprises. Boring is good. Boring is how you survive outages.

One quote to keep on your desk: Hope is not a strategy. (attributed to Gen. Gordon R. Sullivan)

What you’re actually restoring in Docker

Docker doesn’t “contain” state. It just makes state easier to misplace. For restore drills, treat your system as layers:
host state, container state, data state, and deployment state. Then decide what you’re promising to restore.

1) Data state

  • Named volumes (managed by Docker): usually under /var/lib/docker/volumes.
  • Bind mounts: anywhere on the host filesystem; often not in the same backup policy as volumes.
  • External storage: NFS, iSCSI, Ceph, EBS, SAN LUNs, ZFS datasets, LVM, etc.
  • Databases: Postgres/MySQL/Redis/Elastic/etc. The backup method matters more than where it sits.

2) Deployment state

  • Compose files, environment files, and overrides.
  • Secrets and their delivery mechanism (Swarm secrets, files, SOPS, Vault templates, etc.).
  • Image tags: “latest” is not a restore plan.
  • Registry access: if you can’t pull, you can’t start.

3) Host state

  • Docker Engine config, storage driver, daemon flags.
  • Kernel + filesystem details: overlay2 expectations, xfs ftype, SELinux/AppArmor.
  • Networking: firewall rules, DNS, routes, MTU.

4) Runtime state (usually not worth “restoring”)

Container layers and ephemeral runtime files can be recreated. If you are backing up the entire Docker root directory
(/var/lib/docker) hoping to resurrect containers byte-for-byte, you’re signing up for subtle breakage.
The correct target is almost always data volumes plus deployment config, and rebuilding containers cleanly.

Joke #1: If your recovery plan starts with “I think the data is on that one node,” congratulations—you’ve invented a single point of surprise.

Facts & historical context (so you stop repeating it)

  • Fact 1: Docker’s early “AUFS era” normalized the idea that containers are disposable; a lot of teams mistakenly made data disposable too.
  • Fact 2: The shift from AUFS to overlay2 wasn’t just performance—restore semantics and filesystem requirements changed (notably XFS ftype=1 expectations).
  • Fact 3: The industry’s move toward “immutable infrastructure” reduced host restores but increased the need to restore externalized state (volumes, object stores, managed DBs).
  • Fact 4: Compose became the default app description for many orgs, even when the operational rigor (secrets rotation, pinned versions, healthchecks) didn’t keep up.
  • Fact 5: Many outages blamed on “Docker” are really storage coherency problems: crash-consistent filesystem copies taken from under a busy database.
  • Fact 6: Ransomware shifted backup strategy from “can we restore?” to “can we restore without trusting the attacker didn’t encrypt our backup keys?”
  • Fact 7: Container image registries became critical infrastructure; losing a private registry or its credentials can block restores even if data is safe.
  • Fact 8: Filesystem snapshots (LVM/ZFS) made fast backups easier—but they also encouraged overconfidence when apps weren’t snapshot-safe.
  • Fact 9: The rise of rootless containers changed backup paths and permission models; restoring data as root can quietly break rootless runtimes later.

Pick the drill scope: host, app, or data tier

A restore drill can be three different things. If you don’t declare which one you’re doing, you’ll “succeed” at the easy
one and fail the one that matters.

Scope A: Data restore drill (most common, most valuable)

You restore volumes/bind-mount data and re-deploy containers from known images and config. This is the right default
for most Docker Compose production setups.

Scope B: App restore drill (deployment + data)

You restore the exact app stack: Compose files, env/secrets, reverse proxy, certificates, plus data. This validates
the “everything needed to run” assumption. It also exposes the “we kept that config on someone’s laptop” disease.

Scope C: Host rebuild drill (rare, but do it at least annually)

You assume the node is gone. You provision a fresh host and restore onto it. This is where you discover dependency on
old kernels, missing packages, custom iptables rules, weird MTU hacks, and storage driver mismatches.

Fast diagnosis playbook (find the bottleneck fast)

During a restore, you’re typically blocked by one of four things: identity/credentials, data integrity,
data transfer speed, or application correctness. Don’t guess. Triage in this order.

First: Can you even access what you need?

  • Do you have the backup repository credentials and encryption keys?
  • Can the restore host reach object storage / backup server / NAS?
  • Can you pull container images (or do you have an air-gapped cache)?

Second: Is the backup complete and internally consistent?

  • Do you have all expected volumes/bind-mount paths for the app?
  • Do checksums match? Can you list and extract files?
  • For databases: do you have a logical backup or only a crash-consistent filesystem copy?

Third: Where is time going?

  • Network throughput (object storage egress, VPN constraints, throttling)?
  • Decompression and crypto (single-threaded restore tooling)?
  • IOPS and small-file restore storms (millions of tiny files)?

Fourth: Why won’t the app come up?

  • Permissions/ownership/SELinux labels on restored data.
  • Config drift: env vars, secrets, changed image tags.
  • Schema mismatch: restoring old DB data into new app version.

If you only remember one thing: measure transfer speed and verify keys early. Everything else is secondary.

Build a realistic restore environment

A restore drill on the same host that produced the backups is a comforting lie. It shares the same cached images,
the same credentials already logged in, and the same hand-tuned firewall rules. Your goal is to fail honestly.

What “realistic” means

  • Fresh host: new VM or bare metal, same OS family, same major versions.
  • Same network constraints: same route to backup storage, same NAT/VPN, same DNS.
  • No hidden state: don’t reuse old /var/lib/docker; don’t mount production volumes directly.
  • Timeboxed: you’re testing RTO; stop admiring the logs and start a timer.

Define success criteria up front

  • RPO validated: you can point at the newest successful backup and show its timestamp and contents.
  • RTO measured: from “host provisioned” to “service responds correctly”.
  • Correctness verified: not just “containers are running” but “data is right”.

Hands-on tasks: commands, outputs, decisions

These are restore-drill tasks I expect to see in a runbook. Each one includes a command, what the output means, and the
decision you make from it. Run them on the restore target host unless noted.

Task 1: Inventory running containers and their mounts (source environment)

cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}'
NAMES               IMAGE                         STATUS
api                 registry.local/api:1.42.0     Up 3 days
postgres            postgres:15                   Up 3 days
nginx               nginx:1.25                    Up 3 days

Meaning: This is the minimal “what exists” list. It’s not enough, but it’s a start.
Decision: Identify which containers are stateful (here: postgres) and which are stateless.

cr0x@server:~$ docker inspect postgres --format '{{range .Mounts}}{{.Type}} {{.Source}} -> {{.Destination}}{{"\n"}}{{end}}'
volume pgdata -> /var/lib/postgresql/data
bind /srv/postgres/conf -> /etc/postgresql

Meaning: You have both a named volume and a bind mount. Two backup policies, two failure modes.
Decision: Your restore plan must capture both pgdata and /srv/postgres/conf.

Task 2: List Docker volumes and map them to projects

cr0x@server:~$ docker volume ls
DRIVER    VOLUME NAME
local     myapp_pgdata
local     myapp_redisdata
local     shared_uploads

Meaning: Volume names often encode Compose project names. That’s useful during restores.
Decision: Decide which volumes are critical and which can be rebuilt (e.g., caches).

Task 3: Identify where volumes live on disk (restore host)

cr0x@server:~$ docker info --format '{{.DockerRootDir}}'
/var/lib/docker

Meaning: Default Docker root directory. Volumes will be under this path unless configured otherwise.
Decision: Confirm this matches your backup expectations; mismatches cause “restore succeeded, data missing.”

Task 4: Verify filesystem and free space before restoring

cr0x@server:~$ df -hT /var/lib/docker /srv
Filesystem     Type   Size  Used Avail Use% Mounted on
/dev/sda2      ext4   200G   32G  158G  17% /
/dev/sdb1      xfs    800G  120G  680G  15% /srv

Meaning: You have capacity headroom. Also note filesystem types; some behaviors differ for overlay and permissions.
Decision: If available space is tight, you don’t “try anyway.” You resize first or pick a larger restore target.

Task 5: Confirm Docker storage driver and kernel compatibility

cr0x@server:~$ docker info --format 'Driver={{.Driver}}; BackingFS={{.BackingFilesystem}}'
Driver=overlay2; BackingFS=extfs

Meaning: overlay2 on ext4 (Docker reports “extfs”). If your original host used a different driver, don’t assume portability of /var/lib/docker.
Decision: Prefer restoring only volumes and config; rebuild containers from images.

Task 6: Verify the backup artifact exists and is recent

cr0x@server:~$ ls -lh /backups/myapp/
total 4.1G
-rw------- 1 root root 1.9G Jan  2 01:05 myapp-volumes-2026-01-02.tar.zst
-rw------- 1 root root 2.2G Jan  2 01:06 myapp-bindmounts-2026-01-02.tar.zst
-rw------- 1 root root  12K Jan  2 01:06 myapp-compose-2026-01-02.tgz

Meaning: Separate artifacts for volumes, bind mounts, and deployment config is healthy. It makes partial restores possible.
Decision: If the newest file is older than your RPO, you stop and escalate. Restoring stale data without telling anyone is how incidents become careers.

Task 7: Validate archive integrity before extraction

cr0x@server:~$ zstd -t /backups/myapp/myapp-volumes-2026-01-02.tar.zst
/backups/myapp/myapp-volumes-2026-01-02.tar.zst: OK

Meaning: The compressed stream is not corrupted.
Decision: If this fails, do not extract “some of it.” Locate another backup set or redo the backup pipeline.

Task 8: Dry-run list of files inside the backup (spot missing paths)

cr0x@server:~$ tar -I zstd -tf /backups/myapp/myapp-bindmounts-2026-01-02.tar.zst | head
srv/postgres/conf/postgresql.conf
srv/postgres/conf/pg_hba.conf
srv/myapp/env/.env.production
srv/nginx/conf.d/app.conf

Meaning: You can see expected bind-mounted config files.
Decision: If key directories are missing, stop and fix the backup definition. Restore drills are not magic tricks.

Task 9: Restore bind mounts to a staging prefix first (avoid clobbering)

cr0x@server:~$ mkdir -p /restore-staging
cr0x@server:~$ tar -I zstd -xpf /backups/myapp/myapp-bindmounts-2026-01-02.tar.zst -C /restore-staging
cr0x@server:~$ ls -la /restore-staging/srv/postgres/conf
total 24
drwxr-xr-x 2 root root 4096 Jan  2 01:06 .
drwxr-xr-x 3 root root 4096 Jan  2 01:06 ..
-rw-r--r-- 1 root root  980 Jan  2 01:06 pg_hba.conf
-rw-r--r-- 1 root root 3150 Jan  2 01:06 postgresql.conf

Meaning: Files restored with permissions preserved (-p). Ownership matters later.
Decision: Compare staging to target layout. Only then move into place.

Task 10: Restore named volume data using a helper container

For named volumes, don’t hand-copy into Docker’s internals. Use a temporary container that mounts the volume.

cr0x@server:~$ docker volume create myapp_pgdata
myapp_pgdata
cr0x@server:~$ docker run --rm -v myapp_pgdata:/data -v /backups/myapp:/backup alpine:3.20 sh -c "cd /data && tar -I zstd -xpf /backup/myapp-volumes-2026-01-02.tar.zst --strip-components=2 ./volumes/myapp_pgdata"
tar: removing leading './' from member names

Meaning: You’re extracting only the sub-tree for that volume into the mounted volume path.
Decision: If the archive layout doesn’t match what you expect, stop and re-check the backup script; don’t improvise your way into partial restores.

Task 11: Sanity-check restored volume contents and ownership

cr0x@server:~$ docker run --rm -v myapp_pgdata:/data alpine:3.20 sh -c "ls -la /data | head"
total 128
drwx------    19 999      999           4096 Jan  2 01:04 .
drwxr-xr-x     1 root     root          4096 Jan  2 02:10 ..
-rw-------     1 999      999              3 Jan  2 01:04 PG_VERSION
drwx------     5 999      999           4096 Jan  2 01:04 base

Meaning: Ownership is 999:999, typical for the official Postgres image. Good.
Decision: If ownership is wrong (e.g., root), fix it now (chown) or Postgres may refuse to start.

Task 12: Restore deployment config and pin image versions

cr0x@server:~$ mkdir -p /opt/myapp
cr0x@server:~$ tar -xpf /backups/myapp/myapp-compose-2026-01-02.tgz -C /opt/myapp
cr0x@server:~$ ls -la /opt/myapp
total 40
drwxr-xr-x 3 root root 4096 Jan  2 02:13 .
drwxr-xr-x 3 root root 4096 Jan  2 02:13 ..
-rw-r--r-- 1 root root 2241 Jan  2 01:06 docker-compose.yml
-rw------- 1 root root  412 Jan  2 01:06 .env.production

Meaning: Config is present, including env file. Treat it as sensitive.
Decision: Ensure images are pinned to tags or digests you trust. If the Compose file uses latest, fix it as part of the drill.

Task 13: Validate images can be pulled (or are already available)

cr0x@server:~$ docker compose -f /opt/myapp/docker-compose.yml pull
[+] Pulling 3/3
 ✔ postgres Pulled
 ✔ api      Pulled
 ✔ nginx    Pulled

Meaning: Your registry path, credentials, and network are functional.
Decision: If pulls fail, your restore plan must include a registry mirror, offline image tarballs, or credential recovery steps.

Task 14: Bring the stack up and watch for fast failures

cr0x@server:~$ docker compose -f /opt/myapp/docker-compose.yml up -d
[+] Running 3/3
 ✔ Container myapp-postgres-1  Started
 ✔ Container myapp-api-1       Started
 ✔ Container myapp-nginx-1     Started

Meaning: Containers started, but this is not proof of correctness.
Decision: Immediately check logs and health endpoints. “Started” can still mean “broken quietly.”

Task 15: Verify health and read the first errors, not the last ones

cr0x@server:~$ docker compose -f /opt/myapp/docker-compose.yml ps
NAME                IMAGE                         COMMAND                  SERVICE    STATUS          PORTS
myapp-api-1          registry.local/api:1.42.0     "gunicorn -c ..."        api        Up 20s (healthy) 0.0.0.0:8080->8080/tcp
myapp-nginx-1        nginx:1.25                    "/docker-entrypoint…"    nginx      Up 20s          0.0.0.0:80->80/tcp
myapp-postgres-1     postgres:15                   "docker-entrypoint…"     postgres   Up 20s (healthy) 5432/tcp

Meaning: Health checks are passing. That’s a strong signal, not a guarantee.
Decision: If health is failing, check the earliest relevant logs (DB first, then app, then proxy).

cr0x@server:~$ docker logs --tail=50 myapp-postgres-1
2026-01-02 02:14:12.101 UTC [1] LOG:  database system is ready to accept connections

Meaning: Postgres came up cleanly. If you see “invalid checkpoint record” or “permission denied,” your restore is not correct.
Decision: For DB errors, decide whether you need a logical restore instead of filesystem copy.

Task 16: Prove correctness with an application-level query

cr0x@server:~$ curl -fsS http://127.0.0.1:8080/health
{"status":"ok","db":"ok","version":"1.42.0"}

Meaning: Your app says it’s healthy. Now validate data, not just liveness.
Decision: Run a known query or business check (e.g., “a specific customer record exists”).

cr0x@server:~$ docker exec -i myapp-postgres-1 psql -U postgres -tAc "select now(), count(*) from users;"
2026-01-02 02:14:35.812396+00|1842

Meaning: You have data, and it looks plausible.
Decision: Compare against an expected range or a checksum report. If count is zero, you restored the wrong thing or pointed to an empty volume.

Task 17: Measure restore throughput to predict RTO

cr0x@server:~$ /usr/bin/time -f 'elapsed=%E cpu=%P' tar -I zstd -xpf /backups/myapp/myapp-volumes-2026-01-02.tar.zst -C /restore-staging-voltest
elapsed=0:02:41 cpu=380%

Meaning: CPU is a bottleneck (380% indicates ~4 cores busy). Good to know.
Decision: If restore is CPU-bound, you can reduce RTO by adding cores, tuning compression level, or switching tools—before the real outage.

Task 18: Verify file permissions and SELinux/AppArmor context (where applicable)

cr0x@server:~$ getenforce
Enforcing

Meaning: SELinux is on. Bind mounts may need labels, or containers will get “permission denied” despite correct UNIX perms.
Decision: If you use SELinux, ensure your Compose uses :z or :Z where appropriate, and relabel restored paths.

cr0x@server:~$ ls -lZ /srv/postgres/conf | head -n 3
-rw-r--r--. root root unconfined_u:object_r:default_t:s0 pg_hba.conf
-rw-r--r--. root root unconfined_u:object_r:default_t:s0 postgresql.conf

Meaning: Default SELinux type on config files may not match what your container expects.
Decision: If containers fail to read bind mounts under SELinux, you must relabel or adjust mount options; do not disable SELinux as a “fix.”

Task 19: Confirm the restore didn’t silently swap volumes

cr0x@server:~$ docker inspect myapp-postgres-1 --format '{{range .Mounts}}{{.Name}} {{.Source}} -> {{.Destination}}{{"\n"}}{{end}}'
myapp_pgdata /var/lib/docker/volumes/myapp_pgdata/_data -> /var/lib/postgresql/data

Meaning: Container is using the expected volume.
Decision: If you see an unexpected volume name, you’re restoring into one volume but running against another. That’s a classic drill failure.

Task 20: Capture evidence and timings (so the drill improves)

cr0x@server:~$ journalctl -u docker --since "today" | tail -n 5
Jan 02 02:13:55 server dockerd[1023]: API listen on /run/docker.sock
Jan 02 02:14:03 server dockerd[1023]: Loading containers: done.

Meaning: You have timestamps for Docker daemon start and container loading.
Decision: Record these in the drill report along with restore start/end. If you don’t measure, you’ll argue during the incident instead.

Joke #2: A restore drill is like flossing—everyone claims they do it, and the evidence is usually bleeding.

Three corporate mini-stories (how this fails in real life)

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company ran Docker Compose on a couple of beefy VMs. Their backups were “simple”: nightly tar of
/srv plus a weekly snapshot of the VM disk. The assumption was that everything important lived in /srv.

The outage started with a mundane storage failure. The VM wouldn’t boot cleanly after a host incident. The team spun up
a new VM and restored /srv from the nightly backup. Compose came up. Nginx served pages. The API returned 500s.

Postgres logs showed a fresh empty database cluster had been initialized. Nobody had restored it—because nobody had backed
it up. The DB used a named Docker volume, sitting in Docker’s root under /var/lib/docker/volumes, outside the
backup scope. The weekly VM snapshot contained it, but it was too old for the company’s implicit RPO, and it lived in
a different system managed by a different team.

The postmortem wasn’t dramatic. It was worse: it was obvious. They had conflated “our app data directory” with “where Docker
stores state.” The fix wasn’t fancy either: inventory mounts, explicitly back up named volumes, and run a quarterly restore drill
on a fresh host. Also: stop calling it “simple backups” if it doesn’t include your database.

Mini-story 2: The optimization that backfired

Another org got serious about speed. Their restore time was too slow for leadership’s patience, so they optimized.
They moved from logical DB dumps to crash-consistent filesystem snapshots of the database volume. It was faster and produced
smaller incremental transfers. Everyone celebrated.

Six months later, they needed the restore. A bad deploy corrupted application state and they rolled back. The restore “worked”
mechanically: the snapshot extracted, containers started, healthchecks went green. Then traffic ramped, and the DB began throwing
errors: subtle index corruption, followed by query planner weirdness, followed by a crash loop.

The root cause was dull but deadly: the snapshot was taken while the database was under write load, without coordinating a checkpoint
or using a DB-native backup mechanism. The volume backup was consistent at the filesystem level, not necessarily at the database level.
It restored fast and failed late—exactly the kind of failure that wastes the most time.

The fix was a compromise: keep fast snapshots for short-term “oops” recovery, but also take periodic DB-native backups (or run the DB’s
supported base backup procedure) that can be validated. They also added a verification job that starts a restored DB in a sandbox and runs
integrity checks. Optimizations are allowed. Unverified optimizations are just performance-themed risk.

Mini-story 3: The boring but correct practice that saved the day

A finance-adjacent company ran several customer-facing services in Docker. Their SRE lead was not a romantic. Every quarter, they ran a
restore drill using an isolated VPC, a clean VM image, and a copy of the backup repository. The drill had a checklist and a stopwatch.

The drill always included the same tedious steps: verify encryption keys are accessible to on-call, validate backup manifests, restore volumes
into staging first, then swap into place, then run a handful of application-level sanity checks. Finally, document timing and update the runbook.
Nobody loved it. Nobody put it on a slide deck.

Then a real incident arrived: an operator error wiped a production volume and replicated quickly. The on-call followed the runbook without improvising.
They already knew the slowest step was decompression and they had already tuned the restore host size for it. They already knew exactly which secrets
had to be present, and where they lived. They had already fought the SELinux labeling fight—in the drill, not during the outage.

The restore finished within the expected window. Not because the team was heroic, but because they were boring on purpose. In ops, boring is a feature.

Checklists / step-by-step plan

Restore drill plan (repeatable, not “let’s see what happens”)

  1. Declare scope and success criteria.

    • Which services? Which data sets? What RPO/RTO are you validating?
    • What does “correct” mean (queries, checksums, UI actions, message counts)?
  2. Freeze the inventory.

    • Export Compose files and env/secrets references.
    • List volumes and bind mounts per container.
    • Record image references (tags or digests).
  3. Provision a fresh restore target.

    • Same OS family, similar CPU/memory, same filesystem choices.
    • Same network path to backups and registries (or explicitly different, if testing DR region).
  4. Fetch backup artifacts and validate integrity.

    • Checksum, decrypt, list contents, verify timestamps.
    • Confirm you have keys and passwords in the access model you expect during an incident.
  5. Restore to staging first.

    • Bind mounts into /restore-staging.
    • Volumes via helper containers into freshly created volumes.
  6. Apply permissions, labels, and ownership.

    • DB volumes must match the container’s UID/GID expectations.
    • SELinux/AppArmor: ensure correct labels and mount options.
  7. Bring up the stack pinned to known-good images.

    • Pull images; if pull fails, use cached/offline images.
    • Start DB first, then app, then edge proxies.
  8. Verify correctness.

    • Health endpoint + at least one data query per critical service.
    • For queues/caches: verify you don’t need to restore (often you don’t).
  9. Measure timings and write the drill report.

    • Restore start/end, transfer throughput, bottlenecks, failures, fixes.
    • Update runbook and automate the fragile steps.

What to automate after your first honest drill

  • Inventory export: mounts, volumes, images, Compose configs.
  • Backup manifest generation: expected paths and volumes, sizes, timestamps.
  • Integrity checks: checksums, archive tests, periodic restore to sandbox.
  • Permissions normalization: known UID/GID mapping per service.
  • Image retention: keep required images for your RPO window (or export tars).

Common mistakes: symptom → root cause → fix

1) “Containers are up, but the app is empty”

Symptom: Healthchecks pass, but user data is missing or reset to defaults.
Root cause: Restored into the wrong volume name, or Compose created a new empty volume due to project-name mismatch.
Fix: Inspect mounts (docker inspect), ensure volume names match, and explicitly name volumes in Compose rather than relying on implicit project scoping.

2) “Permission denied” on restored bind mounts

Symptom: Containers crash with file access errors; files look fine on the host.
Root cause: SELinux labels wrong, or rootless container expects different ownership than the restore produced.
Fix: Use :z/:Z mount options where appropriate, relabel restored paths, and restore ownership matching container UID/GID.

3) Postgres/MySQL starts, then behaves strangely under load

Symptom: DB comes up, then you see corruption-like errors or crashes later.
Root cause: Crash-consistent filesystem backup taken without DB coordination; inconsistent WAL/checkpoint state.
Fix: Prefer DB-native backup methods for durable restores; if using snapshots, coordinate with the DB’s supported backup mode and validate in a sandbox.

4) Restore is “slow for no reason”

Symptom: Hours of restore time, CPU pegged, disks underutilized.
Root cause: Single-threaded decompression/encryption or too-high compression level; millions of tiny files amplifying metadata operations.
Fix: Benchmark decompression, consider lower compression or parallel tools, and restructure backups (e.g., per-volume archives) to reduce metadata thrash.

5) You can’t pull images during restore

Symptom: Registry auth fails, DNS fails, or images are gone.
Root cause: Credentials stored only on old host; registry retention garbage-collected tags you relied on; dependency on public registry rate limits.
Fix: Store registry creds in a recoverable secret manager, pin by digest or immutable tags, and keep an offline cache/export for critical images.

6) Compose “works on prod” but fails on restore host

Symptom: Same Compose file, different behavior: ports, DNS, networks, MTU issues.
Root cause: Hidden host configuration drift: sysctls, iptables, kernel modules, custom daemon.json, or cloud-specific networking.
Fix: Codify host provisioning (IaC), export and version daemon settings, and include a “fresh host” restore drill annually.

7) Backup is present, but keys are not

Symptom: You can see the backup file but cannot decrypt or access it during incident response.
Root cause: Encryption keys/passwords gated behind a person, a dead laptop, or a broken SSO path.
Fix: Practice key recovery during drills, store break-glass access properly, and verify the procedure with a least-privilege on-call role.

8) You restored config but not the boring dependencies

Symptom: App starts but can’t send email, can’t reach payment provider, or callbacks fail.
Root cause: Missing TLS certs, firewall rules, DNS records, webhook secrets, or outbound allowlists.
Fix: Treat external dependencies as part of “deployment state” and test them in the drill (or stub explicitly and document it).

FAQ

1) Should I back up /var/lib/docker?

Usually no. Back up volumes and any bind-mounted application directories, plus Compose config and secrets references.
Backing up the whole Docker root directory is fragile across versions, storage drivers, and host differences.

2) What’s the safest way to back up a database in Docker?

Use the database’s supported backup mechanism (logical dumps, base backups, WAL archiving, etc.), and validate by restoring into a sandbox.
Filesystem-level backups can work if coordinated correctly, but “it seemed fine once” is not a method.

3) How often should I run restore drills?

Quarterly for critical systems is a sane baseline. Monthly if the system changes constantly or if RTO/RPO are tight.
Also run a drill after major changes: storage migration, Docker upgrade, database upgrade, or backup tooling change.

4) Can I run a restore drill without duplicating production data (privacy concerns)?

Yes: use masked datasets, synthetic fixtures, or restore to an encrypted isolated environment with strict access controls.
But you still need to restore realistic structure: permissions, sizes, file counts, schema, and runtime behavior.

5) What’s the #1 thing that makes restore time explode?

Small files and metadata-heavy trees, especially when combined with encryption and compression. You can have plenty of bandwidth
and still be blocked by CPU or IOPS.

6) Should I compress backups?

Usually yes, but pick compression that matches your restore constraints. If you’re CPU-bound during restore, heavy compression
hurts RTO. Measure it with a timed extraction during drills and adjust.

7) How do I know if I restored the right thing?

Don’t trust container status. Use application-level checks: run DB queries, verify record counts, validate a known customer/account,
or run a read-only business transaction. Automate these checks in the drill.

8) Do I need to restore Redis or other caches?

Typically no—caches are rebuildable and restoring them can reintroduce bad state. But you must confirm the app can tolerate empty cache
and that cache configuration (passwords, TLS, maxmemory policies) is backed up.

9) What about secrets in environment variables?

If your production depends on an env file, that file is part of deployment state and must be recoverable. Better: migrate secrets to a
secret manager or Docker secrets-equivalent, and include break-glass retrieval in the drill.

10) Can I do this with Docker Compose and still be “enterprise-grade”?

Yes, if you treat Compose as an artifact with versioning, pinned images, tested restores, and disciplined state management.
“Enterprise-grade” is a behavior, not a tool choice.

Conclusion: next steps you can do this week

If you only do one thing, schedule a restore drill on a fresh host and time it. Not in production, not on your laptop, not “sometime.”
Put it on the calendar and invite whoever owns backups, storage, and the app. You want all the failure modes in the room.

Then do these next steps, in order:

  1. Inventory mounts for every stateful container and write down the authoritative paths and volume names.
  2. Split artifacts into data (volumes), bind mounts, and deployment config so you can restore surgically.
  3. Validate integrity of the newest backup set and prove you have the keys to decrypt it under on-call permissions.
  4. Restore into a sandbox and run app-level correctness checks, not just “container is running.”
  5. Measure RTO, identify the slowest step, and fix that one thing before you optimize anything else.

Backups you never restored are not backups. They’re compressed optimism. Run the drill, write down what broke, and make it boring.

← Previous
ZFS ECC vs non-ECC: Risk Math for Real Deployments
Next →
ZFS SMB: Fixing “Windows Copy Is Slow” for Real

Leave a comment