The storage migration that hurts isn’t the one that fails loudly. It’s the one that “worked” and quietly shaved off a few files, a few permissions, or a few milliseconds of durability until your database learns what entropy tastes like.
Docker volumes are supposed to be boring. Then you move them: new disk, new host, new driver, new Compose stack name, or “just a quick rebuild.” This is where data loss happens—usually because we treated a volume like a folder, or treated a folder like a volume.
The trick: migrate via a controlled container mount (not the host filesystem)
Here’s the move that avoids most volume-migration disasters: don’t copy “whatever is under
/var/lib/docker” and don’t trust that a bind mount path is the same as a named volume. Instead, attach the old volume and the new target (a new volume, a bind mount, or an SSH pipe) to a one-shot utility container and copy inside that controlled environment.
Why it works: the container mount gives you a stable view of the volume contents exactly as your app sees them. You don’t accidentally traverse Docker’s internal metadata. You don’t miss files because you copied the wrong directory. And you can standardize the process with repeatable commands.
My default is a two-phase approach:
- Phase 1 (warm copy): keep the app running, do an initial sync to cut downtime.
- Phase 2 (cutover copy): stop the app (or at least quiesce it), then do a final sync and verify.
If you only take one thing from this piece: treat migration as a deployment. You need a plan, a rollback, verification, and a clean cutover.
Interesting facts and short history that explain today’s foot-guns
- Docker volumes predate Docker Compose. Volumes were the first “official” persistence mechanism; Compose later made them easy to declare, and even easier to misunderstand.
- Named volumes are managed objects, not paths. They map to paths on the host, but Docker reserves the right to choose where and how. Copying “the folder” often copies the wrong folder.
-
Overlay filesystems are not where you store state. Docker’s
overlay2is fantastic for image layers and ephemeral container writes. It is not your database’s happy place. - Volumes outlived containers by design. Container deletion does not remove named volumes unless you explicitly do so. This is a reliability feature—and an operational trap when old volumes accumulate.
- Permissions mismatches got worse with rootless Docker. Rootless engines change UID/GID mappings; migration can succeed and still break at runtime with “permission denied.”
- macOS/Windows Desktop has a virtualization layer. On Desktop, volumes live inside a VM. “Copying from the host” isn’t what you think it is, and performance characteristics differ drastically.
-
Storage drivers have changed default behaviors across versions. What was
aufsbecameoverlay2for most Linux distros; volume semantics stayed stable, but “where data is” keeps confusing people. -
Some database engines treat timestamps as part of correctness. If your migration changes
mtimein weird ways (or you restore with the wrong options), certain workloads get slower or behave oddly. - Network filesystems are still a compatibility minefield. NFS options, locking, and fsync behavior can make a “successful” migration turn into subtle corruption later.
Mental model: what a Docker volume is (and isn’t)
Named volume vs bind mount vs “stuff in the container”
Three storage categories show up in real incidents:
-
Named volume: Docker-managed persistent storage. The Docker engine creates it and mounts it into your container.
It typically lives under/var/lib/docker/volumes/<name>/_dataon Linux, but don’t build automation that assumes this path. - Bind mount: a host path mounted into a container. Great for dev and for some production cases where you own the path layout, backups, and permissions.
- Container writable layer: the “diff” on top of the image. It dies with the container. If you have business data there, you don’t have persistence—you have vibes.
Why copying “/var/lib/docker” is a bad hobby
Docker stores images, layers, build cache, networks, container metadata, and volumes under its data-root. Copying it while Docker is running can produce a consistent-looking directory tree that’s internally inconsistent. You’ll only discover it when containers refuse to start, or worse, start with partial state.
If you must relocate Docker’s entire data-root, you do it like a surgical procedure: stop Docker, copy, verify, update data-root, then start Docker. But that’s not “volume migration.” That’s “moving the whole engine’s guts.”
The reliability principle that matters
When you migrate state, you’re moving a contract: content, permissions, ownership, extended attributes, hardlinks, symlinks, and sometimes sparse file behavior. Your tool needs to preserve the contract.
One quote that operations folks keep relearning:
Paraphrased idea — Richard Cook (systems safety): “Success hides system complexity; failures reveal it.”
Jokes are usually a bad reliability strategy, but here’s one anyway: A Docker volume migration is like moving a fish tank—you can do it fast, or you can do it twice.
Fast diagnosis playbook: find the bottleneck in minutes
When a migration is slow or risky, you don’t need a week of benchmarking. You need a fast, disciplined triage.
Check these in order:
1) Are you I/O bound on the source, the destination, or the network?
- Run
iostat/vmstaton both ends. - Watch for high await, high util, low throughput, and paging.
- If you’re copying over SSH, check CPU too; encryption can be the bottleneck.
2) Are you accidentally copying the wrong thing (or way too much)?
- Confirm the volume(s) and mountpoints via
docker volume inspectanddocker inspect. - List what’s actually inside the volume using a utility container.
- Measure size with
du -xfrom inside the mount.
3) Are you dealing with an app that must be quiesced?
- Databases, queues, and anything with write-ahead logs need a stop-the-world (or snapshot) moment.
- If you can’t stop it, your “migration” is a replication project. Don’t pretend otherwise.
4) Are permissions/ownership going to bite you after cutover?
- Check UID/GID inside the running container.
- Check whether you’re running rootless Docker or user namespaces.
- Expect SELinux/AppArmor differences between hosts.
5) Are you verifying the result with more than hope?
- Count files, compare checksums for a sample, and validate app-level health (not just container “Up”).
- Keep old volume intact until you’ve run through a full business cycle.
Hands-on tasks: commands, outputs, and what you decide from them
These are real production tasks. Each one includes: the command, what “normal” output looks like, and what decision you make.
Use them as building blocks for your own runbooks.
Task 1: Identify whether you’re using a named volume or a bind mount
cr0x@server:~$ docker inspect -f '{{range .Mounts}}{{println .Type .Source "->" .Destination}}{{end}}' app
volume /var/lib/docker/volumes/pgdata/_data -> /var/lib/postgresql/data
bind /srv/app/config -> /etc/app
What it means: You have a named volume (pgdata) and a bind mount (/srv/app/config).
Decision: migrate pgdata via volume-to-volume copy; migrate bind mounts via normal filesystem copy with explicit ownership rules.
Task 2: Inspect volume metadata (driver, mountpoint)
cr0x@server:~$ docker volume inspect pgdata
[
{
"CreatedAt": "2026-01-20T10:22:14Z",
"Driver": "local",
"Labels": {},
"Mountpoint": "/var/lib/docker/volumes/pgdata/_data",
"Name": "pgdata",
"Options": {},
"Scope": "local"
}
]
What it means: local driver, host-local scope. It’s not a cluster volume.
Decision: plan a host-level copy (tar/rsync) or use an intermediate container; don’t expect another node to see it.
Task 3: Confirm which containers are using the volume
cr0x@server:~$ docker ps --format '{{.Names}} {{.Mounts}}' | grep pgdata
db pgdata
What it means: only db uses this volume.
Decision: you can schedule downtime for just that service; no hidden consumers.
Task 4: Measure volume size from a utility container (trust but verify)
cr0x@server:~$ docker run --rm -v pgdata:/v alpine:3.20 sh -lc 'du -sh /v && df -h /v | tail -n +2'
12.4G /v
/dev/sda2 220G 96G 113G 46% /v
What it means: about 12.4G of data, destination filesystem has headroom.
Decision: you can use a single tar stream; no need for chunking, but plan for temporary space if you stage archives.
Task 5: Check filesystem type and mount options (performance and correctness)
cr0x@server:~$ findmnt -no SOURCE,FSTYPE,OPTIONS /var/lib/docker
/dev/sda2 ext4 rw,relatime
What it means: ext4, standard options.
Decision: rsync/tar should behave predictably. If you saw NFS/CIFS here, you’d slow down and validate locking and fsync behavior.
Task 6: Stop the writer cleanly (or you’re copying a moving target)
cr0x@server:~$ docker stop -t 60 db
db
What it means: container stopped within the timeout.
Decision: proceed with final cutover sync. If it doesn’t stop, you need to diagnose shutdown hooks before you migrate anything.
Task 7: Warm copy between two volumes on the same host (rsync)
cr0x@server:~$ docker volume create pgdata_new
pgdata_new
cr0x@server:~$ docker run --rm -i \
-v pgdata:/from:ro \
-v pgdata_new:/to \
alpine:3.20 sh -lc 'apk add --no-cache rsync >/dev/null && rsync -aHAX --numeric-ids --info=stats2 /from/ /to/'
Number of files: 14832 (reg: 12110, dir: 2711, sym: 11)
Number of created files: 14832 (reg: 12110, dir: 2711, sym: 11)
Total file size: 13,274,991,224 bytes
Total transferred file size: 13,274,991,224 bytes
Literal data: 13,274,991,224 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.210 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 13,278,112,901
Total bytes received: 412,220
sent 13,278,112,901 bytes received 412,220 bytes 34,201,351.55 bytes/sec
total size is 13,274,991,224 speedup is 1.00
What it means: rsync preserved attributes (-aHAX), used numeric IDs, created expected file counts.
Decision: safe to proceed to verification. If “created files” is surprisingly low, you likely pointed at the wrong source path or volume.
Task 8: Verify file counts match (cheap sanity check)
cr0x@server:~$ docker run --rm -v pgdata:/a -v pgdata_new:/b alpine:3.20 sh -lc 'cd /a && find . | wc -l; cd /b && find . | wc -l'
14865
14865
What it means: counts match.
Decision: proceed. If they don’t, stop. Find out what was skipped (permissions, special files, mount issues).
Task 9: Spot-check checksums for a representative sample (catch silent corruption)
cr0x@server:~$ docker run --rm -v pgdata:/a -v pgdata_new:/b alpine:3.20 sh -lc '
apk add --no-cache coreutils >/dev/null
cd /a
find . -type f | head -n 200 | while read f; do
sha256sum "$f"
done | sort -k2 > /tmp/a.sha
cd /b
find . -type f | head -n 200 | while read f; do
sha256sum "$f"
done | sort -k2 > /tmp/b.sha
diff -u /tmp/a.sha /tmp/b.sha | head
'
What it means: no output from diff means samples match.
Decision: accept migration with higher confidence. If you see mismatches, suspect underlying disk errors, bad RAM, or a copy tool that didn’t preserve content.
Task 10: Cut over a Compose service to the new volume
cr0x@server:~$ docker compose up -d
[+] Running 1/1
✔ Container db Started
What it means: service is up.
Decision: don’t celebrate yet; validate app-level health and logs. “Started” is not “correct.”
Task 11: Validate logs for permission and corruption hints immediately after cutover
cr0x@server:~$ docker logs --tail=80 db
PostgreSQL Database directory appears to contain a database; Skipping initialization
LOG: database system was shut down at 2026-02-04 09:12:30 UTC
LOG: database system is ready to accept connections
What it means: clean startup, recognizes existing data dir.
Decision: proceed to a smoke test query. If you see “permission denied” or “invalid checkpoint record,” stop and roll back to the old volume.
Task 12: Confirm container sees expected disk usage (avoid “empty volume” surprises)
cr0x@server:~$ docker exec db sh -lc 'du -sh /var/lib/postgresql/data | cat'
12.4G /var/lib/postgresql/data
What it means: the new volume is mounted and populated.
Decision: keep monitoring, then retire the old volume only after a safe window.
Task 13: Migrate a volume between hosts with a tar stream over SSH (no staging file)
cr0x@server:~$ docker run --rm -v pgdata:/from alpine:3.20 sh -lc 'cd /from && tar -cpf - . ' | ssh ops@newhost 'docker volume create pgdata && docker run --rm -v pgdata:/to alpine:3.20 sh -lc "cd /to && tar -xpf -"'
pgdata
What it means: data streamed directly; no intermediate archive.
Decision: use this when you need a straightforward, low-dependency transfer. If you need resume/partial retries, prefer rsync.
Task 14: Check Docker data-root if you’re relocating the entire engine storage
cr0x@server:~$ docker info --format '{{.DockerRootDir}}'
/var/lib/docker
What it means: engine’s data-root location.
Decision: if your real problem is “the disk is full,” you might need to move data-root or prune—don’t confuse that with a single-volume migration.
Task 15: Diagnose storage driver and sanity-check you’re not mixing concerns
cr0x@server:~$ docker info --format 'Driver={{.Driver}}'
Driver=overlay2
What it means: overlay2 is the storage driver.
Decision: keep your state in volumes; don’t attempt to “migrate a database” by copying overlay2 layer directories.
Task 16: Detect SELinux enforcement that may break post-migration access
cr0x@server:~$ getenforce
Enforcing
What it means: SELinux is enforcing policies.
Decision: if you migrate to a bind mount, you’ll likely need proper labels (e.g., :Z/:z) or relabeling. Named volumes usually avoid this issue, but host differences still matter.
The safe migration patterns (same host, new disk, new host)
Pattern A: same host, volume-to-volume migration (the boring winner)
If you can keep the Docker engine on the same host and just need to move data to a different backing store, create a new volume and copy into it. This keeps the engine metadata stable.
The copy method can be rsync (best for incremental syncs) or tar (best for simplicity).
Use rsync if:
- You want a warm copy + final cutover copy.
- You expect retries or partial progress.
- You want stats and easy diff behavior.
Use tar if:
- You want minimal dependencies (tar is everywhere).
- You are streaming over SSH and don’t want temp files.
- You can afford a single all-or-nothing pass.
A detail people skip: run your copy tool inside a container so it sees the mounts exactly as containers see them. That avoids host path confusion and reduces “works on my host” drift.
Pattern B: migrate to a bind mount (when you want explicit control)
This is what you do when you want your data at /srv/data/postgres because your backup system, monitoring, and auditors already understand that path.
It can be correct. It can also be a permission festival.
The safest route is:
- Create the destination directory.
- Set ownership and permissions to match the container’s expected UID/GID.
- Copy data with rsync preserving numeric IDs.
- Mount it with SELinux-aware options if applicable.
This is where rootless Docker complicates your life: the UID inside the container may not map the way you think on the host. Don’t “chmod 777” in production unless your threat model is “none.”
Pattern C: migrate between hosts (tar stream or rsync over SSH)
Between-host migration is where you choose between simplicity and resumability.
- tar over SSH is simple and fast to set up, with low moving parts.
- rsync over SSH is your friend when the transfer is large, the network is flaky, or you want to warm-sync then final-sync.
For rsync between hosts without exposing the host’s volume path, you can still do it container-to-container by mounting the volume into a utility container and running rsync out over SSH. It’s slightly more typing, much less confusion.
Pattern D: moving the entire Docker data-root (last resort, do it properly)
Sometimes the real requirement is “move Docker storage off the root disk.” That’s not a volume migration; that’s relocating the engine’s storage root. The correct sequence is:
stop Docker, copy data-root with preservation, update daemon config, start Docker, then validate images/containers/volumes.
Don’t do this to “fix” a single application volume unless you enjoy creating bigger blast radiuses than necessary.
Verification that actually catches bad migrations
Storage people are paid to distrust success signals. A container being “Up” means the init process didn’t crash yet. It does not mean your data is correct.
Verification needs layers:
Layer 1: filesystem-level sanity
- Size check: expected order of magnitude, not exact bytes.
- File count check: catches “copied empty directory” errors quickly.
- Permissions/ownership check: catches UID/GID mismatches and missing xattrs.
Layer 2: content-level sampling
Full checksums on multi-terabyte volumes can be unreasonable during a maintenance window. Sample smartly:
- Hash the most recently modified files.
- Hash known-critical directories (WAL, indexes, metadata).
- Hash random samples across the tree.
Layer 3: application-level truth
For databases, run a read query and a write query. For object stores, fetch and re-put an object. For queues, enqueue and dequeue. Validate the thing the business actually cares about.
Rollback discipline
Keep the old volume intact and disconnected. Do not “clean up” until you’re beyond the point where silent errors would surface. If the business cycle is weekly reporting, “tomorrow” is not enough.
Second joke (and last one): Backups are like parachutes—if you need one and don’t have it, you will not need it again.
Three corporate mini-stories from the storage trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran a customer-facing API on a single Linux host with Docker Compose. The database was in a named volume.
A platform engineer decided to “simplify backups” by copying /var/lib/docker/volumes nightly to a NAS using a generic file sync tool.
The assumption: a volume is a directory, and copying directories is the same as backing up. It worked for weeks. The backup job reported success. The NAS filled steadily. Everyone moved on.
Then a disk issue forced a restore. They copied the directory back, restarted the database container, and the DB came up with a subset of tables missing recent writes.
The root cause was painfully normal. The backup had been running while the database was writing. The copy tool produced a point-in-time view that never existed: some files from “before” and some from “after,” plus a few partially-copied segments.
There was no database-level snapshot, no quiescing, and no verification restore test. The “backup” was a pile of files with plausible names.
Fixing it required rebuilding from logical exports that were thankfully still available for part of the data. The hard lesson wasn’t “don’t use Docker.”
It was: don’t conflate “file copy succeeded” with “data is consistent.”
Mini-story 2: The optimization that backfired
Another org ran CI workloads that produced lots of artifacts. They migrated build caches into a shared network filesystem and mounted it into containers to reduce local disk churn.
The pitch was great: fewer local SSD upgrades, centralized management, easier cleanup.
Performance looked fine on day one. Then a few subtle bugs appeared: builds that occasionally failed in strange ways, tests that timed out, and a rise in “file not found” errors that vanished on rerun.
Engineers blamed flaky tests. The SREs blamed the network. Everyone had a point.
The backfire came from filesystem semantics. The network share had caching and locking behaviors that didn’t match local disk assumptions.
Some tools expected atomic renames and reliable fsync; the share sometimes delayed or reordered visibility under load. It wasn’t “down.” It was “different.”
The fix was to keep truly shared artifacts in object storage (with explicit publish semantics), and keep per-job scratch on local disk.
The “optimization” didn’t fail because NFS is evil; it failed because the workload was latency-sensitive and assumed local filesystem guarantees.
Mini-story 3: The boring practice that saved the day
A financial services team had a rule: any stateful container must have a written cutover plan, a rollback plan, and a verification step that includes a real transaction.
Nobody loved the paperwork. But it made migrations repeatable.
They needed to move a Postgres volume to a new host due to a maintenance contract change. The migration plan used a warm rsync while the DB was running, then a scheduled stop, then a final rsync, then application-level checks.
The plan also included “keep old volume read-only and disconnected for seven days.”
On cutover night, everything looked correct. Logs were clean. Health checks were green.
The next morning a reporting job found one inconsistency in a derived table. Not catastrophic, but suspicious.
Because they kept the old volume, they could mount it read-only and compare a small set of files and timestamps.
They found the issue: a last-minute schema change had happened during the warm sync window, and the cutover process missed one ancillary file created by a sidecar job outside the DB container.
They re-ran a targeted sync of that directory and revalidated. No drama, no prolonged outage, no “we deleted the old data because we were confident.”
Boring won. Again.
Common mistakes: symptom → root cause → fix
1) Symptom: new container starts “fresh” as if data is missing
Root cause: you mounted an empty volume (new name) or you swapped a named volume for a bind mount path that points to an empty directory.
Fix: inspect mounts with docker inspect. Confirm the volume name in Compose. Populate the correct target, then restart.
2) Symptom: “permission denied” after migration
Root cause: UID/GID mismatch, rootless Docker mapping issues, or SELinux label mismatch on bind mounts.
Fix: use --numeric-ids with rsync; match ownership to the container’s runtime user; apply correct SELinux mount labeling for bind mounts.
3) Symptom: database starts, then crashes with corruption-like errors
Root cause: you copied while the DB was writing, producing an inconsistent on-disk state.
Fix: stop/quiesce the DB for final sync, or use DB-native backup/snapshot tooling. Do not rely on a raw file copy of a live datastore.
4) Symptom: migration is extremely slow, CPU is pegged
Root cause: SSH encryption overhead or compression set incorrectly, especially on small-core instances.
Fix: measure CPU on both ends; consider disabling compression; consider a faster cipher or moving the copy onto a private network. Or stage locally and transfer via a better channel.
5) Symptom: rsync completes, but file count differs
Root cause: excluded patterns, permission errors, special files skipped, or you copied from the wrong mount path.
Fix: rerun rsync with verbose and itemized changes, check stderr, run as root in the utility container if necessary, and validate source/target mounts.
6) Symptom: after cutover, app is “healthy” but performance collapses
Root cause: destination storage has different latency characteristics (HDD vs SSD, network storage vs local, different mount options).
Fix: run a small read/write benchmark before cutover; check filesystem and mount options; for databases, confirm fsync and barrier behavior match requirements.
7) Symptom: data looks present, but timestamps/ownership changed unexpectedly
Root cause: copy tool defaults didn’t preserve metadata (e.g., missing -a), or you tarred without preserving permissions.
Fix: use rsync with -aHAX where appropriate; use tar with -p and verify extraction behavior; validate with stat sampling.
8) Symptom: container cannot mount volume after host migration
Root cause: volume driver mismatch or missing plugin on the destination host (common with third-party drivers).
Fix: confirm volume driver and options; install the same plugin/driver; if migrating from local to plugin-backed volume, treat it as a new architecture decision, not a file copy.
Checklists / step-by-step plan
Checklist A: Pre-flight (before you touch data)
- Inventory mounts for the service: named volumes, bind mounts, tmpfs.
- Classify the workload: database/queue/object store vs “static files.” If it’s transactional, plan quiescence or native backup.
- Measure size of the volume and confirm destination free space.
- Decide copy method: rsync for incremental, tar for simplicity/streaming.
- Decide downtime window and rollback timebox.
- Write verification steps that include an app-level check.
Checklist B: Warm sync (optional but recommended)
- Create destination volume (or destination directory for bind mount).
- Run rsync from old to new while app is running.
- Record file counts and rough sizes.
- Do not delete anything yet.
Checklist C: Cutover sync (the part that prevents data loss)
- Stop the application cleanly (or pause writes via app-level mechanism).
- Run a final rsync (or tar copy) to capture the last changes.
- Verify file counts and sample checksums.
- Update Compose/service definition to point at the new volume/bind mount.
- Start the service and watch logs for the first few minutes.
- Run an app-level smoke test (read and write).
Checklist D: Rollback plan (write it before you start)
- If verification fails, stop the service.
- Repoint to the old volume.
- Start service, confirm health.
- Preserve failed migrated volume for forensics; don’t “fix” by overwriting evidence.
Checklist E: Post-cutover hygiene (don’t do this too early)
- Monitor for a full business cycle (batch jobs, reports, backups).
- Only then archive or delete the old volume.
- Update runbooks so the next migration isn’t a one-off hero story.
FAQ
1) What’s the safest way to migrate a Docker named volume?
Stop the writer for the final sync, then copy from old to new using a utility container that mounts both volumes.
Prefer rsync for two-pass migrations; verify with file counts and an app-level check.
2) Can I just copy /var/lib/docker/volumes?
You can, but you usually shouldn’t. It’s easy to copy the wrong thing, and copying live data is inconsistent for databases.
If you must, stop Docker entirely and treat it as a Docker data-root relocation, not “a volume copy.”
3) Is tar better than rsync?
Tar is simpler and great for streaming. Rsync is better for incremental copies, retries, and a warm+final approach.
For avoiding data loss, the key isn’t tar vs rsync; it’s quiescing writes and verifying results.
4) How do I migrate volumes between hosts without knowing Docker’s internal paths?
Use a utility container with the volume mounted and stream tar over SSH to another utility container that extracts into a destination volume.
That keeps you out of Docker’s internals.
5) What about database containers—should I file-copy the data directory?
Only if you can guarantee a consistent on-disk state (service stopped, or filesystem snapshot with correct guarantees).
DB-native backup/replication is often safer for zero/low downtime, but it’s a bigger project than a “copy.”
6) Why did my container lose permissions after moving to a bind mount?
Because bind mounts expose host filesystem ownership/labels directly. Named volumes tend to be simpler.
Fix by matching UID/GID, using numeric ID preservation during copy, and handling SELinux labels if enforcing.
7) Can I rename a Docker volume?
Not directly. The practical “rename” is: create a new volume with the desired name, copy data into it, update references, and remove the old volume later.
8) How do I avoid downtime entirely?
For truly stateful systems, “no downtime” usually means replication: set up a new instance, replicate/stream changes, then switch traffic.
A raw volume migration is typically a short downtime operation unless the workload is read-only.
9) What’s the biggest red flag during migration?
When the new service comes up “clean” and initializes a fresh data directory. That usually means it didn’t see the migrated data at all.
Stop immediately and confirm mounts.
10) When is it okay to delete the old volume?
After verification and after you’ve passed a full business cycle that would surface subtle issues.
Keep it longer if compliance or forensics matter; storage is cheaper than incident response.
Next steps you can do this week
If you run stateful containers in production, do these practical moves:
- Write one migration runbook using the utility-container approach (mount old + new, rsync/tar, verify, cutover, rollback). Make it the default.
- Add verification to your definition of done: file count + sampled checksums + app-level read/write.
- Label your volumes (in Compose or via naming conventions) so you know which are stateful and which are disposable caches.
- Schedule a restore test from whatever you call “backup.” The fastest way to find out it’s fake is on a Tuesday afternoon, not during an outage.
- Decide your position on bind mounts vs named volumes and document it. Waffling is how you end up with three persistence patterns and no consistent backup story.
The “volume migration trick” isn’t magic. It’s discipline: mount the data the way the app mounts it, copy with metadata preservation, stop the writer for the final pass, and verify like you don’t trust yourself. You shouldn’t.