Docker Compose Stack Migration: Move to a New Host Without Downtime Myths

Was this helpful?

You’ve got a Docker Compose stack that has been “fine for years,” and now you need to move it to a new host because the old one is dying, out of warranty, out of space, or out of everyone’s patience.
The business asks for “no downtime,” and someone says, “Just rsync the volumes and start it over there.” That sentence is how incidents are born.

Compose can absolutely be migrated cleanly. But if you want users to experience “no downtime,” you must define what that means (minutes? seconds? a few errors?) and pick a migration technique that matches your data and traffic patterns.
Your web containers are easy. Your databases are the part that sends you to therapy.

The “no downtime” myth: what you can and can’t promise

“No downtime” is not a binary. It’s a contract. You need to decide what the contract is, then build a migration that meets it.
For Docker Compose migrations, the limiting factor is almost always stateful data: databases, queues, and any writable volume that matters.
If your stack is purely stateless, you can do a clean blue/green swap with near-zero user impact.
If your stack writes to disk, you have to either:

  • Replicate state (database replication, object storage replication, dual-writing), then cut over, or
  • Freeze writes (maintenance window, read-only mode) and copy consistent data, or
  • Accept downtime and make it short and predictable.

The honest SRE answer is: if you’re copying volumes that are being written to, you do not have “no downtime.” You have “eventual corruption with a side of denial.”
The acceptable version of “no downtime” is usually “no planned maintenance window,” which can still allow a few seconds of blips during DNS or load balancer changes.

Here’s the practical definition I like: no downtime for users, but allowed brief error budget burn.
For example: a 30-second spike of 502s at cutover is tolerable if you tell people and your clients retry.
A two-hour silent data inconsistency because you rsynced a live database volume is not.

Short joke #1: “We’ll do it with zero downtime” is often shorthand for “we haven’t looked at the database yet.”

A few facts and bits of history (so you stop believing in magic)

You don’t need to become a container archaeologist to migrate a Compose stack. But a little context helps you predict failure modes.
Here are some concrete facts that matter in real migrations:

  1. Docker volumes were designed for local persistence, not portability. Named volumes live under Docker’s data root and aren’t inherently “moveable” without copying the underlying filesystem.
  2. Docker Compose started as an external Python tool. It later became a Docker CLI plugin (Compose V2), which changed behaviors and output formats enough to confuse runbooks.
  3. Overlay networking is a Swarm/Kubernetes thing, not a Compose thing. Compose networking is single-host bridge networking unless you bring your own network fabric.
  4. IP addresses inside Docker networks are not stable contracts. If an application depends on container IPs, it’s already a minor incident waiting for a calendar invite.
  5. Rsync of a live database directory is not a backup. Most databases need either snapshots, filesystem freeze, or native backup tooling to get a consistent copy.
  6. Healthchecks exist because “container started” is not the same as “service is ready.” Use them in cutovers or enjoy mystery 502s.
  7. DNS TTLs are advisory, not absolute. Some resolvers cache longer than you asked. Design cutovers assuming stragglers.
  8. Compose isn’t a scheduler. It won’t reschedule your failed container on another host. If you need that, you’re in Swarm/Kubernetes territory (or you’re building your own automation, which is… a hobby).
  9. File ownership and UID/GID mapping are the silent killers. Image changes and host differences break permissions on bind mounts and volumes in ways that look like “app bugs.”

One quote, because it’s still the job: Hope is not a strategy. — James Cameron

Migration models that actually work

Model A: Stateless blue/green (the dream case)

If your stack has no persistent writes on the host (or all writes go to managed services), do this:
bring up the same Compose project on the new host, verify healthchecks, then switch traffic via DNS, reverse proxy, or load balancer.
Keep the old host running as a fallback until you trust the new one.

The user-visible impact can be close to zero if:
sessions are not pinned to a host, or you use shared session storage (Redis, database sessions, JWTs),
and your cutover method doesn’t strand half your traffic on stale endpoints.

Model B: Stateful with replication (the grown-up approach)

For PostgreSQL/MySQL/Redis, replication is the cleanest way to approach “no downtime.”
You run the new database as a replica, let it catch up, then promote it.
This works well if you can tolerate a brief write freeze at promotion time or your application supports fast failover.

Replication-based migrations shift risk from “data copy correctness” to “replication correctness and cutover choreography.”
That’s a good trade: replication tools were built for this; your rsync script was built for confidence.

Model C: Stateful with snapshot + short freeze (the realistic middle)

If replication is too heavy (legacy apps, no time, no expertise), aim for a consistent snapshot:
stop writes, take a snapshot (LVM/ZFS/btrfs or database-native backup), transfer it, start on new host, cut over.
Downtime is the “stop writes” interval plus cutover time.

Model D: “Just copy /var/lib/docker” (the trap)

This can work in narrow conditions: same Docker version, same storage driver, same filesystem semantics, no live writes, and you’re prepared to debug Docker internals at 3 a.m.
If you’re migrating because the old host is fragile, doubling down on fragile techniques is an aesthetic choice.

Pre-flight: what to inventory before you touch anything

Migration failures rarely come from the obvious. They come from the invisible coupling between your Compose file and your host:
filesystem paths, kernel settings, firewall rules, and “temporary” cron jobs that became production.

Before you build the new host, inventory these:

  • Compose version and Docker Engine version on the old host (match or intentionally upgrade).
  • Volumes: named volumes vs bind mounts; which services write; how much data.
  • Secrets/config: .env files, mounted config directories, TLS certs, API keys.
  • Ingress: reverse proxy, published ports, firewall, any NAT rules.
  • External dependencies: DNS records, allowlists, upstream webhooks, SMTP relays.
  • Observability: logs location, metrics endpoints, alerting integrations.
  • Backup/restore strategy that you can test without gambling the company’s patience.

Practical tasks with commands (and how to interpret them)

This section is deliberately hands-on: commands you can run today, outputs you should expect, and the decision you make from each.
Use them on the old host first, then repeat on the new host as validation.

Task 1: Identify Docker and Compose versions (compat risk)

cr0x@server:~$ docker version --format 'Engine={{.Server.Version}} StorageDriver={{.Server.Driver}}'
Engine=26.1.4 StorageDriver=overlay2

Meaning: You’ve got Docker Engine 26.x using overlay2. That’s good and common.
Decision: On the new host, match the major version (or test the upgrade). Also ensure overlay2 is supported (kernel + filesystem).

cr0x@server:~$ docker compose version
Docker Compose version v2.27.1

Meaning: Compose V2 plugin, not legacy docker-compose V1.
Decision: Use the same major/minor if you want less surprise in network names, project naming, and CLI output parsing.

Task 2: See what’s actually running (and under what project name)

cr0x@server:~$ docker compose ls
NAME            STATUS              CONFIG FILES
payments        running(7)          /srv/payments/compose.yaml

Meaning: Your Compose project is named payments, not whatever folder name you guessed.
Decision: Keep the project name stable across hosts (use --project-name or name: in Compose) to avoid surprises in network/volume names.

Task 3: Enumerate services, images, and ports (ingress impact)

cr0x@server:~$ docker compose -p payments ps
NAME                 IMAGE                        COMMAND                  SERVICE   STATUS    PORTS
payments-web-1       nginx:1.27-alpine            "/docker-entrypoint.…"   web       running   0.0.0.0:443->443/tcp
payments-api-1       ghcr.io/acme/api:3.14.2      "/app/start"             api       running   0.0.0.0:8080->8080/tcp
payments-db-1        postgres:15                  "docker-entrypoint.s…"   db        running   5432/tcp

Meaning: Only web and api publish ports to the host. The database is internal-only (good).
Decision: Your cutover surface is 443 and 8080, not 5432. That changes firewall and load balancer planning.

Task 4: List volumes and identify what’s stateful

cr0x@server:~$ docker volume ls
DRIVER    VOLUME NAME
local     payments_dbdata
local     payments_redisdata
local     payments_prometheus

Meaning: Named volumes exist. Expect them under Docker’s data root unless you used driver_opts.
Decision: Treat dbdata and redisdata as stateful and require consistency guarantees.

Task 5: Determine where Docker stores data (so you don’t copy the wrong disk)

cr0x@server:~$ docker info --format 'DockerRootDir={{.DockerRootDir}}'
DockerRootDir=/var/lib/docker

Meaning: Default Docker root directory.
Decision: If you’re planning filesystem snapshots, this path must be on a snapshot-capable filesystem, and you must know its backing device/pool.

Task 6: Inspect mounts used by a container (named volume vs bind mount truth)

cr0x@server:~$ docker inspect payments-db-1 --format '{{json .Mounts}}'
[{"Type":"volume","Name":"payments_dbdata","Source":"/var/lib/docker/volumes/payments_dbdata/_data","Destination":"/var/lib/postgresql/data","Driver":"local","Mode":"z","RW":true,"Propagation":""}]

Meaning: The database is writing to a named volume under /var/lib/docker/volumes.
Decision: If you copy it, you need a consistent method (snapshot or database-native backup), not a casual file copy while it’s running.

Task 7: Check volume size and growth (transfer time and risk)

cr0x@server:~$ sudo du -sh /var/lib/docker/volumes/payments_dbdata/_data
58G	/var/lib/docker/volumes/payments_dbdata/_data

Meaning: You’re moving ~58 GB of database data, plus WAL, plus overhead.
Decision: Plan transfer time based on real throughput. If you can’t replicate and you can’t afford a long copy window, you need a different model.

Task 8: Find hidden bind mounts referenced in Compose

cr0x@server:~$ grep -nE '^\s*-\s*/' /srv/payments/compose.yaml
42:      - /srv/payments/nginx/conf.d:/etc/nginx/conf.d:ro
43:      - /srv/payments/tls:/etc/nginx/tls:ro
77:      - /srv/payments/uploads:/var/lib/app/uploads

Meaning: You’re using bind mounts for config, TLS, and uploads. Uploads are stateful and not inside Docker volumes.
Decision: Add those directories to your migration plan, including ownership and permissions.

Task 9: Validate image availability (prevent cutover-day “pull denied”)

cr0x@server:~$ docker compose -p payments config --images
ghcr.io/acme/api:3.14.2
nginx:1.27-alpine
postgres:15
redis:7-alpine

Meaning: These exact image tags are what you’ll run.
Decision: On the new host, pre-pull images. If you’re using a private registry, confirm auth works before cutover.

Task 10: Check container health and readiness gates (cutover correctness)

cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Status}}' | head
NAMES             STATUS
payments-web-1    Up 5 days
payments-api-1    Up 5 days (healthy)
payments-db-1     Up 5 days

Meaning: API reports healthy. Web and DB might not have healthchecks configured (common).
Decision: Add healthchecks for any service you depend on for cutover (especially reverse proxy and API). Healthchecks are not decorations.

Task 11: Confirm open ports and what’s bound (avoid stealth conflicts)

cr0x@server:~$ sudo ss -lntp | grep -E ':(443|8080)\s'
LISTEN 0      4096        0.0.0.0:443       0.0.0.0:*    users:(("docker-proxy",pid=15422,fd=4))
LISTEN 0      4096        0.0.0.0:8080      0.0.0.0:*    users:(("docker-proxy",pid=15457,fd=4))

Meaning: Ports are published via docker-proxy. On some systems you’ll see iptables DNAT instead, but the end result is the same: those ports must be free on the new host.
Decision: Reserve those ports. If something else (like an old Nginx on the host) is binding them, you’ll fail at deployment time.

Task 12: Measure disk and inode headroom (migrations fail on boring math)

cr0x@server:~$ df -h /var/lib/docker
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p3  500G  410G   65G  87% /

Meaning: You’re tight on space already.
Decision: The new host must have comfortable headroom. If you’re migrating to “same size but newer,” you’re just buying time, not fixing the problem.

Task 13: Check filesystem type and mount options (snapshot and performance implications)

cr0x@server:~$ findmnt -no FSTYPE,OPTIONS /var/lib/docker
ext4	rw,relatime,errors=remount-ro

Meaning: ext4, no native snapshots.
Decision: If you want consistent filesystem snapshots without stopping services, you’ll need LVM underneath, or move Docker root onto ZFS/btrfs on the new host, or use database-native backups.

Task 14: Validate DNS and TTL behavior you actually get

cr0x@server:~$ dig +noall +answer app.example.internal
app.example.internal.  300  IN  A  10.20.30.40

Meaning: TTL is 300 seconds (5 minutes).
Decision: Plan for up to several minutes of mixed traffic unless you’re using a load balancer/VIP cutover. Lower TTL in advance if DNS is your switch.

Task 15: Test application-level write freeze readiness (if you need it)

cr0x@server:~$ curl -sS -o /dev/null -w '%{http_code}\n' https://app.example.internal/health
200

Meaning: Health endpoint is reachable.
Decision: If you need a maintenance mode, implement and test it now. A migration is a terrible time to discover your app can’t go read-only cleanly.

Storage and state: the part you can’t hand-wave

Compose migrations are usually framed as “moving containers.” That’s cute. You’re moving data.
The containers are cattle; your volumes are pets with legal standing.

Named volumes vs bind mounts: migration implications

Named volumes are inside Docker’s control plane. That’s convenient but makes migrations opaque:
you must locate the volume data path, copy it safely, and preserve ownership.
Bind mounts are explicit: you can see the path in Compose, you can back it up with standard tooling, and you can apply filesystem-level practices.
The drawback: bind mounts are host-coupled. Your directory layout becomes an API contract.

Consistency rules (a blunt version)

  • If it’s a database: use replication or native backup tooling; treat filesystem copies as suspicious unless you have snapshots or the DB is shut down cleanly.
  • If it’s an object store directory (uploads): rsync is fine, but you must handle concurrent writes (two-phase sync: initial copy, then final sync after write freeze).
  • If it’s Redis: decide whether it’s a cache (rebuild) or a datastore (replicate or snapshot).
  • If it’s Prometheus: you can copy, but expect WAL churn. Snapshotting works; live rsync is shaky unless you stop it.

Two reliable patterns for moving stateful data

Pattern 1: Database-native backup + restore (predictable, usually fastest)

For PostgreSQL: use pg_basebackup (replication) or pg_dump/pg_restore (logical backup).
For MySQL: use replication or a consistent dump with appropriate flags.
The key advantage: you’re moving data in a format the database understands, not files the database is actively changing.

Pattern 2: Snapshot + send (fast for big datasets)

If your Docker root or volume directories live on ZFS or LVM, snapshots let you take a point-in-time consistent copy at the filesystem layer.
Then you transfer the snapshot to the new host and mount it as the volume source.
This is extremely fast for large datasets and reduces downtime. It also requires that you planned your storage layout like an adult.

Short joke #2: Storage is the only place where “it worked on my machine” means “your machine is now my problem.”

Networking and cutover: DNS, VIPs, and reverse proxies

Most Compose stacks have one of these ingress shapes:

  • Ports published directly (e.g., 443 on the host) and clients connect to the host IP/DNS.
  • A reverse proxy container (nginx/Traefik/Caddy) publishing 80/443, routing to internal services.
  • An external load balancer in front, forwarding to the host(s).

For migrations, the best cutover lever is the one with the fastest rollback:
a load balancer pool change, a VIP move, or a DNS update you can revert quickly.
Publishing ports directly to a single host with hard-coded DNS is easy until it’s time to move.

DNS cutover: fine, but plan for stragglers

DNS cutover is common because it’s accessible. It’s also messy:
caches ignore TTLs, clients reuse TCP connections, and some software “pins” to IPs until restart.
If you do DNS, lower TTLs at least a day ahead (not five minutes ahead).
Then keep the old service running until you’re confident you’ve drained.

VIP/keepalived cutover: clean if you can run it

A virtual IP moved between hosts can be near-instant and rollback-friendly.
But it requires network support and the willingness to run VRRP/keepalived correctly.
In corporate networks with strict change control, this can be harder than it should be.

Reverse proxy cutover: put the switch where it belongs

If you already have a reverse proxy, consider making it external to the app host(s).
A small dedicated proxy tier can route to either old or new backend based on config changes.
That’s effectively blue/green without needing Compose to be a scheduler.

Fast diagnosis playbook

When a migration “works” but performance tanks or errors spike, you need a short path to the bottleneck.
Don’t debate architecture in the incident channel. Check the basics in order.

First: is traffic reaching the right place?

  • Verify DNS resolution from multiple networks (corp network, VPN, and from the new host itself).
  • Confirm the new host is actually receiving connections on expected ports (ss, firewall counters).
  • Check the reverse proxy upstream targets and health status.

Second: is the app healthy or just “running”?

  • Check container health status, not just uptime.
  • Tail logs at the edge (proxy) and core (API) at the same time; correlate timestamps.
  • Look for connection pool exhaustion, timeouts, and “permission denied” on mounts.

Third: is storage the bottleneck?

  • Check disk latency and saturation (iostat, nvme smart stats, dmesg for resets).
  • Confirm the database is on the expected disk/pool and not on a slow boot volume.
  • Validate filesystem options and free space; near-full disks behave badly.

Fourth: is the network path different than you think?

  • Check MTU mismatches (especially across VPNs and VLAN boundaries).
  • Verify firewall rules and conntrack limits on the new host.
  • Look at SYN retransmits and TCP resets (ss stats, proxy logs).

Common mistakes: symptom → root cause → fix

1) API returns 502/504 right after cutover

Symptom: Reverse proxy is up, but upstream calls time out.

Root cause: Healthchecks were missing or too weak; proxy started routing before API warmed up or migrations finished.

Fix: Add real healthchecks (DB connectivity, not just process alive). Gate proxy routing on health. Consider startup ordering with depends_on plus health conditions.

2) Database starts, then crashes with corruption errors

Symptom: Postgres complains about invalid checkpoints or WAL segments; MySQL reports InnoDB errors.

Root cause: Filesystem-level copy of a live database directory (rsync while running) produced an inconsistent dataset.

Fix: Restore from a consistent backup or snapshot. Re-migrate using replication or database-native backup tooling. Stop the DB if you must do file copies.

3) Everything is “up,” but writes fail with permission denied

Symptom: Logs show EACCES on mounted paths; uploads fail; database can’t write.

Root cause: UID/GID mismatch between old and new host, or bind mount directory ownership changed.

Fix: Align ownership to the container user, not your admin account. Confirm with stat and container runtime user IDs. Use explicit user: in Compose if appropriate.

4) Performance is half of what it used to be

Symptom: Latency spikes after migration; CPU is fine; database feels slow.

Root cause: Storage moved from SSD to slower disk (or misconfigured RAID), filesystem options differ, or I/O scheduler differences.

Fix: Benchmark disk on the new host. Put the DB volume on the right device. Fix mount options. Verify no accidental encryption/compression overhead you didn’t plan.

5) Clients intermittently hit old host for hours

Symptom: Mixed logs, mixed behavior; some users see old version.

Root cause: DNS caching beyond TTL, hard-coded resolvers, long-lived connections, or pinned IPs in clients.

Fix: Use a load balancer/VIP when possible. If DNS-only, lower TTL well in advance and keep old host serving until the tail drains.

6) Compose up fails: “port is already allocated”

Symptom: New host refuses to start web/proxy container.

Root cause: Another process is using the port (often a host-level nginx, apache, or leftover container).

Fix: Identify the listener with ss -lntp. Stop/disable the conflicting service, or adjust published ports and upstream routing.

7) Unexpected data loss in uploads or files

Symptom: New host is missing recent uploads.

Root cause: One-pass copy; writes continued during transfer; no final sync.

Fix: Two-phase rsync: initial sync while live, then freeze writes and do a final sync, then cut over.

Checklists / step-by-step plans

Plan 1: Near-zero downtime for mostly stateless stacks (blue/green + DNS/LB)

  1. Match runtime: Install Docker Engine and Compose plugin on the new host. Keep versions close to reduce surprises.
  2. Provision storage: Create directories for bind mounts; plan volume locations. Ensure enough disk and IOPS.
  3. Pre-pull images: Pull all required images on the new host to avoid cutover-day registry outages.
  4. Deploy in parallel: Bring up the stack on the new host on alternate ports or behind a separate LB target group.
  5. Validate: Health endpoints, database connectivity, background jobs, and any scheduled tasks.
  6. Shadow traffic (optional): Mirror read-only traffic or run synthetic checks against the new stack.
  7. Cutover: Switch LB targets or DNS record to new host.
  8. Monitor: Watch error rates, latency, and logs. Keep old host ready for rollback.
  9. Rollback plan: If error budget burns, flip back quickly. Don’t “debug in production” unless you’re out of options.
  10. Decommission later: After a stable window, shut down old host and archive configs/backups.

Plan 2: Short downtime with consistent snapshots (freeze + snapshot + restore)

  1. Prepare new host: Same images, same config, same secrets, same directory structure for bind mounts.
  2. Lower TTL: If using DNS cutover, reduce TTL at least 24 hours ahead.
  3. Initial sync: For uploads and other file trees, do an rsync while the app is live.
  4. Freeze writes: Put app in maintenance mode or stop write-heavy services. Confirm no writes are occurring.
  5. Take snapshot/backup: Use DB-native backup or filesystem snapshot if supported.
  6. Final sync: Rsync again to capture last changes.
  7. Restore on new host: Import database, place files, verify ownership.
  8. Start stack: Bring up Compose and wait for healthchecks to pass.
  9. Cutover traffic: DNS or LB switch. Keep old host stopped or read-only to avoid split-brain writes.
  10. Unfreeze: Exit maintenance mode and monitor.

Plan 3: Stateful “no downtime-ish” with replication (database-first migration)

  1. Stand up new DB: Configure as replica of old DB. Verify replication lag stays low under normal load.
  2. Deploy app on new host: Point it at new DB replica for read-only verification if possible, or keep it idle.
  3. Plan promotion: Decide the exact cutover minute and what you’ll do with writes (brief freeze, or app-level failover).
  4. Cutover: Freeze writes briefly, let replica catch up, promote new DB to primary.
  5. Flip app traffic: Switch LB/DNS to new host and new DB endpoint.
  6. Keep old DB: Leave it as a replica (if supported) for rollback window, but don’t allow writes to it.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption (rsync as “backup”)

A mid-size SaaS company decided to move a Compose stack from a “temporary” VM to a new host with more CPU.
The stack was typical: Nginx, API, PostgreSQL, Redis, and a worker. The team had done stateless cutovers before, so the migration plan felt familiar:
stand up the new host, rsync /var/lib/docker/volumes, start Compose, flip DNS.

They ran rsync while everything was still serving traffic. It took a while. They ran it again “to catch up,” felt proud of the delta size, and cut over.
The new stack came up and looked fine. Users logged in. Requests succeeded. The incident didn’t start with alarms; it started with support tickets.

The first symptom was subtle: a small percentage of users saw stale data after updates. Then a few background jobs failed with unique constraint violations.
Eventually Postgres started emitting WAL-related complaints, not immediately fatal but increasingly unhappy. The team tried to restart the database container. That’s when it cratered.

The wrong assumption was that rsync of the Postgres data directory is “close enough” if you do it twice.
It isn’t. Postgres expects a consistent checkpoint and WAL sequence; copying files mid-write can create a dataset that starts but contains landmines.
They ended up restoring from an older logical backup, then manually repairing the gap with application logs and a pile of careful SQL.

The lesson that stuck: if your migration plan doesn’t include an explicit consistency mechanism for the database, it’s not a plan.
It’s optimism with a command line.

Mini-story 2: The optimization that backfired (faster storage… on paper)

A large enterprise team migrated a Compose-hosted internal platform to new hardware. The “optimization” was to consolidate storage:
put Docker root, database volumes, and application uploads all on a single high-capacity RAID array.
The rationale sounded great in a meeting: fewer mount points, simpler backups, and “RAID is fast.”

After cutover, everything worked, but latency doubled during peak hours. The API wasn’t CPU-bound. The network was fine.
The database had periodic stalls. Workers took longer to complete jobs. Engineers started blaming the new host’s kernel, Docker version, and even “container overhead.”

It turned out the RAID controller was configured for capacity-first behavior with a write cache policy that was safe but slow for their workload.
On the old host, the database lived on SSDs with low latency. On the new host, it shared spindles with large sequential writes from uploads and backup jobs.
The consolidated layout created I/O contention that didn’t show up in synthetic benchmarks but showed up immediately in production.

The fix was boring: separate the database onto low-latency storage and keep bulk file storage elsewhere, with explicit bandwidth limits for background copying.
They also added simple I/O monitoring that would have made the root cause obvious on day one.
The “optimization” wasn’t malicious; it was just a failure to respect that databases don’t negotiate with slow disks.

Mini-story 3: The boring but correct practice that saved the day (rehearsal + rollback)

Another team, smaller and less glamorous, had to migrate a Compose stack that handled payroll approvals.
The system wasn’t high-traffic, but it was high-stakes. They couldn’t afford data loss, and they couldn’t keep everyone locked out for long.
No one was excited. That’s usually a good sign.

They did a rehearsal. Not a theoretical one—an actual dry run on a staging clone with a realistic data snapshot.
They documented every command they ran, including the ones that felt obvious. They verified restore procedures, not just backups.
Then they did something even less exciting: they wrote a rollback plan that included DNS reversion, container stop/start order, and a clear “stop the world” point.

On migration day, the new host came up, but one service failed due to a missing CA certificate bundle that had been quietly present on the old host.
Because they had rehearsed, the failure looked familiar. They fixed the package, re-ran the start sequence, and proceeded.
The downtime was short, predictable, and explainable. Users complained a little, then forgot about it.

The day was “successful” because they treated it like a change with sharp edges: rehearsal, validation, rollback.
Nothing heroic. Nothing clever. Just competence.

FAQ

1) Can I migrate a Compose stack with literally zero downtime?

If the stack is stateless, close to yes. If it has state, “literally zero” usually requires replication or an externalized data layer.
Most teams can achieve “no maintenance window” with a small blip, but not zero in the mathematical sense.

2) Is it safe to copy Docker named volumes with rsync?

For static data or data you can quiesce, yes. For live databases, no. If you must copy, stop the service or use snapshots, or use database-native backup/replication.

3) Should I copy /var/lib/docker to the new host?

Avoid it unless you have a strong reason and a test environment. It couples you to storage driver details, Docker internals, and version compatibility.
Prefer migrating application data and redeploying containers cleanly.

4) What’s the safest cutover mechanism: DNS, load balancer, or VIP?

Load balancer target changes and VIP moves are typically the fastest to roll back and least dependent on client behavior.
DNS works, but you must plan for caching and long-lived connections.

5) How do I handle TLS certificates during migration?

Treat them as first-class state. Inventory where they live (bind mounts, secrets files).
Copy them securely, verify file permissions, and validate the full chain on the new host before cutover.

6) Do I need to keep the same container IPs or network names?

You should not depend on container IPs at all. Use service names on the Compose network.
Network names matter only if external systems reference them (rare). Most of the time, stable project naming is sufficient.

7) How do I migrate uploads or other mutable files?

Do a two-phase sync: copy while live, then freeze writes briefly and do a final rsync.
If you can, move uploads to object storage and stop caring about host files forever.

8) What about background workers and scheduled jobs during cutover?

Pause them or ensure idempotency. During cutover, duplicate workers can double-process jobs, and that’s how finance systems invent new forms of excitement.
If you can’t pause, design deduplication and job locks.

9) Should I upgrade Docker/Compose during the migration?

Prefer not to combine a host migration with a runtime upgrade unless you have time to test.
If you must upgrade, do it intentionally with a rehearsal and validation, not as a side effect of “new server build.”

Conclusion: practical next steps

If you take one thing from this: Compose migrations fail when teams treat state like an implementation detail.
Define what “no downtime” means in business terms, then pick a migration model that matches your data reality.
Replicate when you can. Snapshot when you must. Freeze writes when you’re out of better options. Don’t rsync live databases and call it engineering.

Next steps that pay off immediately:

  • Run the inventory tasks above on the old host and write down what’s stateful.
  • Choose a cutover mechanism with a rollback lever you trust (LB/VIP beats DNS-only).
  • Rehearse the migration on a staging clone, including restore and rollback.
  • Add healthchecks and readiness gating so “up” means “ready.”
  • Make storage placement explicit on the new host; databases don’t belong on mystery disks.

Migrations aren’t glamorous. They’re operational truth serum. Do it like you want to sleep afterward.

← Previous
Debian 13 SR-IOV basics: why it fails and how to debug the first time
Next →
Replace vCenter with Proxmox: what you gain, what you lose, and workarounds that actually work

Leave a comment