You have a small “production” stack on Docker Compose. It’s been fine for months. Then a routine update turns into a 2‑minute outage,
your chat fills with “is it down?”, and you’re staring at docker compose up -d like it personally betrayed you.
Compose can absolutely run real workloads. But “zero-downtime updates on Compose” is not a feature you enable—it’s a set of operational
choices you implement, test, and occasionally regret. Let’s separate the marketing myth from the engineering reality, and then build patterns
that actually hold up when customers are connected, requests are in flight, and your database is having feelings.
Myth vs reality: what Compose can and cannot do
The myth: “Compose does rolling updates”
Docker Compose, by itself, is not an orchestrator. It does not natively do rolling updates with traffic-aware draining, service discovery
across hosts, or automatic replacement in the face of failure. When you run docker compose up -d, Compose reconciles the desired
state on one host. It may recreate containers. It may stop and start them. It will not coordinate “keep old, start new, shift traffic,
then retire old” unless you build those mechanics around it.
The reality: “Compose can do zero-downtime-ish with a reverse proxy and discipline”
You can get close to zero downtime for many web workloads. The trick is to stop treating “the app container” as the thing that owns the socket.
Your stable entrypoint should be a reverse proxy (or L4 proxy) that stays up while you swap backends behind it. Then you add:
- Healthchecks that reflect readiness, not just liveness.
- Graceful shutdown (stop signals and timeouts) so in-flight requests finish.
- A deployment method that starts new instances before stopping old ones.
- Database change discipline (expand/contract, backward compatible, avoid “stop the world” migrations).
If your stack is a single container listening on port 443 on the host, you’re trying to do “zero downtime” while repeatedly ripping out the
floorboards. It’s possible. It’s also a weird hobby.
One quote worth keeping in your head: Hope is not a strategy
— often attributed to reliability circles; treat it as a paraphrased idea,
not a courtroom exhibit.
Interesting facts and historical context (why this is confusing)
- Compose started as “Fig” (2014). It was designed for developer workflows, not production deployments with SLOs.
-
Docker introduced “Swarm mode” later with rolling updates, health-driven rescheduling, and service abstractions—features Compose
itself never gained. -
depends_onnever meant “wait until ready”. It’s ordering, not readiness gating. This misunderstanding has caused
more “it works on my laptop” outages than most people admit. -
Healthchecks came later (Docker 1.12 era). Before that, people used “sleep 10” as a readiness strategy. It remains popular,
for reasons that are mostly psychological. -
Restart policies are not orchestration.
restart: alwaysis a seatbelt, not an autopilot. - Nginx has supported graceful reload for ages—a major reason it became the default “front door” for DIY zero-downtime deploys.
-
Linux added
SO_REUSEPORT(2013ish mainstream usage) enabling multiple processes to bind the same port in some designs,
but it doesn’t magically fix deployment coordination for containers. - Blue/green deployment predates containers. Ops teams did it with pairs of VMs and load balancers long before Docker made it trendy.
- Database migrations are the real downtime factory. App containers are easy; schema changes with exclusive locks are where dreams go to die.
Define “zero downtime” like an adult
“Zero downtime” is a phrase that means wildly different things depending on who is sweating. Nail down the definition before you change anything.
Here are the common versions:
- No TCP accept outage: the port never stops accepting connections. Clients may still see errors if your backend isn’t ready.
- No 5xx spike: requests keep succeeding. Some latency increase may be acceptable.
- No user-visible interruption: sessions persist, websockets survive, long polls continue. This is harder than it sounds.
- No deployment-induced error budget burn: the change might still cause issues, but your deployment process doesn’t.
For Compose-based stacks, the realistic target is usually: no 5xx spike and no outage on the public endpoint, with bounded latency
increase during the swap. If you’re serving websockets, redefine success: you can be “zero downtime” and still disconnect clients unless you build
explicit connection draining and sticky routing behavior.
Joke #1: If someone tells you they have “true zero downtime” on Compose with a single container, ask what they use for time travel.
The failure modes you actually hit in production
1) Port binding is a single point of pain
If your app container binds 0.0.0.0:443 on the host, you can’t start the new version until the old one releases the port.
That means a gap. Even if it’s 200 ms, it’s still a gap. Under load, clients back off poorly, retries cascade, and suddenly “milliseconds” becomes
“why did our checkout fail.”
2) “Container started” ≠ “service ready”
Many apps start their process quickly, then spend 5–30 seconds doing migrations, warming caches, or waiting on dependencies. If your proxy routes
traffic early, you’ll serve errors. If you block routing until ready, you’ll be fine—assuming readiness is measured correctly.
3) SIGTERM handling is not optional
Docker will send SIGTERM (by default) then SIGKILL after the stop timeout. If your app ignores SIGTERM, you will drop in-flight requests.
If your stop timeout is too short, you will also drop in-flight requests. If you terminate the proxy first, you’ll drop everything.
4) Database migrations that lock tables
The usual Compose outage is not Docker’s fault. It’s a migration that takes an exclusive lock, or a column rewrite, or an index build without
concurrency. The app stops responding, the healthcheck fails, Compose restarts it, and now you’ve got a self-inflicted denial of service.
5) Stateful services behind “stateless” patterns
You can blue/green your web tier. You cannot casually blue/green a single-instance database by “starting another container” unless your storage,
replication, and failover story is already mature. Compose can run Postgres. Compose is not a Postgres HA solution.
6) The silent killer: connection pools and stale DNS
If you rotate backends by swapping container IPs and expect clients to “just reconnect,” you’ll discover that connection pools and DNS caching
have opinions. Some drivers cache resolved IPs longer than you expect. Some apps never reconnect unless you bounce them. This is why stable service
names (proxy frontends) matter.
Workable patterns for near-zero downtime on Compose
Pattern A: Stable reverse proxy + versioned app services (blue/green-ish)
This is the pattern I recommend most often because it matches Compose’s strengths: simple, local, deterministic. You keep a stable proxy container
bound to the host ports (80/443). Your app runs behind it on a user-defined network. During deploy, you start a new app service (green) alongside
old (blue), verify readiness, then switch the proxy upstream and gracefully retire the old.
Key characteristics:
- The proxy is the only thing binding host ports.
- Both app versions can run simultaneously on the internal network.
- Switching traffic is a config reload, not a container recreation.
- Rollback is flipping the proxy back and killing the bad version.
You can implement the proxy with Nginx, HAProxy, Traefik, Caddy. Pick one you can operate. “Operate” means: you can reload it safely, you can
read its logs, and you can explain what happens when a backend fails.
Pattern B: Scale-out with --scale + drain + recreate (limited rolling)
Compose supports scaling a service to multiple replicas on one host. That alone doesn’t give rolling updates, but it gives you room to maneuver:
bring up additional replicas of the new version, route traffic to them, then remove old replicas. The limitation is that Compose doesn’t do
“update order” natively the way an orchestrator does. You’re writing the playbook.
This works best when:
- Your app is stateless or session state is externalized.
- Your proxy/load balancer can detect backend health and stop routing.
- You’re okay with manual sequencing or a small deploy script.
Pattern C: Socket activation / host-level port ownership (advanced, sharp edges)
If you’re determined to avoid a proxy container, you can let systemd own the socket and hand it to whichever app instance is current (socket
activation). This can work. It can also become an artisanal outage generator if your app doesn’t support it cleanly or you don’t test reload
behavior under load.
For most teams running Compose, the stable proxy pattern is the sweet spot. Less clever. More reliable.
Pattern D: “Boring migrations” + app deploy (the hidden requirement)
Even perfect container swapping doesn’t help if your migration locks the database for 30 seconds. The deploy pattern must include database
discipline:
- Expand/contract schema changes (add new columns/tables first, deploy code, then remove old).
- Backwards compatible reads and writes during the transition.
- Online index builds where possible.
- Feature flags when a change can’t be instantaneous.
Compose doesn’t prevent you from doing this. Compose also doesn’t remind you. You remind you.
Joke #2: A migration with an exclusive lock is the only thing that can take down your app faster than your CEO trying to “help with deployment.”
Practical tasks: commands, outputs, and decisions (12+)
These are the checks I actually run on a Compose host when I’m trying to make “zero downtime” real. Each task includes: the command, what typical
output looks like, what it means, and what decision you make next.
Task 1: Confirm what Compose thinks is running
cr0x@server:~$ docker compose ps
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
prod-proxy-1 nginx:1.25-alpine "/docker-entrypoint.…" proxy 2 weeks ago Up 2 weeks 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp
prod-app-blue-1 registry/app:1.9.3 "/app/start" app-blue 2 weeks ago Up 2 weeks (healthy)
prod-app-green-1 registry/app:1.9.4 "/app/start" app-green 3 minutes ago Up 3 minutes (healthy)
What it means: You have two app versions live and a stable proxy.
Decision: If green is healthy, you’re ready to shift traffic by changing proxy upstream, not by stopping blue.
Task 2: Inspect healthcheck status and timing
cr0x@server:~$ docker inspect --format '{{json .State.Health}}' prod-app-green-1
{"Status":"healthy","FailingStreak":0,"Log":[{"Start":"2026-01-03T10:12:01.123Z","End":"2026-01-03T10:12:01.187Z","ExitCode":0,"Output":"ok\n"}]}
What it means: The container is reporting readiness (assuming your healthcheck is meaningful).
Decision: If health is starting or unhealthy, do not shift traffic. Fix readiness or app startup first.
Task 3: Verify the proxy sees both backends and which one is active
cr0x@server:~$ docker exec -it prod-proxy-1 nginx -T | sed -n '1,120p'
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
# configuration content:
upstream app_upstream {
server prod-app-blue-1:8080 max_fails=2 fail_timeout=5s;
server prod-app-green-1:8080 max_fails=2 fail_timeout=5s;
}
server {
listen 80;
location / {
proxy_pass http://app_upstream;
}
}
What it means: The proxy is configured to load balance across both.
Decision: If you want a strict cutover (not load balancing), configure weights or switch to a single upstream target for the cut.
Task 4: Validate Nginx reload is graceful (no dropped workers)
cr0x@server:~$ docker exec -it prod-proxy-1 nginx -s reload && docker exec -it prod-proxy-1 tail -n 5 /var/log/nginx/error.log
2026/01/03 10:15:22 [notice] 1#1: signal process started
2026/01/03 10:15:22 [notice] 1#1: reconfiguring
2026/01/03 10:15:22 [notice] 1#1: using the "epoll" event method
What it means: Reload succeeded; Nginx did not restart from scratch.
Decision: If reload errors out, do not deploy. Fix config generation and run nginx -t before reload.
Task 5: Confirm stop timeout and signal behavior (graceful shutdown)
cr0x@server:~$ docker inspect --format 'StopSignal={{.Config.StopSignal}} StopTimeout={{.Config.StopTimeout}}' prod-app-blue-1
StopSignal=SIGTERM StopTimeout=30
What it means: Docker will SIGTERM then wait 30s before SIGKILL.
Decision: If your app needs 60s to drain, set stop_grace_period: 60s. If it ignores SIGTERM, fix the app. No YAML will save you.
Task 6: Watch in-flight connections during a drain window
cr0x@server:~$ docker exec -it prod-app-blue-1 ss -Hnt state established '( sport = :8080 )' | wc -l
47
What it means: 47 established TCP connections to the app.
Decision: If the number doesn’t fall after you remove the backend from the proxy, you may have long-lived connections (websockets) and need a longer drain plan.
Task 7: Confirm which container is actually receiving traffic
cr0x@server:~$ docker logs --since=2m prod-app-green-1 | tail -n 5
10.0.2.5 - - [03/Jan/2026:10:16:01 +0000] "GET /healthz HTTP/1.1" 200 2 "-" "kube-probe/1.0"
10.0.2.5 - - [03/Jan/2026:10:16:04 +0000] "GET /api/orders HTTP/1.1" 200 431 "-" "Mozilla/5.0"
What it means: Green is receiving real requests.
Decision: If only health checks hit green, you haven’t shifted production traffic; change proxy routing or weights deliberately.
Task 8: Detect whether you’re accidentally recreating the proxy (the outage generator)
cr0x@server:~$ docker compose up -d --no-deps proxy
[+] Running 1/0
✔ Container prod-proxy-1 Running
What it means: Compose did not recreate the proxy container.
Decision: If this prints “Recreated,” you’re bouncing the front door. Stop. Pin proxy configuration changes and reload in-container instead.
Task 9: Compare running image digests (avoid “latest” surprises)
cr0x@server:~$ docker images --digests registry/app | head -n 5
REPOSITORY TAG DIGEST IMAGE ID CREATED SIZE
registry/app 1.9.4 sha256:8b9a2f6d3c1e8f... 4a1f2c3d4e5f 2 days ago 156MB
registry/app 1.9.3 sha256:1c2d3e4f5a6b7c... 7b6a5c4d3e2f 2 weeks ago 155MB
What it means: You can identify exactly what’s deployed.
Decision: If you’re using :latest without digests, stop doing that. Tag your releases and keep a known rollback image.
Task 10: Check Docker events during deploy (find who restarted what)
cr0x@server:~$ docker events --since 10m --until 0m | tail -n 12
2026-01-03T10:12:10.000000000Z container start 2f1a... (name=prod-app-green-1, image=registry/app:1.9.4)
2026-01-03T10:14:02.000000000Z container health_status: healthy 2f1a... (name=prod-app-green-1)
2026-01-03T10:15:22.000000000Z container exec_start: nginx -s reload 9aa2... (name=prod-proxy-1)
What it means: Timeline of what actually happened, not what you remember doing.
Decision: If you see proxy restart/recreate events during deploy, your “zero downtime” story has a hole in it.
Task 11: Verify kernel-level resource pressure (CPU throttling and IO wait)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 51200 21000 320000 0 0 120 80 600 1200 25 10 55 10 0
3 1 0 48000 20500 318000 0 0 2200 300 900 1700 30 12 38 20 0
What it means: Second sample shows higher IO wait (wa) and blocked processes (b).
Decision: If IO wait spikes during deploy (image pull, migrations), schedule deploy off-peak, move images to local registry/cache, or fix storage performance.
Task 12: Identify which process owns the host port (catch accidental binds)
cr0x@server:~$ sudo ss -Htlpn '( sport = :443 )'
LISTEN 0 511 0.0.0.0:443 0.0.0.0:* users:(("docker-proxy",pid=2143,fd=4))
What it means: Docker proxy owns the published port. If the container is recreated, the bind will flap.
Decision: Keep a stable proxy container. Don’t publish app ports directly if you want seamless swaps.
Task 13: Confirm network attachments (is the proxy on the right network?)
cr0x@server:~$ docker inspect --format '{{range .NetworkSettings.Networks}}{{.NetworkID}} {{end}}' prod-proxy-1
a8c1f0d3e2b1c4d5e6f7a8b9c0d1e2f3
What it means: Proxy is on at least one user-defined network.
Decision: If the proxy and app aren’t on the same network, name resolution like prod-app-green-1 won’t work. Fix networks before blaming Docker.
Task 14: Check database locks during migration (the downtime smoking gun)
cr0x@server:~$ docker exec -it prod-db-1 psql -U postgres -d app -c "select pid, wait_event_type, wait_event, state, query from pg_stat_activity where state <> 'idle' order by pid;"
pid | wait_event_type | wait_event | state | query
------+-----------------+---------------+--------+----------------------------------------
2412 | Lock | relation | active | ALTER TABLE orders ADD COLUMN foo text;
2550 | Lock | transactionid | active | UPDATE orders SET foo = 'x' WHERE ...
What it means: Active sessions waiting on locks. Your “deploy outage” may be a schema lock.
Decision: Stop the migration if it’s unsafe, or redesign it (online approach, batching, concurrent indexes). Do not keep restarting app containers hoping it resolves.
Task 15: Confirm Compose didn’t silently change containers due to config drift
cr0x@server:~$ docker compose config | sed -n '1,120p'
name: prod
services:
app-blue:
image: registry/app:1.9.3
healthcheck:
test:
- CMD
- /bin/sh
- -c
- curl -fsS http://localhost:8080/ready || exit 1
interval: 5s
timeout: 2s
retries: 12
stop_grace_period: 60s
proxy:
image: nginx:1.25-alpine
ports:
- mode: ingress
target: 80
published: "80"
protocol: tcp
- mode: ingress
target: 443
published: "443"
protocol: tcp
What it means: This is the fully rendered config Compose is applying.
Decision: If config output differs from what you think you deployed, fix your env var management and pin values. “Surprise config” is downtime’s best friend.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption (“depends_on means ready”)
A mid-sized SaaS team ran their production stack on a single VM with Docker Compose. It was a deliberate choice: they wanted fewer moving parts,
and their traffic fit comfortably on one host. Sensible. The system had an API, a worker, Redis, and Postgres.
They did a routine update: new API image, minor migration, restart. The deploy playbook was a single line: docker compose up -d.
They assumed their depends_on settings meant the API would wait for Postgres. It did not.
Postgres took longer than usual to come up because the VM was also pulling images and doing filesystem writes. The API started, failed to connect,
exited, restarted, failed again. Their restart policy turned a slow start into a tight loop. Meanwhile the reverse proxy routed traffic to a
flapping backend. Users saw intermittent 502s for several minutes—long enough for support tickets and internal blame bingo.
The fix wasn’t exotic. They added a real readiness endpoint to the API, wired a healthcheck to it, and configured the proxy to only route to
healthy backends. They also added a start period so transient startup failures wouldn’t instantly mark the service unhealthy.
The lesson was uncomfortable because it was boring: Compose ordered container starts; it didn’t guarantee dependency readiness. They stopped
assuming “started” meant “ready,” and their deploys stopped being suspenseful.
Mini-story 2: The optimization that backfired (fast deploys, slow disks)
Another org ran Compose on a beefy host with NVMe—until procurement replaced it with “similar” hardware that had plenty of CPU and RAM but
mediocre storage. The team optimized deploys by always pulling fresh images right before the update. Faster releases, fewer “works on CI”
mismatches. On paper.
Under load, the deploy pulled a large image layer set, saturating IO. Postgres checkpointing slowed down. API latencies climbed. Healthchecks
started timing out. The proxy marked backends unhealthy. Suddenly “deploy faster” became “deploy causes cascading brownout.”
They responded by tightening healthcheck thresholds to “be more sensitive,” which was exactly the wrong instinct. It made the proxy eject
backends sooner, amplifying the outage. Their SLO burn rate spiked, and everyone learned the difference between “detect failure” and “cause
failure.”
The eventual fix: pre-pull images during quiet periods, cap IO impact (using host-level scheduling and sometimes just human discipline), and
make healthchecks tolerant of short spikes. They also moved the Docker data directory to faster storage and separated database IO from image
extraction as much as practical.
The optimization lesson: speed isn’t free. If you make deploys faster by moving work into the critical window, your users pay the interest.
Mini-story 3: The boring but correct practice that saved the day (two versions + easy rollback)
A regulated enterprise team—heavy process, lots of paperwork—ran a customer portal on Compose. They were not trendy. They were also rarely down.
Their deployment method looked old-school: stable proxy, two app services (“blue” and “green”), explicit health gates, and a manual traffic switch.
One Friday, the new release passed tests but had a subtle memory leak triggered by a rare request pattern. Twenty minutes after cutover, RSS
started creeping up. Latency climbed. The on-call watched it unfold with the calm of someone who has rehearsed this.
They flipped traffic back to blue by reloading the proxy config, then stopped green. The rollback took less time than it took to explain why
they were rolling back. Customers barely noticed: a small latency bump, no outage banner, no panic.
Later, in the postmortem, nobody praised the YAML. They praised the boring discipline: always keep the previous version running until the new one
proves itself, and make rollback a single reversible action, not a heroic multi-step dance.
The lesson: reliability is mostly unglamorous repetition. If your deployment requires courage, you’ve already lost.
Fast diagnosis playbook: find the bottleneck fast
When a “zero-downtime deploy” causes a blip, you don’t have time for philosophical debates. You need a quick triage sequence that narrows the
culprit to: proxy/traffic switch, app readiness, database, or host resources.
First: Is the front door stable?
- Check whether the proxy container was recreated or restarted during deploy (
docker events,docker psuptime). - Check host port bind continuity (
ss -ltnpfor 80/443). - Check proxy logs for upstream failures and reload errors.
If the proxy flapped, that’s your outage. Fix the process so the proxy stays up and only reloads config.
Second: Are backends healthy and actually ready?
- Inspect health status for new containers.
- Hit readiness endpoint from inside the proxy network.
- Confirm the proxy is routing to the intended backend set.
If healthchecks say “healthy” but the app still fails under traffic, your healthcheck is lying. Make it more representative.
Third: Is the database blocking the world?
- Check active queries and locks during migration windows.
- Look for connection errors/timeouts in app logs.
- Check database IO wait and checkpoint behavior (host metrics help).
If locks are piling up, stop redeploying. Fix migrations. Schema locks don’t respond to optimism.
Fourth: Is the host under resource pressure?
- CPU steal/throttle, IO wait, memory pressure.
- Docker image pulls and layer extraction activity during deploy.
- Disk saturation on the Docker data directory and database volume.
If the host is choking, the best deployment pattern in the world won’t save you. Reliability starts with boring capacity.
Common mistakes: symptom → root cause → fix
1) Symptom: brief outage on every deploy (a few seconds of 502)
Root cause: The app container owns the host port; restarting/recreating it unbinds the port.
Fix: Put a stable proxy on host ports; route to app on an internal network. Deploy by adding new backend first.
2) Symptom: new version starts, immediately gets traffic, returns 500 for 10–30 seconds
Root cause: No readiness gating; healthcheck only tests “process exists” or is missing.
Fix: Implement a real /ready endpoint that checks dependencies; use Docker healthcheck and proxy gating based on it.
3) Symptom: deploy triggers a restart loop; logs show DB connection failures
Root cause: DB not ready, migrations running, or lock contention; restart policy amplifies the problem.
Fix: Add start periods; avoid restart loops during migrations; separate migration job from app startup; make migrations online and incremental.
4) Symptom: connections drop during deploy even though proxy stays up
Root cause: App doesn’t handle SIGTERM; stop timeout too short; long-lived connections not drained.
Fix: Implement graceful shutdown; increase stop_grace_period; configure proxy to stop routing before stopping containers.
5) Symptom: rollback takes longer than deploy and is risky
Root cause: No parallel run; deployment replaces in-place; database changes are not backward compatible.
Fix: Blue/green backend services; keep previous version alive; use expand/contract migrations and feature flags.
6) Symptom: “It worked in staging” but production deploy causes latency spikes
Root cause: Resource contention in production (IO, CPU) during image pulls/migrations; healthchecks too aggressive.
Fix: Pre-pull images; schedule heavy operations; tune healthcheck timeouts/retries; separate IO paths for DB and Docker where possible.
7) Symptom: proxy routes to dead backends after deploy
Root cause: Proxy upstream config uses static IPs or stale container names; network attachment mismatch.
Fix: Use service discovery via Docker DNS on a user-defined network; reference service names; ensure proxy is on the same network.
8) Symptom: Compose “up -d” recreates more than expected
Root cause: Config drift (env var changes, volume changes, image tag changes) triggers recreation; proxy is not pinned.
Fix: Lock down env; use docker compose config to inspect final config; avoid changing proxy container unless necessary; reload config instead.
Checklists / step-by-step plan
Checklist 1: Minimum viable “near-zero downtime” Compose stack
- Stable proxy container publishes host ports 80/443.
- App backends not published to host ports; only exposed on internal network.
- Healthchecks for readiness (not “process exists”).
- Graceful shutdown: SIGTERM handling + sufficient stop grace period.
- Database migrations separated from app startup and designed for online changes.
- Rollback plan: keep previous version runnable and routable until the new one proves itself.
Checklist 2: Step-by-step deploy (blue/green with proxy switch)
-
Pre-pull image to avoid IO spikes during the critical window.
cr0x@server:~$ docker pull registry/app:1.9.4 1.9.4: Pulling from app Digest: sha256:8b9a2f6d3c1e8f... Status: Downloaded newer image for registry/app:1.9.4Decision: If pulling takes too long or spikes IO, do it earlier or fix storage.
-
Start the new backend alongside the old.
cr0x@server:~$ docker compose up -d app-green [+] Running 1/1 ✔ Container prod-app-green-1 StartedDecision: If it recreates blue or proxy, your Compose model is wrong. Stop and isolate services.
-
Wait for green readiness.
cr0x@server:~$ docker inspect --format '{{.State.Health.Status}}' prod-app-green-1 healthyDecision: If unhealthy, roll back by stopping green and investigate logs.
-
Switch proxy routing (weights or single-target upstream), then reload.
cr0x@server:~$ docker exec -it prod-proxy-1 nginx -t nginx: the configuration file /etc/nginx/nginx.conf syntax is ok nginx: configuration file /etc/nginx/nginx.conf test is successfulcr0x@server:~$ docker exec -it prod-proxy-1 nginx -s reload 2026/01/03 10:20:12 [notice] 1#1: signal process startedDecision: If
nginx -tfails, do not reload. Fix config generation first. -
Observe traffic and errors for a soak period.
cr0x@server:~$ docker exec -it prod-proxy-1 tail -n 10 /var/log/nginx/access.log 10.0.1.10 - - [03/Jan/2026:10:20:15 +0000] "GET /api/orders HTTP/1.1" 200 431 "-" "Mozilla/5.0" 10.0.1.11 - - [03/Jan/2026:10:20:16 +0000] "POST /api/pay HTTP/1.1" 200 1024 "-" "Mozilla/5.0"Decision: If you see upstream 502/504, check backend readiness, DB locks, and proxy timeouts.
-
Drain and stop blue after confidence.
cr0x@server:~$ docker compose stop app-blue [+] Running 1/1 ✔ Container prod-app-blue-1 StoppedDecision: If you still need instant rollback, don’t remove blue yet—keep it stopped but available, or keep it running but unrouted.
Checklist 3: Database change discipline for Compose deploys
- Never combine risky migrations with a deploy you can’t roll back.
- Prefer additive changes first (new nullable column, new table, new index concurrently if supported).
- Backfill in batches using a job/worker with rate limiting.
- Switch reads/writes to new schema via feature flag or versioned code path.
- Only then remove old columns/tables in a later deploy.
FAQ
1) Can Docker Compose do true zero-downtime deployments?
Not as a built-in orchestrator feature. You can achieve effectively zero user-visible downtime for many web apps by keeping a stable proxy and
swapping healthy backends behind it, plus graceful shutdown and sane migrations.
2) Why not just use docker compose up -d and trust it?
Because Compose reconciles desired state by recreating containers when it detects changes. If the container owns the host port, recreation equals
a port flap. It’s not malice; it’s design.
3) Does depends_on ensure my app waits for the database?
No. It enforces start order, not readiness. Use healthchecks, explicit wait logic in your app, or a deploy process that verifies dependency readiness.
4) What’s the simplest workable pattern?
Stable reverse proxy on host ports + two app services (blue/green) on a user-defined network. Start green, verify health, reload proxy to route to
green, then drain and stop blue.
5) Should I use Traefik for this?
Traefik is fine if you already know how to operate it. It shines at dynamic configuration via container labels. But “dynamic” doesn’t mean “safe”;
you still need healthchecks, drain behavior, and rollback planning.
6) What about websockets and long-lived connections?
Plan for connection draining. Many websocket clients will disconnect on backend restart. You can reduce pain by increasing grace periods, stopping
routing before stop, and designing clients to reconnect cleanly. “Zero downtime” may still mean “some reconnects.”
7) Can I do rolling updates with --scale?
You can do a manual rolling update playbook: scale up new version, route traffic, then scale down old. Compose won’t coordinate it for you, so you
must write and test the steps.
8) What’s the biggest hidden cause of downtime during Compose deploys?
Database migrations that lock tables or saturate IO. The container swap is usually easy. The schema change is the boss fight.
9) Is it safer to run migrations at container startup?
Usually no. It couples deploy success to migration success, encourages “restart until it works,” and can cause thundering herds if multiple
replicas start. Prefer a controlled, explicit migration step.
10) When should I stop using Compose for production?
When you need multi-host scheduling, self-healing across machines, automatic rolling updates with traffic management, or robust secrets/config
distribution. At that point, you want an orchestrator, not a very disciplined bash script.
Conclusion: practical next steps
Compose doesn’t give you zero downtime. It gives you a clean, readable definition of what should run on a host. The rest—traffic shifting, readiness,
draining, and schema discipline—is on you. That’s not a complaint. It’s the contract.
If you want workable near-zero downtime updates on Compose, do these next:
- Put a stable reverse proxy in front of everything and stop publishing app ports directly.
- Add a real readiness endpoint and wire it into Docker healthchecks and proxy routing behavior.
- Implement graceful shutdown: SIGTERM handling, adequate stop grace period, and drain-before-stop.
- Split migrations from app startup and adopt expand/contract schema changes.
- Write a deployment playbook you can run at 3 a.m.—and rehearse rollback until it’s boring.
Then test under load. Not in theory. Not in staging with three requests per minute. In something that resembles production, where the system
is already busy doing its job while you try to replace parts of it.