Docker Compose Rollback: The Fastest Path Back From a Bad Deploy

December 2, 2025 • February 3, 2026 • Read: 22 min • Views: 15

Was this helpful?

You shipped a Compose change, the healthchecks started screaming, and now your on-call brain is doing
that thing where it can’t remember if docker compose down deletes volumes by default (it doesn’t—unless you tell it to).
Meanwhile, the business wants a timeline, the dashboard is red, and your deploy notes say “small refactor.”

This is the rollback guide you want when you’re tired, impatient, and trying to get production stable
before Slack decides you’re the new incident commander by default.

The principle: rollback is a change, not a prayer

Docker Compose is a “desired state” tool for a single host (or a small set of hosts if you’re disciplined).
It’s not a full orchestration system with built-in rollbacks, replica sets, or deployment history.
Compose will happily replace containers, keep volumes, reuse networks, and move on with its life. That’s great—until you need to go backwards quickly.

Rolling back on Compose boils down to three things:

Containers: get the previous known-good image running again.
Config: restore the previous Compose file (and any env files) that produced that known-good state.
State: decide what to do about volumes and schema/data changes.

If you only handle the containers, you’ll “rollback” and still be broken because the DB migrations already ran,
or because your reverse proxy config changed, or because the app reads a new env var that no longer exists.

Here’s the operational truth: the fastest rollback is usually the one that avoids touching state.
If your deploy included a DB migration that isn’t backward compatible, rolling back the app image might not restore functionality.
Then your “rollback” is a partial rollback. Those are the ones that ruin evenings.

One quote worth keeping in your head during incidents comes from John Allspaw (paraphrased idea): Reliability is about how systems behave under stress, not how they look on diagrams.
Rollbacks are stress tests you didn’t schedule.

Interesting facts and context (why Compose rollbacks are weird)

Compose started life as “Fig” (2013): it was built to make multi-container dev environments sane, not to be your production deployment history.
“docker-compose” (Python) vs “docker compose” (CLI plugin): the modern plugin integrates better with Docker contexts and is where new features land first.
Compose has no built-in “undo”: unlike some orchestrators, there’s no native rollback controller. Your “history” is your Git repo and your image registry.
Image tags are not immutable unless you force them: tags like latest can be moved, overwritten, or re-pushed. Digests don’t lie.
Container IDs are disposable; volumes are not: Compose’s default behavior keeps named volumes around, which is good for persistence and terrifying for rollback surprises.
Compose uses project naming for scoping: the project name controls container/network/volume names. A project name change can “orphan” the previous stack.
Healthchecks are recent compared to “it runs on my laptop”: plenty of stacks still “start” even if they are functionally dead. Healthchecks make rollbacks faster because they give you a stop/go signal.
Registry pull behavior depends on what’s local: if the old image exists on disk, rollback can be instant; if you need to pull from a slow registry, you just found your bottleneck.

The takeaway: Compose can absolutely run production systems, but it doesn’t carry your hand through rollback safety.
You have to build the muscle.

The fastest rollback paths (pick one)

Path A: revert to the previous Git commit and redeploy

This is the cleanest when your failure is caused by Compose config, env vars, entrypoints, resource limits,
or a bad image tag introduced in the Compose file.

It’s also the best for auditability: your rollback is a commit, not a series of panicked terminal commands.
If you’re deploying from a repo on the host, rolling back is usually a git checkout and a docker compose up -d.

Path B: pin to a known-good image digest and redeploy

If the Compose file is “fine” but the image tag points to a bad build, do not waste time negotiating with tags.
Pin the exact digest of the last known good image. Digests are the closest thing to truth you’ll get at 3 AM.

Path C: stop the bleeding—run the previous container image manually

This is the “break glass” path. It’s faster than fixing the Compose file when your deploy tooling is broken,
or when you need service back now and you’ll clean it up after.

You trade elegance for speed. You also create drift, so take notes and schedule cleanup. Your future self is not your employee.

Path D: restore data/state (only if you must)

This is for when the bad deploy mutated state in a way you can’t tolerate:
destructive migrations, incorrect background jobs, or a queue consumer that “helpfully” processed messages into the void.

Restoring state is slower, riskier, and requires coordination. If you can avoid it, avoid it.

Joke #1: Rollback plans are like fire extinguishers—everyone likes having them, and nobody wants to discover they’re decorative.

Fast diagnosis playbook (find the bottleneck in minutes)

The goal is not to become a forensic scientist. The goal is to answer: “Is rollback the right move, and what’s blocking it?”
Work top-down, and stop when you have a confident plan.

First: is the failure in the app, the container, or the host?

Check container health and restart loops: if containers are restarting, you’re in crash-loop land. Rollback is often correct.
Check logs for immediate errors: missing env vars, migrations failing, connection refused, config syntax errors.
Check host resources: disk full, OOM kills, exhausted inodes. Rollback won’t fix a full disk.

Second: can you get the old bits back quickly?

Is the previous image still local? If yes, rollback is minutes. If not, you need pulls and registry access.
Do you know the last good digest? If yes, pin it. If no, you’re about to do archaeology in CI logs or registry metadata.
Did the deploy touch state? If yes, evaluate backward compatibility and whether you need data rollback or forward fixes.

Third: decide your rollback strategy

Config regression: revert Compose commit.
Bad image build: pin digest.
Host/resource issue: fix host first, then decide if rollback is still needed.
State mutation: coordinate: app rollback alone may not work.

Treat this like triage, not debate club. If you spend 20 minutes arguing whether it’s “really the deploy,” you’ve already chosen downtime.

Practical tasks with commands, outputs, and decisions

These are the commands I actually use during incidents. Each task includes (a) the command, (b) what output means,
and (c) the decision you make.

Task 1: Confirm what Compose thinks the project is

cr0x@server:~$ docker compose ls
NAME            STATUS              CONFIG FILES
payments        running(6)          /srv/payments/compose.yaml

What it means: You have a Compose project named payments and Docker sees it as running.
If your project name changed recently, you might be looking at the wrong stack.

Decision: If the expected project isn’t listed, you’re in “wrong directory / wrong context / wrong project name” territory. Fix that before touching anything.

Task 2: Identify which services are restarting or unhealthy

cr0x@server:~$ docker compose -p payments ps
NAME                 IMAGE                              COMMAND                  SERVICE   STATUS                    PORTS
payments-api-1       registry.local/payments:1.9.2      "/entrypoint.sh"         api       Restarting (1) 8s ago
payments-web-1       registry.local/payments-web:1.9.2  "nginx -g 'daemon off;'" web       Up 3 minutes (healthy)     0.0.0.0:443->443/tcp
payments-db-1        postgres:15.6                      "docker-entrypoint.s…"   db        Up 2 days (healthy)        5432/tcp

What it means: The API is crash-looping. Web and DB look healthy. This is a candidate for app image rollback.

Decision: If only one service is failing, roll back that service first (or pin its image) instead of tearing down the whole stack.

Task 3: Grab the last 200 lines of logs from the failing service

cr0x@server:~$ docker compose -p payments logs --no-color --tail=200 api
api-1  | ERROR: missing required env var PAYMENTS_SIGNING_KEY
api-1  | FATAL: cannot start

What it means: This is config/env regression, not a code bug. Someone added a required env var.

Decision: Roll back Compose config (env file or compose.yaml), or restore the missing secret. Rolling back the image might not help if the image expects the variable too.

Task 4: Verify what config Compose is actually using (rendered)

cr0x@server:~$ docker compose -p payments config
services:
  api:
    environment:
      PAYMENTS_SIGNING_KEY: ""
    image: registry.local/payments:1.9.2
    restart: always

What it means: Compose resolved the env var to an empty string. That’s why the app refuses to start.

Decision: Fix env injection (env_file, exported env, secrets) or revert the change. Do not “hotfix” by editing containers; it won’t survive a restart.

Task 5: Check whether the previous image is still local (fast rollback potential)

cr0x@server:~$ docker images --digests | grep registry.local/payments | head
registry.local/payments   1.9.2   sha256:aa11...   3d2f1c9d0b2a   45 minutes ago   312MB
registry.local/payments   1.9.1   sha256:bb22...   9a8e7d6c5b4f   2 days ago       311MB

What it means: The old image (1.9.1) is already on disk. That’s your golden path: rollback without waiting on pulls.

Decision: Prefer switching back to 1.9.1 (or its digest) immediately, then do deeper debugging once stable.

Task 6: Pin the service back to a known-good tag and redeploy only that service

cr0x@server:~$ sed -i 's|registry.local/payments:1.9.2|registry.local/payments:1.9.1|g' /srv/payments/compose.yaml
cr0x@server:~$ docker compose -p payments up -d --no-deps api
[+] Running 1/1
 ✔ Container payments-api-1  Started

What it means: Only the API container was recreated. Dependencies were left alone.

Decision: Use --no-deps during rollback when dependencies are healthy. Don’t churn working containers during an incident unless you enjoy compounding variables.

Task 7: Confirm the rollback “took” (image and health)

cr0x@server:~$ docker compose -p payments ps api
NAME             IMAGE                            COMMAND            SERVICE   STATUS              PORTS
payments-api-1   registry.local/payments:1.9.1    "/entrypoint.sh"   api       Up 20 seconds

What it means: The container is running on 1.9.1. If you have healthchecks, wait for healthy.

Decision: If it stays up and serves traffic, you’ve bought time. If it still fails, the problem may be config/state, not code.

Task 8: Pin by digest (stronger than tags)

cr0x@server:~$ docker inspect --format '{{index .RepoDigests 0}}' registry.local/payments:1.9.1
registry.local/payments@sha256:bb22cc33dd44ee55ff66...

What it means: You have the immutable digest for the known-good build.

Decision: Update Compose to use the digest when you need certainty, especially if tags can be re-pushed.

Task 9: Switch Compose to digest and redeploy (prevents tag drift)

cr0x@server:~$ sed -i 's|registry.local/payments:1.9.1|registry.local/payments@sha256:bb22cc33dd44ee55ff66...|g' /srv/payments/compose.yaml
cr0x@server:~$ docker compose -p payments up -d --no-deps api
[+] Running 1/1
 ✔ Container payments-api-1  Recreated

What it means: You redeployed the service referencing a content-addressed image.

Decision: For post-incident stabilization, pin digests for critical services. You can return to semver tags when you have better controls.

Task 10: Detect whether your rollback is blocked by host disk pressure

cr0x@server:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  118G  117G  0G   100% /

What it means: Your root filesystem is full. Pulling images, writing logs, and starting containers can fail in chaotic ways.

Decision: Free disk space before anything else. Rollbacks need disk too—especially if they require pulling an older image.

Task 11: Quick cleanup without deleting volumes (safe-ish)

cr0x@server:~$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          42        18        28.6GB    11.2GB (39%)
Containers      35        9         1.3GB     1.1GB (84%)
Local Volumes   14        10        96.4GB    0B (0%)
Build Cache     0         0         0B        0B
cr0x@server:~$ docker container prune -f
Deleted Containers:
c8a1...
Total reclaimed space: 1.1GB

What it means: Containers were taking space; volumes are the big chunk but are not reclaimable via this command.

Decision: Prune containers first. Avoid volume pruning during a rollback unless you have verified backups and you like high-stakes gambling.

Task 12: Identify whether you accidentally created orphaned services

cr0x@server:~$ docker compose -p payments up -d
[+] Running 6/6
 ✔ Container payments-web-1  Running
 ✔ Container payments-api-1  Running
 ✔ Container payments-db-1   Running
 ! Found orphan containers ([payments-worker-1]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.

What it means: Something was renamed/removed and an old container is still around.

Decision: During rollback, don’t remove orphans blindly. That orphan might still be doing important work (like queue draining). Evaluate, then clean up deliberately.

Task 13: Confirm container exit reason (OOM vs app failure)

cr0x@server:~$ docker inspect -f '{{.State.Status}} {{.State.ExitCode}} OOMKilled={{.State.OOMKilled}}' payments-api-1
exited 137 OOMKilled=true

What it means: Exit code 137 with OOMKilled indicates the kernel killed your container. That’s not a code regression until proven otherwise.

Decision: Rollback may “work” if the old version uses less memory, but the real fix is memory limits, host capacity, or a leak. Triage memory now.

Task 14: Check host memory pressure and recent OOM logs

cr0x@server:~$ free -m
               total        used        free      shared  buff/cache   available
Mem:           15984       15410         120         210         454         260
Swap:           2047        2047           0
cr0x@server:~$ journalctl -k --since "30 min ago" | tail -n 5
kernel: Out of memory: Killed process 28411 (payments-api) total-vm:2456120kB, anon-rss:1220040kB, file-rss:0kB, shmem-rss:0kB

What it means: The host is out of memory and swap is exhausted. Any version might die; rollback alone is a bandage.

Decision: Reduce load, scale down other services, increase memory, or set sensible limits. Then redeploy. Otherwise you’ll “rollback” into the same OOM.

Task 15: Verify what changed between two Compose revisions (config diff)

cr0x@server:~$ cd /srv/payments
cr0x@server:~$ git log --oneline -5
a12b9f3 bump api to 1.9.2 and add signing key
6c0dd21 pin postgres and tune healthcheck
b19e8aa add worker concurrency env var
cr0x@server:~$ git show a12b9f3 -- compose.yaml | sed -n '1,120p'
diff --git a/compose.yaml b/compose.yaml
index 31e2c0a..8f19c11 100644
--- a/compose.yaml
+++ b/compose.yaml
@@ -12,6 +12,7 @@ services:
   api:
     image: registry.local/payments:1.9.2
     environment:
+      - PAYMENTS_SIGNING_KEY=${PAYMENTS_SIGNING_KEY}

What it means: The new version added a required secret. Your env injection didn’t follow.

Decision: Roll back the commit or supply the secret correctly. If the secret must exist going forward, fix the secret plumbing and redeploy forward rather than roll back.

Task 16: Roll back by Git to the last known good config

cr0x@server:~$ cd /srv/payments
cr0x@server:~$ git checkout 6c0dd21 -- compose.yaml .env
cr0x@server:~$ docker compose -p payments up -d
[+] Running 3/3
 ✔ Container payments-api-1  Recreated
 ✔ Container payments-web-1  Running
 ✔ Container payments-db-1   Running

What it means: You reverted the configuration and applied it. The API container was recreated accordingly.

Decision: Use Git as your rollback ledger. If you can’t reproduce “last known good” from source control, you don’t have a rollback plan—just nostalgia.

Task 17: Validate external reachability quickly (service-level check)

cr0x@server:~$ curl -fsS -o /dev/null -w "%{http_code}\n" https://localhost/health
200

What it means: Your local endpoint returns 200. This is not a full integration test, but it’s a fast signal.

Decision: If health is good locally but users still see failures, the problem might be DNS, load balancer, firewall, or upstream dependencies.

Task 18: Snapshot a named volume before you touch it (cheap insurance)

cr0x@server:~$ docker volume ls | grep payments
local     payments_dbdata
cr0x@server:~$ mkdir -p /var/backups/payments
cr0x@server:~$ docker run --rm -v payments_dbdata:/data -v /var/backups/payments:/backup alpine sh -c "cd /data && tar -czf /backup/payments_dbdata_$(date +%F_%H%M).tgz ."

What it means: You created a tarball backup of the volume contents. It’s not perfect for live databases, but it’s better than “we didn’t.”

Decision: For real databases, prefer native backup tools. But when you’re about to do something risky, take a snapshot or a backup first.

Joke #2: “Just roll it back” is a lovely phrase until you discover your database didn’t get the memo.

Three mini-stories from corporate life

1) The incident caused by a wrong assumption

A mid-sized company ran a payments API on a single beefy host using Docker Compose.
The deployment process was “pull new images, run docker compose up -d.”
They had monitoring, alerts, and a rotating on-call. They also had one assumption: tags were immutable enough.

A developer pushed payments-api:1.8.4 to the registry, realized a build flag was wrong, and re-pushed the same tag with a corrected image.
No malice; just habit from dev environments. The registry allowed it. Nobody noticed.
Two days later, an incident hit after a routine host restart. The node came back up, Compose started containers,
and it pulled 1.8.4—but not the image everyone had been running. Same tag, different bits.

The on-call did what the runbook said: roll back to 1.8.4. The rollback “worked” in the sense that it changed nothing.
They burned an hour chasing phantom config differences and suspected kernel upgrades.
The actual issue was simpler: their rollback target wasn’t a target. It was a moving label.

The fix wasn’t glamorous. They began pinning digests for production deploys and kept human-friendly tags for convenience,
but not as the source of truth. They also locked down the registry so “overwrite an existing tag” required explicit admin action.
Nobody loves this in the moment. Everyone loves it the first time it saves them.

2) The optimization that backfired

A data platform team wanted faster deploys. Their Compose stack had half a dozen services and a shared network.
Pulling images during deploy was slow, so they optimized: they kept images around aggressively and avoided pulling unless necessary.
They also set restart policies to always everywhere, because availability, right?

A bad release shipped with a subtle memory leak in a worker service. At first, it looked fine.
Over a few hours, memory climbed, the kernel began reclaiming aggressively, and eventually OOM killed containers.
With restart=always, the workers restarted instantly, reloaded, leaked again, and thrashed.
Meanwhile the host was so memory-starved that the “good” services started failing too.

The on-call attempted rollback. But the host was under such pressure that even starting the old image was flaky.
Logs were truncated. Exec into containers timed out. “Rollback” became a race between the kernel and the operator.
The very restart policy that helped for transient failures turned a controlled incident into a noisy one.

The eventual improvement: they set memory limits and reserved headroom, tuned restart policies per service,
and added a deploy-time canary check for memory growth under load. The optimization—avoid pulls—wasn’t the real villain.
The missing guardrails were.

3) The boring but correct practice that saved the day

An internal tools team ran Compose on a couple of hosts, each with a “project” directory.
Their practice was painfully boring: every deploy was a Git commit, every commit recorded image digests in the Compose file,
and every release had a “last known good” note in the repo. They also took nightly volume backups and tested restores monthly.
Yes, really.

A release introduced a schema migration that was supposed to be additive.
It wasn’t. It renamed a column used by an older code path and broke a reporting job.
The production API was still mostly fine, but the job spammed errors and created load.

The incident response was almost dull. They reverted to the previous Git commit for the job service only,
redeployed with --no-deps, and confirmed the reporting system recovered.
Then they scheduled a forward-fix migration with proper backward compatibility.
No heroics, no war room that turned into a philosophy seminar.

The key detail: because they pinned digests, they didn’t have to guess what “previous version” meant.
Because they practiced restores, they knew what would happen if they needed data rollback.
They didn’t end up restoring data. They didn’t need to. But the option existed, and that changes how calmly people behave.

Common mistakes: symptom → root cause → fix

1) “Rollback redeployed, but nothing changed”

Symptom: You change the tag back, run docker compose up -d, and the service is still broken.

Root cause: The “old” tag points to new bits (tag overwritten), or Compose didn’t recreate the container because it didn’t detect a change.

Fix: Pin by digest and force recreation if necessary.

2) “Container starts, then exits immediately”

Symptom: Crash-looping; logs show missing env vars or config parsing errors.

Root cause: Config drift, missing secret injection, or an env var now required by the app.

Fix: Use docker compose config to see resolved values. Restore previous env/compose commit or supply the missing secret properly.

3) “Rollback made it worse; now multiple services are down”

Symptom: You rolled back one change and suddenly DB, proxy, and workers are restarting.

Root cause: You ran a full docker compose down (or removed networks) and recreated everything, causing dependency churn or changed IPs on a fragile stack.

Fix: Prefer docker compose up -d --no-deps <service> and avoid tearing down networks mid-incident.

4) “Old version can’t talk to the database anymore”

Symptom: After rollback, app errors mention missing columns/tables or incompatible schema.

Root cause: Non-backward-compatible migrations were applied by the bad deploy.

Fix: Either roll forward with a fixed app version that supports the new schema, or perform a coordinated data rollback (backup/restore) if feasible.

5) “Rollback is slow because pulls take forever”

Symptom: docker compose up stalls pulling images; recovery time stretches.

Root cause: Old images aren’t cached locally, registry bandwidth is limited, or DNS/network issues.

Fix: Keep last known good images locally (or pre-pull during deploy), verify registry reachability, and consider an internal mirror/cache.

6) “We rolled back, but users still see errors”

Symptom: Healthchecks pass, but real traffic fails intermittently.

Root cause: External dependencies changed (LB config, TLS cert, DNS), or partial rollback left a mixed-version system.

Fix: Confirm end-to-end behavior via curl from outside, verify reverse proxy routes, and ensure all interdependent services are on compatible versions.

7) “Everything is restarting; logs are empty”

Symptom: Containers restart quickly and you can’t catch logs.

Root cause: OOM kills, disk full, or a crash before stdout flushes.

Fix: Check docker inspect for OOMKilled, check journalctl -k, and fix host pressure first.

8) “Rollback deleted data”

Symptom: DB comes up empty after “rollback.”

Root cause: Someone ran docker compose down -v or removed named volumes, or changed volume names by changing project name.

Fix: Stop. Identify volumes with docker volume ls. Restore from backups. Prevent by locking down runbooks and using explicit volume names.

Checklists / step-by-step plan

Checklist: the 10-minute rollback (when state is compatible)

Freeze deployments: stop CI/CD from pushing more changes while you triage.
Identify failing services: docker compose ps. Pick the smallest blast radius first.
Read logs: docker compose logs --tail=200 <service>. Decide: config vs code vs host.
Confirm host health: disk (df -h), memory (free -m), kernel OOM logs.
Find last known good image: local cache (docker images), CI record, or registry metadata.
Roll back one service: change tag/digest, then docker compose up -d --no-deps <service>.
Verify health: docker compose ps, then a real request (curl).
Announce status: “Service stable on X version; root cause investigation pending.”
Capture evidence: logs, digests, commit IDs. Future you needs receipts.
Plan forward fix: don’t live on rollback forever; schedule proper repair.

Checklist: rollback when migrations may have changed state

Determine what ran: check app logs for migration execution messages; check DB migration table if applicable.
Assess compatibility: can old app run against new schema? If not, rollback app won’t restore service.
Choose strategy:
- Prefer forward fix if you can ship a compatible app quickly.
- Restore data if the deploy corrupted or deleted data, or if forward fix is too slow.
Protect current state: snapshot/backup volumes before you touch them.
Coordinate downtime: app+DB restore is not a solo activity in a busy company.
Validate restore: schema version, row counts sanity checks, and app functional tests.

Checklist: prevent “Compose rollback theater” (it looks like rollback, but it isn’t)

Stop using floating tags (latest) in production Compose files.
Record and deploy by digest for critical services.
Keep “last known good” in Git (compose + env + any config mounted into containers).
Make sure healthchecks exist and represent actual readiness.
Practice a rollback quarterly on a staging host that resembles production.
Back up volumes with a method appropriate to the data (DB-native preferred).

FAQ

1) Does Docker Compose have a built-in rollback command?

No. Compose applies whatever you declare. Your rollback is “declare the previous state” and apply it again,
typically via Git revert and/or pinning an older image digest.

2) What’s the fastest safe rollback if only one service is broken?

Roll back only that service: update its image reference and run docker compose up -d --no-deps <service>.
Don’t churn healthy dependencies.

3) Should I use `docker compose down` during a rollback?

Usually no. down removes containers and networks, and can increase blast radius.
Use it when you need a clean slate and you understand the consequences (especially around networking and downtime).

4) Will rollback delete my database?

Not by default. Named volumes persist across up/down. Data loss usually happens when someone runs
docker compose down -v or changes volume/project names and “loses” the old volume.

5) Tags vs digests: what should production use?

For production rollbacks, digests are superior because they are immutable identifiers of image content.
Tags are fine for humans, but don’t treat them as evidence.

6) How do I find the last known good digest?

Best case: your deploy pipeline records it. Next best: it’s already in your Compose file from the last deployment.
Otherwise, inspect locally cached images, or consult registry metadata (internally) and correlate with release notes.

7) What if the bad deploy ran migrations that aren’t backward compatible?

Rolling back the app might not work. You either ship a forward fix that supports the new schema,
or you coordinate a data restore. This is why “expand/contract” migrations and backward compatibility are not optional in production.

8) Why does Compose sometimes not recreate a container after I change things?

If Compose doesn’t detect a meaningful change (or you edited the wrong file/path), it may keep the existing container.
Confirm with docker compose config and docker compose ps. If needed, force recreation for that service.

9) What’s the safest way to test a rollback without impacting production traffic?

Run the old image as a parallel service on a different port or project name, validate it against a staging dependency set,
then switch traffic at the proxy. Compose doesn’t give you traffic shifting; you have to build it (often in Nginx/Traefik/HAProxy).

10) How many old images should I keep locally to make rollback fast?

Keep at least the last known good plus one more prior release for each critical service, assuming disk allows.
The right number depends on image sizes and disk budget, but “zero” is the wrong number.

Next steps you can do today

Compose rollback doesn’t need to be dramatic. It needs to be deterministic. If you take one thing from this:
stop trusting tags in emergencies, and stop treating state changes as an afterthought.

Write down “last known good” as a digest and a Git commit for each deploy.
Add healthchecks that reflect readiness, not just “process exists.”
Practice a one-service rollback with --no-deps on a non-production host.
Decide your state policy: which services are allowed to run destructive migrations, and how you back out.
Budget disk and memory headroom so rollback isn’t blocked by the host being on fire.

The fastest rollback on Compose is the one you can execute with your eyes half-closed: pin digest, redeploy the one broken service,
verify health, and keep moving. Drama is optional. Discipline isn’t.