Docker: The Compose Pattern That Prevents 90% of Production Outages

Was this helpful?

If you’ve ever watched a Compose stack “come up” while your actual application stays down, you already know the dirty secret:
containers starting is not the same as services being ready.
In production, that gap is where pagers are born.

The failures are usually boring: a database that needs 12 more seconds, a migration that ran twice, a stale volume, a log disk filling,
a “helpful” optimization that silently removed guardrails. Compose isn’t the villain. The default patterns are.

The pattern: Compose as a gated, observable service graph

The Compose pattern that prevents most production outages isn’t a single line in docker-compose.yml. It’s a stance:
treat your stack as a service graph with explicit readiness, controlled startup, bounded resources, durable state, and
predictable failure behavior.

Here’s the core idea:

  1. Every critical service has a healthcheck that matches real readiness (not “process exists”).
  2. Dependencies are gated on health, not on container creation.
  3. One-shot jobs are explicit (migrations, schema checks, bootstrap) and are idempotent.
  4. State is isolated into named volumes (or well-managed bind mounts) with backup/restore paths.
  5. Configuration is immutable per deployment (env + files) and changes are deliberate.
  6. Restart policy is a decision, not a default. Crashing forever is not “high availability.”
  7. Logs are bounded so “debug mode” doesn’t become “disk full.”
  8. Resource limits exist so one service can’t starve the host and take the rest with it.
  9. You have a fast diagnosis loop: three commands to identify the bottleneck in under two minutes.

The practical result: fewer cascading failures, fewer restart storms, fewer “it works on my laptop” incidents, and far fewer
3 a.m. mysteries caused by invisible ordering issues.

One quote worth keeping taped to your monitor: Hope is not a strategy. — James Greene (commonly attributed in ops circles).
If you’re not sure it’s exact, treat it as a paraphrased idea and move on; the point stands.

Joke #1: The nice thing about “works on my machine” is that it’s true. The bad thing is that your machine isn’t production.

What this pattern is not

  • Not “turn everything into Kubernetes.” Compose is fine for plenty of production systems.
  • Not “just add depends_on.” Without health gating, it’s ordering theater.
  • Not “restart: always.” That’s how you turn a misconfiguration into an infinite loop with excellent uptime metrics for the container runtime.

Why it works: you collapse uncertainty

Most outages in Compose deployments are uncertainty masquerading as convenience:
“It probably starts fast enough,” “the network will be ready,” “the migration will only run once,” “the log file won’t grow,”
“the volume is the same as last time,” “that env var is set somewhere.”
This pattern removes “probably.” It replaces it with checks and gates you can inspect.

Interesting facts and historical context

  • Compose began as Fig (2013–2014 era), a developer tool to define multi-container apps; production hardening came later via patterns, not defaults.
  • Docker healthchecks were introduced after operators kept inventing “wait-for-it” scripts; the platform eventually admitted readiness is a first-class need.
  • depends_on does not mean “ready” by default; it historically means “start this container before that one,” which is rarely the real requirement.
  • Restart policies are not retries; they’re “keep trying forever.” Many postmortems include a restart storm that hid the first useful error.
  • Containers don’t contain the kernel; noisy neighbors are still a thing. Without CPU/memory limits, one service can degrade the host and everything on it.
  • Local volumes are easy, portability is not; named volumes are portable in definition, but the data lifecycle is still your job.
  • Logging drivers matter; JSON-file logging is convenient until a chatty app turns disk into a slowly expanding incident.
  • Compose isn’t a scheduler; it won’t redistribute workloads across nodes or handle node failures like an orchestrator. Your design needs to assume a single host unless you build otherwise.
  • “Init containers” existed as a pattern long before Kubernetes popularized the term; one-shot bootstraps are a universal need in distributed systems.

Why Compose stacks fail in production (the repeat offenders)

1) Startup order is mistaken for dependency readiness

Your API container starts. It tries to connect to Postgres. Postgres is “up” in Docker terms (the PID exists), but it’s still replaying WAL,
performing recovery, or simply not listening yet. The API fails, restarts, fails again. You now have an outage that looks like an API problem,
but is actually a readiness problem.

This is where healthchecks + gating pay for themselves. You don’t want your API to be the database’s readiness probe.
It’s bad for uptime and worse for logs.

2) Migrations are treated like a side effect instead of a job

A common anti-pattern: app container entrypoint runs migrations, then starts the server.
In a single container, on a single host, with a single replica, maybe fine.
In real life, the app restarts, the migration re-runs, locks tables, or partially applies changes.

Make migrations a dedicated one-shot service. Make it idempotent. Make it block app startup until it finishes.
Your future self will send you a thank-you note. Probably in the form of fewer pages.

3) Volumes are treated like “some directory”

Stateful services don’t fail politely. They fail by corrupting, filling, or being mounted with the wrong permissions.
Named volumes help because they decouple the data path from random filesystem layout, but they don’t replace:
backups, restores, capacity checks, and change control.

4) Restart policies mask real failures

restart: always is a blunt instrument. It will dutifully restart a container with a typo in an env var, a missing secret file,
or a failing migration. Your monitoring sees flapping. Your logs become a blender. Meanwhile, the root cause scrolls by once every three seconds.

Use restart: unless-stopped or on-failure intentionally. Combine with healthchecks so “running” isn’t a lie.

5) No resource limits, then surprise host death

Compose will happily let one container consume all memory, trigger the OOM killer, and take down unrelated services.
This is not theoretical. It’s the oldest trick in the “why did everything die?” book.

6) Logs eat the disk

The default JSON logging driver can grow without bounds. If your host disk fills, your database may stop writing,
your app may stop creating temp files, and Docker itself can become unstable.

Bounded logging is not a nice-to-have; it’s a seatbelt.

7) “Convenient” networking choices create invisible coupling

Publishing every port to the host feels pragmatic. It’s also how you end up with port conflicts, unintended exposure,
and a “quick debug” that becomes permanent architecture.

Use internal networks. Publish only what humans or upstream systems need. Keep east-west traffic inside the Compose network.

Reference Compose file (annotated, production-minded)

This is a pattern, not a sacred text. Adapt it. The important part is the interplay:
healthchecks, gating, explicit jobs, bounded logs, and durable volumes.

cr0x@server:~$ cat docker-compose.yml
version: "3.9"

x-logging: &default-logging
  driver: "json-file"
  options:
    max-size: "10m"
    max-file: "5"

networks:
  appnet:
    driver: bridge

volumes:
  pgdata:
  redisdata:

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: app
      POSTGRES_PASSWORD_FILE: /run/secrets/pg_password
    secrets:
      - pg_password
    volumes:
      - pgdata:/var/lib/postgresql/data
    networks: [appnet]
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d appdb -h 127.0.0.1"]
      interval: 5s
      timeout: 3s
      retries: 20
      start_period: 10s
    restart: unless-stopped
    logging: *default-logging

  redis:
    image: redis:7
    command: ["redis-server", "--appendonly", "yes"]
    volumes:
      - redisdata:/data
    networks: [appnet]
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 20
    restart: unless-stopped
    logging: *default-logging

  migrate:
    image: ghcr.io/example/app:1.9.3
    command: ["./app", "migrate", "up"]
    environment:
      DATABASE_URL_FILE: /run/secrets/db_url
    secrets:
      - db_url
    networks: [appnet]
    depends_on:
      postgres:
        condition: service_healthy
    restart: "no"
    logging: *default-logging

  api:
    image: ghcr.io/example/app:1.9.3
    environment:
      DATABASE_URL_FILE: /run/secrets/db_url
      REDIS_URL: redis://redis:6379/0
      PORT: "8080"
    secrets:
      - db_url
    networks: [appnet]
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
      migrate:
        condition: service_completed_successfully
    ports:
      - "127.0.0.1:8080:8080"
    healthcheck:
      test: ["CMD-SHELL", "wget -qO- http://127.0.0.1:8080/healthz | grep -q ok"]
      interval: 10s
      timeout: 3s
      retries: 10
      start_period: 10s
    restart: unless-stopped
    logging: *default-logging
    deploy:
      resources:
        limits:
          memory: 512M

secrets:
  pg_password:
    file: ./secrets/pg_password.txt
  db_url:
    file: ./secrets/db_url.txt

What to steal from this file

  • Healthchecks match real readiness: pg_isready, redis-cli ping, and an HTTP endpoint that validates the app is serving.
  • Gating includes “completed successfully” for migrations. If migrations fail, the API stays down, loudly, with an actionable error.
  • Secrets via files so passwords don’t end up in docker inspect output or shell history.
  • Ports bound to localhost for safety. Put a reverse proxy in front if you need external access.
  • Bounded logs via log rotation. This prevents “debug log ate the disk” incidents.
  • Named volumes for stateful services. It’s not magic, but it’s at least explicit.

What you should customize immediately

  • Memory/CPU limits based on your host size and service behavior.
  • Healthcheck logic to match your application’s real “ready” state (e.g., DB connectivity + migrations applied).
  • Backup schedule and restore procedure for volumes. If you can’t restore, you don’t have backups; you have expensive wishes.

Practical tasks: commands, outputs, and the decision you make

These are not “nice diagnostics.” These are the moves you make while the incident clock is ticking, and the moves you make on calm Tuesdays
to prevent the incident in the first place.

Task 1: See what Compose thinks is running

cr0x@server:~$ docker compose ps
NAME                IMAGE                         COMMAND                  SERVICE     STATUS                    PORTS
stack-postgres-1     postgres:16                   "docker-entrypoint.s…"   postgres    Up 2 minutes (healthy)    5432/tcp
stack-redis-1        redis:7                       "docker-entrypoint.s…"   redis       Up 2 minutes (healthy)    6379/tcp
stack-migrate-1      ghcr.io/example/app:1.9.3      "./app migrate up"       migrate     Exited (0) 90 seconds ago
stack-api-1          ghcr.io/example/app:1.9.3      "./app server"           api         Up 2 minutes (healthy)    127.0.0.1:8080->8080/tcp

What it means: “Up” is not enough; you want (healthy) for long-running services and Exited (0) for one-shot jobs like migrations.

Decision: If API is Up but not (healthy), you debug readiness (healthcheck logic, dependencies, boot time). If migrate is non-zero, you stop and fix migrations first.

Task 2: Identify the first failure in logs (not the loudest)

cr0x@server:~$ docker compose logs --no-color --timestamps --tail=200 api
2026-02-04T01:18:42Z api  | ERROR: could not connect to postgres: connection refused
2026-02-04T01:18:45Z api  | INFO: retrying in 3s
2026-02-04T01:18:48Z api  | ERROR: migration state not found

What it means: You’re seeing symptoms. The first error is “connection refused,” implying Postgres wasn’t listening yet or network/DNS failed.
The later “migration state not found” might be fallout.

Decision: Check Postgres health and logs next; don’t tunnel-vision on the app.

Task 3: Inspect container health status in detail

cr0x@server:~$ docker inspect --format '{{json .State.Health}}' stack-postgres-1
{"Status":"healthy","FailingStreak":0,"Log":[{"Start":"2026-02-04T01:18:21.112Z","End":"2026-02-04T01:18:21.189Z","ExitCode":0,"Output":"/var/run/postgresql:5432 - accepting connections\n"}]}

What it means: Healthcheck is passing and reporting “accepting connections.” Good sign.

Decision: If app still can’t connect, investigate networking (DNS, network attachment) or wrong connection string.

Task 4: Validate service discovery from inside the network

cr0x@server:~$ docker exec -it stack-api-1 getent hosts postgres
172.22.0.2   postgres

What it means: DNS resolves postgres to a container IP on the Compose network.

Decision: If this fails, you likely have a network misconfiguration (service not on same network, custom network typo, or using host networking incorrectly).

Task 5: Test TCP connectivity to the dependency from the app container

cr0x@server:~$ docker exec -it stack-api-1 bash -lc 'nc -vz postgres 5432'
Connection to postgres (172.22.0.2) 5432 port [tcp/postgresql] succeeded!

What it means: Network path is open. If the app still fails, it’s likely credentials, SSL mode, database name, or connection parameters.

Decision: Verify secret file contents and parsing (carefully), and check Postgres auth logs.

Task 6: Confirm what config the container actually received

cr0x@server:~$ docker exec -it stack-api-1 bash -lc 'ls -l /run/secrets && sed -n "1p" /run/secrets/db_url'
total 4
-r--r----- 1 root root 74 Feb  4 01:17 db_url
postgres://app:REDACTED@postgres:5432/appdb?sslmode=disable

What it means: The secret exists, permissions look reasonable, and the URL targets postgres.

Decision: If the file is missing or empty, fix your secret mount and deployment process. If the URL points to localhost, that’s your outage.

Task 7: Check Postgres logs for auth and recovery issues

cr0x@server:~$ docker compose logs --tail=120 postgres
postgres  | LOG:  database system is ready to accept connections
postgres  | FATAL:  password authentication failed for user "app"

What it means: Postgres is up; credentials are wrong.

Decision: Rotate/fix the password secret, then restart the affected services. Do not “just restart everything” without fixing the root cause.

Task 8: Spot restart loops quickly

cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.RunningFor}}'
NAMES              STATUS                          RUNNING FOR
stack-api-1         Restarting (1) 2 seconds ago    3 minutes
stack-postgres-1    Up 5 minutes (healthy)          5 minutes
stack-redis-1       Up 5 minutes (healthy)          5 minutes

What it means: The API is flapping; your logs might be truncated between restarts.

Decision: Temporarily disable restart for the failing service or scale it to zero, capture logs, fix, then re-enable. Restart loops waste time and hide the first error.

Task 9: Get the exit code and last error for a crashed container

cr0x@server:~$ docker inspect --format '{{.State.ExitCode}} {{.State.Error}}' stack-api-1
1

What it means: Exit code 1 is generic; you must use logs and the app’s own output to pinpoint the failure.

Decision: If the exit code is consistently the same, you’re likely dealing with deterministic misconfig (secrets, env, migration) rather than transient infra.

Task 10: Check host disk pressure before you do anything clever

cr0x@server:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  200G  192G  8.0G  97% /

What it means: 97% used. You’re in the danger zone. Databases and Docker both behave badly when disk is tight.

Decision: Stop log growth (rotate, truncate carefully), prune unused images, or expand the disk. Don’t redeploy repeatedly and make it worse.

Task 11: Check Docker’s own disk usage breakdown

cr0x@server:~$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          28        6         35.2GB    23.4GB (66%)
Containers      14        4         1.1GB     920MB (83%)
Local Volumes   7         2         96.0GB    22.0GB (22%)
Build Cache     0         0         0B        0B

What it means: Volumes are large (expected for databases). Images are reclaimable.

Decision: Prune unused images/containers first; do not touch volumes without a backup/restore plan and a clear understanding of what you’re deleting.

Task 12: Safely prune unused images (when you’ve confirmed)

cr0x@server:~$ docker image prune -a
Deleted Images:
deleted: sha256:4c2c6b1f8b7c...
Total reclaimed space: 18.6GB

What it means: You got disk back by deleting unused images.

Decision: If disk pressure persists, address logs and volumes next. If you’re still near full, you need capacity expansion or a cleanup policy.

Task 13: Identify who is eating memory and triggering OOM

cr0x@server:~$ docker stats --no-stream
CONTAINER ID   NAME              CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
c0ffee12ab34   stack-api-1        180.23%   620MiB / 512MiB       121.1%    120MB / 95MB      2.1GB / 45MB    42
bada55aa9876   stack-postgres-1    35.01%    900MiB / 0B          0.0%      80MB / 110MB      8.2GB / 3.1GB   19

What it means: API exceeded its memory limit and may be getting killed; Postgres has no limit (0B shows “unlimited”).

Decision: Increase API memory if it’s legitimate, fix memory leaks, and set sane limits for Postgres too—aligned with host RAM and cache needs.

Task 14: Check kernel OOM kills on the host

cr0x@server:~$ dmesg -T | tail -n 12
[Mon Feb  4 01:22:13 2026] Out of memory: Killed process 21134 (app) total-vm:2104820kB, anon-rss:682312kB, file-rss:104kB, shmem-rss:0kB, UID:0 pgtables:1820kB oom_score_adj:0

What it means: The kernel killed your app. That’s not “the container crashed”; that’s “the host ran out of memory.”

Decision: Add memory, add limits, reduce concurrency, or fix the app. If you ignore OOM signals, your next incident will be louder.

Task 15: Verify log rotation is actually applied

cr0x@server:~$ docker inspect --format '{{.HostConfig.LogConfig.Type}} {{json .HostConfig.LogConfig.Config}}' stack-api-1
json-file {"max-file":"5","max-size":"10m"}

What it means: The container is using bounded json-file logs.

Decision: If you see empty config, your logging anchor wasn’t applied or you’re relying on daemon defaults. Fix it at the Compose layer.

Task 16: Confirm volumes and mountpoints (the “where is my data?” check)

cr0x@server:~$ docker volume ls
DRIVER    VOLUME NAME
local     stack_pgdata
local     stack_redisdata
cr0x@server:~$ docker volume inspect stack_pgdata | sed -n '1,12p'
[
    {
        "CreatedAt": "2026-02-03T21:10:44Z",
        "Driver": "local",
        "Name": "stack_pgdata",
        "Mountpoint": "/var/lib/docker/volumes/stack_pgdata/_data",
        "Scope": "local"
    }
]

What it means: Data lives under Docker’s volume mountpoint on this host.

Decision: If you expected data elsewhere (like a bind mount), reconcile it now—before a host replacement turns into an accidental data deletion event.

Task 17: Take a quick, consistent Postgres logical backup (for small/medium DBs)

cr0x@server:~$ docker exec -t stack-postgres-1 pg_dump -U app -d appdb | gzip -c > /var/backups/appdb_$(date +%F).sql.gz

What it means: You created a gzipped SQL dump on the host.

Decision: If the database is large, this may be too slow for incident response. Plan physical backups or replication; don’t discover this during an outage.

Task 18: Verify the health endpoint from the host

cr0x@server:~$ curl -fsS http://127.0.0.1:8080/healthz
ok

What it means: The service is reachable from the host and returns the expected body.

Decision: If this fails but container is “healthy,” your healthcheck is lying or your port binding is wrong.

Fast diagnosis playbook

When production is down, you don’t need wisdom. You need a short loop that identifies the bottleneck quickly and stops you from
“fixing” the wrong thing at high speed.

First: Is this a dependency readiness problem or an app bug?

  1. Check the graph status:
    cr0x@server:~$ docker compose ps
    NAME                IMAGE                         COMMAND               SERVICE   STATUS                     PORTS
    stack-api-1          ghcr.io/example/app:1.9.3      "./app server"        api       Up 1 minute (unhealthy)    127.0.0.1:8080->8080/tcp
    stack-postgres-1     postgres:16                   "docker-entrypoint"   postgres  Up 1 minute (healthy)      5432/tcp
    

    Decision: If dependencies are healthy but API is unhealthy, focus on API config and its own readiness path.

  2. Read the last 200 lines of the failing service:
    cr0x@server:~$ docker compose logs --tail=200 api
    api  | ERROR: missing required setting: JWT_PUBLIC_KEY
    

    Decision: Missing config/secret. Stop redeploying. Fix config injection.

Second: Is the host sick (disk, memory, CPU, IO)?

  1. Disk:
    cr0x@server:~$ df -h /
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/nvme0n1p2  200G  199G  1.0G  100% /
    

    Decision: Treat “100%” as “nothing works.” Free space before anything else.

  2. Memory pressure:
    cr0x@server:~$ free -h
                   total        used        free      shared  buff/cache   available
    Mem:            16Gi        15Gi       210Mi       120Mi       790Mi       420Mi
    Swap:            0B          0B          0B
    

    Decision: If available memory is low and swap is absent, expect OOM kills. Reduce load or add memory/limits.

  3. Who is consuming resources:
    cr0x@server:~$ docker stats --no-stream
    CONTAINER ID   NAME             CPU %     MEM USAGE / LIMIT     MEM %     NET I/O        BLOCK I/O      PIDS
    c0ffee12ab34   stack-api-1       220.14%   480MiB / 512MiB       93.8%     140MB / 98MB   1.8GB / 22MB  55
    

    Decision: If a container is pegging CPU, suspect tight loops, retries, or thundering herd from failed dependencies.

Third: Is it networking (DNS, ports, exposure)?

  1. DNS from inside the container:
    cr0x@server:~$ docker exec -it stack-api-1 getent hosts postgres redis
    172.22.0.2   postgres
    172.22.0.3   redis
    

    Decision: If resolution fails, you have network attachment or service name issues.

  2. Connectivity tests:
    cr0x@server:~$ docker exec -it stack-api-1 bash -lc 'nc -vz postgres 5432; nc -vz redis 6379'
    Connection to postgres (172.22.0.2) 5432 port [tcp/postgresql] succeeded!
    Connection to redis (172.22.0.3) 6379 port [tcp/redis] succeeded!
    

    Decision: Network path is good; focus shifts to auth/config/migrations.

The playbook’s goal is to avoid the classic incident arc: “restart everything,” then “it’s still broken,” then “why are the logs gone,”
then “we changed three things at once.” Don’t be that arc.

Three corporate mini-stories from the trenches

Mini-story #1: The incident caused by a wrong assumption

A mid-sized SaaS company ran a single-host Compose stack: Postgres, Redis, API, and a worker. They were careful—mostly.
They used depends_on because they’d heard “it helps ordering.” They assumed ordering meant readiness.
That assumption lived in production for months because most restarts were manual and spaced out.

One day the host rebooted after routine kernel patching. Postgres came up slower than usual because it had to replay more WAL than normal.
The API container started immediately, tried to connect, failed, and restarted. The worker did the same. Both were configured with
restart: always.

Their external monitor saw HTTP 500s and timeouts. Internally, the logs were a machine-gun burst of connection errors.
The engineering lead initially suspected a Postgres corruption event because “it takes too long.” The database was fine; it was just busy.
But the API and worker hammered it with retries, making it busier.

They fixed it in a single change: add a real Postgres healthcheck and gate app start on it. Second change: cap retry concurrency at the app layer.
The next reboot was uneventful. The outage wasn’t caused by Compose. It was caused by treating a start event as a readiness guarantee.

The lesson wasn’t subtle: the dependency graph exists whether you model it or not. If you don’t model it, production will.

Mini-story #2: The optimization that backfired

A financial services team was under pressure to reduce disk usage. Their hosts were running hot, and Docker data directories kept growing.
Someone noticed that JSON logs were huge. They “optimized” by switching several services to minimal logging and aggressive pruning.
The change looked responsible: lower log retention, more pruning, more automation.

Weeks later, a subtle bug surfaced: a background job occasionally failed to renew a token, causing intermittent downstream failures.
The incident was real but intermittent enough to be confusing. The on-call engineer reached for logs—only to find that the relevant container
logs had been pruned by their own automation before anyone noticed the pattern.

The team tried to compensate by increasing verbosity temporarily. That caused another problem: disk pressure spiked, because the log rotation
settings were not consistently applied across services. Some containers rotated. Others didn’t. The “optimization” had created a mix of policies,
which is how production turns “reasonable” into “chaos.”

The eventual fix was boring: enforce a uniform log policy via Compose anchors, keep enough history to cover the monitoring detection window,
and send critical application logs to a central system instead of relying on local container logs. They also removed automatic pruning during business hours.

The lesson: disk is a resource, logs are a resource, and “optimize” without observability is just blindfolded cost cutting.

Mini-story #3: The boring but correct practice that saved the day

An e-commerce company ran Compose for internal services: catalog updates, price ingestion, and a small API. Nothing glamorous.
But they had one thing many teams skip: a written restore drill for volumes, tested quarterly.
The drill was not “we have backups.” It was “we restored a backup to a clean host and proved the service worked.”

During a routine change, an engineer cleaned up a directory on the host—confident it was “just old Docker stuff.”
It wasn’t. It was a bind-mounted directory used by a stateful service. The container still started. It even looked fine for a few minutes.
Then it began returning partial data and timing out.

They didn’t argue about blame. They executed the drill. They stopped the stack, restored from the most recent backup into a fresh named volume,
and brought the services up with the same Compose file. They validated healthchecks, then ran the app’s consistency check job.
Total impact was limited because they knew exactly what “restore” meant in their environment.

The postmortem wasn’t heroic. It was clinical. They migrated that service off bind mounts to a named volume with clearer lifecycle management,
and they added a pre-change checklist item: verify mountpoints for stateful services before host-level cleanup.

The lesson: the boring practice—restore drills—turns data incidents from existential to inconvenient.

Common mistakes: symptom → root cause → fix

1) API flaps (restarts every few seconds)

Symptom: Restarting (1) in docker ps, logs show repeated connection errors.

Root cause: Dependencies not ready; migration failures; missing secrets; restart policy hides the first error.

Fix: Add healthchecks; gate depends_on with health; split migrations into a one-shot service; temporarily disable restarts to capture the first failure.

2) Everything “Up” but users get 502/timeout

Symptom: Compose reports services running; external proxy returns 502 or timeouts.

Root cause: Healthcheck is too shallow (process exists) or points to wrong interface; app is running but not serving.

Fix: Make healthcheck hit a real endpoint and validate a real response; ensure service binds to correct address; align proxy upstream with container port.

3) Database is healthy, app can’t authenticate

Symptom: Postgres logs show password authentication failed.

Root cause: Secret file wrong, stale, or formatted with a trailing newline that your app mishandles; wrong DB user; wrong DB name.

Fix: Use _FILE env patterns; standardize secret formatting; validate secrets inside the container; rotate credentials deliberately.

4) Sudden outage after enabling debug logging

Symptom: Disk fills; Docker and DB become unstable; writes fail.

Root cause: Unbounded json-file logs; log rotation missing on one service; debug verbosity too high.

Fix: Enforce log rotation via anchors; cap verbosity; add disk monitoring; do not rely on manual cleanup.

5) After reboot, services come up but data is “gone”

Symptom: Application acts like a fresh install; DB has no tables; Redis cache resets unexpectedly.

Root cause: Bind mount path changed, permissions prevent reads, or the service is using a different volume than expected.

Fix: Prefer named volumes for state; verify mounts with docker inspect; document volume names; run restore drill.

6) One service causes system-wide slowdown

Symptom: High load, OOM kills, IO wait; multiple containers become unhealthy.

Root cause: No resource limits; runaway query; batch job collides with peak traffic.

Fix: Set memory/CPU limits; schedule heavy jobs; cap concurrency; monitor host metrics and container stats.

7) “It works locally, fails in prod” after switching images

Symptom: New image version fails immediately; older version fine.

Root cause: Config drift; missing env var; incompatible defaults; migrations required but not run.

Fix: Pin image tags; treat config as versioned; require migration job completion before app start; keep a rollback path that doesn’t mutate state.

Joke #2: The only thing worse than an outage is an outage with “helpful” auto-restarts—like a smoke alarm that politely resets itself.

Checklists / step-by-step plan

Production Compose pattern checklist (do this before you call it “production”)

  1. Healthchecks exist for every service that matters (DB, cache, API, proxy).
  2. Healthchecks are honest: they validate readiness, not just liveness.
  3. Dependencies are gated with condition: service_healthy.
  4. Migrations are a one-shot job with restart: "no" and gating via service_completed_successfully.
  5. Secrets are files (or injected securely), not pasted into shell history.
  6. Named volumes for state, unless you have a managed storage reason to bind mount.
  7. Backups exist and restores are tested. Schedule a restore drill.
  8. Logs are bounded (size + files) on every service.
  9. Resource limits are set so one service can’t take the host down.
  10. Port publishing is minimal; internal traffic stays on internal networks.
  11. Image tags are pinned; upgrades are deliberate, not surprise “latest.”
  12. Rollback plan exists and accounts for database schema changes.

Step-by-step: harden an existing Compose stack in one week

  1. Day 1: Inventory and graph
    • List services, dependencies, and which are stateful.
    • Decide which endpoints represent readiness.
  2. Day 2: Add healthchecks
    • DB: pg_isready (or equivalent).
    • API: a /healthz endpoint that checks critical dependencies.
    • Cache: redis-cli ping or actual read/write test if needed.
  3. Day 3: Gate dependencies
    • Replace naive depends_on ordering with health conditions.
    • Introduce a migrate one-shot service and gate the API on it.
  4. Day 4: Stabilize state
    • Move stateful services to named volumes if feasible.
    • Document volume names and mountpoints.
  5. Day 5: Make logging boring
    • Set log rotation using anchors.
    • Confirm via docker inspect that every container picked it up.
  6. Day 6: Add resource limits and test load
    • Start with conservative memory limits for chatty apps; ensure DB has enough headroom.
    • Watch for OOM and throttling under realistic traffic.
  7. Day 7: Rehearse failure
    • Reboot the host in a maintenance window and watch the stack come back.
    • Simulate dependency delay and ensure gating prevents flaps.
    • Run a restore drill for the database volume.

Incident response checklist (when you’re already down)

  1. Run docker compose ps and identify the first unhealthy or exited service.
  2. Check df -h and free -h before changing configs.
  3. Pull logs for the failing service and its dependencies (--tail=200 with timestamps).
  4. Validate DNS and connectivity from inside the failing container (getent, nc).
  5. If restart loop: disable restarts temporarily, reproduce once, capture the first error, then fix.
  6. Don’t prune volumes during an incident unless you are restoring from known-good backups.

FAQ

1) Does depends_on guarantee that my database is ready?

Not by default. It only influences startup order. Use healthchecks and gate on condition: service_healthy, or implement explicit readiness logic.

2) Are healthchecks enough to prevent startup issues?

They’re necessary, not sufficient. Healthchecks prevent “start too early” flaps, but you still need idempotent migrations, correct secrets, sane retries, and resource limits.

3) Why not just put migrations in the app startup command?

Because restarts happen. When the app restarts, migrations run again unless you’ve made them explicitly safe and idempotent. A one-shot migration service makes the lifecycle visible and gateable.

4) Should I use named volumes or bind mounts for databases?

Default to named volumes for clarity and portability within Docker’s lifecycle. Use bind mounts only when you have a strong operational reason and you’re disciplined about permissions, backups, and host path management.

5) How do I keep secrets out of docker inspect?

Avoid plain environment variables for raw secrets. Use file-based secrets and have your app read from *_FILE env vars (or equivalent). Also avoid putting secrets in Compose labels or command lines.

6) Is restart: always ever a good idea?

Sometimes—for crash-only services where failures are genuinely transient and you have strong monitoring. In most business apps, it hides deterministic misconfigurations and creates noisy restart storms.

7) How do I do “rolling updates” in Compose?

Compose isn’t a full orchestrator. You can approximate safe updates by running multiple instances behind a proxy, updating one at a time, and using healthchecks to gate traffic. If you need true rolling updates across nodes, you want a scheduler.

8) Why bind API ports to 127.0.0.1?

Because most internal services don’t need to be publicly reachable. Bind to localhost and put a reverse proxy (or firewall rules) in front. This reduces accidental exposure and port collisions.

9) My containers are healthy but the app is slow. What now?

Health is binary; performance is not. Check host IO wait, disk space, memory pressure, and container stats. Then profile the app and database queries. Most “slow” incidents are resource contention, not a Compose issue.

10) Can I run Compose in production on a single host responsibly?

Yes—if you accept the failure domain and build accordingly: backups, restore drills, host monitoring, capacity planning, and a documented rebuild process. Compose won’t save you from single-host physics.

Conclusion: next steps you can do this week

The Compose pattern that prevents most outages is not glamorous. That’s the point. You win production reliability by removing ambiguity:
readiness is explicit, dependencies are gated, state is managed, logs are bounded, and resource use is constrained.

Next steps that actually move the needle:

  1. Add honest healthchecks to every critical service and confirm they fail when the service is not truly ready.
  2. Convert migrations into a one-shot Compose service and gate app startup on its success.
  3. Enforce log rotation everywhere using a Compose anchor and verify it with docker inspect.
  4. Set memory limits for the biggest offenders and watch for OOM signals; adjust based on real load.
  5. Run a restore drill for your database volume on a clean host. If you can’t do that, stop calling it “backed up.”

Do those five things and you’ll prevent the majority of the outages that feel like “Docker problems” but are really just
“we didn’t specify the system we thought we had.”

← Previous
“Access Denied” on Your Own Files After Reinstall: The Ownership Fix
Next →
Game Stutters on a Fast PC: DPC Latency Basics (and the Fix Path)

Leave a comment