Docker Compose: The Dependency Trap — ‘depends_on’ Doesn’t Mean Ready

February 20, 2026 • February 20, 2026 • Read: 22 min • Views: 0

Was this helpful?

It’s 09:12. Your deploy is “green” because containers are running, but the API is returning 500s. Logs show connection refused to Postgres. You add depends_on, redeploy, and… nothing changes except your confidence level drops.

This is the Docker Compose dependency trap: depends_on controls start order, not readiness. It’s a convenience feature, not a reliability contract. If you treat it like one, your system will eventually teach you humility—usually during a demo.

What `depends_on` really does (and what it never promised)

Compose has to decide the order it starts containers. That’s all depends_on is: a directed graph that says “start A before B.” It does not say “A is accepting connections,” “A has finished migrations,” “A has warmed caches,” or “A won’t crash two seconds later.”

When people say “depends_on doesn’t work,” they usually mean: it worked exactly as designed, and the design wasn’t what they assumed.

Start order vs readiness: the blunt distinction

Start order: the container process has been launched (or at least Docker has attempted to launch it).
Readiness: the service is usable for its purpose (listening socket, successful auth, schema present, upstream reachable, etc.).

Those are separate problems, and Compose only solves the first by default.

The “service_healthy” wrinkle (and why it’s not a universal fix)

Some Compose implementations support conditional dependencies like condition: service_healthy, which gates dependent startup on a dependency’s healthcheck. That’s helpful, but still not a complete contract:

Healthchecks can be wrong (too shallow, too slow, too optimistic).
Healthy once doesn’t mean healthy forever.
Your app still needs retries because networks and storage don’t care about your YAML.

Here’s the operational truth: even with healthchecks, you design your application as if dependencies can be late, flaky, or briefly unavailable. Compose helps you orchestrate; it doesn’t absolve you of resilience.

One quote that belongs on the wall: Werner Vogels’ paraphrased idea: “Everything fails all the time; build systems that expect it.”

Why “ready” is hard: what actually happens during startup

On a clean laptop with warm caches and no load, it’s easy to believe readiness is instantaneous. In real environments, startup is a mess of I/O, DNS, CPU scheduling, storage latency, and sometimes a surprise fsck you didn’t ask for.

Typical dependency startup phases (the parts you forget exist)

Container created: filesystem layers mounted, namespaces set up, network attached.
Entrypoint starts: process begins; might fork; might wait on config templates.
Service initializes: reads configs, allocates memory, checks permissions.
Storage readiness: volume mounts, journal replay, crash recovery, WAL replay.
Networking readiness: DNS propagation, service binds to sockets, firewall rules.
Application readiness: migrations, cache warm-up, seeding, leader election.

Any of those can delay “ready” by milliseconds or minutes. And yes, I’ve seen minutes.

Joke #1: “It worked on my machine” is just another way of saying “my machine has lower standards.”

Storage makes it worse (especially on first boot)

Databases are not “up” when the process exists; they’re up when they can accept a connection and handle a query reliably. Postgres may be replaying WAL. MySQL may be upgrading system tables. Redis may be loading an RDB snapshot. If you use network storage, you add another layer of timing variance.

For SREs, the key is to model dependency availability as stochastic, not deterministic. Your app either handles that reality gracefully, or it becomes a pager.

Facts and historical context (because this didn’t happen by accident)

Some context helps because Compose behavior is frequently confused with Swarm/Kubernetes semantics, and the ecosystem evolved in awkward steps. Here are concrete facts that matter operationally:

Compose’s original goal was developer ergonomics, not high-availability orchestration. It optimized for “run the stack locally,” not “manage brownouts.”
depends_on historically only enforced start order. Readiness was explicitly out of scope for a long time because it’s application-specific.
Healthchecks came to Docker later than many people assume; early Compose setups used ad-hoc “wait-for” scripts because there was no native primitive.
Swarm and Kubernetes popularized explicit health/readiness concepts, which led teams to expect similar semantics everywhere—even where they don’t exist.
Docker’s HEALTHCHECK runs inside the container namespace, which is great for testing internal service state, but can miss external reachability problems.
Compose v2 is a plugin and not the old Python binary; the implementation details and supported features differ across environments, which fuels “it works for me” confusion.
Startup order is not restart order; a crashed dependency can come back later, and Compose won’t magically re-sequence the world for you.
DNS inside Compose networks is usually stable but not instantaneous; early connection attempts can fail with name resolution errors in fast-starting clients.
Database “accepts TCP” is not the same as “schema ready”; migrations can still be running, causing timeouts or missing-table errors.

That’s the trap: the tool’s scope is narrower than the operational problem, and our brains autocomplete the missing features.

Failure modes you’ll see in production (even if you swear you won’t)

1) Connection refused at boot, then “magically” OK

The database container starts quickly. The DB process binds late. Your app tries once, fails, and exits. Compose restarts it, or you do. On the second attempt, it works.

Diagnosis: the app has no retry/backoff, and you confused start order with service readiness.

2) “No such host” or transient DNS failures

Fast clients can attempt to resolve a service name before the embedded DNS is fully ready or before the network is attached. It’s rarer now, but it still happens under load or on slow nodes.

3) Schema missing / migrations running

Postgres is accepting connections, but your migration job hasn’t run yet. Your app boots, runs queries, and dies. You get a brief outage and a pile of useless alerts.

4) Healthy container, unhealthy system

Your DB healthcheck is pg_isready, which returns success. But disk is full, the database is read-only, or connections are maxed out. Healthchecks are only as good as their definition.

5) Backpressure and timeouts masquerading as startup problems

Your dependency is “up” but painfully slow: cold caches, high I/O wait, CPU steal. The app times out and exits, and everyone blames Compose because it’s standing nearby.

Joke #2: depends_on is like saying “I arrived at the restaurant first” and assuming dinner is already cooked.

Patterns that work: healthchecks, retries, and sane sequencing

If you want reliability, you need layers. Compose can help, but the application has to do the grown-up part: retry, backoff, and fail in a controlled way.

Pattern A: Add explicit healthchecks to dependencies

For common services, define a healthcheck that tests meaningful readiness. Not just “process exists.” Prefer a real query or ping that exercises the right subsystem.

Example: Postgres healthcheck that validates TCP, auth, and basic query ability:

cr0x@server:~$ cat docker-compose.yml
services:
  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: example
      POSTGRES_USER: app
      POSTGRES_DB: appdb
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d appdb -h 127.0.0.1 || exit 1"]
      interval: 5s
      timeout: 3s
      retries: 20
      start_period: 10s
  api:
    image: myorg/api:latest
    depends_on:
      db:
        condition: service_healthy

Operational note: even if you gate startup on health, you still need app-level retries for restarts, failovers, and mid-flight dependency resets.

Pattern B: Make the application resilient (retries with backoff)

The best place to handle dependency readiness is the client. Your API should tolerate the DB being late by 30–120 seconds without exiting. Make it log clearly, retry with exponential backoff, and keep a separate liveness signal so it doesn’t accept traffic prematurely.

When people avoid this because “it hides failures,” what they mean is “I prefer outages to slow starts.” You can still alert on slow readiness; you don’t need to crash-loop to feel something.

Pattern C: Separate migrations/initialization from app startup

Run migrations as a one-shot job that blocks deployment completion, not app startup. In Compose terms, that might be a dedicated service you run explicitly, or an entrypoint that does migrations with a lock and clear observability.

Don’t run migrations in 10 app replicas simultaneously unless you enjoy “relation already exists” errors and arguing about who started first.

Pattern D: Use restart policies intentionally

restart: always can mask real issues by turning them into a perpetual crash-loop. Sometimes that’s acceptable during bootstrapping; it’s not acceptable as a steady-state operating mode.

My preference for most services:

Use restart: unless-stopped for long-running services in dev/test.
In production-ish environments, pair restart with sane logging, backoff in the app, and clear healthchecks so “restarting” doesn’t become “working.”

Practical tasks: commands, outputs, and the decision you make

This section is deliberately hands-on. These are the tasks you run at 02:00 when you’re trying to answer one question: what is actually not ready, and why?

Task 1: Confirm what Compose thinks is running

cr0x@server:~$ docker compose ps
NAME                IMAGE               COMMAND                  SERVICE   STATUS              PORTS
stack-db-1          postgres:16         "docker-entrypoint.s…"   db        running (healthy)   5432/tcp
stack-api-1         myorg/api:latest    "/bin/api"               api       running             0.0.0.0:8080->8080/tcp

What it means: Containers are running; DB is “healthy” per its healthcheck.

Decision: If API still errors, this is not a start-order issue; move to logs and real connectivity tests.

Task 2: Inspect dependency graph and merged config

cr0x@server:~$ docker compose config
services:
  api:
    depends_on:
      db:
        condition: service_healthy
    image: myorg/api:latest
  db:
    environment:
      POSTGRES_DB: appdb
      POSTGRES_PASSWORD: example
      POSTGRES_USER: app
    healthcheck:
      interval: 5s
      retries: 20
      start_period: 10s
      test:
      - CMD-SHELL
      - pg_isready -U app -d appdb -h 127.0.0.1 || exit 1
      timeout: 3s
    image: postgres:16

What it means: You’re validating what Compose will actually run (after merges, overrides, and environment interpolation).

Decision: If the healthcheck or dependency condition is missing here, you’re debugging the wrong file or the wrong Compose implementation.

Task 3: Read the API logs with timestamps

cr0x@server:~$ docker compose logs --timestamps --tail=200 api
api-1  2026-02-04T08:12:09.441Z ERROR db connect failed: dial tcp 172.22.0.2:5432: connect: connection refused
api-1  2026-02-04T08:12:09.443Z INFO  exiting with code 1
api-1  2026-02-04T08:12:11.012Z INFO  starting api version=1.9.3

What it means: The client tried once and exited. This is classic “no retry/backoff” behavior.

Decision: Fix the app startup logic, not Compose. Add retries and only fail hard after a bounded time.

Task 4: Read the DB logs around initialization

cr0x@server:~$ docker compose logs --timestamps --tail=200 db
db-1  2026-02-04T08:12:03.118Z PostgreSQL init process complete; ready for start up.
db-1  2026-02-04T08:12:04.002Z database system is ready to accept connections

What it means: The DB was ready at 08:12:04Z, but API tried at 08:12:09Z and still got refused. That mismatch suggests networking, wrong address, or DB restart.

Decision: Validate the API’s connection target (host, port, TLS) and test connectivity from within the network namespace.

Task 5: Validate DNS and network reachability from the API container

cr0x@server:~$ docker compose exec api getent hosts db
172.22.0.2      db

What it means: DNS resolution inside the Compose network works.

Decision: If this fails, you have a network attachment or name issue (wrong network, wrong service name, or container not on the same network).

Task 6: Check TCP connect to DB from the API container

cr0x@server:~$ docker compose exec api bash -lc 'timeout 2 bash -lc "

What it means: TCP is reachable now.

Decision: If TCP is OK but auth/query fails, your readiness probe must be deeper than “port open.”

Task 7: Perform an actual DB query from the API container

cr0x@server:~$ docker compose exec api bash -lc 'PGPASSWORD=example psql -h db -U app -d appdb -c "select 1;"'
 ?column?
----------
        1
(1 row)

What it means: Auth and basic query path works.

Decision: If your app still errors, the issue is likely application config (wrong DSN), migrations, or connection pool limits—not Compose.

Task 8: Inspect container health details (don’t guess)

cr0x@server:~$ docker inspect --format '{{json .State.Health}}' stack-db-1
{"Status":"healthy","FailingStreak":0,"Log":[{"Start":"2026-02-04T08:12:25.011Z","End":"2026-02-04T08:12:25.042Z","ExitCode":0,"Output":"/var/run/postgresql:5432 - accepting connections\n"}]}

What it means: Healthcheck is returning “accepting connections.” It does not validate schema readiness or that the API user has privileges beyond connect.

Decision: If you need schema readiness, build a probe that checks for a known table or migration version.

Task 9: Verify restart loops and exit codes

cr0x@server:~$ docker compose ps --all
NAME                SERVICE   STATUS
stack-api-1         api       restarting (1) 3 seconds ago
stack-db-1          db        running (healthy)

What it means: The API is crash-looping. This is not “waiting,” it’s failing and restarting.

Decision: Pause the loop to preserve logs/state, then fix the immediate failure. Crash-loops can also DoS your dependency.

Task 10: Confirm environment variables and the actual DSN used

cr0x@server:~$ docker compose exec api env | egrep 'DATABASE_URL|PGHOST|PGPORT|PGUSER'
DATABASE_URL=postgres://app:example@db:5432/appdb?sslmode=disable

What it means: The app is configured to connect to db inside the Compose network, not localhost.

Decision: If you see localhost here, that’s your bug. In containers, localhost points at the container, not the DB service.

Task 11: Identify resource bottlenecks on the host (CPU, memory, IO)

cr0x@server:~$ docker stats --no-stream
CONTAINER ID   NAME         CPU %     MEM USAGE / LIMIT     MEM %     NET I/O         BLOCK I/O
a12b3c4d5e6f   stack-db-1   215.32%   1.2GiB / 2GiB        60.00%    2.1MB / 3.4MB   1.2GB / 900MB
b98c7d6e5f4a   stack-api-1  0.32%     55MiB / 512MiB       10.74%    800KB / 700KB   12MB / 4MB

What it means: DB is CPU-heavy and doing lots of block IO. That can delay readiness and make clients time out.

Decision: If DB is pegged during startup, tune start_period, increase timeouts, and consider storage performance (volume type, host IO contention).

Task 12: Prove that the issue is startup timing by delaying app start

cr0x@server:~$ docker compose stop api
[+] Stopping 1/1
 ✔ Container stack-api-1  Stopped
cr0x@server:~$ sleep 15
cr0x@server:~$ docker compose start api
[+] Starting 1/1
 ✔ Container stack-api-1  Started

What it means: If this “fixes” the problem, you’ve confirmed a boot race.

Decision: Don’t keep the sleep. Implement readiness gating (healthcheck + condition) and app retries.

Task 13: Check events timeline to catch restarts and health transitions

cr0x@server:~$ docker events --since 10m --filter 'container=stack-db-1' --filter 'container=stack-api-1'
2026-02-04T08:12:01.004Z container create a12b3c4d5e6f (name=stack-db-1)
2026-02-04T08:12:01.210Z container start a12b3c4d5e6f (name=stack-db-1)
2026-02-04T08:12:04.120Z container health_status: healthy a12b3c4d5e6f (name=stack-db-1)
2026-02-04T08:12:04.300Z container start b98c7d6e5f4a (name=stack-api-1)
2026-02-04T08:12:09.443Z container die b98c7d6e5f4a (name=stack-api-1, exitCode=1)

What it means: You get an exact timeline: DB became healthy before API started, yet API still died. That points away from naive readiness and toward config, permissions, TLS, or something the healthcheck didn’t test.

Decision: Expand the healthcheck or add explicit app-side readiness logging that states exactly what it’s waiting for.

Task 14: Validate volume mounts and permissions for stateful dependencies

cr0x@server:~$ docker compose exec db bash -lc 'ls -ld /var/lib/postgresql/data; df -h /var/lib/postgresql/data | tail -1'
drwx------ 19 postgres postgres 4096 Feb  4 08:12 /var/lib/postgresql/data
overlay          80G   78G  2.0G  98% /

What it means: Disk is 98% full. Postgres can “start,” pass simplistic healthchecks, then misbehave under write pressure.

Decision: Treat disk pressure as a top-tier readiness requirement for stateful services. Fix capacity before you touch YAML.

Fast diagnosis playbook

If you only remember one section, make it this one. When a Compose stack “starts” but doesn’t work, you want to find the bottleneck quickly—not write fan fiction about depends_on.

First: establish the failure class (app crash-loop vs app up but failing)

cr0x@server:~$ docker compose ps
NAME         SERVICE   STATUS
stack-api-1  api       restarting (1) 2 seconds ago
stack-db-1   db        running (healthy)

If restart loop: focus on app logs and exit reason. Don’t chase readiness until you know what’s failing.

Second: read the logs from both sides, aligned in time

cr0x@server:~$ docker compose logs --timestamps --tail=100 api
...application errors...

cr0x@server:~$ docker compose logs --timestamps --tail=100 db
...db startup and readiness...

Decision: If the DB clearly wasn’t ready when the API tried, you need gating or retries. If the DB was ready, your failure is likely config/auth/schema/resource.

Third: test from inside the failing container’s network namespace

cr0x@server:~$ docker compose exec api getent hosts db
...ip...

cr0x@server:~$ docker compose exec api bash -lc 'timeout 2 bash -lc "

Decision: DNS fail → network wiring. TCP fail → dependency not listening or wrong port. TCP OK but app fails → auth/schema/TLS/pool/timeouts.

Fourth: look for host-level resource contention

cr0x@server:~$ docker stats --no-stream
...cpu/mem/io...

Decision: If DB is IO-bound, you’ll see “readiness” flapping because the world is slow, not because YAML is wrong.

Fifth: validate what you’re actually running

cr0x@server:~$ docker compose config
...resolved config...

Decision: If the config isn’t what you thought, stop. Fix the source of truth (wrong file, wrong override, wrong environment) before debugging symptoms.

Common mistakes: symptoms → root cause → fix

1) Symptom: API exits immediately with “connection refused”

Root cause: The client makes a single connection attempt during boot, fails fast, and exits. depends_on didn’t help because it doesn’t wait for readiness.

Fix: Add retry/backoff in the app, or gate start on a meaningful healthcheck (service_healthy) plus an app-side bounded wait.

2) Symptom: API can resolve “db” but can’t connect

Root cause: DB is listening on a different port, bound to a different address, or is crash-looping itself. Sometimes the DB is starting, then restarting due to storage corruption or config issues.

Fix: Check DB logs, inspect port bindings, ensure DB listens on expected interface. Test TCP from inside the API container.

3) Symptom: Works on second restart; fails on fresh deploy

Root cause: Boot race. The system is accidentally dependent on timing.

Fix: Stop “fixing” it with manual restarts. Make readiness explicit with healthchecks and client retries. Add a startup deadline so it fails loudly if it truly can’t recover.

4) Symptom: “relation does not exist” or “table not found” during startup

Root cause: Migrations are not complete when the app starts, or multiple replicas run migrations concurrently.

Fix: Run migrations as a dedicated job/step. If you must run them from the app, use advisory locks or a single-runner pattern and log migration state clearly.

5) Symptom: DB healthcheck is healthy, but app times out

Root cause: Healthcheck tests a shallow condition (socket open) but not performance, auth, or schema readiness. Or the DB is overloaded (CPU/IO) and slow.

Fix: Make healthcheck meaningful (e.g., a query). Increase timeouts cautiously. Investigate host resource contention and storage latency.

6) Symptom: Everything is “up,” but requests fail intermittently

Root cause: Mid-flight dependency resets, connection pool exhaustion, ephemeral DNS/network glitches, or restart policy hiding recurring crashes.

Fix: Add circuit breakers and retries with jitter. Monitor restarts and health flaps. Don’t use restart policies as a substitute for fixing crash causes.

7) Symptom: Using `localhost` in the app config works outside Docker, fails inside

Root cause: Inside a container, localhost is the container itself.

Fix: Use the Compose service name (db) as the host, or use an explicit network alias.

Three corporate-world mini-stories (anonymized, plausible, technically accurate)

Mini-story 1: The incident caused by a wrong assumption

A mid-size SaaS company had a “simple” Compose stack in a staging environment that started Postgres, a migration container, and an API. The migration container depended on Postgres. The API depended on the migration container. It looked like a neat chain of responsibility.

During a Monday morning release rehearsal, the API came up and immediately started erroring. The team did what teams do: restarted everything. It worked the second time. They shrugged and moved on.

Two weeks later, they rebuilt the staging hosts. Clean disks, slower storage, a slightly different kernel. On the first boot after deploy, Postgres took longer to replay WAL. The migration container started (because Postgres container was started), attempted to connect, failed once, and exited with a non-zero code. The API started anyway because the dependency chain only encoded start order, not “migrations succeeded.” It then died on missing tables.

The outage wasn’t dramatic, but it was loud: a flurry of alerts, confused engineers, and executives asking why “staging is down again.” The root cause was painfully simple: they had modeled correctness as container start order, not as explicit readiness and success criteria.

The fix was also simple, but required discipline: migrations became an explicit deployment step with a clear pass/fail. The API gained retry/backoff and a startup deadline. Compose was still used, but as a runner—not as a reliability engine.

Mini-story 2: The optimization that backfired

A data platform team wanted faster developer feedback. They shortened healthcheck intervals and retries to make failing services “fail fast.” In theory, a failing dependency would surface quickly and developers would fix it sooner.

In practice, they created a flapping machine. On laptops, the DB was slow during cold starts because of Docker Desktop resource constraints. The strict healthcheck failed early, Compose reported unhealthy, and dependent services never started. Developers began “fixing” it by increasing CPU limits locally or disabling healthchecks entirely.

Then the pattern leaked into CI. CI runners were resource-constrained and noisy neighbors. Healthchecks frequently failed during start_period, making pipelines intermittently red. Engineers lost trust in the signal and retried pipelines until they passed. The organization ended up with slower delivery, more wasted compute, and fewer useful alerts.

They walked it back by admitting a boring truth: healthchecks are not a race to the bottom. They must match expected startup behavior under realistic contention. They increased start_period, kept intervals reasonable, and used app-side retries to smooth variance. “Fail fast” became “fail clearly, with context.”

Mini-story 3: The boring but correct practice that saved the day

A payments-adjacent service had a Compose-based integration environment used by multiple teams. Nothing fancy: API, worker, Postgres, Redis. The team running it was allergic to cleverness, which is a compliment.

They enforced three rules: every dependency had a meaningful healthcheck; every client had retry/backoff with a maximum startup wait; and every deploy ran a smoke test from inside the network namespace after startup. The smoke test wasn’t extensive—just enough to prove the critical path.

One morning, a host reboot coincided with a storage slowdown. Postgres came up but was sluggish; healthchecks took longer yet still passed within the configured envelope. The API took longer to declare itself ready because its internal readiness check waited for a successful query plus a migration version check. It didn’t crash-loop, so it didn’t hammer Postgres with repeated cold connection storms.

The result was deeply unsexy: startup took longer, and everything still worked. Teams noticed a delay but not an outage. The difference wasn’t heroics; it was that the system was designed for the world where startup is variable and dependencies can be late.

Checklists / step-by-step plan

A step-by-step plan to get out of the dependency trap

Stop using depends_on as readiness. Keep it for start order only.
Add healthchecks to stateful services. Make them meaningful (not just “port open”).
If supported, gate with condition: service_healthy. Treat it as a convenience, not a guarantee.
Implement client retries with exponential backoff + jitter. Include a maximum startup deadline (e.g., 2–5 minutes).
Separate migrations from app boot. Run them as an explicit step with clear logging and failure behavior.
Design readiness around the user journey. “DB ping works” might not mean “schema is ready.”
Instrument startup. Log what you’re waiting for, how long it took, and why it failed.
Validate from inside containers. Test DNS, TCP, and an actual query from the app container.
Watch resources. If DB startup is IO-bound, fix storage/host contention, not YAML.
Run a post-start smoke test. A tiny, fast test catches boot races before users do.

Operational checklist for a Compose file you won’t regret

Every stateful service has a healthcheck with sane start_period, timeout, and retries.
Clients use service names, not localhost, for intra-stack connectivity.
Restart policies are chosen intentionally; crash-loops are treated as incidents, not “self healing.”
Migrations/initialization are single-runner and observable.
Logs include timestamps and enough context to reconstruct a startup timeline.

FAQ

1) Does `depends_on` wait for the DB port to open?

No. By default it only ensures Compose starts the dependency container before starting the dependent container. Port-open readiness is not implied.

2) If I add a healthcheck to Postgres, am I done?

You’re less wrong, not done. A healthcheck can help gate initial startup (if you use service_healthy), but your app still needs retries for restarts and transient failures.

3) Why not just use a “wait-for-it” script everywhere?

Because it often checks only TCP connect, which is the shallowest possible definition of “ready.” It also tends to become tribal glue you forget to maintain. Prefer app-side retries and meaningful healthchecks; use wait scripts only when you must.

4) My DB is healthy but migrations aren’t finished. How should I model that?

Separate concerns: DB health means “DB can serve requests.” Migration completion is a deployment state. Run migrations as an explicit step or a dedicated one-shot service and gate app readiness on a schema/version check.

5) Can I rely on `condition: service_healthy` in all environments?

No. Feature support varies with Compose versions and tooling. Always verify with docker compose config and test in the environment you deploy.

6) Why does it only fail on fresh machines or after host reboot?

Cold starts amplify variability: caches are cold, disks are busy, services run crash recovery, and CPU scheduling is noisier. If your system relies on “it usually starts fast,” it will fail exactly when things are cold and slow.

7) Is it bad to use `restart: always`?

It’s not morally bad; it’s operationally risky. It can hide real failures, create dependency hammering, and make logs harder to interpret. Pair restart policies with backoff, good logs, and real health checks.

8) How do I tell the difference between readiness and performance problems?

Readiness problems fail early with “connection refused,” DNS errors, or auth failures. Performance problems show timeouts, high latency, and resource saturation (docker stats shows high CPU/IO). Treat them differently.

9) What’s the simplest reliable approach for small stacks?

Healthcheck the DB, add client retries with backoff, and run a smoke test after startup. That trio prevents most boot-race incidents without turning your Compose file into a screenplay.

Next steps that actually reduce incidents

If your stack occasionally needs “just restart it,” you don’t have a Compose problem. You have a readiness contract problem. depends_on is fine for ordering. It is not a handshake, not a promise, and not a substitute for resilience.

Do these next, in this order:

Add meaningful healthchecks to stateful dependencies (DB, queues, caches).
Make clients retry with exponential backoff, jitter, and a max startup deadline.
Stop mixing migrations into random app startup; run them explicitly and observe success.
Build a timeline during incidents using logs with timestamps and docker events.
Prove connectivity from inside containers before you rewrite config.

Compose will still do what it always did: start containers. Your job is to make “started” mean something useful. That’s reliability engineering: making the obvious failure modes boring.