Docker Compose: Depends_on Lied to You — Proper Readiness Without Hacks

January 22, 2026 • February 3, 2026 • Read: 21 min • Views: 7

Was this helpful?

You start your stack. The database container is “up.” The API container starts. Then it crashes because it can’t connect.
You add depends_on. It still crashes. You add sleep 10. It works… until Monday.

If you’ve ever watched a Compose stack flap like a dying neon sign, this is why: depends_on was never a readiness gate.
It’s a start-order hint. Treating it like a readiness guarantee is how you get intermittent failures that only reproduce during demos.

What depends_on actually does (and what it doesn’t)

Compose has two separate ideas that people keep blending into one:
start order and readiness.
depends_on only addresses the first—sort of.

The plain truth

It can start containers in a given order. That’s it.
It does not wait for your service to be ready. “Container started” is not “database accepting queries.”
It does not validate network reachability. Your dependency can be “up” yet unreachable due to DNS, firewall rules, or wrong hostname.
It does not prevent race conditions. If your app does migrations at boot and the DB is still initializing, you can still lose.

Some Compose implementations and versions support depends_on with conditions like service_healthy.
That’s closer to what people want, but even then: it’s only as good as your healthcheck.
A bad healthcheck is just sleep 10 with more paperwork.

Here’s the mindset shift: readiness is an application-level contract.
Docker can run processes. It can’t know when your DB has replayed WAL, your app has warmed caches,
or your schema migrations have finished. You have to define those signals.

Joke #1: Using sleep 10 as readiness is like fixing packet loss by yelling at the router. It feels productive, and it isn’t.

Why “container started” is a useless milestone

If you run Postgres, “started” might mean it’s still running init scripts, creating users, or replaying logs.
For Elasticsearch, “started” might mean the JVM exists but the cluster is red.
For object stores, “started” might mean credentials aren’t loaded yet.

Compose doesn’t know your semantics. And even if it did, multi-stage readiness is common:
DNS ready, TCP port open, TLS handshake possible, auth working, schema present, migrations complete, background workers running.
Pick the stage you actually depend on, then test for that.

Facts and history you can use in arguments

When you’re trying to convince a team to stop shipping “wait-for-it.sh” welded to their app startup,
it helps to know where this mess came from.

Compose originally targeted developer workflows, not production orchestration. Start order was “good enough” for laptops.
“Healthy” is not a Docker-native runtime state in the same way as “running”; it’s a healthcheck result, optional, and app-defined.
Classic Compose file versions changed semantics over time; some features existed in v2 syntax but got muddier under v3 (especially in Swarm-era thinking).
Swarm and Kubernetes pushed different models: Swarm leaned on container lifecycle; Kubernetes made readiness/liveness first-class, but still app-defined.
Ports can be open long before services are usable. Many daemons bind early, then perform internal init.
DNS inside Docker networks is eventually consistent during rapid restarts; name resolution failures during startup bursts are a real thing.
Restart policies can create thundering herds: an app failing fast can hammer a DB that’s already struggling to start.
Healthchecks were designed for “is it alive?” and got repurposed for “is it ready?”, which is not always the same question.

The takeaway: you’re not “doing it wrong” because Compose is bad. You’re doing it wrong because you’re asking Compose to be Kubernetes.
Compose can still be made reliable. You just have to be explicit.

The failure modes you keep misdiagnosing

1) Connection refused at boot, then it works later

That’s usually a race. The target process hasn’t bound the port yet, or it bound it on a different interface.
Sometimes it’s the opposite: the port is open but the protocol isn’t ready (TLS not loaded, DB not accepting auth).

2) “Temporary failure in name resolution”

Docker’s embedded DNS is generally solid, but under rapid container churn you can still get transient resolution failures.
If your app treats one DNS hiccup as fatal, you’ve built a fragile startup.
Your readiness plan should include retries with backoff for network name lookup and connect attempts.

3) Healthcheck says “healthy” but app still fails

Healthcheck is too shallow. A TCP connect check is not the same as “schema exists.”
A curl / returning 200 might mean “web server is up,” not “application can talk to DB.”
Healthchecks should reflect the dependency boundary.

4) Everything works locally, fails in CI

CI hosts have different CPU, disk, and entropy behavior.
Slow I/O makes DB initialization longer. Slow DNS makes early resolution fail.
Timeouts tuned for your laptop become garbage in a throttled runner.
If your solution is “add 30 seconds sleep,” you just moved the flake.

5) Cascading restarts

A dependent service fails fast and restarts aggressively. Each restart triggers retries, migrations, cache warmups.
Meanwhile the DB is still booting and now also under load.
You get a feedback loop: the dependent service becomes a denial-of-service tool against its own dependency.

Proper readiness patterns (no nap-based engineering)

Pattern A: Use healthchecks that test what you actually need

Don’t test that a port is open. Test that the system can complete the minimum operation your dependent service requires.
For a database, that might be “can authenticate and run a trivial query.”
For an HTTP service, that might be “returns 200 on a readiness endpoint that checks downstream dependencies.”

Example: Postgres healthcheck should run pg_isready and ideally a query if you depend on a specific database existing.
For Redis, redis-cli ping is fine. For Kafka, it’s more complicated.

Pattern B: Gate startup using health status (when available), not container start

If your Compose supports depends_on conditions like service_healthy, use it.
But treat it as an enforcement mechanism, not the core design.
The core design is still: healthcheck must represent readiness.

Pattern C: Build retry/backoff into your application

This is the one engineers resist because it feels like “papering over” infrastructure issues.
It isn’t. Networks are unreliable. Startup races happen. Dependencies restart.
If your app cannot retry a DB connection for 30–60 seconds with jittered backoff, it’s not production-grade.

There’s a difference between “retry because the world is messy” and “retry forever because we refuse to fix configuration.”
Put an upper bound. Emit structured logs. Fail after a sane timeout.

Pattern D: Separate “init work” from “serve traffic”

Schema migrations, bucket creation, search index templates, and “create admin user” should not run inside the main service process
unless you’re prepared for the concurrency and idempotency issues.

In Compose, a clean pattern is: a one-shot “init” service that runs and exits successfully, and your app depends on it.
Your init container should be idempotent: safe to run multiple times, safe if partially completed.

Pattern E: Prefer explicit readiness endpoints for HTTP services

If your API needs DB + queue + object store, expose a /ready endpoint that checks those dependencies.
Then your healthcheck calls that. Now your “ready” definition matches real dependency requirements.

Pattern F: Avoid brittle “wait-for scripts” glued to ENTRYPOINT

People love dropping a script that loops on a port check.
It’s easy. It’s also often wrong: port open ≠ ready, and the script becomes an unowned mini-platform.

If you must wait, make it a real client check (e.g., run a DB query). And keep it minimal.
Better: use healthchecks + dependency gating + app retries.

Quote (paraphrased idea), attributed: Werner Vogels has repeatedly pushed the idea that “everything fails, all the time,” so systems must assume failure and recover automatically.

Practical tasks: commands, outputs, decisions

These are the moves I actually use when a Compose stack won’t come up cleanly.
Each task includes: command, typical output, what it means, and the decision you make from it.

Task 1: Prove what Compose thinks the config is

cr0x@server:~$ docker compose config
services:
  api:
    depends_on:
      db:
        condition: service_healthy
    environment:
      DATABASE_URL: postgres://app:***@db:5432/app
  db:
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d app"]
      interval: 5s
      timeout: 3s
      retries: 20

Meaning: This is the normalized config Compose will run, after merges and env substitution.
If condition: service_healthy disappears here, your Compose implementation isn’t honoring it.

Decision: If the config output doesn’t show what you expect, stop debugging runtime.
Fix the YAML and version/implementation mismatch first.

Task 2: Watch startup order and exit reasons in one shot

cr0x@server:~$ docker compose up --detach && docker compose ps
[+] Running 2/2
 ✔ Container stack-db-1   Started   0.7s
 ✔ Container stack-api-1  Started   0.2s
NAME          IMAGE             COMMAND                  SERVICE   STATUS
stack-db-1    postgres:16       "docker-entrypoint.s…"   db        Up 2 seconds (health: starting)
stack-api-1   myapi:latest      "/app/start"             api       Restarting (1) 2 seconds ago

Meaning: DB is “starting” healthwise; API is already restarting. That’s your race, visible.

Decision: If a dependent service restarts while dependencies are “starting,” you need gating and/or retries.

Task 3: Inspect health status precisely

cr0x@server:~$ docker inspect --format '{{json .State.Health}}' stack-db-1
{"Status":"starting","FailingStreak":2,"Log":[{"Start":"2026-01-02T10:01:01.123Z","End":"2026-01-02T10:01:01.456Z","ExitCode":1,"Output":"/bin/sh: pg_isready: not found\n"}]}

Meaning: Your healthcheck command doesn’t exist in the image. Postgres images have pg_isready, but slim derivatives might not.

Decision: Fix the healthcheck to use available tooling, or install the client tools. A healthcheck that errors is worse than none.

Task 4: Confirm the process is actually listening

cr0x@server:~$ docker exec -it stack-db-1 ss -lntp
State  Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
LISTEN 0      244    0.0.0.0:5432      0.0.0.0:*    users:(("postgres",pid=1,fd=6))

Meaning: Postgres has bound TCP 5432. This is necessary, not sufficient.

Decision: If not listening, check DB logs and config. If listening but clients fail, move up the stack: auth, DNS, TLS, schema.

Task 5: Test name resolution from the dependent container

cr0x@server:~$ docker exec -it stack-api-1 getent hosts db
172.20.0.2   db

Meaning: DNS resolution works at that moment.

Decision: If resolution fails intermittently, add connection retries with backoff and consider slowing restart loops.

Task 6: Test connectivity at the TCP level (fast, shallow)

cr0x@server:~$ docker exec -it stack-api-1 bash -lc 'timeout 2 bash -lc "

Meaning: A TCP handshake is possible.

Decision: If TCP fails, it’s network/name/firewall/listen issues. If TCP works, you need protocol-level checks.

Task 7: Test readiness using a real client operation (Postgres)

cr0x@server:~$ docker exec -it stack-api-1 bash -lc 'psql "postgres://app:app@db:5432/app" -c "select 1;"'
 ?column?
----------
        1
(1 row)

Meaning: Auth works, DB exists, queries work. That’s real readiness.

Decision: If this fails, stop blaming Compose. Fix credentials, DB init, or migrations.

Task 8: Read the logs with timestamps and without scrolling madness

cr0x@server:~$ docker compose logs --timestamps --tail=80 api
2026-01-02T10:01:03.002Z api-1  ERROR db connect failed: dial tcp: lookup db: temporary failure in name resolution
2026-01-02T10:01:03.540Z api-1  ERROR exiting after 1 attempt

Meaning: This is not “DB is slow,” it’s DNS resolution transient + app exiting after one attempt.

Decision: Add retry logic. Also consider reducing restart aggressiveness so DNS/daemon churn settles.

Task 9: Inspect restart policy and current restart loop

cr0x@server:~$ docker inspect --format '{{.HostConfig.RestartPolicy.Name}}' stack-api-1
always

Meaning: The container will restart forever, even if it fails instantly.

Decision: Use on-failure for some services during development, or add backoff/timeout inside the app to prevent self-DoS.

Task 10: Confirm the containers are on the same network

cr0x@server:~$ docker network inspect stack_default --format '{{json .Containers}}'
{"a1b2c3d4":{"Name":"stack-db-1","IPv4Address":"172.20.0.2/16"},"e5f6g7h8":{"Name":"stack-api-1","IPv4Address":"172.20.0.3/16"}}

Meaning: They share the default project network.

Decision: If a service is on a different network, your hostname might not resolve or route. Fix networks before touching timeouts.

Task 11: Validate that your healthcheck is actually executing

cr0x@server:~$ docker inspect --format '{{range .State.Health.Log}}{{.ExitCode}} {{.Output}}{{end}}' stack-db-1 | tail -n 3
0 /var/run/postgresql:5432 - accepting connections
0 /var/run/postgresql:5432 - accepting connections
0 /var/run/postgresql:5432 - accepting connections

Meaning: Healthcheck is running and passing. That’s a prerequisite if you rely on service_healthy.

Decision: If healthchecks aren’t firing, check that the image supports HEALTHCHECK behavior and Compose has it configured correctly.

Task 12: Measure cold-start time of the dependency

cr0x@server:~$ time docker compose up -d db && docker inspect --format '{{.State.Health.Status}}' stack-db-1
healthy

real	0m18.412s
user	0m0.071s
sys	0m0.052s

Meaning: DB took ~18 seconds to become healthy in this run.

Decision: Set dependent timeouts/retry windows based on measured reality, not vibes. If CI is slower, measure there too.

Task 13: Verify your app’s readiness endpoint from inside the network

cr0x@server:~$ docker exec -it stack-api-1 curl -fsS http://localhost:8080/ready
{"status":"ready","db":"ok","queue":"ok"}

Meaning: The app is declaring itself ready and checking its own dependencies.

Decision: If this endpoint lies, fix it. Your orchestration is only as trustworthy as the signal you provide.

Task 14: Catch “port is open but service is not ready” with HTTP status

cr0x@server:~$ docker exec -it stack-api-1 curl -i http://localhost:8080/
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
Content-Length: 62

{"error":"warming up","details":"migrations running"}

Meaning: The server is alive but not ready. That’s good behavior.

Decision: Have your healthcheck call /ready, not /. Keep / for user-facing behavior if you want.

Task 15: Identify slow storage as the real “readiness bug”

cr0x@server:~$ docker exec -it stack-db-1 bash -lc 'dd if=/dev/zero of=/var/lib/postgresql/data/.bench bs=1M count=256 conv=fsync'
256+0 records in
256+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 9.82 s, 27.3 MB/s

Meaning: If you’re seeing tens of MB/s with fsync, your “DB readiness” might simply be “disk is slow.”
Containers don’t fix physics.

Decision: If storage is slow, increase readiness timeouts and fix the underlying disk (or move volumes), rather than sprinkling sleeps in app code.

Fast diagnosis playbook

When the stack won’t come up, don’t thrash. Run this in order. The goal is to find the bottleneck in under five minutes.

First: confirm whether you have a readiness signal at all

Run docker compose config and look for healthcheck blocks and any dependency conditions.
Run docker compose ps and check if dependencies are (health: starting), (health: unhealthy), or have no health at all.

If there’s no healthcheck, your “readiness” is wishful thinking. Add one.

Second: determine if the failure is network/DNS vs protocol/auth

From the failing container: getent hosts <service> (DNS)
Then: TCP check to port (connectivity)
Then: real client operation (protocol/auth)

This sequence prevents the classic error: spending an hour on DB tuning when the hostname is wrong.

Third: stop restart storms before they hide the real error

Check restart policy. If it’s flapping, temporarily scale down the dependent service: docker compose stop api.
Bring up the dependency alone. Make it healthy first.
Then start the app and observe the first failure, not the 50th.

Fourth: check storage and CPU contention

DB “starting” forever often means slow disk, fsync stalls, or memory pressure.
Measure with a quick fsync write test or inspect host metrics if available.

Joke #2: Compose doesn’t have a “wait for SAN” flag, because admitting you have a SAN is already a form of readiness check.

Common mistakes: symptoms → root cause → fix

“I used depends_on, why is it still failing?”

Symptom: Dependent service starts and immediately errors connecting to DB/queue.

Root cause: depends_on does start order, not readiness.

Fix: Add a real healthcheck to the dependency; gate with service_healthy if supported; add retry/backoff in the app.

“Healthcheck says healthy but app errors on migrations”

Symptom: DB healthy, app fails with “relation does not exist” or “database does not exist.”

Root cause: Healthcheck only validated connectivity, not schema/data readiness.

Fix: Add an init job that runs migrations idempotently; or make readiness depend on migration completion; or have healthcheck test the existence of required objects.

“Temporary failure in name resolution” at startup

Symptom: DNS lookup fails once; app exits; restarts; sometimes works.

Root cause: Startup DNS race + app doesn’t retry.

Fix: Retry DNS/connect with backoff for a bounded time window; reduce restart aggressiveness; avoid crashing on first lookup failure.

“Connection refused” even though the service is up

Symptom: Target container is running; clients see ECONNREFUSED.

Root cause: Service not listening yet, wrong port, wrong interface binding, or security config rejecting early.

Fix: Check ss -lntp inside the container; verify port mapping vs internal port; confirm listen address; use a protocol-aware readiness check.

“It works after adding sleep 30, so we’re done”

Symptom: Flakes disappear locally; CI still flakes; production redeploys are slow.

Root cause: Sleep is a guess; startup time is variable with I/O, CPU, and init paths.

Fix: Remove sleeps. Replace with healthchecks + gating + retries. Measure dependency warm-up time and tune timeouts to observed behavior.

“Everything is healthy but requests fail for 2 minutes”

Symptom: Healthchecks pass, but the app returns 500 because it’s warming caches or building indexes.

Root cause: Your readiness definition is wrong; you’re checking liveness.

Fix: Implement a readiness endpoint that checks critical dependencies and internal warm-up completion; healthcheck that endpoint.

“Restarting fixes it”

Symptom: First boot fails; second boot works.

Root cause: You have a hidden initialization ordering problem (users/DB/schema created on first run).

Fix: Pull init out into a one-shot job, or make it idempotent and safely repeatable. Ensure the app waits for init completion.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company ran an internal “integration environment” using Docker Compose on a beefy VM. It wasn’t production,
but it was the place engineers validated changes before shipping. The stack included a Postgres container and an API container.
Someone added depends_on: [db] and felt responsible. They were.

The API had a startup path that applied migrations automatically. On most days, Postgres initialized quickly enough that the API’s first connection attempt succeeded.
On some days—after host reboots or when the VM’s disk cache was cold—Postgres took longer to accept connections.
The API tried once, failed, and exited. Restart policy brought it back. That second attempt usually worked.

Then a change landed that made Postgres startup slower: extra extensions installed at first boot, plus more init scripts.
The API now failed three or four times before succeeding. Engineers saw flapping logs, reran docker compose up, and moved on.
The day it mattered, the environment was used for a customer-facing demo. The API never stabilized because the restart storm caused repeated migration attempts,
each one locking tables and extending startup time further.

The wrong assumption wasn’t “depends_on works.” It was subtler: “if it eventually comes up, it’s fine.”
That’s how intermittent startup failures become full outages under load or during the most time-sensitive moments.
The fix ended up being boring: a Postgres healthcheck, gating on service_healthy, and migration moved to a one-shot job
that ran exactly once per deployment and logged loudly when it failed.

Mini-story 2: The optimization that backfired

Another org chased faster CI builds. They trimmed container images aggressively: smaller base images, fewer packages, fewer “unnecessary” utilities.
Someone removed Postgres client tools from an application image because “we don’t need psql in production.”
True, mostly. But they also used psql in a startup readiness script to verify schema existence.

The pipeline started failing. Not consistently—because caching meant some runners still had old layers, and some jobs used different build paths.
On failing runs, the API container would start, attempt to run psql, and error psql: command not found.
The container exited, restart policy retried, and the job timed out. People blamed the database. They were innocent.

The “optimization” got worse: to reduce log noise, someone changed the entrypoint to swallow the error and fall back to a port-based check.
Now the container “waited” for TCP 5432 and started the app.
The app then immediately hit “relation missing” because migrations weren’t guaranteed, and the failure moved later in the boot sequence.

Eventually they did what should have happened first: they replaced the ad-hoc script with a proper dependency healthcheck on Postgres,
and the app itself gained a bounded retry loop for DB connections.
If they truly needed a schema check, they added a dedicated migration job container that included the right tools and had one job in life.
The CI got faster, but more importantly it got predictable.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran multiple Compose stacks for test environments on shared hosts.
The environments were not “toy”: they were used for incident rehearsal and rollback validation.
The team had a rule: every dependency must have a healthcheck that matches a real client operation,
and every app must retry critical dependencies during startup.

It made their Compose files a bit longer. It also made their lives shorter, in a good way.
They had readiness endpoints for HTTP services, database healthchecks that performed authentication, and one-shot init services for migrations.
They also set restart policies deliberately: databases didn’t restart on every transient failure, and apps didn’t hammer dependencies with instant retries.

One morning, after a host patch cycle, several environments came up slower than usual.
Storage was degraded briefly after a RAID resync. Postgres containers took longer to become ready.
The stacks didn’t collapse into restart storms. Apps waited. Healthchecks stayed “starting” until the DB was actually usable.

The team noticed, because they monitored health status and startup time, not just “container is running.”
They delayed a scheduled rehearsal by 20 minutes instead of spending two hours arguing with ghosts.
Boring practices don’t look heroic. They just prevent you from needing heroes.

Checklists / step-by-step plan

Step-by-step: converting a flaky Compose stack into a reliable one

Define readiness per dependency.
For DB: “can auth and run query.” For HTTP services: “/ready returns ok and downstream ok.”
Add healthchecks to every stateful dependency.
Avoid pure port checks unless that’s truly your only requirement.
Gate dependent startup on health when supported.
If your Compose supports service_healthy, use it. If not, rely on app retry and consider an init job pattern.
Add bounded retries with backoff in the application.
Include DNS lookup failures, connect timeouts, and auth failures that can happen during initialization.
Separate init work from serve work.
Migrations, bucket creation, index templates go into a one-shot service that can be rerun safely.
Make init idempotent.
Use “create if not exists,” transactional migrations, and safe re-runs. Assume it will run twice.
Tune restart policies to avoid storms.
If a service fails because dependencies aren’t ready, it shouldn’t restart 20 times per minute.
Instrument startup time.
Log “starting,” “connected to db,” “migrations complete,” “ready.” You can’t fix what you don’t time.
Test cold starts in CI.
Purge caches occasionally or run on clean runners. Measure the slow path.
Stop using sleeps as a control mechanism.
Replace them with checks that represent real readiness, or remove them entirely and rely on retries.

Checklist: what a “good” healthcheck looks like

Runs quickly (ideally < 1s) when healthy.
Fails reliably when the service is not usable for dependents.
Uses a real client protocol when possible (SQL query, HTTP request).
Doesn’t require external network access or flaky dependencies.
Has sensible intervals and retries based on measured startup time.
Has clear failure output in docker inspect.

Checklist: what your app should do during startup

Retry dependency connections for a bounded window (e.g., 60–180 seconds depending on environment).
Use exponential backoff with jitter to avoid synchronized retry storms.
Log each failed attempt with reason, but don’t spam: aggregate or rate-limit if needed.
Exit with a clear error if the dependency is not reachable after the window.
Expose a readiness endpoint that reflects actual ability to serve.

FAQ

1) Does depends_on ever wait for readiness?

Not by itself. Some Compose implementations support conditions like service_healthy, which can gate on a healthcheck.
But the healthcheck must exist and must reflect real readiness.

2) Is a TCP port check a valid healthcheck?

Sometimes. If your dependent service only requires “port open,” fine. That’s rare.
Most services need authentication, routing, schema, or internal initialization—so a protocol-level check is safer.

3) Why not just increase sleep to 60 seconds?

Because startup time is variable. You’ll make fast paths slower and still fail on slow paths.
Also, sleeps hide real problems: wrong credentials, wrong hostnames, missing migrations.

4) Should I do migrations in app startup?

If you do, you must handle concurrency, idempotency, and failure cleanly.
In Compose stacks, a one-shot migration service is usually cleaner and easier to observe.

5) What’s the difference between liveness and readiness here?

Liveness: “the process is alive and not wedged.” Readiness: “it can serve requests correctly right now.”
Compose healthchecks are often used as liveness; you can use them for readiness, but only if you define them that way.

6) My service is “healthy” but still not reachable from another container. How?

Healthcheck runs inside the container. It can succeed even if the service isn’t reachable over the network (wrong bind address, wrong network, firewall rules).
Verify listen address (ss -lntp) and network membership (docker network inspect).

7) What about using restart: always—good or bad?

It’s neither. It’s a tool. For dependencies that might crash, it can help.
For apps that fail fast because dependencies aren’t ready, it can create restart storms and hide root causes.
Pair it with sane retry logic and good logs.

8) Can I rely on docker compose up order for databases and caches?

You can rely on “Compose will try to start containers in that order.”
You cannot rely on “dependency is usable when the dependent starts.”
If your app needs a usable DB, you need readiness checks and retries.

9) How do I handle multiple dependencies (DB + queue + object store)?

Define readiness at the app boundary: create a readiness endpoint that checks all critical dependencies.
Gate on that readiness for traffic, and ensure each dependency has its own healthcheck where possible.

Conclusion: next steps you should actually do

Stop asking depends_on to do a job it was never hired for. Use it for start order if you want.
But for reliability, you need real readiness signals, and you need systems that can tolerate startup races.

Add healthchecks to every dependency that matters (DB, cache, queue, object store gateways).
Make healthchecks represent real usability, not “port is open.”
If supported, gate startup on service_healthy. If not, treat it as a nice-to-have and rely on app retries.
Move migrations and one-time initialization into a dedicated, idempotent job container.
Implement bounded retry with backoff in every service that talks to a dependency at startup.
Use the fast diagnosis playbook when it still breaks, because it will—just less dramatically.

The goal isn’t perfection. It’s boring starts. Boring starts are what let you spend your attention on product problems instead of boot-time roulette.