The worst kind of outage is the one that looks like it fixed itself. Your stack “comes up,” dashboards go green,
then five minutes later the app starts throwing 500s because the database wasn’t actually ready—just technically “started.”
Meanwhile, your orchestrator did exactly what you told it to do. That’s the part that hurts.
Docker and Docker Compose are blunt tools: they can start containers in a particular order, but they cannot magically
know when a dependency is safe to use unless you teach them what “ready” means. The cure is to separate start order
from readiness, wire in evidence-based checks, and make “false starts” impossible or at least loudly visible.
Start order vs readiness: the difference that matters
Start order answers: “Did the process get launched?” That’s it. Docker can do this: container A starts,
then container B starts. Nice and tidy. Also dangerously incomplete.
Readiness answers: “Is the service usable by its clients right now?” That includes: listening sockets,
completed migrations, warmed caches, TLS keys loaded, leader election done, permissions correct, and dependencies reachable.
Readiness is a contract between a service and the world. Docker does not infer this contract. You implement it.
False starts happen when we treat start order as a proxy for readiness. It’s like declaring a restaurant open because
the lights are on while the chef is still negotiating with a box of frozen fries.
A reliable stack does two things:
- It starts components in a sensible order when that helps.
- It gates dependent components on proven readiness, not optimism.
You will occasionally hear “just add sleeps.” That’s not readiness; that’s superstition with a timer.
Your future self will hate you, and your incident channel will hate you sooner.
Why false starts happen (and why they’re so common)
“Started” is a cheap state. A process can start while still being useless. In modern systems, “useful” often depends on:
network reachability, filesystem state, credentials, schema migrations, and upstream services. Any one of those can lag
behind process start by seconds or minutes.
Classic failure modes that produce false starts
- Socket isn’t listening yet. The app boots, then binds to a port later. Clients try immediately and fail.
- Port is listening, but service isn’t ready. HTTP returns 503 because migrations or cache warm-up runs.
- Database accepts TCP but not queries. PostgreSQL is accepting connections while replaying WAL or running recovery.
- DNS isn’t settled. Container name exists, but resolver caches/sidecars aren’t ready.
- Dependency is ready, but with the wrong schema. The app starts before migrations, then fails with “relation does not exist.”
- Volume not mounted or wrong permissions. Service starts, writes nowhere, then collapses under its own lies.
- Rate limits and thundering herds. Ten replicas all “start,” all stampede a dependency, and you learn what “retry storm” means.
Here’s the uncomfortable truth: many apps are written assuming a human will restart them if something goes wrong at startup.
Containers remove the human and replace them with automation that will happily retry the same failure forever.
Joke #1: If you think “depends_on” means “works_on,” Docker has a bridge to sell you—and it’s probably down for maintenance.
Facts & historical context (short, concrete)
- Docker’s early “link” feature (pre-Compose maturity) tried to wire dependencies by injecting environment variables; it didn’t solve readiness.
- Compose v2 file format popularized
depends_onfor start order; many teams misread it as readiness gating. - Compose v3 shifted focus toward Swarm and removed some conditional startup semantics; people kept assuming the old behavior existed.
- Kubernetes introduced readiness probes as a first-class concept because “container running” was never enough for service routing.
- systemd has had dependency ordering for ages, and even it distinguishes “started” from “ready” via notification mechanisms.
- PostgreSQL can accept TCP connections before it is fully available for workload (recovery, crash replay, checkpoints).
- MySQL variants may listen early but reject auth or lock internal tables during initialization, creating a perfect false start trap.
- Healthchecks were added to Docker to move beyond “process exists” as the only signal; they’re still underused or misused.
What Docker Compose actually does (and doesn’t)
depends_on: ordering, not readiness
In Compose, depends_on controls startup/shutdown order. It does not, by default, wait for the dependency to be
ready. It ensures that Docker has attempted to start the container. That’s it.
Compose can use healthchecks to gate startup in some modes, but the operational reality is messy:
people run different Compose versions, different Docker engines, and different expectations shaped by old blog posts.
If you want reliability, you build readiness logic into your stack in a way that is explicit and testable.
Healthcheck: useful, but you must design it
A healthcheck is a periodic test run by the engine. If it fails, Docker marks the container as unhealthy. That’s a signal.
What you do with it—restarts, gating, alerting—is a separate choice.
If your healthcheck is “curl localhost:8080,” but the service returns 200 while it’s still failing all external requests,
you’ve built a liar. Liars pass tests and fail customers.
Restart policies: not readiness, just persistence
Restart policies are about recovering from crashes. They are not a dependency strategy. If your app exits because the DB
wasn’t ready, restart policies will turn a brief DB warm-up into a restart loop. That loop can also amplify load on the DB.
The only readiness signal that matters: “Can the client succeed?”
Readiness should be defined from the client’s perspective. If a service depends on a database and a queue, “ready” means
it can connect, authenticate, run a simple query, and publish/consume a small message—or whatever “minimal viable work”
looks like for your system.
One quote to keep you honest, as a paraphrased idea from John Allspaw: paraphrased idea: Reliability comes from learning and feedback loops, not from pretending failures won’t happen.
Readiness patterns that work in production
Pattern 1: Put real healthchecks on dependencies
If you run PostgreSQL, Redis, or an HTTP API in a container, give it a healthcheck that reflects actual usability.
For Postgres, that can be pg_isready plus a real query if you need schema validation. For HTTP, hit the endpoint
that checks dependencies, not a static “OK” route.
Pattern 2: Gate dependent services on readiness (explicitly)
There are three common gating approaches:
- Compose gating via health status (when supported in your environment): depends_on + health conditions.
- Entry-point wait scripts inside the dependent container: wait for TCP + app-level validation; then start.
- Application-native retries with backoff and jitter: the best long-term answer, because it works everywhere, not just Docker.
The most durable approach is: app retries properly and the orchestrator has health checks. Belt and suspenders.
This is ops. We dress for the weather we have, not the weather we deserve.
Pattern 3: Make migrations a first-class job
Schema migrations are not a side effect. Treat them as a separate, explicit step: a one-shot container/job that runs
migrations and exits successfully. Then and only then do you start app containers. This prevents the “five apps race to
migrate the schema” chaos.
Pattern 4: Use a readiness endpoint that checks dependencies
For HTTP services, expose:
- /healthz (liveness): “is the process alive?”
- /readyz (readiness): “can I serve real traffic?” (DB reachable, queue reachable, critical configs loaded)
Even in Docker Compose, this helps because your healthcheck can hit /readyz rather than guessing.
Pattern 5: Bound your retries, then fail loudly
Infinite retries can hide real outages. Bounded retries with clear logs give you a controlled startup delay when things are
slow, but still fail if the world is actually broken.
Pattern 6: Add jitter and backoff
If 20 containers start at once and all hammer the DB every 100ms, you create a self-inflicted DDoS. Backoff and jitter
turn stampedes into trickles.
Joke #2: “Just add a 30-second sleep” is how you end up with a 31-second outage and a 3-hour postmortem.
Practical tasks: commands, outputs, decisions (12+)
These are the checks I actually run when a stack “started” but the app behaves like it’s allergic to Mondays.
Each task includes: command, what the output means, and the decision you make.
Task 1: See container state and health at a glance
cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'
NAMES STATUS PORTS
app Up 18 seconds (health: starting) 0.0.0.0:8080->8080/tcp
db Up 22 seconds (healthy) 5432/tcp
redis Up 21 seconds (healthy) 6379/tcp
Meaning: app is running but not ready; DB and Redis are healthy.
Decision: Don’t route traffic yet. If the app health stays “starting” too long, inspect its healthcheck and startup logs.
Task 2: Inspect why a healthcheck is failing
cr0x@server:~$ docker inspect --format '{{json .State.Health}}' app
{"Status":"unhealthy","FailingStreak":3,"Log":[{"Start":"2026-01-03T10:10:01.123Z","End":"2026-01-03T10:10:01.456Z","ExitCode":1,"Output":"curl: (7) Failed to connect to localhost port 8080: Connection refused\n"}]}
Meaning: The app isn’t listening yet, or it’s listening on a different port/interface.
Decision: Check app logs and its listen address/port. If it binds to 127.0.0.1 inside container, that’s fine for healthcheck but not for external clients unless published correctly.
Task 3: Read startup logs with timestamps
cr0x@server:~$ docker logs --since 10m --timestamps app | tail -n 30
2026-01-03T10:09:44.001234567Z boot: loading config
2026-01-03T10:09:44.889012345Z db: connection failed: dial tcp db:5432: connect: connection refused
2026-01-03T10:09:45.889045678Z db: connection failed: dial tcp db:5432: connect: connection refused
2026-01-03T10:09:46.889078901Z boot: giving up after 2 retries
Meaning: The app tried twice and quit too early. DB wasn’t ready at that moment.
Decision: Increase retries/backoff or gate startup until DB readiness is verified. Also evaluate whether “exit on missing DB” is the right behavior.
Task 4: Confirm DB container logs indicate actual readiness, not just startup
cr0x@server:~$ docker logs --since 10m --timestamps db | tail -n 30
2026-01-03T10:09:30.100000000Z PostgreSQL init process complete; ready for start up.
2026-01-03T10:09:31.200000000Z database system is ready to accept connections
Meaning: DB declared readiness at 10:09:31; the app started failing at 10:09:44. That mismatch suggests either wrong hostname, network issue, or the DB restarted.
Decision: Check container networking and DB restarts; verify the app uses the correct service name and port.
Task 5: Verify container restart loops
cr0x@server:~$ docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.RunningFor}}'
NAMES STATUS RUNNING FOR
app Restarting (1) 3 seconds 2 minutes
db Up 2 minutes (healthy) 2 minutes
Meaning: The app is crashing and restarting.
Decision: Stop trying to “fix” it by restarting. Inspect the exit code and error; then fix readiness or config.
Task 6: Get the app’s exit code and last failure reason
cr0x@server:~$ docker inspect --format 'ExitCode={{.State.ExitCode}} Error={{.State.Error}} FinishedAt={{.State.FinishedAt}}' app
ExitCode=1 Error= FinishedAt=2026-01-03T10:10:02.002002002Z
Meaning: Non-zero exit with no engine-level error; the process chose to exit.
Decision: Treat it as an application behavior problem (dependency handling, config validation), not a Docker problem.
Task 7: Confirm service discovery (DNS) inside the network
cr0x@server:~$ docker exec -it app getent hosts db
172.20.0.3 db
Meaning: DNS for db resolves correctly within the container.
Decision: Move on to connectivity and auth checks; DNS is not your bottleneck today.
Task 8: Test TCP connectivity to a dependency from the dependent container
cr0x@server:~$ docker exec -it app bash -lc 'nc -vz db 5432'
nc: connect to db (172.20.0.3) port 5432 (tcp) failed: Connection refused
Meaning: DB is reachable by IP but not accepting TCP. Either DB isn’t listening, or it’s on a different port/interface, or it’s restarting.
Decision: Inspect DB listening sockets and restart history; also check whether DB is bound to localhost only.
Task 9: Confirm the DB is listening on the expected port inside its container
cr0x@server:~$ docker exec -it db bash -lc 'ss -lntp | grep 5432 || true'
LISTEN 0 244 0.0.0.0:5432 0.0.0.0:* users:(("postgres",pid=1,fd=7))
Meaning: PostgreSQL is listening on all interfaces.
Decision: If clients still see connection refused, suspect timing (DB restarting) or network policy/iptables rules.
Task 10: Verify healthcheck command and timing parameters
cr0x@server:~$ docker inspect --format '{{json .Config.Healthcheck}}' db
{"Test":["CMD-SHELL","pg_isready -U postgres -h 127.0.0.1 -p 5432"],"Interval":3000000000,"Timeout":1000000000,"Retries":3,"StartPeriod":0}
Meaning: Interval 3s, timeout 1s, retries 3, and StartPeriod is 0. On slow disks or crash recovery, that can mark DB unhealthy prematurely.
Decision: Add a start period (e.g., 30–60s) and increase timeout. Healthchecks should detect real failure, not normal warm-up.
Task 11: Detect slow storage causing delayed readiness
cr0x@server:~$ docker exec -it db bash -lc 'dd if=/var/lib/postgresql/data/pg_wal/000000010000000000000001 of=/dev/null bs=4M count=16 status=none; echo $?'
0
Meaning: Basic read succeeded. This does not prove performance, but it rules out obvious I/O errors.
Decision: If startup is still slow, check host-level I/O saturation and filesystem latency; readiness delays often trace back to storage.
Task 12: Check host resource pressure (CPU, memory, I/O) that prolongs warm-up
cr0x@server:~$ docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
a1b2c3d4e5f6 app 180.12% 512MiB / 1GiB 50.00% 1.2MB / 800KB 12MB / 2MB
b2c3d4e5f6g7 db 95.33% 1.8GiB / 2GiB 90.00% 900KB / 1.1MB 2.3GB / 1.9GB
Meaning: DB is near memory limit and heavy on block I/O. That’s a recipe for slow readiness and intermittent failures.
Decision: Increase memory, reduce shared buffers, move volumes to faster storage, or reduce startup parallelism. Fix the bottleneck before “tuning” healthchecks to lie.
Task 13: See Compose dependency graph and resolved configuration
cr0x@server:~$ docker compose config
services:
app:
depends_on:
db:
condition: service_healthy
environment:
DATABASE_URL: postgres://postgres:postgres@db:5432/app
db:
healthcheck:
test:
- CMD-SHELL
- pg_isready -U postgres -h 127.0.0.1 -p 5432
Meaning: The intended gating is present in config output (good), assuming your Compose implementation honors it.
Decision: If behavior contradicts config, verify Compose version and engine; then consider moving gating into entrypoint or app retries for portability.
Task 14: Confirm Compose and Docker versions (behavior depends on this)
cr0x@server:~$ docker version --format 'Client={{.Client.Version}} Server={{.Server.Version}}'
Client=27.2.0 Server=27.2.0
Meaning: You know what runtime you’re actually on. This matters when someone swears “it used to work.”
Decision: Align versions across environments or bake readiness into the application so your stack doesn’t depend on tool quirks.
Task 15: Time the path to readiness
cr0x@server:~$ time docker compose up -d
[+] Running 3/3
✔ Container db Started
✔ Container redis Started
✔ Container app Started
real 0m2.114s
user 0m0.082s
sys 0m0.061s
Meaning: Compose reports “Started” fast. That does not mean ready.
Decision: Measure readiness separately (health status, ready endpoints, synthetic checks). Don’t treat the orchestration output as truth.
Task 16: Run a synthetic “can I do real work” check
cr0x@server:~$ docker exec -it app bash -lc 'curl -fsS http://127.0.0.1:8080/readyz && echo READY'
READY
Meaning: The app claims readiness for real traffic (assuming your /readyz is honest).
Decision: Only now is it reasonable to put the service behind a load balancer, open firewall rules, or declare deployment done.
Fast diagnosis playbook
When “containers are up” but the system isn’t usable, speed matters. The trick is to avoid guessing which dependency is lying.
Here’s the order that finds the bottleneck quickly, with minimal thrash.
First: identify which component is not ready (not which is “down”)
- Run
docker psand look for(health: starting)or(unhealthy). - If no healthchecks exist, that’s your first problem. But you still diagnose with logs and synthetic checks.
Second: correlate timestamps across logs
- Grab the last 5–10 minutes of logs with timestamps for app and dependencies.
- Look for: connection refused, timeouts, auth failures, migration errors, disk errors.
- Decide whether it’s a timing issue (dependency warming) or a hard misconfig (wrong hostname/credentials).
Third: test from the client’s network namespace
- From inside the dependent container, test DNS, then TCP, then app-level protocol.
- Don’t test from the host and assume it’s equivalent. Different network path, different truth.
Fourth: check resource pressure and storage latency
- Use
docker statsto spot CPU/memory/disk saturation. - If DB is slow to come ready, suspect I/O first. Apps are impatient; disks are eternal.
Fifth: decide which layer should own the fix
- App fix: retries with backoff/jitter; readiness endpoint; better error handling.
- Compose fix: healthchecks; gating; migration job ordering.
- Platform fix: storage performance; resource limits; avoid noisy neighbors.
Three corporate-world mini-stories
Incident caused by a wrong assumption: “Started” meant “ready”
A mid-size company ran a Docker Compose stack for an internal billing portal: web app, API service, PostgreSQL, and a
background worker. Deployments were “simple”: pull images, docker compose up -d, done. For months it looked fine.
Then a routine host reboot turned into a half-day incident.
After reboot, Compose started containers in the expected order. The API container came up, tried to run a migration on startup,
and immediately failed because Postgres was still replaying WAL. The API exited with a non-zero code. Restart policy brought it back.
It failed again. And again. The migration never ran, the API never stayed up long enough to accept requests, and the worker
hammered the message queue with retries that weren’t rate-limited.
The dashboard said “containers running” because the database container was up and the others were constantly restarting.
The on-call engineer initially focused on the load balancer because users saw 502s. Classic misdirection.
Only after comparing timestamps across logs did it click: the API had a startup dependency on a DB state that wasn’t guaranteed.
The fix was boring and decisive: migrations moved into a one-shot container that ran after the DB was healthy, and the API
gained exponential backoff on DB connect. They also added a readiness endpoint that failed until migrations were complete.
The next reboot was a non-event. The billing team didn’t send flowers, but they also didn’t send angry emails, which is the SRE equivalent.
Optimization that backfired: healthchecks “tuned” into lying
Another org had a Compose-based dev environment that resembled production: microservices, a central database, and a search engine.
Developers complained that bringing the stack up took too long, especially on laptops. Someone “optimized” by making healthchecks
very aggressive: one-second intervals, one-second timeouts, and no start period. It made the UI show “healthy” faster on a good day.
On a mediocre day—like when the search engine needed extra time to initialize indexes—the container was marked unhealthy early.
A wrapper script interpreted unhealthy as “broken,” killed and restarted it. That created a loop where the service never got
enough uninterrupted time to finish initializing. The optimization didn’t reduce startup time; it prevented startup entirely.
The team then chased ghosts: DNS, ports, Java heap settings. Everything except the obvious: their own healthcheck policy
was actively sabotaging the system. They had built a readiness test that punished normal initialization.
The eventual resolution: start periods were added, timeouts increased, and the healthcheck for the search engine was changed
from “port open” to “cluster state is yellow/green” (a signal that initialization passed a meaningful threshold).
Startup took slightly longer. It also started every time. That’s what “faster” looks like in production: fewer retries, fewer loops, fewer lies.
Boring but correct practice that saved the day: a synthetic “ready” gate in CI
A regulated company ran nightly integration tests against a Compose environment. They had a habit that looked painfully dull:
every pipeline had an explicit “wait for readiness” stage using a small script that polled service readiness endpoints and
executed a minimal DB query. Only when that passed did tests begin. Engineers sometimes grumbled about the extra minute.
Then a database image update landed. The new image performed an extra initialization step when it detected a particular
filesystem condition. On some runners, that step took long enough that app containers started and immediately failed their initial DB connection.
Without gating, tests would have started during the flapping and produced random failures.
Instead, the pipeline simply waited. When readiness didn’t arrive within the configured deadline, it failed clearly with:
“DB not ready after N seconds.” No flaky tests. No half-broken artifacts. The team rolled back the image and filed an internal
issue to pin versions until they understood the new behavior.
The boring practice paid off: the failure was deterministic, localized, and fast to diagnose. That’s the dream.
It’s also why I keep saying: treat readiness as a first-class signal, not a vibe.
Common mistakes: symptom → root cause → fix
1) “Connection refused” on boot, then works after a manual restart
Symptom: app fails immediately with connection refused to DB/Redis, then works if you restart the app container later.
Root cause: dependency was started but not listening yet; app has no retries or insufficient retry window.
Fix: add exponential backoff retries in the app; add a readiness gate (healthcheck + gating or entrypoint wait) for dependencies.
2) App container is “Up,” but every request fails
Symptom: container state is running; healthcheck is green; users see 500s.
Root cause: healthcheck only tests liveness (port open) not readiness (dependency success).
Fix: implement /readyz that verifies critical dependencies; point the container healthcheck at it.
3) “Unhealthy” during normal warm-up, causing restart loops
Symptom: service never stabilizes; logs show startup steps repeating.
Root cause: healthcheck start period too short or absent; wrapper script restarts on unhealthy.
Fix: configure healthcheck start_period and reasonable timeout; restart only on crash, not on early unhealthy, unless you truly mean it.
4) Random failures that disappear when you serialize startup
Symptom: starting services one-by-one works; starting them together fails intermittently.
Root cause: thundering herd on a shared dependency (DB, auth service, secrets provider) plus aggressive retries.
Fix: add jitter/backoff; stagger startup; increase dependency capacity; run migrations separately.
5) “Authentication failed” at boot, then fine later
Symptom: transient auth failures for DB or API keys right after startup.
Root cause: secret injection sidecar/agent not ready; file-based secrets not yet written; IAM token not yet available.
Fix: readiness should include “credentials present and valid”; gate startup on that condition, not merely on process start.
6) App says ready, but migrations are still running
Symptom: readiness endpoint returns OK while schema changes are in progress; clients get SQL errors.
Root cause: app doesn’t treat migrations as a readiness dependency, or migrations run in parallel across replicas.
Fix: move migrations into a dedicated job; app readiness should fail until schema version is compatible.
7) “Works on my machine,” fails on a slower host
Symptom: dev laptops fine; CI runners or small VMs fail on startup.
Root cause: timing assumptions baked into startup; no backoff; healthchecks too strict; storage slower.
Fix: increase tolerance windows; measure actual time-to-ready; fix the slowest dependency rather than hiding it.
8) Containers show healthy, but the network path for real clients is broken
Symptom: internal healthchecks green; external clients time out.
Root cause: healthcheck only tests localhost; service binds wrong interface; port publishing/ingress misconfigured.
Fix: test readiness through the same path clients use, or include a secondary synthetic check from outside the container network.
Checklists / step-by-step plan
Step-by-step plan: fix false starts in an existing Compose stack
-
Inventory dependencies. For each service, write down what it truly requires to serve traffic:
DB connectivity, queue connectivity, cache, secrets, filesystem writable, schema version. -
Add a readiness endpoint (or equivalent) to each app service. If it’s not HTTP, implement a small CLI check
that performs a minimal real operation (e.g., a DB query). - Define Docker healthchecks that reflect readiness. Don’t curl a vanity endpoint; curl the one that checks dependencies.
-
Set healthcheck timing realistically. Add
start_periodfor known warm-ups, use timeouts that match your slowest normal startup. - Choose gating strategy. If your environment supports Compose gating on health, use it. Otherwise gate in entrypoint or app logic.
- Implement retries with backoff and jitter in apps. This is non-negotiable for systems that must survive restarts and deploys.
- Split migrations into a dedicated job. Run it once, with a lock, and fail loudly if it can’t complete.
- Add a synthetic “stack ready” check. Something that verifies the user journey: login, fetch data, write a record.
- Measure time-to-ready. Capture timestamps in logs and track median/p95 startup time; tune healthchecks based on data.
- Test the worst day. Reboot the host, throttle CPU, and simulate slow disk. If your readiness model survives that, it’ll survive Tuesday.
Checklist: what “good” looks like
- Every service has a meaningful readiness signal.
- Every dependency has a healthcheck that matches usability.
- No service exits at first connection failure; retries are bounded and logged.
- Migrations run once, explicitly, not as a side effect of “app start.”
- Healthchecks tolerate normal warm-up; they catch real deadlocks and misconfigs.
- The stack has at least one end-to-end synthetic check used in CI and/or deployment.
Checklist: what to avoid (because it always comes back to bite)
- Hard-coded sleeps as “dependency management.”
- Healthchecks that only verify “port open.”
- Restart loops as your default recovery strategy.
- Multiple replicas racing migrations at boot.
- Timeouts tuned to your fastest laptop instead of your slowest normal environment.
FAQ
1) Isn’t depends_on enough for most stacks?
No. It addresses process start sequencing, not service usability. You’ll still get races on slow disks, after crashes,
or when dependencies do internal recovery.
2) Should I gate startup in Compose or inside the app?
Inside the app is more portable and more correct. Compose gating is a helpful layer, but your application will run in more places
than Compose: different hosts, CI, Kubernetes, systemd, maybe bare metal. Retrying dependencies is an application responsibility.
3) What’s the difference between liveness and readiness?
Liveness means “the process is alive.” Readiness means “the service can do its job for clients.” They are not interchangeable.
If you conflate them, you’ll either kill healthy-but-busy services or route traffic to broken-but-running ones.
4) If my service retries forever, isn’t that reliable?
It’s resilient but not necessarily reliable. Infinite retries can mask outages and create sustained load on dependencies.
Use backoff, jitter, and a maximum wait with clear error reporting.
5) What is a “false start” in this context?
A false start is when the orchestrator reports services started (or even healthy) but the system cannot actually serve correct traffic.
It often resolves “by itself,” which makes it easy to ignore until it burns you in production.
6) How do I write a good healthcheck for a database?
Prefer a database-native readiness tool (like pg_isready) and, when needed, a minimal query that exercises authentication
and the correct database. Be careful: a query can be expensive if it runs too frequently.
7) Why do my containers start fine in dev but fail in CI?
CI runners are often slower, more contended, and more variable. Timing assumptions collapse there first. Add start periods,
use readiness endpoints, and measure time-to-ready in both environments.
8) Is a TCP check (nc) enough for readiness?
It’s a useful first step, not a finish line. TCP open means a socket exists; it does not mean auth works, schema is correct,
or the service isn’t returning errors.
9) Can healthchecks hurt performance?
Yes. A frequent, heavy healthcheck (like a slow SQL query) can become self-inflicted load. Keep checks lightweight, reduce frequency,
and use start periods rather than aggressive probing.
10) What’s the simplest improvement with the biggest payoff?
Add a real readiness endpoint to the app and point the healthcheck to it. Then implement connection retries with backoff for DB and queues.
Those two changes eliminate most boot-time flakiness.
Conclusion: next steps you can do this week
Treat “started” as a mechanical event, not a success condition. If you want fewer incidents, stop asking Docker to infer readiness
and start providing it with signals that match real usability.
Practical next steps:
- Add readiness endpoints (or minimal real-operation checks) to your app services.
- Upgrade healthchecks to test readiness, not just “port open,” and tune start periods to match reality.
- Implement app-level retries with backoff and jitter for every external dependency.
- Split schema migrations into a dedicated, one-shot step with clear success/failure semantics.
- Adopt the fast diagnosis playbook and make it muscle memory: health, logs, in-container checks, then resource pressure.
False starts aren’t bad luck. They’re a design gap. Close it, and your “it works after I restart it” era can finally end.