Docker Healthchecks That Actually Catch Failures (Not Just “Process Running”)

February 9, 2026 • February 9, 2026 • Read: 22 min • Views: 0

Was this helpful?

You’ve seen it: the container is “Up”, the PID is alive, and the dashboards are green. Meanwhile customers are getting 502s,
the queue is growing teeth, and your on-call is learning new swear words in three languages.

Docker healthchecks are supposed to prevent this particular flavor of self-deception. Too often, they’re written like a nervous
intern’s first monitoring script: “is the process running?” That’s not a healthcheck. That’s a pulse check on a patient who
may already be clinically useless.

Define “healthy” like you mean it

“Healthy” is not “running.” A container can be running while doing absolutely nothing useful: stuck on a mutex, wedged on DNS,
out of file descriptors, unable to reach its database, returning only error pages, or silently dropping messages. Your job is
to choose a definition of health that aligns with your user-visible success.

Think in layers:

Process health: the binary is alive, not crash-looping, not OOM-killed.
Service health: it can accept requests and produce correct responses within bounds (latency, status codes, payload sanity).
Dependency health: it can reach what it must reach (DB, cache, broker, DNS) and degrade intentionally when it can’t.
Capacity health: it’s not “alive but saturated” (thread pool exhausted, queue full, disk full).

For Docker, healthchecks are most useful for answering one question: Should this container be considered eligible to serve?
That’s a readiness question. Docker doesn’t have a separate readiness probe like Kubernetes, so you either:

Use Docker healthchecks as “readiness,” and don’t make them so aggressive that they kill a container for a transient blip.
Or treat them as “liveness,” but then accept you’ll miss “alive but broken” cases.

My opinionated take: in Docker-only stacks (Compose, Swarm, plain docker run), make healthchecks behave like readiness:
“can I do the minimum useful work right now?” That catches the failures users care about.

Interesting facts and a little history

Fact 1: Docker healthchecks were introduced in Docker 1.12 era, largely to support orchestrated behavior without requiring external tooling.
Fact 2: Health status is stored per container and visible via docker inspect; it’s not a first-class metric unless you export it.
Fact 3: Compose didn’t initially gate depends_on on health; modern Compose can, but people still assume it always did.
Fact 4: Swarm uses health status for service scheduling decisions, but “healthy” doesn’t automatically mean “in the load balancer” in every setup.
Fact 5: A healthcheck runs inside the container’s namespaces, so it sees container DNS, container routing, and container filesystem—good and bad.
Fact 6: The healthcheck command is executed with /bin/sh -c when you use CMD-SHELL; quoting bugs and PATH surprises are common.
Fact 7: Exit code matters, not stdout. Humans love pretty output; Docker loves 0 and “not 0”.
Fact 8: Early “health endpoints” in web apps became popular because load balancers needed a cheap yes/no signal; they were never meant to be a full monitoring suite.
Fact 9: A “check” that takes too long is effectively a denial-of-service against your own container if it stacks up or consumes scarce resources.

How Docker actually evaluates healthchecks

Docker healthchecks are a small state machine: start in starting, then move to healthy if checks pass,
and to unhealthy after a configured number of failures. It’s simple, which is both the appeal and the trap.

What the knobs really do

interval: how often to run the check.
timeout: how long to wait before considering the check failed.
retries: consecutive failures required to mark unhealthy.
start_period: grace window where failures don’t count (but checks still run).

A healthcheck should be:

Cheap (milliseconds to tens of milliseconds when things are OK).
Bounded (hard timeouts; no hanging).
Representative (tests the thing users actually need).
Hard to fake (doesn’t just check a file exists).

One operational quote worth keeping on a sticky note:
“Hope is not a strategy.” — Gene Kranz (often cited in engineering reliability contexts).

Design principles that catch real failures

1) Probe the critical path, not the happy path

If your service exists to serve HTTP backed by a database, then the critical path is “accept request → execute a cheap DB query → return.”
Your healthcheck should exercise that path at a minimal cost.

Avoid a check that hits only an in-memory endpoint that never touches dependencies. That’s how you get green healthchecks while
the database is on fire.

2) Prefer cheap synthetic transactions over full “self tests”

A good check is a tiny transaction with a bounded budget: one lightweight query, one ping, one small GET that exercises routing, auth, and serialization.
A bad check is a full integration suite inside production containers.

3) Decide what failures should trigger a restart vs. just removal from traffic

Docker healthchecks don’t automatically restart containers unless you pair them with policies or an orchestrator behavior. Even then,
restarting on every dependency blip is a classic way to turn “a transient DB hiccup” into “a full outage.”

If the dependency is down, you often want to stay up and serve degraded responses, or at least keep your process warm while waiting.
Healthchecks can still be used to pull you out of rotation (or keep you from being added too early), not necessarily to kill you.

4) Fail fast on resource exhaustion

“Alive but saturated” is one of the most expensive failure modes because it looks like partial success. Your check should catch:
thread pools pinned, request queue full, disk full, or inability to create files/sockets.

A neat trick is to check that you can perform a small allocation or open a socket, without turning the healthcheck into a benchmark.

5) Make the check deterministic and local, but not clueless

It should run quickly and consistently. That means:

Use fixed timeouts (curl --max-time, timeout).
Minimize network hops.
Avoid calling third-party APIs from healthchecks. If you do, you’re outsourcing your availability to their rate limits.

Joke #1: A healthcheck that calls an external API is like asking your neighbor if your house is on fire—by text message—during a tornado.

6) Return meaningful exit codes and log enough context

Docker records the last few healthcheck outputs. Use that. Print one-line context on failure: what dependency failed, what timeout, what status.
Don’t print megabytes. This isn’t a flight recorder.

Healthcheck patterns that work in production

Pattern A: HTTP health endpoint with dependency sampling

Build a /healthz endpoint that checks:

the app can accept requests
the DB connection pool can borrow a connection and run a SELECT 1 (or equivalent)
cache/broker connectivity if they are hard requirements

Keep it cheap. If you need a deep check, put it on a different endpoint (e.g., /readyz vs /healthz),
but Docker gives you one lever, so you’ll likely pick the “readiness-ish” version.

Pattern B: Direct dependency probe from the container

When you can’t change the app, do the next best thing: probe the service externally from within the container namespace.
For example: check that the local HTTP port returns 200, and that the DB TCP port is reachable.

This is less semantically rich than app-level checks, but still catches the “process is alive, but socket isn’t listening” class of failures.

Pattern C: Queue lag / backlog threshold (carefully)

For consumers, “healthy” may mean “I’m making forward progress.” That can be:

message offset is advancing
queue depth below a threshold
dead letter rate not exploding

The danger is flapping: queue depth can spike naturally. Use thresholds with retries and time windows, not a single sample.

Pattern D: Detect deadlocks and event loop stalls

Some of the ugliest outages are deadlocks and stalls: the process is alive, the port is open, but requests never complete.
Your healthcheck should include a strict response time budget. A slow “success” is a failure in disguise.

Pattern E: Disk and filesystem sanity for stateful-ish containers

If the container writes to a volume, you need to know when the filesystem is full, mounted read-only, or permissions broke after
an image change. Check you can write and fsync a tiny file to the intended path—once in a while, not every second.

Pattern F: DNS sanity (because DNS failures look like “everything is broken”)

A service that can’t resolve internal names will fail in ways that look like dependency outages. A simple getent hosts
or nslookup against a required name can save time.

Baseline examples (Dockerfile / Compose)

In a Dockerfile, use healthchecks only if the image truly owns the runtime contract. If health depends on deployment-specific
dependencies, Compose-level healthchecks are often a better fit.

cr0x@server:~$ cat Dockerfile
FROM alpine:3.20
RUN apk add --no-cache curl
HEALTHCHECK --interval=10s --timeout=2s --retries=3 --start-period=20s \
  CMD curl -fsS --max-time 1 http://127.0.0.1:8080/healthz || exit 1

cr0x@server:~$ cat compose.yaml
services:
  api:
    image: example/api:1.9.3
    healthcheck:
      test: ["CMD-SHELL", "curl -fsS --max-time 1 http://127.0.0.1:8080/healthz || exit 1"]
      interval: 10s
      timeout: 2s
      retries: 3
      start_period: 25s
    depends_on:
      db:
        condition: service_healthy
  db:
    image: postgres:16
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -h 127.0.0.1 || exit 1"]
      interval: 5s
      timeout: 3s
      retries: 10
      start_period: 20s

Notice the bias: the API checks itself at localhost (no networking ambiguity), and Postgres uses its own cheap readiness tool.
If you can use purpose-built probes (pg_isready, redis-cli ping), do it.

Practical tasks: commands, outputs, decisions (12+)

Healthchecks fail for reasons that are rarely mysterious. They’re usually “network,” “DNS,” “timeouts,” “permissions,” or “resource exhaustion.”
Here are concrete tasks that let you stop guessing.

Task 1: See container health status and last check output

cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Status}}'
NAMES           STATUS
api             Up 2 hours (unhealthy)
db              Up 2 hours (healthy)

What it means: Docker thinks api healthcheck is failing repeatedly.

Decision: Inspect health logs immediately; don’t restart blindly yet.

cr0x@server:~$ docker inspect --format '{{json .State.Health}}' api | jq
{
  "Status": "unhealthy",
  "FailingStreak": 7,
  "Log": [
    {
      "Start": "2026-02-04T09:44:00.123456789Z",
      "End": "2026-02-04T09:44:01.126789012Z",
      "ExitCode": 1,
      "Output": "curl: (28) Operation timed out after 1000 milliseconds with 0 bytes received\n"
    }
  ]
}

What it means: Your check is timing out, not returning non-200. That’s a stall, a bind issue, or severe saturation.

Decision: Check if the service is listening and whether localhost requests complete within budget.

Task 2: Validate the healthcheck command exactly as Docker runs it

cr0x@server:~$ docker inspect --format '{{json .Config.Healthcheck.Test}}' api
["CMD-SHELL","curl -fsS --max-time 1 http://127.0.0.1:8080/healthz || exit 1"]

What it means: It’s executed via shell, so shell semantics apply.

Decision: Run that same command inside the container to confirm environment and tooling match assumptions.

Task 3: Run the probe manually inside the container

cr0x@server:~$ docker exec -it api sh -lc 'curl -fsS --max-time 1 http://127.0.0.1:8080/healthz; echo "rc=$?"'
curl: (28) Operation timed out after 1000 milliseconds with 0 bytes received
rc=28

What it means: The endpoint is not responding within 1 second from inside the container itself.

Decision: Determine whether it’s not listening, stuck, or too slow. Move to socket checks and app logs.

Task 4: Check if the port is listening (without assuming netstat exists)

cr0x@server:~$ docker exec -it api sh -lc 'ss -lntp | head'
State  Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
LISTEN 0      4096   0.0.0.0:8080      0.0.0.0:*    users:(("api",pid=1,fd=7))

What it means: The service is listening on 8080.

Decision: If it listens but healthcheck times out, suspect application stall, thread exhaustion, or downstream blocking.

Task 5: Measure latency and status code with a hard budget

cr0x@server:~$ docker exec -it api sh -lc 'curl -s -o /dev/null -w "code=%{http_code} time=%{time_total}\n" --max-time 2 http://127.0.0.1:8080/healthz'
code=200 time=1.873

What it means: It’s returning 200, but it’s slow—nearly 2 seconds.

Decision: Either raise the healthcheck budget (carefully) or fix the slowness. Don’t “fix” this by bumping timeouts to 30 seconds.

Task 6: Check container logs around the failure window

cr0x@server:~$ docker logs --since 10m --tail 200 api
2026-02-04T09:41:12Z WARN db pool exhausted; waiting for connection
2026-02-04T09:41:13Z WARN db query timeout after 1500ms
2026-02-04T09:41:15Z ERROR /healthz dependency=db timeout=1500ms

What it means: Health endpoint is doing the right thing: it’s reporting DB issues. The service isn’t healthy enough to serve.

Decision: Stop tuning healthchecks. Fix DB or connection pool sizing. Investigate DB reachability and saturation.

Task 7: Validate DB connectivity from inside the app container

cr0x@server:~$ docker exec -it api sh -lc 'nc -vz -w 1 db 5432; echo "rc=$?"'
db (172.20.0.3:5432) open
rc=0

What it means: TCP connectivity exists; this is not a basic network partition.

Decision: Look at DB load, auth, pool settings, DNS latency, or slow queries—things above TCP.

Task 8: Check DNS resolution time (hidden killer)

cr0x@server:~$ docker exec -it api sh -lc 'time getent hosts db'
172.20.0.3      db

real    0m0.003s
user    0m0.000s
sys     0m0.002s

What it means: DNS/hosts lookup is fast here.

Decision: If this is slow (hundreds of ms), fix resolver config, Docker DNS, or search domains. Don’t blame the DB first.

Task 9: Check for file descriptor exhaustion

cr0x@server:~$ docker exec -it api sh -lc 'cat /proc/1/limits | grep -i "open files"'
Max open files            1024                 1024                 files

What it means: Low FD limit. Under load, you can hit this and become “alive but useless.”

Decision: Increase ulimit in the service definition; also check for FD leaks.

Task 10: Look for OOM kills and memory pressure

cr0x@server:~$ docker inspect --format 'OOMKilled={{.State.OOMKilled}} ExitCode={{.State.ExitCode}}' api
OOMKilled=false ExitCode=0

What it means: Not an OOM kill in this container lifecycle.

Decision: If true, fix memory limits, leaks, or JVM/node heap sizing. Healthchecks won’t save a process the kernel keeps murdering.

Task 11: Confirm the healthcheck tool exists in the image

cr0x@server:~$ docker exec -it api sh -lc 'command -v curl || echo "curl missing"'
/usr/bin/curl

What it means: The binary exists where expected.

Decision: If missing, don’t “fix” by using busybox weirdness. Add the tool or rewrite the check in what you actually ship.

Task 12: Verify healthcheck isn’t chewing CPU or spawning zombies

cr0x@server:~$ docker exec -it api sh -lc 'ps -o pid,ppid,stat,comm | head -n 10'
PID  PPID STAT COMMAND
1    0    S    api
42   1    S    worker
77   1    S    worker
201  1    S    curl

What it means: If you see a growing pile of curl processes, your healthcheck may be hanging and piling up.

Decision: Add timeouts (--max-time) and ensure the check command cannot block forever.

Task 13: Use events to correlate “unhealthy” with restarts and deploys

cr0x@server:~$ docker events --since 30m --filter container=api
2026-02-04T09:31:00Z container health_status: healthy
2026-02-04T09:40:10Z container health_status: unhealthy
2026-02-04T09:40:11Z container exec_start: curl -fsS --max-time 1 http://127.0.0.1:8080/healthz || exit 1

What it means: It went unhealthy at a specific time. That’s your pivot point.

Decision: Check deploy history, config reloads, DB maintenance windows, certificate rotations, DNS changes right then.

Task 14: Inspect Compose dependency gating (are you relying on a lie?)

cr0x@server:~$ docker compose ps
NAME            IMAGE             COMMAND                  SERVICE   STATUS
stack-api-1     example/api       "..."                    api       running (unhealthy)
stack-db-1      postgres:16       "docker-entrypoint..."   db        running (healthy)

What it means: Compose started the API, and it’s still running but flagged unhealthy.

Decision: If you expected Compose to delay API start until DB is ready, verify your depends_on conditions are supported and correct.

Task 15: Verify that “unhealthy” affects traffic (it might not)

cr0x@server:~$ docker inspect --format '{{.State.Health.Status}} {{.Name}}' api
unhealthy /api

What it means: Docker knows it’s unhealthy, but your reverse proxy may still send traffic.

Decision: Confirm your load balancer/proxy integrates with health status, or implement an explicit upstream health mechanism.

Joke #2: If your proxy ignores container health, that “HEALTHCHECK” line is basically a decorative plant—alive, green, and not solving problems.

Fast diagnosis playbook

When the pager hits and a container is “unhealthy,” your job is to find the bottleneck before you “fix” it into a worse outage.
Here’s the order that minimizes time-to-truth.

First: prove what exactly is failing

Inspect the last healthcheck output (docker inspect ... .State.Health). Is it timeout, non-200, DNS error, auth error?
Run the same command manually inside the container. If it passes interactively, you may have environment, PATH, or timing differences.
Check whether the service listens on the expected port (ss -lntp).

Second: classify the failure mode

Timeout: suspect deadlock, thread pool exhaustion, GC pause, downstream blocking, or kernel resource issues.
Connection refused: process not listening, crashed, or bound to the wrong interface.
DNS failure: resolver issues, wrong network, search domain bloat, or missing service name.
Non-200: app-level logic, dependency failing, bad config, migrations incomplete.

Third: verify dependencies and resources

TCP reachability to dependencies (nc -vz).
App logs for pool exhaustion, timeouts, auth failures.
FD limits and current usage (/proc/1/limits, lsof if available).
Disk fullness and mount state (inside container if it writes to volumes).

When to restart vs. when to hold steady

Restart helps if the process is wedged (deadlock) and your system can tolerate loss of in-flight work.
Restart hurts if the dependency is down and the app needs warm caches, migrations, or backoff logic. You’ll amplify load and extend recovery.

Common mistakes: symptoms → root cause → fix

Mistake 1: Healthcheck is “ps | grep”

Symptoms: Health is always healthy until the process crashes, but users see errors for minutes/hours.

Root cause: You’re checking existence, not functionality.

Fix: Probe the actual service interface (HTTP/TCP) with a strict timeout; optionally include dependency sampling.

Mistake 2: Healthcheck hits a “static OK” endpoint

Symptoms: Health remains green during DB outages; error rates spike.

Root cause: Endpoint doesn’t validate critical dependencies.

Fix: Make /healthz include cheap dependency checks (borrow DB connection, ping cache). Keep it fast and bounded.

Mistake 3: No timeouts, so checks hang and pile up

Symptoms: Many stuck curl processes; CPU climbs; container becomes unstable.

Root cause: Healthcheck command blocks indefinitely; Docker keeps scheduling it.

Fix: Use curl --max-time or timeout; keep timeout lower than the Docker healthcheck timeout.

Mistake 4: Healthcheck is too strict and causes flapping

Symptoms: Container oscillates healthy/unhealthy; restarts happen; traffic shifts constantly.

Root cause: Over-sensitive thresholds; interval too short; retries too low; start_period too small.

Fix: Increase retries, lengthen interval, add start_period. Also fix underlying latency spikes—don’t just pad timeouts.

Mistake 5: Healthcheck uses external DNS names not resolvable inside container

Symptoms: curl: (6) Could not resolve host; intermittent, environment-specific.

Root cause: Container DNS differs from host; missing network; wrong search domains.

Fix: Use service names on the Docker network; validate with getent hosts and explicitly set DNS if needed.

Mistake 6: Healthcheck success does not affect traffic

Symptoms: Container is unhealthy, but requests still land on it; users see errors.

Root cause: Proxy/load balancer isn’t wired to container health.

Fix: Configure proxy to use its own upstream checks or integrate with orchestrator behavior; don’t assume Docker health changes routing.

Mistake 7: Healthcheck includes expensive DB queries

Symptoms: DB load increases with replica count; healthchecks fail under load and worsen the incident.

Root cause: Healthchecks turned into a mini load test; they compete with production traffic.

Fix: Use cheap constant-time queries; cache results briefly in-app if needed; reduce frequency.

Mistake 8: Healthcheck relies on tools not present in minimal images

Symptoms: Healthcheck always fails with “command not found.”

Root cause: You used curl, bash, or nc but shipped scratch/distroless without them.

Fix: Add a tiny purpose-built probe binary, use app-native health endpoint, or include minimal tooling intentionally.

Checklists / step-by-step plan

Step-by-step: write a healthcheck that catches reality

Pick the user-visible contract. “Can serve HTTP within 500ms and reach DB.” Write it down.
Decide readiness vs liveness behavior. In Docker-only, default to readiness-style.
Choose the probe method. Prefer app endpoint; otherwise local TCP/HTTP with bounded timeouts.
Add strict timeouts. Every check must end quickly, even when the system is sick.
Handle warm-up properly. Use start_period to avoid false negatives during migrations and cache warm-up.
Keep it cheap. One DB ping, not a report query. One HTTP request, not a full login flow.
Make failures explain themselves. Single-line output like db timeout or http 503.
Load test your health endpoint. It should stay fast under load; if it slows first, it’s not a reliable signal.
Wire health to traffic decisions. If your proxy ignores it, fix that or don’t pretend healthchecks do routing.
Review after incidents. Every outage teaches you what your healthcheck didn’t catch.

Checklist: do not ship until these are true

Healthcheck command has a hard timeout shorter than Docker’s timeout.
Healthcheck is tested inside the running container image (not on your laptop).
Health endpoint checks at least one critical dependency if the service cannot function without it.
Start period covers worst-case startup time in production (migrations, cold caches).
Failures are actionable from the last health log line.
You understand what “unhealthy” triggers in your deployment (restart? removal from traffic? nothing?).

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized fintech ran a Docker Compose stack behind a reverse proxy. The team added healthchecks and congratulated themselves:
the API waited for the database using depends_on, so startup sequencing was “solved.”

Then a routine database patch rebooted the DB host. The API containers stayed up, continued to accept requests, and started
returning 500s. Healthchecks remained green because /health was a static “OK” endpoint. The proxy kept routing traffic
to all replicas, which were now mostly error factories.

The on-call assumed Docker would stop routing to unhealthy containers, because that’s what “health” means in human language.
But their proxy didn’t read Docker health status; it only cared whether the upstream TCP port was open.

The fix wasn’t dramatic. They changed the endpoint to include a cheap DB ping, and they configured the proxy to do its own
upstream health check against that endpoint. They also documented, in plain English, what health status affects in their stack.
The important part wasn’t the code. It was deleting the wrong assumption before it deleted their Saturday.

Mini-story 2: The optimization that backfired

A media company had a large fleet of stateless API containers. Someone noticed healthchecks were generating “unnecessary” DB
traffic: a quick query every 5 seconds per container adds up. The team “optimized” by changing the healthcheck to only verify
that the process was running and the port was open.

A month later, they had an incident where the DB connection pool configuration regressed. Under load, the API would accept
requests, then block waiting for a DB connection until client timeouts triggered. The port stayed open, and the process stayed alive.
Healthchecks were happy. Users were not.

The outage was long because symptoms were messy: CPU wasn’t pegged, memory looked fine, and error logs were noisy but not definitive.
The “optimization” removed the one automated signal that would have classified the failure quickly: dependency saturation.

They reintroduced a dependency-aware healthcheck, but with a smarter budget: lower frequency, a cheap DB ping, and some caching
inside the health endpoint to avoid stampeding the DB. They also learned a boring truth: if you optimize away observability, you
will pay the cost later with interest.

Mini-story 3: The boring but correct practice that saved the day

An enterprise SaaS provider had a habit that looked tedious: every service had a written “health contract” in the repo.
It specified what /healthz checks, what it doesn’t, and what time budget it must meet. It also included a runbook snippet:
the exact healthcheck command and how to reproduce it inside the container.

During a certificate rotation, a subset of containers began failing outbound TLS to a dependency. The healthcheck endpoint did
a minimal dependency call and started returning 503 with a one-line reason: tls handshake failed.

The on-call didn’t have to guess whether it was DNS, routing, or code. They ran the documented command inside a bad container,
saw the same TLS error, and compared environment differences. The issue was traced to an outdated CA bundle in a base image used
by one service line. Fix was to rebuild with the correct CA bundle and redeploy.

Nobody wrote a heroic postmortem about it because it wasn’t heroic. It was predictable, quickly classified, and quickly fixed.
The boring practice—standard contracts and reproducible checks—kept the incident from turning into a company-wide folklore story.

FAQ

1) Should my Docker healthcheck restart the container automatically?

Docker healthchecks themselves don’t restart containers. Restarts come from restart policies or orchestrators. Decide based on failure mode:
restart for deadlocks/crashes; avoid restart loops when dependencies are down.

2) What’s the difference between Docker healthchecks and Kubernetes liveness/readiness?

Kubernetes separates “should I restart?” (liveness) from “should I send traffic?” (readiness). Docker gives you one health status,
so you must choose what it represents and ensure your routing layer respects it.

3) How strict should timeouts be?

Strict enough to detect stalls, but not so strict that normal jitter marks you unhealthy. A common pattern is 1–2 seconds timeout,
10 seconds interval, 3 retries, with a start period that matches worst-case startup.

4) Should healthchecks include database queries?

If the service cannot function without the DB, yes—use a cheap query or ping with a strict timeout. If the service can degrade
gracefully, consider a lighter check and expose dependency status separately for alerting.

5) My health endpoint causes load. What do I do?

Make the work constant-time, cache results briefly, and reduce frequency. Don’t remove dependency checks entirely; tune them.
Also ensure health endpoints are cheap to compute (no heavy auth, no giant JSON serialization).

6) Why does my healthcheck pass in `docker exec` but fail in Docker health status?

Common causes: different shell, PATH differences, missing environment variables, or timing (your manual test happens during a good moment).
Compare the exact command from .Config.Healthcheck.Test and run it with sh -lc.

7) Should I use `CMD` or `CMD-SHELL` for healthchecks?

Prefer CMD (exec form) when possible because it avoids shell quoting issues. Use CMD-SHELL when you need pipes,
conditionals, or multiple checks. If you use shell, be explicit and careful.

8) How do I prevent flapping during startup migrations?

Use start_period long enough to cover worst-case migrations and cold starts. Also make the health endpoint return a
clear “starting” failure reason during warm-up if possible.

9) Can I make healthchecks validate disk space or volume mounts?

Yes, and you should for anything writing to volumes. Check that the path is writable and not full. Do it cheaply and not too often.

10) Is it okay for a healthcheck to call other services?

Only if those services are hard requirements for correctness. Keep calls minimal and bounded. Avoid third-party calls and avoid
expensive “deep” checks on the hot path.

Next steps you can do this week

If your current healthchecks are “process running,” you’re not checking health—you’re checking whether the lights are on.
Replace them with probes that reflect the minimum useful work your service must do, with strict timeouts and meaningful failure output.

Practical next steps:

Pick one critical service and rewrite its healthcheck to hit localhost HTTP with a 1–2 second budget.
Update the health endpoint to sample at least one critical dependency (cheaply) and return actionable failure text.
Add start_period based on real startup time, not optimism.
Verify what “unhealthy” does in your routing layer; make it matter, or stop pretending it matters.
After the next incident, update the health contract and runbook so future-you doesn’t have to rediscover the same failure mode at 3 a.m.