Docker Healthchecks Done Right — Stop Deploying “Green” Failures

Was this helpful?

The most expensive outage is the one that looks healthy. Dashboards are green, deploys are “successful,”
and meanwhile customers are staring at timeouts because the container is up, but the service is not.
Docker happily reports running. Your orchestrator shrugs. You get paged anyway.

Healthchecks are supposed to be the lie detector. Too often they’re a sticker on the forehead that says “OK”
because someone ran curl localhost once and called it a day. Let’s fix that—with checks that match
real failure modes, don’t create new ones, and actually help you make decisions under pressure.

What Docker healthchecks are (and are not)

A Docker healthcheck is a command Docker runs inside the container on a schedule.
The command exits 0 for healthy, non-zero for unhealthy. Docker tracks a state:
starting, healthy, or unhealthy.

That’s it. No magic. No distributed consensus. No “is my app good for customers” guarantee.
It’s a local probe. Useful, but only if you aim it at the right target.

What a Docker healthcheck can do well

  • Detect deadlocks and wedged processes that still have a PID.
  • Detect local dependency failures (database socket not accepting, cache auth failing).
  • Gate startup sequencing in Docker Compose so you don’t stampede the database.
  • Provide a signal to orchestrators and humans: “this container is lying to you.”

What it can’t do (stop asking it to)

  • Prove end-to-end user success. A local check can’t validate DNS, routing, external ACLs, or client behavior.
  • Replace metrics. “Healthy” is a boolean. Latency, saturation, and error budget are not.
  • Fix bad rollouts. If your check is wrong, it will confidently certify the wrong thing.

Here’s the operational truth: a healthcheck is a contract. You’re defining “healthy enough to receive traffic”
or “healthy enough to keep running.” Write that contract like your pager depends on it, because it does.

Interesting facts and context (the stuff that explains today’s mess)

  • Docker added HEALTHCHECK in 2015 (Docker 1.12 era), largely to support orchestration patterns before Swarm/Kubernetes dominated.
  • Healthchecks run inside the container’s namespace, so they see container DNS, local sockets, and localhost differently than the host does.
  • Docker tracks health state in container metadata and exposes it via docker inspect; it’s not a log line unless you look for it.
  • Compose’s original depends_on did not wait for readiness; later versions introduced conditional health-based dependencies, but many stacks still run the old mental model.
  • Kubernetes popularized separate “liveness” and “readiness”, which made people realize a single health endpoint often mixes two incompatible goals.
  • Early “health endpoints” were often just “return 200 OK” because load balancers only needed a heartbeat; modern systems need dependency-aware readiness and fast-fail behavior.
  • cgroups and CPU throttling can make healthy apps look dead; a 1s timeout under CPU pressure is basically a coin flip.
  • Restart policies predate healthchecks in many setups; operators glued checks to restarts and accidentally built self-inflicted denial-of-service loops.

Failure modes that “green” deployments hide

1) The process is alive; the service is dead

The classic: your main process still runs, but it’s stuck. Deadlock, infinite GC, waiting on a broken dependency,
blocked on disk, or wedged on a mutex. Docker reports “Up.” Your reverse proxy times out.

A good healthcheck tests the service behavior, not the existence of a PID.

2) Port open, app not ready

TCP listen sockets come up early. Frameworks do that. Your app is still warming caches, running migrations,
loading models, or waiting for the database.

A port check is a liveness-ish signal. Readiness requires application-level confirmation.

3) Partial dependency failure

Your service can respond to /health, but it can’t talk to Redis due to auth mismatch,
can’t write to Postgres due to permission changes, or can’t reach an external API because egress rules changed.

If you don’t include dependency checks, you will deploy a beautifully healthy failure.

4) Slow failure: it works, but not in time

Under load or noisy neighbors, latency blows past client timeouts. Your health endpoint still returns 200—eventually.
Meanwhile, users see errors.

Your healthcheck should have a latency budget and enforce it with timeouts. A “healthy after 30 seconds” endpoint is just optimism.

5) The check itself causes the outage

Checks that hit expensive endpoints, run migrations, or open new DB connections every second are a great way to
take a system down while congratulating yourself on “adding reliability.”

Joke #1: A healthcheck that DDoSes your own database is still technically “testing production.” It’s just not the kind of testing you wanted.

Design principles: checks that tell the truth

Define what “healthy” means operationally

Pick one of these and be explicit:

  • Ready for traffic: can serve real requests within SLO-ish latency and has required dependencies.
  • Safe to keep running: process isn’t wedged; it can make progress; it can shut down gracefully.

Docker gives you one health state. That’s annoying. You can still model both by choosing what you care about for that container.
For edge proxies, “ready for traffic” matters. For a background worker, “safe to keep running” often matters more.

Fail fast, but not stupidly

A good check fails quickly when the container is genuinely broken, but it doesn’t flap during normal startup
or transient dependency blips.

  • Use start_period to avoid punishing slow warmups.
  • Use timeouts so a wedged call doesn’t block the healthcheck forever.
  • Use retries to avoid one-packet-loss turning into a restart storm.

Prefer local, cheap, deterministic probes

Healthchecks run frequently. Make them:

  • Local: hit localhost, a UNIX socket, or in-process state.
  • Cheap: avoid heavy queries, avoid creating new connection pools.
  • Deterministic: same input, same output; no randomness; no “sometimes it’s slow.”

Include dependencies, but choose the right depth

If your service cannot operate without Postgres, your check should confirm it can authenticate and run a trivial query.
If it can degrade gracefully (serve cached content), don’t fail the container just because Redis is down.

Operationally: your healthcheck should mirror your intended failure behavior.

Use exit codes intentionally

Docker only cares about zero vs non-zero, but humans care why. Make your check print a short reason
to stderr/stdout before exiting non-zero. That reason shows up in docker inspect.

Make checks test the same path your traffic uses

If real traffic goes through Nginx to your app, healthchecking the app directly may skip a whole failure class:
broken Nginx config, exhausted worker connections, bad upstream DNS, TLS issues. Sometimes you want the proxy to check its upstream.
Sometimes you want an external LB to check the proxy. Layer your checks like you layer failures.

Quote (paraphrased idea) — Jim Gray: “Treat system failure as normal; engineer as if components will fail at any time.”

Fast diagnosis playbook: find the bottleneck fast

When a container is “healthy” but users are failing, you don’t need philosophy. You need a sequence.
This is the order that tends to surface the real constraint fastest.

First: is the health signal lying, or did the system change?

  • Check Docker’s health status and the last few health logs for that container.
  • Confirm your check is testing the right thing (dependency, path, latency).
  • Compare before/after deploy config for healthcheck changes, timeouts, and start periods.

Second: is the service actually serving on the expected interface?

  • Inside the container: check listening ports, DNS, and local connectivity.
  • From the host: check port mapping and firewall rules.
  • From a peer container: check service discovery and network policy equivalents.

Third: are we blocked on CPU, memory, disk, or a dependency?

  • CPU throttling and load average spikes can turn a 200ms handler into a 5s timeout.
  • OOM kills can produce “it works… until it doesn’t” cycles.
  • Disk saturation can freeze I/O-heavy services while the process remains alive.
  • Dependency saturation (DB max connections) often looks like random app timeouts.

Fourth: is the healthcheck causing harm?

  • Check frequency and cost: is it hammering the DB or app thread pool?
  • Check concurrency: healthchecks shouldn’t pile up.
  • Check side effects: health endpoints must be read-only.

The trick is to treat healthchecks like any other production load generator. Because they are.

Practical tasks (commands, expected output, and what you decide)

These are real operator moves. Each task includes: a command, what the output means, and the decision you make.
Use them during incidents and during calm engineering time when you’re trying to prevent the next one.

Task 1: See health status at a glance

cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Image}}'
NAMES           STATUS                          IMAGE
api-1           Up 12 minutes (healthy)         myorg/api:1.9.3
db-1            Up 12 minutes (healthy)         postgres:16
worker-1        Up 12 minutes (unhealthy)       myorg/worker:1.9.3

What it means: Docker is running healthchecks and reporting state. (unhealthy) is not subtle.

Decision: If a critical component is unhealthy, stop treating “Up” as success. Investigate before scaling or routing traffic.

Task 2: Inspect the health log and failure reason

cr0x@server:~$ docker inspect --format '{{json .State.Health}}' worker-1 | jq
{
  "Status": "unhealthy",
  "FailingStreak": 5,
  "Log": [
    {
      "Start": "2026-01-02T08:11:12.123456789Z",
      "End": "2026-01-02T08:11:12.223456789Z",
      "ExitCode": 1,
      "Output": "redis ping failed: NOAUTH Authentication required\n"
    }
  ]
}

What it means: The check is failing consistently, and it prints a useful reason. Bless whoever wrote that output.

Decision: Fix credentials/config rather than restarting blindly. Also consider whether Redis auth failure should mark the worker unhealthy or degraded.

Task 3: Run the healthcheck command manually inside the container

cr0x@server:~$ docker exec -it api-1 sh -lc 'echo $0; /usr/local/bin/healthcheck.sh; echo exit=$?'
sh
ok: http=200 db=ok redis=ok latency_ms=27
exit=0

What it means: You’re executing the same probe Docker executes. It succeeded and returned quickly.

Decision: If users still fail, the issue is probably outside this container’s local view (network, proxy, LB, DNS) or a mismatch between health criteria and user path.

Task 4: Confirm what healthcheck Docker is actually running

cr0x@server:~$ docker inspect --format '{{json .Config.Healthcheck}}' api-1 | jq
{
  "Test": [
    "CMD-SHELL",
    "/usr/local/bin/healthcheck.sh"
  ],
  "Interval": 30000000000,
  "Timeout": 2000000000,
  "StartPeriod": 15000000000,
  "Retries": 3
}

What it means: Interval 30s, timeout 2s, start period 15s, retries 3. These numbers are the behavior.

Decision: If you see timeout=1s on a JVM service under CPU limits, you’ve found a future incident. Tune it now.

Task 5: Watch health transitions live

cr0x@server:~$ docker events --filter container=api-1 --filter event=health_status --since 10m
2026-01-02T08:03:12.000000000Z container health_status: healthy api-1
2026-01-02T08:08:42.000000000Z container health_status: unhealthy api-1
2026-01-02T08:09:12.000000000Z container health_status: healthy api-1

What it means: Flapping: it goes unhealthy then recovers. That’s either a real intermittent issue or an oversensitive check.

Decision: If flapping aligns with load, you likely have resource saturation. If it’s random, raise timeout/retries and investigate network/DNS.

Task 6: Validate container-to-container DNS and routing

cr0x@server:~$ docker exec -it api-1 sh -lc 'getent hosts db && nc -zvw2 db 5432'
172.20.0.3 db
db (172.20.0.3:5432) open

What it means: DNS resolves and TCP connect works from the app container to the database container.

Decision: If this fails, don’t waste time inside the app. Fix network, service name, or Compose network configuration.

Task 7: Prove the app is listening where you think it is

cr0x@server:~$ docker exec -it api-1 sh -lc 'ss -lntp | head'
State  Recv-Q Send-Q Local Address:Port  Peer Address:Port Process
LISTEN 0      4096   0.0.0.0:8080       0.0.0.0:*     users:(("java",pid=1,fd=123))

What it means: The service listens on 0.0.0.0:8080. If it were on 127.0.0.1 only, your port mapping could be “up” but unreachable externally.

Decision: If it’s bound wrong, fix the app bind address. Don’t paper over it with host networking unless you enjoy regret.

Task 8: Test the real request path from the host (port mapping)

cr0x@server:~$ curl -fsS -m 2 -D- http://127.0.0.1:18080/healthz
HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 2

ok

What it means: Through the published port, the health endpoint responds within 2 seconds.

Decision: If host curl fails but in-container curl works, your issue is mapping, firewall, proxy, or the app binding to the wrong interface.

Task 9: Identify CPU throttling that makes checks time out

cr0x@server:~$ docker stats --no-stream api-1
CONTAINER ID   NAME    CPU %     MEM USAGE / LIMIT     MEM %     NET I/O       BLOCK I/O     PIDS
a1b2c3d4e5f6   api-1   198.32%   1.2GiB / 1.5GiB       80.12%    1.3GB / 1.1GB  25MB / 2MB   93

What it means: High CPU and memory usage. If you also set tight health timeouts, you’ll see false negatives under load.

Decision: Either allocate more CPU/memory, reduce work, or widen health timeout so “healthy under expected load” remains true.

Task 10: Check for OOM kills or restarts that masquerade as “flakiness”

cr0x@server:~$ docker inspect --format 'RestartCount={{.RestartCount}} OOMKilled={{.State.OOMKilled}} ExitCode={{.State.ExitCode}}' api-1
RestartCount=2 OOMKilled=true ExitCode=137

What it means: Exit code 137 and OOMKilled=true: the kernel killed it. Your healthcheck didn’t fail; the container died.

Decision: Fix memory limits, leaks, or spikes. Also ensure your startup start_period isn’t too short, or you’ll stack failures during recovery.

Task 11: Detect disk I/O stalls that freeze “healthy” processes

cr0x@server:~$ docker exec -it db-1 sh -lc 'ps -o stat,comm,pid | head'
STAT COMMAND PID
Ss   postgres 1
Ds   postgres 72
Ds   postgres 73

What it means: Processes in D state are stuck in uninterruptible I/O wait. They won’t respond to your nice little queries.

Decision: Stop blaming the application. Investigate storage latency, host disk saturation, noisy neighbors, or volume drivers.

Task 12: Confirm a dependency’s readiness with a purpose-built probe (Postgres)

cr0x@server:~$ docker exec -it db-1 sh -lc 'pg_isready -U postgres -h 127.0.0.1 -p 5432; echo exit=$?'
127.0.0.1:5432 - accepting connections
exit=0

What it means: Postgres is accepting connections. This is better than a port-open test because it speaks the protocol.

Decision: If this fails intermittently, look at max connections, checkpoints, disk, or recovery. Don’t just increase health retries and hope.

Task 13: Validate that your health endpoint is fast enough (latency budget)

cr0x@server:~$ docker exec -it api-1 sh -lc 'time -p curl -fsS -m 1 http://127.0.0.1:8080/healthz >/dev/null'
real 0.04
user 0.00
sys 0.00

What it means: 40ms locally. Great. If you see 0.9–1.0s with a 1s timeout, you’re living on the edge.

Decision: Set a timeout with margin. Healthchecks should fail for real slowness, not for normal jitter under load.

Task 14: Catch bad assumptions about “localhost” (proxy vs app)

cr0x@server:~$ docker exec -it nginx-1 sh -lc 'curl -fsS -m 1 http://127.0.0.1:8080/healthz || echo "upstream unreachable"'
upstream unreachable

What it means: Inside the Nginx container, localhost is Nginx—not your app container. This is a top-10 cause of useless healthchecks.

Decision: Point checks at the correct upstream hostname/service, or run the check in the correct container. “It works on my container” is not a networking strategy.

Healthcheck patterns for common services

Pattern A: HTTP service with dependency-aware readiness

For an API, a good check usually validates:
the HTTP server thread can respond quickly, and core dependencies are reachable and authenticated.
Keep it minimal: one tiny query, one cache ping, a shallow internal status check.

Example: Dockerfile healthcheck calling a script

cr0x@server:~$ cat Dockerfile
FROM alpine:3.20
RUN apk add --no-cache curl ca-certificates
COPY healthcheck.sh /usr/local/bin/healthcheck.sh
RUN chmod +x /usr/local/bin/healthcheck.sh
HEALTHCHECK --interval=30s --timeout=2s --start-period=20s --retries=3 CMD ["/usr/local/bin/healthcheck.sh"]
cr0x@server:~$ cat healthcheck.sh
#!/bin/sh
set -eu

t0=$(date +%s%3N)

# Fast local HTTP check (service path)
code=$(curl -fsS -m 1 -o /dev/null -w '%{http_code}' http://127.0.0.1:8080/healthz || true)
if [ "$code" != "200" ]; then
  echo "http failed: code=$code"
  exit 1
fi

# Optional dependency: DB shallow check via app endpoint (preferred) or direct driver probe
# Here we assume /readyz includes db connectivity check inside the app.
code2=$(curl -fsS -m 1 -o /dev/null -w '%{http_code}' http://127.0.0.1:8080/readyz || true)
if [ "$code2" != "200" ]; then
  echo "ready failed: code=$code2"
  exit 1
fi

t1=$(date +%s%3N)
lat=$((t1 - t0))
echo "ok: http=200 ready=200 latency_ms=$lat"
exit 0

Why this works: The check enforces a time budget and validates the actual service interface.
It avoids deep dependency logic in shell when the app can do it better (and reuse existing pools).

Pattern B: Database containers: use native tools, not “port open”

If you’re healthchecking Postgres, use pg_isready. For MySQL, use mysqladmin ping.
For Redis, use redis-cli ping. These probes speak enough protocol to be meaningful.

Example: Postgres in Compose

cr0x@server:~$ cat compose.yaml
services:
  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: example
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -h 127.0.0.1 -p 5432"]
      interval: 5s
      timeout: 3s
      retries: 10
      start_period: 10s

Why this works: It catches “process is up but not accepting connections,” including recovery or misconfig.
It doesn’t require a real query, so it’s cheap.

Pattern C: Background workers: progress-based checks

Workers often don’t have HTTP. Your check should validate:
the process can talk to the queue, and it’s not stuck.
If you can emit a heartbeat file or a lightweight “last job processed” timestamp, do it.

A worker that can connect to Redis but hasn’t made progress in 10 minutes is not healthy. It’s just online.

Pattern D: Reverse proxies: check upstream, not yourself

Nginx “healthy” while upstream is down is meaningless if the proxy is the traffic gateway.
You either check upstream reachability or configure the load balancer to check an endpoint that reflects upstream status.

Joke #2: A reverse proxy that answers 200 while upstream is on fire is like a receptionist saying “everyone’s in a meeting” during an evacuation.

Compose orchestration, dependencies, and startup reality

Docker Compose is where healthchecks either save you or make you overconfident.
The trap: people assume depends_on means “wait until ready.” It historically meant “start in order.”
Those are not the same, and you can guess which one your database cares about.

Use health-based dependencies where it actually matters

If the API will crash-loop until the DB is up, you can gate API startup on DB health in Compose.
This reduces noise and avoids thundering herds on boot.

Example: Compose dependency on health

cr0x@server:~$ cat compose.yaml
services:
  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: example
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -h 127.0.0.1 -p 5432"]
      interval: 5s
      timeout: 3s
      retries: 12
      start_period: 10s

  api:
    image: myorg/api:1.9.3
    depends_on:
      db:
        condition: service_healthy
    healthcheck:
      test: ["CMD-SHELL", "curl -fsS -m 1 http://127.0.0.1:8080/readyz || exit 1"]
      interval: 10s
      timeout: 2s
      retries: 3
      start_period: 30s

Reality check: This improves startup ordering, not long-term reliability. If DB dies later, Compose won’t magically orchestrate a graceful failover.
But it does stop the “everything starts at once, everything fails at once” boot storm.

Don’t let healthchecks turn into deployment gates you can’t reason about

If you make readiness include every external dependency, one flaky upstream can block deploys and rollbacks.
That’s not resilience; that’s coupling. Define what is required to serve your traffic.

Separate startup warmup from steady-state health

Startup is special. Migrations run. JIT compilers wake up. Certificates load. DNS caches are cold.
That’s why start_period exists: to prevent restarts and “unhealthy” states during expected warmup.

But don’t abuse it. A 10-minute start period hides real failures for 10 minutes. If you need that, you probably need better initialization design.

Restarts, healthchecks, and the death spiral

Docker healthchecks don’t automatically restart containers by themselves. Something else does: your orchestrator, your scripts, your supervisors,
or “clever” automation that interprets unhealthy as “restart now.”

This is where good intentions go to die.

The classic death spiral

  • Dependency gets slow (DB latency spikes).
  • Healthcheck times out (too aggressive).
  • Automation restarts containers.
  • Restart storm increases load (cold caches, reconnects, migrations, replays).
  • Dependency gets slower.

How to avoid it

  • Healthchecks should be diagnostic first. Restart as a last resort, not a reflex.
  • Use backoff in whatever system decides to restart.
  • Make your check cheap and time-bounded, and tune timeouts to expected jitter under load.
  • Design for degraded mode: if Redis is down, can you keep serving read-only? Then don’t fail readiness on Redis.

A healthcheck that triggers restarts is a loaded weapon. Store it like one.

Healthchecks vs monitoring vs metrics: stop mixing them up

A healthcheck is a binary local probe. Monitoring is a system that tells you when you’re violating objectives.
Metrics are the data that explain why.

When to use healthchecks

  • To prevent routing traffic to containers that cannot serve it.
  • To avoid starting dependent services too early.
  • To surface obvious broken states quickly and consistently.

When not to use healthchecks

  • As a substitute for latency percentiles and error rates.
  • As the sole signal to restart. That’s how you build elegant failure amplifiers.
  • To “test the whole world” from inside one container. That’s what integration tests and synthetic monitoring are for.

Two-tier pattern: internal health + external synthetic checks

The robust pattern is layered:

  • Internal healthcheck: cheap, fast, dependency-aware enough for local correctness.
  • External synthetic check: hits the real public path, exercises DNS/TLS/routing, and measures latency.

This catches the two big classes of lies: “container is fine but path is broken” and “path is fine but container is dead.”

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A company ran a small internal platform: a reverse proxy container in front of several API containers.
They added a Docker healthcheck to the proxy that did curl http://127.0.0.1/ and returned 0 on 200.
The proxy always answered 200, even when upstreams were down, because the root path served a static “welcome” page.

During a routine deploy, an upstream service failed to start due to a missing environment variable.
The proxy kept returning 200 to the load balancer health probe. Traffic flowed. Users got a friendly 502 and a timeout.
Monitoring showed “proxy healthy.” The deploy pipeline showed “all containers running.” Everyone stared at graphs in disbelief,
as if disbelief could restore service.

The wrong assumption was subtle and very human: “If the proxy is up, the service is up.”
But for customers, the proxy is just the front door. If the room behind it is on fire, the front door being open is not success.

They fixed it by defining two endpoints:
/healthz for “proxy process is alive,” and /readyz for “critical upstreams reachable and returning expected status.”
The load balancer checked /readyz. The Docker healthcheck for the proxy also checked /readyz,
with sane timeouts to avoid cascading failure when upstream was slow.

The immediate result was less drama: misconfigured upstreams no longer became production traffic black holes.
The longer-term result was cultural: people stopped using “container running” as a synonym for “service works.”

Mini-story 2: The optimization that backfired

Another organization wanted “faster detection” of broken containers. They tightened healthchecks:
interval 2 seconds, timeout 200ms, retries 1. They congratulated themselves on being serious about reliability.
On a quiet day, it looked fine.

Then they hit a normal peak. CPU throttling kicked in because the containers were resource-limited to keep costs down.
GC pauses grew. Disk latency spiked a bit from background snapshots on the host.
Requests still succeeded, but sometimes in 400–700ms instead of 80ms. The healthchecks timed out.

Their automation treated “unhealthy” as “restart immediately.” A restart caused cache misses, which caused more DB traffic,
which caused more latency, which caused more healthcheck failures. Soon they had a synchronized restart parade.
The systems weren’t “down” at first; the health policy made them down.

They recovered by rolling back the healthcheck aggressiveness and adding backoff to restarts.
They also split the checks: a fast liveness check for “process still responding,” and a more forgiving readiness check
with a longer timeout and multiple retries.

The lesson wasn’t “never optimize.” It was “optimize the right thing.”
Faster detection is meaningless if your detector is more fragile than the service it’s supposed to protect.

Mini-story 3: The boring but correct practice that saved the day

A third team ran stateful services with persistent volumes. They had a habit that nobody celebrated:
every service had a documented health contract, and the healthcheck command printed a single-line status summary
including a timestamp and key dependency states.

One morning, a subset of containers went unhealthy after a host maintenance window.
The applications were up, but their healthchecks reported db=ok and disk=slow, with latency in milliseconds.
That “disk=slow” was not genius. It was a simple threshold: write a tiny temp file and measure how long it took.

The on-call didn’t have to guess. They checked host disk stats, found that one volume backend was misbehaving,
and drained the affected host. Traffic shifted. Errors dropped. No heroic debugging session inside the app.
Nobody rewrote any code at 3 a.m., which is always the real win.

Later, they tuned the healthcheck so disk slowness did not immediately mark containers unhealthy unless it persisted across several intervals.
The boring practice—consistent health output and a documented contract—turned a vague incident into a clean decision tree.

This is what “operational excellence” looks like in real life: mostly unsexy, occasionally life-saving.

Common mistakes (symptoms → root cause → fix)

1) Symptom: Container is “healthy” but users get 502/504

Root cause: Healthcheck only tests local process or a static page, not upstream dependencies or the real request path.

Fix: Check the same path as real traffic (proxy-to-upstream), or expose a /readyz that validates critical dependencies with time budgets.

2) Symptom: Containers flap between healthy/unhealthy during load

Root cause: Timeout too low, healthcheck competes with real traffic for CPU/threads, or dependency latency spikes.

Fix: Increase timeout, add retries, use start_period, and make the check cheaper. Validate CPU throttling and thread pool saturation.

3) Symptom: Deploys hang because services never become healthy

Root cause: Healthcheck requires external dependencies that are optional, or it depends on data migrations that take longer than start_period.

Fix: Narrow readiness to “can serve traffic safely” (not “the world is perfect”), and move long migrations to a one-shot job.

4) Symptom: DB gets hammered every few seconds even when idle

Root cause: Healthcheck runs heavy queries or opens new DB connections each interval; multiplied by replicas, it becomes real load.

Fix: Use native readiness tools (pg_isready), shallow queries, or app-level pooled checks. Increase interval.

5) Symptom: Healthcheck passes locally but fails in production only

Root cause: Assumptions about localhost, DNS, certificates, proxy headers, or environment-specific auth.

Fix: Run the check in the same network namespace it will run in (the container). Validate service discovery from peers, not just from your laptop.

6) Symptom: Unhealthy triggers restarts, restarts make it worse

Root cause: Restart automation with no backoff + overly strict check + cold-start amplification.

Fix: Add backoff, widen thresholds, separate liveness from readiness semantics, and avoid restarting on transient dependency jitter.

7) Symptom: Healthcheck itself causes latency spikes

Root cause: Check hits expensive endpoints (full dependency graph, cache rebuild, auth handshake) too frequently.

Fix: Make a dedicated cheap endpoint, cache health results briefly in-app, and ensure the check is side-effect free.

8) Symptom: Healthcheck always returns healthy, even when app is wedged

Root cause: Check is only “port open” or “process exists,” or it calls a handler that doesn’t exercise the wedged subsystem.

Fix: Include a trivial operation that requires progress (e.g., enqueue/dequeue noop, execute a tiny DB query, or verify event loop tick).

Checklists / step-by-step plan

Step-by-step: write a healthcheck that won’t embarrass you

  1. Write the contract: “Healthy means X. Unhealthy means Y.” Keep it in the repo next to the Dockerfile.
  2. Choose the target: liveness-ish (progress) or readiness-ish (safe to serve traffic). Don’t pretend one boolean does both perfectly.
  3. Pick a latency budget: choose a timeout that reflects real conditions plus margin (not your laptop on Wi-Fi).
  4. Start period: set start_period based on measured startup time, not hope.
  5. Retries: set retries to tolerate transient jitter but not hide persistent failure.
  6. Make it cheap: avoid heavy endpoints and connection storms. Prefer pooled in-app checks or native probes.
  7. Make it observable: print a one-line reason on failure and a concise summary on success.
  8. Test under stress: run healthchecks while CPU and I/O are constrained; see if they flap.
  9. Decide restart policy: who restarts on unhealthy, with what backoff? Document it. Implement it deliberately.
  10. Review regularly: healthchecks rot as systems change. Add them to your “production correctness” review list.

Checklist: pre-deploy sanity for Compose stacks

  • Do dependencies use condition: service_healthy where appropriate?
  • Do healthchecks use correct hostnames (service names), not 127.0.0.1 across containers?
  • Are intervals reasonable (not 1–2s across dozens of replicas unless the check is truly trivial)?
  • Are timeouts and retries tuned for expected load and jitter?
  • Do checks avoid side effects and heavy queries?
  • Do checks fail for the failure modes you actually care about?

Checklist: incident response when healthchecks are involved

  • Is the healthcheck failing because the service is broken, or because the check is too strict?
  • Is restart automation amplifying the problem?
  • Do health logs show a specific dependency failure (auth, DNS, timeout)?
  • Is the bottleneck CPU, memory, disk, or upstream saturation?
  • Can you degrade gracefully instead of failing hard?

FAQ

1) Should every container have a healthcheck?

No. Add healthchecks where they drive a decision: routing, dependencies, or rapid diagnosis.
For one-shot jobs or trivial stateless sidecars, a healthcheck can be noise. Don’t cargo-cult it.

2) What’s the difference between “liveness” and “readiness” in Docker?

Docker only exposes one health state, but you can choose your semantics.
“Liveness” is “the process can still make progress.” “Readiness” is “safe to receive traffic.”
For frontends and APIs, prioritize readiness. For workers, prioritize progress.

3) Why not just check that the port is open?

Because a port can be open while the app is broken: deadlocked handlers, exhausted thread pools,
failing dependencies, or a proxy serving a static page. Port-open checks catch only the laziest failures.

4) How often should I run a healthcheck?

Start with 10–30 seconds for most services. Faster checks increase load and flapping risk.
If you need sub-second detection, you’re usually solving the wrong problem or using the wrong tool.

5) What timeouts and retries should I use?

Timeouts should be less than client timeouts and reflect expected latency under load plus margin.
Retries should tolerate transient issues (1–3) without masking persistent failure. Measure startup and steady-state separately.

6) Should a healthcheck test dependencies like databases and caches?

If the service cannot function without them, yes—shallowly.
If the service can degrade gracefully, don’t fail readiness for optional dependencies. That’s how you turn partial outages into total outages.

7) Why does my healthcheck pass in one container but fail from another?

Namespaces. 127.0.0.1 inside a container is that container. Service DNS differs between networks.
Also, certificates and auth can be environment-specific. Always test from the same network path the check will run in.

8) Does Docker restart unhealthy containers automatically?

Not by default. Docker reports health state; restarts are driven by restart policies (on exit) or external automation.
Be very careful when wiring “unhealthy” to restarts. Add backoff and avoid restart storms.

9) Should my health endpoint return detailed diagnostics?

Internally, yes: a short reason string is gold during incidents. Externally, be cautious: don’t leak secrets or topology.
Many teams serve a minimal external /healthz and a protected internal diagnostics endpoint.

10) How do I keep healthchecks from overloading the database?

Use native probes (pg_isready) or app-level checks that reuse connection pools.
Increase interval. Avoid queries that scan tables. Your healthcheck should be cheaper than a real request.

Next steps you can ship this week

Stop letting “green” mean “good.” Healthchecks are production controls, not decorative YAML.
If your check doesn’t match real failure modes, it will certify failures with confidence.

  1. Audit your top 5 services: what exactly do their healthchecks test, and is it the same path your users take?
  2. Add one dependency-aware readiness signal where it matters (API, proxy, gateway). Keep it shallow and time-bounded.
  3. Tune timeouts and start periods using measured startup and load behavior, not guesses.
  4. Make output actionable: one-line failure reasons that show up in docker inspect.
  5. Review restart behavior: if “unhealthy” triggers restarts, add backoff and verify you’re not building a death spiral.

The goal isn’t to make healthchecks strict. It’s to make them honest. Honest checks don’t prevent every outage.
They prevent the worst kind: the one your systems swear isn’t happening.

← Previous
VPN logs that matter: find “won’t connect” causes in MikroTik/Linux logs
Next →
ZFS SATA Link Resets: The Controller/Cable Failure Pattern

Leave a comment