The worst Docker incidents don’t look dramatic at first. A container “works” until it doesn’t, and when it fails it often fails sideways:
latency climbs, retries spike, disks fill, nodes start swapping, and then everything is “unrelated” until it’s all related.
The fix is not “add more dashboards.” The fix is a ruthless minimum: a small set of metrics and logs that reliably light up before users do,
and a way to interrogate a host in minutes when the dashboards are lying or missing.
The minimum signal: what to observe and why
“Observability” in container land gets sold as an all-you-can-eat buffet: traces, profiles, eBPF, service maps, fancy flame graphs,
AI anomaly detection, and a dashboard with twelve shades of red. In production, the minimum viable observability is simpler:
detect imminent failure modes early, with signals that are hard to fake and cheap to collect.
For Docker on a single host or a fleet, you’re juggling three layers that all fail differently:
- Application layer: request errors, latency, retries, timeouts, queue depth. This is where users scream.
- Container layer: restarts, OOM kills, CPU throttling, file descriptor exhaustion, log volume.
- Host layer: disk fullness/inodes, IO saturation, memory pressure/swap, conntrack exhaustion, kernel issues.
The minimum set should give you early warning for the predictable stuff: disk filling, memory leaking, CPU throttling, crash loops,
and the silent killers like inode exhaustion and conntrack. If you nail those, you prevent most “it was fine yesterday” outages.
Rules of the minimum
- Prefer leading indicators. “Disk 98% full” is a trailing indicator. “Disk growing 2% per hour” is a lead.
- Pick one source of truth per thing. For CPU, use cgroup CPU usage and throttling. Don’t mix five “CPU%” definitions.
- Keep cardinality under control. Label every metric with container IDs and you’ll DoS your monitoring with success.
- Alert on symptoms you can act on. “Load average high” is trivia unless it’s tied to CPU steal, IO wait, or throttling.
Two kinds of teams exist: those who alert on “container restarted” and those who enjoy being surprised by it at 2 a.m.
Also, a dry truth: if your “observability” stack goes down whenever the cluster is unhappy, it isn’t observability; it’s decoration.
Your minimum needs to be accessible from the host with a shell.
Interesting facts and historical context (the parts that still hurt)
- cgroups were introduced in Linux in 2007. Docker didn’t invent resource isolation; it packaged it and made it easy to misuse.
- The original Docker logging driver defaulted to json-file. It’s convenient, but on busy services it can become your stealth disk-eater.
- OverlayFS (overlay2) became the default storage driver for many distros in the 2016–2017 era. It’s fast enough, but “where did my disk go?” became a weekly question.
- OOM killer behavior predates containers by decades. Containers just made it easier to hit memory limits and harder to see why without the right signals.
- Conntrack has been a Linux netfilter feature since early 2.4 kernels. When NAT and lots of short-lived connections meet containers, conntrack becomes a capacity plan item, not a footnote.
- Healthchecks in Dockerfile were added after early production pain. Before that, “container is running” was treated as “service is healthy,” which is adorable.
- systemd-journald rate limiting exists for a reason. Logging is IO; IO is latency; latency becomes retries; retries become a storm. Logs can take you down.
- CPU throttling is not “CPU is high.” It’s “you asked for CPU and the kernel said no.” This distinction became more visible with widespread containerization.
Minimum metrics set (by failure mode)
Don’t start with “collect everything.” Start with how Docker systems actually die. Below is a minimum that catches failures early
and gives you a direction to debug. This assumes you can collect host metrics (node exporter or equivalent), container metrics
(cAdvisor or Docker API), and basic app metrics (HTTP/gRPC stats). Even if you don’t run Prometheus, the concepts hold.
1) Crash loops and bad deploys
- Container restart count rate (per service, not per container ID). Alert on sustained restarts over 5–10 minutes.
- Exit code distribution. Exit 137 is usually OOM kill; exit 1 is usually app crash; exit 143 is SIGTERM (often normal during deploy).
- Healthcheck failures (if you use HEALTHCHECK). Alert before restarts, because health failing is your lead indicator.
Decision you want to enable: rollback or stop the bleeding. If restart rate spikes after a deploy, don’t “wait for it to settle.”
It won’t. Containers are not houseplants.
2) Memory leaks, memory pressure, and OOM kills
- Container memory working set (not just RSS, and not cache unless you know what you’re doing).
- OOM kill events at the host and container level.
- Host memory available and swap in/out. Swap activity is the slow-motion disaster movie.
Alert thresholds: working set approaching limit (e.g., > 85% for 10 minutes), OOM kills > 0 (immediate page), swap-in rate > 0 sustained (warn).
3) Disk: the #1 boring outage generator
- Host filesystem free % for Docker root (often
/var/lib/docker) and for log filesystems. - Disk write bytes/sec and IO utilization (await, svctm-ish signals depending on tooling).
- Inodes free %. You can be “50% free” and still be dead because you ran out of inodes.
- Docker image + container writable layer growth (if you can track it). Otherwise track
/var/lib/dockergrowth rate.
Alerting pattern that actually works: alert on time-to-full (based on growth rate) rather than raw percentage.
“Disk will be full in 6 hours” gets action. “Disk is 82%” gets ignored until it’s 99%.
4) CPU: high usage vs throttling
- Container CPU usage (cores or seconds/sec).
- Container CPU throttled time and throttled periods. This is the “we’re starved” metric.
- Host CPU iowait. If iowait rises with latency, your bottleneck is disk, not CPU.
If CPU usage is moderate but throttling is high, you set limits too low or packed too much onto the node. Users experience it as
random latency spikes. Engineers misdiagnose it as “network flakiness” because the graphs look fine.
5) Network: when “it’s DNS” is actually conntrack
- TCP retransmits and socket errors on the host.
- Conntrack table usage (% used).
- DNS error rate from your resolver (SERVFAIL, timeout). DNS is often a victim, not the cause.
6) Application “golden signals” (the ones worth wiring)
- Request rate, error rate, latency (p50/p95/p99), saturation (queue depth, worker utilization).
- Dependency error rate (DB, cache, upstream APIs). Docker incidents often manifest as dependency storms.
Minimum alerts that won’t make you hate your pager
- Disk time-to-full < 12h (warn), < 2h (page)
- Inodes time-to-zero < 12h (warn), < 2h (page)
- OOM kill event (page)
- Restart loop: restarts/minute > baseline for 10m (page)
- CPU throttled time ratio > 10% for 10m (warn), > 25% for 5m (page) for latency-sensitive services
- Host swap-in sustained (warn), major page faults rate spike (warn/page depending)
- Conntrack utilization > 80% for 10m (warn), > 90% (page)
- App p99 latency + error rate both degrade (page). One without the other is usually noise.
Minimum logs set (and how not to drown)
Metrics tell you something is wrong. Logs tell you what kind of wrong. The minimum logging strategy is not “log everything.”
It’s “log the events that explain state transitions and failures, and keep them long enough to debug.”
Container stdout/stderr: treat it like a product interface
In Docker, stdout/stderr is the most convenient log path and the easiest to abuse. If your app logs JSON, good. If it logs stack traces
and full request bodies, also good—until it isn’t. Your minimum:
- Structured logs (JSON preferred) with timestamp, level, request ID, and error fields.
- Startup banner including version, config checksum, and listening ports.
- One-line summaries for request failures and dependency failures.
- Rate-limited noisy logs (timeouts, retries). Repeated logs should be aggregated, not spammed.
Docker daemon and host logs: the boring stuff that explains everything
- dockerd logs: image pulls, graphdriver errors, exec failures, containerd issues.
- kernel logs: OOM killer messages, filesystem errors, network drops.
- journald/syslog: service restarts, unit failures, rate limiting warnings.
Log retention and rotation: pick a policy, not a hope
If you keep json-file logs without rotation, you will eventually fill disk. This is not a “maybe.” This is physics.
Joke #1: Unrotated Docker logs are like a junk drawer—fine until you try to close it and the kitchen stops working.
Minimum policy for json-file:
- Set max-size and max-file globally.
- Prefer external log shipping (journald, fluentd, or a sidecar/agent) if you need longer retention.
- Alert on log growth the same way you alert on disk growth. Logs are just disk writes with opinions.
Practical tasks: commands, outputs, decisions (12+)
Dashboards are great until they aren’t. When a host is on fire, you need a small set of commands that tell you what’s failing:
CPU, memory, disk, network, Docker itself, or your app. Below are tasks you can run on any Docker host.
Each includes: command, representative output, what it means, and what decision you make.
Task 1: Confirm the blast radius (what’s running, what’s restarting)
cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
NAMES IMAGE STATUS PORTS
api-7c9d registry/app:1.8.2 Up 3 minutes (healthy) 0.0.0.0:8080->8080/tcp
worker-2a11 registry/worker:1.8.2 Restarting (1) 12 seconds ago
redis-01 redis:7 Up 14 days 6379/tcp
What it means: One container is in a restart loop. “Restarting (1)” is not a vibe; it’s a symptom.
Decision: Triage restart cause immediately (exit code, logs, OOM). Don’t chase latency elsewhere until restart loop is understood.
Task 2: See why a container died (exit code + OOM hint)
cr0x@server:~$ docker inspect -f 'Name={{.Name}} ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}} Error={{.State.Error}} FinishedAt={{.State.FinishedAt}}' worker-2a11
Name=/worker-2a11 ExitCode=137 OOMKilled=true Error= FinishedAt=2026-01-03T00:12:41.981234567Z
What it means: Exit 137 with OOMKilled=true is a container being killed by the kernel/cgroup for memory limit violation.
Decision: Stop guessing about “random crashes.” Increase memory limit (carefully), reduce concurrency, or fix leak. Also check host memory pressure.
Task 3: Grab the last logs without drowning
cr0x@server:~$ docker logs --tail=80 --timestamps worker-2a11
2026-01-03T00:12:35.112Z level=info msg="starting worker" version="1.8.2" concurrency=64
2026-01-03T00:12:39.003Z level=error msg="failed processing job" err="context deadline exceeded" job_id=91311
2026-01-03T00:12:41.978Z level=info msg="shutdown requested"
What it means: The logs show a startup and some timeouts, but nothing about OOM (because OOM is outside the process).
Decision: Correlate with kernel logs for OOM. Don’t wait for the application to confess.
Task 4: Confirm OOM kills in the kernel log
cr0x@server:~$ sudo dmesg -T | tail -n 30
[Sat Jan 3 00:12:41 2026] Memory cgroup out of memory: Killed process 28144 (worker) total-vm:2145932kB, anon-rss:612344kB, file-rss:1200kB, shmem-rss:0kB
[Sat Jan 3 00:12:41 2026] oom_reaper: reaped process 28144 (worker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
What it means: The kernel killed the process due to cgroup memory limit. This is definitive.
Decision: Treat as capacity/limit/config bug. Adjust limits or fix app memory behavior; add alert on OOMKill events.
Task 5: Check host memory pressure (the “everyone is slow” prelude)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 31Gi 28Gi 620Mi 512Mi 2.4Gi 1.1Gi
Swap: 4.0Gi 1.9Gi 2.1Gi
What it means: Available memory is low and swap is in use. If swap is actively changing, latency is likely rising.
Decision: Reduce memory pressure: move workloads, increase host RAM, cap caches, tune limits. Consider paging risk as incident-worthy.
Task 6: Identify top memory consumers by container
cr0x@server:~$ docker stats --no-stream --format 'table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.CPUPerc}}'
NAME MEMUSAGE / LIMIT MEMPERC CPUPERC
api-7c9d 420MiB / 1GiB 41.02% 85.31%
worker-2a11 980MiB / 1GiB 95.70% 12.01%
redis-01 210MiB / 2GiB 10.25% 1.10%
What it means: The worker is riding the memory limit. It will OOM again.
Decision: Reduce worker concurrency, add backpressure, or raise limit. If you raise limit, verify host headroom first.
Task 7: Check disk fullness where Docker lives
cr0x@server:~$ df -h /var/lib/docker
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 200G 189G 11G 95% /
What it means: Docker’s root is at 95%. This is “soon-to-be-incident,” not “fine.”
Decision: Start cleanup (unused images, stopped containers, build cache). Also identify growth source (logs, overlay2, volumes).
Task 8: Find inode exhaustion (the sneaky disk outage)
cr0x@server:~$ df -i /var/lib/docker
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2 13107200 13010000 97200 100% /
What it means: Disk space might exist, but inodes are basically gone. Creating files will fail; containers can break in weird ways.
Decision: Identify directories with huge file counts (often build cache, unpacked layers, or tiny log files). Cleanup and consider filesystem tuning.
Task 9: Measure what’s eating Docker disk (volumes vs layers vs logs)
cr0x@server:~$ sudo du -xhd1 /var/lib/docker | sort -h
1.2G /var/lib/docker/containers
6.8G /var/lib/docker/volumes
29G /var/lib/docker/buildkit
150G /var/lib/docker/overlay2
189G /var/lib/docker
What it means: overlay2 is huge, build cache is non-trivial, and containers dir (logs) is small-ish here.
Decision: If overlay2 dominates, look for large writable layers and image sprawl. If buildkit is big, prune it. If containers dir is big, fix log rotation.
Task 10: Prune safely (and understand the blast radius)
cr0x@server:~$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 47 9 68.2GB 52.4GB (76%)
Containers 21 12 3.4GB 1.1GB (32%)
Local Volumes 18 11 6.8GB 1.9GB (27%)
Build Cache 163 0 29.0GB 29.0GB
What it means: There’s a lot of reclaimable disk, especially build cache and old images.
Decision: Use targeted pruning first. Avoid nuking volumes unless you are very sure they’re not needed.
cr0x@server:~$ docker builder prune -f
Deleted build cache objects:
3yq9m3c2kz7qf2n2o6...
Total reclaimed space: 28.7GB
What it means: Build cache cleaned up successfully.
Decision: Re-check disk. If still high, prune unused images with care.
Task 11: Check log file sizes per container (json-file driver)
cr0x@server:~$ sudo ls -lh /var/lib/docker/containers/*/*-json.log | sort -k5 -h | tail -n 5
-rw-r----- 1 root root 1.2G Jan 3 00:10 /var/lib/docker/containers/9c1.../9c1...-json.log
-rw-r----- 1 root root 2.8G Jan 3 00:11 /var/lib/docker/containers/aa4.../aa4...-json.log
-rw-r----- 1 root root 3.5G Jan 3 00:11 /var/lib/docker/containers/b7d.../b7d...-json.log
-rw-r----- 1 root root 6.1G Jan 3 00:12 /var/lib/docker/containers/cc9.../cc9...-json.log
-rw-r----- 1 root root 9.4G Jan 3 00:12 /var/lib/docker/containers/f01.../f01...-json.log
What it means: One or more containers are generating huge logs. This can fill disk and also slow the host due to IO.
Decision: Fix log verbosity and set log rotation. In an emergency, truncate the worst offender carefully (and accept the loss).
Task 12: Verify Docker log rotation config (and apply it)
cr0x@server:~$ sudo cat /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "5"
}
}
What it means: json-file logs will rotate at 50MB, keeping 5 files. This is the minimum to avoid disk death-by-logging.
Decision: If missing, add it and restart Docker during a maintenance window. If present but still huge, your containers may predate the setting; recreate them.
Task 13: Detect CPU throttling (the “limits are too tight” tell)
cr0x@server:~$ CID=$(docker ps -qf name=api-7c9d); docker inspect -f '{{.Id}}' $CID
b3e2f0a1c9f7f4c2b3a0d0c2d1e9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f
cr0x@server:~$ cat /sys/fs/cgroup/cpu/docker/$CID/cpu.stat
nr_periods 124030
nr_throttled 38112
throttled_time 921837239112
What it means: High nr_throttled and large throttled_time indicate the container frequently wanted CPU but was throttled.
Decision: Raise CPU limit, reduce container count per node, or optimize CPU hotspots. If latency spikes correlate, throttling is your smoking gun.
Task 14: Check host IO pressure (when everything “just gets slow”)
cr0x@server:~$ iostat -xz 1 3
Linux 6.1.0 (server) 01/03/2026 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
18.21 0.00 6.02 21.33 0.00 54.44
Device r/s w/s rkB/s wkB/s await aqu-sz %util
nvme0n1 85.1 410.2 8120.0 65211.1 18.4 2.31 94.7
What it means: High iowait and %util near saturation. Disk is the bottleneck; CPU is waiting for IO.
Decision: Reduce write amplification (logs, database flush storms), move workloads, or scale IO. Don’t “add CPU.” It won’t help.
Task 15: Check conntrack saturation (the “random network timeouts” generator)
cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 245112
net.netfilter.nf_conntrack_max = 262144
What it means: Conntrack is nearly full. New connections can fail in ugly, intermittent ways.
Decision: Raise conntrack max (with memory awareness), reduce connection churn, tune timeouts, and check for retry storms.
Joke #2: Conntrack is where connections go to be remembered forever, which is romantic until your node runs out of memory.
Fast diagnosis playbook: first/second/third checks
When latency is up or errors spike, you need to answer one question fast: what resource is saturated or failing first?
Here’s a playbook that works even when your monitoring is behind, missing, or lying.
First check: Is it a crash loop, OOM, or deploy regression?
- docker ps: look for Restarting, unhealthy, or recently started containers.
- docker inspect: exit codes, OOMKilled flag, health status.
- docker logs –tail: last 50–200 lines for fatal errors and startup config.
Fast decision: If you see restarts increasing after a change, freeze deploys and rollback. Debugging comes after stability.
Second check: Disk and inode pressure (because it breaks everything)
- df -h on
/var/lib/dockerand log mounts. - df -i for inode exhaustion.
- du to find which Docker subdir is growing.
Fast decision: If disk > 90% and growing, start reclaiming space immediately. This is one of the few times “cleanup” is a valid incident response.
Third check: Host resource saturation (memory, IO, CPU throttling)
- free -h and swap activity: confirms memory pressure.
- iostat -xz: shows IO saturation and iowait.
- cpu.stat throttling: confirms CPU limits are too tight.
Fast decision: Saturation means you need to shed load, move containers, or change limits—debugging the app won’t fix physics.
Fourth check: Network table exhaustion and DNS symptoms
- conntrack count/max: near full causes timeouts.
- ss -s (not shown above): socket states; lots of TIME-WAIT can indicate churn.
- app logs: timeouts to dependencies; correlate with conntrack and retransmits.
Fifth check: Docker daemon health
- dockerd logs: storage driver errors, pull failures, “no space left” messages.
- docker info: confirms storage driver, cgroup driver, root dir.
One quote, because it’s still the best framing for incident work:
Hope is not a strategy.
— James Cameron
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran a set of Docker hosts for internal APIs. The team assumed containers were “isolated enough” and treated the host as a dumb substrate.
They had alerts for HTTP 5xx and basic CPU. Disk? They checked it “sometimes.”
One Friday, a new feature shipped with verbose debug logs accidentally left enabled. The service stayed up; error rates were normal. But logs grew fast,
and by early morning the Docker root filesystem was nearly full. The first symptom wasn’t “disk is full.” It was slower deploys, because image pulls
and layer extractions started thrashing. Then database clients began timing out because the node’s IO was pinned.
The team spent hours arguing about “network flakiness” because pings were fine and CPU wasn’t pegged. Finally someone ran df -h and saw 99% used.
They truncated a huge json log and immediately saw service recover.
Post-incident, the wrong assumption was obvious: “If the app is healthy, the host is fine.” Docker doesn’t prevent disk from filling; it just gives you more
places to hide disk usage. The fix was also obvious: disk growth alerts and log rotation by default, enforced via config management.
Mini-story 2: The optimization that backfired
Another organization wanted better bin-packing and reduced cloud costs. They tightened CPU limits for several latency-sensitive services,
reasoning that average CPU usage was low. They got the cost win. Then they got the graphs.
After the change, customers reported “sporadic slowness.” The services weren’t down. Error rates were modest. The on-call saw CPU usage
hovering at 40–50% and declared the node healthy. Meanwhile p99 latency doubled during peak traffic and then “mysteriously” returned to normal.
The real culprit was CPU throttling. Bursty workloads hit their limits and got throttled precisely when they needed short bursts to keep queues empty.
The OS did exactly what it was told. The team had no throttling metrics, so it looked like a phantom.
The backfiring optimization wasn’t “limits are bad.” It was “limits without throttling visibility are bad.” They added throttled time alerts,
increased CPU limits for a couple of services, and adjusted concurrency to match what the containers were actually allowed to do.
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent platform ran Docker hosts with a strict baseline: log rotation set in /etc/docker/daemon.json, nightly build cache pruning,
and alerts on time-to-full for /var/lib/docker. Nobody bragged about it. It was just “the rule.”
During a busy quarter-end batch run, a worker service began producing far more logs than usual due to an upstream dependency timing out. The service was noisy,
but it didn’t take the host down. Logs rotated. Disk usage increased, but the time-to-full alert fired early enough to investigate without panic.
The team traced the real issue to a dependency and fixed the retry storm. The important part is what didn’t happen: the Docker host didn’t hit 100% disk,
containers didn’t start failing to write temporary files, and the incident didn’t become a cascading outage.
“Boring but correct” doesn’t make for exciting postmortems. It does make for fewer postmortems.
Common mistakes: symptom → root cause → fix
1) Symptom: Containers restart every few minutes, no clear app error
Root cause: OOM kills due to too-low memory limit or memory leak; app logs don’t show it.
Fix: Confirm via docker inspect OOMKilled and kernel dmesg. Increase limit or reduce concurrency; add OOM alerts and track working set.
2) Symptom: “No space left on device” but df shows free space
Root cause: Inode exhaustion or per-filesystem reserved blocks; sometimes overlay2 metadata pressure.
Fix: Check df -i. Remove massive file-count directories (build cache, temp files), prune Docker artifacts, tune filesystem inode ratio where appropriate.
3) Symptom: Random latency spikes, CPU looks fine
Root cause: CPU throttling from strict limits; container wants CPU but is denied.
Fix: Monitor throttled time. Increase CPU limit or reduce workload concurrency; stop using “CPU%” alone as capacity signal.
4) Symptom: Deploys hang or image pulls are slow, then services degrade
Root cause: Disk IO saturation or Docker storage driver thrash; often coupled with nearly full disk.
Fix: Use iostat and disk space checks. Reduce IO (logs, rebuild storms), add disk headroom, and avoid doing heavy builds on production nodes.
5) Symptom: Network timeouts across many containers, especially outbound
Root cause: conntrack table near full; NAT-heavy traffic churn.
Fix: Check conntrack count/max; tune nf_conntrack_max and timeouts; reduce connection churn (keepalives, pooling), and avoid retry storms.
6) Symptom: One container’s logs are gigantic; node disk keeps filling
Root cause: json-file logs without rotation; excessive log verbosity; or repeated error loops.
Fix: Set Docker log rotation options; reduce log volume; add rate limiting; ship logs off-host if retention is required.
7) Symptom: Container “Up” but service is dead or returns errors
Root cause: No healthcheck, or healthcheck is too weak (checks process not functionality).
Fix: Add HEALTHCHECK that validates dependency connectivity or a real readiness endpoint; alert on unhealthy status before restarts.
8) Symptom: Disk usage grows but pruning doesn’t help much
Root cause: Volumes accumulating data, or application writing to container writable layer; not just unused images.
Fix: Identify with docker system df and filesystem du. Move writes to volumes with lifecycle management; implement retention policies.
Checklists / step-by-step plan
Minimum observability implementation plan (one week, realistic)
- Pick the minimum alerts (disk time-to-full, inode time-to-zero, OOM kills, restart loops, CPU throttling, conntrack saturation, app latency+errors).
- Standardize Docker logging: set
max-size/max-filein/etc/docker/daemon.json; enforce via config management. - Tag services sanely: metrics labeled by service and environment, not container ID. Keep container ID for debugging, not alerting.
- Expose app golden signals: request rate, error rate, latency percentiles, saturation. If you can only do one, do error rate + p99 latency.
- Collect host metrics for disk, inodes, IO, memory, swap, conntrack. Without host metrics, container metrics will gaslight you.
- Collect container metrics (CPU usage, throttling, memory working set, restarts). Validate by comparing with
docker statsduring load. - Write one runbook page per alert: what to check, commands to run, and the rollback/mitigation steps.
- Run a game day: simulate disk filling (in a safe env), simulate an OOM, simulate conntrack pressure. Validate alerts and playbook.
Disk safety checklist (because disk will betray you)
- Docker log rotation configured and verified on new containers
- Alert on disk growth rate/time-to-full for Docker root
- Alert on inode consumption rate/time-to-zero
- Nightly build cache pruning for build nodes (not necessarily prod)
- Explicit retention for volumes (databases and queues aren’t trash cans)
CPU/memory safety checklist
- Alert on OOM kills and memory working set approaching limits
- Alert on CPU throttling, not just CPU usage
- Limits set with real load tests; concurrency tied to CPU/memory budgets
- Host swap monitored; swap-in sustained is a warning sign you should respect
Logging checklist (minimum viable sanity)
- Structured logs with request IDs and error fields
- Rate-limited repetitive errors
- No request/response bodies by default in production
- Separate audit logs from application debug logs if required
FAQ
1) What is the single most important Docker alert?
Disk time-to-full (and inode time-to-zero as its equally annoying sibling). Disk outages cause cascading failures and corrupt the incident timeline.
If you only add one thing, make it “we will be full in X hours.”
2) Should I alert on container restarts?
Yes, but alert on restart rate per service, not every single restart event. A rolling deploy causes restarts; a crash loop causes a sustained rate.
Also include exit codes or OOMKilled signals in the alert context.
3) Why do my app logs not show OOM kills?
Because the kernel kills the process. Your app doesn’t get a chance to log a nice farewell message. Confirm with docker inspect and dmesg.
4) Is json-file logging okay in production?
It’s okay if you configure rotation and you understand that logs live on the host filesystem. If you need longer retention, ship logs elsewhere.
Unrotated json-file is a disk-filling machine with plausible deniability.
5) CPU is only 40%, why is latency terrible?
Check CPU throttling and IO wait. “CPU 40%” could mean half your CPU quota is being denied, or the CPU is idle because it’s waiting on disk.
Usage graphs don’t show denial; throttling does.
6) How do I distinguish disk space problems from disk IO problems?
Space problems show up in df -h and “no space left” errors. IO problems show up as high iowait and high device utilization in iostat.
They often travel together, but IO saturation can happen with plenty of free space.
7) Why does “docker system prune” not free much space?
Because your disk usage may be in volumes, container writable layers, or active images. docker system df tells you what is reclaimable.
If volumes are the issue, pruning won’t touch them unless you explicitly remove them (which can be catastrophic).
8) Do I need full distributed tracing for Docker observability minimum?
Not for the minimum. Tracing is great for complex request paths, but it won’t save you from full disks, OOM kills, or conntrack exhaustion.
Get the boring host/container signals first. Then add traces where they pay for themselves.
9) What’s a reasonable log rotation setting?
A common baseline is max-size=50m and max-file=5 for general services. High-volume services may need smaller sizes or different drivers.
The “right” setting is the one that prevents disk exhaustion and still leaves enough history to debug.
Conclusion: next steps that actually move the needle
Docker failures are rarely mysterious. They’re usually resource pressure, mis-set limits, runaway logs, or network table exhaustion—plus a thin layer of human denial.
The minimum observability set is how you cut through denial quickly.
Practical next steps:
- Implement Docker log rotation globally and verify new containers inherit it.
- Add alerts for disk time-to-full and inode time-to-zero on the Docker root filesystem.
- Alert on OOM kills and restart-loop rates per service.
- Start tracking CPU throttling; stop pretending CPU% tells the whole story.
- Add conntrack saturation visibility if you run NAT-heavy or high-connection workloads.
- Write a one-page runbook that includes the exact commands you’ll run under pressure (use the tasks above).
If you do only that, you’ll catch the majority of Docker outages early. And you’ll spend less time staring at dashboards that look calm while production quietly burns.