Docker: I/O wait from hell — throttle the one container killing your host

Was this helpful?

When Linux says your CPU is “idle” but your box feels like it’s underwater, you’re usually staring at the same villain: I/O wait. Your cores aren’t busy computing; they’re standing in line, waiting for storage to answer. Meanwhile, one Docker container is enthusiastically chewing through writes like it’s paid per fsync, and your entire host becomes a slow-motion tragedy.

This is the uncomfortable part of container density: one noisy neighbor can make everyone late. The good news is that Linux gives you the tools to pinpoint who’s responsible and then throttle them surgically—without rebooting, without guessing, and without “let’s just add more disks” as a first response.

A mental model: what iowait really means on a container host

On Linux, I/O wait is the time the CPU spends idle while there is at least one outstanding I/O request in the system. It is not a measure of disk utilization by itself. It is a measure of “the CPU wanted to run something, but tasks are blocked on I/O, so the scheduler calls it wait.”

On a Docker host, that “blocked on I/O” often means:

  • Overlay filesystem writes amplifying into multiple real writes.
  • Log spam forcing synchronous writes to a hot filesystem.
  • A database doing heavy fsync in a tight loop.
  • A backup job streaming reads that starve writes (or vice versa) depending on the scheduler and device.
  • Metadata storms: millions of tiny files, directory lookups, inode updates, journal pressure.

Containers don’t have magic storage. They share the same host block devices, the same queue, and often the same filesystem journal. If one workload hammers the queue with deep I/O and no fairness controls, every other workload starts queueing too. The kernel will attempt fairness, but fairness isn’t the same as “your p99 latency stays sane.”

The two flavors of “I/O wait from hell”

Latency hell: IOPS aren’t that high, but latency spikes. Think: NVMe firmware hiccups, RAID controller cache flushes, a filesystem journal choking, or sync-heavy writes. Users feel this as “everything is stuck” even though throughput graphs look modest.

Queue depth hell: One container keeps the device saturated with many requests in flight. The device is “busy” and average throughput looks impressive. Meanwhile, interactive tasks and other containers wait behind a mountain of requests.

One quote to keep in your pocket

Paraphrased idea: “Hope is not a strategy; you need a plan and feedback loops.” — Gene Kranz (mission operations mindset, paraphrased)

That’s the attitude you want here. Don’t hand-wave. Measure, attribute, then apply a control.

Fast diagnosis playbook (check these first)

If your host is melting, you don’t have time for interpretive dance. Do this in order.

1) Confirm it’s storage latency/queueing, not CPU or memory

  • Check system load and iowait.
  • Check disk latency and queue depth.
  • Check memory pressure (swap storms can look like I/O storms because they are).

2) Identify the top I/O processes on the host

  • Use iotop (per-process read/write rate).
  • Use pidstat -d (per-process I/O stats over time).
  • Use lsof to see what files are being hammered (logs? database files? overlay diff?).

3) Map those processes to containers

  • Find the container ID via cgroups (/proc/<pid>/cgroup).
  • Or map file paths back to /var/lib/docker/overlay2 diff directories.

4) Apply the least-worst throttle

  • If you’re on cgroup v2: io.max and io.weight.
  • If you’re on cgroup v1: blkio throttling and weights (works best on direct block devices, less magical on layered filesystems).
  • Also reduce damage: cap logs, move hot paths to separate disks, and fix the app behavior (flush frequency, batching, etc.).

Joke #1: If your I/O wait is 60%, your CPUs aren’t “lazy”—they’re just stuck in the world’s slowest checkout line.

Interesting facts and context (why this problem keeps happening)

  1. Linux “iowait” is not time spent by the disk. It’s CPU idle time while I/O is pending, which can rise even on fast devices if tasks block synchronously.
  2. cgroups for I/O control arrived later than CPU/memory controls. Early container setups were great at CPU quotas and memory limits, and terrible at storage fairness.
  3. blkio throttling historically worked best with direct block device access. When everything goes through a filesystem layer (like overlay2), attribution and control get murkier.
  4. The Completely Fair Queuing (CFQ) scheduler used to be the “fairness” default for rotational media. Modern kernels lean toward schedulers like BFQ or mq-deadline depending on device class and goals.
  5. OverlayFS write amplification is real. A “small write” inside a container can trigger copy-up operations and metadata writes on the host filesystem.
  6. Ext4 journaling can become a bottleneck under metadata-heavy workloads. Lots of file creates/deletes can stress the journal even if data throughput is low.
  7. Logging has been a repeat offender since the dawn of daemons. The only thing more infinite than human optimism is an unbounded log file on a shared disk.
  8. NVMe isn’t immune to latency spikes. Firmware, thermal throttling, power management, and device-level garbage collection can still produce painful tail latency.
  9. “Load average” includes tasks in uninterruptible sleep (D state). Storage stalls can inflate load even when CPU usage looks fine.

Practical tasks: commands, outputs, and decisions (12+)

These are field tools. Each task includes: a command, typical output, what it means, and the decision you make next. Run them as root or with sufficient privileges where needed.

Task 1: Confirm iowait and load symptoms

cr0x@server:~$ uptime
 14:22:05 up 21 days,  6:11,  2 users,  load average: 28.12, 24.77, 19.03
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (server)  01/02/2026  _x86_64_  (32 CPU)

14:22:07     CPU    %usr   %nice    %sys %iowait   %irq   %soft  %steal  %idle
14:22:08     all    3.21    0.00    1.18   62.44   0.00    0.09    0.00   33.08
14:22:09     all    2.97    0.00    1.11   64.02   0.00    0.07    0.00   31.83
14:22:10     all    3.45    0.00    1.29   61.88   0.00    0.08    0.00   33.30

What it means: load average is huge, but user+system CPU is low and iowait is ~60%. Many tasks are blocked on storage.

Decision: treat this as a storage latency/queue problem. Move to disk-level metrics; don’t waste time tuning CPU.

Task 2: Check disk latency and queue depth with iostat

cr0x@server:~$ iostat -x 1 3
Linux 6.5.0 (server)  01/02/2026  _x86_64_  (32 CPU)

Device            r/s   w/s   rkB/s   wkB/s  await  r_await  w_await  aqu-sz  %util
nvme0n1         120.0 980.0  5120.0 82400.0  86.32   14.22   95.01    22.41  99.50

What it means: device is basically pegged (%util ~99.5). Average write latency is awful (~95ms) and queue depth (aqu-sz) is high.

Decision: you need attribution. Find which processes generate the I/O.

Task 3: Check if tasks are stuck in D state (storage wait)

cr0x@server:~$ ps -eo state,pid,comm,wchan:32 --sort=state | head -n 15
D  18342 postgres         io_schedule
D  19011 containerd-shim  io_schedule
D  19102 python3          ext4_sync_file
D  24410 java             io_schedule
R   1322 ksoftirqd/3      -
R  31201 iostat           -
S      1 systemd          ep_poll

What it means: D state processes are blocked in kernel I/O paths (io_schedule, ext4_sync_file).

Decision: find which of these belong to the noisy container(s).

Task 4: Get per-process I/O rates (quick blame list)

cr0x@server:~$ iotop -oPa -d 1 -n 3
Total DISK READ: 5.00 M/s | Total DISK WRITE: 90.00 M/s
PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN  IO>    COMMAND
19102 be/4  root     0.00 B/s  45.00 M/s   0.00 % 95.00 % python3 /app/worker.py
18342 be/4  postgres 0.00 B/s  20.00 M/s   0.00 % 80.00 % postgres: checkpointer
24410 be/4  app      0.00 B/s  12.00 M/s   0.00 % 60.00 % java -jar service.jar

What it means: one Python worker is dominating writes; Postgres checkpointer is also active (possibly normal, possibly pushed hard by someone else).

Decision: map PID 19102 to a container and inspect what it writes.

Task 5: Map a PID to a Docker container via cgroups

cr0x@server:~$ cat /proc/19102/cgroup
0::/docker/4c3f1f8f2a7b6c7f2c7a1d3b8ad9b9d2f2b2a9c8e1d2a3b4c5d6e7f8a9b0c1d

What it means: the process is inside container 4c3f1f8f2a7b....

Decision: identify the human name, image, and purpose of that container.

Task 6: Resolve container ID to name and inspect settings

cr0x@server:~$ docker ps --no-trunc --filter id=4c3f1f8f2a7b6c7f2c7a1d3b8ad9b9d2f2b2a9c8e1d2a3b4c5d6e7f8a9b0c1d
CONTAINER ID                                                       IMAGE                 COMMAND                  CREATED        STATUS        PORTS     NAMES
4c3f1f8f2a7b6c7f2c7a1d3b8ad9b9d2f2b2a9c8e1d2a3b4c5d6e7f8a9b0c1d   acme/etl:2.8.1        "python3 /app/wo..."   3 hours ago    Up 3 hours              etl-worker-07

cr0x@server:~$ docker inspect -f '{{.HostConfig.LogConfig.Type}} {{json .HostConfig.LogConfig.Config}}' etl-worker-07
json-file {"max-file":"0","max-size":"0"}

What it means: container is an ETL worker and uses the default json-file logging with no rotation (max-size 0, max-file 0).

Decision: suspect log-induced writes and/or hot data path writes. Check what files are busy.

Task 7: See what files the noisy PID is writing

cr0x@server:~$ lsof -p 19102 | awk '{print $4, $9}' | head -n 15
1u /dev/null
2u /dev/null
3w /var/lib/docker/containers/4c3f1f8f2a7b6c7f2c7a1d3b8ad9b9d2f2b2a9c8e1d2a3b4c5d6e7f8a9b0c1d/4c3f1f8f2a7b6c7f2c7a1d3b8ad9b9d2f2b2a9c8e1d2a3b4c5d6e7f8a9b0c1d-json.log
4w /data/spool/outgoing/events.ndjson
5w /data/spool/outgoing/events.ndjson

What it means: it’s writing to the Docker json log file and also to a host-mounted path /data/spool/outgoing/events.ndjson.

Decision: check filesystem usage, mount options, and whether that path shares the same device as everything else.

Task 8: Determine which filesystem/device backs the hot path

cr0x@server:~$ df -T /var/lib/docker /data/spool/outgoing
Filesystem     Type  1K-blocks       Used  Available Use% Mounted on
/dev/nvme0n1p2 ext4  960379012  712883120  198472344  79% /
/dev/nvme0n1p2 ext4  960379012  712883120  198472344  79% /

What it means: both Docker storage and the spool directory share the same ext4 filesystem on the same device.

Decision: any write storm will punish the entire host. If you can’t move paths to a different device immediately, throttling becomes your emergency brake.

Task 9: Check Docker’s storage driver (overlay2 behavior matters)

cr0x@server:~$ docker info | sed -n '1,40p'
Client:
 Version:           25.0.3
Server:
 Containers: 47
  Running: 39
 Storage Driver: overlay2
 Backing Filesystem: extfs
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2

What it means: overlay2 on ext4, json-file logs, cgroup v2 with systemd driver. This is a good baseline for using io.max/io.weight in cgroup v2.

Decision: throttle via cgroup v2 (preferred) and fix logging.

Task 10: Verify memory pressure and swap (don’t miss the obvious)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi        96Gi        2.1Gi       1.2Gi        27Gi        18Gi
Swap:           16Gi        9.5Gi       6.5Gi

cr0x@server:~$ vmstat 1 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2 18 9961472 2200000 210000 22400000  80  150  1200  92000 1200 3100  3  1 34 62  0
 1 20 9961600 2180000 210000 22390000  60  120  1100  89000 1180 3000  3  1 32 64  0

What it means: swap is in use and there’s swap I/O (si/so). That can amplify disk pain, but the dominant issue is huge block output (bo) and high wa.

Decision: keep swap in mind, but proceed with I/O attribution. If swap is thrashing, also consider limiting the offender’s memory.

Task 11: Identify the container’s cgroup path (systemd driver, cgroup v2)

cr0x@server:~$ systemctl status docker | sed -n '1,12p'
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-12-10 08:14:10 UTC; 3 weeks 2 days ago
       Docs: man:docker(1)
   Main PID: 1420 (dockerd)

cr0x@server:~$ cat /proc/19102/cgroup
0::/system.slice/docker-4c3f1f8f2a7b6c7f2c7a1d3b8ad9b9d2f2b2a9c8e1d2a3b4c5d6e7f8a9b0c1d.scope

What it means: with systemd + cgroup v2, the container lives in a systemd scope unit under /sys/fs/cgroup/system.slice/.

Decision: apply I/O limits to that scope unit via cgroup files (or systemd properties), not by guessing.

Task 12: Locate the block device major:minor for the filesystem

cr0x@server:~$ findmnt -no SOURCE,TARGET,FSTYPE /var/lib/docker
/dev/nvme0n1p2 / ext4

cr0x@server:~$ lsblk -o NAME,MAJ:MIN,SIZE,TYPE,MOUNTPOINT /dev/nvme0n1p2
NAME        MAJ:MIN  SIZE TYPE MOUNTPOINT
nvme0n1p2   259:2    915G part /

What it means: your target block device is major:minor 259:2. cgroup v2 I/O controls require that identifier.

Decision: set io.max and/or io.weight for that device on the container’s cgroup.

Task 13: Apply a cgroup v2 bandwidth limit (immediate containment)

cr0x@server:~$ CG=/sys/fs/cgroup/system.slice/docker-4c3f1f8f2a7b6c7f2c7a1d3b8ad9b9d2f2b2a9c8e1d2a3b4c5d6e7f8a9b0c1d.scope
cr0x@server:~$ echo "259:2 wbps=20971520 rbps=10485760" | sudo tee $CG/io.max
259:2 wbps=20971520 rbps=10485760

cr0x@server:~$ cat $CG/io.max
259:2 rbps=10485760 wbps=20971520

What it means: reads capped at 10 MiB/s, writes capped at 20 MiB/s for that container against the root device.

Decision: watch latency and system responsiveness. If the host recovers, you’ve confirmed “noisy neighbor” and bought time for a real fix.

Task 14: Verify improvement immediately

cr0x@server:~$ iostat -x 1 3
Device            r/s    w/s    rkB/s   wkB/s  await  aqu-sz  %util
nvme0n1         110.0  260.0   4800.0 21000.0  12.40    2.10  78.00

cr0x@server:~$ mpstat 1 3 | tail -n 4
14:25:08     all    6.01    0.00    2.10    8.33   0.00    0.12    0.00   83.44

What it means: device latency dropped (await ~12ms) and iowait fell sharply (~8%). The host is breathing again.

Decision: keep the limit as a temporary guardrail, then fix the root cause: logging, batching, data placement, or app design.

Task 15: Put log rotation in place (stop bleeding)

cr0x@server:~$ docker update --log-opt max-size=50m --log-opt max-file=3 etl-worker-07
etl-worker-07

cr0x@server:~$ docker inspect -f '{{json .HostConfig.LogConfig}}' etl-worker-07
{"Type":"json-file","Config":{"max-file":"3","max-size":"50m"}}

What it means: logs will rotate and won’t grow without bounds.

Decision: if logs were a major contributor, you may reduce the I/O limit or remove it after confirming stability.

Task 16: Find which containers are generating the most writable-layer churn

cr0x@server:~$ docker ps -q | while read c; do
>   name=$(docker inspect -f '{{.Name}}' $c | sed 's#^/##')
>   rw=$(docker inspect -f '{{.GraphDriver.Data.UpperDir}}' $c 2>/dev/null)
>   if [ -n "$rw" ]; then
>     sz=$(sudo du -sh "$rw" 2>/dev/null | awk '{print $1}')
>     echo "$sz  $name"
>   fi
> done | sort -h | tail -n 8
3.1G  etl-worker-07
4.8G  api-gateway
6.2G  report-builder
9.7G  search-indexer

What it means: the writable layers (UpperDir) show heavy churn. Not all churn is bad, but it’s a smell.

Decision: move write-heavy paths to volumes/bind mounts, and reduce writes to the container’s writable layer.

Mapping I/O pain to the guilty container

The hardest part of “Docker caused I/O wait” is that the disk doesn’t know what a container is. The kernel knows PIDs, inodes, block devices, and cgroups. Your job is to connect those dots.

Start from the disk and work upward

If iostat shows a single device with horrible await and high %util, you’re looking at either saturation or device-level latency. From there:

  • Use iotop to identify top writing PIDs.
  • Map PID → cgroup → container ID.
  • Use lsof to understand the path: logs, volume, overlay UpperDir, database files.

Start from the container and work downward

Sometimes you already suspect the culprit (“it’s the batch job, isn’t it?”). Validate without self-deception:

  • Inspect container log config and volume mounts.
  • Check the size of its json log file and writable layer.
  • Check whether it’s a sync-heavy application (databases, message brokers, anything that values durability).

Understanding overlay2’s special brand of chaos

overlay2 is efficient for many workloads, but write-heavy patterns can get expensive. Writes to files that exist in the lower (image) layer trigger a copy-up into the upper layer. Metadata operations can also explode, particularly for workloads that touch lots of small files.

If you see heavy writes under /var/lib/docker/overlay2, don’t “optimize overlay2.” The right move is usually to stop writing there. Put mutable data on volumes or bind mounts where you can control filesystems, mount options, and I/O isolation.

Joke #2: Overlay filesystems are like office politics—everything looks simple until you try to change one small thing and suddenly three departments are involved.

Throttling options that actually work (and what to avoid)

You have three classes of controls: container-level (cgroups), host-level (I/O scheduler and filesystem choices), and application-level (how the workload writes). If you only do one, do cgroups first to stop the blast radius.

Option A: cgroup v2 io.max (hard caps)

On modern distros and Docker setups using cgroup v2, io.max is your most direct lever: set maximum read/write bandwidth (and IOPS limits on some setups) per block device.

When to use: the host is unresponsive and you need immediate containment; a batch job can be slowed without breaking correctness.

Trade-offs: it’s blunt. If you cap too hard, you can cause timeouts in the throttled container. That might still be better than taking down everything else.

Option B: cgroup v2 io.weight (relative fairness)

io.weight is nicer than a hard cap: it tells the kernel “this cgroup matters less/more.” If you have multiple workloads and want them to share fairly, weights can be better than strict limits.

When to use: you want to protect latency-sensitive services while letting batch jobs use leftover capacity.

Trade-offs: if the device is saturated by one job and others are light, weight may not save you enough; you may still need a cap.

Option C: Docker legacy blkio flags (cgroup v1 era)

Docker’s --blkio-weight, --device-read-bps, --device-write-bps and friends were designed around cgroup v1. They can still help depending on your kernel, device, and driver, but they’re less predictable once layered filesystems and modern multi-queue block subsystems are involved.

Opinionated guidance: if you’re on cgroup v2, prefer io.max/io.weight at the cgroup filesystem. Treat Docker blkio flags as “works in a lab” unless you’ve validated on your kernel and storage stack.

Option D: Systemd properties for Docker scopes (cleaner automation)

If containers show up as systemd scope units, you can set properties via systemctl set-property. This avoids hand-writing to cgroup files, and it survives some operational workflows better.

Option E: Fix the root cause (the only permanent win)

Throttling is for containment. Root cause wins look like:

  • Move write-heavy paths to dedicated volumes on separate devices.
  • Cap and route logs (rotate, compress, or ship off-host).
  • Reduce sync frequency: batch writes, use group commit settings (carefully), or change the durability posture explicitly.
  • Pick better filesystems for the workload (or at least mount options).
  • Stop writing millions of tiny files if you can store them as larger segments.

What to avoid

  • Don’t “fix” iowait by adding CPU. That’s like buying a faster car to sit in worse traffic.
  • Don’t disable journaling or durability features casually. You can absolutely improve performance—right up until you learn what a power loss feels like.
  • Don’t blame Docker as a concept. Docker is just the messenger. The real enemy is ungoverned shared storage.

Three corporate mini-stories from the I/O mines

1) Incident caused by a wrong assumption: “It’s in a container, so it’s isolated”

The team had a busy host: API services, a few background jobs, and a “temporary” importer container that pulled partner data nightly. The importer was deployed with no explicit limits. Nobody worried. It was in Docker, after all.

The first night it ran on the shared production host, latency alerts hit everything at once. Not just the importer. The APIs slowed, the metrics pipeline lagged, SSH sessions froze mid-command. CPU graphs looked “fine,” which is exactly what makes this failure mode so disorienting: the CPUs were waiting politely for storage to answer.

The on-call engineer chased application logs and upstream dependencies for twenty minutes because the symptoms looked like a distributed outage. Only after checking iostat did the picture snap into focus: one NVMe device pinned at near-100% utilization, with write latency spiking. Then iotop pointed at a single process churning writes at a steady pace.

The wrong assumption was subtle: “containers isolate resources by default.” They don’t. CPU and memory can be limited easily, but storage is shared unless you enforce controls or split the underlying devices. The fix that night was an emergency cgroup cap. The long-term fix was even less exciting: the importer got its own volume on a separate device, plus an I/O limit as a seatbelt.

2) Optimization that backfired: “More concurrency means faster imports”

A data platform team tried to speed up a transformation job. They doubled worker concurrency and switched to smaller file chunks to increase parallelism. On paper, it should have been faster: more workers, more throughput, less idle time.

In production, it became a latency grenade. The job produced thousands of small files and did frequent fsync calls to “be safe.” The filesystem journal started to dominate. Throughput didn’t double; it cratered. Worse, the rest of the host suffered because the job kept the storage queue saturated with tiny operations that are poison for tail latency.

The backfire wasn’t just “too much load.” It was load shape. Lots of small synchronous operations are a different beast than large sequential writes. On modern storage, you can still drown in metadata updates and barrier flushes. Your fast NVMe can become a very expensive spinning disk if you treat it like a random write punching bag.

They recovered by reducing concurrency, batching outputs into larger segments, and writing to a volume on a separate filesystem tuned for the workload. Then they added an I/O cap so that future “optimizations” couldn’t take the host hostage. The final postmortem note was blunt: optimizing a job without defining acceptable collateral damage is not optimization; it’s gambling.

3) Boring but correct practice that saved the day: per-service QoS and separate spindles

Another organization ran a mixed fleet: web services, caches, a couple of stateful databases, and periodic analytics. They had been burned before, so they built a boring rulebook.

Stateful services got dedicated volumes on separate devices. No exceptions. Batch workloads ran in containers with predefined I/O weights and caps. Logging was capped by default. The platform team even maintained a simple runbook that started with iostat, iotop, and “map PID to cgroup.” Nothing fancy.

One afternoon, a new container image shipped with debug logging enabled. It started writing aggressively, but the host didn’t melt. The logging container hit its cap, slowed itself down, and the rest of the fleet kept serving. The on-call still had to fix the misconfiguration, but it was a contained incident, not an outage.

The practice wasn’t glamorous: allocate storage intentionally, apply QoS, and enforce defaults. But that’s what reliability looks like—less heroics, more guardrails.

Common mistakes: symptom → root cause → fix

1) Symptom: load average is huge, CPU usage is low

Root cause: tasks blocked in D state on storage; load includes uninterruptible sleep.

Fix: confirm with ps/vmstat; identify top I/O PIDs; map to container; apply io.max or weights; then address the writing behavior.

2) Symptom: disk is 100% utilized but throughput isn’t impressive

Root cause: small random I/O, metadata storms, journal contention, or frequent flushes causing high per-operation overhead.

Fix: reduce file churn; batch writes; move hot paths to a tuned volume; consider BFQ for fairness on some devices; cap the offender.

3) Symptom: a single container “looks fine” on CPU/mem but host is unusable

Root cause: no I/O isolation; container hammers shared device (logs, temp files, database checkpoints).

Fix: enforce I/O QoS via cgroups; cap logs; put write-heavy paths on separate volumes.

4) Symptom: docker logs is slow and /var/lib/docker grows fast

Root cause: json-file logging without rotation; huge log file causing extra writes and metadata updates.

Fix: set max-size and max-file; ship logs elsewhere; disable debug logging in production by default.

5) Symptom: after “throttling,” app starts timing out

Root cause: hard caps too low for the workload’s latency/durability requirements, or the workload expects burst I/O.

Fix: raise limits until SLOs recover; prefer weights over strict caps for mixed workloads; fix application batching and backpressure.

6) Symptom: iowait is high but iostat looks normal

Root cause: I/O might be on a different device (loopback, network storage), or the bottleneck is in filesystem locks/metadata not shown clearly in simple device stats.

Fix: check mounts with findmnt; check other devices with iostat -x across all; use pidstat -d and lsof to locate paths; verify network-backed volumes separately.

7) Symptom: one container reads heavily and writes starve (or vice versa)

Root cause: scheduler behavior and queue contention; mixed read/write patterns can cause unfairness and tail spikes.

Fix: apply separate read/write caps in io.max; isolate workloads to separate devices; schedule batch reads off-peak.

Checklists / step-by-step plan

Emergency containment (15 minutes)

  1. Confirm iowait and device latency: mpstat, iostat -x.
  2. Find top I/O PIDs: iotop -oPa, pidstat -d 1.
  3. Map PID → container via /proc/<pid>/cgroup and docker ps.
  4. Apply cgroup v2 io.max cap for the container on the offending device.
  5. Verify improvement: iowait down, await down, host responsiveness back.
  6. Communicate clearly: “We throttled container X to stabilize host; workload Y may run slower.”

Stabilization (same day)

  1. Fix logging: cap json-file logs or move to a driver/agent that doesn’t punish the root disk.
  2. Move write-heavy data off overlay writable layer to a volume or bind mount.
  3. Check filesystem fullness: free space and inode usage. Full disks create weird performance grief.
  4. Confirm swap behavior; if swap is heavy, add memory limits to the offender or adjust workload.
  5. Document the exact cgroup settings applied so you can reproduce and revert.

Hardening (this sprint)

  1. Set default logging limits for all containers (policy, not a suggestion).
  2. Define I/O classes: latency-sensitive services vs batch/ETL; apply weights/caps accordingly.
  3. Separate storage for stateful services; don’t co-locate database files with container runtime storage if you can avoid it.
  4. Build a runbook section: “map PID to container” with the exact commands for your cgroup mode.
  5. Add alerting on disk latency (await) and queue depth, not just throughput and %util.

FAQ

1) Why does high iowait make SSH and simple commands hang?

Because those commands need disk too: reading binaries, writing shell history, updating logs, touching files, paging memory. When the disk queue is deep or latency spikes, everything that needs I/O blocks.

2) Is iowait always a sign the disk is “slow”?

No. It can be a sign of sync-heavy application behavior, filesystem contention, swap activity, or device queue saturation. You confirm with latency/queue metrics like await and aqu-sz.

3) I capped a container’s write bandwidth and the host improved. Does that prove the container is the root cause?

It proves it’s a major contributor to the device queue. Root cause may still be architectural: shared disks, logging defaults, or a design that turns small writes into flush storms.

4) Should I use IOPS limits or bandwidth limits?

Bandwidth limits are a good blunt tool for streaming workloads. IOPS limits can be better for random I/O and metadata-heavy workloads. Use whichever matches the pain: if latency spikes on tiny ops, IOPS caps can help more.

5) Do Docker blkio flags work with overlay2?

Sometimes, but not reliably enough to bet your outage budget on. With cgroup v2, prefer io.max/io.weight on the container scope. Validate in your environment.

6) If I move data to a volume, why does it help?

You avoid overlay write amplification and you gain control: you can place the volume on a different device, choose filesystem options, and isolate I/O more cleanly.

7) Can I fix this by switching the I/O scheduler?

Sometimes you can improve fairness or latency tails, but scheduler tweaks won’t save you from an unbounded writer on a shared disk. Throttle first; tune second.

8) Why do logs cause so much damage?

Because they are writes you didn’t budget for, often synchronous-ish, often endless, and they live on the same disk as everything else by default. Rotate them and ship them away.

9) How do I know whether the bottleneck is the filesystem journal?

You’ll see high latency with lots of small writes, many tasks in ext4_sync_file or similar wait channels, and heavy metadata activity. The fix is usually workload shape (batching) and data placement, not mystical kernel flags.

10) Is throttling “safe” for databases?

It depends. For a database serving production traffic, hard caps can cause timeouts and cascading failures. Prefer weights and proper storage isolation. If you must cap, do it carefully and monitor.

Next steps (what you do on Monday)

When one container drags your host into I/O wait purgatory, the winning move is not guessing. It’s attribution and control: measure device latency, identify the top I/O PIDs, map them to containers, and apply cgroup I/O limits that keep the host alive.

Then you do the grown-up part: stop writing to overlay layers, cap logging by default, and separate storage for workloads that have no business sharing a queue. Keep the throttles as guardrails, not as a substitute for architecture. Your future on-call self will be boringly grateful.

← Previous
Docker: Limit Log Spam at the Source — App Logging Patterns That Save Disks
Next →
ZFS SLOG Sizing: How Much You Really Need (Not “Bigger Is Better”)

Leave a comment