Docker IOPS Starvation: Why One DB Container Makes Everything Lag

Was this helpful?

The symptom: you deploy a database container, and suddenly the whole host feels like it’s running off a USB thumb drive from 2009. SSH takes seconds to echo characters. Other containers time out. CPU is “fine.” Memory is “fine.” Yet everything is miserable.

The cause: storage contention. Specifically, one workload is consuming (or triggering) most of the IOPS budget and raising latency for everyone else. Docker didn’t “break” your server. You just learned, the hard way, that disks are shared, queueing is real, and database sync semantics don’t negotiate.

What IOPS starvation looks like on a Docker host

IOPS starvation is not “the disk is full.” It’s “the disk is busy.” More precisely: the storage path is saturated such that average request latency spikes, and every workload sharing that path pays the tax.

On Linux hosts running Docker, this usually presents as:

  • High iowait (but not always). If you look only at CPU percent, you can miss it.
  • Latency spikes for reads and/or writes. Databases hate write latency because fsync is a contract, not a suggestion.
  • Queue buildup. Requests pile up in the block layer or the device, and everything becomes “slow,” including unrelated services.
  • Odd, indirect symptoms: slow DNS inside containers, slow logging, systemd services timing out, Docker daemon hiccups, even “random” health checks failing.

Why does SSH lag? Because your terminal writes to a PTY, shells touch disk for history, logs flush, and the kernel is busy shepherding IO. The system isn’t dead. It’s waiting for the slowest shared resource to come back from lunch.

Why one container can hurt everything

Docker containers are process isolation plus some filesystem magic. They are not physical isolation. If multiple containers share:

  • the same block device (same root volume, same EBS disk, same RAID group, same SAN LUN),
  • the same filesystem,
  • the same I/O scheduler queues,
  • and often the same writeback and journaling behavior,

…then one container can absolutely degrade everyone else. This is “noisy neighbor” in its purest form.

Databases are latency amplifiers

Most DB engines do a lot of small random IO. They also do fsync (or similar) to guarantee durability. If the storage is slow, the DB does not “catch up” by trying harder; it blocks and waits. That blocking can cascade into connection pools, application threads, and retry storms.

And here’s the part Docker makes worse: the default storage stack can add overhead when the DB writes lots of small files or churns metadata.

Overlay filesystems can add write amplification

Many Docker hosts use overlay2 for the container’s writable layer. Overlay filesystems are great for image layering and developer ergonomics. Databases do not care about your ergonomics; they care about predictable latency.

If your DB writes to the container’s writable layer (instead of a dedicated volume), you can trigger extra metadata operations, copy-up behavior, and less-friendly write patterns. Sometimes it’s fine. Sometimes it’s a performance crime scene with receipts.

“But the disk is only at 20% throughput” is a trap

Throughput (MB/s) is not the whole story. IOPS and latency matter more for many DB workloads. You can have a disk doing 5 MB/s and still be completely saturated on IOPS with 4K random writes.

That’s why your monitoring that tracks “disk bandwidth” says everything is chill while the host is crying quietly in iowait.

Joke #1: When a database says it’s “waiting on disk,” it’s not being passive-aggressive. It’s being accurate.

Interesting facts and historical context

  1. IOPS became a mainstream metric because of OLTP: transactional databases pushed the industry to measure “how many small operations per second,” not just MB/s.
  2. Linux’s CFQ scheduler used to be the default for fairness on many distros; modern systems often use mq-deadline or none for NVMe, changing how “fair” contention feels.
  3. cgroups v1 had blkio controls early (weight and throttling). cgroups v2 moved to a different interface (io.max, io.weight), which catches people mid-migration.
  4. Overlayfs was built for union mounts and layers, not high-churn database files. It’s improved a lot, but the workload mismatch still shows up in the worst places: latency tails.
  5. Write barriers and flushes exist because caches lie: storage devices reorder writes for performance, so the OS uses flushes/FUA to enforce ordering when applications demand durability.
  6. Journaling filesystems trade extra writes for consistency: ext4 and XFS are reliable, but journal behavior can increase IO load under heavy metadata churn.
  7. Cloud block storage often has burst credits: you can be fast for a while, then suddenly slow. The container didn’t change; the storage tier did.
  8. NVMe improved parallelism massively with multiple queues, but it didn’t delete contention; it just moved it to different queues and limits.
  9. “fsync storms” are a classic failure mode: many threads call fsync, queues fill, latency spikes, and a system that looked fine at p50 falls apart at p99.

Fast diagnosis playbook

This is the “I have five minutes and production is on fire” sequence. Don’t debate architecture while the host is stalling. Measure, identify the bottleneck, then decide whether to throttle, move, or isolate.

First: confirm it’s IO latency, not CPU or memory

  • Check load, iowait, and run queue.
  • Check disk latency and queue depth.
  • Check if one process/container is the top IO consumer.

Second: map the pain to a device and mount

  • Which block device is slow?
  • Is Docker root on that device?
  • Are database files on overlay2 or a dedicated volume?

Third: decide on an immediate mitigation

  • Throttle the noisy container (IOPS or bandwidth) to save the rest of the host.
  • Move DB data to a dedicated volume/device.
  • Stop the bleeding: disable chatty logging, pause noncritical jobs, reduce concurrency.

Fourth: plan the real fix

  • Use volumes for DB data (not container writable layers).
  • Use IO isolation (cgroups v2 io.max / weights) where possible.
  • Pick storage and filesystem options appropriate for DB sync patterns.

Practical tasks: commands, outputs, decisions

Below are real tasks you can run on a Linux Docker host. Each includes: command, example output, what it means, and the decision you make from it. Run them as root or with sudo where needed.

Task 1: Check overall CPU, iowait, and load

cr0x@server:~$ top -b -n 1 | head -n 5
top - 12:41:02 up 21 days,  4:17,  2 users,  load average: 9.12, 7.84, 6.30
Tasks: 312 total,   4 running, 308 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.1 us,  2.4 sy,  0.0 ni, 61.7 id, 25.9 wa,  0.0 hi,  1.9 si,  0.0 st
MiB Mem :  64035.7 total,   3211.3 free,  10822.9 used,  499... buff/cache
MiB Swap:  8192.0 total,   8192.0 free,      0.0 used.  5... avail Mem

Meaning: 25.9% iowait and high load suggests lots of threads blocked on IO. CPU isn’t “busy,” it’s waiting.

Decision: Stop blaming the scheduler and start measuring disk latency and queueing.

Task 2: Identify which disks are suffering (latency, utilization, queue depth)

cr0x@server:~$ iostat -x 1 3
Linux 6.2.0 (server) 	01/03/2026 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          7.52    0.00    2.61   23.98    0.00   65.89

Device            r/s     w/s   rkB/s   wkB/s  rrqm/s  wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz  %util
nvme0n1         95.0   820.0  3200.0  9870.0     0.0    15.0   0.00   1.80    3.2   42.7  18.90  98.50

Meaning: %util near 100% and w_await ~43ms means the device is saturated for writes. aqu-sz (average queue size) is huge, confirming backlog.

Decision: Find the writer. Don’t tune databases blind; identify the process/container generating these writes.

Task 3: See if the block layer is backlogged (per-device stats)

cr0x@server:~$ cat /proc/diskstats | grep -E "nvme0n1 "
259       0 nvme0n1  128930 0 5128032 12043  942110 0 9230016 390122  0 220010 402210  0 0 0 0

Meaning: The exact fields are dense, but a quick tell is high time spent doing IO compared to baseline. Combine with iostat -x for sanity.

Decision: If you see this spike only during DB load, you’re in a classic “one workload saturates device” situation.

Task 4: Find the top IO processes (host view)

cr0x@server:~$ sudo iotop -b -n 3 -o
Total DISK READ:         0.00 B/s | Total DISK WRITE:      74.32 M/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:    71.91 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
21491 be/4 postgres     0.00 B/s   55.12 M/s  0.00 %  89.15 % postgres: checkpointer
21510 be/4 postgres     0.00 B/s   10.43 M/s  0.00 %  63.20 % postgres: walwriter
 4321 be/4 root         0.00 B/s    3.01 M/s  0.00 %  12.10 % dockerd

Meaning: Postgres processes dominate write IO and spend high % of time in IO wait. dockerd writing a bit is normal (logs, layers), but it’s not the primary offender.

Decision: Confirm which container owns these processes, then decide whether to throttle or move its storage.

Task 5: Map a process to a container

cr0x@server:~$ ps -o pid,cgroup,cmd -p 21491 | sed -n '1,2p'
  PID CGROUP                                                          CMD
21491 0::/docker/8b6c3b7e4a3a9b7d2a7b55c4a1a2f9b9b0f6c0f9d1a7b1e3c9e3a2c1e5b  postgres: checkpointer

Meaning: The process is in a Docker cgroup named by container ID.

Decision: Inspect that container. Confirm its mounts and whether it uses overlay2 or a volume.

Task 6: Inspect container mounts and storage driver

cr0x@server:~$ docker inspect -f 'Name={{.Name}} Driver={{.GraphDriver.Name}} DataRoot={{json .GraphDriver.Data}} Mounts={{json .Mounts}}' 8b6c3b7e4a3a
Name=/db Driver=overlay2 DataRoot={"LowerDir":"/var/lib/docker/overlay2/l/..","MergedDir":"/var/lib/docker/overlay2/1d3.../merged","UpperDir":"/var/lib/docker/overlay2/1d3.../diff","WorkDir":"/var/lib/docker/overlay2/1d3.../work"} Mounts=[{"Type":"volume","Name":"pgdata","Source":"/var/lib/docker/volumes/pgdata/_data","Destination":"/var/lib/postgresql/data","Driver":"local","Mode":"z","RW":true,"Propagation":""}]

Meaning: Storage driver is overlay2 for the writable layer, but the DB data directory is mounted as a named volume. That’s good. If DB data were in UpperDir, it would be worse.

Decision: If DB data isn’t on a volume, fix that first. If it is, check where the volume physically lives (same device as everything else?).

Task 7: Check what device backs Docker data root and volumes

cr0x@server:~$ docker info --format 'DockerRootDir={{.DockerRootDir}} Driver={{.Driver}}'
DockerRootDir=/var/lib/docker Driver=overlay2
cr0x@server:~$ df -hT /var/lib/docker /var/lib/docker/volumes/pgdata/_data
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/nvme0n1p2 ext4  900G  612G  242G  72% /
/dev/nvme0n1p2 ext4  900G  612G  242G  72% /

Meaning: Docker root and the database volume are on the same ext4 filesystem on the same NVMe partition. So contention is guaranteed.

Decision: If you need the DB and the rest of the host to coexist, you must isolate IO (cgroup IO controls) or move DB data to a dedicated device.

Task 8: Check filesystem mount options (barriers, atime, etc.)

cr0x@server:~$ findmnt -no TARGET,SOURCE,FSTYPE,OPTIONS /
/ /dev/nvme0n1p2 ext4 rw,relatime,errors=remount-ro

Meaning: relatime is fine. Nothing obviously weird. If you see exotic options (or misapplied tuning), it’s a lead.

Decision: Don’t change mount options during an incident unless you’re certain; use throttling/migration as safer first aid.

Task 9: Check if the host is stuck in writeback congestion

cr0x@server:~$ cat /proc/meminfo | egrep 'Dirty|Writeback|WritebackTmp'
Dirty:              82456 kB
Writeback:         195120 kB
WritebackTmp:           0 kB

Meaning: Elevated Writeback can indicate lots of data being flushed to disk. It doesn’t prove DB guilt, but it supports “storage pipeline overloaded.”

Decision: If writeback is persistently high and latency is high, reduce write pressure and isolate the heavy writer.

Task 10: Look at per-process IO counters (sanity)

cr0x@server:~$ sudo cat /proc/21491/io | egrep 'write_bytes|cancelled_write_bytes'
write_bytes: 18446744073709551615
cancelled_write_bytes: 127385600

Meaning: Some kernels expose counters in ways that can be confusing (and some filesystems don’t report cleanly). Treat as directional, not absolute.

Decision: If this is inconclusive, rely on iotop, iostat, and device-level latency metrics.

Task 11: Check Docker container resource limits (CPU/mem) and note the absence of IO limits

cr0x@server:~$ docker inspect -f 'CpuShares={{.HostConfig.CpuShares}} Memory={{.HostConfig.Memory}} BlkioWeight={{.HostConfig.BlkioWeight}}' 8b6c3b7e4a3a
CpuShares=0 Memory=0 BlkioWeight=0

Meaning: No explicit limits. CPU and memory are commonly constrained; IO often isn’t. That’s how you get a single container flattening a host.

Decision: Add IO controls (Docker blkio on cgroups v1, or systemd/cgroups v2 controls), or isolate the workload on different storage.

Task 12: Apply a temporary IO throttle (bandwidth) to a container (incident mitigation)

cr0x@server:~$ docker update --device-write-bps /dev/nvme0n1:20mb 8b6c3b7e4a3a
8b6c3b7e4a3a

Meaning: Docker applied a write bandwidth throttle for that device to the container. This is a blunt tool. It may protect the host at the expense of DB latency and throughput.

Decision: Use this to stop collateral damage. Then move the DB to dedicated storage or implement fair sharing with proper IO controllers.

Task 13: Apply an IOPS throttle (more relevant for small IO)

cr0x@server:~$ docker update --device-write-iops /dev/nvme0n1:2000 8b6c3b7e4a3a
8b6c3b7e4a3a

Meaning: Limits write IOPS. This typically aligns better with DB pain than MB/s throttles.

Decision: If throttling improves host responsiveness, you’ve confirmed “noisy neighbor IO” as the main failure mode. Now architect it out.

Task 14: Check cgroup v2 IO controller status

cr0x@server:~$ stat -fc %T /sys/fs/cgroup
cgroup2fs
cr0x@server:~$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory pids

Meaning: The host uses cgroups v2 and supports the io controller.

Decision: Prefer cgroups v2 IO control where available; it’s clearer and generally the direction Linux is going.

Task 15: Inspect a container’s IO limits via its cgroup (v2)

cr0x@server:~$ CID=8b6c3b7e4a3a; cat /sys/fs/cgroup/docker/$CID/io.max
8:0 rbps=max wbps=max riops=max wiops=max

Meaning: No limits currently. Major:minor 8:0 is an example; your NVMe may be different. “max” means unlimited.

Decision: If you’re using systemd units or a runtime that integrates with cgroups v2, set io.max or weights for the service/container scope.

Task 16: Check device major:minor for correct throttling target

cr0x@server:~$ lsblk -o NAME,MAJ:MIN,SIZE,TYPE,MOUNTPOINT | sed -n '1,6p'
NAME        MAJ:MIN   SIZE TYPE MOUNTPOINT
nvme0n1     259:0   953.9G disk 
├─nvme0n1p1 259:1     512M part /boot/efi
└─nvme0n1p2 259:2   953.4G part /

Meaning: If you configure io.max you must specify the right major:minor (e.g., 259:0 for the disk, or 259:2 for a partition depending on setup).

Decision: Target the actual device experiencing contention. Throttling the wrong major:minor is how you “fixed” nothing with maximum confidence.

Root causes that show up in real incidents

1) DB writes living on the container writable layer

If your database stores its data inside the container filesystem (overlay2 upperdir), you’re stacking a write-heavy workload onto a mechanism designed for layered images. Expect worse latency and more metadata churn. Use a Docker volume or bind mount to a dedicated path.

2) One shared device, zero IO fairness controls

Even if DB data is on a volume, if that volume is just a directory on the same root filesystem, you’re still sharing the device. Without weights or throttles, the busiest writer wins. Everyone else loses.

3) Latency cliffs from bursty cloud storage

Many cloud volumes provide “baseline + burst” performance. A busy DB burns through burst capacity, then the volume drops to baseline. Your incident begins exactly when the credits end. Nothing about Docker explains the timing, which is why teams waste hours looking at the wrong layer.

4) Journaling + fsync + high concurrency = tail latency hell

Databases often do many concurrent writes, plus WAL/redo logs, plus checkpointing. Add journaling filesystem behavior and device cache flushes, and you can get “perfectly average” throughput with catastrophic p99 write latency.

5) Logging on the same device as the DB

When the disk is saturated, logs don’t just “write slower.” They can block application threads, fill buffers, and create more IO at the worst possible time. JSON logs are cute until they become your primary write workload during an outage.

Joke #2: If you put a database and verbose debug logs on the same disk, you’ve invented a new distributed system: “latency.”

Common mistakes (symptom → root cause → fix)

This is the part where you stop repeating the same outage with different names.

1) Symptom: CPU is low, but load average is high

  • Root cause: threads stuck in uninterruptible IO sleep; load includes them.
  • Fix: check iostat -x for await and %util; identify top IO processes via iotop; isolate or throttle.

2) Symptom: every container becomes slow, including “unrelated” services

  • Root cause: shared block device saturated; kernel writeback and metadata operations impact everyone.
  • Fix: move DB to dedicated device/volume; apply cgroup IO controls; separate Docker root, logs, and DB data onto different devices where possible.

3) Symptom: database latency spikes during checkpoints or compaction

  • Root cause: bursty write phases cause queue buildup and flush storms.
  • Fix: tune DB checkpoint/compaction parameters carefully; cap IO for background writers; ensure storage has enough steady-state IOPS.

4) Symptom: “Disk throughput looks fine” in monitoring

  • Root cause: you’re monitoring MB/s, not IO latency or IOPS; small random IO saturates IOPS first.
  • Fix: monitor await, aqu-sz, device latency percentiles if available, and per-volume IOPS limits in cloud environments.

5) Symptom: DB container is fast alone, slow in production host

  • Root cause: test host had isolated storage or fewer noisy neighbors; prod host shares the root disk with everything.
  • Fix: performance-test on representative storage; enforce isolation via separate devices or IO controls.

6) Symptom: after “optimizing,” things got worse

  • Root cause: disabling durability, changing mount options, or increasing concurrency pushed the system into worse tail latency or data-risk territory.
  • Fix: prioritize predictable latency and durability; tune one variable at a time; validate with latency measurements, not vibes.

Three corporate mini-stories from the storage trenches

Mini-story 1: The incident caused by a wrong assumption

The team was migrating a monolith into “a few containers” on a big Linux VM. They started with the database because it was the scariest piece, and it “worked fine” in staging. In production, every deploy after the DB container landed came with a wave of timeouts from unrelated services: the API, the job runner, even the metrics sidecar.

The initial assumption was classic: “Containers isolate resources.” They limited CPU and memory for the database container, patted themselves on the back, and moved on. When the host load went to the moon with CPU mostly idle, the blame rotated through networking, DNS, Docker’s overlay network, and a brief but passionate argument about kernel versions.

It took one person running iostat -x to end the debate. The root disk was at ~100% utilization with write latency that looked like a mountain range. The DB data directory was a bind mount into /var/lib/docker on the same root filesystem as everything else, including container logs and image layers.

Once they accepted that “container” doesn’t mean “separate disk,” the fix was straightforward: attach a dedicated volume for the DB, mount it at the data directory, and move logs off the root disk. The host went from “mysterious systemic failure” to “boring computer,” which is the highest compliment in operations.

Mini-story 2: The optimization that backfired

A different company had a performance problem: their Postgres container was bottlenecked on writes during peak traffic. Someone proposed an optimization: reduce fsync pressure by relaxing durability settings, reasoning that “we have replication” and “the cloud storage is reliable.” The change improved throughput immediately in a synthetic benchmark, and it shipped.

Two weeks later, a node crash happened during a noisy storage event. Not a catastrophe, but the timing was perfect: high write load, replica lag, and a failover. They didn’t lose the entire database, but they did lose enough recent transactions to trigger a week of uncomfortable conversations. Meanwhile, the original symptom (host-level lag) returned during bursts, because the real bottleneck was device queue saturation and shared IO contention, not just fsync overhead.

They rolled back the durability compromise and did the adult fix: isolate the DB on a dedicated block device with predictable IOPS, add IO weights to keep the rest of the host usable, and cap the most abusive background maintenance. The “optimization” wasn’t evil; it was misapplied. It optimized the wrong layer and bought speed using your data as collateral.

The lesson that stuck: if you’re going to trade durability for performance, call it what it is, write it down, and get sign-off from people who will be paged when it backfires.

Mini-story 3: The boring but correct practice that saved the day

This one is less dramatic, which is exactly why it worked. A team ran multiple stateful containers on a small fleet: a DB, a queue, and a search engine. They had a policy: every stateful service must use a dedicated mount point backed by a dedicated volume class, and every service must have an explicit IO budget written into deployment notes.

It wasn’t fancy. They didn’t have bespoke kernel patches or artisanal schedulers. They just refused to put persistent data on the Docker root filesystem, and they kept application logs on a separate path with rotation and sane verbosity defaults.

One day, a batch job went rogue and started hammering the queue’s persistence. Latency spiked for that service, but the rest of the host stayed responsive. Their alerts pointed directly at the device used by the queue volume, not “the server is slow.” They throttled the batch job’s container IO, stabilized, and did a postmortem without anyone having to explain why SSH was unusable.

Sometimes the “boring” practices are just pre-paid incident response.

Fixes and guardrails that actually work

1) Put DB data on a real volume, not the container writable layer

If you remember one thing, make it this: databases should write to a dedicated mount (named volume or bind mount), ideally backed by a dedicated device. That means:

  • DB data directory is a mount point you can see in findmnt.
  • That mount maps to a device you can measure independently.
  • Docker root (/var/lib/docker) is not carrying your database’s durability burden.

2) Separate concerns: images/logs vs durable data

The Docker data root is busy: image extraction, layer downloads, overlay metadata, container log writes, and more. Combine that with a DB doing WAL writes and checkpoints and you’ve built a contention machine.

A practical split:

  • Disk A: OS + Docker root + container logs (fast enough, but not sacred).
  • Disk B: DB volume (predictable IOPS, low latency, monitored).

3) Use IO controls to enforce fairness

If you must share a device, enforce fairness. For cgroups v2, the io controller provides weight and throttling. Docker’s UX for this varies by version and runtime, but the principle is stable: do not let one container become a disk vacuum cleaner.

Throttling is not just punitive. It can be the difference between “DB is slow” and “everything is down.” In an incident, you often want to keep the rest of the host responsive while you decide what to do with the DB.

4) Measure latency, not just utilization

%util is helpful but not sufficient. A device can show less than 100% utilization and still have terrible latency, especially in virtualized or network-backed storage where the “device” is an abstraction.

What you want to know:

  • Average and tail latency for reads/writes.
  • Queue depth trends under load.
  • IOPS limits and whether you’re near them.

5) Tune DB behavior only after you’ve fixed the storage geometry

Database tuning has a place. But if the DB is simply sharing the wrong disk with everything else, tuning is a tax on your time and a gift to future incidents.

First: isolate the device path. Then: evaluate DB checkpoints, WAL settings, and background work. Otherwise you’ll end up with a fragile configuration that works until the next peak.

6) Keep container logs from becoming an IO workload

If you log at high volume to JSON and let Docker write it to the same filesystem as your DB, you’re competing for the same queue. Use log rotation, reduce verbosity, and consider moving high-volume logs off-host or to a separate device.

7) Don’t ignore the cloud volume’s performance model

If your volume has baseline/burst performance, model your workload accordingly. A DB that is “fine for 20 minutes then awful” is often not a mystery; it’s a credit bucket emptying. Provision for sustained performance or design around it.

One quote to remember

Hope is not a strategy. — General Gordon R. Sullivan

Checklists / step-by-step plan

Step-by-step: stabilize an incident (30 minutes)

  1. Confirm IO wait/latency: run top and iostat -x.
  2. Identify the writer: run iotop -o and map PIDs to containers via cgroups.
  3. Verify storage placement: check docker inspect mounts and df -hT for Docker root and volumes.
  4. Mitigate: throttle the offender’s write IOPS or bandwidth, or temporarily reduce its concurrency (DB connection limits, background jobs).
  5. Reduce collateral IO: cut log verbosity; ensure log rotation isn’t stuck; pause nonessential batch jobs.
  6. Communicate: state clearly “host disk latency is saturated” and the mitigation applied. Avoid vague “Docker is slow.”

Step-by-step: permanent fix (one sprint)

  1. Move DB data to dedicated storage: separate device or a volume class with guaranteed IOPS.
  2. Split Docker root from stateful volumes: keep /var/lib/docker off the DB device.
  3. Implement IO controls: use cgroup v2 io.max / io.weight (or Docker blkio options where supported).
  4. Set monitoring that catches this early: device latency, queue depth, and volume credit/burst state if applicable.
  5. Load test the actual stack: not “DB on my laptop,” but the real storage backend and container runtime.
  6. Write a runbook: include the exact commands from this article and decision points.

Pre-deploy checklist for any DB container

  • DB data directory is a dedicated mount (volume/bind mount), not overlay2 writable layer.
  • That mount is on a device with known sustained IOPS and latency characteristics.
  • Container has defined IO policy (weight or throttle), not “unlimited.”
  • Logs are rate-limited and rotated; high-volume logs don’t land on the DB disk.
  • Alerts exist for disk latency and queue depth, not just disk fullness.

FAQ

1) Is this a Docker bug?

Usually no. This is shared resource contention. Docker makes it easy to colocate workloads, which makes it easy to accidentally colocate their storage pain.

2) Why does only one DB container cause whole-host lag?

Because the block device is shared. When that container saturates IOPS or triggers high latency, the kernel queues fill and everyone waits behind it.

3) Will moving the DB to a named volume fix it?

Only if the volume is backed by different storage or a different performance class. A named volume stored under /var/lib/docker/volumes on the same filesystem is organizational, not isolation.

4) What’s the difference between IOPS and throughput in this context?

IOPS is operations per second (often small 4K reads/writes). Throughput is MB/s. Databases often bottleneck on IOPS and latency, not bandwidth.

5) Should I change the IO scheduler?

Sometimes, but it’s rarely your first fix. Scheduler tweaks won’t save you from a shared device with no isolation and a DB doing heavy sync writes.

6) Does overlay2 always make databases slow?

No. But putting hot DB data on the writable layer is a common mistake, and overlay behavior can worsen metadata-heavy write patterns. Use volumes for DB data.

7) How do I limit IO for a container reliably?

Use cgroups IO controls. Docker supports --device-read-bps, --device-write-bps, and IOPS variants for throttling. On cgroups v2, you can also manage io.max via the service manager/runtime integration.

8) Why does “disk utilization 100%” happen on NVMe? Aren’t they fast?

NVMe is fast, not infinite. Small sync writes and flushes can still saturate queues, and a single device still has a finite latency curve under load.

9) Why do my apps fail health checks during IO contention?

Health checks often do disk or network operations that depend on a responsive kernel and timely logging. Under heavy IO wait, everything gets delayed, including “simple” checks.

10) What’s the safest immediate mitigation if production is melting?

Throttle the noisy container’s IO to restore host responsiveness, then plan a data migration to dedicated storage. Stopping the DB may be necessary, but throttling buys time.

Conclusion: next steps you can do this week

If one DB container makes everything lag, believe the storage signals. Measure latency and queue depth, identify the container, and stop pretending shared disks will self-organize into fairness.

Concrete next steps:

  1. Add device latency and queue depth to your dashboards (not just disk % full or MB/s).
  2. Audit stateful containers: verify data directories are volumes/bind mounts and map to actual devices.
  3. Separate storage: move DB data off the Docker root filesystem onto dedicated storage with known sustained performance.
  4. Implement IO controls: set sane throttles or weights so one container can’t take the whole host hostage.
  5. Write the runbook: copy the “fast diagnosis playbook” and the tasks into your on-call wiki, then run a game day.
← Previous
ZFS clones: Instant Copies With Hidden Dependencies (Know This First)
Next →
VPN + VLAN Segmentation for Office, Warehouse, and Cameras (Without Regrets)

Leave a comment