The symptom: you deploy a database container, and suddenly the whole host feels like itâs running off a USB thumb drive from 2009. SSH takes seconds to echo characters. Other containers time out. CPU is âfine.â Memory is âfine.â Yet everything is miserable.
The cause: storage contention. Specifically, one workload is consuming (or triggering) most of the IOPS budget and raising latency for everyone else. Docker didnât âbreakâ your server. You just learned, the hard way, that disks are shared, queueing is real, and database sync semantics donât negotiate.
What IOPS starvation looks like on a Docker host
IOPS starvation is not âthe disk is full.â Itâs âthe disk is busy.â More precisely: the storage path is saturated such that average request latency spikes, and every workload sharing that path pays the tax.
On Linux hosts running Docker, this usually presents as:
- High iowait (but not always). If you look only at CPU percent, you can miss it.
- Latency spikes for reads and/or writes. Databases hate write latency because
fsyncis a contract, not a suggestion. - Queue buildup. Requests pile up in the block layer or the device, and everything becomes âslow,â including unrelated services.
- Odd, indirect symptoms: slow DNS inside containers, slow logging, systemd services timing out, Docker daemon hiccups, even ârandomâ health checks failing.
Why does SSH lag? Because your terminal writes to a PTY, shells touch disk for history, logs flush, and the kernel is busy shepherding IO. The system isnât dead. Itâs waiting for the slowest shared resource to come back from lunch.
Why one container can hurt everything
Docker containers are process isolation plus some filesystem magic. They are not physical isolation. If multiple containers share:
- the same block device (same root volume, same EBS disk, same RAID group, same SAN LUN),
- the same filesystem,
- the same I/O scheduler queues,
- and often the same writeback and journaling behavior,
âŚthen one container can absolutely degrade everyone else. This is ânoisy neighborâ in its purest form.
Databases are latency amplifiers
Most DB engines do a lot of small random IO. They also do fsync (or similar) to guarantee durability. If the storage is slow, the DB does not âcatch upâ by trying harder; it blocks and waits. That blocking can cascade into connection pools, application threads, and retry storms.
And hereâs the part Docker makes worse: the default storage stack can add overhead when the DB writes lots of small files or churns metadata.
Overlay filesystems can add write amplification
Many Docker hosts use overlay2 for the containerâs writable layer. Overlay filesystems are great for image layering and developer ergonomics. Databases do not care about your ergonomics; they care about predictable latency.
If your DB writes to the containerâs writable layer (instead of a dedicated volume), you can trigger extra metadata operations, copy-up behavior, and less-friendly write patterns. Sometimes itâs fine. Sometimes itâs a performance crime scene with receipts.
âBut the disk is only at 20% throughputâ is a trap
Throughput (MB/s) is not the whole story. IOPS and latency matter more for many DB workloads. You can have a disk doing 5 MB/s and still be completely saturated on IOPS with 4K random writes.
Thatâs why your monitoring that tracks âdisk bandwidthâ says everything is chill while the host is crying quietly in iowait.
Joke #1: When a database says itâs âwaiting on disk,â itâs not being passive-aggressive. Itâs being accurate.
Interesting facts and historical context
- IOPS became a mainstream metric because of OLTP: transactional databases pushed the industry to measure âhow many small operations per second,â not just MB/s.
- Linuxâs CFQ scheduler used to be the default for fairness on many distros; modern systems often use
mq-deadlineornonefor NVMe, changing how âfairâ contention feels. - cgroups v1 had blkio controls early (weight and throttling). cgroups v2 moved to a different interface (
io.max,io.weight), which catches people mid-migration. - Overlayfs was built for union mounts and layers, not high-churn database files. Itâs improved a lot, but the workload mismatch still shows up in the worst places: latency tails.
- Write barriers and flushes exist because caches lie: storage devices reorder writes for performance, so the OS uses flushes/FUA to enforce ordering when applications demand durability.
- Journaling filesystems trade extra writes for consistency: ext4 and XFS are reliable, but journal behavior can increase IO load under heavy metadata churn.
- Cloud block storage often has burst credits: you can be fast for a while, then suddenly slow. The container didnât change; the storage tier did.
- NVMe improved parallelism massively with multiple queues, but it didnât delete contention; it just moved it to different queues and limits.
- âfsync stormsâ are a classic failure mode: many threads call fsync, queues fill, latency spikes, and a system that looked fine at p50 falls apart at p99.
Fast diagnosis playbook
This is the âI have five minutes and production is on fireâ sequence. Donât debate architecture while the host is stalling. Measure, identify the bottleneck, then decide whether to throttle, move, or isolate.
First: confirm itâs IO latency, not CPU or memory
- Check load, iowait, and run queue.
- Check disk latency and queue depth.
- Check if one process/container is the top IO consumer.
Second: map the pain to a device and mount
- Which block device is slow?
- Is Docker root on that device?
- Are database files on overlay2 or a dedicated volume?
Third: decide on an immediate mitigation
- Throttle the noisy container (IOPS or bandwidth) to save the rest of the host.
- Move DB data to a dedicated volume/device.
- Stop the bleeding: disable chatty logging, pause noncritical jobs, reduce concurrency.
Fourth: plan the real fix
- Use volumes for DB data (not container writable layers).
- Use IO isolation (cgroups v2 io.max / weights) where possible.
- Pick storage and filesystem options appropriate for DB sync patterns.
Practical tasks: commands, outputs, decisions
Below are real tasks you can run on a Linux Docker host. Each includes: command, example output, what it means, and the decision you make from it. Run them as root or with sudo where needed.
Task 1: Check overall CPU, iowait, and load
cr0x@server:~$ top -b -n 1 | head -n 5
top - 12:41:02 up 21 days, 4:17, 2 users, load average: 9.12, 7.84, 6.30
Tasks: 312 total, 4 running, 308 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.1 us, 2.4 sy, 0.0 ni, 61.7 id, 25.9 wa, 0.0 hi, 1.9 si, 0.0 st
MiB Mem : 64035.7 total, 3211.3 free, 10822.9 used, 499... buff/cache
MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 5... avail Mem
Meaning: 25.9% iowait and high load suggests lots of threads blocked on IO. CPU isnât âbusy,â itâs waiting.
Decision: Stop blaming the scheduler and start measuring disk latency and queueing.
Task 2: Identify which disks are suffering (latency, utilization, queue depth)
cr0x@server:~$ iostat -x 1 3
Linux 6.2.0 (server) 01/03/2026 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
7.52 0.00 2.61 23.98 0.00 65.89
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz %util
nvme0n1 95.0 820.0 3200.0 9870.0 0.0 15.0 0.00 1.80 3.2 42.7 18.90 98.50
Meaning: %util near 100% and w_await ~43ms means the device is saturated for writes. aqu-sz (average queue size) is huge, confirming backlog.
Decision: Find the writer. Donât tune databases blind; identify the process/container generating these writes.
Task 3: See if the block layer is backlogged (per-device stats)
cr0x@server:~$ cat /proc/diskstats | grep -E "nvme0n1 "
259 0 nvme0n1 128930 0 5128032 12043 942110 0 9230016 390122 0 220010 402210 0 0 0 0
Meaning: The exact fields are dense, but a quick tell is high time spent doing IO compared to baseline. Combine with iostat -x for sanity.
Decision: If you see this spike only during DB load, youâre in a classic âone workload saturates deviceâ situation.
Task 4: Find the top IO processes (host view)
cr0x@server:~$ sudo iotop -b -n 3 -o
Total DISK READ: 0.00 B/s | Total DISK WRITE: 74.32 M/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 71.91 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
21491 be/4 postgres 0.00 B/s 55.12 M/s 0.00 % 89.15 % postgres: checkpointer
21510 be/4 postgres 0.00 B/s 10.43 M/s 0.00 % 63.20 % postgres: walwriter
4321 be/4 root 0.00 B/s 3.01 M/s 0.00 % 12.10 % dockerd
Meaning: Postgres processes dominate write IO and spend high % of time in IO wait. dockerd writing a bit is normal (logs, layers), but itâs not the primary offender.
Decision: Confirm which container owns these processes, then decide whether to throttle or move its storage.
Task 5: Map a process to a container
cr0x@server:~$ ps -o pid,cgroup,cmd -p 21491 | sed -n '1,2p'
PID CGROUP CMD
21491 0::/docker/8b6c3b7e4a3a9b7d2a7b55c4a1a2f9b9b0f6c0f9d1a7b1e3c9e3a2c1e5b postgres: checkpointer
Meaning: The process is in a Docker cgroup named by container ID.
Decision: Inspect that container. Confirm its mounts and whether it uses overlay2 or a volume.
Task 6: Inspect container mounts and storage driver
cr0x@server:~$ docker inspect -f 'Name={{.Name}} Driver={{.GraphDriver.Name}} DataRoot={{json .GraphDriver.Data}} Mounts={{json .Mounts}}' 8b6c3b7e4a3a
Name=/db Driver=overlay2 DataRoot={"LowerDir":"/var/lib/docker/overlay2/l/..","MergedDir":"/var/lib/docker/overlay2/1d3.../merged","UpperDir":"/var/lib/docker/overlay2/1d3.../diff","WorkDir":"/var/lib/docker/overlay2/1d3.../work"} Mounts=[{"Type":"volume","Name":"pgdata","Source":"/var/lib/docker/volumes/pgdata/_data","Destination":"/var/lib/postgresql/data","Driver":"local","Mode":"z","RW":true,"Propagation":""}]
Meaning: Storage driver is overlay2 for the writable layer, but the DB data directory is mounted as a named volume. Thatâs good. If DB data were in UpperDir, it would be worse.
Decision: If DB data isnât on a volume, fix that first. If it is, check where the volume physically lives (same device as everything else?).
Task 7: Check what device backs Docker data root and volumes
cr0x@server:~$ docker info --format 'DockerRootDir={{.DockerRootDir}} Driver={{.Driver}}'
DockerRootDir=/var/lib/docker Driver=overlay2
cr0x@server:~$ df -hT /var/lib/docker /var/lib/docker/volumes/pgdata/_data
Filesystem Type Size Used Avail Use% Mounted on
/dev/nvme0n1p2 ext4 900G 612G 242G 72% /
/dev/nvme0n1p2 ext4 900G 612G 242G 72% /
Meaning: Docker root and the database volume are on the same ext4 filesystem on the same NVMe partition. So contention is guaranteed.
Decision: If you need the DB and the rest of the host to coexist, you must isolate IO (cgroup IO controls) or move DB data to a dedicated device.
Task 8: Check filesystem mount options (barriers, atime, etc.)
cr0x@server:~$ findmnt -no TARGET,SOURCE,FSTYPE,OPTIONS /
/ /dev/nvme0n1p2 ext4 rw,relatime,errors=remount-ro
Meaning: relatime is fine. Nothing obviously weird. If you see exotic options (or misapplied tuning), itâs a lead.
Decision: Donât change mount options during an incident unless youâre certain; use throttling/migration as safer first aid.
Task 9: Check if the host is stuck in writeback congestion
cr0x@server:~$ cat /proc/meminfo | egrep 'Dirty|Writeback|WritebackTmp'
Dirty: 82456 kB
Writeback: 195120 kB
WritebackTmp: 0 kB
Meaning: Elevated Writeback can indicate lots of data being flushed to disk. It doesnât prove DB guilt, but it supports âstorage pipeline overloaded.â
Decision: If writeback is persistently high and latency is high, reduce write pressure and isolate the heavy writer.
Task 10: Look at per-process IO counters (sanity)
cr0x@server:~$ sudo cat /proc/21491/io | egrep 'write_bytes|cancelled_write_bytes'
write_bytes: 18446744073709551615
cancelled_write_bytes: 127385600
Meaning: Some kernels expose counters in ways that can be confusing (and some filesystems donât report cleanly). Treat as directional, not absolute.
Decision: If this is inconclusive, rely on iotop, iostat, and device-level latency metrics.
Task 11: Check Docker container resource limits (CPU/mem) and note the absence of IO limits
cr0x@server:~$ docker inspect -f 'CpuShares={{.HostConfig.CpuShares}} Memory={{.HostConfig.Memory}} BlkioWeight={{.HostConfig.BlkioWeight}}' 8b6c3b7e4a3a
CpuShares=0 Memory=0 BlkioWeight=0
Meaning: No explicit limits. CPU and memory are commonly constrained; IO often isnât. Thatâs how you get a single container flattening a host.
Decision: Add IO controls (Docker blkio on cgroups v1, or systemd/cgroups v2 controls), or isolate the workload on different storage.
Task 12: Apply a temporary IO throttle (bandwidth) to a container (incident mitigation)
cr0x@server:~$ docker update --device-write-bps /dev/nvme0n1:20mb 8b6c3b7e4a3a
8b6c3b7e4a3a
Meaning: Docker applied a write bandwidth throttle for that device to the container. This is a blunt tool. It may protect the host at the expense of DB latency and throughput.
Decision: Use this to stop collateral damage. Then move the DB to dedicated storage or implement fair sharing with proper IO controllers.
Task 13: Apply an IOPS throttle (more relevant for small IO)
cr0x@server:~$ docker update --device-write-iops /dev/nvme0n1:2000 8b6c3b7e4a3a
8b6c3b7e4a3a
Meaning: Limits write IOPS. This typically aligns better with DB pain than MB/s throttles.
Decision: If throttling improves host responsiveness, youâve confirmed ânoisy neighbor IOâ as the main failure mode. Now architect it out.
Task 14: Check cgroup v2 IO controller status
cr0x@server:~$ stat -fc %T /sys/fs/cgroup
cgroup2fs
cr0x@server:~$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory pids
Meaning: The host uses cgroups v2 and supports the io controller.
Decision: Prefer cgroups v2 IO control where available; itâs clearer and generally the direction Linux is going.
Task 15: Inspect a containerâs IO limits via its cgroup (v2)
cr0x@server:~$ CID=8b6c3b7e4a3a; cat /sys/fs/cgroup/docker/$CID/io.max
8:0 rbps=max wbps=max riops=max wiops=max
Meaning: No limits currently. Major:minor 8:0 is an example; your NVMe may be different. âmaxâ means unlimited.
Decision: If youâre using systemd units or a runtime that integrates with cgroups v2, set io.max or weights for the service/container scope.
Task 16: Check device major:minor for correct throttling target
cr0x@server:~$ lsblk -o NAME,MAJ:MIN,SIZE,TYPE,MOUNTPOINT | sed -n '1,6p'
NAME MAJ:MIN SIZE TYPE MOUNTPOINT
nvme0n1 259:0 953.9G disk
âânvme0n1p1 259:1 512M part /boot/efi
âânvme0n1p2 259:2 953.4G part /
Meaning: If you configure io.max you must specify the right major:minor (e.g., 259:0 for the disk, or 259:2 for a partition depending on setup).
Decision: Target the actual device experiencing contention. Throttling the wrong major:minor is how you âfixedâ nothing with maximum confidence.
Root causes that show up in real incidents
1) DB writes living on the container writable layer
If your database stores its data inside the container filesystem (overlay2 upperdir), youâre stacking a write-heavy workload onto a mechanism designed for layered images. Expect worse latency and more metadata churn. Use a Docker volume or bind mount to a dedicated path.
2) One shared device, zero IO fairness controls
Even if DB data is on a volume, if that volume is just a directory on the same root filesystem, youâre still sharing the device. Without weights or throttles, the busiest writer wins. Everyone else loses.
3) Latency cliffs from bursty cloud storage
Many cloud volumes provide âbaseline + burstâ performance. A busy DB burns through burst capacity, then the volume drops to baseline. Your incident begins exactly when the credits end. Nothing about Docker explains the timing, which is why teams waste hours looking at the wrong layer.
4) Journaling + fsync + high concurrency = tail latency hell
Databases often do many concurrent writes, plus WAL/redo logs, plus checkpointing. Add journaling filesystem behavior and device cache flushes, and you can get âperfectly averageâ throughput with catastrophic p99 write latency.
5) Logging on the same device as the DB
When the disk is saturated, logs donât just âwrite slower.â They can block application threads, fill buffers, and create more IO at the worst possible time. JSON logs are cute until they become your primary write workload during an outage.
Joke #2: If you put a database and verbose debug logs on the same disk, youâve invented a new distributed system: âlatency.â
Common mistakes (symptom â root cause â fix)
This is the part where you stop repeating the same outage with different names.
1) Symptom: CPU is low, but load average is high
- Root cause: threads stuck in uninterruptible IO sleep; load includes them.
- Fix: check
iostat -xforawaitand%util; identify top IO processes viaiotop; isolate or throttle.
2) Symptom: every container becomes slow, including âunrelatedâ services
- Root cause: shared block device saturated; kernel writeback and metadata operations impact everyone.
- Fix: move DB to dedicated device/volume; apply cgroup IO controls; separate Docker root, logs, and DB data onto different devices where possible.
3) Symptom: database latency spikes during checkpoints or compaction
- Root cause: bursty write phases cause queue buildup and flush storms.
- Fix: tune DB checkpoint/compaction parameters carefully; cap IO for background writers; ensure storage has enough steady-state IOPS.
4) Symptom: âDisk throughput looks fineâ in monitoring
- Root cause: youâre monitoring MB/s, not IO latency or IOPS; small random IO saturates IOPS first.
- Fix: monitor
await,aqu-sz, device latency percentiles if available, and per-volume IOPS limits in cloud environments.
5) Symptom: DB container is fast alone, slow in production host
- Root cause: test host had isolated storage or fewer noisy neighbors; prod host shares the root disk with everything.
- Fix: performance-test on representative storage; enforce isolation via separate devices or IO controls.
6) Symptom: after âoptimizing,â things got worse
- Root cause: disabling durability, changing mount options, or increasing concurrency pushed the system into worse tail latency or data-risk territory.
- Fix: prioritize predictable latency and durability; tune one variable at a time; validate with latency measurements, not vibes.
Three corporate mini-stories from the storage trenches
Mini-story 1: The incident caused by a wrong assumption
The team was migrating a monolith into âa few containersâ on a big Linux VM. They started with the database because it was the scariest piece, and it âworked fineâ in staging. In production, every deploy after the DB container landed came with a wave of timeouts from unrelated services: the API, the job runner, even the metrics sidecar.
The initial assumption was classic: âContainers isolate resources.â They limited CPU and memory for the database container, patted themselves on the back, and moved on. When the host load went to the moon with CPU mostly idle, the blame rotated through networking, DNS, Dockerâs overlay network, and a brief but passionate argument about kernel versions.
It took one person running iostat -x to end the debate. The root disk was at ~100% utilization with write latency that looked like a mountain range. The DB data directory was a bind mount into /var/lib/docker on the same root filesystem as everything else, including container logs and image layers.
Once they accepted that âcontainerâ doesnât mean âseparate disk,â the fix was straightforward: attach a dedicated volume for the DB, mount it at the data directory, and move logs off the root disk. The host went from âmysterious systemic failureâ to âboring computer,â which is the highest compliment in operations.
Mini-story 2: The optimization that backfired
A different company had a performance problem: their Postgres container was bottlenecked on writes during peak traffic. Someone proposed an optimization: reduce fsync pressure by relaxing durability settings, reasoning that âwe have replicationâ and âthe cloud storage is reliable.â The change improved throughput immediately in a synthetic benchmark, and it shipped.
Two weeks later, a node crash happened during a noisy storage event. Not a catastrophe, but the timing was perfect: high write load, replica lag, and a failover. They didnât lose the entire database, but they did lose enough recent transactions to trigger a week of uncomfortable conversations. Meanwhile, the original symptom (host-level lag) returned during bursts, because the real bottleneck was device queue saturation and shared IO contention, not just fsync overhead.
They rolled back the durability compromise and did the adult fix: isolate the DB on a dedicated block device with predictable IOPS, add IO weights to keep the rest of the host usable, and cap the most abusive background maintenance. The âoptimizationâ wasnât evil; it was misapplied. It optimized the wrong layer and bought speed using your data as collateral.
The lesson that stuck: if youâre going to trade durability for performance, call it what it is, write it down, and get sign-off from people who will be paged when it backfires.
Mini-story 3: The boring but correct practice that saved the day
This one is less dramatic, which is exactly why it worked. A team ran multiple stateful containers on a small fleet: a DB, a queue, and a search engine. They had a policy: every stateful service must use a dedicated mount point backed by a dedicated volume class, and every service must have an explicit IO budget written into deployment notes.
It wasnât fancy. They didnât have bespoke kernel patches or artisanal schedulers. They just refused to put persistent data on the Docker root filesystem, and they kept application logs on a separate path with rotation and sane verbosity defaults.
One day, a batch job went rogue and started hammering the queueâs persistence. Latency spiked for that service, but the rest of the host stayed responsive. Their alerts pointed directly at the device used by the queue volume, not âthe server is slow.â They throttled the batch jobâs container IO, stabilized, and did a postmortem without anyone having to explain why SSH was unusable.
Sometimes the âboringâ practices are just pre-paid incident response.
Fixes and guardrails that actually work
1) Put DB data on a real volume, not the container writable layer
If you remember one thing, make it this: databases should write to a dedicated mount (named volume or bind mount), ideally backed by a dedicated device. That means:
- DB data directory is a mount point you can see in
findmnt. - That mount maps to a device you can measure independently.
- Docker root (
/var/lib/docker) is not carrying your databaseâs durability burden.
2) Separate concerns: images/logs vs durable data
The Docker data root is busy: image extraction, layer downloads, overlay metadata, container log writes, and more. Combine that with a DB doing WAL writes and checkpoints and youâve built a contention machine.
A practical split:
- Disk A: OS + Docker root + container logs (fast enough, but not sacred).
- Disk B: DB volume (predictable IOPS, low latency, monitored).
3) Use IO controls to enforce fairness
If you must share a device, enforce fairness. For cgroups v2, the io controller provides weight and throttling. Dockerâs UX for this varies by version and runtime, but the principle is stable: do not let one container become a disk vacuum cleaner.
Throttling is not just punitive. It can be the difference between âDB is slowâ and âeverything is down.â In an incident, you often want to keep the rest of the host responsive while you decide what to do with the DB.
4) Measure latency, not just utilization
%util is helpful but not sufficient. A device can show less than 100% utilization and still have terrible latency, especially in virtualized or network-backed storage where the âdeviceâ is an abstraction.
What you want to know:
- Average and tail latency for reads/writes.
- Queue depth trends under load.
- IOPS limits and whether youâre near them.
5) Tune DB behavior only after youâve fixed the storage geometry
Database tuning has a place. But if the DB is simply sharing the wrong disk with everything else, tuning is a tax on your time and a gift to future incidents.
First: isolate the device path. Then: evaluate DB checkpoints, WAL settings, and background work. Otherwise youâll end up with a fragile configuration that works until the next peak.
6) Keep container logs from becoming an IO workload
If you log at high volume to JSON and let Docker write it to the same filesystem as your DB, youâre competing for the same queue. Use log rotation, reduce verbosity, and consider moving high-volume logs off-host or to a separate device.
7) Donât ignore the cloud volumeâs performance model
If your volume has baseline/burst performance, model your workload accordingly. A DB that is âfine for 20 minutes then awfulâ is often not a mystery; itâs a credit bucket emptying. Provision for sustained performance or design around it.
One quote to remember
Hope is not a strategy.
â General Gordon R. Sullivan
Checklists / step-by-step plan
Step-by-step: stabilize an incident (30 minutes)
- Confirm IO wait/latency: run
topandiostat -x. - Identify the writer: run
iotop -oand map PIDs to containers via cgroups. - Verify storage placement: check
docker inspectmounts anddf -hTfor Docker root and volumes. - Mitigate: throttle the offenderâs write IOPS or bandwidth, or temporarily reduce its concurrency (DB connection limits, background jobs).
- Reduce collateral IO: cut log verbosity; ensure log rotation isnât stuck; pause nonessential batch jobs.
- Communicate: state clearly âhost disk latency is saturatedâ and the mitigation applied. Avoid vague âDocker is slow.â
Step-by-step: permanent fix (one sprint)
- Move DB data to dedicated storage: separate device or a volume class with guaranteed IOPS.
- Split Docker root from stateful volumes: keep
/var/lib/dockeroff the DB device. - Implement IO controls: use cgroup v2
io.max/io.weight(or Docker blkio options where supported). - Set monitoring that catches this early: device latency, queue depth, and volume credit/burst state if applicable.
- Load test the actual stack: not âDB on my laptop,â but the real storage backend and container runtime.
- Write a runbook: include the exact commands from this article and decision points.
Pre-deploy checklist for any DB container
- DB data directory is a dedicated mount (volume/bind mount), not overlay2 writable layer.
- That mount is on a device with known sustained IOPS and latency characteristics.
- Container has defined IO policy (weight or throttle), not âunlimited.â
- Logs are rate-limited and rotated; high-volume logs donât land on the DB disk.
- Alerts exist for disk latency and queue depth, not just disk fullness.
FAQ
1) Is this a Docker bug?
Usually no. This is shared resource contention. Docker makes it easy to colocate workloads, which makes it easy to accidentally colocate their storage pain.
2) Why does only one DB container cause whole-host lag?
Because the block device is shared. When that container saturates IOPS or triggers high latency, the kernel queues fill and everyone waits behind it.
3) Will moving the DB to a named volume fix it?
Only if the volume is backed by different storage or a different performance class. A named volume stored under /var/lib/docker/volumes on the same filesystem is organizational, not isolation.
4) Whatâs the difference between IOPS and throughput in this context?
IOPS is operations per second (often small 4K reads/writes). Throughput is MB/s. Databases often bottleneck on IOPS and latency, not bandwidth.
5) Should I change the IO scheduler?
Sometimes, but itâs rarely your first fix. Scheduler tweaks wonât save you from a shared device with no isolation and a DB doing heavy sync writes.
6) Does overlay2 always make databases slow?
No. But putting hot DB data on the writable layer is a common mistake, and overlay behavior can worsen metadata-heavy write patterns. Use volumes for DB data.
7) How do I limit IO for a container reliably?
Use cgroups IO controls. Docker supports --device-read-bps, --device-write-bps, and IOPS variants for throttling. On cgroups v2, you can also manage io.max via the service manager/runtime integration.
8) Why does âdisk utilization 100%â happen on NVMe? Arenât they fast?
NVMe is fast, not infinite. Small sync writes and flushes can still saturate queues, and a single device still has a finite latency curve under load.
9) Why do my apps fail health checks during IO contention?
Health checks often do disk or network operations that depend on a responsive kernel and timely logging. Under heavy IO wait, everything gets delayed, including âsimpleâ checks.
10) Whatâs the safest immediate mitigation if production is melting?
Throttle the noisy containerâs IO to restore host responsiveness, then plan a data migration to dedicated storage. Stopping the DB may be necessary, but throttling buys time.
Conclusion: next steps you can do this week
If one DB container makes everything lag, believe the storage signals. Measure latency and queue depth, identify the container, and stop pretending shared disks will self-organize into fairness.
Concrete next steps:
- Add device latency and queue depth to your dashboards (not just disk % full or MB/s).
- Audit stateful containers: verify data directories are volumes/bind mounts and map to actual devices.
- Separate storage: move DB data off the Docker root filesystem onto dedicated storage with known sustained performance.
- Implement IO controls: set sane throttles or weights so one container canât take the whole host hostage.
- Write the runbook: copy the âfast diagnosis playbookâ and the tasks into your on-call wiki, then run a game day.