Your Debian 13 host “freezes,” SSH sessions hang, Prometheus graphs look like a heart monitor flatlining, and top shows CPU at 2%… except iowait is pinned at 100%. It feels like the machine is powered on but emotionally unavailable.
This is rarely a mystery. It’s usually one noisy process or VM doing something “perfectly reasonable” at the worst possible time, plus storage that can’t keep up. The job is not to meditate on it. The job is to name the culprit fast, decide whether to kill, throttle, or fix, and get your host back.
The mental model: what 100% iowait really means
iowait is not “your disk is 100% busy.” It’s “your CPUs have runnable work, but that work is blocked waiting for I/O completion.” If you see high iowait with low CPU usage, the kernel is doing you a favor by not burning cycles. Your users interpret it as “the server is dead,” because from their perspective, it is.
On Debian 13, the usual pattern is:
- Some workload starts issuing lots of I/O (writes are the classic trigger).
- Queue depth spikes on a block device.
- Latency explodes (milliseconds become seconds).
- Threads block in D-state (uninterruptible sleep), and the system feels frozen.
What you want to answer in under 10 minutes:
- Which device? (nvme0n1? md0? dm-crypt? NFS mount?)
- Which kind of pressure? (latency? saturation? retries? flush/sync?)
- Which source? (PID? container? VM/guest?)
- What action? (kill, throttle, move, fix hardware, change config)
Dry truth: “iowait 100%” is a symptom, not a diagnosis. The diagnosis is a queue and a source.
One quote worth keeping on your wall, because it’s the whole job: “Hope is not a strategy.”
— Gene Kranz.
Joke #1: iowait is the Linux way of saying, “I’d love to help, but I’m waiting on storage to finish its long novel.”
Fast diagnosis playbook (10 minutes)
Minute 0–1: confirm it’s I/O and not a different kind of misery
- Check load, iowait, and blocked tasks.
- If you see many tasks in D-state and load rising with low CPU usage, you’re in I/O land.
Minute 1–3: identify the device that’s choking
- Use
iostatorsar -dto find highawaitand high utilization. - Validate with
cat /proc/diskstatsif tools aren’t installed or are hanging.
Minute 3–6: identify the source (PID, cgroup, or VM)
- Use
pidstat -dto find which PIDs are doing I/O. - Use
iotopfor live visibility if it works (it can miss buffered writeback). - On virtualization hosts, map QEMU threads to specific guests via libvirt/QEMU process args and per-disk stats.
- Use PSI (
/proc/pressure/io) to confirm system-wide I/O stall.
Minute 6–8: decide: kill, throttle, or isolate
- If it’s a backup job, an index rebuild, or a migration: throttle it.
- If it’s a runaway process: stop it now, root-cause later.
- If it’s a single VM: apply I/O limits at the hypervisor or storage layer.
Minute 8–10: confirm recovery and capture evidence
- Watch
awaitand utilization drop. - Grab
journalctl, kernel logs, and a short iostat/pidstat sample so you can fix it permanently.
If you do nothing else, do this: find the device, then find the source. Everything else is decoration.
Hands-on tasks: commands, output, and decisions (12+)
These are the moves I actually use when a host is “frozen” but not dead. Each task includes what the output means and what decision you make from it. Run them in order until you have a name (PID or VM) and a device.
Task 1 — See iowait, load, and blocked tasks in one glance
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 12 0 312844 98124 4021180 0 0 0 8240 220 410 2 1 3 94 0
0 18 0 311200 98124 4019900 0 0 0 12032 210 390 1 1 0 98 0
0 16 0 310980 98124 4019500 0 0 0 11024 205 372 1 1 1 97 0
0 20 0 310500 98124 4018400 0 0 0 14080 198 360 1 1 0 98 0
0 17 0 310300 98124 4018100 0 0 0 13056 200 365 1 1 0 98 0
Meaning: the b column (blocked processes) is high, and wa (iowait) is ~98%. That’s a classic “storage is the bottleneck” signature.
Decision: stop debating. Move immediately to per-device latency (iostat).
Task 2 — Find the hot device with latency and utilization
cr0x@server:~$ iostat -xz 1 3
Linux 6.12.0-1-amd64 (server) 12/29/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.20 0.00 1.10 92.40 0.00 5.30
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
nvme0n1 0.20 8.00 0.00 0.00 2.10 40.0 420.00 67136.00 12.00 2.78 280.30 159.8 118.2 99.70
nvme1n1 18.00 2048.00 0.00 0.00 1.80 113.8 15.00 1024.00 0.00 0.00 2.40 68.3 0.1 4.10
Meaning: nvme0n1 is at ~100% util with a huge w_await and a massive queue (aqu-sz). That’s your bottleneck device.
Decision: identify who is writing to nvme0n1. Also consider whether this is a latency spike (firmware, thermal throttling) or saturation (legitimate load).
Task 3 — Confirm system-wide I/O stall using PSI
cr0x@server:~$ cat /proc/pressure/io
some avg10=78.43 avg60=65.12 avg300=40.02 total=184563218
full avg10=52.10 avg60=44.90 avg300=22.31 total=105331982
Meaning: “full” I/O pressure means tasks are stalled because I/O can’t complete. This is not “a few slow queries,” it’s systemic.
Decision: you’re allowed to take aggressive mitigation (throttle/kill), because the whole host is paying the price.
Task 4 — Find the top I/O PIDs fast (and don’t trust your gut)
cr0x@server:~$ pidstat -d 1 5
Linux 6.12.0-1-amd64 (server) 12/29/2025 _x86_64_ (32 CPU)
03:14:01 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
03:14:02 0 21492 0.00 65432.00 0.00 48 qemu-system-x86
03:14:02 0 31811 0.00 12160.00 0.00 20 rsync
03:14:02 0 1092 0.00 4096.00 0.00 10 jbd2/nvme0n1p2-8
03:14:02 0 25170 0.00 2048.00 0.00 8 postgres
03:14:02 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
03:14:03 0 21492 0.00 70000.00 0.00 52 qemu-system-x86
03:14:03 0 31811 0.00 11008.00 0.00 18 rsync
Meaning: the top writer is qemu-system-x86 PID 21492. That usually means “a VM is the noisy neighbor,” not that QEMU woke up and chose violence.
Decision: map this QEMU process to a guest name and disk. Then either throttle the guest I/O or fix what the guest is doing.
Task 5 — If it’s not obvious, look for D-state pileups
cr0x@server:~$ ps -eo pid,state,comm,wchan:32 --sort=state | head -n 20
PID S COMMAND WCHAN
8921 D qemu-system-x86 blk_mq_get_tag
8922 D qemu-system-x86 io_schedule
8923 D qemu-system-x86 bit_wait_io
31811 D rsync io_schedule
25170 D postgres io_schedule
1092 D jbd2/nvme0n1p2-8 bit_wait_io
Meaning: D state threads are blocked in the block layer (blk_mq_get_tag) or waiting on I/O scheduling. That’s consistent with queue saturation on the device.
Decision: you are not dealing with a CPU runaway. You need to reduce I/O load or increase I/O capacity, and quickly identify the source.
Task 6 — Map the QEMU PID to a VM name (libvirt)
cr0x@server:~$ ps -p 21492 -o pid,cmd --no-headers | sed 's/ -/\n-/g' | head -n 20
21492 /usr/bin/qemu-system-x86_64
-name guest=acct-db-03,debug-threads=on
-uuid 3bce8b54-8d13-4c4a-bd19-5e3dfe7e8a0e
-drive file=/var/lib/libvirt/images/acct-db-03.qcow2,format=qcow2,if=virtio,cache=none,aio=native
-drive file=/var/lib/libvirt/images/acct-db-03-data.raw,format=raw,if=virtio,cache=none,aio=native
Meaning: now you have a VM identity: acct-db-03, and at least one disk path. This is gold when you’re trying to explain impact to humans.
Decision: confirm which disk is hottest and whether the backing storage is on nvme0n1. If yes, throttle that guest or pause the workload inside it.
Task 7 — Find which block device backs the VM disk file
cr0x@server:~$ df -h /var/lib/libvirt/images
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 1.8T 1.2T 526G 70% /var
Meaning: the VM images are on /dev/nvme0n1p2, which is on the hot NVMe. The mapping is consistent.
Decision: mitigation can happen at the host: throttle I/O to that VM (cgroup or libvirt tuning), or at the guest: stop/reduce the job.
Task 8 — Check per-domain block stats (libvirt) to find the loudest disk
cr0x@server:~$ virsh domblklist acct-db-03
Target Source
------------------------------------------------
vda /var/lib/libvirt/images/acct-db-03.qcow2
vdb /var/lib/libvirt/images/acct-db-03-data.raw
cr0x@server:~$ virsh domblkstat acct-db-03 vdb --human
rd_bytes 1.2 GiB
wr_bytes 842.6 GiB
rd_operations 12034
wr_operations 55210321
flush_operations 431920
wr_total_times 10421863423
Meaning: vdb is the heavy writer with many flushes. Flushes are a common “latency amplifier” because they force ordering and completion semantics.
Decision: if the guest is doing sync-heavy writes (databases, fsync storms, journaling), consider throttling or moving it to faster/less contended storage. Also check whether host settings (write cache, barriers, dm-crypt) are making flushes expensive.
Task 9 — Check whether the block device is rejecting you with errors or resets
cr0x@server:~$ journalctl -k -n 80 --no-pager
Dec 29 03:12:48 server kernel: nvme nvme0: I/O 1234 QID 6 timeout, aborting
Dec 29 03:12:48 server kernel: nvme nvme0: Abort status: 0x371
Dec 29 03:12:49 server kernel: nvme0n1: I/O error, dev nvme0n1, sector 1937422336 op 0x1:(WRITE) flags 0x800 phys_seg 32 prio class 0
Dec 29 03:12:50 server kernel: EXT4-fs warning (device nvme0n1p2): ext4_end_bio:345: I/O error 10 writing to inode 5513245 starting block 242177834)
Meaning: timeouts and I/O errors are not “load.” They’re a device or transport problem. Under error conditions, latency can go vertical while throughput drops. Your iowait will still look the same.
Decision: stop blaming workloads. Switch to hardware/firmware triage: check SMART/NVMe log, cabling/backplane, controller logs. Consider failing the device out (RAID/ZFS) if possible.
Task 10 — Quick NVMe health and error counters
cr0x@server:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0x00
temperature : 78 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 3%
media_errors : 12
num_err_log_entries : 45
Meaning: temperature is high, and there are media errors. Thermal throttling and internal error recovery can cause huge latency spikes.
Decision: treat this as both a performance and reliability incident. Plan device replacement if errors continue. Improve cooling if temp is consistently high.
Task 11 — Distinguish “device saturated” vs “device idle but slow”
cr0x@server:~$ iostat -xz /dev/nvme0n1 1 2
Linux 6.12.0-1-amd64 (server) 12/29/2025 _x86_64_ (32 CPU)
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.0 380.00 62000.00 0.00 0.00 310.00 163.2 120.0 99.90
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.0 390.00 64000.00 0.00 0.00 295.00 164.1 118.5 99.80
Meaning: this is saturation (util ~100%, big queue). If util were low but await high, you’d suspect intermittent stalls, firmware hiccups, or upstream layers (dm-crypt, RAID resync, NFS server) rather than pure load.
Decision: if it’s saturation, reduce load or add capacity (different disk, more spindles, better NVMe, separate noisy workloads). If it’s stalls, find the stalling layer.
Task 12 — Find which files are being hammered (when you must)
cr0x@server:~$ sudo lsof -p 21492 | grep -E '/var/lib/libvirt/images/|deleted' | head
qemu-system 21492 root 103u REG 259,2 34359738368 1310720 /var/lib/libvirt/images/acct-db-03-data.raw
qemu-system 21492 root 104u REG 259,2 21474836480 983040 /var/lib/libvirt/images/acct-db-03.qcow2
Meaning: confirms which image files are in play. It won’t show you which blocks are hot, but it anchors the blame to a concrete path.
Decision: if you need to relocate: you now know what to move. If you need to snapshot: you know what to snapshot. If you need to explain impact: you have a file and a VM name.
Task 13 — Identify major writeback and dirty page pressure
cr0x@server:~$ grep -E 'Dirty|Writeback|MemAvailable' /proc/meminfo
MemAvailable: 10432124 kB
Dirty: 982432 kB
Writeback: 210944 kB
WritebackTmp: 0 kB
Meaning: elevated Dirty/Writeback suggests the kernel is pushing buffered writes to disk and may be stuck behind slow I/O. If dirty memory skyrockets and the disk can’t keep up, everything becomes “slow in weird ways.”
Decision: consider whether a workload is generating sustained writes that exceed device throughput, and whether throttling (ionice, cgroup io.max) is the right mitigation.
Task 14 — Check filesystem-level pain (EXT4/XFS) and journal behavior
cr0x@server:~$ mount | grep ' on /var '
/dev/nvme0n1p2 on /var type ext4 (rw,relatime,errors=remount-ro)
cr0x@server:~$ dmesg | tail -n 20
[12345.678901] EXT4-fs (nvme0n1p2): Delayed block allocation failed for inode 5513245 at logical offset 123456 with max blocks 2 with error 5
[12345.678910] EXT4-fs (nvme0n1p2): This should not happen!! Data will be lost
Meaning: filesystem warnings plus I/O errors are “stop the world” events. Even without errors, journal threads (like jbd2) showing up in pidstat is a clue that writes are forcing metadata updates and commits.
Decision: if errors exist, treat as storage failure. If not, and you have many flushes/commits, focus on the workload pattern (fsync storms, small random writes) and storage configuration (write cache, barriers, virtualization caching mode).
Task 15 — When you need hard evidence: sample block I/O with blktrace
cr0x@server:~$ sudo blktrace -d /dev/nvme0n1 -w 10 -o - | sudo blkparse -i - | head -n 20
8,0 0 1 0.000000000 21492 Q WS 1937422336 + 128 [qemu-system-x86]
8,0 0 2 0.000012345 21492 G WS 1937422336 + 128 [qemu-system-x86]
8,0 0 3 0.000056789 21492 I WS 1937422336 + 128 [qemu-system-x86]
8,0 0 4 0.310123456 21492 D WS 1937422336 + 128 [qemu-system-x86]
8,0 0 5 0.650987654 21492 C WS 1937422336 + 128 [qemu-system-x86]
Meaning: you can attribute I/O at the block layer to a PID (here, QEMU). Notice the time gap between issue and completion; that’s your latency in the raw.
Decision: if you need to convince someone that “the VM really did it,” this is courtroom-grade. Also helpful when multiple PIDs are involved and userland tools disagree.
Task 16 — Quick mitigation: throttle a noisy cgroup (systemd) with io.max
cr0x@server:~$ systemctl status libvirtd | head -n 5
● libvirtd.service - Virtualization daemon
Loaded: loaded (/lib/systemd/system/libvirtd.service; enabled)
Active: active (running) since Mon 2025-12-29 02:11:03 UTC; 1h 03min ago
cr0x@server:~$ sudo systemctl set-property --runtime machine-qemu\\x2dacct\\x2ddb\\x2d03.scope IOReadBandwidthMax=/dev/nvme0n1 10M IOWriteBandwidthMax=/dev/nvme0n1 20M
Meaning: you just applied runtime I/O bandwidth limits to that VM’s systemd scope (naming varies; confirm the exact scope name on your host). This reduces the VM’s ability to starve the host.
Decision: use this when you need the host stable more than you need one guest fast. Then schedule a permanent fix: isolate workloads, tune DB settings, or move the guest to dedicated storage.
Joke #2: Throttling a noisy VM is like putting a shopping cart wheel back on. It won’t make it elegant, but at least it stops screaming.
Facts and history that help you debug faster
Knowing a few “why it’s like this” facts keeps you from chasing ghosts. Here are some that actually matter when you’re staring at iowait graphs at 03:00.
- Linux iowait is a CPU accounting state, not a disk metric. It can be high even when the disk isn’t fully utilized, especially with intermittent stalls or deep kernel queues.
- The old
svctmfield from iostat became unreliable with modern devices and multi-queue. People still quote it like gospel. Don’t. - blk-mq (multi-queue block layer) changed the “one queue per device” intuition. NVMe and modern storage can have many queues; saturation behavior looks different than old SATA.
- CFQ died, BFQ arrived, and
mq-deadlinebecame a default-ish choice for many SSDs. Scheduler choice can change tail latency under mixed loads. - PSI (Pressure Stall Information) is relatively new compared to classic tools. It tells you “how stalled” the system is, not just “how busy” it is, which is closer to user pain.
- Writeback caching can hide the start of an incident. You can buffer writes in RAM until you can’t, then you pay the debt with interest: long flush storms.
- Virtualization adds an extra scheduler stack. Guest thinks it’s doing “reasonable” fsync; host turns it into real flushes; underlying storage decides it’s time for garbage collection.
- NVMe thermal throttling is a real production failure mode. High sequential writes can push drives into throttling, causing dramatic latency increases without obvious errors at first.
- RAID resync and scrubs can create perfectly legitimate I/O denial of service. No bug required. Just timing.
If it’s a VM: mapping host I/O to the noisy guest
On a virtualization host, the most common wrong move is to treat QEMU as the culprit. QEMU is the messenger. Your real question: which guest workload is saturating which host device, and is it an expected batch job or an accidental foot-gun?
What “noisy VM” looks like on the host
pidstat -dshows one or a fewqemu-system-*PIDs dominating writes.iostatshows one device with huge queue (aqu-sz) and high await.- Load average climbs, but CPU is mostly idle except iowait.
- Other VMs complain: “disk is slow” or “database is stuck” even though they aren’t doing much.
How to connect QEMU threads to a domain cleanly
If you use libvirt, you can map by:
- the QEMU process command line (
-name guest=...) virsh domblkstatper disk device- systemd scope units under
machine.slice(names vary)
Once you have a guest name, you have options:
- Guest-side mitigation: pause the backup, reduce concurrency, tune database checkpoints, stop an index rebuild, limit compaction.
- Host-side mitigation: apply I/O limits using cgroup v2 (
io.max/ systemd properties), or adjust libvirt I/O tuning (if you already have it in place). - Structural fix: move that VM to dedicated storage, or split OS disk from data disk, or separate “batch jobs” from “latency-sensitive” VMs.
When the VM isn’t doing “that much,” but still kills the host
This is where flushes, sync writes, and metadata-heavy patterns matter. A VM doing a modest number of transactions with fsync can create enormous pressure if the underlying stack turns each fsync into expensive flushes across dm-crypt, RAID, and a drive doing internal garbage collection.
Watch for:
- high
flush_operationsinvirsh domblkstat - database config changes that increase durability guarantees
- guest filesystem mount options changes
- host caching mode changes (e.g.,
cache=nonevswriteback)
Storage layer triage: NVMe, RAID, dm-crypt, ZFS, network
After you identify the noisy process/VM, you still need to decide whether the real fix is “stop doing that” or “storage stack is unhealthy.” The fastest way is to recognize common failure modes by their telemetry.
NVMe: fast until it isn’t
NVMe incidents often fall into two buckets: saturation (legit load) and latency spikes (hardware/firmware/thermal). Saturation is boring. Latency spikes are dramatic.
What to check:
- Temperature and errors via
nvme smart-log - Kernel logs for timeouts/resets
- Util vs await: if util is low but await is high, suspect stalling rather than throughput limitation
mdadm RAID: resync is a scheduled incident if you let it be
Software RAID can behave well for months, then a rebuild or scrub starts and suddenly your “stable host” becomes a lesson in shared resources. Rebuild I/O competes with everything.
cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1] [raid10]
md0 : active raid1 sda1[0] sdb1[1]
976630336 blocks super 1.2 [2/2] [UU]
[===>.................] resync = 18.3% (178944000/976630336) finish=123.4min speed=107000K/sec
unused devices: <none>
Meaning: resync is active. That can drive iowait even when your application workload hasn’t changed.
Decision: throttle rebuild speed if it’s hurting production, or schedule rebuild windows. Don’t “optimize” by letting rebuild run wild on shared disks.
dm-crypt: encryption is not free, and flush semantics matter
dm-crypt adds CPU cost and can affect latency under sync-heavy workloads. It’s usually fine on modern CPUs, but it can make tail latency uglier in certain patterns.
cr0x@server:~$ lsblk -o NAME,TYPE,FSTYPE,SIZE,MOUNTPOINTS
nvme0n1 disk 1.8T
├─nvme0n1p1 part vfat 512M /boot/efi
└─nvme0n1p2 part crypto_LUKS 1.8T
└─cryptvar crypt ext4 1.8T /var
Meaning: your hot filesystem is on top of dm-crypt. That’s not automatically bad. It is a hint that flush-heavy workloads may pay extra overhead.
Decision: if flush storms are killing you, test with realistic workloads and consider separating sync-heavy workloads onto dedicated devices or revisiting caching/IO tuning. Don’t disable encryption in a panic unless you enjoy compliance meetings.
ZFS: the “txg sync” shaped punch in the face
ZFS can deliver excellent performance and also teach you about write amplification. A common iowait pattern is periodic stalls during transaction group (txg) sync when the pool is overloaded or mis-tuned for the workload.
cr0x@server:~$ zpool iostat -v 1 3
capacity operations bandwidth
pool alloc free read write read write
rpool 1.20T 600G 5 950 200K 120M
mirror 1.20T 600G 5 950 200K 120M
nvme0n1p3 - - 2 480 100K 60M
nvme1n1p3 - - 3 470 100K 60M
Meaning: heavy writes; if latency is bad, check for sync write patterns and SLOG use, recordsize, compression, and whether the pool is near-full.
Decision: if a VM database is doing sync writes, consider a separate SLOG device (with power-loss protection) or adjust the application behavior. Don’t “fix” this by turning off sync unless you accept data loss.
Network storage (NFS/iSCSI): when the “disk” is actually a network problem
NFS can produce iowait that looks like local disk saturation. The host is waiting on I/O completions, but those completions depend on network latency, server load, or retransmits.
cr0x@server:~$ mount | grep nfs
10.0.0.20:/export/vm-images on /var/lib/libvirt/images type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2)
cr0x@server:~$ nfsstat -c | head -n 20
Client rpc stats:
calls retrans authrefrsh
1234567 3456 123
Client nfs v4:
null read write commit open
0 234567 345678 4567 1234
Meaning: retransmits are non-zero and rising; commits exist; a “hard” mount means tasks can hang until the server responds.
Decision: check network errors, NFS server load, and consider isolating latency-sensitive VMs away from network-backed storage. If your host “freezes” due to NFS, treat it as a dependency failure, not a local disk issue.
Three corporate mini-stories from the trenches
1) Incident caused by a wrong assumption: “iowait means the disk is maxed out”
A platform team had a Debian-based virtualization cluster that occasionally went into “everyone is slow” mode. The on-call saw iowait at 90–100%, glanced at a dashboard showing moderate disk throughput, and assumed the storage array was fine because “it’s not even pushing bandwidth.” So they focused on CPU and memory. They tuned kernel swappiness. They restarted a few services. They even blamed a recent app deploy because that’s what humans do under stress.
Meanwhile the cluster kept freezing in bursts: SSH sessions stuck, VMs timing out, then magically recovering. It looked like a transient external issue. The team started a ticket with the storage vendor. The vendor asked for latency numbers. Nobody had them in the alert.
The real issue was tail latency from intermittent stalls on an NVMe device due to thermal throttling. Utilization and throughput were not the problem; completion time was. During stalls, queues built up, tasks blocked, iowait soared, and then the backlog drained. Bandwidth graphs stayed “fine” because average MB/s didn’t capture the seconds-long pauses.
Once they looked at iostat -xz and NVMe smart logs, it was obvious: temps high, error logs increasing, await in hundreds of ms. Fix was unglamorous: improve airflow in the chassis, update firmware, and replace the worst offender drive. They also added alerting on latency and PSI, not just throughput. The next time iowait spiked, the alert said “IO full pressure is high; nvme0 await 280ms.” The argument ended before it started.
2) Optimization that backfired: “Let’s make backups faster”
An operations group wanted shorter backup windows for VM images. They moved from a slower rsync-based method to a new approach that used multiple parallel streams and aggressive block sizes. On paper, it was excellent: faster backups, less wall time, more “modern.” They rolled it out to all hypervisors on a Friday, because of course.
The first weekend, the cluster entered a weird state: the backups finished quickly, but during the backup window, customer-facing VMs became unresponsive. Latency-sensitive services reported timeouts. The hypervisors showed iowait pegged. A few guests were marked “paused” by their watchdogs because the virtual disks stopped responding in time.
The “optimization” was effectively a distributed denial-of-service against their own storage. The parallel streams saturated the underlying device queue depth and pushed the array/controller into a worst-case behavior for mixed workloads. The backups finished fast because the cluster sacrificed everything else to make them fast.
The fix was to treat backups as a background workload, not a race. They implemented I/O throttling for the backup jobs and enforced a concurrency limit per host. Backups took longer again, but the business stopped losing weekends. They also separated “golden image backups” from “hot DB VMs” onto different storage tiers. The lesson was blunt: making one job faster by borrowing latency from everything else is not optimization, it’s redistribution of pain.
3) The boring but correct practice that saved the day: per-host I/O guardrails
A different company ran a Debian virtualization fleet with a hard rule: every VM scope had defined I/O guardrails. Nothing dramatic, just sane defaults in cgroup v2: a bandwidth cap and an IOPS cap, tuned per class of workload. The team got some eye-rolls early on. Developers wanted “no limits.” Management wanted “maximum utilization.” The SREs wanted “no surprises.” The SREs won, quietly.
One afternoon, a team launched a data migration inside a VM. It started doing a large write-heavy transform with frequent fsync calls. In most environments, this is where the host becomes a brick and everyone learns new vocabulary. Here, the migration VM slowed down, because it hit its cap. Other VMs remained normal. The host stayed responsive. Nobody paged.
The migration finished later than the developer expected, so they asked what happened. The SRE pulled up the cgroup stats and showed the I/O throttling. They explained that the alternative was the entire hypervisor going into iowait hell and taking unrelated services down with it.
Nothing heroic happened. That’s the point. The boring practice was guardrails, and it turned a potential cluster incident into a mildly annoyed engineer and a slower migration. This is the kind of win you only appreciate after you’ve been burned.
Common mistakes: symptoms → root cause → fix
This is where most teams waste hours. The patterns are repeatable, and so are the fixes.
1) “iowait is 100%, so the disk is maxed out”
Symptom: iowait high, throughput moderate, system “freezes” in bursts.
Root cause: latency spikes (thermal throttling, firmware hiccup, path resets) rather than sustained bandwidth saturation.
Fix: look at await, aqu-sz, PSI, kernel logs; check NVMe temp/errors; update firmware and cooling; replace failing media.
2) “iotop says nothing is writing, so it can’t be I/O”
Symptom: iowait high, iostat shows writes, iotop looks quiet.
Root cause: buffered writes being flushed by kernel threads (writeback), or I/O attributed to different threads than you expect (QEMU, jbd2, kswapd).
Fix: use pidstat -d, check /proc/meminfo Dirty/Writeback, and inspect kernel threads in D-state.
3) “Load average is high, so CPU is the issue”
Symptom: load climbs, CPU idle is high, iowait is high.
Root cause: load includes tasks blocked in uninterruptible I/O sleep; it’s not a CPU-only metric.
Fix: correlate load with blocked tasks (vmstat b column) and D-state (ps state). Treat it as I/O.
4) “It’s the database” (said without evidence)
Symptom: database queries time out during incident; people point at the DB.
Root cause: DB is a victim of underlying disk queue saturation caused by another workload (backup, migration, log rotation) or shared VM neighbor.
Fix: prove the source with pidstat and per-VM stats; isolate DB storage; implement I/O limits for non-DB batch jobs.
5) “We’ll fix it by switching the I/O scheduler”
Symptom: random latency spikes under mixed workload; team debates schedulers.
Root cause: scheduler choice can help tail latency, but won’t fix a failing drive, a resync storm, or a single VM saturating the device.
Fix: first identify device and culprit; then test scheduler changes with realistic workloads and rollback plan.
6) “We can just pause the VM and everything will recover”
Symptom: pausing/killing the top writer doesn’t immediately restore responsiveness.
Root cause: backlog still draining; journal commits; RAID resync; writeback flush; or filesystem error recovery.
Fix: wait for queues to drain while watching iostat; check for ongoing resync/scrub; confirm kernel logs for errors.
7) “Turning off sync is fine, it’s faster”
Symptom: performance improves when disabling sync in storage/app settings.
Root cause: you traded durability for speed; crash consistency may be gone.
Fix: use appropriate hardware (PLP NVMe, SLOG), tune workload, or accept slower I/O. Don’t sneak durability changes into production without an explicit decision.
Checklists / step-by-step plan
Checklist A: 10-minute “name the culprit” drill
- Run
vmstat 1 5: confirm highwaand high blocked tasksb. - Run
iostat -xz 1 3: identify the hot device byawait,aqu-sz, and%util. - Check
/proc/pressure/io: confirm system impact (high “full”). - Run
pidstat -d 1 5: identify top read/write PIDs. - If top PID is QEMU: map to VM name via
pscommand line and/or libvirt tools. - Validate storage mapping:
dfon image path, and/orlsblkfor stack (dm-crypt, LVM). - Mitigate:
- Stop or throttle the workload.
- Apply cgroup I/O limits for the scope.
- Pause the VM only if necessary, and be ready for backlog drain time.
- Confirm recovery:
iostatawait and util normalize; PSI full drops. - Capture evidence: iostat/pidstat sample, kernel logs, VM stats.
Checklist B: “Is it hardware or workload?” decision tree
- Kernel logs show timeouts/resets/I/O errors? Treat as hardware/path issue first.
- Device util ~100% with high queue? Likely saturation; find the top writer and throttle/isolate.
- Util low but await high? Look for stalls: thermal throttling, firmware issues, RAID events, network storage hiccups.
- Flushes are huge? Look for sync-heavy apps, barriers, dm-crypt overhead, ZFS sync behavior.
- Only one VM impacted? Guest-specific issue; still check whether it is starving others via shared device queue.
- Everything impacted including unrelated services? Host-level shared bottleneck; implement guardrails and isolate critical workloads.
Checklist C: Permanent prevention (what to implement after the incident)
- Add monitoring for: per-device await, queue depth, PSI I/O “full,” and NVMe temperature.
- Set sane cgroup v2 I/O defaults for VMs and batch jobs.
- Separate storage for “latency-critical” vs “bulk write” workloads.
- Schedule RAID rebuild/scrub and ZFS scrub outside peak hours; throttle if necessary.
- Document which workloads are allowed to run heavy I/O and when.
FAQ
1) Is 100% iowait always a disk problem?
No. It’s an I/O completion problem. Disk can be local SSD, RAID, dm-crypt stack, network storage, or even a controller intermittently stalling. Always identify the device and check latency.
2) Why does the host “freeze” even though CPU is mostly idle?
Because the threads you need are blocked in uninterruptible sleep waiting for I/O. CPU being idle doesn’t help when the kernel is waiting on the block layer to complete requests.
3) What’s the fastest way to find the culprit PID?
pidstat -d 1 is the quickest reliable tool for per-PID I/O rates. iotop is useful, but can miss writeback behavior and sometimes lies by omission.
4) How do I find which VM is causing it on a KVM/libvirt host?
Use pidstat to identify the QEMU PID, then inspect the QEMU command line for -name guest=..., and confirm with virsh domblkstat to see which disk is writing.
5) iostat shows high await but low %util. How is that possible?
Because the device may be stalling or completing I/O in bursts, or the bottleneck is upstream (network storage, controller resets, error recovery). Low util with high latency is a red flag for intermittent stalls, not “too much load.”
6) Should I change the I/O scheduler to fix iowait freezes?
Not as a first move. Scheduler tweaks can help tail latency under mixed workloads, but they won’t fix failing hardware, RAID rebuild storms, or one VM saturating the device. Diagnose first, then test changes with rollback.
7) What’s the safest immediate mitigation when one VM is noisy?
Throttle it via cgroup I/O limits or libvirt I/O tuning so the host stays responsive. Killing the VM works too, but it’s a higher-risk move if you don’t know what it was doing (databases, in-flight writes).
8) Can RAM cache hide I/O problems?
Yes. Buffered writes can make things look fine until writeback kicks in and the disk can’t keep up. Then you get a flush storm and the whole host pays the bill.
9) Why do flushes/fsync cause such dramatic slowdowns?
They force ordering and durability semantics. If the stack below (RAID, dm-crypt, SSD firmware) makes flush expensive, the workload becomes latency-bound quickly, even at moderate throughput.
10) How do I know if I should replace the drive?
If kernel logs show timeouts/resets, or NVMe SMART logs show growing media errors and error log entries, treat it as a reliability issue. Performance incidents often become data-loss incidents when ignored.
Conclusion: practical next steps
When Debian 13 hits 100% iowait and the host “freezes,” don’t treat it like an unsolved mystery. Treat it like an identification problem with a stopwatch. Name the device. Name the source. Mitigate. Capture evidence. Then fix the structure that allowed one workload to monopolize shared I/O.
Do these next:
- Add alerting on per-device
awaitand PSI I/O “full,” not just throughput. - Implement VM and batch-job I/O guardrails (cgroup v2) so one guest can’t brick a host.
- Separate bulk-write workloads from latency-sensitive services at the storage layer.
- Audit kernel logs and NVMe health counters regularly; replace suspicious drives before they teach you humility.
The best iowait incident is the one where nothing exciting happens because you throttled the noisy neighbor months ago. Aim for boring.