Debian 13: 100% iowait host freezes — find the noisy process/VM in 10 minutes

Was this helpful?

Your Debian 13 host “freezes,” SSH sessions hang, Prometheus graphs look like a heart monitor flatlining, and top shows CPU at 2%… except iowait is pinned at 100%. It feels like the machine is powered on but emotionally unavailable.

This is rarely a mystery. It’s usually one noisy process or VM doing something “perfectly reasonable” at the worst possible time, plus storage that can’t keep up. The job is not to meditate on it. The job is to name the culprit fast, decide whether to kill, throttle, or fix, and get your host back.

The mental model: what 100% iowait really means

iowait is not “your disk is 100% busy.” It’s “your CPUs have runnable work, but that work is blocked waiting for I/O completion.” If you see high iowait with low CPU usage, the kernel is doing you a favor by not burning cycles. Your users interpret it as “the server is dead,” because from their perspective, it is.

On Debian 13, the usual pattern is:

  • Some workload starts issuing lots of I/O (writes are the classic trigger).
  • Queue depth spikes on a block device.
  • Latency explodes (milliseconds become seconds).
  • Threads block in D-state (uninterruptible sleep), and the system feels frozen.

What you want to answer in under 10 minutes:

  1. Which device? (nvme0n1? md0? dm-crypt? NFS mount?)
  2. Which kind of pressure? (latency? saturation? retries? flush/sync?)
  3. Which source? (PID? container? VM/guest?)
  4. What action? (kill, throttle, move, fix hardware, change config)

Dry truth: “iowait 100%” is a symptom, not a diagnosis. The diagnosis is a queue and a source.

One quote worth keeping on your wall, because it’s the whole job: “Hope is not a strategy.” — Gene Kranz.

Joke #1: iowait is the Linux way of saying, “I’d love to help, but I’m waiting on storage to finish its long novel.”

Fast diagnosis playbook (10 minutes)

Minute 0–1: confirm it’s I/O and not a different kind of misery

  • Check load, iowait, and blocked tasks.
  • If you see many tasks in D-state and load rising with low CPU usage, you’re in I/O land.

Minute 1–3: identify the device that’s choking

  • Use iostat or sar -d to find high await and high utilization.
  • Validate with cat /proc/diskstats if tools aren’t installed or are hanging.

Minute 3–6: identify the source (PID, cgroup, or VM)

  • Use pidstat -d to find which PIDs are doing I/O.
  • Use iotop for live visibility if it works (it can miss buffered writeback).
  • On virtualization hosts, map QEMU threads to specific guests via libvirt/QEMU process args and per-disk stats.
  • Use PSI (/proc/pressure/io) to confirm system-wide I/O stall.

Minute 6–8: decide: kill, throttle, or isolate

  • If it’s a backup job, an index rebuild, or a migration: throttle it.
  • If it’s a runaway process: stop it now, root-cause later.
  • If it’s a single VM: apply I/O limits at the hypervisor or storage layer.

Minute 8–10: confirm recovery and capture evidence

  • Watch await and utilization drop.
  • Grab journalctl, kernel logs, and a short iostat/pidstat sample so you can fix it permanently.

If you do nothing else, do this: find the device, then find the source. Everything else is decoration.

Hands-on tasks: commands, output, and decisions (12+)

These are the moves I actually use when a host is “frozen” but not dead. Each task includes what the output means and what decision you make from it. Run them in order until you have a name (PID or VM) and a device.

Task 1 — See iowait, load, and blocked tasks in one glance

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1 12      0 312844  98124 4021180   0    0     0  8240  220  410  2  1  3 94  0
 0 18      0 311200  98124 4019900   0    0     0 12032  210  390  1  1  0 98  0
 0 16      0 310980  98124 4019500   0    0     0 11024  205  372  1  1  1 97  0
 0 20      0 310500  98124 4018400   0    0     0 14080  198  360  1  1  0 98  0
 0 17      0 310300  98124 4018100   0    0     0 13056  200  365  1  1  0 98  0

Meaning: the b column (blocked processes) is high, and wa (iowait) is ~98%. That’s a classic “storage is the bottleneck” signature.

Decision: stop debating. Move immediately to per-device latency (iostat).

Task 2 — Find the hot device with latency and utilization

cr0x@server:~$ iostat -xz 1 3
Linux 6.12.0-1-amd64 (server) 	12/29/2025 	_x86_64_	(32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          1.20    0.00    1.10   92.40    0.00    5.30

Device            r/s     rkB/s   rrqm/s  %rrqm  r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm  w_await wareq-sz  aqu-sz  %util
nvme0n1          0.20      8.00     0.00   0.00    2.10    40.0  420.00  67136.00    12.00   2.78  280.30   159.8  118.2  99.70
nvme1n1         18.00   2048.00     0.00   0.00    1.80   113.8   15.00   1024.00     0.00   0.00    2.40    68.3    0.1   4.10

Meaning: nvme0n1 is at ~100% util with a huge w_await and a massive queue (aqu-sz). That’s your bottleneck device.

Decision: identify who is writing to nvme0n1. Also consider whether this is a latency spike (firmware, thermal throttling) or saturation (legitimate load).

Task 3 — Confirm system-wide I/O stall using PSI

cr0x@server:~$ cat /proc/pressure/io
some avg10=78.43 avg60=65.12 avg300=40.02 total=184563218
full avg10=52.10 avg60=44.90 avg300=22.31 total=105331982

Meaning: “full” I/O pressure means tasks are stalled because I/O can’t complete. This is not “a few slow queries,” it’s systemic.

Decision: you’re allowed to take aggressive mitigation (throttle/kill), because the whole host is paying the price.

Task 4 — Find the top I/O PIDs fast (and don’t trust your gut)

cr0x@server:~$ pidstat -d 1 5
Linux 6.12.0-1-amd64 (server) 	12/29/2025 	_x86_64_	(32 CPU)

03:14:01      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
03:14:02        0     21492      0.00  65432.00      0.00      48  qemu-system-x86
03:14:02        0     31811      0.00  12160.00      0.00      20  rsync
03:14:02        0      1092      0.00   4096.00      0.00      10  jbd2/nvme0n1p2-8
03:14:02        0     25170      0.00   2048.00      0.00       8  postgres

03:14:02      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
03:14:03        0     21492      0.00  70000.00      0.00      52  qemu-system-x86
03:14:03        0     31811      0.00  11008.00      0.00      18  rsync

Meaning: the top writer is qemu-system-x86 PID 21492. That usually means “a VM is the noisy neighbor,” not that QEMU woke up and chose violence.

Decision: map this QEMU process to a guest name and disk. Then either throttle the guest I/O or fix what the guest is doing.

Task 5 — If it’s not obvious, look for D-state pileups

cr0x@server:~$ ps -eo pid,state,comm,wchan:32 --sort=state | head -n 20
  PID S COMMAND         WCHAN
 8921 D qemu-system-x86 blk_mq_get_tag
 8922 D qemu-system-x86 io_schedule
 8923 D qemu-system-x86 bit_wait_io
31811 D rsync           io_schedule
25170 D postgres        io_schedule
 1092 D jbd2/nvme0n1p2-8 bit_wait_io

Meaning: D state threads are blocked in the block layer (blk_mq_get_tag) or waiting on I/O scheduling. That’s consistent with queue saturation on the device.

Decision: you are not dealing with a CPU runaway. You need to reduce I/O load or increase I/O capacity, and quickly identify the source.

Task 6 — Map the QEMU PID to a VM name (libvirt)

cr0x@server:~$ ps -p 21492 -o pid,cmd --no-headers | sed 's/ -/\n-/g' | head -n 20
21492 /usr/bin/qemu-system-x86_64
-name guest=acct-db-03,debug-threads=on
-uuid 3bce8b54-8d13-4c4a-bd19-5e3dfe7e8a0e
-drive file=/var/lib/libvirt/images/acct-db-03.qcow2,format=qcow2,if=virtio,cache=none,aio=native
-drive file=/var/lib/libvirt/images/acct-db-03-data.raw,format=raw,if=virtio,cache=none,aio=native

Meaning: now you have a VM identity: acct-db-03, and at least one disk path. This is gold when you’re trying to explain impact to humans.

Decision: confirm which disk is hottest and whether the backing storage is on nvme0n1. If yes, throttle that guest or pause the workload inside it.

Task 7 — Find which block device backs the VM disk file

cr0x@server:~$ df -h /var/lib/libvirt/images
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  1.8T  1.2T  526G  70% /var

Meaning: the VM images are on /dev/nvme0n1p2, which is on the hot NVMe. The mapping is consistent.

Decision: mitigation can happen at the host: throttle I/O to that VM (cgroup or libvirt tuning), or at the guest: stop/reduce the job.

Task 8 — Check per-domain block stats (libvirt) to find the loudest disk

cr0x@server:~$ virsh domblklist acct-db-03
Target     Source
------------------------------------------------
vda        /var/lib/libvirt/images/acct-db-03.qcow2
vdb        /var/lib/libvirt/images/acct-db-03-data.raw

cr0x@server:~$ virsh domblkstat acct-db-03 vdb --human
rd_bytes              1.2 GiB
wr_bytes              842.6 GiB
rd_operations         12034
wr_operations         55210321
flush_operations      431920
wr_total_times        10421863423

Meaning: vdb is the heavy writer with many flushes. Flushes are a common “latency amplifier” because they force ordering and completion semantics.

Decision: if the guest is doing sync-heavy writes (databases, fsync storms, journaling), consider throttling or moving it to faster/less contended storage. Also check whether host settings (write cache, barriers, dm-crypt) are making flushes expensive.

Task 9 — Check whether the block device is rejecting you with errors or resets

cr0x@server:~$ journalctl -k -n 80 --no-pager
Dec 29 03:12:48 server kernel: nvme nvme0: I/O 1234 QID 6 timeout, aborting
Dec 29 03:12:48 server kernel: nvme nvme0: Abort status: 0x371
Dec 29 03:12:49 server kernel: nvme0n1: I/O error, dev nvme0n1, sector 1937422336 op 0x1:(WRITE) flags 0x800 phys_seg 32 prio class 0
Dec 29 03:12:50 server kernel: EXT4-fs warning (device nvme0n1p2): ext4_end_bio:345: I/O error 10 writing to inode 5513245 starting block 242177834)

Meaning: timeouts and I/O errors are not “load.” They’re a device or transport problem. Under error conditions, latency can go vertical while throughput drops. Your iowait will still look the same.

Decision: stop blaming workloads. Switch to hardware/firmware triage: check SMART/NVMe log, cabling/backplane, controller logs. Consider failing the device out (RAID/ZFS) if possible.

Task 10 — Quick NVMe health and error counters

cr0x@server:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0x00
temperature                         : 78 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 3%
media_errors                        : 12
num_err_log_entries                 : 45

Meaning: temperature is high, and there are media errors. Thermal throttling and internal error recovery can cause huge latency spikes.

Decision: treat this as both a performance and reliability incident. Plan device replacement if errors continue. Improve cooling if temp is consistently high.

Task 11 — Distinguish “device saturated” vs “device idle but slow”

cr0x@server:~$ iostat -xz /dev/nvme0n1 1 2
Linux 6.12.0-1-amd64 (server) 	12/29/2025 	_x86_64_	(32 CPU)

Device            r/s     rkB/s   rrqm/s  %rrqm  r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm  w_await wareq-sz  aqu-sz  %util
nvme0n1           0.00      0.00     0.00   0.00    0.00     0.0  380.00  62000.00     0.00   0.00  310.00   163.2  120.0  99.90

Device            r/s     rkB/s   rrqm/s  %rrqm  r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm  w_await wareq-sz  aqu-sz  %util
nvme0n1           0.00      0.00     0.00   0.00    0.00     0.0  390.00  64000.00     0.00   0.00  295.00   164.1  118.5  99.80

Meaning: this is saturation (util ~100%, big queue). If util were low but await high, you’d suspect intermittent stalls, firmware hiccups, or upstream layers (dm-crypt, RAID resync, NFS server) rather than pure load.

Decision: if it’s saturation, reduce load or add capacity (different disk, more spindles, better NVMe, separate noisy workloads). If it’s stalls, find the stalling layer.

Task 12 — Find which files are being hammered (when you must)

cr0x@server:~$ sudo lsof -p 21492 | grep -E '/var/lib/libvirt/images/|deleted' | head
qemu-system 21492 root  103u   REG  259,2 34359738368  1310720 /var/lib/libvirt/images/acct-db-03-data.raw
qemu-system 21492 root  104u   REG  259,2 21474836480   983040 /var/lib/libvirt/images/acct-db-03.qcow2

Meaning: confirms which image files are in play. It won’t show you which blocks are hot, but it anchors the blame to a concrete path.

Decision: if you need to relocate: you now know what to move. If you need to snapshot: you know what to snapshot. If you need to explain impact: you have a file and a VM name.

Task 13 — Identify major writeback and dirty page pressure

cr0x@server:~$ grep -E 'Dirty|Writeback|MemAvailable' /proc/meminfo
MemAvailable:   10432124 kB
Dirty:           982432 kB
Writeback:       210944 kB
WritebackTmp:         0 kB

Meaning: elevated Dirty/Writeback suggests the kernel is pushing buffered writes to disk and may be stuck behind slow I/O. If dirty memory skyrockets and the disk can’t keep up, everything becomes “slow in weird ways.”

Decision: consider whether a workload is generating sustained writes that exceed device throughput, and whether throttling (ionice, cgroup io.max) is the right mitigation.

Task 14 — Check filesystem-level pain (EXT4/XFS) and journal behavior

cr0x@server:~$ mount | grep ' on /var '
/dev/nvme0n1p2 on /var type ext4 (rw,relatime,errors=remount-ro)

cr0x@server:~$ dmesg | tail -n 20
[12345.678901] EXT4-fs (nvme0n1p2): Delayed block allocation failed for inode 5513245 at logical offset 123456 with max blocks 2 with error 5
[12345.678910] EXT4-fs (nvme0n1p2): This should not happen!! Data will be lost

Meaning: filesystem warnings plus I/O errors are “stop the world” events. Even without errors, journal threads (like jbd2) showing up in pidstat is a clue that writes are forcing metadata updates and commits.

Decision: if errors exist, treat as storage failure. If not, and you have many flushes/commits, focus on the workload pattern (fsync storms, small random writes) and storage configuration (write cache, barriers, virtualization caching mode).

Task 15 — When you need hard evidence: sample block I/O with blktrace

cr0x@server:~$ sudo blktrace -d /dev/nvme0n1 -w 10 -o - | sudo blkparse -i - | head -n 20
  8,0    0        1     0.000000000  21492  Q   WS 1937422336 + 128 [qemu-system-x86]
  8,0    0        2     0.000012345  21492  G   WS 1937422336 + 128 [qemu-system-x86]
  8,0    0        3     0.000056789  21492  I   WS 1937422336 + 128 [qemu-system-x86]
  8,0    0        4     0.310123456  21492  D   WS 1937422336 + 128 [qemu-system-x86]
  8,0    0        5     0.650987654  21492  C   WS 1937422336 + 128 [qemu-system-x86]

Meaning: you can attribute I/O at the block layer to a PID (here, QEMU). Notice the time gap between issue and completion; that’s your latency in the raw.

Decision: if you need to convince someone that “the VM really did it,” this is courtroom-grade. Also helpful when multiple PIDs are involved and userland tools disagree.

Task 16 — Quick mitigation: throttle a noisy cgroup (systemd) with io.max

cr0x@server:~$ systemctl status libvirtd | head -n 5
● libvirtd.service - Virtualization daemon
     Loaded: loaded (/lib/systemd/system/libvirtd.service; enabled)
     Active: active (running) since Mon 2025-12-29 02:11:03 UTC; 1h 03min ago

cr0x@server:~$ sudo systemctl set-property --runtime machine-qemu\\x2dacct\\x2ddb\\x2d03.scope IOReadBandwidthMax=/dev/nvme0n1 10M IOWriteBandwidthMax=/dev/nvme0n1 20M

Meaning: you just applied runtime I/O bandwidth limits to that VM’s systemd scope (naming varies; confirm the exact scope name on your host). This reduces the VM’s ability to starve the host.

Decision: use this when you need the host stable more than you need one guest fast. Then schedule a permanent fix: isolate workloads, tune DB settings, or move the guest to dedicated storage.

Joke #2: Throttling a noisy VM is like putting a shopping cart wheel back on. It won’t make it elegant, but at least it stops screaming.

Facts and history that help you debug faster

Knowing a few “why it’s like this” facts keeps you from chasing ghosts. Here are some that actually matter when you’re staring at iowait graphs at 03:00.

  1. Linux iowait is a CPU accounting state, not a disk metric. It can be high even when the disk isn’t fully utilized, especially with intermittent stalls or deep kernel queues.
  2. The old svctm field from iostat became unreliable with modern devices and multi-queue. People still quote it like gospel. Don’t.
  3. blk-mq (multi-queue block layer) changed the “one queue per device” intuition. NVMe and modern storage can have many queues; saturation behavior looks different than old SATA.
  4. CFQ died, BFQ arrived, and mq-deadline became a default-ish choice for many SSDs. Scheduler choice can change tail latency under mixed loads.
  5. PSI (Pressure Stall Information) is relatively new compared to classic tools. It tells you “how stalled” the system is, not just “how busy” it is, which is closer to user pain.
  6. Writeback caching can hide the start of an incident. You can buffer writes in RAM until you can’t, then you pay the debt with interest: long flush storms.
  7. Virtualization adds an extra scheduler stack. Guest thinks it’s doing “reasonable” fsync; host turns it into real flushes; underlying storage decides it’s time for garbage collection.
  8. NVMe thermal throttling is a real production failure mode. High sequential writes can push drives into throttling, causing dramatic latency increases without obvious errors at first.
  9. RAID resync and scrubs can create perfectly legitimate I/O denial of service. No bug required. Just timing.

If it’s a VM: mapping host I/O to the noisy guest

On a virtualization host, the most common wrong move is to treat QEMU as the culprit. QEMU is the messenger. Your real question: which guest workload is saturating which host device, and is it an expected batch job or an accidental foot-gun?

What “noisy VM” looks like on the host

  • pidstat -d shows one or a few qemu-system-* PIDs dominating writes.
  • iostat shows one device with huge queue (aqu-sz) and high await.
  • Load average climbs, but CPU is mostly idle except iowait.
  • Other VMs complain: “disk is slow” or “database is stuck” even though they aren’t doing much.

How to connect QEMU threads to a domain cleanly

If you use libvirt, you can map by:

  • the QEMU process command line (-name guest=...)
  • virsh domblkstat per disk device
  • systemd scope units under machine.slice (names vary)

Once you have a guest name, you have options:

  • Guest-side mitigation: pause the backup, reduce concurrency, tune database checkpoints, stop an index rebuild, limit compaction.
  • Host-side mitigation: apply I/O limits using cgroup v2 (io.max / systemd properties), or adjust libvirt I/O tuning (if you already have it in place).
  • Structural fix: move that VM to dedicated storage, or split OS disk from data disk, or separate “batch jobs” from “latency-sensitive” VMs.

When the VM isn’t doing “that much,” but still kills the host

This is where flushes, sync writes, and metadata-heavy patterns matter. A VM doing a modest number of transactions with fsync can create enormous pressure if the underlying stack turns each fsync into expensive flushes across dm-crypt, RAID, and a drive doing internal garbage collection.

Watch for:

  • high flush_operations in virsh domblkstat
  • database config changes that increase durability guarantees
  • guest filesystem mount options changes
  • host caching mode changes (e.g., cache=none vs writeback)

Storage layer triage: NVMe, RAID, dm-crypt, ZFS, network

After you identify the noisy process/VM, you still need to decide whether the real fix is “stop doing that” or “storage stack is unhealthy.” The fastest way is to recognize common failure modes by their telemetry.

NVMe: fast until it isn’t

NVMe incidents often fall into two buckets: saturation (legit load) and latency spikes (hardware/firmware/thermal). Saturation is boring. Latency spikes are dramatic.

What to check:

  • Temperature and errors via nvme smart-log
  • Kernel logs for timeouts/resets
  • Util vs await: if util is low but await is high, suspect stalling rather than throughput limitation

mdadm RAID: resync is a scheduled incident if you let it be

Software RAID can behave well for months, then a rebuild or scrub starts and suddenly your “stable host” becomes a lesson in shared resources. Rebuild I/O competes with everything.

cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1] [raid10]
md0 : active raid1 sda1[0] sdb1[1]
      976630336 blocks super 1.2 [2/2] [UU]
      [===>.................]  resync = 18.3% (178944000/976630336) finish=123.4min speed=107000K/sec

unused devices: <none>

Meaning: resync is active. That can drive iowait even when your application workload hasn’t changed.

Decision: throttle rebuild speed if it’s hurting production, or schedule rebuild windows. Don’t “optimize” by letting rebuild run wild on shared disks.

dm-crypt: encryption is not free, and flush semantics matter

dm-crypt adds CPU cost and can affect latency under sync-heavy workloads. It’s usually fine on modern CPUs, but it can make tail latency uglier in certain patterns.

cr0x@server:~$ lsblk -o NAME,TYPE,FSTYPE,SIZE,MOUNTPOINTS
nvme0n1         disk         1.8T
├─nvme0n1p1     part vfat     512M /boot/efi
└─nvme0n1p2     part crypto_LUKS 1.8T
  └─cryptvar    crypt ext4    1.8T /var

Meaning: your hot filesystem is on top of dm-crypt. That’s not automatically bad. It is a hint that flush-heavy workloads may pay extra overhead.

Decision: if flush storms are killing you, test with realistic workloads and consider separating sync-heavy workloads onto dedicated devices or revisiting caching/IO tuning. Don’t disable encryption in a panic unless you enjoy compliance meetings.

ZFS: the “txg sync” shaped punch in the face

ZFS can deliver excellent performance and also teach you about write amplification. A common iowait pattern is periodic stalls during transaction group (txg) sync when the pool is overloaded or mis-tuned for the workload.

cr0x@server:~$ zpool iostat -v 1 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
rpool                       1.20T   600G      5    950   200K  120M
  mirror                    1.20T   600G      5    950   200K  120M
    nvme0n1p3                   -      -      2    480   100K   60M
    nvme1n1p3                   -      -      3    470   100K   60M

Meaning: heavy writes; if latency is bad, check for sync write patterns and SLOG use, recordsize, compression, and whether the pool is near-full.

Decision: if a VM database is doing sync writes, consider a separate SLOG device (with power-loss protection) or adjust the application behavior. Don’t “fix” this by turning off sync unless you accept data loss.

Network storage (NFS/iSCSI): when the “disk” is actually a network problem

NFS can produce iowait that looks like local disk saturation. The host is waiting on I/O completions, but those completions depend on network latency, server load, or retransmits.

cr0x@server:~$ mount | grep nfs
10.0.0.20:/export/vm-images on /var/lib/libvirt/images type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2)

cr0x@server:~$ nfsstat -c | head -n 20
Client rpc stats:
calls      retrans    authrefrsh
1234567    3456       123

Client nfs v4:
null         read         write        commit       open
0            234567       345678       4567         1234

Meaning: retransmits are non-zero and rising; commits exist; a “hard” mount means tasks can hang until the server responds.

Decision: check network errors, NFS server load, and consider isolating latency-sensitive VMs away from network-backed storage. If your host “freezes” due to NFS, treat it as a dependency failure, not a local disk issue.

Three corporate mini-stories from the trenches

1) Incident caused by a wrong assumption: “iowait means the disk is maxed out”

A platform team had a Debian-based virtualization cluster that occasionally went into “everyone is slow” mode. The on-call saw iowait at 90–100%, glanced at a dashboard showing moderate disk throughput, and assumed the storage array was fine because “it’s not even pushing bandwidth.” So they focused on CPU and memory. They tuned kernel swappiness. They restarted a few services. They even blamed a recent app deploy because that’s what humans do under stress.

Meanwhile the cluster kept freezing in bursts: SSH sessions stuck, VMs timing out, then magically recovering. It looked like a transient external issue. The team started a ticket with the storage vendor. The vendor asked for latency numbers. Nobody had them in the alert.

The real issue was tail latency from intermittent stalls on an NVMe device due to thermal throttling. Utilization and throughput were not the problem; completion time was. During stalls, queues built up, tasks blocked, iowait soared, and then the backlog drained. Bandwidth graphs stayed “fine” because average MB/s didn’t capture the seconds-long pauses.

Once they looked at iostat -xz and NVMe smart logs, it was obvious: temps high, error logs increasing, await in hundreds of ms. Fix was unglamorous: improve airflow in the chassis, update firmware, and replace the worst offender drive. They also added alerting on latency and PSI, not just throughput. The next time iowait spiked, the alert said “IO full pressure is high; nvme0 await 280ms.” The argument ended before it started.

2) Optimization that backfired: “Let’s make backups faster”

An operations group wanted shorter backup windows for VM images. They moved from a slower rsync-based method to a new approach that used multiple parallel streams and aggressive block sizes. On paper, it was excellent: faster backups, less wall time, more “modern.” They rolled it out to all hypervisors on a Friday, because of course.

The first weekend, the cluster entered a weird state: the backups finished quickly, but during the backup window, customer-facing VMs became unresponsive. Latency-sensitive services reported timeouts. The hypervisors showed iowait pegged. A few guests were marked “paused” by their watchdogs because the virtual disks stopped responding in time.

The “optimization” was effectively a distributed denial-of-service against their own storage. The parallel streams saturated the underlying device queue depth and pushed the array/controller into a worst-case behavior for mixed workloads. The backups finished fast because the cluster sacrificed everything else to make them fast.

The fix was to treat backups as a background workload, not a race. They implemented I/O throttling for the backup jobs and enforced a concurrency limit per host. Backups took longer again, but the business stopped losing weekends. They also separated “golden image backups” from “hot DB VMs” onto different storage tiers. The lesson was blunt: making one job faster by borrowing latency from everything else is not optimization, it’s redistribution of pain.

3) The boring but correct practice that saved the day: per-host I/O guardrails

A different company ran a Debian virtualization fleet with a hard rule: every VM scope had defined I/O guardrails. Nothing dramatic, just sane defaults in cgroup v2: a bandwidth cap and an IOPS cap, tuned per class of workload. The team got some eye-rolls early on. Developers wanted “no limits.” Management wanted “maximum utilization.” The SREs wanted “no surprises.” The SREs won, quietly.

One afternoon, a team launched a data migration inside a VM. It started doing a large write-heavy transform with frequent fsync calls. In most environments, this is where the host becomes a brick and everyone learns new vocabulary. Here, the migration VM slowed down, because it hit its cap. Other VMs remained normal. The host stayed responsive. Nobody paged.

The migration finished later than the developer expected, so they asked what happened. The SRE pulled up the cgroup stats and showed the I/O throttling. They explained that the alternative was the entire hypervisor going into iowait hell and taking unrelated services down with it.

Nothing heroic happened. That’s the point. The boring practice was guardrails, and it turned a potential cluster incident into a mildly annoyed engineer and a slower migration. This is the kind of win you only appreciate after you’ve been burned.

Common mistakes: symptoms → root cause → fix

This is where most teams waste hours. The patterns are repeatable, and so are the fixes.

1) “iowait is 100%, so the disk is maxed out”

Symptom: iowait high, throughput moderate, system “freezes” in bursts.
Root cause: latency spikes (thermal throttling, firmware hiccup, path resets) rather than sustained bandwidth saturation.
Fix: look at await, aqu-sz, PSI, kernel logs; check NVMe temp/errors; update firmware and cooling; replace failing media.

2) “iotop says nothing is writing, so it can’t be I/O”

Symptom: iowait high, iostat shows writes, iotop looks quiet.
Root cause: buffered writes being flushed by kernel threads (writeback), or I/O attributed to different threads than you expect (QEMU, jbd2, kswapd).
Fix: use pidstat -d, check /proc/meminfo Dirty/Writeback, and inspect kernel threads in D-state.

3) “Load average is high, so CPU is the issue”

Symptom: load climbs, CPU idle is high, iowait is high.
Root cause: load includes tasks blocked in uninterruptible I/O sleep; it’s not a CPU-only metric.
Fix: correlate load with blocked tasks (vmstat b column) and D-state (ps state). Treat it as I/O.

4) “It’s the database” (said without evidence)

Symptom: database queries time out during incident; people point at the DB.
Root cause: DB is a victim of underlying disk queue saturation caused by another workload (backup, migration, log rotation) or shared VM neighbor.
Fix: prove the source with pidstat and per-VM stats; isolate DB storage; implement I/O limits for non-DB batch jobs.

5) “We’ll fix it by switching the I/O scheduler”

Symptom: random latency spikes under mixed workload; team debates schedulers.
Root cause: scheduler choice can help tail latency, but won’t fix a failing drive, a resync storm, or a single VM saturating the device.
Fix: first identify device and culprit; then test scheduler changes with realistic workloads and rollback plan.

6) “We can just pause the VM and everything will recover”

Symptom: pausing/killing the top writer doesn’t immediately restore responsiveness.
Root cause: backlog still draining; journal commits; RAID resync; writeback flush; or filesystem error recovery.
Fix: wait for queues to drain while watching iostat; check for ongoing resync/scrub; confirm kernel logs for errors.

7) “Turning off sync is fine, it’s faster”

Symptom: performance improves when disabling sync in storage/app settings.
Root cause: you traded durability for speed; crash consistency may be gone.
Fix: use appropriate hardware (PLP NVMe, SLOG), tune workload, or accept slower I/O. Don’t sneak durability changes into production without an explicit decision.

Checklists / step-by-step plan

Checklist A: 10-minute “name the culprit” drill

  1. Run vmstat 1 5: confirm high wa and high blocked tasks b.
  2. Run iostat -xz 1 3: identify the hot device by await, aqu-sz, and %util.
  3. Check /proc/pressure/io: confirm system impact (high “full”).
  4. Run pidstat -d 1 5: identify top read/write PIDs.
  5. If top PID is QEMU: map to VM name via ps command line and/or libvirt tools.
  6. Validate storage mapping: df on image path, and/or lsblk for stack (dm-crypt, LVM).
  7. Mitigate:
    • Stop or throttle the workload.
    • Apply cgroup I/O limits for the scope.
    • Pause the VM only if necessary, and be ready for backlog drain time.
  8. Confirm recovery: iostat await and util normalize; PSI full drops.
  9. Capture evidence: iostat/pidstat sample, kernel logs, VM stats.

Checklist B: “Is it hardware or workload?” decision tree

  1. Kernel logs show timeouts/resets/I/O errors? Treat as hardware/path issue first.
  2. Device util ~100% with high queue? Likely saturation; find the top writer and throttle/isolate.
  3. Util low but await high? Look for stalls: thermal throttling, firmware issues, RAID events, network storage hiccups.
  4. Flushes are huge? Look for sync-heavy apps, barriers, dm-crypt overhead, ZFS sync behavior.
  5. Only one VM impacted? Guest-specific issue; still check whether it is starving others via shared device queue.
  6. Everything impacted including unrelated services? Host-level shared bottleneck; implement guardrails and isolate critical workloads.

Checklist C: Permanent prevention (what to implement after the incident)

  • Add monitoring for: per-device await, queue depth, PSI I/O “full,” and NVMe temperature.
  • Set sane cgroup v2 I/O defaults for VMs and batch jobs.
  • Separate storage for “latency-critical” vs “bulk write” workloads.
  • Schedule RAID rebuild/scrub and ZFS scrub outside peak hours; throttle if necessary.
  • Document which workloads are allowed to run heavy I/O and when.

FAQ

1) Is 100% iowait always a disk problem?

No. It’s an I/O completion problem. Disk can be local SSD, RAID, dm-crypt stack, network storage, or even a controller intermittently stalling. Always identify the device and check latency.

2) Why does the host “freeze” even though CPU is mostly idle?

Because the threads you need are blocked in uninterruptible sleep waiting for I/O. CPU being idle doesn’t help when the kernel is waiting on the block layer to complete requests.

3) What’s the fastest way to find the culprit PID?

pidstat -d 1 is the quickest reliable tool for per-PID I/O rates. iotop is useful, but can miss writeback behavior and sometimes lies by omission.

4) How do I find which VM is causing it on a KVM/libvirt host?

Use pidstat to identify the QEMU PID, then inspect the QEMU command line for -name guest=..., and confirm with virsh domblkstat to see which disk is writing.

5) iostat shows high await but low %util. How is that possible?

Because the device may be stalling or completing I/O in bursts, or the bottleneck is upstream (network storage, controller resets, error recovery). Low util with high latency is a red flag for intermittent stalls, not “too much load.”

6) Should I change the I/O scheduler to fix iowait freezes?

Not as a first move. Scheduler tweaks can help tail latency under mixed workloads, but they won’t fix failing hardware, RAID rebuild storms, or one VM saturating the device. Diagnose first, then test changes with rollback.

7) What’s the safest immediate mitigation when one VM is noisy?

Throttle it via cgroup I/O limits or libvirt I/O tuning so the host stays responsive. Killing the VM works too, but it’s a higher-risk move if you don’t know what it was doing (databases, in-flight writes).

8) Can RAM cache hide I/O problems?

Yes. Buffered writes can make things look fine until writeback kicks in and the disk can’t keep up. Then you get a flush storm and the whole host pays the bill.

9) Why do flushes/fsync cause such dramatic slowdowns?

They force ordering and durability semantics. If the stack below (RAID, dm-crypt, SSD firmware) makes flush expensive, the workload becomes latency-bound quickly, even at moderate throughput.

10) How do I know if I should replace the drive?

If kernel logs show timeouts/resets, or NVMe SMART logs show growing media errors and error log entries, treat it as a reliability issue. Performance incidents often become data-loss incidents when ignored.

Conclusion: practical next steps

When Debian 13 hits 100% iowait and the host “freezes,” don’t treat it like an unsolved mystery. Treat it like an identification problem with a stopwatch. Name the device. Name the source. Mitigate. Capture evidence. Then fix the structure that allowed one workload to monopolize shared I/O.

Do these next:

  1. Add alerting on per-device await and PSI I/O “full,” not just throughput.
  2. Implement VM and batch-job I/O guardrails (cgroup v2) so one guest can’t brick a host.
  3. Separate bulk-write workloads from latency-sensitive services at the storage layer.
  4. Audit kernel logs and NVMe health counters regularly; replace suspicious drives before they teach you humility.

The best iowait incident is the one where nothing exciting happens because you throttled the noisy neighbor months ago. Aim for boring.

← Previous
VPN Slower Than Expected: Diagnose Router CPU, Crypto, and MSS Clamping Like You Mean It
Next →
WireGuard AllowedIPs Confusion: Why Traffic Doesn’t Go Where You Expect (and How to Fix It)

Leave a comment