Your Proxmox node is “up,” but it isn’t alive. The web UI crawls. SSH takes ages to echo a character. VMs pause like they’re thinking about their life choices. Top says CPU is “idle,” yet the box is unusable because wa is pinned at 100%.
This is the classic storage hostage situation. The job is not to stare at load average until it confesses. The job is to identify which VM/container is hammering storage (or which disk is dying), prove it with evidence, and apply a fix that stops host-level freezes—without guessing and without “reboot as a monitoring strategy.”
What 100% I/O wait actually means (and what it doesn’t)
I/O wait is the CPU telling you: “I’m runnable, but I’m blocked waiting for an I/O completion.” On a Proxmox host that usually means the kernel is waiting on block I/O (disks, SSDs, NVMe, SAN, Ceph, ZFS vdevs), and all the higher-level stuff (QEMU processes, LXC workloads, journald, pvestatd, even the web UI) gets stuck behind the same bottleneck.
High iowait is not inherently “bad” during a big sequential write to a healthy array. What’s bad is the combination of high iowait and user-visible stalls. That typically comes from one of these:
- Latency explosion: IOPS demand exceeds what the storage can serve, queue depth rises, and everything starts waiting.
- Device pathology: a disk starts timing out, resetting, or remapping sectors; the block layer waits, and so does your hypervisor.
- Write amplification: small random writes plus copy-on-write layers (ZFS, QCOW2, thin provisioning) plus sync semantics equals pain.
- Competing workloads: backup jobs, scrubs, resilvers, and “someone running fio on production” (yes, it happens).
Two important clarifications:
- 100% iowait does not mean the CPU is “busy.” It means your runnable threads are blocked on I/O. You can have a quiet CPU and a completely frozen node.
- Load average can skyrocket while CPU usage stays low. Tasks stuck in uninterruptible sleep (often shown as
Dstate) still count toward load.
Dry reality: if storage latency goes from 1–5 ms to 500–5000 ms, your host becomes a single-threaded drama regardless of how many cores you paid for.
Fast diagnosis playbook (first/second/third)
This is the “stop the bleeding” flow. You can do it in 5–15 minutes, even while the host is sluggish.
First: confirm it’s storage latency, not CPU or RAM
- Check iowait and run queue (
toporvmstat). - Check per-disk latency and util (
iostat -x). If one device is at ~100% util with big await, you found the choke point. - Look for blocked tasks (
dmesg“blocked for more than” orpsforDstate). That’s the signature of I/O waits that freeze the system.
Second: identify the noisy guest (VM/CT) causing the load
- Find which QEMU PID is doing I/O (
pidstat -d,iotop). - Map that PID to VMID (QEMU command line includes
-id, orqm list+ process args). - Check Proxmox per-guest IO stats (
pvesh get /nodes/...,qm monitorstats, and storage metrics).
Third: choose the least-risky containment action
- Throttle I/O (preferred): set VM disk limits (
qm set) or cgroup IO limits for LXC. - Pause/suspend the guest (fast, reversible):
qm suspendorqm stop --skiplockif needed. - Kill the specific backup/scrub/resilver job: stop the job that’s burning the array, not the whole node.
Rule of thumb: if the node is freezing, your first goal is to regain control, not to be correct in a philosophical sense.
Interesting facts and context (the stuff that explains the weirdness)
- Linux iowait is a CPU accounting bucket, not a storage metric. It depends on what the scheduler thinks the CPU could have been doing while blocked, so it’s not a direct “disk utilization” number.
- The “D” state (uninterruptible sleep) has been the horror movie villain since early Unix. Processes waiting on I/O can become effectively unkillable until the I/O completes or the device resets.
- ZFS was designed for data integrity first. Its copy-on-write model makes “small sync writes on fragmented pools” a recurring theme in performance postmortems.
- QCOW2 snapshots are convenient but can become latency multipliers. Every write can traverse metadata trees, and long snapshot chains age like milk.
- Ceph’s “slow ops” are often an early warning system. When OSDs or networks are unhappy, latency rises well before complete failure; ignoring slow ops is basically opting into downtime.
- Write barriers and flushes exist because data loss is worse than slowness. If your storage lies about durability (misconfigured cache, missing power-loss protection), you’ll have a fast system right up until the day you don’t.
- “Rebuild” events are performance events. RAID resilvers, ZFS scrubs, and Ceph backfill change IO patterns and can starve foreground workloads if not tuned.
- Hypervisors amplify storage issues. One dying SSD can stall dozens of guests because their IO shares the same kernel and block queue.
- The block layer has evolved a lot. Multi-queue (blk-mq) and modern schedulers improve throughput, but they can’t fix a device returning 2-second latency.
Practical tasks: commands, outputs, decisions (12+)
These are real commands you can run on a Proxmox host. Each task includes what to look for and what decision you make next.
Task 1: Confirm iowait and blocked tasks quickly
cr0x@server:~$ top -b -n1 | head -n 15
top - 11:24:31 up 47 days, 3:12, 2 users, load average: 54.12, 49.88, 41.03
Tasks: 412 total, 3 running, 409 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.2 us, 0.6 sy, 0.0 ni, 0.1 id, 98.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 128642.5 total, 12211.7 free, 44210.2 used, 72220.6 buff/cache
MiB Swap: 8192.0 total, 8170.5 free, 21.5 used. 84044.3 avail Mem
What it means: CPU is mostly waiting. Load is massive. That’s consistent with storage latency or stuck I/O.
Decision: Move to per-device latency (iostat -x) and per-process I/O (pidstat/iotop). Don’t chase CPU tuning.
Task 2: See run queue and iowait trend
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 38 0 1249876 89244 72133840 0 0 118 9432 712 1843 1 1 0 98 0
1 41 0 1249804 89244 72133912 0 0 104 10288 680 1721 1 1 0 98 0
2 39 0 1249812 89244 72133980 0 0 112 9816 705 1802 1 1 0 98 0
1 40 0 1249820 89244 72134012 0 0 96 11024 690 1760 1 1 0 98 0
What it means: The b column (blocked processes) is huge. That’s your host freeze in numeric form.
Decision: Check disk utilization and await next. If a single device is the bottleneck, isolate it. If it’s distributed, suspect backend saturation (Ceph/network) or a workload pattern (sync writes, backup).
Task 3: Find the hot device with extended iostat
cr0x@server:~$ iostat -x 1 3
Linux 6.8.12-4-pve (server) 12/26/2025 _x86_64_ (32 CPU)
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme0n1 12.0 2100.0 512.0 89200.0 0.0 220.0 0.0 9.5 3.2 740.5 1555.2 42.7 42.5 0.45 99.9
nvme1n1 10.0 14.0 480.0 620.0 0.0 1.0 0.0 6.7 2.8 4.1 0.1 48.0 44.3 0.18 0.4
What it means: nvme0n1 is pinned at 99.9% util with ~740 ms write await and a massive queue. That’s not “busy”; that’s “suffering.”
Decision: Identify which workload is generating those writes. Also check kernel logs and SMART for device issues. Latency this high can be overload, but it can also be a dying drive.
Task 4: Confirm if the kernel is reporting I/O errors or resets
cr0x@server:~$ dmesg -T | tail -n 30
[Thu Dec 26 11:23:12 2025] nvme nvme0: I/O 332 QID 4 timeout, aborting
[Thu Dec 26 11:23:12 2025] nvme nvme0: Abort status: 0x371
[Thu Dec 26 11:23:13 2025] nvme nvme0: resetting controller
[Thu Dec 26 11:23:16 2025] EXT4-fs warning (device dm-3): ext4_end_bio:351: I/O error 10 writing to inode 262148 starting block 11534336)
[Thu Dec 26 11:23:18 2025] task jbd2/dm-3-8:219 blocked for more than 120 seconds.
[Thu Dec 26 11:23:18 2025] Tainted: P O 6.8.12-4-pve #1
What it means: Timeouts and controller resets are not workload tuning problems. They are “hardware or firmware is melting” problems. The blocked task warning explains the freeze.
Decision: If you see timeouts/resets, treat the device as suspect immediately: reduce IO, migrate guests off, and plan replacement. Don’t spend an hour tuning elevator schedulers on a disk that’s actively timing out.
Task 5: Check SMART/NVMe health (quick triage)
cr0x@server:~$ smartctl -a /dev/nvme0n1 | egrep -i 'critical_warning|media_errors|num_err_log_entries|percentage_used|power_on|temperature'
Critical Warning: 0x00
Temperature: 73 Celsius
Percentage Used: 89%
Power On Hours: 31211
Media and Data Integrity Errors: 18
Error Information Log Entries: 294
What it means: High wear (Percentage Used), media errors, and a growing error log count point to a device nearing retirement.
Decision: Prioritize evacuating critical VMs. Plan a replacement window. If this is a ZFS mirror/RAID, start a controlled replacement rather than waiting for a dramatic failure during peak hours.
Task 6: Identify which processes are doing disk I/O right now
cr0x@server:~$ pidstat -d 1 5
Linux 6.8.12-4-pve (server) 12/26/2025 _x86_64_ (32 CPU)
11:25:44 AM UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
11:25:45 AM 0 2911 0.00 68420.00 11200.00 98765 qemu-system-x86
11:25:45 AM 0 4187 0.00 10240.00 1800.00 12001 vzdump
11:25:45 AM 0 1332 0.00 220.00 12.00 210 pvestatd
What it means: One QEMU process is the main writer; vzdump is also active (backup). iodelay is enormous, meaning the process is spending a lot of time waiting on IO completion.
Decision: Map PID 2911 to a VMID. If backups are running, decide whether to pause/stop them to restore host responsiveness.
Task 7: Map a QEMU PID to a VMID
cr0x@server:~$ ps -p 2911 -o pid,cmd --no-headers
2911 /usr/bin/kvm -id 113 -name vm113-db01,debug-threads=on -m 16384 -smp 8,sockets=1,cores=8 -drive file=/dev/zvol/rpool/data/vm-113-disk-0,if=none,id=drive-scsi0,cache=none,aio=io_uring,discard=on,format=raw
What it means: It’s VM 113. The disk is a ZVOL. Cache mode is none, async I/O is io_uring. None of that matters if the device is timing out, but it matters if you’re just saturating.
Decision: Pull VM 113 stats and decide: throttle, pause, or migrate. Meanwhile, confirm if VM 113 is also involved in the backup job.
Task 8: Check Proxmox-level VM disk throughput and latency indicators
cr0x@server:~$ qm status 113 --verbose
status: running
name: db01
cpus: 8
memory: 16384
uptime: 192773
net0: virtio=DE:AD:BE:EF:11:30,bridge=vmbr0
diskread: 0.00
diskwrite: 68.42
diskread_bytes: 0
diskwrite_bytes: 73408602112
What it means: Proxmox is reporting high disk write throughput. It doesn’t show latency, but it confirms the “who.”
Decision: Contain VM 113. If it’s a database or log pipeline, it might be doing fsync-heavy writes that punish a degraded pool.
Task 9: Check if a backup job is the secondary culprit
cr0x@server:~$ pgrep -af vzdump
4187 /usr/bin/perl /usr/bin/vzdump --all 1 --mode snapshot --compress zstd --storage backup-nfs --mailnotification failure
4231 /usr/bin/pigz -p 4
What it means: A node-wide vzdump is running. Snapshot mode can still generate heavy reads and metadata writes, especially with QCOW2 or busy ZFS datasets.
Decision: If the node is freezing, stop the backup first (it’s discretionary load) before killing production guests.
Task 10: Stop or pause the backup cleanly (containment)
cr0x@server:~$ kill -TERM 4187
cr0x@server:~$ tail -n 5 /var/log/vzdump/vzdump.log
INFO: vzdump job finished
ERROR: job aborted by signal
INFO: cleanup temporary directory '/var/tmp/vzdumptmp4187'
What it means: You aborted the backup. That should reduce IO pressure quickly, but if the array/device is unhealthy you may still be stuck.
Decision: Re-check iostat -x. If await drops and the host recovers, you’ve bought time to do root-cause analysis without users yelling in chat.
Task 11: Throttle the noisy VM (preferred containment)
cr0x@server:~$ qm set 113 --scsi0 rpool:vm-113-disk-0,iops=800,mbps=40
update VM 113: -scsi0 rpool:vm-113-disk-0,iops=800,mbps=40
What it means: You’re capping the VM’s disk rate, giving the host and other guests a chance to breathe. It’s not a performance “fix,” it’s blast-radius control.
Decision: If throttling stabilizes latency, keep it while you investigate the guest workload (e.g., runaway logging, DB checkpoint storm) and storage health.
Task 12: For LXC containers, find the top I/O writer and map to CTID
cr0x@server:~$ iotop -oPa -n 5
Total DISK READ: 0.00 B/s | Total DISK WRITE: 92.15 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
8432 be/4 root 0.00 B/s 68.42 M/s 0.00 % 99.99 % [kvm]
9211 be/4 root 0.00 B/s 18.11 M/s 0.00 % 95.00 % [kworker/u64:3]
10222 be/4 root 0.00 B/s 4.10 M/s 0.00 % 60.00 % [lxcfs]
What it means: QEMU is the monster. But if you saw a container process here, you’d next map it using cgroups or by correlating mounts and PIDs.
Decision: If it’s a container, use pct status and pct config plus cgroup I/O stats to locate and throttle.
Task 13: Check ZFS pool health and ongoing maintenance that can steal I/O
cr0x@server:~$ zpool status -v
pool: rpool
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error.
action: Replace the device using 'zpool replace'.
scan: scrub in progress since Thu Dec 26 09:12:02 2025
2.31T scanned at 652M/s, 1.98T issued at 559M/s, 6.12T total
0B repaired, 32.35% done, 2:11:44 to go
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme0n1 FAULTED 2 18 0 too many errors
nvme1n1 ONLINE 0 0 0
errors: No known data errors
What it means: The pool is degraded, a scrub is running, and one NVMe is faulted. That’s a perfect recipe for latency spikes and host stalls.
Decision: Stop or postpone the scrub if it’s harming availability, then replace the failed device. Availability first; integrity work resumes once the system can breathe.
Task 14: See which ZFS datasets/zvols are busiest
cr0x@server:~$ zfs iostat -v rpool 1 3
capacity operations bandwidth
pool alloc free read write read write
-------------- ----- ----- ----- ----- ----- -----
rpool 5.01T 1.11T 120 2200 4.8M 89.2M
rpool/ROOT 18.2G 97.8G 2 18 64K 2.1M
rpool/data 4.99T 1.01T 118 2182 4.7M 87.1M
vm-113-disk-0 420G 80.0G 12 1620 512K 68.4M
vm-105-disk-0 210G 30.0G 8 260 384K 11.2M
What it means: You can see the hottest zvol: vm-113-disk-0. That confirms the noisy guest with storage-level telemetry.
Decision: Focus on VM 113 workload and consider moving its disk to a less contended pool, or changing its write pattern (DB tuning), or implementing sustained throttles.
Task 15: If Ceph is involved, check health and slow ops
cr0x@server:~$ ceph -s
cluster:
id: 3f0c2c2d-7f8b-4d17-9e21-2a7d3c3b2a91
health: HEALTH_WARN
2 osds down
154 slow ops, oldest one blocked for 37 sec, osd.12 has slow ops
services:
mon: 3 daemons, quorum mon1,mon2,mon3
mgr: mgr1(active)
osd: 12 osds: 10 up, 10 in
data:
pools: 6 pools, 256 pgs
objects: 4.12M objects, 15 TiB
usage: 46 TiB used, 18 TiB / 64 TiB avail
pgs: 12 pgs degraded, 9 pgs undersized
What it means: Ceph is warning about slow ops and down OSDs. That will manifest as iowait on hypervisor nodes using RBD, because reads/writes block on the cluster.
Decision: Ceph problem beats VM problem. Stabilize the cluster (OSDs, network, backfill limits) before blaming a guest.
Finding the noisy VM or container (for real)
“Noisy neighbor” is a polite phrase for “one guest is eating the storage subsystem and everyone else is paying the bill.” On Proxmox, you have three practical angles to find it:
- Process angle: which PID is doing IO? (qemu-system-x86, backup tools, zfs, ceph-osd).
- Storage object angle: which volume is hottest? (ZVOL name, RBD image, LVM LV, qcow2 file).
- Proxmox guest angle: which VMID/CTID reports high disk throughput?
The process angle: QEMU is usually the smoking gun
On a typical node, each VM maps to one qemu-system-x86 process. If that process is writing 60–200 MB/s on a device that can’t sustain it, iowait climbs. If the device starts timing out, the host freezes. Your goal is to attach the IO to a VMID fast.
When the host is sluggish, avoid expensive “wide” commands. Prefer narrow queries:
pidstat -d 1 5is cheap and gives you a ranked list.ps -p PID -o cmdgives you VMID quickly.iostat -xtells you if this is saturation or pathology.
One joke, as a treat: If your host is at 100% iowait, your CPU isn’t “idle.” It’s practicing mindfulness while the disks panic.
The storage object angle: map hot volumes to guests
Depending on your storage backend, “the hot thing” looks different:
- ZFS zvol:
rpool/data/vm-113-disk-0(beautifully obvious). - QCOW2 on directory storage: a file like
/var/lib/vz/images/113/vm-113-disk-0.qcow2. - LVM-thin: an LV like
/dev/pve/vm-113-disk-0. - Ceph RBD: an image like
vm-113-disk-0in a pool.
The trick is to work backwards: identify the hot device, then the hot logical volume, then the VMID, then the workload inside the guest.
The Proxmox guest angle: quick per-guest stats
Proxmox can show disk throughput per guest, which is helpful but not sufficient. Throughput isn’t the villain; latency is. Still, per-guest throughput is a great “who should I interrogate” list.
cr0x@server:~$ qm list
VMID NAME STATUS MEM(MB) BOOTDISK(GB) PID
105 app01 running 8192 120 2144
113 db01 running 16384 500 2911
127 ci-runner01 running 4096 80 3382
What it means: You have the VMIDs and PIDs; you can now marry this with pidstat/iotop output.
Decision: Pick the guest with the hottest QEMU PID, then verify at the storage layer which volume it’s hitting. Don’t “randomly stop the biggest VM” unless you enjoy unnecessary outages.
Storage backend failure modes: ZFS, Ceph, LVM-thin, NFS/iSCSI
ZFS on Proxmox: the usual suspects
ZFS is excellent at telling the truth about data. It is less forgiving about workloads that are sync-heavy and random-write-heavy on consumer SSDs. On Proxmox, ZFS performance incidents tend to fall into these buckets:
- Degraded vdev or failing device: latency spikes and timeouts, especially under write load.
- Scrub/resilver contention: background work competes with guests for IO and can starve latency-sensitive workloads.
- Sync write pressure: databases and some filesystems issue frequent flushes; without a proper SLOG (and with correct expectations), writes become serialized.
- Fragmentation and small blocks: long-lived pools with lots of churn can degrade random IO, especially on HDDs.
Operationally, ZFS is friendly because it gives you zpool status and zfs iostat that are actually useful during an incident. Use them.
Ceph: when the network is part of your disk
With Ceph, high iowait on the hypervisor may have nothing to do with local disks. It can be:
- OSDs down or flapping causing degraded PGs and slow ops.
- Backfill/recovery storms saturating network or disks.
- Uneven OSD utilization (hotspotting) due to CRUSH issues or a small number of large images.
- Network problems (drops, MTU mismatch, buffer exhaustion) turning “storage” into retransmit city.
If Ceph says “slow ops,” listen. iowait is merely the messenger. Don’t shoot it; fix the cluster.
LVM-thin: metadata pain is real
LVM-thin can be fast and simple. It can also ruin your afternoon when thin-pool metadata gets tight or the pool fills up. When thin provisioning is near capacity, you can see sudden stalls, allocation pauses, and guest IO latency spikes.
Practical rule: keep thin pools comfortably below 80–85% unless you have strong monitoring and a tested extension procedure.
NFS/iSCSI: “the disk is fine” (on somebody else’s host)
Network storage failures are great because they let you argue with a different team while your node freezes.
Watch for:
- Single mount point saturation: one NFS server head becomes the bottleneck.
- Network congestion: microbursts can inflate latency dramatically.
- HBA/driver timeouts on iSCSI/Fibre Channel showing up as kernel logs and blocked tasks.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They had a modest Proxmox cluster running a mix of web apps and a single “important” database VM. The storage was local ZFS mirrors on NVMe. Everything had been stable for months, so the assumption solidified into gospel: “NVMe means storage can’t be the bottleneck.”
One Monday morning the node started freezing. The team chased CPU, then RAM, then network. They restarted a few services. They even blamed the kernel version. Meanwhile, iowait sat at 90–100% and the VMs stuttered. Nobody looked at dmesg early because “hardware is new.”
When someone finally ran dmesg -T, it was a wall of NVMe timeouts and controller resets. The drive wasn’t “slow.” It was effectively disappearing and reappearing under load, creating long I/O stalls that wedged tasks in D state. The database VM wasn’t the root cause; it was just the workload that triggered the failures.
The fix was boring: evacuate critical VMs, replace the NVMe, and update firmware on the remaining drives. The postmortem was more interesting: their monitoring only looked at throughput and space, not latency and error rates. They had mistaken “fast interface” for “reliable device.”
Mini-story 2: The optimization that backfired
Another org wanted faster backups. They changed their vzdump schedule to run more frequently and cranked compression to “maximum” because storage was “expensive.” They also enabled more snapshots on QCOW2 disks because it made rollback easy for developers.
At first, it looked like a win: backup sizes dropped, and restores were “fine.” Then, a few weeks later, performance complaints appeared during office hours. Not constant, but enough to be annoying. The graph showed recurring latency spikes, and every spike lined up with backups.
Here’s what happened: the combination of snapshot-heavy QCOW2 plus aggressive backup compression created a perfect storm of random reads and metadata churn. The backup job wasn’t just reading; it was forcing the storage to navigate complex snapshot metadata while other VMs did their normal write workloads. Under load, the host spent more time waiting on I/O than doing work.
The fix was to back off the “optimization”: reduce snapshot chains, switch key VMs to raw where appropriate, cap backup concurrency, and schedule heavy backups outside business hours. Compression stayed, but not at the cost of production latency. They learned the hard way that “optimizing for backup size” can be indistinguishable from “optimizing for outages.”
Mini-story 3: The boring but correct practice that saved the day
A team running a Proxmox cluster for internal services had a simple rule: every storage backend change required a latency baseline and a rollback plan. Not a long one. A one-pager with “before numbers,” “after numbers,” and “how to undo it.”
One afternoon, iowait climbed on a node. The on-call didn’t debate theory. They ran the same baseline commands they always ran: iostat -x, zpool status, zfs iostat, and a quick check for backups. In five minutes they saw: scrub running, one vdev showing elevated latency, and a single VM zvol dominating writes.
They throttled the VM, paused the scrub, and migrated two latency-sensitive guests off the node. The host recovered immediately. Then they replaced the suspect SSD in a planned window.
No heroics. No guesswork. The boring part—the habit of keeping a baseline and a short incident checklist—prevented a “we should reboot it” spiral and kept user impact contained.
Common mistakes: symptoms → root cause → fix
-
Symptom: iowait 100%, node “frozen,” load average huge.
Root cause: Blocked tasks due to disk timeouts/resets (hardware/firmware issue).
Fix: Checkdmesg, runsmartctl, evacuate guests, replace/firmware-update the device. Stop scrubs/resilvers during the incident. -
Symptom: Only during backups the node becomes unresponsive.
Root cause: vzdump concurrency too high, snapshot-heavy QCOW2, backup storage slow, or NFS bottleneck.
Fix: Reduce backup concurrency, stagger jobs, cap IO, consider raw disks for hot VMs, ensure backup target has enough IOPS and network bandwidth. -
Symptom: One VM is slow, others also degrade.
Root cause: Noisy neighbor saturating shared storage queue (single vdev, single SSD, or thin pool metadata).
Fix: Throttle the VM, move it to faster/isolated storage, fix the guest workload (log storms, DB checkpoints). -
Symptom: Ceph-backed VMs have random multi-second pauses.
Root cause: Ceph slow ops from OSD down/backfill/recovery or network issues.
Fix: Address Ceph health first: restore OSDs, tune recovery/backfill, verify network MTU and drops, reduce client IO temporarily. -
Symptom: Latency spikes during ZFS scrub/resilver.
Root cause: Background work competing with guest IO; degraded mirror/RAID amplifies it.
Fix: Schedule scrubs off-hours, consider tuning scrub priority, and don’t run scrubs while the pool is degraded unless you must. -
Symptom: Plenty of free space, but IO is awful; small writes are particularly bad.
Root cause: Sync write workload (fsync) without appropriate design; consumer SSDs can choke under sustained sync writes; ZFS transaction group pressure.
Fix: Verify guest app settings (journaling, DB durability settings), ensure proper hardware for sync workloads, and avoid “just set sync=disabled” unless you’re intentionally accepting data loss.
Checklists / step-by-step plan
Incident containment checklist (10–30 minutes)
- Run
topandvmstatto confirm high iowait and blocked tasks. - Run
iostat -x 1 3to identify the hottest device and observe await/util. - Check
dmesg -T | tailfor timeouts, resets, I/O errors. - Identify top IO processes with
pidstat -doriotop. - Map heavy QEMU PID to VMID via
psand/orqm list. - Stop discretionary load first: backups (
vzdump), scrubs/resilvers, batch jobs. - Throttle the noisy VM/CT (I/O limits) rather than stopping it, if possible.
- If hardware errors exist: migrate critical guests off the node, mark device for replacement.
- Re-check
iostat -x: await should drop quickly if you removed the pressure. - Record what you did and why. Future you is a stakeholder.
Root-cause checklist (same day)
- Was it device failure, backend saturation, or a workload spike?
- Which guest or host process was the biggest contributor?
- What changed recently (backups, snapshots, kernel/firmware, new workload)?
- Do you have latency metrics (not just throughput) per device and per backend?
- Is your storage layout appropriate (single vdev for many VMs, thin pool near full, Ceph degraded)?
- Do you need per-VM throttles by default on shared storage?
Prevention checklist (this week)
- Set sane I/O limits for “untrusted” workloads (CI runners, log shippers, anything developer-controlled).
- Schedule backups and scrubs away from peak hours; cap concurrency.
- Monitor latency and errors: per-disk await, SMART media errors, Ceph slow ops, ZFS degraded states.
- Keep snapshot sprawl under control; prune QCOW2 snapshots and avoid long chains.
- Plan storage headroom (IOPS and capacity). Running hot is not a cost optimization; it’s a reliability choice.
FAQ
- 1) Why does Proxmox “freeze” when storage is slow?
- Because the host and guests share the same kernel and I/O paths. When block I/O stalls, critical services (including QEMU and system daemons) block in D state, and the node becomes unresponsive.
- 2) Is 100% iowait always a disk problem?
- No. It’s an “I/O completion is late” problem. That can be local disk, RAID controller, Ceph cluster, NFS server, iSCSI path, or even severe memory pressure causing swap I/O. But it’s almost always “storage path latency.”
- 3) What’s the fastest way to find the noisy VM?
- Use
pidstat -doriotopto find the top I/O QEMU PID, then map it to VMID viapscommand line (-id) orqm list. - 4) Should I just reboot the Proxmox host?
- Only if you’ve accepted guest downtime and you can’t regain control. Reboots hide evidence and don’t fix failing devices or saturated backends. If you see timeouts in
dmesg, a reboot may buy minutes, not a solution. - 5) Can I fix this by changing the I/O scheduler?
- Sometimes it helps at the margins. It will not fix timeouts, failing hardware, or a backend that can’t meet required IOPS/latency. Diagnose first. Tune later.
- 6) Is it safe to set ZFS
sync=disabledto reduce iowait? - It’s “safe” only if you’re comfortable losing recent writes on power loss or crash. For many databases and filesystems, that’s not a theoretical risk. If you do it, document it as an explicit durability tradeoff.
- 7) Why do backups cause so much pain?
- Backups are deceptively heavy: they read a lot, they touch metadata, and they can interact badly with snapshots and copy-on-write layers. If the backup target is slow, the source side can stall too.
- 8) How do I prevent one VM from taking down the node?
- Use I/O throttling per VM disk, separate high-IO workloads onto dedicated storage, and enforce backup/scrub schedules. Also monitor latency and errors so you catch degradation before it becomes a freeze.
- 9) What if multiple devices show high await?
- Then the bottleneck may be above the device layer: a shared controller, a thin pool, Ceph/network, or system-wide writeback congestion. Confirm with backend-specific tools (ZFS/Ceph) and by checking for a single dominating process.
- 10) How do I know if it’s saturation vs. a dying disk?
- Saturation usually shows high util and high await but without timeouts/resets in
dmesg. A dying disk often shows timeouts, resets, I/O errors, and SMART media errors that trend upward.
Conclusion: next steps that actually reduce incidents
You don’t “fix iowait.” You fix the storage path and the workload behavior that created unacceptable latency. Start with the fast playbook: prove it’s I/O latency, identify the noisy guest, contain the blast radius, and then decide whether you’re dealing with saturation or failure.
Practical next steps:
- Add per-VM I/O limits for the usual suspects (CI, log ingestion, anything batchy).
- Make latency visible: iostat await/util, ZFS pool state, Ceph slow ops, SMART error counts.
- Stop treating backups like “free” work. Cap concurrency and schedule them like you schedule maintenance—because they are maintenance.
- Replace dying devices early. Waiting for total failure is not bravery; it’s paperwork.
One final, second joke: The only thing worse than a noisy neighbor VM is a quiet one—because it might already be stuck in D state and silently holding your host hostage.
Quote (paraphrased idea): Gene Kranz: “Be tough and competent.” In ops terms: measure first, then act decisively.