Nothing ruins your morning like a Linux VM that “works fine” until it doesn’t: shell prompts hang, MySQL commits pause, journald stalls, and your app’s latency chart grows teeth. You look at CPU and RAM—fine. Network—fine. Then you open iostat and see it: random write latency spiking into the seconds. The guest feels like it’s on a spinning disk powered by vibes.
This is usually fixable. Often it’s not “storage is slow,” it’s that you chose the wrong virtual disk controller, the wrong cache mode, or you accidentally put QEMU in a mode that turns a burst of small writes into a traffic jam. The good news: Proxmox gives you the levers. The bad news: the defaults don’t always match your workload.
Fast diagnosis playbook
If your VM “stutters,” you need to decide where latency is coming from: guest, QEMU, host kernel, storage backend, or the physical device/cluster. Don’t guess. Triage.
First: prove it’s storage latency (not CPU steal or memory pressure)
-
In the guest: check whether stalls correlate with disk wait.
cr0x@server:~$ vmstat 1 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 812340 22012 903112 0 0 120 980 310 510 4 2 92 2 0 0 2 0 810112 22020 904000 0 0 0 4096 280 420 2 1 58 39 0 0 1 0 809980 22020 904120 0 0 0 2048 270 410 2 1 70 27 0What it means: high
waindicates time spent waiting on IO. Ifst(steal) is high, you’re CPU oversubscribed; fix that before obsessing over disk settings.Decision: if
waspikes during stutter andstis low, keep digging into storage.
Second: find the slow layer (guest queue vs host device)
-
On the Proxmox host: watch per-device latency and saturation.
cr0x@server:~$ iostat -x 1 5 Linux 6.8.12-pve (pve01) 12/26/2025 _x86_64_ (32 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 10.12 0.00 3.01 8.33 0.00 78.54 Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme0n1 22.0 480.0 1.2 46.8 192.0 3.8 7.9 1.2 8.2 0.4 20.5 sda 0.0 95.0 0.0 6.0 128.0 18.2 191.3 0.0 191.3 1.1 99.8What it means:
awaitis request latency;%utilnear 100% with highawaitscreams “saturated device.” Heresdais the villain.Decision: if latency is bad at the host device, controller/cache tweaks in the VM won’t save you. Fix the backend (migration, faster disk, better pool layout) before tuning.
-
On the host: check whether QEMU threads are blocked in IO.
cr0x@server:~$ pgrep -a qemu-system-x86_64 | head -n 1 24891 /usr/bin/kvm -id 103 -name vm103 -m 8192 ... -drive file=/dev/zvol/rpool/vm-103-disk-0,if=none,id=drive-scsi0,format=raw,cache=none,aio=native ...What it means: the QEMU command line tells you cache mode, aio mode, and whether you’re using virtio-scsi/virtio-blk.
Decision: if you see
cache=writebackon a host without a UPS and without stable write cache, you’re flirting with data loss; if you seeaio=threadswith high CPU and latency, consideraio=nativeif supported.
Third: confirm the guest sees a sane virtual device and queueing model
-
In the guest: identify the disk type and whether it’s virtio.
cr0x@server:~$ lsblk -o NAME,MODEL,TYPE,SIZE,ROTA,DISC-MAX,DISC-GRAN,MOUNTPOINTS NAME MODEL TYPE SIZE ROTA DISC-MAX DISC-GRAN MOUNTPOINTS vda QEMU HARDDISK disk 200G 0 2G 4K ├─vda1 part 1G 0 └─vda2 part 199G 0 /What it means:
vdausually means virtio-blk.sdainside a VM often means emulated SATA/SCSI and is typically slower/higher overhead.Decision: if you’re on emulated controllers, plan a maintenance window to migrate to virtio.
This playbook is intentionally short. The rest of the article explains the “why” and gives you the knobs that reliably reduce latency spikes without creating new ones.
What “disk stutter” really means in a VM
Stutter isn’t about average throughput. Your monitoring will show 50 MB/s and everyone will congratulate the storage. Meanwhile, your VM freezes for 800 ms every few seconds because small synchronous writes are backing up behind something expensive.
In virtualized storage, there are several queues and flush points:
- Guest page cache: buffered writes accumulate until the kernel decides to flush.
- Guest block layer: merges and schedules IO (less dramatic on newer kernels, still real).
- Virtual controller queue: virtio-blk and virtio-scsi have different queue models and interrupt behavior.
- QEMU block layer: cache mode controls whether QEMU uses the host page cache, and where flushes land.
- Host filesystem / volume manager: ZFS, LVM-thin, ext4-on-SSD, Ceph RBD each has different latency characteristics.
- Device/cluster: the actual disk, RAID controller, NVMe, SAN, or Ceph OSDs.
Stutter is usually one of these patterns:
- Flush storms: a batch of writes hits a flush barrier (
fsync(), journal commit, database commit), and latency spikes. - Single-queue choke: a disk model/controller uses a single queue, so parallel workloads serialize.
- Host memory pressure: host page cache thrashes, and IO becomes “surprise synchronous.”
- Backend write amplification: thin provisioning, copy-on-write, or replication makes small writes expensive.
One line you should keep in your head: the guest thinks it’s talking to a disk; you’re actually running a distributed system made of queues.
Paraphrased idea from Werner Vogels (reliability/operations): “Everything fails eventually; design so failures are expected and manageable.”
Joke #1: Storage performance is like gossip—everything is fast until you ask for a sync.
Facts and history that explain today’s weird defaults
Some “mystery” Proxmox/QEMU options make a lot more sense once you know where they came from.
- Virtio was born to avoid emulating hardware. Early virtualization often emulated IDE/SATA; virtio came later as a paravirtualized interface to cut overhead.
- virtio-blk predates virtio-scsi. virtio-blk was simpler and widely supported; virtio-scsi appeared to offer features closer to real SCSI (multiple LUNs, hotplug, better scaling patterns).
- Write barriers and flushes got stricter over time. Filesystems and databases became less willing to “trust” caches after painful corruption stories in the 2000s; flush semantics matter more today.
- Host page cache used to be a cheap win. On spinning disks and small RAM systems, using host cache (writeback) could smooth IO; on modern SSD/NVMe with multiple VMs, it can cause noisy-neighbor cache contention.
- AIO in Linux has two personalities. “native” AIO (kernel AIO) behaves differently than thread-based emulation; which is best depends on backend and alignment.
- NCQ and multi-queue weren’t always common. A lot of tuning folklore comes from the era of single-queue SATA and limited parallelism.
- ZFS became popular in virtualization for snapshots/clones. The trade: copy-on-write and checksumming add overhead; it’s worth it, but you must respect sync write behavior.
- Ceph popularized “storage as a cluster feature.” Great for resilience and scale; latency is a product of quorum, network, and OSD load, not just a disk.
These aren’t trivia. They explain why one knob fixes stutter in your environment and makes it worse in your coworker’s.
Controller choice: virtio-scsi vs virtio-blk (and when SATA is still useful)
If you’re running Proxmox and your Linux VM’s disk device shows up as sda with a “Intel AHCI” controller, you’re paying for hardware cosplay. Emulated controllers exist for compatibility. Performance isn’t their job.
What to use by default
- Use virtio-scsi-single for most modern Linux VMs. It’s a solid default: good feature support, good scaling, and predictable behavior with iothreads.
- Use virtio-blk for simple setups or when you want fewer moving parts. virtio-blk can be very fast. It’s also simpler, which sometimes matters for debugging and driver maturity in odd guests.
- Use SATA only for bootstrapping or weird OS install media scenarios. Once installed, switch to virtio.
virtio-scsi: what it buys you
virtio-scsi models a SCSI HBA with virtio transport. In Proxmox you’ll see options like:
- virtio-scsi-pci: can attach multiple disks; may share a queue, depends on configuration.
- virtio-scsi-single: creates a separate controller per disk (or effectively isolates queueing), often reducing lock contention and improving fairness across disks.
In practice: virtio-scsi-single is frequently the easiest way to get predictable latency across multiple busy disks. If you’ve got a database disk and a log disk, you want them to stop mugging each other in a shared queue.
virtio-blk: what it buys you
virtio-blk is a paravirtualized block device. It’s lean. It can provide high throughput and low overhead. But it can be less flexible when you want SCSI-ish features, and historically some advanced behaviors (like certain discard/unmap patterns) have been more straightforward with virtio-scsi.
When controller choice actually fixes stutter
Controller changes reduce stutter when the bottleneck is in virtual device queueing or interrupt processing. Typical signs:
- Host device
awaitis low (the backend is fine), but the guest has high IO wait and periodic pauses. - One VM gets bursts of good throughput followed by total silence, even though the host storage isn’t saturated.
- Multiple busy disks in the same VM interfere with each other (logs stall DB commits, tempfiles stall app writes).
Pragmatic recommendation
If you have stutter and you’re not sure: switch the VM disk controller to virtio-scsi-single, enable iothread for that disk, and use cache=none unless you have a very specific reason not to.
This is not “the only correct setup.” It’s the one that most often fixes p99 latency spikes without turning your data integrity into a lifestyle choice.
Cache modes: the settings that make or break your p99 latency
Cache mode is where performance and safety have their awkward handshake. In Proxmox/QEMU terms, you’re deciding whether the host page cache sits in the middle and how flushes are handled.
The common modes (what they really mean)
- cache=none: QEMU uses direct IO (where possible). The host page cache is mostly bypassed. Guest cache still exists. Flushes map more directly to the backend. Often best for latency predictability and avoiding double-caching.
- cache=writeback: QEMU writes land in host page cache and are acknowledged quickly; later they flush to storage. Fast, until it isn’t. Risky without power protection or stable caches, because the guest believes data is safe earlier than it really is.
- cache=writethrough: writes go through host cache but are flushed before completion. Safer than writeback, usually slower, sometimes stuttery under sync-heavy workloads.
- cache=directsync: tries to make every write synchronous. Generally a great way to learn patience.
- cache=unsafe: don’t. It exists for benchmarks and regrets.
Why writeback can cause stutter (even when it “benchmarks fast”)
Writeback often turns your problem into a timing problem. The host page cache absorbs writes quickly—so the guest produces more writes. Then the host decides it’s time to flush. Flush happens in bursts, and those bursts can block new writes or starve reads. Your VM experiences this as periodic freezes.
On a lightly loaded host with a single VM, writeback can feel great. In production with multiple VMs and mixed workloads, it’s the IO equivalent of letting everyone merge into one lane at the same time.
Why cache=none is boring in the best way
With cache=none, you reduce double-caching and you keep the guest’s view of persistence closer to reality. This often stabilizes latency. The guest still caches aggressively, so it’s not “no cache.” It’s “one cache, in one place, with fewer surprises.”
Flushes, fsync, and why databases expose bad settings
Databases call fsync() because they’re not into losing data. Journaling filesystems also issue barriers/flushes for ordering. If your cache mode and backend turn flushes into expensive global operations, you get a classic stutter pattern: the VM is fine until a commit point, then everyone stops.
A note on safety (because you like your weekends)
cache=writeback can be acceptable if you have:
- a UPS that actually works and is integrated (host will shut down cleanly),
- storage with power-loss protection (PLP) or a controller with battery-backed cache,
- and you understand the failure modes.
Otherwise, the “fast” setting is fast right up until it becomes a postmortem.
AIO and iothreads: turning parallel IO into actual parallelism
Even with the right controller and cache mode, you can still stutter because QEMU’s IO processing is serialized or contended. Two settings matter a lot: AIO mode and iothreads.
aio=native vs aio=threads
QEMU can submit IO using native Linux AIO or a thread pool that does blocking IO calls. Which wins depends on backend and kernel behavior, but a decent rule:
- aio=native: often lower overhead and better for direct IO paths; can reduce jitter when supported cleanly by the storage backend.
- aio=threads: more compatible; sometimes higher CPU and can introduce scheduling jitter under load.
If you’re on ZFS zvols or raw devices, native AIO is commonly a good idea. If you’re on certain file-backed images or unusual setups, threads might behave better. Measure, don’t vibe.
iothreads: the latency stabilizer you should actually use
Without iothreads, QEMU can process IO in the main event loop thread (plus whatever helpers), which means disk IO competes with emulation work and interrupt handling. With iothreads, each disk can have its own IO thread, reducing contention and smoothing latency.
On Proxmox, you can enable an iothread per disk. This is especially useful when:
- you have multiple busy disks in one VM,
- you have a single busy disk doing many small sync writes,
- you’re trying to keep p99 under control rather than win a sequential throughput contest.
How many iothreads?
Don’t create 20 iothreads because you can. Each thread is scheduling overhead. Create iothreads for the disks that matter: database volume, journal-heavy filesystem, queue-heavy log disk. For a VM with one disk, one iothread is usually enough.
Joke #2: Adding iothreads is like hiring more baristas—great until the café becomes a meeting about how to make coffee.
Storage backend implications (ZFS, Ceph, LVM-thin, files)
You can pick the perfect controller and cache settings and still get stutter because the backend is doing something expensive. Proxmox abstracts storage, but physics is not impressed.
ZFS: sync writes, ZIL/SLOG, and the “why is my NVMe still slow?” question
ZFS is excellent for integrity and snapshots. It’s also honest about sync writes. If your guest workload issues sync writes (databases, journaling, fsync-heavy apps), ZFS must commit them safely. Without a dedicated fast SLOG device with power-loss protection, sync-heavy workloads can stutter even on fast pools.
Key points:
- zvol vs dataset: VMs on zvols typically behave more predictably than file-backed qcow2 on datasets for heavy IO, though both can work.
- sync behavior: if a guest issues flushes, ZFS takes them seriously. If you “fix” it by setting
sync=disabled, you’re trading durability for speed. - recordsize/volblocksize: mismatch can increase write amplification. For VM zvols,
volblocksizematters at creation time.
Ceph RBD: latency is a cluster property
Ceph is resilient and scalable. It also has more places where latency can be introduced: network, OSD load, backfill/recovery, PG peering, and client-side queueing.
Stutter patterns on Ceph often come from:
- recovery/backfill saturating disks or network,
- OSDs with uneven performance (one slow disk drags a replicated write),
- client options that make flushes expensive,
- noisy neighbor on shared OSD nodes.
Controller/cache choices still matter, but they won’t overcome a cluster doing recovery at the worst possible time. (It always is.)
LVM-thin: metadata pressure and discard behavior
LVM-thin is fast and simple, but thin provisioning introduces metadata writes and can get spiky under heavy random writes or when the thin pool is near full. Discard/TRIM can help reclaim space but can also create bursts of work depending on configuration.
File-backed qcow2: snapshots are nice, but watch the overhead
qcow2 is convenient: snapshots, compression, encryption features. It also adds copy-on-write metadata and fragmentation risk. Under random-write workloads, qcow2 can stutter noticeably compared to raw on a block device.
If you need performance and stable latency, prefer raw on a block backend. If you need features, use qcow2 but accept you might need to overprovision and defragment/convert periodically.
Practical tasks: commands, outputs, and decisions (12+)
These are real tasks you can run today. Each one includes: a command, realistic output, what it means, and the decision you make.
Task 1: Identify the VM’s disk controller and cache mode in Proxmox config
cr0x@server:~$ qm config 103
boot: order=scsi0
cores: 4
memory: 8192
name: api-prod-03
net0: virtio=DE:AD:BE:EF:10:03,bridge=vmbr0
ostype: l26
scsihw: virtio-scsi-single
scsi0: rpool:vm-103-disk-0,cache=none,discard=on,iothread=1
agent: 1
What it means: scsihw is the controller model; disk line shows cache=none, discard enabled, and iothread enabled.
Decision: if you see sata0 or scsihw: lsi with heavy IO, plan a migration to virtio-scsi-single or virtio-blk. If cache is writeback, validate power protection and failure domain.
Task 2: Confirm what QEMU actually launched (trust but verify)
cr0x@server:~$ ps -p $(pgrep -f "kvm -id 103") -o pid,cmd --cols 200
PID CMD
24891 /usr/bin/kvm -id 103 -name vm103 -m 8192 -smp 4 ... -device virtio-scsi-pci,id=scsihw0 ... -device scsi-hd,drive=drive-scsi0 ... -drive file=/dev/zvol/rpool/vm-103-disk-0,if=none,id=drive-scsi0,format=raw,cache=none,aio=native ...
What it means: confirms the effective cache and AIO mode; sometimes GUI settings differ from reality due to older configs or overrides.
Decision: if you expected aio=native and you see threads, investigate backend compatibility and Proxmox version settings.
Task 3: Check host storage latency and saturation under load
cr0x@server:~$ iostat -x -d 1 3
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
nvme0n1 40.0 620.0 2.8 62.5 205.0 5.2 8.4 0.3 19.0
What it means: device is not saturated (%util low), latency is acceptable.
Decision: if host device latency is low but guest is stuttering, suspect controller/cache/queueing inside virtualization layers.
Task 4: Identify which process is generating IO on the host
cr0x@server:~$ pidstat -d 1 3
Linux 6.8.12-pve (pve01) 12/26/2025 _x86_64_ (32 CPU)
# Time UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
12:01:01 0 24891 120.00 54000.00 0.00 12 kvm
12:01:01 0 1820 0.00 220.00 0.00 1 zfs
What it means: the VM’s QEMU process is responsible for the majority of writes; you’re looking at the right thing.
Decision: if IO is coming from pvestatd, backups, or replication jobs, schedule/limit those before tuning VM disks.
Task 5: Check guest-level disk latency and queue depth
cr0x@server:~$ iostat -x 1 3
Device r/s w/s rkB/s wkB/s aqu-sz await r_await w_await %util
vda 5.0 220.0 320.0 22000.0 8.4 38.0 2.0 38.8 98.0
What it means: the guest sees the disk nearly saturated with high write wait time. Could be real backend latency, or virtual queueing/flush behavior.
Decision: correlate with host iostat. If host latency is low but guest awaits are high, suspect queueing in controller/emulation or flush storm behavior.
Task 6: Check if the guest is issuing lots of flushes
cr0x@server:~$ grep -R . /sys/block/vda/queue/* 2>/dev/null | egrep "write_cache|nomerges|max_sectors_kb"
/sys/block/vda/queue/max_sectors_kb:1280
/sys/block/vda/queue/nomerges:0
/sys/block/vda/queue/write_cache:write through
What it means: write through suggests the device presents write-through caching; flush behavior matters.
Decision: if workloads are fsync-heavy and you’re seeing flush storms, prefer cache=none and ensure backend handles sync writes well (ZFS SLOG, Ceph tuning, etc.).
Task 7: Measure filesystem commit behavior in the guest (journaling pressure)
cr0x@server:~$ dmesg | tail -n 8
[ 9123.112233] EXT4-fs (vda2): re-mounted. Opts: (null)
[ 9450.774411] INFO: task jbd2/vda2-8:341 blocked for more than 120 seconds.
[ 9450.774419] Tainted: G W 6.5.0-28-generic #29-Ubuntu
[ 9450.774425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
What it means: the ext4 journal thread is blocked—classic symptom of storage stalls, often on flush/commit paths.
Decision: treat this as a severity signal. Fix IO latency first; don’t “tune ext4” to hide it.
Task 8: Check Proxmox host memory pressure (host cache thrash causes stutter)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 125Gi 98Gi 1.2Gi 2.0Gi 26Gi 18Gi
Swap: 16Gi 12Gi 4.0Gi
What it means: swap usage is high; host might be under memory pressure, making IO slower and more jittery.
Decision: reduce overcommit, add RAM, or stop using host page cache for VM disks. cache=none helps by reducing host cache dependence, but you still need enough memory.
Task 9: Inspect ZFS pool health and latency indicators (if using ZFS)
cr0x@server:~$ zpool status -v
pool: rpool
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
errors: No known data errors
What it means: pool is healthy; no obvious failing device causing retries.
Decision: if you see degraded vdevs or growing error counts, stop tuning VMs and replace hardware / resilver first.
Task 10: Check ZFS sync settings and whether you’re cheating durability
cr0x@server:~$ zfs get -o name,property,value -s local,default sync rpool
NAME PROPERTY VALUE
rpool sync standard
What it means: sync writes are honored normally.
Decision: if someone set sync=disabled to “fix performance,” treat it as a risk item. If performance is unacceptable with sync=standard, add a proper SLOG or redesign storage.
Task 11: Confirm discard/TRIM is enabled end-to-end (thin pools and SSDs)
cr0x@server:~$ qm config 103 | grep -E "discard|ssd|iothread"
scsi0: rpool:vm-103-disk-0,cache=none,discard=on,iothread=1
What it means: discard is enabled at the virtual disk layer.
Decision: if you use thin provisioning (LVM-thin, ZFS thin-ish behavior, Ceph), consider discard to prevent space blowups—but test, because discard can introduce bursts depending on guest behavior.
Task 12: Check whether the guest is actually using virtio drivers and multi-queue (guest side)
cr0x@server:~$ lsmod | egrep "virtio_blk|virtio_scsi|scsi_mod" | head
virtio_scsi 28672 2
scsi_mod 274432 3 virtio_scsi,sd_mod,sg
virtio_pci 32768 0
virtio_ring 40960 2 virtio_net,virtio_scsi
What it means: virtio-scsi is loaded; you’re not on emulated SATA drivers.
Decision: if virtio modules aren’t present, you may be using the wrong controller or an old initramfs. Fix driver availability before changing other knobs.
Task 13: Measure flush-heavy behavior with a quick, honest fio test (host or guest)
cr0x@server:~$ fio --name=syncwrite --filename=/var/lib/testfile --size=1G --rw=randwrite --bs=4k --iodepth=1 --numjobs=1 --direct=1 --sync=1 --time_based --runtime=20
syncwrite: (groupid=0, jobs=1): err= 0: pid=2123: Thu Dec 26 12:10:01 2025
write: IOPS=3200, BW=12.5MiB/s (13.1MB/s)(250MiB/20001msec)
clat (usec): min=120, max=42000, avg=310.42, stdev=900.12
What it means: max completion latency hits 42ms in this run; in stutter cases you’ll see hundreds of ms or seconds.
Decision: if max latency is huge, focus on sync path: cache mode, backend sync handling (ZFS SLOG), Ceph health, or host saturation.
Task 14: Check host block queue settings (sometimes a silent limiter)
cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[mq-deadline] none kyber bfq
What it means: the host is using mq-deadline, a reasonable default for many SSD/NVMe workloads.
Decision: if you’re on rotational disks, bfq or deadline choices can change latency. For NVMe, scheduler changes usually aren’t the first fix, but they can help fairness under mixed loads.
Task 15: Find whether backups/replication are colliding with VM IO
cr0x@server:~$ systemctl list-timers --all | egrep "pve|vzdump" || true
Thu 2025-12-26 12:30:00 UTC 20min left Thu 2025-12-26 12:00:01 UTC 9min ago vzdump.timer vzdump backup job
What it means: a backup job is scheduled and may be running frequently; backups can create read storms and snapshot overhead.
Decision: if stutter aligns with backup windows, throttle/schedule backups, use snapshot modes appropriate for the backend, and isolate backup IO where possible.
Three corporate mini-stories from the IO trenches
Mini-story 1: The incident caused by a wrong assumption
The team inherited a Proxmox cluster running “mostly fine” for internal services. A new customer-facing API was deployed into a VM with a small PostgreSQL instance. Load wasn’t massive; it was just constant. Within a day, on-call started seeing periodic latency spikes and odd timeouts. CPU graphs looked polite. Network looked boring. The VM was “healthy.”
Someone assumed the storage backend was slow and pushed the usual fix: “enable writeback cache, it’ll smooth things out.” It did—briefly. Throughput improved. The stutter got less frequent, which made the change look brilliant. Then the host rebooted unexpectedly after a power event that was short enough to avoid a clean shutdown but long enough to ruin your evening.
The database came back with corruption symptoms: missing WAL segments and inconsistent state. Postgres did its job and refused to pretend everything was fine. Recovery worked, but it wasn’t quick, and it wasn’t fun. The uncomfortable part wasn’t the outage; it was realizing the “performance fix” had changed the durability contract without anyone explicitly agreeing to it.
What actually solved the stutter later wasn’t writeback. They moved the VM disk from an emulated controller to virtio-scsi-single, enabled iothread, used cache=none, and fixed backend sync latency with a proper write-optimized device. Latency flattened. The durability story stayed intact. The lesson stuck: never “assume” a cache mode is just a speed dial.
Mini-story 2: The optimization that backfired
A different org ran mixed workloads: CI runners, a log ingestion pipeline, and a few stateful services. They noticed CI jobs were slow on disk-heavy phases, so they tuned aggressively. They changed disks to virtio-blk, set high iodepth in the guest, and enabled discard everywhere to keep thin pools tidy.
Benchmarks looked better. The CI team celebrated. Two weeks later, the log ingestion system started stuttering during peak hours. It wasn’t CPU. It wasn’t network. It was IO latency spikes—sharp, periodic, and annoyingly hard to correlate. The worst part: it wasn’t a single VM. It was a pattern across many.
The root cause was a combination of “good ideas” that lined up badly: discard operations from many VMs were generating bursts of backend work, and the iodepth tuning caused bigger queues, so latency tails got longer under contention. The cluster wasn’t “slower,” it was more jittery. The p50 improved while p99 got ugly, which is how you fool yourself with averages.
The fix was boring: limit discard behavior (make it scheduled rather than constant), reduce queue depth for latency-sensitive VMs, reintroduce virtio-scsi-single for certain multi-disk guests, and use iothreads selectively. They kept the CI improvements without sacrificing the ingest pipeline. The real win was learning that performance tuning is a negotiation between workloads, not a single number.
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent company ran Proxmox with ZFS mirrors on NVMe. Not glamorous, not cheap, but sane. They had a habit that looked like bureaucracy: every VM had a small runbook entry listing controller type, cache mode, whether iothread was enabled, and why. Nothing fancy—just enough to stop improvisation.
One day, a kernel update plus a change in guest workload triggered intermittent disk stalls in a critical VM. The VM was a message broker and it hated latency spikes. Users saw delayed processing. Engineers started hunting for “what changed.” The runbook made it quick: no recent VM config changes. Controller and cache were as expected. That ruled out half the usual suspects.
They went straight to host metrics and saw elevated latency on one NVMe device during bursts. It wasn’t failing, but it was behaving oddly. A firmware bug was suspected. Because they had consistent VM settings, they could reproduce the issue with targeted tests and isolate it to the device rather than arguing about QEMU flags.
The mitigation was clean: migrate the VM disks off the suspect device, apply firmware, and validate with the same fio profiles they used in acceptance tests. The outage was contained. The postmortem was short. Nobody had to explain why a “quick” writeback change was made at 3 a.m. Boring practices don’t trend on social media; they keep revenue attached to reality.
Common mistakes: symptom → root cause → fix
1) Symptom: periodic 1–5 second freezes; average throughput looks fine
Root cause: flush storms triggered by fsync/journal commits + cache mode/backend that turns flush into a stall.
Fix: use cache=none, enable iothread, ensure backend handles sync writes (ZFS with proper SLOG, Ceph healthy and not recovering), avoid qcow2 for heavy sync workloads.
2) Symptom: guest shows high await, host shows low await
Root cause: virtual queueing/controller contention (shared queue), QEMU main-loop contention, or suboptimal AIO mode.
Fix: switch to virtio-scsi-single or verify virtio-blk multiqueue where appropriate; enable iothread; consider aio=native for direct IO paths.
3) Symptom: stutter starts after enabling writeback “for speed”
Root cause: host page cache flush and writeback throttling; memory pressure amplifies it.
Fix: revert to cache=none unless you have power-loss protection; add RAM or reduce VM memory overcommit; isolate IO-heavy VMs to dedicated storage.
4) Symptom: thin pool fills unexpectedly; VM gets slow near full
Root cause: LVM-thin metadata pressure and near-full behavior; discard not enabled or not effective.
Fix: monitor thin pool usage, keep headroom, enable discard thoughtfully, run periodic fstrim inside guests, and avoid overcommitting thin pools for write-heavy workloads.
5) Symptom: Ceph-backed VMs stutter during the day, fine at night
Root cause: recovery/backfill or uneven OSD performance during business-hour load; network contention.
Fix: adjust recovery scheduling/limits, fix slow OSDs, ensure dedicated storage network capacity, and verify client-side caching/flush settings aren’t exacerbating tail latency.
6) Symptom: multi-disk VM; one disk’s workload disrupts the other
Root cause: shared controller/queue and no iothread isolation; IO merge/scheduler effects.
Fix: use virtio-scsi-single, enable per-disk iothreads, and separate disks by purpose (DB vs logs) with clear priorities.
7) Symptom: benchmarks show great sequential throughput, app still stutters
Root cause: wrong benchmark profile; real workload is small random sync writes with strict latency requirements.
Fix: test with 4k/8k random writes, low iodepth, and sync/fsync patterns. Optimize for latency tails, not peak MB/s.
Checklists / step-by-step plan
Step-by-step: the “fix stutter without gambling data” plan
- Measure first. Capture guest
iostat -xand hostiostat -xduring stutter. Save outputs in the ticket. - Verify controller type. If you’re on SATA/IDE/emulated LSI without a reason, schedule a change to virtio.
- Pick a controller:
- Default to virtio-scsi-single for general-purpose Linux VMs.
- Use virtio-blk if you want simplicity and have a single busy disk, and you’ve tested latency.
- Set cache mode to
none. It’s the safest performance option for most production systems. - Enable iothread for the busy disk(s). Start with one per important disk.
- Confirm AIO mode. Prefer
aio=nativewhen supported and stable; otherwise accept threads and rely on iothreads. - Check host memory. If the host swaps, you will have IO weirdness. Fix memory pressure.
- Backend-specific fixes:
- ZFS: ensure sync write path is fast enough; consider proper SLOG for sync-heavy VMs.
- Ceph: ensure the cluster is healthy and not recovering; find slow OSDs.
- LVM-thin: keep free space headroom; watch metadata; manage discard.
- Re-test with realistic IO. Use fio with sync patterns, not just sequential writes.
- Roll out gradually. Change one VM at a time, validate p95/p99 latency, then standardize.
Checklist: what to avoid when you’re tired and on-call
- Don’t enable
cache=unsafe. Ever. - Don’t set ZFS
sync=disabledas a “temporary” fix without a written risk signoff. - Don’t tune only for throughput; latency tails are what users feel.
- Don’t enable discard everywhere without understanding your thin backend behavior.
- Don’t assume “NVMe” means “low latency.” It means “faster at failing loudly when saturated.”
Quick change examples (with commands)
These examples are typical Proxmox CLI operations. Always do this in a maintenance window and with a backup/snapshot plan appropriate to your backend.
cr0x@server:~$ qm set 103 --scsihw virtio-scsi-single
update VM 103: -scsihw virtio-scsi-single
What it means: sets the SCSI controller model.
Decision: proceed if the guest supports virtio and you can reboot if needed.
cr0x@server:~$ qm set 103 --scsi0 rpool:vm-103-disk-0,cache=none,iothread=1,discard=on
update VM 103: -scsi0 rpool:vm-103-disk-0,cache=none,iothread=1,discard=on
What it means: enforces cache mode and enables iothread on that disk.
Decision: after reboot, validate guest latency and application behavior.
FAQ
1) Should I use virtio-scsi-single or virtio-blk for Linux VMs?
Default to virtio-scsi-single if you care about predictable latency and have multiple disks or mixed IO. Use virtio-blk for simple, single-disk VMs where you’ve measured and it behaves well.
2) Is cache=none always the best choice?
For most production Proxmox setups: yes. It avoids double caching and reduces host memory side effects. The exceptions are niche—usually when you’re deliberately using host cache for read-heavy workloads and you have enough RAM and operational discipline.
3) Why did cache=writeback make my VM “faster” but less stable?
Because it acknowledges writes earlier by buffering them in host RAM, then flushes later in bursts. That can create periodic stalls and increases risk on power loss or host crashes.
4) Do iothreads always help?
They often help latency and concurrency, especially under load. They can be neutral or slightly negative on tiny workloads where overhead dominates. Enable them for the disks that matter, not automatically for every disk in every VM.
5) How do I know if stutter is the backend (ZFS/Ceph) vs VM settings?
Compare host device latency (iostat -x on the host) with guest latency. If the host is fine but the guest is not, suspect virtual controller/queueing/cache mode. If host latency is bad, fix the backend first.
6) Can I fix ZFS VM stutter by setting sync=disabled?
You can “fix” it the way you can fix a smoke alarm by removing the battery. It trades durability for speed. If you need performance with sync-heavy workloads, use a proper SLOG device with power-loss protection or adjust architecture.
7) Does qcow2 cause stutter?
It can, especially under random writes and when fragmented. If you need stable latency for databases, prefer raw on a block backend (zvol, LVM LV, RBD) unless you truly need qcow2 features.
8) Should I enable discard/TRIM for VM disks?
Often yes for SSD-backed thin provisioning, but be deliberate. Continuous discard can create bursts of backend work. Many teams prefer periodic fstrim in the guest plus discard enabled at the virtual layer, then measure.
9) Why is my VM disk named sda even though I selected virtio?
Inside the guest, naming depends on the driver and controller. virtio-blk often appears as vda. virtio-scsi often appears as sda but still uses virtio transport. Verify using lspci/lsmod, not just the device name.
10) What’s the single most common reason for “random” stutter in Proxmox?
Host-level contention: backups, replication, scrubs, Ceph recovery, or a saturated disk. Second most common: writeback cache plus memory pressure creating flush storms.
Conclusion: next steps you can do today
If your Proxmox Linux VM has slow disk stutter, don’t start with folklore. Start with a short measurement loop: guest vmstat/iostat, host iostat, and the VM’s real QEMU flags. Then make the changes that reliably improve latency tails:
- Move off emulated controllers; use virtio-scsi-single or virtio-blk.
- Prefer cache=none unless you can justify writeback with power protection and operational guarantees.
- Enable iothread for the disks that carry your workload’s pain.
- Make backend-specific fixes instead of blaming the VM: ZFS sync path, Ceph health, thin pool headroom.
Do one VM, measure p95/p99 latency before and after, and then standardize the pattern. If it still stutters after that, congratulations: you’ve eliminated the easy causes, and now you get to do real engineering. That’s the job.