Proxmox Linux VM Slow Disk: Controller + Cache Choices That Fix Stutter

November 1, 2025 • February 3, 2026 • Read: 27 min • Views: 22

Was this helpful?

Nothing ruins your morning like a Linux VM that “works fine” until it doesn’t: shell prompts hang, MySQL commits pause, journald stalls, and your app’s latency chart grows teeth. You look at CPU and RAM—fine. Network—fine. Then you open iostat and see it: random write latency spiking into the seconds. The guest feels like it’s on a spinning disk powered by vibes.

This is usually fixable. Often it’s not “storage is slow,” it’s that you chose the wrong virtual disk controller, the wrong cache mode, or you accidentally put QEMU in a mode that turns a burst of small writes into a traffic jam. The good news: Proxmox gives you the levers. The bad news: the defaults don’t always match your workload.

Fast diagnosis playbook

If your VM “stutters,” you need to decide where latency is coming from: guest, QEMU, host kernel, storage backend, or the physical device/cluster. Don’t guess. Triage.

First: prove it’s storage latency (not CPU steal or memory pressure)

In the guest: check whether stalls correlate with disk wait.

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 812340  22012 903112    0    0   120   980  310  510  4  2 92  2  0
 0  2      0 810112  22020 904000    0    0     0  4096  280  420  2  1 58 39  0
 0  1      0 809980  22020 904120    0    0     0  2048  270  410  2  1 70 27  0

What it means: high wa indicates time spent waiting on IO. If st (steal) is high, you’re CPU oversubscribed; fix that before obsessing over disk settings.

Decision: if wa spikes during stutter and st is low, keep digging into storage.

Second: find the slow layer (guest queue vs host device)

On the Proxmox host: watch per-device latency and saturation.

cr0x@server:~$ iostat -x 1 5
Linux 6.8.12-pve (pve01)  12/26/2025  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.12    0.00    3.01    8.33    0.00   78.54

Device            r/s     w/s   rMB/s   wMB/s  avgrq-sz avgqu-sz   await  r_await  w_await  svctm  %util
nvme0n1         22.0   480.0    1.2    46.8     192.0      3.8     7.9     1.2      8.2   0.4   20.5
sda              0.0    95.0    0.0     6.0     128.0     18.2   191.3     0.0    191.3   1.1   99.8

What it means: await is request latency; %util near 100% with high await screams “saturated device.” Here sda is the villain.

Decision: if latency is bad at the host device, controller/cache tweaks in the VM won’t save you. Fix the backend (migration, faster disk, better pool layout) before tuning.

On the host: check whether QEMU threads are blocked in IO.
```
cr0x@server:~$ pgrep -a qemu-system-x86_64 | head -n 1
24891 /usr/bin/kvm -id 103 -name vm103 -m 8192 ... -drive file=/dev/zvol/rpool/vm-103-disk-0,if=none,id=drive-scsi0,format=raw,cache=none,aio=native ...
```
What it means: the QEMU command line tells you cache mode, aio mode, and whether you’re using virtio-scsi/virtio-blk.

Decision: if you see cache=writeback on a host without a UPS and without stable write cache, you’re flirting with data loss; if you see aio=threads with high CPU and latency, consider aio=native if supported.

Third: confirm the guest sees a sane virtual device and queueing model

In the guest: identify the disk type and whether it’s virtio.

cr0x@server:~$ lsblk -o NAME,MODEL,TYPE,SIZE,ROTA,DISC-MAX,DISC-GRAN,MOUNTPOINTS
NAME   MODEL            TYPE  SIZE ROTA DISC-MAX DISC-GRAN MOUNTPOINTS
vda    QEMU HARDDISK    disk  200G    0       2G       4K
├─vda1                 part    1G    0
└─vda2                 part  199G    0                 /

What it means: vda usually means virtio-blk. sda inside a VM often means emulated SATA/SCSI and is typically slower/higher overhead.

Decision: if you’re on emulated controllers, plan a maintenance window to migrate to virtio.

This playbook is intentionally short. The rest of the article explains the “why” and gives you the knobs that reliably reduce latency spikes without creating new ones.

What “disk stutter” really means in a VM

Stutter isn’t about average throughput. Your monitoring will show 50 MB/s and everyone will congratulate the storage. Meanwhile, your VM freezes for 800 ms every few seconds because small synchronous writes are backing up behind something expensive.

In virtualized storage, there are several queues and flush points:

Guest page cache: buffered writes accumulate until the kernel decides to flush.
Guest block layer: merges and schedules IO (less dramatic on newer kernels, still real).
Virtual controller queue: virtio-blk and virtio-scsi have different queue models and interrupt behavior.
QEMU block layer: cache mode controls whether QEMU uses the host page cache, and where flushes land.
Host filesystem / volume manager: ZFS, LVM-thin, ext4-on-SSD, Ceph RBD each has different latency characteristics.
Device/cluster: the actual disk, RAID controller, NVMe, SAN, or Ceph OSDs.

Stutter is usually one of these patterns:

Flush storms: a batch of writes hits a flush barrier (fsync(), journal commit, database commit), and latency spikes.
Single-queue choke: a disk model/controller uses a single queue, so parallel workloads serialize.
Host memory pressure: host page cache thrashes, and IO becomes “surprise synchronous.”
Backend write amplification: thin provisioning, copy-on-write, or replication makes small writes expensive.

One line you should keep in your head: the guest thinks it’s talking to a disk; you’re actually running a distributed system made of queues.

Paraphrased idea from Werner Vogels (reliability/operations): “Everything fails eventually; design so failures are expected and manageable.”

Joke #1: Storage performance is like gossip—everything is fast until you ask for a sync.

Facts and history that explain today’s weird defaults

Some “mystery” Proxmox/QEMU options make a lot more sense once you know where they came from.

Virtio was born to avoid emulating hardware. Early virtualization often emulated IDE/SATA; virtio came later as a paravirtualized interface to cut overhead.
virtio-blk predates virtio-scsi. virtio-blk was simpler and widely supported; virtio-scsi appeared to offer features closer to real SCSI (multiple LUNs, hotplug, better scaling patterns).
Write barriers and flushes got stricter over time. Filesystems and databases became less willing to “trust” caches after painful corruption stories in the 2000s; flush semantics matter more today.
Host page cache used to be a cheap win. On spinning disks and small RAM systems, using host cache (writeback) could smooth IO; on modern SSD/NVMe with multiple VMs, it can cause noisy-neighbor cache contention.
AIO in Linux has two personalities. “native” AIO (kernel AIO) behaves differently than thread-based emulation; which is best depends on backend and alignment.
NCQ and multi-queue weren’t always common. A lot of tuning folklore comes from the era of single-queue SATA and limited parallelism.
ZFS became popular in virtualization for snapshots/clones. The trade: copy-on-write and checksumming add overhead; it’s worth it, but you must respect sync write behavior.
Ceph popularized “storage as a cluster feature.” Great for resilience and scale; latency is a product of quorum, network, and OSD load, not just a disk.

These aren’t trivia. They explain why one knob fixes stutter in your environment and makes it worse in your coworker’s.

Controller choice: virtio-scsi vs virtio-blk (and when SATA is still useful)

If you’re running Proxmox and your Linux VM’s disk device shows up as sda with a “Intel AHCI” controller, you’re paying for hardware cosplay. Emulated controllers exist for compatibility. Performance isn’t their job.

What to use by default

Use virtio-scsi-single for most modern Linux VMs. It’s a solid default: good feature support, good scaling, and predictable behavior with iothreads.
Use virtio-blk for simple setups or when you want fewer moving parts. virtio-blk can be very fast. It’s also simpler, which sometimes matters for debugging and driver maturity in odd guests.
Use SATA only for bootstrapping or weird OS install media scenarios. Once installed, switch to virtio.

virtio-scsi: what it buys you

virtio-scsi models a SCSI HBA with virtio transport. In Proxmox you’ll see options like:

virtio-scsi-pci: can attach multiple disks; may share a queue, depends on configuration.
virtio-scsi-single: creates a separate controller per disk (or effectively isolates queueing), often reducing lock contention and improving fairness across disks.

In practice: virtio-scsi-single is frequently the easiest way to get predictable latency across multiple busy disks. If you’ve got a database disk and a log disk, you want them to stop mugging each other in a shared queue.

virtio-blk: what it buys you

virtio-blk is a paravirtualized block device. It’s lean. It can provide high throughput and low overhead. But it can be less flexible when you want SCSI-ish features, and historically some advanced behaviors (like certain discard/unmap patterns) have been more straightforward with virtio-scsi.

When controller choice actually fixes stutter

Controller changes reduce stutter when the bottleneck is in virtual device queueing or interrupt processing. Typical signs:

Host device await is low (the backend is fine), but the guest has high IO wait and periodic pauses.
One VM gets bursts of good throughput followed by total silence, even though the host storage isn’t saturated.
Multiple busy disks in the same VM interfere with each other (logs stall DB commits, tempfiles stall app writes).

Pragmatic recommendation

If you have stutter and you’re not sure: switch the VM disk controller to virtio-scsi-single, enable iothread for that disk, and use cache=none unless you have a very specific reason not to.

This is not “the only correct setup.” It’s the one that most often fixes p99 latency spikes without turning your data integrity into a lifestyle choice.

Cache modes: the settings that make or break your p99 latency

Cache mode is where performance and safety have their awkward handshake. In Proxmox/QEMU terms, you’re deciding whether the host page cache sits in the middle and how flushes are handled.

The common modes (what they really mean)

cache=none: QEMU uses direct IO (where possible). The host page cache is mostly bypassed. Guest cache still exists. Flushes map more directly to the backend. Often best for latency predictability and avoiding double-caching.
cache=writeback: QEMU writes land in host page cache and are acknowledged quickly; later they flush to storage. Fast, until it isn’t. Risky without power protection or stable caches, because the guest believes data is safe earlier than it really is.
cache=writethrough: writes go through host cache but are flushed before completion. Safer than writeback, usually slower, sometimes stuttery under sync-heavy workloads.
cache=directsync: tries to make every write synchronous. Generally a great way to learn patience.
cache=unsafe: don’t. It exists for benchmarks and regrets.

Why writeback can cause stutter (even when it “benchmarks fast”)

Writeback often turns your problem into a timing problem. The host page cache absorbs writes quickly—so the guest produces more writes. Then the host decides it’s time to flush. Flush happens in bursts, and those bursts can block new writes or starve reads. Your VM experiences this as periodic freezes.

On a lightly loaded host with a single VM, writeback can feel great. In production with multiple VMs and mixed workloads, it’s the IO equivalent of letting everyone merge into one lane at the same time.

Why cache=none is boring in the best way

With cache=none, you reduce double-caching and you keep the guest’s view of persistence closer to reality. This often stabilizes latency. The guest still caches aggressively, so it’s not “no cache.” It’s “one cache, in one place, with fewer surprises.”

Flushes, fsync, and why databases expose bad settings

Databases call fsync() because they’re not into losing data. Journaling filesystems also issue barriers/flushes for ordering. If your cache mode and backend turn flushes into expensive global operations, you get a classic stutter pattern: the VM is fine until a commit point, then everyone stops.

A note on safety (because you like your weekends)

cache=writeback can be acceptable if you have:

a UPS that actually works and is integrated (host will shut down cleanly),
storage with power-loss protection (PLP) or a controller with battery-backed cache,
and you understand the failure modes.

Otherwise, the “fast” setting is fast right up until it becomes a postmortem.

AIO and iothreads: turning parallel IO into actual parallelism

Even with the right controller and cache mode, you can still stutter because QEMU’s IO processing is serialized or contended. Two settings matter a lot: AIO mode and iothreads.

aio=native vs aio=threads

QEMU can submit IO using native Linux AIO or a thread pool that does blocking IO calls. Which wins depends on backend and kernel behavior, but a decent rule:

aio=native: often lower overhead and better for direct IO paths; can reduce jitter when supported cleanly by the storage backend.
aio=threads: more compatible; sometimes higher CPU and can introduce scheduling jitter under load.

If you’re on ZFS zvols or raw devices, native AIO is commonly a good idea. If you’re on certain file-backed images or unusual setups, threads might behave better. Measure, don’t vibe.

iothreads: the latency stabilizer you should actually use

Without iothreads, QEMU can process IO in the main event loop thread (plus whatever helpers), which means disk IO competes with emulation work and interrupt handling. With iothreads, each disk can have its own IO thread, reducing contention and smoothing latency.

On Proxmox, you can enable an iothread per disk. This is especially useful when:

you have multiple busy disks in one VM,
you have a single busy disk doing many small sync writes,
you’re trying to keep p99 under control rather than win a sequential throughput contest.

How many iothreads?

Don’t create 20 iothreads because you can. Each thread is scheduling overhead. Create iothreads for the disks that matter: database volume, journal-heavy filesystem, queue-heavy log disk. For a VM with one disk, one iothread is usually enough.

Joke #2: Adding iothreads is like hiring more baristas—great until the café becomes a meeting about how to make coffee.

Storage backend implications (ZFS, Ceph, LVM-thin, files)

You can pick the perfect controller and cache settings and still get stutter because the backend is doing something expensive. Proxmox abstracts storage, but physics is not impressed.

ZFS: sync writes, ZIL/SLOG, and the “why is my NVMe still slow?” question

ZFS is excellent for integrity and snapshots. It’s also honest about sync writes. If your guest workload issues sync writes (databases, journaling, fsync-heavy apps), ZFS must commit them safely. Without a dedicated fast SLOG device with power-loss protection, sync-heavy workloads can stutter even on fast pools.

Key points:

zvol vs dataset: VMs on zvols typically behave more predictably than file-backed qcow2 on datasets for heavy IO, though both can work.
sync behavior: if a guest issues flushes, ZFS takes them seriously. If you “fix” it by setting sync=disabled, you’re trading durability for speed.
recordsize/volblocksize: mismatch can increase write amplification. For VM zvols, volblocksize matters at creation time.

Ceph RBD: latency is a cluster property

Ceph is resilient and scalable. It also has more places where latency can be introduced: network, OSD load, backfill/recovery, PG peering, and client-side queueing.

Stutter patterns on Ceph often come from:

recovery/backfill saturating disks or network,
OSDs with uneven performance (one slow disk drags a replicated write),
client options that make flushes expensive,
noisy neighbor on shared OSD nodes.

Controller/cache choices still matter, but they won’t overcome a cluster doing recovery at the worst possible time. (It always is.)

LVM-thin: metadata pressure and discard behavior

LVM-thin is fast and simple, but thin provisioning introduces metadata writes and can get spiky under heavy random writes or when the thin pool is near full. Discard/TRIM can help reclaim space but can also create bursts of work depending on configuration.

File-backed qcow2: snapshots are nice, but watch the overhead

qcow2 is convenient: snapshots, compression, encryption features. It also adds copy-on-write metadata and fragmentation risk. Under random-write workloads, qcow2 can stutter noticeably compared to raw on a block device.

If you need performance and stable latency, prefer raw on a block backend. If you need features, use qcow2 but accept you might need to overprovision and defragment/convert periodically.

Practical tasks: commands, outputs, and decisions (12+)

These are real tasks you can run today. Each one includes: a command, realistic output, what it means, and the decision you make.

Task 1: Identify the VM’s disk controller and cache mode in Proxmox config

cr0x@server:~$ qm config 103
boot: order=scsi0
cores: 4
memory: 8192
name: api-prod-03
net0: virtio=DE:AD:BE:EF:10:03,bridge=vmbr0
ostype: l26
scsihw: virtio-scsi-single
scsi0: rpool:vm-103-disk-0,cache=none,discard=on,iothread=1
agent: 1

What it means: scsihw is the controller model; disk line shows cache=none, discard enabled, and iothread enabled.

Decision: if you see sata0 or scsihw: lsi with heavy IO, plan a migration to virtio-scsi-single or virtio-blk. If cache is writeback, validate power protection and failure domain.

Task 2: Confirm what QEMU actually launched (trust but verify)

cr0x@server:~$ ps -p $(pgrep -f "kvm -id 103") -o pid,cmd --cols 200
  PID CMD
24891 /usr/bin/kvm -id 103 -name vm103 -m 8192 -smp 4 ... -device virtio-scsi-pci,id=scsihw0 ... -device scsi-hd,drive=drive-scsi0 ... -drive file=/dev/zvol/rpool/vm-103-disk-0,if=none,id=drive-scsi0,format=raw,cache=none,aio=native ...

What it means: confirms the effective cache and AIO mode; sometimes GUI settings differ from reality due to older configs or overrides.

Decision: if you expected aio=native and you see threads, investigate backend compatibility and Proxmox version settings.

Task 3: Check host storage latency and saturation under load

cr0x@server:~$ iostat -x -d 1 3
Device            r/s     w/s   rMB/s   wMB/s  avgrq-sz avgqu-sz   await  svctm  %util
nvme0n1         40.0   620.0    2.8    62.5     205.0      5.2     8.4   0.3   19.0

What it means: device is not saturated (%util low), latency is acceptable.

Decision: if host device latency is low but guest is stuttering, suspect controller/cache/queueing inside virtualization layers.

Task 4: Identify which process is generating IO on the host

cr0x@server:~$ pidstat -d 1 3
Linux 6.8.12-pve (pve01)  12/26/2025  _x86_64_  (32 CPU)

#      Time   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
12:01:01     0     24891      120.00   54000.00      0.00      12  kvm
12:01:01     0      1820        0.00     220.00      0.00       1  zfs

What it means: the VM’s QEMU process is responsible for the majority of writes; you’re looking at the right thing.

Decision: if IO is coming from pvestatd, backups, or replication jobs, schedule/limit those before tuning VM disks.

Task 5: Check guest-level disk latency and queue depth

cr0x@server:~$ iostat -x 1 3
Device            r/s     w/s   rkB/s   wkB/s  aqu-sz   await  r_await  w_await  %util
vda              5.0   220.0   320.0  22000.0    8.4    38.0     2.0    38.8   98.0

What it means: the guest sees the disk nearly saturated with high write wait time. Could be real backend latency, or virtual queueing/flush behavior.

Decision: correlate with host iostat. If host latency is low but guest awaits are high, suspect queueing in controller/emulation or flush storm behavior.

Task 6: Check if the guest is issuing lots of flushes

cr0x@server:~$ grep -R . /sys/block/vda/queue/* 2>/dev/null | egrep "write_cache|nomerges|max_sectors_kb"
 /sys/block/vda/queue/max_sectors_kb:1280
 /sys/block/vda/queue/nomerges:0
 /sys/block/vda/queue/write_cache:write through

What it means: write through suggests the device presents write-through caching; flush behavior matters.

Decision: if workloads are fsync-heavy and you’re seeing flush storms, prefer cache=none and ensure backend handles sync writes well (ZFS SLOG, Ceph tuning, etc.).

Task 7: Measure filesystem commit behavior in the guest (journaling pressure)

cr0x@server:~$ dmesg | tail -n 8
[ 9123.112233] EXT4-fs (vda2): re-mounted. Opts: (null)
[ 9450.774411] INFO: task jbd2/vda2-8:341 blocked for more than 120 seconds.
[ 9450.774419]       Tainted: G        W          6.5.0-28-generic #29-Ubuntu
[ 9450.774425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

What it means: the ext4 journal thread is blocked—classic symptom of storage stalls, often on flush/commit paths.

Decision: treat this as a severity signal. Fix IO latency first; don’t “tune ext4” to hide it.

Task 8: Check Proxmox host memory pressure (host cache thrash causes stutter)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi        98Gi       1.2Gi       2.0Gi        26Gi        18Gi
Swap:           16Gi        12Gi       4.0Gi

What it means: swap usage is high; host might be under memory pressure, making IO slower and more jittery.

Decision: reduce overcommit, add RAM, or stop using host page cache for VM disks. cache=none helps by reducing host cache dependence, but you still need enough memory.

Task 9: Inspect ZFS pool health and latency indicators (if using ZFS)

cr0x@server:~$ zpool status -v
  pool: rpool
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            nvme0n1 ONLINE       0     0     0
            nvme1n1 ONLINE       0     0     0

errors: No known data errors

What it means: pool is healthy; no obvious failing device causing retries.

Decision: if you see degraded vdevs or growing error counts, stop tuning VMs and replace hardware / resilver first.

Task 10: Check ZFS sync settings and whether you’re cheating durability

cr0x@server:~$ zfs get -o name,property,value -s local,default sync rpool
NAME   PROPERTY  VALUE
rpool  sync      standard

What it means: sync writes are honored normally.

Decision: if someone set sync=disabled to “fix performance,” treat it as a risk item. If performance is unacceptable with sync=standard, add a proper SLOG or redesign storage.

Task 11: Confirm discard/TRIM is enabled end-to-end (thin pools and SSDs)

cr0x@server:~$ qm config 103 | grep -E "discard|ssd|iothread"
scsi0: rpool:vm-103-disk-0,cache=none,discard=on,iothread=1

What it means: discard is enabled at the virtual disk layer.

Decision: if you use thin provisioning (LVM-thin, ZFS thin-ish behavior, Ceph), consider discard to prevent space blowups—but test, because discard can introduce bursts depending on guest behavior.

Task 12: Check whether the guest is actually using virtio drivers and multi-queue (guest side)

cr0x@server:~$ lsmod | egrep "virtio_blk|virtio_scsi|scsi_mod" | head
virtio_scsi             28672  2
scsi_mod               274432  3 virtio_scsi,sd_mod,sg
virtio_pci              32768  0
virtio_ring             40960  2 virtio_net,virtio_scsi

What it means: virtio-scsi is loaded; you’re not on emulated SATA drivers.

Decision: if virtio modules aren’t present, you may be using the wrong controller or an old initramfs. Fix driver availability before changing other knobs.

Task 13: Measure flush-heavy behavior with a quick, honest fio test (host or guest)

cr0x@server:~$ fio --name=syncwrite --filename=/var/lib/testfile --size=1G --rw=randwrite --bs=4k --iodepth=1 --numjobs=1 --direct=1 --sync=1 --time_based --runtime=20
syncwrite: (groupid=0, jobs=1): err= 0: pid=2123: Thu Dec 26 12:10:01 2025
  write: IOPS=3200, BW=12.5MiB/s (13.1MB/s)(250MiB/20001msec)
    clat (usec): min=120, max=42000, avg=310.42, stdev=900.12

What it means: max completion latency hits 42ms in this run; in stutter cases you’ll see hundreds of ms or seconds.

Decision: if max latency is huge, focus on sync path: cache mode, backend sync handling (ZFS SLOG), Ceph health, or host saturation.

Task 14: Check host block queue settings (sometimes a silent limiter)

cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[mq-deadline] none kyber bfq

What it means: the host is using mq-deadline, a reasonable default for many SSD/NVMe workloads.

Decision: if you’re on rotational disks, bfq or deadline choices can change latency. For NVMe, scheduler changes usually aren’t the first fix, but they can help fairness under mixed loads.

Task 15: Find whether backups/replication are colliding with VM IO

cr0x@server:~$ systemctl list-timers --all | egrep "pve|vzdump" || true
Thu 2025-12-26 12:30:00 UTC  20min left Thu 2025-12-26 12:00:01 UTC  9min ago vzdump.timer  vzdump backup job

What it means: a backup job is scheduled and may be running frequently; backups can create read storms and snapshot overhead.

Decision: if stutter aligns with backup windows, throttle/schedule backups, use snapshot modes appropriate for the backend, and isolate backup IO where possible.

Three corporate mini-stories from the IO trenches

Mini-story 1: The incident caused by a wrong assumption

The team inherited a Proxmox cluster running “mostly fine” for internal services. A new customer-facing API was deployed into a VM with a small PostgreSQL instance. Load wasn’t massive; it was just constant. Within a day, on-call started seeing periodic latency spikes and odd timeouts. CPU graphs looked polite. Network looked boring. The VM was “healthy.”

Someone assumed the storage backend was slow and pushed the usual fix: “enable writeback cache, it’ll smooth things out.” It did—briefly. Throughput improved. The stutter got less frequent, which made the change look brilliant. Then the host rebooted unexpectedly after a power event that was short enough to avoid a clean shutdown but long enough to ruin your evening.

The database came back with corruption symptoms: missing WAL segments and inconsistent state. Postgres did its job and refused to pretend everything was fine. Recovery worked, but it wasn’t quick, and it wasn’t fun. The uncomfortable part wasn’t the outage; it was realizing the “performance fix” had changed the durability contract without anyone explicitly agreeing to it.

What actually solved the stutter later wasn’t writeback. They moved the VM disk from an emulated controller to virtio-scsi-single, enabled iothread, used cache=none, and fixed backend sync latency with a proper write-optimized device. Latency flattened. The durability story stayed intact. The lesson stuck: never “assume” a cache mode is just a speed dial.

Mini-story 2: The optimization that backfired

A different org ran mixed workloads: CI runners, a log ingestion pipeline, and a few stateful services. They noticed CI jobs were slow on disk-heavy phases, so they tuned aggressively. They changed disks to virtio-blk, set high iodepth in the guest, and enabled discard everywhere to keep thin pools tidy.

Benchmarks looked better. The CI team celebrated. Two weeks later, the log ingestion system started stuttering during peak hours. It wasn’t CPU. It wasn’t network. It was IO latency spikes—sharp, periodic, and annoyingly hard to correlate. The worst part: it wasn’t a single VM. It was a pattern across many.

The root cause was a combination of “good ideas” that lined up badly: discard operations from many VMs were generating bursts of backend work, and the iodepth tuning caused bigger queues, so latency tails got longer under contention. The cluster wasn’t “slower,” it was more jittery. The p50 improved while p99 got ugly, which is how you fool yourself with averages.

The fix was boring: limit discard behavior (make it scheduled rather than constant), reduce queue depth for latency-sensitive VMs, reintroduce virtio-scsi-single for certain multi-disk guests, and use iothreads selectively. They kept the CI improvements without sacrificing the ingest pipeline. The real win was learning that performance tuning is a negotiation between workloads, not a single number.

Mini-story 3: The boring but correct practice that saved the day

A finance-adjacent company ran Proxmox with ZFS mirrors on NVMe. Not glamorous, not cheap, but sane. They had a habit that looked like bureaucracy: every VM had a small runbook entry listing controller type, cache mode, whether iothread was enabled, and why. Nothing fancy—just enough to stop improvisation.

One day, a kernel update plus a change in guest workload triggered intermittent disk stalls in a critical VM. The VM was a message broker and it hated latency spikes. Users saw delayed processing. Engineers started hunting for “what changed.” The runbook made it quick: no recent VM config changes. Controller and cache were as expected. That ruled out half the usual suspects.

They went straight to host metrics and saw elevated latency on one NVMe device during bursts. It wasn’t failing, but it was behaving oddly. A firmware bug was suspected. Because they had consistent VM settings, they could reproduce the issue with targeted tests and isolate it to the device rather than arguing about QEMU flags.

The mitigation was clean: migrate the VM disks off the suspect device, apply firmware, and validate with the same fio profiles they used in acceptance tests. The outage was contained. The postmortem was short. Nobody had to explain why a “quick” writeback change was made at 3 a.m. Boring practices don’t trend on social media; they keep revenue attached to reality.

Common mistakes: symptom → root cause → fix

1) Symptom: periodic 1–5 second freezes; average throughput looks fine

Root cause: flush storms triggered by fsync/journal commits + cache mode/backend that turns flush into a stall.

Fix: use cache=none, enable iothread, ensure backend handles sync writes (ZFS with proper SLOG, Ceph healthy and not recovering), avoid qcow2 for heavy sync workloads.

2) Symptom: guest shows high `await`, host shows low `await`

Root cause: virtual queueing/controller contention (shared queue), QEMU main-loop contention, or suboptimal AIO mode.

Fix: switch to virtio-scsi-single or verify virtio-blk multiqueue where appropriate; enable iothread; consider aio=native for direct IO paths.

3) Symptom: stutter starts after enabling writeback “for speed”

Root cause: host page cache flush and writeback throttling; memory pressure amplifies it.

Fix: revert to cache=none unless you have power-loss protection; add RAM or reduce VM memory overcommit; isolate IO-heavy VMs to dedicated storage.

4) Symptom: thin pool fills unexpectedly; VM gets slow near full

Root cause: LVM-thin metadata pressure and near-full behavior; discard not enabled or not effective.

Fix: monitor thin pool usage, keep headroom, enable discard thoughtfully, run periodic fstrim inside guests, and avoid overcommitting thin pools for write-heavy workloads.

5) Symptom: Ceph-backed VMs stutter during the day, fine at night

Root cause: recovery/backfill or uneven OSD performance during business-hour load; network contention.

Fix: adjust recovery scheduling/limits, fix slow OSDs, ensure dedicated storage network capacity, and verify client-side caching/flush settings aren’t exacerbating tail latency.

6) Symptom: multi-disk VM; one disk’s workload disrupts the other

Root cause: shared controller/queue and no iothread isolation; IO merge/scheduler effects.

Fix: use virtio-scsi-single, enable per-disk iothreads, and separate disks by purpose (DB vs logs) with clear priorities.

7) Symptom: benchmarks show great sequential throughput, app still stutters

Root cause: wrong benchmark profile; real workload is small random sync writes with strict latency requirements.

Fix: test with 4k/8k random writes, low iodepth, and sync/fsync patterns. Optimize for latency tails, not peak MB/s.

Checklists / step-by-step plan

Step-by-step: the “fix stutter without gambling data” plan

Measure first. Capture guest iostat -x and host iostat -x during stutter. Save outputs in the ticket.
Verify controller type. If you’re on SATA/IDE/emulated LSI without a reason, schedule a change to virtio.
Pick a controller:
- Default to virtio-scsi-single for general-purpose Linux VMs.
- Use virtio-blk if you want simplicity and have a single busy disk, and you’ve tested latency.
Set cache mode to none. It’s the safest performance option for most production systems.
Enable iothread for the busy disk(s). Start with one per important disk.
Confirm AIO mode. Prefer aio=native when supported and stable; otherwise accept threads and rely on iothreads.
Check host memory. If the host swaps, you will have IO weirdness. Fix memory pressure.
Backend-specific fixes:
- ZFS: ensure sync write path is fast enough; consider proper SLOG for sync-heavy VMs.
- Ceph: ensure the cluster is healthy and not recovering; find slow OSDs.
- LVM-thin: keep free space headroom; watch metadata; manage discard.
Re-test with realistic IO. Use fio with sync patterns, not just sequential writes.
Roll out gradually. Change one VM at a time, validate p95/p99 latency, then standardize.

Checklist: what to avoid when you’re tired and on-call

Don’t enable cache=unsafe. Ever.
Don’t set ZFS sync=disabled as a “temporary” fix without a written risk signoff.
Don’t tune only for throughput; latency tails are what users feel.
Don’t enable discard everywhere without understanding your thin backend behavior.
Don’t assume “NVMe” means “low latency.” It means “faster at failing loudly when saturated.”

Quick change examples (with commands)

These examples are typical Proxmox CLI operations. Always do this in a maintenance window and with a backup/snapshot plan appropriate to your backend.

cr0x@server:~$ qm set 103 --scsihw virtio-scsi-single
update VM 103: -scsihw virtio-scsi-single

What it means: sets the SCSI controller model.

Decision: proceed if the guest supports virtio and you can reboot if needed.

cr0x@server:~$ qm set 103 --scsi0 rpool:vm-103-disk-0,cache=none,iothread=1,discard=on
update VM 103: -scsi0 rpool:vm-103-disk-0,cache=none,iothread=1,discard=on

What it means: enforces cache mode and enables iothread on that disk.

Decision: after reboot, validate guest latency and application behavior.

FAQ

1) Should I use virtio-scsi-single or virtio-blk for Linux VMs?

Default to virtio-scsi-single if you care about predictable latency and have multiple disks or mixed IO. Use virtio-blk for simple, single-disk VMs where you’ve measured and it behaves well.

2) Is cache=none always the best choice?

For most production Proxmox setups: yes. It avoids double caching and reduces host memory side effects. The exceptions are niche—usually when you’re deliberately using host cache for read-heavy workloads and you have enough RAM and operational discipline.

3) Why did cache=writeback make my VM “faster” but less stable?

Because it acknowledges writes earlier by buffering them in host RAM, then flushes later in bursts. That can create periodic stalls and increases risk on power loss or host crashes.

4) Do iothreads always help?

They often help latency and concurrency, especially under load. They can be neutral or slightly negative on tiny workloads where overhead dominates. Enable them for the disks that matter, not automatically for every disk in every VM.

5) How do I know if stutter is the backend (ZFS/Ceph) vs VM settings?

Compare host device latency (iostat -x on the host) with guest latency. If the host is fine but the guest is not, suspect virtual controller/queueing/cache mode. If host latency is bad, fix the backend first.

6) Can I fix ZFS VM stutter by setting sync=disabled?

You can “fix” it the way you can fix a smoke alarm by removing the battery. It trades durability for speed. If you need performance with sync-heavy workloads, use a proper SLOG device with power-loss protection or adjust architecture.

7) Does qcow2 cause stutter?

It can, especially under random writes and when fragmented. If you need stable latency for databases, prefer raw on a block backend (zvol, LVM LV, RBD) unless you truly need qcow2 features.

8) Should I enable discard/TRIM for VM disks?

Often yes for SSD-backed thin provisioning, but be deliberate. Continuous discard can create bursts of backend work. Many teams prefer periodic fstrim in the guest plus discard enabled at the virtual layer, then measure.

9) Why is my VM disk named sda even though I selected virtio?

Inside the guest, naming depends on the driver and controller. virtio-blk often appears as vda. virtio-scsi often appears as sda but still uses virtio transport. Verify using lspci/lsmod, not just the device name.

10) What’s the single most common reason for “random” stutter in Proxmox?

Host-level contention: backups, replication, scrubs, Ceph recovery, or a saturated disk. Second most common: writeback cache plus memory pressure creating flush storms.

Conclusion: next steps you can do today

If your Proxmox Linux VM has slow disk stutter, don’t start with folklore. Start with a short measurement loop: guest vmstat/iostat, host iostat, and the VM’s real QEMU flags. Then make the changes that reliably improve latency tails:

Move off emulated controllers; use virtio-scsi-single or virtio-blk.
Prefer cache=none unless you can justify writeback with power protection and operational guarantees.
Enable iothread for the disks that carry your workload’s pain.
Make backend-specific fixes instead of blaming the VM: ZFS sync path, Ceph health, thin pool headroom.

Do one VM, measure p95/p99 latency before and after, and then standardize the pattern. If it still stutters after that, congratulations: you’ve eliminated the easy causes, and now you get to do real engineering. That’s the job.