Proxmox: The Hidden Reason Your VM Feels Slow (Even on NVMe)

Was this helpful?

“But it’s on NVMe.” That sentence has ended more incident calls than it has solved. You bought fast media, you can benchmark it, and yet the VM feels like it’s thinking about the question before answering. Logins lag. Package installs crawl. Your database claims it’s “I/O bound,” which is both helpful and not.

The hidden reason is usually not raw throughput. It’s latency—especially write latency—created by a stack of defaults that are safe, generic, and occasionally cruel: sync semantics, caching modes, queueing, thin provisioning behavior, CoW amplification, storage replication, and host contention. NVMe doesn’t fix bad decisions; it just makes them happen faster.

The hidden reason: your VM is paying for latency you didn’t budget

NVMe marketing is a victory lap for throughput. Your users, however, experience tail latency: the 95th/99th percentile delay of “small synchronous operations that must complete before the next thing happens.” VM “slowness” is often just a thousand tiny sync writes waiting politely in line behind someone else’s idea of durability.

In Proxmox, the storage path is a layered cake:

  • Guest filesystem & application semantics (fsync, O_DIRECT, journaling, barriers)
  • Virtual device model (virtio-scsi vs virtio-blk, queues, iothreads)
  • QEMU cache mode (writeback/writethrough/none/directsync) and AIO
  • Host filesystem and volume manager (ZFS, LVM-thin, directory, Ceph RBD)
  • Host block layer (scheduler, merge behavior, multipath)
  • Physical device behavior (NVMe firmware, thermal throttling, SLC cache, power-loss protection)

The “hidden reason” is typically a mismatch between guest expectations and host guarantees. The guest issues sync writes, and somewhere below, the stack says: “Sure, I’ll confirm that when it’s actually safe.” That’s good engineering. It’s also why your install of a small package can feel like it’s doing taxes.

Opinionated guidance: stop chasing headline MB/s. Start measuring IOPS at low queue depth, fsync latency, and host wait time. Most “slow NVMe” incidents are either (a) sync write latency (ZFS, Ceph, write barriers), (b) contention/queueing (one noisy neighbor), or (c) an “optimization” like discard, compression, or snapshots interacting badly with your workload.

Joke #1: NVMe is like giving your intern a sports car—impressive until you realize they still take the scenic route to the data center.

Interesting facts and history (because defaults have backstories)

  1. QEMU’s cache modes exist because storage lies. In the early virtualization years, “writeback” could be fast but risky if the underlying stack didn’t actually persist data when it said it did.
  2. ZFS was designed around checksums and copy-on-write. That gives you integrity and snapshots, but it changes write patterns and can amplify small random writes under certain recordsize/volblocksize choices.
  3. NVMe’s big win is parallelism. The protocol supports many submission/completion queues, but a VM with a single queue can still bottleneck on CPU and locking.
  4. Linux I/O schedulers evolved for spinning rust. Some are still useful for fairness and latency control; others just add overhead to NVMe paths that don’t need reordering.
  5. Write barriers and flushes got stricter over time. Filesystems learned the hard way that “fast” without ordered durability becomes “fast corruption,” especially after power loss.
  6. Thin provisioning became popular because capacity is expensive. But thin pools can fragment and their metadata can become a silent performance limiter when badly monitored.
  7. TRIM/discard is not “free.” It helps SSD longevity and steady-state performance, but online discard can create periodic latency spikes depending on drive firmware and workload.
  8. Ceph’s strength is distributed durability. That durability adds network and replication latency; NVMe on the OSD nodes doesn’t magically remove quorum and acknowledgment costs.

Fast diagnosis playbook

You don’t need a week of tuning. You need 20 minutes of controlled observation. Here’s the order that finds the bottleneck fastest in production.

First: confirm the symptom is I/O latency, not CPU steal or memory pressure

  • Inside the guest: check iowait and disk latency. If CPU is pinned or memory is swapping, storage tuning is theater.
  • On the host: check CPU steal-like behavior (host overload), KSM ballooning, and swapping.

Second: identify the storage backend and its durability semantics

  • ZFS: sync behavior, SLOG presence/quality, recordsize/volblocksize, compression, special vdevs.
  • LVM-thin: data/metadata usage, discard behavior, fragmentation signs.
  • Ceph RBD: latency on the network path, OSD commit/apply, replication, client settings.
  • Directory/ext4/xfs: host mount options, write cache, barriers, filesystem health.

Third: check VM device model and queueing

  • virtio-scsi with iothread and multiple queues when appropriate.
  • Correct cache mode for backend (and honesty about the risk tradeoffs).

Fourth: find contention and tail latency

  • Look for one VM doing heavy writes, backups, or snapshot merges.
  • Measure disk utilization and queue depth on the host, not just inside the VM.

Fifth: verify the NVMe itself isn’t the bottleneck

  • Thermal throttling, firmware quirks, PCIe link speed, and power settings can turn “fast” into “polite.”

Decision rule: if you can’t point to one of (a) sync latency, (b) queueing/contending, (c) CPU/memory pressure, (d) backend misconfiguration, you’re still guessing. Keep measuring.

Practical tasks: commands, outputs, and decisions

These are the checks I actually run. Each task includes a realistic snippet of output and what decision to make from it.

Task 1: Identify the VM disk backend from Proxmox

cr0x@server:~$ qm config 104 | egrep -i 'scsi|virtio|ide|sata|boot|machine'
boot: order=scsi0;net0
machine: q35
scsi0: nvme-zfs:vm-104-disk-0,iothread=1,cache=none,discard=on,ssd=1,size=80G
scsihw: virtio-scsi-single

What it means: This VM uses a ZFS-backed volume, virtio-scsi-single, cache=none, iothread enabled, discard on.

Decision: If this VM is latency-sensitive and issues sync writes, your next checks are ZFS sync path and whether virtio-scsi-single is limiting queue parallelism.

Task 2: Confirm host storage type (ZFS vs LVM-thin vs directory)

cr0x@server:~$ pvesm status
Name        Type     Status           Total            Used       Available        %
local       dir      active        19684272        4181248        15403024   21.24%
nvme-zfs    zfspool  active       960123904      412345344       547778560   42.95%
ceph-rbd    rbd      active               0               0               0    0.00%

What it means: The VM disk is on a ZFS pool called nvme-zfs.

Decision: Focus on ZFS dataset/zvol properties (sync, volblocksize, compression) and whether a proper SLOG exists.

Task 3: Check VM-visible latency inside the guest (quick and dirty)

cr0x@server:~$ iostat -x 1 5
Linux 6.1.0 (guest)  02/04/2026  _x86_64_  (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.02    0.00    3.11   22.70    0.00   66.17

Device            r/s     w/s   rkB/s   wkB/s  avgrq-sz avgqu-sz   await  r_await  w_await  svctm  %util
vda              3.00   45.00   120.0  2480.0     114.7     2.80   58.20    8.20   61.60   2.10  10.20

What it means: w_await is ~62 ms. That’s not “NVMe fast.” That’s “something is forcing writes to wait.”

Decision: Look for sync write path constraints: ZFS sync, Ceph replication, cache mode, or host contention.

Task 4: Check host-side disk latency and queueing

cr0x@server:~$ iostat -x 1 5
Linux 6.5.13-pve (pve-host)  02/04/2026  _x86_64_  (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.40    0.00    4.80   18.20    0.00   66.60

Device            r/s     w/s   rkB/s   wkB/s  avgrq-sz avgqu-sz   await  r_await  w_await  svctm  %util
nvme0n1          20.00  800.00   900.0 68000.0     167.5    45.20   55.10    9.80   56.30   0.45  99.00

What it means: %util is pegged and avgqu-sz is huge. The device is saturated or throttled, and latency is rising.

Decision: Find who is writing, whether it’s backups/scrubs, and whether the “fast” drive is hitting sustained-write limits or thermal throttling.

Task 5: Identify top I/O consumers on the host

cr0x@server:~$ pidstat -d 1 5
Linux 6.5.13-pve (pve-host)  02/04/2026  _x86_64_  (16 CPU)

#      Time   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
12:01:10     0      21456      0.00  82000.00  12000.00  vzdump
12:01:10     0      17890      0.00  12000.00      0.00  qemu-system-x86
12:01:10     0      17891      0.00  11800.00      0.00  qemu-system-x86

What it means: vzdump is hammering writes. The host is busy doing backups, likely causing VM latency spikes.

Decision: Rate-limit backups, move them off the primary pool, or schedule them away from peak. If you already do that, validate snapshot/merge behavior and storage backend.

Task 6: Check ZFS pool health and latency indicators

cr0x@server:~$ zpool status -v nvme-zfs
  pool: nvme-zfs
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        nvme-zfs    ONLINE       0     0     0
          nvme0n1   ONLINE       0     0     0

errors: No known data errors

What it means: No obvious errors. Health is not performance.

Decision: Move on to ZFS properties and sync behavior; also check whether there’s a SLOG and whether it’s appropriate.

Task 7: Check ZFS dataset and zvol properties that affect latency

cr0x@server:~$ zfs get -r compression,atime,recordsize,logbias,sync,primarycache,secondarycache nvme-zfs
NAME      PROPERTY        VALUE           SOURCE
nvme-zfs  compression     zstd            local
nvme-zfs  atime           off             local
nvme-zfs  recordsize      128K            local
nvme-zfs  logbias         latency         local
nvme-zfs  sync            standard        local
nvme-zfs  primarycache    all             default
nvme-zfs  secondarycache  all             default

What it means: sync=standard means sync writes are honored. Good. If your workload does lots of fsync, latency depends on the pool’s ability to commit safely.

Decision: If sync latency is high and you care about correctness, add a proper SLOG (power-loss protected) or redesign the write pattern. Don’t flip sync=disabled unless you enjoy explaining data loss to adults.

Task 8: Confirm whether a SLOG exists and is actually separate

cr0x@server:~$ zpool status nvme-zfs | sed -n '1,40p'
  pool: nvme-zfs
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        nvme-zfs    ONLINE       0     0     0
          nvme0n1   ONLINE       0     0     0

errors: No known data errors

What it means: No logs vdev is present. Sync writes must be committed to the main vdev.

Decision: If you run databases, mail servers, CI runners, or anything fsync-heavy, consider a dedicated SLOG device with power-loss protection. If you can’t, at least accept the latency as the cost of durability.

Task 9: Measure ZFS latency live

cr0x@server:~$ zpool iostat -v nvme-zfs 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
nvme-zfs     393G   519G     40    820  1.2M  70.1M
  nvme0n1    393G   519G     40    820  1.2M  70.1M
----------  -----  -----  -----  -----  -----  -----

What it means: High write ops and bandwidth. This doesn’t show latency directly, but it tells you the pool is busy and you should correlate with host iostat -x and the timing of backups/scrubs.

Decision: If “busy” aligns with user pain, you need workload isolation: separate pools, backup targets, or I/O throttling via cgroups / Proxmox limits.

Task 10: Check NVMe health and throttling hints

cr0x@server:~$ nvme smart-log /dev/nvme0n1 | egrep -i 'temperature|warning|critical|media|power_cycles|unsafe_shutdowns'
critical_warning                    : 0x00
temperature                         : 72 C
available_spare                     : 100%
media_errors                        : 0
power_cycles                        : 38
unsafe_shutdowns                    : 2

What it means: 72°C is warm. Some drives throttle around this range depending on model and airflow. Unsafe shutdowns also matter for integrity if you rely on volatile caches.

Decision: Fix cooling and confirm the device isn’t downshifting under sustained write. If you see thermal issues, your “slow VM” is actually “slow physics.”

Task 11: Verify PCIe link speed/width (quietly common)

cr0x@server:~$ lspci -s 03:00.0 -vv | egrep -i 'LnkSta|NVMe'
Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller
LnkSta: Speed 8GT/s (ok), Width x2 (downgraded)

What it means: The drive is running at PCIe x2, not x4. That can cap throughput and sometimes increase contention effects.

Decision: Move the device to a slot with full lanes, fix BIOS settings, or stop sharing lanes with something that shouldn’t be there.

Task 12: Check the host I/O scheduler for NVMe

cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[mq-deadline] none kyber bfq

What it means: mq-deadline is active. That’s often a sane default for mixed workloads because it controls latency under load.

Decision: If you see latency spikes under contention, keep a latency-aware scheduler (mq-deadline or kyber). If you have a single workload and want minimal overhead, test none—but only with measurements.

Task 13: Inspect VM’s virtio-scsi and queue configuration

cr0x@server:~$ qm config 104 | egrep -i 'scsihw|iothread|queues|numa|cpu|machine'
machine: q35
scsihw: virtio-scsi-single
scsi0: nvme-zfs:vm-104-disk-0,iothread=1,cache=none,discard=on,ssd=1,size=80G
cpu: host
numa: 0

What it means: virtio-scsi-single uses a single queue by design. For some workloads, that becomes a bottleneck on multi-vCPU guests.

Decision: Consider virtio-scsi-pci with multiple queues for parallel I/O if the guest workload benefits. Don’t change blindly: databases often care more about latency consistency than raw parallel queues.

Task 14: Check for host memory pressure (which masquerades as “slow disk”)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        54Gi       1.2Gi       2.1Gi       6.8Gi       3.0Gi
Swap:          8.0Gi       2.6Gi       5.4Gi

What it means: Swap is in use and available memory is low. Under pressure, host reclaim and swapping can stall I/O paths and VM vCPUs.

Decision: Fix overcommit, ballooning policy, or add RAM. Storage tuning won’t out-run swapping.

Task 15: Detect TRIM/discard behavior and whether it’s causing spikes

cr0x@server:~$ systemctl status fstrim.timer | sed -n '1,12p'
● fstrim.timer - Discard unused blocks once a week
     Loaded: loaded (/lib/systemd/system/fstrim.timer; enabled; preset: enabled)
     Active: active (waiting) since Mon 2026-02-03 00:00:00 UTC; 1 day 12h ago
    Trigger: Mon 2026-02-10 00:00:00 UTC; 5 days left

What it means: Weekly TRIM is scheduled. That’s usually better than continuous discard for latency consistency.

Decision: Prefer periodic TRIM over always-on discard for many server workloads. If you’re using discard=on in VM disk config, confirm it’s not creating latency spikes during heavy churn.

Task 16: Identify snapshot/backup merges (a classic “why now?”)

cr0x@server:~$ zfs list -t snapshot -o name,used,creation -s creation | tail -n 5
nvme-zfs/vm-104-disk-0@vzdump-2026_02_04-000001  1.2G  Tue Feb  4 00:00 2026
nvme-zfs/vm-107-disk-0@replica-2026_02_04-001500 800M  Tue Feb  4 00:15 2026
nvme-zfs/vm-104-disk-0@replica-2026_02_04-001500 650M  Tue Feb  4 00:15 2026
nvme-zfs/vm-104-disk-0@replica-2026_02_04-003000 700M  Tue Feb  4 00:30 2026
nvme-zfs/vm-104-disk-0@replica-2026_02_04-004500 720M  Tue Feb  4 00:45 2026

What it means: Frequent snapshots from backup/replication. Snapshot churn can increase write amplification on CoW systems and degrade locality.

Decision: Reduce snapshot frequency for high-churn volumes, separate backup targets, or tune retention to avoid long chains of old snapshots pinning blocks.

Deep dives: what actually makes NVMe feel slow

1) Throughput lies; latency tells the truth

Most “my VM is slow” tickets are a human reporting that interactions feel sticky: SSH takes a second longer, package managers pause, websites stutter. Those are small I/O operations with dependency chains. A single fsync that takes 40 ms doesn’t show up as a bandwidth problem, but it can ruin a transaction log, an apt install, or a journaling filesystem.

NVMe shines at high IOPS and high throughput, but only if the stack can keep the queues full and complete work consistently. A VM workload often runs at low queue depth with frequent flushes. If your backend turns those flushes into expensive waits, your NVMe becomes a very fast device that’s mostly waiting for permission.

2) Sync writes: ZFS and the cost of being honest

ZFS’s default stance is: when an application asks for durability, ZFS takes it seriously. That means sync writes are committed to stable storage before acknowledging. On a pool without a dedicated ZIL/SLOG device, those commits land on the main vdev(s). With a single NVMe, that can still be fine—until you mix workloads, trigger sustained writes, or hit firmware behavior that makes flushes expensive.

The trap is assuming “single NVMe = instant sync.” Many consumer NVMe devices accelerate writes using volatile caches and folding mechanisms. If they lack power-loss protection, they must be conservative about flush semantics, and they may do internal housekeeping that spikes latency. ZFS doesn’t know your drive’s marketing; it only knows your drive’s promises.

Do: for sync-heavy workloads, use an enterprise NVMe with power-loss protection or a separate SLOG device that is explicitly designed for low-latency sync writes.

Avoid: setting sync=disabled as a “performance fix” unless you’re comfortable losing seconds of acknowledged writes on a crash. That’s not hypothetical; it happens in real outages.

3) QEMU cache modes: performance is a contract, not a vibe

In Proxmox, disk cache mode changes how QEMU interacts with the host page cache and flush behavior. The short version:

  • cache=none: direct I/O (O_DIRECT) bypasses host page cache. Often good for reducing double caching and making latency more predictable.
  • cache=writeback: can be fast, but relies on host caching. It can also increase risk if the underlying stack or hardware lies about durability.
  • cache=writethrough/directsync: more conservative; can amplify latency.

There is no “best.” There is only “best for your durability and performance requirements.” If you’re running databases and you value correctness, align cache mode and backend so that flushes mean what the guest thinks they mean.

4) virtio-scsi-single: one queue to bottleneck them all

virtio-scsi-single exists for a reason: compatibility and simplicity. But it also serializes I/O through a single queue, which can become a CPU and lock contention point for multi-vCPU guests with parallel I/O.

If your workload does multiple independent I/O streams—think build servers, multiple containers in one VM, or a busy file server—one queue can create artificially high latency even when the NVMe is idle-ish.

Do: consider virtio-scsi-pci with appropriate queue settings and iothread=1 for disk devices that benefit.

Avoid: turning everything into “more queues” without observing. Some workloads get worse when you increase concurrency because tail latency spreads under contention.

5) Thin provisioning: the slow creep of metadata

LVM-thin is perfectly serviceable, and it’s common in Proxmox setups. But thin pools have two performance foot-guns:

  • Metadata pressure: when metadata is tight or heavily updated, latency spikes.
  • Fragmentation: over time, allocation becomes scattered, especially under snapshot churn and random writes.

If you run thin pools like they’re magic, they will eventually remind you they are accounting systems with a block device hobby.

6) Snapshot and backup churn: CoW isn’t free

Proxmox makes backups and snapshots accessible, which is good. It also makes it easy to create a constant background write amplification machine:

  • Frequent snapshots pin old blocks.
  • Random writes must allocate new blocks (CoW), increasing fragmentation.
  • Backup processes read entire disks and can push caches out, increasing latency for interactive workloads.

Backups are necessary. But “every 15 minutes for everything” is a policy written by someone who doesn’t get paged.

7) Contention: the noisy neighbor is usually you

A single NVMe can handle impressive IOPS, but mixed read/write workloads with sync flushes can still saturate it. Worse: one VM doing large sequential writes (backup target, log aggregator) can ruin latency for another VM doing small random sync writes (database). Both are “fine” individually. Together, they fight.

Isolation beats hero tuning. Separate pools. Separate devices. Separate backup traffic. Or at least use Proxmox I/O limits to enforce fairness.

8) The NVMe itself: SLC cache, firmware, and thermals

Consumer NVMe drives often advertise burst performance aided by SLC caching. Under sustained writes, they fall off a cliff. You don’t notice during benchmarks that run for 30 seconds; you notice at 02:00 when your backups and replication run for 90 minutes.

Thermals are the other quiet killer. NVMe controllers will throttle to protect themselves, and your “fast storage” becomes “well-behaved storage.” This can produce the most annoying graph in operations: a perfect sawtooth of performance as the device heats and cools.

Joke #2: I’ve seen more performance “mysteries” solved by a fan than by a PhD, which is both humbling and loud.

9) Reliability mindset: a quote worth keeping

Paraphrased idea (with attribution): Gene Kim has often pushed the operations principle that improving flow requires reducing work-in-progress and feedback delay, not just adding horsepower.

Storage performance problems are frequently flow problems: too many concurrent background tasks, too much queueing, and too little visibility into latency.

Three corporate mini-stories from the real world

Mini-story 1: The incident caused by a wrong assumption (“NVMe means sync is cheap”)

A mid-sized SaaS company migrated a few critical PostgreSQL VMs to a new Proxmox cluster. The pitch was straightforward: local NVMe mirrors, ZFS for snapshots, and Proxmox replication for quick recovery. The first week was quiet. The second week featured a slow-motion outage during a routine peak.

The application tier started timing out. Not everywhere—just enough to be confusing. CPU was fine. Network was fine. From inside the database VM, iostat showed ugly write waits. The team’s first reaction was to blame the guest filesystem and tune mount options. That did nothing.

On the host, it became obvious: sync writes were stacking up behind other writes, and the NVMe drives showed long tail latency under sustained mixed load. The wrong assumption was that NVMe makes fsync basically free. It doesn’t. It makes it possible to be fast—if your device and configuration support low-latency durable commits.

The fix was boring: add proper power-loss protected devices as dedicated SLOG, reduce snapshot frequency on the database zvols, and schedule replication away from peak. The “NVMe is fast” story turned into “durability has a cost, pay it explicitly.”

Mini-story 2: The optimization that backfired (“turn on discard everywhere”)

An enterprise IT team standardized Proxmox templates. They enabled discard=on for every VM disk and switched guest filesystems to continuous discard because it looked clean on paper: better SSD wear leveling, better space reclamation, fewer storage surprises. They rolled it out steadily, with a satisfying sense of maturity.

Then the helpdesk started seeing “random freezes” in VMs. Not full hangs. Just periodic multi-second stalls. It was worst on busy Windows terminal servers and a few Linux VMs with heavy churn in temp directories. Users described it as “the keyboard stops working for a second.” Engineers described it as “not reproducible,” which is a nicer way of saying “not yet.”

Host metrics eventually correlated the stalls with discard bursts. Some NVMe models handled it gracefully; others turned discard into internal garbage collection events with noticeable latency spikes. Under virtualization, those spikes were amplified because multiple guests were doing the same “optimization” at once.

The rollback was surgical: disable continuous discard in guests, keep periodic fstrim, and only enable discard=on for volumes that truly needed online space reclamation. The optimization didn’t fail because discard is bad; it failed because “everywhere, always” is not a performance strategy.

Mini-story 3: The boring practice that saved the day (“measure tail latency and isolate backups”)

A financial services team ran Proxmox for internal tooling and a few latency-sensitive services. Nothing glamorous. They had a simple rule: backups must not share the same pool as production write-heavy databases unless there’s a measured and enforced I/O budget.

They implemented separate storage: one pool for VM disks and a second for backup targets. They also kept a small dashboard focused on 95th/99th percentile disk latency on hosts during business hours. Not just averages. Tail latency. The metric that matches human complaints.

One Tuesday, a new service deployed with chatty logging and an aggressive local retention policy. It was writing constantly. The latency dashboard lit up within minutes, and because backups were isolated, the blast radius stayed small. They throttled the noisy VM, adjusted logging, and life continued.

No incident review heroics. No “NVMe is broken” drama. Just guardrails that assumed someone, someday, would do something enthusiastic with disk writes. They were right.

Common mistakes: symptom → root cause → fix

  • Symptom: VM feels sluggish, package installs pause, databases report high fsync time.

    Root cause: Sync write latency on ZFS without suitable SLOG or on consumer NVMe under sustained mixed load.

    Fix: Add PLP-backed SLOG or enterprise NVMe; reduce snapshot churn; isolate noisy writers; keep sync=standard for correctness.

  • Symptom: Random multi-second stalls across multiple VMs at odd intervals.

    Root cause: Discard/TRIM bursts or background maintenance (fstrim, drive GC) causing latency spikes.

    Fix: Prefer periodic fstrim over continuous discard; test discard=on selectively; check NVMe thermals and firmware behavior.

  • Symptom: High disk utilization on host, but individual VMs show low throughput; everything “just slow.”

    Root cause: Queueing/contending workloads (backup, replication, scrub, another VM) saturating the device with small writes and flushes.

    Fix: Move backups off the pool; schedule heavy jobs; add I/O limits; separate pools for different latency classes.

  • Symptom: Single VM with multiple vCPUs can’t exceed modest IOPS; latency rises when parallel jobs run.

    Root cause: virtio-scsi-single single-queue bottleneck or missing iothread.

    Fix: Use virtio-scsi-pci and enable iothread; consider multiqueue where supported; re-test with fio in the guest.

  • Symptom: Performance is great for 30–60 seconds, then collapses during long writes.

    Root cause: NVMe SLC cache exhaustion and sustained-write throttling.

    Fix: Choose drives rated for sustained writes; overprovision; improve cooling; separate backup/ingest workloads.

  • Symptom: After enabling compression/dedup “for efficiency,” latency gets worse.

    Root cause: CPU overhead and amplification; dedup is especially brutal without the right memory and workload characteristics.

    Fix: Use compression thoughtfully (often yes); avoid dedup for VM disks unless you have a proven case; measure CPU and latency impact.

  • Symptom: “NVMe is fast” but host shows swapping; VMs stall during memory pressure.

    Root cause: Host memory overcommit; swapping and reclaim stalls I/O and CPU scheduling.

    Fix: Fix RAM sizing, ballooning, and ZFS ARC limits; stop swapping on hypervisors unless you know why you want it.

  • Symptom: Storage benchmarks look fine, but real apps are slow.

    Root cause: Benchmarks test sequential throughput or deep queue; your app is sync-heavy at low queue depth.

    Fix: Benchmark the right thing: low-QD random reads/writes, fsync latency, and tail percentiles—inside the guest and on the host.

Checklists / step-by-step plan

Step-by-step: diagnose one slow VM without boiling the ocean

  1. Confirm the problem is consistent. Get a time window and correlate with backups, scrubs, replication, and deploys.
  2. Inside guest: check iostat -x and vmstat 1. If iowait is high, proceed. If swap is active, fix memory first.
  3. On host: run iostat -x for the NVMe device. Look for high await, high avgqu-sz, and pegged %util.
  4. Find the writer: pidstat -d and correlate to qemu-system-x86 PIDs or backup tools.
  5. Identify backend: pvesm status and qm config. ZFS? LVM-thin? Ceph? Different playbooks apply.
  6. Validate VM device model: virtio-scsi-single vs virtio-scsi-pci, iothread enabled, cache mode appropriate.
  7. Check NVMe reality: nvme smart-log (thermals), lspci -vv (lane width), and ensure it’s not throttling.
  8. Make one change at a time. Re-test the original user operation (not just a synthetic benchmark).

Checklist: “do this, avoid that” for Proxmox storage performance

  • Do: monitor disk latency percentiles on hosts. Avoid: relying on average throughput graphs.
  • Do: isolate backup traffic. Avoid: running heavy backups on the same pool as low-latency databases during peak.
  • Do: use PLP-backed devices for sync acceleration. Avoid: consumer drives for write-ahead log durability expectations.
  • Do: keep snapshot policies intentional. Avoid: high-frequency snapshots on high-churn volumes without measuring impact.
  • Do: choose virtio and queueing deliberately. Avoid: sticking with defaults when you have evidence they bottleneck.
  • Do: treat discard as a tool. Avoid: enabling continuous discard everywhere because it “sounds clean.”

FAQ

1) Why is my VM slow when the NVMe benchmarks at gigabytes per second?

Because your workload likely depends on low-queue, synchronous latency (fsync/flush), not sequential throughput. Measure await and fsync times, not just MB/s.

2) Should I set ZFS sync=disabled to fix latency?

Only if you accept losing acknowledged writes on a crash. For databases and most stateful services, that’s a bad trade. Fix the sync path with proper hardware (PLP-backed SLOG/enterprise NVMe) or workload isolation instead.

3) Is cache=none always best in Proxmox?

No. It’s often a good default for predictability and avoiding double caching, but the “best” mode depends on backend and durability requirements. Treat cache mode as part of a correctness contract.

4) What’s wrong with virtio-scsi-single?

Nothing—until you need parallelism. It uses a single queue, which can bottleneck multi-threaded I/O workloads. If you see CPU/lock contention and limited IOPS, consider moving to virtio-scsi-pci with appropriate queues and iothreads.

5) Are snapshots inherently slow?

Snapshots are not inherently slow, but frequent snapshots on high-churn volumes increase copy-on-write overhead and fragmentation. The pain often shows up during merges, backups, and sustained random writes.

6) Does enabling discard improve performance?

It can improve steady-state behavior and space reclamation, but it can also introduce latency spikes. Many environments do better with periodic fstrim than continuous discard, especially across many VMs.

7) How do I know if the NVMe is throttling?

Check temperature via nvme smart-log and correlate performance drops with rising thermals. Also watch for sawtooth patterns in latency and throughput under sustained writes.

8) Should I change the Linux I/O scheduler for NVMe?

Sometimes. none can reduce overhead, while mq-deadline or kyber can improve latency fairness under contention. Choose based on measured tail latency, not ideology.

9) Why do backups make interactive VMs slow even if backups are “read-heavy”?

Reads can evict caches and increase device queueing, and snapshot-based backups can trigger additional metadata work. If backups contend for the same pool, latency-sensitive workloads will feel it.

10) What’s the single most effective fix for “slow VMs on fast storage”?

Isolation. Separate pools (or at least strict I/O limits) for different latency classes, and stop mixing backup/scrub/replication storms with interactive transactional workloads.

Practical next steps

If you want the VM to stop feeling slow, do these in order:

  1. Measure latency, not throughput. Capture iostat -x on guest and host during the complaint window.
  2. Identify the backend. ZFS vs LVM-thin vs Ceph changes what “normal” means and what “fix” looks like.
  3. Find contention. Backups and snapshot churn are repeat offenders. Confirm with pidstat -d and timing correlation.
  4. Fix the durability path honestly. If you need fast sync writes, use hardware that can deliver them safely (PLP-backed SLOG/enterprise NVMe) or accept the latency as correctness cost.
  5. Tune VM queueing deliberately. If you’re on virtio-scsi-single and the workload is parallel, switch to a better device model and validate with real workload tests.
  6. Stop “optimizing” globally. Discard, aggressive snapshots, and overly clever caching are great ways to create fleet-wide jitter. Make changes per workload class.

The payoff is not just faster VMs. It’s fewer mystery tickets, fewer late-night blame games, and a storage stack that behaves like a system—not a rumor.

← Previous
DNS Resolvers: Negative Caching — The Setting That Makes Outages Last Longer
Next →
Rocky Linux 10 Install: RHEL-Compatible Without the Subscription Headache

Leave a comment