Performance problems on Proxmox almost never show up as a single clean error. They show up as “the database feels sticky,” “RDP is laggy,” “Kubernetes nodes flap,” or “why is the host load 60 when CPU is 10%?” You’ll get contradictory graphs and confident theories from everyone in the room.
This is a field guide for breaking the stalemate. We’ll separate CPU steal from IO wait, map ZFS ARC behavior to real latency, and track down the noisy-neighbor VM that’s turning your node into a shared-office microwave.
Fast diagnosis playbook
You don’t need a 40-panel dashboard to find the bottleneck. You need the right three checks, in the right order, with the discipline to stop guessing.
First: confirm what kind of “slow” it is (CPU contention vs storage vs memory pressure)
- Host feels slow but CPU looks low: suspect IO wait, steal time, or memory reclaim (swap/kswapd).
- Guest CPU pegged but host CPU not: suspect CPU limits, steal time, or vCPU topology mismatch.
- Everything is “fine” until a backup/replication starts: suspect IO saturation and ZFS transaction group behavior.
Second: identify the limiter on the host (one-liners that don’t lie)
Run these on the Proxmox node:
- CPU + steal + iowait snapshot:
mpstat -P ALL 1 5 - Memory pressure / swap / reclaim:
vmstat 1 10 - Disk latency and queueing:
iostat -x 1 10 - ZFS pool behavior:
zpool iostat -v 1 10
Third: map it to the culprit VM or workload
- Top CPU VM(s):
qm list+ps -eo pid,cmd,%cpu --sort=-%cpu | head+ correlate QEMU PIDs - Top IO VM(s):
iotop -oPaandpvesh get /nodes/$(hostname)/qemu --output-format json - Top memory pressure offender: ballooning and host swap metrics, plus “who is allocating page cache”
If you do only one thing: pick one host, one 10-minute window, and capture the snapshots above. That’s your ground truth. Everything else is opinion.
A mental model that keeps you honest
Proxmox performance problems are usually one of four things:
- CPU contention: vCPUs want to run; the scheduler can’t give them time. On a VM you’ll often see this as steal time.
- Storage latency: CPUs are idle because threads are waiting on disk. That’s IO wait, and it’s not a CPU problem—it’s a storage pipeline problem.
- Memory pressure: ARC, page cache, anonymous memory, and ballooning fight for RAM. If the host swaps, everything becomes “mysteriously slow.”
- Noisy neighbor effects: one VM saturates IO, floods the CPU with interrupts, or triggers pathological ZFS write amplification that punishes everyone.
The trick is to avoid mixing symptoms. Load average includes runnable tasks and tasks stuck in uninterruptible IO sleep. That’s why you can have a load of 40 and a mostly idle CPU. It’s not “Linux lying.” It’s you asking a vague question and getting an honest answer.
One quote you should tape to your monitor: Paraphrased idea from Deming: without data you’re just another person with an opinion.
It’s painfully relevant when three teams argue over whether “it’s networking.”
Interesting facts and history you can weaponize
- Steal time was born in the early virtualization era to quantify how often a guest was runnable but the hypervisor scheduled someone else. It’s a scheduling debt meter, not a guest bug.
- Linux load average predates modern IO stacks; it counts tasks in D-state (uninterruptible sleep), so heavy storage latency inflates load even if CPUs nap.
- ZFS ARC is not “just cache.” It’s a self-tuning memory consumer with multiple lists (MRU/MFU and friends), and it will happily take RAM until something forces it to behave.
- ZFS was engineered around data integrity first (copy-on-write, checksums, transactional semantics). Performance tuning is real, but it always pays an integrity tax.
- IOPS became a business metric because of virtualization: when many small random IO streams share disks, throughput stops being the bottleneck and latency dominates.
- Write amplification isn’t just for SSDs. Copy-on-write filesystems can amplify writes too, especially under small synchronous writes and fragmentation.
- Virtio wasn’t a default at first. Paravirtualized drivers were introduced to avoid emulation overhead; using the wrong disk/controller model still hurts.
- CPU frequency scaling can mimic “random performance.” If governors downclock aggressively, your “same workload” becomes different every minute.
- Ballooning solved overcommit economics but introduced a new failure mode: the host looks “fine” while guests thrash, because the pain is delegated.
CPU steal: when guests are ready but can’t run
CPU steal is time a VM’s vCPU wanted to run but couldn’t because the host scheduler (or another layer) didn’t schedule it. On bare metal, steal is basically zero. In virtualization, steal is a confession: “I am contended.”
How steal time shows up (and how it fools you)
- Inside a VM,
topshows high%st, but%usisn’t crazy. - Applications time out even though “CPU usage” graphs look moderate.
- Interactive latency is bad: SSH keystrokes echo slowly, cron jobs run late.
Steal time is not “the VM using too much CPU.” It’s “the VM not getting CPU when it needs it.” That distinction matters when you decide whether to add cores, move VMs, or stop overcommitting.
Common causes of steal on Proxmox
- Overcommit that’s too spicy: sum(vCPU) greatly exceeds physical cores, and workload peaks align.
- CPU limits and shares: cgroup quotas or Proxmox CPU units starve a VM during contention.
- NUMA/topology mismatch: huge VMs spanning sockets, remote memory access, and cache misses create “CPU is slow” behavior even without high steal.
- Host CPU frequency pinned low: looks like contention, acts like contention, but it’s just underclocking.
Practical advice: if you see sustained steal above a few percent during a customer-visible incident, treat it as a real problem. Bursty steal isn’t fatal; sustained steal is.
Joke #1: If your VM has 30% steal time, it’s not “stealing”—it’s politely waiting while someone else steals its lunch money.
IO wait: the CPU is idle because storage is slow
IO wait means the CPU had nothing to run because threads were blocked on IO. It’s not the CPU being “busy.” It’s the CPU being unemployed while storage ruins everyone’s weekend.
What makes IO wait spike in Proxmox
- Backups/snapshots/replication creating heavy sequential reads mixed with random writes.
- Sync writes from databases or NFS configurations that force flushes.
- Queue depth overload on SATA SSDs or HDD arrays: latency blows up before throughput looks saturated.
- ZFS TXG behavior: bursts of writes at commit boundaries can look like periodic stalls.
- Thin-provisioned storage or nearly full pools leading to fragmentation and slower allocations.
Interpreting iowait without self-deception
A high iowait percentage tells you: “storage is limiting progress.” It does not tell you whether the culprit is the pool, the controller, the SSD firmware, the guest filesystem, or a single VM hammering sync writes.
Focus on latency metrics: await, svctm (less useful on modern kernels), queue size, and ZFS per-vdev behavior. Throughput charts are comforting lies if your workload is latency-sensitive.
ZFS ARC: cache, memory pressure, and the swap trap
ZFS ARC is a powerful performance feature and a frequent scapegoat. It caches reads, metadata, and can drastically reduce disk IO. But on a virtualization host, ARC is also a political actor: it competes with guests for RAM.
The ARC failure modes you actually see in production
- Host swapping: the kernel swaps because ARC + page cache + qemu processes + everything else exceed RAM. Once the host swaps, latency gets weird and “random.”
- ARC too small: constant cache misses cause real disks to do the work; IO wait rises; VMs stutter.
- ARC too large: VMs lose memory, ballooning kicks in, guests start swapping, and the host “looks fine” while the app burns.
- Metadata pressure: lots of small files or many datasets can bloat metadata; ARC becomes metadata-heavy and less effective for actual VM blocks.
Opinionated ARC sizing guidance for Proxmox
If you run many VMs and want predictable behavior, set an ARC max. Letting ARC auto-grow can be okay on a storage appliance; on a virtualization node with changing guest memory, it’s how you end up debugging “ghost slowness.”
There isn’t one magic percentage, but a common starting point is: cap ARC so the host always has headroom for guests + kernel + qemu overhead. Then watch real cache hit ratios and latency. Tune based on evidence, not vibes.
SLOG/L2ARC footnote (because people will ask)
A separate SLOG device can help synchronous write latency if you have sync-heavy workloads and you understand the risks and endurance. L2ARC can help read-heavy workloads but consumes RAM for metadata and can backfire if you’re already memory-tight.
Noisy neighbors: finding the VM that ruins your day
“Noisy neighbor” is corporate-speak for “one workload is selfish and everyone else is paying for it.” On Proxmox, noisy neighbors typically manifest as:
- One VM generating extreme random writes (often sync writes).
- One VM doing a full-disk scan, indexing job, antivirus run, or backup inside the guest.
- One VM saturating CPU with high interrupt rates (packet floods, misconfigured polling, busy loops).
- One VM with too many vCPUs causing scheduling friction for smaller VMs.
The diagnostic move: stop looking at host averages and start attributing consumption to individual QEMU processes, then correlate to VMIDs. Most “mystery contention” becomes a name and an owner within 15 minutes.
Practical tasks: commands, outputs, decisions (12+)
Task 1: Snapshot CPU usage, iowait, and steal per core
cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.8.12-pve (pve01) 12/28/2025 _x86_64_ (32 CPU)
12:01:11 PM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
12:01:12 PM all 12.10 0.00 4.20 9.80 0.00 1.10 6.50 66.30
12:01:12 PM 0 22.00 0.00 6.00 18.00 0.00 2.00 10.00 42.00
12:01:12 PM 1 8.00 0.00 3.00 1.00 0.00 0.50 0.00 87.50
What it means: %steal at 6.5% across all CPUs suggests CPU scheduling contention for guests. %iowait at 9.8% also suggests storage latency is limiting progress.
Decision: If steal is sustained > 5% during pain, treat CPU contention as real: reduce vCPU overcommit, move hot VMs, or adjust limits. If iowait is also high, run storage tasks next to determine which is primary.
Task 2: Confirm memory pressure and swapping at the host level
cr0x@server:~$ vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
5 3 262144 18432 10240 8123456 80 120 900 1400 9200 18000 12 4 66 10 8
3 2 262144 16000 10240 8012345 60 100 1100 1600 9400 17500 10 4 67 12 7
What it means: Non-zero si/so indicates active swapping. b (blocked processes) aligns with IO wait. Low free alone isn’t bad; swapping is.
Decision: If the host is swapping, fix memory pressure first: cap ARC, stop ballooning overcommit games, add RAM, or reduce guest allocations. Storage tuning won’t save you if the host is paging.
Task 3: Identify disk latency and saturation symptoms
cr0x@server:~$ iostat -x 1 5
Linux 6.8.12-pve (pve01) 12/28/2025 _x86_64_ (32 CPU)
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util await r_await w_await
nvme0n1 1200 1800 48000 92000 0.0 0.0 98.5 18.2 6.1 26.4
nvme1n1 110 1600 3200 88000 0.0 0.0 91.0 22.8 4.5 24.1
What it means: High %util plus elevated await indicates your devices are saturated and latency is climbing. The write latency is especially high.
Decision: Move to ZFS-level breakdown (zpool iostat -v). If await spikes align with backups or replication, throttle/shape those jobs.
Task 4: Break down ZFS IO by vdev to find the real limiter
cr0x@server:~$ zpool iostat -v rpool 1 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
rpool 3.12T 1.48T 1.20K 1.85K 46.2M 92.8M
mirror 3.12T 1.48T 1.20K 1.85K 46.2M 92.8M
nvme0n1 - - 620 980 23.1M 46.4M
nvme1n1 - - 580 870 23.0M 46.4M
What it means: Reads/writes are balanced across mirror members, so the pool isn’t “one disk dying” (yet). If one vdev lags, you’d see skew.
Decision: If bandwidth and ops are high but latency is still bad, look for sync writes, fragmentation, or memory pressure causing constant cache misses.
Task 5: Check pool health, errors, and slow devices
cr0x@server:~$ zpool status -xv
all pools are healthy
What it means: No obvious ZFS errors. That’s good. It doesn’t mean performance is good.
Decision: If performance is bad but health is good, focus on workload shape, latency, and contention rather than “a disk is failing.”
Task 6: Observe ARC size and hit ratio signals
cr0x@server:~$ arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:03:01 4.2K 1.1K 26 210 19 540 49 350 32 52.1G 60.0G
12:03:02 4.0K 1.4K 35 260 19 700 50 440 31 52.4G 60.0G
What it means: ARC is large (52G) and capped at 60G. Miss rate 26–35% suggests disks are still doing work. Not necessarily bad, but it’s a clue.
Decision: If miss% is high and IO wait is high, consider whether ARC is too small for the working set. If the host is swapping, ARC is too big for your reality.
Task 7: Confirm host swap and memory accounting (don’t trust “free” alone)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 128Gi 109Gi 18Gi 2.1Gi 1.6Gi 15Gi
Swap: 16Gi 256Mi 15Gi
What it means: Some swap is in use. That’s not instantly fatal, but if it’s growing or if vmstat shows active swapping, you’re in trouble.
Decision: If swap usage is stable and si/so are zero, you might accept it. If swap is active, reduce memory pressure before tuning anything else.
Task 8: Identify top IO processes on the host (often QEMU)
cr0x@server:~$ iotop -oPa
Total DISK READ: 65.20 M/s | Total DISK WRITE: 110.30 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
18342 be/4 root 2.10 M/s 38.70 M/s 0.00 % 92.10% kvm -id 107 -name vm107 ...
19110 be/4 root 0.40 M/s 26.80 M/s 0.00 % 88.30% kvm -id 112 -name vm112 ...
What it means: Two VMs are responsible for most writes. IO> indicates time waiting on IO.
Decision: Investigate VM 107 and 112: what are they doing, are they backing up, are they databases with sync writes, are they misconfigured?
Task 9: Map QEMU PID to VMID and confirm Proxmox sees the same “hot” VM
cr0x@server:~$ qm list
VMID NAME STATUS MEM(MB) BOOTDISK(GB) PID
107 db-prod-01 running 32768 256.00 18342
112 files-prod-02 running 16384 1024.00 19110
What it means: PIDs match iotop output. Now you have real attribution: db-prod-01 and files-prod-02 are heavy writers.
Decision: If the heavy writer is a database, evaluate sync behavior, cache settings, and SLOG suitability. If it’s a file server, check for scans, rsync storms, or snapshots.
Task 10: Check per-VM CPU limits, CPU units, and ballooning that can create contention
cr0x@server:~$ qm config 107
balloon: 16384
boot: order=scsi0;net0
cores: 16
cpu: x86-64-v2-AES
memory: 32768
name: db-prod-01
net0: virtio=DE:AD:BE:EF:10:07,bridge=vmbr0,firewall=1
scsi0: rpool:vm-107-disk-0,discard=on,iothread=1,ssd=1
sockets: 1
What it means: Ballooning is enabled and can reclaim memory under pressure. Cores=16 might be too many if the host is contended; large vCPU counts can increase scheduling friction.
Decision: If steal is high, consider reducing oversized vCPU allocations for VMs that don’t scale. If memory pressure exists, disable ballooning for critical low-latency workloads and instead right-size memory.
Task 11: Check CPU frequency and throttling (the “why is it slow today” classic)
cr0x@server:~$ lscpu | egrep 'Model name|Socket|Thread|CPU\(s\)'
CPU(s): 32
Model name: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
Socket(s): 2
Thread(s) per core: 2
cr0x@server:~$ cpupower frequency-info | egrep 'current CPU frequency|governor'
current CPU frequency: 1200 MHz (asserted by call to kernel)
governor "powersave" may decide which speed to use
What it means: CPUs are running at 1.2GHz under a powersave governor. That can absolutely look like “mysterious contention.”
Decision: On virtualization nodes, prefer performance or a tuned governor policy unless you have a measured reason not to. If you’re power-capping, accept the performance trade explicitly.
Task 12: Verify interrupt/softirq pressure (network or storage storms)
cr0x@server:~$ sar -I SUM 1 3
Linux 6.8.12-pve (pve01) 12/28/2025 _x86_64_ (32 CPU)
12:05:12 PM INTR intr/s
12:05:13 PM SUM 82000.00
12:05:14 PM SUM 91000.00
What it means: Very high interrupt rate can burn CPU in softirq and create latency. It’s common with high PPS networking or some storage drivers.
Decision: If %soft in mpstat is high, investigate network bridges, virtio-net settings, and whether a VM is causing packet floods.
Task 13: Check kernel logs for storage timeouts or resets (the “it’s fine” lie detector)
cr0x@server:~$ journalctl -k --since "30 min ago" | egrep -i 'nvme|blk|zfs|reset|timeout|error' | tail -n 20
Dec 28 11:41:03 pve01 kernel: nvme nvme0: I/O 123 QID 7 timeout, aborting
Dec 28 11:41:03 pve01 kernel: nvme nvme0: Abort status: 0x371
Dec 28 11:41:04 pve01 kernel: blk_update_request: I/O error, dev nvme0n1, sector 123456789 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
What it means: You have device-level issues. Performance tuning is irrelevant until hardware/firmware stability is fixed.
Decision: Treat this as an incident: check firmware, cabling, PCIe errors, drive health, and controller behavior. Plan replacement if needed.
Task 14: Measure ZFS dataset properties that impact VM IO behavior
cr0x@server:~$ zfs get -o name,property,value -r compression,atime,recordsize,sync,logbias rpool | head
NAME PROPERTY VALUE
rpool compression zstd
rpool atime off
rpool recordsize 128K
rpool sync standard
rpool logbias latency
What it means: atime=off is sensible. recordsize matters: VM volumes often behave best with smaller record sizes when random IO dominates, but you must measure. sync=standard honors guest sync semantics; changing it can be dangerous.
Decision: Don’t change sync to disabled for “performance” unless you accept data loss risk explicitly and have a business sign-off. Tune recordsize and special devices carefully, and test on representative IO.
Task 15: Confirm what the VM disk actually is (virtio-scsi + iothread vs legacy)
cr0x@server:~$ qm config 112 | egrep 'scsi|sata|ide|virtio|iothread'
scsi0: rpool:vm-112-disk-0,discard=on,iothread=1,ssd=1
What it means: You’re on virtio-scsi with an IO thread. That’s typically a good baseline for performance and isolation.
Decision: If you see sata0 or IDE disks on performance-sensitive VMs, fix that. Emulated controllers are not cute in 2025.
Three corporate mini-stories from the trenches
Incident caused by a wrong assumption: “Low CPU usage means the host is fine”
A mid-sized company ran a Proxmox cluster hosting a mix of web apps and a couple of databases. One morning, the on-call got the usual complaint: “pages take 10 seconds to load.” The host CPU graphs looked calm. Someone concluded it was an application regression and started rolling back deployments.
Meanwhile, load average was high. The rollback didn’t help. People got louder. The network team got dragged in because that’s what happens when nobody has a theory they trust.
Eventually, someone ran iostat -x. The underlying SATA SSD array was showing high await with %util pinned. Then iotop pointed at a single VM doing huge write bursts. It turned out to be a “temporary” reporting job that started doing nightly full-table exports and compressing them inside the VM, writing a torrent of small sync-heavy blocks to the VM disk.
The wrong assumption was subtle: they equated “CPU not busy” with “system healthy.” In reality, the CPU was waiting on storage, and the storage was being hammered by one workload that nobody considered “production.” The fix was boring and effective: schedule the reporting job off-peak, rate-limit it, and move the VM to a pool with better write latency. The deployment rollback was just theater.
Optimization that backfired: “Disable sync on ZFS and watch it fly”
Another organization had a PostgreSQL VM complaining about write latency. A well-meaning engineer suggested setting the ZFS dataset property sync=disabled on the VM storage to “fix” it. Benchmarks looked great. Everyone high-fived. They rolled it into production.
Two weeks later, a node lost power hard. Not a clean shutdown, not a graceful panic—just dead. When it came back, the database VM booted, but the database had corruption symptoms. They restored from backups, but the recovery window was ugly and the team got to explain why a “performance change” increased data loss risk.
The backfire wasn’t that ZFS is bad. It did what it promised. The team changed the durability contract: guest sync writes were no longer durable. That’s fine for some workloads, catastrophic for others, and it shouldn’t be done casually.
They ultimately solved the original latency problem the hard way: moved the database VM to faster mirrored NVMe, validated write cache policies, and added a proper SLOG device with power-loss protection after doing the math on endurance. The lesson landed: performance hacks that rewrite correctness aren’t optimizations; they’re gambles with paperwork.
Boring but correct practice that saved the day: capacity headroom and predictable limits
A third team ran Proxmox for internal services. Not glamorous. They had a few rules that sounded like overkill: keep ZFS pools below a certain fill level, cap ARC on all nodes, and forbid “just give it 32 vCPUs” requests unless someone proved the workload scaled.
Then they had a real incident: a vendor VM started behaving badly after an update. It began writing logs at a ridiculous rate and rotating them constantly. This would normally have been a cluster-wide performance dumpster fire.
Instead, the blast radius was limited. Because they had reserved headroom and sane ARC limits, the host didn’t start swapping. Because they had reasonable CPU sizing and limits, the VM couldn’t steal the entire node. Because they had monitoring that tracked per-VM disk latency, the culprit was obvious within minutes.
They throttled the VM’s IO and moved it to a less critical node while the vendor fix was negotiated. Nobody cheered. Nobody wrote a post about “heroic mitigation.” But production stayed up, and that’s the whole point of boring practices.
Joke #2: The best performance fix is the one that doesn’t require a war room—mostly because war rooms run on coffee and denial.
Common mistakes: symptom → root cause → fix
1) Load average is huge, CPU is mostly idle
Symptom: load 30–80, CPU idle 60%+, users complain about latency.
Root cause: tasks stuck in D-state waiting on storage; IO wait is the real limiter.
Fix: verify with vmstat (high b, high wa) and iostat -x (high await, high %util); then identify top IO VM via iotop and shape/relocate the workload.
2) VM reports high CPU, but host CPU seems normal
Symptom: guest says CPU pegged, host dashboard doesn’t look scary.
Root cause: CPU steal or CPU quota/limit in cgroups; guest wants CPU but doesn’t get scheduled.
Fix: check guest %st and host mpstat steal; review VM CPU limits/units; reduce overcommit or move the VM.
3) “ZFS is eating all memory” panic
Symptom: low “free” memory, big ARC, people want to “disable ARC.”
Root cause: misunderstanding Linux memory accounting; the real issue is swapping or reclaim pressure, not low free RAM.
Fix: check vmstat for si/so, check available memory; cap ARC if the host swaps; otherwise ignore “free” and focus on latency and hit rate.
4) Random periodic stalls every few seconds/minutes
Symptom: apps freeze briefly, then recover, repeating.
Root cause: ZFS TXG commit spikes, snapshot/replication bursts, or controller-level cache flush behavior.
Fix: correlate with zpool iostat 1 and backup schedules; throttle jobs; consider faster vdevs or separate backup traffic; verify drive cache/firmware stability.
5) Performance got worse after “tuning recordsize”
Symptom: more IOPS, worse latency; or better sequential, worse random.
Root cause: recordsize mismatch to IO pattern; for VM volumes, changing it blindly can increase amplification and fragmentation.
Fix: revert to a sane baseline; measure with real workload; tune per-dataset (and understand that zvol vs dataset semantics differ).
6) Backups destroy production every night
Symptom: midnight incident pattern; daytime is fine.
Root cause: backup reads saturate pool and compete with write latency; snapshot chains and compression amplify work.
Fix: schedule and throttle; separate backup storage; limit concurrent backup jobs; consider offloading heavy reads to replication nodes.
7) “We added L2ARC and now it’s slower”
Symptom: latency increases, memory pressure rises after adding a cache device.
Root cause: L2ARC metadata overhead consumes RAM; cache warms slowly; device adds contention.
Fix: remove L2ARC unless you have a proven read-heavy working set that doesn’t fit in ARC and you have spare RAM.
Checklists / step-by-step plan
Checklist A: When users say “everything is slow”
- On the host: capture
mpstat,vmstat,iostat -x, andzpool iostatfor 2–5 minutes. - If steal is high: reduce CPU contention (move VMs, lower vCPU, review limits, fix governor).
- If iowait is high: identify top IO VMs (iotop), then correlate to workload (backup, DB, scan, replication).
- If swap is active: stop the bleeding (cap ARC, reduce ballooning, free memory, migrate a VM), then fix capacity.
- Check kernel logs for storage errors/timeouts. If present, stop tuning and treat hardware/driver stability as priority one.
Checklist B: When one VM is slow but others are fine
- Inside the VM: check
topfor%stand%wa. - On the host: map VMID to PID (
qm list) and observe that PID’s CPU/IO (ps,iotop). - Verify disk/controller type: virtio-scsi with iothread for IO-heavy VMs.
- Check for CPU limits/ballooning that might be silently constraining the VM.
- If it’s a DB: confirm whether the workload is sync-write heavy; consider storage that’s designed for low-latency writes, not just “fast sequential.”
Checklist C: Before you change ZFS settings on a production node
- Write down the current dataset/zvol properties you plan to change.
- Define success metrics: p99 latency, query time, fsync time, backup window, host iowait, ARC hit%.
- Change one thing at a time, ideally on one VM or one dataset.
- Have a rollback plan that takes minutes, not hours.
- Never trade durability for speed without explicit business sign-off (
sync=disabledis not a “tuning option,” it’s a contract rewrite).
FAQ
1) What’s a “bad” CPU steal percentage?
There’s no universal threshold, but sustained steal above ~5% during user-visible latency is a strong sign of contention. Bursty steal is common; sustained steal is a capacity or scheduling problem.
2) Why is load average high when CPU usage is low?
Because load counts runnable tasks and tasks stuck in uninterruptible sleep (usually waiting on IO). High IO latency produces high load without high CPU.
3) Should I disable ZFS atime?
For VM storage and most server workloads, atime=off is usually the right call to avoid extra writes. If you have a workload that relies on atime semantics, keep it on for that dataset only.
4) Is ZFS ARC “stealing” memory from VMs?
ARC uses available memory aggressively by design, but it can be reclaimed. The real problem is when the host swaps or when ballooning pushes guests into swap. Cap ARC if the host is under memory pressure.
5) Is adding an L2ARC SSD a good idea for Proxmox?
Sometimes, but not as a default. L2ARC consumes RAM for metadata and helps mostly with read-heavy working sets that don’t fit in ARC. If you’re short on RAM, L2ARC is often a performance regression.
6) My IO wait is high. Does that mean I need faster disks?
Maybe, but first find out what kind of IO it is. One VM doing sync-heavy random writes can saturate excellent hardware. Identify the offender, the access pattern, and whether your storage layout matches it.
7) Should I give every VM lots of vCPUs “just in case”?
No. Oversizing vCPUs increases scheduling friction and can worsen tail latency for everyone. Start smaller, measure, and scale vCPU only when the workload proves it benefits.
8) Are Proxmox backups supposed to hurt performance?
Backups consume IO. The question is whether the system is designed to absorb it: separate backup storage, throttling, and concurrency limits. If backups regularly cause production incidents, your backup design is incomplete.
9) How do I quickly find the noisy-neighbor VM on a node?
Use iotop -oPa to find IO-heavy QEMU processes and map their PIDs to VMIDs via qm list. For CPU contention, use mpstat and ps sorted by CPU, then correlate.
10) Should I turn off ballooning?
For critical latency-sensitive VMs, often yes—ballooning can create unpredictable memory pressure. For less critical workloads, ballooning can increase density. The key is to avoid host swap at all costs.
Conclusion: next steps that actually move the needle
If your Proxmox node is slow, stop debating and start attributing: capture mpstat, vmstat, iostat -x, and zpool iostat. Decide whether the limiter is CPU scheduling (steal), storage latency (iowait + await), or memory pressure (swap/reclaim). Then name the top offender VM with iotop and qm list.
From there, do the unsexy fixes first: cap ARC to protect the host, stop host swapping, right-size vCPUs, and throttle backups. Only then consider structural changes like faster vdevs, separate pools for latency-sensitive workloads, or adding a properly-engineered SLOG. Performance tuning without diagnosis is just improvisational spending.