pveperf says your storage does 2,000 MB/s. Your VMs still stutter when you unzip a file and your database latency spikes like it’s trying to escape the graph. You’re not crazy. You’re benchmarking wrong—or rather, you’re relying on a tool that’s too small to be trusted as a decision-maker.
This is the pragmatic guide I wish came bundled with Proxmox: what pveperf actually measures, why it lies (sometimes honestly, sometimes by omission), and how to do benchmarking that survives contact with production.
What pveperf is (and why it disappoints)
pveperf is a quick smoke test. It’s designed to give you a rough signal: “is this host wildly underpowered or misconfigured?” It is not designed to answer questions like:
- Can my ZFS pool sustain sync writes for PostgreSQL?
- Will my Ceph cluster survive 3-node failure domains during peak?
- Why do VMs pause when backups run?
- Is the issue CPU steal, IO latency, or network jitter?
pveperf runs small, simplistic tests (filesystem writes, CPU loops) and reports a number. Small tests love caches and hate reality. If you treat the number like a storage SLA, you’ll end up debugging at 2 a.m. while your users discover the concept of “refresh button endurance.”
Interesting facts and context you should know (so you stop benchmarking the wrong thing)
- Bonnie++ (and similar microbenchmarks) shaped early Linux perf culture: quick numbers were used to compare disks, but they were famously cache-sensitive and easy to game.
- Fio became the standard because it models IO patterns: it can do sync vs async, queue depth, random vs sequential, and report latency distributions—not just averages.
- Average latency is a liar: tail latency (p95/p99) is what makes databases cry. Modern benchmarking culture shifted hard toward percentiles.
- Writeback caching changed the world (and the failure modes): controllers and SSDs can acknowledge writes before they’re durable; benchmarks look great until you pull power or hit a cache flush wall.
- ZFS intentionally trades raw speed for correctness: copy-on-write and checksums cost something; pretending it should behave like ext4-on-RAID0 is how you create myths.
- fsync semantics are workload-defining: an OLTP database is basically a durability machine; a benchmark that doesn’t model fsync is mostly measuring optimism.
- NVMe brought queue depth into everyday life: SATA was often bottlenecked by a small command queue; NVMe can handle deep queues and punishes shallow tests.
- Virtio and paravirtual drivers are performance features: benchmarking inside a VM without virtio drivers is benchmarking your mistakes.
Benchmarking is not a number; it’s a question. Start with the question, then pick tools and parameters that match it.
Why pveperf results look like nonsense
pveperf commonly produces numbers that are simultaneously “true” and “useless.” Here are the main failure modes:
1) It measures the page cache, not the disk
If the test file fits in RAM, you’re benchmarking memory bandwidth and kernel buffering behavior. Your SSD didn’t become 10x faster; your RAM did what RAM does.
2) It hits a filesystem path with behavior you didn’t intend
On Proxmox, storage “types” matter. Testing /var/lib/vz on local-lvm vs local-zfs vs an NFS mount changes everything: caching, sync semantics, even metadata overhead.
3) It ignores the durability contract
Many “fast” write benchmarks are fast because they don’t wait for stable storage. If your workload needs durability (databases, journaling apps, VM images), you must benchmark with sync writes (or explicit fsync) to know what you’re buying.
4) It runs too short
Modern storage has multiple performance phases: SLC cache bursts, thermal throttling, garbage collection, ZFS transaction groups, Ceph recovery. A 10-second test is basically a greeting, not a measurement.
5) It hides contention and queueing
Real systems are shared: multiple VMs, background scrub, backups, replication. The “nonsense” number is often the best-case in a lab fantasy where nothing else is happening.
Joke #1: pveperf is like weighing your car by checking how fast it rolls downhill—technically a measurement, emotionally a mistake.
Benchmarking principles that keep you out of trouble
Decide what you’re benchmarking: device, filesystem, or workload
- Device-level (raw disk/NVMe): “Is the hardware healthy and configured right?” Use
nvmetools,smartctl, fio with--filename=/dev/nvme0n1(careful). - Filesystem/pool-level (ZFS dataset, LVM-thin, ext4): “What does my storage stack deliver to VMs?” Use fio on a file in the target mount.
- Application-level (database, VM OS): “Will my service meet SLOs?” Use the app’s own benchmarks plus IO telemetry.
Prefer reproducibility over hero numbers
Benchmarking that cannot be repeated is not benchmarking; it’s storytelling. Control what you can:
- Run tests at known system state (no scrub, no resilver, no backup).
- Record kernel, Proxmox version, storage layout, drive models, firmware.
- Capture CPU frequency scaling state.
- Capture IO scheduler, multipath, and controller cache policy.
Latency percentiles beat throughput averages
Throughput is nice, but latency is what users feel. Benchmark outputs that include clat percentiles are the difference between “seems fine” and “why did the API time out?”
Use runtime long enough to hit steady state
If you don’t see the system settle, you didn’t measure it. For SSDs, give it time to warm up. For ZFS, let transaction groups cycle. For Ceph, let the cluster stop gossiping and start doing work.
One paraphrased operations idea (because it’s still true)
Paraphrased idea (Werner Vogels, reliability/operations): “You build it, you run it” implies you benchmark it like you’ll be paged for it.
Fast diagnosis playbook: find the bottleneck quickly
This is the “you have 20 minutes before the incident call” sequence. The goal is to identify whether the primary limiter is CPU, disk, network, or a configuration footgun.
First: confirm the symptom is real and locate the layer
- On the Proxmox host: check load, CPU steal (if nested), memory pressure, and IO wait.
- On the storage layer: check disk utilization, queue depth, and latency.
- On the VM: confirm it’s actually IO latency and not guest memory swapping or application locks.
Second: determine whether you’re limited by throughput, IOPS, or latency
- High MB/s, low IOPS suggests sequential bandwidth limit.
- High IOPS, low MB/s suggests small-block workload (typical databases).
- Low utilization but high latency suggests sync penalties, misaligned cache, controller flushing, or network storage jitter.
Third: isolate variables with one controlled fio run
Run fio on the host against the exact storage path your VMs use. Use direct IO. Use a file size larger than RAM. Capture percentiles. Decide based on data, not vibes.
Practical tasks: commands, output meaning, decisions
These are real operational tasks. Each one includes a command, what the output tells you, and the decision you make. Run them on the Proxmox host unless specified.
Task 1: See what pveperf is actually testing
cr0x@server:~$ pveperf /var/lib/vz
CPU BOGOMIPS: 59840.00
REGEX/SECOND: 2071670
HD SIZE: 222.22 GB (/dev/mapper/pve-root)
FSYNCS/SECOND: 1641.53
DNS EXT: 68.88 ms
DNS INT: 0.34 ms (pve1)
What it means: That “HD SIZE” line tells you what device is behind the path. If it’s pve-root on spinning disks, don’t expect miracles. If it’s on ZFS, it’s not telling you ZFS dataset settings.
Decision: Use this only as a sanity check. If results disagree with user pain, move on to fio and telemetry.
Task 2: Confirm what storage backs your Proxmox storages
cr0x@server:~$ pvesm status
Name Type Status Total Used Available %
local dir active 222.22G 48.13G 162.60G 21.66%
local-lvm lvmthin active 1.82T 910.24G 929.76G 50.04%
rpool zfspool active 3.62T 1.23T 2.39T 33.98%
What it means: Proxmox “storage” is not one thing. A directory store behaves differently than LVM-thin, which behaves differently than ZFS.
Decision: Benchmark the exact backing store where VM disks live. If VMs are on local-lvm, do not benchmark /var/lib/vz and call it a day.
Task 3: Check live IO pressure and IO wait
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 421232 91264 882012 0 0 12 44 520 880 6 2 91 1 0
1 3 0 418904 91272 882140 0 0 0 3240 610 990 5 2 70 23 0
1 2 0 419120 91280 882188 0 0 0 2800 590 970 5 2 74 19 0
What it means: The wa column is CPU waiting on IO. High wa during complaints is a smoking gun, but not a verdict.
Decision: If wa is high, go straight to per-disk latency (iostat) and queueing; if wa is low, don’t obsess over disks first.
Task 4: Measure disk latency and utilization (the fastest truth serum)
cr0x@server:~$ iostat -x 1 3
Linux 6.8.12-4-pve (pve1) 12/26/2025 _x86_64_ (32 CPU)
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 85.0 120.0 5120 19456 2.10 0.18 37.0
sda 2.0 95.0 64 4096 45.20 0.80 82.0
dm-0 10.0 110.0 1024 16384 39.10 0.00 0.0
What it means: await is average request latency. %util near 100% means saturation. If sda shows await 45ms and 82% util, that disk is the problem whether pveperf likes it or not.
Decision: If await is consistently above ~10ms for SSD workloads or above ~30ms for HDD during normal load, you’re in “performance incident” territory. Investigate the stack behind that device.
Task 5: See who is doing IO right now
cr0x@server:~$ pidstat -d 1 3
Linux 6.8.12-4-pve (pve1) 12/26/2025 _x86_64_ (32 CPU)
01:12:44 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s Command
01:12:45 PM 0 2123 0.00 82124.00 0.00 vzdump
01:12:45 PM 0 3891 0.00 11200.00 0.00 pveproxy
01:12:45 PM 0 15322 512.00 9200.00 0.00 kvm
What it means: This ties IO pressure to a process. If vzdump is hammering writes, your “random VM slowdown” is suddenly less mysterious.
Decision: Throttle backups, schedule them, or move them to separate storage. Don’t blame the hypervisor for your backup strategy.
Task 6: Verify CPU frequency scaling isn’t sabotaging benchmarks
cr0x@server:~$ cpupower frequency-info | sed -n '1,12p'
analyzing CPU 0:
driver: intel_pstate
CPUs which run at the same hardware frequency: 0
available cpufreq governors: performance powersave
current policy: frequency should be within 800 MHz and 3500 MHz.
The governor "powersave" may decide which speed to use
current CPU frequency: 900 MHz (asserted by call to hardware)
What it means: If you benchmark with the CPU parked at 900 MHz, you’ll get “nonsense” results and blame storage for CPU laziness.
Decision: For repeatable benchmarks, temporarily set performance governor. For production, choose intentionally; don’t let defaults decide.
Task 7: Switch to performance governor (temporarily) and confirm
cr0x@server:~$ sudo cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
cr0x@server:~$ cpupower frequency-info | grep -E 'current policy|current CPU frequency'
current policy: frequency should be within 800 MHz and 3500 MHz.
current CPU frequency: 3401 MHz (asserted by call to hardware)
What it means: You removed a variable. This is not about “always run max clocks,” it’s about making benchmarks comparable.
Decision: If performance changes dramatically after this, your workload might be CPU-bound or sensitive to scheduling latency.
Task 8: Confirm ZFS pool health and spot obvious misalignment
cr0x@server:~$ zpool status -v
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:12:44 with 0 errors on Sun Dec 22 03:12:45 2025
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-SAMSUNG_MZVLB1T0 ONLINE 0 0 0
nvme-SAMSUNG_MZVLB1T0 ONLINE 0 0 0
errors: No known data errors
What it means: A degraded pool, resilver, or checksum errors make every benchmark suspect. Also, scrub timing hints at background activity.
Decision: If not ONLINE/healthy, fix storage first. Benchmarking a sick pool is just quantifying sadness.
Task 9: Inspect ZFS dataset settings that directly affect VM IO
cr0x@server:~$ zfs get -o name,property,value -s local recordsize,compression,atime,sync,primarycache,logbias rpool/data
NAME PROPERTY VALUE
rpool/data recordsize 128K
rpool/data compression zstd
rpool/data atime off
rpool/data sync standard
rpool/data primarycache all
rpool/data logbias latency
What it means: recordsize and sync can make or break database-style IO. primarycache=all might evict useful cache under VM load, depending on memory sizing.
Decision: If VM disks are on ZVOLs, you’ll also check volblocksize. If you see sync=disabled in production for databases, stop and reassess your risk posture.
Task 10: Check ZVOL block size (common mismatch with workloads)
cr0x@server:~$ zfs list -t volume
NAME USED AVAIL REFER MOUNTPOINT
rpool/vm-101-disk-0 64G 1.1T 64G -
cr0x@server:~$ zfs get -H -o property,value volblocksize rpool/vm-101-disk-0
volblocksize 8K
What it means: volblocksize is set at ZVOL creation and cannot be changed later. 8K is common; sometimes it’s fine, sometimes it amplifies write overhead.
Decision: If you have heavy 4K random writes, 8K can still work but may increase write amplification. If you’re hosting large sequential workloads, consider larger blocks on new volumes.
Task 11: Check the IO scheduler (especially on non-NVMe)
cr0x@server:~$ cat /sys/block/sda/queue/scheduler
[mq-deadline] none kyber bfq
What it means: The scheduler affects latency under contention. On many SSDs, mq-deadline is a sane default; on NVMe, often it’s none and that’s fine.
Decision: Don’t tune this because you’re bored. Tune because you saw a contention symptom and can validate the change.
Task 12: Validate TRIM/Discard support (SSD sanity, long-term performance)
cr0x@server:~$ lsblk -D
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda 0 512B 4G 0
nvme0n1 0 4K 2G 0
dm-0 0 0B 0B 0
What it means: Non-zero DISC-GRAN indicates discard capability. If you never discard on thin provisioning, performance can degrade as the SSD loses free blocks internally.
Decision: Enable discard where appropriate (ZFS autotrim, fstrim timers, thin-lvm discard), but understand the latency trade-offs in your environment.
Task 13: Confirm network health for NFS/iSCSI/Ceph traffic
cr0x@server:~$ ip -s link show dev eno1 | sed -n '1,12p'
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 3c:fd:fe:aa:bb:cc brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
9876543210 8123456 0 12 0 12345
TX: bytes packets errors dropped carrier collsns
8765432109 7345678 0 0 0 0
What it means: Drops matter. A few during bursts might be tolerable, a steady stream is not. Storage over the network is allergic to packet loss.
Decision: If errors/drops are climbing, fix the network before tuning storage. Otherwise you’ll benchmark retransmits.
Task 14: Run a quick host-level fio test on the actual datastore path
cr0x@server:~$ fio --name=randread --directory=/var/lib/vz --rw=randread --bs=4k --iodepth=32 --numjobs=4 --size=8G --time_based=1 --runtime=60 --direct=1 --group_reporting
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
read: IOPS=72.4k, BW=283MiB/s (297MB/s)(16.6GiB/60001msec)
clat (usec): min=70, max=6200, avg=420.11, stdev=190.22
clat percentiles (usec):
| 1.00th=[ 110], 5.00th=[ 170], 10.00th=[ 210], 50.00th=[ 390]
| 90.00th=[ 690], 95.00th=[ 900], 99.00th=[ 1400], 99.90th=[ 2400]
What it means: This is already more honest than pveperf. You get IOPS, bandwidth, and latency percentiles. The 99th percentile tells you what “bad moments” look like.
Decision: If p99 is too high for your workload (databases often care), you need to reduce contention, change sync path, or improve media/controllers—not chase a bigger MB/s number.
Fio recipes that answer real questions
Fio is a chainsaw: powerful, dangerous, and occasionally used to cut bread by people who enjoy chaos. Use it deliberately.
Rule: test where the workload lives
If VMs run on local-lvm, test a file on that mount or a block device that maps to it (with extreme caution). If VMs run on ZFS zvols, test on the zvol or a file on the dataset that backs them.
Recipe 1: Random reads (4k) for “is my SSD pool behaving?”
cr0x@server:~$ fio --name=rr4k --directory=/rpool --rw=randread --bs=4k --iodepth=64 --numjobs=4 --size=16G --time_based=1 --runtime=120 --direct=1 --group_reporting
rr4k: (g=0): rw=randread, bs=(R) 4096B-4096B, ioengine=libaio, iodepth=64
...
read: IOPS=110k, BW=431MiB/s (452MB/s)(50.5GiB/120001msec)
clat percentiles (usec):
| 90.00th=[ 520], 95.00th=[ 740], 99.00th=[ 1300], 99.90th=[ 2200]
Interpretation: Strong IOPS, acceptable p99. If p99.9 jumps into tens of milliseconds, you likely have contention or a sync/flush issue elsewhere.
Decision: If p99.9 is unacceptable, move to telemetry (iostat, zpool iostat, ceph health) and identify the source before “tuning.”
Recipe 2: Random sync writes (4k) to model database commits
cr0x@server:~$ fio --name=rsw4k --directory=/rpool --rw=randwrite --bs=4k --iodepth=1 --numjobs=4 --size=8G --time_based=1 --runtime=120 --direct=1 --fsync=1 --group_reporting
rsw4k: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, ioengine=libaio, iodepth=1
...
write: IOPS=8200, BW=32.0MiB/s (33.6MB/s)(3.75GiB/120001msec)
clat (usec): min=120, max=48000, avg=1850.22
clat percentiles (usec):
| 90.00th=[ 3200], 95.00th=[ 5800], 99.00th=[18000], 99.90th=[42000]
Interpretation: Sync writes expose the real durability path. Those tail latencies are what cause “every minute the app freezes.”
Decision: If this is bad: verify ZFS sync settings, SLOG presence/quality (if used), controller cache policy, and whether your “fast” storage is actually acknowledging writes safely.
Recipe 3: Sequential read throughput (1M) for backup/restore expectations
cr0x@server:~$ fio --name=sr1m --directory=/rpool --rw=read --bs=1M --iodepth=16 --numjobs=1 --size=32G --time_based=1 --runtime=120 --direct=1 --group_reporting
sr1m: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, ioengine=libaio, iodepth=16
...
read: BW=1710MiB/s (1793MB/s)(200GiB/120001msec)
Interpretation: This is where pveperf’s happy numbers sometimes come from. Useful for planning migration windows, not for predicting DB latency.
Decision: If sequential looks great but sync random writes are awful, you have a durability/latency problem, not a “disk is slow” problem.
Recipe 4: Mixed read/write (70/30) random to mimic VM general-purpose churn
cr0x@server:~$ fio --name=mix7030 --directory=/var/lib/vz --rw=randrw --rwmixread=70 --bs=16k --iodepth=32 --numjobs=4 --size=16G --time_based=1 --runtime=180 --direct=1 --group_reporting
mix7030: (g=0): rw=randrw, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=32
...
read: IOPS=18.2k, BW=284MiB/s (298MB/s)
write: IOPS=7810, BW=122MiB/s (128MB/s)
clat percentiles (usec):
| 90.00th=[ 1400], 95.00th=[ 2200], 99.00th=[ 5200], 99.90th=[12000]
Interpretation: Mixed IO is where storage stacks show their manners. Watch p99/p99.9; that’s your “VMs feel laggy” zone.
Decision: If latency spikes under mixed IO, tune scheduling, isolate noisy neighbors, or split workloads across pools.
Joke #2: If you run fio on the wrong path, it will faithfully prove your storage is faster than your ability to read your own mountpoints.
ZFS-on-Proxmox: the traps and the right tests
ZFS is a production-grade filesystem that assumes you care about integrity. It also assumes you’ll blame it for your benchmarks if you don’t understand how it commits data. Let’s fix that.
What makes ZFS benchmarking tricky
- Copy-on-write means small random writes can amplify into more IO than you expect, depending on recordsize/volblocksize and fragmentation.
- Transaction groups batch writes; performance can look bursty. A short benchmark may catch only the “good” part.
- ARC (RAM cache) makes reads look amazing until they don’t. If you want disk numbers, use
--direct=1and make the working set larger than RAM. - Sync writes are either safely committed or they’re not. ZFS takes this seriously; benchmarks that ignore it are measuring a different contract.
Host-side visibility: zpool iostat is your friend
cr0x@server:~$ zpool iostat -v 1 3
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
rpool 1.23T 2.39T 8.20K 3.10K 220M 64.0M
mirror 1.23T 2.39T 8.20K 3.10K 220M 64.0M
nvme0n1 - - 4.10K 1.55K 110M 32.0M
nvme1n1 - - 4.10K 1.55K 110M 32.0M
---------- ----- ----- ----- ----- ----- -----
What it means: This tells you what ZFS is pushing to each device. If fio says “100k IOPS” but zpool iostat shows modest ops, you’re probably hitting cache or the test isn’t reaching disks.
Decision: Align fio parameters (direct IO, file size) until zpool iostat reflects meaningful device activity.
Sync semantics: don’t treat SLOG like a magic amulet
A dedicated SLOG can help only for sync writes. And only if it’s fast, low-latency, and power-loss safe. A random consumer SSD as SLOG is a great way to benchmark yourself into believing in miracles.
If you do use SLOG, benchmark sync writes explicitly and watch latency percentiles. If there is no improvement, your bottleneck is elsewhere (CPU, vdev layout, or you weren’t sync-bound to begin with).
Compression: great in production, confusing in benchmarks
Compression can make your storage look faster by writing fewer bytes. That’s not “cheating”—it’s a real benefit—but it makes comparisons tricky. If you benchmark random data, compression won’t help. If you benchmark zeros, it’ll look absurdly good.
Decide what you want to measure: raw media or effective workload performance. Then generate data accordingly.
Ceph-on-Proxmox: benchmarking without self-deception
Ceph is distributed storage. That means your benchmark is really a test of network, OSD media, replication, CPU, and cluster health. If you run a benchmark while the cluster is backfilling, you’re benchmarking recovery behavior, not steady-state performance.
Check cluster health before you measure anything
cr0x@server:~$ ceph -s
cluster:
id: 2c1e3a6d-9e4d-4a4d-9b2e-1a2b3c4d5e6f
health: HEALTH_OK
services:
mon: 3 daemons, quorum pve1,pve2,pve3
mgr: pve1(active), standbys: pve2
osd: 9 osds: 9 up, 9 in
data:
pools: 3 pools, 256 pgs
objects: 1.12M objects, 4.3TiB
usage: 12TiB used, 18TiB / 30TiB avail
pgs: 256 active+clean
What it means: active+clean PGs and HEALTH_OK is your baseline. Anything else means your numbers will reflect maintenance behavior.
Decision: If not clean, postpone benchmarking or explicitly label it as “degraded state performance.”
Measure raw Ceph throughput and IOPS (and know what it represents)
cr0x@server:~$ rados bench -p rbd 60 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects
...
Total written: 2388 MiB
Bandwidth (MB/sec): 39.8
Average IOPS: 9
Average Latency(s): 1.72
What it means: This is a cluster-level test using RADOS objects, not necessarily the same as RBD performance for VMs. Still, it exposes “is the cluster fundamentally slow?”
Decision: If bandwidth is low and latency is high, check network (MTU, drops), OSD disk latency, and CPU saturation. Don’t “optimize Proxmox” first.
Validate network path quality for Ceph (latency and loss)
cr0x@server:~$ ping -c 20 -i 0.2 pve2
PING pve2 (10.10.10.12) 56(84) bytes of data.
64 bytes from 10.10.10.12: icmp_seq=1 ttl=64 time=0.286 ms
...
20 packets transmitted, 20 received, 0% packet loss, time 3804ms
rtt min/avg/max/mdev = 0.251/0.302/0.401/0.038 ms
What it means: Ceph hates jitter. Low average is good; low variance is better.
Decision: If you see loss or high variance, fix switching, NICs, bonding, or congestion before touching Ceph tunables.
Ceph benchmarking deserves its own book, but the principle remains: measure health first, then measure the storage service, then measure the VM experience. If you skip layers, you’ll end up tuning symptoms.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
At a mid-sized company with a Proxmox cluster, a team replaced an aging SAN with local NVMe mirrors using ZFS. The migration went smoothly. pveperf looked fantastic. Everyone high-fived and went home at a reasonable hour, which should have been the first clue something was off.
Two weeks later, the customer portal started timing out during peak. Not constantly—just enough to make support tickets multiply. The on-call engineer checked CPU and memory. Fine. Network looked clean. Storage graphs showed “high throughput,” so storage got mentally acquitted.
The actual culprit: the most important workload was a database running on a VM image with heavy sync write behavior. ZFS was doing the right thing: honoring sync semantics. But the benchmark they used to sign off the migration was mostly sequential, mostly buffered, and short. It measured the easy path.
Once they ran fio with --fsync=1 and looked at p99 latency, the story changed. Tail latency was brutal under peak. They didn’t need “more MB/s”; they needed a better sync write path and less contention during commit bursts.
The fix was boring: separate the database VM storage onto a pool with predictable latency characteristics, tune the dataset for the workload, and stop treating a single pveperf number as a performance guarantee. After that, the portal stopped flapping.
Mini-story 2: The optimization that backfired
Another environment ran mixed workloads: CI runners, build caches, a couple of databases, and a pile of general-purpose VMs. Someone read that “disabling sync improves ZFS performance,” saw sync=disabled mentioned in a forum, and applied it to the dataset hosting VM disks.
The immediate benchmark results were spectacular. pveperf climbed. fio without fsync was basically a highlight reel. Management was pleased, because management always likes graphs that go up.
Then a host crashed during a power event. The UPS did what it could; reality did what it does. Several VMs came back with corrupted filesystems and inconsistent application state. The issue wasn’t that ZFS “lost data” randomly—it did exactly what it was told: acknowledge writes that were not guaranteed durable.
The optimization had also masked the real performance issue: the system was sync-bound and needed either better hardware for the sync path or application-level adjustments. Instead, they changed the contract and paid for it later in recovery time, loss of trust, and a week of uncomfortable meetings.
They reverted sync behavior, documented the rationale, and built a benchmark suite that explicitly tests sync write latency. The best performance trick is not lying to your future self.
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent org ran Proxmox with Ceph for VM storage. Nothing glamorous. Lots of compliance pressure. The SRE team maintained a simple practice: every quarter, they ran the same benchmark set under controlled conditions and stored the results with cluster health snapshots.
It wasn’t fancy. A couple fio profiles on RBD-backed VMs, a rados bench sanity check, and host-level telemetry captures (iostat, ceph -s, NIC counters). They also ran the benchmarks after any major upgrade or hardware change. Same parameters. Same runtime. Same data retention.
One quarter, results changed subtly: sequential throughput was fine, but p99 latency on random writes crept upward. Not enough for a major incident. Enough to be suspicious. Because they had historical baselines, the change was undeniable rather than arguable.
The root cause was a firmware update on a batch of SSDs that altered flush behavior under certain queue depths. Without the boring benchmark routine, they would have found it only after an outage. Instead, they isolated affected nodes, rolled back firmware where possible, and adjusted maintenance plans.
The day was saved by a spreadsheet and consistency. No heroics. Just adult supervision.
Common mistakes: symptom → root cause → fix
1) “pveperf says 1500 MB/s but VMs are slow”
Symptom: Great pveperf numbers, sluggish interactive performance, DB spikes.
Root cause: pveperf hit cache or measured sequential throughput; your workload is random IO with sync requirements.
Fix: Run fio with --direct=1, file size > RAM, and include sync write tests (--fsync=1). Decide based on p99 latency, not peak MB/s.
2) “fio shows amazing numbers on Monday, terrible on Tuesday”
Symptom: Benchmark variance you can’t explain.
Root cause: Background activity: ZFS scrub/resilver, Ceph backfill, backups, TRIM, or SSD thermal throttling.
Fix: Capture system state (zpool status, ceph -s, backup jobs), extend runtime to reach steady state, and rerun at controlled times.
3) “Random write latency is awful, but disks are barely utilized”
Symptom: iostat shows modest %util, but fio p99 is ugly.
Root cause: Flush/sync behavior, write cache disabled, poor SLOG, or network storage jitter causing pauses not reflected as %util saturation.
Fix: Verify controller cache policy, ZFS sync behavior, SLOG hardware, and network drops/latency. Measure with sync-aware fio patterns.
4) “Ceph performance is inconsistent and gets worse after failures”
Symptom: VM IO dips during node outages or reboots.
Root cause: Recovery/backfill competes with client IO; network oversubscription; OSD disks saturated.
Fix: Benchmark only when active+clean. If operationally unacceptable, adjust recovery priorities and capacity planning, and isolate Ceph traffic.
5) “Upgrading to faster SSDs didn’t help database latency”
Symptom: Better sequential throughput, same commit stalls.
Root cause: The bottleneck is sync write path (flush latency), CPU scheduling, or contention on shared pool.
Fix: Measure sync write latency explicitly; separate workloads; ensure adequate CPU and avoid noisy neighbors; validate power-loss protection for the write acknowledgement path.
6) “Benchmark inside the VM looks fine; users still complain”
Symptom: Guest tools show okay performance, but real app suffers.
Root cause: Guest benchmark doesn’t match app IO pattern; or host contention causes tail latency spikes that averages hide.
Fix: Capture host-level latency percentiles (fio on host, iostat, zpool/ceph stats) during complaint windows. Compare p95/p99, not averages.
7) “After enabling compression, benchmarks doubled”
Symptom: Suddenly huge throughput for writes, suspiciously low disk bandwidth.
Root cause: Benchmark data compresses extremely well (zeros or repeated patterns), measuring CPU+compression rather than disk limits.
Fix: Use incompressible data for raw-media tests (fio --buffer_compress_percentage=0 and --refill_buffers=1), or explicitly state you’re measuring effective workload performance with compression.
Checklists / step-by-step plan
Step-by-step: build a trustworthy Proxmox benchmark baseline
- Pick the question. Example: “Can this host sustain 4k sync writes at p99 < 5ms for 10 VMs?”
- Identify the exact storage path. Use
pvesm statusand VM disk configuration to find it. - Check health. ZFS:
zpool status. Ceph:ceph -s. SMART/NVMe health if relevant. - Stabilize the environment. Pause scrubs/backfills if possible, avoid backups during the run, and document what you couldn’t stop.
- Control CPU scaling for the test. Set governor intentionally (temporary).
- Run telemetry in parallel. Keep
iostat -xand eitherzpool iostator Ceph metrics running. - Run fio with direct IO and realistic working set. Size > RAM, runtime 120–300s for SSD, longer for HDD/Ceph if needed.
- Capture percentiles. Record p50/p95/p99/p99.9. If your tool doesn’t show them, you’re flying blind.
- Repeat three times. If you can’t repeat it, you can’t trust it.
- Write down the config. ZFS properties, RAID layout, firmware versions, NIC speeds, MTU, Ceph replication size.
- Decide based on SLOs. Throughput for backups, latency percentiles for databases, mixed IO for VM clusters.
- Store results with context. A number without conditions is trivia.
Quick checklist: before you blame storage
- Is the VM swapping? (guest memory pressure looks like storage slowness)
- Is the host in IO wait? (
vmstat) - Are there packet drops on storage networks? (
ip -s link) - Is there a backup/scrub/backfill running? (
pidstat,zpool status,ceph -s) - Are you measuring the right datastore? (
pvesm status) - Is CPU scaling down? (
cpupower frequency-info)
FAQ
1) Should I stop using pveperf?
No. Use it as a smoke test and inventory hint (“what device backs this path?”). Stop using it as a procurement tool or a performance SLO predictor.
2) Why does pveperf improve after I run it once?
Caches warm up. Page cache, ARC, device internal caches. The second run is often “faster” because you’re measuring memory and metadata caching, not disks.
3) What’s the single most useful metric for storage performance in Proxmox?
Latency under load, especially p95/p99, correlated with host-level device latency (iostat -x await) and queueing.
4) Should I benchmark on the host or inside a VM?
Both, but for different questions. Host benchmarks validate the storage stack capacity. VM benchmarks validate the guest driver path and the VM’s reality. If they disagree, the gap is usually in virtualization settings, caching modes, or contention.
5) How do I avoid benchmarking RAM by accident?
Use fio with --direct=1, pick a test size larger than RAM, and confirm with zpool iostat/iostat that devices are actually doing work.
6) Why does my sequential throughput look great but my database is slow?
Databases are typically sync-heavy and random IO heavy. Sequential throughput benchmarks mostly tell you how fast backups and restores might go, not how fast commits happen.
7) Is enabling ZFS sync=disabled ever acceptable?
It can be acceptable for non-critical, reconstructable workloads where durability isn’t required. For databases and important VM disks, it’s usually a risk decision that deserves a written justification and a rollback plan.
8) What fio runtime should I use?
Long enough to reach steady state: often 120–300 seconds for SSD tests, longer for HDD arrays and distributed storage. If your SSD has a big SLC cache, a 30-second test is a marketing demo.
9) How do I benchmark Ceph fairly?
Only benchmark when HEALTH_OK and PGs are active+clean, isolate Ceph traffic on a reliable network, and test both at the RADOS layer (rados bench) and at the VM/RBD layer (fio on RBD-backed storage).
10) My fio numbers are lower than pveperf. Which is correct?
Usually fio, if you configured it correctly (direct IO, realistic size, correct path, appropriate sync model). pveperf is often measuring a different, easier problem.
Conclusion: next steps that actually work
If pveperf “looks great” but users complain, believe the users. Then prove it with measurements that model your workload.
- Run the fast diagnosis playbook to identify whether you’re CPU-, disk-, or network-bound.
- Benchmark with fio on the real datastore using direct IO, realistic working set sizes, and percentiles.
- For ZFS, test sync writes explicitly and validate pool/dataset properties. Don’t confuse sequential bandwidth with durability performance.
- For Ceph, benchmark only when healthy and treat network quality as part of storage performance.
- Store your results with context and repeat them after changes. The boring baseline is how you catch regressions before they page you.
Benchmarking done right won’t give you a single magic number. It gives you a map of limits—and a way to stop arguing with graphs that never had to carry production traffic.