Your VM users don’t file tickets saying “IOPS are low.” They say “the database froze,” “Windows updates take forever,” or “that CI job is stuck in ‘copying artifacts’ again.”
Meanwhile, you run a quick fio test, get heroic numbers, and everyone goes home happy—until Monday.
ZFS makes this easier to mess up than most filesystems because it’s honest about durability and it has multiple layers that can cheat (ARC, compression, write coalescing, transaction groups).
The fix isn’t “run more fio.” The fix is running fio that looks like VMs actually behave: mixed I/O, sync semantics, realistic queue depths, and latency targets.
1) Reality check: what VM I/O really looks like
Most VM storage workloads aren’t “big sequential writes” and they aren’t “pure random 4K reads.”
They are an annoying cocktail:
- Small-block random reads and writes (4K to 64K) from databases, metadata, package managers, and Windows background services.
- Bursty sync writes (journals, WALs, fsync storms, VM guest flushes).
- Mixed read/write ratios that change by the hour (backup windows, log rotation, patch Tuesdays).
- Latency sensitivity more than throughput sensitivity. A VM can “feel slow” at 2,000 IOPS if p99 goes from 2 ms to 80 ms.
- Concurrency is uneven: a few hot VMs can dominate, while most are quiet but still need consistent tail latency.
A realistic fio profile for VMs is not about maximizing a headline number. It’s about measuring the right failure modes:
sync write latency, queueing, write amplification, and whether your “fast” pool turns into a pumpkin when the TXG commits.
If your test doesn’t include fsync or a sync mode equivalent, it is not measuring the kind of pain that pages humans at 03:00.
Your pool might be fine for bulk ingest and still be terrible for VMs.
Joke #1: If your fio results look too good to be true, they probably came from ARC, not from your disks—like a résumé written by a cache layer.
2) Interesting facts (and a little history) that change how you benchmark
These aren’t trivia. Each one maps to a benchmark mistake I’ve seen in production.
- ZFS was designed around copy-on-write and transaction groups: writes are collected and committed in batches, which affects latency spikes during TXG sync.
- ARC (Adaptive Replacement Cache) is memory-first performance: a warm ARC can make fio read tests look like NVMe even when the pool is spindles.
- ZIL exists even when you don’t have a SLOG: without a separate device, the ZIL lives on the main pool and competes with everything else.
- SLOG accelerates sync writes, not all writes: async writes bypass the ZIL path; testing without sync semantics can make a SLOG look “useless.”
- volblocksize is set at zvol creation time: you don’t “tune it later” in any practical sense. This matters for VM block I/O alignment and write amplification.
- recordsize is a dataset property, not a zvol property: mixing datasets and zvols and expecting the same behavior is a classic benchmark error.
- Compression can increase IOPS (sometimes dramatically) by reducing physical writes—until your CPUs become the bottleneck or your data doesn’t compress.
- fio defaults can be dangerously “nice”: buffered I/O and friendly queue depths produce numbers that won’t survive real VM concurrency.
- NVMe write caches and power-loss protection matter: a “fast” consumer NVMe can be a liar under sync workloads if it lacks proper PLP behavior.
3) The ZFS + VM I/O stack: where your benchmark lies to you
VM I/O isn’t a single system. It’s a chain of decisions and caches. fio can test any link in that chain, and if you test the wrong one,
you’ll publish a benchmark for a system you don’t actually run.
Guest filesystem vs virtual disk vs host ZFS
The guest OS has its own cache and writeback behavior. The hypervisor has its own queueing. ZFS has ARC/L2ARC and its own write pipeline.
If you run fio inside the guest with buffered I/O, you’re mostly benchmarking guest memory and host memory bandwidth.
If you run fio on the host against a file, you’re benchmarking datasets and recordsize behavior, which may not match zvol behavior.
Sync semantics: where the real pain lives
The defining difference between “demo fast” and “production steady” is durability. Databases and many guest filesystems issue flushes.
On ZFS, synchronous writes go through ZIL semantics; a separate SLOG device can reduce latency by providing a low-latency log target.
But the SLOG isn’t magic: it must be low latency, consistent, and safe under power loss.
The worst fio sin in VM benchmarking is running a 1M sequential write test, seeing 2–5 GB/s, and calling the platform “ready for databases.”
That’s not a VM profile. That’s a marketing slide.
Queue depth and parallelism: iodepth is not a virtue signal
VM workloads often have moderate queue depth per VM but high concurrency across VMs. fio can simulate that in two ways:
numjobs (multiple independent workers) and iodepth (queue depth per worker).
For VMs, prefer more jobs with modest iodepth rather than one job with iodepth=256 unless you’re specifically modeling a heavy database.
One reliable way to create fake performance is to choose an iodepth that forces the device into its best sequential merge path.
It’s like evaluating a car’s city driving by rolling it downhill.
4) Principles for fio profiles that match production
Principle A: test the thing users feel (latency), not just the thing vendors sell (throughput)
Capture p95/p99 latency, not just average. Your VM customers live in the tail.
fio can report percentiles; use them and treat them as first-class metrics.
Principle B: include sync write tests
Use --direct=1 to avoid guest page cache effects (when testing inside guests) and use a sync mechanism:
--fsync=1 (or --fdatasync=1) for file-based workloads, or --sync=1 for some engines.
On raw block devices, you can approximate by using --ioengine=libaio and forcing flushes carefully,
but the cleanest model for VM “flush storms” is to test with actual filesystem + fsync patterns.
Principle C: ensure the working set defeats ARC when you intend to test disks
If you want to measure pool performance, your test size should be larger than ARC by a wide margin.
If ARC is 64 GiB, do not run a 10 GiB read test and call it “disk speed.”
Alternately, test in a way that focuses on writes (sync writes especially) where ARC cannot fully hide physical behavior.
Principle D: match block sizes to what guests do
VM random I/O tends to cluster at 4K, 8K, 16K, and 32K. Large 1M blocks are for backup streams and media workloads.
Use multiple block sizes or a distribution if you can. If you must pick one: 4K random and 128K sequential are the workhorses.
Principle E: use time-based tests with ramp time
ZFS behavior changes as TXGs commit, ARC warms, metadata gets created, and free space fragments.
Run tests long enough to see a few TXG cycles. Use a ramp-up period to avoid measuring the first 10 seconds of “everything is empty and happy.”
Principle F: pin down the test environment
CPU governor, interrupt balancing, virtio settings, dataset properties, and zvol properties all matter.
Reproducibility is a feature. If you can’t rerun the test a month later and explain deltas, it’s not benchmarking—it’s vibes.
One quote worth keeping on a sticky note near your monitoring wall:
Everything fails, all the time.
— Werner Vogels
5) Realistic fio profiles (with explanations)
These are not “best” profiles. They’re honest profiles. Use them as building blocks and tune to your VM mix.
For each profile, decide whether you’re testing inside the guest, on the host against a zvol, or on the host against a dataset file.
Profile 1: VM boot/login storm (read-heavy, small random, modest concurrency)
Models dozens of VMs booting, services starting, reading many small files. It’s mostly reads, but not purely random.
cr0x@server:~$ fio --name=vm-boot --filename=/dev/zvol/tank/vm-101-disk0 \
--rw=randread --bs=16k --iodepth=8 --numjobs=8 --direct=1 \
--time_based --runtime=180 --ramp_time=30 --group_reporting \
--ioengine=libaio --percentile_list=95,99,99.9
vm-boot: (groupid=0, jobs=8): err= 0: pid=21233: Sat Dec 21 11:02:20 2025
read: IOPS=42.1k, BW=658MiB/s (690MB/s)(115GiB/180s)
slat (usec): min=3, max=2100, avg=12.4, stdev=18.9
clat (usec): min=90, max=28000, avg=1480, stdev=2100
lat (usec): min=105, max=28150, avg=1492, stdev=2102
clat percentiles (usec):
| 95.00th=[ 3600], 99.00th=[ 8200], 99.90th=[18000]
What it means: 42k IOPS looks great, but the real signal is p99 and p99.9 latency.
Boot storms feel bad when p99 goes into tens of milliseconds.
Decision: if p99.9 is high, look for contention (other workloads), special vdev needs, or too-small/slow vdevs.
Profile 2: OLTP database-ish (mixed random, sync writes matter)
This is the profile that exposes whether your SLOG is real or cosplay.
Use it on a filesystem inside the guest if you can, because guests do fsync. On the host, you can run against a file on a dataset to model fsync.
cr0x@server:~$ fio --name=oltp-mix-fsync --directory=/tank/vmtest \
--rw=randrw --rwmixread=70 --bs=8k --iodepth=4 --numjobs=16 \
--direct=1 --time_based --runtime=300 --ramp_time=60 \
--ioengine=libaio --fsync=1 --group_reporting --percentile_list=95,99,99.9
oltp-mix-fsync: (groupid=0, jobs=16): err= 0: pid=21901: Sat Dec 21 11:12:54 2025
read: IOPS=18.4k, BW=144MiB/s (151MB/s)(42.2GiB/300s)
clat (usec): min=120, max=95000, avg=2900, stdev=5200
clat percentiles (usec):
| 95.00th=[ 8200], 99.00th=[22000], 99.90th=[62000]
write: IOPS=7.88k, BW=61.6MiB/s (64.6MB/s)(18.0GiB/300s)
clat (usec): min=180, max=120000, avg=4100, stdev=7800
clat percentiles (usec):
| 95.00th=[12000], 99.00th=[34000], 99.90th=[90000]
What it means: With fsync, latency tails blow up first. Average can look “fine” while p99.9 ruins transactions.
Decision: if p99.9 write latency is ugly, validate SLOG, sync settings, and device write cache behavior.
Profile 3: Windows update / package manager (metadata-heavy, small random reads/writes)
This is where special vdevs for metadata and small blocks can be worth their cost—if you actually have the right kind of pool.
cr0x@server:~$ fio --name=metadata-chaos --directory=/tank/vmtest \
--rw=randrw --rwmixread=60 --bs=4k --iodepth=16 --numjobs=8 \
--direct=1 --time_based --runtime=240 --ramp_time=30 \
--ioengine=libaio --group_reporting --percentile_list=95,99,99.9
metadata-chaos: (groupid=0, jobs=8): err= 0: pid=22188: Sat Dec 21 11:18:22 2025
read: IOPS=55.0k, BW=215MiB/s (226MB/s)(50.4GiB/240s)
clat percentiles (usec): 95.00th=[ 2400], 99.00th=[ 6800], 99.90th=[16000]
write: IOPS=36.0k, BW=141MiB/s (148MB/s)(33.0GiB/240s)
clat percentiles (usec): 95.00th=[ 3100], 99.00th=[ 9200], 99.90th=[24000]
What it means: If these percentiles degrade sharply when the pool is half full or fragmented,
you may have a layout/ashift issue, an overloaded mirror vdev, or you’re missing fast metadata paths.
Decision: compare performance at different pool fill levels and after sustained random writes.
Profile 4: Backup/restore stream (sequential, large blocks, checks “can we drain?”)
This profile is not a VM latency test. It answers: “Can we move big data without destroying everything?”
Use it to schedule backup windows and decide whether to throttle.
cr0x@server:~$ fio --name=backup-write --filename=/tank/vmtest/backup.bin \
--rw=write --bs=1m --iodepth=8 --numjobs=1 --direct=1 \
--size=50G --ioengine=libaio --group_reporting
backup-write: (groupid=0, jobs=1): err= 0: pid=22502: Sat Dec 21 11:24:10 2025
write: IOPS=1450, BW=1450MiB/s (1520MB/s)(50.0GiB/35s)
What it means: Great throughput doesn’t mean your pool is healthy for VMs.
Decision: use this to set backup throttles; then rerun a latency-sensitive profile concurrently to see interference.
Profile 5: “No cheating” disk test (working set bigger than ARC, random reads)
Use this when someone claims the pool is “slow,” and you need to establish raw read capability without ARC masking the truth.
You must size the file beyond ARC and run long enough to avoid warm-cache artifacts.
cr0x@server:~$ fio --name=arc-buster --filename=/tank/vmtest/arc-buster.bin \
--rw=randread --bs=128k --iodepth=32 --numjobs=4 --direct=1 \
--size=500G --time_based --runtime=240 --ramp_time=30 \
--ioengine=libaio --group_reporting --percentile_list=95,99
arc-buster: (groupid=0, jobs=4): err= 0: pid=22791: Sat Dec 21 11:31:12 2025
read: IOPS=3100, BW=387MiB/s (406MB/s)(90.7GiB/240s)
clat percentiles (usec):
| 95.00th=[ 16000], 99.00th=[ 32000]
What it means: Lower IOPS and higher latency are normal here; you’re finally touching disks.
Decision: if this is unexpectedly awful, check vdev layout, ashift, and disk health before arguing about fio flags.
6) Practical tasks: commands, what output means, and what you decide
This is the part you’ll actually use during an incident or a capacity review.
Each task includes: a command, sample output, what it means, and the decision it drives.
Assume a Linux host running ZFS with a pool named tank.
Task 1: Identify whether you’re benchmarking a zvol or a dataset (and what properties apply)
cr0x@server:~$ zfs list -o name,type,volblocksize,recordsize,compression,sync tank
NAME TYPE VOLBLOCKSIZE RECORDSIZE COMPRESS SYNC
tank filesystem - 128K lz4 standard
tank/vmdata filesystem - 128K lz4 standard
tank/vm-101-disk0 volume 16K - lz4 standard
What it means: zvols have volblocksize; datasets have recordsize.
Mixing their results is how you accidentally “optimize” the wrong thing.
Decision: choose fio target accordingly: /dev/zvol/... for zvol VM disks, or a file on the dataset if your VM storage uses files.
Task 2: Check pool topology (your vdev layout is your performance contract)
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
logs
nvme4n1 ONLINE 0 0 0
errors: No known data errors
What it means: Mirrors behave differently than RAIDZ under random I/O. Logs indicates a separate SLOG device exists.
Decision: if you’re testing sync writes, verify logs are present and healthy; if RAIDZ, expect lower small random write IOPS and plan accordingly.
Task 3: Check pool free space and fragmentation risk signals
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint tank
NAME USED AVAIL REFER MOUNTPOINT
tank 42.8T 6.10T 192K /tank
What it means: Pools running hot on space tend to show worse allocation behavior and worse tail latency.
ZFS isn’t uniquely bad here; it’s just honest about the consequences.
Decision: if avail is tight, stop “benchmarking” and start capacity work. Any fio test now is measuring a system already in distress.
Task 4: Validate that sync settings aren’t quietly lying to you
cr0x@server:~$ zfs get -r -o name,property,value sync tank/vmdata
NAME PROPERTY VALUE
tank/vmdata sync standard
What it means: sync=standard means sync requests are honored. sync=disabled makes benchmarks pretty and audits angry.
Decision: if someone set sync=disabled “temporarily,” treat every performance result as contaminated.
Task 5: Verify ashift (because 4K disks don’t forgive 512-byte fantasies)
cr0x@server:~$ zdb -C tank | grep -E "ashift|vdev_tree" -n | head
120: ashift: 12
What it means: ashift=12 means 4K sectors. Wrong ashift can permanently degrade performance via read-modify-write.
Decision: if ashift is wrong, plan a migration. You don’t “tune” your way out of it.
Task 6: Check ARC size vs test size (are you benchmarking RAM?)
cr0x@server:~$ arcstat 1 1
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
11:40:22 128K 56K 43 10K 8% 40K 31% 6K 4% 64G 80G
What it means: ARC is 64G with 80G target. If your fio file is smaller than that, reads will “improve” over time.
Decision: for disk tests, use a file several times bigger than ARC, or focus on sync writes where ARC can’t fully mask latency.
Task 7: Watch ZFS I/O and latency at the pool level during fio
cr0x@server:~$ zpool iostat -v tank 1 3
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 42.8T 6.10T 8.20K 3.10K 410M 220M
mirror-0 21.4T 3.05T 4.10K 1.55K 205M 110M
nvme0n1 - - 2.05K 780 102M 55M
nvme1n1 - - 2.05K 770 103M 55M
mirror-1 21.4T 3.05T 4.10K 1.55K 205M 110M
nvme2n1 - - 2.04K 780 102M 55M
nvme3n1 - - 2.06K 770 103M 55M
-------------------------- ----- ----- ----- ----- ----- -----
What it means: You see whether load is spread across vdevs or one side is hot.
Decision: if one vdev is overloaded (or a disk is slower), investigate imbalance, firmware, or a failing device.
Task 8: Confirm SLOG is actually being used for sync writes
cr0x@server:~$ zpool iostat -v tank 1 2 | sed -n '1,18p'
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 42.8T 6.10T 2.10K 6.40K 120M 210M
mirror-0 21.4T 3.05T 1.05K 3.10K 60M 105M
mirror-1 21.4T 3.05T 1.05K 3.30K 60M 105M
logs - - 2 9.80K 512K 310M
nvme4n1 - - 2 9.80K 512K 310M
What it means: High write ops on the log device during sync-heavy fio indicates ZIL traffic is landing on SLOG.
Decision: if log writes don’t move during sync tests, either workload isn’t sync, sync=disabled is set somewhere, or SLOG isn’t configured.
Task 9: Check CPU bottlenecks and IRQ pressure during “fast” benchmarks
cr0x@server:~$ mpstat -P ALL 1 2
Linux 6.8.0 (server) 12/21/2025 _x86_64_ (32 CPU)
12:01:10 PM CPU %usr %nice %sys %iowait %irq %soft %idle
12:01:11 PM all 18.2 0.0 22.9 0.8 0.0 6.1 52.0
12:01:11 PM 7 4.0 0.0 78.0 0.0 0.0 0.0 18.0
What it means: One CPU pinned at high %sys may indicate a single queue/IRQ bottleneck (NVMe, virtio, networking, or locking).
Decision: if you’re CPU-bound, stop tuning disks. Fix queueing, IRQ affinity, or increase parallelism.
Task 10: Verify guest flush behavior (why “sync” keeps showing up)
cr0x@server:~$ qemu-img info /tank/vmimages/vm-101.qcow2
image: /tank/vmimages/vm-101.qcow2
file format: qcow2
virtual size: 200 GiB (214748364800 bytes)
disk size: 36.1 GiB
cluster_size: 65536
What it means: qcow2 has its own metadata and allocation behavior. It can amplify writes and introduce extra flushes depending on settings.
Decision: if latency is bad under sync workloads, consider raw/zvol or tune qcow2 cache/IO mode carefully (and test again).
Task 11: Check ZFS compression ratio and decide whether CPU is helping or hurting
cr0x@server:~$ zfs get -o name,property,value,source compressratio,compression tank/vmdata
NAME PROPERTY VALUE SOURCE
tank/vmdata compression lz4 local
tank/vmdata compressratio 1.62x -
What it means: A real compressratio suggests your pool is writing less to disk than the VM thinks it is.
Decision: if compressratio is high and CPU is not saturated, compression is a net win. If CPU is pegged, benchmark with and without.
Task 12: Verify zvol block alignment expectations for VM I/O
cr0x@server:~$ lsblk -o NAME,PHY-SEC,LOG-SEC,MIN-IO,OPT-IO,ROTA /dev/zvol/tank/vm-101-disk0
NAME PHY-SEC LOG-SEC MIN-IO OPT-IO ROTA
zd0 4096 4096 4096 0 0
What it means: 4K logical/physical sectors align with modern expectations. Misalignment causes RMW and latency spikes.
Decision: if you see 512 logical sectors atop 4K devices, fix it in design time (ashift/volblocksize). Otherwise you’ll be “tuning” forever.
Task 13: Measure TXG sync pressure signals
cr0x@server:~$ cat /proc/spl/kstat/zfs/txgs
1 0x01 0x00000000 136 13440 105155148830 0
What it means: This file can change by implementation, but if TXG sync times or backlog grow during load, you’ll see latency waves.
Decision: if tail latency correlates with TXG sync behavior, investigate dirty data limits, vdev write latency, and SLOG effectiveness rather than chasing fio knobs.
Task 14: Check device error counters and latency outliers before blaming ZFS
cr0x@server:~$ smartctl -a /dev/nvme0n1 | sed -n '1,25p'
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0] (local build)
=== START OF INFORMATION SECTION ===
Model Number: ACME NVMe 3.2TB
Firmware Version: 1.04
Percentage Used: 2%
Data Units Read: 19,442,112
Data Units Written: 13,188,440
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
What it means: A single flakey device can turn p99 into a horror story while average looks decent.
Decision: if errors or high wear appear, replace the device before you “optimize” around failing hardware.
7) Fast diagnosis playbook: find the bottleneck in minutes
This is the “stop debating, start isolating” playbook. Use it when latency is high or fio results don’t match production.
The goal is to identify whether you’re bound by the guest, the hypervisor, ZFS, the vdev layout, or a single sick device.
First: decide what you are actually testing
- fio in guest, buffered I/O → mostly guest cache and memory behavior.
- fio in guest, direct I/O → closer to virtual disk behavior (still through hypervisor queues).
- fio on host against a zvol → tests ZFS block volume path, bypasses guest FS.
- fio on host against a file in dataset → tests ZFS dataset path and recordsize behavior.
If the test target doesn’t match your VM storage path, stop. Fix the test.
Second: ask “is it sync?”
- Run a sync-heavy fio profile (
--fsync=1or equivalent) and watchzpool iostat -vfor log activity. - Check
zfs get syncat the dataset/zvol level.
If p99 write latency explodes only with sync, your problem is in ZIL/SLOG behavior, device cache safety, or write latency on the vdevs.
Third: determine whether ARC is masking reads
- Compare an ARC-busting read test with a small read test.
- Watch
arcstatmiss rates during the run.
If “disk reads” are fast but misses are low, you’re not reading disks. You’re reading RAM and calling it storage.
Fourth: locate the choke point with live stats
zpool iostat -v 1shows per-vdev distribution and whether logs are used.mpstat 1shows CPU saturation and single-core pressure.iostat -x 1shows device utilization and latency at the block layer.
If a single device is pegged or shows high await, isolate it. If CPU is pegged, stop shopping for faster SSDs.
Fifth: check pool health and allocation reality
- Pool near full? Expect worse behavior.
- Recent resilver? Scrub running? Expect interference.
- Errors? Stop performance work and fix integrity first.
8) Common mistakes: symptoms → root cause → fix
Mistake 1: “fio shows 1M IOPS but VMs are slow”
Symptoms: Massive read IOPS in fio, but real apps have high latency and stalls.
Root cause: ARC/page cache benchmark. Test file fits in RAM; fio is reading cache, not storage.
Fix: Use --direct=1, make the working set larger than ARC, and watch arcstat miss% during the run.
Mistake 2: “SLOG did nothing”
Symptoms: Adding a SLOG shows no improvement; sync write latency unchanged.
Root cause: Workload wasn’t sync (no fsync/flush), or sync=disabled set, or log device not active.
Fix: Run fsync-heavy fio, verify zpool status shows logs, and confirm log write ops in zpool iostat -v.
Mistake 3: “We increased iodepth and got better numbers, so we’re done”
Symptoms: Benchmark IOPS improved with iodepth=256; production still suffers.
Root cause: Artificial queueing hides latency. You’re measuring saturation throughput, not service time.
Fix: Use iodepth values that match VM behavior (often 1–16 per job) and track p99/p99.9 latency.
Mistake 4: “Random writes are terrible; ZFS is slow”
Symptoms: Small random write tests are bad, especially on RAIDZ.
Root cause: RAIDZ parity overhead plus COW allocation costs under small random writes. This is expected physics.
Fix: For VM-heavy random I/O, use mirrors (or special vdev designs) and size vdev count for IOPS, not raw capacity.
Mistake 5: “Latency spikes every so often like a heartbeat”
Symptoms: p99 latency jumps periodically during steady load.
Root cause: TXG sync behavior, dirty data throttling, or a slow device creating periodic stalls.
Fix: Correlate spikes with ZFS stats and disk await; validate device firmware and consider write latency improvements (better vdevs, better SLOG).
Mistake 6: “We tuned recordsize for VM disks”
Symptoms: Recordsize changes show no effect on VM zvol performance.
Root cause: recordsize doesn’t apply to zvols; volblocksize does.
Fix: Create zvols with appropriate volblocksize from the start; migrate if needed.
Mistake 7: “Compression made it faster in fio, so it must be better”
Symptoms: IOPS jump with compression on; CPU rises; under real load, latency worsens.
Root cause: CPU bottleneck or non-compressible data. Compression can help, but it’s not free.
Fix: Measure CPU headroom during realistic concurrency; check compressratio; keep compression if it’s actually reducing writes without pegging CPUs.
Joke #2: Changing sync=disabled to “fix performance” is like removing the smoke alarm because it keeps waking you up.
9) Three corporate mini-stories from the trenches
Story A: An incident caused by a wrong assumption (cache ≠ disk)
A mid-sized SaaS company rolled out a new VM cluster for internal CI and a couple of customer-facing databases.
The storage was ZFS on decent NVMe mirrors. The proof-of-readiness was a fio test that showed absurd random read IOPS.
Everyone relaxed. Procurement got a gold star.
Two weeks later, the incident channel lit up: database latency spikes, CI runners timing out, random “hung task” warnings in the guest kernels.
The on-call ran the same fio job and again got the big numbers. This created a special kind of misery: when metrics say “fast”
but humans say “slow,” you waste hours arguing about whose reality counts.
The wrong assumption was simple: “fio read IOPS equals disk performance.” The test file was small.
The ARC was huge. Under steady VM load, the hot working set wasn’t stable and sync writes were pushing TXG behavior into visible latency waves.
fio was benchmarking memory.
The fix wasn’t exotic. They rebuilt the fio suite: time-based tests, file sizes well above ARC, and a mixed workload with fsync.
The numbers got “worse,” which was the best thing that happened—now they matched production. They then found a single NVMe with inconsistent write latency.
Replacing it stabilized p99.9 and magically “improved the app,” which is the only benchmark anyone cares about.
Story B: An optimization that backfired (the sync shortcut)
A finance-adjacent platform had a VM farm running a message bus and a couple of PostgreSQL clusters.
During a peak season rehearsal, they saw elevated commit latency. Someone suggested a “temporary” ZFS change:
set sync=disabled on the dataset holding VM disks to make commits faster.
It worked immediately. Latency charts dropped. The rehearsal passed. The change stayed.
The team wasn’t reckless; they were busy, and the platform didn’t have a culture of config drift review.
Months later, a power event hit one rack. The hosts rebooted cleanly. The VMs came back. A few services didn’t.
What followed was a week of forensic work nobody enjoys: subtle database corruption patterns, missing acknowledged messages, and a slow rebuild of trust.
There wasn’t a single smoking gun log line. There rarely is. The “optimization” had turned durability from a contract into a suggestion.
ZFS did what it was told. The system failed exactly as configured.
The backfire wasn’t just the outage. It was the long-term operational debt:
they had to audit every dataset, re-baseline performance with sync enabled, validate SLOG hardware, and re-train teams to treat durability settings as production safety controls.
The eventual performance fix involved better log devices and more mirrors—not lying to the storage stack.
Story C: A boring but correct practice that saved the day (repeatable baselines)
Another org—this one with a painfully mature change process—kept a small suite of fio profiles versioned alongside their infrastructure code.
Same fio versions. Same job files. Same runtime. Same target datasets. Every storage-related change required a run and an attached report.
Nobody loved it. It wasn’t glamorous.
One quarter, they swapped an HBA firmware version during a maintenance window. Nothing else changed.
The next day, a few VMs started reporting occasional stalls. Not enough for a full incident, just enough to make people uneasy.
The team ran their standard fio suite and compared it to last month’s baseline. p99 write latency was meaningfully worse in sync-heavy profiles.
Because the baseline suite already existed, they didn’t debate methodology. They didn’t bikeshed iodepth.
They had a known-good “feel of the system” captured in numbers that mattered.
They rolled back firmware, and the stall reports disappeared.
The saving move here was boring: controlled, repeatable tests with latency percentiles and sync semantics.
It let them treat performance as a regression problem, not a philosophical argument.
10) Checklists / step-by-step plan
Step-by-step: build a VM-reality fio suite for ZFS
- Inventory your VM storage path. Are VM disks zvols, raw files, qcow2, or something else?
- Capture ZFS properties for the relevant datasets/zvols:
compression,sync,recordsize/volblocksize. - Pick three core profiles:
- 4K/8K mixed random with fsync (latency-focused)
- 16K random read storm (boot/login behavior)
- 1M sequential write (backup/restore throughput)
- Decide job structure: prefer
numjobsfor concurrency and keepiodepthmoderate. - Use time-based runs (3–10 minutes) with ramp time (30–60 seconds).
- Measure percentiles (95/99/99.9) and treat p99.9 as the “user pain proxy.”
- Size test files to exceed ARC if you intend to measure disk reads.
- Run tests in three modes:
- Host → zvol
- Host → dataset file
- Guest → filesystem (direct I/O and fsync)
- Record the environment: kernel/ZFS versions, CPU governor, pool topology, and whether any scrubs/resilvers were running.
- Repeat at least twice and compare. If results vary wildly, that variability is itself the finding.
Operational checklist: before trusting any fio number
- Is
--direct=1used when it should be? - Does the profile include fsync/flush when modeling databases or VM durability?
- Is the test file bigger than ARC (for read tests)?
- Are you tracking p99/p99.9 latency?
- Are you watching
zpool iostat -vand CPU during the test? - Is the pool healthy (no errors, no degraded vdevs)?
- Is the pool not near-full?
- Did you run the test on the actual storage path used by VMs?
Change checklist: when tuning ZFS for VM workloads
- Don’t touch durability first. Leave
syncalone unless you enjoy incident retrospectives. - Prefer layout decisions over micro-tuning. Mirrors vs RAIDZ is a design choice, not a sysctl.
- Validate SLOG with sync-heavy fio and confirm it’s used.
- Align volblocksize to guest reality at zvol creation time.
- Measure regression risk with a baseline suite after every meaningful change.
11) FAQ
Q1: Should I run fio inside the VM or on the host?
Both, but for different reasons. Inside the VM tells you what the guest experiences (including hypervisor queues and guest filesystem behavior).
On the host isolates ZFS behavior. If they disagree, that’s a clue: your bottleneck is in the virtualization layer or caching.
Q2: What fio flags matter most for VM realism?
--direct=1, realistic --bs, moderate --iodepth, multiple --numjobs, time-based runs,
and --fsync=1 (or equivalent) for durability-sensitive workloads. Also: --percentile_list so you stop staring at averages.
Q3: Why does my random read test get faster over time?
ARC (or guest page cache) warming. You’re moving from disk to memory. If you’re trying to test disks, increase the working set and watch ARC miss rates.
Q4: How do I know if my SLOG is helping?
Run a sync-heavy fio profile and watch log device write ops in zpool iostat -v. Also compare p99 write latency with and without the SLOG.
If your workload isn’t sync, the SLOG shouldn’t help—and that’s not a failure.
Q5: Is RAIDZ “bad” for VM storage?
RAIDZ is not bad; it’s just not an IOPS monster for small random writes. For VM-heavy OLTP-like behavior, mirrors are usually the safer choice.
If you need RAIDZ for capacity efficiency, plan for the performance reality and test with sync + random writes.
Q6: Should I change recordsize for VM performance?
Only for datasets used as files (like qcow2/raw files). For zvol-backed VM disks, recordsize doesn’t apply; volblocksize does.
Q7: What’s a good target for p99 latency?
It depends on workload, but as a rule: if p99 sync write latency regularly enters tens of milliseconds, databases will complain.
Use your app SLOs to set a threshold; then tune design (vdevs, SLOG) to meet it.
Q8: How do I stop fio from destroying my pool performance for everyone else?
Run in maintenance windows, throttle with fewer jobs/iodepth, and monitor. fio is a load generator, not a polite guest.
If you must test in production, use shorter runs and prioritize latency profiles over saturating throughput.
Q9: Does enabling compression always help VM workloads?
Often it helps, because VM data (OS files, logs) can compress and reduce physical writes. But if CPU becomes a bottleneck or data is incompressible,
compression can hurt tail latency. Check compressratio and CPU during realistic load.
Q10: Why do my fio results differ between zvols and dataset files?
Different code paths and properties. Datasets use recordsize and file metadata; zvols use volblocksize and present a block device.
VM platforms also behave differently depending on whether you use raw files, qcow2, or zvols.
12) Practical next steps
If you want your fio results to predict VM reality, do these next, in this order:
- Pick one VM disk (zvol or file) and build three fio profiles: boot storm, OLTP mixed with fsync, backup stream.
- Run them time-based with percentiles, and record p95/p99/p99.9, not just IOPS.
- During each run, capture
zpool iostat -v,arcstat, and CPU stats. - Validate sync path: confirm SLOG activity (if present) and verify no dataset has
sync=disabledhiding problems. - Turn the results into a baseline and rerun after every meaningful change: firmware, kernel, ZFS version, topology, and VM storage format.
The goal isn’t to get pretty numbers. The goal is to stop being surprised by production.
Once your fio suite makes the same things hurt that users complain about, you’re finally benchmarking the system you actually operate.