ZFS fio for VMs: Profiles That Match Reality (Not Marketing)

Was this helpful?

Your VM users don’t file tickets saying “IOPS are low.” They say “the database froze,” “Windows updates take forever,” or “that CI job is stuck in ‘copying artifacts’ again.”
Meanwhile, you run a quick fio test, get heroic numbers, and everyone goes home happy—until Monday.

ZFS makes this easier to mess up than most filesystems because it’s honest about durability and it has multiple layers that can cheat (ARC, compression, write coalescing, transaction groups).
The fix isn’t “run more fio.” The fix is running fio that looks like VMs actually behave: mixed I/O, sync semantics, realistic queue depths, and latency targets.

1) Reality check: what VM I/O really looks like

Most VM storage workloads aren’t “big sequential writes” and they aren’t “pure random 4K reads.”
They are an annoying cocktail:

  • Small-block random reads and writes (4K to 64K) from databases, metadata, package managers, and Windows background services.
  • Bursty sync writes (journals, WALs, fsync storms, VM guest flushes).
  • Mixed read/write ratios that change by the hour (backup windows, log rotation, patch Tuesdays).
  • Latency sensitivity more than throughput sensitivity. A VM can “feel slow” at 2,000 IOPS if p99 goes from 2 ms to 80 ms.
  • Concurrency is uneven: a few hot VMs can dominate, while most are quiet but still need consistent tail latency.

A realistic fio profile for VMs is not about maximizing a headline number. It’s about measuring the right failure modes:
sync write latency, queueing, write amplification, and whether your “fast” pool turns into a pumpkin when the TXG commits.

If your test doesn’t include fsync or a sync mode equivalent, it is not measuring the kind of pain that pages humans at 03:00.
Your pool might be fine for bulk ingest and still be terrible for VMs.

Joke #1: If your fio results look too good to be true, they probably came from ARC, not from your disks—like a résumé written by a cache layer.

2) Interesting facts (and a little history) that change how you benchmark

These aren’t trivia. Each one maps to a benchmark mistake I’ve seen in production.

  1. ZFS was designed around copy-on-write and transaction groups: writes are collected and committed in batches, which affects latency spikes during TXG sync.
  2. ARC (Adaptive Replacement Cache) is memory-first performance: a warm ARC can make fio read tests look like NVMe even when the pool is spindles.
  3. ZIL exists even when you don’t have a SLOG: without a separate device, the ZIL lives on the main pool and competes with everything else.
  4. SLOG accelerates sync writes, not all writes: async writes bypass the ZIL path; testing without sync semantics can make a SLOG look “useless.”
  5. volblocksize is set at zvol creation time: you don’t “tune it later” in any practical sense. This matters for VM block I/O alignment and write amplification.
  6. recordsize is a dataset property, not a zvol property: mixing datasets and zvols and expecting the same behavior is a classic benchmark error.
  7. Compression can increase IOPS (sometimes dramatically) by reducing physical writes—until your CPUs become the bottleneck or your data doesn’t compress.
  8. fio defaults can be dangerously “nice”: buffered I/O and friendly queue depths produce numbers that won’t survive real VM concurrency.
  9. NVMe write caches and power-loss protection matter: a “fast” consumer NVMe can be a liar under sync workloads if it lacks proper PLP behavior.

3) The ZFS + VM I/O stack: where your benchmark lies to you

VM I/O isn’t a single system. It’s a chain of decisions and caches. fio can test any link in that chain, and if you test the wrong one,
you’ll publish a benchmark for a system you don’t actually run.

Guest filesystem vs virtual disk vs host ZFS

The guest OS has its own cache and writeback behavior. The hypervisor has its own queueing. ZFS has ARC/L2ARC and its own write pipeline.
If you run fio inside the guest with buffered I/O, you’re mostly benchmarking guest memory and host memory bandwidth.
If you run fio on the host against a file, you’re benchmarking datasets and recordsize behavior, which may not match zvol behavior.

Sync semantics: where the real pain lives

The defining difference between “demo fast” and “production steady” is durability. Databases and many guest filesystems issue flushes.
On ZFS, synchronous writes go through ZIL semantics; a separate SLOG device can reduce latency by providing a low-latency log target.
But the SLOG isn’t magic: it must be low latency, consistent, and safe under power loss.

The worst fio sin in VM benchmarking is running a 1M sequential write test, seeing 2–5 GB/s, and calling the platform “ready for databases.”
That’s not a VM profile. That’s a marketing slide.

Queue depth and parallelism: iodepth is not a virtue signal

VM workloads often have moderate queue depth per VM but high concurrency across VMs. fio can simulate that in two ways:
numjobs (multiple independent workers) and iodepth (queue depth per worker).
For VMs, prefer more jobs with modest iodepth rather than one job with iodepth=256 unless you’re specifically modeling a heavy database.

One reliable way to create fake performance is to choose an iodepth that forces the device into its best sequential merge path.
It’s like evaluating a car’s city driving by rolling it downhill.

4) Principles for fio profiles that match production

Principle A: test the thing users feel (latency), not just the thing vendors sell (throughput)

Capture p95/p99 latency, not just average. Your VM customers live in the tail.
fio can report percentiles; use them and treat them as first-class metrics.

Principle B: include sync write tests

Use --direct=1 to avoid guest page cache effects (when testing inside guests) and use a sync mechanism:
--fsync=1 (or --fdatasync=1) for file-based workloads, or --sync=1 for some engines.
On raw block devices, you can approximate by using --ioengine=libaio and forcing flushes carefully,
but the cleanest model for VM “flush storms” is to test with actual filesystem + fsync patterns.

Principle C: ensure the working set defeats ARC when you intend to test disks

If you want to measure pool performance, your test size should be larger than ARC by a wide margin.
If ARC is 64 GiB, do not run a 10 GiB read test and call it “disk speed.”
Alternately, test in a way that focuses on writes (sync writes especially) where ARC cannot fully hide physical behavior.

Principle D: match block sizes to what guests do

VM random I/O tends to cluster at 4K, 8K, 16K, and 32K. Large 1M blocks are for backup streams and media workloads.
Use multiple block sizes or a distribution if you can. If you must pick one: 4K random and 128K sequential are the workhorses.

Principle E: use time-based tests with ramp time

ZFS behavior changes as TXGs commit, ARC warms, metadata gets created, and free space fragments.
Run tests long enough to see a few TXG cycles. Use a ramp-up period to avoid measuring the first 10 seconds of “everything is empty and happy.”

Principle F: pin down the test environment

CPU governor, interrupt balancing, virtio settings, dataset properties, and zvol properties all matter.
Reproducibility is a feature. If you can’t rerun the test a month later and explain deltas, it’s not benchmarking—it’s vibes.

One quote worth keeping on a sticky note near your monitoring wall:
Everything fails, all the time. — Werner Vogels

5) Realistic fio profiles (with explanations)

These are not “best” profiles. They’re honest profiles. Use them as building blocks and tune to your VM mix.
For each profile, decide whether you’re testing inside the guest, on the host against a zvol, or on the host against a dataset file.

Profile 1: VM boot/login storm (read-heavy, small random, modest concurrency)

Models dozens of VMs booting, services starting, reading many small files. It’s mostly reads, but not purely random.

cr0x@server:~$ fio --name=vm-boot --filename=/dev/zvol/tank/vm-101-disk0 \
  --rw=randread --bs=16k --iodepth=8 --numjobs=8 --direct=1 \
  --time_based --runtime=180 --ramp_time=30 --group_reporting \
  --ioengine=libaio --percentile_list=95,99,99.9
vm-boot: (groupid=0, jobs=8): err= 0: pid=21233: Sat Dec 21 11:02:20 2025
  read: IOPS=42.1k, BW=658MiB/s (690MB/s)(115GiB/180s)
    slat (usec): min=3, max=2100, avg=12.4, stdev=18.9
    clat (usec): min=90, max=28000, avg=1480, stdev=2100
     lat (usec): min=105, max=28150, avg=1492, stdev=2102
    clat percentiles (usec):
     | 95.00th=[ 3600], 99.00th=[ 8200], 99.90th=[18000]

What it means: 42k IOPS looks great, but the real signal is p99 and p99.9 latency.
Boot storms feel bad when p99 goes into tens of milliseconds.
Decision: if p99.9 is high, look for contention (other workloads), special vdev needs, or too-small/slow vdevs.

Profile 2: OLTP database-ish (mixed random, sync writes matter)

This is the profile that exposes whether your SLOG is real or cosplay.
Use it on a filesystem inside the guest if you can, because guests do fsync. On the host, you can run against a file on a dataset to model fsync.

cr0x@server:~$ fio --name=oltp-mix-fsync --directory=/tank/vmtest \
  --rw=randrw --rwmixread=70 --bs=8k --iodepth=4 --numjobs=16 \
  --direct=1 --time_based --runtime=300 --ramp_time=60 \
  --ioengine=libaio --fsync=1 --group_reporting --percentile_list=95,99,99.9
oltp-mix-fsync: (groupid=0, jobs=16): err= 0: pid=21901: Sat Dec 21 11:12:54 2025
  read: IOPS=18.4k, BW=144MiB/s (151MB/s)(42.2GiB/300s)
    clat (usec): min=120, max=95000, avg=2900, stdev=5200
    clat percentiles (usec):
     | 95.00th=[ 8200], 99.00th=[22000], 99.90th=[62000]
  write: IOPS=7.88k, BW=61.6MiB/s (64.6MB/s)(18.0GiB/300s)
    clat (usec): min=180, max=120000, avg=4100, stdev=7800
    clat percentiles (usec):
     | 95.00th=[12000], 99.00th=[34000], 99.90th=[90000]

What it means: With fsync, latency tails blow up first. Average can look “fine” while p99.9 ruins transactions.
Decision: if p99.9 write latency is ugly, validate SLOG, sync settings, and device write cache behavior.

Profile 3: Windows update / package manager (metadata-heavy, small random reads/writes)

This is where special vdevs for metadata and small blocks can be worth their cost—if you actually have the right kind of pool.

cr0x@server:~$ fio --name=metadata-chaos --directory=/tank/vmtest \
  --rw=randrw --rwmixread=60 --bs=4k --iodepth=16 --numjobs=8 \
  --direct=1 --time_based --runtime=240 --ramp_time=30 \
  --ioengine=libaio --group_reporting --percentile_list=95,99,99.9
metadata-chaos: (groupid=0, jobs=8): err= 0: pid=22188: Sat Dec 21 11:18:22 2025
  read: IOPS=55.0k, BW=215MiB/s (226MB/s)(50.4GiB/240s)
    clat percentiles (usec): 95.00th=[ 2400], 99.00th=[ 6800], 99.90th=[16000]
  write: IOPS=36.0k, BW=141MiB/s (148MB/s)(33.0GiB/240s)
    clat percentiles (usec): 95.00th=[ 3100], 99.00th=[ 9200], 99.90th=[24000]

What it means: If these percentiles degrade sharply when the pool is half full or fragmented,
you may have a layout/ashift issue, an overloaded mirror vdev, or you’re missing fast metadata paths.
Decision: compare performance at different pool fill levels and after sustained random writes.

Profile 4: Backup/restore stream (sequential, large blocks, checks “can we drain?”)

This profile is not a VM latency test. It answers: “Can we move big data without destroying everything?”
Use it to schedule backup windows and decide whether to throttle.

cr0x@server:~$ fio --name=backup-write --filename=/tank/vmtest/backup.bin \
  --rw=write --bs=1m --iodepth=8 --numjobs=1 --direct=1 \
  --size=50G --ioengine=libaio --group_reporting
backup-write: (groupid=0, jobs=1): err= 0: pid=22502: Sat Dec 21 11:24:10 2025
  write: IOPS=1450, BW=1450MiB/s (1520MB/s)(50.0GiB/35s)

What it means: Great throughput doesn’t mean your pool is healthy for VMs.
Decision: use this to set backup throttles; then rerun a latency-sensitive profile concurrently to see interference.

Profile 5: “No cheating” disk test (working set bigger than ARC, random reads)

Use this when someone claims the pool is “slow,” and you need to establish raw read capability without ARC masking the truth.
You must size the file beyond ARC and run long enough to avoid warm-cache artifacts.

cr0x@server:~$ fio --name=arc-buster --filename=/tank/vmtest/arc-buster.bin \
  --rw=randread --bs=128k --iodepth=32 --numjobs=4 --direct=1 \
  --size=500G --time_based --runtime=240 --ramp_time=30 \
  --ioengine=libaio --group_reporting --percentile_list=95,99
arc-buster: (groupid=0, jobs=4): err= 0: pid=22791: Sat Dec 21 11:31:12 2025
  read: IOPS=3100, BW=387MiB/s (406MB/s)(90.7GiB/240s)
    clat percentiles (usec):
     | 95.00th=[ 16000], 99.00th=[ 32000]

What it means: Lower IOPS and higher latency are normal here; you’re finally touching disks.
Decision: if this is unexpectedly awful, check vdev layout, ashift, and disk health before arguing about fio flags.

6) Practical tasks: commands, what output means, and what you decide

This is the part you’ll actually use during an incident or a capacity review.
Each task includes: a command, sample output, what it means, and the decision it drives.
Assume a Linux host running ZFS with a pool named tank.

Task 1: Identify whether you’re benchmarking a zvol or a dataset (and what properties apply)

cr0x@server:~$ zfs list -o name,type,volblocksize,recordsize,compression,sync tank
NAME                     TYPE     VOLBLOCKSIZE  RECORDSIZE  COMPRESS  SYNC
tank                     filesystem       -        128K     lz4       standard
tank/vmdata              filesystem       -        128K     lz4       standard
tank/vm-101-disk0         volume        16K          -      lz4       standard

What it means: zvols have volblocksize; datasets have recordsize.
Mixing their results is how you accidentally “optimize” the wrong thing.
Decision: choose fio target accordingly: /dev/zvol/... for zvol VM disks, or a file on the dataset if your VM storage uses files.

Task 2: Check pool topology (your vdev layout is your performance contract)

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            nvme0n1                 ONLINE       0     0     0
            nvme1n1                 ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            nvme2n1                 ONLINE       0     0     0
            nvme3n1                 ONLINE       0     0     0
        logs
          nvme4n1                   ONLINE       0     0     0

errors: No known data errors

What it means: Mirrors behave differently than RAIDZ under random I/O. Logs indicates a separate SLOG device exists.
Decision: if you’re testing sync writes, verify logs are present and healthy; if RAIDZ, expect lower small random write IOPS and plan accordingly.

Task 3: Check pool free space and fragmentation risk signals

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint tank
NAME         USED  AVAIL  REFER  MOUNTPOINT
tank        42.8T  6.10T   192K  /tank

What it means: Pools running hot on space tend to show worse allocation behavior and worse tail latency.
ZFS isn’t uniquely bad here; it’s just honest about the consequences.
Decision: if avail is tight, stop “benchmarking” and start capacity work. Any fio test now is measuring a system already in distress.

Task 4: Validate that sync settings aren’t quietly lying to you

cr0x@server:~$ zfs get -r -o name,property,value sync tank/vmdata
NAME        PROPERTY  VALUE
tank/vmdata sync      standard

What it means: sync=standard means sync requests are honored. sync=disabled makes benchmarks pretty and audits angry.
Decision: if someone set sync=disabled “temporarily,” treat every performance result as contaminated.

Task 5: Verify ashift (because 4K disks don’t forgive 512-byte fantasies)

cr0x@server:~$ zdb -C tank | grep -E "ashift|vdev_tree" -n | head
120:        ashift: 12

What it means: ashift=12 means 4K sectors. Wrong ashift can permanently degrade performance via read-modify-write.
Decision: if ashift is wrong, plan a migration. You don’t “tune” your way out of it.

Task 6: Check ARC size vs test size (are you benchmarking RAM?)

cr0x@server:~$ arcstat 1 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
11:40:22  128K   56K     43   10K   8%   40K  31%    6K   4%   64G   80G

What it means: ARC is 64G with 80G target. If your fio file is smaller than that, reads will “improve” over time.
Decision: for disk tests, use a file several times bigger than ARC, or focus on sync writes where ARC can’t fully mask latency.

Task 7: Watch ZFS I/O and latency at the pool level during fio

cr0x@server:~$ zpool iostat -v tank 1 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        42.8T  6.10T  8.20K  3.10K   410M   220M
  mirror-0                  21.4T  3.05T  4.10K  1.55K   205M   110M
    nvme0n1                      -      -  2.05K    780   102M    55M
    nvme1n1                      -      -  2.05K    770   103M    55M
  mirror-1                  21.4T  3.05T  4.10K  1.55K   205M   110M
    nvme2n1                      -      -  2.04K    780   102M    55M
    nvme3n1                      -      -  2.06K    770   103M    55M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: You see whether load is spread across vdevs or one side is hot.
Decision: if one vdev is overloaded (or a disk is slower), investigate imbalance, firmware, or a failing device.

Task 8: Confirm SLOG is actually being used for sync writes

cr0x@server:~$ zpool iostat -v tank 1 2 | sed -n '1,18p'
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        42.8T  6.10T  2.10K  6.40K   120M   210M
  mirror-0                  21.4T  3.05T  1.05K  3.10K    60M   105M
  mirror-1                  21.4T  3.05T  1.05K  3.30K    60M   105M
logs                             -      -     2  9.80K   512K   310M
  nvme4n1                         -      -     2  9.80K   512K   310M

What it means: High write ops on the log device during sync-heavy fio indicates ZIL traffic is landing on SLOG.
Decision: if log writes don’t move during sync tests, either workload isn’t sync, sync=disabled is set somewhere, or SLOG isn’t configured.

Task 9: Check CPU bottlenecks and IRQ pressure during “fast” benchmarks

cr0x@server:~$ mpstat -P ALL 1 2
Linux 6.8.0 (server)  12/21/2025  _x86_64_  (32 CPU)

12:01:10 PM  CPU   %usr  %nice   %sys %iowait  %irq  %soft  %idle
12:01:11 PM  all   18.2   0.0   22.9    0.8    0.0   6.1   52.0
12:01:11 PM   7    4.0   0.0   78.0    0.0    0.0   0.0   18.0

What it means: One CPU pinned at high %sys may indicate a single queue/IRQ bottleneck (NVMe, virtio, networking, or locking).
Decision: if you’re CPU-bound, stop tuning disks. Fix queueing, IRQ affinity, or increase parallelism.

Task 10: Verify guest flush behavior (why “sync” keeps showing up)

cr0x@server:~$ qemu-img info /tank/vmimages/vm-101.qcow2
image: /tank/vmimages/vm-101.qcow2
file format: qcow2
virtual size: 200 GiB (214748364800 bytes)
disk size: 36.1 GiB
cluster_size: 65536

What it means: qcow2 has its own metadata and allocation behavior. It can amplify writes and introduce extra flushes depending on settings.
Decision: if latency is bad under sync workloads, consider raw/zvol or tune qcow2 cache/IO mode carefully (and test again).

Task 11: Check ZFS compression ratio and decide whether CPU is helping or hurting

cr0x@server:~$ zfs get -o name,property,value,source compressratio,compression tank/vmdata
NAME        PROPERTY       VALUE  SOURCE
tank/vmdata compression    lz4    local
tank/vmdata compressratio  1.62x  -

What it means: A real compressratio suggests your pool is writing less to disk than the VM thinks it is.
Decision: if compressratio is high and CPU is not saturated, compression is a net win. If CPU is pegged, benchmark with and without.

Task 12: Verify zvol block alignment expectations for VM I/O

cr0x@server:~$ lsblk -o NAME,PHY-SEC,LOG-SEC,MIN-IO,OPT-IO,ROTA /dev/zvol/tank/vm-101-disk0
NAME                     PHY-SEC LOG-SEC MIN-IO OPT-IO ROTA
zd0                        4096    4096   4096      0    0

What it means: 4K logical/physical sectors align with modern expectations. Misalignment causes RMW and latency spikes.
Decision: if you see 512 logical sectors atop 4K devices, fix it in design time (ashift/volblocksize). Otherwise you’ll be “tuning” forever.

Task 13: Measure TXG sync pressure signals

cr0x@server:~$ cat /proc/spl/kstat/zfs/txgs
1 0x01 0x00000000 136 13440 105155148830 0

What it means: This file can change by implementation, but if TXG sync times or backlog grow during load, you’ll see latency waves.
Decision: if tail latency correlates with TXG sync behavior, investigate dirty data limits, vdev write latency, and SLOG effectiveness rather than chasing fio knobs.

Task 14: Check device error counters and latency outliers before blaming ZFS

cr0x@server:~$ smartctl -a /dev/nvme0n1 | sed -n '1,25p'
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0] (local build)
=== START OF INFORMATION SECTION ===
Model Number:                       ACME NVMe 3.2TB
Firmware Version:                   1.04
Percentage Used:                    2%
Data Units Read:                    19,442,112
Data Units Written:                 13,188,440
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

What it means: A single flakey device can turn p99 into a horror story while average looks decent.
Decision: if errors or high wear appear, replace the device before you “optimize” around failing hardware.

7) Fast diagnosis playbook: find the bottleneck in minutes

This is the “stop debating, start isolating” playbook. Use it when latency is high or fio results don’t match production.
The goal is to identify whether you’re bound by the guest, the hypervisor, ZFS, the vdev layout, or a single sick device.

First: decide what you are actually testing

  • fio in guest, buffered I/O → mostly guest cache and memory behavior.
  • fio in guest, direct I/O → closer to virtual disk behavior (still through hypervisor queues).
  • fio on host against a zvol → tests ZFS block volume path, bypasses guest FS.
  • fio on host against a file in dataset → tests ZFS dataset path and recordsize behavior.

If the test target doesn’t match your VM storage path, stop. Fix the test.

Second: ask “is it sync?”

  • Run a sync-heavy fio profile (--fsync=1 or equivalent) and watch zpool iostat -v for log activity.
  • Check zfs get sync at the dataset/zvol level.

If p99 write latency explodes only with sync, your problem is in ZIL/SLOG behavior, device cache safety, or write latency on the vdevs.

Third: determine whether ARC is masking reads

  • Compare an ARC-busting read test with a small read test.
  • Watch arcstat miss rates during the run.

If “disk reads” are fast but misses are low, you’re not reading disks. You’re reading RAM and calling it storage.

Fourth: locate the choke point with live stats

  • zpool iostat -v 1 shows per-vdev distribution and whether logs are used.
  • mpstat 1 shows CPU saturation and single-core pressure.
  • iostat -x 1 shows device utilization and latency at the block layer.

If a single device is pegged or shows high await, isolate it. If CPU is pegged, stop shopping for faster SSDs.

Fifth: check pool health and allocation reality

  • Pool near full? Expect worse behavior.
  • Recent resilver? Scrub running? Expect interference.
  • Errors? Stop performance work and fix integrity first.

8) Common mistakes: symptoms → root cause → fix

Mistake 1: “fio shows 1M IOPS but VMs are slow”

Symptoms: Massive read IOPS in fio, but real apps have high latency and stalls.

Root cause: ARC/page cache benchmark. Test file fits in RAM; fio is reading cache, not storage.

Fix: Use --direct=1, make the working set larger than ARC, and watch arcstat miss% during the run.

Mistake 2: “SLOG did nothing”

Symptoms: Adding a SLOG shows no improvement; sync write latency unchanged.

Root cause: Workload wasn’t sync (no fsync/flush), or sync=disabled set, or log device not active.

Fix: Run fsync-heavy fio, verify zpool status shows logs, and confirm log write ops in zpool iostat -v.

Mistake 3: “We increased iodepth and got better numbers, so we’re done”

Symptoms: Benchmark IOPS improved with iodepth=256; production still suffers.

Root cause: Artificial queueing hides latency. You’re measuring saturation throughput, not service time.

Fix: Use iodepth values that match VM behavior (often 1–16 per job) and track p99/p99.9 latency.

Mistake 4: “Random writes are terrible; ZFS is slow”

Symptoms: Small random write tests are bad, especially on RAIDZ.

Root cause: RAIDZ parity overhead plus COW allocation costs under small random writes. This is expected physics.

Fix: For VM-heavy random I/O, use mirrors (or special vdev designs) and size vdev count for IOPS, not raw capacity.

Mistake 5: “Latency spikes every so often like a heartbeat”

Symptoms: p99 latency jumps periodically during steady load.

Root cause: TXG sync behavior, dirty data throttling, or a slow device creating periodic stalls.

Fix: Correlate spikes with ZFS stats and disk await; validate device firmware and consider write latency improvements (better vdevs, better SLOG).

Mistake 6: “We tuned recordsize for VM disks”

Symptoms: Recordsize changes show no effect on VM zvol performance.

Root cause: recordsize doesn’t apply to zvols; volblocksize does.

Fix: Create zvols with appropriate volblocksize from the start; migrate if needed.

Mistake 7: “Compression made it faster in fio, so it must be better”

Symptoms: IOPS jump with compression on; CPU rises; under real load, latency worsens.

Root cause: CPU bottleneck or non-compressible data. Compression can help, but it’s not free.

Fix: Measure CPU headroom during realistic concurrency; check compressratio; keep compression if it’s actually reducing writes without pegging CPUs.

Joke #2: Changing sync=disabled to “fix performance” is like removing the smoke alarm because it keeps waking you up.

9) Three corporate mini-stories from the trenches

Story A: An incident caused by a wrong assumption (cache ≠ disk)

A mid-sized SaaS company rolled out a new VM cluster for internal CI and a couple of customer-facing databases.
The storage was ZFS on decent NVMe mirrors. The proof-of-readiness was a fio test that showed absurd random read IOPS.
Everyone relaxed. Procurement got a gold star.

Two weeks later, the incident channel lit up: database latency spikes, CI runners timing out, random “hung task” warnings in the guest kernels.
The on-call ran the same fio job and again got the big numbers. This created a special kind of misery: when metrics say “fast”
but humans say “slow,” you waste hours arguing about whose reality counts.

The wrong assumption was simple: “fio read IOPS equals disk performance.” The test file was small.
The ARC was huge. Under steady VM load, the hot working set wasn’t stable and sync writes were pushing TXG behavior into visible latency waves.
fio was benchmarking memory.

The fix wasn’t exotic. They rebuilt the fio suite: time-based tests, file sizes well above ARC, and a mixed workload with fsync.
The numbers got “worse,” which was the best thing that happened—now they matched production. They then found a single NVMe with inconsistent write latency.
Replacing it stabilized p99.9 and magically “improved the app,” which is the only benchmark anyone cares about.

Story B: An optimization that backfired (the sync shortcut)

A finance-adjacent platform had a VM farm running a message bus and a couple of PostgreSQL clusters.
During a peak season rehearsal, they saw elevated commit latency. Someone suggested a “temporary” ZFS change:
set sync=disabled on the dataset holding VM disks to make commits faster.

It worked immediately. Latency charts dropped. The rehearsal passed. The change stayed.
The team wasn’t reckless; they were busy, and the platform didn’t have a culture of config drift review.
Months later, a power event hit one rack. The hosts rebooted cleanly. The VMs came back. A few services didn’t.

What followed was a week of forensic work nobody enjoys: subtle database corruption patterns, missing acknowledged messages, and a slow rebuild of trust.
There wasn’t a single smoking gun log line. There rarely is. The “optimization” had turned durability from a contract into a suggestion.
ZFS did what it was told. The system failed exactly as configured.

The backfire wasn’t just the outage. It was the long-term operational debt:
they had to audit every dataset, re-baseline performance with sync enabled, validate SLOG hardware, and re-train teams to treat durability settings as production safety controls.
The eventual performance fix involved better log devices and more mirrors—not lying to the storage stack.

Story C: A boring but correct practice that saved the day (repeatable baselines)

Another org—this one with a painfully mature change process—kept a small suite of fio profiles versioned alongside their infrastructure code.
Same fio versions. Same job files. Same runtime. Same target datasets. Every storage-related change required a run and an attached report.
Nobody loved it. It wasn’t glamorous.

One quarter, they swapped an HBA firmware version during a maintenance window. Nothing else changed.
The next day, a few VMs started reporting occasional stalls. Not enough for a full incident, just enough to make people uneasy.
The team ran their standard fio suite and compared it to last month’s baseline. p99 write latency was meaningfully worse in sync-heavy profiles.

Because the baseline suite already existed, they didn’t debate methodology. They didn’t bikeshed iodepth.
They had a known-good “feel of the system” captured in numbers that mattered.
They rolled back firmware, and the stall reports disappeared.

The saving move here was boring: controlled, repeatable tests with latency percentiles and sync semantics.
It let them treat performance as a regression problem, not a philosophical argument.

10) Checklists / step-by-step plan

Step-by-step: build a VM-reality fio suite for ZFS

  1. Inventory your VM storage path. Are VM disks zvols, raw files, qcow2, or something else?
  2. Capture ZFS properties for the relevant datasets/zvols: compression, sync, recordsize/volblocksize.
  3. Pick three core profiles:
    • 4K/8K mixed random with fsync (latency-focused)
    • 16K random read storm (boot/login behavior)
    • 1M sequential write (backup/restore throughput)
  4. Decide job structure: prefer numjobs for concurrency and keep iodepth moderate.
  5. Use time-based runs (3–10 minutes) with ramp time (30–60 seconds).
  6. Measure percentiles (95/99/99.9) and treat p99.9 as the “user pain proxy.”
  7. Size test files to exceed ARC if you intend to measure disk reads.
  8. Run tests in three modes:
    • Host → zvol
    • Host → dataset file
    • Guest → filesystem (direct I/O and fsync)
  9. Record the environment: kernel/ZFS versions, CPU governor, pool topology, and whether any scrubs/resilvers were running.
  10. Repeat at least twice and compare. If results vary wildly, that variability is itself the finding.

Operational checklist: before trusting any fio number

  • Is --direct=1 used when it should be?
  • Does the profile include fsync/flush when modeling databases or VM durability?
  • Is the test file bigger than ARC (for read tests)?
  • Are you tracking p99/p99.9 latency?
  • Are you watching zpool iostat -v and CPU during the test?
  • Is the pool healthy (no errors, no degraded vdevs)?
  • Is the pool not near-full?
  • Did you run the test on the actual storage path used by VMs?

Change checklist: when tuning ZFS for VM workloads

  • Don’t touch durability first. Leave sync alone unless you enjoy incident retrospectives.
  • Prefer layout decisions over micro-tuning. Mirrors vs RAIDZ is a design choice, not a sysctl.
  • Validate SLOG with sync-heavy fio and confirm it’s used.
  • Align volblocksize to guest reality at zvol creation time.
  • Measure regression risk with a baseline suite after every meaningful change.

11) FAQ

Q1: Should I run fio inside the VM or on the host?

Both, but for different reasons. Inside the VM tells you what the guest experiences (including hypervisor queues and guest filesystem behavior).
On the host isolates ZFS behavior. If they disagree, that’s a clue: your bottleneck is in the virtualization layer or caching.

Q2: What fio flags matter most for VM realism?

--direct=1, realistic --bs, moderate --iodepth, multiple --numjobs, time-based runs,
and --fsync=1 (or equivalent) for durability-sensitive workloads. Also: --percentile_list so you stop staring at averages.

Q3: Why does my random read test get faster over time?

ARC (or guest page cache) warming. You’re moving from disk to memory. If you’re trying to test disks, increase the working set and watch ARC miss rates.

Q4: How do I know if my SLOG is helping?

Run a sync-heavy fio profile and watch log device write ops in zpool iostat -v. Also compare p99 write latency with and without the SLOG.
If your workload isn’t sync, the SLOG shouldn’t help—and that’s not a failure.

Q5: Is RAIDZ “bad” for VM storage?

RAIDZ is not bad; it’s just not an IOPS monster for small random writes. For VM-heavy OLTP-like behavior, mirrors are usually the safer choice.
If you need RAIDZ for capacity efficiency, plan for the performance reality and test with sync + random writes.

Q6: Should I change recordsize for VM performance?

Only for datasets used as files (like qcow2/raw files). For zvol-backed VM disks, recordsize doesn’t apply; volblocksize does.

Q7: What’s a good target for p99 latency?

It depends on workload, but as a rule: if p99 sync write latency regularly enters tens of milliseconds, databases will complain.
Use your app SLOs to set a threshold; then tune design (vdevs, SLOG) to meet it.

Q8: How do I stop fio from destroying my pool performance for everyone else?

Run in maintenance windows, throttle with fewer jobs/iodepth, and monitor. fio is a load generator, not a polite guest.
If you must test in production, use shorter runs and prioritize latency profiles over saturating throughput.

Q9: Does enabling compression always help VM workloads?

Often it helps, because VM data (OS files, logs) can compress and reduce physical writes. But if CPU becomes a bottleneck or data is incompressible,
compression can hurt tail latency. Check compressratio and CPU during realistic load.

Q10: Why do my fio results differ between zvols and dataset files?

Different code paths and properties. Datasets use recordsize and file metadata; zvols use volblocksize and present a block device.
VM platforms also behave differently depending on whether you use raw files, qcow2, or zvols.

12) Practical next steps

If you want your fio results to predict VM reality, do these next, in this order:

  1. Pick one VM disk (zvol or file) and build three fio profiles: boot storm, OLTP mixed with fsync, backup stream.
  2. Run them time-based with percentiles, and record p95/p99/p99.9, not just IOPS.
  3. During each run, capture zpool iostat -v, arcstat, and CPU stats.
  4. Validate sync path: confirm SLOG activity (if present) and verify no dataset has sync=disabled hiding problems.
  5. Turn the results into a baseline and rerun after every meaningful change: firmware, kernel, ZFS version, topology, and VM storage format.

The goal isn’t to get pretty numbers. The goal is to stop being surprised by production.
Once your fio suite makes the same things hurt that users complain about, you’re finally benchmarking the system you actually operate.

← Previous
ZFS xattr: The Compatibility Choice That Changes Performance
Next →
0-day headlines: why one vulnerability can cause instant panic

Leave a comment