Someone hands you a screenshot: “Our new ZFS box does 6 GB/s!” You ask one question—“From where?”—and the room goes quiet. Because the fastest storage device in the data center is the one you didn’t mean to test: RAM. The second fastest is the one you forgot you enabled: compression.
Benchmarks are how storage projects get funded, approved, and later blamed. In production, “almost right” is the same as wrong, just with nicer graphs. This is a set of rules and field methods for benchmarking ZFS without lying to yourself, your boss, or the next on-call.
What you’re actually measuring (and what you’re not)
ZFS benchmarking fails for one core reason: people benchmark “storage” as if it’s one thing. It isn’t. You are benchmarking a pipeline: application → libc → kernel VFS → ZFS DMU → ZIO scheduler → vdev queues → disks/flash → controller → firmware. Add in ARC, L2ARC, SLOG, compression, checksums, and metadata, and you’re benchmarking a system with more moving parts than a corporate reorg.
There are three numbers people care about:
- Throughput (MB/s or GB/s): bulk read/write speed. Great for backups, media, sequential scans, and lying in slide decks.
- IOPS: operations per second. Great for small blocks, random access, metadata churn, and real databases.
- Latency (especially p95/p99): the only number your users feel. ZFS can deliver high throughput while quietly murdering tail latency.
A benchmark that produces one number and no latency distribution is a vibe check, not engineering.
Benchmarking is inference, not measurement
Storage is hard to “measure” directly because the system adapts. ARC warms up. Transaction groups (TXGs) batch work. ZFS prefetch and readahead change patterns. Compression changes physical bytes. If you don’t control those variables, you are not testing the pool—you are testing whatever ZFS decided to do while you watched.
One dry rule: if you can’t explain why a result improved, it didn’t improve. You just got lucky.
A few facts and history that change how you benchmark
Some context helps you predict how ZFS behaves under tests. Here are concrete facts that matter in the lab and in production:
- ZFS was designed with end-to-end data integrity as a first-class goal, not as an add-on. Checksums, copy-on-write, and self-healing influence write amplification and latency.
- Early ZFS work at Sun had to make commodity disks act “enterprise enough”. The whole point was to survive flaky hardware with smart software. Benchmarks that ignore integrity options miss the point.
- The ARC is not just a cache, it’s a memory policy engine. It competes with your application for RAM, and its behavior changes with workload. Benchmarking without controlling ARC is benchmarking memory management.
- Copy-on-write means “overwrite” is actually allocate-new + update-metadata. Random overwrites behave like random writes plus metadata churn. Filesystems that overwrite in place can look “faster” on some patterns, until they corrupt data.
- TXG batching is why short tests lie. ZFS groups writes into TXGs and commits periodically. If you benchmark for 10 seconds, you might never see steady-state behavior.
- SLOG exists because synchronous writes are expensive. ZIL is always there; SLOG is the separate device that can accelerate sync writes. If you never test sync behavior, you’re benchmarking the wrong risk.
- Recordsize is a performance contract. ZFS can store large records efficiently for sequential workloads, but large records can punish small random writes via read-modify-write patterns.
- Sector alignment (ashift) is forever. Pick wrong, and every write can become two writes. You can fix a lot in ZFS. You can’t “tune” your way out of a wrong ashift without rebuilding.
- Compression often makes ZFS “faster” by doing less I/O. That’s not cheating—unless your data won’t compress in production. Then it’s the benchmark equivalent of downhill racing.
One quote to keep you honest:
“Hope is not a strategy.” — General Gordon R. Sullivan
The rules: how to prevent fake ZFS results
Rule 1: State your question before you run a command
“How fast is this pool?” is not a question. Ask something testable:
- What is the sustained random read IOPS at 4k, p99 latency < 5 ms?
- What is the sequential write throughput for 1M blocks with compression off?
- What is the sync write latency with and without SLOG under concurrency 16?
- Does changing recordsize from 128K to 16K improve database write tail latency?
Write the question down. If you can’t phrase it, you can’t interpret the result.
Rule 2: Pick the correct test surface: file dataset vs zvol
Testing a file dataset exercises the filesystem path, metadata, and recordsize. Testing a zvol exercises a block device, volblocksize, and a different I/O profile. Databases on raw block devices (or iSCSI) behave differently than databases on files. Don’t benchmark the wrong interface and then “optimize” based on that.
Rule 3: Control caching or explicitly measure it
ARC can make reads look supernatural. That’s fine if your production working set fits in RAM. If it doesn’t, ARC-powered read benchmarks are performance fan fiction.
Decide which you are testing:
- Cold-cache performance: what happens after reboot or after a cache drop, and when working set >> RAM.
- Warm-cache performance: the steady reality of a hot dataset. Useful, but don’t confuse it with disk performance.
Second dry rule: if you don’t mention cache state in your report, the numbers are not admissible.
Rule 4: Sync writes are their own universe
Most production pain comes from sync writes: databases with fsync, NFS with sync semantics, VM storage that cares about durability. Async write benchmarks can look glorious while your real workload faceplants because it needs durability.
Benchmark sync explicitly, and include latency percentiles. If you have a SLOG, test with it enabled and disabled. Make the system show its real durability cost.
Rule 5: Use steady state; short tests lie
ZFS behavior changes over time: ARC warms, metaslabs allocate, fragmentation grows, and TXGs settle. Your benchmark should run long enough to hit steady state and to capture tail latency. As a starting point:
- Throughput tests: 2–5 minutes per configuration, after a short warm-up.
- Random latency tests: at least 5 minutes, collect p95/p99/p999.
- Recordsize / compression comparisons: repeat runs and compare distributions, not just averages.
Rule 6: Eliminate “helpful” variables
CPU frequency scaling, background scrubs, resilvers, SMART tests, and noisy neighbors will all contaminate results. Benchmarks are not a time to “share the box” politely. Be rude to the rest of the system. You can apologize later with uptime.
Rule 7: Don’t benchmark empty pools and call it production
Fragmentation and metaslab allocation behavior change as pools fill. Performance at 5% full can be excellent; performance at 80% full can be a different personality. If you care about production, test at a realistic fill level or at least simulate it with preallocation and file churn.
Rule 8: Validate your tool is actually doing I/O
Tools can fool you. Misconfigured fio can benchmark page cache. dd can benchmark read-ahead and cache effects. Your monitoring might be looking at the wrong device.
Trust but verify: during any test, confirm that the disks show real I/O and that ZFS counters move.
Rule 9: Measure at multiple layers
One number is a trap. Always capture at least:
- Application-level (fio output: IOPS, bw, clat percentiles)
- ZFS-level (zpool iostat, arcstat)
- Device-level (iostat -x: util, await, svctm-ish signals)
- CPU and interrupts (mpstat, top)
When results look “too good,” one of those layers will be suspiciously quiet.
Rule 10: Change one thing at a time
Benchmarking is not a tuning festival. If you change recordsize, compression, sync, and atime simultaneously, you’re doing random number generation with extra steps.
Joke #1: Benchmarking is like cooking—if you change the oven, the recipe, and the chef, don’t brag about the soufflé.
Match the workload: files, blocks, sync, and latency
Workload archetypes that matter
Most real ZFS use-cases fall into a few buckets. Your benchmark should mimic one of them:
- Bulk sequential reads: backups, media, analytics scans. Favor large blocks, throughput, and prefetch effects.
- Bulk sequential writes: ingest pipelines, backup targets. Watch TXG behavior and sustained bandwidth after caches are saturated.
- Random reads: key-value stores, VM boot storms. ARC can dominate here; cold-cache tests matter.
- Random writes (async): logging, temp stores. Often looks fine until sync enters the chat.
- Random writes (sync): databases, NFS sync, VM datastores with barriers. Latency and SLOG behavior are the story.
- Metadata-heavy: git servers, maildirs, CI artifact trees. Small files, directory ops, and the special vdev can matter.
Dataset properties are not “tuning knobs,” they’re workload contracts
recordsize, compression, atime, primarycache, logbias, and sync are decisions about write amplification, caching priority, and durability semantics. Benchmarking them without understanding the workload is how you create a system that wins benchmarks and loses incidents.
Example: setting recordsize=1M on a dataset backing a database can inflate read-modify-write overhead for 8K updates. You might see great sequential throughput, and then watch latency spike when the database does random updates. The benchmark was “correct.” The question was wrong.
Practical tasks: commands, outputs, decisions (12+)
These are real tasks you can run on a ZFS host. Each includes: the command, what output means, and the decision you make. The goal isn’t to collect trivia; it’s to keep you from trusting a number that’s lying.
Task 1: Confirm pool topology and ashift (the “forever” setting)
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 0 days 00:12:41 with 0 errors on Tue Dec 24 03:11:03 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-SAMSUNG_SSD_1 ONLINE 0 0 0
ata-SAMSUNG_SSD_2 ONLINE 0 0 0
ata-SAMSUNG_SSD_3 ONLINE 0 0 0
ata-SAMSUNG_SSD_4 ONLINE 0 0 0
ata-SAMSUNG_SSD_5 ONLINE 0 0 0
ata-SAMSUNG_SSD_6 ONLINE 0 0 0
errors: No known data errors
cr0x@server:~$ sudo zdb -C tank | grep -E 'ashift|vdev_tree' -n | head
67: vdev_tree:
92: ashift: 12
Meaning: ashift=12 implies 4K sectors. ashift=9 implies 512B, and on modern 4K-native drives that can cause write amplification and terrible latency.
Decision: If ashift is wrong for your media, stop “tuning.” Plan a rebuild or migration. Everything else is lipstick.
Task 2: Capture dataset properties that affect benchmarks
cr0x@server:~$ sudo zfs get -o name,property,value -s local,default recordsize,compression,atime,sync,primarycache,logbias tank/test
NAME PROPERTY VALUE
tank/test recordsize 128K
tank/test compression off
tank/test atime off
tank/test sync standard
tank/test primarycache all
tank/test logbias latency
Meaning: You now know whether “performance” came from compression, or whether sync was disabled, or if caching was restricted.
Decision: Freeze these settings for the benchmark run. Any change needs a new run and a clear reason.
Task 3: Verify you’re not accidentally benchmarking page cache
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 256G 18G 190G 1.2G 48G 235G
Swap: 0B 0B 0B
cr0x@server:~$ sudo arcstat.py 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:01:01 219 3 1 0 0 3 100 0 0 18G 190G
12:01:02 241 2 0 0 0 2 100 0 0 18G 190G
12:01:03 210 4 1 0 0 4 100 0 0 18G 190G
Meaning: ARC size (arcsz) and cache hit/miss are visible. If your “disk read benchmark” shows near-zero misses, congratulations—you benchmarked RAM.
Decision: If you need cold-cache numbers, either reboot between runs, use a working set larger than ARC, or set primarycache=metadata on the test dataset (with caution).
Task 4: Confirm real device I/O is happening during a run
cr0x@server:~$ sudo zpool iostat -v tank 1 3
capacity operations bandwidth
pool alloc free read write read write
---------------------------------------- ----- ----- ----- ----- ----- -----
tank 3.12T 8.77T 120 980 15.0M 980M
raidz2-0 3.12T 8.77T 120 980 15.0M 980M
ata-SAMSUNG_SSD_1 - - 20 165 2.6M 170M
ata-SAMSUNG_SSD_2 - - 19 166 2.4M 169M
ata-SAMSUNG_SSD_3 - - 21 162 2.7M 168M
ata-SAMSUNG_SSD_4 - - 20 161 2.5M 166M
ata-SAMSUNG_SSD_5 - - 20 164 2.4M 169M
ata-SAMSUNG_SSD_6 - - 20 162 2.4M 168M
---------------------------------------- ----- ----- ----- ----- ----- -----
Meaning: You can see per-vdev operations and bandwidth. If fio says “3 GB/s” and zpool iostat shows 0 MB/s, fio is hitting cache or doing nothing.
Decision: Trust zpool iostat to validate that your test is exercising the pool.
Task 5: Check for background work that will poison results
cr0x@server:~$ sudo zpool status tank | sed -n '1,25p'
pool: tank
state: ONLINE
scan: scrub in progress since Wed Dec 25 11:48:03 2025
1.11T scanned at 4.52G/s, 220G issued at 898M/s, 3.12T total
0B repaired, 7.03% done, 00:53:12 to go
Meaning: A scrub is in progress. Your benchmark results are now a duet: your workload plus scrub I/O.
Decision: Pause benchmarking. Either wait, schedule tests outside scrub windows, or temporarily stop it if policy allows.
Task 6: Establish a baseline sequential write test (files)
cr0x@server:~$ sudo zfs create -o compression=off -o atime=off tank/bench
cr0x@server:~$ fio --name=seqwrite --directory=/tank/bench --rw=write --bs=1M --size=40G --numjobs=1 --iodepth=16 --direct=1 --time_based=1 --runtime=120 --group_reporting
seqwrite: (groupid=0, jobs=1): err= 0: pid=24819: Wed Dec 25 12:03:44 2025
write: IOPS=980, BW=981MiB/s (1029MB/s)(115GiB/120001msec)
clat (usec): min=320, max=18900, avg=2450.11, stdev=702.41
lat (usec): min=340, max=18940, avg=2475.88, stdev=704.13
Meaning: direct=1 avoids page cache. This is a throughput baseline, but still subject to ARC metadata and ZFS behavior. Latency shows the distribution; max spikes matter.
Decision: Use this baseline to compare changes (compression, recordsize, vdev layout). Don’t treat it as a universal “pool speed.”
Task 7: Random read IOPS with tail latency (cold-ish vs warm)
cr0x@server:~$ fio --name=randread4k --directory=/tank/bench --rw=randread --bs=4k --size=40G --numjobs=4 --iodepth=32 --direct=1 --time_based=1 --runtime=180 --group_reporting --lat_percentiles=1
randread4k: (groupid=0, jobs=4): err= 0: pid=24910: Wed Dec 25 12:09:12 2025
read: IOPS=182k, BW=712MiB/s (747MB/s)(125GiB/180002msec)
clat percentiles (usec):
| 1.00th=[ 118], 5.00th=[ 128], 10.00th=[ 136], 50.00th=[ 176]
| 90.00th=[ 260], 95.00th=[ 310], 99.00th=[ 560], 99.90th=[ 1400]
Meaning: If this is “cold cache,” those are astonishing numbers and likely ARC is involved. If it’s warm cache, it may be realistic for a read-heavy workload that fits in memory.
Decision: Repeat with a working set much larger than ARC if you need disk-limited read behavior. Compare percentiles, not just IOPS.
Task 8: Sync write test (the one that ruins happy plans)
cr0x@server:~$ sudo zfs set sync=always tank/bench
cr0x@server:~$ fio --name=syncwrite4k --directory=/tank/bench --rw=randwrite --bs=4k --size=10G --numjobs=4 --iodepth=8 --direct=1 --time_based=1 --runtime=180 --group_reporting --lat_percentiles=1
syncwrite4k: (groupid=0, jobs=4): err= 0: pid=25103: Wed Dec 25 12:15:41 2025
write: IOPS=9800, BW=38.3MiB/s (40.2MB/s)(6.73GiB/180002msec)
clat percentiles (usec):
| 50.00th=[ 780], 90.00th=[ 1900], 95.00th=[ 2600], 99.00th=[ 6800]
| 99.90th=[22000]
Meaning: This reflects durability cost. If you expected 100k IOPS because the SSD spec sheet said so, welcome to the difference between NAND speed and system semantics.
Decision: If p99/p999 is too high, evaluate SLOG device quality, queueing, and CPU saturation. Also consider whether the workload truly requires sync=always or if application-level durability is already correct.
Task 9: Check if SLOG is present and actually used
cr0x@server:~$ sudo zpool status tank | sed -n '1,80p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-SAMSUNG_SSD_1 ONLINE 0 0 0
ata-SAMSUNG_SSD_2 ONLINE 0 0 0
ata-SAMSUNG_SSD_3 ONLINE 0 0 0
ata-SAMSUNG_SSD_4 ONLINE 0 0 0
ata-SAMSUNG_SSD_5 ONLINE 0 0 0
ata-SAMSUNG_SSD_6 ONLINE 0 0 0
logs
nvme-INTEL_SLOG0 ONLINE 0 0 0
cr0x@server:~$ sudo zpool iostat -v tank 1 3
capacity operations bandwidth
pool alloc free read write read write
---------------------------------------- ----- ----- ----- ----- ----- -----
tank 3.15T 8.74T 10 1200 1.2M 120M
raidz2-0 3.15T 8.74T 10 900 1.2M 90M
logs - - 0 300 0 30M
nvme-INTEL_SLOG0 - - 0 300 0 30M
---------------------------------------- ----- ----- ----- ----- ----- -----
Meaning: During sync-heavy work, the log vdev should show writes. If it stays idle, your workload may not be issuing sync writes, or SLOG may be misapplied, or you’re not testing what you think you are.
Decision: If sync performance is critical and SLOG is not engaged, fix the test or configuration before blaming the pool.
Task 10: Identify CPU bottlenecks and compression effects
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.8.0 (server) 12/25/2025 _x86_64_ (64 CPU)
12:19:01 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:19:02 AM all 18.2 0.0 22.9 0.4 0.0 2.1 0.0 0.0 0.0 56.4
12:19:03 AM all 19.0 0.0 24.1 0.5 0.0 2.2 0.0 0.0 0.0 54.2
cr0x@server:~$ sudo zfs set compression=zstd tank/bench
cr0x@server:~$ fio --name=seqwrite-cmpr --directory=/tank/bench --rw=write --bs=1M --size=40G --numjobs=1 --iodepth=16 --direct=1 --time_based=1 --runtime=120 --group_reporting
seqwrite-cmpr: (groupid=0, jobs=1): err= 0: pid=25218: Wed Dec 25 12:20:44 2025
write: IOPS=1500, BW=1501MiB/s (1574MB/s)(176GiB/120001msec)
clat (usec): min=350, max=24100, avg=2600.88, stdev=880.03
Meaning: Throughput improved with compression. That can be legitimate if your production data compresses similarly. CPU usage may rise; tail latency may shift.
Decision: Validate compressibility with real data samples (or representative synthetic). If production data is already compressed (media, encrypted blobs), this benchmark is irrelevant.
Task 11: Check real on-disk compression ratio
cr0x@server:~$ sudo zfs get -o name,property,value compressratio,logicalused,used tank/bench
NAME PROPERTY VALUE
tank/bench compressratio 1.72x
tank/bench logicalused 180G
tank/bench used 105G
Meaning: compressratio shows what happened, not what you hoped. 1.72x means you wrote less physical data than logical.
Decision: If compressratio is near 1.00x on representative data, stop counting on compression as a performance plan.
Task 12: Inspect device-level saturation and queueing
cr0x@server:~$ iostat -x 1 3
Linux 6.8.0 (server) 12/25/2025 _x86_64_ (64 CPU)
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util await
nvme0n1 0.0 300.0 0.0 30720.0 0.0 0.0 6.5 0.41
sda 25.0 165.0 2560.0 168960.0 0.0 1.0 78.2 2.90
sdb 24.0 166.0 2450.0 169100.0 0.0 1.0 79.0 2.88
sdc 25.0 162.0 2600.0 168100.0 0.0 1.0 77.5 2.95
Meaning: %util near 100% with increasing await indicates the device is saturated. If devices are not busy but latency is high, the bottleneck is elsewhere (CPU, locks, sync, network).
Decision: If disks are saturated, optimize layout or add vdevs. If disks are idle, stop buying disks and start profiling the stack.
Task 13: Confirm the test isn’t limited by a single file or small metadata path
cr0x@server:~$ fio --name=randread4k-many --directory=/tank/bench --rw=randread --bs=4k --size=40G --numjobs=8 --iodepth=32 --direct=1 --time_based=1 --runtime=180 --group_reporting --filename_format=job.$jobnum.file
randread4k-many: (groupid=0, jobs=8): err= 0: pid=25440: Wed Dec 25 12:28:11 2025
read: IOPS=240k, BW=938MiB/s (984MB/s)(165GiB/180004msec)
clat (usec): min=95, max=3200, avg=210.34, stdev=81.22
Meaning: Multiple files reduce single-file lock contention and can expose parallelism. If performance jumps massively, your earlier test might have benchmarked a bottleneck you don’t have in production—or one you absolutely do.
Decision: Choose file count to match reality: many files for VM images or sharded DBs; fewer for big logs; zvol for block workloads.
Task 14: Watch ZFS latency under load (quick sanity)
cr0x@server:~$ sudo zpool iostat -r -w tank 1 3
read write
pool ops bandwidth total_wait disk_wait ops bandwidth total_wait disk_wait
tank 120 15.0M 1ms 1ms 980 980M 4ms 3ms
tank 110 14.2M 1ms 1ms 995 990M 5ms 4ms
tank 118 15.1M 2ms 1ms 970 970M 4ms 3ms
Meaning: total_wait includes time in ZFS queues; disk_wait is device time. If total_wait is much larger than disk_wait, the bottleneck can be in ZFS scheduling, CPU, contention, or sync path.
Decision: Use the difference (total_wait – disk_wait) as a clue: are you waiting on disks or on software?
Task 15: Verify pool free space and fragmentation risk
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint tank
NAME USED AVAIL REFER MOUNTPOINT
tank 3.15T 8.74T 128K /tank
cr0x@server:~$ sudo zpool list -o name,size,alloc,free,fragmentation,capacity
NAME SIZE ALLOC FREE FRAG CAP
tank 11.9T 3.15T 8.74T 18% 26%
Meaning: Fragmentation and capacity matter. As pools fill and fragment, allocation becomes more complex, and latency can climb.
Decision: If you benchmark at 10–30% full and run production at 80% full, you’re benchmarking a different machine. Plan tests at realistic occupancy.
Fast diagnosis playbook
When a benchmark result looks wrong—or production is slow and you’re trying to reproduce it—don’t start randomly flipping ZFS properties. Do this instead, in order. Fast. Deterministic. Boring.
First: confirm the test is real I/O, not cache theater
- Check zpool iostat while the benchmark runs. If pool bandwidth is near zero, stop.
- Check arcstat miss rates. If misses are low, you are testing ARC behavior.
- Check fio direct=1 and verify file size exceeds RAM if cold-cache is required.
Second: determine whether you’re bottlenecked on disks, CPU, or sync semantics
- Run iostat -x. If %util is high and await climbs, devices are saturated.
- Run mpstat. If %sys is high and CPUs are busy while disks are not, you’re CPU-bound (checksums, compression, interrupts, locking).
- Force sync behavior on the dataset (
sync=always) for a short test. If performance collapses and latency explodes, your real bottleneck is the sync path and/or SLOG.
Third: validate ZFS configuration assumptions
- Check topology (mirror/raidz) and ashift.
- Confirm dataset properties: recordsize/volblocksize, compression, logbias, primarycache.
- Check for background work: scrub, resilver, TRIM, snapshots sending/receiving.
If you follow that order, you can usually name the bottleneck within 5–10 minutes. Not fix it—name it. Naming is step one toward not thrashing at random.
Three corporate mini-stories (painfully plausible)
Incident caused by a wrong assumption: “It’s NVMe, so sync will be fine”
A mid-sized company rolled out a new ZFS-backed datastore for virtual machines. It was all-flash. It was expensive. It had a beautiful benchmark: random writes were “great,” and the graphs had enough colors to qualify as modern art.
Two weeks later, morning logins turned into a slow-motion disaster. VM consoles froze for seconds. Databases inside the VMs started timing out. The storage team did what teams do under pressure: they reran the benchmark they trusted, saw great numbers, and declared the problem must be “network.”
The wrong assumption was simple: “All-flash means sync writes are fast.” Their benchmark used async random writes with deep queues and measured average latency. Production had a ton of fsync-heavy behavior—journaling, metadata, and guest OS flushes. ZFS treated those as synchronous writes, and without a proper SLOG device, the latency path forced durable commits that bottlenecked on worst-case flash behavior.
The fix wasn’t magical. They added a power-loss-protected, low-latency SLOG device and reran the benchmark with sync=always and percentile latency enabled. The numbers were lower but honest. Morning login storms became boring again, which is the highest compliment you can give a storage system.
An optimization that backfired: recordsize and “free throughput”
An analytics team wanted faster scan performance. Someone read that “bigger recordsize = faster sequential reads,” and it’s often true. They changed recordsize to 1M on a dataset that held both analytics parquet files and a metadata service database that tracked jobs and small records.
Benchmarks immediately improved for bulk reads. They celebrated. Then the metadata service started spiking latency. The database did small random updates, and ZFS now had to manage large records for tiny modifications, causing read-modify-write amplification and extra metadata updates. The service didn’t need throughput; it needed predictable small-write latency.
They “fixed” it by increasing fio iodepth and concurrency until the benchmark looked good again. That made the graphs prettier while making the service worse, because queueing hid latency by turning it into a backlog.
The real fix was boring: split the workloads. Keep 1M recordsize for analytics files on one dataset. Put the database on its own dataset with a smaller recordsize and appropriate sync semantics. The throughput chart got less impressive; the incident queue got smaller. Choose your trophies.
A boring but correct practice that saved the day: measuring and documenting cache state
A finance company ran quarterly capacity and performance checks on their storage before reporting deadlines. It wasn’t glamorous. They had a runbook that specified: pool fill level, ARC size at start, dataset properties, fio job files, and how to record p95/p99 latency. Every run included “cold-ish” and “warm” tests.
One quarter, a change in firmware on a batch of SSDs introduced a latency quirk under certain queue depths. The sequential throughput numbers didn’t budge. The random read average latency looked fine. But p99 jumped in a way that only showed up when the cache was cold and the working set exceeded ARC.
Because they had consistent baselines, the deviation was obvious. They caught it before the reporting window. They rolled the firmware back and quarantined the batch. No heroic debugging at 2 a.m., no emergency procurement, no “is it the network?” arguments.
That team never won an innovation award. They did win something better: uninterrupted sleep.
Common mistakes: symptoms → root cause → fix
This is the section where your benchmark results go to be diagnosed like a sick server: cold facts, no vibes.
1) Symptom: “Reads are impossibly fast, faster than the SSDs”
- Root cause: ARC (or OS cache) is serving reads. Working set fits in RAM, or you repeated the run and warmed cache.
- Fix: Use
direct=1, increase dataset size beyond ARC, record ARC miss rate, and explicitly label results “warm-cache” vs “cold-cache.”
2) Symptom: “Random write IOPS are great, but production databases are slow”
- Root cause: Benchmark used async writes; production requires sync writes (fsync). No SLOG, or weak SLOG, or you didn’t test sync path.
- Fix: Test with
sync=always, measure p99 latency, add/validate SLOG with power-loss protection, and confirm it’s used viazpool iostat.
3) Symptom: “Throughput is high for 30 seconds, then drops”
- Root cause: You benchmarked cache and write buffering (TXG) rather than sustained device throughput. Short runs also miss steady-state allocation behavior.
- Fix: Run longer tests, use time_based runs with enough duration, and watch pool bandwidth over time with
zpool iostat 1.
4) Symptom: “Latency spikes every few seconds”
- Root cause: TXG commits, sync flush behavior, or SLOG device latency spikes. Sometimes CPU scheduling or interrupt storms.
- Fix: Capture fio latency percentiles, correlate with
zpool iostat -r -w, and check CPU/interrupt behavior. Validate SLOG device quality and firmware.
5) Symptom: “Changing compression makes everything faster, so we’ll rely on it”
- Root cause: Benchmark data compresses; production data may not (already compressed/encrypted). Compression masks I/O limits.
- Fix: Verify compressratio on representative data. If production is incompressible, benchmark with compression off or with realistic data.
6) Symptom: “One fio job shows great numbers; real app is slower”
- Root cause: Workload mismatch: block size, sync behavior, concurrency, access pattern, number of files, metadata mix.
- Fix: Build fio jobs that resemble the app: same block sizes, same fsync rate, similar concurrency, realistic file counts, and mixed read/write ratios.
7) Symptom: “Pool performance is inconsistent across runs”
- Root cause: Cache state changes, background work (scrub/resilver), pool fill level changes, thermal throttling, CPU governor changes.
- Fix: Control the environment: isolate host, fix governor, stop background tasks, run warm-up, and record fill level and ARC state each run.
8) Symptom: “IOPS look good but p99 latency is awful”
- Root cause: Queueing hides pain. Deep iodepth can inflate IOPS while increasing tail latency. Also possible contention or sync stalls.
- Fix: Benchmark with multiple iodepth values, plot IOPS vs p99, and choose a point that meets latency SLOs. Don’t optimize for IOPS alone.
Joke #2: If your benchmark only reports averages, it’s basically a horoscope with better formatting.
Checklists / step-by-step plan
Step-by-step: a repeatable ZFS benchmark session (production-grade)
- Write the benchmark question (throughput vs IOPS vs latency, sync vs async, file vs zvol).
- Record the environment: kernel version, ZFS version, CPU model, RAM, device models, controller, firmware status.
- Confirm pool topology and ashift (
zpool status,zdb -C). - Create a dedicated dataset for benchmarks and set properties explicitly (compression, atime, sync, recordsize/primarycache).
- Set the pool state: no scrub/resilver, stable temperature, stable CPU governor, minimal noise.
- Decide cache mode: cold-ish vs warm. Document ARC size and misses. If cold-ish, use a big working set and/or reboot between runs.
- Warm-up: run a short pre-test to stabilize allocation and prevent “first touch” anomalies.
- Run the benchmark long enough for steady state. Capture fio output, plus zpool iostat and iostat -x concurrently.
- Repeat at least 3 times per config. If variance is high, you have noise or a non-deterministic bottleneck.
- Change one variable (recordsize or compression or sync or logbias), rerun the same suite.
- Interpret using multiple layers: fio + zpool iostat + iostat -x + CPU stats.
- Decide based on your SLO: pick the configuration that meets p99 latency targets, not the one with the biggest peak throughput.
Benchmark suite template: minimal but honest
- Sequential write/read: 1M blocks, 1–4 jobs, iodepth 16, 2 minutes.
- Random read: 4k blocks, jobs 4–8, iodepth 16–32, 3 minutes, percentiles.
- Random write async: 4k blocks, jobs 4–8, iodepth 8–16, 3 minutes.
- Random write sync: same as above, but with
sync=alwaysand measure p99/p999. - Mixed 70/30 read/write: 8k or 16k, latency percentiles.
FAQ
1) Should I benchmark with compression on or off?
Both, but label them. Compression is a real feature that can improve throughput and reduce latency by doing less I/O. It’s only “fake” when your benchmark data compresses and production data doesn’t.
2) Is using dd acceptable for ZFS benchmarks?
For a quick sequential smoke test, yes. For anything involving latency, random I/O, concurrency, or sync semantics, use fio and capture percentiles. dd is too easy to accidentally benchmark cache and readahead.
3) What does sync=disabled do to benchmark results?
It can make writes look dramatically faster by acknowledging them before they are durable. That’s not a “tuning trick”; it’s changing the durability contract. If you use it for production, you’re accepting data loss on power events or crashes.
4) How big should my fio test files be?
For cold-cache behavior: bigger than ARC (often bigger than RAM). For warm-cache behavior: sized to match your realistic working set. Always state which one you tested.
5) Why do my numbers change after the pool fills up?
Allocation gets harder, fragmentation increases, and metaslab behavior changes. ZFS performance at 20% full is not the same as at 80% full. Benchmark at realistic capacity if you care about production behavior.
6) Do I need a SLOG device?
Only if your workload issues synchronous writes and you care about reducing sync latency. A SLOG won’t help async throughput and won’t fix a pool that’s already disk-saturated. It’s for sync latency, not general speed.
7) What iodepth should I use in fio?
Use a range. Shallow queues (iodepth 1–4) show latency and responsiveness; deeper queues show maximum throughput/IOPS but can destroy tail latency. Pick the iodepth that matches your application’s concurrency and latency SLOs.
8) How do I know if I’m CPU-bound in ZFS?
If disks are not busy (iostat -x %util low) but fio latency rises and mpstat shows high %sys or many cores pegged, you’re likely CPU-bound: checksums, compression, interrupts, encryption, or contention.
9) Should I benchmark raidz vs mirrors as if it’s just “more disks”?
No. Mirrors tend to have better random IOPS and latency; raidz tends to be efficient for capacity and sequential workloads, but parity and allocation can affect small random writes. Benchmark with your actual workload profile, especially for random writes and sync behavior.
10) Can L2ARC make benchmarks look better?
Yes, but it’s still a cache. If your workload is read-heavy and reuses data, L2ARC can help. If your workload is mostly write or streaming reads, it won’t. If you benchmark without stating cache layers, your results are incomplete.
Next steps you can actually do
If you want ZFS benchmark results that survive contact with production, do three things this week:
- Write and version-control your fio job definitions (even if they’re simple) and record dataset properties alongside results.
- Add a standard “truth check” to every run: zpool iostat + arcstat + iostat -x captured during the benchmark. If the layers don’t agree, the benchmark doesn’t count.
- Benchmark sync behavior with latency percentiles, even if you think you don’t need it. Sync is where systems stop being fast and start being honest.
ZFS is capable of excellent performance. It’s also capable of producing spectacularly misleading benchmarks if you let it. The rules above aren’t about winning. They’re about not being surprised later—when the surprise comes with a pager.