Your storage “looks fine.” The dashboard says 2 GB/s read throughput, the pool is green, and yet the database is wheezing like it just ran a marathon in a wool coat.
Everyone argues anyway. Somebody points at “disk bandwidth” and declares victory. Meanwhile, your p95 latency graph is quietly doing arson.
ZFS performance problems rarely start with a broken disk. They start with a broken mental model: reading the wrong metric for the workload you actually have.
This is the difference between “we’re IOPS-bound” and “we’re throughput-bound,” and it decides whether you buy more spindles, change recordsize, add a special vdev, or just stop benchmarking lies.
The metric mismatch: why 2 GB/s can still be slow
Throughput is seductively simple. It’s a big number. It looks good in a slide deck. It’s also the easiest metric to misuse.
A system can move gigabytes per second and still deliver awful user experience if it’s doing that work in large, happy sequential streams while your real workload is tiny random reads with strict latency budgets.
The core relationship is this:
Throughput = IOPS × IO size.
That’s not a slogan; it’s the math behind most storage arguments.
If your IO size is 4 KiB and you can do 20,000 IOPS, that’s about 80 MiB/s. Not impressive, but it might be exactly what a database needs.
If your IO size is 1 MiB and you can do 2,000 IOPS, that’s 2 GiB/s. Impressive, and completely irrelevant to that same database.
In ZFS land, this mismatch gets worse because the stack is honest but complicated:
ZFS has compression, checksums, copy-on-write, a big ARC cache, prefetching, and transaction groups.
These features are not “slow.” They’re just specific. They reward some IO patterns and punish others.
If you remember only one thing, make it this:
Stop using a single “MB/s” chart to decide storage health.
When latency is the user-visible symptom, measure latency first. Then map that latency to IOPS vs throughput and to the actual bottleneck (CPU, vdev queues, sync path, fragmentation, metadata, network, or app).
Joke #1: If you benchmarked your pool with a single sequential read test and declared it “fast,” congratulations—you’ve measured your ability to read the benchmark file.
IOPS, throughput, latency: the three numbers that matter (and how they relate)
IOPS
IOPS is “IO operations per second.” It counts completed IO requests, not bytes.
It matters most when the application does lots of small reads/writes: databases, VM disks, metadata-heavy workloads, mail spools, small-file CI caches.
IOPS is not free. Each IO has overhead: syscall, filesystem bookkeeping, checksum, allocation, vdev scheduling, actual media access, completion.
Your pool can have high throughput and low IOPS because it’s good at large sequential transfers but bad at lots of small random operations.
Throughput
Throughput (MB/s or GB/s) is total bytes moved per second. It matters most for streaming: backups, media processing, large-file ETL, object storage replication.
For sequential IO, a single queue with deep enough outstanding requests can saturate throughput with surprisingly few IOPS.
Latency
Latency is time per IO (average, p95, p99). It is usually the metric humans experience.
A database waiting 15 ms per read doesn’t care that your pool can hit 3 GB/s if it asked for 8 KiB and got it late.
In practical terms: latency tells you if you’re in trouble; IOPS/throughput tell you why.
The relationship you keep ignoring
If IO size is small, you need high IOPS to get decent throughput.
If IO size is large, you can get high throughput with moderate IOPS.
ZFS complicates IO size because application IO size, ZFS recordsize, and physical sector size don’t always match.
Also, a workload can be both throughput- and IOPS-bound at different times.
Example: a VM host can have random 8 KiB reads during steady state (IOPS/latency-sensitive), and then do large sequential writes during backup windows (throughput-sensitive).
One pool, two bottlenecks, one on-call.
Where ZFS spends time: the performance path in plain terms
ZFS performance diagnosis gets easier when you think in layers. Not “ZFS is slow.” More like: “this IO is waiting on which stage?”
Read path (simplified)
- ARC hit? If yes, you’re mostly in RAM and CPU. Latency is microseconds to low milliseconds depending on system load.
- ARC miss → vdev read. Now you’re at the mercy of disk latency, queue depth, and scheduling.
- Checksums & decompression. CPU cost, sometimes non-trivial with fast NVMe or heavy compression.
- Prefetch may help sequential reads, may harm random ones if it pollutes ARC.
Write path (simplified, and where people get fired)
- Async writes land in memory and are committed to disk in transaction groups (TXGs). This is generally fast… until you fill memory or the pool can’t flush fast enough.
- Sync writes must be committed safely before acknowledgement. Without a dedicated log (SLOG), that means hitting main vdevs with low-latency guarantees.
- Copy-on-write changes the “overwrite” story. ZFS allocates new blocks and updates metadata. Fragmentation can become a tax on random workloads over time.
Vdevs are the unit of parallelism
ZFS stripes across vdevs, not across individual disks the way some people vaguely assume.
A single RAIDZ vdev has a limited IOPS profile (especially for random writes) compared to multiple mirror vdevs.
If you want more random IOPS, you typically add more vdevs, not “bigger disks.”
Queueing is where your latency goes to die
Most performance incidents are queueing incidents.
Disks aren’t “slow” so much as “busy,” and busy means requests wait.
That wait shows up as latency. The pool might still show respectable throughput, because bytes keep moving, but every individual IO is waiting in line.
Workload fingerprints: what “IOPS-bound” and “throughput-bound” look like
IOPS-bound (small random IO)
Fingerprints:
- High latency even though MB/s is modest.
- IO size small (4–16 KiB typical).
- Queue depth increases under load; disks show high
awaitorsvctm-like symptoms depending on tool. - CPU often not saturated; you’re waiting on storage.
- ZFS datasets with small recordsize or zvols with small volblocksize tend to amplify it.
Typical culprits:
VM hosts, OLTP databases, metadata-heavy file trees, containers with layered filesystems hammering small files.
Throughput-bound (large sequential IO)
Fingerprints:
- High MB/s and stable latency until you saturate.
- Large IO size (128 KiB to several MiB).
- CPU might become the bottleneck if compression/checksums run hot.
- Network often becomes the ceiling (10/25/40/100GbE), especially with NFS/SMB clients.
Typical culprits:
backup streams, media pipelines, large analytics scans, replication and resilver operations.
The “mixed workload” trap
Your pool can look amazing at night (backup throughput charts) and terrible at noon (interactive latency).
If you size for throughput only, you’ll ship a system that melts under random IO.
If you size for IOPS only, you might buy expensive SSDs when you actually needed more network or better sequential layout.
Interesting facts and historical context (that actually helps)
- ZFS was designed at Sun for data integrity first: end-to-end checksums and copy-on-write were core features, not bolt-ons.
- RAIDZ exists to avoid the RAID-5 write hole, trading some small-write performance for stronger guarantees.
- The ARC is not “just cache”; it’s a self-tuning algorithmic cache with metadata awareness, and it can change workload behavior dramatically.
- 4K sector drives changed everything: misaligned ashift values caused real-world performance collapses and unpredictable write amplification.
- Prefetch was built for streaming reads; on random IO-heavy workloads it can waste ARC and IO budget if you don’t understand when it triggers.
- SLOG devices became popular because sync writes got real when virtualization and databases started doing fsync-heavy patterns by default.
- Compression flipped from “CPU tax” to “performance feature” as CPUs got faster and storage got relatively slower; fewer bytes can mean fewer IOs.
- Special vdevs are a modern answer to an old problem: metadata and small blocks being latency-sensitive while bulk data can be slower.
- IOPS marketing came from the HDD era, where random access was brutally slow and “seeks per second” basically defined performance classes.
Fast diagnosis playbook
This is the order I use when paged. It’s biased toward getting the bottleneck right before touching tunables.
First: prove whether it’s latency, IOPS, throughput, CPU, or network
- Check latency and queueing on the server (not the client):
iostat,zpool iostat, and application p95/p99. - Check IO size and pattern: random vs sequential, sync vs async, read vs write. If you don’t know the pattern, your benchmark is fan fiction.
- Check ARC hit rate: if ARC is saving you, disks may be innocent.
Second: localize the bottleneck to a vdev, a dataset/zvol, or a path
- Which vdev is hot? Mirrors vs RAIDZ behave very differently under random writes.
- Is sync the problem? Look for sync write latency and SLOG health.
- Is metadata the problem? Small-file workloads often die on metadata IOPS, not data throughput.
Third: decide the lever
- If it’s IOPS-bound: add vdevs, shift to mirrors, use special vdev for metadata/small blocks, fix recordsize/volblocksize, reduce sync pressure or add proper SLOG.
- If it’s throughput-bound: check network, check compression CPU, widen stripes (more vdevs), and use bigger recordsize for streaming datasets.
- If it’s latency from queueing: reduce concurrency, isolate noisy neighbors, set sane
primarycache/secondarycache, and stop mixing workloads that hate each other.
Practical tasks: commands, what the output means, and what decision you make
These are real “do this now” tasks. Each one has a command, sample output, what it means, and the decision it drives.
Run them on the storage host whenever possible. Client-side metrics are useful, but they lie by omission.
Task 1: Identify pool topology (because vdev layout is destiny)
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 0 days 02:11:09 with 0 errors on Sun Dec 22 03:10:12 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1p2 ONLINE 0 0 0
nvme1n1p2 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
sdc2 ONLINE 0 0 0
errors: No known data errors
Meaning: This pool mixes a mirror vdev and a RAIDZ1 vdev. That’s allowed, but performance characteristics differ wildly; allocation will stripe across vdevs and the slowest/most contended part can dominate latency.
Decision: If you care about consistent latency, avoid mixing vdev types in the same pool. If it already exists, consider separating workloads by pool.
Task 2: Observe pool IO at a high level (IOPS vs MB/s)
cr0x@server:~$ sudo zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 3.21T 5.89T 3.21K 1.87K 110M 42.3M
mirror-0 920G 1.81T 2.95K 210 98.4M 3.21M
nvme0n1p2 - - 1.48K 105 49.2M 1.60M
nvme1n1p2 - - 1.47K 105 49.2M 1.61M
raidz1-1 2.29T 4.08T 260 1.66K 11.7M 39.1M
sda2 - - 90 560 4.10M 13.0M
sdb2 - - 86 550 3.96M 13.1M
sdc2 - - 84 545 3.64M 13.0M
Meaning: The RAIDZ vdev is taking most writes (1.66K ops), while reads are dominated by the mirror. That likely means the workload is write-heavy and landing where latency will be worse.
Decision: If latency matters, move write-sensitive datasets to a pool composed of mirrors or SSD vdevs; or redesign vdevs. Don’t “tune” your way out of a topology mismatch.
Task 3: Check latency/queueing at the block layer
cr0x@server:~$ iostat -x 1 3
Linux 6.8.0 (server) 12/25/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.44 0.00 6.21 9.87 0.00 71.48
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %util await r_await w_await
nvme0n1 480.0 42.0 62.0 1.9 0.0 8.0 38.2 1.20 1.10 2.30
nvme1n1 475.0 41.0 61.8 1.9 0.0 7.0 37.9 1.18 1.08 2.25
sda 22.0 180.0 1.2 20.5 0.0 15.0 96.8 24.40 9.20 26.10
sdb 21.0 178.0 1.2 20.2 0.0 14.0 95.9 25.10 9.50 26.70
sdc 20.0 175.0 1.1 19.9 0.0 13.0 95.1 26.00 10.10 27.20
Meaning: HDDs are at ~95% util with ~25 ms await. NVMe is fine (~1.2 ms). This is classic queueing: small IOs to HDD vdevs will get punished.
Decision: If your workload needs low latency, stop sending it to saturated HDD vdevs. Reduce concurrency, move datasets, or add vdevs/SSDs. Don’t focus on MB/s.
Task 4: Identify whether sync writes are involved
cr0x@server:~$ sudo zfs get -o name,property,value -r sync tank/app
NAME PROPERTY VALUE
tank/app sync standard
Meaning: standard means the application decides sync vs async (fsync/O_DSYNC). Many databases and hypervisors will force sync semantics.
Decision: If you see high write latency and the app is fsync-heavy, investigate SLOG and sync path. Don’t “fix” it by setting sync=disabled unless you enjoy data loss postmortems.
Task 5: Check presence and health of SLOG (separate log)
cr0x@server:~$ sudo zpool status tank | sed -n '1,120p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1p2 ONLINE 0 0 0
nvme1n1p2 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
sdc2 ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
nvme2n1p1 ONLINE 0 0 0
nvme3n1p1 ONLINE 0 0 0
errors: No known data errors
Meaning: A mirrored SLOG exists. Good: sync write latency might be gated by these devices, not by HDD RAIDZ.
Decision: Validate that SLOG devices are power-loss safe and not consumer NVMe pretending to be reliable. If SLOG is absent and sync is heavy, consider adding one—but only after confirming the workload is sync-bound.
Task 6: Check dataset recordsize (throughput vs random-read behavior)
cr0x@server:~$ sudo zfs get -o name,property,value recordsize tank/app
NAME PROPERTY VALUE
tank/app recordsize 128K
Meaning: 128K is a solid default for general files. For databases doing 8K pages, it can cause read amplification (reading more than needed) and impact latency under cache misses.
Decision: For database datasets, consider recordsize=16K or 8K depending on DB page size and access pattern. For streaming datasets, keep it larger (128K–1M).
Task 7: Check zvol volblocksize (VMs live and die here)
cr0x@server:~$ sudo zfs get -o name,property,value volblocksize tank/vmstore/vm-001
NAME PROPERTY VALUE
tank/vmstore/vm-001 volblocksize 8K
Meaning: 8K matches many VM random IO patterns, but can reduce sequential throughput and increase metadata overhead. Too small also means more IO ops for the same bytes.
Decision: Choose volblocksize per workload before writing data (it’s fixed after creation in many implementations). For mixed VM workloads, 8K–16K is common; for bulk sequential, larger might help.
Task 8: Check ARC stats (are you disk-bound or cache-bound?)
cr0x@server:~$ arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:10:01 3200 420 13 180 5 210 7 30 1 96G 110G
12:10:02 3400 600 17 210 6 360 11 30 1 96G 110G
12:10:03 3100 380 12 160 5 190 6 30 1 96G 110G
Meaning: Miss rate ~12–17%. Not terrible, but under a latency incident even “only” 15% misses can be catastrophic if misses go to slow vdevs.
Decision: If misses correlate with latency spikes, you’re disk-latency-bound. Consider more RAM, special vdev, or moving hot working set to faster vdevs.
Task 9: Check compression ratio (it can be free throughput)
cr0x@server:~$ sudo zfs get -o name,property,value,source compression,compressratio tank/app
NAME PROPERTY VALUE SOURCE
tank/app compression lz4 local
tank/app compressratio 1.62x -
Meaning: 1.62x means you’re writing fewer bytes than the app thinks. That often improves both throughput and IOPS (less physical IO), at the cost of some CPU.
Decision: If CPU is not saturated and data compresses, leave compression on. If CPU is pegged and storage is idle, evaluate changing compression level or moving heavy compression off hot paths.
Task 10: Confirm ashift (alignment) to avoid silent write amplification
cr0x@server:~$ sudo zdb -C tank | sed -n '/ashift/,+2p'
ashift: 12
asize: 7.99T
is_log: 0
Meaning: ashift=12 means 4K sectors. Good baseline. If you had ashift=9 on 4K drives, you’d risk read-modify-write penalties and ugly small-write behavior.
Decision: If ashift is wrong, you don’t “tune” it away. You rebuild vdevs correctly. Yes, that’s annoying. No, it’s not optional.
Task 11: Check metadata/small-block acceleration (special vdev)
cr0x@server:~$ sudo zpool status tank | sed -n '/special/,+15p'
special
mirror-3 ONLINE 0 0 0
nvme4n1p1 ONLINE 0 0 0
nvme5n1p1 ONLINE 0 0 0
Meaning: A special vdev exists. If configured with a small-block threshold, metadata and small blocks can live on fast media, boosting IOPS/latency for small-file workloads.
Decision: If your workload is metadata-heavy or 4–16K random IO-heavy, special vdev can be a game changer. But treat it like a first-class vdev: redundancy and monitoring are non-negotiable.
Task 12: Measure dataset-level IO behavior (who is noisy?)
cr0x@server:~$ sudo zfs iostat -r -v tank 1 3
capacity operations bandwidth
pool alloc free read write read write
------------------------- ----- ----- ----- ----- ----- -----
tank 3.21T 5.89T 3.20K 1.86K 110M 42.0M
tank/app 820G 1.20T 1.90K 920 52.0M 18.0M
tank/vmstore 1.1T 2.20T 1.10K 910 58.0M 22.0M
tank/backups 1.2T 1.50T 200 30 0.8M 2.0M
------------------------- ----- ----- ----- ----- ----- -----
Meaning: The app dataset and VM store dominate IO. Backups are quiet right now.
Decision: If you need isolation, put VMs and databases on separate pools or at least separate vdev classes. Trying to “fair share” storage latency in one pool is how you grow gray hair.
Task 13: Determine if you’re CPU-bound due to checksumming/compression
cr0x@server:~$ mpstat -P ALL 1 2
Linux 6.8.0 (server) 12/25/2025 _x86_64_ (32 CPU)
12:12:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
12:12:02 PM all 62.0 0.0 25.0 1.0 0.0 1.0 0.0 11.0
12:12:02 PM 0 78.0 0.0 19.0 0.0 0.0 0.0 0.0 3.0
12:12:02 PM 1 80.0 0.0 17.0 0.0 0.0 0.0 0.0 3.0
Meaning: CPU is heavily utilized while iowait is low. That’s a hint the storage media may be fine and you’re burning CPU on compression, checksums, encryption, or simply the workload itself.
Decision: Before buying disks, profile CPU and check ZFS feature costs. If you’re CPU-bound, faster disks won’t help; more CPU or different compression/encryption choices might.
Task 14: Run a truthful fio test for random read latency
cr0x@server:~$ fio --name=randread4k --filename=/tank/app/fio-testfile --size=8G --rw=randread --bs=4k --iodepth=32 --numjobs=4 --direct=1 --runtime=30 --time_based --group_reporting
randread4k: (groupid=0, jobs=4): err= 0: pid=27144: Thu Dec 25 12:13:40 2025
read: IOPS=48210, BW=188MiB/s (197MB/s)(5640MiB/30001msec)
slat (nsec): min=1100, max=220000, avg=5200.4, stdev=3400.1
clat (usec): min=70, max=9200, avg=640.8, stdev=310.5
lat (usec): min=80, max=9300, avg=650.9, stdev=312.0
clat percentiles (usec):
| 1.00th=[ 160], 5.00th=[ 240], 10.00th=[ 290], 50.00th=[ 610]
| 90.00th=[ 1050], 95.00th=[ 1400], 99.00th=[ 2100], 99.90th=[ 4200]
Meaning: Random 4K reads: ~48K IOPS, ~188 MiB/s, p99 ~2.1 ms. That’s an IOPS/latency story, not a throughput story.
Decision: Compare this to application requirements. If the app needs sub-millisecond p99 and you’re not there, you need faster vdevs, more vdev parallelism, better caching, or less sync pressure—not “more GB/s.”
Task 15: Run a sequential throughput fio test (separate from random)
cr0x@server:~$ fio --name=seqread1m --filename=/tank/backups/fio-testfile --size=32G --rw=read --bs=1m --iodepth=16 --numjobs=2 --direct=1 --runtime=30 --time_based --group_reporting
seqread1m: (groupid=0, jobs=2): err= 0: pid=27201: Thu Dec 25 12:15:12 2025
read: IOPS=3020, BW=3020MiB/s (3168MB/s)(90615MiB/30005msec)
clat (usec): min=230, max=12000, avg=820.3, stdev=310.4
Meaning: You can do ~3 GiB/s sequential reads with 1 MiB IOs. Great for backups/streams.
Decision: Use this to size replication windows and backup performance. Do not use it to claim the database will be fast.
Task 16: Check TXG pressure (are writes piling up?)
cr0x@server:~$ cat /proc/spl/kstat/zfs/txg | sed -n '1,80p'
13 1 0x01 7 336 4257451234 123456789
name type data
birth 4 1756114458
state 4 1
txg 4 231948
g__active 4 231949
g__opened 4 231948
g__quiescing 4 231947
g__syncing 4 231946
ndirty 4 1948723200
dirty_max 4 4294967296
delay 4 0
Meaning: ndirty is large but below dirty_max, and delay is 0. If delay rises or ndirty hits max, applications can get throttled; latency spikes often follow.
Decision: If TXGs are struggling, you’re write-flush-bound. Look at vdev write latency, sync load, and whether a slow RAIDZ is backing up the pipeline.
Three corporate-world mini-stories (anonymized, plausible, and technically annoying)
1) The incident caused by a wrong assumption: “But the pool does 5 GB/s”
A mid-size SaaS company migrated a payment service from an old SAN to a shiny new ZFS appliance.
The migration checklist had the usual items: scrub schedule, snapshots, replication, alerting. The performance section had one line: “validated throughput with dd.”
They ran a big sequential read, got a heroic number, and shipped it.
Two weeks later, they rolled out a feature that increased transaction volume and added a couple of secondary indexes.
The database didn’t crash. It just got slow. Latency rose gradually until the API started timing out.
The incident channel filled with charts showing that the pool was only doing a few hundred MB/s. “We have headroom,” someone said, and everyone nodded because MB/s looked low.
The real problem was random read latency. The working set didn’t fit in ARC anymore, so cache misses went to a RAIDZ vdev of HDDs.
Each query needed dozens of 8–16 KiB reads, and those reads queued behind other random IO.
Throughput stayed modest because IO size was small; IOPS were the bottleneck and the queueing multiplied the pain.
They fixed it by moving the database dataset to a mirror-only SSD pool and setting an appropriate recordsize.
The throughput chart barely changed. The p95 latency dropped dramatically. The team learned the hard way that users don’t pay you in gigabytes per second.
2) The optimization that backfired: “Let’s force bigger records for performance”
Another company ran a fleet of VM hosts on ZFS with zvols.
Someone noticed that sequential throughput wasn’t stellar during backup windows and decided to “optimize”: they standardized everything on larger block sizes.
Datasets got bigger recordsize; some zvols were recreated with larger volblocksize; the change request proudly claimed “fewer IOs.”
Backups got a little faster. Then the ticket queue started filling with complaints: VMs felt sluggish, especially under patching and login storms.
Graphs showed higher latency but also higher throughput during peak hours, which looked like success to anyone stuck in bandwidth mode.
The issue was read amplification and cache inefficiency. Many VMs were doing 4–8 KiB random reads and writes.
Larger blocks meant each small access pulled in more data than needed, wasting ARC and generating more physical IO on cache misses.
Worse, random writes caused more work per operation, and the pool’s effective IOPS capacity dropped.
They rolled back for VM zvols: smaller volblocksize, and they separated backup streams to a different dataset/pool.
The lesson was simple: you don’t optimize storage by making everything “big.” You optimize it by matching block behavior to the workload.
3) The boring but correct practice that saved the day: “We kept sync honest”
A financial company ran NFS-backed storage for a few critical systems, including a database cluster and a message queue.
The storage team had a policy that annoyed developers: sync semantics stayed enabled, and “performance fixes” that involved disabling safety required a risk sign-off.
It wasn’t popular, but it was consistent.
During a hardware refresh, a vendor suggested a quick win: set sync=disabled on the hot datasets and “make it scream.”
The storage team refused and instead added a proper mirrored SLOG built on power-loss-protected devices, then measured fsync latency under load.
The result wasn’t magical, but it was stable. More importantly, it was predictable.
Months later, they had a power event that took out a rack PDU in a way that made UPS runtime shorter than everyone expected.
Some systems went down hard. The storage came back clean.
The postmortem was boring—no data loss, no corruption, no weird replay behavior—and boring was the goal.
If you want a reliability quote to hang over your desk, here’s one I trust because it’s operationally actionable:
“Hope is not a strategy.”
— General Gordon R. Sullivan
ZFS knobs that change IOPS vs throughput (and the ones that don’t)
Topology: mirrors vs RAIDZ
Mirrors generally win on random read IOPS and often on random write latency. RAIDZ tends to win on usable capacity and sequential throughput per dollar, but it pays a tax on small random writes (parity math + IO patterns).
If your workload is IOPS/latency-sensitive, mirrors are the default answer unless you have a strong reason otherwise.
recordsize (datasets) and volblocksize (zvols)
These settings influence IO amplification and caching efficiency.
Larger blocks help sequential throughput and reduce metadata overhead. Smaller blocks can reduce read amplification for small random reads.
Set them based on application IO size and access pattern. If you don’t know the app pattern, find out. Guessing is how you build expensive disappointment.
sync, SLOG, and the real meaning of “fast writes”
Sync writes are about durability guarantees. If the app demands it, ZFS must commit safely before acknowledging.
A good SLOG can reduce latency for sync writes by taking that durability hit on a fast low-latency device, then later flushing to main storage.
A bad SLOG (non-PLP consumer SSD, overloaded, or shared with other workloads) is worse than none: it becomes a bottleneck and adds failure risk.
special vdev for metadata and small blocks
If your pain is metadata IO (lots of small files, directory traversals, mail spools) or small-block random reads, special vdevs can move the hottest, most latency-sensitive parts to SSD.
This is often a cleaner fix than throwing more RAM at ARC when the working set is too large.
Compression
Compression changes the math. If you compress 2:1, you halve physical bytes and often physical IO time.
But you spend CPU cycles. On HDD pools it’s usually a win. On NVMe pools at extreme throughput, CPU can become the ceiling.
What not to obsess over
Tiny “tunables” won’t fix a workload/topology mismatch.
If you have RAIDZ HDDs serving random 8K sync writes, no sysctl is going to rescue you. The disks will still have to do the work, one painful operation at a time.
Joke #2: Tuning without measurement is like debugging with your eyes closed—technically possible, but mostly a form of interpretive dance.
Common mistakes: symptoms → root cause → fix
1) “Throughput is low, storage must be fine” (while latency is awful)
Symptoms: App timeouts, high p95/p99 latency, but MB/s charts look modest.
Root cause: IOPS-bound workload with small IO size; queueing on vdevs; cache misses hitting slow media.
Fix: Measure IO size and latency; add vdev parallelism (more mirrors), move hot datasets to SSD, add special vdev, or increase ARC if it truly fits.
2) “We bought faster disks, still slow”
Symptoms: NVMe everywhere, but performance gains are small; CPU is hot; latency doesn’t improve as expected.
Root cause: CPU-bound (compression/encryption/checksums), or the bottleneck is network/NFS/SMB, or the app is serialized (single-threaded IO).
Fix: Check CPU utilization and network; profile app concurrency; validate that clients can issue parallel IO; tune at the right layer.
3) “Random writes collapsed after pool filled up”
Symptoms: Fresh pool was fast; after months, random IO latency spikes; freeing space helps a bit.
Root cause: Fragmentation + allocation pressure in copy-on-write, plus reduced free space leading to less efficient block allocation.
Fix: Keep healthy free space headroom, especially on HDD RAIDZ; use special vdev for metadata/small blocks; consider rewriting or migrating datasets; avoid pathological overwrite patterns on RAIDZ for hot random workloads.
4) “Sync writes are killing us, so we disabled sync”
Symptoms: Performance “improves” immediately; later you see corruption after a crash, or replication sends broken blocks.
Root cause: Trading durability for speed; the system was sync-bound and needed a proper SLOG or workload change.
Fix: Restore sync=standard; add a mirrored, power-loss-protected SLOG; or change application behavior knowingly (batch fsync, group commit), not by lying to it.
5) “We changed recordsize and got worse performance”
Symptoms: Sequential workloads improve, but interactive workloads degrade; cache hit rate drops; more IO ops for same work.
Root cause: Misaligned block sizing vs workload; read amplification; ARC pollution.
Fix: Match recordsize/volblocksize to access pattern: small for DB random reads, large for streaming. Separate workloads into different datasets and set properties per dataset.
6) “One pool for everything”
Symptoms: Backups make databases slow; scrubs make VMs jittery; performance is unpredictable.
Root cause: Competing IO patterns; no isolation; shared vdev queues.
Fix: Separate pools or at least separate vdev classes; schedule scrubs/resilvers; throttle bulk jobs; consider dedicated backup targets.
Checklists / step-by-step plan
Step-by-step: decide whether you’re IOPS-bound or throughput-bound
- Collect latency (p95/p99) from the app and from the storage host (
iostat -x). - Collect IO size distribution using fio tests that match the app (4K random, 8K sync writes, 1M sequential reads, etc.).
- Compute the implied throughput: IOPS × IO size. If your observed MB/s matches the math, the metrics are consistent and you can reason cleanly.
- Check ARC miss rate. High miss rate with high disk await means you’re disk-bound; high miss rate with low disk await suggests something else (CPU/network/app).
- Inspect vdev utilization via
zpool iostat -v. Identify the hottest vdev and why it’s hot.
Checklist: designing for random IOPS (databases, VMs)
- Prefer multiple mirror vdevs over a small number of wide RAIDZ vdevs.
- Set recordsize/volblocksize to match expected IO sizes.
- Consider special vdev for metadata/small blocks.
- Keep free space headroom (don’t run pools “near full” and then act surprised).
- Ensure sync path is sane: either proper SLOG or accept the latency and design around it.
Checklist: designing for sequential throughput (backups, ETL)
- Use larger recordsize for large files.
- Measure network ceilings; storage may not be your bottleneck.
- Confirm CPU headroom if using compression/encryption.
- Separate sequential bulk jobs from interactive latency-sensitive datasets if possible.
Operational checklist: before you touch tunables
- Capture
zpool status,zpool iostat -v,iostat -xduring the incident. - Capture ARC stats and memory pressure indicators.
- Write down the workload pattern (sync? random? IO size?). If you can’t, stop and instrument.
- Make one change at a time and re-measure. Storage is not a vibes-based discipline.
FAQ
1) Which metric should I watch first on ZFS: IOPS or throughput?
Watch latency first (p95/p99), then map it to IOPS and IO size. Throughput alone is a feel-good number and often the wrong one.
2) Why does my pool show high throughput but my database is slow?
Your pool may be good at sequential reads/writes (high MB/s) while the database needs low-latency small random IO.
If random reads miss ARC and hit HDD RAIDZ, latency dominates and throughput doesn’t reflect user pain.
3) Are mirrors always faster than RAIDZ?
For random IOPS and latency, mirrors usually win.
For capacity efficiency and often sequential throughput per dollar, RAIDZ can be attractive.
Pick based on workload. “Always” is for marketing, not engineering.
4) Does adding more disks to a RAIDZ vdev increase IOPS?
It can increase sequential throughput, but random IOPS don’t scale linearly the way people hope.
If you need random IOPS, you typically add more vdevs (more independent queues), not just wider RAIDZ.
5) When does a SLOG help?
When you have a workload with significant sync writes (fsync-heavy DBs, NFS with sync, some VM patterns) and your main vdevs have higher latency.
It won’t help async writes, and it won’t fix random read latency.
6) Is sync=disabled ever acceptable?
It’s acceptable when you deliberately choose to lie about durability and you can tolerate data loss on power failure or crash.
Most production systems can’t, and the ones that claim they can often find religion after the first outage.
7) Should I change recordsize for my database?
Often yes, but only with understanding. If your DB page size is 8K and access is random, smaller recordsize can reduce read amplification on cache misses.
Benchmark with representative workloads and watch p95/p99 latency, not just MB/s.
8) How do I know if I’m ARC-bound or disk-bound?
If ARC hit rate is high and disks show low await/util, you’re likely cache/CPU/app-bound.
If ARC misses correlate with high disk await/util and rising latency, you’re disk-bound and need faster media or more vdev parallelism (or a smaller working set).
9) Can compression improve IOPS?
Yes. If data compresses, ZFS writes fewer bytes and can satisfy reads faster (less physical IO), which can improve both throughput and effective IOPS.
If CPU becomes the bottleneck, the gain disappears.
10) Why do benchmarks disagree with production?
Because benchmarks are often sequential, cache-friendly, and unrealistically deep-queued, while production is mixed, bursty, and latency-sensitive.
If your benchmark doesn’t match IO size, sync behavior, concurrency, and working set, it’s measuring a different universe.
Next steps you can do this week
If you want to stop arguing about “fast storage” and start shipping predictable performance, do these in order:
- Pick two representative fio profiles: one random small-block (4K or 8K) with realistic concurrency, and one sequential large-block (1M). Run them on the same datasets your apps use.
- Instrument latency: capture p95/p99 at the application and at the storage host during peak load.
- Audit topology vs workload: if you run latency-sensitive workloads on RAIDZ HDD vdevs, accept the physics or redesign with mirrors/SSD/special vdev.
- Fix block sizing: align recordsize/volblocksize to how the app actually does IO, not how you wish it did IO.
- Validate sync behavior: don’t disable safety in production; add a proper SLOG if sync writes are the gating factor.
- Separate noisy neighbors: backups, scrubs, and replication are necessary. They also bulldoze interactive latency if you let them share queues unchecked.
Then re-measure. If you can’t show “before vs after” in latency percentiles and IO shape, you didn’t do engineering—you did wishful thinking with root privileges.