You’ve got a pile of spinning disks, no budget for SSDs, and users who think “storage” is a button you press to make the app faster.
Meanwhile your ZFS pool is doing exactly what physics allows: slow random I/O, decent sequential I/O, and occasional drama during scrubs.
This is a field guide for making HDD-only ZFS feel surprisingly competent—without lying to yourself. We’ll tune what’s tunable, avoid the
traps that look fast until they corrupt your week, and build a diagnosis muscle that works at 3 a.m.
The mental model: what HDD pools are good at
HDD-only ZFS tuning is mostly the art of not asking disks to do what they can’t. Spinning disks are fantastic at sequential bandwidth and
terrible at random IOPS. ZFS is fantastic at correctness and flexibility, and pretty good at performance—as long as you align your workload
with the pool’s geometry and ZFS’s write/read behavior.
One rule to tattoo on your runbook
For HDD pools, you don’t “tune for IOPS.” You reduce random I/O, increase effective sequentiality, and prevent avoidable write amplification.
If your workload is truly random small-block writes, the most honest optimization is changing the workload or buying SSDs. Everything else is
negotiating with physics.
Why ZFS can feel slow on HDDs (and why it’s often your fault)
- Copy-on-write means overwrites become new writes, with metadata updates too. Fragmentation and scattered writes appear if you do lots of random updates.
- Transaction groups (TXGs) batch changes and flush periodically. This is good for throughput; it can be bad for latency if you misunderstand sync behavior.
- Checksumming adds CPU work, and forces full-block reads on partial writes in some cases (read-modify-write).
- RAIDZ is capacity-efficient, but small random writes can trigger parity overhead and RMW penalties.
Joke #1: HDDs are like a manager reading email—amazing at processing a long thread, and catastrophically slow when you interrupt every 8 milliseconds.
“Speed” means multiple things
Most tuning mistakes come from optimizing the wrong metric:
- Throughput (MB/s) for backups, media, object storage, big ETL reads.
- IOPS for metadata-heavy workloads, mail stores, VM random writes.
- Latency for databases, VM sync writes, NFS with sync semantics.
You can often improve throughput and also reduce tail latency by avoiding pathological patterns (tiny sync writes, mis-sized records,
over-wide RAIDZ, aggressive scrubs during peak). But you can’t make eight HDDs behave like eight SSDs. The win is making them behave like a
well-fed eight-disk array instead of a bag of sad seeks.
Facts & history that actually matter
Storage tuning is easier when you know which “old” ideas are still baked into modern behavior. Here are short, concrete facts that keep
paying rent:
- ZFS came out of Sun Microsystems in the mid-2000s to end silent corruption and simplify storage administration; performance was designed to be predictable under correctness constraints.
- Copy-on-write is why ZFS can always validate data with checksums, but it also means random overwrites tend to fragment over time.
- RAIDZ was designed to avoid the “write hole” that classic RAID-5 can hit on power loss; parity consistency is part of the design, not an afterthought.
- The ARC (Adaptive Replacement Cache) evolved to beat simplistic LRU caching by balancing recency and frequency; on HDD pools, ARC effectiveness is often your biggest “free” performance lever.
- 4K-sector disks changed the world and ashift exists because the OS can’t always trust the drive’s reported sector size; picking it wrong can permanently tax performance.
- LZ4 compression became the default favorite because it’s usually faster than disks at typical HDD throughput; it often increases effective bandwidth by writing less.
- ZIL (intent log) is not a write cache; it exists to commit synchronous semantics safely. Without a SLOG device, ZIL lives on HDDs and sync writes can become the latency kingpin.
- Scrubs are not optional in ZFS culture because checksums only detect corruption when you read; scrubs force reads to proactively find and heal latent errors.
- Wide RAIDZ got popular with big disks for capacity, but operationally it increases resilver time and the blast radius of poor performance during rebuilds—especially on busy HDD pools.
One paraphrased idea worth keeping: Hope is not a strategy; measure the system you have, then change one thing at a time
— paraphrased idea from Gene Kranz (mission operations discipline).
Pool layout decisions that change everything
If you’re tuning an existing pool, some layout choices are locked in. But you still need to understand what you built, because many “tuning”
problems are actually “you chose RAIDZ2 for a VM farm” problems.
Mirrors vs RAIDZ on HDDs
If your workload is IOPS-sensitive (VMs, databases, metadata storms), mirrors win. Not by a little. Mirrors give you more independent
actuators (spindles) for random reads and often better write behavior. RAIDZ gives you capacity efficiency and great sequential throughput,
but pays for it with parity overhead and read-modify-write on small writes.
- Mirrors: best random read IOPS, decent random write, fast resilver (copies only used blocks).
- RAIDZ2/3: best usable TB per disk, good sequential streaming, more complex small-write behavior, resilver can be heavy.
Don’t go too wide
“Wider vdevs are faster” is true for sequential throughput, and often false for latency and rebuild behavior. A 12-wide RAIDZ2 vdev can
stream like a champ, then turn into a busy, fragile beast during resilver while your apps fight for seeks.
Recordsize alignment and vdev geometry
ZFS writes variable-sized blocks up to recordsize (datasets) or volblocksize (zvols). On RAIDZ, each block is split
across columns plus parity. If your block sizes don’t play nicely with the RAIDZ geometry, you can end up with more IO operations than you
expected, and more partial-stripe writes.
For HDD-only pools, the pragmatic goal is: write fewer, larger, well-compressed blocks; avoid rewriting them frequently; keep metadata and
small random write paths under control.
ashift: choose once, suffer forever
ashift sets the pool’s sector size exponent. In practice you almost always want ashift=12 (4K). Sometimes you want
ashift=13 (8K) for some advanced drives or to reduce overhead with big-sector devices. If you picked ashift=9 on
4K drives, you bought yourself permanent read-modify-write pain.
If you’re stuck with a bad ashift, the “tuning” fix is migrating to a new pool. ZFS is polite; it will not rewrite your pool’s foundations
just because you regret them now.
Dataset and zvol tuning (recordsize, compression, sync)
Most HDD-only ZFS wins come from setting dataset properties based on workload. ZFS lets you tune per dataset, so stop thinking “pool-wide”
and start thinking “this dataset holds VM disks, that dataset holds backups, this one holds logs.”
recordsize: stop making the disk do extra work
recordsize affects how ZFS chunks file data. Larger records tend to improve sequential throughput and compression efficiency.
Smaller records can reduce read amplification for random reads, and reduce rewrite size for small updates—sometimes.
Opinionated defaults:
- Backups, media, large objects:
recordsize=1Mis usually excellent on HDD pools. - General file shares: default
128Kis fine; change only with evidence. - Databases: often
16Kor32K, but test with your DB page size and access pattern. - VM images as files: depends; many do better with
64Kor128Kplus compression, but random-write workloads may still hurt on RAIDZ.
volblocksize for zvols: set it at creation, not later
If you present iSCSI or use zvol-backed VM disks, volblocksize is your block size. It cannot be changed after the zvol is created.
Match it to the guest filesystem/DB page size when possible (often 8K or 16K). Too small increases metadata and IO ops; too large increases
write amplification for small random updates.
Compression: the closest thing to a free lunch
On HDDs, compression is usually a win because it reduces physical IO. lz4 is typically the right baseline. The caveat is CPU: if
you’re CPU-starved or using heavier algorithms, you can shift bottlenecks.
sync: where latency goes to die
Synchronous writes require durability confirmation. Without SSDs (no dedicated SLOG), your sync writes are committed to the on-pool ZIL, which
means head seeks and rotational latency. Apps that do lots of small sync writes will make an HDD pool look like it’s malfunctioning.
Hard truth: turning sync=disabled makes benchmarks look great and postmortems look expensive.
atime: stop writing because someone read a file
On HDD pools, atime=on can create pointless write churn for read-heavy datasets. Turn it off for most server workloads unless you
have a concrete requirement for access times.
xattr and dnodesize: metadata matters on spinners
Metadata-heavy workloads can be brutal on HDDs. Settings like xattr=sa and dnodesize=auto can reduce metadata IO for
some workloads, but they’re not universal magic. The better lesson: identify metadata storms early and isolate them into datasets you can tune.
ARC, memory pressure, and why “more RAM” is not a meme
If you have no SSDs, RAM is your performance tier. The ARC is where hot reads go to avoid disk seeks. On HDD pools, a healthy ARC can be the
difference between “fast enough” and “why is ls slow.”
ARC sizing: don’t starve the OS, don’t starve ZFS
The right ARC size depends on your platform and workload. The wrong ARC size is easy: too large and you swap; too small and you thrash disks.
Swapping on a system that serves IO is like putting sand in the gearbox because you wanted more traction.
Know what’s in ARC: data vs metadata
HDD pools often benefit disproportionately from caching metadata (directory entries, indirect blocks, dnodes). If metadata misses force disk
seeks, your “random read” workload becomes “random metadata read” plus “random data read.” That’s a two-for-one deal nobody asked for.
Special note on dedup
Dedup on HDD-only pools is usually a performance horror story unless you have a very specific, measured case and plenty of RAM for the dedup
tables. If you want space savings, start with compression. If you still want dedup, bring a calculator and a rollback plan.
Prefetch, sequential reads, and streaming workloads
Prefetch is ZFS trying to be helpful by reading ahead when it detects sequential access. On HDD pools, sequential access is your happy place.
When prefetch works, throughput climbs and latency smooths out. When it misfires (common with some VM patterns), it can waste bandwidth and
evict useful ARC entries.
The tuning approach is not “disable prefetch because someone on a forum said so.” It’s: detect whether the workload is actually sequential,
then test. If you can reorganize IO to be more sequential—bigger blocks, fewer sync points, batched writes—you should do that before touching
prefetch knobs.
Scrub and resilver: getting safety without murdering performance
Scrubs and resilvers are the moments your pool stops being a storage system and becomes a storage project. HDDs have finite bandwidth and
limited IOPS. If production is busy and a resilver starts, something will lose. Your job is deciding what loses, and making it predictable.
Scrub scheduling is performance tuning
Scrubs are essential. But scrubbing at noon on a busy NFS server is how you learn what your executives sound like when they discover latency.
Schedule scrubs off-peak and monitor their duration over time. Increasing duration is a smell: fragmentation, growing data set, failing disks,
or a workload shift.
Resilver behavior differs by layout
Mirrors often resilver faster because they only copy allocated blocks. RAIDZ resilvers have improved significantly over the years (sequential
resilver), but they still stress the pool because reconstruction needs reads from many disks and writes to one, plus metadata updates.
Joke #2: A resilver during peak hours is the closest thing storage has to a live-fire drill—except the fire is your ticket queue.
Practical tasks: commands, outputs, and decisions (12+)
These are real operational tasks you can run on a ZFS host (Linux or illumos variants with similar tools). Each task includes what the output
tells you and what decision to make. Use a scratchpad. Don’t “tune” from memory.
Task 1: Identify pool health and obvious slow-paths
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
status: Some supported features are not enabled on the pool.
action: Upgrade the pool to enable all features.
scan: scrub repaired 0B in 05:12:41 with 0 errors on Sun Dec 22 03:10:14 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d4 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d5 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d6 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d7 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d8 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d9 ONLINE 0 0 0
errors: No known data errors
Meaning: Health is fine; scrub time is ~5 hours; no errors. “Features not enabled” is not a performance fire by itself.
Decision: If scrubs are running longer each month, you’ll investigate fragmentation and disk performance. Otherwise, move on.
Task 2: See real-time latency and identify whether reads or writes hurt
cr0x@server:~$ zpool iostat -v tank 2 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 48.2T 21.6T 120 980 12.8M 38.4M
raidz2-0 48.2T 21.6T 120 980 12.8M 38.4M
wwn-... - - 18 155 2.1M 6.2M
wwn-... - - 20 160 2.2M 6.4M
wwn-... - - 16 148 1.9M 6.0M
wwn-... - - 19 158 2.1M 6.3M
wwn-... - - 17 152 2.0M 6.1M
wwn-... - - 18 154 2.1M 6.1M
---------- ----- ----- ----- ----- ----- -----
Meaning: Writes dominate operations; bandwidth is moderate. If the workload is slow, it’s likely latency/IOPS bound, not
throughput bound.
Decision: Next, check sync writes and per-process IO patterns; also validate that the app isn’t doing tiny fsync storms.
Task 3: Check dataset properties that most often decide performance
cr0x@server:~$ zfs get -o name,property,value -r recordsize,compression,atime,sync tank
NAME PROPERTY VALUE
tank recordsize 128K
tank compression lz4
tank atime off
tank sync standard
tank/vm recordsize 128K
tank/vm compression lz4
tank/vm atime off
tank/vm sync standard
tank/backup recordsize 1M
tank/backup compression lz4
tank/backup atime off
tank/backup sync disabled
Meaning: Backup dataset is tuned for throughput but has sync=disabled, which is a data-loss risk for anything
that claims durability.
Decision: If that dataset is truly “recreatable backup staging,” maybe acceptable; otherwise set it back to
sync=standard and fix the writer to batch.
Task 4: Verify ashift (sector alignment) on each vdev
cr0x@server:~$ zdb -C tank | grep -E "ashift|vdev_tree" -n
45: vdev_tree:
62: ashift: 12
Meaning: ashift=12 is good for modern 4K disks.
Decision: If you see ashift: 9 on 4K drives, plan a migration. Don’t waste time with micro-tuning.
Task 5: Measure ARC hit ratio and memory pressure
cr0x@server:~$ arcstat 1 5
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:10:01 980 62 6 40 4% 22 2% 10 1% 22G 24G
12:10:02 1012 74 7 51 5% 23 2% 12 1% 22G 24G
12:10:03 990 69 6 44 4% 25 3% 9 1% 22G 24G
12:10:04 1005 80 8 55 5% 25 2% 13 1% 22G 24G
12:10:05 970 58 5 37 4% 21 2% 9 1% 22G 24G
Meaning: Miss rate ~5–8% is decent; ARC is near its target. If miss% is high and disks are busy, you’re doing more physical
IO than necessary.
Decision: If you have free RAM, increase ARC max (platform-specific). If you don’t, prioritize metadata caching via workload
separation or reduce working set (snapshots, dataset sprawl, etc.).
Task 6: Identify whether the workload is sync-write bound
cr0x@server:~$ grep -E "zil|sync" /proc/spl/kstat/zfs/zil
5 1 0x01 107 4080 173968149 2130949321
zil_commit_count 4 84211
zil_commit_writer_count 4 84211
zil_itx_count 4 5128840
zil_itx_indirect_count 4 120
zil_itx_indirect_bytes 4 983040
Meaning: High zil_commit_count implies frequent sync transactions.
Decision: If users complain about latency and you see heavy ZIL commits, investigate which dataset is receiving sync IO and
which process is calling fsync(). The fix is often at the app/DB layer, not ZFS.
Task 7: Find which processes are generating IO
cr0x@server:~$ iotop -oPa
Total DISK READ: 12.31 M/s | Total DISK WRITE: 41.22 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
8421 be/4 postgres 1.12 M/s 9.65 M/s 0.00 % 23.11 % postgres: wal writer
9102 be/4 qemu 4.88 M/s 18.33 M/s 0.00 % 31.74 % qemu-system-x86_64 -drive file=/tank/vm/vm01.img
2210 be/4 root 0.00 B/s 6.10 M/s 0.00 % 9.12 % rsync -a /staging/ /tank/backup/
Meaning: PostgreSQL WAL and QEMU are write-heavy. WAL tends to be sync-sensitive.
Decision: For DB, consider dataset-level tuning (recordsize, log placement) and DB settings (commit batching). For VMs,
strongly consider mirrors for the VM dataset even if the rest of the pool is RAIDZ.
Task 8: Check fragmentation and capacity headroom
cr0x@server:~$ zpool list -o name,size,alloc,free,cap,frag,health tank
NAME SIZE ALLOC FREE CAP FRAG HEALTH
tank 69.8T 48.2T 21.6T 69% 38% ONLINE
Meaning: 69% full, 38% fragmented. That’s not catastrophic, but it’s trending into “random writes will get worse” territory.
Decision: If CAP goes above ~80% on HDD pools, expect performance cliff behavior. Plan expansion, deletion, or migration
before you hit it.
Task 9: Verify compression ratio and whether you’re gaining IO reduction
cr0x@server:~$ zfs get -o name,used,compressratio -r tank | head
NAME USED COMPRESSRATIO
tank 48.2T 1.52x
tank/vm 18.4T 1.18x
tank/backup 22.1T 1.94x
Meaning: Backups compress well (good). VM dataset compresses poorly (normal for already-compressed OS images).
Decision: Keep lz4 anyway unless CPU is the bottleneck. If VM compression is near 1.00x and CPU is hot, you can
consider disabling compression only for that dataset.
Task 10: Inspect sync property per dataset and fix dangerous shortcuts
cr0x@server:~$ zfs get -o name,property,value sync tank/vm tank/backup
NAME PROPERTY VALUE
tank/vm sync standard
tank/backup sync disabled
Meaning: Backup is unsafe for sync semantics. Some backup tools rely on fsync for correctness.
Decision: If you can’t justify it in a risk review, change it:
cr0x@server:~$ sudo zfs set sync=standard tank/backup
Task 11: Determine if you’re doing small-block random IO (the HDD killer)
cr0x@server:~$ sudo fio --name=randread --filename=/tank/testfile --size=4G --direct=1 --rw=randread --bs=4k --iodepth=32 --runtime=20 --time_based
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, ioengine=psync, iodepth=32
fio-3.33
randread: IOPS=420, BW=1.64MiB/s (1.72MB/s)(32.8MiB/20001msec)
lat (usec): min=210, max=185000, avg=76000.15, stdev=22100.42
Meaning: 4K random reads are brutal: a few hundred IOPS and nasty tail latency. That’s normal for HDDs.
Decision: If your production workload resembles this, you need architectural changes: mirrors, more spindles, more RAM/ARC,
or make IO less random (batching, larger blocks, caching layer outside ZFS).
Task 12: Determine sequential throughput headroom
cr0x@server:~$ sudo fio --name=seqread --filename=/tank/testfile --size=8G --direct=1 --rw=read --bs=1M --iodepth=8 --runtime=20 --time_based
seqread: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, ioengine=psync, iodepth=8
fio-3.33
seqread: IOPS=620, BW=620MiB/s (650MB/s)(12.1GiB/20003msec)
lat (msec): min=2, max=52, avg=12.9, stdev=4.2
Meaning: Sequential reads are strong. So if your app is slow, it’s probably doing non-sequential IO or forcing sync latency.
Decision: Tune recordsize and workload patterns toward sequential IO where possible (streaming writes, larger IO, fewer fsync).
Task 13: Check disk-level latency to find a single bad actor
cr0x@server:~$ iostat -x 2 3
avg-cpu: %user %nice %system %iowait %steal %idle
6.12 0.00 2.20 18.40 0.00 73.28
Device r/s w/s r_await w_await aqu-sz %util
sda 9.20 45.10 12.40 18.70 1.22 78.0
sdb 10.10 44.90 13.20 19.10 1.24 79.4
sdc 9.40 45.20 12.10 17.90 1.18 77.9
sdd 8.90 44.70 61.30 88.10 5.90 99.8
Meaning: One disk (sdd) has far worse await and is pegged at 99.8% util. That can drag the whole vdev down.
Decision: Pull SMART stats and consider proactive replacement. A “slow disk” is a real failure mode, not superstition.
Task 14: Check SMART for reallocated sectors and pending errors
cr0x@server:~$ sudo smartctl -a /dev/sdd | egrep -i "Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|SMART overall"
SMART overall-health self-assessment test result: PASSED
5 Reallocated_Sector_Ct 0x0033 098 098 010 Pre-fail Always 12
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always 4
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline 4
Meaning: “PASSED” is meaningless comfort; pending sectors and uncorrectables are bad. This disk will cause retries and latency.
Decision: Replace the disk, then resilver off-peak, and monitor error counters during the operation.
Task 15: Verify whether snapshots are ballooning metadata and working set
cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation | head
NAME USED REFER CREATION
tank/vm@auto-2025-12-01 28G 18.4T Mon Dec 1 02:00 2025
tank/vm@auto-2025-12-02 31G 18.4T Tue Dec 2 02:00 2025
tank/vm@auto-2025-12-03 33G 18.4T Wed Dec 3 02:00 2025
Meaning: Many VM snapshots with growing “USED” indicates churn. That churn can increase fragmentation and pressure ARC.
Decision: Enforce retention. For HDD pools, treat snapshot sprawl as a performance bug, not just a capacity issue.
Fast diagnosis playbook
When someone says “the storage is slow,” you don’t have time for a philosophical debate about IO patterns. You need a fast funnel to locate
the bottleneck: disk, ZFS behavior, memory, or application semantics.
First: confirm whether it’s a pool problem or a host problem
-
Check pool health and current activity.
Runzpool status. If a scrub/resilver is active, that’s your headline. If there are checksum errors, stop tuning and start incident response. -
Check CPU iowait and swapping.
If the host is swapping or CPU is saturated, storage symptoms can be secondary. High iowait often means disks are the bottleneck, but it can also mean sync-write stalls.
Second: determine if you’re latency-bound or throughput-bound
-
Run
zpool iostat -v 2.
If ops are high and bandwidth is low, it’s IOPS/latency-bound (classic HDD pain). -
Run
iostat -x 2.
Look for a single disk with high await/%util. One limping disk can make a vdev look like a design failure.
Third: check for sync-write storms and metadata storms
-
Look at ZIL commit counts and identify top writers.
If you see heavy ZIL activity and a DB/WAL writer on top, your “storage slowness” is sync semantics meeting HDD latency. -
Check ARC stats.
A poor hit ratio plus busy disks suggests working set larger than RAM or poor locality; you’ll need workload changes or more memory. -
Check fragmentation and capacity.
High CAP and rising FRAG are classic predictors of “it was fine last year.”
What you decide from the playbook
- If a scrub/resilver is running: throttle/schedule; communicate; don’t “tune” mid-rebuild unless you enjoy chaos.
- If one disk is slow: replace it; stop blaming ZFS for a dying HDD.
- If sync writes are dominating: fix app batching; isolate dataset; do not knee-jerk to
sync=disabled. - If ARC misses are high: add RAM or reduce working set; separate datasets; fix snapshot retention.
- If you’re simply doing random 4K IO on RAIDZ: the “tuning” is changing layout (mirrors) or expectations.
Common mistakes: symptom → root cause → fix
1) “Writes are slow and spiky; graphs look like a saw blade”
Root cause: TXG flushing + sync-write bursts + HDD latency. Often worsened by an app doing frequent fsync.
Fix: Identify the sync-heavy dataset and process. Increase application batching; adjust DB commit settings; isolate to a dataset with appropriate recordsize; keep sync=standard unless you formally accept data-loss risk.
2) “Random reads are terrible; even small directory listings lag”
Root cause: Metadata misses in ARC, or fragmented metadata due to snapshot churn and small random updates.
Fix: Ensure enough RAM; reduce snapshot count; consider xattr=sa for relevant datasets; keep CAP below ~80%; split metadata-heavy workloads into separate datasets and review recordsize.
3) “After we filled the pool past 85%, everything got slower”
Root cause: Space maps, allocator pressure, fragmentation; HDD seeks increase; metaslabs become constrained.
Fix: Free space aggressively; add capacity; migrate to a larger pool; stop running “near full” as a steady state. Treat headroom as a performance requirement.
4) “Scrubs take forever now, and users complain during scrubs”
Root cause: Growing dataset, fragmentation, one slow disk, or scrubs scheduled during peak IO.
Fix: Move scrub window; check per-disk latency; replace slow disks; consider reducing pool width in future designs; monitor scrub duration trend.
5) “We disabled compression to ‘reduce CPU’ and performance got worse”
Root cause: You increased physical IO on HDDs; disks became the bottleneck again.
Fix: Use compression=lz4 as a baseline; measure CPU. If CPU truly bottlenecks, reduce other CPU costs first (encryption, heavy checksums) before giving up IO reduction.
6) “VM storage on RAIDZ is unpredictable under load”
Root cause: VM IO is often small-block random write + sync-like behavior from guest filesystems and hypervisors. RAIDZ amplifies the pain.
Fix: Use mirrors for VM vdevs; tune volblocksize; reduce fsync frequency in guests where safe; consider separate pool/dataset for VMs with stricter policies.
7) “One client is slow, others are fine”
Root cause: Network issues, client-side sync semantics, or a single dataset property mismatch.
Fix: Verify dataset properties for that share/volume; compare with a known-good dataset; check client mount options and application behavior.
Three corporate mini-stories from the trenches
Incident caused by a wrong assumption: “ZIL is a cache, right?”
A mid-sized company ran an internal analytics platform on an HDD-only ZFS pool. Everything was “fine” until a new team deployed a service
that wrote small state updates with synchronous semantics. Latency went from tolerable to absurd. The on-call engineer saw disk bandwidth was
low and concluded the pool was underutilized. “We have plenty of headroom,” they said, staring at MB/s graphs like they were truth tablets.
They toggled sync=disabled on the dataset to “test.” The service got fast immediately. Tickets stopped. The change stayed.
Nobody wrote it down. That’s how it always starts.
Weeks later, a power event took down the rack. Systems came back. The service came back too—except some state was inconsistent, and it took
a long time to figure out what was missing because the application believed it had durable commits. The post-incident review was
uncomfortable in a very corporate way: lots of “process improvements,” little naming of the actual technical mistake.
The wrong assumption wasn’t “sync=disabled is unsafe.” Everyone knows that. The wrong assumption was that the ZIL behaves like a write cache
that you can ignore. In reality, ZIL exists to make sync semantics true. On HDDs, that means rotational latency is part of your SLA.
The fix ended up being boring: move the sync-heavy state to a different storage tier (eventually SSDs), and in the meantime reduce fsync
frequency by batching updates and using a queue. They also added a simple policy: any dataset property changes require a ticket with a risk
statement. It didn’t make them faster. It made them less fragile.
Optimization that backfired: “Let’s shrink recordsize for everything”
Another org ran ZFS for a mixed workload: file shares, backup targets, and a few VM images. They noticed slow random access during peak and
found online advice suggesting smaller recordsize for “better performance.” Without profiling, they set recordsize=16K
recursively on the whole pool.
The immediate effect looked positive for one workload: a small database on NFS saw slightly better random read latency. Then everything else
got worse. Backup jobs took longer. CPU went up. Metadata ops increased. ARC effectiveness dropped because the cache had to track far more
blocks. Scrubs took longer because there were more blocks to walk, and the pool was busy for more hours.
The real kicker arrived a month later: fragmentation rose, CAP climbed, and users started experiencing intermittent pauses on large file
transfers. The storage team ended up in the classic loop: “tune more” to fix the tune that broke things. They almost disabled compression to
“save CPU,” which would have been the second self-inflicted wound.
The recovery was methodical. They reverted recordsize on backup and file-share datasets to 128K and 1M where appropriate, leaving the small
recordsize only on the database dataset. They split VM storage into its own dataset and later into a mirrored pool. Performance stabilized.
Lesson: recordsize is a scalpel, not a paint roller. If you apply it pool-wide, you’ll get pool-wide consequences.
Boring but correct practice that saved the day: “Trend scrub duration”
A finance-ish company (the kind that loves spreadsheets and hates surprises) ran a large HDD RAIDZ2 pool for compliance archives and internal
file storage. They didn’t have SSDs. They also didn’t have illusions about performance. Their storage lead enforced two habits: scrubs were
scheduled off-peak, and scrub duration was tracked every month in a simple dashboard. No fancy observability stack required.
Over several months, scrub duration crept upward. Not dramatically, but enough to be obvious: the trend line was wrong. Nobody was reporting
problems yet. That’s the point: you want to find storage problems when users are asleep.
They investigated. Per-disk latency showed one drive had higher await but no hard errors. SMART showed growing pending sectors. It still
reported “PASSED” overall health. They replaced it proactively during a planned window, resilvered quietly, and scrub times returned to the
previous baseline.
Weeks later, a similar drive in a different chassis failed outright. That team got hit with a noisy incident and a long resilver under load.
The finance-ish company didn’t. Their “boring” practice—trend scrub duration and check per-disk latency—paid for itself without a heroic
midnight.
Checklists / step-by-step plan
Step-by-step tuning plan for an existing HDD-only pool
-
Baseline the workload.
Capture:zpool iostat -v 2,iostat -x 2, ARC stats, and top IO processes. Save it somewhere. -
Validate the non-negotiables.
Pool health, ashift, no failing disks, no silent read errors. If a disk is slow, fix hardware first. -
Separate datasets by workload.
At minimum:tank/vm,tank/db,tank/backup,tank/home. Tuning is per dataset; design your namespace accordingly. -
Set safe, opinionated dataset properties.
- Most datasets:
compression=lz4,atime=off - Backup/media:
recordsize=1M - Databases: start at
recordsize=16Kor32K(test) - VM zvols: set
volblocksizecorrectly at creation
- Most datasets:
-
Handle sync writes with engineering, not denial.
Keepsync=standard. If latency is unacceptable, the primary fix is application batching or moving the sync-heavy path to a different tier (even if that’s “more RAM and fewer fsyncs”). -
Enforce headroom.
Set internal policy: keep HDD pools below ~80% CAP for consistent performance. If you can’t, you’re in a capacity incident waiting room. -
Manage snapshots like they cost money (they do).
Retention for VM churn should be short. Archive snapshots should be deliberate and limited. -
Schedule scrubs and watch their trend.
Scrub monthly (commonly) off-peak. Track duration. Investigate changes. -
Re-measure after each change.
One change at a time. Compare to baseline. Keep what helps. Revert what doesn’t.
Checklist: when a team asks “Can we make it faster without SSDs?”
- Is the workload mostly sequential? If yes, tune recordsize and compression and you can win.
- Is it random small-block IO? If yes, mirrors and RAM help; RAIDZ won’t become magical.
- Is it sync-write heavy? If yes, fix application semantics or accept latency; don’t disable sync casually.
- Is the pool >80% full or highly fragmented? If yes, capacity management is performance work.
- Is any disk slow or erroring? If yes, replace it before you touch knobs.
Checklist: safe defaults for HDD-only pools (per dataset)
- General shares:
compression=lz4,atime=off,recordsize=128K - Backups:
compression=lz4,recordsize=1M,sync=standardunless explicitly risk-accepted - Databases (files): start
recordsize=16Kor32K, benchmark, keepsync=standard - VMs: prefer mirrored vdevs; for zvols set
volblocksizecorrectly at creation
FAQ
1) Can I get “SSD-like” performance from HDD-only ZFS with tuning?
No. You can get “well-designed HDD array” performance. That’s still valuable. The win is eliminating self-inflicted IO and aligning the
workload with sequential behavior.
2) Should I use RAIDZ or mirrors for HDD-only pools?
Mirrors for IOPS-sensitive workloads (VMs, databases, metadata heavy). RAIDZ for capacity-efficient bulk storage and mostly sequential IO.
If you’re trying to run VMs on RAIDZ and it feels bad, that’s because it is.
3) Is compression=lz4 safe for production?
Yes, and it’s usually faster on HDD pools. It reduces physical IO. The main reason to disable it is if CPU is truly the bottleneck for that
dataset and compression ratio is near 1.00x.
4) What recordsize should I use?
Default 128K for general-purpose. 1M for backups/media/large objects. Smaller (16K/32K) for databases or workloads with small random IO,
but test. Recordsize is not a performance “max” knob; it’s an IO-shape knob.
5) Is it okay to set sync=disabled to improve performance?
Only if you can lose recent writes and you’ve explicitly accepted that risk. It can also break applications that rely on fsync for
correctness. For most production data: keep sync=standard and fix the workload.
6) Do I need to tune ZFS module parameters to get good performance?
Usually no. Most wins come from layout, dataset properties, ARC sizing, and operational practices. Kernel/module tuning is last-mile work
and easy to get wrong, especially across upgrades.
7) Why does performance drop as the pool fills up?
Allocator pressure and fragmentation increase, free space becomes less contiguous, and ZFS has fewer good choices. HDDs pay the seek penalty.
Keep headroom; treat it as part of the design.
8) How often should I scrub an HDD-only pool?
Commonly monthly, scheduled off-peak. The real answer depends on risk tolerance, drive quality, and rebuild windows. Track scrub duration and
error corrections; trend changes are more important than the exact schedule.
9) Does dedup help performance?
Rarely on HDD-only pools. It often hurts performance and memory. If you want performance, use compression. If you want space savings, measure
and validate before enabling dedup.
10) What’s the single biggest improvement without SSDs?
For read-heavy workloads: more RAM (ARC). For mixed workloads: correct dataset properties and avoiding sync-write storms. For VM-heavy: mirrors.
For “everything is slow”: stop running the pool near full and replace slow disks.
Next steps you can do this week
-
Run the fast diagnosis playbook once during peak.
Save outputs ofzpool iostat,iostat -x, and ARC stats. Baselines are how you stop guessing. -
Split datasets by workload and set sane properties.
Especially:recordsizefor backups andatime=offfor most server datasets. -
Hunt sync-write storms.
Identify who is doing fsync and why. Fix it at the application layer if possible. Don’t “solve” it withsync=disabled. -
Check for a slow disk.
Useiostat -xand SMART. One bad actor can look like a systemic performance regression. -
Enforce headroom and snapshot retention.
If your pool is trending above 80% CAP, treat it like a capacity incident. It’s cheaper than treating it like a performance mystery. -
Track scrub duration.
It’s dull. It works. It’s the kind of practice you only mock until it saves you.
HDD-only ZFS performance is a discipline: good layout, correct semantics, sane dataset boundaries, and relentless measurement. You’re not
trying to win benchmarks. You’re trying to win Tuesdays.