Your pool has plenty of “throughput” on paper. Yet the database stalls on tiny updates, the VM host reports storage latency spikes, and your app’s p99 goes from “fine” to “why is the CEO calling.” This is the classic ZFS small random write trap: the workload is IOPS- and latency-bound, while your design is bandwidth-optimized.
When people say “ZFS is slow,” they usually mean “I built parity vdevs and then asked them to behave like mirrors under tiny random writes.” ZFS didn’t betray you. Physics did. And parity math joined in for a group project.
What “small random writes” really mean in ZFS terms
Small random writes are the workload that punishes designs built for sequential bandwidth. Think 4K–32K writes scattered across a large working set, typically under concurrency. Databases, VM images, container overlay filesystems, mail queues, and metadata-heavy file trees all live here.
For ZFS, “small” and “random” is not just an application characteristic. It’s also a layout and transaction characteristic. ZFS is copy-on-write (CoW). That means overwriting an existing block is not an in-place update; it’s “allocate new blocks, write them, then update metadata to point to them.” So the actual write I/O is frequently more than “one 8K write.”
Now add parity vdevs (RAIDZ1/2/3). Each write may require touching multiple disks plus parity computation, sometimes requiring read-modify-write patterns if the write doesn’t align to full-stripe boundaries. That is where IOPS and latency go to be judged.
Two definitions you should keep straight
- IOPS-bound: throughput is low because each IO is small and each one costs you latency. More bandwidth won’t help.
- Latency-bound: your queue depth grows because each operation takes too long to complete; applications time out, logs show spikes, and “disk busy” is misleadingly high.
Mirrors and parity arrays can both offer lots of raw capacity and acceptable sequential bandwidth. Only one of them typically delivers predictable small random write latency without heroic tuning: mirrors.
Mirrors vs parity in one sentence (and the longer sentence you actually need)
One sentence: Mirrors usually beat parity vdevs on small random writes because a mirror write is “write to two places,” while parity is often “read some things, compute parity, write several things,” with more I/O operations and more waiting.
The longer sentence you actually need: ZFS parity vdevs do well when writes are large, aligned, and streaming, but degrade sharply when writes are small, fragmented, or sync-heavy, because they amplify I/O, increase per-operation latency, and reduce scheduling flexibility at the vdev level.
The parity write path: where your IOPS go to retire early
Parity (RAIDZ) is great when you want capacity efficiency and failure tolerance. It’s not great when you want low-latency random writes. The reason is not mystical; it’s arithmetic plus mechanics.
Parity writes and the “full-stripe” problem
In a RAIDZ vdev, data is striped across disks with parity. A full-stripe write means you’re writing a complete set of data columns plus parity columns for a stripe. If you can do full-stripe writes consistently, parity is fairly efficient: no need to read old data to recompute parity; you already have all the new data.
Small random writes are usually not full-stripe. They’re partial-stripe and often misaligned relative to the vdev’s recordsize, sector size (ashift), and RAIDZ geometry. When you do a partial-stripe update, parity may require a read-modify-write:
- Read old data blocks that are part of the stripe (or parity) to compute the delta
- Compute new parity
- Write new data blocks and new parity blocks
That means a small logical write can turn into several physical I/Os across multiple disks. Each disk has its own latency. Your operation completes when the slowest required I/O completes.
Why this hurts more with HDDs
With HDDs, random IOPS are limited by seek and rotational latency. If your parity operation fans out to several disks, you multiply the chance of hitting a slow seek. Mirrors also fan out writes (two disks), but parity can involve more disks plus reads, and it typically has worse “tail latency” under load.
Why this still matters with SSDs
SSDs reduce seek pain, but they don’t remove it. SSDs have internal erase block management, garbage collection, and write amplification. Parity’s amplified and scattered writes can increase device-level write amplification and cause long tail latency during GC. Mirrors write more bytes than a single disk, but parity can create more fragmented and smaller writes across more devices.
Queueing: the silent killer
ZFS issues I/O per vdev. A RAIDZ vdev behaves like one logical vdev backed by multiple disks. With small random writes, the vdev’s internal mapping and parity math limit how many independent operations it can satisfy without contention. Mirrors, by contrast, can distribute reads across disks and, with multiple mirror vdevs, distribute writes across vdevs with less coordination overhead.
Joke #1: Parity arrays are like project teams—everyone has to be involved, so nothing finishes on time.
The mirror write path: boring, direct, fast
A mirror write is simple: ZFS allocates a block, writes it to both sides of the mirror, and considers it done when the write is durable according to your sync policy. There is no parity to compute and usually no need to read existing blocks for the purpose of updating parity.
Mirrors scale in the way random I/O actually scales
The practical performance unit in ZFS is the top-level vdev. A pool’s IOPS are roughly the sum of the IOPS of its top-level vdevs, especially for random workloads. Ten mirror vdevs means ten places ZFS can schedule independent writes (each mirrored pair doing its thing). One big RAIDZ vdev means one scheduling domain.
So yes: a single 2-disk mirror is not magic. A pool of multiple mirrors is.
Write amplification differences you can feel in production
- Mirror: write new data twice (or three times for 3-way mirrors), plus metadata updates due to CoW.
- RAIDZ: write new data plus parity, sometimes read old data/parity first, plus metadata updates due to CoW.
When your workload is dominated by many small updates, the system lives and dies by per-operation overhead and tail latency. Mirrors keep that overhead lower and more predictable.
ZFS-specific multipliers: CoW, txg, metadata, and sync
Even with mirrors, ZFS can get cranky under small random writes if you ignore the ZFS parts. Here’s what matters when diagnosing “mirrors vs parity” debates in the real world.
Copy-on-write: the block you overwrote is not the block you wrote
ZFS writes new blocks and then updates pointers. That means random overwrite workloads generate additional metadata writes. On parity vdevs, those metadata writes are just as subject to partial-stripe penalties as data writes.
Transaction groups (txg) and bursts
ZFS batches writes into transaction groups and flushes them periodically. Under heavy small-write workloads, you can see a pattern: the system buffers, then flushes, and during flush latency spikes. Mirrors tend to flush more smoothly; parity can become “spiky” if the flush turns into a storm of RMW operations.
Sync writes: where latency becomes policy
If your workload issues sync writes (databases often do, VM hosts often do), ZFS must commit data to stable storage before acknowledging the write—unless you explicitly relax that with sync=disabled (don’t, unless you enjoy explaining data loss).
On HDDs, sync writes are brutal without a proper SLOG (separate log device) because each sync operation forces a latency-bound commit. Parity doesn’t help. Mirrors don’t magically help either, but mirrors reduce the extra parity overhead so your SLOG actually has a fighting chance to be the bottleneck instead of the array.
Metadata and the “special” vdev lever
ZFS metadata I/O can dominate small-file and random-update workloads. A special vdev (typically mirrored SSDs) can place metadata (and optionally small blocks) on faster media, drastically reducing latency. This helps both mirrors and RAIDZ—but it often saves RAIDZ pools from feeling unusable under metadata churn.
Recordsize, volblocksize, and actual I/O size
ZFS recordsize (for filesystems) and volblocksize (for zvols) influence how ZFS chunks data. If your DB writes 8K pages into a dataset with a 128K recordsize, ZFS might do more work than necessary, especially under fragmentation. But don’t blindly shrink recordsize everywhere; that can increase metadata overhead and reduce sequential efficiency. You tune it where the workload justifies it.
Interesting facts and historical context (the stuff people forget)
- Fact 1: ZFS came out of Sun Microsystems in the mid-2000s with end-to-end checksumming and CoW as first-class ideas—years before “data integrity” became a default marketing checkbox.
- Fact 2: RAIDZ was designed to avoid the classic RAID-5 “write hole” by integrating parity with the filesystem’s transactional model.
- Fact 3: The “top-level vdev is the performance unit” guideline predates modern NVMe; it originally mattered even more on HDDs where seek latency dominated.
- Fact 4: Early ZFS deployments were often built for streaming workloads (home directories, media, backups). Databases and virtualization later became common, and that’s where parity pain became mainstream.
- Fact 5: The ZIL (ZFS Intent Log) exists even without a dedicated SLOG device. With no SLOG, the ZIL lives on the pool disks, and sync writes compete with everything else.
- Fact 6: Modern OpenZFS has introduced features like special vdevs and persistent L2ARC to target the exact “metadata and small random IO” problem that parity arrays struggle with.
- Fact 7: 4K-native drives and SSDs made
ashiftselection a permanent footgun; a wrong ashift can silently kneecap small I/O performance for the life of the vdev. - Fact 8: RAIDZ expansion (adding a disk to an existing RAIDZ vdev) has historically been hard; operationally, mirrors were often favored simply because they were easier to grow without migration.
- Fact 9: “Parity is slower” is not a law. For large sequential writes, RAIDZ can be very competitive—sometimes faster than mirrors due to better usable bandwidth per disk.
Fast diagnosis playbook
This is the order I check things when someone says “ZFS small writes are slow.” The goal is to find the bottleneck in minutes, not to win a theoretical argument.
First: confirm the workload is actually small random writes
- Are we talking about a DB WAL + data pages? VM random writes? Metadata storm?
- Is latency the complaint (p95/p99), or throughput, or both?
Second: identify the vdev type and how many top-level vdevs you have
- One big RAIDZ vdev is a very different animal than eight mirrors.
- For random I/O, “more vdevs” usually beats “wider vdevs.”
Third: check sync behavior and SLOG health
- If sync writes are on and there is no fast SLOG (or it’s misconfigured), you’re latency-bound by stable storage commits.
- If the SLOG exists but is saturated, it becomes your choke point.
Fourth: check pool fullness and fragmentation
- Pools above ~80% full and heavily fragmented can turn small writes into allocator misery.
- Parity pools suffer more because they need better alignment to stay efficient.
Fifth: check ashift, recordsize/volblocksize, and special vdevs
- Wrong ashift: permanent tax.
- Mismatched block sizing: avoidable tax.
- Metadata on slow disks: self-inflicted tax.
Sixth: validate that the bottleneck is storage, not CPU or memory pressure
- Parity can burn CPU on checksumming/compression/parity math under heavy IOPS.
- ARC pressure can cause extra reads and metadata misses.
Hands-on tasks: commands, outputs, and what decisions to make
These are real checks you can run on a typical Linux/OpenZFS system. Each task includes: command, sample output, what it means, and the decision you make from it.
Task 1: Identify vdev layout and count top-level vdevs
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
logs
nvme0n1p2 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme1n1p1 ONLINE 0 0 0
nvme2n1p1 ONLINE 0 0 0
errors: No known data errors
What it means: You have one top-level RAIDZ2 vdev (one performance “lane” for random writes), plus a SLOG and a special vdev mirror.
Decision: If random write latency is the problem, you either (a) accept parity’s limits and mitigate with special vdev/SLOG/tuning, or (b) redesign toward multiple mirror vdevs.
Task 2: Watch per-vdev IOPS and latency during the complaint
cr0x@server:~$ sudo zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 6.20T 1.80T 120 980 3.2M 12.4M
raidz2-0 6.20T 1.80T 120 980 3.2M 12.4M
sda - - 20 165 520K 2.1M
sdb - - 18 170 510K 2.0M
sdc - - 22 160 540K 2.1M
sdd - - 21 162 530K 2.0M
sde - - 19 163 510K 2.1M
sdf - - 20 160 520K 2.1M
-------------------------- ----- ----- ----- ----- ----- -----
What it means: Writes are distributed, but the pool is doing ~980 write ops/s total. On HDDs, that’s already flirting with “seek-limited.”
Decision: If the app needs more IOPS or lower p99 latency, RAIDZ on HDDs is likely the wrong tool. Consider mirrors, SSDs, or pushing the hot set to special vdev / SSD pool.
Task 3: Check sync settings at the dataset/zvol level
cr0x@server:~$ sudo zfs get -r sync tank
NAME PROPERTY VALUE SOURCE
tank sync standard default
tank/db sync standard inherited from tank
tank/vmstore sync standard inherited from tank
What it means: Sync writes are honored normally.
Decision: Good. Now ensure you have a fast, power-loss-protected SLOG if sync latency is high. Don’t “fix” this by disabling sync unless the data is disposable.
Task 4: Confirm SLOG presence and whether it’s actually being used
cr0x@server:~$ sudo zpool iostat -v tank 1 3
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 6.20T 1.80T 200 1100 4.1M 13.0M
raidz2-0 6.20T 1.80T 150 700 3.5M 10.8M
nvme0n1p2 - - 50 400 620K 2.2M
-------------------------- ----- ----- ----- ----- ----- -----
What it means: The log device shows write activity. Sync traffic is landing there.
Decision: If the log device is slow or saturated, upgrade it. If it’s fast and still latency is bad, your pool writes are the issue (often parity + fragmentation).
Task 5: Check pool fullness (allocation pressure gets ugly)
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint tank
NAME USED AVAIL REFER MOUNTPOINT
tank 6.20T 1.80T 96K /tank
What it means: About 77% used. Not terrible yet.
Decision: If you’re above ~80–85% used and seeing random write pain, plan a capacity relief project. Mirrors tolerate “full” better than RAIDZ, but no pool likes allocator starvation.
Task 6: Check fragmentation (a proxy for allocator pain)
cr0x@server:~$ sudo zpool list -o name,size,alloc,free,frag,health tank
NAME SIZE ALLOC FREE FRAG HEALTH
tank 8T 6.2T 1.8T 41% ONLINE
What it means: 41% fragmentation is not a crime, but under small random overwrites it can produce real latency.
Decision: If frag is high and workload is random-write heavy, plan to migrate/replicate to a fresh pool or add vdevs (mirrors help more) to give the allocator room.
Task 7: Verify ashift (misalignment tax is forever)
cr0x@server:~$ sudo zdb -C tank | egrep -i "ashift|vdev_tree" -n | head
45: vdev_tree:
78: ashift: 12
What it means: ashift=12 (4K sectors). Usually correct for modern disks.
Decision: If you see ashift=9 on 4K drives, you’ve built a performance problem you can’t tune away. The fix is rebuild/migrate.
Task 8: Check recordsize (filesystems) and volblocksize (zvols)
cr0x@server:~$ sudo zfs get recordsize,compression tank/db
NAME PROPERTY VALUE SOURCE
tank/db recordsize 128K local
tank/db compression lz4 local
What it means: DB dataset uses 128K recordsize. If the DB does 8K page writes, this might be too large depending on access patterns.
Decision: Consider setting recordsize to match DB page size only when the dataset is truly DB datafiles and the workload is random overwrite heavy. Measure before/after.
Task 9: Check whether compression is helping or hurting CPU
cr0x@server:~$ sudo zfs get compression,compressratio tank/db
NAME PROPERTY VALUE SOURCE
tank/db compression lz4 local
tank/db compressratio 1.68x -
What it means: Compression is effective; fewer bytes hit disk, which often helps random write performance.
Decision: Keep lz4 unless CPU is saturated. Compression frequently helps parity pools more than mirrors because it reduces parity work per logical write.
Task 10: Check whether sync latency is dominated by the log device
cr0x@server:~$ sudo iostat -x 1 3
Linux 6.8.0 (server) 12/26/2025 _x86_64_ (16 CPU)
Device r/s w/s r_await w_await aqu-sz %util
sda 20.0 165.0 12.1 35.8 3.2 92.0
sdb 18.0 170.0 11.7 36.5 3.3 93.5
sdc 22.0 160.0 12.8 34.2 3.1 90.1
nvme0n1 50.0 400.0 0.3 0.8 0.2 18.0
What it means: HDDs show high write await (~35 ms) and high utilization; NVMe log device is fine. The pool, not the SLOG, is the bottleneck.
Decision: This is where parity-on-HDD small random writes usually loses. Move workload to mirrors/SSD, add mirror vdevs, or split workloads across pools.
Task 11: Check ARC pressure (memory misses can trigger extra I/O)
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep "c_max|c |size|hits|misses" | head
c 4 51539607552
c_max 4 51539607552
size 4 49821667328
hits 4 182736455
misses 4 32736455
What it means: ARC is near max and has misses. Misses aren’t automatically bad; the ratio matters and the workload matters.
Decision: If ARC is thrashing (miss rate spikes) and you’re read-modify-writing parity stripes, latency can worsen. Add RAM, tune workloads, or isolate datasets.
Task 12: Look for slow I/O and vdev-level timeouts in kernel logs
cr0x@server:~$ sudo dmesg | egrep -i "zfs|I/O error|blk_update_request|timeout" | tail -n 8
[12345.678901] zfs: vdev disk sdb: slow I/O, 35 sec, zio 0x0000000abcd123
[12347.112233] blk_update_request: I/O error, dev sdb, sector 123456789
[12347.445566] zfs: vdev disk sdb: repairing
What it means: One disk is slow or erroring. Parity vdevs are particularly sensitive to one laggard; the whole vdev’s tail latency worsens.
Decision: Replace or remediate the slow disk. Don’t performance-tune around failing hardware.
Task 13: Check ZFS write throttle behavior (txg pressure)
cr0x@server:~$ cat /proc/spl/kstat/zfs/dmu_tx | head -n 15
dmu_tx statistics:
dmu_tx_assigned: 9876543
dmu_tx_delay: 12345
dmu_tx_error: 0
What it means: dmu_tx_delay increments when ZFS must throttle writers because the pool can’t flush fast enough.
Decision: If delay climbs during the incident, you’re storage-bound on writes. Mirrors with more vdevs can reduce this; so can faster media or workload shaping.
Task 14: Validate special vdev is actually taking metadata/small blocks
cr0x@server:~$ sudo zpool list -v tank
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 8T 6.2T 1.8T - - 41% 77% 1.00x ONLINE -
raidz2-0 8T 6.2T 1.8T - - 41% 77%
special 1T 220G 804G - - 9% 21%
What it means: Special vdev has allocation, suggesting metadata (and perhaps small blocks) are landing there.
Decision: If special shows near-zero allocation despite being configured, check dataset properties like special_small_blocks. If metadata is on HDDs, you’re paying avoidable latency.
Task 15: Confirm dataset properties that often sabotage small-write latency
cr0x@server:~$ sudo zfs get atime,logbias,primarycache,secondarycache tank/db
NAME PROPERTY VALUE SOURCE
tank/db atime on default
tank/db logbias latency default
tank/db primarycache all default
tank/db secondarycache all default
What it means: atime is on (extra metadata writes), logbias is latency (good for sync-heavy), caching is default.
Decision: For DB/VM datasets, set atime=off unless you truly need it. Consider logbias=throughput only if you understand your sync profile and you have a strong reason.
Task 16: Determine whether you’re using zvols for VMs and what block size they use
cr0x@server:~$ sudo zfs list -t volume -o name,volsize,volblocksize,used tank/vmstore
NAME VOLSIZE VOLBLOCKSIZE USED
tank/vmstore 2T 8K 1.4T
What it means: zvol uses 8K blocks, which can be reasonable for VM workloads that do small random writes.
Decision: If volblocksize is huge (e.g., 128K) for random-write VM images, you’re manufacturing write amplification. Changing volblocksize requires recreating the zvol—plan migration.
Three corporate mini-stories from the storage trenches
Mini-story 1: The incident caused by a wrong assumption
The company: mid-size SaaS, one primary PostgreSQL cluster, a few read replicas, and a background job system that loved tiny updates. They had a new storage node with a big RAIDZ2 pool because “we need capacity and redundancy.” On the sizing spreadsheet, the sequential write throughput looked fantastic.
The assumption: “IOPS is IOPS, and RAIDZ2 has more disks, so it will be faster.” Nobody said it out loud quite that bluntly, but it lived in the architecture.
Launch day was fine. Then the dataset started fragmenting and the working set grew. The symptom was not “low throughput.” The symptom was periodic 1–3 second stalls on commits. Application threads piled up, p99 latency went sideways, and the on-call got the familiar alert pattern: CPU low, network low, disk busy, but not moving much data.
The postmortem turned up the real culprit: sync-heavy small writes landing on a parity vdev with HDDs. The ZIL was on-pool (no SLOG), so each fsync dragged the disks into the fight. Under concurrency, the array became a tail-latency generator.
The fix wasn’t clever. They migrated the primary DB to a pool built from multiple mirror vdevs on SSDs, kept RAIDZ2 for backups and bulk data, and suddenly the same application code looked “optimized.” Storage architecture had been the bug.
Mini-story 2: The optimization that backfired
Different place, different problem: virtualization cluster running dozens of mixed workloads. Someone noticed the parity pool was “wasting” potential and proposed a simple performance tweak: turn off sync on the VM dataset and add a larger recordsize to reduce overhead. It was sold as “safe because the hypervisor already caches writes.”
For a week it looked great. Latency graphs smoothed out. People congratulated the change ticket. Then a host crashed hard—power event plus a messy reboot. Several VMs came back with filesystem errors. One had a database that would not start cleanly.
The uncomfortable lesson: sync=disabled doesn’t mean “slightly less safe.” It means “ZFS is allowed to lie about durability.” For VM images and databases, that lie eventually becomes an incident, and it always happens at the least convenient time.
They rolled the dataset back to sync=standard, added proper mirrored SLOG devices with power-loss protection, and tuned the system the boring way: right-sized volblocksize for zvols, used mirrors for latency-sensitive tiers, and kept parity for capacity tiers.
Joke #2: Disabling sync is like removing smoke detectors because they keep beeping—quiet house, exciting future.
Mini-story 3: The boring but correct practice that saved the day
A finance org with strict audit requirements ran ZFS for a mix of file services and a transactional system. They weren’t fancy. They were disciplined. Every pool change required a performance baseline and a rollback plan. Every quarter they tested restore and failover. Every month they reviewed pool health and scrub results.
One morning, a parity pool started showing higher write latency and occasional checksum errors. Nothing was “down,” but the team treated it as a leading indicator, not background noise. They checked zpool status, saw a disk with rising errors, and replaced it in a controlled window.
During the resilver, they throttled batch jobs and shifted the transactional workload to a mirror-based pool they kept specifically for “hot” datasets. Because it was planned, not improvised, the business barely noticed.
After the incident, the team had the kind of postmortem that SREs like: short, calm, and mostly about what went right. The boring practice—scrubs, baselines, and workload segregation—was the entire reason it didn’t become a headline.
Common mistakes: symptoms → root cause → fix
Mistake 1: “RAIDZ has more disks, so it must have more IOPS”
Symptoms: p99 write latency spikes, low MB/s but high %util on disks, DB commits stalling.
Root cause: One RAIDZ vdev is one random-I/O lane; small writes trigger partial-stripe overhead and RMW patterns.
Fix: Use multiple mirrored vdevs for random-write tiers. Keep RAIDZ for capacity/streaming. If stuck with RAIDZ, add special vdev and ensure sync/SLOG is correct.
Mistake 2: Pool too full for parity geometry to stay happy
Symptoms: Gradual performance decline, allocator CPU time, increasing txg delays, fragmentation rising.
Root cause: High pool utilization reduces allocation choices; fragmentation increases, partial-stripe writes become more common.
Fix: Keep pools under ~80% for heavy random writes. Add capacity (preferably new vdevs), or migrate to a fresh pool.
Mistake 3: No SLOG for sync-heavy workloads (or using a cheap SSD as SLOG)
Symptoms: fsync latency is awful, DB WAL stalls, VM guests report disk flush latency.
Root cause: ZIL writes land on main pool or on a log device with poor latency or no power-loss protection.
Fix: Add mirrored, power-loss-protected SLOG for sync workloads. Validate with zpool iostat -v and device latency metrics.
Mistake 4: Wrong ashift
Symptoms: Chronic poor small I/O performance regardless of tuning; write amplification feels “mysterious.”
Root cause: ashift too small causes read-modify-write at the device level due to sector misalignment.
Fix: Rebuild/migrate vdevs with correct ashift. There is no magic sysctl to undo it.
Mistake 5: Treating recordsize as a universal “performance knob”
Symptoms: Some workloads improve, others get worse; metadata overhead spikes; backups slow down.
Root cause: recordsize affects on-disk layout and metadata behavior; shrinking it everywhere increases overhead.
Fix: Tune per dataset. DB datafiles might want 8K/16K; media archives want 1M. Measure with realistic load.
Mistake 6: Ignoring metadata as a first-class workload
Symptoms: “But we’re only writing a little data” while latency is awful; directory operations slow; VM snapshot operations slow.
Root cause: Metadata I/O dominates; parity vdevs handle metadata updates poorly under churn.
Fix: Use special vdev (mirrored SSD) for metadata and possibly small blocks; keep it redundant; monitor it like it’s production data (because it is).
Checklists / step-by-step plan
Design checklist: choosing mirrors vs parity for a new pool
- If it’s databases, VM storage, or anything sync-heavy: default to multiple mirror vdevs.
- If it’s backups, media, logs, analytics dumps: RAIDZ is fine; optimize for capacity.
- Count vdevs, not disks: random IOPS scale with top-level vdevs. Plan enough mirror vdevs to hit target IOPS with headroom.
- Plan the sync story: either accept latency on pool disks, or add a proper mirrored SLOG. Don’t “solve” it with
sync=disabled. - Plan metadata: if the workload is metadata-heavy, budget for a special vdev mirror.
- Keep space headroom: design for staying below ~80% used if random writes matter.
- Pick ashift deliberately: assume 4K sectors at minimum; err on the side of ashift=12 unless you have a reason.
- Decide on block size per dataset: set recordsize/volblocksize to match workload, not your mood.
Migration plan: moving from RAIDZ to mirrors without drama
- Build the new pool (mirrors, correct ashift, SSD tiering if needed) alongside the old pool.
- Set dataset properties on the new pool before copying (recordsize, compression, atime, sync, special_small_blocks).
- Use ZFS send/receive for filesystems; for zvols, plan guest downtime or replication strategy consistent with your hypervisor.
- Run a parallel performance baseline before cutover: measure fsync latency, random write IOPS, p99.
- Cut over with rollback: keep old pool read-only for a window; monitor errors and performance.
- After cutover re-check fragmentation and pool fill rate; fix the next bottleneck (often network or CPU).
Operational checklist: keeping small random write performance from decaying
- Monitor pool utilization and fragmentation trends monthly.
- Scrub on schedule; treat checksum errors as urgent, not cosmetic.
- Track latency, not just throughput. p99 write latency is the truth.
- Keep firmware consistent across SSDs used for SLOG/special vdev.
- Do controlled load tests after major changes (kernel, ZFS version, drive replacements).
FAQ
1) Is RAIDZ always slower than mirrors?
No. For large sequential reads/writes and capacity-focused tiers, RAIDZ can be excellent. The pain is specifically small random writes and sync-heavy patterns.
2) Why do mirrors “scale” better for random I/O?
Because each top-level vdev is a scheduling unit. Many mirror vdevs give ZFS more independent lanes for random operations. A single wide RAIDZ vdev is still one lane.
3) If I add more disks to RAIDZ, do I get more IOPS?
You get more aggregate bandwidth and sometimes better concurrency, but you also widen stripes and increase parity coordination. For small random writes, it rarely scales the way you want.
4) Can a fast SLOG make RAIDZ good at small random writes?
A SLOG helps sync write latency by accelerating ZIL commits. It does not eliminate parity overhead for the main data writes that still land in the pool during txg flushes.
5) Should I set sync=disabled for performance?
Only for data you can lose without regret. For databases and VM images, it’s an outage waiting for a power event. Use proper SLOG and mirrors instead.
6) Does a special vdev replace the need for mirrors?
No. A special vdev can drastically improve metadata and small-block performance, which can make RAIDZ feel less awful under certain workloads. It doesn’t change parity math for normal data blocks.
7) Is SSD RAIDZ fine for small random writes?
Better than HDD RAIDZ, but still often behind multiple mirrors for tail latency, especially under sync-heavy or fragmented workloads. SSDs mask the problem; they don’t remove it.
8) What dataset settings most commonly improve small random write behavior?
compression=lz4, atime=off for DB/VM datasets, workload-appropriate recordsize/volblocksize, and proper sync/SLOG strategy. Special vdev for metadata-heavy workloads is a big lever.
9) What’s the most reliable way to prove parity is the bottleneck?
Correlate application latency spikes with zpool iostat -v, disk iostat -x await/util, and ZFS throttling indicators like dmu_tx_delay. If the pool disks are busy with high await and low MB/s, you’re IOPS/latency-bound.
10) What’s the one quote you want SREs to remember here?
“Hope is not a strategy.”
— General Gordon R. Sullivan
Next steps you can execute this week
- Classify your datasets: which ones are latency-sensitive (DB/VM), which ones are throughput/capacity (backups, archives).
- Run the fast diagnosis playbook during a real incident window: capture
zpool iostat -vandiostat -xwhen users complain. - Fix the cheap wins:
atime=offwhere appropriate, verifycompression=lz4, confirm sync policy and SLOG health. - If you’re on parity for hot workloads: decide whether you’re going to (a) add a mirror-based “hot pool,” (b) add special vdev and SSD tiering, or (c) redesign to mirrors outright.
- Plan the migration like an operations change, not a weekend hobby: baseline, replicate, cut over with rollback.
If your workload is small random writes and your business cares about latency, mirrors are not a luxury. They’re the sensible default. Parity is for when capacity efficiency matters more than tail latency. Pick deliberately, and ZFS will do what it does best: keep your data correct while you sleep.