The first time ZFS hurts you, it’s usually not because it’s “unstable.” It’s because you treated it like a generic filesystem,
sprinkled a few “performance tips” from a forum thread, and promoted it to production with all the ceremony of moving a houseplant.
This roadmap is for building ZFS like you’re going to be paged at 03:12, the storage is full, the CEO is on the demo Wi‑Fi,
and you have to diagnose the bottleneck before your coffee cools down.
ZFS mental model: what you’re actually building
ZFS is not “a filesystem.” It’s a storage system with opinions. The pool (zpool) is your failure domain and your performance envelope.
Datasets are policy boundaries. Zvols are block devices with sharp edges. The ARC is your best friend until it becomes your most
expensive excuse for “it was fast in staging.”
The most important thing to internalize: ZFS is copy-on-write. It never overwrites blocks in place. That’s how it provides checksumming,
snapshots, and consistent on-disk state without journaling in the traditional sense. It’s also why fragmentation, metadata growth,
and write amplification can show up in surprising places if you don’t shape workloads.
Think in layers:
- vdev: a single redundancy group (mirror, raidz). If a vdev dies, the pool is gone.
- pool: a set of vdevs striped together. Capacity and IOPS are aggregates—until they aren’t.
- dataset: an administrative boundary for properties (compression, recordsize, atime, quotas, reservations).
- snapshot: a point-in-time reference; it’s not a “backup,” it’s a time machine stuck in the same chassis.
- send/receive: how you get real backups, replication, migrations, and regrets into another system.
Your roadmap is mostly about selecting the right vdev geometry, setting dataset properties to match real I/O, and building an operational
rhythm: scrub, monitor, test restore, repeat.
Facts and context that change decisions
Storage engineering gets better when you remember that today’s “best practice” is usually yesterday’s incident report.
Here are a few context points worth keeping in your head:
- ZFS originated at Sun Microsystems in the mid‑2000s as an end-to-end storage system, not a filesystem bolt-on.
- Copy-on-write was a design choice for consistency: power loss during metadata updates shouldn’t require fsck theatrics.
- End-to-end checksumming means ZFS can detect silent corruption even when the disk happily returns the wrong data.
- RAIDZ is not “RAID5/6” in implementation details: it avoids the write hole by design, but pays with parity math and
variable stripe behavior. - Early ZFS had a reputation for RAM hunger; modern implementations are more configurable, but ARC still scales with ambition.
- lz4 compression became the default for a reason: it’s typically “free speed” because fewer bytes hit disk.
- 4K sector alignment (ashift) became a permanent decision: once you create a vdev with a too-small ashift, you can’t
fix it in place. - SLOG and L2ARC were historically oversold as magic performance buttons; in many real systems they do nothing or make it worse.
- OpenZFS became the cross-platform convergence point after the original licensing split; features land at different tempos per OS.
One paraphrased idea from John Allspaw (operations/reliability): paraphrased idea: reliability comes from enabling learning, not pretending failures won’t happen.
Build your ZFS setup so you can learn fast when it misbehaves.
Stage 0: decide what kind of failure you’re buying
Before commands, decide the three things that actually define your outcome:
failure tolerance, IO profile, and rebuild risk.
People love to talk about raw throughput. Production systems die of tail latency and operational panic.
Mirror vs RAIDZ: pick based on your worst day
- Mirrors: best small random I/O, fastest resilver (especially on large disks), easier future expansion. Costs more capacity.
- RAIDZ1: tempting on paper, frequently regretted on large disks. One disk failure away from a very exciting week.
- RAIDZ2: common default for capacity systems; decent protection, slower small random writes than mirrors.
- RAIDZ3: for very large vdevs and “rebuild windows are terrifying” environments.
If the pool supports latency-sensitive workloads (VMs, databases, CI runners), mirrors are usually the least-wrong answer.
If it’s a mostly-sequential object/archive workload, RAIDZ2 can be a good citizen.
The “one vdev is one blast radius” rule
If any top-level vdev fails, the pool fails. This is why mixing device classes inside a vdev is a bad hobby.
It’s also why “I’ll just add one more disk later” is not a plan—vdev geometry matters.
Joke #1: ZFS doesn’t lose your data. It just schedules a meeting between your assumptions and physics.
Stage 1: create the first pool (correctly)
This stage is mostly about not baking irreversible mistakes into your pool: wrong devices, wrong ashift, wrong layout.
Treat pool creation like schema design. You don’t “just change it later.”
Task 1: identify disks by stable IDs (not /dev/sdX roulette)
cr0x@server:~$ ls -l /dev/disk/by-id/ | head
total 0
lrwxrwxrwx 1 root root 9 Dec 26 10:11 ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4XXXXXXXXX -> ../../sda
lrwxrwxrwx 1 root root 9 Dec 26 10:11 ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4YYYYYYYYY -> ../../sdb
lrwxrwxrwx 1 root root 10 Dec 26 10:11 wwn-0x5000c500abcdef01 -> ../../nvme0n1
lrwxrwxrwx 1 root root 10 Dec 26 10:11 wwn-0x5000c500abcdef02 -> ../../nvme1n1
What it means: you have stable names (ata-*, wwn-*). ZFS will store paths; you want paths that survive reboots and HBA shuffles.
Decision: build vdevs using /dev/disk/by-id (or by-partuuid if you partition).
Task 2: check sector sizes before you commit ashift
cr0x@server:~$ sudo lsblk -d -o NAME,MODEL,SIZE,PHY-SeC,LOG-SeC
NAME MODEL SIZE PHY-SEC LOG-SEC
sda SAMSUNG MZ7L31T9HBLT-00A 1.8T 4096 512
sdb SAMSUNG MZ7L31T9HBLT-00A 1.8T 4096 512
nvme0n1 INTEL SSDPE2KX040T8 3.7T 4096 4096
nvme1n1 INTEL SSDPE2KX040T8 3.7T 4096 4096
What it means: some drives are 512e (logical 512, physical 4K). ZFS should use 4K alignment.
Decision: set ashift=12 for 4K. If you have 8K/16K-sector gear, consider higher ashift. Don’t guess.
Task 3: create a mirror pool with explicit ashift and sane defaults
cr0x@server:~$ sudo zpool create -o ashift=12 -O compression=lz4 -O atime=off -O xattr=sa -O acltype=posixacl tank mirror \
/dev/disk/by-id/ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4XXXXXXXXX \
/dev/disk/by-id/ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4YYYYYYYYY
What it means: you created tank with a mirrored vdev. Properties apply to the root dataset.
Decision: if this pool will host general files and VMs, mirrors are a defensible start. If you need capacity, design RAIDZ2 now—don’t “migrate later.”
Task 4: verify pool health and layout
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4XXXXXXXXX ONLINE 0 0 0
ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4YYYYYYYYY ONLINE 0 0 0
errors: No known data errors
What it means: no errors; topology matches what you intended.
Decision: if topology looks wrong, destroy and recreate now (zpool destroy tank) before data arrives and politics happen.
Task 5: confirm properties actually stuck
cr0x@server:~$ sudo zfs get -o name,property,value -s local compression,atime,xattr,acltype tank
NAME PROPERTY VALUE
tank compression lz4
tank atime off
tank xattr sa
tank acltype posixacl
What it means: properties are locally set on the root dataset.
Decision: keep “policy” properties local at the dataset level; avoid global tweaks you can’t explain later.
Task 6: set a mountpoint and confirm it mounted
cr0x@server:~$ sudo zfs set mountpoint=/tank tank
cr0x@server:~$ mount | grep tank
tank on /tank type zfs (rw,xattr,posixacl)
What it means: the dataset is mounted at a predictable location.
Decision: keep mountpoints boring and explicit. Nobody wants “where did ZFS mount that?” during a restore.
Stage 2: datasets, properties, and sane defaults
Pools are for hardware. Datasets are for humans. You create datasets to separate policies:
compression choices, recordsize, snapshot schedules, quotas, and “this is allowed to be noisy.”
Dataset layout that survives growth
A practical structure (adjust names to your world):
tank/sysfor OS-ish stuff and configstank/homefor humans (quotas help)tank/appsfor application datatank/vmfor VM disks (recordsize considerations)tank/backupfor incoming replication (different snapshot policies)
Task 7: create datasets and apply properties per workload
cr0x@server:~$ sudo zfs create tank/apps
cr0x@server:~$ sudo zfs create tank/home
cr0x@server:~$ sudo zfs create tank/vm
cr0x@server:~$ sudo zfs set compression=lz4 tank/apps tank/home tank/vm
cr0x@server:~$ sudo zfs set atime=off tank/apps tank/vm
cr0x@server:~$ sudo zfs set recordsize=16K tank/vm
What it means: you created separate datasets and tuned recordsize for VM-ish random I/O.
Decision: don’t set recordsize=16K everywhere “for performance.” Use it where it matches I/O (VM disks, some databases).
Task 8: validate dataset properties inheritance and local overrides
cr0x@server:~$ sudo zfs get -r -o name,property,value,source compression,recordsize,atime tank | head -n 20
NAME PROPERTY VALUE SOURCE
tank compression lz4 local
tank recordsize 128K default
tank atime off local
tank/apps compression lz4 local
tank/apps recordsize 128K inherited from tank
tank/apps atime off local
tank/home compression lz4 local
tank/home recordsize 128K inherited from tank
tank/home atime off inherited from tank
tank/vm compression lz4 local
tank/vm recordsize 16K local
tank/vm atime off local
What it means: you can see inheritance and what you intentionally overrode.
Decision: keep overrides sparse. If everything is overridden, nothing is explainable.
Task 9: use quotas and reservations to prevent noisy neighbors
cr0x@server:~$ sudo zfs set quota=500G tank/home
cr0x@server:~$ sudo zfs set reservation=200G tank/apps
cr0x@server:~$ sudo zfs get -o name,property,value tank/home quota
NAME PROPERTY VALUE
tank/home quota 500G
What it means: tank/home cannot grow past 500G; tank/apps keeps 200G reserved.
Decision: quotas stop runaway growth; reservations keep critical workloads from being squeezed by “temporary” logs.
Stage 3: performance tuning you can defend
ZFS performance tuning is 30% settings and 70% not lying to yourself about your workload.
Start by measuring. Then do the simplest thing that addresses the bottleneck.
“Tune everything” is how you create a system that only one person can operate—and that person is on vacation.
ARC, memory, and why “more RAM” is both true and lazy
ARC caches reads and metadata. It can mask slow disks and it can also compete with applications for memory.
If you’re running databases or hypervisors, you need to consciously decide where caching lives:
in the app, in the OS page cache, in ARC, or in a dedicated tier.
Task 10: inspect ARC and memory pressure (Linux example)
cr0x@server:~$ grep -E 'c_max|c |size|hits|misses' /proc/spl/kstat/zfs/arcstats | head
c_max 4 34359738368
c 4 26843545600
size 4 25769803776
hits 4 182736451
misses 4 24372611
What it means: ARC is ~24–25GiB, target is ~25GiB, max is 32GiB; hits vs misses tells you if cache is helping.
Decision: if ARC is huge and apps are swapping, cap ARC. If misses are high and disks are busy, more ARC might help.
Recordsize: the quiet kingmaker
recordsize is for filesystems (datasets). It’s the maximum block size ZFS will use for file data.
Large recordsize is great for sequential reads and compression ratio. Small recordsize reduces read-modify-write overhead for small random I/O.
But too small recordsize can increase metadata overhead and fragmentation.
Zvols: when you want a block device and also want to suffer a little
Zvols can be fine for iSCSI or VM backends, but they require extra discipline: set volblocksize at creation time,
align guest partitions, and monitor write amplification. Don’t casually change block sizes after the fact—you can’t.
Task 11: create a zvol with an intentional volblocksize
cr0x@server:~$ sudo zfs create -V 200G -o volblocksize=16K -o compression=lz4 tank/vm/vm-001
cr0x@server:~$ sudo zfs get -o name,property,value tank/vm/vm-001 volblocksize
NAME PROPERTY VALUE
tank/vm/vm-001 volblocksize 16K
What it means: a 200G zvol backed by ZFS, with 16K blocks.
Decision: match volblocksize to expected I/O (often 8K–16K for many VM patterns). Don’t default blindly.
SLOG and sync writes: the part where people spend money and still lose
A SLOG device only helps synchronous writes. If your workload is mostly async, it won’t move the needle.
If your workload is sync-heavy (databases with fsync, NFS with sync, VM journaling), SLOG can reduce latency and protect intent logs on fast media.
But a bad SLOG (no power-loss protection) can turn “performance upgrade” into “mystery corruption story.”
Task 12: check whether your workload is actually issuing sync writes
cr0x@server:~$ sudo zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 1.02T 650G 210 1800 42.1M 155M
mirror-0 1.02T 650G 210 1800 42.1M 155M
ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4XXXXXXXXX - - 105 900 21.0M 77.5M
ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4YYYYYYYYY - - 105 900 21.1M 77.5M
What it means: you see write-heavy activity. This alone doesn’t prove sync vs async, but it tells you where load sits.
Decision: if latency-sensitive clients complain during sync-heavy operations, investigate sync property and add SLOG only if justified.
Task 13: inspect sync settings and avoid the “sync=disabled” trap
cr0x@server:~$ sudo zfs get -o name,property,value,source sync tank tank/apps tank/vm
NAME PROPERTY VALUE SOURCE
tank sync standard default
tank/apps sync standard inherited from tank
tank/vm sync standard inherited from tank
What it means: you’re using normal POSIX semantics.
Decision: keep sync=standard unless you like explaining to auditors why “durable” meant “mostly vibes.”
Task 14: check fragmentation and capacity headroom before blaming ZFS
cr0x@server:~$ sudo zpool list -o name,size,alloc,free,capacity,frag,health
NAME SIZE ALLOC FREE CAPACITY FRAG HEALTH
tank 1.81T 1.02T 650G 61% 18% ONLINE
What it means: 61% full, fragmentation 18%. Not scary.
Decision: if capacity is >80–85% and frag is high, expect performance cliffs. Fix fullness first; tuning comes second.
Compression: usually on, occasionally off
lz4 is the “default adult.” It reduces physical writes and often improves throughput.
Turn compression off only when data is already compressed (some media, some encrypted blobs) and you’ve verified the CPU overhead matters.
Task 15: estimate compression effectiveness from real data
cr0x@server:~$ sudo zfs get -o name,property,value -r compressratio tank/apps | head
NAME PROPERTY VALUE
tank/apps compressratio 1.62x
What it means: you’re saving ~38% space on average, often with fewer disk writes.
Decision: if compressratio is near 1.00x and CPU is a constraint, consider disabling compression for that dataset only.
Stage 4: protection: scrubs, snapshots, replication
ZFS gives you checksums. It does not give you invincibility. Scrubs find latent disk errors. Snapshots give you rollback.
Replication gives you a second copy that doesn’t share your failure domain.
Scrubs: not optional, not a panic button
A scrub reads all data and verifies checksums, repairing from redundancy when possible. It’s how you find a slowly dying drive
before it graduates to “unreadable during resilver.”
Task 16: start a scrub and verify progress
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Fri Dec 26 10:42:01 2025
312G scanned at 3.20G/s, 120G issued at 1.23G/s, 1.02T total
0B repaired, 11.71% done, 0:11:23 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4XXXXXXXXX ONLINE 0 0 0
ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4YYYYYYYYY ONLINE 0 0 0
errors: No known data errors
What it means: scrub is running; it shows scan rate, issued rate, and ETA.
Decision: schedule scrubs (monthly is common). If scrubs take “forever,” investigate disk performance, cabling, and pool layout.
Snapshots: a scalpel, not a landfill
Snapshots are cheap at first. Then you keep them forever, rename datasets three times, and wonder why deletes don’t free space.
Snapshot strategy is a retention policy plus restore testing. Without both, it’s just a directory of false hope.
Task 17: create and list snapshots; interpret space usage
cr0x@server:~$ sudo zfs snapshot tank/apps@pre-upgrade-001
cr0x@server:~$ sudo zfs list -t snapshot -o name,used,refer,creation -s creation | tail -n 3
NAME USED REFER CREATION
tank/apps@pre-upgrade-001 12M 220G Fri Dec 26 10:55 2025
What it means: USED is snapshot-exclusive space (blocks held because of this snapshot); REFER is referenced size.
Decision: if snapshots accumulate and space doesn’t free, inspect snapshot USED and prune by policy, not emotion.
Replication: the adult version of snapshots
If the pool controller fries, snapshots on that pool are as helpful as a spare key locked inside the same car.
Real protection means send/receive to another machine, another rack, or at minimum another failure domain.
Task 18: run an incremental send/receive to a backup pool
cr0x@server:~$ sudo zfs snapshot tank/apps@replica-001
cr0x@server:~$ sudo zfs send -c tank/apps@replica-001 | ssh backup01 sudo zfs receive -uF backup/tank/apps
What it means: you sent a compressed stream (-c) to backup01 and received it into backup/tank/apps, not mounted (-u), force rolling back as needed (-F).
Decision: use automation later, but first do it by hand so you know what “success” looks like and how it fails.
Task 19: verify receive-side dataset and last snapshot
cr0x@server:~$ ssh backup01 sudo zfs list -o name,used,avail,refer,mountpoint backup/tank/apps
NAME USED AVAIL REFER MOUNTPOINT
backup/tank/apps 220G 4.10T 220G none
What it means: the backup dataset exists and is not mounted (good for safety).
Decision: keep backup receives unmounted by default. Mount only for restore tests, and then unmount again.
Stage 5: observability and operational guardrails
ZFS fails loudly when it fails loudly. The nastier failures are quiet: a marginal cable, a drive that times out once a day,
a pool that slowly fills until fragmentation becomes a personality.
Your job is to learn about these before users do.
Task 20: baseline error counters and watch for movement
cr0x@server:~$ sudo zpool status -v
pool: tank
state: ONLINE
scan: scrub repaired 0B in 0:26:41 with 0 errors on Fri Dec 26 11:08:49 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4XXXXXXXXX ONLINE 0 0 0
ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4YYYYYYYYY ONLINE 0 0 0
errors: No known data errors
What it means: READ/WRITE/CKSUM are zero. Great baseline.
Decision: alert when they move. One checksum error is “investigate.” A trend is “schedule maintenance.”
Task 21: check SMART health for the drives in the vdev
cr0x@server:~$ sudo smartctl -a /dev/sda | egrep -i 'Model|Serial|Reallocated|Pending|CRC|Power_On_Hours'
Model Family: Samsung based SSDs
Serial Number: S4XXXXXXXXX
Power_On_Hours: 18422
Reallocated_Sector_Ct: 0
Current_Pending_Sector: 0
UDMA_CRC_Error_Count: 2
What it means: CRC errors often point to cabling/backplane/HBA issues, not the NAND itself.
Decision: if CRC increments, reseat/replace cable or move bays before you replace a perfectly good drive.
Task 22: verify autotrim (SSDs) and decide if you want it
cr0x@server:~$ sudo zpool get -o name,property,value autotrim tank
NAME PROPERTY VALUE
tank autotrim off
What it means: autotrim is off. On SSD pools, TRIM can help sustained write performance.
Decision: consider zpool set autotrim=on tank for SSD-based pools after validating your drive firmware behaves.
Task 23: inspect dataset-level write amplification signals (logical vs physical)
cr0x@server:~$ sudo zfs get -o name,property,value logicalused,used tank/vm
NAME PROPERTY VALUE
tank/vm logicalused 380G
tank/vm used 295G
What it means: compression is helping (physical used is lower than logical). If it were reversed, you’d suspect copies, padding, or volblocksize mismatches.
Decision: when logical and used diverge in the wrong direction, re-check dataset properties and workload assumptions.
Task 24: rehearse a restore (the only test that counts)
cr0x@server:~$ ssh backup01 sudo zfs clone backup/tank/apps@replica-001 backup/tank/apps-restore-test
cr0x@server:~$ ssh backup01 sudo zfs set mountpoint=/mnt/restore-test backup/tank/apps-restore-test
cr0x@server:~$ ssh backup01 mount | grep restore-test
backup/tank/apps-restore-test on /mnt/restore-test type zfs (rw,xattr,posixacl)
What it means: you created a writable clone from a snapshot and mounted it.
Decision: schedule restore tests. If you don’t, your first restore will be during an outage, which is a bold choice.
Fast diagnosis playbook
When performance tanks, you want a path that converges quickly. Not a week-long interpretive dance with graphs.
This playbook assumes Linux/OpenZFS tooling, but the logic holds elsewhere.
First: is the pool healthy and is anything rebuilding?
- Check:
zpool status -v - Look for: scrub/resilver in progress, DEGRADED vdevs, checksum errors, slow devices
- Decision: if resilvering, expect degraded performance; prioritize finishing rebuild safely over “tuning.”
Second: are you out of space or heavily fragmented?
- Check:
zpool list -o size,alloc,free,capacity,frag - Look for: capacity > 80–85%, frag > ~50% (context-dependent)
- Decision: if full/fragged, free space and delete snapshots (carefully). Don’t chase arcane sysctls first.
Third: what’s the bottleneck: disk, CPU, memory, or sync latency?
- Disk:
zpool iostat -v 1shows one device pegged or much slower than peers. - CPU: compression/checksum can be CPU-bound on small cores; validate with system CPU tools.
- Memory: ARC thrashing or system swapping: check ARC size and swap activity.
- Sync writes: latency spikes during fsync-heavy workloads; SLOG may help if properly designed.
Fourth: identify the dataset and workload pattern
- Check: which dataset is hot (application logs, VM disks, backup ingest)
- Look for: wrong
recordsizefor workload, too many snapshots holding space, unexpectedsyncbehavior - Decision: tune at dataset boundary. Avoid pool-wide changes unless you’re fixing a pool-wide problem.
Three corporate mini-stories (the kind you remember)
Incident: the wrong assumption (512-byte thinking in a 4K world)
A mid-size SaaS company built a new analytics cluster on shiny large HDDs behind a reputable HBA. The architect used ZFS because of checksums
and snapshots, and because the old storage stack had the personality of wet cardboard. Pool creation was scripted. It “worked.”
Six months later, write latency crept up. Not catastrophically—just enough that batch jobs missed their window. Then resilver time on a single disk
replacement turned into a multi-day event. During the resilver, performance fell off a cliff and stayed there. The team assumed “big disk rebuilds are slow”
and accepted the pain as the price of capacity.
Someone finally pulled a full baseline: sector sizes, ashift, and real physical alignment. The pool was built with ashift=9 because the drives
reported 512 logical sectors and nobody checked physical sector size. Every write got translated into a read-modify-write cycle on the drive.
ZFS was doing what it was told; the drives were doing what physics required.
They migrated data to a new pool with ashift=12. Performance normalized. Resilvers got dramatically faster.
The incident report was painfully simple: “We assumed the disk told the truth.” The corrective action was also simple:
“We will check PHY-SeC and set ashift explicitly.” The lesson: ZFS will faithfully preserve your mistakes.
Optimization that backfired: the “sync=disabled” era
A different company ran a VM farm on ZFS mirrors. Developers complained about occasional latency spikes during peak deploy hours.
Someone googled. Someone found the setting. Someone said, “We don’t need synchronous writes; we have a UPS.”
sync=disabled was applied at the dataset level for VM storage.
The spikes went away. Tickets closed. High-fives were exchanged in the shared Slack channel where optimism goes to die.
Two months later, a host rebooted unexpectedly after a kernel panic. The UPS was fine. The disks were fine. The VMs were not fine.
A handful came back with corrupted filesystems. Not all. Just enough to make the incident feel like a haunting.
The postmortem was grim but clean: synchronous semantics were explicitly disabled, so acknowledged writes weren’t necessarily durable.
The crash happened in a window where several guests believed their data was on stable storage. It wasn’t. ZFS did exactly what it was configured to do.
They reverted to sync=standard, measured again, and solved the real problem: a saturated write path plus poor queueing during deploy storms.
They added capacity and smoothed I/O bursts. The moral is not “never optimize.” It’s “optimize with a rollback plan and a clear definition of correctness.”
Joke #2: Disabling sync writes is like removing your smoke detector because it’s loud. Quieter, yes. Smarter, no.
Boring but correct practice that saved the day: monthly scrubs and alert hygiene
A financial services team ran a modest ZFS-backed file service. Nothing fancy: mirrored vdevs, lz4 compression, conservative dataset policies.
They had a habit that nobody bragged about: monthly scrubs, and alerts that fired on new checksum errors or degraded vdevs.
The on-call rotation hated many things, but not that.
One Thursday afternoon, an alert fired: a handful of checksum errors on one disk, then more. The pool stayed ONLINE. Users noticed nothing.
The engineer on duty didn’t “wait and see.” They checked SMART, saw CRC errors increasing, and suspected a cable or bay.
They scheduled a maintenance window and moved the drive to another slot. CRC errors stopped.
Two weeks later, a different disk started throwing real media errors, and ZFS repaired them during a scrub. The team replaced that disk during business hours.
No emergency. No extended outage. The system remained boring.
The secret wasn’t genius. It was a loop: scrub regularly, alert early, treat small error counters as smoke, and validate the path (cables, HBAs, firmware),
not just the drive. In storage, boring is a feature you can ship.
Common mistakes: symptoms → root cause → fix
1) “Deletes don’t free space”
- Symptoms: application deletes data, but pool usage stays flat;
dfdoesn’t budge. - Root cause: snapshots retain referenced blocks; sometimes clones do too.
- Fix: list snapshots by used space and prune by policy.
cr0x@server:~$ sudo zfs list -t snapshot -o name,used -s used | tail
tank/apps@daily-2025-12-20 18.2G
tank/apps@daily-2025-12-21 21.4G
tank/apps@daily-2025-12-22 25.7G
2) “Random I/O is awful on RAIDZ”
- Symptoms: VM latency spikes; IOPS lower than expected; writes feel “sticky.”
- Root cause: RAIDZ parity overhead plus small random writes; recordsize mismatch; pool too full.
- Fix: mirrors for latency-critical workloads, or separate RAIDZ for capacity; tune recordsize on the hot dataset; keep capacity headroom.
3) “Scrub takes forever and the system crawls”
- Symptoms: scrubs run for days; services slow down; iostat shows low throughput.
- Root cause: slow or failing disk, bad HBA/cabling, SMR drives in disguise, or heavy concurrent workload.
- Fix: identify slow device with
zpool iostat -v; validate SMART; replace problem hardware; schedule scrubs off-peak.
4) “We added an L2ARC and nothing got faster”
- Symptoms: bought SSD cache; latency unchanged; ARC stats look similar.
- Root cause: workload isn’t read-cacheable, or L2ARC is too small/slow, or system is CPU/memory bound.
- Fix: measure cache hit rates; prioritize RAM/ARC and better vdev layout before adding L2ARC.
5) “Resilver is dangerously slow”
- Symptoms: disk replacement takes a long time; performance during resilver is terrible.
- Root cause: large HDDs, RAIDZ geometry, high pool utilization, SMR behavior, or a sick disk dragging the vdev.
- Fix: prefer mirrors for fast rebuild; keep pool below ~80%; replace suspect disks proactively; don’t mix slow/fast devices in a vdev.
6) “Checksum errors appear but disks ‘test fine’”
- Symptoms:
zpool statusshows CKSUM errors; SMART looks normal. - Root cause: cabling/backplane/HBA/firmware issues; transient transport errors.
- Fix: check SMART CRC counters; reseat/replace cables; try different bays; update HBA firmware; then clear errors and monitor.
7) “We can’t expand our RAIDZ vdev the way we expected”
- Symptoms: added one disk, capacity barely changed or expansion isn’t possible.
- Root cause: top-level vdev geometry is fixed; you expand by adding whole vdevs (or via newer expansion features depending on platform/version, with constraints).
- Fix: plan vdev width at day 0; for growth, add another vdev of similar performance class; avoid Frankenstein pools.
Checklists / step-by-step plan
Plan A: first pool to “safe enough” production in 10 steps
- Inventory hardware: confirm drive type, sector sizes, HBA model, and whether you have power-loss protection on SSDs.
- Choose topology: mirrors for latency, RAIDZ2/3 for capacity; avoid RAIDZ1 on large disks unless you enjoy gambling.
- Name devices sanely: use
/dev/disk/by-idpaths; document bay mapping. - Create pool explicitly: set
ashiftand root dataset properties (compression, atime, xattr). - Create datasets per workload: apps vs VMs vs users; set recordsize and sync policy intentionally.
- Set capacity guardrails: quotas for “humans,” reservations for “must not fail.”
- Scrub schedule: monthly baseline; more often if drives are suspect or environment is harsh.
- Snapshot policy: e.g., hourly for 24h, daily for 30d, monthly for 12m—tune to business needs and storage budget.
- Replication: send/receive to a different system; test restores by cloning snapshots.
- Monitoring: alerts on DEGRADED, checksum errors, rising SMART CRC/realloc/pending, pool capacity thresholds, and scrub failures.
Plan B: migrate from “it exists” to “it’s operable” without downtime fantasies
- Stop making pool-wide tweaks. Start measuring and documenting current state:
zpool status,zpool list,zfs get all(filtered). - Split datasets by workload so you can tune and snapshot independently.
- Implement a retention policy and prune snapshots that no longer serve recovery goals.
- Set up replication and run a restore test. Prove it to yourself with a mounted clone.
- Plan the “irreversible fixes” (wrong ashift, wrong topology) as a migration to a new pool. There is no magic toggle.
Operational cadence (what you do every week/month/quarter)
- Weekly: review alerts, check for new error counters, confirm snapshot jobs are running, validate capacity projections.
- Monthly: scrub, review scrub duration trends, verify at least one restore test from replication.
- Quarterly: rehearse a “disk failure + restore” scenario, review dataset properties against workload changes, validate firmware baselines.
FAQ
1) Should I choose mirrors or RAIDZ for VM storage?
Mirrors, unless you have a strong reason and a tested workload profile. VMs tend to do small random I/O and punish RAIDZ parity overhead.
If you must use RAIDZ, keep vdev widths reasonable, maintain headroom, and tune recordsize for VM datasets.
2) Is RAIDZ1 ever acceptable?
On small disks and non-critical data, maybe. On large modern disks, RAIDZ1 increases the risk that a second issue during resilver takes the pool down.
If you can’t tolerate downtime and restore time, don’t run single-parity.
3) What compression should I use?
lz4 for almost everything. Turn it off only for datasets where data is already compressed or encrypted and you’ve measured CPU impact.
4) How full can I let a pool get?
Try to stay below ~80% for healthy performance, especially on RAIDZ and mixed workloads. Above that, fragmentation and allocation behavior can
increase latency. The exact cliff depends on workload, but “we ran it to 95%” is a familiar pre-outage sentence.
5) Do I need a SLOG?
Only if you have significant synchronous writes and you care about their latency. If your workload is mostly async, a SLOG won’t help.
If you do add one, use high-endurance devices with power-loss protection. Cheap consumer SSDs are not a journal device; they are a surprise generator.
6) Do I need L2ARC?
Usually no as a first move. Start with RAM (ARC) and correct pool topology. L2ARC can help read-heavy workloads with working sets larger than RAM,
but it also consumes memory for metadata and can add complexity.
7) Can I change ashift after creating the pool?
No. Not in place. You fix wrong ashift by migrating data to a new pool created correctly. This is why sector-size validation is a day-0 task.
8) How do I know if snapshots are the reason space won’t free?
List snapshots and sort by used. If big snapshots exist, they’re holding blocks alive. Delete the snapshots (carefully, in policy order),
then monitor space changes. Also check for clones.
9) Is ZFS “a backup” because it has snapshots?
No. Snapshots are local recovery points. Backups require separate failure domains. Use send/receive replication (or another backup system) and test restores.
10) What’s the single best habit to avoid regrets?
Do restore tests from replication on a schedule. Everything else is just probability management; restore testing is truth.
Conclusion: practical next steps
If you want “production without regret,” you don’t chase exotic flags. You make three good structural decisions (topology, ashift, dataset boundaries),
then you run a boring operational loop (scrub, snapshot, replicate, test restore, monitor errors).
Next steps you can do this week:
- Write down your pool topology and failure tolerance in one paragraph. If you can’t, you don’t have a design yet.
- Create datasets for your top three workloads and set properties intentionally (compression, recordsize, atime, quotas).
- Run a scrub and record how long it took. That duration trend will become a health signal.
- Set up replication with send/receive to another system and run one restore test via clone-and-mount.
- Turn your baseline checks into alerts: pool health, capacity, checksum errors, and SMART transport/media indicators.
ZFS will give you data integrity and operational leverage. It will also happily preserve every bad assumption you feed it.
Choose your assumptions like you’re the one who gets paged. Because you are.