ZFS Master Roadmap: From First Pool to Production Without Regret

Was this helpful?

The first time ZFS hurts you, it’s usually not because it’s “unstable.” It’s because you treated it like a generic filesystem,
sprinkled a few “performance tips” from a forum thread, and promoted it to production with all the ceremony of moving a houseplant.

This roadmap is for building ZFS like you’re going to be paged at 03:12, the storage is full, the CEO is on the demo Wi‑Fi,
and you have to diagnose the bottleneck before your coffee cools down.

ZFS mental model: what you’re actually building

ZFS is not “a filesystem.” It’s a storage system with opinions. The pool (zpool) is your failure domain and your performance envelope.
Datasets are policy boundaries. Zvols are block devices with sharp edges. The ARC is your best friend until it becomes your most
expensive excuse for “it was fast in staging.”

The most important thing to internalize: ZFS is copy-on-write. It never overwrites blocks in place. That’s how it provides checksumming,
snapshots, and consistent on-disk state without journaling in the traditional sense. It’s also why fragmentation, metadata growth,
and write amplification can show up in surprising places if you don’t shape workloads.

Think in layers:

  • vdev: a single redundancy group (mirror, raidz). If a vdev dies, the pool is gone.
  • pool: a set of vdevs striped together. Capacity and IOPS are aggregates—until they aren’t.
  • dataset: an administrative boundary for properties (compression, recordsize, atime, quotas, reservations).
  • snapshot: a point-in-time reference; it’s not a “backup,” it’s a time machine stuck in the same chassis.
  • send/receive: how you get real backups, replication, migrations, and regrets into another system.

Your roadmap is mostly about selecting the right vdev geometry, setting dataset properties to match real I/O, and building an operational
rhythm: scrub, monitor, test restore, repeat.

Facts and context that change decisions

Storage engineering gets better when you remember that today’s “best practice” is usually yesterday’s incident report.
Here are a few context points worth keeping in your head:

  1. ZFS originated at Sun Microsystems in the mid‑2000s as an end-to-end storage system, not a filesystem bolt-on.
  2. Copy-on-write was a design choice for consistency: power loss during metadata updates shouldn’t require fsck theatrics.
  3. End-to-end checksumming means ZFS can detect silent corruption even when the disk happily returns the wrong data.
  4. RAIDZ is not “RAID5/6” in implementation details: it avoids the write hole by design, but pays with parity math and
    variable stripe behavior.
  5. Early ZFS had a reputation for RAM hunger; modern implementations are more configurable, but ARC still scales with ambition.
  6. lz4 compression became the default for a reason: it’s typically “free speed” because fewer bytes hit disk.
  7. 4K sector alignment (ashift) became a permanent decision: once you create a vdev with a too-small ashift, you can’t
    fix it in place.
  8. SLOG and L2ARC were historically oversold as magic performance buttons; in many real systems they do nothing or make it worse.
  9. OpenZFS became the cross-platform convergence point after the original licensing split; features land at different tempos per OS.

One paraphrased idea from John Allspaw (operations/reliability): paraphrased idea: reliability comes from enabling learning, not pretending failures won’t happen.
Build your ZFS setup so you can learn fast when it misbehaves.

Stage 0: decide what kind of failure you’re buying

Before commands, decide the three things that actually define your outcome:
failure tolerance, IO profile, and rebuild risk.
People love to talk about raw throughput. Production systems die of tail latency and operational panic.

Mirror vs RAIDZ: pick based on your worst day

  • Mirrors: best small random I/O, fastest resilver (especially on large disks), easier future expansion. Costs more capacity.
  • RAIDZ1: tempting on paper, frequently regretted on large disks. One disk failure away from a very exciting week.
  • RAIDZ2: common default for capacity systems; decent protection, slower small random writes than mirrors.
  • RAIDZ3: for very large vdevs and “rebuild windows are terrifying” environments.

If the pool supports latency-sensitive workloads (VMs, databases, CI runners), mirrors are usually the least-wrong answer.
If it’s a mostly-sequential object/archive workload, RAIDZ2 can be a good citizen.

The “one vdev is one blast radius” rule

If any top-level vdev fails, the pool fails. This is why mixing device classes inside a vdev is a bad hobby.
It’s also why “I’ll just add one more disk later” is not a plan—vdev geometry matters.

Joke #1: ZFS doesn’t lose your data. It just schedules a meeting between your assumptions and physics.

Stage 1: create the first pool (correctly)

This stage is mostly about not baking irreversible mistakes into your pool: wrong devices, wrong ashift, wrong layout.
Treat pool creation like schema design. You don’t “just change it later.”

Task 1: identify disks by stable IDs (not /dev/sdX roulette)

cr0x@server:~$ ls -l /dev/disk/by-id/ | head
total 0
lrwxrwxrwx 1 root root  9 Dec 26 10:11 ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4XXXXXXXXX -> ../../sda
lrwxrwxrwx 1 root root  9 Dec 26 10:11 ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4YYYYYYYYY -> ../../sdb
lrwxrwxrwx 1 root root 10 Dec 26 10:11 wwn-0x5000c500abcdef01 -> ../../nvme0n1
lrwxrwxrwx 1 root root 10 Dec 26 10:11 wwn-0x5000c500abcdef02 -> ../../nvme1n1

What it means: you have stable names (ata-*, wwn-*). ZFS will store paths; you want paths that survive reboots and HBA shuffles.
Decision: build vdevs using /dev/disk/by-id (or by-partuuid if you partition).

Task 2: check sector sizes before you commit ashift

cr0x@server:~$ sudo lsblk -d -o NAME,MODEL,SIZE,PHY-SeC,LOG-SeC
NAME  MODEL                     SIZE PHY-SEC LOG-SEC
sda   SAMSUNG MZ7L31T9HBLT-00A 1.8T    4096    512
sdb   SAMSUNG MZ7L31T9HBLT-00A 1.8T    4096    512
nvme0n1 INTEL SSDPE2KX040T8      3.7T    4096   4096
nvme1n1 INTEL SSDPE2KX040T8      3.7T    4096   4096

What it means: some drives are 512e (logical 512, physical 4K). ZFS should use 4K alignment.
Decision: set ashift=12 for 4K. If you have 8K/16K-sector gear, consider higher ashift. Don’t guess.

Task 3: create a mirror pool with explicit ashift and sane defaults

cr0x@server:~$ sudo zpool create -o ashift=12 -O compression=lz4 -O atime=off -O xattr=sa -O acltype=posixacl tank mirror \
/dev/disk/by-id/ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4XXXXXXXXX \
/dev/disk/by-id/ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4YYYYYYYYY

What it means: you created tank with a mirrored vdev. Properties apply to the root dataset.
Decision: if this pool will host general files and VMs, mirrors are a defensible start. If you need capacity, design RAIDZ2 now—don’t “migrate later.”

Task 4: verify pool health and layout

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME                                                       STATE     READ WRITE CKSUM
        tank                                                       ONLINE       0     0     0
          mirror-0                                                 ONLINE       0     0     0
            ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4XXXXXXXXX             ONLINE       0     0     0
            ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4YYYYYYYYY             ONLINE       0     0     0

errors: No known data errors

What it means: no errors; topology matches what you intended.
Decision: if topology looks wrong, destroy and recreate now (zpool destroy tank) before data arrives and politics happen.

Task 5: confirm properties actually stuck

cr0x@server:~$ sudo zfs get -o name,property,value -s local compression,atime,xattr,acltype tank
NAME  PROPERTY     VALUE
tank  compression  lz4
tank  atime        off
tank  xattr        sa
tank  acltype      posixacl

What it means: properties are locally set on the root dataset.
Decision: keep “policy” properties local at the dataset level; avoid global tweaks you can’t explain later.

Task 6: set a mountpoint and confirm it mounted

cr0x@server:~$ sudo zfs set mountpoint=/tank tank
cr0x@server:~$ mount | grep tank
tank on /tank type zfs (rw,xattr,posixacl)

What it means: the dataset is mounted at a predictable location.
Decision: keep mountpoints boring and explicit. Nobody wants “where did ZFS mount that?” during a restore.

Stage 2: datasets, properties, and sane defaults

Pools are for hardware. Datasets are for humans. You create datasets to separate policies:
compression choices, recordsize, snapshot schedules, quotas, and “this is allowed to be noisy.”

Dataset layout that survives growth

A practical structure (adjust names to your world):

  • tank/sys for OS-ish stuff and configs
  • tank/home for humans (quotas help)
  • tank/apps for application data
  • tank/vm for VM disks (recordsize considerations)
  • tank/backup for incoming replication (different snapshot policies)

Task 7: create datasets and apply properties per workload

cr0x@server:~$ sudo zfs create tank/apps
cr0x@server:~$ sudo zfs create tank/home
cr0x@server:~$ sudo zfs create tank/vm
cr0x@server:~$ sudo zfs set compression=lz4 tank/apps tank/home tank/vm
cr0x@server:~$ sudo zfs set atime=off tank/apps tank/vm
cr0x@server:~$ sudo zfs set recordsize=16K tank/vm

What it means: you created separate datasets and tuned recordsize for VM-ish random I/O.
Decision: don’t set recordsize=16K everywhere “for performance.” Use it where it matches I/O (VM disks, some databases).

Task 8: validate dataset properties inheritance and local overrides

cr0x@server:~$ sudo zfs get -r -o name,property,value,source compression,recordsize,atime tank | head -n 20
NAME      PROPERTY     VALUE  SOURCE
tank      compression  lz4    local
tank      recordsize   128K   default
tank      atime        off    local
tank/apps compression  lz4    local
tank/apps recordsize   128K   inherited from tank
tank/apps atime        off    local
tank/home compression  lz4    local
tank/home recordsize   128K   inherited from tank
tank/home atime        off    inherited from tank
tank/vm   compression  lz4    local
tank/vm   recordsize   16K    local
tank/vm   atime        off    local

What it means: you can see inheritance and what you intentionally overrode.
Decision: keep overrides sparse. If everything is overridden, nothing is explainable.

Task 9: use quotas and reservations to prevent noisy neighbors

cr0x@server:~$ sudo zfs set quota=500G tank/home
cr0x@server:~$ sudo zfs set reservation=200G tank/apps
cr0x@server:~$ sudo zfs get -o name,property,value tank/home quota
NAME       PROPERTY  VALUE
tank/home  quota     500G

What it means: tank/home cannot grow past 500G; tank/apps keeps 200G reserved.
Decision: quotas stop runaway growth; reservations keep critical workloads from being squeezed by “temporary” logs.

Stage 3: performance tuning you can defend

ZFS performance tuning is 30% settings and 70% not lying to yourself about your workload.
Start by measuring. Then do the simplest thing that addresses the bottleneck.
“Tune everything” is how you create a system that only one person can operate—and that person is on vacation.

ARC, memory, and why “more RAM” is both true and lazy

ARC caches reads and metadata. It can mask slow disks and it can also compete with applications for memory.
If you’re running databases or hypervisors, you need to consciously decide where caching lives:
in the app, in the OS page cache, in ARC, or in a dedicated tier.

Task 10: inspect ARC and memory pressure (Linux example)

cr0x@server:~$ grep -E 'c_max|c |size|hits|misses' /proc/spl/kstat/zfs/arcstats | head
c_max                           4    34359738368
c                               4    26843545600
size                            4    25769803776
hits                            4    182736451
misses                          4    24372611

What it means: ARC is ~24–25GiB, target is ~25GiB, max is 32GiB; hits vs misses tells you if cache is helping.
Decision: if ARC is huge and apps are swapping, cap ARC. If misses are high and disks are busy, more ARC might help.

Recordsize: the quiet kingmaker

recordsize is for filesystems (datasets). It’s the maximum block size ZFS will use for file data.
Large recordsize is great for sequential reads and compression ratio. Small recordsize reduces read-modify-write overhead for small random I/O.
But too small recordsize can increase metadata overhead and fragmentation.

Zvols: when you want a block device and also want to suffer a little

Zvols can be fine for iSCSI or VM backends, but they require extra discipline: set volblocksize at creation time,
align guest partitions, and monitor write amplification. Don’t casually change block sizes after the fact—you can’t.

Task 11: create a zvol with an intentional volblocksize

cr0x@server:~$ sudo zfs create -V 200G -o volblocksize=16K -o compression=lz4 tank/vm/vm-001
cr0x@server:~$ sudo zfs get -o name,property,value tank/vm/vm-001 volblocksize
NAME            PROPERTY      VALUE
tank/vm/vm-001  volblocksize  16K

What it means: a 200G zvol backed by ZFS, with 16K blocks.
Decision: match volblocksize to expected I/O (often 8K–16K for many VM patterns). Don’t default blindly.

SLOG and sync writes: the part where people spend money and still lose

A SLOG device only helps synchronous writes. If your workload is mostly async, it won’t move the needle.
If your workload is sync-heavy (databases with fsync, NFS with sync, VM journaling), SLOG can reduce latency and protect intent logs on fast media.
But a bad SLOG (no power-loss protection) can turn “performance upgrade” into “mystery corruption story.”

Task 12: check whether your workload is actually issuing sync writes

cr0x@server:~$ sudo zpool iostat -v tank 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.02T   650G    210   1800   42.1M  155M
  mirror-0  1.02T   650G    210   1800   42.1M  155M
    ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4XXXXXXXXX     -      -    105    900   21.0M  77.5M
    ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4YYYYYYYYY     -      -    105    900   21.1M  77.5M

What it means: you see write-heavy activity. This alone doesn’t prove sync vs async, but it tells you where load sits.
Decision: if latency-sensitive clients complain during sync-heavy operations, investigate sync property and add SLOG only if justified.

Task 13: inspect sync settings and avoid the “sync=disabled” trap

cr0x@server:~$ sudo zfs get -o name,property,value,source sync tank tank/apps tank/vm
NAME      PROPERTY  VALUE  SOURCE
tank      sync      standard  default
tank/apps sync      standard  inherited from tank
tank/vm   sync      standard  inherited from tank

What it means: you’re using normal POSIX semantics.
Decision: keep sync=standard unless you like explaining to auditors why “durable” meant “mostly vibes.”

Task 14: check fragmentation and capacity headroom before blaming ZFS

cr0x@server:~$ sudo zpool list -o name,size,alloc,free,capacity,frag,health
NAME  SIZE  ALLOC  FREE  CAPACITY  FRAG  HEALTH
tank  1.81T 1.02T  650G      61%   18%  ONLINE

What it means: 61% full, fragmentation 18%. Not scary.
Decision: if capacity is >80–85% and frag is high, expect performance cliffs. Fix fullness first; tuning comes second.

Compression: usually on, occasionally off

lz4 is the “default adult.” It reduces physical writes and often improves throughput.
Turn compression off only when data is already compressed (some media, some encrypted blobs) and you’ve verified the CPU overhead matters.

Task 15: estimate compression effectiveness from real data

cr0x@server:~$ sudo zfs get -o name,property,value -r compressratio tank/apps | head
NAME      PROPERTY       VALUE
tank/apps compressratio  1.62x

What it means: you’re saving ~38% space on average, often with fewer disk writes.
Decision: if compressratio is near 1.00x and CPU is a constraint, consider disabling compression for that dataset only.

Stage 4: protection: scrubs, snapshots, replication

ZFS gives you checksums. It does not give you invincibility. Scrubs find latent disk errors. Snapshots give you rollback.
Replication gives you a second copy that doesn’t share your failure domain.

Scrubs: not optional, not a panic button

A scrub reads all data and verifies checksums, repairing from redundancy when possible. It’s how you find a slowly dying drive
before it graduates to “unreadable during resilver.”

Task 16: start a scrub and verify progress

cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Fri Dec 26 10:42:01 2025
        312G scanned at 3.20G/s, 120G issued at 1.23G/s, 1.02T total
        0B repaired, 11.71% done, 0:11:23 to go
config:

        NAME                                                       STATE     READ WRITE CKSUM
        tank                                                       ONLINE       0     0     0
          mirror-0                                                 ONLINE       0     0     0
            ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4XXXXXXXXX             ONLINE       0     0     0
            ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4YYYYYYYYY             ONLINE       0     0     0

errors: No known data errors

What it means: scrub is running; it shows scan rate, issued rate, and ETA.
Decision: schedule scrubs (monthly is common). If scrubs take “forever,” investigate disk performance, cabling, and pool layout.

Snapshots: a scalpel, not a landfill

Snapshots are cheap at first. Then you keep them forever, rename datasets three times, and wonder why deletes don’t free space.
Snapshot strategy is a retention policy plus restore testing. Without both, it’s just a directory of false hope.

Task 17: create and list snapshots; interpret space usage

cr0x@server:~$ sudo zfs snapshot tank/apps@pre-upgrade-001
cr0x@server:~$ sudo zfs list -t snapshot -o name,used,refer,creation -s creation | tail -n 3
NAME                         USED  REFER  CREATION
tank/apps@pre-upgrade-001     12M  220G   Fri Dec 26 10:55 2025

What it means: USED is snapshot-exclusive space (blocks held because of this snapshot); REFER is referenced size.
Decision: if snapshots accumulate and space doesn’t free, inspect snapshot USED and prune by policy, not emotion.

Replication: the adult version of snapshots

If the pool controller fries, snapshots on that pool are as helpful as a spare key locked inside the same car.
Real protection means send/receive to another machine, another rack, or at minimum another failure domain.

Task 18: run an incremental send/receive to a backup pool

cr0x@server:~$ sudo zfs snapshot tank/apps@replica-001
cr0x@server:~$ sudo zfs send -c tank/apps@replica-001 | ssh backup01 sudo zfs receive -uF backup/tank/apps

What it means: you sent a compressed stream (-c) to backup01 and received it into backup/tank/apps, not mounted (-u), force rolling back as needed (-F).
Decision: use automation later, but first do it by hand so you know what “success” looks like and how it fails.

Task 19: verify receive-side dataset and last snapshot

cr0x@server:~$ ssh backup01 sudo zfs list -o name,used,avail,refer,mountpoint backup/tank/apps
NAME             USED  AVAIL  REFER  MOUNTPOINT
backup/tank/apps  220G  4.10T  220G  none

What it means: the backup dataset exists and is not mounted (good for safety).
Decision: keep backup receives unmounted by default. Mount only for restore tests, and then unmount again.

Stage 5: observability and operational guardrails

ZFS fails loudly when it fails loudly. The nastier failures are quiet: a marginal cable, a drive that times out once a day,
a pool that slowly fills until fragmentation becomes a personality.
Your job is to learn about these before users do.

Task 20: baseline error counters and watch for movement

cr0x@server:~$ sudo zpool status -v
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 0:26:41 with 0 errors on Fri Dec 26 11:08:49 2025
config:

        NAME                                                       STATE     READ WRITE CKSUM
        tank                                                       ONLINE       0     0     0
          mirror-0                                                 ONLINE       0     0     0
            ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4XXXXXXXXX             ONLINE       0     0     0
            ata-SAMSUNG_MZ7L31T9HBLT-00A07_S4YYYYYYYYY             ONLINE       0     0     0

errors: No known data errors

What it means: READ/WRITE/CKSUM are zero. Great baseline.
Decision: alert when they move. One checksum error is “investigate.” A trend is “schedule maintenance.”

Task 21: check SMART health for the drives in the vdev

cr0x@server:~$ sudo smartctl -a /dev/sda | egrep -i 'Model|Serial|Reallocated|Pending|CRC|Power_On_Hours'
Model Family:     Samsung based SSDs
Serial Number:    S4XXXXXXXXX
Power_On_Hours:   18422
Reallocated_Sector_Ct: 0
Current_Pending_Sector: 0
UDMA_CRC_Error_Count: 2

What it means: CRC errors often point to cabling/backplane/HBA issues, not the NAND itself.
Decision: if CRC increments, reseat/replace cable or move bays before you replace a perfectly good drive.

Task 22: verify autotrim (SSDs) and decide if you want it

cr0x@server:~$ sudo zpool get -o name,property,value autotrim tank
NAME  PROPERTY  VALUE
tank  autotrim  off

What it means: autotrim is off. On SSD pools, TRIM can help sustained write performance.
Decision: consider zpool set autotrim=on tank for SSD-based pools after validating your drive firmware behaves.

Task 23: inspect dataset-level write amplification signals (logical vs physical)

cr0x@server:~$ sudo zfs get -o name,property,value logicalused,used tank/vm
NAME     PROPERTY     VALUE
tank/vm  logicalused  380G
tank/vm  used         295G

What it means: compression is helping (physical used is lower than logical). If it were reversed, you’d suspect copies, padding, or volblocksize mismatches.
Decision: when logical and used diverge in the wrong direction, re-check dataset properties and workload assumptions.

Task 24: rehearse a restore (the only test that counts)

cr0x@server:~$ ssh backup01 sudo zfs clone backup/tank/apps@replica-001 backup/tank/apps-restore-test
cr0x@server:~$ ssh backup01 sudo zfs set mountpoint=/mnt/restore-test backup/tank/apps-restore-test
cr0x@server:~$ ssh backup01 mount | grep restore-test
backup/tank/apps-restore-test on /mnt/restore-test type zfs (rw,xattr,posixacl)

What it means: you created a writable clone from a snapshot and mounted it.
Decision: schedule restore tests. If you don’t, your first restore will be during an outage, which is a bold choice.

Fast diagnosis playbook

When performance tanks, you want a path that converges quickly. Not a week-long interpretive dance with graphs.
This playbook assumes Linux/OpenZFS tooling, but the logic holds elsewhere.

First: is the pool healthy and is anything rebuilding?

  • Check: zpool status -v
  • Look for: scrub/resilver in progress, DEGRADED vdevs, checksum errors, slow devices
  • Decision: if resilvering, expect degraded performance; prioritize finishing rebuild safely over “tuning.”

Second: are you out of space or heavily fragmented?

  • Check: zpool list -o size,alloc,free,capacity,frag
  • Look for: capacity > 80–85%, frag > ~50% (context-dependent)
  • Decision: if full/fragged, free space and delete snapshots (carefully). Don’t chase arcane sysctls first.

Third: what’s the bottleneck: disk, CPU, memory, or sync latency?

  • Disk: zpool iostat -v 1 shows one device pegged or much slower than peers.
  • CPU: compression/checksum can be CPU-bound on small cores; validate with system CPU tools.
  • Memory: ARC thrashing or system swapping: check ARC size and swap activity.
  • Sync writes: latency spikes during fsync-heavy workloads; SLOG may help if properly designed.

Fourth: identify the dataset and workload pattern

  • Check: which dataset is hot (application logs, VM disks, backup ingest)
  • Look for: wrong recordsize for workload, too many snapshots holding space, unexpected sync behavior
  • Decision: tune at dataset boundary. Avoid pool-wide changes unless you’re fixing a pool-wide problem.

Three corporate mini-stories (the kind you remember)

Incident: the wrong assumption (512-byte thinking in a 4K world)

A mid-size SaaS company built a new analytics cluster on shiny large HDDs behind a reputable HBA. The architect used ZFS because of checksums
and snapshots, and because the old storage stack had the personality of wet cardboard. Pool creation was scripted. It “worked.”

Six months later, write latency crept up. Not catastrophically—just enough that batch jobs missed their window. Then resilver time on a single disk
replacement turned into a multi-day event. During the resilver, performance fell off a cliff and stayed there. The team assumed “big disk rebuilds are slow”
and accepted the pain as the price of capacity.

Someone finally pulled a full baseline: sector sizes, ashift, and real physical alignment. The pool was built with ashift=9 because the drives
reported 512 logical sectors and nobody checked physical sector size. Every write got translated into a read-modify-write cycle on the drive.
ZFS was doing what it was told; the drives were doing what physics required.

They migrated data to a new pool with ashift=12. Performance normalized. Resilvers got dramatically faster.
The incident report was painfully simple: “We assumed the disk told the truth.” The corrective action was also simple:
“We will check PHY-SeC and set ashift explicitly.” The lesson: ZFS will faithfully preserve your mistakes.

Optimization that backfired: the “sync=disabled” era

A different company ran a VM farm on ZFS mirrors. Developers complained about occasional latency spikes during peak deploy hours.
Someone googled. Someone found the setting. Someone said, “We don’t need synchronous writes; we have a UPS.”
sync=disabled was applied at the dataset level for VM storage.

The spikes went away. Tickets closed. High-fives were exchanged in the shared Slack channel where optimism goes to die.
Two months later, a host rebooted unexpectedly after a kernel panic. The UPS was fine. The disks were fine. The VMs were not fine.
A handful came back with corrupted filesystems. Not all. Just enough to make the incident feel like a haunting.

The postmortem was grim but clean: synchronous semantics were explicitly disabled, so acknowledged writes weren’t necessarily durable.
The crash happened in a window where several guests believed their data was on stable storage. It wasn’t. ZFS did exactly what it was configured to do.

They reverted to sync=standard, measured again, and solved the real problem: a saturated write path plus poor queueing during deploy storms.
They added capacity and smoothed I/O bursts. The moral is not “never optimize.” It’s “optimize with a rollback plan and a clear definition of correctness.”

Joke #2: Disabling sync writes is like removing your smoke detector because it’s loud. Quieter, yes. Smarter, no.

Boring but correct practice that saved the day: monthly scrubs and alert hygiene

A financial services team ran a modest ZFS-backed file service. Nothing fancy: mirrored vdevs, lz4 compression, conservative dataset policies.
They had a habit that nobody bragged about: monthly scrubs, and alerts that fired on new checksum errors or degraded vdevs.
The on-call rotation hated many things, but not that.

One Thursday afternoon, an alert fired: a handful of checksum errors on one disk, then more. The pool stayed ONLINE. Users noticed nothing.
The engineer on duty didn’t “wait and see.” They checked SMART, saw CRC errors increasing, and suspected a cable or bay.
They scheduled a maintenance window and moved the drive to another slot. CRC errors stopped.

Two weeks later, a different disk started throwing real media errors, and ZFS repaired them during a scrub. The team replaced that disk during business hours.
No emergency. No extended outage. The system remained boring.

The secret wasn’t genius. It was a loop: scrub regularly, alert early, treat small error counters as smoke, and validate the path (cables, HBAs, firmware),
not just the drive. In storage, boring is a feature you can ship.

Common mistakes: symptoms → root cause → fix

1) “Deletes don’t free space”

  • Symptoms: application deletes data, but pool usage stays flat; df doesn’t budge.
  • Root cause: snapshots retain referenced blocks; sometimes clones do too.
  • Fix: list snapshots by used space and prune by policy.
cr0x@server:~$ sudo zfs list -t snapshot -o name,used -s used | tail
tank/apps@daily-2025-12-20   18.2G
tank/apps@daily-2025-12-21   21.4G
tank/apps@daily-2025-12-22   25.7G

2) “Random I/O is awful on RAIDZ”

  • Symptoms: VM latency spikes; IOPS lower than expected; writes feel “sticky.”
  • Root cause: RAIDZ parity overhead plus small random writes; recordsize mismatch; pool too full.
  • Fix: mirrors for latency-critical workloads, or separate RAIDZ for capacity; tune recordsize on the hot dataset; keep capacity headroom.

3) “Scrub takes forever and the system crawls”

  • Symptoms: scrubs run for days; services slow down; iostat shows low throughput.
  • Root cause: slow or failing disk, bad HBA/cabling, SMR drives in disguise, or heavy concurrent workload.
  • Fix: identify slow device with zpool iostat -v; validate SMART; replace problem hardware; schedule scrubs off-peak.

4) “We added an L2ARC and nothing got faster”

  • Symptoms: bought SSD cache; latency unchanged; ARC stats look similar.
  • Root cause: workload isn’t read-cacheable, or L2ARC is too small/slow, or system is CPU/memory bound.
  • Fix: measure cache hit rates; prioritize RAM/ARC and better vdev layout before adding L2ARC.

5) “Resilver is dangerously slow”

  • Symptoms: disk replacement takes a long time; performance during resilver is terrible.
  • Root cause: large HDDs, RAIDZ geometry, high pool utilization, SMR behavior, or a sick disk dragging the vdev.
  • Fix: prefer mirrors for fast rebuild; keep pool below ~80%; replace suspect disks proactively; don’t mix slow/fast devices in a vdev.

6) “Checksum errors appear but disks ‘test fine’”

  • Symptoms: zpool status shows CKSUM errors; SMART looks normal.
  • Root cause: cabling/backplane/HBA/firmware issues; transient transport errors.
  • Fix: check SMART CRC counters; reseat/replace cables; try different bays; update HBA firmware; then clear errors and monitor.

7) “We can’t expand our RAIDZ vdev the way we expected”

  • Symptoms: added one disk, capacity barely changed or expansion isn’t possible.
  • Root cause: top-level vdev geometry is fixed; you expand by adding whole vdevs (or via newer expansion features depending on platform/version, with constraints).
  • Fix: plan vdev width at day 0; for growth, add another vdev of similar performance class; avoid Frankenstein pools.

Checklists / step-by-step plan

Plan A: first pool to “safe enough” production in 10 steps

  1. Inventory hardware: confirm drive type, sector sizes, HBA model, and whether you have power-loss protection on SSDs.
  2. Choose topology: mirrors for latency, RAIDZ2/3 for capacity; avoid RAIDZ1 on large disks unless you enjoy gambling.
  3. Name devices sanely: use /dev/disk/by-id paths; document bay mapping.
  4. Create pool explicitly: set ashift and root dataset properties (compression, atime, xattr).
  5. Create datasets per workload: apps vs VMs vs users; set recordsize and sync policy intentionally.
  6. Set capacity guardrails: quotas for “humans,” reservations for “must not fail.”
  7. Scrub schedule: monthly baseline; more often if drives are suspect or environment is harsh.
  8. Snapshot policy: e.g., hourly for 24h, daily for 30d, monthly for 12m—tune to business needs and storage budget.
  9. Replication: send/receive to a different system; test restores by cloning snapshots.
  10. Monitoring: alerts on DEGRADED, checksum errors, rising SMART CRC/realloc/pending, pool capacity thresholds, and scrub failures.

Plan B: migrate from “it exists” to “it’s operable” without downtime fantasies

  1. Stop making pool-wide tweaks. Start measuring and documenting current state: zpool status, zpool list, zfs get all (filtered).
  2. Split datasets by workload so you can tune and snapshot independently.
  3. Implement a retention policy and prune snapshots that no longer serve recovery goals.
  4. Set up replication and run a restore test. Prove it to yourself with a mounted clone.
  5. Plan the “irreversible fixes” (wrong ashift, wrong topology) as a migration to a new pool. There is no magic toggle.

Operational cadence (what you do every week/month/quarter)

  • Weekly: review alerts, check for new error counters, confirm snapshot jobs are running, validate capacity projections.
  • Monthly: scrub, review scrub duration trends, verify at least one restore test from replication.
  • Quarterly: rehearse a “disk failure + restore” scenario, review dataset properties against workload changes, validate firmware baselines.

FAQ

1) Should I choose mirrors or RAIDZ for VM storage?

Mirrors, unless you have a strong reason and a tested workload profile. VMs tend to do small random I/O and punish RAIDZ parity overhead.
If you must use RAIDZ, keep vdev widths reasonable, maintain headroom, and tune recordsize for VM datasets.

2) Is RAIDZ1 ever acceptable?

On small disks and non-critical data, maybe. On large modern disks, RAIDZ1 increases the risk that a second issue during resilver takes the pool down.
If you can’t tolerate downtime and restore time, don’t run single-parity.

3) What compression should I use?

lz4 for almost everything. Turn it off only for datasets where data is already compressed or encrypted and you’ve measured CPU impact.

4) How full can I let a pool get?

Try to stay below ~80% for healthy performance, especially on RAIDZ and mixed workloads. Above that, fragmentation and allocation behavior can
increase latency. The exact cliff depends on workload, but “we ran it to 95%” is a familiar pre-outage sentence.

5) Do I need a SLOG?

Only if you have significant synchronous writes and you care about their latency. If your workload is mostly async, a SLOG won’t help.
If you do add one, use high-endurance devices with power-loss protection. Cheap consumer SSDs are not a journal device; they are a surprise generator.

6) Do I need L2ARC?

Usually no as a first move. Start with RAM (ARC) and correct pool topology. L2ARC can help read-heavy workloads with working sets larger than RAM,
but it also consumes memory for metadata and can add complexity.

7) Can I change ashift after creating the pool?

No. Not in place. You fix wrong ashift by migrating data to a new pool created correctly. This is why sector-size validation is a day-0 task.

8) How do I know if snapshots are the reason space won’t free?

List snapshots and sort by used. If big snapshots exist, they’re holding blocks alive. Delete the snapshots (carefully, in policy order),
then monitor space changes. Also check for clones.

9) Is ZFS “a backup” because it has snapshots?

No. Snapshots are local recovery points. Backups require separate failure domains. Use send/receive replication (or another backup system) and test restores.

10) What’s the single best habit to avoid regrets?

Do restore tests from replication on a schedule. Everything else is just probability management; restore testing is truth.

Conclusion: practical next steps

If you want “production without regret,” you don’t chase exotic flags. You make three good structural decisions (topology, ashift, dataset boundaries),
then you run a boring operational loop (scrub, snapshot, replicate, test restore, monitor errors).

Next steps you can do this week:

  • Write down your pool topology and failure tolerance in one paragraph. If you can’t, you don’t have a design yet.
  • Create datasets for your top three workloads and set properties intentionally (compression, recordsize, atime, quotas).
  • Run a scrub and record how long it took. That duration trend will become a health signal.
  • Set up replication with send/receive to another system and run one restore test via clone-and-mount.
  • Turn your baseline checks into alerts: pool health, capacity, checksum errors, and SMART transport/media indicators.

ZFS will give you data integrity and operational leverage. It will also happily preserve every bad assumption you feed it.
Choose your assumptions like you’re the one who gets paged. Because you are.

← Previous
Docker logs are exploding: fix log rotation before your host dies
Next →
Email “550 rejected”: what it actually means and how to get unblocked

Leave a comment