ZFS has a reputation: either it’s the file system that saves your career, or the one that quietly eats your RAM while your database cries. Both stories can be true. ZFS is not a “filesystem” in the narrow sense; it’s a storage system that treats disks, redundancy, caching, and data integrity as a single design problem. That’s why it can feel magical when it fits—and stubborn when it doesn’t.
This is the operational view: what ZFS actually buys you on real hardware with real failure modes, how to interrogate it with commands you’ll use at 2 a.m., and the honest cases where ext4 or XFS is simply the better tool. If you’re looking for religion, you’re in the wrong room. If you’re looking to keep data correct and latency boring, keep reading.
Table of contents
- 1) What ZFS really is (and why it behaves differently)
- 2) What you actually get in practice
- 3) When ext4/XFS wins (and why that’s not heresy)
- 4) Interesting facts & historical context
- 5) Practical tasks: commands you’ll actually run
- 6) Fast diagnosis playbook (find the bottleneck quickly)
- 7) Three corporate-world mini-stories
- 8) Common mistakes: symptoms and fixes
- 9) Checklists / step-by-step plan
- 10) FAQ
- Conclusion
1) What ZFS really is (and why it behaves differently)
With ext4 or XFS, you typically have a block device (maybe a RAID volume, maybe LVM) and a filesystem that sits on top. ZFS flips the stack: it owns the “RAID,” the volume management, and the filesystem, so it can enforce end-to-end integrity. That’s why ZFS talks about pools, vdevs, and datasets more than “partitions.”
Core model in one paragraph
A ZFS pool (zpool) is made of one or more vdevs (virtual devices). A vdev might be a mirror, a RAIDZ group, a single disk (please don’t), or a special purpose device. ZFS stripes data across vdevs, not across disks inside a vdev. That distinction matters when you expand capacity or performance: adding vdevs increases throughput; changing a vdev type is… not really a thing.
Copy-on-write is the feature behind the features
ZFS uses copy-on-write (CoW). When you modify a block, it writes a new block elsewhere and then updates metadata pointers to reference it. This makes snapshots cheap (they’re just old pointers) and makes torn writes much less likely to yield inconsistency. But it also means fragmentation behaves differently, and synchronous write semantics are a first-class performance variable.
End-to-end checksums: the part you don’t notice until you do
Every block has a checksum stored in its parent metadata. Reads verify checksums; repairs happen automatically if redundancy exists. It’s not “bitrot paranoia.” It’s basic hygiene in a world where disks, controllers, firmware, and cables all lie occasionally.
Joke #1: ZFS is like a smoke detector—you mostly notice it when it’s screaming, and then you’re very glad it was installed.
2) What you actually get in practice
2.1 Data integrity you can operationalize
The honest pitch: ZFS catches silent corruption that other stacks may happily serve to your application as “valid data.” It does this with checksums on every block and self-healing when redundancy exists (mirror/RAIDZ). In practice, you operationalize this via scrubs, which walk the pool and verify everything.
What it changes in your day-to-day operations:
- You can schedule a scrub and treat checksum errors as an actionable signal rather than a ghost story.
- You can prove that a backup is consistent at the storage layer, not just “it completed.”
- You can replace flaky disks proactively because ZFS will tell you it is repairing, not just failing.
2.2 Snapshots that are fast enough to be boring
ZFS snapshots are near-instant because they’re metadata markers. In practice, this changes your operational posture: you can snapshot before risky changes, snapshot frequently for ransomware recovery, and keep short retention without sweating I/O storms.
The catch is subtle: snapshots are cheap to create, not free to keep. If you keep thousands and then do lots of random overwrites, your metadata and fragmentation profile changes. ZFS will still work; your latency SLO might not.
2.3 Compression that usually helps more than it hurts
Modern ZFS compression (commonly lz4) is one of those rare features that is both performant and useful. You often get better effective throughput because you’re moving fewer bytes off disk. For many workloads, it’s a free lunch—except it’s not free; it’s “paid” in CPU, which you might already be short on.
In practice: enable compression=lz4 on most datasets by default unless you have a measured reason not to (e.g., already-compressed media, CPU-starved appliances).
2.4 Dataset-level controls: quotas, reservations, and sane multi-tenant storage
ZFS datasets give you per-tree properties: compression, recordsize, atime, quotas, reservations, snap schedules, mountpoints. This is where ZFS feels like a storage platform rather than “a filesystem.” In corporate systems, this is often the difference between “shared NFS server chaos” and “predictable storage service.”
2.5 Predictable replication with send/receive
zfs send/zfs receive are a practical gift. They replicate snapshots as streams, and incrementals can be efficient. When it’s set up cleanly, it’s not just backup—it’s a rebuild path, a migration strategy, and a DR mechanism you can test without heroics.
2.6 Where performance surprises come from
ZFS can be extremely fast, but it’s sensitive to workload patterns and configuration. The most common “surprise” in production is synchronous writes: if your workload uses fsync() heavily (databases, NFS, VM images), ZFS will honor it. Without a proper SLOG (and proper power-loss protection), that can turn into latency spikes that look like “ZFS is slow.” It’s not slow; it’s honest.
Joke #2: ZFS doesn’t lose your data—it just makes you confront your assumptions, which is somehow worse.
3) When ext4/XFS wins (and why that’s not heresy)
3.1 When simplicity is your SLO
ext4 is the Toyota Corolla of Linux filesystems: not exciting, but it starts every morning. If your storage stack already has redundancy and integrity guarantees (good RAID controller with cache + BBU, enterprise SAN with checksums, cloud block storage with integrity assurances), ext4 can be the right choice because it has fewer moving parts and fewer tuning knobs to mis-set.
XFS is similarly “boring in a good way,” especially for large files, parallel I/O, and certain metadata-heavy workloads where its allocation groups scale nicely.
3.2 When you can’t afford CoW side-effects
Copy-on-write changes the write pattern. For some workloads—especially those that rewrite blocks in place expecting stable locality—CoW can create fragmentation and read amplification over time. You can mitigate, you can tune, you can design around it. But if you want “write bytes in place, keep locality,” XFS/ext4 may provide more predictable long-term behavior.
3.3 When you need the absolute minimum overhead
ZFS checksums, metadata, and features cost something. Often you get that cost back (compression, fewer corruptions, easier snapshots). But for extremely lean systems—embedded appliances, minimal VMs, or environments where RAM is precious—ext4 can win simply because the memory and CPU footprint is small and easy to reason about.
3.4 When your operational team is not staffed for ZFS
This isn’t a dig; it’s reality. ZFS is operationally friendly once you know it, but it does require understanding of vdev layout, ashift, scrubs, snapshot management, and the ARC. If you don’t have people who will own those details, ext4/XFS reduces the surface area for subtle mistakes.
3.5 When you need online growth patterns ZFS doesn’t offer
ZFS expansion is vdev-based. You can often grow a mirror by replacing disks with larger ones and letting it resilver, or add new vdevs to expand the pool. But you can’t trivially “convert RAIDZ1 to RAIDZ2” after the fact, and rebalancing data across vdevs isn’t like a traditional RAID reshape. If your procurement realities demand frequent, awkward reshapes, ext4/XFS on top of LVM/MDRAID may fit better.
4) Interesting facts & historical context (short, concrete)
- ZFS was designed at Sun Microsystems in the mid-2000s as part of Solaris to replace the “volume manager + filesystem” split and reduce corruption risk.
- Its “128-bit filesystem” claim is mostly a way to say “we’re not going to run out of space any time soon,” not a promise you’ll attach disks the size of Jupiter.
- ZFS’s end-to-end checksumming was a direct reaction to silent corruption in real storage stacks—not theoretical cosmic rays, but ordinary hardware and firmware misbehavior.
- Copy-on-write snapshots became a practical ops tool in ZFS years before “snapshotting” became mainstream in VM and container platforms.
- ZFS popularized the idea that compression can be a performance feature, not just a capacity trick—because reducing I/O can be faster than moving raw bytes.
- OpenZFS evolved as a multi-platform open implementation; on Linux it became a de facto standard for ZFS usage despite licensing constraints that keep it out of the kernel tree.
- RAIDZ is not “RAID5/6 but branded.” The implementation details (variable stripe width, block pointers) make its behavior different—especially under partial writes.
- ZFS scrubs are not the same as RAID “consistency checks.” They verify checksums end-to-end and can repair with redundancy.
- The ARC (Adaptive Replacement Cache) is more than a page cache; it’s a ZFS-managed cache with policies that can be tuned and observed.
5) Practical tasks: commands you’ll actually run
All examples assume Linux with OpenZFS installed. Replace pool/dataset names as appropriate. After each command, I’ll tell you what the output means operationally, not academically.
Task 1: Inventory pools and basic health
cr0x@server:~$ sudo zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 7.25T 3.10T 4.15T - - 12% 42% 1.00x ONLINE -
cr0x@server:~$ sudo zpool status -x
all pools are healthy
Interpretation: FRAG is a hint, not a verdict, but if it climbs and performance degrades, you’ll correlate it with workload. status -x is your quick “is anything on fire?” check.
Task 2: Get the detailed story during trouble
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 04:12:55 with 0 errors on Sun Dec 22 03:10:11 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_SSD_1 ONLINE 0 0 0
ata-SAMSUNG_SSD_2 ONLINE 0 0 0
errors: No known data errors
Interpretation: READ/WRITE/CKSUM counters tell you whether you’re dealing with a dying disk, a flaky cable/controller, or actual on-disk corruption. “Scrub repaired” being non-zero is not automatically panic, but it is a ticket.
Task 3: See datasets and where space went
cr0x@server:~$ sudo zfs list -o name,used,avail,refer,mountpoint
NAME USED AVAIL REFER MOUNTPOINT
tank 3.10T 3.92T 192K /tank
tank/home 420G 3.92T 410G /home
tank/vm 1.90T 3.92T 1.20T /tank/vm
tank/backups 780G 3.92T 120G /tank/backups
Interpretation: USED includes snapshots and descendants. REFER is “live data in this dataset.” If USED is huge and REFER is modest, snapshots are your likely culprit.
Task 4: Find snapshot bloat quickly
cr0x@server:~$ sudo zfs list -t snapshot -o name,used,refer,creation -s used | tail -n 5
tank/vm@auto-2025-12-24-0100 120G 1.20T Wed Dec 24 01:00 2025
tank/vm@auto-2025-12-23-0100 118G 1.19T Tue Dec 23 01:00 2025
tank/vm@auto-2025-12-22-0100 111G 1.18T Mon Dec 22 01:00 2025
tank/vm@auto-2025-12-21-0100 109G 1.18T Sun Dec 21 01:00 2025
tank/vm@auto-2025-12-20-0100 103G 1.17T Sat Dec 20 01:00 2025
Interpretation: That used is “unique blocks held by the snapshot.” If it’s growing daily, your churn rate is high (VM images, databases). Plan retention accordingly.
Task 5: Check key properties that change performance
cr0x@server:~$ sudo zfs get -o name,property,value -s local,default recordsize,compression,atime,sync,logbias tank/vm
NAME PROPERTY VALUE
tank/vm atime off
tank/vm compression lz4
tank/vm logbias latency
tank/vm recordsize 128K
tank/vm sync standard
Interpretation: recordsize wants to match your dominant I/O size for files (databases often prefer 16K/8K, VM images often 16K/64K depending). sync and logbias decide how ZFS treats synchronous workloads.
Task 6: Safely enable compression on a dataset
cr0x@server:~$ sudo zfs set compression=lz4 tank/home
cr0x@server:~$ sudo zfs get -o name,property,value compression tank/home
NAME PROPERTY VALUE
tank/home compression lz4
Interpretation: This affects newly written blocks. It won’t rewrite old data unless you rewrite/copy it. Operationally, that’s good: you can enable without a big migration event.
Task 7: Measure whether compression is actually helping
cr0x@server:~$ sudo zfs get -o name,property,value compressratio,logicalused,used tank/home
NAME PROPERTY VALUE
tank/home compressratio 1.42x
tank/home logicalused 580G
tank/home used 420G
Interpretation: If compressratio hovers near 1.00x for a dataset full of already-compressed data, you’re mostly burning CPU for nothing. If it’s 1.3x+ on general-purpose datasets, it’s often a net win.
Task 8: Run and monitor a scrub
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Wed Dec 24 02:10:11 2025
1.25T scanned at 1.10G/s, 410G issued at 360M/s, 3.10T total
0B repaired, 13.2% done, 0:02:41 to go
Interpretation: “scanned” vs “issued” tells you how much data is actually being read vs metadata traversal. If scrubs take forever, you may be IOPS-limited or dealing with slow/SMR drives.
Task 9: Replace a failed disk in a mirror (the common case)
cr0x@server:~$ sudo zpool status tank
pool: tank
state: DEGRADED
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-SAMSUNG_SSD_1 ONLINE 0 0 0
ata-SAMSUNG_SSD_2 FAULTED 12 0 3 too many errors
cr0x@server:~$ sudo zpool replace tank ata-SAMSUNG_SSD_2 /dev/disk/by-id/ata-SAMSUNG_SSD_NEW
cr0x@server:~$ sudo zpool status tank
scan: resilver in progress since Wed Dec 24 03:01:10 2025
210G scanned at 2.5G/s, 88G issued at 1.0G/s, 3.10T total
88G resilvered, 2.8% done, 0:49:12 to go
Interpretation: Replacing by /dev/disk/by-id avoids device-name roulette after reboots. Monitor resilver speed; if it’s crawling, your pool is busy or one side is sick.
Task 10: Confirm ashift (sector size alignment) before you commit a pool design
cr0x@server:~$ sudo zdb -C tank | grep -E "ashift|vdev_tree" -n | head
35: ashift: 12
36: asize: 7998634579968
Interpretation: ashift=12 means 4K sectors. Getting ashift wrong is a permanent tax. If you end up with ashift=9 on modern 4K/8K devices, you can pay for it forever in write amplification.
Task 11: Observe ARC behavior (is RAM helping or just missing?)
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep "size|c_max|hit|miss" | head
hits 1802345678
misses 234567890
size 17179869184
c_max 25769803776
Interpretation: A high hit rate usually means reads are being served from RAM. If misses spike during working set access, you’re RAM-constrained or your access pattern is too large/random to cache effectively.
Task 12: Watch pool I/O and latency in real time
cr0x@server:~$ sudo zpool iostat -v tank 1
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 3.10T 4.15T 820 1600 210M 380M
mirror-0 3.10T 4.15T 410 800 105M 190M
sda - - 205 400 52.5M 95.0M
sdb - - 205 400 52.5M 95.0M
Interpretation: If a single disk is doing disproportionate work in a mirror, suspect errors, firmware quirks, or a device that’s silently slower. If ops are high but bandwidth low, you’re IOPS-bound (random). If bandwidth is high but ops moderate, you’re streaming.
Task 13: Create a dataset for a database and set sane defaults
cr0x@server:~$ sudo zfs create tank/pg
cr0x@server:~$ sudo zfs set atime=off compression=lz4 recordsize=16K tank/pg
cr0x@server:~$ sudo zfs get -o name,property,value atime,compression,recordsize tank/pg
NAME PROPERTY VALUE
tank/pg atime off
tank/pg compression lz4
tank/pg recordsize 16K
Interpretation: This is not “the one true database tuning,” but it’s a reasonable start: smaller recordsize for DB pages, no atime churn, compression on.
Task 14: Snapshot and roll back safely (when you’re about to do something brave)
cr0x@server:~$ sudo zfs snapshot tank/home@pre-upgrade
cr0x@server:~$ sudo zfs list -t snapshot tank/home
NAME USED AVAIL REFER MOUNTPOINT
tank/home@pre-upgrade 0 - 410G -
cr0x@server:~$ sudo zfs rollback tank/home@pre-upgrade
Interpretation: Snapshot is instant. Rollback is fast but destructive to changes after the snapshot. In practice: snapshot, change, validate, then either keep or destroy the snapshot based on results.
Task 15: Replicate with send/receive (incremental)
cr0x@server:~$ sudo zfs snapshot tank/home@daily-1
cr0x@server:~$ sudo zfs snapshot tank/home@daily-2
cr0x@server:~$ sudo zfs send -i tank/home@daily-1 tank/home@daily-2 | sudo zfs receive -u backup/home
cr0x@server:~$ sudo zfs list -t snapshot backup/home | tail -n 2
backup/home@daily-1 0 - 410G -
backup/home@daily-2 0 - 412G -
Interpretation: Incremental streams mean you move only changes. The -u keeps it unmounted until you’re ready. Operationally: you can test restores without disturbing production mountpoints.
6) Fast diagnosis playbook (find the bottleneck quickly)
This is the “don’t flail” sequence. The goal is to decide whether your problem is hardware, pool layout, sync write path, memory/cache, or your workload.
Step 1: Is the pool healthy, or are you benchmarking a degraded system?
cr0x@server:~$ sudo zpool status -x
all pools are healthy
If not healthy: stop performance tuning. Replace the failing device, resolve cabling/controller errors, let resilver finish. Performance during resilver is not the baseline; it’s emergency mode.
Step 2: Is the pool out of space or heavily fragmented?
cr0x@server:~$ sudo zpool list -o name,size,alloc,free,cap,frag,health
NAME SIZE ALLOC FREE CAP FRAG HEALTH
tank 7.25T 6.70T 550G 92% 61% ONLINE
Interpretation: Pools above ~80–85% capacity often suffer. This isn’t a moral failing; it’s allocator reality. If CAP is high and writes are slow, free space is the first thing to fix.
Step 3: Are you sync-write bound (fsync latency)?
cr0x@server:~$ sudo zpool iostat -v tank 1
Interpretation: Look for high write ops with low bandwidth and high latency in the broader system. If your app is doing synchronous writes and you have no proper SLOG, ZFS will wait for stable storage. On HDD pools, that can be brutal.
Step 4: Is ARC helping or are you missing RAM?
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep "hits|misses|size|c_max" | head
Interpretation: If your working set fits and hit rate is good, reads should be fast. If misses dominate and your workload is read-heavy, you might need more RAM—or accept that your pattern is not cacheable.
Step 5: Is a single vdev the limiter?
cr0x@server:~$ sudo zpool iostat -v tank 1
Interpretation: ZFS stripes across vdevs, so performance often scales with vdev count. One RAIDZ vdev is one performance unit. If you built “one giant RAIDZ2” and expected it to behave like “many disks worth of IOPS,” this is where reality arrives.
Step 6: Are snapshots and churn driving write amplification?
cr0x@server:~$ sudo zfs list -t snapshot -o name,used -s used | tail
Interpretation: Large snapshot unique space plus a rewrite-heavy workload means the pool is constantly allocating new blocks. That can look like “random write performance collapsed” when the real issue is retention policy vs churn rate.
7) Three corporate-world mini-stories (plausible, technically accurate)
Mini-story A: An incident caused by a wrong assumption (ZFS isn’t RAID controller cache)
A company I worked with moved a PostgreSQL cluster off a SAN onto a pair of local storage servers. The old SAN had battery-backed write cache; the database enjoyed low-latency fsync() and never thought about it. The new servers were built with HDDs in RAIDZ2 and “no need for fancy extras because ZFS is smart.”
The migration weekend went fine until Monday morning when customer traffic returned and commit latency went from “fine” to “the app is timing out.” The monitoring showed CPU mostly idle, disks not saturated in bandwidth, and yet the database was stalling. Classic case: the system was sync-write bound. Every transaction commit wanted stable storage, and stable storage on a RAIDZ HDD pool without a dedicated log device means waiting on rotational latency and parity bookkeeping.
The wrong assumption was subtle: “ZFS has a ZIL, so it’s like having a write cache.” The ZIL is a mechanism for correctness, not a magic accelerator. Without a SLOG device designed for low-latency sync writes (and with power-loss protection), ZFS is simply doing the safe thing slowly.
The fix wasn’t heroic. They added proper enterprise NVMe devices as mirrored SLOG, validated that the devices had real power-loss protection, and rolled back one ill-advised property tweak: someone had set sync=disabled during testing and assumed it was “fine” because nothing crashed. After the change, the database was back to predictable commit latency, and the incident postmortem had a refreshingly simple moral: “Correctness is a performance requirement.”
Mini-story B: An optimization that backfired (recordsize and VM images)
Another environment ran a private virtualization cluster with dozens of busy VM images on ZFS. An engineer read a tuning guide and decided “bigger blocks equals faster,” then set recordsize=1M on the VM dataset. The benchmark on a quiet system looked great for sequential reads. Everyone celebrated and moved on.
Two months later, latency complaints started showing up like weeds. Not constant, but spiky: a VM would “freeze” briefly under random write load. The pool wasn’t full, no disks were failing, ARC looked healthy. The problem was the write pattern: VM images do lots of small random writes. With a very large recordsize, small overwrites can trigger read-modify-write behavior and increase write amplification. Add snapshots into the mix (because VM snapshots were kept longer than they should have been), and CoW meant even more new allocations.
They reverted the dataset to a more reasonable recordsize (commonly 16K or 64K depending on the hypervisor and guest patterns), pruned snapshot retention, and the spikes largely disappeared. The lesson wasn’t “never tune.” It was “never tune in one dimension.” CoW + recordsize + snapshot retention is a three-body problem, and production always finds the unstable orbit.
The postmortem action item that actually mattered: any performance “optimization” required a workload-representative test (random write mix, snapshot churn, and concurrency) and a rollback plan. Nobody was banned from tuning; they just stopped doing it like a midnight ritual.
Mini-story C: A boring but correct practice that saved the day (scrubs + by-id + spares)
The most expensive outage I didn’t have was prevented by a calendar reminder. A team ran ZFS mirrors for a business-critical file service. Nothing exotic: good disks, modest load, steady growth. The practice was dull: monthly scrubs, alerts on checksum errors, and replacing disks by persistent IDs instead of /dev/sdX names.
One month, a scrub reported a small but non-zero number of repaired bytes on one mirror member. No user reports, no SMART screams, nothing obvious. The alert triggered a ticket anyway because the rule was simple: checksum repairs mean something lied. They pulled logs, saw intermittent link resets on a particular bay, reseated the drive, replaced a suspect cable, and swapped the disk during business hours.
Two weeks later, another disk in the same mirror started throwing hard read errors. If the earlier silent corruption hadn’t been found and addressed, they might have been one bad day away from losing blocks they couldn’t reconstruct. Instead, it was a routine replacement with no drama.
Nothing about the save was glamorous. No one “heroed” a 3 a.m. recovery. The system stayed boring because they did the boring things: scrubs, alerts that mattered, and hardware hygiene. In production, boring is a feature.
8) Common mistakes (specific symptoms and fixes)
Mistake 1: Building one giant RAIDZ vdev and expecting high IOPS
Symptom: Great sequential throughput, terrible random I/O latency under load; VM or database workloads feel “stuck.”
Why: A RAIDZ vdev behaves like a single IOPS unit in many patterns; parity work and disk count don’t magically multiply random IOPS.
Fix: Use mirrors for IOPS-heavy workloads, or multiple RAIDZ vdevs (more vdevs = more parallelism). If redesign isn’t possible, isolate workloads, reduce sync pressure, and be realistic about IOPS.
Mistake 2: Setting sync=disabled to “fix performance”
Symptom: Performance looks amazing; later, after a power loss or crash, the database or VM filesystem is corrupted or missing recent transactions.
Why: You told ZFS to lie to applications about durability.
Fix: Set sync=standard (default) and address the real issue: add a proper mirrored SLOG with power-loss protection if you need low-latency sync writes.
Mistake 3: Wrong ashift at pool creation
Symptom: Write performance is inexplicably bad; small writes cause large device writes; SSD endurance drops faster than expected.
Why: Sector alignment mismatch causes write amplification. You can’t change ashift after creation.
Fix: Recreate the pool correctly (often ashift=12 or 13 depending on devices). If you can’t, you live with the tax.
Mistake 4: Letting the pool run hot (too full)
Symptom: Everything gets slower over time; scrubs/resilvers take longer; metadata ops feel laggy.
Why: Allocator has fewer choices; fragmentation and write amplification increase.
Fix: Keep free space headroom. Add capacity (new vdevs), reduce retention, or move cold data out. Treat 80% as a planning threshold, not a hard law.
Mistake 5: Snapshot hoarding without churn awareness
Symptom: “We deleted data but space didn’t come back,” plus performance drift.
Why: Snapshots keep old blocks referenced. Heavy churn + long snapshot retention = space and fragmentation pressure.
Fix: Audit snapshots, implement retention policies by dataset, and align snapshot frequency with recovery objectives, not vibes.
Mistake 6: Mixing slow and fast devices in a way that amplifies the slowest
Symptom: Random latency spikes; one disk shows high utilization while others idle.
Why: Mirrors are only as fast as the slower member for many operations; heterogeneous vdevs complicate predictability.
Fix: Keep vdev members matched. If you must mix, do it intentionally (e.g., special vdev for metadata) and monitor carefully.
Mistake 7: Treating SMR drives like normal HDDs
Symptom: Resilvers and scrubs take ages; write performance collapses under sustained load.
Why: SMR write behavior can be punishing for ZFS rebuild patterns.
Fix: Avoid SMR for ZFS pools that need predictable rebuilds. If you’re stuck, reduce load during resilver and reconsider redundancy design.
9) Checklists / step-by-step plan
Design checklist (before you create the pool)
- Classify workload: mostly sequential (backup/media), mostly random (VMs/DB), mixed (home directories), sync-heavy (databases/NFS).
- Pick vdev type based on IOPS needs: mirrors for IOPS; RAIDZ for capacity/throughput with fewer IOPS expectations.
- Decide ashift up front: assume 4K+ sectors; verify, then create pool accordingly. Treat this as irreversible.
- Plan headroom: capacity planning that keeps you out of the 90% zone.
- Decide snapshot policy: frequency and retention per dataset, not one-size-fits-all.
- Decide sync strategy: if you have sync-heavy workloads, plan for SLOG (mirrored, PLP) or accept latency.
Deployment plan (first week in production)
- Create datasets per workload (VMs, DBs, home, backups) rather than one monolith.
- Set baseline properties:
compression=lz4,atime=offwhere appropriate, recordsize per workload. - Implement monitoring:
zpool statushealth, scrub results, checksum errors, capacity, and scrub duration trends. - Schedule scrubs during low-traffic windows, but don’t treat them as optional.
- Test snapshot rollback on non-critical data so the team has muscle memory.
- Test
zfs send/receiverestore, not just backup creation.
Change checklist (before you “optimize”)
- State the hypothesis: which metric improves, and why.
- Measure baseline with production-like concurrency and I/O patterns.
- Change one knob at a time: recordsize, sync, special vdev, etc.
- Define rollback criteria and rehearse rollback steps.
- Observe for at least one business cycle if the workload is cyclical.
10) FAQ
Q1: Is ZFS “safer” than ext4/XFS?
A: ZFS provides end-to-end checksums and self-healing with redundancy. ext4/XFS generally rely on the underlying device layer for integrity. If you care about silent corruption detection and repair, ZFS has a real advantage.
Q2: Do I need ECC RAM for ZFS?
A: ECC is strongly recommended for any serious storage system, ZFS or not. ZFS is good at detecting disk-level corruption, but it can’t fix bad data created in RAM before it’s checksummed. Many run without ECC; fewer sleep well doing it.
Q3: Does ZFS always need “lots of RAM”?
A: ZFS will happily use RAM for ARC, and more RAM often improves read performance. But it’s not a hard requirement for correctness. The real question is whether your workload benefits from caching. If the working set doesn’t fit, RAM helps less, and ext4/XFS may feel similar.
Q4: Should I enable deduplication?
A: Usually no in general-purpose production. Dedup can be expensive in RAM and can create performance cliffs if underprovisioned. If you have a narrow, measured use case (like many identical VM images) and you’ve tested it, maybe. Otherwise, use compression first.
Q5: What’s the difference between ZIL and SLOG?
A: The ZIL is the on-disk intent log used to safely handle synchronous writes. A SLOG is a separate device where ZFS can place that log to make sync writes faster. Without a SLOG, the ZIL lives on the main pool devices.
Q6: When is sync=disabled acceptable?
A: Almost never for anything you’d be sad to lose. It may be acceptable for disposable scratch data or certain read-only ingest pipelines where the application already tolerates loss. If you’re not absolutely sure, treat it as “unsafe.”
Q7: Do snapshots replace backups?
A: No. Snapshots help you recover from logical errors quickly on the same system. Backups protect against pool loss, site loss, and admin mistakes that delete both data and snapshots. Use snapshots as a layer, not a substitute.
Q8: Why does my pool get slower as it fills up?
A: As free space shrinks, ZFS has fewer contiguous regions to allocate, which can increase fragmentation and metadata work. Also, CoW means it’s constantly finding new places to write blocks. Keep headroom and plan capacity expansions.
Q9: ext4/XFS have checksums too, right?
A: They have checksums for some metadata structures (journals, etc.) depending on features. ZFS checksums data and metadata end-to-end and validates on read. That’s a different level of coverage.
Q10: If ZFS is so good, why doesn’t everyone use it everywhere?
A: Because trade-offs are real: memory footprint, tuning surface area, CoW behavior, and operational expertise. Also, some environments already get the benefits elsewhere (SAN features, cloud storage guarantees), making ext4/XFS the simpler, safer choice.
Conclusion
ZFS is what you pick when you want the storage layer to be an adult: it verifies what it reads, it can repair what it finds, it makes snapshots routine, and it gives you clean primitives for replication. In practice, it changes the kinds of failures you see: fewer mysterious corruptions, more honest performance limits, and more “we can roll back” moments.
ext4 and XFS still win plenty of days. They win when you need simplicity, when your integrity story is handled elsewhere, when your team wants fewer knobs, and when predictable in-place write behavior matters more than snapshots and end-to-end checksums. The best choice isn’t the one with the most features; it’s the one that makes your production system boring in the specific ways your business needs.