There’s a certain kind of midnight page that teaches you more about filesystems than any benchmark ever will: “storage latency spiking,” “VMs freezing,” “backup window missed,” and the quiet dread that your snapshots have become a landfill. ZFS and btrfs both promise modern filesystem features—copy-on-write, snapshots, checksums, self-healing. In practice, they feel like two different philosophies of pain management.
This is an operator’s comparison: what breaks, how it breaks, what you can confidently automate, and what you should rehearse before production makes you rehearse it at 3 a.m. We’ll talk snapshots, RAID, recovery, performance, and the boring operational habits that keep you employed.
The thesis: which one bites less?
If your top priority is predictable integrity, mature tooling, and the least-surprising recovery story—especially for large pools, heavy virtualization, or long retention—ZFS bites less. It’s opinionated, strict, and usually right. You pay in RAM appetite, tuning complexity, and sometimes licensing politics, but you get a filesystem that behaves like it expects to be blamed—and came prepared with receipts.
If your top priority is native kernel integration, flexible subvolume layout, and “snapshots everywhere” workflows on Linux—especially on single-node systems and distributions that wrap it nicely—btrfs can bite less. It’s remarkably capable now, but its RAID story (especially parity RAID) has caveats that are not theoretical. It’s not “bad,” but it does demand that you know which features are boringly stable and which features are still a bit… adventurous.
My production bias, earned the hard way: ZFS for serious multi-disk redundancy and long-lived datasets. btrfs for root filesystems, workstation/server OS snapshots, and simpler mirrored setups where kernel-native integration and tooling convenience matter more than maximum RAID sophistication.
Joke #1: A filesystem is like a parachute—if you’re trying to find out whether it works, you’ve already made some interesting life choices.
Interesting facts & historical context (the parts people forget)
- ZFS was born at Sun Microsystems as an end-to-end storage stack: filesystem + volume manager + RAID + checksums, designed to eliminate “silent corruption” and operational guesswork.
- btrfs started at Oracle as a Linux-native CoW filesystem with snapshots and checksumming, aiming to compete with ZFS-like features while living inside the kernel.
- ZFS’s RAIDZ is not classic RAID-5/6; it’s designed to avoid the “write hole” with copy-on-write and transaction groups, reducing a whole class of partial-stripe corruption issues.
- btrfs’s early reputation was shaped by RAID5/6 issues. Many shops learned (painfully) to treat btrfs parity RAID as “don’t, unless you really know what you’re signing up for.”
- Both systems checksum data, but ZFS treats integrity as the core promise, while btrfs provides integrity within a Linux filesystem that also needs to play nice with the broader kernel ecosystem.
- Scrubs are not backups. Both ZFS scrub and btrfs scrub are integrity verification processes; they can repair only if there’s redundant good data available.
- ZFS send/receive became the blueprint for cheap, fast replication at the block/record level. btrfs send/receive exists too, and is great—when your subvolume/snapshot discipline is clean.
- Compression became mainstream because of CoW filesystems. ZFS popularized “turn it on, it’s often faster”; btrfs made compression accessible and kernel-native, with per-subvolume control.
- “Snapshots” are metadata tricks, not magical extra disks. Both systems can create snapshots instantly, but long retention without pruning becomes operational debt with interest.
Mental models: what each filesystem believes
ZFS: the storage system that wants to own the whole problem
ZFS doesn’t want to be “a filesystem sitting on top of someone else’s RAID.” It wants to be the filesystem and the RAID and the volume manager and the integrity layer. That’s why a ZFS pool (“zpool”) contains vdevs, and those vdevs define redundancy. The filesystem (“dataset”) sits on top and inherits properties. Once you internalize that, ZFS stops feeling weird and starts feeling like a very strict adult.
In ZFS land, the primary unit of risk is the vdev. Lose a vdev, you lose the pool. You don’t “add disks to RAIDZ later” without planning. You design vdev widths and redundancy upfront (or you accept what expansion options exist in your version and operational constraints). ZFS rewards patience and punishes improvisation.
btrfs: the Linux-native filesystem that wants to be flexible
btrfs sits closer to the “filesystem first” identity, with built-in multi-device support that can look like RAID. It uses subvolumes and snapshots as first-class organizational primitives, which makes it wonderful for OS snapshot/rollback patterns. It also plays nicely with the kernel, which matters for distros, bootloaders, and the “just update the box” reality of fleets.
btrfs’s flexibility is real: you can add devices, convert data profiles, and reorganize with balance. But flexibility cuts both ways. If ZFS is strict and saves you from yourself, btrfs sometimes lets you do the dangerous thing because it’s technically allowed.
Snapshots & clones: power and foot-guns
How snapshots actually behave
Both ZFS and btrfs snapshots are cheap to create and expensive to keep forever. The cost isn’t the snapshot itself; it’s the retained old blocks. If you snapshot a dataset and then rewrite a large VM image every day, you’ve basically signed up for “infinite history” unless you prune. Your free space won’t disappear linearly; it will vanish in awkward chunks, at emotionally inconvenient times.
ZFS snapshots: operationally predictable
ZFS snapshots are stable, widely used, and integrated with replication workflows. Clones are snapshots you can write to. ZFS also gives you clean primitives like holds (to prevent deletion), dataset properties for recursion, and send/receive streams that are well-understood.
The ZFS snapshot trap is usually not correctness—it’s capacity and fragmentation. A snapshot-heavy dataset with lots of small random writes can get messy. Still, the “messy” usually looks like “performance degraded and space is tight,” not “filesystem metadata is on fire.”
btrfs snapshots: fantastic for OS + subvolume discipline
btrfs snapshots shine when you treat subvolumes as boundaries: @ for root, @home, maybe @var or @docker, and snapshot the ones you actually want to roll back. This is where btrfs feels like cheating: instant snapshot, fast rollback, easy experimentation.
The btrfs snapshot trap is when people snapshot everything without thinking about churny directories (VM images, databases, containers). Snapshots retain extents, and high-churn workloads can turn your “nice rollback safety net” into a slow-motion space leak that only reveals itself during an incident.
RAID: what’s built-in, what’s implied, what’s risky
ZFS RAID: mirrors and RAIDZ done the ZFS way
ZFS redundancy is defined at the vdev level: mirrors, RAIDZ1/2/3. Add vdevs to expand. Replace disks to heal. Scrub to verify. Resilver to reconstruct. It’s not “set and forget,” but it’s coherent.
Key operational truth: IOPS is mostly about the number of vdevs, especially with mirrors. A single wide RAIDZ vdev can deliver impressive sequential throughput and still feel sluggish for random IO compared to multiple mirror vdevs. The filesystem isn’t slow; your geometry is.
btrfs RAID: profiles, not classic arrays
btrfs talks about profiles: data profile (single, DUP, RAID1, RAID10, RAID5/6) and metadata profile (often RAID1 even if data is single). That’s powerful: you can prioritize metadata safety even on single-disk setups using DUP (on a single device, it stores two copies—useful against some bad-sector patterns, not a substitute for redundancy).
The big caveat remains parity RAID. Mirroring profiles (RAID1/RAID10) are widely used and generally trusted. Parity profiles (RAID5/6) have had a long history of edge cases and are still treated cautiously in many production environments. If you’re building “this must survive weird failures,” btrfs RAID1/10 is the safer lane.
Joke #2: “RAID is not a backup” is the storage equivalent of “don’t touch the stove”—everyone nods, and then someone touches the stove anyway.
Recovery: what you can fix, what you can’t
ZFS recovery: scrubs, resilvers, and the art of not panicking
ZFS’s recovery story is cohesive: detect corruption via checksums, repair from redundancy, track errors per device, and provide clear health states. When it goes wrong, it often goes wrong in a way that’s diagnosable: a disk is throwing errors, a cable is flaky, a controller is lying. ZFS will tell you, loudly.
The most important operational advantage: you can trust the signal. If ZFS says it repaired X bytes and a device is faulting, you can act with high confidence. That confidence is worth a lot when you’re juggling stakeholders and the clock is yelling.
btrfs recovery: good tools, more nuance, and “know your mode”
btrfs has a solid set of recovery tools: scrub, device stats, check/repair utilities, and the ability to mount read-only and salvage data. But the operator burden is higher because outcomes can depend more on which features you used (profiles, compression, quotas, snapshots) and what kind of failure occurred (single-device corruption vs multi-device vs metadata damage).
btrfs can be extremely reliable in the configurations that are well-worn (single device, RAID1/10, disciplined snapshots). It becomes less “predictably boring” the further you push into parity RAID and complex conversion operations under pressure.
Performance: latency, throughput, and the cache wars
ZFS performance: ARC, recordsize, sync, and the SLOG myth
ZFS is often accused of being “RAM-hungry.” More precisely: ZFS will happily use memory for ARC (adaptive replacement cache) because caching makes everything better until it doesn’t. On Linux, you must understand the interaction between ARC and the kernel page cache, and you should set sane limits in memory-constrained systems.
For databases and virtualization, ZFS tuning often revolves around a few practical levers:
- recordsize: large for sequential (backups), smaller for random (VMs, DBs).
- compression: often improves effective throughput and latency.
- sync: do not “optimize” this unless you like explaining data loss.
- SLOG: helps sync writes only for certain workloads; it’s not a generic speed potion.
btrfs performance: CoW, compression, and fragmentation realities
btrfs can be fast, especially with compression and NVMe. But CoW has consequences for workloads that rewrite large files in place (VM disk images, some databases). btrfs provides ways to mitigate (like disabling CoW on specific files/directories via chattr +C), but that’s a trade: you’re opting out of some CoW benefits.
btrfs’s “balance” is both a capability and a trap: it can reorganize data to a new profile or improve distribution, but it is IO-heavy and can amplify pain if you run it casually on busy systems.
Practical tasks with commands (and what the output means)
The quickest way to learn these systems is to do real things: create pools, snapshot, scrub, simulate failure, and practice recovery paths. Below are tasks you can run on test systems. Commands assume Linux with ZFS tools installed (OpenZFS) and btrfs-progs for btrfs.
Task 1 (ZFS): Create a mirrored pool and verify layout
cr0x@server:~$ sudo zpool create -o ashift=12 tank mirror /dev/disk/by-id/ata-diskA /dev/disk/by-id/ata-diskB
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-diskA ONLINE 0 0 0
ata-diskB ONLINE 0 0 0
errors: No known data errors
Interpretation: mirror-0 is your redundancy unit. If you lose one disk, the pool stays online. If you lose the mirror vdev (both disks), the pool is gone.
Task 2 (ZFS): Create datasets with sane properties for VMs and backups
cr0x@server:~$ sudo zfs create -o compression=zstd -o atime=off tank/vm
cr0x@server:~$ sudo zfs set recordsize=16K tank/vm
cr0x@server:~$ sudo zfs create -o compression=zstd -o recordsize=1M tank/backup
cr0x@server:~$ sudo zfs get -o name,property,value compression,recordsize,atime tank/vm tank/backup
NAME PROPERTY VALUE
tank/vm compression zstd
tank/vm recordsize 16K
tank/vm atime off
tank/backup compression zstd
tank/backup recordsize 1M
tank/backup atime on
Interpretation: You’re matching recordsize to IO patterns. VMs: smaller records reduce read-modify-write amplification. Backups: large records boost sequential efficiency.
Task 3 (ZFS): Take a recursive snapshot and list it
cr0x@server:~$ sudo zfs snapshot -r tank@pre-upgrade
cr0x@server:~$ sudo zfs list -t snapshot -o name,used,refer,creation | head
NAME USED REFER CREATION
tank@pre-upgrade 0B 128K Tue Dec 24 00:10 2025
tank/backup@pre-upgrade 0B 96K Tue Dec 24 00:10 2025
tank/vm@pre-upgrade 0B 256K Tue Dec 24 00:10 2025
Interpretation: Snapshot USED grows as the live filesystem diverges. Don’t ignore it; it’s your “space you can’t reclaim until you delete snapshots.”
Task 4 (ZFS): Send an incremental snapshot to another pool (replication)
cr0x@server:~$ sudo zfs snapshot tank/vm@replica-001
cr0x@server:~$ sudo zfs send -w tank/vm@replica-001 | sudo zfs receive -u backup-pool/vm
cr0x@server:~$ sudo zfs snapshot tank/vm@replica-002
cr0x@server:~$ sudo zfs send -w -i tank/vm@replica-001 tank/vm@replica-002 | sudo zfs receive -u backup-pool/vm
Interpretation: First send is full; second is incremental. -w sends a raw stream preserving properties and (when supported) encryption characteristics.
Task 5 (ZFS): Scrub and read the results
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
scan: scrub in progress since Tue Dec 24 00:14:02 2025
2.31G scanned at 410M/s, 812M issued at 144M/s, 120G total
0B repaired, 0.66% done, 0:13:50 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
diskA ONLINE 0 0 0
diskB ONLINE 0 0 0
errors: No known data errors
Interpretation: Scrubs verify data against checksums. If repair happens, ZFS will tell you exactly how much and which device is suspect.
Task 6 (ZFS): Simulate a disk failure and replace it
cr0x@server:~$ sudo zpool offline tank /dev/disk/by-id/ata-diskB
cr0x@server:~$ sudo zpool status tank
pool: tank
state: DEGRADED
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-diskA ONLINE 0 0 0
ata-diskB OFFLINE 0 0 0
errors: No known data errors
cr0x@server:~$ sudo zpool replace tank /dev/disk/by-id/ata-diskB /dev/disk/by-id/ata-diskC
cr0x@server:~$ sudo zpool status tank
pool: tank
state: DEGRADED
scan: resilver in progress since Tue Dec 24 00:20:11 2025
8.02G scanned, 1.94G issued, 120G total
1.93G resilvered, 1.62% done, 0:18:10 to go
Interpretation: “Resilver” is reconstruction. In mirrors it’s straightforward; in RAIDZ it’s more complex. Watch for slow resilvers—it often indicates hidden IO contention or a second sick disk.
Task 7 (btrfs): Create a filesystem on a single device and mount it
cr0x@server:~$ sudo mkfs.btrfs -L data /dev/nvme0n1p3
btrfs-progs v6.6
Label: data
UUID: 4b1a0c7e-1d2a-4a6f-9e3c-2e3a3a4e6b2f
Node size: 16384
Sector size: 4096
Filesystem size: 400.00GiB
Block group profiles:
Data: single
Metadata: DUP
System: DUP
cr0x@server:~$ sudo mount -o compress=zstd,space_cache=v2 /dev/nvme0n1p3 /mnt/data
cr0x@server:~$ sudo btrfs filesystem df /mnt/data
Data, single: total=1.00GiB, used=256.00MiB
Metadata, DUP: total=1.00GiB, used=112.00MiB
System, DUP: total=8.00MiB, used=16.00KiB
Interpretation: Metadata DUP on a single disk is a pragmatic integrity boost. It’s not redundancy against disk death, but it can help against some localized corruption.
Task 8 (btrfs): Create subvolumes and snapshot them (OS-style layout)
cr0x@server:~$ sudo btrfs subvolume create /mnt/data/@
Create subvolume '/mnt/data/@'
cr0x@server:~$ sudo btrfs subvolume create /mnt/data/@home
Create subvolume '/mnt/data/@home'
cr0x@server:~$ sudo btrfs subvolume snapshot -r /mnt/data/@ /mnt/data/@-snap-pre-upgrade
Create a readonly snapshot of '/mnt/data/@' in '/mnt/data/@-snap-pre-upgrade'
cr0x@server:~$ sudo btrfs subvolume list /mnt/data | head
ID 256 gen 12 top level 5 path @
ID 257 gen 13 top level 5 path @home
ID 258 gen 14 top level 5 path @-snap-pre-upgrade
Interpretation: Read-only snapshots are safer for rollback and send/receive. Treat subvolume boundaries like blast-radius boundaries.
Task 9 (btrfs): Use send/receive for incremental replication
cr0x@server:~$ sudo btrfs subvolume snapshot -r /mnt/data/@ /mnt/data/@-snap-001
cr0x@server:~$ sudo btrfs send /mnt/data/@-snap-001 | sudo btrfs receive /mnt/backup
At subvol /mnt/backup/@-snap-001
cr0x@server:~$ sudo btrfs subvolume snapshot -r /mnt/data/@ /mnt/data/@-snap-002
cr0x@server:~$ sudo btrfs send -p /mnt/data/@-snap-001 /mnt/data/@-snap-002 | sudo btrfs receive /mnt/backup
At subvol /mnt/backup/@-snap-002
Interpretation: The -p parent snapshot enables incremental sends. If you delete the parent snapshot prematurely, your chain breaks and you’ll need a new full send.
Task 10 (btrfs): Scrub and read device stats
cr0x@server:~$ sudo btrfs scrub start -Bd /mnt/data
Starting scrub on devid 1
Scrub device /dev/nvme0n1p3 (id 1) done
Scrub started: Tue Dec 24 00:35:12 2025
Status: finished
Duration: 0:01:07
Total to scrub: 18.00GiB
Rate: 274.62MiB/s
Error summary: read=0 write=0 csum=0 verify=0 no_csum=0 csum_discards=0 super=0 malloc=0 uncorrectable=0 corrected=0
cr0x@server:~$ sudo btrfs device stats /mnt/data
[/dev/nvme0n1p3].write_io_errs 0
[/dev/nvme0n1p3].read_io_errs 0
[/dev/nvme0n1p3].flush_io_errs 0
[/dev/nvme0n1p3].corruption_errs 0
[/dev/nvme0n1p3].generation_errs 0
Interpretation: Scrub verifies checksums and tries to repair when redundancy exists. Device stats help you identify a flaky device even before it fully fails.
Task 11 (btrfs): Convert to RAID1 with a second device (mirror-like behavior)
cr0x@server:~$ sudo btrfs device add /dev/nvme1n1p3 /mnt/data
cr0x@server:~$ sudo btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt/data
Done, had to relocate 12 out of 12 chunks
cr0x@server:~$ sudo btrfs filesystem df /mnt/data
Data, RAID1: total=16.00GiB, used=12.00GiB
Metadata, RAID1: total=2.00GiB, used=512.00MiB
System, RAID1: total=32.00MiB, used=64.00KiB
Interpretation: btrfs uses profiles and chunks. After conversion, your data and metadata are mirrored across devices. The balance operation is the price of that flexibility—plan capacity and maintenance windows.
Task 12 (btrfs): Disable CoW for a VM directory (with trade-offs)
cr0x@server:~$ sudo mkdir -p /mnt/data/vmimages
cr0x@server:~$ sudo chattr +C /mnt/data/vmimages
cr0x@server:~$ lsattr -d /mnt/data/vmimages
---------------C------ /mnt/data/vmimages
Interpretation: New files created in that directory will be NOCOW, often reducing fragmentation and improving VM write patterns. Trade-off: you reduce CoW-based guarantees for those files (and checksumming behavior can be affected depending on kernel/features).
Task 13 (ZFS): Check ARC behavior and memory pressure (Linux)
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c_max|hit|miss)'
size 4 17179869184
c_max 4 25769803776
hit 4 1402382231
miss 4 32988312
Interpretation: ARC size and hit/miss give you a fast signal: are you cache-starved or fine? If the box is swapping, ARC limits need attention, but don’t knee-jerk—identify the real memory consumers.
Task 14 (ZFS): Identify what’s eating space (snapshots vs live data)
cr0x@server:~$ sudo zfs list -o name,used,avail,refer,usedbysnapshots,usedbydataset -r tank
NAME USED AVAIL REFER USEDBYSNAPSHOTS USEDBYDATASET
tank 420G 1.2T 128K 0B 128K
tank/vm 300G 1.2T 250G 80G 250G
tank/backup 120G 1.2T 110G 10G 110G
Interpretation: If USEDBYSNAPSHOTS is large, deletion of old snapshots is the lever. If USEDBYDATASET is large, your live data grew and you need capacity or pruning policies.
Fast diagnosis playbook (find the bottleneck before it finds you)
This is the “what do I check first, second, third” flow I use when storage “feels slow” and everyone’s monitoring dashboard is accusing everyone else.
1) Establish: is it disk, CPU, memory pressure, or a single noisy tenant?
cr0x@server:~$ uptime
00:41:03 up 12 days, 4:22, 2 users, load average: 18.12, 17.90, 16.55
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 128G 92G 3.1G 2.2G 33G 29G
Swap: 0B 0B 0B
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
18 3 0 3211232 120000 28000000 0 0 1200 3400 9000 22000 25 15 30 30 0
Interpretation: High wa suggests IO wait. High load with low IO wait suggests CPU contention. Memory pressure shows up as shrinking available memory and potential swapping (if enabled).
2) Check filesystem/pool health first; performance second
Because degraded pools often look like “random slowness.”
cr0x@server:~$ sudo zpool status
cr0x@server:~$ sudo btrfs device stats /mountpoint
cr0x@server:~$ dmesg | tail -n 50
Interpretation: Any read/write/cksum errors, resets, or timeouts change your plan: you’re not tuning; you’re triaging hardware.
3) Identify the top IO consumers (per process and per device)
cr0x@server:~$ iostat -xz 1 5
cr0x@server:~$ pidstat -d 1 5
Interpretation: Look for devices at high utilization with rising await, and processes driving disproportionate writes. This is where you find the “one backup job” that is accidentally doing random IO on your VM pool.
4) For ZFS: check ARC and sync write behavior
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c_max|mru_size|mfus|hits|miss)'
cr0x@server:~$ sudo zfs get -o name,property,value sync,logbias,primarycache,recordsize -r tank/vm
Interpretation: If you see heavy sync writes and no adequate latency path (or a saturated one), the system will “feel hung” under certain workloads even if throughput looks fine.
5) For btrfs: check for balance/scrub in progress and fragmentation patterns
cr0x@server:~$ sudo btrfs balance status /mountpoint
cr0x@server:~$ sudo btrfs scrub status /mountpoint
cr0x@server:~$ sudo btrfs filesystem usage /mountpoint | head -n 30
Interpretation: A balance running during peak hours can explain “sudden IO tax.” Also watch metadata usage; a metadata pinch can produce bizarre stalls.
Common mistakes (symptoms & fixes)
Mistake 1: Treating snapshots like backups
Symptom: “We have snapshots, why can’t we restore after pool loss?” followed by a long silence.
Fix: Snapshots protect against logical mistakes (rm -rf, bad upgrades) on the same storage. Backups require independent media or a remote system. Use ZFS send/receive or btrfs send/receive to another host/pool, or a separate backup system. Test restores, not just sends.
Mistake 2: ZFS pool designed as one huge RAIDZ vdev for VM workloads
Symptom: Great sequential benchmarks, terrible VM latency, “IOPS is cursed” complaints.
Fix: For random IO, use multiple mirror vdevs or multiple RAIDZ vdevs to increase parallelism. Also set appropriate recordsize (e.g., 16K) for VM datasets and enable compression.
Mistake 3: btrfs parity RAID used as if it’s as boring as mdadm RAID6
Symptom: Unexpected behavior during device failure/rebuild, confusing error states, recovery feels like archaeology.
Fix: Prefer btrfs RAID1/10 for multi-device reliability in production unless you have a strong reason and tested procedures for parity modes. If parity is required, consider using mdadm underneath with a simple btrfs profile on top—or pick ZFS RAIDZ.
Mistake 4: Running btrfs balance casually on busy production
Symptom: IO latency spikes, application timeouts, “nothing changed” except “someone ran maintenance.”
Fix: Use filtered balance (-dusage= / -musage=) and schedule it. Monitor. Treat balance like a storage migration job, not a cleanup script.
Mistake 5: “Optimizing” ZFS sync settings to make benchmarks pretty
Symptom: Workload is lightning fast—until a power event or crash, then data integrity becomes a meeting.
Fix: Keep sync=standard unless you fully understand the application’s durability expectations. If you need faster sync, consider proper SLOG devices with power-loss protection, and validate latency improvements with real workloads.
Mistake 6: Ignoring scrubs because “the pool is online”
Symptom: First scrub in a year finds errors; now you’re debugging corruption history instead of preventing it.
Fix: Schedule scrubs: monthly is common for large pools; more often for consumer disks or harsh environments. Review results. Replace flaky hardware early.
Mistake 7: Snapshot sprawl without retention policy
Symptom: “Space disappeared,” deletions don’t free space, backups fail due to low space, performance degrades.
Fix: Enforce snapshot retention (hourly/daily/weekly) and prune automatically. Track space used by snapshots (ZFS: usedbysnapshots; btrfs: per-subvolume referenced space and qgroups if used).
Checklists / step-by-step plan
Checklist A: Choosing between ZFS and btrfs (production decision)
- Define failure model: single disk? dual disk? controller lies? bit rot? operator error? ransomware blast radius?
- Define workload: VMs, databases, object storage, backups, developer home dirs, OS root.
- Decide redundancy approach: ZFS mirror/RAIDZ vs btrfs RAID1/10 (or mdadm + btrfs).
- Decide snapshot strategy: which datasets/subvolumes get snapshots; retention; replication chain; deletion protection.
- Plan observability: scrub schedules, alerting on device errors, capacity forecasting, snapshot growth alarms.
- Rehearse recovery: replace a disk; roll back a snapshot; restore a replicated dataset to a clean host.
Checklist B: ZFS deployment plan (boring, correct, repeatable)
- Choose vdev geometry: mirrors for IOPS/latency; RAIDZ2 for capacity + resilience; avoid heroic single wide vdev for mixed workloads.
- Set
ashiftcorrectly (usually 12) at pool creation; you don’t want to “discover” sector size reality later. - Define datasets per workload; set
recordsize,compression,atime. - Set up snapshot policy and replication; verify incremental sends.
- Schedule scrubs and alert on
zpool statuschanges and checksum errors. - Document disk replacement procedure and keep spare capacity for resilvers.
Checklist C: btrfs deployment plan (safe lanes first)
- Decide if it’s single-disk, RAID1, or RAID10. Treat parity RAID as a special project, not a default.
- Use subvolumes as boundaries; snapshot the ones you can reasonably roll back.
- Enable compression (zstd is a common choice) and mount options consistently.
- Plan balance operations: filtered, scheduled, monitored.
- Scrub periodically; watch device stats; investigate IO errors early.
- Test send/receive restores; keep parent snapshots until children are safely replicated.
Three corporate-world mini-stories (plausible, technically accurate, and painful)
Mini-story 1: The incident caused by a wrong assumption
They had a neat setup: btrfs on a pair of SSDs, snapshots every hour, and a proud little dashboard that showed “snapshot count: healthy.” A junior engineer asked the obvious question—“Are we backing up off-host?”—and got waved off with “snapshots are basically backups; we can roll back anything.” That sentence is the storage equivalent of leaving your keys in the car because you have insurance.
The incident wasn’t dramatic at first. A firmware bug in an SSD started throwing intermittent read errors, then escalated to the drive dropping off the bus. The system kept running; redundancy covered the immediate failure. The team replaced the disk. Then a week later, the second SSD exhibited similar behavior—same batch, same firmware, different timing. Suddenly the box was down and the filesystem was unavailable.
The uncomfortable meeting wasn’t about the failure. Drives fail; that’s why we mirror. The meeting was about the discovery that their “backup plan” lived on the same chassis. Snapshots were immaculate and useless. The restore path was “hope the vendor can recover NAND,” which is not a plan so much as a prayer with a purchase order.
What changed afterward was delightfully boring: off-host replication using send/receive to a separate system, plus monthly restore tests. The team stopped arguing about whether snapshots are backups, because they had a calendar invite that proved the restore worked.
Mini-story 2: The optimization that backfired
A virtualization cluster was moved to ZFS on Linux because the team wanted clean snapshots and replication. Early testing showed decent performance. Then someone noticed sync-heavy workloads were “a bit slow,” and a well-meaning performance enthusiast proposed a fix: turn off sync semantics on the VM dataset. It made the graphs look incredible.
For a while, everyone was happy. Latency was down, throughput was up, and the storage team looked like wizards. The only dissent came from the person who kept asking, “What does the database think fsync means?” Nobody likes that person until they’re right.
Then came an unscheduled power event—nothing exotic, just a facility hiccup plus a UPS that did its best impression of a decorative box. Several VMs booted, but a subset had corrupted application state. One database recovered. Another didn’t. The root cause wasn’t mysterious: the dataset was configured in a way that allowed acknowledged writes to evaporate on crash. ZFS did exactly what it was told. The filesystem didn’t bite them; the configuration did.
The postmortem outcome was classic: restore from replicas, then revert to safe sync settings, then add a proper low-latency sync path where it mattered. They also learned a painful but useful lesson: benchmarks are not durability tests. If you change semantics, you have changed the contract—not just the speed.
Mini-story 3: The boring but correct practice that saved the day
One team ran ZFS for a file service that everyone forgot about until it broke—which is the highest compliment you can pay storage. They had two habits that felt annoyingly old-school: monthly scrubs and routine reviews of pool health alerts. Not “when we get around to it,” but actually scheduled, with an on-call runbook.
During one scrub, ZFS reported a small number of checksum errors on a single disk. Performance looked normal; users weren’t complaining. The temptation was to ignore it. But checksums don’t lie for fun, and ZFS is not prone to melodrama. They pulled SMART stats, found a growing reallocated sector count, and replaced the disk during business hours.
Weeks later, a different disk in the same chassis experienced a sudden failure. If they hadn’t proactively replaced the first disk, that second failure would have happened during a degraded window. Instead, the pool shrugged and carried on. The incident ticket was a non-event: “Disk failed; replaced; resilvered; no data loss.”
This is the secret truth of reliable storage: the heroic recovery story is rarely a sign of excellence. The best storage story is the one nobody tells because nothing interesting happened.
FAQ
1) Is ZFS “safer” than btrfs?
In many production configurations, ZFS has a more consistently predictable integrity and recovery story, especially with multi-disk RAIDZ/mirror setups. btrfs can be very safe too—particularly single-disk or RAID1/10 with disciplined operations—but its risk profile depends more heavily on which features you use.
2) Can I use btrfs like ZFS with snapshots and replication?
Yes: btrfs subvolume snapshots plus btrfs send/receive can provide an efficient replication workflow. The catch is operational discipline: you must manage snapshot parent chains and retention carefully, and you should test restores routinely.
3) What’s the practical difference between ZFS RAIDZ and btrfs RAID profiles?
ZFS RAIDZ is part of a unified pool design where redundancy is defined by vdevs and is central to the pool’s identity. btrfs uses per-chunk profiles for data and metadata that you can convert with balance. ZFS is more rigid but more deterministic; btrfs is more flexible but requires more careful operational choices.
4) Are scrubs mandatory?
If you care about integrity, yes. Scrubs are how you find latent errors while redundancy still exists. Without scrubs, you’re discovering corruption only when you read the affected data—often during restores or audits, which is the worst time to learn.
5) When should I use compression?
Often by default on both filesystems, especially with zstd. Compression frequently improves performance because fewer bytes hit the disk. The exceptions are already-compressed media and CPU-constrained systems; even then, test rather than assume.
6) Do I need ECC RAM for ZFS?
ECC is strongly recommended for any serious storage system, regardless of filesystem. ZFS benefits from ECC because it is aggressive about caching and integrity. But “no ECC” doesn’t automatically mean “don’t use ZFS”; it means understand your risk and monitor more tightly.
7) Why do my deletions not free space when I have snapshots?
Because snapshots retain references to old blocks. Deleting files removes them from the live view, but the blocks remain referenced by snapshots. On ZFS, inspect usedbysnapshots. On btrfs, inspect snapshot/subvolume usage and consider quotas (qgroups) if you need enforcement.
8) Should I put ZFS on hardware RAID?
Usually no. ZFS wants direct disk access to manage redundancy and error reporting correctly. If you must use a RAID controller, configure it as HBA/JBOD mode if possible. “Double RAID” often results in worse failure visibility and more confusing recovery.
9) What’s the simplest safe btrfs multi-disk setup?
btrfs RAID1 for both data and metadata on two devices is a common, relatively boring baseline. RAID10 is also strong when you have four or more devices and want better performance and resilience characteristics.
10) What’s the simplest safe ZFS setup?
Mirrors. A pool of mirror vdevs scales predictably for random IO and keeps recovery straightforward. RAIDZ2 is also common when capacity efficiency matters, but design it carefully for workload and rebuild risk.
Conclusion
If you want the filesystem that most consistently behaves like a disciplined SRE—tracking checksums, naming the guilty disk, making replication a first-class habit—ZFS is the safer bet, especially for multi-disk pools and long retention. It’s not effortless, but it’s coherent, and coherence is half of reliability.
If you want kernel-native integration, elegant subvolume snapshot workflows, and a filesystem that’s excellent for OS-level rollback and reasonably safe mirrored storage, btrfs is a strong choice—provided you stay in the well-lit paths (single/RAID1/RAID10) and treat maintenance operations like real production events.
Either way, the filesystem doesn’t save you from not practicing. Run scrubs. Test restores. Keep snapshot policies boring. And remember: the best storage system is the one that turns disasters into routine maintenance tickets.