ZFS vs btrfs: Where btrfs Feels Nice and Where It Hurts

Was this helpful?

The outage doesn’t start with a bang. It starts with a dashboard that looks “fine,” a latency graph that gets a little spiky,
and a storage layer that quietly stops doing what you assumed it was doing. Two hours later, you’re staring at I/O wait,
“no space left on device” errors with plenty of free GiB, and a replication job that’s now a crime scene.

ZFS and btrfs both promise modern filesystem features—checksums, snapshots, copy-on-write. In production they feel very different.
ZFS behaves like a stern adult who insists on rules. btrfs behaves like a talented intern: fast, flexible, occasionally brilliant,
and capable of surprising you at 2 a.m. if you didn’t set boundaries.

My stance: what I deploy where

If you need predictable behavior under load, boring recovery semantics, and you can tolerate the “ZFS is a storage stack,
not just a filesystem” worldview: pick ZFS.

If you need tight Linux integration, easy root-on-filesystem snapshots, fast iteration, and you’re willing to learn btrfs’s
failure modes: btrfs is genuinely nice—especially for workstations, build machines, and some single-node services with good backups.

If you’re thinking “I’ll do btrfs RAID5/6 because it’s built-in and convenient,” don’t. Use mdraid underneath, or use ZFS, or
use hardware RAID if you must (and accept the tradeoffs). btrfs RAID5/6 has improved over the years but still carries enough caveats
that you should treat it like experimental medicine: only with informed consent and a second opinion.

Interesting facts and historical context (the stuff that explains today’s sharp edges)

  • ZFS started at Sun Microsystems as an integrated volume manager + filesystem, designed to end silent corruption and “RAID write hole” headaches.
  • btrfs was created at Oracle as a next-gen Linux filesystem with pooling, snapshots, and checksumming, aiming at similar goals but within Linux’s ecosystem.
  • ZFS on Linux became OpenZFS, a cross-platform project that standardized features and behavior across illumos, Linux, and more.
  • btrfs landed in the Linux kernel and evolved in public, which is great for availability but also means you inherit kernel + tooling versions as part of ops reality.
  • Copy-on-write is the common ingredient, but ZFS’s transaction groups and btrfs’s B-trees and allocation strategies lead to different fragmentation and latency patterns.
  • ZFS has a strong reputation for scrub/resilver semantics, including “repair by redundancy” that’s straightforward when mirrors/RAIDZ are configured correctly.
  • btrfs RAID1 is not “two-disk RAID1” in the classic sense; it’s “two copies across devices” at the chunk level, and that nuance matters for capacity and failure expectations.
  • ZFS’s licensing (CDDL) kept it out of the Linux kernel, which impacts packaging and integration decisions for enterprises.
  • btrfs send/receive matured alongside snapshot-heavy workflows (think: immutable-ish OS deployments and rapid rollback), which is why some distros love it for root filesystems.

Mental models that stop bad decisions

1) ZFS wants to own the block device story

ZFS expects to see disks (or partitions) and handle redundancy, caching decisions, and integrity end-to-end. You can layer it on
top of hardware RAID, but you’re choosing to blind ZFS to individual disk failure behavior and you’re making your own life harder
when something goes weird. ZFS’s best features show up when ZFS has real visibility.

2) btrfs wants to be “the Linux filesystem that does a lot”

btrfs is in-kernel, plays nicely with systemd tooling, distro installers, and root filesystem snapshot workflows. It’s also happier
being treated like a filesystem first. The volume management is powerful, but it’s not always the place you want to take risks.

3) Both are CoW; both can punish “random overwrite at scale”

Databases, VM images, and log-heavy workloads can trigger write amplification and fragmentation on any CoW system. ZFS has knobs
like recordsize, logbias, special vdevs, and a mature ARC/L2ARC story. btrfs has compression choices,
mount options, and the ability to disable CoW per-file (chattr +C). Neither gives you a free lunch.

One paraphrased idea from John Allspaw: reliability comes from treating operations as a design problem, not a firefight.

Where btrfs feels nice

Root filesystem snapshots that don’t feel like a science project

btrfs is fantastic when your OS is a set of subvolumes and snapshots are part of daily life. Rollback after a bad package upgrade?
Easy. Create a disposable test environment by snapshotting @home or a build tree? Easy.

Distros that lean into btrfs aren’t doing it for novelty. They want atomic-ish updates and quick rollbacks without bolting on
a separate volume manager. In the real world, “I can roll back quickly” is an uptime feature.

Send/receive is straightforward for many workflows

btrfs send/receive is a practical replication primitive. For backup targets, lab environments, or “ship snapshots to another box”
setups, it’s approachable and fast enough. You can wire it into timers, keep retention simple, and move on with your life.

Online resize and device management without rethinking your whole stack

Adding a disk, removing a disk, converting profiles (e.g., single → RAID1), resizing filesystems online—btrfs gives you tools for
this without forcing you into a separate LVM layer. That’s convenient when you’re running fleets of small nodes that change shape.

Compression that tends to be “worth it”

Transparent compression (especially zstd) is a quiet superpower. It can reduce write I/O and stretch SSD endurance. On mixed workloads,
it’s often a net win. On already-compressed media, it’s just wasted CPU—so measure, don’t vibe.

Subvolumes are operationally useful

Subvolumes behave like separate filesystems for snapshots and quotas, but they share the same pool. That makes it easy to isolate
retention policies: keep aggressive snapshots for @, less aggressive for @var, and treat big churn directories
separately.

Joke #1: btrfs subvolumes make you feel organized—right up until you realize your snapshot retention policy is a museum exhibit.

Where btrfs hurts (and why)

Metadata space exhaustion: “No space left” while df says you’re fine

btrfs allocates space in chunks for data and metadata. You can have plenty of free data space and still be dead in the water because
metadata chunks are full or badly distributed across devices. The symptoms are brutal: writes fail, snapshots fail, sometimes deletes
fail. It feels like a prank.

The fix is usually a combination of freeing snapshots, running a targeted balance, and sometimes converting metadata profiles.
The prevention is monitoring btrfs filesystem df and keeping headroom.

Balance is a maintenance tool, not a magic “defrag my life” button

A balance rewrites chunks. That can be expensive and can wreck performance if you run it at the wrong time or with the wrong filters.
It’s also sometimes necessary to fix space distribution or apply profile changes. Treat balance like a surgical instrument, not a broom.

RAID semantics can surprise people who learned RAID in 2005

btrfs RAID1 is chunk-based replication, not necessarily symmetric “half capacity” mirroring in all failure conditions. RAID10 has
its own layout realities. And RAID5/6 is the area where you should be extremely cautious: it’s where corner cases and recovery paths
get complicated and where “it worked in testing” can turn into “why is scrub finding uncorrectable errors.”

Fragmentation and CoW amplification: VM images and databases are the usual victims

CoW is great until you overwrite large files in-place over and over (QCOW2 images, database files, mail spools). btrfs can handle it,
but you often need to opt out of CoW for specific paths, pick the right mount options, and avoid snapshotting high-churn files blindly.

Recovery tooling is improving, but it can still feel like spelunking

When btrfs is healthy, it’s pleasant. When it’s not, you’ll spend time with btrfs check warnings, rescue modes, and very
careful decisions about whether to attempt repair offline. The tooling is real, but the “operator confidence” curve is steeper.

Joke #2: “No space left on device” on btrfs sometimes means “there is space, but it’s in metadata, and metadata is emotional.”

Where ZFS wins, bluntly

Predictable integrity model: checksums, scrub, self-heal (when redundancy exists)

ZFS’s claim to fame is end-to-end checksumming with automatic repair from redundancy. Scrub is first-class. When a disk lies or a cable
flakes, ZFS is more likely to tell you clearly what happened and what it did about it.

Operational clarity: pools, vdevs, and rules you can teach

ZFS forces you to think in vdevs. That sounds annoying until you’re on call. The model is consistent: mirrors mirror, RAIDZ is RAIDZ,
and you don’t get to “kinda-sort-of” your way into a layout. That rigidity prevents a lot of creative disasters.

ARC/L2ARC and SLOG are mature and widely understood

ZFS caching is not magic, but it is coherent. ARC sizing, special vdevs for metadata/small blocks, and separate intent log devices
have a lot of operational folklore behind them—which is another way of saying many people have already made the mistakes for you.

Replication is boring in a good way

ZFS send/receive is battle-tested. Incremental streams are reliable. The edge cases are known. If you build a snapshot strategy and
retention policy, it tends to keep working until you change something large (like recordsize or encryption strategy) without thinking.

The trade: ZFS is heavier and more “storage-stack-ish”

ZFS has memory appetite, a different packaging story on Linux, and a strong preference to manage the disks itself. In exchange, you
get a system that behaves predictably when you treat it correctly. ZFS does not reward improvisation.

Performance tuning: the levers that actually matter

Workload fit beats micro-optimizations

Start with the access pattern. Large sequential writes? Random sync writes? Millions of small files? VM images? Your filesystem choice
and settings should match reality. If you don’t know reality, measure it first. Guessing is how you end up “optimizing” the wrong layer.

btrfs knobs that change outcomes

  • Compression: compress=zstd is often a win; test CPU overhead. For latency-sensitive workloads, consider compress=zstd:1.
  • NOCOW for hot overwrite files: set chattr +C on directories for VM images/databases (and ensure no snapshots depend on CoW semantics there).
  • Autodefrag: helps for some small random write workloads; can hurt on large sequential I/O. Use sparingly and validate.
  • Space caching/free space tree: modern kernels handle this better; mismatched tooling versions can still create weirdness after upgrades.

ZFS knobs that change outcomes

  • recordsize: match to workload. Big files want bigger recordsize; databases often want smaller blocks.
  • compression: lz4 is the “default sane”; sometimes zstd is worth it if supported and CPU is available.
  • sync behavior: don’t disable sync writes to “go fast” unless you’re comfortable losing data you thought was safe.
  • special vdev: can transform metadata-heavy workloads, but it’s also a design commitment: losing it can lose the pool.

A note on SSDs, TRIM, and write amplification

Both stacks need sensible discard/TRIM behavior. Too aggressive can hurt latency; too absent can degrade long-term performance.
For ZFS, periodic trim and device firmware quality matter. For btrfs, discard mount options versus periodic fstrim is
a tuning choice you should measure on your hardware.

Practical tasks with commands, outputs, and decisions (12+)

These are real operator moves: what you run, what you expect to see, and what decision you make next. I’ll mix ZFS and btrfs so you
can compare the “shape” of operations.

Task 1: btrfs health check at a glance

cr0x@server:~$ sudo btrfs filesystem show
Label: 'rootfs'  uuid: 6d3b3fe1-0a9e-4e8d-9bf0-73a55c6a8f2f
	Total devices 2 FS bytes used 412.31GiB
	devid    1 size 931.51GiB used 520.00GiB path /dev/nvme0n1p2
	devid    2 size 931.51GiB used 520.00GiB path /dev/nvme1n1p2

What it means: devices, sizes, and “used” at the chunk allocation level, not the same as df.

Decision: if “used” is near device size, you’re chunk-saturated; plan balance/free space work before changes.

Task 2: btrfs data vs metadata usage (the classic trap detector)

cr0x@server:~$ sudo btrfs filesystem df /
Data, RAID1: total=480.00GiB, used=402.10GiB
Metadata, RAID1: total=16.00GiB, used=15.72GiB
System, RAID1: total=32.00MiB, used=16.00KiB

What it means: metadata is almost full. That’s how you get “ENOSPC” with free data space.

Decision: stop snapshot churn, delete old snapshots, then run a targeted balance for metadata (not a full one).

Task 3: targeted btrfs balance for metadata pressure

cr0x@server:~$ sudo btrfs balance start -musage=50 /
Done, had to relocate 12 out of 32 chunks

What it means: it relocated metadata chunks that were less than 50% used, consolidating and freeing chunk space.

Decision: re-check btrfs filesystem df. If still tight, repeat with a slightly higher threshold or
address snapshot count and small-file churn.

Task 4: btrfs scrub status and what “corrected” implies

cr0x@server:~$ sudo btrfs scrub status /
UUID:             6d3b3fe1-0a9e-4e8d-9bf0-73a55c6a8f2f
Scrub started:    Sat Dec 21 02:11:03 2025
Status:           finished
Duration:         0:32:44
Total to scrub:   410.23GiB
Rate:             213.45MiB/s
Error summary:    read=0, csum=2, verify=0, corrected=2, uncorrectable=0

What it means: checksums mismatched twice and redundancy repaired it.

Decision: treat corrected errors as a hardware signal. Check SMART, cabling/backplane, and kernel logs; don’t
just celebrate and move on.

Task 5: btrfs device stats (persistent counters)

cr0x@server:~$ sudo btrfs device stats /
[/dev/nvme0n1p2].write_io_errs   0
[/dev/nvme0n1p2].read_io_errs    0
[/dev/nvme0n1p2].flush_io_errs   0
[/dev/nvme0n1p2].corruption_errs 2
[/dev/nvme0n1p2].generation_errs 0
[/dev/nvme1n1p2].write_io_errs   0
[/dev/nvme1n1p2].read_io_errs    0
[/dev/nvme1n1p2].flush_io_errs   0
[/dev/nvme1n1p2].corruption_errs 0
[/dev/nvme1n1p2].generation_errs 0

What it means: corruption errors occurred on one device path.

Decision: run SMART extended tests and inspect dmesg for NVMe resets; plan a proactive replacement if this repeats.

Task 6: Disable CoW for a directory holding VM images (btrfs)

cr0x@server:~$ sudo mkdir -p /var/lib/libvirt/images
cr0x@server:~$ sudo chattr +C /var/lib/libvirt/images
cr0x@server:~$ lsattr -d /var/lib/libvirt/images
---------------C------ /var/lib/libvirt/images

What it means: new files created in that directory will be NOCOW, reducing fragmentation for overwrite-heavy images.

Decision: do this before creating the big image files; existing files won’t be magically rewritten. Also:
don’t snapshot those images expecting CoW-based efficiency.

Task 7: btrfs list subvolumes and decide snapshot retention

cr0x@server:~$ sudo btrfs subvolume list -t /
ID 256 gen 19234 top level 5 path @
ID 257 gen 19233 top level 5 path @home
ID 890 gen 19110 top level 256 path .snapshots/2025-12-01
ID 891 gen 19140 top level 256 path .snapshots/2025-12-02
ID 892 gen 19170 top level 256 path .snapshots/2025-12-03

What it means: snapshots accumulate quickly; each one pins extents and can increase metadata pressure.

Decision: set retention to a number you can defend (daily for 7–14 days, weekly for a few weeks), then enforce it.

Task 8: ZFS pool health and error triage

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.
action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear'.
  scan: scrub repaired 0B in 0 days 00:41:12 with 0 errors on Sat Dec 21 03:02:11 2025
config:

	NAME                         STATE     READ WRITE CKSUM
	tank                         ONLINE       0     0     0
	  mirror-0                   ONLINE       0     0     0
	    ata-SAMSUNG_SSD1         ONLINE       0     0     1
	    ata-SAMSUNG_SSD2         ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:
/tank/vmstore/qemu/win10.qcow2

What it means: ZFS detected a checksum error tied to a file path. That’s gold for incident response.

Decision: restore/repair the affected file from backup; run SMART on the disk with CKSUM increment; consider replace.

Task 9: ZFS dataset properties that commonly explain performance

cr0x@server:~$ sudo zfs get -o name,property,value -s local recordsize,compression,atime,sync,logbias tank/vmstore
NAME          PROPERTY     VALUE
tank/vmstore  atime        off
tank/vmstore  compression  lz4
tank/vmstore  logbias      latency
tank/vmstore  recordsize   128K
tank/vmstore  sync         standard

What it means: these settings are often the difference between “fine” and “why are my VMs stuttering.”

Decision: for VM images, consider smaller recordsize and possibly logbias=throughput depending on sync pattern; change only with measurement.

Task 10: ZFS ARC pressure check

cr0x@server:~$ sudo arcstat -f time,read,miss,arcsz,c
    time  read  miss   arcsz      c
03:12:01   842    71   19.2G   24.0G
03:12:02   901    88   19.2G   24.0G
03:12:03   877    92   19.2G   24.0G

What it means: ARC size (~19G) vs target (~24G), misses rising. If misses are high under steady workload, you may be RAM-starved.

Decision: add RAM or adjust workload/dataset settings; don’t knee-jerk add L2ARC without understanding working set.

Task 11: ZFS scrub and what to do with results

cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Sat Dec 21 03:20:10 2025
	12.3G scanned at 1.02G/s, 1.4G issued at 118M/s, 480G total
	0B repaired, 0.29% done, 1:17:44 to go
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    sda     ONLINE       0     0     0
	    sdb     ONLINE       0     0     0

errors: No known data errors

What it means: scrub is progressing; “repaired” and “errors” are the key lines for integrity.

Decision: if you see repairs or CKSUM errors, correlate with SMART and plan device replacement; if clean, move up-stack.

Task 12: btrfs: detect whether free space is fragmented across chunks

cr0x@server:~$ sudo btrfs filesystem usage -T /
Overall:
    Device size:                   1.82TiB
    Device allocated:              1.04TiB
    Device unallocated:          816.00GiB
    Device missing:                  0.00B
    Used:                        824.20GiB
    Free (estimated):            496.10GiB      (min: 248.05GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:               512.00MiB      (used: 0.00B)

What it means: “Free (estimated)” vs “min” shows uncertainty due to chunk allocation; “min” is the conservative number.

Decision: size your safety margin using “min,” not the optimistic estimate, especially before large snapshot operations.

Task 13: ZFS: identify which dataset is consuming space and snapshots

cr0x@server:~$ sudo zfs list -o name,used,avail,refer,usedbysnapshots -S used
NAME                USED  AVAIL  REFER  USEDBYSNAPSHOTS
tank                3.21T  5.07T   192K            0B
tank/vmstore        1.84T  5.07T  1.02T          820G
tank/backups        1.12T  5.07T  1.10T          18G
tank/home           256G  5.07T   241G          15G

What it means: snapshots are pinning 820G in tank/vmstore.

Decision: adjust snapshot retention or exclude high-churn VM datasets from aggressive snapshot schedules; consider replication frequency separately.

Task 14: btrfs: replace a failing device (conceptual “do it before it dies” move)

cr0x@server:~$ sudo btrfs replace start /dev/nvme1n1p2 /dev/nvme2n1p2 /
Started on Sat Dec 21 03:40:11 2025, estimated time: 2:10:00
cr0x@server:~$ sudo btrfs replace status /
Status: finished

What it means: btrfs copied data to the replacement device online.

Decision: after completion, verify scrub clean, then remove the old device cleanly if it’s still present.

Fast diagnosis playbook: find the bottleneck before you “tune” anything

First: decide if this is capacity/metadata, integrity, or latency

  • If writes fail (ENOSPC) but df looks fine: suspect btrfs metadata chunk exhaustion or snapshot pinning.
  • If apps see corruption or checksum errors: suspect device, controller, cabling, firmware, or power-loss behavior.
  • If latency spikes but throughput is okay: suspect sync writes, fragmentation, or cache misses.

Second: check the storage layer’s own truth

For btrfs: btrfs filesystem df, btrfs filesystem usage -T, btrfs scrub status, and kernel logs.
For ZFS: zpool status -v, zpool iostat -v 1, and dataset properties.

Third: isolate whether you’re bound by CPU, RAM, or the devices

  • High iowait + low disk utilization: often queueing, firmware, controller issues, or sync latency.
  • High CPU in kswapd or memory pressure: ZFS ARC sizing or general RAM shortage.
  • One device saturated: uneven btrfs allocation/balance, or a ZFS vdev design bottleneck (one slow disk poisons a mirror/RAIDZ vdev’s latency).

Fourth: only then tune

Tuning without diagnosis is how you accidentally turn a recoverable situation into a weekend. Establish the constraint, then adjust one
variable, then measure again.

Common mistakes: symptom → root cause → fix

1) “No space left on device” but there’s hundreds of GiB free

Symptom: snapshot creation fails, writes fail, deletes fail intermittently.

Root cause: btrfs metadata chunks full; free space trapped in data chunks or poorly distributed across devices.

Fix: delete snapshots, stop churn, run btrfs balance start -musage=50 (or similar), and monitor btrfs filesystem df. Keep headroom.

2) “Scrub corrected errors” gets ignored

Symptom: occasional corrected checksum errors during scrub; system seems fine otherwise.

Root cause: device or path instability: flaky NVMe, cabling, HBA issues, backplane, or firmware bugs.

Fix: treat corrected errors as early warning. Run SMART tests, check dmesg, consider proactive replacement.

3) btrfs performance tanks after snapshot-heavy period

Symptom: random write latency increases, metadata ops slow, commit times increase.

Root cause: fragmentation + pinned extents from many snapshots, plus metadata overhead.

Fix: trim snapshot retention; separate high-churn data into NOCOW directories or separate subvolumes not snapshotted frequently.

4) ZFS “optimized” by disabling sync, then data loss after power event

Symptom: workload is fast; after crash/power loss, database or VM image corrupts.

Root cause: sync=disabled (or app relying on fsync) broke durability assumptions.

Fix: set sync=standard (or always where needed), use a proper SLOG if sync latency is the problem.

5) “btrfs RAID1 means I can lose any disk anytime” assumption

Symptom: device loss causes unexpected missing data or failed mount in certain profile/layout situations.

Root cause: misunderstanding chunk-based replication and allocation; also mixing device sizes or near-full conditions can reduce flexibility.

Fix: validate failure scenarios, keep free space, use homogeneous devices where possible, and test replace/remove workflows before you need them.

6) ZFS pool designed with a single wide RAIDZ vdev for random I/O

Symptom: terrible random read/write IOPS compared to expectations.

Root cause: vdev is the IOPS unit; one big RAIDZ vdev has limited IOPS relative to multiple mirrors.

Fix: for IOPS-heavy workloads, prefer mirrored vdevs (striped mirrors) or more vdevs; don’t expect RAIDZ width to create IOPS.

Three corporate mini-stories (anonymized, plausible, and painfully familiar)

Story 1: The incident caused by a wrong assumption

A medium-sized SaaS company migrated build agents to btrfs because snapshots made cleanup easy. Each build ran in a disposable snapshot,
then rolled back. The team loved it. Storage stayed clean. Deployments sped up.

Then the build cluster started failing with “No space left on device.” df -h showed hundreds of gigabytes free. The on-call
assumed it was a container overlay issue and restarted a bunch of agents, which helped exactly zero percent.

The wrong assumption: “free space is free space.” On btrfs, metadata allocation can be your real limit. They had huge numbers of small files,
plus aggressive snapshot retention “just in case,” and metadata RAID1 chunks were saturated.

The fix was not heroic. They deleted old snapshots, ran a targeted metadata balance, and reduced snapshot frequency. They also split build
artifact caches into a separate subvolume that wasn’t snapshotted at all. The cluster became boring again.

The lesson: btrfs is friendly until you treat it like ext4 with superpowers. It isn’t. Watch metadata like it’s a first-class resource,
because it is.

Story 2: The optimization that backfired

A finance-ish company ran ZFS for VM storage. Someone noticed sync write latency and proposed the classic fix: “temporarily” set
sync=disabled on the VM dataset. The graphs improved immediately. High fives were exchanged. A ticket was created to “revisit later.”
The ticket lived a long and meaningful life in the backlog.

A few weeks later, a power incident occurred—nothing dramatic, just a PDU reboot that took longer than expected. Several VMs came back
with corrupted filesystems. One database started but had silent data inconsistencies that showed up as application errors, which is the
worst kind: the kind that looks like a software bug.

Post-incident, they discovered that the storage had been configured to lie. The applications were doing the right thing (fsync),
and the storage layer was shrugging.

The corrective work was expensive: restore from backups, validate data, and rebuild trust. The actual performance fix was boring:
a small, reliable SLOG device for sync-heavy datasets, and tuning recordsize for the VM workloads. Latency improved without gambling
durability.

The lesson: performance “wins” that change durability semantics are not optimizations. They’re bets. Production systems remember your bets.

Story 3: The boring but correct practice that saved the day

A healthcare-adjacent company ran a ZFS storage cluster with a strict routine: weekly scrubs, alerting on any checksum increments,
and a policy that “corrected errors require a ticket.” People rolled their eyes at the process because most weeks nothing happened.

One scrub reported a small number of corrected errors on a single disk in a mirror vdev. The system stayed online; users noticed nothing.
The on-call opened the required ticket and scheduled a maintenance window to replace the disk.

During replacement they found the disk had borderline SMART attributes and intermittent link resets in logs. It wasn’t dead, just
untrustworthy. A week later, a similar model disk in another chassis failed outright. Because they’d replaced the “warning disk” early,
they avoided a double-fault scenario that could have turned into a long restore.

Meanwhile, their backups and replication were verified monthly with restore drills. When leadership asked why they needed “all this
extra procedure,” the answer was simply: because the day you need it is the day you don’t get to negotiate for it.

The lesson: scrubs, alerts, and disciplined replacement policies are dull. Dull is the sound of uptime.

Checklists / step-by-step plan

Choosing between ZFS and btrfs (decision checklist)

  1. Define the failure domain: single node, or multi-node with replication? If single node and high consequence: lean ZFS.
  2. Define the workload: VM images, databases, or small-file CI? Plan for CoW behavior explicitly.
  3. Decide on redundancy: mirrors/RAIDZ vs btrfs profiles vs mdraid under btrfs. Avoid btrfs RAID5/6 in serious environments.
  4. Define snapshot policy: retention, frequency, and where snapshots are forbidden (databases/VM images without a plan).
  5. Plan monitoring: scrub results, checksum errors, metadata usage (btrfs), pool health (ZFS), SMART health everywhere.
  6. Practice replace workflows: replace a disk in a test environment. Time it. Document it. Make it boring.

btrfs operational baseline (weekly/monthly)

  1. Weekly scrub (staggered across hosts).
  2. Weekly check: btrfs filesystem df and btrfs filesystem usage -T for metadata pressure.
  3. Enforced snapshot retention with a hard cap.
  4. Quarterly: test btrfs replace workflow in staging.

ZFS operational baseline (weekly/monthly)

  1. Regular scrubs (often monthly for large pools, weekly for smaller or more failure-prone hardware).
  2. Alert on any zpool status CKSUM increments and any “permanent errors” with file paths.
  3. Snapshot + replication with periodic restore drills.
  4. Capacity policy: don’t run pools hot; keep headroom for resilver and fragmentation.

FAQ

1) Is btrfs “production ready”?

For many uses, yes: single-disk, RAID1/10, root filesystems, snapshot-heavy workflows, and backup targets. But “production ready”
depends on whether your team understands metadata behavior, balance, and your chosen RAID profile limitations.

2) Is ZFS always safer than btrfs?

ZFS has a longer reputation for predictable integrity and recovery, especially with redundancy. But safety also depends on hardware,
monitoring, operational discipline, and not “optimizing” away durability.

3) Should I run btrfs RAID5/6?

If you need boring reliability, avoid it. Use ZFS RAIDZ, or use mdraid (5/6) with a simpler filesystem, or redesign with mirrors.
Convenience is not a feature during a rebuild.

4) Why does btrfs show free space estimates with “min”?

Because chunk allocation and profiles affect how much space is actually usable for new allocations. The conservative “min” value is
the one you should trust for safety planning.

5) Can ZFS be used as a root filesystem on Linux?

Yes, but it depends on distro support, initramfs integration, and packaging. Operationally, it’s doable, but btrfs usually feels
more native for root-on-Linux snapshot workflows.

6) What’s the fastest way to get better VM performance on btrfs?

Put VM images on NOCOW directories (set before creating files), avoid snapshotting them aggressively, and validate with latency metrics.
Also check compression and autodefrag choices; they can help or hurt depending on access patterns.

7) What’s the fastest way to get better VM performance on ZFS?

Use mirrors for IOPS, set dataset properties appropriately (recordsize, compression), keep sync behavior correct, and add a SLOG only
if you have real sync write latency and a device you trust.

8) Do I still need backups if I have snapshots?

Yes. Snapshots are not backups; they live in the same failure domain. Both ZFS and btrfs make snapshot-based replication feasible—
use that, and verify restores.

9) How much free space should I keep?

More than you want to. For btrfs, headroom reduces metadata crises and makes balance/relocation possible. For ZFS, headroom improves
performance and resilver safety. Running “hot” turns maintenance tasks into incidents.

10) Which is easier to operate with a small team?

If the team is Linux-first and wants root snapshots and simple tooling: btrfs can be easier. If the team wants a strict model and
predictable recovery semantics: ZFS tends to reduce surprises, at the cost of learning its vocabulary.

Next steps (practical conclusion)

Pick ZFS when you want the storage stack to behave like an adult: clear failure signals, strong integrity defaults, and a design that
forces good architecture. Pick btrfs when you want Linux-native snapshot workflows, flexible subvolumes, and you’re willing to monitor
metadata and treat balance/scrub as routine care.

Then do the unglamorous work:

  • Write down your snapshot retention policy. Enforce it automatically.
  • Schedule scrubs. Alert on corrected errors. Treat them as hardware smoke.
  • Practice disk replacement on a non-critical box. Time it. Document it.
  • For CoW pain points (VMs, databases), make an explicit plan: NOCOW directories on btrfs, dataset tuning and vdev design on ZFS.
  • Run a restore drill. If you can’t restore, you don’t have a backup—you have a wish.

Storage is where optimism goes to retire. Set it up to be boring, and it will repay you by letting you sleep.

← Previous
Docker Build Cache Invalidation: Why Builds Are Slow and How to Speed Them Up
Next →
VPN on Ubuntu/Debian: UFW/nftables Mistakes That Lock You Out (and Fixes)

Leave a comment