You don’t want “snapshots”. You want the confidence to roll back a broken upgrade at 02:00 and have the system come up cleanly—network, SSH, services, the whole deal—without you spelunking initramfs logs like a cave diver with a headlamp.
Most ZFS-on-root guides get you to “it boots once.” Production demands more: predictable dataset layout, bootloader integration that respects boot environments, and operational habits that don’t turn your snapshot history into a crime scene.
What “rollback” must mean on a real system
A rollback that “works” isn’t just a zfs rollback that completes without errors. A rollback that works means:
- The bootloader can enumerate and boot a previous root state (boot environment), not just mount some dataset if you hand-edit kernel parameters.
- The initramfs can import your pool (and unlock it if encrypted) consistently and automatically.
- /boot is coherent with the kernel+initramfs inside the root environment you’re booting—or deliberately separated and managed as a first-class exception.
- Service state and config are scoped so rolling back root doesn’t resurrect an incompatible on-disk state for apps that write constantly (databases, queues, metrics stores).
- The operator workflow is obvious: create BE, upgrade, reboot into BE, verify, keep or revert.
In plain English: you’re not installing “ZFS,” you’re installing a time machine with a bootloader as the control panel. And the control panel must keep working after you time travel.
Opinionated take: if you cannot list your boot environments from the boot menu or a predictable CLI and boot into them without editing anything, your rollbacks are theater. Theater is fun, but not at 02:00.
Short joke #1: A rollback plan that requires “some manual steps” is just a forward-only plan with denial baked in.
Interesting facts and context (because history explains today’s sharp edges)
- ZFS was born at Sun in the mid‑2000s with end-to-end checksumming and copy-on-write baked in; it was designed to make silent corruption less exciting.
- “Pooled storage” was a mindset shift: filesystems stopped being tied to one block device; you manage a pool and carve datasets and volumes from it.
- Boot environments became a ZFS cultural norm on Solaris/Illumos, where upgrading into a new BE and keeping the old one bootable was the expected workflow, not an exotic trick.
- Linux ZFS is out-of-tree due to licensing incompatibilities; this is why distro integration varies and why “it worked on my laptop” isn’t a plan.
- GRUB gained ZFS support years ago, but support depends on feature flags; enabling every shiny pool feature can strand your bootloader in the past.
- OpenZFS feature flags replaced on-disk version numbers; this made incremental evolution easier but created a new operational rule: boot components must support the flags you enable.
- ACL and xattr behaviors differ by OS expectations; ZFS can emulate, but mismatched settings can create subtle permission surprises after rollback.
- Compression stopped being controversial once modern CPUs got fast; today,
lz4is often a free win, but still needs to be applied with intent. - Snapshots are cheap until they aren’t: metadata and fragmentation can sneak up, especially when snapshots are kept forever and churny datasets live on the same root tree.
Rollbacks fail when you treat ZFS-on-root as “a filesystem choice” rather than “a boot architecture choice.” The filesystem is the easy part. Boot is where your confidence goes to die.
Design goals: the rules that keep rollbacks boring
Rule 1: Separate what changes together
Operating system bits (packages, libraries, config) should be in the boot environment. High-churn application data should not. If you roll back root and your database files roll back too, you’ve built a time machine that casually violates causality.
Do: keep /var/lib split into datasets by app, and put databases on dedicated datasets (or even separate pools) with their own snapshot policies.
Don’t: put everything under one monolithic rpool/ROOT dataset and call it “simple.” It is simple the way a single large blast radius is simple.
Rule 2: Boot sees what it needs, and no more
The initramfs and bootloader need a stable way to find and mount the right root. That typically means: a predictable bootfs, clear BE naming, and avoiding features the bootloader can’t read.
Rule 3: Your rollback boundary is a boot environment, not a snapshot
Snapshots are raw materials. A boot environment is a curated snapshot+clone (or dataset clone) with a coherent / and usually a coherent /boot story. Use the right tool. Stop trying to boot random snapshots like it’s a party trick.
Rule 4: Default properties should be defaults for a reason
Set properties at the right dataset level. Avoid setting everything on the pool root unless you want to debug inheritance for the rest of your life.
Dataset layout that makes boot environments behave
The dataset layout is where most “ZFS on root” installs quietly fail. They boot, sure. But later, you can’t roll back cleanly because half the state is shared, or because mountpoints fight each other, or because /boot is on an island that doesn’t match your BE.
A pragmatic layout (single-pool, GRUB, typical Linux)
Assume a pool called rpool. We create:
rpool/ROOT(container,mountpoint=none)rpool/ROOT/default(actual root BE,mountpoint=/)rpool/ROOT/<be-name>(additional boot environments)rpool/BOOTor a separate non-ZFS/bootpartition, depending on your distro and tasterpool/home(mountpoint=/home)rpool/var(mountpoint=/var, usually)rpool/var/log(log churn; snapshot policy different)rpool/var/lib(container, often)rpool/var/lib/dockeror.../containers(if you must, but consider a separate pool)rpool/var/lib/postgresql,rpool/var/lib/mysqletc. (separate on purpose)rpool/tmp(mountpoint=/tmp, withsync=disabledonly if you like living dangerously; otherwise leave it alone)
What about /boot?
Two viable patterns:
- /boot on EFI+ext4 (or vfat for ESP, ext4 for /boot): boring, compatible, easy. Rollbacks need to manage kernel/initramfs versions carefully, because /boot is shared across BEs. This is still fine if you use a BE tool that handles it.
- /boot on ZFS: possible with GRUB reading ZFS, but you must keep pool features compatible with GRUB and be very deliberate about updates. This can be clean if your distro’s tooling supports it; otherwise it’s an attractive nuisance.
My bias: if you’re doing this on fleet machines, keep /boot on a small ext4 partition and keep the ESP separate. Remove one moving part from early boot. You can still have strong rollback semantics with proper BE management; you just treat kernel artifacts carefully.
Install plan: a clean, rollback-friendly ZFS-on-root build
This is a generic plan that maps well to Debian/Ubuntu-style systems, and conceptually to others. Adjust package manager and initramfs tooling accordingly. The point isn’t the exact commands; it’s the architecture and verification steps.
Partitioning: choose predictability
Use GPT. Make an ESP. Make a dedicated /boot partition unless you know you need /boot on ZFS and you’ve tested GRUB compatibility with your pool features. Put the rest in ZFS.
Create the pool with boot in mind
Pick ashift correctly (almost always 12). Choose sane defaults: compression=lz4, atime=off, xattr=sa, and a consistent acltype for Linux (posixacl).
Encryption: if you encrypt root, decide how it unlocks (passphrase at console, keyfile in initramfs, or network-unlock). Don’t “figure it out later.” Later is when your system won’t boot.
Build datasets and mount them correctly
Set mountpoint=none on containers. Put the actual root dataset under rpool/ROOT/<be-name> with mountpoint=/. Then mount a temporary root at /mnt and install the OS into it.
Install bootloader + initramfs with ZFS support
You need the ZFS modules in initramfs and the pool import logic to be correct. That includes:
- Correct
root=ZFS=rpool/ROOT/<be-name>kernel cmdline (or your distro’s equivalent). - Pool cache if your initramfs uses it (
/etc/zfs/zpool.cache), plus correct hostid. - GRUB configured to find ZFS and list BEs (varies by distro/tooling).
Boot environment tooling: pick one and use it
If you want rollbacks that “actually work,” use a BE tool that integrates with your distro’s bootloader update flow. On Linux, that might mean tooling like zsys (Ubuntu-era), zfsbootmenu (a different approach), or distro-specific hooks. I’m not prescribing one, but I am prescribing that you stop hand-rolling BEs without a repeatable boot integration story.
Reliability quote, because it’s still true: “Hope is not a strategy.”
— General Gordon R. Sullivan
Practical tasks: commands, outputs, and decisions (12+)
These are the checks you run when you’re installing, validating, or debugging. Each task includes: command, what the output means, and the decision you make next.
Task 1: Confirm sector size assumptions (so ashift won’t bite you)
cr0x@server:~$ lsblk -o NAME,MODEL,SIZE,PHY-SEC,LOG-SEC,ROTA
NAME MODEL SIZE PHY-SEC LOG-SEC ROTA
nvme0n1 Samsung SSD 1.8T 4096 512 0
sda ST4000NM0035 3.6T 4096 4096 1
Meaning: PHY-SEC is physical sector size. Most modern drives are 4K physical. If you choose ashift=12 (2^12=4096), you align well.
Decision: Use ashift=12 unless you have an unusual device that genuinely needs 8K (rare) or you’re on legacy 512B-only (also rare now).
Task 2: Create the pool with sane defaults (and show you what happened)
cr0x@server:~$ sudo zpool create -f -o ashift=12 \
-O compression=lz4 -O atime=off -O xattr=sa -O acltype=posixacl \
-O mountpoint=none -R /mnt rpool /dev/nvme0n1p3
cr0x@server:~$ zpool status -v rpool
pool: rpool
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme0n1p3 ONLINE 0 0 0
errors: No known data errors
Meaning: Pool exists, ONLINE, no errors. The -R /mnt altroot means mounts go under /mnt for installation.
Decision: If you see anything but ONLINE, stop. Fix cabling, device path, partitioning, or drive health before installing an OS on top of a shaky foundation.
Task 3: Verify the pool feature flags won’t surprise your bootloader
cr0x@server:~$ zpool get -H feature@* rpool | head
rpool feature@async_destroy enabled local
rpool feature@bookmarks enabled local
rpool feature@embedded_data enabled local
rpool feature@extensible_dataset enabled local
rpool feature@filesystem_limits enabled local
rpool feature@hole_birth enabled local
rpool feature@large_blocks enabled local
rpool feature@lz4_compress enabled local
rpool feature@spacemap_histogram enabled local
rpool feature@enabled_txg active local
Meaning: Feature flags are enabled/active. Bootloaders (especially GRUB builds) may not support every feature on every distro vintage.
Decision: If you rely on GRUB reading ZFS for /boot, validate feature compatibility. If you’re not 100% sure, keep /boot off ZFS and let GRUB read ext4.
Task 4: Create container datasets with mountpoint discipline
cr0x@server:~$ sudo zfs create -o mountpoint=none rpool/ROOT
cr0x@server:~$ sudo zfs create -o mountpoint=/ rpool/ROOT/default
cr0x@server:~$ sudo zfs create -o mountpoint=/home rpool/home
cr0x@server:~$ sudo zfs create -o mountpoint=/var rpool/var
cr0x@server:~$ sudo zfs create -o mountpoint=/var/log rpool/var/log
cr0x@server:~$ zfs list -o name,mountpoint -r rpool | head -n 10
NAME MOUNTPOINT
rpool none
rpool/ROOT none
rpool/ROOT/default /mnt/
rpool/home /mnt/home
rpool/var /mnt/var
rpool/var/log /mnt/var/log
Meaning: With altroot /mnt, mountpoints appear under /mnt. The container datasets are not mounted.
Decision: If you see container datasets mounted, fix it now. Container datasets should usually be mountpoint=none to avoid mount conflicts and weird rollback coupling.
Task 5: Check what will mount at boot (and whether ZFS agrees)
cr0x@server:~$ zfs get -r canmount,mountpoint rpool/ROOT/default | head
NAME PROPERTY VALUE SOURCE
rpool/ROOT/default canmount on default
rpool/ROOT/default mountpoint / local
Meaning: Root dataset will mount at /.
Decision: For non-root datasets you want per-BE (rare but sometimes needed), set canmount=noauto and handle explicitly. Don’t wing it.
Task 6: Confirm the bootfs property points at the intended BE
cr0x@server:~$ sudo zpool set bootfs=rpool/ROOT/default rpool
cr0x@server:~$ zpool get bootfs rpool
NAME PROPERTY VALUE SOURCE
rpool bootfs rpool/ROOT/default local
Meaning: The pool’s default boot filesystem is set. Some boot flows use this as a hint.
Decision: When you create a new BE to upgrade into, set bootfs to that BE before reboot (or let your BE tool do it). If you forget, you’ll boot the old environment and wonder why nothing changed.
Task 7: Validate hostid and zpool.cache (so initramfs imports reliably)
cr0x@server:~$ hostid
9a3f1c2b
cr0x@server:~$ sudo zgenhostid -f 9a3f1c2b
cr0x@server:~$ sudo zpool set cachefile=/etc/zfs/zpool.cache rpool
cr0x@server:~$ sudo ls -l /etc/zfs/zpool.cache
-rw-r--r-- 1 root root 2264 Jan 12 09:41 /etc/zfs/zpool.cache
Meaning: A stable hostid reduces “pool was last accessed by another system” drama. The cachefile helps early boot find devices reliably.
Decision: On cloned VMs, regenerate hostid per machine. Duplicate hostids are a classic “works in dev, explodes in prod” storyline.
Task 8: Check that initramfs includes ZFS and will import pools
cr0x@server:~$ lsinitramfs /boot/initrd.img-$(uname -r) | grep -E 'zfs|zpool' | head
usr/sbin/zpool
usr/sbin/zfs
usr/lib/modules/6.8.0-31-generic/kernel/zfs/zfs.ko
usr/lib/modules/6.8.0-31-generic/kernel/zfs/zcommon.ko
scripts/zfs
Meaning: Tools and modules are in initramfs. If they aren’t, early boot can’t import the pool, and you’ll land in an initramfs shell with a blank stare.
Decision: If missing, install the correct ZFS initramfs integration package for your distro and rebuild initramfs.
Task 9: Confirm kernel cmdline points at the right root dataset
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.0-31-generic root=ZFS=rpool/ROOT/default ro quiet
Meaning: The kernel is told to mount rpool/ROOT/default as root.
Decision: If this points at a snapshot or a wrong dataset, fix GRUB config or your BE tooling. Don’t try to “just mount it” post-boot; the boot boundary is wrong.
Task 10: Create a new boot environment (clone) and verify it’s independent
cr0x@server:~$ sudo zfs snapshot rpool/ROOT/default@pre-upgrade
cr0x@server:~$ sudo zfs clone rpool/ROOT/default@pre-upgrade rpool/ROOT/upgrade-test
cr0x@server:~$ zfs list -o name,origin,mountpoint -r rpool/ROOT | grep -E 'default|upgrade-test'
rpool/ROOT/default - /
rpool/ROOT/upgrade-test rpool/ROOT/default@pre-upgrade /
Meaning: upgrade-test is a clone from the snapshot and can serve as a BE. Note: both show mountpoint /; only one should be mounted at a time.
Decision: Set canmount=noauto on inactive BEs so they don’t mount accidentally if imported in maintenance contexts.
Task 11: Make only one BE mountable by default
cr0x@server:~$ sudo zfs set canmount=noauto rpool/ROOT/default
cr0x@server:~$ sudo zfs set canmount=noauto rpool/ROOT/upgrade-test
cr0x@server:~$ sudo zfs set canmount=on rpool/ROOT/upgrade-test
cr0x@server:~$ zfs get canmount rpool/ROOT/default rpool/ROOT/upgrade-test
NAME PROPERTY VALUE SOURCE
rpool/ROOT/default canmount noauto local
rpool/ROOT/upgrade-test canmount on local
Meaning: You’re controlling what auto-mounts. This reduces “two roots mounted somewhere weird” incidents.
Decision: For your fleet, standardize a BE property convention. Your future self will thank you quietly, which is the best kind of gratitude.
Task 12: Switch bootfs to the new BE and update bootloader configs
cr0x@server:~$ sudo zpool set bootfs=rpool/ROOT/upgrade-test rpool
cr0x@server:~$ zpool get bootfs rpool
NAME PROPERTY VALUE SOURCE
rpool bootfs rpool/ROOT/upgrade-test local
cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.0-31-generic
cr0x@server:~$ sudo update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.0-31-generic
Found initrd image: /boot/initrd.img-6.8.0-31-generic
done
Meaning: You’re preparing early boot artifacts and the boot menu. Some distros need additional hooks for BE enumeration; this shows the general shape.
Decision: Reboot and validate you can boot both the new BE and the old BE. If you can’t, stop calling them “boot environments.” They are “datasets you hope you can mount later.”
Task 13: Confirm what you actually booted into after reboot
cr0x@server:~$ findmnt -n -o SOURCE /
rpool/ROOT/upgrade-test
cr0x@server:~$ zfs list -o name,mounted,canmount rpool/ROOT
NAME MOUNTED CANMOUNT
rpool/ROOT no on
rpool/ROOT/default no noauto
rpool/ROOT/upgrade-test yes on
Meaning: Root is the new BE. The old BE is not mounted.
Decision: If root is still default, you didn’t actually switch; fix bootfs and/or bootloader config, then try again.
Task 14: Audit snapshots and space impact (so “rollback” doesn’t become “out of space”)
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint rpool | head -n 8
NAME USED AVAIL REFER MOUNTPOINT
rpool 38.2G 1.67T 192K none
rpool/ROOT 14.1G 1.67T 192K none
rpool/ROOT/default 7.20G 1.67T 7.20G /
rpool/ROOT/upgrade-test 7.14G 1.67T 7.14G /
rpool/var 11.5G 1.67T 7.31G /var
rpool/var/log 1.84G 1.67T 1.84G /var/log
cr0x@server:~$ zfs list -t snapshot -o name,used,refer -r rpool/ROOT | tail -n 5
rpool/ROOT/default@pre-upgrade 132M 7.20G
Meaning: Snapshot used shows space held due to changes since snapshot. High snapshot “used” often means churn in datasets you didn’t isolate.
Decision: If snapshots balloon because logs or container layers are inside the BE, split them out now. Space pressure is the #1 rollback killer because ZFS needs room to breathe.
Task 15: Check pool health and error counters before blaming “ZFS weirdness”
cr0x@server:~$ zpool status -xv
all pools are healthy
Meaning: No known data errors, no degraded vdevs. Good.
Decision: If you see checksum errors, treat it as a hardware/transport incident until proven otherwise. “My rollback didn’t work” sometimes translates to “my SSD is lying.”
Task 16: Identify a boot-import delay (the silent rollback saboteur)
cr0x@server:~$ systemd-analyze blame | head -n 10
14.203s zfs-import-cache.service
6.412s dev-nvme0n1p3.device
3.901s systemd-udev-settle.service
2.311s network-online.target
1.988s zfs-mount.service
Meaning: Boot time is dominated by pool import waiting on devices or cache mismatch.
Decision: If zfs-import-cache is slow, fix device naming stability, rebuild zpool.cache, or switch to import-by-id. If udev-settle is slow, audit what’s triggering it.
Three corporate mini-stories from the land of consequences
1) Incident caused by a wrong assumption: “Snapshots equal boot environments”
A mid-size SaaS company rolled ZFS-on-root onto a set of API nodes. The engineer who championed it was competent, but time-crunched. They created a single dataset for /, enabled snapshots, and declared victory. The rest of the org heard “we can roll back upgrades now.” That sentence has consequences.
During a routine security patch cycle, one node came up with broken networking after reboot. The on-call did what they’d been told: “roll back.” They ran a rollback on the root dataset to the last snapshot, rebooted, and… got the same broken network. Then they tried older snapshots. Same result. Now they’re angry at ZFS, which is unfair but common.
The root issue was that their /boot lived on an ext4 partition shared across all “states,” and the initramfs had been regenerated with a module mismatch. The rollback restored root filesystem contents, but it did not revert the kernel+initramfs pair in /boot. They were effectively booting a new initramfs into an old root. Linux is many things, but “forgiving about ABI mismatches at boot” isn’t one of them.
The fix was not magical. They introduced actual boot environments: clones of rpool/ROOT/<be> and a workflow where upgrades happen in a new BE, and /boot is updated in a coordinated way. They also changed the internal language: they stopped calling snapshots “rollbacks” and started saying “boot environment switch.” Precision reduced the number of late-night surprises.
2) Optimization that backfired: “Let’s dedup root, it’s mostly identical”
A financial firm loved the idea of boot environments but hated the space overhead. Someone noticed that multiple BEs look similar: same libraries, same binaries, just a few updates. Deduplication looked like a neat trick: store identical blocks once, save space, and get “infinite” BEs.
They enabled dedup on the root dataset tree and ran happily for a while. Then a wave of updates hit: kernels, language runtimes, container tooling. The dedup table grew, memory pressure increased, and the boxes started thrashing. Not catastrophically at first—just a little latency, just a little weirdness. Then a reboot happened, and boot time exploded. The import was slow, services timed out, and a node fell out of the load balancer. Repeat across the cluster, one by one, as maintenance continued.
Dedup wasn’t “broken.” It did exactly what it does: trade memory and metadata work for space. On root filesystems with churn and many snapshots/BEs, that trade can be brutal. The worst part is that the failure mode isn’t a clean error; it’s death by a thousand slow operations.
The recovery was boring and expensive: they migrated off the dedup-enabled datasets to new ones without dedup, rebooted into clean BEs, and accepted that clones + compression were enough. They stopped optimizing for the storage team’s dashboard and started optimizing for reboot reliability. That’s the right direction of embarrassment.
3) Boring but correct practice that saved the day: “Always keep a known-good BE and test bootability”
A healthcare analytics company had a practice that sounded tedious: every month, they created a “known-good” boot environment, labeled it with the change ticket, and kept it for at least one full release cycle. They also had a requirement: after any upgrade, the operator must prove they can still boot the previous BE by selecting it in the boot menu and reaching multi-user mode. Then they’d reboot back into the new BE.
It cost them maybe ten minutes per upgrade window. People complained. People always complain about rituals that prevent problems they haven’t personally suffered yet.
One quarter, a vendor driver update broke disk enumeration order on a batch of servers. Nothing dramatic, just enough that the initramfs import logic slowed and occasionally picked the wrong path first. A couple machines failed to boot into the upgraded environment. But because the team had a tested known-good BE and an established procedure, the on-call selected the previous BE, the system came up, and they had time to fix the new BE without a full outage.
That’s what “correct” looks like in production: unglamorous steps that trade five minutes now for five hours you don’t spend later. Also, it makes postmortems shorter, which is a form of kindness.
Fast diagnosis playbook: find the bottleneck fast
When a ZFS-on-root rollback/boot environment fails, you can waste hours in the wrong layer. Don’t. Check in this order.
First: Are you booting the BE you think you’re booting?
- From a running system:
findmnt -n -o SOURCE / - Confirm cmdline:
cat /proc/cmdline - Confirm
zpool get bootfs rpool
Interpretation: If your BE switch didn’t take, everything else is noise. Fix boot selection before debugging userland.
Second: Can early boot import the pool quickly and consistently?
systemd-analyze blame | headforzfs-import-*delaysjournalctl -b -u zfs-import-cache.servicefor timeouts and missing devices- Check
/etc/zfs/zpool.cacheexists and matches reality - Check hostid consistency:
hostid
Interpretation: Import problems masquerade as “random boot failures.” They’re usually deterministic once you look at device discovery and cache.
Third: Is /boot coherent with the root BE?
- List kernels and initramfs:
ls -lh /boot - Check initramfs contains ZFS modules:
lsinitramfs ... | grep zfs - Check GRUB menu generation and entries
Interpretation: If you share /boot across BEs, you need a policy. Without a policy you have “whatever the last update did,” which is not an engineering strategy.
Fourth: If it boots but rollback doesn’t “restore sanity,” check dataset boundaries
- Are logs in the BE?
zfs list -o name,mountpoint | grep /var/log - Is app state shared across BEs unexpectedly? Review
/var/libdatasets - Snapshot space pressure:
zfs list -t snapshot,zpool list
Interpretation: Most rollback disappointment is self-inflicted via poor dataset separation.
Common mistakes: symptoms → root cause → fix
1) Symptom: “Rollback succeeded, but system still broken after reboot”
Root cause: /boot is shared and not rolled back; kernel/initramfs mismatch with rolled-back root. Or you rolled back a dataset but didn’t switch the BE the bootloader uses.
Fix: Use BEs (clone-based) and coordinate kernel/initramfs updates. Confirm root=ZFS=... points at the BE you intend. Consider keeping per-BE kernel artifacts if your tooling supports it, or implement strict kernel retention rules.
2) Symptom: Boot drops to initramfs shell: “cannot import ‘rpool’”
Root cause: Missing ZFS modules in initramfs, incorrect zpool.cache, unstable device paths, or hostid mismatch (common after cloning).
Fix: Rebuild initramfs with ZFS included; regenerate /etc/zfs/zpool.cache; ensure hostid is unique; prefer stable device identifiers. Validate by inspecting initramfs contents.
3) Symptom: GRUB can’t see the pool or can’t read /boot on ZFS
Root cause: Pool features enabled that the GRUB build can’t read, or GRUB ZFS module missing/old.
Fix: Keep /boot off ZFS, or freeze pool features to what GRUB supports, or upgrade bootloader stack in a controlled way. Don’t “just zpool upgrade” on boot-critical pools without thinking.
4) Symptom: Boot environment list exists, but selecting an older one panics or hangs
Root cause: Old BE has an initramfs that can’t import pool with current naming, encryption config, or module set; or userland expects newer kernel.
Fix: Periodically test booting older BEs. Keep at least one known-good BE updated enough to boot on current hardware/firmware. If you keep ancient BEs, treat them as archives, not boot targets.
5) Symptom: “zfs rollback” fails with dataset busy, or rollback breaks mounts
Root cause: Trying to rollback the live root dataset, or mountpoint conflicts due to containers mounted, or canmount not managed.
Fix: Roll back by switching BEs, not rolling back the mounted root. Ensure container datasets have mountpoint=none. Set inactive BEs to canmount=noauto.
6) Symptom: Space keeps shrinking, snapshots “don’t delete”
Root cause: Snapshots hold blocks alive; clones depend on snapshots; deleting snapshots that are origins of clones won’t free space the way you expect.
Fix: Understand clone/snapshot dependencies. Promote clones when needed. Separate churny datasets. Implement retention policies per dataset.
7) Symptom: Boot becomes slower over time
Root cause: Import delays due to device discovery, bloated snapshot metadata, or an “optimization” like dedup that increased metadata work.
Fix: Measure with systemd-analyze. Fix import method and cache. Reduce snapshot churn on root. Avoid dedup on root unless you have a very good reason and a lot of RAM.
Short joke #2: Dedup on root is like a sport bike in winter: impressive right up until it introduces you to physics personally.
Checklists / step-by-step plan
Checklist A: Before you install
- Decide: /boot on ext4 (recommended for fleet) or /boot on ZFS (only with verified GRUB compatibility).
- Pick pool name (
rpoolis common) and BE naming scheme (default,2025q1-upgrade, etc.). - Confirm sector sizes (
lsblk) and setashift=12. - Decide encryption method and operational unlock story.
- Write down dataset boundaries: at minimum split
/var/logand database paths.
Checklist B: Create pool + datasets (rollback-friendly)
- Create pool with
mountpoint=none,compression=lz4,atime=off,xattr=sa,acltype=posixacl. - Create
rpool/ROOTcontainer andrpool/ROOT/defaultmounted at/. - Create separate datasets for
/home,/var,/var/log, and app data. - Set
bootfsto the BE you intend to boot. - Set
cachefileand generate a stable hostid.
Checklist C: Install OS + make it bootable
- Install ZFS userspace and kernel modules.
- Ensure initramfs includes ZFS import scripts and modules.
- Install bootloader and ensure kernel cmdline includes
root=ZFS=rpool/ROOT/<be>. - If /boot is shared: define kernel retention and update process; don’t rely on vibes.
Checklist D: Validate rollback behavior (the part everyone skips)
- Create a new BE (snapshot + clone) before your first real upgrade.
- Switch
bootfsto the new BE and regenerate boot configs. - Reboot into the new BE; confirm with
findmnt. - Reboot into the old BE from boot menu; confirm it still boots.
- Only then: do the actual upgrade procedure and repeat the dance.
FAQ
1) Are ZFS snapshots enough for OS rollbacks?
No. Snapshots are building blocks. For OS rollback you want a boot environment that the bootloader can select and boot into cleanly.
2) Should I put /boot on ZFS?
If you’re operating a fleet and want fewer early-boot surprises, keep /boot on ext4. Put it on ZFS only if you’ve tested GRUB compatibility with your pool features and update workflow.
3) What’s the single most important dataset decision for rollbacks?
Separate churny state from the BE: at minimum /var/log and databases. Otherwise snapshots and clones become space anchors and rollback semantics get weird.
4) How many boot environments should I keep?
Keep at least two bootable ones: the current and one known-good. More is fine if you have a retention policy and you understand snapshot/clone dependencies.
5) Why does my pool import slowly at boot?
Usually device discovery instability or stale zpool.cache. Sometimes hostid duplication on cloned systems. Measure with systemd-analyze blame and inspect zfs-import-* logs.
6) Can I just run zfs rollback on rpool/ROOT/default while the system is running?
Don’t. Rolling back a mounted root is how you create new and exciting failure modes. Switch BEs instead, then reboot.
7) Is encryption on ZFS root safe operationally?
Yes, if you design the unlock workflow. Decide up front whether you unlock via console passphrase, keyfile in initramfs, or remote mechanism. “We’ll handle it later” often becomes “we can’t boot.”
8) Should I enable dedup to save space across BEs?
Generally no for root. Compression is usually the right default. Dedup can be valid in narrow cases with abundant RAM and careful measurement, but it’s not a casual toggle.
9) What property settings are the usual safe defaults for root datasets?
Commonly: compression=lz4, atime=off, xattr=sa, and acltype=posixacl. Validate application expectations for ACLs and xattrs, especially with container runtimes.
10) How do I know which BE I’m currently on?
findmnt -n -o SOURCE / should show something like rpool/ROOT/<be>. Also check cat /proc/cmdline for root=ZFS=....
Conclusion: next steps you can do this week
If you already have ZFS on root and you’re not sure rollbacks really work, don’t wait for the next bad upgrade to find out. Do three things:
- Prove your current boot identity: check
/proc/cmdline,findmnt /, andzpool get bootfs. Make sure the story matches. - Create a real boot environment and boot it: snapshot, clone, switch
bootfs, regenerate boot artifacts, reboot. Then boot the old one too. If either fails, fix it now while you’re calm. - Redesign dataset boundaries where churn lives: split
/var/logand app data out of the BE. Add retention policies. Stop letting logs hold your rollbacks hostage.
ZFS-on-root is worth it when it’s installed like you plan to use it: as a controlled rollback system with a bootloader that understands your intent. The goal isn’t clever. The goal is boring recovery.