ZFS on Root: Installing So Rollbacks Actually Work

Was this helpful?

You don’t want “snapshots”. You want the confidence to roll back a broken upgrade at 02:00 and have the system come up cleanly—network, SSH, services, the whole deal—without you spelunking initramfs logs like a cave diver with a headlamp.

Most ZFS-on-root guides get you to “it boots once.” Production demands more: predictable dataset layout, bootloader integration that respects boot environments, and operational habits that don’t turn your snapshot history into a crime scene.

What “rollback” must mean on a real system

A rollback that “works” isn’t just a zfs rollback that completes without errors. A rollback that works means:

  • The bootloader can enumerate and boot a previous root state (boot environment), not just mount some dataset if you hand-edit kernel parameters.
  • The initramfs can import your pool (and unlock it if encrypted) consistently and automatically.
  • /boot is coherent with the kernel+initramfs inside the root environment you’re booting—or deliberately separated and managed as a first-class exception.
  • Service state and config are scoped so rolling back root doesn’t resurrect an incompatible on-disk state for apps that write constantly (databases, queues, metrics stores).
  • The operator workflow is obvious: create BE, upgrade, reboot into BE, verify, keep or revert.

In plain English: you’re not installing “ZFS,” you’re installing a time machine with a bootloader as the control panel. And the control panel must keep working after you time travel.

Opinionated take: if you cannot list your boot environments from the boot menu or a predictable CLI and boot into them without editing anything, your rollbacks are theater. Theater is fun, but not at 02:00.

Short joke #1: A rollback plan that requires “some manual steps” is just a forward-only plan with denial baked in.

Interesting facts and context (because history explains today’s sharp edges)

  1. ZFS was born at Sun in the mid‑2000s with end-to-end checksumming and copy-on-write baked in; it was designed to make silent corruption less exciting.
  2. “Pooled storage” was a mindset shift: filesystems stopped being tied to one block device; you manage a pool and carve datasets and volumes from it.
  3. Boot environments became a ZFS cultural norm on Solaris/Illumos, where upgrading into a new BE and keeping the old one bootable was the expected workflow, not an exotic trick.
  4. Linux ZFS is out-of-tree due to licensing incompatibilities; this is why distro integration varies and why “it worked on my laptop” isn’t a plan.
  5. GRUB gained ZFS support years ago, but support depends on feature flags; enabling every shiny pool feature can strand your bootloader in the past.
  6. OpenZFS feature flags replaced on-disk version numbers; this made incremental evolution easier but created a new operational rule: boot components must support the flags you enable.
  7. ACL and xattr behaviors differ by OS expectations; ZFS can emulate, but mismatched settings can create subtle permission surprises after rollback.
  8. Compression stopped being controversial once modern CPUs got fast; today, lz4 is often a free win, but still needs to be applied with intent.
  9. Snapshots are cheap until they aren’t: metadata and fragmentation can sneak up, especially when snapshots are kept forever and churny datasets live on the same root tree.

Rollbacks fail when you treat ZFS-on-root as “a filesystem choice” rather than “a boot architecture choice.” The filesystem is the easy part. Boot is where your confidence goes to die.

Design goals: the rules that keep rollbacks boring

Rule 1: Separate what changes together

Operating system bits (packages, libraries, config) should be in the boot environment. High-churn application data should not. If you roll back root and your database files roll back too, you’ve built a time machine that casually violates causality.

Do: keep /var/lib split into datasets by app, and put databases on dedicated datasets (or even separate pools) with their own snapshot policies.

Don’t: put everything under one monolithic rpool/ROOT dataset and call it “simple.” It is simple the way a single large blast radius is simple.

Rule 2: Boot sees what it needs, and no more

The initramfs and bootloader need a stable way to find and mount the right root. That typically means: a predictable bootfs, clear BE naming, and avoiding features the bootloader can’t read.

Rule 3: Your rollback boundary is a boot environment, not a snapshot

Snapshots are raw materials. A boot environment is a curated snapshot+clone (or dataset clone) with a coherent / and usually a coherent /boot story. Use the right tool. Stop trying to boot random snapshots like it’s a party trick.

Rule 4: Default properties should be defaults for a reason

Set properties at the right dataset level. Avoid setting everything on the pool root unless you want to debug inheritance for the rest of your life.

Dataset layout that makes boot environments behave

The dataset layout is where most “ZFS on root” installs quietly fail. They boot, sure. But later, you can’t roll back cleanly because half the state is shared, or because mountpoints fight each other, or because /boot is on an island that doesn’t match your BE.

A pragmatic layout (single-pool, GRUB, typical Linux)

Assume a pool called rpool. We create:

  • rpool/ROOT (container, mountpoint=none)
  • rpool/ROOT/default (actual root BE, mountpoint=/)
  • rpool/ROOT/<be-name> (additional boot environments)
  • rpool/BOOT or a separate non-ZFS /boot partition, depending on your distro and taste
  • rpool/home (mountpoint=/home)
  • rpool/var (mountpoint=/var, usually)
  • rpool/var/log (log churn; snapshot policy different)
  • rpool/var/lib (container, often)
  • rpool/var/lib/docker or .../containers (if you must, but consider a separate pool)
  • rpool/var/lib/postgresql, rpool/var/lib/mysql etc. (separate on purpose)
  • rpool/tmp (mountpoint=/tmp, with sync=disabled only if you like living dangerously; otherwise leave it alone)

What about /boot?

Two viable patterns:

  • /boot on EFI+ext4 (or vfat for ESP, ext4 for /boot): boring, compatible, easy. Rollbacks need to manage kernel/initramfs versions carefully, because /boot is shared across BEs. This is still fine if you use a BE tool that handles it.
  • /boot on ZFS: possible with GRUB reading ZFS, but you must keep pool features compatible with GRUB and be very deliberate about updates. This can be clean if your distro’s tooling supports it; otherwise it’s an attractive nuisance.

My bias: if you’re doing this on fleet machines, keep /boot on a small ext4 partition and keep the ESP separate. Remove one moving part from early boot. You can still have strong rollback semantics with proper BE management; you just treat kernel artifacts carefully.

Install plan: a clean, rollback-friendly ZFS-on-root build

This is a generic plan that maps well to Debian/Ubuntu-style systems, and conceptually to others. Adjust package manager and initramfs tooling accordingly. The point isn’t the exact commands; it’s the architecture and verification steps.

Partitioning: choose predictability

Use GPT. Make an ESP. Make a dedicated /boot partition unless you know you need /boot on ZFS and you’ve tested GRUB compatibility with your pool features. Put the rest in ZFS.

Create the pool with boot in mind

Pick ashift correctly (almost always 12). Choose sane defaults: compression=lz4, atime=off, xattr=sa, and a consistent acltype for Linux (posixacl).

Encryption: if you encrypt root, decide how it unlocks (passphrase at console, keyfile in initramfs, or network-unlock). Don’t “figure it out later.” Later is when your system won’t boot.

Build datasets and mount them correctly

Set mountpoint=none on containers. Put the actual root dataset under rpool/ROOT/<be-name> with mountpoint=/. Then mount a temporary root at /mnt and install the OS into it.

Install bootloader + initramfs with ZFS support

You need the ZFS modules in initramfs and the pool import logic to be correct. That includes:

  • Correct root=ZFS=rpool/ROOT/<be-name> kernel cmdline (or your distro’s equivalent).
  • Pool cache if your initramfs uses it (/etc/zfs/zpool.cache), plus correct hostid.
  • GRUB configured to find ZFS and list BEs (varies by distro/tooling).

Boot environment tooling: pick one and use it

If you want rollbacks that “actually work,” use a BE tool that integrates with your distro’s bootloader update flow. On Linux, that might mean tooling like zsys (Ubuntu-era), zfsbootmenu (a different approach), or distro-specific hooks. I’m not prescribing one, but I am prescribing that you stop hand-rolling BEs without a repeatable boot integration story.

Reliability quote, because it’s still true: “Hope is not a strategy.” — General Gordon R. Sullivan

Practical tasks: commands, outputs, and decisions (12+)

These are the checks you run when you’re installing, validating, or debugging. Each task includes: command, what the output means, and the decision you make next.

Task 1: Confirm sector size assumptions (so ashift won’t bite you)

cr0x@server:~$ lsblk -o NAME,MODEL,SIZE,PHY-SEC,LOG-SEC,ROTA
NAME MODEL             SIZE PHY-SEC LOG-SEC ROTA
nvme0n1 Samsung SSD    1.8T    4096    512    0
sda    ST4000NM0035    3.6T    4096   4096    1

Meaning: PHY-SEC is physical sector size. Most modern drives are 4K physical. If you choose ashift=12 (2^12=4096), you align well.

Decision: Use ashift=12 unless you have an unusual device that genuinely needs 8K (rare) or you’re on legacy 512B-only (also rare now).

Task 2: Create the pool with sane defaults (and show you what happened)

cr0x@server:~$ sudo zpool create -f -o ashift=12 \
  -O compression=lz4 -O atime=off -O xattr=sa -O acltype=posixacl \
  -O mountpoint=none -R /mnt rpool /dev/nvme0n1p3
cr0x@server:~$ zpool status -v rpool
  pool: rpool
 state: ONLINE
config:

        NAME            STATE     READ WRITE CKSUM
        rpool           ONLINE       0     0     0
          nvme0n1p3     ONLINE       0     0     0

errors: No known data errors

Meaning: Pool exists, ONLINE, no errors. The -R /mnt altroot means mounts go under /mnt for installation.

Decision: If you see anything but ONLINE, stop. Fix cabling, device path, partitioning, or drive health before installing an OS on top of a shaky foundation.

Task 3: Verify the pool feature flags won’t surprise your bootloader

cr0x@server:~$ zpool get -H feature@* rpool | head
rpool    feature@async_destroy              enabled     local
rpool    feature@bookmarks                  enabled     local
rpool    feature@embedded_data              enabled     local
rpool    feature@extensible_dataset         enabled     local
rpool    feature@filesystem_limits          enabled     local
rpool    feature@hole_birth                 enabled     local
rpool    feature@large_blocks               enabled     local
rpool    feature@lz4_compress               enabled     local
rpool    feature@spacemap_histogram         enabled     local
rpool    feature@enabled_txg                active      local

Meaning: Feature flags are enabled/active. Bootloaders (especially GRUB builds) may not support every feature on every distro vintage.

Decision: If you rely on GRUB reading ZFS for /boot, validate feature compatibility. If you’re not 100% sure, keep /boot off ZFS and let GRUB read ext4.

Task 4: Create container datasets with mountpoint discipline

cr0x@server:~$ sudo zfs create -o mountpoint=none rpool/ROOT
cr0x@server:~$ sudo zfs create -o mountpoint=/ rpool/ROOT/default
cr0x@server:~$ sudo zfs create -o mountpoint=/home rpool/home
cr0x@server:~$ sudo zfs create -o mountpoint=/var rpool/var
cr0x@server:~$ sudo zfs create -o mountpoint=/var/log rpool/var/log
cr0x@server:~$ zfs list -o name,mountpoint -r rpool | head -n 10
NAME               MOUNTPOINT
rpool              none
rpool/ROOT         none
rpool/ROOT/default /mnt/
rpool/home         /mnt/home
rpool/var          /mnt/var
rpool/var/log      /mnt/var/log

Meaning: With altroot /mnt, mountpoints appear under /mnt. The container datasets are not mounted.

Decision: If you see container datasets mounted, fix it now. Container datasets should usually be mountpoint=none to avoid mount conflicts and weird rollback coupling.

Task 5: Check what will mount at boot (and whether ZFS agrees)

cr0x@server:~$ zfs get -r canmount,mountpoint rpool/ROOT/default | head
NAME               PROPERTY   VALUE     SOURCE
rpool/ROOT/default canmount   on        default
rpool/ROOT/default mountpoint /         local

Meaning: Root dataset will mount at /.

Decision: For non-root datasets you want per-BE (rare but sometimes needed), set canmount=noauto and handle explicitly. Don’t wing it.

Task 6: Confirm the bootfs property points at the intended BE

cr0x@server:~$ sudo zpool set bootfs=rpool/ROOT/default rpool
cr0x@server:~$ zpool get bootfs rpool
NAME   PROPERTY  VALUE              SOURCE
rpool  bootfs    rpool/ROOT/default local

Meaning: The pool’s default boot filesystem is set. Some boot flows use this as a hint.

Decision: When you create a new BE to upgrade into, set bootfs to that BE before reboot (or let your BE tool do it). If you forget, you’ll boot the old environment and wonder why nothing changed.

Task 7: Validate hostid and zpool.cache (so initramfs imports reliably)

cr0x@server:~$ hostid
9a3f1c2b
cr0x@server:~$ sudo zgenhostid -f 9a3f1c2b
cr0x@server:~$ sudo zpool set cachefile=/etc/zfs/zpool.cache rpool
cr0x@server:~$ sudo ls -l /etc/zfs/zpool.cache
-rw-r--r-- 1 root root 2264 Jan 12 09:41 /etc/zfs/zpool.cache

Meaning: A stable hostid reduces “pool was last accessed by another system” drama. The cachefile helps early boot find devices reliably.

Decision: On cloned VMs, regenerate hostid per machine. Duplicate hostids are a classic “works in dev, explodes in prod” storyline.

Task 8: Check that initramfs includes ZFS and will import pools

cr0x@server:~$ lsinitramfs /boot/initrd.img-$(uname -r) | grep -E 'zfs|zpool' | head
usr/sbin/zpool
usr/sbin/zfs
usr/lib/modules/6.8.0-31-generic/kernel/zfs/zfs.ko
usr/lib/modules/6.8.0-31-generic/kernel/zfs/zcommon.ko
scripts/zfs

Meaning: Tools and modules are in initramfs. If they aren’t, early boot can’t import the pool, and you’ll land in an initramfs shell with a blank stare.

Decision: If missing, install the correct ZFS initramfs integration package for your distro and rebuild initramfs.

Task 9: Confirm kernel cmdline points at the right root dataset

cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.0-31-generic root=ZFS=rpool/ROOT/default ro quiet

Meaning: The kernel is told to mount rpool/ROOT/default as root.

Decision: If this points at a snapshot or a wrong dataset, fix GRUB config or your BE tooling. Don’t try to “just mount it” post-boot; the boot boundary is wrong.

Task 10: Create a new boot environment (clone) and verify it’s independent

cr0x@server:~$ sudo zfs snapshot rpool/ROOT/default@pre-upgrade
cr0x@server:~$ sudo zfs clone rpool/ROOT/default@pre-upgrade rpool/ROOT/upgrade-test
cr0x@server:~$ zfs list -o name,origin,mountpoint -r rpool/ROOT | grep -E 'default|upgrade-test'
rpool/ROOT/default        -                           /
rpool/ROOT/upgrade-test   rpool/ROOT/default@pre-upgrade  /

Meaning: upgrade-test is a clone from the snapshot and can serve as a BE. Note: both show mountpoint /; only one should be mounted at a time.

Decision: Set canmount=noauto on inactive BEs so they don’t mount accidentally if imported in maintenance contexts.

Task 11: Make only one BE mountable by default

cr0x@server:~$ sudo zfs set canmount=noauto rpool/ROOT/default
cr0x@server:~$ sudo zfs set canmount=noauto rpool/ROOT/upgrade-test
cr0x@server:~$ sudo zfs set canmount=on rpool/ROOT/upgrade-test
cr0x@server:~$ zfs get canmount rpool/ROOT/default rpool/ROOT/upgrade-test
NAME                   PROPERTY  VALUE   SOURCE
rpool/ROOT/default      canmount  noauto  local
rpool/ROOT/upgrade-test canmount  on      local

Meaning: You’re controlling what auto-mounts. This reduces “two roots mounted somewhere weird” incidents.

Decision: For your fleet, standardize a BE property convention. Your future self will thank you quietly, which is the best kind of gratitude.

Task 12: Switch bootfs to the new BE and update bootloader configs

cr0x@server:~$ sudo zpool set bootfs=rpool/ROOT/upgrade-test rpool
cr0x@server:~$ zpool get bootfs rpool
NAME   PROPERTY  VALUE                   SOURCE
rpool  bootfs    rpool/ROOT/upgrade-test local
cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.0-31-generic
cr0x@server:~$ sudo update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.0-31-generic
Found initrd image: /boot/initrd.img-6.8.0-31-generic
done

Meaning: You’re preparing early boot artifacts and the boot menu. Some distros need additional hooks for BE enumeration; this shows the general shape.

Decision: Reboot and validate you can boot both the new BE and the old BE. If you can’t, stop calling them “boot environments.” They are “datasets you hope you can mount later.”

Task 13: Confirm what you actually booted into after reboot

cr0x@server:~$ findmnt -n -o SOURCE /
rpool/ROOT/upgrade-test
cr0x@server:~$ zfs list -o name,mounted,canmount rpool/ROOT
NAME                   MOUNTED  CANMOUNT
rpool/ROOT             no       on
rpool/ROOT/default      no       noauto
rpool/ROOT/upgrade-test yes      on

Meaning: Root is the new BE. The old BE is not mounted.

Decision: If root is still default, you didn’t actually switch; fix bootfs and/or bootloader config, then try again.

Task 14: Audit snapshots and space impact (so “rollback” doesn’t become “out of space”)

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint rpool | head -n 8
NAME                  USED  AVAIL  REFER  MOUNTPOINT
rpool                38.2G  1.67T   192K  none
rpool/ROOT           14.1G  1.67T   192K  none
rpool/ROOT/default   7.20G  1.67T  7.20G  /
rpool/ROOT/upgrade-test 7.14G 1.67T 7.14G  /
rpool/var            11.5G  1.67T  7.31G  /var
rpool/var/log        1.84G  1.67T  1.84G  /var/log
cr0x@server:~$ zfs list -t snapshot -o name,used,refer -r rpool/ROOT | tail -n 5
rpool/ROOT/default@pre-upgrade  132M  7.20G

Meaning: Snapshot used shows space held due to changes since snapshot. High snapshot “used” often means churn in datasets you didn’t isolate.

Decision: If snapshots balloon because logs or container layers are inside the BE, split them out now. Space pressure is the #1 rollback killer because ZFS needs room to breathe.

Task 15: Check pool health and error counters before blaming “ZFS weirdness”

cr0x@server:~$ zpool status -xv
all pools are healthy

Meaning: No known data errors, no degraded vdevs. Good.

Decision: If you see checksum errors, treat it as a hardware/transport incident until proven otherwise. “My rollback didn’t work” sometimes translates to “my SSD is lying.”

Task 16: Identify a boot-import delay (the silent rollback saboteur)

cr0x@server:~$ systemd-analyze blame | head -n 10
14.203s zfs-import-cache.service
 6.412s dev-nvme0n1p3.device
 3.901s systemd-udev-settle.service
 2.311s network-online.target
 1.988s zfs-mount.service

Meaning: Boot time is dominated by pool import waiting on devices or cache mismatch.

Decision: If zfs-import-cache is slow, fix device naming stability, rebuild zpool.cache, or switch to import-by-id. If udev-settle is slow, audit what’s triggering it.

Three corporate mini-stories from the land of consequences

1) Incident caused by a wrong assumption: “Snapshots equal boot environments”

A mid-size SaaS company rolled ZFS-on-root onto a set of API nodes. The engineer who championed it was competent, but time-crunched. They created a single dataset for /, enabled snapshots, and declared victory. The rest of the org heard “we can roll back upgrades now.” That sentence has consequences.

During a routine security patch cycle, one node came up with broken networking after reboot. The on-call did what they’d been told: “roll back.” They ran a rollback on the root dataset to the last snapshot, rebooted, and… got the same broken network. Then they tried older snapshots. Same result. Now they’re angry at ZFS, which is unfair but common.

The root issue was that their /boot lived on an ext4 partition shared across all “states,” and the initramfs had been regenerated with a module mismatch. The rollback restored root filesystem contents, but it did not revert the kernel+initramfs pair in /boot. They were effectively booting a new initramfs into an old root. Linux is many things, but “forgiving about ABI mismatches at boot” isn’t one of them.

The fix was not magical. They introduced actual boot environments: clones of rpool/ROOT/<be> and a workflow where upgrades happen in a new BE, and /boot is updated in a coordinated way. They also changed the internal language: they stopped calling snapshots “rollbacks” and started saying “boot environment switch.” Precision reduced the number of late-night surprises.

2) Optimization that backfired: “Let’s dedup root, it’s mostly identical”

A financial firm loved the idea of boot environments but hated the space overhead. Someone noticed that multiple BEs look similar: same libraries, same binaries, just a few updates. Deduplication looked like a neat trick: store identical blocks once, save space, and get “infinite” BEs.

They enabled dedup on the root dataset tree and ran happily for a while. Then a wave of updates hit: kernels, language runtimes, container tooling. The dedup table grew, memory pressure increased, and the boxes started thrashing. Not catastrophically at first—just a little latency, just a little weirdness. Then a reboot happened, and boot time exploded. The import was slow, services timed out, and a node fell out of the load balancer. Repeat across the cluster, one by one, as maintenance continued.

Dedup wasn’t “broken.” It did exactly what it does: trade memory and metadata work for space. On root filesystems with churn and many snapshots/BEs, that trade can be brutal. The worst part is that the failure mode isn’t a clean error; it’s death by a thousand slow operations.

The recovery was boring and expensive: they migrated off the dedup-enabled datasets to new ones without dedup, rebooted into clean BEs, and accepted that clones + compression were enough. They stopped optimizing for the storage team’s dashboard and started optimizing for reboot reliability. That’s the right direction of embarrassment.

3) Boring but correct practice that saved the day: “Always keep a known-good BE and test bootability”

A healthcare analytics company had a practice that sounded tedious: every month, they created a “known-good” boot environment, labeled it with the change ticket, and kept it for at least one full release cycle. They also had a requirement: after any upgrade, the operator must prove they can still boot the previous BE by selecting it in the boot menu and reaching multi-user mode. Then they’d reboot back into the new BE.

It cost them maybe ten minutes per upgrade window. People complained. People always complain about rituals that prevent problems they haven’t personally suffered yet.

One quarter, a vendor driver update broke disk enumeration order on a batch of servers. Nothing dramatic, just enough that the initramfs import logic slowed and occasionally picked the wrong path first. A couple machines failed to boot into the upgraded environment. But because the team had a tested known-good BE and an established procedure, the on-call selected the previous BE, the system came up, and they had time to fix the new BE without a full outage.

That’s what “correct” looks like in production: unglamorous steps that trade five minutes now for five hours you don’t spend later. Also, it makes postmortems shorter, which is a form of kindness.

Fast diagnosis playbook: find the bottleneck fast

When a ZFS-on-root rollback/boot environment fails, you can waste hours in the wrong layer. Don’t. Check in this order.

First: Are you booting the BE you think you’re booting?

  • From a running system: findmnt -n -o SOURCE /
  • Confirm cmdline: cat /proc/cmdline
  • Confirm zpool get bootfs rpool

Interpretation: If your BE switch didn’t take, everything else is noise. Fix boot selection before debugging userland.

Second: Can early boot import the pool quickly and consistently?

  • systemd-analyze blame | head for zfs-import-* delays
  • journalctl -b -u zfs-import-cache.service for timeouts and missing devices
  • Check /etc/zfs/zpool.cache exists and matches reality
  • Check hostid consistency: hostid

Interpretation: Import problems masquerade as “random boot failures.” They’re usually deterministic once you look at device discovery and cache.

Third: Is /boot coherent with the root BE?

  • List kernels and initramfs: ls -lh /boot
  • Check initramfs contains ZFS modules: lsinitramfs ... | grep zfs
  • Check GRUB menu generation and entries

Interpretation: If you share /boot across BEs, you need a policy. Without a policy you have “whatever the last update did,” which is not an engineering strategy.

Fourth: If it boots but rollback doesn’t “restore sanity,” check dataset boundaries

  • Are logs in the BE? zfs list -o name,mountpoint | grep /var/log
  • Is app state shared across BEs unexpectedly? Review /var/lib datasets
  • Snapshot space pressure: zfs list -t snapshot, zpool list

Interpretation: Most rollback disappointment is self-inflicted via poor dataset separation.

Common mistakes: symptoms → root cause → fix

1) Symptom: “Rollback succeeded, but system still broken after reboot”

Root cause: /boot is shared and not rolled back; kernel/initramfs mismatch with rolled-back root. Or you rolled back a dataset but didn’t switch the BE the bootloader uses.

Fix: Use BEs (clone-based) and coordinate kernel/initramfs updates. Confirm root=ZFS=... points at the BE you intend. Consider keeping per-BE kernel artifacts if your tooling supports it, or implement strict kernel retention rules.

2) Symptom: Boot drops to initramfs shell: “cannot import ‘rpool’”

Root cause: Missing ZFS modules in initramfs, incorrect zpool.cache, unstable device paths, or hostid mismatch (common after cloning).

Fix: Rebuild initramfs with ZFS included; regenerate /etc/zfs/zpool.cache; ensure hostid is unique; prefer stable device identifiers. Validate by inspecting initramfs contents.

3) Symptom: GRUB can’t see the pool or can’t read /boot on ZFS

Root cause: Pool features enabled that the GRUB build can’t read, or GRUB ZFS module missing/old.

Fix: Keep /boot off ZFS, or freeze pool features to what GRUB supports, or upgrade bootloader stack in a controlled way. Don’t “just zpool upgrade” on boot-critical pools without thinking.

4) Symptom: Boot environment list exists, but selecting an older one panics or hangs

Root cause: Old BE has an initramfs that can’t import pool with current naming, encryption config, or module set; or userland expects newer kernel.

Fix: Periodically test booting older BEs. Keep at least one known-good BE updated enough to boot on current hardware/firmware. If you keep ancient BEs, treat them as archives, not boot targets.

5) Symptom: “zfs rollback” fails with dataset busy, or rollback breaks mounts

Root cause: Trying to rollback the live root dataset, or mountpoint conflicts due to containers mounted, or canmount not managed.

Fix: Roll back by switching BEs, not rolling back the mounted root. Ensure container datasets have mountpoint=none. Set inactive BEs to canmount=noauto.

6) Symptom: Space keeps shrinking, snapshots “don’t delete”

Root cause: Snapshots hold blocks alive; clones depend on snapshots; deleting snapshots that are origins of clones won’t free space the way you expect.

Fix: Understand clone/snapshot dependencies. Promote clones when needed. Separate churny datasets. Implement retention policies per dataset.

7) Symptom: Boot becomes slower over time

Root cause: Import delays due to device discovery, bloated snapshot metadata, or an “optimization” like dedup that increased metadata work.

Fix: Measure with systemd-analyze. Fix import method and cache. Reduce snapshot churn on root. Avoid dedup on root unless you have a very good reason and a lot of RAM.

Short joke #2: Dedup on root is like a sport bike in winter: impressive right up until it introduces you to physics personally.

Checklists / step-by-step plan

Checklist A: Before you install

  • Decide: /boot on ext4 (recommended for fleet) or /boot on ZFS (only with verified GRUB compatibility).
  • Pick pool name (rpool is common) and BE naming scheme (default, 2025q1-upgrade, etc.).
  • Confirm sector sizes (lsblk) and set ashift=12.
  • Decide encryption method and operational unlock story.
  • Write down dataset boundaries: at minimum split /var/log and database paths.

Checklist B: Create pool + datasets (rollback-friendly)

  • Create pool with mountpoint=none, compression=lz4, atime=off, xattr=sa, acltype=posixacl.
  • Create rpool/ROOT container and rpool/ROOT/default mounted at /.
  • Create separate datasets for /home, /var, /var/log, and app data.
  • Set bootfs to the BE you intend to boot.
  • Set cachefile and generate a stable hostid.

Checklist C: Install OS + make it bootable

  • Install ZFS userspace and kernel modules.
  • Ensure initramfs includes ZFS import scripts and modules.
  • Install bootloader and ensure kernel cmdline includes root=ZFS=rpool/ROOT/<be>.
  • If /boot is shared: define kernel retention and update process; don’t rely on vibes.

Checklist D: Validate rollback behavior (the part everyone skips)

  • Create a new BE (snapshot + clone) before your first real upgrade.
  • Switch bootfs to the new BE and regenerate boot configs.
  • Reboot into the new BE; confirm with findmnt.
  • Reboot into the old BE from boot menu; confirm it still boots.
  • Only then: do the actual upgrade procedure and repeat the dance.

FAQ

1) Are ZFS snapshots enough for OS rollbacks?

No. Snapshots are building blocks. For OS rollback you want a boot environment that the bootloader can select and boot into cleanly.

2) Should I put /boot on ZFS?

If you’re operating a fleet and want fewer early-boot surprises, keep /boot on ext4. Put it on ZFS only if you’ve tested GRUB compatibility with your pool features and update workflow.

3) What’s the single most important dataset decision for rollbacks?

Separate churny state from the BE: at minimum /var/log and databases. Otherwise snapshots and clones become space anchors and rollback semantics get weird.

4) How many boot environments should I keep?

Keep at least two bootable ones: the current and one known-good. More is fine if you have a retention policy and you understand snapshot/clone dependencies.

5) Why does my pool import slowly at boot?

Usually device discovery instability or stale zpool.cache. Sometimes hostid duplication on cloned systems. Measure with systemd-analyze blame and inspect zfs-import-* logs.

6) Can I just run zfs rollback on rpool/ROOT/default while the system is running?

Don’t. Rolling back a mounted root is how you create new and exciting failure modes. Switch BEs instead, then reboot.

7) Is encryption on ZFS root safe operationally?

Yes, if you design the unlock workflow. Decide up front whether you unlock via console passphrase, keyfile in initramfs, or remote mechanism. “We’ll handle it later” often becomes “we can’t boot.”

8) Should I enable dedup to save space across BEs?

Generally no for root. Compression is usually the right default. Dedup can be valid in narrow cases with abundant RAM and careful measurement, but it’s not a casual toggle.

9) What property settings are the usual safe defaults for root datasets?

Commonly: compression=lz4, atime=off, xattr=sa, and acltype=posixacl. Validate application expectations for ACLs and xattrs, especially with container runtimes.

10) How do I know which BE I’m currently on?

findmnt -n -o SOURCE / should show something like rpool/ROOT/<be>. Also check cat /proc/cmdline for root=ZFS=....

Conclusion: next steps you can do this week

If you already have ZFS on root and you’re not sure rollbacks really work, don’t wait for the next bad upgrade to find out. Do three things:

  1. Prove your current boot identity: check /proc/cmdline, findmnt /, and zpool get bootfs. Make sure the story matches.
  2. Create a real boot environment and boot it: snapshot, clone, switch bootfs, regenerate boot artifacts, reboot. Then boot the old one too. If either fails, fix it now while you’re calm.
  3. Redesign dataset boundaries where churn lives: split /var/log and app data out of the BE. Add retention policies. Stop letting logs hold your rollbacks hostage.

ZFS-on-root is worth it when it’s installed like you plan to use it: as a controlled rollback system with a bootloader that understands your intent. The goal isn’t clever. The goal is boring recovery.

← Previous
HBM on CPUs: When Memory Moves Into the Package
Next →
Docker “Too Many Requests” When Pulling Images: Fix Registry Throttling Like You Mean It

Leave a comment