ZFS rollback: The Safe Way to Undo Mistakes Without Collateral Damage

Was this helpful?

Some outages don’t start with a disk failure or a kernel panic. They start with a perfectly healthy system and one well-intentioned command: a config tweak, a package upgrade, a “cleanup” that deletes the wrong directory. Suddenly your app is broken, your data is weird, and your stakeholders are making the sort of noises humans make when they realize their weekend plans are canceled.

ZFS rollback is one of the few tools in production that can turn time backwards with surgical precision. It’s also one of the easiest ways to vaporize the wrong snapshots, wreck replication chains, or roll back a dataset while leaving dependent systems in a subtly corrupt state. This is the safe way: minimize blast radius, preserve evidence, and keep your future self out of therapy.

What rollback really does (and what it destroys)

zfs rollback moves the “live” dataset back to the exact contents of a snapshot. It does not “merge” or “undo” in a gentle way. It rewinds the dataset head pointer, which implicitly discards changes made after that snapshot.

Here’s the part that bites people: by default, ZFS won’t let you rollback past snapshots that have newer snapshots after them unless you tell it to. If you use -r (recursive) or -R (more destructive recursive), you are explicitly asking ZFS to destroy snapshots and dependent clones that are “newer” than the target snapshot. That might be the right move. It might also be the exact moment you cut off your replication history and make the backup team stop returning your calls.

Operationally, rollback is best treated like a controlled demolition: you can do it safely, but only if you’ve already verified what’s attached to the thing you’re about to move.

The snapshot family tree matters

Snapshots form a timeline. Rollback rewinds to an earlier point. If you have snapshots after that point, those snapshots refer to blocks that represent history after the target. When you rollback with destruction flags, ZFS is clearing the future you no longer want. That’s not “dangerous,” it’s simply final.

Rollback is not a file restore

If you only need a few files or a directory, mounting or browsing a snapshot is safer than rolling back the whole dataset. Rollback is for “the dataset is now wrong, and we want it back exactly how it was.” File-level recovery is for “I regret one command, not my entire personality.”

One quote to tape above your terminal

Paraphrased idea (Gene Kim, reliability/DevOps author): “Improvement comes from making work visible and reducing the cost of recovery.” Rollback reduces the cost of recovery—if you keep it visible and controlled.

Interesting facts and historical context

  • ZFS shipped with snapshots from the start (early 2000s Solaris development), when many filesystems still treated “backup” as someone else’s problem.
  • Snapshots are essentially free to create (metadata-only at creation time). They become “expensive” later only as diverging blocks are retained.
  • Rollback is instantaneous for metadata, but the operational impact is not: services see a sudden state change, caches become lies, and applications may need recovery steps.
  • Clones exist because rollback is too blunt: a clone is a writable branch from a snapshot, built for safe testing and selective recovery.
  • ZFS replication (send/receive) is snapshot-based, which means rollback decisions can break incremental chains if you delete snapshots the receiver expects.
  • “Used by snapshots” confuses almost everyone at first: it’s not “snapshot size,” it’s “how much would be freed if that snapshot disappeared.”
  • ZFS holds were introduced to prevent accidental deletion of critical snapshots, especially in automated environments where “cleanup” scripts roam.
  • Bookmarks exist for replication hygiene: they record a send point without keeping all blocks pinned like a full snapshot.
  • Root pool rollback became a mainstream operational pattern with boot environments (notably on illumos and some Linux setups), making “upgrade then revert” less terrifying.

A mental model that prevents panic

When you’re mid-incident, you don’t want poetry. You want a model you can execute.

Think in three layers:

  1. Data layer (ZFS): datasets, snapshots, clones, holds, bookmarks.
  2. Consistency layer (apps): databases, message queues, files that must match each other, WAL/redo logs, crash recovery semantics.
  3. Dependency layer (ops): replication, backups, monitoring expectations, consumers of mounted paths.

Rollback is purely layer 1. Your incident is usually caused by layer 2 or 3. That’s why “rollback fixed the files” sometimes still leaves you with a broken system: the application’s world includes caches, external state, and transactional invariants.

Rule of thumb: if an application can’t tolerate a power loss at any time, it can’t tolerate rollback without explicit coordination. ZFS will happily rewind; your database will happily explain why it’s angry.

Safe defaults: clone first, rollback later

If you remember nothing else: do not rollback first when you are not 100% sure. Clone the snapshot, mount it somewhere harmless, compare, extract, and only then consider rollback.

Rollback is great for:

  • Reverting a bad package upgrade or configuration change on a dedicated dataset or boot environment.
  • Undoing destructive file operations when the dataset’s current state is not worth preserving.
  • Restoring a known-good service image where the application is stateless or has its own journaling recovery.

Rollback is risky for:

  • Databases with external dependencies (replicas, PITR, binlog/WAL shipping).
  • Datasets used by multiple services where only one is “broken.”
  • Systems with tight replication chains where snapshot deletion breaks incrementals.

Joke #1: Rollback is like a time machine: it works great until you realize you also deleted the timeline where you wrote down what you changed.

Practical tasks with commands, outputs, and decisions

These are real operational moves you can run under pressure. Each task includes: command, what the output means, and the decision you make from it.

Task 1: Confirm pool health before you touch anything

cr0x@server:~$ zpool status -x
all pools are healthy

Meaning: no known device errors or degraded vdevs. If you see “DEGRADED” or checksum errors, your problem might be hardware or corruption, not “oops we changed a file.”

Decision: If not healthy, stop and stabilize: scrub, replace disks, or at least capture status output for incident notes before any rollback.

Task 2: Identify the exact dataset behind the mountpoint

cr0x@server:~$ zfs list -o name,mountpoint -S mountpoint
NAME                 MOUNTPOINT
tank                 /tank
tank/app             /srv/app
tank/app/logs        /srv/app/logs
tank/db              /var/lib/postgresql

Meaning: You now know what dataset a path maps to. People rollback “tank” when they meant “tank/app.” That’s how you invent new incidents.

Decision: Target the smallest dataset that contains the broken state.

Task 3: Enumerate snapshots for the dataset and find your candidate

cr0x@server:~$ zfs list -t snapshot -o name,creation,used -s creation tank/app
NAME                          CREATION                USED
tank/app@autosnap_2025-12-26  Fri Dec 26 01:00 2025   120M
tank/app@autosnap_2025-12-26_02-00  Fri Dec 26 02:00 2025   8M
tank/app@pre_upgrade          Fri Dec 26 02:12 2025   0B
tank/app@post_upgrade         Fri Dec 26 02:20 2025   35M

Meaning: Snapshot names and creation times. USED is how much unique space would be freed if the snapshot were deleted (not “size of snapshot contents”).

Decision: Pick the snapshot that matches your “last known good” point. Prefer explicit snapshots like @pre_upgrade over periodic ones when available.

Task 4: Check for dependent clones (rollback may be blocked or destructive)

cr0x@server:~$ zfs list -t snapshot -o name,clones tank/app@pre_upgrade
NAME                    CLONES
tank/app@pre_upgrade    tank/app-test

Meaning: There is a clone tank/app-test based on that snapshot. Some rollback operations that destroy snapshots can require destroying dependent clones.

Decision: If clones exist, prefer “clone-and-restore files” over rollback, or plan the clone’s fate explicitly.

Task 5: Verify whether newer snapshots exist that would be destroyed

cr0x@server:~$ zfs list -t snapshot -o name -s creation tank/app | tail -n 5
tank/app@autosnap_2025-12-26_02-00
tank/app@pre_upgrade
tank/app@post_upgrade
tank/app@autosnap_2025-12-26_03-00
tank/app@autosnap_2025-12-26_04-00

Meaning: If you rollback to @pre_upgrade with recursive destroy flags, everything after it is on the chopping block.

Decision: If those later snapshots are needed for replication or audit, don’t destroy them; clone instead.

Task 6: Create a safety snapshot right now (yes, even if things are broken)

cr0x@server:~$ zfs snapshot tank/app@before_rollback_2025-12-26T0235
cr0x@server:~$ zfs list -t snapshot -o name,creation | grep before_rollback
tank/app@before_rollback_2025-12-26T0235  Fri Dec 26 02:35 2025

Meaning: You preserved the current broken state. This is your forensics and “undo the undo.”

Decision: If rollback goes sideways, you can return to this point or extract evidence.

Task 7: Put a hold on critical snapshots so automation can’t delete them mid-incident

cr0x@server:~$ zfs hold incident tank/app@before_rollback_2025-12-26T0235
cr0x@server:~$ zfs holds tank/app@before_rollback_2025-12-26T0235
NAME                                      TAG       TIMESTAMP
tank/app@before_rollback_2025-12-26T0235  incident  Fri Dec 26 02:36 2025

Meaning: The snapshot cannot be destroyed until the hold is released.

Decision: Use holds whenever cleanup jobs exist, or when multiple humans are “helping.” It’s cheaper than arguing.

Task 8: Clone the snapshot to inspect and recover without changing prod

cr0x@server:~$ zfs clone tank/app@pre_upgrade tank/app_recover
cr0x@server:~$ zfs set mountpoint=/mnt/app_recover tank/app_recover
cr0x@server:~$ zfs mount tank/app_recover
cr0x@server:~$ zfs list -o name,mountpoint,mounted tank/app_recover
NAME              MOUNTPOINT        MOUNTED
tank/app_recover  /mnt/app_recover  yes

Meaning: You now have a writable branch with the old contents. You can diff, copy, or run checks without risking the live dataset.

Decision: If you can fix prod by copying a handful of files from the clone, do that and avoid rollback entirely.

Task 9: Compare dataset properties that might affect behavior after rollback

cr0x@server:~$ zfs get -H -o property,value compression,recordsize,atime,xattr,acltype tank/app
compression  lz4
recordsize   128K
atime        off
xattr        sa
acltype      posixacl

Meaning: Properties impact performance and semantics. Rollback doesn’t change properties (they’re not per-snapshot), but if someone “fixed” something by changing a property after the snapshot, rollback won’t undo it.

Decision: If the incident was caused by property change, fix the property instead of rolling back data. Or snapshot properties separately as config management.

Task 10: Check space pressure (snapshots can block frees; rollback can amplify it)

cr0x@server:~$ zfs list -o name,used,avail,refer,mounted tank
NAME  USED  AVAIL  REFER  MOUNTED
tank  8.21T  640G  96K    yes

Meaning: Only 640G available. If you create clones, receive streams, or keep many snapshots, you might hit 100% and trigger performance collapse or allocation failures.

Decision: If free space is tight, avoid operations that increase referenced space (big clones, large receives). Consider deleting non-critical snapshots (carefully) or adding capacity before doing “clever” recovery.

Task 11: See what’s pinning space via snapshots

cr0x@server:~$ zfs list -t snapshot -o name,used,refer -S used tank/app | head
NAME                                   USED  REFER
tank/app@autosnap_2025-12-25_23-00     88G   612G
tank/app@autosnap_2025-12-26_00-00     74G   618G
tank/app@autosnap_2025-12-26_01-00     120M  620G
tank/app@before_rollback_2025-12-26T0235 0B  621G

Meaning: Older snapshots with large USED values are retaining lots of blocks. That’s usually “expected,” but it’s how you end up with “df says free, ZFS says no.”

Decision: If capacity is the bottleneck, delete or offload the snapshots that retain the most space—after verifying they aren’t required for replication or compliance.

Task 12: Dry-run your rollback impact by listing what would be destroyed

cr0x@server:~$ zfs list -t snapshot -o name -s creation tank/app | awk '/@pre_upgrade/{flag=1}flag{print}'
tank/app@pre_upgrade
tank/app@post_upgrade
tank/app@autosnap_2025-12-26_03-00
tank/app@autosnap_2025-12-26_04-00
tank/app@before_rollback_2025-12-26T0235

Meaning: This is the “future” relative to @pre_upgrade. If you rollback with destruction flags, these are in scope.

Decision: If any snapshot listed is needed (replication base, audit point, “before_rollback”), do not destroy. Use clone-based recovery or a more surgical approach.

Task 13: Perform the rollback (single dataset), only after you’ve made it boring

cr0x@server:~$ sudo systemctl stop app.service
cr0x@server:~$ zfs rollback tank/app@pre_upgrade
cr0x@server:~$ sudo systemctl start app.service
cr0x@server:~$ systemctl status --no-pager app.service
● app.service - Example App
     Loaded: loaded (/etc/systemd/system/app.service; enabled)
     Active: active (running)

Meaning: Service stopped to avoid writing during rollback. Rollback succeeded and the service is back.

Decision: If the service still fails, you’re now in application consistency territory (migrations, caches, schema drift). Move to app-layer recovery, not more ZFS thrashing.

Task 14: If you must destroy newer snapshots during rollback, do it explicitly and understand the blast radius

cr0x@server:~$ zfs rollback -r tank/app@pre_upgrade
cr0x@server:~$ zfs list -t snapshot -o name -s creation tank/app | tail -n 3
tank/app@autosnap_2025-12-26_01-00
tank/app@autosnap_2025-12-26_02-00
tank/app@pre_upgrade

Meaning: Newer snapshots are gone. That may include replication anchors. This is why you verify first.

Decision: Only accept this if you’ve confirmed you can re-seed replication, and you have a preserved “before_rollback” snapshot (possibly on another system) if you need it later.

Task 15: Validate replication implications by checking bookmarks (if used)

cr0x@server:~$ zfs list -t bookmark -o name,creation tank/app
NAME                         CREATION
tank/app#replica_base_01      Fri Dec 20 03:00 2025
tank/app#replica_base_02      Wed Dec 25 03:00 2025

Meaning: Bookmarks can preserve send points even if snapshots are removed. If your replication strategy relies on bookmarks, you have more flexibility.

Decision: If you have a suitable bookmark, you may be able to keep incrementals alive. If not, plan for a full re-seed after destructive rollback.

Task 16: Check encryption-related constraints before cloning or sending

cr0x@server:~$ zfs get -H -o property,value encryption,keylocation,keystatus tank/app
encryption   aes-256-gcm
keylocation  file:///etc/zfs/keys/tank_app.key
keystatus    available

Meaning: Encrypted datasets need keys loaded for mount/receive operations depending on your design. Rollback itself doesn’t “need” keys in the same way, but mounting clones does.

Decision: Ensure key material is available before cloning/mounting for recovery. If keys are missing, fix key management first—rolling back blindly won’t bring keys back.

Fast diagnosis playbook

When rollback is on the table, you’re usually asking one of two questions: “Can I safely rewind?” and “If I rewind, will it actually fix the system?” This playbook is optimized for speed and minimal collateral damage.

First: Confirm you’re fixing the right layer

  1. Is the failure state purely on disk? Missing files, wrong config, corrupted deploy artifact. If yes, ZFS can help directly.
  2. Is the failure state in the application’s external world? Database schema mismatch, cached secrets, downstream API incompatibility, queued messages. If yes, rollback might be irrelevant or harmful.
  3. Is this actually hardware or pool health? If reads are failing or latency is huge, rollback won’t change physics.

Second: Identify the smallest safe rollback unit

  1. Map mountpoints to datasets (zfs list -o name,mountpoint).
  2. Pick the most specific dataset that contains the problem.
  3. Check whether child datasets exist; don’t accidentally roll back children that shouldn’t move.

Third: Check the “don’t break the future” constraints

  1. Replication chain: Will deleting newer snapshots break incremental sends?
  2. Clones: Are there dependent clones that will block or be destroyed?
  3. Space: Are you within a safe free-space margin to clone/migrate if needed?
  4. Coordination: Can you stop writers cleanly? If not, you’re rolling back into a moving target.

Quick bottleneck check if things are slow (common during snapshot-heavy incidents)

cr0x@server:~$ zpool iostat -v 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        8.21T   640G    120    980  32.1M  210M
  raidz2-0                  8.21T   640G    120    980  32.1M  210M
    sda                         -      -     15    130  4.0M   28.0M
    sdb                         -      -     14    126  3.9M   27.5M
    sdc                         -      -     46    340  12.2M  72.1M
    sdd                         -      -     45    340  12.0M  72.0M
--------------------------  -----  -----  -----  -----  -----  -----

Meaning: Uneven per-disk load can suggest a failing disk, a hot vdev, or pathological IO patterns. Snapshot churn plus low free space can also turn writes into misery.

Decision: If IO is the bottleneck, stabilize performance (free space, pause heavy jobs, scrub scheduling) before doing big snapshot operations.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran a fleet of application servers with ZFS for local state: configs, small data files, and some “temporary” artifacts that were never temporary. They had a habit of snapshotting the entire tank pool hourly because it was easy. And to be fair, it worked for months.

Then an engineer tried to revert a botched deploy. They saw tank@pre_deploy and assumed rolling back tank would neatly undo the change. It did undo the change. It also rolled back unrelated datasets: a message queue’s spool directory and a monitoring agent’s state. The spool now contained messages that had already been processed. The monitoring agent forgot what it had seen. The system didn’t crash. It lied.

The truly expensive part was the lag: the incident didn’t announce itself with a bang. It showed up as duplicated work, inconsistent metrics, and “why is the queue replaying?” questions three hours later. Everyone had to reconstruct what happened from logs—logs that were also on ZFS and had been rolled back.

The fix was not magical. They split datasets so the unit of rollback matched the unit of ownership: tank/app, tank/queue, tank/monitoring. They also stopped treating pool-level rollback as a normal tool. Now “rollback” required naming the dataset explicitly in the change plan.

Mini-story 2: The optimization that backfired

A different org wanted faster recoveries. Their idea: keep more frequent snapshots, keep them longer, and rely on rollback for most “oops” moments. They tuned their autosnap tooling to keep many snapshots per day. It made managers happy because it sounded like resilience.

What they didn’t model was write amplification under space pressure. Snapshots retain old blocks; deletes don’t free space until snapshots drop; the pool’s free space shrank; allocations got more fragmented; performance degraded. The platform team noticed increased latency during peak writes, but it looked like an application problem because the pool was “healthy.”

Then a real incident hit: a bad schema migration. They wanted to rollback the database dataset. But they couldn’t easily, because the replication receiver expected a chain of snapshots that now spanned a huge timeline, and the pool was too tight on space to clone for investigation. Their “faster recoveries” design forced them into the slowest possible recovery: a full re-seed plus application-level reconciliation.

Afterward they kept the frequent snapshots, but shortened retention on the hottest datasets, introduced holds only for pre-change snapshots, and enforced a minimum free-space policy. Their best optimization ended up being a boring alert: “pool < 20% free.”

Mini-story 3: The boring but correct practice that saved the day

A payments-adjacent service (no names, no drama) ran on ZFS with strict change control. Before any deploy, their pipeline took a snapshot named @pre_change_$ticket. The snapshot was held with a tag and automatically replicated to a secondary system. Nothing fancy. Just consistent.

One evening a dependency update shipped with a subtle config default change. The service started rejecting valid requests. SRE got paged, and within minutes they had a clean decision: the problem began at deploy time, and their pre-change snapshot existed, was held, and was already off-host.

They didn’t even rollback immediately. They cloned @pre_change, diffed configs, and found the default change. They hotfixed the config in place. No rollback needed, no replication chain broken, and the incident report had a clear timeline because the “broken state” snapshot existed too.

It wasn’t glamorous. It was the systems equivalent of brushing your teeth. Also, it worked.

Common mistakes: symptoms → root cause → fix

1) “Rollback failed: dataset has dependent clones”

Symptoms: Rollback command errors mentioning clones or “dataset is busy” or clone dependencies.

Root cause: A snapshot you’re trying to roll back past is the origin of a clone, or the dataset is currently mounted and in active use.

Fix: List clones of the snapshot, decide whether to keep them, and pivot to clone-based recovery if you can’t destroy them. Stop services and unmount if it’s just “busy.” Use zfs list -t snapshot -o name,clones and lsof/fuser on the mountpoint if needed.

2) “We rolled back and the app is still broken”

Symptoms: Files look correct, but application errors persist; database complains; caches mismatch.

Root cause: App-level state outside the dataset: schema changes, external queues, remote dependencies, or journaling expectations.

Fix: Treat rollback as only the first step. Coordinate application recovery: database crash recovery, undo/redo logs, schema version checks, cache flush, or rolling back only the artifact dataset rather than the database.

3) “Rollback destroyed snapshots and replication is now stuck”

Symptoms: Incremental zfs send fails; receiver complains about missing snapshots; replication tooling errors after rollback.

Root cause: Destructive rollback (-r/-R) deleted snapshots that were the basis for incrementals.

Fix: Re-seed replication with a new full send, or use bookmarks if your strategy supports them. For the future: never destroy snapshots used as replication anchors; hold them until the receiver confirms.

4) “Space didn’t come back after deleting data”

Symptoms: You delete gigabytes, but zfs list shows little change; pool remains full; writes slow down.

Root cause: Snapshots are retaining the deleted blocks.

Fix: Identify which snapshots have high USED and prune according to retention/replication requirements. Consider splitting datasets so high-churn directories don’t pin space across the whole service.

5) “We rolled back the wrong dataset”

Symptoms: Unrelated services regress; logs disappear; monitoring shows time travel; multiple teams show up.

Root cause: Mountpoint-to-dataset mapping was assumed, not verified. Or rollback was run at pool level.

Fix: Always map the mountpoint to the dataset and rollback only that dataset. Use explicit dataset naming in runbooks. If you need to undo across multiple datasets, snapshot them together (see checklists) and rollback as a coordinated set.

6) “Rollback completed, but permissions/ACL behavior is weird”

Symptoms: Access errors or ACL evaluation changes after rollback; files “look” right but behave differently.

Root cause: Dataset properties (ACL type, xattr mode) or OS-level ACL handling changed. Snapshots don’t revert dataset properties.

Fix: Compare zfs get outputs against baseline; restore properties via config management. Treat properties as code.

7) “Rollback on encrypted dataset fails to mount clone”

Symptoms: Clone exists but won’t mount; key status unavailable; tooling errors around keys.

Root cause: Keys not loaded, wrong keylocation, or operational flow assumes keys are present on recovery host.

Fix: Ensure keys are loaded and accessible on the host where you’re cloning/mounting. Verify with zfs get keystatus. If you replicate encrypted datasets, have a key management plan that works during incidents.

8) “We used rollback for file recovery and made everything worse”

Symptoms: One missing file becomes a full service rollback, losing legitimate writes since snapshot time.

Root cause: Using dataset rollback when you needed file-level restore.

Fix: Mount/browse snapshot or clone and copy back specific paths. Rollback is last resort for broad reversions, not your default undelete.

Joke #2: If your rollback plan is “we’ll just roll back,” congratulations—you’ve invented a backup strategy with the confidence of a fortune cookie.

Checklists / step-by-step plan

Checklist A: The “I need to restore a few files” plan (no rollback)

  1. Identify dataset for the path (zfs list -o name,mountpoint).
  2. List snapshots around the incident time.
  3. Create an incident snapshot of current state and hold it.
  4. Clone the target snapshot to a recovery dataset.
  5. Copy files from clone to live (preserving ownership/ACLs).
  6. Validate app behavior. Keep the clone until you’re sure.

Checklist B: The “Rollback this dataset safely” plan

  1. Stop writers. Systemd service stop, application maintenance mode, or at least block writes.
  2. Confirm pool health. If degraded, don’t add more chaos.
  3. Confirm dataset boundaries. Don’t roll back the pool unless you want to own everyone’s problems.
  4. Snapshot current state. Name it clearly and put a hold on it.
  5. Check clones. If dependent clones exist, decide whether rollback is feasible.
  6. Check “newer snapshot” consequences. Determine whether you can keep or must destroy snapshots after the target.
  7. Consider clone-first rehearsal. If you’re not sure, clone and test app start on the clone.
  8. Rollback. Use the least destructive form that works. Avoid -R unless you have to.
  9. Start services and validate. Validate not just “service running,” but correctness and data sanity.
  10. Clean up responsibly. Release holds when safe, remove temporary clones, and document the snapshot used.

Checklist C: Coordinated rollback across multiple datasets

This is where teams get burned. Multi-dataset state (app + db + queue) needs coordinated snapshots and coordinated rollback, or you get “internally consistent but mutually inconsistent.”

  1. Stop the entire stack or put it in a quiesced mode.
  2. Create a snapshot on each dataset with the same name token (example: @pre_change_TICKET).
  3. Hold those snapshots under an “incident” or “change” tag.
  4. When rolling back, rollback the set in a defined order (usually queue/consumers first, then db, then app), and validate each layer.

FAQ

1) Does ZFS rollback delete data permanently?

Rollback discards the live dataset’s changes after the snapshot. If you also destroy newer snapshots (via -r/-R), then yes, that history is gone unless replicated elsewhere.

2) What’s the difference between rollback and cloning a snapshot?

Rollback rewinds the existing dataset in place. A clone creates a new writable dataset based on the snapshot. Clones are safer when you’re unsure, need file recovery, or want to test boot/app start without touching prod.

3) Can I rollback just one directory?

No. Rollback applies to a dataset. If you want “just one directory,” use a snapshot browser (like .zfs/snapshot where available) or clone and copy the directory back.

4) Why is rollback blocked by “newer snapshots exist”?

ZFS prevents accidental destruction of the snapshot timeline. You can override it, but ZFS wants you to explicitly admit you’re deleting the future.

5) Will rollback restore dataset properties like compression or recordsize?

No. Properties are not snapshot state in the way file contents are. If a property change caused the issue, fix the property directly and snapshot the change policy in configuration management.

6) How does rollback interact with replication?

Replication depends on snapshot continuity. If you delete snapshots that were sent (or expected as incremental bases), you can break incrementals and require a full re-seed. Use holds for replication anchors, and plan rollback so you don’t delete them casually.

7) Is rollback safe for databases?

Sometimes. It depends on whether the database can recover from a crash-consistent image and whether you coordinate with replicas, PITR tooling, and schema migrations. If you can’t articulate those dependencies, prefer clone-and-validate or database-native restore paths.

8) What’s the safest “undo” if I’m not sure what changed?

Snapshot current state (and hold it), then clone a known-good snapshot and compare. If the diff reveals a small set of files, restore those. Rollback is for when you’re confident the entire dataset should be rewound.

9) When should I use holds?

Use holds for any snapshot that must not disappear: pre-change snapshots, incident evidence snapshots, replication anchors, and compliance snapshots. Release holds deliberately when the window of risk has passed.

10) Does rollback require unmounting the dataset?

Not always, but stopping writers is the real requirement. In practice, stop the service and ensure nothing is actively writing. If the dataset is busy, unmount or identify the process holding it.

Conclusion: next steps you can actually do

Rollback is a sharp tool. It can save a night, a quarter, or your credibility. But the safe way isn’t “type rollback faster.” The safe way is to reduce uncertainty and narrow blast radius.

Do these next:

  • Split datasets by ownership and rollback domain. If two services share a dataset, your rollback risk is already priced in.
  • Standardize pre-change snapshots with holds. Make them automatic and boring, and replicate them if you can.
  • Practice clone-first recovery. Make “clone, diff, copy back” the default muscle memory.
  • Write down your destructive flags policy. When is -r allowed? When is it forbidden? Who signs off?
  • Run a game day: break a config, recover with snapshots, then do it again under “low space” conditions. That’s the real test.

If your incident response relies on heroics, you don’t have an incident response. You have a recurring appointment with chaos. ZFS rollback, done safely, is how you cancel that appointment.

← Previous
Debian 13: New interface name broke networking — stable naming fixes that survive reboots (case #7)
Next →
WannaCry: the ransomware that reminded everyone patching exists

Leave a comment