ZFS Ransomware Response: The Snapshot Playbook That Saves You

Was this helpful?

Ransomware doesn’t announce itself politely. It shows up as a Slack message: “Why are all the PDFs named .locked?” or as a graph: write IOPS pegged, latency climbing, users complaining that “files are corrupt.” Then you find the ransom note, and everyone suddenly becomes very interested in your backup strategy.

If you run ZFS, you have a weapon most filesystems can only cosplay: snapshots with cheap copy-on-write semantics, plus replication that can move clean history off the box. That only helps if you can prove you have clean restore points, choose the right one under pressure, and roll forward without re-infecting yourself. This is the playbook for that moment.

What ransomware looks like on ZFS (and what it doesn’t)

Ransomware isn’t one thing. On storage, it typically lands in one of these buckets:

  • File-encrypting ransomware: walks a file tree and rewrites files (or writes new encrypted copies then deletes originals). On ZFS this is mostly random writes and metadata churn.
  • Wiper behavior: deletes, truncates, overwrites. Same effect: live dataset becomes garbage.
  • Credentialed attacker: the worst kind. They don’t need fancy malware. They log in as you (or your automation) and run destructive commands: deleting snapshots, destroying datasets, killing replication.
  • Hypervisor/VM compromise: encrypts inside the guest. Your ZFS sees big writes to zvols or VM images, not individual files.
  • Backup-target compromise: attacker hits the system that stores your “backups,” encrypts those too. If your ZFS box is the backup target and it’s writable from the infected environment, congratulations: you built a high-performance self-own.

ZFS snapshots help most with the first two. They help with the third only if snapshot deletion is constrained and replication has a safety gap. If attackers can delete snapshots and replicated copies, your snapshots are just a comforting story you tell yourself.

One quote to keep in your head when things get loud: Hope is not a strategy. — paraphrased idea often attributed to operations leadership. The point stands even if the attribution is messy: hope doesn’t pass audits or restore data.

Snapshot semantics: why this works

ZFS snapshots are point-in-time views of a dataset. They’re cheap because ZFS is copy-on-write: new writes allocate new blocks; existing blocks referenced by a snapshot stay put. This is the core advantage: ransomware can rewrite the live dataset all it wants; your snapshot still points at the old blocks.

Two gotchas: snapshots are not magic and they are not immutable by default. If an attacker (or a misconfigured automation tool) can run zfs destroy pool/ds@snap, your “immutable backups” become “a historical curiosity.” Also: snapshots preserve corrupted-but-consistent state. If ransomware encrypts your files cleanly, snapshots keep the pre-encryption view, but only if you have a snapshot from before encryption began.

Threat model decisions that matter

  • Assume credentials are compromised until proven otherwise. Your restore host should not trust the infected host.
  • Prefer restore-elsewhere over in-place rollback for business-critical systems when you can afford it. It gives you forensics and reduces re-infection risk.
  • Replication is your “air gap” only if it’s not continuously writable and deletable from the same control plane.

Facts and history that change how you respond

These are not trivia. They are the small truths that determine whether your recovery is a victory lap or a post-mortem.

  1. ZFS snapshots arrived early and stayed weirdly underused. Sun introduced ZFS in the mid-2000s with snapshots as a first-class feature, long before “ransomware” was a board-level word.
  2. Copy-on-write is both the superpower and the performance trap. It makes snapshots cheap, but sustained random overwrites (hello ransomware) can fragment free space and hammer latency.
  3. Snapshot deletions can be expensive. Destroying large, old snapshots can trigger heavy space map work and I/O, which matters in the middle of an incident when you’re trying to restore fast.
  4. ZFS “send/receive” can be a forensic time machine. You can replicate incremental state and reconstruct timelines, not just restore data.
  5. Ransomware operators shifted to double extortion. Modern incidents often involve data theft plus encryption. Snapshots fix encryption; they don’t undo exfiltration.
  6. Attackers learned to target backups. The last decade pushed them toward deleting shadow copies, VSS snapshots, backup catalogs—and yes, ZFS snapshots if they can reach them.
  7. Immutable snapshot policies exist, but they’re operational choices. On some platforms you can configure snapshot holds, delegated permissions, or separate admin domains. None is automatic; all require discipline.
  8. Encryption can help and hurt. Native ZFS encryption prevents raw disk theft and can limit some forensics, but it doesn’t stop ransomware running as a user who can read/write files.
  9. Fast restores require free space planning. The best snapshot in the world is useless if you can’t mount it, clone it, or receive it because your pool is at 95% and angry.

Joke #1 (short, relevant): Ransomware is the only workload that makes everyone suddenly appreciate “boring” storage engineering. It’s like a surprise audit, but with more profanity.

The first hour: contain, preserve, decide

In the first hour you’re not “restoring.” You’re buying options.

1) Contain the blast radius

If encryption is ongoing, every minute matters. Cut off the write path. That might mean:

  • Disconnect SMB/NFS exports from the network.
  • Stop application services that write to affected datasets.
  • Disable compromised accounts or revoke tokens.
  • Quarantine the host at the switch or firewall.

Containment beats cleverness. If you can’t stop the writes, you can’t trust the timeline.

2) Preserve evidence without slowing recovery to a crawl

You need enough evidence to answer “how did they get in?” and “how far did this go?” without turning this into a forensic science fair. Practical middle ground:

  • Take a new snapshot of affected datasets immediately (yes, even if the data is bad). You’re freezing the current state for later analysis.
  • Export logs (auth logs, SMB logs, application logs) off the host.
  • If you replicate, pause and copy the current replication state for later review; don’t overwrite good history.

3) Decide your recovery mode

You have three main patterns:

  • Rollback in place: fast, risky. Good for simple file shares if you’re confident the host is clean and you know the right snapshot.
  • Clone snapshot and serve clone: safer. You can mount the clone read-only initially, validate, then swap.
  • Restore to a clean system elsewhere: safest. Costs more time and capacity, but reduces reinfection and preserves the compromised system for investigation.

My bias: for anything that runs code (CI servers, app servers with writable volumes, VM datastores), restore elsewhere. For dumb file shares with good identity hygiene, clones are a sweet spot.

Fast diagnosis playbook (bottleneck-first triage)

This is the “what do I check first, second, third” list when everyone is staring at you and you need to stop guessing.

First: is the attacker still writing?

  • Look for sustained write I/O and file churn on the affected datasets.
  • Confirm whether SMB/NFS clients are still connected and actively modifying data.
  • Decision: if writes are ongoing, isolate or stop services before doing anything else.

Second: do you still have clean snapshots?

  • List recent snapshots and verify you have a timeline spanning before suspected encryption started.
  • Check whether snapshots were deleted recently (unexpected gaps).
  • Decision: if snapshot history is missing, switch to replication/offsite or alternate backups immediately; don’t waste time planning a rollback that can’t happen.

Third: what’s your restore bottleneck—CPU, disk, network, or space?

  • Space: pool usage and fragmentation; if you’re near full, clones/receives may fail.
  • Disk: high latency, resilvering, or errors; restoring on a dying pool is a slow tragedy.
  • Network: replication or restore across links; measure throughput and packet loss.
  • CPU: encryption/compression can cap send/receive rates; check if you’re CPU-bound.
  • Decision: choose rollback vs clone vs restore-elsewhere based on the constraint. If space is tight, rollback may be the only move; if the host is suspect, restore elsewhere even if it’s slower.

Fourth: validate cleanliness without re-infecting

  • Mount candidate snapshots/clones read-only first.
  • Scan for known ransom note patterns, new extensions, and recently modified executables/scripts in shared paths.
  • Decision: pick the restore point where business data is intact and malicious artifacts are absent (or at least understood).

Hands-on tasks: commands, outputs, decisions (12+)

These are the commands you actually run, with realistic outputs and what you do next. Hostnames and pool names are examples; keep your own consistent. The incident is assumed to involve tank pool with datasets tank/share and tank/vm.

Task 1: confirm pool health (don’t restore onto a burning platform)

cr0x@server:~$ zpool status -x
all pools are healthy

What it means: No known device errors or resilvering in progress.

Decision: Proceed. If you see DEGRADED, FAULTED, or resilvering, plan for slower restores and consider restoring to different hardware.

Task 2: check pool capacity and fragmentation risk

cr0x@server:~$ zpool list -o name,size,alloc,free,frag,cap,health
NAME  SIZE  ALLOC   FREE  FRAG  CAP  HEALTH
tank  54.5T  41.8T  12.7T   41%  76%  ONLINE

What it means: 76% full, moderate fragmentation. Not great, not fatal.

Decision: Clones and receives should work. If CAP is > 90%, expect failures and prioritize rollback or add capacity first.

Task 3: identify which datasets are getting hammered

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -r tank
NAME        USED   AVAIL  REFER  MOUNTPOINT
tank        41.8T  11.9T   128K  /tank
tank/share   8.4T  11.9T   8.4T  /srv/share
tank/vm     31.2T  11.9T  31.2T  /srv/vm

What it means: tank/vm is huge; if VMs are impacted, restore time will be dominated there.

Decision: Split the incident: file shares may be quick to recover; VM datastore may need a different approach (restore subset, prioritize critical VMs).

Task 4: watch real-time I/O to confirm encryption is active

cr0x@server:~$ zpool iostat -v tank 2 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        41.8T  12.7T    410   8120   32.1M  1.02G
  raidz2-0                  41.8T  12.7T    410   8120   32.1M  1.02G
    sda                         -      -     52   1015   4.0M  128M
    sdb                         -      -     51   1010   4.1M  127M
    sdc                         -      -     51   1012   4.0M  128M
    sdd                         -      -     51   1012   4.0M  128M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: Heavy sustained writes. If this is not a known workload window, assume active encryption or mass rewrite.

Decision: Contain: disable shares, stop services, quarantine host. Do not start rollback while writes are ongoing.

Task 5: list snapshots and spot gaps

cr0x@server:~$ zfs list -t snapshot -o name,creation,used -s creation | tail -n 8
tank/share@auto-2025-12-26-0700  Fri Dec 26 07:00 2025   56M
tank/share@auto-2025-12-26-0800  Fri Dec 26 08:00 2025   60M
tank/share@auto-2025-12-26-0900  Fri Dec 26 09:00 2025   58M
tank/share@auto-2025-12-26-1000  Fri Dec 26 10:00 2025   62M
tank/share@auto-2025-12-26-1100  Fri Dec 26 11:00 2025   59M
tank/share@auto-2025-12-26-1200  Fri Dec 26 12:00 2025   61M
tank/share@auto-2025-12-26-1300  Fri Dec 26 13:00 2025   63M
tank/share@auto-2025-12-26-1400  Fri Dec 26 14:00 2025   65M

What it means: Regular hourly snapshots exist up to 14:00. Good sign.

Decision: Identify when encryption started; pick snapshot just before. If there’s a gap (missing hours/days), suspect snapshot deletion or broken automation—switch to replication/offsite.

Task 6: take a “crime scene” snapshot right now

cr0x@server:~$ zfs snapshot -r tank/share@incident-2025-12-26T1425Z
cr0x@server:~$ zfs list -t snapshot -o name,creation | grep incident | tail -n 2
tank/share@incident-2025-12-26T1425Z  Fri Dec 26 14:25 2025

What it means: You froze current state for later analysis.

Decision: Keep it. Do not delete “bad” snapshots during incident response; you’ll want them when legal/security asks questions.

Task 7: check snapshot holds (and add one when you must preserve)

cr0x@server:~$ zfs holds tank/share@auto-2025-12-26-1200
NAME                          TAG   TIMESTAMP
tank/share@auto-2025-12-26-1200  -     -
cr0x@server:~$ zfs hold keep tank/share@auto-2025-12-26-1200
cr0x@server:~$ zfs holds tank/share@auto-2025-12-26-1200
NAME                          TAG   TIMESTAMP
tank/share@auto-2025-12-26-1200  keep  Fri Dec 26 14:28 2025

What it means: A hold prevents accidental deletion of that snapshot until explicitly released.

Decision: Put holds on your candidate “last-known-good” snapshots before you start manipulating anything.

Task 8: confirm what changed recently (high-signal triage)

cr0x@server:~$ zfs diff -H tank/share@auto-2025-12-26-1200 tank/share@auto-2025-12-26-1400 | head
M	/srv/share/finance/Q4.xlsx
M	/srv/share/finance/Q4.xlsx.locked
+	/srv/share/READ_ME_NOW.txt
M	/srv/share/projects/roadmap.docx
M	/srv/share/projects/roadmap.docx.locked

What it means: You’re seeing typical ransomware markers: new ransom note, rewritten files, new extensions.

Decision: Pick a snapshot before those artifacts appear. If artifacts exist across many snapshots, encryption started earlier than you thought.

Task 9: safely inspect a snapshot without changing anything

cr0x@server:~$ ls -la /srv/share/.zfs/snapshot/auto-2025-12-26-1200 | head
total 12
drwxr-xr-x   9 root root    9 Dec 26 12:00 .
drwxr-xr-x  10 root root   10 Dec 26 14:00 ..
drwxr-xr-x  12 root root   12 Dec 20 09:10 finance
drwxr-xr-x  34 root root   34 Dec 24 17:03 projects

What it means: Snapshots are visible via the .zfs directory (if enabled and not hidden by share settings).

Decision: Use snapshot browsing for quick validation. If you’re on a platform where .zfs isn’t exposed, mount a clone instead.

Task 10: clone a snapshot for read-only validation

cr0x@server:~$ zfs clone -o mountpoint=/mnt/restore_share -o readonly=on tank/share@auto-2025-12-26-1200 tank/share-restore
cr0x@server:~$ zfs get -o name,property,value readonly,mountpoint tank/share-restore
NAME              PROPERTY    VALUE
tank/share-restore  readonly    on
tank/share-restore  mountpoint  /mnt/restore_share

What it means: You now have a clone mounted read-only for validation and scanning.

Decision: Validate content and spot-check critical directories. If clean, plan the cutover: either promote the clone or rsync data out to a clean destination.

Task 11: rollback in place (only when you accept the risk)

cr0x@server:~$ zfs rollback -r tank/share@auto-2025-12-26-1200
cr0x@server:~$ zfs list -o name,used,refer,mountpoint tank/share
NAME        USED  REFER  MOUNTPOINT
tank/share  8.1T  8.1T  /srv/share

What it means: Dataset state is reverted to that snapshot; -r rolls back descendants too.

Decision: Use this only when you’ve stopped the attacker’s access path and you’re confident you selected the right snapshot. Rollback is fast, but it discards newer changes. Make sure the business agrees on data loss window.

Task 12: identify what’s holding space (why “df” lies during snapshots)

cr0x@server:~$ zfs list -o name,used,usedbysnapshots,usedbydataset,usedbychildren tank/share
NAME        USED  USEDSNAP  USEDDATA  USEDCHILD
tank/share  8.4T     1.1T     7.3T       0B

What it means: Snapshots are pinning 1.1T of blocks. That’s not “wasted”; it’s your safety net.

Decision: Do not delete snapshots to “free space” mid-incident unless you have verified off-box backups and you’re about to hit a hard failure.

Task 13: verify replication state (is offsite clean and current?)

cr0x@server:~$ zfs get -H -o value receive_resume_token tank/share
-
cr0x@server:~$ zfs list -t snapshot -o name -s creation -r tank/share | tail -n 3
tank/share@auto-2025-12-26-1200
tank/share@auto-2025-12-26-1300
tank/share@auto-2025-12-26-1400

What it means: No resume token indicates no interrupted receive on this side; snapshot chain exists locally.

Decision: On your replication target, confirm it has snapshots that predate infection and that deletion permissions are restricted. If the target is “too in sync,” you might have replicated encrypted data as well—look for older points.

Task 14: do a controlled send/receive restore to a new dataset

cr0x@server:~$ zfs send -R tank/share@auto-2025-12-26-1200 | zfs receive -u tank/restored/share
cr0x@server:~$ zfs list -o name,mountpoint -r tank/restored
NAME                MOUNTPOINT
tank/restored        /tank/restored
tank/restored/share  /tank/restored/share

What it means: You restored a point-in-time copy into a separate dataset tree, not overwriting the original.

Decision: Prefer this when you need clean separation and you have space. You can then point exports to the restored dataset after validation.

Task 15: confirm dataset properties that impact recovery speed and correctness

cr0x@server:~$ zfs get -o name,property,value compression,recordsize,atime,sync tank/share
NAME       PROPERTY     VALUE
tank/share compression  lz4
tank/share recordsize   1M
tank/share atime        off
tank/share sync         standard

What it means: These settings affect performance and workload behavior. For VM-like workloads, recordsize and sync are especially relevant.

Decision: Don’t change these during the incident unless you know exactly why. “Tuning during an outage” is a hobby for people who enjoy surprise consequences.

Task 16: check for permission model risks (who can destroy snapshots?)

cr0x@server:~$ zfs allow tank/share
---- Permissions on tank/share ---------------------------------------
Local+Descendent permissions:
user backupsvc create,destroy,mount,snapshot,send,receive
user appsvc   mount

What it means: The backupsvc user can destroy snapshots. That is a high-value credential.

Decision: In the post-incident hardening phase, remove destroy unless absolutely required, and separate roles for snapshot creation vs deletion.

Restore strategies: rollback, clone, and “restore elsewhere”

Strategy A: rollback in place (fastest, sharpest edges)

Rollback is a sledgehammer. It’s also a lifesaver when the business is bleeding and you have a well-understood dataset (like a file share) and reliable snapshot cadence.

Use when:

  • You have high confidence the compromise vector is contained (credentials rotated, shares disabled, malware removed).
  • You can tolerate losing changes since the snapshot.
  • The dataset isn’t hosting code execution paths that might be re-triggered (e.g., shared scripts executed by servers).

Avoid when:

  • You suspect the host is still compromised.
  • The dataset is a VM datastore or database where rollbacks can collide with application consistency expectations.
  • You need forensics on the encrypted state and can’t afford to overwrite evidence.

Strategy B: clone a snapshot and cut over (safer, still quick)

A clone lets you validate and serve clean data without destroying the current dataset. You can also keep the “infected” dataset intact for investigation. Operationally, you can change mountpoints, exports, or SMB shares to point to the clone.

Watch out: clones consume space as they diverge. If you keep serving the clone for weeks, it becomes “production,” and the original becomes “that thing we’ll delete someday.” That day never comes, and your pool cap quietly climbs. Plan the cleanup.

Strategy C: restore elsewhere (the grown-up move)

Restore to a separate host/pool when you don’t trust the system that got hit. This is the default for VM datastores, databases, and anything with privileged automation tied to it.

Operational benefits:

  • Clean control plane: new SSH keys, new credentials, new exports.
  • Reduced reinfection risk: attackers often persist in the original environment.
  • Parallel work: one team restores, another does forensics on the compromised host.

Costs:

  • Capacity and time. You need space to receive and validate.
  • Network throughput can become the bottleneck.

Joke #2 (short, relevant): The only thing scarier than restoring from backups is discovering your backups have been doing “interpretive dance” instead of replicating.

Three corporate mini-stories from the trenches

Mini-story 1: the incident caused by a wrong assumption

A mid-size company ran a ZFS-backed file share for departments. They were proud of their hourly snapshots. The helpdesk could restore files by browsing .zfs/snapshot. It felt modern. It was also built on a single assumption: “Only admins can delete snapshots.”

During a phishing-driven compromise, the attacker landed on a Windows workstation and later obtained credentials that happened to belong to a service account used for “self-service restores.” That account had far more power than the name suggested. It could snapshot, mount, and—because someone didn’t want to deal with retention scripts failing—destroy.

The attacker didn’t need custom ZFS knowledge. They found shell history and automation scripts, then ran a handful of ZFS commands to wipe the snapshot chain before encrypting the live share. The encryption was loud; the snapshot deletion was quiet.

Recovery was possible, but not from local snapshots. The team had to pull from an older offsite copy that was replicated weekly (not hourly), rehydrate terabytes over a constrained link, and then answer the question nobody likes: “So what did we lose?”

The fix was simple and humiliating: delegated permissions were split. Snapshot creation and replication were separated from snapshot destruction, and holds were used on the most recent daily points. They also stopped exposing .zfs to users and replaced self-service restores with a constrained workflow. The system became slightly less convenient and dramatically more survivable.

Mini-story 2: the optimization that backfired

A different shop hosted VM images on ZFS. They wanted faster performance for their busiest workload and made a set of aggressive tuning changes: larger recordsize, relaxed sync behavior in some places, and a trimmed-down snapshot schedule “to reduce overhead.” It was all justified with benchmarks and a few happy graphs.

Then ransomware hit a jump host that had access to the VM management network. The attacker didn’t encrypt files on the ZFS host; they encrypted inside the guests. The storage saw a storm of random writes across multiple large zvols. The team’s “reduced snapshot schedule” meant their last clean restore point for several critical VMs was many hours old.

The performance tuning made things worse during recovery. With fewer snapshots, they attempted to replicate entire large images from the last good point. The network link became the bottleneck, and CPU overhead from compression/encryption on the send side added insult. The restores were correct, but slow enough to breach internal recovery objectives.

The post-mortem wasn’t about ZFS being “slow.” It was about optimizing for the steady state while forgetting the failure mode. They reintroduced more frequent snapshots for VM datasets, separated “performance tuning” from “recovery design,” and put hard requirements around restore points for tier-0 systems. Steady-state graphs improved. But more importantly, the next incident would end with a restore, not a negotiation.

Mini-story 3: the boring but correct practice that saved the day

A regulated business ran ZFS for departmental shares and a small analytics cluster. Their snapshot policy was dull: 15-minute snaps for 24 hours, hourly for a week, daily for a month. Replication ran to an offsite target under a different administrative domain, with snapshot deletion permissions not delegated to production automation.

They also did a quarterly restore drill. Not a tabletop exercise. An actual restore of a representative dataset to a clean host, with a stopwatch and a checklist. Engineers hated it in the way you hate flossing: you know it’s correct, but it still feels like an accusation.

When ransomware hit via a compromised user workstation, the file share took damage. They contained access quickly, then cloned the last good snapshot to a quarantine mount and validated the absence of ransom artifacts. Cutover took hours, not days.

The offsite replication was the quiet hero. Even if the attacker had managed to delete local snapshots, the offsite target had older points protected by policy and permissions. The incident report was still painful, but recovery was procedural. No improvisation, no late-night heroics, no “we think we have everything.”

What saved them wasn’t genius. It was refusing to treat snapshots as a feature and instead treating them as a product with requirements: retention, protection, and test restores.

Hardening snapshots so the next time is boring

Snapshot cadence: match it to how ransomware behaves

Ransomware often runs fast, but not always. Sometimes it encrypts opportunistically, sometimes it crawls to avoid detection. Your snapshot schedule should assume you might only notice after hours.

  • Tier 0 (shared business data): 15-min for 24h, hourly for 7d, daily for 30d is a sane baseline.
  • VM datastores: frequent snapshots help, but don’t pretend they’re application-consistent. Use them as crash-consistent restore points and pair with guest-level backups for critical databases.
  • Home directories: high churn; set expectations with retention to avoid snapshot space blowups.

Protect snapshots from deletion (this is where adults live)

Defense is mostly about who can do what. Snapshots are vulnerable if deletion rights live in the same credential pool as day-to-day operations.

  • Use holds on daily/weekly “anchor” snapshots, especially on replication targets.
  • Split privileges: snapshot creation and replication are common; snapshot destruction should be rare and gated.
  • Separate admin domains: replication targets should not accept interactive logins from production admin accounts if you can avoid it.
  • Make the backup target boring and stingy: minimal packages, minimal services, strict firewall rules, read-only exports where possible.

Replication design: build an “oops gap” on purpose

If you replicate every minute and also propagate deletions immediately, you’ve built a high-availability pipeline for disasters. Add deliberate friction:

  • Delay destructive actions: retention and snapshot deletion on the target should be independent from source.
  • Keep longer retention offsite: the target is where you keep the history you hope you never need.
  • Consider pull-based replication: target initiates pulls from source, using limited credentials, rather than source pushing into target with broad rights.

Test restores: stop lying to yourself

Snapshots are not backups unless you can restore them under stress. Run drills. Time them. Verify content. Document the steps. Put the checklist where it can be found at 3 a.m. by someone who didn’t build the system.

Common mistakes: symptom → root cause → fix

1) “We rolled back but users still see encrypted files”

Symptom: After rollback, some directories still contain .locked files or ransom notes.

Root cause: You rolled back the wrong dataset (or wrong snapshot), or users are viewing a different export path (DFS namespace, SMB share mapping, client-side offline cache).

Fix: Confirm dataset mountpoints and share definitions; verify snapshot selection with zfs diff. Flush or invalidate client caches where applicable. Ensure you didn’t restore only a child dataset while the parent remains infected.

2) “Rollback fails with ‘dataset is busy’”

Symptom: ZFS refuses rollback due to active mounts.

Root cause: Processes still have files open, or the dataset is exported via NFS/SMB, or a jail/container is using it.

Fix: Stop services, unexport shares, and unmount if needed. Use clones/receive-elsewhere if you can’t safely stop everything.

3) “We can’t clone or receive: no space left”

Symptom: Clone/receive operations fail; pool near full.

Root cause: High CAP or snapshot retention pinning large amounts of data, plus the incident’s write amplification.

Fix: Add capacity (best), migrate non-critical datasets off, or perform in-place rollback (if safe). Avoid mass snapshot deletions during peak I/O; they can worsen performance.

4) “Replication target also has encrypted data”

Symptom: Offsite snapshots include ransomware artifacts.

Root cause: Replication ran after infection; retention too short; deletions propagated; no protected anchor points.

Fix: Keep longer retention on target, use holds, decouple deletion policies, and consider delayed or pull-based replication. During incident, stop replication immediately to avoid overwriting good history.

5) “Restores are painfully slow”

Symptom: Send/receive throughput far below expectations.

Root cause: CPU-bound (encryption/compression), network bottleneck, pool fragmentation, or concurrent workloads (scrubs, resilvers, active ransomware writes).

Fix: Measure: check iostat, CPU usage, link throughput. Pause non-essential tasks. Restore to a different pool if the original is overloaded or unhealthy.

6) “Snapshots are gone, but nobody admits deleting them”

Symptom: Snapshot chain missing; logs unclear.

Root cause: Over-privileged automation or compromised service account executed destructive commands, or a retention script misfired under edge conditions.

Fix: Audit delegated permissions and automation. Gate deletion. Put holds on anchors. Log ZFS administrative actions centrally and immutably.

7) “We restored, then got reinfected”

Symptom: Encryption resumes after recovery.

Root cause: You restored data but not trust. Compromised credentials, persistent malware, or exposed shares remained.

Fix: Rotate credentials before cutover, rebuild compromised hosts, and reintroduce access gradually. Treat restoration as a change with controls, not a magical undo button.

Checklists / step-by-step plan

Incident response checklist (storage/operator view)

  1. Contain: disable SMB/NFS exports or firewall them; stop write-heavy apps; quarantine compromised endpoints.
  2. Freeze: take an @incident-* snapshot of affected datasets for evidence.
  3. Verify health: zpool status, capacity, errors. If pool is unhealthy, plan restore elsewhere.
  4. Stop replication: pause jobs so you don’t replicate encrypted state over good history.
  5. Find last-known-good snapshot: use zfs diff, spot-check content, look for ransom artifacts.
  6. Protect it: apply zfs hold to candidate snapshots.
  7. Choose recovery mode: rollback, clone and cutover, or restore elsewhere.
  8. Validate in quarantine: mount clone read-only; scan for artifacts; confirm business-critical files open correctly.
  9. Cut over: re-point shares/mounts; re-enable access in stages; monitor writes and auth logs.
  10. Post-restore hardening: rotate keys, adjust zfs allow, enforce retention, protect target.
  11. Document timeline: snapshot times, chosen restore point, data loss window, actions taken.
  12. Run a restore drill later: yes, later. But schedule it before everyone forgets.

Recovery cutover checklist (clone-based)

  1. Create clone from last-known-good snapshot, mount read-only.
  2. Validate directory structure, file integrity checks for critical samples, and absence of obvious ransomware artifacts.
  3. Create a writable clone (or promote workflow) if you need to serve data actively.
  4. Update SMB/NFS exports to point to the clone mountpoint (or swap mountpoints atomically where feasible).
  5. Bring service back for a small pilot group first.
  6. Watch I/O, auth failures, and new file creation patterns for at least one business cycle.
  7. Only then decide what to do with the infected dataset: keep for forensics, archive, or destroy after approvals.

Post-incident hardening checklist (snapshot resilience)

  1. Split ZFS delegated permissions: remove snapshot destruction from routine accounts.
  2. Set holds on anchor snapshots (daily/weekly) on the replication target.
  3. Decouple retention policies between source and target; do not propagate deletion blindly.
  4. Centralize logging for administrative commands and auth events; alert on snapshot destruction.
  5. Reduce writable access paths: least-privilege shares, separate admin networks, MFA where possible.
  6. Schedule restore tests and measure RTO/RPO with real data sizes.

FAQ

1) Are ZFS snapshots “immutable” backups?

No. Snapshots are durable point-in-time references, not immutable by default. If someone can run zfs destroy on them, they are deletable. Use holds, privilege separation, and offsite replication under a different admin domain if you want practical immutability.

2) Should I rollback or clone?

Clone when you can; rollback when you must. Clones preserve evidence and let you validate before cutover. Rollback is faster and simpler but discards newer changes and can reintroduce risk if the host is still compromised.

3) How do I know which snapshot is clean?

Use a combination: incident timeline (when symptoms started), zfs diff between snapshots, and direct inspection of a read-only clone or snapshot tree. Look for ransom notes, new extensions, and mass file modification patterns.

4) What if ransomware encrypted VM disks (zvols) instead of files?

You can still roll back or restore the zvol dataset, but the restore point must predate encryption. Expect large data movement and slower validation. For critical databases, pair ZFS restores with application-level recovery (transaction logs, consistent backups).

5) Will enabling ZFS encryption stop ransomware?

No. ZFS encryption protects data at rest and can reduce certain theft scenarios, but ransomware running with legitimate access can still read/write and encrypt your files inside the dataset.

6) Can attackers delete snapshots through SMB/NFS?

Not directly via file operations. But if they compromise credentials that have shell/API access to the ZFS host—or compromise automation that manages snapshots—they can delete snapshots using ZFS commands. Your risk is identity and privilege, not the SMB protocol itself.

7) Should I keep exposing .zfs/snapshot for self-service restores?

It’s convenient and often abused operationally (people treat snapshots as a recycle bin). If you keep it, restrict who can access it and ensure snapshot deletion privileges are tightly controlled. Many environments are better off with a controlled restore workflow.

8) Why did freeing space get harder after the incident?

Because snapshots pin blocks. After mass rewrites, the “old” clean data is still referenced by snapshots, and the “new” encrypted data consumes additional space. That’s expected. Capacity planning must include incident headroom, not just steady-state growth.

9) Is it safe to delete the encrypted dataset after restoring?

Technically yes, operationally maybe. Get approvals from security/legal first if the incident triggers reporting requirements. If you need forensics, keep the encrypted state (or a snapshot of it) isolated and access-controlled.

10) What’s the single best improvement if we only do one thing?

Protect offsite snapshots from deletion by compromised production credentials. That means separate admin domain, holds on anchors, and retention that can’t be “cleaned up” by an attacker.

Next steps you can do today

When ransomware hits, ZFS snapshots can turn a career-limiting event into a bad day with a checklist. But only if you’ve built the path ahead of time: clean restore points, protected history, and a recovery routine that doesn’t depend on heroics.

Do these next:

  1. Audit zfs allow on critical datasets. Remove snapshot destruction from routine accounts and automation that doesn’t absolutely need it.
  2. Add holds to daily/weekly anchor snapshots on the replication target.
  3. Write down your “last-known-good selection” method (timeline + zfs diff + read-only clone validation).
  4. Run a restore drill on a representative dataset. Time it. Capture the steps. Fix what surprises you.
  5. Decide, in advance, when you will rollback vs clone vs restore elsewhere. During an incident is a bad time to invent philosophy.

Ransomware is an engineering problem wearing a criminal costume. Solve it with engineering: constraints, separation of duties, and tested recovery paths.

← Previous
WordPress Too Many Redirects: Fixing www/https/Cloudflare Redirect Loops
Next →
ZFS NVMe-only pool tuning: Latency, IRQs, and the Real Limits

Leave a comment