ZFS Immutable Backups: Readonly + Snapshot Policies That Actually Hold Up

Was this helpful?

Backups don’t fail dramatically. They fail quietly, over months, until the day you need them and discover your “immutable” copies were writable, your snapshots were deletable, and your replication job was “green” because it only replicated emptiness.

ZFS can build backups that are stubborn in exactly the right ways—readonly datasets, snapshot retention, holds, and replication patterns that survive fat-fingers and ransomware. But you have to assemble the pieces correctly. ZFS will not save you from creative misuse.

What “immutable” means in ZFS land (and what it doesn’t)

In backup marketing, “immutable” often means “we set a flag and it felt comforting.” In production, immutable means something narrower and more testable:
the data you care about cannot be altered or deleted within a defined retention window, even if an attacker or an operator gains the usual level of access.

With ZFS, you don’t get a single magic immutability switch. You get layers:

  • Snapshots give you point-in-time consistency and cheap versioning.
  • Readonly datasets reduce accidental writes and break whole classes of “oops” moments.
  • Holds and delegated permissions can prevent snapshot deletion by default operators.
  • Replication topologies can make the backup server “dumber” (good) and the blast radius smaller.
  • Offline or delayed-delete controls (often outside ZFS) are what make ransomware cry.

Here’s what ZFS immutability is not:

  • Not protection from root on the backup host. If an attacker owns root and has the keys, they can destroy your pool. Your job is to make that harder, rarer, and detectable.
  • Not a substitute for offsite copies. Fire and flood don’t care about snapshots.
  • Not guaranteed by “readonly=on” alone. You can still destroy snapshots and datasets unless you design against it.

The operational goal is boring: ensure you can always restore some recent, clean copy inside your RPO window, and that you can prove it routinely.
“Boring” is a compliment in backup engineering.

Facts and historical context that shape real backup designs

  • ZFS was born at Sun to kill the “volume manager + filesystem” split. That’s why snapshots and checksumming are first-class, not bolt-ons.
  • Copy-on-write is the enabling trick. ZFS snapshots are cheap because blocks are shared until changed.
  • ZFS checksums data and metadata end-to-end. This made “silent corruption” a solvable problem, not an urban legend.
  • Snapshots are not backups—by themselves. They are stored on the same pool. Pool-loss equals snapshot-loss. This distinction has been ignored since snapshots were invented.
  • Early ZFS adopters learned that “scrub” is a schedule, not an emotion. Regular scrubs find bad sectors while you still have redundancy.
  • Incremental send/receive became the workhorse of ZFS-based DR. It’s basically a journal of changed blocks between snapshots.
  • Snapshot naming conventions became an industry of their own. Because when you’re panicking during an incident, you will pick the wrong snapshot unless the naming is obvious.
  • “Immutable backups” got popular after ransomware matured. The attackers learned to delete backups first, then encrypt. The defenders learned to stop making deletion easy.

One paraphrased idea worth keeping on your wall: paraphrased idea “Hope is not a strategy,” attributed to James Cameron, often repeated in operations circles because it’s painfully true.

A practical model: writable source, append-only-ish destination

The simplest reliable architecture is asymmetric:

  1. Source (production): writable datasets, frequent snapshots, replication sender.
  2. Backup target (vault): receives snapshots, keeps them readonly, retains them, and minimizes who can delete them.
  3. Optional second target (offsite): receives from the vault or directly from source, depending on trust boundaries.

The point is not to make the source immutable. The source is where applications run. It changes constantly. Your job is to make the destination hard to tamper with,
while keeping restore practical. If restores are hard, people won’t test them. If people don’t test them, you’re roleplaying disaster recovery.

Threat model, bluntly

  • Accidental deletion: an admin destroys snapshots or datasets, or a cleanup job runs wild.
  • Ransomware with credentials: attacker gets SSH keys, API tokens, or domain admin and targets backup infrastructure.
  • Compromised backup host: attacker gets root on the vault and tries to destroy retention.
  • Corruption: bad RAM, dying drives, flaky HBAs, or misbehaving firmware—ZFS can detect, but only if you scrub and monitor.

Your defenses should match the threat. Readonly protects against accidental writes. Holds and delegation protect against “normal” users deleting snapshots.
For “root on vault,” you need separation (different auth domain, MFA, limited keys), monitoring, and ideally an offline/offsite copy.

Readonly datasets vs snapshots: the difference that matters at 2 a.m.

Readonly dataset means you can’t modify files in the live filesystem view of that dataset (mountpoint). It’s a guardrail.
It’s excellent for backup targets: you receive replicated snapshots and then keep the dataset readonly so someone doesn’t “just quickly edit a config” inside the backup copy.

Snapshots are immutable views of the dataset at a time. But “immutable” here means blocks won’t change; it does not mean a privileged user can’t destroy the snapshot object.
Deletion is the real enemy in ransomware scenarios.

What readonly does protect you from

  • Random edits to the backup dataset by curious humans.
  • Misconfigured services that accidentally write into the backup mountpoint.
  • Some classes of malware that just encrypts mounted writable filesystems.

What readonly does not protect you from

  • zfs destroy on snapshots/datasets by someone with ZFS admin privileges.
  • Pool destruction (zpool destroy), device removal, or sabotaging importability.
  • Replication streams that overwrite your expectations (e.g., forced receives that roll back).

If you only remember one thing: readonly is for writes; holds/delegation are for deletes.

Joke #1: Readonly is like putting your leftovers in the office fridge with your name on it. It helps, but it doesn’t stop someone determined and hungry.

Snapshot policy that doesn’t rot: naming, cadence, retention

Snapshot policies rot when they’re clever. Avoid clever. The policy must be:
predictable, searchable, and easy to reason about during a restore.

Naming convention: pick one and never improvise

Use a prefix that encodes “managed by policy” and a timestamp that sorts lexicographically. Example:
auto-YYYYMMDD-HHMM for frequent snapshots, and daily-YYYYMMDD for longer retention tiers.

You can keep multiple tiers on the same dataset as long as your retention tool understands them. If you do it manually, keep it simple:
hourly for 48 hours, daily for 30 days, weekly for 12 weeks, monthly for 12 months. Adjust for RPO/RTO and capacity reality.

Cadence: match business rhythms, not your personal preferences

  • Databases: frequent snapshots matter, but consistency matters more. Coordinate with app-level quiesce or use replicas.
  • Home directories: hourly is often enough. People delete files during daylight hours; you want easy rollbacks.
  • VM images: snapshot near change windows and before patching. Also snapshot frequently enough to cover “oops we upgraded the wrong thing.”

Retention: what you keep is what you can restore

Retention is a storage budgeting exercise with teeth. Your retention policy must fit the pool with headroom, or it will self-destruct:
snapshots will fill the pool, performance will sag, allocations will fail, and someone will “temporarily” delete snapshots to recover space.
“Temporarily” is how backup policies die.

Don’t confuse snapshot count with safety

Ten thousand snapshots aren’t safer than fifty if you don’t replicate them, verify them, and protect them from deletion.
Large snapshot counts can also slow operations like snapshot listing and some administrative tasks. ZFS handles many snapshots well,
but your humans and your tooling may not.

Preventing deletion: holds, delegation, and admin blast radius

If your backup threat includes ransomware, you must assume the attacker will try to delete snapshots. You need at least one of:
holds, permission delegation that prevents snapshot destruction, or a separate system that enforces retention out of band.
Ideally: all of the above, in layers.

ZFS holds: the underrated seatbelt

A hold prevents snapshot destruction until the hold is released. That’s not “absolute immutability,” but it’s strong defense against
accidental deletion and against attackers who don’t have the specific privilege set (or don’t notice holds).

Holds are operationally nice because they’re explicit and inspectable. You can tag holds with names like policy or vault.
Use them on the vault side to enforce retention windows.

Delegation: fewer people should be able to destroy backups

Too many shops run the backup target with the same admin group as production. That’s convenient. It’s also how attackers get a two-for-one deal.
Split the roles:

  • Replication user can receive snapshots but cannot destroy old ones.
  • Backup operators can list and restore, but snapshot deletion is guarded.
  • A break-glass role can remove holds, with MFA and audit logging.

Readonly on the vault is still useful

Even with holds, set the received dataset to readonly. Most ransomware and most accidents are boring: they encrypt files they can write.
Make “write” hard, and you shrink the casualty list.

Replication done right: send/receive patterns and failure modes

ZFS replication is powerful because it copies snapshots exactly, including properties (when you want it to), and it can be incremental.
It’s also sharp enough to cut you.

Pattern A: Source snapshots, vault receives (recommended default)

The source takes snapshots on schedule. The vault pulls or the source pushes them. The vault retains snapshots for longer than the source.
The vault dataset is readonly and protected from deletion. This is the standard and it works.

Pattern B: Vault snapshots the received dataset (useful for “vault-side history”)

Sometimes you want vault-side snapshots (e.g., “daily-cold” copies) regardless of what the source does.
That’s fine, but it complicates restores: you now have two snapshot namespaces, and you must be precise about what you’re sending and what you’re holding.

Failure mode: forced receives and unexpected rollbacks

Replication tooling sometimes uses flags that can rollback the target to match the source stream. This can delete newer target-side snapshots
or destroy vault-local history if you’re sloppy. Your vault should be boring: mostly receive, retain, and serve restores.
Avoid “clever” bidirectional tricks unless you truly need them.

Failure mode: incremental chain broken

Incremental send requires the base snapshot to exist on both sides. If someone deletes a needed snapshot on the vault, your next replication fails,
and the “fix” becomes a full resend. Full resends are expensive, slow, and often happen at the worst possible time.

Joke #2: Incremental replication is like office gossip—delete one key detail and suddenly everyone has to start from the beginning.

Practical tasks with commands: what to run, what it means, what you decide

These are real operational tasks. Run them on the right host (source or vault), and treat the output as a decision point.
Each task includes: command, sample output, what it means, and what you do next.

Task 1: Confirm dataset readonly on the vault

cr0x@server:~$ zfs get -H -o property,value readonly vault/backups/prod
readonly	on

Meaning: The dataset’s live mount is not writable.
Decision: If it’s off, set it now on the vault dataset (not on production datasets that apps need to write).

Task 2: Enforce readonly (vault side)

cr0x@server:~$ sudo zfs set readonly=on vault/backups/prod
cr0x@server:~$ zfs get -H readonly vault/backups/prod
readonly	on

Meaning: New writes through the mountpoint are blocked.
Decision: If some process needs write access, it should not be using the vault dataset. Fix the workflow, not the flag.

Task 3: List snapshots with creation time to verify cadence

cr0x@server:~$ zfs list -t snapshot -o name,creation -s creation -r tank/prod | tail -5
tank/prod@auto-20251226-0100  Fri Dec 26 01:00 2025
tank/prod@auto-20251226-0200  Fri Dec 26 02:00 2025
tank/prod@auto-20251226-0300  Fri Dec 26 03:00 2025
tank/prod@auto-20251226-0400  Fri Dec 26 04:00 2025
tank/prod@auto-20251226-0500  Fri Dec 26 05:00 2025

Meaning: Snapshot cadence is consistent; names sort by time.
Decision: If there are gaps, check cron/systemd timers, snapshot tool logs, and pool health (failed snapshots can be a symptom).

Task 4: Check retention pressure via snapshot space usage

cr0x@server:~$ zfs list -t snapshot -o name,used,refer -s used -r tank/prod | head -5
NAME                         USED  REFER
tank/prod@auto-20251220-0100  48G   1.2T
tank/prod@auto-20251218-0300  41G   1.2T
tank/prod@auto-20251222-0900  39G   1.2T
tank/prod@auto-20251224-1800  35G   1.2T

Meaning: Some snapshots pin large amounts of old blocks (high USED).
Decision: Large USED snapshots are often “before a massive rewrite.” Keep them if they’re within policy; otherwise prune by tier, not by vibes.

Task 5: Verify holds on vault snapshots

cr0x@server:~$ zfs holds vault/backups/prod@daily-20251201
NAME                               TAG     TIMESTAMP
vault/backups/prod@daily-20251201   policy  Fri Dec  1 00:05 2025

Meaning: A hold named policy prevents deletion.
Decision: If there are no holds and you rely on holds for immutability, add them systematically (ideally via automation).

Task 6: Apply a hold to a snapshot you must not lose

cr0x@server:~$ sudo zfs hold policy vault/backups/prod@daily-20251201
cr0x@server:~$ zfs holds vault/backups/prod@daily-20251201
NAME                               TAG     TIMESTAMP
vault/backups/prod@daily-20251201   policy  Fri Dec 26 05:40 2025

Meaning: The snapshot is now protected from zfs destroy until the hold is released.
Decision: Use holds for retention tiers, legal holds, or incident freezes. Document who can release them.

Task 7: Prove a snapshot cannot be destroyed due to holds

cr0x@server:~$ sudo zfs destroy vault/backups/prod@daily-20251201
cannot destroy snapshot vault/backups/prod@daily-20251201: snapshot has holds

Meaning: The protection works.
Decision: If this does not fail as expected, your “immutability” is mostly a feeling. Fix it before the next incident.

Task 8: Check who can destroy snapshots (delegation)

cr0x@server:~$ sudo zfs allow vault/backups/prod
---- Permissions on vault/backups/prod --------------------------------
Local+Descendent permissions:
user repl-user create,mount,receive
user backup-ops mount,snapshot,send
user breakglass destroy,hold,release

Meaning: Delegated permissions restrict destructive actions to a specific account.
Decision: If your replication user has destroy, you are one compromised key away from regret. Remove it.

Task 9: Verify pool health (because replication won’t fix bad disks)

cr0x@server:~$ zpool status -x
all pools are healthy

Meaning: No known faults right now.
Decision: If you see degraded/faulted devices, fix hardware first. Backups on a sick pool are just slower corruption delivery.

Task 10: Check recent scrub results

cr0x@server:~$ zpool status vault
  pool: vault
 state: ONLINE
  scan: scrub repaired 0B in 03:12:44 with 0 errors on Sun Dec 22 03:30:12 2025
config:

        NAME        STATE     READ WRITE CKSUM
        vault       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

Meaning: Scrub ran, repaired nothing, found no checksum errors.
Decision: If scrubs are not scheduled, schedule them. If scrub finds errors repeatedly, investigate drives, cabling, and RAM.

Task 11: Confirm replication snapshots exist on both sides

cr0x@server:~$ zfs list -t snapshot -o name -r tank/prod | grep auto-20251226-0500
tank/prod@auto-20251226-0500
cr0x@server:~$ zfs list -t snapshot -o name -r vault/backups/prod | grep auto-20251226-0500
vault/backups/prod@auto-20251226-0500

Meaning: The point-in-time exists at both ends, so incrementals can continue.
Decision: If it’s missing on the vault, replication is behind or failing. If it’s missing on the source, your snapshot job is failing.

Task 12: Run an incremental send/receive (manual, explicit)

cr0x@server:~$ sudo zfs send -I tank/prod@auto-20251226-0400 tank/prod@auto-20251226-0500 | ssh backup-vault sudo zfs receive -u -F vault/backups/prod
cr0x@server:~$ ssh backup-vault zfs list -t snapshot -o name -r vault/backups/prod | tail -2
vault/backups/prod@auto-20251226-0400
vault/backups/prod@auto-20251226-0500

Meaning: Incremental replication succeeded; the vault has the new snapshot. Flags matter:
-u receives without mounting; -F can rollback—use only when you fully understand target-side consequences.
Decision: If you need -F regularly, you likely have a snapshot chain hygiene problem. Fix snapshot deletion and naming.

Task 13: Measure replication throughput and spot compression wins

cr0x@server:~$ sudo zfs send -nvP tank/prod@auto-20251226-0500 2>&1 | tail -3
send from @auto-20251226-0400 to tank/prod@auto-20251226-0500 estimated size is 17.2G
total estimated size is 17.2G
time        sent   snapshot

Meaning: Dry-run estimate shows how big the incremental is.
Decision: If estimates are constantly huge, your workload is rewrite-heavy. Tune snapshot frequency, exclude churny datasets, or accept the capacity/network reality.

Task 14: Detect “snapshots pinning space” causing low free space

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint vault/backups/prod
NAME               USED  AVAIL  REFER  MOUNTPOINT
vault/backups/prod  38T   1.2T  1.4T   /vault/backups/prod

Meaning: The dataset is near capacity (AVAIL small). Performance and allocations may suffer.
Decision: Do not “solve” this by deleting random snapshots. Apply retention policy, add capacity, or move tiers off the pool.

Task 15: Check reservation and refreservation surprises

cr0x@server:~$ zfs get -H -o property,value reservation,refreservation vault/backups/prod
reservation	none
refreservation	none

Meaning: No space is artificially reserved.
Decision: If you find large reservations on a backup target, confirm they’re intentional. Reservations can starve other datasets and trigger panic-deletes.

Task 16: Verify you can actually restore (clone a snapshot)

cr0x@server:~$ sudo zfs clone vault/backups/prod@auto-20251226-0500 vault/restore/prod-test
cr0x@server:~$ zfs list -o name,mountpoint,readonly vault/restore/prod-test
NAME                    MOUNTPOINT              READONLY
vault/restore/prod-test  /vault/restore/prod-test  off

Meaning: You can materialize a point-in-time view for restore testing. Clones are writable by default.
Decision: Restore workflows should operate on clones, not on the immutable dataset. Keep the vault dataset readonly; make restores separate.

Fast diagnosis playbook: find the bottleneck first, not last

When backups fall behind, teams love to argue about “network” vs “storage” vs “ZFS being slow.” Don’t argue. Measure in order.

First: is the pool healthy and not dying?

  • Run zpool status. If you have READ/WRITE/CKSUM errors, you have a reliability incident, not a tuning problem.
  • Check whether a scrub is running. Scrub + replication can crush I/O on small pools.

Second: is capacity the hidden constraint?

  • Check zfs list and pool free space. Very full pools fragment and slow down, and snapshot deletion can become “the only fix” people reach for.
  • Inspect which snapshots pin space (zfs list -t snapshot -o used).

Third: is replication blocked on snapshot chain state?

  • Confirm base snapshots exist on both sides.
  • Look for “cannot receive incremental stream” errors in job logs; it usually means missing snapshots or diverged history.

Fourth: is it I/O, CPU, or network?

  • If sends are estimated huge, the workload is high churn. No amount of tuning makes “17G per hour” turn into “200M per hour.”
  • If you use encryption or compression on the stream, CPU can be real. Profile before changing settings.
  • If the vault is on slower disks than production (common), receives will lag. That’s not a bug; it’s your architecture asking for patience.

Fifth: are you shooting yourself with properties?

  • Check sync, recordsize, and compression on the vault datasets. Don’t cargo-cult production settings onto backups.
  • Be careful with atime on backup datasets; it can generate useless metadata writes if something scans the tree.

Common mistakes: symptoms → root cause → fix

Mistake 1: “Readonly means immutable”

Symptoms: Snapshots disappear during an incident; dataset is still readonly.

Root cause: Someone with ZFS privileges destroyed snapshots/datasets; readonly doesn’t stop that.

Fix: Use holds for retention tiers, remove destroy from replication/operator roles, and isolate the vault admin boundary.

Mistake 2: Replication “succeeds” but restores are missing data

Symptoms: Target has snapshots, but application data is inconsistent or missing expected recent files.

Root cause: Snapshot timing doesn’t align with app consistency, or you replicated the wrong dataset subtree.

Fix: Snapshot the correct dataset(s), coordinate with app quiesce if needed, validate restores with automated checks, and keep a restore test dataset.

Mistake 3: Full resends every week

Symptoms: Incremental receive fails; jobs fall back to full sends; network gets hammered.

Root cause: Base snapshot deleted on one side, or target diverged due to local snapshots/changes and forced rollbacks.

Fix: Protect base snapshots via holds, standardize naming, avoid local writes on received datasets, and stop using -F as a lifestyle.

Mistake 4: Pool hits 95% and everything becomes “slow ZFS”

Symptoms: Replication lag, high latency, occasional ENOSPC, snapshot deletions take forever.

Root cause: Capacity planning ignored snapshot growth; retention too aggressive for the pool.

Fix: Reduce retention by tier, exclude high-churn datasets, add capacity, or move long-term retention to a different pool class.

Mistake 5: Backup dataset mounted and scanned by antivirus/indexers

Symptoms: Unexpected metadata I/O, receives slow down, backup host load spikes.

Root cause: Vault datasets are mounted and treated like live filesystems; scanners touch everything.

Fix: Receive with -u, keep vault datasets unmounted by default, mount only for restores, and isolate restore clones.

Mistake 6: Holds exist but retention doesn’t work

Symptoms: Snapshots accumulate forever; pool fills; nobody can delete anything.

Root cause: Holds applied without a lifecycle; no automated release after retention window.

Fix: Implement hold tags per tier (e.g., policy), and build a controlled process to release holds when snapshots age out.

Mistake 7: “We replicated, so we’re safe”

Symptoms: Both source and vault contain encrypted/corrupted data after compromise.

Root cause: Replication faithfully copied bad changes quickly; no delay, no detection, no offline tier.

Fix: Add delayed replication, longer retention tiers, anomaly detection (unexpected churn), and a second copy with a separate trust boundary.

Three corporate mini-stories from the backup trenches

1) Incident caused by a wrong assumption: “Readonly is immutable”

A mid-size SaaS shop ran ZFS on both production and a backup server. The backup datasets were set readonly=on, and everyone felt smug about it.
They even had a slide in the quarterly risk review: “immutable backups enabled.”

Then a privileged account got compromised. The attacker didn’t bother encrypting the vault dataset. They just deleted the snapshots, because deleting is faster than encrypting.
Production got hit next, and the team discovered their retention window was now “since the attacker arrived.”

The postmortem was uncomfortable in the best way: nobody lied, but several people admitted they assumed readonly implied undeletable.
It doesn’t. ZFS is precise; humans are not.

The remediation was equally precise: snapshot holds on the vault for retention tiers, delegated permissions so replication users couldn’t destroy anything,
and a separate break-glass account with audit logging and MFA. They also made a policy decision: backup hosts don’t join the same identity domain as production.

2) Optimization that backfired: “Let’s save space by pruning aggressively”

A large enterprise team was under pressure to cut storage costs. They reduced snapshot retention on the vault dramatically and also turned on a “cleanup” script
that deleted snapshots beyond a certain count, not age. The script was fast and the charts looked great. For a while.

Then a slow-burn data corruption issue surfaced in an application. It wasn’t obvious: a few records were wrong, then more, then the app started making “helpful” corrections
that were actually making the dataset worse. By the time anyone noticed, the only clean restore point was older than the new retention window.

The interesting bit: ZFS wasn’t at fault. Their optimization removed the only thing that can save you from slow failures—time.
Ransomware is loud. Corruption and bad logic are quiet.

They rolled back the retention change, but not to the old numbers. They did something more grown-up: tiered retention with a small set of longer-lived “gold” snapshots
that were held, plus a periodic restore test. Storage costs went up modestly. Recovery confidence went up massively.

3) Boring but correct practice that saved the day: routine restore drills

A financial services team ran weekly restore drills. Every Friday, someone cloned a snapshot from the vault into a restore sandbox and ran a small suite of checks:
file counts, application-level sanity checks, and a quick integrity read pass on a few representative files.

It was not glamorous. People complained it took time. Leadership still funded it because it created a simple artifact: a ticket saying “restore tested from snapshot X.”
That ticket was boring, repeatable evidence.

When they got hit with a credential-theft incident, the attacker tried to delete snapshots and hit holds. They then tried to tamper with the mounted dataset and hit readonly.
Meanwhile, the team already knew which snapshot to restore because they had tested it recently. They restored cleanly and moved on.

The lesson wasn’t “we had better tech.” It was “we did the dull work.” In operations, dull work is how you buy sleep.

Checklists / step-by-step plan

Step-by-step: build an “immutable-ish” ZFS vault with sane defaults

  1. Define datasets and boundaries.
    Production datasets are writable; vault datasets are receive-only and readonly.
  2. Implement snapshot cadence on source.
    Keep names consistent. Ensure snapshots cover your RPO.
  3. Replicate to vault.
    Prefer explicit snapshot-based replication. Avoid workflows that require frequent -F.
  4. Set vault datasets readonly and ideally unmounted by default.
    Mount restore clones instead.
  5. Apply holds on vault snapshots based on retention tiers.
    Use a tag like policy so it’s searchable and consistent.
  6. Delegate ZFS permissions.
    Replication account: receive, not destroy. Operators: restore actions, not retention overrides.
  7. Schedule scrubs and monitor errors.
    Scrubs are for catching latent errors before they become missing blocks during restore.
  8. Capacity plan with headroom.
    Don’t run the vault pool near full. Snapshot retention needs slack.
  9. Test restores routinely.
    Clone a snapshot, mount it, validate, and record the result.
  10. Practice break-glass.
    Define who can release holds, how it’s audited, and how you avoid “everyone is root” culture on the vault.

Operational checklist: before you claim “immutable backups”

  • Can a non-breakglass account destroy vault snapshots? If yes, you’re not immutable.
  • Are holds applied and visible? If no, you’re trusting humans to behave under stress.
  • Is the vault mounted and writable anywhere? If yes, expect encryption attempts to succeed.
  • Do you have at least one copy outside the primary trust boundary? If no, you’re one compromised domain admin away from pain.
  • Have you restored from the vault in the last 30 days? If no, you have beliefs, not backups.

FAQ

Is a ZFS snapshot truly immutable?

Its content is immutable (copy-on-write), but the snapshot object can be destroyed by someone with the right privileges.
If you need “cannot be deleted,” use holds and restrict destroy.

Does readonly=on stop ransomware?

It stops straightforward encryption of mounted filesystems. It does not stop snapshot deletion, pool destruction, or a privileged attacker.
Use readonly as a layer, not a finish line.

Should I take snapshots on the vault or on the source?

Default to source snapshots and vault retention. Vault-side snapshots can be useful for additional tiers, but they add complexity and can confuse restores.

What’s the safest way to do restores without weakening immutability?

Clone the snapshot into a separate restore dataset and mount that. Keep the received vault dataset readonly and ideally unmounted.

How do holds interact with retention policies?

Holds block deletion until released. A retention system should apply holds for snapshots inside the retention window and release them when they age out,
then delete snapshots cleanly.

Can replication users be restricted to only receive?

Yes. Use zfs allow to delegate only what’s needed (typically receive, maybe create), and avoid granting destroy.
Then verify with zfs allow output.

Why does replication sometimes require a full resend?

Incremental streams require a common base snapshot. If the base snapshot is missing on either side—or the target history diverged—incremental receive fails and a full send is required.
Protect base snapshots, and stop deleting snapshots ad hoc.

How many snapshots is “too many”?

It depends on workload and tooling. ZFS can handle many snapshots, but admin tooling, replication jobs, and humans can struggle.
Keep tiers reasonable and prune with policy, not panic.

What’s the single best indicator my backup vault is unhealthy?

Persistent checksum errors or repeated scrub repairs. ZFS is telling you data integrity is compromised.
Treat it as urgent: investigate hardware, firmware, cabling, and memory.

Is offsite still required if my vault is “immutable”?

Yes. Immutability is about tampering and deletion, not about site loss, theft, or disasters.
You want at least one copy in a different failure domain.

Conclusion: next steps you can actually do this week

If you want ZFS immutable backups that hold up under pressure, stop chasing a single knob and start layering controls:
readonly for writes, holds for deletion resistance, delegation to shrink blast radius, and replication that doesn’t depend on heroics.

  1. On the vault, set received datasets readonly=on and receive with -u so they aren’t casually mounted.
  2. Implement holds for snapshots inside your retention window and prove you can’t destroy held snapshots.
  3. Remove destroy from replication and operator accounts; reserve it for a break-glass role with audit.
  4. Run a restore drill: clone a snapshot, mount it, and validate something meaningful (not just “it mounted”).
  5. Check pool health and scrub scheduling on both source and vault. If integrity is shaky, immutability is academic.

Do those five things and your backup posture will shift from “hope and screenshots” to “measured and survivable.” That’s the whole game.

← Previous
Debian 13 policy routing: debug ip rule and ip route without pain
Next →
Proxmox PBS “datastore not found”: why PVE can’t see it and how to fix

Leave a comment