ZFS zfs hold: The Safety Pin That Blocks Accidental Deletion

Was this helpful?

ZFS has a reputation for being “hard to mess up,” which is mostly true until you meet the one command that can make history disappear: snapshot destruction. Snapshots are cheap, fast, and dangerously easy to delete in bulk—especially when someone is chasing space pressure and the pager is already screaming.

zfs hold is the tiny mechanism that turns snapshots from “oops” into “not today.” It doesn’t encrypt anything, it doesn’t move data, and it won’t save you from every possible failure. But it does one job exceptionally well: it makes a snapshot undeletable unless the hold is deliberately released. In production, that’s the difference between a recoverable incident and a career update.

What zfs hold really does (and what it doesn’t)

A ZFS snapshot is immutable, but not indestructible. You can destroy it, and if it was the only thing keeping old blocks referenced, those blocks become free to be reused. In a busy pool, “reused” can mean “gone” quickly.

zfs hold attaches one or more “holds” (think: tags) to a snapshot. As long as at least one hold exists, the snapshot cannot be destroyed. Not by you, not by a script, not by a well-meaning colleague with a wildcard and a deadline. The destroy command will fail with a message pointing at holds.

What it does not do:

  • It does not prevent the snapshot from consuming space. In fact, it can keep space tied up longer (that’s the point).
  • It does not prevent someone from destroying the entire dataset or pool by other means if they’re sufficiently privileged and determined.
  • It does not replace real retention policy, backups, replication verification, or access control.

Here’s the mental model I use:

  • Snapshot = a frozen view of dataset blocks at a point in time.
  • Hold = a sticky note on that snapshot saying “do not delete until these conditions are met.”
  • Release = removing a sticky note; once all sticky notes are removed, deletion becomes possible again.

First joke (keep it short, like your outage budget): A hold is the one “sticky” thing in storage that you actually want to be sticky.

How holds work under the hood

Holds are implemented as metadata on the snapshot. Each hold has a name (often called a tag) and is associated with that snapshot. Multiple holds can exist simultaneously—common in organizations where backups, replication, and legal retention all want a vote.

Tag semantics: why names matter

The tag is not just decorative. It becomes your operational handle for audits, troubleshooting, and safe automation. If you name holds like “keep,” you’ll regret it when you’re staring at 40TB of pinned snapshots and no one remembers which “keep” was for what.

A good tag includes:

  • Owner/system: replication, backup, legal, migration
  • Scope or target: to-dr, to-s3-gw, pre-upgrade
  • Optional ticket or change ID: chg12345 (if your environment has them)

Holds vs. properties vs. “just don’t do that”

Yes, you can limit who can run zfs destroy. You should. But privilege boundaries blur under pressure: emergency access, break-glass accounts, automation running as root, on-call engineers with elevated rights. A hold adds an extra intentional step: “I must release the hold first.” That extra step is where many disasters quietly die.

Holds and replication

Holds are especially useful with replication because replication creates a dependency chain: you often need a specific snapshot to exist long enough to complete incremental sends, to seed a new target, or to guarantee a consistent restore window. Automation that pins “the last successfully replicated snapshot” is a classic, boring, correct practice.

What happens when you try to destroy a held snapshot

ZFS doesn’t argue; it refuses. Destruction fails with an error referencing holds. This is good. It’s also a frequent source of confusion when someone expects their cleanup job to reclaim space immediately.

Second joke: zfs hold is like putting a “Do Not Unplug” tag on the server power cable—annoying until the day it isn’t.

Interesting facts and context

Some short, concrete context points that help explain why holds exist and why they’re used the way they are:

  1. ZFS snapshots aren’t copies. They’re a set of block pointers; the “old data” remains referenced until nothing points to it.
  2. Holds are per-snapshot and tag-based. Multiple teams can independently pin the same snapshot without coordinating—until it’s time to delete it.
  3. Space pressure makes humans dangerous. In real operations, most catastrophic deletions happen during a storage emergency, not during calm planning.
  4. Replication depends on lineage. Incremental sends require a common snapshot base; delete the wrong base snapshot and your next replication turns into a full resend.
  5. Auto-snapshot tools can create “snapshot storms.” If retention misfires, you get thousands of snapshots—then someone reaches for a wildcard destroy. Holds are the seatbelt.
  6. Holds are not the same as bookmarks. Bookmarks can preserve incremental send points without preserving all data blocks like snapshots do; holds keep the snapshot itself alive.
  7. Holds are cheap until they aren’t. The metadata is tiny; the cost is the referenced blocks you cannot free while the snapshot is pinned.
  8. “Cannot destroy” is a feature, not a bug. Most systems treat deletion as final; ZFS gives you a deliberate “are you really sure” mechanism you can script around.

Practical tasks: commands you will actually run

The goal here isn’t to show off syntax. It’s to build operational muscle memory: how to apply holds, find them, interpret errors, and integrate holds into retention and replication without pinning your pool into a corner.

Task 1: Create a snapshot with an operationally meaningful name

cr0x@server:~$ sudo zfs snapshot tank/app@pre-upgrade-2025-12-25_0100
cr0x@server:~$ sudo zfs list -t snapshot -o name,used,refer,mountpoint -r tank/app | tail -n 3
NAME                                   USED  REFER  MOUNTPOINT
tank/app@auto-2025-12-25_0000           12M   48G    -
tank/app@auto-2025-12-25_0030           8M    48G    -
tank/app@pre-upgrade-2025-12-25_0100    0B    48G    -

Interpretation: The snapshot is created instantly. USED may show 0B at creation because no blocks have diverged yet.

Task 2: Put a hold on that snapshot

cr0x@server:~$ sudo zfs hold change:pre-upgrade tank/app@pre-upgrade-2025-12-25_0100
cr0x@server:~$ sudo zfs holds tank/app@pre-upgrade-2025-12-25_0100
NAME                                   TAG                 TIMESTAMP
tank/app@pre-upgrade-2025-12-25_0100    change:pre-upgrade  Fri Dec 25 01:00 2025

Interpretation: The snapshot now has a tag. It cannot be destroyed until this tag is released (and any other tags too).

Task 3: Prove the hold blocks destruction

cr0x@server:~$ sudo zfs destroy tank/app@pre-upgrade-2025-12-25_0100
cannot destroy snapshot tank/app@pre-upgrade-2025-12-25_0100: snapshot has holds

Interpretation: That error is your safety pin doing its job. Your cleanup automation should treat this as “skip” not “retry forever.”

Task 4: Add a second hold (multiple stakeholders)

cr0x@server:~$ sudo zfs hold replication:to-dr tank/app@pre-upgrade-2025-12-25_0100
cr0x@server:~$ sudo zfs holds tank/app@pre-upgrade-2025-12-25_0100
NAME                                   TAG                 TIMESTAMP
tank/app@pre-upgrade-2025-12-25_0100    change:pre-upgrade  Fri Dec 25 01:00 2025
tank/app@pre-upgrade-2025-12-25_0100    replication:to-dr   Fri Dec 25 01:02 2025

Interpretation: Destruction is blocked until both tags are released. This is how teams avoid stepping on each other.

Task 5: Release exactly one hold and confirm it’s still protected

cr0x@server:~$ sudo zfs release change:pre-upgrade tank/app@pre-upgrade-2025-12-25_0100
cr0x@server:~$ sudo zfs holds tank/app@pre-upgrade-2025-12-25_0100
NAME                                   TAG                TIMESTAMP
tank/app@pre-upgrade-2025-12-25_0100    replication:to-dr  Fri Dec 25 01:02 2025

cr0x@server:~$ sudo zfs destroy tank/app@pre-upgrade-2025-12-25_0100
cannot destroy snapshot tank/app@pre-upgrade-2025-12-25_0100: snapshot has holds

Interpretation: Releasing one tag doesn’t remove protection if another tag remains. This is the mechanism that makes shared retention sane.

Task 6: Release the remaining hold and destroy the snapshot

cr0x@server:~$ sudo zfs release replication:to-dr tank/app@pre-upgrade-2025-12-25_0100
cr0x@server:~$ sudo zfs destroy tank/app@pre-upgrade-2025-12-25_0100
cr0x@server:~$ sudo zfs list -t snapshot -r tank/app | grep pre-upgrade || echo "snapshot removed"
snapshot removed

Interpretation: Once the last hold is released, deletion behaves normally.

Task 7: Apply holds recursively to a dataset tree (carefully)

cr0x@server:~$ sudo zfs snapshot -r tank/projects@quarterly-freeze-2025Q4
cr0x@server:~$ sudo zfs hold -r legal:q4-retention tank/projects@quarterly-freeze-2025Q4
cr0x@server:~$ sudo zfs holds -r tank/projects@quarterly-freeze-2025Q4 | head
NAME                                               TAG                 TIMESTAMP
tank/projects@quarterly-freeze-2025Q4              legal:q4-retention  Fri Dec 25 02:00 2025
tank/projects/alpha@quarterly-freeze-2025Q4        legal:q4-retention  Fri Dec 25 02:00 2025
tank/projects/beta@quarterly-freeze-2025Q4         legal:q4-retention  Fri Dec 25 02:00 2025
tank/projects/beta/builds@quarterly-freeze-2025Q4  legal:q4-retention  Fri Dec 25 02:00 2025

Interpretation: This pins snapshots across the tree. It’s powerful and potentially expensive in space retention. You do this on purpose, not by accident.

Task 8: Find which holds block deletion (the “why won’t it die?” command)

cr0x@server:~$ sudo zfs destroy tank/projects/alpha@quarterly-freeze-2025Q4
cannot destroy snapshot tank/projects/alpha@quarterly-freeze-2025Q4: snapshot has holds

cr0x@server:~$ sudo zfs holds tank/projects/alpha@quarterly-freeze-2025Q4
NAME                                         TAG                 TIMESTAMP
tank/projects/alpha@quarterly-freeze-2025Q4  legal:q4-retention  Fri Dec 25 02:00 2025

Interpretation: This is the first stop in any cleanup incident: identify the exact tag and decide whether you’re allowed to remove it.

Task 9: Audit holds across a dataset tree

cr0x@server:~$ sudo zfs holds -r tank/projects | awk 'NR==1 || $2 ~ /legal:|replication:|backup:/ {print}'
NAME                                               TAG                 TIMESTAMP
tank/projects@quarterly-freeze-2025Q4              legal:q4-retention  Fri Dec 25 02:00 2025
tank/projects/alpha@quarterly-freeze-2025Q4        legal:q4-retention  Fri Dec 25 02:00 2025
tank/projects/beta@quarterly-freeze-2025Q4         legal:q4-retention  Fri Dec 25 02:00 2025

Interpretation: You’re building an inventory of “pinned” snapshots. In large environments, this is the difference between a manageable retention policy and a haunted attic.

Task 10: Estimate space impact: find heavy snapshot deltas

cr0x@server:~$ sudo zfs list -t snapshot -o name,used,creation -s used -r tank/projects | tail -n 10
tank/projects/beta@auto-2025-12-20_0000     18G  Sat Dec 20 00:00 2025
tank/projects/beta@auto-2025-12-21_0000     22G  Sun Dec 21 00:00 2025
tank/projects/beta@quarterly-freeze-2025Q4  35G  Fri Dec 25 02:00 2025

Interpretation: Snapshots with high USED are pinning lots of unique blocks. If they’re held, that space is likely non-negotiable until the hold is released.

Task 11: Use holds to protect “last good replication base” snapshots

cr0x@server:~$ sudo zfs snapshot tank/app@replica-base-2025-12-25_0300
cr0x@server:~$ sudo zfs hold replication:last-good tank/app@replica-base-2025-12-25_0300
cr0x@server:~$ sudo zfs holds tank/app@replica-base-2025-12-25_0300
NAME                                   TAG                   TIMESTAMP
tank/app@replica-base-2025-12-25_0300   replication:last-good Fri Dec 25 03:00 2025

Interpretation: This is a common replication pattern: always keep one known-good base snapshot pinned until the next replication confirms success.

Task 12: Cleanly rotate a pinned “last good” snapshot after success

cr0x@server:~$ sudo zfs holds -r tank/app | grep replication:last-good
tank/app@replica-base-2025-12-25_0300   replication:last-good Fri Dec 25 03:00 2025

cr0x@server:~$ sudo zfs snapshot tank/app@replica-base-2025-12-25_0400
cr0x@server:~$ sudo zfs hold replication:last-good tank/app@replica-base-2025-12-25_0400

cr0x@server:~$ sudo zfs release replication:last-good tank/app@replica-base-2025-12-25_0300
cr0x@server:~$ sudo zfs destroy tank/app@replica-base-2025-12-25_0300
cr0x@server:~$ sudo zfs holds tank/app@replica-base-2025-12-25_0400
NAME                                   TAG                   TIMESTAMP
tank/app@replica-base-2025-12-25_0400   replication:last-good Fri Dec 25 04:00 2025

Interpretation: Note the order: create new base → hold it → release old base hold → destroy old base. If you flip that order under packet loss, you’ll eventually pay for it with a full resend.

Task 13: Destroy snapshots while skipping held ones (safe cleanup pattern)

cr0x@server:~$ for s in $(sudo zfs list -H -t snapshot -o name -r tank/app | head -n 5); do
>   sudo zfs destroy "$s" 2>&1 | sed "s/^/[$s] /"
> done
[tank/app@auto-2025-12-25_0000] cannot destroy snapshot tank/app@auto-2025-12-25_0000: snapshot has holds
[tank/app@auto-2025-12-25_0030] destroyed
[tank/app@auto-2025-12-25_0100] destroyed
[tank/app@auto-2025-12-25_0130] destroyed
[tank/app@auto-2025-12-25_0200] destroyed

Interpretation: Cleanup scripts should treat “has holds” as an expected state. Log it, report it, move on. Don’t fail the entire job and certainly don’t try to “fix” it automatically.

Task 14: Find snapshots with holds and sort by creation time

cr0x@server:~$ sudo zfs holds -r tank | awk 'NR==1{next} {print $1}' | sort -u | while read snap; do
>   sudo zfs get -H -o value creation "$snap" | awk -v s="$snap" '{print $0 " " s}'
> done | sort | head
Fri Dec 20 00:00 2025 tank/projects/beta@auto-2025-12-20_0000
Fri Dec 25 02:00 2025 tank/projects@quarterly-freeze-2025Q4
Fri Dec 25 02:00 2025 tank/projects/alpha@quarterly-freeze-2025Q4

Interpretation: This helps answer the compliance question: “What have we pinned, and how long has it been pinned?” The tool doesn’t do it in one neat table, so you build one.

Three corporate-world mini-stories

1) The incident caused by a wrong assumption

The team had a tidy rule: “We keep 48 hours of snapshots, and everything older gets destroyed.” It was enforced by a cron job written years ago. Nobody loved it, but it kept the pool from turning into a museum.

Then they migrated a noisy workload—CI artifacts and build caches—onto the same pool as a handful of critical databases. The pool started filling faster than expected. Under pressure, someone ran the cleanup job manually, twice, and then decided to “speed it up” by broadening the destroy pattern. Wildcards were used. Nothing exploded immediately, which is how storage incidents lure you into false confidence.

Later that day, replication to the DR site failed. Not “temporary glitch” failed—“incremental base not found” failed. The assumption was that if a snapshot was older than 48 hours, it was safe to delete. But the replication schedule had drifted during a maintenance window and was now days behind. The incremental chain depended on a snapshot that the cleanup job considered “expired.”

They ended up doing a full resend across a link that was sized for incrementals, not full datasets. The business impact wasn’t just the delayed DR posture. The resend competed with production traffic and pushed latency into user-visible territory.

The postmortem fix wasn’t “tell people not to use wildcards” (good luck). It was operationally enforceable: the replication job applied a hold tag to the last snapshot confirmed at the target. The retention job destroyed freely, but held snapshots were skipped. The wrong assumption was replaced with a mechanism that made the safe behavior the default.

2) The optimization that backfired

A different company had a clever idea: “Let’s hold all snapshots for seven days, then release them in one batch. That way we only have to think about retention once a week.” It felt efficient: fewer moving parts, fewer chances to get it wrong.

It worked—until the workload changed. A new analytics pipeline started rewriting large datasets daily. Snapshots began to accumulate large deltas. Holds pinned those deltas for a full week, and the pool’s free space started to whipsaw. Every week, they’d release holds and do a mass destroy, getting a brief rush of free space… followed by fragmentation-like allocation pressure and performance cliffs as the pool scrambled to rewrite hot blocks while also freeing old ones.

The optimization made retention “simple,” but it concentrated write amplification and deletion work into a predictable weekly storm. Worse, when the storm landed during a busy business period, latency spikes became regular. It wasn’t a mystery; it was scheduled pain.

The fix was boring: staggered retention. They still used holds, but only for the snapshots that truly needed protection (replication bases, pre-change points, and compliance). Everything else followed a rolling window with gradual deletions. They also watched pool free space more conservatively, because holds turn “free space” into a promise you can’t necessarily cash today.

The lesson: holds are a safety mechanism, not a blanket policy. If you optimize for fewer decisions, you can accidentally optimize for periodic chaos.

3) The boring but correct practice that saved the day

This one is my favorite because it doesn’t involve heroics, just discipline.

A finance-adjacent application had quarterly close procedures that were always high stress. The storage team had a runbook step: before the close begins, take a recursive snapshot of a specific dataset subtree and apply a legal hold tag. Everyone rolled their eyes at it because it took five extra minutes and never seemed to “do” anything.

During one close, a deployment tool misread a configuration value and ran a cleanup step against the wrong mountpoint. It wasn’t malicious; it was the kind of automation bug that only shows up when a variable is empty and someone didn’t quote it. Files disappeared. The application team tried to roll back at the app layer, but the deletion had already happened on disk.

They went to snapshots, as planned—except the first restoration attempt failed because the on-call engineer (sleep deprived, but well-intentioned) started deleting “older snapshots” to create space for the restore. The pool was tight. This is where normal organizations lose the thread: you end up destroying the very snapshots you need because the system is pressuring you to act.

The quarterly snapshots were held. They could not be destroyed in the heat of the moment. That forced the team to choose a safer path: temporarily expand capacity and restore from the pinned snapshot set. The incident still hurt, but it stayed in the category of “expensive and annoying,” not “irreversible.”

The practice wasn’t glamorous. It was a mechanical step that prevented a panicked person from making a permanent mistake.

Fast diagnosis playbook

This is the playbook for the common production moment: “We need space, we tried to delete snapshots, ZFS refuses, and performance is getting weird.” The goal is to find the bottleneck and the decision point quickly, not to admire the filesystem.

1) First check: what exactly is blocking destroy?

Pick a specific snapshot that refuses to die and inspect holds.

cr0x@server:~$ sudo zfs destroy tank/app@auto-2025-12-24_0000
cannot destroy snapshot tank/app@auto-2025-12-24_0000: snapshot has holds

cr0x@server:~$ sudo zfs holds tank/app@auto-2025-12-24_0000
NAME                               TAG                  TIMESTAMP
tank/app@auto-2025-12-24_0000       replication:last-good Thu Dec 24 00:05 2025

Decision: If the tag is expected (replication/compliance), don’t remove it under pressure without confirming the downstream dependency.

2) Second check: are holds widespread or isolated?

Inventory holds recursively to see whether you’re dealing with a few pinned snapshots or an entire subtree.

cr0x@server:~$ sudo zfs holds -r tank/app | wc -l
128

Interpretation: Big number? You likely have a policy or automation problem, not a one-off.

3) Third check: where is the space actually going?

Look at dataset and snapshot space usage, focusing on what is pinned.

cr0x@server:~$ sudo zfs list -o name,used,avail,refer,compressratio,mounted tank
NAME   USED  AVAIL  REFER  RATIO  MOUNTED
tank   68T   1.2T   256K   1.42x  yes

cr0x@server:~$ sudo zfs list -t snapshot -o name,used -s used -r tank/app | tail -n 5
tank/app@auto-2025-12-23_0000   210G
tank/app@auto-2025-12-24_0000   260G
tank/app@auto-2025-12-25_0000   300G
tank/app@quarterly-freeze       420G
tank/app@legal-freeze           1.1T

Interpretation: Large USED snapshots are prime suspects. If they’re held, you can’t reclaim that space quickly without changing the business decision.

4) Fourth check: are you hitting a pool-level performance bottleneck?

Space issues and deletion attempts can coincide with poor latency. Check pool health and I/O quickly.

cr0x@server:~$ sudo zpool status -x
all pools are healthy

cr0x@server:~$ iostat -xz 1 3
Linux 6.8.0 (server)  12/25/2025  _x86_64_  (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.40    0.00    4.10    8.20    0.00   75.30

Device            r/s     w/s   rkB/s   wkB/s  await  %util
nvme0n1         220.0   180.0  18000   26000   6.20  92.0
nvme1n1         210.0   170.0  17500   24000   6.50  89.0

Interpretation: High %util and rising await suggests you’re I/O bound. Snapshot deletions may not be the culprit, but the same underlying write load that created large deltas often is.

5) Fifth check: confirm you’re not blocked by a long-running send/receive or scrub

cr0x@server:~$ sudo zpool status tank | sed -n '1,25p'
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 02:10:21 with 0 errors on Fri Dec 25 01:30:11 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            nvme0n1 ONLINE       0     0     0
            nvme1n1 ONLINE       0     0     0

Interpretation: If a scrub/resilver is active, the pool will be busy. It doesn’t directly block destroys, but it changes the risk calculation when you’re also short on space.

Common mistakes: symptoms and fixes

Mistake 1: Treating “snapshot has holds” as an error condition to auto-remediate

Symptom: Cleanup job fails nightly, retries aggressively, pages someone, or (worse) automatically releases holds to “fix” it.

Fix: Make “has holds” a first-class state: log and report held snapshots separately. Releasing holds should require explicit ownership checks (replication success, change window closure, legal approval).

Mistake 2: Using vague hold tags that don’t identify an owner

Symptom: You find tags like keep, hold, or important. Nobody can safely remove them; they linger for years.

Fix: Adopt a naming convention: system:purpose (optionally with a change ID). Example: replication:last-good, change:pre-upgrade, legal:q4-retention.

Mistake 3: Holding snapshots “just in case” without measuring space impact

Symptom: Pool free space trends down steadily, and deleting unheld snapshots barely moves the needle.

Fix: Identify the high-USED held snapshots. Decide if they’re truly required. If compliance requires retention, plan capacity accordingly; don’t pretend you can clean your way out.

Mistake 4: Recursive holds applied to the wrong scope

Symptom: Suddenly thousands of snapshots across many child datasets are held; storage consumption spikes; retention stops working.

Fix: Before -r, list the dataset tree and confirm the intended root. In automation, whitelist dataset prefixes. Consider holding only specific snapshot names rather than applying to every child automatically.

Mistake 5: Confusing holds with permissions and access control

Symptom: People assume holds stop privileged users from destructive actions across the pool.

Fix: Use delegated ZFS permissions and operational controls. Holds protect snapshots from destruction, not your whole storage estate from root.

Mistake 6: Replication scripts that set holds but never release them

Symptom: replication:last-good exists on dozens or hundreds of snapshots; you only meant one.

Fix: On successful replication, rotate the hold: apply to the new base snapshot, release from the old base snapshot, then delete old if policy allows. Add monitoring that alerts if more than N snapshots are tagged replication:last-good.

Mistake 7: Assuming holds are the only reason a snapshot can’t be destroyed

Symptom: Destroy fails and people tunnel vision on holds, but the issue is elsewhere (typo, dependent clones, permission problems).

Fix: Read the full error message and also check for clones and permissions. Holds are common, but not exclusive.

Checklists / step-by-step plan

Checklist A: Introducing holds safely into an existing environment

  1. Define ownership tags. Decide which systems are allowed to place holds (replication, backup, change management, legal).
  2. Pick a naming convention. Prefer owner:purpose and optionally append a short identifier in the snapshot name itself.
  3. Decide scope rules. When are recursive holds allowed? On which dataset roots? Document the allowed prefixes.
  4. Update cleanup jobs. Treat “has holds” as a skip condition. Produce a separate report of held snapshots.
  5. Add visibility. Create a scheduled inventory job that counts holds by tag and lists the oldest held snapshot per tag.
  6. Run a fire drill. Practice the “free space emergency” procedure without actually being low on space, so you don’t learn under stress.

Checklist B: Pre-change “safety snapshot” using holds

  1. Take snapshot(s) of the dataset(s) involved, ideally recursive if the change spans multiple child datasets.
  2. Apply a change-specific hold tag.
  3. Verify holds exist.
  4. Proceed with change.
  5. After validation window closes, release the hold tag.
  6. Let normal retention delete the snapshot later (or explicitly destroy it if policy requires).
cr0x@server:~$ sudo zfs snapshot -r tank/app@chg-pre-2025-12-25_0500
cr0x@server:~$ sudo zfs hold -r change:chg-pre tank/app@chg-pre-2025-12-25_0500
cr0x@server:~$ sudo zfs holds -r tank/app@chg-pre-2025-12-25_0500 | head -n 5
NAME                                     TAG              TIMESTAMP
tank/app@chg-pre-2025-12-25_0500         change:chg-pre   Fri Dec 25 05:00 2025
tank/app/db@chg-pre-2025-12-25_0500      change:chg-pre   Fri Dec 25 05:00 2025
tank/app/uploads@chg-pre-2025-12-25_0500 change:chg-pre   Fri Dec 25 05:00 2025

Checklist C: Replication retention with “last good base” holds

  1. Create a new replication snapshot.
  2. Send/receive (or whatever replication mechanism you use).
  3. Verify success on the target side.
  4. Hold the new base snapshot on the source.
  5. Release hold on the previous base snapshot.
  6. Optionally destroy old base snapshots if they’re outside retention.

FAQ

1) Is a hold the same as setting a dataset property like readonly=on?

No. readonly=on affects write behavior to a dataset. A hold affects only whether a specific snapshot can be destroyed.

2) Can I hold a dataset, not a snapshot?

Holds are applied to snapshots. You can snapshot recursively and then hold those snapshots, which is often what people mean operationally by “holding a dataset tree.”

3) Why does ZFS allow multiple holds on one snapshot?

Because real environments have multiple stakeholders. Replication might need a snapshot for incremental chains, while legal retention might need it for governance. Multiple holds let each system assert its requirement independently.

4) If a snapshot is held, does it keep consuming more space over time?

The snapshot itself doesn’t change, but it can keep old blocks referenced while the live dataset changes. The more churn in the dataset, the more unique old blocks remain pinned by that snapshot.

5) How do I find what’s preventing snapshot deletion?

Run zfs holds <snapshot>. If you’re dealing with many datasets, use zfs holds -r to inventory and then narrow down to the specific snapshots blocking your policy.

6) Can I use holds as a compliance retention mechanism?

You can use holds to enforce “cannot delete until explicitly released,” which is a useful primitive. Whether it satisfies compliance depends on your controls around who can release holds, auditability, and whether privileged users can bypass your intended process.

7) What’s the difference between a hold and a bookmark?

A hold pins a snapshot (and therefore all the referenced blocks). A bookmark preserves an incremental send point without retaining the full snapshot’s referenced blocks the same way. Holds are about undeletability; bookmarks are about replication lineage without full snapshot retention.

8) My retention tool is deleting snapshots but not reclaiming space. Are holds to blame?

Possibly, but not always. Holds prevent deletion of specific snapshots; they don’t directly explain why deleting others doesn’t free much. Common causes: the remaining snapshots (held or not) still reference the old blocks, the dataset churn pattern keeps blocks referenced by newer snapshots, or you’re looking at the wrong dataset hierarchy.

9) Should I put holds on every snapshot?

Usually no. That’s how you turn a retention policy into a capacity crisis. Use holds selectively: last-good replication bases, pre-change safety points, and explicit compliance snapshots.

10) Who should be allowed to release holds?

In mature environments: only the owning automation or a small set of privileged roles with an audit trail. Operationally, “release hold” is equivalent to “permit permanent deletion,” and it should be treated with the same seriousness.

Conclusion

zfs hold is a small feature with outsized consequences. It doesn’t make snapshots “more snapshotted.” It makes them harder to destroy accidentally, and that’s a surprisingly rare superpower in production systems—where most failures are not exotic kernel bugs but ordinary humans moving quickly.

If you adopt holds with clear tag ownership, integrate them into retention and replication, and build a habit of auditing what’s pinned, you get a system that fails safer under stress. And when the day comes that someone tries to delete the wrong thing to make the red alert go away, ZFS will do what you trained it to do: refuse.

← Previous
Chiplet GPUs: why the idea is logical—and brutally hard
Next →
Proxmox HA “cannot start resource”: finding the real blocker (quorum, storage, network)

Leave a comment