ZFS snapshot deletion: Why Snapshots Refuse to Die (and How to Fix It)

January 20, 2026 • February 3, 2026 • Read: 21 min • Views: 12

Was this helpful?

Deleting a ZFS snapshot should be boring: zfs destroy pool/fs@snap, done. But in real systems it’s often a scene from a crime show: the snapshot is “gone” yet the space doesn’t come back, or the destroy fails because something—somewhere—still “needs” it.

This is a field guide to those moments. Not a brochure. We’ll cover the technical reasons snapshots refuse to die (holds, clones, busy datasets, deferred destroys, replication, space accounting) and the operational fixes that work when your pager is already warm.

What a ZFS snapshot really is (and why deletion is not “free”)

A ZFS snapshot is not a copy of your files. It’s a frozen view of a dataset at a transaction group boundary: a set of block pointers that says “these blocks define the filesystem at that time.” When you modify live data, ZFS writes new blocks elsewhere (copy-on-write). The old blocks remain referenced by the snapshot. Delete the snapshot, and those old blocks might become unreferenced and eligible for freeing—if nothing else still points at them.

This is why snapshot deletion has two different failure modes:

Destroy operation is blocked (holds, clones, “dataset is busy”). You can’t even remove the snapshot metadata yet.
Destroy operation succeeds but space doesn’t return (because the blocks are still referenced by other snapshots, clones, or because space accounting is misunderstood).

It also means deletion can be expensive. ZFS may need to walk metadata to determine which blocks become free. Large snapshot trees with years of churn can make destroys take minutes—or hours—depending on pool load, storage latency, and feature flags.

One operational truth: when people say “snapshots are cheap,” they mean “creating snapshots is cheap.” Deleting them is where the bill shows up.

Interesting facts and historical context (you’ll feel these in production)

ZFS snapshots date back to Sun Microsystems and were designed as a primitive for cloning, rollback, and replication—long before “immutable backups” became marketing copy.
Copy-on-write is why snapshots exist without pausing writes: you can take a snapshot of a live database dataset without freezing I/O, because new writes go to new blocks.
Snapshot names are part of the dataset namespace: pool/fs@snap is not a separate object type the way LVM snapshots behave; it’s deeply tied to dataset bookkeeping.
Clones are writable snapshots: a clone is a dataset that initially shares all blocks with a snapshot; that dependency is what makes some snapshots “undeletable.”
“Holds” were introduced to prevent accidental deletion during workflows like replication, backup verification, and snapshot-based provisioning.
Deferred destroy exists because deletion can be slow: ZFS can mark a snapshot for later cleanup so the command returns quickly while freeing happens asynchronously.
Space is not a single number in ZFS: “used,” “refer,” “logicalused,” “written,” “usedbysnapshots,” and “usedbychildren” answer different questions. People routinely pick the wrong one under pressure.
Compression changes the intuition: a snapshot can “hold” blocks that look huge logically but tiny physically—or vice versa with recordsize changes and rewrites.

And one small joke, because we’ve earned it: ZFS snapshots are like office chairs—easy to add, strangely hard to get rid of, and you only notice the cost when the hallway is blocked.

Fast diagnosis playbook (check this first, second, third)

This is the sequence I use when someone says, “We deleted snapshots but space didn’t come back,” or “Destroy fails and we don’t know why.” The goal is to find the bottleneck quickly, not to admire the pool’s philosophical complexity.

1) Is the snapshot actually gone, or is it deferred?

First check whether you’re looking at a deferred destroy situation. ZFS may accept the destroy but free blocks later.

2) Are there holds?

Holds are the #1 “it refuses to die” culprit in disciplined environments (backup/replication software loves them). They are also the #1 “nobody remembers setting them” culprit.

3) Is there a clone dependency?

If any clone was created from a snapshot, you cannot destroy that snapshot until the clone is promoted or destroyed. In many shops, one developer “temporarily” cloned production data for a test and quietly turned snapshot cleanup into a hostage negotiation.

4) If the snapshot is destroyed, why is space not returning?

Check usedbysnapshots, and check whether other snapshots still reference the same overwritten blocks. Also verify you’re looking at the right dataset and not confusing pool-level and dataset-level usage.

5) Is the pool unhealthy or under extreme load?

A degraded pool, heavy fragmentation, or saturated IOPS can make destroys crawl. Snapshot deletion is metadata-heavy; it competes with your real workloads.

6) Did you delete snapshots on the wrong side of replication?

Replication topologies can keep snapshots alive: the sender retains them to satisfy incremental chains, the receiver retains them because you pinned them, and both sides blame each other.

Why snapshots refuse to die: the real causes

Cause A: Holds (user holds, tool holds, “it’s for your own good”)

A hold is a tag attached to a snapshot that prevents its destruction. You’ll see it when zfs destroy fails with a message about holds, or when a deletion tool “skips protected snapshots.” Holds are excellent—until they’re orphaned by a crashed job, a half-migrated backup system, or a script that tags snapshots but never untags them.

In the wild, holds show up as:

Replication pipelines holding snapshots until receive completes.
Backup verification jobs holding snapshots until checksums finish.
“Safety” holds added by admins during an incident and forgotten.

Cause B: Clone dependencies (the quietest blocker)

If a snapshot has a clone, that snapshot is part of the clone’s ancestry. ZFS will refuse to destroy it because it would sever the clone’s block references. You can identify this by checking the snapshot’s clones property. The fix is to destroy the clone, or promote the clone so it becomes the new origin and the dependency flips.

This is the most common “we don’t understand why it fails” case in mixed teams, because the person deleting snapshots often doesn’t know someone created a clone months ago.

Cause C: Dataset “busy” errors (mounted, in use, or special cases)

Most of the time, you can destroy snapshots of mounted datasets with no issue. But “dataset is busy” can occur when you are destroying a filesystem or volume with active references, or when you are trying to destroy snapshots during certain operations (like ongoing receive in some workflows, depending on platform and flags).

Also, be careful with zfs destroy -r or -R: you may be destroying datasets (not just snapshots) and hitting mountpoint usage, NFS exports, jails/zones, or container runtimes that pinned mountpoints.

Cause D: Deferred destroy (it “deleted,” but the pool didn’t breathe)

Deferred destroy allows ZFS to quickly remove the snapshot from the namespace while postponing the actual block freeing work. This is not magic; it’s a scheduling decision. If the pool is under load, the cleanup can lag. Operators see “snapshot is gone” and assume space should return instantly. It might not.

Deferred destroy is often triggered via zfs destroy -d, but it can also appear as behavior when the system chooses not to block the command too long (implementation depends on platform and flags).

Cause E: Space accounting misconceptions (you freed space—just not where you’re looking)

ZFS reports space at multiple levels:

pool-level free space: what the pool can allocate.
dataset-level “used”: includes children and snapshots depending on properties and perspective.
usedbysnapshots: blocks uniquely held by snapshots for that dataset.
referenced: what the dataset itself points to at that moment (excluding snapshots).

A common reality: you delete a bunch of snapshots, and zfs list still shows “used” barely changing because the live dataset (or other snapshots) still references most of the blocks. This is especially common when the workload is append-heavy (logs) rather than overwrite-heavy.

Cause F: Replication chains and incremental dependencies

Incremental zfs send needs a common snapshot between source and destination. If you delete “the wrong” snapshot on either side, you break the chain and force a full send next time. To prevent that, many replication tools hold or retain snapshots. If you override retention manually, you might win a few gigabytes and lose a weekend to re-seeding.

Cause G: You’re deleting too much, too fast (operationally correct but physically painful)

Destroying thousands of snapshots at once can create a nasty burst of metadata work. On busy pools—especially HDD-based ones—this can look like a performance incident. ZFS isn’t being petty; it’s doing the accounting you asked for. The fix is to throttle, batch, or defer.

Second joke (and last): Snapshot retention policies are like diets—easy to start, hard to maintain, and everyone lies about how many they have.

Practical tasks: commands, outputs, and what they mean

Below are concrete tasks I’ve run in anger. The commands are standard ZFS tooling; outputs are representative. Your platform (OpenZFS on Linux, FreeBSD, illumos) may differ slightly, but the workflow holds.

Task 1: Confirm the snapshot exists (and you’re spelling it right)

cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation -s creation pool/app
NAME                     USED  REFER  CREATION
pool/app@auto-2025-12-24  12G  1.20T  Wed Dec 24 02:00 2025
pool/app@auto-2025-12-25  14G  1.21T  Thu Dec 25 02:00 2025

Interpretation: You have two snapshots. USED is how much space this snapshot uniquely consumes compared to the dataset head and other snapshots.

Task 2: Try a normal destroy and read the error literally

cr0x@server:~$ sudo zfs destroy pool/app@auto-2025-12-24
cannot destroy snapshot pool/app@auto-2025-12-24: snapshot has holds

Interpretation: This is not a mystery. It has holds. Don’t start guessing about “busy.” Go check holds.

Task 3: List holds on a snapshot

cr0x@server:~$ sudo zfs holds pool/app@auto-2025-12-24
NAME                     TAG             TIMESTAMP
pool/app@auto-2025-12-24  repl-inflight   Thu Dec 25 02:05 2025
pool/app@auto-2025-12-24  backup-verify   Thu Dec 25 02:10 2025

Interpretation: Two different systems think they still need this snapshot. That’s good news: you have names to chase.

Task 4: Release a hold (safely, with intent)

cr0x@server:~$ sudo zfs release backup-verify pool/app@auto-2025-12-24
cr0x@server:~$ sudo zfs holds pool/app@auto-2025-12-24
NAME                     TAG             TIMESTAMP
pool/app@auto-2025-12-24  repl-inflight   Thu Dec 25 02:05 2025

Interpretation: You removed one tag. If a hold is owned by a replication or backup job, releasing it prematurely can break guarantees. Coordinate or confirm the job is dead/stuck.

Task 5: Force destroy (not recommended, but know what it does)

cr0x@server:~$ sudo zfs destroy -f pool/app@auto-2025-12-24
cannot destroy snapshot pool/app@auto-2025-12-24: snapshot has holds

Interpretation: -f is not “ignore holds.” Holds exist specifically to prevent this. You must release the holds or remove the dependent conditions.

Task 6: Check whether a snapshot has clones

cr0x@server:~$ zfs get -H -o value clones pool/app@auto-2025-12-24
pool/dev/app-clone

Interpretation: This snapshot is the origin for a clone dataset. You cannot destroy the snapshot until you deal with the clone.

Task 7: Inspect the clone and its origin

cr0x@server:~$ zfs get -o name,property,value origin pool/dev/app-clone
NAME               PROPERTY  VALUE
pool/dev/app-clone  origin    pool/app@auto-2025-12-24

Interpretation: Clear lineage. If this clone is still needed, consider promoting it. If it’s disposable, destroy it.

Task 8: Promote a clone to break dependency (when you need the clone, not the origin snapshot)

cr0x@server:~$ sudo zfs promote pool/dev/app-clone
cr0x@server:~$ zfs get -o name,property,value origin pool/dev/app-clone
NAME               PROPERTY  VALUE
pool/dev/app-clone  origin    -

Interpretation: After promotion, the clone becomes an independent dataset (its origin is cleared). The former origin may now show an origin pointing back, depending on lineage. Re-check clone lists on the snapshot you want to delete.

Task 9: Destroy a snapshot now that holds/clones are resolved

cr0x@server:~$ sudo zfs release repl-inflight pool/app@auto-2025-12-24
cr0x@server:~$ sudo zfs destroy pool/app@auto-2025-12-24

Interpretation: If this succeeds and the snapshot disappears from zfs list -t snapshot, you removed the namespace entry. Space reclamation may still take time.

Task 10: Verify whether space is actually tied up by snapshots

cr0x@server:~$ zfs get -o name,property,value -s local,default used,usedbysnapshots,usedbydataset,referenced pool/app
NAME      PROPERTY         VALUE
pool/app  used             3.10T
pool/app  usedbysnapshots  420G
pool/app  usedbydataset    2.68T
pool/app  referenced       2.68T

Interpretation: If usedbysnapshots is large, deleting snapshots can help. If it’s small, snapshot deletion won’t rescue you; the dataset itself is the space hog.

Task 11: Identify which snapshots consume the most space

cr0x@server:~$ zfs list -t snapshot -o name,used,refer -s used pool/app | tail -n 5
pool/app@auto-2025-10-01  85G  1.05T
pool/app@auto-2025-11-01  92G  1.10T
pool/app@auto-2025-11-15  97G  1.12T
pool/app@auto-2025-12-01  110G 1.18T
pool/app@auto-2025-12-15  140G 1.20T

Interpretation: These are good candidates for pruning if policy allows. Usually the “used” spikes correlate with big rewrites, compactions, VM image churn, or database maintenance.

Task 12: Batch-delete snapshots by pattern (carefully)

cr0x@server:~$ zfs list -H -t snapshot -o name -s creation pool/app | grep '@auto-2025-10' | head
pool/app@auto-2025-10-01
pool/app@auto-2025-10-02
pool/app@auto-2025-10-03
pool/app@auto-2025-10-04
pool/app@auto-2025-10-05

cr0x@server:~$ zfs list -H -t snapshot -o name -s creation pool/app | grep '@auto-2025-10' | xargs -n 1 sudo zfs destroy

Interpretation: This deletes snapshots one by one (less bursty than a single recursive destroy across many datasets). If any snapshot fails due to holds/clones, you’ll see which one and why. In large environments, add throttling between destroys.

Task 13: Use deferred destroy when you need the command to return quickly

cr0x@server:~$ sudo zfs destroy -d pool/app@auto-2025-12-15
cr0x@server:~$ zfs list -t snapshot pool/app | grep auto-2025-12-15
# (no output)

Interpretation: The snapshot is removed from the namespace quickly, but the pool may still be freeing blocks in the background. Watch pool I/O and space over time.

Task 14: Check pool health and obvious red flags before blaming ZFS deletion logic

cr0x@server:~$ zpool status -x
all pools are healthy

cr0x@server:~$ zpool status pool
  pool: pool
 state: ONLINE
  scan: scrub repaired 0B in 03:21:10 with 0 errors on Sun Dec 22 03:00:11 2025
config:

        NAME        STATE     READ WRITE CKSUM
        pool        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors

Interpretation: If the pool is degraded or resilvering, snapshot deletions can drag, and you might want to postpone bulk cleanup until after recovery work.

Task 15: Confirm you’re not trapped by replication expectations

cr0x@server:~$ zfs list -t snapshot -o name,creation pool/app | tail -n 3
pool/app@repl-2025-12-24-0200  Wed Dec 24 02:00 2025
pool/app@repl-2025-12-25-0200  Thu Dec 25 02:00 2025
pool/app@repl-2025-12-25-1400  Thu Dec 25 14:00 2025

cr0x@server:~$ zfs holds pool/app@repl-2025-12-24-0200
NAME                         TAG              TIMESTAMP
pool/app@repl-2025-12-24-0200  zfs-send-chain   Thu Dec 25 14:05 2025

Interpretation: A replication tool is intentionally keeping this snapshot. If you remove it, the next incremental may fail or fall back to a full transfer.

Task 16: See snapshot counts and detect “snapshot explosions”

cr0x@server:~$ zfs list -H -t snapshot -o name pool/app | wc -l
1827

Interpretation: A dataset with thousands of snapshots is not automatically wrong, but it changes the cost model of deletion and rollback. Plan cleanup like you’d plan an index rebuild: schedule it, throttle it, and monitor it.

Three corporate-world mini-stories

1) Incident caused by a wrong assumption: “Deleting snapshots will instantly free terabytes”

In a large enterprise VM cluster, a storage team got a scary alert: the primary pool had slipped below the free-space threshold. An engineer did what many of us have done: deleted “old snapshots” on the busiest dataset, expecting a quick drop in usage. The delete completed, but the pool free space barely moved. In the next hour, write latency rose, and some VMs started timing out.

The wrong assumption wasn’t that snapshots cost space; it was that those snapshots were the reason. The dataset had a churn-heavy workload: VM images were being compacted and rewritten daily, but new snapshots were also being taken hourly. Deleting a handful of old ones didn’t help because the overwritten blocks were still referenced by newer snapshots.

The team doubled down: more deletes, faster. That’s how they triggered the second-order effect: mass snapshot destruction created metadata pressure. It didn’t “break ZFS,” but it competed with VM I/O and made a marginal situation worse. Now they had both low free space and degraded performance.

The fix was a mix of humility and arithmetic. They measured usedbysnapshots, identified the worst snapshot ranges by USED, and deleted in controlled batches. They also adjusted the snapshot schedule for that dataset: fewer hourly snapshots, more daily ones, plus a shorter retention window for the high-churn images. The pool stabilized, and performance recovered—slowly, like any system that’s been forced to do a lot of bookkeeping in a hurry.

2) Optimization that backfired: “Let’s turn everything into clones for dev speed”

A company with a strong platform team wanted to accelerate developer environments. The pitch was elegant: take a nightly snapshot of production-like datasets, then create per-team clones in seconds. It worked. Developers were happy. The platform team got praise for “leveraging storage primitives.”

Months later, the storage pool hit a wall. Snapshot deletion started failing with clone dependencies. Worse, those clones were no longer “temporary.” Teams had installed packages, dropped test data, and built workflows around their clones. Destroying them wasn’t a cleanup task; it was a political process.

The backfire was subtle: the optimization pushed lifecycle complexity into the storage layer. Snapshots became undeletable not because ZFS is stubborn, but because the organization made snapshots part of an implicit contract. Every clone anchored a snapshot; every anchored snapshot held history; and history held space.

The recovery plan was not a heroic one-liner. They introduced a policy: dev clones must be promoted within a time window (or rebuilt from a newer snapshot). They also implemented tagging and reporting: clone age, origin snapshot age, and snapshot retention exceptions. The win was cultural as much as technical: the cost of “instant clones” became visible, and the platform team stopped treating snapshot cleanup as an afterthought.

3) Boring but correct practice that saved the day: “Holds with owner tags, and a cleanup contract”

Another environment had two replication systems during a migration—old and new—running in parallel for a while. This is the kind of situation where snapshots quietly pile up, because nobody wants to delete something that might be needed for incremental sends.

The team’s practice was unglamorous: every hold tag included the owning system name and a run identifier, and holds were time-bounded by policy. A daily job reported snapshots with holds older than a threshold, broken down by tag. When a pipeline died, it left evidence instead of mystery.

During the migration, one of the replication jobs started failing intermittently. Snapshots were accumulating, but they didn’t become “immortal.” The report showed a single hold tag growing stale. The on-call engineer didn’t have to guess which system owned it; the tag said so.

They paused the faulty pipeline, confirmed a safe baseline snapshot, released stale holds, and resumed with an updated chain. No drama, no speculative deletes. The storage pool never dipped into the danger zone, and the migration completed without the usual “why is retention not working?” incident. The practice was boring. It also worked—which in production is the highest compliment.

Common mistakes (specific symptoms and fixes)

Mistake 1: Confusing “snapshot destroyed” with “space immediately reclaimed”

Symptom: Snapshot no longer appears in zfs list -t snapshot, but pool free space doesn’t increase.

Fix: Check usedbysnapshots and whether other snapshots still exist. Consider deferred destroy lag and pool load. Measure over time, not seconds.

Mistake 2: Deleting snapshots while ignoring clone dependencies

Symptom: cannot destroy snapshot ...: snapshot has dependent clones or snapshot property clones is non-empty.

Fix: Identify clones via zfs get clones. Destroy unneeded clones or zfs promote the clone if it must live.

Mistake 3: Releasing holds without understanding who set them

Symptom: Replication or backup jobs start failing after “cleanup.” Incremental sends complain about missing snapshots.

Fix: Before releasing holds, identify the owning process/tool and confirm it’s safe. If you must break the chain, plan for a full re-seed and the bandwidth/time cost.

Mistake 4: Using recursive destroys too broadly

Symptom: You run zfs destroy -r pool/fs@snap and unexpectedly target a huge subtree; deletes take forever or fail in surprising places.

Fix: List what will be affected first. Prefer batch deletion on explicit snapshot names when the blast radius matters.

Mistake 5: Measuring the wrong dataset or the wrong property

Symptom: You delete snapshots on pool/app but the pool is still full; later you realize the space is in pool/app/logs or a different dataset.

Fix: Use zfs list -o space style views (or explicit property queries) across the dataset tree. Verify the dataset actually contributes to the pool pressure.

Mistake 6: Deleting “big USED” snapshots first without considering incrementals

Symptom: After deleting a big snapshot, your next replication run becomes a full send or fails.

Fix: Understand which snapshots are anchor points for incremental chains. Coordinate deletes with replication schedule and destination state.

Mistake 7: Running mass snapshot destroys during peak I/O

Symptom: Latency spikes, iowait climbs, users report slowness, but nothing is “down.”

Fix: Throttle deletes, batch them, run during low-traffic windows, or use deferred destroy. Snapshot deletion is metadata-heavy; treat it as maintenance work.

Checklists / step-by-step plan

Checklist A: “Destroy fails” step-by-step

Confirm the exact snapshot name exists with zfs list -t snapshot.
Attempt destroy once and capture the error text.
If it mentions holds: run zfs holds, identify tags, decide whether to release.
If it mentions clones: run zfs get clones, then inspect and either destroy or promote clones.
If it mentions “busy”: confirm whether you are destroying only snapshots or also datasets (-r/-R flags); check for mounts, exports, containers, and ongoing receives.
Re-run destroy for a single snapshot to verify you’ve cleared the blocker before scaling up.

Checklist B: “Deleted snapshots but space didn’t return” step-by-step

Verify snapshot count actually decreased: zfs list -t snapshot | wc -l (or dataset-scoped).
Check dataset space breakdown: zfs get used,usedbysnapshots,usedbydataset,referenced.
List remaining snapshots and sort by USED to see who still holds blocks.
Consider that the dataset head may still reference the blocks: snapshot deletion won’t help if the live data is the bulk.
If using deferred destroy, allow time and monitor pool free space trend.
Check whether a different dataset is the real source of pool usage.

Checklist C: Safe bulk cleanup plan (when you need to delete hundreds/thousands)

Pick a deletion window (off-peak). Announce it like any other maintenance.
Dry-run the selection: list snapshot names you intend to delete, review for patterns (replication anchors, month-ends).
Check for holds and clone dependencies before you start.
Delete in batches; monitor latency and pool activity between batches.
Log what you deleted and what failed (and why). “We ran a command” is not an audit trail.
After cleanup, verify replication/backups still have their required base snapshots.

FAQ

1) Why does `zfs destroy` say “snapshot has holds”?

Because one or more hold tags are set on the snapshot. Holds are explicit protection. Use zfs holds pool/fs@snap to list them, then zfs release TAG pool/fs@snap to remove a specific tag when it’s safe.

2) Why can’t I destroy a snapshot that has dependent clones?

A clone is a dataset that uses that snapshot as its origin. Destroying the snapshot would break the clone’s reference graph. Either destroy the clone, or zfs promote the clone to flip the dependency.

3) I destroyed snapshots and `df` still shows the filesystem full. Is ZFS lying?

Usually it’s not lying; you’re comparing different accounting systems. df shows what the mounted filesystem thinks is available, which can be influenced by reservations, quotas, refreservations, and pool-wide allocation pressure. Use zfs get used,available,usedbysnapshots and pool-level zpool list to see what’s actually happening.

4) Do snapshots slow down my workload?

Not by existing alone. The cost comes from churn: if you overwrite lots of data while holding many snapshots, the pool retains more old blocks, increasing space pressure and potentially fragmentation. Deleting large numbers of snapshots can also create a burst of metadata work.

5) What’s the difference between `USED` on a snapshot and `usedbysnapshots` on a dataset?

USED on a snapshot is what that snapshot uniquely contributes compared to others. usedbysnapshots on a dataset is the total space consumed by snapshots associated with that dataset. They answer different questions: “which snapshot is expensive?” versus “how much is snapshots overall costing me here?”

6) Is deferred destroy safe?

Yes, in the sense that it’s a supported mechanism: the snapshot is removed and block freeing happens later. The tradeoff is operational visibility—people expect instant space return. Use it when you need responsiveness and can tolerate delayed reclamation.

7) Why does destroying snapshots sometimes take forever?

ZFS may need to traverse metadata to free blocks, and that competes with normal I/O. Pools with lots of snapshots, high churn, slow disks, or heavy concurrent workloads will feel this more. Batch deletes and schedule them during quieter periods.

8) Can I delete snapshots on the source without touching the destination (or vice versa)?

You can, but replication incrementals depend on having common snapshots. Deleting an “anchor” snapshot on either side can force a full re-seed or break automation. If you’re not sure, inspect which snapshots are being used for replication and whether they’re held/pinned.

9) How do I prevent snapshot “immortality” from happening again?

Use explicit naming conventions, tag holds with ownership, monitor snapshots with holds older than expected, and avoid long-lived clones unless you have a lifecycle policy (promotion window, TTL, or automatic cleanup).

Conclusion

ZFS snapshots don’t “refuse to die” out of spite. They persist because ZFS is doing exactly what you hired it to do: preserve consistency, honor dependencies, and prevent unsafe deletion. When a snapshot won’t delete, it’s almost always one of three things—holds, clones, or a misunderstanding of space accounting—and the fix is to identify which it is before you start swinging commands.

When you treat snapshot deletion like maintenance (scoped, monitored, and coordinated with replication/backup realities), it becomes boring again. And boring storage is the kind you only notice when you’re bragging about uptime instead of explaining it.