Snapshot hell doesn’t arrive with sirens. It shows up as “why is the pool 94% full again?” and “why does deleting old snapshots free nothing?”
Then the backup window stretches, replication lags, and a routine restore turns into a forensic excavation.
ZFS snapshots are one of the best ideas storage ever shipped. They’re also the easiest way to quietly build a time machine you can’t afford to run.
The fix isn’t “take fewer snapshots” or “buy more disks.” It’s a retention policy that treats snapshots like production data: budgeted, observable, and
routinely garbage-collected with receipts.
What “snapshot hell” really is
Snapshot hell is not “too many snapshots.” It’s snapshots without governance.
You can run tens of thousands of snapshots and be fine if you understand how space is retained, how replication and holds work,
and you delete with intent rather than vibes.
Here’s what it looks like in production:
- Capacity panic: pool usage climbs even though “we deleted old data.”
- Restore uncertainty: no one knows which snapshot is safe, consistent, or relevant.
- Replication drag: incremental sends grow because snapshots are misaligned, or because you kept the wrong ones.
- Operational fear: “don’t delete anything, it might be needed.” Congratulations, you now run a museum.
The core problem is that snapshots preserve old blocks. If you mutate data aggressively (VMs, databases, CI caches),
snapshots keep those old versions around. Your “deleted” data isn’t deleted; it’s memorialized.
Joke #1: Snapshots are like receipts—useful until you keep them for seven years and discover you bought 300 identical USB cables.
Interesting facts and history you can use
- ZFS was built around copy-on-write (CoW), meaning it never overwrites live blocks; snapshots are a natural consequence, not a bolt-on feature.
- Snapshots are almost free to create (metadata work), which is why they’re so easy to overuse; the cost shows up later as retained blocks.
- ZFS snapshots are consistent at the filesystem level; for application consistency (databases), you still need coordination (freeze, flush, or replication tooling).
- Space accounting for snapshots is not “snapshot size”; it’s “unique blocks referenced only by that snapshot.” This confuses nearly everyone once.
- Clones are writable snapshots—and they create dependency chains where you can’t delete “the old thing” because a clone is still using it.
- Holds exist because humans delete the wrong thing; ZFS made snapshot deletion reversible in spirit (preventable), not in fact (still destructive).
- Incremental send/receive relies on snapshot lineage; deleting “intermediate” snapshots can break incrementals unless you plan around it.
- Auto-snapshot tooling became popular because manual snapshotting fails silently: it works until the one week you forget—right before you need it.
A mental model: snapshots, blocks, and why deletion disappoints
Snapshots in ZFS are pointers to a consistent view of a dataset at a moment in time. They don’t “contain files.”
They contain references to blocks that existed at snapshot creation time.
When data changes after a snapshot, ZFS writes new blocks (CoW). The old blocks remain because the snapshot still points to them.
So the snapshot cost is proportional to how much changed since it was taken, not the dataset’s nominal size.
Three behaviors that make operators miserable
- Free space returns late: deleting a file doesn’t free blocks if any snapshot still references them.
- “Used” is contextual: the “used” of a snapshot is unique block ownership; it can grow as the dataset churns.
- Dependencies are invisible unless you look: clones and holds will prevent deletions and confuse cleanup scripts.
What retention policy must solve
A retention policy is not a schedule. It’s a compact between risk and cost:
“We will always be able to restore within X time window and Y granularity, and we will never exceed Z% pool usage because of snapshots.”
You need:
- Clear snapshot classes (hourly/daily/weekly/monthly).
- A naming scheme that encodes intent.
- A pruning algorithm that keeps representative points in time.
- Guardrails: holds for special snapshots, and alerts when retention violates capacity budgets.
- Runbooks for the “pool is filling” day—because it will happen.
One paraphrased idea, attributed because it’s shaped ops culture:
paraphrased idea: You should build systems that assume humans will make mistakes, then constrain the blast radius.
— James Hamilton (reliability engineering)
The retention policy that actually works
The policy below is boring. That’s why it survives contact with production.
It’s designed around three truths: most restores are recent, compliance wants long horizons, and capacity is finite.
1) Pick a recovery goal, then buy it with snapshots
Decide your RPO (how much data you can lose) and RTO (how quickly you must restore). Then map them to snapshot frequency.
If your business says “we can tolerate 1 hour of loss,” don’t take daily-only snapshots and hope.
Practical default that fits many teams:
- Hourly: keep 48 (2 days)
- Daily: keep 35 (5 weeks)
- Weekly: keep 12 (3 months)
- Monthly: keep 18 (18 months)
That sounds like a lot. It’s not, if you prune on schedule and keep your most-churny datasets separate (VMs, build caches, database WALs).
2) Separate datasets by churn and by value
Snapshot retention must be per dataset class, not “one rule for the pool.”
Put high-churn stuff in its own dataset so it doesn’t poison retention economics for everything else.
- Databases: consistent snapshots coordinated with the DB; short retention locally, longer in replica/backups.
- VM disks: short frequent snapshots; aggressively prune; consider replication for longer horizons.
- Home directories: longer retention, lower churn, good candidate for long horizons.
- Scratch / CI: minimal retention or none; snapshots here are how you slowly set money on fire.
3) Budget snapshots as a percentage of pool capacity
Pick a hard cap, like: “Snapshots may not cause the pool to exceed 80% usage under normal operation.”
Your budget is a policy constraint, not a suggestion.
When the cap is breached, pruning becomes more aggressive automatically (drop monthly first? or drop hourly first?).
My opinion: drop the high-frequency layers first (hourly), because they cost the most in churny datasets and buy the least in long-term recovery.
4) Require holds for special snapshots, but never for routine ones
Holds are for exceptional moments: pre-upgrade checkpoints, legal holds, “we’re about to run a destructive migration.”
Routine retention should be fully automated and fully deletable.
If you let holds creep into the default snapshot set, your retention policy becomes a polite fiction.
5) Align retention with replication
If you do ZFS send/receive replication, decide which side is the authority for long-term retention.
Common winning pattern:
- Source: short retention, frequent snapshots for fast local restore and small incrementals.
- Destination: longer retention, fewer snapshots retained, plus periodic “anchors” (weekly/monthly) for compliance.
This avoids the classic “we retained everything everywhere” mistake. Disk is cheaper than time—until it isn’t.
Naming and metadata: future-you deserves better
Snapshot names are part of your control plane. They’re not decoration.
If you don’t encode intent, you’ll end up with names like auto-2025-thing and a human guessing which ones matter.
A naming scheme that scales
Use: <dataset>@<class>-<utc timestamp> plus optional suffixes for application coordination.
Examples:
tank/vmstore@hourly-2026-02-04T00:00Ztank/home@daily-2026-02-04T00:00Ztank/db@hourly-2026-02-04T00:00Z-pgfreeze
Use properties to tag intent
ZFS user properties can carry policy metadata. For example:
com.example:retention=hourlycom.example:owner=paymentscom.example:tier=gold
Names help humans. Properties help tooling.
Practical tasks (commands, output, decisions)
These are not “toy” commands. They’re the ones you run when someone is asking why the pool is full and your calendar is about to be canceled by physics.
Each task includes: the command, what the output means, and the decision you make from it.
Task 1: See pool capacity pressure and fragmentation
cr0x@server:~$ zpool list -o name,size,alloc,free,cap,health,frag
NAME SIZE ALLOC FREE CAP HEALTH FRAG
tank 10.9T 9.2T 1.7T 84% ONLINE 38%
Meaning: CAP at 84% is already in the danger zone for performance and allocation behavior; FRAG indicates how chopped up free space is.
Fragmentation isn’t “bad” by itself, but high CAP + rising FRAG is where latency starts to get spicy.
Decision: If CAP > 80%, stop creating new long-retention snapshots, prioritize pruning, and investigate who is holding space (snapshots, clones, or actual data).
Task 2: Identify which datasets are actually consuming space
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -S used
NAME USED AVAIL REFER MOUNTPOINT
tank/vmstore 4.3T 820G 1.1T /tank/vmstore
tank/home 2.7T 820G 2.2T /tank/home
tank/db 1.6T 820G 540G /tank/db
Meaning: USED includes snapshots and children; REFER is what the live filesystem is currently referencing.
Big gaps between USED and REFER are your snapshot tax (or children).
Decision: Focus on datasets where USED ≫ REFER. They’re prime suspects for snapshot bloat.
Task 3: Quantify snapshot overhead per dataset
cr0x@server:~$ zfs get -H -o name,property,value used,usedbysnapshots,usedbydataset tank/vmstore
tank/vmstore used 4.30T
tank/vmstore usedbysnapshots 3.10T
tank/vmstore usedbydataset 1.10T
Meaning: 3.10T is pinned by snapshots. That’s not “snapshot count,” that’s retained blocks.
Decision: This dataset needs retention review immediately. Either reduce snapshot frequency/length or split churny subpaths into separate datasets.
Task 4: List snapshots with space impact, sorted
cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation -S used tank/vmstore | head
NAME USED REFER CREATION
tank/vmstore@hourly-2026-02-02T10:00Z 120G 1.05T Sun Feb 2 10:00 2026
tank/vmstore@hourly-2026-02-02T11:00Z 118G 1.06T Sun Feb 2 11:00 2026
tank/vmstore@daily-2026-01-28T00:00Z 90G 1.02T Tue Jan 28 00:00 2026
Meaning: The USED column for snapshots is “unique data held by this snapshot.” Large values are your cleanup leverage.
Decision: Start pruning large hourly snapshots first—unless they’re part of a replication chain you still need.
Task 5: Check if snapshots are blocked by holds
cr0x@server:~$ zfs holds tank/vmstore@hourly-2026-02-02T10:00Z
NAME TAG TIMESTAMP
tank/vmstore@hourly-2026-02-02T10:00Z pre-upgrade Mon Feb 3 09:12 2026
Meaning: This snapshot cannot be destroyed until the hold is released.
Decision: Confirm the “pre-upgrade” event is complete and signed off. Then release the hold; otherwise keep it and prune other snapshots.
Task 6: Release a hold (safely) and delete a snapshot
cr0x@server:~$ zfs release pre-upgrade tank/vmstore@hourly-2026-02-02T10:00Z
cr0x@server:~$ zfs destroy tank/vmstore@hourly-2026-02-02T10:00Z
cr0x@server:~$ zfs list -t snapshot -o name,used -S used tank/vmstore | head -3
NAME USED
tank/vmstore@hourly-2026-02-02T11:00Z 118G
tank/vmstore@daily-2026-01-28T00:00Z 90G
Meaning: The snapshot is gone; the ordering changed. Space may not show up immediately as “free” if other snapshots still reference those blocks.
Decision: If free space doesn’t move, keep deleting down the chain (or identify clones/other snapshots holding the same blocks).
Task 7: Detect clones that block snapshot deletion
cr0x@server:~$ zfs get -H -o name,property,value origin -r tank/vmstore
tank/vmstore origin -
tank/vmstore/clone-win11 origin tank/vmstore@daily-2026-01-28T00:00Z
Meaning: tank/vmstore/clone-win11 depends on that daily snapshot. Deleting the snapshot will fail until the clone is removed or promoted.
Decision: Either destroy the clone (if disposable) or zfs promote the clone to break dependency before pruning.
Task 8: Promote a clone to unblock retention
cr0x@server:~$ zfs promote tank/vmstore/clone-win11
cr0x@server:~$ zfs get -H -o name,property,value origin tank/vmstore/clone-win11
tank/vmstore/clone-win11 origin -
Meaning: The clone is now independent; its origin is cleared.
Decision: Now you can destroy old snapshots that previously acted as origin—after verifying the clone’s new snapshot lineage and replication plan.
Task 9: Check snapshot count and distribution by class
cr0x@server:~$ zfs list -t snapshot -o name -r tank/home | awk -F@ '{print $2}' | cut -d- -f1 | sort | uniq -c
48 hourly
35 daily
12 weekly
18 monthly
Meaning: This matches the policy. Counts drifting upward is an automation failure or a blocked destroy (holds/clones).
Decision: If counts exceed policy, inspect holds and clones first; then inspect whether your pruning job actually runs and has permissions.
Task 10: Inspect what’s preventing incremental replication
cr0x@server:~$ zfs list -t snapshot -o name,creation tank/db | tail -5
tank/db@hourly-2026-02-03T20:00Z Mon Feb 3 20:00 2026
tank/db@hourly-2026-02-03T21:00Z Mon Feb 3 21:00 2026
tank/db@hourly-2026-02-03T22:00Z Mon Feb 3 22:00 2026
tank/db@hourly-2026-02-03T23:00Z Mon Feb 3 23:00 2026
tank/db@hourly-2026-02-04T00:00Z Tue Feb 4 00:00 2026
Meaning: Replication incrementals require a common snapshot on both sides. If the destination is missing older hourlies because you pruned them there,
your source can’t do incremental from that point.
Decision: Define which snapshots are “replication anchors” (often dailies or weeklies) and ensure both sides keep them long enough.
Task 11: Estimate send size before you commit
cr0x@server:~$ zfs send -nvP -I tank/db@hourly-2026-02-03T20:00Z tank/db@hourly-2026-02-04T00:00Z
send from @hourly-2026-02-03T20:00Z to tank/db@hourly-2026-02-04T00:00Z estimated size is 36.2G
total estimated size is 36.2G
Meaning: This is what your replication would push. If this is unexpectedly huge, your dataset churn is high or your interval is too large.
Decision: If estimated size is too big for your window, increase snapshot frequency (smaller deltas), reduce churn (separate WAL/temp),
or move to more appropriate replication timing.
Task 12: Confirm compression and recordsize choices that affect snapshot churn
cr0x@server:~$ zfs get -o name,property,value compression,recordsize,atime tank/vmstore
NAME PROPERTY VALUE
tank/vmstore compression zstd
tank/vmstore recordsize 128K
tank/vmstore atime off
Meaning: Recordsize affects how much data changes when small writes happen; atime=on can create needless metadata churn (and thus snapshot growth).
Decision: For VM images and databases, consider recordsize and dataset layout intentionally. Disable atime unless you have a real use case.
Task 13: Find “space that won’t free” due to deletes under snapshots
cr0x@server:~$ zfs list -o name,used,refer,usedbysnapshots -r tank/vmstore
NAME USED REFER USEDBYSNAPSHOTS
tank/vmstore 4.3T 1.1T 3.1T
tank/vmstore/images 3.9T 820G 3.0T
tank/vmstore/iso 120G 115G 2G
Meaning: The churn is concentrated under images. That’s where deletes/rewrites happened while snapshots existed.
Decision: Tune retention specifically for tank/vmstore/images or separate further (per-VM datasets) to limit blast radius.
Task 14: Dry-run a destructive cleanup by listing what would be deleted
cr0x@server:~$ zfs list -t snapshot -o name,creation -S creation tank/home | awk 'NR==1 || /@hourly-/' | head -5
NAME CREATION
tank/home@hourly-2026-02-04T00:00Z Tue Feb 4 00:00 2026
tank/home@hourly-2026-02-03T23:00Z Mon Feb 3 23:00 2026
tank/home@hourly-2026-02-03T22:00Z Mon Feb 3 22:00 2026
tank/home@hourly-2026-02-03T21:00Z Mon Feb 3 21:00 2026
Meaning: You’ve confirmed naming pattern and creation timestamps. Cleanup scripts should operate on these patterns, not on “whatever looks old.”
Decision: If snapshots don’t match patterns, stop and fix naming first. Garbage collection without consistent names is just creative deletion.
Task 15: Observe I/O pressure when snapshots are heavy
cr0x@server:~$ zpool iostat -v tank 2 3
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 9.2T 1.7T 420 1150 62.1M 210M
raidz2-0 9.2T 1.7T 420 1150 62.1M 210M
sda - - 52 140 7.8M 26.5M
sdb - - 50 138 7.5M 25.9M
sdc - - 54 145 8.0M 27.1M
Meaning: High write ops plus high pool usage often coincides with snapshot-heavy workloads (more metadata churn, more fragmentation).
Decision: If your workload is latency-sensitive and CAP is high, prune snapshots and reduce churn. If you’re out of headroom, plan expansion—don’t bargain with entropy.
Fast diagnosis playbook
When someone says “snapshots are killing us,” they might be right—or they might be blaming the only visible feature.
This playbook finds the bottleneck fast, in the order that usually pays off.
First: are you actually capacity-constrained?
-
Check pool usage and health:
cr0x@server:~$ zpool list -o name,cap,health,frag NAME CAP HEALTH FRAG tank 84% ONLINE 38%Decision: If CAP > 80%, treat it as an incident. Your options shrink rapidly above this line.
-
Find snapshot contribution:
cr0x@server:~$ zfs get -H -o name,value usedbysnapshots -r tank | sort -h -k2 | tail -5 tank/vmstore 3.10T tank/home 420G tank/db 310GDecision: Attack the top contributor first. Heroics elsewhere are performative.
Second: is deletion blocked by dependencies (holds/clones/replication)?
-
Check for holds on “should be deletable” snapshots:
cr0x@server:~$ zfs holds -r tank/vmstore | head NAME TAG TIMESTAMP tank/vmstore@daily-2026-01-28T00:00Z legal Thu Jan 30 14:10 2026Decision: Holds imply process. Confirm ownership, then release or accept the capacity cost.
-
Check for clones and origins:
cr0x@server:~$ zfs get -H -o name,value origin -r tank/vmstore | grep -v '^-' tank/vmstore/clone-win11 tank/vmstore@daily-2026-01-28T00:00ZDecision: If clones exist, promote or destroy them before pruning their origin snapshots.
Third: is performance the complaint, not capacity?
-
Check I/O and latency symptoms:
cr0x@server:~$ zpool iostat -v tank 1 5 capacity operations bandwidth pool alloc free read write read write -------------------------- ----- ----- ----- ----- ----- ----- tank 9.2T 1.7T 380 1320 58.0M 225MDecision: If writes are high and CAP is high, pruning helps. If CAP is fine, look at recordsize, sync settings, and workload shape.
-
Check dataset properties that create churn (atime, recordsize):
cr0x@server:~$ zfs get -o name,property,value atime,recordsize -r tank/vmstore NAME PROPERTY VALUE tank/vmstore atime off tank/vmstore recordsize 128K tank/vmstore/images atime off tank/vmstore/images recordsize 128KDecision: Tune per dataset. Don’t “optimize” the whole pool because one workload screams the loudest.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-size SaaS company ran ZFS for VM storage and home directories. They had a nightly job that “deleted old backups” in the guest OS.
The storage team expected the pool to drift downward after each cleanup. It never did.
When the pool crossed into the high-80s, latency spiked. Tickets arrived: “VMs are slow,” “backup takes longer,” “deploys are stuck.”
The on-call engineer did the obvious thing: delete more files. The pool got… more full. That’s when panic acquires a sense of humor.
The wrong assumption was simple: “Deleting files frees space.” On ZFS with snapshots, that’s only true when no snapshot references the deleted blocks.
Their VM dataset had hourly snapshots for 30 days. Every nightly cleanup deleted large log files and rotated VM images. The snapshots kept all the old blocks.
The fix wasn’t mystical. They measured usedbysnapshots, discovered it was the majority of the dataset, and shortened hourly retention from 30 days to 2 days.
They kept longer dailies and weeklies for point-in-time recovery. They also split high-churn VM images into per-VM datasets so one noisy tenant didn’t pin space for all.
The best part: restores improved. Instead of scrolling through a month of hourlies, they had a clear recent window for fast rollback and a longer window for audit needs.
Fewer snapshots. Better outcomes. The universe occasionally rewards basic accounting.
Mini-story 2: The optimization that backfired
A finance org wanted “more safety” and decided to snapshot every five minutes. They ran it on a busy database dataset without application coordination.
The snapshot job was fast. The replication wasn’t. The pool filled faster than their procurement cycle, which is the slowest storage tier known to science.
They then optimized: “Let’s keep the five-minute snapshots only for 24 hours, and keep dailies for a year.”
Reasonable on paper. In practice, their pruning script removed “intermediate” snapshots needed for incremental send chains to the DR site,
forcing frequent full sends. Bandwidth went up, replication lag grew, and the DR dataset started missing the very recovery points everyone thought they had.
The twist was that they were optimizing the wrong metric. Snapshot frequency wasn’t the problem; alignment was.
The source created snapshots at :00, :05, :10. The destination retention job pruned based on its local time and load, deleting anchors unpredictably.
The fix: they defined explicit replication anchors (hourly and daily) and retained those on both sides. Five-minute snapshots stayed local-only, short-lived.
DR kept hourlies for a few days and dailies for months. Replication became incremental again, predictable, and boring—exactly what you want from DR.
Joke #2: Nothing says “high availability” like discovering your DR plan depends on a cron job’s mood.
Mini-story 3: The boring but correct practice that saved the day
A healthcare company had strict change control. Before any major upgrade, they took a manual snapshot and applied a hold with a ticket ID in the tag.
It was procedural, slightly annoying, and universally mocked until it mattered.
During a storage firmware rollout, a misconfigured multipath setting caused intermittent I/O errors on a subset of hosts.
The application team rolled back code. The errors continued. The problem wasn’t software; it was storage pathing.
They needed to restore a set of configuration directories and a small database that had been “cleaned up” during troubleshooting.
Because the pre-upgrade snapshot was held, it survived the regular pruning cycle. The restore was surgical: clone the snapshot, extract the needed files,
and keep operations running while they fixed multipath.
It didn’t require genius. It required a boring discipline: snapshot + hold + documented owner. The hold prevented “helpful cleanup” from destroying the escape hatch.
They later removed the hold once the incident was closed, and the system returned to its normal retention policy without accumulating permanent barnacles.
Common mistakes: symptoms → root cause → fix
1) “We deleted old data but space didn’t come back”
Symptom: Dataset REFER shrinks, pool CAP doesn’t.
Root cause: Snapshots still reference the deleted blocks.
Fix: Measure usedbysnapshots, then prune snapshots according to policy. If you need the snapshots, split datasets to isolate churn.
2) “Destroying snapshots fails with ‘dataset is busy’ or similar dependency errors”
Symptom: zfs destroy refuses, or script skips snapshots.
Root cause: Holds or clones exist; the snapshot is an origin.
Fix: Use zfs holds and zfs get origin. Release holds if authorized; destroy or promote clones.
3) “Replication suddenly became huge or slow”
Symptom: Incremental sends balloon; replication lag grows.
Root cause: Missing common snapshots due to mismatched retention or pruning on one side.
Fix: Establish replication anchors retained on both sides. Validate with zfs list -t snapshot on source and destination.
4) “We have thousands of snapshots and no one knows what any of them are”
Symptom: Names are inconsistent, ownership unclear.
Root cause: No naming convention; no metadata; multiple tools creating snapshots.
Fix: Standardize names by class + UTC timestamp. Add user properties for owner and retention class. Stop unauthorized snapshot creators.
5) “Performance got worse as we added snapshots”
Symptom: Latency increases; pool feels ‘heavy’ under load.
Root cause: High pool utilization + fragmentation + churn from frequent snapshots amplifies CoW metadata and allocation costs.
Fix: Keep CAP under control (target < 80%). Reduce high-frequency retention on churny datasets. Consider dataset redesign and capacity expansion.
6) “Pruning runs, but snapshot counts only increase”
Symptom: Automation reports success, but old snapshots remain.
Root cause: Script filters don’t match names; snapshots held; permissions; or pruning operates on wrong dataset scope.
Fix: Validate by listing snapshots targeted for deletion before deleting. Add monitoring on snapshot count per class and holds count.
7) “Restores are unreliable or inconsistent for databases”
Symptom: Restore boots but DB needs recovery, or data is corrupted logically.
Root cause: Filesystem snapshot without application coordination.
Fix: Use DB-native mechanisms (freeze/flush, consistent checkpoints) around snapshot timing, or replicate using DB tools and snapshot the replicas.
Checklists / step-by-step plan
Step-by-step: implement a retention policy without creating a new incident
-
Inventory datasets and classify by churn/value.
Group into: databases, VM images, user data, logs, scratch. If it’s everything in one dataset, you’ve already found problem #1. -
Define RPO/RTO per class.
Don’t let “we want everything forever” into the room unless they brought a purchase order. - Pick retention tiers (hourly/daily/weekly/monthly) and counts per class.
- Adopt a naming scheme using UTC timestamps and the class prefix.
- Implement automation for snapshot creation with the correct scope and consistent naming.
-
Implement automation for pruning that:
- keeps the newest N per class
- never deletes held snapshots
- is clone-aware (or at least reports clones blocking deletion)
- logs what it deleted and what it couldn’t
- Set capacity guardrails: alert at 75% and 80% pool usage; treat 85% as paging-worthy depending on workload.
-
Test restores quarterly, not during incidents.
Restore is a product feature. If you don’t test it, you don’t have it. - Align replication with anchors and retention authority (source vs destination).
- Document “special snapshot” procedure: take snapshot, apply hold tag with ticket ID, set an expiry/owner, and require removal when done.
Operational checklist: the “pool is filling” day
- Confirm CAP and identify top datasets by
usedbysnapshots. - List largest snapshots by USED and check for holds/clones.
- Prune high-frequency snapshots first on churny datasets (hourlies), while preserving replication anchors.
- If space still won’t free, hunt clones and long-lived holds.
- If you’re near 90% CAP, plan emergency capacity relief (move data, expand pool) while pruning—don’t choose one and hope.
Design checklist: prevent snapshot hell structurally
- Separate high-churn paths into dedicated datasets.
- Disable atime where it’s not required.
- Decide which system is the “long retention” store (often the replication target).
- Use holds only with an owner and an expiry date.
- Monitor snapshot counts and snapshot space, not just pool CAP.
- Keep naming consistent across hosts to simplify tooling.
FAQ
Do snapshots slow down ZFS?
Not by existing. They slow you down indirectly by increasing retained blocks, which pushes pool usage up, increases allocation work, and can amplify fragmentation.
Also, more metadata and more complex block sharing can make some operations heavier.
Why does a snapshot’s USED number change over time?
Because USED is “unique blocks that would be freed if this snapshot were destroyed.” As the live dataset changes, the snapshot becomes the last reference to older blocks,
so its unique ownership can increase.
How many snapshots is too many?
The wrong question. The right question is: how much space do snapshots retain, and can you restore what you need inside your RPO/RTO?
You can have thousands of snapshots with low churn and be fine; you can have 200 snapshots on a high-churn VM dataset and be broke.
Should I keep long retention on the source or the replication target?
Prefer long retention on the replication target. Keep the source lean for performance and for small incremental sends.
Keep explicit anchors on both sides to preserve incremental replication.
What’s the safest way to delete snapshots?
Safest means: you know why the snapshot exists, you know what depends on it (holds/clones/replication), and you delete according to a policy.
Start by listing snapshots by USED, check holds, check origin/clone relationships, then delete from oldest to newest within the class you’re pruning.
Can I snapshot a database safely without coordination?
Filesystem consistency isn’t the same as application consistency. Some databases can recover cleanly; others can restore into lengthy recovery or subtle logical issues.
Coordinate snapshots with the database (flush/freeze), or snapshot a replica that is designed for backups.
Why didn’t pruning free as much space as expected?
Because the snapshots you deleted didn’t uniquely own the blocks you thought they did, or other snapshots still reference the same blocks.
Also check for clones: a clone can retain blocks even after snapshot pruning.
Are holds a good long-term retention mechanism?
No. Holds are a guardrail for exceptional snapshots, not a policy engine. If you use holds as retention, you’ll accumulate permanent snapshots with unclear ownership,
and your pruning system becomes a liar.
What retention policy should I start with if I’m overwhelmed?
Start with: hourly 48, daily 35, weekly 12, monthly 18—then adjust by dataset churn. Put VM images and databases in separate datasets and shorten their local retention.
Make the replication target your long-term store.
How do I prevent tools from creating random snapshots?
Standardize snapshot automation in one place, limit administrative access, and monitor snapshot creation rates and naming compliance.
If snapshots appear with unknown prefixes, treat it as a configuration drift problem, not a curiosity.
Conclusion: practical next steps
Snapshot hell is a policy failure wearing a storage costume. ZFS gave you the mechanism; you still owe it governance.
If you want to stop living in fear of zfs destroy, do the work once and keep it boring.
- Measure: collect
usedbysnapshotsper dataset and alert on growth trends. - Classify: split datasets by churn and value so one workload can’t bankrupt the pool.
- Standardize: adopt a naming scheme and add user properties for ownership and retention class.
- Automate: create and prune snapshots on schedule, with clone/hold awareness and real logging.
- Align replication: keep anchor snapshots on both sides; push long retention to the target.
- Practice restores: test the path that matters before you need it at 3 a.m.
Do this and snapshots become what they were always meant to be: a fast, reliable undo button—not an archaeological site.