Rollbacks are supposed to feel boring. You hit the button, your heart rate stays flat, and the post-incident writeup is a single sentence: “Restored from snapshot.”
In practice, the panic arrives earlier—at the moment you have to choose which snapshot is “the good one.” If your snapshot names look like a junk drawer, you’re not doing backups; you’re collecting surprises.
Why naming is a production control (not bikeshedding)
Snapshot naming is one of those topics that attracts two kinds of people: those who think it’s aesthetics, and those who have had to roll back a dataset while a VP watches the incident channel refresh. The second group is right.
A ZFS snapshot is a point-in-time view. It’s cheap to create, fast to list, and brutally literal: it captures blocks, not your intent. The only place your intent reliably lives is the snapshot name and the metadata around it (properties, holds, bookmarks). That’s why naming is an operational control—like labeling breakers in a data center. Sure, you can guess. You also can’t afford to.
Here’s the operational reality:
- You will take more snapshots than you plan.
- You will forget why you took at least 30% of them.
- You will need the right one under pressure.
- You will have at least one snapshot taken during a partial outage (and it will look “fresh”).
So the goal isn’t “pretty names.” The goal is to make the correct rollback the path of least resistance, and the wrong rollback annoyingly hard.
One dry truth: if your snapshot list doesn’t instantly tell you when, why, who/what, and whether it’s safe, you’re not running a snapshot system. You’re running a screenshot folder.
Joke #1: A snapshot without a naming scheme is like a password manager full of “password123”—technically functional, spiritually doomed.
Interesting facts and historical context
Some context helps because ZFS naming sits at the intersection of file systems, volume management, and operations culture. A few concrete facts:
- ZFS was designed as an end-to-end system: filesystem + volume manager + checksumming + snapshots, originally developed at Sun Microsystems in the early 2000s.
- Snapshot names are not global: they’re scoped to a dataset or zvol (e.g.,
pool/app@name). Two datasets can each have@dailyand ZFS won’t care. - Snapshots are writable only through clones: a snapshot is read-only; you can
zfs cloneit to get a writable dataset. That workflow shapes how you name “candidate restore” points. - Sending snapshots is name-sensitive: incremental
zfs send -idepends on a common ancestor snapshot. Messy naming makes ancestry ambiguous to humans even when ZFS can compute it. - Snapshots are cheap, until they aren’t: they don’t duplicate blocks immediately, but they retain old blocks. A snapshot policy can quietly become your largest “consumer” of space.
- ZFS supports “holds”: you can apply a hold tag so a snapshot won’t be destroyed. This is basically your “legal hold” / “don’t delete, I mean it” mechanism.
- Bookmarks exist: they’re lightweight references to snapshot points for replication workflows, and they can reduce retention overhead when you need “replication anchors” without keeping full snapshots around.
- Names are part of your audit trail: ZFS itself won’t store your ticket number or reason unless you put it in the name or properties. Operations teams typically use both.
- There are multiple ZFS lineages: OpenZFS is the community-driven implementation used on illumos, FreeBSD, Linux, and more. Naming conventions are your portability layer across tooling differences.
What a good snapshot name does in the real world
A good snapshot name is an index entry. It should let you answer these questions without scrolling:
- When was it taken (timezone included, or unambiguous UTC)?
- What generated it (auto timer, CI job, manual operator, upgrade runbook)?
- Why (pre-upgrade, pre-migration, post-backup, incident capture)?
- What safety level (can this be deleted; should it be held; is it “golden”)?
- Which workflow it belongs to (replication, VM management, DB quiesce, app deploy)?
The naming system also must be:
- Sortable in chronological order.
- Stable across teams and automation tools.
- Safe to type under stress (no spaces, no shell footguns, no “clever” punctuation).
- Compatible with how you select snapshots in scripts (
grep,awk, or ZFS property filters).
And it should anticipate failure modes. The big one: you’ll have multiple snapshots taken around the same event, but not all of them are equally good restore points. Example: database datasets. A “crash-consistent” snapshot might be fine, or it might replay journals for 20 minutes and freak everyone out. Your naming must distinguish “quiesced” from “not quiesced,” or your on-call will learn the difference live.
One paraphrased idea from Werner Vogels (Amazon CTO): Everything fails all the time; design and operate as if failure is normal.
(paraphrased idea)
The snapshot naming system: conventions that scale
Rule 0: Use UTC in snapshot names
Local time is for calendars. Restore operations are distributed systems problems: teams in multiple time zones, servers with drifting clocks, and logs already in UTC. Put UTC in the snapshot name and you remove a whole class of “wait, was that before the deploy?” arguments.
Format recommendation: YYYYMMDDThhmmssZ (ISO-ish, no colons). It sorts lexicographically and is obvious to humans.
Rule 1: Separate “schedule” from “intent”
Many teams start with @daily, @hourly, @weekly. It’s cute until you need to restore “the snapshot from right before the schema migration.” Now you’re counting dailies. Don’t do that.
Instead, encode:
- cadence (hourly/daily/adhoc)
- intent (predeploy, preupgrade, postbackup, incident)
- creator (auto/manual/ci)
Rule 2: Use a predictable token order
Pick a canonical order and never deviate. Humans scan left-to-right. Scripts parse left-to-right. Your future self will be tired and unimpressed by creativity.
A practical, production-friendly pattern:
<prefix>.<cadence>.<creator>.<intent>.<ts>[.<ticket>][.<flags>]
Example names:
snap.hourly.auto.base.20260204T010000Zsnap.daily.auto.base.20260204T000000Zsnap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345snap.adhoc.ci.predeploy.20260204T015912Z.PR4812snap.adhoc.manual.incident.20260204T021100Z.INC7781.holdsnap.adhoc.auto.quiesced-db.20260204T020000Z
Notes:
- Prefix (
snap) lets you filter your own snapshots from vendor tools. - Dots are boring and shell-friendly. Avoid spaces. Avoid colons. Avoid “@prod:good”.
- Ticket IDs are optional but valuable. Don’t put sensitive info in names; they show up in logs, monitoring, and replication streams.
- Flags are small words like
hold,gold,quiesced. Keep them from becoming prose.
Rule 3: Put the “specialness” at the end, but keep timestamp before it
People want to tack on “-final” or “-good”. If you do that before the timestamp, sorting breaks. If you do it after, sorting stays chronological and you still get your hint.
Good: snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345.gold
Bad: snap.adhoc.manual.gold.preupgrade.CHG12345.20260203T221530Z
Rule 4: Encode quiescence explicitly for apps that care
There are datasets where crash-consistency is fine (static content, most VM disks if guest journaling is healthy). Then there are datasets where it’s a roulette wheel (busy databases, certain message queues, certain legacy apps). For those, your naming system should include a token like:
quiesced-db(app confirmed flush / fsfreeze / coordinated)crash(no coordination; restore may require recovery)
This is not a moral judgment. It’s risk labeling.
Rule 5: Use ZFS properties for machine-readable metadata (and names for humans)
Snapshot names are visible. Properties are queryable. Use both. A practical pattern:
- Name carries cadence/intent/timestamp and maybe a ticket.
- Properties carry structured metadata like
com.company:reason,com.company:owner,com.company:expiry.
Why both? Because humans pick snapshots by name during incidents, and automation needs reliable selectors that don’t depend on parsing a string you’ll inevitably “improve” someday.
Rule 6: Plan for replication and retention from day one
If you replicate snapshots, your names are part of the protocol between sites. A naming system that isn’t stable becomes a replication outage disguised as a “cleanup.”
Practical recommendation:
- Keep automated cadence snapshots on a strict pattern: easy to match, easy to expire.
- Keep “event” snapshots (preupgrade/incident) on a different pattern and protect them with holds.
- Use bookmarks for replication anchors when appropriate, but name them with the same timestamp discipline.
Practical tasks: commands, outputs, and decisions (12+)
Task 1: List snapshots in a dataset, sorted by creation time
cr0x@server:~$ zfs list -t snapshot -o name,creation -s creation tank/app
NAME CREATION
tank/app@snap.daily.auto.base.20260203T000000Z Mon Feb 3 00:00 2026
tank/app@snap.hourly.auto.base.20260203T230000Z Mon Feb 3 23:00 2026
tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345 Tue Feb 3 22:15 2026
What the output means: You have creation timestamps from ZFS itself, not inferred from the name. That matters when clocks drift or someone manually renames patterns.
Decision: If the ZFS creation time doesn’t align with the timestamp token in the name, you have a process problem (or a time sync problem). Fix NTP and your automation before trusting timestamps for incident work.
Task 2: Show space held by snapshots (why “cheap” becomes “expensive”)
cr0x@server:~$ zfs list -o name,used,refer,usedbysnapshots tank/app
NAME USED REFER USEDBYSNAPSHOTS
tank/app 850G 620G 210G
What the output means: USEDBYSNAPSHOTS is space retained because snapshots reference old blocks.
Decision: If USEDBYSNAPSHOTS is high and your pool is tight on free space, you should expire snapshots (preferably by policy and not by panic). If it’s high on a high-churn dataset, revisit cadence and retention immediately.
Task 3: Find the largest snapshots (useful for targeted cleanup)
cr0x@server:~$ zfs list -t snapshot -o name,used,refer -s used tank/app
NAME USED REFER
tank/app@snap.hourly.auto.base.20260203T230000Z 18G 620G
tank/app@snap.hourly.auto.base.20260203T220000Z 14G 615G
tank/app@snap.adhoc.manual.incident.20260203T210100Z.INC7781.hold 9G 610G
What the output means: Snapshot USED is how much unique space the snapshot keeps alive compared to the live dataset.
Decision: Large hourly snapshots imply heavy churn. Consider lowering frequency, splitting datasets to isolate churn, or ensuring the app writes to a dedicated dataset with a different retention policy.
Task 4: Verify a snapshot exists before you do something irreversible
cr0x@server:~$ zfs list -t snapshot tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
NAME
tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
What the output means: If it prints the name, it exists. If it errors, your restore plan is already wrong.
Decision: If you can’t positively identify the snapshot, stop. Don’t “approximate” with the nearest daily. That’s how you rollback to the wrong schema.
Task 5: Create an “event snapshot” with a disciplined name
cr0x@server:~$ ts=$(date -u +%Y%m%dT%H%M%SZ); zfs snapshot tank/app@snap.adhoc.manual.preupgrade.${ts}.CHG12345
What the output means: No output is success. ZFS is like that.
Decision: Immediately follow with a hold for high-risk changes (next task). If you don’t, a cleanup job can delete the one snapshot you actually care about.
Task 6: Protect a snapshot with a hold tag (make deletion fail on purpose)
cr0x@server:~$ zfs hold keep tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
cr0x@server:~$ zfs holds tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
NAME TAG TIMESTAMP
tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345 keep Tue Feb 3 22:16 2026
What the output means: The snapshot has a hold tag keep. It cannot be destroyed until holds are released.
Decision: For incident snapshots and pre-upgrade snapshots, holds are cheap insurance. Use them. Then create a process to release them when the change is proven safe.
Task 7: Attach structured metadata via user properties (queryable later)
cr0x@server:~$ zfs set com.acme:reason="preupgrade app 7.4" tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
cr0x@server:~$ zfs set com.acme:ticket="CHG12345" tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
cr0x@server:~$ zfs get -H -o property,value com.acme:reason,com.acme:ticket tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
com.acme:reason preupgrade app 7.4
com.acme:ticket CHG12345
What the output means: Snapshot now carries machine-readable metadata you can filter on without parsing the name.
Decision: If you run automation, pivot your scripts to properties for selection and keep names human-friendly. It reduces brittle parsing logic.
Task 8: Roll back a dataset (and understand what you’re overwriting)
cr0x@server:~$ zfs rollback -n tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
would rollback tank/app to tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
would destroy the following snapshots:
tank/app@snap.hourly.auto.base.20260203T230000Z
tank/app@snap.daily.auto.base.20260204T000000Z
What the output means: -n is a dry run. Rolling back will discard newer snapshots unless you use -r carefully or preserve them by cloning.
Decision: If rollback would destroy snapshots you need for replication or audit, stop and consider cloning instead (Task 9). Dry-run is non-negotiable in production.
Task 9: Clone a snapshot instead of rolling back (safer for forensics)
cr0x@server:~$ zfs clone tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345 tank/app-restore
cr0x@server:~$ zfs list -o name,used,refer,mountpoint tank/app-restore
NAME USED REFER MOUNTPOINT
tank/app-restore 1G 620G /tank/app-restore
What the output means: The clone is writable and initially small (USED) because it shares blocks with the snapshot.
Decision: For incident response, cloning is often the right first move. You can validate data, extract files, or run application checks without destroying the current state.
Task 10: Send/receive with clear ancestry (replication that doesn’t guess)
cr0x@server:~$ zfs send -nvP -i tank/app@snap.daily.auto.base.20260203T000000Z tank/app@snap.daily.auto.base.20260204T000000Z
send from @snap.daily.auto.base.20260203T000000Z to tank/app@snap.daily.auto.base.20260204T000000Z estimated size is 12.3G
total estimated size is 12.3G
TIME SENT SNAPSHOT
What the output means: Dry-run send (-n) with progress estimate (-P). It confirms the incremental base snapshot exists and the delta size.
Decision: If the estimated size is unexpectedly huge, it’s a churn signal (or compression/encryption changes). Recheck what changed between snapshots before you flood a WAN link.
Task 11: Use bookmarks to keep replication anchors lightweight
cr0x@server:~$ zfs bookmark tank/app@snap.daily.auto.base.20260204T000000Z tank/app#bmark.daily.20260204T000000Z
cr0x@server:~$ zfs list -t bookmark -o name,creation tank/app
NAME CREATION
tank/app#bmark.daily.20260204T000000Z Tue Feb 4 00:00 2026
What the output means: A bookmark references the point in the dataset history without keeping full snapshot retention semantics.
Decision: If retention pressure is high, bookmarks can help replication workflows while letting you delete older snapshots. But don’t treat bookmarks as restore points for humans; they’re not mounts.
Task 12: Verify holds before cleanup (why destroy sometimes “fails”)
cr0x@server:~$ zfs destroy tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
cannot destroy snapshot tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345: snapshot has holds
cr0x@server:~$ zfs holds tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
NAME TAG TIMESTAMP
tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345 keep Tue Feb 3 22:16 2026
What the output means: Holds are doing their job: preventing deletion.
Decision: If this snapshot is still required, leave it. If it’s safe to delete, release the hold explicitly with a change record (Task 13).
Task 13: Release a hold and then destroy (make it deliberate)
cr0x@server:~$ zfs release keep tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
cr0x@server:~$ zfs destroy tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
What the output means: No output: success.
Decision: Releasing holds should be rarer than setting them. If you release holds casually, you’ll eventually delete the snapshot you needed. Treat holds like break-glass controls.
Task 14: Find snapshots by intent using name patterns (human triage)
cr0x@server:~$ zfs list -t snapshot -o name -s creation tank/app | grep preupgrade | tail -5
tank/app@snap.adhoc.manual.preupgrade.20260110T210002Z.CHG12111
tank/app@snap.adhoc.manual.preupgrade.20260124T210011Z.CHG12202
tank/app@snap.adhoc.manual.preupgrade.20260203T221530Z.CHG12345
What the output means: Your naming system makes intent searchable under pressure.
Decision: If you need complicated regex to find “the last preupgrade snapshot,” your names are too clever. Simplify tokens.
Task 15: Confirm dataset mountpoints before restore actions
cr0x@server:~$ zfs get -o name,property,value mountpoint,canmount tank/app tank/app-restore
NAME PROPERTY VALUE
tank/app mountpoint /srv/app
tank/app canmount on
tank/app-restore mountpoint /tank/app-restore
tank/app-restore canmount on
What the output means: You see where things will mount and whether they will auto-mount.
Decision: Before a rollback, verify you won’t accidentally mount a clone on top of production paths. For test restores, set a safe mountpoint or canmount=noauto.
Task 16: Check pool health before blaming snapshots for slowness
cr0x@server:~$ zpool status -x
all pools are healthy
What the output means: ZFS doesn’t see errors or degraded vdevs. That’s the baseline.
Decision: If the pool is not healthy, snapshot operations may be slow because the system is already struggling. Fix hardware or resilver first; don’t tune your way out of a degraded pool.
Joke #2: The only thing scarier than “we have no snapshots” is “we have 40,000 snapshots and nobody knows which ones are real.”
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption (“@daily means safe”)
At a mid-sized SaaS company, the storage team had a simple policy: hourly snapshots for critical datasets, daily for everything else. Names were short: @hourly, @daily. The automation kept only a week of hourlies and a month of dailies. The team felt responsible. They were. Just not for the right thing.
A backend deploy went sideways and corrupted a subset of user metadata. The incident commander asked for a restore to “yesterday midnight.” The on-call found the @daily snapshot, rolled back the dataset, and watched the app restart. It came up. Then it started emitting a new kind of error: missing records referenced by newer files. The rollback had restored the dataset, but not the system.
The wrong assumption was that “daily” implied “known good.” In reality, the snapshot was crash-consistent and captured mid-ingestion. The application expected a coordinated checkpoint between two datasets: one holding metadata, one holding blobs. The blob dataset had its own @daily snapshot, taken a few minutes later. The rollback restored one side of the relationship, not both. Congratulations, you invented data inconsistency with excellent confidence.
The fix wasn’t a heroic new tool. It was a naming and policy change: snapshots for multi-dataset applications were taken in an orchestrated batch with a shared timestamp token, and the name included quiesced only when the app confirmed it. The runbook was updated to restore a consistent set by matching the timestamp across datasets. Suddenly “which snapshot?” stopped being a debate and became a filter.
Mini-story 2: The optimization that backfired (snapshot frequency as a latency amplifier)
A financial services team ran ZFS on a fleet serving VM disks. They had a compliance requirement to retain frequent restore points. Someone got ambitious: “If snapshots are cheap, we should take them every five minutes.” Automation was updated. It worked. For a while.
Two months later, performance complaints became routine. Not catastrophic, just constant: periodic latency spikes, especially during peak write hours. Everyone blamed “the hypervisor” or “the network.” Storage graphs showed something more boring: bursts of metadata work around snapshot times, increased write amplification from retained blocks, and pools running closer to full because retention had silently expanded.
The backfire was subtle: the team optimized for RPO while ignoring the cost curve of churn. High-frequency snapshots on high-write VM datasets don’t just create more snapshot entries; they retain more divergent block histories. Combined with a pool that hovered near a high utilization threshold, allocation got harder, fragmentation got worse, and the system had fewer good options. The five-minute policy didn’t just add work; it magnified the worst-case behavior of a stressed pool.
The recovery plan was pragmatic: reduce frequency for high-churn datasets, split “OS disks” and “data disks” into separate datasets with different retention, and enforce a free-space floor. Snapshot naming helped here too: the team could accurately identify and expire the five-minute cadence (snap.5min.auto.base...) without touching event snapshots or replication anchors.
They still met compliance. They just stopped pretending the storage subsystem was a wish-granting machine.
Mini-story 3: The boring-but-correct practice that saved the day (dry runs + holds)
A healthcare platform had a culture of “do the boring thing first.” Their ZFS runbooks mandated two steps before any rollback: zfs rollback -n dry run, and confirm holds on any snapshot associated with a change ticket. Engineers grumbled. It felt like paperwork with extra typing.
During a routine upgrade window, an engineer needed to restore a dataset after a migration tool wrote unexpected values. They picked what looked like the right snapshot: same day, close timestamp. Dry-run output warned that rollback would destroy several newer snapshots used as replication incrementals. That’s not just data loss; it’s a replication backlog turning into a weekend.
The engineer paused. They cloned instead, validated the cloned dataset’s application state, and then chose a different snapshot that preserved the replication chain. The boring step avoided a second incident: failed replication followed by a frantic re-seed.
Later, cleanup automation tried to prune “adhoc” snapshots older than a threshold. It couldn’t delete the pre-upgrade ones because holds were set with a standard tag. Nobody had to remember. The system did. The upgrade was rolled forward properly, holds were released after verification, and the incident ticket was closed with no drama.
Operational maturity often looks like this: fewer hero moments, more predictable outcomes.
Fast diagnosis playbook: find the bottleneck quickly
When snapshot operations feel slow—or replication is crawling—people tend to blame “ZFS snapshots” as if they’re a daemon with moods. Don’t do folklore. Do triage. Here’s a fast, high-signal sequence.
First: is the pool sick?
- Check:
zpool status -x - Why: degraded vdevs, resilvers, checksum errors, or offline devices will dominate everything else.
- Decision: If not healthy, fix that first. Tuning is a waste while your pool is limping.
Second: is the pool too full?
- Check:
zpool listand datasetusedbysnapshots - Why: high utilization reduces allocation flexibility; snapshots retain blocks; both can spike latency.
- Decision: If you’re approaching your operational free-space floor, expire snapshots (safely), add capacity, or reduce churn. Pick at least one today.
Third: is churn coming from one dataset?
- Check:
zfs list -o name,used,usedbysnapshotsacross datasets; find offenders. - Why: one log dataset can dominate snapshot deltas and replication volume.
- Decision: Split datasets by write profile. Stop snapshotting ephemeral logs at high frequency unless you enjoy paying for retained garbage.
Fourth: is replication slow because the deltas are huge?
- Check:
zfs send -nvP -ibetween snapshots. - Why: You can’t optimize your way out of a 10x delta unless you change what’s writing.
- Decision: If deltas are unexpectedly large, investigate app behavior, recordsize mismatches, compression changes, and snapshot frequency.
Fifth: are you suffering from “snapshot pile-up”?
- Check: snapshot count:
zfs list -t snapshot | wc -l(rough, but quick) - Why: very large snapshot counts increase management overhead and human error probability.
- Decision: If counts are runaway, fix retention automation and naming patterns so you can expire safely without regex roulette.
Common mistakes: symptoms → root cause → fix
Mistake 1: “We can’t tell which snapshot is safe to restore”
Symptoms: Long incident threads arguing about timestamps; multiple partial restores; fear of rollback.
Root cause: Names encode cadence only (@daily) but not intent, quiescence, or workflow grouping.
Fix: Adopt a tokenized naming scheme with UTC timestamps and intent markers (preupgrade, predeploy, incident, quiesced-db). For multi-dataset apps, take coordinated snapshots sharing the same timestamp token across all datasets.
Mistake 2: “Cleanup deleted the one snapshot we needed”
Symptoms: Event snapshots vanish; postmortem includes “we thought it would be there.”
Root cause: No holds; cleanup scripts match “adhoc” broadly; no separation between retention classes.
Fix: Apply holds to change/incident snapshots by default. Split retention: snap.hourly.auto.base expires aggressively; snap.adhoc.manual.preupgrade is held until explicit release.
Mistake 3: “Rollback would destroy newer snapshots and replication breaks”
Symptoms: Dry-run shows snapshot destruction; replication pipelines fail after rollback; need to re-seed.
Root cause: Rolling back instead of cloning; misunderstanding that rollback rewinds dataset history and discards newer snapshots.
Fix: Use zfs rollback -n always. Prefer cloning for recovery validation and forensics. If you must rollback, plan replication re-anchoring explicitly and preserve required snapshots.
Mistake 4: “Snapshot space usage keeps growing even when we delete files”
Symptoms: Dataset USED doesn’t drop; pool stays full; deleting data doesn’t buy space.
Root cause: Snapshots retain old blocks, so deletions in the live dataset don’t free space until snapshots are destroyed.
Fix: Identify large snapshots (zfs list -t snapshot -s used). Reduce retention on high-churn datasets. Consider dataset layout changes to isolate churn sources.
Mistake 5: “Names are inconsistent across tools; scripts miss snapshots”
Symptoms: Some snapshots use underscores, some use dashes; timestamps differ; retention scripts skip some; incident snapshots aren’t found by grep.
Root cause: Multiple snapshot creators (cron, CI, admins) with no enforced convention.
Fix: Define one canonical token order and separators; enforce via wrapper scripts or policy. Add properties for structured metadata so automation can select by property, not string parsing.
Mistake 6: “We snapshot too often and performance gets spiky”
Symptoms: Latency spikes at snapshot times; replication deltas are huge; pool utilization climbs.
Root cause: High-frequency snapshots on high-churn datasets; retention not tuned; pool too full.
Fix: Reduce frequency for high-churn datasets; split datasets by write profile; enforce a free-space floor; verify deltas with zfs send -nvP before pushing changes globally.
Checklists / step-by-step plan
Checklist A: Adopt a naming system without breaking everything
- Inventory snapshot creators: cron jobs, CI pipelines, vendor tooling, humans. If you don’t know who’s creating snapshots, you don’t control naming.
- Choose a canonical name pattern: e.g.,
snap.<cadence>.<creator>.<intent>.<ts>. - Standardize UTC timestamp formatting:
YYYYMMDDThhmmssZ. No exceptions. - Define retention classes:
hourly.auto.base(short)daily.auto.base(medium)adhoc.*.preupgradeandadhoc.*.incident(held)
- Decide on quiescence tokens: at minimum,
quiesced-dbvscrash. - Add properties for metadata: reason, ticket, expiry, owner.
- Update runbooks: restore by timestamp group for multi-dataset apps; clone-first policy for high-risk restores.
- Roll out gradually: start with new snapshots. Don’t rename old ones in place unless you’re sure tooling won’t break.
Checklist B: “I need to restore now” (operator steps)
- Confirm pool health:
zpool status -x. If unhealthy, expect delays and risk. - Identify the exact dataset(s) involved. Don’t restore a parent dataset if only a child is impacted.
- List snapshots sorted by creation and find the intent token (
predeploy/preupgrade/incident). - Verify quiescence for DB-like datasets. Choose
quiesced-dbwhen available. - Dry-run rollback:
zfs rollback -n. Read what it would destroy. - Prefer clone-first for validation. Mount it safely and verify application checks.
- Rollback only when you understand blast radius and you’ve communicated it (replication, newer snapshots, dependent services).
- After restore: take a new snapshot with
postrestoreintent and hold it briefly. It gives you a stable “known state” if follow-up changes go wrong.
Checklist C: Snapshot hygiene (weekly housekeeping)
- Review snapshot space usage: find top datasets by
usedbysnapshots. - Spot runaway cadence: too many snapshots per dataset usually means automation drift.
- Expire by class: delete old
hourly.auto.basefirst; do not touch held event snapshots. - Audit holds: if holds never get released, your “temporary” becomes permanent. Add an expiry property and a review loop.
- Test restores: not just “can we list snapshots?”—actually clone and validate data. Your naming system is only proven when it helps a restore.
FAQ
1) Should snapshot names include the dataset name?
No. The dataset is already part of the full snapshot name (tank/app@...). Duplicating it makes names longer and less scannable. Encode intent and time instead.
2) Can I rename a ZFS snapshot?
Yes, ZFS supports snapshot renaming in many implementations, but it can have operational consequences (replication expectations, scripts, monitoring). Treat renames as a migration task with testing, not as casual cleanup.
3) Why not just use @hourly and keep it simple?
Because “hourly” answers only one of your real questions. Under incident pressure you need “hourly of what intent, taken by what system, and is it quiesced?” Simplicity that removes meaning is not simplicity; it’s deferred pain.
4) How do I group snapshots across multiple datasets for a consistent restore?
Use a shared timestamp token across all datasets in the coordinated snapshot run. Then restore by matching that timestamp. Example: every dataset involved gets ...20260204T020000Z.... This is where UTC timestamp discipline pays rent.
5) Are snapshot names used by zfs send for correctness?
ZFS correctness depends on snapshot lineage, not the human-readable meaning of the name. But humans operate the system. Good naming prevents you from sending the wrong incrementals or selecting the wrong base under stress.
6) Should we put ticket numbers in snapshot names?
Usually yes, for change and incident snapshots. It’s lightweight traceability. Keep it short and non-sensitive. If ticket IDs are messy, store them in properties and keep names clean.
7) How many snapshots is “too many”?
There isn’t one magic number. The practical limit is where management overhead, retention space, and human selection error become problems. If listing snapshots is slow, cleanup is risky, or operators routinely pick the wrong one, you’re already over the line.
8) What’s the difference between a snapshot hold and just “don’t delete it”?
A hold makes deletion fail at the filesystem level, even if a cleanup script runs with good intentions. “Don’t delete it” is a hope. Holds are a control.
9) Should we keep “golden” snapshots forever?
Rarely. Long-lived snapshots can accumulate retained blocks and increase space pressure, especially on high-churn datasets. If you need long-term retention, replicate to a backup target and keep fewer local snapshots, or move the archive data to a dataset designed for it.
10) Do snapshot names need to be identical across environments (dev/stage/prod)?
The pattern should be identical. The dataset paths will differ. Consistent naming patterns help reuse runbooks and automation and reduce “works in staging” surprises.
Conclusion: next steps that actually reduce stress
If you want stress-free rollbacks, stop treating snapshots as a pile of points-in-time and start treating them as a system: naming, metadata, retention, and operator workflows. The best naming convention is the one your on-call can use half-awake without improvising.
Practical next steps:
- Pick the token order (
snap.cadence.creator.intent.ts) and enforce UTC timestamps. - Separate retention classes: base cadence vs event snapshots. Use holds for events.
- Add properties for reason/ticket/expiry so automation can be robust.
- Update the restore runbook to: list → dry-run → clone-first → validate → rollback only when understood.
- Run one restore drill this month using your new names. If the drill is confusing, the naming system is lying to you.
Snapshots are one of ZFS’s superpowers. Naming is how you aim it.