Thin provisioning is a budget meeting disguised as a storage feature. Everything looks great until reality files an expense report.
With ZFS sparse volumes (thin zvols), you can hand out more “disk” than you physically have—on purpose. That’s not inherently bad.
What’s bad is doing it casually, then acting surprised when a pool hits 100% and the blast radius looks like a datacenter-wide trust exercise.
This is a practical guide to the overcommit trap: what actually runs out first, what breaks, what the signals look like,
and exactly what to check when you smell smoke. You’ll get concrete commands, what the outputs mean, and the decisions you make from them.
What “sparse volumes” really mean in ZFS
In ZFS, a zvol is a block device backed by a dataset. You use it for VMs, iSCSI LUNs, databases that demand raw-ish block devices,
or any system that thinks in blocks, not files. A “sparse” zvol is thin-provisioned: its volsize (the size you present) can be larger than
the space currently allocated in the pool.
Here’s the key: sparse does not mean free. It means “allocated on write.” The pool is still the pool. If consumers write enough,
ZFS must allocate real blocks. When it can’t, it returns errors. Those errors propagate to your guest filesystem, your database, or your hypervisor,
usually in the least polite way possible.
Why do people like sparse zvols? Because the alternative is pessimism: pre-allocating full capacity for every VM or LUN means you buy disks early, then
spend years celebrating the empty space you paid for. Thin provisioning is pragmatic. The problem is that it’s also a promise you can break.
The two numbers you must separate in your head
- Logical size: what you told the client it can use (
volsize). - Physical allocation: what ZFS has actually allocated in the pool for that zvol (plus metadata, parity, copies, etc.).
Overcommit happens when the sum of logical sizes exceeds physical capacity. It’s not immediately wrong.
It becomes wrong when write patterns (or the calendar) make that logical promise come due.
The overcommit trap: why it fails so abruptly
Thin provisioning fails like this: everything seems fine, then suddenly it isn’t. That’s because storage consumption is lumpy.
Backups, weekly batch jobs, log bursts, VM snapshot chains, database compaction, reindexing, and “we migrated some stuff” all allocate real blocks fast.
Pools that sit at 70% for months can jump to 95% in a day, then to 100% in an hour.
ZFS is not a “soft” filesystem about space exhaustion. Near-full pools trigger fragmentation, allocator pain, and metadata pressure.
When the pool is truly out, you get ENOSPC and then cascading failures: VM disks go read-only, databases panic, or the hypervisor can’t write
logs needed to recover. Thin provisioning is the accelerant; the real fire is operational complacency.
There’s also a subtlety: even if the pool shows “some space free,” you can still fail allocations due to slop space,
fragmentation, recordsize/volblocksize effects, and metadata that must be written to stay consistent.
Free space is not one number. It’s “free space that is allocatable in the required shape and class.”
Joke #1: Thin provisioning is like ordering pants “one day I’ll fit into.” The day arrives at 2 a.m., and your belt is the pager.
What breaks first in real life
- Guest filesystems see I/O errors. Journaling helps until it doesn’t.
- Databases get partial writes or can’t extend files, then go into crash recovery at the worst time.
- Hypervisors may pause VMs, mark disks failed, or fail snapshots.
- Replication may stall because it can’t write receive buffers or new snapshots.
- ZFS itself struggles to free space if you need to delete snapshots but can’t write metadata updates reliably.
A pool going over ~80–85% isn’t a moral failure. It’s a signal. You can run higher if you know the workload and you’re disciplined.
But with thin provisioning, discipline means monitoring and headroom, not optimism and vibes.
Interesting facts and context (the stuff that explains the weirdness)
- ZFS introduced pooled storage as a first-class idea, so “filesystem full” became “pool full.” That shifts failure domains from a single mount to everything.
- Thin provisioning was popularized in enterprise SANs long before ZFS zvols were common; ZFS inherited both the benefits and the sharp edges.
- Space accounting in ZFS is intentionally conservative in several places (slop space, metadata reservations) to prevent catastrophic deadlocks.
- Copy-on-write means overwrites allocate new blocks first, then free old ones later. Near-full pools punish “in-place updates” workloads.
- Snapshots don’t “use space” when created, but they pin old blocks. Deletes and overwrites become new allocations instead of reuse.
- Compression changes the overcommit math because logicalused can be far larger than physical used—until the data becomes incompressible (hello, encrypted backups).
- RAIDZ parity overhead is workload-shaped: small random writes can amplify allocation due to padding and record alignment (varies by implementation and ashift).
- Special allocation classes (special vdevs) can move metadata/small blocks off spinning disks, but they create another capacity cliff if undersized.
- Discard/TRIM support took time to mature in the virtualization ecosystem; guests not issuing discard means ZFS can’t learn that blocks are free.
Mechanics that matter: volsize, referenced, logicalused, and the blocks you can’t avoid
Sparse zvols: the properties that decide your fate
ZFS gives you a handful of levers. Most people only pull one (create sparse) and ignore the rest. That’s how you end up with a “mysteriously” full pool.
volsize: the exposed size of the block device. This is the promise.refreservation: guaranteed space for the dataset (zvol). This is the “we will not overcommit this portion” lever.reservation: similar concept for filesystems; for zvols you typically userefreservation.volblocksize: allocation granularity for the zvol; impacts performance and space efficiency.compression: can extend headroom dramatically—until your data isn’t compressible.sync: how ZFS handles synchronous writes; changing it can change performance and failure behavior.
Space accounting terms you must be able to read quickly
For zvols, USED and friends in zfs list are easy to misunderstand. Add these columns and the world gets clearer:
used: actual space consumed in the pool (including metadata and copies depending on view).referenced: space accessible by this dataset at its head (not counting snapshots).logicalused: uncompressed logical bytes referenced (useful for compression ratio reality checks).logicalreferenced: logical referenced at dataset head.usedbysnapshots: snapshot-pinned space; the “why isn’t delete freeing space” culprit.
The part nobody budgets: metadata, fragmentation, and “space you can’t spend”
ZFS needs room to breathe. When you get close to full, the allocator has fewer choices, metaslabs get fragmented,
and writes get slower. That slowdown itself can trigger more writes: timeouts, retries, log growth, crash recovery files.
It’s a feedback loop with a sense of humor you won’t share.
One relevant reliability idea comes from engineering culture rather than ZFS specifics:
Hope is not a strategy.
(paraphrased idea often attributed to operations and reliability training)
If you run sparse zvols without reservations, you’re running a system where hope is literally the capacity model.
You can do better.
Monitoring signals that actually predict failure
Monitoring thin provisioning is not “alert when pool is 90%.” That’s one alert, and it arrives late.
You want a small set of signals that explain both capacity and rate of change, plus “things that make reclamation impossible.”
Signals to track (and why)
- Pool allocated % (
zpool list): the big obvious one. Track trend, not just current. - Pool free space by class (
zpool list -von some platforms; alsozpool status): special vdevs and log vdevs can have their own cliffs. - Snapshot space (
usedbysnapshots): a pool can be “full” because retention is quietly eating it. - Overcommit ratio: sum of zvol
volsizedivided by physical pool size (and/or by “safe usable”). - Dirty data / txg pressure (
zfs-statsor platform equivalents): spikes often correlate with big write bursts. - Write amplification indicators: high IOPS with modest throughput, rising latency, and allocator fragmentation (metaslab histograms).
- Compression ratio drift: when new workloads are incompressible, the pool consumption curve bends upward.
- Discard effectiveness: whether guests issue discard and whether it results in ZFS freeing space over time.
Alert thresholds that don’t lie (much)
Use staged alerts, not one big cliff:
- 75%: “heads up.” Verify growth trend and snapshot retention. Start planning.
- 85%: “action.” Freeze nonessential snapshots, validate replication room, ensure deletions actually reclaim.
- 90%: “change mode.” Stop creating new thin volumes. Move workloads. Add vdevs or reduce retention. Treat as incident-prevent.
- 95%: “incident.” You’re now paying fragmentation tax and allocation failure risk. Execute a runbook, not a debate.
These thresholds depend on workload and vdev layout. RAIDZ plus random writes wants more headroom than mirrors with mostly sequential writes.
But if you’re thin provisioning and you don’t know your workload shape, pick conservative thresholds and enjoy sleeping.
Practical tasks: commands, output meaning, decisions
This section is intentionally operational. The point is not to admire ZFS. The point is to know what to do at 3 a.m.
Each task includes a runnable command, what the output means, and what decision you make.
Task 1: Check pool capacity and health fast
cr0x@server:~$ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 21.8T 18.9T 2.90T - - 41% 86% 1.00x ONLINE -
What it means: CAP 86% is in the “action” zone. FRAG 41% suggests allocator pain is coming.
Decision: Stop nonessential writes (snapshot storms, migrations). Start snapshot/retention review and capacity plan now.
Task 2: Confirm no devices are failing while you chase “space” ghosts
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
scan: scrub repaired 0B in 09:12:44 with 0 errors on Sun Dec 22 09:12:55 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-WDC_WD140EDGZ-... ONLINE 0 0 1
ata-WDC_WD140EDGZ-... ONLINE 0 0 0
ata-WDC_WD140EDGZ-... ONLINE 0 0 0
ata-WDC_WD140EDGZ-... ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
tank/vmstore/zvol-104-disk-0
What it means: There’s a checksum error associated with a specific zvol. Not a capacity issue, but it can masquerade as one.
Decision: Triage data integrity separately: verify the guest disk, scrub results, consider replacing the disk, and restore affected VM data if needed.
Task 3: List zvols and spot the overcommit picture
cr0x@server:~$ zfs list -t volume -o name,volsize,used,referenced,logicalused,refreservation -r tank/vmstore
NAME VOLSIZE USED REFER LUSED REFRESERV
tank/vmstore/vm-101-disk-0 500G 412G 401G 603G none
tank/vmstore/vm-102-disk-0 2T 1.31T 1.28T 1.28T none
tank/vmstore/vm-103-disk-0 1T 74G 72G 92G none
tank/vmstore/vm-104-disk-0 4T 3.61T 3.55T 3.55T none
What it means: These are sparse (no refreservation). Logical size promises are big. Compression exists (vm-101 shows logical > physical).
Decision: If this is production, choose: set refreservation for critical zvols, or enforce strict pool headroom with alerts and capacity management.
Task 4: Calculate “promised capacity” (sum of volsize)
cr0x@server:~$ zfs list -H -p -t volume -o volsize -r tank/vmstore | awk '{s+=$1} END{printf("total_volsize_bytes=%d\n",s)}'
total_volsize_bytes=8246337208320
What it means: You’ve promised ~8.2 TB to guests in this subtree. That number alone is not evil; it’s context.
Decision: Compare to usable pool space (after parity, slop space policy, and growth). Define a maximum overcommit ratio and enforce it.
Task 5: See snapshot pressure for a dataset tree
cr0x@server:~$ zfs list -o name,used,usedbysnapshots,usedbydataset,usedbychildren -r tank/vmstore | head -n 8
NAME USED USEDSNAP USEDDS USEDCH
tank/vmstore 6.12T 2.04T 4.01T 77.6G
tank/vmstore/vm-101 622G 211G 401G 9.6G
tank/vmstore/vm-102 1.53T 243G 1.28T 12.4G
tank/vmstore/vm-103 119G 42.1G 72G 4.9G
tank/vmstore/vm-104 3.85T 1.52T 3.55T 18.6G
What it means: 2 TB in snapshots is not “free.” It’s allocated and pinned. Deleting in guests won’t reclaim that.
Decision: If capacity is tight, adjust retention. Delete snapshots strategically (oldest, largest deltas), and confirm replication dependencies first.
Task 6: Identify the top space consumers quickly
cr0x@server:~$ zfs list -o name,used,usedbysnapshots -s used -r tank/vmstore | tail -n 6
tank/vmstore/vm-103 119G 42.1G
tank/vmstore/vm-101 622G 211G
tank/vmstore/vm-102 1.53T 243G
tank/vmstore/vm-104 3.85T 1.52T
What it means: vm-104 is carrying huge snapshot ballast.
Decision: If you need a fast reclaim, snapshot-heavy workloads are the most “recoverable” space—after you validate business/backup requirements.
Task 7: Verify compression reality (and detect the “encrypted backup ruined it” moment)
cr0x@server:~$ zfs get -H -o name,property,value compression,compressratio -r tank/vmstore | head
tank/vmstore compression zstd
tank/vmstore compressratio 1.24x
tank/vmstore/vm-101 compression zstd
tank/vmstore/vm-101 compressratio 1.46x
tank/vmstore/vm-102 compression zstd
tank/vmstore/vm-102 compressratio 1.02x
What it means: vm-102 is basically incompressible. If it grows, it will eat raw space at almost 1:1.
Decision: Don’t count compression savings for workloads that encrypt at rest inside the guest or store pre-compressed blobs. Plan capacity like it’s 1.0x.
Task 8: Check reservations/refreservations (a.k.a. “who is allowed to sink the ship”)
cr0x@server:~$ zfs get -o name,property,value -r tank/vmstore | egrep 'refreservation|reservation' | head -n 12
tank/vmstore reservation none
tank/vmstore refreservation none
tank/vmstore/vm-101 reservation none
tank/vmstore/vm-101 refreservation none
tank/vmstore/vm-102 reservation none
tank/vmstore/vm-102 refreservation none
What it means: Nobody has guaranteed space. That means the last writer wins until the pool loses.
Decision: For “must not fail” volumes (databases, control plane, identity), set refreservation so they can keep writing even during pressure.
Task 9: Set a refreservation for a critical zvol (and understand the consequence)
cr0x@server:~$ sudo zfs set refreservation=600G tank/vmstore/vm-101-disk-0
cr0x@server:~$ zfs get -H -o name,property,value refreservation tank/vmstore/vm-101-disk-0
tank/vmstore/vm-101-disk-0 refreservation 644245094400
What it means: ZFS now reserves ~600G for this volume. Pool free space drops accordingly.
Decision: This is how you prevent critical workloads from being starved by someone else’s runaway job. Use it intentionally; it reduces flexibility.
Task 10: Check ashift and vdev layout (capacity math depends on it)
cr0x@server:~$ zdb -C tank | egrep 'ashift|raidz|mirror' | head -n 20
ashift: 12
type: 'raidz'
nparity: 2
What it means: ashift 12 implies 4K sectors. RAIDZ2 parity overhead exists. Small writes may amplify.
Decision: If you’re already tight on space, changing workload patterns (more random writes, smaller blocks) can accelerate consumption and fragmentation. Factor it into risk.
Task 11: Inspect free space fragmentation and metaslab histograms
cr0x@server:~$ sudo zdb -mm tank | head -n 40
Metaslab 0:
size 1090519040
alloc 1019215872
free 71303168
free segments: 142
histogram:
000: ------------------------------------------------------------
001: ******************* 19
002: ***************************** 29
003: ******************** 20
What it means: Many small free segments means “free space exists but it’s chopped up.” Allocation gets slower and can fail for larger contiguous needs.
Decision: Treat high fragmentation + high capacity as a risk multiplier. Create headroom by freeing space and avoiding churn (snapshots, rewrites).
Task 12: Confirm guest discard is enabled (thin provisioning without discard is a lie)
cr0x@server:~$ lsblk -D
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda 0 512B 2G 0
sdb 0 512B 2G 0
What it means: The block devices expose discard granularity. That’s necessary but not sufficient.
Decision: Ensure the hypervisor passes discard/TRIM and the guest filesystem issues it (mount options, fstrim timer). Otherwise deletes won’t reclaim.
Task 13: Observe whether freeing data actually reclaims on the host
cr0x@server:~$ zfs get -H -o name,property,value written tank/vmstore/vm-101-disk-0
tank/vmstore/vm-101-disk-0 written 103215702016
What it means: written can help estimate recent churn. If guests “deleted a lot” but used doesn’t drop, suspect snapshots or missing discard.
Decision: If snapshots are pinning blocks, discard won’t help. If snapshots are minimal, enable discard end-to-end and run trims.
Task 14: Find datasets with huge snapshot deltas (fast reclaim candidates)
cr0x@server:~$ zfs list -t snapshot -o name,used,referenced -s used -r tank/vmstore | tail -n 5
tank/vmstore/vm-104@auto-2025-12-20 188G 3.55T
tank/vmstore/vm-104@auto-2025-12-21 201G 3.55T
tank/vmstore/vm-104@auto-2025-12-22 214G 3.55T
tank/vmstore/vm-102@auto-2025-12-22 61G 1.28T
tank/vmstore/vm-101@auto-2025-12-22 49G 401G
What it means: Snapshots with large USED are consuming lots of unique blocks. Deleting one can reclaim meaningful space.
Decision: If replication permits, delete the largest/oldest snapshots first. If replication depends on them, you need a different plan (expand capacity or adjust replication).
Task 15: When near-full, check slop space impact
cr0x@server:~$ zfs get -H -o property,value slop_space tank
slop_space 12884901888
What it means: ZFS keeps some space effectively off-limits to reduce “pool is so full it can’t function” scenarios.
Decision: Don’t plan to use the last few GB/TB. Plan to never see 100% outside a lab. If you’re hitting slop space, you’re already late.
Task 16: Check if you can even delete (sometimes you can’t, and that’s the horror story)
cr0x@server:~$ sudo zfs destroy tank/vmstore/vm-104@auto-2025-12-20
cannot destroy 'tank/vmstore/vm-104@auto-2025-12-20': out of space
What it means: You’re in the “can’t free space because freeing space needs space” zone (metadata updates still require allocations).
Decision: Stop all writes, export noncritical datasets if possible, add space (temporarily attach a vdev) or free space from another pool via send/receive. This is an incident.
Joke #2: A 99% full pool is like a meeting that “will only take five minutes.” Everyone knows what happens next.
Fast diagnosis playbook
When a thin-provisioned environment gets weird, people argue about “storage” like it’s weather. Don’t.
Run a short sequence that tells you: are we out of space, out of performance, or out of truth?
First: confirm the failure mode (capacity vs integrity vs performance)
- Pool capacity and fragmentation:
zpool list. If CAP > 85% and FRAG rising, treat as capacity-pressure incident even if you’re not “full.” - Pool health:
zpool status -v. If errors exist, fix integrity before you chase “thin provisioning math.” - Snapshot pinning:
zfs list -o usedbysnapshotsacross the relevant tree. If snapshot usage is large, deletes in guests won’t help.
Second: locate the consumer and the growth rate
- Top datasets/zvols by USED:
zfs list -s used. - Top snapshots by USED:
zfs list -t snapshot -s used. - Churn indicators:
zfs get writtenon likely culprits; correlate with job schedules and VM activity.
Third: decide whether you can reclaim or must expand
- If reclaimable: delete snapshots (carefully), reduce retention, enable discard, compact inside guests where it actually results in freed blocks.
- If not reclaimable quickly: add capacity (new vdevs) or move workloads off-pool; do not attempt heroics on a 95–100% pool.
- Protect critical workloads: apply
refreservationfor essential zvols before you start deleting the wrong things.
Common mistakes (symptom → root cause → fix)
1) “Guests show free space, but writes fail with ENOSPC”
Symptom: VM filesystem shows space, applications error, hypervisor logs “No space left on device.”
Root cause: ZFS pool is full (or effectively full due to slop space / fragmentation). The guest can’t see host pool exhaustion until writes fail.
Fix: Free pool space immediately (delete snapshots, move datasets), add capacity, then enforce headroom alerts and optionally refreservation for critical disks.
2) “Deleting files in the guest doesn’t free space on the pool”
Symptom: Guest usage drops; ZFS USED doesn’t.
Root cause: Snapshots pin blocks, or discard isn’t working end-to-end, or the guest filesystem doesn’t issue trim.
Fix: Check usedbysnapshots. If high, delete/expire snapshots. If low, enable discard in hypervisor + guest and run periodic fstrim.
3) “Pool is 88% full and everything is slow”
Symptom: Latency spikes, VM pauses, sync writes crawl.
Root cause: Fragmentation and allocator contention at high fill levels; copy-on-write amplifies small overwrites; RAIDZ can magnify small random writes.
Fix: Create headroom (target <80–85%), reduce churn (snapshots, frequent rewrites), tune block sizes where appropriate, and consider layout changes for next build (mirrors for IOPS-heavy workloads).
4) “We set thin provisioning because compression would save us”
Symptom: The pool growth curve suddenly steepens.
Root cause: New data is incompressible (encrypted backups, media, already-compressed archives). Compression ratio assumptions became fantasy.
Fix: Track compressratio drift and plan with 1.0x for untrusted workloads. Move incompressible data to separate pools or tiers.
5) “Snapshot deletion takes forever, and space doesn’t come back”
Symptom: Destroying snapshots is slow; pool remains pressured.
Root cause: High fragmentation and massive snapshot chains mean frees are expensive; plus space returns gradually as metadata updates complete.
Fix: Delete snapshots in batches off-peak, avoid gigantic chains, keep headroom, and consider revising backup strategy (fewer long-lived snapshots, more incremental send retention elsewhere).
6) “We cannot delete snapshots because the pool is out of space”
Symptom: zfs destroy returns out-of-space.
Root cause: Pool is so full metadata operations can’t allocate; you’re beyond “cleanup” and into “stabilize.”
Fix: Stop writes, add temporary capacity (a new vdev is the clean approach), or evacuate datasets using send/receive if possible. Then set strict operational ceilings.
Three corporate mini-stories from the thin-provisioning trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company ran a tidy virtualization cluster with ZFS-backed zvols. The storage team was comfortable with thin provisioning because “VMs never fill their disks.”
That was true, until it wasn’t.
One quarter, finance wanted longer retention for audit logs. The application team complied by increasing log verbosity and keeping more history on the same VM disks.
Nobody told storage. Nobody thought they needed to.
Two weeks later, the pool hit the high-80s. Performance got mushy, but nothing screamed. Then a maintenance window kicked off: VM snapshots for backups, plus an application upgrade that rewrote large datasets.
Copy-on-write did its thing: overwrites required fresh allocations. The pool sprinted from “uncomfortable” to “full” faster than the monitoring could page.
The first visible symptom wasn’t “pool full.” It was database writes failing in a guest, followed by crash loops and “filesystem read-only” errors.
The incident channel filled with theories: SAN issues, kernel bug, “maybe the hypervisor is unstable.”
The fix was boring: delete old snapshots, reduce retention, and add a vdev. The lesson wasn’t “thin provisioning is evil.”
The lesson was that thin provisioning is a contract, and the contract was never written down.
Mini-story 2: The optimization that backfired
At a large enterprise, a platform team wanted better storage efficiency. They enabled aggressive snapshotting—hourly for a week, daily for a month—on VM zvol datasets.
The charts looked great: quick rollback, easy restores, fewer backup tickets.
Then they introduced a new CI workload with lots of short-lived VMs. Those VMs wrote, deleted, rewrote, and repeated. Inside the guests, files came and went constantly.
On ZFS, with snapshots, deletes don’t mean “free.” They mean “free later, if nothing references the old blocks.” Snapshots referenced everything.
Space usage grew steadily, but the team rationalized it: “Snapshots are worth it.” The pool stayed under 80% for a while.
The trap was that fragmentation was rising and the snapshot chain was turning every overwrite into net-new allocation.
When the pool crossed the high-80s, latency spiked. CI jobs slowed, which caused timeouts. Timeouts caused retries. Retries caused more writes.
It wasn’t just capacity pressure; it was a write-amplification spiral driven by an “optimization” nobody revisited.
They fixed it by separating workload classes: CI VMs moved to a pool with different snapshot policy (short retention), and the “golden rollback” snapshots were kept for long-lived stateful systems only.
Snapshotting wasn’t removed. It was made intentional.
Mini-story 3: The boring but correct practice that saved the day
A healthcare company ran ZFS zvols for critical systems. Their storage lead insisted on two rules:
keep pool utilization under a hard ceiling, and set refreservation on the handful of volumes that must never fail writes.
It wasn’t popular. Reservations make pools look “smaller,” and dashboards love large free numbers.
During a routine month-end, a reporting system went wild and started generating huge intermediate files on its VM disk.
It filled what it had been promised. On a thin-provisioned setup without guardrails, that would have been everyone’s problem.
Instead, the reporting VM hit I/O errors and stopped growing. That was still unpleasant, but it was contained.
The database volumes and the identity services kept their guaranteed space and stayed healthy.
The operations team had time to respond: they extended the reporting disk properly, rebalanced retention, and added capacity on schedule rather than in panic.
Nothing made headlines. That’s the point.
Checklists / step-by-step plan
Checklist A: Before you enable sparse zvols in production
- Define an overcommit policy: maximum ratio, which tenants can overcommit, and what happens when thresholds are crossed.
- Pick headroom targets: a hard ceiling for pool CAP (commonly 80–85% for mixed workloads; more conservative for RAIDZ + random writes).
- Decide which volumes are “must write”: set
refreservationfor those, and accept the visible loss of free space as the cost of reliability. - Snapshot retention by class: don’t apply the same retention to CI scratch disks and databases.
- Discard/TRIM plan: confirm hypervisor and guests support it; schedule
fstrimor equivalent. - Monitoring & alerting: pool CAP, FRAG, snapshot usage, compressratio drift, and rate-of-change alerts.
Checklist B: Weekly operational review (15 minutes, saves a weekend)
- Check
zpool listfor CAP and FRAG trends. - Check top snapshot usage with
zfs list -o usedbysnapshots. - Review top growth datasets/zvols:
zfs list -s used. - Validate scrubs are completing and clean:
zpool status. - Spot compressratio changes across major datasets.
- Confirm replication targets have headroom too (thin provisioning across both ends is how you get synchronized disasters).
Checklist C: When the pool hits 85% (treat as a controlled incident)
- Freeze churn: pause bulk migrations, suspend nonessential snapshot jobs, stop test pipelines that thrash disk.
- Identify reclaimable space: largest snapshots, largest snapshot-heavy datasets.
- Confirm discard path: if relying on reclaim-from-delete, verify it’s even possible.
- Protect critical writers: apply
refreservationif missing, before the pool gets worse. - Decide expansion vs reclamation: if you can’t reclaim enough safely, schedule capacity add immediately.
Checklist D: When the pool hits 95% (stop improvising)
- Stop writes wherever possible (application throttles, pause VM creation, halt backup snapshot bursts).
- Do not start “cleanup scripts” that delete random things; you’ll delete what’s easy, not what’s effective.
- Prioritize deleting large snapshots that are safe to remove (validate replication and restore requirements).
- Prepare to add a vdev. This is often the fastest safe exit.
- After stabilizing: postmortem the overcommit policy and implement rate-of-change alerts.
FAQ
1) Are sparse zvols the same as “thin provisioning” on a SAN?
Conceptually, yes: you present more logical capacity than you have physically allocated. Operationally, ZFS adds copy-on-write, snapshots, and pooled allocation,
which can make failure modes sharper if you run hot on space.
2) Is overcommit always bad?
No. Overcommit is a tool. It becomes bad when you don’t measure it, don’t cap it, and don’t keep headroom for bursts and copy-on-write behavior.
If you can’t explain your overcommit ratio and your reclaim plan, you’re not overcommitting—you’re gambling.
3) Should I use refreservation on every zvol?
Usually no. Reserving everything defeats thin provisioning. Use it on critical volumes where write failure is unacceptable (databases, control plane, shared services),
and rely on headroom + alerts for the rest.
4) Why does ZFS get slow when the pool is nearly full?
Because allocators like choice. Near-full pools have fewer free segments, higher fragmentation, and copy-on-write overwrites that require new allocations.
On RAIDZ, small random writes can add extra overhead. The result is latency inflation and throughput collapse.
5) Can I rely on deleting data in guests to reclaim pool space?
Only if two things are true: snapshots aren’t pinning those blocks, and discard/TRIM is working end-to-end.
Without discard, the host often can’t know blocks are free. With snapshots, they aren’t free even if the guest deletes them.
6) What’s the safest way to reclaim space quickly?
Deleting large, old snapshots that have significant USED is often the most predictable reclaim—assuming you’ve validated replication and restore requirements.
Randomly deleting files inside guests is less predictable unless you know discard is effective and snapshots aren’t holding you hostage.
7) Do compression and dedup change the overcommit risk?
Compression can help a lot, but it’s workload-dependent and can change over time. Dedup can be dangerous operationally due to memory pressure and complexity;
it also doesn’t fix the fundamental “writes require physical blocks” truth. Model risk as if compression might drop to 1.0x.
8) How do I set alerting for thin provisioning specifically?
Alert on pool CAP with multiple thresholds, track snapshot usage growth, and track overcommit ratio (sum of volsize vs usable pool).
Add rate-of-change alerts: “pool used +5% in 1 hour” catches bursts that static thresholds miss.
9) What’s a reasonable overcommit ratio?
There isn’t one universal number. For predictable enterprise VM fleets with strong hygiene, 1.5–2.0x might be tolerable.
For mixed workloads with heavy snapshots and unknown teams, keep it closer to 1.0–1.2x or enforce reservations on critical subsets.
10) Is running at 90% ever acceptable?
Sometimes, for a short period, with known workloads, mirrors, low churn, and a rehearsed reclaim plan. In general, 90% is where ZFS starts charging interest.
If you’re thin provisioning and living at 90%, you’re choosing drama.
Conclusion: next steps that prevent the pager
Sparse zvols are not the villain. The villain is unmanaged promises. If you want thin provisioning without thin excuses,
treat capacity as an SLO with budgets, alerts, and guardrails.
Do these next:
- Instrument the basics: pool CAP, FRAG, snapshot usage, compressratio drift, and growth-rate alerts.
- Write an overcommit policy: define maximum overcommit and what triggers “stop creating new thin volumes.”
- Protect critical zvols: add
refreservationwhere write failure is unacceptable. - Fix reclaim mechanics: verify snapshots and discard behavior; don’t assume deletes reclaim.
- Rehearse the 95% runbook: know exactly how you’ll reclaim or expand before you need to.
Thin provisioning can be a competitive advantage: better utilization, faster provisioning, fewer emergency disk purchases.
But only if you’re honest about the bill that comes due. ZFS will collect, on time, every time.