ZFS gives you two knobs for controlling space that look similar, sound similar, and behave just differently enough to cause production incidents: quota and reservation. One is a ceiling. The other is a floor. And both are enforced by ZFS’s own accounting—not what du thinks, not what your application “feels,” and definitely not what your storage budget presentation promised.
If you run multi-tenant systems, CI builders, VM farms, database fleets, or anything where “someone will eventually fill the pool,” quota and reservation are the difference between a quiet on-call and an 02:00 war room. This piece is written the way you actually operate ZFS: with commands, symptoms, and the messy reality of snapshots, refquota, and “why is there no space when df says there is?”
1. Quota vs reservation: the real definitions
Quota: “You may not grow beyond this.”
A quota on a dataset is a hard cap on how much space that dataset (and its descendants) are allowed to consume. Once the dataset’s usage hits the quota, writes that require more space fail with ENOSPC (or application-specific variants of “no space left”).
Key behavior: ZFS enforces quota based on its internal accounting of logical referenced space (with nuances you must respect, especially once snapshots exist). A quota is not “reserved” space. It’s permission to consume up to a limit.
Reservation: “This much space is mine even if everyone else is hungry.”
A reservation is a guarantee: ZFS will set aside pool space so that a dataset can continue writing up to that reserved amount, even if other datasets are competing for free space.
Key behavior: reservations reduce the pool’s available space for everyone else, immediately, even if the dataset isn’t currently using that space. This is the “floor” guarantee.
Two one-liners that keep you sane
- Quota is a speed limit. You can drive up to it, but it doesn’t buy you fuel.
- Reservation is a fuel voucher. It doesn’t tell you where to go, but it guarantees you can get somewhere.
Joke #1 (short and relevant): A quota is the CFO saying “don’t spend more than this.” A reservation is the CFO actually putting the money in your cost center—rare, beautiful, and still somehow confusing.
2. The mental model: ceiling, floor, and who pays
Most teams get quota right on day one: “Each tenant gets 500G.” Then the first incident happens because quota doesn’t protect the pool. It only protects other datasets from that dataset growing too much. If you hand out quotas that sum to 200% of pool capacity (overcommit), you’re betting on behavior. Sometimes that’s fine. Sometimes it’s how you learn the meaning of “write amplification during compaction.”
Reservations are the opposite failure mode: they do protect the dataset, but they can quietly starve everything else. A reservation is you walking into a shared fridge and putting your name on half the shelves “just in case.”
Think in three numbers, not one
When you’re debugging space, you need three distinct concepts:
- Used: what ZFS accounts as in use by datasets and snapshots.
- Available: what ZFS says can be allocated (after slop space, reservations, etc.).
- Referenced vs logical vs physical: how much data is “yours,” how much is shared by snapshots, and how much is actually on disk after compression.
Quotas and reservations apply at the dataset boundary
ZFS datasets are the unit of enforcement. Quotas/reservations don’t apply to “a directory” unless you use project quotas (more on those later). For VM zvols, different knobs exist (volsize, refreservation) and the confusion multiplies.
What gets blocked when you hit quota?
Writes that need new blocks. Overwrites can also need new blocks because ZFS is copy-on-write. That means “I’m editing a file in place” can still allocate, and quotas can still bite.
What happens when the pool is full but you have a reservation?
If a dataset has a reservation and the pool becomes tight, ZFS tries to preserve that reserved amount for the dataset. That can mean other datasets see ENOSPC earlier than expected, because from their perspective, the reserved space was never truly “available.”
3. Interesting facts and historical context
Storage engineers love “simple.” ZFS loves “correct.” The gap between those two is where quotas and reservations live. Here are a few context points that help explain why the system behaves the way it does:
- ZFS was built around copy-on-write, which means overwrites allocate new blocks. Space accounting must consider “old” blocks held by snapshots, not just the live filesystem.
- Early ZFS emphasized end-to-end integrity (checksums, self-healing) long before it was fashionable; quota enforcement had to work with transactional semantics, not “best effort.”
- The “refquota” and “refreservation” concepts exist because snapshots complicated the naive idea of “a dataset uses X.” Referenced space and total space are different bills.
- ZFS “slop space” exists (a small unallocatable reserve at the pool level) to keep the system functioning when nearly full. This makes “why is there 5G missing?” a recurring mystery for newcomers.
- Compression changes the human perception of usage: a quota is enforced on logical space accounting, while physical space consumption may be lower. Users don’t like being told “you’re out of space” when the disks aren’t full.
- Thin provisioning became mainstream, and ZFS embraced it with datasets and quotas—but reservations are the counterweight when you need guaranteed headroom.
- VM storage popularized zvols, and many operational mistakes come from treating zvols like filesystems.
volsizeis not a quota; it’s the device size. - Containerization made multi-tenant FS layout normal. ZFS datasets became a clean boundary for quotas and delegation, but only if you understand descendants and snapshot behavior.
- Project quotas arrived to answer “I need per-directory limits” without dataset sprawl. They’re powerful, but they add another accounting layer that must be monitored.
4. The properties you will actually use (and their traps)
ZFS exposes a small constellation of space-control properties. Learn them as pairs, because that’s how they behave in practice.
quota and reservation: apply to dataset + descendants
quota limits the total space consumed by a dataset and all its children. Same story for reservation: it reserves for the dataset and its descendants.
This is great when you allocate “a tenant” as a top-level dataset and put everything under it. It’s not great when you intend to cap only the parent dataset but later someone creates children and wonders why the parent quota “does nothing.” It’s doing exactly what it promised: governing the whole subtree.
refquota and refreservation: apply to the dataset only (not descendants, not snapshots)
refquota is a quota on the dataset’s referenced space—typically meaning “live data,” not including snapshots and not including descendants. It is the “I want to limit what this dataset itself can reference” knob.
refreservation is the matching “floor” for referenced space.
Operationally, refquota is how you stop snapshot-heavy workflows from punishing tenants for history they didn’t ask for (or from punishing you because you keep seven days of history).
Why “including snapshots” and “excluding snapshots” matters
Snapshot space is “real” space in the pool, but it is not necessarily “owned” the way your tenants think. A dataset can be at its quota while still needing to allocate space due to copy-on-write churn, especially when snapshots pin old blocks. That’s how you get the classic: “I deleted files but usage didn’t drop.” You didn’t delete the blocks; you deleted references. Snapshots kept the old references alive.
Reservations can be bigger than used (and that’s the point)
When you set a reservation, you are pre-allocating pool availability, not writing zeros. If you reserve 200G and only use 20G, the pool will still act like that 200G is not available to other datasets. This is intentional. It is also a common cause of “mysterious” low pool free space.
Delegation and multi-tenant reality
In corporate environments, it’s common to delegate dataset management to platform teams or even to tenants (CI infrastructure teams, game server ops, etc.). If you let tenants create snapshots, they can make quota enforcement feel unfair unless you choose refquota correctly. And if you let tenants set reservations, you’re effectively letting them pre-empt pool capacity. That’s not a technical problem. That’s an org chart problem.
5. Snapshots: the third party in every argument
Snapshots are the reason ZFS is a joy—and the reason space conversations get weird.
Snapshots don’t “take space” when created, but they can keep space forever
A snapshot is initially metadata. The space cost comes later, when the live dataset changes and the snapshot keeps references to old blocks. Deleting 100G of files from the live dataset doesn’t free those blocks if snapshots still reference them.
How snapshots interact with quotas
Here’s the nuance that trips people: depending on which quota type you use, snapshots may or may not count against the quota.
quotacounts the dataset and its descendants; snapshot-held space can still cause allocations to fail because the dataset cannot allocate new blocks without exceeding quota.refquotafocuses on referenced space (live data). Snapshot space is not “referenced” by the dataset in the same way, so it’s a better fit when you run snapshot policies centrally.
Snapshot churn makes quotas feel “sticky”
In practice, the sticky feeling is copy-on-write plus retention. Databases that rewrite large files, compaction jobs, VM images, build caches—these are all “space churn” workloads. They can require temporary double-space during a rewrite. If you set quotas too close to steady-state usage, you create a system that works… until it needs to do maintenance.
Joke #2 (short and relevant): Snapshots are like taking photos of your closet. Deleting the socks later doesn’t make the photos any smaller, and ZFS is not impressed by your newfound minimalism.
6. Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption (quota as a pool safety net)
The platform team ran a shared ZFS pool for CI runners. Each project got its own dataset under tank/ci, and each dataset had a quota. Everyone felt responsible. Everyone felt safe. The pool was sized for typical load and “bursty” builds, and quotas were meant to prevent any one team from going feral.
Then a big refactor landed in a monorepo, and a new build step started producing artifacts twice: once uncompressed, once compressed, then uploading both. The dataset quota prevented infinite growth, sure—but the build still had enough headroom to expand quickly across many projects at once. Quotas didn’t prevent the pool from filling because the pool was shared and the sum of “allowed growth” was far above actual capacity.
At about the same time, snapshots were being taken every 15 minutes for “fast rollback” of runner images. Nobody had connected that policy to CI artifacts. Write churn plus frequent snapshots created a lot of pinned blocks. Build directories were deleted after completion, but snapshots kept the churn alive until retention expired.
The incident wasn’t dramatic at first. It started as flakey builds. Then it became package extraction failures. Then a few runner hosts went read-only in strange ways because applications behaved badly under ENOSPC. The team’s first reaction was to raise quotas, because “projects are hitting quota.” That made the pool fill faster.
The eventual fix was boring and correct: set a pool-level policy (monitor pool free space), move CI artifacts to a dataset with short snapshot retention (or no snapshots), and keep quotas as tenant fairness—not pool protection. They also introduced a small reservation for system datasets (logs, package caches) so the hosts could still function during emergencies.
Mini-story 2: The optimization that backfired (reservations everywhere “for reliability”)
A storage-minded engineer joined a team that had been burned by “pool full” outages. Their instinct was reasonable: guarantee capacity for critical datasets. They created reservations for databases, logs, and a handful of services that “must never fail.” The idea was to avoid noisy neighbors and avoid the dreaded near-full pool behavior.
It worked for a while. Then growth happened. New services arrived. Each asked for “just a small reservation” because it sounded responsible. No one wanted to be the service that didn’t reserve space and later caused an outage. This is how good intentions scale into bad math.
After a quarter, the pool looked half-empty in terms of raw usage, yet “available” was low. Teams kept filing tickets: “df shows free space but writes fail.” They weren’t wrong; they were just looking at the wrong accounting layer. Reservations had quietly pre-committed most of the pool.
The backfire was operational: during a traffic spike, the logging pipeline needed temporary growth and couldn’t get it. The data wasn’t large long-term, but the surge required burst capacity. The pool had physical space; it didn’t have allocatable space because it was reserved away. The result was dropped logs right when debugging mattered most.
The postmortem ended with a rule: reservations are for infrastructure survivability (OS datasets, critical DB WAL/headroom) and for strict SLA workloads. Everything else gets quotas and monitoring. The team also learned to document reservations as “capacity debt” that must be paid for with real disks.
Mini-story 3: The boring but correct practice that saved the day (refquota + snapshot boundaries)
A company ran a multi-tenant analytics platform. Tenants uploaded data, jobs processed it, and results were stored per-tenant. The platform team wanted two things: predictable tenant limits and strong rollback for operational safety. They also knew snapshots would be non-negotiable because “oops” moments happen weekly in data systems.
Instead of using quota everywhere, they used a pattern: tenant datasets got refquota set to enforce “live data” limits. Snapshots were managed centrally by the platform team, with retention policies tailored to workload (short for scratch, longer for curated results).
They created a separate dataset subtree for scratch/intermediate data, with aggressive snapshot pruning and lower recordsize tuning. Most importantly, they made snapshot boundaries align with ownership: tenants were not charged (via quota) for platform-level safety history, and scratch datasets were not permitted to retain snapshots beyond a short window.
Months later, a bad deployment triggered a wave of job retries that rewrote intermediate files aggressively. The churn was real. But the system stayed up because: (1) scratch datasets had snapshot retention that didn’t pin churn for long, and (2) a small reservation existed for system-critical datasets so logs and core services could still write while the team stabilized jobs.
No heroics. No “storage wizardry.” Just clean boundaries, conservative enforcement, and the humility to assume someone will do something stupid with disk.
7. Practical tasks: commands and interpretation (12+)
These are the commands I actually reach for when a pool is tight, a dataset hits limits, or someone claims “ZFS is lying.” Commands are shown with typical output. Adjust pool/dataset names for your environment.
Task 1: List pool capacity and health
cr0x@server:~$ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 7.25T 5.61T 1.64T - - 22% 77% 1.00x ONLINE -
Interpretation: Pool is 77% full. That’s not an emergency by itself, but if this climbs into the high 80s/90s, allocations become fragile (and performance can wobble). This output does not show reservations directly.
Task 2: Check pool “available” space with slop considered
cr0x@server:~$ zfs get -H -o name,property,value available tank
tank available 1.52T
Interpretation: ZFS may report less than FREE due to slop space and other accounting. If zpool list shows free but zfs get available is low, you’re heading into the zone where ENOSPC appears “early.”
Task 3: Show quotas and reservations on a dataset tree
cr0x@server:~$ zfs get -r -o name,property,value -s local quota,reservation,refquota,refreservation tank/tenants
NAME PROPERTY VALUE
tank/tenants quota -
tank/tenants reservation -
tank/tenants refquota -
tank/tenants refreservation -
tank/tenants/acme quota 2T
tank/tenants/acme reservation -
tank/tenants/acme refquota 1.5T
tank/tenants/acme refreservation 200G
tank/tenants/zephyr quota 1T
tank/tenants/zephyr reservation -
tank/tenants/zephyr refquota -
tank/tenants/zephyr refreservation -
Interpretation: This tells you who has limits and guarantees. Note the mix: quota caps the whole subtree; refquota caps live data for just that dataset; refreservation guarantees live headroom.
Task 4: Find the biggest datasets fast
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -S used | head -n 12
NAME USED AVAIL REFER MOUNTPOINT
tank 5.61T 1.52T 128K /tank
tank/tenants 4.90T 1.52T 96K /tank/tenants
tank/tenants/acme 1.92T 800G 1.44T /tank/tenants/acme
tank/tenants/zephyr 1.31T 1.52T 1.05T /tank/tenants/zephyr
tank/vm 420G 1.52T 96K /tank/vm
tank/logs 180G 1.52T 160G /tank/logs
Interpretation: USED includes snapshots and descendants. REFER is “live data” referenced by that dataset. When USED is much larger than REFER, snapshots/children are usually the reason.
Task 5: Identify snapshot-heavy datasets (USED vs REFER gap)
cr0x@server:~$ zfs list -t filesystem -o name,used,refer -S used | head -n 10
NAME USED REFER
tank/tenants/acme 1.92T 1.44T
tank/tenants/zephyr 1.31T 1.05T
tank/logs 180G 160G
tank/ci 140G 18G
tank/home 110G 45G
Interpretation: tank/ci is a red flag: 140G used, only 18G referenced. That’s usually snapshots holding churn, or lots of child datasets.
Task 6: Inspect snapshots and their space
cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation -S used | head -n 8
NAME USED REFER CREATION
tank/ci@autosnap_2025-12-24_0100 22.4G 0B Wed Dec 24 01:00 2025
tank/ci@autosnap_2025-12-24_0045 18.1G 0B Wed Dec 24 00:45 2025
tank/tenants/acme@daily_2025-12-23 12.7G 0B Tue Dec 23 02:00 2025
tank/home@hourly_2025-12-24_0100 6.2G 0B Wed Dec 24 01:00 2025
Interpretation: Snapshot USED is the unique space held by that snapshot (space that would be freed if it were destroyed, assuming no other snapshot references those blocks). A run of large USED snapshots indicates churn.
Task 7: Show space breakdown for a dataset
cr0x@server:~$ zfs get -o property,value -p used,usedbysnapshots,usedbydataset,usedbychildren,usedbyrefreservation tank/ci
PROPERTY VALUE
used 150323855360
usedbysnapshots 126406688768
usedbydataset 19327352832
usedbychildren 0
usedbyrefreservation 0
Interpretation: Snapshots are consuming ~126G unique space. This is where your pool went.
Task 8: Set a quota (cap) on a dataset
cr0x@server:~$ sudo zfs set quota=500G tank/tenants/zephyr
cr0x@server:~$ zfs get -H -o name,property,value quota tank/tenants/zephyr
tank/tenants/zephyr quota 500G
Interpretation: zephyr (and any children under it) cannot consume more than 500G total. If it’s already above 500G, writes will fail until usage drops.
Task 9: Set a reservation (guarantee) for critical headroom
cr0x@server:~$ sudo zfs set reservation=50G tank/logs
cr0x@server:~$ zfs get -H -o name,property,value reservation tank/logs
tank/logs reservation 50G
Interpretation: 50G is removed from “available” to others and reserved for tank/logs (and its descendants). This helps logs keep writing when the pool is under pressure.
Task 10: Use refquota to limit only live data (snapshot-safe tenant limits)
cr0x@server:~$ sudo zfs set refquota=300G tank/tenants/acme
cr0x@server:~$ zfs get -H -o name,property,value refquota tank/tenants/acme
tank/tenants/acme refquota 300G
Interpretation: This caps acme’s live referenced data. If platform snapshots inflate USED, refquota is less likely to punish the tenant for retention policy.
Task 11: Use refreservation to guarantee live write headroom
cr0x@server:~$ sudo zfs set refreservation=20G tank/tenants/acme
cr0x@server:~$ zfs get -H -o name,property,value refreservation tank/tenants/acme
tank/tenants/acme refreservation 20G
Interpretation: Guarantees 20G of referenced space for that dataset’s own writes. Useful for things like database WAL or scratch that must not stall during pool pressure.
Task 12: Confirm why a dataset shows low “avail” (quota/reservation pressure)
cr0x@server:~$ zfs get -o name,quota,refquota,reservation,refreservation,used,avail,refer tank/tenants/acme
NAME QUOTA REFQUOTA RESERV REFRESERV USED AVAIL REFER
tank/tenants/acme 2T 300G - 20G 1.92T 0B 1.44T
Interpretation: AVAIL 0B is the key. It means from ZFS’s perspective this dataset cannot allocate more space, likely due to the enforced refquota being lower than current REFER. In other words: the limits are inconsistent with reality; you just turned on the seatbelt after the crash.
Task 13: Reduce snapshot pressure by deleting old snapshots (carefully)
cr0x@server:~$ zfs list -t snapshot -o name,used -S used | grep '^tank/ci@' | head
tank/ci@autosnap_2025-12-24_0100 22.4G
tank/ci@autosnap_2025-12-24_0045 18.1G
tank/ci@autosnap_2025-12-24_0030 15.9G
cr0x@server:~$ sudo zfs destroy tank/ci@autosnap_2025-12-24_0030
Interpretation: Destroying a snapshot frees space only if blocks are unique to it. Always confirm policies and replication dependencies before destroying snapshots in production.
Task 14: Reveal hidden space hogs with “written” (recent churn)
cr0x@server:~$ zfs get -H -o name,property,value written@autosnap_2025-12-24_0045 tank/ci
tank/ci written@autosnap_2025-12-24_0045 41234597888
Interpretation: This shows bytes written since that snapshot. High numbers + frequent snapshots is a recipe for snapshot-held growth.
Task 15: Compare ZFS accounting to filesystem view
cr0x@server:~$ df -h /tank/ci
Filesystem Size Used Avail Use% Mounted on
tank/ci 1.6T 19G 0B 100% /tank/ci
Interpretation: The mount shows 0B available because ZFS enforces the dataset’s availability after quotas/reservations. If users say “df says full,” believe them, then go look at ZFS properties to understand why.
Task 16: Find reservations that are “stealing” pool availability
cr0x@server:~$ zfs get -r -H -o name,property,value reservation,refreservation tank | egrep -v '\t-\s*$' | head -n 20
tank/logs reservation 50G
tank/system reservation 30G
tank/tenants/acme refreservation 20G
Interpretation: Any non-- value is reducing allocatable pool space. If “available” feels low, this list is often why.
8. Fast diagnosis playbook
This is the triage sequence that gets you to the root cause quickly when someone reports “dataset full,” “pool full,” or “writes failing.” The goal is to avoid the classic loop of raising quotas blindly or deleting random data under pressure.
Step 1: Is it the pool, the dataset, or a limit?
cr0x@server:~$ zpool list tank
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 7.25T 6.98T 270G - - 29% 96% 1.00x ONLINE -
cr0x@server:~$ zfs get -H -o name,property,value available tank
tank available 120G
Interpretation: If pool CAP is >90% and tank available is low, you have a pool capacity event. If pool looks fine but a dataset is out, it’s usually quotas, reservations, or snapshot pinning.
Step 2: Identify which dataset is constrained
cr0x@server:~$ zfs list -o name,used,avail,refer -S used | head -n 15
NAME USED AVAIL REFER
tank 6.98T 120G 128K
tank/tenants 5.80T 120G 96K
tank/tenants/acme 2.40T 0B 1.90T
tank/tenants/zephyr 1.70T 120G 1.65T
tank/ci 650G 0B 40G
Interpretation: Datasets with AVAIL 0B are where applications will fail first.
Step 3: Check whether it’s quotas/reservations or snapshots
cr0x@server:~$ zfs get -o name,quota,refquota,reservation,refreservation,used,usedbysnapshots,refer,avail tank/ci
NAME QUOTA REFQUOTA RESERV REFRESERV USED USEDBYSNAPSHOTS REFER AVAIL
tank/ci 200G - - - 650G 590G 40G 0B
Interpretation: Here, the quota is 200G but the dataset is using 650G (likely due to descendants or snapshots; also possible the quota was applied later or to the wrong level). USEDBYSNAPSHOTS is enormous, so snapshots are the immediate lever.
Step 4: Decide the right remedy for the moment
- Emergency pool full: delete the safest-to-delete snapshots first (largest
USED), or prune retention. Avoid “rm -rf” unless you understand snapshots are holding the space anyway. - Tenant hitting quota: decide whether to raise quota or reduce usage; check if snapshot retention is causing “invisible” space growth.
- Everything starved but pool not full: audit reservations/refreservations and undo the overcommit of guarantees.
9. Common mistakes, symptoms, and fixes
Mistake 1: Treating quota as “pool protection”
Symptom: Pool fills even though every dataset has a quota. Multiple tenants hit quota simultaneously during a churn event.
Why it happens: Quotas cap individuals, not the sum. If you overcommit quotas, the pool can still fill when everyone grows at once.
Fix: Monitor pool free space and set operational guardrails (alerts at 80/85/90%). Use reservations only for critical datasets. Keep quotas for fairness, not safety.
Mistake 2: Setting reservations “just in case” across the board
Symptom: Pool appears to have free space, but many datasets show low AVAIL or writes fail unpredictably. Teams see conflicting numbers between tools.
Why it happens: Reservations pre-consume allocatable pool space. Too many reservations make the pool functionally full even when it isn’t physically full.
Fix: List reservations recursively, justify each one, and remove/resize them. Prefer refreservation for targeted headroom rather than broad subtree reservations.
Mistake 3: Using quota when you meant refquota (snapshots make it hurt)
Symptom: Tenants complain that space usage doesn’t drop after deletions; they hit quota despite “cleaning up.”
Why it happens: Snapshots retain old blocks. quota doesn’t separate live data from snapshot history the way tenants expect.
Fix: Use refquota for tenant limits if snapshots are managed centrally. Or move snapshotting to a parent dataset and keep tenant datasets snapshot-free, depending on your governance model.
Mistake 4: Applying quota at the wrong level of the dataset tree
Symptom: A quota “does nothing” or has surprising reach. A child dataset fills the parent’s quota unexpectedly.
Why it happens: quota applies to dataset + descendants. Setting it on the wrong parent changes who is included.
Fix: Visualize the dataset tree. Apply quotas at the tenant root. Use refquota if you only want to cap a single dataset’s referenced usage.
Mistake 5: Confusing zvol sizing with quotas
Symptom: VM disk “runs out of space” even though dataset quota looks generous, or the pool fills unexpectedly due to thin provisioning assumptions.
Why it happens: A zvol’s volsize defines device size; space is allocated as blocks are written, and snapshots can pin old blocks too. Quota/reservation behaviors can differ depending on how you provision.
Fix: For zvol-backed VMs, track zvol usage and snapshot policies carefully. Consider refreservation for critical zvols if overcommit is risky.
Mistake 6: Setting quotas too tight for copy-on-write workloads
Symptom: Database compaction, VM image updates, or build steps fail even though steady-state usage is below quota.
Why it happens: Rewrites need temporary extra allocation; snapshots amplify this. You need headroom for transactional rewriting.
Fix: Build in burst space. Use refreservation or just set quotas with slack. Reduce snapshot frequency on churny datasets.
10. Checklists / step-by-step plan
Checklist A: Designing tenant storage (the “don’t page me” plan)
- Create a dataset per tenant (or per environment) as the management boundary.
- Decide whether snapshots are tenant-owned or platform-owned. Don’t split ownership accidentally.
- If snapshots are platform-owned, prefer
refquotafor tenant limits. - Set
quotaonly when you explicitly want to include children (common for “tenant root”). - Add small
reservationorrefreservationonly for workloads that must keep writing under pool pressure. - Alert on pool capacity, not just dataset usage. Quotas don’t save you from aggregate growth.
- Document: which datasets are allowed to have reservations, and why.
Checklist B: Responding to “no space left” in production
- Check pool capacity:
zpool list. - Check pool allocatable space:
zfs get available tank. - Find datasets at
AVAIL 0B:zfs list -o name,used,avail,refer -S used. - For the affected dataset, inspect limits and snapshot usage:
zfs get usedbysnapshots,quota,refquota,reservation,refreservation. - If snapshots are the culprit, delete/prune snapshots according to policy; don’t “rm” expecting miracles.
- If reservations are starving the pool, reduce/remove non-critical reservations.
- Only then consider raising quotas (and treat it as capacity planning, not a fix).
Checklist C: Quarterly hygiene (the boring practice that keeps working)
- Inventory all non-default
quota/refquota/reservation/refreservation. - Confirm reservations still match criticality and growth projections.
- Review snapshot policies on churn-heavy datasets (CI, scratch, temp DBs).
- Spot datasets where
USEDis much larger thanREFER; investigate why. - Verify monitoring: pool capacity alerts, snapshot growth alerts, and “datasets with AVAIL near zero.”
11. FAQ
Q1: If I set a quota, does ZFS “allocate” that space?
No. A quota is a limit, not a pre-allocation. Other datasets can consume the pool until the quota-bound dataset tries to write and finds the pool (or its quota) doesn’t allow it.
Q2: If I set a reservation, does it write zeros to disk?
No. It reserves allocatable space in the pool accounting. It’s a promise, not a prefill. The pool’s “available” space for others drops immediately, though.
Q3: Should I use quota or refquota for tenants?
If tenants should be responsible for snapshots and children, use quota. If snapshots are managed by the platform team (or you want tenant limits to reflect mostly live data), use refquota. The correct answer is about ownership and expectations as much as bytes.
Q4: Why did deleting files not free space?
Because snapshots (or clones) still reference the old blocks. Check usedbysnapshots and list snapshots by USED. Space is freed when the last reference is destroyed, not when a file disappears from the live view.
Q5: Why does df disagree with zpool list?
df reports dataset-level availability after quotas/reservations. zpool list reports pool-level raw allocation. Both are right, just answering different questions. When they disagree, look for quotas, reservations, and slop space.
Q6: Can reservations cause an outage?
Yes—by starving unreserved workloads even when physical space exists. Reservations are powerful and should be treated like capacity commitments. If you wouldn’t sign it as a contract, don’t set it as a reservation.
Q7: Do quotas account for compression?
Quotas are based on ZFS’s logical accounting, not “how many disk sectors were used after compression” in the way users expect. Compression can make physical usage lower, but quotas can still be hit based on logical referenced space. This is good for predictability, annoying for explanations.
Q8: What’s the simplest safe pattern for critical system datasets?
Give system datasets (logs, core services) a modest reservation so the OS can keep writing during a pool pressure event. Keep the reservation small, reviewed, and justified—enough for incident survival, not enough to turn into silent overcommit.
Q9: If I have reservations, can I still overcommit the pool with quotas?
Yes. Quotas can still sum to more than remaining pool capacity. Reservations reduce what’s truly available. Treat quota totals as “potential demand,” not “allocated supply.”
Q10: How do I tell whether snapshots are the main culprit quickly?
Compare USED vs REFER for the dataset, then check usedbysnapshots. If snapshots dominate, pruning snapshots (per policy) will usually free the most space fastest.
12. Conclusion
Quota and reservation are not competing features. They are a pair: quota prevents a dataset from taking too much, and reservation prevents a dataset from being squeezed out. In real production systems, you typically need both—just not everywhere, and not without understanding snapshots.
If you take one operational rule from this: don’t argue about “free space” until you’ve checked ZFS’s own accounting for usedbysnapshots, quotas, and reservations. ZFS is rarely lying; it’s usually answering a question you didn’t realize you asked.