ZFS Quotas for Multi-Tenant: Preventing One User From Killing the Pool

December 24, 2025 • February 3, 2026 • Read: 22 min • Views: 16

Was this helpful?

Multi-tenant storage fails in the least poetic way possible: the pool hits 100%, metadata updates stall, and suddenly your “small issue”
becomes an outage with a meeting invite. One noisy tenant doesn’t need malice; a runaway build cache or a log loop will do.

ZFS gives you the tools to keep tenants in their lane. The trick is choosing the right kind of quota, placing it at the right boundary,
and understanding how snapshots and reservations bend your mental model. If you get any of those wrong, you haven’t enforced fairness—
you’ve just created a new and exciting failure mode.

Design goals: what “safe multi-tenant” actually means

Multi-tenant ZFS safety is not “everyone gets a quota.” It’s a set of explicit outcomes:

One tenant cannot fill the pool (or if they can, you notice early and the blast radius is bounded).
Pool free space stays above a safety floor so ZFS can allocate, flush, and keep latency sane.
Tenants get predictable errors: ideally EDQUOT (quota exceeded), not ENOSPC (pool full), and not “everything is slow.”
Operations can explain space usage without interpretive dance: “this dataset is big because snapshots” is a real answer.
Deletion works when you need it. “Disk full and you can’t delete” is a classic storage horror story.

You’re designing boundaries. Datasets are those boundaries. Quotas enforce them. Reservations guarantee them. And snapshots are the
boundary-crossing gremlins you must account for or you’re just doing performance art.

Opinionated guidance: use datasets as tenant containers, not just directories. If you can’t put a ZFS property on it,
you can’t reliably govern it.

Interesting facts and historical context

ZFS was built at Sun in the mid-2000s with end-to-end data integrity and pooled storage as first-class goals, not bolt-ons.
Quotas arrived early because ZFS expected consolidation: multiple consumers sharing a pool, each needing predictable limits.
Snapshots are cheap to create because they’re metadata-only at birth; the cost shows up later via referenced blocks.
“Referenced” vs “used” in ZFS reporting exists specifically because snapshots complicate “how much space is mine?”
Reservations were designed for fairness and availability: they keep critical datasets alive even when the pool is pressured.
Zvols and filesystems are governed differently: quotas on filesystems don’t directly map to zvol consumers; provisioning strategies matter.
Historically, ZFS wanted free space headroom (often 10–20%) to keep allocation efficient and avoid pathological fragmentation and latency spikes.
OpenZFS evolved the tooling (like expanded quota reporting) as operators deployed it in larger, noisier multi-tenant environments.

Quota primitives: quota, refquota, reservations, and why names matter

Dataset boundaries are the policy boundary

ZFS doesn’t do “quotas on a directory tree” in the same native way traditional filesystems do. It does properties on datasets.
That’s a feature. It forces you to define real tenants. A tenant is a dataset. Everything else is an implementation detail.

`quota`: limits the dataset and its descendants

quota caps the total space a dataset can consume, including space used by descendants (child datasets).
This is the right tool when the tenant owns a subtree of datasets.

But it’s also the tool that surprises people because it interacts with snapshots. If your tenant’s dataset has snapshots,
the blocks held by snapshots count toward usage in a way that can be unintuitive. If you want “the tenant’s live data” capped,
you probably want refquota.

`refquota`: limits referenced space (live data), not snapshots

refquota caps the dataset’s referenced space: the blocks currently reachable from the dataset’s head.
Snapshots are not part of “referenced,” so tenants can’t get stuck because old snapshots are holding space hostage.

That sounds like magic. It’s not. The pool can still fill because snapshots still consume pool space. You’ve just moved the blast radius:
you prevented the tenant from getting random EDQUOT because of retention, but you did not prevent pool-wide ENOSPC.

`reservation` and `refreservation`: guaranteed space, but not free lunch

Reservations carve out space that cannot be used by others. They’re your “keep this service alive” lever.
reservation includes descendants. refreservation applies to referenced space.

Reservations can save you in a pool pressure event. They can also turn “we are low” into “we are dead” if overused, because they make
free space look available to the pool but unavailable to most datasets.

Why “one user killing the pool” still happens with quotas

Quotas stop a tenant from writing beyond a limit. They do not automatically enforce a pool-wide safety floor.
If you set quotas that sum to 200% of the pool, you’ve created oversubscription. That might be fine for many workloads.
It might also be how you end up learning what “space accounting under snapshots” means at high speed.

Paraphrased idea, attributed: When you build systems, you trade easy problems for hard ones; reliability work is choosing the hard problems you can monitor. — Charity Majors (paraphrased idea)

Also: quotas don’t reduce write amplification. A tenant can stay under quota and still destroy latency by forcing fragmentation,
sync-heavy workloads, or small-block churn. Quotas are about capacity governance, not performance governance. You need both.

Joke #1: A quota is like a diet—effective until you discover snapshots are the midnight snacks you didn’t log.

Dataset layout models that don’t hate you back

Model A: one dataset per tenant (the default winner)

Create pool/tenants/$tenant as a filesystem dataset. Put everything for that tenant there.
Apply quotas, compression, recordsize choices, snapshot policies, and mountpoints per tenant.

Pros: clean governance, easy reporting, low cognitive load. Cons: more datasets (which is fine until you get silly), and you need automation.

Model B: parent dataset with child datasets per service

Example: pool/tenants/acme/home, pool/tenants/acme/db, pool/tenants/acme/cache.
Put a quota on the parent to bound the total tenant footprint, and refquota on specific children to keep live data sane.

This model lets you tune properties per workload (database recordsize, logbias, compression) while still enforcing a tenant-level cap.
It’s a grown-up design when you operate platform services.

Model C: directory-per-tenant inside one dataset (avoid)

Traditional UNIX admins love this because it’s simple: /srv/tenants/acme, /srv/tenants/zenith.
On ZFS, it’s the wrong abstraction. You lose native governance and end up bolting on user/group quotas, project quotas, or external tooling.

There are valid reasons—like millions of tenants where dataset count becomes a management issue—but make that choice with eyes open.
For most corporate multi-tenant systems (dozens to thousands), dataset-per-tenant is both safer and simpler.

Model D: zvol-per-tenant (only when you must)

If tenants need block devices (VM disks, iSCSI LUNs), you’ll use zvols. Quotas on zvols are volsize.
Thin provisioning can oversubscribe a pool hard if you’re not careful. For multi-tenant, you must pair this with strict monitoring
and a pool safety floor.

Snapshots: the silent quota bypass

The two most common “quota surprises” are:

The tenant hits their quota even after deleting a bunch of files.
The tenant stays under quota but the pool still fills and everyone suffers.

How snapshots mess with deletion

If a snapshot references blocks that a file used, deleting the file from the live dataset doesn’t free those blocks. The snapshot still owns them.
This is why operators say “space is stuck in snapshots.” It’s not stuck; it’s correctly accounted to history.

If you used quota (not refquota), snapshot-held blocks contribute to “used” and can keep a tenant pinned at quota.
The tenant will swear they deleted things. They did. Your retention policy disagrees.

Why `refquota` helps users but can hurt pools

refquota is a user-experience improvement: it makes quota enforcement track the live dataset head.
But it shifts the risk: snapshots can grow until the pool is pressured. If you choose refquota, you must also choose:
snapshot limits, retention discipline, and pool-wide alerting.

Snapshot retention is policy, not a backup strategy

Snapshots are great for short-term rollback, replication streams, and forensic recovery. They are not a license to keep everything forever
on your hottest pool. Treat retention like a budget: define it, enforce it, and review it when tenants change behavior.

Joke #2: Snapshots are like office junk drawers—nobody wants them, but everyone panics when you try to empty them.

Practical tasks (commands, output, decisions)

The fastest way to get quotas right is to run the same small set of commands every time, and interpret them consistently.
Below are real tasks you can execute on a ZFS host. Each includes: command, what the output means, and what decision to make.

Task 1: Confirm pool health and whether you’re already in trouble

cr0x@server:~$ zpool status -x
all pools are healthy

Meaning: no known pool errors. This does not mean you have free space, nor does it mean performance is fine.
Decision: if this is not “healthy,” fix hardware/pool errors first. Quotas won’t save a degraded pool from bad latency.

Task 2: Check pool capacity, fragmentation, and headroom

cr0x@server:~$ zpool list -o name,size,alloc,free,cap,frag,health
NAME   SIZE   ALLOC   FREE  CAP  FRAG  HEALTH
tank  21.8T   18.9T   2.9T  86%   42%  ONLINE

Meaning: 86% used, fragmentation rising. Many ZFS pools get unpleasant above ~85–90%, depending on workload.
Decision: if cap > 85%, treat quotas as secondary; you need a capacity plan (delete snapshots, add vdevs, move tenants).

Task 3: Identify the biggest datasets first (the usual suspects)

cr0x@server:~$ zfs list -o name,used,refer,avail,mountpoint -S used | head -n 10
NAME                     USED  REFER  AVAIL  MOUNTPOINT
tank/tenants/zenith      6.21T  1.02T  1.48T  /srv/tenants/zenith
tank/tenants/acme        3.88T  3.62T  2.11T  /srv/tenants/acme
tank/tenants/blue        2.45T  2.40T  1.90T  /srv/tenants/blue
tank/backups             1.91T  1.88T  4.05T  /tank/backups
tank/tenants             512K   192K   2.90T  /srv/tenants

Meaning: notice USED vs REFER. zenith has huge USED but small REFER: snapshots or descendants own the difference.
Decision: if USED ≫ REFER, investigate snapshots/children before yelling at the tenant.

Task 4: See quotas and reservations applied across tenants

cr0x@server:~$ zfs get -r -o name,property,value,source quota,refquota,reservation,refreservation tank/tenants | head -n 25
NAME                 PROPERTY         VALUE   SOURCE
tank/tenants          quota            none    default
tank/tenants          refquota         none    default
tank/tenants          reservation      none    default
tank/tenants          refreservation   none    default
tank/tenants/acme     quota            5T      local
tank/tenants/acme     refquota         none    default
tank/tenants/acme     reservation      none    default
tank/tenants/acme     refreservation   none    default
tank/tenants/blue     quota            3T      local
tank/tenants/blue     refquota         2500G   local
tank/tenants/blue     reservation      none    default
tank/tenants/blue     refreservation   none    default
tank/tenants/zenith   quota            7T      local
tank/tenants/zenith   refquota         1500G   local
tank/tenants/zenith   reservation      500G    local
tank/tenants/zenith   refreservation   none    default

Meaning: you can audit governance quickly. Mixed strategy is fine, but it must be intentional.
Decision: if tenants rely on “deletes free space,” favor refquota plus snapshot controls. If you want “all in,” use quota.

Task 5: Set a tenant quota (hard cap) and immediately verify

cr0x@server:~$ sudo zfs set quota=2T tank/tenants/acme
cr0x@server:~$ zfs get -o name,property,value tank/tenants/acme quota
NAME              PROPERTY  VALUE
tank/tenants/acme  quota     2T

Meaning: writes that would exceed 2T for that dataset subtree will fail with quota errors.
Decision: if acme has child datasets, remember quota includes them. If you want only the head dataset capped, use refquota.

Task 6: Set `refquota` for “live data” and confirm `refer` behavior

cr0x@server:~$ sudo zfs set refquota=1500G tank/tenants/acme
cr0x@server:~$ zfs get -o name,property,value tank/tenants/acme refquota
NAME              PROPERTY  VALUE
tank/tenants/acme  refquota  1500G

Meaning: the dataset head can’t exceed 1.5T referenced. Snapshots can still grow.
Decision: pair this with snapshot retention/limits or you’re just postponing the argument until the pool is full.

Task 7: Guarantee headroom for a critical service using reservation

cr0x@server:~$ sudo zfs set reservation=200G tank/tenants/platform
cr0x@server:~$ zfs get -o name,property,value tank/tenants/platform reservation
NAME                   PROPERTY     VALUE
tank/tenants/platform   reservation  200G

Meaning: 200G is carved out for that dataset tree. Other tenants can’t consume it.
Decision: use reservations sparingly. They are for “must keep running” datasets, not for political comfort.

Task 8: Spot snapshot-driven usage growth on a dataset

cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation -S used tank/tenants/zenith | head -n 8
NAME                                    USED  REFER  CREATION
tank/tenants/zenith@daily-2025-12-25    210G  1.02T  Thu Dec 25 01:00 2025
tank/tenants/zenith@daily-2025-12-24    198G  1.01T  Wed Dec 24 01:00 2025
tank/tenants/zenith@daily-2025-12-23    176G  1.00T  Tue Dec 23 01:00 2025
tank/tenants/zenith@daily-2025-12-22    165G  1008G  Mon Dec 22 01:00 2025
tank/tenants/zenith@daily-2025-12-21    152G  1004G  Sun Dec 21 01:00 2025
tank/tenants/zenith@daily-2025-12-20    141G  1001G  Sat Dec 20 01:00 2025
tank/tenants/zenith@daily-2025-12-19    135G   999G  Fri Dec 19 01:00 2025

Meaning: each snapshot’s USED is the unique blocks held by that snapshot. Growth here often means churn (rewrites) in the live dataset.
Decision: if snapshot USED is ballooning, shorten retention, move churny workloads, or tune workload (e.g., stop rewriting giant files).

Task 9: Confirm what space is actually available to a tenant under quota

cr0x@server:~$ zfs get -o name,avail,used,quota,refquota tank/tenants/acme
NAME              AVAIL  USED  QUOTA  REFQUOTA
tank/tenants/acme  320G  1.68T  2T     1500G

Meaning: AVAIL reflects the tighter constraint between pool free space and quota/refquota enforcement. Here refquota is likely the limiter.
Decision: if AVAIL is unexpectedly tiny, check whether refquota is lower than intended, or whether snapshots/descendants are counted via quota.

Task 10: Find which children are consuming a parent tenant quota

cr0x@server:~$ zfs list -r -o name,used,refer,quota,refquota -S used tank/tenants/acme
NAME                      USED  REFER  QUOTA  REFQUOTA
tank/tenants/acme          1.68T  1.45T  2T     1500G
tank/tenants/acme/cache    220G   210G   none   250G
tank/tenants/acme/db       110G   108G   none   none
tank/tenants/acme/home      35G    34G   none   none

Meaning: the cache is large and close to its refquota. That’s often correct: caches should be bounded.
Decision: if the cache is unbounded, set a refquota. If db is spiky, consider separate quotas and reservation to keep it alive.

Task 11: Identify whether “space not freeing” is snapshots vs open files

cr0x@server:~$ sudo zfs destroy -n tank/tenants/zenith@daily-2025-12-19
would destroy tank/tenants/zenith@daily-2025-12-19
would reclaim 135G

Meaning: a dry-run destroy tells you reclaimable space if you remove a snapshot. This is gold for decision-making.
Decision: if reclaimable space is large and you’re in trouble, delete snapshots (starting with oldest) per policy.
If reclaim is tiny, you’re not chasing snapshots—look for open-but-deleted files or other datasets.

Task 12: Check for open-but-deleted files pinning space (classic ENOSPC bait)

cr0x@server:~$ sudo lsof +L1 /srv/tenants/acme | head
COMMAND   PID USER   FD   TYPE DEVICE  SIZE/OFF NLINK    NODE NAME
java     2714 acme   12w  REG  0,118  2147483648     0  553211 /srv/tenants/acme/logs/app.log (deleted)

Meaning: the process still holds a file descriptor to a deleted file. Space won’t be freed until the process closes it.
Decision: restart or signal the process to reopen logs. Don’t delete more files; you’ll just create more “deleted but open.”

Task 13: Confirm mountpoints and avoid “writing outside the dataset you think”

cr0x@server:~$ zfs get -o name,mountpoint,canmount tank/tenants/acme
NAME              MOUNTPOINT          CANMOUNT
tank/tenants/acme  /srv/tenants/acme  on

Meaning: if mountpoints are wrong, tenants can write to the parent dataset (unquoted) and bypass limits.
Decision: verify every tenant’s mountpoint and ensure parent datasets are not writable by tenants.

Task 14: Apply “boring guardrails” on the parent dataset

cr0x@server:~$ sudo zfs set readonly=on tank/tenants
cr0x@server:~$ zfs get -o name,readonly tank/tenants
NAME         PROPERTY  VALUE
tank/tenants  readonly  on

Meaning: tenants can’t accidentally write to the parent mount (if it’s even mounted), forcing writes into tenant datasets.
Decision: for multi-tenant, make parents non-writable and use explicit child mountpoints. It prevents accidental bypass.

Task 15: Monitor per-dataset logical space pressure (quota nearing)

cr0x@server:~$ zfs list -o name,used,quota,refquota,available -r tank/tenants | awk 'NR==1 || $3!="none" || $4!="none"{print}'
NAME                   USED  QUOTA  REFQUOTA  AVAIL
tank/tenants/acme      1.68T  2T     1500G    320G
tank/tenants/blue      2.45T  3T     2500G    550G
tank/tenants/zenith    6.21T  7T     1500G    790G

Meaning: a quick view of governed datasets. AVAIL gives you a near-term “will writes fail soon?” indicator.
Decision: alert on %used of quota and also on pool cap. A tenant can be fine while the pool is not.

Task 16: For zvol tenants, verify thin provisioning risk

cr0x@server:~$ zfs list -t volume -o name,volsize,used,refer,logicalused,logicalrefer -S logicalused
NAME                 VOLSIZE  USED   REFER  LOGICALUSED  LOGICALREFER
tank/vm/tenant01       800G   120G   120G        640G         640G
tank/vm/tenant02       800G   160G   160G        790G         790G

Meaning: logicalused shows what the guest thinks it used; USED is what the pool actually allocated.
Thin provisioning hides risk until it doesn’t.
Decision: if logicalused approaches volsize across many tenants, treat it as real capacity pressure and budget space accordingly.

Fast diagnosis playbook

When a multi-tenant pool is in trouble, you don’t have time for philosophical purity. You need a fast path to: “what is filling what?”
and “is this capacity or performance?”

First: confirm whether you have a pool-wide emergency

Pool capacity: zpool list -o name,alloc,free,cap,frag. If cap is > 90%, assume everything will get weird.
Pool health: zpool status. If degraded, expect worse latency and slower deletes.
Immediate reclaim candidates: zfs list -t snapshot -o name,used -S used.

Second: identify whether the pain is “quota hit” or “pool full”

If tenants see errors like “Disk quota exceeded,” you’re dealing with dataset-level governance.
If everyone sees “No space left on device,” you’re dealing with pool-level exhaustion or reservation starvation.
Check zfs get avail,quota,refquota on the impacted dataset and compare to pool free.

Third: decide snapshots vs open files vs a different dataset

Snapshots: if USED ≫ REFER on the dataset, list snapshots and do a dry-run destroy to estimate reclaim.
Open-but-deleted files: run lsof +L1 on the mount. If present, restart the offender.
Wrong mountpoint / bypass: verify mountpoints and check whether writes landed in a parent dataset with no quota.

Fourth: if performance is the symptom, don’t confuse it with capacity

High fragmentation + high cap can look like “quota issues” because writes time out or stall.
Measure IO pressure with zpool iostat -v 1 and look for saturated vdevs.
If you’re near full, your best “performance tuning” is freeing space.

Three corporate mini-stories from the quota trenches

Mini-story 1: the outage caused by a wrong assumption

A mid-sized company ran a shared ZFS pool for internal teams: analytics, build systems, a few web properties. They did the sensible thing:
dataset per team, quotas on each dataset. They were proud. The pool was stable. Then one Monday, half the CI jobs failed with ENOSPC.

The on-call assumed a team had exceeded its quota. But quotas were fine. Each team dataset still had headroom.
The pool, however, was at 98%, and ZFS was behaving like a storage system at 98%: allocation got expensive, and metadata updates slowed down.

The wrong assumption was subtle: “If every team has a quota, the pool can’t fill.” Quotas don’t sum themselves into safety.
They had oversubscribed—quietly—because quotas were set based on business expectations, not on actual pool capacity, and retention wasn’t bounded.

The real culprit: automated snapshots kept for “a while,” which slowly became “forever” because nobody wanted to delete history.
A single team with a high-churn workload (large artifacts rewritten daily) caused snapshot growth. Their live data stayed under refquota,
but snapshots steadily ate the pool.

The fix wasn’t heroic. They defined snapshot retention per tenant class, added snapshot count limits, and set a pool safety alert at 80/85/90%.
They also started a monthly review of datasets where USED-REFER exceeded a threshold. Boring, consistent, effective.

Mini-story 2: the optimization that backfired

Another company offered “developer sandboxes” on ZFS. They wanted a great developer experience, so they switched many tenant datasets
from quota to refquota. The goal: stop devs from complaining that deleting files didn’t restore their ability to write
because snapshots were holding space.

It worked. Complaints dropped. The platform team celebrated with the kind of quiet satisfaction you only get from removing a whole class
of tickets. And then the pool started filling faster than expected, but nobody noticed immediately because tenant dashboards looked fine.

The backfire came from visibility. With refquota, tenants never hit their “limit” because their live data stayed bounded,
while snapshots were allowed to grow under the radar. The system had shifted the failure from “tenant can’t write” to “pool is full,”
which is a much worse failure in multi-tenant land.

The incident ended the usual way: they deleted snapshots under pressure, replication lag spiked, and a few restores became impossible.
Not catastrophic, but painful and avoidable.

The fix was to treat snapshot retention as part of quota governance. They implemented:
per-dataset snapshot caps, per-tenant snapshot schedules, and a report that ranked tenants by “snapshot-only space.”
Refquota stayed—but only with guardrails and a pool-wide free-space floor.

Mini-story 3: the boring but correct practice that saved the day

A regulated org ran a multi-tenant ZFS cluster for application teams. The storage engineers were allergic to surprises,
so they did two unsexy things: they kept 20% free space as policy, and they reserved a small slice for platform datasets
(logging, auth, monitoring spools).

One quarter-end, an app team’s batch job started producing far more output than normal. The tenant dataset hit its quota.
The app failed loudly—exactly what you want. The pool stayed healthy, monitoring stayed online, and other teams didn’t notice.

The on-call got a clean alert: “tenant quota exceeded.” Not “pool full.” Not “IO latency 10x.” Not “everything is on fire.”
They increased the tenant quota temporarily, but only after moving older snapshots to a colder pool and trimming retention.

The key wasn’t the quota by itself. It was the combination: a pool safety floor, reservations for essential services, and consistent reporting.
The incident stayed tenant-scoped. That’s the whole point of multi-tenant engineering.

Common mistakes: symptoms → root cause → fix

1) Symptom: “I deleted 500GB but I’m still at quota”

Root cause: snapshots still reference the deleted blocks; quota enforcement counts them.

Fix: either delete/expire snapshots, or switch to refquota for that dataset and control snapshots separately.

2) Symptom: tenant is under quota, but pool hits 100% anyway

Root cause: refquota limits only live data; snapshots, other datasets, and zvol thin provisioning still consume pool space.

Fix: enforce snapshot retention/limits, monitor “snapshot-only” growth (USED-REFER), and keep a pool-wide free-space floor.

3) Symptom: random ENOSPC even though `zpool list` shows free space

Root cause: reservations or special allocation constraints mean the free space isn’t usable for that dataset.

Fix: audit reservation/refreservation; reduce or remove non-critical reservations; ensure critical datasets have the reservations, not everything.

4) Symptom: tenant can write outside quota somehow

Root cause: writes are landing in a parent dataset (wrong mountpoint, bind-mount confusion, or permissions on parent mount).

Fix: lock parent datasets (readonly=on, canmount=off where appropriate), verify mountpoints, and restrict permissions.

5) Symptom: pool is not full, but latency is awful and writes crawl

Root cause: high fragmentation, small-block churn, sync-heavy workload, or a degraded vdev; capacity governance doesn’t solve IO saturation.

Fix: keep headroom, separate churny workloads into their own vdevs/pools, and measure with zpool iostat. Consider SLOG/special vdevs where appropriate.

6) Symptom: “space not freeing” after deleting big files, no snapshots found

Root cause: open-but-deleted files held by processes.

Fix: lsof +L1 to find offenders; restart or signal log rotation properly.

7) Symptom: tenant replication grows without obvious live growth

Root cause: frequent rewrites create lots of snapshot deltas; send streams grow even if live data stays stable.

Fix: reduce churn (app changes), adjust snapshot frequency, or move that tenant to a pool designed for churn.

Checklists / step-by-step plan

Step-by-step: set up a new tenant safely

Create a dataset per tenant (or per tenant/service if you need different properties).
Set mountpoint explicitly and ensure parent datasets are not writable by tenants.
Choose quota model:
- Use quota if snapshots count as “their problem” and you want strict total cap.
- Use refquota if you want “live data” capped and you manage snapshots centrally.
Decide snapshot policy: frequency and retention. Put it in code, not tribal memory.
Add alerting: quota %used, pool cap thresholds, and snapshot-only growth.
Document the failure mode the tenant will see: EDQUOT vs ENOSPC and what they should do.

Step-by-step: enforce pool safety floor (the “don’t page me” plan)

Pick a target free-space floor (commonly 10–20% depending on workload and vdev layout).
Alert early at multiple thresholds (e.g., 80/85/90%), not just at 95% when it’s already miserable.
Audit oversubscription: sum of quotas vs pool size; accept oversubscription only if you can explain why it’s safe.
Limit snapshot growth: retention limits and (where supported by your tooling) snapshot count/space caps per tenant.
Keep platform datasets reserved: monitoring, logging spools, and auth metadata should not be competing with tenants during an incident.

Step-by-step: respond when the pool is near full

Stop the bleeding: identify the fastest reclaim (usually snapshots) and confirm reclaim with zfs destroy -n.
If tenants are writing outside quotas, fix mountpoints and permissions immediately.
Check for open-but-deleted files and restart offenders.
Trim snapshot retention temporarily, then restore a sane policy with approvals.
Schedule capacity expansion or data movement; “we’ll be careful” is not a capacity plan.

FAQ

1) Should I use `quota` or `refquota` for tenants?

If tenants manage their own snapshots or you want “total footprint including history” capped, use quota.
If you centrally manage snapshots and want user experience to reflect live data, use refquota, but then you must govern snapshot growth separately.

2) Can quotas prevent a pool from hitting 100%?

Not by themselves. Quotas limit datasets. Pool-level exhaustion still happens via snapshots, other datasets, zvol thin provisioning,
reservations, and oversubscription. You still need a pool headroom policy and alerting.

3) Why does `USED` differ so much from `REFER`?

REFER is the space referenced by the dataset head (live view). USED includes snapshot-held blocks and descendants.
A big gap usually means snapshots or child datasets.

4) What error will applications see when a quota is hit?

Typically “Disk quota exceeded” (EDQUOT). If the pool itself is out of space, they’ll see “No space left on device” (ENOSPC),
which affects everyone and is far worse operationally.

5) If I delete snapshots, will I always get space back immediately?

Usually yes, but the amount reclaimed depends on block sharing. Use zfs destroy -n snapshot to estimate reclaim.
If reclaim is small, the snapshot isn’t your main issue.

6) Are reservations a good way to “protect” each tenant?

No. Reservations are for protecting critical services, not for making everyone feel safe. Overusing reservations can starve the pool
and cause confusing ENOSPC behavior even when the pool reports free space.

7) How do I stop tenants from bypassing quotas by writing elsewhere?

Use dataset-per-tenant mountpoints, make parent datasets non-writable, verify mountpoint and canmount,
and ensure permissions don’t allow writes to shared parents.

8) Do snapshots count against `refquota`?

No, that’s the point. Snapshots still count against the pool, though. Refquota is a per-dataset live-data cap, not a pool safety mechanism.

9) What’s the simplest multi-tenant pattern that works in production?

One dataset per tenant, a clear quota model (quota or refquota), automated snapshots with strict retention,
and alerts on both tenant limits and pool headroom. Keep it boring.

Conclusion: next steps that prevent the 2 a.m. page

ZFS quotas are not a nice-to-have; they’re how you prevent one tenant from turning shared storage into a shared incident.
But quotas only work when your dataset layout matches your tenancy model, and when snapshots and reservations are treated as first-class policy.

Practical next steps:

Audit your tenant boundaries: if tenants are directories, plan a migration to dataset-per-tenant.
Pick quota semantics intentionally: quota for total footprint, refquota for live-data UX—then implement the missing guardrails.
Implement snapshot retention limits and a report for “snapshot-only space” growth.
Set a pool free-space floor and alert before 85% usage; don’t wait for 95% to discover physics.
Reserve space only for platform-critical datasets so you can still operate when tenants misbehave.

Do this well and “one user killed the pool” becomes a story you tell new hires as a warning, not a quarterly tradition.