Everything is fine until it isn’t: a backup job starts, a database stalls, and suddenly the VM hosts feel “slow” in a way nobody can quantify.
You stare at graphs, argue about “storage latency,” and somebody suggests rebooting things “to clear it out.”
The fix is rarely heroic. It’s boring structure: one ZFS dataset per workload, with explicit properties, explicit snapshot rules, and explicit limits.
That’s how you prevent the storage pool from turning into a shared kitchen where everyone cooks curry at the same time and nobody can find a clean pan.
The trick: datasets are management boundaries
ZFS is not just a filesystem. It’s a policy engine welded to an allocator, with checksumming and copy-on-write as the glue.
That policy engine is mostly controlled through dataset properties. And those properties matter only if you use datasets as boundaries.
A “dataset per workload” strategy means you stop treating your pool like one big directory tree where everything inherits whatever the last admin felt like.
Instead, you create datasets that map to real workloads: a database, a VM image store, an object cache, a build artifact repo, home directories,
container layers, a backup landing zone. Each gets deliberate settings.
The point isn’t “more datasets because ZFS can.” The point is operational control:
- Performance isolation: recordsize, logbias, atime, sync behavior, special_small_blocks can be tuned per workload.
- Safety isolation: snapshot cadence, retention, holds, and replication can be per dataset.
- Capacity discipline: quotas/reservations prevent one workload from eating your pool and calling it “temporary.”
- Blast-radius control: you can roll back or send/receive one workload without touching neighbors.
- Diagnosability: zfs get, zfs list, and per-dataset stats become meaningful instead of a soup.
If you remember one thing: your ZFS pool is the hardware budget; datasets are your contracts.
Without contracts, you get surprise bills.
One short joke, as a service: If your storage is “one big dataset,” it’s not architecture—it’s an emotion.
Interesting facts and historical context (because scars have history)
- ZFS originated at Sun Microsystems in the early-to-mid 2000s, designed to replace traditional volume managers and filesystems with one integrated system.
- The copy-on-write model was a direct answer to silent corruption and the “write hole” class of issues that haunted older RAID + filesystem stacks.
- ZFS snapshots were designed to be cheap because they’re just references to existing blocks—until you keep them forever and wonder why deletes don’t free space.
- Dataset property inheritance is intentional: it enables policy trees, but it also enables accidental policy inheritance—like a production database inheriting a “test” snapshot schedule.
- Compression became mainstream practice in ZFS because it often improves performance by reducing I/O, especially on spinning disks and saturated arrays.
- Recordsize exists because “one block size fits all” fails: large sequential files and small random I/O want different tradeoffs.
- ZVOLs are block devices backed by ZFS; they behave differently from datasets and bring their own tuning knobs like volblocksize.
- The ARC is not “just cache”; it’s a memory consumer with eviction behavior that can make or break performance, especially under mixed workloads.
- OpenZFS became the cross-platform continuation after Sun’s era, and modern features (like special vdevs and dRAID) are products of years of operational feedback.
All of that history points to the same operational truth: ZFS wants you to express intent. Datasets are how you do it.
Workload map: what belongs in its own dataset
“Per workload” is not “per directory.” It’s “per behavior.” You split when a workload has different needs for latency, throughput, safety, or retention.
You keep together what shares a lifecycle and policy.
Good dataset boundaries (pragmatic, not academic)
- Databases: Postgres/MySQL data directory, WAL/binlog, and backups often deserve separate datasets.
- VM storage: One dataset for VM images, optionally per cluster or per tenant. If using ZVOLs, treat them as workloads too.
- Containers: Image layers vs writable volumes vs logs. These have wildly different write patterns.
- Build/artifact caches: High churn, low value, lots of deletes—perfect for distinct snapshot/retention rules.
- Home directories: Quotas, different retention, and “oops I deleted it” restores.
- Backup landing zones: Receives, long retention, and “do not accidentally snapshot every 5 minutes.”
- Logs: Often compress well, but don’t always need snapshots; also can be huge and spiky.
Where people over-split
Don’t create a dataset for every application subdirectory because you got excited about knobs.
If your snapshot schedule is identical, retention is identical, and performance characteristics are identical, keep it together.
Managing 400 datasets with no naming standard is how you build a museum of forgotten intentions.
Naming: make it boring and searchable
Use a predictable hierarchy. Example:
tank/prod/db/postgres, tank/prod/db/postgres-wal, tank/prod/vm, tank/prod/containers,
tank/shared/homes, tank/backup/recv.
Put environment and function early. You will grep for it at 02:00. Don’t make Future You parse poetry.
Properties that matter (and what they really do)
recordsize: the silent performance lever
recordsize controls the maximum block size for files in a dataset. Bigger is good for large sequential reads/writes (media, backups).
Smaller can reduce read-modify-write amplification for random I/O patterns (databases).
For many databases on datasets (not ZVOLs), recordsize=16K or recordsize=8K is a common starting point.
For VM images stored as files, recordsize=128K often works well. For backup streams, recordsize=1M can be appropriate.
But don’t cargo-cult this; measure.
volblocksize: for ZVOLs, not files
ZVOLs use volblocksize. You set it at creation time and changing it later is non-trivial.
Match it to the guest/filesystem block size and expected I/O: e.g., 8K or 16K for DB-like random I/O; larger for sequential.
compression: performance feature disguised as storage efficiency
Use compression=zstd for most datasets unless you have a clear reason not to.
It’s usually faster than your disks and reduces writes. It also reduces the “why is my pool full” drama.
atime: metadata writes you probably don’t want
atime=off is a standard move for most server workloads. If you truly need access time updates, make that dataset explicit and small.
sync and logbias: where honesty meets latency
sync=standard is the default and the safest. sync=disabled can make benchmarks look great and production look like a post-mortem.
logbias=latency vs throughput influences intent for synchronous writes (and whether the SLOG helps).
If you change sync behavior, treat it like changing your RPO: write it down, get sign-off, and assume you will be audited by reality.
primarycache / secondarycache: choose what you cache
For datasets that pollute ARC (like large sequential backup reads), consider primarycache=metadata.
It’s not about saving RAM; it’s about keeping ARC useful for latency-sensitive workloads.
special_small_blocks and special vdevs: fast metadata, fast small I/O
If you have a special vdev (typically SSDs) for metadata/small blocks, special_small_blocks can dramatically improve small-file workloads.
It can also cause dramatic regret if your special vdev is undersized. This is one of those “plan capacity like it’s production, because it is” features.
quotas, reservations, refreservation: capacity as policy
quota caps growth. reservation guarantees space. refreservation can reserve space equal to a dataset’s referenced data,
often used for zvols. These are your guardrails against the Noisy Neighbor Storage Tax.
snapshots: your rollback lever, your replication unit, your space-time trap
Snapshots are cheap until they aren’t. A dataset with high churn and long snapshot retention will retain deleted blocks and inflate used space.
A dataset with frequent snapshots and fast churn can fill a pool while everyone swears nothing changed.
One quote, because it’s still true
Hope is not a strategy.
— General H. Norman Schwarzkopf
Practical tasks: commands + what the output means + the decision you make
These are not “toy” commands. These are the ones you run when you’re tired, production is hot, and your storage pool is being blamed for everything.
Each task includes: command, example output, interpretation, and a decision.
Task 1: List datasets and see who is actually using space
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -r tank/prod
NAME USED AVAIL REFER MOUNTPOINT
tank/prod 3.21T 5.87T 192K /tank/prod
tank/prod/db 1.44T 5.87T 192K /tank/prod/db
tank/prod/db/postgres 812G 5.87T 812G /var/lib/postgresql
tank/prod/db/postgres-wal 221G 5.87T 221G /var/lib/postgresql-wal
tank/prod/vm 1.55T 5.87T 1.55T /tank/prod/vm
tank/prod/containers 218G 5.87T 218G /var/lib/containers
What it means: USED includes snapshots/descendants; REFER is the live data in that dataset.
Decision: If USED is much larger than REFER, snapshots are hoarding space. Investigate snapshot retention for that dataset.
Task 2: Find snapshot bloat per dataset
cr0x@server:~$ zfs list -t snapshot -o name,used,refer -s used -r tank/prod/db/postgres | tail -n 5
tank/prod/db/postgres@hourly-2025-12-25-00 18.4G 812G
tank/prod/db/postgres@hourly-2025-12-24-23 18.2G 812G
tank/prod/db/postgres@hourly-2025-12-24-22 17.9G 812G
tank/prod/db/postgres@hourly-2025-12-24-21 17.6G 812G
tank/prod/db/postgres@hourly-2025-12-24-20 17.4G 812G
What it means: Snapshot USED here is the unique blocks held by that snapshot.
Decision: If snapshots are huge and frequent, shorten retention or snapshot less frequently on high-churn datasets (WAL, caches, build artifacts).
Task 3: Check property inheritance to spot accidental defaults
cr0x@server:~$ zfs get -r -o name,property,value,source recordsize,compression,atime,sync,logbias tank/prod/db/postgres
NAME PROPERTY VALUE SOURCE
tank/prod/db/postgres recordsize 128K inherited from tank/prod
tank/prod/db/postgres compression zstd inherited from tank/prod
tank/prod/db/postgres atime on default
tank/prod/db/postgres sync standard default
tank/prod/db/postgres logbias latency local
What it means: This database is inheriting recordsize=128K and has atime=on.
Decision: Set recordsize explicitly for DB datasets and disable atime unless you need it.
Task 4: Fix recordsize and atime for a database dataset
cr0x@server:~$ sudo zfs set recordsize=16K atime=off tank/prod/db/postgres
What it means: Future writes use 16K records; existing blocks won’t be rewritten until data changes.
Decision: Apply during a maintenance window if you plan to rewrite large portions; otherwise accept gradual improvement.
Task 5: Check compressratio to see if compression is helping
cr0x@server:~$ zfs get -o name,property,value compressratio -r tank/prod | head
NAME PROPERTY VALUE
tank/prod compressratio 1.42x
tank/prod/db compressratio 1.12x
tank/prod/db/postgres compressratio 1.06x
tank/prod/vm compressratio 1.68x
What it means: VM images compress well; the database compresses poorly (common for already-compressed pages or encrypted data).
Decision: Keep compression on anyway unless CPU is constrained. For the DB, compression won’t save much space but can still reduce writes slightly.
Task 6: Put a hard cap on a “temporary” workload
cr0x@server:~$ sudo zfs set quota=500G tank/prod/containers
cr0x@server:~$ zfs get -o name,property,value quota tank/prod/containers
NAME PROPERTY VALUE
tank/prod/containers quota 500G
What it means: The containers dataset cannot exceed 500G.
Decision: Use quotas for caches, build artifacts, and log spools. If it hits quota, you get a controlled failure instead of a pool-wide outage.
Task 7: Reserve space for the database so it can keep breathing
cr0x@server:~$ sudo zfs set reservation=1T tank/prod/db
cr0x@server:~$ zfs get -o name,property,value reservation tank/prod/db
NAME PROPERTY VALUE
tank/prod/db reservation 1T
What it means: You’re guaranteeing space for DB datasets under tank/prod/db.
Decision: Use reservations when the business cost of DB write failures is higher than the cost of “wasting” reserved capacity.
Task 8: Identify whether the pool is suffering from fragmentation pressure
cr0x@server:~$ zpool list -o name,size,alloc,free,frag,capacity,health tank
NAME SIZE ALLOC FREE FRAG CAPACITY HEALTH
tank 9.08T 7.24T 1.84T 63% 79% ONLINE
What it means: 79% full and 63% fragmentation is a warning sign, not a crime. ZFS can run like this, but latency often rises.
Decision: Plan capacity: keep pools below ~80% for mixed random workloads. Reduce churn datasets or add vdevs before the “mystery latency” phase.
Task 9: Watch I/O latency at the pool level
cr0x@server:~$ zpool iostat -v tank 2 3
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 7.24T 1.84T 640 1200 78.2M 94.6M
raidz2-0 7.24T 1.84T 640 1200 78.2M 94.6M
sda - - 80 150 9.70M 11.8M
sdb - - 79 151 9.66M 11.7M
sdc - - 81 148 9.74M 11.6M
sdd - - 80 149 9.69M 11.7M
What it means: You see load distribution. This view alone doesn’t show latency, but it shows whether you’re saturating devices.
Decision: If writes are consistently high and the workload is sync-heavy, evaluate SLOG and sync settings per dataset (not globally).
Task 10: Inspect per-dataset logical space accounting (and spot refreservation traps)
cr0x@server:~$ zfs list -o name,used,refer,logicalused,logicalrefer,compressratio -r tank/prod/vm | head
NAME USED REFER LUSED LREFER RATIO
tank/prod/vm 1.55T 1.55T 2.31T 2.31T 1.68x
What it means: Logical usage is bigger than physical usage thanks to compression (and possibly sparse images).
Decision: If logical used is huge, ensure your monitoring alerts on physical pool capacity, not just guest-claimed capacity.
Task 11: See what’s hammering ARC and whether you should limit caching for a dataset
cr0x@server:~$ arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% size c
12:00:01 712 144 20 112 16 32 4 0 0 46.3G 48.0G
12:00:02 690 171 24 139 20 32 5 0 0 46.1G 48.0G
12:00:03 705 162 23 131 19 31 4 0 0 46.0G 48.0G
What it means: A 20–24% miss rate under load can be fine or awful depending on latency goals. It suggests the working set may not fit.
Decision: If large sequential jobs are thrashing ARC, set primarycache=metadata on those datasets (e.g., backup receive or media archives).
Task 12: Change cache policy for a backup dataset to protect latency-sensitive workloads
cr0x@server:~$ sudo zfs set primarycache=metadata tank/backup/recv
cr0x@server:~$ zfs get -o name,property,value,source primarycache tank/backup/recv
NAME PROPERTY VALUE SOURCE
tank/backup/recv primarycache metadata local
What it means: ZFS will cache metadata, but not file data, for this dataset in ARC.
Decision: Use this for “streaming” datasets that don’t benefit from caching and actively hurt everyone else.
Task 13: Create a dataset with workload-appropriate defaults (the right kind of boring)
cr0x@server:~$ sudo zfs create -o compression=zstd -o atime=off -o recordsize=16K tank/prod/db/mysql
cr0x@server:~$ zfs get -o name,property,value,source compression,atime,recordsize tank/prod/db/mysql
NAME PROPERTY VALUE SOURCE
tank/prod/db/mysql compression zstd local
tank/prod/db/mysql atime off local
tank/prod/db/mysql recordsize 16K local
What it means: You’ve made intent explicit at creation time, preventing accidental inheritance surprises.
Decision: Standardize dataset creation via a runbook or automation. Humans are great; consistency is better.
Task 14: Check snapshot policy drift (are you snapshotting the wrong things?)
cr0x@server:~$ zfs get -r -o name,property,value,source com.sun:auto-snapshot tank/prod | head -n 8
NAME PROPERTY VALUE SOURCE
tank/prod com.sun:auto-snapshot true inherited from tank
tank/prod/db com.sun:auto-snapshot true inherited from tank
tank/prod/db/postgres com.sun:auto-snapshot true inherited from tank
tank/prod/db/postgres-wal com.sun:auto-snapshot true inherited from tank
tank/prod/vm com.sun:auto-snapshot true inherited from tank
tank/prod/containers com.sun:auto-snapshot true inherited from tank
tank/prod/logs com.sun:auto-snapshot true inherited from tank
What it means: Everything is being snapshotted, including WAL and logs, which may be high-churn and low-value.
Decision: Disable auto-snapshots on churn-heavy datasets (WAL/logs/caches) or give them a different retention class.
Task 15: Disable snapshots on a churn-heavy dataset (example: WAL)
cr0x@server:~$ sudo zfs set com.sun:auto-snapshot=false tank/prod/db/postgres-wal
cr0x@server:~$ zfs get -o name,property,value,source com.sun:auto-snapshot tank/prod/db/postgres-wal
NAME PROPERTY VALUE SOURCE
tank/prod/db/postgres-wal com.sun:auto-snapshot false local
What it means: Snapshot automation that respects this property will skip WAL.
Decision: Keep WAL protected via database-native backups/archiving, not filesystem snapshots every 15 minutes that retain churn forever.
Task 16: Confirm ashift and vdev layout (because you can’t tune your way out of wrong geometry)
cr0x@server:~$ zdb -C tank | egrep "ashift|vdev_tree|type:|path:"
ashift: 12
type: 'raidz'
type: 'disk'
path: '/dev/disk/by-id/ata-SAMSUNG_SSD_...'
What it means: ashift: 12 implies 4K sector alignment, typically correct for modern disks/SSDs.
Decision: If ashift is wrong (too small), performance and write amplification can be permanently harmed. Fixing it usually means rebuilding the pool.
Second short joke, because you’ve earned it: The nice thing about “temporary” datasets is they last forever—like tattoos, but with worse retention policies.
Fast diagnosis playbook: find the bottleneck without a week of meetings
This is the triage order that works in real life: start broad, then zoom in. Don’t start by changing recordsize.
Don’t start by blaming ZFS. And absolutely don’t start by “turning off sync” because someone saw a forum post in 2014.
First: Is the pool out of breathing room?
- Check pool capacity and fragmentation:
zpool list - Check snapshot bloat:
zfs list -t snapshot - Check if a single dataset is exploding:
zfs list -r
Interpretation: Pools that live near full tend to become latency machines.
High churn + lots of snapshots + high fill level is a classic recipe for “it got slow” incidents.
Second: Is it IOPS/latency, throughput, or CPU?
- Look at device saturation and distribution:
zpool iostat -v 1 - Look at system CPU steal/iowait:
mpstat -P ALL 1,iostat -x 1 - Check compression CPU cost if using heavy algorithms:
zfs get compressionand system CPU metrics
Interpretation: Random write latency bottlenecks feel like “everything is slow.” Throughput bottlenecks feel like “big jobs take forever.”
CPU bottlenecks feel like “storage is slow” because the storage stack is waiting on CPU to checksum, compress, and manage metadata.
Third: Which dataset is the noisy neighbor?
- Identify top writers/readers by process:
iotop -oorpidstat -d 1 - Correlate to mountpoints/datasets:
zfs list -o name,mountpoint - Confirm dataset properties:
zfs get -o name,property,value,source -r ...
Interpretation: If one dataset is doing sequential writes with snapshots and caching enabled, it can evict useful ARC and cause everyone else to miss.
If one dataset is sync-heavy and lacks proper intent/log configuration, it can dominate latency.
Fourth: Are you fighting your own policies?
- Snapshot automation scope: check
com.sun:auto-snapshotor your local snapshot tooling flags - Replication lag and receives:
zfs list -t snapshotand check for “stuck” old snapshots kept for send/receive - Quotas/reservations:
zfs get quota,reservation,refreservation
Interpretation: Many “ZFS is eating space” tickets are actually “snapshots are doing exactly what we asked, and we asked badly.”
Three corporate mini-stories from the storage trenches
1) The incident caused by a wrong assumption: “snapshots are free”
A mid-sized company ran a ZFS pool for a mixed fleet: VMs, containers, and a couple of databases. Someone enabled an auto-snapshot policy
at the pool root. Hourly snapshots for everything. Daily retention for a month. It was pitched as “a safety net.”
The wrong assumption wasn’t malicious; it was optimistic. Snapshots are cheap to create, so the team assumed they were cheap to keep.
Meanwhile, the container dataset had aggressive churn: image pulls, layer deletions, CI jobs, log rotation, the usual entropy generator.
Deletions didn’t free space. The pool filled slowly, then suddenly.
The incident began as a database symptom: commits slowed. Then the VM cluster started experiencing guest timeouts.
Operations looked at graphs and saw disk utilization go up and never come down. Someone tried deleting old container images—no immediate space return.
The on-call ended up in the same pit many of us know: “why is nothing freeing space?”
The post-incident fix was not exotic tuning. They split datasets per workload, disabled snapshots on caches and churn-heavy datasets,
and gave each dataset a retention policy matching its value. The database got frequent snapshots with short retention; the VM store got nightly snapshots;
containers got none, with backups handled at the artifact registry level instead.
The lesson: snapshots are not free; they are deferred cost. A dataset boundary is where you decide what cost you’re willing to pay.
2) The optimization that backfired: “sync=disabled for speed”
Another shop had a performance problem: their Postgres workload was suffering high commit latency after a migration to new storage.
A well-meaning engineer ran a benchmark, saw poor numbers, and applied the classic “fix”: sync=disabled on the database dataset.
Benchmark graphs improved instantly. People celebrated. The ticket was closed with a confident comment about “ZFS overhead.”
Weeks later, they had an unplanned power event. Not a dramatic data center fire—just the kind of upstream electrical issue that makes you learn
how good your UPS maintenance really is. Systems came back. Postgres came back too, but not cleanly. The database needed recovery and showed signs
of corruption in a portion of recent transactions. The incident became a cross-team exercise in backups, WAL archives, and uncomfortable conversations.
The operational failure wasn’t just the property change; it was the lack of dataset boundaries and documentation around intent.
The DB dataset had inherited other defaults from a general-purpose dataset tree, and nobody tracked which datasets were “lying” about sync.
When it was time to audit risk, the team didn’t know what to search for.
The eventual fix was conservative: restore sync=standard, validate SLOG suitability (or accept the latency), and separate WAL into its own dataset
with deliberate properties. They also added a simple compliance check: zfs get -r sync on prod pools, flagged in monitoring.
The lesson: performance hacks that change correctness are not tuning; they are policy changes. Use dataset boundaries to make those policies explicit,
auditable, and rare.
3) The boring but correct practice that saved the day: quotas + reservations + clear dataset ownership
A company running an internal platform had a storage pool shared by multiple product teams. The platform team enforced a rule:
every team got a dataset, with a quota, and production databases had reservations. Everyone grumbled, because everyone grumbles when you add guardrails.
One quarter-end, a data science workload began dumping intermediate data into its dataset—perfectly reasonable in isolation, catastrophically large in aggregate.
The workload hit its quota. Jobs failed. The team paged the platform team with the usual “storage is broken” message.
But production stayed healthy. Databases kept committing. VM images kept moving. Nobody else noticed.
The platform team’s on-call did not need to do heroics. They had a clear failure domain: one dataset, one quota, one owner.
They worked with the team to either raise the quota with a plan or move the workload to a different pool designed for scratch data.
The capacity impact was discussed before it was inflicted.
The lesson: the most effective ZFS feature for multi-tenant sanity is not a fancy cache. It’s policy-as-properties, enforced through datasets,
with owners who can be paged when their dataset misbehaves.
Common mistakes: symptoms → root cause → fix
1) Symptom: “We deleted terabytes but the pool is still full”
Root cause: Snapshots are retaining deleted blocks; deletions only free space once no snapshot references the blocks.
Fix: Identify snapshot space users; shorten retention; exclude high-churn datasets from frequent snapshots; destroy snapshots deliberately.
2) Symptom: “Database latency spikes during backups/replication”
Root cause: Backup dataset reads pollute ARC, pushing out DB working set; or backup writes compete for IOPS.
Fix: Put backups in their own dataset; set primarycache=metadata for streaming datasets; schedule heavy jobs; consider separate pool for backups.
3) Symptom: “VMs stutter every hour on the hour”
Root cause: Snapshot/replication jobs on the VM dataset causing bursts of metadata activity, or scrubs/resilvers coinciding with peak load.
Fix: Tune snapshot cadence per dataset; spread schedules; ensure special vdev sizing if using one; run scrub off-peak and monitor impact.
4) Symptom: “Everything is slow once the pool hits ~85% used”
Root cause: Allocator has less freedom; fragmentation rises; write amplification increases, especially on RAIDZ with random writes.
Fix: Keep headroom; add vdevs; move churn workloads to separate pool; prune snapshots; enforce quotas.
5) Symptom: “After enabling compression, CPU jumps and latency worsens”
Root cause: Using an expensive compression level/algorithm for a CPU-bound system, or compressing already-compressed/encrypted data.
Fix: Use zstd at a sensible level (platform defaults); validate CPU headroom; consider leaving compression on but avoid extreme levels.
6) Symptom: “Application writes fail with ENOSPC while zpool shows free space”
Root cause: Dataset quota reached, or reservations elsewhere are consuming available space.
Fix: Check zfs get quota,reservation; adjust quotas/reservations; communicate ownership—this is policy working, not ZFS lying.
7) Symptom: “Small files are painfully slow”
Root cause: Metadata and small blocks on slow disks; no special vdev; recordsize not the main factor—IOPS is.
Fix: Use special vdev with enough capacity; set special_small_blocks appropriately; ensure atime is off; separate small-file workload dataset.
8) Symptom: “Replication can’t delete old snapshots”
Root cause: Snapshot holds or dependent incremental chains requiring them.
Fix: Check holds; validate replication tooling; break and restart replication carefully; don’t destroy required snapshots blindly.
Checklists / step-by-step plan
Step-by-step: migrating from “one big dataset” to per-workload datasets
-
Inventory workloads and mountpoints.
Decide boundaries based on behavior: DB, WAL, VM images, containers, logs, backups, homes. -
Decide the minimum policy set.
For each workload: snapshot cadence/retention, compression, recordsize/volblocksize, atime, quotas/reservations, caching policy. -
Create datasets with explicit properties.
Don’t rely on inheritance for critical settings; inheritance is fine for defaults, not for contracts. -
Move data with a plan.
For live services, use rsync with downtime window or application-aware migration; for large datasets, consider snapshot send/receive. -
Update fstab/systemd mount units.
Make mounts explicit and consistent; avoid mountpoint surprises. -
Implement snapshot tooling per dataset class.
One schedule for DB data, another for VM images, none for caches unless proven valuable. -
Enforce quotas where “temporary” data lives.
Containers, CI, caches, logs. If it can be regenerated, it must be capped. -
Reserve for workloads that must not fail.
Databases and critical state stores. -
Add monitoring checks for drift.
Alert on unexpectedsync=disabled, on snapshot counts, on pool capacity, on unusual dataset growth. -
Run a game day.
Practice restoring a file from snapshots; practice rolling back a dataset clone; practice receiving replication into the right dataset.
Operational checklist: per-workload dataset defaults (starter kit)
- DB data (files):
recordsize=16K,compression=zstd,atime=off, snapshots frequent but short retention. - DB WAL/binlog:
recordsize=16K(or workload-specific),atime=off, snapshots usually disabled or minimal retention. - VM images (files):
recordsize=128K,compression=zstd, snapshots daily/weekly depending on RPO/RTO. - ZVOL for VM: set
volblocksizeat creation; document it; treat changes as a migration. - Containers: quotas mandatory; snapshots optional and usually short; consider cache controls if churn is high.
- Backups/receive:
recordsize=1Moften sensible,primarycache=metadata, long retention but controlled replication. - Logs: compression on, snapshots rarely needed, quotas/rotation essential.
Governance checklist: prevent chaos from returning
- Every dataset has an owner and a purpose in its name.
- Every prod dataset has explicit snapshot policy documented in code/runbooks.
- Quotas exist for anything non-critical or regenerable.
- Reservations exist for critical state where ENOSPC is unacceptable.
- Property drift is detectable: scheduled reports on
sync,recordsize,compression, snapshot counts.
FAQ
1) Does “dataset per workload” mean one dataset per application?
Not necessarily. It means one dataset per behavior and policy. One application might need multiple datasets (DB data vs WAL vs uploads).
Ten apps might share one dataset if they truly share lifecycle and policy (rare, but possible).
2) How many datasets is too many?
When humans can’t answer “what’s this dataset for?” without archaeology. Operationally, ZFS handles many datasets fine.
Your naming, ownership, and automation are the limit.
3) Should I set recordsize to 8K for every database?
Don’t universalize. Many DB engines write in 8K/16K pages, but workload matters. Start with 16K for DB datasets, measure, and adjust.
If you’re using ZVOLs, recordsize isn’t the knob—volblocksize is.
4) Is compression safe for databases?
Generally yes, and often beneficial. The risk is CPU overhead on already CPU-saturated systems.
The fix is not “disable compression everywhere”; it’s “use sane compression and monitor CPU.”
5) Should I ever use sync=disabled?
Only when you explicitly accept losing recent synchronous writes on crash/power loss—typically for scratch data where correctness isn’t required.
If it’s production state, treat sync=disabled like removing seatbelts to improve commute time.
6) Why did deleting a huge directory not free space immediately?
Snapshots and clones can keep blocks referenced. Check snapshot usage on that dataset. Space returns when the last reference is gone.
7) Do quotas hurt performance?
Not meaningfully in typical environments. Quotas hurt feelings, because they force planning.
Use quotas to prevent pool-wide outages; it’s a good trade.
8) If I set recordsize now, will it rewrite existing data?
No. It affects new writes. Existing blocks keep their size until rewritten by normal workload or explicit rewriting/migration.
9) Can I use one snapshot schedule for everything and call it a day?
You can, the same way you can use one password for everything. The damage shows up later, and it’s always at the worst time.
Snapshot schedules should match value and churn.
10) What’s the quickest win if we’re already in chaos?
Identify the top three datasets by USED and snapshot usage, then apply quotas and snapshot policy fixes.
That reduces blast radius immediately, even before deep tuning.
Conclusion: next steps that actually work
ZFS chaos is rarely a technology failure. It’s usually a management failure expressed as a storage problem.
The “dataset per workload” approach gives you control points: performance intent, safety intent, and capacity intent.
And it gives you a map you can debug under pressure.
Practical next steps:
- Draw your workload map and mark where policies differ. Those are your dataset boundaries.
- Create or refactor datasets so that each boundary has explicit properties: recordsize/volblocksize, compression, atime, snapshots, quotas/reservations.
- Implement a fast triage routine (capacity → saturation → noisy neighbor → policy drift) and rehearse it.
- Add drift detection: alert if a prod dataset flips to
sync=disabled, if snapshot counts explode, or if pool headroom shrinks past your threshold.
Do this, and your next “storage is slow” page turns from a guessing game into a checklist. That’s the whole trick.