ZFS Datasets Per Workload: The Management Trick That Prevents Chaos

October 20, 2025 • February 3, 2026 • Read: 24 min • Views: 14

Was this helpful?

Everything is fine until it isn’t: a backup job starts, a database stalls, and suddenly the VM hosts feel “slow” in a way nobody can quantify.
You stare at graphs, argue about “storage latency,” and somebody suggests rebooting things “to clear it out.”

The fix is rarely heroic. It’s boring structure: one ZFS dataset per workload, with explicit properties, explicit snapshot rules, and explicit limits.
That’s how you prevent the storage pool from turning into a shared kitchen where everyone cooks curry at the same time and nobody can find a clean pan.

The trick: datasets are management boundaries

ZFS is not just a filesystem. It’s a policy engine welded to an allocator, with checksumming and copy-on-write as the glue.
That policy engine is mostly controlled through dataset properties. And those properties matter only if you use datasets as boundaries.

A “dataset per workload” strategy means you stop treating your pool like one big directory tree where everything inherits whatever the last admin felt like.
Instead, you create datasets that map to real workloads: a database, a VM image store, an object cache, a build artifact repo, home directories,
container layers, a backup landing zone. Each gets deliberate settings.

The point isn’t “more datasets because ZFS can.” The point is operational control:

Performance isolation: recordsize, logbias, atime, sync behavior, special_small_blocks can be tuned per workload.
Safety isolation: snapshot cadence, retention, holds, and replication can be per dataset.
Capacity discipline: quotas/reservations prevent one workload from eating your pool and calling it “temporary.”
Blast-radius control: you can roll back or send/receive one workload without touching neighbors.
Diagnosability: zfs get, zfs list, and per-dataset stats become meaningful instead of a soup.

If you remember one thing: your ZFS pool is the hardware budget; datasets are your contracts.
Without contracts, you get surprise bills.

One short joke, as a service: If your storage is “one big dataset,” it’s not architecture—it’s an emotion.

Interesting facts and historical context (because scars have history)

ZFS originated at Sun Microsystems in the early-to-mid 2000s, designed to replace traditional volume managers and filesystems with one integrated system.
The copy-on-write model was a direct answer to silent corruption and the “write hole” class of issues that haunted older RAID + filesystem stacks.
ZFS snapshots were designed to be cheap because they’re just references to existing blocks—until you keep them forever and wonder why deletes don’t free space.
Dataset property inheritance is intentional: it enables policy trees, but it also enables accidental policy inheritance—like a production database inheriting a “test” snapshot schedule.
Compression became mainstream practice in ZFS because it often improves performance by reducing I/O, especially on spinning disks and saturated arrays.
Recordsize exists because “one block size fits all” fails: large sequential files and small random I/O want different tradeoffs.
ZVOLs are block devices backed by ZFS; they behave differently from datasets and bring their own tuning knobs like volblocksize.
The ARC is not “just cache”; it’s a memory consumer with eviction behavior that can make or break performance, especially under mixed workloads.
OpenZFS became the cross-platform continuation after Sun’s era, and modern features (like special vdevs and dRAID) are products of years of operational feedback.

All of that history points to the same operational truth: ZFS wants you to express intent. Datasets are how you do it.

Workload map: what belongs in its own dataset

“Per workload” is not “per directory.” It’s “per behavior.” You split when a workload has different needs for latency, throughput, safety, or retention.
You keep together what shares a lifecycle and policy.

Good dataset boundaries (pragmatic, not academic)

Databases: Postgres/MySQL data directory, WAL/binlog, and backups often deserve separate datasets.
VM storage: One dataset for VM images, optionally per cluster or per tenant. If using ZVOLs, treat them as workloads too.
Containers: Image layers vs writable volumes vs logs. These have wildly different write patterns.
Build/artifact caches: High churn, low value, lots of deletes—perfect for distinct snapshot/retention rules.
Home directories: Quotas, different retention, and “oops I deleted it” restores.
Backup landing zones: Receives, long retention, and “do not accidentally snapshot every 5 minutes.”
Logs: Often compress well, but don’t always need snapshots; also can be huge and spiky.

Where people over-split

Don’t create a dataset for every application subdirectory because you got excited about knobs.
If your snapshot schedule is identical, retention is identical, and performance characteristics are identical, keep it together.
Managing 400 datasets with no naming standard is how you build a museum of forgotten intentions.

Naming: make it boring and searchable

Use a predictable hierarchy. Example:
tank/prod/db/postgres, tank/prod/db/postgres-wal, tank/prod/vm, tank/prod/containers,
tank/shared/homes, tank/backup/recv.

Put environment and function early. You will grep for it at 02:00. Don’t make Future You parse poetry.

Properties that matter (and what they really do)

recordsize: the silent performance lever

recordsize controls the maximum block size for files in a dataset. Bigger is good for large sequential reads/writes (media, backups).
Smaller can reduce read-modify-write amplification for random I/O patterns (databases).

For many databases on datasets (not ZVOLs), recordsize=16K or recordsize=8K is a common starting point.
For VM images stored as files, recordsize=128K often works well. For backup streams, recordsize=1M can be appropriate.
But don’t cargo-cult this; measure.

volblocksize: for ZVOLs, not files

ZVOLs use volblocksize. You set it at creation time and changing it later is non-trivial.
Match it to the guest/filesystem block size and expected I/O: e.g., 8K or 16K for DB-like random I/O; larger for sequential.

compression: performance feature disguised as storage efficiency

Use compression=zstd for most datasets unless you have a clear reason not to.
It’s usually faster than your disks and reduces writes. It also reduces the “why is my pool full” drama.

atime: metadata writes you probably don’t want

atime=off is a standard move for most server workloads. If you truly need access time updates, make that dataset explicit and small.

sync and logbias: where honesty meets latency

sync=standard is the default and the safest. sync=disabled can make benchmarks look great and production look like a post-mortem.
logbias=latency vs throughput influences intent for synchronous writes (and whether the SLOG helps).

If you change sync behavior, treat it like changing your RPO: write it down, get sign-off, and assume you will be audited by reality.

primarycache / secondarycache: choose what you cache

For datasets that pollute ARC (like large sequential backup reads), consider primarycache=metadata.
It’s not about saving RAM; it’s about keeping ARC useful for latency-sensitive workloads.

special_small_blocks and special vdevs: fast metadata, fast small I/O

If you have a special vdev (typically SSDs) for metadata/small blocks, special_small_blocks can dramatically improve small-file workloads.
It can also cause dramatic regret if your special vdev is undersized. This is one of those “plan capacity like it’s production, because it is” features.

quotas, reservations, refreservation: capacity as policy

quota caps growth. reservation guarantees space. refreservation can reserve space equal to a dataset’s referenced data,
often used for zvols. These are your guardrails against the Noisy Neighbor Storage Tax.

snapshots: your rollback lever, your replication unit, your space-time trap

Snapshots are cheap until they aren’t. A dataset with high churn and long snapshot retention will retain deleted blocks and inflate used space.
A dataset with frequent snapshots and fast churn can fill a pool while everyone swears nothing changed.

One quote, because it’s still true

Hope is not a strategy.
— General H. Norman Schwarzkopf

Practical tasks: commands + what the output means + the decision you make

These are not “toy” commands. These are the ones you run when you’re tired, production is hot, and your storage pool is being blamed for everything.
Each task includes: command, example output, interpretation, and a decision.

Task 1: List datasets and see who is actually using space

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -r tank/prod
NAME                        USED  AVAIL  REFER  MOUNTPOINT
tank/prod                   3.21T  5.87T   192K  /tank/prod
tank/prod/db                1.44T  5.87T   192K  /tank/prod/db
tank/prod/db/postgres       812G  5.87T   812G  /var/lib/postgresql
tank/prod/db/postgres-wal   221G  5.87T   221G  /var/lib/postgresql-wal
tank/prod/vm                1.55T  5.87T  1.55T  /tank/prod/vm
tank/prod/containers        218G  5.87T   218G  /var/lib/containers

What it means: USED includes snapshots/descendants; REFER is the live data in that dataset.

Decision: If USED is much larger than REFER, snapshots are hoarding space. Investigate snapshot retention for that dataset.

Task 2: Find snapshot bloat per dataset

cr0x@server:~$ zfs list -t snapshot -o name,used,refer -s used -r tank/prod/db/postgres | tail -n 5
tank/prod/db/postgres@hourly-2025-12-25-00  18.4G   812G
tank/prod/db/postgres@hourly-2025-12-24-23  18.2G   812G
tank/prod/db/postgres@hourly-2025-12-24-22  17.9G   812G
tank/prod/db/postgres@hourly-2025-12-24-21  17.6G   812G
tank/prod/db/postgres@hourly-2025-12-24-20  17.4G   812G

What it means: Snapshot USED here is the unique blocks held by that snapshot.

Decision: If snapshots are huge and frequent, shorten retention or snapshot less frequently on high-churn datasets (WAL, caches, build artifacts).

Task 3: Check property inheritance to spot accidental defaults

cr0x@server:~$ zfs get -r -o name,property,value,source recordsize,compression,atime,sync,logbias tank/prod/db/postgres
NAME                      PROPERTY     VALUE     SOURCE
tank/prod/db/postgres     recordsize   128K      inherited from tank/prod
tank/prod/db/postgres     compression  zstd      inherited from tank/prod
tank/prod/db/postgres     atime        on        default
tank/prod/db/postgres     sync         standard  default
tank/prod/db/postgres     logbias      latency   local

What it means: This database is inheriting recordsize=128K and has atime=on.

Decision: Set recordsize explicitly for DB datasets and disable atime unless you need it.

Task 4: Fix recordsize and atime for a database dataset

cr0x@server:~$ sudo zfs set recordsize=16K atime=off tank/prod/db/postgres

What it means: Future writes use 16K records; existing blocks won’t be rewritten until data changes.

Decision: Apply during a maintenance window if you plan to rewrite large portions; otherwise accept gradual improvement.

Task 5: Check compressratio to see if compression is helping

cr0x@server:~$ zfs get -o name,property,value compressratio -r tank/prod | head
NAME                 PROPERTY      VALUE
tank/prod            compressratio 1.42x
tank/prod/db         compressratio 1.12x
tank/prod/db/postgres compressratio 1.06x
tank/prod/vm         compressratio 1.68x

What it means: VM images compress well; the database compresses poorly (common for already-compressed pages or encrypted data).

Decision: Keep compression on anyway unless CPU is constrained. For the DB, compression won’t save much space but can still reduce writes slightly.

Task 6: Put a hard cap on a “temporary” workload

cr0x@server:~$ sudo zfs set quota=500G tank/prod/containers
cr0x@server:~$ zfs get -o name,property,value quota tank/prod/containers
NAME                 PROPERTY  VALUE
tank/prod/containers quota     500G

What it means: The containers dataset cannot exceed 500G.

Decision: Use quotas for caches, build artifacts, and log spools. If it hits quota, you get a controlled failure instead of a pool-wide outage.

Task 7: Reserve space for the database so it can keep breathing

cr0x@server:~$ sudo zfs set reservation=1T tank/prod/db
cr0x@server:~$ zfs get -o name,property,value reservation tank/prod/db
NAME         PROPERTY     VALUE
tank/prod/db reservation  1T

What it means: You’re guaranteeing space for DB datasets under tank/prod/db.

Decision: Use reservations when the business cost of DB write failures is higher than the cost of “wasting” reserved capacity.

Task 8: Identify whether the pool is suffering from fragmentation pressure

cr0x@server:~$ zpool list -o name,size,alloc,free,frag,capacity,health tank
NAME   SIZE  ALLOC  FREE  FRAG  CAPACITY  HEALTH
tank  9.08T  7.24T  1.84T   63%       79%  ONLINE

What it means: 79% full and 63% fragmentation is a warning sign, not a crime. ZFS can run like this, but latency often rises.

Decision: Plan capacity: keep pools below ~80% for mixed random workloads. Reduce churn datasets or add vdevs before the “mystery latency” phase.

Task 9: Watch I/O latency at the pool level

cr0x@server:~$ zpool iostat -v tank 2 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        7.24T  1.84T    640   1200  78.2M  94.6M
  raidz2-0  7.24T  1.84T    640   1200  78.2M  94.6M
    sda         -      -     80    150  9.70M  11.8M
    sdb         -      -     79    151  9.66M  11.7M
    sdc         -      -     81    148  9.74M  11.6M
    sdd         -      -     80    149  9.69M  11.7M

What it means: You see load distribution. This view alone doesn’t show latency, but it shows whether you’re saturating devices.

Decision: If writes are consistently high and the workload is sync-heavy, evaluate SLOG and sync settings per dataset (not globally).

Task 10: Inspect per-dataset logical space accounting (and spot refreservation traps)

cr0x@server:~$ zfs list -o name,used,refer,logicalused,logicalrefer,compressratio -r tank/prod/vm | head
NAME             USED  REFER  LUSED  LREFER  RATIO
tank/prod/vm     1.55T 1.55T  2.31T  2.31T   1.68x

What it means: Logical usage is bigger than physical usage thanks to compression (and possibly sparse images).

Decision: If logical used is huge, ensure your monitoring alerts on physical pool capacity, not just guest-claimed capacity.

Task 11: See what’s hammering ARC and whether you should limit caching for a dataset

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size   c
12:00:01   712   144     20   112   16    32    4     0    0  46.3G  48.0G
12:00:02   690   171     24   139   20    32    5     0    0  46.1G  48.0G
12:00:03   705   162     23   131   19    31    4     0    0  46.0G  48.0G

What it means: A 20–24% miss rate under load can be fine or awful depending on latency goals. It suggests the working set may not fit.

Decision: If large sequential jobs are thrashing ARC, set primarycache=metadata on those datasets (e.g., backup receive or media archives).

Task 12: Change cache policy for a backup dataset to protect latency-sensitive workloads

cr0x@server:~$ sudo zfs set primarycache=metadata tank/backup/recv
cr0x@server:~$ zfs get -o name,property,value,source primarycache tank/backup/recv
NAME             PROPERTY      VALUE     SOURCE
tank/backup/recv primarycache  metadata  local

What it means: ZFS will cache metadata, but not file data, for this dataset in ARC.

Decision: Use this for “streaming” datasets that don’t benefit from caching and actively hurt everyone else.

Task 13: Create a dataset with workload-appropriate defaults (the right kind of boring)

cr0x@server:~$ sudo zfs create -o compression=zstd -o atime=off -o recordsize=16K tank/prod/db/mysql
cr0x@server:~$ zfs get -o name,property,value,source compression,atime,recordsize tank/prod/db/mysql
NAME                PROPERTY    VALUE  SOURCE
tank/prod/db/mysql  compression zstd   local
tank/prod/db/mysql  atime       off    local
tank/prod/db/mysql  recordsize  16K    local

What it means: You’ve made intent explicit at creation time, preventing accidental inheritance surprises.

Decision: Standardize dataset creation via a runbook or automation. Humans are great; consistency is better.

Task 14: Check snapshot policy drift (are you snapshotting the wrong things?)

cr0x@server:~$ zfs get -r -o name,property,value,source com.sun:auto-snapshot tank/prod | head -n 8
NAME                 PROPERTY              VALUE  SOURCE
tank/prod             com.sun:auto-snapshot true   inherited from tank
tank/prod/db          com.sun:auto-snapshot true   inherited from tank
tank/prod/db/postgres com.sun:auto-snapshot true   inherited from tank
tank/prod/db/postgres-wal com.sun:auto-snapshot true inherited from tank
tank/prod/vm          com.sun:auto-snapshot true   inherited from tank
tank/prod/containers  com.sun:auto-snapshot true   inherited from tank
tank/prod/logs        com.sun:auto-snapshot true   inherited from tank

What it means: Everything is being snapshotted, including WAL and logs, which may be high-churn and low-value.

Decision: Disable auto-snapshots on churn-heavy datasets (WAL/logs/caches) or give them a different retention class.

Task 15: Disable snapshots on a churn-heavy dataset (example: WAL)

cr0x@server:~$ sudo zfs set com.sun:auto-snapshot=false tank/prod/db/postgres-wal
cr0x@server:~$ zfs get -o name,property,value,source com.sun:auto-snapshot tank/prod/db/postgres-wal
NAME                      PROPERTY              VALUE  SOURCE
tank/prod/db/postgres-wal com.sun:auto-snapshot false  local

What it means: Snapshot automation that respects this property will skip WAL.

Decision: Keep WAL protected via database-native backups/archiving, not filesystem snapshots every 15 minutes that retain churn forever.

Task 16: Confirm ashift and vdev layout (because you can’t tune your way out of wrong geometry)

cr0x@server:~$ zdb -C tank | egrep "ashift|vdev_tree|type:|path:"
    ashift: 12
        type: 'raidz'
            type: 'disk'
            path: '/dev/disk/by-id/ata-SAMSUNG_SSD_...'

What it means: ashift: 12 implies 4K sector alignment, typically correct for modern disks/SSDs.

Decision: If ashift is wrong (too small), performance and write amplification can be permanently harmed. Fixing it usually means rebuilding the pool.

Second short joke, because you’ve earned it: The nice thing about “temporary” datasets is they last forever—like tattoos, but with worse retention policies.

Fast diagnosis playbook: find the bottleneck without a week of meetings

This is the triage order that works in real life: start broad, then zoom in. Don’t start by changing recordsize.
Don’t start by blaming ZFS. And absolutely don’t start by “turning off sync” because someone saw a forum post in 2014.

First: Is the pool out of breathing room?

Check pool capacity and fragmentation: zpool list
Check snapshot bloat: zfs list -t snapshot
Check if a single dataset is exploding: zfs list -r

Interpretation: Pools that live near full tend to become latency machines.
High churn + lots of snapshots + high fill level is a classic recipe for “it got slow” incidents.

Second: Is it IOPS/latency, throughput, or CPU?

Look at device saturation and distribution: zpool iostat -v 1
Look at system CPU steal/iowait: mpstat -P ALL 1, iostat -x 1
Check compression CPU cost if using heavy algorithms: zfs get compression and system CPU metrics

Interpretation: Random write latency bottlenecks feel like “everything is slow.” Throughput bottlenecks feel like “big jobs take forever.”
CPU bottlenecks feel like “storage is slow” because the storage stack is waiting on CPU to checksum, compress, and manage metadata.

Third: Which dataset is the noisy neighbor?

Identify top writers/readers by process: iotop -o or pidstat -d 1
Correlate to mountpoints/datasets: zfs list -o name,mountpoint
Confirm dataset properties: zfs get -o name,property,value,source -r ...

Interpretation: If one dataset is doing sequential writes with snapshots and caching enabled, it can evict useful ARC and cause everyone else to miss.
If one dataset is sync-heavy and lacks proper intent/log configuration, it can dominate latency.

Fourth: Are you fighting your own policies?

Snapshot automation scope: check com.sun:auto-snapshot or your local snapshot tooling flags
Replication lag and receives: zfs list -t snapshot and check for “stuck” old snapshots kept for send/receive
Quotas/reservations: zfs get quota,reservation,refreservation

Interpretation: Many “ZFS is eating space” tickets are actually “snapshots are doing exactly what we asked, and we asked badly.”

Three corporate mini-stories from the storage trenches

1) The incident caused by a wrong assumption: “snapshots are free”

A mid-sized company ran a ZFS pool for a mixed fleet: VMs, containers, and a couple of databases. Someone enabled an auto-snapshot policy
at the pool root. Hourly snapshots for everything. Daily retention for a month. It was pitched as “a safety net.”

The wrong assumption wasn’t malicious; it was optimistic. Snapshots are cheap to create, so the team assumed they were cheap to keep.
Meanwhile, the container dataset had aggressive churn: image pulls, layer deletions, CI jobs, log rotation, the usual entropy generator.
Deletions didn’t free space. The pool filled slowly, then suddenly.

The incident began as a database symptom: commits slowed. Then the VM cluster started experiencing guest timeouts.
Operations looked at graphs and saw disk utilization go up and never come down. Someone tried deleting old container images—no immediate space return.
The on-call ended up in the same pit many of us know: “why is nothing freeing space?”

The post-incident fix was not exotic tuning. They split datasets per workload, disabled snapshots on caches and churn-heavy datasets,
and gave each dataset a retention policy matching its value. The database got frequent snapshots with short retention; the VM store got nightly snapshots;
containers got none, with backups handled at the artifact registry level instead.

The lesson: snapshots are not free; they are deferred cost. A dataset boundary is where you decide what cost you’re willing to pay.

2) The optimization that backfired: “sync=disabled for speed”

Another shop had a performance problem: their Postgres workload was suffering high commit latency after a migration to new storage.
A well-meaning engineer ran a benchmark, saw poor numbers, and applied the classic “fix”: sync=disabled on the database dataset.
Benchmark graphs improved instantly. People celebrated. The ticket was closed with a confident comment about “ZFS overhead.”

Weeks later, they had an unplanned power event. Not a dramatic data center fire—just the kind of upstream electrical issue that makes you learn
how good your UPS maintenance really is. Systems came back. Postgres came back too, but not cleanly. The database needed recovery and showed signs
of corruption in a portion of recent transactions. The incident became a cross-team exercise in backups, WAL archives, and uncomfortable conversations.

The operational failure wasn’t just the property change; it was the lack of dataset boundaries and documentation around intent.
The DB dataset had inherited other defaults from a general-purpose dataset tree, and nobody tracked which datasets were “lying” about sync.
When it was time to audit risk, the team didn’t know what to search for.

The eventual fix was conservative: restore sync=standard, validate SLOG suitability (or accept the latency), and separate WAL into its own dataset
with deliberate properties. They also added a simple compliance check: zfs get -r sync on prod pools, flagged in monitoring.

The lesson: performance hacks that change correctness are not tuning; they are policy changes. Use dataset boundaries to make those policies explicit,
auditable, and rare.

3) The boring but correct practice that saved the day: quotas + reservations + clear dataset ownership

A company running an internal platform had a storage pool shared by multiple product teams. The platform team enforced a rule:
every team got a dataset, with a quota, and production databases had reservations. Everyone grumbled, because everyone grumbles when you add guardrails.

One quarter-end, a data science workload began dumping intermediate data into its dataset—perfectly reasonable in isolation, catastrophically large in aggregate.
The workload hit its quota. Jobs failed. The team paged the platform team with the usual “storage is broken” message.
But production stayed healthy. Databases kept committing. VM images kept moving. Nobody else noticed.

The platform team’s on-call did not need to do heroics. They had a clear failure domain: one dataset, one quota, one owner.
They worked with the team to either raise the quota with a plan or move the workload to a different pool designed for scratch data.
The capacity impact was discussed before it was inflicted.

The lesson: the most effective ZFS feature for multi-tenant sanity is not a fancy cache. It’s policy-as-properties, enforced through datasets,
with owners who can be paged when their dataset misbehaves.

Common mistakes: symptoms → root cause → fix

1) Symptom: “We deleted terabytes but the pool is still full”

Root cause: Snapshots are retaining deleted blocks; deletions only free space once no snapshot references the blocks.

Fix: Identify snapshot space users; shorten retention; exclude high-churn datasets from frequent snapshots; destroy snapshots deliberately.

2) Symptom: “Database latency spikes during backups/replication”

Root cause: Backup dataset reads pollute ARC, pushing out DB working set; or backup writes compete for IOPS.

Fix: Put backups in their own dataset; set primarycache=metadata for streaming datasets; schedule heavy jobs; consider separate pool for backups.

3) Symptom: “VMs stutter every hour on the hour”

Root cause: Snapshot/replication jobs on the VM dataset causing bursts of metadata activity, or scrubs/resilvers coinciding with peak load.

Fix: Tune snapshot cadence per dataset; spread schedules; ensure special vdev sizing if using one; run scrub off-peak and monitor impact.

4) Symptom: “Everything is slow once the pool hits ~85% used”

Root cause: Allocator has less freedom; fragmentation rises; write amplification increases, especially on RAIDZ with random writes.

Fix: Keep headroom; add vdevs; move churn workloads to separate pool; prune snapshots; enforce quotas.

5) Symptom: “After enabling compression, CPU jumps and latency worsens”

Root cause: Using an expensive compression level/algorithm for a CPU-bound system, or compressing already-compressed/encrypted data.

Fix: Use zstd at a sensible level (platform defaults); validate CPU headroom; consider leaving compression on but avoid extreme levels.

6) Symptom: “Application writes fail with ENOSPC while zpool shows free space”

Root cause: Dataset quota reached, or reservations elsewhere are consuming available space.

Fix: Check zfs get quota,reservation; adjust quotas/reservations; communicate ownership—this is policy working, not ZFS lying.

7) Symptom: “Small files are painfully slow”

Root cause: Metadata and small blocks on slow disks; no special vdev; recordsize not the main factor—IOPS is.

Fix: Use special vdev with enough capacity; set special_small_blocks appropriately; ensure atime is off; separate small-file workload dataset.

8) Symptom: “Replication can’t delete old snapshots”

Root cause: Snapshot holds or dependent incremental chains requiring them.

Fix: Check holds; validate replication tooling; break and restart replication carefully; don’t destroy required snapshots blindly.

Checklists / step-by-step plan

Step-by-step: migrating from “one big dataset” to per-workload datasets

Inventory workloads and mountpoints.
Decide boundaries based on behavior: DB, WAL, VM images, containers, logs, backups, homes.
Decide the minimum policy set.
For each workload: snapshot cadence/retention, compression, recordsize/volblocksize, atime, quotas/reservations, caching policy.
Create datasets with explicit properties.
Don’t rely on inheritance for critical settings; inheritance is fine for defaults, not for contracts.
Move data with a plan.
For live services, use rsync with downtime window or application-aware migration; for large datasets, consider snapshot send/receive.
Update fstab/systemd mount units.
Make mounts explicit and consistent; avoid mountpoint surprises.
Implement snapshot tooling per dataset class.
One schedule for DB data, another for VM images, none for caches unless proven valuable.
Enforce quotas where “temporary” data lives.
Containers, CI, caches, logs. If it can be regenerated, it must be capped.
Reserve for workloads that must not fail.
Databases and critical state stores.
Add monitoring checks for drift.
Alert on unexpected sync=disabled, on snapshot counts, on pool capacity, on unusual dataset growth.
Run a game day.
Practice restoring a file from snapshots; practice rolling back a dataset clone; practice receiving replication into the right dataset.

Operational checklist: per-workload dataset defaults (starter kit)

DB data (files): recordsize=16K, compression=zstd, atime=off, snapshots frequent but short retention.
DB WAL/binlog: recordsize=16K (or workload-specific), atime=off, snapshots usually disabled or minimal retention.
VM images (files): recordsize=128K, compression=zstd, snapshots daily/weekly depending on RPO/RTO.
ZVOL for VM: set volblocksize at creation; document it; treat changes as a migration.
Containers: quotas mandatory; snapshots optional and usually short; consider cache controls if churn is high.
Backups/receive: recordsize=1M often sensible, primarycache=metadata, long retention but controlled replication.
Logs: compression on, snapshots rarely needed, quotas/rotation essential.

Governance checklist: prevent chaos from returning

Every dataset has an owner and a purpose in its name.
Every prod dataset has explicit snapshot policy documented in code/runbooks.
Quotas exist for anything non-critical or regenerable.
Reservations exist for critical state where ENOSPC is unacceptable.
Property drift is detectable: scheduled reports on sync, recordsize, compression, snapshot counts.

FAQ

1) Does “dataset per workload” mean one dataset per application?

Not necessarily. It means one dataset per behavior and policy. One application might need multiple datasets (DB data vs WAL vs uploads).
Ten apps might share one dataset if they truly share lifecycle and policy (rare, but possible).

2) How many datasets is too many?

When humans can’t answer “what’s this dataset for?” without archaeology. Operationally, ZFS handles many datasets fine.
Your naming, ownership, and automation are the limit.

3) Should I set recordsize to 8K for every database?

Don’t universalize. Many DB engines write in 8K/16K pages, but workload matters. Start with 16K for DB datasets, measure, and adjust.
If you’re using ZVOLs, recordsize isn’t the knob—volblocksize is.

4) Is compression safe for databases?

Generally yes, and often beneficial. The risk is CPU overhead on already CPU-saturated systems.
The fix is not “disable compression everywhere”; it’s “use sane compression and monitor CPU.”

5) Should I ever use sync=disabled?

Only when you explicitly accept losing recent synchronous writes on crash/power loss—typically for scratch data where correctness isn’t required.
If it’s production state, treat sync=disabled like removing seatbelts to improve commute time.

6) Why did deleting a huge directory not free space immediately?

Snapshots and clones can keep blocks referenced. Check snapshot usage on that dataset. Space returns when the last reference is gone.

7) Do quotas hurt performance?

Not meaningfully in typical environments. Quotas hurt feelings, because they force planning.
Use quotas to prevent pool-wide outages; it’s a good trade.

8) If I set recordsize now, will it rewrite existing data?

No. It affects new writes. Existing blocks keep their size until rewritten by normal workload or explicit rewriting/migration.

9) Can I use one snapshot schedule for everything and call it a day?

You can, the same way you can use one password for everything. The damage shows up later, and it’s always at the worst time.
Snapshot schedules should match value and churn.

10) What’s the quickest win if we’re already in chaos?

Identify the top three datasets by USED and snapshot usage, then apply quotas and snapshot policy fixes.
That reduces blast radius immediately, even before deep tuning.

Conclusion: next steps that actually work

ZFS chaos is rarely a technology failure. It’s usually a management failure expressed as a storage problem.
The “dataset per workload” approach gives you control points: performance intent, safety intent, and capacity intent.
And it gives you a map you can debug under pressure.

Practical next steps:

Draw your workload map and mark where policies differ. Those are your dataset boundaries.
Create or refactor datasets so that each boundary has explicit properties: recordsize/volblocksize, compression, atime, snapshots, quotas/reservations.
Implement a fast triage routine (capacity → saturation → noisy neighbor → policy drift) and rehearse it.
Add drift detection: alert if a prod dataset flips to sync=disabled, if snapshot counts explode, or if pool headroom shrinks past your threshold.

Do this, and your next “storage is slow” page turns from a guessing game into a checklist. That’s the whole trick.