You can run ZFS for years without ever thinking about the difference between a dataset and a zvol.
Then you virtualize something important, you add snapshots “just in case,” replication becomes a board-level requirement,
and suddenly your storage platform develops opinions. Loud ones.
The zvol vs dataset choice isn’t academic. It changes how IO is shaped, what caching can do, how snapshots behave,
how replication breaks, and which tuning knobs even exist. Pick wrong and you don’t just get slower performance—you get
operational debt that compounds every quarter.
Datasets and zvols: what they really are (not what people say in Slack)
Dataset: a filesystem with ZFS superpowers
A ZFS dataset is a ZFS filesystem. It has file semantics: directories, permissions, ownership, extended attributes.
It can be exported over NFS/SMB, mounted locally, and manipulated with normal tools. ZFS adds its own layer of features:
snapshots, clones, compression, checksums, quotas/reservations, recordsize tuning, and all the transactional safety that
makes you sleep slightly better.
When you put data in a dataset, ZFS controls how it lays out variable-size “records” (blocks, but not fixed-size blocks like
traditional filesystems). That matters because it changes amplification, caching efficiency, and IO patterns. The key knob is
recordsize.
Zvol: a block device carved out of ZFS
A zvol is a ZFS volume: a virtual block device exposed as /dev/zvol/pool/volume. It doesn’t understand files.
Your guest filesystem (ext4, XFS, NTFS) or your database engine sees a disk and writes blocks. ZFS stores those blocks as objects
with a fixed block size controlled by volblocksize.
Zvols exist for the cases where your consumer wants a block device: iSCSI LUNs, VM disks, some container runtimes, some hypervisors,
and occasionally application stacks that insist on raw devices.
The real-world translation
- Dataset = “ZFS is the filesystem; clients talk file protocols; ZFS sees the files and can optimize around them.”
- Zvol = “ZFS is providing a pretend disk; something else builds a filesystem; ZFS sees blocks and guesses.”
ZFS is extremely good at both, but they behave differently. The pain comes from assuming they behave the same.
One short joke, because storage needs humility: If you want to start a heated debate in a datacenter, bring up RAID levels.
If you want to end it, mention “zvol volblocksize” and watch everyone quietly check their notes.
The decision rules: when to use a dataset, when to use a zvol
Default stance: datasets are the boring choice—and boring wins
If your workload can consume storage as a filesystem (local mount, NFS, SMB), use a dataset. You get simpler operations:
easier inspection, easier copy/restore, straightforward permissions, and fewer edge cases around block sizes and TRIM/UNMAP.
You also get ZFS’s behavior tuned for files by default.
Datasets are also easier to debug because your tools still speak “file.” You can measure file-level fragmentation, look at
directories, reason about metadata, and keep a clean mental model.
When a zvol is the right tool
Use a zvol when the consumer requires a block device:
- VM disks (especially for hypervisors that want raw volumes, or when you want ZFS-managed snapshots of the virtual disk)
- iSCSI targets (LUNs are block by definition)
- Some clustered setups that replicate block devices or require SCSI semantics
- Legacy applications that only support “put database on raw device” (rare, but it happens)
The zvol model is powerful: ZFS snapshots of a VM disk are fast, clones are instant, replication works, and you can compress and
checksum everything.
But: block devices multiply responsibility
When you use a zvol, you now own the layering between a guest filesystem and ZFS. Alignment matters. Block size matters.
Write barriers matter. Trim/UNMAP behavior matters. Sync settings become a policy question, not a tuning detail.
A simple decision matrix you can defend
- Need NFS/SMB or local files? Dataset.
- Need iSCSI LUN or raw block for hypervisor? Zvol.
- Need per-file visibility, easy restore of a single file? Dataset.
- Need instant VM clones from a golden image? Zvol (or a dataset with sparse files, but know your tooling).
- Need consistent snapshot of an application that manages its own filesystem? Zvol (and coordinate flush/quiesce).
- Trying to “optimize performance” without knowing IO patterns? Dataset, then measure. Hero tuning comes later.
Recordsize vs volblocksize: where performance goes to get decided
Datasets: recordsize is the maximum size of a data block ZFS will use for files. Big sequential files (backups,
media, logs) love larger recordsize like 1M. Databases and random IO prefer smaller values (16K, 8K) because rewriting a small
region doesn’t force large block churn.
Zvols: volblocksize is fixed at creation. Choose wrong and you can’t change it later without rebuilding the zvol.
That’s not “annoying.” That’s “we’re scheduling a migration in Q4 because latency charts look like a saw blade.”
Snapshots: deceptively similar, operationally different
Snapshotting a dataset captures filesystem state. Snapshotting a zvol captures raw blocks. In both cases, ZFS uses copy-on-write,
so the snapshot is cheap at creation time and expensive later if you keep rewriting blocks referenced by snapshots.
With zvols, that expense is easier to trigger, because VM disks rewrite blocks constantly: metadata updates, journal churn,
filesystem housekeeping, even “nothing happening” can mean something is rewriting. Snapshots left around too long become a tax.
Quotas, reservations, and the thin-provisioning trap
Datasets give you quota and refquota. Zvols give you a fixed size, but you can create sparse volumes
and pretend you have more space than you do. That’s a business decision masquerading as an engineering feature.
Thin-provisioning is fine when you have monitoring, alerting, and adult supervision. It is a disaster when used to avoid saying
“no” in a ticket queue.
Second short joke (and the last one): Thin provisioning is like ordering pants one size smaller as “motivation.”
Sometimes it works, but mostly you just can’t breathe.
Failure modes you only meet in production
Write amplification from mis-sized blocks
A dataset with recordsize=1M backing a database that does 8K random writes can cause painful amplification: each small
update touches a large record. ZFS does have logic to handle smaller writes (and will store smaller blocks in some cases), but don’t
rely on it to save you from a poor fit. Meanwhile, a zvol with volblocksize=128K serving a VM filesystem that writes 4K
blocks is similarly mismatched.
Symptom: decent throughput in benchmarks, miserable tail latency in real workloads.
Sync semantics: where latency hides
ZFS honors synchronous writes. If an application (or hypervisor) issues sync writes, ZFS must commit them safely—meaning to stable
storage, not just RAM. Without a dedicated SLOG device (fast, power-loss-protected), sync-heavy workloads can bottleneck on main pool
latency.
Zvol consumers often use sync writes more aggressively than file protocols. VMs and databases tend to care about durability.
NFS clients might issue sync writes depending on mount options and application behavior. Either way, if your latency spikes correlate
with sync write load, the “zvol vs dataset” debate becomes a “do we understand our sync write path” debate.
TRIM/UNMAP and the myth of “free space comes back automatically”
Datasets can free blocks when files are deleted. Zvols depend on the guest issuing TRIM/UNMAP (and your stack passing it through).
Without it, your zvol looks full forever, snapshots bloat, and you start blaming “ZFS fragmentation” for what is basically an absence
of garbage collection signaling.
Snapshot retention explosions
Keeping hourly snapshots for 90 days of a VM zvol feels responsible until you realize you’re retaining every churned block across
every Windows update, package manager run, and log rotation. The math gets ugly fast. Datasets also suffer here, but VM churn is a
special kind of enthusiastic.
Replication surprise: datasets are friendlier to incremental logic
ZFS replication works for both datasets and zvols, but zvol replication can be larger and more sensitive to block churn.
A small change in a guest filesystem can rewrite blocks all over the place. Your incremental send can look suspiciously like a full
send, and your WAN link will send you a resignation letter.
Tooling friction: the humans are part of the system
Most teams have stronger operational muscle around filesystems than block devices. They know how to check permissions, how to copy
files, how to mount and inspect. Zvol workflows push you into different tools: partition tables, guest filesystems, block-level checks.
The friction shows up at 3 a.m., not in the design meeting.
Interesting facts and history you can weaponize in design reviews
- ZFS introduced end-to-end checksumming as a core feature, not an add-on; it changed how people argued about “silent corruption.”
- Copy-on-write was not new when ZFS arrived, but ZFS made it operationally mainstream for general storage stacks.
- Zvols were designed to integrate with block ecosystems like iSCSI and VM platforms, long before “hyperconverged” became a sales word.
- Recordsize defaults were chosen for general-purpose file workloads, not for databases; defaults are politics embedded in code.
- Volblocksize is immutable after zvol creation in most common implementations; that single detail drives many migrations.
- The ARC (Adaptive Replacement Cache) made ZFS caching behavior distinct from many traditional filesystems; it’s not “just page cache.”
- L2ARC arrived as a second-tier cache, but it never replaced the need to size RAM correctly; it mostly changes hit rates, not miracles.
- SLOG devices became a standard pattern because synchronous write latency dominates certain workloads; “fast SSD” without power-loss protection is not a SLOG.
- Send/receive replication gave ZFS a built-in backup primitive; it’s not “rsync,” it’s a transaction-stream of blocks.
Practical tasks: commands, outputs, and what to decide
These are not “cute demo” commands. These are the ones you run when you’re trying to choose between a dataset and a zvol, or when
you’re trying to prove to yourself you made the right choice.
Task 1: Inventory what you actually have
cr0x@server:~$ zfs list -o name,type,used,avail,refer,mountpoint -r tank
NAME TYPE USED AVAIL REFER MOUNTPOINT
tank filesystem 1.12T 8.44T 192K /tank
tank/vm filesystem 420G 8.44T 128K /tank/vm
tank/vm/web01 volume 80.0G 8.44T 10.2G -
tank/vm/db01 volume 250G 8.44T 96.4G -
tank/nfs filesystem 320G 8.44T 320G /tank/nfs
Meaning: You have both datasets (filesystem) and zvols (volume). Zvols have no mountpoint.
Decision: Identify which workloads are on volumes and ask: do they truly require block, or was this inertia?
Task 2: Check dataset recordsize (and whether it matches the workload)
cr0x@server:~$ zfs get -o name,property,value recordsize tank/nfs
NAME PROPERTY VALUE
tank/nfs recordsize 128K
Meaning: This dataset uses 128K records, a decent general default.
Decision: If this dataset hosts databases or VM images as files, consider 16K or 32K; if it hosts backups, consider 1M.
Task 3: Check zvol volblocksize (the “can’t change later” knob)
cr0x@server:~$ zfs get -o name,property,value volblocksize tank/vm/db01
NAME PROPERTY VALUE
tank/vm/db01 volblocksize 8K
Meaning: This zvol uses 8K blocks—commonly reasonable for database-heavy random IO and many VM filesystems.
Decision: If you see 64K/128K here for general VM boot disks, expect write amplification and consider rebuilding properly.
Task 4: Check sync policy, because it controls durability vs latency
cr0x@server:~$ zfs get -o name,property,value sync tank/vm
NAME PROPERTY VALUE
tank/vm sync standard
Meaning: ZFS will honor sync writes but won’t force all writes to be sync.
Decision: If someone set sync=disabled “for performance,” schedule a risk conversation and a rollback plan.
Task 5: See compression status and ratio (this often decides cost)
cr0x@server:~$ zfs get -o name,property,value,source compression,compressratio tank/vm/db01
NAME PROPERTY VALUE SOURCE
tank/vm/db01 compression lz4 local
tank/vm/db01 compressratio 1.62x -
Meaning: LZ4 is enabled and helping.
Decision: Keep it. If compression is off on VM disks, turn it on unless you have a proven CPU bottleneck.
Task 6: Check how many snapshots you’re dragging around
cr0x@server:~$ zfs list -t snapshot -o name,used,refer -S used | head
NAME USED REFER
tank/vm/db01@hourly-2025-12-24-23 8.4G 96.4G
tank/vm/db01@hourly-2025-12-24-22 7.9G 96.4G
tank/vm/web01@hourly-2025-12-24-23 2.1G 10.2G
tank/nfs@daily-2025-12-24 1.4G 320G
Meaning: The USED column is snapshot space unique to that snapshot.
Decision: If hourly VM snapshots each consume multiple gigabytes, shorten retention or reduce snapshot frequency; snapshots are not free.
Task 7: Identify “space held by snapshots” on a zvol
cr0x@server:~$ zfs get -o name,property,value usedbysnapshots,usedbydataset,logicalused tank/vm/db01
NAME PROPERTY VALUE
tank/vm/db01 usedbysnapshots 112G
tank/vm/db01 usedbydataset 96.4G
tank/vm/db01 logicalused 250G
Meaning: Snapshots are holding more space than the current live data.
Decision: Your “backup policy” is now a capacity planning policy. Fix retention and consider guest TRIM/UNMAP behavior.
Task 8: Check pool health and errors first (performance tuning on a sick pool is comedy)
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 02:14:33 with 0 errors on Sun Dec 22 03:10:11 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SSD1 ONLINE 0 0 0
ata-SSD2 ONLINE 0 0 0
errors: No known data errors
Meaning: Healthy pool, scrub clean.
Decision: If you see checksum errors or degraded vdevs, stop. Fix hardware/pathing before debating zvol vs dataset.
Task 9: Watch real-time IO by dataset/zvol
cr0x@server:~$ zpool iostat -v tank 2 3
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 1.12T 8.44T 210 980 12.4M 88.1M
mirror-0 1.12T 8.44T 210 980 12.4M 88.1M
ata-SSD1 - - 105 490 6.2M 44.0M
ata-SSD2 - - 105 490 6.2M 44.1M
-------------------------- ----- ----- ----- ----- ----- -----
Meaning: You see read/write IOPS and bandwidth; this tells you if you are IOPS-bound or throughput-bound.
Decision: High write IOPS with low bandwidth suggests small random writes—zvol volblocksize and sync path matter a lot.
Task 10: Check ARC pressure (because RAM is your first storage tier)
cr0x@server:~$ arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% size c
12:20:01 532 41 7 12 2 29 5 0 0 48.1G 52.0G
12:20:02 611 58 9 16 2 42 7 0 0 48.1G 52.0G
12:20:03 590 55 9 15 2 40 7 0 0 48.1G 52.0G
Meaning: ARC miss rates are low; caching is healthy.
Decision: If miss% is consistently high under load, adding RAM often beats clever storage layout changes.
Task 11: Check whether a zvol is sparse (thin) and whether it’s safe
cr0x@server:~$ zfs get -o name,property,value refreservation,volsize,used tank/vm/web01
NAME PROPERTY VALUE
tank/vm/web01 refreservation none
tank/vm/web01 volsize 80G
tank/vm/web01 used 10.2G
Meaning: No reservation: this is effectively thin from a “guaranteed space” perspective.
Decision: If running critical VMs, set refreservation to guarantee space or accept the risk of pool-full outages.
Task 12: Confirm ashift (your physical sector alignment baseline)
cr0x@server:~$ zdb -C tank | grep -E 'ashift|vdev_tree' -n | head
45: vdev_tree:
67: ashift: 12
Meaning: ashift=12 means 4K sectors. Usually correct for modern SSDs and HDDs.
Decision: If ashift is wrong (too small), performance can be permanently impaired; you rebuild the pool, not “tune it.”
Task 13: Evaluate snapshot send size before you replicate over a small link
cr0x@server:~$ zfs send -nvP tank/vm/db01@hourly-2025-12-24-23 | head
send from @hourly-2025-12-24-22 to tank/vm/db01@hourly-2025-12-24-23 estimated size is 18.7G
total estimated size is 18.7G
Meaning: Incremental replication is still 18.7G—block churn is high.
Decision: Reduce snapshot frequency/retention, improve guest TRIM behavior, or switch architecture (dataset-based app storage) if feasible.
Task 14: Check whether TRIM is enabled on the pool
cr0x@server:~$ zpool get -o name,property,value autotrim tank
NAME PROPERTY VALUE
tank autotrim on
Meaning: Pool is trimming freed blocks to SSDs.
Decision: If autotrim is off on SSD pools, consider enabling it—especially if you rely on zvol guests to return space.
Task 15: Check per-dataset properties that change behavior in sneaky ways
cr0x@server:~$ zfs get -o name,property,value atime,xattr,primarycache,logbias tank/vm
NAME PROPERTY VALUE
tank/vm atime off
tank/vm xattr sa
tank/vm primarycache all
tank/vm logbias latency
Meaning: atime off reduces metadata writes; xattr=sa stores xattrs more efficiently; logbias=latency favors sync latency.
Decision: For VM zvols, logbias=latency is typically reasonable. If logbias=throughput appears, validate it wasn’t a cargo-cult tweak.
Fast diagnosis playbook (find the bottleneck in minutes)
When performance is bad, the zvol vs dataset debate often becomes a distraction. Use this sequence to locate the real limit quickly.
First: prove the pool isn’t sick
-
Run:
zpool status -v
Look for: degraded vdevs, checksum errors, resilver in progress, slow scrub behavior.
Interpretation: If the pool is unhealthy, everything else is noise. -
Run:
dmesg | tail(and your OS-specific logs)
Look for: link resets, timeouts, NVMe errors, HBA issues.
Interpretation: A flapping drive path looks like “random latency spikes.”
Second: classify the IO (small random vs large sequential, sync vs async)
-
Run:
zpool iostat -v 2
Look for: high IOPS with low MB/s (random) vs high MB/s (sequential).
Interpretation: Random IO stresses latency, sequential stresses bandwidth. -
Run:
zfs get sync tank/...and check application settings
Look for: sync-heavy workloads without SLOG or on slow media.
Interpretation: Sync writes will expose the slowest durable path.
Third: check memory and caching before buying hardware
-
Run:
arcstat 1
Look for: high miss%, ARC not growing, memory pressure.
Interpretation: If you’re missing cache constantly, you’re forcing disk reads you could have avoided. -
Run:
zfs get primarycache secondarycache tank/...
Look for: someone set caching to metadata-only “to save RAM.”
Interpretation: That can be valid in some designs, but it’s often accidental self-harm.
Fourth: validate block sizing and snapshot tax
-
Run:
zfs get recordsize tank/datasetorzfs get volblocksize tank/volume
Interpretation: Mismatch = amplification = tail latency. -
Run:
zfs get usedbysnapshots tank/...
Interpretation: If snapshots hold massive space, they also increase metadata and allocation work.
Three corporate mini-stories (anonymized, but painfully real)
Mini-story 1: An incident caused by a wrong assumption (“a zvol is basically a dataset”)
A mid-size SaaS company migrated from a legacy SAN to ZFS. The storage engineer—smart, fast, a bit too confident—standardized on zvols
for everything “because VM disks are volumes, and that’s what the SAN did.” The NFS-based app storage also got moved into zvol-backed
ext4 filesystems mounted on Linux clients. It worked in testing. It even worked in production for a while.
The first signs were subtle: backup windows started stretching. Replication began missing its RPO, but only on certain volumes.
Then a pool that had been stable for months suddenly hit a capacity cliff. “We have 30% free,” someone said, pointing at the pool
dashboard. “So why can’t we create a new VM disk?”
The answer was snapshots. The zvols were being snapshotted hourly, and the guest filesystems were churning blocks constantly. Deleted
files inside the guests did not translate into freed blocks unless TRIM made it through the whole stack. It didn’t, because the guest
OSes weren’t configured for it and the hypervisor path didn’t pass it cleanly.
Meanwhile, the NFS-like workload running inside the guest ext4 had no reason to be inside a zvol in the first place. They wanted file
semantics but built a file-on-block-on-ZFS layering cake. The on-call response was to delete “old snapshots” until the pool stopped
screaming, which worked briefly and then became an emergency ritual.
The fix wasn’t glamorous: migrate NFS-ish data to datasets exported directly, implement sane snapshot retention for VM zvols, and
validate TRIM end-to-end. It took a month of careful migration to unwind a design based on the wrong assumption that “volume vs
filesystem is just packaging.”
Mini-story 2: An optimization that backfired (“set sync=disabled, it’s fine”)
Another org, finance-adjacent and extremely allergic to downtime, ran a virtualized database cluster. Latency was creeping up during
peak business hours. Someone dug through forum posts, found the magic words sync=disabled, and proposed it as a quick win.
The change was made on the zvol hierarchy that backed the VM disks.
Latency improved immediately. Graphs looked great. The team declared victory and moved on to other fires. For a few weeks, everything
was calm, which is exactly how risk teaches you to ignore it.
Then there was a power event: not a clean shutdown, not a graceful failover—just a moment where the UPS plan met reality and reality
won. The hypervisor came back. Several VMs booted. A few didn’t. The database did, but it rolled back more transactions than anyone
liked, and at least one filesystem required repair.
The incident review was uncomfortable because no one could say, with a straight face, that they hadn’t traded durability for
performance. They had. That’s what the setting does. The rollback was to restore sync=standard and add a proper SLOG
device with power-loss protection. The long-term fix was cultural: no “performance fix” that changes durability semantics without a
written risk acceptance and a test that simulates power loss behavior.
Mini-story 3: The boring but correct practice that saved the day (testing send size and snapshot discipline)
A large internal platform team ran two datacenters with ZFS replication between them. They had a habit that looked tedious:
before onboarding a new workload, they would run a week-long “replication rehearsal” with snapshots and zfs send -nvP to
estimate incremental sizes. They also enforced snapshot retention policies like adults: short retention for churny volumes, longer for
datasets with stable data.
A product team requested “hourly snapshots for six months” for a fleet of VMs. The platform team didn’t argue philosophically. They
ran the rehearsal. The incrementals were huge and erratic, and the WAN link would have been saturated regularly. Instead of saying
“no,” they offered a boring alternative: daily long-retention, hourly short-retention, plus application-level backups for the critical
data. They also moved some data off VM disks into datasets exported over NFS, because it was file data pretending to be block data.
Months later, an outage in one site forced a failover. Replication was current, recovery was predictable, and the postmortem was
delightfully uneventful. The credit went to a practice nobody wanted to do because it wasn’t “engineering,” it was “process.”
It saved them anyway.
Common mistakes: symptoms → root cause → fix
1) VM storage is slow and spiky, especially during updates
- Symptoms: tail latency spikes, UI freezes, slow boots, periodic IO stalls.
- Root cause: zvol
volblocksizemismatched with guest IO size; snapshots retained too long; sync writes bottlenecking on slow media. - Fix: rebuild zvols with sensible
volblocksize(often 8K or 16K for general VMs), reduce snapshot retention, validate SLOG for sync-heavy workloads.
2) Pool shows plenty of free space, but you hit “out of space” behavior
- Symptoms: allocations fail, writes block, new zvols cannot be created, weird ENOSPC in guests.
- Root cause: thin provisioning with no
refreservation; snapshots holding space; pool too full (ZFS needs headroom). - Fix: enforce reservations for critical zvols, delete/expire snapshots, keep pool below a sane utilization threshold, and implement capacity alerts that include snapshot growth.
3) Replication incrementals are huge for zvol-based VMs
- Symptoms: send/receive runs forever, network saturation, RPO misses.
- Root cause: guest filesystem block churn; lack of TRIM/UNMAP; snapshot interval poorly chosen.
- Fix: enable and verify TRIM from guest to zvol, adjust snapshot cadence, move file-like data to datasets, and test estimated send sizes before committing policy.
4) “We disabled sync and nothing bad happened” (yet)
- Symptoms: amazing latency; suspiciously calm dashboards; no immediate failures.
- Root cause: durability semantics changed; you’re acknowledging writes before they’re safe.
- Fix: revert to
sync=standardorsync=alwaysas appropriate; add a proper SLOG; test power-loss scenarios and document risk acceptance if you insist on cheating physics.
5) NFS workload performs badly when stored inside a VM on a zvol
- Symptoms: metadata-heavy workloads are sluggish; backups and restores are awkward; troubleshooting is painful.
- Root cause: unnecessary layering: file workload placed inside a guest filesystem on top of a zvol, losing ZFS file-level optimizations and visibility.
- Fix: store and export as a dataset directly; tune
recordsizeandatime; keep the stack simple.
6) Snapshot rollback “works” but the app comes up corrupted
- Symptoms: after rollback, filesystem mounts but application data is inconsistent.
- Root cause: crash-consistency mismatch; zvol snapshots capture blocks, not application quiescence; dataset snapshots are also crash-consistent unless coordinated.
- Fix: quiesce applications (fsfreeze, database flush, hypervisor guest-agent hooks) before snapshots; validate restore procedures periodically.
Checklists / step-by-step plan
Step-by-step: choosing dataset vs zvol for a new workload
- Identify the interface you need: file protocol (NFS/SMB/local mount) → dataset; block protocol (iSCSI/VM disk) → zvol.
- Write down IO pattern assumptions: mostly sequential? mostly random? sync-heavy? This decides recordsize/volblocksize and SLOG needs.
- Pick the simplest workable layer: avoid file-on-block-on-ZFS unless you must.
- Set compression to lz4 by default unless proven otherwise.
- Decide snapshot policy up front: frequency and retention; don’t let it grow as a “backup substitute.”
- Decide replication expectations: run a rehearsal with estimated send sizes if you care about RPO/RTO.
- Capacity guardrails: reservations for critical zvols; quotas for datasets; keep pool headroom.
- Document recovery: how to restore a file, a VM, or a database; include quiescing steps.
Checklist: configuring a zvol for VM disks (production baseline)
- Create with a sensible
volblocksize(often 8K or 16K; match your guest and hypervisor realities). - Enable
compression=lz4. - Keep
sync=standard; add SLOG if sync latency matters. - Plan snapshot retention to match churn; test
zfs send -nvPfor replication sizing. - Verify TRIM/UNMAP end-to-end if you expect space reclamation.
- Consider
refreservationfor critical guests to prevent pool-full catastrophes.
Checklist: configuring a dataset for app and file storage
- Choose
recordsizebased on IO: 1M for backups/media; smaller for DB-like patterns. - Enable
compression=lz4. - Disable
atimeunless you truly need it. - Use
quota/refquotato prevent “one tenant ate the pool.” - Snapshot with retention, not hoarding.
- Export via NFS/SMB with sane client settings; measure with real workloads.
FAQ
1) Is a zvol always faster than a dataset for VM disks?
No. A zvol can be excellent for VM disks, but “faster” depends on sync behavior, block sizing, snapshot churn, and the hypervisor IO
path. A dataset hosting QCOW2/raw files can also perform very well with the right recordsize and caching behavior. Measure, don’t vibe.
2) Can I change volblocksize later?
Practically speaking: no. Treat volblocksize as immutable. If you picked wrong, the clean fix is migration to a new zvol
with the correct size and a controlled cutover.
3) Should I set recordsize=16K for databases on datasets?
Often reasonable, but not universal. Many databases use 8K pages; 16K can be a decent compromise. But if your workload is mostly
sequential scans or large blobs, larger recordsize can help. Profile your IO.
4) Are ZFS snapshots backups?
They are a powerful building block, not a backup strategy. Snapshots don’t protect against pool loss, operator mistakes on the pool,
or replicated corruption if you replicate too eagerly. Use snapshots with replication and/or separate backup storage and retention.
5) Why does deleting files inside a VM not free space on the ZFS pool?
Because ZFS sees a block device. Unless the guest issues TRIM/UNMAP and the stack passes it through, ZFS doesn’t know which blocks are
free inside the guest filesystem.
6) Should I use dedup on zvols to save space?
Usually no. Dedup is RAM-hungry and operationally unforgiving. Compression typically gives you safe wins with less risk. If you want
dedup, prove it with realistic data, and budget RAM like you mean it.
7) Does a SLOG help all writes?
No. A SLOG helps synchronous writes. If your workload is mostly asynchronous, a SLOG won’t move the needle much. If your workload is
sync-heavy, a proper SLOG can be the difference between “fine” and “why is everything on fire.”
8) When should I prefer datasets for containers?
If your container platform can use ZFS datasets directly (common on many Linux setups), datasets usually give better visibility and
simpler operations than stuffing container storage into VM disks on zvols. Keep layers minimal.
9) Can I safely use sync=disabled for VM disks if I have a UPS?
A UPS reduces risk; it does not eliminate it. Kernel panics, controller resets, firmware bugs, and human error still exist. If you
need durability, keep sync semantics correct and engineer the hardware path (SLOG with power-loss protection) to support it.
10) What’s the best default: zvol or dataset?
Default to dataset unless the consumer requires block. When you do need block, use zvols intentionally: choose volblocksize, plan
snapshots, and confirm TRIM and sync behavior.
Next steps you can actually do this week
Here’s the practical path that reduces future pain without turning your storage into a science experiment.
- Inventory your environment: list datasets vs volumes, and map them to workloads. Anything “file-like” living inside a zvol is a red flag to investigate.
-
Audit the irreversible knobs: check
volblocksizeon zvols andrecordsizeon key datasets. Write down mismatches with workload patterns. -
Measure snapshot tax: identify which zvols/datasets have large
usedbysnapshots. Align retention with business need, not anxiety. -
Validate sync behavior: find any
sync=disabledand treat it as a change request needing explicit risk acceptance. If sync latency is a problem, engineer it with SLOG, not wishful thinking. -
Run a replication rehearsal: use
zfs send -nvPto estimate incrementals for one week. If numbers look wild, fix churn drivers before promising tight RPOs.
One paraphrased idea from John Allspaw (operations/reliability): Incidents come from normal work in complex systems, not from one bad person having a bad day.
(paraphrased idea)
The zvol vs dataset choice is exactly that kind of “normal work” decision. Make it deliberately. Future-you will still have outages,
but they’ll be the interesting kind—the kind you can fix—rather than the slow, grinding kind caused by a single bad storage primitive
chosen years ago and defended out of pride.