If you run VMs on ZFS, you’ve probably tuned recordsize on datasets, argued about SLOGs in chat, and blamed “noisy neighbors” at least once. And then one day you discover volblocksize on zvols and realize you’ve been driving with the parking brake on—quietly, expensively, and with a confused look at your IOPS graphs.
volblocksize is one of those settings that feels too small to matter, like the “Advanced” tab nobody clicks. But it decides how ZFS chops up your VM’s virtual disk I/O, which decides write amplification, which decides latency, which decides whether your database thinks today is a good day to time out. This is not theory; it shows up as 99th percentile latency spikes in production at the worst possible time.
What volblocksize actually is (and what it is not)
A zvol is ZFS pretending to be a block device. You create it with zfs create -V ..., and it shows up as something like /dev/zvol/pool/vm-101-disk-0. Your hypervisor then formats it (or passes it raw), and the guest OS thinks it’s a disk.
volblocksize is the internal block size ZFS uses for that zvol’s data. When the guest writes 4K, 8K, 64K, or 1M chunks, ZFS ultimately has to map those writes into its own blocks. With a dataset, that knob is usually recordsize. With a zvol, the analogous knob is volblocksize.
It’s tempting to treat it as “just set it to 4K for VMs,” or “set it to 128K because ZFS likes big blocks.” Both are half-truths that age badly. The right answer depends on the type of I/O (random vs sequential), the guest filesystem, database page size, sync write behavior, and your hardware’s latency profile.
What it is not:
- It is not the guest’s filesystem block size. Your guest can use 4K blocks on a 16K
volblocksizezvol; it will still “work,” it may just work expensively. - It is not the pool’s
ashift. Alignment matters, butashiftis about the physical sector size ZFS assumes for the vdevs. - It is not a magical IOPS switch. It’s a trade-off knob between IOPS efficiency, metadata overhead, compression ratio, and latency tail risk.
First joke, because we’ve earned it: Changing volblocksize after you’ve put data on the zvol is like trying to change the size of a pizza after it’s been eaten—there are methods, but none of them feel like “resize.”
Why it decides IOPS and latency
IOPS and latency aren’t just “how fast the disks are.” In VM storage, they’re also about how many operations you force the storage stack to do for each guest operation. volblocksize changes that multiplication factor.
The write amplification you can actually reason about
Suppose your guest issues a 4K random write. What ZFS does depends on volblocksize:
- If
volblocksize=4K, ZFS can update a single 4K block (plus metadata). This is the “literal” mapping. - If
volblocksize=16K, ZFS must update a 16K block. If only 4K changed, ZFS still writes a new 16K block (copy-on-write), which implies read-modify-write behavior at the logical level: it has to construct the new 16K content, then write it. Depending on caching and how that block is assembled, you’re burning bandwidth and adding latency risk. - If
volblocksize=128K, you’re now potentially rewriting 128K for a 4K change—again, at least logically. If the data is compressible and the write covers mostly zeros, you might get lucky. But “maybe the data compresses” is not a strategy.
Now flip it for sequential I/O. If your VM is streaming large reads/writes (backup, log shipping, big file copies): big blocks can reduce overhead and improve throughput because ZFS does fewer I/O operations per megabyte.
Latency is a tail game
Average latency flatters you. Tail latency humiliates you. Larger blocks increase the amount of work per logical change and widen the latency distribution when the system is under pressure: more bytes to move, more time stuck behind other writes, more time waiting for txg commits, and more opportunities to collide with sync write requirements.
In production, the worst moments are predictable: snapshot storms, backup windows, scrubs, resilvers, and that one quarterly batch job nobody told SRE about. A volblocksize choice that’s “fine in a benchmark” can become your 99.9th percentile outage generator when the pool is 70% full and fragmented.
Metadata and CPU aren’t free, either
Smaller blocks mean more blocks, which means more metadata: more block pointers, more indirect blocks, more checksumming operations, more compression decisions, more work for ARC. You can absolutely create a system that is “IOPS fast” but CPU-bound in the storage layer, especially with encryption or heavy compression.
Interesting facts and historical context
Storage engineers love folklore; here are the concrete bits that actually matter:
- ZFS was born in a world of spinning disks, where sequential throughput mattered and seek penalties were brutal. Big blocks made a lot of sense.
- Copy-on-write is ZFS’s superpower and its tax collector. It enables snapshots and consistency, but it also means “small changes can rewrite big blocks” when you choose big blocks.
- zvols were built to provide block devices for iSCSI, VM disks, and swap-like use cases—workloads that don’t behave like big-file NAS traffic.
- 4K sectors won the industry, but the transition was messy (512e drives, Advanced Format). Misalignment could silently cut write performance in half.
- VM hypervisors changed the I/O pattern game. Thin provisioning, snapshotting at the hypervisor layer, and random write bursts became normal, not exceptional.
- Database page sizes are often 8K or 16K (varies by engine and config). When your storage block size fights the DB page size, you pay for it twice: once in I/O, once in WAL/redo behavior.
- SSDs made random I/O cheap compared to HDDs, but they didn’t make it free. Latency is lower, but write amplification (both at ZFS and inside the SSD FTL) still matters.
- NVMe reduced latency so much that “software overhead” became visible. Suddenly, block size choices show up as CPU time and lock contention, not just disk wait.
- ZFS compression became mainstream because CPUs got fast and storage stayed expensive. Compression interacts strongly with block size: larger blocks usually compress better, but can amplify small writes.
A mental model you can use at 3 a.m.
Think of volblocksize as the “minimum rewrite unit” for the zvol inside ZFS. The guest may write 4K, but if ZFS stores the data in 64K chunks, ZFS is responsible for producing a new 64K version of that chunk on every change.
There are three big consequences:
- Random write IOPS: smaller volblocksize generally helps, because you rewrite fewer bytes per write.
- Sequential throughput: larger volblocksize can help, because you amortize metadata and checksum work.
- Tail latency: smaller tends to be more predictable under mixed workloads, while larger can spike badly when you combine sync writes + pool pressure.
Second and final joke: Storage tuning is like making espresso: one notch too fine and everything stalls; one notch too coarse and it tastes like regret.
Sync writes, SLOGs, and why volblocksize changes the pain
VM workloads often generate sync writes even when you didn’t ask for them. Databases fsync. Journaling filesystems commit. Hypervisors may issue flushes. And if you export the zvol over iSCSI/NFS (or use certain hypervisor settings), “sync” can become the default behavior.
On ZFS, sync writes have to be made durable before the system acknowledges them. Without a separate SLOG device, that means the main pool vdevs must commit enough intent log records to stable storage quickly. With a good SLOG, you can acknowledge quickly and later flush to the pool in txg commits.
Where does volblocksize enter? In two ways:
- How much data a “small” write turns into inside ZFS before it can be committed. Larger blocks can increase the amount of work to safely represent that change.
- How much fragmentation and churn you create in the main pool, which affects txg commit time and therefore sync latency during congestion.
A classic pain pattern: everything is fine until a backup window starts. The pool gets busy with large sequential reads/writes, txg commit times climb, and suddenly your database’s fsync latency goes from “fine” to “the app is down.” If your volblocksize is too large for your random-write VM disks, you’ve increased the amount of work per fsync under pressure.
Important nuance: sync write performance is not only about volblocksize. It’s also about sync property, logbias, SLOG quality, pool layout (mirrors vs RAIDZ), and whether your storage is saturated. But volblocksize is one of the few levers that changes the fundamental granularity of change.
Compression, checksums, and the hidden CPU tax
Compression on zvols is not automatically wrong. But it’s not automatically free, either.
Larger blocks often compress better because compressors see more repeating patterns. This can reduce physical writes, which can offset the cost of rewriting larger logical blocks. In the real world, the mix matters:
- OS disks: often compress well (text, binaries, lots of zeros). Compression can help, and block size choice can be less punishing.
- Databases with already-compressed pages: may not compress well. Then larger blocks just mean more bytes rewritten without savings.
- Encrypted guests: if the guest encrypts the filesystem, compression may become ineffective at the ZFS layer. Don’t bank on compression to save you from a bad block size when the data is high-entropy.
Checksumming and (optional) encryption also scale with “number of blocks” and “bytes processed.” Too small a block size can become CPU-heavy at high IOPS; too big can become latency-heavy at random write workloads. You’re looking for the elbow in the curve for your environment.
ashift, sector sizes, and alignment
If volblocksize is the zvol’s logical rewrite unit, ashift is the pool’s physical alignment assumption. Most modern pools should be at least ashift=12 (4K). Some environments prefer ashift=13 (8K) for certain devices.
Misalignment is the silent killer: if ZFS thinks the device sector is smaller than reality, it can do read-modify-write at the drive level. That’s the kind of performance bug that looks like “random latency spikes” and survives months of meetings.
Rule of thumb that rarely betrays you: make sure your pool’s ashift is correct when you create it. You can’t change it later without rebuilding. Then choose volblocksize as a multiple of the sector size (4K, 8K, 16K…). Most deployments treat 4K or 8K as the safe baseline for VM disks that do random writes.
Picking volblocksize for real VM workloads
Let’s talk choices, not ideology.
Common starting points
- General-purpose VM OS disks:
volblocksize=8Kis a common compromise: not too metadata-heavy, not too amplification-prone.4Kcan be excellent for latency-sensitive mixed workloads but may cost more metadata and CPU at scale. - Database VM disks (random write heavy): start at
8Kor16Kdepending on DB page size and observed I/O. If you don’t know,8Kis often a safer default than128K. - Sequential-heavy volumes (backup targets inside VMs, media, large object stores):
64Kor128Kcan make sense if the workload is truly large-block sequential and not doing lots of small random rewrites.
Here’s the part people skip: you don’t pick volblocksize based on what ZFS “likes,” you pick it based on what your guest actually does. If your guest does 4K random writes all day, giving it a 128K rewrite unit is basically signing up for unnecessary work and variance.
Mirrors vs RAIDZ changes the stakes
On mirrors, random writes are generally friendlier. On RAIDZ, small random writes are more expensive due to parity and read-modify-write patterns at the vdev layer. A too-small volblocksize on RAIDZ can be punishing. A too-large volblocksize can be punishing in a different way (amplification). The “right” value is more sensitive on RAIDZ.
If you’re running VM storage on RAIDZ and you care about latency, you’re already playing on hard mode. It can work, but the tuning and capacity headroom need to be boringly disciplined.
Three corporate-world mini-stories
1) Incident caused by a wrong assumption: “ZFS will handle it”
They were migrating a busy internal SaaS platform from a legacy SAN to a ZFS-backed virtualization cluster. The team did the sensible stuff: mirrored vdevs, decent RAM, SSDs, and a separate SLOG because the database was known to fsync aggressively. The pilot looked good. The cutover was scheduled, risk assessed, change ticket approved, coffee brewed.
The assumption was simple: “ZFS is smart; it will adapt.” The zvols were created with a large volblocksize because someone remembered that ZFS likes big blocks for throughput, and the vendor slide deck had a graph that rewarded sequential performance. Nobody mapped the VM’s actual I/O profile to the block size.
The incident didn’t happen immediately. It waited until the system had been running long enough to get snapshots, churn, and real-world fragmentation. Then, during a routine batch window, database commit latency started to wobble. Application threads piled up. Retries multiplied. The pool wasn’t “down,” it was just slow in the way that makes everything else look broken.
On-call looked at CPU (fine), network (fine), the SLOG (fine), and still saw sync write latency spikes. The breakthrough came when someone ran a quick zfs get volblocksize and compared it to the guest’s I/O sizes from iostat and a short fio run. The guests were doing lots of 8K and 16K writes; the zvol rewrite unit was far larger. Under load, each tiny commit dragged a larger block rewrite train behind it.
The fix was not a toggle. They had to create new zvols with a saner volblocksize and migrate disks live where possible, offline where not. It was a long night, but a useful one: ZFS is smart, yes. It is not clairvoyant.
2) An optimization that backfired: “Let’s go 4K everywhere”
Another place, different problem. They had a latency-sensitive fleet of small VMs (CI workers, build agents, test databases) and wanted better IOPS. Someone read that 4K volblocksize improves random write performance, and they decided to standardize on it. No exceptions. Consistency is comforting, especially in corporate environments where every exception becomes a meeting.
The first week looked great. Benchmarks improved. The dashboards smiled. People high-fived quietly in Slack with the kind of emoji you use when you don’t want management to notice you changed something important.
Then reality arrived in the form of CPU and memory pressure. The storage nodes started spending more time on metadata and checksum work. ARC churn increased because the working set contained many more blocks and pointers. Some hosts hit a wall where the pool wasn’t saturated on bandwidth, but I/O completion slowed because the software path was doing more per operation.
It didn’t fail catastrophically; it failed bureaucratically. Developers complained that builds were “sometimes slow.” Test jobs became jittery. The SREs spent days chasing phantom “noisy neighbors” before noticing the common thread: a uniform 4K block size on workloads that included plenty of sequential I/O and large file activity.
The eventual policy became more mature: 4K for specific zvols proven to be random-write heavy; 8K or 16K for general OS disks; larger for known sequential volumes. The lesson wasn’t “4K is bad.” The lesson was that dogma is expensive.
3) The boring but correct practice that saved the day: “Standard profiles + a migration path”
This team had been burned before, so they did something unfashionable: they created a small set of storage profiles and made them the default. Each profile had a documented volblocksize, compression setting, sync/logbias expectations, and a short rationale. Nothing exotic. Just decisions written down before the outage.
When a new VM was provisioned, the requester picked “general,” “db-heavy,” or “sequential.” If they didn’t pick, it defaulted to general. That alone prevented a lot of accidental bad fits.
But the real win was the migration path. They accepted that volblocksize is effectively immutable for existing data and built operational muscle for moving disks: create new zvol with the desired volblocksize, replicate or block-copy, cut over, validate, destroy old. They practiced it during normal hours, not as a desperate improvisation during a crisis.
So when a vendor appliance VM arrived with a nasty sync-heavy workload and started hurting the cluster, the fix was not a war room. It was a planned migration using a known-good “sync-latency” profile and a controlled cutover. The day was saved by boredom: a repeatable procedure and the discipline to use it.
Practical tasks: commands, outputs, and interpretation
These are the tasks I actually run when someone says “VM storage is slow” and the only hint is a screenshot of a red graph. Commands assume a Linux host with ZFS utilities. Adjust pool and zvol names as needed.
Task 1: Identify whether the VM disk is a zvol and find its ZFS name
cr0x@server:~$ ls -l /dev/zvol/tank/vm-101-disk-0
lrwxrwxrwx 1 root root 13 Dec 24 10:20 /dev/zvol/tank/vm-101-disk-0 -> ../../zd0
Interpretation: You’re dealing with a zvol. Good—volblocksize applies. The backing device here is /dev/zd0.
Task 2: Check volblocksize (and a few related properties)
cr0x@server:~$ zfs get -H -o property,value volblocksize,compression,logbias,sync,refreservation tank/vm-101-disk-0
volblocksize 128K
compression lz4
logbias latency
sync standard
refreservation none
Interpretation: 128K is large for a general VM disk unless you know it’s mostly sequential. Compression is on (lz4), which may help or not. sync=standard means guest flushes matter.
Task 3: Confirm pool ashift (alignment baseline)
cr0x@server:~$ zdb -C tank | grep -E "ashift|vdev_tree" -n | head
52: vdev_tree:
76: ashift: 12
104: ashift: 12
Interpretation: ashift=12 (4K sectors) is a sane baseline. If you see ashift=9 on modern disks, you’ve likely found a foundational problem.
Task 4: Watch host-side I/O sizes and latency quickly
cr0x@server:~$ iostat -x 1 5 /dev/zd0
Linux 6.8.0 (server) 12/24/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
6.12 0.00 3.45 8.90 0.00 81.53
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
zd0 5.00 80.0 0.00 0.00 4.20 16.0 220.00 3520.0 10.00 4.35 18.50 16.0 3.20 92.00
Interpretation: Average write request size ~16K, not 128K. If volblocksize is 128K, expect extra churn on small writes, especially under sync pressure. w_await at ~18ms suggests latency pain already.
Task 5: Observe ZFS-level latency with iostat per vdev
cr0x@server:~$ zpool iostat -v tank 1 3
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 4.2T 3.1T 80 900 12.0M 55.0M
mirror 2.1T 1.6T 40 450 6.0M 27.5M
nvme0n1 - - 20 230 3.0M 14.0M
nvme1n1 - - 20 220 3.0M 13.5M
mirror 2.1T 1.6T 40 450 6.0M 27.5M
nvme2n1 - - 20 225 3.0M 13.8M
nvme3n1 - - 20 225 3.0M 13.7M
Interpretation: Bandwidth isn’t huge, but ops are. That’s typical of small-block random I/O. If you’re seeing high ops and moderate MB/s, block size choices matter.
Task 6: Check txg sync pressure (are commits taking too long?)
cr0x@server:~$ cat /proc/spl/kstat/zfs/txgs
1 0x01 137 14592 0 0 0 0 0 0 0
2 0x01 137 14593 0 0 0 0 0 0 0
Interpretation: The exact fields vary by platform, but the goal is to spot txg delays and whether you’re accumulating dirty data faster than you can sync. If txg sync times rise during your latency spikes, you’re in “commit pressure” territory.
Task 7: Verify whether a zvol is using a separate log (SLOG) effectively
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
mirror ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
log
mirror ONLINE 0 0 0
nvme4n1 ONLINE 0 0 0
nvme5n1 ONLINE 0 0 0
errors: No known data errors
Interpretation: There is a mirrored SLOG. Great. But remember: a SLOG helps only for sync writes. If your workload is async or your bottleneck is elsewhere, it won’t save you.
Task 8: Check if the guest is forcing sync (or if you accidentally forced it)
cr0x@server:~$ zfs get -H -o property,value sync tank/vm-101-disk-0
sync standard
Interpretation: standard means ZFS honors O_SYNC/fsync/flush semantics. If someone set sync=always, latency will increase dramatically for many VM workloads. If someone set sync=disabled, latency looks great right up until you lose power and have a very educational day.
Task 9: Create a test zvol with a candidate volblocksize (do not change prod blindly)
cr0x@server:~$ sudo zfs create -V 50G -b 8K -o compression=lz4 tank/test-vm-disk-8k
cr0x@server:~$ zfs get -H -o property,value volblocksize tank/test-vm-disk-8k
volblocksize 8K
Interpretation: Use a test zvol to benchmark with fio from a VM or host. Choosing with data beats choosing with opinions.
Task 10: Run a quick fio random write test against the zvol (host-side)
cr0x@server:~$ sudo fio --name=randwrite --filename=/dev/zvol/tank/test-vm-disk-8k \
--direct=1 --ioengine=libaio --rw=randwrite --bs=8k --iodepth=32 --numjobs=1 --time_based --runtime=20 --group_reporting
randwrite: (g=0): rw=randwrite, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.36
randwrite: write: IOPS=42.1k, BW=329MiB/s (345MB/s)(6580MiB/20001msec)
lat (usec): min=70, max=22450, avg=610.42, stdev=380.12
clat percentiles (usec):
| 1.00th=[ 150], 10.00th=[ 250], 50.00th=[ 520], 90.00th=[ 980], 99.00th=[ 1800], 99.90th=[ 4200]
cpu : usr=3.20%, sys=11.40%, ctx=820k, majf=0, minf=12
Interpretation: Look at percentiles, not just IOPS. If your 99.9th percentile is ugly, your application will eventually complain.
Task 11: Compare with a larger volblocksize test zvol (same fio job)
cr0x@server:~$ sudo zfs create -V 50G -b 128K -o compression=lz4 tank/test-vm-disk-128k
cr0x@server:~$ sudo fio --name=randwrite --filename=/dev/zvol/tank/test-vm-disk-128k \
--direct=1 --ioengine=libaio --rw=randwrite --bs=8k --iodepth=32 --numjobs=1 --time_based --runtime=20 --group_reporting
randwrite: write: IOPS=18.7k, BW=146MiB/s (153MB/s)(2920MiB/20001msec)
lat (usec): min=95, max=78410, avg=1380.12, stdev=1220.55
clat percentiles (usec):
| 1.00th=[ 220], 10.00th=[ 420], 50.00th=[ 1100], 90.00th=[ 2600], 99.00th=[ 6800], 99.90th=[ 32000]
Interpretation: Same 8K workload, much worse tail latency and fewer IOPS. That’s the amplification tax showing up.
Task 12: Check zvol logical/physical space use (compression and padding effects)
cr0x@server:~$ zfs list -o name,volsize,used,refer,compressratio tank/test-vm-disk-8k tank/test-vm-disk-128k
NAME VOLSIZE USED REFER COMPRESSRATIO
tank/test-vm-disk-8k 50G 3.2G 3.2G 1.45x
tank/test-vm-disk-128k 50G 6.8G 6.8G 1.10x
Interpretation: Bigger blocks didn’t compress better here; they used more space. Real workloads vary, but this is exactly why you measure.
Task 13: Confirm whether a zvol is sparse (thin) and whether refreservation is masking it
cr0x@server:~$ zfs get -H -o property,value volsize,volmode,refreservation tank/vm-101-disk-0
volsize 500G
volmode default
refreservation none
Interpretation: Thin provisioning behavior depends on environment. Refreservation can avoid “out of space” surprises at the cost of capacity efficiency.
Task 14: Measure real guest I/O size distribution (host-side sampling)
cr0x@server:~$ sudo blktrace -d /dev/zd0 -w 10 -o - | blkparse -i - | head -n 12
8,16 3 1 0.000000000 1234 Q WS 12345678 + 16 [qemu-kvm]
8,16 3 2 0.000010000 1234 G WS 12345678 + 16 [qemu-kvm]
8,16 3 3 0.000020000 1234 P WS 12345678 + 16 [qemu-kvm]
8,16 3 4 0.000080000 1234 C WS 12345678 + 16 [0]
8,16 3 5 0.000100000 1234 Q WS 12345710 + 32 [qemu-kvm]
8,16 3 6 0.000120000 1234 C WS 12345710 + 32 [0]
Interpretation: The “+ 16” sectors here is 8K (16 × 512B sectors) in a 512-sector reporting scheme. This tells you what the VM is actually issuing, not what you wish it issued.
Task 15: Plan a safe migration to a new volblocksize (create + copy)
cr0x@server:~$ sudo zfs create -V 500G -b 8K -o compression=lz4 tank/vm-101-disk-0-new
cr0x@server:~$ sudo dd if=/dev/zvol/tank/vm-101-disk-0 of=/dev/zvol/tank/vm-101-disk-0-new bs=16M status=progress conv=fsync
104857600000 bytes (105 GB, 98 GiB) copied, 310 s, 338 MB/s
Interpretation: This is the blunt instrument. It requires downtime (or at least a consistent snapshot strategy at the hypervisor layer). But it’s reliable and keeps the block device semantics straightforward.
Fast diagnosis playbook
When latency is high and people are yelling, you need a short sequence that narrows the search space fast. Here’s mine for ZFS zvol-backed VMs.
First: Is the pool actually saturated or just “slow”?
- Run
zpool iostat -v 1and look at ops and bandwidth. - Check if one vdev is hotter than the others (imbalance can look like “random latency”).
- Look for high utilization with modest throughput: that often indicates small random I/O or sync pressure.
Second: Is this sync write pain?
- Check
zfs get sync,logbiason the zvol. - Check if a SLOG exists and is healthy (
zpool status). - Correlate latency spikes with txg commit pressure (platform-specific kstats; on Linux, SPL kstats and system logs help).
Third: Does volblocksize match the I/O profile?
- Check
zfs get volblocksizefor the affected zvol. - Sample real I/O sizes with
iostat -xon the zvol device and, if needed,blktrace. - If guest writes are mostly 4K–16K and volblocksize is 64K–128K, assume amplification until proven otherwise.
Fourth: Are you capacity/fragmentation constrained?
- Check pool fullness:
zfs list/zpool list. High usage increases fragmentation and slows allocations. - Check for background work: scrub/resilver status in
zpool status. - Snapshot volume: large snapshot counts can increase metadata overhead and slow down some operations.
Fifth: Validate with a controlled micro-benchmark
- Create a test zvol with candidate volblocksize values.
- Run fio with block sizes that match the VM workload and measure tail latency.
- Do not benchmark on an already-burning pool unless you like self-inflicted tickets.
Common mistakes, symptoms, and fixes
Mistake 1: Using 128K volblocksize for random-write VM disks
Symptoms: Great sequential throughput, but databases time out under mixed load; high 99th/99.9th latency; IOPS lower than expected; “everything is fine until backups.”
Fix: Migrate to a new zvol with volblocksize=8K or 16K (depending on workload). Validate with fio and real app metrics. Do not expect an in-place property change to re-block existing data.
Mistake 2: Setting 4K volblocksize everywhere without checking CPU/metadata cost
Symptoms: Random IOPS improves, but storage hosts show higher system CPU; ARC churn; performance jitter on sequential workloads; “it’s fast but inconsistent.”
Fix: Use profiles. Keep 4K for truly random-write-heavy disks; use 8K/16K for general-purpose; use larger only for known sequential volumes.
Mistake 3: Confusing dataset tuning (recordsize) with zvol tuning (volblocksize)
Symptoms: Someone sets recordsize=16K on the parent dataset and expects VM disk performance to change. Nothing changes; blame travels upward.
Fix: zvols are not files in a dataset in the same way. Use zfs get volblocksize on the zvol itself.
Mistake 4: Using sync=disabled to “fix latency”
Symptoms: Latency graphs look amazing; leadership is happy; then an unclean shutdown happens and the database needs repair or loses recent transactions.
Fix: Keep sync=standard for correctness. If you need sync performance, fix the SLOG, pool layout, and volblocksize alignment with the workload.
Mistake 5: Ignoring pool fullness and fragmentation
Symptoms: Same workload gets slower over months; write latency creeps up; allocations become expensive; “ZFS was fast at the start.”
Fix: Maintain capacity headroom. Plan expansions before panic. Treat snapshots and retention like a budget, not a hobby.
Mistake 6: Benchmarking with the wrong block size and calling it “proof”
Symptoms: fio shows huge MB/s with 1M sequential writes, but production is slow; arguments ensue.
Fix: Benchmark with the block sizes and sync semantics your VM workload uses. Include latency percentiles and mixed read/write patterns.
Checklists / step-by-step plan
Checklist A: New VM disk provisioning (zvol-backed)
- Classify the workload: general OS, DB-heavy random writes, or sequential-heavy.
- Pick a profile:
- General:
volblocksize=8K(often) withcompression=lz4 - DB-heavy:
8Kor16K; test if you can - Sequential-heavy:
64Kor128Kif truly sequential
- General:
- Create the zvol explicitly (don’t rely on defaults you haven’t audited).
- Confirm properties with
zfs get. - Document the choice in the VM’s metadata/ticket so future you doesn’t have to archeologically dig it up.
Checklist B: When you suspect volblocksize is wrong
- Confirm the zvol’s current
volblocksize. - Measure real I/O sizes (
iostat -x, optionallyblktrace). - Create a test zvol with a candidate smaller/larger volblocksize.
- Benchmark with fio using block sizes matching the workload.
- Make a migration plan:
- Downtime window or live migration tooling
- Block-level copy method (dd, replication at hypervisor layer, or guest-level copy)
- Validation steps (filesystem check, app checks, performance checks)
- Cut over, monitor tail latency, then decommission the old zvol.
Checklist C: “Boring safeguards” for VM storage on ZFS
- Keep pool utilization sane; avoid living near full.
- Schedule scrubs, but don’t overlap them with your heaviest write windows unless you mean to test resilience.
- Track latency percentiles, not only averages.
- Standardize a small set of volblocksize profiles and stick to them.
- Practice the migration procedure before you need it.
FAQ
1) What is the default volblocksize, and should I trust it?
Defaults vary by platform and version, and they may not be aligned with your VM workload. Treat defaults as “safe for someone,” not “optimal for you.” Always check what your environment uses and benchmark against your I/O profile.
2) Can I change volblocksize on an existing zvol without migration?
You can often change the property value, but it won’t retroactively rewrite existing blocks into the new size in a clean, instant way. In practice, if you need the performance behavior of a different volblocksize, plan on creating a new zvol and migrating the data.
3) Is 4K always best for VM disks?
No. 4K can be excellent for random-write-heavy workloads and predictable latency, but it increases metadata overhead and CPU work. For general-purpose VM disks, 8K or 16K is often a better balance.
4) How does volblocksize relate to the guest filesystem block size?
They’re independent layers. The guest block size influences the I/O it issues. The zvol volblocksize influences how ZFS stores and rewrites those writes. When they’re mismatched (e.g., guest writes 8K pages, zvol uses 128K blocks), you can create amplification and latency variance.
5) If I have a fast NVMe pool, does volblocksize still matter?
Yes. NVMe makes latency low enough that software overhead and amplification become more visible. You can absolutely waste NVMe performance with a poor block size choice, especially for sync-heavy or random-write workloads.
6) Does compression change the recommended volblocksize?
It can. Compression can reduce physical writes and sometimes soften the cost of larger blocks, but it’s workload-dependent. Encrypted or already-compressed data often won’t benefit. Measure compressratio and latency percentiles under realistic load.
7) What about RAIDZ—does that push me to larger or smaller volblocksize?
RAIDZ makes small random writes more expensive due to parity and read-modify-write behavior. That doesn’t mean “always use huge blocks,” but it does mean you should be extra cautious and test. Mirrors are generally friendlier for VM random I/O.
8) Can a SLOG fix bad volblocksize choices?
A good SLOG can dramatically improve sync write latency, but it won’t eliminate the amplification and fragmentation effects of a mismatched volblocksize. Think of SLOG as “helps you acknowledge sync safely,” not “fixes inefficient writes.”
9) My hypervisor uses QCOW2 or another image format—does volblocksize still matter?
Yes, but the stack gets more complex. Image formats can introduce their own allocation granularity and metadata overhead, which can compound with ZFS behavior. If you care deeply about predictable latency, raw volumes are simpler to reason about, and volblocksize becomes a clearer lever.
10) What’s the single safest volblocksize if I must standardize?
If forced to pick one for general VM disks without knowing workloads, 8K is a common “least-wrong” choice in many environments. But the honest answer is: standardize profiles, not a single value.
Conclusion
volblocksize is not glamorous, and that’s exactly why it bites. It’s a low-level decision that quietly shapes how much work your storage stack does per VM write, how stable your latency is under pressure, and how quickly your “fast pool” turns into an apology.
The operationally mature approach is also the simplest: measure real I/O sizes, pick a small set of sane profiles, benchmark tail latency, and accept that changing volblocksize is usually a migration project—not a toggle. Do that, and you get what ZFS is genuinely good at: predictable correctness, strong performance, and storage behavior you can explain to another human during an incident.