You inherited a pool. The VMs feel “fine” until patch night, backup night, or that one analytics job night. Someone says,
“Let’s just change volblocksize and we’ll get better IOPS.” You nod because it sounds like a setting.
Then you run the command, it accepts it, and… nothing gets better. Sometimes it gets worse. Then you learn the hard truth:
you changed a property, not the physics of already-written blocks.
ZVOLs are deceptively simple: a block device backed by ZFS. But their block size behavior is not like a filesystem dataset’s
recordsize. When you “migrate” volblocksize, you’re mostly changing how future writes
are laid out. Existing data stays laid out the old way unless you rewrite it. In production, “rewrite it” usually means
“recreate the ZVOL and move data onto it with intent.”
volblocksize is not a time machine
Here’s the headline: changing volblocksize on an existing ZVOL rarely changes performance meaningfully until
the data is rewritten. ZFS doesn’t go back and “repack” a ZVOL just because you asked nicely. You can set the property today,
but the blocks already on disk keep their original size, segmentation, and indirect block structure.
That’s not ZFS being stubborn. That’s ZFS being consistent. ZFS is copy-on-write: it writes new blocks elsewhere, updates
metadata, and then flips pointers. It doesn’t overwrite old blocks in place. If you want existing data to be rewritten using
a new block size, you need a workflow that causes full logical rewrite of the volume: replication to a new ZVOL, a restore
from backup, or a block-level copy onto a freshly created target.
The practical result: most “volblocksize migrations” are actually “recreate-and-move” projects. The rest are wishful thinking,
plus a change ticket that says “performance tuning” and a postmortem that says “no measurable impact.”
What volblocksize actually controls (and what it doesn’t)
volblocksize: the ZVOL’s allocation unit
A ZVOL is a ZFS dataset of type volume. It exports a block device (e.g., /dev/zvol/pool/vm-101-disk-0)
that consumers treat like a disk. For ZVOLs, volblocksize is the maximum size of a block ZFS will allocate for
data written to the ZVOL.
If you set volblocksize=16K, ZFS will allocate blocks up to 16KiB for the ZVOL’s data. If you set it to 8K,
up to 8KiB, and so on. “Up to” matters: ZFS can still allocate smaller blocks when the write is smaller or when compression
shrinks it. But the configured size strongly biases allocation and the resulting I/O pattern.
recordsize is for filesystems, not ZVOLs
On a filesystem dataset, recordsize controls the maximum record size for file data. It’s about file blocks.
It interacts with application I/O size, compression, and random access patterns. On ZVOLs, you don’t have file records;
you have a virtual disk. The guest filesystem decides its own block sizes and layouts, and ZFS sees writes to LBAs.
If you take one idea away: recordsize tuning is often forgiving because filesystems naturally rewrite
hot files over time. ZVOLs backing VM disks can sit there with old block layout for ages because many workloads don’t
rewrite the whole virtual disk—especially databases and long-lived VM images.
Where the pain comes from: write amplification and I/O mismatch
The easiest way to think about a bad volblocksize is mismatch. Your workload issues 4K random writes, but you set
volblocksize=128K. ZFS now tends to allocate (and later read/modify/write) much bigger chunks. On sync-heavy workloads,
that mismatch can turn into latency spikes, bloated metadata, and an I/O profile that looks like a confused elephant on roller skates.
Another mismatch is between volblocksize and pool ashift (physical sector size alignment). You can “work”
with misalignment, but you pay for it with extra I/O, worse fragmentation behavior, and device-level write amplification. The bill
arrives as latency.
Joke #1: Changing volblocksize on a busy ZVOL and expecting instant improvement is like repainting a forklift and calling it a turbo upgrade.
Why changing it rarely fixes existing data
ZFS properties are not always retroactive
ZFS properties fall into buckets. Some affect runtime behavior immediately (like compression for future writes),
some affect metadata decisions, and some are mostly about default policy for new allocations. volblocksize is in the
“new allocations” bucket. It doesn’t trigger a rewrite of existing blocks because that would be a massive, risky background operation.
Copy-on-write means “rewrite happens only when the workload rewrites”
ZFS allocates new blocks on modification. It doesn’t go touch old blocks to “normalize” them to a new preferred size. So when you
change volblocksize, only blocks written after that point are likely to reflect the new size. If your VM disk has a giant
mostly-static OS image and a small churny database file, you might see a small subset of blocks changing, while the majority stays
stuck in the old layout.
Even if the guest rewrites, snapshots can pin old blocks
Snapshots are the other trap. If you have frequent snapshots (VM backups, replication points, “just in case” snapshots that never die),
old blocks remain referenced and cannot be freed. The guest may rewrite data, but the old blocks remain pinned by snapshots, leaving
fragmentation and space usage unchanged. Your new layout exists alongside the old one. Congratulations, you now have a museum.
The “it accepted the property” trap
ZFS happily accepts zfs set volblocksize=... on many systems for a ZVOL when it is not in active use, or sometimes
it might refuse with “in use” depending on platform and ZFS implementation. The acceptance doesn’t mean you got the desired migration.
It means the dataset now has a property value that will guide future allocations. That’s it.
Why recreation is usually the clean answer
Recreating the ZVOL forces all blocks to be allocated fresh under the new policy. That’s the only deterministic way to guarantee the
whole volume gets the new volblocksize behavior. Everything else is probabilistic and slow, and gets derailed by snapshots,
thin provisioning patterns, and partial rewrites.
Interesting facts and historical context
- ZFS was born at Sun in the mid-2000s, designed around end-to-end data integrity, copy-on-write, and pooled storage—ideas that later became table stakes.
- ZVOLs were a deliberate “block device” bridge: they let ZFS serve workloads that demand a disk abstraction (VMs, iSCSI targets) without forcing a ZFS filesystem inside the guest.
- The original ZFS emphasis was correctness over micro-optimizations; properties like block sizing were policy knobs, not live-repacking mechanisms.
- Advanced Format drives (4K sectors) made alignment issues painfully mainstream;
ashiftbecame a critical pool design decision rather than trivia. - VM storage pushed ZVOL popularity because hypervisors and SAN tooling speak block devices fluently, even when file-based images might be simpler.
- Early ZFS-on-Linux adoption accelerated operational patterns like
zfs send/receivereplication, making “recreate and stream data” a normal move. - Compression became mainstream not just for capacity, but for performance: less I/O can beat more CPU, especially on SSD-backed pools.
- Snapshots changed operational culture: teams started snapshotting everything because it was cheap—until it wasn’t, especially with pinned blocks and long retention chains.
- As NVMe latency dropped, software overhead and I/O amplification got more visible; bad block sizing can waste the advantage of fast media.
Fast diagnosis playbook
When someone complains “the ZVOL is slow,” don’t start by changing volblocksize. Start by figuring out which layer is lying.
Here’s a tight sequence that finds the bottleneck quickly.
1) First: is it latency or throughput?
- If users complain about “VM sluggishness” and “DB timeouts,” assume latency.
- If backups or bulk copies are slow, assume throughput—unless they’re sync writes.
2) Second: is the workload sync-bound?
- Check
syncbehavior and whether you have a sane SLOG (or none, and that’s fine too). - Sync-heavy random writes amplify every bad decision: block size, fragmentation, and device queueing.
3) Third: confirm the I/O size actually seen by the pool
- Observe I/O sizes at the ZFS layer. If you’re seeing 128K writes but the app claims 4K, something is coalescing—or lying.
4) Fourth: check fragmentation and snapshot pinning
- High fragmentation plus long snapshot chains can kill random I/O even on SSDs.
- If blocks can’t be freed, “tuning” becomes placebo.
5) Fifth: only then decide if volblocksize is the lever
- If the workload is 4K/8K random I/O (databases, VM boot storms), smaller
volblocksizeoften helps. - If it’s large sequential writes/reads (video, backups inside the guest), larger blocks can help—assuming the guest issues big I/O.
Practical tasks: commands, outputs, and decisions
The point of commands isn’t to look busy. It’s to turn “I think” into “I know,” and then make a decision you can defend in a review.
Below are hands-on tasks I actually use when diagnosing or planning a ZVOL volblocksize migration.
Task 1: Identify the ZVOL and verify its current volblocksize
cr0x@server:~$ zfs list -t volume -o name,volsize,volblocksize,compression,sync pool/vm-101-disk-0
NAME VOLSIZE VOLBLOCKSIZE COMPRESS SYNC
pool/vm-101-disk-0 200G 128K lz4 standard
What it means: This ZVOL is configured to allocate up to 128K blocks for future writes. It’s also compressed with lz4.
Sync is standard (honors application requests).
Decision: If the guest workload is mostly 4K/8K random I/O, 128K is suspicious. Don’t change it yet. Measure first.
Task 2: Check if the property is inherited or explicitly set
cr0x@server:~$ zfs get -H -o property,value,source volblocksize pool/vm-101-disk-0
volblocksize 128K local
What it means: It’s locally set. Someone chose this.
Decision: Treat this as an intentional tuning attempt. Ask what workload and why. Assume the reason is lost to time.
Task 3: Confirm pool ashift and vdev layout (alignment reality check)
cr0x@server:~$ zdb -C pool | sed -n '1,120p'
MOS Configuration:
version: 5000
name: 'pool'
state: 0
txg: 1234567
pool_guid: 1234567890123456789
hostid: 1122334455
hostname: 'server'
vdev_tree:
type: 'root'
id: 0
guid: 9876543210987654321
children[0]:
type: 'mirror'
id: 0
guid: 1111111111111111111
ashift: 12
What it means: ashift: 12 implies 4K sectors (2^12). Good baseline for modern SSD/HDD.
Decision: If ashift were 9 (512B) on 4K media, you’d have deeper problems than volblocksize. Plan a pool migration, not a ZVOL tweak.
Task 4: Check ZVOL “in use” state and whether a live change is even allowed
cr0x@server:~$ zfs set volblocksize=16K pool/vm-101-disk-0
cannot set property for 'pool/vm-101-disk-0': volume is busy
What it means: The kernel is using the block device (mapped, exported via iSCSI, attached to a VM, etc.).
Decision: Don’t fight it. Plan a migration to a new ZVOL with the desired block size, then cut over.
Task 5: Identify what’s consuming the ZVOL (VM, iSCSI, multipath)
cr0x@server:~$ ls -l /dev/zvol/pool/vm-101-disk-0
lrwxrwxrwx 1 root root 13 Dec 26 02:10 /dev/zvol/pool/vm-101-disk-0 -> ../../zd0
cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,MOUNTPOINT,FSTYPE /dev/zd0
NAME TYPE SIZE MOUNTPOINT FSTYPE
zd0 disk 200G
What it means: The ZVOL is a raw disk device zd0. If you see it mounted, mapped to DM, or used by iSCSI,
that explains “busy.”
Decision: Track the dependency chain before planning any migration window. People get upset when you detach “just a disk.”
Task 6: Check snapshot count and retention risk (pinned blocks)
cr0x@server:~$ zfs list -t snapshot -o name,used,refer -s creation | grep '^pool/vm-101-disk-0@' | tail -n 5
pool/vm-101-disk-0@auto-2025-12-26_0000 1.2G 180G
pool/vm-101-disk-0@auto-2025-12-26_0100 600M 180G
pool/vm-101-disk-0@auto-2025-12-26_0200 400M 180G
pool/vm-101-disk-0@auto-2025-12-26_0300 250M 180G
pool/vm-101-disk-0@auto-2025-12-26_0400 200M 180G
What it means: You have frequent snapshots. The used column suggests active churn. Old blocks are likely pinned.
Decision: If you were hoping “the workload will rewrite itself into the new volblocksize,” snapshots will sabotage you.
Migration to a new ZVOL with a clean snapshot policy is the sane route.
Task 7: Measure I/O and latency at the pool level
cr0x@server:~$ zpool iostat -v pool 1 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
pool 4.20T 3.10T 820 2400 45.2M 110M
mirror 4.20T 3.10T 820 2400 45.2M 110M
nvme0n1 - - 410 1200 22.6M 55M
nvme1n1 - - 410 1200 22.6M 55M
What it means: You’re write-heavy. This is the moment to ask: are these writes sync? Are they small random? Are they sequential?
Decision: If bandwidth is high but latency complaints persist, the issue is likely sync latency, queueing, or write amplification,
not raw throughput.
Task 8: Check per-dataset I/O (is the ZVOL really the hot spot?)
cr0x@server:~$ zfs iostat -v pool 1 3
capacity operations bandwidth
dataset alloc free read write read write
----------------------------- ----- ----- ----- ----- ----- -----
pool 4.20T 3.10T 0 0 0 0
pool/vm-101-disk-0 200G 3.10T 120 1600 3.2M 62M
pool/vm-102-disk-0 120G 3.10T 40 300 1.1M 10M
What it means: vm-101 is dominating writes.
Decision: Don’t start tuning the pool globally if one ZVOL is the offender. Fix the offender.
Task 9: Check compression ratio and whether smaller blocks will explode metadata
cr0x@server:~$ zfs get -H -o property,value compressratio pool/vm-101-disk-0
compressratio 1.45x
What it means: Compression is doing useful work. Smaller volblocksize can reduce read-modify-write on small writes,
but may increase metadata overhead and reduce compression efficiency in some patterns.
Decision: If the workload is DB-like random I/O, favor correctness and latency over theoretical compression gains.
Task 10: Confirm TRIM behavior (thin provisioning sanity)
cr0x@server:~$ zfs get -H -o property,value autotrim pool
autotrim on
What it means: The pool is issuing TRIM automatically (implementation-dependent behavior). For ZVOLs used by guests,
you also need discard support in the guest stack to reclaim space effectively.
Decision: If space usage keeps growing even after deletions in the guest, verify guest discard and snapshot retention before blaming volblocksize.
Task 11: Estimate rewrite feasibility with a quick logical copy test (offline or on a clone)
cr0x@server:~$ zfs snapshot pool/vm-101-disk-0@pre-migrate
cr0x@server:~$ zfs clone pool/vm-101-disk-0@pre-migrate pool/vm-101-disk-0-clone
What it means: You now have a clone that shares blocks. It’s great for testing workflows, not for forcing a rewrite.
Decision: Use the clone to validate export/attach steps, not to “fix” block layout. Shared blocks remain shared until rewritten.
Task 12: Create the new ZVOL with the desired volblocksize (the real migration step)
cr0x@server:~$ zfs create -V 200G -b 16K -o compression=lz4 -o sync=standard pool/vm-101-disk-0-new
cr0x@server:~$ zfs list -t volume -o name,volsize,volblocksize pool/vm-101-disk-0-new
NAME VOLSIZE VOLBLOCKSIZE
pool/vm-101-disk-0-new 200G 16K
What it means: You have a clean target ZVOL with the new block size. This is the only way to guarantee fresh allocation policy.
Decision: Commit to a cutover plan. The rest is data movement and operational discipline.
Task 13: Block-level copy with progress (offline migration)
cr0x@server:~$ dd if=/dev/zvol/pool/vm-101-disk-0 of=/dev/zvol/pool/vm-101-disk-0-new bs=16M status=progress conv=sync,noerror
21474836480 bytes (21 GB, 20 GiB) copied, 42 s, 511 MB/s
What it means: You’re performing a logical rewrite onto the new ZVOL. The destination will be allocated using its volblocksize
policy as the stream lands.
Decision: Use this only when the source is quiesced (VM shut down, filesystem clean) unless you’re intentionally doing a crash-consistent copy.
Most of the time, shut it down.
Task 14: Verify the guest sees the same disk identity expectations (UUIDs, partitions)
cr0x@server:~$ sgdisk -p /dev/zvol/pool/vm-101-disk-0-new
Disk /dev/zvol/pool/vm-101-disk-0-new: 419430400 sectors, 200.0 GiB
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): 5A1B2C3D-4E5F-6789-ABCD-0123456789AB
Partition table holds up to 128 entries
Number Start (sector) End (sector) Size Code Name
1 2048 2099199 1024.0 MiB EF00 EFI System
2 2099200 419428351 199.0 GiB 8300 Linux filesystem
What it means: The partition table copied over and looks consistent. The “Sector size” output is useful: if your guest expects 4K,
verify you’re not surprising it.
Decision: If you see unexpected sector sizes or missing partitions, stop. Don’t proceed to cutover until you understand why.
Task 15: Cutover by swapping ZVOL names atomically (with careful sequencing)
cr0x@server:~$ zfs rename pool/vm-101-disk-0 pool/vm-101-disk-0-old
cr0x@server:~$ zfs rename pool/vm-101-disk-0-new pool/vm-101-disk-0
cr0x@server:~$ zfs list -t volume -o name,volblocksize pool/vm-101-disk-0
NAME VOLBLOCKSIZE
pool/vm-101-disk-0 16K
What it means: The canonical name now points to the new ZVOL, so any automation that references the dataset path keeps working.
Decision: Keep the old ZVOL around until validation is done. Then destroy it intentionally, not impulsively.
Migration strategies that actually work
Strategy A: Recreate + block copy (simple, blunt, effective)
This is the dd or imaging approach. It works well when you can take downtime and want a deterministic rewrite.
It is also the easiest to explain to change management: “We created a new volume with correct geometry and copied the disk.”
Trade-offs:
- Pros: Guarantees rewrite, respects new
volblocksize, minimal ZFS magic. - Cons: Needs downtime for consistency; copies free space too; slow if the disk is mostly empty.
Strategy B: Recreate + file-level migration inside the guest (cleaner data, but more moving parts)
If you can attach a second disk to the VM, you can migrate at filesystem level: create new ZVOL, attach to guest, format it,
copy data (rsync, database dump/restore), then switch boot configuration. This rewrites only used data and often results in less
fragmentation. It’s slower in human time, faster in storage time.
- Pros: Doesn’t copy unused blocks; can improve guest filesystem layout; often less downtime.
- Cons: Requires guest changes; more failure modes; needs careful application consistency handling.
Strategy C: ZFS send/receive is for datasets, not direct “volblocksize conversion”
People reach for zfs send/receive because it’s the ZFS hammer. It’s great for moving datasets and preserving snapshots.
But for changing volblocksize, it’s not the magic bullet you want it to be. Replication preserves the block structure as represented
in the stream. In many practical cases, a receive will not magically “reblock” your data into the new volblocksize.
Use send/receive for:
- Moving the ZVOL to another pool or host.
- Preserving snapshots/history.
- Disaster recovery patterns where identity and point-in-time rollback matter.
Use recreate-and-copy when:
- You want a deterministic allocation policy change.
- You want to break away from legacy snapshot chains and fragmentation.
- You want a performance reset you can actually measure.
Strategy D: “Let it rewrite naturally” (usually a trap)
Sometimes teams change volblocksize and hope the workload will rewrite enough data over time for the new size to take effect.
This can work in narrow cases: ephemeral workloads, CI runners, caches, or VM templates that are frequently reimaged.
It fails when:
- Snapshots pin old blocks.
- The disk is mostly static (OS + installed apps) and churn is small relative to total.
- The workload is “hot in a small area” (databases), rewriting the same region and increasing fragmentation without changing the rest.
A single reliability quote to keep you honest
Brian Kernighan’s well-known line fits storage changes uncomfortably well: “Debugging is twice as hard as writing the code.”
(paraphrased idea)
Storage migrations aren’t hard because commands are hard. They’re hard because assumptions multiply faster than IOps on a benchmark slide.
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran a virtualization cluster with ZVOL-backed VM disks. A new engineer noticed the default volblocksize
was 8K on some older ZVOLs and 128K on some newer ones. They assumed “bigger blocks = better throughput,” and that “ZFS will handle the rest.”
A change ticket was filed to set everything to 128K across the board.
The first surprise was that some volumes refused to change because they were “busy.” The engineer worked around it by rebooting VMs in batches,
flipping the property when disks detached. The second surprise was subtle: a handful of database VMs started showing periodic latency spikes.
The apps didn’t crash. They just got slow enough to trip retry storms and pile up connection pools.
The assumption that bit them: changing the property would “convert” existing data layout. It didn’t. They had a mixed layout: old 8K-ish
behavior in older blocks, new 128K allocations in newer ones, plus a long snapshot history that pinned old blocks. The storage graphs looked
like a normal busy system—until you looked at tail latency and sync write times.
The fix wasn’t heroic. It was boring. They created new 16K ZVOLs for the database VMs, did controlled offline block copies, cut over, and
reduced snapshot retention for those particular volumes. Performance stabilized, and the cluster stopped having “mystery Tuesdays.”
Lesson: a property change is not a migration plan. If you need a migration, migrate.
Mini-story 2: The optimization that backfired
An enterprise team ran iSCSI targets backed by ZVOLs. They had a workload that looked sequential on paper: nightly ETL dumps and big file transfers.
They set volblocksize=256K for “maximum streaming performance,” and on the benchmark run it looked great.
Then production arrived. The workload wasn’t just dumps; it was also metadata-heavy and had a background agent doing small random writes
to “track progress,” plus a security tool constantly updating tiny files. The iSCSI initiators issued a mix of 4K sync writes and larger async writes.
Latency climbed, then bursty pauses showed up. The ETL jobs started missing windows. The dashboards showed bandwidth headroom, which made the
first responders chase the wrong problem: “But we’re not saturating the pool.”
What happened: with large volblocksize, small random sync writes triggered more read-modify-write behavior and metadata churn.
The sequential part got faster; the tail latency got worse. And tail latency is what users notice, and what distributed systems amplify.
The rollback was awkward because they tried to “just set volblocksize back.” That didn’t re-layout existing allocations; it merely changed future writes.
They ended up recreating the ZVOLs at 16K, restoring from application-level backups (not block-level), and accepting that the benchmark win was
never representative of production.
Lesson: optimizing for the average case is how you buy outages. Optimize for the ugly mixed workload that actually runs at 2 a.m.
Mini-story 3: The boring but correct practice that saved the day
A regulated shop had a standard: every VM disk ZVOL must be created from a thin wrapper script that sets
volblocksize based on workload class (general VM, database, log-heavy), and tags the dataset with a human-readable note.
No one loved the script. Everyone wanted to “just create a volume.”
Years later, a storage refresh forced them to migrate pools. During migration planning, they discovered a scattering of ZVOLs with
odd sizes and performance issues. The wrapper script’s tags made it obvious which volumes were meant to be DB-class, which were general,
and which were special snowflakes created by hand during incidents.
When a particularly sensitive billing database started showing latency spikes after the move, the team had a clear, documented baseline:
it was supposed to be 16K. It was. So they didn’t waste a week arguing about block sizes. They checked sync latency, verified SLOG health,
found a misconfigured initiator forcing cache flushes, and fixed the real issue.
The boring practice wasn’t the script itself. It was the insistence on repeatability and metadata: the system described what it was supposed to be.
That prevented a speculative tuning spiral under pressure.
Lesson: the best migration tool is a standard you enforced when you weren’t in a crisis.
Common mistakes: symptoms → root cause → fix
1) Symptom: “We changed volblocksize and nothing happened”
Root cause: Existing blocks were not rewritten; snapshot retention pinned old allocations.
Fix: Recreate the ZVOL with the desired -b and perform a logical rewrite (offline block copy or guest-level restore).
Review snapshots and retention; don’t carry a decade of snapshots into a performance migration unless you truly need them.
2) Symptom: “Random write latency spiked after increasing volblocksize”
Root cause: Read-modify-write amplification and metadata overhead on small sync writes.
Fix: For DB-like random workloads, choose 8K–16K volblocksize more often than not. Validate with workload-specific tests,
not generic sequential benchmarks. Recreate and rewrite if you need the layout to be consistent.
3) Symptom: “Space usage didn’t go down after deleting data in the VM”
Root cause: Guest discard not enabled, snapshots pin old blocks, or the workload rewrites without freeing due to snapshot chains.
Fix: Confirm guest discard/TRIM settings, remove unneeded snapshots, and ensure backup tooling isn’t snapshotting every hour forever.
4) Symptom: “zfs set volblocksize fails with ‘volume is busy’”
Root cause: The block device is attached, open, mapped, or exported. Many platforms prevent changing it when in use.
Fix: Don’t schedule a risky live tweak. Create a new ZVOL with correct size and migrate. If you need near-zero downtime, use
application-level replication or guest-level mirroring, then cut over.
5) Symptom: “Reads are fast, writes are slow, and CPU is low”
Root cause: Sync writes hitting slow latency path (no suitable SLOG, misconfigured sync settings, or storage that can’t commit fast enough).
Fix: Confirm sync behavior at application and ZFS level. If you deploy SLOG, do it properly (power-loss-protected media), or don’t do it.
Don’t use sync=disabled as a performance feature unless you’re okay with losing committed writes.
6) Symptom: “After migration, performance improved briefly then degraded”
Root cause: Fragmentation grew due to ongoing small random writes; snapshot schedule pinning prevented cleanup; guest filesystem may be heavily fragmented.
Fix: Revisit snapshot policy, consider periodic guest-level maintenance (e.g., DB maintenance, filesystem trim), and monitor fragmentation indicators.
Don’t treat volblocksize as a one-time miracle knob.
Joke #2: Snapshots are like corporate email retention—everyone loves them until they discover they’re storing every bad decision forever.
Checklists / step-by-step plan
Decision checklist: should you recreate the ZVOL?
- Is the workload dominated by small random I/O (4K–16K)? If yes, you probably want a smaller
volblocksize. - Do you have a long snapshot chain? If yes, “natural rewrite” won’t cleanly convert old blocks.
- Is the volume busy and can’t change property? That’s your answer: recreate-and-cutover.
- Do you need predictable, measurable change in a maintenance window? Recreate.
- Is the volume ephemeral and frequently rewritten end-to-end? You might get away with just setting it for future writes.
Step-by-step: safe recreate-and-cutover (offline, minimal drama)
- Baseline metrics. Capture pool and dataset iostat during the pain window. Confirm what “bad” looks like so you can verify “better.”
-
Inventory properties. Record
volsize,volblocksize,compression,sync,
reservation/refreservation, and snapshot policy. - Ensure you can roll back. Take a final snapshot and confirm backups are valid (not just “configured”).
- Schedule downtime or consistency method. Shut down the VM or quiesce the application. Decide what “consistent” means.
-
Create the target ZVOL. Use
zfs create -V ... -b ...with explicit properties. No inheritance surprises. - Copy data. Use block-level copy for simplicity, or guest-level copy for efficiency. Validate the result.
- Cut over by rename or reattach. Prefer dataset rename for automation compatibility. Confirm device paths.
- Boot and validate. Check guest filesystem, application health, and storage latency.
- Monitor for 24–72 hours. Watch tail latency, sync writes, and snapshot growth.
- Retire the old ZVOL. Keep it for a defined rollback window, then destroy it intentionally. Document the change.
Step-by-step: if you insist on “change it in place”
Sometimes you can’t migrate immediately. If you’re going to set volblocksize without recreation, do it with eyes open:
- Confirm there are no snapshots that will keep old blocks forever (or accept the consequence).
- Set the property only during a quiet period; expect zero immediate improvement.
- Plan a later “rewrite event” (restore from backup, reimage, full disk copy) that actually forces the layout to change.
FAQ
1) Can I change volblocksize on an existing ZVOL?
Sometimes you can set the property, sometimes it will refuse if the volume is busy. Even when it succeeds, it typically affects only future allocations.
It does not reliably convert existing data layout.
2) Why does recreating the ZVOL work when changing the property doesn’t?
Recreation forces all blocks to be allocated anew under the new policy. Copying data onto a fresh ZVOL causes a full logical rewrite,
which is what you actually need to “migrate” block sizing behavior.
3) Does ZFS send/receive change volblocksize?
Don’t treat it as a conversion tool. Send/receive is excellent for replication and moving datasets, but it does not guarantee a reblocking
transformation of existing allocations. If you need deterministic block layout change, recreate and rewrite.
4) What volblocksize should I use for VM disks?
For general-purpose VM disks, 8K–16K is a common, practical range. Databases often prefer smaller. Large sequential workloads can justify larger.
Choose based on observed I/O sizes and sync behavior, not ideology.
5) Is smaller volblocksize always faster for random I/O?
Often, but not always. Smaller blocks can reduce read-modify-write on small writes, but they can increase metadata overhead and reduce compression efficiency.
If you’re CPU-bound on compression or metadata, smaller can hurt. Measure with production-like tests.
6) How do snapshots interfere with volblocksize migration?
Snapshots keep references to old blocks. Even if the guest rewrites data, old blocks remain allocated and fragmentation persists.
You end up with a mixed layout and fewer of the benefits you expected.
7) What about sync=disabled to “fix latency”?
That’s not a fix. That’s a business decision to accept possible loss of acknowledged writes on crash/power loss. If you can accept that risk,
fine—document it. Otherwise, solve the real sync path (device latency, SLOG design, initiator behavior).
8) Can I use zvol “thin provisioning” safely with a migration?
Yes, but understand reclaim. Ensure discard/TRIM works end-to-end (guest → hypervisor → ZVOL → pool → SSD). Also remember snapshots can defeat reclaim.
Migration by guest-level restore often reclaims space better than block copying.
9) How do I prove the new volblocksize helped?
Compare before/after latency (especially tail latency), sync write times, and application-level metrics. Pool bandwidth alone is a weak signal.
Use zpool iostat, zfs iostat, and workload-specific measurements.
10) If I already changed volblocksize, should I change it back?
If you changed it and saw regression, changing it back may not undo the layout changes already written. It can stop further drift,
but the clean fix is still recreate-and-rewrite for the affected volume.
Next steps you can take this week
If you’re sitting on ZVOL performance complaints and a stack of half-remembered tuning attempts, do this in order:
- Pick one problematic ZVOL and capture a 10-minute baseline during real load:
zpool iostat -v 1andzfs iostat -v 1. - Inventory snapshots and retention for that ZVOL. If you have a snapshot museum, admit it and plan cleanup or a migration that doesn’t bring the museum along.
- Create a test ZVOL with the target
volblocksizeand run a controlled migration in a maintenance window. Measure tail latency and application behavior. - Standardize creation: script it, document it, tag datasets. Future-you will be tired and angry; help them now.
The point isn’t to worship a number like 8K or 16K. The point is to make block layout a deliberate choice, then enforce it by recreating when you need a real change.
ZFS gives you sharp tools. Use them like you’re going to be on call when they cut.