You changed recordsize on a ZFS dataset and nothing got faster. Or worse: something got slower and you’re now staring at dashboards like they owe you money.
Welcome to the most common “but I set the knob” moment in ZFS operations.
Recordsize strategy is one of those topics where half the advice is correct, a quarter is outdated, and the remaining quarter is confidently wrong. The trick is learning
what changes instantly, what only affects newly-written blocks, and how to migrate without nuking your entire filesystem just to satisfy a tunable.
What recordsize really does (and what it doesn’t)
recordsize is ZFS’s maximum data block size for files in a filesystem dataset. That “maximum” word is the entire story.
ZFS will store a file in blocks up to recordsize, but it can (and will) use smaller blocks when the file is small, when the application writes
in smaller chunks, or when it can’t efficiently pack data into full records.
Here’s the non-negotiable operational truth: changing recordsize doesn’t rewrite existing blocks.
It changes how new data is laid out from that moment forward. Old blocks stay whatever size they were when written.
This is why “we set recordsize and it didn’t help” is not a mystery; it’s expected behavior.
Recordsize is about trading off sequential efficiency versus random update cost. Large records are great when you stream big files (backups, media, logs you only append to, large object stores).
Small records are great when you do small random reads/writes (some databases, VM images, mail spools, metadata-heavy workloads).
The mistake is thinking there’s one correct value. There isn’t. There’s a correct value per dataset and workload.
Also: recordsize is for filesystems (zfs datasets). For block devices (zvol), you’re dealing with volblocksize, which is set at creation and is not casually changed later.
If you run VMs on zvols and you’re tuning recordsize… you’re polishing a bicycle to improve a truck’s fuel economy.
How recordsize interacts with the real world
- IO amplification: updating 4K in the middle of a 128K record can cost more than 4K. ZFS has copy-on-write, checksums, and compression; it may need to rewrite a lot of structure to change a little data.
- ARC behavior: large blocks can consume ARC in bigger chunks; that can be good (fewer metadata lookups) or bad (cache churn) depending on access patterns.
- Compression: larger records often compress better. But compressing a 1M record to save space is not “free” when your workload is random reads of small ranges.
- Special vdevs: if you have special allocation classes for metadata/small blocks, recordsize changes can shift what qualifies as “small” and ends up on your special vdev. That can be either a gift or a fire.
- Snapshots: rewriting data to adopt new recordsize will keep old blocks alive via snapshots. Migration plans that ignore snapshot retention are how storage teams learn humility.
One quote to keep you honest
Paraphrased idea (Gene Kim): “Improving reliability comes from understanding flow and feedback loops, not heroics after the outage.”
Interesting facts and historical context
ZFS has been around long enough to accumulate folklore. Some of it is useful; some of it should be composted. Here are a few concrete facts and context points
that actually matter for recordsize migration decisions:
- Recordsize started life in Solaris-era ZFS as a pragmatic “big enough for sequential, not too big for memory” default—128K became common because it worked on real disks and real workloads.
- The default recordsize historically stayed stable precisely because it’s a compromise, not because it’s optimal. It’s designed to be “fine” for general file sharing, not “great” for your OLTP database.
- ZFS always supported variable block sizes up to recordsize; that’s why small files don’t get padded to 128K. This is also why recordsize is a ceiling, not a mandate.
- OpenZFS diverged and evolved across platforms. Features like large dnodes, special vdevs, and improved prefetch changed the performance landscape around recordsize, even if recordsize itself didn’t “change.”
- Compression algorithms improved (e.g., LZ4 became the practical default). That made “bigger records compress better” more operationally relevant because compression became cheaper to use everywhere.
- 4K sector drives pushed ashift realities. Mismatching ashift and underlying sectors can dominate performance; recordsize tuning can’t compensate for bad physical alignment.
- VM storage trends forced clarity: the community learned (sometimes loudly) that zvol tuning is different from filesystem tuning, and recordsize advice doesn’t transfer.
- SSD/NVMe changed the failure modes: random IOPS are less terrifying than on spinning disks, but write amplification and latency spikes still matter—especially with parity RAID, small updates, and sync writes.
Why you’re migrating: symptom-driven motivations
“We want to change recordsize” is not a goal. It’s a tactic. Your goal is usually one of these:
- Reduce latency on random reads or writes for applications that touch small regions.
- Increase throughput for large sequential streams (backup jobs, media processing, ETL pipelines, archive workloads).
- Reduce write amplification for overwrite-heavy workloads (some database patterns, VM images, container layers under churn).
- Fix cache churn where ARC is filled with big records but the workload wants small slices.
- Stop hammering a special vdev because too many blocks qualify for “special small blocks.”
- Make replication and send/receive behavior predictable by aligning new writes with a known block strategy.
You migrate recordsize strategy when you’ve measured a mismatch between what your workload does and what your dataset is optimized for.
If you’re doing it because “someone on the internet said databases like 16K,” you’re already halfway to an incident report.
Fast diagnosis playbook
When performance is bad and recordsize is being blamed (often unfairly), don’t start by rewriting data.
Start by proving what’s slow and where.
First: identify the workload pattern
- Is it mostly random reads? Random writes? Sequential reads/writes?
- What IO size does the application actually issue? 4K? 16K? 128K? Mixed?
- Is the slowness latency-bound or throughput-bound?
Second: check the storage path for “hard limits”
- Sync writes waiting on SLOG? (Or worse: no SLOG on a sync-heavy system.)
- Parity vdevs doing small random overwrites? That’s a tax collector that never sleeps.
- Special vdev saturated? Metadata IO can kneecap the whole pool.
- Ashift wrong? Misalignment can dominate everything else.
Third: validate what recordsize is doing today
- What is the dataset’s recordsize now?
- What block sizes exist on disk for the hot files?
- Are snapshots pinning old layouts so “migration” isn’t actually changing physical reality?
Recordsize is a lever. But you don’t pull levers until you’ve confirmed you’re pulling the right one.
Practical tasks: commands, outputs, and decisions
These are real operational tasks you can run today. Each includes a command, a sample output, what it means, and the decision you make from it.
No magic, no vibes.
Task 1: Confirm current recordsize (and whether you’re even looking at the right dataset)
cr0x@server:~$ zfs get -o name,property,value,source recordsize tank/app
NAME PROPERTY VALUE SOURCE
tank/app recordsize 128K local
Meaning: tank/app has recordsize 128K set locally. If it says inherited, the setting comes from a parent dataset.
Decision: If you intended to tune only a child dataset, stop and fix inheritance boundaries before you change anything.
Task 2: Verify whether you’re on a filesystem dataset or zvol
cr0x@server:~$ zfs list -o name,type,volblocksize,recordsize,mountpoint tank
NAME TYPE VOLBLOCKSIZE RECORDSIZE MOUNTPOINT
tank filesystem - 128K /tank
tank/app filesystem - 128K /tank/app
tank/vm-01 volume 16K - -
Meaning: tank/vm-01 is a zvol; recordsize doesn’t apply. volblocksize does.
Decision: If your issue is VM disk IO and it’s on zvols, stop tuning recordsize and evaluate volblocksize, zvol alignment, and guest IO patterns.
Task 3: Check ashift and vdev layout (recordsize can’t rescue bad fundamentals)
cr0x@server:~$ zdb -C tank | egrep "ashift|vdev_tree|type:|path:"
ashift: 12
type: 'raidz'
path: '/dev/disk/by-id/nvme-SAMSUNG_MZQLB1T9HAJR-00007_...'
Meaning: ashift=12 means 4K sectors. That’s usually sane. If you see ashift=9 on modern disks, assume misalignment pain.
Decision: If ashift is wrong, plan a pool rebuild. Do not waste a week “tuning recordsize” as a coping mechanism.
Task 4: Confirm compression and its interaction with IO size
cr0x@server:~$ zfs get -o name,property,value compression,compressratio tank/app
NAME PROPERTY VALUE SOURCE
tank/app compression lz4 local
tank/app compressratio 1.43x -
Meaning: LZ4 is enabled; data is compressing reasonably. Larger records often improve compression ratio, but can increase latency for small random reads.
Decision: If you are latency-sensitive and see high compressratio with large records, consider whether you’re paying decompression cost on reads of tiny slices.
Task 5: Observe real-time IO size patterns (are you guessing or measuring?)
cr0x@server:~$ iostat -x 1 5
avg-cpu: %user %nice %system %iowait %steal %idle
8.12 0.00 3.94 1.25 0.00 86.69
Device r/s w/s rMB/s wMB/s avgrq-sz await svctm %util
nvme0n1 820.0 610.0 12.8 9.6 32.0 1.9 0.4 58.0
Meaning: avgrq-sz ~32K requests. That’s not “big streaming.” It’s moderate-size IO. If your recordsize is 1M and you’re doing 32K requests, you should be suspicious.
Decision: If request sizes are small and random, consider smaller recordsize (or accept that your application is the limiting factor, not your dataset).
Task 6: Check ZFS-level IO and latency with zpool iostat
cr0x@server:~$ zpool iostat -v tank 1 3
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 2.10T 5.15T 1.20K 900 85.0M 60.0M
raidz1-0 2.10T 5.15T 1.20K 900 85.0M 60.0M
nvme0n1 - - 400 300 28.0M 20.0M
nvme1n1 - - 400 300 28.0M 20.0M
nvme2n1 - - 400 300 28.0M 20.0M
-------------------------- ----- ----- ----- ----- ----- -----
Meaning: The pool is doing a lot of small ops. If ops are high but bandwidth is modest, you’re likely IOPS/latency bound, not throughput bound.
Decision: If you’re IOPS-bound on RAIDZ, smaller recordsize might reduce read-modify-write amplification for overwrites in some patterns, but parity overhead may still dominate. Consider mirrors for random-write-heavy workloads.
Task 7: Check sync write pressure (recordsize isn’t the fix for fsync storms)
cr0x@server:~$ zpool get -o name,property,value,source sync,logbias tank
NAME PROPERTY VALUE SOURCE
tank sync standard default
tank logbias latency local
Meaning: sync=standard is normal. logbias=latency prefers the SLOG for sync writes (if present).
Decision: If your app is fsync-heavy and latency is bad, confirm SLOG health/performance before touching recordsize. Recordsize doesn’t make a slow sync device faster.
Task 8: Verify you have (or don’t have) a SLOG device
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
logs
nvme3n1 ONLINE 0 0 0
Meaning: There is a separate log device. Good. Now verify it’s not a consumer SSD with miserable PLP characteristics.
Decision: If you don’t have a SLOG and you have sync workloads, don’t let recordsize become your scapegoat.
Task 9: Inspect file block sizes on disk (are your “new settings” actually present?)
cr0x@server:~$ zdb -ddddd tank/app | egrep "Object|blksz|type" | head -n 12
Object lvl iblk dblk dsize dnsize bonustype bsize data
7 1 128K 128K 9.00M 512 DMU dnode 16K ZFS plain file
dblk [0] L0 128K 1.00M
dblk [1] L0 128K 1.00M
Meaning: This file’s data blocks are 128K. If you changed recordsize to 16K yesterday and still see 128K blocks, that file was not rewritten.
Decision: If the hot files are still using old block sizes, you need a targeted rewrite/migration approach (or accept it’s not worth rewriting).
Task 10: Identify snapshot pinning before planning rewrites
cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation -s creation tank/app | tail -n 5
tank/app@snap-2025-12-01 85.2G 120G Mon Dec 1 03:00 2025
tank/app@snap-2025-12-08 91.7G 120G Mon Dec 8 03:00 2025
tank/app@snap-2025-12-15 96.4G 120G Mon Dec 15 03:00 2025
tank/app@snap-2025-12-22 101.9G 120G Mon Dec 22 03:00 2025
tank/app@snap-2025-12-26 104.1G 120G Thu Dec 26 03:00 2025
Meaning: Snapshot “used” is growing. Rewriting will duplicate blocks and snapshots will keep the old ones, potentially doubling space.
Decision: If you plan to rewrite/migrate, either coordinate snapshot retention changes or ensure you have headroom. No headroom means no migration; it means panic later.
Task 11: Estimate dataset headroom realistically (not “df says we’re fine”)
cr0x@server:~$ zfs list -o name,used,avail,refer,logicalused,logicalavail tank/app
NAME USED AVAIL REFER LOGICALUSED LOGICALAVAIL
tank/app 2.1T 3.4T 1.9T 2.8T 4.5T
Meaning: Physical used vs logical used can diverge because of compression and snapshots. Headroom calculations must consider snapshot behavior during rewrite.
Decision: If avail is tight, avoid any approach that duplicates data (rsync within the same dataset, file copy churn, naive rewrite).
Task 12: Check whether special small blocks are in play
cr0x@server:~$ zfs get -o name,property,value special_small_blocks tank
NAME PROPERTY VALUE SOURCE
tank special_small_blocks 0 default
Meaning: special_small_blocks=0 means only metadata goes to special vdevs (if present). If this is set (e.g., 16K), then small file blocks may go to special devices.
Decision: If special_small_blocks is enabled and special vdev is hot, consider raising recordsize or adjusting special_small_blocks to avoid pushing too much data to special.
Task 13: Confirm what changed when you set recordsize (property sources matter)
cr0x@server:~$ zfs get -r -o name,property,value,source recordsize tank | head
NAME PROPERTY VALUE SOURCE
tank recordsize 128K default
tank/app recordsize 16K local
tank/app/logs recordsize 16K inherited from tank/app
Meaning: You changed tank/app and the child inherited it. That might be correct—or it might be collateral damage.
Decision: If logs are append-only and you just forced them to 16K, you probably reduced throughput and increased metadata overhead for no benefit.
Task 14: Do a controlled “rewrite a subset” test with a sacrificial directory
cr0x@server:~$ zfs create -o recordsize=16K tank/app_test
cr0x@server:~$ rsync -aHAX --info=progress2 /tank/app/hotset/ /tank/app_test/hotset/
sending incremental file list
3.42G 12% 95.11MB/s 0:00:32
28.40G 100% 102.88MB/s 0:04:38 (xfr#231, to-chk=0/232)
Meaning: You copied the hot set into a dataset with the desired recordsize. Now the data blocks for those files will be created under the new policy.
Decision: Benchmark the application against this test dataset. If it improves, you have evidence. If it doesn’t, stop blaming recordsize.
Joke #1: Changing recordsize and expecting old data to change is like changing the office coffee brand and expecting last quarter’s outages to disappear.
Migration strategies that don’t rewrite everything
The headline promise—“changing strategy without rewriting everything”—is only partially possible, because physics and on-disk reality exist.
You can change policy immediately. You can avoid rewriting cold data. You can target rewrites. You can also decide not to rewrite at all and still win.
Strategy A: Accept that only new writes matter (and make that enough)
This is the easiest and most underrated strategy. Many datasets are “mostly append” or “mostly churn” even if they look static.
If the hot working set is overwritten naturally, then a policy change will migrate you over time.
Where this works:
- Log datasets where files rotate and old logs age out.
- Temporary build artifacts and CI caches.
- Object stores where objects are replaced (not modified in place) and lifecycle policies expire old objects.
What you do:
- Set the new recordsize.
- Ensure new writes land in the right dataset (separate mountpoints if needed).
- Watch block size distribution on newly-written files over days/weeks.
Strategy B: Split datasets by workload, not by org chart
Most recordsize failures happen because someone tried to make one dataset serve two masters: small random IO and large sequential throughput.
Don’t. Use multiple datasets with different properties and move data along boundaries that match how it’s used.
Practical splits that work:
- db/ (small random): recordsize 16K–32K often makes sense for many OLTP-ish patterns, but measure.
- wal-or-redo/ (sequential append, sync): usually wants larger records and very careful sync/SLOG attention.
- backups/ (big sequential): 1M recordsize can be great.
- home/ or shared/ (mixed): keep the default 128K unless you have proof.
Splitting datasets gives you: separate snapshots, separate quotas, separate reservation strategies, and—critically—separate tuning.
It’s how grown-ups run ZFS in production.
Strategy C: Targeted rewrite of hot data only
If performance pain is concentrated in a hot subset (database files, VM images, a particular index directory),
rewrite just that subset into a new dataset with the right recordsize. Then flip the application path.
Techniques:
- Create a new dataset with the desired recordsize.
- Copy the hot directory tree over (rsync or application-level export/import).
- Switch mountpoints or symlink boundaries (prefer mountpoints; symlinks confuse people at 3am).
- Keep old data available for rollback, but don’t leave it mounted where the app can accidentally use it.
For databases, application-level tools (dump/restore, physical backup restore, tablespace relocation) often produce a cleaner rewrite than file-level copying.
They also let the database re-pack and reduce internal fragmentation. File-level rsync is blunt; sometimes blunt is fine.
Strategy D: Rewrite in place by copying file-to-file (use with caution)
The classic trick: copy a file onto itself via a temporary name so ZFS allocates new blocks under the new recordsize.
This can work for a bounded set of files when you can tolerate extra space and IO.
The ugly parts:
- Snapshots keep old blocks alive. Space usage can explode.
- You can accidentally trigger massive write amplification and fragmentation.
- It’s hard to do safely at scale without an orchestration plan.
If you do this, do it in a dedicated maintenance window, on a known list of files, with monitoring, and with snapshot policy temporarily adjusted.
Strategy E: ZFS send/receive as a migration engine
ZFS send/receive is fantastic for moving datasets between pools or reorganizing dataset trees. It is not a magical block-size rewriter by itself.
A send/receive reproduces the dataset content and typically preserves the on-disk record structure as part of the stream, meaning it won’t “re-recordsize” your existing data.
What send/receive is great for in this context:
- Moving data to a new pool with a different vdev layout better suited to your IO pattern (mirrors vs RAIDZ).
- Reorganizing datasets so future writes follow correct policies.
- Providing a clean rollback point: if your rewrite plan goes sideways, you can revert by re-mounting the received dataset.
Strategy F: Don’t migrate recordsize; migrate the workload
Sometimes the bottleneck isn’t recordsize. It’s:
- A single-threaded application doing tiny synchronous writes.
- A database configured with an IO scheduler that’s hostile to SSDs.
- A VM guest issuing 4K random writes on a RAIDZ pool with no SLOG and heavy snapshots.
If you can change the workload (batch writes, change database page size, adjust checkpointing, move WAL to a separate dataset),
you can keep recordsize stable and still get a win that doesn’t require risky rewrites.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran an internal analytics platform on ZFS. The storage team noticed rising query latency and saw a chorus of blog posts:
“Databases like smaller recordsize.” Someone changed the dataset’s recordsize from 128K to 16K in the middle of a release cycle.
Nothing improved. Two days later, it got worse. The team concluded the “16K trick” didn’t work and rolled it back.
That rollback also did nothing. Now they had two changes and zero causality.
The actual issue was a set of large sequential scans colliding with a nightly job that wrote big parquet files and then immediately re-read them.
The hot files were mostly newly generated and streaming; 16K records increased metadata overhead and reduced read-ahead efficiency.
The latency spike correlated with that job, not with database random IO.
The wrong assumption was that “database = random IO,” but the workload wasn’t OLTP; it was an analytics mix with heavy sequential reads.
The fix was boring: keep recordsize at 128K (or larger for the parquet output dataset), isolate job outputs into their own dataset,
and schedule the competing jobs so they didn’t stomp the same vdev queues.
The post-incident action item that stuck: any recordsize change required two measurements—application IO size distribution and ZFS-level ops/bandwidth—before and after.
No measurement, no change.
Mini-story 2: The optimization that backfired
Another organization had a large backup repository on ZFS and wanted faster replication. Someone set recordsize to 1M on the backup dataset.
Replication throughput improved. Everyone celebrated and went home early, which is how you invite trouble.
A month later, restores for small files got noticeably slower. Not catastrophic, just annoying—until an incident forced a restore of millions of small config files
and package manifests. The restore SLA got missed, and now management cared.
The backfire wasn’t that 1M is “bad.” It was that the dataset wasn’t only backups; it also held a giant tree of small files that were frequently accessed during restores.
With huge records, reading a tiny file often dragged in more data than needed and churned ARC. The system did more work for the same result.
The resolution was to split the repository: one dataset optimized for large sequential backup streams (1M), another for small-file catalogs and indexes (128K or smaller).
Replication stayed fast, and the restore path stopped punishing small-file access.
The operational lesson: “Our backup dataset” is not a workload description. It’s an accounting label. Recordsize only respects workloads, not labels.
Mini-story 3: The boring but correct practice that saved the day
A regulated enterprise ran customer-facing services on ZFS with strict change control. They wanted to migrate a dataset containing VM images
from a legacy pool to a new pool with faster hardware. There was also a desire to “fix recordsize while we’re at it,” because there always is.
The team did something unfashionable: they built a staging dataset and ran a canary workload on it for a week.
They captured zpool iostat metrics, guest IO sizes, and application p99 latency. They also documented the rollback procedure
and tested it twice.
During canary, they learned that recordsize wasn’t even in play because these were zvol-backed disks. The correct tuning knob was volblocksize,
but changing it required zvol recreation. That’s the sort of detail that turns “simple change” into “planned migration.”
They migrated by creating new zvols with the correct volblocksize, doing storage-level copies, and switching the hypervisor to the new volumes.
Boring, careful, and undeniably correct. They also avoided an outage caused by someone trying to tune recordsize on the wrong object type.
The week-long canary felt slow. It saved them from a much slower kind of week later: the one where you’re rebuilding systems under pressure.
Common mistakes: symptoms → root cause → fix
These show up in real incident timelines. The symptoms are familiar. The root causes are usually avoidable. The fixes are specific.
1) “We changed recordsize, but performance didn’t change at all.”
- Symptoms: No improvement after recordsize change; benchmarks look identical; zdb shows old block sizes.
- Root cause: Existing data wasn’t rewritten; recordsize only affects newly written blocks.
- Fix: Targeted rewrite of hot data (copy to a new dataset) or accept gradual migration via churn. Verify with
zdbon hot files.
2) “Small random writes are slow; we lowered recordsize and it’s still slow.”
- Symptoms: High latency, high ops, low bandwidth; RAIDZ pool; frequent overwrites; maybe heavy snapshots.
- Root cause: Parity read-modify-write overhead dominates; sync writes may be forcing ZIL behavior; snapshots increase fragmentation.
- Fix: Consider mirrors for random-write-heavy datasets; isolate sync-heavy components; evaluate SLOG; reduce snapshot churn on hot overwrite data.
3) “Throughput tanked after we set recordsize to 16K.”
- Symptoms: Sequential jobs slower; more CPU in kernel; higher metadata IO; lower prefetch efficiency.
- Root cause: Dataset is sequential/streaming, but recordsize was shrunk, increasing block count and overhead.
- Fix: Restore larger recordsize (128K–1M) for streaming datasets; split workloads into separate datasets.
4) “We rewrote data and ran out of space.”
- Symptoms: Pool fills during migration; deletes don’t free space; everything gets worse quickly.
- Root cause: Snapshots pin old blocks; rewrite duplicates data; insufficient headroom.
- Fix: Adjust snapshot retention during migration; clone to a new dataset and cut over; add temporary capacity; plan headroom before rewriting.
5) “Special vdev is saturated after we tuned recordsize.”
- Symptoms: Special device shows high %util; metadata latency spikes; overall pool latency climbs.
- Root cause:
special_small_blockscauses more data blocks to land on special vdev; recordsize/small-block distribution shifted. - Fix: Re-evaluate
special_small_blocksthreshold; move small-block-heavy data off special; ensure special vdev has enough performance/endurance.
6) “We tuned recordsize for VMs and nothing changed.”
- Symptoms: VM performance unchanged; you changed recordsize on a dataset that stores zvols or sparse files with different IO patterns.
- Root cause: VMs are on zvols where volblocksize matters; or the hypervisor uses direct IO patterns not aligned with your assumption.
- Fix: Tune volblocksize at zvol creation; use separate datasets for VM images if file-based; measure guest IO sizes.
Joke #2: The fastest way to discover you don’t have enough pool headroom is to start a “quick” rewrite at 4:55pm on Friday.
Checklists / step-by-step plan
This is the plan I’d want in front of me during a change window: short enough to use, strict enough to prevent self-inflicted wounds.
Checklist 1: Decide whether recordsize is actually the lever
- Measure IO size distribution (
iostat -x, application metrics, hypervisor metrics). - Measure ZFS ops vs bandwidth (
zpool iostat -v). - Identify if you’re sync-bound (look for fsync-heavy behavior; verify SLOG presence/perf).
- Confirm dataset type (filesystem vs zvol).
- Confirm snapshots and headroom for any rewrite plan.
Checklist 2: Choose a migration strategy
- Policy-only change (no rewrite): Use when data naturally churns or hot data is mostly new writes.
- Split datasets: Use when you have mixed workloads in one dataset.
- Targeted rewrite: Use when a specific directory/file set causes the pain.
- Workload migration: Use when the application can change (batching, page sizes, WAL separation).
- Pool layout migration: Use when parity overhead is the problem and you need mirrors or different hardware.
Checklist 3: Execute a safe targeted rewrite cutover
- Create new dataset with desired recordsize and other properties (compression, atime, xattr, etc.).
- Run a small canary copy and validate block sizes on a few hot files (
zdb -ddddd). - Benchmark the application against the new location (same query set, same VM workload, same backup job).
- Plan snapshot policy: reduce retention during migration or ensure sufficient headroom.
- Rsync/copy hot subset; verify checksums or application-level consistency.
- Cut over via mountpoint switch; keep old dataset read-only for rollback.
- Monitor p95/p99 latency, pool queue depth, special vdev, SLOG if present.
- After stabilization, retire old data and restore snapshot retention.
Checklist 4: Post-change validation (don’t skip the boring part)
- Confirm property sources: recordsize set where intended, inherited where intended.
- Confirm hot files are written under the new policy.
- Confirm no unexpected datasets inherited the change.
- Compare before/after metrics: ops, bandwidth, latency, CPU.
- Document the final dataset layout and why each recordsize exists.
FAQ
1) If I change recordsize, will my existing files be re-blocked automatically?
No. Existing blocks stay as-is. Only newly written blocks follow the new maximum block size.
If you need existing hot files to change, you must rewrite them (directly or indirectly) so ZFS allocates new blocks.
2) What recordsize should I use for databases?
It depends on access patterns and database page size. Many OLTP workloads benefit from smaller recordsize (often 16K–32K),
but analytics scans often prefer larger. Measure IO sizes and latency first; then test on a subset dataset.
3) Is 1M recordsize always best for backups?
Often, but not always. If the dataset is truly large sequential streams, 1M can improve throughput and compression.
If it also stores small frequently-accessed catalogs or indexes, split those into a different dataset.
4) Can ZFS send/receive rewrite data into a new recordsize?
Not by itself. Send/receive preserves the existing block structure in the stream.
Use send/receive to move datasets between pools or reorganize; use targeted rewrites or application-level rebuilds to change on-disk block sizing.
5) Why does lowering recordsize sometimes make things slower?
Smaller records increase block count, metadata overhead, and can reduce read-ahead efficiency for sequential workloads.
You can trade random update efficiency for sequential throughput. If your workload is mostly streaming, small recordsize is self-sabotage.
6) How do snapshots affect recordsize migration?
Snapshots keep old blocks alive. If you rewrite data to force new block sizes, you’ll allocate new blocks while snapshots retain the old ones.
Space usage can spike dramatically. Plan headroom and snapshot retention before rewriting.
7) Should I tune recordsize for VM disks?
If your VM disks are zvols, recordsize is irrelevant; volblocksize matters and is usually chosen at zvol creation.
If VM disks are files in a filesystem dataset, recordsize can matter—but guest IO patterns still rule. Measure them.
8) How can I tell if my hot files are using the new recordsize?
Use zdb -ddddd on representative files and look at the data block sizes (e.g., 16K vs 128K).
Also, copy a small test set into a new dataset with the desired recordsize and compare performance.
9) Does compression change the “right” recordsize?
It can. Larger records often compress better, which can reduce physical IO.
But if you frequently read small slices, you may pay decompression and read amplification costs. Test with real access patterns.
10) If recordsize is only a maximum, why do people obsess over it?
Because it strongly influences how ZFS lays out large files, and large files are where performance problems hide:
databases, VM images, big logs, backups, indexes. The obsession is justified; the simplistic advice is not.
Practical next steps
If you remember only three things: recordsize is a ceiling, it doesn’t rewrite old data, and the fastest path to a safe migration is dataset separation plus targeted rewrites.
Everything else is just implementation detail—and implementation detail is where outages breed.
- Measure first: capture IO size distribution, pool ops/bandwidth, and sync behavior.
- Scope the change: confirm dataset types and inheritance so you don’t “optimize” your logs into a regression.
- Pick the least risky migration: policy-only for churny data, targeted rewrites for hot subsets, dataset splits for mixed workloads.
- Validate with evidence:
zdbfor block sizes, application p99 latency for user impact, andzpool iostatfor pool behavior. - Write down what you did and why: future you will be tired, and tired people deserve good documentation.