ZFS has a reputation: “set it and forget it.” That’s mostly deserved—until it isn’t. One dataset property, recordsize, quietly decides whether your storage behaves like a well-oiled runway or like a baggage carousel full of angry rollers. It won’t fix every problem, but it will amplify the good choices and punish the bad ones.
If you run production systems—VM fleets, databases, NFS home directories, backup targets—recordsize is the knob you touch when performance is “mysteriously fine in dev” and then becomes “mysteriously expensive in prod.” This article is a field guide: what recordsize really controls, how it interacts with compression and caching, and how to pick values that don’t create hidden write amplification or latency spikes.
What recordsize actually is (and what it isn’t)
recordsize is a ZFS dataset property that controls the maximum size of a file data block ZFS will try to use for regular files in that dataset. Think of it as “how big a chunk ZFS prefers when it can.” It is not a promise that every block will be that size.
Important nuances that matter in production:
- It’s a maximum, not a minimum. ZFS will use smaller blocks when it has to (small files, tail blocks, fragmented allocations, or when compression makes blocks smaller).
- It applies to file data blocks. It is not the same as
volblocksize(for zvols), and it does not directly set metadata block sizes. - It affects how much data must be read/rewritten when part of a file changes. This is the performance fulcrum: random small writes into a large record can trigger a read-modify-write pattern at the block level.
- It does not retroactively rewrite existing blocks. Changing
recordsizeaffects new writes going forward. Old blocks stay as they are until rewritten (via normal application churn or via rewrite/copy operations you initiate).
Joke #1 (short, relevant): Setting the wrong recordsize is like buying a truck to deliver a single envelope—technically it works, but your fuel bill will develop feelings.
Why this one setting dominates file performance
For many workloads, “file performance” boils down to a few measurable behaviors:
- Latency for small random reads/writes (databases, VM images, mail spools)
- Throughput for large sequential reads/writes (media, backups, object-like blobs)
- Write amplification when small updates trigger large block rewrites
- Cache efficiency in ARC/L2ARC: are you caching the right granularity?
recordsize touches all four. Bigger records mean fewer pointers, fewer I/O operations for sequential scans, and better compression ratios (often). Smaller records mean less collateral damage when a small part of a file changes, and they can reduce “read more than you asked for” behavior when an app does random reads.
In operations terms: recordsize is one of the few ZFS settings that (a) is per-dataset, (b) is easy to change safely, and (c) produces immediate, measurable differences. That combination is rare, and it’s why it ends up deciding “80% of file performance” in practice—especially when the bottleneck is I/O pattern mismatch rather than raw disk bandwidth.
Facts & historical context that explain today’s defaults
Some concrete context helps explain why recordsize defaults and “folk wisdom” exist:
- ZFS inherited a world where 4K pages and 4K disk sectors were common. Many OS and database I/O paths naturally operate in 4K or 8K units, which influences “reasonable” block sizes for random I/O.
- The classic ZFS default recordsize was 128K. That default isn’t magic; it’s a compromise aimed at general-purpose filesystems with lots of sequential I/O mixed in.
- Early commodity drives often lied about sectors. The 512e era created painful alignment problems; ZFS’s
ashiftwas the “always align to a power of two” countermeasure. - Copy-on-write filesystems made partial overwrite semantics expensive. Traditional filesystems could overwrite in place; ZFS generally writes new blocks, which changes the cost profile of small updates.
- Compression becoming “free enough” changed tuning priorities. Modern CPUs made lightweight compression practical, and larger records often compress better (fewer headers, more redundancy per block).
- VM sprawl created a new class of “files that behave like disks.” QCOW2/VMDK/raw images are files, but their access patterns look like random block devices—recordsize needs to follow that reality.
- SSD latency made small I/O more feasible, but also more sensitive. When the media is fast, overhead (checksums, metadata, write amplification) becomes a larger fraction of total time.
- ARC made caching more strategic than “more RAM equals faster.” Caching 128K chunks for random 8K reads can be wasteful; caching 16K chunks for sequential streaming can be overhead-heavy.
How ZFS turns your writes into on-disk reality
To tune recordsize responsibly, you need a mental model of what ZFS is doing with your bytes.
Recordsize and the “read-modify-write” trap
Suppose an app updates 8K in the middle of a large file. If the relevant on-disk block (record) is 128K, ZFS can’t just overwrite 8K “in place” the way old-school filesystems did. It generally has to:
- Read the old 128K block (unless it’s already in ARC)
- Modify the 8K portion in memory
- Write a new 128K block elsewhere (copy-on-write)
- Update metadata to point to the new block
That’s a simplified view, but the punchline holds: small random writes into large records create extra I/O and extra allocation churn. On HDD pools, the extra I/O becomes seek pain. On SSD pools, it becomes write amplification and latency jitter, especially under sync writes or high concurrency.
Why bigger records can be spectacular for sequential workloads
If your workload reads and writes large contiguous ranges—backup streams, media encoding outputs, parquet files, log archives—bigger records reduce overhead:
- Fewer block pointers and metadata operations
- Fewer I/O requests for the same amount of data
- Often better compression efficiency
- Better prefetch behavior: ZFS can pull in meaningful chunks
On spinning disks, this is the difference between sustained throughput and a graph that looks like a seismograph. On SSDs, it’s the difference between saturating bandwidth and saturating CPU/IOPS overhead first.
ARC granularity: caching the right “shape” of data
ARC caches ZFS blocks. If your app does 8K random reads but your blocks are 128K, each cache miss pulls in 128K. That’s “read amplification.” Sometimes it helps (spatial locality), but for truly random patterns it just burns ARC and evicts useful data faster.
Conversely, if your workload is large sequential reads, tiny recordsize can increase overhead: more blocks to manage, more checksum operations, more metadata lookups, and less efficient prefetch.
Compression and recordsize: friends with boundaries
Compression happens per block. Larger blocks often compress better because redundancy is visible across a larger window. But compression also means your on-disk block size may be smaller than recordsize, which changes the I/O shape:
- With compressible data, large recordsize can be a free win: fewer blocks, less physical I/O.
- With incompressible data, large recordsize mainly changes I/O granularity and rewrite cost.
- With mixed data, you’ll see a mix of physical sizes; the “max” still matters for rewrite amplification.
Recordsize vs volblocksize (don’t mix them)
Datasets store files; zvols present block devices. For zvols, the equivalent tuning knob is volblocksize, which is set at creation time and is difficult to change without rebuilding the zvol. A common production mistake is tuning recordsize on a dataset that stores VM images inside a zvol—or tuning volblocksize for files. These are different layers.
Mapping workloads to recordsize (with real numbers)
Here’s the practical mapping I use when I’m responsible for the pager.
General file shares (home directories, mixed office files)
Typical access: small documents, occasional large downloads, many small metadata ops. The default 128K is often fine. If you have heavy random access in large files (e.g., huge PST-like archives), consider lowering, but don’t pre-optimize; measure first.
Common choice: 128K. Sometimes 64K if lots of small random I/O inside large files.
Databases (Postgres, MySQL/InnoDB) stored as files on datasets
Most databases do 8K–16K reads/writes (varies by engine and configuration). If you store database data files on a dataset (not a zvol), large recordsize can cause read-modify-write and ARC inefficiency. A recordsize closer to the DB page size is usually better.
Common choice: 16K for Postgres (8K pages but 16K aligns with practical ZFS behavior); 16K for many InnoDB setups (often 16K pages). Sometimes 8K for very latency-sensitive random read workloads, but watch overhead.
VM disk images as files (qcow2, raw, vmdk) on datasets
This is the classic “file that behaves like a disk.” Guest OS does 4K random writes; the hypervisor writes into a file. Large recordsize punishes you with write amplification and fragmentation. You want smaller blocks. Many shops land on 16K or 32K as a compromise; 8K can be appropriate but increases metadata overhead.
Common choice: 16K or 32K.
Backup targets, media archives, log archives
Sequential, streaming, append-heavy, usually not rewriting in place. Bigger is better: reduce overhead, increase streaming throughput, improve compression ratio. For modern OpenZFS, 1M recordsize is a legitimate choice for these datasets—when the workload truly is large sequential I/O.
Common choice: 1M (or 512K if you want a safer middle ground).
Container images and build caches
These can be nasty: lots of small files and metadata, but also large layers. Tuning recordsize alone won’t solve it; you’ll also care about special vdevs for metadata, compression, and maybe atime=off. For recordsize, default is usually fine; if the workload is dominated by many small random reads inside large blobs, consider 64K, but measure.
Common choice: 128K, sometimes 64K.
Analytics files (parquet, ORC, columnar data)
These are often read in large chunks but with some skipping. If your queries scan wide ranges, larger records help. If you do lots of “read a little from many places,” smaller can reduce read amplification. This one is workload-specific; start with 1M for pure scan, then adjust.
Common choice: 512K–1M for scan-heavy; 128K–256K for mixed.
Joke #2 (short, relevant): In storage, “one size fits all” is like “one timeout fits all”—it’s only true until your first incident review.
Three corporate-world mini-stories
1) Incident caused by a wrong assumption: “It’s a file, so 1M recordsize will be faster”
They had a virtualization cluster that stored VM disks as raw files on a ZFS dataset. Someone saw a blog post about big recordsize improving throughput for backups and decided the VM dataset should be recordsize=1M too. It sounded reasonable: bigger blocks, fewer I/Os, more speed.
The change rolled out during a routine maintenance window. The first warning wasn’t a benchmark; it was the helpdesk. “Windows VMs feel laggy.” Then the monitoring lit up: elevated IO wait, queue depths rising, and an odd pattern—reads looked okay, but writes had jitter. The hypervisor graphs showed bursts: a bunch of small writes suddenly turned into much larger backend write activity.
On investigation, the ZFS pool was doing a lot of read-modify-write. The workload was typical VM behavior: small random writes, filesystem metadata updates inside guests, journaling, and periodic syncs. With 1M records, each tiny update could rewrite a huge block (or at least trigger big-block allocation patterns). ARC hit rates weren’t saving them; the working set was bigger than RAM and the random pattern made caching less effective.
The resolution wasn’t heroic. They created a new dataset with recordsize=16K, migrated the VM images with a copy that forced block rewrite, and the symptoms largely vanished. The postmortem lesson was sharp: “file” is not a workload description. If your “file” is a virtual disk, treat it like block I/O.
The follow-up win was cultural: they added a lightweight storage design review step for new datasets. Not a bureaucracy—just a ten-minute checklist: access pattern, expected I/O sizes, sync behavior, and an explicit recordsize choice.
2) Optimization that backfired: shrinking recordsize everywhere to reduce latency
A different organization had a latency incident on a database system and walked away with a simple belief: “Smaller blocks are faster.” It’s an understandable conclusion when you’ve been burned by write amplification. So they standardized on recordsize=8K across most datasets: databases, home directories, artifact storage, backups, the lot.
The next month’s trend reports were weird. The DB latency improved a bit, sure. But the backup system started missing its window. The media processing pipeline began taking longer to read and write large files. CPU utilization on storage nodes climbed, and not because users were suddenly happy. It was overhead: more blocks, more metadata churn, more checksum work, more I/O operations to move the same amount of data.
Then came the real punch: their L2ARC device looked “busy” but wasn’t improving outcomes. Small recordsize meant more cache entries and metadata pressure. It increased the working set management overhead, and the cache became less efficient for the streaming workloads that previously benefited from large records and prefetch.
The fix was to stop treating recordsize as a global tuning knob. They split datasets by workload class: 16K for DB files, 128K for general shares, 1M for backup targets and large artifacts. The result wasn’t just performance—it was predictability. Systems that stream data got their throughput back; systems that do random I/O got lower amplification; and the storage CPU stopped pretending it was a crypto miner.
The lesson that stuck: a “universal” recordsize is a universal compromise, and compromises are where incidents breed.
3) A boring but correct practice that saved the day: dataset-per-workload with migration discipline
Here’s the kind of story nobody brags about, but it’s the reason some teams sleep at night.
This company ran a shared ZFS pool for multiple internal services: a PostgreSQL fleet, an NFS home directory service, and a backup repository. Years ago, they got into the habit of creating a dataset per workload class with explicit properties: recordsize, compression, atime, sync policy, and mountpoint conventions. It wasn’t fancy. It was just consistent.
When a new analytics service arrived—big columnar files, mostly sequential reads—the storage team didn’t touch existing datasets. They created a new dataset with recordsize=1M and appropriate compression, then onboarded the service there. Performance was good from day one, largely because they didn’t force an analytics workload to live on a dataset tuned for home directories.
Months later, a regression in the analytics pipeline caused it to start doing more random reads. Latency crept up. The on-call SRE ran a quick diagnosis: confirm I/O pattern shift, inspect block sizes, check ARC behavior. They reduced recordsize moderately (from 1M to 256K) in a staging run, validated with realistic queries, then rolled a controlled migration that rewrote the hot data. No emergency surgery on production. No “let’s change the pool-wide settings and hope.”
The day was saved by boredom: namespaced datasets, explicit properties, and a muscle memory for “if you change recordsize, plan how to rewrite data.” The incident ticket never became an incident call. That’s the best kind of success.
Hands-on tasks: commands, outputs, and interpretation
The following tasks are written the way I’d run them on a Linux host with OpenZFS. Adjust pool/dataset names as needed. Every command includes what to look for so you can make a decision, not just produce output.
Task 1: List datasets and recordsize
cr0x@server:~$ zfs list -o name,used,avail,recordsize,compression,mountpoint
NAME USED AVAIL RECORDSIZE COMPRESS MOUNTPOINT
tank 3.21T 7.80T 128K lz4 /tank
tank/vm 1.40T 7.80T 16K lz4 /tank/vm
tank/db 820G 7.80T 16K lz4 /tank/db
tank/backups 900G 7.80T 1M zstd /tank/backups
Interpretation: sanity-check that datasets match workload classes. If your VM dataset says 1M, you likely found your next performance ticket.
Task 2: Check a single dataset property (and inherited values)
cr0x@server:~$ zfs get -o name,property,value,source recordsize tank/vm
NAME PROPERTY VALUE SOURCE
tank/vm recordsize 16K local
Interpretation: if SOURCE is inherited, confirm the parent dataset is intentionally configured. Accidental inheritance is common in “quick fix” environments.
Task 3: Change recordsize safely (affects new writes only)
cr0x@server:~$ sudo zfs set recordsize=16K tank/vm
cr0x@server:~$ zfs get -o name,property,value,source recordsize tank/vm
NAME PROPERTY VALUE SOURCE
tank/vm recordsize 16K local
Interpretation: this does not rewrite existing blocks. If you need impact now, plan a rewrite/migration (see later tasks).
Task 4: Inspect actual block sizes used by an existing file
cr0x@server:~$ sudo zdb -bbbbb tank/vm | head -n 20
Dataset tank/vm [ZPL], ID 56, cr_txg 124567, 1.40T, 1045320 objects
Object lvl iblk dblk dsize dnsize lsize %full type
96 1 128K 16K 1.40T 512B 1.40T 100% ZFS plain file
Interpretation: look at dblk (data block size). If your dataset recordsize is 16K but existing files show 128K, they were written before the change or by a different workload shape.
Task 5: Check compression ratio and logical vs physical space
cr0x@server:~$ zfs get -o name,property,value -H compressratio,logicalused,used tank/backups
tank/backups compressratio 1.82x
tank/backups logicalused 1.63T
tank/backups used 900G
Interpretation: if compression is helping, larger recordsize may be amplifying that benefit for sequential workloads. If compressratio is ~1.00x on incompressible media, recordsize choice matters mostly for I/O shape and rewrite cost.
Task 6: Measure random vs sequential I/O quickly with fio (sanity check)
cr0x@server:~$ fio --name=randread --directory=/tank/vm --rw=randread --bs=4k --iodepth=32 --numjobs=4 --size=4G --time_based --runtime=30 --group_reporting
randread: (groupid=0, jobs=4): err= 0: pid=18432: Thu Dec 24 12:10:33 2025
read: IOPS=38.2k, BW=149MiB/s (156MB/s)(4468MiB/30001msec)
slat (usec): min=3, max=240, avg=12.7, stdev=4.9
clat (usec): min=60, max=3900, avg=820.2, stdev=210.5
Interpretation: run this on the dataset that matches your workload. If your real app does 4K/8K random IO and latency is bad, recordsize may be too large (or sync settings/log device are limiting). Use fio to confirm I/O shape first.
Task 7: Observe pool I/O and latency under real load
cr0x@server:~$ iostat -x 1
Linux 6.8.0 (server) 12/24/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
9.2 0.0 6.1 18.7 0.0 66.0
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
nvme0n1 420.0 980.0 52.0 120.0 180.2 4.6 3.6 0.4 54.0
nvme1n1 410.0 970.0 51.3 119.1 179.6 4.8 3.8 0.4 55.0
Interpretation: if await climbs with small I/O, you may be hitting sync write latency, device saturation, or write amplification. Recordsize is a suspect when write bandwidth seems high relative to application write rate.
Task 8: Watch ZFS-level I/O with zpool iostat
cr0x@server:~$ zpool iostat -v 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 3.21T 7.80T 8.10K 12.5K 210M 380M
mirror 1.60T 3.90T 4.05K 6.20K 105M 190M
nvme0n1 - - 2.02K 3.10K 52.5M 95.0M
nvme1n1 - - 2.03K 3.10K 52.5M 95.0M
-------------------------- ----- ----- ----- ----- ----- -----
Interpretation: compare application throughput with pool bandwidth. If the pool is writing far more than the app claims, you’re seeing amplification (recordsize mismatch, sync behavior, metadata churn, or fragmentation).
Task 9: Check ARC stats for cache effectiveness signals
cr0x@server:~$ arcstat 1 5
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:11:10 185K 21.2K 11 9K 4 10K 5 2K 1 58G 64G
12:11:11 192K 30.1K 15 14K 7 12K 6 4K 2 58G 64G
Interpretation: rising miss% during a workload that should be cache-friendly can mean the blocks are too large for the access pattern (or the working set is too large). Recordsize can be part of that story.
Task 10: Confirm the workload’s I/O size (before you tune)
cr0x@server:~$ sudo pidstat -d 1 5 -p 12345
Linux 6.8.0 (server) 12/24/2025 _x86_64_ (32 CPU)
12:12:01 UID PID kB_rd/s kB_wr/s kB_ccwr/s Command
12:12:02 1001 12345 5120.0 2048.0 128.0 postgres
Interpretation: this shows rates, not sizes, but it’s your first hint of read/write dominance. Pair it with app knowledge (DB page size, VM block sizes). If you can’t describe your I/O sizes, you’re tuning blind.
Task 11: Force a rewrite of a file to apply new recordsize
This is the operationally useful part: changing recordsize doesn’t rewrite existing blocks. To apply it to existing files, you need to rewrite the file content. One simple method is “copy out and back,” ideally within the same pool but to a new dataset with the desired properties.
cr0x@server:~$ sudo zfs create -o recordsize=16K -o compression=lz4 tank/vm_new
cr0x@server:~$ sudo rsync -aHAX --inplace --progress /tank/vm/ /tank/vm_new/
cr0x@server:~$ sudo zfs set mountpoint=/tank/vm_old tank/vm
cr0x@server:~$ sudo zfs set mountpoint=/tank/vm tank/vm_new
Interpretation: the rsync forces new blocks to be allocated with the new dataset’s settings. The mountpoint swap is the “boring but safe” cutover. Test services before deleting the old dataset.
Task 12: Validate that rewritten data actually uses the new block size
cr0x@server:~$ sudo zdb -bbbbb tank/vm_new | head -n 20
Dataset tank/vm_new [ZPL], ID 102, cr_txg 124890, 1.39T, 1039120 objects
Object lvl iblk dblk dsize dnsize lsize %full type
96 1 128K 16K 1.39T 512B 1.39T 100% ZFS plain file
Interpretation: dblk should align with your intended recordsize for newly-written data. If it doesn’t, investigate: are files sparse? are you dealing with zvols? is the workload writing in larger chunks than expected?
Task 13: Spot-check a dataset’s “hot files” behavior with filefrag
cr0x@server:~$ sudo filefrag -v /tank/vm/disk01.raw | head -n 15
Filesystem type is: 0x2fc12fc1
File size of /tank/vm/disk01.raw is 107374182400 (26214400 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 255: 1200000.. 1200255: 256: last,eof
1: 256.. 511: 1310000.. 1310255: 256: 1200256:
Interpretation: fragmentation isn’t purely a recordsize issue, but recordsize mismatch increases churn and can worsen allocation patterns. If you see extreme fragmentation plus random-write workloads, smaller records and better allocation headroom can help.
Task 14: Check sync write behavior signals (not to blame recordsize for slog issues)
cr0x@server:~$ zfs get -o name,property,value,source sync tank/db
NAME PROPERTY VALUE SOURCE
tank/db sync standard inherited
Interpretation: if you have high-latency sync writes (databases often do), recordsize tuning won’t fix a slow SLOG or absent separate log device. Diagnose sync behavior separately; recordsize primarily changes block rewrite costs and I/O granularity.
Fast diagnosis playbook
This is the “what do I check first, second, third” plan when performance is off and you suspect ZFS tuning might be involved.
1) Confirm the workload’s I/O pattern before touching ZFS
- Is it random or sequential?
- Typical I/O size: 4K, 8K, 16K, 128K, 1M?
- Read-heavy or write-heavy? Sync-heavy?
Run basic observation: application metrics, iostat -x, zpool iostat. If the pattern is random small writes and your dataset is tuned for huge sequential blocks, you’re likely holding the smoking gun.
2) Check dataset properties and inheritance
zfs get recordsize,compression,atime,primarycache,logbias,sync- Confirm you’re on a dataset (files) vs a zvol (block device)
If the dataset is inheriting from a parent “catch-all,” verify the parent wasn’t tuned for a different workload six months ago during a panic.
3) Verify what’s actually on disk (not what you hope)
zdb -bbbbb pool/datasetfor block sizes- Compression ratio and logical vs physical usage
This step avoids a common trap: you changed recordsize last week but never rewrote the hot data, so nothing changed in practice.
4) Look for amplification and bottlenecks that masquerade as recordsize problems
- Pool writes much higher than application writes (amplification)
- High sync write latency (SLOG or device latency)
- ARC misses due to oversized blocks for random reads
- CPU saturation due to checksum/compression overhead at tiny block sizes
5) Make a reversible change, then apply it correctly
Changing recordsize is reversible, but “feeling” the benefit requires data rewrite. Plan a migration (new dataset, rsync, cutover) rather than tweaking a property and hoping.
Common mistakes: symptoms and fixes
Mistake 1: Using huge recordsize for VM images
Symptoms: VM stalls during updates, periodic write bursts, storage writes far exceed guest writes, latency spikes under concurrent write load.
Fix: put VM images on a dataset tuned for random I/O: recordsize=16K or 32K. Migrate/rewrite images so existing blocks are recreated at the new size.
Mistake 2: Using tiny recordsize for backups and media
Symptoms: backup jobs can’t hit throughput targets, high CPU on storage nodes, lots of IOPS but low MB/s, prefetch seems ineffective.
Fix: create a backup/media dataset with recordsize=1M (or 512K), appropriate compression (often zstd or lz4), and rewrite/migrate incoming data streams.
Mistake 3: Changing recordsize and expecting immediate improvement
Symptoms: “We set recordsize to 16K and nothing changed.” Benchmarks and app behavior remain identical.
Fix: rewrite hot data. For existing large files, you need copy/rsync to a new dataset, or a controlled rewrite operation that forces new block allocation.
Mistake 4: Confusing recordsize with volblocksize
Symptoms: you tune recordsize, but the workload is on a zvol (iSCSI/LUN). No measurable impact; confusion in incident review.
Fix: for zvols, plan volblocksize at creation. If it’s wrong, the clean fix is often “new zvol with correct volblocksize, migrate data.” Don’t fight physics with wishful properties.
Mistake 5: Ignoring sync write behavior and blaming recordsize
Symptoms: database commits are slow, but reads are fine; latency correlates with fsync/commit rates; adding recordsize tweaks doesn’t help.
Fix: evaluate sync path: separate log device (SLOG) quality, latency of main vdevs, sync=standard expectations, and application settings. Recordsize helps with rewrite amplification, not with “my sync device is slow.”
Mistake 6: Tuning recordsize without considering compression
Symptoms: unpredictable CPU usage, or surprising physical write rates after changes.
Fix: if you use compression, understand that physical block sizes may shrink; measure compressratio, CPU headroom, and pool bandwidth. Larger recordsize can improve compression on some data, but it can also increase rewrite cost for random updates.
Checklists / step-by-step plan
Checklist A: Picking recordsize for a new dataset
- Name the workload. “VM images,” “Postgres,” “backup streams,” “mixed home dirs.” Not “files.”
- Identify typical I/O size. 4K/8K/16K/128K/1M. Use app docs or measurement.
- Decide if overwrites are common. Databases and VMs overwrite; backups mostly append.
- Choose an initial recordsize. Common starting points: 16K (DB/VM), 128K (general), 1M (backups/media).
- Set compression intentionally.
lz4is usually a safe default; heavier compression should be justified. - Write it down in the dataset properties. Don’t rely on tribal memory. At minimum: recordsize, compression, atime, sync expectations.
Checklist B: Changing recordsize on an existing workload (safe operational approach)
- Confirm you’re tuning the right layer. Dataset vs zvol.
- Measure current behavior. Baseline: latency, throughput, pool bandwidth, ARC miss%. Capture before/after.
- Set recordsize on a new dataset. Avoid changing the live dataset first if the data needs rewrite anyway.
- Migrate with a rewrite. Use rsync/copy in a way that allocates new blocks (copy to new dataset).
- Cut over with mountpoint swap or service config change. Make rollback easy.
- Validate block sizes and performance. Use
zdbspot checks and workload tests. - Keep the old dataset temporarily. Rollback insurance; delete later when confidence is high.
Checklist C: “Is recordsize even the problem?”
- If latency spikes correlate with sync writes, investigate SLOG/device latency first.
- If throughput is low but CPU is high, check tiny recordsize overhead or heavy compression.
- If pool bandwidth is huge compared to app writes, suspect amplification (recordsize mismatch, fragmentation, metadata churn).
- If performance degrades as the pool fills, check fragmentation and free space; recordsize can worsen symptoms but isn’t the root cause.
FAQ
1) Does changing recordsize rewrite existing data?
No. It affects new writes. Existing blocks keep their old sizes until the file regions are rewritten. If you need immediate benefit, migrate/rewrite the hot data.
2) What’s the best recordsize for PostgreSQL?
Commonly 16K on a dataset storing PostgreSQL data files, because it tends to behave well with typical DB I/O while keeping rewrite amplification reasonable. But measure: workload, RAM/ARC size, and whether you’re sync-limited matter more than slogans.
3) What’s the best recordsize for VM images stored as files?
Typically 16K or 32K. VM I/O is usually small and random; huge recordsize often causes amplification. If your environment is extremely IOPS-sensitive, 8K can work, but it increases overhead and can reduce sequential throughput for guest operations like full-disk scans.
4) Should I set recordsize to 1M for everything because it’s “faster”?
No. It’s faster for large sequential streaming and often worse for small random overwrites. A 1M recordsize on VM or database datasets is a common cause of “why are writes so spiky?” tickets.
5) How does compression change recordsize decisions?
Compression is per block. Larger blocks can compress better, which can reduce physical I/O for compressible data. But the maximum record still controls rewrite scope: a small change inside a large record can still trigger large-block rewrite work, even if the compressed output is smaller.
6) Is smaller recordsize always better for latency?
No. Smaller blocks can reduce read amplification for random reads and reduce rewrite size for small updates, but they also increase metadata overhead and CPU work. For sequential workloads, too-small recordsize can reduce throughput and increase CPU consumption.
7) How do I know what block sizes are actually being used?
Use zdb -bbbbb pool/dataset to inspect dataset/object block sizes, and validate with spot checks after a migration. Don’t assume that setting recordsize changes existing files.
8) What’s the difference between recordsize and ashift?
ashift is a pool/vdev-level sector alignment setting (typically 12 for 4K, 13 for 8K). It affects the minimum allocation granularity and alignment. recordsize is a dataset-level maximum file block size. They interact—misalignment can cause amplification—but they solve different problems.
9) Can recordsize cause fragmentation?
Recordsize doesn’t “cause” fragmentation by itself, but a mismatch (large records with random overwrites) increases allocation churn and can worsen fragmentation over time. Keeping free space healthy and matching recordsize to I/O patterns helps.
10) Should I tune recordsize for NFS shares?
Tune for the application behavior behind NFS. NFS is transport; the real question is whether the clients do small random I/O inside large files (CAD projects, VM images, DB files) or mostly sequential/append workloads.
Conclusion
If you want a single ZFS tuning lever that reliably changes outcomes, recordsize is it—because it determines the granularity of data blocks, and granularity determines everything: rewrite cost, caching efficiency, I/O request rate, and how gracefully you handle random vs sequential access.
The operational trick is to stop treating recordsize as a tweak and start treating it as a design choice. Create datasets per workload, pick recordsize based on I/O patterns, and remember the part everyone forgets: changing it only matters if you rewrite the data. Do that, and you’ll get fewer performance mysteries and more boring graphs—exactly the kind of boring that keeps production alive.