ZFS Recordsize + Compression: The Combo That Changes CPU and Disk Math

Was this helpful?

You can buy faster disks, throw NVMe at the problem, and still watch latency charts look like a seismograph.
Then you flip two ZFS properties—recordsize and compression—and suddenly the same hardware feels like it got a promotion.

Or you flip them wrong and your CPUs start doing interpretive dance while your disks nap. This is the part where we stop guessing and do the math.

What recordsize really does (and what it doesn’t)

recordsize is the maximum size of a data block ZFS will use for a file on a filesystem dataset.
It is not “the block size.” It’s a ceiling, a hint, and—depending on your workload—a performance steering wheel.

Recordsize is a cap, not a mandate

For ordinary files, ZFS uses variable-sized records up to recordsize. A file with 4K writes doesn’t magically become 128K blocks
just because the dataset’s recordsize is 128K. ZFS will happily store small blocks when the write pattern demands it.
But when the workload produces large sequential writes, recordsize becomes your on-disk reality.

This is why recordsize is most visible on:

  • large streaming reads/writes (backups, object stores, media, logs that are appended in big chunks),
  • datasets with big files and read-ahead benefits,
  • workloads where read-modify-write penalties show up (random overwrites into big blocks).

Recordsize doesn’t apply to zvols the same way

If you’re serving iSCSI/LUNs or VM disks via zvols, the knob is volblocksize, set at zvol creation time.
That’s the fixed block size exposed to the consumer. Treat that as an API contract.

The hidden villain: partial-block writes

The pain mode for “recordsize too big” isn’t that ZFS can’t do small I/O; it’s that overwrites of small pieces inside a large record
can trigger read-modify-write (RMW). ZFS must read the old block, modify part of it, and write a new full block (copy-on-write).
That costs extra I/O and extra latency—especially when the working set doesn’t fit in ARC and you’re doing synchronous writes.

Think of recordsize as picking the unit of “storage transaction size.” Bigger blocks reduce metadata overhead and can increase streaming throughput.
Bigger blocks also increase the blast radius of a small overwrite.

Joke #1: Setting recordsize=1M for a random-write database is like bringing a moving truck to deliver a house key. It’ll arrive, but nobody’s happy.

Compression changes the economics

ZFS compression is not a space-only feature. It’s a performance feature because it trades CPU cycles for reduced physical I/O.
That trade can be fantastic or terrible depending on what you’re bottlenecked on.

The core effect: logical bytes vs physical bytes

Every ZFS dataset has two realities:

  • Logical: what the application thinks it wrote/read.
  • Physical: what actually hit disk after compression (and potentially after padding effects, RAIDZ parity, etc.).

When compression is effective, you reduce physical reads/writes. That means:

  • lower disk bandwidth consumption,
  • potentially fewer IOPS if the same logical request maps to less physical work,
  • more effective ARC (because cached compressed blocks represent more logical data per byte of RAM).

LZ4 is usually the default for a reason

In production, compression=lz4 is the “safe fast” setting for most data: configs, logs, texty payloads, many databases, VM images.
It’s fast enough that CPU is rarely the limiting factor unless you’re at very high throughput or on underpowered cores.

More aggressive compressors (zstd levels, for example) can win big on space and sometimes on I/O reduction, but the CPU cost becomes real.
That CPU cost shows up in latency at inconvenient times: compaction bursts, backup windows, and “why is the API slow only during replication?” moments.

Compression also interacts with recordsize

Compression ratio is sensitive to block size. Larger records often compress better because the compressor sees more redundancy.
That’s why increasing recordsize can improve compression ratios on log-like or column-like data.
But again: if your workload overwrites small parts of large blocks, you can pay RMW plus compression cost on every churned write.

The combo math: IOPS, bandwidth, and CPU

Recordsize and compression don’t just “optimize.” They decide which resource gets to be your limiting factor:
disk IOPS, disk bandwidth, or CPU.

Start with the three bottleneck regimes

  • IOPS-bound: latency dominated by the number of operations (random I/O, metadata heavy workloads, sync writes).
  • Bandwidth-bound: you can’t shove more bytes through disks/network (streaming, scans, backups).
  • CPU-bound: compression/decompression, checksums, encryption, or heavy ZFS bookkeeping consumes the cores.

Your tuning goal is not “maximum performance.” It’s “move the bottleneck to the cheapest resource.”
CPU is often cheaper than IOPS on flash, and vastly cheaper than IOPS on spinning disks. But CPU isn’t free when latency budgets are tight.

Recordsize changes IOPS math

Suppose you have a workload reading 1 GiB sequentially.
If your recordsize is 128K, that’s about 8192 blocks. If it’s 1M, it’s about 1024 blocks.
Fewer blocks means fewer metadata lookups, fewer checksum validations, fewer I/O operations, and better prefetch behavior.
For streaming reads, bigger blocks often win.

Now flip to random overwrites of 8K pages (hello databases).
If those pages live inside 128K records, a random page update can translate into:

  • read old 128K (if not in ARC),
  • modify 8K inside it,
  • write new 128K somewhere else (copy-on-write),
  • update metadata.

That’s more than “one 8K write.” That’s an amplified I/O pattern, plus fragmentation over time.
With recordsize=16K (or 8K depending on the DB page size), you reduce RMW amplification.
You may increase metadata overhead, but databases already live in that world.

Compression changes bandwidth math (and sometimes IOPS)

If your compression ratio is 2:1, a “100 MB read” becomes “50 MB physical read + decompression.”
If your disks were saturated, you just freed half the bandwidth. Latency drops, throughput rises, everyone looks clever.
If your disks weren’t saturated and your CPUs were already busy, you just bought yourself a new problem.

CPU math: decompression is on the read path

Writes pay compression cost. Reads pay decompression cost. In many systems, reads dominate tail latency.
That means the decompressor’s behavior matters more than you think.
LZ4 decompression is typically fast; higher-level compressors may not be.

ARC and compression: “more cache” without buying RAM

ZFS stores compressed blocks compressed in ARC (implementation details vary by version and feature flags, but the practical effect holds:
compressed data makes cache more effective).
Better compression can mean more logical data cached per GB of RAM.
This is one reason compression=lz4 is a default “yes” for general-purpose datasets.

When the combo is magic

The best-case scenario:

  • large-ish recordsize for the workload’s access pattern,
  • compression that’s cheap (lz4) and effective (lots of redundancy),
  • disks are the expensive resource, CPUs have headroom.

The result is fewer physical bytes, fewer disk stalls, and higher cache hit rates.

When the combo is a trap

The trap scenario:

  • recordsize bigger than the overwrite granularity,
  • workload does random updates,
  • compression adds CPU cost to every rewrite of a large record,
  • sync writes + small log device (or no SLOG) magnify latency.

The symptom is often “disk isn’t busy but latency is awful.” That’s CPU or sync path contention, not raw bandwidth.

Workload playbook: what to set for what

Defaults exist because they’re safe, not because they’re optimal. Your job is to be safely opinionated.

General-purpose fileshares, home directories, configs

  • recordsize: leave default (often 128K) unless you have a reason.
  • compression: lz4.
  • Why: mixed workloads benefit from a medium recordsize; compression reduces physical I/O for text-heavy files.

VM images on filesystem datasets (qcow2/raw files)

  • recordsize: commonly 16K or 32K; sometimes 64K if mostly sequential reads.
  • compression: lz4 (often helps more than people expect).
  • Why: VM I/O tends to be random-ish and in smaller blocks; reduce RMW amplification.

zvols for VM disks / iSCSI

  • volblocksize: match the guest/filesystem expectation (often 8K or 16K; sometimes 4K).
  • compression: lz4 unless CPU is tight.
  • Why: volblocksize is fixed; getting it wrong is a forever tax unless you migrate.

PostgreSQL / MySQL / databases on filesystem datasets

  • recordsize: match DB page size or a multiple that doesn’t cause churn (often 8K or 16K).
  • compression: lz4 is usually fine; test zstd only if you have CPU and want space.
  • Why: databases do small random overwrites; large records lead to RMW and fragmentation.

Backups, archives, media, object-like blobs

  • recordsize: 512K or 1M can make sense if reads/writes are streaming.
  • compression: zstd (moderate level) can be worth it, or lz4 if you want predictable CPU.
  • Why: sequential I/O rewards large blocks; compression ratio improves with bigger windows.

Logs (append-heavy)

  • recordsize: default is usually fine; if logs are large and mostly sequential reads later, consider 256K.
  • compression: lz4 almost always wins.
  • Why: logs compress extremely well; reducing physical writes also reduces wear on SSDs.

Small-file workloads (maildirs, source trees, package caches)

  • recordsize: not the main knob; small files already use small blocks.
  • compression: lz4 helps and can speed reads by reducing disk I/O.
  • Consider: special vdev for metadata/small blocks if you’re serious about latency.

Interesting facts and historical context

  • ZFS was built for end-to-end data integrity: checksums on every block are core, not an add-on feature.
  • Early ZFS defaults targeted big disks and mixed workloads: 128K recordsize became a pragmatic middle ground for streaming and general file use.
  • Compression in ZFS has long been “transparent”: applications don’t need to know; the filesystem decides on-disk representation.
  • LZ4 became the go-to because it’s cheap: it’s designed for speed, so it often improves performance by reducing I/O without spiking CPU.
  • Copy-on-write changes overwrite costs: ZFS never overwrites in place; this is why overwrite-heavy workloads are sensitive to block sizing.
  • RAIDZ write amplification is real: small random writes can become large read-modify-write cycles at the parity stripe level, compounding recordsize choices.
  • ARC made “RAM as storage” mainstream: ZFS aggressively caches, and compression effectively increases cache capacity for compressible data.
  • zvols were made for block consumers: but their fixed volblocksize means mistakes are harder to undo than recordsize on filesystems.

Three corporate mini-stories from the trenches

Incident: the wrong assumption (“recordsize doesn’t matter for databases, right?”)

A mid-sized company ran a customer analytics platform on ZFS. The primary database lived on a dataset that had been cloned from a backup target.
Nobody noticed, because it mounted fine, performance looked “okay,” and the graphs were quiet—until month-end.

Month-end meant heavy updates, index churn, and maintenance jobs. Latency climbed, then spiked. Application threads backed up.
The storage team looked at disk utilization: not pegged. The CPU graphs on the DB hosts: not pegged. So they blamed the network.
They tuned TCP buffers like it was 2009.

The actual issue was boring: the dataset had recordsize=1M, inherited from the backup profile.
The database wrote 8K pages. Under churn, ZFS had to do read-modify-write on large records and churn metadata.
The pool wasn’t “busy” in bandwidth terms; it was busy doing the wrong kind of work.

They fixed it by moving the database to a new dataset with recordsize=16K and keeping compression=lz4.
The migration was the painful part. The performance recovery was immediate, and the graphs stopped screaming.

Lesson: “It mounted, therefore it’s fine” is how you end up debugging physics at 2 a.m.

Optimization that backfired: the zstd enthusiasm phase

Another org had a storage cost problem: fast SSD pools were filling up with VM images and build artifacts.
Someone proposed stronger compression. They enabled compression=zstd at an aggressive level on a busy dataset.
Space savings were great. The ticket got closed with a satisfied sigh.

Two weeks later, their CI system started missing build SLAs. Not consistently—just enough to be maddening.
The cluster wasn’t out of CPU overall, but a subset of nodes showed elevated iowait and higher system CPU during peak hours.
The storage array wasn’t saturated. Network was fine. So the blame tour began: kernel updates, scheduler tuning, “maybe it’s DNS.”

The real culprit: CPU spent compressing and decompressing hot artifacts during bursty workloads.
Strong compression improved capacity but increased tail latency and amplified contention on the busiest compute nodes.
The system didn’t fail; it just became annoying and unpredictable—the worst kind of production regression.

They rolled back to lz4 on the hot dataset and kept zstd only for colder, write-once artifact archives.
The right outcome wasn’t “never use zstd.” It was “use it where the access pattern matches the CPU budget.”

Boring but correct practice that saved the day: per-workload datasets

A financial services shop had a clean rule: every workload gets its own dataset, even if it feels like paperwork.
Databases, logs, VM images, backups—separate datasets, explicit properties, and a short README in the mountpoint.
New engineers grumbled. Senior folks smiled politely and kept doing it.

One day, a performance regression showed up after a platform upgrade. VM boot storms were slower, and a subset of guests had high write latency.
Because the org had separate datasets, they could compare properties quickly and see which workloads were affected.
The VM dataset had a recordsize tuned for that environment, and compression was consistent.

The problem turned out to be elsewhere (a sync write path issue combined with a misbehaving SLOG device),
but isolating datasets prevented a messy “global tuning” attempt that would have broken their databases.
They fixed the SLOG problem, and everything else stayed stable because it was never touched.

Lesson: correctness is often just disciplined separation. It’s not glamorous. It’s how you avoid tuning one workload by accidentally sabotaging three others.

Practical tasks: commands, outputs, decisions (12+)

These are real operational moves. Each task includes a command, an example output, what it means, and what decision you make next.
Adjust pool/dataset names to match your environment.

Task 1: List dataset properties that matter (recordsize, compression)

cr0x@server:~$ zfs get -o name,property,value -s local,inherited recordsize,compression rpool/data
NAME        PROPERTY     VALUE
rpool/data  recordsize   128K
rpool/data  compression  lz4

Meaning: You see the effective settings. If they’re inherited, track where from.

Decision: If this dataset hosts a DB or VM images, 128K might be wrong. Don’t change yet—measure first.

Task 2: Check whether a dataset is actually compressing data

cr0x@server:~$ zfs get -o name,property,value compressratio,logicalused,used rpool/data
NAME        PROPERTY       VALUE
rpool/data  compressratio  1.85x
rpool/data  logicalused    1.62T
rpool/data  used           906G

Meaning: Compression is working. Logical is what apps wrote; used is physical-ish space consumed.

Decision: If compressratio is near 1.00x on hot data, compression might be wasted CPU. Consider leaving lz4 anyway unless CPU is tight.

Task 3: Find inherited properties quickly (who set this?)

cr0x@server:~$ zfs get -o name,property,value,source recordsize,compression -r rpool
NAME                  PROPERTY     VALUE  SOURCE
rpool                 recordsize   128K   default
rpool                 compression  off    default
rpool/data            recordsize   128K   inherited from rpool
rpool/data            compression  lz4    local
rpool/data/db         recordsize   1M     local
rpool/data/db         compression  lz4    inherited from rpool/data

Meaning: rpool/data/db has a local 1M recordsize. That’s a red flag for many databases.

Decision: Confirm workload type. If it’s a DB with 8K pages, plan migration to a correctly sized dataset.

Task 4: Watch real-time I/O by dataset to find the noisy neighbor

cr0x@server:~$ zpool iostat -v rpool 2
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
rpool                       2.10T  1.40T    180    950  12.3M  88.1M
  mirror                    2.10T  1.40T    180    950  12.3M  88.1M
    nvme0n1                 -      -        90    480  6.2M   44.0M
    nvme1n1                 -      -        90    470  6.1M   44.1M

Meaning: High write ops vs modest read ops suggests write-heavy workload. Not enough alone—correlate with latency.

Decision: If bandwidth is low but ops are high and latency is bad, suspect small random writes and/or sync path issues.

Task 5: Check pool health and errors before performance tuning

cr0x@server:~$ zpool status -v rpool
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support the features.
  scan: scrub repaired 0B in 00:12:41 with 0 errors on Tue Dec 24 03:12:02 2025
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            nvme0n1 ONLINE       0     0     0
            nvme1n1 ONLINE       0     0     0

errors: No known data errors

Meaning: No errors, scrub clean. Good—performance issues are likely configuration or workload-driven.

Decision: Proceed to workload diagnostics. Don’t “tune” a pool that’s silently failing media.

Task 6: Measure compression CPU cost via system time under load

cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (server)  12/26/2025  _x86_64_  (16 CPU)

12:40:01 AM  CPU  %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:40:02 AM  all  18.1  0.0   9.7   2.4    0.0  0.3    0.0    0.0    0.0   69.5
12:40:03 AM  all  20.2  0.0  13.5   1.9    0.0  0.2    0.0    0.0    0.0   64.2
12:40:04 AM  all  19.6  0.0  14.1   1.8    0.0  0.2    0.0    0.0    0.0   64.3

Meaning: Elevated %sys suggests kernel work (ZFS included). Low iowait suggests disks aren’t the bottleneck.

Decision: If performance is poor with low iowait, investigate CPU-bound paths (compression level, checksumming, sync writes, contention).

Task 7: Observe ARC effectiveness and memory pressure

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:41:10   820   110     13    28    3    72    9    10    1   28.5G  31.8G
12:41:11   790    95     12    22    3    63    8    10    1   28.6G  31.8G
12:41:12   815   120     15    35    4    74    9    11    1   28.6G  31.8G

Meaning: Miss rate isn’t terrible. ARC size is near target. Compression can make ARC “hold more logical data,” but misses still hurt.

Decision: If miss% is high on a read-heavy workload, compression may help (less physical read), but you may also need more RAM or better locality.

Task 8: Check per-dataset logical vs physical written (is compression reducing write load?)

cr0x@server:~$ zfs get -o name,property,value -r logicalused,used,compressratio rpool/data/vm
NAME           PROPERTY       VALUE
rpool/data/vm  logicalused    800G
rpool/data/vm  used           610G
rpool/data/vm  compressratio  1.31x

Meaning: Moderate compression. VM images often compress a bit, sometimes a lot depending on OS and zero blocks.

Decision: Keep lz4. If CPU is high and compressratio is near 1.00x, consider leaving it anyway unless you’re chasing tail latency.

Task 9: Confirm the actual block sizes being written (not just recordsize)

cr0x@server:~$ zdb -bbbbb rpool/data/db | head -n 12
Dataset rpool/data/db [ZPL], ID 236, cr_txg 41292, 1.12G, 2948 objects
Indirect blocks:
               0 L0  16K   1.23G  11%  1.00x   79.3M

Meaning: This shows actual block distributions. Here you see 16K L0 blocks dominate, despite recordsize possibly being larger.

Decision: If you expected 128K blocks for a streaming workload but see lots of 8K/16K, the workload is not streaming (or is fragmented). Tune accordingly.

Task 10: Identify whether sync writes are killing you

cr0x@server:~$ zfs get -o name,property,value sync,logbias rpool/data/db
NAME          PROPERTY  VALUE
rpool/data/db sync      standard
rpool/data/db logbias   latency

Meaning: Sync is standard (safe). logbias latency prefers SLOG use when present.

Decision: If you have many fsync-heavy writes and no good SLOG, performance pain is expected. Don’t “fix” it by setting sync=disabled unless you enjoy explaining data loss.

Task 11: Check whether a special vdev exists (metadata/small blocks acceleration)

cr0x@server:~$ zpool status rpool | sed -n '1,40p'
  pool: rpool
 state: ONLINE
config:

        NAME           STATE     READ WRITE CKSUM
        rpool          ONLINE       0     0     0
          mirror       ONLINE       0     0     0
            sda        ONLINE       0     0     0
            sdb        ONLINE       0     0     0
        special
          mirror       ONLINE       0     0     0
            nvme0n1    ONLINE       0     0     0
            nvme1n1    ONLINE       0     0     0

Meaning: There’s a special vdev mirror. Metadata and optionally small blocks can land there.

Decision: If small-file and metadata latency is your issue, special vdev can be a big deal. But it’s also a reliability commitment—lose it and you can lose the pool.

Task 12: Verify special_small_blocks setting (is small data actually going to special?)

cr0x@server:~$ zfs get -o name,property,value special_small_blocks rpool/data
NAME        PROPERTY             VALUE
rpool/data  special_small_blocks 0

Meaning: Only metadata goes to special, not small data blocks.

Decision: If you want small blocks on special, set special_small_blocks=16K or similar—after thinking hard about capacity and failure domain.

Task 13: Test a recordsize change safely on a new dataset

cr0x@server:~$ zfs create -o recordsize=16K -o compression=lz4 rpool/data/db16k
cr0x@server:~$ zfs get -o name,property,value recordsize,compression rpool/data/db16k
NAME            PROPERTY     VALUE
rpool/data/db16k recordsize  16K
rpool/data/db16k compression lz4

Meaning: You created a place to migrate/test without mutating the existing dataset in-place.

Decision: Migrate a representative slice of data and benchmark. Avoid “flip it in prod” heroics.

Task 14: Move data with send/receive to preserve snapshots and reduce downtime

cr0x@server:~$ zfs snapshot rpool/data/db@pre-migrate
cr0x@server:~$ zfs send -R rpool/data/db@pre-migrate | zfs receive -u rpool/data/db16k
cr0x@server:~$ zfs list -t snapshot -o name,used -r rpool/data/db16k | head
NAME                         USED
rpool/data/db16k@pre-migrate 0B

Meaning: The dataset and snapshot exist on the target. The -u keeps it unmounted until you’re ready.

Decision: Cut over by swapping mountpoints or updating service config during a maintenance window.

Task 15: Validate you’re actually reducing physical I/O after enabling compression

cr0x@server:~$ zfs get -o name,property,value written,logicalused rpool/data/logs
NAME            PROPERTY    VALUE
rpool/data/logs written     145G
rpool/data/logs logicalused 312G

Meaning: Logical used is much higher than physical written, suggesting compression is reducing write traffic.

Decision: For append-heavy logs, keep compression on. It reduces disk wear and often improves read speed during investigations.

Fast diagnosis playbook

This is the “someone is yelling in Slack” workflow. The goal is to identify whether you’re IOPS-bound, bandwidth-bound, CPU-bound, or sync-bound—fast.

First: confirm it’s not a failing pool or a scrub/resilver situation

  • Run zpool status. If you see errors, degraded vdevs, or a resilver in progress, stop tuning and start fixing hardware or finishing recovery.
  • Check if a scrub is running. Scrubs can be polite or rude depending on tunables and workload.

Second: decide whether the bottleneck is disk or CPU

  • Disk-limited signs: high iowait, high device utilization, bandwidth near device limits, latency increases with throughput.
  • CPU-limited signs: low iowait but high system CPU, throughput plateaus early, latency spikes without saturating disks.

Third: test whether sync writes are the actual villain

  • Check dataset sync and whether the workload uses fsync/O_DSYNC.
  • If you have a SLOG, validate it’s healthy and fast. If you don’t, accept that sync-heavy workloads will be limited by main vdev latency.

Fourth: check recordsize/volblocksize versus I/O pattern

  • For DB/random overwrite workloads, too-large recordsize can cause RMW amplification.
  • For streaming workloads, too-small recordsize wastes IOPS and CPU on metadata/checksums.

Fifth: verify compression is helping, not just “enabled”

  • Look at compressratio and logical vs physical usage.
  • If ratio is poor and CPU is high, consider lz4 instead of heavier algorithms, or compression off for already-compressed data.

Common mistakes: symptom → root cause → fix

1) “Disks are idle but latency is high”

  • Symptom: low bandwidth, low iowait, but application latency spikes.
  • Root cause: CPU-bound path (compression level too high, checksum/encryption cost) or sync write contention.
  • Fix: downgrade compression to lz4, ensure CPU headroom, validate SLOG for sync workloads, and avoid oversized records for overwrite patterns.

2) “Database writes got slower after increasing recordsize for throughput”

  • Symptom: higher write latency, more unpredictable spikes during maintenance/vacuum/compaction.
  • Root cause: RMW amplification from overwriting small pages inside big records; fragmentation worsens over time.
  • Fix: migrate to a dataset with recordsize aligned to DB page size (often 8K/16K). Keep compression lz4 unless CPU is constrained.

3) “Compression enabled, but pool still fills up and performance didn’t improve”

  • Symptom: compressratio ~1.00x, no meaningful reduction in physical writes.
  • Root cause: data is already compressed/encrypted (media, zip, many VM images with encrypted filesystems).
  • Fix: leave lz4 if CPU is cheap and you want occasional wins; otherwise turn compression off for that dataset and stop paying CPU tax for nothing.

4) “Sequential reads are slower than expected on fast disks”

  • Symptom: throughput below hardware capability during scans/backups.
  • Root cause: recordsize too small, causing excessive IOPS and overhead; or data is fragmented into small blocks due to historical write pattern.
  • Fix: use a large-record dataset for streaming data (256K–1M). For existing data, consider rewriting (send/recv to a new dataset) to reblock.

5) “Turning on strong compression saved space but broke SLAs”

  • Symptom: space wins, but tail latency/regressions under peak load.
  • Root cause: CPU contention from heavy compression/decompression on hot path.
  • Fix: use lz4 for hot data; reserve zstd for colder, less latency-sensitive datasets; benchmark with real concurrency.

6) “We changed recordsize and nothing changed”

  • Symptom: no observable difference after setting recordsize.
  • Root cause: recordsize affects new writes. Existing blocks keep their old size until rewritten.
  • Fix: rewrite data via replication/migration (send/recv) or application-level rewrite; validate with zdb block distribution.

Joke #2: If you change recordsize on a dataset and expect old blocks to reshape themselves, congratulations—you’ve invented storage yoga.

Checklists / step-by-step plan

Step-by-step: choosing recordsize + compression safely

  1. Classify the workload: random overwrite (DB), random read (VM), sequential (backup/media), mixed (home dirs).
  2. Find the I/O unit: DB page size, VM block size, typical object size, log chunk size.
  3. Pick a starting recordsize:
    • DB: 8K–16K (match pages).
    • VM files: 16K–64K.
    • Streaming: 256K–1M.
    • Mixed: 128K.
  4. Enable compression lz4 unless you have a specific CPU/tail-latency reason not to.
  5. Create a new dataset with those properties; don’t mutate the old one if you can avoid it.
  6. Migrate a representative sample and benchmark with production-like concurrency.
  7. Validate outcomes:
    • compressratio improved?
    • Latency percentiles improved?
    • CPU headroom still healthy?
  8. Roll out gradually: migrate service by service, not “the whole pool at once.”

Checklist: before you touch production knobs

  • Pool health clean (zpool status).
  • Recent scrub completed without errors.
  • You know whether the workload is sync-heavy.
  • You can roll back (snapshots + a tested migration plan).
  • You have baseline metrics (latency percentiles, CPU, IOPS, bandwidth).
  • You’re not mixing unrelated workloads in a single dataset with one-size-fits-none properties.

Checklist: compression policy by data type

  • Text/logs/configs: lz4 on.
  • Databases: lz4 on (usually), recordsize aligned.
  • Media (already compressed): optional; often off unless you’ve measured wins.
  • Encrypted-at-rest inside files: compression won’t help; don’t expect miracles.
  • Cold archives: consider zstd if CPU budget is separate from latency budget.

One reliability idea (paraphrased) worth keeping

Paraphrased idea: plan for failure, because everything fails eventually—design so failures are survivable. — Werner Vogels

FAQ

1) Should I always enable compression=lz4?

For most datasets, yes. It often reduces physical I/O and improves effective cache without noticeable CPU cost.
Exceptions: already-compressed/encrypted data on CPU-starved systems with tight latency budgets.

2) If I change recordsize, will it rewrite existing blocks?

No. Recordsize applies to new writes. Existing blocks keep their size until rewritten.
To “reblock,” you usually migrate data to a new dataset (send/recv) or rewrite files at the application layer.

3) What recordsize should I use for PostgreSQL?

Common starting points are 8K or 16K (PostgreSQL pages are typically 8K). 16K can be fine if you see some sequential behavior.
The correct answer is: align with page behavior and benchmark your workload, especially for update-heavy tables.

4) What about MySQL/InnoDB?

InnoDB commonly uses 16K pages by default. A 16K recordsize is a sane starting point for overwrite-heavy datasets.
If you run large sequential scans and mostly append, you might tolerate bigger. Measure before you get creative.

5) Why does big recordsize hurt random writes if ZFS can store small blocks?

The problem is not initial small writes—it’s partial overwrites and copy-on-write behavior that create read-modify-write cycles and fragmentation.
Big records increase the amount of data rewritten for a small change.

6) Is zstd worth it?

Sometimes. It can provide better compression than lz4, especially on cold or write-once data.
On hot, latency-sensitive datasets, it can backfire by increasing CPU contention and tail latency.

7) Does compression help IOPS, or only bandwidth?

It primarily reduces bandwidth (bytes). But reducing bytes can indirectly reduce IOPS pressure if operations complete faster,
and it improves cache density, which can reduce physical reads (and thus IOPS) when ARC hit rates improve.

8) Recordsize vs volblocksize: which do I tune for VM storage?

If you store VM disks as files on a dataset, tune recordsize. If you use zvols, tune volblocksize at creation time.
Don’t confuse them; zvol block size is harder to change later.

9) Can I set different recordsize values inside the same dataset?

Not per-directory via standard ZFS properties. It’s per dataset. The practical pattern is: create multiple datasets with different properties
and mount them where the application expects different I/O behavior.

10) How do I know if I’m CPU-bound because of compression?

You’ll typically see low iowait, elevated system CPU, and throughput plateauing before disks saturate.
Confirm by comparing performance with lz4 vs heavier compression on a test dataset, using production-like concurrency.

Conclusion: next steps you can do today

  1. Inventory your datasets: list recordsize, compression, and sources of inheritance. Find the “backup profile accidentally running production” situation.
  2. Pick one workload that’s either latency-sensitive or expensive (database, VM storage, build cache). Give it a dedicated dataset.
  3. Set sane defaults:
    • Hot general data: compression=lz4, recordsize=128K.
    • DB: compression=lz4, recordsize=8K or 16K.
    • Streaming backups: recordsize=512K or 1M, compression based on CPU budget.
  4. Migrate, don’t mutate: create a new dataset with the right settings and move data using send/recv. This is how you avoid surprise regressions.
  5. Re-measure: logical vs physical, latency percentiles, CPU headroom, and the “are disks actually busy?” question.

Recordsize and compression are not “tuning tricks.” They’re architecture decisions expressed as two properties.
Make them deliberately, per workload, and your hardware budget will go further—without turning your CPUs into unpaid interns.

← Previous
MySQL vs SQLite: how far SQLite can go before it ruins your site
Next →
ZFS L2ARC: The SSD Cache That Helps… Until It Hurts

Leave a comment