ZFS Compression Benchmarks: Measuring Real Gains, Not Placebo

Was this helpful?

You turn on compression=lz4, run a “benchmark,” and the numbers look amazing. Great—until your busiest VM host starts stuttering at 10 a.m.,
your database latency gets weirdly spiky, and the only metric anyone can quote is compressratio from zfs get.
Compression didn’t “fail.” The benchmark did.

ZFS compression is usually a free lunch, but production systems don’t serve lunch; they serve SLAs.
If you want real gains—not placebo—you need to measure the right things, in the right order, under the right load, and then decide like an adult.

What “real gains” means for ZFS compression

“Compression makes things smaller” is true in the same way “exercise makes you healthier” is true. Yes, generally. Also: depends what you do,
how you measure it, and what you’re already bottlenecked on.

In production, ZFS compression is successful when it achieves one (or more) of these outcomes under your real constraints:

  • Lower p95/p99 latency (because you push fewer bytes through a slow device or a saturated HBA).
  • Higher throughput (because your storage is bandwidth-limited and compression trades CPU for fewer writes).
  • More effective cache (ARC/L2ARC holds more logical data per physical byte when blocks compress well).
  • Less write amplification (fewer physical blocks written for the same logical change).
  • More usable capacity without changing the failure domain or the pool layout.

Compression is not successful when the only “win” is a prettier number in zfs get compressratio.
If your CPU becomes the limiter, if your latency tail grows, or if your workload is already not I/O bound, you can absolutely make things worse.

The job of a compression benchmark isn’t to declare a winner codec. The job is to answer a specific question:
“With this workload, on this hardware, under these constraints, what changes in latency, throughput, CPU, and write volume when compression changes?”

Facts and history that actually matter

A little context keeps you from cargo-cult tuning. Here are concrete facts that show up in real benchmark interpretation:

  1. ZFS was designed for end-to-end data integrity (checksums everywhere), not for “max IOPS at any cost.” Compression lives inside that model.
  2. LZ4 became the default recommendation in many OpenZFS circles because it’s fast enough that CPU is rarely the limiter for typical server workloads.
  3. Modern OpenZFS added ZSTD because people wanted better ratios with configurable levels; it’s not “just a bit slower,” it’s a range.
  4. Compression is per-block and happens before writing to disk; incompressible blocks are stored uncompressed (usually with small overhead).
  5. Recordsize matters because compression operates on record blocks; the same data can compress differently at 16K vs 128K record sizes.
  6. Copy-on-write changes the economics: rewriting small parts of big blocks can cause more churn; compression can reduce physical churn if data compresses.
  7. ARC caches post-compression data effectively: if data compresses 2:1, ARC can hold roughly twice the logical content, changing read behavior.
  8. Dedup and compression are not buddies by default: dedup amplifies metadata and RAM pressure; adding compression testing to dedup pools can mislead you.
  9. Special vdevs changed metadata behavior: metadata and small blocks can live on faster media, so “compression helped latency” may really be “special vdev saved you.”

One paraphrased idea worth keeping on the wall: Everything fails, all the time — Werner Vogels (paraphrased idea). Compression tuning is no exception:
test with failure modes and saturation in mind, not just empty-lab hero runs.

Benchmark principles (a.k.a. how not to lie to yourself)

1) Decide what you’re optimizing: latency tail, throughput, or capacity

If the business cares about p99 latency, stop showing average throughput graphs. If the business cares about capacity, stop showing “IOPS increased”
when your data set is uncompressible. Pick the target, then pick the measurement.

2) Separate warm-cache behavior from cold-cache behavior

ZFS benchmarks without cache control are basically a Rorschach test. ARC can make “disk performance” look like “RAM performance” for a while.
That’s not bad—ARC is real—but it’s a different system than “how fast do my vdevs go under sustained load?”

3) Measure CPU cost explicitly

Compression isn’t free. LZ4 is cheap, ZSTD can be cheap or expensive depending on level and data. If you don’t measure CPU saturation,
you’ll misattribute slowdowns to “ZFS overhead” or “the network.”

4) Don’t benchmark on empty pools and declare victory

ZFS behavior changes as pools fill. Fragmentation, metaslab allocation, and write patterns evolve. A pool at 20% full is a different creature than one at 80%.
If you only test empty, you’re testing a best-case that you’ll never see again.

5) Use realistic I/O sizes and concurrency

If your real workload is 8K random reads at queue depth 32, don’t run 1M sequential writes at queue depth 1 and call it “database-like.”
Storage people love numbers; reality does not care.

6) Control what ZFS can legally change underneath you

Recordsize, volblocksize (for zvols), sync settings, atime, xattr storage, and special vdev policies all affect the result.
A compression benchmark that changes four other variables is a superstition ritual, not an experiment.

First joke (and yes, it’s short): A benchmark without controls is like a fire drill where you forget to pull the alarm.
Everyone feels prepared, right up until the building is on fire.

Metrics that matter (and which ones are vanity)

Compression metrics: useful, but limited

  • compressratio: dataset-level ratio of logical to physical space. Good for capacity planning. Bad as a performance predictor.
  • logicalused vs used: tells you how much space compression is saving. Still not performance.
  • per-algorithm savings (lz4 vs zstd-N): only meaningful if your data shape is stable.

Performance metrics: what decides the argument

  • p95/p99 latency of reads and writes (application-facing reality).
  • device service time (zpool iostat -v latency columns are gold).
  • CPU utilization and steal time (compression burns cycles; virtualization adds its own tax).
  • bytes written/read at the vdev level (the “fewer bytes” promise of compression is measurable here).
  • ARC hit ratio under steady state (compression can increase effective cache capacity).

Vanity metrics (or “context required” metrics)

  • Single-run throughput peak: often just cache warming.
  • Average latency: averages hide tail pain. Tail pain pages you at night.
  • IOPS without block size: meaningless. 4K IOPS vs 128K IOPS are different planets.
  • zpool list “capacity used” alone: doesn’t reveal how fragmented or write-amplified you are.

Practical tasks: commands, outputs, and decisions (12+)

These are the tasks I actually run when someone says, “Compression made it slower,” or “ZSTD level 19 is a free win.”
Each task includes: command, what output means, and the decision you make.

Task 1: Confirm dataset compression setting (and inherited surprises)

cr0x@server:~$ zfs get -o name,property,value,source compression tank/vm
NAME     PROPERTY     VALUE     SOURCE
tank/vm  compression  zstd      local

Meaning: Compression is set to zstd locally, not inherited. If it were inherited, you’d see inherited from ....
Decision: If benchmarking, ensure the dataset under test is explicitly set to the codec you’re claiming, and record the source.

Task 2: Check actual achieved ratio, not just “enabled”

cr0x@server:~$ zfs get -o name,property,value -p used,logicalused,compressratio tank/vm
NAME     PROPERTY      VALUE
tank/vm  used          214748364800
tank/vm  logicalused   322122547200
tank/vm  compressratio 1.50x

Meaning: Logical is ~300G while physical used is ~200G, so compression is doing work (~1.5x).
Decision: If compressratio is ~1.00x, performance tests won’t show “fewer bytes” gains; focus on CPU overhead and latency instead.

Task 3: Check pool health and errors before blaming compression

cr0x@server:~$ zpool status -xv tank
pool 'tank' is healthy

Meaning: No known errors. If you see checksum errors or degraded vdevs, your “compression benchmark” is benchmarking failure recovery.
Decision: Fix pool health first; benchmarking on a degraded pool is like timing a marathon with one shoe missing.

Task 4: Check pool fullness (your “empty pool win” detector)

cr0x@server:~$ zpool list -o name,size,alloc,free,capacity,fragmentation tank
NAME  SIZE  ALLOC   FREE  CAP  FRAG
tank  40T   28T     12T   70%  34%

Meaning: 70% full with moderate fragmentation. Performance at 70% can differ sharply from 20%.
Decision: If your benchmark was done on a fresh pool, rerun on a similar fill level or at least disclose the difference.

Task 5: Observe vdev-level bandwidth and latency during load

cr0x@server:~$ zpool iostat -v tank 1
                              capacity     operations     bandwidth    total_wait     disk_wait
pool                        alloc   free   read  write   read  write   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
tank                         28T    12T    2200   1800  320M  290M    12ms  18ms     4ms  10ms
  raidz2-0                   28T    12T    2200   1800  320M  290M    12ms  18ms     4ms  10ms
    sda                          -      -     0    300    0B   48M       -     -     3ms  11ms
    sdb                          -      -     0    300    0B   48M       -     -     4ms  10ms
    sdc                          -      -     0    300    0B   48M       -     -     4ms  10ms
    sdd                          -      -     0    300    0B   48M       -     -     4ms  10ms
--------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

Meaning: Look at total_wait and disk_wait. If disk_wait is high, storage is the bottleneck.
If total_wait is high but disk_wait is low, you may be queueing in software (CPU, locks, sync, etc.).
Decision: If compression reduces bandwidth but latency stays high, you’re not bandwidth-limited; chase CPU, sync, or fragmentation.

Task 6: Watch CPU saturation while toggling compression

cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.6.0 (server) 	12/26/2025 	_x86_64_	(16 CPU)

12:10:01 AM  CPU   %usr %nice %sys %iowait %irq %soft %steal %idle
12:10:02 AM  all   62.1  0.0  18.4   0.9    0.0  1.1    0.0   17.5
12:10:02 AM    3   96.0  0.0   3.0   0.0    0.0  0.0    0.0    1.0

Meaning: Some cores are near pegged. Compression often concentrates work; a single hot core can throttle throughput.
Decision: If enabling a heavier codec pushes cores to saturation, expect latency spikes. Either lower ZSTD level, reduce concurrency, or add CPU headroom.

Task 7: Check ARC behavior (are you benchmarking RAM?)

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:10:20   11K   120     1   110   1    10   0     0   0   48G   64G
12:10:21   12K   140     1   130   1    10   0     0   0   48G   64G

Meaning: Miss rate is tiny; reads are mostly from ARC. Great for applications, terrible if you’re trying to compare vdev read performance.
Decision: If you need cold-cache numbers, design the test to exceed ARC or use unique data per run.

Task 8: Verify recordsize/volblocksize matches workload

cr0x@server:~$ zfs get -o name,property,value recordsize tank/vm
NAME     PROPERTY    VALUE
tank/vm  recordsize  128K

Meaning: Files will tend to use 128K records. For VM images with 4K random I/O, that can be a mismatch.
Decision: For VM datasets, consider recordsize=16K (or zvol volblocksize=8K/16K) and benchmark again—compression interacts with this.

Task 9: Confirm sync behavior (compression benchmark vs sync benchmark)

cr0x@server:~$ zfs get -o name,property,value sync tank/vm
NAME     PROPERTY  VALUE
tank/vm  sync      standard

Meaning: Sync writes are honored. If someone set sync=disabled, they didn’t speed up compression; they removed safety.
Decision: If you see sync=disabled in perf tests, discard the results unless the production workload also knowingly runs that way.

Task 10: Watch TXG and write throttling symptoms via iostat + latency

cr0x@server:~$ iostat -x 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          55.40    0.00   17.10    1.20    0.00   26.30

Device            r/s     w/s   rMB/s   wMB/s  avgrq-sz avgqu-sz await  r_await  w_await  svctm  %util
nvme0n1          0.0   1200.0    0.0   280.0    478.0     8.2    6.9    0.0      6.9     0.8   92.0

Meaning: Device is highly utilized, but service time is low; queueing is building. Compression may reduce wMB/s, which can lower queueing.
Decision: If compression drops bandwidth and await drops too, it’s a real win. If bandwidth drops but await rises, you’re CPU or sync bound.

Task 11: Inspect per-dataset logical vs physical writes (what you pay the disks)

cr0x@server:~$ zfs get -o name,property,value -p written,logicalwritten tank/vm
NAME     PROPERTY        VALUE
tank/vm  written         53687091200
tank/vm  logicalwritten  96636764160

Meaning: You logically wrote ~90G but physically wrote ~50G. Compression is reducing physical writes.
Decision: On SSD endurance-sensitive systems, this is one of the most persuasive reasons to use compression even when performance is “fine.”

Task 12: Confirm ashift and layout (benchmark results don’t travel well)

cr0x@server:~$ zdb -C tank | grep -E "ashift|vdev_tree" -n | head
12:        ashift: 12
45:    vdev_tree:

Meaning: ashift=12 implies 4K sectors. If you compare to another pool with different ashift, you’re not comparing compression; you’re comparing alignment.
Decision: When you publish internal benchmark results, include ashift and vdev type. Otherwise the next person repeats your test and “discovers” different physics.

Task 13: Compare algorithms safely on a test dataset

cr0x@server:~$ sudo zfs create -o compression=lz4 -o recordsize=128K tank/bench_lz4
cr0x@server:~$ sudo zfs create -o compression=zstd -o recordsize=128K tank/bench_zstd
cr0x@server:~$ zfs get -o name,compression,recordsize -r tank/bench_lz4 tank/bench_zstd
NAME            COMPRESSION  RECORDSIZE
tank/bench_lz4  lz4          128K
tank/bench_zstd zstd         128K

Meaning: Two datasets identical except compression. That’s what “benchmarking compression” looks like.
Decision: Always isolate the variable. If you also change recordsize, you’re benchmarking a bundle.

Task 14: Measure real application-level latency (not just storage stats)

cr0x@server:~$ fio --name=randread --directory=/tank/bench_lz4 --rw=randread --bs=4k --iodepth=32 --numjobs=4 --size=8G --time_based --runtime=60 --group_reporting
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
  read: IOPS=128k, BW=500MiB/s (524MB/s)(29.3GiB/60001msec)
  lat (usec): min=80, max=5400, avg=245.10, stdev=110.12
  clat percentiles (usec):
   |  50.00th=[  230],  90.00th=[  320],  95.00th=[  410],  99.00th=[  980],  99.90th=[ 2900]

Meaning: The percentiles are the story. If switching to ZSTD raises 99.9th percentile while average stays similar, you just bought tail latency.
Decision: Use p99/p99.9 as the pass/fail gate for latency-sensitive workloads (databases, VM hosts).

FIO recipes that model real workloads

FIO is not “the truth.” It’s a controlled lie you tell the system to see how it reacts. Tell the right lie.
The point is to approximate access patterns: random vs sequential, sync vs async, block size, concurrency, working set size, and overwrite behavior.

Recipe A: VM-like random 4K reads/writes (mix)

cr0x@server:~$ fio --name=vm-mix --directory=/tank/bench_zstd --rw=randrw --rwmixread=70 --bs=4k --iodepth=32 --numjobs=8 --size=16G --time_based --runtime=120 --direct=1 --group_reporting
vm-mix: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
  read: IOPS=84.2k, BW=329MiB/s (345MB/s)
  write: IOPS=36.1k, BW=141MiB/s (148MB/s)
  clat percentiles (usec):
   |  99.00th=[ 2200],  99.90th=[ 6500],  99.99th=[17000]

How to use it: Run the same job on bench_lz4 and bench_zstd, compare p99.9 and CPU.
Decision: If ZSTD improves bandwidth but worsens p99.9 materially, LZ4 is the safer default for VM mixes.

Recipe B: Database-ish 8K random writes with fsync pressure

cr0x@server:~$ fio --name=db-sync --directory=/tank/bench_lz4 --rw=randwrite --bs=8k --iodepth=1 --numjobs=4 --size=8G --time_based --runtime=120 --fsync=1 --direct=1 --group_reporting
db-sync: (g=0): rw=randwrite, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=1
...
  write: IOPS=4200, BW=32.8MiB/s (34.4MB/s)
  clat percentiles (usec):
   |  95.00th=[ 1800],  99.00th=[ 4200],  99.90th=[12000]

How to use it: This is where sync and SLOG quality dominate. Compression may help by shrinking bytes, but CPU can also matter.
Decision: If codec choice barely moves numbers, stop arguing about compression and go look at sync path (SLOG, latency, queueing).

Recipe C: Sequential writes with large blocks (backup targets, logs)

cr0x@server:~$ fio --name=seqwrite --directory=/tank/bench_zstd --rw=write --bs=1M --iodepth=8 --numjobs=2 --size=64G --time_based --runtime=120 --direct=1 --group_reporting
seqwrite: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=8
...
  write: IOPS=680, BW=680MiB/s (713MB/s)
  cpu          : usr=78.2%, sys=8.5%, ctx=12000, majf=0, minf=400

How to use it: Large sequential writes can be bandwidth-bound. Compression can drastically reduce bytes if data is log/text-like.
Decision: If CPU user% skyrockets and BW stops increasing, lower ZSTD level or use LZ4. If BW increases and CPU has headroom, ZSTD may be worth it.

Recipe D: “Incompressible data” sanity test (detect placebo)

cr0x@server:~$ fio --name=incompressible --directory=/tank/bench_lz4 --rw=write --bs=128k --iodepth=16 --numjobs=4 --size=16G --time_based --runtime=60 --direct=1 --refill_buffers=1 --buffer_compress_percentage=0 --group_reporting
incompressible: (g=0): rw=write, bs=(R) 131072B-131072B, (W) 131072B-131072B, (T) 131072B-131072B, ioengine=libaio, iodepth=16
...
  write: IOPS=5200, BW=650MiB/s (682MB/s)

How to use it: If data is incompressible, compression should not reduce bytes much and can only cost CPU.
Decision: If you see “massive speedup” on incompressible writes after enabling compression, you’re measuring something else (cache, alignment, or test artifact).

Second joke (last one, promise): ZSTD level 19 on a busy hypervisor is like giving everyone in the office a standing desk.
The posture improves; the complaints get louder.

Fast diagnosis playbook

When someone says “Compression changed performance,” you don’t have time for a week-long lab reenactment. Triage like an SRE.
The goal is to identify the bottleneck class quickly: disk, CPU, sync path, or cache illusion.

First: establish whether you’re I/O bound or CPU bound

  1. Check CPU saturation during the workload. Use mpstat and look for pegged cores and rising sys time.
    If CPU is maxed, compression level is a real knob with real consequences.
  2. Check device utilization and wait times. Use zpool iostat -v 1 and iostat -x 1.
    If disks are pegged and wait rises with bandwidth, you’re I/O bound and compression often helps.

Second: determine whether the benchmark is just ARC

  1. Watch ARC misses. If miss% is tiny during reads, you’re mostly in RAM.
    Compression’s “read speedup” here may be cache effectiveness, not disk speed.
  2. Increase working set size. If your dataset fits in ARC, you can’t make disk conclusions.

Third: confirm the sync/write path isn’t the real limiter

  1. Check sync property and workload sync behavior. If you’re benchmarking a database, sync dominates.
  2. Observe latency during sustained sync writes. Compression may reduce write bytes, but SLOG latency and TXG behavior can still dominate.

Fourth: validate you didn’t change other variables

  1. Record dataset props: recordsize, atime, xattr, primarycache, logbias, redundant_metadata.
  2. Record pool props/layout: ashift, RAIDZ vs mirrors, special vdevs, SLOG, L2ARC.

Common mistakes: symptom → root cause → fix

1) “Compression enabled, throughput doubled” (but only on the second run)

Symptom: First run is slower, second run is blazing fast, compression “wins.”

Root cause: ARC warmed. You benchmarked RAM, not disks. Compression may also increase ARC effectiveness, exaggerating the effect.

Fix: Use a working set larger than ARC, use unique filenames per run, and compare steady-state after warmup. Track ARC miss%.

2) “ZSTD is slower than LZ4, so ZSTD is bad”

Symptom: Higher latency tails and lower IOPS after switching to ZSTD.

Root cause: CPU saturation or single-core bottleneck. ZSTD level too high for the concurrency and CPU headroom available.

Fix: Test ZSTD at lower levels; measure CPU. If you can’t afford CPU, choose LZ4 and move on with your life.

3) “compressratio is 2.5x, so we should get 2.5x performance”

Symptom: Great compression ratios, little to no performance change.

Root cause: Workload isn’t bandwidth-limited; it’s latency-bound, sync-bound, or CPU-bound elsewhere (checksums, metadata, app locks).

Fix: Use latency percentiles and vdev wait time to find the real limiter. Compression reduces bytes, not necessarily IOPS latency.

4) “We benchmarked with sync=disabled to see the max”

Symptom: Unreal write numbers, followed by “production doesn’t match.”

Root cause: You removed durability guarantees and changed the write path entirely.

Fix: Benchmark with production sync semantics. If you must test sync=disabled, label it as “unsafe mode” and don’t use it for decisions.

5) “Compression made snapshots expensive”

Symptom: Snapshot space growth looks worse after changing compression.

Root cause: Recordsize mismatch and rewrite patterns. Large recordsize + small overwrites increases churn; compression changes physical layout and can alter fragmentation patterns.

Fix: Align recordsize/volblocksize to overwrite size. For VM images and databases, smaller blocks are often more stable.

6) “We turned on ZSTD and now scrub is slower”

Symptom: Scrub time increases and competes with workload.

Root cause: Scrub reads and verifies blocks; decompression can add CPU overhead during scrub on busy systems.

Fix: Schedule scrubs for low-traffic windows, cap scrub impact, and avoid high compression levels on systems with no CPU slack.

Three corporate mini-stories from the compression trenches

Mini-story 1: The incident caused by a wrong assumption

A team rolled out a new VM cluster and standardized on ZSTD because the staging numbers looked great. They did what everyone does:
copy a golden image, run a couple of file-copy tests, then take a victory lap.

The wrong assumption was subtle: they assumed “compresses well” equals “runs fast.” Their golden image was mostly OS files—highly compressible,
mostly read-heavy, and very cache-friendly. Production was a mixed fleet with log-heavy services and a few chatty middleware boxes doing constant small writes.

Monday morning, the cluster hit peak load. CPU graphs went vertical, but not in the fun way. Latency climbed, then jittered, then climbed again.
The hypervisor team blamed the network. The network team blamed storage. Storage blamed “noisy neighbors.”
Meanwhile the pager didn’t care about the org chart.

The actual issue: ZSTD at an aggressive level was compressing a firehose of small random writes. The pool wasn’t saturating disks; it was saturating CPU.
The workload had enough concurrency that compression threads became a bottleneck, and tail latency punished everyone.

The fix was boring: drop to LZ4 for VM datasets and keep ZSTD for backup/archive datasets where throughput mattered more than micro-latency.
They also changed their benchmark suite to include a VM-mix FIO job with p99.9 as a gate.

Mini-story 2: The optimization that backfired

Another company tried to “optimize” compression by turning it off for their database dataset. The reasoning sounded clean:
“The DB pages are already compact; compression is wasted CPU.” They had a few graphs to prove CPU dropped slightly.

A month later, SSD wear indicators started moving faster than expected, and nightly maintenance windows grew.
Nothing was on fire, but everything felt heavier—like the pool had gained weight.

The investigation found that while the database pages weren’t dramatically compressible, they were compressible enough to reduce physical writes meaningfully.
Turning compression off increased actual bytes written, which increased device write amplification and pushed the pool closer to its bandwidth ceiling.
The “CPU savings” was real, but it wasn’t the limiting resource.

Even worse, disabling compression reduced effective ARC capacity for that dataset. Read amplification increased,
and the database started missing cache more frequently during churny periods.

They reverted to LZ4, not ZSTD, because it gave most of the write reduction with minimal CPU cost.
The optimization backfired because it optimized the wrong resource. Storage is a system; you don’t get to tune one part in isolation.

Mini-story 3: The boring but correct practice that saved the day

A platform team maintained a simple internal rule: any performance claim needs a “benchmark manifest.”
Not a giant document—just a checklist attached to the ticket: hardware, pool layout, ashift, dataset properties, fill level, workload profile,
warm/cold cache stance, and the exact commands used.

During a cost-cutting push, someone proposed switching all datasets to ZSTD at a higher level to “save storage.”
The change request came with a slide deck and a single number: “2.1x compression ratio.”
The manifest forced them to include p99 latency under mixed load and CPU utilization.

The manifest also forced a test on a pool filled to the same level as production and with the same special vdev arrangement.
That’s where the surprise appeared: CPU headroom was fine on half the hosts but tight on the older ones, which would become the weak link.

They shipped a targeted change: ZSTD for the archival and log datasets, LZ4 for VM and latency-sensitive services,
plus an explicit “do not exceed ZSTD level X on host class Y” note in configuration management.

Nothing dramatic happened afterward. That’s the point. The boring practice saved them from a very exciting outage.

Checklists / step-by-step plan

Step-by-step: a compression benchmark you can defend in a change review

  1. Clone the environment assumptions. Same pool type, similar fullness, similar hardware class, same OS/OpenZFS version.
  2. Create two datasets differing only in compression (and optionally ZSTD level).
  3. Lock down properties: recordsize/volblocksize, sync, atime, xattr, primarycache, logbias (if relevant).
  4. Pick 2–3 workload models that represent reality: VM mix, DB sync writes, sequential ingest/backup.
  5. Define success criteria before running: p99.9 latency threshold, minimum throughput, maximum CPU utilization, and capacity gain requirement.
  6. Run warmup and then steady-state measurement windows. Capture p95/p99/p99.9, CPU, and vdev wait.
  7. Repeat runs (at least 3) and compare variability. If variability is huge, your environment isn’t controlled enough.
  8. Record a manifest: exact commands, dataset props, pool status, fill level, and observed bottleneck.
  9. Make a decision per dataset class (VM, DB, logs, backups), not “one codec to rule them all.”

Operational checklist: before changing compression in production

  • Confirm CPU headroom at peak load (not at 2 a.m.).
  • Confirm the workload’s compressibility (sample real data, not synthetic zeros).
  • Confirm rollback plan: property changes are easy; rolling back performance regressions in a busy cluster is not.
  • Confirm monitoring: p99 latency, CPU, and device wait time dashboards exist and are watched.
  • Confirm blast radius: change one dataset class or one host group first.

FAQ

1) Should I enable compression on ZFS by default?

Yes—LZ4 on most general-purpose datasets is the default sane choice. It usually saves space and reduces physical writes with minimal CPU cost.
Treat “no compression” as an exception you justify with measurements.

2) Is ZSTD always better than LZ4?

No. ZSTD can provide better ratios, but it can cost more CPU and can worsen tail latency under small-random-write concurrency.
Use ZSTD where capacity savings matter and you have CPU headroom; keep LZ4 where latency is king.

3) Why does compressratio look great but performance doesn’t improve?

Because your bottleneck may not be bandwidth. Latency, sync semantics, metadata contention, CPU, or application locks can dominate.
Compression reduces bytes; it doesn’t magically remove the rest of the stack.

4) Can compression improve read performance?

Yes, especially if you’re bandwidth-limited or cache-limited. If compressed blocks mean fewer bytes from disk, reads can speed up.
Also, ARC can effectively hold more logical data when blocks compress well.

5) Do I need to recompress old data after changing the property?

ZFS compression applies to newly written blocks. Existing blocks keep their existing compression state until rewritten.
If you need old data recompressed, you usually rewrite it (e.g., send/receive to a new dataset or copy within).

6) Is benchmarking with dd if=/dev/zero valid?

It’s valid for testing peak compression behavior on trivially compressible data, but it’s a terrible proxy for production.
It can produce spectacular but misleading results.

7) How do I choose ZSTD level?

Start at the default zstd (which implies a reasonable level depending on implementation) or explicitly pick a low level (e.g., zstd-3 or zstd-5)
for performance-sensitive systems. Increase only if you can prove CPU headroom and you actually gain meaningful capacity.

8) Does recordsize affect compression outcomes?

Yes. Compression is per record block. Larger recordsize can improve compression ratio on large sequential data but can hurt overwrite-heavy workloads.
Tune recordsize for the workload first, then evaluate compression levels.

9) Why do my FIO results vary so much run to run?

Common causes: ARC state changes, background scrubs/resilvers, other tenants, pool fill level drift, and CPU frequency scaling.
If variability is high, you don’t have a controlled environment, so don’t make irreversible decisions from it.

10) Does compression interact with encryption?

Yes. Compression occurs before encryption in ZFS’s pipeline (so it can still compress), but encryption adds CPU cost too.
Benchmark with both enabled if production uses both, because CPU contention is cumulative.

Next steps you can do this week

If you want compression results you can defend in a design review (and not regret in an incident review), do this:

  1. Create two test datasets identical except compression (lz4 vs zstd at a chosen level).
  2. Run three FIO profiles: VM mix (4K randrw), sync-heavy (8K + fsync), sequential ingest (1M writes).
  3. For each run, capture: p99/p99.9 latency, CPU (mpstat), vdev wait (zpool iostat), ARC miss% (arcstat), and physical vs logical writes.
  4. Decide per dataset class. Don’t standardize on one codec just because it wins one synthetic test.
  5. Roll out gradually with monitoring gates. If p99.9 moves the wrong way, roll back quickly and with no shame.

Compression isn’t a religion. It’s a trade. Measure the trade honestly and you’ll get the space savings and performance gains you were promised—without the placebo.

← Previous
Ubuntu 24.04 “Network is unreachable”: routing table truth serum (and fixes)
Next →
ZFS vs btrfs: Snapshots, RAID, Recovery—Which One Bites Less?

Leave a comment