ZFS Sequential Reads: Tuning for Maximum Streaming Throughput

Was this helpful?

Your ZFS pool “benchmarks fine” until the day you need to stream a few terabytes—media playout, backups, analytics scans, object-store rehydration—and throughput
collapses into a sad, sawtooth pattern. CPU looks bored. Disks look busy. Networking is blaming storage, storage is blaming networking, and someone suggests adding
more RAM like it’s holy water.

This is where sequential reads on ZFS get interesting: ZFS can be brutally fast at streaming, but it’s also a file system with strong opinions, multiple caches,
checksums everywhere, and a transaction model that occasionally turns “simple” into “complicated with style.” Let’s tune it like we actually run production systems.

What “sequential reads” means on ZFS (and what it doesn’t)

“Sequential read” is a workload description, not a guarantee. On spinning disks, sequential reads mean long contiguous runs on disk, minimal seeks, and big I/O.
On SSDs, it mostly means big I/O and high queue depth without tiny random reads polluting the pipeline.

On ZFS, sequential reads are shaped by three layers of reality:

  • Logical sequentiality: your application reads file offsets in order.
  • On-disk layout: blocks may not be physically contiguous due to copy-on-write and free-space fragmentation.
  • ZFS I/O behavior: recordsize, prefetch, checksum verification, compression, and caching decide how many and what kind of I/Os you actually issue.

You can have a perfectly sequential application read that still becomes “semi-random” on disk because the file was rewritten over months,
snapshots pinned old blocks, and free space turned into confetti. That’s not ZFS being bad; that’s ZFS being honest about physics.

Your goal for maximum streaming throughput is to make ZFS issue large reads, keep the queue deep enough to saturate the devices,
avoid needless cache churn, and ensure CPU isn’t the hidden choke point. The fastest pools aren’t the ones with the most tweaks; they’re the ones with the fewest surprises.

Facts & history that matter in practice

  • ZFS came out of Sun Microsystems in the mid-2000s with end-to-end checksums as a core feature. That checksum verification isn’t “overhead”; it’s the design.
  • The original ZFS pitch included “no fsck”—a big operational win. For streaming reads, it also means metadata integrity is guarded constantly, not occasionally.
  • Copy-on-write is why snapshots are cheap—and also why long-lived datasets can fragment. Streaming throughput often degrades over time if you rewrite big files in place.
  • “128K blocks” became culturally sticky because 128K recordsize works well for many sequential workloads. It’s not magical; it’s just a decent default.
  • L2ARC was introduced to extend cache to SSD when RAM was expensive. It helps random reads more than sustained streaming because streaming reads tend to be “read once.”
  • ZFS send/receive reshaped backup workflows by making replication block-aware. It’s also a sequential read workload generator that can expose pool bandwidth limits fast.
  • OpenZFS unified multiple forks so tuning guidance isn’t “Solaris-only folklore” anymore, but defaults still differ across platforms and versions.
  • Modern ZFS has special allocation classes (special vdev) to separate metadata/small blocks. That can indirectly boost streaming by reducing metadata contention.
  • Checksumming algorithms evolved (fletcher4, sha256, etc.). Stronger hashes cost CPU; fast hashes cost less. Streaming can become CPU-bound before you notice.

The streaming read data path: where throughput dies

If you want maximum throughput, you need to know which stage is limiting you. A sequential read on ZFS typically flows like this:

  1. Application issues reads (often 4K–1MB depending on the app/library).
  2. DMU maps file offsets to blocks based on recordsize and actual block sizes on disk.
  3. Prefetch decides whether to speculatively read ahead.
  4. ARC lookup hits or misses; misses schedule disk I/O.
  5. Vdev scheduler turns logical reads into physical reads across mirrors/RAIDZ, honoring queue limits.
  6. Device completes reads; data is checksum-verified, decompressed if needed, then placed into ARC and copied to userland.

Streaming throughput usually dies in one of four places:

  • Device bandwidth (you simply don’t have enough spindles, lanes, or SSD write/read bandwidth).
  • Too-small I/O size (lots of overhead per byte; you “win” on IOPS and lose on MB/s).
  • CPU bound on checksums/decompression (especially with fast NVMe pools).
  • Fragmentation and RAIDZ geometry (more I/O operations than expected, broken read coalescing, or parity math overhead).

Joke #1: If your “sequential reads” look random, congratulations—you’ve invented a new benchmark category: “interpretive storage.”

Fast diagnosis playbook (first/second/third)

First: prove what’s saturated

  • Check per-vdev throughput and latency with zpool iostat -v. If one vdev is pegged while others are idle, you have layout or queueing issues.
  • Check CPU (user/system, softirqs) during the stream. If CPU climbs with throughput capped, checksums/compression or IRQ handling may be limiting.
  • Check the network if this is remote. A “storage problem” that tops out at exactly 9.4 Gbps is often a networking problem wearing a storage hat.

Second: validate I/O size and prefetch behavior

  • Look at dataset recordsize and whether the files were written with it. Old files keep old block sizes.
  • Observe actual read sizes via zpool iostat and fio. If you see lots of 4K–16K reads, you’re paying overhead tax.
  • Confirm prefetch isn’t disabled (module params), and that it’s not being defeated by access patterns (e.g., many readers doing interleaved access).

Third: inspect pool health and “background” contention

  • Scrub/resilver in progress can cut streaming bandwidth hard. Same for heavy snapshot deletion (mass free).
  • Errors or slow devices cause retries and inflate latency; mirrors will often pick the “faster” side, but if both are struggling you’ll see it.
  • Space and fragmentation: a pool at 85–90% full is rarely a streaming champion. ZFS needs elbow room.

Practical tasks: commands, outputs, decisions

These are the tasks I actually run when someone says “ZFS sequential reads are slow.” Each includes: the command, what the output means, and what you decide next.
Assumptions: Linux + OpenZFS tooling; adapt paths and pool names.

Task 1: Identify pool topology and RAID level (you can’t tune what you don’t understand)

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 02:11:33 with 0 errors on Wed Dec 18 03:21:39 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0

errors: No known data errors

Meaning: You’re on a single RAIDZ2 vdev of six disks. Streaming reads will be limited by this vdev’s aggregate bandwidth and RAIDZ behavior.
One big vdev is one big “lane” from ZFS’s perspective.

Decision: If you need more throughput, you add vdevs (more lanes) or switch to mirrors for better parallelism. Tuning won’t create bandwidth out of air.

Task 2: See real-time read throughput, IOPS, and latency per vdev and disk

cr0x@server:~$ sudo zpool iostat -v tank 1 5
                                              capacity     operations     bandwidth
pool                                        alloc   free   read  write   read  write
------------------------------------------  -----  -----  -----  -----  -----  -----
tank                                        8.11T  2.77T    480      5   820M  1.20M
  raidz2-0                                  8.11T  2.77T    480      5   820M  1.20M
    sda                                         -      -     85      1   135M   256K
    sdb                                         -      -     76      1   130M   256K
    sdc                                         -      -     79      1   138M   256K
    sdd                                         -      -     80      1   140M   256K
    sde                                         -      -     78      1   139M   256K
    sdf                                         -      -     82      0   138M     0K
------------------------------------------  -----  -----  -----  -----  -----  -----

Meaning: You’re getting ~820 MB/s reads, evenly distributed. That’s plausibly near the sequential ceiling of these disks.
If you expected 2–3 GB/s, the topology is the explanation.

Decision: If the pool looks evenly loaded and bandwidth is “reasonable,” stop blaming ZFS and start capacity-planning. If one disk lags, investigate it.

Task 3: Confirm dataset settings that affect streaming reads

cr0x@server:~$ sudo zfs get -o name,property,value -s local,received recordsize,compression,atime,primarycache,secondarycache tank/media
NAME        PROPERTY        VALUE
tank/media  recordsize      1M
tank/media  compression     lz4
tank/media  atime           off
tank/media  primarycache    all
tank/media  secondarycache  all

Meaning: recordsize is 1M (good for big streaming files), compression is lz4 (usually good), atime is off (good; avoids write traffic on reads).

Decision: If recordsize is 128K but you stream huge files, consider 1M for that dataset before you ingest data. If you already have data, you’ll need rewrite/replication to benefit.

Task 4: Verify what block sizes files actually have (not what you wish they had)

cr0x@server:~$ sudo zdb -ddddd tank/media | head -n 20
Dataset tank/media [ZPL], ID 52, cr_txg 11423, 2.31T, 98123 objects

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
        17    1   128K    1M   1.00M    512B  4.00M  100.00  ZFS plain file
        18    1   128K  128K  128.0K    512B  512.0K 100.00  ZFS plain file

Meaning: Some files are using 1M blocks, some are 128K. recordsize is an upper bound; existing data may be smaller due to how it was written.

Decision: If your key streaming objects aren’t using large blocks, plan a rewrite: zfs send|recv, rsync with reflinks off, or application-level re-ingest.

Task 5: Measure raw sequential read bandwidth at the file level (bypass “the app is weird”)

cr0x@server:~$ fio --name=seqread --filename=/tank/media/bigfile.bin --rw=read --bs=1M --iodepth=32 --direct=1 --numjobs=1 --time_based --runtime=30
seqread: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
fio-3.36
Starting 1 process
seqread: Laying out IO file (1 file / 0MiB)
Jobs: 1 (f=1): [R(1)][100.0%][r=1050MiB/s][r=1050 IOPS][eta 00m:00s]
seqread: (groupid=0, jobs=1): err= 0: pid=21944: Thu Dec 26 11:08:14 2025
  read: IOPS=1048, BW=1048MiB/s (1099MB/s)(30.7GiB/30005msec)
    clat (usec): min=310, max=2200, avg=720.12, stdev=120.44

Meaning: With direct I/O, 1M blocks, and reasonable iodepth, you’re getting ~1.0 GiB/s. That’s close to your pool’s observed vdev throughput.

Decision: If fio is fast but the application is slow, the bottleneck is app behavior (small reads, single-threaded, sync points). If fio is slow, keep digging in storage.

Task 6: Verify ARC size and hit ratios (streaming is often miss-heavy, and that’s fine)

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
11:09:01  820M  790M     96  790M  96     0   0     0   0   42G   64G
11:09:02  835M  810M     97  810M  97     0   0     0   0   42G   64G
11:09:03  810M  785M     96  785M  96     0   0     0   0   42G   64G

Meaning: ARC miss rate is high during a stream. That’s normal: streaming is “read and move on.”
arcsz is 42G with target c=64G, so ARC is healthy.

Decision: Don’t panic and “increase ARC” just because misses are high. ARC helps repeated reads; streaming throughput is usually device-bound.

Task 7: Check whether prefetch is working or disabled at module level

cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_prefetch_disable
0

Meaning: 0 means prefetch is enabled globally.

Decision: If it’s 1 and your workload is mostly sequential, set it back to 0 (and find who disabled it and why).

Task 8: Check device-level saturation and queue behavior

cr0x@server:~$ iostat -x 1 3
Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
sda             85.0   138240.0     0.0   0.00   8.40  1626.4     0.71  98.0
sdb             76.0   133120.0     0.0   0.00   8.10  1751.6     0.62  97.2
sdc             79.0   141312.0     0.0   0.00   8.55  1788.8     0.68  98.5
sdd             80.0   143360.0     0.0   0.00   8.60  1792.0     0.69  98.7
sde             78.0   142336.0     0.0   0.00   8.45  1824.8     0.66  98.3
sdf             82.0   141312.0     0.0   0.00   8.50  1723.3     0.70  98.6

Meaning: %util near 100% and ~8–9ms await indicates disks are saturated. rareq-sz is large (~1.6–1.8MB), which is good for streaming.

Decision: If devices are saturated, tuning above the vdev layer won’t help. Add spindles, add vdevs, or move to SSD/NVMe.

Task 9: Confirm you’re not fighting a scrub/resilver (silent throughput killers)

cr0x@server:~$ sudo zpool status tank | sed -n '1,12p'
  pool: tank
 state: ONLINE
  scan: scrub in progress since Thu Dec 26 10:41:12 2025
        2.11T scanned at 1.20G/s, 1.02T issued at 595M/s, 8.11T total
        0B repaired, 12.59% done, 03:56:18 to go

Meaning: Scrub is consuming ~595 MB/s of issued bandwidth. Your streaming reads are competing for the same disks.

Decision: If this is a production streaming window, pause/cancel scrub or schedule it properly. Scrubs are good; scrubs during peak are not.

Task 10: Check ashift and sector alignment (bad alignment is forever)

cr0x@server:~$ sudo zdb -C tank | grep -E 'ashift|vdev_tree' -n | head -n 8
45:        vdev_tree:
61:            ashift: 12

Meaning: ashift=12 means 4K sectors, which is correct for most modern HDD/SSD. If you see ashift=9 on 4K drives, performance can suffer badly.

Decision: If ashift is wrong, the fix is rebuilding the vdev correctly. There is no “tune” that undoes misalignment pain.

Task 11: Evaluate fragmentation and space pressure (sequential reads hate a packed pool)

cr0x@server:~$ sudo zpool list -o name,size,alloc,free,capacity,fragmentation tank
NAME  SIZE  ALLOC  FREE  CAPACITY  FRAG
tank 10.9T  8.11T  2.77T      74%   33%

Meaning: 74% full is fine; 33% fragmentation is moderate. If capacity is 90% and frag is high, “sequential” will drift toward random.

Decision: If you’re above ~80–85% on a pool that must stream, plan expansion or migrate. If fragmentation is high, consider rewriting datasets or restoring from replication to re-pack blocks.

Task 12: Check per-dataset primarycache/secondarycache (stop caching what you won’t reread)

cr0x@server:~$ sudo zfs get -o name,property,value primarycache,secondarycache tank/backup
NAME        PROPERTY        VALUE
tank/backup primarycache    metadata
tank/backup secondarycache  none

Meaning: This dataset caches metadata only and avoids L2ARC. That’s often correct for one-pass backup scans and large sequential reads.

Decision: If streaming workloads are evicting useful cache for other services, set streaming datasets to primarycache=metadata. Don’t treat ARC like an all-you-can-eat buffet.

Task 13: Validate that readahead isn’t sabotaging you at the Linux block layer

cr0x@server:~$ lsblk -o NAME,TYPE,RA,MOUNTPOINT | grep -E 'tank|zd|sd[a-z]'
sda   disk   256
sdb   disk   256
sdc   disk   256
sdd   disk   256
sde   disk   256
sdf   disk   256

Meaning: Readahead here is 256 (KiB units in many distros). ZFS does its own prefetch; Linux readahead typically matters more for non-ZFS.
Still, bizarre values (like multi-megabyte per disk) can cause wasted I/O.

Decision: If someone tuned readahead blindly and it’s massive, bring it back to sane defaults. Fix one layer at a time.

Task 14: Confirm your stream isn’t secretly a tiny-read workload

cr0x@server:~$ sudo zpool iostat -r tank 1 2
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        8.11T  2.77T   8500      3   420M  1.00M
----------  -----  -----  -----  -----  -----  -----
tank        8.11T  2.77T   8600      2   430M  1.10M
----------  -----  -----  -----  -----  -----  -----

Meaning: 8.5K read ops for ~420 MB/s implies average I/O size around ~50 KB, not great for “streaming.”

Decision: If your I/O size is small, fix the application read size, increase recordsize for future writes, and verify prefetch isn’t being disrupted by many interleaved readers.

Tuning levers that actually move throughput

1) recordsize: set it for the files you will store, not the ones you already stored

recordsize sets the maximum block size ZFS will use for files in a dataset. For sequential reads, bigger blocks generally mean fewer I/O operations,
fewer checksum computations per byte, fewer metadata lookups, and better device efficiency.

What to do:

  • Media, backups, large objects, analytics scans: recordsize=1M is often the right starting point.
  • Databases and VM images: don’t blindly crank recordsize. Those are usually mixed/random patterns; optimize separately.

The catch: recordsize doesn’t retroactively rewrite existing files. If your “streaming dataset” was created with 128K and populated for a year, changing to 1M today helps new writes only.
If you need the benefit now, you rewrite the data.

2) atime: turn it off for streaming reads unless you enjoy unnecessary writes

atime updates on reads cause writes. Writes cause TXG work and potentially additional disk activity.
For a pure streaming read dataset, atime=off is a boring win.

3) primarycache / secondarycache: stop poisoning ARC with one-pass streams

ARC is precious because it’s in RAM and it’s hot. Streaming reads of large datasets can evict genuinely useful cache for other services (metadata, small hot files,
frequently-read config blobs, etc.).

For datasets that are mostly one-pass sequential reads (backup verification, archival rehydration, batch analytics), set:
primarycache=metadata. Often also set secondarycache=none if you have L2ARC and don’t want to burn it on junk.

This doesn’t necessarily increase the stream’s throughput; it prevents collateral damage. In real systems, that’s half the battle.

4) Compression: lz4 is usually “free,” until it isn’t

compression=lz4 often improves streaming read throughput because you read fewer bytes from disk and spend a bit of CPU decompressing.
That’s a great trade on HDD pools and many SSD pools.

But on fast NVMe, compression can move the bottleneck to CPU. You’ll see devices underutilized and CPU threads pegged doing decompression and checksums.
If your pipeline is CPU-bound, consider faster CPUs, verify you aren’t using an expensive checksum, and benchmark with compression on/off for your data.

5) Don’t confuse “more ARC” with “more throughput”

ARC helps when you reread data. Streaming throughput is generally limited by the storage bandwidth and how effectively ZFS issues I/O.
Increasing ARC might help if your stream loops or if metadata caching is the hidden limiter, but for single-pass reads it often just changes who gets to keep RAM.

6) Parallelism: one vdev is one lane

ZFS pools scale throughput primarily by adding vdevs. Mirrors generally provide better read parallelism and simpler geometry.
RAIDZ can be bandwidth-efficient and capacity-efficient, but a single RAIDZ vdev is still a single vdev, and that caps concurrency.

7) Special vdev: metadata belongs on fast storage (sometimes)

If your “sequential read” job still trips over metadata (lots of files, lots of directory traversals, tiny sidecar files), moving metadata and small blocks to a special vdev
can reduce head-of-line blocking on HDDs. This is more about “time to first byte” and consistent streaming in file-heavy workloads than raw MB/s on one giant file.

Joke #2: L2ARC for streaming is like bringing a suitcase to a one-night trip—technically impressive, practically unnecessary.

Vdev layout and bandwidth math you can defend in a meeting

Sequential read tuning gets existential when people ask: “Why can’t this 12-disk pool do 4 GB/s?” Because topology matters more than optimism.

Mirrors: the boring high-throughput choice

Mirrors read from either side. With multiple mirror vdevs, ZFS can spread reads across vdevs and across disks. For large streaming reads, mirrors usually deliver:

  • high concurrency (many vdevs)
  • predictable performance under mixed load
  • simple rebuild behavior (resilver is straightforward)

The trade-off is capacity efficiency. Storage engineers can argue about it for hours, which is a recognized form of cardio in some organizations.

RAIDZ: capacity-efficient, but geometry-sensitive

RAIDZ can stream fast, especially with multiple RAIDZ vdevs, but there are two practical realities:

  • Single vdev bottleneck: One RAIDZ vdev is one vdev. If your pool has one RAIDZ2 vdev, you’re capped at that vdev’s performance envelope.
  • Small reads hurt more: RAIDZ has to reconstruct from parity for certain access patterns, and small blocks can create read amplification.

For maximum sequential throughput, prefer multiple vdevs of a size that matches your operational constraints. If you can’t add vdevs because the chassis is full,
your “tuning” options become mostly about not making it worse.

Wide vdevs: tempting on spreadsheets, messy in production

Very wide RAIDZ vdevs look great on capacity-per-dollar. They also increase rebuild times, increase exposure windows, and can create performance cliffs when a disk is slow.
If your core business is streaming throughput and predictable operations, wide vdevs are a negotiation with entropy.

Queue depth and concurrency: streaming is a pipeline, not a single request

Many applications read sequentially but single-threaded with small buffers. That can leave the pool underutilized because disks/SSDs want multiple outstanding I/Os.
When fio with iodepth=32 hits 1.5 GB/s but your application hits 400 MB/s, the application is throttling itself.

Fix it by increasing application read size, enabling async I/O, adding reader threads, or using tooling that reads large blocks efficiently.
ZFS can prefetch, but it’s not a replacement for an application that refuses to queue work.

Compression, checksums, and CPU: friends until they aren’t

ZFS does end-to-end checksums. That’s not optional. The good news: checksum verification is usually fast enough on modern CPUs.
The bad news: “usually” ends right around the time you upgrade to fast NVMe and keep the same CPU from three years ago.

How to tell you’re CPU-bound on reads

  • zpool iostat shows devices not saturated but throughput is capped.
  • mpstat/top shows one or more CPU cores pegged in kernel/system time.
  • fio with direct I/O doesn’t increase throughput when you increase iodepth beyond a point.

Compression trade-offs

lz4 is the default recommendation for good reason: low CPU cost, often better throughput, and it improves effective cache.
If your data is incompressible (already compressed video), lz4 won’t help much; it also generally won’t hurt much, but benchmark.

Stronger compression (zstd at high levels) can reduce disk I/O at the cost of CPU. That can help HDD pools for certain datasets, but for pure streaming throughput,
it’s easy to overdo it and move the bottleneck to CPU. “More compression” is not the same as “more performance.”

Checksum algorithm choice

Most deployments use fletcher4 (fast) or sha256 (stronger, more CPU). For read throughput, faster checksums can matter on NVMe-heavy systems.
If you’re in a regulated environment, your choice might be made for you. If not, choose pragmatically: keep integrity strong enough, keep performance predictable.

One quote that holds up in operations: “Hope is not a strategy.” — General Gordon R. Sullivan.
You don’t fix read throughput with hope; you fix it with measurements and controlled changes.

Prefetch, ARC, and why L2ARC won’t save your stream

Prefetch: what it does for sequential reads

ZFS prefetch attempts to detect sequential access and issue read-ahead I/O so the next blocks are already in flight.
For a single large file read sequentially, prefetch is typically beneficial. For many interleaved sequential readers, it can become noisy but still helpful.

The classic failure mode: someone disables prefetch to “fix random read latency” on a database dataset, then forgets it’s global or forgets they changed it,
and now media streams or backup restores are slower. This happens more often than anyone will admit.

ARC: what to cache for streaming systems

ARC is a combined data+metadata cache. For streaming workloads, caching the stream data itself usually provides little benefit unless you reread.
What does matter: metadata. If you have a workload that opens many files, does lots of stat calls, traverses directories, or reads sidecar indexes,
metadata caching can improve the “ramp-up” and smooth out performance.

L2ARC: when it helps

L2ARC is a second-level cache, usually on SSD. It helps when you have a working set bigger than RAM that gets reread.
Streaming reads of huge files are rarely reread soon enough to justify L2ARC; it can also add overhead because ZFS must manage L2ARC headers and feed the cache.

If your system is a multi-tenant file service and streaming jobs are evicting other tenants’ hot data, L2ARC can help those other tenants,
but it won’t necessarily make the streaming job faster. That’s a win, but be honest about which win you’re buying.

Scrub/resilver and other “background” jobs that aren’t

Streaming reads compete with anything that wants disk bandwidth: scrub, resilver, snapshot deletion, heavy metadata churn, and sometimes even SMART long tests
if your controller and drive firmware handle them poorly.

Operational rules that keep streaming predictable

  • Scrub scheduling: run scrubs when you can afford the bandwidth hit. If you can’t find a window, your capacity planning is lying to you.
  • Resilver posture: treat degraded mode as an incident. Streaming throughput during resilver is often reduced, and that’s expected.
  • Snapshot deletion: bulk snapshot destroys can cause intense free-space activity. Don’t do it at 10:00 on a Monday.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “It’s sequential, so it must be contiguous”

A media pipeline team had a ZFS-backed NAS serving large video files. For months it was fine: editors pulled footage, render nodes read clips,
and throughput stayed comfortably high. Then a new show landed and everything felt “sticky.” Playback stuttered. Batch transcodes ran overnight and still missed morning deadlines.

The assumption was simple: “The files are large and read sequentially, so disks are doing sequential reads.” The truth was uglier.
The dataset had been under constant churn: ingest, partial edits, re-exports, and frequent snapshotting for “safety.”
Files were rewritten in place at the application level, but ZFS copy-on-write turned each rewrite into new block allocations.

When we looked at zpool list, the pool was in the high 80% capacity range and fragmentation was high enough to matter.
zpool iostat showed lots of small-ish reads and higher latency than expected for the same disks.
The workload was sequential in the application’s mind but not physically sequential on disk.

The fix wasn’t a magical sysctl. They expanded the pool by adding vdevs, then replicated the dataset to a fresh dataset with the intended recordsize.
That rewrite effectively “defragmented” the data. Throughput returned. The lesson that stuck: sequential access patterns don’t guarantee sequential disk access,
especially after months of snapshots and rewrites.

Optimization that backfired: “Disable prefetch to reduce cache pollution”

A platform group ran a mixed fleet: databases, object storage, and a backup service on shared ZFS storage.
Someone had read that prefetch can hurt random read latency, especially for database workloads that do their own caching.
They disabled prefetch at the module level to “stabilize the DB.”

The DB team reported some improvement during peak. Everyone celebrated quietly and moved on.
Two weeks later, a restore test ran. Throughput was down sharply. Another team streaming analytics scans saw the same.
The symptoms were classic: the pool wasn’t issuing big, smooth read pipelines; it was lurching through demand reads.

The incident report was awkward because no one had connected the global nature of the prefetch setting to the variety of workloads.
The “optimization” solved one pain by injecting a different pain somewhere else.
We re-enabled prefetch and addressed DB behavior the right way: dataset-level choices, application caching settings, and isolating workloads where needed.

The broader moral: “global tuning” is a euphemism for “global blast radius.” When you change the file system’s behavior, you are changing it for everyone.

Boring but correct practice that saved the day: “We benchmarked the rebuild plan”

A storage refresh project migrated a backup/restore platform to new servers with faster NICs and more SSDs.
The team did something unfashionable: they wrote down expected throughput per vdev, per host, and per network path, then tested each layer independently with simple tools.
It wasn’t glamorous, but it was measurable.

During cutover week, one node underperformed. Panic tried to start a small fire: “ZFS is slow” and “maybe the new firmware is bad.”
But because baseline numbers existed, the team immediately compared the node’s zpool iostat -v against the others.
One SSD showed slightly higher latency and lower bandwidth. Not catastrophic, just enough to cap the node.

They swapped the suspect device, resilvered, and throughput matched the baseline again.
No heroic tuning. No late-night sysctl roulette. Just boring validation and a pre-agreed performance envelope.

The win wasn’t just speed; it was confidence. When you know what “normal” is, you can treat deviations as facts, not feelings.

Common mistakes: symptom → root cause → fix

1) Symptom: Throughput is capped far below disk specs, and iostat shows small average read sizes

Root cause: Application is reading in small chunks (4K–64K), defeating efficient streaming; or files are stored in small blocks due to recordsize at write time.

Fix: Increase application read size/queue depth; set recordsize=1M for the dataset and rewrite data to benefit; verify prefetch enabled.

2) Symptom: One disk in a vdev shows higher await and lower bandwidth, pool throughput sawtooths

Root cause: A slow or failing device, cabling/controller issues, or SMR drive behavior under sustained load.

Fix: Confirm with zpool iostat -v and iostat -x; replace the device; validate firmware/controller; avoid SMR for performance tiers.

3) Symptom: Streaming is fine until scrub starts, then everything drops

Root cause: Scrub competes for bandwidth; scheduling ignores peak usage windows.

Fix: Reschedule scrubs; consider per-pool policy; temporarily pause/cancel during critical streams; plan more headroom.

4) Symptom: ARC hit rate is low during stream, and someone calls it a “cache problem”

Root cause: Streaming is miss-heavy by nature; ARC isn’t failing.

Fix: Stop tuning based on miss% alone; focus on disk saturation and I/O size; protect ARC via primarycache=metadata where appropriate.

5) Symptom: NVMe pool not saturating devices, CPU is high, throughput plateaus

Root cause: CPU-bound on checksum verification and/or decompression; possibly single-threaded readers.

Fix: Benchmark with compression on/off; ensure enough parallelism (jobs/iodepth); consider faster CPU, NUMA pinning strategies at the application level, and avoid heavy compression levels.

6) Symptom: “We changed recordsize but nothing improved”

Root cause: Existing files retain their block sizes; recordsize isn’t retroactive.

Fix: Rewrite data: replication to a fresh dataset, re-ingest, or application rewrite. Validate with zdb.

7) Symptom: Pool got slower over months without hardware changes

Root cause: Fragmentation and high capacity due to CoW + snapshots + rewrite churn.

Fix: Keep pools below ~80–85% for performance tiers; add vdevs; migrate/replicate to re-pack; revisit snapshot retention.

Checklists / step-by-step plan

Step-by-step: tune a dataset for maximum streaming reads (new data)

  1. Create a dedicated dataset for streaming objects; don’t share with databases/VMs.
  2. Set recordsize (typically 1M for large files): zfs set recordsize=1M tank/media.
  3. Set atime off: zfs set atime=off tank/media.
  4. Use lz4 compression unless you’ve proven it hurts: zfs set compression=lz4 tank/media.
  5. Decide caching policy: for one-pass workloads, primarycache=metadata; for shared read-hot libraries, keep all.
  6. Ingest data once (avoid rewrite churn); validate block sizes with zdb.
  7. Benchmark with fio using realistic block sizes and iodepth; capture zpool iostat -v during the test.

Step-by-step: recover streaming throughput on an existing “slow” dataset

  1. Run the fast diagnosis playbook and determine if you’re device-bound, CPU-bound, or I/O-size-bound.
  2. Check pool capacity and fragmentation; if you’re too full, expand or migrate first.
  3. Confirm actual block sizes of the important files; don’t assume recordsize changed anything.
  4. Stop background contention during testing (scrub/resilver/snapshot destruction).
  5. Rewrite the dataset if block sizes/fragmentation are the issue: replicate to a fresh dataset with correct settings.
  6. Re-test with fio; compare per-vdev metrics and CPU utilization before/after.

Operational checklist: keep streaming performance from degrading

  • Keep performance pools below ~80–85% capacity if you care about consistent bandwidth.
  • Schedule scrubs outside streaming windows; treat scrub overlap as a planned performance reduction.
  • Avoid mixing workloads with incompatible tuning goals on the same dataset.
  • Baseline throughput after changes; keep one “known-good” benchmark procedure.
  • When changing global module parameters, document and review them like you would firewall rules.

FAQ

1) Should I set recordsize=1M everywhere to get better sequential reads?

No. Set it on datasets that store large, streaming-friendly files. Databases, VM images, and mixed random workloads often perform worse with oversized blocks.

2) Why didn’t changing recordsize improve my existing files?

Because recordsize doesn’t rewrite blocks. Existing files keep the block sizes they were written with. You need to rewrite or replicate data to benefit.

3) Is ZFS prefetch always good for sequential reads?

Usually yes for single sequential streams. It can be less effective with many interleaved readers. Disabling it globally is rarely a good idea unless you’ve proven a specific benefit.

4) Does L2ARC increase streaming throughput?

Typically not for one-pass streams. L2ARC helps when data is reread and the working set doesn’t fit in RAM. For streaming, it’s often overhead with little payoff.

5) Why is ARC miss% so high during streaming? Is that bad?

High miss% is normal for read-once workloads. What matters is whether disks/SSDs are saturated and whether read sizes are large enough.

6) RAIDZ or mirrors for maximum sequential reads?

Mirrors tend to deliver better parallelism and more predictable reads under mixed load. RAIDZ can stream well too, especially with multiple vdevs, but one wide RAIDZ vdev is a throughput limiter.

7) Can compression improve sequential read throughput?

Yes. lz4 often improves throughput by reducing bytes read from disk. On fast NVMe or CPU-constrained systems, compression can become the bottleneck; benchmark for your data.

8) How do I tell if I’m CPU-bound on ZFS reads?

If devices aren’t saturated, throughput plateaus, and CPU is busy (often in kernel/system time), you’re likely CPU-bound on checksum/decompression or limited by single-threaded reading.
Confirm with fio tests and system CPU metrics.

9) My pool is 90% full. Can tuning save streaming performance?

Tuning can reduce damage, but it won’t restore the laws of physics. High fullness increases allocation difficulty and fragmentation. Expand, add vdevs, or migrate.

10) Why does scrub impact my streaming reads so much?

Scrub is basically a structured, relentless read workload across the entire pool. It competes directly with your streaming reads for the same device bandwidth.

Next steps

If you want maximum streaming throughput on ZFS and you want it reliably—not just on benchmark day—do this in order:

  1. Measure the bottleneck: zpool iostat -v, iostat -x, CPU metrics, and a direct fio sequential read test.
  2. Fix the big levers: vdev count/topology, dataset recordsize for new data, application read size/queue depth.
  3. Protect the rest of the system: caching policies (primarycache=metadata), scrub scheduling, and avoiding global tuning with surprise blast radius.
  4. Rewrite when necessary: if block sizes and fragmentation are the issue, replication to a fresh dataset is often the cleanest “defrag.”

Then write down what “good” looks like for your pool: expected MB/s per vdev, typical latency, and what background activity is acceptable. Future-you will appreciate it,
even if present-you thinks documentation is what happens to other people.

← Previous
MySQL vs MariaDB: WordPress 504s—who collapses first under a traffic spike
Next →
Shellshock: when an environment variable became world news

Leave a comment