Prefetch is the kind of feature you only notice when it betrays you. Most days, it quietly turns sequential reads into smooth, efficient I/O. Then one day your ARC turns into a revolving door, your hit ratio tanks, latency spikes, and someone says the most dangerous sentence in storage: âBut nothing changed.â
This is a field guide to that betrayal: what ZFS prefetch actually does, how it interacts with ARC and L2ARC, how it can create cache thrash, and how to fix it without turning your storage into a science fair. Iâll keep it grounded in what you can measure on production systems and what you can safely change when people are waiting.
Prefetch in plain English (and why it can hurt)
ZFS prefetch is an adaptive âread-aheadâ mechanism. When ZFS sees you reading file data sequentially, it starts fetching the next chunks before you ask for them. That is the core idea: hide disk latency by keeping the next blocks ready in memory.
When it works, itâs magic: streaming reads pull from ARC at memory speeds. When it fails, it becomes a particularly expensive form of optimism: it drags a lot of data into ARC that you never actually use, evicting the data you will use. Thatâs cache thrash.
Cache thrash from prefetch tends to show up in three broad situations:
- Large âmostly-sequentialâ scans that donât repeat (analytics queries, backups, antivirus scans, media indexing, log shipping). Prefetch eagerly fills ARC with one-time data.
- Sequential reads across many files in parallel (multiple VMs, multiple workers, parallel consumers). Each stream looks sequential on its own, but collectively they become a firehose.
- Workloads that appear sequential but arenât beneficial to cache (e.g., reading huge files once, or reading compressed/encrypted blocks where CPU becomes the bottleneck and ARC residency just burns RAM).
Two jokes, as promised, and then we get serious:
Joke #1: Prefetch is like an intern who starts printing tomorrowâs emails âto save time.â Great initiative, catastrophic judgment.
How ZFS prefetch works under the hood
Letâs map the territory. In OpenZFS, a normal read travels through the DMU (Data Management Unit), which finds the blocks (block pointers, indirect blocks, dnodes), then issues I/O for data blocks. ARC caches both metadata and data. ZFS prefetch adds an extra behavior: when ZFS detects a sequential access pattern, it will issue additional reads for future blocks.
ARC, MRU/MFU, and why prefetch changes eviction
ARC is not a simple LRU. Itâs adaptive: it balances between ârecently usedâ (MRU) and âfrequently usedâ (MFU). In a simplified mental model:
- MRU: ânew stuffâ you touched recently (often one-time reads).
- MFU: âhot stuffâ you keep coming back to.
Prefetch tends to shove blocks into ARC that were never demanded by the application yet. Depending on implementation details and workload, those prefetched blocks can inflate the ârecentâ side, pushing out useful metadata and working-set data. The result is a worse hit ratio, more real disk reads, and sometimes a nasty feedback loop: more misses cause more reads, more reads create more prefetch, prefetch pushes out more cache, and so on.
Prefetch is not the same as the OS page cache
On Linux, ZFS is outside the traditional page cache. Thatâs a feature, not a bug: ZFS owns its cache logic. But it also means you canât assume the usual Linux readahead knobs will tell you the whole story. ZFS has its own prefetch behavior and tuning parameters, and the consequences land directly in ARC memory pressure, not âjustâ in VFS cache.
Metadata matters: prefetch can crowd out the boring stuff
The worst performance outages Iâve seen werenât caused by âdata is slow,â but by âmetadata is missing.â When ARC loses metadata (dnodes, indirect blocks), every operation turns into multiple I/Os: find the dnode, read indirect blocks, finally read the data. Prefetch that evicts metadata can turn âone readâ into âa small parade of reads,â and the parade is very polite but very slow.
L2ARC: not a get-out-of-jail-free card
L2ARC (SSD cache) is frequently misunderstood as âARC but bigger.â It isnât. L2ARC is populated by evictions from ARC; it has its own metadata overhead in RAM; and it historically didnât persist across reboots (newer OpenZFS supports persistent L2ARC, but itâs still not magic). If prefetch is pushing junk into ARC, L2ARC can become a museum of things you once read but will never read again.
The knobs youâll hear about
Different platforms expose slightly different tunables (FreeBSD vs Linux/OpenZFS versions), but two names show up repeatedly:
zfs_prefetch_disable: blunt instrument; disables prefetch.zfetch_max_distance/zfetch_max_streams: the âhow muchâ and âhow manyâ style limits on prefetch behavior (names vary by version).
Donât memorize the names. Memorize the strategy: measure, confirm prefetch is the pressure source, then restrict it in the narrowest way that fixes the problem.
Why cache thrash happens: the failure modes
Failure mode 1: one-time sequential reads that look âhelpfulâ
Backup jobs are the classic example. A backup reads entire datasets sequentially, usually once per day. Prefetch will happily stage ahead, ARC will happily fill, and your production working set will get evicted to make room for the least reusable bytes on the system.
Symptom pattern: backups start, ARC hit ratio drops, latency rises, and customer-facing queries slow down even though the disks arenât saturated on throughputâtheyâre saturated on random IOPS because cache misses force extra seeks.
Failure mode 2: parallel sequential streams
One sequential stream is easy: itâs basically a conveyor belt. Twenty sequential streams across twenty VMs is a traffic circle during a storm. Each stream triggers prefetch. The combined prefetch pulls in far more data than ARC can hold, and eviction becomes constant.
Failure mode 3: prefetch competes with metadata and small hot blocks
ARC needs to hold the âindex cardsâ of your filesystem: dnodes, indirect blocks, directory structures. Prefetch tends to inject large runs of file data. If ARC is small relative to working set, prefetch can tip the balance: you start missing metadata that used to be resident, and suddenly everything is slow, including operations that arenât part of the sequential scan.
Failure mode 4: L2ARC pollution and RAM overhead
L2ARC sounds like a cushion, but it costs RAM for its index and itâs not free to populate. If prefetch is injecting throwaway data into ARC, youâll evict it to L2ARC, spending I/O and RAM to preserve something youâll never re-read. Thatâs not caching; thatâs hoarding.
Failure mode 5: compression, encryption, and CPU bottlenecks
If reads are CPU-bound (decompression, decryption, checksumming), prefetch may still pull data into ARC faster than the CPU can consume it, or pull data that will be invalidated by other working-set needs. Youâll see CPU pegged, ARC churning, and disks not obviously maxed outâan especially confusing triad in incident calls.
Facts & historical context (6â10 quick points)
- ZFS was designed for end-to-end integrity: checksums, copy-on-write, self-healingâperformance features like prefetch were always layered atop correctness-first design.
- ARC was created to outgrow simple LRU thinking: it adapts between recency and frequency, which is why it can survive mixed workloadsâuntil you feed it too much ârecencyâ via prefetch.
- Early ZFS deployments were often RAM-rich: prefetch behavior that was harmless with 256 GB RAM becomes dramatic on nodes squeezed into 64 GB footprints.
- L2ARC historically wasnât persistent: cold boots meant cold cache, which encouraged people to over-tune prefetch to âwarm things up,â sometimes creating self-inflicted churn.
- OpenZFS split and re-converged: Solaris lineage, illumos, FreeBSD, Linuxâprefetch knobs and defaults evolved differently across ecosystems.
- âRecordsizeâ became a performance religion: tuning recordsize for workloads often dwarfs prefetch effects, but prefetch interacts with it; big records amplify the cost of speculative reads.
- SSDs changed the pain shape: prefetch was invented in a world where latency was expensive; with NVMe, the penalty of a miss may be smaller, but cache pollution can still murder tail latency.
- Virtualization multiplied sequential streams: a single storage pool now serves dozens of guests; prefetch algorithms that assume âa few streamsâ can misbehave under multi-tenant patterns.
- People love toggles: the existence of
zfs_prefetch_disableis proof that enough operators hit real pain to justify a big red switch.
Three corporate-world mini-stories
Mini-story #1: An incident caused by a wrong assumption
The incident started as a âminor slowdownâ ticket. A database-backed internal appânothing glamorousâwas suddenly timing out during peak hours. The storage graphs looked fine at first glance: bandwidth wasnât maxed, disks werenât screaming, and CPU wasnât pegged. Everyoneâs favorite conclusion appeared in chat: âIt canât be storage.â
But latency told a different story. 99th percentile read latency jumped, not average. Thatâs the hallmark of caches lying: most requests are fine, a subset is falling off the cache cliff and taking the slow path. The wrong assumption was subtle: the team believed the nightly backup job was âsequential and therefore friendly.â
In reality, the backup job ran late and overlapped with business hours. It read huge datasets in long sequential runs. Prefetch saw a perfect studentâpredictable readsâand aggressively staged ahead. ARC filled with backup-read data. Metadata and small, frequently accessed blocks for the app were evicted. The app didnât need high throughput; it needed low-latency hits on a small working set. The backup needed throughput but didnât need caching at all because it wasnât going to re-read the same blocks soon.
Fixing it wasnât heroic. The team first confirmed the ARC hit ratio collapse correlated with the backup window. Then they throttled the backup and adjusted scheduling. Finally, they constrained prefetch behavior for that node (not globally across the fleet) and resized ARC to protect metadata. The lesson wasnât âprefetch is bad.â The lesson was âsequential doesnât always mean cacheable.â
Mini-story #2: An optimization that backfired
A platform team wanted to speed up analytics jobs on a shared ZFS pool. They observed large table scans and concluded the best move was to âhelpâ ZFS by increasing read parallelism: more workers, bigger chunks, more aggressive scanning. On paper, it looked like a throughput win.
It workedâon an empty cluster. In production, it collided with everything else. Each worker created a neat sequential stream, so prefetch kicked in for each. Suddenly the pool had a dozen prefetch streams competing for ARC. The ARC eviction rate rose until it looked like a slot machine. L2ARC write traffic increased too, turning the cache SSDs into unwilling journaling devices for speculative data.
And then came the punchline: the analytics queries got only slightly faster, but the rest of the company got slower. CI jobs timed out. VM boot storms took longer. Even directory listings and package installs felt âstickyâ because metadata was missing from cache. The optimization was real but externalized the cost onto every neighbor.
The rollback was not âdisable prefetch everywhere.â The team reduced concurrency, introduced resource isolation (separate pool class / separate storage tier), and added guardrails: analytics nodes had their own tuning profile, including stricter prefetch limits. The backfire wasnât because the idea was stupid; it was because shared systems punish selfish throughput.
Mini-story #3: A boring but correct practice that saved the day
A storage incident once failed to become a disaster purely because someone was boring. The team had a routine: before any performance tuning, they captured a small bundle of evidenceâARC stats, ZFS I/O stats, pool health, and a 60-second snapshot of latency distribution. It took five minutes and was considered âannoying process.â
One afternoon, an engineer proposed flipping zfs_prefetch_disable=1 because âprefetch always hurts databases.â That belief is common, occasionally true, and dangerously overgeneralized. The boring routine provided the counterexample: ARC was mostly metadata, hit ratio was high, and the dominant pain was actually synchronous writes from a misconfigured application. Reads werenât the bottleneck.
Instead of toggling prefetch and introducing a new variable, they fixed the write path (application fsync behavior, dataset sync settings aligned with business requirements). The outage ended quickly and predictably. A week later, they did a controlled experiment on prefetch and found it wasnât the villain on that system.
Operational moral: the âboringâ habit of collecting the same baseline stats every time turns tuning from folklore into engineering. Also, it prevents you from winning the argument and losing the system.
Fast diagnosis playbook
This is the order I use when someone says âZFS is slowâ and the graphs are screaming in multiple colors.
First: confirm the symptom class (latency vs throughput vs CPU)
- Is it latency? Look at application p95/p99, then map to storage latency (device and ZFS layer). Cache thrash usually shows up as tail latency spikes.
- Is it throughput saturation? If youâre maxing sequential bandwidth and everything is sequential, prefetch may be fine.
- Is it CPU? High CPU in kernel threads doing checksums/compression can make âstorageâ look slow.
Second: check ARC behavior in 60 seconds
- ARC hit ratio trend: did it fall off a cliff when the workload started?
- ARC size vs target: is ARC pinned at max with high eviction?
- Metadata vs data balance: is metadata getting crowded out?
Third: correlate with a sequential scanner
- Backups, scrubs, resilvers, replication reads, analytics scans, antivirus, search indexing, media transcodes.
- Look for âa job startedâ timestamp that matches ARC churn.
Fourth: validate whether prefetch is actually implicated
- Look for evidence of aggressive read-ahead (high read throughput with low cache reuse, high eviction).
- Try a narrow change: reduce prefetch distance/streams (or temporarily disable prefetch) on one host, during a controlled window, and measure.
Fifth: choose the least dangerous mitigation
- Throttle or reschedule the scanning workload.
- Reduce concurrency (especially parallel sequential readers).
- Constrain prefetch rather than kill it globally.
- Adjust ARC sizing to protect metadata if RAM allows.
Joke #2: Disabling prefetch in production because âit might helpâ is like fixing a rattle by removing the radio. The noise stops, sure, but youâve learned nothing.
Practical tasks: commands, what they mean, what to do next
These tasks assume Linux with OpenZFS installed. Some commands require packages (e.g., zfsutils-linux) and root privileges. The goal is not to run everything always; itâs to have a toolbox and know what each tool tells you.
Task 1: Identify pool and dataset basics
cr0x@server:~$ zpool status
pool: tank
state: ONLINE
scan: scrub repaired 0B in 02:11:33 with 0 errors on Sun Dec 22 03:14:10 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
errors: No known data errors
Interpretation: If the pool is degraded/resilvering/scrubbing, performance conclusions are suspect. Prefetch tuning wonât fix a pool thatâs busy healing itself.
Task 2: See dataset properties that affect read patterns
cr0x@server:~$ zfs get -o name,property,value -s local,received recordsize,compression,primarycache,secondarycache,sync tank/data
NAME PROPERTY VALUE
tank/data recordsize 128K
tank/data compression lz4
tank/data primarycache all
tank/data secondarycache all
tank/data sync standard
Interpretation: recordsize and caching policies matter. Large recordsize + aggressive prefetch can mean big speculative reads. If primarycache=metadata, data wonât stay in ARC regardless of prefetch.
Task 3: Check the prefetch knob state
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_prefetch_disable
0
Interpretation: 0 means prefetch is enabled. Do not change it yet; first prove itâs the culprit.
Task 4: Capture ARC summary quickly
cr0x@server:~$ arcstat 1 5
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:40:01 9234 1840 19 220 12 1620 88 95 5 48G 48G
12:40:02 10110 2912 28 240 8 2672 92 110 4 48G 48G
12:40:03 9988 3220 32 230 7 2990 93 105 3 48G 48G
12:40:04 9520 3011 31 210 7 2801 93 100 3 48G 48G
12:40:05 9655 3155 33 205 6 2950 94 99 3 48G 48G
Interpretation: Rising miss% during a scan is a classic thrash signal. If misses are mostly pmis (prefetch misses), thatâs suggestive, but not definitive. The key is trend and correlation with workload start.
Task 5: Check ARC size limits (Linux)
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_arc_max
51539607552
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_arc_min
4294967296
Interpretation: ARC max ~48 GiB here. If ARC is small relative to workload, prefetch can churn it quickly. If ARC is huge and still thrashing, the workload is likely âno reuseâ or too many streams.
Task 6: Watch ZFS I/O at pool level
cr0x@server:~$ zpool iostat -v 1 5
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 3.2T 10.8T 2400 110 780M 6.2M
mirror-0 3.2T 10.8T 2400 110 780M 6.2M
nvme0n1 - - 1200 60 390M 3.1M
nvme1n1 - - 1200 60 390M 3.1M
-------------------------- ----- ----- ----- ----- ----- -----
Interpretation: If read bandwidth is high during the slowdown but application isnât benefiting, you may be reading ahead wastefully. Cross-check with ARC misses and app latency.
Task 7: Confirm whether a scrub/resilver is competing
cr0x@server:~$ zpool status | sed -n '1,25p'
pool: tank
state: ONLINE
scan: scrub in progress since Thu Dec 25 12:10:01 2025
1.20T scanned at 6.9G/s, 210G issued at 1.2G/s, 3.20T total
0B repaired, 6.56% done, 00:38:21 to go
Interpretation: A scrub is a sequential reader too, and it can trigger prefetch and compete for ARC. If your thrash coincides with scrub, you have a scheduling problem first.
Task 8: Identify top readers (process-level)
cr0x@server:~$ sudo iotop -oPa
Total DISK READ: 820.12 M/s | Total DISK WRITE: 6.41 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
18277 be/4 backup 612.55 M/s 0.00 B/s 0.00 % 98.00 % borg create ...
23110 be/4 postgres 84.12 M/s 2.40 M/s 0.00 % 12.00 % postgres: checkpointer
9442 be/4 root 32.90 M/s 0.00 B/s 0.00 % 4.00 % zfs send ...
Interpretation: If one process is a sequential hog, you can often fix the incident without touching prefetch: throttle it, move it, reschedule it.
Task 9: Verify whether prefetch is being disabled/enabled correctly (temporary)
cr0x@server:~$ echo 1 | sudo tee /sys/module/zfs/parameters/zfs_prefetch_disable
1
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_prefetch_disable
1
Interpretation: This is a runtime change for quick experiments. Treat it like a circuit breaker: log the time, measure for 5â15 minutes, and revert if it harms sequential throughput. Make it persistent only after youâve proven the win and understood the tradeoff.
Task 10: Revert the change cleanly
cr0x@server:~$ echo 0 | sudo tee /sys/module/zfs/parameters/zfs_prefetch_disable
0
Interpretation: Always revert in the same incident window if the data doesnât support the change. Production tuning without rollback is gambling with better vocabulary.
Task 11: Check memory pressure that makes ARC eviction worse
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 128Gi 93Gi 2.1Gi 1.2Gi 33Gi 30Gi
Swap: 0B 0B 0B
Interpretation: If âavailableâ is low, ARC may be fighting the rest of the system. Prefetch under memory pressure is more likely to evict useful pages quickly.
Task 12: Examine ARC memory breakdown (kstat)
cr0x@server:~$ sudo kstat -p zfs:0:arcstats:size zfs:0:arcstats:target_size zfs:0:arcstats:arc_meta_used zfs:0:arcstats:data_size
zfs:0:arcstats:size 51511234560
zfs:0:arcstats:target_size 51539607552
zfs:0:arcstats:arc_meta_used 8123456789
zfs:0:arcstats:data_size 43200000000
Interpretation: If metadata usage collapses during a scan, youâve found a likely mechanism for âeverything gets slow.â Protecting metadata (via ARC sizing and workload control) often restores tail latency.
Task 13: Inspect L2ARC behavior (if present)
cr0x@server:~$ sudo kstat -p zfs:0:arcstats:l2_hits zfs:0:arcstats:l2_misses zfs:0:arcstats:l2_size
zfs:0:arcstats:l2_hits 182334
zfs:0:arcstats:l2_misses 992112
zfs:0:arcstats:l2_size 214748364800
Interpretation: A giant L2ARC with low hit rate during thrash may be polluted by one-time prefetched data. If L2ARC is missing too, it canât save you; itâs just another slow tier behind ARC.
Task 14: Validate recordsize alignment for the workload (spot check)
cr0x@server:~$ zfs get recordsize tank/db tank/backup tank/vmstore
NAME PROPERTY VALUE SOURCE
tank/db recordsize 16K local
tank/backup recordsize 1M local
tank/vmstore recordsize 128K local
Interpretation: If your database is on 1M records, prefetch will speculatively pull big blocks that may include unused pages. If your backups are on tiny records, youâll do more I/O than needed. Get the basics right before blaming prefetch.
Task 15: Measure actual latency at the block device
cr0x@server:~$ iostat -x 1 5
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await aqu-sz %util
nvme0n1 1180 402000 0.0 0.00 4.10 340.7 60 3200 0.95 3.10 74.2
nvme1n1 1195 401000 0.0 0.00 4.05 335.6 62 3400 1.01 3.05 75.0
Interpretation: If device r_await jumps when ARC miss% jumps, youâre actually going to disk. If device latency is stable but app is slow, you might be CPU-bound or stuck on locks/queues above the device layer.
Task 16: Make prefetch setting persistent (only after proof)
cr0x@server:~$ echo "options zfs zfs_prefetch_disable=1" | sudo tee /etc/modprobe.d/zfs-prefetch.conf
options zfs zfs_prefetch_disable=1
cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.0-40-generic
Interpretation: Persistent changes should be change-managed. If you canât articulate the downside (sequential reads slower) and the rollback path (remove file, rebuild initramfs, reboot), youâre not ready to make it persistent.
Common mistakes, symptoms, and fixes
Mistake 1: âARC hit ratio is low, so add L2ARCâ
Symptom: You add a big SSD cache, hit ratio barely improves, and now the system has higher write amplification and less RAM available.
Why it happens: The workload has low reuse; prefetch is polluting ARC; L2ARC just stores the pollution with extra overhead.
Fix: First reduce speculative reads (prefetch limits or workload throttling). Validate reuse. Only then consider L2ARC, sized and monitored for real hit improvement.
Mistake 2: Disabling prefetch globally because âdatabases hate itâ
Symptom: Reports and batch jobs get slower; backup windows extend; resilvers take longer; users complain about âslow exports.â
Why it happens: Some workloads benefit enormously from prefetch. Global disable penalizes all sequential readers, including legitimate ones.
Fix: Prefer constraining the offending workload (scheduling/throttling) or using targeted tuning per host/pool role. If you must disable, document it and measure the impact on sequential tasks.
Mistake 3: Confusing âhigh bandwidthâ with âgood performanceâ
Symptom: Storage is reading 800 MB/s, yet the application is slow.
Why it happens: Youâre streaming data into ARC and immediately evicting it (or reading ahead beyond whatâs used). Thatâs activity, not progress.
Fix: Check miss%, eviction, and application-level completion times. If app throughput doesnât scale with pool bandwidth, youâre likely doing speculative reads or suffering metadata misses.
Mistake 4: Ignoring metadata residency
Symptom: âEverythingâ slows down during a scan: listings, small reads, VM responsiveness.
Why it happens: Metadata got evicted; operations now need extra reads to find data.
Fix: Ensure ARC is sized appropriately; avoid running cache-polluting scans during peak; consider caching policy changes (primarycache) only when you understand the consequences.
Mistake 5: Tuning recordsize blindly
Symptom: Random read workloads show high latency; sequential scans show huge I/O; prefetch seems âworse than usual.â
Why it happens: Oversized records inflate read amplification; prefetch pulls in large blocks that contain little useful data for small reads.
Fix: Align recordsize to workload (e.g., smaller for databases, larger for backups/media). Re-test prefetch behavior after recordsize is sane.
Mistake 6: Making multiple changes at once
Symptom: After âtuning,â performance changes but no one can explain why; later regressions are impossible to debug.
Fix: Change one variable, measure, keep a rollback. Prefetch tuning is sensitive to workload mix; you need attribution.
Checklists / step-by-step plan
Step-by-step plan: prove (or disprove) prefetch thrash
- Mark the incident window. Write down when the slowdown started and what jobs were running.
- Check pool health. If scrub/resilver is active, note it.
- Collect ARC stats for 2â5 minutes. Use
arcstat 1and capture miss% trend. - Collect pool I/O stats. Use
zpool iostat -v 1to see read load. - Find top readers. Use
iotop(or your platformâs equivalent). - Correlate. Did ARC misses spike when the sequential reader started?
- Choose the least invasive mitigation. Throttle/reschedule the reader first.
- If you must tune: Temporarily disable prefetch (or reduce its scope) and measure for a short window.
- Decide. If latency improves materially and sequential tasks remain acceptable, consider a persistent change with documentation.
- After action: Re-run the workload in a controlled window and validate the fix under realistic concurrency.
Checklist: safe production tuning hygiene
- Have a rollback command ready before applying the change.
- Change one thing at a time.
- Measure both system metrics (ARC/pool) and workload metrics (job runtime, query latency).
- Prefer per-role tuning (backup nodes vs latency-sensitive nodes) over fleet-wide toggles.
- Write down the âwhy,â not just the âwhat.â Future you will not remember the panic context.
FAQ
1) What exactly is ZFS prefetch?
Itâs ZFSâs adaptive read-ahead. When ZFS detects sequential access, it issues additional reads for future blocks so theyâre in ARC by the time the application requests them.
2) Is prefetch the same as Linux readahead?
No. Linux readahead is tied to the page cache and block layer. ZFS uses ARC and its own I/O pipeline. You can have âreasonableâ Linux readahead and still have ZFS prefetch causing ARC churn.
3) When should I disable prefetch?
When you have strong evidence itâs polluting ARC and harming latency-sensitive workloadsâtypically during large, one-time sequential scans on a shared system with limited ARC headroom. Disable as a controlled experiment first.
4) When is disabling prefetch a bad idea?
If your environment depends on high-throughput sequential reads (backups, media streaming, large file reads, replication send streams) and you donât have other ways to schedule/throttle them. Disabling can turn smooth streaming reads into steady disk I/O.
5) How do I know itâs prefetch thrash and not just ânot enough RAMâ?
Not enough RAM is often the underlying condition, but prefetch thrash is the trigger. Youâll see ARC pinned near max, miss% rising during a sequential job, increased disk reads, and degraded tail latency for unrelated workloads. If the same memory footprint behaves fine without the scan, prefetch/speculative reads are part of the mechanism.
6) Does L2ARC fix prefetch thrash?
Usually not. L2ARC is fed by ARC evictions; if ARC is full of one-time prefetched blocks, L2ARC will store those too, with added RAM overhead. L2ARC helps when thereâs real reuse that doesnât fit in ARC.
7) Can I tune prefetch without disabling it?
Often yes, depending on OpenZFS version and platform. You can limit how far ahead ZFS reads or how many streams it tracks. The exact parameters vary, which is why itâs important to inspect what your system exposes under /sys/module/zfs/parameters (Linux) or loader/sysctl interfaces (FreeBSD).
8) Why does prefetch sometimes reduce performance even on fast NVMe?
Because the cost isnât only device latency. Cache pollution evicts metadata and hot blocks, creating more misses and more I/O. Even if NVMe is fast, increased tail latency and extra I/O work can still hurt application response times.
9) Whatâs the relationship between recordsize and prefetch?
Prefetch reads ahead in units that ultimately map to on-disk blocks. Larger records can increase the amount of data pulled in speculatively. If your app reads 8â16K randomly and your recordsize is 1M, prefetch and read amplification become very expensive.
10) If I disable prefetch, will ARC stop caching reads?
No. ARC will still cache demand reads (what the application actually asked for). Disabling prefetch mainly reduces speculative reads ahead of demand.
Conclusion
ZFS prefetch is neither hero nor villain. Itâs an aggressive helper that assumes tomorrow looks like today: sequential reads will continue, and caching ahead will pay off. In stable streaming workloads, that assumption is correct and performance is excellent. In mixed, multi-tenant, scan-heavy corporate reality, that assumption breaksâand the bill comes due as ARC thrash and tail latency.
The winning move is not âflip the magic bit.â The winning move is disciplined diagnosis: prove the correlation, identify the scanner, protect metadata and hot data, and apply the narrowest fix that restores predictability. When you treat prefetch as a hypothesis instead of a superstition, ZFS gets boring again. And boring storage is the best storage.