You’re on call. The database is slow. Someone posts a screenshot: “ARC hit ratio dropped to 72%!” and a dozen people decide that’s The Problem.
You can almost hear the budget request for more RAM drafting itself.
Sometimes they’re right. Often they’re not. ZFS cache hit rates are like a weather forecast: useful for planning, useless for blaming.
This is the field guide for knowing which is which—without turning your storage into a science fair.
What ZFS cache hit rates actually measure
ZFS has a primary in-memory cache called the ARC (Adaptive Replacement Cache). It also has an optional secondary cache on fast devices called the
L2ARC. Both caches track “hits” and “misses,” and tools happily compute a hit ratio that looks like a KPI.
Here’s the catch: a “hit” is not the same thing as “fast” and a “miss” is not the same thing as “slow.” A hit rate is a statement about where
data was served from, not whether the system met latency targets, not whether the workload is healthy, and not whether your pool is saturated.
ARC is not a single bucket
ARC holds both data (file blocks) and metadata (dnode, indirect blocks, directory entries, etc.). Many workloads
live or die on metadata caching. For example, a filesystem with millions of small files can feel “fast” primarily because metadata stays hot,
even if data blocks are cold.
ARC also keeps multiple “lists” internally (MRU/MFU and their “ghost” variants) to balance recency vs frequency. That matters because a hit can be
“we saw this recently” or “we see this constantly,” which have different tuning implications.
L2ARC is not a magic RAM expansion
L2ARC stores blocks on SSD/NVMe for reuse. It sounds like “RAM but cheaper.” In practice, it’s “SSD cache with bookkeeping overhead.” L2ARC
needs ARC to index it. If you’re short on RAM, adding L2ARC can paradoxically make you more RAM-starved.
Hits are counted per request, not per business outcome
ZFS can satisfy a request from ARC, L2ARC, or disk. It can also read ahead (prefetch), compress, deduplicate (if you enabled that
particular footgun), and coalesce I/O. Your application experiences end-to-end latency shaped by CPU, locks, I/O queues, txg commits, and
sync behavior. Hit rate is one lens, not the truth.
A practical mantra: performance is latency under load, not cache percentages.
One quote worth keeping on a sticky note: “Hope is not a strategy.”
— General Gordon R. Sullivan.
Cache hit ratios are where hope goes to look like math.
When hit rates matter (and are predictive)
1) Read-heavy workloads with stable working sets
If your workload repeatedly reads the same blocks—VM boot storms, CI artifact reuse, web asset serving, analytics with repeated scans over the
same “hot” subset—then ARC hit rate correlates strongly with disk read pressure and latency.
In this world, improving hit rate (more RAM, better recordsize alignment, reducing churn) is often a direct performance win.
You can reason: fewer disk reads → lower queue depth → better latency.
2) Metadata-dominated workloads
List directories with millions of entries. Traverse deep trees. Build container layers. Run git operations on large monorepos.
These workloads benefit from ARC metadata hits even when data hits stay mediocre.
Here, the useful metric isn’t “overall hit rate,” it’s “are metadata misses causing synchronous random reads?” If yes, ARC sizing and
metadata placement (special vdev) can be transformative.
3) Pools where disks are the bottleneck
If your pool is rotational (HDD) or your SSDs are already at high utilization, ARC misses hurt because disk service time is your limiting factor.
Hit rates matter when the alternative is slow.
4) You’re evaluating an L2ARC or special vdev investment
Hit rates alone aren’t enough, but they’re part of the cost model:
if your ARC miss stream is mostly random reads and your working set slightly exceeds RAM, L2ARC may help.
If misses are sequential reads, backups, or streaming, L2ARC is mostly expensive heat.
Joke #1: L2ARC is like an intern with a forklift—occasionally brilliant, occasionally expensive, always requires supervision.
5) You’re diagnosing a regression and hit rates changed with it
If an upgrade, config change, or dataset property flip coincides with a sharp change in hit rates, that can be a strong lead. It’s not proof.
It’s a “follow the smoke” signal.
When hit rates don’t matter (and will mislead you)
1) Write-heavy workloads
ARC is not a write cache in the way people mean it in meeting rooms. Yes, ZFS uses memory for dirty data and transaction groups, and it uses
caches for reads that often follow writes. But if your pain is write latency or fsync storms, hit rates are a side show.
For writes, the important story is: sync vs async, SLOG (if present), pool latency, and txg behavior. A beautiful ARC hit rate will not save you
from a saturated pool doing small sync writes.
2) Streaming reads and one-pass scans
Backups, large file copies, media pipelines, log reprocessing, cold analytics scans: they read a ton once and move on.
A low hit rate is expected and healthy. ZFS prefetch may make it look like cache “works” (because it stages blocks in ARC briefly),
but it doesn’t change the physics: you are fundamentally bound by throughput.
3) When the bottleneck is CPU or lock contention
Decompression, checksumming, encryption, dedup tables, pathological metadata churn, or a single-threaded application can bottleneck on CPU.
In that case ARC hit rate can be great while the system still lags, because the time is spent above the I/O layer.
4) When your pool is already fast enough
If you’re on modern NVMe and your latency targets are met, you can tolerate misses. “More hits” becomes a vanity metric. You don’t get a prize
for 99% ARC hits if the service is already fast and stable.
5) When hit rate is inflated by the wrong thing
ARC can look “amazing” because it’s caching data you don’t care about: sequential prefetch, transient blocks, or warmed data from a benchmark
you ran once. Meanwhile, your real workload is missing on metadata and suffering.
If you must remember one rule: don’t tune based on a single aggregate hit ratio. Break it down: metadata vs data, demand vs prefetch,
latency vs throughput, sync vs async. Otherwise you’re just tuning your feelings.
ARC vs L2ARC: the practical differences
ARC: fastest, simplest, and still easy to mess up
ARC lives in RAM. It is extremely fast, and it can serve reads without touching disks. But ARC shares RAM with everything else:
applications, page cache (on some platforms), kernel structures, and ZFS metadata.
If ARC is too small relative to your working set, you thrash: lots of misses, frequent evictions, and disk reads spike.
If ARC is too large, you starve applications and the OS. That can look like “storage is slow” when it’s actually reclaim pressure.
L2ARC: sometimes great, sometimes a tax
L2ARC can be a win when:
- Your workload is read-heavy and reuses blocks.
- Your working set is larger than RAM but not absurdly larger.
- Your pool is slower than your cache device (HDD pool, or busy SSD pool).
- You have enough RAM to hold L2ARC headers and not drown.
L2ARC can be a loss when:
- Your workload is mostly streaming reads (cache churn).
- Your cache device is not consistently low-latency under write load.
- You’re already memory constrained.
- You expect it to accelerate sync writes (it doesn’t).
Why hit rates differ between ARC and L2ARC
ARC hit rate measures requests served from RAM. L2ARC hit rate measures requests served from the cache device.
A “good” L2ARC hit rate depends on what you’re trying to accomplish:
sometimes even 10–20% L2ARC hits can be meaningful if those hits are expensive random reads that would otherwise go to HDD.
Also: L2ARC is filled asynchronously and historically was not persistent across reboots. On many modern OpenZFS systems,
persistent L2ARC exists, but operational reality still matters: warm-up time, eviction behavior, and device wear.
Interesting facts and historical context
- ARC was a headline feature of Sun’s ZFS design: it replaced the traditional “buffer cache vs page cache” split with one adaptive cache.
- The “ghost lists” idea in ARC (remembering recently evicted blocks) came from academic caching research and helps avoid thrashing.
- L2ARC arrived later as flash became viable; early SSDs were fast but fragile, which shaped conservative cache write behavior.
- Historically, L2ARC was not persistent across reboot: caches started cold, which mattered for “Monday morning login storms.”
- ARC stats became a culture: admins compared hit rates like gamers compare FPS, even when latency graphs were the real game.
- Special vdevs (metadata/small blocks on fast devices) changed the caching conversation by making “misses” less painful.
- Compression changed cache math: ARC stores compressed blocks in many configurations, effectively increasing cache capacity per GiB of RAM.
- Prefetch behavior has evolved: what looked like “cache pollution” in one release could be smarter readahead in another.
- Dedup’s reputation as a RAM-eater is earned: the dedup table can dominate memory needs and distort everything you think you know about ARC.
Fast diagnosis playbook
The goal is not to admire metrics. The goal is to identify the bottleneck in under 15 minutes, pick the next test, and avoid cargo-cult tuning.
First: is it IOPS/latency or throughput?
- If users complain about “spiky slowness” and timeouts: suspect latency/IOPS.
- If transfers are just slower end-to-end: suspect throughput limits, CPU, or network.
Second: is it reads or writes, sync or async?
- High read IOPS + high disk await: cache might matter.
- High write IOPS with sync: SLOG/pool latency matters more than ARC hit rates.
- High writes with large blocks: look at vdev throughput and fragmentation.
Third: is ARC under pressure or behaving normally?
- ARC size pinned and eviction high: you may be thrashing.
- ARC size stable, misses stable, latency still bad: bottleneck is elsewhere.
Fourth: is the pool saturated or unhealthy?
- Check
zpool iostat -vfor vdev imbalance, high queueing, and slow devices. - Check for errors, resilvering, scrub, or a dying disk dragging the vdev.
Fifth: validate with one targeted experiment
- Drop caches? Usually a bad idea in production, but you can run a controlled read test on a non-critical dataset.
- Change one property (e.g.,
primarycache=metadataon a backup dataset) and watch ARC behavior. - Use
fioor application-level metrics to confirm improvement.
Practical tasks: commands, outputs, decisions
These are the workhorse checks I expect an SRE to run before proposing hardware, and before touching tunables with a long stick.
Each task includes: command, sample output, what it means, and the decision you make.
Task 1: Confirm pool health and ongoing maintenance work
cr0x@server:~$ sudo zpool status
pool: tank
state: ONLINE
scan: scrub repaired 0B in 02:14:33 with 0 errors on Tue Dec 24 03:10:11 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
errors: No known data errors
Meaning: If you’re resilvering, scrubbing, or degraded, performance can tank regardless of cache hit rate.
Decision: If scan is running during peak, reschedule. If degraded, fix hardware first; do not “tune ARC” to compensate.
Task 2: Identify read vs write load and latency at the pool level
cr0x@server:~$ sudo zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 4.22T 7.58T 1.20K 3.40K 140M 220M
raidz2-0 4.22T 7.58T 1.20K 3.40K 140M 220M
sda - - 280 860 35.0M 56.0M
sdb - - 300 820 34.5M 55.0M
sdc - - 310 870 36.0M 56.5M
sdd - - 320 850 34.5M 55.0M
-------------------------- ----- ----- ----- ----- ----- -----
Meaning: This shows per-vdev and per-disk distribution. Large imbalance suggests a slow disk or queueing.
Decision: If one disk is lagging, investigate it. If the whole vdev is saturated, cache may help reads—but only if misses are random and reusable.
Task 3: Check dataset properties that directly change caching behavior
cr0x@server:~$ zfs get -o name,property,value,source recordsize,primarycache,secondarycache,compression,sync tank/data
NAME PROPERTY VALUE SOURCE
tank/data recordsize 128K local
tank/data primarycache all default
tank/data secondarycache all default
tank/data compression lz4 local
tank/data sync standard default
Meaning: primarycache/secondarycache decide what can enter ARC/L2ARC. recordsize affects I/O shape and cache efficiency.
Decision: For backup/streaming datasets, consider primarycache=metadata to prevent cache pollution. For DBs, evaluate recordsize and sync separately.
Task 4: Inspect ARC size and target
cr0x@server:~$ arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:01:01 22K 1.8K 8 980 4 740 3 120 1 64G 64G
12:01:02 21K 2.0K 9 1.1K 5 760 3 140 1 64G 64G
12:01:03 20K 1.9K 9 1.0K 5 780 4 120 1 64G 64G
Meaning: arcsz is current ARC size; c is target. Miss% is demand misses. Prefetch misses are separate.
Decision: If arcsz pinned at c and misses high with rising disk read latency, you may be under-cached. If misses are low, cache isn’t your bottleneck.
Task 5: Separate demand vs prefetch to detect cache pollution
cr0x@server:~$ arcstat -f time,read,miss,miss%,pmis,pm%,arcsz,c 1 3
time read miss miss% pmis pm% arcsz c
12:02:10 18K 2.2K 12 1.9K 10 64G 64G
12:02:11 19K 2.1K 11 1.8K 10 64G 64G
12:02:12 18K 2.3K 12 2.0K 11 64G 64G
Meaning: High prefetch misses (pm%) during streaming reads is normal; high prefetch hits can evict useful data.
Decision: If a backup job creates huge prefetch activity and interactive latency spikes, isolate it (separate dataset, limit IO, set caching properties).
Task 6: Check L2ARC presence and effectiveness
cr0x@server:~$ arcstat -f time,l2read,l2miss,l2hit%,l2asize,l2size 1 3
time l2read l2miss l2hit% l2asize l2size
12:03:01 6K 4K 33 420G 800G
12:03:02 7K 5K 29 421G 800G
12:03:03 6K 4K 33 421G 800G
Meaning: L2ARC hit% can look “meh” and still be valuable if it saves slow disk reads. Also watch how full it actually is (l2asize).
Decision: If L2ARC is barely used or hit% is low and your misses are streaming, remove it or repurpose the device. If it’s helping random reads, keep it.
Task 7: Verify memory pressure before blaming ARC size
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 256Gi 210Gi 3.1Gi 2.0Gi 43Gi 20Gi
Swap: 16Gi 9.5Gi 6.5Gi
Meaning: Low “available” and active swap use means you’re memory pressured. ARC might be competing with apps.
Decision: If swapping, fix memory pressure first (reduce workloads, add RAM, cap ARC) before chasing hit ratios.
Task 8: Measure disk latency directly on Linux
cr0x@server:~$ iostat -x 1 3
Device r/s w/s r_await w_await aqu-sz %util
sda 280.0 860.0 18.2 12.4 9.10 98.0
sdb 300.0 820.0 17.9 12.1 8.80 97.5
sdc 310.0 870.0 18.5 12.7 9.30 98.3
sdd 320.0 850.0 18.0 12.2 9.00 97.9
Meaning: High await and high %util indicates device saturation/queueing. Cache hits won’t fix saturated writes.
Decision: If latency is high due to writes, investigate sync behavior, slog, and write amplification (recordsize, small blocks, fragmentation).
Task 9: See if sync writes are dominating
cr0x@server:~$ zfs get -o name,property,value,source sync tank/db
NAME PROPERTY VALUE SOURCE
tank/db sync standard default
Meaning: sync=standard means the application decides. Databases often do lots of sync writes.
Decision: If latency correlates with fsync, you need a proper SLOG (fast, power-loss protected) or app tuning—not ARC hit rate tuning.
Task 10: Confirm whether a SLOG exists and where
cr0x@server:~$ sudo zpool status tank | sed -n '1,80p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
logs
nvme0n1 ONLINE 0 0 0
Meaning: A log vdev is your SLOG. Its latency and durability characteristics matter for sync writes.
Decision: If you have no SLOG and sync write latency is killing you, consider one. If you have a cheap consumer SSD as SLOG, replace it before it replaces your weekend.
Task 11: Spot ARC eviction pressure (Linux /proc)
cr0x@server:~$ egrep 'c_max|c_min|size|memory_throttle_count|deleted|evict' /proc/spl/kstat/zfs/arcstats
c_max 4 68719476736
c_min 4 17179869184
size 4 68719476736
memory_throttle_count 4 0
deleted 4 119283
evict_skip 4 0
Meaning: size at c_max isn’t inherently bad. memory_throttle_count rising suggests the ARC is being forced to shrink or throttled.
Decision: If throttling increases during load, look for memory contention and consider capping ARC or moving workloads.
Task 12: Understand metadata vs data behavior in ARC
cr0x@server:~$ egrep 'demand_data_hits|demand_data_misses|demand_metadata_hits|demand_metadata_misses' /proc/spl/kstat/zfs/arcstats
demand_data_hits 4 99887766
demand_data_misses 4 5544332
demand_metadata_hits 4 22334455
demand_metadata_misses 4 112233
Meaning: Lots of data misses might be okay for streaming. Metadata misses are more suspicious for “UI feels slow” file workloads.
Decision: If metadata misses climb with latency spikes, consider special vdev for metadata or increase ARC (if memory allows).
Task 13: Check whether a “special” vdev exists for metadata/small blocks
cr0x@server:~$ sudo zpool status tank | sed -n '1,140p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
Meaning: Special vdev can offload metadata and optionally small blocks onto fast devices, reducing the pain of cache misses.
Decision: If metadata misses hurt and you don’t have special vdev, evaluate it. If you do have it, check those devices for wear/latency.
Task 14: Identify dataset-level cache policy mismatches
cr0x@server:~$ zfs get -r -o name,property,value primarycache tank | head
NAME PROPERTY VALUE
tank primarycache all
tank/backups primarycache all
tank/backups/daily primarycache all
tank/db primarycache all
tank/home primarycache all
Meaning: Backup datasets default to caching everything, which can evict hot data for no benefit.
Decision: Set backup datasets to primarycache=metadata (or even none in extreme cases) and leave interactive datasets alone.
Task 15: Validate that recordsize matches workload reality
cr0x@server:~$ zfs get -o name,property,value recordsize tank/db tank/vm tank/backups
NAME PROPERTY VALUE
tank/db recordsize 16K
tank/vm recordsize 64K
tank/backups recordsize 1M
Meaning: Small random DB I/O prefers smaller recordsize; backups like large. Wrong recordsize can waste ARC and cause read-modify-write amplification.
Decision: If hit rate is low because you’re caching giant records for small reads, adjust recordsize on the dataset (carefully, for new writes).
Task 16: Quick “is it the app?” sanity check with per-process I/O
cr0x@server:~$ sudo pidstat -d 1 3
Linux 6.8.0 (server) 12/25/2025 _x86_64_ (32 CPU)
12:05:10 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s Command
12:05:11 PM 1001 24182 1200.00 9800.00 0.00 postgres
12:05:11 PM 0 19876 0.00 64000.00 0.00 zfs
Meaning: Confirms who is generating I/O. If a batch job is hammering writes, cache hit rates are just spectators.
Decision: If a noisy neighbor is responsible, throttle/reshard/relocate the job before touching ZFS tunables.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-size company ran a multi-tenant analytics platform. During a launch week, dashboards slowed down and some requests timed out.
The storage team saw ARC hit ratio drop from the mid-90s to the low-70s. The conclusion formed instantly: “We need more RAM.”
They emergency-procured memory, scheduled a maintenance window, and meanwhile tried to “protect cache” by disabling prefetch and changing a few
dataset properties across the pool. The next day, the system was worse. Latency spikes got sharper, and internal error rates climbed.
The real issue was sync writes. A new ingestion path used a library that fsync’d aggressively. The pool had no SLOG and the vdevs were already
near saturation during peak. Reads were suffering because writes were queueing; the ARC ratio dropped because the system was simply slower at
servicing misses, not because caching suddenly “broke.”
Once they graphed sync write latency, it was obvious. They added a proper power-loss-protected SLOG, fixed the ingestion path, and rolled back
the cache-tuning changes. ARC hit ratio recovered somewhat, but more importantly, tail latencies dropped back into their SLO.
Lesson: a falling hit rate can be an effect, not a cause. Cache metrics are excellent witnesses and terrible suspects.
Mini-story 2: The optimization that backfired
A different org ran a ZFS-backed VM cluster. Someone proposed L2ARC on a shiny consumer NVMe drive: “We’ll get RAM-like performance without buying RAM.”
The change went in during a quiet period. Initial tests looked better. Everyone relaxed.
Two weeks later, the cluster hit a new workload pattern: lots of short-lived VMs doing package installs and CI builds—tons of reads, yes, but also huge churn.
L2ARC started filling with blocks that were never reused. The ARC had to maintain metadata for those cached blocks, and memory pressure increased.
The hypervisor began swapping intermittently. Latency got weird: not constantly bad, but unpredictable in the worst way.
The team stared at the L2ARC hit rate. It wasn’t great, but it wasn’t awful either. They argued over whether “30% hit rate” was “good.”
Meanwhile, customers were filing tickets.
The fix was painfully boring: remove L2ARC, cap ARC to leave headroom for the hypervisor and guests, and separate CI scratch space onto a dataset
with caching restricted to metadata. They also moved the noisiest build workloads to a different pool.
Lesson: L2ARC is not free. If your workload churns, you pay overhead for caching data you’ll never see again.
Mini-story 3: The boring but correct practice that saved the day
A financial services team ran ZFS for home directories and internal tooling. Nothing glamorous; lots of small files, lots of directory walking,
and occasional “why is my repo checkout slow” complaints. They had a habit: every quarter, they reviewed dataset properties and validated that
“bulk” datasets (backups, exports, media archives) weren’t allowed to flood ARC.
So their backups dataset had primarycache=metadata. Their archive dataset had recordsize tuned large. Their interactive datasets
stayed on defaults. The team also tracked pool latency and ARC metadata misses as first-class signals.
One day, a new backup job accidentally pointed at the interactive dataset instead of the backups dataset. It started streaming reads for hours
during business time. On most systems, that would have flushed ARC and triggered a storm of “storage is slow” complaints.
Here, the blast radius was limited. The wrong dataset was noisy, but the bulk of the pool stayed responsive. ARC hit ratio dipped briefly, but
metadata stayed hot and the user-facing latency didn’t collapse. The team caught the misconfiguration from the job logs and fixed it without a
major incident.
Lesson: small, consistent cache hygiene beats heroic tuning. The best outage is the one that looks like a rounding error.
Common mistakes (symptoms → root cause → fix)
1) “ARC hit ratio is low, so storage is slow”
Symptoms: Hit ratio drops during batch jobs; users complain; someone proposes doubling RAM.
Root cause: Streaming reads or one-time scans are bypassing reuse. Misses are expected; throughput is the constraint.
Fix: Isolate batch workloads (separate dataset/pool), set primarycache=metadata on bulk datasets, throttle with cgroups/ionice, and measure throughput/latency directly.
2) “L2ARC will solve our write latency”
Symptoms: Database fsync latency is bad; team adds L2ARC; nothing improves.
Root cause: L2ARC accelerates reads, not sync writes. Write latency is governed by SLOG/pool.
Fix: Add a proper SLOG for sync-heavy workloads, or change app sync behavior (if safe). Don’t buy SSDs for the wrong job.
3) Cache hit rate is high, but the app is still slow
Symptoms: ARC hit > 95%, users still see timeouts.
Root cause: CPU saturation (checksums/compression/encryption), lock contention, or a single hot dataset causing txg delays.
Fix: Profile CPU, check iowait vs user time, inspect txg and sync write behavior, and validate app-level bottlenecks.
4) “We’ll set primarycache=none everywhere to stop pollution”
Symptoms: Hit ratio drops, metadata misses explode, directory operations become painfully slow.
Root cause: Overcorrecting: you removed the very thing that makes filesystem workloads responsive.
Fix: Use primarycache=metadata for bulk datasets only. Keep all for interactive datasets unless you have a specific reason.
5) L2ARC makes the system less stable
Symptoms: Occasional swapping, jittery latency, kernel reclaim pressure.
Root cause: Not enough RAM for L2ARC headers and ARC itself; cache device encourages more caching churn.
Fix: Remove or shrink L2ARC, cap ARC, or add RAM. If you’re running hypervisors, protect the host memory first.
6) “We turned off prefetch and it got worse”
Symptoms: Sequential read throughput drops, applications do more synchronous reads.
Root cause: Prefetch was helping your actual workload; you assumed it was pollution because you saw prefetch stats.
Fix: Re-enable prefetch, and instead control the workloads that cause harmful streaming. Measure before/after.
7) “Our ARC is huge; why are we missing on metadata?”
Symptoms: Plenty of RAM, but metadata misses rise with directory traversals.
Root cause: Working set is truly enormous, or the special vdev is absent and metadata must be read from slow disks, making misses painful.
Fix: Consider special vdev for metadata/small blocks, reduce file count per directory, and ensure bulk datasets aren’t evicting metadata.
Joke #2: If your only tool is ARC hit ratio, every performance problem looks like “add RAM.” That’s how you get promoted to “person who orders RAM.”
Checklists / step-by-step plan
Step-by-step: Decide whether cache hit rates are actionable
- Define the pain: tail latency, throughput, jitter, or timeouts? Get an app graph first.
- Classify the workload: read-heavy, write-heavy, mixed; streaming vs reused working set.
- Check pool health:
zpool statusfor degraded/resilver/scrub during peak. - Check device latency:
zpool iostat -vandiostat -x. If disks are pegged, caches won’t fix writes. - Check sync behavior: identify fsync-heavy apps; verify SLOG configuration if needed.
- Inspect ARC behavior: demand vs prefetch misses; metadata vs data misses; eviction/throttle counters.
- Check memory pressure: swapping or low available memory invalidates simplistic ARC tuning.
- Only then decide: add RAM, adjust dataset cache properties, add special vdev, add/remove L2ARC, or leave it alone.
Operational checklist: Keep caches from becoming a superstition
- Partition datasets by workload type: interactive vs bulk vs DB vs VMs.
- Set
primarycache=metadataon bulk datasets (backups, exports) unless proven otherwise. - Keep
compression=lz4by default unless CPU is truly your bottleneck. - Review
recordsizeon DB/VM datasets; don’t inherit “1M everywhere” because someone read a blog. - Track latency at pool and device level, not just hit rates.
- Document why each tunable exists; remove it if nobody can explain it.
- Test L2ARC with real workload traces; don’t decide based on synthetic benchmarks only.
- Protect memory headroom on hypervisors and multi-tenant systems; cache greed is how you summon swap.
FAQ
1) What is a “good” ARC hit ratio?
There isn’t one number. For a stable read-heavy workload, higher usually means fewer disk reads and better latency.
For streaming reads, a low hit ratio is normal and not a problem. Judge by latency and disk utilization, not aesthetics.
2) Why did ARC hit ratio drop after I added more RAM?
Because the workload changed, because you started measuring differently, or because the system now does more prefetch/reads under higher load.
Also, adding RAM can increase concurrency (apps do more work), which changes cache dynamics. Look at misses and latency, not just ratio.
3) Does compression improve cache hit rates?
Often, yes—indirectly. With compression like lz4, ARC can hold more logical data per GiB of RAM (depending on implementation and workload).
But if CPU is constrained, compression can trade I/O for CPU and hurt latency.
4) Should I set primarycache=metadata on everything?
No. That’s a common overreaction. Use it on datasets that do large streaming reads (backups, archives) so they don’t evict useful cache.
Keep interactive datasets on all unless you’ve proven that caching data blocks is harmful.
5) Is L2ARC worth it on NVMe pools?
Sometimes, but the bar is higher. If your pool is already low-latency NVMe, L2ARC may not buy much and can add overhead.
L2ARC shines when it avoids expensive misses—like random reads to HDD or busy SSDs—especially when the working set is slightly larger than RAM.
6) Can L2ARC hurt performance?
Yes. It consumes RAM for headers and can increase memory pressure. It can also waste write bandwidth on the cache device filling with
non-reused blocks. If you see swapping or jittery latency after enabling it, treat that as a serious signal.
7) If my L2ARC hit rate is only 20%, should I remove it?
Not automatically. Ask: what are those 20% hits replacing? If they avoid slow random reads to HDD, 20% might be huge.
If they’re replacing already-fast reads or mostly prefetch, it might be pointless.
8) Does ARC help writes?
ARC is primarily a read cache. ZFS does buffer dirty data in memory before committing transaction groups, and writes can be coalesced.
But if your issue is sync write latency, ARC hit rates are not the lever. Look at SLOG and pool write latency.
9) Why do my hit rates look great but users still complain?
Because latency may be coming from somewhere else: sync writes, CPU saturation, network issues, application locks, or a single slow device in a vdev.
Validate with zpool iostat -v, iostat -x, and app metrics.
10) What’s the simplest safe tuning that usually helps?
Separate bulk datasets and set primarycache=metadata on them. Keep compression=lz4. Verify recordsize alignment for DB/VM datasets.
Everything else should be justified by measurements.
Conclusion: what to do next
Cache hit rates are not performance. They’re a clue. Sometimes a very good clue. But if you treat them like a scoreboard, you will tune the wrong
thing and still be slow—just with prettier graphs.
Practical next steps:
- Start tracking pool/device latency and utilization alongside ARC/L2ARC stats.
- Split datasets by workload and apply cache policies deliberately (bulk vs interactive vs DB/VM).
- Use the fast diagnosis playbook: health → latency → sync/write behavior → ARC breakdown → only then cache tuning.
- If you change anything, change one thing at a time and validate with application metrics, not just hit ratios.
You don’t need perfect hit rates. You need predictable latency, stable throughput, and a system that doesn’t surprise you at 2 a.m.
ZFS can do that—if you stop asking the cache to be your religion.