If you run ZFS long enough, someone will eventually panic over RAM “disappearing.”
They’ll show you a dashboard where “free memory” is near zero, and they’ll say the sentence every SRE hears in their sleep:
“ZFS is eating all our RAM.”
Here’s the punchline: ZFS is supposed to use your RAM. ZFS treats memory as an accelerator pedal, not a museum exhibit.
The trick isn’t to keep RAM “free.” The trick is to keep it useful—without starving your applications, without triggering swap storms,
and without confusing healthy caching for a leak.
ARC in one sentence
ARC (Adaptive Replacement Cache) is ZFS’s in-memory cache for recently and frequently used blocks—data and metadata—
designed to turn expensive disk reads into cheap RAM reads, adapting dynamically to your workload.
Joke #1: “Free RAM” is like an empty seat on a plane—comforting to look at, but you’re paying for it either way.
Why “free RAM” is a myth (in production terms)
On a modern OS, unused RAM is wasted opportunity. The kernel aggressively uses memory for caches: filesystem cache,
inode/dentry cache, slab allocations, and ZFS’s ARC. This is not a ZFS quirk; it’s basic economics.
RAM is the fastest storage tier you own.
The confusion comes from dashboards and simplistic metrics. Many graphs still treat “used memory” as suspicious and
“free memory” as safe. That’s backwards. What you want is:
- Low swap activity (especially sustained swap-in/out).
- Healthy page reclaim behavior (no thrashing, no OOM kills).
- Stable application latencies under load.
- Predictable ARC behavior relative to your workload.
ARC can be large and still harmless—if it can shrink when memory pressure arrives, and if it isn’t pushing your workloads into swap.
Conversely, you can have “free RAM” and still be slow, if your working set doesn’t fit cache and your disks are doing random I/O.
A practical way to think about it: free RAM is not a goal; reclaimable RAM is. ARC should be reclaimable under pressure,
and the system should remain stable when it is reclaimed.
What ARC actually is (and what it isn’t)
ARC is not the Linux page cache (but it competes with it)
If you’re on Linux with OpenZFS, you effectively have two caching systems in play:
the Linux page cache and the ZFS ARC. ZFS uses ARC for ZFS-managed storage. Linux uses page cache for file I/O
and anything else backed by the kernel’s caching mechanisms.
Depending on your workload and I/O path, you can end up double-caching or fighting for memory.
For example, database workloads that already implement their own caching can end up paying for caching three times:
application cache, ARC, and (sometimes) page cache effects around direct vs buffered I/O.
ARC is adaptive (and it’s picky about what it keeps)
ARC isn’t a dumb “keep the last N blocks” cache. It uses an adaptive replacement algorithm designed to balance:
- Recency: “I used this recently, I might use it again soon.”
- Frequency: “I use this repeatedly over time.”
That matters in real systems. Consider:
a nightly batch job that scans terabytes sequentially (recency-heavy, low reuse) versus a VM datastore with hot blocks
read repeatedly (frequency-heavy). ARC’s goal is to avoid being “polluted” by one-time scans while still capturing
genuinely hot data.
ARC caches metadata, which can matter more than data
In many production ZFS deployments, metadata caching is the difference between “snappy” and “why is ls hanging?”
Metadata includes dnodes, indirect blocks, directory structures, and various lookup structures.
If you’ve ever watched a storage system fall over under “small file workloads,” you’ve met metadata the hard way.
You can have plenty of disk throughput and still stall because the system is doing pointer-chasing on disk
for metadata that should have been in RAM.
What lives in ARC: data, metadata, and the stuff that surprises people
Data blocks vs metadata blocks
ARC contains cached blocks from your pool. Some are file data. Some are metadata.
The mix changes with workload. A VM farm tends to have a lot of repeated reads and metadata churn; a media archive
might mostly stream big blocks once; a Git server can be metadata-heavy with bursts.
Prefetch can help, and it can also set your RAM on fire
ZFS does read-ahead (prefetch) in various scenarios. When it works, it turns sequential reads into smooth throughput.
When it misfires—like when your “sequential” workload is actually many interleaved streams—it can flood ARC with
data that won’t be reused.
Real consequence: you can evict useful metadata to make room for useless prefetched data.
Then everything else gets slower, and people blame “ZFS overhead” when it’s actually cache pollution.
Compressed blocks and the RAM math people get wrong
ZFS stores compressed blocks on disk. ARC typically stores data in a way that depends on implementation details and workload,
but the operational truth is: compression changes your mental model. If you compress well, the effective cache capacity
can increase because more logical data fits per physical unit. But the CPU cost and memory overhead for bookkeeping
are real, and they show up at scale.
Dedup: the “hold my beer” of RAM consumption
Deduplication in ZFS is famous for being both powerful and dangerous. The dedup table (DDT) needs fast access, and
fast access means memory. If you enable dedup without enough RAM for the DDT working set, you can turn a storage system
into a random I/O generator with a side hustle in misery.
Joke #2: Enabling dedup on a RAM-starved box is like adopting a tiger because you got a good deal on cat food.
How ARC grows, shrinks, and sometimes refuses to “let go”
ARC sizing is a negotiation with the OS
ARC has a target size range, governed by tunables such as zfs_arc_min and zfs_arc_max.
The kernel will apply pressure when other subsystems need memory.
In a well-behaved system, ARC grows when RAM is available and shrinks when it’s needed elsewhere.
In a poorly understood system, people see ARC at “max” and assume it’s a leak. Usually it’s not.
ARC is behaving as designed: it found spare memory and used it to make reads faster.
Why ARC sometimes doesn’t shrink “fast enough”
There are cases where ARC shrink can lag behind sudden memory demand:
sudden container bursts, JVM heap expansions, or an emergency page cache expansion for a non-ZFS workload.
ARC is reclaimable, but not necessarily instantly reclaimable, and the path from “pressure” to “bytes freed”
can have latency.
When that latency meets aggressive workloads, you see swapping, kswapd CPU burn, and tail latency spikes.
That’s when ARC becomes politically unpopular.
The special pain of virtualization and “noisy neighbors”
In hypervisor setups (or large container hosts), memory accounting gets messy. Guests have their own caching.
The host has ARC. The host may also have page cache for other files. If you oversubscribe memory or allow
ballooning/overcommit without guardrails, ARC becomes the scapegoat for fundamentally bad capacity planning.
Facts & history: how we got here
- ZFS was born at Sun Microsystems as a “storage pool + filesystem” design, not a bolt-on volume manager.
- ARC is based on the Adaptive Replacement Cache algorithm, which improved on simple LRU by balancing recency and frequency.
- ZFS popularized end-to-end checksumming for data integrity, which increases metadata work—and makes caching metadata more valuable.
- The “use RAM as cache” philosophy predates ZFS; Unix kernels have long used spare memory for caching, but ZFS made it impossible to ignore.
- Early ZFS guidance was “RAM is king” partly because disks were slower and random I/O was brutally expensive compared to RAM.
- L2ARC (the secondary cache) arrived to extend caching onto fast devices, but it’s not free: it needs metadata in ARC to be useful.
- Dedup became notorious because it moved a traditionally offline storage optimization into a real-time, RAM-hungry code path.
- OpenZFS brought ZFS to Linux and other platforms, where it had to coexist with different VM and cache subsystems, changing tuning realities.
- NVMe changed the game: disks got fast enough that bad caching decisions are sometimes less obvious—until you hit tail latency.
Three corporate-world mini-stories
Mini-story #1: The incident caused by a wrong assumption (“Free RAM is healthy”)
A mid-sized company ran a ZFS-backed NFS cluster serving home directories and build artifacts.
A new manager—smart, fast-moving, and freshly trained on a different stack—rolled out a memory “hardening” change:
cap ARC aggressively so that “at least 40% RAM stays free.” It sounded reasonable in a spreadsheet.
The first week looked fine. The dashboards were comforting: lots of green, lots of “free memory.”
Then the quarterly release cycle hit. Build jobs fanned out, small files exploded, and the NFS servers started
stuttering. Not down, just slow enough to make everything else feel broken.
The symptom that sent the incident into full bloom wasn’t “high disk utilization.” It was the ugly kind:
iowait climbing, latency for simple metadata ops spiking, and the NFS threads piling up.
The pool wasn’t saturated on throughput. It was drowning in random reads for metadata that used to sit happily in ARC.
The postmortem wasn’t about blame. It was about a wrong assumption: “free” memory is not a stability indicator.
The fix was equally unsexy: allow ARC to grow, but set a sane upper bound based on actual application headroom,
and add alerting on swap activity and ARC eviction rates—not on “free RAM.”
The lesson: if you treat RAM like a trophy, ZFS will treat your disks like a scratchpad.
Mini-story #2: The optimization that backfired (L2ARC everywhere)
Another shop had a ZFS pool supporting a virtualization cluster. Reads were the pain point, so someone proposed
adding L2ARC devices. They had spare SSDs, and the plan was simple: “Add L2ARC and the cache hit ratio will soar.”
It’s an easy sell because it’s tangible hardware.
They added a big L2ARC, watched the graphs, and… nothing magical happened. In fact, under certain workloads,
latency got worse. It wasn’t catastrophic; it was insidious. The VMs felt “sticky” during morning boot storms,
and random workloads got spikier.
The culprit wasn’t the SSDs. It was memory. L2ARC needs metadata in ARC to be effective. The larger the L2ARC,
the more ARC overhead you spend indexing it. On a host that was already RAM-tight, the extra pressure pushed ARC
into more frequent evictions of exactly the metadata the system needed most.
The rollback wasn’t dramatic. They reduced L2ARC size, added RAM on the next refresh cycle, and adjusted expectations:
L2ARC helps best when the working set is larger than RAM but still “cacheable,” and when you can afford the memory overhead.
Otherwise, you’ve built a very expensive way to make your cache less stable.
The lesson: caching is not additive; it’s a budget. If you spend it twice, you go broke in latency.
Mini-story #3: The boring practice that saved the day (measuring before tuning)
A financial services team ran ZFS for a file ingestion pipeline. They weren’t the loudest team, but they were disciplined.
Their practice was painfully boring: before any “tuning,” they captured a baseline bundle of metrics—ARC stats, IO latency,
swap activity, and per-dataset recordsize/compression settings. Every change came with a before/after comparison.
One afternoon, ingestion latency doubled. The easy blame target was ARC: “Maybe the cache is thrashing.”
But their baseline told a different story. ARC hit ratio was stable. Evictions weren’t unusual. What changed was
memory pressure: a new sidecar process had been deployed with an unbounded heap.
The system wasn’t failing because ARC was greedy; it was failing because the host was overcommitted.
ARC was doing what it could—shrinking under pressure—but the other process kept expanding, pushing the box into swap.
Their graphs showed it clearly: swap-in/out rose first, then latency followed, then CPU time in reclaim.
The fix wasn’t an arcane ZFS tunable. It was a resource limit and a rollback of a bad deployment.
The boring practice—capturing baselines and watching the right indicators—kept them from making it worse by blindly
strangling ARC.
The lesson: most “ZFS memory problems” are actually system memory problems wearing a ZFS hat.
Practical tasks: commands, outputs, and what they mean
The goal here is not to memorize commands. It’s to build muscle memory:
verify the workload, confirm memory pressure, then decide whether ARC is helping or harming.
Commands below assume a Linux system with OpenZFS installed; adjust paths for other platforms.
Task 1: See overall memory reality (not “free RAM” panic)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 251Gi 41Gi 2.1Gi 1.2Gi 208Gi 198Gi
Swap: 16Gi 0B 16Gi
Interpretation: “Free” is low, but “available” is huge. That usually means the system is caching aggressively and can reclaim memory.
If swap is quiet and “available” is healthy, this is probably fine.
Task 2: Check if swapping is actually happening
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 2212140 10240 189512000 0 0 12 44 520 880 6 2 91 1 0
1 0 0 2198840 10240 189625000 0 0 0 0 500 860 5 2 92 1 0
4 1 0 2101020 10240 189540000 0 0 220 180 1100 1600 10 4 74 12 0
3 0 0 2099920 10240 189520000 0 0 140 120 920 1300 8 3 81 8 0
2 0 0 2103200 10240 189610000 0 0 30 60 600 900 5 2 91 2 0
Interpretation: Watch si/so (swap in/out). Non-zero sustained values mean the box is under memory pressure.
A little I/O wait (wa) isn’t automatically ARC’s fault; correlate with ARC misses and disk latency.
Task 3: Read ARC size and limits directly
cr0x@server:~$ grep -E '^(c|size|c_min|c_max|memory_throttle_count)' /proc/spl/kstat/zfs/arcstats
c 4 214748364800
c_min 4 10737418240
c_max 4 214748364800
size 4 198742182912
memory_throttle_count 4 0
Interpretation: size is current ARC size. c is the target. c_max is the cap.
If memory_throttle_count climbs, ARC has experienced memory pressure events worth investigating.
Task 4: Check ARC hit/miss behavior (is ARC helping?)
cr0x@server:~$ grep -E '^(hits|misses|demand_data_hits|demand_data_misses|demand_metadata_hits|demand_metadata_misses)' /proc/spl/kstat/zfs/arcstats
hits 4 18230933444
misses 4 1209933221
demand_data_hits 4 12055411222
demand_data_misses 4 902331122
demand_metadata_hits 4 5800122201
demand_metadata_misses 4 307602099
Interpretation: High hits relative to misses is good, but don’t worship hit ratio.
What matters is latency and disk load. A “good” ratio can still be too slow if your misses are expensive (random HDD reads),
and a “bad” ratio can be acceptable if your pool is NVMe and your workload is streaming.
Task 5: Watch ARC live with arcstat (when installed)
cr0x@server:~$ arcstat 1 5
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:01:01 812 42 5 30 4 8 1 4 0 185G 200G
12:01:02 900 55 6 40 4 10 1 5 1 185G 200G
12:01:03 1100 220 20 190 17 18 2 12 1 184G 200G
12:01:04 980 180 18 150 15 20 2 10 1 184G 200G
12:01:05 860 60 7 45 5 10 1 5 1 184G 200G
Interpretation: A spike in misses during a batch scan is normal. A persistent miss storm during “steady state”
often means your working set doesn’t fit, prefetch pollution, or a workload shift (new dataset, new access pattern).
Task 6: Check for memory reclaim stress (Linux)
cr0x@server:~$ cat /proc/pressure/memory
some avg10=0.00 avg60=0.05 avg300=0.12 total=1843812
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
Interpretation: PSI memory “some” indicates time spent stalled due to memory pressure; “full” is worse (tasks fully blocked).
Rising PSI alongside swap activity and ARC at cap is a signal to revisit memory budgets.
Task 7: Confirm pool health and obvious bottlenecks
cr0x@server:~$ zpool status -xv
all pools are healthy
Interpretation: Don’t tune caches on a sick pool. If you have errors, resilvering, or degraded vdevs, your “ARC problem”
may be a “hardware problem.”
Task 8: Observe I/O latency, not just throughput
cr0x@server:~$ zpool iostat -v 1 3
capacity operations bandwidth
pool alloc free read write read write
tank 48.2T 21.6T 8200 1100 690M 120M
raidz2-0 48.2T 21.6T 8200 1100 690M 120M
sda - - 1020 130 85M 10M
sdb - - 1015 135 86M 11M
sdc - - 1040 140 86M 10M
sdd - - 1030 135 85M 10M
Interpretation: This shows operations and bandwidth, but not latency. If things “feel slow,” pair this with tools like
iostat -x to see await/util, and correlate with ARC misses.
Task 9: Check device latency with iostat
cr0x@server:~$ iostat -x 1 3
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 1020.0 130.0 85.0 10.0 0.0 8.0 0.00 5.80 9.10 6.20 9.50 85.3 79.1 0.40 46.0
sdb 1015.0 135.0 86.0 11.0 0.0 9.0 0.00 6.20 9.40 6.10 9.60 86.7 83.1 0.41 46.5
Interpretation: Rising r_await and high %util during ARC miss spikes means your disks are paying for cache misses.
If latency is already low (e.g., NVMe), ARC misses may not be the villain.
Task 10: Identify which datasets are configured to behave expensively
cr0x@server:~$ zfs get -o name,property,value -s local,received recordsize,primarycache,secondarycache,compression tank
NAME PROPERTY VALUE
tank compression lz4
tank primarycache all
tank recordsize 128K
Interpretation: primarycache=all means both data and metadata are cached in ARC.
For some workloads (databases, large streaming), you might choose metadata to reduce ARC pressure.
Don’t do it by superstition—measure.
Task 11: Check whether a workload is bypassing cache expectations
cr0x@server:~$ zfs get -o name,property,value atime,sync,logbias,primarycache tank/vmstore
NAME PROPERTY VALUE
tank/vmstore atime off
tank/vmstore sync standard
tank/vmstore logbias latency
tank/vmstore primarycache all
Interpretation: Settings like sync and logbias won’t change ARC directly, but they change I/O behavior.
If writes are slow and causing backpressure, reads can suffer and the “ARC debate” becomes a distraction.
Task 12: Set a temporary ARC cap (carefully) for experiments
cr0x@server:~$ sudo sh -c 'echo $((64*1024*1024*1024)) > /sys/module/zfs/parameters/zfs_arc_max'
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_arc_max
68719476736
Interpretation: This sets ARC max to 64 GiB (value is bytes). Use this to test headroom hypotheses, not as a permanent “fix.”
Permanent tuning should be done via your distro’s module parameter configuration so it persists across reboots.
Task 13: Force the question—does performance track ARC size or disk latency?
cr0x@server:~$ sudo sh -c 'echo $((128*1024*1024*1024)) > /sys/module/zfs/parameters/zfs_arc_max'
cr0x@server:~$ sleep 10
cr0x@server:~$ grep -E '^(size|c|c_max)' /proc/spl/kstat/zfs/arcstats
c 4 137438953472
c_max 4 137438953472
size 4 132001234944
Interpretation: Raise ARC cap temporarily and watch whether latency improves and misses drop.
If nothing changes, your bottleneck may be write path, CPU, network, or the application’s access pattern.
Task 14: Spot ARC-related kernel messages
cr0x@server:~$ dmesg -T | grep -i -E 'arc|spl|zfs' | tail -n 10
[Thu Dec 25 09:44:10 2025] ZFS: Loaded module v2.2.4-1
[Thu Dec 25 10:02:31 2025] ZFS: ARC size 197G, target 200G, min 10G, max 200G
Interpretation: You’re looking for warnings about memory throttling, failures to allocate, or repeated reclaim events.
If logs are noisy, you’re beyond “ARC is large” and into “ARC is fighting the kernel.”
Task 15: Determine if your workload is dominated by metadata
cr0x@server:~$ grep -E '^(demand_metadata_hits|demand_metadata_misses|demand_data_hits|demand_data_misses)' /proc/spl/kstat/zfs/arcstats
demand_data_hits 4 12055411222
demand_data_misses 4 902331122
demand_metadata_hits 4 5800122201
demand_metadata_misses 4 307602099
Interpretation: If metadata misses are significant and correlate with slow directory ops, slow file opens, or high IOPS on disks,
prioritize keeping metadata warm: avoid cache pollution, avoid tiny ARC caps, and consider workload-specific dataset caching settings.
Fast diagnosis playbook
When someone pings you with “ZFS is using all the RAM” or “ZFS is slow,” you don’t have time for a philosophy seminar.
You need a short, reliable sequence that finds the bottleneck quickly.
Step 1: Is the host under memory pressure or just caching?
- Check
free -hand focus on available, not free. - Check
vmstat 1for sustainedsi/so> 0. - Check PSI memory (
/proc/pressure/memory) for rising “some/full.”
If swap is active and PSI is rising, you have a memory pressure problem. ARC may be involved, but it’s rarely the only actor.
Step 2: Are reads slow because ARC is missing, or because disks are slow anyway?
- Check ARC hits/misses (
/proc/spl/kstat/zfs/arcstatsorarcstat). - Check disk latency (
iostat -x 1) and pool behavior (zpool iostat 1).
If ARC misses spike and disk latency spikes, your cache isn’t covering your working set—or it’s being polluted.
If ARC misses spike but disks stay low-latency, your performance complaint might be elsewhere (CPU, network, app).
Step 3: Is the workload changing the cache economics?
- Look for large scans, backups, reindex jobs, VM boot storms, or replication.
- Identify whether metadata misses increased (small file workload, millions of inodes).
- Review dataset properties:
primarycache,recordsize, compression, sync/logbias.
Many “ARC incidents” are really “a batch job happened” incidents. Your response should be to isolate or schedule
the batch job, not to permanently cripple caching.
Step 4: Decide: tune ARC limits, tune workload, or add RAM
- If the host is swapping: reserve headroom (cap ARC) and fix the memory hog.
- If disks are saturated by misses: increase effective cache (more RAM, better caching policy, reduce pollution).
- If latency is fine: stop touching it and move to the real bottleneck.
Checklists / step-by-step plan
Checklist A: “Is ARC harming my applications?”
- Confirm swap activity:
cr0x@server:~$ vmstat 1 10Look for sustained
si/soand risingwa. - Confirm memory availability:
cr0x@server:~$ free -hIf
availableis low and dropping, you’re actually short on memory. - Check ARC cap and size:
cr0x@server:~$ grep -E '^(size|c_max|c_min|memory_throttle_count)' /proc/spl/kstat/zfs/arcstatsIf ARC is at cap and throttle count is climbing, consider headroom changes.
- Correlate with app latency and OOM logs:
cr0x@server:~$ dmesg -T | tail -n 50If you see OOM kills, ARC sizing is not the root cause; overcommit is.
Checklist B: “Is ARC too small for this workload?”
- Measure ARC misses during the complaint window:
cr0x@server:~$ arcstat 1 30Persistent misses during normal steady load is a red flag.
- Check disk latency at the same time:
cr0x@server:~$ iostat -x 1 30If awaits climb during miss storms, the pool is paying for it.
- Test a controlled ARC max increase (if you have headroom):
cr0x@server:~$ sudo sh -c 'echo $((192*1024*1024*1024)) > /sys/module/zfs/parameters/zfs_arc_max'Watch whether latency improves and misses drop. If yes, the fix is usually “more RAM or better isolation.”
Checklist C: “Keep metadata hot, stop cache pollution”
- Identify scan-like jobs (backup, scrub, rsync, reindex) and schedule them off-peak.
- Consider dataset
primarycache=metadatafor streaming datasets that don’t benefit from data caching:cr0x@server:~$ sudo zfs set primarycache=metadata tank/archiveThis can reduce ARC churn while keeping directory traversal fast.
- Validate with ARC stats: metadata misses should drop; disk IOPS should stabilize.
Common mistakes (symptoms and fixes)
Mistake 1: Alerting on “free RAM”
Symptoms: Constant pages to on-call, no actual performance issue, pressure to “fix ZFS memory.”
Fix: Alert on swap activity (vmstat si/so), PSI memory “full,” OOM events, and application latency.
Use “available” memory, not “free,” in dashboards.
Mistake 2: Capping ARC without measuring the workload
Symptoms: Disk IOPS jumps, metadata-heavy operations slow down, “everything feels laggy” during peak.
Fix: Restore a reasonable ARC cap; measure ARC misses and disk latency.
If you need headroom for apps, cap ARC based on a budget (apps + kernel + safety margin), not a percentage of “free.”
Mistake 3: Treating ARC hit ratio as the KPI
Symptoms: People celebrate a high hit rate while tail latency is terrible; or they panic over low hit rate on streaming workloads.
Fix: Prioritize latency and swap health. Hit ratio is context-dependent.
A media streamer can have a low hit ratio and still be fast; a metadata-heavy NFS server cannot.
Mistake 4: Enabling dedup because it “saves space”
Symptoms: Sudden performance collapse, high random reads, memory pressure, slow writes, DDT-related overhead.
Fix: Don’t enable dedup without a real capacity/performance model and memory budget.
If already enabled and suffering, plan a migration strategy; “turning it off” is not instant on existing blocks.
Mistake 5: Throwing L2ARC at the problem on a RAM-tight system
Symptoms: No improvement or worse latency; ARC pressure increases; metadata misses persist.
Fix: Ensure adequate RAM first; keep L2ARC appropriately sized; validate that the workload has reuse.
If your workload is mostly one-time reads, L2ARC is an expensive placebo.
Mistake 6: Ignoring write path issues and blaming ARC
Symptoms: Reads slow “sometimes,” but the real trigger is sync writes, commit latency, or a saturated SLOG/write vdev.
Fix: Measure end-to-end: zpool iostat, device latency, and application write patterns.
Fix write bottlenecks; don’t micromanage ARC to compensate.
Mistake 7: Running mixed workloads without isolation
Symptoms: Backup jobs ruin interactive workloads; VM boot storms crush file services; cache churn.
Fix: Isolate workloads by host, pool, or schedule. Use dataset-specific cache policy where appropriate.
Consider cgroups memory limits for noisy services on Linux.
FAQ
1) Is it bad if ZFS uses most of my RAM?
Not by itself. It’s bad if the system is swapping, reclaiming aggressively (high PSI “full”), or applications are losing memory and slowing down.
If “available” memory is healthy and swap is quiet, ARC using RAM is usually a feature.
2) Why doesn’t ARC release memory immediately when an app needs it?
ARC is reclaimable, but reclaim has mechanics and timing. Under sudden spikes, ARC may lag behind demand,
and the kernel may swap before ARC has shrunk enough. This is why you budget headroom and avoid operating at the cliff edge.
3) Should I set zfs_arc_max on every system?
If the host runs only ZFS workloads (like a dedicated NAS), defaults often work well.
If it’s a mixed-use host (databases, JVMs, containers), setting a cap can prevent surprise contention.
The right answer is a memory budget: what your apps need under peak, plus safety margin, plus what you can afford for ARC.
4) What’s a “good” ARC hit ratio?
Depends on workload. For streaming reads, a low hit ratio can still deliver high throughput.
For random reads and metadata-heavy workloads, low hit ratio usually means real pain.
Track hit/miss trends and correlate with disk latency and user-visible latency.
5) Is ARC the same as L2ARC?
No. ARC is in RAM. L2ARC is a secondary cache on fast storage (SSD/NVMe). L2ARC can extend caching,
but it needs RAM for metadata and doesn’t help much for one-time reads.
6) If I add more RAM, will ZFS always get faster?
Not always, but often. More RAM helps when your working set is cacheable and misses are expensive.
If you’re bottlenecked on writes, CPU, network, or application design, more ARC won’t save you.
7) Why does my system show low “free” memory even when idle?
Because the OS uses RAM for caches to speed up future work. Idle systems with lots of cache are normal.
Focus on “available” memory and swapping, not “free.”
8) Can I configure ZFS to cache only metadata?
Yes, per dataset with primarycache=metadata. It’s useful for datasets with large streaming reads
that don’t benefit from caching data blocks, while still keeping directory traversal and file lookups fast.
Measure before and after—this can backfire on workloads that actually reuse data.
9) How do I tell if ARC is thrashing?
Look for sustained high ARC misses during steady workload, rising eviction behavior, and disk latency spikes that correlate with misses.
If the system is also swapping, you can get into a vicious cycle: pressure causes ARC churn, which increases I/O, which increases latency.
10) Why did performance drop right after a big backup or scrub?
Large sequential reads can evict useful cached blocks (especially metadata) if the cache is not sized or tuned for mixed workloads.
The fix is usually scheduling, isolation, or preventing cache pollution—not permanently shrinking ARC.
Conclusion
ZFS ARC is not a memory leak wearing a filesystem costume. It’s a deliberate design choice: use RAM to avoid disk I/O,
and adapt to what the workload is doing. The operational mistake is treating “free RAM” as a health metric and
treating ARC size as a moral failure.
When performance is bad, don’t argue about philosophy—measure. Check memory pressure, check swap, check ARC misses,
check disk latency, and identify the workload that changed the game. Then decide: cap ARC for headroom, tune datasets
to avoid pollution, isolate workloads, or buy more RAM. The best ZFS tuning is often the simplest:
let ARC do its job, and make sure the rest of the system isn’t sabotaging it.