Cache wars are rarely won in benchmarks and frequently lost in production at 2 a.m., when the storage graph looks like a seismograph and your database team starts saying “it must be the network.”
ZFS has the ARC. Linux has the page cache. Both are very good at what they do, and both can ruin your day if you assume they’re the same thing with different names.
This piece is about the practical truth: what each cache actually stores, how they behave under pressure, why you sometimes get “double caching,” and how to decide who should hold memory in a mixed workload.
You’ll get history, mechanics, incident stories from real ops patterns, concrete commands, and a fast diagnosis playbook you can use before someone suggests “just add RAM.”
The two caches, in one sentence
The Linux page cache caches file data as the kernel sees it through the VFS, broadly and opportunistically; ZFS ARC caches ZFS objects and metadata inside the ZFS stack, aggressively and with its own rules.
In practice: the page cache is the OS’s default “keep it handy” brain; ARC is ZFS’s “I know what I’ll need next” brain—sometimes smarter, sometimes stubborn.
Joke #1 (quick, relevant): asking “which cache is better” is like asking whether a forklift beats a pickup truck. If you’re moving pallets, you’ll be very impressed with the forklift—until you try to drive it on the freeway.
Interesting facts and short history (that actually matters)
Caching isn’t new, but the details of where and how it’s implemented shape the operational failure modes. A few concrete points that tend to change decisions:
- ZFS was born in the Solaris world, where the filesystem and volume manager were tightly integrated. ARC is part of that integrated design, not a bolt-on “Linux-ish” cache.
- ARC is adaptive: it’s designed to balance between caching recently used data and frequently used data (roughly MRU vs MFU behavior). That adaptiveness is a feature—and a source of surprises.
- ZFS checksums everything and can self-heal with redundancy. That means reads aren’t just “read bytes”; there’s metadata and verification overhead that caching can amplify or reduce.
- The Linux page cache is older than many current filesystems. It’s deeply embedded in Linux performance expectations; lots of applications quietly rely on it even if they don’t talk about it.
- Historically, ZFS on Linux had to reconcile two memory managers: the Linux VM and ZFS’s own ARC behavior. Modern ZoL/OpenZFS is far better than early days, but the “two brains” model still matters.
- L2ARC came later as SSDs became viable read caches. It’s not “more ARC”; it’s an extension with its own costs, including metadata overhead and warm-up time.
- ZIL and SLOG are about sync writes, not read caching. People still buy “a fast SLOG” to fix read latency, which is like installing a better mailbox to improve your living room acoustics.
- Linux’s direct I/O and bypass paths evolved alongside databases. Many serious DB engines learned to bypass page cache deliberately; ZFS has its own ways to play in that space, but not all combos are happy.
- Modern Linux runs with cgroups everywhere. ARC historically didn’t always behave like a “good citizen” inside memory limits; newer releases improved, but ops teams still get bitten by mismatched expectations.
A mental model: what lives where
When an application reads a file, you can imagine a path of layers: app → libc → kernel VFS → filesystem → block layer → disk. Linux page cache sits near the VFS side and caches file pages keyed by inode/page offsets.
ZFS, however, has its own pipeline: DMU (Data Management Unit) objects, dnodes, ARC buffers, vdev I/O scheduler, and the pool. ARC caches at the ZFS level, not merely “file pages.”
That difference has three operational consequences:
- Metadata wins: ZFS metadata (dnodes, indirect blocks, spacemaps) can be cached in ARC, and caching metadata can radically change random read performance. Linux page cache does cache metadata too, but ZFS metadata is… more.
- Compression/encryption changes the economics: ARC may hold compressed buffers depending on settings; page cache typically holds decompressed file pages as applications see them.
- Eviction and pressure signals differ: Linux VM can reclaim page cache under pressure easily. ARC can shrink, but its tuning and heuristics can be misaligned with what the rest of the system thinks is “pressure.”
ZFS ARC deep dive (what it caches and why it’s weird)
What ARC actually caches
ARC is not “a RAM disk for your files.” It caches ZFS buffers: data blocks and metadata blocks as they exist inside the ZFS pipeline. This includes file data, but also the scaffolding needed to find file data quickly:
dnode structures, indirect block pointers, and more.
In operational terms, ARC is why a ZFS pool that felt like it was “on fire” during a metadata-heavy scan suddenly calms down once warmed: the second pass is largely pointer chasing in RAM instead of on disk.
ARC’s adaptive policy (MRU/MFU) and why you see “cache misses” that aren’t failures
ARC balances between “recently used” and “frequently used” buffers. It also tracks ghost lists—things it recently evicted—to learn whether it made the right call. This is clever and usually helpful.
But it means ARC can look like it’s doing work even when it’s “working as designed.”
Operators often misread ARC hit rate as a universal KPI. It’s not. A low hit rate can be fine if your workload is streaming large files once. A high hit rate can be deceptive if you’re accidentally caching a workload that should be sequentially streamed with minimal retention.
ARC sizing: the part that starts arguments
On Linux, ARC competes with everything else: page cache, anonymous memory, slab, and your applications. ARC has limits (zfs_arc_max and friends), and it can shrink under pressure, but the timing matters.
When memory pressure hits fast (batch job starts, huge sort, container spikes), ARC might not shrink fast enough, and the kernel will start reclaiming elsewhere—sometimes the page cache, sometimes anonymous pages—leading to latency spikes or even OOM events.
The practical tuning question is not “how big can ARC be?” It’s “how big can ARC be without destabilizing the rest of the box?”
ARC and write behavior: not the same as page cache
Linux page cache is tightly coupled to writeback and dirty page accounting. ZFS has its own transaction groups (TXGs), dirty data limits, and a pipeline where data is staged, written, and committed.
ARC itself is primarily about reads, but the system memory story includes dirty data in-flight and metadata updates.
This is why “free memory” is the wrong metric on ZFS systems. You care about reclaimability and latency under pressure, not whether the box looks empty in a dashboard.
L2ARC: the SSD cache that behaves like a slow second RAM
L2ARC extends ARC onto fast devices, typically SSDs. It can help random reads when the working set is bigger than RAM and when the access pattern has reuse.
But it’s not magic: L2ARC must be populated (warm-up), it can increase memory pressure due to metadata tracking, and it adds I/O load to the cache device.
In production, the most common L2ARC disappointment is expecting it to fix a workload that is mostly one-time reads. You cannot cache your way out of “I read a petabyte once.”
Linux page cache deep dive (what it caches and why it’s everywhere)
The page cache’s job description
The Linux page cache caches file-backed pages: the exact bytes that would be returned to a process reading a file (after filesystem translation). It also caches metadata and directory entries via dentry and inode caches.
Its power comes from being the default: almost every filesystem and almost every application benefits automatically.
Reclaim and writeback: page cache is designed to be sacrificed
Linux treats page cache as reclaimable. Under memory pressure, the kernel can drop clean page cache pages quickly. Dirty pages must be written back, which introduces latency and I/O.
That’s why “dirty ratio” tuning can change tail latencies in write-heavy systems.
The page cache is an ecosystem with the VM: kswapd, direct reclaim, dirty throttling, and per-cgroup accounting (in many setups). It’s why Linux often feels “self balancing” compared to systems where caches are more isolated.
Read-ahead and sequential workloads
Page cache is good at recognizing sequential access patterns and doing read-ahead. This is a huge win for streaming reads.
It’s also why your benchmark that reads a file twice looks “amazing” the second time—unless you bypass page cache or the file is too large.
Direct I/O and database engines
Many databases use direct I/O to avoid double buffering and to control their own caching. That can make sense when the DB buffer cache is mature and the access pattern is random.
But it also moves the burden of caching correctness and tuning into the application, which is great until you run it in a VM with noisy neighbors.
So who wins?
The honest answer: neither “wins” globally. The operationally useful answer: ARC tends to win when ZFS metadata and ZFS-level intelligence dominate performance, while page cache tends to win when the workload is file-oriented, sequential, and fits the VFS model cleanly.
ARC “wins” when:
- Metadata matters: lots of small files, snapshots, clones, directory traversal, random reads that require deep block tree walks.
- ZFS features are active: compression, checksums, snapshots, recordsize tuning—ARC caches the “real” blocks and metadata in the format ZFS needs.
- You want ZFS to make the decisions: you accept that the filesystem is an active participant, not a passive byte store.
Linux page cache “wins” when:
- Workload is streaming: large sequential reads, media pipelines, backup reads, log shipping, big ETL scans.
- Applications expect Linux VM semantics: a lot of software is tuned assuming page cache reclaim and cgroup behavior.
- You’re not using ZFS: yes, obvious, but worth stating—the page cache is the default star of the show for ext4/xfs and friends.
The real match is “who cooperates better under pressure”
Most production incidents are not about steady-state performance. They’re about what happens when the workload changes suddenly:
a reindex starts, a backup kicks off, a container scales, a node begins resilvering, a rogue job reads the entire dataset once.
The “winner” is the caching system that degrades gracefully under that change. On Linux with ZFS, that means making ARC a good citizen so the kernel VM doesn’t have to panic-reclaim in the worst possible way.
The “double caching” trap (and when it’s fine)
If you run ZFS on Linux and access files normally, you can end up with caching at multiple layers: ZFS ARC caches blocks and metadata; Linux page cache can also cache file pages depending on how ZFS integrates with the VFS.
The details depend on implementation and configuration, but the operational truth is consistent: you can spend RAM twice to remember the same content.
Double caching is not always evil. It can be harmless if RAM is plentiful and the working set is stable. It can be beneficial if each cache holds different “shapes” of data (metadata vs file pages) that help in different ways.
It becomes a problem when memory pressure forces eviction and reclaim in a feedback loop: ARC holds on; VM reclaims page cache; applications fault; more reads happen; ARC grows; repeat.
Joke #2: double caching is like printing the same report twice “just in case.” It sounds prudent until you’re out of paper and the CFO wants the budget spreadsheet now.
Three corporate-world mini-stories
Mini-story #1: The incident caused by a wrong assumption (“Free memory means we’re fine”)
A team I worked with ran a multi-tenant analytics platform on ZFS-backed Linux nodes. The dashboards were comforting: memory usage looked high but stable, swap was low, and the service had been fine for months.
Then a quarterly workload shift hit: a client started running broader scans across older partitions, and a new batch process did aggressive file enumeration for compliance reporting.
Latency spiked. Not just a bit—tail latency went from “nobody complains” to “support tickets every five minutes.” The first response was classic: “But we have RAM. Look, there’s still free memory.”
The second response was also classic: reboot a node. It got better for an hour, then fell off the cliff again.
The wrong assumption was that “free memory” was the leading indicator, and that caches would politely yield when applications needed memory. In reality, the system entered a reclaim storm:
ARC held a large slice, page cache got reclaimed aggressively, and the application’s own memory pressure caused frequent minor faults and re-reads. Meanwhile, ZFS was doing real work: metadata walks, checksum verification, and random I/O that made the disks look guilty.
The fix wasn’t heroic. They capped ARC to leave headroom, watched memory pressure metrics instead of “free,” and changed the batch job schedule so it didn’t collide with the interactive peak.
The real lesson: in a mixed workload, “available memory” and “reclaim behavior under pressure” matter more than the absolute size of RAM.
Mini-story #2: The optimization that backfired (“Let’s add L2ARC, it’ll be faster”)
Another org had a ZFS pool on decent SSDs but still wanted more read performance for a search index. Someone proposed adding a large L2ARC device because “more cache is always better.”
They installed a big NVMe as L2ARC and watched hit rates climb in the first day. Everyone felt smart.
A week later, they started seeing periodic latency spikes during peak query hours. Nothing was saturating CPU. The NVMe wasn’t pegged. The pool disks weren’t maxed.
The spikes were weird: short, sharp, and hard to correlate with a single metric.
The backfire came from two places. First, their working set churned: the search workload had bursts of new terms and periodic full re-rank operations. L2ARC kept getting populated with data that wouldn’t be reused.
Second, the system paid memory overhead to track and feed L2ARC, which tightened memory margins. Under pressure, ARC and the kernel VM started fighting. The symptom wasn’t “slow always,” it was “spiky and unpredictable,” which is the worst kind in corporate SLAs.
They removed L2ARC, increased RAM instead (boring but effective), and tuned recordsize and compression for the index layout. Performance improved and, more importantly, became predictable.
The lesson: L2ARC can help, but it’s not free, and it doesn’t rescue a churn-heavy workload. If your cache device is busy caching things you won’t ask for again, you’ve built a very expensive heater.
Mini-story #3: The boring but correct practice that saved the day (“Measure before you tune”)
A finance-adjacent service (think: batch settlements, strict audit logs) ran on ZFS with a mix of small sync writes and periodic read-heavy reconciliation jobs.
They had a change-management culture that some engineers joked was “slow motion,” but it had one habit I’ve grown to love: before any performance change, they captured a standard evidence bundle.
When they hit a sudden slowdown after a kernel update, they didn’t start with tuning knobs. They pulled their bundle: ARC stats, VM pressure, block device latency, ZFS pool status, and top-level application metrics—same commands, same sampling windows as before.
Within an hour they had something rare in ops: a clean diff.
The diff showed ARC sizing was unchanged, but page cache behavior shifted: dirty writeback timing changed under the new kernel, which interacted with their sync-write pattern and increased write latency.
Because they had baseline evidence, they could say “it’s writeback behavior” rather than “storage is slow,” and they could validate fixes without superstition.
The fix was conservative: adjust dirty writeback settings within safe bounds and ensure the SLOG device health was verified (it had started reporting intermittent latency).
The service stayed stable, auditors stayed calm, and nobody had to declare a war room for three days. The lesson: boring measurement rituals beat exciting tuning guesses.
Practical tasks: commands and interpretation (at least 12)
The goal of these tasks is not to “collect stats.” It’s to answer specific questions: Is ARC dominating memory? Is the page cache thrashing? Are we bottlenecked on disk latency, CPU, or reclaim?
Commands below assume Linux with OpenZFS installed where relevant.
Task 1: See memory reality (not the myth of “free”)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 125Gi 42Gi 2.1Gi 1.2Gi 80Gi 78Gi
Swap: 8Gi 256Mi 7.8Gi
Interpretation: “available” is the key line for “how much could apps get without swapping,” but it’s still a simplification.
A system can show plenty of “available” and still hit reclaim storms if memory demand spikes quickly or if large parts aren’t reclaiming fast enough.
Task 2: Confirm ZFS ARC size, target, and limits
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c_min|c_max|c)\s'
size 4 68719476736
c 4 75161927680
c_min 4 17179869184
c_max 4 85899345920
Interpretation: size is current ARC size. c is the current ARC target. c_min/c_max are bounds.
If size hugs c_max during busy periods and the box experiences memory pressure, you likely need a cap or more RAM—or a workload change.
Task 3: Watch ARC behavior over time (misses vs hits)
cr0x@server:~$ arcstat 1 5
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:00:01 12K 2.4K 19 1.1K 9 1.0K 8 280 2 64G 70G
12:00:02 10K 2.0K 20 900 9 850 8 250 2 64G 70G
12:00:03 11K 2.2K 20 1.0K 9 950 9 260 2 64G 70G
12:00:04 13K 2.5K 19 1.2K 9 1.1K 8 290 2 64G 70G
12:00:05 12K 2.3K 19 1.1K 9 1.0K 8 270 2 64G 70G
Interpretation: Miss percentage alone is not a verdict. Look for changes: did misses jump when the job started? Are data misses high (streaming) or metadata misses high (tree-walking)?
If arcsz is stable but misses spike, your working set may exceed ARC or your access pattern is low-reuse.
Task 4: Identify memory pressure and reclaim storms (VM pressure)
cr0x@server:~$ cat /proc/pressure/memory
some avg10=0.12 avg60=0.20 avg300=0.18 total=9382321
full avg10=0.00 avg60=0.01 avg300=0.00 total=10231
Interpretation: PSI (Pressure Stall Information) tells you if tasks are stalling because memory can’t be allocated quickly.
Rising some suggests reclaim activity; rising full means the system is frequently unable to proceed—this is where latency blows up.
Task 5: Catch kswapd/direct reclaim symptoms quickly
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 262144 2200000 120000 78000000 0 0 120 900 3200 6400 12 6 78 4 0
3 1 262144 1900000 115000 77000000 0 0 220 2100 4100 9000 14 9 62 15 0
4 2 262144 1700000 112000 76000000 0 64 450 9800 5200 12000 18 12 44 26 0
2 1 262144 1600000 110000 75000000 0 0 300 6000 4700 10500 16 10 55 19 0
1 0 262144 2100000 118000 77000000 0 0 160 1400 3600 7200 13 7 76 4 0
Interpretation: Watch si/so (swap in/out), wa (I/O wait), and the “b” column (blocked processes).
A reclaim storm often shows as rising blocked processes, increased context switching, and erratic I/O.
Task 6: See page cache, slab, and reclaimable memory
cr0x@server:~$ egrep 'MemAvailable|Cached:|Buffers:|SReclaimable:|Slab:' /proc/meminfo
MemAvailable: 81654312 kB
Buffers: 118432 kB
Cached: 79233412 kB
Slab: 3241120 kB
SReclaimable: 2015540 kB
Interpretation: Big Cached is not inherently bad. Big Slab can be normal too.
But if MemAvailable is low while ARC is high, you may be pinching the VM. If MemAvailable is fine but PSI is high, you might have fragmentation, writeback stalls, or cgroup contention.
Task 7: Check ZFS pool health and latent pain (scrub/resilver)
cr0x@server:~$ zpool status -v
pool: tank
state: ONLINE
scan: scrub repaired 0B in 02:11:34 with 0 errors on Tue Dec 24 03:14:11 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
errors: No known data errors
Interpretation: A pool in resilver or a scrub during peak can convert “cache discussion” into “I/O scheduler firefight.”
Always check this early; otherwise you’ll tune ARC while the pool is busy doing the moral equivalent of rebuilding an engine while driving.
Task 8: Measure disk latency at the block layer
cr0x@server:~$ iostat -x 1 3
avg-cpu: %user %nice %system %iowait %steal %idle
13.22 0.00 7.11 18.90 0.00 60.77
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await aqu-sz %util
sda 220.0 18432.0 2.0 0.90 9.20 83.78 70.0 9216.0 18.40 2.10 88.0
nvme0n1 1600.0 102400.0 120.0 6.98 0.40 64.00 800.0 51200.0 0.60 0.90 35.0
Interpretation: High r_await/w_await and high %util suggests device saturation or queueing.
If disks are fine but latency is high at the app, look at memory pressure, CPU, and ZFS internal contention.
Task 9: Inspect ZFS dataset properties that change caching dynamics
cr0x@server:~$ zfs get -o name,property,value -s local,default recordsize,primarycache,secondarycache,compression tank/data
NAME PROPERTY VALUE
tank/data compression lz4
tank/data primarycache all
tank/data recordsize 128K
tank/data secondarycache all
Interpretation: recordsize shapes I/O and caching granularity for files. For databases, you often want smaller records (e.g., 16K) to reduce read amplification.
primarycache controls what goes into ARC (all, metadata, none). This is a big lever.
Task 10: Cap ARC safely (persistent module option)
cr0x@server:~$ echo "options zfs zfs_arc_max=34359738368" | sudo tee /etc/modprobe.d/zfs-arc-max.conf
options zfs zfs_arc_max=34359738368
cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.0-xx-generic
Interpretation: This caps ARC to 32 GiB (value is bytes). You do this when ARC growth threatens application memory or causes reclaim storms.
After reboot, confirm via arcstats. Don’t guess; verify.
Task 11: Adjust primarycache to protect RAM from streaming reads
cr0x@server:~$ sudo zfs set primarycache=metadata tank/backups
cr0x@server:~$ zfs get -o name,property,value primarycache tank/backups
NAME PROPERTY VALUE
tank/backups primarycache metadata
Interpretation: For backup targets that are mostly write-once/read-rarely or streaming, caching only metadata prevents ARC from being polluted by large, low-reuse data.
This single property has saved more mixed-workload boxes than many “performance guides.”
Task 12: Observe ZFS I/O and latency inside the pool
cr0x@server:~$ zpool iostat -v 1 3
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 4.20T 3.05T 1.20K 650 140M 62.0M
raidz2-0 4.20T 3.05T 1.20K 650 140M 62.0M
sda - - 310 160 36.0M 15.0M
sdb - - 290 170 34.0M 15.5M
sdc - - 300 160 35.0M 15.0M
sdd - - 300 160 35.0M 15.0M
Interpretation: This gives you pool-level visibility. If reads are high and ARC miss rate is high, you’re likely disk-bound.
If reads are high but disks aren’t, you might be bouncing in cache layers or throttled elsewhere.
Task 13: Spot dirty writeback and throttling (Linux VM)
cr0x@server:~$ sysctl vm.dirty_background_ratio vm.dirty_ratio vm.dirty_writeback_centisecs vm.dirty_expire_centisecs
vm.dirty_background_ratio = 10
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000
Interpretation: These affect when the kernel starts background writeback and when it forces processes to write.
On ZFS systems, you’re juggling ZFS’s own TXG behavior too, so changes here should be tested carefully. If you see periodic write latency spikes, this is a suspect.
Task 14: Validate whether a workload is streaming or reusing (cache usefulness test)
cr0x@server:~$ sudo perf stat -e minor-faults,major-faults,cache-misses -a -- sleep 10
Performance counter stats for 'system wide':
482,120 minor-faults
2,104 major-faults
21,884,110 cache-misses
10.002312911 seconds time elapsed
Interpretation: Major faults indicate page cache misses that required disk I/O. A surge during a job suggests page cache isn’t helping (or is being reclaimed).
This doesn’t isolate ARC vs page cache by itself, but it flags whether the system is repeatedly faulting data into memory.
Fast diagnosis playbook
When performance tanks and everyone has a theory, you need a short sequence that converges quickly. This is the order I use in real incidents because it distinguishes “disk is slow” from “memory is fighting” in minutes.
Step 1: Are we unhealthy or rebuilding?
Check pool status and background work first. If you’re scrubbing/resilvering, you’re not diagnosing a clean system.
cr0x@server:~$ zpool status
cr0x@server:~$ zpool iostat -v 1 5
Decide: If resilver/scrub is active and latency is the complaint, consider rescheduling or throttling before touching ARC/page cache knobs.
Step 2: Is it disk latency or memory pressure?
cr0x@server:~$ iostat -x 1 5
cr0x@server:~$ cat /proc/pressure/memory
cr0x@server:~$ vmstat 1 5
Decide:
If disk awaits are high and util is high, you’re I/O bound. If PSI memory is high and disk isn’t, you’re memory/reclaim bound.
If both are high, you may have a feedback loop: reclaim causes I/O, which causes stalls, which causes more reclaim.
Step 3: Is ARC oversized or misapplied?
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c|c_min|c_max|memory_throttle_count)\s'
cr0x@server:~$ arcstat 1 10
Decide: If ARC is near max and memory pressure exists, cap ARC or restrict caching on streaming datasets (primarycache=metadata) to stop cache pollution.
Step 4: Is the workload fundamentally cacheable?
cr0x@server:~$ arcstat 1 10
cr0x@server:~$ zpool iostat -v 1 10
Decide: If ARC miss rate remains high while throughput is high and reuse is low, stop trying to “win” with caches. Focus on disk throughput, recordsize, and scheduling.
Step 5: Check dataset properties and application mismatch
cr0x@server:~$ zfs get -s local,default recordsize,atime,compression,primarycache tank/data
cr0x@server:~$ ps -eo pid,cmd,%mem --sort=-%mem | head
Decide: If a database is on a dataset with huge recordsize and caching “all,” you might be amplifying random reads and polluting ARC.
Fixing this is often more impactful than micromanaging ARC parameters.
Common mistakes, symptoms, fixes
Mistake 1: Treating ARC hit rate as the only truth
Symptoms: Team celebrates 95% hit rate while users still see latency spikes; or you panic at 20% hit rate on a backup job.
Fix: Classify the workload: streaming vs reuse. Pair arcstat with disk latency (iostat) and memory pressure (PSI). A low hit rate on streaming reads is normal; high latency is not.
Mistake 2: Letting streaming workloads pollute ARC
Symptoms: During backups or scans, ARC grows, interactive workloads slow down, cache “feels useless” afterward.
Fix: Put streaming/backup datasets on primarycache=metadata (or even none for extreme cases). Consider separate pools if contention is chronic.
Mistake 3: Capping ARC blindly and calling it “tuned”
Symptoms: After reducing ARC, metadata-heavy workloads slow down dramatically; disks become busier; CPU usage rises.
Fix: Cap ARC with intent: leave headroom for apps and page cache, but not so low that ZFS loses metadata working set. Validate with arcstat and end-to-end latency.
Mistake 4: Assuming L2ARC is a free lunch
Symptoms: Added L2ARC improves synthetic benchmarks, but production gets spiky; memory pressure increases; cache device shows constant activity.
Fix: Ensure the workload has reuse. Measure before/after with latency percentiles and disk reads. If churn dominates, remove L2ARC and spend budget on RAM or better primary storage.
Mistake 5: Confusing SLOG/ZIL with read caching
Symptoms: You buy a fast SLOG expecting reads to speed up; nothing changes; you declare ZFS “slow.”
Fix: Use SLOG to improve sync write latency where appropriate. For reads, focus on ARC/L2ARC, recordsize, and device latency.
Mistake 6: Ignoring kernel memory pressure signals
Symptoms: Periodic stalls, kswapd activity, occasional OOM kills, but “available memory” looks okay.
Fix: Use PSI (/proc/pressure/memory), vmstat, and logs for OOM/reclaim. Cap ARC, fix runaway processes, and review cgroup memory limits if containers are involved.
Checklists / step-by-step plan
Checklist A: New ZFS-on-Linux host (mixed workloads)
-
Decide your memory budget. How much RAM must be reserved for applications and how much can be given to filesystem caching?
If you can’t answer this, you’re delegating architecture to default heuristics. -
Set an initial ARC cap (conservative, adjust later). Verify it after reboot via
/proc/spl/kstat/zfs/arcstats. -
Classify datasets by access pattern: interactive, database, backup/streaming, VM images. Set
primarycacheaccordingly. - Set recordsize per dataset based on workload (large for sequential files, smaller for random DB-like patterns).
-
Establish a baseline evidence bundle:
free,meminfo, PSI,arcstat,iostat -x,zpool iostat, and app latency percentiles.
Checklist B: When someone says “ZFS is eating all my RAM”
- Check PSI memory and swap activity. If there’s no pressure, ARC isn’t your villain—it’s your unused RAM doing useful work.
- Confirm ARC size vs max and whether ARC is shrinking when load changes.
- Identify the polluter workload (often backups, scans, or analytics). Apply
primarycache=metadatato that dataset. - If pressure persists, cap ARC. Then validate that key workloads didn’t regress due to metadata misses.
Checklist C: Performance regression after a change
- Verify pool health, scrub/resilver status, and device errors.
- Compare disk latency (
iostat -x) before/after. If latency increased, don’t argue about caches yet. - Compare memory pressure (PSI) before/after. Kernel changes can shift reclaim/writeback behavior.
- Compare ARC behavior (size, target, misses). If ARC is stable but app is slower, look higher (CPU, locks, app config) or lower (devices).
FAQ
1) Is ARC just “ZFS page cache”?
No. ARC caches ZFS buffers (data and metadata) within the ZFS stack. Linux page cache caches file pages at the VFS level.
They can overlap, but they’re not the same layer or the same policy engine.
2) Why does Linux show “almost no free memory” on a healthy ZFS box?
Because RAM is meant to be used. ARC and the page cache will consume memory to reduce future I/O.
The metric that matters is whether the system can satisfy allocations without stalling or swapping—use “available,” PSI, and observed latency.
3) Should I always cap ARC on Linux?
On dedicated storage appliances, you often let ARC be large. On mixed-use servers (databases, containers, JVMs, build agents), a cap is usually prudent.
The correct cap is workload-specific: leave enough for applications and VM, but keep enough ARC to cache metadata and hot blocks.
4) Does adding more RAM always beat adding L2ARC?
Often, yes. RAM reduces latency and complexity and helps both ARC and the rest of the system.
L2ARC can help when the working set is larger than RAM and has reuse, but it adds overhead and won’t fix churn-heavy workloads.
5) If my database uses direct I/O, does ARC still matter?
It can. Even if user data reads bypass some caching paths, ZFS metadata still matters, and ZFS still manages block I/O, checksums, and layout.
Also, not all DB access patterns are purely direct I/O all the time; and background tasks can hit the filesystem in cacheable ways.
6) What’s the quickest way to tell if I’m I/O-bound or memory-pressure-bound?
Look at iostat -x for device await/util and at PSI (/proc/pressure/memory) plus vmstat for reclaim/swap.
High disk await/util suggests I/O bound; high memory PSI suggests reclaim pressure. If both are high, suspect a feedback loop.
7) When should I set primarycache=metadata?
When the dataset mostly serves streaming reads/writes (backups, media archives, cold logs) and caching file data in ARC would crowd out hotter workloads.
It’s especially useful on multi-tenant systems where one “read everything once” job can evict everyone else’s hot blocks.
8) Why do my benchmarks look amazing after the first run?
Because caches warmed up. That might be Linux page cache, ARC, or both. The second run often measures memory speed, not storage.
If you’re trying to measure storage, design the benchmark to control caching effects rather than pretending they don’t exist.
9) Can I “drop caches” safely to test?
You can drop Linux page cache, but doing it on a production box is disruptive and usually a bad idea.
Also, dropping Linux caches doesn’t necessarily reset ARC the same way; you may still be measuring warmed ZFS behavior. Prefer controlled test hosts or synthetic workloads with clear methodology.
10) What’s the best single metric to monitor long-term?
For user experience: latency percentiles at the application layer. For system health: memory PSI and disk latency.
ARC stats are useful, but they’re supporting evidence, not the headline.
Conclusion
ZFS ARC and the Linux page cache are not competitors in the simplistic sense; they’re caching systems built at different layers, with different assumptions, and different failure modes.
ARC is brilliant at caching ZFS’s world—blocks, metadata, and the machinery that makes snapshots and checksums feel fast. The page cache is the Linux kernel’s default performance amplifier, designed to be reclaimable and broadly helpful.
What you should care about is not ideological purity (“ZFS should handle it” vs “Linux should handle it”), but operational stability under change.
If you run mixed workloads, leave headroom, prevent cache pollution, measure pressure rather than “free memory,” and tune with evidence instead of folklore.
That’s how you win the cache war: by not turning it into one.