ZFS ARC vs Linux Page Cache: Who Wins and Why You Should Care

Was this helpful?

Cache wars are rarely won in benchmarks and frequently lost in production at 2 a.m., when the storage graph looks like a seismograph and your database team starts saying “it must be the network.”
ZFS has the ARC. Linux has the page cache. Both are very good at what they do, and both can ruin your day if you assume they’re the same thing with different names.

This piece is about the practical truth: what each cache actually stores, how they behave under pressure, why you sometimes get “double caching,” and how to decide who should hold memory in a mixed workload.
You’ll get history, mechanics, incident stories from real ops patterns, concrete commands, and a fast diagnosis playbook you can use before someone suggests “just add RAM.”

The two caches, in one sentence

The Linux page cache caches file data as the kernel sees it through the VFS, broadly and opportunistically; ZFS ARC caches ZFS objects and metadata inside the ZFS stack, aggressively and with its own rules.
In practice: the page cache is the OS’s default “keep it handy” brain; ARC is ZFS’s “I know what I’ll need next” brain—sometimes smarter, sometimes stubborn.

Joke #1 (quick, relevant): asking “which cache is better” is like asking whether a forklift beats a pickup truck. If you’re moving pallets, you’ll be very impressed with the forklift—until you try to drive it on the freeway.

Interesting facts and short history (that actually matters)

Caching isn’t new, but the details of where and how it’s implemented shape the operational failure modes. A few concrete points that tend to change decisions:

  1. ZFS was born in the Solaris world, where the filesystem and volume manager were tightly integrated. ARC is part of that integrated design, not a bolt-on “Linux-ish” cache.
  2. ARC is adaptive: it’s designed to balance between caching recently used data and frequently used data (roughly MRU vs MFU behavior). That adaptiveness is a feature—and a source of surprises.
  3. ZFS checksums everything and can self-heal with redundancy. That means reads aren’t just “read bytes”; there’s metadata and verification overhead that caching can amplify or reduce.
  4. The Linux page cache is older than many current filesystems. It’s deeply embedded in Linux performance expectations; lots of applications quietly rely on it even if they don’t talk about it.
  5. Historically, ZFS on Linux had to reconcile two memory managers: the Linux VM and ZFS’s own ARC behavior. Modern ZoL/OpenZFS is far better than early days, but the “two brains” model still matters.
  6. L2ARC came later as SSDs became viable read caches. It’s not “more ARC”; it’s an extension with its own costs, including metadata overhead and warm-up time.
  7. ZIL and SLOG are about sync writes, not read caching. People still buy “a fast SLOG” to fix read latency, which is like installing a better mailbox to improve your living room acoustics.
  8. Linux’s direct I/O and bypass paths evolved alongside databases. Many serious DB engines learned to bypass page cache deliberately; ZFS has its own ways to play in that space, but not all combos are happy.
  9. Modern Linux runs with cgroups everywhere. ARC historically didn’t always behave like a “good citizen” inside memory limits; newer releases improved, but ops teams still get bitten by mismatched expectations.

A mental model: what lives where

When an application reads a file, you can imagine a path of layers: app → libc → kernel VFS → filesystem → block layer → disk. Linux page cache sits near the VFS side and caches file pages keyed by inode/page offsets.
ZFS, however, has its own pipeline: DMU (Data Management Unit) objects, dnodes, ARC buffers, vdev I/O scheduler, and the pool. ARC caches at the ZFS level, not merely “file pages.”

That difference has three operational consequences:

  • Metadata wins: ZFS metadata (dnodes, indirect blocks, spacemaps) can be cached in ARC, and caching metadata can radically change random read performance. Linux page cache does cache metadata too, but ZFS metadata is… more.
  • Compression/encryption changes the economics: ARC may hold compressed buffers depending on settings; page cache typically holds decompressed file pages as applications see them.
  • Eviction and pressure signals differ: Linux VM can reclaim page cache under pressure easily. ARC can shrink, but its tuning and heuristics can be misaligned with what the rest of the system thinks is “pressure.”

ZFS ARC deep dive (what it caches and why it’s weird)

What ARC actually caches

ARC is not “a RAM disk for your files.” It caches ZFS buffers: data blocks and metadata blocks as they exist inside the ZFS pipeline. This includes file data, but also the scaffolding needed to find file data quickly:
dnode structures, indirect block pointers, and more.

In operational terms, ARC is why a ZFS pool that felt like it was “on fire” during a metadata-heavy scan suddenly calms down once warmed: the second pass is largely pointer chasing in RAM instead of on disk.

ARC’s adaptive policy (MRU/MFU) and why you see “cache misses” that aren’t failures

ARC balances between “recently used” and “frequently used” buffers. It also tracks ghost lists—things it recently evicted—to learn whether it made the right call. This is clever and usually helpful.
But it means ARC can look like it’s doing work even when it’s “working as designed.”

Operators often misread ARC hit rate as a universal KPI. It’s not. A low hit rate can be fine if your workload is streaming large files once. A high hit rate can be deceptive if you’re accidentally caching a workload that should be sequentially streamed with minimal retention.

ARC sizing: the part that starts arguments

On Linux, ARC competes with everything else: page cache, anonymous memory, slab, and your applications. ARC has limits (zfs_arc_max and friends), and it can shrink under pressure, but the timing matters.
When memory pressure hits fast (batch job starts, huge sort, container spikes), ARC might not shrink fast enough, and the kernel will start reclaiming elsewhere—sometimes the page cache, sometimes anonymous pages—leading to latency spikes or even OOM events.

The practical tuning question is not “how big can ARC be?” It’s “how big can ARC be without destabilizing the rest of the box?”

ARC and write behavior: not the same as page cache

Linux page cache is tightly coupled to writeback and dirty page accounting. ZFS has its own transaction groups (TXGs), dirty data limits, and a pipeline where data is staged, written, and committed.
ARC itself is primarily about reads, but the system memory story includes dirty data in-flight and metadata updates.

This is why “free memory” is the wrong metric on ZFS systems. You care about reclaimability and latency under pressure, not whether the box looks empty in a dashboard.

L2ARC: the SSD cache that behaves like a slow second RAM

L2ARC extends ARC onto fast devices, typically SSDs. It can help random reads when the working set is bigger than RAM and when the access pattern has reuse.
But it’s not magic: L2ARC must be populated (warm-up), it can increase memory pressure due to metadata tracking, and it adds I/O load to the cache device.

In production, the most common L2ARC disappointment is expecting it to fix a workload that is mostly one-time reads. You cannot cache your way out of “I read a petabyte once.”

Linux page cache deep dive (what it caches and why it’s everywhere)

The page cache’s job description

The Linux page cache caches file-backed pages: the exact bytes that would be returned to a process reading a file (after filesystem translation). It also caches metadata and directory entries via dentry and inode caches.
Its power comes from being the default: almost every filesystem and almost every application benefits automatically.

Reclaim and writeback: page cache is designed to be sacrificed

Linux treats page cache as reclaimable. Under memory pressure, the kernel can drop clean page cache pages quickly. Dirty pages must be written back, which introduces latency and I/O.
That’s why “dirty ratio” tuning can change tail latencies in write-heavy systems.

The page cache is an ecosystem with the VM: kswapd, direct reclaim, dirty throttling, and per-cgroup accounting (in many setups). It’s why Linux often feels “self balancing” compared to systems where caches are more isolated.

Read-ahead and sequential workloads

Page cache is good at recognizing sequential access patterns and doing read-ahead. This is a huge win for streaming reads.
It’s also why your benchmark that reads a file twice looks “amazing” the second time—unless you bypass page cache or the file is too large.

Direct I/O and database engines

Many databases use direct I/O to avoid double buffering and to control their own caching. That can make sense when the DB buffer cache is mature and the access pattern is random.
But it also moves the burden of caching correctness and tuning into the application, which is great until you run it in a VM with noisy neighbors.

So who wins?

The honest answer: neither “wins” globally. The operationally useful answer: ARC tends to win when ZFS metadata and ZFS-level intelligence dominate performance, while page cache tends to win when the workload is file-oriented, sequential, and fits the VFS model cleanly.

ARC “wins” when:

  • Metadata matters: lots of small files, snapshots, clones, directory traversal, random reads that require deep block tree walks.
  • ZFS features are active: compression, checksums, snapshots, recordsize tuning—ARC caches the “real” blocks and metadata in the format ZFS needs.
  • You want ZFS to make the decisions: you accept that the filesystem is an active participant, not a passive byte store.

Linux page cache “wins” when:

  • Workload is streaming: large sequential reads, media pipelines, backup reads, log shipping, big ETL scans.
  • Applications expect Linux VM semantics: a lot of software is tuned assuming page cache reclaim and cgroup behavior.
  • You’re not using ZFS: yes, obvious, but worth stating—the page cache is the default star of the show for ext4/xfs and friends.

The real match is “who cooperates better under pressure”

Most production incidents are not about steady-state performance. They’re about what happens when the workload changes suddenly:
a reindex starts, a backup kicks off, a container scales, a node begins resilvering, a rogue job reads the entire dataset once.

The “winner” is the caching system that degrades gracefully under that change. On Linux with ZFS, that means making ARC a good citizen so the kernel VM doesn’t have to panic-reclaim in the worst possible way.

The “double caching” trap (and when it’s fine)

If you run ZFS on Linux and access files normally, you can end up with caching at multiple layers: ZFS ARC caches blocks and metadata; Linux page cache can also cache file pages depending on how ZFS integrates with the VFS.
The details depend on implementation and configuration, but the operational truth is consistent: you can spend RAM twice to remember the same content.

Double caching is not always evil. It can be harmless if RAM is plentiful and the working set is stable. It can be beneficial if each cache holds different “shapes” of data (metadata vs file pages) that help in different ways.
It becomes a problem when memory pressure forces eviction and reclaim in a feedback loop: ARC holds on; VM reclaims page cache; applications fault; more reads happen; ARC grows; repeat.

Joke #2: double caching is like printing the same report twice “just in case.” It sounds prudent until you’re out of paper and the CFO wants the budget spreadsheet now.

Three corporate-world mini-stories

Mini-story #1: The incident caused by a wrong assumption (“Free memory means we’re fine”)

A team I worked with ran a multi-tenant analytics platform on ZFS-backed Linux nodes. The dashboards were comforting: memory usage looked high but stable, swap was low, and the service had been fine for months.
Then a quarterly workload shift hit: a client started running broader scans across older partitions, and a new batch process did aggressive file enumeration for compliance reporting.

Latency spiked. Not just a bit—tail latency went from “nobody complains” to “support tickets every five minutes.” The first response was classic: “But we have RAM. Look, there’s still free memory.”
The second response was also classic: reboot a node. It got better for an hour, then fell off the cliff again.

The wrong assumption was that “free memory” was the leading indicator, and that caches would politely yield when applications needed memory. In reality, the system entered a reclaim storm:
ARC held a large slice, page cache got reclaimed aggressively, and the application’s own memory pressure caused frequent minor faults and re-reads. Meanwhile, ZFS was doing real work: metadata walks, checksum verification, and random I/O that made the disks look guilty.

The fix wasn’t heroic. They capped ARC to leave headroom, watched memory pressure metrics instead of “free,” and changed the batch job schedule so it didn’t collide with the interactive peak.
The real lesson: in a mixed workload, “available memory” and “reclaim behavior under pressure” matter more than the absolute size of RAM.

Mini-story #2: The optimization that backfired (“Let’s add L2ARC, it’ll be faster”)

Another org had a ZFS pool on decent SSDs but still wanted more read performance for a search index. Someone proposed adding a large L2ARC device because “more cache is always better.”
They installed a big NVMe as L2ARC and watched hit rates climb in the first day. Everyone felt smart.

A week later, they started seeing periodic latency spikes during peak query hours. Nothing was saturating CPU. The NVMe wasn’t pegged. The pool disks weren’t maxed.
The spikes were weird: short, sharp, and hard to correlate with a single metric.

The backfire came from two places. First, their working set churned: the search workload had bursts of new terms and periodic full re-rank operations. L2ARC kept getting populated with data that wouldn’t be reused.
Second, the system paid memory overhead to track and feed L2ARC, which tightened memory margins. Under pressure, ARC and the kernel VM started fighting. The symptom wasn’t “slow always,” it was “spiky and unpredictable,” which is the worst kind in corporate SLAs.

They removed L2ARC, increased RAM instead (boring but effective), and tuned recordsize and compression for the index layout. Performance improved and, more importantly, became predictable.
The lesson: L2ARC can help, but it’s not free, and it doesn’t rescue a churn-heavy workload. If your cache device is busy caching things you won’t ask for again, you’ve built a very expensive heater.

Mini-story #3: The boring but correct practice that saved the day (“Measure before you tune”)

A finance-adjacent service (think: batch settlements, strict audit logs) ran on ZFS with a mix of small sync writes and periodic read-heavy reconciliation jobs.
They had a change-management culture that some engineers joked was “slow motion,” but it had one habit I’ve grown to love: before any performance change, they captured a standard evidence bundle.

When they hit a sudden slowdown after a kernel update, they didn’t start with tuning knobs. They pulled their bundle: ARC stats, VM pressure, block device latency, ZFS pool status, and top-level application metrics—same commands, same sampling windows as before.
Within an hour they had something rare in ops: a clean diff.

The diff showed ARC sizing was unchanged, but page cache behavior shifted: dirty writeback timing changed under the new kernel, which interacted with their sync-write pattern and increased write latency.
Because they had baseline evidence, they could say “it’s writeback behavior” rather than “storage is slow,” and they could validate fixes without superstition.

The fix was conservative: adjust dirty writeback settings within safe bounds and ensure the SLOG device health was verified (it had started reporting intermittent latency).
The service stayed stable, auditors stayed calm, and nobody had to declare a war room for three days. The lesson: boring measurement rituals beat exciting tuning guesses.

Practical tasks: commands and interpretation (at least 12)

The goal of these tasks is not to “collect stats.” It’s to answer specific questions: Is ARC dominating memory? Is the page cache thrashing? Are we bottlenecked on disk latency, CPU, or reclaim?
Commands below assume Linux with OpenZFS installed where relevant.

Task 1: See memory reality (not the myth of “free”)

cr0x@server:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           125Gi        42Gi       2.1Gi       1.2Gi        80Gi        78Gi
Swap:            8Gi       256Mi       7.8Gi

Interpretation: “available” is the key line for “how much could apps get without swapping,” but it’s still a simplification.
A system can show plenty of “available” and still hit reclaim storms if memory demand spikes quickly or if large parts aren’t reclaiming fast enough.

Task 2: Confirm ZFS ARC size, target, and limits

cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c_min|c_max|c)\s'
size                            4    68719476736
c                               4    75161927680
c_min                           4    17179869184
c_max                           4    85899345920

Interpretation: size is current ARC size. c is the current ARC target. c_min/c_max are bounds.
If size hugs c_max during busy periods and the box experiences memory pressure, you likely need a cap or more RAM—or a workload change.

Task 3: Watch ARC behavior over time (misses vs hits)

cr0x@server:~$ arcstat 1 5
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:00:01   12K  2.4K     19  1.1K   9  1.0K   8  280    2   64G    70G
12:00:02   10K  2.0K     20  900    9  850    8  250    2   64G    70G
12:00:03   11K  2.2K     20  1.0K   9  950    9  260    2   64G    70G
12:00:04   13K  2.5K     19  1.2K   9  1.1K   8  290    2   64G    70G
12:00:05   12K  2.3K     19  1.1K   9  1.0K   8  270    2   64G    70G

Interpretation: Miss percentage alone is not a verdict. Look for changes: did misses jump when the job started? Are data misses high (streaming) or metadata misses high (tree-walking)?
If arcsz is stable but misses spike, your working set may exceed ARC or your access pattern is low-reuse.

Task 4: Identify memory pressure and reclaim storms (VM pressure)

cr0x@server:~$ cat /proc/pressure/memory
some avg10=0.12 avg60=0.20 avg300=0.18 total=9382321
full avg10=0.00 avg60=0.01 avg300=0.00 total=10231

Interpretation: PSI (Pressure Stall Information) tells you if tasks are stalling because memory can’t be allocated quickly.
Rising some suggests reclaim activity; rising full means the system is frequently unable to proceed—this is where latency blows up.

Task 5: Catch kswapd/direct reclaim symptoms quickly

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0 262144 2200000 120000 78000000  0    0   120   900 3200 6400 12  6 78  4  0
 3  1 262144 1900000 115000 77000000  0    0   220  2100 4100 9000 14  9 62 15  0
 4  2 262144 1700000 112000 76000000  0   64   450  9800 5200 12000 18 12 44 26  0
 2  1 262144 1600000 110000 75000000  0    0   300  6000 4700 10500 16 10 55 19  0
 1  0 262144 2100000 118000 77000000  0    0   160  1400 3600 7200 13  7 76  4  0

Interpretation: Watch si/so (swap in/out), wa (I/O wait), and the “b” column (blocked processes).
A reclaim storm often shows as rising blocked processes, increased context switching, and erratic I/O.

Task 6: See page cache, slab, and reclaimable memory

cr0x@server:~$ egrep 'MemAvailable|Cached:|Buffers:|SReclaimable:|Slab:' /proc/meminfo
MemAvailable:   81654312 kB
Buffers:          118432 kB
Cached:         79233412 kB
Slab:            3241120 kB
SReclaimable:    2015540 kB

Interpretation: Big Cached is not inherently bad. Big Slab can be normal too.
But if MemAvailable is low while ARC is high, you may be pinching the VM. If MemAvailable is fine but PSI is high, you might have fragmentation, writeback stalls, or cgroup contention.

Task 7: Check ZFS pool health and latent pain (scrub/resilver)

cr0x@server:~$ zpool status -v
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 02:11:34 with 0 errors on Tue Dec 24 03:14:11 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors

Interpretation: A pool in resilver or a scrub during peak can convert “cache discussion” into “I/O scheduler firefight.”
Always check this early; otherwise you’ll tune ARC while the pool is busy doing the moral equivalent of rebuilding an engine while driving.

Task 8: Measure disk latency at the block layer

cr0x@server:~$ iostat -x 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.22    0.00    7.11   18.90    0.00   60.77

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   w_await aqu-sz  %util
sda             220.0   18432.0     2.0   0.90    9.20    83.78    70.0    9216.0   18.40   2.10  88.0
nvme0n1        1600.0  102400.0   120.0   6.98    0.40    64.00   800.0   51200.0    0.60   0.90  35.0

Interpretation: High r_await/w_await and high %util suggests device saturation or queueing.
If disks are fine but latency is high at the app, look at memory pressure, CPU, and ZFS internal contention.

Task 9: Inspect ZFS dataset properties that change caching dynamics

cr0x@server:~$ zfs get -o name,property,value -s local,default recordsize,primarycache,secondarycache,compression tank/data
NAME       PROPERTY        VALUE
tank/data  compression     lz4
tank/data  primarycache    all
tank/data  recordsize      128K
tank/data  secondarycache  all

Interpretation: recordsize shapes I/O and caching granularity for files. For databases, you often want smaller records (e.g., 16K) to reduce read amplification.
primarycache controls what goes into ARC (all, metadata, none). This is a big lever.

Task 10: Cap ARC safely (persistent module option)

cr0x@server:~$ echo "options zfs zfs_arc_max=34359738368" | sudo tee /etc/modprobe.d/zfs-arc-max.conf
options zfs zfs_arc_max=34359738368

cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.0-xx-generic

Interpretation: This caps ARC to 32 GiB (value is bytes). You do this when ARC growth threatens application memory or causes reclaim storms.
After reboot, confirm via arcstats. Don’t guess; verify.

Task 11: Adjust primarycache to protect RAM from streaming reads

cr0x@server:~$ sudo zfs set primarycache=metadata tank/backups
cr0x@server:~$ zfs get -o name,property,value primarycache tank/backups
NAME          PROPERTY     VALUE
tank/backups  primarycache metadata

Interpretation: For backup targets that are mostly write-once/read-rarely or streaming, caching only metadata prevents ARC from being polluted by large, low-reuse data.
This single property has saved more mixed-workload boxes than many “performance guides.”

Task 12: Observe ZFS I/O and latency inside the pool

cr0x@server:~$ zpool iostat -v 1 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        4.20T  3.05T  1.20K   650    140M   62.0M
  raidz2-0  4.20T  3.05T  1.20K   650    140M   62.0M
    sda         -      -   310    160    36.0M  15.0M
    sdb         -      -   290    170    34.0M  15.5M
    sdc         -      -   300    160    35.0M  15.0M
    sdd         -      -   300    160    35.0M  15.0M

Interpretation: This gives you pool-level visibility. If reads are high and ARC miss rate is high, you’re likely disk-bound.
If reads are high but disks aren’t, you might be bouncing in cache layers or throttled elsewhere.

Task 13: Spot dirty writeback and throttling (Linux VM)

cr0x@server:~$ sysctl vm.dirty_background_ratio vm.dirty_ratio vm.dirty_writeback_centisecs vm.dirty_expire_centisecs
vm.dirty_background_ratio = 10
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000

Interpretation: These affect when the kernel starts background writeback and when it forces processes to write.
On ZFS systems, you’re juggling ZFS’s own TXG behavior too, so changes here should be tested carefully. If you see periodic write latency spikes, this is a suspect.

Task 14: Validate whether a workload is streaming or reusing (cache usefulness test)

cr0x@server:~$ sudo perf stat -e minor-faults,major-faults,cache-misses -a -- sleep 10
 Performance counter stats for 'system wide':

       482,120      minor-faults
         2,104      major-faults
    21,884,110      cache-misses

      10.002312911 seconds time elapsed

Interpretation: Major faults indicate page cache misses that required disk I/O. A surge during a job suggests page cache isn’t helping (or is being reclaimed).
This doesn’t isolate ARC vs page cache by itself, but it flags whether the system is repeatedly faulting data into memory.

Fast diagnosis playbook

When performance tanks and everyone has a theory, you need a short sequence that converges quickly. This is the order I use in real incidents because it distinguishes “disk is slow” from “memory is fighting” in minutes.

Step 1: Are we unhealthy or rebuilding?

Check pool status and background work first. If you’re scrubbing/resilvering, you’re not diagnosing a clean system.

cr0x@server:~$ zpool status
cr0x@server:~$ zpool iostat -v 1 5

Decide: If resilver/scrub is active and latency is the complaint, consider rescheduling or throttling before touching ARC/page cache knobs.

Step 2: Is it disk latency or memory pressure?

cr0x@server:~$ iostat -x 1 5
cr0x@server:~$ cat /proc/pressure/memory
cr0x@server:~$ vmstat 1 5

Decide:
If disk awaits are high and util is high, you’re I/O bound. If PSI memory is high and disk isn’t, you’re memory/reclaim bound.
If both are high, you may have a feedback loop: reclaim causes I/O, which causes stalls, which causes more reclaim.

Step 3: Is ARC oversized or misapplied?

cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c|c_min|c_max|memory_throttle_count)\s'
cr0x@server:~$ arcstat 1 10

Decide: If ARC is near max and memory pressure exists, cap ARC or restrict caching on streaming datasets (primarycache=metadata) to stop cache pollution.

Step 4: Is the workload fundamentally cacheable?

cr0x@server:~$ arcstat 1 10
cr0x@server:~$ zpool iostat -v 1 10

Decide: If ARC miss rate remains high while throughput is high and reuse is low, stop trying to “win” with caches. Focus on disk throughput, recordsize, and scheduling.

Step 5: Check dataset properties and application mismatch

cr0x@server:~$ zfs get -s local,default recordsize,atime,compression,primarycache tank/data
cr0x@server:~$ ps -eo pid,cmd,%mem --sort=-%mem | head

Decide: If a database is on a dataset with huge recordsize and caching “all,” you might be amplifying random reads and polluting ARC.
Fixing this is often more impactful than micromanaging ARC parameters.

Common mistakes, symptoms, fixes

Mistake 1: Treating ARC hit rate as the only truth

Symptoms: Team celebrates 95% hit rate while users still see latency spikes; or you panic at 20% hit rate on a backup job.

Fix: Classify the workload: streaming vs reuse. Pair arcstat with disk latency (iostat) and memory pressure (PSI). A low hit rate on streaming reads is normal; high latency is not.

Mistake 2: Letting streaming workloads pollute ARC

Symptoms: During backups or scans, ARC grows, interactive workloads slow down, cache “feels useless” afterward.

Fix: Put streaming/backup datasets on primarycache=metadata (or even none for extreme cases). Consider separate pools if contention is chronic.

Mistake 3: Capping ARC blindly and calling it “tuned”

Symptoms: After reducing ARC, metadata-heavy workloads slow down dramatically; disks become busier; CPU usage rises.

Fix: Cap ARC with intent: leave headroom for apps and page cache, but not so low that ZFS loses metadata working set. Validate with arcstat and end-to-end latency.

Mistake 4: Assuming L2ARC is a free lunch

Symptoms: Added L2ARC improves synthetic benchmarks, but production gets spiky; memory pressure increases; cache device shows constant activity.

Fix: Ensure the workload has reuse. Measure before/after with latency percentiles and disk reads. If churn dominates, remove L2ARC and spend budget on RAM or better primary storage.

Mistake 5: Confusing SLOG/ZIL with read caching

Symptoms: You buy a fast SLOG expecting reads to speed up; nothing changes; you declare ZFS “slow.”

Fix: Use SLOG to improve sync write latency where appropriate. For reads, focus on ARC/L2ARC, recordsize, and device latency.

Mistake 6: Ignoring kernel memory pressure signals

Symptoms: Periodic stalls, kswapd activity, occasional OOM kills, but “available memory” looks okay.

Fix: Use PSI (/proc/pressure/memory), vmstat, and logs for OOM/reclaim. Cap ARC, fix runaway processes, and review cgroup memory limits if containers are involved.

Checklists / step-by-step plan

Checklist A: New ZFS-on-Linux host (mixed workloads)

  1. Decide your memory budget. How much RAM must be reserved for applications and how much can be given to filesystem caching?
    If you can’t answer this, you’re delegating architecture to default heuristics.
  2. Set an initial ARC cap (conservative, adjust later). Verify it after reboot via /proc/spl/kstat/zfs/arcstats.
  3. Classify datasets by access pattern: interactive, database, backup/streaming, VM images. Set primarycache accordingly.
  4. Set recordsize per dataset based on workload (large for sequential files, smaller for random DB-like patterns).
  5. Establish a baseline evidence bundle: free, meminfo, PSI, arcstat, iostat -x, zpool iostat, and app latency percentiles.

Checklist B: When someone says “ZFS is eating all my RAM”

  1. Check PSI memory and swap activity. If there’s no pressure, ARC isn’t your villain—it’s your unused RAM doing useful work.
  2. Confirm ARC size vs max and whether ARC is shrinking when load changes.
  3. Identify the polluter workload (often backups, scans, or analytics). Apply primarycache=metadata to that dataset.
  4. If pressure persists, cap ARC. Then validate that key workloads didn’t regress due to metadata misses.

Checklist C: Performance regression after a change

  1. Verify pool health, scrub/resilver status, and device errors.
  2. Compare disk latency (iostat -x) before/after. If latency increased, don’t argue about caches yet.
  3. Compare memory pressure (PSI) before/after. Kernel changes can shift reclaim/writeback behavior.
  4. Compare ARC behavior (size, target, misses). If ARC is stable but app is slower, look higher (CPU, locks, app config) or lower (devices).

FAQ

1) Is ARC just “ZFS page cache”?

No. ARC caches ZFS buffers (data and metadata) within the ZFS stack. Linux page cache caches file pages at the VFS level.
They can overlap, but they’re not the same layer or the same policy engine.

2) Why does Linux show “almost no free memory” on a healthy ZFS box?

Because RAM is meant to be used. ARC and the page cache will consume memory to reduce future I/O.
The metric that matters is whether the system can satisfy allocations without stalling or swapping—use “available,” PSI, and observed latency.

3) Should I always cap ARC on Linux?

On dedicated storage appliances, you often let ARC be large. On mixed-use servers (databases, containers, JVMs, build agents), a cap is usually prudent.
The correct cap is workload-specific: leave enough for applications and VM, but keep enough ARC to cache metadata and hot blocks.

4) Does adding more RAM always beat adding L2ARC?

Often, yes. RAM reduces latency and complexity and helps both ARC and the rest of the system.
L2ARC can help when the working set is larger than RAM and has reuse, but it adds overhead and won’t fix churn-heavy workloads.

5) If my database uses direct I/O, does ARC still matter?

It can. Even if user data reads bypass some caching paths, ZFS metadata still matters, and ZFS still manages block I/O, checksums, and layout.
Also, not all DB access patterns are purely direct I/O all the time; and background tasks can hit the filesystem in cacheable ways.

6) What’s the quickest way to tell if I’m I/O-bound or memory-pressure-bound?

Look at iostat -x for device await/util and at PSI (/proc/pressure/memory) plus vmstat for reclaim/swap.
High disk await/util suggests I/O bound; high memory PSI suggests reclaim pressure. If both are high, suspect a feedback loop.

7) When should I set primarycache=metadata?

When the dataset mostly serves streaming reads/writes (backups, media archives, cold logs) and caching file data in ARC would crowd out hotter workloads.
It’s especially useful on multi-tenant systems where one “read everything once” job can evict everyone else’s hot blocks.

8) Why do my benchmarks look amazing after the first run?

Because caches warmed up. That might be Linux page cache, ARC, or both. The second run often measures memory speed, not storage.
If you’re trying to measure storage, design the benchmark to control caching effects rather than pretending they don’t exist.

9) Can I “drop caches” safely to test?

You can drop Linux page cache, but doing it on a production box is disruptive and usually a bad idea.
Also, dropping Linux caches doesn’t necessarily reset ARC the same way; you may still be measuring warmed ZFS behavior. Prefer controlled test hosts or synthetic workloads with clear methodology.

10) What’s the best single metric to monitor long-term?

For user experience: latency percentiles at the application layer. For system health: memory PSI and disk latency.
ARC stats are useful, but they’re supporting evidence, not the headline.

Conclusion

ZFS ARC and the Linux page cache are not competitors in the simplistic sense; they’re caching systems built at different layers, with different assumptions, and different failure modes.
ARC is brilliant at caching ZFS’s world—blocks, metadata, and the machinery that makes snapshots and checksums feel fast. The page cache is the Linux kernel’s default performance amplifier, designed to be reclaimable and broadly helpful.

What you should care about is not ideological purity (“ZFS should handle it” vs “Linux should handle it”), but operational stability under change.
If you run mixed workloads, leave headroom, prevent cache pollution, measure pressure rather than “free memory,” and tune with evidence instead of folklore.
That’s how you win the cache war: by not turning it into one.

← Previous
Docker containerd/runc errors: how to debug without reinstalling
Next →
PC HDR: Why It’s Amazing… and Why It’s Sometimes Awful

Leave a comment