ZFS ARC sizing: When Too Much Cache Slows Down Everything Else

November 12, 2025 • February 3, 2026 • Read: 21 min • Views: 14

Was this helpful?

ZFS has a superpower: it turns RAM into fewer disk seeks, fewer round trips, and fewer regrets. The ARC (Adaptive Replacement Cache) is why a mediocre disk subsystem can feel like it’s had espresso. It’s also why a perfectly healthy server can suddenly feel like it’s running through wet cement—because ARC is “winning” a memory fight it shouldn’t be fighting.

This is the part most tuning guides skip: ARC doesn’t live alone. It competes with the kernel’s page cache, application heaps, VM ballooning, container limits, filesystem metadata, and plain old “stuff the OS needs to stay alive.” When ARC gets too large (or shrinks too reluctantly), it can cause swapping, direct reclaim stalls, latency spikes, and the kind of cascading failures that make you wonder if your monitoring graphs are performance art.

What ARC is (and what it is not)

ARC is ZFS’s in-memory cache for both file data and metadata. It is not “just a read cache” in the simplistic sense. It’s a multi-list cache (MRU/MFU plus “ghost” lists) designed to adapt to workloads that oscillate between streaming reads and re-reads, and between metadata-heavy and data-heavy access patterns.

ARC also plays a role in how ZFS avoids hitting disk for metadata constantly—think indirect blocks, dnodes, directory traversal, and the general “where is my file, really?” chain of lookups. If you’ve ever seen a pool with plenty of raw IOPS but a workload that still crawls, it’s often metadata latency, not data throughput.

What ARC is not: it is not the Linux page cache (even though it competes with it), it is not a substitute for application memory, and it is not a magic “give me all RAM” lever that makes every workload faster. ARC’s job is to reduce disk I/O. If your bottleneck isn’t disk I/O, ARC can become the world’s most expensive paperweight.

One sentence joke #1: ARC is like an intern with a very large filing cabinet—useful until they wheel it into the only hallway and block the fire exit.

ARC, dirty data, and why write workloads complicate the story

Write behavior in ZFS is governed more by the transaction group (TXG) mechanism and the dirty data limits than by ARC sizing, but RAM is still the shared battleground. Heavy writers can push dirty pages, ZIL/SLOG behavior, and asynchronous writeback into memory contention territory.

Oversized ARC doesn’t directly “eat dirty data,” but it reduces headroom for everything else, and that changes how the kernel behaves under pressure. When the box is memory-tight, writeback can become bursty, latency becomes spiky, and you end up debugging “storage” when the root cause is memory reclaim.

Why too much ARC hurts: the real mechanisms

Let’s be blunt: “more cache is always better” is only true if you have infinite RAM and no other consumers. Production systems are not fantasy novels. They have budgets.

Mechanism 1: ARC vs. the kernel’s own cache and reclaim behavior

On Linux, the kernel page cache does caching too. With ZFS, you can end up with two caches: ARC (inside ZFS) and page cache (for things like binaries, libraries, mmap’d files outside ZFS context, and sometimes even ZFS-adjacent effects depending on how your workload interacts with the system). If ARC claims too much RAM, the kernel may reclaim aggressively elsewhere, leading to stalls.

Symptoms: CPU time in kernel reclaim, higher system load without actual CPU utilization, and processes stuck in uninterruptible sleep (D state). Users interpret this as “disk is slow” because everything waits on memory reclamation that looks like I/O wait.

Mechanism 2: Swap thrashing: the performance tax you keep paying

Swap isn’t evil; uncontrolled swapping is. A box that begins swapping because ARC is too large can experience a feedback loop:

Memory pressure rises → kernel swaps out cold application pages.
Latency rises → applications stall and time out.
Retried work increases → more memory allocations, more pressure.
ARC may shrink, but not always fast enough, and not predictably enough for your SLOs.

If you’ve never seen swap thrash in production, you have not lived. It’s like watching a forklift try to do ballet: technically possible, emotionally upsetting.

Mechanism 3: VMs and containers: ARC can starve your guests

On virtualization hosts (Proxmox, bhyve, KVM on ZoL), ARC isn’t “free.” The hypervisor and guests need memory too. Ballooning and overcommit make this worse, because the host can appear fine until it suddenly isn’t, at which point the host starts reclaiming and swapping while the guests also scream for memory. Oversized ARC turns a manageable overcommit into a host-level incident.

Mechanism 4: Metadata dominance: ARC grows with what you touch

ARC isn’t uniformly “data.” Some workloads are metadata-heavy: millions of small files, containers unpacking layers, CI systems, language package managers, maildir-like patterns, backups walking trees. ARC will happily fill with metadata that makes those traversals fast—until it displaces memory your database or JVM needed to avoid garbage collection storms.

Mechanism 5: Misreading “free memory” leads to bad decisions

ZFS likes memory. Linux also likes using memory. “Free” memory being small is normal. The question is whether the system can reclaim memory quickly under demand without swapping or stalling. ARC can be tuned to be a good citizen—or it can act like it pays the rent and everyone else is subletting.

Facts & historical context that change how you tune

ARC was designed to outperform classic LRU by adapting between “recently used” and “frequently used” data; it’s not a dumb cache you can reason about with one ratio.
ZFS’s original ecosystem expected big RAM boxes. Early ZFS deployments often lived on systems where “a lot of memory” was the default, and tuning guidance reflected that culture.
On Linux, ZFS lives outside the native VFS caching model in important ways; this is why ARC/page-cache interactions and reclaim behavior are central to performance.
L2ARC exists because disks were slow and RAM was expensive. Today, NVMe changed the math: sometimes “add RAM” beats “add L2ARC,” and sometimes neither matters if you’re CPU-bound.
ARC holds both data and metadata; in metadata-heavy workloads, a modest ARC can outperform a huge ARC that causes swapping.
ARC has historically had tuning pain around shrink behavior under pressure. Improvements have happened over time, but you still shouldn’t assume it will always give memory back exactly when you need it.
Dedup in ZFS is famous for memory hunger because the DDT (dedup table) wants RAM; people conflate “ZFS needs RAM” with “give ARC everything,” which is the wrong lesson.
Virtualization changed the default ARC conversation. In 2010, a “storage server” was often a storage server. In 2025, it’s often a storage server that also runs a small city of VMs.
Compression made caching more valuable because compressed blocks mean more logical data per byte of ARC—but it also means CPU can become the limiting factor first.

Three corporate-world stories from the trenches

Mini-story #1: The incident caused by a wrong assumption

The ticket started innocently: “API latency spikes every day around 10:00.” It was a VM host with ZFS-backed volumes for several services. The on-call saw disks at 20% utilization and assumed “not storage.” The graphs showed memory “mostly used,” but that’s normal on Linux, right?

Then someone noticed the swap graph: a slow upward slope starting at 09:40, hitting a cliff around 10:05. At 10:10, the API pods were restarting due to timeouts. The host load average was high, but CPU wasn’t. That’s the smell of reclaim storms: tasks waiting on memory and I/O, not doing work.

The wrong assumption was subtle: “ARC will shrink when needed.” It did shrink—eventually. But “eventually” was measured in minutes, and minutes are an eternity for customer-facing APIs. The proximate trigger was a scheduled job that traversed a very large tree of small files (metadata buffet), inflating ARC with metadata just as the JVMs wanted heap growth.

The fix wasn’t heroic: cap ARC, and leave headroom for guests and the host. The important change was cultural: they stopped treating ARC as “free performance” and started treating it like a budget line item with an owner.

Mini-story #2: The optimization that backfired

A different team had a “speed up builds” initiative. CI runners were slow unpacking containers and dependencies from a ZFS dataset. Someone read that “ZFS loves RAM,” so they raised the ARC limit to the moon and watched cache hit rates improve in a benchmark.

In production, it was a disaster. Builds got faster for a day, then progressively more erratic. Why? Because the runners were also doing a lot of ephemeral work: short-lived processes, compiler caches, and temp files. The enlarged ARC pushed the host into memory pressure, and the kernel started reclaiming and swapping out exactly the pages the build system needed. The hit rate looked good, but end-to-end job time got worse. Cache hit rate is a vanity metric when your scheduler is paging.

The really painful part: the slowdown didn’t correlate cleanly with “ARC size.” It correlated with concurrency. With one build, the box was fine. With ten, it collapsed. That’s what makes memory pressure incidents so good at wasting senior engineers’ time: you can’t reproduce them on your laptop, and your dashboards don’t scream “ARC did this.”

They rolled back the ARC increase, then made a boring change: they split the CI workload so unpacking and compilation weren’t fighting on the same hosts that were also running other latency-sensitive services. The lesson wasn’t “never increase ARC.” It was: don’t optimize one stage while destabilizing the platform it runs on.

Mini-story #3: The boring but correct practice that saved the day

This one is the opposite of drama. A finance-ish enterprise (the kind that has change windows and feelings about risk) ran ZFS on a set of database hosts. The team had a practice: every host build included explicit ARC caps, swap policy review, and a “memory headroom SLO.” It wasn’t glamorous. It was written down and enforced in config management.

One quarter, the DB team pushed a new feature that increased working set size. On hosts without discipline, that would have meant surprise swapping. Here, the impact was modest: database caches expanded until they hit the headroom boundary, latency ticked up slightly, and alerts fired for “approaching memory budget.”

Because ARC was already capped, the system didn’t enter a death spiral. The DB team saw the alert, tuned their internal cache sizing, and planned a RAM upgrade as part of the next capacity cycle. There was no incident bridge call, no panic, no “why is load 200?” mystery.

That’s the unsexy truth: most great SRE outcomes are boring. The best ARC tuning is the one you did months ago, quietly, and forgot about until it prevented a 3 a.m. outage.

A mental model: what to measure, not what to believe

When people argue about ARC sizing, they’re usually arguing from ideology:

Storage folks: “Use RAM for cache!”
App folks: “Stop stealing my memory!”
SREs: “I don’t care who wins, I care that latency stops spiking.”

Here’s a model that stops the argument: ARC is valuable when it reduces expensive I/O without triggering more expensive memory behavior. Expensive I/O might be random reads on HDDs, or synchronous reads across a network, or metadata lookups that stall your workload. More expensive memory behavior is swapping, reclaim stalls, and cache churn that increases CPU overhead.

Signals that ARC is helping

High ARC hit rate and stable latency.
Lower disk read IOPS and lower read latency compared to baseline.
No swapping, no sustained reclaim pressure.
Apps have enough memory to keep their own caches/hot sets resident.

Signals that ARC is hurting

Swap in/out activity during normal load.
Major page faults rising with no corresponding throughput improvement.
Load average rising while CPU idle remains high.
I/O wait increases and disks aren’t saturated.
Frequent ARC evictions with low hit benefit (cache churn).

Fast diagnosis playbook

This is the “you have 15 minutes before the incident call turns into interpretive shouting” sequence. Do it in order.

1) Confirm it’s memory pressure, not disk saturation

Check swap, reclaim, and stalled tasks. If swap is active and you see reclaim indicators, treat ARC as a suspect immediately.

2) Check ARC size vs. host headroom

Look at current ARC usage, ARC limits, and overall available memory. On VM hosts, also check guest memory allocation and ballooning.

3) Correlate ARC behavior with workload type

Is this metadata-heavy (many files, many stats, container unpack) or data-heavy (large sequential reads)? ARC sizing strategy differs.

4) Look for cache churn

Hit rates alone are not enough. If ARC is constantly evicting and reloading, you’re paying CPU and lock contention for little gain.

5) Only then touch knobs

Don’t tune while blind. Collect a small bundle: ARC stats, vmstat, iostat, top, and a 5-minute time window of behavior. Then adjust ARC max conservatively and observe.

Practical tasks: commands, outputs, interpretation (12+)

These are real commands you can run on Linux with OpenZFS or on FreeBSD (with minor path/sysctl differences). I’ll call out where it matters. Treat outputs as illustrative; your numbers will differ.

Task 1: Check current ARC size and limits (Linux)

cr0x@server:~$ grep -E "c_max|c_min|size" /proc/spl/kstat/zfs/arcstats
13 c_max                            4    34359738368
14 c_min                            4    4294967296
 7 size                             4    28776239104

Interpretation: ARC max is 32 GiB, min is 4 GiB, current size ~26.8 GiB. If this is a 32 GiB RAM system running databases and VMs, this is not “cache,” it’s a hostile takeover.

Task 2: Check ARC efficiency quickly (arcstat)

cr0x@server:~$ arcstat 1 5
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:00:01   920    40      4     2   0    38   4     0   0   26.9G  32.0G
12:00:02   880    55      6     1   0    54   6     0   0   27.0G  32.0G
12:00:03   910    48      5     2   0    46   5     0   0   27.1G  32.0G
12:00:04   940    60      6     3   0    57   6     0   0   27.1G  32.0G
12:00:05   905    42      5     1   0    41   5     0   0   27.2G  32.0G

Interpretation: Miss rate is low. That’s good. But if the host is swapping, “good hit rate” is not a free pass. A cache can be effective and still be too large.

Task 3: Check system memory headroom (Linux)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            64Gi        55Gi       1.2Gi       2.0Gi       7.8Gi       3.5Gi
Swap:           16Gi       2.6Gi        13Gi

Interpretation: “available” is 3.5 GiB with swap already used. On a busy system this is a warning sign. ARC might be sitting on memory that the OS and apps need for stability.

Task 4: Detect swap thrash and reclaim pressure (vmstat)

cr0x@server:~$ vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  1 2684352 120000  80000 900000   20   80   200   150  900 1400 12 18 58 12  0
 6  2 2684800 110000  78000 880000  120  260   400   300 1100 2100 10 22 49 19  0
 7  3 2686000  90000  76000 860000  300  600   800   500 1300 2600  9 25 40 26  0

Interpretation: Non-zero si/so sustained means active swapping. Rising b (blocked procs) with higher wa often means the system is stuck waiting. If this correlates with ARC near max, you’ve found a likely culprit.

Task 5: Confirm disk isn’t the real bottleneck (iostat)

cr0x@server:~$ iostat -x 1 3
Device            r/s     w/s   rkB/s   wkB/s  avgrq-sz avgqu-sz   await  %util
nvme0n1          12.0    30.0    900    2200      80.0     0.20    2.10   12.5
nvme1n1          10.0    28.0    850    2100      78.5     0.18    2.00   11.0

Interpretation: Low %util and low await suggests disks aren’t saturated. If latency is bad anyway, suspect memory pressure or lock contention, not raw disk throughput.

Task 6: Check ARC breakdown (data vs metadata)

cr0x@server:~$ grep -E "demand_data_bytes|demand_metadata_bytes|prefetch_data_bytes|prefetch_metadata_bytes" /proc/spl/kstat/zfs/arcstats
96 demand_data_bytes                4    12884901888
97 demand_metadata_bytes            4    13743895347
98 prefetch_data_bytes              4    1073741824
99 prefetch_metadata_bytes          4    268435456

Interpretation: Metadata is huge here—bigger than demand data. That’s not “wrong,” but it hints at a workload like tree-walking, small files, or virtualization images with lots of metadata activity. If you’re starving apps, consider capping ARC and/or reducing metadata churn sources.

Task 7: Check for ARC shrink behavior and throttling

cr0x@server:~$ grep -E "arc_shrink|arc_no_grow|memory_throttle_count" /proc/spl/kstat/zfs/arcstats
171 arc_no_grow                     4    4821
198 memory_throttle_count           4    129

Interpretation: memory_throttle_count indicates ZFS had to throttle due to memory pressure. If this climbs during incidents, ARC may be contributing to system-wide contention.

Task 8: Find your current ARC tunables (Linux module parameters)

cr0x@server:~$ sudo cat /sys/module/zfs/parameters/zfs_arc_max
34359738368
cr0x@server:~$ sudo cat /sys/module/zfs/parameters/zfs_arc_min
4294967296

Interpretation: Values are bytes. If you didn’t set them, these may be defaults derived from total RAM. Defaults are not gospel.

Task 9: Temporarily lower ARC max (Linux) to stop a fire

cr0x@server:~$ sudo sh -c 'echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max'
cr0x@server:~$ sudo cat /sys/module/zfs/parameters/zfs_arc_max
17179869184

Interpretation: This caps future growth; ARC doesn’t instantly drop to the new cap, but it will trend down as it evicts. Watch swap activity and latency. If you need immediate relief, you’re in “incident response” territory—reduce load, restart the worst offender, or move workloads.

Task 10: Persist ARC limits (Linux, modprobe.d)

cr0x@server:~$ sudo tee /etc/modprobe.d/zfs.conf >/dev/null <<'EOF'
options zfs zfs_arc_max=17179869184
options zfs zfs_arc_min=2147483648
EOF
cr0x@server:~$ sudo update-initramfs -u

Interpretation: This makes limits survive reboots. Exact persistence mechanics vary by distro; the point is: don’t rely on “someone will remember the sysfs echo during the next outage.”

Task 11: FreeBSD: check and set ARC limits (sysctl)

cr0x@server:~$ sysctl kstat.zfs.misc.arcstats.size
kstat.zfs.misc.arcstats.size: 28776239104
cr0x@server:~$ sysctl vfs.zfs.arc_max
vfs.zfs.arc_max: 34359738368
cr0x@server:~$ sudo sysctl vfs.zfs.arc_max=17179869184
vfs.zfs.arc_max: 34359738368 -> 17179869184

Interpretation: FreeBSD uses different knobs, but the same operational truth applies: cap ARC to protect overall system health.

Task 12: Check ZFS dataset recordsize and workload fit (because ARC isn’t magic)

cr0x@server:~$ zfs get recordsize,compression,primarycache tank/vmstore
NAME          PROPERTY      VALUE     SOURCE
tank/vmstore  recordsize    128K      default
tank/vmstore  compression   lz4       local
tank/vmstore  primarycache  all       default

Interpretation: Recordsize affects what’s cached and how efficiently. For VM images or databases, wrong recordsize can inflate cache footprint and increase read amplification. ARC sizing is not a substitute for basic dataset hygiene.

Task 13: Check primarycache/secondarycache policy (targeted relief)

cr0x@server:~$ zfs get primarycache,secondarycache tank/backup
NAME         PROPERTY        VALUE     SOURCE
tank/backup  primarycache    all       default
tank/backup  secondarycache  all       default
cr0x@server:~$ sudo zfs set primarycache=metadata tank/backup
cr0x@server:~$ zfs get primarycache tank/backup
NAME         PROPERTY      VALUE     SOURCE
tank/backup  primarycache  metadata  local

Interpretation: If a dataset is mostly sequential backups that you’ll never reread hot, caching file data is often wasted RAM. Keeping only metadata can preserve directory traversal speed without pinning bulk data in ARC.

Task 14: Check for ZFS prefetch effects (useful or harmful)

cr0x@server:~$ sudo cat /sys/module/zfs/parameters/zfs_prefetch_disable
0
cr0x@server:~$ grep -E "prefetch_data_hits|prefetch_data_misses" /proc/spl/kstat/zfs/arcstats
68 prefetch_data_hits               4    105230
69 prefetch_data_misses             4    98210

Interpretation: Prefetch can help sequential reads but waste cache for random workloads. If prefetch misses are huge and prefetch data bytes are large, you might be caching guesses instead of facts.

Task 15: Detect “everything is slow” due to blocked tasks

cr0x@server:~$ ps -eo state,pid,comm,wchan:32 --sort=state | head
D   2314  postgres         io_schedule
D   9881  java             balance_pgdat
D  11202  python3          zio_wait
R  21011  top              -

Interpretation: D state with balance_pgdat points toward memory reclaim pressure. If you see this during incidents alongside ARC near max, you’re staring at the motive and the weapon.

Checklists / step-by-step plan

Step-by-step plan: right-size ARC without guesswork

Classify the host: storage appliance, VM host, DB server, general purpose. ARC budgets differ.
Set a headroom target: decide how much RAM must remain available for OS + apps under peak. Write it down.
Measure baseline: collect 30 minutes of ARC stats, swap activity, iostat, and app latency during typical load.
Pick an initial ARC cap: conservative first. For VM hosts and mixed workloads, leaving substantial headroom is usually correct.
Apply cap temporarily: change zfs_arc_max live, watch for swap/reclaim improvements and any drop in hit rate.
Validate workload impact: check p95/p99 latency, not just throughput. ARC tuning is about tail latency.
Make it persistent: config management, boot-time settings, and a rollback plan.
Re-evaluate after changes: new application versions, new datasets, new VM density—ARC sizing is not set-and-forget.

Operational checklist: before you change ARC in production

Is swap currently active? If yes, reduce risk first: shed load or scale out.
Do you have remote access out-of-band? Memory incidents can make SSH “interactive art.”
Do you know how to revert the change quickly?
Do you have a 10-minute observation window with stable load?
Have you captured “before” metrics (ARC size, vmstat, iostat, latency)?

Common mistakes: symptoms and fixes

Mistake 1: “ARC should be as big as possible.”

Symptoms: Swap usage creeps up over days; p99 latency spikes during cron jobs; load average rises while CPU isn’t busy; random OOM kills.

Fix: Cap zfs_arc_max to leave real headroom. On VM hosts, be stricter than on dedicated storage boxes. Validate with vmstat and application latency.

Mistake 2: Tuning based on hit rate alone

Symptoms: Hit rate improves but overall performance worsens; more context switches; more kernel time; user-visible stalls.

Fix: Treat hit rate as a supporting metric. Prioritize swap activity, reclaim pressure, blocked tasks, and tail latency.

Mistake 3: Ignoring metadata-heavy workloads

Symptoms: ARC fills quickly during backup scans, container layer extraction, or file indexing; memory pressure correlates with “lots of file operations,” not with throughput.

Fix: Consider dataset-level controls like primarycache=metadata for bulk datasets, and cap ARC. Also examine workload patterns: can you schedule tree walks off-peak?

Mistake 4: Overcommitting VM hosts without budgeting ARC

Symptoms: Host swaps while guests also balloon; noisy-neighbor incidents; unpredictable pauses.

Fix: Treat ARC as a fixed reservation. Set a cap and keep it stable. Monitor host “available” memory, not just “used.”

Mistake 5: Making changes that don’t persist

Symptoms: System is fine after tuning, then after reboot the incident returns; nobody remembers why.

Fix: Persist via module options (Linux) or loader/sysctl config (FreeBSD), under config management with change history.

Mistake 6: Blaming disks when the real issue is reclaim

Symptoms: “Storage is slow” reports, but iostat shows low utilization and decent latency. Load is high. Many tasks are blocked.

Fix: Look at vmstat, swap, and D state tasks. If reclaim is the issue, ARC sizing is part of the fix.

Mistake 7: Setting ARC min too high

Symptoms: Even under pressure, ARC refuses to shrink enough; OOM risk increases; swap persists.

Fix: Keep zfs_arc_min modest unless you have a dedicated storage appliance with predictable needs.

One sentence joke #2: Setting ARC min to “never shrink” is like bolting your office chair to the floor—stable, sure, but now you’re doing meetings from the hallway.

FAQ

1) How much RAM should I give to ARC?

There’s no universal number. Start by reserving enough for the OS and your primary workloads under peak (VMs, databases, JVMs). Then allocate what’s left to ARC with a conservative cap. Dedicated storage boxes can afford larger ARC; mixed-use hosts usually cannot.

2) Why does my server show almost no free memory? Is that bad?

Not necessarily. Linux uses RAM aggressively for caches. The danger sign is not “free is low,” it’s “available is low” plus swap activity, reclaim stalls, or latency spikes.

3) Should I disable swap on ZFS systems?

Usually no. A small, controlled swap can prevent abrupt OOM events. The goal is to avoid active thrashing. If swap is heavily used during normal load, fix memory budgeting (including ARC caps) rather than playing whack-a-mole with swap.

4) Does adding L2ARC let me shrink ARC safely?

L2ARC can help read-heavy workloads with a working set larger than RAM, but it’s not a free replacement for ARC. L2ARC still consumes memory for metadata and can add write traffic to SSDs. Shrink ARC to protect system stability; add L2ARC only when you’ve proven read misses are your bottleneck.

5) Why does ARC not shrink immediately after lowering zfs_arc_max?

ARC eviction is driven by activity and pressure. Lowering the cap changes the target, but ARC still needs to evict buffers over time. If you need immediate relief, reduce load, stop the workload that’s inflating cache, or plan a controlled restart of the heavy memory consumer—because you’re already in incident mode.

6) Is ARC sizing different for HDD pools vs NVMe pools?

Yes, because the value of caching depends on the cost of a miss. HDD random reads are expensive; ARC helps a lot. NVMe misses are cheaper; ARC still helps (especially metadata), but you can hit CPU or memory contention before storage becomes the limiter. Don’t starve the system just to avoid a 200-microsecond NVMe read.

7) How do I know if my workload is metadata-heavy?

Look at ARC breakdown (demand metadata bytes vs demand data bytes), and observe workload patterns: lots of stat(), directory walking, small file opens, container layers, package installs. Metadata-heavy workloads benefit from caching metadata, but they can also balloon ARC quickly.

8) Should I set primarycache=metadata for VM datasets?

Not by default. VM images often benefit from caching data because they re-read blocks. However, for backup datasets, archive datasets, or write-once/read-rarely datasets, primarycache=metadata can reclaim RAM without meaningful performance loss.

9) What’s the safest way to change ARC limits in production?

Change zfs_arc_max incrementally, during stable load, with clear rollback. Observe swap, vmstat, and latency for at least several minutes. Then persist the setting via configuration management and schedule a post-change review.

10) I capped ARC and performance got worse. Did I do something wrong?

Maybe, or maybe you exposed the real bottleneck (disk, network, CPU). If read latency rose and disks are now busier, ARC was masking slow storage. If performance got worse but swap stopped, you traded speed for stability—which might still be correct. The right next step is to address the newly visible constraint rather than re-inflating ARC blindly.

Conclusion

ARC is a performance tool, not a birthright. It should compete for memory, but it shouldn’t win by knocking the rest of the system unconscious. Oversized ARC doesn’t fail loudly; it fails sideways—through swap, reclaim stalls, blocked tasks, and tail latency that makes perfectly healthy disks look guilty.

Size ARC like you’d size anything else in production: define headroom, measure real bottlenecks, make small changes, and persist the boring settings that keep you out of incident bridges. The best cache is the one that accelerates your workload without turning your operating system into a memory arbitration court.