ZFS ARC per TB: RAM Sizing Without Myths and Forum Religion

Was this helpful?

The call always starts the same: “We added disks, storage is huge now, and performance is worse. Do we need 1 GB of RAM per TB?”
Then someone forwards a forum screenshot like it’s a notarized capacity plan.

ARC sizing isn’t astrology. It’s cache economics, workload physics, and a little humility about what ZFS is actually doing. If you size RAM
by raw terabytes, you’ll overspend in some environments and still be slow in others. Let’s replace folklore with decisions you can defend in a change review.

Stop using “RAM per TB” as a sizing rule

“1 GB RAM per TB of storage” is the storage equivalent of “restart it and see if it helps.” Sometimes it accidentally works, which is why it survives.
But it’s not a sizing method. It’s a superstition with a unit attached.

ZFS ARC is a cache. Caches don’t size to capacity; they size to working set and latency targets. Your pool could be 20 TB of cold archives
that are written once and read never. Or it could be 20 TB of VMs doing 4K random reads with a small hot set that fits in memory.
Those two worlds have the same “TB” and wildly different “RAM that matters.”

Here’s what the terabyte rule misses:

  • Access pattern beats capacity. Sequential scans can blow through any ARC size; random reads can be transformed by ARC.
  • Metadata has different value than data. Caching metadata can make “slow disk” feel fast without caching much user data.
  • ZFS has knobs that change memory behavior. Recordsize, compression, special vdevs, dnodesize, and primarycache matter.
  • ARC competes with your applications. The best ARC size is the one that doesn’t starve the actual workload.

What you should do instead is boring: pick a performance target, observe your cache hit behavior, validate your working set, and size RAM to meet the target.
If you’re buying hardware, do it with a plan and a rollback.

Joke #1: If you size ARC by terabytes, you’ll eventually buy a server that’s basically a space heater with SATA ports.

ARC: what it is, what it isn’t

ARC is a memory-resident cache with memory pressure awareness

ARC (Adaptive Replacement Cache) lives in RAM. It caches both data and metadata, and it’s designed to adapt between
“recently used” and “frequently used” patterns. That “adaptive” part is real: ARC tracks multiple lists and uses ghost entries
to learn what it wished it had cached.

ARC is also supposed to play reasonably with the OS when memory pressure rises. On Linux, the ARC shrinks based on zfs_arc_max
and pressure signals; on FreeBSD it’s integrated differently but still aims to avoid total starvation. “Supposed to” is doing some work here:
you still need to verify that your OS and ZFS version behave correctly in your environment.

ARC is not “how ZFS keeps your pool consistent”

ZFS consistency comes from copy-on-write, transaction groups, and intent logging (ZIL/SLOG for sync behavior). None of that requires massive ARC.
Yes, ZFS uses memory for metadata and bookkeeping, but the mythical “ZFS needs tons of RAM to not corrupt your data” is nonsense in modern releases.
You can run ZFS with modest RAM and it will still be correct. It may just be slower.

ARC is not a substitute for bad I/O design

If your workload is 90% synchronous writes to a pool of slow disks and you don’t have an appropriate SLOG, doubling ARC won’t save you.
If your bottleneck is CPU (compression, checksums, encryption) or a single-threaded application, ARC won’t save you.
If your pool is 90% full and fragmented, ARC won’t save you. ARC is a cache, not a therapist.

ARC contents are shaped by dataset properties

ZFS caching is not monolithic. Dataset properties influence what gets cached and how painful misses are:

  • recordsize changes I/O amplification and the granularity of cached blocks.
  • primarycache can be set to all, metadata, or none.
  • compression changes how much “logical” data fits per byte of ARC, and affects CPU.
  • atime and metadata churn can turn reads into write pressure.

One quote, because it’s still the cleanest framing:
Hope is not a strategy. — James Cameron

Workload-first sizing: the only sizing model that survives production

Step 1: classify the workload you actually run

“File server” is not a workload description. You need at least this level of detail:

  • I/O type: mostly reads, mostly writes, mixed.
  • Write semantics: sync-heavy (databases, NFS with sync), async-heavy (bulk ingest).
  • Access pattern: random 4K/8K, sequential, many small files, big streaming files.
  • Hot set: roughly how much data is touched repeatedly per hour/day.
  • Latency target: “VMs feel snappy” is not a target; “p95 read latency under 5 ms” is.

Step 2: decide what you want ARC to do

ARC can deliver value in three common ways:

  1. Accelerate random reads by serving hot blocks from RAM.
  2. Accelerate metadata (directory traversals, file opens, small file workloads, snapshots browsing).
  3. Reduce disk seeks by turning repeated reads into memory hits, freeing disks for writes.

The third one is underrated: sometimes you’re “write slow” because disks are busy doing avoidable reads.
ARC can indirectly speed writes by removing read pressure.

Step 3: pick a starting point that is intentionally conservative

Practical baseline (not per TB, and not a law):

  • Small VM host / small NAS: 16–32 GB RAM, then validate.
  • General virtualization node with SSD pool: 64–128 GB RAM if you expect read locality.
  • Large HDD pools serving active workloads: prioritize metadata caching and consider special vdevs; RAM alone may not scale.
  • Databases with sync writes: ARC helps reads; for writes focus on SLOG and pool topology first.

If you can’t articulate what ARC will cache and why that matters, don’t buy RAM yet. Measure first.

Step 4: understand why “more ARC” can lose

Bigger ARC isn’t free:

  • Memory pressure can push the OS into swapping or reclaim storms. If your hypervisor swaps, your “cache” becomes a slow-motion outage.
  • Warmup time increases after reboots or failovers. Huge ARC means longer time to reach steady-state behavior.
  • Cache pollution from scans/backups can evict what actually matters, especially if you allow data caching everywhere.
  • Kernel memory constraints and fragmentation can become your new weird problem.

A usable mental model: “hot set + metadata + safety margin”

ARC sizing that doesn’t embarrass you in a postmortem tends to follow this logic:

  • Estimate hot data set: the portion of data repeatedly read within your latency window.
  • Add metadata headroom: directory entries, indirect blocks, dnodes, ZAP objects. This is highly workload dependent.
  • Leave memory for the OS and apps: page cache (if applicable), hypervisor overhead, containers, databases, monitoring agents, and the “stuff you forgot.”
  • Cap ARC: do not let it “win” every time. Your application pays the bill.

Interesting facts and history (because myths have origin stories)

  • ARC predates ZFS adoption on Linux by years. The ARC algorithm was described in academia (IBM) before ZFS made it famous in storage circles.
  • The “1 GB per TB” rule likely started as a rough warning for metadata-heavy pools. Early deployments with large directories on slow disks felt awful without enough RAM.
  • Early ZFS versions were more memory hungry and less configurable. Modern OpenZFS has improved memory accounting and added knobs like persistent L2ARC and special vdevs.
  • L2ARC originally wasn’t persistent across reboot. That made it less valuable for systems that rebooted often; persistent L2ARC changed the economics for some shops.
  • ZIL and SLOG are commonly misunderstood as “write cache.” They’re about sync semantics, not accelerating all writes. This confusion fuels bad RAM decisions.
  • Recordsize defaults were chosen for general files, not VMs. A default recordsize like 128K made sense historically, but random I/O workloads often need a different setting.
  • Special vdevs were introduced to fix a specific pain. Putting metadata (and optionally small blocks) on SSD can outperform “just add RAM” on HDD pools.
  • Compression became a default recommendation because CPU got cheap. With modern CPUs, compression can increase effective ARC and reduce I/O—until it becomes your bottleneck.

Fast diagnosis playbook

When someone says “ZFS is slow,” you have about five minutes to avoid an hour of speculation. Here’s the order that tends to pay off.

First: decide if the bottleneck is reads, writes, or latency from sync

  • Check pool I/O and latency (are disks saturated, are ops waiting?).
  • Check if sync writes are dominating (NFS sync, database fsync).
  • Check if you’re seeing cache misses or cache thrash.

Second: verify memory pressure and ARC behavior

  • Is the OS swapping or reclaiming aggressively?
  • Is ARC near its max and still missing heavily?
  • Is ARC shrinking due to pressure?

Third: look for the classic configuration foot-guns

  • Pool too full, fragmentation, and slow HDD vdev topology.
  • Bad recordsize for the workload.
  • Backups/scrubs/resilver colliding with peak workload.
  • L2ARC abuse (huge SSD cache with too little RAM).

Fourth: only then talk about buying RAM

If the pool is IOPS-limited and you have a stable hot set, ARC helps. If the pool is sync-write-limited, fix SLOG/topology.
If the pool is CPU-limited, fix CPU. If the pool is “we filled it to 92%,” fix capacity.

Practical tasks: commands, outputs, and the decision you make

These are field tasks. They’re not theoretical. Run them, read the outputs, and make a decision.
Commands assume OpenZFS on Linux with typical tools; adjust for your distro.

Task 1: Check basic memory pressure (swapping kills more ZFS “performance” than any ARC setting)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi        79Gi       3.2Gi       1.1Gi        42Gi        44Gi
Swap:          8.0Gi       2.6Gi       5.4Gi

What it means: Swap is in use. That doesn’t prove an incident, but it’s a smell.
Decision: If swap grows during load or latency spikes, cap ARC and/or add RAM only after confirming the workload needs caching.

Task 2: Identify whether the kernel is thrashing on reclaim

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0 270336 3421120  91264 41277440   0   0   120   250 5400 7200 12  6 78  4  0
 5  1 270336 2897408  91264 41300224   0  64    90  1800 6100 9100 18 10 55 17  0
 6  2 271360 2019328  91264 41311232   0 512    60  4200 6900 9800 15 11 42 32  0

What it means: so (swap out) is rising and wa (I/O wait) is high. You’re paying for memory pressure with disk latency.
Decision: Reduce ARC max (or fix the real memory hog) before you do anything else. ZFS can’t cache effectively when the OS is evicting it.

Task 3: Confirm ARC size and whether it’s pinned near max

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:01:10   820    96     11    12    1    84   10     0    0   88.2G  92.0G
12:01:11   901   210     23    18    2   192   21     0    0   88.3G  92.0G
12:01:12   870   245     28    20    2   225   26     0    0   88.3G  92.0G

What it means: ARC size (arcsz) is close to target (c) and miss% is climbing.
Decision: If this is steady-state and the workload is read-heavy, adding RAM can help. If miss% spikes during backups/scans, fix cache pollution first.

Task 4: Look at detailed ARC breakdown (metadata vs data, and whether you’re paying for it)

cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c_max|c_min|hits|misses|mfu_hits|mru_hits|metadata_size|data_size) '
size                            94701989888
c_min                           4294967296
c_max                           103079215104
hits                            182993948
misses                          23120291
mfu_hits                        119002331
mru_hits                        56211617
metadata_size                   22811942912
data_size                       70100215808

What it means: You’re caching a lot of metadata (~22 GB) and a lot of data (~70 GB). Hits greatly exceed misses overall.
Decision: If latency is still bad, your problem may not be read cache. Move on to pool latency and sync write checks.

Task 5: Check pool health and obvious vdev layout constraints

cr0x@server:~$ zpool status -v
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 04:12:33 with 0 errors on Sun Dec 22 03:10:19 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0

errors: No known data errors

What it means: A single RAIDZ2 vdev behaves like one vdev for IOPS. Great for capacity and sequential throughput, not for random IOPS.
Decision: If the workload is random I/O (VMs), don’t try to “ARC your way out” of topology. Add vdevs, use mirrors, or move hot workloads to SSD.

Task 6: Check pool fullness (performance cliff is real)

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint tank
NAME   USED  AVAIL  REFER  MOUNTPOINT
tank  78.3T  4.1T   192K   /tank

What it means: The pool is effectively ~95% used. Allocation gets expensive; fragmentation and metaslab behavior get ugly.
Decision: Stop debating ARC. Add capacity or delete/migrate data. Then re-evaluate performance.

Task 7: Spot sync write pressure (the “why are my writes slow?” trap)

cr0x@server:~$ zfs get -o name,property,value -s local,received sync tank
NAME  PROPERTY  VALUE
tank  sync      standard

What it means: Sync behavior is default. Apps that call fsync or clients that demand sync will force ZIL behavior.
Decision: If you’re running a database or NFS with sync-heavy workload, investigate SLOG and latency, not ARC size.

Task 8: Check whether you even have a SLOG device and what it is

cr0x@server:~$ zpool list -v tank
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank        83.6T  78.3T  5.3T        -         -    41%    93%  1.00x  ONLINE  -
  raidz2-0  83.6T  78.3T  5.3T        -         -    41%  93.6%      -      -  -

What it means: No log vdev is listed. Sync writes land on the main pool.
Decision: If sync write latency is the pain, a proper low-latency SLOG (power-loss protected) may help more than any RAM upgrade.

Task 9: Measure real-time pool I/O and latency

cr0x@server:~$ zpool iostat -v tank 1 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        78.3T  5.3T   1200    980   210M  145M
  raidz2-0  78.3T  5.3T   1200    980   210M  145M
    sda         -      -    210    160  35.0M  23.5M
    sdb         -      -    195    165  33.1M  24.0M
    sdc         -      -    205    155  34.2M  22.8M
    sdd         -      -    180    170  30.8M  25.1M
    sde         -      -    195    165  33.0M  24.0M
    sdf         -      -    215    165  34.9M  23.9M
----------  -----  -----  -----  -----  -----  -----

What it means: The pool is doing a lot of operations. If these are small random IOs on HDDs, latency is probably high even if bandwidth looks fine.
Decision: If ops are high and the workload is latency sensitive, consider mirrors/more vdevs/SSD tiering before buying RAM.

Task 10: Check dataset properties that directly affect ARC efficiency

cr0x@server:~$ zfs get -o name,property,value recordsize,compression,atime,primarycache tank/vmstore
NAME         PROPERTY      VALUE
tank/vmstore recordsize    128K
tank/vmstore compression   lz4
tank/vmstore atime         on
tank/vmstore primarycache  all

What it means: 128K recordsize and atime=on for a VM store is a common self-own. atime updates add write load; big records inflate random I/O.
Decision: Consider atime=off and a VM-appropriate recordsize (often 16K) after testing. If the hot set is small, ARC will also behave better.

Task 11: See if compression is helping or hurting (and whether CPU is the real bottleneck)

cr0x@server:~$ zfs get -o name,property,value compressratio tank/vmstore
NAME         PROPERTY       VALUE
tank/vmstore compressratio  1.62x

What it means: Compression is effective: 1.62x means you’re saving I/O and fitting more logical data into ARC.
Decision: Keep it unless CPU is pegged. If CPU is saturated, compression may be the bottleneck and more RAM won’t fix it.

Task 12: Verify CPU saturation during “slow storage” complaints

cr0x@server:~$ mpstat -P ALL 1 2
Linux 6.8.0 (server)  12/26/2025  _x86_64_  (32 CPU)

12:05:01 PM  CPU   %usr  %nice   %sys %iowait  %irq  %soft  %steal  %guest  %gnice  %idle
12:05:02 PM  all   58.2    0.0   18.9     1.1   0.0    0.8     0.0     0.0     0.0   21.0
12:05:02 PM   7    96.0    0.0    3.8     0.0   0.0    0.0     0.0     0.0     0.0    0.2

What it means: One CPU is nearly pegged. That can be a single busy thread (checksums, compression, a VM, an interrupt path).
Decision: If storage latency correlates with CPU saturation, adding ARC won’t help. Investigate CPU hotspots and workload distribution.

Task 13: Check L2ARC presence and whether it’s reasonable for your RAM

cr0x@server:~$ zpool status tank | egrep -A3 'cache|special|logs'
        cache
          nvme1n1p1               ONLINE       0     0     0

What it means: You have an L2ARC device. L2ARC is not magic; it consumes ARC metadata and can increase read amplification.
Decision: If RAM is small and L2ARC is large, you can end up slower. Validate with ARC/L2ARC stats before assuming it helps.

Task 14: Check whether ARC is dominated by metadata (a hint that metadata acceleration is the goal)

cr0x@server:~$ arc_summary | egrep 'ARC Size|Most Recently Used|Most Frequently Used|Metadata Size|Data Size'
ARC Size:                                88.3 GiB
Most Recently Used Cache Size:           31.2 GiB
Most Frequently Used Cache Size:         56.1 GiB
Metadata Size:                           21.3 GiB
Data Size:                               65.8 GiB

What it means: A meaningful chunk is metadata. That’s good when your workload is “lots of files,” snapshots, and directory traversal.
Decision: If metadata is the pain and you’re on HDDs, consider special vdevs for metadata before throwing RAM at it indefinitely.

Task 15: Find cache pollution culprits (sequential readers flattening your ARC)

cr0x@server:~$ iotop -oPa
Total DISK READ:         422.31 M/s | Total DISK WRITE:         12.05 M/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN  IO>    COMMAND
18322 be/4  backup    410.12 M/s    0.00 B/s  0.00 %  92.14 % tar -cf - /tank/vmstore | ...

What it means: A backup job is streaming reads. That can evict useful cache content unless managed.
Decision: Consider setting primarycache=metadata on backup datasets, using snapshots send/receive patterns, or scheduling/limiting backup I/O.

Task 16: Confirm your ARC cap and adjust safely (temporary test)

cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_arc_max
103079215104
cr0x@server:~$ echo 68719476736 | sudo tee /sys/module/zfs/parameters/zfs_arc_max
68719476736

What it means: You reduced ARC max to 64 GiB (value is bytes). This is a live change in many setups.
Decision: If application latency improves (less swap, less reclaim), keep a lower cap. If read latency worsens and memory pressure stays fine, increase RAM instead.

Joke #2: L2ARC is like a second freezer—useful, until you realize you bought it because you forgot how groceries work.

Three corporate mini-stories (how this goes wrong in real life)

Mini-story 1: The incident caused by a wrong assumption (“RAM per TB” as a procurement spec)

A mid-sized company refreshed their storage nodes for a private cloud. The RFP literally specified memory as “1 GB per TB usable.”
No one challenged it because it sounded technical and came with a number. Procurement loves numbers.

The new nodes showed up with plenty of disk and a respectable amount of RAM—by the terabyte rule. The problem: the workload was VMs,
mostly random reads and writes, hosted on a pair of large RAIDZ2 vdevs per node. The hot working set was small and spiky, and the real limiter
was random write IOPS and latency during snapshot storms.

Within a week, tickets piled up: “VMs freeze,” “database timeouts,” “storage graphs look fine.” The graphs looked fine because they were bandwidth graphs.
Latency was the killer, and ZFS was doing exactly what you’d expect: it couldn’t magic IOPS out of parity RAIDZ vdevs under random load.

The postmortem was uncomfortable. More RAM would have helped some reads, sure. But the outage was driven by write latency and vdev topology.
The fix wasn’t “double memory.” It was “stop putting random-IO VM workloads on a small number of wide RAIDZ vdevs” and “separate workloads by performance class.”

The lesson: a sizing rule that ignores IOPS is a liability. “RAM per TB” is silent about the thing that usually hurts first.

Mini-story 2: The optimization that backfired (L2ARC everywhere, because SSDs are cheap)

Another shop ran a fleet of ZFS-backed virtualization nodes. Someone did the math: NVMe drives were inexpensive,
so they added L2ARC devices to every node. Bigger cache, faster reads. The change got merged with a nice Jira summary and zero measurements.

Within days, read latency got worse under load. Not catastrophically worse, but enough to annoy customers and cause periodic hiccups.
The team blamed the network, then blamed the hypervisor, then blamed “ZFS overhead.”

The actual issue was predictable: the nodes didn’t have enough RAM to support the L2ARC effectively. L2ARC consumes ARC resources for headers and metadata,
and it changes the I/O pattern. The cache devices were large relative to RAM, which increased churn and overhead while still not holding the right hot data.
Under mixed load, they were paying extra work for misses.

Rolling back L2ARC on some nodes improved stability immediately. Later, they reintroduced L2ARC only on nodes with enough RAM and on workloads with
confirmed read locality that exceeded ARC but benefited from a second-tier cache.

The lesson: “SSD cache” is not automatically good. If you don’t know your miss types and your working set, you’re just adding another moving part
that can disappoint you on schedule.

Mini-story 3: The boring but correct practice that saved the day (ARC cap + change discipline)

A team ran a ZFS-based file service for internal builds and artifacts. During a busy release, the service started timing out.
The first responder noticed swap activity and memory reclaim storms. Classic.

They had a boring runbook: “Check memory pressure; cap ARC temporarily; validate application recovery; then tune permanently.”
No heroics, no forum browsing. They reduced zfs_arc_max live, freeing memory for the services that were actually failing.

Latency dropped within minutes. Not because ARC was bad, but because the system had drifted: new agents, more containers, and a bigger build workload
were eating memory. ARC was doing its job—using what it could—until the OS started swapping.

The permanent fix wasn’t “turn off ARC.” It was setting a sane ARC cap, increasing RAM in the next hardware cycle, and adding monitoring
that alerted on swap and ARC shrink events. The service survived the release without another incident.

The lesson: boring guardrails beat clever guesses. ARC is powerful, but it should never be allowed to starve your actual business logic.

Common mistakes: symptom → root cause → fix

1) “ZFS is slow after we added more disks”

Symptom: More capacity, worse latency; bandwidth graphs look okay.

Root cause: New vdev layout increased parity overhead or widened RAIDZ without adding IOPS; fragmentation got worse; pool is now too full.

Fix: Check zpool iostat and pool fullness; add vdevs (IOPS), not just disks (capacity). Keep pool usage under control.

2) “ARC hit ratio is high, but apps still time out”

Symptom: ARC hit% looks decent; latency spikes persist.

Root cause: Sync write latency (ZIL/SLOG), CPU saturation, or a single slow vdev; ARC doesn’t fix write sync semantics.

Fix: Measure sync behavior; validate SLOG; check CPU and per-vdev latency. Solve the actual bottleneck.

3) “We added L2ARC and everything got worse”

Symptom: Higher read latency and more jitter after adding cache SSDs.

Root cause: L2ARC too large relative to RAM; increased overhead; cache churn; wrong workload (no locality).

Fix: Remove or reduce L2ARC; ensure sufficient RAM; confirm benefit with miss statistics before reintroducing.

4) “VMs stutter during backups”

Symptom: Predictable performance drops during backup windows.

Root cause: Sequential reads pollute ARC and compete for disk; scrub/resilver collides with production I/O.

Fix: Limit backup I/O, separate datasets, use primarycache=metadata where appropriate, schedule scrubs carefully.

5) “We have tons of RAM; why is ARC not huge?”

Symptom: ARC size seems capped or smaller than expected.

Root cause: zfs_arc_max set intentionally or by distro defaults; memory pressure; container limits; hugepages interactions.

Fix: Inspect ARC parameters and memory availability; change caps deliberately; don’t starve the OS.

6) “Small file operations are slow on HDD pool”

Symptom: Listing directories, untarring, git operations are painful.

Root cause: Metadata seeks on HDDs; insufficient metadata caching; no special vdev.

Fix: Ensure adequate RAM for metadata; consider special vdev for metadata/small blocks; verify with metadata hit behavior.

7) “Performance tanks when the pool gets near full”

Symptom: It was fine at 60%, awful at 90%.

Root cause: Allocation and fragmentation overhead; metaslabs constrained; RAIDZ write amplification hurts more.

Fix: Add capacity, migrate data, enforce quotas/reservations. Don’t treat 95% full as “normal operations.”

8) “We tuned recordsize and now random reads are worse”

Symptom: Latency increases after changing dataset settings.

Root cause: Recordsize misfit for workload; changed I/O pattern and cache behavior; mismatch with application block size.

Fix: Use workload-appropriate recordsize per dataset; test with representative load; don’t change it globally in panic.

Checklists / step-by-step plan

Plan A: You’re buying hardware and want to size RAM without religion

  1. Write down the workload. VMs? NFS home dirs? Object store? Database? Mixed?
  2. Pick two metrics that matter. Example: p95 read latency and p95 sync write latency.
  3. Decide the hot set hypothesis. “We think 200 GB is read repeatedly during business hours.” Put a number on it.
  4. Pick a conservative RAM baseline. Leave room for OS/apps; plan to cap ARC.
  5. Choose pool topology for IOPS first. Mirrors and more vdevs beat wide RAIDZ for random workloads.
  6. Decide whether metadata acceleration is needed. If yes, consider special vdevs (with redundancy) rather than infinite RAM.
  7. Plan for observability. ARC stats, latency, swap, CPU, and per-vdev metrics from day one.
  8. Run a load test that resembles production. If you can’t, you’re guessing—just admit it and build extra safety margin.

Plan B: Production is slow and you need a fix with minimal risk

  1. Check swap and reclaim. If swapping, cap ARC and stabilize.
  2. Check pool fullness. If dangerously full, stop and fix capacity. Everything else is lipstick.
  3. Check latency and sync behavior. Identify whether reads or writes are the pain.
  4. Identify cache pollution. Backups and scans are frequent offenders.
  5. Only then tune dataset properties. Do it per dataset, with rollback notes.
  6. Re-evaluate with the same metrics. If you didn’t measure before, measure after and don’t pretend you proved anything.

Plan C: You suspect you’re under-RAM’d for ARC (and want evidence)

  1. Confirm ARC is at/near max under normal load.
  2. Confirm miss% stays elevated during the slow period.
  3. Confirm the workload is read-heavy and has locality. Random reads repeatedly hitting the same working set, not a scan.
  4. Confirm disks are the bottleneck on misses. If SSDs are already fast enough, ARC benefit may be marginal.
  5. Add RAM or reallocate memory and re-test. Improvement should show in p95 latency, not just “feels better.”

FAQ

1) So how much RAM do I need for ZFS?

Enough to run your workload without swapping, plus enough ARC to materially reduce your expensive reads. Start with a sensible baseline (16–64 GB depending on role),
measure ARC misses and latency, then scale RAM if and only if it reduces your bottleneck.

2) Is “1 GB RAM per TB” ever a useful rule?

As a warning that “large pools with heavy metadata workloads on slow disks need memory,” sure. As a purchase spec, no.
It ignores IOPS, workload, dataset tuning, and the reality that cold data exists.

3) Does more ARC always improve performance?

No. If you’re write-latency bound, CPU bound, topology bound, or suffering cache pollution, more ARC can do nothing or make things worse by increasing memory pressure.

4) Should I cap ARC?

In mixed-use servers (hypervisors, container hosts, boxes running databases), yes—cap it deliberately.
On dedicated storage appliances, you may let it grow, but still validate behavior under pressure and after reboots.

5) What ARC hit ratio is “good”?

“Good” is when application latency meets target. Hit ratio is context. A streaming workload can have low hit% and still be fine.
A VM workload with low hit% and high random read latency will feel terrible.

6) When does L2ARC make sense?

When your working set is larger than RAM but still has locality, and your disks are slow enough that SSD hits matter.
Also when you have enough RAM to feed it. L2ARC is not a band-aid for bad pool topology.

7) Is ZFS “memory hungry” compared to other file systems?

ZFS will happily use available RAM for ARC because it’s useful. That can look “hungry” in dashboards.
The question isn’t “is ARC big,” it’s “is the system stable and is latency improved.”

8) Does compression change RAM needs?

Often, yes—in a good way. Compression can increase effective ARC capacity because more logical data fits per byte cached, and it reduces disk I/O.
But if CPU becomes constrained, you just moved the bottleneck.

9) What about deduplication?

Use it only if you can prove the savings and you understand the memory and performance implications for your version and workload.
If you enable dedup without a plan, you’re volunteering for an incident.

10) Why does performance change after reboot?

ARC starts cold. If your performance depends on a warm cache, you’ll see a post-reboot slow period.
That’s normal cache behavior. Solve it with adequate RAM, workload-aware design, and avoiding “cache-dependent” assumptions for critical paths.

Next steps you can actually do this week

If you want a defensible ARC sizing decision—one that won’t get laughed out of a review—do this in order:

  1. Baseline the system: collect free -h, vmstat, ARC stats, and zpool iostat during a slow period.
  2. Separate the problem: decide whether you’re read-limited, write-limited, or sync-latency-limited. Don’t guess.
  3. Remove obvious self-harm: fix pool fullness, stop swapping, and stop letting backups bulldoze your cache.
  4. Tune per dataset: align recordsize, primarycache, and atime with the workload.
  5. Only then buy RAM: if misses remain high under steady-state locality and disks are the bottleneck, add RAM and verify improvement with the same metrics.

If anyone insists on “RAM per TB,” ask them one question: “Which workload and which latency target?” If they can’t answer, they’re not sizing. They’re reciting.

← Previous
Ubuntu 24.04: Reverse path filtering — the hidden setting that breaks multi-homing (case #54)
Next →
WordPress Duplicate Content Fix: Canonical, Trailing Slash, and WWW Without Cannibalization

Leave a comment