The call always starts the same: “We added disks, storage is huge now, and performance is worse. Do we need 1 GB of RAM per TB?”
Then someone forwards a forum screenshot like it’s a notarized capacity plan.
ARC sizing isn’t astrology. It’s cache economics, workload physics, and a little humility about what ZFS is actually doing. If you size RAM
by raw terabytes, you’ll overspend in some environments and still be slow in others. Let’s replace folklore with decisions you can defend in a change review.
Stop using “RAM per TB” as a sizing rule
“1 GB RAM per TB of storage” is the storage equivalent of “restart it and see if it helps.” Sometimes it accidentally works, which is why it survives.
But it’s not a sizing method. It’s a superstition with a unit attached.
ZFS ARC is a cache. Caches don’t size to capacity; they size to working set and latency targets. Your pool could be 20 TB of cold archives
that are written once and read never. Or it could be 20 TB of VMs doing 4K random reads with a small hot set that fits in memory.
Those two worlds have the same “TB” and wildly different “RAM that matters.”
Here’s what the terabyte rule misses:
- Access pattern beats capacity. Sequential scans can blow through any ARC size; random reads can be transformed by ARC.
- Metadata has different value than data. Caching metadata can make “slow disk” feel fast without caching much user data.
- ZFS has knobs that change memory behavior. Recordsize, compression, special vdevs, dnodesize, and primarycache matter.
- ARC competes with your applications. The best ARC size is the one that doesn’t starve the actual workload.
What you should do instead is boring: pick a performance target, observe your cache hit behavior, validate your working set, and size RAM to meet the target.
If you’re buying hardware, do it with a plan and a rollback.
Joke #1: If you size ARC by terabytes, you’ll eventually buy a server that’s basically a space heater with SATA ports.
ARC: what it is, what it isn’t
ARC is a memory-resident cache with memory pressure awareness
ARC (Adaptive Replacement Cache) lives in RAM. It caches both data and metadata, and it’s designed to adapt between
“recently used” and “frequently used” patterns. That “adaptive” part is real: ARC tracks multiple lists and uses ghost entries
to learn what it wished it had cached.
ARC is also supposed to play reasonably with the OS when memory pressure rises. On Linux, the ARC shrinks based on zfs_arc_max
and pressure signals; on FreeBSD it’s integrated differently but still aims to avoid total starvation. “Supposed to” is doing some work here:
you still need to verify that your OS and ZFS version behave correctly in your environment.
ARC is not “how ZFS keeps your pool consistent”
ZFS consistency comes from copy-on-write, transaction groups, and intent logging (ZIL/SLOG for sync behavior). None of that requires massive ARC.
Yes, ZFS uses memory for metadata and bookkeeping, but the mythical “ZFS needs tons of RAM to not corrupt your data” is nonsense in modern releases.
You can run ZFS with modest RAM and it will still be correct. It may just be slower.
ARC is not a substitute for bad I/O design
If your workload is 90% synchronous writes to a pool of slow disks and you don’t have an appropriate SLOG, doubling ARC won’t save you.
If your bottleneck is CPU (compression, checksums, encryption) or a single-threaded application, ARC won’t save you.
If your pool is 90% full and fragmented, ARC won’t save you. ARC is a cache, not a therapist.
ARC contents are shaped by dataset properties
ZFS caching is not monolithic. Dataset properties influence what gets cached and how painful misses are:
recordsizechanges I/O amplification and the granularity of cached blocks.primarycachecan be set toall,metadata, ornone.compressionchanges how much “logical” data fits per byte of ARC, and affects CPU.atimeand metadata churn can turn reads into write pressure.
One quote, because it’s still the cleanest framing:
Hope is not a strategy.
— James Cameron
Workload-first sizing: the only sizing model that survives production
Step 1: classify the workload you actually run
“File server” is not a workload description. You need at least this level of detail:
- I/O type: mostly reads, mostly writes, mixed.
- Write semantics: sync-heavy (databases, NFS with sync), async-heavy (bulk ingest).
- Access pattern: random 4K/8K, sequential, many small files, big streaming files.
- Hot set: roughly how much data is touched repeatedly per hour/day.
- Latency target: “VMs feel snappy” is not a target; “p95 read latency under 5 ms” is.
Step 2: decide what you want ARC to do
ARC can deliver value in three common ways:
- Accelerate random reads by serving hot blocks from RAM.
- Accelerate metadata (directory traversals, file opens, small file workloads, snapshots browsing).
- Reduce disk seeks by turning repeated reads into memory hits, freeing disks for writes.
The third one is underrated: sometimes you’re “write slow” because disks are busy doing avoidable reads.
ARC can indirectly speed writes by removing read pressure.
Step 3: pick a starting point that is intentionally conservative
Practical baseline (not per TB, and not a law):
- Small VM host / small NAS: 16–32 GB RAM, then validate.
- General virtualization node with SSD pool: 64–128 GB RAM if you expect read locality.
- Large HDD pools serving active workloads: prioritize metadata caching and consider special vdevs; RAM alone may not scale.
- Databases with sync writes: ARC helps reads; for writes focus on SLOG and pool topology first.
If you can’t articulate what ARC will cache and why that matters, don’t buy RAM yet. Measure first.
Step 4: understand why “more ARC” can lose
Bigger ARC isn’t free:
- Memory pressure can push the OS into swapping or reclaim storms. If your hypervisor swaps, your “cache” becomes a slow-motion outage.
- Warmup time increases after reboots or failovers. Huge ARC means longer time to reach steady-state behavior.
- Cache pollution from scans/backups can evict what actually matters, especially if you allow data caching everywhere.
- Kernel memory constraints and fragmentation can become your new weird problem.
A usable mental model: “hot set + metadata + safety margin”
ARC sizing that doesn’t embarrass you in a postmortem tends to follow this logic:
- Estimate hot data set: the portion of data repeatedly read within your latency window.
- Add metadata headroom: directory entries, indirect blocks, dnodes, ZAP objects. This is highly workload dependent.
- Leave memory for the OS and apps: page cache (if applicable), hypervisor overhead, containers, databases, monitoring agents, and the “stuff you forgot.”
- Cap ARC: do not let it “win” every time. Your application pays the bill.
Interesting facts and history (because myths have origin stories)
- ARC predates ZFS adoption on Linux by years. The ARC algorithm was described in academia (IBM) before ZFS made it famous in storage circles.
- The “1 GB per TB” rule likely started as a rough warning for metadata-heavy pools. Early deployments with large directories on slow disks felt awful without enough RAM.
- Early ZFS versions were more memory hungry and less configurable. Modern OpenZFS has improved memory accounting and added knobs like persistent L2ARC and special vdevs.
- L2ARC originally wasn’t persistent across reboot. That made it less valuable for systems that rebooted often; persistent L2ARC changed the economics for some shops.
- ZIL and SLOG are commonly misunderstood as “write cache.” They’re about sync semantics, not accelerating all writes. This confusion fuels bad RAM decisions.
- Recordsize defaults were chosen for general files, not VMs. A default recordsize like 128K made sense historically, but random I/O workloads often need a different setting.
- Special vdevs were introduced to fix a specific pain. Putting metadata (and optionally small blocks) on SSD can outperform “just add RAM” on HDD pools.
- Compression became a default recommendation because CPU got cheap. With modern CPUs, compression can increase effective ARC and reduce I/O—until it becomes your bottleneck.
Fast diagnosis playbook
When someone says “ZFS is slow,” you have about five minutes to avoid an hour of speculation. Here’s the order that tends to pay off.
First: decide if the bottleneck is reads, writes, or latency from sync
- Check pool I/O and latency (are disks saturated, are ops waiting?).
- Check if sync writes are dominating (NFS sync, database fsync).
- Check if you’re seeing cache misses or cache thrash.
Second: verify memory pressure and ARC behavior
- Is the OS swapping or reclaiming aggressively?
- Is ARC near its max and still missing heavily?
- Is ARC shrinking due to pressure?
Third: look for the classic configuration foot-guns
- Pool too full, fragmentation, and slow HDD vdev topology.
- Bad recordsize for the workload.
- Backups/scrubs/resilver colliding with peak workload.
- L2ARC abuse (huge SSD cache with too little RAM).
Fourth: only then talk about buying RAM
If the pool is IOPS-limited and you have a stable hot set, ARC helps. If the pool is sync-write-limited, fix SLOG/topology.
If the pool is CPU-limited, fix CPU. If the pool is “we filled it to 92%,” fix capacity.
Practical tasks: commands, outputs, and the decision you make
These are field tasks. They’re not theoretical. Run them, read the outputs, and make a decision.
Commands assume OpenZFS on Linux with typical tools; adjust for your distro.
Task 1: Check basic memory pressure (swapping kills more ZFS “performance” than any ARC setting)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 125Gi 79Gi 3.2Gi 1.1Gi 42Gi 44Gi
Swap: 8.0Gi 2.6Gi 5.4Gi
What it means: Swap is in use. That doesn’t prove an incident, but it’s a smell.
Decision: If swap grows during load or latency spikes, cap ARC and/or add RAM only after confirming the workload needs caching.
Task 2: Identify whether the kernel is thrashing on reclaim
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 270336 3421120 91264 41277440 0 0 120 250 5400 7200 12 6 78 4 0
5 1 270336 2897408 91264 41300224 0 64 90 1800 6100 9100 18 10 55 17 0
6 2 271360 2019328 91264 41311232 0 512 60 4200 6900 9800 15 11 42 32 0
What it means: so (swap out) is rising and wa (I/O wait) is high. You’re paying for memory pressure with disk latency.
Decision: Reduce ARC max (or fix the real memory hog) before you do anything else. ZFS can’t cache effectively when the OS is evicting it.
Task 3: Confirm ARC size and whether it’s pinned near max
cr0x@server:~$ arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:01:10 820 96 11 12 1 84 10 0 0 88.2G 92.0G
12:01:11 901 210 23 18 2 192 21 0 0 88.3G 92.0G
12:01:12 870 245 28 20 2 225 26 0 0 88.3G 92.0G
What it means: ARC size (arcsz) is close to target (c) and miss% is climbing.
Decision: If this is steady-state and the workload is read-heavy, adding RAM can help. If miss% spikes during backups/scans, fix cache pollution first.
Task 4: Look at detailed ARC breakdown (metadata vs data, and whether you’re paying for it)
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c_max|c_min|hits|misses|mfu_hits|mru_hits|metadata_size|data_size) '
size 94701989888
c_min 4294967296
c_max 103079215104
hits 182993948
misses 23120291
mfu_hits 119002331
mru_hits 56211617
metadata_size 22811942912
data_size 70100215808
What it means: You’re caching a lot of metadata (~22 GB) and a lot of data (~70 GB). Hits greatly exceed misses overall.
Decision: If latency is still bad, your problem may not be read cache. Move on to pool latency and sync write checks.
Task 5: Check pool health and obvious vdev layout constraints
cr0x@server:~$ zpool status -v
pool: tank
state: ONLINE
scan: scrub repaired 0B in 04:12:33 with 0 errors on Sun Dec 22 03:10:19 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: No known data errors
What it means: A single RAIDZ2 vdev behaves like one vdev for IOPS. Great for capacity and sequential throughput, not for random IOPS.
Decision: If the workload is random I/O (VMs), don’t try to “ARC your way out” of topology. Add vdevs, use mirrors, or move hot workloads to SSD.
Task 6: Check pool fullness (performance cliff is real)
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint tank
NAME USED AVAIL REFER MOUNTPOINT
tank 78.3T 4.1T 192K /tank
What it means: The pool is effectively ~95% used. Allocation gets expensive; fragmentation and metaslab behavior get ugly.
Decision: Stop debating ARC. Add capacity or delete/migrate data. Then re-evaluate performance.
Task 7: Spot sync write pressure (the “why are my writes slow?” trap)
cr0x@server:~$ zfs get -o name,property,value -s local,received sync tank
NAME PROPERTY VALUE
tank sync standard
What it means: Sync behavior is default. Apps that call fsync or clients that demand sync will force ZIL behavior.
Decision: If you’re running a database or NFS with sync-heavy workload, investigate SLOG and latency, not ARC size.
Task 8: Check whether you even have a SLOG device and what it is
cr0x@server:~$ zpool list -v tank
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 83.6T 78.3T 5.3T - - 41% 93% 1.00x ONLINE -
raidz2-0 83.6T 78.3T 5.3T - - 41% 93.6% - - -
What it means: No log vdev is listed. Sync writes land on the main pool.
Decision: If sync write latency is the pain, a proper low-latency SLOG (power-loss protected) may help more than any RAM upgrade.
Task 9: Measure real-time pool I/O and latency
cr0x@server:~$ zpool iostat -v tank 1 3
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 78.3T 5.3T 1200 980 210M 145M
raidz2-0 78.3T 5.3T 1200 980 210M 145M
sda - - 210 160 35.0M 23.5M
sdb - - 195 165 33.1M 24.0M
sdc - - 205 155 34.2M 22.8M
sdd - - 180 170 30.8M 25.1M
sde - - 195 165 33.0M 24.0M
sdf - - 215 165 34.9M 23.9M
---------- ----- ----- ----- ----- ----- -----
What it means: The pool is doing a lot of operations. If these are small random IOs on HDDs, latency is probably high even if bandwidth looks fine.
Decision: If ops are high and the workload is latency sensitive, consider mirrors/more vdevs/SSD tiering before buying RAM.
Task 10: Check dataset properties that directly affect ARC efficiency
cr0x@server:~$ zfs get -o name,property,value recordsize,compression,atime,primarycache tank/vmstore
NAME PROPERTY VALUE
tank/vmstore recordsize 128K
tank/vmstore compression lz4
tank/vmstore atime on
tank/vmstore primarycache all
What it means: 128K recordsize and atime=on for a VM store is a common self-own. atime updates add write load; big records inflate random I/O.
Decision: Consider atime=off and a VM-appropriate recordsize (often 16K) after testing. If the hot set is small, ARC will also behave better.
Task 11: See if compression is helping or hurting (and whether CPU is the real bottleneck)
cr0x@server:~$ zfs get -o name,property,value compressratio tank/vmstore
NAME PROPERTY VALUE
tank/vmstore compressratio 1.62x
What it means: Compression is effective: 1.62x means you’re saving I/O and fitting more logical data into ARC.
Decision: Keep it unless CPU is pegged. If CPU is saturated, compression may be the bottleneck and more RAM won’t fix it.
Task 12: Verify CPU saturation during “slow storage” complaints
cr0x@server:~$ mpstat -P ALL 1 2
Linux 6.8.0 (server) 12/26/2025 _x86_64_ (32 CPU)
12:05:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:05:02 PM all 58.2 0.0 18.9 1.1 0.0 0.8 0.0 0.0 0.0 21.0
12:05:02 PM 7 96.0 0.0 3.8 0.0 0.0 0.0 0.0 0.0 0.0 0.2
What it means: One CPU is nearly pegged. That can be a single busy thread (checksums, compression, a VM, an interrupt path).
Decision: If storage latency correlates with CPU saturation, adding ARC won’t help. Investigate CPU hotspots and workload distribution.
Task 13: Check L2ARC presence and whether it’s reasonable for your RAM
cr0x@server:~$ zpool status tank | egrep -A3 'cache|special|logs'
cache
nvme1n1p1 ONLINE 0 0 0
What it means: You have an L2ARC device. L2ARC is not magic; it consumes ARC metadata and can increase read amplification.
Decision: If RAM is small and L2ARC is large, you can end up slower. Validate with ARC/L2ARC stats before assuming it helps.
Task 14: Check whether ARC is dominated by metadata (a hint that metadata acceleration is the goal)
cr0x@server:~$ arc_summary | egrep 'ARC Size|Most Recently Used|Most Frequently Used|Metadata Size|Data Size'
ARC Size: 88.3 GiB
Most Recently Used Cache Size: 31.2 GiB
Most Frequently Used Cache Size: 56.1 GiB
Metadata Size: 21.3 GiB
Data Size: 65.8 GiB
What it means: A meaningful chunk is metadata. That’s good when your workload is “lots of files,” snapshots, and directory traversal.
Decision: If metadata is the pain and you’re on HDDs, consider special vdevs for metadata before throwing RAM at it indefinitely.
Task 15: Find cache pollution culprits (sequential readers flattening your ARC)
cr0x@server:~$ iotop -oPa
Total DISK READ: 422.31 M/s | Total DISK WRITE: 12.05 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
18322 be/4 backup 410.12 M/s 0.00 B/s 0.00 % 92.14 % tar -cf - /tank/vmstore | ...
What it means: A backup job is streaming reads. That can evict useful cache content unless managed.
Decision: Consider setting primarycache=metadata on backup datasets, using snapshots send/receive patterns, or scheduling/limiting backup I/O.
Task 16: Confirm your ARC cap and adjust safely (temporary test)
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_arc_max
103079215104
cr0x@server:~$ echo 68719476736 | sudo tee /sys/module/zfs/parameters/zfs_arc_max
68719476736
What it means: You reduced ARC max to 64 GiB (value is bytes). This is a live change in many setups.
Decision: If application latency improves (less swap, less reclaim), keep a lower cap. If read latency worsens and memory pressure stays fine, increase RAM instead.
Joke #2: L2ARC is like a second freezer—useful, until you realize you bought it because you forgot how groceries work.
Three corporate mini-stories (how this goes wrong in real life)
Mini-story 1: The incident caused by a wrong assumption (“RAM per TB” as a procurement spec)
A mid-sized company refreshed their storage nodes for a private cloud. The RFP literally specified memory as “1 GB per TB usable.”
No one challenged it because it sounded technical and came with a number. Procurement loves numbers.
The new nodes showed up with plenty of disk and a respectable amount of RAM—by the terabyte rule. The problem: the workload was VMs,
mostly random reads and writes, hosted on a pair of large RAIDZ2 vdevs per node. The hot working set was small and spiky, and the real limiter
was random write IOPS and latency during snapshot storms.
Within a week, tickets piled up: “VMs freeze,” “database timeouts,” “storage graphs look fine.” The graphs looked fine because they were bandwidth graphs.
Latency was the killer, and ZFS was doing exactly what you’d expect: it couldn’t magic IOPS out of parity RAIDZ vdevs under random load.
The postmortem was uncomfortable. More RAM would have helped some reads, sure. But the outage was driven by write latency and vdev topology.
The fix wasn’t “double memory.” It was “stop putting random-IO VM workloads on a small number of wide RAIDZ vdevs” and “separate workloads by performance class.”
The lesson: a sizing rule that ignores IOPS is a liability. “RAM per TB” is silent about the thing that usually hurts first.
Mini-story 2: The optimization that backfired (L2ARC everywhere, because SSDs are cheap)
Another shop ran a fleet of ZFS-backed virtualization nodes. Someone did the math: NVMe drives were inexpensive,
so they added L2ARC devices to every node. Bigger cache, faster reads. The change got merged with a nice Jira summary and zero measurements.
Within days, read latency got worse under load. Not catastrophically worse, but enough to annoy customers and cause periodic hiccups.
The team blamed the network, then blamed the hypervisor, then blamed “ZFS overhead.”
The actual issue was predictable: the nodes didn’t have enough RAM to support the L2ARC effectively. L2ARC consumes ARC resources for headers and metadata,
and it changes the I/O pattern. The cache devices were large relative to RAM, which increased churn and overhead while still not holding the right hot data.
Under mixed load, they were paying extra work for misses.
Rolling back L2ARC on some nodes improved stability immediately. Later, they reintroduced L2ARC only on nodes with enough RAM and on workloads with
confirmed read locality that exceeded ARC but benefited from a second-tier cache.
The lesson: “SSD cache” is not automatically good. If you don’t know your miss types and your working set, you’re just adding another moving part
that can disappoint you on schedule.
Mini-story 3: The boring but correct practice that saved the day (ARC cap + change discipline)
A team ran a ZFS-based file service for internal builds and artifacts. During a busy release, the service started timing out.
The first responder noticed swap activity and memory reclaim storms. Classic.
They had a boring runbook: “Check memory pressure; cap ARC temporarily; validate application recovery; then tune permanently.”
No heroics, no forum browsing. They reduced zfs_arc_max live, freeing memory for the services that were actually failing.
Latency dropped within minutes. Not because ARC was bad, but because the system had drifted: new agents, more containers, and a bigger build workload
were eating memory. ARC was doing its job—using what it could—until the OS started swapping.
The permanent fix wasn’t “turn off ARC.” It was setting a sane ARC cap, increasing RAM in the next hardware cycle, and adding monitoring
that alerted on swap and ARC shrink events. The service survived the release without another incident.
The lesson: boring guardrails beat clever guesses. ARC is powerful, but it should never be allowed to starve your actual business logic.
Common mistakes: symptom → root cause → fix
1) “ZFS is slow after we added more disks”
Symptom: More capacity, worse latency; bandwidth graphs look okay.
Root cause: New vdev layout increased parity overhead or widened RAIDZ without adding IOPS; fragmentation got worse; pool is now too full.
Fix: Check zpool iostat and pool fullness; add vdevs (IOPS), not just disks (capacity). Keep pool usage under control.
2) “ARC hit ratio is high, but apps still time out”
Symptom: ARC hit% looks decent; latency spikes persist.
Root cause: Sync write latency (ZIL/SLOG), CPU saturation, or a single slow vdev; ARC doesn’t fix write sync semantics.
Fix: Measure sync behavior; validate SLOG; check CPU and per-vdev latency. Solve the actual bottleneck.
3) “We added L2ARC and everything got worse”
Symptom: Higher read latency and more jitter after adding cache SSDs.
Root cause: L2ARC too large relative to RAM; increased overhead; cache churn; wrong workload (no locality).
Fix: Remove or reduce L2ARC; ensure sufficient RAM; confirm benefit with miss statistics before reintroducing.
4) “VMs stutter during backups”
Symptom: Predictable performance drops during backup windows.
Root cause: Sequential reads pollute ARC and compete for disk; scrub/resilver collides with production I/O.
Fix: Limit backup I/O, separate datasets, use primarycache=metadata where appropriate, schedule scrubs carefully.
5) “We have tons of RAM; why is ARC not huge?”
Symptom: ARC size seems capped or smaller than expected.
Root cause: zfs_arc_max set intentionally or by distro defaults; memory pressure; container limits; hugepages interactions.
Fix: Inspect ARC parameters and memory availability; change caps deliberately; don’t starve the OS.
6) “Small file operations are slow on HDD pool”
Symptom: Listing directories, untarring, git operations are painful.
Root cause: Metadata seeks on HDDs; insufficient metadata caching; no special vdev.
Fix: Ensure adequate RAM for metadata; consider special vdev for metadata/small blocks; verify with metadata hit behavior.
7) “Performance tanks when the pool gets near full”
Symptom: It was fine at 60%, awful at 90%.
Root cause: Allocation and fragmentation overhead; metaslabs constrained; RAIDZ write amplification hurts more.
Fix: Add capacity, migrate data, enforce quotas/reservations. Don’t treat 95% full as “normal operations.”
8) “We tuned recordsize and now random reads are worse”
Symptom: Latency increases after changing dataset settings.
Root cause: Recordsize misfit for workload; changed I/O pattern and cache behavior; mismatch with application block size.
Fix: Use workload-appropriate recordsize per dataset; test with representative load; don’t change it globally in panic.
Checklists / step-by-step plan
Plan A: You’re buying hardware and want to size RAM without religion
- Write down the workload. VMs? NFS home dirs? Object store? Database? Mixed?
- Pick two metrics that matter. Example: p95 read latency and p95 sync write latency.
- Decide the hot set hypothesis. “We think 200 GB is read repeatedly during business hours.” Put a number on it.
- Pick a conservative RAM baseline. Leave room for OS/apps; plan to cap ARC.
- Choose pool topology for IOPS first. Mirrors and more vdevs beat wide RAIDZ for random workloads.
- Decide whether metadata acceleration is needed. If yes, consider special vdevs (with redundancy) rather than infinite RAM.
- Plan for observability. ARC stats, latency, swap, CPU, and per-vdev metrics from day one.
- Run a load test that resembles production. If you can’t, you’re guessing—just admit it and build extra safety margin.
Plan B: Production is slow and you need a fix with minimal risk
- Check swap and reclaim. If swapping, cap ARC and stabilize.
- Check pool fullness. If dangerously full, stop and fix capacity. Everything else is lipstick.
- Check latency and sync behavior. Identify whether reads or writes are the pain.
- Identify cache pollution. Backups and scans are frequent offenders.
- Only then tune dataset properties. Do it per dataset, with rollback notes.
- Re-evaluate with the same metrics. If you didn’t measure before, measure after and don’t pretend you proved anything.
Plan C: You suspect you’re under-RAM’d for ARC (and want evidence)
- Confirm ARC is at/near max under normal load.
- Confirm miss% stays elevated during the slow period.
- Confirm the workload is read-heavy and has locality. Random reads repeatedly hitting the same working set, not a scan.
- Confirm disks are the bottleneck on misses. If SSDs are already fast enough, ARC benefit may be marginal.
- Add RAM or reallocate memory and re-test. Improvement should show in p95 latency, not just “feels better.”
FAQ
1) So how much RAM do I need for ZFS?
Enough to run your workload without swapping, plus enough ARC to materially reduce your expensive reads. Start with a sensible baseline (16–64 GB depending on role),
measure ARC misses and latency, then scale RAM if and only if it reduces your bottleneck.
2) Is “1 GB RAM per TB” ever a useful rule?
As a warning that “large pools with heavy metadata workloads on slow disks need memory,” sure. As a purchase spec, no.
It ignores IOPS, workload, dataset tuning, and the reality that cold data exists.
3) Does more ARC always improve performance?
No. If you’re write-latency bound, CPU bound, topology bound, or suffering cache pollution, more ARC can do nothing or make things worse by increasing memory pressure.
4) Should I cap ARC?
In mixed-use servers (hypervisors, container hosts, boxes running databases), yes—cap it deliberately.
On dedicated storage appliances, you may let it grow, but still validate behavior under pressure and after reboots.
5) What ARC hit ratio is “good”?
“Good” is when application latency meets target. Hit ratio is context. A streaming workload can have low hit% and still be fine.
A VM workload with low hit% and high random read latency will feel terrible.
6) When does L2ARC make sense?
When your working set is larger than RAM but still has locality, and your disks are slow enough that SSD hits matter.
Also when you have enough RAM to feed it. L2ARC is not a band-aid for bad pool topology.
7) Is ZFS “memory hungry” compared to other file systems?
ZFS will happily use available RAM for ARC because it’s useful. That can look “hungry” in dashboards.
The question isn’t “is ARC big,” it’s “is the system stable and is latency improved.”
8) Does compression change RAM needs?
Often, yes—in a good way. Compression can increase effective ARC capacity because more logical data fits per byte cached, and it reduces disk I/O.
But if CPU becomes constrained, you just moved the bottleneck.
9) What about deduplication?
Use it only if you can prove the savings and you understand the memory and performance implications for your version and workload.
If you enable dedup without a plan, you’re volunteering for an incident.
10) Why does performance change after reboot?
ARC starts cold. If your performance depends on a warm cache, you’ll see a post-reboot slow period.
That’s normal cache behavior. Solve it with adequate RAM, workload-aware design, and avoiding “cache-dependent” assumptions for critical paths.
Next steps you can actually do this week
If you want a defensible ARC sizing decision—one that won’t get laughed out of a review—do this in order:
- Baseline the system: collect
free -h,vmstat, ARC stats, andzpool iostatduring a slow period. - Separate the problem: decide whether you’re read-limited, write-limited, or sync-latency-limited. Don’t guess.
- Remove obvious self-harm: fix pool fullness, stop swapping, and stop letting backups bulldoze your cache.
- Tune per dataset: align
recordsize,primarycache, andatimewith the workload. - Only then buy RAM: if misses remain high under steady-state locality and disks are the bottleneck, add RAM and verify improvement with the same metrics.
If anyone insists on “RAM per TB,” ask them one question: “Which workload and which latency target?” If they can’t answer, they’re not sizing. They’re reciting.