ZFS L2ARC sizing: When 200GB Helps More Than 2TB

Was this helpful?

Nothing ruins a quiet on-call like “we bought two terabytes of NVMe cache and performance got worse.” You stare at graphs: disks look fine, CPUs look bored, and latency still spikes like it’s getting paid per millisecond.

L2ARC is the most tempting knob in ZFS because it looks like a simple trade: cheap fast flash buys you expensive RAM-like cache. In production it’s messier. Sometimes 200GB of L2ARC is a clean win. Sometimes 2TB is a very expensive way to heat an NVMe drive while stealing your RAM.

The mental model: what L2ARC really is

ARC is your first-level cache: in-memory, fast, and fiercely optimized. L2ARC is not “ARC but on SSD.” It’s a second-level cache for blocks that have already been in ARC and got evicted. That detail is everything.

ARC and L2ARC are solving different problems

ARC is where ZFS wants to be. It caches recently and frequently used blocks, metadata, and it adapts between MRU (recent) and MFU (frequent) workloads. ARC hits are cheap.

L2ARC is a victim cache. It stores data that ARC couldn’t keep. That means L2ARC only helps when:

  • Your workload has a working set larger than RAM, but still small enough to fit meaningfully in L2ARC.
  • There’s reuse. L2ARC does nothing for one-and-done reads.
  • The miss penalty on disks is high enough that an SSD hit is materially better.

Here’s the part people forget: L2ARC spends RAM to store SSD pointers

L2ARC maintains metadata in RAM that maps ARC headers to blocks stored on the L2 device. Bigger L2ARC means more in-memory metadata. If that metadata squeezes ARC, you can lose more than you gain.

Also, L2ARC isn’t free to fill. It consumes I/O bandwidth and CPU cycles to write to the cache device. Under heavy churn, that can turn into self-inflicted pressure.

What you’re actually sizing

When you “size L2ARC,” you are balancing:

  • Hit rate gain (how many reads avoid slower media)
  • ARC shrinkage (RAM burned by L2ARC metadata and feed behavior)
  • Write amplification (L2ARC fill writes; plus any persistence features)
  • Latency profile (NVMe is fast, but not free; and not all misses are equal)
  • Failure and recovery behavior (cache warmup time; reboot storms; device loss)

Paraphrased idea from Werner Vogels (Amazon CTO): Everything fails, all the time—design and operate assuming that reality.

That applies to caches too. L2ARC is a performance optimization. Treat it like a feature flag with an exit plan.

Interesting facts and a bit of history

Some context makes the knobs less mystical. Short, concrete facts:

  1. L2ARC was introduced early in Solaris ZFS as a way to extend caching beyond RAM, originally targeting enterprise flash devices.
  2. ARC predates the modern “page cache + fancy eviction” hype; ZFS built a coherent cache with its own eviction logic instead of leaning entirely on the OS page cache.
  3. Historically, L2ARC was non-persistent: reboot meant a cold cache and a long warmup; later OpenZFS work added L2ARC persistence options on some platforms.
  4. Early L2ARC designs were limited by SSD endurance; feeding L2ARC could write a lot more than people expected.
  5. Metadata overhead has always been the tax; as ARC got smarter (and bigger), L2ARC sometimes became less compelling for general workloads.
  6. L2ARC only caches reads; it does not accelerate synchronous writes (that’s SLOG territory, and even then it’s nuanced).
  7. Special vdevs changed the game: putting metadata (and optionally small blocks) on flash can beat L2ARC because it changes where data lives, not just where it’s cached.
  8. NVMe made “huge L2ARC” affordable, which is why we now see more cases where people oversize it and discover the RAM/metadata punchline.

Why 200GB can beat 2TB

The headline isn’t a trick. Smaller L2ARC often wins because caching is about reuse density, not raw capacity. If you cache a lot of blocks that won’t be reused, you’ve built a very fast landfill.

1) Your working set has a “hot core” and a “cold tail”

Most production workloads do. Think: VM boot storms versus long-tail user data; database indexes versus historical tables; build artifacts versus last week’s artifacts.

A 200GB L2ARC that holds the hot tail of ARC evictions can yield a high hit rate. A 2TB L2ARC may pull in the cold tail too, wasting RAM metadata and feed bandwidth. When that happens, ARC shrinks, hot metadata falls out, and you lose the cheapest hits you had.

2) L2ARC competes with ARC for RAM (indirectly)

ZFS performance is usually about memory. Not “more memory good” in a bumper-sticker way—specifically: do you have enough ARC to keep metadata and your hot blocks resident?

Big L2ARC increases the amount of bookkeeping in memory. That reduces effective ARC space for actual cached data. If ARC misses increase as you enlarge L2ARC, you can regress.

3) L2ARC has a fill rate; huge caches can’t keep up

L2ARC is filled gradually. If your workload churns through data faster than L2ARC can populate, the cache never becomes representative. You keep paying to fill it, and you don’t get enough hits back.

Smaller L2ARC warms faster and stays relevant. In environments with frequent reboots, failovers, or maintenance, warmup time is not academic.

4) Not all “hits” are equal

Even a perfect L2ARC hit rate can be a bad deal if the bottleneck isn’t reads from slow media. If you’re CPU-bound on compression, encryption, checksums, or a single-threaded application lock, L2ARC capacity won’t fix it.

5) The “random read penalty” is where L2ARC earns its keep

If your pool is HDD-heavy, random reads are expensive, and L2ARC can be dramatic—if the data is reused. If your pool is already all-NVMe, L2ARC often becomes a rounding error. A large L2ARC on a fast pool can still hurt by consuming RAM metadata and generating extra writes.

Joke #1: Buying a bigger L2ARC to fix latency is like buying a bigger fridge to fix hunger—you may just end up with more leftovers.

A sizing method that doesn’t rely on vibes

If you remember only one rule: don’t size L2ARC by disk size; size it by measured reuse and RAM headroom.

Step 1: Prove you have a read-miss problem

Start by answering three questions:

  • Are we latency-bound on reads from the pool?
  • Are ARC hits high but still not enough, meaning the working set exceeds RAM?
  • Is there reuse in the eviction stream (i.e., would an L2ARC actually be hit)?

If you can’t demonstrate these, the default recommendation is boring: add RAM or fix the workload.

Step 2: Estimate the “L2-worthy” working set

You’re hunting for blocks that are:

  • Frequently accessed, but not frequent enough to stay in ARC under pressure
  • Expensive to fetch from the pool (random reads on HDD, or remote SAN latency)
  • Stable enough to benefit from cache warmup

In practice, that “L2-worthy” set is often much smaller than total dataset size. It might be the 50–300GB of VM OS blocks, or a few hundred gigabytes of database indexes, not “all 12TB of user files.”

Step 3: Ensure RAM headroom for ARC metadata and L2ARC headers

Oversized L2ARC can starve ARC. The tell is: ARC size stops growing, metadata misses climb, and the system gets “stuttery” under load.

Rule of thumb that’s actually usable: if you’re already fighting ARC pressure, don’t add L2ARC yet. Fix memory first, or move metadata to a special vdev, or reduce the working set.

Step 4: Choose device class and endurance like an adult

L2ARC writes. Not constantly like a log, but enough to matter. Consumer QLC can work for modest caches with conservative fill behavior; it can also die embarrassingly fast under churny workloads.

Pick an NVMe with power-loss protection if you care about consistency during brownouts. L2ARC is “just cache,” but sudden device misbehavior can still add failure modes and ugly timeouts.

Step 5: Start small, measure, then grow

Don’t install 2TB because it was on sale. Install 200–400GB. Measure hit rate, latency, ARC behavior, and write load for a week. Then decide whether to expand.

Joke #2: L2ARC is like a company-wide Slack channel—useful when curated, a disaster when everyone dumps everything into it.

Practical tasks: commands, outputs, decisions (12+)

These are the day-to-day checks that keep you honest. Outputs are examples; your numbers will differ. The point is what each tells you, and what decision you make next.

Task 1: Confirm pool and vdev layout (because nothing else makes sense without it)

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 00:32:11 with 0 errors on Tue Dec 24 02:14:22 2025
config:

        NAME                         STATE     READ WRITE CKSUM
        tank                         ONLINE       0     0     0
          raidz2-0                   ONLINE       0     0     0
            sda                      ONLINE       0     0     0
            sdb                      ONLINE       0     0     0
            sdc                      ONLINE       0     0     0
            sdd                      ONLINE       0     0     0
            sde                      ONLINE       0     0     0
            sdf                      ONLINE       0     0     0
        cache
          nvme0n1                    ONLINE       0     0     0

errors: No known data errors

Meaning: You’re on RAIDZ2 HDDs with one NVMe cache device. Random read penalty on HDDs is real, so L2ARC might matter.

Decision: If this were all-NVMe or mirrored SSDs, I’d be skeptical about L2ARC value and would first focus on ARC and workload.

Task 2: Check ARC size, target, and pressure

cr0x@server:~$ arc_summary | head -n 25
ARC Summary:
        Memory Throttle Count:        0
        ARC Size:                     22.5 GiB
        Target Size:                  24.0 GiB
        Min Size (Hard Limit):        8.0 GiB
        Max Size (High Water):        24.0 GiB
        Most Recently Used Cache Size: 7.1 GiB
        Most Frequently Used Cache Size: 13.8 GiB

Meaning: ARC is near max. That can be fine or a sign you’re memory-limited.

Decision: If ARC is pinned at max and your OS is swapping or services are memory-starved, don’t grow L2ARC—add RAM or reduce ARC max only if you must.

Task 3: Measure ARC hit ratio and demand data misses

cr0x@server:~$ arcstat 1 5
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:01:01  8320   410      4   320    3    62    1    28    0   22G  23.8G
12:01:02  8011   398      4   310    3    59    1    29    0   22G  23.8G
12:01:03  8550   460      5   370    4    61    1    29    0   22G  23.8G
12:01:04  7902   402      5   322    4    55    1    25    0   22G  23.8G
12:01:05  8122   405      4   330    4    53    1    22    0   22G  23.8G

Meaning: ARC miss rate is ~4–5%, demand data misses dominate. That’s not terrible. Whether L2ARC helps depends on what those misses cost.

Decision: If miss% is low and latency is still bad, suspect non-storage bottlenecks. If miss% is high and reads are random, L2ARC might help.

Task 4: Check L2ARC presence and hit ratio

cr0x@server:~$ arc_summary | sed -n '60,110p'
L2ARC Summary:
        L2ARC Size:                           186.2 GiB
        L2ARC Evict Misses:                   0
        L2ARC Hits:                           1823124
        L2ARC Misses:                         9311042
        L2ARC Read Hit Ratio:                 16.4%
        L2ARC Writes:                         412019

Meaning: L2ARC hit ratio is ~16%. That can be meaningful on HDD pools, less so on SSD pools.

Decision: If hit ratio is in single digits and the device is busy, shrink or remove L2ARC and focus elsewhere. If it’s >15–30% and read latency improves, keep it and consider modest growth.

Task 5: Verify L2ARC feed behavior and whether the device is saturated

cr0x@server:~$ iostat -x 1 3 nvme0n1
Linux 6.8.0 (server)  12/26/2025  _x86_64_  (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.12    0.00    3.21    1.02    0.00   85.65

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   w_await wareq-sz  aqu-sz  %util
nvme0n1          820.0  89200.0     0.0   0.00    0.35   108.78   290.0  38100.0    1.90   131.38    0.80  48.00

Meaning: NVMe is not pegged (%util 48%). Good. If it were near 100% with high w_await, L2ARC fill might be hurting.

Decision: If cache device is saturated, cap L2ARC fill (platform-specific tunables) or shrink L2ARC.

Task 6: Confirm whether the pool is actually the latency source

cr0x@server:~$ zpool iostat -v tank 1 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        38.2T  12.1T  1400   520     180M  62.1M
  raidz2-0  38.2T  12.1T  1400   520     180M  62.1M
    sda         -      -   230    90    30.1M  10.2M
    sdb         -      -   240    85    29.8M  10.1M
    sdc         -      -   225    88    29.6M  10.3M
    sdd         -      -   235    86    30.2M  10.4M
    sde         -      -   240    86    30.5M  10.2M
    sdf         -      -   230    85    29.8M  10.1M
cache           -      -   710   260     86.0M  38.1M
  nvme0n1      -      -   710   260     86.0M  38.1M
----------  -----  -----  -----  -----  -----  -----

Meaning: Cache device is doing meaningful reads. That’s L2ARC paying rent.

Decision: If “cache” shows near-zero reads while misses are high, L2ARC isn’t being used (or can’t warm). Investigate churn, recordsize mismatch, or that the workload is streaming.

Task 7: Spot streaming reads that L2ARC won’t help

cr0x@server:~$ zfs get -o name,property,value -s local,default recordsize,primarycache,secondarycache tank/data
NAME       PROPERTY        VALUE
tank/data  primarycache    all
tank/data  secondarycache  all
tank/data  recordsize      128K

Meaning: Dataset caches both ARC and L2ARC and uses 128K records. Large sequential scans can bulldoze cache usefulness.

Decision: For streaming/backup datasets, set secondarycache=none (or even primarycache=metadata) to protect cache for latency-sensitive data.

Task 8: Apply per-dataset cache policy (safely, surgically)

cr0x@server:~$ sudo zfs set secondarycache=none tank/backups
cr0x@server:~$ zfs get -o name,property,value secondarycache tank/backups
NAME         PROPERTY        VALUE
tank/backups secondarycache  none

Meaning: Backup reads won’t pollute L2ARC. ARC may still cache based on primarycache setting.

Decision: Do this when you have mixed workloads and backups/rebuilds are crushing cache hit rates.

Task 9: Check metadata pressure and whether a special vdev is a better answer

cr0x@server:~$ arc_summary | sed -n '25,60p'
ARC Miscellaneous:
        Deleted:                            112.3 MiB
        Mutex Misses:                       0
        Demand Data Hits:                   51234123
        Demand Data Misses:                 9123421
        Demand Metadata Hits:               88123411
        Demand Metadata Misses:             923412

Meaning: Metadata misses exist but aren’t exploding. If metadata misses were huge and latency spiky during directory walks/VM operations, special vdev for metadata could beat L2ARC.

Decision: If your pain is metadata-heavy (VMFS-like patterns, lots of small files, directory traversals), consider special vdev before growing L2ARC.

Task 10: Verify you’re not confusing SLOG with L2ARC

cr0x@server:~$ zpool status tank | sed -n '1,25p'
  pool: tank
 state: ONLINE
config:

        NAME                         STATE     READ WRITE CKSUM
        tank                         ONLINE       0     0     0
          raidz2-0                   ONLINE       0     0     0
            sda                      ONLINE       0     0     0
            sdb                      ONLINE       0     0     0
            sdc                      ONLINE       0     0     0
            sdd                      ONLINE       0     0     0
            sde                      ONLINE       0     0     0
            sdf                      ONLINE       0     0     0
        logs
          nvme1n1                    ONLINE       0     0     0
        cache
          nvme0n1                    ONLINE       0     0     0

Meaning: You have a separate log device (SLOG) and a cache device (L2ARC). Good separation.

Decision: If your issue is sync write latency and you keep adding L2ARC, stop. Measure sync workload and SLOG behavior instead.

Task 11: Validate that the workload is actually reading (not waiting on locks or CPU)

cr0x@server:~$ pidstat -dru 1 3
Linux 6.8.0 (server)  12/26/2025  _x86_64_  (32 CPU)

12:10:11      UID       PID  %usr %system  %wait    %CPU   CPU  kB_rd/s  kB_wr/s  Command
12:10:12      999     18322  2.00    0.50   15.00    2.50     6  1200.00   200.00  postgres
12:10:12        0      1421  0.10    0.20    0.00    0.30     2     0.00     0.00  zfs

Meaning: The app shows %wait. That could be I/O wait, but it could also be lock wait depending on the stack. Still, it’s a signal to correlate with storage latency.

Decision: If CPU is pegged or the app is lock-bound, L2ARC size changes won’t fix it. Don’t do “cache theater.”

Task 12: Watch real I/O latency at the device level

cr0x@server:~$ iostat -x 1 3 sda
Device            r/s     rkB/s   r_await  w/s     wkB/s   w_await  aqu-sz  %util
sda              220.0   30100.0   18.20  85.0   10200.0   6.40     4.10   98.00
sda              235.0   31000.0   21.50  92.0   11000.0   7.10     4.80   99.00
sda              210.0   29000.0   16.80  80.0    9800.0   5.90     3.70   96.00

Meaning: HDD is saturated and read await is high. If L2ARC can absorb some reads, you’ll lower this.

Decision: If HDDs are pegged on reads, L2ARC likely helps. If HDDs are idle and latency is still high, your issue is elsewhere.

Task 13: Confirm compression and recordsize alignment (cache efficiency depends on it)

cr0x@server:~$ zfs get -o name,property,value compression,recordsize tank/pg
NAME     PROPERTY     VALUE
tank/pg  compression  lz4
tank/pg  recordsize   16K

Meaning: 16K recordsize for a database dataset; that can improve locality and cache utility for OLTP patterns.

Decision: If you’re caching huge 1M records for random reads, you may waste cache. Tune recordsize per dataset.

Task 14: Evaluate memory headroom quickly

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           128Gi        41Gi        12Gi       1.0Gi        75Gi        78Gi
Swap:           16Gi       0.0Gi        16Gi

Meaning: Plenty of available memory; ARC can be large without starving userspace.

Decision: If available memory is low and swap is active, fix that first. L2ARC is not a substitute for RAM when the OS is gasping.

Task 15: Baseline L2ARC write rate (endurance reality check)

cr0x@server:~$ sudo smartctl -a /dev/nvme0 | sed -n '1,80p'
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Percentage Used:                    3%
Data Units Read:                    18,240,112
Data Units Written:                 9,882,441
Host Read Commands:                 312,114,901
Host Write Commands:                210,441,022

Meaning: “Data Units Written” trends upward with L2ARC feed and workload. Track it over time.

Decision: If your cache device endurance is being eaten quickly, reduce L2ARC churn (dataset policies, smaller L2ARC, better ARC sizing) or use a more durable drive.

Fast diagnosis playbook

You’re in an incident. You don’t have time for philosophical debates about caching. Here’s what to check first, second, third, to find the bottleneck without lying to yourself.

First: is it even storage?

  • Check host CPU saturation and run queue. If CPU is pegged, storage tuning is a distraction.
  • Check application-level waits (DB locks, thread pools). Storage graphs can look guilty by association.

Second: if it is storage, is it reads, writes, or metadata?

  • Reads: high read latency on HDD vdevs; ARC miss% rising; L2ARC reads meaningful.
  • Sync writes: latency spikes correlate with fsync; SLOG metrics and device latency matter.
  • Metadata: directory traversals slow; VM operations slow; metadata misses high; special vdev may be better.

Third: is cache helping or hurting right now?

  • ARC size near max with poor hit ratio: likely working set too big or streaming workload.
  • L2ARC hit ratio low with high device utilization: cache churn; shrink L2ARC or exclude datasets.
  • L2ARC hit ratio decent but no latency improvement: pool may already be fast, or bottleneck is elsewhere.

Fourth: pick the right lever

  • Add RAM when ARC is constrained and metadata misses hurt.
  • Add or tune L2ARC when read misses are expensive and reused.
  • Add a special vdev when metadata and small blocks dominate.
  • Fix dataset policies when backups or scans are polluting cache.
  • Fix the workload when access patterns are pathological (huge random reads, tiny recordsize mismatches, unbounded scans).

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The company ran a virtualization cluster on a ZFS-backed storage server: RAIDZ2 HDDs for capacity, a pair of NVMes for “performance stuff.” The team assumed L2ARC was basically “more RAM but cheaper,” so they carved most of an NVMe into a multi-terabyte L2ARC slice. Big number, big win. That was the assumption.

A month later, a routine maintenance reboot turned into a slow-motion outage. After boot, VM latency was awful for hours. Not “a little degraded.” Tenants filed tickets. Management asked if the storage team had changed anything. The answer was “no,” which was technically true and operationally useless.

The real story: L2ARC had been carrying a lot of warm blocks that the working set depended on during business hours. Reboot erased it. ARC alone couldn’t hold the set. The L2ARC took a long time to warm because the fill rate couldn’t keep up with the churn. Worse, during warmup it consumed I/O bandwidth writing cache content while tenants were already waiting on reads.

The fix wasn’t heroic. They shrank L2ARC to a few hundred gigabytes so it warmed quickly, excluded backup datasets from secondarycache, and documented that reboots imply a warmup window. The surprising result: smaller L2ARC delivered steadier performance and shorter “post-reboot ugliness.”

Mini-story 2: The optimization that backfired

A different shop hosted a database cluster on ZFS. They saw read misses and decided to “go all in”: massive L2ARC, aggressive settings to feed it faster, and a shiny new NVMe dedicated to cache. The graphs looked great for a day. Cache hit ratio went up. Everyone felt clever.

Then the write latency complaints started. Not even on the database dataset—across the host. The NVMe cache device showed high utilization and increased write latency. The HDD pool looked busier too, even though they were trying to reduce reads. The team had built a machine that was constantly moving blocks into L2ARC, evicting them, and repeating the process. Cache churn became a workload.

They were also burning RAM on L2ARC metadata, which quietly reduced ARC’s ability to keep the truly hot metadata resident. More ARC misses. More disk traffic. You can guess the rest.

The rollback plan saved them: disable the aggressive feed, cut L2ARC size, and move metadata-heavy datasets to a special vdev. Performance returned to baseline. After that, they treated L2ARC as a scalpel, not a lifestyle.

Mini-story 3: The boring but correct practice that saved the day

A media platform had a ZFS storage fleet with mixed workloads: user uploads, transcoding reads, analytics scans, and nightly backups. They never trusted caches blindly. They had a standing practice: every high-throughput dataset got explicit cache policy, and every cache device had endurance monitoring with alerting.

One night, analytics jobs changed and started doing large scans against a dataset that was still set to secondarycache=all. L2ARC hit ratio tanked, NVMe writes spiked, and interactive workloads saw higher latency. The system didn’t crash; it just became unpleasant.

The on-call followed their runbook: check arcstat, confirm L2ARC churn, identify the dataset by workload schedule, and flip the dataset policy. They set secondarycache=none on the analytics dataset and left primarycache alone. Within minutes, L2ARC stabilized and interactive performance recovered.

No heroics. No “we should redesign storage.” Just boring hygiene: per-dataset cache policies and watching the cache device like it’s production hardware—because it is.

Common mistakes: symptoms → root cause → fix

1) Symptom: performance got worse after adding a huge L2ARC

Root cause: L2ARC metadata overhead and feed churn reduce effective ARC and add write load; the cache stores low-reuse blocks.

Fix: Shrink L2ARC; exclude streaming datasets (secondarycache=none); confirm ARC has headroom; consider adding RAM instead.

2) Symptom: L2ARC hit ratio is low (single digits) and never improves

Root cause: Streaming/scan workload, low reuse, cache warms too slowly, or datasets aren’t eligible for L2ARC.

Fix: Identify scan datasets and disable secondarycache; keep L2ARC smaller; verify secondarycache settings; measure hit ratio over a representative window.

3) Symptom: great L2ARC hit ratio but latency doesn’t improve

Root cause: Bottleneck isn’t slow reads (CPU, locks, network, sync writes, fragmented vdevs). Or pool is already SSD-fast.

Fix: Correlate app latency with device r_await; check SLOG for sync workloads; check CPU and scheduling; don’t keep buying cache.

4) Symptom: after reboot/failover, system is slow for hours

Root cause: Non-persistent L2ARC (or ineffective persistence) plus oversized cache with slow warmup; working set depends on cached blocks.

Fix: Reduce L2ARC size; ensure workload is resilient to cold cache; schedule warmup reads if appropriate; document expected warmup behavior.

5) Symptom: NVMe cache device wears out faster than expected

Root cause: High churn and continuous feed writes; L2ARC used as a dumping ground for cold data.

Fix: Reduce cache size; disable L2ARC for scan datasets; pick a drive with suitable endurance; monitor SMART “Percentage Used” and bytes written trends.

6) Symptom: directory listings, VM operations, and “small IO” workloads are slow

Root cause: Metadata-bound workload; L2ARC helps some, but persistent placement (special vdev) often helps more.

Fix: Consider a special vdev for metadata/small blocks; confirm metadata miss behavior; ensure enough RAM for metadata in ARC.

7) Symptom: cache device is at 100% utilization with high write latency

Root cause: L2ARC fill competing with reads; cache device too small/slow for churn; aggressive feeding.

Fix: Cap feed rate/tunables where applicable; shrink L2ARC; move to faster NVMe; reduce eligible datasets.

Checklists / step-by-step plan

Decision checklist: should you add L2ARC at all?

  1. Pool media: HDD-heavy or high-latency backend? If yes, L2ARC may help. All-NVMe? Be skeptical.
  2. ARC maxed and still missing: Confirm ARC is at/near max and misses matter.
  3. Reuse exists: Confirm the workload revisits data (not pure streaming).
  4. RAM headroom: Ensure userspace isn’t swapping; ARC can afford some metadata overhead.
  5. Operational tolerance: Are you okay with warmup after reboot? If not, keep it small or plan for persistence.

Step-by-step: sizing L2ARC safely

  1. Baseline for 7 days. Capture: arcstat miss%, L2ARC hit%, pool device latency, and cache device utilization.
  2. Set dataset policies. Exclude backups/scans from L2ARC first (secondarycache=none).
  3. Start with 200–400GB. Prefer “small and warm” over “huge and stale.”
  4. Verify it’s being used. Check zpool iostat “cache” reads and L2ARC hit ratio.
  5. Watch RAM behavior. ARC size, metadata misses, and system memory availability.
  6. Watch endurance. Track NVMe writes and “Percentage Used” over time.
  7. Grow only with evidence. If hit ratio improves and latency drops without starving ARC, scale up in increments.
  8. Keep a rollback plan. Be able to detach the cache device and revert dataset policies quickly.

When to stop increasing L2ARC

  • ARC metadata misses rise or ARC shrinks noticeably.
  • Cache device utilization climbs without a corresponding latency drop.
  • L2ARC hit ratio plateaus: extra gigabytes are storing colder and colder blocks.
  • Warmup time becomes operationally painful.

FAQ

1) Is L2ARC “worth it” on an all-SSD or all-NVMe pool?

Usually not. If your pool already serves random reads in sub-millisecond territory, L2ARC adds complexity, RAM overhead, and write load for small gains. Spend effort on ARC sizing, recordsize, and workload tuning first.

2) Why does a smaller L2ARC warm faster?

Because the cache is filled over time. A smaller cache reaches a representative set of frequently evicted blocks sooner. A huge cache can remain partially irrelevant for long periods under churn.

3) Can L2ARC cause higher latency even if hit ratio improves?

Yes. If filling L2ARC competes with pool I/O or the cache device becomes a bottleneck, you can increase background work. Also, if L2ARC metadata reduces ARC effectiveness, you lose cheap ARC hits and end up doing more slow I/O.

4) Should I add RAM instead of L2ARC?

If you can, yes—especially when your ARC is constrained and metadata misses hurt. RAM improves ARC directly, helps metadata, and doesn’t have warmup or endurance issues. L2ARC is for when the working set is larger than RAM and read misses are expensive.

5) Does L2ARC help writes?

No. L2ARC is a read cache. If your pain is sync write latency, look at SLOG, sync settings, and application fsync patterns.

6) What dataset settings matter most for L2ARC behavior?

secondarycache controls L2ARC eligibility. primarycache controls ARC eligibility. recordsize affects cache granularity and efficiency. Compression can help by fitting more logical data per cached byte.

7) Is a special vdev better than L2ARC?

Different tool. A special vdev changes placement for metadata (and optionally small blocks), often improving latency predictably for metadata-heavy workloads. L2ARC caches what ARC evicts. If your bottleneck is metadata and small random reads, special vdevs frequently outperform “just add more L2ARC.”

8) What’s a good target L2ARC hit ratio?

There’s no magic number. On HDD pools, even 10–20% can be a big win if it hits the expensive reads. On fast SSD pools, you might need much higher to justify overhead. Always tie it to latency improvements and reduced HDD read await.

9) How do I prevent backups and scans from trashing my cache?

Put them in their own dataset(s) and set secondarycache=none. For especially abusive streams, consider primarycache=metadata too, depending on access patterns.

10) What happens if the L2ARC device fails?

ZFS can keep running; it’s a cache device, not data. But the failure can still create operational noise: I/O errors, timeouts, and performance shifts as cache hits disappear. Monitor it and be prepared to detach it cleanly.

Conclusion: practical next steps

If you’re about to buy 2TB of “cache” because your ZFS box feels slow, pause. Measure first. The best L2ARC size is rarely “as big as possible.” It’s “as small as needed to capture the reused misses without stealing ARC’s lunch money.”

Next steps you can do this week

  1. Run arcstat during peak and correlate miss% with user-visible latency.
  2. List datasets and mark the streamers: backups, analytics scans, bulk copies. Set secondarycache=none on them.
  3. If you already have L2ARC, check hit ratio and cache device utilization. If it’s low-hit and high-busy, shrink it.
  4. Make a call: if you’re memory-constrained, buy RAM before buying cache.
  5. If metadata is your pain, plan for a special vdev rather than a bigger L2ARC.

Do the boring measurement work. Then pick the smallest change that gives you a stable improvement. That’s how you keep caches from turning into a second job.

← Previous
WireGuard “Handshake did not complete”: pinpoint NAT, ports, and time issues
Next →
Docker: Alpine vs Debian-slim — stop picking the wrong base image

Leave a comment