ZFS Using NVMe as L2ARC: When ARC Isn’t Enough

Was this helpful?

At some point every ZFS admin meets the same villain: a working set that doesn’t fit in RAM. The symptom is familiar—latency spikes, disks chattering, users claiming “the storage is slow” with the moral certainty of someone who has never looked at iostat.

NVMe as L2ARC sounds like the obvious fix: add a fast cache device and move on with your life. Sometimes it’s brilliant. Sometimes it’s an expensive way to make your NVMe wear out while performance stays stubbornly mediocre. This piece is about knowing which world you’re in—before you learn it in production.

What L2ARC is (and what it isn’t)

ZFS has a primary cache called ARC (Adaptive Replacement Cache). It lives in RAM and it’s extremely good at making repeated reads cheap. When ARC is too small for your working set, ZFS can extend caching onto a fast device: L2ARC, the “Level 2 ARC”. L2ARC is read cache. Not a write buffer, not a journaling accelerator, not a magic wand.

What happens in practice:

  • ARC tracks what’s hot. If ARC can’t keep it, ZFS may push some of that data into L2ARC.
  • On a read miss in ARC, ZFS can look in L2ARC. If found, it reads from NVMe instead of spinning disks or slower SSDs.
  • L2ARC contents are populated over time. Historically it was non-persistent across reboots; newer implementations can persist it (more on that later).

What L2ARC doesn’t do:

  • It does not accelerate synchronous writes (that’s SLOG territory, and even then: only specific workloads).
  • It does not fix a pool that is slow because of fragmentation, poor vdev layout, or a CPU bottleneck.
  • It does not remove the need to size RAM properly. L2ARC still needs RAM metadata to be useful.

Dry operational truth: L2ARC is a tool for making a specific kind of read latency cheaper. If your problem is “random reads are killing me and the data is re-read,” it can shine. If your problem is “everything is random and never repeated,” it won’t.

One quote that belongs on every on-call rotation schedule: Hope is not a strategy — attributed to General Gordon R. Sullivan. Treat L2ARC the same way: measure, then decide.

Interesting facts and history (short, useful, and slightly opinionated)

  • ARC predates most “cloud storage” habits. ZFS and ARC came from Solaris-era engineering where memory was expensive and disks were slow, so caching policy was serious business.
  • L2ARC was designed for read caching, not durability. Early implementations lost the cache on reboot; that shaped operational expectations for years.
  • L2ARC needs RAM to describe what’s on it. Every cached block needs metadata in ARC. A gigantic L2ARC can quietly steal RAM that your ARC desperately needs.
  • NVMe made L2ARC less embarrassing. Early L2ARC devices were often SATA SSDs; latency improvements were real, but not always dramatic. NVMe’s low latency changes the tradeoffs.
  • Prefetch and L2ARC have a complicated relationship. Sequential reads can flood caches with stuff you’ll never touch again. ZFS added knobs (like l2arc_noprefetch) because people kept caching their regrets.
  • There’s a “warm-up tax.” L2ARC is filled gradually, not instantly. The first hours (or days) can look worse than “no L2ARC,” especially if you sized it wrong.
  • The cache feed rate is intentionally throttled. ZFS limits how aggressively it writes to L2ARC to avoid turning your system into a cache-writing factory.
  • L2ARC is not a victimless feature. It consumes CPU, memory, and writes to your NVMe. That wear is real; plan for it like you plan for log retention.
  • Persistent L2ARC exists now in modern ZFS implementations. It reduces the warm-up pain after reboots, but it’s not universal and it’s not free.

When NVMe L2ARC actually helps

L2ARC is for workloads with repeat reads that miss ARC but could be served faster than your pool can provide.

Good fits

  • VM hosts with stable “hot” OS blocks across many guests: shared libraries, boot files, common binaries, patch repositories.
  • Databases with read-heavy patterns where the working set is just a bit too big for RAM (or where you refuse to buy more RAM because finance).
  • CI/build caches where the same toolchains are used repeatedly and your pool is HDD-based.
  • Analytics clusters that repeatedly scan the same “recent” partitions, but not in a purely sequential way.

The key condition: L2ARC must be faster than the thing it replaces

If your pool is already NVMe or a strong SSD mirror, L2ARC can be redundant. If your pool is HDD RAIDZ with random reads, L2ARC can look like a miracle—assuming you’re actually re-reading the same blocks.

Joke #1: L2ARC is like a second fridge—great if you keep leftovers, useless if you only eat takeout.

When L2ARC is a trap

Most “L2ARC was disappointing” stories boil down to one of these:

1) Your workload doesn’t re-read

Backup ingestion, cold analytics scans, one-time file copies, large sequential media processing—these tend to stream through cache and vanish. L2ARC becomes write traffic to NVMe with little hit rate payoff.

2) Your ARC is starved, and L2ARC makes it worse

L2ARC requires metadata in RAM. If your system already has memory pressure, adding a large L2ARC can reduce effective ARC size and increase misses. Congratulations, you cached more on NVMe while losing the thing that makes caching fast: RAM.

3) Your problem isn’t reads

If the pain is synchronous writes, small random writes, or metadata contention, L2ARC might do almost nothing. Often the correct fix is a different vdev layout, a special vdev for metadata/small blocks, more RAM, or tuning the application.

4) You just built a cache that thrashes

Thrash means blocks are written into L2ARC and evicted before they’re reused. That can happen because the working set is too large, the cache feed rate is too high, or the pattern is a rolling window larger than ARC+L2ARC.

Picking NVMe for L2ARC: latency, endurance, and why “fast” isn’t enough

For L2ARC, you want low latency under mixed load, not just heroic sequential throughput you’ll never hit in a cache. Marketing numbers don’t show tail latency when the device is warm, partially full, and doing background garbage collection. That’s your real world.

Latency and queue behavior

L2ARC hits are random reads. Your NVMe should handle high IOPS without turning p99 latency into a weekly surprise. Drives with consistent performance (often “enterprise-ish” models) matter here. Consumer drives can be fine, but the risk is spikier behavior when SLC cache is exhausted or under sustained writes from cache fill.

Endurance is a feature, not a footnote

L2ARC is populated by writes. Some workloads cause steady churn: blocks get written to cache, replaced, written again. That can burn through consumer TBW ratings faster than your procurement team expects.

Rule of thumb: if you can’t confidently explain your cache write rate, don’t buy the cheapest NVMe. Buy the one you can replace without a career discussion.

Power loss protection (PLP)

L2ARC is a cache; it’s not required to be durable. But power loss behavior still matters because a device that behaves badly during brownouts can hang the PCIe bus, wedge I/O paths, or trigger controller resets. PLP isn’t mandatory for L2ARC, but “boring enterprise behavior” is valuable in production.

Sizing L2ARC: simple math, real constraints

People size L2ARC like they size object storage: “more is better.” That’s how you end up with a 4 TB cache in front of a system with 64 GB of RAM and an ARC that wheezes.

The RAM overhead reality

L2ARC needs metadata in ARC for each cached block. Exact overhead varies by implementation and recordsize, but you should assume real, non-trivial RAM cost. If ARC is already tight, a big L2ARC can hurt. You want enough RAM to keep the most valuable metadata and frequently used data in ARC, while L2ARC catches what spills.

Start small and prove value

Operationally sane starting points:

  • Pick an L2ARC size that is no more than a few multiples of RAM for typical systems, unless you have measured and can justify more.
  • Prefer “right-sized” NVMe over “largest available.” A cache that never warms is a monument to optimism.
  • Measure hit ratios and latency improvements before scaling.

Recordsize and compression matter

ZFS caches blocks. If your datasets use a 1M recordsize for large sequential writes, that can change the caching behavior compared to a 16K recordsize typical of some databases. Compression also affects how much “logical” data fits in cache and can improve effective cache capacity—sometimes dramatically.

Tuning knobs that matter (and the ones that mostly don’t)

Most L2ARC tuning is about reducing useless caching and preventing cache fill from becoming its own workload.

Prefer caching demand reads over prefetch

Prefetch is ZFS trying to be helpful for sequential access. For L2ARC, it can be actively unhelpful because it fills cache with data that was read once, sequentially, and will never be read again.

Common approach: enable L2ARC but set it to avoid caching prefetch. On Linux/OpenZFS, this is often controlled via l2arc_noprefetch. Exact defaults and behavior can vary by version; measure on your build.

Control the feed rate

L2ARC is populated from ARC eviction. There are tunables controlling how much is written per interval and how quickly it ramps. If you push too hard, you:

  • Increase write load on NVMe (endurance hit).
  • Spend CPU on cache bookkeeping.
  • Potentially compete with real workload I/O.

Persistent L2ARC: nice, but don’t worship it

Persistent L2ARC reduces warm-up after reboot by keeping cache headers and content across restarts. It can be a big win for systems that reboot for patching and need predictable post-reboot performance. But persistence can increase complexity and boot-time scanning behavior. Treat it like any other feature: test, then deploy.

Practical tasks: commands, outputs, and the decisions you make

These are real tasks you can run on a Linux OpenZFS host. Outputs are representative. Your numbers will differ; your decisions shouldn’t be vibes-based.

Task 1: Verify pool health before touching cache

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 02:11:43 with 0 errors on Sun Dec 22 03:10:12 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0

errors: No known data errors

What it means: You’re not already on fire. If you see checksum errors or degraded vdevs, fix that first.

Decision: Only add L2ARC to a stable pool. A cache device won’t rescue a pool that’s sick.

Task 2: Check whether an L2ARC already exists

cr0x@server:~$ sudo zpool list -v tank
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank        65.4T  41.2T  24.2T        -         -    22%    63%  1.00x  ONLINE  -
  raidz2   65.4T  41.2T  24.2T        -         -    22%    63%
    sda        -      -      -        -         -      -      -
    sdb        -      -      -        -         -      -      -
    sdc        -      -      -        -         -      -      -
    sdd        -      -      -        -         -      -      -
    sde        -      -      -        -         -      -      -
    sdf        -      -      -        -         -      -      -

What it means: No cache section shown, so no L2ARC.

Decision: Proceed to select and attach an NVMe, or stop if you can solve the issue with RAM or layout instead.

Task 3: Confirm the NVMe device identity and health (before ZFS touches it)

cr0x@server:~$ lsblk -o NAME,MODEL,SIZE,ROTA,TYPE,MOUNTPOINT | grep -E 'nvme|NAME'
NAME        MODEL                 SIZE ROTA TYPE MOUNTPOINT
nvme0n1     Samsung SSD 980 PRO  931.5G    0 disk

What it means: You’re about to use /dev/nvme0n1, 1 TB, non-rotational.

Decision: Confirm you are not pointing at your OS disk. If this NVMe also holds root, stop. Sharing is possible but rarely wise.

Task 4: Check NVMe SMART and endurance indicators

cr0x@server:~$ sudo smartctl -a /dev/nvme0n1 | sed -n '1,35p'
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0] (local build)
=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 980 PRO 1TB
Serial Number:                      S5GXNF0R123456A
Firmware Version:                   5B2QGXA7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        44 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Written:                 14,512,331
Data Units Read:                    22,900,112

What it means: Drive is healthy, low wear (Percentage Used).

Decision: If wear is already high, don’t volunteer this device for cache churn. If you must, limit feed rate and monitor writes.

Task 5: Observe current ARC size and pressure

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsize  c
12:10:01  8160  1432     17   902   11   420    5   110    1     46G  58G
12:10:02  7992  1470     18   961   12   401    5   108    1     46G  58G
12:10:03  8421  1522     18  1002   12   413    5   107    1     46G  58G

What it means: ARC is ~46G, target/limit ~58G. Miss rate ~18%. That might be fine or awful depending on pool latency.

Decision: If ARC is far below its target because the OS is starving it, fix memory pressure first. L2ARC is not a substitute for RAM.

Task 6: Measure where your reads are coming from right now

cr0x@server:~$ sudo zpool iostat -v tank 1 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        41.2T  24.2T   8900    420   195M  6.8M
  raidz2    41.2T  24.2T   8900    420   195M  6.8M
    sda         -      -   1500     70  33.0M  1.2M
    sdb         -      -   1490     72  32.8M  1.1M
    sdc         -      -   1520     71  33.5M  1.1M
    sdd         -      -   1475     68  32.4M  1.1M
    sde         -      -   1460     69  32.1M  1.1M
    sdf         -      -   1455     70  31.9M  1.1M

What it means: Reads are hitting the HDD vdevs hard. If latency is high and the working set is re-read, L2ARC can help.

Decision: If disks are saturated and reads are random, a cache is plausible. If reads are small and metadata-heavy, consider a special vdev instead.

Task 7: Attach NVMe as L2ARC

cr0x@server:~$ sudo zpool add tank cache /dev/nvme0n1
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0
        cache
          nvme0n1                   ONLINE       0     0     0

What it means: NVMe is now a cache device (L2ARC).

Decision: Don’t declare victory. Now you monitor warm-up, hit ratio, and NVMe writes.

Task 8: Verify L2ARC is being used (hit rates and size)

cr0x@server:~$ arcstat 1 5 | sed -n '1,6p'
    time  read  miss  miss%  ...  l2hit  l2miss  l2miss%  l2bytes  arcsize  c
12:18:01  9010  1210     13  ...   2200     980       31     190G     47G  58G
12:18:02  9155  1198     13  ...   2350     960       29     191G     47G  58G
12:18:03  9033  1160     12  ...   2480     910       27     192G     47G  58G
12:18:04  9199  1205     13  ...   2605     920       26     194G     47G  58G
12:18:05  9122  1188     13  ...   2710     890       25     195G     47G  58G

What it means: L2ARC has ~195G populated and is serving hits (l2hit). The L2 miss% is falling as it warms.

Decision: If l2bytes stays tiny for hours under steady load, something is off: feed rate, workload, or you’re simply not evicting useful blocks.

Task 9: Watch NVMe I/O to ensure cache fill isn’t clobbering latency

cr0x@server:~$ iostat -x 1 3 | sed -n '1,25p'
Linux 6.8.0 (server)  12/26/2025  _x86_64_  (32 CPU)

Device            r/s     w/s   rMB/s   wMB/s  await  svctm  %util
nvme0n1         3200.0  180.0   125.0    40.0   0.7    0.2   68.0
sda              250.0   12.0     4.1     0.3  18.4    2.2   55.0
sdb              240.0   11.0     4.0     0.3  19.1    2.2   54.0

What it means: NVMe is busy, but latency (await) is low. HDDs show higher await, typical.

Decision: If NVMe await rises into multi-ms and competes with application I/O, reduce L2ARC write feed, or reconsider. Your cache should not become your hottest device.

Task 10: Confirm dataset properties that influence cache behavior

cr0x@server:~$ sudo zfs get -o name,property,value,source recordsize,primarycache,secondarycache,compression tank/vmstore
NAME          PROPERTY        VALUE     SOURCE
tank/vmstore  recordsize      128K      local
tank/vmstore  primarycache    all       default
tank/vmstore  secondarycache  all       default
tank/vmstore  compression     lz4       local

What it means: Both ARC and L2ARC are allowed (primarycache=all, secondarycache=all).

Decision: If a dataset has secondarycache=none, L2ARC won’t cache it. That can be correct for streaming backups, or a silent misconfiguration.

Task 11: Exclude streaming datasets from L2ARC to prevent pollution

cr0x@server:~$ sudo zfs set secondarycache=none tank/backups
cr0x@server:~$ sudo zfs get -o name,property,value,source secondarycache tank/backups
NAME         PROPERTY        VALUE  SOURCE
tank/backups secondarycache  none   local

What it means: Backup dataset will not populate L2ARC.

Decision: Do this when you have big sequential readers/writers that would evict useful cached blocks. Keep the cache for the workloads that pay rent.

Task 12: Inspect L2ARC-related module parameters (Linux OpenZFS)

cr0x@server:~$ sudo sysctl -a 2>/dev/null | grep -E 'l2arc_|zfs\.arc_' | head
zfs.l2arc_noprefetch = 1
zfs.l2arc_write_max = 8388608
zfs.l2arc_write_boost = 33554432
zfs.arc_max = 62277025792
zfs.arc_min = 15569256448

What it means: Prefetch is not cached to L2ARC, and write limits are set (bytes per interval).

Decision: If l2arc_noprefetch is 0 and you have sequential workloads, consider enabling it. If cache fill is too aggressive, tune down l2arc_write_max carefully and validate.

Task 13: Change an L2ARC tunable safely (temporary) and validate impact

cr0x@server:~$ sudo sysctl -w zfs.l2arc_write_max=4194304
zfs.l2arc_write_max = 4194304

What it means: Reduced max write rate to L2ARC (here to 4 MiB per interval; interval depends on implementation).

Decision: Use this if NVMe writes are competing with latency-sensitive reads/writes. Re-check iostat and application p99 latency after change.

Task 14: Confirm that you are not confusing SLOG with L2ARC

cr0x@server:~$ sudo zpool status tank | sed -n '1,35p'
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0
        cache
          nvme0n1                   ONLINE       0     0     0

What it means: Only a cache vdev is present. No logs section means no SLOG.

Decision: If your pain is synchronous writes (NFS with sync, databases with fsync-heavy patterns), L2ARC won’t fix it. Consider SLOG—carefully—only after measuring.

Task 15: Remove L2ARC (if it’s not helping) without drama

cr0x@server:~$ sudo zpool remove tank nvme0n1
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0

What it means: L2ARC detached. The pool is unchanged structurally (cache vdevs don’t hold unique data).

Decision: If performance doesn’t change or improves after removal, your cache was noise or harm. Stop trying to make it work and fix the real bottleneck.

Task 16: Evaluate memory pressure and reclaim behavior (Linux)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           128Gi        94Gi       4.1Gi       1.2Gi        30Gi        26Gi
Swap:           16Gi       2.3Gi        13Gi

What it means: RAM is heavily used, available is 26Gi, swap is in use.

Decision: Swap activity on a storage box is often a sign ARC is squeezed or the host is overcommitted (VMs, containers). Before adding L2ARC, consider adding RAM or reducing colocated workloads. L2ARC metadata overhead is not going to help a swapping host.

Fast diagnosis playbook: find the bottleneck in minutes

This is the order that saves time in production. The goal is not “collect all metrics.” The goal is “identify the limiting resource and the simplest fix.”

First: determine whether you have a read-latency problem or something else

  • Check pool I/O and latency proxies: zpool iostat -v 1, iostat -x 1.
  • If disks are not busy and latency is low, your problem may be CPU, network, application locks, or sync write behavior.

Second: check ARC effectiveness and memory pressure

  • Use arcstat to assess miss rate and ARC size stability.
  • Use free -h and observe swap. If the host is swapping, fix that. L2ARC is not a cure for memory starvation.

Third: confirm the workload has temporal locality

  • Look for repeat reads: DB buffer cache behavior, VM boot storms, repeated CI builds.
  • If it’s mostly streaming, L2ARC will be a wear generator.

Fourth: test a small L2ARC and measure

  • Add NVMe L2ARC and watch l2hit, l2miss, and application p95/p99 latency.
  • Warm-up matters; evaluate over a workload cycle, not 10 minutes.

Fifth: if it’s metadata-heavy, consider a different design

  • Metadata random I/O on HDD pools often responds better to a special vdev than L2ARC.
  • If sync write latency is the issue, measure and consider SLOG (with the correct device and expectations).

Three corporate-world mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

They had a busy virtualization cluster on HDD RAIDZ2. Users complained about “random slowness,” mostly Monday mornings. The storage team saw read IOPS on the pool and decided to “fix it properly” with a big NVMe L2ARC—several terabytes, because procurement loves round numbers.

It went in cleanly. The graphs looked exciting: NVMe utilization jumped, and the pool read IOPS dropped. Everyone felt smart for about a day.

Then latency started creeping up. Not on the HDD pool—on the host. ARC size shrank. The hypervisor began swapping under load. VM boot storms got worse, not better, and the helpdesk learned new vocabulary.

The wrong assumption was simple: “L2ARC is extra cache, so it can’t make memory worse.” In reality, the giant L2ARC consumed ARC metadata and reduced effective RAM cache. The system traded fast RAM hits for slower NVMe hits and more misses. They removed the L2ARC, added RAM to the hosts, and reintroduced a smaller cache later, targeted to specific datasets. Performance normalized and stayed boring.

Mini-story 2: The optimization that backfired

A data platform team ran analytics on a ZFS pool backed by SSDs. They wanted faster dashboard queries, and someone suggested L2ARC on a spare consumer NVMe. “It’s just cache; if it fails, nothing breaks.” This was technically true in the narrowest possible sense.

They tuned the L2ARC write parameters aggressively to “warm it faster.” It did warm faster. It also wrote constantly, because the workload scanned large windows of data with little reuse. L2ARC became a write-heavy treadmill.

A few weeks later, the NVMe started throwing media errors. Not catastrophic, but enough to generate alerts and I/O hiccups. During one incident, the device reset caused temporary stalls that looked like a cluster-wide storage outage. People blamed ZFS. ZFS mostly shrugged; the device was having a bad day.

The postmortem was awkward because the “optimization” didn’t improve the dashboard p99 anyway. The fix was dull: remove L2ARC, tune queries to reduce churn, and later add a small special vdev for metadata and small blocks. The dashboards improved and the NVMe stopped trying to die for a feature nobody could quantify.

Mini-story 3: The boring but correct practice that saved the day

A financial services shop ran ZFS for VM storage and internal services. They had L2ARC on mirrored NVMe (yes, mirrored—because they had the slots and liked sleeping). Their change policy was painfully strict: every storage change required a pre-change baseline, a rollback plan, and an explicit success metric.

Quarterly patching required reboots, and they had previously seen post-reboot performance dips while caches warmed. They tested persistent L2ARC in staging, confirmed the reboot behavior, and rolled it out with carefully chosen tunables and a documented “disable persistence” path.

Then came an ugly week: an application deploy increased read amplification and made caches less effective. Because they had baselines, they could prove it wasn’t “the storage getting old.” They adjusted dataset caching policies for the noisy workload and protected the cache from pollution. No heroics. No vendor calls. Just competence.

It wasn’t glamorous, but it prevented the classic war-room loop: “storage is slow” → “add cache” → “still slow” → “add more cache.” Their boring process saved days of churn.

Joke #2: The only thing more persistent than L2ARC is a stakeholder asking if we can “just add more NVMe.”

Common mistakes: symptoms → root cause → fix

1) Symptom: L2ARC hit rate stays near zero

Root cause: Workload is streaming or has low re-read locality; or the dataset has secondarycache=none; or cache is too small and thrashes immediately.

Fix: Verify dataset properties with zfs get secondarycache. Exclude streaming datasets. If locality is low, remove L2ARC and invest elsewhere.

2) Symptom: ARC size shrinks after adding L2ARC, misses increase

Root cause: RAM overhead for L2ARC metadata reduces ARC capacity; host is memory pressured; L2ARC oversized.

Fix: Reduce L2ARC size (use a smaller partition/device), add RAM, or remove L2ARC. Confirm with arcstat and free -h.

3) Symptom: NVMe utilization high, latency spikes, application gets worse

Root cause: L2ARC feed rate too aggressive; NVMe is a consumer drive with poor sustained behavior; cache fill competes with workload.

Fix: Lower l2arc_write_max/l2arc_write_boost. Use a more consistent NVMe. Consider excluding noisy datasets from L2ARC.

4) Symptom: Performance great… until reboot, then awful for hours

Root cause: Non-persistent L2ARC requires warm-up; workload depends heavily on cache hits.

Fix: Evaluate persistent L2ARC support in your ZFS version. If not available or not desirable, plan reboots and pre-warm, or increase RAM to reduce reliance on L2ARC.

5) Symptom: You expected faster writes, but nothing changed

Root cause: L2ARC is read cache. Writes are gated by ZIL/SLOG behavior, pool layout, and sync semantics.

Fix: Measure sync writes. If needed, evaluate SLOG with the right device and test. Don’t cargo-cult “NVMe = faster.”

6) Symptom: L2ARC helps some workloads, but backups slow everything down

Root cause: Cache pollution and eviction pressure from sequential backup reads/writes.

Fix: Set secondarycache=none (and sometimes primarycache=metadata) on backup datasets. Keep cache for interactive/hot workloads.

7) Symptom: Lots of CPU usage in kernel during load after enabling L2ARC

Root cause: Cache bookkeeping overhead plus elevated eviction activity; possibly too many small blocks, too aggressive feed.

Fix: Reduce L2ARC write rate; ensure you are not caching prefetch; consider increasing recordsize where appropriate or rethinking the workload’s I/O pattern.

8) Symptom: NVMe wears out faster than expected

Root cause: Constant churn due to low locality; over-sized cache with high turnover; aggressive write tunables.

Fix: Measure write rate, cap feed, exclude streaming datasets, or remove L2ARC. Buy higher endurance devices when the workload justifies it.

Checklists / step-by-step plan

Plan A: Decide whether you should add L2ARC at all

  1. Confirm it’s a read problem. If your pool is write-limited or sync-limited, stop and diagnose that instead.
  2. Measure ARC miss rate and RAM headroom. If the host is memory-pressured, prioritize RAM and workload placement.
  3. Validate re-read locality. L2ARC is for “reads that come back.” If they don’t, don’t cache them.
  4. Check pool layout and metadata pain. If metadata is the bottleneck, a special vdev can outperform L2ARC.

Plan B: Implement NVMe L2ARC safely

  1. Baseline before changes. Capture arcstat, zpool iostat, iostat -x, and application latency.
  2. Pick a sane NVMe. Favor consistent latency and endurance over peak throughput.
  3. Start with a modest cache size. Prove value before scaling up.
  4. Attach the device as cache. Verify with zpool status.
  5. Prevent cache pollution. Set secondarycache=none on streaming datasets.
  6. Monitor warm-up and device load. Watch l2hit/l2miss and NVMe await.
  7. Tune conservatively. Reduce L2ARC write rate if it competes with real I/O.
  8. Document rollback. Know how to zpool remove the cache device and what “success” means.

Plan C: Operational guardrails that keep you out of trouble

  • Alert on NVMe wear and media errors. L2ARC can be hard on drives; treat it as a consumable component.
  • Track cache hit ratios over time. A cache that was useful can become useless after application changes.
  • Reboot planning. If L2ARC isn’t persistent, schedule reboots when warm-up won’t hurt.
  • Keep changes small. If you change cache size, tunables, and dataset properties at once, you’ll never know what worked.

FAQ

1) Is L2ARC the same thing as SLOG?

No. L2ARC is a read cache. SLOG is a separate log device used to accelerate synchronous writes by shortening the ZIL commit path. Different problem, different tool.

2) Should I mirror L2ARC devices?

Usually no, because L2ARC doesn’t hold unique data. If it fails, you lose cache and performance falls back to the pool. Mirror it only if the operational cost of losing cache is high and you have the slots and budget.

3) Can L2ARC make performance worse?

Yes. The usual mechanisms are ARC shrink (metadata overhead), cache-fill write contention, and cache pollution from streaming workloads. “Cache” is not automatically “faster.”

4) How long does L2ARC take to warm up?

It depends on your working set, eviction rate from ARC, and L2ARC feed limits. For busy systems it can be hours; for calmer systems it can be days. If you need instant wins, buy more RAM first.

5) Should I cache prefetch in L2ARC?

Most of the time: no. Prefetch can flood L2ARC with sequential data that won’t be reread. If your workload is mostly sequential and rereads, test it—don’t assume.

6) How do I prevent backups from wrecking my cache?

Set secondarycache=none on backup datasets. In some environments, also consider primarycache=metadata for streaming datasets so ARC keeps metadata but not bulk data.

7) Is persistent L2ARC worth enabling?

If your workloads are sensitive to post-reboot warm-up and your ZFS version supports it reliably, it can be a big quality-of-life improvement. Test reboots in staging and watch boot-time behavior and memory usage.

8) I have an all-NVMe pool. Do I still need L2ARC?

Rarely. ARC in RAM is still faster than NVMe. But if the working set is huge and reread-heavy, and your pool is busy enough that offloading reads helps, L2ARC can still be useful. Measure first; otherwise you’re just adding complexity.

9) What’s better: more RAM or more L2ARC?

More RAM, almost always. ARC hits are cheaper than L2ARC hits, and RAM doesn’t wear out because you looked at the same file twice. Use L2ARC when RAM is already reasonably sized and you still miss.

10) How do I know if L2ARC is paying for itself?

Watch application p95/p99 latency and throughput, not just cache hit ratios. A rising l2hit count is nice, but the business metric is “fewer slow requests.” If you can’t show improvement, remove it.

Practical next steps

  1. Baseline today. Capture arcstat, zpool iostat -v, and iostat -x during a known “slow” period.
  2. Decide if the workload rereads. If it’s streaming, don’t build a cache shrine.
  3. Fix memory pressure first. If you’re swapping or ARC can’t reach a stable size, L2ARC is a distraction.
  4. Start with a small NVMe L2ARC and protect it. Exclude streaming datasets, avoid caching prefetch, and keep feed rates reasonable.
  5. Measure outcomes, not feelings. If p99 latency improves and NVMe wear is acceptable, keep it. If not, remove it and move on to layout changes or RAM.
← Previous
The DirectX arms race: why drivers sometimes beat silicon
Next →
“rm -rf /” stories: the command that became an IT horror genre

Leave a comment