ZFS Hybrid Blueprint: HDD Data + SSD Metadata + NVMe Cache, Done Right

Was this helpful?

You bought big, cheap HDDs for capacity. Then production happened: a million small files, chatty databases, VM images that don’t line up with
recordsize, and users who treat “shared storage” as “infinite IOPS.” The pool is “healthy,” scrubs are fine, and yet latency spikes turn your
ticket queue into an archeological site.

A ZFS hybrid layout can fix this—if you stop thinking in marketing terms (“cache!”) and start thinking in failure domains, write paths, and what
ZFS actually does with metadata. This is the blueprint I’d deploy when I need HDD economics without HDD misery.

The mental model: ZFS IO paths, not vibes

ZFS performance work is 80% understanding where time is spent and 20% setting properties. People invert that, then wonder why their “NVMe cache”
didn’t fix a slow metadata workload. If you keep one mental model, keep this:

  • Reads: ARC in RAM first, then (optionally) L2ARC on fast media, then the main vdevs (your HDD pool).
  • Writes (async): land in memory, get aggregated into transaction groups (TXGs), then flushed to the main vdevs.
  • Writes (sync): must be acknowledged only after they’re durable. If you have a SLOG, “durable” means “in the SLOG,” then later flushed to HDDs.
  • Metadata: directory entries, indirect blocks, dnodes—this stuff can dominate IO for small files and random access patterns.

The hybrid blueprint is about putting the right kind of IO on the right kind of device:
bulk sequential data on HDDs; hot metadata (and optionally small blocks) on SSD via a special vdev; sync write intent on a low-latency NVMe SLOG;
read amplification relief via L2ARC when ARC isn’t enough.

Here’s the boring truth: most “ZFS cache” problems are actually “not enough RAM” problems. ARC is the only cache that’s always correct, always
low-latency, and doesn’t require a novel to explain at change review.

One quote worth keeping on a sticky note:
paraphrased idea from Werner Vogels: “You build it, you run it” isn’t culture—it’s how you learn what your system really does.

Interesting facts & historical context (short, concrete)

  1. ZFS was born at Sun to fix silent data corruption and admin pain; end-to-end checksums were a core feature, not an add-on.
  2. ARC predates the current cache hype; it’s a memory-resident adaptive replacement cache designed to outperform simple LRU under mixed workloads.
  3. Copy-on-write changed the failure story: ZFS doesn’t overwrite live blocks; it writes new ones and updates pointers, reducing “torn write” issues.
  4. SLOG isn’t a write cache; historically it was introduced to accelerate sync writes by journaling intent, not to absorb bulk throughput.
  5. L2ARC has a “header tax”: it needs metadata in ARC to index L2ARC contents, which is why adding L2ARC can reduce effective ARC.
  6. Special vdevs are comparatively new in OpenZFS history; they formalized the “put metadata on SSD” pattern without hacks.
  7. ashift became a career limiter: early ZFS admins learned that wrong sector alignment can permanently kneecap IOPS on 4K-sector drives.
  8. Compression became mainstream not because it saves space (it does), but because fewer bytes read often beats raw disk speed.

Reference architecture: HDD data + SSD metadata + NVMe cache

Let’s define the goal: a pool where large, cold data stays on HDDs, while the IO patterns that punish HDD latency—metadata lookups, small random
reads, sync write latency—are redirected to fast media in a way that does not compromise safety or operability.

Baseline assumptions (say them out loud)

  • HDDs are for capacity and sequential throughput. They are terrible at random IO latency. That is physics, not a vendor issue.
  • SSDs/NVMe are for latency and IOPS. They fail too; they just fail faster and sometimes more creatively.
  • RAM is the first performance lever. If you don’t know your ARC hit ratio, you’re tuning blind.
  • You’re using OpenZFS on Linux (examples assume Ubuntu/Debian-ish paths). Adjust as needed.

A sane “hybrid” layout (example)

This is a common, production-friendly pattern:

  • Data vdevs: HDD RAIDZ2 (or mirrors if you need IOPS more than capacity efficiency)
  • Special vdev: mirrored SSDs for metadata (and optionally small blocks)
  • SLOG: mirrored NVMe devices (or at least power-loss-protected enterprise NVMe) for sync write latency
  • L2ARC: optional NVMe (often same class as SLOG but not the same devices) for read-heavy workloads that exceed ARC

Why mirrored for special vdev and SLOG? Because those components are not “nice to have.” They become part of the pool’s availability story.
Lose the special vdev and you can lose the pool. Lose the SLOG and you lose sync write acceleration (and possibly in-flight sync semantics during a crash),
but with mirrored and PLP media you avoid turning a power blip into a resume-writing event.

Joke #1: If someone proposes “a single consumer SSD special vdev to save budget,” ask if they also prefer single-parachute skydiving to reduce weight.

SSD metadata done right: special vdev design

What special vdevs actually do

A special vdev can store metadata, and optionally small file blocks, on faster media. That means directory traversals, file attribute lookups,
indirect block reads, and many small random reads stop hammering your HDDs.

The win is most dramatic when:

  • you have lots of small files (CI artifacts, container layers, source repos, mail spools, build caches)
  • you have deep directory trees and frequent stats/opens
  • your working set is bigger than ARC but metadata is relatively compact

The non-negotiable rule: mirror the special vdev

The special vdev is part of the pool. If it dies and you don’t have redundancy, you may not be “degraded.” You may be “done.”
Treat it like you treat your main vdev redundancy: no single points of failure.

How to decide whether to store small blocks on special vdev

ZFS can place blocks smaller than a threshold on special vdevs via special_small_blocks. This is powerful and dangerous—like giving
your metadata SSDs a side job as a random-read data tier.

Use it when:

  • your HDD vdevs are latency-bound on small random reads
  • you have a known small-block-heavy workload (many files under 64K, lots of tiny object reads)
  • your special vdev has enough endurance and capacity headroom

Avoid it when:

  • you can’t accurately estimate small-block footprint growth
  • your SSDs are consumer-grade with questionable endurance and no PLP
  • you already struggle to keep the special vdev under ~50–60% utilization

Sizing special vdevs without lying to yourself

Rough guidance that holds up in practice:

  • Metadata-only special vdev: often a few percent of pool used space, but can spike with snapshots, tiny records, and heavy churn.
  • Metadata + small blocks: can become “most of your hot data.” Plan capacity like you mean it.

Practical rule: size special vdevs so that, at steady state, they stay comfortably below 70% used. SSDs slow down when they’re full, ZFS metaslab
allocation gets pickier, and your “fast tier” becomes a throttled tier.

NVMe SLOG: what it is, what it isn’t

What a SLOG does

The Separate Intent Log (SLOG) is a device used to store the ZFS Intent Log (ZIL) for sync writes. It reduces the latency of acknowledging
sync writes by writing a minimal log record to fast storage, then later committing the full TXG to the main vdevs.

It is not a general write cache. If your workload is mostly async writes, SLOG won’t move the needle. If your workload is sync-heavy (NFS with
sync semantics, databases with fsync(), VM storage that forces sync), SLOG can be the difference between “usable” and “why is it 1998 again?”

SLOG media requirements (be picky)

  • Low latency under sync write (steady, not just benchmark spikes)
  • Power-loss protection (PLP) so acknowledged writes survive a power event
  • Endurance appropriate for write-heavy sync workloads
  • Mirror it if you care about availability and predictable behavior during device failures

SLOG sizing: small, fast, boring

You usually don’t need a huge SLOG. You need one that can sustain your peak sync write rate for the time between TXG commits (typically seconds)
and device flush behavior. Oversizing doesn’t buy you much. Under-speccing buys you latency cliffs.

Joke #2: A consumer NVMe without PLP used as SLOG is like a seatbelt made of spaghetti—technically present, spiritually absent.

NVMe L2ARC: when it helps and when it lies

The honest pitch for L2ARC

L2ARC is a second-level read cache. It helps when:

  • you have a read-heavy workload with a working set larger than ARC
  • your access pattern has reuse (cacheable reads, not one-and-done scans)
  • your bottleneck is HDD read latency/IOPS, not CPU or network

Why L2ARC disappoints people

L2ARC is not free. It:

  • consumes ARC memory for headers and indexing
  • warms up over time (it’s not instantly populated after reboot unless configured otherwise)
  • can be ineffective for streaming reads, large sequential scans, or datasets with low locality

Practical advice

If you’re short on RAM, buy RAM before you buy L2ARC. If you’re already well-provisioned on RAM and still read-latency bound, L2ARC can be a
solid lever. But measure, don’t assume.

Dataset tuning that actually moves the needle

Compression: default-on unless you have a reason

compression=lz4 is usually a win: fewer bytes to read from HDDs, often faster overall. If your data is already compressed (media files,
encrypted blobs), it may not reduce size much—but it typically doesn’t hurt unless CPU is constrained.

recordsize: align with how you read, not how you store

For general filesystems, recordsize=128K is fine. For databases, you often want smaller records (like 16K) to reduce read amplification.
For VM images, consider 64K or 128K depending on workload. For zvols, tune volblocksize at creation time (you can’t change it later).

sync and logbias: don’t “fix latency” by deleting safety

Setting sync=disabled is not tuning. It’s a choice to lie to applications about durability. Sometimes people do it under pressure.
Sometimes they learn what “crash-consistent” means the hard way.

If you need to optimize sync workloads, use a real SLOG and consider logbias=latency for datasets where sync latency matters.
Use logbias=throughput only when you explicitly want ZFS to favor main pool writes over ZIL behavior—and you’ve tested it.

atime: small win, low risk

Disabling atime (atime=off) reduces metadata writes for read-heavy workloads. It’s rarely controversial in modern environments.

Practical tasks with commands: what to run, what it means, what you decide

These are tasks I actually run on real systems. Each includes: the command, a plausible output snippet, what it means, and what decision you make.
Use these as building blocks for your own runbooks.

Task 1: Confirm the pool topology (do you even have what you think you have?)

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
config:

        NAME                              STATE     READ WRITE CKSUM
        tank                              ONLINE       0     0     0
          raidz2-0                        ONLINE       0     0     0
            ata-ST12000NM0008-1A-0001     ONLINE       0     0     0
            ata-ST12000NM0008-1A-0002     ONLINE       0     0     0
            ata-ST12000NM0008-1A-0003     ONLINE       0     0     0
            ata-ST12000NM0008-1A-0004     ONLINE       0     0     0
            ata-ST12000NM0008-1A-0005     ONLINE       0     0     0
            ata-ST12000NM0008-1A-0006     ONLINE       0     0     0
          special
            mirror-1                      ONLINE       0     0     0
              nvme-SAMSUNG_MZ1LB960HAJQ-0001  ONLINE   0     0     0
              nvme-SAMSUNG_MZ1LB960HAJQ-0002  ONLINE   0     0     0
        logs
          mirror-2                        ONLINE       0     0     0
            nvme-INTEL_SSDPED1D280GA-0001  ONLINE       0     0     0
            nvme-INTEL_SSDPED1D280GA-0002  ONLINE       0     0     0
        cache
          nvme-SAMSUNG_MZ1LB960HAJQ-0003   ONLINE       0     0     0

errors: No known data errors

What it means: You have HDD RAIDZ2 for data, a mirrored special vdev, a mirrored SLOG, and an L2ARC device.

Decision: If special or logs are single devices, fix that before you call the system “production-ready.”

Task 2: Check ashift (alignment) and sector sizes

cr0x@server:~$ sudo zdb -C tank | grep -E "ashift|path" -n | head
34:            ashift: 12
102:            path: '/dev/disk/by-id/ata-ST12000NM0008-1A-0001'

What it means: ashift: 12 = 4K sectors. Good for modern HDDs/SSDs. If you see 9 (512B) on 4K media, performance can be permanently bad.

Decision: Wrong ashift means rebuild/recreate. Don’t bargain with physics.

Task 3: See how much is on special vdev (capacity risk check)

cr0x@server:~$ sudo zpool list -v tank
NAME        SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank       65.4T  40.1T  25.3T         -    18%    61%  1.00x  ONLINE  -
  raidz2   65.4T  39.4T  26.0T         -    18%    60%      -  ONLINE
  special   1.7T   0.7T   1.0T         -    12%    41%      -  ONLINE
  logs       279G   4.2G   275G        -     1%     1%      -  ONLINE
  cache      894G   812G    82G        -    22%    90%      -  ONLINE

What it means: Special vdev is 41% used—fine. L2ARC is 90% “used,” which is normal; it fills by design.

Decision: If special vdev creeps toward 70–80% with special_small_blocks enabled, plan expansion before it becomes your outage.

Task 4: Confirm dataset properties (are you accidentally sabotaging yourself?)

cr0x@server:~$ sudo zfs get -o name,property,value -s local,received compression,atime,recordsize,sync,logbias,special_small_blocks tank/data
NAME       PROPERTY              VALUE
tank/data  compression           lz4
tank/data  atime                 off
tank/data  recordsize            128K
tank/data  sync                  standard
tank/data  logbias               latency
tank/data  special_small_blocks  16K

What it means: Sensible defaults for many mixed workloads. Small blocks <=16K go to special vdev.

Decision: If special vdev endurance is a concern, raise or disable special_small_blocks. If sync latency is a problem, keep logbias=latency and invest in SLOG quality.

Task 5: Determine whether your workload is sync-heavy (SLOG relevance test)

cr0x@server:~$ sudo zpool iostat -v tank 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        40.1T  25.3T    220   1800  38.2M  145M
  raidz2                    39.4T  26.0T    120    240  34.1M  120M
  special                    0.7T   1.0T     90   1550   4.0M  25.0M
  mirror (logs)              4.2G   275G      0   1600      0  22.0M
  cache                         -      -      -      -      -      -
--------------------------  -----  -----  -----  -----  -----  -----

What it means: Writes are landing heavily on logs and special vdev. That often indicates sync traffic and metadata/small-block activity.

Decision: If logs show zero while you have a “database latency problem,” you might not be sync-bound; chase something else (ARC misses, HDD queueing, CPU).

Task 6: Check ARC behavior (is RAM doing the job?)

cr0x@server:~$ sudo arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:01:10   420    35      8     9    2    26    6     0    0   62G   64G
12:01:11   410    40     10    12    3    28    7     0    0   62G   64G
12:01:12   395    33      8     8    2    25    6     0    0   62G   64G

What it means: ARC miss rate ~8–10% is decent. If miss% is consistently high, HDDs will get hammered and L2ARC might help—after RAM is adequate.

Decision: If arcsz is pinned at c and miss% is high, consider more RAM or reducing memory pressure elsewhere.

Task 7: Validate L2ARC effectiveness (is it actually serving reads?)

cr0x@server:~$ sudo arcstat -f time,read,miss,l2hits,l2miss,l2read 1 3
    time  read  miss  l2hits  l2miss  l2read
12:02:21   500    70     120      40     160
12:02:22   480    65     110      35     145
12:02:23   510    68     130      38     168

What it means: L2ARC is getting hits. If l2hits is near zero, your L2ARC isn’t helping (or it hasn’t warmed up, or your workload has no reuse).

Decision: If L2ARC isn’t helping, remove it or repurpose the device; don’t keep complexity as a decorative accessory.

Task 8: Inspect special vdev IO pressure (metadata tier saturation)

cr0x@server:~$ sudo iostat -x 1 3 /dev/nvme0n1 /dev/nvme1n1
Linux 6.8.0 (server)  12/26/2025  _x86_64_  (32 CPU)

Device            r/s     w/s   rMB/s   wMB/s  avgrq-sz avgqu-sz   await  %util
nvme0n1          820.0   540.0    42.0    31.5     94.0     2.10    1.8   92.0
nvme1n1          815.0   530.0    41.5    30.9     93.5     2.05    1.9   91.0

What it means: Special vdev SSDs are near saturation (%util ~90+) but still low latency (await ~2ms). That’s okay… until it isn’t.

Decision: If await climbs (say 10–20ms) during load, special vdev is your bottleneck. Consider faster SSDs, more lanes, or raising special_small_blocks.

Task 9: Confirm sync write latency path (SLOG device health and pressure)

cr0x@server:~$ sudo iostat -x 1 3 /dev/nvme2n1 /dev/nvme3n1
Device            r/s     w/s   rMB/s   wMB/s  avgqu-sz   await  %util
nvme2n1            0.0  2200.0     0.0    28.0      0.60    0.4   55.0
nvme3n1            0.0  2180.0     0.0    27.8      0.58    0.4   54.0

What it means: SLOG writes are fast and steady. If await jumps, your sync write performance is capped by SLOG latency.

Decision: If SLOG latency is poor, replace with PLP enterprise NVMe. Don’t “tune around” bad media.

Task 10: Identify whether HDD vdevs are queueing (classic RAIDZ small IO pain)

cr0x@server:~$ sudo zpool iostat -v tank 1 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        40.1T  25.3T    900    600   110M   80M
  raidz2                    39.4T  26.0T    860    560   106M   76M
    ata-ST12000...0001         -      -    140     95  18.0M  12.7M
    ata-ST12000...0002         -      -    145     92  18.2M  12.4M
    ata-ST12000...0003         -      -    142     93  18.1M  12.5M
    ata-ST12000...0004         -      -    143     96  18.0M  12.8M
    ata-ST12000...0005         -      -    145     92  18.3M  12.4M
    ata-ST12000...0006         -      -    145     92  18.2M  12.4M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: Lots of per-disk ops suggests random IO. RAIDZ2 can be fine, but small random writes are expensive (parity math + read-modify-write).

Decision: If this is a VM or DB pool and latency matters, mirrors often beat RAIDZ for IOPS. Or push small blocks/metadata onto special vdevs thoughtfully.

Task 11: Track TXG behavior and commit pressure (latency spikes clue)

cr0x@server:~$ sudo cat /proc/spl/kstat/zfs/txgs
dmu_tx_assign  0
txg_sync       0
txg_quiesce    0
txg_birth      0
txg_state      0
txg_timeout    5
txg_synctime_ms  312

What it means: TXG timeout is 5s, synctime ~312ms right now. If synctime balloons into seconds, the pool is struggling to flush.

Decision: If synctime spikes correlate with application latency, focus on write throughput and device latency—often HDD vdev saturation or special vdev pressure.

Task 12: Verify that autotrim is set appropriately (SSD longevity + steady performance)

cr0x@server:~$ sudo zpool get autotrim tank
NAME  PROPERTY  VALUE     SOURCE
tank  autotrim  on        local

What it means: TRIM is enabled, helping SSDs maintain performance and reducing write amplification.

Decision: For SSD-based special vdev, logs, and L2ARC, autotrim=on is typically the right call.

Task 13: Check device error counters and slow degradation

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 05:12:33 with 0 errors on Sun Dec 21 04:00:31 2025
config:

        NAME                              STATE     READ WRITE CKSUM
        tank                              ONLINE       0     0     0
          raidz2-0                        ONLINE       0     0     0
            ata-ST12000NM0008-1A-0003     ONLINE       0     0     0
            ata-ST12000NM0008-1A-0004     ONLINE       0     0     0
          special
            mirror-1                      ONLINE       0     0     0
              nvme-SAMSUNG_MZ1LB960HAJQ-0001  ONLINE   0     0     0
              nvme-SAMSUNG_MZ1LB960HAJQ-0002  ONLINE   0     0     0

errors: No known data errors

What it means: Scrubs are clean; no growing error counts.

Decision: If READ/WRITE/CKSUM start incrementing on special vdev or SLOG devices, treat it as urgent. “It still works” is not a strategy.

Task 14: Confirm special allocation is actually used (avoid placebo designs)

cr0x@server:~$ sudo zdb -bbbs tank | grep -E "Special|special" | head
    Special allocation class: 734.22G used, 1.01T available

What it means: Data is being allocated in the special class. If it’s near zero on a metadata-heavy system, your design isn’t being exercised.

Decision: Validate special_small_blocks and confirm you created the special vdev correctly; don’t assume “SSD present” means “SSD used.”

Fast diagnosis playbook: what to check first/second/third

When a hybrid ZFS pool is “slow,” you don’t need a day of philosophical debate. You need a tight loop: identify which layer is saturated, then
map it to a device class and IO type.

First: is it a cache/RAM problem?

  • Check ARC size and miss% (arcstat).
  • If miss% is high and ARC is capped: you’re likely disk-bound because RAM isn’t holding the working set.
  • Decision: add RAM, reduce memory pressure, or accept that HDDs will see reads (and then consider L2ARC).

Second: is it a sync write latency problem?

  • Check whether logs are active (zpool iostat -v shows log writes).
  • Check SLOG device latency (iostat -x on SLOG devices).
  • Decision: if sync-heavy and SLOG awaits are high, upgrade/replace SLOG media; don’t touch sync=disabled unless you enjoy postmortems.

Third: is it metadata/small-block pressure on special vdev?

  • Check special vdev utilization and IO (zpool list -v, iostat -x).
  • Check if special is near full or showing rising latency.
  • Decision: expand special vdev (add another mirrored special vdev), use faster devices, or adjust special_small_blocks.

Fourth: is it plain old HDD vdev saturation?

  • Check per-vdev and per-disk ops (zpool iostat -v).
  • High ops and low bandwidth on HDDs = random IO pain; RAIDZ will feel it.
  • Decision: change workload layout (separate pool for VMs/DBs), move hot datasets to SSD pool, or accept mirrors for IOPS-centric tiers.

Fifth: check the non-ZFS parts (because reality)

  • Network latency (for NFS/SMB), CPU steal, IRQ saturation, HBA queue depth, and virtualization limits can masquerade as “storage.”
  • Decision: verify with system metrics; don’t blame ZFS for a 1GbE bottleneck.

Common mistakes (symptoms → root cause → fix)

1) “We added NVMe cache and nothing changed.”

Symptoms: Same read latency, same HDD ops, L2ARC hit rate near zero.

Root cause: Workload has low reuse (streaming), or ARC is too small and L2ARC header overhead made it worse, or L2ARC never warmed.

Fix: Measure with arcstat. Add RAM first. If access pattern is streaming, remove L2ARC and focus on vdev layout and sequential throughput.

2) “Special vdev is filling up and we’re scared.”

Symptoms: Special class usage grows faster than expected; SSD wear climbs; latency spikes under metadata churn.

Root cause: special_small_blocks set too high, small blocks dominate, snapshots amplify metadata, or special vdev under-sized.

Fix: Lower special_small_blocks (future allocations only), add another mirrored special vdev, and keep special below ~70% used.

3) “Sync writes are slow even with a SLOG.”

Symptoms: High latency on database commits; SLOG device shows high await; application stalls during bursts.

Root cause: SLOG device lacks PLP, has poor steady-state latency, is shared with other workloads, or is not actually being used.

Fix: Verify log activity with zpool iostat -v. Use dedicated enterprise NVMe with PLP. Mirror it.

4) “We set sync=disabled and it got fast. Why not keep it?”

Symptoms: Performance improves immediately; management is happy; SRE is quietly updating their resume.

Root cause: You traded durability for speed. Apps expecting fsync durability are now being lied to.

Fix: Revert to sync=standard. Deploy proper SLOG and tune dataset properties to match the workload.

5) “Pool is healthy but latency spikes every few seconds.”

Symptoms: Periodic stalls; graphs show sawtooth write latency; user complaints during peaks.

Root cause: TXG sync pressure: HDDs can’t flush fast enough; special vdev saturated; fragmentation and near-full pool amplify it.

Fix: Check TXG synctime, vdev utilization, pool fullness. Add vdevs, reduce write amplification (compression helps), and keep pool below ~80% used for heavy-write environments.

6) “We used a single special vdev because it’s ‘just metadata.’”

Symptoms: Sudden pool failure or missing metadata blocks after SSD dies; recovery options are bleak.

Root cause: Special vdev is not optional; it’s in the data path.

Fix: Always mirror special vdevs. If you already deployed single-disk special, plan a migration: backup, rebuild correctly, restore. There is no clean retrofit that fixes the risk without moving data.

Three corporate mini-stories (anonymized, plausible, technically accurate)

Incident caused by a wrong assumption: “Metadata can’t take down the pool, right?”

A mid-sized SaaS company ran a file-heavy pipeline: build artifacts, container layers, and lots of tiny JSON. Their storage team did something
clever: HDD RAIDZ2 for bulk, plus a single “metadata SSD” because the budget meeting was a contact sport.

It worked brilliantly for months. Directory listings were snappy. Builds stopped timing out. People started using the same pool for more things
because success is a magnet for scope creep.

Then the SSD died. Not dramatically—no smoke, just a quiet drop from the PCIe bus. The pool didn’t politely “degrade.” It panicked. Metadata
blocks that ZFS expected to be there… weren’t. Services restarted into failures that looked like random filesystem corruption.

The team’s first instinct was to treat it like a cache failure: reboot, rescan, re-seat. That wasted hours. The actual fix was painful and honest:
restore from backups to a correctly designed pool with mirrored special vdevs. They got the system back, but they also got a new policy:
“if it’s in the pool topology, it’s redundant.”

The lesson wasn’t “ZFS is fragile.” The lesson was that ZFS is literal. If you tell it metadata lives on that device, ZFS will believe you with
the devotion of a golden retriever.

Optimization that backfired: “Let’s turn on special_small_blocks everywhere”

A large enterprise analytics platform had a mix of workloads: data lake files (large), ETL temp files (medium), and a nasty swarm of small logs
and indexes. They added mirrored SSD special vdevs and saw immediate improvements. So far, so good.

Someone then proposed setting special_small_blocks=128K across most datasets. The reasoning sounded clean: “If most blocks go to SSD,
HDDs become just capacity. SSDs are fast. Everyone wins.” The change sailed through because it produced a nice-looking latency graph in the first hour.

Weeks later, the SSDs were fuller than expected and write latency started wobbling. Not constant failure—worse: intermittent stalls that hit
during peak batch windows. Scrubs were still clean. SMART looked “fine.” The on-call rotation began to develop opinions.

The actual cause was predictable in hindsight: the special vdev had quietly become the primary data tier for most active datasets. It absorbed
write amplification, snapshot churn, and random reads. Once it approached high utilization, the SSDs’ internal garbage collection and ZFS allocation
behavior combined into latency spikes. The HDDs were innocent bystanders.

The fix was boring and effective: they reduced special_small_blocks to a conservative value for general datasets, reserved aggressive
values only for the few small-file-heavy trees, and added capacity to the special class. Performance stabilized. The graphs got less exciting.

Boring but correct practice that saved the day: “Measure, scrub, and rehearse”

A financial services team ran ZFS-backed NFS for internal applications. No one outside the team cared about the storage details, which is the best
possible state for storage: invisible.

They had a routine that felt almost quaint: monthly scrub windows, quarterly restore drills, and a standing dashboard that tracked ARC hit rate,
special vdev usage, SLOG latency, and pool fragmentation. They also had a policy that special vdev utilization should not exceed a conservative threshold.

One quarter, a new application rollout increased small-file churn. The dashboard caught it early: special vdev allocation rose steadily and SSD
write latency ticked up during batch. Nothing was “broken” yet; it was just trending wrong.

Because they saw it early, the fix was surgical: add another mirrored special vdev pair during a planned maintenance window, adjust
special_small_blocks for a couple of datasets, and keep the pool under the utilization line that they’d agreed to treat as a hard limit.

No outage. No incident bridge. No weekend. The team got exactly zero praise, which in operations is how you know you did it right.

Checklists / step-by-step plan

Step-by-step: build the hybrid pool (production-minded)

  1. Pick vdev layout first. If you need IOPS and consistent latency for VMs/DBs, prefer mirrors. If you need capacity efficiency and mostly sequential IO, RAIDZ2 is fine.
  2. Choose special vdev SSDs. Mirror them. Favor endurance and consistent latency over headline peak IOPS.
  3. Choose SLOG NVMe. Use PLP enterprise NVMe. Mirror it. Keep it dedicated.
  4. Decide on L2ARC last. Only after you’ve measured ARC misses and confirmed read locality.
  5. Create pool with correct ashift. Use persistent device names in /dev/disk/by-id. Don’t use /dev/sdX in production.
  6. Set baseline properties. compression=lz4, atime=off, sane recordsize per dataset.
  7. Enable autotrim. Especially when SSDs are in the topology.
  8. Configure monitoring. Track ARC miss%, SLOG latency, special vdev usage, scrub results, and device errors.
  9. Rehearse failure. Pull a device in staging. Confirm alerts. Confirm resilver behavior. Confirm your team’s muscle memory.

Operational checklist: keep it healthy

  • Keep pool utilization below ~80% for write-heavy workloads.
  • Scrub on a schedule and review results, not just “it ran.”
  • Watch special vdev wear and usage; treat it like a tier that can fill.
  • Don’t share SLOG devices with random workloads; latency is the product.
  • Keep firmware and drivers stable; storage stacks don’t enjoy surprise updates.

Migration checklist: converting an existing HDD pool to hybrid

  • Confirm you can add a special vdev without violating redundancy policy (you can, but it must be redundant).
  • Estimate metadata/small-block footprint; plan special capacity with headroom.
  • Add mirrored special vdev; then set special_small_blocks only where needed.
  • Add mirrored SLOG if sync latency is relevant.
  • Validate with before/after measurements (ARC misses, iostat, app latency).

FAQ

1) Do I need both SLOG and L2ARC?

No. SLOG helps sync writes. L2ARC helps reads when ARC isn’t enough and the workload has reuse. Many systems need neither if RAM and vdev layout are right.

2) Can I use one NVMe device for both SLOG and L2ARC?

You can, but you usually shouldn’t. SLOG wants predictable low latency; L2ARC wants bandwidth and can generate background writes. Mixing them can create jitter exactly where you least want it.

3) Is a special vdev “just a cache”?

No. It’s allocation, not cache. Data placed there is part of the pool. If you lose a non-redundant special vdev, you can lose the pool.

4) What should I set special_small_blocks to?

Start conservative: 0 (metadata-only) or 16K. Increase only for datasets proven to be small-block dominated, and only if special vdev capacity and endurance are sized for it.

5) Does SLOG improve throughput?

It improves latency for sync writes, which can improve application-level throughput when the app is gated by commit latency. It does not turn HDDs into NVMe for bulk writes.

6) Should I set sync=always for safety?

Only if you accept the performance cost and have a good SLOG. Many applications already issue fsync where needed. Forcing everything sync can be self-inflicted pain.

7) Mirrors vs RAIDZ for the HDD vdevs in a hybrid design?

Mirrors are the latency/IOPS play; RAIDZ is the capacity efficiency play. Special vdevs help RAIDZ pools with metadata/small reads, but they don’t erase parity write costs for random writes.

8) How do I know if L2ARC is hurting me?

Watch ARC pressure and miss%. If adding L2ARC reduces ARC and doesn’t produce meaningful L2 hits, you added complexity and stole RAM for nothing. Measure with arcstat.

9) Can I add a special vdev after pool creation?

Yes. But existing metadata won’t automatically migrate. You’ll see benefits as new allocations happen, and you can trigger rewrites by replication or file-level moves if needed.

10) Is dedup a good idea in this hybrid blueprint?

Usually no, unless you have a very specific dedup-friendly workload and the RAM/CPU budget to match. Compression is the safer default win.

Practical next steps

If you want the hybrid blueprint to work in production, don’t start by buying “cache.” Start by writing down your workload facts: sync vs async,
average IO size, read locality, metadata intensity, and growth. Then design the pool topology so failure doesn’t become a thriller novel.

  1. Run the topology and ARC tasks above on your current system. Identify whether you’re read-miss bound, sync-latency bound, or metadata bound.
  2. If metadata/small files hurt: add a mirrored special vdev and keep it comfortably under 70% used.
  3. If sync writes hurt: add a mirrored PLP NVMe SLOG and verify it’s actually used.
  4. If reads hurt after RAM is sane: consider L2ARC, then validate with hit rates and real app latency.
  5. Lock in boring practices: scrubs, monitoring, restore drills, and a written policy for pool utilization and device class redundancy.

Hybrid ZFS is not a bag of tricks. It’s an IO routing plan with consequences. Do it right, and HDDs go back to being what they’re good at: cheap,
boring capacity. Do it wrong, and you’ll learn which of your SSDs has the most dramatic personality.

← Previous
MySQL vs Redis: Redis is not your DB—yet it can cut MySQL load by 80%
Next →
MySQL vs MariaDB Deadlocks: Which One Is Easier to Debug When the Site Is Burning

Leave a comment