You bought big, cheap HDDs for capacity. Then production happened: a million small files, chatty databases, VM images that don’t line up with
recordsize, and users who treat “shared storage” as “infinite IOPS.” The pool is “healthy,” scrubs are fine, and yet latency spikes turn your
ticket queue into an archeological site.
A ZFS hybrid layout can fix this—if you stop thinking in marketing terms (“cache!”) and start thinking in failure domains, write paths, and what
ZFS actually does with metadata. This is the blueprint I’d deploy when I need HDD economics without HDD misery.
The mental model: ZFS IO paths, not vibes
ZFS performance work is 80% understanding where time is spent and 20% setting properties. People invert that, then wonder why their “NVMe cache”
didn’t fix a slow metadata workload. If you keep one mental model, keep this:
- Reads: ARC in RAM first, then (optionally) L2ARC on fast media, then the main vdevs (your HDD pool).
- Writes (async): land in memory, get aggregated into transaction groups (TXGs), then flushed to the main vdevs.
- Writes (sync): must be acknowledged only after they’re durable. If you have a SLOG, “durable” means “in the SLOG,” then later flushed to HDDs.
- Metadata: directory entries, indirect blocks, dnodes—this stuff can dominate IO for small files and random access patterns.
The hybrid blueprint is about putting the right kind of IO on the right kind of device:
bulk sequential data on HDDs; hot metadata (and optionally small blocks) on SSD via a special vdev; sync write intent on a low-latency NVMe SLOG;
read amplification relief via L2ARC when ARC isn’t enough.
Here’s the boring truth: most “ZFS cache” problems are actually “not enough RAM” problems. ARC is the only cache that’s always correct, always
low-latency, and doesn’t require a novel to explain at change review.
One quote worth keeping on a sticky note:
paraphrased idea
from Werner Vogels: “You build it, you run it” isn’t culture—it’s how you learn what your system really does.
Interesting facts & historical context (short, concrete)
- ZFS was born at Sun to fix silent data corruption and admin pain; end-to-end checksums were a core feature, not an add-on.
- ARC predates the current cache hype; it’s a memory-resident adaptive replacement cache designed to outperform simple LRU under mixed workloads.
- Copy-on-write changed the failure story: ZFS doesn’t overwrite live blocks; it writes new ones and updates pointers, reducing “torn write” issues.
- SLOG isn’t a write cache; historically it was introduced to accelerate sync writes by journaling intent, not to absorb bulk throughput.
- L2ARC has a “header tax”: it needs metadata in ARC to index L2ARC contents, which is why adding L2ARC can reduce effective ARC.
- Special vdevs are comparatively new in OpenZFS history; they formalized the “put metadata on SSD” pattern without hacks.
- ashift became a career limiter: early ZFS admins learned that wrong sector alignment can permanently kneecap IOPS on 4K-sector drives.
- Compression became mainstream not because it saves space (it does), but because fewer bytes read often beats raw disk speed.
Reference architecture: HDD data + SSD metadata + NVMe cache
Let’s define the goal: a pool where large, cold data stays on HDDs, while the IO patterns that punish HDD latency—metadata lookups, small random
reads, sync write latency—are redirected to fast media in a way that does not compromise safety or operability.
Baseline assumptions (say them out loud)
- HDDs are for capacity and sequential throughput. They are terrible at random IO latency. That is physics, not a vendor issue.
- SSDs/NVMe are for latency and IOPS. They fail too; they just fail faster and sometimes more creatively.
- RAM is the first performance lever. If you don’t know your ARC hit ratio, you’re tuning blind.
- You’re using OpenZFS on Linux (examples assume Ubuntu/Debian-ish paths). Adjust as needed.
A sane “hybrid” layout (example)
This is a common, production-friendly pattern:
- Data vdevs: HDD RAIDZ2 (or mirrors if you need IOPS more than capacity efficiency)
- Special vdev: mirrored SSDs for metadata (and optionally small blocks)
- SLOG: mirrored NVMe devices (or at least power-loss-protected enterprise NVMe) for sync write latency
- L2ARC: optional NVMe (often same class as SLOG but not the same devices) for read-heavy workloads that exceed ARC
Why mirrored for special vdev and SLOG? Because those components are not “nice to have.” They become part of the pool’s availability story.
Lose the special vdev and you can lose the pool. Lose the SLOG and you lose sync write acceleration (and possibly in-flight sync semantics during a crash),
but with mirrored and PLP media you avoid turning a power blip into a resume-writing event.
Joke #1: If someone proposes “a single consumer SSD special vdev to save budget,” ask if they also prefer single-parachute skydiving to reduce weight.
SSD metadata done right: special vdev design
What special vdevs actually do
A special vdev can store metadata, and optionally small file blocks, on faster media. That means directory traversals, file attribute lookups,
indirect block reads, and many small random reads stop hammering your HDDs.
The win is most dramatic when:
- you have lots of small files (CI artifacts, container layers, source repos, mail spools, build caches)
- you have deep directory trees and frequent stats/opens
- your working set is bigger than ARC but metadata is relatively compact
The non-negotiable rule: mirror the special vdev
The special vdev is part of the pool. If it dies and you don’t have redundancy, you may not be “degraded.” You may be “done.”
Treat it like you treat your main vdev redundancy: no single points of failure.
How to decide whether to store small blocks on special vdev
ZFS can place blocks smaller than a threshold on special vdevs via special_small_blocks. This is powerful and dangerous—like giving
your metadata SSDs a side job as a random-read data tier.
Use it when:
- your HDD vdevs are latency-bound on small random reads
- you have a known small-block-heavy workload (many files under 64K, lots of tiny object reads)
- your special vdev has enough endurance and capacity headroom
Avoid it when:
- you can’t accurately estimate small-block footprint growth
- your SSDs are consumer-grade with questionable endurance and no PLP
- you already struggle to keep the special vdev under ~50–60% utilization
Sizing special vdevs without lying to yourself
Rough guidance that holds up in practice:
- Metadata-only special vdev: often a few percent of pool used space, but can spike with snapshots, tiny records, and heavy churn.
- Metadata + small blocks: can become “most of your hot data.” Plan capacity like you mean it.
Practical rule: size special vdevs so that, at steady state, they stay comfortably below 70% used. SSDs slow down when they’re full, ZFS metaslab
allocation gets pickier, and your “fast tier” becomes a throttled tier.
NVMe SLOG: what it is, what it isn’t
What a SLOG does
The Separate Intent Log (SLOG) is a device used to store the ZFS Intent Log (ZIL) for sync writes. It reduces the latency of acknowledging
sync writes by writing a minimal log record to fast storage, then later committing the full TXG to the main vdevs.
It is not a general write cache. If your workload is mostly async writes, SLOG won’t move the needle. If your workload is sync-heavy (NFS with
sync semantics, databases with fsync(), VM storage that forces sync), SLOG can be the difference between “usable” and “why is it 1998 again?”
SLOG media requirements (be picky)
- Low latency under sync write (steady, not just benchmark spikes)
- Power-loss protection (PLP) so acknowledged writes survive a power event
- Endurance appropriate for write-heavy sync workloads
- Mirror it if you care about availability and predictable behavior during device failures
SLOG sizing: small, fast, boring
You usually don’t need a huge SLOG. You need one that can sustain your peak sync write rate for the time between TXG commits (typically seconds)
and device flush behavior. Oversizing doesn’t buy you much. Under-speccing buys you latency cliffs.
Joke #2: A consumer NVMe without PLP used as SLOG is like a seatbelt made of spaghetti—technically present, spiritually absent.
NVMe L2ARC: when it helps and when it lies
The honest pitch for L2ARC
L2ARC is a second-level read cache. It helps when:
- you have a read-heavy workload with a working set larger than ARC
- your access pattern has reuse (cacheable reads, not one-and-done scans)
- your bottleneck is HDD read latency/IOPS, not CPU or network
Why L2ARC disappoints people
L2ARC is not free. It:
- consumes ARC memory for headers and indexing
- warms up over time (it’s not instantly populated after reboot unless configured otherwise)
- can be ineffective for streaming reads, large sequential scans, or datasets with low locality
Practical advice
If you’re short on RAM, buy RAM before you buy L2ARC. If you’re already well-provisioned on RAM and still read-latency bound, L2ARC can be a
solid lever. But measure, don’t assume.
Dataset tuning that actually moves the needle
Compression: default-on unless you have a reason
compression=lz4 is usually a win: fewer bytes to read from HDDs, often faster overall. If your data is already compressed (media files,
encrypted blobs), it may not reduce size much—but it typically doesn’t hurt unless CPU is constrained.
recordsize: align with how you read, not how you store
For general filesystems, recordsize=128K is fine. For databases, you often want smaller records (like 16K) to reduce read amplification.
For VM images, consider 64K or 128K depending on workload. For zvols, tune volblocksize at creation time (you can’t change it later).
sync and logbias: don’t “fix latency” by deleting safety
Setting sync=disabled is not tuning. It’s a choice to lie to applications about durability. Sometimes people do it under pressure.
Sometimes they learn what “crash-consistent” means the hard way.
If you need to optimize sync workloads, use a real SLOG and consider logbias=latency for datasets where sync latency matters.
Use logbias=throughput only when you explicitly want ZFS to favor main pool writes over ZIL behavior—and you’ve tested it.
atime: small win, low risk
Disabling atime (atime=off) reduces metadata writes for read-heavy workloads. It’s rarely controversial in modern environments.
Practical tasks with commands: what to run, what it means, what you decide
These are tasks I actually run on real systems. Each includes: the command, a plausible output snippet, what it means, and what decision you make.
Use these as building blocks for your own runbooks.
Task 1: Confirm the pool topology (do you even have what you think you have?)
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-ST12000NM0008-1A-0001 ONLINE 0 0 0
ata-ST12000NM0008-1A-0002 ONLINE 0 0 0
ata-ST12000NM0008-1A-0003 ONLINE 0 0 0
ata-ST12000NM0008-1A-0004 ONLINE 0 0 0
ata-ST12000NM0008-1A-0005 ONLINE 0 0 0
ata-ST12000NM0008-1A-0006 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme-SAMSUNG_MZ1LB960HAJQ-0001 ONLINE 0 0 0
nvme-SAMSUNG_MZ1LB960HAJQ-0002 ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
nvme-INTEL_SSDPED1D280GA-0001 ONLINE 0 0 0
nvme-INTEL_SSDPED1D280GA-0002 ONLINE 0 0 0
cache
nvme-SAMSUNG_MZ1LB960HAJQ-0003 ONLINE 0 0 0
errors: No known data errors
What it means: You have HDD RAIDZ2 for data, a mirrored special vdev, a mirrored SLOG, and an L2ARC device.
Decision: If special or logs are single devices, fix that before you call the system “production-ready.”
Task 2: Check ashift (alignment) and sector sizes
cr0x@server:~$ sudo zdb -C tank | grep -E "ashift|path" -n | head
34: ashift: 12
102: path: '/dev/disk/by-id/ata-ST12000NM0008-1A-0001'
What it means: ashift: 12 = 4K sectors. Good for modern HDDs/SSDs. If you see 9 (512B) on 4K media, performance can be permanently bad.
Decision: Wrong ashift means rebuild/recreate. Don’t bargain with physics.
Task 3: See how much is on special vdev (capacity risk check)
cr0x@server:~$ sudo zpool list -v tank
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 65.4T 40.1T 25.3T - 18% 61% 1.00x ONLINE -
raidz2 65.4T 39.4T 26.0T - 18% 60% - ONLINE
special 1.7T 0.7T 1.0T - 12% 41% - ONLINE
logs 279G 4.2G 275G - 1% 1% - ONLINE
cache 894G 812G 82G - 22% 90% - ONLINE
What it means: Special vdev is 41% used—fine. L2ARC is 90% “used,” which is normal; it fills by design.
Decision: If special vdev creeps toward 70–80% with special_small_blocks enabled, plan expansion before it becomes your outage.
Task 4: Confirm dataset properties (are you accidentally sabotaging yourself?)
cr0x@server:~$ sudo zfs get -o name,property,value -s local,received compression,atime,recordsize,sync,logbias,special_small_blocks tank/data
NAME PROPERTY VALUE
tank/data compression lz4
tank/data atime off
tank/data recordsize 128K
tank/data sync standard
tank/data logbias latency
tank/data special_small_blocks 16K
What it means: Sensible defaults for many mixed workloads. Small blocks <=16K go to special vdev.
Decision: If special vdev endurance is a concern, raise or disable special_small_blocks. If sync latency is a problem, keep logbias=latency and invest in SLOG quality.
Task 5: Determine whether your workload is sync-heavy (SLOG relevance test)
cr0x@server:~$ sudo zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 40.1T 25.3T 220 1800 38.2M 145M
raidz2 39.4T 26.0T 120 240 34.1M 120M
special 0.7T 1.0T 90 1550 4.0M 25.0M
mirror (logs) 4.2G 275G 0 1600 0 22.0M
cache - - - - - -
-------------------------- ----- ----- ----- ----- ----- -----
What it means: Writes are landing heavily on logs and special vdev. That often indicates sync traffic and metadata/small-block activity.
Decision: If logs show zero while you have a “database latency problem,” you might not be sync-bound; chase something else (ARC misses, HDD queueing, CPU).
Task 6: Check ARC behavior (is RAM doing the job?)
cr0x@server:~$ sudo arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:01:10 420 35 8 9 2 26 6 0 0 62G 64G
12:01:11 410 40 10 12 3 28 7 0 0 62G 64G
12:01:12 395 33 8 8 2 25 6 0 0 62G 64G
What it means: ARC miss rate ~8–10% is decent. If miss% is consistently high, HDDs will get hammered and L2ARC might help—after RAM is adequate.
Decision: If arcsz is pinned at c and miss% is high, consider more RAM or reducing memory pressure elsewhere.
Task 7: Validate L2ARC effectiveness (is it actually serving reads?)
cr0x@server:~$ sudo arcstat -f time,read,miss,l2hits,l2miss,l2read 1 3
time read miss l2hits l2miss l2read
12:02:21 500 70 120 40 160
12:02:22 480 65 110 35 145
12:02:23 510 68 130 38 168
What it means: L2ARC is getting hits. If l2hits is near zero, your L2ARC isn’t helping (or it hasn’t warmed up, or your workload has no reuse).
Decision: If L2ARC isn’t helping, remove it or repurpose the device; don’t keep complexity as a decorative accessory.
Task 8: Inspect special vdev IO pressure (metadata tier saturation)
cr0x@server:~$ sudo iostat -x 1 3 /dev/nvme0n1 /dev/nvme1n1
Linux 6.8.0 (server) 12/26/2025 _x86_64_ (32 CPU)
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await %util
nvme0n1 820.0 540.0 42.0 31.5 94.0 2.10 1.8 92.0
nvme1n1 815.0 530.0 41.5 30.9 93.5 2.05 1.9 91.0
What it means: Special vdev SSDs are near saturation (%util ~90+) but still low latency (await ~2ms). That’s okay… until it isn’t.
Decision: If await climbs (say 10–20ms) during load, special vdev is your bottleneck. Consider faster SSDs, more lanes, or raising special_small_blocks.
Task 9: Confirm sync write latency path (SLOG device health and pressure)
cr0x@server:~$ sudo iostat -x 1 3 /dev/nvme2n1 /dev/nvme3n1
Device r/s w/s rMB/s wMB/s avgqu-sz await %util
nvme2n1 0.0 2200.0 0.0 28.0 0.60 0.4 55.0
nvme3n1 0.0 2180.0 0.0 27.8 0.58 0.4 54.0
What it means: SLOG writes are fast and steady. If await jumps, your sync write performance is capped by SLOG latency.
Decision: If SLOG latency is poor, replace with PLP enterprise NVMe. Don’t “tune around” bad media.
Task 10: Identify whether HDD vdevs are queueing (classic RAIDZ small IO pain)
cr0x@server:~$ sudo zpool iostat -v tank 1 3
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 40.1T 25.3T 900 600 110M 80M
raidz2 39.4T 26.0T 860 560 106M 76M
ata-ST12000...0001 - - 140 95 18.0M 12.7M
ata-ST12000...0002 - - 145 92 18.2M 12.4M
ata-ST12000...0003 - - 142 93 18.1M 12.5M
ata-ST12000...0004 - - 143 96 18.0M 12.8M
ata-ST12000...0005 - - 145 92 18.3M 12.4M
ata-ST12000...0006 - - 145 92 18.2M 12.4M
-------------------------- ----- ----- ----- ----- ----- -----
What it means: Lots of per-disk ops suggests random IO. RAIDZ2 can be fine, but small random writes are expensive (parity math + read-modify-write).
Decision: If this is a VM or DB pool and latency matters, mirrors often beat RAIDZ for IOPS. Or push small blocks/metadata onto special vdevs thoughtfully.
Task 11: Track TXG behavior and commit pressure (latency spikes clue)
cr0x@server:~$ sudo cat /proc/spl/kstat/zfs/txgs
dmu_tx_assign 0
txg_sync 0
txg_quiesce 0
txg_birth 0
txg_state 0
txg_timeout 5
txg_synctime_ms 312
What it means: TXG timeout is 5s, synctime ~312ms right now. If synctime balloons into seconds, the pool is struggling to flush.
Decision: If synctime spikes correlate with application latency, focus on write throughput and device latency—often HDD vdev saturation or special vdev pressure.
Task 12: Verify that autotrim is set appropriately (SSD longevity + steady performance)
cr0x@server:~$ sudo zpool get autotrim tank
NAME PROPERTY VALUE SOURCE
tank autotrim on local
What it means: TRIM is enabled, helping SSDs maintain performance and reducing write amplification.
Decision: For SSD-based special vdev, logs, and L2ARC, autotrim=on is typically the right call.
Task 13: Check device error counters and slow degradation
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 05:12:33 with 0 errors on Sun Dec 21 04:00:31 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-ST12000NM0008-1A-0003 ONLINE 0 0 0
ata-ST12000NM0008-1A-0004 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme-SAMSUNG_MZ1LB960HAJQ-0001 ONLINE 0 0 0
nvme-SAMSUNG_MZ1LB960HAJQ-0002 ONLINE 0 0 0
errors: No known data errors
What it means: Scrubs are clean; no growing error counts.
Decision: If READ/WRITE/CKSUM start incrementing on special vdev or SLOG devices, treat it as urgent. “It still works” is not a strategy.
Task 14: Confirm special allocation is actually used (avoid placebo designs)
cr0x@server:~$ sudo zdb -bbbs tank | grep -E "Special|special" | head
Special allocation class: 734.22G used, 1.01T available
What it means: Data is being allocated in the special class. If it’s near zero on a metadata-heavy system, your design isn’t being exercised.
Decision: Validate special_small_blocks and confirm you created the special vdev correctly; don’t assume “SSD present” means “SSD used.”
Fast diagnosis playbook: what to check first/second/third
When a hybrid ZFS pool is “slow,” you don’t need a day of philosophical debate. You need a tight loop: identify which layer is saturated, then
map it to a device class and IO type.
First: is it a cache/RAM problem?
- Check ARC size and miss% (
arcstat). - If miss% is high and ARC is capped: you’re likely disk-bound because RAM isn’t holding the working set.
- Decision: add RAM, reduce memory pressure, or accept that HDDs will see reads (and then consider L2ARC).
Second: is it a sync write latency problem?
- Check whether logs are active (
zpool iostat -vshows log writes). - Check SLOG device latency (
iostat -xon SLOG devices). - Decision: if sync-heavy and SLOG awaits are high, upgrade/replace SLOG media; don’t touch
sync=disabledunless you enjoy postmortems.
Third: is it metadata/small-block pressure on special vdev?
- Check special vdev utilization and IO (
zpool list -v,iostat -x). - Check if special is near full or showing rising latency.
- Decision: expand special vdev (add another mirrored special vdev), use faster devices, or adjust
special_small_blocks.
Fourth: is it plain old HDD vdev saturation?
- Check per-vdev and per-disk ops (
zpool iostat -v). - High ops and low bandwidth on HDDs = random IO pain; RAIDZ will feel it.
- Decision: change workload layout (separate pool for VMs/DBs), move hot datasets to SSD pool, or accept mirrors for IOPS-centric tiers.
Fifth: check the non-ZFS parts (because reality)
- Network latency (for NFS/SMB), CPU steal, IRQ saturation, HBA queue depth, and virtualization limits can masquerade as “storage.”
- Decision: verify with system metrics; don’t blame ZFS for a 1GbE bottleneck.
Common mistakes (symptoms → root cause → fix)
1) “We added NVMe cache and nothing changed.”
Symptoms: Same read latency, same HDD ops, L2ARC hit rate near zero.
Root cause: Workload has low reuse (streaming), or ARC is too small and L2ARC header overhead made it worse, or L2ARC never warmed.
Fix: Measure with arcstat. Add RAM first. If access pattern is streaming, remove L2ARC and focus on vdev layout and sequential throughput.
2) “Special vdev is filling up and we’re scared.”
Symptoms: Special class usage grows faster than expected; SSD wear climbs; latency spikes under metadata churn.
Root cause: special_small_blocks set too high, small blocks dominate, snapshots amplify metadata, or special vdev under-sized.
Fix: Lower special_small_blocks (future allocations only), add another mirrored special vdev, and keep special below ~70% used.
3) “Sync writes are slow even with a SLOG.”
Symptoms: High latency on database commits; SLOG device shows high await; application stalls during bursts.
Root cause: SLOG device lacks PLP, has poor steady-state latency, is shared with other workloads, or is not actually being used.
Fix: Verify log activity with zpool iostat -v. Use dedicated enterprise NVMe with PLP. Mirror it.
4) “We set sync=disabled and it got fast. Why not keep it?”
Symptoms: Performance improves immediately; management is happy; SRE is quietly updating their resume.
Root cause: You traded durability for speed. Apps expecting fsync durability are now being lied to.
Fix: Revert to sync=standard. Deploy proper SLOG and tune dataset properties to match the workload.
5) “Pool is healthy but latency spikes every few seconds.”
Symptoms: Periodic stalls; graphs show sawtooth write latency; user complaints during peaks.
Root cause: TXG sync pressure: HDDs can’t flush fast enough; special vdev saturated; fragmentation and near-full pool amplify it.
Fix: Check TXG synctime, vdev utilization, pool fullness. Add vdevs, reduce write amplification (compression helps), and keep pool below ~80% used for heavy-write environments.
6) “We used a single special vdev because it’s ‘just metadata.’”
Symptoms: Sudden pool failure or missing metadata blocks after SSD dies; recovery options are bleak.
Root cause: Special vdev is not optional; it’s in the data path.
Fix: Always mirror special vdevs. If you already deployed single-disk special, plan a migration: backup, rebuild correctly, restore. There is no clean retrofit that fixes the risk without moving data.
Three corporate mini-stories (anonymized, plausible, technically accurate)
Incident caused by a wrong assumption: “Metadata can’t take down the pool, right?”
A mid-sized SaaS company ran a file-heavy pipeline: build artifacts, container layers, and lots of tiny JSON. Their storage team did something
clever: HDD RAIDZ2 for bulk, plus a single “metadata SSD” because the budget meeting was a contact sport.
It worked brilliantly for months. Directory listings were snappy. Builds stopped timing out. People started using the same pool for more things
because success is a magnet for scope creep.
Then the SSD died. Not dramatically—no smoke, just a quiet drop from the PCIe bus. The pool didn’t politely “degrade.” It panicked. Metadata
blocks that ZFS expected to be there… weren’t. Services restarted into failures that looked like random filesystem corruption.
The team’s first instinct was to treat it like a cache failure: reboot, rescan, re-seat. That wasted hours. The actual fix was painful and honest:
restore from backups to a correctly designed pool with mirrored special vdevs. They got the system back, but they also got a new policy:
“if it’s in the pool topology, it’s redundant.”
The lesson wasn’t “ZFS is fragile.” The lesson was that ZFS is literal. If you tell it metadata lives on that device, ZFS will believe you with
the devotion of a golden retriever.
Optimization that backfired: “Let’s turn on special_small_blocks everywhere”
A large enterprise analytics platform had a mix of workloads: data lake files (large), ETL temp files (medium), and a nasty swarm of small logs
and indexes. They added mirrored SSD special vdevs and saw immediate improvements. So far, so good.
Someone then proposed setting special_small_blocks=128K across most datasets. The reasoning sounded clean: “If most blocks go to SSD,
HDDs become just capacity. SSDs are fast. Everyone wins.” The change sailed through because it produced a nice-looking latency graph in the first hour.
Weeks later, the SSDs were fuller than expected and write latency started wobbling. Not constant failure—worse: intermittent stalls that hit
during peak batch windows. Scrubs were still clean. SMART looked “fine.” The on-call rotation began to develop opinions.
The actual cause was predictable in hindsight: the special vdev had quietly become the primary data tier for most active datasets. It absorbed
write amplification, snapshot churn, and random reads. Once it approached high utilization, the SSDs’ internal garbage collection and ZFS allocation
behavior combined into latency spikes. The HDDs were innocent bystanders.
The fix was boring and effective: they reduced special_small_blocks to a conservative value for general datasets, reserved aggressive
values only for the few small-file-heavy trees, and added capacity to the special class. Performance stabilized. The graphs got less exciting.
Boring but correct practice that saved the day: “Measure, scrub, and rehearse”
A financial services team ran ZFS-backed NFS for internal applications. No one outside the team cared about the storage details, which is the best
possible state for storage: invisible.
They had a routine that felt almost quaint: monthly scrub windows, quarterly restore drills, and a standing dashboard that tracked ARC hit rate,
special vdev usage, SLOG latency, and pool fragmentation. They also had a policy that special vdev utilization should not exceed a conservative threshold.
One quarter, a new application rollout increased small-file churn. The dashboard caught it early: special vdev allocation rose steadily and SSD
write latency ticked up during batch. Nothing was “broken” yet; it was just trending wrong.
Because they saw it early, the fix was surgical: add another mirrored special vdev pair during a planned maintenance window, adjust
special_small_blocks for a couple of datasets, and keep the pool under the utilization line that they’d agreed to treat as a hard limit.
No outage. No incident bridge. No weekend. The team got exactly zero praise, which in operations is how you know you did it right.
Checklists / step-by-step plan
Step-by-step: build the hybrid pool (production-minded)
- Pick vdev layout first. If you need IOPS and consistent latency for VMs/DBs, prefer mirrors. If you need capacity efficiency and mostly sequential IO, RAIDZ2 is fine.
- Choose special vdev SSDs. Mirror them. Favor endurance and consistent latency over headline peak IOPS.
- Choose SLOG NVMe. Use PLP enterprise NVMe. Mirror it. Keep it dedicated.
- Decide on L2ARC last. Only after you’ve measured ARC misses and confirmed read locality.
- Create pool with correct ashift. Use persistent device names in
/dev/disk/by-id. Don’t use/dev/sdXin production. - Set baseline properties.
compression=lz4,atime=off, sanerecordsizeper dataset. - Enable autotrim. Especially when SSDs are in the topology.
- Configure monitoring. Track ARC miss%, SLOG latency, special vdev usage, scrub results, and device errors.
- Rehearse failure. Pull a device in staging. Confirm alerts. Confirm resilver behavior. Confirm your team’s muscle memory.
Operational checklist: keep it healthy
- Keep pool utilization below ~80% for write-heavy workloads.
- Scrub on a schedule and review results, not just “it ran.”
- Watch special vdev wear and usage; treat it like a tier that can fill.
- Don’t share SLOG devices with random workloads; latency is the product.
- Keep firmware and drivers stable; storage stacks don’t enjoy surprise updates.
Migration checklist: converting an existing HDD pool to hybrid
- Confirm you can add a special vdev without violating redundancy policy (you can, but it must be redundant).
- Estimate metadata/small-block footprint; plan special capacity with headroom.
- Add mirrored special vdev; then set
special_small_blocksonly where needed. - Add mirrored SLOG if sync latency is relevant.
- Validate with before/after measurements (ARC misses, iostat, app latency).
FAQ
1) Do I need both SLOG and L2ARC?
No. SLOG helps sync writes. L2ARC helps reads when ARC isn’t enough and the workload has reuse. Many systems need neither if RAM and vdev layout are right.
2) Can I use one NVMe device for both SLOG and L2ARC?
You can, but you usually shouldn’t. SLOG wants predictable low latency; L2ARC wants bandwidth and can generate background writes. Mixing them can create jitter exactly where you least want it.
3) Is a special vdev “just a cache”?
No. It’s allocation, not cache. Data placed there is part of the pool. If you lose a non-redundant special vdev, you can lose the pool.
4) What should I set special_small_blocks to?
Start conservative: 0 (metadata-only) or 16K. Increase only for datasets proven to be small-block dominated, and only if special vdev capacity and endurance are sized for it.
5) Does SLOG improve throughput?
It improves latency for sync writes, which can improve application-level throughput when the app is gated by commit latency. It does not turn HDDs into NVMe for bulk writes.
6) Should I set sync=always for safety?
Only if you accept the performance cost and have a good SLOG. Many applications already issue fsync where needed. Forcing everything sync can be self-inflicted pain.
7) Mirrors vs RAIDZ for the HDD vdevs in a hybrid design?
Mirrors are the latency/IOPS play; RAIDZ is the capacity efficiency play. Special vdevs help RAIDZ pools with metadata/small reads, but they don’t erase parity write costs for random writes.
8) How do I know if L2ARC is hurting me?
Watch ARC pressure and miss%. If adding L2ARC reduces ARC and doesn’t produce meaningful L2 hits, you added complexity and stole RAM for nothing. Measure with arcstat.
9) Can I add a special vdev after pool creation?
Yes. But existing metadata won’t automatically migrate. You’ll see benefits as new allocations happen, and you can trigger rewrites by replication or file-level moves if needed.
10) Is dedup a good idea in this hybrid blueprint?
Usually no, unless you have a very specific dedup-friendly workload and the RAM/CPU budget to match. Compression is the safer default win.
Practical next steps
If you want the hybrid blueprint to work in production, don’t start by buying “cache.” Start by writing down your workload facts: sync vs async,
average IO size, read locality, metadata intensity, and growth. Then design the pool topology so failure doesn’t become a thriller novel.
- Run the topology and ARC tasks above on your current system. Identify whether you’re read-miss bound, sync-latency bound, or metadata bound.
- If metadata/small files hurt: add a mirrored special vdev and keep it comfortably under 70% used.
- If sync writes hurt: add a mirrored PLP NVMe SLOG and verify it’s actually used.
- If reads hurt after RAM is sane: consider L2ARC, then validate with hit rates and real app latency.
- Lock in boring practices: scrubs, monitoring, restore drills, and a written policy for pool utilization and device class redundancy.
Hybrid ZFS is not a bag of tricks. It’s an IO routing plan with consequences. Do it right, and HDDs go back to being what they’re good at: cheap,
boring capacity. Do it wrong, and you’ll learn which of your SSDs has the most dramatic personality.