The special vdev is one of those ZFS features that feels like cheating: put metadata (and optionally small blocks) on fast SSDs and watch your pool wake up.
Then one day the SSDs are 92% full, allocations get weird, sync latency spikes, and your “fast pool” starts acting like it’s pulling data through a garden hose.
Sizing a special vdev is not a vibe. It’s math plus operational paranoia. Do it wrong and you don’t just lose performance—you can lose the pool.
Do it right and you get predictable latency, faster directory traversals, snappier VM boot storms, and fewer 2 a.m. tickets that start with “storage is slow.”
What the special vdev actually does
A ZFS pool normally stores data blocks and metadata blocks across its “normal” vdevs (HDDs, SSDs, whatever you built).
A special vdev is an additional allocation class that can hold metadata and, optionally, small data blocks.
The goal is straightforward: keep the random, latency-sensitive stuff on faster media.
What lands on a special vdev?
- Metadata: indirect blocks, dnodes, directory structures, block pointers, spacemaps, and other “bookkeeping” needed to find and manage data.
-
Optionally, small blocks when
special_small_blocksis set on a dataset or zvol.
It’s not an L2ARC. It’s not a SLOG. It’s not a “cache” you can lose without consequences.
If you put pool-critical metadata on a special vdev, losing it can mean losing the pool.
Here’s the operational truth: special vdevs are fantastic, but they are not forgiving.
The sizing question is not “what’s the minimum that works,” it’s “what’s the minimum that still works after two years of churn, snapshots, and bad ideas.”
Joke #1: The special vdev is like a junk drawer—if you make it too small, everything still goes in there, it just stops closing.
Metadata is small… until it isn’t
People underestimate metadata because they picture “filenames and timestamps.”
ZFS metadata includes the structures that make copy-on-write, snapshots, checksums, compression, and block pointer trees work.
More blocks and more fragmentation means more metadata. Lots of small files means more metadata. Lots of snapshots means more metadata.
If you also route small blocks to the special vdev, your “metadata device” becomes a metadata-plus-hot-data device.
That’s fine—often great—but your sizing and endurance assumptions must change.
Facts and historical context (so the behavior makes sense)
A few concrete facts help explain why special vdev sizing feels counterintuitive and why the wrong default can quietly work for months before it explodes.
-
ZFS special allocation classes arrived comparatively late compared to core ZFS features; early ZFS relied more on “all vdevs are equal” layouts,
plus L2ARC/SLOG for performance shaping. - Metadata I/O is often more random than data I/O. ZFS can stream large data reads, but metadata access tends to bounce around the on-disk tree.
- Copy-on-write multiplies metadata work. Every new write updates block pointers up the tree, and snapshots preserve old pointers.
-
dnodes got bigger over time. Features like
dnodesize=autoand “fat dnodes” exist because packing more attributes into dnodes avoids extra I/O,
but it also changes metadata footprint and layout. - Special vdev failure can be catastrophic if it stores metadata required to assemble the pool. This is not theoretical; it’s a design consequence.
-
SSD latency gaps keep widening. Modern NVMe can serve random reads in tens of microseconds; HDDs are in milliseconds.
A 100× delta on a metadata-heavy workload is not rare. -
Compression and small records shift the economics. When your pool writes many small compressed blocks, metadata and “small block hotness”
start to look similar from an I/O perspective. - Snapshots increase metadata churn. Each snapshot keeps old block pointer paths alive; deletes become “deadlist management” rather than immediate reclaim.
-
Pool expansion is easy; special vdev remediation is not. You can add vdevs to increase capacity, but you can’t remove a special vdev from most production
systems and walk away smiling.
Paraphrased idea from Werner Vogels (Amazon CTO): “Everything fails all the time—design for failure, not for best-case behavior.”
Special vdev sizing is exactly that: design for failure modes, not day-one benchmarks.
Sizing principles that don’t age badly
Principle 1: Decide what problem you’re solving
There are two legitimate reasons to add a special vdev:
- Metadata acceleration: faster directory listings, file opens, stat storms, snapshot operations, and “lots of small files” behavior.
- Small-block acceleration: store blocks smaller than a threshold on SSD to reduce random I/O on HDDs and improve latency for small reads/writes.
If you only want metadata acceleration, you can often keep it modest (but still not tiny).
If you want small-block acceleration, you’re building a partial tiering system and you should size accordingly.
Principle 2: “It’s only metadata” is how you end up paging your storage
Under-sizing the special vdev creates a perverse outcome: ZFS tries to place metadata/small blocks there, it fills up, allocations become constrained,
and suddenly the thing you added for performance becomes a bottleneck and sometimes a risk factor.
Principle 3: Mirrors for special vdevs in production
If you care about the pool, the special vdev should be mirrored (or better).
RAIDZ special vdevs exist in some implementations, but mirrors keep failure domains clean and rebuild behavior predictable.
Also: NVMe dies in exciting ways. Not always, but enough that you should assume it will.
Principle 4: Plan for growth and churn, not raw capacity
The special vdev is sensitive to churn: snapshots, deletes, rewrite-heavy workloads, and dataset reorganizations can increase metadata pressure.
Capacity planning should include a “future you is busier and less careful” tax.
Principle 5: Avoid the “one tiny special vdev for everything” pattern
The special vdev is pool-wide, but you can control small-block placement per dataset.
That means you can be selective. If you set special_small_blocks globally and forget about it, you’re committing to special vdev growth.
Being selective is not cowardice; it’s risk management.
Joke #2: Storage engineers don’t believe in magic—except when a “temporary” dataset lives on the special vdev for three years.
Sizing math you can defend in a change review
There’s no single perfect formula because metadata size depends on recordsize, compression, file count distribution, snapshots, and fragmentation.
But you can get to an answer that is conservative, explainable, and operationally safe.
Step 1: Decide whether small blocks are included
If you will not set special_small_blocks, you’re mostly sizing for metadata.
If you will set it (common values: 8K, 16K, 32K), you must budget for small data blocks too.
Step 2: Use rules of thumb (then validate)
Practical sizing starting points that won’t get you fired:
-
Metadata-only special vdev: start at 0.5%–2% of raw pool capacity for general file workloads.
If you have many small files, lots of snapshots, or heavy churn, aim higher. -
Metadata + small blocks: start at 5%–15% depending on your small-block threshold and workload.
VM images with 8K–16K blocks can consume special capacity fast.
The sizing “range” isn’t indecision; it reflects that a pool storing 10 million 4K files behaves differently than a pool storing 20 TB of media files.
Step 3: Convert the rule of thumb into a hard number
Example: you have a 200 TB raw HDD pool.
- Metadata-only at 1%: 2 TB special.
- Metadata + small blocks at 10%: 20 TB special (now you’re building a tier, not a garnish).
If those numbers feel “too big,” that’s your brain anchored to the idea that metadata is tiny.
Your future outage doesn’t care how you feel.
Step 4: Apply a “don’t paint yourself into a corner” buffer
Special vdev fullness is operationally scary because it can constrain metadata allocation.
Treat it like a log filesystem: you don’t run it at 95% and call it a win.
A reasonable policy:
- Target steady-state usage: 50–70%
- Alert at: 75%
- Drop everything at: 85%
Step 5: Account for special vdev redundancy overhead
A mirrored special vdev halves usable capacity. Two 3.84 TB NVMes give you ~3.84 TB usable (minus slop/overhead).
If your math says “need 2 TB,” you don’t buy “two 1.92 TB” and hope. You buy bigger. SSD endurance and write amplification are real.
Step 6: Consider endurance (DWPD) if small blocks are on special
Metadata writes are not free, but they’re often manageable.
Small blocks can turn the special vdev into a write-heavy tier, especially with VM workloads, databases, and high churn.
If you’re routing small blocks, select SSDs with appropriate endurance and monitor write rates.
Special vdev sizing is also a risk decision
Under-sizing means:
- performance cliff when it fills,
- operational panic during growth spurts,
- and in worst cases, pool instability if metadata allocation is constrained.
Over-sizing means:
- some unused SSD space,
- a slightly higher procurement line item,
- and fewer emergency meetings.
Choose your pain.
Workload-specific sizing (VMs, fileservers, backup, object-ish)
General fileserver (home directories, shared engineering storage)
This is where special vdevs shine. Directory traversals and file opens are metadata-heavy.
For “normal” corporate file shares with a mix of file sizes, metadata-only special vdevs often deliver the biggest win per dollar.
Sizing guidance:
- Start at 1–2% raw capacity for metadata-only.
- If you expect millions of small files, push toward 2–4%.
- Keep
special_small_blocksoff unless you can justify it with observed I/O patterns.
VM storage (zvols, hypervisor backing store)
VM workloads are where people get greedy and turn on special_small_blocks.
It can work extremely well because a lot of VM I/O is small random reads/writes.
It can also eat your special vdev alive, because the threshold is a vacuum cleaner: it doesn’t care if the block is “important,” only whether it’s small.
Sizing guidance:
- Metadata-only: still helpful, but less dramatic.
- Metadata + small blocks: size 8–15% raw depending on volblocksize and I/O profile.
- Consider separate pools for VM latency-critical workloads instead of overloading a general-purpose pool.
Backup targets (large sequential writes, dedup sometimes)
Backup repositories tend to be big-block, streaming-friendly workloads.
Metadata acceleration helps for snapshot management and directory operations, but it’s not always a slam dunk.
- Metadata-only special vdev: modest sizing (0.5–1.5%) is usually fine.
- Small blocks: often unnecessary; avoid unless you have measured a real random I/O problem.
“Object storage-ish” layouts (many small objects, lots of listing)
If you’re abusing ZFS as an object store with millions of small files, metadata becomes a first-class citizen.
The special vdev can keep the system responsive under directory enumerations and stat storms.
- Plan for 2–5% metadata-only, depending on object size distribution.
- Watch snapshot strategy; object stores plus snapshots can multiply metadata retention.
Design choices: mirrors, RAIDZ, ashift, redundancy, and why you should be boring
Mirror it, and sleep
A special vdev is not a “cache.” Treat it as core pool storage.
Use mirrored SSDs (or triple mirrors if your risk model demands it).
The extra drive cost is cheaper than explaining to your CFO why “the pool won’t import.”
Choose sensible ashift
Use ashift=12 for most modern SSDs and HDDs; it aligns to 4K sectors.
Getting ashift wrong can waste space or hammer performance.
You generally can’t change ashift after vdev creation without rebuilding.
Don’t co-mingle questionable SSDs with critical roles
A consumer SSD with sketchy power-loss behavior and mediocre endurance is not automatically wrong,
but putting it in the special vdev is where “probably fine” becomes “incident report.”
If you must use consumer SSDs, over-provision and mirror aggressively, and monitor SMART like it owes you money.
Set expectations: special vdev improves latency, not bandwidth
Your sequential throughput may not change much.
What changes is the “small random” experience: metadata fetches, small block reads, and the chain of pointer walks to find data.
It’s latency work, which means it shows up as “everything feels faster,” not “we doubled GB/s.”
Understand what happens when the special vdev fills
When the special vdev runs out of space, ZFS can’t place new metadata there.
Behavior depends on implementation and feature flags, but you should assume:
- performance degrades (metadata goes to slower vdevs or allocation gets constrained),
- allocation becomes fragmented and weird,
- and you’re operating closer to a pool-level failure mode than you want.
The right posture is: do not let it get close to full. Expand early.
Practical tasks with commands (and how to decide)
These are not “cute demos.” These are the commands you run on a Tuesday afternoon when you’re trying to prevent a Thursday outage.
Each task includes what to look for and the decision you make based on it.
Task 1: Identify whether you even have a special vdev (and its layout)
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
errors: No known data errors
Output meaning: You have a special class consisting of a mirrored NVMe pair. Good.
If the special vdev is a single disk, that’s a risk decision you should revisit immediately.
Decision: If special is not mirrored, schedule remediation (rebuild into mirrors via new pool or migration plan).
Task 2: Check special vdev capacity pressure fast
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint tank
NAME USED AVAIL REFER MOUNTPOINT
tank 120T 48T 256K /tank
Output meaning: Pool space is not the same as special vdev space. This command doesn’t show special usage directly.
It tells you whether overall pool fullness is likely contributing to fragmentation and allocation stress.
Decision: If the pool itself is above ~80% used, expect worse allocation behavior everywhere and treat special sizing as more urgent.
Task 3: Show allocation by vdev, including special
cr0x@server:~$ zpool list -v tank
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH
tank 200T 152T 48T - - 38% 76% 1.00x ONLINE
raidz2-0 200T 148T 52T - - 39% 74% - ONLINE
special 3.50T 3.10T 410G - - 52% 88% - ONLINE
Output meaning: The special vdev is at 88% capacity. That’s the danger zone. Fragmentation is also high.
Decision: Plan to expand special vdev now. Do not wait for 95%. Your margin is already gone.
Task 4: Confirm whether small blocks are being routed to special
cr0x@server:~$ zfs get -r special_small_blocks tank
NAME PROPERTY VALUE SOURCE
tank special_small_blocks 0 default
tank/vmstore special_small_blocks 16K local
tank/home special_small_blocks 0 inherited from tank
Output meaning: Only tank/vmstore is routing blocks ≤16K to special. That dataset is probably the growth driver.
Decision: If special is filling, first identify which datasets have nonzero special_small_blocks and evaluate if they still need it.
Task 5: Check how many snapshots you’re carrying (metadata pressure indicator)
cr0x@server:~$ zfs list -t snapshot -o name,used -S used | head
NAME USED
tank/vmstore@hourly-2025-12-26 2.3T
tank/vmstore@hourly-2025-12-25 2.1T
tank/home@daily-2025-12-25 320G
tank/vmstore@hourly-2025-12-24 1.9T
Output meaning: Lots of snapshots with large “used” suggests retention is keeping old block trees alive.
Metadata and spacemap work increases with snapshot count and churn.
Decision: Tighten snapshot retention if it’s not aligned with recovery requirements, or plan more special capacity to support the real requirement.
Task 6: Spot dataset patterns that create metadata storms (many tiny files)
cr0x@server:~$ zfs get -r recordsize,compression,atime,xattr tank/home
NAME PROPERTY VALUE SOURCE
tank/home recordsize 128K default
tank/home compression lz4 local
tank/home atime off local
tank/home xattr sa local
Output meaning: Sensible defaults. xattr=sa can reduce extra I/O by storing xattrs in the dnode (where possible).
That can shift metadata patterns, often for the better, but it still lives in “metadata land.”
Decision: Keep these consistent; avoid random per-dataset tweaks unless you can justify them.
Task 7: Check for dedup (special vdev sizing changes drastically)
cr0x@server:~$ zpool get dedup tank
NAME PROPERTY VALUE SOURCE
tank dedup off default
Output meaning: Dedup is off. Good; dedup tables can be massive and change sizing/endurance math.
Decision: If dedup is on, revisit special sizing with extreme caution; you may need far more SSD capacity and RAM.
Task 8: See whether metadata is actually hitting the special vdev (iostat by vdev)
cr0x@server:~$ zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 152T 48T 1.20K 3.80K 95M 210M
raidz2-0 148T 52T 200 1.10K 70M 180M
sda - - 30 180 11M 30M
sdb - - 28 175 10M 30M
sdc - - 32 185 12M 32M
sdd - - 33 190 12M 32M
sde - - 35 185 12M 31M
sdf - - 42 185 13M 31M
special 3.10T 410G 1.00K 2.70K 25M 30M
mirror-1 - - 1.00K 2.70K 25M 30M
nvme0n1 - - 520 1.40K 12M 15M
nvme1n1 - - 480 1.30K 13M 15M
Output meaning: Special is doing most operations (IOPS), even if bandwidth is modest. That’s typical: metadata is IOPS-heavy, not bandwidth-heavy.
Decision: If special is IOPS-saturated (high await on NVMe, queueing), you may need more special vdev width or faster SSDs—not just more capacity.
Task 9: Check TRIM support and whether it’s enabled (affects SSD longevity and performance)
cr0x@server:~$ zpool get autotrim tank
NAME PROPERTY VALUE SOURCE
tank autotrim on local
Output meaning: Autotrim is enabled. This helps SSDs maintain performance and reduces write amplification in many cases.
Decision: If autotrim is off on SSD-backed special vdevs, consider enabling it (after validating your platform’s trim stability).
Task 10: Confirm special vdev devices are healthy at the disk level (SMART)
cr0x@server:~$ sudo smartctl -a /dev/nvme0n1 | egrep "Critical Warning|Percentage Used|Media and Data Integrity Errors|Data Units Written"
Critical Warning: 0x00
Percentage Used: 18%
Media and Data Integrity Errors: 0
Data Units Written: 62,114,928
Output meaning: No critical warning, 18% endurance consumed, no media errors. Good.
Decision: If Percentage Used is high or media errors appear, plan replacement proactively—special vdevs are not where you “wait and see.”
Task 11: Estimate how much space is tied up in metadata-heavy datasets (rough indicator)
cr0x@server:~$ zfs list -o name,used,logicalused,compressratio -S used tank | head
NAME USED LOGICALUSED COMPRESSRATIO
tank/vmstore 78T 110T 1.41x
tank/home 22T 26T 1.18x
tank/backup 18T 19T 1.05x
Output meaning: VM store is dominant and compressed, which often correlates with smaller on-disk blocks and more metadata activity.
Decision: Treat tank/vmstore as the primary driver for special vdev sizing and endurance planning.
Task 12: See if your special vdev is being used for normal data because of small-block settings
cr0x@server:~$ zdb -bbbbb tank/vmstore | head -n 30
Dataset tank/vmstore [ZPL], ID 54, cr_txg 12345, 4.20T, 10.2M objects
bpobj: 0x0000000000000000
flags:
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
features:
org.openzfs:spacemap_histogram
org.openzfs:allocation_classes
...
Output meaning: Presence of org.openzfs:allocation_classes indicates allocation classes are in use.
This doesn’t quantify usage, but it confirms the pool supports the feature set.
Decision: If you don’t see allocation class features and you thought you had a special vdev, you’re either on an older stack or a different implementation.
Adjust expectations and confirm with zpool status.
Task 13: Detect whether you’re near the “special full” cliff
cr0x@server:~$ zpool list -Ho name,cap,frag tank
tank 76% 38%
Output meaning: Overall pool is 76% full and 38% fragmented. Fragmentation increases metadata and allocation overhead.
Decision: If fragmentation is rising and special is also high-cap, prioritize expansion; you’re headed for compounding pain.
Task 14: Add a new mirrored special vdev (capacity expansion done the sane way)
cr0x@server:~$ sudo zpool add tank special mirror /dev/nvme2n1 /dev/nvme3n1
Output meaning: This adds another mirrored special vdev to the pool, increasing special capacity and IOPS.
Existing metadata doesn’t automatically migrate, but new allocations can use the added space depending on behavior and configuration.
Decision: If special is above 75–80%, expansion is a production change worth doing sooner rather than later.
Validate device model/firmware alignment and ashift implications before you commit.
Task 15: Confirm the pool now shows the additional special vdev
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
errors: No known data errors
Output meaning: Two mirrored special vdevs exist. Your special class is now wider and larger.
Decision: Update monitoring to track special class usage and device health across both mirrors.
Task 16: Verify which datasets are candidates for turning off small-block placement (risk control)
cr0x@server:~$ zfs get -r -s local special_small_blocks tank | grep -v "VALUE *0"
tank/vmstore special_small_blocks 16K local
Output meaning: Only one dataset has it locally set.
Decision: If special capacity is the constraint, consider lowering the threshold (e.g., 16K → 8K) or disabling it on less critical datasets.
Do not flip it blindly; test latency impacts first.
Fast diagnosis playbook
You get paged: “ZFS pool is slow.” You suspect the special vdev because the pool used to be fine and now directory listings crawl.
Here’s the fastest path to a root cause hypothesis without turning it into a weeklong archaeology project.
First: Is the special vdev near full?
cr0x@server:~$ zpool list -v tank
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH
tank 200T 152T 48T - - 38% 76% 1.00x ONLINE
raidz2-0 200T 148T 52T - - 39% 74% - ONLINE
special 3.50T 3.10T 410G - - 52% 88% - ONLINE
If special is above ~80%, treat that as the primary suspect. Fullness correlates with allocation constraints and latency spikes.
Decision: expand special class or reduce special_small_blocks usage before chasing other tuning.
Second: Is the special vdev saturated on IOPS or latency?
cr0x@server:~$ zpool iostat -v tank 1 10
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 152T 48T 1.60K 5.10K 120M 260M
raidz2-0 148T 52T 220 1.30K 80M 190M
special 3.10T 410G 1.38K 3.80K 40M 70M
If special dominates ops and the system is slow, that’s consistent with metadata pressure.
Decision: if NVMe is pegged, consider widening the special class (another mirror) or using faster devices.
Third: Did someone route small blocks to special and forget?
cr0x@server:~$ zfs get -r special_small_blocks tank
NAME PROPERTY VALUE SOURCE
tank special_small_blocks 0 default
tank/vmstore special_small_blocks 16K local
tank/home special_small_blocks 0 inherited from tank
If this is set on a busy dataset, it’s likely the real reason special grew faster than expected.
Decision: keep it, shrink it, or scope it to only the datasets that truly benefit.
Fourth: Is the problem actually pool-wide fragmentation or general capacity pressure?
cr0x@server:~$ zpool list -Ho cap,frag tank
76% 38%
If pool cap is high and fragmentation is climbing, performance can degrade even with a healthy special vdev.
Decision: free space, add capacity, and adjust retention/churn.
Fifth: Is the special device sick (SMART, errors, resets)?
cr0x@server:~$ dmesg | egrep -i "nvme|I/O error|reset|timeout" | tail -n 10
[123456.789] nvme nvme0: I/O 987 QID 6 timeout, aborting
[123456.790] nvme nvme0: Abort status: 0x371
[123456.800] nvme nvme0: reset controller
If you see resets/timeouts, don’t “tune ZFS.” Fix the hardware/firmware path.
Decision: replace device, update firmware, check PCIe topology, power management, and cabling (if U.2/U.3).
Common mistakes: symptoms → root cause → fix
1) Symptom: Directory listings and find are suddenly slow
Root cause: Special vdev is near full or saturated with metadata IOPS; metadata spills to HDDs or allocations get constrained.
Fix: Check zpool list -v. If special cap > 80%, expand special class (add another mirrored pair). If it’s saturated, widen it or use faster SSDs.
2) Symptom: Pool “looks healthy” but application latency spikes during snapshot windows
Root cause: Snapshot creation/deletion churn increases metadata operations; special vdev becomes the bottleneck.
Fix: Reduce snapshot frequency/retention for high-churn datasets; schedule snapshot pruning off-peak; expand special capacity and IOPS if snapshots are non-negotiable.
3) Symptom: Special vdev fills unexpectedly fast after enabling special_small_blocks
Root cause: Threshold too high (e.g., 32K) on a workload with many small writes (VMs, databases), effectively tiering a large fraction of data onto SSD.
Fix: Lower threshold (e.g., 16K → 8K), scope it to only the datasets that benefit, and expand special capacity before making changes in production.
4) Symptom: After adding a special vdev, performance didn’t improve much
Root cause: Workload is mostly large sequential I/O; metadata wasn’t the bottleneck. Or ARC misses are dominated by data, not metadata.
Fix: Validate with zpool iostat -v and application metrics. Don’t keep turning knobs; special vdevs aren’t a universal accelerator.
5) Symptom: Pool import fails or hangs after losing an NVMe
Root cause: Special vdev was non-redundant or redundancy wasn’t sufficient for device failure; metadata loss prevents pool assembly.
Fix: In production, use mirrored special vdevs. If you already built it wrong, the correct fix is migration to a properly designed pool.
6) Symptom: NVMe wear skyrockets
Root cause: Special vdev is receiving small-block writes plus metadata churn; autotrim is off; SSD endurance class is insufficient.
Fix: Enable autotrim if stable, choose higher-endurance SSDs, reduce small-block offload scope, and monitor SMART Percentage Used.
7) Symptom: “Random” performance regressions after upgrades or property changes
Root cause: Dataset properties changed (recordsize/volblocksize/special_small_blocks), shifting allocation patterns and moving hot I/O onto special unexpectedly.
Fix: Audit property changes with zfs get -r (local values). Treat special_small_blocks as a controlled change with rollback plan.
Three corporate mini-stories (anonymized, painfully plausible)
Incident caused by a wrong assumption: “Metadata is tiny”
A mid-sized company had a big HDD pool backing a monolithic file share and a build artifact repository.
They added a mirrored special vdev made of two small-but-fast NVMe drives. The change ticket said “metadata is small; 400 GB usable is plenty.”
Everybody nodded because everyone loves a cheap win.
For six months, it looked great. Builds sped up. Developer home directories felt snappy. Someone even wrote “ZFS special vdev = free performance”
in a chat thread, which should have been treated as a pre-incident indicator.
Then a compliance requirement arrived: increased snapshot frequency and longer retention “just in case.”
Snapshots piled up. Builds kept churning. Lots of tiny files got created and deleted. Special capacity crept upward, but nobody was watching the special class separately.
They watched pool capacity, saw plenty of free HDD space, and assumed all was well.
The first symptom wasn’t an alert. It was humans complaining: “ls is slow,” “git status hangs,” “CI jobs timeout during cleanup.”
By the time SRE looked, special was deep in the 90s, and the system was spending its life doing expensive allocation work.
The fix was straightforward but not pleasant: emergency addition of another mirrored special vdev pair, plus snapshot policy triage.
The actual root cause wasn’t “ZFS is weird.” It was believing metadata is a constant-size footnote. It’s not.
Metadata grows with churn and history. And snapshot history is literally stored history.
Optimization that backfired: Turning on small blocks everywhere
Another shop ran virtualization on ZFS with a hybrid pool: HDDs for capacity, a special vdev for metadata.
A performance tuning sprint happened because a few latency-sensitive VMs were unhappy during peak hours.
Someone read about special_small_blocks and decided to “make the pool faster” by setting it on the parent dataset so it inherited everywhere.
The immediate benchmark looked good. The problem VMs improved. The sprint ended with a tidy slide: “We used SSDs for small I/O.”
No one asked what fraction of total writes were under the threshold, or whether the special vdev had the endurance budget to be a partial tier.
Two quarters later, special was at high utilization, and NVMe wear was alarming.
Worse: the special class became a single chokepoint for random writes across the whole virtualization estate.
The pool wasn’t “faster.” It was “fast until it wasn’t,” which is the worst kind of fast.
They rolled back the setting for most datasets and kept it only for a small subset of VMs that truly needed it.
They also expanded the special class and standardized on higher-endurance SSDs.
The lesson stuck: special_small_blocks is not a free lunch; it’s a design choice that converts metadata devices into a tier.
Boring but correct practice that saved the day: Alerting on special class CAP and wear
A financial services team treated special vdevs like first-class infrastructure from day one.
They used mirrored enterprise NVMes, enabled autotrim, and—here’s the truly thrilling part—created alerts specifically for special vdev capacity,
not just overall pool capacity.
They also tracked NVMe wear indicators and set a replacement policy before devices got spicy.
When special cap hit their warning threshold, it triggered a normal change workflow, not a war room.
The team had a standing playbook: verify which datasets were using small-block placement, check snapshot growth, and forecast capacity.
One year, a new internal service started generating millions of tiny files in a shared dataset.
Alerts fired early. The storage team met with the service team, changed the layout, and expanded special before it became a user-visible incident.
Nobody outside infra noticed, which is the highest compliment you can get in operations.
The practice wasn’t clever. It was just attentive: separate monitoring for special, conservative thresholds, and routine capacity reviews.
Boring is a feature.
Checklists / step-by-step plan
Step-by-step: sizing a new special vdev
-
Classify the workload: metadata-only acceleration vs metadata + small blocks.
If you can’t answer this, stop and measure I/O patterns first. -
Pick a starting ratio:
- Metadata-only: 1–2% raw pool (2–4% if many small files/snapshots).
- Metadata + small blocks: 5–15% raw pool (higher if VM-heavy and small threshold).
- Add buffer: size so normal operation stays under 70% utilization for the special class.
- Choose redundancy: mirrored special vdevs. No exceptions for production unless you enjoy gambling with other people’s data.
- Choose SSD class: prefer PLP (power-loss protection) and adequate endurance, especially if small blocks are routed.
- Set monitoring: track special cap, special frag, NVMe errors/resets, SMART wear, and zpool status changes.
- Rollout carefully: if enabling small-block placement, do it per-dataset and validate before broadening.
Step-by-step: when special is already too small
- Confirm special cap with
zpool list -v. - Identify datasets driving growth via
zfs get -r special_small_blocksand snapshot counts. - Expand special class by adding another mirrored special vdev pair.
- Reduce future pressure: lower
special_small_blocksthreshold or limit it to critical datasets only. - Fix retention: snapshot policies aligned to real RPO/RTO, not anxiety.
- Re-check wear: ensure SSD endurance budget still holds.
Operational checklist: weekly (yes, weekly)
- Check
zpool statusfor any device errors. - Check
zpool list -vand alert if special cap > 75%. - Check SMART wear on special devices.
- Audit datasets with
special_small_blocksenabled. - Review snapshot counts and prune responsibly.
FAQ
1) Can I treat the special vdev like a cache?
No. L2ARC is a cache. SLOG is a log device. Special vdevs can hold pool-critical metadata. Losing it can mean losing the pool.
Design it like primary storage.
2) How do I know if I should enable special_small_blocks?
Enable it only if you have a measured random-I/O problem on HDD vdevs and you have the SSD capacity/endurance budget.
VM stores are common candidates; general file shares often do fine with metadata-only special.
3) What value should I use for special_small_blocks?
Conservative defaults: 8K or 16K for targeted datasets. Higher thresholds capture more I/O but consume special capacity faster.
Start smaller, measure, then widen if needed.
4) What happens when the special vdev fills up?
Best case: performance degrades as allocations become constrained and metadata placement becomes less optimal.
Worst case: you get instability or failures around allocations. Treat “special nearing full” as an urgent capacity issue.
5) Can I remove a special vdev later?
Plan as if the answer is “no.” In many production deployments, removing special vdevs is not supported or not practical.
Assume it’s a one-way door and size accordingly.
6) Should special vdevs be NVMe, SATA SSD, or something else?
NVMe is ideal for metadata IOPS and latency. SATA SSD can still help massively versus HDDs.
The more metadata-heavy your workload, the more NVMe’s low latency matters.
7) Does adding more special vdev mirrors improve performance?
Often, yes. More mirrors increase IOPS capacity and reduce queueing. It also increases capacity.
But it won’t fix a fundamentally overloaded pool or pathological dataset churn.
8) Is it safe to use RAIDZ for special vdevs?
Mirrors are the default recommendation for a reason: predictable rebuild behavior and simpler failure domains.
If you’re considering RAIDZ special, you should be able to explain rebuild time, failure tolerance, and operational impact in detail.
9) How does special vdev sizing relate to ARC/RAM?
ARC helps cache metadata and data in RAM. A special vdev reduces the penalty when metadata isn’t in ARC.
If ARC is undersized, special vdevs still help, but you may be masking a memory problem.
10) If I add special vdevs, will existing metadata move onto them?
Don’t assume it will “magically rebalance” in a way that fixes your past. Some new allocations will use the new special capacity,
but migration behavior depends on implementation details and workload churn. Plan expansions before you’re on fire.
Next steps you can do this week
If you already run special vdevs: check their capacity and health today. Special fullness is one of those problems that stays quiet until it’s loud.
-
Run
zpool list -vand record special CAP. If it’s above 75%, open a capacity change ticket. -
Run
zfs get -r special_small_blocksand make a list of datasets using it. Validate that each one is intentional. - Review snapshot counts and retention. If you’re keeping history “just because,” you’re paying for it in metadata.
- Add monitoring for special class utilization and NVMe wear. If you only monitor overall pool capacity, you’re looking at the wrong dashboard.
- If you’re designing a new pool, size special so steady-state stays under 70%, mirror it, and buy SSDs with endurance appropriate to your workload.
The best special vdev sizing is the one you never have to explain during an incident.
Err on the side of boring, redundant, and slightly larger than your ego says you need.