ZFS Metadata-Only Reads: How Special VDEV Changes the Game

Was this helpful?

You don’t notice ZFS metadata until it’s the only thing your storage does all day. The classic symptom: the pool has plenty of bandwidth,
disks look “fine”, but directory listings crawl, backups spend minutes “scanning”, and every container runtime insists on stat’ing
the universe before it does any real work.

Special vdevs change the economics of that pain. They don’t make your rust faster; they move the hottest, most latency-sensitive parts of
the filesystem off rust. Done correctly, they turn “metadata storms” into SSD hits and restore sanity. Done incorrectly, they turn a
recoverable performance issue into a pool-killing outage. Both outcomes are popular.

What “metadata-only reads” actually mean in ZFS

“Metadata-only reads” is what you call it when your application workload looks like it’s reading files, but ZFS is mostly reading
the structures that describe files. Think directory traversal, permission checks, inode-like bookkeeping, snapshot trees,
and the block pointers that lead to actual payload data.

In ZFS terms, that’s dominated by blocks like:

  • Indirect blocks (block pointers describing where data lives).
  • Dnodes (the object metadata: type, size, block pointers, bonus buffer).
  • ZAP objects (name/value stores used heavily for directories and properties).
  • Space maps / metaslabs (allocation bookkeeping; can appear in pathological allocation patterns).
  • DDT if dedup is enabled (more on that: please don’t, unless you truly mean it).

A metadata-heavy workload can have miserable “MB/s” while saturating IOPS and latency budgets. The pool looks idle to people who only
monitor throughput. Meanwhile, your application sees “storage slow” because each metadata lookup is a random read and random reads on
HDDs are basically a negotiation with physics.

The dirty secret: many “read” workloads are actually lookup workloads. A build farm doing millions of small file stats.
A mail server walking directories. A container platform enumerating layers. Backup software that does a pre-flight scan of everything
you own before reading a byte. These are metadata jobs that occasionally read data.

One quote worth keeping taped to the monitor, as a paraphrased idea:
paraphrased idea: Everything fails; what matters is building systems that fail predictably and recover quickly.
— John Allspaw

Special vdevs are one way to make metadata failure modes less dramatic: lower latency, fewer timeouts, fewer retries, fewer knock-on
incidents. But only if you respect what they are.

Special vdev: what it is, and what it is not

A special vdev is an allocation class in OpenZFS. It’s a place where ZFS can store certain categories of blocks,
typically metadata and (optionally) “small” file blocks, on a faster vdev class than the main pool.

The most important operational truth: a special vdev is not a cache. It’s not L2ARC. It’s not a “maybe we’ll put
something there and if it dies we’ll shrug.” If a special vdev is lost, the pool is typically lost, because those blocks are part of
the pool’s authoritative on-disk structure.

You’re moving critical filesystem organs onto different media. Treat it like moving a database’s WAL to a different disk: it can be
brilliant, and it can be fatal.

What goes to special by default

With a special vdev present, ZFS will allocate metadata there by default (subject to implementation details and
available space). That includes dnodes, indirect blocks, directory structures (ZAP), and other metadata objects.

What can go to special if you ask for it

The dataset property special_small_blocks lets you push small file data blocks to special as well.
That’s the knob that turns “metadata accelerator” into “small I/O accelerator”.

This is where people get excited and then get fired. We’ll cover how to do it without the second part.

Joke #1: Special vdevs are like espresso—life-changing in the right dose, and a terrible idea if you keep “optimizing” until 3 a.m.

How special vdev changes the read path

Without special vdev, metadata reads behave like any other reads: ZFS walks pointers, fetches blocks from the same vdevs that hold data,
and relies heavily on ARC to avoid repeated disk hits. ARC helps, but a cold cache, an oversized working set, or an aggressive scan can
evict metadata and force real I/O.

With special vdev, the blocks that describe the filesystem are placed on a vdev class that’s typically SSD or NVMe. The result:

  • Metadata random I/O shifts from HDD seeks to SSD reads.
  • Latency collapse: not just lower average latency, but fewer awful tail latencies.
  • Higher effective concurrency: operations that were serialized by disk head travel become parallelized.
  • ARC becomes more effective: metadata fetched quickly reduces stall time and improves cache warm-up behavior.

The important nuance: special vdev doesn’t magically eliminate metadata I/O. It makes it cheap enough that it stops dominating.
Your bottleneck might then move somewhere else: CPU (compression), network, application locks, or the main data vdevs.
That’s a good problem. At least it’s honest.

Metadata-only reads in the wild

A few patterns that produce “mostly metadata reads”:

  • Large directory traversals: repeated readdir(), stat(), permission checks.
  • Snapshot-heavy datasets: snapshot lists, clones, and sends with heavy traversal cost.
  • Backup verification: “compare file list” operations that touch everything.
  • Container image unpack: many small files, many metadata updates and lookups.
  • Build systems: tiny files, lots of existence checks, metadata churn.

Facts and history that matter in production

Some context points that help you reason about special vdevs and metadata behavior, without turning your incident review into a
graduate seminar:

  1. ZFS was designed for end-to-end integrity: metadata is not optional; losing it means you can’t safely interpret data blocks.
  2. ARC predates “special vdev” as a mainstream feature: early ZFS performance tuning leaned heavily on RAM and layout, not allocation classes.
  3. L2ARC came before special vdev: L2ARC is a cache and can be dropped; special vdev is storage and cannot.
  4. The “metadata on SSD” idea existed in storage long before ZFS: filesystems and arrays have used mirrored NVRAM/SSD for maps and journals for decades.
  5. OpenZFS allocation classes matured over time: early feature sets were conservative; modern deployments use special vdevs routinely for mixed workloads.
  6. Dedup tables (DDT) are metadata on steroids: when dedup is enabled, the DDT can dominate random reads and memory usage.
  7. Indirect blocks can be a large fraction of I/O: especially with large files and random reads, you can be bottlenecked on pointer-chasing.
  8. Special vdevs changed the “small file problem”: pushing tiny file blocks to SSD is often the difference between “fine” and “why is ls taking seconds?”

The headline: special vdev is not a toy feature. It’s ZFS acknowledging that metadata latency is often the real enemy, and that throwing
more HDDs at it is a sad kind of cardio.

When special vdev is a big win (and when it’s pointless)

Big win scenarios

  • Millions of small files where directory traversal and stats dominate.
  • Virtualization or container hosts with lots of small configuration files and frequent lookups.
  • Backup targets where scan/verify phases are the bottleneck.
  • Snapshot-heavy datasets where metadata walk costs show up in send/list operations.
  • Dedup-enabled pools (again: only if you truly mean it) where DDT reads dominate; special can help, but RAM still matters more.

Pointless or marginal scenarios

  • Streaming workloads reading large contiguous files (media, archives). Your bottleneck is sequential bandwidth, not metadata.
  • ARC already holds the working set. If all metadata is resident, special vdev won’t show dramatic gains.
  • Workloads dominated by synchronous writes. That’s SLOG territory, not special vdev.

Where people mis-predict the win

The classic wrong assumption is “our pool is slow because HDDs are slow, so SSDs will fix it.” Sometimes true. Often false.
If your pain is latency on random metadata reads, special vdev is a surgical fix.
If your pain is “we’re CPU-bound on compression” or “the app does one operation at a time,” then special vdev mostly makes charts prettier.

Design sane special vdev layouts

Here’s the design stance that keeps you employed: mirror the special vdev. If you can’t afford redundancy for it,
you can’t afford the feature.

Redundancy rules that don’t bend

  • Use mirrors for special vdevs in most environments. RAIDZ special vdevs exist, but mirrors keep latency low and failure domains clear.
  • Match or exceed the pool’s reliability posture. If your data vdevs are RAIDZ2, your special vdev should not be a single SSD “because it’s enterprise.”
  • Plan for wear. Metadata can be write-heavy under churn. Pick SSDs with real endurance, not “whatever was on sale”.

Sizing: the part everyone guesses wrong

Under-size special and you’ll hit 80–90% usage, allocation gets constrained, and performance becomes weird in non-obvious ways.
Over-size it and you waste money, but you also buy operational peace: space to grow, room for rebalancing, fewer cliff edges.

Practical sizing guidance:

  • Metadata-only special vdev: size it generously enough that it stays comfortably under ~70% used in steady state.
  • Metadata + small blocks: assume it becomes a hot tier. Size it like you would size a real dataset, because it effectively is one.
  • Snapshot-heavy environments: metadata grows with history. Budget for it. “We’ll delete snapshots later” is not a strategy; it’s a bedtime story.

special_small_blocks: a sharp knife

Setting special_small_blocks to a non-zero value means any blocks at or below that size are allocated to special.
If you set it to 128K on a dataset with recordsize=128K, congratulations: you just told ZFS to put basically all data on
special. People do this accidentally. Then they learn new feelings.

A conservative approach:

  • Start with 0 (metadata only), measure, then consider 16K or 32K for small-file workloads.
  • Use it per-dataset, not as a pool-wide “optimization”. Your VM images don’t need it; your source tree might.
  • Document it. Future-you is a stranger who will blame present-you.

Joke #2: Setting special_small_blocks=128K on a busy dataset is like putting your entire office on the guest Wi‑Fi because it “tested faster”.

Practical tasks: commands, output meaning, and decisions

These are the day-to-day checks that separate “I read about special vdevs” from “I can operate them at 2 a.m. without improvising.”
Each task includes a command, sample output, what it means, and the decision you make.

Task 1: Confirm pool layout and special class presence

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0
        special
          mirror-1                  ONLINE       0     0     0
            nvme0n1p1               ONLINE       0     0     0
            nvme1n1p1               ONLINE       0     0     0

errors: No known data errors

Meaning: A special section exists and is mirrored. Good. If you don’t see special, you don’t have one.

Decision: If special is missing and metadata is a bottleneck, plan an add. If it’s present but single-disk, treat as a reliability bug.

Task 2: Verify allocation class stats (are bytes actually landing there?)

cr0x@server:~$ sudo zpool list -v tank
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank        109T  72.4T  36.6T        -         -    18%    66%  1.00x  ONLINE  -
  raidz2-0  109T  70.9T  38.1T        -         -    18%    65%      -  ONLINE
  special   3.64T  1.52T  2.12T        -         -     9%    41%      -  ONLINE

Meaning: Special has allocation. If it’s near zero while you expect metadata there, something is off (no special vdev at creation time, or blocks predate it).

Decision: If special is underutilized but should be used, check dataset properties and consider whether existing data needs rewriting/migration to benefit.

Task 3: Check dataset settings that drive block placement

cr0x@server:~$ sudo zfs get -o name,property,value -s local,received special_small_blocks,recordsize,atime,compression tank/projects
NAME           PROPERTY              VALUE
tank/projects  special_small_blocks  32K
tank/projects  recordsize            128K
tank/projects  atime                 off
tank/projects  compression           zstd

Meaning: This dataset will push blocks ≤32K to special. With recordsize=128K, only genuinely small files/blocks go there.

Decision: If special is filling too fast, lower or reset special_small_blocks. If metadata latency is still bad, consider raising it cautiously.

Task 4: Find out whether your “slow read” is metadata (ARC stats view)

cr0x@server:~$ sudo arcstat 1 5
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:01:01   560    90     16    40   44    22   24    28   31   64G   96G
12:01:02   610   110     18    55   50    24   22    31   28   64G   96G
12:01:03   590   105     17    52   50    25   24    28   27   64G   96G
12:01:04   605   120     20    62   52    28   23    30   25   64G   96G
12:01:05   575   115     20    60   52    27   23    28   24   64G   96G

Meaning: High metadata misses (dmis/mmis) indicate metadata isn’t staying in ARC and is being fetched from disk/special.

Decision: If metadata miss rate is high and latency hurts, special vdev helps. If special exists, verify it isn’t saturated or mis-sized.

Task 5: Measure vdev-level latency during a “metadata storm”

cr0x@server:~$ sudo zpool iostat -v tank 1 3
                 capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        72.4T  36.6T    820    190  18.2M  4.5M
  raidz2-0  70.9T  38.1T    260    160   6.1M  4.0M
    sda         -      -     45     26   1.1M   700K
    sdb         -      -     43     25   1.0M   680K
    sdc         -      -     44     27   1.1M   710K
    sdd         -      -     42     26   1.0M   690K
    sde         -      -     44     28   1.1M   720K
    sdf         -      -     42     28   1.0M   720K
  special   1.52T  2.12T    560     30  12.1M   520K
    mirror-1     -      -    560     30  12.1M   520K
      nvme0n1p1  -      -    280     15   6.0M   260K
      nvme1n1p1  -      -    280     15   6.1M   260K

Meaning: Reads are hitting special heavily. That’s expected if metadata/small blocks are there.

Decision: If special is doing the work and latency is good, you’re winning. If special is pegged and slow, you’ve moved the bottleneck onto the SSDs.

Task 6: Check special vdev health and error counters

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 07:21:10 with 0 errors on Sun Dec 22 03:10:11 2025
config:

        NAME          STATE     READ WRITE CKSUM
        tank          ONLINE       0     0     0
        ...
        special
          mirror-1    ONLINE       0     0     0
            nvme0n1p1 ONLINE       0     0     0
            nvme1n1p1 ONLINE       0     0     0

errors: No known data errors

Meaning: Clean. If you see errors on special, treat it as an urgent reliability incident, not a “performance issue”.

Decision: Replace failing special devices immediately. A degraded special mirror is a pool-outage countdown.

Task 7: Identify whether metadata is overwhelming the main vdevs anyway (queueing)

cr0x@server:~$ iostat -x 1 2
Linux 6.6.0 (server) 	12/26/2025 	_x86_64_	(32 CPU)

Device            r/s   w/s  rkB/s  wkB/s  await  svctm  %util
sda              35.0  22.0  1100.0  680.0  28.5   3.2   98.0
sdb              34.0  23.0  1050.0  690.0  27.8   3.1   97.5
nvme0n1          290.0  18.0  6200.0  260.0   1.1   0.3   22.0
nvme1n1          295.0  17.0  6300.0  260.0   1.0   0.3   21.0

Meaning: HDDs are near 100% utilized with high await, while NVMe has headroom. Either metadata still hits HDDs, or your workload is data-heavy.

Decision: If special exists, verify special_small_blocks and confirm metadata placement. Otherwise, plan special or reduce metadata churn.

Task 8: Verify ashift and sector alignment on special devices

cr0x@server:~$ sudo zdb -C tank | sed -n '1,80p'
MOS Configuration:
        version: 5000
        name: 'tank'
        vdev_children: 2
        vdev_tree:
            type: 'root'
            id: 0
            guid: 1234567890
            children[0]:
                type: 'raidz'
                ashift: 12
                ...
            children[1]:
                type: 'mirror'
                ashift: 12
                is_special: 1
                ...

Meaning: ashift: 12 (4K sectors) on special. Good. Wrong ashift can waste space and harm performance.

Decision: If ashift is wrong, you can’t “tune” your way out. Plan a migration or a rebuild for correctness.

Task 9: Check how full the special vdev is (and act before it hurts)

cr0x@server:~$ sudo zpool list -o name,size,alloc,free,cap,frag tank
NAME  SIZE   ALLOC  FREE   CAP  FRAG
tank  109T   72.4T  36.6T  66%  18%

Meaning: This is pool-level. It does not show special fullness by itself; use zpool list -v as in Task 2.

Decision: If special class is creeping above ~70–80% used, plan expansion early. Don’t wait for “it’s full” because that’s when your options get stupid.

Task 10: See what datasets are heavy on metadata (snapshot count as a clue)

cr0x@server:~$ sudo zfs list -t snapshot -o name -S creation tank/projects | head
NAME
tank/projects@auto-2025-12-26-0000
tank/projects@auto-2025-12-25-0000
tank/projects@auto-2025-12-24-0000
tank/projects@auto-2025-12-23-0000
tank/projects@auto-2025-12-22-0000
tank/projects@auto-2025-12-21-0000
tank/projects@auto-2025-12-20-0000
tank/projects@auto-2025-12-19-0000
tank/projects@auto-2025-12-18-0000

Meaning: Snapshot volume is a metadata multiplier. More snapshots often means more metadata traversal overhead.

Decision: If metadata storms coincide with snapshot churn, adjust retention, scheduling, or move that dataset’s metadata pressure onto special.

Task 11: Check scrub/resilver impact on special vdev

cr0x@server:~$ sudo zpool status tank | sed -n '1,25p'
  pool: tank
 state: ONLINE
  scan: scrub in progress since Thu Dec 26 01:20:44 2025
        9.21T scanned at 1.05G/s, 1.90T issued at 222M/s, 72.4T total
        0B repaired, 2.62% done, 3 days 17:12:00 to go

Meaning: Scrub walks metadata extensively. If your special vdev is the metadata home, scrub behavior and latency can change (often for the better).

Decision: If scrub crushes production latency, schedule it, throttle it (platform-dependent), and ensure special isn’t underpowered.

Task 12: Confirm that your “fast tier” isn’t being used as a write-amplifier

cr0x@server:~$ sudo smartctl -a /dev/nvme0n1 | egrep -i 'percentage|data units|media|error'
Percentage Used:                    7%
Data Units Written:                 18,442,119
Data Units Read:                    21,104,553
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

Meaning: Wear is measurable. Metadata churn plus small-block data can chew through consumer SSD endurance.

Decision: If “Percentage Used” climbs fast, change SSD class, reduce metadata churn, or stop pushing small blocks to special.

Task 13: Validate what’s actually on special (high-level inference via allocation)

cr0x@server:~$ sudo zfs get -r -o name,property,value special_small_blocks tank | head -n 12
NAME                 PROPERTY              VALUE
tank                 special_small_blocks  0
tank/projects        special_small_blocks  32K
tank/vm              special_small_blocks  0
tank/backups         special_small_blocks  0
tank/containers      special_small_blocks  16K

Meaning: Only some datasets push small blocks to special. That’s healthy: targeted use instead of pool-wide gambling.

Decision: Keep it targeted. If everything is set to 128K “for performance,” you’re about to run a very expensive special vdev.

Task 14: Spot metadata-driven application behavior (syscall-level hint)

cr0x@server:~$ sudo strace -f -tt -p $(pgrep -n rsync) -e trace=newfstatat,getdents64,openat -c
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 58.41    0.723601          12     59041       112 newfstatat
 31.22    0.386926          14     26874           getdents64
 10.37    0.128504          10     12245        45 openat
------ ----------- ----------- --------- --------- ----------------
100.00    1.238?                   98160       157 total

Meaning: The app is spending most calls on stat and directory enumeration. That’s metadata. Your bottleneck is likely metadata I/O and latency.

Decision: Prioritize special vdev and metadata tuning over “more disks” or “bigger recordsize”. The workload isn’t asking for bandwidth.

Fast diagnosis playbook

When someone says “storage is slow,” you don’t have time for interpretive dance. You need a quick triage path that narrows the bottleneck
to: ARC, special vdev, main vdevs, CPU, or the application doing something unfortunate.

First: determine whether it’s metadata-dominated

  • Check syscall patterns (strace sampling) and workload description: lots of stat(), readdir(), snapshot listing, scanning.
  • Check ARC misses (Task 4). Metadata miss rates are the tell.
  • Check throughput vs latency: low MB/s with high IOPS is usually metadata or small random reads.

Second: identify which vdev class is serving reads

  • Use zpool iostat -v (Task 5) during the slowdown.
  • If reads hammer special, confirm SSD latency and health.
  • If reads hammer HDDs, either metadata isn’t on special (or pre-existing blocks weren’t rewritten), or the workload is data-heavy.

Third: decide whether you’re bottlenecked by capacity pressure, queueing, or configuration

  • Check special allocation and capacity (Task 2). A too-full special vdev is a performance cliff edge.
  • Check device queueing (iostat -x, Task 7). High await on SSD suggests saturation or contention.
  • Verify dataset knobs (special_small_blocks, recordsize, compression) (Task 3).
  • Check whether a scrub/resilver is underway (Task 11).

The output of this playbook should be a single sentence you can say in Slack without getting roasted:
“We’re metadata-miss bound and reads are hitting HDDs; special isn’t carrying metadata for this dataset” or
“Special is saturated and at 85% used; we need to expand it and stop shoving 32K blocks there for that backup job.”

Common mistakes (symptoms → root cause → fix)

1) “We added special and nothing got faster.”

Symptoms: Directory traversal still slow; HDDs still show random read pressure; special vdev shows little read activity.

Root cause: Existing metadata/data blocks weren’t moved. Special affects new allocations; old blocks remain where they were written.

Fix: For datasets where it matters, rewrite data (e.g., send/receive into a new dataset, or copy-then-swap) so metadata/small blocks reallocate onto special.

2) “Special is full and now everything feels haunted.”

Symptoms: Latency spikes, allocation oddities, unpredictable performance; alerts about capacity.

Root cause: Special vdev under-sized, often combined with aggressive special_small_blocks settings.

Fix: Add mirrors to special (expand allocation class) or migrate data to a bigger pool. Reduce special_small_blocks on datasets that don’t need it.

3) “We used a single ‘enterprise SSD’ for special. It failed. Now we’re learning restore procedures.”

Symptoms: Pool faulted or will not import; missing special device; panic.

Root cause: No redundancy for a vdev class that stores required blocks.

Fix: Always mirror special. If already broken, restore from backup or attempt professional recovery—don’t expect miracles.

4) “We set special_small_blocks too high and accidentally built an expensive all-SSD pool.”

Symptoms: Special allocation rises rapidly; SSD wear increases; HDDs are oddly idle; budget meetings get tense.

Root cause: Threshold set near recordsize (or workload writes many small blocks), so most data lands on special.

Fix: Reduce threshold per dataset, and migrate/rewriting if you need to move existing data off special. Monitor special fullness and SSD endurance.

5) “Performance improved, then cratered during scrub.”

Symptoms: Latency and IO wait spikes during scrub; application timeouts; special vdev shows heavy read activity.

Root cause: Scrub is metadata-intensive; special is now the hot path and may be under-provisioned, or scrub schedule overlaps peak use.

Fix: Schedule scrubs off-peak, tune scrub behavior where supported, and ensure special mirrors have enough IOPS and endurance.

6) “We thought special replaces SLOG.”

Symptoms: NFS/VM synchronous write performance still bad; fsync-heavy workloads unchanged.

Root cause: Special vdev targets metadata/small blocks. SLOG (separate log) accelerates synchronous writes by hosting the ZIL.

Fix: Add an appropriate SLOG device (mirrored for serious workloads) and verify sync behavior. Keep roles separate.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “It’s a cache, right?”

A mid-sized company ran a ZFS pool for home directories and CI artifacts. They’d heard that “special vdev makes metadata fast” and
bought a single, high-end NVMe drive. It was added as special. Performance improved immediately: directory listings stopped timing out,
CI scans ran faster, everyone high-fived.

Months later, the NVMe started throwing media errors. The system kept running because the device didn’t die all at once; it just got
weird. Occasional checksum errors appeared, then increased. Someone treated it like “cache corruption” and scheduled a replacement
“next maintenance window.”

The replacement window came after the device dropped out completely. The pool wouldn’t import cleanly. The “metadata accelerator” had
been hosting required metadata. Their data vdevs were fine; they just couldn’t safely traverse the filesystem structures without the
missing blocks.

The recovery was exactly as fun as you think: restore from backups, rehydrate CI artifacts, rebuild developer machines, answer awkward
questions about why a single SSD was a single point of failure. The postmortem action item wasn’t “be careful,” it was
“mirror special or don’t use it.”

Optimization that backfired: the special_small_blocks “turbo” setting

Another organization ran a mixed workload: source repositories, container registries, and VM images. They deployed a mirrored special
vdev and got a solid improvement on metadata-heavy tasks. Then someone decided to “finish the job” and pushed
special_small_blocks=128K on a few datasets because “most IO is small anyway.”

At first it looked great. Latency graphs improved. HDD utilization dropped. The team declared victory and moved on.
The problem showed up slowly: special allocation climbed faster than expected, and wear indicators on the SSDs ticked upward in a way
that didn’t match their mental model.

The backfire was subtle: the datasets had recordsize=128K and a workload that rewrote files frequently.
They’d effectively created an SSD tier that hosted not just metadata, but a large fraction of data blocks too. The special vdev became
the new capacity constraint. Then it became the new performance bottleneck, because it was doing both metadata and a pile of data IO.

The fix was painful but clean: they lowered special_small_blocks to 16K for the datasets that benefited, reset it to 0 for
the rest, and migrated affected datasets via send/receive to reallocate blocks sanely. It cost time, but it replaced a creeping disaster
with a stable system.

Boring but correct practice that saved the day: capacity and wear budgets

A financial services shop ran ZFS for analytics logs and compliance archives. They were conservative: mirrored special vdev, conservative
special_small_blocks only on a few “tiny file” datasets, weekly scrubs, and a strict monitoring rule: alert at 60% and 70%
usage on special, plus SSD wear thresholds.

One quarter, a new application release increased file counts dramatically. Nothing obvious: no large increase in total bytes, just a
flood of small objects and snapshots. The main pool still had plenty of free space. But special allocation climbed week over week.

Because they watched special separately, they caught it early. They added another mirrored special vdev pair during a normal change
window and adjusted retention on one snapshot schedule. No outage, no drama, no “why is the pool offline.”

The most impressive part is that no one wrote a heroic Slack thread about it. That’s the point. Operations is supposed to be boring.
This is how you buy boring: budgets, thresholds, and a refusal to treat critical storage as an experiment.

Checklists / step-by-step plan

Step-by-step: deciding whether you need a special vdev

  1. Characterize the workload: is it dominated by file enumeration, stats, snapshots, tiny files, or dedup lookups?
  2. Measure ARC behavior: are metadata misses meaningful during the slowdown?
  3. Measure device queueing: are HDDs pegged with random read latency while throughput is low?
  4. Confirm you’re not chasing the wrong bottleneck: CPU-bound compression, sync writes, or single-threaded app behavior.
  5. Decide scope: pool-wide metadata acceleration vs targeted small-block acceleration per dataset.

Step-by-step: implementing special vdev safely

  1. Pick devices: mirrored SSD/NVMe with endurance appropriate for metadata churn.
  2. Plan redundancy: minimum mirror; consider hot spare policy in your operational context.
  3. Add special vdev: validate with zpool status and zpool list -v.
  4. Keep special_small_blocks at 0 initially: get the “metadata-only” win first.
  5. Measure again: verify reads shift and latency improves.
  6. Enable special_small_blocks selectively: start small (16K/32K), only on datasets with tiny-file pain.
  7. Document: dataset properties and rationale in your change log.

Step-by-step: avoiding capacity cliffs

  1. Monitor special allocation using zpool list -v.
  2. Alert early: treat 70% as “plan expansion,” not “panic”.
  3. Watch SSD wear: SMART wear metrics and error logs.
  4. Be honest about snapshots: snapshot growth is metadata growth; set retention you can afford.
  5. Have a migration plan: if you must change placement meaningfully, plan send/receive or controlled rewrite.

FAQ

1) Is a special vdev the same thing as L2ARC?

No. L2ARC is a read cache and can be lost without losing the pool. Special vdev stores real blocks (metadata and optionally small data).
Lose it and you can lose the pool.

2) Does adding a special vdev move existing metadata automatically?

Generally, no. It affects new allocations. Existing blocks stay where they are unless you rewrite/migrate data so ZFS allocates again.

3) Can I use a single SSD for special vdev if it’s “enterprise grade”?

You can. You also can run production without backups. In both cases, you’re choosing drama. Mirror the special vdev.

4) What’s a safe starting value for special_small_blocks?

Start at 0 (metadata only). If you have a proven small-file workload, try 16K or 32K on that dataset, then watch special allocation and SSD wear.

5) How is SLOG different from special vdev?

SLOG accelerates synchronous writes by hosting the ZIL intent log. Special vdev accelerates metadata reads/writes and optionally small blocks.
They solve different problems and are often used together in serious environments.

6) Will special vdev make scrubs faster?

Often, it improves scrub behavior because scrubs read metadata heavily. But if your special vdev is underpowered or saturated, scrub can
compete with production reads on the same SSDs.

7) Can special vdev help with dedup?

It can help because DDT lookups are metadata-like and latency-sensitive. But dedup is still dominated by memory needs and lookup behavior.
Special vdev is not a substitute for enough RAM and a correct dedup design.

8) What happens if special vdev fills up?

Best case: performance degrades and allocations behave differently as ZFS tries to cope. Worst case: you hit allocation failures and
operational chaos. Treat special capacity as a first-class resource with early alerts and an expansion plan.

9) Do I need special vdev on an all-NVMe pool?

Sometimes still useful, but the win is smaller because metadata and data already have low latency. It can help isolate metadata from bulk
reads, but measure first—don’t cargo-cult it.

10) How do I know if my slowdown is “metadata” versus “data”?

Look for low throughput with high operation counts, high metadata ARC misses, and syscall profiles dominated by stat() and directory reads.
Then confirm which vdev class is serving reads via zpool iostat -v.

Conclusion: practical next steps

Special vdev is one of the rare ZFS features that can feel like cheating: suddenly the filesystem stops arguing with physics during
metadata storms. But it’s only cheating if you pay the price upfront: redundancy, sizing, and discipline with
special_small_blocks.

Next steps that reliably improve real systems:

  • Prove it’s metadata with ARC miss data and vdev-level iostat during the incident window.
  • If you deploy special, mirror it and monitor it like it can take your pool down—because it can.
  • Start metadata-only, then selectively enable small-block placement where the workload earns it.
  • Budget for growth: snapshots, file counts, and churn will expand metadata over time.
  • Keep it boring: alerts at sane thresholds, scrubs scheduled, wear tracked, and changes documented.

If you do this right, you’ll spend less time explaining why “ls” is slow and more time doing the kind of work that gets you promoted:
predictable systems, predictable incidents, predictable recovery.

← Previous
ZFS Dataset Naming: The Boring Habit That Saves Admin Time
Next →
WordPress White Screen of Death (WSOD): 12 checks that actually work

Leave a comment