ZFS Special VDEVs on SAS SSDs: The Pro Move for Metadata

Was this helpful?

You know the smell: the pool is “fine” on sequential throughput, but the system feels like it’s dragging a piano.
ls -l on a big directory stutters, backup verification takes forever, and your hypervisor reports high latency
during what should be boring background work.

Nine times out of ten, you don’t have a bandwidth problem. You have a metadata IOPS problem. And if you’re running big
HDD pools (or mixed workloads) and you’re not using ZFS special vdevs, you’re leaving a lot of performance—and predictability—on the table.
SAS SSDs make a particularly good home for that metadata. Not because they’re magical. Because they’re boring, consistent,
and built for servers that live in racks longer than some corporate strategies.

What special vdevs actually do (and what they don’t)

A ZFS “special vdev” is an allocation class that can store metadata and (optionally) small file blocks on a faster device class
than your main data vdevs. The common deployment is: big, cheap, resilient HDD vdevs for bulk data; mirrored SSDs in the special class
for metadata and small I/O. The impact is immediate in workloads that do lots of namespace activity: directory traversals, snapshots,
rsync/backup scans, VM image access patterns, container image layers, git repos, maildirs, and anything that “touches lots of tiny things.”

What goes to the special vdev?

  • Metadata by default: directory entries, object metadata, indirect blocks, allocation metadata.
  • Optionally small data blocks when special_small_blocks is set on a dataset.
  • Not a cache: this is not L2ARC. Data is placed there permanently (until rewritten/relocated by ZFS behavior).

What it is not

  • Not SLOG (separate log device): SLOG accelerates synchronous writes. Special vdev accelerates metadata and small blocks.
  • Not L2ARC: L2ARC is a read cache that can be dropped without losing the pool. Special vdev is part of the pool.
  • Not optional once used: lose it, lose the pool (unless it’s redundant and only one device fails, and even then: treat it as a fire).

That last bullet is the one that turns a clever performance feature into a career-limiting move when implemented casually.
If you take nothing else from this piece: special vdevs must be redundant, and their health must be monitored like a power feed.

Why SAS SSDs are a smart choice for special vdevs

You can build special vdevs with SATA SSDs, NVMe, even Optane back when it was easy to buy. But SAS SSDs hit a sweet spot for
server fleets: dual-porting, consistent firmware ecosystems, proper enclosures/backplanes, and fewer “consumer surprise” behaviors.
They’re not the fastest on paper. They’re fast enough where it counts: consistent low-latency at queue depth 1–4, under steady metadata churn.

Traits that matter in the metadata business

  • Predictable latency: metadata workloads punish tail latency. SAS SSDs in good HBAs/backplanes are usually boring. Boring wins.
  • Better operational ergonomics: standardized sleds, enclosure management, and fewer “which M.2 is dying?” moments.
  • Dual-port (common in enterprise SAS SSDs): more resilient paths in HA setups.
  • Power-loss protection (typical): less drama during “someone kicked the wrong PDU” events.
  • Endurance headroom: metadata can be write-heavy in unexpected ways: snapshots, deletes, churny datasets.

The punchline: if your main pool is HDD, special vdevs are often the biggest “feel fast” upgrade you can make without buying a whole new storage system.
And SAS SSDs are a good compromise between “enterprise correct” and “finance didn’t faint.”

Interesting facts and historical context

Storage engineering is mostly the art of learning old lessons in new packaging. Here are a few context points that help you reason about special vdevs,
and why they exist.

  1. ZFS was designed with end-to-end integrity: checksums for everything means metadata isn’t “lightweight”; it’s part of correctness.
  2. Old-school filesystems often treated metadata as second-class; ZFS metadata is richer, and its access patterns show up in real latency.
  3. “Allocation classes” came later in OpenZFS to solve mixed-media pools without giving up a single coherent namespace.
  4. Before special vdevs, people used L2ARC as a band-aid for metadata-heavy reads; it helped sometimes, but placement wasn’t deterministic.
  5. Hybrid arrays have existed for decades: tiering hot data to flash isn’t new; special vdevs are ZFS doing it in a ZFS-shaped way.
  6. Directory traversal cost exploded with massive filesets long before “big data” was fashionable; mail spools and web caches taught that lesson.
  7. Enterprise SAS lived through multiple eras: from spinning disks to SSDs, the operational tooling (enclosures, SES, HBAs) remained mature.
  8. Metadata amplification is real: deleting a million small files is far more metadata work than writing one big one of equal size.

How metadata really hurts you: a practical mental model

On HDD pools, you can have plenty of MB/s and still feel slow because metadata is random I/O. A directory listing on a large tree is a storm of tiny reads:
dnodes, indirect blocks, directory blocks, ACLs, xattrs, and the “where is this block stored?” map work ZFS must do to remain correct.
HDDs are fine at streaming. They are miserable at random 4–16K reads with seeks.

Put metadata on SSD and two things happen: (1) your random IOPS ceiling rises by orders of magnitude; (2) your tail latency drops, which makes everything
above storage—applications, VMs, databases, backup tools—stop waiting in line.

Metadata vs small data: deciding what to place

The default behavior (“metadata only”) already changes user experience. But many real-world workloads are dominated by small files:
config repos, container layers, log shards, email, CI artifacts, package registries.
That’s where special_small_blocks earns its keep: it pushes small file blocks to the special vdev too.

There’s no free lunch. If you send small blocks to special vdevs, you are also sending more writes to those SSDs. That’s fine if you sized them,
mirrored them, and you’re watching endurance. It’s a faceplant if you sized them like a cache drive and then loaded it with a million 32K writes per second.

One operational truth: users don’t file tickets saying “metadata latency is high.” They say “the app is slow.” Your job is to know when “the app is slow”
means “your rust disks are seeking for dnodes again.”

Design decisions that matter in production

1) Redundancy is mandatory

Treat special vdevs like the pool’s spine. If the special vdev is lost and it contains metadata, the pool is not “degraded.” It’s gone.
The sane default is a mirror (or multiple mirrored special vdevs). RAIDZ for special is possible but often awkward; mirrors keep rebuild behavior
and latency predictable.

2) Size it for the long haul, not the demo

Metadata grows with file count, snapshot count, and fragmentation patterns. If you set special_small_blocks, growth is faster.
Undersizing leads to a nasty cliff: once the special vdev fills, ZFS must place new metadata (and small blocks, if configured) on slower main vdevs.
That’s when your “fast pool” becomes “mysteriously inconsistent pool.” Users love inconsistency almost as much as they love surprise maintenance windows.

3) Think failure domains: HBA, expander, enclosure

Mirroring two SSDs in the same backplane on the same expander on the same HBA is not redundancy; it’s optimism with extra steps.
Place mirror legs on different HBAs/enclosures when you can. If you can’t, at least use different bays and validate the enclosure pathing.

4) Use SAS SSDs you can actually replace

“We found two random enterprise SSDs in a drawer” is not a lifecycle plan. You want consistent models, firmware, and the ability to buy replacements without
playing eBay roulette.

5) Decide on small block offload deliberately

If your workload is mostly big files (media, backups, VM disks with large blocks), metadata-only special vdev is often enough.
If you have lots of small files or random reads on small blocks, use special_small_blocks—but set it at the dataset level,
and measure.

6) Compression changes the effective “small block” cutoff

If you use compression (you should, usually), a logical 128K record might become a physical 32K block. ZFS decisions can be based on physical size.
This can increase how much lands on special vdevs when special_small_blocks is in play. That’s great until your special vdev runs out of space
and you discover the cliff edge.

Paraphrased idea from Werner Vogels (Amazon CTO): everything fails; design and operate assuming it will. Special vdevs are exactly that: a design choice that demands operational maturity.

Implementation: creating and tuning special vdevs

Pick the topology

The standard move: add a mirrored special vdev to an existing pool, using two SAS SSDs. If you’re building a new pool, you can
include it at creation time, but adding later is common and safe if you do it right.

Decide your special_small_blocks policy

Setting this on the entire pool is a blunt instrument. Prefer dataset-level settings. Put small-file heavy datasets on it, keep large sequential ones off it.
This is how you avoid turning your metadata SSDs into “accidentally primary storage.”

Joke #1: If you set special_small_blocks=128K everywhere, congratulations—you built an SSD pool with HDD latency and SSD bills.

Plan monitoring before you flip the switch

Watch: special vdev capacity, read/write latency, error counts, and SMART endurance. Also watch pool-wide symptoms: zpool iostat latency spikes,
and application-level 99th percentile latency. You want to know you’re approaching a cliff weeks before you fall off it.

Hands-on tasks: commands, output, meaning, and decisions

The commands below assume a Linux host with OpenZFS and a pool named tank. Adjust names to your environment.
Each task includes: what you run, what the output suggests, and what decision you make next.

Task 1: Confirm your pool layout and whether a special vdev already exists

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            scsi-35000c500a1b2c3d4  ONLINE       0     0     0
            scsi-35000c500a1b2c3e5  ONLINE       0     0     0
            scsi-35000c500a1b2c3f6  ONLINE       0     0     0
            scsi-35000c500a1b2c4a7  ONLINE       0     0     0
            scsi-35000c500a1b2c4b8  ONLINE       0     0     0
            scsi-35000c500a1b2c4c9  ONLINE       0     0     0

errors: No known data errors

Meaning: No special class listed. This pool is all main vdevs (RAIDZ2 on SAS HDDs).

Decision: If your workload is metadata-heavy and latency-sensitive, you have a strong candidate for a mirrored special vdev.

Task 2: Check dataset properties that affect metadata/small block placement

cr0x@server:~$ sudo zfs get -o name,property,value -s local,default recordsize,compression,atime,special_small_blocks tank
NAME  PROPERTY              VALUE     SOURCE
tank  recordsize            128K      default
tank  compression           zstd      local
tank  atime                 off       local
tank  special_small_blocks  0         default

Meaning: No small blocks are being offloaded to special vdevs (even if you add them) because special_small_blocks=0.

Decision: Plan to set special_small_blocks per dataset later if you want small file acceleration.

Task 3: Identify candidate SAS SSDs by stable IDs

cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep 'sas|scsi-3' | head
lrwxrwxrwx 1 root root  9 Dec 26 10:02 scsi-35000c500d0e1a2b3 -> ../../sdc
lrwxrwxrwx 1 root root  9 Dec 26 10:02 scsi-35000c500d0e1a2b4 -> ../../sdd
lrwxrwxrwx 1 root root  9 Dec 26 10:02 scsi-35000c500a1b2c3d4 -> ../../sde
lrwxrwxrwx 1 root root  9 Dec 26 10:02 scsi-35000c500a1b2c3e5 -> ../../sdf

Meaning: You have stable identifiers. Use these, not /dev/sdX, when adding vdevs.

Decision: Pick two SSD IDs for a mirror; verify they’re not already in use and are the right capacity/model.

Task 4: Confirm the SSDs are actually SSDs and check key SMART health signals

cr0x@server:~$ sudo smartctl -a /dev/sdc | egrep -i 'Device Model|Serial|User Capacity|Rotation|Percentage|Wear|Media_Wearout|Power_On_Hours|Reallocated'
Device Model:     SEAGATE XS800LE10003
Serial Number:    ABCD1234
User Capacity:    800,166,076,416 bytes [800 GB]
Rotation Rate:    Solid State Device
Power_On_Hours:   18432
Percentage Used:  6%
Reallocated_Sector_Ct: 0

Meaning: Low wear (Percentage Used: 6%), no reallocations. Suitable for a special vdev role.

Decision: Proceed if both SSDs show clean health. If wear is high, don’t “use what’s left”—buy proper drives.

Task 5: Verify ashift and general pool properties before extending

cr0x@server:~$ sudo zdb -C tank | egrep 'ashift|name:'
    name: 'tank'
        ashift: 12

Meaning: ashift=12 (4K sectors) is the sane default for modern disks/SSDs.

Decision: No action needed; proceed. If you see ashift=9 in 2025, you have bigger problems.

Task 6: Add a mirrored special vdev (the actual pro move)

cr0x@server:~$ sudo zpool add tank special mirror scsi-35000c500d0e1a2b3 scsi-35000c500d0e1a2b4

Meaning: You’ve added a special allocation class mirror to the pool.

Decision: Immediately confirm layout and start watching special vdev usage. Also document the change like an adult.

Task 7: Confirm the special vdev shows up and is ONLINE

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME                          STATE     READ WRITE CKSUM
        tank                          ONLINE       0     0     0
          raidz2-0                    ONLINE       0     0     0
            scsi-35000c500a1b2c3d4    ONLINE       0     0     0
            scsi-35000c500a1b2c3e5    ONLINE       0     0     0
            scsi-35000c500a1b2c3f6    ONLINE       0     0     0
            scsi-35000c500a1b2c4a7    ONLINE       0     0     0
            scsi-35000c500a1b2c4b8    ONLINE       0     0     0
            scsi-35000c500a1b2c4c9    ONLINE       0     0     0
        special
          mirror-1                    ONLINE       0     0     0
            scsi-35000c500d0e1a2b3    ONLINE       0     0     0
            scsi-35000c500d0e1a2b4    ONLINE       0     0     0

errors: No known data errors

Meaning: The pool now has a special class with a mirror. This is what “correct” looks like.

Decision: Proceed to dataset tuning and validation. If it’s not ONLINE, stop and fix hardware/pathing first.

Task 8: Measure I/O and latency by vdev class under real workload

cr0x@server:~$ sudo zpool iostat -v tank 1 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        12.3T  45.8T    240    310  12.5M  18.2M
  raidz2-0  12.3T  45.8T    120    180  10.1M  16.9M
  special    4.2G   743G    120    130   2.4M   1.3M
    mirror-1 4.2G   743G    120    130   2.4M   1.3M

Meaning: Special vdev is actively serving ops. Even a few MB/s can represent massive metadata acceleration.

Decision: If ops are still hammering HDD vdevs during metadata-heavy tasks, consider enabling small block offload on targeted datasets.

Task 9: Check special vdev space usage and keep it away from the cliff

cr0x@server:~$ sudo zfs list -o name,used,avail,refer,mountpoint tank
NAME   USED  AVAIL  REFER  MOUNTPOINT
tank  12.3T  45.8T   128K  /tank

cr0x@server:~$ sudo zpool list -v tank
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank       58.1T  12.3T  45.8T        -         -    12%    21%  1.00x  ONLINE  -
  raidz2-0  58.1T  12.3T  45.8T        -         -    12%    21%
  special    745G  4.20G   741G        -         -     1%     0%

Meaning: Special vdev has plenty of headroom now. The key number is special class CAP; don’t let it creep near full.

Decision: Establish alerting thresholds (for example: warn at 60%, page at 75%). Also project growth based on file count and snapshots.

Task 10: Enable small block offload for a specific dataset (surgical, not global)

cr0x@server:~$ sudo zfs create tank/containers
cr0x@server:~$ sudo zfs set compression=zstd tank/containers
cr0x@server:~$ sudo zfs set special_small_blocks=32K tank/containers
cr0x@server:~$ sudo zfs get -o name,property,value special_small_blocks tank/containers
NAME            PROPERTY             VALUE
tank/containers special_small_blocks 32K

Meaning: Blocks of size ≤32K (often physical size) can be allocated on the special vdev for this dataset.

Decision: Start conservative (16K–32K) unless you’ve modeled capacity/endurance. Then measure. Expand scope only if it’s paying off.

Task 11: Validate that metadata is actually being served fast (look at latency)

cr0x@server:~$ sudo zpool iostat -vl tank 1 2
                              capacity     operations     bandwidth    total_wait     disk_wait
pool                        alloc   free   read  write   read  write   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
tank                        12.3T  45.8T    260    340  13.1M  19.4M   8ms   11ms    3ms    5ms
  raidz2-0                  12.3T  45.8T    120    210  10.2M  18.0M  14ms   22ms    9ms   14ms
  special                    4.5G   741G    140    130   2.9M   1.4M   1ms    2ms    1ms    1ms
    mirror-1                 4.5G   741G    140    130   2.9M   1.4M   1ms    2ms    1ms    1ms

Meaning: Special vdev disk_wait is ~1ms while HDD vdev is in the teens. That’s the whole point.

Decision: If special wait is high, you may be saturating the SSDs, the HBA queueing, or hitting a firmware/pathing issue. Investigate before tuning ZFS knobs blindly.

Task 12: Observe ARC behavior (because metadata loves memory too)

cr0x@server:~$ sudo arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
10:20:01   520    60     11    20   33    18   30    22   37   42G   64G
10:20:02   610    80     13    25   31    25   31    30   38   42G   64G
10:20:03   590    70     11    20   29    22   31    28   40   42G   64G

Meaning: ARC is doing its job. A modest miss rate is normal; the special vdev reduces penalty of metadata misses.

Decision: If ARC miss% is high during steady state, consider memory sizing or working set issues. Don’t expect SSDs to fix “no RAM.”

Task 13: Watch special vdev errors like it’s a payroll system

cr0x@server:~$ sudo zpool status -xv
pool 'tank' is healthy

Meaning: No known issues right now.

Decision: If you see read/write/checksum errors on special, escalate faster than you would for a single HDD in RAIDZ. Metadata errors are not “wait until next sprint.”

Task 14: Replace a failed special vdev device correctly (simulate the workflow)

cr0x@server:~$ sudo zpool offline tank scsi-35000c500d0e1a2b3
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
config:

        NAME                          STATE     READ WRITE CKSUM
        tank                          DEGRADED     0     0     0
          raidz2-0                    ONLINE       0     0     0
            scsi-35000c500a1b2c3d4    ONLINE       0     0     0
            scsi-35000c500a1b2c3e5    ONLINE       0     0     0
            scsi-35000c500a1b2c3f6    ONLINE       0     0     0
            scsi-35000c500a1b2c4a7    ONLINE       0     0     0
            scsi-35000c500a1b2c4b8    ONLINE       0     0     0
            scsi-35000c500a1b2c4c9    ONLINE       0     0     0
        special
          mirror-1                    DEGRADED     0     0     0
            scsi-35000c500d0e1a2b3    OFFLINE      0     0     0
            scsi-35000c500d0e1a2b4    ONLINE       0     0     0

errors: No known data errors

Meaning: Pool is degraded because the special mirror lost a leg. You still have the pool, but you’re running without a safety net.

Decision: Replace immediately. Don’t wait for maintenance day. The next failure could be catastrophic.

cr0x@server:~$ sudo zpool replace tank scsi-35000c500d0e1a2b3 scsi-35000c500deadbeef
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.
scan: resilver in progress since Fri Dec 26 10:22:14 2025
        2.10G scanned at 420M/s, 1.05G issued at 210M/s, 4.20G total
        1.05G resilvered, 25.0% done, 0:00:15 to go

Meaning: Resilver is running. Note: special vdev resilvers are usually quick because the footprint is smaller than bulk data.

Decision: Keep load reasonable, watch for errors, and confirm it returns to ONLINE. If resilver stalls, suspect pathing/drive issues.

Task 15: Confirm TRIM behavior (helps SSD steady-state)

cr0x@server:~$ sudo zpool get autotrim tank
NAME  PROPERTY  VALUE     SOURCE
tank  autotrim  on        local

Meaning: Autotrim is enabled. Good for SSD longevity and performance in many environments.

Decision: If it’s off, consider enabling. If you’re on older kernels/SSDs where trim causes latency spikes, test carefully before enabling globally.

Task 16: Measure “user pain” directly: directory traversal timing before/after

cr0x@server:~$ time find /tank/containers -maxdepth 4 -type f -printf '.' > /dev/null

real    0m7.912s
user    0m0.332s
sys     0m1.821s

Meaning: This is a blunt but effective “does it feel faster?” test for metadata-heavy trees.

Decision: If it’s still slow, check whether the dataset is using special_small_blocks, whether metadata is cached in ARC, and whether the workload is actually limited by something else (CPU, single-threaded app, network).

Fast diagnosis playbook

When someone says “storage is slow,” you get about five minutes before the conversation becomes unproductive. Use this sequence to find the bottleneck quickly.
It’s optimized for ZFS pools with (or without) special vdevs.

First: is it an obvious health problem?

  • Run zpool status -xv. If not healthy, stop performance work and fix reliability first.
  • Check if the special vdev is degraded. If yes, treat as urgent: you’re one failure away from a bad day.

Second: is latency coming from rust, SSD, or somewhere else?

  • Run zpool iostat -vl pool 1 5 and compare disk_wait between main vdevs and special.
  • If HDD wait is high and special is low: your workload is still hitting HDDs. Maybe it’s large blocks, or special is too small/filled, or small blocks aren’t enabled.
  • If special wait is high: you may have saturated the SSD mirror, or you have HBA queueing/firmware issues, or the SSD is near full and suffering write amplification.

Third: is it a metadata-dominated workload?

  • Clues: slow directory listings, slow snapshot send/receive enumeration, high iops with low MB/s, user complaints around “opening folders.”
  • Run arcstat. If ARC misses are high and you see lots of small random reads, special vdevs help—if you have them and they’re sized right.

Fourth: are you accidentally writing too much to special?

  • Check dataset special_small_blocks settings. A too-high threshold can shove a lot of data onto the special vdev.
  • Check special vdev capacity. Approaching full means chaos: performance cliffs and unpredictable placement.

Fifth: validate the “boring stuff”

  • HBA errors, link resets, enclosure issues, multipath flaps.
  • CPU saturation (compression can be CPU-heavy; zstd is usually fine, but don’t guess).
  • Network bottlenecks (if clients are remote, storage may be innocent).

Joke #2: If you skip step one and tune performance on a degraded special vdev, the universe will schedule your outage for Friday afternoon.

Three corporate mini-stories (anonymized, plausible, and avoidable)

1) The incident caused by a wrong assumption

A mid-sized SaaS company had a ZFS pool backing CI artifacts and container images. Builds were slow, so they did the sensible thing: added two SSDs “for cache.”
Someone had read about special vdevs and added a single SSD as special, planning to “add redundancy later.”

It ran fine for weeks. Then the SSD started throwing medium errors. The pool went from “ONLINE” to “unavailable” fast—because a special vdev isn’t a cache.
It contained real metadata. Recovery was not a quick “remove the cache device and reboot.” It was restore-from-backup time, plus a long postmortem.

The worst part wasn’t the outage. It was the confusion. Half the team assumed it worked like L2ARC; the other half assumed ZFS would “keep a copy somewhere.”
ZFS did exactly what it promised: it used the special vdev for metadata placement. They had built a single point of failure into the pool.

The fix was boring and correct: mirrored special vdevs only, plus a policy that no one adds allocation-class devices without a change request
that explicitly states failure behavior. The rebuild cost more than two SSDs ever would.

2) The optimization that backfired

A large internal analytics platform had a pool with RAIDZ2 HDDs and a mirrored SAS SSD special vdev. Performance was good.
Then a well-meaning engineer noticed some datasets had tons of small files and decided to “supercharge” by setting special_small_blocks=128K
on a top-level dataset.

The result was immediate: random read latency improved, but within weeks the special vdev capacity climbed faster than expected.
Compression made more blocks “small enough,” and the effective placement shifted.
Snapshot churn amplified metadata writes, and the SSD mirror started living a harder life than planned.

Eventually the special vdev approached high utilization. Performance got weird: not always slow, but spiky.
Users reported intermittent slowness: some queries were instant, others stalled.
Storage latency graphs looked like a seismograph during a minor earthquake.

They rolled back by narrowing the setting to specific datasets and lowering the threshold to 16K–32K where it actually mattered.
They also added a second mirrored special vdev to distribute load. The lesson wasn’t “don’t optimize.”
The lesson was “optimize with a capacity model, not vibes.”

3) The boring but correct practice that saved the day

A finance-adjacent environment (read: audits and serious change control) ran ZFS for document storage and VM backups.
They added SAS SSD special vdevs early, mirrored across two HBAs.
Not exciting. But they wrote down the topology, labeled the bays, and set monitoring for special vdev utilization and SMART wear.

One morning, a batch job that did large-scale permission changes across millions of files kicked off. Metadata writes spiked.
The special vdev absorbed the churn with low latency. The HDDs stayed mostly busy with bulk reads and writes.
Users noticed nothing; the job completed; everyone went back to pretending storage is simple.

A week later, one SSD started reporting increasing corrected errors. Monitoring fired early because they were tracking the right counters.
They replaced the drive during business hours, resilvered quickly, and wrote a two-paragraph change note that made the auditors happy.

No heroics. No emergency calls. Just a system designed around failure and operated like it would actually fail.
That’s what “senior” looks like in storage.

Common mistakes: symptoms → root cause → fix

1) “Pool died when an SSD died”

Symptom: Pool won’t import after losing a special vdev device.

Root cause: Special vdev was not redundant (single disk), or both mirror legs were lost due to shared failure domain (HBA/enclosure).

Fix: Always mirror special vdevs. Split mirror legs across failure domains. Treat special vdev hardware like tier-0 infrastructure.

2) “Directory listings still slow after adding special”

Symptom: find/ls on large trees still crawls; HDD vdev shows high random reads.

Root cause: Workload is dominated by small data blocks, not just metadata; or the hot metadata is already in ARC and bottleneck is elsewhere.

Fix: Enable special_small_blocks on the relevant dataset (start at 16K or 32K), then re-test. Also confirm you’re not CPU/network bound.

3) “Special vdev filling up faster than planned”

Symptom: Special class utilization grows quickly; alerts fire; performance becomes spiky at higher utilization.

Root cause: Threshold too high; compression shrinks blocks under the cutoff; workload has heavy churn/snapshots; sizing model ignored file count growth.

Fix: Reduce special_small_blocks on broad datasets; limit it to targeted datasets. Add additional mirrored special vdev capacity if needed.

4) “Special vdev latency is high even though it’s SSD”

Symptom: zpool iostat -vl shows high disk_wait on special.

Root cause: SSD saturated (IOPS), near-full behavior, firmware quirks, HBA queueing, expander issues, or mixed SAS/SATA path instability.

Fix: Check device health and errors, ensure proper HBA firmware/driver, verify queue depths, ensure special has headroom, and consider adding another mirrored special vdev to spread load.

5) “After enabling small blocks, SSD wear climbs unexpectedly”

Symptom: SMART wear indicators increase faster than expected; write amplification visible in vendor tools.

Root cause: Small-block offload moved lots of write traffic to SSDs; workload includes frequent deletes/rewrites; insufficient overprovisioning/endurance.

Fix: Lower the cutoff, scope to fewer datasets, upgrade to higher-endurance SAS SSDs, and monitor wear with alerts tied to replacement planning.

6) “We added special vdevs and now scrub takes longer”

Symptom: Scrubs/resilvers behave differently; some parts finish quickly, others drag.

Root cause: Additional vdev class changes I/O patterns; special vdev resilvers quickly but main vdev scrub remains HDD-bound; contention from production load.

Fix: Schedule scrubs appropriately, cap scrub impact if needed, and don’t misattribute “HDD scrub time” to the special vdev feature.

Checklists / step-by-step plan

Planning checklist (before you touch the pool)

  1. Workload classification: Is the pain metadata-heavy (namespace ops) or big streaming? Get evidence with zpool iostat -vl and user-visible tests.
  2. Decide scope: Metadata-only special, or metadata + small blocks via special_small_blocks on selected datasets.
  3. Capacity model: Estimate metadata growth from file counts and snapshot behavior. Leave headroom; special vdev near-full is a performance and placement trap.
  4. Redundancy: Mirror special vdevs. Confirm failure domains are independent enough for your environment.
  5. Hardware vetting: SAS SSD models, firmware consistency, SMART health baseline, endurance class.
  6. Monitoring: Alerts on special class utilization, device errors, SMART wear, and pool health.
  7. Change management: Document what special vdevs do and the “loss means pool loss” property in plain language.

Deployment steps (safe and boring)

  1. Confirm pool health: zpool status -xv must be healthy.
  2. Identify SSDs by /dev/disk/by-id; verify SMART and capacity.
  3. Add mirrored special vdev: zpool add pool special mirror ...
  4. Confirm layout and ONLINE state with zpool status.
  5. Baseline performance: zpool iostat -vl under representative load.
  6. Enable special_small_blocks only on datasets that benefit; start small (16K–32K).
  7. Set alert thresholds and verify they page the right humans, not the entire company.

Operations checklist (ongoing)

  1. Weekly: check special vdev utilization and trend it.
  2. Weekly: check SMART wear for the special vdev SSDs.
  3. Monthly: review zpool status error counters; investigate any non-zero growth.
  4. Quarterly: validate restore procedures; special vdevs are not where you want to learn about your backup gaps.
  5. On every incident: capture zpool iostat -vl and arcstat during the event, not after it “mysteriously” clears.

FAQ

1) Do I really need a special vdev if I have lots of RAM?

RAM (ARC) helps a lot, but it’s not a substitute. ARC misses still happen, and metadata churn can exceed memory. Special vdevs reduce the penalty of misses
and stabilize tail latency when the working set doesn’t fit.

2) Is a special vdev the same as L2ARC?

No. L2ARC is a cache and can be removed (with consequences to performance, not data). Special vdev is part of the pool’s allocation and can contain critical metadata.
Lose it and you can lose the pool.

3) Is a special vdev the same as SLOG?

No. SLOG accelerates synchronous writes (think NFS sync, databases with fsync patterns). Special vdev accelerates metadata and optionally small blocks. You can have both, and many systems do.

4) Why SAS SSD specifically? Why not NVMe?

NVMe can be fantastic, especially for extreme latency goals. SAS SSDs win on integration: hot-swap bays, dual-porting, consistent server management, and predictable procurement.
If you have a clean NVMe backplane story and solid ops tooling, NVMe is a strong choice too. Don’t pick based on marketing; pick based on replacement logistics and failure domains.

5) What should I set special_small_blocks to?

Start at 16K or 32K on datasets with lots of small files or random reads. Measure. If you set it too high, you’ll consume special capacity and endurance quickly.
Avoid blanket “128K everywhere” unless you intentionally want most data on SSD and you sized for it.

6) Can I remove a special vdev later?

In practice, treat special vdevs as permanent. Some device removal capabilities exist in certain OpenZFS contexts, but relying on removal for special vdevs is risky planning.
Plan as if you cannot remove it and must replace/expand instead.

7) How big should the special vdev be?

Big enough that you won’t hit high utilization under growth and snapshots. Metadata-only can be modest for some pools, but “modest” is workload-specific.
If you also offload small blocks, size significantly larger. In corporate life, buying too small is the expensive option because it forces disruptive rework.

8) Does enabling compression help or hurt special vdev usage?

Both. Compression usually helps performance and saves space, but it can cause more blocks to fall under the “small” cutoff if you use special_small_blocks.
That increases special vdev usage. Account for it in sizing and monitor utilization trends.

9) What if my special vdev is mirrored but both SSDs are in one enclosure?

That’s better than a single disk, but it’s still exposed to enclosure/backplane/expander failures. If the environment is important, split mirror legs across enclosures or HBAs.
If you can’t, at least keep spares and practice replacement.

10) Will special vdevs speed up my database?

Sometimes. If the database workload is metadata-heavy at the filesystem layer (lots of small files, many tables, frequent fsync metadata updates),
you may see benefits. But databases are often dominated by their own I/O patterns and caching. Test with representative load, and don’t confuse “faster schema operations” with “faster queries.”

Conclusion: next steps you can execute this week

Special vdevs are one of the few ZFS features that can make a creaky HDD pool feel modern—without rewriting your storage strategy.
The trick is to treat them as first-class pool members, not as “some SSD cache thing.”

  1. Measure first: capture zpool iostat -vl and a real metadata-heavy test (find, backup scan, snapshot enumeration) during pain.
  2. Add a mirrored SAS SSD special vdev using stable /dev/disk/by-id paths, then confirm with zpool status.
  3. Start with metadata-only, then selectively enable special_small_blocks on the datasets that are actually small-file heavy.
  4. Set alerting on special class utilization and SSD health. Don’t let “almost full” be a surprise.
  5. Document the failure behavior clearly: special vdev loss can mean pool loss. This prevents future “helpful” changes.

If you want the pro version of this move: mirror the special vdev across independent failure domains, size it with growth in mind,
and keep it boring. Storage doesn’t reward bravado. It rewards paperwork and paranoia.

← Previous
Ubuntu 24.04 “Cert verify failed”: Fix CA bundles and intermediate chains properly
Next →
Why CPUs Run Hot: Turbo, Power Limits, and Reality

Leave a comment