ZFS Special Small Blocks: How to Turbocharge Small-File Workloads

Was this helpful?

If your storage system feels fast when streaming big files but starts wheezing the moment someone runs
find across a tree of tiny files, you’re meeting the oldest enemy in enterprise storage: metadata and
small-block I/O. ZFS is excellent at protecting data, but it can’t change physics—random reads from slow media are
still random reads from slow media. The good news is ZFS gives you an escape hatch: special vdevs and
special small blocks.

Done right, this is one of those rare optimizations that feels like cheating: directory traversals speed up, git
checkouts stop dragging, container image extraction gets punchier, and those “why is ls slow?” tickets
disappear. Done wrong, it’s also a spectacular way to learn how important metadata is to your pool’s survival.
This guide is about doing it right—in production, with the kinds of constraints that show up on call.

What “special small blocks” really means

In ZFS, “special small blocks” is shorthand for a policy: store blocks under a chosen size threshold on a
special vdev—typically SSD or NVMe—while leaving large file data on the main vdevs (often HDD).
It’s not a cache. It’s not best-effort. Those blocks live there.

The special vdev already has a job even with no threshold: it can hold metadata. Metadata includes things like
indirect blocks, dnodes, directory entries, and other structures that make filesystem operations possible.
When you set special_small_blocks on a dataset, you expand that job: small file data
blocks also land on the special vdev.

The magic is not that SSD is “faster.” The magic is that small-file workloads are dominated by random I/O, and
random I/O on HDD is basically a mechanical sympathy test you will lose. If you can push those reads to low-latency
media, the whole workload changes character: less seeking, fewer IOPS stalls, and a shorter queue at the wrong
place.

The one-liner version: special vdevs let ZFS separate “things that must be quick to find” from “things that are
big to read.”

Joke #1: Think of special small blocks as giving your metadata a sports car. Just remember: if you total the sports
car, you don’t just miss a meeting—you forget where the office is.

Interesting facts and context (why this exists)

Here are a few concrete bits of history and context that matter when you’re making design calls:

  1. ZFS started with an assumption of expensive RAM and slower disks. ARC was designed to be clever
    about caching because “just add RAM” was not always the answer in early deployments.
  2. The “metadata special allocation class” came later. Early ZFS didn’t have a clean way to pin
    metadata to a different vdev type; special vdevs added an explicit class for it.
  3. Small file performance has always been a filesystem reputation-maker. Users forgive slow
    streaming less than they forgive slow backups—but slow ls and slow builds cause immediate anger.
  4. 4K sector sizes changed the ground rules. “Tiny files” still translate into multiple
    metadata blocks, and 4K-aligned reads mean you pay a minimum I/O size even for small content.
  5. Copy-on-write makes random write patterns more common. ZFS must write new blocks and update
    pointers; it avoids in-place updates, which is great for consistency and snapshots but can increase metadata churn.
  6. NVMe didn’t just add speed; it changed queueing behavior. The latency gap between NVMe and HDD
    means moving a small fraction of I/O (metadata + small blocks) can change end-to-end response times dramatically.
  7. “Special” is not “cache.” L2ARC is a cache and can be rebuilt; special vdev allocation is
    authoritative. This single fact drives most operational risk decisions.
  8. dnodes are a quiet villain. Dnodes are per-object metadata and can become a hotspot for
    million-file trees; putting dnodes on the special vdev can be a huge win.
  9. Space accounting matters more than you think. If the special vdev fills up, ZFS doesn’t “spill”
    metadata back to HDD in a friendly way; you can end up with allocation failures and ugly operational choices.

How it works under the hood (enough to make good decisions)

The special vdev allocation class

ZFS pools are composed of vdevs. A “special vdev” is a vdev assigned to a special allocation class. ZFS uses
allocation classes to decide where blocks go. When a pool has a special vdev, ZFS can allocate certain block types
there—particularly metadata. If you enable small block placement, small data blocks go there too.

What counts as “metadata”? It’s more than directories. It includes the tree of indirect blocks that point to file
content blocks, and those indirect blocks are accessed during reads. Put those on SSD and even large-file reads can
benefit—especially for fragmented files or deep block trees.

special_small_blocks: the policy knob

special_small_blocks is a per-dataset property. If set to a size (for example, 16K), then file data
blocks up to that size are allocated to the special vdev.

Three operational truths:

  • It’s not retroactive. Existing blocks stay where they are unless rewritten.
  • It interacts with recordsize. If your recordsize is 128K and your workload writes lots of tiny
    files, you still only store the used portion, but the block allocation and compression behavior can matter.
  • It can increase special vdev consumption fast. Small-file datasets are often “all small blocks.”
    That’s the point, but it’s also the trap.

Dnodes, dnodesize, and “metadata that behaves like data”

ZFS stores per-file metadata in dnodes. When extended attributes and ACLs are heavy (think: corporate file shares
with aggressive auditing and Windows clients), dnodes can bloat. If dnodes don’t fit in the “bonus buffer,” ZFS may
spill metadata into separate blocks, increasing reads. Settings like xattr=sa and
dnodesize=auto can reduce spill and amplify the benefit of the special vdev.

Why this feels like a “turbo”

For many small-file workloads, the critical path is:

  1. Look up directory entries (metadata I/O).
  2. Read file attributes (more metadata I/O).
  3. Read small file data (small random reads).

If those steps happen on HDD, you’re bounded by seeks and queue depth. If they happen on NVMe, you’re bounded by
CPU, ARC hit rates, and the application’s own thread model. It’s a different problem, and generally a better one.

Joke #2: HDDs are great at teaching patience. Unfortunately, your CI pipeline did not sign up for the lesson.

Workloads that win (and ones that don’t)

Best candidates

Special small blocks shine when your working set includes lots of small files and metadata-heavy operations:

  • Source trees and build artifacts (millions of tiny files, repeated traversals, lots of stat calls).
  • Container image layers (untar storms, many small files, parallel extraction).
  • Maildir-style workloads (many small messages as individual files).
  • Web hosting and CMS content (thousands of small PHP files, plugins, thumbnails, metadata churn).
  • Corporate file shares with deep trees (heavy directory browsing, ACL checks, Windows clients).
  • Backup repositories with many small chunks (depends on tool; some create many tiny objects).

Mixed results

VM images and databases can benefit indirectly because metadata and indirect blocks are faster,
but these are usually dominated by medium-to-large random I/O and sync write behavior. If you’re trying to fix
database latency, you probably need to talk about SLOG, recordsize, ashift, and application tuning first.

Poor candidates

If your workload is mostly big sequential reads/writes and you already have healthy prefetch and large recordsize,
special small blocks won’t change your life. You might still want metadata on special for snappier management
operations, but don’t expect miracles for streaming workloads.

Designing a special vdev: redundancy, sizing, and failure domain

The non-negotiable rule: redundancy or regret

The special vdev is part of the pool. If it dies and you don’t have redundancy, your pool is not “degraded” in a
cute way. You can lose the pool because metadata may be missing. Treat special vdevs like top-tier storage, not
disposable accelerators.

In practice, that means mirror at minimum, and in some environments RAIDZ (for special class) if
you can tolerate the write amplification and want capacity efficiency. Mirrors are common because they keep latency
low and rebuild times simpler.

Sizing: the part everyone underestimates

Special vdev sizing is where the “turbo” project succeeds or becomes a quarterly incident generator. You must
estimate:

  • Metadata footprint for existing and future data.
  • Small file data footprint under your special_small_blocks threshold.
  • Snapshot growth (metadata and small blocks can multiply with churn).
  • Slack space for performance and allocator health.

A practical mental model: if you store millions of small files, the “data” might be small, but the
object count makes metadata non-trivial. A tree with 50 million files can be “only” a few terabytes, and
still be a metadata machine.

Another practical model: special vdev is where your filesystem’s brain lives. Don’t build a brain out of the
smallest parts you found in the procurement closet.

Choosing a threshold

Typical thresholds are 8K, 16K, 32K, sometimes 64K. The right value depends on:

  • Your file size distribution.
  • How quickly you can grow special capacity.
  • Whether the dataset’s “small files” are truly latency-sensitive.

Start conservative. You can raise the threshold later; lowering it doesn’t move existing blocks back.

Compression and special

Compression changes the effective block size stored. If you have compression=zstd, many “medium” files
can compress into blocks below your threshold and end up on special. This is sometimes great and sometimes a
surprise.

Device selection and endurance

Special vdevs do a lot of small random writes, especially for metadata-heavy workloads and snapshots. Pick devices
with decent endurance and power-loss protection if your environment demands it. Consumer NVMe can work in lab
environments; in production it’s often a slow-motion support ticket.

Implementation: step-by-step with real commands

The commands below assume OpenZFS on Linux. If you’re on FreeBSD, the concepts and most commands are the same,
but device paths and some tooling differ. Replace pool/dataset names and device identifiers to match your system.

Task 1: Confirm pool topology and current health

cr0x@server:~$ sudo zpool status -v
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 0 days 02:11:32 with 0 errors on Sun Dec 22 03:15:12 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            /dev/disk/by-id/...     ONLINE       0     0     0
            /dev/disk/by-id/...     ONLINE       0     0     0
            /dev/disk/by-id/...     ONLINE       0     0     0
            /dev/disk/by-id/...     ONLINE       0     0     0
            /dev/disk/by-id/...     ONLINE       0     0     0
            /dev/disk/by-id/...     ONLINE       0     0     0

errors: No known data errors

Interpretation: confirm there is no existing special class, and the pool is stable. Don’t add structural complexity
to a pool that’s already sick.

Task 2: Identify candidate SSD/NVMe devices by stable IDs

cr0x@server:~$ ls -l /dev/disk/by-id/ | grep -E 'nvme|ssd|ata' | head
lrwxrwxrwx 1 root root 13 Dec 24 09:10 nvme-SAMSUNG_MZVLB1T0HBLR-00000_S4EWNX0R123456 -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Dec 24 09:10 nvme-SAMSUNG_MZVLB1T0HBLR-00000_S4EWNX0R789012 -> ../../nvme1n1

Interpretation: use /dev/disk/by-id to avoid device renumbering surprises after reboots or HBA swaps.

Task 3: Check ashift and general pool properties

cr0x@server:~$ sudo zdb -C tank | grep -E 'ashift|name'
        name: 'tank'
        ashift: 12

Interpretation: ashift=12 implies 4K sectors, which is good. Special vdevs should be aligned appropriately too;
you generally don’t want mixing that creates needless write amplification.

Task 4: Add a mirrored special vdev (metadata class)

cr0x@server:~$ sudo zpool add tank special mirror \
  /dev/disk/by-id/nvme-SAMSUNG_MZVLB1T0HBLR-00000_S4EWNX0R123456 \
  /dev/disk/by-id/nvme-SAMSUNG_MZVLB1T0HBLR-00000_S4EWNX0R789012
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            /dev/disk/by-id/...                         ONLINE       0     0     0
            ...
          special
            mirror-1                                    ONLINE       0     0     0
              nvme-SAMSUNG_MZVLB1T0HBLR...123456        ONLINE       0     0     0
              nvme-SAMSUNG_MZVLB1T0HBLR...789012        ONLINE       0     0     0

Interpretation: the pool now has a special class. From this point, metadata allocations can land on it (depending
on feature flags and block type). The pool’s fate is now tied to that mirror. Monitor it like you mean it.

Task 5: Verify special vdev allocation is recognized

cr0x@server:~$ sudo zpool list -v tank
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank        54.5T  31.2T  23.3T        -         -    23%    57%  1.00x  ONLINE  -
  raidz2-0  54.5T  31.2T  23.3T        -         -    23%    57%
  special    1.8T   120G   1.7T        -         -     2%     6%

Interpretation: you can see special capacity separately. Keep an eye on this; “special is only 6% used” is a
comforting sentence you want to keep saying.

Task 6: Create a dataset for the small-file workload with sane defaults

cr0x@server:~$ sudo zfs create -o atime=off -o compression=zstd -o xattr=sa tank/work
cr0x@server:~$ sudo zfs get -o name,property,value atime,compression,xattr tank/work
NAME       PROPERTY     VALUE
tank/work  atime        off
tank/work  compression  zstd
tank/work  xattr        sa

Interpretation: turning off atime reduces metadata writes for read-heavy workloads. xattr=sa often
reduces spill to separate xattr objects for Linux extended attributes, which helps small-file storms.

Task 7: Enable special small blocks on that dataset

cr0x@server:~$ sudo zfs set special_small_blocks=16K tank/work
cr0x@server:~$ sudo zfs get special_small_blocks tank/work
NAME       PROPERTY              VALUE   SOURCE
tank/work  special_small_blocks  16K     local

Interpretation: new file data blocks up to 16K will allocate on the special vdev. This won’t rewrite old data.

Task 8: Ensure dnodes can scale with metadata-heavy workloads

cr0x@server:~$ sudo zfs set dnodesize=auto tank/work
cr0x@server:~$ sudo zfs get dnodesize tank/work
NAME       PROPERTY   VALUE  SOURCE
tank/work  dnodesize  auto   local

Interpretation: dnodesize=auto allows larger dnodes when needed, which can reduce extra metadata blocks
and improve directory traversal and attribute access patterns. It can increase metadata footprint—plan capacity.

Task 9: Baseline small-file behavior before and after (practical micro-benchmark)

cr0x@server:~$ mkdir -p /tank/work/testtree
cr0x@server:~$ /usr/bin/time -f "elapsed=%E" bash -c 'for i in $(seq 1 200000); do echo x > /tank/work/testtree/f_$i; done'
elapsed=0:04:31
cr0x@server:~$ /usr/bin/time -f "elapsed=%E" bash -c 'find /tank/work/testtree -type f -exec stat -c %s {} \; > /dev/null'
elapsed=0:00:46

Interpretation: don’t take the absolute numbers seriously across systems; look for relative changes and for where
the time shifts (CPU vs I/O). Also: 200k files is enough to show pain without turning your filesystem into a long-term
science project.

Task 10: Observe real-time I/O distribution (special vs main)

cr0x@server:~$ sudo zpool iostat -v tank 2 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        31.3T  23.2T    210   1800   8.3M   145M
  raidz2-0  31.2T  23.3T     40    220   7.9M    40M
  special    140G  1.6T     170   1580   420K   105M
----------  -----  -----  -----  -----  -----  -----

Interpretation: you want to see lots of metadata/small-block write IOPS on the special vdev instead of hammering
the HDD vdevs. Bandwidth on special may look “small” while ops are high—this is normal.

Task 11: Check dataset properties that commonly interact with small blocks

cr0x@server:~$ sudo zfs get -o name,property,value recordsize,primarycache,logbias,sync tank/work
NAME       PROPERTY      VALUE
tank/work  recordsize    128K
tank/work  primarycache  all
tank/work  logbias       latency
tank/work  sync          standard

Interpretation: for small-file trees, recordsize is usually not your lever (it matters more for large
files and databases). But primarycache can be tuned if you’re starving ARC, and sync can
kill performance if you’re unintentionally doing sync writes.

Task 12: Find whether special is filling up and why

cr0x@server:~$ sudo zfs list -o name,used,available,usedbysnapshots,usedbydataset,usedbychildren -r tank/work
NAME                 USED  AVAIL  USEDSNAP  USEDDS  USEDCHILD
tank/work           1.02T   9.1T      120G    900G         -
cr0x@server:~$ sudo zpool list -v tank | sed -n '1,5p'
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank        54.5T  32.2T  22.3T        -         -    24%    59%  1.00x  ONLINE  -
  raidz2-0  54.5T  31.6T  22.9T        -         -    24%    58%
  special    1.8T   600G   1.2T        -         -     9%    33%

Interpretation: if special usage rises disproportionally compared to dataset usage, you may have a high churn small-file
dataset, compression-induced promotion, or snapshots pinning small blocks. Track used-by-snapshots carefully.

Task 13: Force rewriting old blocks to migrate to special (controlled)

Because special_small_blocks is not retroactive, you may want to rewrite data in-place. You can do this
by copying within the dataset (which allocates new blocks under current policy) or using send/receive to a new dataset.

cr0x@server:~$ sudo zfs snapshot tank/work@rewrite-start
cr0x@server:~$ sudo rsync -aHAX --delete /tank/work/ /tank/work_rewritten/

Interpretation: this is I/O heavy. Do it with guardrails (rate limiting, off-peak windows) and ensure special has
capacity to absorb the migrated blocks.

Task 14: Verify special class is actually being used for small blocks

cr0x@server:~$ sudo zfs get -r special_small_blocks tank/work
NAME       PROPERTY              VALUE  SOURCE
tank/work  special_small_blocks  16K    local
cr0x@server:~$ sudo zdb -dddd tank/work | head -n 20
Dataset tank/work [ZPL], ID 123, cr_txg 45678, 1.02T, 720000 objects

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       123    1   128K    16K    16K     512   16K    100%  ZFS plain file
       124    1   128K    16K    16K     512   12K     75%  ZFS plain file

Interpretation: zdb output varies by version, but you’re looking for evidence that data blocks are
small and being allocated under the current policy. For deeper verification, correlate with zpool iostat -v
during workload bursts and watch special ops spike.

Operating it: monitoring, capacity planning, and upgrades

What you monitor changes when special exists

Before special vdevs, “pool capacity” was the headline metric. After special vdevs, you have two capacities that
can independently kill your day:

  • Main vdev free space (the classic one).
  • Special vdev free space (the new cliff).

Special can fill faster than expected because it holds the blocks that churn the most: metadata, tiny files,
and sometimes compressed blocks that fall under your threshold.

Three corporate-world mini-stories

Mini-story 1: The incident caused by a wrong assumption

A large internal engineering org rolled out special vdevs to speed up a monorepo build farm. The initial results were
glorious: checkouts and dependency scans sped up, and the CI queue stopped backing up every morning.

The wrong assumption was subtle: “special is just metadata; it won’t grow much.” They sized the mirrored NVMe
special class based on a rule of thumb from an older, metadata-only deployment. Then they set
special_small_blocks=64K to “be safe,” because lots of their files were under 50KB.

Two months later, the build farm’s special vdev was at uncomfortable utilization. The system started throwing
allocation warnings and intermittent stalls. Developers described it as “the storage is having panic attacks,” which
is unfair but not entirely inaccurate.

The postmortem was not about ZFS being fragile. It was about ownership: nobody had an alert on special class capacity.
They watched the pool’s overall free space and felt calm, while the pool’s brain was quietly running out of room.

The fix was boring and effective: add more special capacity (mirrored), lower the threshold for some datasets, and
introduce a policy that any dataset enabling special_small_blocks must declare a growth model and an
alerting plan. The second fix was the real win; the first was just paying the bill.

Mini-story 2: The optimization that backfired

A corporate file services team tried to “optimize everything” in one change window: special vdev added, threshold
set to 32K, compression switched from lz4 to zstd, and atime changed—plus a massive data migration to
rewrite old blocks onto special.

The backfire wasn’t that zstd is bad or special is risky. It was the combination of knobs and the timing. The
migration rewrote enormous volumes of small data and metadata. Compression reduced some blocks below the 32K threshold,
which meant far more data landed on special than anticipated. Snapshot retention pinned additional blocks. The special
mirror’s write amplification went through the roof.

The symptom in production was “random slowness.” Not total outage—worse. A mix of latency spikes, long directory
listings, and complaints from a handful of users at any moment. Monitoring showed the NVMe devices at high latency
during bursts while the HDD vdevs looked almost bored.

The recovery was operational humility: stop the rewrite migration, roll back the most aggressive threshold on the
busiest datasets, and re-run the migration later in smaller batches with explicit IO limits. They also implemented
a write-latency SLO for special devices. It’s not a fancy metric; it’s the difference between “we know it’s sick”
and “we’ll find out from a VP.”

Mini-story 3: The boring but correct practice that saved the day

A different team ran a ZFS-backed artifact store for internal releases. They used special vdevs only for metadata
at first and enabled special_small_blocks only on a single dataset used for indexes and manifests.
Nothing heroic: a mirror of enterprise SSDs, conservative threshold (8K), and aggressive alerting on special usage.

Six months in, one SSD started throwing media errors. The pool went degraded, but nothing else happened—no drama,
no scramble. They had a runbook: replace device, resilver, verify. They’d practiced it in a staging environment
with the same topology. (Yes, staging. The thing everyone claims to have and nobody funds.)

The real save came from something even less exciting: they had scheduled scrubs and they reviewed scrub reports.
The drive’s error count rose slowly over weeks, and they swapped it before it became a 3 a.m. page. The resilver
finished fast because special was mirrored, small, and healthy.

The lesson wasn’t “buy better SSDs.” It was “treat special vdevs as first-class citizens.” If your pool’s metadata
lives there, your operational maturity has to live there too.

Practical operational metrics

Things I’ve learned to watch in real environments:

  • Special vdev capacity (absolute and trend rate). Alerts should fire early.
  • Special vdev latency (read and write) during known workload bursts.
  • Pool fragmentation and whether allocator behavior shifts after policy changes.
  • Snapshot retention and churn-heavy datasets pinning special blocks.
  • ARC pressure because metadata caching changes your hit/miss patterns.

Upgrades and lifecycle: plan the “how do we add more later?”

You can add additional special vdev mirrors to expand special capacity (like adding another mirror vdev to the
special class). You cannot easily remove special vdevs from a pool in most practical production setups. This is
a one-way door for many operators.

So your design should assume growth. If you think “we’ll never need more than 2TB special,” you are probably
about to learn something about your own file size distribution.

Fast diagnosis playbook

When someone says “small files are slow” or “git checkout is crawling,” you can waste a day on theories. Here’s a
practical three-pass triage that tends to find the bottleneck quickly.

First: confirm the bottleneck is storage and not the client

  1. Check client-side constraints: CPU saturation, single-threaded extraction, antivirus scanning, or network latency.
  2. On the server, watch system load and I/O wait.
cr0x@server:~$ uptime
 09:41:22 up 102 days,  3:12,  5 users,  load average: 12.11, 11.90, 10.80

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 8  2      0 812344  91232 9023112   0    0   420  1800 3200 7000 12  6 55 27  0

Interpretation: high wa suggests storage latency. It’s not proof, but it’s a strong hint.

Second: identify whether special is helping or hurting

  1. Check special capacity and health.
  2. Watch per-vdev IOPS and latency symptoms during the slow operation.
cr0x@server:~$ sudo zpool status tank | sed -n '1,25p'
  pool: tank
 state: ONLINE
config:
        NAME          STATE     READ WRITE CKSUM
        tank          ONLINE       0     0     0
          raidz2-0    ONLINE       0     0     0
          special
            mirror-1  ONLINE       0     0     0
cr0x@server:~$ sudo zpool list -v tank | tail -n 3
  raidz2-0  54.5T  31.6T  22.9T        -         -    24%    58%
  special    1.8T   1.4T   420G        -         -    18%    77%

Interpretation: special at 77% is a red flag. Even if performance is fine today, the allocator and future growth are
not going to be polite.

cr0x@server:~$ sudo zpool iostat -v tank 1 10

Interpretation: if special has huge queues/ops and rising latency while HDDs are idle, special is your bottleneck
(too small, too slow, or being asked to do too much).

Third: validate dataset policy and workload reality

  1. Confirm the dataset in question has the expected properties.
  2. Check whether sync writes, snapshots, or compression changed behavior.
  3. Look for “not retroactive” surprises: old blocks still on HDD.
cr0x@server:~$ sudo zfs get -o name,property,value special_small_blocks,compression,atime,xattr,dnodesize -r tank/work
NAME       PROPERTY              VALUE
tank/work  special_small_blocks  16K
tank/work  compression           zstd
tank/work  atime                 off
tank/work  xattr                 sa
tank/work  dnodesize             auto

Interpretation: if the dataset doesn’t have the property set, you’re not using the feature. If it does, but
performance didn’t change, you may be bound by sync writes, CPU, or the data hasn’t been rewritten.

Common mistakes, symptoms, and fixes

Mistake 1: Adding a non-redundant special vdev

Symptom: Everything works until it doesn’t; when the device fails, the pool may be unrecoverable.

Fix: Always mirror (or RAIDZ) special vdevs. If you already did this wrong, treat it as urgent
technical debt and plan a migration to a correctly designed pool. In-place “fixing” is not the kind of exciting you
want.

Mistake 2: Setting special_small_blocks too high “just in case”

Symptom: Special capacity grows faster than expected; performance may initially improve then degrade
as special approaches high utilization.

Fix: Lower the threshold for new writes where appropriate, and add more special capacity. Remember
lowering the property doesn’t move existing blocks back. For some datasets, rebuilding (send/receive or rewrite) is
the clean approach.

Mistake 3: Forgetting that compression can move more data onto special

Symptom: After enabling zstd or increasing compression level, special usage accelerates, especially
for “medium” files that compress well.

Fix: Re-evaluate the threshold and your dataset selection. Consider a smaller threshold (8K/16K)
for compressed datasets that contain many moderately sized but highly compressible files.

Mistake 4: No alerting on special vdev usage

Symptom: Pool has plenty of free space, but operations fail or latency spikes; the team is confused
because “the pool is only 60% full.”

Fix: Alert on special class utilization separately, with an early threshold (for example, warn at
60–70%, critical at 80–85%, tuned to your growth rate and risk tolerance).

Mistake 5: Assuming enabling the property makes existing data faster

Symptom: You set special_small_blocks, users report no improvement, and someone claims
“it doesn’t work.”

Fix: Explain and plan for rewrite: move data to a new dataset via send/receive, or rewrite files
in place (carefully). Validate with zpool iostat -v during workload runs.

Mistake 6: Overloading special with too many datasets at once

Symptom: Special devices show high write latency, while main vdevs look underutilized; intermittent
slowdowns happen during metadata storms.

Fix: Prioritize datasets that truly need it; use conservative thresholds; scale special capacity
before expanding scope. Treat it like a shared critical resource, not a dumping ground for every performance wish.

Mistake 7: Ignoring snapshots and churn

Symptom: Special usage grows even when “data size” looks stable; deleting files doesn’t free much
space because snapshots retain blocks.

Fix: Audit snapshot retention and churn-heavy datasets. Adjust snapshot policy or move high-churn
workloads to separate pools/datasets with tailored retention.

Checklists / step-by-step plan

Checklist A: Decide whether special small blocks is the right tool

  1. Confirm the workload is metadata/small-file dominated (slow find, slow stat, slow untar).
  2. Confirm the bottleneck is storage latency/IOPS, not network or CPU.
  3. Measure file size distribution and object count (roughly is fine, but measure).
  4. Decide which datasets truly need it (don’t blanket-enable).

Checklist B: Design the special vdev safely

  1. Choose redundant topology (mirror minimum).
  2. Choose devices with appropriate endurance and stable power behavior for your environment.
  3. Size for metadata + expected small blocks + snapshot overhead + growth.
  4. Plan expansion: leave bays/slots or PCIe lanes, and budget for adding another mirror vdev later.
  5. Create alerts specifically for special utilization and device error rates.

Checklist C: Rollout plan that won’t wake you up at 3 a.m.

  1. Add special vdev and verify pool health.
  2. Enable on one dataset first with conservative threshold (8K or 16K).
  3. Measure before/after on real operations (checkout, untar, indexing, directory scans).
  4. Rewrite data only if needed, and throttle migrations.
  5. Expand scope dataset-by-dataset, watching special utilization trend.

Checklist D: Ongoing operations

  1. Weekly review of special usage trend and snapshot growth.
  2. Scrubs scheduled and scrub results reviewed.
  3. Device health checks (SMART/NVMe logs) integrated with paging thresholds.
  4. Quarterly validation: test a device replacement and resilver procedure (in a safe environment).

Extra practical tasks (ops-focused)

If you want more hands-on tasks beyond the core twelve, these are the ones that tend to pay back quickly.

Task 15: Check ARC stats (metadata caching pressure)

cr0x@server:~$ sudo arcstat 1 5
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
09:52:01   812   102     12    48   47    40   39    14   14   96G   110G

Interpretation: if ARC miss rate spikes during directory storms, special vdev will be hit harder. That’s fine if it’s
sized and fast. If not, you’ll see latency.

Task 16: Quick check of mountpoint and dataset mapping (avoid tuning the wrong dataset)

cr0x@server:~$ mount | grep tank
tank/work on /tank/work type zfs (rw,xattr,noacl)

cr0x@server:~$ sudo zfs list -o name,mountpoint -r tank | grep -E '^tank/work|/tank/work'
tank/work  /tank/work

Interpretation: in real shops, “the path” and “the dataset” drift over time. Tune the dataset that actually serves
the path users hit.

Task 17: Check snapshot count and retention pressure

cr0x@server:~$ sudo zfs list -t snapshot -o name,used,creation -S creation | head
NAME                      USED  CREATION
tank/work@daily-2025-12-24 1.2G  Wed Dec 24 02:00 2025
tank/work@daily-2025-12-23 1.1G  Tue Dec 23 02:00 2025
tank/work@daily-2025-12-22 1.3G  Mon Dec 22 02:00 2025

Interpretation: snapshots are not “free.” On churny datasets, snapshots pin old small blocks and metadata.

FAQ

1) Is a special vdev the same thing as L2ARC?

No. L2ARC is a cache and can be dropped and rebuilt. Special vdev stores real blocks (metadata and possibly small
file data). If special is lost without redundancy, the pool can be lost.

2) Do I need special small blocks if I already have lots of RAM for ARC?

Maybe. ARC helps if your working set fits and the access pattern is cache-friendly. But small-file workloads often
exceed ARC or involve cold scans (think CI runners starting fresh). Special vdev improves the “miss penalty” by making
misses land on SSD/NVMe instead of HDD.

3) What’s a good starting value for special_small_blocks?

8K or 16K is a sane starting point for many environments. If you jump to 64K, you’re signing up to store a lot of
real data on special. That may be correct, but you should make that decision with sizing math, not vibes.

4) Can I enable it on the whole pool at once?

You technically can set it on datasets broadly, but operationally you shouldn’t. Roll it out per dataset, measure,
and watch special utilization trend. Special is a shared resource; one noisy dataset can consume it.

5) Why didn’t performance improve after I set the property?

Common reasons: your data wasn’t rewritten (property isn’t retroactive), your bottleneck is sync writes or network,
the workload is CPU-bound, or special is saturated/slow. Validate with zpool iostat -v during the workload.

6) What happens if the special vdev fills up?

Bad things in a boring way: allocation can fail for metadata and small blocks, and the pool can become unstable or
start returning errors. Treat “special is nearing full” as a serious condition—add capacity or reduce growth drivers
before it becomes a production incident.

7) Can I remove a special vdev later?

In many real-world setups, removing special vdevs is not straightforward or not supported the way people expect.
Plan as if it’s permanent. If you need reversibility, consider building a new pool and migrating.

8) Should I also move metadata to SSD by using a “metadata-only” pool layout?

Special vdevs are exactly that concept, integrated into ZFS allocation classes. If your whole pool is SSD, special
vdevs may not help much. If your pool is HDD-based, special can be a big win without moving all data to SSD.

9) Does xattr=sa matter for this topic?

Often yes. When xattrs live in the system attribute area, ZFS can avoid separate xattr objects for many cases,
reducing metadata I/O. That makes directory scans and attribute-heavy workloads faster and can reduce pressure on
special capacity and IOPS.

10) Should I use RAIDZ for the special vdev instead of mirrors?

Mirrors are common for special because they keep latency low and rebuild behavior simple. RAIDZ can be viable for
capacity efficiency, but you need to be comfortable with the write amplification and rebuild dynamics. For most
latency-sensitive small-file workloads, mirrored special vdevs are the default safe choice.

Conclusion

ZFS special vdevs and special_small_blocks are one of the most effective ways to change the feel of a
filesystem that’s drowning in small files. They don’t just “make it faster”—they move the hardest part of the
workload (metadata and tiny random I/O) onto media that can actually do it.

But this is not a free lunch, and it’s not a cache you can shrug off. The special vdev becomes critical
infrastructure: it needs redundancy, monitoring, capacity planning, and operational discipline. If you treat it like
a bolt-on speed hack, it will eventually treat you like a person who enjoys reading kernel logs at midnight.

Start conservative, measure honestly, and expand with intention. That’s how you get the turbo without the blown head
gasket.

← Previous
Docker Multi-Host Without Kubernetes: Real Options and Hard Limits
Next →
Docker AppArmor and seccomp: the minimum hardening that matters

Leave a comment