When ZFS feels “slow,” people blame disks, then networks, then “that one noisy neighbor dataset.” Meanwhile, the real culprit is often metadata: tiny IOs, pointer chasing, and dnodes that don’t fit nicely in cache anymore. Your pool can push gigabytes per second of big sequential reads, and still collapse when asked to git status a large repo, untar a kernel tree, or scan millions of 4–16 KB objects.
This is the kind of performance problem that ruins on-call weekends because it looks like everything and nothing. CPU seems fine. Bandwidth looks fine. And yet latency spikes, directory operations crawl, and the application swears the “database is down” while your dashboards say the disks are only half busy.
What “metadata bottleneck” really means in ZFS
In ZFS, “metadata” isn’t a vague concept. It’s concrete blocks and on-disk structures that describe everything else: object sets, block pointers, indirect blocks, dnodes, directory entries (ZAP), spacemaps, allocation classes, intent logs, and more. Metadata is also the workload pattern: lots of tiny reads/writes, lots of random IO, lots of dependencies, and frequent synchronous semantics that force ordering.
Metadata bottlenecks happen when your system spends more time figuring out where data is and what it is than moving the data itself. That’s why a pool that streams backups at line rate can still buckle under:
- CI jobs that create/delete millions of files per day
- container layers and image extraction
- maildir-like patterns (many small files, lots of renames)
- VM fleets with snapshots, clones, and churn
- anything that walks directories and stats files at scale
Metadata load is a latency problem first. The enemy is not “low throughput,” it’s queueing and stall time: a chain of dependent reads where each miss in ARC turns into a disk IO, which is too small to amortize seek/flash translation overhead, which then blocks the next lookup.
Here’s the mindset shift: on metadata-heavy workloads, your “IOPS budget” and “cache hit budget” matter more than your “MB/s budget.” You can have plenty of bandwidth and still be dead in the water.
Dnodes 101: the tiny structure that decides your fate
What is a dnode?
A dnode is ZFS’s per-object metadata structure. For a file, the dnode describes its type, block size metadata, bonus data, and how to find the file’s data blocks (through block pointers). For directories and other ZFS objects, it performs the same “this object exists and here’s how to reach it” role.
That sounds innocent. In practice, dnodes are involved in almost every metadata operation you care about: lookups, stats, permissions, extended attributes, and indirect block traversal.
Why dnodesize matters
dnodesize controls how large a dnode can be. Larger dnodes can store more “bonus” information inline (especially useful for extended attributes). But larger dnodes also increase metadata footprint. More footprint means fewer dnodes in ARC for the same RAM, more misses, more disk IO, and slower metadata operations.
This is not theoretical. It’s one of the most common reasons a ZFS system feels “fine for big files” and “awful for small files.” The app team calls it a storage regression. You call it math.
Bonus data and xattrs: the hidden multiplier
ZFS can store extended attributes (xattrs) in different ways. When xattrs are stored as sa (system attributes), they live in the dnode’s bonus area when possible. That’s great for avoiding extra IO per attribute—until you crank dnodesize up everywhere and discover you just made every object’s metadata heavier, whether or not it needed it.
Joke #1: Metadata is like office paperwork—nobody wants more of it, but somehow it keeps generating itself while you sleep.
“But I set dnodesize=auto, so I’m safe.” Not necessarily.
dnodesize=auto allows ZFS to increase dnode size to fit bonus data. That often helps for xattr-heavy workloads. But it also means your dataset can gradually accumulate larger dnodes as files are created. If you later pivot to “millions of tiny files with no xattrs,” you might still be paying the cache footprint tax for earlier decisions, snapshots, and clones. ZFS is honest; it remembers.
Where metadata lives and why it’s expensive
ARC: your metadata shock absorber
The ARC (Adaptive Replacement Cache) is not just a read cache; it’s your metadata accelerator. On metadata-heavy workloads, ARC hit rate is the difference between “fast enough” and “paging through the pool one 4K read at a time.” Metadata access tends to be random and hard to prefetch. So caching is king.
But ARC is a shared resource. Data blocks, metadata blocks, prefetch streams, and MFU/MRU lists all compete. If your ARC is full of streaming reads or large blocks, metadata can get squeezed out. The system then “works harder” by issuing more small reads, which increases latency, which makes the user complain, which makes someone run more diagnostics, which adds more load. Congratulations: you built a feedback loop.
Indirect blocks, block pointers, and pointer chasing
ZFS is copy-on-write and uses a tree of block pointers. A metadata operation often becomes: read dnode → read indirect block(s) → read ZAP blocks → maybe read SA spill blocks → repeat. Each step depends on the previous. If these blocks aren’t cached, the operation becomes a series of synchronous random reads.
This is why NVMe “fixes” metadata pain so often: not because the workload magically became sequential, but because random read latency fell by an order of magnitude. With disks, you can have a lot of spindles and still lose because the dependency chain serializes the effective throughput.
Special vdev: separating metadata (and small blocks) on purpose
OpenZFS supports a “special” allocation class vdev (often called a special vdev) that can store metadata and, optionally, small blocks. Done right, it’s one of the most powerful tools for turning a metadata-bound system into a tolerable one.
Done wrong, it becomes a single point of “why is the pool suspended” drama, because losing a special vdev can make the pool unimportable if metadata is there. Treat it like top-tier storage: mirrored, monitored, and sized correctly.
Synchronous writes and the ZIL
Metadata operations often include synchronous semantics: creates, renames, fsync-heavy patterns, databases that insist on ordering. ZFS uses the ZIL (ZFS Intent Log) to satisfy synchronous writes safely. A separate SLOG device can reduce latency for sync writes, but it doesn’t magically make metadata reads fast. It helps when your bottleneck is “sync write latency,” not “random metadata read latency.”
In other words: SLOG is not a bandage for slow directory traversal. It’s a bandage for applications that demand sync writes and are sensitive to latency.
Snapshots: metadata time travel isn’t free
Snapshots are cheap until you use them like a version control system for entire filesystems. Each snapshot preserves old block pointers. That increases metadata that must be kept reachable, increases fragmentation, and can raise the cost of deletions (because a delete is not a delete; it’s a “free when no snapshots reference it anymore”). The more snapshots and churn, the more metadata gets involved in everyday housekeeping.
Failure modes: how metadata bottlenecks show up in production
Symptom cluster A: “The storage is slow, but iostat looks fine.”
Classic. You’ll see moderate bandwidth, moderate IOPS, and yet user-facing latency. Why? Because the bottleneck is per-operation latency, not aggregate throughput. A directory scan might be doing one or two dependent reads at a time, never building enough queue depth to “look busy,” yet each read takes long enough to ruin the user experience.
Symptom cluster B: “CPU is high in system time; the app is blocked.”
Metadata-heavy workloads stress the DMU, vnode layer, and checksum/compression paths. High system CPU plus high context switching can happen even when disks aren’t saturated—especially if the system is thrashing on cache misses and doing lots of small IO completions.
Symptom cluster C: “Everything was fine until we added snapshots / enabled xattrs / changed dnodesize.”
These are multipliers. Snapshots preserve history, which preserves metadata reachability. Xattrs stored as SA can increase bonus data usage. Larger dnodes increase cache footprint. Each is rational on its own. Together, they can turn “fast enough” into “why does ls take 10 seconds.”
Symptom cluster D: “Scrubs/resilvers make the cluster unusable.”
Scrub and resilver touch metadata and data broadly. If you’re already close to the edge, background IO pushes you into user-visible latency. Metadata misses rise because ARC is displaced by scrub reads. If your special vdev is overloaded or your metadata is on slow disks, the effect is worse.
Symptom cluster E: “Deletion is slow; freeing space takes forever.”
Deletes under snapshot load mean walking metadata, updating spacemaps, and deferring actual frees. If you’re freeing millions of tiny files, the system does a huge number of metadata updates. If your metaslab fragmentation is high, allocations and frees become more expensive.
Interesting facts and historical context (the stuff that explains today’s weirdness)
- ZFS was born in the mid-2000s at Sun Microsystems with end-to-end checksums and copy-on-write as core design points; metadata safety was a first-class goal, not an add-on.
- Copy-on-write means metadata updates are writes: updating a block pointer or directory entry triggers new blocks, not in-place edits, which is safer but can amplify IO on churny metadata workloads.
- The ARC was designed to cache both data and metadata, and in practice metadata caching is often the highest ROI use of RAM on ZFS boxes.
- System attributes (“SA”) were introduced to reduce xattr IO by packing attributes into bonus buffers, making metadata ops faster for attribute-heavy workloads.
- Feature flags made ZFS evolve without on-disk format forks: modern OpenZFS systems negotiate features like
extensible_datasetand others, changing metadata capabilities over time. - Special allocation class vdevs arrived to address a real pain: metadata on HDDs was a performance limiter even when data throughput was fine.
- ZAP (ZFS Attribute Processor) is ZFS’s “dictionary” format used for directories and properties; it can be micro-optimized in caching and becomes a hot spot in directory-heavy workloads.
- Recordsize tuning is often misunderstood: it affects file data blocks, not dnodes, and it won’t fix directory traversal slowness by itself.
- NVMe didn’t just make ZFS faster; it changed failure modes: random read latency improved so much that CPU and lock contention can become visible where disks used to hide them.
Fast diagnosis playbook
When a ZFS system “feels slow,” you need to decide whether you’re data-bound, metadata-bound, or sync-write-bound. Don’t overthink it. Run the triage in order.
1) First: confirm it’s latency, and identify read vs write
- Check pool latency and queueing with
zpool iostat -vand, if available, per-vdev latency columns. - Look for small IO patterns: lots of ops, low bandwidth, high wait.
- Ask: does the pain show up in
ls -l,find, untar, git operations? That’s metadata until proven otherwise.
2) Second: check ARC health and metadata pressure
- Is ARC size stable and near target? Is it thrashing?
- Are metadata misses high? Is prefetch polluting ARC?
- Is RAM adequate for the object count and working set?
3) Third: find where metadata is physically landing
- Do you have a special vdev? Is it mirrored? Is it saturated?
- Are metadata and small blocks on HDD while data is on SSD (or vice versa)?
- Is the pool fragmented or at high utilization, driving more metadata overhead?
4) Fourth: isolate a “metadata micro-benchmark” from production symptoms
- Time a directory traversal.
- Time a metadata-heavy create/delete loop in a test dataset.
- Correlate with
zpool iostatand ARC stats while it runs.
5) Fifth: validate sync write path if the app is fsync-heavy
- Check
syncsetting, SLOG presence, and write latency. - Confirm the workload actually issues synchronous writes (don’t assume).
Practical tasks: commands, outputs, what it means, and the decision you make
These are the commands I run when somebody says “ZFS is slow” and I want to stop guessing. Outputs are representative; your numbers will differ. The point is what you conclude.
Task 1: Identify the pool layout and special vdev presence
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
errors: No known data errors
What it means: There is a special vdev mirror. Metadata (and possibly small blocks) may be landing on NVMe, which is good. If there is no special vdev and you have HDDs, metadata latency is a prime suspect.
Decision: If metadata is on slow rust and workload is metadata-heavy, plan a special vdev (mirrored) or move the workload to flash.
Task 2: Measure pool-level latency and IO shape
cr0x@server:~$ zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 7.12T 3.88T 950 420 12.3M 8.1M
raidz2-0 7.12T 3.88T 120 80 12.0M 8.0M
sda - - 30 20 3.0M 2.0M
sdb - - 30 20 3.0M 2.0M
sdc - - 30 20 3.0M 2.0M
sdd - - 30 20 3.0M 2.0M
special 80G 720G 830 330 0.3M 0.1M
mirror-1 - - 415 165 0.3M 0.1M
nvme0n1 - - 210 85 0.1M 0.0M
nvme1n1 - - 205 80 0.1M 0.0M
What it means: Lots of operations, tiny bandwidth on the special vdev: that screams metadata. The big vdev bandwidth is modest; your pain is likely from small IO latency and dependency chains.
Decision: Focus on ARC hit rate, special vdev saturation, and metadata-heavy datasets. Don’t waste time chasing sequential throughput.
Task 3: Check dataset properties that influence metadata behavior
cr0x@server:~$ zfs get -o name,property,value -s local,received -r atime,recordsize,xattr,dnodesize,primarycache,secondarycache,sync tank/app
NAME PROPERTY VALUE
tank/app atime off
tank/app recordsize 128K
tank/app xattr sa
tank/app dnodesize auto
tank/app primarycache all
tank/app secondarycache all
tank/app sync standard
What it means: xattr=sa and dnodesize=auto can be great for xattr-heavy workloads, but can also inflate metadata footprint. primarycache=all allows both data and metadata in ARC; sometimes you want metadata for streaming data workloads that evict metadata.
Decision: If this dataset is used for small-file metadata-heavy operations, keep caching enabled and consider whether primarycache=metadata is appropriate for mixed workloads.
Task 4: Verify ARC size and basic behavior
cr0x@server:~$ sudo arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
21:10:01 480 22 4 10 45 12 55 6 27 62G 64G
21:10:02 510 28 5 14 50 14 50 8 29 62G 64G
21:10:03 495 30 6 18 60 12 40 10 33 62G 64G
What it means: ARC is near target (arcsz close to c). Miss rate is moderate. If you see miss% climb and stay high during metadata operations, you’re going to disk for metadata.
Decision: If ARC is too small or polluted by data, change caching strategy or add RAM. If misses are mostly metadata (mm% high), special vdev and metadata tuning become urgent.
Task 5: Confirm whether reads are actually hitting disks or cache
cr0x@server:~$ sudo zpool iostat -r tank 1 3
read
pool ops/s bandwidth
------------ ----- ---------
tank 980 12.6M
raidz2-0 130 12.2M
special 850 0.4M
What it means: High read ops to special vdev suggests metadata is not in ARC enough. If this were a warm-cache workload, you’d expect fewer physical reads during repeated directory walks.
Decision: Investigate ARC eviction, cache settings, and memory pressure. Consider pinning caching behavior per dataset.
Task 6: Check pool utilization and fragmentation risk
cr0x@server:~$ zpool list -o name,size,alloc,free,cap,frag,health tank
NAME SIZE ALLOC FREE CAP FRAG HEALTH
tank 11T 7.12T 3.88T 64% 38% ONLINE
What it means: 64% full isn’t scary, but 38% fragmentation suggests allocations are getting less contiguous. Metadata-heavy churn can drive this up, making future operations more expensive.
Decision: If cap is >80% and frag is high, plan space reclamation, rewrite, or pool expansion. Metadata performance often falls off a cliff at high utilization.
Task 7: Identify the “small file” datasets and count objects
cr0x@server:~$ zfs list -o name,used,refer,logicalused,compressratio,usedsnap -r tank
NAME USED REFER LUSED RATIO USEDSNAP
tank 7.12T 256K 7.10T 1.18x 1.4T
tank/app 2.20T 2.00T 2.30T 1.12x 600G
tank/ci-cache 1.10T 920G 1.30T 1.05x 480G
tank/backups 3.60T 3.50T 3.40T 1.24x 320G
What it means: Datasets with large USEDSNAP and high churn (CI caches) often trigger metadata pain: deletes get expensive, space doesn’t come back immediately, and directory ops slow.
Decision: Target tank/ci-cache for snapshot policy review, caching policy changes, and potentially its own pool or special vdev class.
Task 8: Observe live latency and queue depth per vdev
cr0x@server:~$ zpool iostat -v tank 1 3
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 7.12T 3.88T 1100 600 14.0M 10.2M
raidz2-0 7.12T 3.88T 150 120 13.7M 10.1M
special 80G 720G 950 480 0.3M 0.1M
What it means: If the special vdev ops jump during directory traversal or file creation storms, it’s doing its job—but it can also become the bottleneck if undersized or if the devices are consumer-grade and choking on sustained random IO.
Decision: If special vdev is saturated, upgrade it (faster NVMe, more mirrors) or reduce metadata IO (snapshot policy, workload isolation).
Task 9: Check whether your workload is sync-write heavy
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 06:12:44 with 0 errors on Sun Dec 22 03:10:01 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
logs
nvme2n1 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
errors: No known data errors
What it means: There is a SLOG device (logs). That helps sync write latency. It does nothing for metadata reads. People confuse those daily.
Decision: If users complain about fsync latency, validate SLOG health and latency. If they complain about find and ls, stop staring at the SLOG.
Task 10: Look for pathological snapshot counts and retention
cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation -s creation | tail -n 5
tank/ci-cache@auto-2025-12-26-0900 0B 920G Fri Dec 26 09:00 2025
tank/ci-cache@auto-2025-12-26-1000 0B 920G Fri Dec 26 10:00 2025
tank/ci-cache@auto-2025-12-26-1100 0B 920G Fri Dec 26 11:00 2025
tank/ci-cache@auto-2025-12-26-1200 0B 920G Fri Dec 26 12:00 2025
tank/ci-cache@auto-2025-12-26-1300 0B 920G Fri Dec 26 13:00 2025
What it means: Frequent snapshots can be fine, but in churny datasets they preserve dead blocks and increase metadata to traverse. “0B used” per snapshot can still hide a lot of metadata complexity depending on churn.
Decision: For CI caches and scratch datasets: shorten retention, reduce frequency, or don’t snapshot them at all.
Task 11: Identify whether ARC is being polluted by streaming reads
cr0x@server:~$ zfs get -o name,property,value primarycache -r tank/backups
NAME PROPERTY VALUE
tank/backups primarycache all
What it means: Backup datasets often stream. If they’re allowed to cache data in ARC, they can evict metadata needed by latency-sensitive workloads.
Decision: Consider setting primarycache=metadata on streaming datasets so ARC keeps metadata hot without wasting RAM on bulk data you won’t reread soon.
Task 12: Apply a targeted caching change (carefully)
cr0x@server:~$ sudo zfs set primarycache=metadata tank/backups
What it means: ZFS will stop caching file data blocks for this dataset in ARC, but will still cache metadata. This often improves metadata-heavy latency elsewhere on the pool.
Decision: If latency-sensitive datasets share the pool with streaming workloads, this is one of the cleanest, lowest-risk knobs.
Task 13: Inspect xattr and dnodesize choices on the hottest dataset
cr0x@server:~$ zfs get -o name,property,value xattr,dnodesize tank/ci-cache
NAME PROPERTY VALUE
tank/ci-cache xattr sa
tank/ci-cache dnodesize auto
What it means: This combination is commonly correct for xattr-heavy workloads (containers can be). But it can inflate metadata footprint. If your workload isn’t xattr-heavy, it may be unnecessary.
Decision: Don’t flip these casually on an existing dataset expecting instant results; they affect newly created objects. Consider A/B testing on a new dataset.
Task 14: Basic “is it metadata?” micro-test with directory traversal timing
cr0x@server:~$ time find /tank/ci-cache/workdir -type f -maxdepth 3 -printf '.' >/dev/null
real 0m28.412s
user 0m0.812s
sys 0m6.441s
What it means: High system time and long wall time for a simple traversal strongly indicates metadata and IO wait. If you repeat it and it becomes much faster, ARC is helping; if not, ARC isn’t large enough or is being evicted.
Decision: Repeat immediately and compare. If the second run is not dramatically faster, you’re missing in ARC and hitting disks/special vdev hard.
Three corporate mini-stories (anonymized, plausible, painfully familiar)
1) The incident caused by a wrong assumption: “We have NVMe, so metadata can’t be the problem”
A mid-sized SaaS company ran a build farm on ZFS. They had modern NVMe for “the pool,” so the infrastructure team assumed storage latency was basically solved. The build system still suffered sporadic 30–90 second stalls during peak hours. Engineers blamed the CI orchestrator and wrote it off as “Kubernetes being Kubernetes.”
On-call eventually correlated stalls with jobs that unpacked large dependency trees: tens of thousands of tiny files. During those windows, average pool bandwidth was low and the NVMe devices were far from saturated. Everybody nodded at the dashboards and concluded: not storage.
The mistake: they only looked at throughput and aggregate utilization. They didn’t look at IO size distribution, cache hit rate, or the fact that metadata operations were serial and latency-dominated. ARC was small relative to the object working set, and a concurrent backup job was streaming data through ARC, evicting metadata.
The fix was boring: set primarycache=metadata on backup datasets, add RAM, and move the CI dataset’s metadata onto a properly mirrored special vdev. The “mystery stalls” evaporated. The orchestrator didn’t change. The assumptions did.
2) The optimization that backfired: “Let’s crank dnodesize to make xattrs faster”
A large enterprise ran a compliance-sensitive file service. They stored lots of tags and audit attributes as xattrs. Someone read that larger dnodes help keep attributes inline and reduce IO. True statement. Dangerous half-truth.
They applied a dataset template that set dnodesize=auto broadly across many datasets, including ones that didn’t need it: log staging, build artifacts, temporary scratch. Over time, those datasets accumulated objects with larger dnodes. ARC efficiency worsened. Metadata cache missed more often. Directory operations slowed, then client latency spiked, then ticket volume surged.
The backfire wasn’t immediate, which made it harder. The system “aged into” the problem as more objects were created and snapshots preserved history. Eventually, even moderate churn caused enough metadata IO that HDD-based metadata became the bottleneck. The team had accidentally traded fewer xattr IOs on one dataset for heavier metadata everywhere.
Rolling it back wasn’t a simple toggle because existing objects keep their dnode structure. They had to create new datasets with sane defaults, migrate data, and treat the old ones like radioactive waste until retirement. The lesson was not “never use larger dnodes.” The lesson was: apply it where it’s measured, not where it’s fashionable.
3) The boring but correct practice that saved the day: treating special vdev like production-critical storage
A financial services org ran OpenZFS with a special vdev to keep metadata on flash. They did it the right way: mirrored enterprise NVMe, strict monitoring, and a rule that special vdev capacity must never be allowed to “accidentally” fill up.
During a vendor firmware issue that caused intermittent NVMe timeouts under sustained random read load, the special vdev started throwing errors. The monitoring fired early because they watched error counters and latency on the special class separately, not just pool health.
They degraded the pool gracefully: slowed nonessential jobs, postponed scrubs, and migrated the hottest metadata-heavy workloads off the system before the situation escalated. Replacement drives were staged, and they already had a tested procedure for swapping special vdev members without panicking.
Most teams discover that “metadata is special” when the pool won’t import. This team discovered it on a Tuesday morning in a change window, and everyone got to eat lunch. Reliability often looks like paranoia with better documentation.
Common mistakes: symptoms → root cause → fix
1) ls and find are slow, but big file reads are fast
Symptom: Streaming reads/writes hit expected bandwidth; directory traversal and stat-heavy tools crawl.
Root cause: Metadata misses in ARC; metadata on slow vdevs; insufficient IOPS for small random reads.
Fix: Increase ARC effectiveness (more RAM, primarycache=metadata on streaming datasets), deploy a mirrored special vdev, reduce snapshot churn.
2) “We added SLOG and nothing improved”
Symptom: App latency unchanged after installing a fast log device.
Root cause: Bottleneck is metadata reads or async operations; SLOG only helps synchronous writes.
Fix: Confirm sync write rate; keep SLOG if needed, but fix metadata path (ARC, special vdev, object count, snapshots).
3) Latency spikes during scrubs/resilvers
Symptom: User IO becomes slow during maintenance operations.
Root cause: ARC displacement by scrub reads; vdev contention; metadata reads pushed to slower media; special vdev saturated.
Fix: Schedule scrubs off-peak, throttle where supported, ensure special vdev is not undersized, and avoid running heavy scans during peak.
4) Deleting files is slow and space doesn’t come back
Symptom: rm -rf runs forever; pool stays full.
Root cause: Snapshots keep blocks alive; metadata updates are massive; frees are deferred across TXGs.
Fix: Reduce snapshot retention/frequency on churny datasets; consider periodic dataset rotation (new dataset + migrate) for scratch.
5) “We increased recordsize, still slow”
Symptom: Tuning recordsize didn’t improve metadata-heavy operations.
Root cause: Recordsize affects file data blocks, not directory ops or dnode cache footprint.
Fix: Focus on metadata: ARC, special vdev, xattr/dnodesize strategy, snapshot policy, and workload isolation.
6) Special vdev fills up unexpectedly
Symptom: Special vdev allocation grows until it’s near full.
Root cause: Small blocks are being allocated to special due to special_small_blocks setting, or metadata growth from object explosion.
Fix: Re-evaluate special_small_blocks threshold, expand special vdev capacity (add another mirror), and control object growth (cleanup policies).
Joke #2: A special vdev is “special” the same way a single point of failure is “special”—it gets everyone’s attention at 3 a.m.
Checklists / step-by-step plan
Step 1: Classify the workload (don’t guess)
- Run
zpool iostat -vduring the slowdown. - Time a metadata-heavy operation (directory walk) and a data-heavy operation (sequential read).
- If ops are high and bandwidth is low, treat it as metadata/small IO.
Step 2: Make ARC work for you
- Check ARC size and miss rate (use
arcstat). - If streaming workloads share the pool, set
primarycache=metadataon those datasets. - If ARC is simply too small for the working set, add RAM before buying more disks. Yes, really.
Step 3: Put metadata on the right media
- If you have HDD pools serving metadata-heavy workloads, plan a mirrored special vdev on enterprise NVMe.
- Size it with headroom. Metadata grows with objects and snapshots, not just “used bytes.”
- Monitor it separately: latency, errors, and allocation.
Step 4: Control churn and snapshots like an adult
- Identify datasets with high churn (CI caches, temp staging, container layers).
- Reduce snapshot frequency and retention for those datasets.
- Consider dataset rotation: create a new dataset, switch the workload, destroy the old dataset when safe. This avoids pathological long-lived metadata.
Step 5: Test changes with a realistic micro-benchmark
- Use a representative directory tree.
- Measure cold-cache and warm-cache behavior.
- Validate with
zpool iostatand ARC stats while testing, not afterward.
Step 6: Operational hygiene that reduces “mystery” incidents
- Keep pool utilization under control (avoid living above ~80% unless you like surprises).
- Keep scrub schedules predictable.
- Document dataset defaults:
xattr,dnodesize,primarycache, snapshot policies.
One quote that belongs on your runbook
Paraphrased idea, attributed to John Allspaw: “In complex systems, success and failure come from the same place: normal work and normal decisions.”
FAQ
1) What exactly is a dnode, in operational terms?
A dnode is the metadata entry for a ZFS object (file, directory, dataset object, etc.). If your workload does lots of stat(), lookups, creates, deletes, or xattrs, dnode access is on the hot path.
2) Is metadata always on the special vdev if I have one?
Typically metadata is allocated to the special class when present, but behavior can vary with features and settings. Also, “small blocks” may go there if special_small_blocks is set. You should assume the special vdev is critical and monitor it accordingly.
3) Should I always enable special_small_blocks?
No. It’s powerful, but it can fill your special vdev and turn a metadata accelerator into a capacity problem. Use it when your workload is dominated by small blocks and you have enough mirrored flash to carry it with headroom.
4) Does increasing recordsize help metadata?
Not much. recordsize is about file data blocks. Metadata operations are dominated by dnodes, ZAP, indirect blocks, and other metadata structures.
5) Will adding a SLOG fix slow directory traversals?
No. SLOG helps synchronous write latency. Directory traversal slowness is usually metadata reads and cache behavior. Keep the SLOG if your workload needs it, but don’t expect it to fix read-side metadata stalls.
6) Why does the second find run sometimes get much faster?
Because ARC warmed up. The first run pulled metadata into cache. If the second run is not faster, either the working set doesn’t fit in ARC or some other workload is evicting it.
7) How do snapshots affect metadata performance?
Snapshots preserve old block pointers, which increases the amount of metadata that remains reachable. On churny datasets, that can amplify the cost of deletes and increase fragmentation, which can make metadata operations slower over time.
8) What’s the safest “quick win” when metadata is the bottleneck?
Often: set primarycache=metadata on streaming datasets that are polluting ARC, and reduce snapshot retention on churny datasets. These two changes can improve metadata hit rates without changing on-disk formats.
9) Should I change dnodesize on an existing dataset?
Be careful. It affects new objects, and existing objects won’t necessarily “shrink.” If you need a clean state, create a new dataset with the desired properties and migrate.
10) If I have plenty of IOPS on paper, why do metadata ops still stall?
Because metadata often requires dependent reads in sequence. You can’t parallelize a chain of “read pointer to find next pointer” the way you can parallelize big reads. Latency dominates.
Conclusion: next steps that actually move the needle
If you take one operational lesson from ZFS metadata behavior, take this: stop treating “storage performance” as a bandwidth problem. For many real workloads, it’s a cache-and-latency problem wearing a disguise.
Do these next:
- Run the fast diagnosis playbook during the next incident and classify the bottleneck correctly.
- Protect ARC from streaming workloads with dataset-level
primarycache=metadatawhere appropriate. - If you’re metadata-heavy on HDD, stop negotiating with physics: use a mirrored special vdev on enterprise flash, sized with headroom.
- Audit snapshot policies on churny datasets and stop snapshotting scratch like it’s priceless history.
- Document dataset defaults and enforce them. Most metadata disasters start as “just one property change.”
ZFS will do exactly what you ask, and it will keep doing it long after you’ve forgotten you asked. Make the metadata path boring, predictable, and fast. Your future self will have fewer pagers and better weekends.