ZFS has a knack for turning “filesystem” conversations into “storage architecture” conversations. That’s not a bug; it’s the whole point. ZFS treats your data like a graph of blocks, and the metadata is the map. Lose the map and you’re not “missing a file,” you’re missing the ability to prove what the file even is. That’s why redundant_metadata exists: it’s a dataset property that decides how many copies of metadata ZFS tries to keep.
The catch is that “more copies” is not automatically “more better.” Extra metadata copies can save you from the kind of corruption that makes grown engineers stare at zpool status in silence — but it can also quietly add write amplification and turn an already-I/O-starved pool into a queueing disaster. This piece is about when redundant_metadata pays for itself, when it doesn’t, and how to tell the difference in production without guessing.
What redundant_metadata really does
redundant_metadata is a ZFS dataset property that influences how many copies of metadata ZFS stores. The most common values you’ll see are:
all: store redundant copies of all metadata (as in, “try to keep extra copies”).most: store redundant copies of most metadata.none: store no extra metadata copies (beyond whatever redundancy your vdev layout already provides).
This is not the same as the copies property. copies=2 duplicates data and metadata at the dataset level, increasing space usage substantially. redundant_metadata focuses on metadata only, which is usually much smaller than user data — until it isn’t (think millions of small files, heavy snapshots, or metadata-heavy workloads like maildir, container layers, build caches).
Important nuance: ZFS already has redundancy at the vdev level (mirror, RAIDZ). redundant_metadata is about additional copies on top of that. On mirrors, you already have two or more copies of every block because the whole vdev is duplicated. On RAIDZ, every block is protected by parity, but it’s still one logical block per write. Extra metadata copies can give you alternate physical instances of that metadata block, which changes the recovery story when you have silent corruption or sector-level weirdness that parity can’t correct (for example, repeated read errors on a specific region, or a device that returns the wrong data with the right checksum path being forced to retry).
One sentence version: redundant_metadata is a bet that your biggest risk is “metadata becomes unreadable before you notice,” and you’re willing to pay some overhead to hedge.
Joke #1 (short and relevant): Metadata is like DNS: nobody cares until it breaks, and then suddenly it’s “the most critical service we’ve ever run.”
What counts as “metadata” here?
ZFS metadata includes things like block pointers, indirect blocks, dnodes (file metadata), directory structures, space maps, and other on-disk structures used to locate and verify your data. Some metadata is pool-wide (MOS and friends), some is dataset-specific. In practice, if you’re debugging a scary situation, the “metadata” you care about is anything that prevents the filesystem from traversing to your file contents reliably.
Not all metadata is equal. A corrupt file’s content might be a single customer attachment you can restore from a snapshot. Corrupt metadata can make entire datasets unmountable or cause traversal failures that block restores (which is the real nightmare: your backups exist, but you can’t enumerate them).
Why metadata is the part you actually can’t lose
ZFS is copy-on-write. Every time it changes a block, it writes new blocks and then updates pointers to the new blocks. This is why ZFS is so good at consistency: it doesn’t overwrite live structures in place; it builds a new tree and then flips a root pointer at the end. That design has implications:
- Metadata churn can be very high even when user data is “stable.”
- The “shape” of your metadata affects performance (more indirection, more small random I/O).
- Metadata is what makes snapshots possible, but snapshots also preserve historical metadata, which can increase the number of metadata blocks that must remain readable.
Operationally, ZFS failures tend to come in two flavors:
- Obvious device failures: a disk dies, SMART screams, the mirror degrades, you replace it, you resilver. Annoying but routine.
- Non-obvious corruption and partial unreadability: everything looks fine until scrub or a specific block is accessed, and then you get checksum errors or permanent errors. Metadata blocks are disproportionately painful here, because they can be on critical paths: mounting, listing directories, snapshot operations, send/receive, even scrubs.
Extra metadata copies won’t save you from every disaster. They can’t fix “oops, we destroyed the pool” and they won’t make RAIDZ behave like a mirror. But they can change a corruption incident from “we need a restore window and a lot of caffeine” to “scrub repaired it and we moved on.”
How ZFS places extra copies (and where it can’t)
ZFS has a long-standing mechanism for “ditto blocks” (extra copies of metadata). The intent is simple: if one instance is bad, another might be good, and checksums let ZFS pick the correct one. This is conceptually similar to having multiple replicas, but within a single pool and controlled by allocator rules.
There are constraints:
- Not everything gets copied the same way. Some metadata is already handled specially; some may be too large or too frequent to reasonably duplicate.
- Copies must land somewhere. If your pool is small, fragmented, or constrained to a narrow set of devices, “extra copies” can end up not being as independent as you think. Two copies on the same failing disk don’t count as redundancy; they count as optimism.
- Special vdev changes the game. If you use a special vdev for metadata, then metadata placement and redundancy can become both better and worse: better because metadata is on fast devices, worse because you just created a “metadata tier” that must be properly redundant or you’ve built a single point of failure with better latency.
Also: redundant_metadata doesn’t retroactively rewrite your entire pool. It influences new writes. Existing metadata blocks generally stay as-is until they’re rewritten through normal churn, or you force rewrites with specific tactics (which can be risky and workload-dependent).
Mirrors, RAIDZ, and the false sense of “we already have redundancy”
On mirrors, every block is duplicated at the vdev level. Metadata redundancy is inherently strong because any read can come from either side, and checksums allow ZFS to detect and self-heal by rewriting from the good side. In that world, redundant_metadata is often less critical for integrity (though still sometimes relevant for performance due to placement and read distribution).
On RAIDZ, parity protects you from a full-device failure up to the parity level, and can correct some read errors. But parity is not magic against all failure modes, especially when you’re dealing with latent sector errors, firmware bugs, or a device returning bad data inconsistently. Extra physical copies of metadata can help because they create alternative sources for the same logical information.
Joke #2 (short and relevant): RAIDZ parity is like insurance: it’s great until you learn the deductible is “rebuild the pool during a weekend.”
When more metadata copies actually matter
Here are the situations where redundant_metadata tends to earn its keep, based on how failures show up in real systems.
1) Metadata-heavy workloads (millions of files, tiny files, deep trees)
If you run anything that creates huge numbers of filesystem objects—CI build caches, container image layers, package mirrors, mail spools, artifact stores, source code monorepos—your “metadata” is not a rounding error. It becomes a first-class capacity and performance consumer. This is precisely the kind of environment where metadata corruption is also particularly disruptive: traversal touches lots of metadata blocks, and a single bad directory block can block access to large logical swaths of data.
Extra copies are valuable because your probability of encountering a bad metadata block increases with the number of metadata blocks you have and read. It’s statistics, not superstition.
2) Long snapshot retention or aggressive snapshot schedules
Snapshots preserve historical versions of metadata, not just data. When you keep many snapshots, you keep old metadata around. If some of that metadata becomes unreadable, it can break operations like zfs list -t snapshot, zfs destroy (yes, even deleting can be blocked), and zfs send. It’s like keeping years of receipts in a shoebox—fine until the one you need is the one that got coffee spilled on it.
redundant_metadata=all can reduce the chance that any given metadata block is a single point of failure.
3) Pools with a history of checksum errors or marginal hardware
We all want enterprise hardware. Sometimes you inherit “best effort” hardware with a procurement story that starts with “it was a great deal.” If you’ve seen intermittent checksum errors during scrub, or you’ve had to replace disks for read errors more than once, redundant metadata can be a practical risk reducer.
But be honest: if you’re routinely seeing checksum errors, the first fix is usually hardware, cables, HBAs, firmware, and power. redundant_metadata is not a substitute for stable I/O paths; it’s a seatbelt, not a steering wheel.
4) RAIDZ pools where metadata read IOPS is the constraint
It sounds paradoxical: “extra copies” means more writes, but it can also improve read resilience and, in some cases, read behavior under error conditions. In a RAIDZ pool, a single bad sector can force reconstruction reads that are expensive. If ZFS can read a good copy of a metadata block without reconstruction, it can reduce the blast radius of a marginal disk during a scrub or heavy traversal.
5) Special vdev designs done correctly
If you have a special vdev made of mirrored SSDs dedicated to metadata (and optionally small blocks), you can make metadata both faster and more resilient—if the special vdev itself is properly redundant and monitored. In such architectures, setting redundant_metadata=all for critical datasets can be the difference between “metadata is fast” and “metadata is fast and survives a weird SSD failure mode.”
When it hurts: performance and space tradeoffs
redundant_metadata is not a free lunch. It can bite you in three places: write amplification, fragmentation behavior, and capacity planning.
Write amplification and small random writes
Metadata writes are typically small and random. Duplicating them means more small random writes. On HDD pools, this is how you turn “acceptable latency” into “why is everything at 200ms.” On SSD pools, you might not feel it immediately, but you’re still increasing write workload and potentially impacting endurance and garbage collection behavior.
If your workload is sync-heavy (databases with sync=standard and lots of fsync, NFS exports with sync semantics), the metadata path is on the critical latency chain. Extra metadata writes can show up as higher commit times, which show up as application tail latency. Nobody pages the storage team for average latency.
Space overhead that’s invisible until it isn’t
Metadata overhead is usually small compared to data. “Usually” is doing a lot of work in that sentence. If you store billions of small objects, enable xattrs heavily, or keep deep snapshot histories, metadata can be substantial. Extra copies of metadata can push you into higher pool utilization, and high pool utilization is where ZFS allocation gets less friendly, fragmentation rises, and performance degrades.
This is where people get surprised: they didn’t run out of space because of “data.” They ran out because the filesystem had to manage the data, and the management got duplicated.
Not the right fix for “I want better redundancy”
If your real goal is “I want this dataset to survive two disk failures,” redundant_metadata is not your tool. Choose mirror/RAIDZ topology accordingly. Extra metadata copies help with certain classes of corruption and localized unreadability; they don’t change your fundamental fault tolerance the way vdev redundancy does.
Facts and context you can use in arguments
These are the kinds of short, concrete points that help when you’re in a design review and someone says “why are we even talking about metadata?”
- ZFS was designed around end-to-end checksums, so it can detect silent corruption rather than serving bad data quietly. That detection is only useful if there’s a valid alternate source (mirror side, parity reconstruction, or an extra copy).
- Copy-on-write makes consistency cheap but metadata busy. Even “small” changes can touch multiple metadata blocks: dnode updates, indirect blocks, spacemap updates, and so on.
- The term “ditto blocks” predates the modern property name and reflects the original idea: keep extra copies of critical metadata so a single bad sector can’t brick the pool traversal.
- Scrubs are not just “checks,” they’re repair workflows on redundant vdevs: ZFS reads, verifies checksums, and can heal by rewriting from a good copy. Extra metadata copies can increase the chance that a good copy exists.
- Metadata tends to be small-block I/O, which is precisely the worst-case pattern for HDD vdevs and parity layouts: small, random, latency sensitive.
- RAIDZ has a “reconstruction tax”: when a sector is bad, reading a block can require reading multiple columns and reconstructing. For metadata, that tax hits high-frequency operations (directory walks, snapshot enumeration).
- High pool utilization changes allocator behavior, often increasing fragmentation and making metadata I/O more scattered. Extra metadata copies increase pressure on free space earlier.
- Special vdevs were introduced to address metadata IOPS bottlenecks by moving metadata to faster devices; but they also introduce a new failure domain that must be mirrored (or better) because losing it can be catastrophic.
- “Backups exist” is not the same as “restore is possible.” If metadata corruption prevents enumerating snapshots or reading send streams, you can have bits on disk and still be blocked operationally.
Three corporate-world mini-stories from the trenches
Mini-story #1: The incident caused by a wrong assumption
We inherited a storage cluster that “had redundancy,” according to the handover doc. The pool was RAIDZ2, plenty of disks, and the dashboards were green. The team’s assumption was basically: parity equals safety, so metadata tuning is academic.
Then an innocuous event: a weekly scrub started, and a day later developers complained that a container registry was “randomly hanging.” Not down—just hanging. The pool wasn’t out of space, and there were no obvious device failures. But latency spiked during metadata-heavy operations: listing tags, garbage collection, and pulling older layers. The application team suspected network issues. The network team suspected DNS. The storage team suspected everything and nothing.
zpool status showed checksum errors on one disk, but the vdev was still ONLINE. RAIDZ2 was “handling it.” The wrong assumption was that “handling it” means “no user impact.” In reality, each time ZFS hit certain metadata blocks that lived on the marginal region of that disk, it paid the reconstruction penalty. Under scrub load, those metadata reads became frequent and expensive. The system was correct; it was also slow.
We replaced the disk and the symptoms evaporated. Later, we did a postmortem and changed two things: we treated checksum errors as urgent even if redundancy masked them, and we adjusted metadata strategy for the heaviest datasets. The lesson wasn’t “redundant_metadata would have prevented this.” It was that metadata is the first place where marginal hardware becomes visible, and parity doesn’t make performance free.
Mini-story #2: The optimization that backfired
Another environment: a build system producing millions of small artifacts. The pool was SSD-based RAIDZ, and the team wanted to squeeze maximum capacity. Someone read that extra metadata redundancy “wastes space” and set redundant_metadata=none across the artifact datasets. It seemed harmless. Scrubs were clean. The graphs looked great. Promotions were discussed.
Months later, a firmware issue (not catastrophic, just annoying) caused occasional read errors on a subset of blocks on one SSD. With RAIDZ, ZFS reconstructed. Again, “handled.” But the workload was metadata-heavy: directory traversals, stat storms, and snapshot operations during retention cleanup. One day, a cleanup job started failing to destroy old snapshots due to “I/O error” on a dataset. Not the whole pool. Not even all snapshots. Just certain operations that touched specific metadata structures.
Now the backfire: because metadata had no extra physical copies, recovery options were narrower. RAIDZ could reconstruct some blocks, but one block became permanently unreadable after repeated attempts; ZFS reported permanent errors tied to metadata. The dataset became partially unmanageable: listing was slow and error-prone, cleanup failed, and sends had issues. We ended up doing a more disruptive remediation: restoring from a replication target that had a clean copy, which cost time and credibility.
The lesson was not “always set redundant_metadata=all.” It was that the “optimization” was made without understanding the failure mode it was trading away. Capacity optimization is easy to justify; recovery complexity is harder to quantify until you’re living it.
Mini-story #3: The boring but correct practice that saved the day
A third story, and it’s the least glamorous: a team that did routine scrubs, kept pool utilization reasonable, and had a clean separation of datasets with sane properties. They also had a policy: metadata-heavy datasets got extra protection, and they used a mirrored special vdev for metadata on their busiest pools.
One morning, a host rebooted after a power maintenance event. It came up, but a couple of services were sluggish. Nothing was “down,” and that’s exactly the kind of situation where people waste half a day pointing fingers. The storage engineer ran a scrub and immediately saw checksum errors that were corrected. The system healed itself because it had somewhere to heal from.
It didn’t end there: the root cause was traced to a flaky HBA cable that occasionally corrupted data in flight. The reason it wasn’t a full-blown incident was boring: checksums detected it, redundancy repaired it, and scrubs surfaced it quickly. They replaced the cable, validated, and moved on. No heroics. No weekend restore. No dramatic “we almost lost everything” meeting.
This is the real win condition: a system designed so that the failure looks like routine maintenance, not a career-defining event.
Practical tasks: commands and interpretation
Below are hands-on tasks you can run on a ZFS system. The commands are written in a Linux-style shell with OpenZFS tooling. Adjust pool/dataset names for your environment.
Task 1: Check the current redundant_metadata setting (and inheritance)
cr0x@server:~$ zfs get -r -o name,property,value,source redundant_metadata tank
NAME PROPERTY VALUE SOURCE
tank redundant_metadata all default
tank/home redundant_metadata all inherited from tank
tank/registry redundant_metadata none local
tank/registry/blobs redundant_metadata none inherited from tank/registry
Interpretation: You’re looking for “local” overrides that may have been set as a quick fix or an optimization. “default” or “inherited” is not automatically correct; it just means nobody touched it.
Task 2: Change redundant_metadata safely on a single dataset
cr0x@server:~$ sudo zfs set redundant_metadata=all tank/registry
cr0x@server:~$ zfs get redundant_metadata tank/registry
NAME PROPERTY VALUE SOURCE
tank/registry redundant_metadata all local
Interpretation: This affects future metadata writes for that dataset. It won’t instantly rewrite existing metadata. If you expected an immediate change in on-disk distribution, you’ll be disappointed (and that disappointment is safer than the alternative).
Task 3: Validate pool health and see if errors already exist
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
action: Determine if the device needs to be replaced, and clear the errors
see: none
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 12
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
tank/registry@daily-2025-12-01:/blobs/sha256/aa/...
Interpretation: Checksum errors on a device are not “fine because RAIDZ2.” They are a symptom. Permanent errors pointing to a snapshot path often indicate metadata or referenced block issues affecting that snapshot’s view.
Task 4: Scrub on purpose, not by superstition
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
status: scrub in progress since Tue Dec 24 10:11:22 2025
1.23T scanned at 510M/s, 420G issued at 173M/s, 8.11T total
0B repaired, 5.18% done, 0:07:42 to go
Interpretation: “Scanned” vs “issued” matters. If issued is far lower, the pool is seeking/queueing or throttled by IOPS. Metadata-heavy pools often show this gap strongly.
Task 5: Check space pressure (because it changes everything)
cr0x@server:~$ zpool list -o name,size,alloc,free,capacity,frag,health
NAME SIZE ALLOC FREE CAPACITY FRAG HEALTH
tank 10T 8.7T 1.3T 87% 52% ONLINE
Interpretation: 87% full and 52% fragmented is the zone where “small random metadata writes” get more expensive. If you’re tuning metadata redundancy on a pool this full, capacity planning is part of the fix.
Task 6: Identify metadata-heavy datasets by object count and logical space
cr0x@server:~$ zfs list -o name,used,refer,logicalused,logicalrefer,compressratio -r tank
NAME USED REFER LOGICALUSED LOGICALREFER RATIO
tank 8.7T 128K 11.2T 128K 1.28x
tank/home 1.2T 1.2T 1.3T 1.2T 1.09x
tank/registry 6.8T 6.8T 9.4T 9.4T 1.38x
Interpretation: This doesn’t directly show metadata, but it helps find datasets where logical vs physical behavior and churn might indicate heavy block pointer activity. Combine with snapshot counts and workload knowledge.
Task 7: Count snapshots and understand retention risk
cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation -r tank/registry | head
NAME USED REFER CREATION
tank/registry@hourly-2025-12-24-09 0B 6.8T Tue Dec 24 09:00 2025
tank/registry@hourly-2025-12-24-08 0B 6.8T Tue Dec 24 08:00 2025
tank/registry@daily-2025-12-23 12G 6.7T Mon Dec 23 00:10 2025
Interpretation: Many snapshots means many historical metadata paths. If operations on old snapshots are failing, redundant metadata can be part of the resilience strategy, but you also need scrub discipline and sane retention.
Task 8: Confirm special vdev exists (metadata tier)
cr0x@server:~$ zpool status tank | sed -n '1,80p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme0n1p1 ONLINE 0 0 0
nvme1n1p1 ONLINE 0 0 0
Interpretation: If you have a special vdev, it must be redundant. If it’s a single device, treat it like an outage waiting to be scheduled.
Task 9: Check where metadata is going (high-level behavior)
cr0x@server:~$ zpool get -o name,property,value,source feature@spacemap_histogram tank
NAME PROPERTY VALUE SOURCE
tank feature@spacemap_histogram active local
Interpretation: Features vary by platform and version; the point is to verify your pool supports the tooling you plan to use for visibility. If your platform has limited introspection, you may need to rely more on workload symptoms and scrub results.
Task 10: Observe I/O pressure and latency (quick view)
cr0x@server:~$ iostat -x 1 5
Linux 6.8.0 (server) 12/24/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
6.12 0.00 3.45 18.90 0.00 71.53
Device r/s w/s rkB/s wkB/s await svctm %util
sda 32.0 210.0 820.0 4600.0 98.12 3.10 79.2
sdb 30.0 208.0 800.0 4550.0 102.55 3.20 80.1
nvme0n1 900.0 700.0 72000.0 64000.0 1.20 0.05 8.5
Interpretation: High await on HDDs with small throughput often indicates random I/O saturation—classic metadata pain. Low util on NVMe special vdev suggests it isn’t the bottleneck; high util suggests metadata tier saturation or mis-sized special vdev.
Task 11: Watch ZFS I/O stats (pool level)
cr0x@server:~$ zpool iostat -v tank 1 3
capacity operations bandwidth
pool alloc free read write read write
--------------------------- ----- ----- ----- ----- ----- -----
tank 8.7T 1.3T 2.10K 5.80K 180M 95M
raidz2-0 8.7T 1.3T 2.10K 5.80K 180M 95M
sda - - 350 970 30M 16M
sdb - - 350 970 30M 16M
sdc - - 350 970 30M 16M
sdd - - 350 970 30M 16M
sde - - 350 970 30M 16M
sdf - - 350 970 30M 16M
special - - 900 1.10K 72M 64M
mirror-1 - - 900 1.10K 72M 64M
nvme0n1p1 - - 450 550 36M 32M
nvme1n1p1 - - 450 550 36M 32M
--------------------------- ----- ----- ----- ----- ----- -----
Interpretation: This helps confirm whether metadata/small-block traffic is landing on the special vdev or whether the RAIDZ HDDs are doing the painful work. If the special vdev is absent, metadata IOPS land on the main vdevs.
Task 12: Check dataset knobs that commonly interact with metadata behavior
cr0x@server:~$ zfs get -o name,property,value -s local,default recordsize,atime,xattr,dnodesize,compression,redundant_metadata tank/registry
NAME PROPERTY VALUE
tank/registry recordsize 128K
tank/registry atime off
tank/registry xattr sa
tank/registry dnodesize legacy
tank/registry compression zstd
tank/registry redundant_metadata all
Interpretation: Metadata strategy doesn’t live alone. xattr=sa can increase metadata density in dnodes; dnodesize affects how much metadata can live “inline.” These can change the amount of metadata I/O and the cost of duplicating it.
Task 13: Identify checksum errors early and clear only after remediation
cr0x@server:~$ zpool status -x
pool 'tank' has experienced checksum errors
cr0x@server:~$ sudo zpool clear tank
cr0x@server:~$ zpool status -x
all pools are healthy
Interpretation: Clearing errors is for bookkeeping after you’ve addressed the cause (replaced disk, fixed cabling, completed scrub). Clearing first just deletes evidence and teaches the system nothing.
Task 14: Use a controlled file tree to see metadata costs in microcosm (lab technique)
cr0x@server:~$ mkdir -p /tank/testmeta
cr0x@server:~$ time bash -c 'for i in $(seq 1 200000); do echo x > /tank/testmeta/f_$i; done'
real 2m41.772s
user 0m20.118s
sys 1m39.332s
Interpretation: This is a blunt instrument, but it makes metadata cost visible. Repeat on a dataset with different redundant_metadata settings (in a lab) and compare latency and I/O stats. Don’t do this on production unless you enjoy explaining yourself.
Fast diagnosis playbook
This is the order I use when a system is slow and “storage might be involved.” The goal is to identify whether metadata is the bottleneck, and whether redundancy settings are contributing.
First: Is the pool healthy, or are we paying a reconstruction tax?
cr0x@server:~$ zpool status -v tank
Look for: checksum errors, degraded devices, scrub in progress, “permanent errors,” repeated read errors on a single device. If you see CKSUM increments, assume performance impact even if the pool is ONLINE.
Second: Are we capacity-fragmentation constrained?
cr0x@server:~$ zpool list -o name,capacity,frag,alloc,free tank
Look for: capacity > 80–85% and high fragmentation. If you’re near-full, metadata duplication overhead is more likely to hurt, and allocator behavior can dominate everything.
Third: Is it IOPS latency (metadata-like) or throughput (data-like)?
cr0x@server:~$ iostat -x 1 10
Look for: high await with low KB/s points to random I/O pressure—often metadata. High KB/s with high util points to sequential throughput saturation.
Fourth: Is the special vdev doing its job (if present)?
cr0x@server:~$ zpool status tank
cr0x@server:~$ zpool iostat -v tank 1 5
Look for: special vdev busy while HDDs are calm (good: metadata is offloaded) versus HDDs busy with tiny I/O (metadata stuck on rust).
Fifth: Confirm dataset-level settings on the hot path
cr0x@server:~$ zfs get -o name,property,value,source redundant_metadata,copies,recordsize,xattr,dnodesize,primarycache tank/registry
Look for: copies accidentally set to 2+ (common “oops”), or redundant_metadata set inconsistently with workload and risk tolerance.
Sixth: Decide: are we fixing hardware, workload, or configuration?
If errors exist: fix hardware first. If near-full: fix capacity first. If metadata IOPS is the bottleneck: consider special vdev (properly mirrored), consider reducing metadata churn (snapshot policy, app behavior), and then consider metadata redundancy settings in context.
Common mistakes: specific symptoms and fixes
Mistake 1: Treating checksum errors as “non-urgent because redundancy”
Symptom: zpool status shows CKSUM errors on a device, pool remains ONLINE, but scrubs slow down and metadata operations get spiky.
Fix: Investigate and remediate the underlying cause (disk, cable, HBA, backplane). Run a scrub after replacement. Only then clear errors.
Mistake 2: Using redundant_metadata=none as a blanket capacity optimization
Symptom: Months later you see permanent errors tied to snapshot paths or directory traversal errors during replication/cleanup.
Fix: Re-evaluate risk. Consider redundant_metadata=most or all for metadata-heavy/critical datasets, especially on RAIDZ. Pair with scrubs and sane snapshot retention.
Mistake 3: Confusing redundant_metadata with copies
Symptom: Pool fills “mysteriously,” write latency increases, and you find copies=2 enabled on big datasets.
Fix: Use zfs get copies across the tree, and reset where inappropriate. Keep copies for narrowly scoped use cases (small critical datasets on non-redundant pools, or special cases), not as a casual reliability knob.
Mistake 4: Building a special vdev without redundancy
Symptom: Everything is fast until the special device dies, and then the pool is in a catastrophic state (often not importable or missing critical metadata).
Fix: Special vdev should be mirrored (at minimum) and monitored like a first-class component. Treat it as part of the pool’s core redundancy design.
Mistake 5: Expecting immediate results after changing the property
Symptom: You set redundant_metadata=all and see no change in error behavior or performance, so you flip it back and declare it useless.
Fix: Remember: it impacts new writes. Benefits accrue as metadata is rewritten. Use it as a policy, not a panic button.
Mistake 6: Ignoring pool utilization and fragmentation while tuning metadata
Symptom: Any change seems to make things worse, especially on HDD RAIDZ pools; latency skyrockets during snapshot operations and scrubs.
Fix: Get capacity back (delete data, shorten retention, add vdevs). Tuning on a near-full pool is like rearranging furniture in a burning house: technically possible, emotionally unhelpful.
Checklists / step-by-step plan
Checklist A: Deciding what to set (policy-level)
- Classify datasets by business impact: “can restore later” vs “must be online.”
- Classify by workload: metadata-heavy (many small files, snapshots, directories) vs data-heavy (large sequential files).
- Classify by vdev topology: mirror vs RAIDZ; and whether a special vdev exists.
- If RAIDZ + metadata-heavy + high criticality: lean toward
redundant_metadata=all(or at leastmost). - If mirror + healthy hardware + not metadata-heavy: default may be fine; focus on scrubs and monitoring.
- If pool is >85% full: fix capacity before adding overhead.
Checklist B: Rolling out the change safely
- Inventory current settings:
cr0x@server:~$ zfs get -r -o name,property,value,source redundant_metadata tank
- Pick a single dataset on the hot path and change it first:
cr0x@server:~$ sudo zfs set redundant_metadata=all tank/registry
- Track performance baseline before/after with pool and device stats:
cr0x@server:~$ zpool iostat -v tank 1 10
cr0x@server:~$ iostat -x 1 10
- Run a scrub during a controlled window and compare corrected errors and throughput:
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ zpool status tank
- Roll out to other datasets only after you understand the overhead.
Checklist C: If you suspect metadata IOPS bottleneck
- Confirm it’s latency/IOPS, not throughput (
iostat -x). - Confirm pool isn’t degraded / reconstructing a lot (
zpool status -v). - Check capacity/frag (
zpool list). - Check whether special vdev exists and is sized/redundant (
zpool status). - Only then tune metadata policies (
redundant_metadata, snapshot retention, and dataset organization).
FAQ
1) Should I always set redundant_metadata=all?
No. It’s a good default for critical, metadata-heavy datasets on RAIDZ pools, especially with long snapshot retention. On mirrors, the incremental integrity benefit is usually smaller, and the overhead may not be worth it for every dataset.
2) Is redundant_metadata a substitute for mirrors or RAIDZ2?
No. It doesn’t change your vdev-level fault tolerance. It helps with certain corruption and unreadable-block scenarios by increasing the chance that an alternate metadata block is available.
3) What’s the difference between redundant_metadata and copies=2?
copies duplicates data blocks too, which can be extremely expensive in space and write load. redundant_metadata targets metadata only, typically far cheaper, and is often the more surgical tool.
4) Does changing redundant_metadata rewrite existing metadata?
Not immediately. It affects newly written metadata. Existing blocks may be rewritten as files change, snapshots are created/destroyed, or blocks are reallocated over time. Treat it as a forward-looking policy.
5) Will this fix “permanent errors have been detected” messages?
Not directly. Permanent errors mean ZFS couldn’t repair a block from available redundancy at the time it was needed. Extra copies can reduce the chance of that happening in the future, but you still need to remediate existing damage (restore from replication/backup, or remove the affected snapshot/files if possible).
6) Does a special vdev make redundant_metadata unnecessary?
No. A special vdev improves metadata performance and can improve resilience if it’s mirrored and healthy. But it’s still hardware, and it can still experience corruption or read failures. Extra metadata copies can complement a special vdev; they don’t replace good redundancy design.
7) What are the most common signs that metadata is my bottleneck?
High I/O wait, high disk await with low throughput, slow directory listings, slow snapshot operations, sluggish small-file workloads, and scrub “issued” throughput far below “scanned.” Also: performance that gets worse as the pool fills and fragments.
8) Can redundant_metadata make performance worse?
Yes, especially on HDD-based pools and metadata-heavy workloads. You’re adding extra small writes. If you’re already IOPS-limited, it can increase latency. That’s why you roll it out per dataset and measure.
9) If I’m on mirrors, should I set it to none for performance?
Maybe, but don’t do it as a reflex. Mirrors already provide strong self-healing, so the integrity gain is smaller; however, metadata duplication overhead may also be smaller than you think compared to the workload. Test on a representative dataset, and don’t optimize away safety unless you’re sure what failure mode you’re accepting.
10) What’s the most “boring correct” configuration choice here?
For critical datasets: keep scrubs scheduled, keep utilization under control, use redundant vdevs (mirror or adequate RAIDZ), and set metadata strategy intentionally (often redundant_metadata=all for metadata-heavy RAIDZ datasets). Most disasters are a pile-up of small “we’ll fix it later” decisions.
Conclusion
redundant_metadata is one of those ZFS levers that looks like a checkbox until you’ve lived through a metadata-related incident. Then it starts to look like an insurance policy with a premium paid in small-block writes and capacity overhead. The trick is to buy the policy where it matters: datasets with high metadata churn, long snapshot histories, and high business impact—especially on RAIDZ pools where reconstruction under stress is expensive.
If you take only one operational lesson: don’t debate metadata redundancy in the abstract. Measure your pool’s health, capacity, and IOPS headroom; understand whether metadata is on the critical path; and then set redundant_metadata intentionally per dataset. ZFS will do the math, but it won’t do the prioritization for you.