RAIDZ3 is ZFS’s “third seatbelt”: triple parity across a RAIDZ vdev. It costs you an extra disk’s worth of capacity compared to RAIDZ2, and it buys you a bigger safety margin exactly when you’re most exposed—during long rebuilds on big drives, under real workload, with the kind of latent sector errors that only show up when you’re already having a bad day.
If you’ve never lost a pool during a resilver, RAIDZ3 can feel like overkill. If you have, RAIDZ3 feels like the one time finance approves an insurance policy that actually pays out. This piece is about making that decision without religion, with numbers you can reason about, and with commands you can run at 3 a.m. when the pager is louder than your confidence.
What RAIDZ3 actually is (and isn’t)
In ZFS terms, RAIDZ is parity RAID implemented at the vdev level. A RAIDZ3 vdev can lose any three drives in that vdev and keep serving data. It does this by storing three independent parity blocks for each RAIDZ stripe (conceptually similar to RAID-7/RAID-6 variants, but ZFS does it with variable stripe width and copy-on-write semantics).
What it is:
- Triple-parity protection per vdev. Lose three disks in the same RAIDZ3 vdev, still online. Lose four, you’re restoring backups and practicing humility.
- Designed for long rebuild windows and real-world failure modes: second (and third) disk failures during resilver, and latent media errors uncovered by full-disk reads.
- Operationally simple compared to more exotic layouts. The on-call gets a familiar set of commands and failure behavior.
What it is not:
- Not a substitute for backups. RAIDZ3 reduces the chance of losing a pool to multiple drive failures; it does not protect you from deletion, ransomware, application bugs, or “I thought I was on staging.”
- Not a performance panacea. It can be fast for sequential reads/writes, but small random writes pay a parity tax. The parity tax is not theoretical; it is the sound of your latency SLOs coughing.
- Not “set it and forget it.” You still need scrubs, SMART monitoring, burn-in, and sane replacement procedures.
One joke, because storage people deserve joy too: RAIDZ3 is like wearing a helmet on a bicycle path—you look paranoid until the one time you aren’t.
Facts & history that matter
Here are some short, concrete context points that actually influence how you plan a RAIDZ3 pool:
- ZFS was built around end-to-end checksums (data + metadata), making silent corruption detectable and repairable when redundancy exists.
- RAIDZ was created to avoid the “write hole” common in classic RAID5/6 controllers under power loss, by using copy-on-write transactional semantics.
- Drive capacities have grown faster than drive IOPS for two decades. Your rebuild window is largely limited by I/O, not by how much you wish harder.
- URE rates exist even when vendors don’t like talking about them. On large drives, the probability of encountering an unreadable sector during a full-disk read is not fantasy; it’s arithmetic.
- Scrubs became operationally “mandatory” in ZFS culture because they surface latent errors before a resilver forces you to read everything at the worst possible time.
- Advanced Format (4K) sectors changed the cost of misalignment. Wrong ashift can quietly turn “fine” into “why is this so slow?” forever.
- SMR drives entered the mainstream and made “rebuild behavior” a first-class purchasing criterion again. A resilver that takes ages is not a resilver; it’s an extended outage rehearsal.
- Enterprise embraced “failure domains” (rack, shelf, expander, HBA) and stopped thinking of “disk failure” as an isolated event. Triple parity is partly about correlated failures.
Why triple parity exists: failure math you can feel
People sometimes frame RAIDZ3 as “for people who don’t trust drives.” That’s not quite it. RAIDZ3 is for people who understand that time is the enemy.
The pool is safest right after a scrub completes and all disks are healthy. The pool is most vulnerable during a resilver, when:
- Every disk is being read heavily (or at least more than usual), which is when marginal drives show their true personality.
- You’re running degraded—meaning your redundancy budget is already spent.
- You often keep serving production workload, so latency spikes, queues build, and “background I/O” becomes “foreground pain.”
With RAIDZ1, a second failure during resilver kills the vdev. With RAIDZ2, a third failure can. With RAIDZ3, you can survive that third failure and keep your pool alive long enough to replace hardware without a midnight restore party.
But the real driver isn’t just “another drive fails.” It’s “something becomes unreadable when we must read it.” ZFS is very good at detecting bad data. Redundancy is what makes “detect” turn into “repair.” Triple parity gives you more room for that repair to succeed when the resilver is reading the whole world.
The real tradeoffs: capacity, performance, operational cost
Capacity: the obvious cost
In a single RAIDZ3 vdev with N drives of equal size, usable capacity is roughly (N − 3) drives worth (minus a little for ZFS metadata and slop). That means:
- 8 drives in RAIDZ3: usable ≈ 5 drives worth
- 12 drives in RAIDZ3: usable ≈ 9 drives worth
- 16 drives in RAIDZ3: usable ≈ 13 drives worth
RAIDZ3 becomes easier to justify as vdev width grows—because the parity overhead is a smaller fraction of total. The catch: wider RAIDZ vdevs increase rebuild time and can increase the blast radius of correlated issues if your enclosure/HBA path is a single failure domain. Engineering is never a free lunch; it’s more like a buffet where everything has hidden sodium.
Performance: parity tax, but not uniformly
Workload matters more than parity level. RAIDZ3 tends to be:
- Great at sequential reads (lots of spindles contribute).
- Good at sequential writes when records align well and you aren’t doing pathological sync behavior.
- Not great at small random writes, because parity RAID requires reading/modifying/writing stripes unless you’re doing full-stripe writes.
Triple parity can increase CPU overhead a bit (parity math), but in modern systems the dominant performance constraints are usually disks, queue depth, and write amplification—not XOR math. Still, don’t build a RAIDZ3 pool on an underpowered CPU and then act surprised when checksums and compression compete with parity at high throughput.
Operational cost: the part nobody budgets for
RAIDZ3 changes how you operate because it changes what “acceptable risk” looks like. In practice it can let you:
- Resilver during business hours without holding your breath quite as hard.
- Use larger drives without pretending rebuilds are still 6 hours like it’s 2012.
- Run wider vdevs without living in constant fear of the third failure.
But it can also tempt teams into sloppy behavior: deferring scrubs, ignoring SMART warnings, or running too close to full because “we have RAIDZ3.” That’s how robust designs become fragile in production: not because the math was wrong, but because humans got comfortable.
When RAIDZ3 is worth the disks
RAIDZ3 is a good idea when the cost of downtime or restore is high, and the probability of “multiple problems during rebuild” is non-trivial. Here are situations where I’ve seen it be the right call:
1) Very large drives and long resilver windows
If your resilvers take days, not hours, your exposure window is long enough that “two things go wrong” stops being a rare event. RAIDZ3 is basically buying a wider margin for the time dimension.
2) High-capacity archival or backup repositories
Ironically, backup storage often needs more redundancy than primary storage. Why? Because backup systems get hammered during restores—exactly when you can’t afford the pool to collapse. Also, backups are often built with huge drives and wide vdevs to optimize $/TB, which increases rebuild time.
3) Pools with known correlated failure risk
Examples: single JBOD shelf, single expander, a batch of drives from the same manufacturing window, or environments with vibration/thermal issues. RAIDZ3 doesn’t fix correlated failures, but it gives you room to handle them without turning the incident into a disaster.
4) “Hands-off” environments where replacement isn’t immediate
Remote sites, labs, edge deployments, or anything where a human can’t swap a drive within an hour. If your mean time to replacement is measured in days, parity is cheap compared to the operational reality.
5) Compliance-driven durability targets
If you have internal requirements like “two concurrent failures plus one latent error must not cause data loss,” RAIDZ3 is one of the simplest ways to get there without building elaborate multi-pool replication schemes (which you may still do, but at least the local pool isn’t one bad week away from a restore).
When RAIDZ3 is the wrong answer
Triple parity isn’t a badge of honor. It’s a trade. Skip RAIDZ3 when:
You need small random write IOPS more than you need capacity efficiency
Databases, VM storage, and latency-sensitive systems usually want mirrors (and sometimes special vdevs) rather than wide RAIDZ. RAIDZ3 will work, but it will often be the slowest acceptable option and sometimes not acceptable.
You can get your risk down better with replication
If you already have fast, tested replication to another system and you can tolerate losing a pool without losing data, then spending three disks on parity might be less valuable than spending those disks on a second pool or a replica target. The key phrase is “tested.” Untested replication is a bedtime story, not a disaster recovery plan.
You’re tempted to build a single, massive vdev because RAIDZ3 feels safe
One huge vdev can be fine, but it concentrates risk in one failure domain. RAIDZ3 makes it more survivable, not invincible. If you’re building “the one pool to rule them all,” consider whether multiple vdevs or multiple pools with replication actually fits your fault model better.
Vdev width, ashift, recordsize: layout decisions that decide outcomes
Vdev width: 8–16 drives is where the arguments get real
RAIDZ3 often shows up in widths like 10, 11, 12, 14, 16 drives. Wider vdevs mean:
- More usable capacity per parity disk (parity overhead fraction drops).
- Higher sequential throughput potential (more spindles participate).
- Longer resilvers and scrubs (more data to read, more drives involved).
A practical heuristic: if you’re choosing RAIDZ3, you’re already signaling that resilver risk matters. Don’t then choose an extreme width that makes resilvers unbearably long unless you have a strong operational reason (and a monitoring story).
ashift: set it right once, or regret it forever
ashift controls the pool’s sector alignment. Modern disks are effectively 4K sectors even if they pretend otherwise. Most production pools should be ashift=12 (4K). Some environments use 8K or 16K sector devices, but the key is: pick correctly based on your hardware, and don’t let ZFS auto-guess on mixed gear.
You can’t change ashift after creation. You can only migrate.
recordsize and volblocksize: match the workload
For file datasets, recordsize affects how ZFS chunks data. Larger recordsize helps sequential workloads and reduces metadata overhead. For zvols (block devices), volblocksize should be chosen at creation and aligned with the application (often 8K–64K depending on DB/VM patterns).
RAIDZ parity penalties are worst when you force partial-stripe writes repeatedly. You can mitigate that by:
- Using appropriate record sizes for your workload.
- Enabling compression to reduce physical writes (often a net win).
- Avoiding pathological sync writes unless you’ve designed for them (SLOG, latency budget).
Practical operations: commands, interpretations, and what “good” looks like
Below are hands-on tasks I expect an on-call SRE or storage engineer to be able to do on a RAIDZ3 pool. Every command here is something you can run. The interpretations are the part that prevents “ran commands, still confused.”
Task 1: Confirm the pool layout (and verify it’s actually RAIDZ3)
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 04:12:33 with 0 errors on Mon Dec 23 02:11:10 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d4 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d5 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d6 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d7 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d8 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d9 ONLINE 0 0 0
wwn-0x5000c500a1b2c3da ONLINE 0 0 0
wwn-0x5000c500a1b2c3db ONLINE 0 0 0
wwn-0x5000c500a1b2c3dc ONLINE 0 0 0
wwn-0x5000c500a1b2c3dd ONLINE 0 0 0
errors: No known data errors
Interpretation: You’re looking for the line raidz3-0. If it says raidz2-0 or raidz-0, you don’t have triple parity, no matter what the wiki says. Also note the device IDs: using wwn-* is a good practice; it avoids device renumbering surprises.
Task 2: Show health at a glance (watch for silent degradation)
cr0x@server:~$ zpool list -o name,size,alloc,free,frag,capacity,health
NAME SIZE ALLOC FREE FRAG CAPACITY HEALTH
tank 145T 92.1T 52.9T 21% 63% ONLINE
Interpretation: Capacity and fragmentation matter. RAIDZ pools running hot (80–90% full) often suffer from allocation inefficiency and performance cliffs. RAIDZ3 doesn’t change that physics. If you’re above ~80%, expect writes to get spikier and scrubs to take longer.
Task 3: Check ashift (alignment) on the vdev
cr0x@server:~$ zdb -C tank | grep -E 'ashift|vdev_tree' -n | head
56: vdev_tree:
88: ashift: 12
Interpretation: ashift: 12 usually indicates 4K alignment. If you see ashift: 9 on modern 4K drives, you may be paying a permanent penalty. Fixing it means rebuilding the pool and migrating data.
Task 4: Start a scrub (and confirm it’s progressing)
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Tue Dec 24 01:02:18 2025
3.21T / 92.1T scanned at 1.10G/s, 1.02T / 92.1T issued at 350M/s, 78.90T total
0B repaired, 1.11% done, 3 days 01:20:11 to go
config:
...
Interpretation: Pay attention to issued throughput, not just scanned. If issued is low, ZFS is likely throttled by workload contention, device errors, or slow disks. A scrub ETA of days is normal on large pools, but sudden changes in rate can signal a problem.
Task 5: Find which disks are throwing errors
cr0x@server:~$ zpool status -v tank
...
wwn-0x5000c500a1b2c3d8 ONLINE 0 2 0
wwn-0x5000c500a1b2c3dc ONLINE 4 0 0
...
Interpretation: Non-zero READ/WRITE/CKSUM counts are not “fine” just because the pool is ONLINE. For RAIDZ, a few transient errors can happen (cables, expander hiccups), but repeated errors on one disk are an early warning. Correlate with SMART and kernel logs before deciding whether it’s media, path, or controller.
Task 6: Correlate ZFS errors with kernel logs
cr0x@server:~$ sudo dmesg -T | egrep -i 'sd[a-z]|sas|ata|reset|abort|I/O error' | tail -n 30
[Mon Dec 23 21:44:02 2025] sd 6:0:12:0: [sdl] tag#211 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Mon Dec 23 21:44:02 2025] sd 6:0:12:0: [sdl] tag#211 CDB: Read(16) 88 00 00 00 00 00 1a 2b 3c 40 00 00 00 80 00 00
[Mon Dec 23 21:44:02 2025] blk_update_request: I/O error, dev sdl, sector 439041088
Interpretation: “Soft errors” and resets can be a flaky link. Media errors tend to show as consistent LBA failures on the same drive. If multiple drives on the same HBA lane show resets, suspect cabling, expander, or firmware.
Task 7: Check SMART health on a suspect disk
cr0x@server:~$ sudo smartctl -a /dev/sdl | egrep -i 'Model|Serial|Reallocated|Pending|Offline_Uncorrectable|UDMA_CRC|SMART overall'
Model Family: Seagate Exos
Device Model: ST16000NM000J
Serial Number: ZL0ABC12
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct 0
Current_Pending_Sector 3
Offline_Uncorrectable 3
UDMA_CRC_Error_Count 0
Interpretation: “PASSED” is not a clean bill of health. Pending and offline-uncorrectable sectors are red flags, especially during scrub/resilver. If CRC errors are climbing, suspect cabling/backplane. If pending/offline are non-zero and increasing, plan a replacement.
Task 8: Replace a failed disk using stable identifiers
cr0x@server:~$ zpool status tank | grep -E 'DEGRADED|FAULTED|OFFLINE|UNAVAIL'
state: DEGRADED
cr0x@server:~$ sudo zpool replace tank wwn-0x5000c500a1b2c3d8 /dev/disk/by-id/wwn-0x5000c500deadbeef
Interpretation: Use /dev/disk/by-id paths (WWN) rather than /dev/sdX. After issuing zpool replace, watch resilver progress. If you swapped in the wrong bay and replaced a good disk, RAIDZ3 might save you—but don’t treat that as a feature.
Task 9: Monitor resilver and confirm it’s not silently stalling
cr0x@server:~$ watch -n 10 'zpool status tank | sed -n "1,25p"'
Every 10.0s: zpool status tank
pool: tank
state: DEGRADED
scan: resilver in progress since Tue Dec 24 03:18:11 2025
12.4T scanned at 780M/s, 3.10T issued at 196M/s, 92.1T total
3.09T resilvered, 3.37% done, 1 day 19:02:10 to go
...
Interpretation: A resilver that shows progress in scanned but barely increases issued can mean contention or errors. If the ETA starts expanding without workload changes, investigate I/O wait, slow disks, or a device repeatedly resetting.
Task 10: Identify top latency contributors at the block layer
cr0x@server:~$ iostat -x 5 3
Linux 6.8.0 (server) 12/24/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.1 0.0 4.3 21.7 0.0 61.9
Device r/s w/s rkB/s wkB/s await svctm %util
sda 1.2 12.3 44.1 981.2 8.9 0.9 15.2
sdl 84.0 10.1 9032.0 1048.0 94.7 1.2 99.8
sdm 79.2 9.8 8900.0 1002.0 11.1 1.1 86.7
Interpretation: %util near 100% with high await on a single disk (here sdl) is a classic “one disk is dragging the vdev” symptom. RAIDZ performance is gated by the slowest participant during parity operations. Investigate that disk’s SMART and link stability.
Task 11: Check ZFS dataset properties that commonly affect RAIDZ behavior
cr0x@server:~$ zfs get -o name,property,value -s local,received recordsize,compression,atime,sync,logbias tank/data
NAME PROPERTY VALUE
tank/data recordsize 1M
tank/data compression zstd
tank/data atime off
tank/data sync standard
tank/data logbias latency
Interpretation: A 1M recordsize can be excellent for media/backup workloads. Compression often helps parity RAID by reducing physical writes. sync=standard is usually correct; forcing sync=always without a designed SLOG can cause write latency to explode.
Task 12: See real-time ZFS I/O behavior (and spot a parity-bound workload)
cr0x@server:~$ sudo zpool iostat -v tank 5
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 92.1T 52.9T 420 1200 1.8G 310M
raidz3-0 92.1T 52.9T 420 1200 1.8G 310M
wwn-...c3d4 - - 42 118 180M 31.0M
wwn-...c3d5 - - 41 120 176M 30.8M
wwn-...c3d6 - - 43 117 182M 31.1M
wwn-...c3d7 - - 40 121 171M 30.9M
wwn-...c3d8 - - 38 125 160M 31.5M
wwn-...c3d9 - - 44 119 186M 30.7M
wwn-...c3da - - 42 120 180M 31.0M
wwn-...c3db - - 43 118 183M 30.9M
wwn-...c3dc - - 44 121 185M 31.2M
wwn-...c3dd - - 43 121 182M 31.2M
-------------------------- ----- ----- ----- ----- ----- -----
Interpretation: Balanced per-disk bandwidth is what you want. If a few disks show much lower throughput (or much higher latency in iostat -x), your “RAIDZ3 is slow” complaint might actually be “one disk is bad and the vdev is politely waiting for it.”
Task 13: Verify that autotrim is doing what you think (for SSD vdevs)
cr0x@server:~$ zpool get autotrim tank
NAME PROPERTY VALUE SOURCE
tank autotrim on local
Interpretation: For SSD-based pools, autotrim typically helps maintain steady-state performance. On HDD RAIDZ3 it’s irrelevant, but mixed pools exist, and people do surprising things under budget pressure.
Task 14: Check for “special vdev” dependency before you celebrate redundancy
cr0x@server:~$ zpool status tank | sed -n '1,120p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
...
special
mirror-1 ONLINE 0 0 0
nvme-INTEL_SSDPE2KX040T8 ONLINE 0 0 0
nvme-INTEL_SSDPE2KX040T9 ONLINE 0 0 0
Interpretation: If you have a special vdev, you’ve created a new dependency: losing it can make the pool unusable even if RAIDZ3 is fine. Mirror your special vdev properly, monitor it aggressively, and understand what metadata/classes you placed there.
Task 15: Confirm you’re not one reboot away from “why didn’t it import?”
cr0x@server:~$ sudo zpool import
pool: tank
id: 12345678901234567890
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:
tank ONLINE
raidz3-0 ONLINE
...
Interpretation: This is a sanity check during maintenance windows or after hardware changes. If zpool import shows unexpected devices missing, fix the cabling/pathing before you need to reboot for a kernel update.
Fast diagnosis playbook (bottleneck hunting)
When RAIDZ3 “is slow,” you need a quick path to truth. Here’s the order I use to avoid getting lost in tuning folklore.
Step 1: Is it one bad disk or path?
- Run
zpool status -v: look for any disk with errors, DEGRADED state, or resilver/scrub activity. - Run
iostat -x 5: look for one device with much higherawaitor pegged%util. - Check
dmesg -Tfor resets/timeouts on the same device or bus.
Decision: If a single disk is slow or resetting, stop tuning ZFS and start fixing hardware. RAIDZ is a team sport; one player limping ruins the game.
Step 2: Is the pool busy with maintenance?
- Check
zpool statusfor scrub/resilver. - Use
zpool iostat -v 5to see whether bandwidth is being eaten by background reads.
Decision: If a resilver is running, your performance complaint might be “we’re rebuilding a disk while serving production.” That’s not a mystery; that’s a choice. Consider scheduling scrubs and replacements more strategically.
Step 3: Is it sync write behavior?
- Check dataset property:
zfs get sync,logbias. - Confirm whether the workload is actually issuing sync writes (databases, NFS, VM storage often do).
Decision: If latency spikes align with sync writes and you don’t have a well-designed SLOG (or you forced sync=always), you’ve found the culprit.
Step 4: Is it a recordsize/volblocksize mismatch causing partial-stripe churn?
- For files: check
recordsizeand workload I/O size patterns. - For zvols: verify
volblocksize(can’t change after creation without recreation).
Decision: If your workload is 8K random writes into a wide RAIDZ3, parity overhead will dominate. Consider mirrors, special vdev, or redesign.
Step 5: Is the pool too full or too fragmented?
- Check
zpool listcapacity andfrag. - Check dataset quotas/reservations and whether a “nearly full” condition is being hidden by snapshots.
Decision: If you’re above ~80–85% used, the most effective performance “tuning” might be adding capacity or deleting data (carefully). ZFS can do many things, but it cannot allocate blocks that do not exist.
Three corporate-world mini-stories
Mini-story #1: The incident caused by a wrong assumption
The company had a new backup repository: big drives, a wide RAIDZ2 vdev, and a cheerful slide deck that said “two-disk redundancy.” The assumption—spoken and then repeated until it became policy—was that RAIDZ2 meant “we can lose two disks at any time and be fine.” That sentence is only true if you treat “time” as a point, not an interval.
The first disk failed on a Tuesday morning. Ticket opened, replacement ordered, no real urgency because “we’re still protected.” The replacement arrived late Wednesday. During the resilver, performance dipped, and the backup ingestion started lagging. Someone saw the lag, assumed it was a tuning problem, and throttled the resilver to “help production.” Now the pool was degraded for longer—exactly what you do not want.
Thursday night, a second disk threw a handful of read errors. ZFS corrected what it could, but the error counters kept climbing. The team still felt okay: “RAIDZ2, right?” Friday morning, the second disk dropped offline. Now they were down to zero redundancy during the still-running resilver.
Then the third event happened: a latent error on a different drive surfaced under full-stripe reads. With RAIDZ2 and two drives already out of the game (one failed, one in distress), the pool didn’t have enough healthy members to reconstruct the stripe. The result wasn’t a clean, dramatic “pool is dead” moment. It was worse: partial metadata became unreadable, services started failing in weird ways, and the restore plan had to be executed under pressure.
Afterward, the lesson wasn’t “RAIDZ2 is bad.” The lesson was that the assumption was wrong. Redundancy is not a static number; it’s a budget you spend during incidents. RAIDZ3 wouldn’t have prevented the first failure. It would have given the team more time to be human without paying for it with data.
Mini-story #2: The optimization that backfired
A different org tried to squeeze more usable TB out of their hardware. They had a RAIDZ3 pool planned, but three parity disks “felt expensive.” Someone proposed a compromise: use RAIDZ2, keep the same vdev width, and “make up safety” by running scrubs more often.
On paper, it sounded clever: scrubs catch latent errors early, so you reduce the chance of hitting UREs during a resilver. The backfire came from the details. Scrubs were scheduled during business hours because nights were reserved for data pipelines, and the team didn’t want to “compete with batch.” So the scrubs ran while the pool was also serving interactive workloads.
Latency climbed, someone complained, and the response was predictable: scrub throttling, then skipping scrubs, then eventually ignoring the alerts because “it always finishes eventually.” Meanwhile, the pool kept filling up, and write amplification quietly increased. Months later, when a drive failed and the resilver started, the pool was hot (high utilization), fragmented, and hadn’t had a clean scrub in a long time.
During that resilver, two things happened: the replacement drive was slower than the rest (not defective, just a different model with different sustained behavior), and another drive started showing pending sectors. RAIDZ2 survived, but only barely—and the performance impact became a production incident on its own. They got lucky. Luck is not an engineering control.
The final decision was ironically more expensive: they added capacity to reduce pool fill, rebalanced workload scheduling, and eventually migrated to RAIDZ3 on the next refresh. The “optimization” wasn’t wrong because it tried to be efficient. It was wrong because it assumed operational discipline would remain perfect over time, even as the team and workload changed.
Mini-story #3: The boring but correct practice that saved the day
My favorite reliability stories are painfully unsexy. One team ran a RAIDZ3 pool for a large media pipeline. Nothing fancy: periodic scrubs, SMART checks, and strict rules about burn-in testing and labeling drives. The rules were annoying enough that new hires rolled their eyes. The rules were also why the pool survived a genuinely ugly week.
It started with a drive that began logging CRC errors. Instead of immediately replacing the disk, the on-call treated it as a path issue first—because they’d been burned by unnecessary replacements before. They reseated a cable, moved the drive to a different bay, and watched the counters. Errors stopped. That drive stayed in service, and the incident ended as a one-hour maintenance task, not a degraded pool.
Two weeks later, a real disk failed (offline, SMART ugly, the whole tragic opera). They replaced it the same day, because spares were already on-site and the procedure was scripted. Resilver began, and during resilver another disk started throwing a small number of read errors. RAIDZ3 shrugged; the pool remained online and repairable.
Here’s the part that saved them: they had a scrub schedule and they didn’t cancel it “because we’re busy.” The last scrub had completed recently enough that latent errors were rare. So when that second disk started acting up during resilver, it wasn’t dragging a long tail of old unreadable sectors into the worst possible moment.
The postmortem was boring in the best way: replace the failed disk, monitor the flaky one, plan a proactive replacement, and move on. Nobody got promoted for heroics. But nobody got fired for data loss either. In production, boring is a feature.
Common mistakes: symptoms and fixes
This is the section you read when you inherit a pool built by someone who has since “moved on to new opportunities.”
Mistake 1: Building RAIDZ3, then creating a single-disk special vdev
Symptom: Pool is RAIDZ3 but feels “fragile.” A single SSD failure threatens the whole pool.
Why it happens: Special vdev holds metadata (and optionally small blocks). If it’s not redundant, it becomes a single point of failure.
Fix: Mirror the special vdev (at least). If you already built it single-disk, migrate data off and rebuild; ZFS doesn’t magically make single points safe.
Mistake 2: Assuming RAIDZ3 makes scrubs optional
Symptom: First resilver after months leads to unexpected checksum errors, slow rebuild, or unrecoverable blocks.
Fix: Schedule scrubs at a frequency matched to your drive size and workload, and make completion a monitored SLO. Scrubs are not “maintenance”; they’re early detection.
Mistake 3: Using /dev/sdX paths and then “mysteriously” replacing the wrong disk
Symptom: Disk replacement command targets the wrong device after reboot or controller reorder.
Fix: Always use stable IDs: /dev/disk/by-id/wwn-*. Also label bays physically and match serial numbers before pulling.
Mistake 4: Forcing sync writes without a design (SLOG myths)
Symptom: Latency spikes, throughput collapses, especially for NFS/VM workloads.
Fix: Keep sync=standard unless you have a reason. If you need sync performance, design a proper SLOG (power-loss-protected, mirrored if required by your risk model) and measure.
Mistake 5: Mixing SMR drives into a RAIDZ3 pool unknowingly
Symptom: Resilver/scrub takes forever, pool becomes sluggish under write load, sustained throughput collapses unpredictably.
Fix: Verify drive models; avoid DM-SMR for RAIDZ vdevs unless you truly understand the workload and accept the rebuild behavior. If you already have them, plan a migration.
Mistake 6: Running the pool too full, then blaming RAIDZ3 for being slow
Symptom: Write latency becomes spiky around 80–90% capacity; scrubs slow down; allocations fragment.
Fix: Add capacity or reduce utilization. Consider quotas, reservations, and snapshot retention. “Tune ZFS” is rarely the best first response to “we ran out of elbow room.”
Mistake 7: Choosing recordsize/volblocksize by vibes
Symptom: Parity overhead dominates; random write workloads suffer; CPU and disk churn increases.
Fix: Match recordsize to workload. For zvols, choose volblocksize carefully at creation, then benchmark with realistic I/O sizes.
Checklists / step-by-step plan
Planning checklist: decide whether RAIDZ3 is justified
- Measure your rebuild window today (scrub duration is a proxy; resilver will be similar or worse under load).
- Define operational reality: how fast can you replace a drive (hours vs days)?
- Identify correlated failure domains: same shelf, same expander, same batch, same firmware.
- Quantify restore pain: time to restore, business impact, human cost.
- Pick vdev width intentionally, not because “that’s how many bays we have.”
- Decide whether mirrors would serve the workload better (especially for random I/O).
Build checklist: create a sane RAIDZ3 pool
Example creation (adjust devices for your environment). This is not a copy-paste religion; it’s a pattern.
cr0x@server:~$ sudo zpool create -o ashift=12 tank raidz3 \
/dev/disk/by-id/wwn-0x5000c500a1b2c3d4 \
/dev/disk/by-id/wwn-0x5000c500a1b2c3d5 \
/dev/disk/by-id/wwn-0x5000c500a1b2c3d6 \
/dev/disk/by-id/wwn-0x5000c500a1b2c3d7 \
/dev/disk/by-id/wwn-0x5000c500a1b2c3d8 \
/dev/disk/by-id/wwn-0x5000c500a1b2c3d9 \
/dev/disk/by-id/wwn-0x5000c500a1b2c3da \
/dev/disk/by-id/wwn-0x5000c500a1b2c3db \
/dev/disk/by-id/wwn-0x5000c500a1b2c3dc \
/dev/disk/by-id/wwn-0x5000c500a1b2c3dd
Interpretation: Explicit ashift, explicit stable IDs. If you do nothing else, do this.
Operational checklist: keep RAIDZ3 reliable in the boring way
- Scrub on a schedule, and alert on missed/failed scrubs.
- Monitor SMART attributes (pending/offline uncorrectable, CRC errors, reallocated sectors).
- Keep cold spares or rapid replacement paths that match your risk tolerance.
- Track pool fill and snapshot growth like it’s a first-class metric (because it is).
- Test your replacement procedure when you’re calm, not when you’re degraded.
- Document device mapping (bay-to-WWN/serial) and keep it updated after swaps.
Second joke, because we’ve earned it: The fastest way to learn ZFS is to skip scrubs—ZFS will then teach you personally, with interactive labs at 2 a.m.
FAQ
1) Is RAIDZ3 “safer” than mirrors?
They’re safe in different ways. Mirrors offer strong random IOPS and fast rebuilds (copy from a healthy mirror side). RAIDZ3 offers strong capacity efficiency at high redundancy for wide vdevs and better tolerance to multiple concurrent failures within one vdev. For databases/VMs, mirrors often win operationally; for large sequential/backup/archive, RAIDZ3 is often a better fit.
2) How many disks should a RAIDZ3 vdev have?
There isn’t a single right number, but RAIDZ3 is usually justified in wider vdevs where rebuild time and correlated failures become real risks. Practically, 10–16 disks is common. Too narrow and you pay a high parity fraction; too wide and resilvers/scrubs become long and the failure domain grows.
3) Does RAIDZ3 reduce the chance of data corruption?
ZFS checksums detect corruption regardless of parity level. Redundancy determines whether ZFS can repair it automatically. RAIDZ3 increases the chance that ZFS can reconstruct correct data even when multiple disks have issues during scrub/resilver.
4) Is RAIDZ3 slower than RAIDZ2?
Usually somewhat, especially for small random writes due to additional parity. For sequential workloads the difference may be modest. In practice, “slower” is often dominated by other factors: one slow disk, pool fullness, sync write behavior, or recordsize mismatch.
5) Can I convert RAIDZ2 to RAIDZ3 in place?
Not in the straightforward “flip a switch” sense. ZFS supports adding vdevs, replacing disks with larger ones, and various expansions in newer implementations, but changing a vdev’s parity level is not a simple in-place operation in typical production practice. Plan for migration if parity level is wrong.
6) Will RAIDZ3 save me from losing a pool during a resilver?
It improves your odds significantly, especially when a second/third disk fails or has read errors during rebuild. It does not guarantee survival against catastrophic correlated failures (expander meltdown, wrong firmware update, operator error pulling multiple disks, etc.). Backups and replication still matter.
7) Should I use RAIDZ3 for VM storage?
Sometimes, but be suspicious. VM storage tends to be random I/O heavy and latency sensitive; mirrors often perform better and are easier to reason about. If you must use RAIDZ3 for VMs, invest in careful block sizing, consider special vdev for metadata/small blocks (mirrored!), and benchmark with real workloads.
8) How often should I scrub a RAIDZ3 pool?
Often enough that you find latent errors before a disk failure forces a full read under degraded conditions. The “right” interval depends on pool size, drive behavior, and workload. Operationally: pick a schedule you can keep, ensure scrubs complete, and alert when they don’t.
9) Is triple parity overkill if I have replication?
Not necessarily. Replication helps with disaster recovery and logical corruption scenarios, but it doesn’t always save you from downtime or long restore windows when a local pool fails. RAIDZ3 can be the difference between “replace a drive and move on” and “restore 100+ TB while the business watches.” The answer depends on your RTO/RPO and how tested your recovery actually is.
10) What’s the biggest “gotcha” with RAIDZ3 in real life?
People over-trust it and under-invest in operations: scrubs, monitoring, proper device IDs, and avoiding single points like unmirrored special vdevs. RAIDZ3 is robust, but it’s not a substitute for discipline.
Conclusion
RAIDZ3 is not for everyone. It’s a specific answer to a specific modern problem: big disks, long rebuilds, and failure modes that arrive in clusters, not in neat, independent events. If losing a pool would be a career-limiting incident, and your rebuild windows are long enough that “another thing can go wrong” is realistic, triple parity is often worth the disks.
The best RAIDZ3 deployments I’ve seen don’t look heroic. They look boring: stable device IDs, sane ashift, scrubs that actually finish, SMART alerts that someone respects, and replacement procedures that don’t depend on tribal knowledge. RAIDZ3 gives you margin. Operations decides whether you spend that margin wisely.