RAIDZ1 is the storage equivalent of driving with a spare tire and convincing yourself it’s a roll cage. It works—until the day you need it to work under stress, at speed, in bad weather, with passengers who don’t care that “it passed the last test.”
Most conversations about RAIDZ1 stop at “one disk can fail.” That’s true, and also dangerously incomplete. The real question is: what is the probability that a second problem happens while you’re rebuilding, and how long are you exposed? That’s the math people don’t do, because it forces a decision: RAIDZ2, mirrors, smaller disks, more spares, better monitoring, faster replacements, or acceptance of risk—consciously, not by default.
What RAIDZ1 actually protects (and what it doesn’t)
RAIDZ1 is single-parity RAID in ZFS: you can lose one device in a vdev and keep operating. That statement is true in the same way that “your parachute can fail once” is true—technically accurate, emotionally unhelpful.
What RAIDZ1 is good at
It protects against a single complete device failure inside a vdev, assuming the remaining devices are readable enough to reconstruct missing data. In practice, RAIDZ1 is most comfortable when:
- drives are smaller (less data to read during rebuild),
- IO load is moderate,
- you have hot spares or fast hands,
- you scrub regularly and replace drives early.
What RAIDZ1 doesn’t protect against
It does not protect you from a second issue during resilver. That second issue isn’t always “another disk dies.” More often, it’s one of these:
- Unrecoverable read errors (URE) on remaining disks while reading to rebuild.
- Latent sector errors that only show up when you stress-read the whole vdev.
- Time-to-replace delays (procurement, someone needs approvals, the spare is in another region).
- Controller/HBA/cabling issues that pick the worst possible time to show up.
- Workload-induced collapse where resilver load plus production load drives timeouts and cascading faults.
One short joke, because you’ve earned it: RAIDZ1 is like a seatbelt made of hope—fine until you actually need it.
The risk calculation most people never do
Here’s the uncomfortable truth: with RAIDZ1, your risk is dominated by what happens during the resilver window. So the calculation is not “probability of one disk failing,” but:
Probability of one disk failing × probability of another failure/URE during resilver × duration of resilver exposure × operational reality (alerts, spares, load, staff response).
Step 1: define your failure modes
For a RAIDZ1 vdev of N drives:
- First event: one drive fails (or is faulted, or starts throwing errors you can’t ignore).
- Second event: another drive fails before resilver completes, or you encounter unrecoverable reads on remaining drives when reading blocks needed for reconstruction.
Classic RAID discussions obsess over URE rates printed on spec sheets. That matters, but in ZFS you also have:
- Checksum verification that will detect corruption,
- self-healing if redundancy exists (it doesn’t in RAIDZ1 once you’re degraded),
- copy-on-write behavior that changes write patterns and fragmentation.
Step 2: compute your exposure window (resilver time)
The longer your resilver runs, the longer you’re living without parity. If resilver takes 6 hours, you might accept RAIDZ1. If it takes 6 days, you’ve effectively scheduled a weeklong game of “please don’t sneeze.”
Resilver time is not “disk size / sequential speed.” In real production it is:
- bytes actually allocated (ZFS resilvers allocated blocks, not raw full-disk, unless forced),
- pool fragmentation,
- concurrent workload,
- vdev layout and ashift,
- recordsize and IO amplification,
- device error retries and timeouts.
Step 3: model URE/latent error risk in plain English
Drive vendors quote URE as something like 1 error per 10^14 bits (consumer-ish) or 10^15 bits (better), sometimes 10^16 on paper in newer enterprise gear. Translate that:
- 10^14 bits ≈ 12.5 TB read per expected URE
- 10^15 bits ≈ 125 TB read per expected URE
During a degraded RAIDZ1 rebuild, you may need to read a lot of data from the surviving disks. ZFS does not have magic here: if a sector is unreadable and you have no redundancy left, you can lose data. Depending on the block and metadata involved, it could be a file, a dataset, or a pool-level failure in worst cases.
The trap: people multiply “URE rate” by “disk size” and call it a day. But the right number is “how much data must be read from remaining disks to reconstruct what’s missing,” which can approach the size of the vdev’s allocated data, plus overhead from retries and metadata.
A practical risk framing that changes decisions
Instead of pretending we can precisely compute failure probability, operational teams do better with thresholds:
- If resilver routinely finishes inside hours, and you can replace drives immediately, RAIDZ1 can be acceptable for non-critical pools.
- If resilver often takes multiple days, RAIDZ1 becomes a business decision you should document like any other risk acceptance.
- If you cannot guarantee fast replacement, or you run high utilization, RAIDZ1 is frequently a false economy.
The second short joke (and we’ll stop there)
Buying bigger disks to “reduce the number of failure points” is how you end up with fewer disks and more failure.
Why modern disks change everything
RAIDZ1 was born in a world where “large disk” meant hundreds of gigabytes, not tens of terabytes. Back then, rebuild windows were shorter, and the probability of encountering a read error during rebuild was materially lower simply because you didn’t have to read as much data.
Drive size growth outpaced rebuild speed
Capacity exploded; sequential throughput improved, but not at the same rate—and random IO did not magically become cheap. Meanwhile, resilver is rarely pure sequential IO. It’s often “seek-heavy reads + parity math + metadata traversals,” especially on busy pools.
Workloads got noisier
Virtualization, containers, CI pipelines, and log-heavy systems create IO patterns that are the opposite of “nice.” ZFS can handle a lot, but degraded RAIDZ1 under mixed random IO is where you discover the difference between “it works in a benchmark” and “it works at 2 a.m. on a Tuesday.”
Helium drives and SMR complicate assumptions
Modern drives can be excellent, but they’re also more diverse. SMR (shingled magnetic recording) drives in particular can turn resilver into a multi-day slog if you’re unlucky or misinformed. Even with CMR drives, firmware behavior under error conditions can lead to long timeouts.
ZFS is resilient, but not a miracle worker
ZFS gives you end-to-end checksums and self-healing—when redundancy exists. In degraded RAIDZ1, you’ve temporarily traded “self-healing” for “please read perfectly.” That’s a fine trade for short windows. It’s a poor trade for long ones.
Interesting facts and historical context
Storage arguments are older than most of the data we’re protecting. A few concrete points that matter to how RAIDZ1 is perceived today:
- ZFS was designed at Sun Microsystems with the explicit goal of fixing silent data corruption using checksums and copy-on-write, not just “keeping disks spinning.”
- RAID-5 write hole issues (power loss mid-stripe) were a longstanding concern in traditional RAID; ZFS avoids many of those pitfalls with transactional semantics, but parity still has rebuild risk.
- “RAID is not a backup” became a cliché because too many orgs treated parity as recovery, not availability. The cliché exists because it keeps being true.
- URE rates became famous when multi-terabyte disks made “read the entire array during rebuild” a normal event rather than a rare one.
- Enterprise arrays quietly shifted toward dual-parity defaults as disk sizes climbed, even when marketing kept talking about “efficiency.”
- ZFS scrubs are a cultural artifact of checksum-based storage: the idea that you periodically verify the universe is intact, not just wait for corruption to surprise you.
- 4K sector alignment (ashift) became non-negotiable as disks moved away from 512B physical sectors; misalignment can quietly burn performance and increase wear.
- RAIDZ expansion wasn’t available for years, which pushed operators into planning vdev width carefully; today expansion exists in some implementations, but it’s still not “resize without consequences.”
- Mirrors remained popular in high-IO environments because rebuilds are simpler and often faster; parity wins on capacity, mirrors win on operational forgiveness.
Three corporate-world mini-stories from the trenches
1) Incident caused by a wrong assumption: “Degraded is fine, we’ll replace it next week”
A mid-sized company ran a ZFS pool backing internal build artifacts and VM templates. Nothing customer-facing, so the storage was treated like a utility: important, but not urgent. The pool was a single RAIDZ1 vdev of large HDDs, sized for capacity. It ran at high utilization because “unused storage is wasted budget.”
One drive faulted on a Friday afternoon. Alerts fired, a ticket was opened, and someone wrote the classic note: “Pool is degraded but functioning; replacement scheduled.” The replacement part wasn’t in-house. Procurement did what procurement does.
Over the weekend, the pool stayed busy because CI jobs don’t care about your staffing model. Monday morning, a second drive didn’t fully fail—it started throwing intermittent read errors. ZFS tried hard, but in degraded RAIDZ1, intermittent read errors can be as deadly as a clean failure because the pool needs clean reads to reconstruct missing blocks.
The symptom wasn’t dramatic at first: some builds started failing in weird ways. A few VMs refused to boot from templates. Then ZFS reported permanent errors. The team lost a subset of artifacts and had to rebuild some templates from older sources. The business impact wasn’t existential, but it was a week of expensive engineering attention.
The wrong assumption wasn’t “disks fail.” The wrong assumption was that a degraded RAIDZ1 pool is stable enough to defer action. In reality, degraded is a state of emergency whose severity is proportional to resilver time and replacement latency.
2) Optimization that backfired: “Let’s use wider RAIDZ1 for better efficiency”
A different org—bigger, more process-heavy—wanted to reduce rack count. Someone proposed fewer, larger vdevs: wide RAIDZ1 groups to maximize usable capacity. On paper, it looked elegant: fewer vdevs, simpler layout, less overhead.
In production, they discovered two problems. First, the resilver window ballooned. Wider vdevs meant more total data in the vdev, more metadata, and more opportunities for slow devices to drag the whole group. Second, performance during resilver cratered under mixed workloads; the pool remained online, but user-facing services experienced latency spikes that looked like application bugs.
The backfire arrived when a disk was replaced and resilver ran for days. During that time, another disk in the same vdev started logging high read error counts. It didn’t fully die, but the error retries made the resilver even slower, stretching the exposure window further. The org avoided total loss, but the incident response consumed multiple teams and triggered a painful capacity redesign.
They learned a lesson that never shows up in simplistic capacity math: operational efficiency beats raw capacity efficiency. The cheapest terabyte is the one that doesn’t drag your on-call team into a weeklong outage investigation.
3) Boring but correct practice that saved the day: scrubs, spares, and fast replacement
A third company ran ZFS for a moderately critical internal platform: databases had their own replication, but the ZFS pool still mattered. They used RAIDZ2 for most pools, but had one RAIDZ1 pool for ephemeral-ish workloads. The difference wasn’t heroics; it was discipline.
They scrubbed on a schedule and actually read the scrub reports. SMART data was monitored, and drives with rising reallocated sector counts were replaced early—before they failed outright. They also kept cold spares in the same data center, labeled and tested.
When a drive failed, it was replaced within an hour, not a week. The resilver started immediately, and the workload was temporarily throttled (pausing non-essential jobs) to shorten the resilver window. The pool spent minimal time degraded.
That incident didn’t become a war story because it was engineered to be boring. In SRE terms, they bought down risk with predictable process: early warning, fast hands, and limiting blast radius. That’s not exciting, but it’s what “reliability” looks like.
Practical tasks: commands, outputs, and what they mean
Below are practical, runnable tasks you can do on a ZFS system. Commands assume OpenZFS on a Linux host with a pool named tank. Adjust names for your environment.
Task 1: Identify vdev layout and confirm you’re actually RAIDZ1
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 02:14:33 with 0 errors on Sun Dec 22 03:10:12 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-ST16000NM001G-1 ONLINE 0 0 0
ata-ST16000NM001G-2 ONLINE 0 0 0
ata-ST16000NM001G-3 ONLINE 0 0 0
ata-ST16000NM001G-4 ONLINE 0 0 0
errors: No known data errors
Interpretation: You have a single RAIDZ1 vdev with four drives. One drive can fail. During degraded mode, you have zero parity margin.
Task 2: Check how full the pool is (high utilization worsens everything)
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -r tank
NAME USED AVAIL REFER MOUNTPOINT
tank 38.2T 4.10T 192K /tank
tank/vm 25.6T 4.10T 25.6T /tank/vm
tank/logs 12.6T 4.10T 12.6T /tank/logs
Interpretation: You’re near 90%+ used. Expect slower allocations, more fragmentation, and longer resilvers. If you run RAIDZ1 this full, you’re volunteering for drama.
Task 3: See recent scrubs and resilvers
cr0x@server:~$ zpool status tank | sed -n '1,12p'
pool: tank
state: ONLINE
scan: scrub repaired 0B in 02:14:33 with 0 errors on Sun Dec 22 03:10:12 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
Interpretation: Scrub completed with zero errors. Good. If you never scrub, the first time you find a bad sector may be during resilver—when RAIDZ1 can’t save you.
Task 4: Scrub the pool (and understand the cost)
cr0x@server:~$ sudo zpool scrub tank
Interpretation: Scrubs read and verify data. Schedule them when IO can tolerate it. If scrubs consistently find errors, that’s not “ZFS being picky”; it’s ZFS catching what would otherwise be silent corruption.
Task 5: Monitor scrub/resilver progress
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Tue Dec 24 01:12:10 2025
7.24T scanned at 1.21G/s, 3.89T issued at 664M/s, 38.2T total
0B repaired, 10.18% done, 0 days 15:42:11 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-ST16000NM001G-1 ONLINE 0 0 0
ata-ST16000NM001G-2 ONLINE 0 0 0
ata-ST16000NM001G-3 ONLINE 0 0 0
ata-ST16000NM001G-4 ONLINE 0 0 0
Interpretation: ZFS gives ETA, but treat it as weather forecasting. If “issued” speed drops sharply during business hours, your workload is contending with scrub IO.
Task 6: Identify disks and persistent device paths
cr0x@server:~$ ls -l /dev/disk/by-id/ | grep -E 'ST16000NM001G|wwn' | head
lrwxrwxrwx 1 root root 9 Dec 24 00:40 wwn-0x5000c500abcd0001 -> ../../sdb
lrwxrwxrwx 1 root root 9 Dec 24 00:40 wwn-0x5000c500abcd0002 -> ../../sdc
lrwxrwxrwx 1 root root 9 Dec 24 00:40 wwn-0x5000c500abcd0003 -> ../../sdd
lrwxrwxrwx 1 root root 9 Dec 24 00:40 wwn-0x5000c500abcd0004 -> ../../sde
Interpretation: Use by-id or wwn paths in ZFS, not /dev/sdX, which can change across reboots.
Task 7: Replace a failed disk correctly (don’t guess device names)
cr0x@server:~$ sudo zpool replace tank wwn-0x5000c500abcd0003 /dev/disk/by-id/wwn-0x5000c500beef0009
Interpretation: First argument is the old member (as ZFS knows it). Second is the new disk path. After this, resilver begins. If you replace the wrong disk in RAIDZ1, you can turn “degraded” into “gone.”
Task 8: Confirm resilver started and watch for errors
cr0x@server:~$ zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Dec 24 02:02:41 2025
1.78T scanned at 412M/s, 612G issued at 141M/s, 25.6T total
0B resilvered, 2.34% done, 2 days 01:15:09 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-ST16000NM001G-1 ONLINE 0 0 0
ata-ST16000NM001G-2 ONLINE 0 0 0
replacing-2 DEGRADED 0 0 0
wwn-0x5000c500abcd0003 FAULTED 0 0 0 too many errors
wwn-0x5000c500beef0009 ONLINE 0 0 0 (resilvering)
ata-ST16000NM001G-4 ONLINE 0 0 0
errors: No known data errors
Interpretation: Resilver ETA is 2 days. That’s your exposure window. If another disk throws errors now, you may lose data. This is the moment to reduce workload and babysit alerts.
Task 9: Check per-disk error counters and timeouts
cr0x@server:~$ sudo zpool status -P tank
pool: tank
state: DEGRADED
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
/dev/disk/by-id/wwn-0x5000c500abcd0001 ONLINE 0 0 0
/dev/disk/by-id/wwn-0x5000c500abcd0002 ONLINE 0 0 0
/dev/disk/by-id/wwn-0x5000c500beef0009 ONLINE 3 0 0 (resilvering)
/dev/disk/by-id/wwn-0x5000c500abcd0004 ONLINE 0 0 0
Interpretation: Non-zero READ errors on a surviving disk during resilver is a red flag. Sometimes it’s transient (cable/HBA), sometimes it’s a dying disk. Investigate immediately; RAIDZ1 does not forgive delay.
Task 10: Pull SMART health and error log (Linux)
cr0x@server:~$ sudo smartctl -a /dev/sdd | sed -n '1,40p'
smartctl 7.4 2023-08-01 [x86_64-linux-6.6.0] (local build)
=== START OF INFORMATION SECTION ===
Device Model: ST16000NM001G
Serial Number: ZR123ABC
LU WWN Device Id: 5 000c50 0abcd0003
Firmware Version: SN03
User Capacity: 16,000,900,169,728 bytes [16.0 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Interpretation: If SMART says FAILED, treat it as “evict this drive.” Some drives limp for months; others fall off a cliff. Your rebuild window is not the time to debate optimism.
Task 11: See dataset properties that affect resilver and performance
cr0x@server:~$ zfs get -o name,property,value -s local,default recordsize,compression,atime,sync tank/vm
NAME PROPERTY VALUE
tank/vm recordsize 128K
tank/vm compression lz4
tank/vm atime off
tank/vm sync standard
Interpretation: lz4 often helps (less physical IO). atime=off reduces write noise. Recordsize influences IO patterns; mismatches can amplify random IO on RAIDZ.
Task 12: Check ZFS pool I/O and latency symptoms using iostat
cr0x@server:~$ zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 38.2T 4.10T 210 1200 92.1M 410M
raidz1-0 38.2T 4.10T 210 1200 92.1M 410M
wwn-0x5000c500abcd0001 - - 55 310 24.0M 110M
wwn-0x5000c500abcd0002 - - 48 295 21.2M 105M
wwn-0x5000c500beef0009 - - 60 300 25.1M 100M
wwn-0x5000c500abcd0004 - - 47 295 21.8M 95M
-------------------------- ----- ----- ----- ----- ----- -----
Interpretation: This shows distribution across disks. If one disk shows much lower bandwidth or far higher ops, it may be slow, retrying, or misbehaving. During resilver you’ll often see asymmetry; extreme asymmetry is a lead.
Task 13: Find whether you’re doing unsafe write caching assumptions
cr0x@server:~$ sudo hdparm -W /dev/sdb | head -n 2
/dev/sdb:
write-caching = 1 (on)
Interpretation: Write cache on HDDs is normal, but without power-loss protection, it can turn a power event into a mess. ZFS mitigates some issues, but hardware lies under stress. If you need strict guarantees, design for them.
Task 14: Check ashift (sector alignment) for the pool
cr0x@server:~$ sudo zdb -C tank | grep -E 'ashift|vdev_tree' -n | head
38: vdev_tree:
57: ashift: 12
Interpretation: ashift=12 means 4K sectors. If you accidentally created a pool with ashift=9 on 4K disks, you can get performance cliffs and extra write amplification. Fixing it usually means rebuilding the pool.
Task 15: Check for data errors after a scary event
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
scan: resilvered 25.6T in 2 days 03:12:44 with 0 errors on Thu Dec 26 05:15:25 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-ST16000NM001G-1 ONLINE 0 0 0
ata-ST16000NM001G-2 ONLINE 0 0 0
ata-ST16000NM001G-NEW ONLINE 0 0 0
ata-ST16000NM001G-4 ONLINE 0 0 0
errors: No known data errors
Interpretation: “0 errors” is what you want to see. If you see “permanent errors,” don’t hand-wave it—list impacted files, assess blast radius, restore from backup if needed.
Fast diagnosis playbook (find the bottleneck quickly)
This is the playbook I use when someone pings: “Resilver is slow” or “RAIDZ1 pool is crawling.” The goal is not to become a performance poet; it’s to find the limiting factor in minutes.
First: confirm the pool state and what ZFS is doing
cr0x@server:~$ zpool status -v tank
Look for: DEGRADED, resilver in progress, read/write/cksum errors per device, and any “too many errors” messages.
Second: identify whether the bottleneck is one bad device or overall contention
cr0x@server:~$ zpool iostat -v tank 1 10
Interpretation: If one disk is dramatically slower or shows errors, you likely have a device/cable/HBA issue. If all disks are busy and throughput is low, you may be dominated by random IO or CPU parity overhead under load.
Third: check system-level IO pressure and queueing
cr0x@server:~$ iostat -x 1 10
Look for: high %util, high await, large queues. If latency explodes during resilver, throttle workloads or adjust priorities—degraded RAIDZ1 is not a time for “full send.”
Fourth: check for obvious memory pressure (ARC thrash) and CPU saturation
cr0x@server:~$ free -h
cr0x@server:~$ top -b -n 1 | head -n 20
Interpretation: If you’re swapping, ZFS will feel haunted. If CPU is pegged, parity math plus compression plus checksumming under load can become the limiter.
Fifth: confirm the “slow disk” isn’t just negotiating poorly
cr0x@server:~$ sudo dmesg | grep -Ei 'ata|sas|reset|link|error|timeout' | tail -n 30
Interpretation: Link resets, timeouts, and command aborts often show up here before you see a full fault. In RAIDZ1 degraded mode, treat this like a fire alarm, not a vibe.
Sixth: if it’s still unclear, measure real workload vs resilver
cr0x@server:~$ sudo zpool iostat -r -v tank 1 5
Interpretation: Breakdown by request size can reveal you’re stuck doing small random reads (fragmentation, VM workloads) rather than streaming.
Common mistakes with specific symptoms and fixes
Mistake 1: Treating degraded RAIDZ1 as “stable enough”
Symptoms: A disk fails, ticket is opened, replacement delayed; days later, second disk errors appear; resilver hasn’t even started; data errors appear during the eventual rebuild.
Fix: Define an SLA for replacement (hours, not days). Keep tested spares. Automate paging for DEGRADED. Reduce workload immediately when degraded.
Mistake 2: Running pools too full
Symptoms: Scrubs/resilvers take dramatically longer as usage climbs; latency spikes; allocation failures; “everything is slow” with no single smoking gun.
Fix: Capacity plan with headroom (practically: keep well below the cliff). If you must run hot, don’t use RAIDZ1; choose redundancy that tolerates stress.
Mistake 3: No scrubs (or scrubs ignored)
Symptoms: First scrub in months finds checksum errors; resilver later fails with permanent errors; “but the pool was ONLINE yesterday.”
Fix: Schedule scrubs. Alert on scrub errors. Replace drives that show growing error trends even if they’re “still online.”
Mistake 4: Mixing drive types without thinking (SMR surprises)
Symptoms: Resilver speed is fine at first, then collapses; disks show long command times; rebuild takes days longer than expected.
Fix: Avoid SMR in RAIDZ unless you’ve explicitly validated it for resilver behavior. Standardize drive models and firmware in a vdev.
Mistake 5: Using unstable device naming
Symptoms: After reboot, ZFS imports with different /dev/sdX mapping; operator replaces “sdc” but it’s not the intended drive; pool gets worse.
Fix: Always build and operate using /dev/disk/by-id or WWN identifiers. Verify serial numbers physically and via SMART.
Mistake 6: Over-optimizing recordsize/sync without workload proof
Symptoms: “Tune” changes seem to help benchmarks, but production sees higher latency, more fragmentation, slower resilvers.
Fix: Treat tuning as an experiment with rollback. Keep defaults unless you can explain why your workload is different—and measure before/after.
Mistake 7: Believing parity equals backup
Symptoms: Pool survives a disk loss, team relaxes; later ransomware/accidental delete happens; no clean copy exists.
Fix: Maintain real backups or snapshots replicated elsewhere. RAIDZ is availability, not recovery from human problems.
Checklists / step-by-step plan
Checklist A: Should you use RAIDZ1 at all?
- Classify the data. If losing it triggers revenue loss, contractual breach, or multi-day recovery, default away from RAIDZ1.
- Estimate resilver time from reality. Look at your last scrub duration; resilvers under load will not be faster.
- Confirm replacement logistics. Do you have a spare on-site? Is someone authorized to swap it immediately?
- Assess utilization. If you run hot, RAIDZ1 is not the “efficient” choice; it’s the “fragile” choice.
- Decide your tolerated exposure window. If your org can’t tolerate multi-day “one more error and we’re cooked,” don’t choose RAIDZ1.
Checklist B: When a disk fails in RAIDZ1 (minute-by-minute)
- Confirm it’s a real device problem.
cr0x@server:~$ zpool status -v tank cr0x@server:~$ sudo dmesg | tail -n 80Decision: If you see timeouts/resets, check cabling/HBA too—replacing a good disk won’t fix a bad link.
- Reduce load. Pause batch jobs, reduce VM churn, and stop heavy writes if possible. Your goal is to shorten resilver and avoid additional stress-induced errors.
- Identify the physical drive.
cr0x@server:~$ sudo smartctl -a /dev/disk/by-id/wwn-0x5000c500abcd0003 | grep -E 'Serial|SMART overall|Reallocated|Pending'Decision: Match serial numbers with chassis labels. Don’t trust “slot 3” folklore unless you’ve validated it.
- Replace immediately using stable IDs.
cr0x@server:~$ sudo zpool replace tank wwn-0x5000c500abcd0003 /dev/disk/by-id/wwn-0x5000c500beef0009 - Monitor resilver and errors continuously.
cr0x@server:~$ watch -n 10 'zpool status -v tank'Decision: If another disk starts erroring, be ready to escalate: replace suspect disks, check HBA, and consider moving workload off the pool.
- After completion, scrub soon.
cr0x@server:~$ sudo zpool scrub tankDecision: Verify the pool is clean and no permanent errors were introduced.
Checklist C: “Boring reliability” practices that make RAIDZ1 less scary
- Scrub schedule aligned with business load and pool size; alert on any repaired bytes or checksum errors.
- SMART monitoring with thresholds for reallocated/pending sectors and error logs.
- On-site spares that are burned in and labeled.
- Documented replacement runbook including device identification and rollback steps.
- Headroom policy (don’t run pools to the cliff).
- Test restores from backups/snapshots. RAID is not your recovery plan.
FAQ
1) Is RAIDZ1 “unsafe”?
Not inherently. It’s less forgiving. If you can keep resilver windows short and replacements fast, it can be reasonable for non-critical data. If your resilver window is measured in days, RAIDZ1 is a risk you should treat like any other high-impact operational exposure.
2) Why does RAIDZ1 risk increase with larger drives?
Because rebuilds require reading a lot of data from surviving disks, and larger drives mean more data to read and more time spent degraded. More time and more reads increase the chance of encountering a second failure mode (another disk failure, read errors, timeouts).
3) Is RAIDZ2 always better?
For reliability under rebuild stress, yes: dual parity gives you margin during a resilver. But “better” includes tradeoffs: more parity overhead, different performance characteristics, and sometimes more complex capacity planning. For many production pools, RAIDZ2 is the default because operational outcomes matter more than raw usable TB.
4) Mirrors vs RAIDZ: which is safer?
Mirrors are often operationally safer because rebuilds (resilvers) can be simpler and faster, and performance under random IO is typically better. RAIDZ can be great for capacity-heavy, throughput-oriented workloads. If you’re running VM random IO on spinning disks, mirrors frequently behave better under stress.
5) Does ZFS resilver the entire disk?
Usually it resilvers only allocated blocks (which is good), but fragmented pools and certain scenarios can still create long, IO-heavy rebuilds. The practical advice remains: measure your actual resilver time in your workload, not in theory.
6) If I scrub regularly, does that eliminate RAIDZ1 rebuild risk?
No, but it reduces it. Scrubs surface latent errors early, when you still have parity. That means you can fix issues before you’re degraded. Scrubs don’t prevent a second disk from failing mid-resilver, and they don’t make timeouts/cabling issues disappear.
7) What’s the simplest way to reduce RAIDZ1 risk without redesigning everything?
Shorten the exposure window and improve response: keep tested spares on-site, alert loudly on DEGRADED, and have a practiced replacement procedure. Also keep headroom; very full pools are slower to resilver and more fragile under load.
8) How do I know if my pool is one error away from data loss?
If it’s RAIDZ1 and the pool is degraded, you’re already there. Check zpool status -v. Any additional read errors on remaining disks during resilver is a serious escalation—investigate hardware and consider proactively replacing questionable drives.
9) Can I “convert” RAIDZ1 to RAIDZ2 in place?
In many deployments, not directly in-place the way people hope. Some expansion features exist in modern OpenZFS ecosystems, but changing parity level is typically not a trivial flip. The common operational path is migrating data to a newly built pool with the desired topology.
10) What’s a realistic RAIDZ1 use case today?
Non-critical, reproducible, or well-replicated data; environments with fast hands and spares; smaller disk sizes; or where a pool is a cache tier rather than the source of truth. If the pool is your only copy of important data, RAIDZ1 should make you uneasy—and that’s a useful signal.
Conclusion: choosing RAIDZ1 with eyes open
RAIDZ1’s reputation swings between “totally fine” and “never use it,” and both extremes miss the point. The real story is the rebuild window: how long you’re degraded, how much you must read to recover, and how likely the environment is to produce a second failure mode while you’re exposed.
If you run the calculation—really run it, with your actual scrub times, your actual utilization, your actual replacement process—RAIDZ1 often stops being a default and starts being a deliberate choice. Sometimes it’s still the right choice. But when it isn’t, you’ll know before the pager teaches you at 3 a.m.