There’s a rule in ZFS that sounds like superstition until you’ve watched a pool limp through a long resilver while your on-call phone heats up: vdevs are the unit of redundancy, performance, and regret. You can swap disks. You can tune datasets. You can add vdevs. But once you pick a vdev layout, you’ve effectively chosen the pool’s personality for the rest of its life.
This isn’t a theoretical warning delivered from a whiteboard. It’s a production truth learned the hard way in data centers where the storage “just needs to work” and in clouds where it “just needs to be cheap.” ZFS will do exactly what you told it to do—brilliantly, relentlessly, and with zero sympathy for your mistaken assumptions.
The rule you break once
The rule is simple: Never build a pool where you can’t articulate—out loud—how you will expand it, replace it, and survive a long resilver without gambling the business.
In practice, that rule boils down to vdev decisions:
- Redundancy lives inside a vdev. A pool is “as redundant as its least redundant vdev,” and it will fail if any single vdev fails.
- Performance aggregates across vdevs. More vdevs generally means more IOPS and more parallelism; wider RAIDZ vdevs do not behave like “more spindles = more speed” in the way people expect.
- Expansion is additive by vdev. You can add vdevs to a pool. Historically you could not remove them (and while some newer features exist in specific situations, you should still plan as if removal is not part of your lifecycle).
- Repair (resilver) is per vdev and per device. Wider vdevs mean more disks participating, more time under stress, and often more exposure to a second failure before you’re safe again.
Here’s the line I use when someone proposes a “creative” vdev layout to save a budget meeting: ZFS is a file system with opinions, and its strongest opinion is that topology is destiny.
Joke #1 (short and relevant): The fastest way to learn ZFS vdev design is to build the wrong one—because ZFS will make sure you remember it.
Vdevs as a mental model: topology beats tuning
If you’ve run RAID controllers, you’re used to thinking in terms of “the RAID group” and “the LUN.” ZFS collapses those ideas. A pool is built from vdevs; datasets live on the pool; ZFS spreads writes across vdevs to balance space and performance.
What a vdev actually is
A vdev is a virtual device that provides storage to a pool. That vdev might be:
- a single disk (no redundancy; don’t do this in production unless you explicitly accept losing it),
- a mirror (two or more disks, copies of data),
- a RAIDZ group (single, double, triple parity: raidz1/2/3),
- a special vdev (metadata and/or small blocks on faster media),
- a log (SLOG) device for synchronous write acceleration,
- a cache (L2ARC) device.
Only some of those contribute to main storage capacity (data vdevs and special vdevs, depending on configuration). And only some of them are required for the pool to keep functioning.
Why the vdev is the failure domain
If you build a pool from four mirrors, you can lose one disk in each mirror and still run. If you build a pool from two raidz2 vdevs, you can lose two disks in the same vdev (not anywhere in the pool) and still run. That “where the failures land” detail is what ruins otherwise decent-looking spreadsheet math.
Operationally, you don’t get to choose where disks fail. You get to choose the blast radius when they do.
Why the vdev is the performance unit
ZFS stripes writes across vdevs, not across individual disks. A pool with more vdevs tends to deliver more IOPS because ZFS can issue more independent IO in parallel. This is why “one huge RAIDZ2 with 12 disks” often disappoints compared to “six mirrored pairs,” even though both use 12 disks.
A RAIDZ vdev can deliver good sequential throughput, but random IO and latency behave differently. A mirror vdev can often satisfy reads from either disk and can handle small random writes without parity overhead. RAIDZ has to do parity math and read-modify-write patterns for small updates. ZFS mitigates some of this with copy-on-write and transaction groups, but you can’t tune your way out of physics.
The “you can add vdevs” trap
Yes, you can expand a pool by adding vdevs. That sounds like freedom. The trap is that you can’t (practically) “re-balance” data the way people imagine from distributed systems. ZFS will allocate new writes to the new vdevs, but existing data stays where it is unless it’s rewritten. Your hot dataset from last year stays on last year’s vdev unless you proactively migrate it.
Also, once you add a vdev, the pool now depends on that vdev forever. If you add a “temporary” single-disk vdev to survive the week, you have just created a permanent single point of failure. That week becomes a year. It always becomes a year.
Interesting facts and historical context
Storage engineering is full of folklore. Here are concrete bits of context that make ZFS vdev rules feel less arbitrary:
- ZFS was designed at Sun with end-to-end integrity as a core goal, not as an add-on. Checksums on every block are not a feature; they’re the worldview.
- RAIDZ exists because traditional RAID-5/6 has a “write hole” problem (power loss during stripe updates can leave parity inconsistent). ZFS’s copy-on-write transactional model avoids that class of corruption.
- The “1 TB drive” that made RAID-5 scary became “18 TB drives” that made it existential. As disks grew, rebuild/resilver windows grew, and the probability of encountering an error during rebuild stopped being a rounding error.
- ZFS scrubs are not optional hygiene; they’re how you turn silent corruption into detected-and-corrected corruption. Without scrubs, you’re betting your backups and your luck against bit rot.
- 4K sector drives made alignment (ashift) a permanent decision. Misaligned pools can take a silent performance hit that lasts for the entire pool’s lifespan.
- People used to “fix” latency with a RAID controller cache and then discovered they’d built a corruption machine when battery units aged out. ZFS pushed caching into the OS and made power-loss behavior explicit.
- OpenZFS became a multi-OS ecosystem, which is why you see differences in defaults, tooling, and feature flags between platforms. The vdev rules remain stubbornly consistent.
- dRAID exists largely to address rebuild time and risk at scale by distributing spare capacity and speeding replacement/resilver behavior, especially in large RAIDZ-like groups.
- Special vdevs are newer but operationally dangerous when misunderstood: if you put metadata on a special vdev and lose it, the pool can be effectively lost even if data disks are fine.
Mirrors, RAIDZ, dRAID, and special vdevs: real tradeoffs
Mirrors: boring, fast, and expensive (in a good way)
Mirrors are the default recommendation for mixed workloads because they behave well under randomness, fail predictably, and resilver relatively quickly (especially with sequential resilver support and if the pool isn’t completely full). Mirrors also give you more vdevs for the same number of disks, which means more parallelism.
But mirrors cost you raw capacity: two-way mirrors cut usable space roughly in half. Three-way mirrors are the “I never want to talk about this again” option for critical systems, at the cost of two-thirds of raw capacity.
RAIDZ: capacity efficiency with performance caveats
RAIDZ is compelling when you need capacity and your workload is more sequential or large-block oriented. RAIDZ2 is often the baseline for nearline or “this is expensive to rebuild” pools. RAIDZ1 is still used in some places, but it’s increasingly a risk decision rather than an engineering one.
The two big operational pitfalls with RAIDZ are:
- Small random write latency, especially under load, because parity updates are not free.
- Long resilvers as disks get bigger, and the uncomfortable reality that resilvering stresses the remaining disks while you’re already degraded.
RAIDZ width matters. Very wide RAIDZ vdevs can look great on a capacity chart and then punish you with rebuild windows and unpredictable tail latencies at the worst time.
dRAID: the “big chassis” answer
dRAID (distributed RAID) is designed for large pools where classic RAIDZ rebuild times become unacceptable. By distributing parity and spare space across many disks, it can reduce the time to restore redundancy after a failure and reduce the “hot spare sits idle until disaster” inefficiency.
It’s not a magic wand. You still need to understand the failure domains, the parity level, and the operational workflow for replacement. But if you’re building large JBOD shelves and you care about resilver time, dRAID is worth serious consideration.
Special vdevs: the performance multiplier with sharp edges
Special vdevs can store metadata (and optionally small blocks) on faster storage like SSDs. Done right, they can transform filesystem metadata performance, directory traversals, and small-file workloads. Done wrong, they can create a new, smaller failure domain that takes the whole pool down.
Rule of thumb: if a special vdev contains metadata, treat it like a first-class data vdev with redundancy. Mirror it. Monitor it. Replace it proactively.
SLOG and the sync write misunderstanding
SLOG is not a write cache. It is a separate log device used for synchronous writes only. If your workload is mostly asynchronous (common for many applications), a SLOG may do nothing. If your workload is synchronous (databases, NFS with sync, certain VM storage patterns), a good SLOG can reduce latency dramatically—if it’s power-loss safe and fast at low queue depths.
Joke #2 (short and relevant): Buying a SLOG for an async workload is like installing a race car spoiler on a delivery van—it changes the look, not the lap time.
Three corporate-world mini-stories
1) Incident caused by a wrong assumption: “We can always remove that later”
The company was mid-migration. A legacy SAN was being retired, and a new ZFS appliance (whitebox, competent design) was built to take over. The storage team sized the pool conservatively: mirrored vdevs, good monitoring, tested restores. Then the project schedule slipped.
A product team showed up with a new dataset that wouldn’t fit. The easiest short-term fix seemed harmless: add a single large disk as a temporary vdev “just until procurement finishes.” The pool accepted it. The dataset landed. The crisis passed. Everyone went back to their real jobs.
Six months later, the “temporary disk” was still there. It wasn’t in the original replacement plan. It wasn’t in the spreadsheet that tracked warranty dates. It also wasn’t mirrored. One morning, it started throwing read errors. ZFS did what ZFS does: it marked the vdev as faulted, and the entire pool went down because a pool cannot survive the loss of any top-level vdev.
The postmortem was grim not because the failure was exotic, but because it was boring. The wrong assumption wasn’t “disks fail.” It was “we can easily undo topology changes.” They ended up doing an emergency migration off the pool, under time pressure, onto whatever spare capacity could be found. The business impact wasn’t from the disk dying; it was from the topology decision that made a single disk a pool-wide dependency.
The fix was straightforward but painful: rebuild the pool properly and migrate data back. The lesson stuck: there is no such thing as a temporary top-level vdev.
2) Optimization that backfired: “One wide RAIDZ is simpler”
A different org had a new analytics cluster. They ran big sequential scans at night, but daytime was interactive: dashboards, ad-hoc queries, and lots of small random IO. Someone proposed a single wide RAIDZ2 vdev to maximize usable space and simplify management. Fewer vdevs, fewer moving parts, one set of parity disks. The plan looked clean.
In the first load test, sequential throughput was excellent. Everyone celebrated. Then production arrived. Daytime queries started showing long-tail latency spikes. Not constant slowness—worse. Every few minutes, something would stall hard enough that the application timeouts would fire. The storage graphs looked “fine on average,” which is how storage problems hide in corporate dashboards.
The root cause wasn’t mysterious. It was the mismatch between workload and vdev behavior. The wide RAIDZ2 group had decent bandwidth but struggled with mixed small IO under concurrency. The pool had one data vdev, so ZFS had limited places to schedule parallel work. Add in a moderately full pool, some fragmentation over time, and the interactive latency got ugly.
They tried the usual: recordsize tweaks, compression changes, more RAM, even an L2ARC. Improvements were marginal because the core limiter was structural. Eventually they rebuilt as multiple mirror vdevs and the problem disappeared. The “optimization” backfired because it optimized the wrong metric—usable TB—at the expense of the metric users actually felt: p99 latency.
The quiet moral: simplicity is great, but a single-vdev pool is rarely simple in production.
3) Boring but correct practice that saved the day: “Scrubs, spares, and replacement discipline”
This one never shows up in architecture decks because it’s not sexy. A mid-size enterprise ran ZFS for VM storage with mirrored vdevs, conservative capacity utilization, and a schedule: monthly scrubs, alerts on checksum errors, and a policy to replace any disk that showed repeated errors even if it hadn’t fully failed.
One quarter, a batch of drives started showing intermittent read issues. Not enough to trip immediate failure, but enough to accumulate checksum errors. The scrubs caught it early. The team didn’t argue with the drives. They replaced them in an orderly way, one mirror member at a time, during business hours, while the pool was healthy.
Later that year, a power event hit a different rack. The systems recovered, but the same pool that had seen flaky drives would have been in a dangerous state if those marginal disks were still present. Instead, the pool came back clean. No degraded vdevs. No emergency resilvers on already-sick media. The incident response was almost boring: verify status, verify scrub schedule, move on.
The practice that saved them wasn’t a clever tuning parameter. It was operational discipline: scrubs, monitoring, and proactive replacement. In storage, boring is a feature.
Practical tasks: commands + interpretation
These are the tasks I actually run in production when I’m validating vdev design, diagnosing issues, or cleaning up after something went sideways. Commands assume OpenZFS on a Linux-like system; adjust device names and pool names accordingly.
Task 1: Inspect pool topology (the truth serum)
cr0x@server:~$ sudo zpool status -v
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_SSD_1 ONLINE 0 0 0
ata-SAMSUNG_SSD_2 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
ata-WDC_1 ONLINE 0 0 0
ata-WDC_2 ONLINE 0 0 0
ata-WDC_3 ONLINE 0 0 0
ata-WDC_4 ONLINE 0 0 0
ata-WDC_5 ONLINE 0 0 0
ata-WDC_6 ONLINE 0 0 0
errors: No known data errors
Interpretation: Identify top-level vdevs (mirror-0, raidz2-1). The pool fails if any top-level vdev fails. Mixed vdev types are allowed but can create uneven performance and awkward expansion behavior.
Task 2: Get capacity and fragmentation at a glance
cr0x@server:~$ zpool list -o name,size,alloc,free,capacity,health,fragmentation
NAME SIZE ALLOC FREE CAPACITY HEALTH FRAG
tank 100T 72T 28T 72% ONLINE 31%
Interpretation: Capacity over ~80% and rising fragmentation are a common recipe for latency complaints, especially on RAIDZ. Fragmentation isn’t inherently evil, but it correlates with allocation difficulty and IO amplification.
Task 3: Show ashift (sector alignment) per vdev
cr0x@server:~$ sudo zdb -C tank | egrep -i 'vdev|ashift' | head -n 30
vdev_tree:
type: 'root'
id: 0
guid: 123456789
children[0]:
type: 'mirror'
id: 0
ashift: 12
children[1]:
type: 'raidz'
id: 1
ashift: 12
Interpretation: ashift=12 means 4K sectors. ashift is effectively permanent for that vdev. If you accidentally built with ashift=9 on 4K disks, you may see persistent write amplification and poor performance.
Task 4: Identify datasets with sync enabled (SLOG relevance)
cr0x@server:~$ zfs get -r -o name,property,value sync tank
NAME PROPERTY VALUE
tank sync standard
tank/vm sync always
tank/backups sync disabled
Interpretation: If important datasets are sync=always, a proper SLOG can matter. If most are standard and your apps don’t issue sync writes, SLOG won’t move the needle.
Task 5: Confirm recordsize and volblocksize choices
cr0x@server:~$ zfs get -o name,property,value recordsize tank
NAME PROPERTY VALUE
tank recordsize 128K
cr0x@server:~$ zfs get -o name,property,value volblocksize tank/vm-zvol
NAME PROPERTY VALUE
tank/vm-zvol volblocksize 16K
Interpretation: recordsize matters for filesystems; volblocksize matters for zvols and is not trivial to change after creation. Mismatches can create read-modify-write overhead and poor compression behavior.
Task 6: Observe real-time IO and latency by vdev
cr0x@server:~$ sudo zpool iostat -v tank 2
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 72T 28T 120 800 15M 110M
mirror-0 1.2T 0.8T 90 500 12M 60M
ata-SAMSUNG_SSD_1 - - 45 250 6M 30M
ata-SAMSUNG_SSD_2 - - 45 250 6M 30M
raidz2-1 70T 27T 30 300 3M 50M
ata-WDC_1 - - 5 50 0.5M 8.5M
...
Interpretation: If one vdev is saturated or showing disproportionate work, you’ve found your bottleneck. Pools don’t magically “average out” a slow vdev; they queue behind it when allocations land there.
Task 7: Check error counters and spot a “quietly dying” disk
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_SSD_1 ONLINE 0 0 0
ata-SAMSUNG_SSD_2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-WDC_7 ONLINE 0 0 12
ata-WDC_8 ONLINE 0 0 0
errors: No known data errors
Interpretation: CKSUM errors on one device are often cabling, controller, or the drive itself. “No known data errors” means ZFS corrected what it could, but your hardware is sending bad data. Don’t normalize checksum errors.
Task 8: Clear transient errors after fixing hardware (only after evidence)
cr0x@server:~$ sudo zpool clear tank
Interpretation: Clearing errors hides history. Do it after you’ve replaced a cable/HBA/drive and want to confirm the problem is gone. If errors come back, you’re still in the blast radius.
Task 9: Replace a failed disk in a mirror
cr0x@server:~$ sudo zpool offline tank ata-WDC_7
cr0x@server:~$ sudo zpool replace tank ata-WDC_7 /dev/disk/by-id/ata-WDC_NEW_7
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
replacing-0 ONLINE 0 0 0
ata-WDC_7 OFFLINE 0 0 0
ata-WDC_NEW_7 ONLINE 0 0 0 (resilvering)
ata-WDC_8 ONLINE 0 0 0
errors: No known data errors
Interpretation: Use stable device paths (by-id). Watch the resilver. If resilver time is unexpectedly long, check pool utilization, IO contention, and whether you’re hitting a slow device.
Task 10: Monitor resilver progress and rate
cr0x@server:~$ sudo zpool status tank
scan: resilver in progress since Tue Dec 24 10:12:03 2025
3.20T scanned at 540M/s, 1.10T issued at 185M/s, 72T total
1.10T resilvered, 1.53% done, 4 days 10:21:19 to go
Interpretation: “Scanned” vs “issued” tells you about throttling and IO contention. If “issued” is low, the pool is too busy or a vdev/device is slow. During resilver, your redundancy margin is reduced; treat this as an incident until it completes.
Task 11: Run a scrub and understand what it costs
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
scan: scrub in progress since Tue Dec 24 12:00:00 2025
9.80T scanned at 1.1G/s, 2.40T issued at 270M/s, 72T total
0B repaired, 3.33% done, 0 days 18:10:40 to go
Interpretation: Scrubs consume IO. Schedule them, but don’t skip them. A scrub that finds checksum errors is not “bad luck”; it’s an early warning system doing its job.
Task 12: Show compression ratio and confirm you’re not fooling yourself
cr0x@server:~$ zfs get -o name,property,value,source compression,compressratio tank
NAME PROPERTY VALUE SOURCE
tank compression zstd local
tank compressratio 1.72x -
Interpretation: Compression can be a capacity multiplier and sometimes a performance win, but it changes write patterns and CPU usage. If compressratio is ~1.00x, you’re not getting much.
Task 13: Identify metadata-heavy pain (special vdev candidates)
cr0x@server:~$ zfs get -o name,property,value primarycache,secondarycache tank/home
NAME PROPERTY VALUE
tank/home primarycache all
tank/home secondarycache all
cr0x@server:~$ sudo zpool iostat -v tank 2 | head -n 20
Interpretation: If your workload is directory walks, small files, and metadata churn, the issue may be latency on metadata IO. Special vdevs can help, but only when designed with redundancy and monitored like your job depends on it (because it does).
Task 14: Check ARC health and whether you’re memory-starved
cr0x@server:~$ grep -E 'c_min|c_max|size|hits|misses' /proc/spl/kstat/zfs/arcstats | head
c_min 4 8589934592
c_max 4 68719476736
size 4 53687091200
hits 4 2147483648
misses 4 268435456
Interpretation: ARC is your first cache. If misses are high and latency is bad, you might be IO-bound or simply under-cached. Don’t immediately buy SSDs; check if RAM is the cheapest performance upgrade.
Fast diagnosis playbook
This is the “you have 15 minutes before the incident channel melts” flow. The goal is to identify whether the bottleneck is topology, a specific device, sync behavior, capacity pressure, or something upstream.
First: Is the pool healthy and are you already degraded?
cr0x@server:~$ sudo zpool status -x
pool 'tank' is healthy
If not healthy: stop pretending this is a “performance issue.” Degraded pools behave differently, resilvers contend with workload, and risk is elevated.
Second: Is one vdev or one disk slow or erroring?
cr0x@server:~$ sudo zpool iostat -v tank 1
cr0x@server:~$ sudo zpool status tank
What you’re looking for: one device with much lower bandwidth, high errors, or a vdev taking disproportionate ops. A single bad disk can drag a whole RAIDZ vdev into latency hell.
Third: Are you capacity-constrained (space or fragmentation)?
cr0x@server:~$ zpool list -o name,capacity,fragmentation
NAME CAPACITY FRAG
tank 89% 54%
Interpretation: High capacity plus high fragmentation is a classic driver of allocation stalls. The “fix” is rarely a sysctl. It’s usually “add vdevs, free space, or migrate.”
Fourth: Are you paying for sync writes?
cr0x@server:~$ zfs get -r sync tank | head
cr0x@server:~$ sudo zpool status tank | egrep -i 'logs|log'
Interpretation: If sync is enabled and you have no SLOG (or a bad one), latency can spike under fsync-heavy workloads. If sync is disabled everywhere, don’t chase SLOG ghosts.
Fifth: Is ARC doing its job, or are you reading from disk constantly?
cr0x@server:~$ awk '{print $1,$3}' /proc/spl/kstat/zfs/arcstats | egrep 'hits|misses'
hits 2147483648
misses 268435456
Interpretation: A rising miss rate under a read-heavy workload can indicate memory pressure or a working set larger than ARC. It can also indicate that your IO pattern is inherently uncachable (large scans).
Sixth: Confirm the workload and match it to topology
This part is human, not a command. Ask:
- Is this mostly small random IO (VMs, databases)? Mirrors tend to win.
- Is this mostly large sequential IO (backups, media)? RAIDZ can be fine.
- Are we mixing metadata-heavy workloads with bulk storage? Consider special vdevs—carefully.
- Did we recently add a new vdev and expect old data to rebalance? It didn’t.
Common mistakes: specific symptoms and fixes
Mistake 1: Adding a single-disk vdev “temporarily”
Symptom: Pool becomes a single-disk failure away from total outage; later, one disk faults and the entire pool goes down.
Fix: Don’t do it. If you already did: migrate data off, rebuild the pool properly, migrate back. If you must add capacity urgently, add a redundant vdev (mirror or RAIDZ) that meets your durability requirements.
Mistake 2: Building one giant RAIDZ vdev and expecting mirror-like IOPS
Symptom: Good sequential benchmarks, terrible p95/p99 latency under mixed load; “average throughput looks fine.”
Fix: Redesign with more vdevs (often mirrors) or multiple RAIDZ vdevs with sensible width. Stop trying to tune around a topology mismatch.
Mistake 3: Ignoring ashift and sector alignment
Symptom: Chronic write latency and lower-than-expected throughput, especially on SSDs; no obvious “errors.”
Fix: Verify ashift with zdb -C. If wrong, the real fix is rebuild/migrate. You can’t reliably “fix” ashift in place.
Mistake 4: Treating special vdevs like a cache
Symptom: Pool loss or severe corruption risk after special vdev failure; metadata becomes unavailable even though data disks are healthy.
Fix: Mirror (or otherwise protect) special vdevs. Monitor them like data vdevs. Use high-endurance SSDs. Plan replacements.
Mistake 5: Buying a cheap SSD for SLOG
Symptom: Sync write latency doesn’t improve, or gets worse; occasional stalls; device wears out quickly.
Fix: Use a power-loss-protected, low-latency device. Confirm your workload uses sync writes. Validate with application-level latency, not just ZFS counters.
Mistake 6: Running too hot (pool too full)
Symptom: Writes stall, scrub/resilver times balloon, fragmentation rises, and “random weirdness” appears under load.
Fix: Keep free space headroom. Add vdevs before you hit the cliff. If already full, migrate cold data away or expand immediately.
Mistake 7: Mixing disk sizes and expecting graceful outcomes
Symptom: Mirrors waste capacity, RAIDZ vdevs cap at smallest disk, expansion becomes messy.
Fix: Keep vdev members uniform where possible. If you must mix, do it intentionally and document how it affects usable space and replacement strategy.
Mistake 8: Confusing “zpool add” with “upgrade”
Symptom: Someone adds a vdev expecting an existing RAIDZ to expand in width; later they realize the topology is now mixed forever.
Fix: Expansion by adding vdevs is not the same as changing a vdev’s shape. Plan vdev width up front; use additional vdevs for growth.
Checklists / step-by-step plan
Step-by-step: Designing a pool you won’t hate
- Write down the workload in one sentence: “VM storage with random writes,” “backup target with sequential writes,” “NFS home dirs with metadata churn,” etc.
- Pick your failure tolerance: how many disks can fail in the same vdev without outage? Choose mirror/raidz2/raidz3 accordingly.
- Choose vdev width intentionally: don’t just use “all disks in one group.” Prefer more vdevs over wider vdevs for IOPS-heavy workloads.
- Decide how you will expand: “add another mirror pair per quarter,” or “add another raidz2 vdev of 6 disks,” etc.
- Decide how you will replace drives: by-id naming, maintenance windows, spares on hand, documented procedure.
- Set ashift correctly before creating the pool. Treat it as permanent.
- Plan scrubs and monitoring: scrub cadence, alert thresholds (checksum errors, slow resilvers), and who gets paged.
- Keep headroom: define a capacity ceiling that triggers expansion before performance collapses.
- Test degraded behavior: simulate a disk offline and observe application impact; validate resilver time under representative load.
Step-by-step: Pre-flight validation before you commit data
- Confirm topology and redundancy.
- Confirm ashift.
- Confirm dataset properties for intended use (recordsize, compression, atime, sync).
- Run a representative load test that includes mixed IO and measures tail latency.
- Pull a drive (or offline it) in staging and watch the resilver and the application.
Operational checklist: when a disk fails
- Confirm pool state (
zpool status). - Identify the exact physical disk (by-id mapping, chassis bay mapping).
- Check whether the pool is already under scrub/resilver.
- Reduce avoidable load if possible; resilvering while saturated extends your risk window.
- Replace disk using stable device paths; monitor resilver to completion.
- After completion, run a scrub if you have any reason to doubt data integrity.
FAQ
1) What is the “one rule” about vdevs, really?
Design vdevs as if you’ll be stuck with them forever—because you will be. You can add vdevs, but you can’t casually undo a bad top-level vdev decision without migrating data.
2) Is it okay to mix mirrors and RAIDZ vdevs in one pool?
It’s allowed, and sometimes practical (for example, adding a special vdev or adding a mirror vdev to boost IOPS). But mixed topologies can create uneven performance and confusing allocation behavior. If you do it, document why and how you’ll expand consistently going forward.
3) Why do more vdevs usually help performance?
ZFS schedules IO across top-level vdevs. More vdevs means more independent queues and more parallelism. A single wide vdev can become a single bottleneck for random IO and latency.
4) Should I use RAIDZ1?
Only if you’ve explicitly accepted the risk profile and rebuild window for your drive sizes and workload, and you have strong backups. For many modern large-disk deployments, RAIDZ2 is the more defensible baseline.
5) Does adding a vdev rebalance existing data?
No. New allocations will prefer the new vdev based on free space and weighting, but old blocks generally stay where they are. If you need data to move, you must rewrite it (send/receive, replication, or application-level migration).
6) Do I need a SLOG?
Only if your workload issues synchronous writes and you care about sync latency. Confirm with dataset sync settings and application behavior. If you do need one, use a power-loss-protected device.
7) Are special vdevs safe?
They’re safe when treated as critical, redundant components. They’re unsafe when treated like a cache you can lose. If metadata lives there, losing it can be catastrophic.
8) What’s a sensible scrub schedule?
Common practice is monthly for many pools, sometimes more frequently for high-risk environments. The right answer depends on capacity, workload, and how quickly you want to surface latent errors. The wrong answer is “never.”
9) Why is my resilver taking forever?
Common causes: high pool utilization, heavy production load, slow or marginal disks, controller issues, or a very wide RAIDZ vdev. Check zpool status (issued vs scanned), zpool iostat -v, and hardware error counters.
10) What’s the safest expansion strategy?
Add vdevs that match your existing redundancy and performance profile, keep vdev widths consistent, and expand before you hit high-utilization performance cliffs. If you must change the profile, plan a migration rather than an improvised patch.
Conclusion
ZFS vdev design isn’t a dark art. It’s just unforgiving. The filesystem will happily accept your “temporary” disk, your too-wide RAIDZ, your under-protected special vdev, and your misaligned ashift—and then it will run those decisions in production at full scale, under stress, at 3 a.m.
If you take one thing from this: vdevs are not a configuration detail; they are the contract you sign with your future self. Sign something you can live with: redundancy you can explain, performance you can predict, expansion you can execute, and boring operational practices that keep you out of incident calls where the storage is technically “fine” but the business is on fire.