Storage fails in boring ways until it fails in exciting ways. The exciting version is the one where your “simple” storage layer
quietly ran out of breathing room, the app team is screaming about latency, and your dashboards show… nothing actionable.
ZFS and Windows Storage Spaces can both deliver real results. They can also both ruin your weekend. The difference is how often you
can see the problem forming, and how reliably you can turn evidence into a decision at 03:17.
What “opaque” means in production
“Opaque” is not an insult. It’s a property: the system hides important state behind friendly abstractions, and when performance
or reliability goes sideways, you can’t quickly answer:
- Which physical devices are slow right now?
- Is redundancy currently degraded?
- Is the system repairing data, reshaping layout, or throttling writes?
- Is the problem capacity, fragmentation, metadata pressure, cache misses, or a single dying disk?
Abstractions are useful. They let a generalist deploy storage without becoming a filesystem archaeologist. But abstraction is a
trade: you gain ease and lose directness. If your platform already has a strong operational culture—clear SLOs, consistent
telemetry, practiced incident response—you can survive with something more opaque. If you don’t, the abstraction will happily
cover your eyes while you run toward the cliff.
I’m biased toward systems where the evidence is local, structured, and hard to lie about. ZFS does that well. Storage Spaces can
do it too, but in practice it’s more common to encounter “it should work” reasoning rather than “here is the chain of proof.”
Different philosophies: self-describing truth vs managed abstraction
ZFS: the filesystem that treats storage like a database
ZFS is not just “a RAID and a filesystem.” It’s a storage stack built around end-to-end checksums, copy-on-write semantics, and
a coherent model of truth: data blocks and metadata blocks reference each other in a way that makes corruption detectable and,
with redundancy, repairable.
Operationally, ZFS tends to reward you for learning its vocabulary: vdevs, pools, datasets, snapshots, ARC, transaction groups,
scrubs, resilvers. Once you learn the nouns, the verbs behave consistently. The system is opinionated and largely honest. When
it’s mad, it usually tells you why—sometimes rudely, but clearly.
Storage Spaces: the “storage fabric” mindset
Storage Spaces is a Windows storage virtualization layer: you aggregate physical disks into pools, create virtual disks with
resiliency (mirror, parity), and then format volumes on top. In the Storage Spaces Direct (S2D) world, you scale this across
nodes with clustering.
The strengths are obvious in corporate environments:
- It fits Windows management tooling and identity models.
- It’s “clickable” and scriptable with PowerShell.
- It integrates with clustering and Windows Server features.
The weakness is also obvious once you’ve been on-call: the storage layer becomes a mediated experience. When parity is slow,
when a disk is marginal, when write-back cache is misbehaving, the system’s “helpful” layers can make root cause analysis feel
like arguing with a concierge about where they parked your car.
One quote worth keeping on the wall, because it applies to both stacks: “Hope is not a strategy.” — General Gordon R. Sullivan.
Joke #1: Storage is like a shared spreadsheet—everyone trusts it until the formulas start returning “#REF!” at quarter end.
Facts and history that still matter
A few concrete context points—because the “why” often hides in the “when”:
- ZFS was born at Sun as part of Solaris, designed in an era when RAID controllers lied and bit-rot was not theoretical.
- End-to-end checksums were a core ZFS motivation: the system assumes disks, cables, and controllers can silently corrupt data.
- ZFS made copy-on-write mainstream for general-purpose storage, enabling cheap snapshots and consistent on-disk state after crashes.
- “RAID-Z” was ZFS’s answer to the write hole problem in classic RAID-5/6 implementations.
- Storage Spaces arrived with Windows 8 / Server 2012 as Microsoft’s attempt to provide pooled storage without vendor RAID.
- Storage Spaces Direct (S2D) later extended the model into a clustered, hyperconverged design for Windows Server.
- ReFS integration became a key part of Microsoft’s storage story, with integrity streams and metadata resilience (though behavior depends on configuration).
- Both ecosystems learned painful lessons about caching: write-back caching can save performance and also magnify failure modes when mis-sized or unprotected.
These aren’t trivia. They explain why ZFS is obsessed with correctness signals (checksums, scrubs) and why Storage Spaces often
assumes you want policy-driven storage that “just works” until it doesn’t.
How failures look: ZFS vs Storage Spaces
Failure shape 1: latent corruption and “we restored garbage”
ZFS’s signature move is detecting corruption during reads and during scrubs. If redundancy exists, ZFS can repair from a good
copy. This is not magic; it’s consistent checksums and redundancy at the right layer. The operational win is that you can prove
integrity, not just assume it.
Storage Spaces can offer integrity features depending on filesystem (ReFS) and settings. But in many deployments, integrity is
an afterthought. You will see “disk is healthy” while application-level corruption festers quietly because the stack wasn’t
configured to detect it end-to-end.
Failure shape 2: rebuild behavior and the long tail of pain
ZFS resilvering is logical: it rebuilds only allocated blocks (for some topologies), not necessarily the full disk. That can be
a massive operational advantage when pools are large but mostly empty. The long tail is that a fragmented pool or heavy
metadata pressure can make resilvers slow and disruptive.
Storage Spaces repairs depend on layout and whether you’re in traditional Storage Spaces or S2D. In parity layouts especially,
repairs can be punishing, and the system may prioritize “correctness” in a way that makes your production workload feel like it
got demoted to background noise. The scary part is that the repair work is sometimes less obvious to operators unless you know
the right cmdlets and counters.
Failure shape 3: performance cliffs
ZFS has predictable cliffs:
- Recordsize mismatch and small random writes can hurt.
- SLOG misunderstandings can create placebo devices or real bottlenecks.
- ARC pressure will show up as cache misses and I/O amplification.
- Pool near-full is a classic: fragmentation and allocation costs climb.
Storage Spaces has different cliffs:
- Parity write penalty is real, and “it’s just parity” becomes “why is everything 10x slower.”
- Thin provisioning can turn into a sudden stop when physical capacity runs out.
- Tiering and cache can look great in benchmarks and weird in production when the working set changes.
- Background jobs (repair, optimize, rebalance) can steal performance without obvious user-facing alarms.
Failure shape 4: the human factor—what the system encourages you to ignore
ZFS encourages you to look at zpool status, scrub schedules, and error counters. Storage Spaces encourages you to
look at “HealthStatus: Healthy” and trust the abstraction. That’s fine until the abstraction is summarizing away the one detail
you needed to know: which disk is timing out, which enclosure is flapping, which slab of capacity is overcommitted.
Practical operator tasks (commands, outputs, decisions)
These are not “toy” commands. They’re the ones you run during an incident and again during the postmortem, when you decide
whether the platform is trustworthy or just politely silent.
Task 1 (ZFS): Check pool health and error accounting
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible.
scan: scrub repaired 0B in 06:21:14 with 1 errors on Wed Dec 4 02:18:10 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d4 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d5 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d6 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d7 ONLINE 0 0 1
errors: Permanent errors have been detected in the following files:
tank/data/app.db
What it means: The pool is online, but a checksum error was detected and pinpointed to a file. That’s not “fine.”
It’s a story: a bad sector, a cable, a controller, or memory. ZFS caught it; now you must respond.
Decision: Restore or rebuild the affected file from a known-good source, then investigate the device with CKSUM
errors. If errors repeat, replace that disk path (drive, cable, HBA port) even if SMART looks “okay.”
Task 2 (ZFS): Confirm scrub schedule and last scrub outcome
cr0x@server:~$ zpool status tank | sed -n '1,15p'
pool: tank
state: ONLINE
scan: scrub repaired 0B in 06:21:14 with 0 errors on Wed Dec 18 02:11:05 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
errors: No known data errors
What it means: Scrub completed, repaired nothing, found no errors. That’s your periodic integrity proof.
Decision: If you don’t see scrubs completing regularly, schedule them. If scrubs take longer over time, treat it
as a capacity/performance signal (fragmentation, slow disks, SMR drives, or a workload shift).
Task 3 (ZFS): Identify I/O latency at the vdev level
cr0x@server:~$ zpool iostat -v tank 2 3
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 12.1T 7.8T 90 420 12.3M 55.4M
raidz2-0 12.1T 7.8T 90 420 12.3M 55.4M
wwn-...c3d4 - - 12 58 1.6M 7.6M
wwn-...c3d5 - - 11 54 1.5M 7.1M
wwn-...c3d6 - - 12 57 1.6M 7.4M
wwn-...c3d7 - - 55 197 6.3M 26.0M
-------------------------- ----- ----- ----- ----- ----- -----
What it means: One disk is doing disproportionate work. Sometimes that’s normal (hot blocks, a slow sibling, or
a partial failure pushing reads elsewhere). It’s also how you catch a disk that’s “working” but not keeping up.
Decision: Correlate with SMART and kernel logs. If one device consistently shows different behavior, preemptively
replace it or at least move it to a different bay/cable to isolate the path.
Task 4 (ZFS): Check ARC behavior (memory pressure vs cache hit rate)
cr0x@server:~$ arcstat 1 5
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:10:01 8120 610 7 210 34 400 66 0 0 58G 64G
12:10:02 7901 645 8 240 37 405 63 0 0 58G 64G
12:10:03 8012 702 9 260 37 442 63 0 0 58G 64G
12:10:04 7998 690 9 250 36 440 64 0 0 58G 64G
12:10:05 8105 720 9 270 38 450 62 0 0 58G 64G
What it means: A ~9% miss rate is not inherently bad, but if latency is high and misses are climbing, disks are
doing more work. “arcsz” close to “c” means ARC is at target size; memory pressure might still exist elsewhere.
Decision: If misses spike under load, either add RAM, reduce working set, tune recordsize/compression, or consider
special vdev / metadata devices carefully. Don’t add L2ARC as your first move unless you know your read pattern.
Task 5 (ZFS): Confirm dataset properties that drive performance
cr0x@server:~$ zfs get -o name,property,value -s local recordsize,compression,atime,logbias tank/data
NAME PROPERTY VALUE
tank/data atime off
tank/data compression zstd
tank/data logbias latency
tank/data recordsize 128K
What it means: These settings shape I/O patterns. recordsize impacts amplification; compression can
reduce I/O; logbias affects synchronous write handling in some cases.
Decision: If this dataset is a database with 16K pages, 128K recordsize may be wrong. Change it before
the next big load, and validate. You’re managing physics, not vibes.
Task 6 (ZFS): Watch for a pool getting too full
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint tank
NAME USED AVAIL REFER MOUNTPOINT
tank 12.1T 7.8T 128K /tank
What it means: Capacity looks fine now. But the real operational threshold is not “100% full.” For many pools,
performance and allocation behavior degrade substantially well before that, especially with RAIDZ and fragmented freespace.
Decision: Treat ~80% as the start of serious conversations, and ~85–90% as “stop adding data unless you also add
vdevs.” Set quotas/reservations for noisy tenants.
Task 7 (ZFS): Replace a disk properly and verify resilver progress
cr0x@server:~$ zpool replace tank wwn-0x5000c500a1b2c3d7 /dev/disk/by-id/wwn-0x5000c500a1b2c3ff
cr0x@server:~$ zpool status tank
pool: tank
state: DEGRADED
status: One or more devices is being resilvered.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Dec 19 03:14:22 2025
1.27T scanned at 1.12G/s, 412G issued at 362M/s, 8.41T total
412G resilvered, 4.78% done, 06:23:10 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
wwn-...c3d4 ONLINE 0 0 0
wwn-...c3d5 ONLINE 0 0 0
wwn-...c3d6 ONLINE 0 0 0
replacing-3 DEGRADED 0 0 0
wwn-...c3d7 OFFLINE 0 0 0
wwn-...c3ff ONLINE 0 0 0
What it means: ZFS is resilvering. The output gives you an ETA and throughput. The topology shows “replacing”
which is exactly the state you want to see mid-operation.
Decision: If resilver rate collapses or errors grow, stop treating it as routine. Investigate controller errors,
cabling, and load. Consider scheduling heavy jobs away from resilver windows.
Task 8 (Linux + ZFS): Prove the kernel is seeing disk errors (not just ZFS)
cr0x@server:~$ dmesg -T | egrep -i 'reset|i/o error|timeout|sense key' | tail -n 6
[Thu Dec 19 03:20:11 2025] sd 6:0:5:0: [sdf] tag#231 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Dec 19 03:20:11 2025] sd 6:0:5:0: [sdf] Sense Key : Medium Error [current]
[Thu Dec 19 03:20:11 2025] sd 6:0:5:0: [sdf] Add. Sense: Unrecovered read error
[Thu Dec 19 03:20:11 2025] blk_update_request: I/O error, dev sdf, sector 184467440
[Thu Dec 19 03:20:12 2025] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Thu Dec 19 03:20:13 2025] ata7: hard resetting link
What it means: The OS sees medium errors and link resets. This is no longer “ZFS being picky.” This is hardware
misbehaving.
Decision: Replace the suspect component. If it repeats across different disks in the same path, replace the path
(HBA, backplane, expander, cable). Don’t “monitor” a disk that’s throwing medium errors under resilver.
Task 9 (Storage Spaces): Inspect physical disks and media type
cr0x@server:~$ powershell -NoProfile -Command "Get-PhysicalDisk | Select FriendlyName,SerialNumber,MediaType,HealthStatus,OperationalStatus,Size | Format-Table -Auto"
FriendlyName SerialNumber MediaType HealthStatus OperationalStatus Size
------------ ------------ --------- ------------ ----------------- ----
PD01 Z4A0... HDD Healthy OK 7.28 TB
PD02 Z4A1... HDD Healthy OK 7.28 TB
PD03 Z4A2... HDD Warning OK 7.28 TB
PD04 Z4A3... HDD Healthy OK 7.28 TB
What it means: “Warning” on a physical disk is your early smoke alarm. Storage Spaces often keeps running while
the disk degrades—until it doesn’t.
Decision: Correlate with SMART/vendor tools and event logs. If the disk is “Warning,” plan replacement now, not
after the virtual disk goes “Degraded.”
Task 10 (Storage Spaces): Check virtual disk health and resiliency
cr0x@server:~$ powershell -NoProfile -Command "Get-VirtualDisk | Select FriendlyName,ResiliencySettingName,HealthStatus,OperationalStatus,Size,FootprintOnPool | Format-Table -Auto"
FriendlyName ResiliencySettingName HealthStatus OperationalStatus Size FootprintOnPool
------------ --------------------- ------------ ----------------- ---- ---------------
VD-Data Parity Healthy OK 40 TB 60 TB
VD-VMs Mirror Healthy OK 12 TB 24 TB
What it means: Parity footprint is larger than logical size due to layout and parity overhead. This also hints at
write amplification and rebuild costs. Mirror is more predictable operationally.
Decision: If your workload is write-heavy or latency sensitive, parity is a tax you will pay daily. Put hot
workloads on mirror or tiered mirror, and reserve parity for colder data with clear expectations.
Task 11 (Storage Spaces): Verify pool free space vs thin provisioning risk
cr0x@server:~$ powershell -NoProfile -Command "Get-StoragePool -IsPrimordial $false | Select FriendlyName,HealthStatus,Size,AllocatedSize,FreeSpace | Format-List"
FriendlyName : Pool01
HealthStatus : Healthy
Size : 58.2 TB
AllocatedSize : 54.9 TB
FreeSpace : 3.3 TB
What it means: Only ~3.3 TB free remains. If you have thin-provisioned virtual disks and they grow, you can hit a
hard stop. Windows will try to warn you. Production will try to ignore it.
Decision: Set alerts on FreeSpace, and enforce guardrails: either stop thin provisioning in critical systems or
keep real headroom with a policy (e.g., never below 15–20%).
Task 12 (Storage Spaces): Identify ongoing repair/optimization jobs stealing performance
cr0x@server:~$ powershell -NoProfile -Command "Get-StorageJob | Select Name,JobState,PercentComplete,BytesProcessed,TimeRemaining | Format-Table -Auto"
Name JobState PercentComplete BytesProcessed TimeRemaining
---- -------- --------------- ------------- -------------
Repair Virtual Disk Running 17 3.1 TB 05:12:33
Optimize Storage Pool Running 42 9.8 TB 02:01:10
What it means: Background work is actively running. This is often the hidden reason “storage got slow.” The jobs
are legitimate—but they compete with your workload.
Decision: If this is production, schedule these jobs or throttle them where possible. If repairs run frequently,
stop and ask why: failing disks, unstable connections, or misconfiguration.
Task 13 (Storage Spaces): Map a virtual disk to underlying physical disks
cr0x@server:~$ powershell -NoProfile -Command "$vd=Get-VirtualDisk -FriendlyName 'VD-Data'; Get-PhysicalDisk -VirtualDisk $vd | Select FriendlyName,HealthStatus,OperationalStatus,Size | Format-Table -Auto"
FriendlyName HealthStatus OperationalStatus Size
------------ ------------ ----------------- ----
PD01 Healthy OK 7.28 TB
PD02 Healthy OK 7.28 TB
PD03 Warning OK 7.28 TB
PD04 Healthy OK 7.28 TB
What it means: You can tie the abstraction back to actual disks. This is how you avoid replacing the wrong drive
in the wrong chassis while everyone watches.
Decision: Replace the “Warning” disk and confirm the repair job starts. If multiple disks are “Warning,” assume a
shared fault domain (enclosure/backplane/firmware).
Task 14 (Storage Spaces / Windows): Read the event log for storage-specific signals
cr0x@server:~$ powershell -NoProfile -Command "Get-WinEvent -LogName System -MaxEvents 30 | Where-Object {$_.ProviderName -match 'Disk|Stor|Microsoft-Windows-Storage'} | Select TimeCreated,ProviderName,Id,LevelDisplayName,Message | Format-Table -Wrap"
TimeCreated ProviderName Id LevelDisplayName Message
---------- ------------ -- ---------------- -------
12/19/2025 03:21:10 Microsoft-Windows-StorageSpaces-Driver 312 Error Physical disk has encountered an error and may fail.
12/19/2025 03:21:14 Disk 153 Warning The IO operation at logical block address was retried.
12/19/2025 03:22:01 storahci 129 Warning Reset to device, \Device\RaidPort0, was issued.
What it means: This is the Windows equivalent of dmesg truth. Disk retries and resets predict worse things.
Decision: Treat repeated 153/129-style warnings as a hardware incident, not a software mystery. Check firmware,
cabling, controller drivers, and power stability.
Joke #2: A storage pool is a lot like a corporate reorg—everything is “more resilient” right up until you need to find who owns
the problem.
Three corporate mini-stories from the trenches
Mini-story 1: An incident caused by a wrong assumption
A mid-sized SaaS company ran Windows-based file services for internal build artifacts and some legacy application shares. They
moved from a traditional RAID array to Storage Spaces because it promised easier expansion. The migration was clean, the pool
was healthy, and they liked the idea of thin provisioning. “We’ll just add disks when we need them.”
The wrong assumption: that thin provisioning failures are graceful. They aren’t. Thin provisioning is an agreement with the
future, and the future does not sign SLAs.
Over a quarter, a few teams increased artifact retention. A couple of test environments started dropping large datasets into the
share “temporarily.” Pool free space slowly shrank. Monitoring existed, but it was looking at volume free space inside the
virtual disk, not the storage pool free space underneath. The share still had plenty of logical free space, so nobody worried.
Then a weekly job hit a growth spike, the pool ran out of physical capacity, and writes started failing. Not “slow.” Failing.
Applications that assumed POSIX-ish semantics behaved badly. Some retried aggressively. Others corrupted their own metadata. A
handful deadlocked on I/O.
The fix wasn’t heroic. They freed space, added disks, and stabilized the pool. The lesson was: you must monitor the pool, not
just the volume. And you must agree on a headroom policy that management can’t negotiate away in a meeting.
Mini-story 2: An optimization that backfired
A data platform team used ZFS on Linux for a mixed workload: object-like blobs, analytics extracts, and a few PostgreSQL
instances. They got performance complaints, and someone suggested adding a fast NVMe “as a SLOG” because they read a blog post
once. They installed an inexpensive consumer NVMe and set logbias=latency on a bunch of datasets.
For a week, things felt snappier. Not uniformly, but enough to declare victory. Then the NVMe started reporting media errors,
followed by timeouts. The pool didn’t explode, but synchronous write latency spiked. The databases started stalling. The incident
was confusing because the main RAIDZ vdevs were healthy and iostat looked “fine” at a glance.
They had optimized the wrong thing and created a fragile dependency. A SLOG only matters for synchronous writes, and only
if you are actually doing sync writes. Worse: if you add a SLOG that can’t sustain power-loss-safe writes, you can convert an
orderly crash into a recovery horror show.
The postmortem was blunt: don’t bolt on a SLOG because you’re nervous. Measure your sync write rate. Use an enterprise device
with power-loss protection if you do it at all. And remember that tuning one dataset for latency can punish another dataset that
wanted throughput.
The long-term outcome was good. They removed the consumer NVMe, tuned recordsize for the databases, enabled compression where it
helped, and added RAM. It was less exciting than “add a magic cache drive,” and it worked.
Mini-story 3: A boring but correct practice that saved the day
A financial services company ran a ZFS-backed NFS platform for internal analytics. The team was not flashy. They were the kind
of people who schedule scrubs, test restores, and refuse to run pools at 92% capacity. Other teams called them paranoid. They
called it “Tuesday.”
One month, a batch workload started failing with checksum errors during reads. ZFS flagged a handful of files with permanent
errors. The pool stayed online, but the evidence was unambiguous: some blocks were bad. The team didn’t debate whether ZFS was
“overreacting.” They treated it as a data integrity incident.
Because they ran regular scrubs, they had a known-good baseline and could say, confidently, that the corruption was recent. They
quickly mapped the errors to a specific disk path and found repeated link resets in kernel logs. They replaced the drive and the
cable as a unit. They restored the affected files from snapshots replicated to a second system.
No drama, no prolonged outage. The bigger win was organizational: they could prove the platform detected and contained
corruption. That proof mattered more than raw performance.
The boring practice was a combination of scrub discipline, replication, and a culture of believing error counters. It saved the
day because it reduced uncertainty. Uncertainty is what turns incidents into folklore.
Fast diagnosis playbook
The goal is not to “collect data.” The goal is to identify which layer owns the bottleneck: workload, filesystem, volume manager,
device, or transport. Here’s a practical order of operations that works under pressure.
First: confirm whether this is correctness work or user workload
- ZFS:
zpool status(scrub/resilver in progress? errors growing?) - Storage Spaces:
Get-StorageJob(repair/optimize running?)
If background repair is active, your performance “problem” may be a safety feature. Your decision becomes scheduling and
throttling—not tuning random knobs.
Second: identify if it’s capacity pressure masquerading as latency
- ZFS:
zfs list, check pool fullness and fragmentation symptoms - Storage Spaces:
Get-StoragePoolFreeSpace; validate thin provisioning assumptions
A system near full is not just out of space. It’s out of options. Allocation gets expensive, repairs get slower, and queues get
deeper.
Third: isolate the slow device or path
- ZFS/Linux:
zpool iostat -v+dmesgfor timeouts/resets - Windows: event logs +
Get-PhysicalDiskwarning states
If one disk is slow, your “storage system performance” is hostage to that disk. Replace it. Don’t debate with it.
Fourth: validate caching assumptions
- ZFS: ARC stats, sync write behavior, SLOG presence and health
- Storage Spaces: tiering settings, write-back cache configuration, and whether cache is being thrashed
Fifth: align workload I/O shape with layout
Random writes on parity layouts hurt. Small blocks on large recordsize waste bandwidth. Databases hate surprises. File shares
hate metadata storms. If the workload changed, the storage “suddenly got worse” because it’s doing exactly what you asked—just
not what you meant.
Common mistakes: symptom → root cause → fix
1) “Everything is healthy” but users report stalls
Symptom: Apps hang on I/O, but dashboards show green.
Root cause: Background repair/optimization consuming IOPS, or a single disk intermittently timing out while
the abstraction still reports “Healthy.”
Fix: Check Get-StorageJob / zpool status. Then check OS logs for timeouts. Replace
the flapping disk/path.
2) Parity virtual disk is “fine” but writes are painfully slow
Symptom: Good read throughput, terrible small random writes.
Root cause: Parity write penalty and read-modify-write behavior; cache not sized for workload; tiering not
aligned with write pattern.
Fix: Use mirror for write-heavy data, or tier with SSD and validate cache hit rate. Don’t promise parity
latency to database teams.
3) ZFS pool goes slow over months
Symptom: Same hardware, same workload (allegedly), but latency creeps upward.
Root cause: Pool filling up, fragmentation, more metadata churn, scrubs/resilvers taking longer, ARC misses
increasing as working set grows.
Fix: Keep headroom, add vdevs (not just bigger disks), tune dataset properties, and watch scrub duration as a
trend metric.
4) “We added a cache drive and it got worse”
Symptom: More devices, worse latency.
Root cause: Wrong cache type (SLOG vs L2ARC confusion), consumer SSD without power-loss protection, cache
thrash, or extra failure domain.
Fix: Measure sync writes before adding SLOG. Use proper devices. Remove the cache if it destabilizes the
system; stability beats theoretical performance.
5) Thin-provisioned Storage Spaces suddenly hits a wall
Symptom: Writes fail, services crash, volumes look “not full.”
Root cause: Pool physical capacity exhausted while virtual disks still have logical free space.
Fix: Monitor pool FreeSpace, enforce headroom policy, and stop treating thin provisioning as a capacity plan.
6) ZFS reports checksum errors, but SMART says disk is fine
Symptom: CKSUM increments on a device, but SMART attributes look normal.
Root cause: Bad cable, bad expander port, HBA issues, or transient link resets corrupting transfers.
Fix: Check kernel logs. Replace the path components. Move disk bays. Don’t accept “SMART is clean” as a
verdict when ZFS has proof of corruption.
7) Repairs take forever and performance tanks during rebuilds
Symptom: Rebuild/repair runs for days; workload becomes unusable.
Root cause: Large slow disks, parity layouts, SMR drives, too much concurrent workload, or poor QoS
controls.
Fix: Prefer mirrors for latency-sensitive workloads, avoid SMR in rebuild-heavy environments, schedule rebuild
windows, and keep spares/automation ready.
Checklists / step-by-step plan
If you’re choosing between ZFS and Storage Spaces
- Decide what you fear more: silent corruption or operational complexity. If you fear corruption, ZFS is the
default answer. - Inventory skills: if your team lives in Windows and PowerShell, Storage Spaces may be more supportable—but
only if you commit to learning the deep cmdlets and logs. - Match layout to workload: parity for cold, mirror for hot. Don’t negotiate with physics.
- Define headroom policy: write it down. Enforce it with quotas and alerts.
- Define integrity policy: scrubs for ZFS; integrity streams/ReFS settings if you’re in Windows land (and test
what they actually do in your environment). - Plan failure drills: practice disk replacement, repair jobs, and restore workflows before you need them.
Step-by-step: building a sane ZFS deployment
- Use HBAs, not RAID controllers (IT mode / pass-through), so ZFS can see the disks honestly.
- Pick vdev topology intentionally: mirrors for IOPS and rebuild speed; RAIDZ2 for capacity with acceptable
rebuild risk. - Enable compression (often
zstd) unless you have a specific reason not to. - Set dataset properties per workload (recordsize, atime, sync policy with eyes open).
- Schedule scrubs and alert on errors and scrub duration changes.
- Test restore from snapshots/replication. A snapshot is not a backup until you’ve restored from it.
- Keep headroom and plan expansions by adding vdevs, not hoping bigger disks magically fix allocation pain.
Step-by-step: making Storage Spaces less mysterious
- Decide on mirror vs parity per volume based on write pattern, not budget hopes.
- Document thin provisioning usage and set pool free-space alerts that wake humans.
- Instrument background jobs (
Get-StorageJob) so “it’s slow” can be correlated with “it’s
repairing.” - Track physical disk warning signs and replace early. Don’t wait for “Unhealthy.”
- Validate firmware and drivers in staging; storage drivers are not the place to YOLO updates.
- Practice a repair and time it. Your first repair should not be during a customer outage.
FAQ
1) Is ZFS “safer” than Storage Spaces?
ZFS is usually safer by default for data integrity because end-to-end checksums and scrubs are first-class. Storage
Spaces can be safe too, but you must deliberately configure and monitor integrity behavior.
2) Why do people say ZFS needs lots of RAM?
ZFS uses RAM for ARC caching, which improves performance and reduces disk I/O. It doesn’t “require” absurd RAM to function, but
it will use what you give it. Under-provision RAM and you’ll feel it as latency.
3) Is Storage Spaces parity always slow?
Parity is inherently more expensive for small random writes. With large sequential writes, a good cache/tier design, and a
workload that fits the model, it can be fine. But if you promise parity latency to OLTP databases, you are writing your own
incident report.
4) What’s the ZFS equivalent of “virtual disk footprint on pool”?
ZFS shows allocation at pool and dataset levels (zfs list, zpool list). RAIDZ overhead isn’t hidden,
but it’s expressed through actual allocated space and parity layout rather than a single “footprint” number.
5) Can Storage Spaces detect bit rot like ZFS?
It can, depending on filesystem and settings (commonly ReFS with integrity features). In practice, many deployments don’t enable
or validate those features end-to-end, so detection is less consistent operationally.
6) Do I need a SLOG on ZFS?
Only if you have meaningful synchronous write load and your main vdevs can’t handle the latency. Measure first. If you add one,
use a device designed for it (power-loss protection matters).
7) What’s the biggest “gotcha” with thin provisioning?
You can run out of physical capacity while still having logical free space. That failure mode is abrupt and ugly. Thin
provisioning is not a capacity plan; it’s a utilization tactic with strict monitoring requirements.
8) How do I decide mirror vs RAIDZ/parity?
If you need predictable latency and fast rebuilds: mirror. If you need capacity efficiency and can tolerate slower writes and
longer repairs: RAIDZ2/parity (with enough spindles and headroom). If you’re unsure, mirror is the safer bet operationally.
9) Which one is easier to operate?
ZFS is easier to operate once you learn it because the signals are consistent and local. Storage Spaces is easier to
start with, but can become harder when you’re debugging performance or capacity risk across layers.
10) What should I alert on first?
For ZFS: zpool status changes, scrub failures, rising checksum errors, and pool capacity thresholds. For Storage
Spaces: pool FreeSpace, any disk HealthStatus not “Healthy,” and long-running storage jobs.
Next steps you can actually do
If you’re running ZFS: make zpool status boring. Schedule scrubs. Alert on checksum deltas. Track scrub duration and
resilver times as trends. Keep headroom. And stop pretending that a pool at 89% is “basically fine.”
If you’re running Storage Spaces: stop trusting the word “Healthy” without context. Build a daily health script around
Get-PhysicalDisk, Get-VirtualDisk, Get-StoragePool, and Get-StorageJob. Alert
on pool FreeSpace. Practice a disk replacement and repair. Document exactly what parity is used for, and refuse to put
latency-sensitive workloads there.
If you’re choosing: pick the stack whose failure modes you can explain to a tired engineer using real evidence. “Easy” is only
easy when it stays legible under stress. When it becomes opaque, it becomes expensive.