ZFS vs mdadm: Where mdraid Wins and Where It Loses

Was this helpful?

Storage failures don’t announce themselves with a single clean symptom. They show up as “the database is slow,” or “backups are missing,”
or that special classic: “it was fine yesterday.” And then someone notices a disk is rebuilding, a latency graph is screaming, and you’re
negotiating downtime with a calendar that doesn’t believe in physics.

ZFS and Linux mdraid (managed by mdadm) both run real production fleets. They’re not religious choices; they’re engineering choices.
This is the comparison you make when you care about correctness, recovery time, and whether your on-call can sleep.

Two mental models: storage as a filesystem vs storage as a stack

Comparing ZFS to mdadm is slightly unfair because they’re not the same type of thing.
ZFS is an integrated storage system: volume manager + RAID + filesystem + checksums + snapshots + replication primitives.
mdadm is a software RAID layer. It creates a block device. You still need to put LVM and a filesystem (ext4, XFS, btrfs, etc.) on top.

That difference changes where correctness lives. With mdraid, data integrity is mostly “somewhere else”:
your filesystem, your application, your backup verification, and your monitoring. With ZFS, integrity is built-in and opinionated:
every block is checksummed; scrub is a first-class operation; snapshots are cheap; replication is designed for “ship changes, not full copies.”

If you’re reading this because you’re designing a new system: don’t start by asking “which is faster.”
Start by asking: what failures must be survivable without human heroics?
Then ask: what operational tasks must be safe at 3 AM? Speed matters, but speed without safety is just a faster way to get the wrong data.

Interesting facts and historical context (short, useful)

  • ZFS shipped in 2005 (Solaris era) as a reaction to filesystems that couldn’t reliably detect silent corruption.
  • mdraid predates modern “storage stacks”; it’s been in Linux for decades and is still the default mental model for many admins: RAID, then filesystem.
  • The “RAID5 write hole” is not folklore; it’s the real risk of power loss during parity updates leaving stripes inconsistent without additional protection.
  • ZFS RAIDZ was designed to avoid the write hole by using copy-on-write semantics and transactional updates.
  • mdraid write-intent bitmaps exist specifically to make resync after an unclean shutdown faster, at some write overhead cost.
  • Early ZFS on Linux adoption was slowed by licensing friction; it’s why ZFS isn’t “just in mainline” like mdraid.
  • ZFS introduced practical, production-friendly snapshots as cheap metadata operations; many teams built backup workflows around that before “cloud snapshots” were mainstream.
  • Linux RAID6 matured because disks got huge; rebuild windows became large enough that dual-parity stopped being “paranoid” and became “normal.”

Where mdraid wins (and why)

1) It’s boring in the best way: maximum compatibility

mdraid is a block device. That’s its superpower. Everything understands block devices: LVM, dm-crypt, XFS, ext4, database tools, OS installers,
cloud images, recovery environments. If you want to boot from RAID1 in a plain distro installer without additional modules, mdraid is still
the path of least resistance.

Operationally, that means you can move the array between machines and recover it from almost any Linux rescue ISO.
ZFS recovery environments exist, but “available everywhere” is a real mdraid advantage.

2) Predictable performance characteristics for simple workloads

mdraid doesn’t have ARC, does not do copy-on-write, does not do checksumming at the filesystem layer.
That can translate into more predictable latency for certain synchronous write patterns when paired with a conservative filesystem
and a tuned IO scheduler.

RAID10 on mdraid is especially hard to mess up. It’s not always the most space-efficient, but it’s forgiving.
For latency-sensitive workloads (databases, queues) where you can afford the extra mirrors, mdraid RAID10 is a strong baseline.

3) Easier to “compose” with encryption and existing standards

ZFS encryption is good, but the ecosystem around dm-crypt/LUKS is massive. If your compliance team already has procedures,
key management, and incident playbooks built around LUKS, mdraid slots in cleanly.

Stack example that works well: mdraid (RAID10) → LUKS → XFS. It’s understandable, testable, and relatively easy to recover.

4) Less RAM hunger, fewer moving parts in memory

ZFS wants RAM. Not because it’s sloppy, but because ARC and metadata caching are part of its design.
If you’re running small-footprint nodes (edge, appliances, minimal VMs with direct-attached disks), mdraid + ext4/XFS may be the only reasonable fit.

First short joke: mdraid is the storage equivalent of a claw hammer—unsexy, loud, and somehow always on the truck when you need it.

Where mdraid loses (and how it hurts)

1) Silent corruption detection is not its job

mdraid does not checksum your data end-to-end. If the disk returns wrong data without reporting an error (yes, it happens),
mdraid happily hands that wrong data to your filesystem. Some filesystems can detect metadata corruption, but user data is generally unchecked.

The practical consequence: you can have “successful” reads of bad data, and only learn about it when an application complains—or worse, doesn’t.
Your mitigation becomes “better backups” and “regular verification,” which is a plan, but it’s not a built-in safety net.

2) RAID5/6 write hole and partial-stripe ugliness

The RAID5 write hole problem: a power loss or crash during a parity update can leave data/parity inconsistent.
Later reads can return corruption with no clear signal. There are mitigations: UPS, write journaling above, or avoiding parity RAID for critical data.
But the core risk exists.

For mdraid parity arrays, you should treat “unclean shutdown” as an incident that warrants verification, not a shrug.
Bitmap helps resync speed, but it doesn’t magically prove correctness of every stripe.

3) Rebuilds can be brutal and the knobs are sharp

Rebuild speed is a trade: go fast and you melt latency for production; go slow and you extend your vulnerability window.
mdraid gives you controls, but they’re global-ish, easy to forget, and easy to set in a way that makes the box look dead.

4) Observability is fragmented

With mdraid you watch: mdadm state, per-disk SMART, filesystem logs, sometimes LVM, sometimes dm-crypt.
None of this is impossible. But it’s a multi-panel cockpit, and the alarms don’t always correlate.

Where ZFS wins (and when it’s the obvious choice)

1) End-to-end checksumming and self-healing (when redundant)

ZFS checksums every block and verifies on read. If redundancy exists (mirror, RAIDZ), it can heal by reading another copy and repairing.
This changes the operational game: corruption becomes a detected, actionable event instead of a vague suspicion.

This is the feature that moves ZFS from “filesystem” to “integrity system.”
If you store anything you’d be ashamed to lose—customer data, audit logs, backups—this matters.

2) Copy-on-write semantics: safer snapshots, safer metadata updates

ZFS writes new blocks then flips pointers. That’s not just a neat trick; it’s why snapshots are cheap, and why certain crash-consistency
problems are less exciting than in parity RAID stacks.

It also means fragmentation and write amplification can show up in surprising ways if you treat ZFS like ext4.
ZFS rewards planning. It punishes “YOLO, ship it” layouts with unpredictable tail latency later.

3) Snapshots and replication as first-class operations

With ZFS, “take a snapshot” and “replicate incrementally” are normal operations, not bolt-ons.
If your RPO/RTO story depends on fast, consistent point-in-time copies, ZFS is often the cleanest answer.

4) Operational clarity: pool health is a single truthy signal

zpool status gives you a real, meaningful health view: degraded, faulted, checksum errors, resilver progress.
It’s not perfect, but it’s cohesive. mdraid tends to scatter the truth across multiple tools.

Second short joke: ZFS will remember every mistake you made with recordsize—and unlike your coworkers, it will show you graphs.

One reliability quote (paraphrased idea)

“Paraphrased idea” from Werner Vogels: build systems expecting things to fail, and design for recovery as a normal mode of operation.

Architecture patterns that actually work

Pattern A: mdraid RAID10 for databases that hate surprises

If you have a latency-sensitive database (PostgreSQL, MySQL, Elasticsearch hot tier) and you can afford mirroring,
mdraid RAID10 is a very sane option. Pair it with XFS or ext4. Add a UPS. Monitor SMART and md state.

What you don’t get: end-to-end integrity. Your mitigation is backup verification and periodic offline checksums at the application layer
(for truly critical datasets).

Pattern B: ZFS mirror or RAIDZ2 for “I care about correctness and operability”

ZFS mirror is the low-drama choice: fast resilvers, simple failure modes, predictable behavior.
RAIDZ2 is great for capacity with safety, especially for large disks where rebuild windows are long.

For VMs, consider special vdevs, recordsize tuning, and separate log devices only when you know what they do.
ZFS rewards simplicity: mirrors + good monitoring + tested replacement workflows.

Pattern C: mdraid + LUKS + XFS for compliance-heavy shops

If the org already standardizes on LUKS tooling, key escrow processes, and “break glass” recovery,
mdraid integrates neatly. This isn’t a technical victory so much as an operational one.

Pattern D: ZFS for backup targets and archival storage

Backups are where silent corruption goes to retire. ZFS’s scrub + checksums + snapshots make it a strong target for backups,
even if your primary storage is not ZFS.

Practical tasks: commands, outputs, decisions (12+)

These are not “toy” commands. They’re the ones you run when something feels off and you need to decide whether you’re looking at a disk,
a controller, a rebuild, a cache issue, or an application doing something regrettable.

Task 1: Confirm mdraid array health quickly

cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1] [raid10] [raid6] [raid5]
md0 : active raid10 sdb1[1] sda1[0] sdd1[3] sdc1[2]
      1953383488 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]

unused devices: <none>

What it means: [4/4] [UUUU] says all four members are up. No resync line means it’s not rebuilding.
Decision: If you see [U_UU] or a resync/recovery line, treat it as degraded and check which disk dropped.

Task 2: Get detailed mdraid member state

cr0x@server:~$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Wed Nov  6 10:41:22 2024
        Raid Level : raid10
        Array Size : 1953383488 (1.82 TiB 2.00 TB)
     Used Dev Size : 976691744 (931.51 GiB 1000.20 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

       Update Time : Thu Dec 26 01:12:09 2025
             State : clean
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

           Layout : near=2
       Chunk Size : 512K

Consistency Policy : resync

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync set-A   /dev/sda1
       1       8       17        1      active sync set-B   /dev/sdb1
       2       8       33        2      active sync set-A   /dev/sdc1
       3       8       49        3      active sync set-B   /dev/sdd1

What it means: State : clean is good. “failed” or “removed” means trouble. Layout and chunk affect performance.
Decision: If state isn’t clean, start incident workflow: identify failing member, check SMART, schedule replacement, and assess rebuild risk.

Task 3: Find which device is misbehaving (SMART triage)

cr0x@server:~$ sudo smartctl -a /dev/sdc | egrep "Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|SMART overall-health"
SMART overall-health self-assessment test result: PASSED
  5 Reallocated_Sector_Ct   0x0033   098   098   010    Pre-fail  Always       -       12
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       3
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       3

What it means: Pending and uncorrectable sectors are a bad sign even if “PASSED.”
Decision: Plan replacement. If the array is degraded, do not wait. If it’s parity RAID, consider pausing heavy workloads during rebuild.

Task 4: Check ZFS pool health and error types

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
status: Some supported features are not enabled on the pool.
action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features.
  scan: scrub repaired 0B in 00:21:10 with 0 errors on Thu Dec 26 00:40:02 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors

What it means: READ/WRITE/CKSUM counters staying at 0 is the dream. “scrub repaired” tells you whether ZFS had to heal.
Decision: If you see CKSUM errors on a device, suspect cabling/controller or the disk. If “errors: Permanent errors,” treat as data loss and restore from snapshot/replica.

Task 5: Spot ZFS performance foot-guns in dataset properties

cr0x@server:~$ sudo zfs get compression,recordsize,atime,logbias,sync tank/db
NAME      PROPERTY     VALUE     SOURCE
tank/db   compression  lz4       local
tank/db   recordsize   128K      local
tank/db   atime        off       local
tank/db   logbias      latency   local
tank/db   sync         standard  local

What it means: recordsize impacts IO amplification. sync affects durability semantics.
Decision: For databases with 8K/16K pages, consider recordsize=16K or 8K after testing. Do not set sync=disabled unless you like explaining data loss.

Task 6: Identify whether you’re bottlenecked on ZFS ARC

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
01:12:10   890   220     24    40   18   170   77    10    5   24.1G  28.0G
01:12:11   910   260     28    60   23   190   73    10    4   24.1G  28.0G
01:12:12   870   240     27    55   23   175   73    10    4   24.1G  28.0G

What it means: Miss rate around 25–30% may be fine or may be killing you, depending on workload.
Decision: If ARC is capped low versus RAM and your working set is larger, consider increasing ARC max or adding RAM. If misses are high during bursts, consider separate special vdev or faster disks.

Task 7: Confirm mdraid rebuild is happening and estimate blast radius

cr0x@server:~$ watch -n 2 cat /proc/mdstat
Every 2.0s: cat /proc/mdstat

md0 : active raid6 sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
      5855592448 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/4] [UUU_U]
      [=>...................]  recovery =  7.3% (107822080/1463898112) finish=215.0min speed=105000K/sec

What it means: Degraded ([5/4]) and recovering at ~105 MB/s. ETA is rough.
Decision: If production latency is unacceptable, throttle rebuild speed; if risk window is unacceptable, schedule maintenance or temporarily shed load.

Task 8: Tune mdraid rebuild speed safely (temporary)

cr0x@server:~$ cat /proc/sys/dev/raid/speed_limit_min
1000
cr0x@server:~$ cat /proc/sys/dev/raid/speed_limit_max
200000
cr0x@server:~$ echo 50000 | sudo tee /proc/sys/dev/raid/speed_limit_max
50000

What it means: max rebuild speed capped to ~50 MB/s. This reduces contention.
Decision: Use when you must keep latency stable. Revert later, and document the change. If you forget, you’ll be rebuilding until the next fiscal year.

Task 9: Check filesystem-level health on mdraid (XFS example)

cr0x@server:~$ mount | grep /data
/dev/md0 on /data type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k)
cr0x@server:~$ sudo xfs_repair -n /dev/md0
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
No modify flag set, skipping filesystem flush and exiting.

What it means: -n is read-only. If it reports corruption, you schedule downtime for a real repair.
Decision: If corruption exists and the array recently had an unclean shutdown, treat it as likely write-hole-adjacent behavior on parity RAID. Verify backups before repair.

Task 10: Identify ZFS latency culprits live

cr0x@server:~$ sudo zpool iostat -v tank 1 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        2.10T  1.40T    220    980  18.2M  95.4M
  raidz2-0  2.10T  1.40T    220    980  18.2M  95.4M
    sda         -      -     45    210  3.6M  19.1M
    sdb         -      -     46    195  3.7M  18.4M
    sdc         -      -     40    205  3.4M  19.0M
    sdd         -      -     44    220  3.5M  20.1M
----------  -----  -----  -----  -----  -----  -----

What it means: Balanced per-disk IO suggests no single disk is stalling. If one disk shows far less bandwidth or ops, suspect it.
Decision: If a single disk lags, correlate with SMART and cabling. If all disks are busy and latency is high, you’re workload-bound or underprovisioned.

Task 11: Replace a failed disk in mdraid (the safe sequence)

cr0x@server:~$ sudo mdadm --fail /dev/md0 /dev/sdc1
mdadm: set /dev/sdc1 faulty in /dev/md0
cr0x@server:~$ sudo mdadm --remove /dev/md0 /dev/sdc1
mdadm: hot removed /dev/sdc1 from /dev/md0
cr0x@server:~$ sudo mdadm --add /dev/md0 /dev/sdf1
mdadm: added /dev/sdf1

What it means: The array will begin recovery after add. Watch /proc/mdstat.
Decision: If rebuild doesn’t start, check partition type, superblock version mismatch, or that you added the correct partition.

Task 12: Replace a failed disk in ZFS (and ensure it’s actually replacing)

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          raidz2-0  DEGRADED     0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     FAULTED     12     0    45  too many errors
            sdd     ONLINE       0     0     0

cr0x@server:~$ sudo zpool replace tank sdc /dev/sde
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
  scan: resilver in progress since Thu Dec 26 01:14:02 2025
        148G scanned at 1.45G/s, 36.1G issued at 354M/s, 2.10T total
        36.1G resilvered, 1.68% done, 01:39:40 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          raidz2-0  DEGRADED     0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            replacing-2  DEGRADED     0     0     0
              sdc     FAULTED     0     0     0
              sde     ONLINE       0     0     0  (resilvering)
            sdd     ONLINE       0     0     0

What it means: ZFS shows a “replacing” vdev while resilvering. That’s your confirmation you’re not just adding a random disk.
Decision: If resilver speed is low, check for other pool activity, SMR disks, or controller issues. If errors climb on other disks during resilver, pause and reassess risk.

Task 13: Verify mdraid superblocks when assembly is weird

cr0x@server:~$ sudo mdadm --examine /dev/sda1 | egrep "Array UUID|Raid Level|Device Role|State"
Array UUID : 1c2b7d1f:0e4c1c0a:4b2330c3:b51e1c66
Raid Level : raid10
Device Role : Active device 0
State : clean

What it means: Confirms the member belongs to the expected array.
Decision: If UUIDs differ across members, you may be mixing disks from different arrays. Stop and map everything before assembling anything.

Task 14: Detect ZFS “my pool is full and now everything is slow”

cr0x@server:~$ sudo zfs list -o name,used,avail,refer,mountpoint tank
NAME   USED  AVAIL  REFER  MOUNTPOINT
tank  3.40T   120G    96K  /tank

What it means: 120G free on a multi-TB pool can be dangerously low depending on workload.
Decision: If free space is under ~10–20% for busy pools, plan immediate cleanup or expansion. ZFS performance and allocator behavior degrade when the pool is tight.

Task 15: Check for mdraid write-intent bitmap usage

cr0x@server:~$ sudo mdadm --detail /dev/md0 | egrep "Intent Bitmap|Bitmap"
     Intent Bitmap : Internal

What it means: Bitmap is enabled; resync after a crash will be faster (only changed regions).
Decision: For parity arrays on systems that may crash or reboot, bitmap is often worth the small write cost. For high-write workloads, measure the impact.

Task 16: Confirm IO is the bottleneck (not CPU, not memory)

cr0x@server:~$ iostat -xz 1 3
Linux 6.8.0 (server)  12/26/2025  _x86_64_  (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.12    0.00    3.44   22.10    0.00   66.34

Device            r/s     rkB/s   rrqm/s  %rrqm  r_await  w/s     wkB/s   w_await  aqu-sz  %util
md0             120.0  18000.0     0.0    0.0     8.4  980.0  98000.0    34.2    18.7   98.0

What it means: High %iowait and %util ~98% indicates saturated storage. w_await ~34ms is painful for latency-sensitive apps.
Decision: If this is during a rebuild, throttle rebuild or shed load. If it’s steady-state, you need more spindles, SSDs, better RAID level, or workload changes.

Fast diagnosis playbook

When storage gets “slow,” your first job is to avoid wasting time. Your second job is to avoid making it worse.
This order is designed to identify bottlenecks in minutes, not hours.

First: are we rebuilding or degraded?

  • mdraid: cat /proc/mdstat and mdadm --detail /dev/mdX. Look for recovery, resync, [U_U].
  • ZFS: zpool status. Look for DEGRADED, resilver, rising READ/WRITE/CKSUM.

If yes: expect latency. Decide whether to throttle rebuild or schedule a maintenance window. Don’t argue with physics.

Second: is it a single disk, a bus, or the whole array?

  • iostat -xz 1 to see saturation and await times.
  • zpool iostat -v 1 to see if one vdev/device is lagging.
  • smartctl -a on suspected disks; look for pending/uncorrectable sectors and CRC errors (often cabling).

One bad disk can drag an array into misery long before it “fails.” This is why monitoring should alert on SMART degradation, not just dead drives.

Third: confirm the workload pattern

  • High sync writes? Check application settings, fsync behavior, and on ZFS check sync/logbias.
  • Random reads with low cache hit? Check ZFS ARC stats or Linux page cache pressure.
  • Small writes on RAIDZ/RAID5? Expect write amplification and parity overhead.

Fourth: check “boring” limits that cause big pain

  • ZFS pool too full (low free space).
  • mdraid rebuild speed limits accidentally pinned low/high.
  • Controller errors in dmesg (timeouts, resets).
  • Misaligned partitions or weird sector sizes when mixing disks.

Common mistakes: symptoms → root cause → fix

1) Symptom: parity array returns corrupted files after power loss

Root cause: RAID5/6 write hole exposure + unclean shutdown, plus lack of end-to-end checksums.

Fix: Prefer RAID10 or ZFS RAIDZ with copy-on-write for critical data. If staying on mdraid parity: UPS + write-intent bitmap + consistent shutdown practices + verification tooling.

2) Symptom: mdraid rebuild makes production unusably slow

Root cause: Rebuild saturating IO; speed_limit_max too high; parity rebuild doing full-stripe reads/writes.

Fix: Throttle rebuild (/proc/sys/dev/raid/speed_limit_max), schedule rebuild windows, and consider RAID10 for latency-sensitive workloads.

3) Symptom: ZFS pool is ONLINE but applications see random corruption

Root cause: Often not ZFS itself—more commonly bad RAM on systems without ECC, or a flaky controller lying about writes.

Fix: Run memory tests, check ECC events, inspect dmesg for controller resets. On ZFS, trust checksum errors: if they rise, hardware is guilty until proven innocent.

4) Symptom: ZFS write latency spikes after enabling snapshots for everything

Root cause: Snapshot churn + fragmentation + small recordsize mismatch for workload, especially on busy VM datasets.

Fix: Reduce snapshot frequency/retention on hot datasets, tune recordsize/volblocksize appropriately, consider separate pools for VM images vs general files.

5) Symptom: mdraid array assembles but filesystems won’t mount cleanly

Root cause: Underlying block device is “fine,” but filesystem metadata took a hit (unclean shutdown, hardware errors).

Fix: Perform filesystem checks (xfs_repair / fsck) in maintenance mode; confirm backups before making repairs that can discard metadata.

6) Symptom: ZFS resilver takes forever compared to expectations

Root cause: Resilver reads only allocated blocks (good), but can still be slow due to SMR drives, pool fragmentation, heavy concurrent IO, or slow device.

Fix: Identify slow device with zpool iostat -v, avoid SMR for vdevs, keep pool free space healthy, schedule resilvers off-peak.

7) Symptom: “We replaced a disk but the array still looks degraded”

Root cause: Replaced the wrong partition, wrong device name, or ZFS replace/add confusion; in mdraid, drive wasn’t actually added as a member.

Fix: For mdraid, verify with mdadm --detail and check member roles. For ZFS, look for “replacing” vdev in zpool status and confirm the new disk is resilvering.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company ran mdraid RAID6 under XFS for a shared file store. The assumption was simple:
“RAID6 means two disks can fail, so we’re safe.”

The box took a hard power event. It booted. The md array assembled. /proc/mdstat said everything was clean.
Users started complaining about “random” corrupted archives and images. Nothing was consistent enough to point at a single directory.

The team chased application bugs for a day. Then someone ran checksum comparisons against the offsite backup set and found mismatches in files that were recently modified pre-outage.
No clear disk errors. SMART looked “okay.” The array was telling the truth from its perspective: it didn’t know the parity/data relationship was inconsistent.

Root cause: a classic parity inconsistency window during the crash—effectively the write hole made visible. XFS metadata was mostly fine,
so mounts looked normal. The corruption lived in user data where no end-to-end checks existed.

The fix wasn’t clever. They restored affected datasets from backups and changed design: ZFS for the file store with scrubs and snapshots,
or mdraid RAID10 for hot data where they couldn’t migrate immediately. They also treated “unclean shutdown on parity RAID” as a real incident from then on.

Mini-story 2: The optimization that backfired

A finance-adjacent team ran ZFS for VM storage and wanted more IOPS. Someone noticed “sync writes are slow” and made the tempting move:
set sync=disabled on the VM dataset.

Benchmarks looked fantastic. Latency dropped. The graphs calmed down. The change got quietly promoted to “standard tuning.”
Nobody wrote it down because, well, it “worked.”

Weeks later, a hypervisor rebooted unexpectedly (not even a power event—just a kernel panic from an unrelated driver issue).
Several VMs came back with databases that wouldn’t start, or worse, started with subtle transaction loss.
The storage was fine. The pool was ONLINE. The damage was in application-level consistency: acknowledged writes that never hit stable storage.

The postmortem was painful because it wasn’t mysterious. The system was configured to lie about durability.
And the VMs trusted it. Because that’s what computers do.

They rolled back to sync=standard, added a proper SLOG device for the workloads that needed low-latency sync writes, and updated change management:
any durability-affecting settings require explicit risk sign-off. “Fast” is not a business requirement if “correct” is optional by accident.

Mini-story 3: The boring but correct practice that saved the day

An enterprise team ran a mix: mdraid RAID10 for database servers, ZFS for backup targets. Nothing revolutionary.
What they did have was a boring practice: monthly “recovery drills” that included replacing a disk in a staging environment,
assembling an md array from cold disks, and restoring a ZFS dataset from incremental replication.

One Friday, a database node started logging intermittent IO timeouts. SMART showed rising CRC errors on a single disk.
That’s often a cable or backplane issue, but in RAID10 it’s still a “replace something” signal. They failed and removed the member,
swapped the drive, and started rebuild. Routine.

During rebuild, a second disk on the same backplane threw errors and dropped. Not fully dead, but misbehaving.
Because they were monitoring proactively, the on-call already had the spare plan, the exact sequence of commands, and the “stop if X happens” thresholds.

They paused the rebuild speed, shifted read traffic away from the node, and used the time to swap the suspect cable/backplane slot.
The array stabilized and completed recovery without losing data. No heroics. No improvisation.

The post-incident note was almost boring: “Followed runbook.” That’s the whole point.
Boring is what you want from storage. Excitement belongs in product demos, not in your block layer.

Checklists / step-by-step plan

Choosing between ZFS and mdraid (decision checklist)

  1. If you need end-to-end integrity (detect + heal corruption): choose ZFS.
  2. If you need maximum portability and simple boot/install paths: mdraid wins, especially RAID1 for root.
  3. If you run databases and can afford mirrors: mdraid RAID10 or ZFS mirrors. Avoid parity RAID for hot OLTP unless tested thoroughly.
  4. If you rely on snapshots for RPO/RTO: ZFS is the cleanest operational story.
  5. If your RAM budget is tight: mdraid + XFS/ext4 is usually less demanding.
  6. If your org has LUKS compliance workflows: mdraid integrates naturally; ZFS encryption is fine but different operationally.

Production rollout plan (safe steps)

  1. Pick RAID level based on failure domain and rebuild window, not just capacity.
  2. Standardize disk models within vdev/array when possible; avoid mixing SMR with CMR in places you care about.
  3. Enable monitoring before go-live: mdadm alerts, SMART alerts, ZFS pool status, scrub status, latency and IO saturation graphs.
  4. Define scrub/check policy:
    • ZFS: regular zpool scrub and alert on repaired bytes or checksum errors.
    • mdraid: periodic consistency checks (and understand what they do and don’t guarantee).
  5. Test disk replacement procedure on a non-prod node with your exact hardware.
  6. Document rebuild/resilver performance impact and define when to throttle.
  7. Backups: implement and test restore, not just backup jobs.
  8. Do a game day: simulate one disk failure, then a second “flaky” disk, and watch how humans behave.

Incident response plan for a degraded array/pool

  1. Freeze changes: stop “optimizations,” stop rebalancing, stop random restarts.
  2. Collect state: /proc/mdstat + mdadm --detail or zpool status.
  3. Identify the failing component: SMART, logs, controller resets.
  4. Decide: replace now vs throttle rebuild vs evacuate workload.
  5. Replace using the correct tool sequence (mdadm fail/remove/add or zpool replace).
  6. Monitor rebuild/resilver and watch for secondary errors.
  7. Post-recovery: run scrub/check, then do a targeted data verification and validate backups.

FAQ

Is ZFS always safer than mdraid?

Safer against silent corruption and many consistency issues, yes—because it checksums and can self-heal with redundancy.
But ZFS still depends on sane hardware. Bad RAM or a controller that lies can ruin anyone’s day.

Is mdraid “outdated” now that ZFS exists?

No. mdraid is still a solid building block for Linux fleets, especially for RAID1/10 and for environments that demand portability and simplicity.
It’s not trendy; it’s serviceable. Production likes serviceable.

Should I ever use mdraid RAID5 for important data?

If “important” means “I cannot tolerate silent corruption,” avoid it. If you must use parity RAID, mitigate aggressively:
UPS, write-intent bitmap, strong monitoring, verified backups, and documented recovery procedures. Or use ZFS RAIDZ instead.

Why do mdraid rebuilds feel worse than ZFS resilvers?

mdraid rebuild often touches large portions of the device (depending on level and bitmap usage).
ZFS resilver typically copies only allocated blocks, which can be far less data—unless the pool is nearly full or badly fragmented.

Can I stack ZFS on top of mdraid?

You can, but you usually shouldn’t. You’re duplicating RAID logic, complicating failure modes, and blurring where integrity lives.
If you want ZFS features, give ZFS direct disks (HBAs, not RAID controllers) and let it do its job.

What’s the best RAID level for databases?

Mirrors are the default answer: mdraid RAID10 or ZFS mirrors. Parity RAID can work for some read-heavy or batchy workloads,
but it’s less forgiving and often less predictable under write pressure.

Do I need ECC RAM for ZFS?

It’s strongly recommended for systems where correctness matters. ZFS will detect disk corruption; it cannot magically prevent bad RAM
from generating bad data before it’s checksummed. ECC reduces that class of risk substantially.

Is ZFS compression safe and useful?

Yes—lz4 is widely used and typically a win, especially for text, logs, VM images, and many databases.
It can reduce IO and increase effective throughput. Measure CPU headroom, but most modern CPUs handle it comfortably.

What’s the mdraid equivalent of ZFS scrub?

mdraid can do consistency checks, but it does not provide end-to-end data checksums. A parity check can detect mismatched parity/data,
but it cannot prove user data correctness the way ZFS can.

How do I pick between ZFS RAIDZ2 and mdraid RAID6?

If you want parity with stronger correctness guarantees and integrated tooling, RAIDZ2 is usually the better operational choice.
If you need a block device for existing stacks and are confident in your operational mitigations, RAID6 is workable—but treat power events seriously.

Practical next steps

If you’re building new storage for production and you don’t have a special constraint pushing you elsewhere, default to ZFS mirrors or RAIDZ2,
keep the design simple, and schedule regular scrubs with alerting. Pair it with ECC RAM when you can. Make snapshots part of your backup story,
not an afterthought.

If you’re already on mdraid and it’s working: don’t panic-migrate. Instead, tighten the operations:
verify SMART monitoring, confirm your rebuild throttling strategy, add write-intent bitmap where appropriate, and drill recovery procedures.
Then decide whether integrity features (and not just performance) justify a migration to ZFS for specific datasets like backups and file stores.

The best choice is the one your team can operate correctly under stress. The second-best choice is the one that fails loudly enough that you notice.

← Previous
WordPress blocked by WAF: tune rules without opening security holes
Next →
ZFS Special VDEV: The Feature That Makes Metadata Fly (and Can Kill the Pool)

Leave a comment