Debian 13 mdadm RAID Degraded: Replace and Rebuild Without Data Loss

Was this helpful?

Your pager goes off. The storage graph is “fine” (it always is), but your app latency is climbing and one node is spitting disk errors.
You log in and see it: [U_], [UU_U], or the dreaded “inactive” array.
You can rebuild this cleanly—or you can turn a single failed disk into a full-blown incident with permanent data loss.

This is a production guide for Debian 13 and mdadm software RAID: fast diagnosis, safe replacement, rebuild monitoring,
and the sharp edges people cut themselves on. It’s opinionated because the filesystem doesn’t care about your feelings.

Ground rules: what “degraded” really means

“Degraded” is not a vibe. It’s a precise state: an array is missing at least one member, has kicked a member, or has a member
that is present but not trusted enough to read from. The array might still be serving reads and writes. That’s the good news.
The bad news is that your redundancy budget is already spent.

The objective is simple: restore redundancy without introducing new uncertainty. In practice, that means:

  • Don’t guess the failed disk. Identify by stable attributes (serial, WWN, enclosure bay).
  • Don’t “force assemble” unless you understand the metadata and the failure mode. “Force” is a power tool.
  • Don’t rebuild onto a disk that is quietly sick. SMART lies by omission, not by commission.
  • Don’t change two things at once. Replace one disk; verify; then proceed.

If you’re running RAID5/6 and you’re already degraded, the system is living on borrowed time. Reads will hit all remaining disks,
your error rate will spike, and your rebuild window becomes a reliability lottery. You can still win. You just don’t get to be casual.

Fast diagnosis playbook (first/second/third)

The fastest way to fix a degraded RAID is to avoid doing “stuff” until you know which of these you’re dealing with:
(1) a genuinely dead disk, (2) a connectivity problem, (3) a kernel/device naming shuffle, (4) latent media errors exposed by stress,
or (5) a human already tried to fix it.

First: confirm array state and whether it’s actively resyncing

  • Check /proc/mdstat for degraded, resync, recovery, reshape, or check.
  • Confirm which md devices exist and their personalities (raid1/5/6/10).
  • Decide: do you freeze changes until resync ends, or do you intervene now?

Second: identify the failing member by stable identity, not /dev/sdX

  • Map md members → partitions → underlying block devices.
  • Collect serial/WWN and, if applicable, enclosure slot mapping.
  • Decide: is the “missing” disk actually missing, or just not assembling?

Third: classify the failure mode

  • Hard failure: I/O errors, device gone, SMART says “failing now.” Replace disk.
  • Path/cable/HBA: link resets, CRC errors, device flaps. Fix connectivity first.
  • Metadata/assembly: wrong UUID, old superblock, initramfs mismatch. Fix mdadm config and assemble safely.
  • Consistency issue: mismatch count, unclean shutdown. Run a check after redundancy is restored.

Joke #1: RAID stands for “Redundant Array of Inexpensive Disks,” but during rebuild it often means “Really Anxious IT Department.”

Interesting facts and history (why mdadm behaves like this)

  1. Linux MD predates mdadm. Early Linux software RAID used raidtools; mdadm became the practical standard as metadata and assembly improved.
  2. Superblock versions matter. Metadata 0.90 lives at the end of the device (ancient compatibility), while 1.x lives near the start; this changes boot behavior and how easily old metadata survives wipes.
  3. “Write hole” is a real thing. Classic RAID5 can lose consistency across a crash mid-stripe update; journaling and RAID5 write-intent bitmaps help, but they don’t make physics go away.
  4. Bitmaps weren’t always common. Write-intent bitmaps dramatically reduce resync time after unclean shutdowns by tracking dirty regions instead of scanning the whole array.
  5. MD can do reshape online. Changing layout, number of disks, or RAID level is possible but risky and performance-heavy—especially while degraded.
  6. MD’s “check” and “repair” are separate modes. A check can be non-destructive; repair can write corrections. Mixing them up is how people “fix” data into being wrong.
  7. Device names are not identities. /dev/sda is a suggestion the kernel makes based on discovery order; WWN is an identity the hardware carries.
  8. Debian historically leaned conservative. Debian’s defaults (and the culture around them) usually favor correctness and stability over “helpful automation” that hides state.

Practical tasks (commands, outputs, decisions)

Below are hands-on tasks you can run on Debian 13. Each includes: a command, what typical output means, and the decision you make.
If you treat these as a checklist instead of a buffet, you’ll sleep more.

Task 1: Read /proc/mdstat like a grown-up

cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid10]
md0 : active raid1 sda1[0] sdb1[1]
      976630336 blocks super 1.2 [2/1] [U_]
      bitmap: 2/8 pages [8KB], 65536KB chunk

unused devices: <none>

What it means: md0 is RAID1 with two expected members. [2/1] means 2 slots, 1 working device. [U_] means the first disk is up, second is missing/faulty.

Decision: The array is running but has no redundancy. You prioritize restoring a mirror member. Don’t start other risky maintenance now.

Task 2: Get the authoritative story from mdadm

cr0x@server:~$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Mon Sep  2 11:40:18 2024
        Raid Level : raid1
        Array Size : 976630336 (931.51 GiB 1000.20 GB)
     Used Dev Size : 976630336 (931.51 GiB 1000.20 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

       Update Time : Mon Dec 29 09:11:22 2025
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

              Name : server:0
              UUID : 3b0d0f3b:5f2e6d7a:8c6c9b9b:9dd5a1d2
            Events : 41290

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       0        0        1      removed

What it means: mdadm agrees it’s clean, degraded (not resyncing). The missing member is “removed,” not “failed.” That often happens after repeated I/O errors or manual removal.

Decision: Plan a controlled replacement. Also ask: why was it removed? Check logs before you declare victory.

Task 3: Find the member devices and their stable identities

cr0x@server:~$ ls -l /dev/disk/by-id/ | grep -E 'sda|sdb'
lrwxrwxrwx 1 root root  9 Dec 29 09:03 ata-SAMSUNG_MZ7L3960HCGR-00005_S4JNNX0T123456 -> ../../sda
lrwxrwxrwx 1 root root  9 Dec 29 09:03 ata-SAMSUNG_MZ7L3960HCGR-00005_S4JNNX0T789012 -> ../../sdb

What it means: You’ve got two distinct serials. Even if /dev/sdX changes after reboot, these by-id paths persist.

Decision: From now on, when you reference a disk in your notes or change plan, use serial/WWN, not sdb.

Task 4: Confirm which partitions belong to MD and whether anything is missing

cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINTS,UUID
NAME        SIZE TYPE FSTYPE            MOUNTPOINTS UUID
sda       931.5G disk
├─sda1    931.5G part linux_raid_member            3b0d0f3b-5f2e-6d7a-8c6c-9b9b9dd5a1d2
└─sda2      512M part vfat              /boot/efi   1A2B-3C4D
sdb       931.5G disk
└─sdb1    931.5G part
md0       931.5G raid1 ext4             /          7b1e4d2d-6ad0-4f1e-bb0d-3b51f7e0f88a

What it means: sda1 is a RAID member; sdb1 exists but has no linux_raid_member signature. That could mean it’s new/unpartitioned, or it’s the old disk that lost its partition table (less likely, but don’t assume).

Decision: If sdb is meant to be the replacement, you will partition it to match sda, then add to the array. If sdb is an existing member that “fell out,” investigate why it lost metadata.

Task 5: Read kernel logs for the real failure

cr0x@server:~$ sudo journalctl -k -b | grep -E 'md0|sdb|I/O error|reset|fail' | tail -n 30
Dec 29 08:52:11 server kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0x6 frozen
Dec 29 08:52:11 server kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Dec 29 08:52:12 server kernel: ata2: hard resetting link
Dec 29 08:52:17 server kernel: sd 1:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 29 08:52:17 server kernel: blk_update_request: I/O error, dev sdb, sector 1953525160 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Dec 29 08:52:17 server kernel: md/raid1:md0: Disk failure on sdb1, disabling device.
Dec 29 08:52:17 server kernel: md/raid1:md0: Operation continuing on 1 devices.

What it means: This is not “mdadm being weird.” It’s link resets and I/O errors. That’s often the disk, but it can be a cable, backplane, or HBA port.

Decision: Before you rebuild, decide whether you trust the path. If this is a hot-swap bay, reseat or move the disk to a known-good slot. If it’s a server with ancient SATA cables, replace the cable. Rebuilds are excellent at turning marginal into dead.

Task 6: SMART the suspect device (and interpret it like a pessimist)

cr0x@server:~$ sudo smartctl -a /dev/sdb | sed -n '1,120p'
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.0-amd64] (local build)
=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZ7L3960HCGR-00005
Serial Number:                      S4JNNX0T789012
Firmware Version:                   EDA7902Q
User Capacity:                      960,197,124,096 bytes [960 GB]
SMART overall-health self-assessment test result: PASSED

=== START OF SMART DATA SECTION ===
Critical Warning:                   0x00
Media and Data Integrity Errors:    14
Error Information Log Entries:      27
Warning  Comp. Temperature Time:    0
Unsafe Shutdowns:                   9

What it means: “PASSED” is not a clean bill of health. NVMe/SATA devices can “pass” while logging media errors. Media and Data Integrity Errors > 0 is a red flag in a degraded array.

Decision: If the device logs media/integrity errors, replace it. If SMART is clean but kernel logs show link resets, prioritize path/hardware checks.

Task 7: Verify the array is mounted and not doing something dangerous already

cr0x@server:~$ findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /
/dev/md0 / ext4 rw,relatime,errors=remount-ro

What it means: Root is on /dev/md0. This matters because you can’t casually stop the array.

Decision: Any rebuild plan must be online and cautious. If you need to stop/assemble, you will be doing it from rescue media—don’t freestyle on a live root array.

Task 8: Confirm mdadm’s config knows about the array (boot reliability check)

cr0x@server:~$ grep -R "UUID=3b0d0f3b" /etc/mdadm /etc/mdadm/mdadm.conf
/etc/mdadm/mdadm.conf:ARRAY /dev/md0 metadata=1.2 UUID=3b0d0f3b:5f2e6d7a:8c6c9b9b:9dd5a1d2 name=server:0

What it means: Good: the array is declared. Bad: if it’s missing, boot-time assembly can become “best effort,” which sometimes means “wrong disk, wrong day.”

Decision: If the entry is missing or stale, update it and rebuild initramfs once the system is stable.

Task 9: Check for “ghost metadata” on the replacement disk before adding it

cr0x@server:~$ sudo mdadm --examine /dev/sdb1
mdadm: No md superblock detected on /dev/sdb1.

What it means: Perfect for a fresh add. If you saw an old superblock with a different UUID, you’d need to zero it or you risk wrong assembly or refusal to add.

Decision: Proceed to partition and add. If old metadata exists, stop and clean it deliberately (see checklist).

Task 10: Clone the partition table from the good disk to the new disk

cr0x@server:~$ sudo sfdisk -d /dev/sda | sudo sfdisk /dev/sdb
Checking that no-one is using this disk right now ... OK

Disk /dev/sdb: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: SAMSUNG MZ7L3960HCGR-00005
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

sfdisk: Successfully wrote the new partition table.
Syncing disks.

What it means: New disk now matches the known-good layout. This is safer than hand-editing starts/ends and accidentally creating misalignment or size mismatch.

Decision: Re-read partitions and verify they exist before adding to md.

Task 11: Re-read partitions and confirm kernel sees the new partition

cr0x@server:~$ sudo partprobe /dev/sdb
cr0x@server:~$ lsblk /dev/sdb
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sdb      8:16   0 931.5G  0 disk
├─sdb1   8:17   0 931.5G  0 part
└─sdb2   8:18   0   512M  0 part

What it means: Partitions are present. If partprobe fails because the disk is “busy,” you may need a reboot—but don’t reboot a degraded system casually unless you’ve validated boot assembly.

Decision: If partitions exist, proceed. If they don’t, stop and correct before touching md.

Task 12: Add the new member to the array and watch it start rebuilding

cr0x@server:~$ sudo mdadm --add /dev/md0 /dev/sdb1
mdadm: added /dev/sdb1
cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid10]
md0 : active raid1 sdb1[2] sda1[0]
      976630336 blocks super 1.2 [2/1] [U_]
      [=>...................]  recovery =  7.3% (71680000/976630336) finish=58.4min speed=258432K/sec
      bitmap: 2/8 pages [8KB], 65536KB chunk

unused devices: <none>

What it means: Recovery is running. Note: it still shows [U_] early on; once enough regions are synced, it should become [UU] when complete.

Decision: Don’t touch it unless you must. Your job now is to monitor for errors and keep the system calm (CPU, I/O, thermals).

Task 13: Monitor rebuild progress with more context

cr0x@server:~$ watch -n 2 'cat /proc/mdstat; echo; sudo mdadm --detail /dev/md0 | egrep "State|Rebuild|Resync|Raid Devices|Active Devices|Working Devices|Failed Devices"'
Every 2.0s: cat /proc/mdstat; echo; sudo mdadm --detail /dev/md0 | egrep "State|Rebuild|Resync|Raid Devices|Active Devices|Working Devices|Failed Devices"

md0 : active raid1 sdb1[2] sda1[0]
      976630336 blocks super 1.2 [2/2] [UU]
      bitmap: 0/8 pages [0KB], 65536KB chunk

             State : clean
      Raid Devices : 2
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0

What it means: Array is back to clean, both members active, no failed devices. This is the state you want before you do any consistency checks.

Decision: With redundancy restored, schedule a scrub/check and then fix the “why” (monitoring, cabling, firmware, spares).

Task 14: Update mdadm config and initramfs so the next reboot isn’t spicy

cr0x@server:~$ sudo mdadm --detail --scan | sudo tee /etc/mdadm/mdadm.conf
ARRAY /dev/md0 metadata=1.2 UUID=3b0d0f3b:5f2e6d7a:8c6c9b9b:9dd5a1d2 name=server:0
cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.12.0-amd64

What it means: initramfs now includes current mdadm assembly information. This reduces “won’t boot after disk swap” surprises.

Decision: If this is a root array, this step is not optional. Rebuilds fix redundancy; initramfs fixes tomorrow morning.

Task 15: Run a non-destructive consistency check (after rebuild)

cr0x@server:~$ echo check | sudo tee /sys/block/md0/md/sync_action
check
cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid10]
md0 : active raid1 sdb1[2] sda1[0]
      976630336 blocks super 1.2 [2/2] [UU]
      [==>.................]  check = 11.6% (113246208/976630336) finish=45.2min speed=317120K/sec

unused devices: <none>

What it means: This reads and compares mirrors. On RAID5/6, check reads parity; it can surface latent errors.

Decision: If mismatch count rises (see next task), you investigate the cause. A few mismatches after an unclean shutdown can happen; persistent or growing mismatches mean trouble.

Task 16: Read mismatch counters (and decide whether to “repair”)

cr0x@server:~$ cat /sys/block/md0/md/mismatch_cnt
0

What it means: Zero mismatches is what you want. On RAID5/6, mismatches can indicate write hole events, bad RAM, or a flaky disk/controller lying under load.

Decision: Don’t run repair as a reflex. Repair writes corrections—use it only when you know which side is correct (often: after validating hardware and having backups).

Checklists / step-by-step plan (safe disk replacement)

Here’s a plan that works under pressure. It is deliberately boring. Boring is good; boring keeps your data.

Checklist A: Before you touch anything

  1. Confirm backups are real. Not “we have snapshots.” Real, tested recovery for the data on this array.
  2. Capture current state: /proc/mdstat, mdadm --detail, lsblk, recent kernel logs.
  3. Identify the failed path by serial/WWN. Write it down. Take a photo of the chassis label if it’s physical.
  4. Check whether the array is already resyncing. If it is, your intervention may slow it or destabilize it.
  5. Set expectations. On RAID5/6, rebuild time under load is not your vendor’s brochure number.

Checklist B: If the disk is still visible but “faulty”

If md marked a member faulty but the kernel still shows the device, remove it cleanly. That stops md from trying to talk to a disk that’s already misbehaving.

cr0x@server:~$ sudo mdadm --detail /dev/md0 | egrep "faulty|removed|active"
      RaidDevice State
      active sync   /dev/sda1
      removed
cr0x@server:~$ sudo mdadm /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
cr0x@server:~$ sudo mdadm /dev/md0 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md0

Interpretation: Failing then removing creates a clean slot. It also prevents random partial reads from a dying disk.

Decision: If removal fails because the device is gone, fine—proceed with replacement. If removal fails because it’s “busy,” stop and re-check you targeted the correct member.

Checklist C: Prepare the replacement disk correctly

There are two big rules: match the partition layout, and ensure no old RAID metadata survives.

  1. Partition it to match the existing member. Use sfdisk cloning, not handcrafting.
  2. Wipe old md superblocks if present. Use mdadm --zero-superblock on the member partitions, not the whole disk if you can avoid it.
  3. Check sector sizes. A 4Kn disk swapped into a 512e array can be a compatibility mess depending on controllers and partitioning.
cr0x@server:~$ sudo mdadm --examine /dev/sdb1
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
           UUID : 11111111:22222222:33333333:44444444
     Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing)
cr0x@server:~$ sudo mdadm --zero-superblock /dev/sdb1
mdadm: Unrecognised md component device - /dev/sdb1
cr0x@server:~$ sudo wipefs -n /dev/sdb1
offset               type
0x00000400           linux_raid_member   [raid]

Interpretation: Example shows conflicting signals you will see in real life: stale metadata and tools disagreeing when devices are half-initialized.
If wipefs -n shows linux_raid_member, you need to remove it before adding to the new array.

Decision: Use wipefs to erase the signature, then re-check with mdadm --examine to ensure it’s truly gone.

cr0x@server:~$ sudo wipefs -a /dev/sdb1
/dev/sdb1: 8 bytes were erased at offset 0x00000400 (linux_raid_member): a9 2b 4e fc 00 00 00 00
cr0x@server:~$ sudo mdadm --examine /dev/sdb1
mdadm: No md superblock detected on /dev/sdb1.

Checklist D: Add the disk and control the rebuild impact

Rebuilds are I/O intensive. You can trade speed for safety (lower load) or speed for risk (hammer everything).
In production, I bias toward “finish reliably,” not “finish fast and maybe die halfway.”

cr0x@server:~$ cat /proc/sys/dev/raid/speed_limit_min
1000
cr0x@server:~$ cat /proc/sys/dev/raid/speed_limit_max
200000
cr0x@server:~$ echo 50000 | sudo tee /proc/sys/dev/raid/speed_limit_max
50000

Interpretation: These are KiB/s limits. Capping max speed can keep latency acceptable for customers and reduce thermal stress. It will extend rebuild time.

Decision: If this is a database box serving production traffic, cap rebuild speed and survive. If it’s a maintenance window, open the throttle.

cr0x@server:~$ sudo mdadm --add /dev/md0 /dev/sdb1
mdadm: added /dev/sdb1

Checklist E: After rebuild, harden so you don’t repeat this next week

  1. Update mdadm.conf and regenerate initramfs.
  2. Run a check and review mismatch counts.
  3. Fix the root cause: replace cable/backplane, update firmware, adjust monitoring.
  4. Set monitoring for md events and SMART warnings.
cr0x@server:~$ sudo systemctl enable --now mdadm --quiet || true
cr0x@server:~$ sudo systemctl status mdadm --no-pager
● mdadm.service - MD array assembly and monitoring
     Loaded: loaded (/lib/systemd/system/mdadm.service; enabled; preset: enabled)
     Active: active (exited) since Mon 2025-12-29 09:30:14 UTC; 2min ago

Interpretation: On Debian, the service behavior can be “active (exited)” and still be correct; it assembles and hands off monitoring to other units/config.

Decision: Confirm you have actual alerting (email, Prometheus exporter, whatever your stack is). “Enabled” without a receiver is theater.

Special cases: RAID1 vs RAID5/6 vs RAID10, bitmaps, and reshape

RAID1: the easiest rebuild, and therefore where people get sloppy

RAID1 rebuilds are straightforward: md copies blocks from the good disk to the new disk. Your biggest risk is human:
replacing the wrong disk or rebuilding from the wrong side in a split-brain situation (rare on single host, but it happens after
weird controller resets and panicked reboots).

A RAID1 array can be clean, degraded and still be consistent. Don’t confuse “degraded” with “dirty.”
The rebuild should be safe as long as the remaining disk is healthy.

RAID5/6: degraded means every read is now a group project

With parity RAID, a single missing disk means reconstruction on reads (or reads across remaining disks) and heavy I/O.
During rebuild, you are stressing every surviving disk—the exact moment you discover which ones were marginal.

Practical implications:

  • Scrub cadence matters. If you never scrub, you find latent errors during rebuild. That’s the worst time.
  • Performance will wobble. Expect latency spikes. Plan rebuild throttling.
  • Have a stop condition. If a second disk starts erroring during RAID5 rebuild, you stop and reassess. Continuing can destroy what’s left.

RAID10: rebuilds are localized, but don’t let that lull you

RAID10 rebuilds mirror pairs. That’s usually faster and less stressful than RAID5/6, but it’s not magic.
If you lose a disk in a pair and the surviving mirror partner is old, the rebuild reads the entire partner—again, stress.

Bitmaps: your friend after an unclean shutdown

Internal bitmaps record which regions may be out of sync, allowing a faster resync. If your arrays are large and your shutdowns are not
perfectly clean (power events happen, humans happen), bitmaps can save hours.

But bitmaps aren’t free: they add write overhead and complexity. In practice, for most production arrays where availability matters,
I still prefer having them.

Reshape: don’t do it while degraded unless you enjoy suspense

mdadm can reshape arrays: change RAID level, add disks, change chunk size, etc. It’s powerful.
It’s also the kind of operation where “mostly correct” is not a passing grade.

Rule: Restore redundancy first, then reshape. If you’re already degraded, reshaping increases moving parts and failure surface.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company ran Debian servers with mdadm RAID1 for boot and RAID10 for data. A node flagged degraded RAID1 on md0.
The on-call engineer saw /dev/sdb missing and assumed “disk 2 is dead,” because that’s the story we tell ourselves:
Linux names disks in a nice stable order and the universe is kind.

They swapped the disk in bay 2, booted, and the array still looked wrong. In frustration, they “fixed” it by adding the “new disk”
to the array and letting it rebuild. The box came back clean. Everyone exhaled.

Two days later, a planned reboot happened. The system dropped into initramfs complaining it couldn’t assemble root.
What changed? Enumeration. A firmware update and a slightly different discovery order flipped /dev/sda and /dev/sdb.
The disk that was rebuilt from was not the disk everyone thought it was. The array wasn’t assembled consistently because mdadm.conf
wasn’t updated and initramfs still had stale assembly hints.

Recovery wasn’t heroic; it was careful. They booted rescue media, assembled by UUID, confirmed which member had the latest event count,
and rebuilt the proper mirror. The real lesson wasn’t “Linux is random.” The lesson was: /dev/sdX is not an identity.

After that, they labeled bays with serial numbers and updated their runbook: every disk operation starts with /dev/disk/by-id,
and every mdadm change ends with update-initramfs -u.

Mini-story 2: The optimization that backfired

Another org had a heavy analytics workload and wanted rebuilds to finish faster. Someone found the md speed limits and cranked
speed_limit_max to a big number across the fleet. In the next failure, the rebuild screamed along.
Dashboard looked great. Time-to-redundancy dropped. High fives in a chat channel.

But during business hours, latency also spiked. Not a little. The rebuild competed with production reads, saturated the HBA queue,
and raised device temperature. One of the “healthy” disks began logging UDMA CRC errors—classic sign of a marginal link.
Then it started timing out under load. md kicked it. On RAID6 that’s survivable; on RAID5 it’s not.

The worst part: it wasn’t even a disk problem at first. The cable had been slightly loose for months, but nothing stressed it hard enough
to show symptoms. The rebuild turned a latent issue into an outage.

They rolled back the global “optimization” and replaced questionable cabling. The better optimization was policy:
rebuilds run slower during peak hours, faster during maintenance windows. Real systems have customers attached.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran mdadm RAID6 on a set of nearline disks. They had two habits that nobody celebrated:
monthly RAID checks and strict replacement procedure with serial-number verification.

One Friday night, a disk failed. They replaced it and began rebuild. Midway through, a second disk returned a read error on a sector.
That’s the moment RAID6 earns its keep: it can tolerate two failures, but only if the second one doesn’t escalate.

Because they’d been doing monthly checks, the team already knew mismatch counts were stable and that no other disks were throwing pending sectors.
They also had recent SMART baselines. The second disk’s error was new and isolated; they lowered rebuild speed and let it finish.
Then they scheduled a controlled replacement of the second disk during the next window.

No data loss. No weekend. No drama. The reason it worked is painfully unsexy:
they didn’t discover their media errors for the first time during a rebuild.

Quote (paraphrased idea) from Admiral Grace Hopper: “The most dangerous phrase is ‘we’ve always done it this way’.” Applied to storage: verify, don’t inherit.

Common mistakes (symptoms → root cause → fix)

1) Symptom: Array is degraded after reboot; disks look “fine”

Root cause: mdadm.conf/initramfs doesn’t include correct ARRAY definitions; assembly relies on scanning and timing.

Fix: Regenerate configuration and initramfs after stabilizing.

cr0x@server:~$ sudo mdadm --detail --scan | sudo tee /etc/mdadm/mdadm.conf
ARRAY /dev/md0 metadata=1.2 UUID=3b0d0f3b:5f2e6d7a:8c6c9b9b:9dd5a1d2 name=server:0
cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.12.0-amd64

2) Symptom: You add a disk and mdadm says “device has wrong UUID” or refuses to add

Root cause: Replacement disk still has old md superblock or filesystem signatures (common with reused spares).

Fix: Wipe signatures on the specific member partition, then add.

cr0x@server:~$ sudo wipefs -n /dev/sdb1
offset               type
0x00000400           linux_raid_member   [raid]
cr0x@server:~$ sudo wipefs -a /dev/sdb1
/dev/sdb1: 8 bytes were erased at offset 0x00000400 (linux_raid_member): a9 2b 4e fc 00 00 00 00
cr0x@server:~$ sudo mdadm --add /dev/md0 /dev/sdb1
mdadm: added /dev/sdb1

3) Symptom: Rebuild is extremely slow and system latency is awful

Root cause: Rebuild saturates IO; scheduler/queue depth interacts badly with workload; device errors cause retries.

Fix: Throttle rebuild speed; reduce workload; check logs for retries/timeouts.

cr0x@server:~$ echo 20000 | sudo tee /proc/sys/dev/raid/speed_limit_max
20000
cr0x@server:~$ sudo journalctl -k -b | grep -E 'reset|timeout|I/O error' | tail -n 20
Dec 29 09:12:04 server kernel: blk_update_request: I/O error, dev sdc, sector 1234567 op 0x0:(READ)

4) Symptom: RAID5 rebuild fails with “read error” on another disk

Root cause: Latent bad sector encountered during rebuild; RAID5 can’t tolerate a second failure or unrecoverable read error.

Fix: Stop and assess: attempt sector remap by reading offending region; consider cloning; restore from backup if necessary.

Real talk: RAID5 is not a backup, and during rebuild it’s not even an especially convincing redundancy story. Treat this as an incident, not a warning.

5) Symptom: Array shows “clean” but mismatch count rises during check

Root cause: Prior unclean shutdown, write hole (RAID5), flaky RAM/controller, or silent corruption previously masked.

Fix: Investigate hardware, run memory checks, verify cabling, and only then consider repair with backups available.

cr0x@server:~$ cat /sys/block/md0/md/mismatch_cnt
12

6) Symptom: Disk keeps getting kicked, but SMART looks fine

Root cause: Link-level problems (CRC errors), controller resets, backplane issues, power problems.

Fix: Replace cable/backplane slot; check power; review kernel logs for resets. Don’t keep swapping “good” disks into a bad slot.

7) Symptom: You can’t tell which physical disk is /dev/sdb

Root cause: No labeling, no enclosure mapping, and device naming drift across boots.

Fix: Use by-id serials, udev properties, and (if present) enclosure management tools; then label hardware.

Joke #2: The only thing more fragile than a degraded RAID5 is the confidence of someone who just typed --force.

FAQ

1) Can I keep the server running while rebuilding?

Usually yes. mdadm is designed for online rebuilds. The trade-off is performance and risk: rebuild load can expose latent issues.
If it’s your root array, online is often the only practical option without downtime.

2) Should I replace the disk immediately or try reseating it first?

If logs show link resets/CRC errors and the device disappears/reappears, reseat and check cabling/backplane first.
If SMART shows media/integrity errors or reallocated/pending sectors (depending on device type), replace the disk.

3) How do I know which disk to pull in a hot-swap chassis?

Use /dev/disk/by-id serials and match them to the drive labels. If you have enclosure slot mapping (SAS expanders),
use that. Don’t rely on /dev/sdX.

4) What’s the safest way to partition the new disk?

Clone the partition table from a known-good member using sfdisk -d piped into sfdisk.
Then verify with lsblk and only then add to md.

5) mdadm says the array is “clean, degraded.” Is that bad?

It’s better than “dirty, degraded,” but you still have no redundancy. A second disk issue can become data loss depending on RAID level.
Treat it as urgent, not necessarily panic.

6) Should I run sync_action=repair after a rebuild?

Not by default. Repair writes changes. If you have mismatches, first determine whether they’re expected (unclean shutdown) or symptomatic
(hardware/firmware). Have backups before you let anything “repair” at block level.

7) Why did the rebuild take so long compared to the disk’s rated speed?

Rebuild speed is constrained by random IO, filesystem activity, controller limits, error retries, thermal throttling, and md speed limits.
Vendor sequential throughput numbers are marketing, not your workload.

8) Do I need to update initramfs after replacing a RAID member?

If the array is involved in boot (root, /boot), yes. Update /etc/mdadm/mdadm.conf and run update-initramfs -u.
Otherwise, you risk a boot failure or an array that assembles inconsistently.

9) What if I replaced the wrong disk?

Stop. Don’t add anything. Identify the remaining disks by serial, examine superblocks, and determine which member has the latest event count.
If you’re not confident, boot rescue media and assemble by UUID. Guessing is how you convert “recoverable” into “career development.”

10) Is mdadm RAID “safe” compared to hardware RAID?

mdadm is perfectly viable in production when operated correctly: monitoring, tested recovery, and disciplined procedures.
Hardware RAID can hide problems until it can’t. Software RAID is honest, which is uncomfortable but useful.

Conclusion: what to do next

Degraded mdadm RAID on Debian 13 is fixable without drama if you keep your hands off the panic buttons and follow a strict sequence:
identify the right disk by stable identity, validate the failure mode, replace cleanly, rebuild with controlled load, then verify consistency.

Next steps you should actually do (not just nod at):

  • Set mdadm event monitoring so “degraded” becomes a page, not an archaeological discovery.
  • Schedule periodic check runs and track mismatch counts over time.
  • Label disks by serial and document bay mapping. Future-you will buy present-you coffee.
  • After any disk work on boot arrays: update mdadm.conf and regenerate initramfs.
← Previous
Site-to-Site VPN: The Routing Checklist That Prevents One-Way Traffic
Next →
Fix hreflang Errors in Multilingual WordPress (and Restore Google Language Targeting)

Leave a comment