Repairing Disk Errors on Boot: What to Do Before It Gets Worse

February 21, 2026 • February 21, 2026 • Read: 22 min • Views: 0

Was this helpful?

Your machine boots, pauses, and you get the cheerful message: “Repairing disk errors.” The fans spin, the cursor blinks, and your stomach does that little drop. You’re not imagining it: this is one of those moments where “wait and see” can turn a recoverable filesystem problem into an unrecoverable data-loss story.

Sometimes it’s just a dirty journal and a clean fix. Sometimes it’s a drive quietly dying while the OS politely tries to keep going. Your job is to tell the difference—quickly—and choose actions that preserve data, not pride.

What “Repairing disk errors” actually means (and what it doesn’t)

That boot-time message is usually one of three things:

A filesystem consistency check is running (Windows: CHKDSK; Linux: fsck; macOS: fsck_apfs). This can be routine after an unclean shutdown, or it can be triggered because corruption is detected.
The OS is replaying a journal (common on ext4, XFS, NTFS). Journaling filesystems try to restore consistency after a crash. Most journal replays are fast. If it’s slow, that’s a clue.
The storage layer is unhappy: the disk is returning read/write errors, taking too long to respond, or the controller is resetting. In that case, “repairing” is the OS trying to read metadata and rewrite structures while the disk is intermittently failing.

Here’s what it doesn’t mean: it doesn’t guarantee the OS is fixing anything safely, and it doesn’t promise your data is intact afterward. Filesystem repair tools are designed to restore consistency, not to preserve every last file. On a healthy disk, that trade-off is fine. On a dying disk, it’s playing darts in the dark.

Actionable mindset: treat boot-time repairs as a symptom, not a solution. Your goal is to determine whether you’re dealing with (a) a one-off dirty shutdown, (b) recurring corruption from software/firmware/power, or (c) an imminent drive failure.

One paraphrased idea, because it’s the right energy for this situation: paraphrased idea from Werner Vogels (reliability mindset): everything fails eventually; design and operate as if that’s guaranteed.

Short joke #1: A filesystem repair tool is like a dentist—helpful, but you don’t want it improvising while the chair is on fire.

Fast diagnosis playbook (first/second/third checks)

If you’re on the console and the system is “repairing” or stuck, you need a tight loop: observe → classify → decide. This is the version you can run at 2:00 AM without inventing new problems.

First: determine whether the disk is failing physically

If you can access a shell or recovery environment, pull SMART/NVMe health and kernel logs.
Look for: reallocated sectors, pending sectors, uncorrectable errors, CRC errors, NVMe media errors, controller resets, I/O timeouts.
Decision: if these are present or increasing, stop “repairing” and start “copying”. Clone/image first.

Second: determine whether the filesystem is corrupt but the hardware is OK

Check mount errors, journal replay loops, and whether fsck/chkdsk reports a stable number of fixes.
Decision: if SMART looks clean and errors correlate to power loss or forced resets, you can proceed with offline repair—after backups.

Third: determine whether you’re blocked by something else (not the disk)

UEFI/BIOS misdetecting boot order, RAID controller in a degraded state, bad SATA cable, flaky backplane, broken initramfs, or a kernel driver issue can masquerade as “disk errors.”
Decision: if logs show clean I/O but boot fails at mounting root, focus on bootloader/initramfs and device naming changes.

Operational rule: the more the system “hangs,” the more suspicious you should be of timeouts and retries. Healthy repairs tend to be noisy and fast. Sick disks are quiet and slow.

Interesting facts & historical context (why this keeps happening)

Fact 1: CHKDSK dates back to early DOS-era disk tooling; its modern forms still inherit the same mission: make the filesystem consistent, even if that means removing questionable entries.
Fact 2: Journaling filesystems (like ext3/ext4, NTFS, XFS) became mainstream partly because non-journaled repairs after crashes could take hours—or require full scans—on large disks.
Fact 3: The move from spinning disks to SSDs reduced some mechanical failures but introduced new ones: firmware bugs, wear-related read disturb, and sudden “read-only mode” behaviors on some devices.
Fact 4: SATA link errors (CRC errors) are often caused by cables/backplanes, not the drive media. They look terrifying in logs and are frequently fixable with a $6 cable and better routing.
Fact 5: “Bad sectors” are not always permanent. Drives can remap sectors, and sometimes a transient read error becomes readable again—right before it fails again. That’s why “it worked once” is a trap.
Fact 6: Filesystems aren’t databases. They don’t track business-level invariants, and repair tools can’t guess your intent; they just restore internal structure integrity.
Fact 7: RAID was never a backup, and the industry has been repeating that sentence since RAID got popular in servers. It’s still true.
Fact 8: Boot-time filesystem checks used to be scheduled by default more aggressively (e.g., ext2/ext3 mount-count checks). Modern systems often reduce these checks to improve boot speed—sometimes hiding slow-burn corruption.

The first rule: protect data before you “fix” anything

If there’s any chance the drive is failing, your priority is to capture data with the least additional stress on the disk. Repair tools can hammer metadata and trigger lots of random I/O—exactly what weak media hates.

What “protect data” means in practice:

If the system still boots: copy the most valuable data first (databases, user directories, configs, keys). Do not start by running full-disk checks.
If it doesn’t boot: use a live environment to mount read-only, or image the disk.
If it’s a server with redundancy (RAID/ZFS/Ceph): focus on replacing the failing component and rebuilding safely. Don’t run destructive checks on a degraded array unless you’re sure what they touch.

Short joke #2: The only thing more optimistic than “it’ll probably boot” is “I’ll run fsck on production at lunch.”

Practical tasks with commands: outputs, meaning, and decisions

These tasks assume Linux because that’s where you’ll see the most direct signals, but the logic ports to Windows/macOS: observe health, capture data, repair offline, verify, then prevent recurrence.

Task 1: Identify what block device your root filesystem is on

cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS,MODEL,SERIAL
NAME   TYPE  SIZE FSTYPE MOUNTPOINTS MODEL            SERIAL
sda    disk  1.8T        ST2000DM008  ZFL123AB
├─sda1 part  512M vfat   /boot/efi
├─sda2 part    2G ext4   /boot
└─sda3 part  1.8T ext4   /

What it means: You now know which disk is involved (sda) and what partitions matter (sda3 is root). If you see LVM, mdraid, dm-crypt, or ZFS, your “disk” is a stack—don’t skip the lower layers.

Decision: If root is on a complex stack, you must collect health data from physical devices and the logical layer (mdadm, LVM, ZFS).

Task 2: Check recent kernel messages for disk I/O errors and resets

cr0x@server:~$ sudo dmesg -T | egrep -i "error|fail|reset|timeout|I/O|ata|nvme" | tail -n 30
[Mon Feb  5 09:11:22 2026] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Mon Feb  5 09:11:22 2026] ata1.00: failed command: READ DMA EXT
[Mon Feb  5 09:11:22 2026] blk_update_request: I/O error, dev sda, sector 1953525160 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Mon Feb  5 09:11:23 2026] ata1: soft resetting link
[Mon Feb  5 09:11:28 2026] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

What it means: Real I/O errors at the block layer. The OS tried to read a sector and the disk/controller couldn’t deliver. Link reset suggests the disk or the path is unstable.

Decision: Treat this as a hardware or connectivity issue until proven otherwise. Proceed to SMART and cable/backplane checks; prioritize imaging/backup.

Task 3: Pull SMART health for SATA/SAS disks

cr0x@server:~$ sudo smartctl -a /dev/sda
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   097   097   010    Pre-fail  Always       -       24
197 Current_Pending_Sector  0x0012   098   098   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   098   098   000    Old_age   Offline      -       8
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

What it means: “PASSED” is not reassurance; it just means the drive hasn’t crossed a vendor threshold. Pending sectors and uncorrectables are bad news: the disk cannot reliably read some locations. Reallocated sectors mean it already had to remap damaged areas.

Decision: If 197/198 are non-zero and not stable, replace the disk. Before replacement, image it if you need the data. If 199 is high, suspect cable/backplane first.

Task 4: Pull NVMe health (media errors, percentage used)

cr0x@server:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0x00
temperature                         : 44 C
available_spare                     : 100%
percentage_used                     : 62%
media_errors                        : 12
num_err_log_entries                 : 98

What it means: Non-zero media_errors is the NVMe equivalent of “the device failed to read/write data correctly.” It’s not a cosmetic metric.

Decision: If media errors are rising, plan replacement. If the system is boot-loop repairing, move straight to imaging and migration.

Task 5: Check filesystem state and last fsck results (ext4 example)

cr0x@server:~$ sudo tune2fs -l /dev/sda3 | egrep -i "Filesystem state|Errors behavior|Last checked|Last mount time|Mount count|Check interval"
Filesystem state:         clean
Errors behavior:          Continue
Last mount time:          Mon Feb  5 09:02:10 2026
Last checked:             Mon Feb  5 09:01:55 2026
Mount count:              34
Check interval:           0 (<none>)

What it means: ext4 thinks it is clean now, but that doesn’t mean the disk is fine. Also, “Errors behavior: Continue” is a footgun on critical systems; it can keep going while quietly corrupting.

Decision: If you’ve seen repairs on boot repeatedly, change error behavior to “remount-ro” after you stabilize, and fix the underlying issue (power, disk, controller).

Task 6: Run a non-destructive filesystem check (read-only) from a live environment

cr0x@server:~$ sudo fsck.ext4 -n /dev/sda3
e2fsck 1.46.5 (30-Dec-2021)
/dev/sda3: clean, 412156/117440512 files, 98123456/469760000 blocks

What it means: -n does not modify the filesystem. “clean” suggests the filesystem is consistent.

Decision: If the filesystem is clean but boot still triggers repairs, suspect hardware I/O timeouts or controller resets. If fsck reports errors even after a clean shutdown, suspect underlying corruption drivers (disk, RAM, controller, power).

Task 7: Run an offline repair (only when the disk path looks stable)

cr0x@server:~$ sudo fsck.ext4 -f -y /dev/sda3
e2fsck 1.46.5 (30-Dec-2021)
Pass 1: Checking inodes, blocks, and sizes
Inode 924844 has invalid mode.  Clear? yes
Pass 2: Checking directory structure
Entry 'cache.tmp' in /var/tmp (924900) has deleted/unused inode 924844.  Clear? yes
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sda3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda3: 412155/117440512 files, 98123450/469760000 blocks

What it means: The tool made changes. That can be fine. It can also mean you just lost some metadata for files you cared about. If this repeats, it’s not “bad luck.”

Decision: If this was a one-off after a crash and SMART is clean, proceed. If repairs keep happening, stop and diagnose hardware and power stability.

Task 8: Verify the disk path is not lying (SATA CRC errors and link stability)

cr0x@server:~$ sudo smartctl -a /dev/sda | egrep -i "UDMA_CRC_Error_Count|SATA Version|Error"
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       214

What it means: CRC errors usually mean data corruption on the wire, not inside the drive. Think cable, connector, backplane, vibration, or power noise.

Decision: Reseat/replace cable, move bays, check backplane. After fixing physical path, watch if CRC stops increasing. If it keeps increasing, treat the whole path as suspect.

Task 9: Check systemd boot failures for mount issues (common on Linux servers)

cr0x@server:~$ systemctl --failed
  UNIT                          LOAD   ACTIVE SUB    DESCRIPTION
● boot.mount                    loaded failed failed /boot
● local-fs.target               loaded failed failed Local File Systems

What it means: Boot is failing at filesystem mounts. That can be corruption, missing device, wrong UUID, or degraded RAID not assembling.

Decision: Next check: journal logs for why the mount failed, then verify UUIDs and the presence of the block device.

Task 10: Inspect journal logs for “superblock” and “wrong fs type” errors

cr0x@server:~$ sudo journalctl -b -p err --no-pager | tail -n 30
Feb 05 09:01:44 server kernel: EXT4-fs (sda3): bad geometry: block count 469760000 exceeds size of device (0)
Feb 05 09:01:44 server systemd[1]: Failed to mount /.
Feb 05 09:01:44 server systemd[1]: Dependency failed for Local File Systems.

What it means: The filesystem metadata doesn’t match what the kernel thinks the device size is. This can happen with failing devices, broken controllers, or misassembled RAID/LVM presenting a smaller device.

Decision: Stop attempting repair until you confirm the underlying block device is correctly detected (capacity, model, stable reads). Check RAID/LVM layers and controller logs.

Task 11: Confirm the device capacity is stable and not “shrinking”

cr0x@server:~$ sudo blockdev --getsize64 /dev/sda
2000398934016

What it means: Size in bytes. If this changes between boots or after resets, you’re looking at a controller/firmware/path problem.

Decision: If size is inconsistent, treat the disk subsystem as unstable. Switch ports/controllers if possible, then image through the most stable path.

Task 12: Assemble and check mdraid state (if applicable)

cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
      976630336 blocks super 1.2 [2/1] [U_]
      [==============>......]  recovery = 72.3% (706000000/976630336) finish=45.0min speed=100000K/sec

What it means: RAID1 is degraded: one side missing ([U_]). Recovery is running, which is I/O heavy.

Decision: If your boot repairs started during recovery, consider pausing heavy rebuild to stabilize and capture data. Replace the failed member, ensure the surviving disk is healthy before forcing a full rebuild.

Task 13: Check ZFS pool health and scrub status (if applicable)

cr0x@server:~$ sudo zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Replace the device and run 'zpool clear'.
  scan: scrub repaired 0B in 00:12:31 with 3 errors on Mon Feb  5 08:40:22 2026
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sda     ONLINE       0     0     0
            sdb     FAULTED     12     0     3  too many errors

What it means: ZFS is telling you it saw errors and even mentions corruption. That’s a “stop and fix hardware” moment, not a “scrub harder” moment.

Decision: Replace the faulted device, then scrub again. Also check cabling/HBA firmware; ZFS often surfaces upstream problems earlier than other stacks.

Task 14: Image a failing disk with minimal retries (ddrescue)

cr0x@server:~$ sudo ddrescue -f -n /dev/sda /mnt/recovery/sda.img /mnt/recovery/sda.map
GNU ddrescue 1.26
Press Ctrl-C to interrupt
rescued:   1999 GB,  errsize:   1024 kB,  current rate:   120 MB/s
ipos:   2000 GB,  errors:      12,  average rate:   118 MB/s
opos:   2000 GB,  time from start:   04:45:12
Finished

What it means: You got almost everything; the mapfile records what failed. -n does a first pass without aggressive scraping—less stress. The remaining errors might be recoverable with a second pass, or might be lost forever.

Decision: If the image is mostly complete, work from the image (mount loopback, run fsck against a copy). Do not keep beating on the original disk unless the missing sectors contain critical data and you accept the risk of total failure.

Task 15: Mount a filesystem read-only to grab data safely

cr0x@server:~$ sudo mount -o ro,norecovery /dev/sda3 /mnt
cr0x@server:~$ ls -lah /mnt | head
total 96K
drwxr-xr-x  24 root root 4.0K Feb  5 09:02 .
drwxr-xr-x   3 root root 4.0K Feb  5 08:59 ..
drwxr-xr-x   2 root root 4.0K Jan 20 11:10 bin
drwxr-xr-x   4 root root 4.0K Feb  4 18:12 boot

What it means: Read-only mount reduces writes. norecovery (where supported) avoids replaying journals on mount, which can write to disk.

Decision: If this works, copy out the most valuable data immediately. If it fails, go back to imaging and deeper recovery.

Task 16: Find which files are on bad blocks (ext filesystems)

cr0x@server:~$ sudo badblocks -sv /dev/sda3
Checking blocks 0 to 469760000
Checking for bad blocks (read-only test): done
Pass completed, 12 bad blocks found. (12/0/0 errors)

What it means: Bad blocks exist. On a spinning disk, this often correlates with growing failure. On SSDs, “bad blocks” is trickier because of remapping, but uncorrectables still matter.

Decision: Don’t “mark them and move on” unless you’re okay with continuing on a compromised device (you shouldn’t be, for anything important). Replace the disk.

Task 17: Verify RAM isn’t the silent accomplice (because corruption is not always the disk)

cr0x@server:~$ sudo journalctl -k --no-pager | egrep -i "mce|machine check|EDAC|memory error" | tail -n 20
Feb 05 08:58:12 server kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0

What it means: Correctable ECC errors are warnings. If they ramp up, they can corrupt data in memory before it’s written to disk. That looks like “disk errors” later.

Decision: If you see ECC errors, schedule DIMM replacement and run memory tests. Don’t blame the filesystem for what the memory subsystem broke.

Three corporate mini-stories from the trenches

1) The incident caused by a wrong assumption: “PASSED means healthy”

A mid-sized company had a fleet of on-prem build servers. One started showing boot-time “repairing disk errors,” but it eventually came up. The on-call checked SMART, saw “PASSED,” and moved on—because the pipeline was red and everyone wanted green graphs before the morning standup.

The machine limped for a week. Builds got slower, then intermittent. Every reboot triggered another repair cycle. The working assumption became: “It’s just filesystem churn after power events.” Except the server didn’t have power events. It had read retries, timeouts, and a growing pile of pending sectors that nobody looked at because the summary line looked fine.

Then came the day it didn’t come back. The disk finally failed hard mid-boot, right when the filesystem tool was writing metadata. The most painful part wasn’t the outage; it was the ambiguity. Was the repo cache corrupt? Were artifacts corrupt? Which builds were suspect? Nobody likes explaining to management that “the data might be wrong.”

Postmortem outcome: they stopped treating SMART “PASSED” as a verdict. They started alerting on specific attributes (pending/uncorrectable sectors, NVMe media errors, link CRC errors), and they added a runbook line: “If boot repair repeats, image first.” Not glamorous. Very effective.

2) The optimization that backfired: “Skip checks to boot faster”

A different org ran latency-sensitive services on bare metal. Boot speed mattered because rolling reboots were part of their patching rhythm. Someone tuned ext4 to minimize scheduled checks and set mount options to keep things moving. It was a classic “optimize the steady state” move.

Months later, a minor kernel bug plus a couple of abrupt watchdog reboots caused metadata inconsistencies. The systems still booted, mostly. But they booted into “repairing disk errors” more frequently, and the repair window got longer each time. Nobody correlated it because the checks were rare and the fleet was large. The slow failures were diffuse—perfect for being ignored.

Then one node failed in an especially inconvenient way: it got stuck repairing during an automated maintenance window. It missed the time slot, came back late, and triggered cascading failovers because capacity planning assumed the node would return quickly. One “fast boot” tweak had quietly converted a recoverable corner case into a predictable outage pattern.

They reverted the most aggressive settings, introduced periodic offline scrubs/checks during low-traffic windows, and—this is the key—made sure every “repair on boot” event generated a ticket. Not an alert storm. A ticket with a human owner. Optimizations are fine; hiding evidence is not.

3) The boring but correct practice that saved the day: tested restores and staged replacement

A finance-adjacent company (regulated, allergic to surprises) ran their core systems on mirrored storage with strict operational discipline. One morning, a database host booted into a repair screen. The on-call didn’t try to be a hero. They followed the runbook: capture logs, check disk health, take the node out of rotation, and fail over.

SMART showed increasing uncorrectables on one disk and a handful of CRC errors on another. The team didn’t argue about which metric “mattered more.” They replaced the cable, replaced the disk, and scheduled a controlled rebuild. While the rebuild ran, they kept the node out of critical traffic.

Here’s the part that sounds dull until you need it: they had tested restores. When someone asked, “What if the mirror rebuild corrupts the database?” the answer was not a meeting. The answer was: “We can restore last night’s snapshot to a fresh host; we rehearsed it last month.”

The incident was a non-event. A hardware swap, a validation, and a closed ticket. Nobody got a medal. That’s what success looks like in operations: calm and slightly boring.

Common mistakes: symptom → root cause → fix

1) Repair runs every boot, but “it eventually works”

Symptom: Boot-time repair takes longer each time; occasional freezes; random app crashes.

Root cause: Disk is accumulating unreadable sectors or the controller is resetting under load.

Fix: Check SMART/NVMe logs and dmesg. If pending/uncorrectable/media errors exist, image the disk and replace it. If CRC errors dominate, replace cable/backplane and retest.

2) “Wrong fs type” or “bad superblock” after an update

Symptom: systemd drops to emergency mode; mount fails; repair tool claims it can’t find a filesystem.

Root cause: Device naming changed (UUID mismatch), RAID/LVM didn’t assemble, or initramfs lacks the right driver.

Fix: Verify UUIDs in /etc/fstab against blkid. Confirm mdraid/LVM activation in initramfs. Don’t run fsck blindly on the wrong device node.

3) CHKDSK/fsck “fixes” things, but files disappear

Symptom: After repair, directories contain lost+found fragments or missing files.

Root cause: Metadata corruption forced the tool to detach orphaned inodes/records. This is expected behavior under damage.

Fix: Restore from backup/snapshot if possible. If not, recover from an image using forensic tools. Stop repeated repairs on failing disks; it increases damage.

4) Repair is “stuck” at a certain percentage

Symptom: Progress stalls; drive LED solid; fans normal; no visible errors.

Root cause: The tool is retrying a hard-to-read region, sometimes for hours. The UI doesn’t show the retries.

Fix: Check kernel logs for I/O timeouts. If present, abort and image with ddrescue. If no I/O errors and CPU is pegged, it may be a massive directory scan—let it run, but monitor.

5) RAID rebuild starts and then the system begins “repairing disk errors”

Symptom: After replacing a disk, array rebuild causes boot checks, slowdowns, and sporadic read errors.

Root cause: The surviving disk was already weak; rebuild stress exposes it. Or the controller/backplane is marginal.

Fix: Validate the remaining disks before rebuild. If errors appear, stop rebuild, image data, and plan a safer migration. Replace questionable components first.

6) SSD suddenly becomes read-only or disappears during repair

Symptom: Mount flips to read-only; NVMe resets; boot loops.

Root cause: SSD firmware protection mode, power loss protection issues, or controller failure. Sometimes triggered by thermal/power events.

Fix: Pull NVMe SMART/logs. Ensure cooling and stable power. Replace the device; do not trust it again for write-heavy workloads.

Checklists / step-by-step plan (do this, then that)

Scenario A: A laptop/desktop shows “Repairing disk errors” once

Let it finish once. If it completes quickly and boots normally, log it as a warning anyway.
After boot, collect health signals: SMART/NVMe, OS event logs, and recent power events.
Back up immediately. Not later.
If SMART/NVMe shows pending/uncorrectable/media errors, replace the drive.
If CRC/link errors show up, reseat/replace cables (desktop) or inspect connectors.

Scenario B: It repeats every boot or gets stuck

Stop rebooting repeatedly. Reboots are not diagnostics; they are stress tests.
Boot a live environment (or recovery mode) and collect dmesg plus SMART/NVMe logs.
If hardware errors appear: image with ddrescue, then work from the image.
If hardware looks clean: run read-only fsck first; then offline repair if needed.
After repair, validate: file integrity checks, application checks, and log review.

Scenario C: Server with RAID/ZFS and production impact

Fail over or drain traffic if you can. Don’t debug under load unless you enjoy regret.
Check array/pool health first (mdadm/zpool). A degraded set changes every risk calculation.
Identify the failing component: disk vs cable vs HBA vs backplane.
Replace suspect hardware, then rebuild/resilver with monitoring.
Run a scrub/check post-rebuild and confirm no silent errors remain.

Scenario D: You need the data, and the disk is clearly dying

Prioritize imaging over repair. Imaging preserves optionality.
Use ddrescue with a mapfile; do a low-stress first pass.
Work on the image copy: try mounting read-only; run filesystem checks on a duplicated image, not the only copy.
Only attempt aggressive recovery reads on the original if the missing data is worth the risk of total failure.

FAQ

1) Is “Repairing disk errors” always a sign my drive is dying?

No. It can be a normal response to an unclean shutdown. But if it repeats, slows down, or coincides with I/O errors in logs, assume hardware risk until proven otherwise.

2) Should I interrupt the repair process?

If you suspect physical disk failure (I/O errors, resets, pending sectors, NVMe media errors), yes—because repair writes can worsen loss. If it’s a one-time journal replay on healthy hardware, let it finish.

3) Why does SMART say “PASSED” when the disk is obviously failing?

SMART “overall health” is threshold-based and vendor-specific. Attributes like pending sectors, uncorrectables, and media errors are often more actionable than the summary line.

4) What’s the difference between filesystem corruption and bad sectors?

Filesystem corruption is broken metadata/structures (often from crashes, bugs, or bad writes). Bad sectors/media errors are the disk failing to store or read data reliably. Corruption can be repaired; bad media tends to spread.

5) Will running fsck/chkdsk fix a failing disk?

No. It can make the filesystem consistent on top of a failing disk, but it doesn’t heal hardware. On a dying disk it can accelerate failure by causing heavy random I/O.

6) If I replace the disk, can I trust the repaired filesystem copy?

Sometimes. If you repaired after imaging and verified application-level integrity, you can be reasonably confident. If the disk was returning silent corruption (rare, but real), you need checksums and higher-level validation.

7) Does RAID prevent this boot-time repair situation?

It reduces downtime for single-disk failures, but it doesn’t prevent filesystem checks, controller issues, bad cables, firmware bugs, or multi-disk correlated failures. Also: RAID rebuilds can be the moment the second disk gives up.

8) What’s the safest way to recover data from a failing disk?

Image it with a tool designed for failing media (ddrescue) using a mapfile, then do recovery work from the image. Read-only mounting and minimal retries are your friends.

9) Why does the repair take forever on large drives?

Some checks are effectively full metadata scans. If the disk is slow due to retries, timeouts, or SMR behavior under random reads, it can stretch dramatically. The logs will tell you which one you’re dealing with.

10) If I see CRC errors, do I still replace the disk?

Not immediately. CRC errors often implicate the cable/backplane. Fix the path, then watch whether the CRC counter stops increasing. If you also see pending/uncorrectable errors, replace the disk regardless.

Next steps you can take today

If your system is “repairing disk errors” on boot, stop treating it like a weather report. It’s an early-warning system that’s easy to ignore until it stops being early.

Get signals: grab dmesg/journalctl plus SMART/NVMe health for the underlying devices.
Make the call: if you see media/pending/uncorrectable errors or repeated resets/timeouts, image first and replace hardware.
Repair sanely: run read-only checks before offline repairs; don’t repeatedly “fix” a filesystem on a failing disk.
Verify: confirm mounts, run application checks, review logs, and watch for recurring errors over the next days.
Prevent recurrence: address power stability, cables/backplanes, controller firmware, and set alerting on meaningful health attributes—not just “PASSED.”

Operations reality: the disk doesn’t care about your deadline. Your best move is to act before it forces a decision for you.