Debian 13 SMART warnings: which attributes actually predict failure (and what to do)

Was this helpful?

You get the alert at 02:17. “SMART Usage Attribute: 0xC5 Current_Pending_Sector changed from 0 to 1.”
You stare at it like it’s a horoscope. Is the drive dying, or did it just have a bad day?

On Debian 13, SMART warnings are either early, noisy, and confusing—or late, quiet, and expensive.
The trick is knowing which attributes are actually predictive, which are vendor-math nonsense, and what
the next operational move should be when a number twitches.

SMART in plain English: what it can and cannot tell you

SMART is the drive telling you what it thinks about itself. That’s both useful and suspect, like a résumé.
It exposes counters and thresholds that are partly standardized, partly vendor-specific, and always shaped
by firmware decisions you can’t audit.

Two important truths keep you sane:

  • SMART is better at confirming failure than predicting it—but a few attributes do correlate strongly with real-world failures.
  • SMART is not a substitute for redundancy, backups, and scrubs. If you don’t have those, SMART is a smoke alarm in a house made of gasoline.

On Debian 13, the tooling is mature: smartmontools (smartctl + smartd), kernel log signals,
NVMe log pages, and storage-stack feedback from mdadm, LVM, ZFS, btrfs, and your hypervisor.
The win is combining those signals into a decision: ignore, observe, test, or replace.

Facts and context that make SMART make sense

Some context points that change how you read alerts:

  1. SMART started as vendor-specific in the 1990s; the “standard” attributes are still interpreted differently between vendors.
  2. Normalized values (VALUE/WORST/THRESH) are a vendor scale, not physics. RAW is closer to reality, but still can be encoded.
  3. SMART thresholds are often set to avoid RMAs, not to protect your data. “PASSED” doesn’t mean “healthy.”
  4. Reallocation exists because disks ship with spare sectors. Modern drives constantly remap—some remapping is expected, but trends matter.
  5. SSDs introduced “wear” as a first-class failure mode. SMART for SSDs includes wear indicators, but not all are comparable across models.
  6. NVMe defined more consistent health logs than SATA/ATA SMART, but vendors still differ in details and thresholds.
  7. Self-tests are not the same as scrubs. SMART tests are device-internal; scrubs are filesystem/RAID-level validation of your actual data.
  8. SMART can’t see your cabling problems directly. UDMA CRC errors are a clue, but plenty of link issues show up first in kernel logs.
  9. Some failures are “sudden death”. Firmware bugs, controller failures, or power events can brick a drive with clean SMART history.

The attributes that actually predict failure (and why)

If you only remember one thing: media errors beat “health status” every time.
A drive that starts struggling to read/write real sectors is a drive you should treat as guilty until proven innocent.

1) Reallocated sectors: Reallocated_Sector_Ct (05)

This is the classic: sectors that were found bad and remapped to spares. A non-zero count isn’t instant death,
but it’s one of the clearest “the surface is degrading” signals on spinning disks and some SATA SSD firmwares.

Operational meaning: The drive has already failed to reliably store data at least once.

What predicts failure:

  • Growth rate matters more than the absolute number.
  • New reallocations during heavy reads (scrub/backup) are especially damning.
  • Reallocated sectors plus any pending/uncorrectable sectors is a “schedule replacement” combo.

2) Current pending sectors: Current_Pending_Sector (C5)

Pending sectors are the drive saying: “I tried to read this sector and couldn’t correct it. If you rewrite it
successfully, I’ll clear it; if rewriting fails, I’ll remap it.”

Pending sectors are more actionable than reallocated ones because they correlate with
read failures that can surface as I/O errors now, not just “something went wrong sometime.”

Decisions:

  • Any non-zero pending sectors on a drive holding important data is a trigger to start controlled testing and plan replacement.
  • If the count drops to zero after rewriting (or after a scrub that forces reads), you still treat it as a warning shot.
  • If pending sectors persist or rise, you replace. Don’t negotiate with entropy.

3) Offline uncorrectable: Offline_Uncorrectable (C6)

This counts uncorrectable errors found during offline scans or self-tests. It tends to line up with “you will
not read some data without redundancy.”

Decisions:

  • Non-zero C6 on a data drive means you should assume you have unreadable blocks until proven otherwise.
  • Combine with filesystem checks, RAID scrubs, and targeted reads of important areas (VM images, databases).

4) Uncorrectable read errors (vendor variants)

On SATA/ATA SMART tables you’ll see attributes like Raw_Read_Error_Rate (01) and
Seek_Error_Rate (07). The normalized values are often useless (especially on Seagate).
But some drives expose additional counters for uncorrectable errors.

The key is not the “rate” name. The key is: is the drive returning uncorrectable read errors to the host?
That shows up more reliably in:

  • SMART error log entries
  • Self-test log failures
  • Kernel I/O error messages
  • RAID/ZFS checksum errors

5) SMART self-test log failures (not an attribute, but a verdict)

If a long self-test ends with “Completed: read failure” or “Completed: unknown failure,” that’s the drive
admitting it cannot read its own media reliably.

Decision: If the drive holds anything you care about, replacement is the default.

6) NVMe “critical warning” and media errors

NVMe health reporting is different. Two NVMe fields matter a lot:

  • Critical Warning (bitmask): if set, the device is raising its hand for attention.
  • Media and Data Integrity Errors: counts failures the controller could not recover.

Also watch Percentage Used (wear). It’s not a failure predictor on its own, but high wear
plus media errors is a nasty combination.

7) Interface-level errors: UDMA_CRC_Error_Count (C7)

This one doesn’t predict drive failure. It predicts cabling, backplane, or controller flakiness.
If C7 grows, you’re corrupting your day with link errors.

Decision: Replace/seat cable, check backplane, check power, and stop blaming the disk surface.

One quote that’s been passed around ops teams for decades:
“Hope is not a strategy.” — James Cameron

Joke #1: SMART “PASSED” is like a toddler saying “I didn’t break it.” Interesting data point, not a certification.

The attributes that waste your time (most of the time)

Some SMART attributes are famous mostly because they are printed in every smartctl output,
not because they are predictive for your incident.

Raw_Read_Error_Rate (01) and Seek_Error_Rate (07)

On many vendors these RAW values are giant counters that increase constantly and mean nothing without
proprietary context. People panic because the number looks scary. That’s understandable: it’s huge, it’s labeled “error,” and humans are pattern-seeking mammals.

When they matter: when the drive’s self-test log, SMART error log, or kernel logs also show actual read errors.

Power_On_Hours (09) and Start_Stop_Count (04)

Useful for lifecycle planning and warranty arguments. Weak as immediate failure predictors.
Drives can die young; drives can run ancient and stubborn.

Temperature_Celsius (194/190)

Temperature matters, but it’s a risk amplifier, not a smoke detector. If you’re running hot,
you’ll see more errors and shorter life. But a single temperature spike is not a reason to swap a drive.

Head flying hours, load cycles, vibration counters

These can be meaningful in large fleets when you correlate them. In a single server, they mostly tell you:
“This chassis is in a cupboard with poor airflow and a talent for making the fans scream.”

“Overall-health self-assessment test result: PASSED”

Treat it like a car’s “check engine” light being off. Nice, but not comforting if the engine is already making
metallic noises.

NVMe vs SATA: different telemetry, different traps

Debian 13 makes it easy to run both ATA SMART and NVMe logs, but the mental model differs:

  • SATA/ATA SMART exposes many attributes; interpretation is partly folklore, partly experience, partly vendor papers you don’t have.
  • NVMe exposes a more structured health log with clearer counters: media errors, unsafe shutdowns, wear percentage, and temperature events.

The trap with NVMe is assuming it’s always “better.” NVMe drives can fail abruptly (firmware/controller issues),
and “Percentage Used” can lull people into thinking: “Only 12% used, so it must be fine.” Media errors don’t care about your optimism.

The trap with SATA is the opposite: assuming every attribute is interpretable. Many aren’t. You’re better off
focusing on the small set that reliably maps to failures: reallocations, pending, uncorrectable, self-test failures,
and real error logs.

Fast diagnosis playbook (first/second/third checks)

When a SMART warning lands, you want to answer two questions quickly:
Is data at risk right now? and is the problem the disk, or the path to the disk?

First: confirm the symptom isn’t a monitoring hallucination

  • Check kernel logs for I/O errors and link resets.
  • Check smartctl current values and the error/self-test logs.
  • Check whether the SMART attribute changed once or is trending.

Second: classify the failure mode

  • Media degradation: pending/reallocated/uncorrectable sectors, self-test read failures.
  • Interface/path: CRC errors, SATA link resets, timeouts, HBA issues, power/backplane issues.
  • Workload-induced: errors show up only under heavy read or only on specific LBAs (bad region).
  • NVMe controller health: critical warning set, media errors incrementing, thermal throttling, many unsafe shutdowns.

Third: decide fast—replace now, test now, or watch

  • Replace now: any non-zero C5/C6 that persists, self-test failures, media errors on NVMe, or kernel I/O errors on important data drives.
  • Test now: a single pending sector that clears, a single reallocation with no trend, or suspicious but non-fatal counters (run long test + scrub).
  • Watch: CRC errors without media errors after fixing cabling; temperature issues after airflow fix; “PASSED” but nothing else.

Practical tasks: commands, outputs, and decisions (12+)

These are the tasks I actually run on Debian when SMART alerts show up. Each one includes:
command, example output, what it means, and what decision you make.

Task 1: Identify the disk and its path (don’t guess)

cr0x@server:~$ lsblk -o NAME,MODEL,SERIAL,SIZE,TYPE,TRAN,ROTA,MOUNTPOINTS
NAME   MODEL              SERIAL          SIZE TYPE TRAN ROTA MOUNTPOINTS
sda    ST8000NM000A-2KE1  ZR123ABC        7.3T disk sata    1
├─sda1                     512M part           1 /boot/efi
└─sda2                     7.3T part           1
nvme0n1 Samsung SSD 990    S7KX...         1.8T disk nvme    0
└─nvme0n1p1                               1.8T part           0 /

What it means: Confirm which device is spinning (ROTA=1) vs SSD (ROTA=0), and map mountpoints.

Decision: Use the correct tool path (/dev/sda vs /dev/nvme0 or /dev/nvme0n1) and don’t replace the wrong drive.

Task 2: Get the SMART summary for SATA/ATA disks

cr0x@server:~$ sudo smartctl -a /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.0] (local build)
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Exos 7E8
Device Model:     ST8000NM000A-2KE1
Serial Number:    ZR123ABC
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       8
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       8
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

What it means: “PASSED” is irrelevant here; C5=1 and C6=1 are active risk. Reallocated=8 shows prior remaps.

Decision: Run a long test immediately and schedule replacement. If the data is critical, start evacuating now.

Task 3: Pull NVMe SMART/health log the right way

cr0x@server:~$ sudo smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.0] (local build)
=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO 2TB
Serial Number:                      S7KX...
Firmware Version:                   5B2QJXD7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    6%
Data Units Read:                    18,345,112
Data Units Written:                 11,902,771
Host Read Commands:                 247,110,004
Host Write Commands:                183,991,201
Controller Busy Time:               2,401
Media and Data Integrity Errors:    0
Unsafe Shutdowns:                   7

What it means: Healthy NVMe: no critical warnings, no media errors, low wear. Unsafe shutdowns can be a facility/power story.

Decision: If you had an incident, you look elsewhere (power, kernel, filesystem). You still fix unsafe shutdowns because they correlate with future weirdness.

Task 4: Check SMART error log (ATA) to see real failures, not just counters

cr0x@server:~$ sudo smartctl -l error /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.0] (local build)
SMART Error Log Version: 1
ATA Error Count: 2
  CR = Command Register [HEX]
  FR = Features Register [HEX]
  SC = Sector Count Register [HEX]
  SN = Sector Number Register [HEX]
  CL = Cylinder Low Register [HEX]
  CH = Cylinder High Register [HEX]
  DH = Device/Head Register [HEX]
  DC = Device Command Register [HEX]
  ER = Error register [HEX]
  ST = Status register [HEX]
Error 2 occurred at disk power-on lifetime: 28421 hours (1184 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 00 00 00 40  Error: UNC at LBA = 0x00a1b2c3 = 10629827

What it means: UNC = uncorrectable. The drive returned an unrecoverable read at a specific LBA.

Decision: Treat as media failure. If redundancy exists, scrub/resilver; if not, prioritize backup and replacement.

Task 5: Check SMART self-test log (the drive’s confession booth)

cr0x@server:~$ sudo smartctl -l selftest /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.0] (local build)
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     28422         10629827
# 2  Short offline       Completed without error       00%     28420         -

What it means: Short test passed; long test found a read failure at an LBA. That’s common: short tests don’t touch enough surface.

Decision: Replace the disk. Also run higher-level checks (scrub) to ensure redundancy corrected data.

Task 6: Start a short SMART test (quick triage)

cr0x@server:~$ sudo smartctl -t short /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.0] (local build)
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Tue Dec 29 03:14:12 2025

What it means: The drive accepted the test request. This doesn’t prove health; it checks basics.

Decision: If short test fails, it’s an immediate replacement signal. If it passes but you have pending/uncorrectable, you still run long test.

Task 7: Start a long SMART test (surface coverage)

cr0x@server:~$ sudo smartctl -t long /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.0] (local build)
Testing has begun.
Please wait 822 minutes for test to complete.
Test will complete after Tue Dec 29 17:02:41 2025

What it means: Long test time is realistic for large HDDs. It will stress reads across the media.

Decision: Plan around load. On busy arrays, schedule off-peak, but don’t postpone for a week “because it’s inconvenient.”

Task 8: Watch for kernel-level I/O errors and link resets

cr0x@server:~$ sudo journalctl -k -S -2h | egrep -i 'ata[0-9]|i/o error|reset|nvme|blk_update_request|medium error' | tail -n 30
Dec 29 01:42:18 server kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 29 01:42:18 server kernel: ata3.00: failed command: READ FPDMA QUEUED
Dec 29 01:42:18 server kernel: blk_update_request: I/O error, dev sda, sector 10629827 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Dec 29 01:42:19 server kernel: ata3: hard resetting link
Dec 29 01:42:24 server kernel: ata3: link is slow to respond, please be patient (ready=0)
Dec 29 01:42:29 server kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

What it means: I/O error at a sector aligns with SMART LBA error: media issue. Link resets can be secondary (drive struggling) or cabling/backplane.

Decision: If you see medium errors, replace. If you only see resets and CRC errors rise, focus on cabling/HBA.

Task 9: Check UDMA CRC error growth (path trouble)

cr0x@server:~$ sudo smartctl -A /dev/sda | egrep 'UDMA_CRC_Error_Count|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable'
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       8
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       12

What it means: CRC errors are link-level corruption. They should not increase. Media errors are also present here, so you likely have both: a sick drive and a questionable path.

Decision: Replace the drive (media). While doing so, reseat/replace cable or check backplane to prevent the next drive inheriting the same CRC story.

Task 10: Run a RAID scrub/check (mdadm) and interpret the result

cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5]
md0 : active raid1 sda2[0] sdb2[1]
      7813772288 blocks super 1.2 [2/2] [UU]
      [==>..................]  check = 12.4% (970000000/7813772288) finish=420.0min speed=270000K/sec

unused devices: <none>

What it means: A check is running. If the check encounters unreadable blocks on one member, md will rely on the other (if intact).

Decision: Let it finish and watch kernel logs. If errors appear, you accelerate replacement and verify the mirror is truly clean.

Task 11: For ZFS, scrub and check for checksum/read errors

cr0x@server:~$ sudo zpool status -v
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Replace the device or restore the affected file(s) from backup.
  scan: scrub repaired 0B in 01:12:33 with 0 errors on Tue Dec 29 02:10:01 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sda     DEGRADED     5     0     2
            sdb     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:
        tank/vmstore:vm-102-disk-0

What it means: ZFS is telling you the uncomfortable truth: there were read/checksum errors, and one file is impacted.

Decision: Replace sda, then restore the affected dataset/file from backup or replica. Scrub again after resilver.

Task 12: For btrfs, check device stats (it’s not SMART, but it’s reality)

cr0x@server:~$ sudo btrfs device stats /mnt/data
[/dev/sda2].write_io_errs    0
[/dev/sda2].read_io_errs     3
[/dev/sda2].flush_io_errs    0
[/dev/sda2].corruption_errs  2
[/dev/sda2].generation_errs  0

What it means: The filesystem observed read errors and corruption. That’s beyond “maybe.”

Decision: Replace the underlying device and run a scrub. Confirm redundancy/repair worked; if it didn’t, restore from backup.

Task 13: Enable and configure smartd for actionable alerts

cr0x@server:~$ sudo apt update
...
cr0x@server:~$ sudo apt install -y smartmontools
...
cr0x@server:~$ sudo systemctl enable --now smartd
Created symlink /etc/systemd/system/multi-user.target.wants/smartd.service → /lib/systemd/system/smartd.service.

What it means: smartd is now running and can monitor devices periodically.

Decision: Don’t stop here. You must tune /etc/smartd.conf to monitor the correct devices (especially behind HBAs) and to run tests.

Task 14: Validate that smartd is watching the disks you think it is

cr0x@server:~$ sudo smartd -q showtests
smartd 7.4 2023-08-01 r5530 [x86_64-linux-6.12.0] (local build)
Device: /dev/sda, will do test schedule 'S/../.././02|L/../../6/03'
Device: /dev/nvme0, will do test schedule 'S/../.././02|L/../../6/03'

What it means: smartd parsed your schedules and will run short tests daily at 02:00 and long tests Saturdays at 03:00.

Decision: If your critical disks are missing, you fix discovery (sat passthrough, megaraid, scsi device mapping) before trusting alerts.

Task 15: Inspect Debian 13’s log messages from smartd

cr0x@server:~$ sudo journalctl -u smartd -S -24h | tail -n 40
Dec 29 02:00:01 server smartd[1321]: Device: /dev/sda, starting scheduled Short self-test.
Dec 29 02:02:15 server smartd[1321]: Device: /dev/sda, SMART Self-test log error count increased from 0 to 1
Dec 29 02:02:15 server smartd[1321]: Device: /dev/sda, SMART Prefailure Attribute: 5 Reallocated_Sector_Ct changed from 7 to 8

What it means: smartd is doing its job: it ran the test, saw a new self-test issue, and observed attribute growth.

Decision: This is replacement territory. You also capture the full smartctl -a output for ticketing and postmortem.

Task 16: If the disk is behind a HBA/RAID controller, address it explicitly

cr0x@server:~$ sudo smartctl -a -d sat /dev/sdb
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.0] (local build)
=== START OF INFORMATION SECTION ===
Device Model:     WDC WD80EFAX-68KNBN0
Serial Number:    7SG...
...
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       34

What it means: Media looks fine; CRC errors are high. That points away from platter and toward path issues.

Decision: Reseat/replace SATA cable/backplane slot, check HBA firmware, and watch whether C7 continues to grow afterward.

Joke #2: If your storage strategy is “we’ll know it’s bad when it stops working,” congratulations on your exciting career in emergency response.

Three corporate mini-stories from the SMART trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company ran a fleet of Debian servers with local SSDs for a cache tier and HDDs for bulk data.
Monitoring alerted on Reallocated_Sector_Ct for a couple of HDDs. The on-call engineer glanced at
“SMART overall-health: PASSED,” then closed the alert as noise. The assumption was simple: “If SMART says passed, it’s fine.”

Two days later, a nightly analytics job started failing. Not catastrophically—just enough to retry and clog the queue.
Latency for customer dashboards crept up. The storage graphs were confusing: iowait spikes, then calm; a pattern that
looked like “busy system” rather than “dying disk.”

When they finally pulled smartctl -l selftest, the long test had already reported a read failure at a consistent LBA.
ZFS (they had it on one cluster) would have flagged this loudly, but this particular dataset was on ext4 over mdadm,
and the errors surfaced as intermittent I/O errors in the kernel log that nobody had correlated to the SMART warning.

The fix was boring: replace the disk, rebuild the mirror, re-run the job. The postmortem, though, was spicy in a useful way.
They changed policy: any pending/uncorrectable sector counts on bulk data drives trigger replacement scheduling, regardless of “PASSED.”
They also taught the team to look at the SMART error log and self-test log, not just the summary line.

Mini-story 2: The optimization that backfired

In a different org, someone tried to reduce maintenance windows by moving long SMART tests to “only quarterly,” because
“the long test is disruptive and we have RAID anyway.” They also disabled RAID scrubs to avoid performance hits during business hours.
The optimization looked good on paper: fewer background reads, fewer latency complaints.

A few months later, a disk started growing Current_Pending_Sector. No one noticed because smartd was only doing short tests,
and the pending sectors were in a cold region rarely read. Production looked healthy. Backups were “successful” because they were
incremental and didn’t touch the bad blocks.

Then came the double event: a second drive in the same RAID group failed outright (controller dropped it).
The rebuild hammered the surviving disk, forcing reads across the surface. That’s when the pending sectors turned into uncorrectables.
The array couldn’t reconstruct a few blocks. Not the whole dataset. Just enough to corrupt several VM images.

The root cause wasn’t “RAID is useless.” The root cause was removing the only mechanisms that would have forced latent errors to show up
while redundancy was still intact: regular scrubs and long tests. They reintroduced monthly scrubs and weekly long tests for HDDs,
then rate-limited them and scheduled off-peak. Latency complaints dropped after they stopped running them at lunchtime.

Mini-story 3: The boring but correct practice that saved the day

A company with strict change control ran Debian 13 on database servers with mirrored NVMe devices.
They had a policy that sounded tedious: every drive replacement ticket must include smartctl -a,
the last 24 hours of kernel logs for the device, and the output of a redundancy check (mdadm check or ZFS scrub).

One week, a monitoring alert fired: NVMe “Unsafe Shutdowns” increased. Nobody panicked. The drive was “PASSED,” no media errors, no critical warning.
But the policy required gathering evidence, so the on-call also pulled the last kernel logs and noticed brief power-loss events on the PCIe bus.

They traced it to a loose power distribution connection in a rack PDU (not dramatic, just slightly unseated after a maintenance visit).
Had they replaced the NVMe, it would have “fixed” nothing and hidden the facility issue. Instead they stabilized power, then watched the counters.
No new unsafe shutdowns. No corruption.

The boring practice—collecting multiple signals before acting—prevented a cargo-cult disk replacement cycle and caught the real systemic risk.
Sometimes the most reliable engineering move is paperwork with teeth.

Common mistakes: symptom → root cause → fix

Here are failure modes I’ve seen repeatedly, mapped to what you actually do about them.
This section is intentionally specific; generic advice is how outages breed.

1) “SMART PASSED but app logs show I/O errors”

Symptom: Database or VM host sees sporadic read errors; SMART overall-health says PASSED.

Root cause: Vendor thresholds are lax; health summary doesn’t trip until late. Media errors show first in logs/self-tests.

Fix: Check SMART error log and self-test log; run long self-test; if read failures exist, replace the drive and verify data integrity via scrub.

2) “UDMA_CRC_Error_Count is climbing; replacing disks didn’t help”

Symptom: CRC errors increase across multiple drives in same bay/cable path.

Root cause: Bad SATA cable/backplane connector, marginal HBA port, or power noise causing link corruption.

Fix: Replace/seat cables; move drive to different bay; update HBA firmware; inspect PSU/backplane; confirm CRC counter stops increasing.

3) “Current_Pending_Sector is 1 and then went back to 0, so we’re good”

Symptom: C5 blips and clears after a test.

Root cause: A weak sector was rewritten/remapped; it may be an early sign of surface degradation.

Fix: Run a long test and a filesystem/RAID scrub; monitor trend weekly. If new pending sectors appear, replace.

4) “Long SMART tests make performance terrible, so we disabled them”

Symptom: User complaints during tests; background tasks removed.

Root cause: Tests scheduled during peak, no IO throttling, no coordination with workload.

Fix: Schedule off-peak; stagger disks; rate-limit scrub/check; accept some maintenance cost to avoid rebuild-time surprises.

5) “NVMe Percentage Used is low; still getting media errors”

Symptom: Wear looks fine; media/data integrity errors increment.

Root cause: Controller or NAND issues unrelated to wear-out, firmware defects, or thermal/power instability.

Fix: Treat media errors as real; update firmware if policy allows; check temperatures and power events; replace device if errors persist.

6) “We ran a SMART test; it passed; corruption still exists”

Symptom: SMART tests show “Completed without error,” but ZFS/btrfs reports corruption.

Root cause: SMART tests don’t validate end-to-end data integrity; they may not touch the exact blocks or catch transient path issues.

Fix: Trust filesystem checksums over SMART. Scrub, locate affected files, restore from backup, and investigate controller/cabling.

7) “smartd is running but we never get alerts”

Symptom: smartd active; no notifications even when a drive is clearly failing.

Root cause: smartd not configured to monitor all devices, especially behind HBAs; or email/notification path broken.

Fix: Validate monitored devices with smartd -q showtests; test alerting path; explicitly configure devices with correct -d types.

Checklists / step-by-step plan

Checklist A: When you see pending/uncorrectable sectors (C5/C6) on SATA HDD

  1. Capture evidence: smartctl -a, -l error, -l selftest, and recent kernel logs for the device.
  2. Run a long self-test (smartctl -t long) if the system can tolerate the read load.
  3. Run your redundancy validation (mdadm check / ZFS scrub / btrfs scrub) while redundancy is intact.
  4. If the long test fails or C5/C6 rises: schedule replacement immediately.
  5. After replacement: rebuild/resilver, then scrub again to confirm no residual errors.

Checklist B: When you see only CRC errors (C7) and no media errors

  1. Confirm C5/C6 are zero and self-tests are clean.
  2. Reseat/replace SATA cable, or move to a different bay/backplane slot.
  3. Check kernel logs for link resets/timeouts.
  4. Watch C7 over 24–72 hours. It should stop increasing.
  5. If C7 continues to climb across multiple drives, escalate to HBA/backplane/power investigation.

Checklist C: NVMe warning signals

  1. Collect smartctl -a /dev/nvmeX output (critical warning, media errors, temperature, percentage used).
  2. Check kernel logs for PCIe AER, NVMe resets, or timeouts.
  3. If critical warning is set or media errors increase: replace the drive (and keep the evidence for vendor/RMA).
  4. If only unsafe shutdowns increase: investigate power/PSU/PDU and abrupt resets before blaming the SSD.

Checklist D: Make smartd actually useful (not just installed)

  1. Inventory devices: SATA, SAS, NVMe, USB bridges, HBAs.
  2. Configure /etc/smartd.conf with correct device types and schedules.
  3. Verify parsing and schedules: smartd -q showtests.
  4. Verify alerting: simulate a notification or validate journald ingestion into your monitoring.
  5. Review weekly: any growth in realloc/pending/uncorrectable triggers a ticket with trend and plan.

FAQ

1) Which SMART attributes should wake me up at night?

Pending sectors (C5), offline uncorrectable (C6), reallocated sectors trending upward (05),
SMART error log entries showing UNC, and self-test failures. For NVMe: critical warning and media/data integrity errors.

2) Is a non-zero Reallocated_Sector_Ct always an immediate replace?

Not always. A stable, small count on an older HDD can limp along. But growth—especially during scrubs/tests—is a replacement plan.
If you also have C5/C6 or self-test failures, stop debating and replace.

3) Why did Current_Pending_Sector go back to zero?

Because the sector was successfully rewritten or remapped. That clears the pending list. It does not undo the fact that
the drive encountered a read it couldn’t correct. Treat it as an early warning and look for recurrence.

4) Can SMART predict sudden SSD death from firmware bugs?

Sometimes you’ll see resets, critical warnings, or rising error counts. Often you won’t. Firmware/controller failures can be abrupt.
That’s why redundancy and backups exist: to handle failures that don’t give notice.

5) Should I trust RAW or normalized SMART values?

Prefer RAW for counts like reallocations/pending/uncorrectables/CRC. Normalized values are vendor scales and can be misleading.
But don’t ignore normalized threshold crossings—they usually mean the vendor is finally willing to admit the drive is toast.

6) Are short tests worth running?

Yes, as a quick sanity check and for automation. No, they’re not enough for surface validation.
If you’re investigating a warning, a long test (plus a scrub/check) is the adult move.

7) SMART says the drive is fine, but ZFS shows checksum errors—what now?

Believe ZFS. Filesystem checksums catch end-to-end corruption that SMART may not see. Scrub, identify affected files,
replace suspect hardware (disk, cable, HBA), and restore from backup if needed.

8) How often should I run SMART long tests on HDDs?

Weekly is common for important fleets, monthly for quieter environments. The right cadence is the one you actually run
without causing performance incidents. Stagger drives and avoid peak hours.

9) Do CRC errors mean my data is corrupted?

CRC errors indicate link-level transmission issues; the protocol detects and retries. Usually you get retries and latency,
not silent corruption. Still, frequent link errors can trigger timeouts and higher-level failures, so fix the path.

10) What’s the single most useful thing to store in a ticket when SMART warns?

A full smartctl -a output plus the SMART error log and self-test log, and the last relevant kernel logs.
Trend data (what changed since last week) is even better than a single snapshot.

Conclusion: next steps you can do today

SMART warnings are not a philosophical debate. They’re an input to a decision under uncertainty.
Your job on Debian 13 is to make that decision fast, with the right signals, and without superstition.

  • Focus on predictive signals: pending (C5), uncorrectable (C6), reallocations trending (05), self-test failures, SMART error log UNC entries, NVMe critical warnings and media errors.
  • Differentiate media from path: C7 and link resets are often cables/backplanes/HBAs, not platters.
  • Combine layers: SMART + kernel logs + scrub/check results. Filesystem checksum systems (ZFS/btrfs) are brutally honest—use them.
  • Make smartd real: configure schedules, validate monitored devices, verify alert delivery, and capture evidence in tickets.
  • When in doubt and the data matters: evacuate, replace, and post-validate with a scrub. Drives are cheaper than incident calls.
← Previous
Debian 13: Core dumps fill your disk — keep debugging value, drop the bloat
Next →
Dovecot IMAP Login Fails: Where Auth Breaks and How to Fix It

Leave a comment