Proxmox is good at surfacing SMART warnings. It’s also good at making you ask the worst question at the worst time: “Is this drive about to die, or is it just being dramatic?”
When you run ZFS or Ceph on Proxmox, storage health is a stack of signals. SMART is only one layer, often misunderstood, and occasionally weaponized by dashboards that treat every non-zero counter like a five-alarm fire. Let’s sort it out with the attributes that actually correlate with failure, the ones that mostly don’t, and the commands you’ll run at 2:13 AM when a node starts blinking yellow.
How SMART actually works (and how Proxmox consumes it)
SMART is vendor-specific telemetry wrapped in a standards-shaped container. The disk firmware decides what to count, when to increment, and how to map “normalized” values. You’re not reading objective physics. You’re reading what the drive is willing to admit.
SMART provides three kinds of signal that matter operationally:
- Failure flags (overall SMART “PASSED/FAILED”, NVMe “critical_warning”). These are blunt instruments. Some drives fail without flipping them. Some flip them late.
- Media degradation counters (reallocated sectors, pending sectors, uncorrectables, NVMe media errors). These correlate with surface or NAND failure, and they predict the future better than most.
- Transport and environment counters (CRC errors, temperature history, unsafe shutdowns). These are often about cables, backplanes, power, and behavior.
Proxmox generally gets SMART data via smartmontools (smartctl/smartd) for SATA/SAS devices and via NVMe logs for NVMe. In the UI, you’ll often see a summarized “Health” plus a handful of attributes. Summaries are for humans. Incidents are for raw data.
If you only remember one thing: SMART is a trend tool, not a single snapshot tool. A single non-zero number can be harmless if it never changes. A small number that climbs is a siren.
One quote worth keeping near your on-call notebook: Hope is not a strategy.
— General Gordon R. Sullivan. That’s not storage-specific, but it’s painfully applicable to “it’s probably fine” disk decisions.
Joke #1: SMART is like a toddler’s injury report: loud about the wrong things, quiet about the scary ones.
Interesting facts and short history that change how you read SMART
- SMART predates modern SSDs. It was popularized in the 1990s for spinning disks, and many attributes still carry HDD-era assumptions.
- “Normalized” values are vendor fiction. Attribute “VALUE/WORST/THRESH” are scaled and vendor-defined; the raw counter is often more actionable.
- Early SMART was mainly about “predictive failure.” The original promise was a pre-failure warning. In practice, many failures are abrupt (electronics, firmware, power events).
- Backblaze-style fleet studies popularized a few attributes. Large fleet data made reallocated and pending sectors famous for HDDs, but the relationship isn’t identical across models.
- SSD SMART is less standardized than people assume. Wear indicators (percent used, media wearout) are comparably useful, but vendor-specific logs still matter.
- SMART tests are “offline” but not harmless. Long tests can stress marginal drives and slow performance; that’s a feature when you’re validating a replacement decision.
- Transport errors often have nothing to do with the drive. A rising CRC error count screams “cable/backplane” more than “media failure.”
- ZFS and Ceph already do their own truth-finding. ZFS checksum errors and Ceph “bluestore” errors can reveal corruption even when SMART looks calm.
- Drive firmware can remap sectors silently. Reallocation is sometimes proactive; the drive may swap out weak sectors before you ever see an uncorrectable read.
The SMART attributes that actually predict failure
“Predict” is doing heavy lifting here. No attribute is a prophecy. But some are consistently correlated with drives that get worse, quickly. These are the ones that deserve an incident ticket, not a shrug.
For SATA HDDs: the big three that matter most
1) Reallocated Sector Count (ID 5)
This is the canonical “media is degrading” metric. A reallocated sector is a sector the drive couldn’t reliably read/write and swapped out for a spare sector.
How to treat it in production:
- Zero is the baseline. Non-zero means the drive has already consumed spares.
- Stable non-zero can be acceptable if it doesn’t increase over weeks and scrubs show no read errors.
- Increasing reallocations are a replacement trigger. Not because of the number itself, but because the drive is telling you it’s still losing the fight.
2) Current Pending Sector (ID 197)
Pending sectors are worse than reallocations in the short term. These are sectors the drive has trouble reading and has not remapped yet—often because it wants a successful write to confirm whether it can remap.
Operational meaning: you have data the drive can’t consistently read. That’s how you get timeouts, slow reads, and eventually uncorrectable errors.
Decision rule: Pending > 0 on a ZFS/RAID member deserves immediate follow-up: run a long test and check ZFS/Ceph errors. If pending persists or grows, schedule replacement.
3) Uncorrectable Sector Count / Offline Uncorrectable (ID 198) and Reported Uncorrectable Errors (ID 187)
If the drive admits it couldn’t correct reads during offline scanning (198) or during normal operations (187), treat it as “data integrity risk present.”
In RAIDZ/mirrors you might survive. In single-disk filesystems you might not. In either case, the drive is no longer “boring,” and boring is the desired state.
For SATA SSDs: the attributes that matter
1) Media Wearout Indicator / Percentage Used / Wear Leveling Count
Different vendors name this differently. You want a direct indicator of NAND wear. When this approaches end-of-life, you expect:
- slower writes
- higher error correction overhead
- eventual read-only mode on some models
Decision rule: When wear indicates you’re inside the last chunk of rated life, plan replacement on your schedule, not the drive’s.
2) Reallocated sectors / reallocation events (varies)
SSDs remap differently than HDDs, but a rising reallocation-style metric often correlates with failing NAND or controller instability. If it climbs, treat like HDD reallocations: trend it and plan replacement.
3) Uncorrectable errors
Uncorrectables on SSDs are a bad sign. They’re less common than on dying HDDs, and when they happen, they often mean controller/NAND trouble that doesn’t get better.
For NVMe: the few fields that matter more than a dozen SATA attributes
1) Critical Warning
This is a bitmask. Any non-zero deserves attention. It can indicate spare capacity low, temperature issues, or media read-only mode.
2) Percentage Used
It’s exactly what it sounds like: wear relative to the manufacturer’s rated life. Not perfect, but far better than guessing based on power-on hours.
3) Media and Data Integrity Errors
Often exposed as “Media and Data Integrity Errors” in NVMe SMART. Non-zero and increasing means the device is failing to deliver correct data without error recovery.
4) Error Information Log Entries
This is the “something keeps going wrong” counter. It can increment for recoverable errors too, so correlate with latency spikes, I/O errors in logs, and ZFS checksum errors.
Trend beats threshold
Most SMART “THRESH” values are set so low that by the time they trip, you’ve already had a bad week. Your operational thresholds should be more conservative:
- Any increase in pending or uncorrectables gets immediate investigation.
- Reallocated sectors increasing month-over-month triggers planned replacement.
- NVMe media errors increasing triggers replacement planning; increasing fast triggers urgent replacement.
- Transport errors increasing triggers cabling/backplane/power work before blaming the drive.
The noisy attributes that waste your time
Some SMART attributes are famously misleading because vendors encode multiple subcounters in one raw value, or because the attribute is workload-dependent. Dashboards love these because they move. Engineers hate them because they don’t answer the question “will this drive drop out?”
Raw Read Error Rate (ID 1) and Seek Error Rate (ID 7)
On many Seagate models these raw numbers look terrifying and keep climbing even on healthy drives. That’s not your incident. Trend the normalized value if you must, but don’t replace a drive based on scary-looking raw read error rate alone.
Hardware ECC Recovered (ID 195)
High ECC recovery counts often mean “the error correction is doing its job.” It’s not automatically bad. It becomes interesting only if it coincides with uncorrectables, timeouts, or ZFS checksum errors.
Spin Retry Count (ID 10) and Start/Stop Count (ID 4)
These are workload and power-management dependent. Start/stop counts are especially abused by aggressive head parking. High counts don’t necessarily predict imminent failure; they predict you configured power management like a laptop.
Power-On Hours (ID 9)
Age matters, but it’s not determinism. I’ve replaced brand-new drives that arrived pre-damaged and kept 7-year-old enterprise disks alive by treating them gently and watching the right counters.
Temperature (ID 194/190)
Temperature is a risk factor, not a failure predictor by itself. Persistent high temps accelerate wear and can trigger throttling or error rates, but a one-off spike during a rebuild is not a death sentence.
Where Proxmox shows SMART and what it’s really telling you
In Proxmox VE, you typically see SMART under node → Disks → SMART. It can also surface health in the storage view depending on your setup. The UI is convenient for a glance, but it’s not where you do root cause analysis.
Use the UI to spot “something changed.” Then drop to the shell and pull:
- full SMART attributes (not just the few the UI renders)
- SMART error log
- self-test history
- kernel logs showing timeouts/resets
- ZFS/Ceph scrub and checksum status
Also: Proxmox is often running on hosts with HBAs, backplanes, and sometimes RAID controllers. SMART passthrough can be incomplete. If SMART is missing fields or always returns “UNKNOWN,” that’s not the drive being mysterious; it’s your storage path being opaque.
Practical tasks: commands, what the output means, and the decision you make
These are the tasks I actually run when SMART warnings show up on Proxmox. Each includes the “so what” decision, because collecting numbers without decisions is just competitive log hoarding.
Task 1: Confirm the device path and model (avoid blaming the wrong disk)
cr0x@server:~$ lsblk -o NAME,SIZE,MODEL,SERIAL,TYPE,MOUNTPOINT
NAME SIZE MODEL SERIAL TYPE MOUNTPOINT
sda 3.6T ST4000NM0035-1V4 ZC1ABC12 disk
sdb 3.6T ST4000NM0035-1V4 ZC1ABD34 disk
nvme0n1 1.8T INTEL SSDPE2KX020T8 PHBT1234001 disk
Meaning: Map Linux names to physical disks (model + serial). Proxmox UI device names can shift after reboots or HBA changes.
Decision: If you can’t tie the warning to a serial number, stop. Build that mapping first, or you’ll pull the wrong drive and learn humility the hard way.
Task 2: Pull full SMART health summary for a SATA/SAS disk
cr0x@server:~$ sudo smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-3-pve] (local build)
=== START OF INFORMATION SECTION ===
Device Model: ST4000NM0035-1V4107
Serial Number: ZC1ABC12
LU WWN Device Id: 5 000c50 0a1b2c3d4
Firmware Version: SN04
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 098 098 010 Pre-fail Always - 12
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 2
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
Meaning: “PASSED” doesn’t override pending/uncorrectables. Here we have reallocations, pending sectors, and offline uncorrectables: classic media trouble.
Decision: If 197 or 198 is non-zero, escalate. Run a long test, check ZFS/Ceph status, and plan replacement if the numbers persist or increase.
Task 3: Pull NVMe health and the critical warning bit
cr0x@server:~$ sudo smartctl -a /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-3-pve] (local build)
=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPE2KX020T8
Serial Number: PHBT1234001
Firmware Version: VDV10131
PCI Vendor/Subsystem ID: 0x8086
IEEE OUI Identifier: 0x5cd2e4
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Critical Warning: 0x00
Temperature: 44 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 63%
Media and Data Integrity Errors: 0
Error Information Log Entries: 18
Meaning: Percentage Used is getting up there (63%), but not end-of-life by itself. “Error Information Log Entries” is non-zero; correlate with logs and performance symptoms.
Decision: If Critical Warning is non-zero or Media and Data Integrity Errors climbs, plan replacement. If only error log entries climb but no integrity errors, investigate firmware/driver/timeouts first.
Task 4: Check SMART error log (the drive’s confession list)
cr0x@server:~$ sudo smartctl -l error /dev/sda
SMART Error Log Version: 1
ATA Error Count: 3
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
...
Error 3 occurred at disk power-on lifetime: 4231 hours
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 ...
40 -- 51 0008 0000001a2b3c ...
Commands leading to the command that caused the error were:
READ FPDMA QUEUED
Meaning: Errors during reads are consistent with pending/uncorrectables. If errors are old and not increasing, less urgent.
Decision: If error count increments during your incident window, treat it as live instability. Prepare to replace and resilver/rebuild.
Task 5: Check SMART self-test history
cr0x@server:~$ sudo smartctl -l selftest /dev/sda
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 10% 4232 0x0000001a2b3c
# 2 Short offline Completed without error 00% 4200 -
Meaning: The long test hit a read failure at a specific LBA. That’s not “monitor it.” That’s “schedule replacement.”
Decision: Replace the disk. If it’s in a mirror/RAIDZ/Ceph OSD, start the controlled recovery path now, not after it fails harder.
Task 6: Run a short SMART test (quick triage)
cr0x@server:~$ sudo smartctl -t short /dev/sda
Please wait 2 minutes for test to complete.
Test will complete after Tue Dec 26 02:16:04 2025
Use smartctl -l selftest /dev/sda to read test results.
Meaning: Short tests catch obvious issues fast. They won’t fully scan the surface.
Decision: If short test fails, you’re done debating. Replace. If it passes but you have pending/uncorrectables, run extended next.
Task 7: Run an extended SMART test (surface scan, the “prove it” test)
cr0x@server:~$ sudo smartctl -t long /dev/sda
Please wait 411 minutes for test to complete.
Test will complete after Tue Dec 26 09:06:11 2025
Use smartctl -l selftest /dev/sda to read test results.
Meaning: Extended tests scan the surface and tend to flush out marginal sectors.
Decision: If it reports read failures, replace. If it completes but pending remains, don’t declare victory until you’ve verified pending cleared and no new errors appear under load.
Task 8: Identify whether “CRC errors” are cabling/backplane, not disk
cr0x@server:~$ sudo smartctl -A /dev/sda | egrep 'UDMA_CRC_Error_Count|CRC|199'
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 47
Meaning: CRC errors are almost always link-layer corruption between disk and controller: cable, connector, expander, backplane, vibration, or power noise.
Decision: If this counter is increasing, reseat/replace cable, move the drive to another bay, check backplane, and inspect kernel logs for link resets. Don’t RMA the drive first unless other media attributes also look bad.
Task 9: Correlate with kernel logs for timeouts and resets
cr0x@server:~$ sudo journalctl -k --since "2 hours ago" | egrep -i 'ata[0-9]|reset|timeout|I/O error|nvme'
Dec 26 01:41:08 server kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 26 01:41:08 server kernel: ata3.00: failed command: READ FPDMA QUEUED
Dec 26 01:41:09 server kernel: ata3: hard resetting link
Dec 26 01:41:14 server kernel: blk_update_request: I/O error, dev sda, sector 43905102 op 0x0:(READ)
Meaning: Link resets plus I/O errors: could be media or could be transport. Combine this with SMART 199 and 5/197/198.
Decision: If 5/197/198 are clean but link resets happen, treat as cabling/HBA/backplane first. If 197/198 are non-zero and you see I/O errors, treat as drive failure in progress.
Task 10: ZFS health: see what the filesystem thinks (it’s usually right)
cr0x@server:~$ sudo zpool status -v
pool: rpool
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Replace the device and run 'zpool clear' or 'zpool replace'.
scan: scrub repaired 0B in 00:18:11 with 0 errors on Tue Dec 24 03:12:18 2025
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-ST4000NM0035_ZC1ABC12 FAULTED 12 0 34 too many errors
ata-ST4000NM0035_ZC1ABD34 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
/rpool/vm-100-disk-0
Meaning: ZFS sees real read/checksum errors and has faulted the device. This is beyond “SMART warning”; it’s an integrity incident.
Decision: Replace the disk immediately. Then handle the permanent errors: restore from replication/backup for affected blocks or VM disks.
Task 11: ZFS scrub on demand (when you need truth, not comfort)
cr0x@server:~$ sudo zpool scrub rpool
cr0x@server:~$ sudo zpool status rpool
pool: rpool
state: ONLINE
scan: scrub in progress since Tue Dec 26 02:22:10 2025
215G scanned at 1.20G/s, 48.1G issued at 273M/s, 2.10T total
0B repaired, 2.31% done, 02:41:18 to go
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST4000NM0035_ZC1ABC12 ONLINE 0 0 0
ata-ST4000NM0035_ZC1ABD34 ONLINE 0 0 0
Meaning: A scrub reads everything and verifies checksums. If SMART is whispering but ZFS is clean, you may have early indicators without user-visible corruption yet.
Decision: If scrub reports errors, act immediately. If scrub is clean but SMART shows pending/uncorrectables, still plan replacement—scrub can miss sectors not currently referenced.
Task 12: Ceph quick health check (if your Proxmox cluster uses it)
cr0x@server:~$ sudo ceph -s
cluster:
id: 8f3c2d3e-1b2a-4c5d-9e10-1a2b3c4d5e6f
health: HEALTH_WARN
1 osds down
2 slow ops, oldest one blocked for 31 sec
services:
mon: 3 daemons, quorum a,b,c (age 4h)
mgr: a(active, since 2d)
osd: 12 osds: 11 up (since 3m), 12 in (since 7d)
data:
pools: 4 pools, 256 pgs
objects: 1.2M objects, 4.3 TiB
usage: 13 TiB used, 24 TiB / 37 TiB avail
pgs: 254 active+clean, 2 active+degraded
Meaning: Ceph is telling you a disk/OSD is out and ops are slowing. SMART might be involved, but Ceph health is the priority signal.
Decision: Identify the down OSD host and disk, check SMART/NVMe logs, and replace/restore OSD. Don’t stare at SMART while Ceph bleeds.
Task 13: Map a Ceph OSD to a physical disk (stop guessing)
cr0x@server:~$ sudo ceph-volume lvm list
====== osd.4 ======
[block] /dev/disk/by-id/ata-ST4000NM0035-1V4107_ZC1ABC12
[block.db] /dev/nvme0n1p2
devices /dev/sda
====== osd.5 ======
[block] /dev/disk/by-id/ata-ST4000NM0035-1V4107_ZC1ABD34
[block.db] /dev/nvme0n1p3
devices /dev/sdb
Meaning: Now you can tie “osd.4” to a specific serial and bay.
Decision: Replace the correct disk, not its innocent neighbor.
Task 14: Look for SATA/NVMe PCIe errors (a “disk problem” that’s really a platform problem)
cr0x@server:~$ sudo journalctl -k --since "24 hours ago" | egrep -i 'pcie|aer|nvme.*reset|controller is down'
Dec 25 18:12:44 server kernel: pcieport 0000:00:1c.0: AER: Corrected error received: 0000:02:00.0
Dec 25 18:12:44 server kernel: nvme nvme0: controller reset
Dec 25 18:12:46 server kernel: nvme nvme0: I/O 37 QID 5 timeout, aborting
Meaning: PCIe AER corrected errors and controller resets are often about power management, firmware bugs, risers, or marginal slots—not the NAND wearing out.
Decision: Before you RMA the NVMe, check BIOS/firmware, disable aggressive PCIe ASPM if needed, reseat, and validate thermals. Replace the drive if media errors appear or resets persist after platform fixes.
Task 15: Trend SMART attributes over time (because “is it getting worse?” is the real question)
cr0x@server:~$ sudo smartctl -A /dev/sda | awk 'NR==1 || $1 ~ /5|187|197|198|199/ {print}'
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 098 098 010 Pre-fail Always - 12
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 2
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 2
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
Meaning: This is your “top offenders” set for SATA HDDs. Capture it daily or weekly.
Decision: Any upward trend in 197/198/187 is a replacement plan. 199 trending upward is a cabling/backplane plan.
Task 16: Validate drive identity by-id (stable naming for automation and replacement)
cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep 'ZC1ABC12|nvme-INTEL'
lrwxrwxrwx 1 root root 9 Dec 26 01:10 ata-ST4000NM0035-1V4107_ZC1ABC12 -> ../../sda
lrwxrwxrwx 1 root root 13 Dec 26 01:10 nvme-INTEL_SSDPE2KX020T8_PHBT1234001 -> ../../nvme0n1
Meaning: Use these stable identifiers in ZFS by-id vdev paths and in your runbooks.
Decision: If your pool uses /dev/sdX names, schedule a cleanup/migration to by-id naming before you have a replacement incident.
Fast diagnosis playbook (first/second/third)
This is the “you have five minutes before the meeting and 30 seconds before the pager escalates” routine. The goal is not perfect truth; it’s fast sorting: media failure vs transport vs platform vs filesystem.
First: determine if this is an integrity incident
- ZFS:
zpool status -v. If you see CKSUM errors, faulted devices, or permanent errors, treat it as data integrity risk now. - Ceph:
ceph -s. If OSDs are down, slow ops, degraded PGs: treat it as storage service instability now. - Kernel logs: look for I/O errors and resets in the last hour.
Second: separate media degradation from transport noise
- Pull SMART/NVMe health (
smartctl -a). - Media indicators: pending (197), uncorrectables (198/187), reallocated (5), NVMe media errors.
- Transport indicators: CRC (199), link resets, SAS phy resets, PCIe AER, NVMe controller resets without media errors.
Third: force the issue with a controlled test
- Run SMART short test, then long test if needed.
- Run ZFS scrub or Ceph deep-scrub scheduling if the cluster can handle it.
- Watch counters again after the test. If pending clears and no new errors appear, you may have dodged a bullet. If counters climb, replace.
Joke #2: If your “fast diagnosis” ends with “let’s reboot and see,” congratulations—you’re practicing storage astrology.
Three corporate mini-stories (how teams get this wrong and right)
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran a Proxmox cluster with ZFS mirrors. The on-call saw a SMART “PASSED” and a Proxmox warning about a couple of reallocated sectors. They assumed it was fine because the pool was still ONLINE and the VMs were running.
The subtle detail: the drive also had a small but non-zero Current_Pending_Sector count. It wasn’t emphasized in the dashboard view. Nobody trended it. Nobody ran a long test. Life continued.
Two weeks later, a routine ZFS scrub started. It hit one of the pending sectors during a read, the drive timed out, and the HBA reset the link. ZFS kept trying. Latency went through the roof. VM I/O stalled. The cluster didn’t “lose data,” but it lost something more precious in the moment: time.
They replaced the drive after it finally dropped. The resilver took longer than expected because the surviving mirror member was now doing double duty under production load. The postmortem was blunt: the mistake wasn’t ignoring reallocations; it was treating pending sectors as “just another SMART number.”
The corrective action was boring: daily SMART snapshots of 5/197/198 and an auto-ticket if 197 is non-zero or increases. They didn’t stop failures. They stopped surprises.
Mini-story 2: The optimization that backfired
Another team wanted quieter alerts. Their Proxmox UI showed frequent SMART “temperature” and “UDMA CRC error count” warnings on a handful of nodes. People were tuning out notifications and, naturally, missing the real ones.
So they “optimized” alerting: they raised thresholds and suppressed CRC warnings because “it’s always the backplane.” They also suppressed temperature warnings because “those drives always run hot during rebuilds.” The dashboard became calmer. Everyone felt better.
Months later, they had intermittent VM pauses on one host. Nothing obvious in SMART. No reallocated sectors, no pending sectors. But the kernel logs showed bursts of link resets and CRC increments during peak load. It turned out one SAS expander port was marginal. The errors weren’t constant; they came in storms.
Because CRC alerts were suppressed, the hardware issue lingered. The expander degraded further and started causing multi-drive timeouts during scrubs. That’s when ZFS started showing checksum errors. The system didn’t fail because of a single drive. It failed because the “noisy” alert was actually their early-warning system for a shared component.
The fix wasn’t “turn alerts back on and suffer.” It was better: alert on CRC rate (change over time), not raw value, and correlate to a specific port/bay via inventory. They replaced the expander, not a pile of innocent disks.
Mini-story 3: The boring but correct practice that saved the day
A finance org ran Proxmox with Ceph. Their policy was dull: every disk has a label with serial, bay, and OSD ID; every week they capture SMART/NVMe summaries into a small time-series store; every month they run scheduled long tests staggered by node.
One afternoon, Ceph went HEALTH_WARN due to a few slow ops. Nothing was down. The on-call checked the trending dashboard and saw one NVMe’s “Media and Data Integrity Errors” tick up from 0 to 1 over the past day. Just one. Not exciting. But it was new.
They drained the host, marked the OSD out, and replaced the NVMe during business hours. The vendor’s RMA diagnostics later confirmed early media failure. If they’d waited for “Critical Warning” to flip, they would have eaten the failure during an unrelated maintenance window when the cluster was already stressed.
The practice that saved them wasn’t clever tooling. It was the habit of treating new integrity errors as actionable, even when they’re small, and doing replacements on their own schedule.
Common mistakes: symptom → root cause → fix
1) “SMART PASSED but Proxmox shows warnings”
Symptom: SMART overall status is PASSED; Proxmox still flags disk health.
Root cause: Overall status is a coarse vendor threshold; meaningful counters (pending/uncorrectable) can be non-zero while overall remains PASSED.
Fix: Ignore the “PASSED” comfort blanket. Review raw 5/187/197/198 (SATA) or NVMe media errors/critical warning. Trend them.
2) “UDMA CRC Error Count keeps increasing; replaced disk, problem persists”
Symptom: SMART attribute 199 increments; I/O errors appear; replacing the disk doesn’t stop it.
Root cause: Transport issue: SATA cable, backplane connector, expander, HBA port, or power instability.
Fix: Reseat/replace cables, move bays, inspect backplane, update HBA firmware, check power. Replace the shared component when errors follow the bay/port, not the disk.
3) “ZFS checksum errors but SMART looks clean”
Symptom: ZFS reports CKSUM errors; SMART attributes show no reallocations/pending.
Root cause: Data corruption in transit (cabling/HBA), flaky RAM (less common but catastrophic), or firmware/driver bugs. SMART measures the drive, not the whole path.
Fix: Check kernel logs for resets, CRC, PCIe errors. Validate ECC RAM health (edac). Swap HBA/cables. Scrub again. Don’t blame SMART for being the wrong tool.
4) “Pending sectors appeared after a power event”
Symptom: 197 > 0 after a hard shutdown or power loss.
Root cause: Incomplete writes or marginal sectors exposed by a sudden power-off; sometimes it clears after rewriting the affected sectors.
Fix: Run long SMART test, then verify whether pending clears. If pending persists or uncorrectables appear, replace. Also fix power: UPS, redundant PSUs, clean shutdowns.
5) “NVMe resets under load; SMART looks fine”
Symptom: NVMe controller resets and I/O timeouts; no media errors.
Root cause: Platform/PCIe issues: thermals, firmware, ASPM, risers, marginal slot, power delivery.
Fix: Check AER logs, thermals, firmware updates. Consider disabling aggressive power management. If resets persist, move device to another slot or replace the platform component.
6) “SMART long test slows VMs, so we never run it”
Symptom: You avoid long tests because they impact performance.
Root cause: No maintenance windows; fear of impact; lack of staggered scheduling.
Fix: Stagger tests per node and per disk during low-traffic windows. If you can’t tolerate a SMART long test, you also can’t tolerate a resilver during peak.
7) “We alert on every non-zero SMART raw value”
Symptom: Constant alerts; everyone ignores them.
Root cause: Alerting on noisy attributes without rate/trend logic.
Fix: Alert on changes in high-signal attributes (pending, uncorrectable, reallocations, NVMe media errors) and on rate-of-change for CRC/temperature, not absolute values.
Checklists / step-by-step plan
Checklist A: When Proxmox throws a SMART warning
- Identify the disk by serial number (
lsblk,/dev/disk/by-id). - Pull full SMART/NVMe data (
smartctl -a). - Extract the high-signal attributes:
- SATA: 5, 187, 197, 198, 199
- NVMe: Critical Warning, Percentage Used, Media/Data Integrity Errors
- Check error log and self-test log (
smartctl -l error,-l selftest). - Correlate with kernel logs for the same time window (
journalctl -k). - Check ZFS/Ceph health (
zpool status -vorceph -s). - Make a decision:
- Media counters increasing → replace disk (planned or urgent depending on rate).
- CRC/resets without media counters → fix transport/platform first.
- ZFS permanent errors → treat as integrity incident and recover data blocks/VM disks.
Checklist B: Replacement decision rules I actually use
- Replace now if:
- Any SMART long test shows read failure
- ZFS faults device or shows permanent errors
- 197 (pending) persists after long test or increases
- 198/187 increases (uncorrectables) under normal workload
- NVMe Critical Warning is non-zero or media errors climb
- Plan replacement if:
- Reallocated sectors (5) are non-zero and slowly increasing
- SSD Percentage Used is high and you’re approaching lifecycle policies
- Do not replace the drive yet if:
- Only CRC errors rise and media counters are clean → fix cabling/backplane/HBA
- Only noisy attributes look scary (raw read error rate) with no corroboration
Checklist C: Building a sane SMART monitoring setup on Proxmox
- Ensure smartmontools is installed and smartd is running.
- Use stable device IDs for tracking (by-id symlinks).
- Collect a small subset of attributes and trend them (daily snapshots are enough).
- Alert on change, not on existence:
- delta(197) > 0 → page
- delta(198) > 0 → page
- delta(5) > 0 → ticket
- delta(199) > 0 → ticket; page if rapid and correlated with I/O errors
- Schedule SMART long tests staggered across disks and nodes.
- Verify that ZFS scrubs or Ceph scrubbing are happening and reviewed.
FAQ
1) If SMART says PASSED, can I ignore Proxmox warnings?
No. “PASSED” often means “hasn’t crossed the vendor’s catastrophic threshold.” Pending sectors and uncorrectables can exist while overall health still passes.
2) Which single HDD SMART attribute is most predictive?
Current_Pending_Sector (197) is the one that changes decisions fastest. It represents unreadable sectors right now, not historical remapping.
3) Are reallocated sectors always a reason to replace?
Not always immediately. A small, stable reallocated count can live for a long time. An increasing count is your replacement plan writing itself.
4) What does UDMA CRC error count mean in practice?
Data corruption on the link between disk and controller. Think cable, backplane, expander, HBA port, or power noise. It’s often not the drive’s fault.
5) Why does ZFS show checksum errors when SMART looks fine?
Because SMART measures the drive’s internal view. ZFS validates end-to-end checksums and can catch corruption from RAM, HBA, cables, or firmware issues.
6) Should I run SMART long tests on production hosts?
Yes, but stagger them. A long test is cheaper than a surprise rebuild at peak. If performance impact is unacceptable, that’s a capacity/planning problem, not a SMART problem.
7) For NVMe, what should I alert on?
Alert on non-zero Critical Warning, increasing Media and Data Integrity Errors, and sudden jumps in resets/timeouts in kernel logs. Percentage Used is for lifecycle planning.
8) Proxmox can’t read SMART behind my RAID controller. What now?
Use the controller’s tooling for drive health, or switch to an HBA/IT mode where the OS can see drives directly. If you can’t observe per-drive health, you’re running blind—at scale, that’s a choice with consequences.
9) How do I decide between “replace disk” and “fix cabling”?
If 5/197/198/187 (or NVMe media errors) are increasing, it’s the disk. If only CRC/link resets are increasing and media counters are clean, it’s the path.
10) Does high temperature mean imminent failure?
Not necessarily imminent, but it shortens life and can induce errors and throttling. Treat persistent high temperature as a reliability bug: airflow, fan curves, dust, and chassis design.
Conclusion: what to do next, practically
When Proxmox waves a SMART warning at you, don’t get hypnotized by the overall “PASSED.” Go straight to the few counters that actually predict trouble: pending sectors, uncorrectables, reallocations, NVMe media errors, and the transport counters that implicate your cabling and backplane.
Next steps that pay back immediately:
- Pick a small, opinionated alert set: 5/187/197/198/199 for SATA; critical warning/media errors/percentage used for NVMe.
- Trend those values. Alerts on change, not on existence.
- Make ZFS/Ceph the judge of integrity and SMART the early warning system.
- Write down your replacement decision rules and follow them. The point is consistency, not heroics.
Your storage stack doesn’t need you to be optimistic. It needs you to be boring, systematic, and slightly distrustful of dashboards.