Debian 13 SATA Link Resets: Prove It’s the Cable or Backplane, Not Linux

October 14, 2025 • February 3, 2026 • Read: 23 min • Views: 10

Was this helpful?

Some outages don’t start with a bang. They start with a disk “hiccup,” a single ataX: link is slow to respond, and then a slow-motion collapse: RAID rebuilds, ZFS resilvers, filesystem remounts read-only, and a team arguing about whether “the new Debian kernel” is to blame.

If you run Debian 13 on real servers with real SATA wiring—backplanes, expander-ish nonsense, cheap breakout cables, and airflow that’s fine until it isn’t—SATA link resets are the kind of failure that looks like software until you prove it’s hardware. This is how you prove it.

What a SATA link reset actually is (and what it isn’t)

A “SATA link reset” is the controller trying to re-establish the physical and protocol link to a disk. Linux reports it through libata (the kernel subsystem handling ATA/SATA). You’ll see phrases like:

link is slow to respond
COMRESET failed
hard resetting link
SATA link up 6.0 Gbps (after recovery)
ataX: device reported invalid CHS (often noise)
I/O error, dev sdX (the application-visible blast radius)

Important distinction: a link reset is not the same as a media error.

Media error: the drive can’t read a sector reliably. This typically increments SMART attributes like “Reallocated_Sector_Ct” or logs read errors.
Link error: the controller and drive can’t talk cleanly. This often increments SMART “UDMA_CRC_Error_Count” and shows as resets/timeouts. The platters might be fine; the conversation isn’t.

Linux isn’t “causing” the electrical signal to degrade. But Linux will absolutely be the messenger, and it will get blamed. Your job is to separate: (1) a kernel regression, (2) a controller/firmware issue, (3) a drive dying, (4) a cable/backplane/power path that can’t hold a stable link.

Here’s the operational reality: if you see repeated link resets and CRC errors on one bay/port, swapping the drive usually “fixes” it only because you reseated the cable or changed the vibration pattern. That’s not a fix; that’s you winning at roulette.

One quote to keep you honest: “Hope is not a strategy.” — Gen. Gordon R. Sullivan

Joke #1: A SATA cable is like a meeting invite—everything seems fine until it starts dropping packets and everyone blames the calendar.

Facts and context: why SATA fails the way it does

Some short, concrete context points that help you reason about what you’re seeing in Debian 13 logs. These aren’t trivia for trivia’s sake; they explain the failure modes.

SATA is serial, but it’s still analog at the edges. The bits ride on an electrical link that’s sensitive to impedance, shielding, connector wear, and crosstalk—especially in dense backplanes.
COMRESET is a real electrical handshake. When you see “COMRESET failed,” the host is trying to reset the PHY-level link and not getting a clean response. That’s often cable/backplane, sometimes power, occasionally controller.
SMART CRC errors were basically invented to catch cabling problems. “UDMA_CRC_Error_Count” increments when the drive detects transmission errors between drive and host. Media errors don’t increment it.
NCQ made SATA faster and debugging harder. Native Command Queuing allows multiple outstanding commands. When the link gets flaky, failures can look like timeouts in a queue, not clean “read sector bad” errors.
Consumer SATA connectors are not designed for infinite reinsertion cycles. A backplane that’s seen years of disk swaps has wear: spring tension changes, oxidation happens, plastic warps with heat.
Many “SATA backplanes” are really passive until they aren’t. Some have muxes, retimers, LEDs, or cheap connectors that behave fine at 1.5 Gbps and become chaos at 6.0 Gbps.
Linux libata error recovery is aggressive by design. It tries hard resets, soft resets, speed downshifts, and revalidation. This is good for uptime and terrible for finger-pointing.
Link speed downshifts are a smoking gun. A port that negotiates down from 6.0 to 3.0 or 1.5 Gbps under load is frequently a signal integrity problem, not a filesystem bug.
Vibration and heat matter more than people admit. A barely-marginal connection can fail only during peak I/O (more EMI, more heat) or during fan profile changes.

Fast diagnosis playbook (first/second/third)

You’re on call. The array is degraded. The app is timing out. You need the fastest path to “what do we replace, and what do we stop touching?” This is that path.

First: confirm it’s link-level instability, not just a dying drive

Scan journalctl/dmesg for hard resetting link, COMRESET, failed command: READ FPDMA QUEUED, link up/down.
Check SMART UDMA_CRC_Error_Count (or SATA PHY error log if available). If it’s moving, treat cabling/backplane as primary suspect.
Check if resets follow a bay/port, not a specific drive serial.

Second: correlate errors to a single port/bay under load

Map /dev/sdX → ataX → controller port → physical bay.
Look for link speed changes and repeated resets on the same ataX.
Run a controlled read test on that disk only, and watch logs live.

Third: isolate the fault domain with swaps that teach you something

Swap the cable/backplane path before swapping the OS or “tuning the kernel.” Move the drive to a different bay; if the errors stay with the bay, you have your culprit.
Swap to a known-good cable (or different backplane connector) and rerun the same load test.
Only after cabling/backplane/power are cleared do you spend time on firmware, controller drivers, or kernel regressions.

That ordering is not ideology. It’s math: cabling and backplane faults are common, fast to test, and they don’t show up in your CI pipeline.

Your evidence standard: building a case that survives a postmortem

If you want to stop the “Linux did it” loop, you need a repeatable chain of evidence:

Timeline: exact timestamps of resets and I/O errors.
Identity: which disk (WWN and serial), which ataX, which HBA port, which bay.
Signal type: CRC/link/handshake vs media reallocation vs controller bug.
Repro: a test that triggers the error (or doesn’t) under controlled conditions.
Swap result: the error follows the bay/cable/backplane (hardware path) or follows the drive (device).

A kernel upgrade might change timing enough to reveal a marginal link. That doesn’t mean the kernel “broke SATA.” It means the kernel stopped tiptoeing around your flaky physical layer.

Practical tasks: commands, outputs, what they mean, and what you decide

These are real tasks you can run on Debian 13. Each one includes: a command, sample output, what it means, and the decision you make from it. Don’t run stress tests on a production array without coordinating; “controlled chaos” is still chaos.

Task 1: Pull a focused SATA/libata error timeline from the journal

cr0x@server:~$ sudo journalctl -k --since "2 hours ago" | egrep -i "ata[0-9]+|sata|hard reset|link up|link down|comreset|i/o error|exception"
Dec 29 10:11:02 server kernel: ata7: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0x6 frozen
Dec 29 10:11:02 server kernel: ata7: SError: { CommWake 10B8B DevExch }
Dec 29 10:11:02 server kernel: ata7: hard resetting link
Dec 29 10:11:03 server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 29 10:11:03 server kernel: ata7.00: configured for UDMA/133
Dec 29 10:11:04 server kernel: blk_update_request: I/O error, dev sde, sector 124518400 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

Meaning: Emask 0x10 and SErr flags with a hard reset strongly suggest a transport/link event. The I/O error is collateral damage.

Decision: Treat this as link instability until proven otherwise. Next step: identify which physical disk is sde and check its SMART CRC counters.

Task 2: Confirm which disk is sde (serial, WWN, model)

cr0x@server:~$ lsblk -o NAME,SIZE,MODEL,SERIAL,WWN,HCTL /dev/sde
NAME SIZE MODEL            SERIAL         WWN                HCTL
sde  12T  ST12000NM0008-2A ZS2ABC12       0x5000c500cafe1234 3:0:6:0

Meaning: You now have stable identifiers (serial/WWN) and an HCTL path.

Decision: Use WWN/serial in tickets and postmortems, not /dev/sde. Device names move; WWNs don’t (usually).

Task 3: Read SMART attributes; look specifically for CRC errors vs media errors

cr0x@server:~$ sudo smartctl -a /dev/sde | egrep -i "UDMA_CRC_Error_Count|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|SMART overall"
SMART overall-health self-assessment test result: PASSED
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       47

Meaning: Media looks clean (no reallocations/pending). CRC errors are non-zero. CRC errors are almost always cabling/backplane/controller-path issues, not the disk surface.

Decision: Stop swapping drives “because it’s failing.” Focus on the link: cable, backplane connector, HBA port, and power path to that bay.

Task 4: Check whether CRC errors are still increasing

cr0x@server:~$ sudo smartctl -A /dev/sde | awk '$1==199{print}'
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       47

cr0x@server:~$ sleep 60; sudo smartctl -A /dev/sde | awk '$1==199{print}'
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       49

Meaning: It moved from 47 to 49 in one minute. That’s an active transport problem.

Decision: Escalate to physical intervention. Plan a maintenance window if the array is redundant; if not, treat as imminent outage risk.

Task 5: Watch kernel messages live while you do a controlled read

cr0x@server:~$ sudo dmesg -Tw
[Sun Dec 29 10:21:17 2025] ata7: hard resetting link
[Sun Dec 29 10:21:18 2025] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[Sun Dec 29 10:21:18 2025] ata7.00: configured for UDMA/133

cr0x@server:~$ sudo dd if=/dev/sde of=/dev/null bs=4M status=progress iflag=direct
12582912000 bytes (13 GB, 12 GiB) copied, 7 s, 1.8 GB/s

Meaning: The link came back at 3.0 Gbps instead of 6.0. That downshift is classic “signal integrity got worse under stress.”

Decision: Mark the bay/port path as suspect. Plan a swap of cable/backplane connection for that port. Don’t waste time tuning I/O schedulers.

Task 6: Identify the libata “ata7” mapping in sysfs

cr0x@server:~$ readlink -f /sys/class/block/sde/device
/sys/devices/pci0000:00/0000:00:1f.2/ata7/host6/target6:0:6/6:0:6:0

Meaning: You can tie the block device to a specific ATA port (ata7) and host.

Decision: Use this path to find the controller PCI device and, from there, the slot/HBA/backplane chain.

Task 7: Confirm the controller PCI identity and driver

cr0x@server:~$ lspci -nnk | egrep -A3 -i "sata|ahci|scsi|sas"
00:1f.2 SATA controller [0106]: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller [8086:3a22]
	Subsystem: Supermicro Computer Inc Device [15d9:0801]
	Kernel driver in use: ahci
	Kernel modules: ahci

Meaning: You’re on AHCI (common onboard SATA). If this were an HBA, you’d see mpt3sas, megaraid_sas, etc.

Decision: If this is onboard SATA with a cheap backplane cable run, assume physical fragility until proven otherwise.

Task 8: Verify link speed and negotiated features (where available)

cr0x@server:~$ sudo grep -H . /sys/class/ata_link/link*/sata_spd 2>/dev/null | head
/sys/class/ata_link/link6/sata_spd: 3.0 Gbps
/sys/class/ata_link/link7/sata_spd: 6.0 Gbps

Meaning: At least one link is sitting at 3.0 Gbps. If your disks and controller support 6.0, this can be a negotiated downshift after errors.

Decision: Treat sustained downshift on the problem port as evidence. Include it in your incident notes; it’s persuasive to non-kernel folks.

Task 9: Inspect per-disk error counters and kernel “timeout” patterns

cr0x@server:~$ sudo journalctl -k --since "2 hours ago" | egrep -i "failed command|READ FPDMA|WRITE FPDMA|cmd|timeout|resetting"
Dec 29 10:21:16 server kernel: ata7.00: failed command: READ FPDMA QUEUED
Dec 29 10:21:16 server kernel: ata7.00: cmd 60/08:00:00:10:6e/00:00:00:00:00/40 tag 0 ncq dma 4096 in
Dec 29 10:21:16 server kernel: ata7.00: status: { DRDY }
Dec 29 10:21:16 server kernel: ata7: softreset failed (1st FIS failed)
Dec 29 10:21:17 server kernel: ata7: hard resetting link

Meaning: “FIS failed” and command timeouts with resets are consistent with transport failure, not “the filesystem is angry.”

Decision: Proceed to isolate by moving the drive to a different bay or swapping the cable. Software-only actions won’t stop physics.

Task 10: Check if the drive is actually being power-cycled (power path/backplane)

cr0x@server:~$ sudo journalctl -k --since "2 hours ago" | egrep -i "rejecting i/o|device offline|Spinning|Start/Stop|Power-on|reset"
Dec 29 10:21:19 server kernel: sd 6:0:6:0: [sde] rejecting I/O to offline device
Dec 29 10:21:19 server kernel: sd 6:0:6:0: [sde] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x00

Meaning: “offline device” can happen from link loss, but if accompanied by repeated “new disk” discovery messages, suspect power interruptions or backplane behavior.

Decision: If you see remove/add cycles, prioritize power/backplane connectors and PSU rail stability. Link resets plus device disappearance is a louder alarm.

Task 11: Verify drive identity is stable across resets (same WWN comes back)

cr0x@server:~$ udevadm info --query=all --name=/dev/sde | egrep "ID_WWN=|ID_SERIAL=|DEVPATH="
E: DEVPATH=/devices/pci0000:00/0000:00:1f.2/ata7/host6/target6:0:6/6:0:6:0/block/sde
E: ID_SERIAL=ST12000NM0008-2A_ZS2ABC12
E: ID_WWN=0x5000c500cafe1234

Meaning: You can confirm the same identity returns after a reset. If identity changes or disappears, you may have a flaky backplane multiplexer or a controller issue.

Decision: Stable identity + rising CRC errors points to physical link quality rather than drive firmware weirdness.

Task 12: If you run mdadm, confirm whether the array is dropping the same member repeatedly

cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1] [raid10] [raid6] [raid5]
md0 : active raid10 sde1[2](F) sdb1[0] sdc1[1] sdd1[3]
      11718067200 blocks super 1.2 512K chunks 2 near-copies [4/3] [_UUU]

Meaning: (F) indicates a failed member. If the same slot keeps failing, it’s rarely “Linux RAID is flaky”; it’s the underlying link.

Decision: Do not keep re-adding the same disk without addressing the transport. You’ll burn rebuild cycles and increase risk for a second fault.

Task 13: If you run ZFS, see whether errors are checksum or I/O-level, and which vdev

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Replace the device using 'zpool replace'.
  scan: resilver in progress since Sun Dec 29 10:30:01 2025
config:

	NAME                        STATE     READ WRITE CKSUM
	tank                        DEGRADED     0     0     0
	  raidz2-0                  DEGRADED     0     0     0
	    wwn-0x5000c500cafe1234  DEGRADED     5     0     0  too many errors
	    wwn-0x5000c500beef5678  ONLINE       0     0     0
	    wwn-0x5000c500abcd9999  ONLINE       0     0     0
	    wwn-0x5000c5001234aaaa  ONLINE       0     0     0

Meaning: ZFS shows READ errors on a specific WWN. That’s consistent with link resets causing I/O failures (not necessarily checksum mismatches).

Decision: If CKSUM is zero and READ is non-zero alongside libata resets, it’s probably not silent corruption—it’s failed reads due to link drops. Fix the transport first, then clear/replace.

Task 14: Check drive temperature and error log to rule out overheating-induced instability

cr0x@server:~$ sudo smartctl -a /dev/sde | egrep -i "Temperature|Error Log|ATA Error Count"
194 Temperature_Celsius     0x0022   037   045   000    Old_age   Always       -       63
SMART Error Log Version: 1
ATA Error Count: 0

Meaning: 63°C is hot for many enterprise SATA drives. It may still “work,” but signal margins shrink with heat and backplanes deform.

Decision: Improve cooling and re-test. If the errors only happen hot, don’t declare victory after a cool-room test.

Mapping /dev/sdX to a physical port without guessing

The fastest way to lose credibility is to pull the wrong drive. The second-fastest is to “reseat cables” generically and call it done. Map it.

Use stable identifiers: WWN, serial, HCTL, and ataX

Start with WWN/serial from lsblk or udevadm. Then connect those to:

HCTL (Host:Channel:Target:Lun) for SCSI-style enumeration (common for HBAs)
ataX for libata port numbering (common for AHCI and many SATA paths)
PCI path to find the actual controller

On many servers, the missing link is physical bay labeling. The correct solution is not “tribal knowledge.” It’s documentation: a port map from HBA port → backplane connector → bay numbers.

Practical mapping approach when you don’t have a port map

Identify the failing disk by WWN/serial.
Find its DEVPATH and ataX association.
Correlate ataX to the motherboard/HBA connector by checking cabling diagrams, chassis manual, or physically tracing during a window.
Validate by moving only that drive to another bay and seeing if the errors follow the bay.

That last step matters because it converts “educated suspicion” into evidence.

Hardware patterns: how cables and backplanes incriminate themselves

When people blame Linux, they’re usually reacting to correlation: “we upgraded Debian, then disks started resetting.” That correlation is real. It just doesn’t mean what they think.

Here are the patterns that reliably separate fault domains.

Cable fault patterns

CRC errors climb (SMART attribute 199) while reallocated/pending sectors stay flat.
Errors correlate with vibration or movement: bumping the chassis, fan ramp, nearby cable bundle movement.
Problem follows the cable/backplane connection: moving the drive to another bay fixes it without replacing the drive.
Link speed downshifts to 3.0/1.5 Gbps on that port after resets.

What to do: replace the cable with known-good, ideally shorter and better shielded; avoid sharp bends; re-route away from power bundles when possible. If it’s a Mini-SAS to SATA breakout, treat it as a consumable.

Backplane fault patterns

Only one bay is cursed, regardless of drive model or serial.
Drive presence flaps: the OS logs show the disk disappearing and reappearing.
Multiple drives on same backplane segment show errors after a temperature rise or vibration event.
LED/SGPIO weirdness: activity LEDs stuck or wrong bay blinking can correlate with signal or grounding issues on cheap backplanes.

What to do: move the drive to a different bay as a test; if the bay is guilty, replace the backplane or stop using that slot. “Cleaning contacts” is not a plan unless you also control recurrence with replacement.

Power path fault patterns

Simultaneous resets across multiple ports, often after a load step (CPU spikes, fans ramp, disks spin up).
Disk resets coincide with PSU alarms or BMC voltage dips.
Disks log power-related events (depends on model; often not explicit).

What to do: check PSU health, power cabling to backplane, and whether staggered spin-up is configured. A “good” PSU can still be bad at transient response.

Controller/firmware fault patterns

Errors spread across many ports with no single bay correlation.
Resets triggered by specific command patterns (TRIM, NCQ depth, queued writes) and disappear with driver/firmware updates.
Kernel logs show controller resets, not only link resets.

What to do: check firmware levels; test with a different HBA if possible. But don’t use “controller maybe” as an excuse to ignore CRC evidence.

Drive fault patterns

Reallocated/pending sectors increase, SMART health fails, self-tests fail.
Errors follow the drive to a known-good bay and cable.
SMART error log shows read/write failures without matching CRC growth.

What to do: replace the drive, and don’t argue with the numbers.

Joke #2: If your storage “only fails under load,” congratulations—you’ve built a system that’s perfectly reliable at being idle.

Three corporate mini-stories from the trenches

1) The incident caused by a wrong assumption: “It’s the new Debian kernel”

The environment was ordinary: a pair of Debian boxes running a replicated database, each with a handful of SATA SSDs in a front hot-swap chassis. A routine upgrade landed a newer kernel, and within a day one node started throwing ata resets. The team did what teams do: rolled back the kernel.

The errors reduced, but didn’t stop. That was the first clue. The second clue was nastier: the other node began showing occasional resets too. Now the narrative changed from “kernel regression” to “maybe the SSD firmware hates Linux.” A vendor ticket was opened. Time was spent gathering kernel traces that told everyone what they already knew: the link was being reset.

What finally broke the loop was a disciplined correlation. An SRE compared the UDMA_CRC_Error_Count for every disk in the chassis. Only disks in bays 5–8 had rising CRC counts. Those bays shared one backplane connector and one breakout cable run. The upgrade wasn’t the cause; it was the moment the system got busy enough to expose a marginal connection.

The “fix” was boring: replace the breakout cable and re-seat the backplane connector. Then re-run the same load that previously triggered resets. The errors didn’t return. The kernel stayed upgraded. The postmortem was uncomfortable because the wrong assumption wasn’t dumb; it was plausible. But plausibility is not proof, and production doesn’t pay you for plausible.

2) The optimization that backfired: “Let’s crank queue depth and save money”

A storage-heavy analytics fleet was being tuned for throughput. Someone had read that deeper queues and more parallelism improves utilization. They increased I/O concurrency in the application and also adjusted some block layer settings to keep disks busy. Initial benchmarks looked great. Everyone congratulated themselves and moved on.

Two weeks later, a subset of hosts began seeing sporadic READ FPDMA QUEUED failures and link resets. The failures clustered during batch windows. The team treated it as a workload problem and spread jobs out, which helped. That was the trap: the system became stable only when it was less useful.

The root cause turned out to be physical. The chassis used long SATA cables routed alongside high-current power harnesses. At low I/O, the link margins were barely okay. At high queue depth, the disks and controller were doing more work, the thermal profile shifted, and the link error rate rose—until libata started resetting ports. Nothing “mystical” happened; they moved a system from “works in the lab” to “fails at scale” by leaning on thin signal margins.

The corrective action wasn’t to abandon performance tuning. It was to treat cabling as part of the performance envelope: shorter cables, cleaner routing, and in a few cases moving high-duty workloads to SAS HBAs with better connectors. After that, the same tuning no longer triggered resets. Optimization wasn’t the villain. Optimizing on top of shaky hardware was.

3) The boring but correct practice that saved the day: a real port map and consistent identifiers

A different org ran mixed storage: some mdadm, some ZFS, lots of hot-swap bays. They had a habit that looked almost old-fashioned: every chassis had a port map in the runbook. It listed HBA port numbers, cable part numbers, backplane connector labels, and the bay range each connector served. They also recorded disk WWNs in inventory when disks were installed.

One afternoon, a host started logging link resets on one disk. The on-call followed the playbook: identify the WWN, map to HBA port, map to bay. They didn’t pull random drives. They didn’t reboot “to clear it.” They didn’t downgrade a kernel. They moved the disk to a spare bay on the same host and repeated the read test. The errors stayed with the original bay.

With that, the remediation was precise: the backplane connector for that bay group was replaced during a planned window. The disk stayed in service and never showed media errors. The ticket write-up was short and convincing because they had the receipts: CRC errors, bay correlation, and a successful isolation swap.

This is the part people dislike: the “heroic debugging” was unnecessary because the boring groundwork existed. Documentation didn’t make anyone feel smart, but it made the system cheaper to operate. That’s the job.

Common mistakes: symptoms → root cause → fix

This is the stuff that keeps incidents alive longer than they deserve.

1) Symptom: repeated `hard resetting link` + rising CRC errors

Root cause: marginal SATA signal path (cable, connector, backplane).
Fix: replace cable/backplane path; re-route; avoid tight bends; retest under load; verify CRC counter stops increasing.

2) Symptom: disk drops out entirely and comes back as “new”

Root cause: power interruption to the bay/backplane, or backplane presence detection flapping.
Fix: inspect/replace backplane power connectors; check PSU health; check for shared power harness strain; validate BMC logs.

3) Symptom: errors “move” after you swap drives

Root cause: you reseated or disturbed the failing cable/connector, temporarily improving contact.
Fix: stop calling it fixed; run a controlled load test and watch CRC counters; replace the suspect cable anyway.

4) Symptom: link negotiates down to 1.5/3.0 Gbps after resets

Root cause: error recovery downshifting due to poor signal integrity.
Fix: treat as physical layer problem; replace cable/backplane; confirm it stays at 6.0 Gbps under sustained load.

5) Symptom: “Linux I/O scheduler change fixed it”

Root cause: you reduced I/O intensity, hiding the problem.
Fix: keep the scheduler if it helps, but still replace the marginal link. Hidden faults come back when you need performance most.

6) Symptom: ZFS/mdadm rebuilds keep restarting on same disk

Root cause: unstable transport causing transient I/O failures during heavy sequential reads/writes.
Fix: fix the link first; then rebuild once. Rebuild storms are how you turn one flaky path into an outage.

7) Symptom: kernel upgrade “triggered” the issue

Root cause: workload, timing, or power profile changed enough to expose a marginal path; kernel is a catalyst, not the cause.
Fix: keep the upgrade unless you can reproduce on old kernel and prove regression; focus on physical evidence (CRC, bay correlation).

Checklists / step-by-step plan

Incident response checklist (while the system is degraded)

Capture logs: journalctl -k window around the event. Save it somewhere durable.
Record the disk identity: model, serial, WWN, HCTL, and ataX mapping.
Check SMART for media vs CRC indicators. If CRC is rising, elevate cabling/backplane.
Reduce blast radius: pause non-essential heavy jobs; avoid repeated rebuild attempts.
If redundancy exists, plan a controlled isolation swap (move drive to another bay).
After physical change, run the same read test and watch logs live.
Verify counters: CRC should stop increasing; link should remain at expected speed.

Maintenance window plan (what to replace and in what order)

Replace the SATA cable/breakout serving the failing port/bay group.
If failures persist and follow the bay, replace/repair the backplane segment or stop using the bay.
Validate power: reseat power connectors to the backplane; check PSU status; verify no brownouts under load.
Only then consider controller firmware updates or swapping the HBA/motherboard SATA.
Run a sustained read/write burn-in on the affected disks after changes.

Post-incident hardening checklist (prevent the sequel)

Create a port map: HBA port → cable label → backplane connector → bay numbers.
Standardize on WWN-based naming in ZFS/mdadm configs where possible.
Track SMART CRC counters over time; alert on deltas, not just absolute thresholds.
Keep spare known-good cables and (if possible) a spare backplane on hand.
Document an isolation test procedure so the next on-call doesn’t improvise.

FAQ

1) Does a SATA link reset always mean the drive is bad?

No. Often the drive is fine and the link is not. If SMART shows rising UDMA_CRC_Error_Count with clean media attributes, suspect cable/backplane first.

2) If SMART overall-health says PASSED, can the disk still be the problem?

Yes. SMART “PASSED” is not a guarantee; it’s a weak threshold check. But for link reset cases, the discriminator is usually CRC vs reallocated/pending sectors.

3) Why did this start right after upgrading Debian 13?

Because timing, power management, or workload shifted. A kernel upgrade can change I/O patterns and expose marginal hardware. Treat the upgrade as a trigger, not proof of causality.

4) Can bad power cause CRC errors?

Indirectly. Power instability can cause link drops and device resets, and the resulting behavior can look like a transport problem. If disks disappear and reappear, investigate power/backplane presence circuits.

5) Is it safe to keep running if CRC errors are non-zero but stable?

If the CRC count is old and not increasing, you might have a historical event (a cable was bumped once). If it’s increasing, you are actively losing link integrity and should fix it promptly.

6) How do I prove it’s the backplane and not the disk?

Move the same disk (same WWN/serial) to a different bay/cable path. If the errors stop, the old bay/path is guilty. If the errors follow the disk, the disk is guilty.

7) Why do I see link speed drop to 3.0 Gbps?

After errors, SATA can renegotiate at a lower speed to maintain stability. That’s a sign of marginal signal quality. It’s not a “performance feature.”

8) Should I change kernel parameters or libata settings to reduce resets?

Avoid using kernel parameters as a bandage unless you’re diagnosing a confirmed driver/controller issue. If physical evidence points to the link, changing timeouts just changes how long you wait before failing.

9) What if multiple disks across different bays show CRC increases?

Suspect something shared: a cable bundle, a backplane connector feeding multiple bays, an HBA port group, or power. Look for commonality, not coincidence.

10) Do I need SAS to avoid this?

Not strictly, but SAS connectors and cabling standards are generally more robust in server environments. SATA can be reliable; it just demands better discipline in the physical layer than many chassis provide.

Conclusion: next steps that actually reduce risk

If Debian 13 is logging SATA link resets, treat it like an investigation, not a debate. Linux is telling you what it sees. Your job is to determine whether the problem lives on the platters, in the controller, or in the copper and plastic that everyone forgets exists.

Do these next, in order:

Pull a tight kernel log timeline and capture the evidence.
Check SMART: CRC vs media attributes. If CRC is moving, stop blaming filesystems.
Map the failing device to ataX/HCTL and a physical bay.
Run a controlled read test while watching dmesg live.
Isolate with a swap that teaches you something: move the drive to a different bay or replace the cable path.
After the fix, verify: link speed stable, CRC no longer increasing, and no resets under load.

If you only take one operational lesson: write the port map and track CRC deltas. It’s not glamorous, but it turns a spooky “Linux storage issue” into a simple parts replacement with receipts.