You’re pushing I/O—database checkpoint, backup window, a compile farm, whatever—and suddenly the NVMe just… leaves.
One moment it’s /dev/nvme0n1, the next it’s an archaeological mystery. Your filesystem flips read-only, your RAID/ZFS screams,
and your monitoring lights up like a holiday tree you didn’t ask for.
On Ubuntu 24.04, that “NVMe disappears under load” pattern is often not a dying SSD. It’s power management: PCIe ASPM on the link,
and/or NVMe APST inside the controller. Both can be fine in theory and catastrophic in practice when a platform’s firmware, PCIe topology,
or drive firmware gets creative. This is the pragmatic, production-sane way to diagnose it fast, fix it safely, and prove you actually fixed it.
What “NVMe disappears” actually means (and why it’s rarely magic)
When operators say “the NVMe disappeared,” they usually mean one of three things:
-
The block device vanishes:
/dev/nvme0n1is gone, udev removed it, and anything mounted on it is now a bad day. -
The device stays, but I/O stalls:
it’s still listed, but reads/writes hang, time out, then the kernel escalates to controller reset. -
The PCIe function goes nonresponsive:
the NVMe controller is present on the bus, but the link is flapping or AER (Advanced Error Reporting) is shouting.
Under load is the key clue. High queue depth, sustained writes, heavy DMA, and thermal stress all amplify timing sensitivities.
That’s when power-state transitions that “usually work” start losing races.
If this is happening on Ubuntu 24.04, you’re likely on a fairly modern kernel and NVMe stack. That’s good for features.
It’s also good for discovering that your platform’s power management was designed by a committee whose job ended at “boots Windows.”
Facts & historical context that explain the mess
- ASPM predates NVMe: PCIe Active State Power Management was designed for link power savings long before NVMe became the default for storage.
- NVMe APST exists because idle power mattered: Autonomous Power State Transitions were built to let controllers save power without OS babysitting.
- AER became mainstream as PCIe got faster: higher link speeds and tighter margins made error reporting and recovery more critical—and more visible in logs.
- Consumer NVMe firmware often optimizes for benchmarks: fast burst performance can hide ugly corner cases under sustained, mixed workloads.
- Laptops drove aggressive power defaults: many platform vendors bias toward battery life, sometimes shipping firmware that’s “optimistic” about link stability.
- Linux got better at runtime PM over time: but “better” also means “more active,” which increases the odds of hitting platform-specific bugs.
- PCIe Gen4/Gen5 made signal integrity less forgiving: the faster you go, the more a marginal board layout or retimer can show up as “random device drop.”
- NVMe has multiple timeout layers: controller timeouts, blk-mq timeouts, filesystem timeouts, and higher-level retries can mask the real failure point.
- Ubuntu 24.04 runs systemd with aggressive power tooling available: user-space can influence runtime PM and policy, which sometimes “helps” until it hurts.
The operational takeaway: you’re dealing with a layered power stack (platform firmware, PCIe link policy, NVMe controller policy, Linux runtime PM).
If one layer lies, the others politely believe it—right until your SSD ghosts you mid-checkpoint.
Fast diagnosis playbook (check 1–2–3)
When you’re on-call, you don’t have time for interpretive dance. Do this in order.
1) Confirm it’s really a device/link problem, not a filesystem illusion
- Look for
nvmetimeouts and controller resets in the kernel log. - Look for PCIe AER errors from the upstream port (
pcieport). - Confirm whether the namespace device is gone or just wedged.
2) Identify the power features in play
- Is PCIe ASPM enabled on the system?
- Is NVMe APST enabled (and what are the idle latencies)?
- Is runtime PM set to
autofor the NVMe PCI device?
3) Reproduce safely and decide: stabilize first, optimize later
- Run a controlled I/O load and watch for error patterns.
- Apply the smallest stabilizing change (often disabling ASPM and/or APST).
- Re-run the same load and compare logs, not vibes.
Paraphrased idea from Werner Vogels (reliability/operations): “Everything fails; design so failure is expected and recovery is routine.”
Practical tasks: commands, outputs, and decisions (12+)
These are the tasks I actually run when an NVMe is misbehaving under load. Each includes (a) the command, (b) what typical output means,
and (c) the decision you make from it.
Task 1: Find NVMe-related kernel errors fast
cr0x@server:~$ sudo journalctl -k -b | egrep -i 'nvme|pcie|aer|timeout|reset' | tail -n 60
Dec 29 09:11:22 server kernel: nvme nvme0: I/O 123 QID 6 timeout, aborting
Dec 29 09:11:22 server kernel: nvme nvme0: Abort status: 0x371
Dec 29 09:11:22 server kernel: nvme nvme0: controller is down; will reset: CSTS=0x1
Dec 29 09:11:23 server kernel: pcieport 0000:00:1c.0: AER: Corrected error received: 0000:02:00.0
Dec 29 09:11:23 server kernel: pcieport 0000:00:1c.0: AER: device recovery successful
What it means: NVMe I/O timeouts plus controller resets often indicate link/power instability or firmware/controller hangs under load.
AER corrected errors on the upstream port are a strong hint the PCIe link is glitching.
Decision: If you see timeouts/resets with AER noise, prioritize ASPM/link power changes before blaming the filesystem.
Task 2: Check if the NVMe namespace device is actually gone
cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,MODEL,SERIAL,MOUNTPOINTS | egrep -i 'nvme|NAME'
NAME TYPE SIZE MODEL SERIAL MOUNTPOINTS
nvme0n1 disk 931.5G ExampleNVMe 1TB ABCD1234
├─nvme0n1p1 part 512M /boot
└─nvme0n1p2 part 931G /
What it means: If nvme0n1 disappears here after the incident, udev removed it because the controller fell off the bus or failed hard.
If it stays present but I/O hangs, the controller might be wedged but still enumerated.
Decision: “Gone from lsblk” pushes you toward PCIe/link issues or full controller resets; “present but stuck” can still be power-state related, but also thermal/firmware.
Task 3: Inspect NVMe controller state via nvme-cli
cr0x@server:~$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 ABCD1234 ExampleNVMe 1TB 1 120.0 GB / 1.00 TB 512 B + 0 B 3B2QEXM7
What it means: If nvme list intermittently fails or shows no devices when the problem occurs, the controller is dropping off.
Decision: If the controller drops, focus on PCIe power management and platform firmware first; SMART won’t save you if the bus is gone.
Task 4: Pull SMART/health and look for media vs transport symptoms
cr0x@server:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0x00
temperature : 54 C
available_spare : 100%
percentage_used : 2%
media_errors : 0
num_err_log_entries : 18
power_cycles : 23
power_on_hours : 410
unsafe_shutdowns : 6
What it means: Media errors at zero with error log entries incrementing often points away from NAND failure and toward transport/controller resets.
Unsafe shutdowns creeping upward is consistent with “controller vanished” events.
Decision: If media errors are low/zero but resets/timeouts are common, don’t RMA immediately; stabilize power/link behavior and retest.
Task 5: Read the NVMe error log to confirm timeout/reset patterns
cr0x@server:~$ sudo nvme error-log /dev/nvme0 | head -n 20
Error Log Entries for device:nvme0 entries:64
Entry[ 0]
error_count : 18
sqid : 6
cmdid : 0x00a2
status_field : 0x4004
parm_error_location: 0x0000
lba : 0
nsid : 1
vs : 0x00000000
What it means: Repeated errors on the same queue under load can align with controller resets and command aborts. This is usually “the controller got upset,” not “bits are bad.”
Decision: Correlate error_count jumps with kernel resets. If they line up, treat it as stability/power-state issue.
Task 6: Identify the PCIe device path (so you can inspect the upstream port)
cr0x@server:~$ sudo lspci -nn | egrep -i 'non-volatile|nvme'
02:00.0 Non-Volatile memory controller [0108]: Example Corp NVMe Controller [1234:11aa] (rev 01)
What it means: Your NVMe controller is at 0000:02:00.0. That address becomes your anchor for AER, ASPM, runtime PM, and link status.
Decision: Use this BDF to query lspci -vv, sysfs power control, and to find the upstream port that might be generating AER.
Task 7: Check PCIe link status and ASPM state
cr0x@server:~$ sudo lspci -s 02:00.0 -vv | egrep -i 'LnkCap:|LnkSta:|ASPM|L1Sub|AER' -n
45: LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
52: LnkSta: Speed 16GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
60: L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
What it means: If LnkCap and related lines show ASPM modes supported/enabled, the link can enter low-power states.
That’s not inherently wrong, but it’s a prime suspect when devices vanish under load or during bursty traffic.
Decision: If you have dropouts and ASPM is enabled, testing with ASPM disabled is usually worth it. Stabilize first.
Task 8: Confirm whether the kernel thinks ASPM is enabled globally
cr0x@server:~$ cat /sys/module/pcie_aspm/parameters/policy
default
What it means: Policies like default, powersave, or performance influence how aggressively ASPM is used.
Some systems run stable on default; others treat it like a suggestion to misbehave.
Decision: If you’re chasing disappearances, plan a test boot with pcie_aspm=off. If stability returns, you’ve found your villain.
Task 9: Check NVMe APST status (controller autonomous power transitions)
cr0x@server:~$ cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
0
What it means: A value of 0 typically means “no limit,” which allows deeper power states if the controller/firmware wants them.
On some drives, deep idle transitions can interact badly with bursts and resets.
Decision: For stability testing, set a conservative maximum latency (or disable APST) using a kernel parameter. Don’t guess; test.
Task 10: Inspect runtime power management state for the NVMe PCI function
cr0x@server:~$ cat /sys/bus/pci/devices/0000:02:00.0/power/control
auto
What it means: auto allows the kernel to runtime-suspend the device when it thinks it’s idle.
On flaky platforms, runtime PM can be the extra shove that triggers link/controller weirdness.
Decision: If you’re seeing resets, test with runtime PM forced to on (no autosuspend) for the NVMe device.
Task 11: Check for NVMe timeouts configured by the kernel
cr0x@server:~$ cat /sys/module/nvme_core/parameters/io_timeout
30
What it means: That’s the I/O timeout in seconds. Increasing it can mask symptoms (you’ll “wait longer” before reset),
but it does not fix the cause if the controller is dropping.
Decision: Don’t “fix” disappearances by inflating timeouts unless you know the device is just slow under GC. Disappearances are not latency problems.
Task 12: Stress test I/O in a way that reproduces the issue without destroying production
cr0x@server:~$ sudo fio --name=nvme-load --filename=/dev/nvme0n1 --direct=1 --rw=randwrite --bs=4k --iodepth=64 --numjobs=4 --runtime=120 --time_based=1 --group_reporting
nvme-load: (groupid=0, jobs=4): err= 0: pid=9123: Mon Dec 29 09:15:01 2025
write: IOPS=82.1k, BW=321MiB/s (337MB/s)(38.6GiB/123002msec)
lat (usec): min=48, max=21234, avg=311.42, stdev=145.11
What it means: You want a repeatable workload that triggers the issue. If the device disappears during this test, you have a reproducer.
If it only disappears under a mixed read/write or sequential pattern, adjust accordingly.
Decision: Never tune power settings blind. Reproduce, change one variable, rerun, compare logs.
Task 13: Watch for AER events live while stressing
cr0x@server:~$ sudo dmesg -wT | egrep -i 'aer|pcieport|nvme|timeout|reset'
[Mon Dec 29 09:16:22 2025] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:02:00.0
[Mon Dec 29 09:16:22 2025] nvme nvme0: I/O 87 QID 3 timeout, aborting
[Mon Dec 29 09:16:23 2025] nvme nvme0: controller is down; will reset: CSTS=0x1
What it means: The sequence “AER corrected error” → “NVMe timeout” → “controller reset” is a familiar plot.
Decision: Treat AER as evidence. If disabling ASPM stops AER spam and timeouts, you’ve likely fixed the root cause.
Task 14: Check upstream PCIe port errors (not just the NVMe endpoint)
cr0x@server:~$ sudo lspci -s 00:1c.0 -vv | egrep -i 'AER|UESta|CESta|Err' -n
78: Capabilities: [100 v2] Advanced Error Reporting
96: UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
104: CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
What it means: Corrected errors like RxErr+ indicate link-level signal issues or marginal power states.
They’re “corrected” until they aren’t.
Decision: If corrected errors spike during load, stop chasing filesystem tuning. Go after link power management and platform stability.
Task 15: Verify the drive isn’t thermal-throttling into instability
cr0x@server:~$ sudo nvme smart-log /dev/nvme0 | egrep -i 'temperature|warning|critical'
temperature : 73 C
critical_warning : 0x00
What it means: Temperature alone doesn’t prove causality, but “disappears under load” plus high temps can push marginal controllers over the edge.
Decision: If temps are high, improve cooling and retest before you start rewriting boot parameters. Hardware reality beats software optimism.
Power stack: ASPM vs APST vs runtime PM (who can knock your disk offline)
There are three knobs people conflate. Don’t. They’re different layers, and you can change them independently.
PCIe ASPM (link-level power saving)
ASPM controls the PCIe link’s low-power states (not the NVMe controller’s internal states). The big players:
L0s and L1 (and L1 substates like L1.1/L1.2 on newer platforms).
In a perfect world, the endpoint and root port negotiate, enter low-power states when idle, and exit quickly when traffic resumes.
In the world you and I live in, exit latency promises are sometimes… aspirational. Under load, you can get transitions at awkward times:
bursty I/O, MSI/MSI-X interrupts, power gating events, and the occasional firmware bug that says “sure, I’m awake” while it is very much not.
NVMe APST (controller autonomous power states)
APST is internal to the NVMe controller. It can decide to drop into deeper power states when idle, guided by OS policy (max latency)
and the drive’s own tables. If your drive’s firmware is touchy, APST can turn “idle then busy” into “idle then controller reset.”
APST is also where “fixes” get dangerously popular. People disable it and the problem goes away. Great.
Then they ship that config everywhere and wonder why laptops lose an hour of battery. Production decisions have consequences.
Runtime PM / autosuspend (OS device-level power saving)
Runtime PM is the kernel deciding “this device is idle, let’s suspend it.” For PCI devices, that can mean D-states and link changes.
It’s a multiplier: even if ASPM and APST are okay, runtime PM can trigger more transitions.
One short joke, promised and relevant: PCIe power management is like a motion-sensor light in a hallway—great until you’re carrying something fragile and it goes dark mid-step.
Fixes that usually work: kernel params, udev, and BIOS knobs
The winning strategy is to get stability first with the least invasive change, then reintroduce power savings carefully.
Don’t start by disabling everything forever unless you enjoy explaining idle power increases to finance.
Fix option A: Disable PCIe ASPM (high success rate for “vanishes under load”)
This is the blunt instrument. It often works because it removes the link-state transition path entirely.
It can increase idle power a bit, depending on platform.
Temporary test (one boot)
At the GRUB menu, edit the boot line and add:
pcie_aspm=off
Persistent change (GRUB)
cr0x@server:~$ sudo sed -i 's/^GRUB_CMDLINE_LINUX_DEFAULT="/GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off /' /etc/default/grub
cr0x@server:~$ sudo update-grub
Sourcing file `/etc/default/grub'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.0-xx-generic
Found initrd image: /boot/initrd.img-6.8.0-xx-generic
done
What the output means: update-grub regenerated the boot config and found your kernel/initrd.
Decision: Reboot into the new kernel cmdline. Re-run your reproducer (fio + dmesg watch). If the dropouts stop, keep this while you decide if you want a narrower change.
Fix option B: Limit or disable NVMe APST (often fixes “timeouts then reset”)
If the controller is entering deep NVMe power states and failing to return cleanly, APST is the lever.
On Linux, the common control is nvme_core.default_ps_max_latency_us=.
Common settings
- Disable APST (often):
nvme_core.default_ps_max_latency_us=0is sometimes interpreted as “no limit” rather than “disable,” depending on kernel behavior and drive tables. In practice, many operators use a small nonzero value to prevent deep states. - Prevent deep states: set a low max latency, like
1000(1ms) or5000(5ms), to avoid the deepest sleep states while still allowing mild savings.
The annoying truth: the right number is platform- and drive-specific. That’s not philosophy. That’s what your logs will tell you after controlled testing.
Persistent change (GRUB)
cr0x@server:~$ sudo sed -i 's/^GRUB_CMDLINE_LINUX_DEFAULT="/GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=5000 /' /etc/default/grub
cr0x@server:~$ sudo update-grub
Sourcing file `/etc/default/grub'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.0-xx-generic
Found initrd image: /boot/initrd.img-6.8.0-xx-generic
done
Decision: If ASPM-off fixes it, you may not need APST tuning. If ASPM changes don’t help, APST is your next suspect. Don’t change both at once unless you’re only trying to stop the bleeding and will come back to isolate later.
Fix option C: Disable runtime autosuspend for the NVMe PCI device
For systems where runtime PM is too clever, forcing the NVMe PCI function to stay “on” can help.
This is a per-device fix, so you don’t punish the whole machine.
Immediate (until reboot)
cr0x@server:~$ echo on | sudo tee /sys/bus/pci/devices/0000:02:00.0/power/control
on
What it means: The device is now excluded from runtime autosuspend.
Decision: If stability improves, make it persistent with a udev rule.
Persistent via udev rule
cr0x@server:~$ sudo tee /etc/udev/rules.d/80-nvme-runtimepm.rules >/dev/null <<'EOF'
ACTION=="add", SUBSYSTEM=="pci", KERNELS=="0000:02:00.0", ATTR{power/control}="on"
EOF
cr0x@server:~$ sudo udevadm control --reload-rules
cr0x@server:~$ sudo udevadm trigger -s pci
What the output means: No output is fine; udev rules reloaded and re-applied. Verify with the sysfs readback.
cr0x@server:~$ cat /sys/bus/pci/devices/0000:02:00.0/power/control
on
Decision: Keep this if it solves the issue and you can tolerate the power tradeoff. It’s often milder than ASPM-off globally.
Fix option D: BIOS/UEFI settings (when software knobs aren’t enough)
If the root cause is firmware-level ASPM behavior or a board-level signal issue, software can only do so much.
Useful BIOS/UEFI toggles include:
- PCIe ASPM: disable or set to “off.” Some vendors hide it under “Power” or “Advanced > PCIe.”
- Native ASPM: turning it off can force firmware-managed behavior; turning it on can let the OS control it. Which helps depends on vendor quality. Yes, that sentence is annoying.
- PCIe Link Speed: forcing Gen3 instead of Gen4 can stabilize marginal links. This is the “I need it stable today” move.
- Global C-states / Package C-states: deep CPU sleep states can interact with chipset power gating. Disabling the deepest states is sometimes a fix.
If you end up forcing Gen3 or disabling deep C-states to keep storage online, treat it as a hardware/firmware defect you’re routing around, not “a Linux tuning.”
Fix option E: Firmware updates (NVMe + BIOS)
NVMe firmware updates can fix APST tables, reset recovery, and power-state bugs. BIOS updates can fix PCIe training and ASPM policy.
Apply them like an adult: change management, maintenance window, verified rollback options where possible.
Three corporate-world mini-stories (because reality is weird)
Mini-story 1: The incident caused by a wrong assumption
A team ran a small fleet of Ubuntu hosts doing log ingestion into a local NVMe-backed queue. The hosts were fine for weeks.
Then a new pipeline version increased burst write concurrency. At first it looked like software: queue corruption, odd retries, and sudden latency spikes.
The wrong assumption was subtle: “If SMART looks clean, the drive is fine.” Their SMART checks showed no media errors and decent spare.
So they started tuning the application, widening buffers, and adding retries. It reduced visible errors but increased the time to recover after a stall.
During a heavy burst, the kernel logged NVMe I/O timeouts, then a controller reset, then the namespace vanished. The process manager restarted services, which didn’t help.
The hosts came back only after reboot. Storage “fixed itself,” which is how you know it’s going to happen again.
The actual failure mode wasn’t NAND wear. It was PCIe link instability under aggressive power management on that motherboard revision.
Disabling ASPM stopped the vanishing act immediately. The team then tested re-enabling ASPM but limiting NVMe APST latencies. That worked on some boxes, not all.
The operational lesson they wrote down: SMART tells you about the flash and controller health, not whether the PCIe link is having a panic attack.
They added AER and NVMe reset patterns to alerting, so the next time it happened they’d catch it before the filesystem went read-only.
Mini-story 2: The optimization that backfired
A different org tried to shave power usage in a lab cluster that “might become production.” They enabled an aggressive power profile and
set kernel policies toward saving energy. The machines were mostly idle, with periodic heavy build jobs.
On paper, it was sensible: idle boxes should sip power. In practice, the cluster started dropping NVMe devices during the build bursts.
The pattern was maddening: long idle, then a sudden compile storm, then a controller reset mid-artifact write.
They chased performance for weeks. It wasn’t throughput; it was a state transition problem. The build workload had sharp on/off behavior:
nothing for minutes, then extreme metadata churn and parallel writes. That’s basically a power-state torture test.
The “optimization” was runtime PM + deep ASPM + permissive APST. It made the platform constantly enter deep sleep states,
then demand instant wakeups at the worst times. Disabling runtime autosuspend for the NVMe device fixed most nodes.
The few remaining offenders needed ASPM off.
The interesting part: once stable, they reintroduced power saving carefully—one knob at a time, validated with the same workload.
Power goals were mostly achieved without sacrificing storage reliability, but only after they stopped treating power settings as harmless.
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent service ran on a pair of NVMe-heavy database boxes. Nothing fancy. Just a steady I/O load and strict uptime expectations.
Their SRE lead was famously dull about change control. That dullness paid rent.
They had a habit: every kernel/firmware update had a pre-flight test that included an I/O stress run and a log review checklist.
It wasn’t heroic. It was scheduled, documented, and repeatable.
After a platform firmware update, the stress test started producing corrected AER errors on the NVMe upstream port.
No outages yet, no visible application impact, just a new kind of log noise. Most teams would ignore it because “corrected.”
They didn’t ship. They bisected the change: same OS, same kernel, only firmware changed. The errors followed the firmware.
They rolled back firmware on that batch and opened a vendor ticket with clean evidence: before/after logs, reproduction steps, and link status output.
Weeks later, the vendor provided a fixed firmware that adjusted PCIe power-state behavior. The service never had a production outage from it.
The practice that saved them wasn’t genius; it was refusing to deploy when the system started whispering “I’m not okay.”
Common mistakes: symptom → root cause → fix
This is where most outages go from “annoying” to “resume-generating.”
1) Symptom: “Filesystem went read-only, NVMe must be dying”
Root cause: controller reset or PCIe link drop caused I/O errors; the filesystem remounted read-only to protect itself.
Fix: confirm with journalctl -k for timeouts/resets; address ASPM/APST/runtime PM; then validate under load.
2) Symptom: “SMART is clean, so it can’t be the drive”
Root cause: transport/link instability doesn’t necessarily show as media errors.
Fix: correlate AER errors and NVMe resets; check PCIe link status; treat it as platform stability issue.
3) Symptom: “Only happens under load, so it’s overheating”
Root cause: could be thermal, but often it’s power-state transition timing under sustained queue depth.
Fix: check temps, yes; but also reproduce with ASPM off or APST limited. Thermal fixes without log confirmation are just HVAC cosplay.
4) Symptom: “Disabling ASPM fixed it, so ship that everywhere”
Root cause: blanket policy based on one bad platform/drive combination.
Fix: treat as hardware profile-specific. On laptops and power-sensitive deployments, try a narrower fix (limit APST, disable runtime autosuspend for NVMe only).
5) Symptom: “I increased nvme io_timeout and now it doesn’t reset”
Root cause: you turned a hard failure into a longer hang. The device still stalls; you just wait longer to admit it.
Fix: revert timeout tweaks; fix the underlying instability; use timeouts to tune recovery behavior only after stability is proven.
6) Symptom: “AER corrected errors are harmless”
Root cause: corrected errors can be early warnings of a marginal link that will eventually become uncorrected under stress.
Fix: track corrected error rates; investigate spikes; test with Gen speed forced down or ASPM adjusted.
7) Symptom: “It only happens after idle”
Root cause: deep idle power states (ASPM L1.2, APST deep PS, runtime suspend) plus a sudden workload creates a wake-up failure path.
Fix: disable runtime autosuspend for NVMe; limit APST max latency; consider ASPM off if needed.
Second and final joke: If your NVMe keeps vanishing, congratulations—you’ve invented cloud storage, except the cloud is your PCIe bus and it charges interest.
Checklists / step-by-step plan
Step-by-step: stabilize a flaky Ubuntu 24.04 NVMe in production
-
Capture evidence first.
Savejournalctl -k -boutput,lspci -vvfor the NVMe and upstream port, andnvme smart-log.
You want before/after proof. -
Determine the failure class.
Device gone vs device stuck. “Gone” points strongly to PCIe/link/power; “stuck” can be power, firmware, or thermal. -
Reproduce with controlled load.
Usefioon the raw device if safe, or on a dedicated test partition. Watchdmesg -wTlive. -
First change: disable ASPM for one boot.
Addpcie_aspm=off. Reproduce again. If fixed, decide whether to keep it or try a narrower fix. -
Second change: limit NVMe APST.
Addnvme_core.default_ps_max_latency_us=5000(or similar) and test. Adjust based on results. -
Third change: disable runtime autosuspend for the NVMe PCI device.
Set/sys/bus/pci/devices/.../power/controltoon; persist via udev if it works. -
Validate across idle→burst transitions.
The classic failure is “idle for 10 minutes, then spike.” Recreate that pattern with your test. -
Check BIOS/UEFI if software isn’t enough.
Disable ASPM in firmware, force Gen3, or adjust C-states temporarily to confirm root cause. -
Once stable, optimize carefully.
Re-enable one power feature at a time, measure, and stop when errors return. -
Operationalize the guardrails.
Add alerts for NVMe resets/timeouts and PCIe AER bursts. Stability without detection is just surprise scheduling.
Rollback checklist (because you will need it once)
- Keep console access (IPMI/iDRAC/KVM) before changing kernel parameters.
- Document the previous
/etc/default/grubline and keep a copy. - Apply one persistent change at a time; reboot; test; then proceed.
- If the host fails to boot or storage changes behavior, remove the last kernel parameter and regenerate GRUB.
FAQ
1) Is this an Ubuntu 24.04 bug?
Sometimes it’s triggered by newer kernels being more active about power management, but the root cause is often platform/firmware/drive interaction.
Ubuntu is just where you noticed it.
2) Should I disable PCIe ASPM or NVMe APST first?
If the NVMe “disappears” (drops off the bus) and you see AER/link noise, disable ASPM first.
If it’s more “timeouts then reset” without obvious AER, try limiting APST next. In practice, ASPM-off is the fastest binary test.
3) Will disabling ASPM hurt performance?
Usually not in throughput terms. It can increase idle power and sometimes affect latency micro-behavior. The real “hurt” is power efficiency,
which matters more on laptops and dense fleets.
4) Will disabling APST ruin battery life on laptops?
It can. APST is there for a reason. Prefer limiting APST max latency rather than nuking power management wholesale,
and only apply broad changes to machines that need it.
5) I only see “AER corrected errors.” Do I care?
Yes, if they correlate with load spikes or precede NVMe timeouts/resets. Corrected errors are early warnings.
If they’re constant and harmless for months, maybe ignore; if they appear suddenly after an update, investigate.
6) Can a bad cable cause NVMe dropouts?
Not for M.2 on-board NVMe, but for U.2/U.3 or PCIe backplanes, absolutely. Poor signal integrity looks like AER noise and link retrains.
In servers, “it’s the cable/backplane” is boring and frequently correct.
7) Should I force PCIe Gen3 as a workaround?
If you need stability today and your link is marginal at Gen4/Gen5, forcing Gen3 is a valid triage move.
You’ll give up peak bandwidth, but you’ll keep your filesystem, which is generally considered a fair trade.
8) Why does it happen under load instead of at idle?
Load increases queue depth, DMA activity, interrupts, and thermal output. It can also create rapid transitions between busy/idle states.
Those transitions are where buggy power-state logic trips.
9) Is increasing nvme_core.io_timeout ever a good idea?
Rarely, and only when you’ve proven the device is slow but stable (e.g., heavy internal garbage collection) and resets are counterproductive.
For “device disappears,” timeouts are not the fix.
10) How do I prove the fix actually worked?
Use the same reproducer (fio pattern, duration, queue depth), compare kernel logs for AER/timeouts/resets, and run an idle→burst test.
“It feels better” is not a verification strategy.
Conclusion: practical next steps
When an NVMe disappears under load on Ubuntu 24.04, you’re usually looking at a power management edge case:
PCIe ASPM on the link, NVMe APST in the controller, runtime PM in the OS, or some charming combination.
The fix is rarely mystical. It’s disciplined: capture logs, reproduce, change one knob, and re-test.
Next steps that pay off quickly:
- Run the fast playbook: check logs for NVMe timeouts/resets and PCIe AER, confirm whether the device is gone or wedged.
- Test boot with
pcie_aspm=off. If stability returns, decide whether to keep it or try a narrower fix. - If needed, limit APST via
nvme_core.default_ps_max_latency_usand/or disable runtime autosuspend for the NVMe PCI function. - Validate with a repeatable load and an idle→burst transition test, then operationalize alerts for AER bursts and NVMe resets.