You shut down a VM. The GPU should pop back to the host like a well-trained boomerang. Instead it comes back… wrong.
Next VM boot: black screen. Host dmesg: “reset failed”. Your cluster autoscaler shrugs and schedules the job onto a corpse.
If you’ve been told “enable IOMMU and you’ll be fine,” you’ve been sold a comforting half-truth. IOMMU gives isolation.
It does not guarantee reset, reinitialization, or that the device will ever return from whatever state it entered under load.
This is the GPU reset bug ecosystem: part silicon behavior, part PCIe spec nuance, part driver reality, and part operator pain.
What people mean by “the GPU reset bug”
“GPU reset bug” isn’t one bug. It’s the family name for any situation where a GPU cannot be reliably reset and reused
without a host reboot (or physical power cycle). It shows up most often in these scenarios:
- VFIO GPU passthrough (KVM/QEMU/libvirt) where a GPU is attached to a VM, then detached and reattached.
- Containerized GPU workloads where the host driver tries to recover from a hung kernel or bad DMA.
- Multi-tenant GPU nodes with MIG/SR-IOV/vGPU where reset semantics vary by mode and firmware.
- PCIe AER storms, link flaps, or a GPU that falls into a low-power state and doesn’t come back cleanly.
The core problem is simple: a GPU is not a polite PCIe NIC. It’s a complex SoC with multiple internal engines
(graphics, compute, copy, video, display), its own firmware, and aggressive power management. When you “reset a PCIe device,”
you’re not necessarily resetting all that internal state—or you’re resetting it at the wrong time, leaving the device
in a half-alive state that confuses the driver on the next bind.
Joke #1: GPUs don’t “crash,” they enter a contemplative state where they reconsider their relationship with your driver.
Interesting facts and historical context (short, concrete)
- PCIe FLR (Function Level Reset) arrived to standardize per-function resets, but many early implementations were partial or quirky.
- Consumer GPUs were optimized for a single OS instance per boot, long before virtualization users demanded “detach/attach forever.”
- AMD’s “reset bug” label gained traction in the VFIO community because some models needed a full bus reset or power cycle to recover.
- NVIDIA’s compute persistence made sense for HPC throughput, but it complicates “tear down and rebind” behavior when you expect clean resets.
- IOMMU was primarily designed for DMA remapping and protection, not as a device lifecycle manager.
- PCIe power states like D3hot/D3cold can break reinitialization when firmware/BIOS and OS disagree about who owns wake-up.
- ACS (Access Control Services) became a big deal because consumer platforms often group multiple devices together, blocking safe passthrough.
- AER (Advanced Error Reporting) can be a lifesaver for diagnosis, but it can also flood logs when a device is misbehaving and degrade node stability.
The uncomfortable takeaway: “reset” is not a single thing. It’s a negotiation between platform firmware, PCIe topology,
device capability, and driver behavior. And GPUs are the most negotiation-prone devices you’ll host.
Why IOMMU isolation doesn’t imply recoverability
IOMMU is a guardrail. It maps device DMA addresses into a translation domain so devices can’t scribble over arbitrary
physical memory. That matters for security and stability. It also enables VFIO to assign a device to a guest safely.
But IOMMU does not:
- guarantee that the device can be returned to a clean state on detach
- force the device to honor FLR correctly
- reset internal firmware state, microcontrollers, or power rails
- fix a broken PCIe topology where the GPU shares a reset domain with other devices
- protect you from vendor-specific driver expectations about initialization sequencing
In production terms: IOMMU stops a runaway GPU from DMA-ing your host into the abyss. It doesn’t stop the GPU from
being a stubborn little appliance that refuses to reboot without someone cutting power.
The two isolation planes you must not confuse
Operators mix these up because they look adjacent in diagrams:
- DMA isolation (IOMMU): “Can this device access only what it’s allowed to access?”
- Lifecycle isolation (reset domain): “Can this device be independently reset and reinitialized?”
Many platforms give you the first and quietly fail at the second. That’s where “IOMMU isn’t enough” lives.
Reset domains: the hidden dependency
A GPU sits behind a root port, maybe behind a switch, sometimes behind a PLX chip, sometimes sharing lanes with other devices.
Even if the GPU is in its own IOMMU group, it may share a reset line or power domain with something else.
When you issue a bus reset you might reset neighbors. When you issue FLR the GPU might ignore it.
When you issue nothing, the driver tries to “soft reset” internal engines and sometimes loses.
If you’re running multi-tenant workloads, you should treat “independent reset” as a procurement requirement, not an afterthought.
This is where boring hardware platform choices beat clever kernel flags.
One quote (paraphrased idea)
Paraphrased idea: hope is not a strategy—design systems so failure is expected and recovery is routine.
— Gene Kranz (operations leader, Apollo program)
PCIe reset mechanisms: FLR, bus reset, hot reset, and why GPUs are special
Function Level Reset (FLR)
FLR is the cleanest conceptually: reset just one PCIe function (like 0000:65:00.0) without blasting the whole bus.
The OS can request it via sysfs, and VFIO uses it when available.
Reality: some GPUs implement FLR in a way that resets config space but not the internal firmware state you actually care about.
Or FLR works only when the device is in a certain power state. Or it “works” and the device comes back with engines wedged.
Bus reset / secondary bus reset
This resets a downstream bus segment. More forceful than FLR. More collateral damage too.
If your GPU sits behind a PCIe bridge or switch, a bus reset can sometimes kick it into compliance.
The catch is obvious: you might reset other endpoints behind the same bridge. If you’re lucky, that’s nothing important.
If you’re not, you just rebooted your storage HBA in the middle of an I/O storm. Good luck explaining that graph to finance.
Hot reset
Hot reset is closer to “simulate unplug/replug at the link level.” It can work when FLR doesn’t.
It can also fail if the platform can’t retrain the link properly or if the device’s firmware gets stuck during link training.
Fundamental reset / power cycle
The nuclear option. If the GPU won’t come back without power removal, you don’t have a software reset problem.
You have a platform lifecycle problem. In a datacenter you solve this with:
- nodes with BMC-controlled PCIe slot power control (rare, but gold)
- GPU partitions that tolerate node reboots
- keeping workloads restartable and designing scheduling around failure
Why GPUs are special (and annoying)
GPUs have multiple internal “subdevices” behind a single PCIe function: copy engines, memory controllers, video engines,
display engines, security processors, and a large chunk of firmware that boots the whole thing. A reset that doesn’t
reinitialize firmware and memory controller state is a reset in name only.
Add power management: GPUs aggressively downshift into low-power states when idle. If you detach a GPU while it’s entering
a deep power state (or while runtime PM is active), you can get a device that enumerates but doesn’t respond correctly.
Failure modes you’ll actually see in production
1) “Reset failed” after VM shutdown; next attach hangs
Classic VFIO symptom: VM runs fine. You shut it down. Detach GPU. Reattach to another VM. Libvirt says it worked.
Guest boots to a black console. Host logs show a failed reset, or the driver refuses to bind.
2) GPU stuck in D3cold (or toggling power states)
The device looks present in lspci but won’t initialize. Sometimes the kernel logs mention “unable to change power state.”
This is often a runtime power management interaction: the kernel tries to suspend the device, then you try to reset it, and you get a mess.
3) AER spam, link resets, and “frozen” nodes
The GPU begins producing PCIe correctable/non-fatal errors. AER floods dmesg. CPU time goes into interrupt handling.
Even if workloads continue, the node’s latency profile becomes untrustworthy.
4) Driver thinks the GPU is busy forever
Kernel driver tries to quiesce the device, but internal engines are stuck. It waits for fences that never signal.
You try to unload the module and it blocks. That’s not “a driver bug” in the moral sense; it’s the driver doing the only safe thing:
refusing to pretend the hardware is OK.
5) Multi-function surprises: audio function, USB-C controller, etc.
Many GPUs expose multiple functions (graphics + HDMI audio + USB controller for VirtualLink/USB-C).
Resetting one function without the others (or binding inconsistently) can leave the device in an inconsistent state.
6) “Works in the morning” failures
Under light test workloads everything resets cleanly. Under sustained compute plus frequent attach/detach cycles, resets degrade.
This is common when firmware enters a less-tested state after long runtime, or when GPU memory is heavily used and teardown paths
are not perfect.
Fast diagnosis playbook
This is the version you run when the pager is buzzing and you don’t want to become a PCIe archaeologist.
The goal: identify whether you have a reset capability issue, a power management issue,
a driver binding issue, or a topology/reset-domain issue.
First: confirm what failed—reset, bind, or link
- Check
dmesgfor “reset failed”, “AER”, “link down”, “device not ready”. - Check sysfs reset method availability for the PCIe function.
- Check whether the device is in D3cold/D3hot.
Second: validate topology and reset domain risks
- Find the upstream bridge/root port and see what else shares it.
- Confirm IOMMU group membership, but don’t stop there.
- Look for a PCIe switch/bridge that might be the real reset target.
Third: choose the least destructive recovery action
- Try function reset (FLR) if available.
- Try unbind/bind cycle (if the driver supports reinit cleanly).
- Try hot reset or secondary bus reset if the GPU is isolated behind its own bridge/switch.
- If errors persist or AER storms start: drain node, reboot host.
If you can’t recover without reboot more than occasionally, stop treating it as “operator error.”
It’s an architectural constraint. Build scheduling and maintenance around it.
Practical tasks: commands, outputs, decisions (12+)
All examples assume a Linux host with root access and a GPU at 0000:65:00.0.
Replace addresses and driver names as appropriate.
Task 1: Identify the GPU and functions (are you missing the audio/USB function?)
cr0x@server:~$ lspci -nn | egrep -i 'vga|3d|nvidia|amd|audio|usb'
65:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2231] (rev a1)
65:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)
What it means: Your “GPU” is at least two functions. Passthrough/reset plans must account for both.
Decision: If you passthrough 65:00.0, also consider 65:00.1 to keep the device consistent, or explicitly bind it to a safe driver.
Task 2: Check IOMMU is actually enabled (don’t assume)
cr0x@server:~$ dmesg | egrep -i 'iommu|dmarmr|intel-iommu|amd-vi' | head
[ 0.912345] DMAR: IOMMU enabled
[ 0.912678] DMAR: Host address width 46
What it means: The kernel believes IOMMU is on.
Decision: If you don’t see this, fix boot params/BIOS first; VFIO stability without IOMMU is fantasy.
Task 3: Inspect IOMMU group membership (necessary, not sufficient)
cr0x@server:~$ gpu=0000:65:00.0; group=$(readlink /sys/bus/pci/devices/$gpu/iommu_group); echo $group; ls -l $group/devices
../../kernel/iommu_groups/42
total 0
lrwxrwxrwx 1 root root 0 Feb 4 10:01 0000:65:00.0 -> ../../../../devices/pci0000:60/0000:60:01.0/0000:65:00.0
lrwxrwxrwx 1 root root 0 Feb 4 10:01 0000:65:00.1 -> ../../../../devices/pci0000:60/0000:60:01.0/0000:65:00.1
What it means: GPU functions share the same IOMMU group, which is normal.
Decision: If unrelated devices are in the group, you can’t safely pass through without platform changes (or accepting risk).
Task 4: Find the upstream bridge/root port (reset domain clue)
cr0x@server:~$ gpu=0000:65:00.0; echo "GPU path:"; readlink -f /sys/bus/pci/devices/$gpu
GPU path:
/sys/devices/pci0000:60/0000:60:01.0/0000:65:00.0
What it means: Upstream port is 0000:60:01.0.
Decision: Check what else is behind 60:01.0. If it’s a shared switch, bus reset may have collateral damage.
Task 5: List everything behind the same upstream port
cr0x@server:~$ upstream=0000:60:01.0; find /sys/bus/pci/devices/$upstream/ -maxdepth 2 -name '0000:*' -printf '%f\n'
0000:65:00.0
0000:65:00.1
What it means: Only the GPU functions live behind this port.
Decision: A bus reset on the downstream segment is likely safe (still validate no hidden switch with other ports).
Task 6: Check whether the kernel thinks FLR is available
cr0x@server:~$ gpu=0000:65:00.0; lspci -s ${gpu#0000:} -vv | egrep -i 'Capabilities:.*FLR|Reset'
Capabilities: [1b0] Vendor Specific Information: Len=14
What it means: This output doesn’t prove FLR support; some devices don’t advertise it cleanly in a friendly grep.
Decision: Use sysfs next; don’t rely on marketing-grade capability listings.
Task 7: Try sysfs function reset (the least destructive)
cr0x@server:~$ gpu=0000:65:00.0; sudo sh -c "echo 1 > /sys/bus/pci/devices/$gpu/reset"
sh: 1: cannot create /sys/bus/pci/devices/0000:65:00.0/reset: Permission denied
What it means: You forgot root (or your shell redirection wasn’t privileged).
Decision: Use a root shell or sudo tee to write to sysfs.
cr0x@server:~$ gpu=0000:65:00.0; echo 1 | sudo tee /sys/bus/pci/devices/$gpu/reset
1
What it means: The kernel accepted a reset request.
Decision: Immediately check dmesg for success/failure; acceptance doesn’t mean completion.
Task 8: Confirm reset outcome in dmesg (success vs “reset failed”)
cr0x@server:~$ dmesg -T | tail -n 20
[Sun Feb 4 10:05:01 2026] pci 0000:65:00.0: resetting
[Sun Feb 4 10:05:02 2026] pci 0000:65:00.0: reset failed
What it means: The platform/device couldn’t complete the requested reset path.
Decision: Escalate to hot reset / bus reset only if the GPU is isolated behind its own bridge; otherwise drain and reboot node.
Task 9: Check power state (D0 vs D3) to catch runtime PM traps
cr0x@server:~$ gpu=0000:65:00.0; cat /sys/bus/pci/devices/$gpu/power_state
D3cold
What it means: The device is in deep power-down.
Decision: Consider disabling runtime PM for the device, or ensure it’s in D0 before detach/reset operations.
Task 10: Disable runtime power management for the GPU (test change)
cr0x@server:~$ gpu=0000:65:00.0; echo on | sudo tee /sys/bus/pci/devices/$gpu/power/control
on
What it means: Runtime PM is forced “on” (no autosuspend) for that function.
Decision: Re-test detach/attach cycles. If stability improves, make it persistent via udev rule or systemd unit (with care).
Task 11: Unbind from current driver (prepare for rebinding or VFIO)
cr0x@server:~$ gpu=0000:65:00.0; readlink /sys/bus/pci/devices/$gpu/driver
../../../../bus/pci/drivers/nvidia
cr0x@server:~$ gpu=0000:65:00.0; echo $gpu | sudo tee /sys/bus/pci/drivers/nvidia/unbind
0000:65:00.0
What it means: The device is detached from the nvidia kernel driver.
Decision: If unbind blocks or errors, the driver may be stuck waiting for hardware; drain and reboot is often the only clean escape.
Task 12: Bind to vfio-pci (for passthrough) and verify binding
cr0x@server:~$ gpu=0000:65:00.0; sudo modprobe vfio-pci
cr0x@server:~$ gpu=0000:65:00.0; vendor=$(cat /sys/bus/pci/devices/$gpu/vendor); device=$(cat /sys/bus/pci/devices/$gpu/device); echo $vendor $device
0x10de 0x2231
cr0x@server:~$ echo 10de 2231 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id
10de 2231
cr0x@server:~$ gpu=0000:65:00.0; readlink /sys/bus/pci/devices/$gpu/driver
../../../../bus/pci/drivers/vfio-pci
What it means: vfio-pci owns the device now.
Decision: If binding fails or flips back, you have competing drivers (initramfs, udev rules, or Xorg) claiming the GPU.
Task 13: Check for AER errors that predict future resets failing
cr0x@server:~$ dmesg -T | egrep -i 'AER|pcieport|error' | tail -n 15
[Sun Feb 4 10:06:10 2026] pcieport 0000:60:01.0: AER: Corrected error received: 0000:60:01.0
[Sun Feb 4 10:06:10 2026] pcieport 0000:60:01.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer
What it means: Link-level issues exist even if the GPU “works.”
Decision: Treat persistent AER as a hardware/platform problem: check seating, risers, cabling (for external PCIe), BIOS settings, or replace the node.
Task 14: Verify link speed/width (degraded links correlate with weird resets)
cr0x@server:~$ gpu=0000:65:00.0; sudo lspci -s ${gpu#0000:} -vv | egrep -i 'LnkSta:|LnkCap:'
LnkCap: Port #0, Speed 16GT/s, Width x16
LnkSta: Speed 8GT/s (downgraded), Width x8 (downgraded)
What it means: Your GPU is running on a degraded link.
Decision: Fix physical/topology issues first. Debugging resets on a degraded link is like debugging storage on a dying SATA cable: you’ll learn nothing useful.
Task 15: Try a downstream bus reset only if topology is safe
cr0x@server:~$ upstream=0000:60:01.0; echo 1 | sudo tee /sys/bus/pci/devices/$upstream/reset
1
What it means: You requested a reset on the upstream port/bridge, which can reset the downstream segment.
Decision: If other devices share that segment, don’t do this on a live node. If the GPU is isolated, this can restore a “stuck” device.
Task 16: Validate device is responsive post-reset (quick health signal)
cr0x@server:~$ gpu=0000:65:00.0; lspci -s ${gpu#0000:} -nn
65:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2231] (rev a1)
What it means: The device still enumerates, which is the bare minimum.
Decision: If it disappears entirely, you’re in link training/power domain territory; plan for host reboot and investigate platform firmware.
What works: mitigation strategies that survive real workloads
1) Prefer GPUs and platforms with proven reset behavior
This is not philosophical. It’s operational. If you need frequent reassignments (multi-tenant VFIO, CI farms, ephemeral GPU VMs),
your hardware must support reliable FLR or predictable bus reset without collateral damage.
Datacenter-class GPUs and servers tend to behave better because they were designed for fleet operations and remote recovery.
Consumer platforms can work, but you’re signing up for a hobby that involves kernel parameters, ACPI workarounds, and occasional reboots.
2) Keep the GPU bound to one world as long as possible
Every detach/attach is a chance to hit the bad path. If you can schedule “GPU stays with a VM for its lifetime,” do it.
If you can pin GPU to a node and move jobs instead of moving GPUs, do it. If you can avoid VFIO and use containers on the host,
do it (assuming you trust the multi-tenant model and have isolation controls).
3) Disable runtime PM for passthrough GPUs (selectively)
Runtime PM saves watts. It also introduces states that make reset/rebind harder. For passthrough GPUs, reliability beats elegance.
Force D0 during lifecycle transitions, or disable runtime PM for the device entirely.
4) Treat “reset failed” as a node health failure, not a one-off glitch
If a GPU can’t reset once, it’s more likely to fail again under similar conditions. Mark the node unhealthy, drain workloads,
and reboot on your schedule, not during peak.
5) Use a controlled reset ladder
Don’t jump straight to bus resets. Use escalating attempts:
- Ensure no process is using the GPU (or stop the VM cleanly).
- Unbind driver (host) or detach device (VFIO).
- Attempt FLR via sysfs.
- Attempt hot/bus reset only if isolation is confirmed.
- If still broken: reboot node; if recurring: firmware/platform change.
6) Align firmware/BIOS settings with your reset expectations
BIOS/UEFI can sabotage you with power management and PCIe features. Common improvements (platform-dependent):
- Disable deep PCIe power states for the slot if you see D3cold issues.
- Keep Above 4G Decoding enabled for large BAR devices and modern GPUs.
- Update GPU VBIOS and system BIOS together; mismatched generations create “works until it doesn’t” behavior.
7) For clusters: build in reboot tolerance and take the win
If your fleet is big enough, the economically correct answer is often: accept that some GPU resets require reboots,
and make node reboot a normal remediation with fast job rescheduling.
Joke #2: The most reliable GPU reset is the one performed by a power supply that briefly “forgets” it was supplying power.
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company built a GPU-backed inference service. They wanted strong tenant isolation, so they used VFIO passthrough:
one VM per customer workload, GPUs reassigned as VMs churned. The platform team enabled IOMMU, validated IOMMU groups, and wrote
automation that detached and reattached devices between VMs.
It passed staging. It passed load tests. The rollout began quietly and looked fine for a few days, which is exactly how
platform bugs build trust before they cash it out.
The first incident started as “some jobs stuck.” Then it became “half the GPUs not scheduling.” The VMs were healthy.
Libvirt reported successful detach/attach. But the GPU devices would not initialize in the next VM after a detach.
Operators found “reset failed” in dmesg and watched unbind operations hang. A few nodes began logging PCIe corrected errors
that slowly turned into storms.
The wrong assumption was simple: “IOMMU groups imply safe reuse.” Isolation was correct. Recoverability wasn’t.
The GPU models in that fleet could be passed through just fine, but their reset behavior was unreliable under repeated churn.
They weren’t broken for desktop use; they were just not designed for that lifecycle.
The fix was operational and architectural: they stopped reusing GPUs across VM lifetimes. A GPU stayed pinned to a VM until
the VM was destroyed, and nodes were rebooted during a controlled maintenance window before reentering the pool.
Capacity planning got a little more honest. Incidents dropped sharply.
Mini-story 2: The optimization that backfired
A financial analytics shop ran compute-heavy workloads overnight. They wanted to squeeze power costs down and enabled aggressive
runtime power management. The idea was reasonable: GPUs idle between bursts, autosuspend them, save watts.
Two weeks later, they had a new class of failure: nodes that looked healthy but refused to start GPU jobs after being idle.
The scheduler would assign a job, the job would fail to open the device, retries would bounce across nodes, and eventually
the cluster would look “busy” while doing almost nothing. A classic utilization mirage.
The root cause was GPUs entering deep power states that interacted poorly with driver reinitialization after long idles.
The reset path sometimes worked, sometimes didn’t. The failure rate was low, but the operational blast radius was high because
it triggered thundering herds of retries.
They disabled runtime PM for the GPU functions used by compute. Power costs went up slightly. Throughput went up more.
Most importantly, the cluster stopped lying about being available.
The lesson: optimizing for power without a recoverability plan is like optimizing storage performance by turning off checksums.
You’ll get numbers. You won’t like the bill when reality arrives.
Mini-story 3: The boring but correct practice that saved the day
A research org ran a multi-tenant GPU cluster with a mix of hardware generations. They had learned the hard way that
“GPU reset” problems are not evenly distributed; a few nodes are always worse.
Their boring practice: nodes were labeled with a “GPU reset reliability” class based on observed behavior. Jobs that required
frequent restarts or ephemeral provisioning were scheduled only to high-reliability nodes. Long-running jobs could run anywhere,
but with checkpointing and a preemption plan.
They also had a strict remediation workflow: a node that logged a reset failure was immediately drained, rebooted, and only
returned to service after passing a standardized attach/detach test loop. No hero debugging during business hours.
When a new driver version introduced more frequent reset edge cases on one GPU family, the impact was contained.
The scheduler avoided the risky nodes, the drain-and-reboot loop prevented long AER storms, and users mostly saw jobs restart
rather than hang. It wasn’t glamorous. It was correct.
Common mistakes (symptoms → root cause → fix)
1) Symptom: “reset failed” after detach; GPU won’t reattach
Root cause: Device doesn’t support reliable FLR; driver reset path incomplete; GPU stuck in internal firmware state.
Fix: Avoid frequent detach/attach; pin GPU per-VM lifetime; use bus reset only with isolated downstream port; otherwise reboot node.
2) Symptom: GPU present in lspci but driver bind fails with timeouts
Root cause: GPU in D3cold or platform won’t wake it; runtime PM conflict.
Fix: Force /sys/bus/pci/devices/…/power/control to on; adjust BIOS PCIe power settings; ensure device in D0 before rebind.
3) Symptom: AER corrected errors increase; occasional job failures
Root cause: Signal integrity/link issues, riser problems, downgraded link, marginal slot.
Fix: Reseat GPU, swap riser/cable, check link width/speed, update firmware; quarantine node if persistent.
4) Symptom: Unloading GPU driver hangs forever
Root cause: Driver waiting for hardware engines to idle; GPU microcontroller wedged.
Fix: Don’t fight it. Drain and reboot. Then investigate why wedges happen (thermal, firmware, workload, overclock, power).
5) Symptom: VFIO works once per boot only
Root cause: Reset works at cold boot but not after a guest uses the GPU; missing vendor reset quirk or requires bus reset/power cycle.
Fix: Validate reset capability before deploying; plan for host reboot between assignments; prefer hardware with known-good reset.
6) Symptom: VM shutdown triggers host instability or other devices reset
Root cause: You used a bus reset on a bridge shared with other endpoints (storage, NIC).
Fix: Map topology; isolate GPUs behind dedicated root ports; do not bus reset shared segments in production.
7) Symptom: “It worked in staging” but fails under churn
Root cause: Test cycles didn’t cover long runtime, heavy VRAM usage, or repeated attach/detach; firmware states are workload-dependent.
Fix: Add soak tests with detach/attach loops; classify nodes by observed reset reliability; keep remediation automated.
Checklists / step-by-step plan
Checklist A: Before you deploy VFIO GPU passthrough at scale
- Verify IOMMU enabled in BIOS and kernel; confirm DMAR/AMD-Vi messages.
- Confirm GPU functions and ensure they’re all accounted for (graphics/audio/USB).
- Map IOMMU groups and topology; identify upstream ports and shared segments.
- Run a detach/attach soak test loop (dozens to hundreds of cycles) under real workload (VRAM pressure + compute).
- Decide your recovery contract: “recover without reboot” or “reboot acceptable.” Put it in the runbook.
- Implement node quarantine: if reset fails once, remove node from scheduling until remediated.
Checklist B: Safe recovery ladder during an incident
- Stop or migrate workload; ensure no process uses the GPU.
- Capture evidence:
dmesgtail,lspci -vvlink state, power state, current driver binding. - Attempt FLR via sysfs reset on the function.
- If still broken, and only if topology is safe: reset the upstream port/bridge.
- If AER storms or repeated failures: drain node, reboot, mark node for deeper investigation.
Checklist C: Make it boring (the best kind of reliable)
- Label nodes by GPU family/firmware and observed reset behavior.
- Schedule churn-heavy workloads only on “good reset” nodes.
- Disable runtime PM for passthrough GPUs unless you have proof it’s safe.
- Keep kernel/driver/firmware versions controlled; roll changes with canaries and rollback plans.
- Automate reboot + health test loops; treat reboots as a maintenance tool, not a failure of character.
FAQ
Q1: If the GPU is in its own IOMMU group, why can’t it always be reset?
IOMMU grouping is about DMA isolation boundaries. Reset ability depends on PCIe reset support, power domains, firmware state,
and topology. They’re related but not equivalent.
Q2: Is this only an AMD problem?
No. Different vendors show different patterns, but reset failures exist across GPU families. The VFIO community historically
highlighted AMD “reset bug” cases, but NVIDIA and others can also wedge depending on model, firmware, and driver path.
Q3: Does SR-IOV or MIG eliminate reset issues?
It changes the problem. Partitioning features can reduce full-device churn, which helps. But you still have reset semantics at the
physical function level, and firmware/driver bugs can still surface under heavy multi-tenant conditions.
Q4: Can I fix this purely in software with kernel parameters?
Sometimes you can improve behavior (especially around power management or driver binding), but you can’t software your way out of
a device that requires a fundamental reset. Treat hardware selection and topology as first-class.
Q5: What’s the safest reset to try first?
Function-level reset (FLR) via sysfs is generally the least disruptive. If it fails, the next steps depend on topology. Bus resets
can work but carry collateral risk.
Q6: Why does the GPU sometimes work after a host reboot but not after VM detach/attach?
Cold boot establishes a clean baseline: firmware initializes from scratch, power rails cycle, link trains cleanly. A detach/attach
cycle often relies on partial resets and driver teardown paths, which may not fully reinitialize internal GPU state.
Q7: Should I disable AER to stop log spam?
Usually no. AER is a symptom reporter. Disabling it hides the evidence while the link continues misbehaving.
If AER storms cause operational issues, quarantine the node and fix the underlying link/power/platform problem.
Q8: Is it acceptable to plan for reboots as the recovery method?
Yes, if you design for it. In clusters, “reboot to recover a stuck GPU” can be perfectly reasonable if jobs are restartable,
nodes are drained safely, and the reboot path is fast and automated.
Q9: Why does unbinding the driver sometimes hang?
Because the driver is attempting a safe teardown: it waits for DMA to stop and engines to quiesce. If the GPU is wedged, the driver
can’t responsibly pretend it’s free. That hang is often protecting you from memory corruption.
Q10: What’s the single best metric to alert on?
Alert on reset failures and recurring AER events on GPU root ports. Those correlate strongly with “this node will waste your time later.”
Then automate quarantine and remediation.
Conclusion: practical next steps
If you remember one thing: IOMMU gives you DMA safety, not reset safety. The GPU reset bug class is what happens when we
confuse those and build automation that assumes devices behave like stateless peripherals.
Next steps that pay off immediately:
- Inventory topology: map upstream ports, shared bridges, and IOMMU groups for every GPU node.
- Implement a reset ladder: FLR → careful bus reset (only if isolated) → drain and reboot.
- Disable runtime PM for passthrough GPUs if you see D3cold/D3hot weirdness.
- Adopt node quarantine: one reset failure should remove a node from scheduling until reboot + health test.
- Change procurement criteria: require proven reset behavior, slot power control if possible, and platforms designed for virtualization.
That’s how you turn “GPU reset bug” from a recurring incident into a known constraint with a boring, reliable response.
Boring is good. Boring means you get your weekends back.