Why Your GPU Passthrough Black Screens After Reboot (It’s Often IOMMU)
February 10, 2026 • February 10, 2026 • Read: 21 min • Views: 0
Was this helpful?
You had it working. The VM booted, the guest driver installed, you even ran a benchmark like a responsible adult.
Then you rebooted the host and—black screen. No output. Maybe the VM “runs” but you can’t see anything.
Maybe the host boots fine but the GPU is dead to the guest. Welcome to GPU passthrough: where yesterday’s success is not a contract.
When this happens after a reboot, the failure is often not “the GPU” in the abstract. It’s almost always
one of three things: IOMMU isolation changed, the wrong driver grabbed the device first, or the device didn’t reset cleanly.
The trick is to stop guessing and start collecting evidence that survives the reboot.
Fast diagnosis playbook
If you’re on-call or just tired, use this order. It’s optimized for time-to-first-truth, not for elegance.
Do these checks on the host first; your guest can’t tell you the truth if it never really owned the GPU.
1) Confirm the device is still the same device
Check PCI address and device IDs didn’t change (rare, but it happens with BIOS updates or slot changes).
Confirm the GPU and its audio function are both present.
2) Confirm VFIO actually bound the GPU at boot
If the host kernel driver (nouveau, nvidia, amdgpu) binds first, VFIO will “lose” and your VM gets a ghost.
Binding problems are common after kernel updates and initramfs changes.
3) Confirm IOMMU groups are still isolated
If your GPU shares an IOMMU group with something the host needs, you will either refuse to start or you’ll start and black-screen.
Group membership can change across BIOS toggles, firmware updates, or different kernel parameters.
4) Check for reset/FLR issues
If it works once but fails after reboot or after VM shutdown/start, you probably hit a reset quirk.
These present as “VM boots, no display” or “guest driver loads then device disappears.”
5) Confirm your display path isn’t lying
Wrong monitor input, DP handshake weirdness, or a KVM switch can look like VFIO failure.
Test via a different output (HDMI vs DP) or use a known-good dummy plug.
Joke #1: GPU passthrough is the only place where “it worked yesterday” is considered a threat model.
What “black screen” actually means in passthrough
“Black screen” is a symptom, not a diagnosis. In VFIO GPU passthrough it typically means one of these:
The guest never got the GPU: the host driver bound it, or QEMU couldn’t assign it due to IOMMU group conflicts.
The guest got the GPU, but it never initialized output: missing option ROM, wrong firmware mode (UEFI vs legacy), or a bad reset state.
The guest initialized, but you’re looking at the wrong output: multi-monitor or multi-output confusion; the GPU is rendering elsewhere.
The GPU is in a wedged state: a known problem on some consumer GPUs without proper Function Level Reset (FLR).
A useful mental model: passthrough is a custody chain. The PCIe device is “owned” by something at every moment:
firmware, host driver, VFIO, QEMU, guest driver. Black screens often happen when ownership changes
but the state does not reset cleanly—or the wrong owner grabs it first.
Why IOMMU is usually the culprit
IOMMU (Intel VT-d / AMD-Vi) is the bouncer at the memory club. It restricts DMA so devices can’t read and write
random RAM. For passthrough, that’s non-negotiable: your guest GPU must DMA only into guest-assigned memory,
not into the host’s page cache or your hypervisor’s secrets.
The part that bites you after reboot is that the IOMMU topology is not just “on/off.” It’s shaped by:
BIOS settings, PCIe topology, kernel parameters, ACS capabilities, and sometimes board vendor creativity.
That means you can have a working configuration that silently becomes non-isolated after a reboot,
a firmware update, or even a different boot order of devices.
IOMMU groups: the real unit of isolation
VFIO hands whole IOMMU groups to guests, not arbitrary single devices (with some nuance, but treat it as “the group”).
If your GPU is in a group with, say, a SATA controller or a USB controller the host needs, you have a conflict:
either you pass through too much and lose host functionality, or you can’t safely isolate the GPU at all.
After reboot, group membership can change because the firmware re-enumerated the PCIe tree differently,
or because a BIOS toggle changed ACS behavior. Sometimes a “harmless” BIOS update decides the chipset root port
should behave differently, and now your GPU is glued to half the machine.
ACS override: tempting, sometimes necessary, always a trade
The ACS override kernel parameter can split groups that the hardware doesn’t properly isolate.
This can make passthrough possible on consumer platforms, and it can also create a false sense of security:
you’re telling the kernel to assume isolation that may not exist electrically.
In production environments, I treat ACS override like using duct tape on a brake line. You might get home.
You should not build a fleet around it.
Interrupt remapping and why “IOMMU enabled” isn’t enough
Some platforms need interrupt remapping enabled for safe passthrough. Without it, you can see weird failures:
devices that appear assigned but act dead, guests that hang on driver init, or “works until reboot” patterns.
This is where “I enabled VT-d” becomes a fairy tale. You need to verify the kernel actually enabled the features.
Interesting facts and historical context
VFIO replaced older KVM device assignment approaches by moving device access into a safer, IOMMU-driven framework rather than ad-hoc mapping.
IOMMU exists primarily for DMA isolation; virtualization made it famous, but the security problem (devices as DMA attackers) predates today’s homelab boom.
PCIe ACS wasn’t “designed for gamers”; it’s a server feature for isolation and routing control, and consumer chipsets often implement it partially or not at all.
Many GPUs expose multiple PCI functions (graphics + HDMI audio + sometimes USB-C/VirtualLink). You often need to passthrough all related functions for stability.
Function Level Reset (FLR) is not universal; some consumer GPUs historically lacked reliable reset behavior, causing “works once” passthrough failures.
UEFI GOP vs legacy VGA ROM matters: modern GPUs prefer UEFI initialization; legacy CSM paths can behave differently across reboots and firmware versions.
Resizable BAR changed PCIe resource sizing expectations; turning it on/off can affect how devices map memory, and occasionally how firmware enumerates devices.
Early VGA arbitration is still a thing: the “primary GPU” concept influences which device firmware initializes first, which affects host driver binding and passthrough.
Practical tasks: commands, outputs, what it means, what you do next
Below are field tasks you can run right now on a Linux KVM host (including Proxmox-style hosts).
Each one includes: a command, realistic output, what the output means, and the decision you make from it.
Do not cherry-pick. Run the early ones even if you’re “sure” it’s something else.
Task 1: Confirm IOMMU is actually enabled in the kernel
Meaning: You’re requesting Intel IOMMU, pass-through mode for host devices (iommu=pt), and pre-binding VFIO to specific PCI IDs.
Decision: If the parameters are missing after reboot, you edited the wrong bootloader entry or didn’t regenerate the boot config. Fix bootloader, then rebuild initramfs if needed.
Task 3: Identify the GPU and its functions (graphics + audio)
Meaning: Your GPU is at 01:00.0 and the HDMI audio function is 01:00.1. They usually need to be passed together.
Decision: If the audio function is missing after reboot, suspect PCIe enumeration changes, faulty riser, or a BIOS “PCIe power saving” feature misbehaving.
Task 4: See what driver currently owns the GPU (the custody chain check)
Meaning: Good: Kernel driver in use: vfio-pci. The host isn’t actively using it.
Decision: If you see nouveau, nvidia, or amdgpu as “in use”, your passthrough will be flaky or dead. Fix driver binding before debugging IOMMU groups.
Task 5: Verify IOMMU group membership (the isolation truth)
cr0x@server:~$ for d in /sys/kernel/iommu_groups/*/devices/*; do g=$(basename $(dirname $d)); printf "Group %s: " "$g"; lspci -nns ${d##*/}; done | egrep '01:00'
Group 16: 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
Group 16: 01:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
Meaning: Ideal: the GPU functions are alone together in one group.
Decision: If the group includes other devices (USB controller, SATA, NIC), decide: move the GPU to another slot, change BIOS PCIe settings, or accept ACS override risk.
Task 6: Detect ACS override being used (and admit the risk)
cr0x@server:~$ dmesg | egrep -i 'ACS|override' | head
[ 1.245678] PCI: Using ACS override for IOMMU isolation
Meaning: The kernel is faking/splitting isolation.
Decision: In a homelab, maybe acceptable. In an environment with real threat models, treat this as a temporary bridge while you fix hardware topology.
Task 7: Confirm VFIO modules loaded and not silently missing after an update
Meaning: You’re configuring early VFIO binding via modprobe options and ensuring modules load.
Decision: If this looks correct but the host driver still binds first, regenerate initramfs and reboot. Early binding is the difference between “stable” and “maddening.”
Task 9: Confirm the GPU is not the boot/console framebuffer device
cr0x@server:~$ dmesg | egrep -i 'fb0|framebuffer|efifb|vesafb' | head -n 20
[ 0.401234] efifb: probing for efifb
[ 0.401567] efifb: framebuffer at 0xc0000000, using 3072k, total 3072k
[ 0.401890] fb0: EFI VGA frame buffer device
Meaning: The host is using EFI framebuffer, which may or may not be on your passthrough GPU.
Decision: If the passthrough GPU is the primary display device, you’ll fight firmware and console binding. Prefer iGPU or a second cheap GPU for the host.
Task 10: Look for “GPU fell off the bus” / AER / link errors after reboot
cr0x@server:~$ journalctl -b -k | egrep -i 'AER|pcie|fallen off|vfio|DMAR|amdgpu|nvidia' | tail -n 30
Feb 04 09:11:21 server kernel: vfio-pci 0000:01:00.0: enabling device (0000 -> 0003)
Feb 04 09:11:22 server kernel: pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
Feb 04 09:11:22 server kernel: pcieport 0000:00:01.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer
Meaning: You have PCIe corrected errors. Not always fatal, but if they correlate with black screens, you may have signal integrity or power management issues.
Decision: Try a different slot, disable ASPM in BIOS, update BIOS, or remove risers. Don’t “optimize” until stable.
Task 11: Check QEMU/libvirt assignment errors (the “it never really attached” case)
cr0x@server:~$ journalctl -u libvirtd -b --no-pager | tail -n 30
Feb 04 09:15:02 server libvirtd[1123]: internal error: qemu unexpectedly closed the monitor: 2026-02-04T09:15:02.123456Z qemu-system-x86_64: -device vfio-pci,host=0000:01:00.0: vfio 0000:01:00.0: group 16 is not viable
Feb 04 09:15:02 server libvirtd[1123]: internal error: qemu unexpectedly closed the monitor: 2026-02-04T09:15:02.123499Z qemu-system-x86_64: -device vfio-pci,host=0000:01:00.0: Please ensure all devices within the iommu_group are bound to their vfio bus driver.
Meaning: Classic: the IOMMU group contains something not bound to VFIO, so QEMU refuses or bails.
Decision: Either bind the entire group to VFIO (dangerous if it includes host-critical devices), or rework isolation (preferred).
Task 12: Confirm the guest firmware mode and GPU ROM handling
Meaning: The VM is using OVMF (UEFI). ROM BAR is enabled (not the same as supplying a ROM file, but it’s relevant).
Decision: If you’re using legacy BIOS (SeaBIOS) with a modern GPU and seeing post-reboot black screens, switch to OVMF unless you have a specific reason not to.
Task 13: Check for the classic “NVIDIA Code 43” and distinguish it from IOMMU issues
cr0x@server:~$ virsh qemu-agent-command win11-gpu '{"execute":"guest-get-osinfo"}'
{"return":{"id":"mswindows","name":"Microsoft Windows","pretty-name":"Microsoft Windows 11","version":"10.0"}}
Meaning: Guest agent is responsive; the guest OS is alive even if you see a black screen.
Decision: If the guest is alive, your problem might be display initialization, driver state, or output selection—not a total device attach failure.
Task 14: Validate that the GPU isn’t still “in use” by a host process
Meaning: The VFIO group is held by the QEMU process. That’s expected when the VM is running.
Decision: If something else holds it (or it’s held after VM stop), you have a cleanup/reset issue. Fix shutdown hooks, and consider vendor-reset or a full power cycle policy.
Task 15: Force-check IOMMU feature flags for Intel platforms
Meaning: You have the pieces that make passthrough less haunted.
Decision: If interrupt remapping is missing, look for BIOS options like “Interrupt Remapping” or “Posted Interrupt.” Some boards hide it under “Security.”
Task 16: Check whether the GPU supports and advertises reset capabilities
Meaning: This snippet alone doesn’t confirm FLR, but lspci -vv is where you look for reset-related capabilities and behavior.
Decision: If you consistently see “works once” and the card lacks good reset behavior, plan for mitigations: vendor-reset for some AMD cards, or host power cycle after VM stop, or choose a different GPU.
Three corporate mini-stories from the trenches
Incident #1: The wrong assumption (“Reboot is a clean slate”)
A midsize company ran a small VDI pool for a design team. Nothing exotic: KVM, VFIO, a few GPUs per host, Windows guests.
It worked in staging for weeks. Then the first patch Tuesday hit, and everyone rebooted the hosts over a weekend.
Monday morning: a third of the VMs came up with black screens.
The assumption baked into their runbook was simple: a host reboot resets the hardware, so the GPU will always come back clean.
That assumption was wrong in two ways. First, the GPU was the primary display device in the BIOS on some hosts,
so firmware and the host console framebuffer touched it before VFIO could claim it. Second, a BIOS setting
(“fast boot”) altered PCIe initialization timing and changed which devices landed in which IOMMU groups.
The diagnosis took longer than it should have because the team chased guest-side driver logs.
Those logs were real, but they weren’t causal. The GPU never truly belonged to the guest in the failing cases.
The proof was in lspci -nnk: the host driver owned the GPU on the broken hosts.
The fix was boring and effective: standardize BIOS settings, force the iGPU as primary for the host,
and ensure VFIO binding occurs in initramfs with the correct IDs for both GPU and audio function.
After that, reboots became predictable again—which is the only kind of reboot you want in production.
Incident #2: The optimization that backfired (“Let’s enable every performance toggle”)
Another shop wanted maximum GPU performance for compute workloads in guests. Someone (with good intentions)
enabled Resizable BAR, Above 4G decoding, and a few PCIe power management knobs across the fleet.
On paper, this can help certain workloads. In reality, it introduced an inconsistent boot-time PCIe resource layout.
The immediate symptom wasn’t performance. It was intermittent black screens after reboot, concentrated on one motherboard model.
Some hosts would come up with the GPU in an IOMMU group that now included a PCIe bridge and a USB controller.
QEMU would sometimes refuse to start, other times the VM would start but the GPU would fail to initialize output.
The team lost a day because the changes were “performance settings,” and nobody mentally connected them to isolation.
The postmortem was blunt: if your platform isn’t validated for those toggles, you treat them like experimental features.
They rolled back the power management changes first, then validated Resizable BAR host-by-host.
Stability returned. Performance did too—because performance is what you get after reliability, not instead of it.
Joke #2: If you flip every BIOS switch labeled “turbo,” you’re not tuning—you’re auditioning for a reboot roulette league.
Incident #3: The boring practice that saved the day (“Baseline evidence after every change”)
A more disciplined team ran a small “golden host” program. Every time they changed firmware, kernel, or hypervisor packages,
they captured a baseline bundle: /proc/cmdline, dmesg IOMMU lines, IOMMU group listings, and lspci -nnk for the GPU. They stored it with the change ticket. Nobody celebrated this.
It was paperwork with command output.
Then a routine kernel update coincided with a black-screen issue after reboot on two hosts.
Instead of arguing about whether it was “the GPU driver,” they diffed the baseline bundles.
The smoking gun was immediate: the cmdline no longer had the expected VFIO IDs on the affected hosts.
A bootloader config fragment had been overwritten during a package transition.
The fix was quick because they didn’t need to rediscover the system; they just needed to restore it.
They corrected the boot config, rebuilt initramfs, and rebooted once. The VMs came back.
The big win wasn’t technical brilliance—it was having proof of what “working” looked like.
This is the unglamorous reality of running production virtualization: you don’t prevent every failure.
You prevent the ones that waste your time.
Common mistakes: symptom → root cause → fix
1) Black screen only after reboot; first boot after cold power is fine
Root cause: GPU reset quirk. Device isn’t returning to a clean state across warm boots or VM stop/start.
Fix: Prefer GPUs with proper FLR; try a kernel/module reset workaround (where applicable), or enforce host power cycle after VM stop. Avoid “suspend” states for the host.
2) VM starts, guest is reachable (RDP/SSH), but physical monitor stays black
Root cause: Output mismatch (wrong port), UEFI/ROM init mismatch, or guest driver choosing a different display path.
Fix: Switch VM to OVMF; ensure GPU ROM handling is correct; test a different output; remove extra virtual displays; verify guest sees the GPU and selected output.
3) QEMU/libvirt complains “group is not viable” after reboot
Root cause: IOMMU group contains other devices not bound to VFIO, or group changed after firmware update.
Fix: Re-check group membership; move GPU to another slot; change BIOS PCIe settings; avoid ACS override unless you accept the risk; bind entire group only if it’s safe.
4) GPU binds to nouveau/nvidia/amdgpu on host after reboot
Root cause: VFIO binding not happening early enough; initramfs missing your vfio config; blacklists not applied at boot.
Fix: Put vfio-pci IDs in modprobe config; blacklist host GPU drivers if needed; rebuild initramfs; confirm with lspci -nnk after reboot.
5) Everything worked until you enabled “fast boot” or changed CSM/UEFI settings
Capture IOMMU group membership for the GPU functions.
Capture lspci -nnk showing vfio-pci as “in use” for GPU and audio.
Capture VM firmware mode (OVMF vs SeaBIOS) and hostdev mapping.
Recovery plan when you rebooted and now it’s black
Stop the VM (if it’s running) to avoid repeated device init attempts that complicate logs.
Confirm the GPU’s PCI addresses didn’t change (lspci -nn).
Check driver ownership (lspci -nnk -s 01:00.0 and 01:00.1).
Check IOMMU groups and verify the group is viable.
Check kernel logs for AER and VFIO errors (journalctl -b -k).
Verify firmware mode (OVMF preferred for modern guests) and remove conflicting virtual display devices.
If it’s a reset issue, try a full host power cycle (not reboot). If that fixes it, treat reset behavior as the root cause and plan mitigations.
Hardening plan (make reboots boring again)
Standardize BIOS settings across hosts: VT-d/AMD-Vi on, interrupt remapping on, fast boot off.
Prefer a dedicated host display device (iGPU or a cheap secondary GPU) so firmware doesn’t initialize your passthrough GPU.
Bind VFIO in initramfs, not “late” in boot. Late binding is a race you will eventually lose.
Avoid ACS override in environments with serious isolation needs; if you must use it, document and revisit.
After any BIOS/kernel update, re-validate IOMMU groups. Treat it like a schema migration, because it is.
One paraphrased idea often attributed to Werner Vogels: you build reliability by expecting failure and designing for it, not by hoping the happy path stays happy.
FAQ
1) Why does it work until I reboot the host?
Reboot changes the custody chain. Firmware may initialize the GPU differently, the host driver may bind earlier,
and IOMMU groups may be enumerated differently. Warm boots also don’t always reset consumer GPUs cleanly.
2) Is IOMMU always required for GPU passthrough?
For safe and correct passthrough on modern systems: yes. Without IOMMU, DMA isolation isn’t enforced,
and VFIO can’t safely assign the device to a guest.
3) What’s the difference between intel_iommu=on and iommu=pt?
intel_iommu=on enables the Intel IOMMU. iommu=pt puts non-passthrough devices in pass-through mode to reduce overhead.
You still get isolation for devices you bind to VFIO; you’re just not forcing translation for everything else.
4) Do I have to passthrough the GPU’s audio function too?
Usually, yes. Many GPUs are multi-function devices and behave better when all related functions in the same IOMMU group are assigned together.
Skipping the audio function can cause weird init or reset behavior.
5) If my IOMMU group contains other devices, can I still passthrough just the GPU?
Practically: sometimes, with ACS override or unsafe configurations. Correctly: the group is the unit of isolation.
If the group includes host-critical devices, your long-term fix is hardware/slot/topology changes, not “forcing it.”
6) Is ACS override safe?
It can be acceptable in a personal environment. It is a security and correctness trade: you’re telling the kernel to assume separation that may not exist.
If you care about isolation guarantees, avoid it and fix the platform instead.
7) Why does the VM run but I get no video output?
Because “VM runs” only means QEMU is alive. Video output depends on GPU initialization, firmware mode, and output selection.
Validate that the guest sees the GPU, and try OVMF plus a different physical output path.
8) Should I use OVMF (UEFI) or SeaBIOS?
For modern Windows and modern GPUs, OVMF is typically the least painful path. SeaBIOS can work, but it increases the chance of ROM/init quirks.
If you’re chasing post-reboot black screens, switching to OVMF is often a net win.
9) What if the host has no iGPU and only one GPU?
You can still do it, but you’re choosing “hard mode.” The host will likely initialize that GPU for console output.
Expect to fight early binding and framebuffer ownership. A cheap secondary GPU often pays for itself in saved time.
10) What’s the single most useful command when debugging this?
lspci -nnk on the GPU functions. It tells you who owns the device right now. Ownership is the story.
Conclusion: next steps you can do today
If your GPU passthrough black-screens after reboot, stop treating it like a mystical GPU mood swing.
Treat it like a systems problem: verify IOMMU is enabled and fully featured, verify VFIO binds early,
verify IOMMU groups are stable and isolated, and only then chase reset quirks and display path oddities.
Practical next steps:
Run Tasks 1–5 and save the output. That’s your baseline and your diff target.
Make VFIO binding deterministic at boot (initramfs), not “eventually.”
Re-check IOMMU groups after any BIOS or kernel change. Assume they changed until proven otherwise.
If your platform needs ACS override to behave, consider that a hardware-selection problem—not a kernel-tuning victory.
If you confirm a reset issue, plan a mitigation policy (power cycle, different GPU, or a reset helper where appropriate) instead of rebooting and hoping.
The goal isn’t to get it working once. The goal is to make it survive the next reboot without a ceremony.