You did the “simple” thing: pass a GPU to a VM. The VM boots. The monitor stays black. The guest might even be alive—RDP works, SSH works, but the local display is dead like a conference-room TV with the wrong input.
GPU passthrough on Proxmox is not hard. It’s just unforgiving. One wrong firmware choice, one bad assumption about ROM/GOP, one GPU reset quirk, and you get the same symptom: nothing on the screen.
Fast diagnosis playbook (check this first)
This is the “stop thrashing” sequence. It’s optimized for time-to-first-signal and avoids rabbit holes like tinkering with random QEMU flags before confirming the basics.
1) Confirm the guest is alive vs dead
- If Windows boots and you can RDP in: you likely have a display init problem (firmware/ROM/primary display), not a total passthrough failure.
- If the VM doesn’t boot at all: suspect IOMMU, VFIO binding, or a hard reset/freeze issue.
2) Check IOMMU is actually on, and groups make sense
If IOMMU isn’t enabled, you’re not doing passthrough; you’re doing interpretive dance.
3) Verify the GPU is bound to vfio-pci on the host
Black screens frequently happen when the host driver grabbed the card first, leaving VFIO with leftovers (or a half-initialized device).
4) Match firmware to GPU requirements: OVMF for UEFI GOP cards
Modern GPUs, especially when you want the GPU as the primary display in the guest, typically behave best with OVMF + Q35.
5) Decide whether you need a ROM file
If the GPU never shows output until the guest driver loads (or never shows output at all), you might need to supply a clean ROM with a valid GOP section—or stop the host from posting it.
6) If it worked once but fails after VM reboot: assume reset bug
Single-boot success followed by black screen on subsequent starts is classic “GPU didn’t reset” territory. Treat it like that until proven otherwise.
What a “black screen” actually means in passthrough
“Black screen” is a lazy symptom. You need to pin down which of these you’re seeing:
- No signal from monitor: the GPU never drives the connector (init never happened, wrong primary, or GPU is still owned by host/firmware).
- Signal but black image: link is up, but no framebuffer is being scanned out (driver/firmware mismatch, wrong resolution, guest stuck before graphics handoff).
- Shows OVMF splash then black: firmware init succeeded; handoff to guest OS/driver is failing (Windows driver conflicts, Code 43, Linux modesetting issues).
- Works only after cold boot: reset bug or power-state issue (FLR absent, D3cold transitions, vendor-specific reset behavior).
- Works only with a dummy HDMI dongle: EDID/monitor detection issues, especially with remote sessions or headless setups.
One operational rule: separate “device assignment success” from “display output success.” VFIO can assign a GPU perfectly while the guest never manages to light up the screen.
Exactly one quote, because it applies here: paraphrased idea
from James Hamilton (Amazon): reliability comes from disciplined operations and eliminating classes of failure, not hero debugging at 2 a.m.
Interesting facts and historical context (because this mess has a backstory)
- PCI passthrough predates “homelab GPU passthrough” by decades. Enterprise hypervisors used direct device assignment for NICs and HBAs long before gamers discovered VFIO.
- UEFI GOP changed the early-boot display game. Legacy VGA option ROMs were enough for BIOS-era boot; UEFI firmware prefers a GOP-capable ROM for clean graphics init.
- OVMF is essentially UEFI for QEMU. It’s the open-source firmware that makes UEFI VMs practical in KVM environments.
- IOMMU was driven by DMA safety as much as virtualization. Without DMA remapping, a device can scribble over arbitrary memory. Passthrough without IOMMU is a security horror show.
- ACS (Access Control Services) exists to isolate PCIe traffic. Consumer boards often skimp on it, which is why people reach for ACS override and then act surprised when isolation is imperfect.
- “Reset” is not a single mechanism. FLR, bus reset, power-state transitions, and vendor-specific GPU resets all behave differently.
- Some GPUs have multi-function devices. The audio function (HDMI/DP audio) and USB controller (on some cards) must be passed through together, or you get weird behavior.
- NVIDIA’s Code 43 history shaped passthrough culture. Driver behavior around virtualized environments pushed people into workarounds, then into better-supported configurations.
- Resizable BAR changed expectations. Large BAR mappings can stress firmware/host setup and sometimes expose bugs in platform initialization.
Joke 1: GPU passthrough is the only hobby where “I changed one line in GRUB and rebooted eight times” counts as a quiet evening.
UEFI/OVMF vs SeaBIOS: pick the one that matches reality
If you remember one thing: don’t pick SeaBIOS because it feels “simpler.” Pick firmware based on what the GPU and guest OS expect.
When OVMF is the right call
- You want Windows 11, Secure Boot off/on, or modern UEFI boot.
- You want the passed-through GPU to be the primary display.
- Your GPU’s option ROM is UEFI-first (common for newer cards).
- You want cleaner behavior with Q35 machine type and PCIe topology.
When SeaBIOS can still work
- Older GPUs with robust legacy VGA ROMs.
- Older guest OS installs that were built for BIOS boot.
- Specific edge cases where UEFI GOP init is the problem and legacy init works.
Common OVMF gotchas that look like “GPU is broken”
- Wrong display device order: if you leave a virtual VGA (like SPICE/QXL) and the guest uses it as primary, your passed-through GPU may not show until drivers load—or ever.
- OVMF variable store issues: a corrupted NVRAM file can trap you in weird boot states. Recreate it instead of negotiating with it.
- CSM expectations: UEFI-only guests may not boot off BIOS-formatted disks; this can masquerade as “no GPU output.”
ROM files, GOP, and why your GPU sometimes “needs a ROM”
A GPU ROM (option ROM) is the firmware blob that initializes the card during early boot. In passthrough, QEMU can try to use the device’s ROM, but several things get in the way:
- The host firmware may have already posted the GPU, leaving it in a state the guest firmware doesn’t expect.
- The device may not expose a readable ROM region once the host driver binds to it.
- Some vendor ROMs include both legacy VGA and UEFI GOP; some include one; some are… “creative.”
When supplying a ROM file is a proven fix
- Black screen at boot, but GPU works in other machines or bare metal.
- OVMF never shows its splash on the GPU output.
- GPU is not the primary boot display and never gets initialized properly.
- Multi-GPU setups where the host posts one GPU and you pass through the other.
When a ROM file makes things worse
- You use a ROM dumped from a different SKU or different board vendor revision.
- The ROM contains board-specific straps that don’t match your exact card.
- You trim or “fix” the ROM incorrectly (especially on NVIDIA cards with headers).
In production terms: a ROM file is a surgical tool, not a good-luck charm. If you don’t have a reason, don’t introduce another variable.
PCIe reset bugs: FLR, D3hot/D3cold, and the AMD reset mess
Reset is where most “it worked yesterday” stories come from. A GPU can be passed through cleanly, used, and then fail to come back after a guest reboot or a VM stop/start.
Three reset patterns you’ll see
- Cold-boot only: GPU works after a full host power cycle, not after VM restart. That’s a reset/initialization gap.
- Host driver contamination: host driver binds once (even briefly), initializes the GPU, and later VFIO hands a weirdly configured device to the guest.
- Partial function reset: the graphics function resets, but the audio function (or USB function) doesn’t, causing enumeration or driver weirdness.
FLR isn’t guaranteed
Function Level Reset is optional. Some GPUs claim it, some don’t implement it properly, and some platforms don’t propagate it cleanly. You can’t “belief” your way into FLR.
The AMD “reset bug” reality
Older AMD cards (and some not-so-old ones) are infamous for failing to reset into a clean state for VFIO reuse. The community workarounds range from vendor-reset modules to toggling power states; results vary with kernel versions, card generations, and motherboard behavior.
Joke 2: The fastest way to test if you have a reset bug is to reboot the VM—your GPU will immediately file for divorce.
Host binding, vfio-pci, and the driver tug-of-war
The host must not “help.” If the host grabs the GPU with nvidia, amdgpu, or nouveau before VFIO does, you often get a black screen, or worse: a GPU that half-works and then wedges later.
What “correct” looks like
- At boot, the GPU functions are bound to
vfio-pci. - The host console uses a different GPU, iGPU, or serial console.
- Proxmox starts the VM, QEMU claims the PCIe device, and the guest driver sees a clean, exclusive GPU.
What “almost correct” looks like
- The GPU is bound to
vfio-pci, but the host framebuffer driver still has it mapped. - The HDMI audio function is still bound to
snd_hda_intel. - There’s an IOMMU group with a USB controller or SATA controller glued to the GPU, and you passed only the GPU.
Q35 vs i440fx, primary GPU, and display devices that sabotage you
In Proxmox, you choose a machine type. Treat it like hardware.
Q35 is usually the sane default
Q35 models a more modern chipset with PCIe root ports. That matters for GPUs that expect PCIe behavior, and it can influence reset behavior and device enumeration order.
i440fx is legacy compatibility
Sometimes you use it because an old guest expects it. For modern GPU passthrough, Q35 tends to reduce weirdness.
Primary GPU passthrough: remove competing virtual displays
If you want your passed-through GPU to be the boot display, don’t leave a virtual VGA device as the obvious primary. In Proxmox terms: set vga: none if you’re serious about using the physical GPU for console output.
Hands-on tasks (commands, what output means, and what decision you make)
These are real operations tasks. Run them on the Proxmox host unless stated otherwise. Each one includes the “so what” decision, because output without a decision is just logging for fun.
Task 1: Confirm the CPU and kernel see virtualization extensions
cr0x@server:~$ egrep -m1 '(vmx|svm)' /proc/cpuinfo
flags : fpu vme de pse tsc ... vmx ...
Meaning: vmx (Intel) or svm (AMD) must exist. If it doesn’t, no IOMMU/VFIO journey ends well.
Decision: If missing, fix BIOS settings (VT-x/AMD-V) before touching Proxmox configs.
Task 2: Confirm IOMMU is enabled in the kernel command line
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.12-2-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt
Meaning: Look for intel_iommu=on or amd_iommu=on. iommu=pt is common to reduce overhead for non-passthrough devices.
Decision: If absent, edit GRUB and reboot; don’t proceed until it’s present.
Task 3: Verify DMAR/IVRS shows IOMMU actually initialized
cr0x@server:~$ dmesg | egrep -i 'DMAR|IOMMU|IVRS' | head -n 8
[ 0.824112] DMAR: IOMMU enabled
[ 0.824980] DMAR: Host address width 39
[ 0.835010] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
Meaning: You want “IOMMU enabled” and no fatal faults.
Decision: If you see disable messages or faults, fix BIOS (VT-d/AMD-Vi) or kernel parameters before continuing.
Task 4: Identify the GPU and all its functions (graphics + audio + extras)
cr0x@server:~$ lspci -nn | egrep -i 'vga|3d|audio|nvidia|amd'
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU116 [GeForce GTX 1660] [10de:2184]
01:00.1 Audio device [0403]: NVIDIA Corporation TU116 High Definition Audio Controller [10de:1aeb]
Meaning: Note the device IDs and functions. The audio function is not optional in practice; pass it too.
Decision: Plan to passthrough both 01:00.0 and 01:00.1.
Task 5: Check IOMMU groups (the isolation truth)
cr0x@server:~$ for d in /sys/kernel/iommu_groups/*/devices/*; do echo "$(basename "$(dirname "$d")") $(lspci -nns ${d##*/})"; done | sort -n | sed -n '1,18p'
1 00:01.0 Host bridge [0600]: Intel Corporation Device [8086:1237]
12 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU116 [GeForce GTX 1660] [10de:2184]
12 01:00.1 Audio device [0403]: NVIDIA Corporation TU116 High Definition Audio Controller [10de:1aeb]
Meaning: Your GPU functions should ideally be in a group by themselves (or at least only with each other).
Decision: If the group includes unrelated devices (USB/SATA), you need a different slot/motherboard setting, or you accept ACS override risk (see later).
Task 6: Confirm what driver currently owns the GPU
cr0x@server:~$ lspci -nnk -s 01:00.0
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU116 [GeForce GTX 1660] [10de:2184]
Subsystem: Micro-Star International Co., Ltd. [MSI] TU116 [1462:12a3]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
Meaning: “Kernel driver in use” must be vfio-pci. The modules listed are just available, not necessarily loaded.
Decision: If it says nvidia or amdgpu, fix driver binding (Task 7/8).
Task 7: Load VFIO modules and verify they exist
cr0x@server:~$ lsmod | egrep 'vfio|kvm'
vfio_pci 16384 0
vfio_pci_core 90112 1 vfio_pci
vfio_iommu_type1 40960 0
vfio 69632 2 vfio_iommu_type1,vfio_pci_core
kvm_intel 376832 0
kvm 1105920 1 kvm_intel
Meaning: If VFIO modules aren’t loaded, Proxmox may still load them on demand, but explicit is better than mystical.
Decision: If missing, ensure vfio modules are in /etc/modules and rebuild initramfs.
Task 8: Bind the GPU to vfio-pci using device IDs (durable method)
cr0x@server:~$ cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:2184,10de:1aeb disable_vga=1
Meaning: This tells the kernel to bind those PCI IDs to VFIO early. disable_vga=1 can help reduce legacy VGA arbitration drama.
Decision: If you change this file, rebuild initramfs and reboot (Task 9).
Task 9: Rebuild initramfs and confirm it updated
cr0x@server:~$ update-initramfs -u -k all
update-initramfs: Generating /boot/initrd.img-6.8.12-2-pve
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
Meaning: Your early boot environment now knows about VFIO binding rules.
Decision: Reboot before testing again; hot changes are not a reliable experiment here.
Task 10: Check whether the host still has a framebuffer on the GPU
cr0x@server:~$ dmesg | egrep -i 'fb0|framebuffer|efifb|simplefb|vesafb' | tail -n 8
[ 0.612345] simple-framebuffer simple-framebuffer.0: framebuffer at 0xe0000000, 0x300000 bytes
[ 0.612400] fb0: simplefb registered!
Meaning: If the host is using the passed-through GPU for a framebuffer console, you may get black screens or resets that never fully work.
Decision: Prefer headless console (serial), iGPU, or a second GPU for the host. If you must, disable host framebuffer usage (platform-specific).
Task 11: Confirm the VM config uses OVMF + Q35 and doesn’t steal primary display
cr0x@server:~$ qm config 101
boot: order=scsi0;ide2;net0
bios: ovmf
machine: q35
scsi0: local-lvm:vm-101-disk-0,size=128G
hostpci0: 0000:01:00.0,pcie=1,x-vga=1
hostpci1: 0000:01:00.1,pcie=1
vga: none
Meaning: bios: ovmf and machine: q35 are the usual “works more often” combo. vga: none avoids Proxmox adding a competing display.
Decision: If you see bios: seabios and modern GPU/OS, switch to OVMF and reinstall or convert boot mode as needed.
Task 12: Watch QEMU/VFIO errors during VM start
cr0x@server:~$ journalctl -u pvedaemon -u pve-qemu-server --since "10 min ago" | tail -n 60
... vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
... qemu-system-x86_64: vfio: Unable to power on device, stuck in D3
... start failed: QEMU exited with code 1
Meaning: “stuck in D3” is a power-state/reset issue. This is not a Windows driver problem; it’s host/device reset behavior.
Decision: Treat as reset bug: consider kernel upgrades, slot changes, disabling ASPM, or vendor-reset approaches (Task 16/17).
Task 13: Dump the GPU ROM (only if you have a reason)
cr0x@server:~$ echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.0/rom
1
cr0x@server:~$ sudo cat /sys/bus/pci/devices/0000:01:00.0/rom > /root/gpu.rom
cr0x@server:~$ echo 0 | sudo tee /sys/bus/pci/devices/0000:01:00.0/rom
0
cr0x@server:~$ ls -lh /root/gpu.rom
-rw-r--r-- 1 root root 256K Jan 8 12:14 /root/gpu.rom
Meaning: You got a ROM image of plausible size. If this fails with “Operation not permitted” or zero bytes, the device may be bound/posted in a way that blocks reads.
Decision: If you can dump, try it as a ROM file. If you can’t, don’t brute-force—fix binding/boot posting first.
Task 14: Attach the ROM file to the VM (targeted experiment)
cr0x@server:~$ qm set 101 --hostpci0 0000:01:00.0,pcie=1,x-vga=1,romfile=gpu.rom
update VM 101: -hostpci0 0000:01:00.0,pcie=1,x-vga=1,romfile=gpu.rom
Meaning: Proxmox will pass that ROM to QEMU. The ROM must exist under /usr/share/kvm/ or be referenced correctly depending on Proxmox expectations; place it properly before rebooting the VM.
Decision: If this fixes early boot display, keep it. If it breaks driver init, remove it and reassess (wrong ROM or unnecessary ROM).
Task 15: Check for “in use” consumers that block passthrough
cr0x@server:~$ fuser -v /dev/nvidia0 2>/dev/null || true
USER PID ACCESS COMMAND
/dev/nvidia0: root 1321 F.... nvidia-persistenced
Meaning: If NVIDIA persistence daemon or a display manager is using the GPU, VFIO assignment will be contaminated.
Decision: Stop/disable consumers on the host. Passthrough hosts should not run desktop stacks on the same GPU you’re passing through.
Task 16: Inspect reset capability and power management hints
cr0x@server:~$ lspci -vv -s 01:00.0 | egrep -i 'LnkCap|LnkSta|ASPM|FLR|PM' | head -n 30
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <8us
LnkSta: Speed 8GT/s, Width x16
Capabilities: [b0] MSI-X: Enable+ Count=32 Masked-
Capabilities: [d0] Power Management version 3
Kernel driver in use: vfio-pci
Meaning: You’re looking for clues: ASPM presence, power management version, and whether the platform might be forcing aggressive power states.
Decision: If you see repeated D3-related errors in logs, consider disabling ASPM in BIOS or kernel parameters as a test.
Task 17: Detect the “works once” reset bug using a controlled reboot loop
cr0x@server:~$ qm start 101
started VM 101
cr0x@server:~$ sleep 30
cr0x@server:~$ qm shutdown 101 --timeout 90
stopping VM 101 (shutdown)
cr0x@server:~$ qm start 101
started VM 101
Meaning: If the first start gives display and the second start gives black screen, you’ve just reproduced a reset bug without superstition.
Decision: Move into reset mitigations: kernel upgrade, different slot, avoid host posting, or vendor-reset module for affected AMD GPUs.
Task 18: Validate that you passed through the whole IOMMU group when required
cr0x@server:~$ ls -l /sys/kernel/iommu_groups/12/devices/
total 0
lrwxrwxrwx 1 root root 0 Jan 8 12:20 0000:01:00.0 -> ../../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0
lrwxrwxrwx 1 root root 0 Jan 8 12:20 0000:01:00.1 -> ../../../../devices/pci0000:00/0000:00:01.0/0000:01:00.1
Meaning: Only GPU and its audio are in the group. That’s good. If there were more, passthrough might require them too (or you need better isolation).
Decision: If group includes “other stuff,” either passthrough all of it (often unacceptable) or change hardware/topology.
Three corporate mini-stories (anonymized, plausible, technically accurate)
Incident caused by a wrong assumption: “UEFI is optional”
A mid-size company wanted GPU-enabled Windows VMs for a CAD team. The virtualization team had Proxmox in place, and someone had successfully passed through an HBA before, so the vibe was “same thing, different PCI device.” They built a template VM with SeaBIOS because it matched older server templates. It booted fine over SPICE, so they called it done.
On rollout day, they added GPUs. Black screens everywhere on physical monitors, but RDP worked. They blamed drivers, then cables, then the monitors, then Windows updates—classic deflection spiral.
The real issue was simple: the passed-through GPU never became the early boot display because the firmware path didn’t match the card’s expectations. SeaBIOS + the card’s ROM behavior + a competing virtual VGA meant the guest never initialized the output in a way that matched their physical workflow.
The fix was boring: switch to OVMF, set Q35, remove virtual VGA (vga: none), and pass the GPU as primary with x-vga=1. They also rebuilt the VM template to avoid repeating the mistake. The postmortem had one key line: “We assumed firmware choice was cosmetic.”
Optimization that backfired: ACS override as a “quick win”
A different org ran Proxmox on consumer-grade boards because budget. Their IOMMU groups were ugly: GPU grouped with a USB controller and an NVMe controller behind the same root complex. The team wanted GPU passthrough for an AI inference service, and they wanted it yesterday.
They enabled ACS override to split groups and celebrated when the GUI showed clean isolation. They pushed it to production quickly. It worked. For a while.
Weeks later they had intermittent host freezes during VM restarts. Logs pointed to PCIe errors and occasional filesystem hiccups on the “separated” NVMe device. ACS override didn’t magically add hardware isolation; it changed how the kernel pretended devices were isolated. Under certain traffic patterns, devices still interacted in ways the IOMMU grouping was trying to ignore.
The eventual solution was not a clever kernel parameter. They moved the workload to a platform with proper PCIe isolation and validated groups before deployment. ACS override remained a lab-only tool with a big warning label: “May work; may also create novel failure modes.”
Boring but correct practice that saved the day: cold-boot discipline and change control
A team running GPU-backed Linux VMs for video processing had a habit: whenever they changed VFIO binding, firmware type, or ROM settings, they performed a full host reboot and documented the before/after outputs of lspci -nnk and the VM config. No “hot swapping” of low-level plumbing.
It sounded slow. It was. It also prevented them from stacking multiple unverified changes and then guessing which one mattered.
When they hit a black screen after a Proxmox kernel update, their rollback was clean: they compared the “known good” command outputs, saw the GPU was now being claimed by a framebuffer during early boot, and fixed it decisively rather than “try a different ROM because why not.”
The net effect was operational calm. They treated passthrough changes like storage changes: measure, change one thing, reboot, verify ownership, then proceed.
Common mistakes: symptom → root cause → fix
1) Symptom: VM boots (RDP works) but physical monitor stays black
Root cause: Virtual VGA is primary; passed-through GPU isn’t initialized as boot display, or the guest chooses the wrong adapter.
Fix: Use bios: ovmf, machine: q35, set vga: none, and enable x-vga=1 on the GPU. Confirm monitor is connected to the passed-through GPU, not the host output.
2) Symptom: Black screen until OS loads, then sometimes works
Root cause: Missing/quirky GOP init at firmware stage; guest driver later enables display.
Fix: Prefer OVMF, consider supplying a ROM file with a valid GOP, and ensure you’re not using SeaBIOS with a UEFI-first ROM.
3) Symptom: Works after host cold boot; fails after VM reboot
Root cause: Reset bug (no FLR, device stuck in D3, incomplete function reset).
Fix: Upgrade kernel/Proxmox, try different PCIe slot, disable aggressive power management/ASPM as a test, and ensure all GPU functions are passed through together. For affected AMD generations, consider vendor-reset approach.
4) Symptom: VM fails to start, “stuck in D3” or BAR restore errors
Root cause: GPU power state can’t be brought back up by VFIO; platform refuses to power it on cleanly after prior use.
Fix: Avoid host driver binding, ensure clean shutdown, test cold boot, adjust BIOS power settings, and consider moving the GPU to a slot with different power gating behavior.
5) Symptom: Random instability when under load after enabling ACS override
Root cause: Fake isolation; devices still share a path and can interfere; DMA isolation assumptions fail under contention.
Fix: Use hardware with real isolation. Treat ACS override as a last resort and avoid it for critical workloads.
6) Symptom: Guest sees GPU but driver fails (Windows shows error or falls back)
Root cause: Driver policy, virtualization detection quirks, or incorrect device topology.
Fix: Use Q35 + OVMF, hide KVM where appropriate in Proxmox CPU settings, and avoid mixing virtual display adapters with primary passthrough use.
7) Symptom: No output on one DP/HDMI port, works on another
Root cause: Port init order and link training quirks during early boot; some cards prefer certain ports for pre-OS output.
Fix: Test alternate ports/cables first. Seriously. It’s not elegant, but it’s real.
8) Symptom: VM console in Proxmox shows nothing after setting vga: none
Root cause: That’s expected—there is no virtual console.
Fix: Use physical monitor on the GPU, or rely on RDP/SSH. If you need emergency access, temporarily add a virtual display device for debugging, then remove it.
Checklists / step-by-step plan
Checklist A: First-time GPU passthrough with the fewest surprises
- Enable VT-d/AMD-Vi in BIOS/UEFI.
- Enable IOMMU in GRUB (
intel_iommu=onoramd_iommu=on) and reboot. - Verify IOMMU is enabled via
dmesg. - Identify GPU functions via
lspci -nnand note both GPU + audio IDs. - Verify clean IOMMU grouping; if groups are messy, stop and rethink hardware/topology.
- Bind GPU functions to
vfio-pcivia/etc/modprobe.d/vfio.conf. - Rebuild initramfs, reboot, confirm
Kernel driver in use: vfio-pci. - Create VM with
bios: ovmfandmachine: q35. - Add GPU as
hostpciwithpcie=1and (if primary)x-vga=1. - Set
vga: noneif you want the physical GPU to be the console. - Boot VM, test physical output, then install guest drivers.
Checklist B: Debugging a black screen without cargo-culting
- Confirm guest boots via RDP/SSH. If not, focus on VFIO/IOMMU errors first.
- Check Proxmox logs for VFIO power/reset complaints.
- Confirm GPU is bound to
vfio-pciand host isn’t using it as framebuffer. - Remove competing virtual VGA (
vga: none) and use OVMF + Q35. - Pass all GPU functions (audio, USB if present) together.
- If it works only after cold boot, stop tweaking ROMs and address reset behavior.
- Only then try a ROM file, and only from your exact card.
- If you used ACS override, treat any instability as “expected” and plan hardware remediation.
Checklist C: Stabilizing reset-prone GPUs operationally
- Avoid VM stop/start loops; prefer suspend/resume patterns only if you’ve validated them.
- Use controlled shutdown and a wait period before restart; some cards behave better after a clean quiesce.
- Keep Proxmox kernel reasonably current; reset handling improves over time.
- Document known-good combinations: kernel version, Proxmox version, VM machine type, ROM usage, slot.
- For “cannot reset” cards, plan for host power cycles as part of maintenance windows (ugly, but reliable).
FAQ
1) Why does RDP work but the physical display is black?
The guest OS is running, but the passed-through GPU isn’t the primary display or never initialized for early boot. Fix firmware/topology: OVMF + Q35, remove virtual VGA, use x-vga=1 when appropriate.
2) Do I always need a GPU ROM file?
No. Most stable setups do not require a ROM file. Use a ROM only when you have a specific early-init problem and you can dump a correct ROM from your exact card.
3) What does “stuck in D3” mean in Proxmox logs?
The GPU is in a low-power state and VFIO/QEMU can’t bring it back. That’s a reset/power-management/platform issue, not a guest driver issue.
4) Should I use Q35 or i440fx?
Use Q35 for modern GPU passthrough unless you have a specific compatibility requirement. It models PCIe topology more realistically.
5) Can I passthrough only the GPU function and ignore HDMI audio?
Sometimes it “works,” but it’s a bad habit. Pass all functions in the IOMMU group, especially the audio function. Partial passthrough is a great way to get intermittent weirdness.
6) Why does it work after a host reboot but not after restarting the VM?
Classic reset bug. The GPU doesn’t reset cleanly between assignments. Focus on reset mitigations: kernel updates, slot changes, power management tweaks, and known workarounds for your GPU generation.
7) Is ACS override safe?
It’s a trade: it can make passthrough possible on consumer boards, but it can also reduce the isolation guarantees you thought you had. Use it for labs. Be cautious in production.
8) I set vga: none and now Proxmox console is blank. Did I break it?
No. You removed the virtual display device, so the Proxmox console has nothing to show. That’s the point if the physical GPU is the console.
9) Should I enable Resizable BAR?
If you’re chasing stability, leave it off until passthrough is stable. ReBAR can add another layer of platform and firmware complexity. Turn it on later as a controlled change.
10) Do I need to pin CPU cores or tweak hugepages to fix black screen?
Almost never. Those are performance knobs. Black screen is usually firmware, VFIO binding, IOMMU grouping, or reset behavior.
Conclusion: practical next steps
If you’re staring at a black screen, stop guessing. Start with the fast diagnosis playbook: confirm IOMMU, confirm VFIO binding, confirm OVMF + Q35, remove competing virtual VGA, and pass all GPU functions. That sequence fixes a large fraction of real-world cases without folklore.
If it works once and fails on restart, treat it as a reset bug until you prove otherwise. That’s not pessimism; it’s pattern recognition. Stabilize with platform choices, kernel updates, and clean device ownership. Use ROM files as targeted tools, not rituals.
Next steps you can do today:
- Run Task 5 (IOMMU groups) and Task 6 (driver ownership). If either is wrong, fix that before touching anything else.
- Align the VM with Task 11 (OVMF/Q35/vga none) and retest.
- Reproduce reset behavior with Task 17 and decide whether you’re in “reset mitigation mode.”