You can have the right GPU, the right hypervisor, and the right tutorial. And still end up staring at a VM that boots to a black screen,
or worse: boots fine until it randomly wedges your host at 2 a.m. The usual culprit isn’t your kernel version or a missing driver.
It’s one BIOS toggle you didn’t treat like production infrastructure.
That switch is IOMMU (Intel VT-d / AMD-Vi). When it’s right, GPU passthrough is boring. When it’s wrong, nothing you do in software matters.
This is the field guide to understanding it, proving it’s working, and diagnosing the failure modes fast—like you have an outage bridge open.
What IOMMU actually is (and why GPU passthrough needs it)
The IOMMU is a memory-management unit for devices. CPUs have an MMU that translates virtual addresses to physical addresses,
enforcing isolation between processes. The IOMMU does the same kind of translation and isolation, but for DMA (Direct Memory Access)
requests coming from PCIe devices like GPUs, NICs, NVMe controllers, and HBAs.
Without an IOMMU, a device doing DMA can effectively say: “I’d like to read/write physical memory at address X.”
If that device is buggy, misconfigured, or compromised, it can scribble over the host kernel, another VM’s memory, or a secrets page.
With an IOMMU, the host sets up a translation table so the device only sees the memory the host intends.
GPU passthrough in one sentence
GPU passthrough means the VM’s driver talks to a real GPU over PCIe as if it owned the hardware, while the hypervisor uses IOMMU mappings
to keep the device’s DMA confined to that VM’s memory.
Why “it boots” is not proof
A VM can sometimes boot with a passed-through GPU even when key IOMMU features are missing or misconfigured, especially if the workload
doesn’t immediately trigger heavy DMA patterns. Then you hit a game load, a CUDA job, or a driver reset—and the host goes sideways.
That’s not bad luck. That’s the physics of DMA meeting an incomplete isolation boundary.
The layers you’re actually debugging
- Firmware: BIOS/UEFI switches (VT-d / AMD-Vi), SR-IOV, “Above 4G decoding,” “Resizable BAR.”
- Kernel: IOMMU enabled, interrupt remapping enabled, DMA translation mode (strict vs lazy).
- PCIe topology: Root ports, ACS, and whether devices land in separable IOMMU groups.
- Driver binding: Host driver vs VFIO driver, and who grabs the GPU at boot.
- Hypervisor config: QEMU/KVM arguments, OVMF vs SeaBIOS, machine type (q35 matters).
If you treat IOMMU like “some virtualization thing,” you’ll keep doing cargo-cult config changes until it works… and then it won’t.
Treat it like an isolation primitive with observable states, and you’ll get predictable outcomes.
Interesting facts and history you can use in meetings
Here are concrete, non-trivia points that help explain why IOMMU exists and why it’s still annoyingly easy to misconfigure:
- DMA predates modern virtualization by decades. Early high-performance devices used DMA to avoid CPU copy overhead—great for speed, awkward for isolation.
- IOMMU is a response to shared-bus chaos. As PCI and later PCIe proliferated, the “trust every device” model became a security and reliability liability.
- Intel calls it VT-d; AMD calls it AMD-Vi. Different branding, same idea: remap device DMA through translation tables.
- Modern OSes increasingly assume IOMMU exists. It’s not only for VMs; it’s also used for kernel hardening and safe device assignment.
- Interrupt remapping became part of the story. It’s not just DMA. Devices also generate interrupts; remapping helps prevent certain classes of misrouting or interrupt injection issues.
- ACS (Access Control Services) is not a “nice-to-have.” ACS affects how PCIe traffic is isolated between endpoints; weak ACS often means “everything is in one IOMMU group.”
- GPU vendors weren’t thrilled at first. Early passthrough was temperamental because consumer GPUs and drivers weren’t built expecting VM device assignment.
- Above 4G decoding matters because GPUs are memory-hungry devices. Large BAR mappings and modern platforms push you into address-space features that old defaults don’t handle well.
The BIOS/UEFI switch: names, traps, and what “enabled” really means
The most expensive mistake in GPU passthrough is assuming the BIOS label matches the actual hardware state.
On many systems you must enable multiple options, and the labels vary by vendor and motherboard generation.
The core requirement is: the CPU + chipset must expose an IOMMU and the firmware must not disable it.
What the setting is called
- Intel: “Intel VT-d,” “VT-d,” “IOMMU,” sometimes “Directed I/O.”
- AMD: “SVM” (CPU virtualization) is not IOMMU; you want “IOMMU,” “AMD-Vi,” sometimes “PCIe ARI support.”
- Server platforms: You may also see “Interrupt Remapping,” “DMA Protection,” “PCIe ACS,” “SR-IOV.”
BIOS traps that look unrelated but are not
- CSM vs pure UEFI: Many GPU passthrough setups behave better with OVMF (UEFI) and CSM disabled.
- Above 4G decoding: Often required for modern GPUs, especially if you pass through multiple devices or use large BAR features.
- Resizable BAR: Can change BAR sizing and mapping behavior. Useful sometimes, destabilizing other times. Don’t turn it on “because performance.”
- PCIe slot bifurcation: Can change topology; topology changes IOMMU groups.
Joke #1: BIOS menus are the only place where “Enabled” can mean “maybe, depending on your mood and microcode.”
One quote to keep your team honest
Hope is not a strategy.
— paraphrased idea often attributed in engineering/ops circles
Translation for IOMMU: don’t “hope” it’s enabled because you toggled something once. Verify from the OS.
IOMMU groups: the boundary that decides whether you can isolate a GPU
Even with IOMMU enabled, you may still be stuck if your GPU shares an IOMMU group with other devices you need to keep on the host
(like a USB controller you need for keyboard input, or a SATA controller that holds the host’s boot disk). IOMMU groups represent
the smallest isolation units your platform can safely enforce for DMA.
Why groups exist
Devices behind the same upstream PCIe bridge might be able to see each other’s traffic unless the platform supports ACS and proper isolation.
The kernel groups such devices together because it cannot guarantee separation. Passing through one means passing through the whole group.
What you want to see
- The GPU (and its HDMI audio function) in an IOMMU group that contains only those functions.
- NICs and storage controllers in separate groups unless you intentionally pass them through.
- Minimal “everything in group 0” behavior; that usually signals IOMMU not enabled or not exposed properly.
ACS override: the tempting hack
On some consumer boards, you can force the kernel to split groups with an ACS override parameter. This is often presented as a magic fix.
It is not magic; it is you telling the kernel to pretend isolation exists even if the hardware doesn’t fully provide it.
Use ACS override only when you understand the risk: you might get functional passthrough, but DMA isolation could be weaker than you think.
For a home lab that’s fine. For anything that touches production data, it’s a governance conversation.
Fast diagnosis playbook
This is the “I have 20 minutes before the change window ends” sequence. Don’t freestyle. Check these in order; each step rules out a class of problems.
First: prove IOMMU is actually on
- Check kernel boot parameters and dmesg for IOMMU initialization.
- Confirm DMAR (Intel) or AMD-Vi (AMD) tables detected.
- Confirm interrupt remapping status if you care about stability and security.
Second: validate IOMMU groups and topology
- List IOMMU groups; ensure GPU and its audio function are isolated.
- Identify what else shares the group; decide whether you can pass it through.
- If groups are too coarse: prefer moving slots / changing BIOS topology before ACS override.
Third: confirm driver binding and VFIO ownership
- Ensure the host isn’t binding the GPU to nvidia/amdgpu.
- Bind GPU functions to vfio-pci early at boot; don’t rely on “unbind later” unless you like flaky.
- Confirm /dev/vfio/* nodes exist and match your group IDs.
Fourth: check VM firmware, machine type, and reset behavior
- Use OVMF (UEFI) and q35 unless you have a reason not to.
- If the GPU won’t reset cleanly between VM reboots, plan for vendor-reset (where applicable) or a host reboot.
- Check for ROM issues, primary GPU selection, and console output routing.
Practical tasks: commands, outputs, and the decisions you make
The following tasks are intentionally hands-on. Each one includes: the command, what typical output looks like, what it means,
and the decision you should make next. If you’re doing GPU passthrough without running these, you’re operating on vibes.
Task 1: Identify your CPU vendor and virtualization flags
cr0x@server:~$ lscpu | egrep -i 'Vendor ID|Model name|Virtualization|Flags'
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E-2278G @ 3.40GHz
Virtualization: VT-x
Flags: ... vmx ...
Meaning: VT-x (Intel) or SVM (AMD) is CPU virtualization, not IOMMU. Still, if you don’t have this, you’re dead in the water for KVM.
Decision: If virtualization is missing, fix BIOS CPU virtualization first. Then come back.
Task 2: Confirm IOMMU is enabled in the kernel command line
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.5.0 root=/dev/mapper/vg0-root ro quiet intel_iommu=on iommu=pt
Meaning: intel_iommu=on (or amd_iommu=on) requests IOMMU. iommu=pt uses pass-through mode for host devices to reduce overhead.
Decision: If you don’t see the parameter, add it via GRUB/systemd-boot. If it’s present, move to dmesg proof.
Task 3: Check dmesg for DMAR/AMD-Vi initialization
cr0x@server:~$ dmesg | egrep -i 'DMAR|IOMMU|AMD-Vi' | head -n 25
[ 0.012345] DMAR: IOMMU enabled
[ 0.012678] DMAR: Host address width 39
[ 0.013000] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[ 0.013250] DMAR: Queued invalidation will be enabled to support IOMMU
[ 0.013500] DMAR: Intel(R) Virtualization Technology for Directed I/O
Meaning: This is the kernel telling you it found the tables and enabled DMA remapping.
Decision: If you see errors like “IOMMU disabled” or nothing at all, stop. Go back to BIOS/UEFI and boot mode.
Task 4: Validate interrupt remapping status (stability/security)
cr0x@server:~$ dmesg | egrep -i 'remapping|x2apic|IR' | head -n 30
[ 0.020000] DMAR-IR: Enabled IRQ remapping in x2apic mode
[ 0.020500] DMAR-IR: Queued invalidation will be enabled
Meaning: IRQ remapping is enabled; that’s a good sign for correctness in complex setups.
Decision: If remapping is disabled due to firmware bugs, consider BIOS update or different kernel parameters. If you’re in a regulated environment, treat missing remapping as a risk, not a shrug.
Task 5: Enumerate PCI devices and find the GPU functions
cr0x@server:~$ lspci -nn | egrep -i 'vga|3d|audio|nvidia|amd'
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104 [GeForce RTX 2080] [10de:1e87] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation TU104 HD Audio Controller [10de:10f8] (rev a1)
Meaning: Your GPU is usually two functions: graphics and HDMI/DP audio. You pass through both.
Decision: Record the PCI addresses (01:00.0 and 01:00.1) and the vendor:device IDs (10de:1e87, 10de:10f8) for vfio-pci binding.
Task 6: List IOMMU groups and confirm isolation
cr0x@server:~$ for g in /sys/kernel/iommu_groups/*; do echo "IOMMU Group ${g##*/}:"; ls -l "$g/devices"; done | sed -n '1,80p'
IOMMU Group 12:
total 0
lrwxrwxrwx 1 root root 0 Feb 4 09:12 0000:01:00.0 -> ../../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0
lrwxrwxrwx 1 root root 0 Feb 4 09:12 0000:01:00.1 -> ../../../../devices/pci0000:00/0000:00:01.0/0000:01:00.1
Meaning: Group 12 contains only the GPU and its audio function. This is what “good” looks like.
Decision: If other devices are in the same group, decide whether you can pass them too. If not, try a different slot or BIOS settings before considering ACS override.
Task 7: Check what driver currently owns the GPU
cr0x@server:~$ lspci -nnk -s 01:00.0
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104 [GeForce RTX 2080] [10de:1e87] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device [1458:3fdd]
Kernel driver in use: nouveau
Kernel modules: nouveau, nvidiafb
Meaning: The host grabbed the GPU with nouveau. VFIO won’t reliably take it later.
Decision: Prevent host binding (blacklist drivers or bind to vfio-pci early). Don’t try to “just unbind” in production unless you enjoy surprises.
Task 8: Bind the GPU to vfio-pci via modprobe options
cr0x@server:~$ sudo tee /etc/modprobe.d/vfio.conf >/dev/null <<'EOF'
options vfio-pci ids=10de:1e87,10de:10f8 disable_vga=1
EOF
cr0x@server:~$ sudo tee /etc/modprobe.d/blacklist-gpu.conf >/dev/null <<'EOF'
blacklist nouveau
blacklist nvidia
blacklist nvidia_drm
blacklist nvidia_modeset
EOF
Meaning: You’re telling the kernel to bind those device IDs to vfio-pci and to keep the common GPU drivers from claiming them.
Decision: Rebuild initramfs (next task) so this applies early, then reboot.
Task 9: Rebuild initramfs so vfio binding happens at boot
cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.5.0
Meaning: Your initramfs now includes the updated module options/blacklists, affecting early device binding.
Decision: Reboot and re-check lspci -nnk to confirm vfio-pci owns the GPU.
Task 10: After reboot, confirm vfio-pci is the kernel driver in use
cr0x@server:~$ lspci -nnk -s 01:00.0
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104 [GeForce RTX 2080] [10de:1e87] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device [1458:3fdd]
Kernel driver in use: vfio-pci
Kernel modules: nouveau, nvidiafb
Meaning: Perfect. The module list can still mention potential modules; the key line is “Kernel driver in use.”
Decision: Proceed to VM configuration. If the driver is not vfio-pci, you still have a binding race (often initramfs not updated, or the wrong IDs).
Task 11: Confirm VFIO devices exist for the group
cr0x@server:~$ ls -l /dev/vfio/
total 0
crw------- 1 root root 244, 0 Feb 4 09:15 12
crw------- 1 root root 10, 196 Feb 4 09:15 vfio
Meaning: The numeric node (12) corresponds to IOMMU group 12. That’s how VFIO enforces “all devices in a group move together.”
Decision: If the group node doesn’t exist, VFIO isn’t set up or IOMMU isn’t working as you think. Go back to Tasks 2–6.
Task 12: Check whether the GPU is the boot VGA device (can complicate things)
cr0x@server:~$ lspci -vv -s 01:00.0 | egrep -i 'Kernel driver|VGA|boot'
VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2080] (rev a1)
Kernel driver in use: vfio-pci
Capabilities: [60] Power Management version 3
Flags: bus master, fast devsel, latency 0, IRQ 16
Meaning: On some systems you’ll see “Boot VGA” in the flags for the primary GPU. If your passthrough GPU is also the host’s console GPU, you’re signing up for extra work.
Decision: Prefer an iGPU or a cheap secondary GPU for the host console. If you can’t, you’ll need to manage framebuffer/console and possibly ROM quirks.
Task 13: Confirm KVM acceleration is present (so you’re not debugging slowness as “IOMMU issues”)
cr0x@server:~$ sudo kvm-ok
INFO: /dev/kvm exists
KVM acceleration can be used
Meaning: KVM is available. If /dev/kvm is missing, you’ll get terrible performance and strange timing behavior.
Decision: Fix CPU virtualization and KVM modules before blaming VFIO.
Task 14: Spot group cohabitation quickly for a specific device
cr0x@server:~$ readlink -f /sys/bus/pci/devices/0000:01:00.0/iommu_group
/sys/kernel/iommu_groups/12
Meaning: This is the canonical group mapping. It’s script-friendly and removes guesswork.
Decision: If the group contains other devices, decide whether you can pass them too. If not, change topology (slot/bifurcation) before kernel hacks.
Task 15: Check for IOMMU being “enabled” but effectively bypassed
cr0x@server:~$ dmesg | egrep -i 'iommu.*(pt|passthrough)|DMA' | head -n 20
[ 0.050000] DMAR: IOMMU enabled
[ 0.050500] DMAR: Default domain type: Passthrough (set via iommu=pt)
Meaning: iommu=pt is normal for host performance. Devices assigned to VFIO still get isolated domains.
Decision: If you’re chasing edge-case security requirements, consider strict mode; otherwise leave it. Don’t remove iommu=pt as a superstition fix.
Task 16: Inspect QEMU/libvirt log for VFIO attach errors
cr0x@server:~$ sudo tail -n 60 /var/log/libvirt/qemu/win11-gpu.log
2026-02-04T09:21:10.123456Z qemu-system-x86_64: -device vfio-pci,host=0000:01:00.0,id=hostdev0: vfio 0000:01:00.0: failed to setup container for group 12: Failed to set iommu for container: Operation not permitted
2026-02-04T09:21:10.123789Z qemu-system-x86_64: -device vfio-pci,host=0000:01:00.0,id=hostdev0: failed to initialize VFIO device
Meaning: This is a permissions / container / group ownership problem (or a missing capability in the runtime).
Decision: Run QEMU with appropriate privileges (libvirt handles this), confirm vfio group permissions, and check whether you’re inside an unprivileged container attempting passthrough.
Joke #2: VFIO errors are like fortune cookies—cryptic, slightly accusatory, and always delivered right before you wanted to go home.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company spun up a “temporary” GPU virtualization host to accelerate a machine learning backlog. It ran fine for a week.
Then a routine maintenance reboot turned into a half-day outage for the ML team’s pipelines.
The assumption: “We enabled virtualization in BIOS, so passthrough is covered.” They had enabled VT-x, but not VT-d.
The host still booted, VMs still booted, and the GPU appeared in the VM configuration—until the moment VFIO tried to attach for real.
After the reboot, the GPU enumerated differently (slot training changes happen), and now the VM failed hard with a VFIO group/container error.
The team burned hours on driver versions and QEMU args because the error message didn’t say “go to BIOS.”
When someone finally checked dmesg, there was no DMAR initialization. The hypervisor had been operating in a “looks okay” state,
not a “is correct” state.
The fix was boring: enable VT-d, update BIOS to a stable revision, and write a preflight script that greps dmesg for DMAR/AMD-Vi before
allowing GPU workloads on the node. They also updated their build checklist so “virtualization enabled” became two explicit checks:
CPU virtualization and IOMMU virtualization.
Mini-story 2: The optimization that backfired
Another shop wanted to squeeze a little more throughput from their GPU compute VMs. Someone suggested enabling Resizable BAR and toggling
every PCIe performance-sounding option in firmware: Above 4G decoding, ReBAR, aggressive ASPM settings, and a new PCIe slot bifurcation layout.
Benchmark numbers on a single VM looked slightly better.
Then the weirdness started: intermittent VM boot failures, occasional host lockups under heavy GPU DMA, and a pattern where the second VM
to start after a reboot would fail while the first succeeded. It smelled like a driver issue until it didn’t.
The actual issue was topology churn and group churn. The bifurcation change moved devices behind different bridges, shifting IOMMU groups.
Their automation still bound the old device IDs correctly, but the grouping was now coarser and included a USB controller the host needed.
Sometimes the group assignment landed “safe,” sometimes not, depending on enumeration order and firmware behavior.
The rollback fixed it instantly. They kept Above 4G decoding (required), disabled ReBAR for now, and treated PCIe topology as a change-managed
item with a regression checklist: verify groups, verify VFIO binding, verify VM start/stop cycles, verify reset behavior.
Mini-story 3: The boring but correct practice that saved the day
A financial-services team ran GPU passthrough on a small pool of hosts for analytics workloads. Nothing flashy, but the change control
was strict. They had a preflight runbook that looked almost comically simple: confirm IOMMU enabled, confirm groups unchanged, confirm VFIO binding,
confirm a test VM can start and stop twice.
A firmware update was scheduled as part of a security patch cycle. The vendor release notes mentioned “improved PCIe compatibility.”
The team treated that phrase like an unexploded device. They drained one host, patched it, and ran the full preflight.
The preflight caught that interrupt remapping had been disabled by default after the update (firmware reset a setting).
Everything still “worked,” but the kernel logs showed a downgrade in DMA/interrupt handling. They rolled back the firmware on that host,
then re-applied it with explicit BIOS setting enforcement and screenshots stored in their internal change record.
The saved-the-day part wasn’t heroics. It was that they didn’t discover the regression during a quarter-end run. They discovered it
on a drained node, with time to think, and with evidence.
Common mistakes: symptom → root cause → fix
1) “No IOMMU detected” or no DMAR/AMD-Vi lines in dmesg
Symptom: VFIO attach fails; dmesg doesn’t mention DMAR/AMD-Vi.
Root cause: IOMMU disabled in BIOS/UEFI, or kernel boot parameters missing/ignored, or booting in a mode that hides features.
Fix: Enable VT-d/AMD-Vi, confirm kernel parameters, update firmware if DMAR tables are broken. Verify with Task 3.
2) GPU shares IOMMU group with SATA/USB/NIC you need on the host
Symptom: You can’t pass through GPU without also passing through other critical devices.
Root cause: PCIe topology + lack of ACS isolation; consumer boards commonly group devices aggressively.
Fix: Try different PCIe slot, disable/enable onboard devices to reshuffle, adjust bifurcation, update BIOS. Last resort: ACS override, with eyes open.
3) VM boots but performance is terrible; GPU utilization weird
Symptom: VM is sluggish; frame pacing bad; compute jobs stutter.
Root cause: KVM not active, or VM is using emulated devices, or CPU pinning/NUMA mismatch—not IOMMU itself.
Fix: Confirm /dev/kvm, use virtio devices, use q35 + OVMF, and check NUMA placement for the GPU’s PCIe root complex.
4) VM starts once, but after shutdown it won’t start again until host reboot
Symptom: First attach works; subsequent attaches fail; GPU appears “stuck.”
Root cause: GPU reset quirks (FLR not supported or unreliable), or driver leaves device in a bad state.
Fix: Use vendor-reset where supported, ensure you pass through all functions, avoid host driver binding entirely, and plan maintenance reboots if necessary.
5) Black screen in VM despite successful VFIO attach
Symptom: VM runs, but no display output from passed-through GPU.
Root cause: Wrong VM firmware (SeaBIOS vs OVMF), missing GPU ROM, primary GPU confusion, or display output routed to a different port.
Fix: Switch to OVMF, use q35, consider specifying a ROM file if needed, and ensure the monitor is connected to the passed-through GPU.
6) VFIO “failed to set iommu for container: Operation not permitted”
Symptom: Libvirt/QEMU logs show permission/container failures.
Root cause: Trying passthrough from an unprivileged container, wrong device permissions, or misconfigured libvirt security model.
Fix: Run the hypervisor on the host, not in an unprivileged container; validate /dev/vfio permissions; use libvirt-managed QEMU with proper privileges.
7) You enabled “SVM” on AMD and thought that was IOMMU
Symptom: CPU virtualization works; IOMMU does not.
Root cause: SVM is AMD CPU virt; AMD-Vi is IOMMU. Different toggles.
Fix: Enable AMD-Vi/IOMMU explicitly and verify dmesg for AMD-Vi.
8) ACS override “fixes” grouping but introduces instability or risk
Symptom: Groups look perfect after ACS override, but you see rare host hangs or security review pushback.
Root cause: You forced the kernel to split groups beyond what hardware isolation guarantees.
Fix: Prefer hardware that supports proper ACS; for serious environments, don’t use ACS override as a permanent solution.
Checklists / step-by-step plan
Step-by-step: getting to first successful GPU passthrough (repeatable)
- Firmware: Enable VT-d (Intel) or AMD-Vi/IOMMU (AMD). Enable Above 4G decoding. Prefer pure UEFI boot.
- Kernel params: Add
intel_iommu=onoramd_iommu=on. Optionally addiommu=pt. - Reboot and verify: Use dmesg to confirm DMAR/AMD-Vi is enabled and no glaring IOMMU errors exist.
- Discover devices: Use
lspci -nnto record GPU and audio function IDs and addresses. - Verify groups: Ensure GPU functions are isolated in their own IOMMU group.
- Bind to vfio-pci early: Configure vfio-pci IDs and blacklist host GPU drivers; rebuild initramfs.
- Reboot and confirm binding:
lspci -nnkshould show vfio-pci in use for GPU functions. - VM config: Use q35 + OVMF. Pass through both GPU functions. Prefer virtio for disk and NIC.
- Test lifecycle: Boot VM, run load, shut down VM, boot again. Repeat. You’re testing reset behavior, not just first boot.
- Operationalize: Write preflight checks (IOMMU enabled, groups stable, vfio binding correct) and run them after firmware/kernel updates.
Change-management checklist (because firmware updates are chaos)
- Before change: capture
dmesgIOMMU lines and IOMMU group listing for the GPU. - Before change: capture
lspci -nnkfor GPU functions (driver in use). - After change: re-run the same commands; diff outputs.
- After change: start/stop test VM twice; confirm no regressions.
- If groups changed: treat it as a topology change and re-evaluate safety before restoring production workloads.
When to stop trying and buy better hardware
- Your GPU shares a group with critical host devices and you cannot change topology to isolate it.
- You require strict isolation guarantees and you’re relying on ACS override to make groups “look right.”
- You need reliable reset behavior and your GPU/board combination won’t do it without host reboots.
FAQ
1) Is IOMMU only for virtualization?
No. It’s also used for DMA isolation and kernel hardening on bare metal. Virtualization is just the most visible use case.
2) What’s the difference between VT-x and VT-d?
VT-x is CPU virtualization (running guest code efficiently). VT-d is IOMMU (device DMA remapping). You need VT-d for safe PCI passthrough.
3) On AMD, is “SVM” the same as AMD-Vi?
No. SVM is CPU virtualization. AMD-Vi (often labeled IOMMU) is what you need for passthrough.
4) Why do I have to pass through the GPU audio function too?
Because it’s usually a separate PCI function in the same device package. Leaving it on the host can break reset behavior or cause driver conflicts in the guest.
5) Are IOMMU groups a Linux limitation?
The grouping reflects hardware isolation boundaries (bridges, ACS capabilities). Linux is being conservative: it won’t promise isolation it can’t enforce.
6) Should I always use iommu=pt?
Usually yes for a hypervisor host: it reduces overhead for host-owned devices. VFIO-assigned devices still get isolated IOMMU domains.
If you’re chasing strict DMA isolation for all devices, reconsider and test.
7) Does ACS override make passthrough “safe”?
It can make it work. “Safe” depends on what isolation your hardware actually provides. For production or multi-tenant risk models, avoid relying on it.
8) My VM works until I reboot the VM. Why?
Likely a device reset problem. Many GPUs don’t reset cleanly via FLR in all scenarios. You may need vendor-reset support, different hardware, or host reboots between assignments.
9) Do I need UEFI (OVMF) in the VM?
For many modern GPUs and Windows guests, yes. OVMF + q35 reduces weird legacy VGA pathways and generally behaves better for passthrough.
10) Can I run GPU passthrough from inside a container?
In practice, GPU passthrough is a host-level job. Unprivileged containers commonly lack the permissions and device management required for VFIO.
Conclusion: next steps that actually move the needle
IOMMU isn’t “a feature.” It’s the contract that makes device assignment sane. If you want GPU passthrough to be stable, stop treating the BIOS
setting as a checkbox and start treating it as a dependency you continuously verify.
Practical next steps
- Run Tasks 2–6 on your host and save the outputs as a baseline (kernel cmdline, dmesg IOMMU lines, IOMMU group list).
- Bind your GPU to vfio-pci early (Tasks 7–10). If you’re still unbinding live, you’re choosing fragility.
- Test the VM lifecycle twice (boot → load → shutdown → boot). If it only works once, you don’t have a system—you have a demo.
- If groups are ugly, fix topology first. If you must use ACS override, document the risk and keep it out of environments that require strong isolation guarantees.
The win condition is dull: consistent groups, predictable binding, repeatable VM start/stop. When it gets boring, you did it right.