IOMMU Groups Are a Trap: How to Get Clean GPU/NVMe Passthrough Without Tears

February 14, 2026 • February 14, 2026 • Read: 23 min • Views: 0

Was this helpful?

You didn’t buy a GPU to watch it sit behind a hypervisor politely doing nothing. You didn’t buy an NVMe drive to feed it through an emulated controller like it’s 2009. You wanted near-native performance in a VM, and instead you got a brick wall labeled IOMMU group 12.

The trap is that IOMMU groups feel like a simple checkbox problem—until they become a topology, firmware, and platform-quirks problem. The good news: there’s a clean way through, and it mostly involves refusing to guess.

What IOMMU groups really are (and why they hurt)

An IOMMU group is the kernel’s way of saying: “these PCIe devices can’t be isolated from each other safely, so they’re treated as one security domain.” If you pass through one device in the group to a VM, you’re implicitly trusting everything else in that group not to DMA-spam the host or that VM.

That “can’t be isolated” part is the key. The group boundary isn’t defined by your feelings, or the slot label on the motherboard, or how expensive the GPU was. It’s defined by PCIe topology and the Access Control Services (ACS) behavior of the switches/bridges between the device and the CPU root port. If ACS isn’t providing proper isolation (or isn’t advertised), the kernel won’t pretend it is.

The practical consequence

Clean passthrough means the device you want—GPU or NVMe—lands in an IOMMU group with nothing else you care about. Ideally: the GPU and its audio function together, maybe a USB controller if you deliberately placed it there, and that’s it. When your GPU shares a group with your SATA controller, a NIC, and half the chipset, you’re not doing “GPU passthrough.” You’re doing “hope-as-a-service.”

Two bad mental models that cause most pain

“IOMMU groups are fixed by the OS.” The OS reports what the hardware and firmware expose. You can’t out-configure missing ACS on a cheap downstream switch.
“If it boots, it’s safe.” Safety here is DMA isolation and interrupt remapping behavior under stress, hot resets, and error recovery. Booting proves almost nothing.

The sharp edge: the ACS override patch

On many consumer platforms, people “fix” ugly IOMMU groupings by forcing ACS behavior via the pcie_acs_override kernel parameter. This can split groups in a way that looks perfect.

But it’s a lie you told the kernel. It might be a useful lie in a homelab. In production or anything multi-tenant, it’s a security compromise. You’re asking the kernel to assume isolation where the fabric may not actually enforce it.

Use ACS override only when you can accept the risk: single-user host, trusted guests, no shared hosting, and you understand the blast radius. Otherwise: change the platform, change the slot, or change the plan.

Joke #1: ACS override is like putting a “Do Not Enter” sign on a door with no lock. It improves vibes, not security.

Interesting facts and historical context

VT-d and AMD-Vi weren’t built for gamers. Intel’s VT-d and AMD’s IOMMU (AMD-Vi) came from a need to safely virtualize I/O in servers and multi-tenant environments.
DMA was the original “trust me bro.” Classic PCI devices could DMA into host memory with little restraint; IOMMU exists largely to stop one device from scribbling over everything.
ACS is a PCIe fabric feature, not a Linux feature. Linux can only group based on what the PCIe hierarchy reports it can isolate.
Interrupt remapping is part of the safety story. Even with DMA isolation, broken interrupt routing/remapping can become a stability and security issue under load.
GPUs became “weird” passthrough devices because of reset behavior. Many GPUs weren’t designed for clean function-level reset (FLR) from a guest; some need bus resets, some don’t reset cleanly at all.
NVMe is usually easier than GPUs… until it isn’t. NVMe controllers often behave well with FLR and namespace management, but power management and ASPM quirks can still bite.
SR-IOV normalized the idea of sharing devices safely. Network and storage vendors pushed SR-IOV hard; GPU virtualization (vGPU) followed a different, more vendor-locked path.
Consumer boards optimized for cost, not isolation. Many desktop chipsets hang multiple functions behind shared bridges or switches that don’t expose ACS the way server platforms do.

One paraphrased idea worth keeping on your wall: paraphrased idea “Hope is not a strategy” — attributed to many operators over the years, and the SRE version is “measure, then change.”

Design for clean passthrough: pick battles you can win

1) Start with topology, not with kernel flags

Before you touch GRUB, find out what you actually bought: which slots are CPU root ports, which are chipset lanes, which share a downstream switch, and which share a bridge with “everything else.” Marketing diagrams are lies of omission. You need to see the PCIe tree.

2) Prefer CPU-attached lanes for passthrough devices

GPUs and high-performance NVMe drives behave best when they’re on CPU root ports. Chipset-attached devices often share bandwidth and isolation boundaries, and they’re more likely to end up in giant IOMMU groups with other controllers.

3) Don’t pass through the device you boot from

Yes, you can do it. Yes, you can make it work. And yes, it’s an outage generator when firmware updates, initramfs changes, or device enumeration shifts. Keep host boot on something boring: a small SATA SSD, a mirrored pair, or a dedicated M.2 that you never hand to guests.

4) GPUs: plan for resets and display ownership

GPU passthrough failures are often not “IOMMU problems.” They’re reset problems, ROM problems, or driver ownership problems.

Reset: If the GPU can’t reset cleanly between VM stops/starts, you’ll get the dreaded “works once after boot.”
ROM: Some setups need a GPU ROM file; others break if you force one.
Host ownership: If the host loads a driver and grabs the GPU framebuffer early, detaching later can be messy.

5) NVMe: decide between passthrough and virtualized storage based on failure domain

NVMe passthrough is great for latency-sensitive workloads, game VMs, or single-tenant appliances. But it changes who gets to recover from errors.

If the NVMe controller wedges and the guest owns it, the host can’t necessarily poke it back to life without tearing down the VM. If you need host-level HA, snapshots, replication, or predictable recovery, a virtual disk on top of a sane storage layer often wins—even if it’s a bit slower.

6) The only sane security posture: treat passthrough guests as trusted

VFIO + IOMMU is strong isolation when the platform supports it properly, but passthrough expands the guest’s control over real hardware. If your guest is untrusted, reconsider. If your platform needs ACS override, do not pretend you’re building a hardened multi-tenant system.

Practical tasks: commands, outputs, decisions

You’ll diagnose this fastest by treating it like any other ops problem: observe, change one thing, observe again. Below are concrete tasks that you can run on a Linux KVM host (including Proxmox-like environments) with notes on how to interpret the output and what decision to make.

Task 1: Verify IOMMU is enabled (kernel boot and dmesg)

cr0x@server:~$ dmesg | egrep -i 'iommu|dmar|amd-vi' | head -n 30
[    0.000000] ACPI: DMAR 0x0000000079B2B000 0000A8 (v01 INTEL  SKL      00000001 INTL 00000001)
[    0.000000] DMAR: IOMMU enabled
[    0.000000] DMAR: Host address width 39
[    0.000000] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[    0.000000] DMAR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.000000] DMAR: Enabled IRQ remapping in x2apic mode

Meaning: You want to see “IOMMU enabled” (Intel: DMAR) or “AMD-Vi: Enabled.” Bonus points for “IRQ remapping enabled.”

Decision: If you don’t see IOMMU enabled, fix firmware settings (VT-d/AMD-Vi) and kernel parameters before anything else.

Task 2: Confirm kernel command line (catch wrong flags)

cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.12 root=ZFS=rpool/ROOT/default ro quiet intel_iommu=on iommu=pt

Meaning: intel_iommu=on or amd_iommu=on enables the IOMMU; iommu=pt often improves host performance by using passthrough mappings for non-VFIO devices.

Decision: Keep iommu=pt for performance unless you’re debugging something exotic; it’s commonly fine.

Task 3: List IOMMU groups and spot the “monster group”

cr0x@server:~$ for g in /sys/kernel/iommu_groups/*; do echo "IOMMU Group $(basename "$g")"; for d in "$g"/devices/*; do echo -n "  "; lspci -nns "$(basename "$d")"; done; done | sed -n '1,80p'
IOMMU Group 0
  00:00.0 Host bridge [0600]: Intel Corporation Device [8086:1237]
IOMMU Group 1
  00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:460d]
IOMMU Group 12
  01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684] (rev a1)
  01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22bc] (rev a1)
IOMMU Group 13
  02:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a808]

Meaning: This is the truth. Your GPU is usually two functions (graphics + audio) and should be in the same group. An NVMe controller ideally sits alone or with harmless siblings you don’t need.

Decision: If your target device shares a group with critical host devices (SATA controller, NIC you need, USB controller used for host input), stop and redesign: different slot, different motherboard, or accept the ACS override risk.

Task 4: See the PCIe tree (find the bridge causing group merging)

cr0x@server:~$ lspci -tv
-+-[0000:00]-+-00.0  Intel Corporation Device 1237
 |           +-01.0-[01]----00.0  NVIDIA Corporation Device 2684
 |           |            \-00.1  NVIDIA Corporation Device 22bc
 |           +-03.0-[02]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller a808
 |           \-14.0  Intel Corporation Device 7ae0

Meaning: You’re looking for where your device attaches: CPU root ports vs chipset bridges. The tree helps you correlate “bad grouping” with a particular upstream bridge.

Decision: If the GPU/NVMe is behind a shared downstream switch/bridge, try another physical slot that maps to a different root port.

Task 5: Identify your GPU/NVMe device IDs for vfio-pci binding

cr0x@server:~$ lspci -nn | egrep -i 'vga|audio|nvme'
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22bc] (rev a1)
02:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a808]

Meaning: The bracketed pairs are vendor:device IDs. You bind those IDs to vfio-pci to keep host drivers off them.

Decision: Capture all functions you plan to pass through (GPU + audio; sometimes a GPU has USB-C controller too).

Task 6: Check which driver currently owns the device

cr0x@server:~$ lspci -k -s 01:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd Device 4101
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Meaning: If the host is using nvidia or nouveau, passthrough will be harder. For clean VFIO, you want Kernel driver in use: vfio-pci.

Decision: If a non-VFIO driver is bound, you need to bind early via initramfs/modprobe config or driver blacklisting.

Task 7: Bind devices to vfio-pci (modprobe config) and rebuild initramfs

cr0x@server:~$ sudo tee /etc/modprobe.d/vfio.conf >/dev/null <<'EOF'
options vfio-pci ids=10de:2684,10de:22bc,144d:a808 disable_vga=1
EOF
cr0x@server:~$ sudo tee /etc/modprobe.d/blacklist-gpu.conf >/dev/null <<'EOF'
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
EOF
cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.12

Meaning: This forces vfio-pci to claim those device IDs at boot, before other drivers.

Decision: Reboot after this change. If you can’t reboot, don’t start “live rebinding” unless you like debugging a headless host from the floor.

Task 8: After reboot, confirm vfio-pci owns the devices

cr0x@server:~$ lspci -k -s 01:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd Device 4101
	Kernel driver in use: vfio-pci
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Meaning: Modules can exist, but “driver in use” should be vfio-pci.

Decision: If it still binds to a GPU driver, your initramfs didn’t include vfio settings early enough, or the distro’s GPU stack is forcing it. Fix that before touching QEMU.

Task 9: Check VFIO and IOMMU faults in the kernel log

cr0x@server:~$ sudo dmesg -T | egrep -i 'vfio|iommu|dmar|fault' | tail -n 30
[Tue Feb  4 10:10:02 2026] vfio-pci 0000:01:00.0: enabling device (0000 -> 0003)
[Tue Feb  4 10:10:02 2026] vfio-pci 0000:02:00.0: enabling device (0000 -> 0002)
[Tue Feb  4 10:12:17 2026] DMAR: [DMA Read] Request device [02:00.0] fault addr 0x7f1c9000 [fault reason 0x02] Present bit in context entry is clear

Meaning: VFIO “enabling device” is normal. DMAR/IOMMU faults are not. The “Present bit in context entry is clear” often indicates a mapping issue or device misbehavior.

Decision: If you see recurring IOMMU faults under load, suspect buggy firmware, broken interrupt remapping, or a device not playing well with the platform. Don’t paper over it.

Task 10: Validate interrupt remapping (stability and security)

cr0x@server:~$ dmesg | egrep -i 'interrupt remapping|irq remapping|x2apic' | head
[    0.000000] DMAR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.000000] DMAR: Enabled IRQ remapping in x2apic mode

Meaning: You want IRQ remapping enabled, especially if you care about robustness and isolation.

Decision: If it’s disabled, check BIOS settings (sometimes tied to “Above 4G decoding”, “SR-IOV”, or “Intel VT-d” toggles) and kernel parameters. On some platforms, enabling x2APIC helps.

Task 11: Check for ACS override usage (know when you’re lying to yourself)

cr0x@server:~$ cat /proc/cmdline | tr ' ' '\n' | egrep 'pcie_acs_override|acs'
pcie_acs_override=downstream,multifunction

Meaning: This is the “group splitter.” It may make groups look clean without the hardware actually enforcing it.

Decision: If this is present on a host with untrusted guests or compliance requirements, remove it and fix the hardware/topology instead.

Task 12: Check PCIe link speed/width (performance bottleneck hiding in plain sight)

cr0x@server:~$ sudo lspci -s 02:00.0 -vv | egrep -i 'LnkCap|LnkSta'
LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
LnkSta: Speed 8GT/s (downgraded), Width x2 (downgraded)

Meaning: Your NVMe is capable of PCIe Gen4 x4, but it’s running at Gen3 x2. That’s not a virtualization problem. That’s a slot, BIOS, riser, or signal integrity problem.

Decision: Fix link training before benchmarking anything. Swap slots, remove risers, update BIOS, or force PCIe generation in firmware if available.

Task 13: Check IOMMU group membership of a single device (surgical view)

cr0x@server:~$ dev=0000:02:00.0; readlink -f /sys/bus/pci/devices/$dev/iommu_group
/sys/kernel/iommu_groups/13

Meaning: This tells you exactly which group the device belongs to, useful for scripts and change tracking.

Decision: If this changes after BIOS updates or slot moves, update your provisioning docs and VM configs. Do not assume stability.

Task 14: Attach via QEMU/Libvirt and verify guest sees the real device

cr0x@server:~$ sudo virsh domblklist win11-gpu
 Target   Source
------------------------------------------------
 vda      /var/lib/libvirt/images/win11.qcow2

cr0x@server:~$ sudo virsh dumpxml win11-gpu | egrep -n 'hostdev|address|vfio' | head -n 40
132:    
133:      
134:        
135:

Meaning: The VM definition includes a PCI host device assignment. That’s your passthrough.

Decision: If the guest sees a generic storage controller instead of the NVMe model, you didn’t actually pass through the controller—you passed a virtual disk.

Task 15: Measure latency and I/O inside the guest vs host (find virtualization overhead)

cr0x@server:~$ sudo fio --name=randread --filename=/dev/nvme0n1 --ioengine=io_uring --direct=1 --bs=4k --iodepth=32 --numjobs=1 --runtime=15 --time_based --rw=randread
randread: (groupid=0, jobs=1): err= 0: pid=24111: Tue Feb  4 10:21:12 2026
  read: IOPS=182k, BW=712MiB/s (747MB/s)(10.4GiB/15001msec)
    slat (nsec): min=820, max=12514, avg=1790.12, stdev=402.21
    clat (usec): min=49, max=412, avg=173.40, stdev=21.11

Meaning: This is host-side baseline. You compare similar runs inside the guest on the passed-through NVMe. Large gaps suggest interrupt moderation, power management, or CPU pinning issues—not just VFIO.

Decision: If guest latency is wildly worse than host, check CPU topology, C-states, and whether the VM is using MSI/MSI-X properly.

Task 16: Check if your NVMe supports reset sanely (avoid “works once”)

cr0x@server:~$ sudo nvme id-ctrl /dev/nvme0 | egrep -i 'frmw|oacs|oncs|vwc|lpa' | head
oacs    : 0x17
oncs    : 0x5f
vwc     : 0x1
lpa     : 0x3

Meaning: Capabilities vary; you’re mainly checking that the controller is not some bargain-bin oddity that behaves strangely under reset/power events.

Decision: If passthrough stability is poor, swap to an enterprise-ish controller or at least a model known to behave under virtualization.

Fast diagnosis playbook

This is the “I have 20 minutes and a VP hovering” routine. The goal is to identify whether your bottleneck is isolation, ownership, reset, or plain-old PCIe bandwidth.

First: topology and isolation (don’t debug ghosts)

Check IOMMU groups and confirm the target device is not glued to host-critical devices.
Check PCIe tree to see what bridge/root port the device hangs off.
Check ACS override and decide if you’re comfortable with the risk profile.

Second: driver ownership and binding (the boring stuff that breaks everything)

Confirm vfio-pci is in use for every function you’re passing through.
Scan dmesg for VFIO errors, DMAR faults, and interrupt remapping status.
Verify the VM config actually includes hostdev devices, and that you’re not benchmarking a virtual disk by accident.

Third: performance and stability (where the “it works” becomes “it’s usable”)

Check link width/speed for downgrades (Gen4 device running at Gen3 x1 is a classic).
Run a short fio test on host and guest to compare latency and IOPS patterns.
Look for reset-related “works once” symptoms after VM stop/start.

If you do these in order, you avoid the most common time-waster: tweaking QEMU flags for a problem that is actually a motherboard routing decision.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company built an internal “GPU pool” for CI workloads: compile, test, and occasional CUDA runs. They bought a stack of consumer workstations, tossed a couple of GPUs in each, and ran KVM with VFIO. It looked clean in early testing. One GPU per VM. Everybody happy.

Then a kernel update rolled through and the IOMMU grouping changed subtly on a subset of hosts. Same model motherboard, different firmware revisions. On those hosts, a GPU ended up grouped with a USB controller that also hosted the out-of-band KVM dongle used by remote hands. Nobody noticed because the VM still started.

During a routine VM recycle, VFIO took the whole IOMMU group. The host didn’t crash, but it effectively lost remote input access. That wouldn’t be fatal—except these were “workstations” in a locked room with no real BMC, and the only remote access path was through that USB controller. Two people spent an evening coordinating a physical reboot schedule across time zones.

The wrong assumption wasn’t “IOMMU exists.” It was “IOMMU groups are stable and consistent across identical hardware.” They’re not. Firmware, slot wiring, and even minor PCIe enumeration changes can move the cheese.

The fix was painfully ordinary: standardize BIOS versions, pin hardware SKUs, and add a pre-flight check that validates group membership before provisioning a GPU VM. They also stopped attaching their only remote management path to a controller that could be yanked into a VM.

Mini-story 2: The optimization that backfired

A different org wanted low-latency storage inside a Windows VM that hosted a data processing tool with a cranky licensing model. They passed through an NVMe drive directly to the VM. Performance was spectacular. Benchmarks were chart-worthy.

Then reality arrived: the VM would occasionally bluescreen under heavy write bursts, and when it did, the NVMe controller sometimes stayed wedged. The host saw a device that existed but wouldn’t complete commands. A warm reboot didn’t reliably recover it. Sometimes they needed a full power cycle of the chassis.

They had optimized for the wrong metric. They optimized for steady-state throughput, not for recovery behavior. The “backfired” part wasn’t that passthrough is bad; it was that the failure domain moved into the guest OS and its driver stack, and the hardware didn’t reliably reset in their platform.

The eventual solution was to stop pretending they needed raw NVMe for that workload. They moved the VM onto a virtual disk backed by the host’s storage stack with write caching configured sanely and periodic snapshotting. They kept one dedicated passthrough NVMe host for a small subset of workloads that truly required it, with known-good controllers and a tested reset path.

They lost some benchmark glory and gained a system that could be repaired without someone driving to the data center with a finger ready for the power button.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran a few high-value workstation VMs for engineering and visualization. They had strict change management, which is unpopular until it’s the only thing between you and chaos. Their runbook required capturing PCIe topology and IOMMU group listings as artifacts whenever firmware changed.

One quarter, a vendor firmware update “improved PCIe compatibility.” In staging, nothing failed. But their artifact diff showed that a GPU’s group now included a new sibling: a PCIe bridge function previously hidden. That didn’t matter for bare metal, but it mattered for VFIO because the bridge pulled in other devices.

Because they noticed before rollout, they stopped. They tested alternative slots, found one that kept the GPU isolated, and updated their physical build standard: “GPU must be in slot X; NVMe passthrough must be on CPU M.2 slot Y.”

When production rollout happened, it was boring. No surprises. The team looked almost disappointed.

Joke #2: Change management is like flossing—nobody wants to do it, and the consequences of skipping it are expensive and weird.

Common mistakes: symptoms → root cause → fix

1) VM won’t start: “Device is in use” / “cannot reset”

Symptoms: QEMU errors about resetting the device, or “device is in use.” VM starts only once per boot.

Root cause: Host driver grabbed the device first, or the device doesn’t support FLR cleanly, or you’re missing a proper bus reset path.

Fix: Bind to vfio-pci in initramfs; avoid host framebuffer; consider different GPU model; ensure the device is alone in its group; for GPUs with reset bugs, test vendor-specific reset quirks or change hardware.

2) GPU passthrough works, but performance is terrible

Symptoms: Low FPS, stutter, high CPU usage, or the guest behaves like it’s using a basic display adapter.

Root cause: GPU not actually passed through (using virtio-gpu), PCIe link downgraded, guest driver not installed, or CPU scheduling/pinning issues.

Fix: Verify guest sees the GPU model; check host lspci -vv link status; install proper guest drivers; pin vCPUs to physical cores and avoid overcommit for latency-sensitive gaming/interactive workloads.

3) NVMe passthrough benchmarks great, then corrupts or disappears

Symptoms: I/O errors in guest, NVMe resets, device vanishes until reboot/power cycle.

Root cause: Controller firmware quirks under virtualization, power management/ASPM issues, broken reset behavior, or platform instability.

Fix: Disable ASPM for that link if needed; update BIOS/NVMe firmware; try a different controller; if you need reliability, switch to virtual disks on a host-managed storage stack.

4) After enabling IOMMU, host networking breaks or becomes flaky

Symptoms: Packet loss, NIC resets, weird latency spikes.

Root cause: Some drivers/platforms behave differently with DMA remapping; interrupt remapping issues; buggy BIOS.

Fix: Ensure IRQ remapping is enabled; update BIOS; test different kernel; consider iommu=pt; if it’s a platform bug, don’t fight it—replace the board.

5) “IOMMU group is huge, so I’ll just pass it all through”

Symptoms: Host loses storage/NIC/USB; sudden outages when VM starts.

Root cause: Passing through a group containing host-critical devices.

Fix: Don’t. Move the passthrough device to a different slot/root port or change platforms. If you must, redesign the host so those “critical” devices aren’t needed by the host (separate NIC, separate boot drive, separate USB controller).

6) ACS override “fixes” groups but introduces spooky issues later

Symptoms: Rare DMA/IOMMU faults, odd instability, security review panic.

Root cause: Forcing group splits without hardware ACS isolation guarantees.

Fix: Remove override in serious environments; move to hardware with proper ACS; use server/workstation chipsets and boards known for sane groupings.

Checklists / step-by-step plan

Step-by-step: get clean GPU passthrough

Firmware setup: Enable VT-d/AMD-Vi. Enable “Above 4G decoding” if you have modern GPUs and multiple devices.
Boot flags: Add intel_iommu=on or amd_iommu=on; consider iommu=pt for host performance.
Map groups: Enumerate IOMMU groups and confirm the GPU (and its audio) are isolated from host-critical devices.
Pick the right slot: Favor CPU root ports. If a slot routes through chipset bridges/switches, it often creates messy groups.
Bind early: Configure vfio-pci IDs and blacklist conflicting GPU drivers; rebuild initramfs.
Confirm ownership: After reboot, ensure Kernel driver in use: vfio-pci.
VM config: Pass through all GPU functions you need (graphics + audio; sometimes USB-C controller).
Stability test: Start/stop the VM multiple times. If it only works once, you have a reset problem to solve before calling it “done.”

Step-by-step: get clean NVMe passthrough

Decide the failure domain: If the guest owns recovery, passthrough is fine. If the host must recover and snapshot, favor virtual disks.
Isolate the controller: Ensure the NVMe controller’s IOMMU group doesn’t include host boot storage or a NIC you need.
Check link status: Confirm expected PCIe generation and width. Fix downgrades first.
Bind to vfio-pci: As with GPUs, do it early and consistently.
Guest verification: Confirm the guest sees the real NVMe controller and that drivers are sane.
Load test: Run short, brutal I/O tests. Watch host dmesg for resets or DMAR faults.
Power management: If you hit stability issues, test ASPM changes and firmware updates.

Operational checklist: keep it from regressing

Record PCIe topology (lspci -tv) and IOMMU groups as artifacts per host.
Standardize BIOS/UEFI versions across a cluster.
After firmware updates, re-validate group membership and device binding before re-enabling production workloads.
Keep host boot storage separate from passthrough devices.
Maintain a rollback path for kernel and initramfs changes.

FAQ

1) Why can’t I pass through just one device from an IOMMU group?

Because the group boundary is the kernel’s safety boundary for DMA isolation. If devices can’t be isolated, passing one through risks the others DMAing into the wrong memory.

2) Are IOMMU groups the same as PCIe lanes or slots?

No. Groups are shaped by the PCIe hierarchy and ACS capabilities. A single physical slot can sit behind a bridge that merges isolation domains.

3) Is ACS override safe?

It can be acceptable for trusted, single-user setups. It is not a safe assumption for multi-tenant or security-sensitive environments because it may claim isolation the hardware doesn’t enforce.

4) My GPU shares a group with its audio device. Is that bad?

No, that’s normal. You usually pass through both functions together. The problem is when the group includes unrelated host-critical devices.

5) Why does GPU passthrough work only once after reboot?

Commonly a reset issue: the GPU doesn’t support FLR cleanly or the platform can’t reset it properly after the VM stops. Binding, ROM quirks, and vendor driver behavior can also contribute.

6) NVMe passthrough vs virtio-blk on ZFS/LVM: which is “better”?

For raw latency and direct control, NVMe passthrough often wins. For operational safety—snapshots, replication, host-level recovery—virtual disks on a host-managed storage layer usually win.

7) Does enabling `iommu=pt` weaken isolation?

It typically applies passthrough mappings for host-owned devices to reduce overhead, while VFIO devices still use strict mappings. It’s widely used; validate in your environment.

8) What’s the quickest way to tell if performance issues are PCIe-related?

Check LnkSta with lspci -vv. If the link is downgraded in speed or width, fix that before touching VM tuning.

9) Should I pass through the entire NVMe drive or use namespaces?

If you need strict separation and your stack supports it, namespaces can help—but you’re still dealing with one controller and its quirks. For simplicity and fewer surprises, whole-controller passthrough is common.

10) Can I do clean passthrough on consumer hardware?

Sometimes, yes—especially if the board has good ACS behavior and CPU-attached slots. But you’re betting on topology luck. If uptime matters, buy hardware that was designed to be boring.

Conclusion: next steps that actually work

If you want clean GPU/NVMe passthrough, stop treating IOMMU groups like an obstacle you can out-argue. They’re a hardware truth serum.

Do this next:

Inventory your current grouping and PCIe tree. Save it. Treat it like config, because it is.
Move the device to a CPU-attached slot/root port if the grouping is ugly.
Bind to vfio-pci early, confirm ownership after reboot, and refuse to debug guest drivers until the host side is clean.
Benchmark link speed/width and latency before and after, so you know whether you improved anything.
If you’re relying on ACS override for “security,” stop and re-scope the project: either accept trusted guests or buy the right platform.

When passthrough is clean, it’s a joy. When it’s not, it’s a career in interpreting error messages. Choose joy. Fix the topology.