IOMMU On, But Devices Still Share a Group? Fix Your PCIe Topology Like a Pro

Was this helpful?

You flipped the BIOS toggles. Linux says IOMMU is enabled. You even see DMAR or AMD-Vi in dmesg. And yet: your GPU shares an IOMMU group with a USB controller, a SATA HBA, and what looks like half the motherboard. VFIO laughs quietly in the corner.

This is where most “GPU passthrough guides” stop and start recommending random kernel parameters like they’re seasoning. Don’t. Shared IOMMU groups aren’t a vibe; they’re a topology problem. Fix the topology, and isolation becomes boring. Boring is good.

The mental model: what an IOMMU group actually is

An IOMMU group is the set of PCIe functions that cannot be reliably isolated from each other for DMA. If one device in a group can initiate DMA transactions that reach another device’s memory mappings without being blocked, the kernel treats them as inseparable. VFIO doesn’t “want” groups; it enforces the isolation the hardware can prove.

Why “IOMMU enabled” isn’t the win you think it is

Turning on IOMMU (Intel VT-d / AMD-Vi) gives the platform the ability to translate and restrict DMA. It does not guarantee that every PCIe endpoint gets its own neat sandbox. Isolation depends on the chain of trust between the endpoint and the CPU: the root port, any downstream ports, any PCIe switches, and the availability of proper Access Control Services (ACS) and related controls along that path.

The group boundary is usually a bridge boundary

In practice, IOMMU group boundaries commonly align with PCIe bridges/ports that can enforce separation. If you have a single downstream port feeding multiple endpoints (or a switch that doesn’t expose or enable ACS properly), Linux may lump them together. This is not Linux being mean. This is Linux refusing to promise isolation it can’t prove.

Paraphrased idea from James Hamilton (Amazon): “Measure everything, assume nothing.” It applies painfully well to IOMMU groups—topology assumptions are how you end up passing through your USB controller along with your GPU.

Fast diagnosis playbook (what to check first)

First: confirm IOMMU is actually active, not just “configured”

  • Check dmesg for DMAR/AMD-Vi enabled lines and remapping status.
  • Check that IOMMU groups exist under /sys/kernel/iommu_groups.

Second: identify the exact devices you care about and their upstream path

  • Map the GPU/HBA/NIC to PCI addresses with lspci -nn.
  • Walk the PCIe tree with lspci -tv and lspci -vv.
  • Find the upstream bridge and whether ACS is present/enabled.

Third: decide whether this is fixable in hardware/firmware or only “fixable” with overrides

  • If devices share a group because they sit behind the same non-ACS downstream port: fix slot choice, enable BIOS features, or change the platform.
  • If devices share a group because the motherboard vendor wired multiple slots behind one root port: accept reality, redesign, or use ACS override knowing what you’re trading away.

Rule: if you’re doing this for production reliability or compliance, treat ACS override as a last resort and document it like a controlled substance.

Facts & history that explain today’s mess

  1. VT-d and AMD-Vi showed up well after CPU virtualization. Early virtualization focused on CPU privilege levels; DMA isolation arrived later and matured slowly across chipsets.
  2. PCIe ACS is optional. ACS is a capability that devices/ports may or may not implement. Optional features are where dreams go to die.
  3. IOMMU groups are a Linux policy layer over hardware realities. The kernel forms groups based on isolation guarantees; it’s not a “passthrough feature,” it’s a safety boundary.
  4. Many consumer motherboards optimize lanes, not isolation. Splitting x16 into x8/x8 via the same root complex can be fine for performance but awful for grouping.
  5. PCIe switches are not magic isolation boxes. Some switches implement ACS well; others don’t, or firmware leaves it disabled. Same part number, different board, different outcome.
  6. Multifunction devices can be inseparable by design. A GPU with audio function is typically one physical device with multiple functions; those functions often stay in the same IOMMU group.
  7. “ACS override” exists because people kept asking. The kernel gained knobs to relax grouping for virtualization use cases. It’s pragmatic, not purity.
  8. Thunderbolt and external PCIe bring extra topology weirdness. Hotplug bridges and security levels complicate DMA trust and grouping behavior.
  9. SR-IOV changed expectations. Once NICs could expose multiple VFs, everyone expected tidy isolation—then discovered the platform still decides groups.

Practical tasks: commands, outputs, and decisions (12+)

These are the checks I run when someone pings, “IOMMU is on but groups are still shared.” Each task includes: a command, what you might see, and what you should do next.

Task 1: Confirm kernel sees an active IOMMU (Intel)

cr0x@server:~$ dmesg | grep -E "DMAR|IOMMU"
[    0.012345] DMAR: IOMMU enabled
[    0.012678] DMAR: Host address width 39
[    0.013210] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[    0.015432] DMAR: Intel(R) Virtualization Technology for Directed I/O

Meaning: “IOMMU enabled” is the line you want. If you only see DMAR tables but no enablement, you’re not actually remapping DMA.

Decision: If missing, fix BIOS (VT-d) and kernel args (intel_iommu=on) before you waste time on groups.

Task 2: Confirm kernel sees an active IOMMU (AMD)

cr0x@server:~$ dmesg | grep -E "AMD-Vi|IOMMU"
[    0.010101] AMD-Vi: IOMMU performance counters supported
[    0.010202] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
[    0.010303] AMD-Vi: Interrupt remapping enabled

Meaning: AMD-Vi found plus interrupt remapping enabled is a good sign for safe device assignment.

Decision: If interrupt remapping is off, consider firmware updates and BIOS toggles; some setups become fragile without it.

Task 3: Verify IOMMU groups exist

cr0x@server:~$ ls -1 /sys/kernel/iommu_groups | head
0
1
2
3
4
5
6
7
8
9

Meaning: Groups exist. If the directory is empty or missing, you don’t have IOMMU grouping active.

Decision: No groups means stop and fix enablement (BIOS/kernel). Groups exist means move on to topology.

Task 4: Print groups with device names (fast inventory)

cr0x@server:~$ for g in /sys/kernel/iommu_groups/*; do echo "Group ${g##*/}"; for d in $g/devices/*; do lspci -nns ${d##*/}; done; echo; done | sed -n '1,35p'
Group 0
00:00.0 Host bridge [0600]: Intel Corporation Device [8086:1234]
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:5678]

Group 1
00:14.0 USB controller [0c03]: Intel Corporation Device [8086:a36d]
00:14.2 RAM memory [0500]: Intel Corporation Device [8086:a36f]

Group 2
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2484]
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:228b]

Meaning: This tells you whether the GPU is isolated (often it shares with its audio function, which is normal) or trapped with unrelated devices (bad).

Decision: If your target device shares with unrelated controllers, you need to understand the upstream port and ACS.

Task 5: Identify the device and its kernel driver (who owns it)

cr0x@server:~$ lspci -nnk -s 01:00.0
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2484] (rev a1)
	Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:xxxx]
	Kernel driver in use: nouveau
	Kernel modules: nouveau, nvidiafb

Meaning: If a host driver is bound, VFIO passthrough will fight you at boot or when the VM starts.

Decision: Bind to vfio-pci (or stub) only after you’ve confirmed the group is acceptable. Don’t “fix” grouping by rebinding drivers; it won’t.

Task 6: Show PCIe topology as a tree (find the parent bridge)

cr0x@server:~$ lspci -tv
-+-[0000:00]-+-00.0  Host bridge
 |           +-01.0-[01]----00.0  VGA compatible controller
 |           |            \-00.1  Audio device
 |           +-14.0  USB controller
 |           +-17.0  SATA controller
 |           \-1d.0-[02]----00.0  Ethernet controller

Meaning: Your GPU is behind 00:01.0. If multiple endpoints are under the same downstream segment and grouped together, the upstream port may not isolate.

Decision: If the GPU slot and an onboard controller sit under the same bridge in the tree, you likely have a wiring/topology constraint. Consider moving slots/cards.

Task 7: Inspect ACS capabilities on the upstream port

cr0x@server:~$ sudo lspci -vv -s 00:01.0 | grep -A8 -i "Access Control Services"
	Access Control Services
		ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

Meaning: Capability exists, but controls may be disabled. Some platforms leave ACSCtl bits off.

Decision: If ACSCap is missing entirely on the relevant ports, you’re unlikely to get clean splits without overrides or different hardware. If present but disabled, firmware/kernel may be able to enable it.

Task 8: Check whether your platform is using translation mode you expect

cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0 root=UUID=... ro quiet intel_iommu=on iommu=pt

Meaning: intel_iommu=on enables remapping. iommu=pt sets passthrough for host devices (often improves performance for non-assigned devices).

Decision: If you’re debugging isolation, keep it simple: enable IOMMU explicitly, avoid exotic parameters until you understand the baseline.

Task 9: Confirm interrupt remapping / posted interrupts state (sanity for passthrough)

cr0x@server:~$ dmesg | grep -E "Interrupt Remapping|IR|x2apic" | head -n 20
[    0.013333] DMAR-IR: Enabled IRQ remapping in x2apic mode
[    0.013444] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.

Meaning: IRQ remapping reduces the chance of interrupts being delivered to the wrong place under assignment.

Decision: If it’s disabled, treat passthrough as higher risk; prioritize firmware updates or platform changes if this is production.

Task 10: Check whether your GPU is in a group with a bridge you don’t control

cr0x@server:~$ readlink -f /sys/bus/pci/devices/0000:01:00.0/iommu_group
/sys/kernel/iommu_groups/2

Meaning: Confirms group number from the device’s perspective.

Decision: If the group contains devices you can’t pass through together, your next step is topology remediation, not VFIO config tweaks.

Task 11: Identify which devices share the group (the “who else is in the room” check)

cr0x@server:~$ G=$(basename $(readlink /sys/bus/pci/devices/0000:01:00.0/iommu_group)); for d in /sys/kernel/iommu_groups/$G/devices/*; do lspci -nnk -s ${d##*/}; echo; done
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2484] (rev a1)
	Kernel driver in use: nouveau
	Kernel modules: nouveau, nvidiafb

01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:228b] (rev a1)
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

Meaning: GPU + its audio function only? That’s typically fine. GPU + USB + SATA? That’s a design constraint.

Decision: If you see unrelated devices, you’re deciding between: moving cards/slots, BIOS settings, different board/CPU, or ACS override (with risk).

Task 12: Check for a PCIe switch in the path (common in servers, sneaky in workstations)

cr0x@server:~$ lspci -nn | grep -i "PCI bridge\|PLX\|Broadcom\|PEX"
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:5678]
03:00.0 PCI bridge [0604]: Broadcom / PLX Device [10b5:8725]
03:01.0 PCI bridge [0604]: Broadcom / PLX Device [10b5:8725]

Meaning: Switches often show up as multiple bridge functions. Whether they isolate depends on ACS and configuration.

Decision: If a switch is involved, you must inspect ACS on those downstream ports, not just the root port.

Task 13: Validate the kernel’s view of isolation with a targeted ACS scan

cr0x@server:~$ for dev in 00:01.0 03:00.0 03:01.0; do echo "== $dev =="; sudo lspci -vv -s $dev | grep -A6 -i "Access Control Services"; done
== 00:01.0 ==
	Access Control Services
		ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
== 03:00.0 ==
	Access Control Services
		ACSCap: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

Meaning: Capabilities vary per port. Some ports can do request redirection but not upstream forwarding controls, etc.

Decision: If the required ports lack ACS, you don’t get safe isolation, period. Plan hardware changes or accept grouping.

Task 14: Check if the platform is forcing “above 4G decoding” / Resizable BAR effects

cr0x@server:~$ dmesg | grep -iE "Resizable BAR|Above 4G|pci 0000"
[    0.222222] pci 0000:01:00.0: BAR 0: assigned [mem 0x8000000000-0x800fffffff 64bit pref]
[    0.222333] pci 0000:01:00.0: BAR 2: assigned [mem 0x8010000000-0x8011ffffff 64bit pref]

Meaning: Large BAR mappings aren’t a grouping issue, but they can expose firmware bugs and resource allocation problems that masquerade as VFIO pain.

Decision: If you see resource allocation errors or missing BARs, fix that first; a broken device init can make you chase phantom “IOMMU issues.”

Task 15: Confirm VFIO viability without committing (dry-run thinking)

cr0x@server:~$ sudo dmesg | tail -n 20
[  120.123456] vfio-pci 0000:01:00.0: enabling device (0000 -> 0003)
[  120.123789] vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
[  120.124111] vfio-pci 0000:01:00.1: enabling device (0000 -> 0002)

Meaning: When you bind, the kernel logs what it’s doing. Errors here often indicate reset quirks, not grouping.

Decision: If binding works cleanly but group is shared, don’t celebrate. You still can’t safely assign part of a shared group.

Joke #1: An IOMMU group is like a conference room booking: if your USB controller shows up, everyone’s leaving with your whiteboard markers.

PCIe topology clinic: root ports, bridges, switches, and ACS

Start with the upstream path, not the endpoint

People obsess over the GPU model and forget the boring part: the PCIe fabric. Your endpoint device is only as isolatable as the least-capable bridge between it and the CPU. That includes:

  • Root port / root complex on the CPU or chipset.
  • Downstream ports (often part of a switch or chipset fabric).
  • PCIe switches (PLX/Broadcom, ASMedia, etc.).
  • Integrated endpoints (onboard audio, USB controllers, SATA, Wi‑Fi).

ACS: what it does and why it changes grouping

Access Control Services is a set of features that help prevent or control peer-to-peer traffic and ensure that requests are properly routed upstream where the IOMMU can enforce policy. In plain terms: ACS helps stop devices behind the same switch/port from talking to each other behind the IOMMU’s back.

Linux uses ACS information (and other topology cues) to decide whether devices can be separated into different IOMMU groups. If the platform can’t guarantee isolation, the group stays big.

Typical “why is my GPU grouped with X?” patterns

  • Chipset downstream port fan-out: multiple onboard controllers sit behind one chipset port without ACS, so they get grouped.
  • Shared switch: two physical slots are actually behind one PCIe switch without proper ACS configuration.
  • Motherboard lane muxing: enabling a second M.2 slot reroutes lanes and changes which devices share a root port.
  • Multifunction devices: GPU video + GPU audio share. This is normal and usually desirable for passthrough.

What “fixing topology” really means

It’s not mystical. It’s one of these moves:

  1. Move the card to a different slot that hangs off a different root port (ideally CPU-attached).
  2. Stop using a slot/M.2 port that forces lane sharing with something you need isolated.
  3. Enable the firmware features that expose ACS or proper remapping (when they exist).
  4. Use hardware that actually supports isolation: workstation/server boards, known-good switches, CPUs with enough lanes.
  5. Accept the grouping and passthrough the entire group if that’s safe and operationally acceptable.

Most “IOMMU group problems” are your motherboard silently telling you it was designed for gaming RGB, not for trustworthy DMA boundaries.

BIOS/firmware settings that matter (and the ones that don’t)

Settings that usually matter

  • Intel: VT-d (Directed I/O) must be enabled. VT-x alone is not enough.
  • AMD: SVM enables CPU virtualization; IOMMU (or “AMD-Vi”) enables DMA remapping. You usually need both for passthrough use cases.
  • Above 4G decoding: often required for modern GPUs and multiple PCIe devices; it can also influence how firmware allocates resources.
  • SR-IOV: enabling it can change how some firmware configures downstream ports and ARI; not required for passthrough, but relevant for NIC VFs.
  • PCIe ACS / “PCIe ARI” toggles: if your BIOS exposes anything resembling ACS, enable it. If it exposes ARI, enable it when using SR-IOV-heavy NICs.

Settings that people toggle out of desperation

  • CSM (Compatibility Support Module): turning it off can help modern GPU initialization and BAR sizing, but it won’t magically split IOMMU groups.
  • PCIe Gen speed forcing: occasionally fixes link stability; rarely affects grouping.
  • “Gaming mode”: if the BIOS has this, back away slowly.

Firmware updates: the unglamorous fix

Motherboard BIOS updates and server BMC/UEFI updates can change ACS exposure, remapping quirks, and PCIe resource allocation. It’s not fun. It’s also the difference between “works for months” and “random DMA fault at 3 AM.”

Kernel parameters, ACS override, and when to refuse them

The parameters you’ll see in the wild

  • intel_iommu=on / amd_iommu=on: explicit enablement.
  • iommu=pt: passthrough mode for host devices (performance/latency tradeoffs).
  • pcie_acs_override=downstream,multifunction: force the kernel to treat some ports as if ACS separation exists.

What ACS override really does

It changes Linux’s grouping decisions by pretending certain PCIe components provide isolation boundaries even when they don’t advertise the right ACS capabilities. That can produce smaller IOMMU groups and make VFIO happy.

It can also create an isolation boundary that is not real. In a hostile or compromised-device threat model, that’s not acceptable. In a homelab, people take the risk. In production, you need a serious conversation with security and risk owners.

When I will use ACS override

  • Lab environments where the only “attacker” is my own impatience.
  • Temporary validation on a platform we already plan to replace, to prove the workload is viable.
  • Situations where the passthrough device is the only significant DMA-capable endpoint in the group and the rest are effectively inert (rare).

When I refuse ACS override

  • Multi-tenant environments.
  • Anything with compliance requirements around isolation.
  • Systems where the shared group contains storage or networking that the host depends on for survival.

Joke #2: ACS override is like putting “Do Not Enter” tape on a doorway with no door. It makes you feel safer until someone walks through it.

Virtualization stack specifics: Proxmox/KVM, bare metal, and “enterprise” boxes

KVM/VFIO reality check

VFIO is strict because it’s the last adult in the room. If a group contains multiple devices, VFIO wants you to assign the entire group to one guest, because splitting it is how you end up with undefined behavior and security holes.

In practice, “I need to pass through just the GPU” means you need the GPU’s group to contain only the GPU functions (and maybe a USB controller if you’re intentionally passing a whole controller through for input devices).

Proxmox and similar stacks

Proxmox makes it easy to see groups and configure VFIO, but it can’t fix a motherboard that welded your GPU slot to the same downstream port as your SATA controller. If Proxmox shows a fat group, believe it.

Workstation vs server boards

Servers often have better lane distribution and more predictable PCIe switch designs. Workstations can be good too. Consumer boards are the wild west: sometimes you get lucky, sometimes the chipset fabric is a party bus.

Storage engineer’s note: don’t casually share groups with HBAs

If the group contains your HBA or NVMe controller used by the host, you are one bad decision away from handing your storage controller to a VM. That’s not “performance tuning.” That’s data loss roulette.

Three corporate mini-stories from the trenches

1) The incident caused by a wrong assumption

They were building a small internal GPU cluster for build acceleration and some ML inference. Nothing exotic: KVM, VFIO, a few midrange GPUs, and a standardized motherboard because procurement liked “standardized.” Someone verified that IOMMU was enabled in BIOS and checked that dmesg showed DMAR. They declared the platform “passthrough-ready.”

The first two nodes worked because the GPUs landed in a slot wired directly to CPU root ports. The third node got assembled slightly differently—same board, but an extra NVMe carrier went in. That carrier changed lane bifurcation and pushed the GPU slot behind a chipset switch path. Nobody noticed because the GPUs still showed up fine in lspci.

On that node, the GPU ended up in the same IOMMU group as a USB controller and a SATA controller. The engineer configuring VFIO assumed “groups are just a Linux thing” and forced ACS override to split them. The system passed initial testing. A week later, during a busy build window, the host experienced intermittent SATA errors and then a filesystem went read-only. The postmortem wasn’t glamorous: they had created a fake isolation boundary, and heavy DMA from the GPU workload correlated with host I/O weirdness.

The fix was painfully ordinary: move the GPU to a different slot, disable the lane-sharing option, and standardize the build so every node had the same PCIe topology. They also banned ACS override in production unless approved. The real lesson wasn’t “ACS override is evil.” It was that “IOMMU enabled” is not the same as “IOMMU isolation achieved,” and topology drift is a reliability bug.

2) The optimization that backfired

A team wanted to reduce latency for a high-throughput packet processing VM. They planned to pass through a NIC and a GPU for some on-the-fly inference. The host also had a fast NVMe used for local caching. Someone noticed that the NIC and NVMe were in the same IOMMU group on a particular platform generation.

Instead of changing hardware, they made a “clever” move: passthrough the entire group to the VM, reasoning that the VM was “the real workload” and the host could boot from a mirrored SATA DOM anyway. On paper, it reduced overhead and avoided dealing with group splits. It even benchmarked well.

Then operations showed up with the boring questions. How do you patch the host if the VM owns the storage controller used for cache and logs? How do you capture crash dumps? What happens if the VM wedges and holds the PCIe function in a bad reset state? The answer was: you reboot and hope. That’s not a plan, that’s a ritual.

The optimization backfired when the VM hit a driver bug and stopped responding. The host was still “up” but lost access to the passthrough devices, including the cache NVMe. Monitoring went noisy, and the node flapped in and out of service. They reverted to a less “efficient” design: separate IOMMU groups by choosing a different slot layout and using a platform with more root ports, keeping host-critical storage out of any passthrough group. Performance dropped a little. Uptime improved a lot. Nobody missed the heroics.

3) The boring but correct practice that saved the day

Another organization ran mixed workloads: some VMs with NIC passthrough (SR-IOV VFs), some with full device passthrough for specialized accelerators. They had been burned before, so they had a policy: every hardware SKU had a “PCIe topology manifest” checked into the same repo as provisioning code.

When a new batch of servers arrived, the procurement team had substituted a “compatible” motherboard revision due to supply constraints. Same CPU family, same chassis, same everything-to-a-non-expert. The manifest check in CI compared expected IOMMU group layouts against what a freshly booted node reported. It failed immediately.

Instead of discovering the problem in production after a late-night kernel upgrade, they quarantined the batch. The root cause was a different PCIe switch configuration on the new revision: the accelerator slot and an onboard RAID controller now shared a downstream port without usable ACS separation. The vendor could provide a firmware update for some, and for others the fix was a different riser.

The practice was boring: boot, collect lspci, collect group mappings, diff against expected, fail fast. It saved days of debugging and prevented a slow-motion incident. This is the kind of “paperwork” that actually increases system availability, which is why people hate it right up until it saves them.

Common mistakes: symptom → root cause → fix

1) “I enabled IOMMU but /sys/kernel/iommu_groups is empty”

Symptom: No groups directory entries, VFIO complains, guides say “enable IOMMU.”

Root cause: VT-d/AMD-Vi is off in BIOS, or kernel boot params aren’t enabling it, or you’re in a firmware mode that hides remapping.

Fix: Enable VT-d/AMD IOMMU in BIOS; add intel_iommu=on or amd_iommu=on; update BIOS if DMAR tables are broken.

2) “My GPU shares a group with USB/SATA/NIC on the motherboard”

Symptom: Group includes GPU plus unrelated onboard controllers.

Root cause: Devices sit behind the same downstream port/switch without ACS isolation; motherboard wiring prioritizes lane sharing over isolation.

Fix: Move GPU to a CPU-attached slot; disable lane sharing features; avoid M.2/riser combos that reroute lanes; choose a board with better root port distribution.

3) “I used ACS override and it works, but I get random host I/O errors”

Symptom: Passthrough works, but host gets intermittent PCIe/SATA/NVMe weirdness under load.

Root cause: Forced group splitting created a boundary the hardware doesn’t enforce; peer-to-peer and DMA interactions become undefined.

Fix: Remove ACS override; redesign topology; isolate via real ACS-capable ports or pass through the whole real group if safe.

4) “GPU reset problems look like IOMMU problems”

Symptom: VM shutdown leaves GPU unusable, next VM start fails, errors in dmesg about reset.

Root cause: GPU reset quirks (FLR not supported well), vendor driver behavior, or multifunction handling—not IOMMU grouping.

Fix: Use vendor-supported reset mechanisms (where applicable), consider different GPU models, ensure both GPU functions are bound appropriately, avoid hot-reassigning in production.

5) “SR-IOV VFs are in huge groups and can’t be assigned cleanly”

Symptom: VFs share groups with PFs and other devices in ways you didn’t expect.

Root cause: ARI/ACS/firmware configuration; platform groups them because isolation can’t be guaranteed.

Fix: Enable SR-IOV and ARI support in BIOS if available; ensure firmware is current; use NICs and platforms known to behave well with VFIO.

6) “Everything is in one group on a laptop/mini PC”

Symptom: One giant group, no clean splits, external GPU dreams fading.

Root cause: Integrated topology with limited ACS, chipset-rooted fabric, and security model not designed for deterministic isolation.

Fix: Accept limitations; use paravirtualized devices; or change to hardware built for this.

Checklists / step-by-step plan

Step-by-step: get from “shared group” to “clean isolation”

  1. Baseline enablement: confirm DMAR/AMD-Vi enabled in dmesg and groups exist.
  2. Identify targets: list PCI addresses of the devices you want to assign.
  3. Map groups: print group membership and confirm what shares with what.
  4. Draw the upstream chain: use lspci -tv to locate the upstream bridges/ports.
  5. Check ACS on each relevant port: lspci -vv on the root/downstream ports in the path.
  6. Try the easy topology fix: move the card to a different slot; remove/relocate M.2/riser cards that cause lane rerouting.
  7. Re-check groups after each physical change: do not batch changes; you’ll lose causality.
  8. Firmware settings: enable VT-d/AMD IOMMU, Above 4G decoding, SR-IOV/ARI if relevant; update BIOS.
  9. Decide your risk stance: if you need ACS override, write down why, what’s in the group, and what the blast radius is.
  10. Bind drivers only after isolation is correct: assign devices to vfio-pci once groups are sane.
  11. Test reset paths: start/stop VM repeatedly and watch for device wedging.
  12. Operationalize: capture topology + group mapping as an artifact for future drift detection.

Checklist: production readiness for passthrough

  • Host-critical storage/network devices are not in any group you plan to assign.
  • No ACS override in production unless explicitly approved and risk-accepted.
  • Firmware versions are pinned and upgrade-tested with topology regression checks.
  • Device reset behavior validated (cold boot, warm reboot, VM start/stop cycles).
  • Monitoring includes PCIe AER errors, IOMMU faults, and device driver resets.

Checklist: when you should stop and buy different hardware

  • Your target device shares a group with chipset SATA/USB/NVMe you need on the host, and there is no alternative slot/root port.
  • Upstream ports lack ACS capability and you can’t change the topology.
  • Firmware offers no relevant toggles and updates don’t change behavior.
  • The system requires ACS override to work at all, and you have a multi-tenant or compliance-driven environment.

FAQ

1) If IOMMU is enabled, why are my devices still in the same group?

Because enablement is not isolation. Grouping reflects whether the PCIe fabric can enforce separation (ACS/bridge behavior), not whether DMA remapping exists in theory.

2) Is it normal for a GPU and its audio function to share a group?

Yes. That’s one physical device exposing multiple PCI functions. You usually pass both through together.

3) Can I pass through only one device from a shared group?

Not safely. VFIO requires group-level assignment because any device in the group could DMA in ways that affect the others. If you split it anyway (via hacks), you’re accepting undefined behavior and security risk.

4) Will enabling “Above 4G decoding” split my IOMMU groups?

Usually no. It helps resource allocation and BAR mapping for large devices, which can fix boot/probing issues, but it doesn’t create isolation boundaries by itself.

5) Does iommu=pt affect grouping?

No. It affects how the host maps DMA for devices the host keeps, often improving performance. Group formation is about topology and ACS, not passthrough vs translated mappings.

6) What’s the safest alternative if I can’t get clean groups?

Don’t do full device passthrough. Use paravirtual devices (virtio), API-level GPU sharing (where applicable), or move to hardware with better PCIe isolation.

7) Why do my IOMMU groups change when I add an NVMe drive or use a different M.2 slot?

Lane sharing and bifurcation. Many boards reroute lanes depending on which sockets are populated. That can move devices behind different bridges/switches, changing isolation.

8) Is ACS override always insecure?

It relaxes the kernel’s isolation assumptions. In a strict threat model, yes, it undermines the guarantee you think you have. In a single-user lab, many accept it. The key is being honest about the boundary you’re pretending exists.

9) My groups are fine, but passthrough still fails. What next?

Check driver binding, device reset behavior, firmware resource allocation errors, and whether the device supports FLR. Grouping is necessary, not sufficient.

10) Why do servers “just work” more often than desktops for this?

More root ports, better lane budgets, more consistent switch designs, and firmware that takes ACS and SR-IOV seriously. Also: vendors expect virtualization customers to complain loudly.

Conclusion: next steps that actually move the needle

If IOMMU is on but your devices still share a group, stop treating it like a kernel-parameter puzzle. It’s a PCIe topology investigation. Do the boring work: map the tree, identify the upstream ports, verify ACS capabilities, and change the physical layout or platform until the hardware can prove isolation.

Next steps:

  1. Run the group inventory command and highlight the exact unwanted co-tenants in your target group.
  2. Use lspci -tv and lspci -vv to find the upstream bridge chain and whether ACS exists where it matters.
  3. Try one topology change at a time: move the card to a different slot, change M.2 population, toggle bifurcation/PCIe settings, update firmware.
  4. If you still need ACS override, write down the risk, what devices share the fabric, and what failure looks like—then decide like an adult.
  5. Operationalize it: capture your known-good IOMMU groups as a regression test so a “minor hardware revision” doesn’t become your next incident.
← Previous
Docker Networking: The One NAT/Firewall Misread That Exposes Everything
Next →
Clone vs Image vs Backup: Which One Restores Fastest When Disaster Hits?

Leave a comment