PCIe passthrough is the kind of feature that looks deterministic on a slide and behaves like weather in production.
One day your GPU VM is a perfect citizen; the next it black-screens on reboot and your maintenance window turns into a group therapy session.
The question “Is Intel VT-d better than AMD-Vi?” is usually asked right after someone buys hardware and right before they regret at least one BIOS setting.
The real answer is less about logos and more about specific platform behaviors: IOMMU grouping, interrupt remapping, firmware quality, and how badly you need clean device isolation.
What “better passthrough” actually means
“Better passthrough” isn’t a single metric. In production you care about four things, in roughly this order:
- Isolation correctness: the device is in its own IOMMU group, DMA is contained, and resets behave.
- Operational stability: reboots, live migrations (where applicable), driver reloads, and kernel updates don’t turn into incident tickets.
- Performance consistency: low jitter under load and predictable latency, especially for NVMe, NICs, and GPUs.
- Manageability: good tooling visibility, sane logs, and fewer “special boot flags” that become tribal knowledge.
If you’re building a home lab, you can tolerate hacks like ACS override and “just don’t reboot the VM twice in a row.”
If you’re doing this for a business—especially with regulated workloads—your “passthrough solution” is a system, not a checkbox.
VT-d vs AMD-Vi: the opinionated summary
Both Intel VT-d and AMD-Vi (AMD’s IOMMU) can deliver excellent passthrough. The biggest differences you’ll feel are not theoretical.
They’re platform implementation details: motherboard firmware, PCIe topology, IOMMU grouping, and whether interrupt remapping is solid.
My default recommendation
-
If you need boring, enterprise-grade passthrough (SR-IOV NICs, HBAs, NVMe, GPUs in a fleet): pick the platform with the best board + BIOS track record, not the CPU vendor.
In practice, that often means Intel platforms in vendor-certified servers, and AMD platforms in modern EPYC servers with mature firmware. -
If you’re buying consumer gear: AMD systems frequently give you more cores per dollar, but also more variability in IOMMU grouping depending on chipset and board routing.
Intel consumer boards can be more predictable, but you’ll still meet the occasional “shared root port group” mess. -
If you rely on clean IOMMU groups and cannot tolerate ACS override: prioritize platforms that expose more PCIe root ports and cleaner downstream isolation.
That’s less about “Intel vs AMD” and more about “this specific motherboard and CPU generation.”
Here’s the blunt version: the best passthrough is the passthrough you can reboot.
If you can’t cold-reboot the host, reboot the guest, and reattach the device repeatedly without weirdness, you don’t have a solution—you have a demo.
How IOMMU passthrough really works (and where it fails)
Passthrough is a three-part contract:
- The device does DMA and raises interrupts.
- The IOMMU translates device DMA addresses through page tables, restricting what physical memory the device can touch.
- The hypervisor (KVM/QEMU via VFIO on Linux, or a type-1 hypervisor) binds the device to a guest and programs the IOMMU mappings.
When it’s working, your guest can drive the device nearly as if it were bare metal, but the host retains DMA safety boundaries.
When it’s not working, the failure modes are… educational:
- Bad grouping: your GPU and your USB controller share an IOMMU group; you can’t safely pass one without the other.
- Reset failure: the guest shuts down, the device doesn’t properly reset, and the next boot hangs at “Starting bootloader…” or black-screens.
- Interrupt issues: MSI/MSI-X delivery gets weird; performance tanks or latency spikes; some devices only behave with certain kernel parameters.
- Page faults: IOMMU faults appear in dmesg; the guest driver is fine, but mappings or ATS/PRI behaviors don’t match expectations.
Also: passthrough is a topology problem.
Two identical CPUs with different motherboards can behave like different species, because your PCIe layout determines which devices sit behind which root ports and bridges,
and that determines groupings and reset domains.
Joke #1: PCIe passthrough is easy—until you try it.
Interesting facts and historical context (the short, useful kind)
These aren’t trivia-night facts. They explain why certain bugs and platform quirks exist.
- VT-d arrived after VT-x: CPU virtualization (VT-x) was not enough for safe device DMA, so VT-d filled the gap for directed I/O.
- AMD-Vi is “IOMMU” in Linux logs: Linux often reports AMD’s implementation under the generic IOMMU naming, and you’ll see “AMD-Vi” in dmesg on many systems.
- Interrupt remapping is a reliability feature: it’s not just performance—without it, you can be forced into less safe or less functional interrupt modes.
- ACS is a PCIe feature, not an IOMMU feature: Access Control Services can help enforce isolation between downstream ports; lack of ACS often drives ugly groupings.
- “ACS override” is a Linux workaround: it can split IOMMU groups by pretending ACS isolation exists. Sometimes it’s fine; sometimes it’s an own-goal.
- SR-IOV made IOMMU mainstream: once NICs started presenting multiple virtual functions, IOMMU correctness stopped being niche and became table stakes in data centers.
- DMAR is Intel’s ACPI table for IOMMU: if DMAR tables are wrong, VT-d can be “enabled” but effectively unreliable.
- AMD IOMMU uses IVRS tables: similarly, bad IVRS entries can lead to missing devices, broken mappings, or confusing group topology.
- GPU reset pain is partly historical: many GPUs were never designed for frequent function-level resets in virtualized environments, so you inherit hardware assumptions from the bare-metal world.
Platform differences that change outcomes
1) IOMMU group quality: the silent kingmaker
Your best-case setup: each device you want to pass through sits alone in its IOMMU group (or shares only with harmless functions of the same device, like GPU audio).
Worst case: half the motherboard is one group because the firmware exposes a coarse topology or the PCIe switches don’t support ACS.
Intel vs AMD here is not a moral contest. It’s about the combination of:
CPU integrated PCIe root complexes, chipset lanes, onboard PCIe switches/retimers, and board routing.
Server boards tend to be cleaner than consumer boards. Workstation boards can be either paradise or a carnival.
2) Interrupt remapping: the difference between “fine” and “why is latency spiky?”
With passthrough, you want MSI/MSI-X interrupts delivered cleanly to the guest. Interrupt remapping helps keep that sane and secure.
Without it, you may see warnings in dmesg, fallbacks, or restrictions. When people describe “jitter” on passthrough devices, interrupts are often involved.
3) ATS/PRI and friends: when devices get clever
Some devices can participate more actively in address translation (ATS) or request pages (PRI). In theory, this improves performance.
In practice, it expands the surface area for platform quirks. If you’re chasing rare IOMMU faults under load, these features can be relevant.
You don’t need to memorize acronyms; you need to recognize patterns and know where to look.
4) Reset domains and FLR support
Function Level Reset (FLR) makes passthrough lifecycle management much easier.
If your device can’t reset cleanly, you’ll get the classic symptom: first boot works, second boot fails until host reboot.
This affects both Intel and AMD systems because it’s often the device’s limitation, not the IOMMU’s.
5) Firmware maturity: the BIOS is part of your hypervisor
On paper, VT-d and AMD-Vi are both mature technologies. In reality, firmware quality ranges from “solid” to “somebody shipped on Friday.”
A board can advertise IOMMU support and still have broken IVRS/DMAR tables or questionable defaults.
Update the BIOS early, and treat release notes like incident retrospectives—because that’s what they are.
Performance: where the overhead hides
With passthrough, raw throughput is usually close to bare metal. The killers are:
- Latency and jitter from interrupt handling, scheduling, and NUMA mismatches.
- DMA mapping overhead when the guest memory is fragmented or the workload churns mappings.
- Wrong NUMA placement: passing through a device connected to NUMA node 1 into a guest pinned to node 0 is a slow-motion faceplant.
- Hugepages vs not: hugepages reduce TLB pressure and can reduce mapping overhead for some workloads.
If you’re comparing Intel and AMD purely on “passthrough performance,” you’re probably benchmarking the wrong thing.
The real differentiator is how easily you can make the system stable and predictable under your workload and reboot patterns.
Security and reliability: what you get when it’s correct
IOMMU is a security boundary. Without it, a DMA-capable device can read/write host memory directly.
With it, the device is constrained to a mapping defined by the host kernel/hypervisor.
That matters for:
- Multi-tenant hosts (even “tenants” inside one company).
- Untrusted drivers in guests (especially GPU and niche accelerator stacks).
- Containment of device misbehavior (firmware bugs, rogue DMA, etc.).
One useful reliability framing: passthrough is safe when your IOMMU is strict, your grouping is clean, and your lifecycle resets are correct.
Lose any one of those, and you’ll spend your time inventing rituals instead of running services.
One quote to keep you honest: Hope is not a strategy.
— Rick Pitino
Practical tasks: commands, outputs, and decisions (12+)
These are Linux-centric, because that’s where most VFIO/KVM passthrough lives and where you’ll be debugging at 02:00.
The commands are runnable on modern distros; adjust package names for your environment.
Task 1: Confirm the CPU and virtualization flags
cr0x@server:~$ lscpu | egrep -i 'Vendor ID|Model name|Virtualization|Flags'
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU
Virtualization: VT-x
Flags: ... vmx ...
What it means: “Virtualization: VT-x” (Intel) or “AMD-V” (AMD) tells you CPU virtualization exists. It does not prove IOMMU is enabled.
Decision: If virtualization isn’t present, stop. You’re not doing passthrough on that host in any sane way.
Task 2: Confirm IOMMU is enabled in the kernel (Intel VT-d)
cr0x@server:~$ dmesg | egrep -i 'DMAR|IOMMU|VT-d' | head -n 30
[ 0.612345] DMAR: IOMMU enabled
[ 0.612678] DMAR: Host address width 46
[ 0.613210] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[ 0.615432] DMAR: Interrupt remapping enabled
What it means: “DMAR: IOMMU enabled” is the money line. “Interrupt remapping enabled” is a strong sign you’ll have fewer weird interrupt edge cases.
Decision: If you don’t see DMAR lines, check BIOS settings and kernel parameters (later tasks). Don’t debug VFIO until this is correct.
Task 3: Confirm IOMMU is enabled in the kernel (AMD-Vi)
cr0x@server:~$ dmesg | egrep -i 'AMD-Vi|IOMMU|IVRS' | head -n 40
[ 0.501234] AMD-Vi: IOMMU performance counters supported
[ 0.501567] AMD-Vi: Lazy IO/TLB flushing enabled
[ 0.504321] ivrs: IOAPIC[4] not in IVRS table
What it means: AMD-Vi lines indicate the AMD IOMMU driver is active. IVRS warnings can be harmless or a firmware smell depending on severity.
Decision: If the system logs repeated IVRS/IOAPIC complaints and passthrough is flaky, update BIOS and consider a different board before you burn a week.
Task 4: Verify kernel command line (intel_iommu / amd_iommu)
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.0 root=/dev/mapper/vg0-root ro quiet intel_iommu=on iommu=pt
What it means: intel_iommu=on (or amd_iommu=on) enables IOMMU. iommu=pt puts host devices in pass-through mode for lower overhead while keeping translation for guests.
Decision: For virtualization hosts, iommu=pt is usually a good default. If you’re debugging device isolation or faults, you might temporarily remove it to compare behavior.
Task 5: Confirm IOMMU groups and spot “bad sharing”
cr0x@server:~$ find /sys/kernel/iommu_groups/ -maxdepth 2 -type l | sed 's#.*/##' | sort | head
0000:00:01.0
0000:00:14.0
0000:00:14.2
0000:01:00.0
0000:01:00.1
What it means: This lists devices in IOMMU groups. You need to inspect which group each device belongs to and whether your target device is isolated.
Decision: If your target GPU/NVMe/NIC shares a group with unrelated devices you can’t also pass through, you either change slots, change motherboard, or accept ACS override risk.
Task 6: Print groups with human-readable names
cr0x@server:~$ for g in /sys/kernel/iommu_groups/*; do echo "Group ${g##*/}"; for d in $g/devices/*; do lspci -nns ${d##*/}; done; echo; done | sed -n '1,40p'
Group 0
00:00.0 Host bridge [0600]: Intel Corporation Device [8086:1234] (rev 02)
Group 1
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:5678] (rev 02)
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22ba] (rev a1)
What it means: You’re looking for “just the GPU functions” in a group, not the GPU plus SATA plus USB plus a random bridge with friends.
Decision: Clean grouping? Proceed with VFIO. Messy grouping? Consider a different PCIe slot (often changes root port), or a different motherboard.
Task 7: Check whether your device supports reset (FLR) signals
cr0x@server:~$ lspci -s 01:00.0 -vv | egrep -i 'Capabilities:|FLR|Reset' -n | head -n 20
45:Capabilities: [1b0] Vendor Specific Information: ID=0001 Rev=1 Len=024
78:Capabilities: [1e0] Device Serial Number 00-00-00-00-00-00-00-00
92:Capabilities: [250] Latency Tolerance Reporting
110:Capabilities: [300] Secondary PCI Express
132:Capabilities: [400] Physical Resizable BAR
160:Capabilities: [420] Data Link Feature
What it means: Not all devices clearly advertise FLR in an obvious grep. Some will show “Function Level Reset” explicitly; others don’t.
Decision: If your GPU shows poor reset behavior in practice, plan for mitigations: vendor-reset modules (where applicable), avoiding hot-restart patterns, or selecting a different GPU model known to reset well.
Task 8: Identify the driver currently bound to a PCI device
cr0x@server:~$ lspci -k -s 01:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 5110
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
What it means: If “Kernel driver in use” is the vendor driver, you haven’t handed it to VFIO yet.
Decision: For passthrough, bind it to vfio-pci on the host and keep host graphics elsewhere (iGPU or separate GPU).
Task 9: Bind a device to vfio-pci (persistent via modprobe config)
cr0x@server:~$ sudo tee /etc/modprobe.d/vfio.conf >/dev/null <<'EOF'
options vfio-pci ids=10de:2684,10de:22ba disable_vga=1
EOF
cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.0
What it means: You’re telling the host to bind those PCI IDs to vfio-pci early in boot. The initramfs update ensures it takes effect.
Decision: If you rely on the GPU for host console, don’t do this. Use an iGPU or serial/IPMI. Otherwise you will lock yourself out in a very pure way.
Task 10: Confirm vfio-pci binding after reboot
cr0x@server:~$ lspci -k -s 01:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 5110
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
What it means: “Kernel driver in use: vfio-pci” is what you want. “Kernel modules” may still list vendor modules; that’s fine.
Decision: If it’s not bound, check initramfs, blacklist conflicting drivers, and confirm Secure Boot policy if it interferes with module loading in your environment.
Task 11: Check for IOMMU faults and DMA remapping errors
cr0x@server:~$ sudo dmesg -T | egrep -i 'DMAR|IOMMU|fault|vfio|remapping' | tail -n 30
[Tue Feb 4 01:12:11 2026] vfio-pci 0000:01:00.0: enabling device (0000 -> 0003)
[Tue Feb 4 01:13:09 2026] DMAR: [DMA Read] Request device [01:00.0] fault addr 0x7f2b0000 [fault reason 05] PTE Read access is not set
What it means: DMA faults indicate mapping problems or device behavior that violates current mappings. Sometimes it’s a misconfigured guest driver; sometimes it’s platform quirks.
Decision: If faults correlate with guest crashes or device lockups, prioritize stability over tuning. Re-check group isolation, kernel version, and consider disabling advanced features (like ATS) if your platform allows.
Task 12: Confirm hugepages (latency hygiene for guests)
cr0x@server:~$ grep -i huge /proc/meminfo | head
AnonHugePages: 1048576 kB
HugePages_Total: 256
HugePages_Free: 200
HugePages_Rsvd: 10
Hugepagesize: 2048 kB
What it means: This shows whether explicit hugepages are provisioned. Many latency-sensitive passthrough workloads behave better with predictable memory backing.
Decision: If you see stutter under load on a GPU VM or packet processing VM, hugepages are a reasonable next lever—after you’ve fixed IOMMU grouping and NUMA placement.
Task 13: Check NUMA locality for a passed-through device
cr0x@server:~$ cat /sys/bus/pci/devices/0000:01:00.0/numa_node
1
What it means: The device lives on NUMA node 1. If your VM vCPUs and memory sit on node 0, you’re paying a cross-socket penalty.
Decision: Pin the VM vCPUs and memory to the device’s NUMA node where possible. If you can’t, reconsider which slot the device uses (some slots map to different CPU roots).
Task 14: Inspect PCIe topology to understand grouping causes
cr0x@server:~$ lspci -t
-[0000:00]-+-00.0
+-01.0-[01]----00.0
+-14.0
\-1c.0-[02]----00.0-[03]----00.0
What it means: This shows the bridge tree. Devices behind the same downstream bridge often land in the same IOMMU group unless ACS is available and enabled.
Decision: If your GPU shares a bridge with critical host devices, move it to a slot connected to a different root port, or you’ll be stuck negotiating with physics.
Task 15: Verify VFIO is loaded and which modules are active
cr0x@server:~$ lsmod | egrep 'vfio|kvm' | head
vfio_pci 65536 0
vfio_pci_core 90112 1 vfio_pci
vfio_iommu_type1 45056 0
vfio 45056 2 vfio_pci_core,vfio_iommu_type1
kvm_intel 409600 0
What it means: VFIO core and IOMMU type1 are loaded. If they’re missing, your host isn’t ready for passthrough even if the BIOS is configured.
Decision: Load modules, fix initramfs, and confirm your kernel config supports VFIO. Don’t attempt guest configs until the host foundation is stable.
Task 16: Check whether the kernel is using interrupt remapping
cr0x@server:~$ dmesg | egrep -i 'interrupt remapping|IR:' | head
[ 0.615432] DMAR: Interrupt remapping enabled
What it means: On Intel, this is explicit. On AMD, you may see different wording. Either way, you’re looking for signs that modern interrupt handling is enabled.
Decision: If interrupt remapping is disabled and you see instability or security warnings, consider BIOS toggles related to VT-d/AMD IOMMU and interrupt remapping, then retest.
Fast diagnosis playbook
When passthrough is broken, you don’t start by rewriting your QEMU config. You start by proving the platform is capable of being correct.
Here’s the “find the bottleneck fast” flow I use.
First: Is the IOMMU actually on and sane?
- Check
/proc/cmdlineforintel_iommu=onoramd_iommu=on. - Check dmesg for “IOMMU enabled” and any DMAR/IVRS table complaints.
- Check that VFIO modules load.
Second: Are IOMMU groups acceptable?
- Enumerate groups under
/sys/kernel/iommu_groups. - Verify your target device is isolated or only grouped with its own functions.
- If it’s not: move slots, disable unused onboard devices, or accept that the motherboard isn’t fit for this requirement.
Third: Is this a reset/firmware quirk rather than “VFIO tuning”?
- Does the first VM boot work but subsequent boots fail? Smells like reset/FLR trouble.
- Do you need a host reboot to recover the device? That’s reset domain pain.
- Update BIOS. Then update the kernel. Then retest. Don’t swap twelve variables at once.
Fourth: Is it NUMA/interrupt latency masquerading as “passthrough is slow”?
- Check device NUMA node and align VM pinning accordingly.
- Look for interrupt remapping status and MSI/MSI-X issues in logs.
- Only after that: consider hugepages, CPU isolation, and scheduler tuning.
Joke #2: The IOMMU didn’t “randomly break.” It waited until you were confident.
Common mistakes: symptom → root cause → fix
1) “VFIO works once, then black screen on second boot”
Symptom: Guest boots and uses the GPU once. After shutdown/restart, GPU never initializes again until host reboot.
Root cause: Device doesn’t support clean FLR or reset isn’t propagating through the bridge; common with some consumer GPUs.
Fix: Prefer GPUs known for virtualization-friendly resets; try different slot (changes reset domain); update BIOS; consider a reset workaround module if appropriate for your environment; avoid fast reboot loops.
2) “Device is in a giant IOMMU group with SATA/USB; can’t pass through safely”
Symptom: Your GPU shares an IOMMU group with chipset SATA controller and USB controller.
Root cause: No ACS isolation on the relevant downstream path; board routes multiple functions behind one bridge; firmware exposes coarse grouping.
Fix: Move the card to a CPU-rooted slot; disable unused onboard devices; choose a different motherboard with better PCIe isolation. Use ACS override only if you accept the security and stability trade.
3) “High throughput but awful latency/jitter”
Symptom: NVMe benchmarks look fine, but application latency spikes; GPU frame times are uneven; NIC packet processing is bursty.
Root cause: NUMA mismatch, interrupt handling issues, host contention, missing hugepages.
Fix: Align VM CPU+memory to device NUMA node; ensure interrupt remapping and MSI/MSI-X are working; isolate host CPUs for latency-sensitive guests; use hugepages for the guest.
4) “IOMMU is enabled but no groups appear”
Symptom: dmesg mentions IOMMU, but /sys/kernel/iommu_groups is empty or missing.
Root cause: Kernel booted without IOMMU parameters; virtualization disabled in BIOS; or you’re in a kernel/boot mode mismatch (rare, but happens).
Fix: Verify BIOS toggles (VT-d/AMD IOMMU); verify /proc/cmdline; update kernel; ensure you’re not running a stripped-down kernel build.
5) “IOMMU faults under load”
Symptom: DMAR/AMD-Vi faults appear in dmesg during heavy I/O; guest freezes or device drops.
Root cause: Platform firmware bugs, unstable PCIe link, or advanced translation features interacting poorly.
Fix: Update BIOS and kernel; re-seat card; reduce PCIe link speed as a test; verify power delivery; consider disabling advanced features if your stack provides safe toggles. If it persists, replace the board before you normalize it.
6) “Host loses network or storage when VM starts”
Symptom: Starting a VM with passthrough causes host services to die.
Root cause: You passed through the wrong device (or the right device in the same group as host-critical devices) because grouping was ignored.
Fix: Re-check IOMMU groups, bind only the target device IDs, and keep host-critical controllers out of passthrough groups. If you can’t, the hardware isn’t appropriate for this design.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company wanted GPU passthrough for a handful of ML annotation workstations running as VMs.
The plan looked simple: one host per team, one GPU per VM, easy scaling.
Procurement bought a batch of “virtualization-ready” machines because the CPU spec sheet said VT-d or AMD-Vi support.
The first week was fine. The second week, after routine patching, the tickets started: black screens after VM reboot, random USB dropouts, and one host that refused to boot a VM if a specific USB hub was plugged in.
The team assumed it was “a driver issue” and spent days pinning blame on guest OS updates.
The actual failure was topology: the GPU and the USB controller sat in the same IOMMU group on that motherboard.
The “fix” they accidentally applied was rebooting the host frequently, which temporarily cleared the reset state and masked the issue.
Once they correlated failures with IOMMU groups, the pattern was embarrassingly consistent.
They ended up moving GPUs to different slots where possible, and for a subset of hosts, replacing the motherboard model entirely.
The lesson wasn’t “Intel vs AMD.” The lesson was: a CPU feature list is not a passthrough guarantee. The board is the product.
Mini-story 2: The optimization that backfired
Another org ran a virtualized storage appliance VM with an HBA passed through, plus a high-performance NIC passed through for replication traffic.
They were chasing a few percentage points of throughput and decided to “optimize” by enabling every performance feature in BIOS: aggressive power settings, deeper C-states, and some IOMMU performance knobs.
Throughput improved in a synthetic benchmark. Latency got worse in production.
Replication windows started missing their targets, and worse: the storage VM occasionally logged I/O timeouts under peak load.
Nothing was completely broken, which made it more expensive to debug because it looked like “the network being weird.”
The root cause was a combination of power management and interrupt latency interacting badly with the passthrough NIC.
The VM’s vCPUs were also pinned on the wrong NUMA node relative to the NIC, so every interrupt was effectively a small cross-socket negotiation.
They had optimized the wrong metric and then deployed it to the only metric that matters: user-facing latency.
Rolling back the “optimizations,” aligning NUMA pinning, and using a conservative power profile restored stability.
The funniest part (in a dry way) was that the original system was fine; their benchmark victory lap created the incident.
Mini-story 3: The boring but correct practice that saved the day
A financial services shop ran a cluster of virtualization hosts with mixed workloads, including a few VMs with passthrough NICs for specialized packet capture.
They treated passthrough hosts like pets at first—hand-tuned, lovingly configured, and impossible to reproduce.
Eventually they got tired of surprises and standardised.
The “boring practice” was a preflight checklist executed on every new host and after every BIOS update:
confirm IOMMU enabled, confirm interrupt remapping, dump IOMMU groups, snapshot PCIe topology, and record known-good kernel parameters.
Nothing glamorous. Just disciplined.
Then a vendor BIOS update quietly changed PCIe enumeration order on a subset of machines.
Without the preflight, they would have discovered it during a production maintenance window when devices attached to different groups and the old VFIO bindings grabbed the wrong controller.
With the preflight, they caught it in staging, adjusted bindings, and shipped without drama.
Their system didn’t become faster. It became predictable. In ops, that’s usually the better deal.
Checklists / step-by-step plan
Step-by-step: selecting hardware for passthrough (so you don’t buy regrets)
- Start with the motherboard model, not the CPU. Check whether people report clean IOMMU groups for your intended devices.
- Prefer CPU-rooted PCIe slots for passthrough devices (GPU, NVMe adapter, NIC). Chipset-rooted slots often group more aggressively.
- Avoid platforms that require ACS override for basic isolation. If you must use it, document the risk acceptance explicitly.
- Plan host console access: iGPU, BMC/IPMI, or serial. Don’t rely on the passed-through GPU for host access.
- Budget for firmware updates: choose vendors with a track record of maintaining BIOS updates for stability, not just CPU microcode.
Step-by-step: host configuration baseline (Linux + KVM/VFIO)
- Enable VT-d/AMD IOMMU in BIOS/UEFI. Also enable any setting labeled “interrupt remapping” if present.
- Boot with
intel_iommu=on iommu=ptoramd_iommu=on iommu=pt. - Confirm dmesg shows IOMMU enabled and no serious DMAR/IVRS table errors.
- Confirm IOMMU groups; verify target device isolation.
- Bind target device IDs to
vfio-pciin initramfs. - Pin VM CPU/memory to correct NUMA node if latency matters.
- Test the lifecycle: boot VM, run load, shutdown, boot again. Repeat until you’re bored. Bored is the goal.
Step-by-step: deciding between passthrough and paravirtualized devices
- Use virtio when you can (disk, net). It’s simpler and often “fast enough.”
- Use passthrough when you must (specialized NIC features, GPU compute/graphics, vendor drivers that require physical function access, HBAs for storage appliances).
- When in doubt, avoid passing through host-critical controllers. If you pass through the only HBA holding the host OS, you are one mistake away from a remote reinstall.
FAQ
1) Is Intel VT-d inherently more stable than AMD-Vi?
Not inherently. Stability is mostly about platform implementation: firmware tables (DMAR/IVRS), PCIe topology, and device reset behavior.
In certified server platforms, both can be very stable. In consumer boards, both can be chaotic—just in different ways.
2) Why are my IOMMU groups “worse” on one motherboard than another?
Because grouping is influenced by PCIe bridges, switches, and ACS capability along the path.
A board that routes multiple slots behind one downstream bridge (without ACS) will glue devices together in one group.
Another board with more root ports or better ACS isolation will produce cleaner groups.
3) Should I use ACS override?
Only if you understand the trade: it can make groups look isolated even when the hardware doesn’t enforce full separation.
For home labs, it’s often acceptable. For environments with stronger isolation requirements, it’s a risk you should not normalize.
4) Does iommu=pt reduce guest isolation?
It typically sets the host’s own DMA mappings to identity/pass-through for performance while still using translation for devices assigned to guests.
It’s commonly used on virtualization hosts. If you’re debugging or validating strictness, you can test without it.
5) Why does GPU passthrough fail after a VM reboot?
Usually reset behavior: GPU doesn’t reset cleanly (no usable FLR, or reset doesn’t propagate), leaving it in a bad state.
Sometimes it’s also a driver/firmware interaction. The reliable fix is choosing hardware known for reset friendliness, plus correct topology.
6) Is SR-IOV easier on Intel or AMD platforms?
SR-IOV success depends heavily on the NIC model/firmware, driver maturity, and IOMMU correctness.
Both Intel and AMD platforms can run SR-IOV well. The “easier” experience usually comes from enterprise NICs and server boards with mature BIOS.
7) What’s the quickest way to know if passthrough will be painless on a host?
Check IOMMU groups and test reboot cycles. If the device is cleanly isolated and you can reboot the guest repeatedly without host reboot, you’re 80% there.
The remaining 20% is performance tuning and edge-case handling.
8) Do kernel versions matter for VT-d/AMD-Vi passthrough?
Yes. IOMMU, VFIO, and PCIe quirks get fixes over time. If you’re chasing rare faults or reset problems, newer kernels can help.
Just upgrade methodically: one variable at a time, with a rollback plan.
9) Is passing through an NVMe drive a good idea?
It can be excellent for performance and for storage appliances that want direct control.
But NVMe devices can share IOMMU groups with other chipset devices on some boards, and you must avoid passing through anything the host needs to boot.
10) Should I choose Intel or AMD for a Proxmox/KVM passthrough build?
Choose the motherboard + platform that yields clean groups for your target devices and has good firmware support.
If you already own the hardware, evaluate it with the group inspection tasks above before you design the service around it.
Practical next steps
If you’re deciding between Intel VT-d and AMD-Vi for passthrough, don’t treat it like a brand preference.
Treat it like a supply chain problem: pick a platform where the board topology and firmware maturity match your isolation needs.
- For new builds: shortlist boards, then verify reported IOMMU group quality for your exact device types and slots.
- For existing hosts: run the group enumeration tasks, confirm interrupt remapping, and test reboot cycles under load.
- For production rollouts: standardize a preflight checklist, keep BIOS and kernel updates controlled, and document which slots are “passthrough-safe.”
The win condition is not “maximum throughput.” It’s “no surprises at reboot.” When you achieve that, both VT-d and AMD-Vi look pretty great.