Some days your “virtualized” NIC is doing 2% CPU and 25 Gbps like a champ. Other days it’s dropping packets under load, your p99 latency looks like a seismograph, and someone suggests “just enable SR-IOV” like it’s a universal solvent.
This is the grown-up version of that conversation: what SR-IOV and PCI passthrough actually buy you, what the IOMMU is really doing, and the specific ways you can make performance worse while congratulating yourself for “going closer to hardware.”
The mental model: PFs, VFs, DMA, and why IOMMU exists
Let’s define terms the way your kernel does, not the way sales decks do.
Passthrough (VFIO) in one paragraph
PCI passthrough assigns a whole PCIe function to a VM (or container-ish workload) so the guest owns the device. In Linux/KVM this is typically VFIO: the host binds the device to vfio-pci, and QEMU maps it into the guest. The device performs DMA into guest memory, and the IOMMU (if enabled) enforces that the DMA stays inside what the guest is allowed to touch.
SR-IOV in one paragraph
SR-IOV splits a physical PCIe function (PF) into multiple lightweight PCIe functions (VFs). Each VF looks like a distinct PCI device with its own config space, BARs, and queues (implementation-dependent), so you can hand individual VFs to guests. The PF stays managed by the host (or sometimes a “service VM”), and the VF is “mostly hardware,” with policy knobs exposed through the PF driver and firmware.
Where DMA fits, and why you should care
Both SR-IOV and passthrough are about one thing: who is allowed to drive DMA, and how expensive it is to do it safely. Devices don’t “send packets”; they DMA descriptors and payloads to/from memory. If a device can DMA anywhere, it can read your secrets, scribble on your kernel, and turn reliability into interpretive dance.
This is the IOMMU’s job: translate and constrain device DMA addresses, similar to how the CPU’s MMU constrains process memory. Without IOMMU, “assigned” devices can still DMA into host memory if you mess up isolation. With IOMMU, you pay a translation cost (sometimes tiny, sometimes not), but you gain real containment and features like interrupt remapping.
Dry rule of thumb: if you’re doing passthrough in production and you’re not using an IOMMU, you’re not “brave,” you’re just running a different threat model than you think you are.
Joke #1: The IOMMU is like a nightclub bouncer for DMA. It doesn’t stop bad decisions inside, but it keeps random strangers out.
What “performance” actually means here
People say “SR-IOV is faster than virtio.” Sometimes. But you need to specify which performance axis:
- Throughput (Gbps or IOPS) at a given CPU budget
- Tail latency (p99/p999), especially under contention
- Jitter (variance), which breaks real-time-ish apps
- CPU efficiency (cycles per packet/IO)
- Operational performance (how quickly you can debug and restore service)
SR-IOV and passthrough can be fantastic for throughput and CPU efficiency. They can also make tail latency worse if you do interrupts wrong, pin nothing, and let the host scheduler improvise.
SR-IOV vs passthrough: the real tradeoffs
Here’s the opinionated version:
- If you need one VM to own a device end-to-end (GPU, FPGA, HBA): use passthrough. SR-IOV is not universally available, and even when it is, feature parity is weird.
- If you need many guests to get close-to-bare-metal NIC performance: use SR-IOV, but treat VF management as part of your platform, not a per-VM hobby.
- If you need flexibility (live migration, snapshots, heterogeneous hosts): prefer virtio and accept the CPU cost, unless you have a proven reason not to.
Security and isolation: they’re not the same story
With passthrough, the guest gets the whole device. That’s great for performance and feature access, and terrible for sharing. Isolation depends heavily on IOMMU correctness and device behavior. With SR-IOV, you share a physical device across tenants, and isolation depends on the NIC’s VF implementation (queue separation, rate limiting, spoof checks) plus the PF driver. Some VFs can do things they shouldn’t if you leave trust-like flags enabled.
Practical guidance:
- Multi-tenant SR-IOV is doable, but you must explicitly configure VF spoof checking, VLAN enforcement, and trust settings on the PF.
- Passthrough for untrusted guests is strongly coupled to IOMMU and interrupt remapping. If either is missing, you’re accepting risk.
Operations: SR-IOV wins until it doesn’t
SR-IOV looks operationally “simple” because you can hand out VFs like candy. Then you hit the hidden complexity:
- VF provisioning and garbage collection across reboots
- Firmware/driver mismatches that only break under specific queue counts
- Observability gaps (host tools see PF stats; the guest sees VF stats; nobody sees “end-to-end”)
- Packet steering and IRQ affinity becoming a platform requirement
Passthrough is simpler in the sense that one guest owns the device and you debug one stack. It’s harder in the sense that you lose a lot of virtualization niceties (migration, snapshots, oversubscription) and you can brick a host’s networking if you pass the wrong thing through.
The dirty secret: both approaches still need boring Linux hygiene
CPU pinning, NUMA alignment, IRQ affinity, ring sizing, and sane MTU decisions still matter. SR-IOV doesn’t save you from a VM running on the wrong socket, and passthrough doesn’t save you from a guest driver configured like a science experiment.
When IOMMU helps (and why)
1) Containment: DMA isolation is the whole point
Without IOMMU, a device doing DMA can access physical memory addresses you didn’t intend. In passthrough, that can mean a guest-controlled device (or guest-controlled programming of a device) can read or corrupt host memory. With IOMMU, the DMA address space (IOVA) is translated through tables the host controls.
In SR-IOV, VFs also DMA. If you assign VFs to guests, you still want IOMMU on to confine VF DMA to the guest’s memory. Yes, the NIC is “virtualized,” but it’s still a device doing DMA.
2) Interrupt remapping: fewer ways to ruin your day
Modern IOMMUs can also remap interrupts (MSI/MSI-X) so a device can’t inject interrupts in weird ways. That matters when you’re passing devices to guests. Without interrupt remapping, you may be forced into unsafe modes, or you’ll get unstable behavior depending on platform support.
3) You can enable safe features that would otherwise be scary
If you’re doing device assignment at scale, IOMMU is the enabling layer for “this device only touches what it’s supposed to touch.” It unlocks doing passthrough for real workloads, not just lab setups.
4) Some performance paths assume it
Counterintuitive: sometimes leaving IOMMU off triggers kernel fallbacks, disables interrupt remapping, or forces different DMA mapping strategies. On some platforms, “IOMMU off” is not “fast mode,” it’s “compatibility mode.” You don’t get to pick your own adventure; your motherboard already did.
When IOMMU doesn’t help (and can hurt)
Now the part people don’t like hearing: IOMMU is not a magic performance switch. It’s a safety feature with performance implications. Sometimes those implications are negligible. Sometimes they’re your p99.
1) High-rate small-packet networking can amplify translation overhead
If your workload is 64-byte packets at very high PPS, the cost of mapping/unmapping DMA, TLB pressure in the IOMMU, and IOTLB misses can show up. Good drivers amortize mapping costs (long-lived mappings, hugepages, batching). Bad setups churn mappings and pay for it.
2) Misconfigured hugepages or memory fragmentation makes it worse
If the guest memory backing is fragmented, the IOMMU needs more page table entries, which increases IOTLB miss probability. With hugepages (and correct pinning), you reduce the translation footprint. This is why “SR-IOV is slower than virtio” sometimes shows up in real systems: it’s not SR-IOV; it’s the mapping strategy plus memory layout.
3) You can lose features that you assumed would be there
Some environments turn on IOMMU in a mode that breaks or disables peer-to-peer DMA (device-to-device), or changes how ATS/PRI works (when present). For storage stacks that rely on certain DMA patterns, you might see regressions that look like “driver bug” but are actually translation behavior changes.
4) Debugging gets harder because failure modes multiply
When IOMMU is involved, a failure can be:
- guest driver bug
- host VFIO bug
- platform IOMMU bug/quirk
- BIOS setting mismatch
- ACS grouping oddities
- firmware behavior under load
Joke #2: Turning on IOMMU to “fix performance” is like buying a torque wrench to fix a flat tire. Useful tool, wrong problem.
Interesting facts and historical context
- IOMMUs predate cloud hype. They showed up in various forms to solve DMA addressing limits and isolation long before “multi-tenant” was a product pitch.
- DMA addressing used to be a real constraint. Early systems had devices that could only DMA within limited address ranges; IOMMUs helped by remapping device-visible addresses.
- Intel VT-d and AMD-Vi made device assignment mainstream. Hardware DMA remapping became a standard feature for servers that wanted serious virtualization.
- MSI-X changed the game for high-performance NICs. Multiple interrupt vectors allowed queue-per-core designs, which SR-IOV heavily leans on.
- SR-IOV is a PCI-SIG standard. It’s not vendor magic, though vendor implementations vary wildly in quality and knobs.
- “IOMMU groups” are about isolation boundaries. Grouping reflects what hardware can isolate; it’s not a Linux invention, it’s Linux exposing reality.
- ACS became the awkward hero. Access Control Services influence how devices are isolated behind PCIe switches; lack of ACS can force larger IOMMU groups.
- Virtio matured because ops demanded it. Virtio isn’t just “slower emulation.” It evolved into a solid, debuggable paravirtual interface that fits cloud operations.
- DPDK and user-space networking raised expectations. Once people saw line rate in user space, they started demanding similar behavior from VMs, which pushed SR-IOV adoption.
Fast diagnosis playbook
The goal is to find the bottleneck fast, not to “fully understand PCIe.” You can do that later.
First: confirm what you actually deployed
- Is the workload using virtio, SR-IOV VF, or full passthrough?
- Is IOMMU enabled and working (not just “set in BIOS”)?
- Are interrupts remapped and MSI-X enabled?
Second: locate the contention domain
- NUMA alignment: is the VM on the same socket as the PCIe device?
- IRQ affinity: are interrupts pinned to appropriate CPUs?
- Queue count: do you have enough queues, or too many?
Third: decide whether you’re CPU-bound, IRQ-bound, or DMA/IOMMU-bound
- If CPU is pegged in softirq/ksoftirqd: it’s packet processing and interrupt/steering.
- If CPU is fine but p99 is bad: look at IRQ migration, power states, and IOTLB churn.
- If throughput is capped suspiciously: check link speed/width, negotiated PCIe, and offloads.
Fourth: prove it with one targeted experiment
- Pin vCPUs and memory to the device NUMA node, re-test.
- Change IRQ affinity for VF queues, re-test.
- Switch hugepages on/off (or 2M vs 1G) for one host, re-test.
Do not change five variables and declare victory. That’s how you build folklore.
Practical tasks: commands, outputs, and decisions
These are the checks I actually run when someone says “SR-IOV is slow” or “passthrough is unstable.” Each task includes the decision you make from the output.
Task 1: Confirm IOMMU is enabled in the kernel
cr0x@server:~$ dmesg | egrep -i 'iommu|vt-d|amd-vi|dmari' | head -n 25
[ 0.142311] DMAR: IOMMU enabled
[ 0.142355] DMAR: Host address width 46
[ 0.142360] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[ 0.381200] DMAR-IR: Enabled IRQ remapping in x2apic mode
What it means: You have DMA remapping (DMAR: IOMMU enabled) and interrupt remapping (DMAR-IR).
Decision: If you don’t see these lines, don’t pretend passthrough isolation exists. Fix BIOS/kernel args before diagnosing performance.
Task 2: Check kernel cmdline for IOMMU mode and common foot-guns
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.0 root=/dev/mapper/vg0-root ro quiet intel_iommu=on iommu=pt mitigations=auto
What it means: intel_iommu=on enables it; iommu=pt uses passthrough mappings for host devices (often reduces overhead for non-assigned devices).
Decision: For mixed workloads, iommu=pt is usually sane. If you’re troubleshooting isolation, you may temporarily remove it to see if behavior changes—but document why.
Task 3: Identify whether you’re using SR-IOV VFs or full devices
cr0x@server:~$ lspci -D | egrep -i 'ethernet|network' | head
0000:3b:00.0 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28
0000:3b:02.0 Ethernet controller: Intel Corporation Ethernet Virtual Function
0000:3b:02.1 Ethernet controller: Intel Corporation Ethernet Virtual Function
What it means: The PF is 3b:00.0, and VFs exist at 3b:02.x.
Decision: If you expect SR-IOV but only see PFs, you’re not using SR-IOV; you’re debugging the wrong thing.
Task 4: Confirm VF creation count on the PF
cr0x@server:~$ sudo cat /sys/class/net/enp59s0f0/device/sriov_numvfs
8
What it means: PF has 8 VFs currently enabled.
Decision: If it’s 0, no VFs exist. If it’s higher than you planned, you may be starving PF resources (queues, TCAM, interrupts). Reduce VF count and retest stability.
Task 5: Check IOMMU groups for isolation boundaries
cr0x@server:~$ for g in /sys/kernel/iommu_groups/*; do echo "Group $(basename $g):"; ls -1 $g/devices; done | sed -n '1,40p'
Group 12:
0000:3b:00.0
0000:3b:00.1
Group 13:
0000:3b:02.0
0000:3b:02.1
What it means: Devices in the same group can’t be safely isolated from each other by the platform.
Decision: If your target device shares a group with something you can’t pass through (like a storage controller), don’t use passthrough on that host design. Fix the PCIe topology or accept virtio.
Task 6: Check what driver is bound (host side)
cr0x@server:~$ lspci -nnk -s 0000:3b:02.0
3b:02.0 Ethernet controller [0200]: Intel Corporation Ethernet Virtual Function [8086:154c]
Subsystem: Intel Corporation Device [8086:0000]
Kernel driver in use: vfio-pci
Kernel modules: iavf
What it means: VF is bound to VFIO for passthrough; the native VF driver module exists but isn’t active on the host.
Decision: If you expected host networking via VF, seeing vfio-pci is correct. If you expected host to use it, this is your misconfiguration.
Task 7: Verify the VM actually has the device attached (QEMU/libvirt)
cr0x@server:~$ sudo virsh domiflist vm-netperf-01
Interface Type Source Model MAC
-------------------------------------------------------
vnet0 bridge br0 virtio 52:54:00:aa:bb:cc
What it means: This VM is still using virtio on a bridge, not SR-IOV VF passthrough.
Decision: Stop debating SR-IOV tuning. First attach the VF device and confirm in-guest driver changes.
Task 8: Check negotiated PCIe link speed/width (common silent cap)
cr0x@server:~$ sudo lspci -s 3b:00.0 -vv | egrep -i 'LnkSta:|LnkCap:' | head -n 4
LnkCap: Port #0, Speed 8GT/s, Width x8
LnkSta: Speed 8GT/s, Width x4
What it means: The card supports x8 but is running at x4. That’s a physical/topology/BIOS issue, not a driver issue.
Decision: If throughput caps align with x4 limits, move the card, change riser, or fix BIOS lane bifurcation. Don’t tune queues to solve missing lanes.
Task 9: Check NUMA locality of the PCI device
cr0x@server:~$ cat /sys/bus/pci/devices/0000:3b:00.0/numa_node
1
What it means: Device is attached to NUMA node 1.
Decision: Place the VM’s vCPUs and memory on node 1. If you can’t, accept higher latency and lower throughput, or move the device to the other socket.
Task 10: Find VF queue IRQs and see where they land
cr0x@server:~$ grep -E 'enp59s0f0v0|iavf|vfio|msi' /proc/interrupts | head -n 8
178: 120433 0 0 0 IR-PCI-MSI 524288-edge vfio-msi[0]
179: 118901 0 0 0 IR-PCI-MSI 524289-edge vfio-msi[1]
180: 119552 0 0 0 IR-PCI-MSI 524290-edge vfio-msi[2]
What it means: All interrupts are hitting CPU0 (first column) because affinity isn’t configured, or irqbalance made a “creative” choice.
Decision: Pin IRQs to CPUs local to the NIC, and ideally spread queues across isolated cores. Then re-test p99 latency.
Task 11: Check irqbalance status (it can help or harm)
cr0x@server:~$ systemctl status irqbalance --no-pager
● irqbalance.service - irqbalance daemon
Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled)
Active: active (running) since Mon 2026-02-02 09:14:12 UTC; 1 day ago
What it means: irqbalance is running and may move IRQs dynamically.
Decision: For latency-sensitive SR-IOV/passthrough, consider disabling irqbalance and setting affinity explicitly—especially on hosts with CPU isolation.
Task 12: Check hugepages configuration (host)
cr0x@server:~$ grep -i huge /proc/meminfo | head -n 6
HugePages_Total: 2048
HugePages_Free: 1980
HugePages_Rsvd: 12
Hugepagesize: 2048 kB
Hugetlb: 4194304 kB
What it means: 2M hugepages are available and mostly free.
Decision: If you’re chasing IOMMU/IOTLB overhead, hugepages are a lever. If HugePages_Free is low, you might be fragmenting or leaking reservations; fix before blaming SR-IOV.
Task 13: Check if the VM is using hugepages (libvirt)
cr0x@server:~$ sudo virsh dumpxml vm-netperf-01 | egrep -n 'memoryBacking|hugepages|locked'
112:
113:
114:
115:
What it means: VM memory is backed by hugepages and locked (reduces page churn and surprises).
Decision: If you’re using SR-IOV/passthrough and care about tail latency, this is typically worth doing. If you can’t lock memory, expect variability under host pressure.
Task 14: Check for IOMMU faults (you’d be amazed)
cr0x@server:~$ dmesg | egrep -i 'DMAR:.*fault|IOMMU.*fault|AMD-Vi:.*Event' | tail -n 10
[12345.671234] DMAR: [DMA Read] Request device [3b:02.0] fault addr 0x7f3a1000 [fault reason 0x05] PTE Read access is not set
What it means: The device attempted DMA outside allowed mappings, or mappings are being torn down incorrectly.
Decision: Stop. This is not a tuning issue; it’s a correctness issue. Check VFIO/QEMU versions, driver bugs, and whether you’re hot-unplugging devices unsafely.
Task 15: Check NIC offloads inside the guest (SR-IOV VFs vary)
cr0x@server:~$ sudo ethtool -k ens5 | egrep -i 'rx-checksumming|tx-checksumming|tso|gso|gro|lro'
rx-checksumming: on
tx-checksumming: on
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
What it means: Offloads are enabled where expected; LRO is off (often good for latency/observability).
Decision: If offloads are unexpectedly off, you’ll pay CPU. If they’re on but you see weird packet traces, consider disabling GRO for troubleshooting—not as a permanent “fix.”
Task 16: Verify VF anti-spoofing and trust settings on the PF (host)
cr0x@server:~$ sudo ip link show enp59s0f0
2: enp59s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 3c:fd:fe:aa:bb:cc brd ff:ff:ff:ff:ff:ff
cr0x@server:~$ sudo ip link show enp59s0f0 vf 0
vf 0 MAC 52:54:00:11:22:33, vlan 100, spoof checking on, link-state auto, trust off, query_rss on
What it means: VF has VLAN enforced, spoof checking on, trust off. That’s a good default for shared environments.
Decision: If trust on appears “because it fixed something,” demand a concrete reason and add compensating controls. Trust tends to spread like mold.
Task 17: Confirm CPU frequency governor/power state (tail latency culprit)
cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave
What it means: CPU may downclock aggressively, harming p99 latency.
Decision: For high-performance dataplane hosts, use performance governor or platform-appropriate tuning. If you can’t, stop expecting deterministic latency.
Task 18: Check softirq pressure (network dataplane health)
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.8.0 (server) 02/04/2026 _x86_64_ (32 CPU)
02:11:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
02:11:02 PM all 12.5 0.0 9.8 0.0 2.1 18.7 0.0 56.9
What it means: High %soft suggests softirq processing is heavy; common in packet-heavy workloads.
Decision: Add queues/cores, improve IRQ affinity/RPS/XPS, or move to SR-IOV/DPDK if virtio is the bottleneck. If you’re already on SR-IOV, this points to steering and CPU placement.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They were rolling out NIC passthrough for a latency-sensitive service. The pitch was clean: “remove the virtual switch, reduce overhead, lower p99.” The test host looked fine. The canary looked fine. Then production got… haunted.
One cluster started rebooting under load. Not all nodes. Only the ones in a particular rack row. The logs showed sporadic DMA faults and sometimes a hard lockup with no clean kernel panic. The team’s first assumption was the usual: “driver bug.” They upgraded guest drivers. They upgraded QEMU. They toggled offloads. The hauntings continued.
The wrong assumption was that “IOMMU enabled in BIOS” meant “IOMMU is actually working end-to-end.” On the affected nodes, the BIOS setting was present but the platform shipped with interrupt remapping disabled due to an older firmware setting. Linux enabled DMA remapping but couldn’t enable interrupt remapping cleanly on that hardware revision.
They had been passing through a device that spammed MSI-X interrupts under peak load, and without proper remapping guarantees, the platform behaved unpredictably. It wasn’t that the IOMMU “was off”; it was that a critical slice of it wasn’t reliably on.
The fix was boring: standardize BIOS profiles, add a boot-time gate that fails the node if DMAR-IR isn’t enabled, and refuse passthrough scheduling on those hosts. Performance improved later, but the incident ended because they stopped lying to themselves about the platform’s capabilities.
Mini-story 2: The optimization that backfired
A different company ran SR-IOV VFs for high-throughput workloads. Someone noticed that IOMMU translation overhead might be contributing to CPU cost at extreme PPS. They found iommu=pt and decided to “optimize” further: disable IOMMU for the host, because “the guests only have VFs and the NIC is isolating them anyway.”
At first it looked like a win. Microbenchmarks improved slightly, and CPU graphs looked prettier. Then the real workload hit: sporadic data corruption inside a storage-backed service that was also using another passthrough device on some nodes. Not everywhere—only where scheduling lined up a particular combination of device assignment and memory pressure.
Without IOMMU, a misbehaving device (or a buggy interaction) could DMA into memory it shouldn’t. The corruption wasn’t loud. It was subtle: rare checksum mismatches, occasional process crashes, “impossible” state transitions. The worst kind of outage: the one that makes your team doubt physics.
They reverted the change and the corruption stopped. The postmortem wasn’t kind: the optimization was chasing a theoretical overhead while deleting the safety rail. The cost of the “gain” was days of incident response, risk reviews, and a new policy: IOMMU stays on; if performance is a problem, fix mapping strategy (hugepages, pinning) or change the dataplane design.
Mini-story 3: The boring but correct practice that saved the day
A platform team ran a mixed fleet: some hosts for generic virtualization (virtio everywhere), and a smaller pool for high-performance SR-IOV and occasional GPU passthrough. They had a reputation for being annoyingly strict about host profiles.
Every boot ran a checklist: validate IOMMU enabled, validate interrupt remapping, validate NIC link width, validate firmware versions against an allowlist, validate VF counts, validate NUMA locality constraints. If any check failed, the node was drained automatically. No heroics. No negotiation.
One day, a batch of servers arrived with a subtle PCIe topology difference. The NIC negotiated x4 instead of x8 in one slot configuration. Nothing “broke.” No alerts fired from the NIC driver. Applications simply got slower in a way that looked like “traffic spike.”
The boot gate caught it: LnkSta didn’t match expectations. The nodes never entered the SR-IOV pool. Workloads stayed on healthy hosts, and the only “incident” was a ticket to datacenter ops to move the cards to the right slots.
The practice that saved them wasn’t a clever kernel tweak. It was the refusal to accept silent degradation. Reliability engineering is mostly deciding what you will not tolerate.
Common mistakes: symptoms → root cause → fix
1) Symptom: “Passthrough is enabled, but performance is worse than virtio”
Root cause: VM is remote-NUMA from the device; interrupts land on the wrong CPUs; memory not pinned; IOTLB churn due to fragmented memory.
Fix: Align vCPUs and memory to NIC NUMA node, enable hugepages + locked memory, set IRQ affinity, match queue count to cores.
2) Symptom: “SR-IOV VF randomly loses link or stalls under load”
Root cause: PF driver/firmware bug triggered by VF count, queue count, or offload combination; sometimes exacerbated by resets.
Fix: Update NIC firmware + PF driver; reduce VFs/queues; avoid exotic offload combos; add health checks that recreate VFs on failure.
3) Symptom: “VFIO attach fails: device is busy / can’t reset”
Root cause: Device is bound to host driver or is part of a group with other in-use functions; FLR/reset limitations.
Fix: Unbind properly, ensure isolation by IOMMU group, avoid passing through devices without reliable reset semantics, or use SR-IOV instead.
4) Symptom: “IOMMU groups are huge; can’t isolate the NIC”
Root cause: PCIe topology lacks ACS, or devices share upstream components that can’t enforce isolation.
Fix: Change slot/topology, use hosts with proper ACS-capable switches, or stop trying to do safe passthrough on that platform.
5) Symptom: “High p99 latency only during host contention”
Root cause: CPU frequency scaling, IRQ migration, memory reclaim, or noisy neighbors on the same socket.
Fix: CPU isolation for dataplane cores, performance governor, disable irqbalance (or configure it), reserve hugepages, lock memory, set real NUMA policies.
6) Symptom: “Packets drop in guest, host looks fine”
Root cause: VF queue starvation, insufficient ring sizes, interrupt moderation too aggressive, or guest driver mis-tuned.
Fix: Increase ring sizes, adjust coalescing, ensure MSI-X vectors, ensure enough queues, pin vCPUs, and validate guest driver version.
7) Symptom: “Security team says SR-IOV is unsafe”
Root cause: VF trust/spoof/VLAN policies left permissive; misunderstanding of shared hardware risk.
Fix: Enforce VF policies on PF, restrict features per tenant, audit device assignment, ensure IOMMU on, document threat model.
Checklists / step-by-step plans
Checklist A: Choosing SR-IOV vs passthrough (decision-making, not vibes)
- Need device sharing across many guests? SR-IOV. If the device doesn’t support SR-IOV well, reconsider the hardware.
- Need full device features (GPU, HBA advanced modes, vendor tooling)? Passthrough.
- Need live migration/snapshots? Prefer virtio. SR-IOV/passthrough complicate or block it.
- Multi-tenant risk tolerance low? Passthrough with IOMMU and strict platform checks, or avoid device assignment entirely.
- Ops maturity: If you can’t standardize BIOS/firmware/kernel versions, don’t do device assignment at scale.
Checklist B: SR-IOV rollout plan (what to do in order)
- Standardize BIOS settings (VT-d/AMD-Vi on, SR-IOV on, consistent PCIe settings).
- Standardize kernel cmdline and validate via
dmesggates. - Pick a VF count per PF based on hardware limits and queue needs (don’t max it out by default).
- Define VF policy: spoof checking on, trust off by default, VLAN policy explicit.
- Define NUMA placement policy for SR-IOV hosts and enforce via scheduler.
- Implement IRQ affinity strategy; don’t rely on luck.
- Canary under real traffic patterns (PPS + packet sizes + flow count).
- Observe p99/p999, drops, resets, and IOMMU faults; roll forward only when boring.
Checklist C: Passthrough rollout plan (don’t torch your host)
- Verify IOMMU and interrupt remapping are enabled and stable across reboots.
- Verify IOMMU groups and ensure the device can be isolated.
- Verify device reset support (FLR or vendor reset behavior) works reliably.
- Bind device to
vfio-pciand confirm host services won’t need it. - Pin VM vCPUs and memory to the device’s NUMA node; use hugepages if you care about jitter.
- Instrument: collect
dmesgIOMMU faults, PCIe AER errors, and interrupt distribution metrics. - Have a rollback: detach the device, rebind to host driver, restore networking/storage paths.
FAQ
1) Is SR-IOV always faster than virtio?
No. SR-IOV often reduces CPU per packet and improves throughput, but can lose on tail latency if IRQ/NUMA/memory mapping is sloppy. Virtio can be “fast enough” and far easier to operate.
2) Is passthrough always the fastest option?
Often for single-tenant device ownership, yes. But “fastest” depends on placement and interrupt handling. A remote-NUMA passthrough device can be slower than a well-tuned virtio path.
3) Do I need IOMMU for SR-IOV?
If you assign VFs to guests, you want IOMMU for DMA isolation. If you keep everything on the host, it’s less about isolation—but many platforms still behave better with a consistent IOMMU configuration.
4) What does iommu=pt actually do?
It typically sets up identity/pass-through mappings for host devices so they don’t pay translation overhead, while still allowing translated mappings for assigned devices. It’s a common compromise for performance plus safety.
5) Why are my IOMMU groups “too big”?
Because your hardware can’t isolate those devices from each other. PCIe topology and ACS support drive grouping. Linux is reporting the boundary it can trust, not the boundary you wish you had.
6) Can I live migrate a VM using SR-IOV or passthrough?
Not in the usual “it just works” way. Device assignment ties the VM to specific hardware state. Some ecosystems have specialized migration approaches, but treat it as a special project, not a checkbox.
7) What’s the biggest cause of “SR-IOV is unstable” reports?
Firmware/driver mismatch and resource overcommit on the NIC (too many VFs, too many queues, too aggressive settings). The second biggest is operational: VFs not recreated consistently after reboot or link events.
8) Does IOMMU add measurable latency?
It can, especially when mappings churn or IOTLB misses rise. With pinned memory, hugepages, and stable mappings, the overhead is often small compared to the rest of the dataplane.
9) Should I disable irqbalance on SR-IOV/passthrough hosts?
For latency-sensitive workloads, yes—unless you’ve explicitly configured irqbalance to respect isolation and locality. Dynamic IRQ migration and deterministic p99 are not friends.
10) What’s the simplest “safe default” architecture?
Virtio for general compute, a separate SR-IOV pool for performance workloads, and passthrough only for devices that truly require full ownership. Keep host profiles strict and validated.
Practical next steps
Make the decision based on your operational reality, not your benchmark screenshot.
- Pick a baseline: If you’re currently on virtio, get clean NUMA/IRQ hygiene first. Otherwise you won’t know what SR-IOV improved.
- Turn on IOMMU correctly: Confirm DMA remapping and interrupt remapping in
dmesg. Gate your fleet on it. - Choose the right model: SR-IOV for shared NIC acceleration, passthrough for single-tenant device ownership, virtio for flexibility.
- Operationalize it: Standardize firmware, BIOS, kernel args, VF counts, and VF security policy. Automate checks; drain on mismatch.
- Measure the right thing: Track p99/p999 latency and drops, not just average throughput. Most outages live in the tails.
One quote worth keeping on a sticky note, because it’s the whole job: “Hope is not a strategy.” — General Gordon R. Sullivan