You enable SR-IOV because you want clean, predictable throughput and lower CPU overhead. You reboot, you echo a number into
sriov_numvfs, and… nothing. Or worse: VFs appear, but the host loses link, guests can’t get packets, or the IOMMU groups
look like a bowl of spaghetti and VFIO refuses to play.
The first SR-IOV rollout is rarely blocked by “a bug.” It’s blocked by a chain of assumptions: firmware settings, kernel parameters,
PF/VF driver choices, PCIe topology, and one innocent “optimization” that turns out to be a trap.
SR-IOV in one page (without fairy tales)
SR-IOV (Single Root I/O Virtualization) lets a PCIe device expose multiple lightweight PCI functions. The physical function (PF) is the
“owner” interface. Virtual functions (VFs) are the “slices” you hand to guests or containers (usually via VFIO passthrough) or manage in
the host for traffic steering.
The promise: bypass some software switching, reduce per-packet overhead, and get near line-rate performance with less jitter. The cost:
you now depend on hardware behavior, firmware policy, PCIe topology, and strict driver expectations. That’s not “hard,” but it’s
different. The debugging mindset is closer to “storage HBA troubleshooting” than “Linux bridge tweaking.”
On Debian 13, SR-IOV is mostly a kernel+driver+firmware story. Debian doesn’t “do” SR-IOV for you; it gives you good tools to see what’s
happening. Your job is to make the platform honest: IOMMU on, ACS sane, PF driver stable, and no magical assumptions.
Core mental model: three planes
- Control plane: sysfs knobs, devlink, ethtool, PF configuration, firmware toggles.
- Data plane: actual packet/IO path inside the NIC; queue mapping, VLAN/MAC filters, spoof checks.
- Isolation plane: IOMMU/VT-d/AMD-Vi, IOMMU groups, ACS, VFIO bindings, interrupt routing.
Most “SR-IOV is broken” tickets are isolation-plane issues disguised as data-plane issues. The second most common are control-plane
defaults you didn’t know existed.
Interesting facts and a little history (so you stop blaming the wrong thing)
-
SR-IOV is a PCI-SIG spec from the late 2000s, born from the pain of CPU-heavy virtualization stacks pushing 10GbE and
storage traffic through software emulation. -
“Virtual Function” isn’t a Linux invention. It’s a PCIe function number with a standardized capability structure.
Linux just exposes it via sysfs and drivers. -
Early SR-IOV deployments were as much about storage as networking. HBAs and NVMe-like designs pushed for similar
“direct assignment” patterns. -
IOMMU adoption lagged SR-IOV adoption in many data centers. People enabled VFs before they enabled DMA isolation,
which is how you learn the difference between “works” and “safe.” -
ACS (Access Control Services) is a PCIe feature that controls peer-to-peer behavior. Without proper ACS, your IOMMU
groups may be too large to safely passthrough individual functions. -
VF drivers often intentionally expose fewer knobs than PF drivers. That’s by design: fewer sharp edges for tenants, and
fewer ways to wedge the device. -
Some NICs implement “switch-like” behavior internally. VFs are not just queues; there’s internal steering, filtering,
and sometimes a little embedded logic. -
SR-IOV isn’t always faster. For small packets, the host CPU overhead can drop; for some workloads, the operational
complexity outweighs gains and you’re better off with vhost-net/virtio and good tuning.
How SR-IOV fails in real life
SR-IOV failures cluster. If you can categorize what you see, you can stop flailing and start measuring.
Failure class A: “VFs don’t appear”
You echo 8 into /sys/class/net/<pf>/device/sriov_numvfs and it returns
Invalid argument or silently stays at zero. Common causes:
- SR-IOV disabled in BIOS/UEFI or NIC firmware (yes, both can matter).
- Wrong PF driver loaded (inbox vs vendor; or a fallback driver without SR-IOV support).
- Firmware limit: you requested more VFs than the device/port supports.
- Device is in a state where VFs can’t be created (e.g., configured for a mode that conflicts with VF creation).
Failure class B: “VFs appear, but passthrough fails”
VF devices show up in lspci, but your hypervisor can’t assign them. Or VFIO binds, but QEMU errors out with IOMMU group
problems. Common causes:
- IOMMU not enabled at the kernel level.
- ACS/IOMMU groups too large; VF shares a group with the PF or unrelated devices.
- vfio-pci not binding, because another driver grabbed the VF first.
- Secure Boot or kernel lockdown policy blocks VFIO or module loading in certain ways (environment dependent).
Failure class C: “Traffic drops, ARP weirdness, VLAN confusion”
Guest sees link up, can ping its gateway once, then dies. Or VLAN tags vanish. Or ARP replies don’t return. Common causes:
- PF is enforcing anti-spoof/MAC/VLAN policies and the VF is not configured accordingly.
- Switch port security rejects multiple MACs; your VF MACs don’t match allowed list.
- Offload features interact with the virtual switch path in surprising ways (less common, but real).
Failure class D: “Performance is worse than virtio”
This is the one that makes management suspicious. Common causes:
- Interrupt storms due to bad IRQ affinity or too many queues.
- PCIe slot bifurcation or link speed downshift (x8 becomes x4, Gen4 becomes Gen3).
- NUMA mismatch: VF is on one socket, VM vCPUs on another.
- Small packet workload with per-VM overhead elsewhere (firewall, conntrack, app locks) dominating.
Joke 1: SR-IOV is like giving your VM its own lane on the highway—until you realize the on-ramp is still managed by a committee.
Fast diagnosis playbook (first/second/third checks)
This is the sequence I use when someone says “SR-IOV is broken” and I want a useful answer in under 15 minutes. The goal is to identify
the first broken layer, not to fix everything at once.
1) Confirm the platform can do isolation (IOMMU + groups)
- Check IOMMU enabled in kernel cmdline and dmesg.
- Check IOMMU groups: can you isolate a VF cleanly?
- If groups are wrong, stop. Do not proceed to “tuning” or guest changes.
2) Confirm the NIC and PF driver actually support SR-IOV
- Check
sriov_totalvfs. - Check PF driver and firmware versions.
- Create a small number of VFs (2) first, not 32.
3) Confirm the VF binding and guest attach path
- Ensure VFs bind to vfio-pci when passing through; otherwise they’ll bind to a net driver on the host and you’ll fight udev.
- Confirm guest sees the device and loads the correct VF driver.
4) Only then: data-plane sanity
- Switch port security and VLAN trunking rules.
- MAC/VLAN filters on PF for each VF.
- Performance counters and queue/IRQ placement.
Practical tasks: commands, outputs, and decisions (12+)
These are runnable on Debian 13. The outputs shown are representative. Your exact strings will vary, but the decision logic holds.
Task 1: Identify the PF interface and PCI address
cr0x@server:~$ ip -br link
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
enp129s0f0 UP 3c:fd:fe:12:34:56 <BROADCAST,MULTICAST,UP,LOWER_UP>
enp129s0f1 DOWN 3c:fd:fe:12:34:57 <BROADCAST,MULTICAST>
What it means: enp129s0f0 looks like your PF (one of the physical ports).
Decision: pick the PF you’ll create VFs from; don’t try to do both ports at once.
cr0x@server:~$ ethtool -i enp129s0f0
driver: mlx5_core
version: 6.1.0
firmware-version: 22.39.2048 (MT_0000000012)
expansion-rom-version:
bus-info: 0000:81:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
What it means: You have the bus address 0000:81:00.0 and the PF driver.
Decision: record the bus address; most SR-IOV and VFIO operations become easier when you think in PCI IDs.
Task 2: Confirm SR-IOV capability exists and see the VF limit
cr0x@server:~$ cat /sys/class/net/enp129s0f0/device/sriov_totalvfs
64
What it means: Hardware/driver combination can expose up to 64 VFs on this PF.
Decision: start with 2 or 4. If you jump straight to 64 and it fails, you won’t know whether the problem is “SR-IOV is
broken” or “you asked for too much.”
Task 3: Try creating VFs (and handle the two common failure modes)
cr0x@server:~$ cat /sys/class/net/enp129s0f0/device/sriov_numvfs
0
cr0x@server:~$ echo 4 | sudo tee /sys/class/net/enp129s0f0/device/sriov_numvfs
4
cr0x@server:~$ cat /sys/class/net/enp129s0f0/device/sriov_numvfs
4
What it means: VFs were created successfully.
Decision: move on to enumerating VFs in PCI and mapping them to VF indices.
If it fails, you’ll see something like:
cr0x@server:~$ echo 4 | sudo tee /sys/class/net/enp129s0f0/device/sriov_numvfs
tee: /sys/class/net/enp129s0f0/device/sriov_numvfs: Invalid argument
4
What it means: The kernel rejected VF creation. That’s usually driver/firmware/platform policy, not a typo.
Decision: immediately inspect dmesg for the real reason (next task).
Task 4: Read dmesg for the device-specific error, not generic noise
cr0x@server:~$ sudo dmesg -T | tail -n 30
[Mon Dec 30 09:11:12 2025] mlx5_core 0000:81:00.0: SR-IOV: failed to enable VFs, error: -22
[Mon Dec 30 09:11:12 2025] mlx5_core 0000:81:00.0: hint: SR-IOV disabled in firmware or not enough resources
What it means: The driver is telling you it can’t allocate resources or firmware blocks it.
Decision: check firmware settings, PF mode, and ensure you’re not in a restricted NIC configuration.
Task 5: Verify IOMMU is enabled (this blocks VFIO and safe assignment)
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.12.0-amd64 root=UUID=8c3d... ro quiet intel_iommu=on iommu=pt
What it means: Kernel parameters request Intel IOMMU and pass-through mode for host devices.
Decision: if you don’t see intel_iommu=on (or amd_iommu=on), add it and reboot before doing
VFIO work.
cr0x@server:~$ sudo dmesg -T | egrep -i 'DMAR|IOMMU|AMD-Vi' | head -n 20
[Mon Dec 30 09:02:01 2025] DMAR: IOMMU enabled
[Mon Dec 30 09:02:01 2025] DMAR: Host address width 46
[Mon Dec 30 09:02:01 2025] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[Mon Dec 30 09:02:01 2025] DMAR: Interrupt remapping enabled
What it means: IOMMU and interrupt remapping are live.
Decision: proceed to grouping; if interrupt remapping is missing, expect MSI/MSI-X weirdness under load.
Task 6: Inspect IOMMU groups (the “can I isolate a VF?” question)
cr0x@server:~$ for g in /sys/kernel/iommu_groups/*; do \
echo "Group $(basename "$g")"; \
ls -l "$g/devices"; \
done | sed -n '1,60p'
Group 12
total 0
lrwxrwxrwx 1 root root 0 Dec 30 09:03 0000:81:00.0 -> ../../../../devices/pci0000:80/0000:80:01.0/0000:81:00.0
lrwxrwxrwx 1 root root 0 Dec 30 09:03 0000:81:00.1 -> ../../../../devices/pci0000:80/0000:80:01.0/0000:81:00.1
lrwxrwxrwx 1 root root 0 Dec 30 09:03 0000:81:00.2 -> ../../../../devices/pci0000:80/0000:80:01.0/0000:81:00.2
What it means: PF and VFs might be in the same group (depends on your device listing). That can be a problem for
passthrough: many setups want each VF in its own group.
Decision: if VF shares a group with the PF or unrelated devices, you may need BIOS ACS settings, different slot
placement, or accept that your platform won’t do “clean” VF assignment.
Task 7: Enumerate the newly created VF PCI functions
cr0x@server:~$ lspci -nn | egrep -i 'Ethernet|Virtual Function|SR-IOV'
81:00.0 Ethernet controller [0200]: Mellanox Technologies MT28908 Family [ConnectX-6] [15b3:101b]
81:00.1 Ethernet controller [0200]: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] [15b3:101c]
81:00.2 Ethernet controller [0200]: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] [15b3:101c]
81:00.3 Ethernet controller [0200]: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] [15b3:101c]
81:00.4 Ethernet controller [0200]: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] [15b3:101c]
What it means: VFs exist at 81:00.1 through 81:00.4.
Decision: decide whether the host will use these VFs as netdevs (rare in clean designs) or bind them to VFIO for guests.
Task 8: Map VF index to PCI address (stop guessing which VF you handed out)
cr0x@server:~$ ip link show enp129s0f0
5: enp129s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 3c:fd:fe:12:34:56 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 2 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 3 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
What it means: VF indices exist, but MACs are unset and trust is off.
Decision: set MAC/VLAN/trust intentionally on the PF before assigning the VF to a guest, or you’ll debug “network
flakiness” that is actually policy enforcement.
Task 9: Set VF MAC and VLAN policy (or deliberately disable enforcement)
cr0x@server:~$ sudo ip link set enp129s0f0 vf 0 mac 52:54:00:aa:bb:01
cr0x@server:~$ sudo ip link set enp129s0f0 vf 0 vlan 120 qos 0
cr0x@server:~$ sudo ip link set enp129s0f0 vf 0 spoofchk on
cr0x@server:~$ sudo ip link set enp129s0f0 vf 0 trust off
What it means: VF0 can only use the specified MAC and VLAN 120; spoofing is blocked; trust is off.
Decision: in multi-tenant setups, keep spoof checking on. If you need the guest to run a router, you may need
trust on (and then you’d better have an upstream security model that makes sense).
Task 10: Bind a VF to vfio-pci (for passthrough) and verify it stuck
cr0x@server:~$ lspci -nnk -s 81:00.1
81:00.1 Ethernet controller [0200]: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] [15b3:101c]
Subsystem: Mellanox Technologies Device [15b3:0058]
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
What it means: The VF is currently bound to a host net driver. That’s fine for host-side use, but not for VFIO
passthrough.
Decision: unbind and bind to vfio-pci, and ensure it persists via modprobe config or udev rules.
cr0x@server:~$ sudo modprobe vfio-pci
cr0x@server:~$ echo 0000:81:00.1 | sudo tee /sys/bus/pci/devices/0000:81:00.1/driver/unbind
0000:81:00.1
cr0x@server:~$ echo vfio-pci | sudo tee /sys/bus/pci/devices/0000:81:00.1/driver_override
vfio-pci
cr0x@server:~$ echo 0000:81:00.1 | sudo tee /sys/bus/pci/drivers/vfio-pci/bind
0000:81:00.1
cr0x@server:~$ lspci -nnk -s 81:00.1
81:00.1 Ethernet controller [0200]: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] [15b3:101c]
Subsystem: Mellanox Technologies Device [15b3:0058]
Kernel driver in use: vfio-pci
Kernel modules: mlx5_core
What it means: The VF is now owned by VFIO and ready for passthrough.
Decision: if bind fails with “Device or resource busy,” you still have users (like NetworkManager or systemd-networkd)
grabbing it; stop those, or blacklist the VF driver for these functions.
Task 11: Check PCIe link speed/width (your “SR-IOV is slow” smoking gun)
cr0x@server:~$ sudo lspci -vv -s 81:00.0 | egrep -i 'LnkCap|LnkSta'
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 16GT/s (ok), Width x8 (downgraded)
What it means: The card supports x16 but is running at x8. That might be fine, or it might cap you under load.
Decision: if you expected x16 and you’re saturating, check slot wiring, BIOS bifurcation, risers, and whether another
device stole lanes.
Task 12: Check NUMA locality (because cross-socket DMA is a silent tax)
cr0x@server:~$ cat /sys/bus/pci/devices/0000:81:00.0/numa_node
1
cr0x@server:~$ lscpu | egrep 'NUMA node1 CPU\(s\)|NUMA node0 CPU\(s\)'
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
What it means: Your NIC sits on NUMA node 1.
Decision: pin the VM vCPUs (and ideally memory) to node 1 when using VFs from that NIC. Otherwise you’ll benchmark
latency and accidentally benchmark your interconnect.
Task 13: Confirm interrupt distribution (the hidden reason CPU spikes)
cr0x@server:~$ grep -E 'mlx5|vfio|enp129s0f0' /proc/interrupts | head -n 12
156: 120345 0 0 0 IR-PCI-MSI 524288-edge mlx5_comp0@pci:0000:81:00.0
157: 98765 0 0 0 IR-PCI-MSI 524289-edge mlx5_comp1@pci:0000:81:00.0
158: 110002 0 0 0 IR-PCI-MSI 524290-edge mlx5_comp2@pci:0000:81:00.0
What it means: All interrupts are landing on CPU0 (first column), which is classic “we forgot IRQ affinity.”
Decision: distribute IRQs across CPUs local to the NIC’s NUMA node, or enable irqbalance with sane constraints.
Task 14: Confirm PF isn’t blocking the VF with policy (spoof check, trust, link-state)
cr0x@server:~$ ip link show enp129s0f0 | sed -n '1,12p'
5: enp129s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 3c:fd:fe:12:34:56 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 52:54:00:aa:bb:01 brd ff:ff:ff:ff:ff:ff, vlan 120, spoof checking on, link-state auto, trust off
What it means: Policy is set and visible.
Decision: if the guest sends tagged frames on a different VLAN, they will be dropped. That’s not a “Linux bug”; it’s you
enforcing something you forgot you configured.
Task 15: Confirm the guest sees the VF and loads the right driver (inside the VM)
cr0x@server:~$ lspci -nnk | egrep -A3 -i 'Ethernet|Virtual Function' | head -n 10
00:04.0 Ethernet controller [0200]: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] [15b3:101c]
Subsystem: Mellanox Technologies Device [15b3:0058]
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
What it means: The VM sees the VF device and has a driver.
Decision: if the VM sees the device but no driver binds, you’re missing guest driver support or using an older kernel/initramfs.
Task 16: Check for PCI reset capability issues (common with VFs)
cr0x@server:~$ sudo dmesg -T | egrep -i 'reset|FLR|vfio' | tail -n 20
[Mon Dec 30 09:20:44 2025] vfio-pci 0000:81:00.1: enabling device (0000 -> 0002)
[Mon Dec 30 09:20:44 2025] vfio-pci 0000:81:00.1: not capable of FLR, using PM reset
What it means: The VF may not support Function Level Reset (FLR). VFIO will fall back to other reset paths.
Decision: if VMs fail on reboot or hot reset, you may need a different VF, a full PF reset workflow, or operational rules
(“migrate instead of reboot under load”).
Three corporate-world mini-stories (all true enough to hurt)
Mini-story 1: The incident caused by a wrong assumption
A team rolled out SR-IOV for a set of latency-sensitive services. They’d done the lab work: great numbers, clean graphs, smug grins. In
production, they passed VFs to VMs and everything looked normal. Link up, DHCP success, health checks green. Then, five minutes later,
random instances went “half dead”: they could send traffic, but replies didn’t come back reliably.
The first assumption was that it was the switch. The second assumption was that it was ARP cache weirdness. The third was that it must be
a kernel regression. Everyone had a favorite ghost.
The actual problem was boring: the upstream switch port had MAC limiting set for “one MAC per port.” SR-IOV introduced multiple MACs
behind the same physical port. The switch didn’t fully shut the port; it selectively discarded frames once the limit was exceeded, which
made it look like a flaky host problem. Different racks behaved differently because not all ports had the same security profile.
The fix was twofold: change switch port security to allow the expected number of MAC addresses (or use a trunk profile built for
virtualization), and implement a preflight check that counted planned VFs and compared it to switch policy. The real lesson wasn’t “turn
off security.” It was “stop assuming the network treats a server port like a single identity.”
They also learned a social lesson: when you change the semantics of an interface, you own the blast radius. The network team didn’t break
anything; the compute team changed the rules without telling them.
Mini-story 2: The optimization that backfired
Another org decided SR-IOV would fix CPU burn on hosts running lots of small flows. They created the maximum number of VFs per NIC,
because “we might need them later,” and pre-bound them to VFIO. They also enabled every offload they could find, because “the NIC is built
for this.”
The result: boot times stretched, and occasional hosts failed to come up cleanly after maintenance. When they did boot, their monitoring
showed odd latency spikes every few minutes. The easy story was “SR-IOV is unstable,” which became a political talking point.
Root cause ended up being a combination of resource pressure and interrupts. Creating a large number of VFs increased device bookkeeping,
and the host ended up with a mess of IRQs that defaulted onto a small CPU set. The offloads weren’t universally bad, but some interacted
poorly with their traffic pattern and the guest drivers. Their “optimize everything” posture created a system that was brittle under
reboot, unpredictable under load, and hard to debug.
The rollback plan was simple: create only the number of VFs needed per host role, distribute IRQs properly, and only enable offloads that
were proven in their environment. Their performance recovered—and their reliability improved more than their p99 latency did.
Joke 2: If you create 64 VFs “just in case,” you’ve invented a new kind of technical debt: PCIe debt, payable at reboot time.
Mini-story 3: The boring but correct practice that saved the day
A third team ran mixed workloads: some VMs needed SR-IOV for throughput, others were fine on virtio. They standardized an operational
rule: every VF assignment required a recorded mapping of PF name → VF index → PCI address → VM. No exceptions.
People complained. It felt bureaucratic. It felt like paperwork for engineers who “know what they’re doing.” Then a host rebooted after a
kernel update and udev enumeration order changed. PCI addresses stayed stable, but VM definitions referenced the wrong VF indices in one
cluster because someone had been “just clicking” in the UI and relying on label names.
The team didn’t have an outage. They had a minor scare and a handful of misrouted attachments caught in pre-production checks, because
their mapping spreadsheet (and later, a small internal inventory service) made the mismatch obvious.
Their boring practice also made audits easier: you can explain “this VF belongs to this tenant” when you can point to deterministic IDs,
not vibes. Reliability is often a clerical discipline wearing an engineering hat.
A paraphrased idea often attributed to W. Edwards Deming fits operations: “You can’t improve what you don’t measure.” (paraphrased idea)
Common mistakes: symptom → root cause → fix
1) “echo to sriov_numvfs returns Invalid argument”
- Symptom:
Invalid argumentwhen creating VFs. - Root cause: firmware/BIOS SR-IOV disabled, wrong PF driver, or you requested more than
sriov_totalvfs. - Fix: confirm
sriov_totalvfs> 0; check dmesg for the driver hint; enable SR-IOV in BIOS/NIC; request fewer VFs first.
2) “VFs exist but VFIO passthrough fails with IOMMU group error”
- Symptom: hypervisor complains the device is not in an isolated IOMMU group; QEMU refuses assignment.
- Root cause: ACS not providing isolation, or platform groups PF+VF together.
- Fix: move the NIC to a different slot/root port, enable ACS-related BIOS options if available, or accept that this platform can’t safely passthrough per-VF.
3) “Guest has link but can’t pass traffic reliably”
- Symptom: intermittent connectivity, ARP resolution issues, one-way traffic.
- Root cause: VF spoof checking/trust/VLAN policy not matching guest behavior; upstream port security MAC limit.
- Fix: set VF MAC/VLAN on PF, adjust spoof/trust intentionally, and ensure switch port allows multiple MACs.
4) “Performance is worse after SR-IOV”
- Symptom: higher CPU usage or lower throughput vs virtio.
- Root cause: IRQs pinned to wrong CPUs, NUMA mismatch, link width downshift, too many queues/offloads.
- Fix: check link width/speed, place VM on same NUMA node, tune IRQ affinity, reduce VF count/queues, validate offloads with measurements.
5) “VF disappears after reboot”
- Symptom: VFs vanish or reset to 0 after host restart.
- Root cause: SR-IOV VF creation is not persistent; you didn’t reapply
sriov_numvfsat boot, or a service resets the device. - Fix: implement a boot-time mechanism (systemd unit) to set
sriov_numvfsafter the PF driver loads; verify ordering.
6) “Can’t unload PF driver / device stuck”
- Symptom: you can’t change VF count; driver unload fails; device is busy.
- Root cause: VFs still exist and are in use; VFs bound to drivers; a VM holds a VF open via VFIO.
- Fix: set
sriov_numvfsto 0, detach VFs from guests, unbind VFs from drivers, then reconfigure.
Checklists / step-by-step plan
Host preflight checklist (do this before touching guests)
- Confirm PF driver and firmware:
ethtool -i. - Confirm SR-IOV capability:
sriov_totalvfs> 0. - Enable IOMMU in kernel cmdline; reboot; verify in dmesg.
- Inspect IOMMU groups; if isolation is impossible, stop and redesign.
- Verify PCIe link speed/width; fix slot/bifurcation issues early.
- Create a small VF count (2–4) and ensure dmesg is clean.
- Decide policy: per-VF MAC/VLAN/trust/spoof settings.
- Decide binding: host net driver vs vfio-pci, not both.
VF creation and policy plan (repeatable operations)
- Set VF count to 0 (clean slate).
- Create desired number of VFs.
- Assign MAC/VLAN/trust per VF deterministically (inventory it).
- Bind to vfio-pci if passing through.
- Attach to VM; confirm guest driver; do a simple ping + iperf test.
Guest validation checklist (prove it’s not lying)
- Confirm device visible in guest
lspci -nnk. - Confirm guest interface name appears and link is up.
- Confirm MTU matches the network (don’t assume jumbo frames are end-to-end).
- Confirm VLAN behavior matches PF policy.
- Run a packet capture on the host uplink if you suspect policy drops (data-plane truth beats theory).
Persistence plan (because reboot is a feature)
SR-IOV configuration often resets on reboot. Treat VF creation and VF policy like any other system configuration: declarative-ish and
applied by systemd in the right order.
cr0x@server:~$ cat /etc/systemd/system/sriov-enp129s0f0.service
[Unit]
Description=Configure SR-IOV VFs on enp129s0f0
After=network-pre.target
After=sys-subsystem-net-devices-enp129s0f0.device
Wants=network-pre.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo 0 > /sys/class/net/enp129s0f0/device/sriov_numvfs; echo 4 > /sys/class/net/enp129s0f0/device/sriov_numvfs'
ExecStart=/sbin/ip link set enp129s0f0 vf 0 mac 52:54:00:aa:bb:01 vlan 120 spoofchk on trust off
ExecStart=/sbin/ip link set enp129s0f0 vf 1 mac 52:54:00:aa:bb:02 vlan 120 spoofchk on trust off
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl enable --now sriov-enp129s0f0.service
Created symlink /etc/systemd/system/multi-user.target.wants/sriov-enp129s0f0.service → /etc/systemd/system/sriov-enp129s0f0.service.
What it means: you’ve made VF creation and policy repeatable.
Decision: keep this unit simple. Avoid embedding VFIO binding logic here unless you’re sure about module load order and
udev behavior in your environment.
FAQ
1) Do I need SR-IOV enabled in BIOS, or only in the NIC?
Potentially both. Some platforms gate SR-IOV behind BIOS/UEFI settings (especially for “IO virtualization”), and some NICs have firmware
toggles or resource profiles that restrict VF creation. If sriov_totalvfs is 0, treat it as “something is disabled” until
proven otherwise.
2) What’s the difference between iommu=pt and not using it?
iommu=pt (pass-through) typically means the host’s own devices use identity mappings for lower overhead, while still enabling
translation/isolation for VFIO devices. It’s common in virtualization hosts. If you’re debugging weird DMA faults, you can try without it,
but measure and understand the trade-off.
3) Why are my IOMMU groups huge?
Because your PCIe topology and ACS settings decide what can be isolated. Some consumer-ish platforms and some server designs behind
certain switches group functions together. If the VF shares a group with the PF or unrelated devices, your clean per-VF passthrough plan
may be impossible on that hardware.
4) Can I use SR-IOV VFs with Linux bridges or Open vSwitch instead of passthrough?
You can, but it’s usually not why people deploy SR-IOV. If you keep VFs in the host and bridge them, you often reintroduce software
switching and policy complexity. Decide what you want: “direct assignment” or “host-managed switching.” Mixing goals is how you create
mystery meat networking.
5) Why does VFIO binding sometimes revert after reboot?
Because the default driver may bind first during enumeration. Fix it with persistent binding rules: driver_override applied
at the right time, or modprobe configuration that ensures vfio-pci claims specific vendor/device IDs. Be careful: claiming by ID can also
grab devices you didn’t intend if you have multiple identical NICs.
6) Should I set trust on for VFs?
Only if you know exactly why. Trust can allow the VF to change MAC/VLAN behavior and may relax filtering. That’s useful for appliances,
routers, or nested networking, but it changes your threat model. Default stance: trust off, spoofchk on, and
explicitly set MAC/VLAN.
7) Are jumbo frames “free” with SR-IOV?
No. You still need end-to-end MTU consistency: guest interface, VF, PF, switchport, and upstream path. Jumbo frames failing often looks
like random drops, because small control traffic works while large payloads blackhole.
8) Why does performance vary across hosts with the same NIC?
PCIe topology, NUMA placement, BIOS defaults, and firmware profiles. Two identical NICs in different slots can behave very differently.
Also, IRQ placement and host CPU frequency scaling can skew results. Treat performance as a platform property, not a NIC property.
9) Is SR-IOV a security boundary?
It can be part of one, but don’t treat it as a magic wall. You rely on IOMMU isolation, correct firmware, correct driver behavior, and
sane operational controls. If your threat model is strong multi-tenancy, you need disciplined configuration and auditing. If your threat
model is “keep honest people honest,” it’s easier.
10) When should I not use SR-IOV?
When you need flexible L2/L3 policy in the host, when your platform can’t isolate IOMMU groups, when you can’t coordinate switch port
policies, or when you need live migration without disruption and your tooling can’t handle SR-IOV cleanly. Virtio with good tuning is not
an embarrassment. It’s often the correct answer.
Next steps you can do this week
If you want SR-IOV to behave on Debian 13, stop treating it like a single feature toggle. Treat it like a small platform integration
project.
- Pick one host and one NIC port and validate IOMMU + isolation first. If groups are wrong, redesign now.
- Create 2 VFs, not 32. Bind one to VFIO and attach to a test VM. Prove the guest driver and basic connectivity.
- Make policy explicit: set VF MAC/VLAN/spoof/trust intentionally and document the mapping of VF → tenant.
- Measure performance with topology awareness: confirm link width/speed and NUMA placement before blaming drivers.
- Automate persistence with a simple systemd oneshot unit that recreates VFs and policies after reboot.
Once SR-IOV is stable, you can start caring about the fun stuff: queue counts, offloads, DPDK, and shaving microseconds. First, make it
boring. Boring networking is the kind that doesn’t wake you up.