Passthrough is the promise: “Just give the VM the real device and it’ll be fast.” In practice, you get a Tuesday-night mystery—VM boots, but your HBA vanishes; the GPU only works after a cold boot; the NIC drops when you vMotion; storage latency looks like a heartbeat monitor.
This piece compares Proxmox (KVM/QEMU + VFIO) and VMware ESXi (DirectPath I/O) the way production people actually experience them: setup friction, operational sharp edges, and the performance you can reasonably expect for GPUs, HBAs, and NICs.
The punchy take: what to choose
If you want “most predictable for the average homelab and many SMBs”
Proxmox is usually easier to reason about because it’s Linux end-to-end. You can see every knob: IOMMU groups, driver binding, interrupt routing, reset quirks. When it fails, it fails loudly—and you can inspect the host like a normal Linux box.
If you’re passing through HBAs for ZFS, Proxmox is especially comfortable: ZFS on host plus VM passthrough is a religious debate, but Proxmox gives you both patterns with fewer “black box” moments.
If you want “enterprise guardrails and predictable lifecycle”
ESXi is the cleaner corporate story: device compatibility lists, a stable management plane, and operational patterns that scale with teams. For passthrough specifically, ESXi is excellent when your hardware is on the happy path and your requirements match what VMware expects.
When you stray off the happy path—consumer GPUs, oddball reset behavior, niche NICs—ESXi turns into a polite bouncer: it won’t fight you, it just won’t let you in.
GPUs
- Ease: Proxmox usually wins for DIY GPU passthrough. ESXi is good if you’re inside supported GPUs / supported modes (and licensing if you’re doing vGPU).
- Speed: near-native on both when configured correctly. Your bottleneck is more likely CPU scheduling, PCIe topology, or driver/power management than the hypervisor.
HBAs (SAS/SATA controllers)
- Ease: Proxmox wins if your goal is “give the VM the whole HBA and let it run ZFS/TrueNAS.” ESXi is fine, but you’ll spend more time in compatibility and “is this allowed” land.
- Speed: passthrough is effectively native, but the real performance story is queue depth, interrupt moderation, and firmware/driver behavior.
NICs (10/25/40/100G, SR-IOV, DPDK-ish workloads)
- Ease: tie, depending on your goal. ESXi has mature SR-IOV workflows; Proxmox has Linux tooling and fewer product constraints.
- Speed: SR-IOV or full passthrough can be excellent on both. For general VM networking, often paravirtual NICs (vmxnet3/virtio-net) are “fast enough” and operationally kinder.
Dry-funny truth: passthrough gives you native performance plus native problems.
Interesting facts and short history (useful, not trivia night)
- Intel VT-d and AMD-Vi (IOMMU) are what make serious DMA isolation possible. Before that, “passthrough” was a security and stability roulette wheel.
- VFIO (Virtual Function I/O) became the mainstream Linux approach because it cleanly separates device access from userspace (QEMU) and leans on the IOMMU for safety.
- Interrupt remapping is a big deal for correctness and security. Without it, some platforms can’t safely virtualize MSI/MSI-X at scale.
- SR-IOV (Single Root I/O Virtualization) was created to avoid full device assignment by splitting a NIC into “virtual functions” with hardware-enforced isolation.
- GPU reset bugs are not folklore. Some GPUs don’t implement a proper function-level reset (FLR), which is why “works after power cycle” is a real category of incident.
- ACS (Access Control Services) affects IOMMU grouping. Consumer boards often group multiple slots together, forcing you to passthrough more than you wanted—or use an ACS override with tradeoffs.
- NVMe was designed for deep queues and low latency. Virtualizing it poorly looks like “it works” until you put a database on it and watch p99 get weird.
- VMware DirectPath I/O has been around for years and is mature, but it’s intentionally conservative: it prioritizes supportability and predictable behavior over “let me hack it.”
- vMotion and passthrough have always had an awkward relationship. You can’t live-migrate a VM that owns a physical device unless you’re using specific mediated/device-sharing tech.
How passthrough really works (and where it breaks)
PCIe devices do two things that matter here: they expose configuration space (registers, capabilities) and they do DMA into system memory. Passthrough means the guest OS gets direct access to the device, and the hypervisor mostly gets out of the way—except it must still keep the host safe.
The three pillars: IOMMU, isolation, and resets
- IOMMU maps device DMA to guest memory. Without it, a device assigned to a VM could DMA into the host kernel and scribble on reality.
- Isolation depends on IOMMU groups. If two devices share the same group, you often can’t safely assign just one of them (because they can potentially DMA to each other’s address space or share upstream translation quirks).
- Reset behavior matters. When a VM reboots, the device must return to a clean state. If the device can’t reset (or the platform won’t reset it), you get “works once” problems.
Proxmox path: VFIO + QEMU + Linux drivers
On Proxmox, you typically:
- Enable IOMMU in firmware and kernel cmdline.
- Bind the target PCIe device to
vfio-pci(not its normal host driver). - Add
hostpcilines to the VM config, optionally controlling ROM bar, MSI, and other quirks. - Let the guest driver own the device.
The upside: you can see everything and change it. The downside: you can see everything and change it. You are now the integration team.
ESXi path: DirectPath I/O under vSphere rules
On ESXi, you:
- Enable IOMMU in firmware.
- Mark the device for passthrough in the host configuration.
- Reboot the host (often required to rebind the device).
- Add the device to a VM, accepting constraints like no vMotion.
ESXi is less “tweakable” but more “guardrailed.” When it refuses a configuration, it’s usually because it can’t guarantee supportability or safety on that platform.
Where it breaks in real life
- Bad IOMMU grouping (common on consumer boards): you can’t isolate a GPU from its audio function or from a USB controller sitting behind the same upstream bridge.
- Reset failures: the device doesn’t come back after VM shutdown; the host sees it but the guest driver times out; only a host reboot fixes it.
- Interrupt storms or latency spikes: you passed through a high-IOPS device and now CPU 0 is crying because interrupts are pinned weirdly.
- Firmware/driver mismatch: the host BIOS is old; the HBA firmware is old; the guest driver expects something else; you now have a storage ghost story.
One quote, because it fits ops life: Hope is not a strategy.
— James R. Schlesinger
Joke #1: If your GPU only resets after a host reboot, congratulations—you’ve invented “high availability via patience.”
What’s easier: Proxmox vs ESXi by device type
GPU passthrough
Proxmox is easier when you’re outside enterprise GPU lanes. You can pass through almost any PCIe GPU if you can isolate it, bind it to VFIO, and deal with reset quirks. Community lore exists because the problems are common and reproducible.
ESXi is easier when your GPU is supported and your workflow matches VMware’s expectations. If you need vGPU (time-sliced GPUs, multiple VMs sharing a GPU), ESXi’s ecosystem is historically stronger, but it’s also where licensing and vendor enablement start to matter.
Practical reality: the hardest part is rarely “click passthrough.” It’s getting stable behavior across reboots, driver updates, and power events. Proxmox gives you more levers. ESXi gives you fewer but safer levers.
HBA passthrough (SAS HBAs, RAID cards in IT mode)
If you’re doing storage engineering, you already know the rule: RAID firmware lies. If you want ZFS or any storage stack that expects direct disk visibility, you want an HBA in IT mode (or a true HBA) and you want the VM to own it.
Proxmox makes this feel natural: bind HBA to VFIO, pass it through, let TrueNAS/OmniOS/Linux see the disks, manage SMART, and do your thing. You can still keep management and boot devices on the host.
ESXi is fine too, but you’ll spend more time ensuring the controller and firmware are in a supported combination and that the host isn’t trying to be clever. In bigger orgs, this is a feature: fewer weird snowflakes.
NIC passthrough: full device vs SR-IOV vs paravirtual
For most workloads, you don’t need full NIC passthrough. virtio-net (Proxmox) and vmxnet3 (ESXi) are excellent. They also preserve migration features and make operational life calmer.
When you do need it—NFV, ultra-low latency, heavy packet rates, or a VM acting as a firewall/router at 25/100G—then you’re deciding between:
- Full passthrough: simplest conceptually, but kills mobility and can complicate host networking.
- SR-IOV VFs: best compromise for many designs; you keep some host control and can map multiple VMs to one physical port.
- DPDK/Userspace stacks: often paired with passthrough or SR-IOV; performance can be great, but you inherit a tuning problem.
Ease winner depends on your team: Linux-heavy shops find Proxmox SR-IOV and IRQ tuning straightforward; VMware-heavy shops find ESXi’s workflows and consistency easier to operationalize.
What’s faster: performance reality by workload
GPU performance: usually near-native, until it isn’t
When a GPU is passed through properly, you’re close to bare metal for raw throughput. The performance misses usually come from:
- CPU scheduling: your VM vCPUs aren’t pinned, the host is overcommitted, or latency-sensitive threads bounce between cores.
- NUMA mismatches: GPU on one socket, VM memory on another. Bandwidth and latency penalties can be sharp.
- Power management: aggressive ASPM or C-states can make latency spiky (depends on platform).
Proxmox gives you direct access to Linux tools for NUMA pinning and hugepages; ESXi has mature CPU/memory scheduling controls too, but the “why is this slower” root cause is sometimes less transparent.
HBA passthrough: the hypervisor is rarely the bottleneck
For HBAs, passthrough is basically “wire the device to the guest.” Performance problems almost always trace back to:
- Wrong firmware mode (IR/RAID instead of IT/HBA).
- Queue depth and interrupt moderation mismatches.
- Disk behavior (SMR drives, power management, error recovery timers).
Both hypervisors can deliver near-native. The difference is operational: Proxmox makes it easier to correlate host logs, PCIe errors, and kernel messages. ESXi keeps the environment consistent but can hide low-level details behind its abstractions.
NIC performance: paravirtual is underrated
At 10G, both virtio-net and vmxnet3 are usually great. At 25/100G, you start caring about:
- Offloads (TSO/GSO/GRO/LRO) and whether your workload benefits.
- IRQ distribution and RSS queues.
- vSwitch overhead versus direct hardware queues.
Full passthrough and SR-IOV can reduce overhead, but they also reduce your ability to do live migration and centralized policy. If your main goal is “fast,” passthrough can win. If your goal is “fast and sane,” paravirtual often wins.
Joke #2: The fastest NIC is the one you don’t debug at 3 a.m.
Practical tasks: commands, outputs, and decisions (12+)
These are the things I actually run when a passthrough build smells off. Each task includes: a command, realistic output, what it means, and the decision you make.
1) Confirm IOMMU is enabled (Proxmox/Linux host)
cr0x@server:~$ dmesg | grep -E "DMAR|IOMMU|AMD-Vi" | head
[ 0.812345] DMAR: IOMMU enabled
[ 0.812901] DMAR: Intel(R) Virtualization Technology for Directed I/O
[ 0.813210] DMAR: Interrupt remapping enabled
Meaning: IOMMU and interrupt remapping are active. That’s the baseline for safe passthrough.
Decision: If you don’t see this, stop. Fix BIOS settings (VT-d/AMD-Vi) and kernel cmdline before touching VM configs.
2) Verify kernel cmdline includes IOMMU flags (Proxmox)
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.12-4-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt
Meaning: intel_iommu=on enables VT-d; iommu=pt uses passthrough mode for host devices to reduce overhead.
Decision: If missing, update GRUB and reboot. Don’t “try anyway.” You’ll waste hours.
3) Map the device you want to passthrough (PCI ID)
cr0x@server:~$ lspci -nn | egrep -i "vga|3d|ethernet|sas|raid|nvme"
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
03:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0087] (rev 02)
05:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [8086:1572] (rev 02)
Meaning: You have three classic passthrough candidates: GPU (plus its audio function), an HBA, and a 10GbE NIC.
Decision: Note the BDF addresses (01:00.0, 03:00.0, 05:00.0). You’ll use these in VM config or binding steps.
4) Check IOMMU groups (can you isolate the device?)
cr0x@server:~$ for g in /sys/kernel/iommu_groups/*; do echo "Group $(basename $g):"; ls -1 $g/devices/; done | sed -n '1,20p'
Group 0:
0000:00:00.0
Group 1:
0000:00:01.0
0000:01:00.0
0000:01:00.1
Group 2:
0000:00:03.0
0000:03:00.0
Group 3:
0000:05:00.0
Meaning: GPU functions 01:00.0 and 01:00.1 are together (good), HBA is isolated, NIC is isolated. This is what you want.
Decision: If your GPU shares a group with random devices, you either move slots, change BIOS “PCIe ACS” settings, or accept an ACS override (with security/isolations tradeoffs).
5) Confirm what driver currently owns the device (before binding VFIO)
cr0x@server:~$ lspci -k -s 03:00.0
03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
Subsystem: Broadcom / LSI Device 0087
Kernel driver in use: mpt3sas
Kernel modules: mpt3sas
Meaning: The host is currently using the HBA driver. If you pass it through, the host must let go.
Decision: If the host is using this for its own storage, do not proceed. Boot the host from something else first (mirrored SATADOM, separate NVMe, etc.).
6) Bind a device to vfio-pci (Proxmox/Linux)
cr0x@server:~$ echo "options vfio-pci ids=1000:0087,10de:2204,10de:1aef" | sudo tee /etc/modprobe.d/vfio.conf
options vfio-pci ids=1000:0087,10de:2204,10de:1aef
cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.12-4-pve
Meaning: You’ve told the host to bind those PCI IDs to VFIO at boot.
Decision: Reboot the host during a maintenance window. Hot-rebinding is possible, but production people prefer boring repeatability.
7) Verify vfio-pci is now in use
cr0x@server:~$ lspci -k -s 03:00.0
03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
Subsystem: Broadcom / LSI Device 0087
Kernel driver in use: vfio-pci
Kernel modules: mpt3sas
Meaning: Host driver is detached; VFIO owns the device. This is the state you want for passthrough.
Decision: If it’s still on the native driver, you have an initramfs or module ordering problem. Fix that before touching VM configs.
8) Add passthrough to a Proxmox VM config (hostpci)
cr0x@server:~$ sudo qm set 120 --hostpci0 01:00,pcie=1 --hostpci1 03:00,pcie=1
update VM 120: -hostpci0 01:00,pcie=1 -hostpci1 03:00,pcie=1
Meaning: VM 120 will get the GPU (both functions when you specify 01:00) and the HBA.
Decision: If you need the GPU audio too and Proxmox doesn’t pick it up implicitly, pass both functions explicitly. Then boot the VM and check guest detection.
9) Check QEMU starts with the devices attached (Proxmox)
cr0x@server:~$ sudo journalctl -u pvedaemon -n 20 | tail
Dec 28 10:12:41 server pvedaemon[2123]: starting task UPID:server:00003A1B:0002C2F9:676FFB29:qmstart:120:root@pam:
Dec 28 10:12:43 server pvedaemon[2123]: VM 120 started with pid 14592
Meaning: The VM started. This doesn’t prove the device initialized, but it rules out immediate QEMU refusal.
Decision: If the VM won’t start, inspect the task output for VFIO/IOMMU errors and go straight to group isolation and driver binding checks.
10) In the guest, confirm the GPU/HBA is visible (Linux guest)
cr0x@server:~$ lspci -nn | egrep -i "vga|sas|scsi"
00:10.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)
00:11.0 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
00:12.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0087] (rev 02)
Meaning: The VM sees the real hardware. Now the battle is driver and stability, not attachment.
Decision: If missing, you likely passed the wrong BDF, the device is still bound to host driver, or the VM machine type/PCIe setting is wrong.
11) Check for PCIe/AER errors on the host (link stability)
cr0x@server:~$ dmesg | egrep -i "aer|pcie bus error|vfio|DMAR" | tail -n 8
[ 912.344120] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 912.344141] pcieport 0000:00:01.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 912.344148] pcieport 0000:00:01.0: AER: device [8086:1901] error status/mask=00000001/00002000
[ 912.344152] pcieport 0000:00:01.0: AER: [ 0] RxErr
Meaning: Corrected errors happen, but frequent bursts can correlate with GPU hangs, NIC drops, or NVMe weirdness. It can be signal integrity, risers, power, or BIOS.
Decision: If you see repeated corrected errors under load, check cabling/risers, reseat cards, update BIOS, and consider forcing PCIe Gen speed down one notch for stability.
12) Check interrupt distribution (host) for a passed-through NIC/HBA
cr0x@server:~$ cat /proc/interrupts | egrep -i "vfio|ixgbe|i40e|mpt3sas" | head
55: 182934 0 0 0 IR-PCI-MSI 524288-edge vfio-pci
56: 176211 0 0 0 IR-PCI-MSI 524289-edge vfio-pci
Meaning: Interrupts are landing on CPU0 only (the first column). That can be okay at low rates, but it’s a classic reason for latency spikes.
Decision: If p99 latency matters, configure IRQ balancing/pinning so interrupts distribute across cores aligned with NUMA locality.
13) Measure storage latency inside the guest (quick reality check)
cr0x@server:~$ iostat -x 1 3
Linux 6.6.15 (vm-storage) 12/28/2025 _x86_64_ (16 CPU)
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 120.0 250.0 6.0 18.0 96.0 3.20 8.50 4.10 10.60 0.70 25.2
Meaning: await is average request latency in ms. If you expected NVMe-like behavior and see 8–15ms, something upstream is off (device, disks, queueing, controller mode).
Decision: If latency is high, verify HBA is in IT mode, check disks, and inspect host PCIe errors. Don’t blame the hypervisor first.
14) ESXi: check whether a device is eligible for passthrough (conceptual parity task)
cr0x@server:~$ esxcli hardware pci list | egrep -n "0000:03:00.0|Passthru" -A2
214:0000:03:00.0
215: Class Name: Serial Attached SCSI controller
216: Passthru Capable: true
Meaning: ESXi recognizes the device and says it can be passed through.
Decision: If Passthru Capable is false, stop and check platform support/BIOS/IOMMU and whether ESXi has driver/quirk support for that device.
Fast diagnosis playbook: find the bottleneck without guessing
This is the “walk into the war room and be useful in 10 minutes” sequence. It’s ordered. Don’t skip ahead because you’re bored.
First: verify attachment and isolation
- Is IOMMU enabled? (dmesg/cmdline or ESXi host settings). If not, nothing else matters.
- Is the device isolated in its IOMMU group? If not, expect instability or refusal to start.
- Is the device bound to the right driver? Proxmox:
vfio-pci. ESXi: marked for passthrough and requires reboot.
Second: verify reset and lifecycle behavior
- Cold boot host → start VM → stop VM → start VM again. Repeat once.
- If it fails on second start, suspect FLR/reset limitations. This is very common for GPUs.
- Check host logs for VFIO reset errors or PCIe link retraining issues.
Third: check the boring hardware realities
- PCIe errors (AER) in dmesg or ESXi logs. Corrected errors in bursts are a smell.
- Thermals and power: GPU + HBA + NIC in one chassis can be a power transient festival.
- Firmware alignment: BIOS, GPU VBIOS, HBA firmware. Old firmware causes new pain.
Fourth: measure in the guest with the right lens
- GPU: check clocks and utilization, not just FPS. If clocks throttle, it’s not the hypervisor.
- Storage: look at latency (
iostat -x) and queue depths, not just throughput. - Network: look at packet drops, IRQ CPU, and RSS distribution, not just iperf headline numbers.
Fifth: only then tune
- NUMA pinning and hugepages for consistent latency.
- IRQ balancing/pinning for NIC/HBA stability under load.
- Disable aggressive power states if latency jitter is killing you.
Common mistakes (symptom → root cause → fix)
1) VM won’t start after adding hostpci
Symptom: Proxmox task fails; QEMU complains about VFIO or “group not viable.”
Root cause: Device shares an IOMMU group with another in-use device, or IOMMU isn’t enabled.
Fix: Enable VT-d/AMD-Vi, verify dmesg; move card to another slot; avoid ACS override unless you accept the isolation tradeoff.
2) GPU works once, then black screen on VM reboot
Symptom: First boot fine; after shutdown/restart, guest driver hangs or device disappears.
Root cause: GPU lacks reliable FLR/reset support; host can’t reset the function cleanly.
Fix: Try different GPU model; use vendor-reset where applicable; prefer server/workstation GPUs for uptime; plan for host reboot as recovery (and document it).
3) Storage performance is “fine” until database load, then p99 explodes
Symptom: Benchmarks look okay; real workload has latency spikes and timeouts.
Root cause: HBA in RAID/IR mode, queueing mismatch, or interrupts pinned to one core; sometimes SMR drives doing SMR things.
Fix: Ensure IT mode; tune queue depths; distribute interrupts; validate drive models and write behavior.
4) NIC passthrough VM has great throughput but random packet loss
Symptom: iperf shows line rate; real traffic sees drops and retransmits.
Root cause: IRQ imbalance, RSS misconfiguration, ring buffers too small, or PCIe errors under load.
Fix: Check interrupts, tune driver settings in guest, verify link stability (AER), and align NUMA (NIC near CPU/memory used).
5) ESXi passthrough device is “not capable” even though the hardware supports it
Symptom: ESXi refuses the device for DirectPath I/O.
Root cause: BIOS IOMMU disabled, platform firmware bug, or device/driver not on the supported path ESXi expects.
Fix: Enable IOMMU in BIOS, update BIOS, validate device support, and consider using SR-IOV/paravirtual instead if the device is a weird snowflake.
6) You passed through an HBA… and now the host storage vanished
Symptom: Host boots sometimes; after changes, root disk missing or ZFS pool not found.
Root cause: You passed through the controller that hosts your boot or data devices. Classic self-own.
Fix: Separate host boot storage from passthrough storage. Treat it as a hard design requirement, not a suggestion.
Three corporate mini-stories (anonymized, plausible, painfully familiar)
Incident caused by a wrong assumption: “IOMMU groups are per-slot, right?”
The team inherited a small virtualization cluster that had grown organically: a couple of high-core servers, mixed NICs, and one “special” GPU node for ML experiments. The plan was simple: move a second GPU into an empty slot, pass both through to separate VMs, and let the data science group self-serve.
They did the Proxmox configuration carefully—VFIO binding, hostpci entries, even documented which VM got which card. The first GPU worked. The second one… worked, but only when the first VM was powered off. When both were up, one VM would freeze under load and the other would log PCIe errors.
The wrong assumption was subtle: they assumed the motherboard isolated each slot into its own IOMMU group. In reality, two slots shared the same upstream PCIe bridge without ACS separation, and the platform grouped multiple devices together. The “empty slot” wasn’t operationally independent. It was just physically empty.
The fix wasn’t a clever kernel parameter. They moved one GPU to a slot on a different root complex, updated BIOS, and accepted that the board’s topology—not the hypervisor—was the actual constraint. The post-incident note was short and sharp: “PCIe topology is a design input, not an afterthought.”
Optimization that backfired: “Let’s force max performance everywhere”
A fintech-ish shop had latency-sensitive services and a habit of tuning. They moved a network appliance VM to Proxmox and gave it SR-IOV VFs to hit packet rates they couldn’t get with pure virtio-net. It worked. So they did the usual performance rituals: disable CPU C-states, set CPU governor to performance, and crank NIC interrupt coalescing to reduce overhead.
Throughput improved on paper. CPU utilization dropped. Everyone high-fived. Then customer-facing tail latency got worse, intermittently, and only during certain traffic patterns. It took too long to see the correlation: heavy bursts caused micro-bursts inside the appliance VM, and the new interrupt coalescing settings turned “many small timely interrupts” into “fewer chunky interrupts.”
Worse, disabling C-states changed thermal behavior. Fans ramped in odd ways, and one host started logging corrected PCIe errors under sustained load. Nothing crashed, but the system lived closer to the edge. The “optimization” took multiple small buffers of safety and spent them all.
The rollback was boring: return to default coalescing, keep reasonable power settings, and only pin CPU and tune IRQs where measurement justified it. They kept SR-IOV because it solved the core problem, but they stopped trying to outsmart the hardware without a clear p99 goal. Lesson learned: performance tuning without a latency model is just cargo cult with better vocabulary.
Boring but correct practice that saved the day: “We staged a reboot test”
A media company ran GPU-accelerated transcode VMs on ESXi. They had a change window for patching: host firmware, ESXi updates, and guest driver updates. The environment had been stable, so the temptation was to do the updates and call it done.
Instead, their ops lead insisted on a “reboot lifecycle test” for one host: patch it, then cycle each VM through start/stop/start twice, and do a full host reboot between passes. It felt excessive. It also took time they didn’t want to spend.
On the second VM restart after the patch, a GPU device failed to initialize. It didn’t happen consistently. It didn’t show up in basic smoke tests. But it was real, and the failure mode was exactly the kind that becomes a 2 a.m. incident when a host reboots unexpectedly.
Because they caught it in staging, they could roll back the driver combination and schedule a vendor-supported firmware update later. No drama. No customer impact. Just a calendar invite and a fixed runbook. The boring practice—testing lifecycle behavior, not just first boot—was the difference between “minor delay” and “major outage.”
Checklists / step-by-step plan
Design checklist (before you touch config files)
- Define the non-negotiables: do you need live migration, HA restart, snapshots, or is this a “pet VM” bound to a host?
- Pick device strategy: full passthrough vs SR-IOV vs paravirtual. Default to paravirtual unless you can name the bottleneck.
- Validate topology: which CPU socket is the slot attached to? Which IOMMU group? Don’t guess.
- Separate boot from passthrough devices: host must boot and be manageable without the passed-through controller.
- Document reset recovery: what do you do when the device won’t reinitialize? Host reboot? Power cycle? Scripted detach/reattach?
Proxmox step-by-step (pragmatic baseline)
- Enable VT-d/AMD-Vi in BIOS/UEFI. Enable “Above 4G decoding” if you’re doing GPUs/NVMe with big BARs.
- Add kernel params (
intel_iommu=onoramd_iommu=on, plusiommu=pt), reboot. - Check IOMMU groups; relocate cards if necessary.
- Bind devices to
vfio-pcivia/etc/modprobe.d/vfio.conf, update initramfs, reboot. - Add
hostpcientries to VM; usepcie=1for modern devices. - Boot guest, install vendor drivers, validate stability across multiple start/stop cycles.
- Only then: NUMA pinning, hugepages, IRQ tuning if you have measured latency goals.
ESXi step-by-step (operationally safe baseline)
- Enable IOMMU in BIOS/UEFI.
- Confirm device is seen and “passthru capable.”
- Mark device for passthrough in host configuration; reboot host.
- Add device to VM; accept that vMotion is typically off the table.
- Validate guest drivers and lifecycle behavior (start/stop/reboot loops).
- If you need mobility, evaluate SR-IOV (for NIC) or mediated/vGPU options (for GPU) within your support constraints.
Operational checklist (what you write in the runbook)
- How to confirm device is attached (host and guest commands).
- How to recover from reset failure (exact steps, maintenance window requirements).
- How to patch safely (order: firmware, hypervisor, guest drivers; validation loops).
- What logs to collect for escalation (host dmesg/AER, VM logs, guest driver logs).
- What features are disabled (migration, snapshots, suspend/resume), so nobody is surprised later.
FAQ
1) Is PCIe passthrough “faster” than virtio/vmxnet3?
Sometimes. For high packet rates, low latency, or specialized offloads, passthrough/SR-IOV can win. For general workloads, paravirtual devices are often within striking distance and far easier to operate.
2) Why does my GPU passthrough work on first boot but fail after VM reboot?
Reset behavior. Some GPUs don’t implement reliable FLR, or the platform can’t reset the device cleanly. You end up needing a host reboot or power cycle to recover.
3) Should I passthrough an HBA to a TrueNAS/ ZFS VM?
If you want the VM to own disks directly with full visibility (SMART, error handling), yes—HBA passthrough is a common, sane design. Just ensure the host doesn’t depend on that HBA for boot or management storage.
4) Is ACS override on Proxmox safe?
It can make things “work” by splitting IOMMU groups, but it’s not magic isolation. You’re trading safety guarantees for flexibility. In production, prefer hardware that gives you clean groups without overrides.
5) Can I live-migrate a VM using passthrough?
Generally, no. Full passthrough ties the VM to the physical host. Some mediated approaches (SR-IOV, vGPU) can restore partial mobility, but they come with platform constraints.
6) What’s the biggest performance trap with passthrough?
NUMA and interrupts. A passed-through NIC/HBA/GPU can be “fast” but still deliver terrible tail latency if interrupts and memory locality are wrong.
7) Do I need hugepages for GPU passthrough?
Not strictly. Hugepages can reduce TLB pressure and improve consistency for some workloads, but they also add allocation constraints. Use them when you have measured benefits or known workload patterns.
8) For NICs, should I prefer SR-IOV over full passthrough?
Often yes. SR-IOV gives near-passthrough performance while letting the host keep control of the physical function. But it adds operational complexity (VF lifecycle, driver matching, security posture).
9) Why does ESXi say my device isn’t passthrough-capable?
Most commonly: IOMMU disabled in BIOS, outdated firmware, or the device isn’t on the supportable path ESXi expects. ESXi is conservative; it refuses configurations it can’t stand behind.
10) Which is “easier” overall for a mixed GPU + HBA + NIC passthrough host?
Proxmox is easier if you want to tinker and troubleshoot at Linux depth. ESXi is easier if you want guardrails and your hardware is mainstream and supported.
Conclusion: next steps that actually help
If you’re choosing a platform for PCIe passthrough, don’t start with ideology. Start with your failure budget and your migration needs.
- If you need mobility and fleet operations, default to paravirtual devices, and use passthrough only where it measurably matters. ESXi tends to shine here.
- If you need maximum device control and you’re comfortable living in kernel logs and PCIe topology, Proxmox is a great fit—especially for GPU tinkering and HBA passthrough.
- Whichever you choose, treat passthrough as a hardware integration project: validate IOMMU grouping, reset behavior, and lifecycle stability before you promise anything to stakeholders.
Practical next step: pick one target workload, build a single host, and run the lifecycle test loop (start/stop/reboot twice) before you scale. That single boring exercise prevents the most expensive surprises.