Some performance problems are obvious. A CPU pegged at 100%. A network link flatlined. Storage is meaner: it fails quietly, in the gaps between microseconds. Your VM “has an NVMe disk,” your dashboards say IOPS are fine, and yet tail latency ruins your database, your build farm, or your logging pipeline.
The usual debate is framed like a boxing match: NVMe passthrough (VFIO) vs VirtIO. The awkward truth: the winner is often neither. The winner is removing the one extra layer you didn’t know you had, and tuning the layer you can’t remove. That’s why teams keep losing this fight in production while winning it in benchmarks.
What actually changes between passthrough and VirtIO
At a high level:
- NVMe passthrough (VFIO PCIe passthrough) gives a guest VM direct access to a physical NVMe controller. The guest loads the native
nvmedriver, owns the queues, issues admin commands, and “sees” something close to bare metal. - VirtIO gives the guest a paravirtual device (virtio-blk or virtio-scsi). The guest submits requests to a virtqueue; the host (QEMU + vhost + kernel block layer) completes them via a backing device or file.
But the performance story is mostly about path length and scheduling points:
- How many times does the I/O request cross the user/kernel boundary?
- How many queues exist, and are they mapped sensibly to CPU cores?
- Where can the request be delayed by contention: lock, interrupt, cgroup throttling, host page cache, filesystem journaling, or a hypervisor thread that got descheduled?
The actual I/O paths (simplified)
VirtIO (common virtio-blk with QEMU):
- Guest app issues I/O (syscall).
- Guest kernel block layer schedules it.
- VirtIO driver posts descriptors to a virtqueue.
- Exit to host (virtio kick / interrupt), QEMU/vhost processes request.
- Host kernel block layer submits to physical device (or filesystem if you use image files).
- Completion bubbles back up; guest gets an interrupt; app resumes.
NVMe passthrough:
- Guest app issues I/O (syscall).
- Guest kernel NVMe driver submits directly to hardware queue (via PCIe MMIO/DMA).
- Hardware completes; interrupt to guest (MSI-X), completion handled in guest.
Passthrough deletes the host’s block stack from the hot path. That can be huge for latency. It also deletes host-side control points you might actually need, like host caching, snapshots, flexible live migration, and easy storage multiplexing.
Here’s the rule I use in production: if your workload is latency-sensitive and predictable, passthrough is attractive. If your workload is mixed, multi-tenant, and operationally messy, VirtIO is usually the correct compromise—unless you run it in the default configuration and then wonder why it’s not magic.
Interesting facts and historical context (the stuff that explains today)
Storage virtualization didn’t get “complicated” for fun. It got complicated because real systems are. A few concrete context points that matter when you’re choosing between passthrough and VirtIO:
- VirtIO was born from pain: early full-device emulation (like IDE) was slow because every I/O looked like a hardware interrupt party in software.
- NVMe standardized multi-queue from day one, built for parallelism. That fits modern CPUs; it also means queue mapping and interrupt routing matter a lot more than with old SATA.
- MSI-X made high IOPS possible by supporting multiple interrupts per device. It’s why “one disk” can scale across cores, and why bad interrupt affinity can ruin your day.
- Linux blk-mq changed the game: the multi-queue block layer reduced lock contention and improved scaling, but it also added new knobs and new ways to misconfigure.
- vhost was created to get QEMU out of the way: moving datapath work into the kernel reduced context switches and improved throughput for virtio networking and storage.
- IO schedulers went from “pick one” to “pick none” for NVMe: for fast devices, schedulers can add latency with little benefit, so “none” often wins.
- People used to benchmark with 4k random read and declare victory. Modern services are usually a mix of reads, writes, fsyncs, metadata ops, and bursts—so the old benchmark culture still misleads.
- Cloud hypervisors normalized VirtIO: operational features (migration, snapshots, tenancy controls) mattered as much as raw speed, so VirtIO won by being workable.
- Passthrough became practical at scale when IOMMU got boring: VFIO and IOMMU isolation made it less scary, but “less scary” is not the same as “no tradeoffs.”
The performance winner nobody mentions: fewer layers, fewer lies
When teams argue “passthrough vs VirtIO,” they’re often skipping the real question: what extra layers are you accidentally adding, and are you measuring the right thing?
Example: a VM uses VirtIO, backed by a QCOW2 image, sitting on an ext4 filesystem on LVM on top of a RAID controller with a write-back cache you didn’t check. Then someone compares that to NVMe passthrough and calls it science. That’s not science; that’s a layer cake with a latency garnish.
The “nobody mentions” winner is usually one of these:
- VirtIO with the right mode: virtio-scsi with multiple queues, iothreads, correct cache mode, and direct LUNs instead of image files, can get within striking distance of passthrough for many workloads.
- Correct CPU and interrupt topology: pinning vCPUs, aligning queues to cores, and getting interrupts off your noisy neighbors can cut tail latency dramatically without changing the storage device at all.
- Removing the wrong caching layer: the host page cache, guest page cache, and device write cache can interact in ways that look fast until you hit a crash or a flush storm.
One dry truth: you can buy an NVMe drive that does millions of IOPS and still get 50 ms spikes because your VM’s I/O completion thread is starved. Storage isn’t only a device problem; it’s a scheduling problem wearing a disk mask.
Joke #1: Storage benchmarks are like résumés: technically true, strategically incomplete, and usually missing the part where it falls over under pressure.
When NVMe passthrough wins (and why)
Passthrough wins when your bottleneck is software overhead in the virtualization layer, and the workload is sensitive to it. Typical signs:
- You care about tail latency (p99, p999) more than average latency.
- You do frequent fsync or small synchronous writes (databases, message queues, journaling filesystems under load).
- You have high IOPS with small block sizes, and VirtIO adds measurable CPU cost per I/O.
- You can dedicate a device to a VM without crying about utilization.
Why it’s faster
NVMe is already a queue-based, low-latency protocol. Passthrough lets the guest submit directly to the controller queues. No QEMU thread to schedule. No host filesystem metadata. Fewer context switches. Fewer locks. The guest sees a real NVMe device, so the kernel can apply NVMe-specific optimizations and features.
Where it bites you
Passthrough isn’t a “set it and forget it” feature. It changes operational physics:
- Live migration becomes hard or impossible (in the usual sense). A device bound to a VM doesn’t want to teleport.
- Device resets and errors become guest-visible. If the NVMe controller throws a tantrum, your VM gets front-row seats.
- Sharing a device is non-trivial unless you have SR-IOV or NVMe namespaces designed for it (and even then, management complexity goes up).
- Security/isolation depends on IOMMU. If your IOMMU setup is wrong, you’re not doing passthrough; you’re doing trust fall engineering.
When VirtIO wins (and why)
VirtIO wins when the system is bigger than a single VM and a single disk. That’s most production environments.
Operational features you will miss
- Live migration is dramatically easier with virtual disks.
- Snapshots, backups, replication are simpler when storage is a managed artifact (LVM LV, Ceph RBD, ZVOL, etc.).
- Overcommit and pooling become possible. Not always wise, but often economically required.
- Policy enforcement: you can apply throttling, I/O prioritization, and tenant boundaries at the host or storage backend.
Performance is not automatically bad
VirtIO performance is often excellent when you avoid self-inflicted wounds:
- Use raw block devices (LVM LV, NVMe namespace exposed as /dev/… on the host) rather than QCOW2 on a filesystem for high-performance workloads.
- Use virtio-scsi with multiple queues if you need scale and concurrency. virtio-blk can be fine, but virtio-scsi tends to give you better flexibility with queueing and device model behaviors.
- Add iothreads so I/O completion isn’t stuck behind QEMU’s main thread doing something unrelated.
- Pick the right cache mode based on your durability requirements, not vibes.
Joke #2: VirtIO defaults are like default passwords: they exist to get you started, not to keep you safe in production.
Why microbenchmarks lie and production punishes you
Benchmarks are necessary. They’re also frequently malpractice.
Microbenchmarks usually:
- Run on an idle host with warm caches.
- Use a single job, single queue depth, and perfect locality.
- Measure average latency, not tail latency.
- Ignore CPU time consumed per I/O, which is where VirtIO can “pay” for its flexibility.
Production workloads usually:
- Have mixed reads/writes and metadata operations.
- Have bursts that cause queueing.
- Compete for CPU with other guests and host daemons.
- Experience periodic flushes, journal commits, trim/discard, and background maintenance.
The performance winner you’re looking for is often a latency percentile improvement, not a headline IOPS number. Passthrough can improve p99 by reducing software jitter. VirtIO can keep p99 stable by making system behavior manageable—if tuned and if the backend isn’t doing something sneaky.
One quote that’s relevant here, because storage failures are rarely “one thing”:
“Everything fails, all the time.” — Werner Vogels
Practical tasks: commands, outputs, and decisions (12+)
These are the tasks I actually run when someone says, “storage is slow.” Each includes: command, what the output means, and the decision you make.
Task 1: Confirm what the guest thinks the disk is
cr0x@server:~$ lsblk -o NAME,MODEL,TRAN,ROTA,SIZE,TYPE,MOUNTPOINT
NAME MODEL TRAN ROTA SIZE TYPE MOUNTPOINT
vda Virtio Block Dev 0 200G disk
├─vda1 0 512M part /boot
└─vda2 0 199.5G part /
Meaning: vda with model “Virtio Block Dev” is VirtIO. If you see nvme0n1 and a real vendor model, that’s likely passthrough.
Decision: If performance expectations assume “NVMe,” align stakeholders: VirtIO isn’t NVMe, and “SSD-like” isn’t a spec.
Task 2: Check if NVMe passthrough is actually NVMe inside the guest
cr0x@server:~$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
/dev/nvme0n1 S5XXXXXXXXXXXX SAMSUNG MZVLB1T0HALR-00000 1 200.04 GB / 200.04 GB 512 B + 0 B EXF7
Meaning: The guest sees a real NVMe device and model; passthrough is plausible.
Decision: If you need admin features (SMART log, firmware, namespaces), passthrough gives you visibility; VirtIO often hides it.
Task 3: On the host, verify VFIO binding for passthrough
cr0x@server:~$ lspci -nnk | grep -A3 -i nvme
5e:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a808]
Subsystem: Samsung Electronics Co Ltd Device [144d:a801]
Kernel driver in use: vfio-pci
Kernel modules: nvme
Meaning: “Kernel driver in use: vfio-pci” indicates the host handed the device to VFIO, not the host nvme driver.
Decision: If the host is still using nvme, you’re not doing passthrough; fix binding before arguing about performance.
Task 4: Identify VirtIO device model and queues from inside the guest
cr0x@server:~$ grep -H . /sys/block/vda/queue/nr_requests /sys/block/vda/queue/scheduler /sys/block/vda/queue/nr_hw_queues 2>/dev/null
/sys/block/vda/queue/nr_requests:128
/sys/block/vda/queue/scheduler:[none] mq-deadline kyber bfq
/sys/block/vda/queue/nr_hw_queues:1
Meaning: One hardware queue means limited parallelism; scheduler options show what’s available, with none currently selected.
Decision: If you’re pushing concurrency and see nr_hw_queues:1, consider virtio-scsi multi-queue or virtio-blk multi-queue configuration on the host.
Task 5: Check host-side block device scheduler and queue depth
cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq
Meaning: For NVMe, none is often correct. If you see BFQ on a high-IOPS NVMe backend, you may be paying extra latency for fairness you didn’t ask for.
Decision: For dedicated NVMe backing a VM, prefer none or mq-deadline depending on workload; test with p99 latency.
Task 6: Measure latency distribution with fio (guest)
cr0x@server:~$ fio --name=randread --filename=/dev/vda --direct=1 --ioengine=libaio --rw=randread --bs=4k --iodepth=32 --numjobs=4 --time_based --runtime=30 --group_reporting
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
read: IOPS=180k, BW=703MiB/s (737MB/s)(20.6GiB/30001msec)
slat (nsec): min=900, max=150000, avg=4200, stdev=1900
clat (usec): min=70, max=22000, avg=680, stdev=1200
lat (usec): min=75, max=22010, avg=684, stdev=1200
clat percentiles (usec):
| 1.00th=[ 140], 5.00th=[ 180], 10.00th=[ 210], 50.00th=[ 410]
| 90.00th=[ 1500], 95.00th=[ 2600], 99.00th=[ 6000], 99.90th=[16000]
Meaning: Average looks fine (avg=680us), but p99/p99.9 is ugly. That’s the kind of “it’s fine” that burns databases.
Decision: If tail latency is high, investigate CPU scheduling, iothreads, IRQ affinity, host contention, cache modes, and backend writeback behavior. Don’t just chase IOPS.
Task 7: Determine whether you’re paying for host page cache (host)
cr0x@server:~$ ps -eo pid,comm,%cpu,%mem,args | grep -E 'qemu-system|qemu-kvm' | head -n 1
2143 qemu-system-x86 175.2 8.1 /usr/bin/qemu-system-x86_64 ... -drive file=/var/lib/libvirt/images/vm01.qcow2,if=virtio,cache=writeback ...
Meaning: cache=writeback implies host page cache may be involved. That can be fast, until memory pressure triggers writeback storms.
Decision: For latency-critical workloads, prefer cache=none with O_DIRECT to reduce host cache interference (and ensure your durability model is understood).
Task 8: Check for host memory pressure and dirty writeback risk (host)
cr0x@server:~$ grep -E 'Dirty:|Writeback:|MemAvailable:' /proc/meminfo
MemAvailable: 2148320 kB
Dirty: 842112 kB
Writeback: 90112 kB
Meaning: High Dirty/Writeback relative to available memory suggests the kernel is flushing a lot. That can correlate with latency spikes.
Decision: If Dirty is persistently high, tune VM writeback behavior, reduce host cache usage, or provision more RAM/headroom. Also check if your storage backend is slow to absorb writes.
Task 9: Identify CPU steal and scheduling issues (guest)
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (vm01) 02/04/2026 _x86_64_ (8 CPU)
02:10:01 AM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
02:10:02 AM all 12.5 0.0 6.4 4.1 0.0 0.8 9.7 66.5
02:10:02 AM 0 15.0 0.0 8.0 2.0 0.0 0.5 18.0 56.5
Meaning: %steal indicates the hypervisor is taking CPU time away. High steal means your I/O completion path can be delayed even if the disk is fine.
Decision: If steal is non-trivial during latency spikes, fix CPU contention (pinning, reservations, reducing overcommit). Changing storage drivers won’t cure scheduler starvation.
Task 10: Inspect interrupts and confirm distribution (host or guest)
cr0x@server:~$ cat /proc/interrupts | egrep 'nvme|virtio|vfio' | head
45: 1203941 0 0 0 IR-PCI-MSI 327680-edge nvme0q0
46: 893220 0 0 0 IR-PCI-MSI 327681-edge nvme0q1
47: 901144 0 0 0 IR-PCI-MSI 327682-edge nvme0q2
48: 876990 0 0 0 IR-PCI-MSI 327683-edge nvme0q3
Meaning: Multiple NVMe queues (q0..q3) exist. If all interrupts increment on CPU0 only, you’ve got an affinity problem.
Decision: If interrupts are concentrated, set IRQ affinity or enable irqbalance with a policy that doesn’t sabotage latency-sensitive CPUs.
Task 11: Spot IO wait vs actual device saturation (guest)
cr0x@server:~$ iostat -x 1 3
Linux 6.5.0 (vm01) 02/04/2026 _x86_64_ (8 CPU)
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await wareq-sz aqu-sz %util
vda 4500.0 18000.0 0.0 0.0 1.20 4.00 1200.0 9600.0 2.40 8.00 4.10 92.0
Meaning: %util near 100% suggests the device queue is busy. r_await/w_await show time spent waiting in the device queue, not just CPU.
Decision: If %util is high and awaits rise, you’re near device or backend limits; consider more queues, faster backend, or sharding across devices. If %util is low but latency is high, look for scheduling/caching/lock issues.
Task 12: Confirm discard/TRIM behavior (guest)
cr0x@server:~$ lsblk -D
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
vda 0 512B 2G 0
Meaning: Discard granularity and max indicate whether TRIM/discard is supported and at what sizes.
Decision: If discard is enabled unintentionally on a backend that handles it poorly (some thin-provisioned or networked systems), it can cause latency spikes. Either schedule fstrim during off-peak or disable discard mount option.
Task 13: Check QEMU thread model and iothreads presence (host)
cr0x@server:~$ ps -T -p $(pgrep -n qemu-system) -o spid,comm,pcpu | head
2143 qemu-system-x86 98.4
2160 IO iothread0 35.2
2161 CPU 0 12.1
2162 CPU 1 11.7
Meaning: Presence of IO iothread0 suggests you have a dedicated I/O thread; if it’s missing, I/O may share QEMU’s main event loop.
Decision: For high IOPS or latency-sensitive VirtIO, add iothreads and isolate their CPU placement so they don’t compete with noisy work.
Task 14: Verify backend is raw block, not QCOW2-on-filesystem (host)
cr0x@server:~$ virsh domblklist vm01
Target Source
------------------------------------------------
vda /var/lib/libvirt/images/vm01.qcow2
Meaning: QCOW2 adds metadata overhead and fragmentation risk; it can be fine for many uses, but not a free lunch at high write rates.
Decision: If you’re chasing consistent low latency, move hot data disks to raw LVs or dedicated block devices, keep QCOW2 for boot disks and convenience where it belongs.
Fast diagnosis playbook
This is the “you have 20 minutes before the incident call gets spicy” sequence. The goal is to locate the bottleneck layer quickly, not to craft a perfect benchmark.
First: decide whether it’s device saturation or scheduling jitter
- Check guest iostat: if
%utilis high andawaitclimbs with load, suspect backend saturation. - Check guest CPU steal: if
%stealspikes during latency spikes, suspect host CPU contention. - Check fio percentiles: if average looks fine but p99/p99.9 is awful, suspect queueing and contention, not raw device speed.
Second: identify the I/O path and remove the most suspicious layer
- If you’re on VirtIO with QCOW2: that’s your first suspect. Switch the hot disk to raw block/LV as a test.
- If you’re on VirtIO without iothreads: add iothreads, then retest p99.
- If you’re on passthrough and still slow: stop blaming VirtIO; look at IRQ affinity, guest kernel settings, and NVMe power/latency features.
Third: confirm host backend health
- Host
dmesgfor NVMe errors, resets, timeouts. - Host memory pressure (Dirty/Writeback), which can create flush storms.
- Host CPU usage of QEMU threads; if the I/O thread is pegged or starved, you found the villain.
If you do only one thing: measure p99 latency and CPU steal while reproducing the problem. That pair tells you whether you’re fighting storage or scheduling.
Common mistakes: symptoms → root cause → fix
1) Symptom: “VirtIO is slow, so we need passthrough”
Root cause: VirtIO disk backed by QCOW2 on a filesystem with host caching and writeback spikes; plus no iothreads.
Fix: Put hot disks on raw block (LV, RBD, or direct device), use cache=none where appropriate, enable iothreads, and validate queue configuration. Then compare again.
2) Symptom: Great average latency, terrible p99.9 during peak
Root cause: CPU contention: QEMU I/O thread or vCPU gets descheduled; IRQs pile up on one core; host reclaim/writeback storms.
Fix: Pin iothreads and vCPUs, fix IRQ affinity, reduce overcommit, and ensure host memory headroom. Verify with mpstat and /proc/interrupts.
3) Symptom: Passthrough VM is fast until it isn’t; then it falls off a cliff
Root cause: NVMe controller reset, firmware quirk, or PCIe error propagated directly to guest; recovery is guest-visible and brutal.
Fix: Validate firmware, check PCIe AER logs, monitor NVMe error counters, and design for failure (replication, clustering). Passthrough reduces layers; it also reduces cushioning.
4) Symptom: Latency spikes every few minutes like clockwork
Root cause: Periodic flush/journal commit, fstrim/discard jobs, or host writeback thresholds causing bursts.
Fix: Schedule trims, tune writeback parameters, avoid double-caching, and ensure durability settings match your database expectations (don’t “optimize” away flushes unless you like data loss).
5) Symptom: “We added more queue depth and it got worse”
Root cause: Too much queueing increases latency; you’re just building a bigger waiting room. Also possible lock contention or backend saturation amplified by concurrency.
Fix: Tune iodepth to match the device and workload; measure tail latency. For databases, lower iodepth often improves p99 even if peak IOPS drops.
6) Symptom: Live migration works, but performance is inconsistent across hosts
Root cause: Different CPU models, NUMA topology, IRQ balancing policies, or backend NVMe models/firmware across the cluster.
Fix: Standardize host profiles, pin interrupts/threads consistently, and treat “same instance type” as an operational contract, not marketing.
Three corporate mini-stories from the storage trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran a fleet of VMs hosting build agents and artifact caches. Builds were “randomly slow,” which is the kind of symptom that makes everyone blame the network first and storage second. Someone noticed the hosts had shiny NVMe drives and concluded the build VMs were “on NVMe.” That phrase survived long enough to become a belief system.
During a release week, build times doubled. The on-call team saw no obvious disk saturation. IOPS were okay. The mean latency was okay. But the long tail was nasty. The build tool spent a lot of time waiting on small file metadata operations and synchronous fsyncs, the sort of workload that punishes jitter.
The wrong assumption was simple: the VMs were using VirtIO disks backed by QCOW2 images on an ext4 filesystem. The host page cache made the system look fast in light load, and then memory pressure kicked in and forced writeback at the worst times. The “NVMe” was real, but it was buried under a stack of overhead and contention.
The fix wasn’t heroic. They moved the hot disks to raw LVs, set cache mode deliberately, and added iothreads. Tail latency improved enough that the build slowdown disappeared. The cultural fix mattered too: they stopped using “on NVMe” as a performance promise and started specifying the actual path.
Mini-story 2: The optimization that backfired
A financial services team had a database VM that was constantly fighting p99 latency. They chose NVMe passthrough because the benchmarks were gorgeous. And for weeks, it was. Lower latency, lower CPU overhead, fewer mysteries.
Then a host experienced a PCIe hiccup. Nothing dramatic: a transient error, a link retrain, the kind of event that happens in real data centers when you pack high-density hardware into racks and pretend physics is optional. The NVMe controller reset. On bare metal, the OS recovered after a pause. In the passthrough VM, the pause looked like storage disappearing mid-sentence.
The database reacted exactly as designed: it panicked about I/O errors, marked devices suspect, and triggered a failover. Failover worked, but it was noisy: reconnect storms, cache warmup, and a cascade of “why is everything slow” tickets. The optimization had moved the failure boundary from “host absorbs it” to “guest experiences it.”
They kept passthrough, but only after adding boring resilience: database replication tuned for fast failover, alerts on NVMe reset counters, and a runbook that assumed the device could vanish. The lesson wasn’t “passthrough is bad.” It was “passthrough is honest.” Honest systems show you the sharp edges you used to ignore.
Mini-story 3: The boring but correct practice that saved the day
A SaaS company ran mixed workloads on a virtualization cluster. They standardized on VirtIO for most VMs because they needed migration and operational flexibility. Nothing exotic. But they did something that sounds dull and is therefore rare: they maintained a storage performance profile per host.
Every host had a baseline fio run on commissioning, capturing not just IOPS but latency percentiles at several iodepth values. They repeated it after firmware updates, kernel upgrades, and hardware swaps. The results were stored alongside host metadata, so “this host is weird” could be proven in minutes.
One day, after a routine maintenance cycle, a subset of hosts started showing intermittent p99 spikes on otherwise normal VMs. The team compared baselines and immediately saw a change: the host writeback behavior and QEMU thread CPU consumption looked different. Not “broken,” just different enough to cause tail latency issues for certain tenants.
They rolled back a specific combination of host kernel settings and adjusted their iothread pinning policy. The incident didn’t become a week-long war room because the team had a boring baseline and treated performance as a monitored property, not an anecdote. The saving move wasn’t clever tuning. It was making “normal” measurable.
Checklists / step-by-step plan
Decision checklist: should this VM get NVMe passthrough?
- Is the workload latency-critical at p99/p99.9? If no, don’t bother.
- Can you dedicate a device (or namespace) to this VM? If no, passthrough will turn into a resource allocation fight.
- Can you live without live migration? If no, VirtIO wins by default.
- Do you have IOMMU and VFIO operational maturity? If no, you’ll learn during an incident. That’s the worst time to learn.
- Do you have a failure-handling design? Passthrough makes device resets and errors guest-visible. Plan for it.
VirtIO “do it right” checklist (practical defaults)
- Back hot disks with raw block where possible (LV, ZVOL, RBD) instead of QCOW2-on-filesystem.
- Use virtio-scsi when you need multi-queue and flexibility; validate queue counts in guest.
- Add iothreads and pin them to stable CPUs. Don’t let the I/O path fight with emulator housekeeping.
- Pick cache mode intentionally:
- cache=none for direct I/O and reduced host cache jitter.
- cache=writeback only when you understand the durability and memory-pressure tradeoff.
- Validate tail latency with fio percentiles, not only average.
Passthrough “do it safely” checklist
- Confirm the device is bound to
vfio-pcion the host. - Verify IOMMU is enabled and you have sane IOMMU groups (no surprise device sharing).
- Pin vCPUs and handle IRQ affinity so NVMe interrupts aren’t concentrated on a single core.
- Monitor NVMe error logs and resets; treat them as predictive signals.
- Design the application/cluster for device-level faults (replication, failover testing).
FAQ
1) Is NVMe passthrough always faster than VirtIO?
No. It’s often faster for latency-sensitive small I/O because it removes host-side overhead. But VirtIO can match or exceed throughput in some configurations, and it usually wins operationally.
2) What’s the simplest VirtIO change that yields real performance improvements?
Stop putting high-write, high-IOPS disks on QCOW2 images sitting on a general-purpose filesystem. Move hot disks to raw block and add iothreads.
3) VirtIO-blk or virtio-scsi?
virtio-scsi is often the better choice when you want multi-queue scaling and a mature path for complex storage behavior. virtio-blk can be simpler and fast, but check queueing limits and your hypervisor’s configuration options.
4) Does iodepth “as high as possible” make NVMe faster?
It makes queues deeper. That’s not the same thing. High iodepth can increase throughput but destroy tail latency. Tune iodepth based on p99, not ego.
5) Why is p99 latency bad even when %util is low?
Because your bottleneck is likely scheduling or contention: CPU steal, IRQ affinity, a busy QEMU main thread, host memory reclaim, or lock contention in the I/O path. Low device utilization does not mean low end-to-end latency.
6) Can I live migrate a VM with NVMe passthrough?
Not in the normal “move the running VM to another host” way. There are specialized approaches, but if live migration is a hard requirement, VirtIO is the practical answer.
7) Is passthrough safe in multi-tenant environments?
It can be, if IOMMU isolation is correct and you operationally control device assignment. But it reduces the host’s ability to enforce policy, and it increases blast radius when a device misbehaves.
8) What about using NVMe-oF or networked block devices instead?
Networked storage can be excellent, but it introduces network jitter and different failure modes. It also changes who owns the queueing problem. Measure p99 latency end-to-end and decide based on your workload’s sensitivity.
9) If I only care about throughput (MB/s), what should I pick?
VirtIO with a well-configured backend often delivers plenty of throughput, sometimes limited more by CPU than disk. If you’re doing big sequential I/O, the difference between passthrough and VirtIO may be smaller than you expect.
10) What’s the “real winner” again?
The real winner is removing accidental complexity: unnecessary image layers, bad cache modes, missing iothreads, and misaligned CPU/IRQ topology. Passthrough is one way to remove layers. Tuning VirtIO is another.
Practical next steps
- Pick a representative fio profile for your workload (mix reads/writes, include fsync where relevant) and record p50/p95/p99/p99.9 latency.
- Map your actual I/O path: guest device type, host cache mode, backing store type (raw vs QCOW2), and backend device scheduler.
- Fix the easy wins first: raw block for hot disks, iothreads, correct cache mode, and CPU/IRQ placement. Re-measure.
- Only then consider NVMe passthrough for the handful of VMs that truly need it—and only if you can accept the operational tradeoffs.
- Institutionalize baselines: store host and VM storage performance profiles so you can detect regressions before your customers do.
If you want a single opinionated directive: treat storage virtualization as a latency budget problem, not a driver-selection problem. The budget gets spent on layers. Spend it intentionally.