iommu=pt: The Hidden Performance Mode for Linux Virtualization (When to Use It)

Was this helpful?

You buy fast CPUs, fast NICs, fast NVMe, and then your “virtualization platform” runs like it’s carrying a piano upstairs.
Latency spikes. Packet rates wobble. Storage IOPS look fine in a benchmark and weird in production.
Someone says “maybe it’s the IOMMU” and the room gets quiet, because that sentence is both true and expensive.

iommu=pt is one of those Linux kernel switches that can feel like free performance—until it isn’t.
It can reduce overhead in the DMA mapping path, especially under heavy I/O virtualization workloads.
It can also silently remove protections you were relying on, or make your debugging story much worse.
This is the field guide: when to flip it, when not to, and how to prove you’re fixing the right problem.

What iommu=pt actually does (not what the internet thinks)

The IOMMU (Input-Output Memory Management Unit) is the MMU for devices. It translates device-visible addresses (IOVAs)
into host physical addresses, and it enforces isolation so a device can’t DMA into whatever it feels like.
In virtualization, it’s a core building block for safe PCI passthrough (VFIO), SR-IOV isolation, and sometimes
even for “just virtio” depending on your stack.

Linux supports multiple IOMMU operational modes. The details vary by architecture and driver (Intel VT-d, AMD-Vi,
arm-smmu, etc.), but the general pattern is:

  • Translated mode: devices DMA through IOMMU page tables; the kernel sets up mappings and can do
    per-device isolation.
  • Pass-through mode: devices use “identity” mappings (or something equivalent) so DMA translations are
    effectively 1:1 and mapping overhead is reduced.
  • Disabled: no IOMMU translation; isolation features are gone, and some advanced virtualization features
    stop working.

iommu=pt asks Linux to default to IOMMU pass-through mappings for devices that are not assigned to
guests (and sometimes even before assignment, depending on the driver and device lifecycle). It does not mean
“no IOMMU.” It means “keep the IOMMU on, but avoid translation overhead where isolation isn’t needed.”

The subtlety: when you later bind a device to VFIO and assign it to a VM, the kernel still uses translated mode for that
device (because you need isolation between guest and host). The win is usually about everything else: host devices,
the default DMA mapping behavior, and the amount of mapping churn in the IOMMU when you run a busy box with lots of I/O.

This is why iommu=pt can help even when you’re not doing full PCI passthrough. Some workloads create massive
DMA map/unmap activity (networking stacks, storage, userspace packet processing, certain GPU/accelerator drivers).
If those paths hit the IOMMU in translated mode, you can pay in CPU cycles, cache pressure, IOTLB invalidations, and
occasionally “why is ksoftirqd eating my lunch.”

A practical way to think about it:
iommu=pt trades some default isolation behavior for reduced translation and bookkeeping overhead.
It’s a performance dial with a security and debugging bill attached.

Two terms people mix up: “IOMMU enabled” vs “DMA remapping enabled”

Many platforms show IOMMU-related firmware toggles with vague labels: “IOMMU,” “VT-d,” “DMA remapping,” “SVM,” “AMD-Vi.”
You can often have virtualization extensions on while DMA remapping is off. Guests still run, but device isolation and
safe passthrough gets messy. On Linux, intel_iommu=on (or amd_iommu=on) can force it, while
iommu=off disables it in-kernel.

Keep this straight because the failure modes look similar: performance changes, VFIO stops working, or kernel logs fill
with DMAR/IOMMU complaints. But the fixes differ.

Exactly one quote (paraphrased, because accuracy matters)

Paraphrased idea (attributed to Gene Kim): “Improvement work must be tied to outcomes, and you need feedback loops fast enough to steer.”

Why the IOMMU exists: the short history with sharp edges

If you treat IOMMUs as “just performance tax,” you’ll end up with a fast system that’s also a loaded foot-gun.
A little historical context helps you understand why the tax exists and when it’s reasonable to avoid it.

Interesting facts and context (8 points)

  1. IOMMUs predate modern virtualization hype. They originally mattered for systems that needed device DMA into
    memory beyond what devices could address directly, and for remapping in complex bus topologies.
  2. Intel VT-d and AMD-Vi formalized DMA remapping as a platform feature. This was critical for safe device
    assignment to VMs: without DMA isolation, “passthrough” is basically “please scribble on my hypervisor.”
  3. Early PCI passthrough was fragile because interrupts and DMA isolation were fragile. MSI/MSI-X, remapping,
    and proper IOMMU group isolation were all part of making it boring.
  4. The Linux DMA API abstracted mapping, but the IOMMU backend can make it expensive. Drivers call mapping APIs
    and expect them to be “cheap enough.” Under heavy load, “cheap enough” becomes an argument.
  5. The IOTLB exists because devices cache translations too. When mappings change, you may need invalidations.
    Invalidation storms are real and they show up as latency spikes.
  6. SR-IOV changed the blast radius. Now you can have many VFs (virtual functions) doing DMA, and each VF might
    have its own mapping needs. Great for density. Also great for finding the slow path.
  7. High-rate userspace networking (DPDK, XDP, AF_XDP) put pressure on mapping strategy. Pinning memory and using
    hugepages is as much about avoiding mapping churn as it is about TLB misses.
  8. Security research made DMA attacks mainstream. “Malicious peripheral” is not just a spy-movie plot;
    it’s a credible threat model in multi-tenant environments and in places where physical access happens.

Joke #1 (short, relevant): The IOMMU is like a bouncer for your RAM. It’s expensive, but it stops the NIC from wandering into VIP areas.

Who should use iommu=pt (and who should keep their hands off)

Here’s the opinionated version: use iommu=pt when you can measure IOMMU overhead and you control the blast radius.
Don’t use it because you saw a forum comment that ended with “fixed it for me.”

Good candidates

  • Dedicated virtualization hosts with VFIO device passthrough where most devices remain host-owned and you want to cut default
    DMA translation overhead for host devices.
  • High packet rate networking nodes doing SR-IOV, DPDK, AF_XDP, or heavy virtio-net with vhost acceleration, where you’ve
    verified time is spent in mapping/unmapping or IOTLB invalidations.
  • Storage-heavy hypervisors doing lots of NVMe completions, queues, and interrupt work, where CPU cycles spent in DMA mapping
    are measurable and hurting tail latency.
  • Single-tenant environments where the threat model is operational accidents and not malicious DMA.

Bad candidates

  • Multi-tenant or hostile-tenant environments where DMA isolation is part of your security story. You can still use IOMMU and
    still do VFIO safely, but “default pass-through” is a policy choice. Make it deliberately, not casually.
  • Hosts that rely on strict DMA isolation even for host devices (think: untrusted peripherals, hotplug, edge locations, lab
    environments with mystery hardware).
  • Systems where debugging is already hard. If you’re living in a swamp of intermittent I/O issues, reducing isolation can turn
    “intermittent” into “unreproducible.”

If you’re thinking “but we’re not multi-tenant,” remember: you are always multi-stakeholder. Security teams, compliance, and
incident response will be in your future. Leave them a system that makes sense.

What you gain vs. what you risk

The upside: where the performance comes from

The IOMMU adds work in a few ways:

  • Mapping overhead: The kernel (or a driver) needs to create IOVA mappings for DMA. Under load, that bookkeeping costs CPU.
  • IOTLB invalidations: When mappings change, devices and the IOMMU need invalidations; those can serialize and spike latency.
  • Smaller effective DMA windows: Bad defaults can cause fragmentation in IOVA space or force bounce buffering.
  • Interrupt remapping interactions: Some platforms tie interrupt remapping to IOMMU enablement; configuration choices can
    affect interrupt delivery overhead and behavior.

iommu=pt reduces mapping work for “normal” host devices by preferring identity-like mappings. That can:

  • Lower CPU overhead in high I/O paths.
  • Reduce tail latency caused by invalidation churn.
  • Make some workloads less sensitive to small mapping changes.

The downside: what you might break (or weaken)

  • Security posture changes: You may lose default DMA protection for some devices. If a device misbehaves (bug or malicious),
    it can target host memory more easily.
  • Debuggability decreases: Strict translation can catch certain classes of driver/device bugs (bad DMA addresses).
    Identity mappings can let them scribble silently.
  • Assumptions in tooling: Some environments assume “IOMMU on” implies “host DMA protected.” With pass-through defaults,
    that assumption becomes false unless you enforce per-device translation where needed.
  • Corner-case hardware quirks: Some chipsets and devices behave differently with pass-through defaults, especially with
    ATS/PRI features, interrupt remapping, or odd firmware behavior.

Joke #2 (short, relevant): Changing IOMMU modes in production is like reorganizing a data center during a fire drill—educational, but you’ll learn new words.

Fast diagnosis playbook

You don’t flip iommu=pt because you’re bored. You flip it because you have a bottleneck and you can explain it.
This is the order of operations I use when a virtualization host is “fast except when it isn’t.”

First: prove whether the problem is CPU time, latency, or queueing

  • Look for CPU saturation in softirq, kvm, vhost, or iommu-related work.
  • Look for tail latency spikes correlated with network/storage interrupts.
  • Look for queue depth growth: NIC rings, NVMe queues, blk-mq, vhost queues.

Second: confirm IOMMU is in the path and how it’s configured

  • Is the IOMMU enabled in firmware and kernel?
  • Are you using translated mode globally, or pass-through defaults?
  • Are the critical devices (VFIO-assigned) still isolated properly?

Third: identify the hot path (mapping/unmapping vs. something else)

  • Perf sample for iommu_map/iommu_unmap and DMA API functions.
  • Check for IOTLB invalidation costs.
  • Check for bounce buffering / swiotlb use, which often indicates an addressing/mapping constraint.

Decision point

If you can show meaningful CPU time in DMA mapping or IOMMU invalidations on the host—and the security model allows it—then
iommu=pt is a reasonable experiment. If you can’t show it, you’re probably about to optimize a placebo.

Practical tasks: commands, outputs, and decisions (12+)

These are not “run this because it’s fun.” Each task includes what to look for and what decision it drives.
Run them on the host unless noted otherwise.

Task 1: Confirm the kernel command line (what you actually booted)

cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0 root=UUID=... ro quiet intel_iommu=on iommu=pt

Meaning: This is ground truth. If iommu=pt isn’t here, you’re not testing it.
Decision: If parameters are missing or conflicting (like iommu=off), fix bootloader config before doing anything else.

Task 2: Check dmesg for IOMMU mode and DMAR/AMD-Vi initialization

cr0x@server:~$ dmesg -T | egrep -i 'DMAR|IOMMU|AMD-Vi|iommu=' | head -n 30
[Mon Feb  3 10:11:21 2026] DMAR: IOMMU enabled
[Mon Feb  3 10:11:21 2026] DMAR: Default domain type: Passthrough
[Mon Feb  3 10:11:21 2026] pci 0000:00:00.0: DMAR: Skip IOMMU disabling for graphics

Meaning: “Default domain type: Passthrough” is the smoking gun that iommu=pt took effect (on Intel VT-d paths).
Decision: If you don’t see it, you may be on a different driver path, firmware might be disabling features, or the parameter is ignored.

Task 3: Verify IOMMU groups exist (and see what’s isolated)

cr0x@server:~$ find /sys/kernel/iommu_groups/ -maxdepth 2 -type l | head
/sys/kernel/iommu_groups/0/devices/0000:00:00.0
/sys/kernel/iommu_groups/1/devices/0000:00:01.0
/sys/kernel/iommu_groups/7/devices/0000:3b:00.0

Meaning: If groups exist, the IOMMU is at least organizing isolation units.
Decision: No groups usually means IOMMU isn’t enabled or not supported; VFIO passthrough safety will be compromised.

Task 4: Inspect a specific device’s group (is it shareable or stuck with friends?)

cr0x@server:~$ lspci -nn | egrep -i 'Ethernet|Non-Volatile|VGA' | head -n 5
3b:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [8086:1572]
5e:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a808]
cr0x@server:~$ readlink -f /sys/bus/pci/devices/0000:3b:00.0/iommu_group
/sys/kernel/iommu_groups/7
cr0x@server:~$ ls -1 /sys/kernel/iommu_groups/7/devices/
0000:3b:00.0
0000:3b:00.1

Meaning: The NIC has two functions in the same group; that’s common. Passthrough requires the whole group to be controlled.
Decision: If the group contains unrelated devices, you may need different PCIe slotting, BIOS ACS settings, or accept “no passthrough.”

Task 5: Check whether VFIO is in use and which devices are bound

cr0x@server:~$ lsmod | egrep 'vfio|kvm|vhost' | head -n 20
vfio_pci               73728  0
vfio_pci_core          94208  1 vfio_pci
vfio_iommu_type1       45056  0
vfio                   53248  2 vfio_pci_core,vfio_iommu_type1
kvm_intel             376832  0
kvm                  1097728  1 kvm_intel
vhost_net              32768  0
cr0x@server:~$ lspci -nnk -s 3b:00.0
3b:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [8086:1572]
	Subsystem: Intel Corporation Ethernet Converged Network Adapter X710 [8086:0000]
	Kernel driver in use: vfio-pci
	Kernel modules: i40e

Meaning: The NIC is bound to vfio-pci; it should be using translated isolation for the guest even if the default domain is passthrough.
Decision: If a device intended for the host is accidentally bound to vfio, expect missing network/storage on the host and a very tense meeting.

Task 6: Check IOMMU default domain (generic interface when available)

cr0x@server:~$ cat /sys/module/iommu/parameters/default_domain
pt

Meaning: The kernel’s IOMMU default domain is pass-through.
Decision: If it says translated (or similar), you’re not in the mode you think you are; validate kernel parameters and distro defaults.

Task 7: Look for SWIOTLB usage (a classic hidden performance sink)

cr0x@server:~$ dmesg -T | egrep -i 'swiotlb|bounce' | head -n 20
[Mon Feb  3 10:11:20 2026] software IO TLB: mapped [mem 0x000000007a000000-0x000000007e000000] (64MB)

Meaning: SWIOTLB is a bounce buffer mechanism. It can appear due to DMA addressing limits, IOMMU configuration, or platform quirks.
Decision: If you see heavy SWIOTLB use during load (often visible in perf), consider memory placement, BIOS “above 4G decoding,” or IOMMU tuning.

Task 8: See if interrupt remapping is enabled (stability and security implications)

cr0x@server:~$ dmesg -T | egrep -i 'Interrupt remapping|IR:' | head -n 20
[Mon Feb  3 10:11:21 2026] DMAR-IR: Enabled IRQ remapping in x2apic mode

Meaning: IRQ remapping is on. Good for isolation and sometimes required for stable passthrough on certain systems.
Decision: If it’s off unexpectedly, check firmware settings and kernel parameters; some platforms disable it under odd configurations.

Task 9: Profile for IOMMU/DMA mapping hot spots (proof before policy)

cr0x@server:~$ sudo perf top -g --call-graph fp
Samples: 18K of event 'cycles', Event count (approx.): 12200542319
  5.21%  [kernel]  [k] iommu_map
  4.87%  [kernel]  [k] iommu_unmap
  3.94%  [kernel]  [k] dma_map_page_attrs
  3.12%  [kernel]  [k] intel_iommu_unmap
  2.77%  [kernel]  [k] handle_edge_irq

Meaning: This is the evidence you need: significant CPU time inside mapping/unmapping.
Decision: If these functions are near the top during your real workload, iommu=pt is a justified experiment. If not, look elsewhere.

Task 10: Check host CPU time in softirq (networking pain often shows here)

cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.8.0 (server) 	02/03/2026 	_x86_64_	(64 CPU)

12:14:01 AM  CPU   %usr %nice %sys %iowait %irq %soft %steal %idle
12:14:02 AM  all   12.5  0.0  18.2   0.4  0.8  22.9   0.0  45.2
12:14:02 AM    7    8.1  0.0  15.0   0.0  1.2  46.3   0.0  29.4

Meaning: High %soft often indicates networking or block completions and interrupt-driven work.
Decision: If softirq is high and perf shows IOMMU functions, the mapping path is a candidate bottleneck; if softirq is high but IOMMU isn’t, focus on NIC/RPS/IRQ affinity.

Task 11: Validate hugepages and pinned memory behavior (especially for DPDK/VFIO)

cr0x@server:~$ grep -E 'HugePages|Hugepagesize' /proc/meminfo
HugePages_Total:     4096
HugePages_Free:      3900
HugePages_Rsvd:       120
Hugepagesize:       2048 kB

Meaning: Hugepages are available; this reduces TLB pressure and can reduce mapping churn in userspace I/O frameworks.
Decision: If you rely on hugepages and they’re exhausted under load, you’ll see performance collapse that looks like “IOMMU is slow.”

Task 12: Check QEMU/KVM process CPU and thread behavior (don’t blame IOMMU for a pinned vCPU)

cr0x@server:~$ ps -eo pid,comm,pcpu,psr,cls,rtprio,pri,ni,stat --sort=-pcpu | head
 23144 qemu-system-x86 189.7  12  TS      -  19   0 Rl
  1821 ksoftirqd/7       45.2   7  TS      -  19   0 R
  1402 vhost-23144       32.1   7  TS      -  19   0 R

Meaning: A VM process and vhost thread are hot, plus ksoftirqd on the same CPU.
Decision: Before touching IOMMU, fix CPU affinity and IRQ distribution. If you keep everything on CPU 7, you’ve built a tiny traffic jam.

Task 13: Inspect IRQ distribution for a NIC (interrupt storms masquerade as “IOMMU overhead”)

cr0x@server:~$ egrep -i 'eth0|i40e|x710' /proc/interrupts | head -n 10
 169:  12499231   38123   11212   11098   IR-PCI-MSI 524288-edge      eth0-TxRx-0
 170:  12588110   39001   11992   10902   IR-PCI-MSI 524289-edge      eth0-TxRx-1

Meaning: Interrupts are present and distributed across CPUs (first columns). If all counts sit on one CPU, you’ve found a likely cause of latency.
Decision: If distribution is bad, tune IRQ affinity and RPS/XPS; iommu=pt won’t save a single-core interrupt pileup.

Task 14: Check for IOMMU faults (when performance issues are actually errors)

cr0x@server:~$ dmesg -T | egrep -i 'IOMMU fault|DMAR: DRHD|IO_PAGE_FAULT|AMD-Vi: Event logged' | tail -n 10
[Mon Feb  3 12:19:04 2026] DMAR: [DMA Read] Request device [3b:00.0] fault addr 0x7f3a4000 [fault reason 0x05] PTE Read access is not set

Meaning: A real DMA fault. This is not “tuning time,” this is “containment and correctness time.”
Decision: Stop. Validate device assignment, driver stability, firmware, and whether the guest/host is misconfigured. iommu=pt might hide this, not fix it.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-size SaaS company ran a fleet of KVM hypervisors with a mix of virtio devices and a handful of VFIO passthrough NICs for “premium” tenants.
Security signed off on “IOMMU enabled” as part of their hardening checklist. The infrastructure team heard the same phrase and translated it as
“we can safely do passthrough; the IOMMU is on.”

A new kernel rollout included iommu=pt. It was added as a performance tweak after a lab test showed reduced CPU usage in network-heavy workloads.
The change note said “keeps IOMMU on, improves performance.” Everyone nodded. Nobody asked: “which devices still get translated domains by default?”

Weeks later they investigated a weird host memory corruption event. Not frequent. Not reproducible. The kind of issue that forces you to drink cold coffee.
After enough log archaeology, the team found a pattern: the affected hosts had a particular out-of-tree driver for a monitoring PCIe card.
That card DMA’d to the wrong address under certain reset conditions. In strict translation mode it would have faulted loudly. In pass-through defaults,
it scribbled quietly.

The fix wasn’t “never use iommu=pt.” The fix was to stop treating “IOMMU enabled” as a boolean security control.
They replaced the problematic card/driver combination, and they tightened change control around kernel parameters that affect DMA isolation.
They also added a post-boot check that verifies default domain and alerts when it changes unexpectedly.

Mini-story 2: The optimization that backfired

An enterprise ran low-latency trading infrastructure on dedicated hosts. They weren’t multi-tenant, but they were extremely sensitive to jitter.
A performance engineer saw mapping/unmapping show up in a profile and decided to “remove IOMMU overhead entirely.” They set iommu=off across the fleet.
The initial benchmark looked great. Management got the graph they wanted.

Then they hit an upgrade window where NIC firmware updates were applied. One host came back with a slightly different PCIe topology and an odd quirk in interrupt behavior.
Without interrupt remapping (often tied to IOMMU enablement), they got intermittent interrupt storms and occasional missed interrupts under load.
The symptom was simple: sporadic packet loss at the host boundary.

They tried to tune rings, coalescing, IRQ affinity, all the usual levers. Sometimes it helped, sometimes it didn’t.
The real problem was that the platform had crossed into a less-tested configuration for that NIC/firmware combination.
It wasn’t a “Linux bug,” it was a “you disabled a feature the platform expects you to have” bug.

They walked it back to intel_iommu=on iommu=pt. Performance stayed good, and stability returned.
The lesson was not “IOMMU is always good.” The lesson was “the platform is a system.” Remove one part and the rest might sulk.

Mini-story 3: The boring but correct practice that saved the day

A cloud team ran a private virtualization platform for internal workloads: CI runners, databases, and a few GPU-backed ML boxes.
They had a strict “one change per deploy” policy for kernel parameters. It annoyed developers. It also saved them.

When they experimented with iommu=pt, they didn’t start by rolling it out. They started by instrumenting:
perf profiles under representative load, baseline latency histograms, a count of IOMMU faults from dmesg, and
a simple daily export of /proc/cmdline + IOMMU default domain.

The test showed a small but consistent drop in host CPU for their network-heavy CI workload. Great. They rolled it out to one cluster.
Two days later, one GPU host started throwing DMA faults during guest resets. The team could correlate it immediately:
the issue existed before, but strict mode would fault earlier; pass-through defaults made the failure rarer and nastier.

Because they had staged rollout and telemetry, rollback was clean. No drama. No “we think it’s the kernel” handwaving.
They isolated the issue to a specific GPU firmware+driver pairing, fixed that first, then reintroduced iommu=pt with confidence.
Boring practices don’t get you applause. They do get you sleep.

Common mistakes (symptoms → root cause → fix)

1) Symptom: VFIO passthrough fails after enabling iommu=pt

Root cause: Confusing iommu=pt with iommu=off, or firmware doesn’t actually enable DMA remapping.
Sometimes the system boots without proper IOMMU support, so VFIO refuses to attach.

Fix: Verify with dmesg that IOMMU is enabled, check groups exist, ensure intel_iommu=on or amd_iommu=on is set, and confirm firmware VT-d/AMD-Vi is enabled.

2) Symptom: Performance improved, but you now get rare host crashes or memory corruption

Root cause: A buggy device/driver DMA’d incorrectly; strict translation would have faulted, pass-through defaults let it land.

Fix: Treat this as correctness. Update firmware/drivers, remove problematic devices, and consider keeping translated mode for certain device classes (or avoid iommu=pt on hosts with questionable peripherals).

3) Symptom: Latency spikes remain even with iommu=pt

Root cause: The bottleneck wasn’t IOMMU translation; it was IRQ affinity, softirq backlog, vhost thread placement, or guest-side driver issues.

Fix: Use /proc/interrupts, mpstat, and perf to find the real hot path. Fix CPU pinning and IRQ distribution first.

4) Symptom: SR-IOV VFs work, but throughput is inconsistent across VMs

Root cause: IOMMU group constraints and ACS limitations lead to awkward placement, or you have mixed interrupt modes across VFs.
Sometimes a subset of VFs ends up with different NUMA locality and you blame IOMMU.

Fix: Check groups, NUMA node for PCI devices, IRQ distribution, and ensure VM vCPU pinning aligns with device locality.

5) Symptom: Dmesg shows DMA faults after a kernel parameter change

Root cause: A device is now using translated mode with stricter enforcement (or the reverse), surfacing a real bug.

Fix: Do not “tune it away.” Identify the device, driver, and workload triggering the fault. Validate VFIO configuration and update firmware.

Checklists / step-by-step plan

Step-by-step: deciding whether to try iommu=pt

  1. State your goal. Example: “Reduce host CPU by 10% at 2Mpps per host while keeping VFIO passthrough stable.”
  2. Capture a baseline. Perf sample, softirq %, p99 latency for key paths (storage, network), and dmesg fault counts.
  3. Prove IOMMU overhead exists. Look for mapping/unmapping and IOTLB invalidations in perf during real workload.
  4. Confirm threat model. Single-tenant? Controlled hardware? If not, get security buy-in and document the trade.
  5. Stage rollout. One host, one pool, one cluster. Make rollback easy.
  6. Validate isolation for VFIO devices. Ensure assigned devices still use VFIO with proper groups.
  7. Monitor after change. Watch for DMA faults, host instability, and tail latency.

Implementation checklist: making the change cleanly

  • Set firmware VT-d/AMD-Vi and (if relevant) “Above 4G decoding” appropriately.
  • Set kernel params: intel_iommu=on iommu=pt or amd_iommu=on iommu=pt as applicable.
  • Reboot, then verify with /proc/cmdline and dmesg.
  • Verify IOMMU groups exist and VFIO devices are in correct groups.
  • Re-run your baseline measurements and compare.

Rollback checklist

  • Remove iommu=pt and reboot.
  • Confirm default domain is not pt (translated or distro default).
  • Re-check VFIO device binding and guest functionality.
  • Compare fault logs and latency/CPU regression to baseline.

FAQ

1) Does iommu=pt disable the IOMMU?

No. It keeps the IOMMU enabled but asks for pass-through/identity-style mappings by default for non-assigned devices.
VFIO-assigned devices still need translated isolation.

2) When is iommu=pt most likely to help?

When your host spends meaningful CPU time in DMA map/unmap or IOMMU invalidation paths under real load—common in high packet rate networking,
heavy storage completion rates, or userspace I/O frameworks.

3) Is iommu=pt safe for multi-tenant environments?

“Safe” depends on policy and hardware trust. If your security model depends on DMA isolation for host-owned devices, pass-through defaults are a risk.
In multi-tenant setups, make this a deliberate, reviewed decision with compensating controls.

4) How is iommu=pt different from intel_iommu=on?

intel_iommu=on forces Intel VT-d IOMMU enablement. iommu=pt sets the default mapping domain type.
You often use them together: force enable, then choose pass-through default.

5) Will iommu=pt improve virtio performance?

Sometimes indirectly. Virtio itself is paravirtual, but the host’s DMA behavior for backing devices and vhost paths can still involve IOMMU work.
Measure. If perf doesn’t show IOMMU overhead, don’t expect miracles.

6) What’s the biggest operational risk of enabling iommu=pt?

Masking or enabling DMA-related bugs and weakening default isolation for host devices. The scary part is that failures can become rarer and more destructive.

7) If performance is bad, should I just use iommu=off instead?

Usually no. Disabling the IOMMU removes protections and can break VFIO, SR-IOV isolation expectations, and interrupt remapping behavior on some platforms.
If you need performance, iommu=pt is the more surgical tool—still with tradeoffs.

8) How do I prove iommu=pt is the reason performance improved?

Use before/after comparisons under the same workload: perf profiles (mapping/unmapping time), CPU utilization, and tail latency.
Also verify the default domain switched to passthrough in dmesg or sysfs.

9) Can iommu=pt cause devices to end up in different IOMMU groups?

No. Groups are determined by hardware topology (PCIe ACS behavior, chipset/root port isolation), not by the pass-through default domain.
If your groups are bad, fix topology/ACS/slotting, not kernel mapping mode.

10) What’s a sign I should not touch IOMMU knobs at all?

If your bottleneck is clearly elsewhere: a single hot CPU due to IRQ affinity, guest-side driver misconfiguration, storage queueing, or scheduling issues.
Fix the obvious first.

Conclusion: next steps that won’t make you famous for the wrong reasons

iommu=pt is not a magic speed switch. It’s a policy choice: keep IOMMU enabled, reduce translation overhead where isolation isn’t needed,
and accept that you’ve changed the default safety rails.

Practical next steps:

  1. Run the fast diagnosis playbook and get a perf profile under real load. If IOMMU mapping isn’t hot, stop here.
  2. If mapping/unmapping is hot, test iommu=pt on one host. Validate: cmdline, dmesg default domain, VFIO bindings, and absence of DMA faults.
  3. Compare CPU and tail latency to baseline. If you don’t get a meaningful win, roll it back and move on—your bottleneck lives elsewhere.
  4. If you keep it, document the security trade, add a post-boot check for default domain, and watch dmesg for DMA faults like it’s your job (because it is).
← Previous
Restart Stuck Services the Safe Way (Without Rebooting)
Next →
Frontend Components: Copy Buttons and Code Blocks That Don’t Break in Production

Leave a comment