Debian 13 IRQ storms and weird latency: check irqbalance and fix interrupts

January 15, 2026 • February 3, 2026 • Read: 27 min • Views: 9

Was this helpful?

Everything looks “fine” until it doesn’t. Latency graphs get teeth. A database that’s been boring for months starts stuttering every 30–90 seconds. p99 goes off-road while CPU “idle” still claims it’s relaxed. Then you look at one core pinned at 100% in ksoftirqd and realize you’re not chasing an application problem at all.

On Debian 13, IRQ storms and interrupt imbalance can feel like paranormal activity: packets arrive, disks complete, but your scheduler and queues are losing the fight. The fix is not “reboot and pray.” It’s measuring interrupts like you mean it, checking irqbalance, and making intentional decisions about affinity, queue counts, and offloads.

Fast diagnosis playbook

If you’re on-call and the pager is doing that thing, you need a sequence that narrows the search in minutes. Don’t start by “tuning.” Start by proving where the time is going.

First: confirm whether interrupts/softirqs are the bottleneck

Check if one CPU is pegged and it’s not userland.
Check softirq rates and top offenders (NET_RX, BLOCK, TIMER).
Check if interrupts are piling onto CPU0 (classic) or a single NUMA node.

Second: identify which device is generating the pressure

Map hot IRQs to devices (NIC queue IRQs, NVMe MSI-X vectors, HBA lines).
Correlate with workload timing (network spikes, storage flush, snapshot, backup).
Check whether the device has enough queues and whether they’re used.

Third: decide between “let irqbalance do it” vs “pin it deliberately”

If this is a general-purpose server with changing workloads: prefer irqbalance with sane defaults.
If this is a latency-sensitive, pinned-CPU system (DPDK, realtime-ish, trading, audio, telco): disable irqbalance and pin interrupts with intent.
If you use CPU isolation (isolcpus, nohz_full): interrupts must be kept off isolated cores, or you’ve built a race car and mounted shopping cart wheels.

Fourth: verify you improved the right metric

Measure p95/p99 end-to-end latency, not just “CPU looks better.”
Confirm that the hot IRQs are spread (or pinned correctly) and rates are stable.
Watch for regressions: packet drops, increased retransmits, higher interrupt rate due to disabled coalescing.

What you’re actually seeing (IRQ storms, softirqs, and latency)

An “IRQ storm” is usually not a literal electrical storm. It’s the kernel being interrupted so frequently—or handling so much deferred interrupt work—that real work can’t run smoothly. Symptoms often look like application weirdness: timeouts, stalled IO, short “hiccups,” and jitter that doesn’t match average CPU or throughput.

On modern Linux, hard interrupts (top halves) are kept short. The heavy lifting is deferred into softirqs (bottom halves) and kthreads like ksoftirqd/N. Network receive processing (NET_RX) is a repeat offender: if packets arrive quickly enough, the system can spend huge CPU time just moving packets from NIC to socket buffers, leaving less time for the application to drain them. Storage can do it too: NVMe completions are fast and frequent, and with multiple queues you can generate a steady drumbeat of interrupts.

Interrupt imbalance is the quieter cousin of storms. The total interrupt rate might be fine, but if it’s concentrated on one CPU (often CPU0), that core becomes your de facto bottleneck. The scheduler may show plenty of idle time elsewhere, which leads to the classic bad diagnosis: “CPU is fine.” It’s not. One core is on fire while the others are at brunch.

Two common patterns:

CPU0 overload: many drivers default to queue 0 or a single vector unless RSS/MSI-X and affinity are set up. Boot-time affinity defaults can also bias CPU0.
NUMA mismatch: interrupts run on CPUs far from the PCIe device’s memory locality. That adds latency and burns interconnect bandwidth. It’s death by a thousand cache misses.

There’s also the “interrupt moderation” trap: if coalescing is too aggressive, you reduce interrupt rate but increase latency because the NIC waits longer before interrupting. Too little coalescing, and you can melt a core with interrupts. Tuning is a trade: you’re deciding how much jitter you can afford to avoid overload.

One quote that should live in your head during this kind of work: Hope is not a strategy. — General Gordon R. Sullivan

Joke #1: Interrupt storms are like meetings—if you have too many, nothing else gets done.

Interesting facts and context (why this keeps happening)

1) “irqbalance” exists because SMP made naive interrupt routing painful. Early multiprocessor Linux systems often defaulted interrupts onto the boot CPU, producing CPU0 hotspots that looked like “Linux is slow.”
2) MSI-X changed the game. Message Signaled Interrupts (and MSI-X) let devices raise interrupts via in-memory messages and support many vectors—perfect for multi-queue NICs and NVMe.
3) NAPI was invented to stop packet receive livelock. Linux networking moved to an interrupt-mitigating polling model (NAPI) because purely interrupt-driven RX could collapse under high packet rates.
4) Softirqs are per-CPU by design. That’s good for cache locality, but it also means “one CPU drowning in NET_RX” can starve work on that CPU even when others are idle.
5) “IRQ storm” used to mean broken hardware more often. Today it’s frequently mis-tuned queueing/coalescing or workload changes (like a new service sending 64-byte packets at a million per second).
6) NVMe’s performance comes with completion interrupt volume. Many small IOs at high IOPS can generate a high rate of completion events; MSI-X vectors and queue mapping matter.
7) CPU isolation features made interrupt placement more important. isolcpus and nohz_full can improve latency, but only if you keep interrupts and kernel housekeeping off isolated cores.
8) Modern NIC offloads aren’t always your friend. GRO/LRO/TSO can reduce CPU but increase latency or jitter, and some workloads (small RPCs) hate their buffering behavior.
9) “irqbalance” has matured into policy, not magic. It makes decisions based on load heuristics. Those heuristics can be wrong for specialized systems.

Tools and principles: how Debian 13 handles interrupts

Debian 13 is just Linux with opinions and packaging. The kernel uses /proc/interrupts as the raw truth. Tools like irqbalance apply policy. Drivers expose knobs through /sys and ethtool. Your job is to decide what “good” looks like for your workload, then enforce it.

Hard IRQs vs softirqs: what matters for latency

Hard IRQ context is extremely constrained. Most work gets punted to softirqs. When softirq load is high, the kernel can run softirq processing in the context of the interrupted task (fast, good for throughput) or in ksoftirqd (preemptable, but can lag). If you see ksoftirqd dominating a CPU, you’re behind.

Affinity: “which CPU handles this interrupt?”

Every IRQ has an affinity mask. For MSI-X, each vector is effectively its own IRQ and can be balanced across CPUs. For legacy line-based interrupts, your options are limited and sharing can get ugly.

Queueing: multi-queue devices and why one queue is a tragedy

For NICs, multi-queue + RSS lets inbound flows hash across RX queues, each with its own interrupt vector. For block devices, blk-mq maps IO queues to CPUs. This isn’t just throughput. It’s also a latency story: reduce lock contention and keep the hot path local.

NUMA: don’t pay for distance if you don’t have to

On multi-socket systems, putting NIC interrupts on CPUs local to the NIC’s PCIe root complex reduces cross-node memory traffic. Debian won’t guess your topology correctly every time. You need to check.

Practical tasks: commands, outputs, and decisions (12+)

These are the moves that fix real incidents. Each task includes: a command, what output means, and the decision you make from it. Run them as root when needed.

Task 1: Confirm the symptom is interrupt/softirq pressure

cr0x@server:~$ top -H -b -n 1 | head -n 25
top - 10:11:12 up 12 days,  3:44,  1 user,  load average: 6.14, 5.98, 5.22
Threads:  421 total,   3 running, 418 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.0 us,  2.1 sy,  0.0 ni, 78.4 id,  0.0 wa,  0.0 hi, 11.5 si,  0.0 st
PID   USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  32  root      20   0       0      0      0 R  96.7   0.0  78:22.41 ksoftirqd/3

What it means: High si (softirq) plus a hot ksoftirqd/N thread is a smoking gun. Your CPU time is being spent on deferred interrupt work, commonly networking.

Decision: Move immediately to measuring softirqs and mapping interrupts to devices. Don’t tune the application yet.

Task 2: See which softirq classes are hot

cr0x@server:~$ cat /proc/softirqs
                    CPU0       CPU1       CPU2       CPU3
          HI:          0          0          0          0
       TIMER:   10234567    9123456    9234567    9012345
      NET_TX:       884      12933      11822      11540
      NET_RX:  90345678   1245678   1300045   1219990
       BLOCK:    112233     110998     111102     109876
    IRQ_POLL:          0          0          0          0
     TASKLET:      3333       2888       3011       2999
       SCHED:    400000     398000     402000     399000
     HRTIMER:       222       211        219        210
         RCU:    600000     590000     610000     605000

What it means: NET_RX massively skewed to CPU0 screams “RX processing is concentrated.” This often correlates with one RX queue interrupt landing on CPU0.

Decision: Inspect /proc/interrupts and NIC queue configuration. Your goal is to distribute RX queue interrupts or enable/verify RSS.

Task 3: Identify hot IRQ lines and whether they’re imbalanced

cr0x@server:~$ awk 'NR==1 || /eth0|nvme|i915|virtio|mlx|ixgbe|enp/ {print}' /proc/interrupts
            CPU0       CPU1       CPU2       CPU3
  35:   81234567     120332     110221     118877   PCI-MSI 524288-edge      eth0-TxRx-0
  36:      1200   22100333     118900     119010   PCI-MSI 524289-edge      eth0-TxRx-1
  37:      1100     111200   20300111     119100   PCI-MSI 524290-edge      eth0-TxRx-2
  38:       900     112300     120010   19899110   PCI-MSI 524291-edge      eth0-TxRx-3
  92:    900000     910000     905000     899000   PCI-MSI 1048576-edge      nvme0q0
  93:    120000     118000     119000     121000   PCI-MSI 1048577-edge      nvme0q1

What it means: Here the NIC queue 0 IRQ is absurdly hot on CPU0. Other queues look healthier. This is a classic imbalance.

Decision: Fix affinity for the hot vector and confirm RSS spreads flows. If queue 0 is legitimately busiest due to hashing, increase queues or adjust RSS indirection.

Task 4: Confirm whether irqbalance is running and what it thinks it should do

cr0x@server:~$ systemctl status irqbalance --no-pager
● irqbalance.service - irqbalance daemon
     Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-12-29 08:12:01 UTC; 1 day 02:00 ago
   Main PID: 812 (irqbalance)
      Tasks: 1 (limit: 38121)
     Memory: 3.8M
        CPU: 21min 33.120s

What it means: irqbalance is running, so either (a) it can’t move that IRQ (some are pinned/unmovable), (b) it is configured to avoid certain CPUs, or (c) it’s making a poor choice for your workload.

Decision: Check irqbalance configuration and IRQ affinity masks. Decide whether to tune irqbalance or override specific IRQs manually.

Task 5: Check current affinity mask for a specific IRQ

cr0x@server:~$ cat /proc/irq/35/smp_affinity_list
0

What it means: IRQ 35 is pinned to CPU0 only. irqbalance can’t help if something (or someone) hard-pinned it.

Decision: If this is not intentional, change it. If CPU0 is reserved for housekeeping and you want that, pin it elsewhere.

Task 6: Move a hot IRQ to another CPU (quick test)

cr0x@server:~$ echo 2 > /proc/irq/35/smp_affinity
cr0x@server:~$ cat /proc/irq/35/smp_affinity_list
1

What it means: The IRQ is now routed to CPU1 (mask bit 1). This is a blunt instrument but excellent for proving causality.

Decision: If latency improves immediately and ksoftirqd calms down, you’ve confirmed interrupt placement is the problem. Then do the durable fix (queue-aware mapping, irqbalance policy, NUMA-aware placement).

Task 7: Identify the NIC, driver, and bus location (NUMA hints)

cr0x@server:~$ ethtool -i eth0
driver: ixgbe
version: 6.1.0
firmware-version: 0x800003e2
bus-info: 0000:3b:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

What it means: bus-info is the PCI address. You can map that to a NUMA node and CPU locality.

Decision: Keep interrupts on CPUs local to the NIC’s NUMA node whenever possible.

Task 8: Find the device’s NUMA node and local CPU list

cr0x@server:~$ cat /sys/bus/pci/devices/0000:3b:00.0/numa_node
1
cr0x@server:~$ cat /sys/devices/system/node/node1/cpulist
16-31

What it means: The NIC is attached to NUMA node 1, local CPUs 16–31. If your IRQs are on CPU0–3, you’re paying cross-node costs.

Decision: Place NIC IRQs on CPUs 16–31, and consider placing the NIC-heavy workloads there too.

Task 9: Check NIC queue/channel counts (do you even have enough queues?)

cr0x@server:~$ ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX:	16
TX:	16
Other:	0
Combined:	16
Current hardware settings:
RX:	0
TX:	0
Other:	0
Combined:	4

What it means: The NIC supports up to 16 combined channels but is currently using 4. If you have many cores and high packet rates, 4 queues may be a bottleneck.

Decision: Increase combined channels if the workload benefits and the system has CPU to handle it. But don’t go wild; more queues can mean more overhead and worse cache locality.

Task 10: Increase NIC combined queues (carefully) and re-check interrupts

cr0x@server:~$ sudo ethtool -L eth0 combined 8
cr0x@server:~$ ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX:	16
TX:	16
Other:	0
Combined:	16
Current hardware settings:
RX:	0
TX:	0
Other:	0
Combined:	8

What it means: You now have 8 queue pairs. That should create more MSI-X vectors and distribute work better—if RSS is configured and your flows hash well.

Decision: Re-check /proc/interrupts after 30–60 seconds under load. If one queue still dominates, your hashing/indirection or traffic pattern may be the constraint.

Task 11: Inspect RSS indirection and hash key (is traffic being spread?)

cr0x@server:~$ ethtool -x eth0 | head -n 25
RX flow hash indirection table for eth0 with 8 RX ring(s):
    0: 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
   16: 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
   32: 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
   48: 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
RSS hash key:
6d:5a:2c:...:91

What it means: The table appears evenly distributed. That’s good. If it’s not, or it’s all zeros, you’re effectively single-queue.

Decision: If distribution is poor, adjust the indirection table (advanced) or fix the driver/firmware settings. If your workload is a single flow (one giant TCP stream), RSS won’t help much; you need application sharding or different transport behavior.

Task 12: Check and tune interrupt coalescing (latency vs CPU trade)

cr0x@server:~$ ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: on  TX: on
rx-usecs: 50
rx-frames: 0
tx-usecs: 50
tx-frames: 0

What it means: Adaptive coalescing is on; the NIC/driver will change coalescing based on traffic. That’s usually okay for general use, but it can introduce latency variance.

Decision: For strict tail-latency workloads, consider disabling adaptive coalescing and setting conservative fixed values. Validate with real latency measurements, not vibes.

Task 13: Identify dropped packets or backlog overruns (symptom of NET_RX overload)

cr0x@server:~$ nstat | egrep 'TcpExtListenOverflows|IpInDiscards|UdpInErrors|TcpExtTCPBacklogDrop'
TcpExtTCPBacklogDrop          123
IpInDiscards                  456

What it means: You’re dropping packets in the stack. This can happen when softirq processing can’t keep up, or when application accept/read can’t drain fast enough.

Decision: Fix interrupt distribution first. Then consider socket backlog tuning and application capacity. Don’t paper over an IRQ hotspot with bigger buffers unless you enjoy delayed failure.

Task 14: Look for IRQ “nobody cared” messages and kernel warnings

cr0x@server:~$ journalctl -k -b | egrep -i 'irq|nobody cared|soft lockup|hard lockup' | tail -n 20
Dec 30 09:55:18 server kernel: irq 35: nobody cared (try booting with the "irqpoll" option)
Dec 30 09:55:18 server kernel: Disabling IRQ #35

What it means: This is serious. An IRQ got so noisy or misbehaved that the kernel decided it was broken and disabled it. That can take your NIC or storage offline in slow motion.

Decision: Treat it as a driver/firmware/hardware issue first. Check BIOS settings, update firmware, verify MSI/MSI-X stability, and examine whether interrupt moderation settings are pathological.

Task 15: Verify NVMe queue/IRQ mapping is sane

cr0x@server:~$ ls -1 /proc/interrupts | head -n 0
cr0x@server:~$ grep -E 'nvme[0-9]q' /proc/interrupts | head -n 10
 92:    900000     910000     905000     899000   PCI-MSI 1048576-edge      nvme0q0
 93:    120000     118000     119000     121000   PCI-MSI 1048577-edge      nvme0q1
 94:    119000     121000     118000     120000   PCI-MSI 1048578-edge      nvme0q2

What it means: Multiple NVMe queues exist and interrupts are relatively evenly distributed. If you see only nvme0q0 hot and others idle, the device might be limited to one queue or the workload is effectively single-threaded.

Decision: If NVMe interrupts are concentrated, check nvme_core.default_ps_max_latency_us power settings, driver parameters, and whether the block layer is mapping queues to CPUs as expected.

Task 16: Observe per-IRQ rate over time (not just totals)

cr0x@server:~$ for i in 1 2 3; do date; grep -E 'eth0-TxRx-0|eth0-TxRx-1|eth0-TxRx-2|eth0-TxRx-3' /proc/interrupts; sleep 1; done
Tue Dec 30 10:10:01 UTC 2025
 35:   81234567     120332     110221     118877   PCI-MSI 524288-edge      eth0-TxRx-0
 36:      1200   22100333     118900     119010   PCI-MSI 524289-edge      eth0-TxRx-1
Tue Dec 30 10:10:02 UTC 2025
 35:   81310222     120350     110240     118899   PCI-MSI 524288-edge      eth0-TxRx-0
 36:      1210   22155880     118920     119030   PCI-MSI 524289-edge      eth0-TxRx-1
Tue Dec 30 10:10:03 UTC 2025
 35:   81389001     120360     110255     118920   PCI-MSI 524288-edge      eth0-TxRx-0
 36:      1220   22210210     118940     119050   PCI-MSI 524289-edge      eth0-TxRx-1

What it means: You can eyeball deltas per second. If one IRQ increments much faster than others, that’s your hotspot. Totals hide rate changes.

Decision: Target the high-rate IRQ for balancing/pinning and confirm the rate distribution improves under representative load.

irqbalance on Debian 13: verify, tune, and know when to disable

irqbalance is often either blindly trusted or blindly blamed. Neither is adult behavior. Treat it like any other automation: it’s a policy engine with defaults tuned for “typical servers.” If your server is not typical, it will still be treated as typical unless you intervene.

What irqbalance does well

Spreads IRQs across CPUs so CPU0 isn’t the designated sufferer.
Responds to changing load without you hand-editing affinity masks at 3 a.m.
Plays reasonably with MSI-X multi-queue devices.

What irqbalance does poorly (or can’t do at all)

Latency-critical CPU isolation setups: It may move interrupts onto CPUs you intended to keep “clean” unless configured carefully.
NUMA-sensitive placement: It may not always keep device interrupts local to the PCIe NUMA node in the way you want.
Unmovable interrupts: Some IRQs are effectively pinned or handled in ways irqbalance can’t change.

Verify irqbalance configuration

cr0x@server:~$ grep -v '^\s*#' /etc/default/irqbalance | sed '/^\s*$/d'
ENABLED="1"
OPTIONS=""

What it means: Default config. No CPU ban masks, no special behavior.

Decision: If you’re seeing IRQ pinning anyway, something else is setting affinity (driver scripts, custom tuning, container runtime hooks, or old “optimization” leftovers).

See what CPUs are online and whether you’ve isolated some

cr0x@server:~$ lscpu | egrep 'CPU\(s\)|On-line CPU|NUMA node'
CPU(s):                               32
On-line CPU(s) list:                  0-31
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-15
NUMA node1 CPU(s):                    16-31

What it means: Clean topology. If you also use isolation kernel parameters, check the kernel cmdline.

Decision: If you isolate CPUs (say 4–31) for workloads, you must ban those CPUs from interrupts using irqbalance options or manual affinity.

Inspect kernel cmdline for isolation-related parameters

cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.12.0 root=/dev/mapper/vg0-root ro quiet isolcpus=4-31 nohz_full=4-31 rcu_nocbs=4-31

What it means: You’ve declared CPUs 4–31 special. Great. Now keep interrupts off them, or you’ve made a promise the kernel will break on your behalf.

Decision: Configure irqbalance to avoid 4–31, or disable irqbalance and set explicit masks for all relevant IRQs.

Ban CPUs from irqbalance (typical isolation pattern)

cr0x@server:~$ sudo sed -i 's/^OPTIONS=.*/OPTIONS="--banirq=0"/' /etc/default/irqbalance
cr0x@server:~$ sudo systemctl restart irqbalance
cr0x@server:~$ systemctl status irqbalance --no-pager | head -n 12
● irqbalance.service - irqbalance daemon
     Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-12-30 10:18:01 UTC; 2s ago

What it means: Example only: banning IRQ 0 is not the same as banning CPUs. cpu banning is typically done via irqbalance’s CPU mask options (varies by version/build). The point: do not randomly “tune” options you don’t understand.

Decision: For CPU isolation, prefer explicit per-IRQ affinity or irqbalance CPU mask configuration appropriate to your installed irqbalance. Verify by reading /proc/irq/*/smp_affinity_list after restart.

Here’s the opinionated guidance: if you’re running a normal fleet of Debian servers, leave irqbalance enabled. If you’re running a specialized latency box with isolated CPUs, disable it and manage affinity with configuration management so it’s repeatable.

NIC interrupts: RSS, RPS/XPS, coalescing, and multi-queue sanity

Most “weird latency” cases that smell like interrupts are network-driven. Not because storage is innocent, but because packets can arrive at line rate and demand immediate CPU attention. A NIC can happily deliver more work per second than your kernel can digest if you configure it like it’s 2009.

Start with RSS (hardware receive scaling)

RSS spreads flows across RX queues in hardware. Each RX queue has its own interrupt vector. If RSS is off, or only one queue exists, one CPU ends up doing most of the receive work. Your throughput might still look okay, while tail latency gets wrecked by queueing and jitter.

RPS/XPS: software steering when RSS isn’t enough

RPS (Receive Packet Steering) can distribute packet processing across CPUs even if hardware RSS is limited. XPS can help distribute transmit processing. These are CPU features; they can improve balance but also add overhead. Use them when you have a reason, not because you saw a blog post in 2016.

cr0x@server:~$ ls -1 /sys/class/net/eth0/queues/
rx-0
rx-1
rx-2
rx-3
tx-0
tx-1
tx-2
tx-3

What it means: Multi-queue exists. Good. Now confirm the IRQ vectors align with those queues and aren’t all routed to the same CPU set.

Decision: Align each queue IRQ to CPUs local to the NIC, and avoid isolated CPUs if applicable.

Interrupt coalescing: the latency tax you might be paying

If your service is latency-sensitive (RPC, databases, interactive APIs), coalescing settings can swing tail latency. Adaptive coalescing is designed to be generally efficient, not deterministic. Sometimes “generally efficient” is the enemy.

cr0x@server:~$ sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 25 tx-usecs 25
cr0x@server:~$ ethtool -c eth0 | egrep 'Adaptive|rx-usecs|tx-usecs'
Adaptive RX: off  TX: off
rx-usecs: 25
tx-usecs: 25

What it means: You’ve reduced wait time before interrupts fire. That can reduce latency but increase CPU usage and interrupt rate.

Decision: If CPU headroom exists and latency improves, keep it. If CPU melts or drops increase, back off. The correct setting is the one that meets your SLO without wasting a core.

Joke #2: Tuning NIC coalescing is like seasoning soup—too little and it’s bland, too much and you’re suddenly drinking seawater.

Storage interrupts: NVMe, blk-mq, and the “disk is fine” trap

Storage latency issues often get blamed on the drive. Sometimes that’s correct. Often it’s not. NVMe devices are extremely fast, which means they can complete IO so quickly that the completion path becomes the bottleneck: interrupts, queue mapping, and CPU locality.

Check whether you’re CPU-bound in the block layer

cr0x@server:~$ iostat -x 1 3
Linux 6.12.0 (server) 	12/30/2025 	_x86_64_	(32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.12    0.00    3.01    0.22    0.00   89.65

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz aqu-sz  %util
nvme0n1         1200.0  96000.0     0.0   0.00    0.55    80.00  800.0  64000.0     0.0   0.00    0.60    80.00   0.12  45.00

What it means: Device await is low, util is moderate. If your app latency is awful anyway, the device isn’t obviously the limiter; your completion processing, contention, or network path might be.

Decision: Correlate with IRQ/softirq metrics. If NVMe IRQs are high and concentrated, fix that first.

Verify NVMe power state/latency settings aren’t sabotaging you

cr0x@server:~$ cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
0

What it means: 0 commonly means “no limit” (allow low-power states). On servers, aggressive power saving can add wake latency.

Decision: For latency-sensitive systems, consider setting a tighter max latency via kernel parameter. Validate carefully; power policies are workload and platform-specific.

Virtualization quirks: virtio, vhost, and noisy neighbors

In VMs, you’re not just tuning Linux; you’re negotiating with the hypervisor. virtio-net and vhost can be excellent, but interrupt behavior changes. You can see “interrupt storms” that are really “exit storms” or host-side queue contention.

Inside the guest: check virtio IRQ distribution

cr0x@server:~$ grep -E 'virtio|vhost' /proc/interrupts | head -n 8
 40:   22112233    1100223    1099888    1101001   PCI-MSI 327680-edge      virtio0-input.0
 41:     110022   19888777     990001     980002   PCI-MSI 327681-edge      virtio0-output.0

What it means: If one virtio vector dominates and it’s pinned, you can still suffer imbalance. But in VMs, the host’s CPU pinning and vCPU topology matter just as much.

Decision: Fix guest affinity only after verifying host pinning/NUMA placement. Otherwise you’re rearranging furniture in a moving truck.

Check steal time and scheduling pressure

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 812345  12345 987654    0    0     1     3 1200 2400  6  2 90  0  2

What it means: Non-trivial st (steal) suggests host contention. That can mimic interrupt issues because your guest can’t run when it needs to service queues.

Decision: Escalate to the virtualization layer: CPU pinning, host IRQ routing, noisy neighbor mitigation, and ensuring virtio multi-queue is enabled end-to-end.

Three corporate mini-stories from the trenches

Incident: the wrong assumption (“CPU is only 30%, so it can’t be CPU”)

A mid-sized company ran a Debian-based API tier behind a load balancer. After a kernel upgrade, p99 latency doubled during busy hours. Dashboards showed average CPU at ~30–40% with lots of idle. The initial response was predictable: blame the new app release, blame the database, then blame the network team for “packet loss.”

The on-call SRE finally looked at per-core utilization and saw CPU0 pinned. The process list didn’t show a hot userland thread. It was ksoftirqd/0. Meanwhile CPU1–CPU31 were sipping tea. The team had assumed “CPU%” was a scalar. It’s not. It’s a distribution.

/proc/interrupts made it painfully obvious: the primary NIC RX/TX queue interrupt landed almost exclusively on CPU0. RSS was configured for multiple queues, but the IRQ affinity mask had been pinned during an earlier “performance tuning” attempt that never got reverted. irqbalance was running, but it can’t override a pin you’ve welded in place.

The fix was boring: remove the manual pinning, restart irqbalance, and then explicitly pin NIC queue interrupts to CPUs local to the NIC’s NUMA node. Latency returned to baseline without touching the application. The postmortem conclusion was harsher than the fix: the team’s assumption about CPU metrics was wrong, and it wasted hours of cross-team churn.

Optimization that backfired: “Turn off coalescing to reduce latency”

A different organization had a latency-sensitive service and a new engineer with good intentions. They read that interrupt coalescing “adds latency,” so they disabled adaptive coalescing and set rx-usecs to 0 everywhere. The graphs looked amazing in a synthetic test: lower median latency, snappy response, everyone happy.

Two weeks later, a traffic pattern changed. More small packets. More concurrent connections. Suddenly a subset of servers began dropping packets and showing periodic latency spikes that looked like garbage collection pauses. Again, CPU “average” looked fine. The real story was that interrupt rate went through the roof; one CPU per host became an interrupt concierge, and the rest of the machine was underutilized because the hot CPU couldn’t keep up with the pace of interrupts.

The team had optimized for the median and paid with the tail. Disabling coalescing didn’t just reduce latency; it removed a safety valve. Under heavy packet rates, the system spent too much time in interrupt and softirq paths. The fix was to re-enable adaptive coalescing, then set a moderate fixed baseline for RX/TX usecs aligned with their SLO. They also increased NIC queues and ensured interrupts were mapped to the right NUMA node.

The lesson was not “never change coalescing.” It was: coalescing is a control knob, not a religion. If you set it to “always lowest possible,” you’re choosing fragility under bursty workloads.

Boring but correct practice: keep an IRQ/affinity baseline and enforce it

One company ran Debian 13 on storage-heavy nodes: NVMe + 100GbE. They had a simple rule: every hardware class had a documented, versioned “interrupt and queue baseline.” When a new server got provisioned, configuration management applied it and a validation script checked it.

The script wasn’t fancy. It captured a snapshot of /proc/interrupts, queue counts from ethtool -l, NUMA node mapping from sysfs, and current affinity lists for hot IRQs. It also flagged CPU0 hotspots and any IRQ pinned to isolated CPUs. If something drifted, the pipeline failed and the server didn’t join the pool.

During an otherwise nasty incident—high latency on a subset of nodes after a vendor firmware update—this baseline saved days. They could immediately see which nodes deviated: one batch had NIC queues reduced by the firmware reset, and interrupts collapsed onto two vectors. Rolling back or reapplying settings fixed the issue quickly.

The practice was unglamorous. No one got a conference talk out of it. It worked anyway. In production, “boring and correct” beats “clever and fragile” almost every time.

Common mistakes: symptom → root cause → fix

1) p99 latency spikes, average CPU low

Symptom: Tail latency is bad; CPU dashboards show plenty of idle.

Root cause: One CPU is overloaded by interrupts/softirqs (often CPU0). Average hides skew.

Fix: Check top -H, /proc/softirqs, and /proc/interrupts. Fix IRQ affinity and ensure RSS/multi-queue is active.

2) ksoftirqd pegged, NET_RX huge on one CPU

Symptom: ksoftirqd/N is hot; NET_RX skewed.

Root cause: RX queue interrupt pinned to one CPU; RSS not distributing; single flow dominating; or RPS misconfigured.

Fix: Increase queues, verify RSS indirection, distribute IRQ vectors across CPUs local to NIC, consider RPS for stubborn cases.

3) “irqbalance is running but nothing changes”

Symptom: irqbalance active; IRQ still stuck on one CPU.

Root cause: IRQ is manually pinned; driver enforces affinity; IRQ is unmovable; or irqbalance is constrained by banned CPU masks.

Fix: Inspect /proc/irq/*/smp_affinity_list. Remove manual pins or adjust policy. Validate after restarting irqbalance.

4) Latency improved briefly after pinning, then got worse

Symptom: Quick win followed by regression under real traffic.

Root cause: Pinning improved locality but overloaded a subset of CPUs; queue distribution/hashing changed; or you created contention with application threads on the same CPUs.

Fix: Make IRQ placement match CPU scheduling strategy: reserve CPUs for interrupts, or align application threads to the same NUMA node and avoid fighting over the same cores.

5) Packet drops increase after “latency tuning”

Symptom: Retransmits/backlog drops after disabling coalescing/offloads.

Root cause: Interrupt rate too high; CPU can’t keep up; buffers overflow.

Fix: Re-enable adaptive coalescing or set moderate coalescing values; keep RSS and multi-queue; verify no single IRQ dominates.

6) Storage latency blamed on NVMe, but iostat looks clean

Symptom: App sees stalls; NVMe device metrics look fine.

Root cause: CPU-side completion processing or cross-NUMA interrupt handling; sometimes IRQ saturation on a CPU shared with networking.

Fix: Check NVMe IRQ distribution and NUMA locality; avoid mixing hot NIC and NVMe interrupts on the same CPU set if it creates contention.

Checklists / step-by-step plan

Checklist A: Stop the bleeding in 15 minutes

Capture evidence: top -H, /proc/softirqs, /proc/interrupts, and latency graphs at the same timestamp.
Identify the single hottest IRQ line/vector and map it to a device/queue name.
Check its affinity: cat /proc/irq/<IRQ>/smp_affinity_list.
If it’s pinned to one CPU and that CPU is hot, move it as a test (one step) and re-measure p99.
If moving helps, implement a durable policy: either tune irqbalance or set persistent affinity rules.
Verify no packet drops/regressions: nstat, driver stats, and application error rates.

Checklist B: Make the fix durable (don’t rely on heroics)

Document intended CPU topology: which CPUs are for housekeeping, which are isolated, which are workload.
Record NIC and NVMe NUMA locality and ensure IRQs are placed accordingly.
Set NIC queue counts explicitly (and verify after reboot/firmware update).
Pick a coalescing policy: adaptive for general servers; fixed for strict latency with headroom.
Decide on irqbalance: enabled for general purpose; disabled + explicit affinity for specialized nodes.
Automate validation: fail provisioning if IRQs collapse onto CPU0 or isolated cores.

Checklist C: Confirm improvements with real measurements

Measure end-to-end p95/p99 and error rates for at least one business cycle (not 60 seconds of hope).
Confirm interrupt distribution over time using deltas (not just totals).
Watch CPU softirq time and ksoftirqd residency.
Validate no new drops/retransmits appeared.
Keep a before/after snapshot of /proc/interrupts for the postmortem.

FAQ

1) What exactly is an IRQ storm on Linux?

It’s a condition where interrupt activity (hard IRQs or the softirq work they trigger) overwhelms CPUs, causing latency, drops, and jitter. It’s often “too many interrupts” or “interrupts in the wrong place.”

2) Why does CPU0 always seem to be the victim?

Boot-time defaults, legacy interrupt routing, and accidental pinning frequently land work on CPU0. Also, many housekeeping tasks naturally run there. If you let devices dogpile onto CPU0, it becomes your bottleneck core.

3) Should I just disable irqbalance?

On general-purpose servers: no, keep it. On specialized latency systems with CPU isolation or strict affinity requirements: yes, disable it and manage affinity explicitly. The worst option is “disable it and do nothing else.”

4) How do I know whether it’s network or storage?

Look at /proc/softirqs and /proc/interrupts. High NET_RX and hot NIC queue IRQs point to network. High block-related IRQs and NVMe queue vectors point to storage. Then correlate with workload timing.

5) If I increase NIC queues, will latency always improve?

No. More queues can reduce contention and spread load, but can also increase overhead and worsen cache locality. It’s a tool, not a guarantee. Measure after each change.

6) Can a single TCP flow defeat RSS?

Yes. RSS hashes flows; one flow maps to one RX queue. If your workload is dominated by a few heavy flows, you might still hotspot a queue. Fix by sharding traffic or changing how the workload fans out.

7) What’s the difference between RSS and RPS?

RSS is hardware-based distribution into multiple RX queues. RPS is software-based steering of packet processing across CPUs. RSS is usually preferable when available; RPS is a fallback or augmentation.

8) Why did my tuning disappear after reboot or firmware update?

Many settings (queue counts, coalescing, affinity) are not persistent by default. Firmware updates can reset NIC state. Make configuration persistent via systemd units, udev rules, or configuration management—and verify on boot.

9) How does NUMA affect interrupts and latency?

If a device is attached to NUMA node 1 but its interrupts run on CPUs in node 0, you increase cross-node traffic and cache misses. That adds jitter and reduces headroom. Place IRQs on local CPUs when possible.

10) I moved an IRQ and the problem improved, but not completely. Now what?

Good: you proved causality. Next step is systematic distribution: ensure all queues are used, map vectors across a CPU set, check coalescing, and avoid placing hot interrupts on CPUs running the busiest application threads unless you intend to.

Conclusion: next steps that actually reduce p99

IRQ storms and interrupt imbalance aren’t mysterious. They’re measurable. On Debian 13, you already have what you need: /proc/interrupts, /proc/softirqs, ethtool, and a clear head.

Do this next, in order:

Prove whether you’re interrupt/softirq bound (top -H, /proc/softirqs).
Map hot IRQs to devices and queues (/proc/interrupts plus ethtool -i/-l).
Fix distribution and locality: queues, RSS, affinity masks, NUMA placement.
Only then: tune coalescing and offloads, with real latency measurements and rollback plans.
Make it persistent and validated, because the only thing worse than a storm is a storm that returns after a reboot.