Everything looks “fine” until it doesn’t. Latency graphs get teeth. A database that’s been boring for months starts stuttering every 30–90 seconds. p99 goes off-road while CPU “idle” still claims it’s relaxed. Then you look at one core pinned at 100% in ksoftirqd and realize you’re not chasing an application problem at all.
On Debian 13, IRQ storms and interrupt imbalance can feel like paranormal activity: packets arrive, disks complete, but your scheduler and queues are losing the fight. The fix is not “reboot and pray.” It’s measuring interrupts like you mean it, checking irqbalance, and making intentional decisions about affinity, queue counts, and offloads.
Fast diagnosis playbook
If you’re on-call and the pager is doing that thing, you need a sequence that narrows the search in minutes. Don’t start by “tuning.” Start by proving where the time is going.
First: confirm whether interrupts/softirqs are the bottleneck
- Check if one CPU is pegged and it’s not userland.
- Check
softirqrates and top offenders (NET_RX, BLOCK, TIMER). - Check if interrupts are piling onto CPU0 (classic) or a single NUMA node.
Second: identify which device is generating the pressure
- Map hot IRQs to devices (NIC queue IRQs, NVMe MSI-X vectors, HBA lines).
- Correlate with workload timing (network spikes, storage flush, snapshot, backup).
- Check whether the device has enough queues and whether they’re used.
Third: decide between “let irqbalance do it” vs “pin it deliberately”
- If this is a general-purpose server with changing workloads: prefer
irqbalancewith sane defaults. - If this is a latency-sensitive, pinned-CPU system (DPDK, realtime-ish, trading, audio, telco): disable
irqbalanceand pin interrupts with intent. - If you use CPU isolation (
isolcpus,nohz_full): interrupts must be kept off isolated cores, or you’ve built a race car and mounted shopping cart wheels.
Fourth: verify you improved the right metric
- Measure p95/p99 end-to-end latency, not just “CPU looks better.”
- Confirm that the hot IRQs are spread (or pinned correctly) and rates are stable.
- Watch for regressions: packet drops, increased retransmits, higher interrupt rate due to disabled coalescing.
What you’re actually seeing (IRQ storms, softirqs, and latency)
An “IRQ storm” is usually not a literal electrical storm. It’s the kernel being interrupted so frequently—or handling so much deferred interrupt work—that real work can’t run smoothly. Symptoms often look like application weirdness: timeouts, stalled IO, short “hiccups,” and jitter that doesn’t match average CPU or throughput.
On modern Linux, hard interrupts (top halves) are kept short. The heavy lifting is deferred into softirqs (bottom halves) and kthreads like ksoftirqd/N. Network receive processing (NET_RX) is a repeat offender: if packets arrive quickly enough, the system can spend huge CPU time just moving packets from NIC to socket buffers, leaving less time for the application to drain them. Storage can do it too: NVMe completions are fast and frequent, and with multiple queues you can generate a steady drumbeat of interrupts.
Interrupt imbalance is the quieter cousin of storms. The total interrupt rate might be fine, but if it’s concentrated on one CPU (often CPU0), that core becomes your de facto bottleneck. The scheduler may show plenty of idle time elsewhere, which leads to the classic bad diagnosis: “CPU is fine.” It’s not. One core is on fire while the others are at brunch.
Two common patterns:
- CPU0 overload: many drivers default to queue 0 or a single vector unless RSS/MSI-X and affinity are set up. Boot-time affinity defaults can also bias CPU0.
- NUMA mismatch: interrupts run on CPUs far from the PCIe device’s memory locality. That adds latency and burns interconnect bandwidth. It’s death by a thousand cache misses.
There’s also the “interrupt moderation” trap: if coalescing is too aggressive, you reduce interrupt rate but increase latency because the NIC waits longer before interrupting. Too little coalescing, and you can melt a core with interrupts. Tuning is a trade: you’re deciding how much jitter you can afford to avoid overload.
One quote that should live in your head during this kind of work: Hope is not a strategy.
— General Gordon R. Sullivan
Joke #1: Interrupt storms are like meetings—if you have too many, nothing else gets done.
Interesting facts and context (why this keeps happening)
- 1) “irqbalance” exists because SMP made naive interrupt routing painful. Early multiprocessor Linux systems often defaulted interrupts onto the boot CPU, producing CPU0 hotspots that looked like “Linux is slow.”
- 2) MSI-X changed the game. Message Signaled Interrupts (and MSI-X) let devices raise interrupts via in-memory messages and support many vectors—perfect for multi-queue NICs and NVMe.
- 3) NAPI was invented to stop packet receive livelock. Linux networking moved to an interrupt-mitigating polling model (NAPI) because purely interrupt-driven RX could collapse under high packet rates.
- 4) Softirqs are per-CPU by design. That’s good for cache locality, but it also means “one CPU drowning in NET_RX” can starve work on that CPU even when others are idle.
- 5) “IRQ storm” used to mean broken hardware more often. Today it’s frequently mis-tuned queueing/coalescing or workload changes (like a new service sending 64-byte packets at a million per second).
- 6) NVMe’s performance comes with completion interrupt volume. Many small IOs at high IOPS can generate a high rate of completion events; MSI-X vectors and queue mapping matter.
- 7) CPU isolation features made interrupt placement more important.
isolcpusandnohz_fullcan improve latency, but only if you keep interrupts and kernel housekeeping off isolated cores. - 8) Modern NIC offloads aren’t always your friend. GRO/LRO/TSO can reduce CPU but increase latency or jitter, and some workloads (small RPCs) hate their buffering behavior.
- 9) “irqbalance” has matured into policy, not magic. It makes decisions based on load heuristics. Those heuristics can be wrong for specialized systems.
Tools and principles: how Debian 13 handles interrupts
Debian 13 is just Linux with opinions and packaging. The kernel uses /proc/interrupts as the raw truth. Tools like irqbalance apply policy. Drivers expose knobs through /sys and ethtool. Your job is to decide what “good” looks like for your workload, then enforce it.
Hard IRQs vs softirqs: what matters for latency
Hard IRQ context is extremely constrained. Most work gets punted to softirqs. When softirq load is high, the kernel can run softirq processing in the context of the interrupted task (fast, good for throughput) or in ksoftirqd (preemptable, but can lag). If you see ksoftirqd dominating a CPU, you’re behind.
Affinity: “which CPU handles this interrupt?”
Every IRQ has an affinity mask. For MSI-X, each vector is effectively its own IRQ and can be balanced across CPUs. For legacy line-based interrupts, your options are limited and sharing can get ugly.
Queueing: multi-queue devices and why one queue is a tragedy
For NICs, multi-queue + RSS lets inbound flows hash across RX queues, each with its own interrupt vector. For block devices, blk-mq maps IO queues to CPUs. This isn’t just throughput. It’s also a latency story: reduce lock contention and keep the hot path local.
NUMA: don’t pay for distance if you don’t have to
On multi-socket systems, putting NIC interrupts on CPUs local to the NIC’s PCIe root complex reduces cross-node memory traffic. Debian won’t guess your topology correctly every time. You need to check.
Practical tasks: commands, outputs, and decisions (12+)
These are the moves that fix real incidents. Each task includes: a command, what output means, and the decision you make from it. Run them as root when needed.
Task 1: Confirm the symptom is interrupt/softirq pressure
cr0x@server:~$ top -H -b -n 1 | head -n 25
top - 10:11:12 up 12 days, 3:44, 1 user, load average: 6.14, 5.98, 5.22
Threads: 421 total, 3 running, 418 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.0 us, 2.1 sy, 0.0 ni, 78.4 id, 0.0 wa, 0.0 hi, 11.5 si, 0.0 st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32 root 20 0 0 0 0 R 96.7 0.0 78:22.41 ksoftirqd/3
What it means: High si (softirq) plus a hot ksoftirqd/N thread is a smoking gun. Your CPU time is being spent on deferred interrupt work, commonly networking.
Decision: Move immediately to measuring softirqs and mapping interrupts to devices. Don’t tune the application yet.
Task 2: See which softirq classes are hot
cr0x@server:~$ cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3
HI: 0 0 0 0
TIMER: 10234567 9123456 9234567 9012345
NET_TX: 884 12933 11822 11540
NET_RX: 90345678 1245678 1300045 1219990
BLOCK: 112233 110998 111102 109876
IRQ_POLL: 0 0 0 0
TASKLET: 3333 2888 3011 2999
SCHED: 400000 398000 402000 399000
HRTIMER: 222 211 219 210
RCU: 600000 590000 610000 605000
What it means: NET_RX massively skewed to CPU0 screams “RX processing is concentrated.” This often correlates with one RX queue interrupt landing on CPU0.
Decision: Inspect /proc/interrupts and NIC queue configuration. Your goal is to distribute RX queue interrupts or enable/verify RSS.
Task 3: Identify hot IRQ lines and whether they’re imbalanced
cr0x@server:~$ awk 'NR==1 || /eth0|nvme|i915|virtio|mlx|ixgbe|enp/ {print}' /proc/interrupts
CPU0 CPU1 CPU2 CPU3
35: 81234567 120332 110221 118877 PCI-MSI 524288-edge eth0-TxRx-0
36: 1200 22100333 118900 119010 PCI-MSI 524289-edge eth0-TxRx-1
37: 1100 111200 20300111 119100 PCI-MSI 524290-edge eth0-TxRx-2
38: 900 112300 120010 19899110 PCI-MSI 524291-edge eth0-TxRx-3
92: 900000 910000 905000 899000 PCI-MSI 1048576-edge nvme0q0
93: 120000 118000 119000 121000 PCI-MSI 1048577-edge nvme0q1
What it means: Here the NIC queue 0 IRQ is absurdly hot on CPU0. Other queues look healthier. This is a classic imbalance.
Decision: Fix affinity for the hot vector and confirm RSS spreads flows. If queue 0 is legitimately busiest due to hashing, increase queues or adjust RSS indirection.
Task 4: Confirm whether irqbalance is running and what it thinks it should do
cr0x@server:~$ systemctl status irqbalance --no-pager
● irqbalance.service - irqbalance daemon
Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-12-29 08:12:01 UTC; 1 day 02:00 ago
Main PID: 812 (irqbalance)
Tasks: 1 (limit: 38121)
Memory: 3.8M
CPU: 21min 33.120s
What it means: irqbalance is running, so either (a) it can’t move that IRQ (some are pinned/unmovable), (b) it is configured to avoid certain CPUs, or (c) it’s making a poor choice for your workload.
Decision: Check irqbalance configuration and IRQ affinity masks. Decide whether to tune irqbalance or override specific IRQs manually.
Task 5: Check current affinity mask for a specific IRQ
cr0x@server:~$ cat /proc/irq/35/smp_affinity_list
0
What it means: IRQ 35 is pinned to CPU0 only. irqbalance can’t help if something (or someone) hard-pinned it.
Decision: If this is not intentional, change it. If CPU0 is reserved for housekeeping and you want that, pin it elsewhere.
Task 6: Move a hot IRQ to another CPU (quick test)
cr0x@server:~$ echo 2 > /proc/irq/35/smp_affinity
cr0x@server:~$ cat /proc/irq/35/smp_affinity_list
1
What it means: The IRQ is now routed to CPU1 (mask bit 1). This is a blunt instrument but excellent for proving causality.
Decision: If latency improves immediately and ksoftirqd calms down, you’ve confirmed interrupt placement is the problem. Then do the durable fix (queue-aware mapping, irqbalance policy, NUMA-aware placement).
Task 7: Identify the NIC, driver, and bus location (NUMA hints)
cr0x@server:~$ ethtool -i eth0
driver: ixgbe
version: 6.1.0
firmware-version: 0x800003e2
bus-info: 0000:3b:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
What it means: bus-info is the PCI address. You can map that to a NUMA node and CPU locality.
Decision: Keep interrupts on CPUs local to the NIC’s NUMA node whenever possible.
Task 8: Find the device’s NUMA node and local CPU list
cr0x@server:~$ cat /sys/bus/pci/devices/0000:3b:00.0/numa_node
1
cr0x@server:~$ cat /sys/devices/system/node/node1/cpulist
16-31
What it means: The NIC is attached to NUMA node 1, local CPUs 16–31. If your IRQs are on CPU0–3, you’re paying cross-node costs.
Decision: Place NIC IRQs on CPUs 16–31, and consider placing the NIC-heavy workloads there too.
Task 9: Check NIC queue/channel counts (do you even have enough queues?)
cr0x@server:~$ ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX: 16
TX: 16
Other: 0
Combined: 16
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 4
What it means: The NIC supports up to 16 combined channels but is currently using 4. If you have many cores and high packet rates, 4 queues may be a bottleneck.
Decision: Increase combined channels if the workload benefits and the system has CPU to handle it. But don’t go wild; more queues can mean more overhead and worse cache locality.
Task 10: Increase NIC combined queues (carefully) and re-check interrupts
cr0x@server:~$ sudo ethtool -L eth0 combined 8
cr0x@server:~$ ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX: 16
TX: 16
Other: 0
Combined: 16
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 8
What it means: You now have 8 queue pairs. That should create more MSI-X vectors and distribute work better—if RSS is configured and your flows hash well.
Decision: Re-check /proc/interrupts after 30–60 seconds under load. If one queue still dominates, your hashing/indirection or traffic pattern may be the constraint.
Task 11: Inspect RSS indirection and hash key (is traffic being spread?)
cr0x@server:~$ ethtool -x eth0 | head -n 25
RX flow hash indirection table for eth0 with 8 RX ring(s):
0: 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
16: 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
32: 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
48: 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
RSS hash key:
6d:5a:2c:...:91
What it means: The table appears evenly distributed. That’s good. If it’s not, or it’s all zeros, you’re effectively single-queue.
Decision: If distribution is poor, adjust the indirection table (advanced) or fix the driver/firmware settings. If your workload is a single flow (one giant TCP stream), RSS won’t help much; you need application sharding or different transport behavior.
Task 12: Check and tune interrupt coalescing (latency vs CPU trade)
cr0x@server:~$ ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: on TX: on
rx-usecs: 50
rx-frames: 0
tx-usecs: 50
tx-frames: 0
What it means: Adaptive coalescing is on; the NIC/driver will change coalescing based on traffic. That’s usually okay for general use, but it can introduce latency variance.
Decision: For strict tail-latency workloads, consider disabling adaptive coalescing and setting conservative fixed values. Validate with real latency measurements, not vibes.
Task 13: Identify dropped packets or backlog overruns (symptom of NET_RX overload)
cr0x@server:~$ nstat | egrep 'TcpExtListenOverflows|IpInDiscards|UdpInErrors|TcpExtTCPBacklogDrop'
TcpExtTCPBacklogDrop 123
IpInDiscards 456
What it means: You’re dropping packets in the stack. This can happen when softirq processing can’t keep up, or when application accept/read can’t drain fast enough.
Decision: Fix interrupt distribution first. Then consider socket backlog tuning and application capacity. Don’t paper over an IRQ hotspot with bigger buffers unless you enjoy delayed failure.
Task 14: Look for IRQ “nobody cared” messages and kernel warnings
cr0x@server:~$ journalctl -k -b | egrep -i 'irq|nobody cared|soft lockup|hard lockup' | tail -n 20
Dec 30 09:55:18 server kernel: irq 35: nobody cared (try booting with the "irqpoll" option)
Dec 30 09:55:18 server kernel: Disabling IRQ #35
What it means: This is serious. An IRQ got so noisy or misbehaved that the kernel decided it was broken and disabled it. That can take your NIC or storage offline in slow motion.
Decision: Treat it as a driver/firmware/hardware issue first. Check BIOS settings, update firmware, verify MSI/MSI-X stability, and examine whether interrupt moderation settings are pathological.
Task 15: Verify NVMe queue/IRQ mapping is sane
cr0x@server:~$ ls -1 /proc/interrupts | head -n 0
cr0x@server:~$ grep -E 'nvme[0-9]q' /proc/interrupts | head -n 10
92: 900000 910000 905000 899000 PCI-MSI 1048576-edge nvme0q0
93: 120000 118000 119000 121000 PCI-MSI 1048577-edge nvme0q1
94: 119000 121000 118000 120000 PCI-MSI 1048578-edge nvme0q2
What it means: Multiple NVMe queues exist and interrupts are relatively evenly distributed. If you see only nvme0q0 hot and others idle, the device might be limited to one queue or the workload is effectively single-threaded.
Decision: If NVMe interrupts are concentrated, check nvme_core.default_ps_max_latency_us power settings, driver parameters, and whether the block layer is mapping queues to CPUs as expected.
Task 16: Observe per-IRQ rate over time (not just totals)
cr0x@server:~$ for i in 1 2 3; do date; grep -E 'eth0-TxRx-0|eth0-TxRx-1|eth0-TxRx-2|eth0-TxRx-3' /proc/interrupts; sleep 1; done
Tue Dec 30 10:10:01 UTC 2025
35: 81234567 120332 110221 118877 PCI-MSI 524288-edge eth0-TxRx-0
36: 1200 22100333 118900 119010 PCI-MSI 524289-edge eth0-TxRx-1
Tue Dec 30 10:10:02 UTC 2025
35: 81310222 120350 110240 118899 PCI-MSI 524288-edge eth0-TxRx-0
36: 1210 22155880 118920 119030 PCI-MSI 524289-edge eth0-TxRx-1
Tue Dec 30 10:10:03 UTC 2025
35: 81389001 120360 110255 118920 PCI-MSI 524288-edge eth0-TxRx-0
36: 1220 22210210 118940 119050 PCI-MSI 524289-edge eth0-TxRx-1
What it means: You can eyeball deltas per second. If one IRQ increments much faster than others, that’s your hotspot. Totals hide rate changes.
Decision: Target the high-rate IRQ for balancing/pinning and confirm the rate distribution improves under representative load.
irqbalance on Debian 13: verify, tune, and know when to disable
irqbalance is often either blindly trusted or blindly blamed. Neither is adult behavior. Treat it like any other automation: it’s a policy engine with defaults tuned for “typical servers.” If your server is not typical, it will still be treated as typical unless you intervene.
What irqbalance does well
- Spreads IRQs across CPUs so CPU0 isn’t the designated sufferer.
- Responds to changing load without you hand-editing affinity masks at 3 a.m.
- Plays reasonably with MSI-X multi-queue devices.
What irqbalance does poorly (or can’t do at all)
- Latency-critical CPU isolation setups: It may move interrupts onto CPUs you intended to keep “clean” unless configured carefully.
- NUMA-sensitive placement: It may not always keep device interrupts local to the PCIe NUMA node in the way you want.
- Unmovable interrupts: Some IRQs are effectively pinned or handled in ways irqbalance can’t change.
Verify irqbalance configuration
cr0x@server:~$ grep -v '^\s*#' /etc/default/irqbalance | sed '/^\s*$/d'
ENABLED="1"
OPTIONS=""
What it means: Default config. No CPU ban masks, no special behavior.
Decision: If you’re seeing IRQ pinning anyway, something else is setting affinity (driver scripts, custom tuning, container runtime hooks, or old “optimization” leftovers).
See what CPUs are online and whether you’ve isolated some
cr0x@server:~$ lscpu | egrep 'CPU\(s\)|On-line CPU|NUMA node'
CPU(s): 32
On-line CPU(s) list: 0-31
NUMA node(s): 2
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
What it means: Clean topology. If you also use isolation kernel parameters, check the kernel cmdline.
Decision: If you isolate CPUs (say 4–31) for workloads, you must ban those CPUs from interrupts using irqbalance options or manual affinity.
Inspect kernel cmdline for isolation-related parameters
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.12.0 root=/dev/mapper/vg0-root ro quiet isolcpus=4-31 nohz_full=4-31 rcu_nocbs=4-31
What it means: You’ve declared CPUs 4–31 special. Great. Now keep interrupts off them, or you’ve made a promise the kernel will break on your behalf.
Decision: Configure irqbalance to avoid 4–31, or disable irqbalance and set explicit masks for all relevant IRQs.
Ban CPUs from irqbalance (typical isolation pattern)
cr0x@server:~$ sudo sed -i 's/^OPTIONS=.*/OPTIONS="--banirq=0"/' /etc/default/irqbalance
cr0x@server:~$ sudo systemctl restart irqbalance
cr0x@server:~$ systemctl status irqbalance --no-pager | head -n 12
● irqbalance.service - irqbalance daemon
Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled; preset: enabled)
Active: active (running) since Tue 2025-12-30 10:18:01 UTC; 2s ago
What it means: Example only: banning IRQ 0 is not the same as banning CPUs. cpu banning is typically done via irqbalance’s CPU mask options (varies by version/build). The point: do not randomly “tune” options you don’t understand.
Decision: For CPU isolation, prefer explicit per-IRQ affinity or irqbalance CPU mask configuration appropriate to your installed irqbalance. Verify by reading /proc/irq/*/smp_affinity_list after restart.
Here’s the opinionated guidance: if you’re running a normal fleet of Debian servers, leave irqbalance enabled. If you’re running a specialized latency box with isolated CPUs, disable it and manage affinity with configuration management so it’s repeatable.
NIC interrupts: RSS, RPS/XPS, coalescing, and multi-queue sanity
Most “weird latency” cases that smell like interrupts are network-driven. Not because storage is innocent, but because packets can arrive at line rate and demand immediate CPU attention. A NIC can happily deliver more work per second than your kernel can digest if you configure it like it’s 2009.
Start with RSS (hardware receive scaling)
RSS spreads flows across RX queues in hardware. Each RX queue has its own interrupt vector. If RSS is off, or only one queue exists, one CPU ends up doing most of the receive work. Your throughput might still look okay, while tail latency gets wrecked by queueing and jitter.
RPS/XPS: software steering when RSS isn’t enough
RPS (Receive Packet Steering) can distribute packet processing across CPUs even if hardware RSS is limited. XPS can help distribute transmit processing. These are CPU features; they can improve balance but also add overhead. Use them when you have a reason, not because you saw a blog post in 2016.
cr0x@server:~$ ls -1 /sys/class/net/eth0/queues/
rx-0
rx-1
rx-2
rx-3
tx-0
tx-1
tx-2
tx-3
What it means: Multi-queue exists. Good. Now confirm the IRQ vectors align with those queues and aren’t all routed to the same CPU set.
Decision: Align each queue IRQ to CPUs local to the NIC, and avoid isolated CPUs if applicable.
Interrupt coalescing: the latency tax you might be paying
If your service is latency-sensitive (RPC, databases, interactive APIs), coalescing settings can swing tail latency. Adaptive coalescing is designed to be generally efficient, not deterministic. Sometimes “generally efficient” is the enemy.
cr0x@server:~$ sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 25 tx-usecs 25
cr0x@server:~$ ethtool -c eth0 | egrep 'Adaptive|rx-usecs|tx-usecs'
Adaptive RX: off TX: off
rx-usecs: 25
tx-usecs: 25
What it means: You’ve reduced wait time before interrupts fire. That can reduce latency but increase CPU usage and interrupt rate.
Decision: If CPU headroom exists and latency improves, keep it. If CPU melts or drops increase, back off. The correct setting is the one that meets your SLO without wasting a core.
Joke #2: Tuning NIC coalescing is like seasoning soup—too little and it’s bland, too much and you’re suddenly drinking seawater.
Storage interrupts: NVMe, blk-mq, and the “disk is fine” trap
Storage latency issues often get blamed on the drive. Sometimes that’s correct. Often it’s not. NVMe devices are extremely fast, which means they can complete IO so quickly that the completion path becomes the bottleneck: interrupts, queue mapping, and CPU locality.
Check whether you’re CPU-bound in the block layer
cr0x@server:~$ iostat -x 1 3
Linux 6.12.0 (server) 12/30/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
7.12 0.00 3.01 0.22 0.00 89.65
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
nvme0n1 1200.0 96000.0 0.0 0.00 0.55 80.00 800.0 64000.0 0.0 0.00 0.60 80.00 0.12 45.00
What it means: Device await is low, util is moderate. If your app latency is awful anyway, the device isn’t obviously the limiter; your completion processing, contention, or network path might be.
Decision: Correlate with IRQ/softirq metrics. If NVMe IRQs are high and concentrated, fix that first.
Verify NVMe power state/latency settings aren’t sabotaging you
cr0x@server:~$ cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
0
What it means: 0 commonly means “no limit” (allow low-power states). On servers, aggressive power saving can add wake latency.
Decision: For latency-sensitive systems, consider setting a tighter max latency via kernel parameter. Validate carefully; power policies are workload and platform-specific.
Virtualization quirks: virtio, vhost, and noisy neighbors
In VMs, you’re not just tuning Linux; you’re negotiating with the hypervisor. virtio-net and vhost can be excellent, but interrupt behavior changes. You can see “interrupt storms” that are really “exit storms” or host-side queue contention.
Inside the guest: check virtio IRQ distribution
cr0x@server:~$ grep -E 'virtio|vhost' /proc/interrupts | head -n 8
40: 22112233 1100223 1099888 1101001 PCI-MSI 327680-edge virtio0-input.0
41: 110022 19888777 990001 980002 PCI-MSI 327681-edge virtio0-output.0
What it means: If one virtio vector dominates and it’s pinned, you can still suffer imbalance. But in VMs, the host’s CPU pinning and vCPU topology matter just as much.
Decision: Fix guest affinity only after verifying host pinning/NUMA placement. Otherwise you’re rearranging furniture in a moving truck.
Check steal time and scheduling pressure
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 812345 12345 987654 0 0 1 3 1200 2400 6 2 90 0 2
What it means: Non-trivial st (steal) suggests host contention. That can mimic interrupt issues because your guest can’t run when it needs to service queues.
Decision: Escalate to the virtualization layer: CPU pinning, host IRQ routing, noisy neighbor mitigation, and ensuring virtio multi-queue is enabled end-to-end.
Three corporate mini-stories from the trenches
Incident: the wrong assumption (“CPU is only 30%, so it can’t be CPU”)
A mid-sized company ran a Debian-based API tier behind a load balancer. After a kernel upgrade, p99 latency doubled during busy hours. Dashboards showed average CPU at ~30–40% with lots of idle. The initial response was predictable: blame the new app release, blame the database, then blame the network team for “packet loss.”
The on-call SRE finally looked at per-core utilization and saw CPU0 pinned. The process list didn’t show a hot userland thread. It was ksoftirqd/0. Meanwhile CPU1–CPU31 were sipping tea. The team had assumed “CPU%” was a scalar. It’s not. It’s a distribution.
/proc/interrupts made it painfully obvious: the primary NIC RX/TX queue interrupt landed almost exclusively on CPU0. RSS was configured for multiple queues, but the IRQ affinity mask had been pinned during an earlier “performance tuning” attempt that never got reverted. irqbalance was running, but it can’t override a pin you’ve welded in place.
The fix was boring: remove the manual pinning, restart irqbalance, and then explicitly pin NIC queue interrupts to CPUs local to the NIC’s NUMA node. Latency returned to baseline without touching the application. The postmortem conclusion was harsher than the fix: the team’s assumption about CPU metrics was wrong, and it wasted hours of cross-team churn.
Optimization that backfired: “Turn off coalescing to reduce latency”
A different organization had a latency-sensitive service and a new engineer with good intentions. They read that interrupt coalescing “adds latency,” so they disabled adaptive coalescing and set rx-usecs to 0 everywhere. The graphs looked amazing in a synthetic test: lower median latency, snappy response, everyone happy.
Two weeks later, a traffic pattern changed. More small packets. More concurrent connections. Suddenly a subset of servers began dropping packets and showing periodic latency spikes that looked like garbage collection pauses. Again, CPU “average” looked fine. The real story was that interrupt rate went through the roof; one CPU per host became an interrupt concierge, and the rest of the machine was underutilized because the hot CPU couldn’t keep up with the pace of interrupts.
The team had optimized for the median and paid with the tail. Disabling coalescing didn’t just reduce latency; it removed a safety valve. Under heavy packet rates, the system spent too much time in interrupt and softirq paths. The fix was to re-enable adaptive coalescing, then set a moderate fixed baseline for RX/TX usecs aligned with their SLO. They also increased NIC queues and ensured interrupts were mapped to the right NUMA node.
The lesson was not “never change coalescing.” It was: coalescing is a control knob, not a religion. If you set it to “always lowest possible,” you’re choosing fragility under bursty workloads.
Boring but correct practice: keep an IRQ/affinity baseline and enforce it
One company ran Debian 13 on storage-heavy nodes: NVMe + 100GbE. They had a simple rule: every hardware class had a documented, versioned “interrupt and queue baseline.” When a new server got provisioned, configuration management applied it and a validation script checked it.
The script wasn’t fancy. It captured a snapshot of /proc/interrupts, queue counts from ethtool -l, NUMA node mapping from sysfs, and current affinity lists for hot IRQs. It also flagged CPU0 hotspots and any IRQ pinned to isolated CPUs. If something drifted, the pipeline failed and the server didn’t join the pool.
During an otherwise nasty incident—high latency on a subset of nodes after a vendor firmware update—this baseline saved days. They could immediately see which nodes deviated: one batch had NIC queues reduced by the firmware reset, and interrupts collapsed onto two vectors. Rolling back or reapplying settings fixed the issue quickly.
The practice was unglamorous. No one got a conference talk out of it. It worked anyway. In production, “boring and correct” beats “clever and fragile” almost every time.
Common mistakes: symptom → root cause → fix
1) p99 latency spikes, average CPU low
Symptom: Tail latency is bad; CPU dashboards show plenty of idle.
Root cause: One CPU is overloaded by interrupts/softirqs (often CPU0). Average hides skew.
Fix: Check top -H, /proc/softirqs, and /proc/interrupts. Fix IRQ affinity and ensure RSS/multi-queue is active.
2) ksoftirqd pegged, NET_RX huge on one CPU
Symptom: ksoftirqd/N is hot; NET_RX skewed.
Root cause: RX queue interrupt pinned to one CPU; RSS not distributing; single flow dominating; or RPS misconfigured.
Fix: Increase queues, verify RSS indirection, distribute IRQ vectors across CPUs local to NIC, consider RPS for stubborn cases.
3) “irqbalance is running but nothing changes”
Symptom: irqbalance active; IRQ still stuck on one CPU.
Root cause: IRQ is manually pinned; driver enforces affinity; IRQ is unmovable; or irqbalance is constrained by banned CPU masks.
Fix: Inspect /proc/irq/*/smp_affinity_list. Remove manual pins or adjust policy. Validate after restarting irqbalance.
4) Latency improved briefly after pinning, then got worse
Symptom: Quick win followed by regression under real traffic.
Root cause: Pinning improved locality but overloaded a subset of CPUs; queue distribution/hashing changed; or you created contention with application threads on the same CPUs.
Fix: Make IRQ placement match CPU scheduling strategy: reserve CPUs for interrupts, or align application threads to the same NUMA node and avoid fighting over the same cores.
5) Packet drops increase after “latency tuning”
Symptom: Retransmits/backlog drops after disabling coalescing/offloads.
Root cause: Interrupt rate too high; CPU can’t keep up; buffers overflow.
Fix: Re-enable adaptive coalescing or set moderate coalescing values; keep RSS and multi-queue; verify no single IRQ dominates.
6) Storage latency blamed on NVMe, but iostat looks clean
Symptom: App sees stalls; NVMe device metrics look fine.
Root cause: CPU-side completion processing or cross-NUMA interrupt handling; sometimes IRQ saturation on a CPU shared with networking.
Fix: Check NVMe IRQ distribution and NUMA locality; avoid mixing hot NIC and NVMe interrupts on the same CPU set if it creates contention.
Checklists / step-by-step plan
Checklist A: Stop the bleeding in 15 minutes
- Capture evidence:
top -H,/proc/softirqs,/proc/interrupts, and latency graphs at the same timestamp. - Identify the single hottest IRQ line/vector and map it to a device/queue name.
- Check its affinity:
cat /proc/irq/<IRQ>/smp_affinity_list. - If it’s pinned to one CPU and that CPU is hot, move it as a test (one step) and re-measure p99.
- If moving helps, implement a durable policy: either tune irqbalance or set persistent affinity rules.
- Verify no packet drops/regressions:
nstat, driver stats, and application error rates.
Checklist B: Make the fix durable (don’t rely on heroics)
- Document intended CPU topology: which CPUs are for housekeeping, which are isolated, which are workload.
- Record NIC and NVMe NUMA locality and ensure IRQs are placed accordingly.
- Set NIC queue counts explicitly (and verify after reboot/firmware update).
- Pick a coalescing policy: adaptive for general servers; fixed for strict latency with headroom.
- Decide on irqbalance: enabled for general purpose; disabled + explicit affinity for specialized nodes.
- Automate validation: fail provisioning if IRQs collapse onto CPU0 or isolated cores.
Checklist C: Confirm improvements with real measurements
- Measure end-to-end p95/p99 and error rates for at least one business cycle (not 60 seconds of hope).
- Confirm interrupt distribution over time using deltas (not just totals).
- Watch CPU softirq time and
ksoftirqdresidency. - Validate no new drops/retransmits appeared.
- Keep a before/after snapshot of
/proc/interruptsfor the postmortem.
FAQ
1) What exactly is an IRQ storm on Linux?
It’s a condition where interrupt activity (hard IRQs or the softirq work they trigger) overwhelms CPUs, causing latency, drops, and jitter. It’s often “too many interrupts” or “interrupts in the wrong place.”
2) Why does CPU0 always seem to be the victim?
Boot-time defaults, legacy interrupt routing, and accidental pinning frequently land work on CPU0. Also, many housekeeping tasks naturally run there. If you let devices dogpile onto CPU0, it becomes your bottleneck core.
3) Should I just disable irqbalance?
On general-purpose servers: no, keep it. On specialized latency systems with CPU isolation or strict affinity requirements: yes, disable it and manage affinity explicitly. The worst option is “disable it and do nothing else.”
4) How do I know whether it’s network or storage?
Look at /proc/softirqs and /proc/interrupts. High NET_RX and hot NIC queue IRQs point to network. High block-related IRQs and NVMe queue vectors point to storage. Then correlate with workload timing.
5) If I increase NIC queues, will latency always improve?
No. More queues can reduce contention and spread load, but can also increase overhead and worsen cache locality. It’s a tool, not a guarantee. Measure after each change.
6) Can a single TCP flow defeat RSS?
Yes. RSS hashes flows; one flow maps to one RX queue. If your workload is dominated by a few heavy flows, you might still hotspot a queue. Fix by sharding traffic or changing how the workload fans out.
7) What’s the difference between RSS and RPS?
RSS is hardware-based distribution into multiple RX queues. RPS is software-based steering of packet processing across CPUs. RSS is usually preferable when available; RPS is a fallback or augmentation.
8) Why did my tuning disappear after reboot or firmware update?
Many settings (queue counts, coalescing, affinity) are not persistent by default. Firmware updates can reset NIC state. Make configuration persistent via systemd units, udev rules, or configuration management—and verify on boot.
9) How does NUMA affect interrupts and latency?
If a device is attached to NUMA node 1 but its interrupts run on CPUs in node 0, you increase cross-node traffic and cache misses. That adds jitter and reduces headroom. Place IRQs on local CPUs when possible.
10) I moved an IRQ and the problem improved, but not completely. Now what?
Good: you proved causality. Next step is systematic distribution: ensure all queues are used, map vectors across a CPU set, check coalescing, and avoid placing hot interrupts on CPUs running the busiest application threads unless you intend to.
Conclusion: next steps that actually reduce p99
IRQ storms and interrupt imbalance aren’t mysterious. They’re measurable. On Debian 13, you already have what you need: /proc/interrupts, /proc/softirqs, ethtool, and a clear head.
Do this next, in order:
- Prove whether you’re interrupt/softirq bound (
top -H,/proc/softirqs). - Map hot IRQs to devices and queues (
/proc/interruptsplusethtool -i/-l). - Fix distribution and locality: queues, RSS, affinity masks, NUMA placement.
- Only then: tune coalescing and offloads, with real latency measurements and rollback plans.
- Make it persistent and validated, because the only thing worse than a storm is a storm that returns after a reboot.