WireGuard is Slow: MTU, Routing, CPU — Speed It Up Without Guesswork

November 16, 2025 • February 3, 2026 • Read: 28 min • Views: 12

Was this helpful?

WireGuard “works,” which is the most dangerous state a VPN can be in. You can ping, SSH, maybe even open a web page—yet file transfers crawl,
backups miss their window, and your “fast” fiber looks like a motel Wi‑Fi connection with commitment issues.

This is the field guide for turning “WireGuard is slow” into a measured bottleneck, a specific fix, and a repeatable method. No folklore. No
random MTU numbers from a forum post written in 2019. We’ll test, read the counters, and make changes you can defend in a postmortem.

The mental model: where WireGuard can be slow

WireGuard is “just” a network interface and a UDP transport. That simplicity is why it’s beloved—and why performance problems often come from
everything around it: MTU, routing, NAT, NIC driver behavior, kernel queues, and CPU scheduling.

There are four common failure modes that present as “slow”:

Path MTU / fragmentation blackholes: small packets work, big transfers stall or oscillate.
Wrong route / wrong policy: traffic hairpins, traverses NAT twice, or exits the wrong interface.
CPU bottleneck: one core pins at 100% during iperf; throughput tops out suspiciously round.
Loss/buffering on UDP: TCP over UDP reacts badly when the underlay drops or reorders packets.

You want to avoid “tuning” until you know which one you have. Blind tuning is how you end up with a configuration that only works on Tuesdays.

Here’s the operational lens: start with a single flow you can reproduce (iperf3 is fine). Confirm if the bottleneck is local host,
remote host, or the path. Then apply the smallest change that moves the needle, and measure again.

One quote worth keeping on a sticky note: “Hope is not a strategy.” — Gene Kranz.

Joke #1: If you’re changing MTU values at random, you’re not tuning a VPN—you’re doing numerology with extra steps.

Fast diagnosis playbook (first/second/third)

First: prove whether it’s MTU/fragmentation

Run a PMTU-style ping test (don’t guess). If large DF pings fail, you have a blackhole or mismatch.
Check counters for fragmentation needed / ICMP blocked.
If PMTU is broken, stop and fix MTU or MSS clamping before touching anything else.

Second: prove whether it’s CPU

Run iperf3 over the tunnel while watching per-core usage and softirq load.
If one core pins (or ksoftirqd goes wild), you’re CPU/interrupt limited.
Fix by enabling a faster cipher path (usually already good), improving NIC/IRQ distribution, or scaling flows/peers/hosts.

Third: prove routing and path correctness

Verify the route for the destination (and the source address selection) from the sending host.
Check for asymmetric routing: one direction uses the tunnel, the other uses the WAN.
Confirm NAT and firewall rules aren’t rewriting or rate-limiting UDP.

If all three look sane, treat it as UDP loss/queuing

Measure loss/retransmits via TCP stats and interface counters.
Check qdisc, shaping, bufferbloat, and the underlay link health.
Only then consider advanced tuning (socket buffers, fq, pacing).

Interesting facts and context (why these problems exist)

WireGuard entered the Linux kernel in 2020, which made performance and deployment dramatically more predictable than out-of-tree modules.
It rides over UDP by design, partly to avoid TCP-over-TCP meltdown and partly for simpler NAT traversal—yet it inherits UDP’s “best effort” reality.
Its cryptography uses ChaCha20-Poly1305, chosen for strong performance on systems without AES acceleration; on many CPUs it’s blisteringly fast.
Path MTU Discovery has been fragile for decades because it depends on ICMP “fragmentation needed” messages that firewalls love to drop.
Ethernet’s classic MTU of 1500 is a historical artifact, not a law of physics; tunnels add headers and make 1500 a trap.
Linux offloads (GSO/GRO/TSO) can make packet captures look “wrong” and can also hide performance problems until a driver update changes behavior.
WireGuard peers are identified by public keys, not IPs; routing mistakes often manifest as “it connects but it’s slow” when traffic matches the wrong AllowedIPs.
Cloud networks frequently encapsulate your packets already (VXLAN/Geneve), so your tunnel is a tunnel inside a tunnel—MTU death by a thousand headers.
TCP throughput collapse can come from tiny loss rates on long fat networks; VPN overhead is rarely the main enemy compared to loss and RTT.

Practical tasks: commands, outputs, decisions

Below are practical tasks you can run on Linux hosts (or inside Linux VMs) to find the bottleneck. Each includes: the command, what typical
output means, and the decision you should make next. Do them in order if you want speed without superstition.

Task 1: Confirm you’re actually testing over WireGuard

cr0x@server:~$ ip route get 10.60.0.10
10.60.0.10 dev wg0 src 10.60.0.1 uid 1000
    cache

Meaning: Traffic to 10.60.0.10 goes out wg0 with source 10.60.0.1.
Decision: If you don’t see dev wg0, stop. Fix routing/AllowedIPs/policy routing first or your tests are garbage.

Task 2: Inspect WireGuard peer health and whether you’re roaming endpoints

cr0x@server:~$ sudo wg show wg0
interface: wg0
  public key: 2r4...redacted...Kk=
  listening port: 51820

peer: q0D...redacted...xw=
  endpoint: 203.0.113.44:51820
  allowed ips: 10.60.0.10/32, 10.20.0.0/16
  latest handshake: 28 seconds ago
  transfer: 18.42 GiB received, 25.11 GiB sent
  persistent keepalive: every 25 seconds

Meaning: Handshake is recent; endpoint is stable; traffic counters move.
Decision: If latest handshake is old or endpoint keeps changing unexpectedly, suspect NAT timeouts, roaming, or firewall state issues—expect loss and jitter.

Task 3: Baseline the underlay (non-VPN) throughput and latency

cr0x@server:~$ ping -c 5 203.0.113.44
PING 203.0.113.44 (203.0.113.44) 56(84) bytes of data.
64 bytes from 203.0.113.44: icmp_seq=1 ttl=53 time=19.8 ms
64 bytes from 203.0.113.44: icmp_seq=2 ttl=53 time=20.4 ms
64 bytes from 203.0.113.44: icmp_seq=3 ttl=53 time=19.9 ms
64 bytes from 203.0.113.44: icmp_seq=4 ttl=53 time=62.1 ms
64 bytes from 203.0.113.44: icmp_seq=5 ttl=53 time=20.2 ms

--- 203.0.113.44 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 19.8/28.5/62.1/16.8 ms

Meaning: Underlay RTT has spikes (62 ms). VPN will amplify that into TCP throughput pain.
Decision: If underlay jitter/loss is present, don’t expect miracles from MTU tweaks; you may need queue management or a better path/provider.

Task 4: Measure tunnel throughput with iperf3 (single flow)

cr0x@server:~$ iperf3 -c 10.60.0.10 -t 15
Connecting to host 10.60.0.10, port 5201
[  5] local 10.60.0.1 port 43144 connected to 10.60.0.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-1.00   sec  62.2 MBytes   522 Mbits/sec    0
[  5]   1.00-2.00   sec  61.8 MBytes   518 Mbits/sec    1
[  5]   2.00-3.00   sec  44.9 MBytes   377 Mbits/sec   12
[  5]   3.00-4.00   sec  58.2 MBytes   488 Mbits/sec    3
[  5]  14.00-15.00 sec  60.1 MBytes   504 Mbits/sec    2
- - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-15.00  sec   848 MBytes   474 Mbits/sec   44  sender
[  5]   0.00-15.00  sec   846 MBytes   473 Mbits/sec        receiver

Meaning: Throughput is unstable and there are retransmits.
Decision: Retransmits over a VPN usually mean MTU blackholing, underlay loss/jitter, or buffering/queue issues. Next: MTU tests and loss counters.

Task 5: Measure with multiple parallel flows (to detect single-core limits)

cr0x@server:~$ iperf3 -c 10.60.0.10 -P 8 -t 15
[SUM]   0.00-15.00  sec  3.62 GBytes  2.07 Gbits/sec  81  sender
[SUM]   0.00-15.00  sec  3.61 GBytes  2.07 Gbits/sec      receiver

Meaning: Parallelism improved throughput significantly.
Decision: If -P 8 is much faster than single flow, you may be CPU-limited per flow, or TCP is struggling with loss/RTT. Check CPU and qdisc next.

Task 6: Check MTU on wg0 and the underlay interface

cr0x@server:~$ ip link show wg0
7: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/none

cr0x@server:~$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff

Meaning: wg0 defaults to 1420, typical for IPv4 underlay with 1500 MTU.
Decision: If underlay MTU is smaller than you think (cloud overlay, PPPoE, etc.), 1420 can still be too high. Don’t guess—test PMTU.

Task 7: PMTU test with DF ping through the tunnel

cr0x@server:~$ ping -M do -s 1372 -c 3 10.60.0.10
PING 10.60.0.10 (10.60.0.10) 1372(1400) bytes of data.
1380 bytes from 10.60.0.10: icmp_seq=1 ttl=64 time=23.4 ms
1380 bytes from 10.60.0.10: icmp_seq=2 ttl=64 time=23.1 ms
1380 bytes from 10.60.0.10: icmp_seq=3 ttl=64 time=23.3 ms

--- 10.60.0.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms

Meaning: Payload 1372 (1400 on wire for ICMP) works. Good sign.
Decision: Increase until it fails; the failure point tells you real PMTU. If it fails unexpectedly low, you likely have encapsulation overhead or ICMP blocked.

cr0x@server:~$ ping -M do -s 1412 -c 3 10.60.0.10
PING 10.60.0.10 (10.60.0.10) 1412(1440) bytes of data.
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420

--- 10.60.0.10 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2038ms

Meaning: Your local wg0 MTU stops you at 1420—this is not a path test yet, it’s an interface constraint.
Decision: If real path MTU is lower than wg0 MTU, you’ll see failures at smaller sizes too (or weird stalls). Continue by testing near wg0 MTU and watching for loss/retransmits.

Task 8: Check whether ICMP “frag needed” is being received (PMTU working)

cr0x@server:~$ sudo ip -s -s link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    RX:  bytes packets errors dropped  missed   mcast
    9012345678 8123456      0    1452       0  120345
    TX:  bytes packets errors dropped carrier collsns
    8123456789 7012345      0       0       0       0

Meaning: RX drops exist (1452). That might be congestion, ring overflow, or policing upstream.
Decision: If drops rise during iperf, treat it as underlay or host receive path pressure. Investigate NIC rings/interrupts/qdisc and upstream shaping.

Task 9: Observe TCP health (retransmits, congestion) during a transfer

cr0x@server:~$ ss -ti dst 10.60.0.10
ESTAB 0 0 10.60.0.1:43144 10.60.0.10:5201
	 cubic wscale:7,7 rto:204 rtt:24.1/2.1 ato:40 mss:1360 pmtu:1420 rcvmss:1360 advmss:1360 cwnd:64 bytes_acked:8123456 segs_out:6021 segs_in:5844 send 1.9Gbps lastsnd:8 lastrcv:8 lastack:8 pacing_rate 3.8Gbps delivery_rate 1.7Gbps retrans:12/44

Meaning: MSS 1360, PMTU 1420. Retransmissions exist.
Decision: Retrans plus a stable PMTU suggests loss/jitter/queuing, not just MTU mismatch. If MSS/PMTU look wrong (too high), fix MTU/MSS clamp first.

Task 10: Check CPU saturation and softirq during load

cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.8.0 (server) 	12/27/2025 	_x86_64_	(8 CPU)

12:10:01 AM  CPU   %usr %nice  %sys %iowait  %irq %soft  %steal  %idle
12:10:02 AM  all   22.1  0.0   18.4   0.0    0.0  20.9    0.0   38.6
12:10:02 AM    0   12.0  0.0   10.1   0.0    0.0  62.3    0.0   15.6
12:10:02 AM    1   28.4  0.0   21.0   0.0    0.0   8.2    0.0   42.4
12:10:02 AM    2   30.1  0.0   25.7   0.0    0.0   5.9    0.0   38.3
12:10:02 AM    3   29.8  0.0   19.4   0.0    0.0   6.1    0.0   44.6
12:10:02 AM    4   18.0  0.0   15.2   0.0    0.0  28.0    0.0   38.8
12:10:02 AM    5   20.5  0.0   17.1   0.0    0.0  26.4    0.0   36.0
12:10:02 AM    6   19.7  0.0   15.9   0.0    0.0  25.1    0.0   39.3
12:10:02 AM    7   17.6  0.0   13.4   0.0    0.0  29.0    0.0   40.0

Meaning: CPU0 is heavy in %soft (softirq). That’s often network receive processing and can cap throughput.
Decision: If one CPU’s softirq dominates, you likely need better IRQ distribution (RSS/RPS), NIC queue tuning, or to move the workload off a tiny VM.

Task 11: Check WireGuard and UDP socket buffer pressure

cr0x@server:~$ netstat -su
Udp:
    2212345 packets received
    0 packets to unknown port received
    0 packet receive errors
    1832 packets sent
    0 receive buffer errors
    412 send buffer errors
UdpLite:
IpExt:
    InNoRoutes: 0
    InBcastPkts: 0

Meaning: send buffer errors suggests the host couldn’t queue UDP fast enough (or the qdisc/NIC is congested).
Decision: Investigate qdisc, NIC TX queues, and potentially increase socket buffers (carefully, with measurement). Also check for egress policing.

Task 12: Check qdisc (queue discipline) and whether you’re using something sensible

cr0x@server:~$ tc qdisc show dev eth0
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn

Meaning: fq_codel is generally a good default for latency and fairness.
Decision: If you see pfifo_fast or a huge fq backlog elsewhere, consider switching to fq_codel or cake (where available) on the egress that matters.

Task 13: Verify policy routing (common in split tunnel setups)

cr0x@server:~$ ip rule show
0:	from all lookup local
1000:	from 10.60.0.0/24 lookup 100
32766:	from all lookup main
32767:	from all lookup default

cr0x@server:~$ ip route show table 100
default dev wg0 scope link

Meaning: Traffic sourced from 10.60.0.0/24 uses table 100 and defaults to wg0.
Decision: If replies don’t use the same policy, you’ll get asymmetry, weird stalls, and “only some apps are slow.” Fix rules so both directions agree.

Task 14: Look for conntrack/NAT issues that punish UDP

cr0x@server:~$ sudo conntrack -S
cpu=0 found=912345 invalid=12 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=120
cpu=1 found=901234 invalid=8 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=98

Meaning: invalid exists but isn’t crazy; search_restart can hint at table pressure.
Decision: If invalid climbs quickly during VPN load, verify NAT rules, UDP timeouts, and whether middleboxes are rewriting or expiring state too aggressively.

Task 15: Verify offloads and whether the NIC driver is doing something “creative”

cr0x@server:~$ sudo ethtool -k eth0 | egrep 'gro|gso|tso|tx-checksum|rx-checksum'
rx-checksumming: on
tx-checksumming: on
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on

Meaning: Offloads are enabled. Usually good. Occasionally disastrous with certain virtual NICs or buggy drivers.
Decision: If you see high CPU, odd packet capture behavior, or tunnel throughput collapses after a kernel/driver update, test disabling GRO on wg0 or on the underlay as a controlled experiment—then revert if it doesn’t help.

Task 16: Capture evidence without lying to yourself (tcpdump with clarity)

cr0x@server:~$ sudo tcpdump -ni eth0 udp port 51820 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
01:12:01.123456 IP 198.51.100.10.51820 > 203.0.113.44.51820: UDP, length 148
01:12:01.143211 IP 203.0.113.44.51820 > 198.51.100.10.51820: UDP, length 92
01:12:01.163001 IP 198.51.100.10.51820 > 203.0.113.44.51820: UDP, length 1200
01:12:01.182992 IP 198.51.100.10.51820 > 203.0.113.44.51820: UDP, length 1200
01:12:01.203114 IP 203.0.113.44.51820 > 198.51.100.10.51820: UDP, length 92
5 packets captured

Meaning: You see bidirectional UDP traffic on the WireGuard port. Good.
Decision: If traffic is one-way only, performance problems are secondary—you have a path/firewall/NAT failure in one direction.

MTU and fragmentation: the most common “it’s fine” lie

When WireGuard is slow, MTU is guilty often enough that you should treat it as a default suspect—but not a default fix. The correct approach is:
determine the effective path MTU, then set wg0 MTU (or clamp TCP MSS) so your traffic never depends on fragmented packets being delivered reliably.

What actually happens

WireGuard encapsulates your packets inside UDP. That means extra headers. If your inner packet is sized to a 1500-byte path, the outer packet can
exceed 1500 and either:

get fragmented (best case: fragments arrive; worst case: fragments drop), or
get dropped with an ICMP “fragmentation needed” message (if PMTUD works), or
get silently dropped (PMTUD blackhole, the classic).

The failure patterns are distinctive:

Small packets OK, big packets fail: SSH works, file copy stalls, web pages partially load.
Throughput sawtooths: TCP ramps up, hits a wall, collapses, repeats.
“Fixes itself” on different networks: because PMTU differs across paths and ICMP filtering policies.

Don’t worship 1420

1420 is a reasonable default for an IPv4 underlay with a 1500 MTU. But:

If you’re running WireGuard over IPv6 underlay, overhead differs.
If your “1500 MTU” link is actually an overlay (cloud SDN) or PPPoE, you may have less.
If there’s IPSec, GRE, VXLAN, or “helpful” middleboxes, you have less.

Two solid strategies

Set wg0 MTU to a known-safe value.
This affects all traffic, including UDP-based apps. It’s blunt but reliable.
Clamp TCP MSS on the tunnel ingress/egress.
This only affects TCP and can preserve larger MTU for UDP flows that can handle fragmentation (or don’t send huge payloads).

Practical MTU workflow you can defend

Run DF ping tests to the far tunnel IP (Task 7), find the largest payload that passes reliably.
Subtract the correct overhead if you’re testing different layers (be consistent).
Set wg0 MTU to a value that leaves headroom, not one that “barely passes on a good day.”
Re-run iperf3 and compare retransmits and stability, not just peak throughput.

Example: setting MTU safely

cr0x@server:~$ sudo ip link set dev wg0 mtu 1380
cr0x@server:~$ ip link show wg0
7: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1380 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/none

Meaning: wg0 MTU is now 1380.
Decision: If performance stabilizes (fewer retransmits, more consistent throughput), keep it and document the measured PMTU basis. If nothing changes, MTU wasn’t your main bottleneck.

Example: MSS clamping (iptables)

cr0x@server:~$ sudo iptables -t mangle -A FORWARD -o wg0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
cr0x@server:~$ sudo iptables -t mangle -S FORWARD | tail -n 2
-A FORWARD -o wg0 -p tcp -m tcp --tcp-flags FIN,SYN,RST,ACK SYN -j TCPMSS --clamp-mss-to-pmtu

Meaning: TCP SYN packets going out wg0 will have MSS clamped.
Decision: Use this when you can’t reliably control MTU end-to-end (multi-peer, roaming clients), or when you only see TCP issues. Validate with ss -ti that MSS/PMTU align.

Routing and policy routing: when packets take the scenic route

Routing mistakes don’t always break connectivity. They degrade it in ways that feel like “slow VPN” while the root cause is “packets are doing
weird tourism.” You see it most in split-tunnel setups, multi-homed servers, and environments with default routes that change (hello, laptops).

AllowedIPs is routing, not access control

WireGuard’s AllowedIPs is a clever multipurpose field: it tells the kernel what destinations should be routed to that peer, and it
also acts as a cryptokey routing table. The important part: if you define it wrong, traffic can be sent to the wrong peer, or not sent to the
tunnel at all, or sent with the wrong source address.

Asymmetric routing: the silent throughput killer

Your outbound traffic might go through WireGuard, but replies might return via the underlay, or through a different tunnel, or through a NAT path
that mangles state. TCP hates asymmetry. UDP hates it too, it’s just less vocal about it.

How to catch routing lies quickly

Use ip route get for the destination from the sender (Task 1).
Use ip rule and ip route show table X when policy routing is in play (Task 13).
Check reverse path filtering (rp_filter) if you have multiple interfaces and policy routing.

Reverse path filtering: “security feature” turned availability bug

cr0x@server:~$ sysctl net.ipv4.conf.all.rp_filter net.ipv4.conf.eth0.rp_filter net.ipv4.conf.wg0.rp_filter
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.eth0.rp_filter = 1
net.ipv4.conf.wg0.rp_filter = 0

Meaning: Strict rp_filter is enabled globally and on eth0. In policy routing setups, that can drop legitimate asymmetric replies.
Decision: If you see unexplained drops and asymmetry, set rp_filter to 2 (loose) on involved interfaces or disable where appropriate—then verify with packet counters.

cr0x@server:~$ sudo sysctl -w net.ipv4.conf.all.rp_filter=2
net.ipv4.conf.all.rp_filter = 2

Meaning: Loose mode. Still provides some sanity checks without breaking policy routing.
Decision: Make it persistent only after confirming it resolves the issue and doesn’t violate your threat model.

CPU and crypto: when your core count is a throughput limit

WireGuard is fast. But “fast” isn’t “free,” and the underlay can be fast enough to expose CPU as the ceiling. On small cloud instances, old
Xeons, busy hypervisors, or hosts doing a lot of conntrack/firewall work, you can absolutely pin a core and stop scaling.

What CPU bottleneck looks like

iperf3 plateaus at a stable bitrate, regardless of link capacity.
One core is pegged in %soft or %sys while others are mostly idle.
Parallel iperf3 streams increase throughput more than they “should.”
System time grows with packet rate, not with bytes transferred.

Distinguish crypto cost from packet processing cost

People love blaming crypto because it sounds sophisticated. Often, the real issue is packet processing overhead: interrupts, softirq, GRO/GSO
behavior, conntrack, and qdisc. WireGuard’s crypto is efficient; your host’s network path may not be.

Check for kernel time and softirq

cr0x@server:~$ top -b -n 1 | head -n 15
top - 01:20:11 up 16 days,  3:12,  1 user,  load average: 3.21, 2.88, 2.44
Tasks: 213 total,   2 running, 211 sleeping,   0 stopped,   0 zombie
%Cpu(s): 18.2 us,  0.0 ni, 27.6 sy,  0.0 id,  0.0 wa,  0.0 hi, 54.2 si,  0.0 st
MiB Mem :  16000.0 total,   2200.0 free,   4100.0 used,   9700.0 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.  11200.0 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  934 root      20   0       0      0      0 S  62.0   0.0  31:22.11 ksoftirqd/0
 2112 root      20   0       0      0      0 S  18.0   0.0  12:04.22 ksoftirqd/4

Meaning: Softirq threads are burning CPU. That’s network processing overhead.
Decision: Investigate IRQ affinity, NIC queue count, RPS/XPS, and whether your VM is starved or pinned to a noisy neighbor host.

Interrupt distribution and queueing

cr0x@server:~$ grep -E 'eth0|wg0' /proc/interrupts | head
  24:  8123456        0        0        0        0        0        0        0   PCI-MSI 327680-edge      virtio0-input.0
  25:        0  7012345        0        0        0        0        0        0   PCI-MSI 327681-edge      virtio0-output.0

Meaning: Interrupts are concentrated on specific CPUs.
Decision: If most interrupts land on one CPU, configure IRQ affinity and ensure multi-queue is enabled. The goal: spread packet processing across cores without causing cache chaos.

Crypto acceleration reality check

WireGuard’s ChaCha20 performs well on many CPUs, including those without AES-NI. But the performance story changes with:

Very high packet rates (small packets): overhead dominated by per-packet costs.
VMs with limited vCPU and poor virtio tuning.
Firewalls doing heavy conntrack on the same host.

If your ceiling is CPU: scale up the instance, offload firewalling, or terminate the tunnel on a box built for packet pushing. The “fix” is
sometimes buying a larger VM. That is not shameful; it is cheaper than staff time.

NIC offloads, checksum weirdness, and the VM tax

WireGuard lives in the kernel, which is good. But it still relies on NIC drivers and the Linux networking stack. Offloads (GRO/GSO/TSO) usually
improve throughput and reduce CPU. Sometimes they interact badly with tunnels, virtual NICs, or packet filters.

Symptoms that smell like offload trouble

Packet captures show “giant” packets that don’t exist on the wire.
Performance changes drastically after a kernel update, with no config change.
Throughput is fine one direction but awful the other.
High CPU in softirq plus inexplicable drops on the host.

Controlled experiment: disable GRO on wg0

cr0x@server:~$ sudo ethtool -K wg0 gro off
Cannot get device settings: Operation not supported

Meaning: WireGuard interface doesn’t support standard ethtool toggles the same way a physical NIC does.
Decision: Use sysctls and focus on underlay NIC offloads. For WireGuard itself, focus on MTU, routing, and host CPU/IRQ behavior.

More relevant: underlay offload toggles (test, don’t commit blindly)

cr0x@server:~$ sudo ethtool -K eth0 gro off gso off tso off
cr0x@server:~$ sudo ethtool -k eth0 | egrep 'gro|gso|tso'
tcp-segmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off

Meaning: Offloads are disabled. CPU use will rise; packet rate will increase.
Decision: If disabling offloads fixes tunnel throughput stability (rare, but real), you likely have a driver/virtualization issue. Keep the workaround short-term; plan a kernel/driver/hypervisor fix.

The VM tax

On many hypervisors, a VPN endpoint inside a VM pays extra overhead:

virtio/netfront drivers and vSwitch processing add latency and CPU cost.
Interrupt coalescing and vCPU scheduling can create bursty delivery.
Cloud providers may rate-limit UDP or de-prioritize it under congestion.

If you’re trying to push multi-gigabit through a small VM, it will not become a big VM through positive thinking.

UDP realities: loss, jitter, buffers, and shaping

WireGuard uses UDP. That’s a feature, but it means you inherit the underlay’s behavior without a built-in retransmission layer. If the underlay
drops or reorders, the tunnel doesn’t “fix” it. TCP running inside the tunnel will notice and punish you with reduced congestion windows and
retransmits.

What to measure when “it’s just slow sometimes”

Loss: interface drops, UDP errors, TCP retransmits.
Jitter: ping variance and application-level timing.
Queueing: qdisc stats, bufferbloat symptoms (latency spikes under load).
Policing: cloud egress shaping or middlebox rate-limits (often harsh on UDP bursts).

Check qdisc stats for drops/overlimits

cr0x@server:~$ tc -s qdisc show dev eth0
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
 Sent 912345678 bytes 701234 packets (dropped 123, overlimits 0 requeues 12)
 backlog 0b 0p requeues 12

Meaning: Drops occurred at qdisc. That’s local congestion or shaping.
Decision: If drops correlate with slowdowns, address egress queueing: ensure fq_codel/cake, reduce offered load, or fix upstream shaping instead of blaming WireGuard.

Socket buffers: only after you see buffer errors

People love cranking rmem_max and wmem_max. Sometimes it helps; often it just increases latency and memory use while the
real problem is a congested uplink.

cr0x@server:~$ sysctl net.core.rmem_max net.core.wmem_max
net.core.rmem_max = 212992
net.core.wmem_max = 212992

Meaning: Default-ish max buffers.
Decision: Only increase if you have evidence of buffer errors (netstat -su) and you understand where queueing will move. Measure after each change.

cr0x@server:~$ sudo sysctl -w net.core.rmem_max=8388608
net.core.rmem_max = 8388608
cr0x@server:~$ sudo sysctl -w net.core.wmem_max=8388608
net.core.wmem_max = 8388608

Meaning: Larger max buffers allowed.
Decision: Re-run iperf3 and check netstat -su. If errors drop and throughput stabilizes without latency exploding, keep; otherwise revert and look at underlay congestion.

Joke #2: UDP is like office gossip—fast, lightweight, and if something gets lost, nobody comes back to correct the record.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “MTU is always 1500 in the cloud”

A mid-size company ran a fleet of WireGuard gateways in a major cloud provider. Things looked fine in staging. In production, every nightly
database dump over the tunnel “sometimes” stalled. Not failed—stalled. Operators would see transfer progress freeze for minutes and then resume.

The first reaction was predictable: blame the database, then the backup tool, then the storage. Someone suggested “maybe WireGuard is slow.”
They lowered the MTU to 1280 because they’d seen it on a blog. It improved slightly, then regressed the next week after a change in the cloud
network path.

The wrong assumption was that the underlay path was a simple 1500-byte Ethernet. It wasn’t. The provider’s network overlay added encapsulation,
and the effective path MTU differed between availability zones. PMTUD should have handled it, but ICMP “frag needed” messages were filtered by a
security appliance that had been installed “temporarily” and then became a permanent fixture—like many corporate things.

The fix was boring: allow the relevant ICMP types, set wg0 MTU based on measured PMTU for the worst path, and clamp MSS for safety. Nightly
backups became stable, and the team stopped paging storage engineers for a network problem wearing a storage costume.

The lesson: if you haven’t measured PMTU end-to-end, you don’t have an MTU. You have a belief system.

An optimization that backfired: “Disable qdisc to reduce overhead”

An internal platform team wanted maximum throughput between two data centers. They read that queue disciplines can add overhead and decided to
“simplify” by switching egress to a basic FIFO and cranking buffers. The graphs loved it at first: higher peak throughput in a lab test.

Then production happened. During daytime traffic, a large transfer would cause latency spikes across unrelated services. The tunnel was shared by
several apps, and the FIFO queue turned into a latency cannon. RPC timeouts climbed. Retries increased load. The classic self-inflicted storm.

Engineers tried to pin the blame on WireGuard. But packet loss wasn’t the main issue; it was queueing delay. The VPN was simply where traffic
converged, so it got blamed like the only person in the room wearing a bright shirt.

Rolling back to fq_codel stabilized latency and improved effective throughput for interactive flows. Peak throughput on the single big
transfer dropped a little. Nobody cared, because the business cared about “systems stay up,” not “one benchmark number looks pretty.”

The lesson: if you optimize throughput without controlling queueing, you’re building a denial-of-service mechanism and calling it performance.

A boring but correct practice that saved the day: “Measure, record, and re-test after every change”

A company with strict change control ran WireGuard between on-prem and cloud. They had a recurring complaint: “the VPN is slow.” Instead of
jumping to tuning, an SRE wrote a small runbook: one iperf3 test, one MTU test, one routing check, one CPU check. Results went into a ticket.

Over a few weeks, patterns appeared. Slowdowns coincided with a specific ISP peering path and increased underlay jitter. Another subset of
incidents correlated with a particular VM size used for the gateway: CPU steal time was high during peak hours.

When a major slowdown hit during a release, the team didn’t argue. They ran the same tests. Underlay ping showed jitter. iperf3 showed
retransmits. CPU was fine. Routing was correct. That eliminated half the usual speculation in five minutes.

They temporarily re-routed traffic through a different gateway on a better path and filed an ISP escalation with clean evidence. The VPN wasn’t
“fixed” by a magic sysctl. It was fixed by knowing where the problem lived.

The lesson: boring measurement discipline is a force multiplier. It doesn’t look heroic, but it prevents expensive guessing.

Common mistakes: symptom → root cause → fix

1) Symptom: SSH works, large downloads stall or hang

Root cause: PMTU blackhole or MTU mismatch; fragmentation drops; ICMP blocked.

Fix: Measure PMTU with DF pings; lower wg0 MTU; clamp TCP MSS; allow ICMP “frag needed” on the path.

2) Symptom: Throughput is fine with iperf3 -P 8 but bad with single flow

Root cause: Single-flow TCP is limited by RTT/loss or CPU per flow; congestion window can’t grow.

Fix: Check retransmits and jitter; fix MTU/loss first; if CPU-limited, improve IRQ distribution or scale the gateway.

3) Symptom: Tunnel is slow only from one site / one ASN / one Wi‑Fi

Root cause: Underlay path loss/jitter or provider shaping; sometimes UDP treated poorly.

Fix: Baseline underlay ping/jitter; compare from multiple networks; consider changing endpoint port, but only after verifying MTU and routing.

4) Symptom: Fast for minutes, then collapses, then recovers

Root cause: Queueing/bufferbloat, or transient drops; sometimes NAT state flapping without keepalive.

Fix: Use fq_codel/cake on egress; monitor qdisc drops; set PersistentKeepalive for roaming/NATed peers; check conntrack.

5) Symptom: One direction is much slower than the other

Root cause: Asymmetric routing, egress policing, or different MTU/PMTU behavior on each path.

Fix: Run routing checks on both ends; verify policy routing and rp_filter; compare qdisc stats and interface drops per direction.

6) Symptom: Performance got worse after kernel/driver/hypervisor update

Root cause: Offload behavior changed; virtio or NIC driver regression; different interrupt moderation.

Fix: Test offload toggles as a temporary workaround; adjust IRQ affinity; plan upgrade/rollback with evidence.

7) Symptom: “WireGuard is slow” only on a gateway that also does firewalling

Root cause: conntrack and iptables/nft rules add CPU cost; softirq pressure; table contention.

Fix: Reduce conntrack for tunnel traffic where safe, simplify rules, scale CPU, or separate roles (VPN endpoint vs stateful firewall).

8) Symptom: Latency spikes under load, even when throughput is acceptable

Root cause: Bad qdisc (FIFO), too-large buffers, or shaping without AQM.

Fix: Use fq_codel/cake; avoid “just increase buffers”; measure ping under load.

Checklists / step-by-step plan (safe, boring, effective)

Checklist A: One-hour diagnosis on a production incident

Confirm routing: ip route get to the remote tunnel IP. If not wg0, fix routing/AllowedIPs first.
Confirm tunnel health: wg show for handshake freshness and counters moving.
Run iperf3: single flow and -P 8. Note retransmits and stability.
MTU test: DF pings to the peer tunnel IP; look for blackholes.
CPU/softirq check: mpstat / top. If one core is pinned, treat as host limitation.
Loss and queueing: ip -s link drops; tc -s qdisc drops/overlimits.
NAT/conntrack sanity: conntrack stats, firewall counters if available.
Decide: MTU/MSS fix, routing fix, scale gateway CPU, or escalate underlay provider with evidence.

Checklist B: Hardening a WireGuard deployment for predictable performance

Pick a target MTU strategy: set wg0 MTU based on worst-case measured PMTU, and/or clamp MSS for TCP.
Standardize qdisc: fq_codel on WAN egress; avoid FIFO unless you enjoy latency incidents.
Capacity plan CPU: benchmark iperf3 and measure per-core softirq before declaring “it’s fine.”
Document routing intent: AllowedIPs, policy rules, source-based routing; include the reason in config comments.
Decide on keepalive policy: roaming clients and NAT-heavy paths need PersistentKeepalive; servers on stable links often don’t.
Instrument: export interface drops, qdisc drops, CPU softirq, and handshake age; alert on trends, not one-off spikes.

Checklist C: Change plan for a suspected MTU issue (with rollback)

Measure baseline iperf3 and ss -ti retransmits.
Measure DF ping max payload that passes reliably.
Lower wg0 MTU by a small, justified amount (e.g., 20–40 bytes) or implement MSS clamp.
Re-test iperf3 and observe retransmit change.
Roll back if no improvement; MTU wasn’t the bottleneck.
Write down the measured PMTU and the chosen value so nobody “optimizes” it later.

FAQ

1) What throughput should I expect from WireGuard?

On modern hardware, WireGuard can reach multi-gigabit. In practice, your ceiling is usually CPU, NIC/driver behavior, and underlay loss/jitter.
Measure with iperf3 and watch per-core softirq.

2) Is 1420 always the right MTU for wg0?

No. It’s a reasonable default for a 1500-byte IPv4 underlay. Cloud overlays, PPPoE, and nested tunnels can require smaller MTUs. Measure PMTU
and set MTU based on evidence.

3) Should I use MSS clamping or set MTU?

If the problem is mostly TCP and you have diverse clients/paths, MSS clamping is often the least disruptive fix. If you carry significant UDP
traffic (VoIP, games, QUIC-heavy workloads) and PMTU is unreliable, setting wg0 MTU to a safe value is cleaner.

4) Why does iperf3 -P 8 look great but single stream is mediocre?

Parallel streams hide single-flow limitations (RTT, loss sensitivity, congestion control behavior) and can distribute CPU costs. If business
traffic is single-flow (backups, object downloads), optimize for that: fix loss/MTU and reduce queueing.

5) Can firewall rules make WireGuard slow?

Absolutely. Stateful filtering and conntrack can add CPU overhead and drop packets under load. Also, firewalls blocking ICMP “frag needed” can
break PMTUD and cause the classic “small packets fine, big packets die” symptom.

6) Is WireGuard slower than OpenVPN?

Often it’s faster, especially in kernel mode. But if your bottleneck is MTU blackholing, routing asymmetry, or underlay loss, switching VPNs
won’t fix physics and policy. Diagnose first.

7) Does changing the WireGuard port help performance?

Sometimes, but it’s not a first-line fix. Changing ports can bypass dumb rate limits or broken middleboxes. Do MTU/routing/CPU checks first so
you don’t “fix” the wrong problem by accident.

8) Why is WireGuard slow only on one laptop network?

Likely PMTU/ICMP filtering on that network, or UDP shaping/loss on that path. Roaming clients behind NAT can also suffer if keepalive isn’t set.
Compare DF pings and observe handshake stability.

9) Should I increase Linux socket buffers for WireGuard?

Only if you see evidence like UDP send/receive buffer errors. Bigger buffers can increase latency and hide congestion. Fix queueing and loss
first; tune buffers second.

Next steps you can execute this week

Pick one reproducible test (iperf3 single flow + -P 8) and record baseline throughput, retransmits, and CPU.
Run DF ping PMTU tests across the tunnel and decide on MTU or MSS clamping based on the largest reliable size, not vibes.
Validate routing deterministically with ip route get and ip rule on both ends; hunt asymmetry.
Watch softirq during load; if one core is pinned, fix IRQ distribution or scale the VPN endpoint.
Check qdisc drops and move to fq_codel where appropriate; stop using FIFO if latency matters (and it always does).
Write a one-page runbook containing the exact commands you ran today and the “if output looks like X, do Y” decisions.

WireGuard isn’t “slow.” Your path, your MTU, your routing, or your CPU is slow. The good news is that those are diagnosable with a handful of
commands and a refusal to guess.