WireGuard “works,” which is the most dangerous state a VPN can be in. You can ping, SSH, maybe even open a web page—yet file transfers crawl,
backups miss their window, and your “fast” fiber looks like a motel Wi‑Fi connection with commitment issues.
This is the field guide for turning “WireGuard is slow” into a measured bottleneck, a specific fix, and a repeatable method. No folklore. No
random MTU numbers from a forum post written in 2019. We’ll test, read the counters, and make changes you can defend in a postmortem.
The mental model: where WireGuard can be slow
WireGuard is “just” a network interface and a UDP transport. That simplicity is why it’s beloved—and why performance problems often come from
everything around it: MTU, routing, NAT, NIC driver behavior, kernel queues, and CPU scheduling.
There are four common failure modes that present as “slow”:
- Path MTU / fragmentation blackholes: small packets work, big transfers stall or oscillate.
- Wrong route / wrong policy: traffic hairpins, traverses NAT twice, or exits the wrong interface.
- CPU bottleneck: one core pins at 100% during iperf; throughput tops out suspiciously round.
- Loss/buffering on UDP: TCP over UDP reacts badly when the underlay drops or reorders packets.
You want to avoid “tuning” until you know which one you have. Blind tuning is how you end up with a configuration that only works on Tuesdays.
Here’s the operational lens: start with a single flow you can reproduce (iperf3 is fine). Confirm if the bottleneck is local host,
remote host, or the path. Then apply the smallest change that moves the needle, and measure again.
One quote worth keeping on a sticky note: “Hope is not a strategy.” — Gene Kranz.
Joke #1: If you’re changing MTU values at random, you’re not tuning a VPN—you’re doing numerology with extra steps.
Fast diagnosis playbook (first/second/third)
First: prove whether it’s MTU/fragmentation
- Run a PMTU-style ping test (don’t guess). If large DF pings fail, you have a blackhole or mismatch.
- Check counters for fragmentation needed / ICMP blocked.
- If PMTU is broken, stop and fix MTU or MSS clamping before touching anything else.
Second: prove whether it’s CPU
- Run iperf3 over the tunnel while watching per-core usage and softirq load.
- If one core pins (or ksoftirqd goes wild), you’re CPU/interrupt limited.
- Fix by enabling a faster cipher path (usually already good), improving NIC/IRQ distribution, or scaling flows/peers/hosts.
Third: prove routing and path correctness
- Verify the route for the destination (and the source address selection) from the sending host.
- Check for asymmetric routing: one direction uses the tunnel, the other uses the WAN.
- Confirm NAT and firewall rules aren’t rewriting or rate-limiting UDP.
If all three look sane, treat it as UDP loss/queuing
- Measure loss/retransmits via TCP stats and interface counters.
- Check qdisc, shaping, bufferbloat, and the underlay link health.
- Only then consider advanced tuning (socket buffers, fq, pacing).
Interesting facts and context (why these problems exist)
- WireGuard entered the Linux kernel in 2020, which made performance and deployment dramatically more predictable than out-of-tree modules.
- It rides over UDP by design, partly to avoid TCP-over-TCP meltdown and partly for simpler NAT traversal—yet it inherits UDP’s “best effort” reality.
- Its cryptography uses ChaCha20-Poly1305, chosen for strong performance on systems without AES acceleration; on many CPUs it’s blisteringly fast.
- Path MTU Discovery has been fragile for decades because it depends on ICMP “fragmentation needed” messages that firewalls love to drop.
- Ethernet’s classic MTU of 1500 is a historical artifact, not a law of physics; tunnels add headers and make 1500 a trap.
- Linux offloads (GSO/GRO/TSO) can make packet captures look “wrong” and can also hide performance problems until a driver update changes behavior.
- WireGuard peers are identified by public keys, not IPs; routing mistakes often manifest as “it connects but it’s slow” when traffic matches the wrong AllowedIPs.
- Cloud networks frequently encapsulate your packets already (VXLAN/Geneve), so your tunnel is a tunnel inside a tunnel—MTU death by a thousand headers.
- TCP throughput collapse can come from tiny loss rates on long fat networks; VPN overhead is rarely the main enemy compared to loss and RTT.
Practical tasks: commands, outputs, decisions
Below are practical tasks you can run on Linux hosts (or inside Linux VMs) to find the bottleneck. Each includes: the command, what typical
output means, and the decision you should make next. Do them in order if you want speed without superstition.
Task 1: Confirm you’re actually testing over WireGuard
cr0x@server:~$ ip route get 10.60.0.10
10.60.0.10 dev wg0 src 10.60.0.1 uid 1000
cache
Meaning: Traffic to 10.60.0.10 goes out wg0 with source 10.60.0.1.
Decision: If you don’t see dev wg0, stop. Fix routing/AllowedIPs/policy routing first or your tests are garbage.
Task 2: Inspect WireGuard peer health and whether you’re roaming endpoints
cr0x@server:~$ sudo wg show wg0
interface: wg0
public key: 2r4...redacted...Kk=
listening port: 51820
peer: q0D...redacted...xw=
endpoint: 203.0.113.44:51820
allowed ips: 10.60.0.10/32, 10.20.0.0/16
latest handshake: 28 seconds ago
transfer: 18.42 GiB received, 25.11 GiB sent
persistent keepalive: every 25 seconds
Meaning: Handshake is recent; endpoint is stable; traffic counters move.
Decision: If latest handshake is old or endpoint keeps changing unexpectedly, suspect NAT timeouts, roaming, or firewall state issues—expect loss and jitter.
Task 3: Baseline the underlay (non-VPN) throughput and latency
cr0x@server:~$ ping -c 5 203.0.113.44
PING 203.0.113.44 (203.0.113.44) 56(84) bytes of data.
64 bytes from 203.0.113.44: icmp_seq=1 ttl=53 time=19.8 ms
64 bytes from 203.0.113.44: icmp_seq=2 ttl=53 time=20.4 ms
64 bytes from 203.0.113.44: icmp_seq=3 ttl=53 time=19.9 ms
64 bytes from 203.0.113.44: icmp_seq=4 ttl=53 time=62.1 ms
64 bytes from 203.0.113.44: icmp_seq=5 ttl=53 time=20.2 ms
--- 203.0.113.44 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 19.8/28.5/62.1/16.8 ms
Meaning: Underlay RTT has spikes (62 ms). VPN will amplify that into TCP throughput pain.
Decision: If underlay jitter/loss is present, don’t expect miracles from MTU tweaks; you may need queue management or a better path/provider.
Task 4: Measure tunnel throughput with iperf3 (single flow)
cr0x@server:~$ iperf3 -c 10.60.0.10 -t 15
Connecting to host 10.60.0.10, port 5201
[ 5] local 10.60.0.1 port 43144 connected to 10.60.0.10 port 5201
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-1.00 sec 62.2 MBytes 522 Mbits/sec 0
[ 5] 1.00-2.00 sec 61.8 MBytes 518 Mbits/sec 1
[ 5] 2.00-3.00 sec 44.9 MBytes 377 Mbits/sec 12
[ 5] 3.00-4.00 sec 58.2 MBytes 488 Mbits/sec 3
[ 5] 14.00-15.00 sec 60.1 MBytes 504 Mbits/sec 2
- - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-15.00 sec 848 MBytes 474 Mbits/sec 44 sender
[ 5] 0.00-15.00 sec 846 MBytes 473 Mbits/sec receiver
Meaning: Throughput is unstable and there are retransmits.
Decision: Retransmits over a VPN usually mean MTU blackholing, underlay loss/jitter, or buffering/queue issues. Next: MTU tests and loss counters.
Task 5: Measure with multiple parallel flows (to detect single-core limits)
cr0x@server:~$ iperf3 -c 10.60.0.10 -P 8 -t 15
[SUM] 0.00-15.00 sec 3.62 GBytes 2.07 Gbits/sec 81 sender
[SUM] 0.00-15.00 sec 3.61 GBytes 2.07 Gbits/sec receiver
Meaning: Parallelism improved throughput significantly.
Decision: If -P 8 is much faster than single flow, you may be CPU-limited per flow, or TCP is struggling with loss/RTT. Check CPU and qdisc next.
Task 6: Check MTU on wg0 and the underlay interface
cr0x@server:~$ ip link show wg0
7: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/none
cr0x@server:~$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
Meaning: wg0 defaults to 1420, typical for IPv4 underlay with 1500 MTU.
Decision: If underlay MTU is smaller than you think (cloud overlay, PPPoE, etc.), 1420 can still be too high. Don’t guess—test PMTU.
Task 7: PMTU test with DF ping through the tunnel
cr0x@server:~$ ping -M do -s 1372 -c 3 10.60.0.10
PING 10.60.0.10 (10.60.0.10) 1372(1400) bytes of data.
1380 bytes from 10.60.0.10: icmp_seq=1 ttl=64 time=23.4 ms
1380 bytes from 10.60.0.10: icmp_seq=2 ttl=64 time=23.1 ms
1380 bytes from 10.60.0.10: icmp_seq=3 ttl=64 time=23.3 ms
--- 10.60.0.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
Meaning: Payload 1372 (1400 on wire for ICMP) works. Good sign.
Decision: Increase until it fails; the failure point tells you real PMTU. If it fails unexpectedly low, you likely have encapsulation overhead or ICMP blocked.
cr0x@server:~$ ping -M do -s 1412 -c 3 10.60.0.10
PING 10.60.0.10 (10.60.0.10) 1412(1440) bytes of data.
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420
--- 10.60.0.10 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2038ms
Meaning: Your local wg0 MTU stops you at 1420—this is not a path test yet, it’s an interface constraint.
Decision: If real path MTU is lower than wg0 MTU, you’ll see failures at smaller sizes too (or weird stalls). Continue by testing near wg0 MTU and watching for loss/retransmits.
Task 8: Check whether ICMP “frag needed” is being received (PMTU working)
cr0x@server:~$ sudo ip -s -s link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
RX: bytes packets errors dropped missed mcast
9012345678 8123456 0 1452 0 120345
TX: bytes packets errors dropped carrier collsns
8123456789 7012345 0 0 0 0
Meaning: RX drops exist (1452). That might be congestion, ring overflow, or policing upstream.
Decision: If drops rise during iperf, treat it as underlay or host receive path pressure. Investigate NIC rings/interrupts/qdisc and upstream shaping.
Task 9: Observe TCP health (retransmits, congestion) during a transfer
cr0x@server:~$ ss -ti dst 10.60.0.10
ESTAB 0 0 10.60.0.1:43144 10.60.0.10:5201
cubic wscale:7,7 rto:204 rtt:24.1/2.1 ato:40 mss:1360 pmtu:1420 rcvmss:1360 advmss:1360 cwnd:64 bytes_acked:8123456 segs_out:6021 segs_in:5844 send 1.9Gbps lastsnd:8 lastrcv:8 lastack:8 pacing_rate 3.8Gbps delivery_rate 1.7Gbps retrans:12/44
Meaning: MSS 1360, PMTU 1420. Retransmissions exist.
Decision: Retrans plus a stable PMTU suggests loss/jitter/queuing, not just MTU mismatch. If MSS/PMTU look wrong (too high), fix MTU/MSS clamp first.
Task 10: Check CPU saturation and softirq during load
cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.8.0 (server) 12/27/2025 _x86_64_ (8 CPU)
12:10:01 AM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
12:10:02 AM all 22.1 0.0 18.4 0.0 0.0 20.9 0.0 38.6
12:10:02 AM 0 12.0 0.0 10.1 0.0 0.0 62.3 0.0 15.6
12:10:02 AM 1 28.4 0.0 21.0 0.0 0.0 8.2 0.0 42.4
12:10:02 AM 2 30.1 0.0 25.7 0.0 0.0 5.9 0.0 38.3
12:10:02 AM 3 29.8 0.0 19.4 0.0 0.0 6.1 0.0 44.6
12:10:02 AM 4 18.0 0.0 15.2 0.0 0.0 28.0 0.0 38.8
12:10:02 AM 5 20.5 0.0 17.1 0.0 0.0 26.4 0.0 36.0
12:10:02 AM 6 19.7 0.0 15.9 0.0 0.0 25.1 0.0 39.3
12:10:02 AM 7 17.6 0.0 13.4 0.0 0.0 29.0 0.0 40.0
Meaning: CPU0 is heavy in %soft (softirq). That’s often network receive processing and can cap throughput.
Decision: If one CPU’s softirq dominates, you likely need better IRQ distribution (RSS/RPS), NIC queue tuning, or to move the workload off a tiny VM.
Task 11: Check WireGuard and UDP socket buffer pressure
cr0x@server:~$ netstat -su
Udp:
2212345 packets received
0 packets to unknown port received
0 packet receive errors
1832 packets sent
0 receive buffer errors
412 send buffer errors
UdpLite:
IpExt:
InNoRoutes: 0
InBcastPkts: 0
Meaning: send buffer errors suggests the host couldn’t queue UDP fast enough (or the qdisc/NIC is congested).
Decision: Investigate qdisc, NIC TX queues, and potentially increase socket buffers (carefully, with measurement). Also check for egress policing.
Task 12: Check qdisc (queue discipline) and whether you’re using something sensible
cr0x@server:~$ tc qdisc show dev eth0
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
Meaning: fq_codel is generally a good default for latency and fairness.
Decision: If you see pfifo_fast or a huge fq backlog elsewhere, consider switching to fq_codel or cake (where available) on the egress that matters.
Task 13: Verify policy routing (common in split tunnel setups)
cr0x@server:~$ ip rule show
0: from all lookup local
1000: from 10.60.0.0/24 lookup 100
32766: from all lookup main
32767: from all lookup default
cr0x@server:~$ ip route show table 100
default dev wg0 scope link
Meaning: Traffic sourced from 10.60.0.0/24 uses table 100 and defaults to wg0.
Decision: If replies don’t use the same policy, you’ll get asymmetry, weird stalls, and “only some apps are slow.” Fix rules so both directions agree.
Task 14: Look for conntrack/NAT issues that punish UDP
cr0x@server:~$ sudo conntrack -S
cpu=0 found=912345 invalid=12 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=120
cpu=1 found=901234 invalid=8 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=98
Meaning: invalid exists but isn’t crazy; search_restart can hint at table pressure.
Decision: If invalid climbs quickly during VPN load, verify NAT rules, UDP timeouts, and whether middleboxes are rewriting or expiring state too aggressively.
Task 15: Verify offloads and whether the NIC driver is doing something “creative”
cr0x@server:~$ sudo ethtool -k eth0 | egrep 'gro|gso|tso|tx-checksum|rx-checksum'
rx-checksumming: on
tx-checksumming: on
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
Meaning: Offloads are enabled. Usually good. Occasionally disastrous with certain virtual NICs or buggy drivers.
Decision: If you see high CPU, odd packet capture behavior, or tunnel throughput collapses after a kernel/driver update, test disabling GRO on wg0 or on the underlay as a controlled experiment—then revert if it doesn’t help.
Task 16: Capture evidence without lying to yourself (tcpdump with clarity)
cr0x@server:~$ sudo tcpdump -ni eth0 udp port 51820 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
01:12:01.123456 IP 198.51.100.10.51820 > 203.0.113.44.51820: UDP, length 148
01:12:01.143211 IP 203.0.113.44.51820 > 198.51.100.10.51820: UDP, length 92
01:12:01.163001 IP 198.51.100.10.51820 > 203.0.113.44.51820: UDP, length 1200
01:12:01.182992 IP 198.51.100.10.51820 > 203.0.113.44.51820: UDP, length 1200
01:12:01.203114 IP 203.0.113.44.51820 > 198.51.100.10.51820: UDP, length 92
5 packets captured
Meaning: You see bidirectional UDP traffic on the WireGuard port. Good.
Decision: If traffic is one-way only, performance problems are secondary—you have a path/firewall/NAT failure in one direction.
MTU and fragmentation: the most common “it’s fine” lie
When WireGuard is slow, MTU is guilty often enough that you should treat it as a default suspect—but not a default fix. The correct approach is:
determine the effective path MTU, then set wg0 MTU (or clamp TCP MSS) so your traffic never depends on fragmented packets being delivered reliably.
What actually happens
WireGuard encapsulates your packets inside UDP. That means extra headers. If your inner packet is sized to a 1500-byte path, the outer packet can
exceed 1500 and either:
- get fragmented (best case: fragments arrive; worst case: fragments drop), or
- get dropped with an ICMP “fragmentation needed” message (if PMTUD works), or
- get silently dropped (PMTUD blackhole, the classic).
The failure patterns are distinctive:
- Small packets OK, big packets fail: SSH works, file copy stalls, web pages partially load.
- Throughput sawtooths: TCP ramps up, hits a wall, collapses, repeats.
- “Fixes itself” on different networks: because PMTU differs across paths and ICMP filtering policies.
Don’t worship 1420
1420 is a reasonable default for an IPv4 underlay with a 1500 MTU. But:
- If you’re running WireGuard over IPv6 underlay, overhead differs.
- If your “1500 MTU” link is actually an overlay (cloud SDN) or PPPoE, you may have less.
- If there’s IPSec, GRE, VXLAN, or “helpful” middleboxes, you have less.
Two solid strategies
-
Set wg0 MTU to a known-safe value.
This affects all traffic, including UDP-based apps. It’s blunt but reliable. -
Clamp TCP MSS on the tunnel ingress/egress.
This only affects TCP and can preserve larger MTU for UDP flows that can handle fragmentation (or don’t send huge payloads).
Practical MTU workflow you can defend
- Run DF ping tests to the far tunnel IP (Task 7), find the largest payload that passes reliably.
- Subtract the correct overhead if you’re testing different layers (be consistent).
- Set wg0 MTU to a value that leaves headroom, not one that “barely passes on a good day.”
- Re-run iperf3 and compare retransmits and stability, not just peak throughput.
Example: setting MTU safely
cr0x@server:~$ sudo ip link set dev wg0 mtu 1380
cr0x@server:~$ ip link show wg0
7: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1380 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/none
Meaning: wg0 MTU is now 1380.
Decision: If performance stabilizes (fewer retransmits, more consistent throughput), keep it and document the measured PMTU basis. If nothing changes, MTU wasn’t your main bottleneck.
Example: MSS clamping (iptables)
cr0x@server:~$ sudo iptables -t mangle -A FORWARD -o wg0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
cr0x@server:~$ sudo iptables -t mangle -S FORWARD | tail -n 2
-A FORWARD -o wg0 -p tcp -m tcp --tcp-flags FIN,SYN,RST,ACK SYN -j TCPMSS --clamp-mss-to-pmtu
Meaning: TCP SYN packets going out wg0 will have MSS clamped.
Decision: Use this when you can’t reliably control MTU end-to-end (multi-peer, roaming clients), or when you only see TCP issues. Validate with ss -ti that MSS/PMTU align.
Routing and policy routing: when packets take the scenic route
Routing mistakes don’t always break connectivity. They degrade it in ways that feel like “slow VPN” while the root cause is “packets are doing
weird tourism.” You see it most in split-tunnel setups, multi-homed servers, and environments with default routes that change (hello, laptops).
AllowedIPs is routing, not access control
WireGuard’s AllowedIPs is a clever multipurpose field: it tells the kernel what destinations should be routed to that peer, and it
also acts as a cryptokey routing table. The important part: if you define it wrong, traffic can be sent to the wrong peer, or not sent to the
tunnel at all, or sent with the wrong source address.
Asymmetric routing: the silent throughput killer
Your outbound traffic might go through WireGuard, but replies might return via the underlay, or through a different tunnel, or through a NAT path
that mangles state. TCP hates asymmetry. UDP hates it too, it’s just less vocal about it.
How to catch routing lies quickly
- Use
ip route getfor the destination from the sender (Task 1). - Use
ip ruleandip route show table Xwhen policy routing is in play (Task 13). - Check reverse path filtering (
rp_filter) if you have multiple interfaces and policy routing.
Reverse path filtering: “security feature” turned availability bug
cr0x@server:~$ sysctl net.ipv4.conf.all.rp_filter net.ipv4.conf.eth0.rp_filter net.ipv4.conf.wg0.rp_filter
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.eth0.rp_filter = 1
net.ipv4.conf.wg0.rp_filter = 0
Meaning: Strict rp_filter is enabled globally and on eth0. In policy routing setups, that can drop legitimate asymmetric replies.
Decision: If you see unexplained drops and asymmetry, set rp_filter to 2 (loose) on involved interfaces or disable where appropriate—then verify with packet counters.
cr0x@server:~$ sudo sysctl -w net.ipv4.conf.all.rp_filter=2
net.ipv4.conf.all.rp_filter = 2
Meaning: Loose mode. Still provides some sanity checks without breaking policy routing.
Decision: Make it persistent only after confirming it resolves the issue and doesn’t violate your threat model.
CPU and crypto: when your core count is a throughput limit
WireGuard is fast. But “fast” isn’t “free,” and the underlay can be fast enough to expose CPU as the ceiling. On small cloud instances, old
Xeons, busy hypervisors, or hosts doing a lot of conntrack/firewall work, you can absolutely pin a core and stop scaling.
What CPU bottleneck looks like
- iperf3 plateaus at a stable bitrate, regardless of link capacity.
- One core is pegged in
%softor%syswhile others are mostly idle. - Parallel iperf3 streams increase throughput more than they “should.”
- System time grows with packet rate, not with bytes transferred.
Distinguish crypto cost from packet processing cost
People love blaming crypto because it sounds sophisticated. Often, the real issue is packet processing overhead: interrupts, softirq, GRO/GSO
behavior, conntrack, and qdisc. WireGuard’s crypto is efficient; your host’s network path may not be.
Check for kernel time and softirq
cr0x@server:~$ top -b -n 1 | head -n 15
top - 01:20:11 up 16 days, 3:12, 1 user, load average: 3.21, 2.88, 2.44
Tasks: 213 total, 2 running, 211 sleeping, 0 stopped, 0 zombie
%Cpu(s): 18.2 us, 0.0 ni, 27.6 sy, 0.0 id, 0.0 wa, 0.0 hi, 54.2 si, 0.0 st
MiB Mem : 16000.0 total, 2200.0 free, 4100.0 used, 9700.0 buff/cache
MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 11200.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
934 root 20 0 0 0 0 S 62.0 0.0 31:22.11 ksoftirqd/0
2112 root 20 0 0 0 0 S 18.0 0.0 12:04.22 ksoftirqd/4
Meaning: Softirq threads are burning CPU. That’s network processing overhead.
Decision: Investigate IRQ affinity, NIC queue count, RPS/XPS, and whether your VM is starved or pinned to a noisy neighbor host.
Interrupt distribution and queueing
cr0x@server:~$ grep -E 'eth0|wg0' /proc/interrupts | head
24: 8123456 0 0 0 0 0 0 0 PCI-MSI 327680-edge virtio0-input.0
25: 0 7012345 0 0 0 0 0 0 PCI-MSI 327681-edge virtio0-output.0
Meaning: Interrupts are concentrated on specific CPUs.
Decision: If most interrupts land on one CPU, configure IRQ affinity and ensure multi-queue is enabled. The goal: spread packet processing across cores without causing cache chaos.
Crypto acceleration reality check
WireGuard’s ChaCha20 performs well on many CPUs, including those without AES-NI. But the performance story changes with:
- Very high packet rates (small packets): overhead dominated by per-packet costs.
- VMs with limited vCPU and poor virtio tuning.
- Firewalls doing heavy conntrack on the same host.
If your ceiling is CPU: scale up the instance, offload firewalling, or terminate the tunnel on a box built for packet pushing. The “fix” is
sometimes buying a larger VM. That is not shameful; it is cheaper than staff time.
NIC offloads, checksum weirdness, and the VM tax
WireGuard lives in the kernel, which is good. But it still relies on NIC drivers and the Linux networking stack. Offloads (GRO/GSO/TSO) usually
improve throughput and reduce CPU. Sometimes they interact badly with tunnels, virtual NICs, or packet filters.
Symptoms that smell like offload trouble
- Packet captures show “giant” packets that don’t exist on the wire.
- Performance changes drastically after a kernel update, with no config change.
- Throughput is fine one direction but awful the other.
- High CPU in softirq plus inexplicable drops on the host.
Controlled experiment: disable GRO on wg0
cr0x@server:~$ sudo ethtool -K wg0 gro off
Cannot get device settings: Operation not supported
Meaning: WireGuard interface doesn’t support standard ethtool toggles the same way a physical NIC does.
Decision: Use sysctls and focus on underlay NIC offloads. For WireGuard itself, focus on MTU, routing, and host CPU/IRQ behavior.
More relevant: underlay offload toggles (test, don’t commit blindly)
cr0x@server:~$ sudo ethtool -K eth0 gro off gso off tso off
cr0x@server:~$ sudo ethtool -k eth0 | egrep 'gro|gso|tso'
tcp-segmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
Meaning: Offloads are disabled. CPU use will rise; packet rate will increase.
Decision: If disabling offloads fixes tunnel throughput stability (rare, but real), you likely have a driver/virtualization issue. Keep the workaround short-term; plan a kernel/driver/hypervisor fix.
The VM tax
On many hypervisors, a VPN endpoint inside a VM pays extra overhead:
- virtio/netfront drivers and vSwitch processing add latency and CPU cost.
- Interrupt coalescing and vCPU scheduling can create bursty delivery.
- Cloud providers may rate-limit UDP or de-prioritize it under congestion.
If you’re trying to push multi-gigabit through a small VM, it will not become a big VM through positive thinking.
UDP realities: loss, jitter, buffers, and shaping
WireGuard uses UDP. That’s a feature, but it means you inherit the underlay’s behavior without a built-in retransmission layer. If the underlay
drops or reorders, the tunnel doesn’t “fix” it. TCP running inside the tunnel will notice and punish you with reduced congestion windows and
retransmits.
What to measure when “it’s just slow sometimes”
- Loss: interface drops, UDP errors, TCP retransmits.
- Jitter: ping variance and application-level timing.
- Queueing: qdisc stats, bufferbloat symptoms (latency spikes under load).
- Policing: cloud egress shaping or middlebox rate-limits (often harsh on UDP bursts).
Check qdisc stats for drops/overlimits
cr0x@server:~$ tc -s qdisc show dev eth0
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
Sent 912345678 bytes 701234 packets (dropped 123, overlimits 0 requeues 12)
backlog 0b 0p requeues 12
Meaning: Drops occurred at qdisc. That’s local congestion or shaping.
Decision: If drops correlate with slowdowns, address egress queueing: ensure fq_codel/cake, reduce offered load, or fix upstream shaping instead of blaming WireGuard.
Socket buffers: only after you see buffer errors
People love cranking rmem_max and wmem_max. Sometimes it helps; often it just increases latency and memory use while the
real problem is a congested uplink.
cr0x@server:~$ sysctl net.core.rmem_max net.core.wmem_max
net.core.rmem_max = 212992
net.core.wmem_max = 212992
Meaning: Default-ish max buffers.
Decision: Only increase if you have evidence of buffer errors (netstat -su) and you understand where queueing will move. Measure after each change.
cr0x@server:~$ sudo sysctl -w net.core.rmem_max=8388608
net.core.rmem_max = 8388608
cr0x@server:~$ sudo sysctl -w net.core.wmem_max=8388608
net.core.wmem_max = 8388608
Meaning: Larger max buffers allowed.
Decision: Re-run iperf3 and check netstat -su. If errors drop and throughput stabilizes without latency exploding, keep; otherwise revert and look at underlay congestion.
Joke #2: UDP is like office gossip—fast, lightweight, and if something gets lost, nobody comes back to correct the record.
Three corporate mini-stories from the trenches
Incident caused by a wrong assumption: “MTU is always 1500 in the cloud”
A mid-size company ran a fleet of WireGuard gateways in a major cloud provider. Things looked fine in staging. In production, every nightly
database dump over the tunnel “sometimes” stalled. Not failed—stalled. Operators would see transfer progress freeze for minutes and then resume.
The first reaction was predictable: blame the database, then the backup tool, then the storage. Someone suggested “maybe WireGuard is slow.”
They lowered the MTU to 1280 because they’d seen it on a blog. It improved slightly, then regressed the next week after a change in the cloud
network path.
The wrong assumption was that the underlay path was a simple 1500-byte Ethernet. It wasn’t. The provider’s network overlay added encapsulation,
and the effective path MTU differed between availability zones. PMTUD should have handled it, but ICMP “frag needed” messages were filtered by a
security appliance that had been installed “temporarily” and then became a permanent fixture—like many corporate things.
The fix was boring: allow the relevant ICMP types, set wg0 MTU based on measured PMTU for the worst path, and clamp MSS for safety. Nightly
backups became stable, and the team stopped paging storage engineers for a network problem wearing a storage costume.
The lesson: if you haven’t measured PMTU end-to-end, you don’t have an MTU. You have a belief system.
An optimization that backfired: “Disable qdisc to reduce overhead”
An internal platform team wanted maximum throughput between two data centers. They read that queue disciplines can add overhead and decided to
“simplify” by switching egress to a basic FIFO and cranking buffers. The graphs loved it at first: higher peak throughput in a lab test.
Then production happened. During daytime traffic, a large transfer would cause latency spikes across unrelated services. The tunnel was shared by
several apps, and the FIFO queue turned into a latency cannon. RPC timeouts climbed. Retries increased load. The classic self-inflicted storm.
Engineers tried to pin the blame on WireGuard. But packet loss wasn’t the main issue; it was queueing delay. The VPN was simply where traffic
converged, so it got blamed like the only person in the room wearing a bright shirt.
Rolling back to fq_codel stabilized latency and improved effective throughput for interactive flows. Peak throughput on the single big
transfer dropped a little. Nobody cared, because the business cared about “systems stay up,” not “one benchmark number looks pretty.”
The lesson: if you optimize throughput without controlling queueing, you’re building a denial-of-service mechanism and calling it performance.
A boring but correct practice that saved the day: “Measure, record, and re-test after every change”
A company with strict change control ran WireGuard between on-prem and cloud. They had a recurring complaint: “the VPN is slow.” Instead of
jumping to tuning, an SRE wrote a small runbook: one iperf3 test, one MTU test, one routing check, one CPU check. Results went into a ticket.
Over a few weeks, patterns appeared. Slowdowns coincided with a specific ISP peering path and increased underlay jitter. Another subset of
incidents correlated with a particular VM size used for the gateway: CPU steal time was high during peak hours.
When a major slowdown hit during a release, the team didn’t argue. They ran the same tests. Underlay ping showed jitter. iperf3 showed
retransmits. CPU was fine. Routing was correct. That eliminated half the usual speculation in five minutes.
They temporarily re-routed traffic through a different gateway on a better path and filed an ISP escalation with clean evidence. The VPN wasn’t
“fixed” by a magic sysctl. It was fixed by knowing where the problem lived.
The lesson: boring measurement discipline is a force multiplier. It doesn’t look heroic, but it prevents expensive guessing.
Common mistakes: symptom → root cause → fix
1) Symptom: SSH works, large downloads stall or hang
Root cause: PMTU blackhole or MTU mismatch; fragmentation drops; ICMP blocked.
Fix: Measure PMTU with DF pings; lower wg0 MTU; clamp TCP MSS; allow ICMP “frag needed” on the path.
2) Symptom: Throughput is fine with iperf3 -P 8 but bad with single flow
Root cause: Single-flow TCP is limited by RTT/loss or CPU per flow; congestion window can’t grow.
Fix: Check retransmits and jitter; fix MTU/loss first; if CPU-limited, improve IRQ distribution or scale the gateway.
3) Symptom: Tunnel is slow only from one site / one ASN / one Wi‑Fi
Root cause: Underlay path loss/jitter or provider shaping; sometimes UDP treated poorly.
Fix: Baseline underlay ping/jitter; compare from multiple networks; consider changing endpoint port, but only after verifying MTU and routing.
4) Symptom: Fast for minutes, then collapses, then recovers
Root cause: Queueing/bufferbloat, or transient drops; sometimes NAT state flapping without keepalive.
Fix: Use fq_codel/cake on egress; monitor qdisc drops; set PersistentKeepalive for roaming/NATed peers; check conntrack.
5) Symptom: One direction is much slower than the other
Root cause: Asymmetric routing, egress policing, or different MTU/PMTU behavior on each path.
Fix: Run routing checks on both ends; verify policy routing and rp_filter; compare qdisc stats and interface drops per direction.
6) Symptom: Performance got worse after kernel/driver/hypervisor update
Root cause: Offload behavior changed; virtio or NIC driver regression; different interrupt moderation.
Fix: Test offload toggles as a temporary workaround; adjust IRQ affinity; plan upgrade/rollback with evidence.
7) Symptom: “WireGuard is slow” only on a gateway that also does firewalling
Root cause: conntrack and iptables/nft rules add CPU cost; softirq pressure; table contention.
Fix: Reduce conntrack for tunnel traffic where safe, simplify rules, scale CPU, or separate roles (VPN endpoint vs stateful firewall).
8) Symptom: Latency spikes under load, even when throughput is acceptable
Root cause: Bad qdisc (FIFO), too-large buffers, or shaping without AQM.
Fix: Use fq_codel/cake; avoid “just increase buffers”; measure ping under load.
Checklists / step-by-step plan (safe, boring, effective)
Checklist A: One-hour diagnosis on a production incident
- Confirm routing:
ip route getto the remote tunnel IP. If not wg0, fix routing/AllowedIPs first. - Confirm tunnel health:
wg showfor handshake freshness and counters moving. - Run iperf3: single flow and
-P 8. Note retransmits and stability. - MTU test: DF pings to the peer tunnel IP; look for blackholes.
- CPU/softirq check:
mpstat/top. If one core is pinned, treat as host limitation. - Loss and queueing:
ip -s linkdrops;tc -s qdiscdrops/overlimits. - NAT/conntrack sanity: conntrack stats, firewall counters if available.
- Decide: MTU/MSS fix, routing fix, scale gateway CPU, or escalate underlay provider with evidence.
Checklist B: Hardening a WireGuard deployment for predictable performance
- Pick a target MTU strategy: set wg0 MTU based on worst-case measured PMTU, and/or clamp MSS for TCP.
- Standardize qdisc: fq_codel on WAN egress; avoid FIFO unless you enjoy latency incidents.
- Capacity plan CPU: benchmark iperf3 and measure per-core softirq before declaring “it’s fine.”
- Document routing intent: AllowedIPs, policy rules, source-based routing; include the reason in config comments.
- Decide on keepalive policy: roaming clients and NAT-heavy paths need PersistentKeepalive; servers on stable links often don’t.
- Instrument: export interface drops, qdisc drops, CPU softirq, and handshake age; alert on trends, not one-off spikes.
Checklist C: Change plan for a suspected MTU issue (with rollback)
- Measure baseline iperf3 and
ss -tiretransmits. - Measure DF ping max payload that passes reliably.
- Lower wg0 MTU by a small, justified amount (e.g., 20–40 bytes) or implement MSS clamp.
- Re-test iperf3 and observe retransmit change.
- Roll back if no improvement; MTU wasn’t the bottleneck.
- Write down the measured PMTU and the chosen value so nobody “optimizes” it later.
FAQ
1) What throughput should I expect from WireGuard?
On modern hardware, WireGuard can reach multi-gigabit. In practice, your ceiling is usually CPU, NIC/driver behavior, and underlay loss/jitter.
Measure with iperf3 and watch per-core softirq.
2) Is 1420 always the right MTU for wg0?
No. It’s a reasonable default for a 1500-byte IPv4 underlay. Cloud overlays, PPPoE, and nested tunnels can require smaller MTUs. Measure PMTU
and set MTU based on evidence.
3) Should I use MSS clamping or set MTU?
If the problem is mostly TCP and you have diverse clients/paths, MSS clamping is often the least disruptive fix. If you carry significant UDP
traffic (VoIP, games, QUIC-heavy workloads) and PMTU is unreliable, setting wg0 MTU to a safe value is cleaner.
4) Why does iperf3 -P 8 look great but single stream is mediocre?
Parallel streams hide single-flow limitations (RTT, loss sensitivity, congestion control behavior) and can distribute CPU costs. If business
traffic is single-flow (backups, object downloads), optimize for that: fix loss/MTU and reduce queueing.
5) Can firewall rules make WireGuard slow?
Absolutely. Stateful filtering and conntrack can add CPU overhead and drop packets under load. Also, firewalls blocking ICMP “frag needed” can
break PMTUD and cause the classic “small packets fine, big packets die” symptom.
6) Is WireGuard slower than OpenVPN?
Often it’s faster, especially in kernel mode. But if your bottleneck is MTU blackholing, routing asymmetry, or underlay loss, switching VPNs
won’t fix physics and policy. Diagnose first.
7) Does changing the WireGuard port help performance?
Sometimes, but it’s not a first-line fix. Changing ports can bypass dumb rate limits or broken middleboxes. Do MTU/routing/CPU checks first so
you don’t “fix” the wrong problem by accident.
8) Why is WireGuard slow only on one laptop network?
Likely PMTU/ICMP filtering on that network, or UDP shaping/loss on that path. Roaming clients behind NAT can also suffer if keepalive isn’t set.
Compare DF pings and observe handshake stability.
9) Should I increase Linux socket buffers for WireGuard?
Only if you see evidence like UDP send/receive buffer errors. Bigger buffers can increase latency and hide congestion. Fix queueing and loss
first; tune buffers second.
Next steps you can execute this week
- Pick one reproducible test (iperf3 single flow +
-P 8) and record baseline throughput, retransmits, and CPU. - Run DF ping PMTU tests across the tunnel and decide on MTU or MSS clamping based on the largest reliable size, not vibes.
- Validate routing deterministically with
ip route getandip ruleon both ends; hunt asymmetry. - Watch softirq during load; if one core is pinned, fix IRQ distribution or scale the VPN endpoint.
- Check qdisc drops and move to fq_codel where appropriate; stop using FIFO if latency matters (and it always does).
- Write a one-page runbook containing the exact commands you ran today and the “if output looks like X, do Y” decisions.
WireGuard isn’t “slow.” Your path, your MTU, your routing, or your CPU is slow. The good news is that those are diagnosable with a handful of
commands and a refusal to guess.