VPN Slower Than Expected: Diagnose Router CPU, Crypto, and MSS Clamping Like You Mean It

Was this helpful?

Your VPN link is “up,” the ping looks fine, and yet file copies crawl like it’s 2003 and somebody picked up a phone handset.
Welcome to the most common VPN failure mode in production: not “down,” just embarrassingly slower than the line you pay for.

The fix is rarely mystical. It’s usually a bottleneck you can measure: router CPU pegged doing crypto, a single-threaded daemon,
path MTU discovery (PMTUD) getting black-holed, or MSS clamping done “almost right” (the worst kind of right).
This is how you diagnose it without cargo culting settings until it “seems better.”

Fast diagnosis playbook (first 10 minutes)

This is the sequence that finds the bottleneck quickly. The goal is not “collect data.” The goal is to make a decision after each step:
CPU-bound? MTU-bound? loss/bufferbloat? shaping? single-thread? offload disabled? Move on only when you can say “not that.”

1) Prove whether the tunnel is throughput-limited by the endpoints or the path

  • Run iperf3 across the VPN in both directions. Use multiple parallel streams (1 and 4).
    If 1 stream is slow but 4 streams are much faster, you have a TCP/congestion/MTU/loss problem, not a raw link cap.
  • Compare to an iperf test outside the VPN (if possible). If WAN speed is fine but VPN speed isn’t, it’s tunnel overhead or endpoint compute.

2) Check router CPU (and specifically the crypto path)

  • Look for a single core pinned at 100% (classic OpenVPN) or high softirq / ksoftirqd (packet processing).
  • Confirm whether hardware crypto/offload is active. It often turns off when NAT, policy routing, or a “helpful” feature is enabled.

3) Check MTU/PMTUD/MSS before you “tune TCP”

  • Do a binary search with ping DF (don’t fragment). Find the real path MTU through the tunnel.
  • If ICMP frag-needed is blocked anywhere, PMTUD breaks and TCP stalls or retransmits. MSS clamping can mask it, but only if set correctly.

4) Measure loss, reordering, and bufferbloat

  • Use mtr and iperf3 UDP mode carefully. If you see loss under load, fix queueing/shaping, not encryption settings.

5) Only then: chase configuration specifics (cipher suites, tunnel mode, fragmentation settings)

Don’t start by swapping AES-256 for ChaCha20 “because Reddit.” If the CPU is idle and MTU is correct, cipher choice isn’t your bottleneck.

A mental model that prevents dumb guesses

VPN performance is the product of three separate systems that love to fail independently:

  1. Compute: Can your endpoint encrypt/decrypt packets at line rate? This includes CPU, crypto acceleration, kernel vs userspace,
    and whether you’ve accidentally forced all packets through the slow path.
  2. Packet sizing: Encapsulation adds overhead. If your effective MTU is wrong, you get fragmentation, drops, and weird TCP behavior.
    PMTUD is supposed to handle this, until somebody blocks ICMP because “security.”
  3. Transport dynamics: TCP throughput depends on RTT, loss, and queueing. VPNs often increase RTT slightly and can amplify bufferbloat,
    which murders throughput long before the link is saturated.

The trap: engineers treat “VPN” as one knob. It’s not one knob. It’s a pipeline. Your job is to find which stage is limiting,
then fix that stage without breaking the others.

One quote worth keeping on a sticky note, because it applies aggressively here:
“Hope is not a strategy.” — Gordon R. Sullivan

Interesting facts and quick history (because it explains the weirdness)

  • IPsec predates “modern internet” anxiety. The early IPsec RFCs were mid-1990s; it was designed to secure IP itself,
    long before “zero trust” became a slide template.
  • OpenVPN’s default architecture is historically CPU-hostile. Classic OpenVPN runs in userspace and (commonly) pushes most traffic through
    a single thread, which is why one busy core can cap throughput even when you have eight idle ones.
  • WireGuard is deliberately small. Its codebase is famously compact compared to typical VPN stacks, reducing attack surface and often improving
    performance via kernel integration and modern crypto choices.
  • PMTUD was meant to eliminate fragmentation. Path MTU discovery (late 1980s concept, widely implemented later) relies on ICMP feedback.
    Block the ICMP and you’re back to “mystery stalls.”
  • MSS clamping is an old workaround that refuses to die. It’s been used for decades to avoid fragmentation when PMTUD breaks,
    especially across PPPoE and tunnels.
  • AES-NI changed the economics of VPNs. Once common x86 CPUs got AES acceleration, “crypto is slow” stopped being universally true.
    On small routers and ARM SoCs, it’s still very true.
  • GCM vs CBC isn’t just academic. AEAD modes like AES-GCM can be faster and safer in modern implementations, but only if hardware and drivers
    handle them well; otherwise you can end up slower than expected.
  • Encapsulation overhead is not negotiable physics. You pay bytes for headers, authentication tags, and sometimes UDP wrapping.
    That reduces payload efficiency and can push you over MTU limits.

Joke #1: VPN troubleshooting is like archaeology—you brush away layers of “temporary” fixes until you find a fossilized MTU from 2014.

Symptoms that matter (and the ones that mislead)

Symptoms that actually narrow the search

  • Throughput caps at a suspicious number (like 80–200 Mbps) regardless of WAN speed: often CPU/crypto or single-thread limits.
  • One direction fast, the other slow: asymmetric routing, asymmetric shaping, or CPU pinned on only one side.
  • Small transfers “feel fine,” large transfers crawl: MTU/PMTUD issues or bufferbloat under sustained load.
  • Multiple TCP streams outperform one stream by a lot: packet loss, reordering, or MTU black-hole behavior; raw link is probably fine.
  • UDP traffic is fine; TCP is awful: classic sign of loss/queueing/MTU rather than raw bandwidth.

Symptoms that waste your time if you over-index on them

  • Ping looks great: ping is tiny packets; your problem is usually big packets under load.
  • CPU average is low: one core pegged matters; averages lie.
  • “It got worse after we enabled security feature X”: maybe, but measure. Correlation is a career-limiting drug.

Router CPU & crypto limits: the silent throughput killer

A lot of “VPN is slow” cases are simply “your router is doing expensive math at a rate it cannot sustain.”
That math could be encryption, authentication, encapsulation, checksums, or just moving packets between interfaces.

CPU bottleneck patterns you can recognize quickly

On commodity routers, VPN throughput is frequently bounded by:

  • Single-threaded encryption (common in OpenVPN setups): one core hits 100%, throughput flatlines, and you can’t “optimize” your way out
    except by changing architecture (DCO, kernel mode, different protocol) or hardware.
  • Kernel packet processing overhead: high softirq, ksoftirqd activity, lots of small packets (ACKs), NAT, conntrack, and firewall rules.
  • Crypto acceleration not used: you bought a box with acceleration, but the traffic path doesn’t hit it.
    Or the cipher/setting you picked bypasses it.

Crypto isn’t just “AES vs ChaCha”; it’s the whole data path

People love cipher debates because it feels like tuning a race car. But most VPN performance problems are closer to “the tires are flat.”
Ask the unglamorous questions:

  • Is the VPN in kernel or userspace?
  • Is it UDP or TCP transport?
  • Is hardware offload active end-to-end?
  • Did enabling NAT or policy routing disable fast-path forwarding?
  • Are you doing deep packet inspection or excessive firewall logging on encrypted flows?

The key diagnostic: if increasing parallel streams increases throughput dramatically, you’re not purely CPU-bound on crypto.
If throughput flatlines at the same number regardless of streams and packet size, you probably are.

MTU, PMTUD black holes, and MSS clamping: where “works” becomes “slow”

VPNs add headers. Headers consume MTU. When you ignore that, you get fragmentation or drops. When PMTUD is broken, you don’t even get a clean failure.
You get timeouts, retransmits, and throughput that looks like a random number generator.

Encapsulation overhead: the math you actually need

Start simple: Ethernet MTU is typically 1500 bytes. A VPN packet has to fit inside that.
If you add (for example) IP + UDP + VPN header + auth tag, you reduce the payload.

The exact overhead varies by protocol and settings. Don’t guess. Measure the real effective path MTU through the tunnel.

PMTUD black hole: the classic “security” footgun

PMTUD relies on ICMP “Fragmentation Needed” messages. Block them, and TCP keeps sending too-large packets with DF set, which get dropped silently.
TCP then backs off, retransmits, and crawls.

MSS clamping is a workaround: if you reduce the TCP MSS so endpoints never send packets that require fragmentation, you bypass PMTUD.
But you must clamp on the right interface, in the right direction, for the right traffic.

Why MSS clamping sometimes “helps a bit” but doesn’t fix it

  • You clamped MSS on the WAN interface instead of the tunnel interface (or vice versa).
  • You clamped only forwarded traffic but not locally generated traffic (or the reverse).
  • You clamped to a value that’s still too high for the actual encapsulated MTU.
  • You clamped TCP, but your problematic traffic is UDP (some apps) or you have IPsec with ESP fragmentation issues.

Joke #2: MSS clamping is like trimming your bangs with a chainsaw—it can work, but you won’t like the process or the edge cases.

TCP behavior over VPN: why latency and loss multiply the pain

TCP throughput isn’t “bandwidth.” It’s bandwidth constrained by congestion control behavior and the bandwidth-delay product.
VPNs often add a bit of latency. They also often add queueing (especially if the router buffers too much).
A tiny amount of loss under load can cut throughput dramatically.

What “multiple streams faster than one” really means

When one TCP flow is slow but four in parallel saturate the link, you’re seeing one (or more) of:

  • loss events that cause single-flow congestion window collapse
  • bufferbloat causing high RTT variance and delayed ACKs
  • retransmissions from MTU/fragmentation problems
  • per-flow shaping/policing on some middlebox

Don’t “fix” this by telling users to use more parallelism. That’s how you turn your network into a queueing experiment.
Fix the underlying issue: MTU correctness, queue management (fq_codel/cake), proper shaping at the edge, or a better VPN datapath.

VPN over TCP is usually self-inflicted pain

Running a VPN tunnel over TCP, then running TCP inside it, creates TCP-over-TCP meltdown: retransmissions and congestion control fight each other.
If you can use UDP transport for the tunnel, do it. If you can’t, expect “it works but weirdly slow when there’s loss.”

Hardware offload, NAT, and why acceleration sometimes disappears

Many routers advertise “gigabit VPN.” Then you turn on NAT, QoS, IDS, VLAN tagging, policy routing, or just a different cipher, and suddenly you’re at 120 Mbps.
The missing piece is usually fast-path/offload. Hardware acceleration is picky; it often requires a very specific traffic path.

Common ways to accidentally disable offload

  • NAT on the encrypted interface when the offload engine expects route-only forwarding.
  • Firewall rules that require deep inspection on the datapath (or logging every packet because “audit”).
  • Policy-based routing that moves packets out of the accelerated pipeline.
  • Using an unsupported cipher/mode for the hardware engine.
  • Fragmentation handling in software due to MTU mismatch.

You don’t have to worship offload. But you do need to know whether it’s active. Otherwise you’re tuning the wrong engine.

Practical tasks (commands, outputs, decisions) — do these, not vibes

Below are concrete tasks. Each includes commands, sample output, what it means, and the decision you make from it.
Run them on the VPN endpoints (routers, Linux gateways, or servers) and on a client if needed.

Task 1: Measure raw throughput with iperf3 (1 stream vs 4 streams)

cr0x@server:~$ iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
cr0x@server:~$ iperf3 -c 10.20.0.10 -P 1 -t 15
Connecting to host 10.20.0.10, port 5201
[  5]   0.00-15.00  sec   145 MBytes  81.0 Mbits/sec  0             sender
[  5]   0.00-15.01  sec   143 MBytes  79.8 Mbits/sec                receiver
cr0x@server:~$ iperf3 -c 10.20.0.10 -P 4 -t 15
Connecting to host 10.20.0.10, port 5201
[SUM]   0.00-15.00  sec   580 MBytes   324 Mbits/sec  12             sender
[SUM]   0.00-15.01  sec   575 MBytes   321 Mbits/sec                receiver

Meaning: One stream is 80 Mbps, four streams reach 320 Mbps. The path can move data, but single-flow TCP is struggling.
Decision: Prioritize MTU/PMTUD, loss, and queueing diagnostics. Don’t buy new hardware yet.

Task 2: Check CPU saturation and per-core hotspots (Linux)

cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.8.0 (gw-a)  12/28/2025  _x86_64_  (8 CPU)

12:10:01 AM  CPU   %usr %nice  %sys %iowait  %irq %soft  %steal %idle
12:10:02 AM  all    6.2  0.0   9.8    0.0    0.0  4.1     0.0  79.9
12:10:02 AM    0    2.0  0.0   6.0    0.0    0.0  1.0     0.0  91.0
12:10:02 AM    1   38.0  0.0  55.0    0.0    0.0  7.0     0.0   0.0
12:10:02 AM    2    1.0  0.0   2.0    0.0    0.0  0.0     0.0  97.0

Meaning: CPU1 is pinned (0% idle) while others are mostly idle. That screams “single-thread bottleneck” or a single IRQ queue.
Decision: If this is OpenVPN, test DCO or a different protocol. If it’s IRQ, consider RSS/RPS/XPS tuning and NIC queue settings.

Task 3: Identify the VPN process and whether it’s the hot spot

cr0x@server:~$ top -H -b -n 1 | head -30
top - 00:10:12 up 12 days,  3:44,  1 user,  load average: 1.12, 0.98, 0.91
Threads: 274 total,   2 running, 272 sleeping,   0 stopped,   0 zombie
%Cpu(s): 45.3 us, 10.4 sy,  0.0 ni, 44.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 8421 root      20   0  164796  23452  10652 R  99.3   0.6  12:33.14 openvpn
 8428 root      20   0  164796  23452  10652 R  98.7   0.6  12:30.09 openvpn

Meaning: OpenVPN threads are consuming CPU. Even if there are two threads, the datapath may still be limited by how packets are handled.
Decision: Validate OpenVPN mode (userspace vs DCO), consider switching to WireGuard or IPsec with kernel acceleration if appropriate.

Task 4: Check whether AES acceleration is available (x86 AES-NI)

cr0x@server:~$ grep -m1 -o 'aes\|avx\|sha_ni' /proc/cpuinfo | sort -u
aes
avx

Meaning: CPU supports AES instructions. That doesn’t guarantee your VPN uses them, but it makes high throughput plausible.
Decision: If throughput is still low and CPU is high, suspect non-accelerated code path, userspace overhead, or offload disabled.

Task 5: Confirm WireGuard handshake and per-peer transfer rates

cr0x@server:~$ sudo wg show
interface: wg0
  public key: T7oG...redacted...
  listening port: 51820

peer: 9q3B...redacted...
  endpoint: 198.51.100.20:49012
  allowed ips: 10.20.0.0/24
  latest handshake: 38 seconds ago
  transfer: 21.14 GiB received, 18.02 GiB sent
  persistent keepalive: every 25 seconds

Meaning: Tunnel is alive. Transfer counters move. If performance is bad, you can stop blaming “handshake instability.”
Decision: Move to MTU/queueing tests and CPU profiling; WireGuard itself is rarely the bottleneck unless the box is underpowered.

Task 6: Inspect IPsec state and whether it’s rekeying too often

cr0x@server:~$ sudo ip -s xfrm state
src 203.0.113.10 dst 198.51.100.20
	proto esp spi 0x0000000b reqid 1 mode tunnel
	replay-window 32 flag af-unspec
	aead rfc4106(gcm(aes)) 0x... 128
	anti-replay context: seq 0x0007a12f, oseq 0x0, bitmap 0x00000000
	lifetime config:
	  limit: soft (INF) hard (INF)
	lifetime current:
	  12345(s) 0(bytes)
	stats:
	  replay-window 0 replay 0 failed 0
	  bytes 9823141234 packets 912341 errors 0

Meaning: IPsec is stable (no errors, counters increasing). Rekey churn would show repeated SA changes and sometimes brief stalls.
Decision: If CPU is high here, look for missing hardware acceleration or small-packet overhead; otherwise proceed to MTU/PMTUD.

Task 7: Find the effective MTU through the tunnel with DF pings (binary search)

cr0x@server:~$ ping -M do -s 1472 -c 3 10.20.0.10
PING 10.20.0.10 (10.20.0.10) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420

--- 10.20.0.10 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss
cr0x@server:~$ ping -M do -s 1392 -c 3 10.20.0.10
PING 10.20.0.10 (10.20.0.10) 1392(1420) bytes of data.
1400 bytes from 10.20.0.10: icmp_seq=1 ttl=64 time=18.6 ms
1400 bytes from 10.20.0.10: icmp_seq=2 ttl=64 time=18.7 ms
1400 bytes from 10.20.0.10: icmp_seq=3 ttl=64 time=18.5 ms

--- 10.20.0.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms

Meaning: You’ve discovered a real MTU constraint (1420). If your tunnel interface is set higher, fragmentation or drops are likely.
Decision: Set tunnel MTU appropriately and/or clamp MSS. Also verify ICMP frag-needed is not blocked in the path.

Task 8: Check current interface MTU settings

cr0x@server:~$ ip link show dev wg0
6: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/none

Meaning: wg0 MTU matches the discovered limit. Good.
Decision: If you found a lower path MTU than the interface MTU, lower it; don’t “let it fragment and hope.”

Task 9: Verify MSS clamping rules (iptables) and confirm packet counters increment

cr0x@server:~$ sudo iptables -t mangle -S | grep -E 'TCPMSS|MSS'
-A FORWARD -o wg0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
-A FORWARD -i wg0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
cr0x@server:~$ sudo iptables -t mangle -L FORWARD -v -n | grep TCPMSS
  842K   52M TCP  --  *  wg0  0.0.0.0/0  0.0.0.0/0  tcp flags:0x02/0x02 TCPMSS clamp to PMTU
  799K   49M TCP  --  wg0 *   0.0.0.0/0  0.0.0.0/0  tcp flags:0x02/0x02 TCPMSS clamp to PMTU

Meaning: MSS clamping is applied both directions and counters increment. That’s what you want when PMTUD is unreliable.
Decision: If counters are zero, you clamped the wrong chain/interface; fix placement. If PMTUD works, clamping can be optional.

Task 10: Confirm ICMP “fragmentation needed” isn’t being dropped (nftables example)

cr0x@server:~$ sudo nft list ruleset | grep -n 'icmp type'
42:    ip protocol icmp icmp type { echo-request, echo-reply, destination-unreachable, time-exceeded } accept

Meaning: “destination-unreachable” includes frag-needed messages in many setups; blocking it is how you manufacture PMTUD black holes.
Decision: If your ruleset only allows echo-request/echo-reply, expand it. Security teams can log it; they shouldn’t drop it.

Task 11: Look for fragmentation and reassembly signals in interface counters

cr0x@server:~$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    RX:  bytes  packets  errors  dropped overrun mcast
    9812312312 81231231 0       1234    0       0
    TX:  bytes  packets  errors  dropped carrier collsns
    8712312312 70123123 0       9876    0       0

Meaning: Dropped packets under load (TX dropped, RX dropped) correlate strongly with buffer pressure, shaping errors, or MTU issues.
Decision: If drops climb during iperf tests, fix queueing/shaping. If drops occur only for large packets, revisit MTU/MSS.

Task 12: Check for retransmits and TCP pain (ss -i)

cr0x@server:~$ ss -ti dst 10.20.0.10 | head -20
ESTAB 0 0 10.10.0.5:46732 10.20.0.10:445
	 cubic wscale:7,7 rto:220 rtt:48.3/12.1 ato:40 mss:1360 pmtu:1420 rcvmss:1360 advmss:1360 cwnd:18 bytes_acked:8123412 segs_out:51234 segs_in:49871 send 40.5Mbps lastsnd:28 lastrcv:12 lastack:12 pacing_rate 81.0Mbps retrans:214/51234

Meaning: Non-trivial retransmits (214) plus modest cwnd suggests loss/queueing/MTU issues impacting TCP.
Decision: If retrans climbs during transfers, find where packets are dropped. Don’t tune sysctls until you stop drops.

Task 13: Detect bufferbloat under load using ping while saturating

cr0x@server:~$ ping -i 0.2 -c 20 10.20.0.10
PING 10.20.0.10 (10.20.0.10) 56(84) bytes of data.
64 bytes from 10.20.0.10: icmp_seq=1 ttl=64 time=19.2 ms
64 bytes from 10.20.0.10: icmp_seq=2 ttl=64 time=18.9 ms
64 bytes from 10.20.0.10: icmp_seq=3 ttl=64 time=210.5 ms
64 bytes from 10.20.0.10: icmp_seq=4 ttl=64 time=243.1 ms
64 bytes from 10.20.0.10: icmp_seq=5 ttl=64 time=198.7 ms

Meaning: Latency spikes during load are queueing/bufferbloat. That will smash TCP throughput and make apps “feel broken.”
Decision: Implement sane queue management (fq_codel/cake) and shape at the bottleneck link, typically the WAN egress.

Task 14: Confirm qdisc and apply fq_codel (Linux) if appropriate

cr0x@server:~$ tc qdisc show dev eth0
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
cr0x@server:~$ sudo tc qdisc replace dev eth0 root fq_codel
cr0x@server:~$ tc qdisc show dev eth0
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn

Meaning: pfifo_fast is a museum piece. fq_codel provides active queue management that reduces bufferbloat.
Decision: If latency spikes under load, keep fq_codel (or cake) and add shaping if the bottleneck is upstream.

Task 15: Check NIC and IRQ distribution (RSS/RPS hint)

cr0x@server:~$ cat /proc/interrupts | head -12
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
  35:   9123412  781234123      12345      11234      10987      10321      10012       9987  IR-PCI-MSI 524288-edge  eth0-TxRx-0

Meaning: One interrupt queue landing mostly on CPU1 can create a per-core bottleneck even if total CPU seems fine.
Decision: If one core is pinned and interrupts are skewed, tune RSS/RPS or adjust IRQ affinity; or upgrade hardware/NIC drivers.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption (PMTUD “security hardening”)

A mid-sized company rolled out a site-to-site VPN between two offices and a cloud VPC. Pings were great. RDP was fine.
Then the finance team started exporting large reports and the transfer would stall at 2–5% for minutes, then continue, then stall again.
The initial assumption: “VPN crypto is too slow; we need bigger boxes.”

The network team checked CPU. Plenty of headroom. They swapped ciphers anyway because that’s what people do when they’re stuck.
Slight improvements in one direction, worse in the other. The problem stayed weird and intermittent, which is catnip for bad theories.

An SRE ran a DF ping test and immediately hit a “message too long” MTU around 1412 bytes—lower than expected.
Then they checked firewall rules and found that ICMP destination-unreachable was dropped on a “hardened” perimeter ACL.
PMTUD was dead, so TCP sessions were black-holing larger packets until retransmits and timeouts made progress.

The fix wasn’t a bigger router. It was allowing the specific ICMP types needed for PMTUD and adding MSS clamping as a belt-and-suspenders move.
Transfers went from “minutes of stalls” to “boringly fast.” The postmortem action item wasn’t “don’t harden.”
It was “don’t break fundamental protocols and call it security.”

Mini-story 2: The optimization that backfired (enabling “acceleration” that disabled the real fast path)

Another org had an IPsec tunnel doing okay-ish performance. A network engineer enabled a vendor feature marketed as “advanced traffic inspection”
for “better visibility.” It was supposed to be lightweight. It was also enabled globally because the UI made that the easiest option.

Within a day, complaints rolled in: VoIP jitter, file transfers slowed, database replication lag.
Monitoring showed WAN utilization lower than normal while application latency climbed. That’s always a clue: the network isn’t saturated; it’s congested inside.

The router CPU was not pegged overall, but packet processing jumped and interrupts spiked. Offload counters went flat.
The inspection feature forced packets out of the accelerated datapath and into a software slow lane.
Suddenly the same hardware was doing more work per packet, and it couldn’t keep up at peak.

Disabling the feature restored offload and throughput. They reintroduced visibility later using flow logs on the decrypted side and selective sampling,
not per-packet inspection on the tunnel interface. The lesson wasn’t “never inspect.”
It was “if you enable a feature that touches every packet, prove it doesn’t disable the thing paying your performance bills.”

Mini-story 3: The boring but correct practice that saved the day (baseline tests and change discipline)

A company planned a migration of services between data centers over a VPN. The VPN “seemed fine” in casual testing.
But the SRE team insisted on a baseline: iperf3 in both directions, 1 and 4 streams, DF MTU tests, and ping-under-load.
They recorded results, not as an academic exercise, but to have a before/after truth source.

During the migration window, throughput dropped by ~60% and latency spikes appeared under load.
Because they had a baseline, they didn’t waste time arguing about whether it was “normal internet variability.”
They knew exactly what “normal” looked like for their path.

They rolled back the most recent change: a QoS policy update on the WAN edge that accidentally shaped the tunnel traffic twice
(once on the physical interface and again on a VLAN subinterface). Double shaping caused queueing and drops.
The fix was unglamorous: one shaper at the true bottleneck, fq_codel for fairness, and clear ownership of the policy.

Nobody got a trophy for it. But replication finished on time and the migration didn’t turn into a weekend-long incident bridge.
Boring practices—baselines, incremental changes, and “one shaper to rule them all”—are how grownups keep VPNs fast.

Common mistakes: symptom → root cause → fix

These are the repeat offenders. If you recognize your situation, skip the improvisation and do the specific fix.

1) Symptom: Speed tops out at 80–200 Mbps no matter what

  • Root cause: CPU/crypto bottleneck, single-threaded VPN, or offload disabled.
  • Fix: Verify per-core CPU. Confirm offload status. Move to kernel datapath (WireGuard, IPsec kernel) or OpenVPN DCO where viable; otherwise upgrade hardware.

2) Symptom: Small web browsing fine, big downloads stall or crawl

  • Root cause: MTU mismatch or PMTUD black hole causing drops of large packets.
  • Fix: DF ping to find real MTU. Allow ICMP destination-unreachable/time-exceeded. Clamp MSS on tunnel forwarding path.

3) Symptom: One direction fast, reverse direction slow

  • Root cause: Asymmetric routing, asymmetric shaping, or one endpoint CPU-bound on decrypt/encrypt.
  • Fix: Test iperf3 both directions. Check CPU and qdisc on both sides. Verify routing tables and policy routing.

4) Symptom: VPN works until traffic ramps up, then latency explodes

  • Root cause: Bufferbloat or oversubscribed uplink without shaping/aqm.
  • Fix: Apply fq_codel/cake and shape at the true bottleneck (usually WAN egress). Re-test ping under load.

5) Symptom: “We enabled QoS and now VPN is slower”

  • Root cause: Misclassification (encrypted traffic all looks the same), double shaping, or too-low shaper rates causing drops.
  • Fix: Shape only once, on the real egress. Classify on inner traffic only where possible (decrypted side), otherwise use coarse policies.

6) Symptom: OpenVPN performance terrible on fast links

  • Root cause: Userspace overhead and single-thread constraints.
  • Fix: Consider OpenVPN DCO, or migrate to WireGuard/IPsec for high throughput. If stuck, ensure UDP mode, tune MTU/MSS, and pin to faster cores.

7) Symptom: IPsec throughput dropped after enabling NAT on tunnel path

  • Root cause: Offload disabled or policy mismatch; NAT forces slow path.
  • Fix: Re-architect to avoid NAT across the tunnel when possible. If required, confirm hardware/driver supports it; otherwise accept lower throughput or upgrade.

8) Symptom: Speed tests vary wildly across time of day

  • Root cause: Upstream congestion, ISP shaping, or shared-medium saturation; VPN just makes it more obvious.
  • Fix: Establish baselines, measure outside VPN too, and shape traffic to reduce packet loss. If it’s the ISP, escalate with evidence.

Checklists / step-by-step plan

Checklist A: The “stop guessing” measurement sequence

  1. Run iperf3 across the VPN: 1 stream and 4 streams, both directions.
  2. Check per-core CPU and interrupts during the test.
  3. DF ping to find effective MTU through the tunnel.
  4. Verify ICMP frag-needed isn’t blocked; confirm MSS clamp counters increment if used.
  5. Test ping under load to detect bufferbloat.
  6. Check drops on relevant interfaces (WAN and tunnel).

Checklist B: Decision tree you can use on an incident bridge

  1. If CPU core pinned: confirm VPN type; move to kernel/offload or upgrade hardware; stop fiddling with MTU until you can encrypt at speed.
  2. If multi-stream helps a lot: hunt MTU/PMTUD and loss/queueing; implement AQM and shaping before touching crypto.
  3. If DF ping fails at surprisingly low sizes: fix MTU and ICMP; clamp MSS; retest.
  4. If latency spikes under load: implement fq_codel/cake; shape at bottleneck; retest.
  5. If performance changed after enabling a feature: confirm offload/fast path status; revert feature and reintroduce surgically.

Checklist C: What to document so you don’t relearn this next quarter

  • Baseline iperf3 results (1 vs 4 streams, both directions) and test conditions.
  • Effective MTU through the tunnel and the configured interface MTU.
  • Whether MSS clamping is used, where, and why (PMTUD reliable or not).
  • Offload/acceleration status and which features disable it.
  • Queueing/shaping settings and the rationale for rates.

FAQ

1) Why is my VPN slow when my ISP speed test is fast?

ISP speed tests often use multiple parallel streams and may run to nearby servers. Your VPN traffic might be single-stream,
longer RTT, and capped by endpoint crypto or MTU issues. Test with iperf3 across the VPN and compare 1 vs 4 streams.

2) How do I know if it’s CPU/crypto vs MTU?

CPU/crypto bottlenecks usually show a hard throughput ceiling and a pinned core during load.
MTU/PMTUD issues show stalls, retransmits, and big improvement when MSS is clamped or MTU is lowered; multi-stream often helps.

3) Should I just switch to WireGuard?

If you’re stuck on a single-threaded userspace VPN and you need hundreds of Mbps to Gbps, switching often helps.
But don’t use it as a substitute for diagnosing MTU and queueing; WireGuard can also be slow if the path drops packets or buffers too much.

4) Is MSS clamping always necessary?

No. If PMTUD works end-to-end (ICMP frag-needed permitted) and your tunnel MTU is configured correctly, you may not need it.
In the real world, PMTUD is frequently broken by middleboxes, so MSS clamping is a pragmatic workaround.

5) What MSS value should I set?

Prefer --clamp-mss-to-pmtu where available; it adapts to the interface MTU.
If you must set a fixed MSS, compute it from the effective path MTU and standard headers (typically MTU minus 40 for IPv4 TCP headers, more for IPv6).
Measure first; guessing here is how you create new problems.

6) Why does ping look fine but file transfers are slow?

Ping uses small packets. Your problem usually involves large packets (MTU/fragmentation) or sustained queues under load (bufferbloat),
neither of which ping reveals unless you test under load and with DF set.

7) Can firewall logging really slow down a VPN?

Yes. Logging per packet adds CPU and I/O overhead, and can disable fast-path/offload on some platforms.
Log flows or sampled events instead of every packet on high-throughput paths.

8) What’s the single most common “security” change that breaks VPN throughput?

Blocking ICMP destination-unreachable/time-exceeded. It breaks PMTUD and turns MTU issues into silent drops and retransmit storms.
Allow the necessary ICMP types; security can still be strict without being self-sabotaging.

9) Why do multiple TCP streams “fix” my throughput?

Multiple streams hide loss/queueing/MTU problems by letting at least some flows make progress while others back off.
It’s a diagnostic clue, not a real fix. Fix drops, PMTUD, and bufferbloat instead.

10) How do I explain this to management without sounding like I’m making excuses?

Show two graphs: throughput vs CPU core utilization, and ping latency under load. Add iperf3 1-stream vs 4-stream results.
That’s evidence of where the bottleneck lives and what change will address it.

Conclusion: next steps you can execute today

Diagnose VPN slowness like any other production performance issue: measure, isolate, then change one variable.
Don’t start with cipher bikeshedding. Start with the fast playbook: iperf3 (1 vs 4 streams), per-core CPU, DF MTU tests, MSS/ICMP verification,
and latency-under-load to spot bufferbloat.

Practical next steps:

  1. Run iperf3 across the VPN in both directions with 1 and 4 streams; write down the numbers.
  2. During the test, capture per-core CPU and interrupt distribution; prove or eliminate endpoint compute limits.
  3. Measure real path MTU with DF pings; align tunnel MTU and apply MSS clamping if PMTUD can’t be trusted.
  4. Check drops and latency spikes under load; fix queue management and shaping at the bottleneck.
  5. If you confirm a crypto datapath limit, change the architecture (kernel/offload/DCO) or change the hardware. Anything else is theater.
← Previous
Ubuntu 24.04: Swap on SSD — do it safely (and when you shouldn’t) (case #50)
Next →
Debian 13: 100% iowait host freezes — find the noisy process/VM in 10 minutes

Leave a comment