Office VPN Quality Tests: Measure Latency, Jitter, and Loss to Prove ISP Problems

Was this helpful?

Your office VPN “works” right up until it doesn’t. The tunnel is up, routing looks fine, authentication succeeds,
and yet every video call sounds like a haunted voicemail, file copies stall at 99%, and your ticket queue fills with
the same complaint: “The VPN is slow today.”

The hard part isn’t running a speed test. The hard part is producing evidence that survives a meeting with an ISP,
a security team, and a manager who thinks “latency” is something you can fix by rebooting Outlook. You need numbers:
latency, jitter, packet loss, reordering, MTU, and throughput—measured the right way, from the right places, at the
right times, with enough context to isolate where the problem actually lives.

What “good VPN quality” actually looks like

Most teams only notice VPN quality when it’s bad. That’s a mistake. Define what “good” means, or you’ll spend your
life debating feelings.

For an office-to-datacenter (or office-to-cloud) VPN carrying interactive traffic (VoIP, RDP, VDI, SSH), these are
the practical targets I use:

  • Latency (RTT): Stable beats low. Under 50 ms RTT is great for the same region; under 100 ms is usable; spikes are the enemy.
  • Jitter: Under 5–10 ms is typically fine; over 20–30 ms is where voice starts sounding like a robot with a sore throat.
  • Packet loss: Under 0.1% is usually invisible; 0.5% starts causing real application pain; 1% is a productivity tax; 2% is a crisis.
  • Reordering: Usually near zero; when it appears, TCP can look “slow” even when bandwidth is available.
  • MTU sanity: Correct end-to-end MTU prevents fragmentation/blackholes; wrong MTU can mimic “random slowness.”
  • Throughput: For bulk flows, you care about sustained throughput under load, not peak “speed test” numbers.

VPNs complicate measurement because the tunnel adds overhead, changes packet sizes, and often shifts traffic onto a
different path than plain Internet traffic. The result: your “ISP is fine” argument dies the moment you realize you
only tested to a public CDN, not through the VPN to the thing users actually use.

Interesting facts and short history (because it explains today’s mess)

  • Fact 1: The classic ping tool dates back to the early 1980s and was inspired by sonar; it was designed for reachability, not performance analysis.
  • Fact 2: “Jitter” became an everyday metric largely because real-time voice/video moved from circuit-switched to packet-switched networks.
  • Fact 3: IPsec (a common site-to-site VPN technology) was standardized in the 1990s, when typical Internet links were far slower and NAT wasn’t everywhere.
  • Fact 4: WireGuard is relatively new (mid-to-late 2010s) and intentionally small; fewer knobs, fewer foot-guns, and generally better baseline performance.
  • Fact 5: TCP throughput is constrained by RTT and loss; this is why a high-bandwidth link can still “feel slow” over a lossy or jittery VPN path.
  • Fact 6: Many consumer and even “business” broadband links rely on oversubscription; congestion is often time-of-day dependent and doesn’t show up in a single test.
  • Fact 7: Bufferbloat (excessive buffering in network devices) became widely discussed in the late 2000s and can cause huge latency spikes under upload/download.
  • Fact 8: ECMP (equal-cost multipath) in provider networks can cause packet reordering; VPN encapsulation sometimes changes how flows hash across paths.
  • Fact 9: Some networks still treat ICMP differently (rate-limit, deprioritize); that’s why you corroborate ping with other methods before declaring victory.

Joke 1: The nice thing about “intermittent packet loss” is it teaches everyone mindfulness. You can’t cling to expectations when your packets don’t cling to existence.

The metrics that matter (and the ones that lie)

Latency: measure distributions, not a single number

Average latency hides pain. Users feel spikes. Capture min/avg/max and percentiles if your tooling supports it.
When someone says “the VPN is slow,” they’re usually describing variability, not a steady-state average.

Jitter: the “quality” metric for real-time traffic

Jitter is variation in delay. VoIP and video conferencing can buffer a little; once jitter exceeds the buffer,
audio breaks up and video freezes. VPN tunnels can amplify jitter when encryption appliances are CPU-bound or when
ISP queues fill under load.

Loss: the silent killer of throughput

TCP interprets loss as congestion and backs off. A tiny amount of random loss can crater throughput on long-RTT
paths. For office VPNs, loss is also the main driver of “RDP feels sticky,” “Teams freezes,” and “file copy stalls.”

Throughput: test both directions and include concurrent flows

One flow is not reality. Many VPN gateways do fine with one iperf stream and fall apart with ten. Also test upload:
offices often saturate upstream during backups, scans, or “someone emailed a 2 GB file to the whole company.”

MTU/fragmentation: the classic VPN foot-gun

VPN encapsulation adds overhead. If your effective path MTU drops and PMTUD is blocked or broken, large packets get
blackholed. Symptoms look like “some websites work, some don’t,” or “SSH is fine but file transfer hangs.”
MTU issues are boring, common, and absolutely worth testing early.

What lies: speed tests, single pings, and “it’s green in the dashboard”

Public speed tests often hit nearby CDN nodes over well-peered paths that your VPN traffic never uses. A single ping
at 9:03 AM proves almost nothing. And dashboards go green when they measure reachability, not experience.

One quote (paraphrased idea): John Allspaw has argued that reliability comes from understanding systems under real conditions, not from blaming individuals.

Fast diagnosis playbook

When the office VPN “feels bad,” you don’t have time for a week-long research project. Start with the fastest
discriminators: is the bottleneck the local LAN/Wi‑Fi, the ISP last mile, the VPN gateway, or the upstream network?

First: confirm scope and pick two endpoints

  • Pick a client: a wired machine on the office LAN (not Wi‑Fi unless the complaint is Wi‑Fi).
  • Pick two targets: (1) the office VPN gateway public IP, (2) an internal VPN-side host (a jump box or server).
  • Decide the timeframe: “right now” plus “repeat during peak hours.”

Second: split the path into segments

  • Client → office router: proves LAN/Wi‑Fi issues.
  • Client → VPN gateway public IP: proves ISP path to the gateway.
  • Client → internal host through VPN: proves tunnel + remote side.
  • VPN gateway → internal host: proves remote network if you can run tests there.

Third: run three quick tests

  1. Ping with timestamps and intervals (latency/jitter/loss baseline).
  2. MTR (loss/latency by hop; good for escalation, not gospel).
  3. A short iperf3 run in both directions (throughput + loss under load).

Fourth: provoke the problem (carefully)

If quality collapses only when the link is busy, you need a controlled load test. Saturate upstream briefly,
measure latency under load, and you’ll expose bufferbloat, shaping bugs, or a VPN appliance that panics when it has
to do real work.

Fifth: decide where to focus

  • Bad to gateway public IP: ISP/local circuit, peering, or last-mile congestion.
  • Good to gateway but bad through tunnel: VPN gateway CPU, MTU, encryption settings, or remote side.
  • Good from wired but bad from Wi‑Fi: stop blaming the ISP; fix RF, roaming, or client drivers.

Practical tasks: commands, outputs, and decisions (12+)

These tasks assume Linux clients/servers. If you’re on macOS or Windows, you can still apply the logic; the command
names change, physics does not.

Task 1: Identify the active interface and gateway (avoid testing the wrong path)

cr0x@server:~$ ip route show
default via 192.0.2.1 dev eno1 proto dhcp src 192.0.2.50 metric 100
192.0.2.0/24 dev eno1 proto kernel scope link src 192.0.2.50

What it means: Your client egresses via eno1 to 192.0.2.1.
If someone runs tests over Wi‑Fi while users are on wired (or vice versa), you’re debugging fan fiction.

Decision: If the route is wrong, fix it before testing. Otherwise proceed.

Task 2: Baseline LAN health: ping the default gateway

cr0x@server:~$ ping -c 20 -i 0.2 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=0.362 ms
64 bytes from 192.0.2.1: icmp_seq=2 ttl=64 time=0.401 ms
...
20 packets transmitted, 20 received, 0% packet loss, time 3804ms
rtt min/avg/max/mdev = 0.331/0.389/0.471/0.038 ms

What it means: This is your floor. If you see loss or multi-millisecond jitter here on a wired
network, the ISP is not your first problem.

Decision: If LAN ping is unstable, check switch ports, duplex, NIC errors, Wi‑Fi, or local congestion.

Task 3: Check NIC counters for drops and errors

cr0x@server:~$ ip -s link show dev eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
    987654321  123456      0       2       0       123
    TX:  bytes packets errors dropped carrier collsns
    876543210  112233      0       0       0       0

What it means: A few drops might be fine; rising errors/drops during complaints is a smoking gun.

Decision: If errors climb, replace cable, fix duplex mismatch, update drivers/firmware, or move ports.

Task 4: Confirm VPN route selection (split tunnel surprises are common)

cr0x@server:~$ ip route get 10.10.20.15
10.10.20.15 dev wg0 src 10.200.0.2 uid 1000
    cache

What it means: Traffic to 10.10.20.15 goes via wg0 (the VPN).
If it goes via your default route instead, you’re not testing the tunnel at all.

Decision: Fix routing/policy routing before diagnosing “VPN performance.”

Task 5: Measure RTT, jitter proxy (mdev), and loss to the VPN gateway public IP

cr0x@server:~$ ping -c 50 -i 0.2 203.0.113.10
PING 203.0.113.10 (203.0.113.10) 56(84) bytes of data.
64 bytes from 203.0.113.10: icmp_seq=1 ttl=52 time=18.9 ms
64 bytes from 203.0.113.10: icmp_seq=2 ttl=52 time=19.2 ms
...
50 packets transmitted, 50 received, 0% packet loss, time 10052ms
rtt min/avg/max/mdev = 18.5/19.4/28.7/1.8 ms

What it means: mdev is not perfect jitter, but it’s a decent quick indicator.
The max spike (28.7 ms) is notable; if this turns into 200+ ms during complaints, you’re likely seeing congestion.

Decision: If loss appears here, focus on ISP path before blaming VPN crypto settings.

Task 6: MTR to the VPN gateway public IP (evidence, not theology)

cr0x@server:~$ mtr -rwzc 200 203.0.113.10
Start: 2025-12-28T09:15:02+0000
HOST: office-client                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 192.0.2.1                        0.0%   200    0.5   0.6   0.4   1.2   0.1
  2.|-- 198.51.100.1                     0.0%   200    7.8   8.1   6.9  20.3   1.5
  3.|-- 198.51.100.34                    1.5%   200   12.1  12.8  10.2  45.7   4.2
  4.|-- 203.0.113.10                     0.5%   200   19.6  21.4  18.4  80.2   6.9

What it means: Loss at intermediate hops can be ICMP rate-limiting. The key is whether loss
persists to the final hop. Here, 0.5% loss to the destination is real enough to hurt VPN experience.

Decision: If destination loss correlates with complaint windows, start assembling an ISP escalation pack.

Task 7: Compare to a “control” target outside the VPN (prove it’s path-specific)

cr0x@server:~$ ping -c 50 -i 0.2 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=58 time=9.7 ms
...
50 packets transmitted, 50 received, 0% packet loss, time 10043ms
rtt min/avg/max/mdev = 9.4/9.9/12.5/0.4 ms

What it means: The Internet in general can look fine while the path to your VPN gateway is sick.
This is exactly why you test the actual VPN endpoint.

Decision: If control is clean but gateway is not, push the ISP on routing/peering to that destination.

Task 8: Verify MTU to an internal host (find blackholes early)

cr0x@server:~$ ping -M do -s 1372 -c 3 10.10.20.15
PING 10.10.20.15 (10.10.20.15) 1372(1400) bytes of data.
1380 bytes from 10.10.20.15: icmp_seq=1 ttl=63 time=24.3 ms
1380 bytes from 10.10.20.15: icmp_seq=2 ttl=63 time=24.8 ms
1380 bytes from 10.10.20.15: icmp_seq=3 ttl=63 time=25.1 ms

--- 10.10.20.15 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms

What it means: -M do forbids fragmentation. If this fails, your effective MTU is
smaller than you think, and PMTUD might not be working.

Decision: If it fails, reduce tunnel MTU (or MSS clamp) and retest until it passes consistently.

Task 9: Detect PMTUD breakage (classic “works for small, fails for big”)

cr0x@server:~$ tracepath 10.10.20.15
 1?: [LOCALHOST]                      pmtu 1500
 1:  10.200.0.1                        4.123ms
 2:  10.10.20.15                      24.910ms reached
     Resume: pmtu 1420 hops 2 back 2

What it means: Path MTU discovered as 1420. That’s plausible for some VPN overhead scenarios.
If your interfaces are set to 1500 and you rely on PMTUD but ICMP is blocked somewhere, you can get blackholes.

Decision: Set VPN MTU to at or below discovered PMTU (with margin), or apply TCP MSS clamping at the edge.

Task 10: Run iperf3 through the VPN (single stream)

cr0x@server:~$ iperf3 -c 10.10.20.15 -t 15
Connecting to host 10.10.20.15, port 5201
[  5] local 10.200.0.2 port 45322 connected to 10.10.20.15 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  6.25 MBytes  52.4 Mbits/sec    0   1.12 MBytes
[  5]   1.00-2.00   sec  6.12 MBytes  51.3 Mbits/sec    2   0.96 MBytes
...
[  5]  14.00-15.00  sec  3.10 MBytes  26.0 Mbits/sec   18   0.42 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-15.00  sec  67.2 MBytes  37.6 Mbits/sec  61             sender
[  5]   0.00-15.00  sec  66.5 MBytes  37.2 Mbits/sec                  receiver

What it means: Retransmits (Retr) and falling bitrate point to loss, congestion, or
a CPU-bound path. A healthy VPN path should not show retransmits climbing rapidly in a short clean test.

Decision: If retransmits spike, correlate with ping loss and investigate ISP or tunnel MTU/queueing.

Task 11: Run iperf3 reverse direction (office download can be fine while upload is a disaster)

cr0x@server:~$ iperf3 -c 10.10.20.15 -R -t 15
Connecting to host 10.10.20.15, port 5201
Reverse mode, remote host 10.10.20.15 is sending
[  5] local 10.200.0.2 port 45330 connected to 10.10.20.15 port 5201
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-15.00  sec  120 MBytes  67.1 Mbits/sec   3             sender
[  5]   0.00-15.00  sec  119 MBytes  66.5 Mbits/sec                 receiver

What it means: Reverse is clean; forward was not. That’s a strong hint the office upstream is
congested or shaped poorly, or your egress queue is bloated.

Decision: Focus on upstream capacity, QoS/SQM, or ISP upload impairment.

Task 12: Measure latency under load (bufferbloat check)

cr0x@server:~$ (ping -i 0.2 -c 50 203.0.113.10 | tail -n 3) & iperf3 -c 10.10.20.15 -t 15 > /dev/null; wait
50 packets transmitted, 50 received, 0% packet loss, time 10041ms
rtt min/avg/max/mdev = 19.0/85.4/312.7/58.1 ms

What it means: Average RTT quadrupled and max hit 312 ms while pushing traffic. That’s classic
bufferbloat or queue saturation. Users experience this as “VPN is fine until we do anything.”

Decision: Implement smart queue management (SQM) on the edge, shape slightly below line rate, or get more upstream bandwidth.

Task 13: Check VPN interface MTU and stats (tunnel-level drops)

cr0x@server:~$ ip -s link show dev wg0
5: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/none
    RX:  bytes packets errors dropped  missed   mcast
    12345678   54321      0      45       0       0
    TX:  bytes packets errors dropped carrier collsns
    23456789   65432      0      12       0       0

What it means: Drops on the tunnel interface can indicate local queueing, policing, or bursts
exceeding what the path can handle.

Decision: If drops rise during peak, shape traffic, review QoS, and check CPU saturation on the VPN endpoint.

Task 14: Observe WireGuard runtime stats (handshake freshness and transfer)

cr0x@server:~$ sudo wg show
interface: wg0
  public key: zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz=
  private key: (hidden)
  listening port: 51820

peer: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy=
  endpoint: 203.0.113.10:51820
  allowed ips: 10.10.0.0/16
  latest handshake: 28 seconds ago
  transfer: 1.23 GiB received, 2.10 GiB sent
  persistent keepalive: every 25 seconds

What it means: Handshakes are happening; the tunnel is alive. If latest handshake becomes stale
or flaps, you might be hitting NAT timeouts or upstream filtering.

Decision: If handshakes drop during idle, enable keepalive or fix NAT/firewall state timeouts.

Task 15: Confirm IPsec SA health (if you run IPsec)

cr0x@server:~$ sudo ipsec statusall
Status of IKE charon daemon (strongSwan 5.9.8, Linux 6.5.0, x86_64):
  uptime: 2 hours, since Dec 28 07:12:03 2025
Connections:
office-to-dc:  192.0.2.0/24 === 10.10.0.0/16
Security Associations (1 up, 0 connecting):
office-to-dc[1]: ESTABLISHED 12 minutes ago, 198.51.100.10[CN=office]...203.0.113.10[CN=dc]
office-to-dc{1}:  INSTALLED, TUNNEL, reqid 1, ESP in UDP SPIs: c1a2b3c4_i c4b3a2c1_o
office-to-dc{1}:   192.0.2.0/24 === 10.10.0.0/16

What it means: SA is established. This doesn’t prove quality, but it rules out negotiation churn
that can cause intermittent freezes.

Decision: If SAs rekey too frequently or flap, fix lifetimes, NAT-T handling, or firewall pinholes.

Task 16: Capture a short packet trace to prove fragmentation or retransmissions

cr0x@server:~$ sudo tcpdump -ni wg0 -c 20 -vv 'host 10.10.20.15 and (tcp or icmp)'
tcpdump: listening on wg0, link-type RAW (Raw IP), snapshot length 262144 bytes
09:18:41.120001 IP (tos 0x0, ttl 64, id 12345, offset 0, flags [DF], proto ICMP (1), length 1420)
    10.200.0.2 > 10.10.20.15: ICMP echo request, id 3011, seq 1, length 1400
09:18:41.144210 IP (tos 0x0, ttl 63, id 54321, offset 0, flags [DF], proto ICMP (1), length 1420)
    10.10.20.15 > 10.200.0.2: ICMP echo reply, id 3011, seq 1, length 1400
...

What it means: You see DF set and successful replies at size 1400—good for MTU. For TCP, you’d
look for retransmissions and window shrinking.

Decision: Use traces sparingly: enough to prove a hypothesis, not enough to drown in packets.

Three corporate mini-stories (all painfully plausible)

Incident caused by a wrong assumption: “Ping is clean, so the ISP is innocent”

A mid-sized company had a site-to-site VPN from headquarters to a cloud VPC. The helpdesk reported that RDP sessions
were freezing, then catching up in bursts. The network team ran ping to the VPN gateway. 0% loss, ~12 ms.
They declared the ISP fine and moved on to blaming Windows updates.

The freezing continued. So someone finally ran a test that actually matched the workload: sustained upload through
the VPN while measuring latency to the gateway. RTT jumped from ~12 ms to ~280 ms and stayed there. The VPN wasn’t
“down.” It was just stuck behind an ocean of buffering in the office router’s upstream queue.

The wrong assumption was subtle: “If ping is good right now, the link is good.” In reality, idle ping measures the
happy path. The business was suffering under load. A single camera upload, a cloud backup burst, or even a big Teams
screen share was enough to fill the queue and introduce seconds of delay.

Fixing it was not glamorous. They implemented smart queue management at the edge and shaped upload slightly below
the ISP’s real upstream rate. RDP became boring again. The ISP didn’t change anything; the office did.

Optimization that backfired: “We enabled hardware offload and made it worse”

Another company rolled out new edge appliances and enabled every acceleration feature they could find: checksum
offload, GRO/LRO, and some vendor-specific “VPN fast path.” Throughput benchmarks looked great in a lab.
In production, VoIP calls over the VPN started dropping words. The tunnel stayed up, and bulk transfers were fast,
so the issue got dismissed as “Teams being Teams.”

The clue came from jitter. Under moderate load, latency wasn’t high on average, but it was wildly variable—microbursts
and packet batching. Offloads were coalescing packets and delivering them in clumps. Interactive flows hate clumps.
The users experienced it as stutter and awkward talk-over.

The “optimization” improved the metric someone cared about (peak throughput) and degraded the metric users care
about (consistent latency). Disabling some offloads on the WAN/VPN interfaces reduced throughput slightly but made
jitter sane. VoIP stabilized immediately.

Nobody likes the conclusion: sometimes you buy a faster car and then realize you only drive in traffic. Optimize
for latency stability first. Bandwidth is rarely the limiting factor for human-facing VPN pain.

Boring but correct practice that saved the day: “Time-series measurements with a change log”

A global team had a recurring complaint: “Every Tuesday afternoon the VPN is terrible.” People had theories:
ISP maintenance, cloud provider issues, solar flares, and one memorable suggestion involving a haunted switch.

The SRE on duty did the least exciting thing possible. They set up scheduled tests every five minutes from a wired
office host: ping and MTR to the VPN gateway, plus a short iperf3 run to an internal test server. They stored raw
outputs with timestamps and aligned them with a simple change log: WAN circuit changes, firewall policy changes,
endpoint upgrades, and office events.

After two weeks, the pattern was clear. Loss and jitter spiked at the same time every week, but only on the upstream
direction. It correlated with a recurring offsite backup job that hammered upload. The job wasn’t new; the data set
grew, and it finally crossed the threshold where queues collapsed.

They moved the backup window, set a rate limit, and put basic SQM in place. Tuesday afternoons became quiet. The
best part: they didn’t need heroics, only receipts. The evidence also prevented a pointless ISP escalation that
would have ended with “no fault found.”

Joke 2: The most reliable network strategy is “measure first.” The second most reliable is “stop scheduling backups at 2 PM.”

Checklists / step-by-step plan

Step-by-step: establish a repeatable VPN quality test

  1. Pick a stable test client. Wired, not on a docking station with a flaky USB NIC.
    If you must test Wi‑Fi, do it intentionally and document that it’s Wi‑Fi.
  2. Pick stable targets.
    One public target: the VPN gateway public IP. One private target: an internal host reachable only via the VPN.
  3. Define three test windows.
    Morning calm, peak business hours, and the “complaint hour” everyone mentions.
  4. Run baseline tests.
    Ping gateway (LAN), ping VPN endpoint public IP, ping internal host via VPN, MTR to VPN endpoint, MTU checks.
  5. Run load tests in short bursts.
    iperf3 forward and reverse for 15–30 seconds. Don’t melt production links for minutes.
  6. Repeat and record raw outputs.
    Screenshots are not data. Keep text outputs with timestamps.
  7. Correlate with device metrics.
    VPN gateway CPU, interface drops, WAN utilization, and any shaping/policing counters.
  8. Make a call.
    If the impairment shows before the tunnel (to public IP), escalate to ISP. If it appears only through the tunnel,
    fix your VPN configuration or remote side.

Checklist: before you call the ISP

  • Confirm the issue happens from a wired client (unless Wi‑Fi is the suspected root cause).
  • Show loss/jitter to the VPN gateway public IP over time, not one test.
  • Provide MTR outputs to the same endpoint during good and bad periods.
  • Provide latency-under-load evidence if the issue is congestion/bufferbloat.
  • Confirm MTU works (or document that you adjusted it) so the ISP doesn’t blame your tunnel overhead.
  • Document who, where, when: circuit ID, office location, timestamps with timezone, and whether tests were through VPN.

Checklist: changes that usually improve VPN quality

  • Enable or fix SQM on the office edge (or shape slightly below ISP rate).
  • Correct MTU/MSS for the VPN path.
  • Prefer wired for critical VPN users; fix Wi‑Fi roaming for the rest.
  • Watch VPN gateway CPU and interrupt pressure during peaks; scale up or offload appropriately.
  • Separate bulk traffic (backups, large syncs) from interactive traffic with QoS.

Common mistakes: symptom → root cause → fix

  • Symptom: “VPN is slow, but speed tests are great.”
    Root cause: Speed tests hit a nearby CDN; VPN traffic takes a different, congested path or suffers from loss/jitter.
    Fix: Test to the VPN gateway public IP and to an internal host through the tunnel; run tests during peak hours and under load.
  • Symptom: “SSH works, but file transfers hang; some apps time out.”
    Root cause: MTU/PMTUD blackhole inside the VPN path.
    Fix: Use ping -M do with larger payloads, use tracepath, then lower tunnel MTU or clamp TCP MSS at the edge.
  • Symptom: “Teams/VoIP is choppy only when uploads happen.”
    Root cause: Bufferbloat or upstream saturation on the office link.
    Fix: Apply SQM/shaping for upstream; move bulk jobs off-hours; enforce rate limits for backups.
  • Symptom: “MTR shows loss at hop 3; ISP says it’s fine.”
    Root cause: ICMP rate-limiting on intermediate routers; not necessarily data-plane loss.
    Fix: Focus on loss/latency to the final destination; corroborate with iperf3 retransmits and application symptoms.
  • Symptom: “VPN throughput is awful on one site; fine elsewhere.”
    Root cause: Asymmetric routing, peering differences, or a local last-mile impairment.
    Fix: Compare MTR and ping from multiple sites to the same VPN endpoint; present the delta to the ISP.
  • Symptom: “Performance is bad after we upgraded the firewall.”
    Root cause: Crypto throughput limit, small CPU, mis-sized MTU, or offload features causing jitter.
    Fix: Measure CPU and interface drops under load; run iperf3 multi-stream tests; disable problematic offloads; scale the appliance.
  • Symptom: “VPN disconnects every few minutes, especially idle.”
    Root cause: NAT timeouts, aggressive state cleanup, or UDP being mishandled.
    Fix: Enable keepalive for the tunnel; adjust firewall/NAT UDP timeouts; confirm the ISP isn’t filtering UDP.
  • Symptom: “RDP lags, but bulk file copy is fine.”
    Root cause: Jitter and microbursts; batching/coalescing; queueing delays that don’t destroy throughput.
    Fix: Measure latency under load; tune SQM; reduce offload features that batch packets; prioritize interactive traffic.
  • Symptom: “Everything is bad only on Wi‑Fi.”
    Root cause: RF interference, roaming issues, power save behavior, or poor AP placement.
    Fix: Reproduce on wired; if wired is clean, fix Wi‑Fi—channel plan, minimum RSSI, band steering, and driver updates.
  • Symptom: “We see loss, but only small pings succeed reliably.”
    Root cause: Fragmentation/MTU or policing that drops larger packets under load.
    Fix: Perform DF pings at increasing sizes; adjust MTU/MSS; verify that policing isn’t misconfigured on WAN/VPN interfaces.

Building an ISP-proof evidence pack

ISPs don’t respond to “VPN is slow.” They respond to “here is loss and latency to your handoff, here is loss to the
destination, here are timestamps, and here is a comparison with a control path.” Your goal is not to win an argument.
Your goal is to get a ticket routed to someone who can change something.

What to collect (minimum viable receipts)

  • Two weeks of periodic ping to VPN gateway public IP (or at least several days including “bad” windows).
  • MTR runs during good and bad periods to the VPN gateway public IP.
  • Latency-under-load test demonstrating bufferbloat (ping while iperf3 runs) if congestion is suspected.
  • iperf3 results forward and reverse through the VPN to show directionality.
  • MTU tests showing your tunnel is configured sanely (so you don’t get blamed for fragmentation blackholes).
  • Office edge metrics: WAN utilization, interface drops, CPU on the VPN device, timestamps aligned with user complaints.

How to present it so it gets action

Keep it short. Make it easy to skim. ISPs are staffed by humans with queues. Provide:
circuit identifier, public IP, your VPN gateway IP, test client IP, timestamps with timezone, and a one-paragraph
statement of impact (“interactive VPN traffic unusable 10:00–12:00 daily; loss up to 1% and RTT spikes 300 ms”).

Then attach raw outputs. Raw outputs matter because they’re hard to argue with and easy to forward internally.
Also: include both “good” and “bad” samples. If you only show failures, you’ll get stuck in the “can’t reproduce”
loop.

A practical stance on blame

Don’t lead with blame. Lead with segmentation. “Loss is present from office to VPN gateway public IP; LAN is clean;
reverse-direction tests show upstream impairment.” That’s how you get the ISP to stop asking you to reboot your modem.

FAQ

1) What’s the difference between latency and jitter, in plain terms?

Latency is how long a packet takes to make a round trip. Jitter is how much that time varies. Humans hate jitter
more than they hate moderate steady latency.

2) How much packet loss is “acceptable” for a VPN?

For interactive work, aim for effectively zero. In practice, under 0.1% is usually fine; sustained 0.5% is noticeable;
1%+ will cause recurring pain and support tickets.

3) Can ICMP ping be misleading?

Yes. Some networks rate-limit ICMP or treat it differently. That’s why you corroborate with end-to-end tests like
iperf3 retransmits, latency under load, and application behavior. Still, ping is useful when you use it carefully.

4) Why does the VPN get worse during video calls or backups?

Because you’re filling queues. On many office links, upstream bandwidth is limited and buffers are deep. Once the
upstream queue fills, latency spikes. The tunnel stays up; experience collapses.

5) What MTU should I set for WireGuard or IPsec?

There isn’t one magic number. Start with WireGuard’s common default around 1420, then validate with DF pings and
tracepath. For IPsec, overhead varies (NAT-T, algorithms). Measure, don’t guess.

6) Should I run iperf3 over production links during business hours?

Yes, but in short, controlled bursts and with warnings. A 15-second test is usually enough to reveal directionality,
retransmits, and jitter under load. Don’t run a five-minute saturation test and then act surprised when people complain.

7) How do I prove the VPN device is the bottleneck, not the ISP?

Show clean performance to the VPN gateway public IP but degraded performance only through the tunnel. Then correlate
with VPN device CPU, interface drops, or crypto offload counters during the same window.

8) Our ISP says “no problem found.” What now?

Provide time-series evidence and segment the path. Include good vs bad MTR/ping, directionality (iperf3 forward vs
reverse), and latency-under-load results. If they still refuse, ask for escalation to backbone/peering or request a
circuit move—politely, with receipts.

9) Why do I see loss in MTR at an intermediate hop but not at the destination?

That hop may be rate-limiting ICMP responses. It can look like “loss” in the report while actual forwarding is fine.
Trust end-to-end results more than hop-level ICMP behavior.

10) Is Wi‑Fi really that big a deal for VPN quality?

Yes. VPNs magnify Wi‑Fi problems: retransmissions, roaming interruptions, and variable latency. Always reproduce on
wired before escalating to an ISP—unless the whole complaint is explicitly “Wi‑Fi + VPN.”

Conclusion: next steps that actually move the needle

VPN quality isn’t mysterious. It’s measurable. The trick is measuring the right things, on the right segment of the
path, in a way that survives the corporate ritual of “prove it.”

  1. Define targets for latency, jitter, and loss, and write them down.
  2. Instrument the path: scheduled ping/MTR to the VPN public endpoint and a private internal target.
  3. Test MTU early; fix it once and stop rediscovering the same problem every quarter.
  4. Measure under load to expose bufferbloat and directionality issues.
  5. Build an evidence pack with timestamps and raw outputs; escalate with segmentation, not vibes.
  6. Fix the boring stuff: SQM, shaping, off-hours bulk jobs, and sane VPN gateway sizing.

Do this, and the next time someone says “the VPN is slow,” you won’t argue. You’ll point at a graph, attach a log,
and choose the right lever—ISP ticket, edge shaping, MTU change, or VPN capacity—on purpose.

← Previous
VPN security: 12 mistakes that turn a “private tunnel” into an incident
Next →
MariaDB vs SQLite Performance: Why “Fast Locally” Can Be Slow in Production

Leave a comment