VoIP works great right up until you put it through a VPN tunnel, share the uplink with a cloud backup, and let someone start a “quick” screen-share. Then it turns into a radio drama: pauses, talking over each other, robotic syllables, and the dreaded “can you repeat that?”—from the CFO.
Most teams treat this as a mystical problem: “VPNs add overhead.” True, but unhelpful. The real causes are usually measurable and fixable: MTU mismatches, bufferbloat, bad queueing, mis-marked packets, and tunnels that hide the traffic from the one device that could have prioritized it.
The mental model: what a VPN tunnel does to voice
VoIP is allergic to variability. Not just delay—variable delay. Humans tolerate a steady 80 ms far better than a spiky 30–200 ms. VPN tunnels tend to increase variability for three boring reasons:
- Extra headers and crypto work: more bytes per packet, more CPU per packet, and sometimes different NIC offload behavior.
- Hidden flows: once traffic is encrypted, intermediate devices can’t see “this is RTP” unless you mark it before encryption and preserve markings.
- Queueing moves around: you may think your firewall is the bottleneck; actually the ISP CPE or upstream shaper is where packets sit and rot.
Voice (SIP signaling + RTP media) typically uses small packets at a steady cadence. Tunnels like steady too—until the office link saturates. When saturation happens, big flows (cloud drives, updates, backups) create queues. Those queues inject jitter into RTP. Then the jitter buffer tries to cope, adds delay, and when it can’t, you get clipped audio.
There’s a harsh law of office networking: you don’t “fix jitter” inside the tunnel; you prevent queues before the bottleneck and prioritize voice into the tunnel. That means shaping at the egress of the actual constrained link, and choosing a queueing discipline that doesn’t punish small real-time flows.
One dry operational joke, as promised: VPN tunnels are like umbrellas—people only remember them when it’s already raining and they’re already wet.
What this article assumes (and what it doesn’t)
Assumptions:
- You have an office router/firewall you control (Linux, pfSense/OPNsense, a vendor firewall, or a router-on-a-stick VM).
- VoIP endpoints are either desk phones, softphones, or a cloud PBX SBC, and the office is tunneling to something (DC, cloud, security gateway).
- You can run packet captures and basic CLI tools.
Non-goals:
- We’re not doing “buy a new WAN circuit” as the primary solution, though sometimes that’s the adult answer.
- We’re not hand-waving “QoS” without proving where the bottleneck is.
Interesting facts and context (because history repeats in packets)
- VoIP over IP got popular long before “good internet” was common. Early deployments leaned heavily on jitter buffers because access links were noisy and congested.
- RTP is intentionally simple: it assumes the network will sometimes be terrible, and endpoints will mask it with buffering and loss concealment.
- DiffServ (DSCP) dates back to the late 1990s as a replacement for the older IP Precedence model. It’s still the main language of QoS, even if many networks ignore it.
- IPsec ESP was designed for confidentiality, not traffic engineering. The fact that we now expect it to preserve QoS markings is an operational adaptation, not the original vibe.
- Bufferbloat was named in 2009, but the behavior existed as long as consumer gear shipped with deep unmanaged buffers.
- FQ-CoDel and CAKE were born out of real-time pain: gamers and voice users drove practical queue management improvements, not academic perfection.
- SRTP (secure RTP) encrypts media end-to-end. When you add a VPN on top, you’re doing double encryption. That can be fine—unless your edge CPU is a candle in the wind.
- MTU problems got worse with VPNs over broadband because PPPoE, VLAN tags, and tunnel overhead stack like shipping containers.
What “good voice” looks like: latency, jitter, loss, and MOS
To fix VoIP over VPN you need targets. Here are the practical ones used in production environments:
- One-way latency: < 150 ms is generally “good.” 150–250 ms is tolerable but noticeable. Above 250 ms, people start stepping on each other.
- Jitter: keep it < 20–30 ms for a comfortable experience. You can survive higher, but the jitter buffer will add delay.
- Packet loss: keep it < 1% for decent audio. Many codecs can conceal a little loss; they don’t enjoy it.
- Reordering: small amounts are okay. Some VPN/wan acceleration tricks can increase it, and RTP hates surprises.
The jitter buffer is not a cure; it’s a loan
Endpoints compensate jitter by buffering. That buffer increases latency. If you “fix” jitter by increasing buffer size, you’ve just moved the pain from robotic audio to delayed conversation. Sometimes that’s the right trade—call centers often prefer slightly more delay over choppiness—but it’s not a free lunch.
MOS isn’t magic either
MOS (Mean Opinion Score) is a model. It’s useful when trend lines are obvious and you don’t lie to it. If you can, measure jitter/loss directly; treat MOS as a customer-sentiment proxy, not a physics engine.
One reliability quote (paraphrased idea), because operations has prophets: Werner Vogels’ paraphrased idea: everything fails, all the time—design and operate as if failure is normal.
Fast diagnosis playbook (first/second/third)
First: prove where the bottleneck is
Before you touch QoS knobs, figure out what’s saturating:
- Check uplink utilization and drops on the actual WAN egress (the device that pushes packets to the ISP). If you’re shaping on an internal interface, you’re probably shaping the wrong place.
- Check for bufferbloat: latency under load is the truth. If ping times jump 10x during an upload, you found your jitter factory.
- Check CPU on the VPN endpoint: encrypted packet processing is often single-core limited. A tunnel can be “up” while the box is quietly melting.
Second: validate MTU and fragmentation end-to-end
If you have fragmentation, PMTUD blackholes, or MSS issues, voice will fail in ways that look like “random jitter.” It’s not random; it’s packets being delayed, fragmented, or dropped.
Third: ensure voice is classified correctly before encryption
Voice must be identified and prioritized before it disappears into a tunnel. Then you need to either:
- Preserve DSCP into the outer tunnel header, or
- Classify inside the tunnel endpoint and apply shaping/queueing there.
Fourth: fix queueing with modern disciplines (FQ-CoDel/CAKE)
Classic priority queues can help, but they can also starve everything else and create bizarre burstiness. For office links, CAKE or FQ-CoDel with sane shaping is usually the “works on Monday morning” answer.
Fifth: only then tune endpoint and PBX settings
Codec selection, packetization time (ptime), jitter buffer, and SIP session timers matter—but they can’t fix a saturated uplink with terrible queues.
Practical tasks: commands, outputs, and decisions (12+)
These tasks are written for a Linux-based VPN gateway. If you’re on an appliance, the same concepts apply; the commands just become vendor-specific.
Task 1: Identify the real WAN interface and its link type
cr0x@server:~$ ip -br link
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
ens18 UP 52:54:00:aa:bb:cc <BROADCAST,MULTICAST,UP,LOWER_UP>
ppp0 UNKNOWN 00:11:22:33:44:55 <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP>
wg0 UNKNOWN 9a:bc:de:f0:12:34 <POINTOPOINT,NOARP,UP,LOWER_UP>
What it means: You might be on PPPoE (ppp0) and not realize you’re shaping the wrong interface (ens18). PPP adds overhead and has different MTU behavior.
Decision: Apply shaping/QoS on the true WAN egress (often ppp0). If you shape on ens18 but the real bottleneck is ppp0 or the ISP modem, you’re just doing interpretive dance.
Task 2: Check VPN interface stats for errors and drops
cr0x@server:~$ ip -s link show dev wg0
4: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
RX: bytes packets errors dropped missed mcast
987654321 123456 0 12 0 0
TX: bytes packets errors dropped carrier collsns
876543210 120000 0 0 0 0
What it means: Dropped RX packets can indicate queue overflow on the receive side, CPU contention, or kernel-level pressure. For WireGuard, drops are often downstream congestion or policing elsewhere.
Decision: If drops climb during calls, don’t touch the PBX first. Fix congestion/queueing and check CPU and shaping.
Task 3: Measure latency under load (bufferbloat test you can run now)
cr0x@server:~$ ping -i 0.2 -c 30 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=56 time=12.3 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=56 time=13.1 ms
...
--- 1.1.1.1 ping statistics ---
30 packets transmitted, 30 received, 0% packet loss, time 6008ms
rtt min/avg/max/mdev = 11.9/12.8/14.6/0.6 ms
What it means: Baseline looks stable. Now repeat while saturating upload (e.g., a speed test from a client or a controlled iperf). If max jumps to 200–800 ms, you have bufferbloat.
Decision: If latency under load spikes, your first fix is shaping + modern AQM (CAKE/FQ-CoDel), not “change codecs.”
Task 4: Confirm current qdisc on WAN egress
cr0x@server:~$ tc qdisc show dev ppp0
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
What it means: pfifo_fast is old-school and usually terrible under mixed traffic. It doesn’t actively manage latency.
Decision: Replace with CAKE or FQ-CoDel + shaping. If you can’t, you’ll be fighting symptoms forever.
Task 5: Install CAKE shaping (example) and verify it’s active
cr0x@server:~$ sudo tc qdisc replace dev ppp0 root cake bandwidth 80Mbit diffserv4 nat nowash ack-filter
cr0x@server:~$ tc -s qdisc show dev ppp0
qdisc cake 8001: root refcnt 2 bandwidth 80Mbit diffserv4 nat nowash ack-filter
Sent 123456789 bytes 234567 pkt (dropped 123, overlimits 456 requeues 0)
backlog 0b 0p requeues 0
What it means: CAKE is shaping to 80 Mbit and applying DiffServ tiers. Some drops are normal; they should happen at your shaper, not upstream in a giant ISP buffer.
Decision: Set bandwidth slightly below real throughput (often 85–95%). Then re-test call quality under load. If it improves dramatically, you found the main culprit.
Task 6: Observe DSCP on inbound RTP before it enters the tunnel
cr0x@server:~$ sudo tcpdump -ni ens18 -vv 'udp and (port 5060 or portrange 10000-20000)' -c 5
tcpdump: listening on ens18, link-type EN10MB (Ethernet), snapshot length 262144 bytes
IP (tos 0xb8, ttl 64, id 12345, offset 0, flags [DF], proto UDP (17), length 214)
10.10.20.50.40012 > 198.51.100.10.16432: UDP, length 172
IP (tos 0x00, ttl 64, id 12346, offset 0, flags [DF], proto UDP (17), length 498)
10.10.20.50.5060 > 198.51.100.10.5060: SIP, length 456
What it means: RTP marked with DSCP EF (0xb8) is ideal. SIP often remains best-effort (0x00), which is fine; media matters more.
Decision: If RTP isn’t marked, decide whether to mark at the phone, at the access switch/AP, or at the gateway (mangle table). Marking at the edge is cleaner.
Task 7: Check whether DSCP survives inside the tunnel (outer header)
cr0x@server:~$ sudo tcpdump -ni ppp0 -vv 'udp and host 203.0.113.20' -c 3
tcpdump: listening on ppp0, link-type LINUX_SLL2, snapshot length 262144 bytes
IP (tos 0x00, ttl 64, id 54321, offset 0, flags [DF], proto UDP (17), length 148)
192.0.2.10.51820 > 203.0.113.20.51820: UDP, length 96
What it means: Outer tunnel packets are best-effort (tos 0x00). That’s common: the tunnel encapsulation didn’t copy inner DSCP to outer.
Decision: If your WAN QoS depends on DSCP, you may need to copy DSCP (where supported) or classify traffic on the tunnel endpoint and shape there. Otherwise your ISP sees one bland flow and treats voice like cat videos.
Task 8: Verify MTU on tunnel and detect fragmentation risk
cr0x@server:~$ ip link show dev wg0
4: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
What it means: 1420 is a common WireGuard MTU. If your LAN is 1500 and you don’t clamp MSS for TCP, you’ll see fragmentation for large packets. RTP is small, but SIP/TLS and “other traffic during a call” can still trigger queueing and retransmits.
Decision: If you have any suspicion of MTU trouble, clamp MSS for TCP and validate PMTUD. Don’t guess.
Task 9: PMTU test with “do not fragment” ping
cr0x@server:~$ ping -M do -s 1372 -c 3 203.0.113.20
PING 203.0.113.20 (203.0.113.20) 1372(1400) bytes of data.
1400 bytes from 203.0.113.20: icmp_seq=1 ttl=57 time=23.1 ms
1400 bytes from 203.0.113.20: icmp_seq=2 ttl=57 time=23.4 ms
1400 bytes from 203.0.113.20: icmp_seq=3 ttl=57 time=23.0 ms
--- 203.0.113.20 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 23.0/23.1/23.4/0.2 ms
What it means: 1400-byte packets with DF succeed. Increase size until it fails to find the true PMTU margin.
Decision: If even modest DF pings fail, you likely have a PMTUD blackhole. Fix MTU/MSS and ensure ICMP “frag needed” isn’t blocked.
Task 10: MSS clamp for TCP to avoid fragmentation through the tunnel
cr0x@server:~$ sudo iptables -t mangle -A FORWARD -o wg0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
cr0x@server:~$ sudo iptables -t mangle -S FORWARD | grep TCPMSS
-A FORWARD -o wg0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
What it means: This prevents TCP sessions from trying to send segments too large for the tunnel, reducing fragmentation and retransmits that can indirectly worsen jitter under load.
Decision: Keep it unless you have a very specific reason not to. This is one of those “boring, correct” guardrails.
Task 11: Check VPN endpoint CPU saturation during calls
cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.8.0 (server) 12/28/2025 _x86_64_ (4 CPU)
12:10:01 AM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
12:10:02 AM all 22.0 0.0 11.0 0.0 0.0 20.0 0.0 47.0
12:10:02 AM 0 18.0 0.0 9.0 0.0 0.0 42.0 0.0 31.0
12:10:02 AM 1 25.0 0.0 12.0 0.0 0.0 18.0 0.0 45.0
12:10:02 AM 2 23.0 0.0 11.0 0.0 0.0 15.0 0.0 51.0
12:10:02 AM 3 22.0 0.0 12.0 0.0 0.0 5.0 0.0 61.0
What it means: High %soft on one core can indicate packet processing bottlenecks (softirq). VPN encryption and encapsulation can pin to a core depending on driver/queueing.
Decision: If one core is consistently pegged during peak, you need to optimize CPU path (NIC RSS, IRQ affinity, faster cipher, hardware, or fewer packets via larger ptime where acceptable).
Task 12: Observe softirq pressure (network processing) and packet drops
cr0x@server:~$ cat /proc/softirqs | egrep 'NET_RX|NET_TX'
NET_TX: 1020304 993322 880011 770022
NET_RX: 9090909 8888888 7777777 6666666
What it means: Large, fast-increasing NET_RX on one CPU can correlate with jitter when the system is overloaded and packets wait their turn.
Decision: If correlated with bad calls, reduce packet rate (codec packetization), scale CPU, or adjust IRQ/RPS/RFS so work spreads.
Task 13: Verify RTP packet loss and jitter with a capture on the gateway
cr0x@server:~$ sudo tshark -ni ens18 -f "udp portrange 10000-20000" -c 50 -q -z rtp,streams
Running as user "root" and group "root". This could be dangerous.
========================= RTP Streams =========================
Start time End time Src IP Src port Dst IP Dst port SSRC Payload Packets Lost Max Jitter
0.000000 9.842000 10.10.20.50 40012 198.51.100.10 16432 0x1a2b3c4d 111 400 3 14.7 ms
===============================================================
What it means: You’re seeing actual RTP loss and jitter (as observed at the capture point). A few lost packets might be fine; sustained loss correlates with audible artifacts.
Decision: If loss happens before tunnel encapsulation, fix LAN/Wi‑Fi. If loss happens after encapsulation (capture on WAN side), fix WAN shaping/ISP path/tunnel endpoint.
Task 14: Check conntrack pressure (NAT table overload can look like “jitter”)
cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 24789
net.netfilter.nf_conntrack_max = 262144
What it means: Plenty of headroom. If count is close to max, new flows get dropped/evicted, which can break SIP/RTP in odd ways.
Decision: If near max during business hours, increase conntrack_max and/or reduce needless NAT churn (short-lived flows, misconfigured timeouts, chatty devices).
Task 15: Confirm RTP/SIP timeouts aren’t being murdered by firewall defaults
cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_udp_timeout net.netfilter.nf_conntrack_udp_timeout_stream
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 180
What it means: If these are too low for your environment, RTP flows can get dropped mid-call when there’s a brief silence or a re-INVITE dance.
Decision: If you see mid-call one-way audio after ~30 seconds of silence, increase UDP timeouts or ensure keepalives are configured.
Tunnel-specific guidance: IPsec, WireGuard, OpenVPN
IPsec (IKEv2/ESP): strong, common, and easy to mis-QoS
IPsec is a workhorse. It also loves to make your QoS invisible if you don’t deliberately preserve it.
- ESP encapsulation hides ports, so “prioritize UDP 10000–20000” doesn’t help on the WAN side; everything becomes ESP.
- DSCP copying is not automatic everywhere. Some stacks can copy inner DSCP to outer; some don’t; some do until you enable NAT-T.
- NAT-T (UDP encapsulation) adds another header and can change MTU and fragmentation dynamics.
What to do: Prioritize traffic before encryption, then make sure the VPN endpoint uses that to schedule packets into the tunnel. If you’re doing QoS downstream (ISP-managed), you’ll need outer DSCP preservation as well, and you must verify it with captures.
WireGuard: lean, fast, still not magic
WireGuard’s simplicity is operationally beautiful. But it doesn’t automatically solve jitter; it just wastes fewer CPU cycles than some older designs.
- Default MTU choices matter. 1420 is common; your environment may need 1380-ish if you stack PPPoE/VLAN.
- Single UDP flow problem: from an ISP perspective, it’s one flow unless you shape intelligently. You must do fair queueing on your side.
- Marking: DSCP on inner packets doesn’t automatically become DSCP on outer UDP 51820 packets.
OpenVPN: flexible, sometimes heavier than you think
OpenVPN can be perfectly fine for VoIP. It can also become a jitter generator if you run it in user space on a small CPU and push lots of small packets.
- UDP mode is usually the right choice for VoIP. TCP mode can amplify latency due to retransmission semantics (“TCP meltdown”).
- User-space processing overhead can be visible as jitter during bursts of traffic.
- Compression is a trap in modern setups: it’s often a security risk and can increase CPU jitter.
QoS and shaping that actually works over tunnels
Start with shaping, not “priority”
If you do nothing else: shape egress slightly below the real WAN rate and use CAKE. This forces queueing to happen on your box (where you control it) instead of inside the ISP’s mysterious buffer (where you don’t).
Why this helps VoIP over VPN:
- RTP packets stay small and frequent; fair queueing gives them frequent turns.
- Large uploads get spread out instead of building a giant queue.
- Latency under load becomes predictable, which is exactly what jitter buffers want.
Classification: mark before encryption, then schedule into the tunnel
There are two workable models:
- Inner classification + tunnel-aware shaping: classify RTP/SIP on the LAN interface and then shape on WAN based on firewall marks. This avoids needing DSCP on the outer header.
- DSCP end-to-end: mark RTP EF and ensure the tunnel copies DSCP to the outer header so that downstream devices can honor it. This is cleaner when it works. It often doesn’t, unless you configure it explicitly.
Don’t overdo priority
Many “QoS guides” recommend a strict priority queue for voice. That’s fine if your voice traffic volume is modest and your classification is correct. In a real office, misclassification happens, and strict priority can starve DNS, ACKs, and control traffic—causing a weird kind of self-inflicted outage where voice is crystal clear and everything else collapses.
Second short joke (and that’s the quota): Strict priority QoS is like giving one coworker the only key to the conference room; the meeting is efficient until everyone else starts picking the lock.
Make it measurable: latency under load as your KPI
Your best “before/after” is not MOS from one phone. It’s ping latency under load plus RTP loss/jitter from captures at key points. If shaping works, ping under load improves dramatically and RTP jitter stabilizes.
MTU, MSS, fragmentation: the silent voice killer
Why MTU problems show up as “jitter”
Fragmentation doesn’t always drop packets, but it increases variability. Fragments may take different paths, get queued differently, or get dropped selectively. If you’re unlucky, PMTUD breaks and large packets get blackholed; then TCP retransmissions flood the link and the voice stream suffers collateral damage.
Common MTU stacks in offices
- Ethernet: 1500
- PPPoE: typically 1492
- VLAN tag overhead: effectively reduces payload unless using jumbo frames end-to-end
- WireGuard: often 1420-ish
- IPsec ESP/NAT-T: varies; can cut effective MTU substantially
Practical rules
- Clamp MSS for TCP toward the tunnel to prevent fragmentation storms.
- Allow essential ICMP (frag-needed, time-exceeded). Blocking it is like removing road signs and blaming drivers for being late.
- Set tunnel MTU deliberately. Auto can be fine, but “fine” isn’t a strategy when the CEO hears robots.
Wi‑Fi and last-mile issues you keep blaming on “the VPN”
Wi‑Fi introduces jitter by design
Wi‑Fi is contention-based. Clients take turns, retransmit, and rate-adapt. It’s astonishing that VoIP works on it at all, and the reason it does is that voice is low bandwidth. But low bandwidth doesn’t mean low sensitivity.
If your office phones are on Wi‑Fi:
- Prefer 5 GHz (or 6 GHz if available) for lower interference.
- Enable WMM (Wi‑Fi Multimedia) and map DSCP to WMM access categories correctly.
- Watch for “sticky clients” on distant APs causing retries and airtime abuse.
Last-mile asymmetry and policing
Many broadband links have asymmetric upload. VoIP is bidirectional; the uplink often matters more because it’s where the office sends RTP to the provider, and where backups and cloud sync can saturate.
Also: ISPs sometimes police traffic in a way that causes microbursts to drop. Your shaper should smooth bursts before they hit the modem. If you don’t shape, the ISP will shape for you, badly.
Three corporate mini-stories (pain, regret, and boring success)
1) The incident caused by a wrong assumption
They migrated phones to a cloud PBX and kept the site-to-site VPN because “security requires all traffic through HQ.” The migration plan assumed voice was “just another UDP stream,” and since bandwidth graphs looked fine, nobody worried.
First week in production, executives started reporting that calls were perfect early in the morning and unusable at 10:30. That pattern screamed “congestion,” but the NOC looked at average utilization and saw nothing alarming—because averages are the lie your graphs tell to keep you calm.
A capture on the LAN side showed RTP marked EF. A capture on the WAN side showed the tunnel packets all best-effort. The HQ firewall had a beautiful QoS policy for voice ports, but once the packets entered IPsec, they became ESP. The policy never matched. Voice was being queued behind a mid-morning surge of cloud storage sync and a weekly endpoint update cycle.
The fix wasn’t exotic: shape the HQ uplink with CAKE, classify voice before encryption, and schedule it into a higher tier. Once the bottleneck queue moved from the ISP to their controlled shaper, jitter flattened and calls stabilized. The wrong assumption was that “we already have QoS” because the config existed. The network didn’t care about the config; it cared about where the packets actually queued.
2) The optimization that backfired
A different company had a tiny branch with a modest firewall. VoIP over VPN was borderline during busy periods, so someone “optimized” by enabling aggressive packet coalescing and offloads, plus a performance profile that cranked interrupt moderation to reduce CPU usage.
The firewall’s CPU graphs looked better immediately. The team celebrated. Then the helpdesk got a new class of complaint: “People sound like they’re underwater” and “Every few seconds the audio hiccups.” It didn’t correlate with bandwidth; it correlated with traffic bursts.
What happened: interrupt moderation and coalescing increased latency variance. Packets weren’t processed steadily; they arrived in clumps. RTP prefers a metronome. It got jazz. Also, some offload settings interacted badly with encapsulated traffic, increasing retransmits in other flows, which created more bursts. A perfect feedback loop—like an echo, but for bad decisions.
They rolled back the “performance” tuning, then used shaping to cap the WAN slightly below line rate. CPU went up, but call quality stabilized. The lesson: optimizing for average CPU can worsen tail latency. VoIP lives in the tails.
3) The boring but correct practice that saved the day
One more: a company with dozens of small offices ran VoIP over WireGuard to a regional hub. Nothing fancy. But they had an unsexy habit: every new site got a standardized “voice acceptance test” run from day one.
The test included a ping-under-load check, an MTU DF-ping sweep, and a short synthetic RTP stream measured at the hub. They stored results per site and compared them to baseline. It was so routine that junior techs could run it without improvisation.
One office started having intermittent call issues after an ISP hardware swap. The ISP insisted the line was fine. The team reran the acceptance test and immediately saw latency under load jump from “steady” to “roller coaster.” The MTU sweep also showed a reduced PMTU. They hadn’t changed anything internally.
Because they had baseline data, they didn’t argue about feelings. They presented measured evidence, adjusted their shaper and tunnel MTU temporarily, and pushed the ISP to correct the provisioning. The calls improved the same day. The practice wasn’t glamorous. It was correct. It saved time, and it saved the team from the endless meeting where someone says “but it worked last year.”
Common mistakes: symptom → root cause → fix
1) Symptom: calls are fine until someone uploads a large file
Root cause: bufferbloat on the uplink; queueing occurs in modem/ISP, not under your control.
Fix: shape egress below line rate on the true WAN interface (CAKE/FQ-CoDel). Validate via ping-under-load.
2) Symptom: one-way audio, especially after a minute of silence
Root cause: NAT/conntrack UDP timeouts too low; SIP/RTP mappings expire.
Fix: increase UDP timeouts appropriately, enable SIP/RTP keepalives if supported, and ensure symmetric routing.
3) Symptom: robotic audio with periodic stutters, even when bandwidth is low
Root cause: CPU/softirq bottleneck on VPN gateway, often single-core saturation, or aggressive interrupt moderation causing bursty processing.
Fix: check per-core CPU and softirq. Adjust IRQ/RSS, reduce packet rate (codec ptime), or upgrade hardware. Roll back “latency-hostile” offloads.
4) Symptom: random drops when enabling the VPN, especially for certain applications
Root cause: MTU/PMTUD issues; ICMP frag-needed blocked; fragmentation or blackholing.
Fix: validate PMTU with DF pings; clamp MSS; allow ICMP; set tunnel MTU deliberately.
5) Symptom: QoS configured, but voice still sounds bad over VPN
Root cause: you’re matching on inner ports that are hidden after encryption; or DSCP doesn’t copy to outer header.
Fix: classify before encryption using marks, shape on WAN using those marks; or configure DSCP copy and verify with captures.
6) Symptom: jitter only in Wi‑Fi users, wired phones are fine
Root cause: Wi‑Fi retries, interference, airtime contention, missing WMM/DSCP-to-WMM mapping.
Fix: enforce 5/6 GHz usage, tune APs, enable WMM, reduce retries, and consider dedicated SSID/VLAN for voice.
7) Symptom: VPN “works,” but call setup is slow or fails intermittently
Root cause: SIP ALG or “helpful” firewall features rewriting SIP; or asymmetric routing across multiple WANs.
Fix: disable SIP ALG unless you demonstrably need it; ensure consistent path for signaling and media; use an SBC if required by architecture.
Checklists / step-by-step plan
Step-by-step: stabilize voice over a VPN in an office
- Map the traffic path: where is the VPN endpoint, where is NAT done, where is the real bottleneck (WAN egress)? Write it down.
- Measure baseline: ping to a stable external target; capture RTP stats at gateway; note call symptoms and time-of-day pattern.
- Run latency-under-load test: saturate upload briefly; observe ping max and jitter. If it spikes, stop blaming “the PBX.”
- Install shaping on WAN egress: start with CAKE at 85–95% of actual up/down. Re-test under load.
- Classify voice: mark RTP EF (or equivalent) at the edge; ensure the shaper respects it. If you can’t preserve DSCP in the tunnel, use firewall marks.
- Fix MTU/MSS: DF-ping sweep, clamp MSS, allow necessary ICMP, adjust tunnel MTU if needed.
- Verify CPU headroom: watch per-core CPU and softirq during concurrent calls + load. If you’re CPU bound, no QoS policy will save you.
- Validate on Wi‑Fi separately: if only Wi‑Fi breaks, treat it as a Wi‑Fi problem. Wired tests are your control group.
- Lock in monitoring: track latency under load, tunnel drops, qdisc drops, and RTP jitter/loss. Trend it. Make it boring.
Operational checklist: before you change anything mid-incident
- Capture 30 seconds of traffic on LAN side and WAN side (or tunnel interface) during a bad call.
- Check qdisc stats and interface drops.
- Check CPU per core and softirq.
- Confirm whether the uplink is saturating (not average; current and peaks).
- Confirm whether DSCP is present on RTP and whether it survives encapsulation.
- Run DF ping with a few sizes to detect PMTU regressions.
FAQ
1) Does a VPN inherently make VoIP bad?
No. A VPN adds overhead and can hide traffic for QoS, but the usual killer is congestion-induced queueing and poor queue management. Fix those and VoIP can be excellent over a tunnel.
2) Should I run VoIP outside the VPN to “fix it”?
Sometimes it’s a valid architecture choice, especially with cloud PBX and SRTP/TLS already in place. But don’t use it as a workaround for bufferbloat you’ll still have for everything else.
3) What matters more: latency or jitter?
For voice quality, jitter is often what users perceive as “choppy.” For conversational comfort, latency matters more. In practice they’re linked: jitter buffers trade jitter for latency.
4) Is DSCP worth it on the public internet?
Across the open internet, DSCP is inconsistent. Inside your office and on your controlled edge, it’s absolutely worth it because it informs your own shaper and Wi‑Fi WMM mapping. Treat ISP honoring as a bonus, not a dependency.
5) What’s the single best change for office VoIP over VPN?
Shaping the WAN egress with CAKE (or FQ-CoDel) at a realistic bandwidth cap, then verifying latency-under-load improves. It’s not glamorous. It’s effective.
6) Why do my graphs show low utilization but calls still stutter?
Because averages hide microbursts and queueing. Voice fails on millisecond-scale variability. Look at max latency under load, interface drops, and qdisc backlog—not just 5-minute averages.
7) Should I change codecs to fix jitter?
Codec choice can help (some tolerate loss better), and packetization time affects packet rate. But if your uplink queues are the problem, codec changes are a bandage. Fix the network first, then optimize codecs if needed.
8) Is TCP-based VPN (or SIP over TCP) a good idea?
For media (RTP), no—use UDP. For SIP signaling, TCP/TLS is fine. But TCP-based tunneling for everything can amplify latency under loss because of retransmission and head-of-line blocking.
9) What if the VPN endpoint is in the cloud, not the office?
Then you still need shaping at the office WAN egress. Cloud shaping doesn’t fix the office uplink queue. You can also add shaping/queueing at the cloud side to protect downstream, but the first choke point is usually the branch uplink.
10) How do I tell whether the problem is the ISP or my equipment?
If shaping on your gateway dramatically improves latency-under-load, the ISP path may be fine; your previous queueing location was wrong. If you still see loss/jitter after shaping and CPU is healthy, gather captures and push the ISP with evidence (PMTU changes, loss patterns, time-of-day correlation).
Conclusion: next steps you can do this week
If your office runs VoIP through a VPN tunnel and quality is inconsistent, stop debating and start measuring. The fix is usually a small stack of boring controls:
- Run latency-under-load tests and prove bufferbloat (or rule it out).
- Shape the true WAN egress with CAKE or FQ-CoDel, slightly below real line rate.
- Classify voice before encryption; don’t assume QoS rules match once traffic becomes “just a tunnel.”
- Validate MTU/PMTU and clamp MSS so you don’t create fragmentation-driven chaos.
- Check VPN endpoint CPU/softirq and undo “optimizations” that increase tail latency.
Do that, and VoIP over VPN becomes boring—which is the highest compliment an SRE can give a production system.