VoIP over VPN: Stop Robotic Audio with MTU, Jitter, and QoS Basics

November 30, 2025 • February 3, 2026 • Read: 23 min • Views: 10

Was this helpful?

You know the sound. The call starts fine, then somebody turns into a fax machine auditioning for a robot movie.
Everyone blames “the VPN,” then the ISP, then the softphone, then the moon phase. Meanwhile, the real culprit is usually boring:
MTU/MSS mismatch, jitter from bufferbloat, or QoS that doesn’t survive the trip through your tunnel.

I run production networks where voice is just another workload—until it isn’t. Voice punishes lazy assumptions.
It doesn’t care that your throughput test looks great; it cares about small packets arriving on time, consistently, with minimal loss and reordering.

A mental model that actually predicts failures

If you remember one thing: voice is a real-time stream riding on a best-effort network. A VPN adds headers, hides inner QoS markings unless you deliberately preserve them, and can alter packet pacing.
The usual failure modes are not mysterious. They are physics and queueing.

What “robotic audio” usually is

“Robotic” is rarely a codec “quality” problem. It’s packet loss and concealment in action.
RTP audio arrives in small packets (often 20 ms of audio per packet). Lose a few, jitter spikes, the jitter buffer stretches, the decoder guesses, and you hear the robot.
Voice can survive some loss; it just can’t hide it politely.

The VoIP-over-VPN stack in one diagram (conceptual)

Think of the packet as a nested set of envelopes:

Inner: SIP signaling + RTP media (often UDP) with DSCP markings you’d like to keep
Then: your VPN wrapper (WireGuard/IPsec/OpenVPN) adds overhead and may change MTU
Outer: ISP and internet queues (where bufferbloat lives) and where QoS may or may not work
Endpoints: softphone or IP phone, and a PBX/ITSP

Breakage is usually in one of three places:
(1) size (MTU/fragmentation),
(2) timing (jitter/queues),
(3) prioritization (QoS/DSCP and shaping).

Paraphrased idea from W. Edwards Deming: “Without data, you’re just another person with an opinion.” Treat voice issues like incidents: measure, isolate, change one variable, re-measure.

Fast diagnosis playbook

When the CEO says “calls are broken,” you do not start by debating codecs. You start by narrowing the blast radius and locating the queue.
Here’s the order that finds root causes quickly.

First: confirm whether it’s loss, jitter, or MTU

Check RTP stats in the client/PBX: loss %, jitter, late packets. If you don’t have this, capture packets and compute it (later).
If you see even 1–2% loss during “robot” moments, treat it as a network problem until proven otherwise.
Run a quick path MTU test through the VPN. If PMTUD is broken, you’ll get black-holed large packets, especially on UDP-based VPNs.
Check queueing delay under load on the narrowest uplink (usually the user’s upload). Bufferbloat is the silent killer of voice.

Second: isolate where it breaks

Bypass the VPN for one test call (split tunnel or temporary policy). If voice improves dramatically, focus on tunnel overhead, MTU, and QoS handling at tunnel edges.
Compare wired vs Wi‑Fi. If Wi‑Fi is worse, you’re in airtime contention and retransmission land. Fix that separately.
Test from a known-good network (a lab circuit, a different ISP, or a cloud VM running a softphone). If that’s clean, the problem is at the user edge.

Third: apply the “boring fixes”

Set the VPN interface MTU explicitly and clamp TCP MSS where relevant.
Apply smart queue management (fq_codel/cake) at the real bottleneck and shape slightly below line rate.
Mark voice traffic and prioritize it where you control the queue (often the WAN edge), not just in your dreams.

Joke #1: A VPN is like a suitcase—if you keep stuffing extra headers in, eventually the zipper (MTU) gives up at the worst moment.

MTU, MSS, and fragmentation: why “robotic” often means “tiny loss”

MTU problems don’t always look like “can’t connect.” They can look like “connects, but sometimes sounds haunted.”
That’s because signaling might survive while certain media packets or re-invites get dropped, or because fragmentation increases loss sensitivity.

What changes when you add a VPN

Every tunnel adds overhead:

WireGuard adds an outer UDP/IP header plus WireGuard overhead.
IPsec adds ESP/AH overhead (plus possible UDP encapsulation for NAT-T).
OpenVPN adds user-space overhead and can add extra framing depending on mode.

The inner packet that was fine at MTU 1500 may no longer fit. If your path doesn’t support fragmentation the way you think, something gets dropped.
And UDP doesn’t retransmit; it just disappoints you in real time.

Path MTU discovery (PMTUD) and how it fails

PMTUD relies on ICMP “Fragmentation Needed” messages (for IPv4) or Packet Too Big (for IPv6). Lots of networks block or rate-limit ICMP.
Result: you send packets that are too large, routers drop them, and your sender never learns. That’s called a “PMTUD black hole.”

Why RTP usually isn’t “too big” — but still suffers

RTP voice packets are typically small: dozens to a couple hundred bytes payload, plus headers. So why do MTU issues affect calls?

Signaling and session changes (SIP INVITE/200 OK with SDP, TLS records) can get large.
VPN encapsulation can fragment even moderate packets, increasing loss probability.
Jitter spikes happen when fragmentation and reassembly interact with congested queues.
Some softphones bundle or send larger UDP packets under certain settings (comfort noise, SRTP, or unusual ptime).

Actionable guidance

For WireGuard, start with MTU 1420 if you’re not sure. It’s not magic; it’s a conservative default that avoids common overhead pitfalls.
For OpenVPN, be explicit with tunnel MTU and MSS clamping for TCP flows that traverse the tunnel.
Don’t “just lower MTU everywhere” blindly. You can fix one path and hurt another. Measure, then set.

Jitter, bufferbloat, and why speed tests lie

You can have 500 Mbps down and still sound like you’re calling from a submarine. Voice needs low latency variation, not bragging rights.
The biggest practical enemy is bufferbloat: oversized queues in routers/modems that build up under load and add hundreds of milliseconds of delay.

Jitter vs latency vs packet loss

Latency: how long a packet takes end-to-end.
Jitter: how much that latency varies packet-to-packet.
Loss: packets that never arrive (or arrive too late to matter).

Voice codecs use jitter buffers. Those buffers can smooth variation up to a point, at the cost of added delay.
When jitter gets ugly, buffers either grow (increasing delay) or drop late packets (increasing loss). Either way: robotic audio.

Where jitter is born

Most jitter in VoIP-over-VPN incidents isn’t “the internet.” It’s the edge queue:

User home router with a deep upstream buffer
Corporate branch firewall doing traffic inspection and buffering
VPN concentrator CPU saturation causing packet scheduling delay
Wi‑Fi contention/retransmissions (looks like jitter and loss)

Queue management that actually works

If you control the bottleneck, you can fix voice.
Smart queue management (SQM) algorithms like fq_codel and cake actively prevent queues from growing without bound and keep latency stable under load.

The trick: you must shape slightly below the true link rate so your device, not the ISP modem, becomes the bottleneck and therefore controls the queue.
If you don’t, you’re politely asking the modem to behave. It will not.

Joke #2: Bufferbloat is what happens when your router hoards packets like they’re collectible antiques.

QoS/DSCP basics for voice through VPNs (and what gets stripped)

QoS is not a magic “make it good” checkbox. It’s a way to decide what gets hurt first when the link is congested.
That’s it. If there is no congestion, QoS changes nothing.

DSCP and the myth of “end-to-end QoS”

Voice often marks RTP as DSCP EF (Expedited Forwarding) and SIP as CS3/AF31 depending on your policy.
Within your LAN, that can help. Across the internet, most providers will ignore it. Across a VPN, it might not even survive encapsulation.

What you can control

LAN edge: prioritize voice from phones/softphones to your VPN gateway.
VPN gateway WAN egress: shape and prioritize outer packets that correspond to voice flows.
Branch/user edge: if you manage it, deploy SQM and mark voice locally.

VPN specifics: inner vs outer markings

Many tunnel implementations will encapsulate inner packets into an outer packet. The outer packet gets forwarded by the ISP.
If the outer packet isn’t marked (or if it’s marked but stripped), your “EF” on the inside is just decorative.

The workable approach:

Classify voice before encryption when possible, then apply priority to the encrypted flow (outer header) on egress.
Preserve DSCP across the tunnel if your gear supports it and your policy allows it.
Don’t trust Wi‑Fi WMM to save you if your uplink queue is melting down.

QoS caution: you can make it worse

A bad QoS policy can starve control traffic, or create microbursts and reordering. Voice likes priority, but it also likes stability.
Keep classes simple: voice, interactive, bulk. Then shape.

Practical tasks: commands, outputs, and decisions

These are “run it now” tasks. Each includes a command, what the output tells you, and what decision to make.
Use them on Linux endpoints, VPN gateways, or troubleshooting hosts. Adjust interface names and IPs to match your environment.

Task 1: Confirm interface MTU on the VPN tunnel

cr0x@server:~$ ip link show dev wg0
4: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/none

Meaning: wg0 MTU is 1420. Good conservative baseline for WireGuard.
Decision: If MTU is 1500 on a tunnel, assume trouble unless you’ve proven the path supports it. If robotic audio correlates with certain paths, test lower MTU.

Task 2: Measure path MTU with “do not fragment” ping (IPv4)

cr0x@server:~$ ping -M do -s 1372 -c 3 10.20.30.40
PING 10.20.30.40 (10.20.30.40) 1372(1400) bytes of data.
1380 bytes from 10.20.30.40: icmp_seq=1 ttl=63 time=18.4 ms
1380 bytes from 10.20.30.40: icmp_seq=2 ttl=63 time=18.7 ms
1380 bytes from 10.20.30.40: icmp_seq=3 ttl=63 time=18.2 ms

--- 10.20.30.40 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms

Meaning: 1400-byte packet (including headers) makes it through without fragmentation.
Decision: Increase -s until it fails to find maximum. Set tunnel MTU safely below that minus encapsulation overhead.

Task 3: Observe PMTUD failure symptoms (IPv4)

cr0x@server:~$ ping -M do -s 1472 -c 2 10.20.30.40
PING 10.20.30.40 (10.20.30.40) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420

--- 10.20.30.40 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1013ms

Meaning: Your local interface MTU is 1420; kernel refuses to send bigger with DF set.
Decision: If apps are sending bigger packets anyway (UDP VPN encapsulation can), clamp or configure MTU/MSS so they don’t.

Task 4: Check TCP MSS clamping rules (iptables)

cr0x@server:~$ sudo iptables -t mangle -S | grep -i mss
-A FORWARD -o wg0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

Meaning: TCP SYN packets have MSS clamped based on PMTU.
Decision: If you carry SIP over TCP/TLS through the VPN and see stalls or retransmits, enable this. It won’t fix RTP (UDP), but it will stabilize signaling.

Task 5: Verify DSCP markings on outbound packets

cr0x@server:~$ sudo tcpdump -ni eth0 -vv udp and portrange 10000-20000 -c 5
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:10:41.112233 IP (tos 0xb8, ttl 63, id 44211, offset 0, flags [DF], proto UDP (17), length 214) 192.0.2.10.14562 > 198.51.100.20.10012: UDP, length 186
12:10:41.132244 IP (tos 0xb8, ttl 63, id 44212, offset 0, flags [DF], proto UDP (17), length 214) 192.0.2.10.14562 > 198.51.100.20.10012: UDP, length 186

Meaning: TOS 0xb8 corresponds to DSCP EF (46). Your host is marking RTP.
Decision: Next check if the marking survives encapsulation and whether your WAN queue honors it. If it disappears on the outer packet, you need QoS at the tunnel egress, not hopes and dreams.

Task 6: Confirm DSCP on the VPN outer packet

cr0x@server:~$ sudo tcpdump -ni eth0 -vv udp port 51820 -c 5
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:12:03.220011 IP (tos 0x00, ttl 64, id 12001, offset 0, flags [DF], proto UDP (17), length 208) 203.0.113.5.51820 > 203.0.113.9.51820: UDP, length 180
12:12:03.240022 IP (tos 0x00, ttl 64, id 12002, offset 0, flags [DF], proto UDP (17), length 208) 203.0.113.5.51820 > 203.0.113.9.51820: UDP, length 180

Meaning: Outer packets are unmarked (tos 0x00). Even if inner RTP is EF, the ISP only sees outer.
Decision: Apply QoS classification on the VPN gateway: identify voice flows before encryption (or by port/peer heuristics) and set DSCP/priority on egress.

Task 7: Identify the real bottleneck and current qdisc

cr0x@server:~$ tc qdisc show dev eth0
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn

Meaning: fq_codel is active. That’s a decent baseline for latency under load.
Decision: If you see pfifo_fast or a deep vendor qdisc on the WAN edge, plan to deploy shaping + fq_codel/cake where congestion happens.

Task 8: Check qdisc stats for drops/overlimits (shaping trouble)

cr0x@server:~$ tc -s qdisc show dev eth0
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
 Sent 98234123 bytes 84521 pkt (dropped 213, overlimits 0 requeues 12)
 backlog 0b 0p requeues 12
  maxpacket 1514 drop_overlimit 213 new_flow_count 541 ecn_mark 0

Meaning: Some drops occurred. Drops aren’t always bad—controlled drops can prevent massive latency. But drops + robotic audio suggests you’re dropping RTP, not bulk.
Decision: Add classification so voice gets priority (or at least isolation), and ensure shaping rate matches the actual uplink.

Task 9: Quick jitter and loss check with mtr (baseline)

cr0x@server:~$ mtr -rwzc 50 203.0.113.9
Start: 2025-12-28T12:20:00+0000
HOST: server                          Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. 192.0.2.1                         0.0%    50    1.1   1.3   0.9   3.8   0.6
  2. 198.51.100.1                      0.0%    50    8.2   8.5   7.9  13.4   1.1
  3. 203.0.113.9                       0.0%    50   19.0  19.2  18.6  26.8   1.4

Meaning: No loss, stable latency, low jitter (StDev). Good baseline.
Decision: If you see loss at hop 1 under load, it’s your LAN/Wi‑Fi/router. If loss starts later, it’s upstream—still maybe fixable with shaping at your edge.

Task 10: See if VPN gateway CPU is causing packet scheduling delays

cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.5.0 (vpn-gw) 	12/28/2025 	_x86_64_	(8 CPU)

12:21:01     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:21:02     all   18.20    0.00   22.40    0.10    0.00   21.70    0.00    0.00    0.00   37.60
12:21:02       0   20.00    0.00   28.00    0.00    0.00   30.00    0.00    0.00    0.00   22.00

Meaning: High softirq can indicate heavy packet processing (encryption, forwarding).
Decision: If softirq is pegged during call issues, consider enabling multiqueue, moving to faster crypto, adding CPU headroom, or reducing VPN overhead (MTU and offloads).

Task 11: Inspect NIC offloads (can break captures, sometimes timing)

cr0x@server:~$ sudo ethtool -k eth0 | egrep 'tso|gso|gro'
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on

Meaning: Offloads are enabled. Usually fine, but can confuse packet captures and in some edge cases interact poorly with tunnels.
Decision: For accurate troubleshooting, temporarily disable GRO/LRO on a test host, then re-test. Don’t randomly disable offloads on busy gateways without a plan.

Task 12: Check UDP receive errors and drops

cr0x@server:~$ netstat -su
Udp:
    128934 packets received
    12 packets to unknown port received
    0 packet receive errors
    4311 packets sent
UdpLite:
IpExt:
    InOctets: 221009331
    OutOctets: 198887112

Meaning: UDP receive errors are zero. Good.
Decision: If receive errors climb during calls, you may be hitting socket buffer limits or kernel drops; tune buffers, fix CPU saturation, or reduce traffic contention.

Task 13: Verify SIP/RTP packet rate during a call (sanity check)

cr0x@server:~$ sudo tcpdump -ni any udp portrange 10000-20000 -ttt -c 10
tcpdump: listening on any, link-type LINUX_SLL2, snapshot length 262144 bytes
 0.000000 IP 192.0.2.10.14562 > 198.51.100.20.10012: UDP, length 186
 0.019884 IP 192.0.2.10.14562 > 198.51.100.20.10012: UDP, length 186
 0.020042 IP 192.0.2.10.14562 > 198.51.100.20.10012: UDP, length 186
 0.019901 IP 192.0.2.10.14562 > 198.51.100.20.10012: UDP, length 186

Meaning: Inter-packet timing around 20 ms suggests ptime=20ms (common). Big gaps indicate jitter or scheduling delays.
Decision: If timing is irregular at capture point close to sender, investigate sender CPU/Wi‑Fi. If it’s regular at sender but irregular at receiver, it’s network/queueing.

Task 14: Identify whether traffic is going through the VPN or bypassing it

cr0x@server:~$ ip route get 198.51.100.20
198.51.100.20 via 10.10.0.1 dev wg0 src 10.10.0.2 uid 1000
    cache

Meaning: The route to the media endpoint uses wg0.
Decision: If your “bypass VPN test” still routes through wg0, you didn’t bypass anything. Fix policy routing/split tunnel, then compare call quality.

Task 15: Confirm MTU on the physical WAN interface (and spot jumbo mismatch)

cr0x@server:~$ ip link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff

Meaning: WAN interface is standard 1500 MTU.
Decision: If you’re on PPPoE or certain cellular links, WAN MTU can be smaller (1492, 1428, etc.). That pushes you to lower tunnel MTU.

Task 16: Spot bufferbloat under load with a simple ping while saturating uplink

cr0x@server:~$ ping -i 0.2 -c 20 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=57 time=18.9 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=57 time=210.4 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=57 time=245.7 ms
64 bytes from 1.1.1.1: icmp_seq=4 ttl=57 time=198.1 ms

--- 1.1.1.1 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 3812ms
rtt min/avg/max/mdev = 18.4/156.2/265.1/72.9 ms

Meaning: Latency jumps massively under load: classic bufferbloat.
Decision: Deploy SQM shaping on the uplink and prioritize voice; don’t waste time chasing codecs.

Three corporate mini-stories from the trenches

Incident #1: The wrong assumption (MTU “can’t be it, we use 1500 everywhere”)

A mid-sized company moved a call center to softphones over a full-tunnel VPN. It worked in the pilot. Then they rolled it to a few hundred remote agents.
Within a day, the helpdesk queue became a second call center—except with worse audio.

The network team’s first assumption was classic: “MTU can’t be it; Ethernet is 1500, and the VPN is configured cleanly.”
They focused on the SIP provider, then blamed home Wi‑Fi, then tried changing codecs.
Calls improved randomly, which is the worst kind of improvement because it encourages superstition.

The pattern that broke the case: robotic audio spiked during certain call flows—transfers, consult calls, and when the softphone renegotiated SRTP.
That’s when signaling packets got larger, and in some scenarios the VPN path required fragmentation. ICMP “fragmentation needed” was blocked on the user edge by a “security” setting.
PMTUD black holes. Not glamorous. Very real.

The fix was boring and decisive: set a conservative tunnel MTU, clamp MSS for TCP signaling, and document “do not block all ICMP” in the remote access baseline.
They also added a one-page test: DF ping through the tunnel to a known endpoint. It caught regressions later.

Lesson: “1500 everywhere” is not a design. It’s a wish.

Incident #2: The optimization that backfired (prioritizing voice… by accelerating everything)

Another org had a capable VPN gateway and wanted “premium voice quality.” Somebody enabled hardware acceleration and fast-path features on the edge firewall.
Throughput went up. Latency in a synthetic test went down. Everyone celebrated.

Two weeks later, complaints: “Robotic audio only during big file uploads.” That detail mattered.
Under load, the fast path bypassed parts of the QoS and queue management stack. Bulk traffic and voice landed in the same deep queue on the WAN side.
The acceleration improved peak throughput, but it removed the mechanism that kept latency stable.

Engineers did what engineers do: they added more QoS rules. More classes. More match statements. It got worse.
The classification cost CPU on the slow path, while the fast path still punted the bulk of traffic into the same bottleneck queue.
Now they had complexity and still had bufferbloat.

The eventual fix was not “more QoS.” It was: shape the uplink just below real capacity, enable a modern qdisc, and keep the class model simple.
Then decide whether acceleration was compatible with that policy. Where it wasn’t, voice won.

Lesson: optimizing for throughput without respecting queue behavior is how you build a faster way to sound terrible.

Incident #3: The boring practice that saved the day (standard tests + change control)

A global company ran voice over IPsec between branches and HQ. Nothing fancy. The key difference: they treated voice like a production service.
Every network change had a pre-flight and post-flight checklist, including a handful of VoIP-relevant tests.

One Friday, an ISP swapped access gear at a regional office. Users noticed “slight robot” on calls.
The local team ran the standard tests: baseline ping idle vs under uplink load, DF pings for MTU, and a quick DSCP check on the WAN egress.
They didn’t debate. They measured.

The data showed PMTUD was broken on the new access, and the upstream buffer was deeper than before. Two problems. Both actionable.
They lowered tunnel MTU slightly, enabled MSS clamping, and adjusted shaping to keep latency stable. Calls stabilized immediately.

On Monday, they escalated to the ISP with crisp evidence: timestamps, MTU failure threshold, and latency-under-load graphs.
The ISP fixed the ICMP handling later. But the company didn’t have to wait to regain usable voice.

Lesson: the most effective reliability feature is a repeatable test you actually run.

Common mistakes: symptom → root cause → fix

1) Symptom: robotic audio during uploads or screen sharing

Root cause: bufferbloat on upstream; voice packets stuck behind bulk traffic in a deep queue.
Fix: enable SQM (fq_codel/cake) and shape slightly below uplink; add simple priority for RTP/SIP on WAN egress.

2) Symptom: call connects, then audio drops or becomes choppy after a minute

Root cause: MTU/PMTUD black hole triggered by rekey, SRTP renegotiation, or SIP re-INVITE size increase.
Fix: set tunnel MTU explicitly; allow ICMP “frag needed”/PTB; for TCP signaling, clamp MSS.

3) Symptom: one-way audio (you hear them, they don’t hear you)

Root cause: NAT traversal issue or asymmetric routing; RTP pinned to wrong interface; firewall state/timeouts for UDP.
Fix: ensure correct NAT settings (SIP ALG usually off), confirm routes, increase UDP timeout on stateful devices, validate symmetric RTP if supported.

4) Symptom: fine on wired, bad on Wi‑Fi

Root cause: airtime contention, retries, or power-save behavior; VPN adds overhead and jitter sensitivity.
Fix: move calls to 5 GHz/6 GHz, reduce channel contention, disable aggressive client power save for voice devices, prefer wired for call-heavy roles.

5) Symptom: only remote users on a certain ISP have issues

Root cause: ISP upstream shaping/CGNAT behavior, poor peering, or ICMP filtering affecting PMTUD.
Fix: reduce tunnel MTU, enforce shaping at user edge if managed, test alternative transport/port, and collect evidence to escalate.

6) Symptom: “QoS is enabled” but voice still degrades under load

Root cause: QoS configured on the LAN while congestion is on the WAN; or DSCP marked on inner packets but not on outer tunnel packets.
Fix: prioritize at the egress queue that is actually congested; map/classify voice to outer packets; verify with tcpdump and qdisc stats.

7) Symptom: sporadic bursts of robot, especially at peak hours

Root cause: microbursts and queue oscillation; VPN concentrator CPU contention; or upstream congestion.
Fix: check softirq/CPU, enable pacing/SQM, ensure adequate gateway headroom, and avoid overcomplicated class hierarchies.

8) Symptom: calls are fine, but “hold/resume” breaks or transfers fail

Root cause: SIP signaling fragmentation or MTU issues affecting larger SIP messages; sometimes SIP over TCP/TLS affected by MSS.
Fix: MSS clamping, reduce MTU, allow ICMP PTB, and validate SIP settings (and disable SIP ALG where it mangles packets).

Checklists / step-by-step plan

Step-by-step: stabilize VoIP over VPN in a week (not a quarter)

Pick one representative failing user and reproduce the issue on demand (upload while on a call is usually enough).
No reproducibility, no progress.
Collect voice stats (from softphone/PBX): jitter, loss, concealment, RTT if available.
Decide: loss-driven (network) vs CPU-driven (endpoint) vs signaling-driven (SIP transport/MTU).
Confirm routing: ensure media actually traverses the VPN when you think it does. Fix split tunnel confusion early.
Measure path MTU through the tunnel using DF pings to known endpoints.
Decide on a safe MTU (conservative beats theoretical).
Set tunnel MTU explicitly at both ends (and document why).
Avoid the “auto” setting unless you’ve tested it across all access types (home broadband, LTE, hotel Wi‑Fi, etc.).
Clamp TCP MSS on the tunnel for forwarded TCP flows (SIP/TLS, provisioning, management).
Find the real bottleneck (usually uplink). Use ping-under-load to confirm bufferbloat.
Deploy SQM shaping at the bottleneck, slightly below line rate, with fq_codel or cake.
Keep QoS classes simple and prioritize voice only where it matters: the egress queue.
Verify DSCP handling with packet captures: inner marking, outer marking, and whether the queue respects it.
Re-test the original failure case (call + upload) and confirm jitter/loss improvements.
Roll out gradually with a canary group and a rollback plan. Voice changes are user-visible immediately; treat it like a production deploy.

Operational checklist: every time you touch VPN or WAN

Record current MTU settings (WAN + tunnel) and qdisc/shaping policies.
Run DF ping MTU test through tunnel to a stable endpoint.
Run ping idle vs ping under load to measure bufferbloat regression.
Capture 30 seconds of RTP during a test call and check for loss/jitter spikes.
Confirm DSCP on the outer packet on the WAN side (if you rely on marking).
Check gateway CPU softirq under load.

Interesting facts and historical context

Fact 1: RTP (Real-time Transport Protocol) was standardized in the mid-1990s to carry real-time media over IP networks.
Fact 2: SIP grew popular partly because it looked like HTTP for calls—text-based, extensible—great for features, occasionally painful for MTU.
Fact 3: A lot of PMTUD pain comes from ICMP filtering practices that became common as a blunt security response in the early internet era.
Fact 4: Early VoIP deployments often leaned on DiffServ markings (DSCP) inside enterprise networks, but “QoS across the public internet” never became broadly reliable.
Fact 5: VPN adoption surged for remote work, and voice quality complaints followed because consumer uplinks are typically the narrowest, most bufferbloated segment.
Fact 6: WireGuard became popular partly because it’s lean and fast, but “fast crypto” doesn’t cancel “bad queues.”
Fact 7: Bufferbloat was identified and named because consumer gear shipped with overly deep buffers that improved throughput benchmarks while wrecking latency-sensitive apps.
Fact 8: Modern Linux qdiscs like fq_codel were built specifically to keep latency bounded under load, a big deal for voice and gaming.
Fact 9: Many enterprise VPN designs historically assumed a 1500-byte underlay; widespread PPPoE, LTE, and tunnel stacking made that assumption increasingly fragile.

FAQ

1) Why is the audio “robotic” and not just quiet or delayed?

Because you’re hearing packet loss concealment. The decoder is guessing missing audio frames. Quiet/low volume is usually gain or device issues; robotic is usually loss/jitter.

2) What packet loss percentage becomes audible for VoIP?

Depends on codec and concealment, but even ~1% loss can be noticeable, especially when it’s bursty. Stable 0.1% may be fine; 2% in bursts often isn’t.

3) Is MTU only a TCP problem? RTP is UDP.

MTU hurts UDP too. If packets exceed path MTU and PMTUD is broken, they get dropped. Also, fragmentation increases loss sensitivity and jitter when fragments compete in queues.

4) Should I just set MTU to 1200 and move on?

No. You’ll reduce efficiency and might break other protocols or paths unnecessarily. Measure path MTU, choose a safe value, and document it. Conservative, not extreme.

5) Does DSCP marking help over the public internet?

Sometimes inside an ISP domain, often not end-to-end. The reliable win is prioritizing where you control the queue: your WAN egress and managed edges.

6) Can QoS fix packet loss from a bad ISP?

QoS cannot conjure bandwidth. It can prevent self-inflicted loss/jitter by managing your own queues. If the ISP is dropping upstream, you need a better path or different provider.

7) Why does it only break when someone uploads a file?

Upload saturates the upstream. Upstream queues bloat, latency and jitter explode, and RTP arrives late. Speed tests rarely reveal this because they reward deep buffers.

8) Is WireGuard automatically better than OpenVPN for voice?

WireGuard is typically lower overhead and easier to reason about, but voice quality is mostly about MTU correctness, queue management, and stable routing. You can break voice on any VPN.

9) What’s the simplest QoS policy that actually works?

Shape the WAN slightly below real rate, then prioritize voice (RTP) above bulk. Keep the class model small. Verify with qdisc stats and real call tests.

10) How do I prove it’s MTU and not “the codec”?

Reproduce with DF ping thresholds and observe failures around specific packet sizes; correlate with call events that increase signaling size; fix MTU and see the problem disappear without changing codecs.

Practical next steps

If you want calls that sound human, treat voice like a latency SLO, not a vibe.

Run the fast diagnosis playbook on one affected user and capture evidence: MTU, jitter under load, DSCP behavior.
Set explicit tunnel MTU (start conservative), and clamp MSS for TCP signaling paths.
Deploy SQM shaping at the uplink bottleneck with fq_codel/cake and prioritize voice at that queue.
Verify with measurements (qdisc stats, tcpdump DSCP, mtr, and client jitter/loss stats), not feelings.
Write it down: the chosen MTU, why it was chosen, and the regression tests. Future-you will be tired and unimpressed.

Most VoIP-over-VPN “mysteries” are just networks doing network things. Make the packets smaller, make the queues smarter, and make your priorities real where congestion happens.