You know the vibe: “Internet is slow.” Not down. Not obviously broken. Just… sticky. Some sites load, others hang.
SSH connects but stalls when you paste anything bigger than a tweet. The VPN “works” until you open the one app your CFO cares about.
Nine times out of ten, someone will blame DNS, Wi‑Fi, or “the cloud.” Sometimes they’re right. But MTU failures are the quiet
saboteurs: they don’t break everything, only the packets that dare to be slightly larger than your path can handle.
That’s why they masquerade as performance issues instead of clean outages.
MTU in plain English (and why it fools you)
MTU is the Maximum Transmission Unit: the biggest packet size (in bytes) a link will carry without fragmentation.
On a typical Ethernet network, that’s 1500 bytes. If you’re doing PPPoE, it’s usually 1492. If you’re living the
jumbo frames dream, it might be 9000. If you’re tunneling inside a tunnel inside an overlay, subtract overhead until
you’re back in reality.
The trap is that TCP doesn’t send “packets.” It sends a stream. The OS chops that stream into segments, and
the network chops those into frames. When MTU is wrong, you don’t see “MTU error.” You see retries, stalls,
asymmetric failures, and timeouts that look like congestion. Sometimes you can load text but not images. Sometimes
the login page loads but the POST request hangs. Sometimes only one direction is broken because only one firewall
drops the one ICMP message you needed.
A clean MTU failure is obvious: big packets never get through. But in the wild, MTU failures often become
“PMTUD black holes,” where the network is supposed to signal the sender to reduce packet size (via ICMP “Fragmentation
Needed”), but some device blocks those ICMP messages. The sender keeps trying big packets, they keep getting dropped,
TCP keeps retransmitting, and your user keeps saying “slow.”
One quote to keep in your head while you troubleshoot: paraphrased idea
— Richard Cook (safety/operations researcher):
Systems tend to look fine until they suddenly don’t, because success hides the complexity that makes failure possible.
MTU issues are exactly that kind of failure: everything seems okay until one payload size crosses the invisible line.
Short joke #1: MTU bugs are like “quiet quitting” packets — they show up, do the minimum, and then vanish when work gets serious.
Fast diagnosis playbook (first/second/third)
When you have 15 minutes and a pager, you need a sequence that narrows the blast radius quickly. Don’t “tune” first.
Don’t reboot first. Prove where it breaks.
First: Confirm it’s size-dependent
- Does small traffic work reliably (DNS, small HTTP responses, SSH banner) while large transfers stall?
- Does
pingwith DF (don’t fragment) fail at certain sizes? - Do you see TCP retransmissions and “stuck” connections during uploads/downloads?
Second: Identify the path segment that changes MTU
- VPN, GRE, IPsec, WireGuard, overlay networks, MPLS, PPPoE, cloud transit gateways, load balancers.
- Find where encapsulation happens. Overhead reduces effective MTU.
- Check if a firewall blocks ICMP type 3 code 4 (“fragmentation needed”). That’s a classic.
Third: Apply the least-risky mitigation
- Clamp TCP MSS at the edge (temporary band-aid, often safe).
- Set interface MTU correctly on tunnel endpoints (permanent fix if you control both sides).
- Allow the right ICMP for PMTUD (permanent fix, but coordinate with security teams).
The goal is not “perfect MTU.” The goal is “stop the bleeding without creating a new incident.”
Interesting facts & historical context
- Ethernet’s 1500-byte MTU became the de facto default early, partly due to hardware and memory tradeoffs in the 1980s and 1990s.
- IPv4 allows routers to fragment packets (unless DF is set), but fragmentation has performance and reliability costs, so modern stacks try to avoid it.
- IPv6 routers do not fragment in transit; the sender must use PMTUD. That makes ICMPv6 “Packet Too Big” messages operationally critical.
- PPPoE commonly uses 1492 MTU because of its encapsulation overhead; that “missing 8 bytes” is enough to break paths if PMTUD is blocked.
- Jumbo frames (often 9000 MTU) can reduce CPU overhead and increase throughput on trusted LANs, but a single non-jumbo hop can create bizarre partial failures.
- PMTUD black holes were widely documented in the 1990s and 2000s as firewalls began blocking ICMP without understanding what it was for.
- TCP MSS is not MTU: MSS is payload size, MTU includes headers. People confuse them and “fix” the wrong number.
- Encapsulation stacks add up fast: VXLAN/Geneve, plus IPsec, plus cloud provider headers can shave hundreds of bytes off effective MTU.
- Datacenters standardized on 1500 for a reason: interoperability beats theoretical performance. Most outages are born from “we’ll just change MTU.”
What MTU pain looks like in production
Classic symptoms
- Some websites load, others hang (often during TLS handshake or large responses).
- VPN connects, but file shares and large API calls stall.
- SSH interactive works, but
scpis painfully slow or freezes. - Kubernetes nodes Ready, but pulling some container images times out.
- HTTP works, HTTPS flaky (TLS record sizes, larger handshake messages, or different path via CDN).
- Throughput tests show high retransmits; latency looks okay.
Why it’s misleading
With MTU problems, you can have perfect latency and good small-packet performance. Monitoring that checks
“ping works” will report green. Your synthetic checks might hit a tiny endpoint and call it healthy.
Meanwhile, real users are trying to upload a PDF or fetch a container layer and getting a timeout.
The fastest way to get lost is to treat it like a generic “slow network” issue and start tuning TCP congestion control,
buffers, or QoS. MTU failures aren’t a throughput optimization problem. They’re a correctness problem.
Hands-on tasks (commands, what output means, what you decide)
These are production-friendly tasks you can run from a Linux host. Use a test host on each side of the suspected
break (client and server, or two pods, or two VMs). The point is to measure, not guess.
Task 1: Check interface MTU and spot obvious mismatches
cr0x@server:~$ ip -d link show
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
5: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/none
Meaning: eth0 is 1500, wg0 is 1420 (common WireGuard choice). That’s not wrong by itself.
Decision: If you see a tunnel interface with MTU larger than the underlay can support, that’s suspicious. Confirm with DF ping tests next.
Task 2: Confirm PMTUD behavior with DF ping (IPv4)
cr0x@server:~$ ping -M do -s 1472 -c 3 203.0.113.10
PING 203.0.113.10 (203.0.113.10) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1492
ping: local error: message too long, mtu=1492
ping: local error: message too long, mtu=1492
--- 203.0.113.10 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2044ms
Meaning: Your host already knows the outgoing MTU is 1492 on that path/interface. It refused to send 1500-byte packets.
Decision: If the application expects 1500 but path is 1492, you need MSS clamping or correct MTU on the tunnel/PPPoE edge.
Task 3: Binary search the largest working packet size
cr0x@server:~$ for s in 1472 1464 1452 1440 1432; do echo "size=$s"; ping -M do -s $s -c 1 203.0.113.10 | tail -n 2; done
size=1472
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
size=1464
1 packets transmitted, 1 received, 0% packet loss, time 0ms
size=1452
1 packets transmitted, 1 received, 0% packet loss, time 0ms
size=1440
1 packets transmitted, 1 received, 0% packet loss, time 0ms
size=1432
1 packets transmitted, 1 received, 0% packet loss, time 0ms
Meaning: 1464 works, 1472 doesn’t. That suggests an effective MTU of 1492 (because 1464 + 28 bytes ICMP/IP headers ≈ 1492).
Decision: Stop arguing about theory. Set MTU/MSS to match what actually works.
Task 4: Check route MTU / PMTU cache hints
cr0x@server:~$ ip route get 203.0.113.10
203.0.113.10 via 198.51.100.1 dev ppp0 src 198.51.100.20 uid 1000
cache mtu 1492
Meaning: Linux has a cached PMTU of 1492 for that destination on ppp0.
Decision: If PMTU is unexpectedly low, hunt for where encapsulation happens (PPPoE, VPN, overlay). If PMTU is high but packets still stall, ICMP may be blocked and the cache isn’t updating.
Task 5: Look for retransmits and “stalls” on a real connection
cr0x@server:~$ ss -ti dst 203.0.113.10:443
ESTAB 0 0 198.51.100.20:53124 203.0.113.10:443
cubic wscale:7,7 rto:204 rtt:38.5/4.2 ato:40 mss:1460 pmtu:1500 rcvmss:536 advmss:1460 cwnd:10 bytes_acked:1543 bytes_received:2010 segs_out:35 segs_in:33 retrans:7/18
Meaning: Retransmits are non-zero and climbing. pmtu:1500 may be wrong if there’s a black hole. rcvmss:536 hints at path weirdness too.
Decision: If retransmits spike during bulk transfer while small traffic is fine, suspect MTU/PMTUD. Move to packet capture or MSS clamp.
Task 6: Capture ICMP “fragmentation needed” (or absence of it)
cr0x@server:~$ sudo tcpdump -ni any '(icmp and (icmp[0]=3 and icmp[1]=4)) or (icmp6 and ip6[40]=2)'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
Meaning: For IPv4 you’re watching ICMP type 3 code 4. For IPv6 you’re watching ICMPv6 type 2 (Packet Too Big).
Decision: If you reproduce the issue and see no ICMP PTB/frag-needed messages anywhere, a firewall or middlebox may be dropping them. If you do see them, PMTUD is functioning and you should focus on incorrect local MTU/MSS instead.
Task 7: Reproduce with curl and watch for hangs at “upload completely sent off”
cr0x@server:~$ curl -v --max-time 10 -o /dev/null https://203.0.113.10/large-object
* Trying 203.0.113.10:443...
* Connected to 203.0.113.10 (203.0.113.10) port 443 (#0)
* ALPN: offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
> GET /large-object HTTP/1.1
> Host: 203.0.113.10
> User-Agent: curl/8.5.0
> Accept: */*
* Operation timed out after 10001 milliseconds with 0 bytes received
* Closing connection 0
curl: (28) Operation timed out after 10001 milliseconds with 0 bytes received
Meaning: TCP connect and TLS handshake can succeed, then the request/response path stalls on larger packets or specific record sizes.
Decision: Confirm with DF ping and then mitigate (MSS clamp or correct MTU). Don’t waste an hour “optimizing TLS.”
Task 8: Measure actual throughput and loss with iperf3
cr0x@server:~$ iperf3 -c 203.0.113.10 -t 10 -i 1
Connecting to host 203.0.113.10, port 5201
[ 5] local 198.51.100.20 port 45436 connected to 203.0.113.10 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 3.12 MBytes 26.2 Mbits/sec 28 61.5 KBytes
[ 5] 1.00-2.00 sec 2.50 MBytes 21.0 Mbits/sec 31 52.3 KBytes
[ 5] 2.00-3.00 sec 1.75 MBytes 14.7 Mbits/sec 44 43.1 KBytes
[ 5] 3.00-4.00 sec 1.12 MBytes 9.40 Mbits/sec 56 32.8 KBytes
[ 5] 4.00-5.00 sec 0.88 MBytes 7.38 Mbits/sec 61 26.9 KBytes
[ 5] 0.00-10.00 sec 15.2 MBytes 12.8 Mbits/sec 472 sender
[ 5] 0.00-10.00 sec 14.5 MBytes 12.1 Mbits/sec receiver
Meaning: Retransmits are high and congestion window collapses. That’s not normal for a stable path.
Decision: If retransmits correlate with larger MSS/MTU, clamp MSS and re-test. If the path is truly lossy, MTU tuning won’t save you.
Task 9: Check NIC offloads that can confuse packet captures
cr0x@server:~$ ethtool -k eth0 | egrep 'tso|gso|gro|lro'
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
Meaning: Offloads are enabled. That’s normal, but it can make tcpdump show “giant” packets that aren’t actually sent on the wire.
Decision: Don’t “fix” MTU based on a misleading capture. If you need clean packet sizes for debugging, temporarily disable offloads on a test host.
Task 10: Temporarily disable offloads on a test host (for clarity)
cr0x@server:~$ sudo ethtool -K eth0 tso off gso off gro off
Meaning: The kernel will no longer coalesce/segment in ways that confuse your capture tooling.
Decision: Do this only on a test box or during a controlled window. Offloads can be performance-critical on busy hosts.
Task 11: Verify IPv6 PMTUD (ICMPv6 Packet Too Big) isn’t blocked
cr0x@server:~$ ping6 -c 2 -s 1452 -M do 2001:db8::10
PING 2001:db8::10(2001:db8::10) 1452 data bytes
ping: local error: message too long, mtu=1280
ping: local error: message too long, mtu=1280
--- 2001:db8::10 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1023ms
Meaning: Something on the path (or local) is enforcing a 1280 MTU (the IPv6 minimum). That’s a real thing, but it’s also a performance killer if unexpected.
Decision: If IPv6 path MTU collapses unexpectedly, you need to find the tunnel/segment causing it. Don’t just disable IPv6 unless you enjoy recurring incidents.
Task 12: Clamp TCP MSS with iptables (fast mitigation)
cr0x@server:~$ sudo iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
Meaning: For forwarded TCP connections, SYN packets will be rewritten so peers negotiate an MSS that fits the discovered PMTU.
Decision: Use this at edges (VPN gateways, routers, firewalls) when you can’t quickly fix the underlying ICMP blocking or MTU mismatch. Then schedule the real fix.
Task 13: Clamp TCP MSS with nftables (modern systems)
cr0x@server:~$ sudo nft add table inet mangle
cr0x@server:~$ sudo nft 'add chain inet mangle forward { type filter hook forward priority mangle; policy accept; }'
cr0x@server:~$ sudo nft add rule inet mangle forward tcp flags syn tcp option maxseg size set rt mtu
Meaning: Similar idea: adjust MSS based on route MTU. Syntax varies by distro/kernel; test carefully.
Decision: Prefer a known-good, reviewed rule set. Don’t freestyle nftables in production unless you also enjoy freestyle outages.
Task 14: Set MTU on an interface (permanent when correct)
cr0x@server:~$ sudo ip link set dev wg0 mtu 1380
cr0x@server:~$ ip link show wg0 | head -n 1
5: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1380 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
Meaning: You’re shrinking the tunnel MTU to avoid exceeding underlay limits after overhead.
Decision: If you control both ends, this is often the cleanest fix. Validate with DF pings across the tunnel and a real app transfer.
Task 15: Check kernel counters for fragmentation and reassembly stress
cr0x@server:~$ netstat -s | egrep -i 'fragment|reassembl'
0 fragments received ok
0 fragments created
12 packets reassembled ok
3 packet reassembly failures
Meaning: Reassembly failures can show path fragmentation trouble or packet loss interacting with fragments.
Decision: Fragmentation is a smell. Avoid it by setting correct MTU/MSS. If you see reassembly failures during the incident window, MTU is a strong suspect.
Task 16: Validate MTU end-to-end from inside Kubernetes (overlay reality)
cr0x@server:~$ kubectl run -it --rm netdebug --image=busybox:1.36 -- sh
/ # ip link show eth0 | head -n 1
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
/ # ping -M do -s 1422 -c 1 10.96.0.1
PING 10.96.0.1 (10.96.0.1): 1422 data bytes
1430 bytes from 10.96.0.1: seq=0 ttl=64 time=0.632 ms
--- 10.96.0.1 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
Meaning: Pod MTU is 1450 (common in VXLAN setups). Your “normal 1500” assumption is already false inside the cluster.
Decision: If pods talk to external services through tunnels or gateways, account for additional overhead. Fix at CNI config or gateway MSS clamp.
Fix patterns that work (and why)
Pattern 1: Make PMTUD work again (the adult solution)
PMTUD exists so endpoints can adapt. When it works, you get maximum packet sizes without manual tuning.
When it’s broken, people start hardcoding MTUs all over the place like it’s 1997.
The typical break is ICMP filtering. For IPv4, you want ICMP type 3 code 4 to be allowed back to the sender.
For IPv6, ICMPv6 Packet Too Big is mandatory for basic operation. Blocking it is not “security.” It’s self-harm.
What to do: allow the specific ICMP messages, rate-limit sensibly, and log drops. If your security tooling forces
blanket ICMP drops, negotiate an exception with evidence: packet captures, retransmits, and a clear before/after test.
Pattern 2: Clamp TCP MSS at boundaries (the pragmatic fix)
MSS clamping rewrites the TCP SYN so both sides agree on a smaller segment size. It’s a workaround for paths where PMTUD
is unreliable, often due to middleboxes you can’t control (some partner network, a carrier, an enterprise firewall you don’t own).
It’s also your fastest “15-minute fix” because you can apply it at the edge without touching every host.
Downsides: it only helps TCP, not UDP. It can also hide the real issue long enough for it to come back during the next topology change.
Pattern 3: Set correct MTU on tunnels and overlays (the structural fix)
Tunnels reduce effective MTU because they add headers. If your underlay is 1500 and your tunnel adds 60 bytes of overhead,
your tunnel MTU must be 1440 or lower if you want to avoid fragmentation/black holes.
The right value depends on encapsulation type and options (IPsec ESP overhead varies; WireGuard has a typical overhead; VXLAN adds its own).
Don’t memorize numbers. Measure with DF ping through the tunnel, then pick a value with a little safety margin.
Pattern 4: Stop treating jumbo frames like a personality trait
Jumbo frames can be great on a controlled L2 domain: storage networks, HPC, certain east-west traffic patterns.
But jumbo frames across mixed infrastructure is where dreams go to become Change Advisory Board meetings.
If you enable MTU 9000 on some switches and forget one hypervisor vSwitch or one firewall interface, you’ve created
a failure mode that will look like “intermittent slowness” forever. Either make jumbo frames universal on that segment,
or don’t do it at all.
Short joke #2: If you enable jumbo frames without an end-to-end plan, you didn’t “optimize the network,” you just invented a new hobby for your on-call rotation.
Three corporate mini-stories (anonymized, plausible, technically accurate)
Mini-story 1: The incident caused by a wrong assumption
A mid-size SaaS company ran a hybrid setup: office users on a managed ISP circuit, workloads in a public cloud.
They rolled out a new “secure web gateway” appliance. The rollout plan was simple: route corporate traffic through the appliance,
verify web browsing works, then expand.
On day one, the helpdesk got a weird mix of tickets. Some employees said “everything is slow.”
Others said “only uploads are broken.” A few reported that Slack was fine, but the internal CRM (behind a VPN)
kept timing out during login. Network graphs looked normal. CPU on the gateway was normal. The appliance vendor
suggested bumping TCP buffers. Classic.
The wrong assumption was subtle: the team assumed the appliance was a plain L3/L4 forwarder, and thus “MTU stays 1500.”
In reality, the appliance established an IPsec tunnel to a cloud POP for inspection. That added overhead. The appliance
still accepted 1500-byte packets on the LAN side but could not forward them without fragmentation on the tunnel side.
And the tunnel path had ICMP frag-needed filtered by a provider edge.
The fix was not heroic. They reproduced the hang with DF pings, saw the effective MTU was about 1410-ish across the new path,
and applied MSS clamping on the gateway for outbound TCP. Immediately, CRM logins stopped timing out. Then they scheduled the
real fix: adjust tunnel MTU and coordinate ICMP allowances with the provider.
The lesson: “Ethernet is 1500” is not a fact; it’s a default. As soon as you tunnel, encapsulate, or traverse managed security gear,
MTU becomes a design parameter, not a footnote.
Mini-story 2: The optimization that backfired
A data platform team wanted higher replication throughput between two datacenters. They were pushing large objects and seeing CPU pressure
on hosts during peak. Someone proposed jumbo frames: raise MTU to 9000 on the storage VLAN, reduce per-packet overhead, win.
They did the change on the ToR switches and the storage servers. Initial tests on a couple of hosts looked great. Throughput improved.
CPU fell. They celebrated quietly, because loud celebrations anger the networking gods.
Two weeks later, a different team moved a hypervisor cluster into that same VLAN. Their vSwitch port group MTU remained 1500.
Nothing exploded immediately. Instead, they got a mess: some VMs could mount storage but would intermittently freeze during large reads.
Backups started timing out. A few NFS clients would hang on directory listings that included large metadata blobs.
Because many small packets still worked, people chased phantom issues: “NFS tuning,” “ZFS arc,” “ESXi bug,” “maybe the storage array is overloaded.”
It took a blunt DF ping test from a VM to reveal the truth: the VM could not pass packets larger than 1500, but storage servers were happily emitting jumbo frames.
Somewhere in the middle, fragmentation or drops were happening inconsistently, depending on path and offload behavior.
The fix was boring: either make jumbo frames truly end-to-end (switches, hypervisors, vSwitches, NICs, storage endpoints),
or keep that VLAN at 1500 and accept slightly higher CPU. They chose a split: a dedicated, isolated jumbo storage segment for hosts that could support it,
and a separate standard-MTU segment for the mixed environment.
Mini-story 3: The boring but correct practice that saved the day
A fintech ran dozens of site-to-site VPNs for partners. They’d been burned before by PMTUD black holes, so they had a policy:
every new tunnel must ship with a documented effective MTU test and an explicit MSS clamp rule at the VPN edge, validated by an application transfer.
Not “we think it’s fine.” Evidence.
One Friday, a major partner changed their firewall model. No notice. Suddenly, settlement file transfers started failing intermittently.
The partner insisted nothing changed “in the VPN.” Their monitoring said the tunnel was up. Everyone was technically correct and operationally useless.
The fintech team had two advantages. First, they already had MSS clamping at their edge, so most TCP traffic kept working.
Second, they had a standard runbook: DF ping tests both directions, tcpdump for ICMP, and a quick check of negotiated MSS values on a test TCP session.
Within an hour, they had proof that ICMP frag-needed was no longer returning from the partner side.
They didn’t need to redesign the network mid-incident. The workaround was already in place. They simply tightened the MSS clamp slightly
(based on measured effective MTU), restored file transfer reliability, and sent the partner a short packet-capture-backed report.
The partner later fixed their firewall policy to allow the right ICMP, and the fintech rolled back the extra conservatism.
The lesson: boring controls—documented MTU tests, default MSS clamping on VPN edges, and captured “known good” behavior—turn mystery outages into routine maintenance.
Common mistakes: symptoms → root cause → fix
1) “Some sites load, others hang” → PMTUD black hole → allow ICMP / clamp MSS
Symptom: Web browsing is inconsistent; large pages hang; small ones work.
Root cause: Path MTU is lower than assumed, and ICMP frag-needed / packet-too-big is blocked.
Fix: Allow the specific ICMP messages on firewalls; as a fast mitigation, clamp TCP MSS at the egress gateway.
2) “VPN is up but file transfers stall” → tunnel overhead not accounted for → reduce tunnel MTU
Symptom: Small pings and RDP/SSH work; SMB/NFS/HTTPS uploads freeze.
Root cause: IPsec/WireGuard/GRE overhead lowers effective MTU; endpoints still try near-1500 packets.
Fix: Set tunnel interface MTU to a measured safe value; clamp MSS on VPN gateway.
3) “Kubernetes image pulls time out” → overlay MTU too high → fix CNI MTU and node MTU consistency
Symptom: Pods can resolve DNS and hit small endpoints, but pulling large layers fails.
Root cause: CNI overlay MTU mismatched with underlay, or mixed MTU across nodes.
Fix: Configure CNI MTU correctly (e.g., 1450/1440 depending on encapsulation); ensure nodes and vSwitches match.
4) “Jumbo frames enabled, random freezes” → not end-to-end jumbo → either make it universal or revert
Symptom: Only some hosts/VMs have problems; often during large transfers; monitoring shows “up.”
Root cause: One hop is still 1500 (vSwitch, firewall, NIC, intermediate switch).
Fix: Validate MTU on every hop; if you can’t guarantee it, keep the segment at 1500.
5) “Packet captures show huge frames” → offload artifacts → disable offloads for debugging
Symptom: tcpdump shows giant packets that shouldn’t exist; conclusions get weird fast.
Root cause: TSO/GSO/GRO make captures look like oversized packets at the host boundary.
Fix: Temporarily disable offloads on a test host, or capture on a switch/mirror port instead.
6) “UDP app is broken; TCP fine after MSS clamp” → clamp only helps TCP → tune app or lower MTU
Symptom: VoIP, gaming, or custom UDP protocol still fails while web apps recover.
Root cause: MSS clamping doesn’t apply to UDP; UDP datagrams may exceed path MTU and get dropped.
Fix: Reduce UDP payload sizes in the app, enable fragmentation carefully (rarely ideal), or lower MTU on the interface/tunnel.
7) “IPv6 is flaky; IPv4 fine” → ICMPv6 PTB blocked → allow ICMPv6 properly
Symptom: Dual-stack hosts show intermittent hangs over IPv6; IPv4 works.
Root cause: ICMPv6 Packet Too Big blocked; IPv6 can’t rely on router fragmentation.
Fix: Permit required ICMPv6 types; validate with ping6 DF tests and tcpdump.
Checklists / step-by-step plan
15-minute “stop the bleeding” plan
- Prove size-dependence: run DF pings with increasing sizes to the failing destination (Task 2–3).
- Check interface MTUs: underlay and tunnel interfaces (Task 1).
- Check route PMTU:
ip route getfor a hint (Task 4). - Confirm retransmits:
ss -tion an active flow (Task 5) and optionallyiperf3(Task 8). - Capture ICMP PTB/frag-needed: confirm whether PMTUD messages exist or vanish (Task 6).
- Mitigate fast: clamp MSS at the edge (Task 12 or 13).
- Verify with real traffic: curl a large object, scp a file, pull a container image (Task 7 + your workload).
- Write down the measured working MTU: don’t rely on memory; future-you is already tired.
Permanent fix plan (the one that prevents recurrence)
- Map encapsulation: list every tunnel/overlay segment and its overhead (VPN, VXLAN, Geneve, GRE, IP-in-IP).
- Set MTU where it belongs: tunnel interfaces and CNIs to values that fit the underlay with margin (Task 14).
- Restore PMTUD signals: adjust firewall policies to allow ICMP type 3/4 (IPv4) and ICMPv6 PTB (IPv6).
- Standardize edge MSS clamping: keep it as defense-in-depth for partner networks you don’t control.
- Add a synthetic check that transfers real bytes: not just ping; something that fails when MTU fails.
- Document the “known good” MTU per segment: including who owns it and what change process applies.
What to avoid during an incident
- Don’t randomly “increase MTU for performance” to fix slowness. That’s how you turn a partial outage into a full outage.
- Don’t disable ICMP broadly. If you must filter, filter with intent and allow PMTUD-critical messages.
- Don’t treat one successful ping as proof the path is healthy. MTU bugs love your simplistic monitoring.
FAQ
1) What’s the difference between MTU and MSS?
MTU is the maximum IP packet size on a link. MSS (Maximum Segment Size) is the maximum TCP payload size.
MSS is roughly MTU minus IP+TCP headers. Fixing MTU fixes everything; clamping MSS only fixes TCP.
2) Why does MTU mismatch look like “slow internet” instead of an outage?
Because only packets over a certain size fail. Small requests, ACKs, DNS, and some page elements work.
Large responses, uploads, or specific protocol messages hang and retry. Humans interpret “retries” as “slow.”
3) Why is blocking ICMP a problem? Isn’t ICMP unsafe?
Blanket blocking is the problem. PMTUD relies on ICMP errors to inform senders about MTU limits.
For IPv6, ICMPv6 Packet Too Big is essential. You can rate-limit and filter specific types, but
“no ICMP” breaks real traffic in non-obvious ways.
4) How do I pick the right MTU for a VPN tunnel?
Measure. Start with your underlay MTU (often 1500), subtract encapsulation overhead, then validate with DF pings across the tunnel.
Add a safety margin if the path includes unknown provider gear. If you can’t measure reliably, clamp MSS at the edge.
5) Does MSS clamping reduce performance?
It can slightly increase CPU and packet rate because you send more, smaller segments. But it usually improves real throughput
versus a black hole situation where you’re retransmitting large segments repeatedly. Think “less elegant, more functional.”
6) Can MTU issues affect only one direction?
Yes. One side might be able to send ICMP PTB back, the other might have it blocked. Or only one direction traverses a tunnel.
Asymmetry is common in corporate networks with multiple egress paths and “helpful” security appliances.
7) Why does IPv6 sometimes fail while IPv4 works?
IPv6 relies on PMTUD because routers don’t fragment packets in transit. If ICMPv6 PTB is blocked, IPv6 flows can hang
on larger packets. IPv4 might limp along due to fragmentation in some paths, masking the issue.
8) Are jumbo frames worth it?
They’re worth it on tightly controlled segments where every hop is verified (storage networks, specific east-west domains).
They’re not worth it across heterogeneous infrastructure or partner networks. If you can’t guarantee end-to-end MTU, stay at 1500.
9) How do I test MTU without breaking production?
Use DF ping tests to a known target, run them off-peak, and don’t change MTU on busy interfaces casually.
For packet captures, disable offloads only on a test host. Prefer edge MSS clamp as a reversible mitigation.
10) What’s the quickest “tell” that this is MTU and not congestion?
DF ping failing at a consistent size threshold, plus TCP retransmits during large transfers while small requests succeed.
Congestion doesn’t usually create a sharp size cliff. MTU issues do.
Conclusion: next steps you can do today
MTU incidents don’t announce themselves. They cosplay as “the network is slow” and waste your afternoon.
The fix is rarely complicated, but it is specific: measure the real working packet size, identify where the path shrank,
and either restore PMTUD or clamp MSS until you can.
Next steps that pay off immediately:
- Add a runbook step: DF ping size test +
ss -tiretransmit check before anyone touches “performance tuning.” - Standardize MSS clamping on VPN/edge gateways as a safety net for partner networks.
- Audit firewall rules for ICMP frag-needed (IPv4) and Packet Too Big (IPv6). Allow them with sane rate limits and logging.
- Document MTU per segment (underlay, tunnels, overlays). Treat it like an SLO dependency, not trivia.
Do those, and the next time someone says “internet is slow,” you’ll have a concrete answer in minutes—rather than a vague feeling and a growing coffee habit.