Debian 13 MTU/MSS Mismatch: Why Large Files Stall and How to Fix It Cleanly

Was this helpful?

You can ping it. You can curl small pages. DNS is fine. SSH logs in instantly. And then you try to pull a 4 GB image, push a backup,
rsync a VM, or upload a database dump—and the transfer slows to a crawl or flat-out stalls. The progress meter sits there like it’s
waiting for permission from Legal.

On Debian 13, this often isn’t a “Debian problem” at all. It’s the network doing what you told it to do, plus one tiny mismatch you didn’t
realize you told it. MTU and MSS mismatches create failure modes that are surgical: small packets work, big ones die. This is how you
end up with a network that looks healthy in dashboards but behaves like it’s allergic to large files.

What’s actually going on (without fairy tales)

MTU is the Maximum Transmission Unit: the largest IP packet size a link can carry without fragmentation. Ethernet is usually 1500 bytes.
PPPoE often forces 1492. Tunnels (WireGuard, VXLAN, GRE, IPsec) subtract overhead and reduce the effective MTU further.

MSS is the Maximum Segment Size: the largest TCP payload size (not counting IP/TCP headers) that a host will put into one TCP segment.
MSS is negotiated during the TCP handshake via options in SYN and SYN-ACK. If your path MTU is 1500, then your TCP MSS will typically be
1460 (1500 minus 20 bytes IPv4 header minus 20 bytes TCP header). If IPv6, MSS is typically 1440 due to the larger base header.

The stall pattern: why “small works, large fails”

Here’s the core pattern:

  • Small requests and interactive sessions keep packets under the real path MTU. They work.
  • Bulk transfers fill the pipe, send full-sized TCP segments, and eventually hit a packet that exceeds the real path MTU. Those packets get dropped.
  • If the sender never learns the correct MTU (or ignores it), it keeps retransmitting the same too-large segments. Now you have a “black hole.”
  • TCP throughput collapses. Sometimes it looks like the transfer “hangs,” sometimes like it runs at 3 KB/s, sometimes it times out.

The mechanism that should save you is Path MTU Discovery (PMTUD). In IPv4, the sender sets the DF (Don’t Fragment) bit. When a router can’t forward
a packet because it’s too big, it should send back an ICMP “Fragmentation Needed” message with the next-hop MTU. The sender reduces its packet size
and retries.

In a perfect world, PMTUD is boring and invisible. In the real world, ICMP gets filtered (“for security”) by firewalls that were last reviewed when
people still faxed change requests. Now the sender never receives the “fragmentation needed” hint, and you get a PMTUD black hole.

Debian 13 isn’t special here—but modern defaults can make this show up more often:
more VPNs, more overlays, more container networking, more IPv6, more cloud edges, more “helpful” middleboxes. You’re layering headers like it’s
a lasagna, and then acting surprised that a 1500-byte assumption stops fitting.

MTU vs MSS: which one is “wrong”?

MTU mismatch is a link/path property. MSS mismatch is an end-host TCP behavior that can be adjusted to accommodate the path.
If the path MTU is smaller than you think, you can fix the path (ideal) or clamp MSS so TCP never emits packets too big for the path (pragmatic).

If you operate the whole network, fix MTU at the source: make the path consistent, set correct MTUs on tunnels, ensure PMTUD works.
If you only control one side (common in corporate reality), MSS clamping at your edge is often the cleanest “we can ship today” fix.

One paraphrased idea from Werner Vogels (reliability/operations): “Everything fails; design so failures are routine and recover automatically.”
MTU issues are exactly that kind of failure—routine, predictable, and solvable with guardrails.

Joke #1: PMTUD is like office gossip—if someone blocks ICMP, nobody learns the important news and everyone keeps making the same bad decision.

Facts and history that make the behavior make sense

A handful of concrete facts explains why this problem keeps recurring, even among experienced teams:

  1. Ethernet’s 1500-byte MTU became a de facto standard because it balanced efficiency and hardware limits in early LAN design; it’s not sacred, just common.
  2. PPPoE’s classic MTU is 1492 because it adds 8 bytes of overhead on top of Ethernet framing, shrinking what IP can carry.
  3. IPv6 routers do not fragment packets in transit. If you exceed the path MTU, you rely on ICMPv6 “Packet Too Big” to fix it, or you black-hole.
  4. IPv4 fragmentation exists but is widely avoided because fragmented traffic increases loss amplification and creates security/monitoring headaches.
  5. PMTUD black holes were common enough that TCP implementations added heuristics like Packetization Layer PMTUD (PLPMTUD), trying to infer MTU without relying on ICMP.
  6. MSS clamping was popularized by edge devices in ISP/VPN scenarios because it works even when ICMP is filtered; it’s a workaround that became a standard move.
  7. Jumbo frames (9000 MTU) can boost throughput on storage networks, but only if every hop supports it—one 1500-MTU hop turns it into a silent failure factory.
  8. VXLAN and similar overlays add ~50 bytes of overhead (often more depending on underlay), meaning a 1500-MTU underlay implies ~1450-ish usable MTU in the overlay.
  9. WireGuard’s effective MTU depends on transport and route; “1420” is a common default-ish value because it avoids fragmentation on typical Internet paths, but it’s not universal.

Those aren’t trivia. They are the reasons your “it worked last year” network starts failing after a VPN rollout, cloud migration, or firewall refresh.

Fast diagnosis playbook (first/second/third)

The goal is to answer one question quickly: Are we dropping big packets because the real path MTU is smaller than the sender thinks?
Here’s the order that saves the most time in production.

First: confirm the symptom is size-related, not generic loss

  • Try a small download and a large download from the same endpoint.
  • Use a DF ping (IPv4) or a large ping (IPv6) to find the largest working size.
  • Watch retransmits and “black hole” behavior with ss and tcpdump.

Second: identify where MTU changes (tunnels, overlays, PPPoE, cloud edges)

  • Check interface MTUs: physical NIC, VLAN, bridge, VPN, container veth, tunnel.
  • Check routes for MTU hints.
  • Confirm the actual encapsulation in use (WireGuard? IPsec? VXLAN? GRE?).

Third: decide on the cleanest fix boundary

  • If you can fix the path, do that (consistent MTU end-to-end; allow PMTUD ICMP messages).
  • If you can’t, clamp TCP MSS at the edge closest to the sender for affected flows.
  • Verify with a repeatable test and capture proof (before/after).

This playbook is intentionally boring. Boring is good. Boring means you stop cargo-culting “set MTU 1400 everywhere” and start making one correct change.

Hands-on tasks: commands, expected output, and decisions

Below are practical tasks you can run on Debian 13 (or any modern Debian-like) while you’re on the incident bridge. Each task includes:
a command, what the output means, and what decision you make next.

Task 1: Check interface MTUs quickly

cr0x@server:~$ ip -br link
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
enp3s0           UP             3c:ec:ef:12:34:56 <BROADCAST,MULTICAST,UP,LOWER_UP>
wg0              UNKNOWN        9a:bc:de:f0:12:34 <POINTOPOINT,NOARP,UP,LOWER_UP>
br0              UP             02:42:ac:11:00:01 <BROADCAST,MULTICAST,UP,LOWER_UP>
cr0x@server:~$ ip link show dev enp3s0 | grep -i mtu
mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000

Meaning: You’re looking for suspicious differences: physical NIC at 1500, but a tunnel/overlay expecting 1420; or a bridge at 1500 while
a member interface is smaller.

Decision: If any “upper” interface (bridge/tunnel) has an MTU larger than what the encapsulated path can carry, you’ve found a prime suspect.

Task 2: Inspect routes for MTU hints

cr0x@server:~$ ip route get 1.1.1.1
1.1.1.1 via 192.0.2.1 dev enp3s0 src 192.0.2.10 uid 1000
    cache

Meaning: Linux can store per-route MTU, but it won’t always show unless set. Still useful to verify which interface is used.

Decision: If traffic exits via a tunnel interface or a VRF you didn’t expect, stop. Diagnose that path first.

Task 3: Confirm the stall is size-dependent with curl

cr0x@server:~$ curl -o /dev/null -s -w "time=%{time_total} size=%{size_download}\n" https://repo.example.net/small.bin
time=0.18 size=1048576
cr0x@server:~$ curl -o /dev/null -v https://repo.example.net/large.iso
* Connected to repo.example.net (203.0.113.20) port 443
> GET /large.iso HTTP/1.1
> Host: repo.example.net
...
< HTTP/1.1 200 OK
...
  0 4096M    0 1024k    0     0   890k      0  1:18:33  0:00:01  1:18:32  0:00:00
  0 4096M    0 1024k    0     0      0      0 --:--:--  0:00:10 --:--:--     0

Meaning: The large transfer starts, then throughput collapses. That’s classic “some packets pass, then black-hole.”

Decision: Move to PMTUD testing and packet captures. Don’t waste time on DNS, CA certificates, or “maybe the server is slow.”

Task 4: IPv4 PMTUD test using DF ping (find max payload)

cr0x@server:~$ ping -M do -s 1472 -c 2 203.0.113.20
PING 203.0.113.20 (203.0.113.20) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1492
ping: local error: message too long, mtu=1492

--- 203.0.113.20 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1007ms

Meaning: Your host already knows (or an interface enforces) MTU=1492 somewhere on the egress. Payload 1472 implies 1500 MTU;
it fails and hints 1492.

Decision: Re-test with -s 1464 (1492-28) and see if it passes. If it does, you’re in PPPoE or tunnel territory.

Task 5: Binary search the path MTU (IPv4)

cr0x@server:~$ ping -M do -s 1464 -c 2 203.0.113.20
PING 203.0.113.20 (203.0.113.20) 1464(1492) bytes of data.
1472 bytes from 203.0.113.20: icmp_seq=1 ttl=51 time=22.1 ms
1472 bytes from 203.0.113.20: icmp_seq=2 ttl=51 time=22.0 ms

--- 203.0.113.20 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms

Meaning: End-to-end path MTU is at most 1492 (for IPv4 on that path).

Decision: If your sending hosts are using MTU 1500 and not adapting, clamp MSS or adjust MTU to match this reality.

Task 6: IPv6 PMTU sanity test

cr0x@server:~$ ping -6 -s 1452 -c 2 2001:db8:100::20
PING 2001:db8:100::20(2001:db8:100::20) 1452 data bytes
From 2001:db8:200::1 icmp_seq=1 Packet too big: mtu=1420
From 2001:db8:200::1 icmp_seq=2 Packet too big: mtu=1420

--- 2001:db8:100::20 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1002ms

Meaning: IPv6 is telling you the path MTU (1420). That’s the best-case scenario: ICMPv6 is not blocked and PMTUD is working.

Decision: Align tunnel/interface MTU and/or clamp MSS for IPv6 too. Don’t “fix IPv4 only” and call it done.

Task 7: Observe TCP retransmits on the client during the stall

cr0x@server:~$ ss -ti dst 203.0.113.20:443
ESTAB 0 0 192.0.2.10:52144 203.0.113.20:https
	 cubic wscale:7,7 rto:1000 rtt:24.1/4.2 ato:40 mss:1460 pmtu:1500 rcvmss:536 advmss:1460 cwnd:10 bytes_sent:1048576 bytes_acked:1048576 bytes_retrans:65536 retrans:12

Meaning: Retransmits are climbing. MSS is 1460 and PMTU shows 1500, but your earlier tests suggested smaller. That mismatch is the story.

Decision: Prove whether ICMP “frag needed” is missing, or whether a tunnel MTU is mis-set. Move to tcpdump.

Task 8: Capture PMTUD-related ICMP while reproducing

cr0x@server:~$ sudo tcpdump -ni enp3s0 '(icmp and (icmp[0]=3 and icmp[1]=4)) or (tcp and host 203.0.113.20 and port 443)'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp3s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:01:10.112233 IP 192.0.2.10.52144 > 203.0.113.20.443: Flags [P.], seq 1:1449, ack 1, win 501, length 1448
12:01:10.212244 IP 192.0.2.10.52144 > 203.0.113.20.443: Flags [P.], seq 1449:2897, ack 1, win 501, length 1448
12:01:11.312255 IP 192.0.2.10.52144 > 203.0.113.20.443: Flags [P.], seq 1449:2897, ack 1, win 501, length 1448
12:01:12.412266 IP 192.0.2.10.52144 > 203.0.113.20.443: Flags [P.], seq 1449:2897, ack 1, win 501, length 1448

Meaning: You see repeated retransmission of the same segment, but no ICMP type 3/code 4 coming back. That’s a PMTUD black hole signature.

Decision: Either allow the needed ICMP back, or clamp MSS so you never send too-big segments in the first place.

Task 9: Check if a local firewall is blocking PMTUD ICMP

cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
	chain input {
		type filter hook input priority filter; policy drop;
		ct state established,related accept
		iif "lo" accept
		tcp dport { 22 } accept
	}
	chain forward {
		type filter hook forward priority filter; policy drop;
	}
	chain output {
		type filter hook output priority filter; policy accept;
	}
}

Meaning: Input policy drops by default, but accepts established,related. ICMP errors are usually “related” to an existing flow;
that’s good, but not guaranteed depending on conntrack settings and ICMP type.

Decision: If you see explicit drops for ICMP or ICMPv6, fix that first. PMTUD depends on those messages.

Task 10: Confirm tunnel overhead and MTU on WireGuard

cr0x@server:~$ ip link show dev wg0 | grep -i mtu
mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
cr0x@server:~$ sudo wg show wg0
interface: wg0
  public key: 3rY...redacted
  listening port: 51820

peer: q8S...redacted
  endpoint: 198.51.100.77:51820
  allowed ips: 10.10.0.0/16
  latest handshake: 1 minute, 2 seconds ago
  transfer: 1.32 GiB received, 2.01 GiB sent

Meaning: wg0 is already at 1420. If inner interfaces are at 1500 and route into wg0, you may still be emitting too-large packets inside a namespace/bridge
before hitting the tunnel boundary.

Decision: Ensure the sending interface for tunneled traffic has an MTU ≤ wg0 MTU (or clamp MSS for traffic entering wg0).

Task 11: Check bridge/veth MTUs for container hosts

cr0x@server:~$ ip -d link show br0 | sed -n '1,25p'
7: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ac:11:00:01 brd ff:ff:ff:ff:ff:ff
    bridge forward_delay 1500 hello_time 200 max_age 2000 stp_state 0 priority 32768
cr0x@server:~$ ip -br link | grep veth | head
veth9f2a2c3       UP             9a:3c:1d:aa:bb:cc <BROADCAST,MULTICAST,UP,LOWER_UP>
veth6a1b7d8       UP             4e:55:66:77:88:99 <BROADCAST,MULTICAST,UP,LOWER_UP>

Meaning: br0 is 1500. If the underlay path is 1450-ish due to overlay, your pods/containers might generate 1500-byte packets that later get dropped.

Decision: In overlay environments, set pod/bridge MTU appropriately (or clamp MSS at node egress).

Task 12: Verify negotiated MSS in an actual connection (tcpdump SYN options)

cr0x@server:~$ sudo tcpdump -ni enp3s0 -c 6 'tcp port 443 and host 203.0.113.20 and (tcp[tcpflags] & (tcp-syn|tcp-ack) != 0)'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp3s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:05:01.000111 IP 192.0.2.10.52210 > 203.0.113.20.443: Flags [S], seq 111111111, win 64240, options [mss 1460,sackOK,TS val 123 ecr 0,nop,wscale 7], length 0
12:05:01.020222 IP 203.0.113.20.443 > 192.0.2.10.52210: Flags [S.], seq 222222222, ack 111111112, win 65160, options [mss 1440,sackOK,TS val 456 ecr 123,nop,wscale 8], length 0

Meaning: Your host offered MSS 1460. The server replied with 1440. If the true path MTU is smaller (say 1492 or 1420),
you want MSS to reflect that (1460 is too big for 1492? Actually 1460 fits 1500; for 1492, MSS should be 1452).

Decision: If you can’t trust PMTUD, clamp MSS to a safe value based on measured path MTU (and do it for both v4 and v6 where needed).

Task 13: Check kernel PMTU/MTU probing behavior

cr0x@server:~$ sysctl net.ipv4.tcp_mtu_probing net.ipv4.ip_no_pmtu_disc
net.ipv4.tcp_mtu_probing = 0
net.ipv4.ip_no_pmtu_disc = 0

Meaning: tcp_mtu_probing=0 means the kernel won’t actively probe smaller MTUs when black-holed; ip_no_pmtu_disc=0 means PMTUD is enabled.

Decision: Don’t “fix” this by disabling PMTUD globally. Consider enabling MTU probing selectively if you can’t fix the network quickly, but treat it as a fallback.

Task 14: Validate that ICMP is permitted through your firewall (IPv4 + IPv6)

cr0x@server:~$ sudo nft list chain inet filter input
table inet filter {
	chain input {
		type filter hook input priority filter; policy drop;
		ct state established,related accept
		iif "lo" accept
		ip protocol icmp accept
		ip6 nexthdr ipv6-icmp accept
		tcp dport 22 accept
	}
}

Meaning: This is the “stop being clever” rule set: allow ICMP and ICMPv6 in, at least for errors and PMTU messages, ideally with sane limits.

Decision: If ICMP isn’t allowed, fix that. If you can’t (corporate firewall upstream), MSS clamping becomes your reliable workaround.

Task 15: Implement MSS clamping with nftables (IPv4/IPv6) and verify counters

cr0x@server:~$ sudo nft add table inet mssclamp
cr0x@server:~$ sudo nft 'add chain inet mssclamp forward { type filter hook forward priority mangle; policy accept; }'
cr0x@server:~$ sudo nft add rule inet mssclamp forward tcp flags syn tcp option maxseg size set rt mtu
cr0x@server:~$ sudo nft add rule inet mssclamp forward ip6 nexthdr tcp tcp flags syn tcp option maxseg size set rt mtu
cr0x@server:~$ sudo nft -a list chain inet mssclamp forward
table inet mssclamp {
	chain forward {
		type filter hook forward priority mangle; policy accept;
		tcp flags syn tcp option maxseg size set rt mtu # handle 2
		ip6 nexthdr tcp tcp flags syn tcp option maxseg size set rt mtu # handle 3
	}
}

Meaning: This clamps MSS on SYN packets to the route MTU automatically. It’s usually cleaner than hardcoding a number, because it adapts per-route.

Decision: Use route-based MSS clamping on gateways/routers. On single hosts, you may clamp in output if the host itself initiates flows.

Task 16: If you must hardcode, clamp to a known safe MSS

cr0x@server:~$ sudo nft add rule inet mssclamp forward ip protocol tcp tcp flags syn tcp option maxseg size set 1360
cr0x@server:~$ sudo nft add rule inet mssclamp forward ip6 nexthdr tcp tcp flags syn tcp option maxseg size set 1360

Meaning: MSS 1360 is often safe for “Internet + tunnels + unknowns” environments, but it’s a blunt instrument. It trades performance for reliability.

Decision: Prefer route-based clamping. Hardcode only when you can’t get consistent per-route MTU info.

Task 17: Make MTU changes persistent on Debian (systemd-networkd example)

cr0x@server:~$ sudo sed -n '1,120p' /etc/systemd/network/10-enp3s0.network
[Match]
Name=enp3s0

[Network]
DHCP=yes

[Link]
MTUBytes=1492
cr0x@server:~$ sudo systemctl restart systemd-networkd
cr0x@server:~$ ip link show dev enp3s0 | grep -i mtu
mtu 1492 qdisc mq state UP mode DEFAULT group default qlen 1000

Meaning: You’ve aligned the host MTU to the real path requirement (common with PPPoE or a known tunnel).

Decision: If only certain destinations are affected (some paths, some not), don’t lower global MTU on all interfaces. Use per-route MTU or MSS clamping at the edge.

Task 18: Validate the fix with a repeatable transfer test

cr0x@server:~$ curl -o /dev/null -s -w "time=%{time_total} speed=%{speed_download}\n" https://repo.example.net/large.iso
time=43.22 speed=101234567

Meaning: The transfer completes and the speed is stable. The stall symptom is gone.

Decision: Capture a short tcpdump “after” sample and commit the change with a clear incident note: what failed, what you changed, how you proved it.

Joke #2: MTU problems are the only kind of networking issue where “just make it smaller” is both wrong advice and, annoyingly, sometimes the right fix.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A team rolled out a new site-to-site VPN between two offices. The new appliance “supported jumbo frames,” and someone decided that since the LAN was
already using 9000-byte MTU for storage traffic, the VPN should match it. Consistency, right?

The cutover went smoothly for the first hour. Monitoring showed clean pings and normal latency. SSH was fine. People celebrated quietly because no one
trusts celebrating loudly. Then the nightly backup started, and the job froze at a few percent. The backup software logged retries and “connection reset”
messages. The storage team swore it wasn’t them. The network team swore the tunnel was up. Both were correct and still wrong.

The problem: the VPN interface was set to 9000, but the ISP uplink in the middle wasn’t. PMTUD should have handled it, except the upstream firewall
filtered ICMP fragmentation-needed messages. So the sender kept transmitting massive packets that vanished silently. Small control-plane traffic worked.
Large data-plane traffic died.

The fix was almost insulting: set the VPN MTU to a value that fit the real path (or clamp MSS on the tunnel ingress), and permit the right ICMP types.
The root cause wasn’t “VPN is flaky.” It was “we assumed the Internet respects our LAN MTU.” The Internet does not care about your assumptions.

What changed culturally was more important than the technical fix: they added an MTU/PMTU validation step to every tunnel rollout, and they stopped treating
ICMP as optional noise. No heroics. Just fewer 2 a.m. meetings.

Mini-story 2: The optimization that backfired

Another company had a high-throughput internal service used for moving artifacts between build agents and a central cache. Someone noticed that the NICs,
switches, and hypervisors all supported jumbo frames. The service was throughput-sensitive, so they enabled MTU 9000 on the servers. It benchmarked well
in the same rack. Everyone nodded.

Weeks later, a new set of build agents in a different DC started reporting intermittent timeouts while uploading large artifacts. The painful part:
it failed only for “big” builds. Small builds passed. The team chased CPU steal time, TLS settings, and load balancer tuning. They upgraded a few kernels.
They even changed the artifact compression level. The failures stayed.

The underlying network path between DCs had a single segment that remained at 1500 MTU. The jumbo-enabled servers happily sent large segments,
which got fragmented or dropped depending on the hop. Some flows survived due to different routing. Some died due to ECMP selecting a path with a stricter hop.
The new build agents were “unlucky” more often.

They fixed it by rolling back jumbo frames on the artifact service and standardizing MTU across the inter-DC path before re-enabling it.
Jumbo frames weren’t inherently bad; they were bad as an “optimization” applied to only part of the topology. The performance win was real,
but the reliability cost was higher than anyone priced in.

The postmortem takeaway was blunt: do not partially deploy MTU changes. Either you commit to an end-to-end design, or you keep it at 1500 and spend your
performance budget elsewhere.

Mini-story 3: The boring but correct practice that saved the day

A financial-services org had a habit that looked like bureaucracy: whenever a network change involved tunnels, overlays, or WAN circuits,
they required a pre-flight test checklist. It included one line item that engineers loved to hate:
“Verify PMTUD or apply MSS clamp; attach evidence.”

During a migration to a new remote-access platform, an engineer ran the checklist and discovered that large DF pings failed across a certain provider path.
Small packets were fine. The provider’s edge filtered certain ICMP types (because of course it did), but only on one region. The engineer added MSS clamping
on the VPN concentrator for that region and wrote down the before/after captures. Migration continued.

Two weeks later, a different team reported “large uploads are hanging” from that region. The on-call pulled up the migration notes, saw the MSS clamp and the
test evidence, and immediately ruled out MTU black-holing as the new cause. They focused on an application regression instead and fixed it quickly.

The boring practice didn’t just prevent an outage. It prevented a misdiagnosis during a later incident, which is almost as valuable. Good notes are an
operational force multiplier.

Fix it cleanly: preferred solutions in order

You can “fix” MTU/MSS mismatch in a dozen ways. Only a few are clean. Here’s the order I recommend in production, with a strong bias for
changes you can explain at 3 a.m. and reverse at noon.

1) Fix the path MTU consistency (best, hardest)

If you own the network end-to-end, standardize MTU across the path. For overlays, increase the underlay MTU so the overlay can remain 1500.
Example: if VXLAN adds ~50 bytes, set the underlay MTU to 1550+ so the overlay can still run 1500 without fragmentation.

This is the “do it right” fix. It’s also the one that requires coordination across teams and vendors, which is why it’s not the only fix.

2) Ensure PMTUD works (allow the right ICMP)

PMTUD is not optional if you expect modern networks to behave. Allow ICMP type 3/code 4 in IPv4 (“fragmentation needed”).
Allow ICMPv6 “packet too big” and other necessary control messages in IPv6.

The key is not “allow all ICMP forever.” The key is “don’t break the protocol and then act surprised.”
If you must rate-limit, do it carefully and test.

3) Clamp TCP MSS at the edge (best pragmatic fix)

MSS clamping adjusts the TCP MSS value during SYN so endpoints agree to smaller segments that will fit the path MTU.
It works even when ICMP is filtered. It doesn’t require host changes. It’s reversible. It’s easy to scope.

Use route-based MSS clamping where possible (as shown in the nftables task). Hardcoding MSS is acceptable when the environment is messy,
but treat it as a temporary bandage.

4) Lower host MTU only when the host truly lives on that constrained path

Lowering MTU on a host interface is clean when the host’s primary egress really is constrained (classic PPPoE at 1492).
It’s not clean when only some destinations are constrained. Then you’re penalizing everything to fix one path.

5) Enable TCP MTU probing as a tactical workaround, not a strategy

Enabling net.ipv4.tcp_mtu_probing can allow Linux to recover from black holes by probing smaller segments.
It can help if you don’t control the firewall in the middle and you need relief quickly.

But don’t confuse “kernel guessed around the problem” with “network is healthy.” If you can clamp MSS or fix ICMP, do that.

Common mistakes: symptom → root cause → fix

These are the patterns I keep seeing in real orgs. The fixes are specific because vague advice is how outages get promoted to “mysteries.”

1) Large HTTPS downloads hang, but small pages load

Symptom: Browser loads the site; downloading an ISO stalls at a few MB.

Root cause: PMTUD black hole: ICMP fragmentation-needed blocked somewhere; sender keeps using MSS/MTU too large.

Fix: Allow ICMP type 3/code 4 (IPv4) and ICMPv6 Packet Too Big; or clamp MSS on the gateway for TCP/443.

2) SSH works, scp/rsync stalls or crawls

Symptom: Interactive SSH fine; file copy gets stuck or runs extremely slowly.

Root cause: Data packets hit path MTU limit; retransmits explode; interactive packets stay small and escape.

Fix: Confirm with DF ping; clamp MSS for TCP/22 on the path; validate retransmits drop after fix.

3) Kubernetes: some pods can pull images, others time out

Symptom: Node A fine; Node B fails pulling large layers; restarts “fix” it sometimes.

Root cause: Mixed MTU in CNI overlay/underlay; ECMP selects different paths with different effective MTUs.

Fix: Align CNI MTU with underlay; or clamp MSS at node egress; standardize MTU across nodes.

4) “We enabled jumbo frames and now inter-DC replication is flaky”

Symptom: Replication sessions reset, large transfers fail intermittently.

Root cause: Jumbo frames enabled only on some segments; one 1500 hop causes fragmentation/drop.

Fix: Roll back to 1500 until the entire path supports jumbo; then reintroduce with end-to-end validation.

5) IPv6 is slower or fails only for large transfers

Symptom: Dual-stack host prefers IPv6; large transfers stall while IPv4 works.

Root cause: ICMPv6 Packet Too Big blocked; IPv6 routers don’t fragment; black-hole occurs.

Fix: Permit ICMPv6 properly; clamp MSS for IPv6; verify with ping -6 and tcpdump.

6) “We set MTU 1400 everywhere” and performance tanked

Symptom: Everything works, but throughput dropped and CPU went up.

Root cause: Over-correction: you reduced MTU far below what the path needs, increasing overhead and interrupt rate.

Fix: Measure actual PMTU; set MTU/MSS to the minimum required, not an arbitrary superstition value.

Checklists / step-by-step plan

Checklist A: When a large transfer stalls right now

  1. Reproduce with one command (curl or scp) and record the timestamp.
  2. Run DF ping (IPv4) or large ping (IPv6) to the same destination; find the largest working payload.
  3. Check ss -ti for retransmits and MSS/PMTU values during the stall.
  4. Run a short tcpdump looking for repeated retransmits and missing ICMP “too big” messages.
  5. If ICMP is blocked and you control the edge: apply MSS clamping (route-based if possible).
  6. Re-test the transfer; capture “after” evidence; commit the config change persistently.

Checklist B: Before rolling out a tunnel/overlay

  1. Calculate overhead (tunnel + encapsulation + possible VLAN tags) and decide target MTU.
  2. Set underlay MTU to support overlay MTU (preferred), or set overlay MTU lower (acceptable).
  3. Confirm ICMP/ICMPv6 required for PMTUD is allowed across security boundaries.
  4. Validate with DF ping and a real bulk transfer test across the tunnel.
  5. Document the chosen MTU, where it is enforced, and what evidence you captured.

Checklist C: The “clean fix” selection guide

  • You own the full path: standardize MTU end-to-end; keep overlay at 1500 if possible; keep PMTUD working.
  • You own only your edge: route-based MSS clamping + verify PMTU; optionally adjust local MTU for known constrained links.
  • You own only one host: adjust MTU on the relevant interface if that host always uses the constrained path; otherwise consider local MSS clamp in output.

FAQ

1) Why do pings succeed while large TCP transfers fail?

Because typical pings are small (56 bytes payload by default). They never exceed the path MTU. Large TCP transfers use near-MTU-sized segments,
which trigger drops when the true path MTU is smaller.

2) Is this really MTU/MSS, or could it be generic packet loss?

Generic loss usually hurts everything: interactive sessions, small downloads, API calls. MTU black holes are selective: small is fine, large stalls.
Prove it with DF ping size tests and tcpdump showing repeated retransmits without ICMP “too big” responses.

3) What’s the difference between fixing MTU and clamping MSS?

Fixing MTU makes the network path capable of carrying the packets you send, consistently. Clamping MSS tells TCP endpoints to send smaller payloads
so packets fit the existing path. MTU fix is architectural; MSS clamp is operationally pragmatic.

4) Should I just set MTU to 1400 everywhere?

No. It may hide the symptom, but it’s a performance tax and it can break internal paths that actually support 1500 or jumbo frames.
Measure the real PMTU and make the smallest change that restores correctness.

5) Does Debian 13 change anything about MTU behavior?

Debian 13 uses modern kernels and tooling (systemd-networkd, nftables, contemporary TCP behaviors). The problem is not Debian-specific.
But Debian 13 makes it easier to implement clean fixes (nftables MSS clamp, consistent network config) if you use the native stack properly.

6) Is ICMP safe to allow through firewalls?

“Allow all ICMP” is lazy. “Block all ICMP” is worse. You should allow essential ICMP/ICMPv6 types for error reporting and PMTUD,
ideally with rate limits. If you block them, you break core Internet behavior and then pay for it in outages.

7) What MSS value should I clamp to?

Prefer route-based MSS clamping (set rt mtu) so it adapts. If you must hardcode, compute from path MTU:
IPv4 MSS ≈ MTU-40, IPv6 MSS ≈ MTU-60 (depending on options). When uncertain, choose a conservative value and then refine with measurement.

8) Why does it fail only for some destinations?

Different destinations can traverse different paths (ECMP, different peers, different VPN exit points), each with a different bottleneck MTU or ICMP filtering behavior.
That’s why a single host can have “works to A, fails to B” symptoms.

9) Can middleboxes rewrite MSS already?

Yes. Some firewalls and load balancers clamp MSS automatically. That can mask the issue—or make it weirder if only some paths clamp.
Treat MSS as something that can change in transit, and verify with tcpdump SYN option captures.

10) What if this is UDP (like some storage or streaming traffic)?

MSS clamping is TCP-only. For UDP-heavy workloads, you need correct MTU sizing and/or application-level fragmentation control.
In practice, fix the path MTU or configure the application to use smaller datagrams.

Next steps you can do today

If you’re seeing large transfers stall on Debian 13, treat it like a real network bug, not a ghost story.
Do three things, in this order:

  1. Prove the path MTU. Run DF ping tests (IPv4) and/or IPv6 “packet too big” tests. Capture the largest working size.
  2. Capture evidence. A 30-second tcpdump showing repeated retransmits and missing ICMP is worth a thousand Slack arguments.
  3. Apply one clean fix. Prefer: fix MTU consistency and allow PMTUD ICMP. If you can’t, clamp MSS at the edge with nftables.

After the fix, rerun the same bulk transfer test, then save the outputs and captures with the change record. The difference between
“we fixed it” and “we got lucky” is documentation and repeatability. Also, future-you will be tired.

← Previous
MySQL vs Elasticsearch for Ecommerce Search: Why SQL Collapses Under Filters
Next →
Debian 13: New interface name broke networking — stable naming fixes that survive reboots (case #67)

Leave a comment