Docker Network MTU Issues: Why Large Requests Fail and How to Fix MTU/MSS

Was this helpful?

Everything “works” until it doesn’t. Small pings succeed, health checks are green, login pages load, and then a
real request shows up: a 2 MB JSON POST, a TLS handshake with a fat certificate chain, a gRPC stream, a container
pulling a layer. Suddenly: hangs, timeouts, mysterious retries, or that classic corporate Slack message: “Is the
network slow?”

If this only happens in Docker (or “only from containers,” or “only across the VPN,” or “only when traffic hits the
overlay”), you’re likely staring at an MTU/MSS problem. MTU issues are the kind of failure that makes smart people
doubt reality. That’s because the symptoms are selective, the observability is poor by default, and the fixes are
deceptively easy to get wrong.

MTU, MSS, fragmentation, PMTUD: what actually breaks

MTU in one sentence

MTU (Maximum Transmission Unit) is the largest IP packet size (in bytes) that can traverse a link without being
fragmented at that hop.

And MSS is the piece people forget

TCP MSS (Maximum Segment Size) is the largest TCP payload a host will put into a single TCP segment; it’s derived
from the path’s MTU minus IP and TCP headers. In Ethernet/IP/TCP with no options, a 1500-byte MTU commonly yields an
MSS of 1460 bytes (1500 – 20 – 20).

MTU is about packets on the wire. MSS is about how big TCP chunks its payload. If you clamp MSS down, you can avoid
fragmentation entirely by ensuring packets are small enough to fit the smallest MTU on the path.

Why small requests work but big ones fail

With an MTU mismatch, you can get “selective failure”:

  • Small payloads fit into small packets; they pass.
  • Large payloads produce packets that exceed some hop’s MTU; they need fragmentation or smaller MSS.
  • If fragmentation is blocked or Path MTU Discovery (PMTUD) is broken, those large packets blackhole.

Fragmentation, DF, and the blackhole pattern

IPv4 can fragment packets in transit. But modern stacks try hard not to. They set the DF (Don’t Fragment) bit and
rely on PMTUD: the sender transmits DF packets and expects the network to tell it the maximum workable MTU by sending
back ICMP “Fragmentation needed” messages when a packet is too large.

When those ICMP messages are filtered (by security groups, firewalls, “helpful” network appliances, or misconfigured
policy), PMTUD fails silently. The sender keeps retransmitting packets that will never fit. To the application, this
looks like a hang: the TCP connection is established, maybe even some data moves, and then everything stalls.

One quote worth keeping on a sticky note:
Paraphrased idea: “Hope is not a strategy.” — attributed in SRE circles to operations culture; treat MTU as a
design parameter, not a wish.

Docker makes it easier to create MTU mismatches

Docker adds layers: bridges, veth pairs, NAT, and sometimes overlays (VXLAN) or tunnels (WireGuard, IPsec, VPN
client). Each encapsulation layer eats bytes. Eat enough bytes and your “1500” becomes “1450” or “1412” or “1376”
in practice. If only some nodes or some routes have that overhead, congratulations: you have a partial blackhole.

Joke #1: MTU bugs are like office printers — they wait until you’re late, then they develop “character.”

Where Docker hides MTU problems

The usual topology

On a typical Linux host with Docker’s default bridge network:

  • A container connects via a veth pair to docker0 (a Linux bridge).
  • The host routes/NATs traffic out through some physical NIC like eth0 or ens5.
  • From there it might traverse VLANs, VPC fabrics, VPNs, proxies, and other magical corporate inventions.

The container sees an interface MTU (often 1500). The bridge sees an MTU (often 1500). The host NIC might be 1500.
But if the real path includes encapsulation (e.g., VPN adds ~60–80 bytes, VXLAN adds ~50 bytes, GRE adds ~24 bytes,
plus possible extra headers), the effective MTU is smaller. If nobody tells the TCP stacks, large segments will be
too big.

Overlay networks make it spicier

Docker Swarm overlay and many CNI plugins rely on VXLAN or similar encapsulation to build L2-ish semantics across
L3 networks. VXLAN typically adds 50 bytes of overhead (outer Ethernet + outer IP + UDP + VXLAN headers; exact
overhead depends on environment). If the underlay is 1500, the overlay MTU should be around 1450.

Problems appear when:

  • Some nodes set overlay MTU to 1450, others leave it at 1500.
  • The underlay isn’t really 1500 (cloud fabrics vary; VPNs vary more).
  • ICMP is filtered between nodes, breaking PMTUD.

Why you feel it as “TLS is flaky” or “POSTs hang”

TLS can be an accidental MTU tester because certificates and handshake flights can be larger than you think. So can
gRPC metadata, HTTP/2 frames, and “small” JSON that’s actually huge once someone dumps a base64 blob into it.

A classic symptom: TCP handshake completes, small response headers flow, then the connection stalls when the first
big segment with DF set hits a too-small MTU hop. Retransmissions happen. Eventually you get a timeout.

Interesting facts and a little history

  • Fact 1: Ethernet’s 1500-byte MTU became a de facto standard largely from early design tradeoffs, not because 1500 is sacred.
  • Fact 2: Path MTU Discovery was introduced because fragmentation is expensive and fragile; it shifts the work to endpoints.
  • Fact 3: PMTUD relies on ICMP. Blocking all ICMP is like removing road signs and then blaming drivers for getting lost.
  • Fact 4: VXLAN encapsulation overhead commonly pushes overlay MTU to ~1450 on a 1500 underlay; if you keep 1500, you’re betting on fragmentation or jumbo frames.
  • Fact 5: Jumbo frames (9000 MTU) can reduce CPU overhead in some workloads, but a single 1500-only hop ruins your day unless you clamp or segment correctly.
  • Fact 6: IPv6 routers do not fragment in transit. If you break PMTUD in IPv6, you don’t get “sometimes.” You get “doesn’t work.”
  • Fact 7: TCP MSS is negotiated per direction during SYN/SYN-ACK; middleboxes can (and often should) rewrite it on the fly for tunnels.
  • Fact 8: Many “random” API timeouts blamed on app servers are actually retransmission storms from blackholed PMTUD.
  • Fact 9: Docker’s default bridge and veth devices generally inherit MTU from the host at creation time; changing host MTU later doesn’t always retrofit existing veth pairs.

Fast diagnosis playbook

This is the order that finds the culprit fastest in production, with the fewest rabbit holes.

1) Confirm it’s size-dependent and path-dependent

  • From a container, run a small request and a large one to the same destination.
  • If small works and large hangs/timeouts, move to MTU suspicion immediately.

2) Find the effective path MTU with DF pings

  • Use ping -M do -s to binary search the maximum payload that succeeds.
  • Compare results from host vs container. Differences matter.

3) Inspect MTU values on all relevant interfaces

  • Host physical NIC, docker bridge, veth, overlays, VPN/tunnel interface.
  • If any tunnel interface is 1420 and containers think they have 1500, you likely found it.

4) Validate PMTUD is not being blocked

  • Check firewall rules for ICMP type 3 code 4 (IPv4) and ICMPv6 Packet Too Big (type 2).
  • Capture traffic: are ICMP errors returning? Are there repeated retransmissions?

5) Choose your fix strategy

  • Best: set correct MTU end-to-end (underlay, overlay, containers).
  • Pragmatic: clamp TCP MSS at egress/ingress of a tunnel or bridge.
  • Avoid: “just lower it everywhere” without understanding; you’ll mask the issue and may kneecap throughput.

Practical tasks: commands, expected output, and decisions

These are production-grade tasks: what you run, what you hope to see, what it means when you don’t, and what you do
next. Run them on a Linux Docker host unless stated otherwise.

Task 1: Reproduce from inside a container with small vs large payload

cr0x@server:~$ docker run --rm curlimages/curl:8.6.0 curl -sS -o /dev/null -w "%{http_code}\n" http://10.20.30.40:8080/health
200

Meaning: basic connectivity works.
Decision: now test a large request that triggers big TCP segments.

cr0x@server:~$ docker run --rm curlimages/curl:8.6.0 sh -lc 'dd if=/dev/zero bs=1k count=2048 2>/dev/null | curl -sS -m 10 -o /dev/null -w "%{http_code}\n" -X POST --data-binary @- http://10.20.30.40:8080/upload'
curl: (28) Operation timed out after 10000 milliseconds with 0 bytes received

Meaning: size-dependent failure pattern is present.
Decision: stop blaming the application until you prove packets can traverse the path.

Task 2: Compare host vs container behavior

cr0x@server:~$ dd if=/dev/zero bs=1k count=2048 2>/dev/null | curl -sS -m 10 -o /dev/null -w "%{http_code}\n" -X POST --data-binary @- http://10.20.30.40:8080/upload
200

Meaning: the host path works but container path fails.
Decision: focus on Docker bridge/veth/iptables and MTU inheritance rather than the upstream network only.

Task 3: Check MTU on host NIC and Docker bridge

cr0x@server:~$ ip -br link show dev eth0
eth0             UP             10.0.0.15/24 fe80::a00:27ff:feaa:bbbb/64 mtu 1500
cr0x@server:~$ ip -br link show dev docker0
docker0          UP             172.17.0.1/16 fe80::42:1cff:fe11:2222/64 mtu 1500

Meaning: both are 1500, which is “normal” but not necessarily correct for your real path.
Decision: identify any tunnels/overlays. If present, 1500 may be a lie.

Task 4: Inspect MTU inside a running container

cr0x@server:~$ docker run --rm --network bridge alpine:3.19 ip -br link show dev eth0
eth0@if8         UP             172.17.0.2/16 fe80::42:acff:fe11:0002/64 mtu 1500

Meaning: container believes MTU is 1500.
Decision: if the path contains encapsulation (VPN/VXLAN), you likely need a lower MTU in the container or MSS clamping.

Task 5: Look for tunnel interfaces and their MTU

cr0x@server:~$ ip -br link | egrep 'wg0|tun0|tap|vxlan|geneve|gre' || true
wg0              UP             10.8.0.2/24 fe80::aaaa:bbbb:cccc:dddd/64 mtu 1420

Meaning: WireGuard MTU 1420 is a strong clue. If container traffic egresses via wg0, 1500-sized packets won’t fit.
Decision: either lower container/bridge MTU to match the smallest egress MTU, or clamp MSS on traffic leaving through wg0.

Task 6: Determine effective path MTU with DF ping (host)

cr0x@server:~$ ping -c 2 -M do -s 1472 10.20.30.40
PING 10.20.30.40 (10.20.30.40) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420

--- 10.20.30.40 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 0ms

Meaning: host already knows MTU 1420 on the relevant route/interface (likely because traffic is going out via wg0).
Decision: now test smaller sizes to find the max that passes; confirm the same from a container.

Task 7: Binary search a working DF ping size (host)

cr0x@server:~$ ping -c 2 -M do -s 1392 10.20.30.40
PING 10.20.30.40 (10.20.30.40) 1392(1420) bytes of data.
1400 bytes from 10.20.30.40: icmp_seq=1 ttl=62 time=12.3 ms
1400 bytes from 10.20.30.40: icmp_seq=2 ttl=62 time=12.1 ms

--- 10.20.30.40 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 12.051/12.173/12.296/0.122 ms

Meaning: path MTU is around 1420 (1392 payload + 28 ICMP/IP overhead = 1420).
Decision: set MTU/MSS so that TCP payload segments fit under ~1420 (practically: MSS around 1380-ish depending on headers/options).

Task 8: Run DF ping from inside a container (to detect mismatch)

cr0x@server:~$ docker run --rm alpine:3.19 sh -lc 'ping -c 2 -M do -s 1472 10.20.30.40'
PING 10.20.30.40 (10.20.30.40): 1472 data bytes
ping: sendto: Message too long
ping: sendto: Message too long

--- 10.20.30.40 ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss

Meaning: the container cannot send DF packets that large; either it sees a smaller MTU along its egress, or the stack refuses due to discovered MTU.
Decision: inspect routing from container namespace and confirm which interface traffic leaves on the host.

Task 9: Identify the container’s veth and check its MTU on the host

cr0x@server:~$ cid=$(docker run -d alpine:3.19 sleep 300); echo "$cid"
b2e1e3d4c5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2
cr0x@server:~$ pid=$(docker inspect -f '{{.State.Pid}}' "$cid"); echo "$pid"
24188
cr0x@server:~$ nsenter -t "$pid" -n ip -br link show dev eth0
eth0@if10        UP             172.17.0.2/16 fe80::42:acff:fe11:0002/64 mtu 1500
cr0x@server:~$ ifindex=$(nsenter -t "$pid" -n cat /sys/class/net/eth0/iflink); echo "$ifindex"
10
cr0x@server:~$ ip -br link | awk '$1 ~ /^veth/ {print}'
veth1a2b3c4d@if9 UP             mtu 1500

Meaning: veth is 1500. If traffic exits via a 1420 tunnel, packets may be too big unless PMTUD works end-to-end.
Decision: decide whether to lower MTU on bridge/veth or clamp MSS.

cr0x@server:~$ docker rm -f "$cid" >/dev/null

Task 10: Check whether ICMP “Fragmentation needed” is being dropped

cr0x@server:~$ sudo iptables -S | egrep 'icmp|RELATED|ESTABLISHED'
-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT

Meaning: ICMP is allowed on INPUT here; good sign for PMTUD.
Decision: also check FORWARD (Docker uses forwarding) and any firewall manager chains.

cr0x@server:~$ sudo iptables -S FORWARD
-P FORWARD DROP
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

Meaning: FORWARD default drop is fine if exceptions exist, but ICMP errors are “RELATED” and should pass. If you don’t have the RELATED rule, PMTUD breaks.
Decision: ensure RELATED,ESTABLISHED is present for the relevant direction, or explicitly allow ICMP types.

Task 11: Observe retransmissions and MSS with tcpdump

cr0x@server:~$ sudo tcpdump -ni any 'host 10.20.30.40 and tcp' -c 10
tcpdump: data link type LINUX_SLL2
12:01:10.100001 IP 172.17.0.2.51234 > 10.20.30.40.8080: Flags [S], seq 1234567890, win 64240, options [mss 1460,sackOK,TS val 111 ecr 0,nop,wscale 7], length 0
12:01:10.102300 IP 10.20.30.40.8080 > 172.17.0.2.51234: Flags [S.], seq 2222222, ack 1234567891, win 65160, options [mss 1460,sackOK,TS val 222 ecr 111,nop,wscale 7], length 0
12:01:10.103000 IP 172.17.0.2.51234 > 10.20.30.40.8080: Flags [.], ack 1, win 502, options [nop,nop,TS val 112 ecr 222], length 0
12:01:11.105500 IP 172.17.0.2.51234 > 10.20.30.40.8080: Flags [P.], seq 1:1461, ack 1, win 502, options [nop,nop,TS val 113 ecr 222], length 1460
12:01:12.108000 IP 172.17.0.2.51234 > 10.20.30.40.8080: Flags [P.], seq 1:1461, ack 1, win 502, options [nop,nop,TS val 114 ecr 222], length 1460

Meaning: MSS is 1460. Data segments of 1460 bytes are being retransmitted, suggesting they’re not getting through. If the path MTU is ~1420, these segments are too large once encapsulated.
Decision: clamp MSS down (e.g., to 1360–1380) or lower MTU on the container network so MSS negotiates smaller.

Task 12: Check route MTU and policy routing (common with VPN clients)

cr0x@server:~$ ip route get 10.20.30.40
10.20.30.40 dev wg0 src 10.8.0.2 uid 1000
    cache

Meaning: traffic goes out wg0. That’s the small MTU interface.
Decision: apply MSS clamping on wg0 egress (or ingress) for forwarded container traffic, or set Docker network MTU to fit wg0.

Task 13: Inspect Docker daemon MTU configuration (if present)

cr0x@server:~$ cat /etc/docker/daemon.json
{
  "log-driver": "journald"
}

Meaning: no MTU override configured.
Decision: if you need a stable fix across reboots and container restarts, configure Docker’s MTU explicitly (or configure your CNI/overlay MTU).

Task 14: Check sysctls that can influence PMTUD behavior

cr0x@server:~$ sysctl net.ipv4.ip_no_pmtu_disc net.ipv4.tcp_mtu_probing
net.ipv4.ip_no_pmtu_disc = 0
net.ipv4.tcp_mtu_probing = 0

Meaning: PMTUD is enabled (good), but TCP MTU probing is off (default).
Decision: don’t “fix” MTU by enabling probing globally unless you understand the blast radius; prefer correct MTU/MSS.

Fixes that work: MTU alignment and MSS clamping

Fix strategy A: Align MTU across underlay, overlays, and container interfaces

This is the clean solution: packets are naturally sized correctly, PMTUD is a safety net, and your capture files are
boring. It takes more thinking up front, but it’s less magical.

1) If you use a tunnel, accept its overhead

If egress is via wg0 with MTU 1420, you can set Docker’s bridge MTU to 1420 (or slightly less to account
for additional overhead, depending on your path). The idea: ensure the container interface MTU is not larger than the
smallest effective MTU on the path.

2) Set Docker daemon MTU (bridge networks)

Configure Docker to create networks with a specific MTU. Example:

cr0x@server:~$ sudo sh -lc 'cat > /etc/docker/daemon.json <<EOF
{
  "mtu": 1420,
  "log-driver": "journald"
}
EOF'
cr0x@server:~$ sudo systemctl restart docker

Meaning: new container interfaces created by Docker should use MTU 1420.
Decision: recreate affected containers (existing veth pairs may keep old MTU). Plan a controlled rollout.

3) For user-defined networks, set MTU explicitly

cr0x@server:~$ docker network create --driver bridge --opt com.docker.network.driver.mtu=1420 appnet
a1b2c3d4e5f6g7h8i9j0

Meaning: this network will use MTU 1420 for its bridge/veth.
Decision: attach workloads that traverse the tunnel to this network; keep “local-only” networks at 1500 if appropriate.

4) Overlay networks: compute MTU, don’t guess

For VXLAN overlays on a 1500 underlay, 1450 is a common setting. If the underlay is itself tunneled (VPN), your
overlay MTU may need to be smaller still. A VXLAN-over-WireGuard stack can get cramped fast.

The correct approach is empirical: measure effective path MTU between nodes (DF ping), then subtract encapsulation
overheads you add on top of that. Or, more bluntly: set overlay MTU based on the smallest underlay you actually have,
not the one on the purchase order.

Fix strategy B: Clamp TCP MSS (fast, practical, slightly ugly)

MSS clamping is the field medic’s tourniquet: it stops the bleeding even if you haven’t fully corrected the anatomy.
It works particularly well when your path MTU is stable but smaller than endpoints believe, or when you can’t change
MTU on every container/network quickly.

What it does: rewrites MSS in SYN packets so endpoints never send TCP segments too large for the path.
It does not fix UDP-based protocols directly, and it does not fix IPv6 unless you also clamp in ip6tables/nft for v6.

1) Clamp MSS on forwarded container traffic exiting a tunnel

If containers are behind NAT on the host and egress via wg0, clamp on the FORWARD path.

cr0x@server:~$ sudo iptables -t mangle -A FORWARD -o wg0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

Meaning: the kernel adjusts MSS based on discovered PMTU for that route.
Decision: if PMTUD is broken (ICMP blocked), clamp-to-pmtu may not converge; then use a fixed MSS.

2) Clamp to a fixed MSS when PMTUD is unreliable

cr0x@server:~$ sudo iptables -t mangle -A FORWARD -o wg0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360

Meaning: MSS becomes 1360 for those flows, fitting inside a ~1420 MTU with room for headers/options.
Decision: choose MSS conservatively. Too low costs throughput; too high doesn’t fix the problem.

3) Verify MSS clamping is active

cr0x@server:~$ sudo iptables -t mangle -S FORWARD | grep TCPMSS
-A FORWARD -o wg0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360

Meaning: rule is present.
Decision: run tcpdump again and confirm SYN advertises MSS 1360.

4) Confirm with tcpdump that MSS changed

cr0x@server:~$ sudo tcpdump -ni wg0 'tcp[tcpflags] & tcp-syn != 0 and host 10.20.30.40' -c 2
12:05:44.000001 IP 10.8.0.2.53312 > 10.20.30.40.8080: Flags [S], seq 1, win 64240, options [mss 1360,sackOK,TS val 333 ecr 0,nop,wscale 7], length 0
12:05:44.002000 IP 10.20.30.40.8080 > 10.8.0.2.53312: Flags [S.], seq 2, ack 2, win 65160, options [mss 1360,sackOK,TS val 444 ecr 333,nop,wscale 7], length 0

Meaning: MSS is now 1360; you should stop sending oversized segments.
Decision: rerun the “large POST” reproduction. If it works, you have your immediate mitigation.

Fix strategy C: Stop breaking PMTUD (the security team will survive)

If you control firewalls, allow the ICMP needed for PMTUD:

  • IPv4: ICMP type 3 code 4 (“Fragmentation needed”).
  • IPv6: ICMPv6 type 2 (“Packet Too Big”).

Also allow “RELATED” conntrack traffic for forwarded flows. Docker hosts forward traffic; treat them like routers.

Joke #2: Blocking ICMP to “improve security” is like removing the oil light because it’s distracting.

Three corporate mini-stories (all true in spirit)

Incident 1: The wrong assumption (“MTU is always 1500”)

A team rolled out a new containerized ingestion service. It worked perfectly in staging, then started failing in
production only for certain customers. Smaller payloads were fine. Larger uploads stalled and timed out after a few
minutes. The application logs were useless: the request arrived, then nothing. The load balancer metrics showed
connections hanging in “active” state.

The on-call initially suspected a slow upstream dependency. Reasonable, because “timeouts” in distributed systems
are usually dependency pain. But packet captures showed retransmissions of full-sized TCP segments. SYN/SYN-ACK looked
normal. The first chunk of request body disappeared into the void.

The assumption was simple: “the network MTU is 1500.” On the host NIC, it was. On the path to certain customer
networks, it wasn’t, because traffic hairpinned through an IPsec tunnel managed by a separate team. That tunnel had
an effective MTU closer to 1400, and ICMP was filtered “for safety.”

The fix was twofold: allow PMTUD-related ICMP across that tunnel and clamp MSS at the tunnel edge. Once applied,
large uploads returned instantly to boring. The incident review was brutal but productive: you can’t treat “MTU” as a
property of a single interface. It’s a property of the path, and paths change.

Incident 2: The optimization that backfired (“enable jumbo frames everywhere”)

Another organization wanted better throughput for container-to-container transfers. Someone proposed enabling jumbo
frames (MTU 9000) on the data network. It was pitched as a low-risk performance win: fewer packets, less CPU, higher
throughput. They tested between two hosts. It was faster. Everyone applauded and merged the change request.

A week later, odd things started happening. Some services were fine. Others had intermittent issues: gRPC streams
resetting, occasional TLS handshake failures, “random” timeouts during image pulls. The most confusing part: packet
loss graphs were clean. Latency looked normal. Only certain paths were affected.

The problem wasn’t jumbo frames by themselves. The problem was partial deployment and a single 1500-byte hop in the
middle: a firewall cluster that didn’t support jumbo frames on one interface. Some traffic passed through a
jumbo-capable path; some didn’t, depending on routing and failover.

The network now had a split-brain MTU. Hosts happily emitted 9000-byte frames. When they hit the 1500-only hop, they
relied on PMTUD. ICMP “Packet Too Big” was rate-limited on the firewall and sometimes dropped. The result was
blackholing that came and went with load and failover events.

They eventually standardized MTU to 1500 on that segment and used LRO/GRO tuning plus application-level parallelism
for performance. Jumbo frames can be great, but the only safe jumbo MTU is one that is consistently supported end-to-end.

Incident 3: The boring practice that saved the day (standard MTU tests in rollout)

A platform team maintained a baseline “node readiness” script for every new cluster node. It was not glamorous. It
ran a few checks: DNS, time sync, disk space, and—quietly—a pair of PMTU tests to key destinations (control plane,
registry, service mesh gateways). It also validated that container networks had expected MTU values.

During a migration, a subset of nodes was provisioned in a different subnet that routed through a VPN appliance. The
workloads scheduled on those nodes immediately showed higher error rates, but the readiness script caught it before
the blast radius got interesting. The PMTU check failed for 1500-sized DF pings and suggested a smaller MTU.

Because it was caught early, the fix was surgical: the team set Docker network MTU on those nodes to match the VPN
path and added MSS clamping on the VPN egress. They also documented the subnet difference so networking could later
remove the unnecessary tunnel.

Nothing about this was heroic. That’s the point. Routine MTU checks turned a multi-day whodunit into a 30-minute
change. Boring is a feature in operations.

Common mistakes: symptom → root cause → fix

1) Symptom: “Small requests work; large POSTs hang from containers”

Root cause: MTU mismatch plus broken PMTUD (ICMP blocked) or tunnel overhead not accounted for.
Fix: measure path MTU with DF ping; set Docker network MTU accordingly or clamp TCP MSS on egress interface.

2) Symptom: “Works on host, fails in container”

Root cause: container network path differs (NAT/forwarding rules, different routing table, policy routing, overlay).
Fix: inspect ip route get on host for destination; tcpdump on docker0 and egress interface; clamp MSS for forwarded traffic or align MTUs.

3) Symptom: “TLS handshake fails intermittently; curl sometimes stalls after CONNECT”

Root cause: large TLS handshake records exceed path MTU; retransmissions; ICMP blocked or rate-limited.
Fix: clamp MSS; ensure ICMP PTB/frag-needed permitted; verify with tcpdump that MSS is lowered.

4) Symptom: “Overlay network drops large packets; node-to-node ping works”

Root cause: overlay MTU not reduced for encapsulation overhead (VXLAN/Geneve).
Fix: set overlay/CNI MTU (often ~1450 for VXLAN on 1500 underlay), ensure consistent config across all nodes.

5) Symptom: “IPv6 only: some destinations unreachable, weird stalls”

Root cause: ICMPv6 Packet Too Big blocked; IPv6 routers do not fragment, so PMTUD is mandatory.
Fix: allow ICMPv6 type 2; clamp MSS for IPv6 TCP if needed; stop treating ICMPv6 as optional.

6) Symptom: “After changing host MTU, old containers still break”

Root cause: existing veth pairs keep old MTU; Docker doesn’t always retrofit live networks.
Fix: recreate networks/containers; set daemon/user-network MTU so new attachments are correct.

7) Symptom: “Only traffic through VPN breaks; internal traffic is fine”

Root cause: VPN interface MTU smaller than LAN; container/bridge MTU too large; PMTUD broken across VPN.
Fix: clamp MSS on VPN egress; optionally use a dedicated Docker network with smaller MTU for VPN-bound workloads.

8) Symptom: “UDP-based service (DNS, QUIC, syslog) drops large messages”

Root cause: MSS clamping doesn’t help UDP; fragmentation may be blocked; UDP payload exceeds path MTU.
Fix: reduce application message size; use TCP for that protocol where supported; align MTU and allow fragmentation/ICMP as appropriate.

Checklists / step-by-step plan

Checklist A: Quick containment (15–30 minutes)

  1. Reproduce failure with a large request from inside a container.
  2. Run DF ping from host to destination and find maximum size that works.
  3. Identify the egress interface for that destination (ip route get).
  4. Apply MSS clamping on the egress interface for forwarded traffic:
    • Prefer --clamp-mss-to-pmtu if PMTUD works.
    • Use --set-mss if PMTUD is broken or inconsistent.
  5. Verify with tcpdump that SYN MSS is reduced.
  6. Rerun the large request test. Confirm success.

Checklist B: Correct fix (same day, less adrenaline)

  1. Inventory encapsulation layers in the path (VPN, overlay, cloud fabric, load balancers).
  2. Standardize underlay MTU across links that are supposed to be “the same network.”
  3. Set overlay MTU explicitly (VXLAN/Geneve/IPIP as appropriate) and ensure it’s consistent across nodes.
  4. Set Docker daemon MTU or per-network MTU for bridge networks that must traverse constrained paths.
  5. Ensure ICMP requirements for PMTUD are allowed (IPv4 frag-needed, IPv6 packet-too-big), including across FORWARD paths.
  6. Document the expected MTU values and add an automated PMTU test to node readiness.

Checklist C: Post-fix validation (don’t skip this)

  1. Capture a short tcpdump during a large transfer; confirm no repeated retransmissions of large segments.
  2. Check application error rates; confirm the specific symptom disappears (not just “looks better”).
  3. Verify that performance is acceptable; too-small MSS can reduce throughput on high-BDP paths.
  4. Reboot one node (or restart Docker) in a maintenance window to ensure configuration persists.

FAQ

1) Why do pings work but HTTP uploads fail?

Default pings use small payloads that fit under nearly any MTU. Large HTTP uploads generate large TCP segments that
exceed the path MTU and get blackholed when PMTUD/ICMP is broken.

2) Is this a Docker bug?

Usually not. Docker is a magnifier: it introduces extra hops (bridge, veth, NAT) and makes it easy to accidentally
send traffic through tunnels/overlays with smaller MTU than the container assumes.

3) Should I just set MTU to 1400 everywhere and move on?

Only if you like permanent performance tax and mystery regressions later. Measure the smallest real path MTU you need
to support, then set MTU appropriately per network. Use MSS clamping as a targeted mitigation, not a lifestyle.

4) What’s better: lowering MTU or MSS clamping?

Lowering MTU is cleaner and works for TCP and UDP. MSS clamping is faster to roll out and often sufficient for TCP,
but it doesn’t fix UDP payload size problems.

5) Why does blocking ICMP break TCP? TCP isn’t ICMP.

PMTUD uses ICMP as the control plane to signal “your packet is too big.” Without that feedback, TCP keeps sending
oversized DF packets and retransmitting forever. The data plane waits for a sign that never comes.

6) Does Kubernetes change any of this?

The concepts are identical; the surface area increases. CNI plugins, node-to-node encapsulation, and service meshes
add overhead and complexity. Kubernetes just gives your MTU bug more places to hide.

7) What about IPv6?

IPv6 depends on PMTUD even more: routers do not fragment. If ICMPv6 Packet Too Big is blocked, you’ll see hard failure
on paths with smaller MTU. Treat ICMPv6 as required infrastructure, not optional noise.

8) Can TCP MTU probing sysctls fix it?

Sometimes, but it’s a last resort. MTU probing can mask broken networks by adapting after loss, but it’s not a
substitute for correct MTU and working ICMP. In production, prefer deterministic fixes.

9) How do I pick a fixed MSS value?

Start from measured path MTU. MSS should be MTU minus headers. For IPv4 TCP without options it’s typically MTU-40, but
options (timestamps, etc.) effectively reduce it. That’s why values like 1360 for a 1420 path are common: conservative,
safe, not maximal.

10) Why does it only fail for some destinations?

Because MTU is a property of the path, and routes differ. One destination might stay on your LAN at 1500; another
crosses a tunnel at 1420; a third crosses a misconfigured firewall that drops ICMP. Same code, different physics.

Conclusion: practical next steps

When large requests fail from containers and everything else looks “fine,” treat MTU/MSS as a prime suspect. It’s not
superstition; it’s a well-worn failure mode with consistent fingerprints: size-dependent hangs, retransmissions, and
a path that silently can’t carry what you’re sending.

Do this next:

  1. Reproduce with a large payload from a container and from the host. Confirm the shape of the bug.
  2. Measure path MTU with DF pings to the actual destination.
  3. Find the actual egress interface and any tunnels/overlays in the path.
  4. Mitigate immediately with MSS clamping if you need uptime now.
  5. Fix properly by aligning MTU across Docker networks, overlays, and underlays, and by allowing PMTUD-required ICMP.
  6. Make it boring forever: add PMTU checks to node readiness and standardize MTU settings as code.
← Previous
ZFS Cache Hit Rates: When They Matter and When They Don’t
Next →
Debian 13: SSH keys rotated — revoke access cleanly and avoid key sprawl (case #13)

Leave a comment