Ubuntu 24.04: Jumbo frames break “only some” traffic — how to test and fix MTU safely

Was this helpful?

You flip MTU to 9000 because storage is “slow,” and suddenly the world gets weird: SSH stays fine, DNS still answers, but “some” HTTPS sites stall,
certain API calls hang forever, and your Ceph or NFS traffic turns into a drama club production of Waiting for ACK.

This is the signature of a partial MTU failure: big packets die quietly somewhere in the path, while small packets glide through and make you question
your career choices. Let’s fix it like adults: measure, isolate, change one thing at a time, and keep rollback easy.

Why jumbo frames break “only some” traffic

“Only some traffic” breaking is not a paradox. It’s what happens when the network can carry small frames (like TCP ACKs, SYNs, DNS packets, small HTTP
requests), but drops or blackholes larger frames (like TLS records, gRPC responses, SMB reads, storage replication, container overlay payloads).

The root pattern usually looks like one of these:

  • MTU mismatch between interfaces in a single L2 domain (one hop stuck at 1500, others set to 9000). Result: large frames get dropped
    at that hop. Small frames survive. You get “works for me” traffic… until the payload is big.
  • Broken Path MTU Discovery (PMTUD): somewhere drops ICMP “Fragmentation needed” messages, so endpoints never learn the real MTU.
    TCP keeps trying big segments, which keep dying. The connection doesn’t always fail fast; it stalls. That’s why this bug feels haunted.
  • Tunnels/overlays reduce effective MTU (VXLAN, GRE, WireGuard, IPsec, cloud VPN). Your NIC may do 9000, but the tunnel needs overhead.
    If you don’t lower MTU accordingly, you just built a packet shredder.
  • Offload interactions (TSO/GSO/GRO/LRO) can hide fragmentation behavior and make packet captures misleading. The kernel may “pretend” it
    sent huge packets while the NIC actually segments them—until the path doesn’t allow it.
  • Policy devices (firewalls, load balancers, NAT gateways) that mishandle ICMP, clamp MSS incorrectly, or enforce unexpected MTU on one
    direction only. Asymmetric paths are where simple truths go to die.

Here’s the practical takeaway: jumbo frames are not a single setting; they’re an end-to-end contract. If any hop disagrees, only the packets that exceed
the smallest MTU get punished. Everything else keeps working and gaslights you.

Joke #1: MTU bugs are like office microwaves—fine for small things, but put something big in and suddenly there’s smoke and blame.

Facts & historical context (the stuff that explains the pain)

  • Ethernet’s “1500 MTU” is a convention, not physics. It became the de facto default partly because it balanced efficiency with buffer limits in early gear.
  • “Jumbo frames” never had one universal size. 9000 is common, but 9216 and 9600 show up because vendors wanted headroom for VLAN tags and internal framing.
  • PMTUD relies on ICMP, which enterprise networks love to block. This has been a recurring failure mode since the 1990s and still wins awards for “most avoidable outage.”
  • RFC 4821 (Packetization Layer PMTUD) exists because classic PMTUD was fragile when ICMP is filtered; many stacks only partially implement the “more robust” approach.
  • VLAN tagging reduces payload MTU unless you account for it. One 802.1Q tag adds 4 bytes; QinQ adds 8. Some switches compensate; some don’t.
  • VXLAN and friends made MTU math operational. Overlay networks shifted MTU from “switch feature” to “every node must agree,” especially in Kubernetes and multi-tenant setups.
  • Jumbo frames help mostly with CPU efficiency, not raw line rate. Fewer packets for the same bytes means fewer interrupts and per-packet overhead—useful at high throughput.
  • Storage networks embraced jumbo frames early (iSCSI, NFS, Ceph replication) because they tend to move large sequential payloads and benefit from fewer packets.

Fast diagnosis playbook

When you’re on the clock, you don’t need theory. You need a quick path to “what’s the smallest MTU on this path, and who disagrees about it.”

First: confirm it’s MTU and not “the app”

  • Run a DF ping test from client to server at a payload size that should pass at 1500 and fail at jumbo (details in the tasks section). If 1472 works but
    ~8972 doesn’t, congratulations: it’s MTU.
  • If small requests work but large downloads stall, test with curl --http1.1 and a known large response. Stalls during transfer plus DF ping
    failures are classic.

Second: find the smallest hop MTU on the real path

  • Check MTU on the host NIC, bond, VLAN subinterface, bridge, and tunnel endpoints. The smallest one wins, whether you like it or not.
  • Confirm the switchport profile actually allows jumbo. Don’t trust “we configured it last year.” That sentence has ended careers.

Third: verify PMTUD is functioning

  • Look for ICMP “frag needed” coming back when you send DF packets that are too big. If it’s missing, something is dropping it.
  • Temporarily clamp TCP MSS at boundaries (firewall, VPN, tunnel) as a mitigation, then go fix the real MTU alignment.

Fourth: check overlays and offloads

  • In Kubernetes/VXLAN, set pod MTU lower than node MTU by the encapsulation overhead.
  • Disable TSO/GSO briefly for troubleshooting if captures are confusing, then re-enable for performance once you understand the path.

A practical MTU mental model (layers, overhead, and where it goes wrong)

MTU is the maximum size of the L3 packet payload your interface will carry without fragmentation. On Ethernet, people casually say “MTU 1500” meaning
1500 bytes of IP payload inside an Ethernet frame. The Ethernet frame itself is bigger (headers, FCS, preamble), but the MTU setting you touch in Linux
is typically the L3 payload size.

The most useful thing to remember is that MTU is per hop and per interface. You can have:

  • NIC MTU = 9000
  • Bond MTU = 9000
  • VLAN subinterface MTU = 9000
  • Linux bridge MTU = 1500 (oops)
  • VXLAN device MTU = 1450 (probably correct for an underlay of 1500)

And then everyone argues about “the MTU,” as if there is one MTU. There isn’t. There’s a chain, and the weakest link sets your effective MTU.

Why TCP “sort of works” even when MTU is broken

TCP can avoid fragmentation by choosing smaller segments. It learns how big it can safely send through PMTUD: it sends packets with DF (don’t fragment),
and if a hop can’t carry them, that hop returns ICMP “fragmentation needed” with the next-hop MTU. Then TCP shrinks its segment size.

When ICMP is blocked, TCP doesn’t get the memo. It keeps sending large segments that get dropped. Retries happen. Backoff happens. Apps time out.
Meanwhile, small packets (ACKs, keepalives, small queries) still flow, making everything look “mostly fine” from basic monitoring.

That’s why MTU bugs show up as:

  • Intermittent hangs (not hard failures)
  • Long tail latency spikes
  • Works on one network path but not another (asymmetric routing, different firewall policies)
  • “It breaks only when we upload/download something big”

Overlay overhead: your hidden MTU tax

Encapsulation adds headers. Those bytes must fit under the underlay MTU. If your underlay is 1500 and you use VXLAN, your effective payload MTU is
typically around 1450 (depending on exact headers). If you set the overlay to 1500 anyway, you’re asking the underlay to carry >1500, which means
fragmentation (bad) or drops (worse).

Jumbo frames can help overlays too, but only if the entire underlay supports them. Otherwise, you must lower MTU on the overlay and/or clamp MSS.

One quote to keep on the wall near your change calendar: Hope is not a strategy. — Gene Kranz

Practical tasks: commands, expected output, and what decision to make

These are the tasks I actually run when “some traffic” breaks after an MTU change on Ubuntu 24.04. Each task includes: command, what the output means,
and the decision you make.

Task 1: Check current MTU on all interfaces (spot the odd one out)

cr0x@server:~$ ip -br link
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP> mtu 65536
enp3s0           UP             3c:fd:fe:aa:bb:cc <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000
bond0            UP             3c:fd:fe:aa:bb:cc <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000
br-storage       UP             7a:11:22:33:44:55 <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500

Meaning: The bridge is at 1500 while everything else is 9000. That’s a classic partial failure: traffic traversing the bridge gets forced
down to 1500 or dropped if DF is set.

Decision: Align MTU across the entire chain (NIC → bond → VLAN → bridge → veth/tap) or deliberately set everything to the smallest you can support end-to-end.

Task 2: Confirm the default route and real egress interface (avoid debugging the wrong path)

cr0x@server:~$ ip route get 10.50.12.34
10.50.12.34 dev bond0 src 10.50.12.10 uid 1000
    cache

Meaning: Traffic to that destination exits via bond0. If you were staring at enp3s0 settings, you’d be debugging fan fiction.

Decision: Focus MTU checks and packet captures on the actual egress interface(s) and the reverse path from the peer.

Task 3: Test L3 MTU to a peer with DF ping (IPv4)

cr0x@server:~$ ping -M do -s 1472 -c 2 10.50.12.34
PING 10.50.12.34 (10.50.12.34) 1472(1500) bytes of data.
1480 bytes from 10.50.12.34: icmp_seq=1 ttl=63 time=0.412 ms
1480 bytes from 10.50.12.34: icmp_seq=2 ttl=63 time=0.401 ms

--- 10.50.12.34 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms

Meaning: 1500-byte IP packets work. This doesn’t prove jumbo works; it only proves you haven’t broken basic Ethernet.

Decision: Now test a jumbo-sized packet that should pass if MTU 9000 is truly end-to-end.

Task 4: Test jumbo MTU with DF ping (IPv4) and watch it fail cleanly

cr0x@server:~$ ping -M do -s 8972 -c 2 10.50.12.34
PING 10.50.12.34 (10.50.12.34) 8972(9000) bytes of data.
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500

--- 10.50.12.34 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1018ms

Meaning: The local stack believes the outgoing interface MTU is 1500 for this route, despite what you thought you set. That’s not a
path problem yet; it’s local configuration.

Decision: Find which device in the egress chain actually has MTU 1500 (often a bridge, VLAN, VRF, or tunnel device).

Task 5: If it fails without a local error, you have a path blackhole (PMTUD or middle hop)

cr0x@server:~$ ping -M do -s 8972 -c 2 10.50.12.34
PING 10.50.12.34 (10.50.12.34) 8972(9000) bytes of data.

--- 10.50.12.34 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1013ms

Meaning: The host is willing to send 9000, but nobody answers. That usually means a hop dropped the large frame. If ICMP “frag needed”
doesn’t come back, PMTUD can’t correct it.

Decision: Capture traffic and look for ICMP errors; if they’re absent, fix ICMP filtering or clamp MSS as a mitigation.

Task 6: Test IPv6 MTU behavior (it differs; fragmentation is endpoint-only)

cr0x@server:~$ ping -6 -M do -s 1452 -c 2 2001:db8:10::34
PING 2001:db8:10::34(2001:db8:10::34) 1452 data bytes
1460 bytes from 2001:db8:10::34: icmp_seq=1 ttl=63 time=0.588 ms
1460 bytes from 2001:db8:10::34: icmp_seq=2 ttl=63 time=0.570 ms

--- 2001:db8:10::34 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms

Meaning: IPv6 path supports at least ~1500-ish packet sizes here. With IPv6, routers don’t fragment; they send ICMPv6 “Packet Too Big.”
If that’s blocked, IPv6 gets brittle fast.

Decision: Make sure ICMPv6 is allowed end-to-end. Blocking it breaks real traffic, not just “ping.”

Task 7: Check TCP MSS on live sessions (detect clamping or lack of it)

cr0x@server:~$ sudo ss -ti dst 10.50.12.34 | head -n 20
ESTAB 0 0 10.50.12.10:44218 10.50.12.34:443
	 cubic wscale:7,7 rto:204 rtt:1.21/0.12 ato:40 mss:8960 pmtu:9000 rcvmss:536 advmss:8960 cwnd:10 bytes_acked:12983 bytes_received:44112 segs_out:112 segs_in:98 data_segs_out:64 data_segs_in:71

Meaning: This flow believes PMTU is 9000 and MSS is 8960. If the path can’t actually carry 9000, this is a recipe for stalls.

Decision: Either make the path truly 9000 end-to-end or clamp MSS to a safe value at the boundary where the MTU is smaller.

Task 8: Watch for PMTUD ICMP messages with tcpdump

cr0x@server:~$ sudo tcpdump -ni bond0 'icmp and (icmp[0]=3 and icmp[1]=4)'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on bond0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:22:31.201983 IP 10.50.12.1 > 10.50.12.10: ICMP 10.50.12.1 unreachable - need to frag (mtu 1500), length 556

Meaning: The network is telling you a hop MTU is 1500. PMTUD is working, but your endpoints might still misbehave if policies drop some
ICMPs or if only some paths return the error.

Decision: Fix the mismatch: either lower MTU/MSS to 1500-safe sizes for that route, or upgrade the path to jumbo consistently.

Task 9: Check for ICMP being filtered locally (firewall rules)

cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
	chain input {
		type filter hook input priority filter; policy drop;
		ct state established,related accept
		iif "lo" accept
		ip protocol icmp accept
		ip6 nexthdr icmpv6 accept
		tcp dport { 22, 443 } accept
	}
}

Meaning: ICMP/ICMPv6 are accepted. That’s good. If you don’t see those accepts, PMTUD can fail even if the network is sending the right messages.

Decision: Allow ICMP “frag needed”/“packet too big” and related traffic. Don’t “secure” your network by breaking it.

Task 10: Check NIC offloads that can confuse troubleshooting (not necessarily the root cause)

cr0x@server:~$ sudo ethtool -k enp3s0 | egrep 'tcp-segmentation-offload|generic-segmentation-offload|generic-receive-offload'
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on

Meaning: Offloads are enabled. Packet captures on the host might show large “super-packets” that never exist on the wire.

Decision: If debugging is unclear, disable TSO/GSO temporarily on a test node, validate behavior, then restore. Don’t leave performance on the floor permanently.

Task 11: Temporarily disable TSO/GSO/GRO (diagnostic move)

cr0x@server:~$ sudo ethtool -K enp3s0 tso off gso off gro off

Meaning: Now packet captures match wire behavior more closely. Throughput may drop; that’s fine for diagnosis.

Decision: Re-test DF pings and the real application. If symptoms change dramatically, incorporate offload understanding into your fix, then re-enable offloads.

Task 12: Check bridge MTU and members (common “hidden 1500” culprit)

cr0x@server:~$ ip -d link show br-storage
7: br-storage: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 7a:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9000 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

Meaning: The bridge MTU is explicitly 1500 even though it could go up to 9000. That bridge will constrain traffic traversing it.

Decision: Set the bridge MTU to match the underlay if every attached segment supports jumbo; otherwise keep it at 1500 and accept that reality.

Task 13: Confirm VLAN interfaces are aligned (VLANs are where assumptions go to die)

cr0x@server:~$ ip -d link show bond0.120
12: bond0.120@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:aa:bb:cc brd ff:ff:ff:ff:ff:ff promiscuity 0
    vlan protocol 802.1Q id 120 <REORDER_HDR>

Meaning: The VLAN device is at 9000. If it were 1500 while the parent is 9000, you’d get route-specific failures and strange asymmetry.

Decision: Keep MTU consistent for parent and VLAN subinterfaces unless you have a very specific reason not to (and a diagram proving it).

Task 14: Measure real application behavior with curl (stall patterns matter)

cr0x@server:~$ curl -v --max-time 10 http://10.50.12.34:8080/largefile.bin -o /dev/null
*   Trying 10.50.12.34:8080...
* Connected to 10.50.12.34 (10.50.12.34) port 8080
> GET /largefile.bin HTTP/1.1
> Host: 10.50.12.34:8080
> User-Agent: curl/8.5.0
> Accept: */*
* Request completely sent off
* Operation timed out after 10000 milliseconds with 0 bytes received
curl: (28) Operation timed out after 10000 milliseconds with 0 bytes received

Meaning: Connection setup succeeded, request went out, response never arrived. That is consistent with large server-to-client packets being dropped,
or PMTUD failure in the reverse direction.

Decision: Test reverse direction too (run the same from the other host) and capture on both ends; suspect asymmetric MTU/ICMP policy.

Task 15: Check kernel MTU/PMTUD related counters (are we seeing fragmentation or errors?)

cr0x@server:~$ netstat -s | egrep -i 'fragment|reassembl|MTU|icmp.*frag' | head -n 30
Ip:
    0 outgoing packets dropped
    12 fragments dropped after timeout
    0 reassemblies required
Icmp:
    4 destination unreachable messages received

Meaning: Fragment-related counters moving can indicate real fragmentation or blackholing. It’s not perfect, but it’s a clue.

Decision: If you see reassembly or fragment drops rising during tests, align MTU and avoid relying on fragmentation for performance traffic.

Task 16: Validate persistent MTU configuration on Ubuntu 24.04 (netplan)

cr0x@server:~$ sudo cat /etc/netplan/01-bond.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    enp3s0:
      mtu: 9000
    enp4s0:
      mtu: 9000
  bonds:
    bond0:
      interfaces: [enp3s0, enp4s0]
      mtu: 9000
      parameters:
        mode: 802.3ad
        mii-monitor-interval: 100

Meaning: Netplan is explicitly setting MTU. If the running MTU doesn’t match, you may have another file overriding it or a runtime change.

Decision: Consolidate netplan configuration and apply carefully (see the safe rollout plan) to avoid surprise reconfig mid-incident.

How to fix MTU safely (without “oops” outages)

There are two broad strategies. Pick one and commit. Half-jumbo is how you get half-working networks.

Strategy A: Make jumbo truly end-to-end (preferred for storage clusters)

This is correct when you control the entire L2/L3 path: NICs, bonds, bridges, switches, and any intermediate appliances.

  • Set MTU consistently on every Linux interface that carries the traffic: physical NICs, bonds, VLAN subinterfaces, bridges, and any veth/tap bridges for VMs/containers.
  • Ensure the switch ports and trunks are configured to accept jumbo frames. Watch for “baby giant” settings if VLAN tags are involved.
  • Validate PMTUD still works (don’t block ICMP), even with jumbo enabled. You want PMTUD as a safety net, not a casualty.
  • Run DF ping tests for 1500 and jumbo sizes between all critical peers (not just one pair).

Strategy B: Keep underlay at 1500, tune overlays and clamp MSS (preferred for mixed corporate networks)

If you traverse VPNs, cloud gateways, or unknown middleboxes, jumbo frames are often an argument with reality you will lose.

  • Leave underlay MTU at 1500 (or whatever the narrowest hop requires).
  • Set tunnel/overlay MTU correctly (usually lower than underlay by overhead).
  • Clamp TCP MSS at the boundary so TCP never tries segments that won’t fit. This mitigates PMTUD blackholes.

Safe change mechanics on Ubuntu 24.04

Ubuntu 24.04 typically uses netplan with systemd-networkd or NetworkManager. In servers, it’s often networkd. MTU changes can bounce interfaces. If this is
the same interface you’re SSH’d over, you need a safety harness.

Golden rules for MTU changes in production

  • Never change MTU on your only management path without an out-of-band console or a timed rollback.
  • Change the narrowest domain first: if the switchport is 1500, raising the host to 9000 buys you nothing except confusion.
  • Roll out in rings: one host, then a pair, then a rack/zone, then the fleet.
  • Measure with real traffic: storage replication, container overlay traffic, large HTTP transfers—not just pings.

Joke #2: “We’ll just bump MTU to 9000 everywhere” is the networking equivalent of “I’ll just restart production quickly.”

Three corporate mini-stories from the MTU trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-size company ran a private virtualization cluster: hypervisors on Ubuntu, storage traffic on a dedicated VLAN, and a top-of-rack switch stack that
“supported jumbo.” Someone decided to enable jumbo frames to reduce CPU on the hypervisors, because graphs showed high softirq during backup windows.

The engineer changed MTU to 9000 on the hypervisor bonds and on the storage VLAN interfaces. A few test pings succeeded (at 1500 size), VMs kept running,
and everyone went home. The next morning, only some VMs had broken backups. Some NFS mounts were fine. Others would hang during large reads. It looked like
a flaky storage array, which is a popular scapegoat.

The wrong assumption: “The switch stack supports jumbo, so the ports must be configured for it.” In reality, the switch had jumbo enabled globally, but
one particular trunk to the storage network still had a legacy profile with a 1500-ish maximum. Traffic within a rack stayed jumbo-capable; traffic that
crossed that trunk hit the smaller MTU and started blackholing large frames.

Diagnosis took longer than it should have because basic health checks were green: SSH, small RPCs, cluster heartbeats. The first truly useful clue was a DF ping
that failed only across racks, and a tcpdump showing ICMP “frag needed” from one direction but not the other (filtered asymmetrically).

The fix was boring: align switch trunk MTU, verify with DF pings between representative nodes, and stop treating “supports jumbo” as “configured for jumbo.”
The post-incident action item was even more boring: maintain a simple MTU matrix per VLAN and enforce it with config checks. It worked.

Mini-story 2: The optimization that backfired

Another org ran Kubernetes on Ubuntu with a VXLAN-based CNI. They had a fast east-west network and wanted “maximum performance,” so they set the node NIC MTU to
9000 and assumed everything else would benefit automatically. Pods kept the default MTU, and the CNI’s MTU auto-detection guessed wrong on a subset of nodes.

For a week, nothing obvious broke. Then they rolled out a service that streamed large responses (think: artifacts, ML model blobs, large JSON payloads—pick your poison).
Suddenly, only some pods in some nodes experienced huge latency spikes and timeouts. Retries masked the issue, so graphs showed “slower but mostly okay,” which is exactly
the kind of problem that burns money quietly.

The optimization backfired because of encapsulation overhead. Some pod-to-pod paths were effectively exceeding the underlay MTU due to VXLAN headers, and PMTUD was
unreliable due to firewall rules that treated some ICMP as suspicious. So pods would send packets that were fine on-node, but too big off-node.

The eventual fix was to set a correct, explicit CNI MTU (lower than the underlay by the worst-case overhead), and to keep the underlay MTU consistent across
all relevant interfaces. Jumbo frames were still used on the physical network, but the overlay MTU was chosen deliberately, not “because bigger is better.”

The most valuable lesson wasn’t “don’t use jumbo.” It was: don’t deploy performance changes that you can’t validate with end-to-end tests and a rollback plan.
Performance work is production work. Treat it with the same discipline.

Mini-story 3: The boring but correct practice that saved the day

A financial services team had a habit: every network-affecting change came with a small “path contract test.” It was a script that ran from a canary host:
DF pings at multiple sizes, a large HTTPS download, and a storage replication sanity check. It logged results to a place everyone could see.

One quarter, a network team replaced a firewall pair. The migration plan was solid, but in the middle of a long change window, someone restored a default
policy that unintentionally dropped certain ICMP unreachable messages. Connectivity stayed up. Monitoring stayed green. The kind of green that lies.

The canary test tripped immediately. DF pings at jumbo sizes stopped getting “frag needed” responses; large downloads stalled; PMTUD behavior changed. Nobody
had to wait for customers to call. The rollback decision was made based on evidence, not vibes.

They didn’t even need to abandon the firewall cutover. They adjusted the policy to allow essential ICMP, re-ran the tests, and moved forward.
The practice wasn’t glamorous. It didn’t get conference talks. It saved the day anyway.

Common mistakes: symptom → root cause → fix

1) Symptom: SSH works, but large downloads hang or reset

Root cause: PMTUD blackhole. Large TCP segments exceed a hop MTU; ICMP “frag needed” is blocked.

Fix: Allow ICMP fragmentation-needed/packet-too-big across firewalls. As mitigation, clamp TCP MSS on the boundary device.

2) Symptom: Only cross-subnet traffic breaks after enabling jumbo

Root cause: L3 device (router, firewall, gateway) still at 1500 while local L2 is jumbo.

Fix: Either make the routed path jumbo-capable end-to-end, or keep server MTU at 1500 for that routed segment and use jumbo only on isolated storage VLANs.

3) Symptom: East-west pod traffic in Kubernetes is flaky; node-to-node ping is fine

Root cause: Overlay MTU not reduced for VXLAN/Geneve overhead, or inconsistent CNI MTU across nodes.

Fix: Set a consistent CNI MTU (e.g., 1450 for 1500 underlay, or larger if underlay is jumbo). Validate with DF tests between pods on different nodes.

4) Symptom: VMs on a Linux bridge break, but the host itself is fine

Root cause: Bridge MTU or tap/vnet devices at 1500; host NIC at 9000. VM traffic hits the smaller MTU.

Fix: Set MTU on bridge and VM-facing interfaces consistently, or keep everything at 1500 if the physical path can’t guarantee jumbo.

5) Symptom: One direction is slow, the other is fine

Root cause: Asymmetric routing or asymmetric ICMP filtering; PMTUD works in one direction but not the reverse.

Fix: Capture on both ends; align MTU policies; allow ICMP both ways; verify path symmetry for critical flows.

6) Symptom: “We set MTU to 9000 but ping says mtu=1500”

Root cause: The route uses a different interface (VRF, VLAN, bridge) still at 1500, or a config override resets MTU on link up.

Fix: Use ip route get to confirm egress device; check MTU on the full interface stack; fix netplan/systemd-networkd configuration to persist correctly.

7) Symptom: Performance got worse after enabling jumbo

Root cause: Microbursts and buffer pressure on switches/NICs, or offload/driver issues. Jumbo reduces packets, but can increase per-packet serialization delay and burstiness.

Fix: Validate with throughput and latency tests, watch drops on switches/NICs, consider slightly smaller jumbo (e.g., 9000 vs 9600) and tune queues; don’t assume “bigger always faster.”

Checklists / step-by-step plan

Checklist: before you touch MTU

  • Define the scope: which VLAN/subnet/tunnel is changing?
  • List every hop: server NICs, bonds, bridges, hypervisor vSwitch, switchports, trunks, routers, firewalls, VPNs, load balancers.
  • Decide your target MTU per domain: 1500-only, or jumbo end-to-end.
  • Confirm you have a rollback path: console access, alternate mgmt NIC, or a timed job reverting netplan.
  • Prepare tests: DF ping sizes, a large HTTP transfer, and one real workload test (storage replication, backup job, etc.).

Step-by-step: safe jumbo rollout on Ubuntu 24.04 (servers)

  1. Pick a canary host with out-of-band access.
  2. Measure baseline:
    run DF pings at 1472 and 8972 to key peers; run a large transfer; capture a short tcpdump for ICMP “frag needed.”
  3. Validate the network first:
    configure switchports/trunks for jumbo, confirm with network team using their tooling. Don’t change hosts until the network is ready.
  4. Change MTU on the canary host in netplan for all relevant interfaces (physical + bond + VLAN + bridge).
  5. Apply in a controlled way:
    prefer a maintenance window; if remote-only, use a timed rollback mechanism (see below).
  6. Re-run tests immediately:
    DF ping jumbo must succeed end-to-end; application transfer must complete; check ss -ti for sane MSS/PMTU.
  7. Expand to a small ring:
    a pair of hosts that talk heavily; then one rack; then the fleet.
  8. Keep monitoring focused:
    watch retransmits, timeouts, and storage latency. MTU failures often show up as retransmit storms, not link down events.

Timed rollback pattern (because you like sleeping)

If you’re changing the MTU on the interface carrying your SSH session, set up a timed rollback before applying netplan. Example: schedule a reboot or a netplan revert
in 5 minutes, then cancel it once you confirm connectivity. There are many ways; the point is: don’t bet production on your Wi‑Fi.

cr0x@server:~$ sudo systemd-run --on-active=5m --unit=mtu-rollback.service /usr/sbin/netplan apply
Running as unit: mtu-rollback.service

Meaning: This example is intentionally simplistic and not a full rollback; in practice you’d run a script that restores a known-good netplan YAML and applies it.

Decision: Use a real rollback script in your environment. The point is the operational pattern: apply change with a safety timer, then cancel once verified.

FAQ

1) Why does ping work but my app breaks?

Default ping uses small packets. Your app likely sends larger TCP segments. MTU issues only bite when packets exceed the smallest MTU in the path, and then
PMTUD may or may not rescue you depending on ICMP policies.

2) What MTU should I choose for jumbo frames: 9000 or 9216?

Pick the value your switches and NICs support consistently. 9000 is common for “IP MTU.” Some devices advertise 9216 as a frame size allowance. The only
correct answer is: choose one, document it per VLAN, and test end-to-end including VLAN tagging and trunks.

3) Do jumbo frames always improve performance?

No. They often reduce CPU overhead at high throughput, but they can also increase burstiness and stress buffers. If your bottleneck is disk, encryption,
application serialization, or a single-threaded userland process, jumbo frames won’t save you.

4) Is fragmentation actually that bad?

Occasional fragmentation for odd traffic is survivable. Relying on fragmentation for bulk traffic is a tax: extra CPU, reassembly complexity, higher drop
sensitivity, and more painful debugging. Most modern designs try to avoid it.

5) What’s the best way to detect PMTUD blackholes?

DF ping tests plus packet capture for ICMP “frag needed” / “packet too big.” If large DF packets disappear and you never see the ICMP error, something is filtering it.
Also look for stalled TCP flows with high retransmits and no forward progress.

6) In Kubernetes, should I set node MTU to 9000?

Only if your underlay truly supports jumbo end-to-end and you set the CNI/overlay MTU appropriately. Otherwise, keep the underlay at 1500 and use a lower
overlay MTU that accounts for encapsulation.

7) Why does the problem show up only for some destinations?

Different destinations can take different network paths with different MTUs (different routers, firewalls, VPN tunnels, cloud edges). MTU mismatches are
path-dependent by nature.

8) How do I fix it quickly if I can’t change the network today?

Mitigate by lowering MTU on the affected interface(s) to the smallest safe value, or clamp TCP MSS at the boundary so endpoints stop sending too-large
segments. Then schedule the real fix: align MTU end-to-end or fix ICMP policy.

9) Does Ubuntu 24.04 do anything “special” with MTU?

The usual complexity comes from netplan rendering into systemd-networkd or NetworkManager, plus stacked interfaces (bonds, bridges, VLANs, tunnels).
The OS isn’t special; your topology is. Verify the MTU at runtime with ip -br link, not just in YAML.

10) Should I allow all ICMP through the firewall?

You should allow the ICMP types needed for normal operation, especially fragmentation-needed/packet-too-big and related unreachable messages. Blocking all
ICMP is a blunt instrument that breaks PMTUD and makes outages harder to diagnose.

Conclusion: next steps that don’t ruin your weekend

Jumbo frames don’t “kind of” work. They either work end-to-end, or they create a selective reality distortion field where small packets thrive and big
packets vanish. That’s why you see “only some traffic” break.

Do this next:

  • Run DF ping tests at 1500 and jumbo sizes between the peers that matter.
  • Find the smallest MTU on the path by checking every interface layer: NIC, bond, VLAN, bridge, tunnel.
  • Verify PMTUD by capturing ICMP “frag needed” / “packet too big” messages; fix firewall policies that block them.
  • Pick a strategy: true end-to-end jumbo for controlled domains, or conservative underlay MTU with correct overlay MTU and MSS clamping.
  • Roll out in rings with a canary and a rollback plan. Production rewards caution, not courage.
← Previous
Office VPN with Dynamic IPs: DDNS and Strategies That Don’t Fall Apart
Next →
ZFS SAS Expander Tuning: Avoiding Saturation and Link Bottlenecks

Leave a comment