Office VPN MTU/MSS Issues: Why Large Files Hang (SMB/RDP) and How to Fix It

Was this helpful?

Everything “works” over the office VPN until someone tries to copy a big file from a Windows share, or an RDP session freezes the moment you drag a folder. Small stuff? Fine. Directory listings? Fine. A 6 GB installer or a PST file? Suddenly the progress bar stops, the session gets sticky, and you start hearing the words “the VPN is flaky.”

Nine times out of ten, the VPN isn’t flaky. Your packets are. More specifically: MTU and MSS mismatches turn large, long-lived TCP transfers into a slow-motion failure. The fix is usually boring, measurable, and immediate—once you stop guessing and start proving.

The failure mode: why “large files” expose MTU/MSS bugs

When someone says, “The VPN connects but copying large files hangs,” they’re describing a classic path MTU problem that looks like a performance issue, a firewall issue, or a storage issue—until you measure it.

Here’s what’s typically happening:

  • Your endpoints believe they can send packets up to some size (often 1500 bytes on Ethernet).
  • The VPN adds overhead (encapsulation, encryption, authentication). Now the effective payload MTU inside the tunnel is smaller than 1500.
  • Some device on the path can’t forward those larger packets and tries to signal that fact using ICMP “Fragmentation needed” (or the IPv6 equivalent).
  • That ICMP message gets blocked or never makes it back (hello, “security hardening”).
  • TCP keeps retransmitting segments that never successfully traverse the path. Small flows survive because they fit; big flows stall because they don’t.

This is the definition of a PMTUD black hole: Path MTU Discovery is trying to do its job, but the feedback channel is broken.

SMB and RDP don’t necessarily fail instantly. They fail in the most annoying way possible: they partially work. The first few megabytes might transfer, then you hit a packet that must be larger than the path allows, and the session moves from “fine” to “haunted.”

One practical rule: if small packets work but large ones stall, stop blaming the storage, stop blaming DNS, and start testing MTU.

Interesting facts and historical context

  1. 1500 bytes became “the default” largely because of Ethernet, not because 1500 is optimal—just convenient and widely interoperable.
  2. PPPoE famously reduces MTU to 1492 (8 bytes of overhead), which is why “1492” is a recurring character in VPN troubleshooting lore.
  3. IPsec ESP overhead is not a single fixed number; it depends on tunnel vs transport mode, NAT-T, cipher, integrity, padding, and alignment.
  4. PMTUD dates back to the late 1980s/early 1990s, created to reduce fragmentation and improve efficiency—great idea, fragile in the face of ICMP filtering.
  5. “ICMP is blocked for security” has been breaking networks for decades; it’s the operational equivalent of taping over your dashboard warning lights.
  6. IPv6 tries to make fragmentation saner by pushing fragmentation responsibility to endpoints—yet PMTUD black holes still happen when ICMPv6 is mishandled.
  7. MSS clamping became mainstream because PMTUD often fails in the real world; it’s a pragmatic workaround when you can’t trust the path to signal MTU correctly.
  8. Jumbo frames (9000 MTU) made MTU mismatches more common in mixed environments: great inside a data center, messy when accidentally leaked into WAN expectations.

MTU, MSS, PMTUD: what matters in production

MTU: the largest packet you can push through a link

MTU (Maximum Transmission Unit) is the largest IP packet size that can traverse an interface without fragmentation at that layer. On Ethernet, 1500 is common. VPNs wrap your packet in another packet. That wrapper costs bytes.

If your inner packet is 1500 bytes and the VPN adds 60–100+ bytes of overhead, the outer packet might be 1560–1600 bytes. If any hop on the path only allows 1500, something has to give.

MSS: the largest TCP payload per segment

MSS (Maximum Segment Size) is the TCP payload size, excluding IP and TCP headers. When you see MSS 1460, that’s 1500 MTU minus 20 bytes IPv4 header minus 20 bytes TCP header.

When MTU shrinks (due to VPN overhead), the safe MSS shrinks too. If you don’t adjust MSS and PMTUD fails, TCP will try to send segments that are too large for the path, and you get stalls under load.

PMTUD: the feedback loop that’s easy to break

Path MTU Discovery relies on routers telling senders, “That packet is too big; here’s the MTU you need.” On IPv4, that’s ICMP “Fragmentation needed” (type 3, code 4). On IPv6, it’s ICMPv6 “Packet Too Big.”

Block those messages, and PMTUD becomes “Path MTU Guessing.” Guess wrong, and TCP burns time retransmitting segments that never get delivered.

Paraphrased idea from Werner Vogels (Amazon CTO): “Everything fails; design to recover quickly.” MTU issues are a perfect example—build for the failure mode instead of hoping the path is polite.

One-sentence operational truth: if you can’t guarantee PMTUD works end-to-end, clamp MSS at the tunnel edge.

Joke #1 (short and relevant): MTU bugs are like office printers: they only jam when the CEO needs something in five minutes.

Why SMB and RDP are the first to complain

SMB: chatty, stateful, and extremely good at revealing packet loss

SMB file copies are long-lived TCP streams with bursts. They also do a bunch of control chatter that’s sensitive to latency and retransmits. When MTU is wrong, the large data segments get stuck while smaller control messages may still slip through. That produces a maddening symptom: the share browses, authentication works, folder listings appear, but the file copy freezes.

SMB3 adds encryption and signing options that can change packet sizes and behavior. That’s not “the cause,” but it can be the spark that pushes a borderline MTU into failure.

RDP: it will keep the session alive while your payload chokes

RDP can appear connected because the keepalives and small screen updates fit. Clipboard redirection, drive mapping, printing, and file copy are where larger transfers show up and the TCP stream begins to suffer. Users describe it as “RDP freezing,” but often it’s the redirected I/O path stalling while the session threads keep limping along.

Why web browsing often looks fine

Modern web traffic is many short connections, with congestion control that recovers quickly, and a lot of content delivered in chunks that may not require sustained large segments. Also, browsers retry aggressively and hide failures behind spinners. SMB copies are honest. They just stop and stare at you.

Fast diagnosis playbook

This is the order that gets you to truth fast, without spending an afternoon rewriting VPN configs because “it feels like MTU.”

First: confirm the symptom is size-related, not name-resolution or auth

  • Can you list the SMB share reliably but large file copies stall?
  • Does RDP connect instantly but file copy over redirected drive freezes?
  • Do small pings work but large pings with DF set fail?

Second: determine the working path MTU with a DF ping test

  • Test from client to server (and reverse if possible).
  • Find the largest payload that succeeds without fragmentation.
  • If you see “Frag needed” sometimes, you’re lucky; if you see silence, you may have a black hole.

Third: check tunnel overhead assumptions and clamp MSS at the edge

  • Identify whether you’re using IPsec (ESP, NAT-T), OpenVPN (UDP/TCP), WireGuard, or SSL VPN appliances.
  • Set tunnel interface MTU appropriately, and/or clamp TCP MSS on the VPN gateway/firewall.
  • Retest large file copy and long-lived TCP streams.

Fourth: verify ICMP handling (don’t guess)

  • Allow required ICMP types through the tunnel and on WAN edges.
  • Confirm PMTUD works, or accept that it doesn’t and rely on MSS clamping.

Fifth: only then chase “performance” tuning

  • SMB multichannel, window scaling, offloads—those are second-order once MTU is correct.
  • Don’t optimize a broken path. You’ll just create faster failure.

Practical tasks with commands: prove it, then fix it

Below are real tasks you can run. Each one includes: command, sample output, what it means, and the decision you make.

Task 1: Identify the VPN interface and its MTU (Linux)

cr0x@server:~$ ip -br link
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP> mtu 65536
eth0             UP             52:54:00:12:34:56 <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
wg0              UNKNOWN        9a:bc:de:f0:12:34 <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420

What it means: The tunnel interface (wg0) is already at 1420, a common WireGuard choice. If you see 1500 on a VPN interface, that’s a red flag in most WAN scenarios.

Decision: If MTU is 1500 on the tunnel, plan to lower it or clamp MSS.

Task 2: Check route MTU hints (Linux)

cr0x@server:~$ ip route get 10.20.30.40
10.20.30.40 dev wg0 src 10.20.30.1 uid 1000
    cache

What it means: The traffic to the remote host uses wg0. If it unexpectedly uses eth0, your split-tunnel rules may be wrong, and you’re debugging the wrong path.

Decision: Confirm you’re actually testing over the VPN path.

Task 3: Find the real path MTU using ping with DF (Linux, IPv4)

cr0x@server:~$ ping -M do -s 1472 -c 3 10.20.30.40
PING 10.20.30.40 (10.20.30.40) 1472(1500) bytes of data.

--- 10.20.30.40 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2042ms

What it means: A 1500-byte packet (1472 payload + 28 bytes headers) didn’t make it. If you got an explicit “Frag needed,” PMTUD is at least partially functioning. Silence can mean a black hole.

Decision: Reduce the payload size until it succeeds; record the largest working value.

Task 4: Binary search MTU until it works (Linux)

cr0x@server:~$ ping -M do -s 1364 -c 3 10.20.30.40
PING 10.20.30.40 (10.20.30.40) 1364(1392) bytes of data.
1372 bytes from 10.20.30.40: icmp_seq=1 ttl=63 time=31.2 ms
1372 bytes from 10.20.30.40: icmp_seq=2 ttl=63 time=30.9 ms
1372 bytes from 10.20.30.40: icmp_seq=3 ttl=63 time=31.0 ms

--- 10.20.30.40 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms
rtt min/avg/max/mdev = 30.9/31.0/31.2/0.1 ms

What it means: Your path MTU is at least 1392 bytes for IPv4 packets. That suggests an MSS around 1352 (1392 – 40 bytes IP+TCP headers), and you should clamp below that to be safe.

Decision: Set tunnel MTU or MSS clamp to a conservative value (often 1360–1380 MTU equivalent for the inner path, depending on overhead and variability).

Task 5: Observe TCP MSS in SYN packets with tcpdump (Linux)

cr0x@server:~$ sudo tcpdump -ni wg0 'tcp[tcpflags] & (tcp-syn) != 0' -c 3
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wg0, link-type RAW (Raw IP), snapshot length 262144 bytes
12:10:11.123456 IP 10.20.30.10.51123 > 10.20.30.40.445: Flags [S], seq 123456789, win 64240, options [mss 1460,sackOK,TS val 111 ecr 0,nop,wscale 7], length 0
12:10:12.234567 IP 10.20.30.10.51124 > 10.20.30.40.3389: Flags [S], seq 987654321, win 64240, options [mss 1460,sackOK,TS val 112 ecr 0,nop,wscale 7], length 0

What it means: MSS 1460 is being advertised over a tunnel that likely cannot support 1500 inner MTU end-to-end. This is where large transfers go to die.

Decision: Implement MSS clamping on the VPN egress/ingress (or fix the tunnel MTU so endpoints advertise the right MSS).

Task 6: Clamp MSS with iptables (Linux gateway)

cr0x@server:~$ sudo iptables -t mangle -A FORWARD -o wg0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360

What it means: New TCP connections forwarded out wg0 will have their MSS adjusted to 1360, preventing oversized segments even if PMTUD fails.

Decision: Choose the MSS based on measured path MTU and overhead. Err low by ~20–40 bytes if the path changes (LTE, home Wi‑Fi, ISP quirks).

Task 7: Verify MSS clamp is taking effect (Linux)

cr0x@server:~$ sudo tcpdump -ni wg0 'tcp[tcpflags] & (tcp-syn) != 0' -c 2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wg0, link-type RAW (Raw IP), snapshot length 262144 bytes
12:12:44.111111 IP 10.20.30.10.51200 > 10.20.30.40.445: Flags [S], seq 11111111, win 64240, options [mss 1360,sackOK,TS val 210 ecr 0,nop,wscale 7], length 0
12:12:45.222222 IP 10.20.30.10.51201 > 10.20.30.40.3389: Flags [S], seq 22222222, win 64240, options [mss 1360,sackOK,TS val 211 ecr 0,nop,wscale 7], length 0

What it means: The SYN now advertises MSS 1360. You’ve taken packet size drama off the table.

Decision: Re-test SMB copy and RDP file transfer. If still bad, move to loss/latency diagnosis.

Task 8: Check if ICMP “fragmentation needed” is being dropped (Linux firewall counters)

cr0x@server:~$ sudo iptables -L INPUT -v -n | head
Chain INPUT (policy DROP 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
 120K  96M ACCEPT     all  --  lo     *       0.0.0.0/0            0.0.0.0/0
  120  9600 ACCEPT     icmp --  eth0   *       0.0.0.0/0            0.0.0.0/0            icmp type 3 code 4
    0     0 DROP       icmp --  eth0   *       0.0.0.0/0            0.0.0.0/0

What it means: You’re explicitly allowing ICMP type 3 code 4 (frag needed). If those counters are zero while you reproduce the issue, the message might be blocked upstream or not generated.

Decision: Ensure ICMP “frag needed” and “packet too big” are permitted across edges. If you can’t guarantee that, keep MSS clamping.

Task 9: Confirm WireGuard MTU setting (Linux)

cr0x@server:~$ sudo wg show
interface: wg0
  public key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX=
  private key: (hidden)
  listening port: 51820

peer: YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY=
  endpoint: 203.0.113.10:51820
  allowed ips: 10.20.30.0/24
  latest handshake: 1 minute, 2 seconds ago
  transfer: 2.31 GiB received, 3.02 GiB sent
  persistent keepalive: every 25 seconds

What it means: WireGuard doesn’t show MTU here; you must check the interface MTU via ip link. The handshake and transfers confirm the tunnel is alive; MTU can still be wrong.

Decision: If the interface MTU is too high, lower it and keep MSS clamping for safety in mixed client networks.

Task 10: Set MTU on a Linux tunnel interface (WireGuard example)

cr0x@server:~$ sudo ip link set dev wg0 mtu 1380

What it means: You’re forcing the tunnel MTU down. Endpoints will generally choose smaller MSS, reducing reliance on PMTUD.

Decision: Prefer setting MTU correctly on the tunnel plus MSS clamping on the edge if you serve unmanaged clients.

Task 11: Inspect MTU on Windows (client-side reality check)

cr0x@server:~$ powershell -NoProfile -Command "Get-NetIPInterface | Sort-Object -Property InterfaceMetric | Select-Object -First 8 ifIndex,InterfaceAlias,AddressFamily,NlMtu,InterfaceMetric"
ifIndex InterfaceAlias            AddressFamily NlMtu InterfaceMetric
------ --------------            ------------- ----- ---------------
12     Wi-Fi                      IPv4         1500  25
19     VPN - Corp                  IPv4         1500  35
12     Wi-Fi                      IPv6         1500  25
19     VPN - Corp                  IPv6         1500  35

What it means: Windows believes the VPN interface MTU is 1500. If the underlying path can’t support that inside the tunnel, you will see stalls unless PMTUD works perfectly.

Decision: Fix MTU on the VPN adapter profile where supported, or clamp MSS on the VPN gateway so clients don’t need to be perfect.

Task 12: Test SMB throughput and stalls from Windows without guesswork

cr0x@server:~$ powershell -NoProfile -Command "robocopy \\10.20.30.40\share C:\Temp\mtu-test bigfile.bin /J /R:1 /W:1 /NP"
-------------------------------------------------------------------------------
   ROBOCOPY     ::     Robust File Copy for Windows
-------------------------------------------------------------------------------
  Started : Sunday, December 28, 2025 12:20:01
   Source : \\10.20.30.40\share\
     Dest : C:\Temp\mtu-test\

    Files : bigfile.bin

  Options : /NP /R:1 /W:1 /J

------------------------------------------------------------------------------

         1    \\10.20.30.40\share\bigfile.bin
           0%        0.0 MB         0.0 MB/s        0:00:05

What it means: If this hangs at 0% or stalls at a consistent point, you likely have a transport problem (MTU/MSS, loss, firewall inspection), not a file permission issue. The /J switch uses unbuffered I/O, reducing client-side caching illusions.

Decision: Re-run after MSS clamp / MTU change. If it becomes stable immediately, you’ve proven the cause.

Task 13: Validate OpenVPN-style MSS fix on Linux client (if applicable)

cr0x@server:~$ grep -E 'mssfix|tun-mtu|fragment' -n /etc/openvpn/client.conf
41:mssfix 1360
42:tun-mtu 1400

What it means: OpenVPN can enforce smaller packetization. Be careful: fragment is generally frowned upon; mssfix is usually safer.

Decision: Prefer MSS fixes over fragmentation hacks. If you must touch MTU, do it consistently on both ends.

Task 14: Spot retransmits that scream “black hole” (Linux)

cr0x@server:~$ ss -ti dst 10.20.30.40 | head -n 20
ESTAB 0 0 10.20.30.10:51200 10.20.30.40:445
	 cubic wscale:7,7 rto:204 retrans:8/12 lost:0 sacked:0 unacked:1
	 bytes_sent:10485760 bytes_retrans:5242880 bytes_acked:5242880

What it means: High retransmits and retransmitted bytes during an SMB session are consistent with packets being dropped. MTU black holes often produce repeated retransmits at a steady cadence.

Decision: If MSS clamping reduces retransmits and restores throughput, keep it. If not, look for loss, shaping, or broken middleboxes.

Three corporate mini-stories (anonymized)

1) The incident caused by a wrong assumption

They had a tidy setup: a small HQ, a handful of branch offices, and a “modern” IPsec VPN between sites. The firewall vendor’s marketing promised “automatic path optimization,” so the team assumed MTU was handled. Nobody measured it because it “worked.”

The first big incident hit on a payroll morning. Accounting couldn’t open large spreadsheets stored on an SMB share at HQ. Small spreadsheets opened instantly. Large ones would load halfway, then Excel would hang long enough for people to start force-quitting and filing tickets. RDP sessions into the finance terminal server stayed connected but became sluggish whenever anyone copied files via redirected drives.

The storage team got pulled in because “files are slow.” They checked server CPU, disk latency, and SMB counters. Everything looked normal. Network declared “no packet loss” based on interface errors and a few pings. The VPN team insisted encryption was stable because the tunnel stayed up.

The wrong assumption was simple: “If the tunnel is up, MTU must be fine.” A DF ping test across the tunnel failed above a payload that implied an inner MTU around the high 1300s. PMTUD messages were blocked by an upstream ACL that someone had copy-pasted years ago. Once they clamped MSS on the branch firewall, the issue evaporated instantly—no storage changes, no server tuning, no Excel voodoo.

They could have saved a day by measuring packet size first. Instead, they held three meetings and blamed a file server that was just doing its job.

2) The optimization that backfired

A different company wanted “better performance” for remote engineers. Someone enabled jumbo frames inside a virtual switch environment because the SAN team had good results with 9000 MTU in the data center. Then they extended the same VLAN profile to a set of VPN termination VMs because “consistency.”

Locally, things screamed. iSCSI traffic improved. East-west VM traffic looked great. Over the VPN, it was a slow disaster. Some clients worked fine, others had intermittent stalls. SMB copies over the VPN would start fast, then crater. RDP felt like it was running on a dial-up line whenever someone moved files.

The backfire mechanism wasn’t magical. It was the worst of both worlds: servers and VMs began assuming they could send larger frames, while the WAN and consumer ISPs absolutely could not. PMTUD should have saved them, but an IDS rule set was dropping ICMP as “noise.” The result was a black hole that only appeared under larger segment sizes. Smaller interactive flows stayed mostly okay, which made it hard to convince management it was a network problem.

The fix was to stop treating “jumbo frames everywhere” as a personality trait. They reverted MTU on the VPN-facing interfaces to a sane value, enabled MSS clamping, and explicitly allowed ICMP “packet too big” through the security stack. Performance stabilized, and the only lasting damage was the team’s confidence in “one MTU to rule them all.”

3) The boring but correct practice that saved the day

An IT team running a mix of WireGuard for developers and IPsec for site-to-site did something unfashionable: they maintained a one-page “packet size contract.” It listed expected MTU per tunnel type, overhead assumptions, and the standard MSS clamp values they used at each edge.

When a new ISP circuit was added at a branch, a junior engineer noticed that large SMB copies to HQ were stalling, but only from that branch. Before escalating, they ran the standard DF ping test from their checklist and documented the maximum working payload. It was smaller than expected by about 60 bytes.

Because they had a contract, they didn’t debate whether to “tune SMB” or “upgrade the file server.” They immediately checked the new ISP handoff and found an extra encapsulation layer introduced by the provider CPE. Nobody had mentioned it; nobody is surprised. They lowered the tunnel MTU and kept the MSS clamp consistent.

That branch went live on schedule. Nobody outside IT noticed. This is what good operations looks like: dull, repeatable, and quietly correct.

Fix strategies that actually hold up

Strategy A: Set the tunnel interface MTU correctly (best when you control both ends)

If you manage both VPN endpoints (site-to-site, managed clients), setting MTU on the tunnel interface is clean. It causes endpoints to choose smaller packet sizes and reduces fragmentation and retransmits.

What to avoid: random MTU numbers copied from forum posts. Measure your path and pick values with margin.

Strategy B: Clamp TCP MSS at the VPN edge (best when clients are messy)

MSS clamping is the workhorse fix. It doesn’t require every client to behave correctly. It doesn’t require PMTUD to succeed. It just forces TCP to segment conservatively.

Where to apply: on the forwarding path that carries client traffic into the tunnel (FORWARD chain on Linux gateways, firewall policy on appliances, or equivalent).

What to avoid: clamping too low “just to be safe.” You can kneecap throughput if you choose a tiny MSS. Conservative is good; paranoid is expensive.

Strategy C: Fix ICMP handling so PMTUD works (best for correctness, often politically hard)

If you can, allow the essential ICMP messages. You don’t have to allow all ICMP. You have to allow the ICMP types that make the Internet function.

Operational reality: security teams often fear ICMP because it’s visible and easy to misunderstand. Your job is to explain that PMTUD is not an optional luxury. It’s plumbing.

Strategy D: Remove extra encapsulation layers (best for long-term sanity)

Sometimes MTU is wrong because you have multiple overlays stacked: GRE over IPsec over PPPoE over LTE, with a side of NAT. Each layer steals bytes and adds failure modes.

What to do: simplify. If you need both encryption and routing, prefer one well-supported tunnel rather than nesting.

Strategy E: Don’t “fix” it by forcing TCP-over-TCP unless you enjoy pain

OpenVPN over TCP can hide some symptoms but introduces meltdown behavior under loss (TCP retransmits inside TCP retransmits). It can make stalling feel less like stalling and more like molasses.

Recommendation: prefer UDP-based tunnels for VPN transport, then manage MTU/MSS cleanly.

Joke #2 (short and relevant): Blocking ICMP to “improve security” is like removing the oil light to prevent engine warnings.

Common mistakes: symptoms → root cause → fix

1) Symptom: SMB share browses, but copying a large file hangs at 0% or a few percent

Root cause: MSS too high; large TCP segments can’t traverse the tunnel; PMTUD feedback is blocked.

Fix: clamp MSS on VPN gateway (e.g., 1360) and/or lower tunnel MTU; allow ICMP frag-needed/packet-too-big.

2) Symptom: RDP is fine until file copy or clipboard transfer, then the session “freezes”

Root cause: RDP’s bulk channel hits MTU black hole; keepalives and small updates still pass.

Fix: same as above; verify MSS in SYN packets; re-test with a controlled file transfer.

3) Symptom: Works from some home ISPs but not others

Root cause: variable WAN MTU (PPPoE, LTE, DOCSIS quirks) plus a fixed tunnel MTU/MSS assumption.

Fix: choose conservative tunnel MTU (e.g., 1380–1420 depending on tech), clamp MSS, and don’t rely on PMTUD through consumer gear.

4) Symptom: Small pings work; large DF pings fail with no error

Root cause: PMTUD black hole; ICMP messages are dropped.

Fix: permit required ICMP types; if that’s a political battle, clamp MSS and move on with your life.

5) Symptom: “We lowered MTU and now everything is slower”

Root cause: MTU was lowered excessively; overhead increased due to too-small segments; CPU overhead rises.

Fix: measure real maximum MTU and set MTU/MSS with minimal safety margin. Don’t cut to 1200 unless you must.

6) Symptom: Only SMB over VPN is bad; iperf looks okay

Root cause: test mismatch: iperf may use different flows, different MSS, or tolerate retransmits differently; SMB is revealing retransmit pain.

Fix: reproduce with DF ping and packet capture; verify MSS; test with a single long TCP flow similar to SMB.

7) Symptom: Problems started after “security hardening”

Root cause: ICMP filtering or VPN inspection features interfering with fragmentation/PMTUD or TCP options.

Fix: explicitly allow ICMP types; disable broken “helpful” features (e.g., aggressive MSS rewriting without understanding).

8) Symptom: VPN works for months, then fails intermittently during peak hours

Root cause: path changes (new ISP route), MTU variability, or a new middlebox dropping ICMP under load.

Fix: clamp MSS (stability), then investigate ICMP behavior and path MTU across time.

Checklists / step-by-step plan

Step-by-step: from “SMB hangs” to stable transfers

  1. Pick a single failing flow: one client, one server, one file share, one big file.
  2. Confirm routing: ensure traffic is actually using the VPN interface (Linux: ip route get).
  3. Run DF ping tests to find maximum working payload (Linux: ping -M do -s). Document it.
  4. Capture SYN packets and note advertised MSS (Linux: tcpdump filter for SYN).
  5. Compare MSS to measured path MTU: if MSS implies 1500 MTU but you measured ~1392 MTU, you found the mismatch.
  6. Implement MSS clamping on the VPN edge for forwarded traffic; keep a note of the value and why.
  7. Optionally lower tunnel MTU so endpoints naturally advertise a safer MSS.
  8. Re-test the same file copy. Don’t change three things at once; keep it controlled.
  9. Verify retransmits dropped (Linux: ss -ti, or packet capture statistics).
  10. Decide whether to fix ICMP properly: allow required ICMP types or accept MSS clamping as the permanent workaround.
  11. Roll out in layers: pilot one branch, then expand; avoid “big bang MTU changes” across the fleet.
  12. Write a packet size contract: standard MTU/MSS per VPN type, and how to measure it. This is the boring practice that saves weekends.

Policy checklist: what to allow through firewalls

  • IPv4: ICMP type 3 code 4 (“Fragmentation needed”) in the right directions.
  • IPv6: ICMPv6 “Packet Too Big” and related neighbor discovery (don’t break IPv6 fundamentals while you’re at it).
  • Rate limiting: reasonable ICMP rate limiting is fine; blanket drops are not.

Operational checklist: what to record in tickets

  • Client network type (home Wi‑Fi, LTE hotspot, office LAN).
  • VPN type and encapsulation (IPsec NAT-T, WireGuard, SSL VPN).
  • Largest DF ping payload that succeeds.
  • MSS observed in SYN before and after changes.
  • Whether ICMP frag-needed/packet-too-big messages are seen.
  • Whether the issue is symmetric (client→server and server→client).

FAQ

1) Why do small files copy fine but large files hang?

Small transfers often fit into smaller TCP segments or complete before the connection hits problematic packet sizes. Large transfers sustain larger segments and eventually hit the path’s real MTU limit.

2) If PMTUD exists, why do we need MSS clamping?

Because PMTUD depends on ICMP messages surviving the trip back. In many corporate and consumer networks, those messages are blocked, rate-limited, or mangled. MSS clamping removes the dependency.

3) What MSS value should I clamp to?

Use measurement. Find the maximum working IPv4 packet size with DF ping (say, 1392). Subtract 40 bytes for IP+TCP headers: MSS ≈ 1352. Then pick a slightly lower clamp (e.g., 1340–1352) to account for variability.

4) Is lowering tunnel MTU better than MSS clamping?

If you control both tunnel endpoints and clients are consistent, setting the tunnel MTU is clean. In mixed client environments (home networks, roaming laptops), MSS clamping at the gateway is more robust.

5) Can SMB encryption or signing cause this?

It can make packet sizes and throughput behavior change, which might expose an existing MTU problem sooner. But the root is still transport: packets too large for the path and no working feedback.

6) Why does it sometimes fail only one direction?

Asymmetric paths are common: different ISPs, different NAT devices, different filtering rules. One direction might allow ICMP “packet too big” while the return direction drops it.

7) Does this apply to IPv6 too?

Yes. IPv6 relies heavily on ICMPv6 for PMTUD. If you block ICMPv6 “Packet Too Big,” you can create the same black hole behavior, often with even more confusion.

8) Can I just allow fragmentation and be done?

Relying on fragmentation is usually a last resort. It adds overhead, increases loss sensitivity, and can interact badly with firewalls and NAT. Prefer correct MTU and MSS clamping.

9) Why does speed test traffic look fine while SMB hangs?

Different tools use different patterns (multiple flows, different segment sizes, UDP vs TCP). SMB is a long-lived, stateful TCP stream that’s excellent at exposing retransmit-heavy conditions.

10) We use an all-in-one firewall appliance. Where do I apply MSS clamping?

On the policy rule that allows traffic from the inside to the VPN zone/tunnel, or on the VPN interface egress. Vendors name it differently, but the goal is the same: rewrite MSS on SYN packets crossing into the tunnel.

Conclusion: next steps that won’t embarrass you

If your office VPN “mostly works” but large SMB copies hang or RDP file transfers freeze, treat it like an MTU/MSS incident until proven otherwise. Measure path MTU with DF pings. Observe MSS in SYN packets. Clamp MSS at the VPN edge. Then—only then—argue about SMB tuning or storage performance.

Next steps you can do today:

  1. Run a DF ping test across the VPN and record the largest working payload.
  2. Capture a SYN and confirm what MSS is being advertised.
  3. Implement MSS clamping on the VPN gateway for tunnel-bound traffic.
  4. Optionally adjust the tunnel interface MTU to match reality.
  5. Write down your chosen MTU/MSS values and the measurement behind them. Future you will want receipts.
← Previous
ZFS UPS Integration: What a Clean Shutdown Actually Protects
Next →
WordPress Too Many Redirects: Fixing www/https/Cloudflare Redirect Loops

Leave a comment