DNS failures that come and go are the worst kind of outage: the graphs look “fine,” the resolver looks healthy, and yet a handful of domains keep timing out like it’s 1998.
If you’ve ever watched dig succeed for one name and hang for another, or seen DNS work on your phone hotspot but not over VPN, you’ve met the quiet villain: MTU, fragmentation, and the long chain of “it should work” assumptions between client and resolver.
What’s actually happening: why MTU breaks DNS
DNS is small until it isn’t. A-record lookups for modest zones usually fit comfortably inside a single UDP packet. But the moment responses grow—DNSSEC signatures, multiple records, long TXT strings (hello, SPF/DMARC), SVCB/HTTPS records, or just a resolver that gets chatty—you can end up with a response that exceeds the Path MTU between client and server.
When that happens, one of two things must be true for UDP DNS to work:
- The response must be fragmented and all fragments must arrive, in order, at the client.
- Or the client and server must negotiate a smaller payload size (typically via EDNS(0)) so fragmentation never happens.
In real networks, fragmentation is fragile. Firewalls drop fragments. NAT devices “help” by timing them out. Some load balancers hash only on the first fragment (which doesn’t carry UDP ports for later fragments), and then your fragments take different paths like a group project.
So the DNS query goes out, the resolver sends a too-large UDP response, that response fragments, one fragment disappears, and the client sits there waiting. Eventually it retries, maybe over TCP, maybe to another resolver, maybe not. From the outside you see: “DNS flakiness.” From the inside you see: “a networking layer 2–3 problem wearing a layer 7 mask.”
Path MTU Discovery (PMTUD) is supposed to prevent this mess by having routers send ICMP “Fragmentation Needed” (IPv4) or “Packet Too Big” (IPv6) so endpoints learn the maximum size. But PMTUD requires ICMP to get through. Many environments treat ICMP like an unwanted guest and then act surprised when the plumbing leaks.
One operational reality: DNS is often the first protocol to show MTU pain because it’s ubiquitous, it uses UDP by default, and it’s involved in almost every other failure cascade. When DNS breaks, everything looks broken. Your job is to prove it’s MTU, then fix the path so you don’t spend next week blaming the resolver.
Joke #1: If you block ICMP because it’s “insecure,” you’ve reinvented the network equivalent of removing the oil light because it was annoying.
The key mechanism: EDNS(0) and “how big is too big?”
Classic DNS limited UDP responses to 512 bytes. EDNS(0) (RFC 2671, later RFC 6891) extended DNS so clients could advertise a larger UDP buffer size (often 1232 bytes is a safe modern default). That’s good: it reduces TCP fallback and improves performance. It’s also bad: it increases the chance you’ll exceed some broken path’s MTU and get silent drops.
So the “MTU breaks DNS” pattern frequently looks like this:
- Client sends DNS query with EDNS(0) advertising 4096 bytes.
- Resolver returns a 1600–3000 byte UDP response.
- Some hop can’t carry it (e.g., VPN tunnel, PPPoE link, overlay network, mis-set MTU on a vNIC).
- Fragmentation occurs or PMTUD should occur, but fragments/ICMP get blocked.
- Client times out, retries, or falls back to TCP—if it’s allowed.
Why it’s often “only some domains”
MTU issues don’t break DNS so much as they break large DNS responses. That’s why you can resolve example.com all day while some-dnssec-heavy-domain.tld flakes out. Any domain with DNSSEC, large TXT records, many records in an RRset, or CNAME chains is a candidate.
The scariest part is that you can “fix” this accidentally by switching resolvers, caching, or waiting. None of those fix the path. They just move the symptom around until it bites a more important name at 2 a.m.
Facts & history that matter in production
- Fact 1: The original DNS UDP response limit was 512 bytes; EDNS(0) raised that limit by letting clients advertise a larger receive buffer.
- Fact 2: DNSSEC adoption made large responses common, because signatures and key material add bulk—especially for negative answers (NSEC/NSEC3 proofs).
- Fact 3: A widely used “safe” EDNS UDP payload size today is around 1232 bytes to avoid IPv6 fragmentation issues on typical paths.
- Fact 4: PPPoE reduces Ethernet MTU from 1500 to 1492, and tunnels on top of that can take you lower. You can lose another ~60–80 bytes quickly.
- Fact 5: IPv4 routers can fragment packets, but IPv6 routers do not fragment; fragmentation is done only by endpoints, which makes PMTUD and ICMPv6 delivery much more important.
- Fact 6: Many middleboxes treat IP fragments as suspicious and drop them, sometimes intentionally, sometimes as a side effect of “security hardening.”
- Fact 7: Some NAT devices and firewalls track UDP flows poorly for fragments, because later fragments don’t include UDP port information.
- Fact 8: DNS fallback to TCP is part of the protocol’s design, but in enterprise networks TCP/53 is often blocked “because DNS is UDP.” That’s how you turn a mild MTU problem into an outage.
- Fact 9: Cloud overlay networks and encapsulation (VXLAN, Geneve, IPsec, GRE) routinely reduce effective MTU. When you stack tunnels, you stack overhead, not happiness.
One quote worth remembering in ops culture, because it applies brutally well to MTU mysteries:
“paraphrased idea” — Richard Feynman: reality must take precedence over public relations; nature can’t be fooled.
Fast diagnosis playbook (do this first)
If DNS is flaky and you suspect MTU, don’t start by reinstalling systemd-resolved or arguing about which public resolver is best. Do this:
1) Confirm it’s “large-response DNS”
- Pick a domain that fails and one that succeeds.
- Test with EDNS enabled and disabled.
- Test UDP vs TCP explicitly.
2) Measure path MTU in the failing direction
- From client to resolver IP (not “the internet”).
- Use DF-bit pings (IPv4) or IPv6 pings with size.
- Expect to find a number lower than 1500 on VPNs, PPPoE, overlays, and some cloud paths.
3) Look for ICMP black holes and fragment drops
- Packet capture on client or near the resolver.
- Check firewall rules for ICMP “Fragmentation Needed” / “Packet Too Big.”
- Check whether TCP/53 works; if it does, you’re staring at a UDP fragmentation problem.
4) Apply the least risky mitigation first
- Reduce EDNS UDP size on clients/resolvers (e.g., 1232).
- Allow TCP/53 where appropriate.
- Fix tunnel/interface MTU and MSS clamping properly (don’t guess; measure).
Prove it with commands: 12+ real tasks and decisions
The following tasks are written like you’re on call: you run a command, interpret output, then make a decision. Run them from a client that experiences failures, and if possible from a host close to the resolver too.
Task 1: Reproduce the failure with dig and capture timing
cr0x@server:~$ dig @10.20.30.40 large-dnssec-domain.example A +tries=1 +time=2
; <<>> DiG 9.18.24 <<>> @10.20.30.40 large-dnssec-domain.example A +tries=1 +time=2
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
What it means: You have a client→resolver reachability problem at the DNS layer. Not enough yet to call it MTU.
Decision: Compare against a “small” domain and then force TCP to see if UDP is the problem.
Task 2: Compare with a known small response
cr0x@server:~$ dig @10.20.30.40 example.com A +tries=1 +time=2
; <<>> DiG 9.18.24 <<>> @10.20.30.40 example.com A +tries=1 +time=2
;; ANSWER SECTION:
example.com. 3600 IN A 93.184.216.34
;; Query time: 18 msec
What it means: Basic DNS to the resolver works. So it’s likely data-dependent: response size, fragmentation, DNSSEC, or packet filtering.
Decision: Force TCP for the failing name. If TCP succeeds, suspect UDP fragmentation/PMTUD.
Task 3: Force TCP DNS (bypasses UDP fragmentation issues)
cr0x@server:~$ dig @10.20.30.40 large-dnssec-domain.example A +tcp +tries=1 +time=2
; <<>> DiG 9.18.24 <<>> @10.20.30.40 large-dnssec-domain.example A +tcp +tries=1 +time=2
;; ANSWER SECTION:
large-dnssec-domain.example. 300 IN A 203.0.113.77
;; Query time: 42 msec
What it means: TCP/53 works and returns an answer. UDP path is the likely culprit, not the resolver’s data.
Decision: Confirm the response is large and triggers EDNS/fragmentation behavior.
Task 4: Inspect EDNS and UDP payload size negotiation
cr0x@server:~$ dig @10.20.30.40 large-dnssec-domain.example A +dnssec +bufsize=4096
; <<>> DiG 9.18.24 <<>> @10.20.30.40 large-dnssec-domain.example A +dnssec +bufsize=4096
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
;; connection timed out; no servers could be reached
What it means: Advertising a large UDP buffer triggers failure. That’s consistent with MTU/fragment issues.
Decision: Try a smaller buffer (1232) and see if it becomes reliable.
Task 5: Reduce EDNS buffer to avoid fragmentation
cr0x@server:~$ dig @10.20.30.40 large-dnssec-domain.example A +dnssec +bufsize=1232 +tries=1 +time=2
; <<>> DiG 9.18.24 <<>> @10.20.30.40 large-dnssec-domain.example A +dnssec +bufsize=1232 +tries=1 +time=2
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
;; ANSWER SECTION:
large-dnssec-domain.example. 300 IN A 203.0.113.77
;; Query time: 29 msec
What it means: Smaller UDP payload succeeds. Your path likely can’t reliably carry larger UDP responses.
Decision: Now you need to prove the MTU/path limitation, not just “EDNS weirdness.” Measure PMTU next.
Task 6: Measure IPv4 path MTU with DF pings
cr0x@server:~$ ping -M do -s 1472 -c 3 10.20.30.40
PING 10.20.30.40 (10.20.30.40) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1472
--- 10.20.30.40 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss
What it means: Your interface MTU (or an early hop) is below what you expected, or you’re on a link with reduced MTU. Note the local error indicates the local stack can’t send that size on that interface.
Decision: Find the actual interface MTU and then binary-search the path with smaller sizes. Also check whether the resolver is reached over a tunnel interface.
Task 7: Find interface MTU and the egress interface used
cr0x@server:~$ ip route get 10.20.30.40
10.20.30.40 via 10.8.0.1 dev tun0 src 10.8.0.10 uid 1000
cache
cr0x@server:~$ ip link show dev tun0
6: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 500
link/none
What it means: DNS to the resolver goes over tun0 with MTU 1400. That’s already smaller than Ethernet 1500; add headers and you can easily choke large UDP.
Decision: Test PMTU to the resolver with sizes under 1400 and confirm where it fails.
Task 8: Confirm the largest unfragmented payload to resolver (binary-ish search)
cr0x@server:~$ ping -M do -s 1372 -c 2 10.20.30.40
PING 10.20.30.40 (10.20.30.40) 1372(1400) bytes of data.
1380 bytes from 10.20.30.40: icmp_seq=1 ttl=63 time=31.2 ms
1380 bytes from 10.20.30.40: icmp_seq=2 ttl=63 time=30.7 ms
--- 10.20.30.40 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss
cr0x@server:~$ ping -M do -s 1392 -c 2 10.20.30.40
PING 10.20.30.40 (10.20.30.40) 1392(1420) bytes of data.
From 10.8.0.1 icmp_seq=1 Frag needed and DF set (mtu = 1400)
From 10.8.0.1 icmp_seq=2 Frag needed and DF set (mtu = 1400)
--- 10.20.30.40 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss
What it means: Your PMTU is 1400 on that path, and ICMP “Frag needed” is actually reaching you (good). If ICMP were blocked, you might see only timeouts instead.
Decision: Since PMTU is 1400, ask: why did 4096-byte EDNS fail? Because a 1400 MTU path will fragment UDP responses above that. If fragments are dropped, you get timeouts. Next: confirm fragment loss with tcpdump.
Task 9: Capture DNS packets and look for fragmentation behavior
cr0x@server:~$ sudo tcpdump -ni tun0 -vvv 'host 10.20.30.40 and (udp port 53 or tcp port 53)' -c 20
tcpdump: listening on tun0, link-type RAW (Raw IP), snapshot length 262144 bytes
IP 10.8.0.10.50821 > 10.20.30.40.53: 12345+ [1au] A? large-dnssec-domain.example. (56)
IP 10.20.30.40.53 > 10.8.0.10.50821: 12345* 1/0/1 A 203.0.113.77 (1492)
IP 10.8.0.10.42233 > 10.20.30.40.53: 54321+ [1au] A? large-dnssec-domain.example. (56)
IP 10.8.0.10.42233 > 10.20.30.40.53: Flags [S], seq 101, win 64240, options [mss 1360,sackOK,TS val 1 ecr 0,nop,wscale 7], length 0
IP 10.20.30.40.53 > 10.8.0.10.42233: Flags [S.], seq 202, ack 102, win 65160, options [mss 1360,sackOK,TS val 2 ecr 1,nop,wscale 7], length 0
What it means: You can see a large UDP response (1492 bytes on the wire) which exceeds MTU 1400 on the tunnel; depending on capture point you may or may not see fragmentation explicitly. The presence of TCP SYN afterward suggests the client is retrying and falling back to TCP.
Decision: If you don’t see the full UDP response arrive (or see it leave the server but not arrive at client), you’ve got fragment loss. Mitigate by lowering EDNS size and/or fixing tunnel MTU and allowing ICMP.
Task 10: Confirm whether TCP/53 is allowed end-to-end
cr0x@server:~$ nc -vz 10.20.30.40 53
Connection to 10.20.30.40 53 port [tcp/domain] succeeded!
What it means: TCP/53 is open on that path. That’s a safety net; many networks don’t have it.
Decision: If TCP works, you can mitigate quickly by ensuring clients/resolvers use TCP fallback properly and by shrinking UDP sizes. If TCP is blocked, fix that immediately—then continue on MTU root cause.
Task 11: Check resolver-side settings for EDNS buffer size (BIND example)
cr0x@server:~$ sudo named-checkconf -p | grep -E 'edns-udp-size|max-udp-size' || true
max-udp-size 4096;
What it means: The resolver is willing to send very large UDP responses. That’s optimistic in enterprise networks with tunnels and firewalls.
Decision: Reduce to a safer number (often 1232) unless you control every hop and can prove larger works reliably.
Task 12: Apply a safer max UDP size on the resolver (BIND) and validate config
cr0x@server:~$ sudo sed -i 's/max-udp-size 4096;/max-udp-size 1232;/' /etc/bind/named.conf.options
cr0x@server:~$ sudo named-checkconf
cr0x@server:~$ sudo systemctl reload bind9
cr0x@server:~$ sudo systemctl status bind9 --no-pager -l
● bind9.service - BIND Domain Name Server
Loaded: loaded (/lib/systemd/system/bind9.service; enabled)
Active: active (running)
What it means: Resolver is now constrained to smaller UDP responses, reducing fragmentation risk.
Decision: Re-test failing domains with dig without forcing +bufsize. If the issue disappears, you’ve proven a path MTU/fragment sensitivity. Still fix the network, but you’ve bought stability.
Task 13: Check if ICMP “frag needed” is being dropped by host firewall (Linux)
cr0x@server:~$ sudo iptables -S | grep -E 'icmp|fragmentation|RELATED' || true
-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
What it means: ICMP is allowed on the host. If you saw only timeouts earlier, the drop might be on a network firewall, not the host.
Decision: If ICMP is blocked anywhere in the path, fix policy to allow essential ICMP types (not “all ICMP forever,” but enough for PMTUD).
Task 14: Check MSS clamping on a VPN gateway (common mitigation for TCP, not UDP)
cr0x@server:~$ sudo iptables -t mangle -S | grep -i clamp || true
-A FORWARD -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
What it means: MSS clamping is in place, which helps TCP avoid MTU trouble. It does nothing for UDP DNS fragmentation.
Decision: Don’t stop here. People see “clamp-mss-to-pmtu” and declare victory, then wonder why DNS over UDP still dies. You need EDNS sizing and/or MTU correctness.
Task 15: Quick Kubernetes check: does CoreDNS see truncation or retries?
cr0x@server:~$ kubectl -n kube-system logs deploy/coredns --tail=50 | grep -E 'timeout|truncated|SERVFAIL' || true
[ERROR] plugin/errors: 2 large-dnssec-domain.example. A: read udp 10.244.2.15:46122->10.20.30.40:53: i/o timeout
What it means: CoreDNS is timing out on UDP reads to the upstream resolver. If TCP succeeds from the same pod/network, suspect MTU/fragmentation between cluster and resolver (often an overlay MTU mismatch).
Decision: Verify pod network MTU and node MTU, and set CoreDNS (or upstream) to a safe EDNS size. Then fix overlay/tunnel MTU properly.
Three corporate mini-stories (how this really goes)
1) The incident caused by a wrong assumption
They rolled out a new site-to-site VPN between two offices. It came with the usual promises: “transparent connectivity,” “no application changes,” “it’s just routing.” The network team set the tunnel MTU to a default value they’d used for years, then moved on. No one measured anything, because why would you measure a thing that is “standard”?
Monday morning, ticket volume spiked. Users could access internal apps by IP, but names were flaky. Some web pages loaded, some spun forever, and the login portal failed in a way that looked like authentication trouble. The resolver graphs looked clean. CPU fine. QPS normal. On-call started chasing “DNS server performance.”
It took one person to run dig +tcp. Suddenly the failing domains resolved instantly, and the “authentication outage” looked a lot like “DNSSEC response size.” Packet captures showed UDP responses leaving the resolver and never arriving at the client intact. The tunnel path had an effective MTU smaller than assumed, and the firewall in the middle dropped fragments.
The wrong assumption wasn’t about MTU math; it was about ownership. Everyone assumed “the VPN handles it” and “DNS is tiny.” The fix was embarrassingly simple: correct the tunnel MTU, allow the necessary ICMP, and cap EDNS UDP size on the resolver as a belt-and-suspenders measure. The postmortem action item was even simpler: any new tunnel must have a measured PMTU test in the rollout checklist.
2) The optimization that backfired
A different org had a performance initiative: reduce latency by avoiding TCP for DNS. Someone noticed that TCP queries were a non-trivial fraction of resolver traffic during peak hours. So they “optimized” by raising EDNS buffer sizes across the fleet and tweaking firewall settings to “prefer UDP.” The change looked good in a narrow benchmark: fewer TCP handshakes, slightly lower average query time.
Two weeks later, a subset of remote employees started reporting that “some sites don’t exist” and “VPN is unstable.” It was intermittent and maddening. The resolver team saw more retransmits and timeouts but couldn’t reproduce from the data center. The VPN team blamed Wi‑Fi. The desktop team blamed the OS. Everyone had a plausible story; none of them were right.
The root cause was the optimization: bigger UDP responses increased fragmentation frequency on VPN links with smaller effective MTU. The path included a security device that dropped non-first fragments as part of a “hardening baseline.” The old configuration had forced more truncation and TCP fallback, which—while slower—was reliable. They had optimized away the safety margin.
The rollback reduced EDNS UDP size to a conservative value, and they explicitly allowed TCP/53. Then they fixed the security device configuration to stop dropping fragments blindly and allowed ICMP types needed for PMTUD. The lesson wasn’t “never optimize.” It was “optimize only what you can observe end-to-end,” especially when the protocol is UDP and the network is full of middleboxes with opinions.
3) The boring but correct practice that saved the day
One team had a rule that sounded dull: every resolver change required a canary test from at least three network vantage points—data center, VPN, and a cloud VPC. The test suite wasn’t fancy. It was a handful of dig queries, a few known large DNSSEC names, and a forced +bufsize case to simulate worst behavior.
During a routine upgrade, their canary from the VPN vantage point started failing only on the “large response” names. Everything else passed. The upgrade itself was fine. The difference was that the VPN concentrator pool had been refreshed the same week, and one new appliance model had a smaller effective MTU due to an extra encapsulation layer.
Because they tested from the edge, they caught it before users did. They lowered EDNS UDP size on the resolver as an immediate mitigation, then corrected the VPN MTU and ensured ICMP “Packet Too Big” messages were permitted. No incident call, no executive escalation, no tickets. Just boring tests preventing drama.
Joke #2: Boring reliability work is like flossing—nobody brags about it, but everyone notices when you don’t do it.
Fixes that actually stick (and what to avoid)
Fix category A: Make DNS less likely to fragment
Set a conservative EDNS UDP size
If you only do one mitigation, do this. Many resolvers and clients allow tuning the advertised/accepted UDP payload size. A common operational target is 1232 bytes, chosen to play nicely with IPv6 and typical tunnel overheads.
Resolver side (example approaches):
- BIND:
max-udp-size 1232; - Unbound: tune
edns-buffer-size/msg-buffer-sizeas appropriate - PowerDNS Recursor: configure EDNS UDP limits
The exact knob differs, but the operational intent is the same: stop emitting “heroic” UDP responses that rely on fragmentation across unknown networks.
Allow TCP/53 and make sure fallback works
TCP fallback isn’t a “nice-to-have.” It’s the protocol’s escape hatch when UDP fails or truncation happens. Blocking TCP/53 is a classic enterprise self-own.
Do not “prefer UDP” by blocking TCP. That’s not preference; that’s sabotage. Allow TCP/53 at least between clients and recursive resolvers, and between resolvers and upstreams if you operate that way.
Fix category B: Fix the path (the real root cause)
Correct MTU on tunnels and overlays
If DNS goes over a tunnel, the tunnel MTU must account for encapsulation overhead. IPsec, WireGuard, OpenVPN, GRE, VXLAN—pick your poison. The overhead varies and can stack. The right approach is:
- Measure PMTU between endpoints.
- Set interface MTUs accordingly.
- Verify with DF pings and real DNS tests.
Stop dropping essential ICMP
PMTUD needs ICMP messages. You do not need to allow every ICMP type from everywhere, but you do need to allow the ones that make the network function. For IPv4, that’s “Fragmentation Needed.” For IPv6, “Packet Too Big” is non-negotiable.
If your security policy says “no ICMP,” rewrite it. If you can’t, at least be honest that you’re trading operational correctness for a feeling.
Handle fragments intentionally
Some environments choose to drop fragments as a security stance. If you do that, you must compensate: cap EDNS size, ensure TCP fallback, and verify applications that rely on UDP won’t exceed MTU. DNS is the poster child, but not the only victim.
Fix category C: Fix client behavior (useful, but don’t hide the path bug)
Client stacks vary. Some stub resolvers advertise large EDNS buffers and are aggressive about UDP retries. Some quickly fall back to TCP. Some are… creative.
In managed fleets, you can enforce safer defaults:
- Use a local caching resolver (systemd-resolved, Unbound, dnsmasq) with a conservative EDNS size.
- Ensure VPN clients set sane MTU and don’t rely on PMTUD alone, especially if ICMP is filtered.
- Prefer DNS over TLS/HTTPS only if you understand the MTU/PMTUD implications (it’s TCP-based, which can help, but also adds its own moving parts).
What to avoid (because it “fixes” the wrong thing)
- Avoid whack-a-mole resolver switching as a primary fix. It can change response sizes and caching, masking the issue.
- Avoid raising EDNS UDP size as a performance hack unless you control and test the full path.
- Avoid “just block fragments” without a compensating DNS strategy (EDNS cap + TCP fallback).
- Avoid guessing MTU values. Measure PMTU. Then set MTU.
Common mistakes: symptom → root cause → fix
1) “Only some domains won’t resolve”
Symptom: A handful of domains time out; most work. DNS server appears healthy.
Root cause: Large UDP responses (DNSSEC/TXT/many records) fragment; fragments dropped or ICMP blocked.
Fix: Reduce EDNS UDP size (start at 1232), allow TCP/53, fix MTU/ICMP on tunnels/firewalls.
2) “DNS works on my laptop hotspot but fails on corporate VPN”
Symptom: Same client, same resolver, different network path changes behavior.
Root cause: VPN reduces effective MTU and/or blocks ICMP; fragmentation fails.
Fix: Set VPN MTU correctly, allow ICMP “frag needed/packet too big,” cap EDNS size.
3) “TCP DNS works, UDP DNS times out”
Symptom: dig +tcp succeeds reliably; normal dig fails intermittently.
Root cause: UDP fragmentation or PMTUD black hole.
Fix: Investigate PMTU; allow fragments or reduce UDP size; ensure TCP/53 isn’t blocked.
4) “We set MSS clamping, but DNS still fails”
Symptom: Someone points to --clamp-mss-to-pmtu and expects miracles.
Root cause: MSS clamping only affects TCP. DNS primarily uses UDP.
Fix: Cap EDNS UDP size and fix MTU/ICMP; don’t confuse TCP mitigations with UDP behavior.
5) “We blocked ICMP and now random things hang”
Symptom: Large transfers stall; DNS occasionally times out; IPv6 behaves worse.
Root cause: PMTUD fails; black holes form; endpoints keep sending too-large packets.
Fix: Allow essential ICMP types; validate with DF ping and packet capture.
6) “Kubernetes cluster DNS is flaky after enabling overlay encryption”
Symptom: Pods get intermittent DNS timeouts; nodes might be fine.
Root cause: Overlay MTU reduced by encapsulation; pods still assume 1500; fragment loss or drop on nodes.
Fix: Set CNI MTU correctly, verify node/pod MTU alignment, cap EDNS size in CoreDNS/upstream resolver.
Checklists / step-by-step plan
Step-by-step: from symptom to root cause (field-tested)
- Pick two names: one that fails, one that succeeds. Prefer a DNSSEC-heavy name for the failing one.
- Run three queries: normal UDP, forced TCP, UDP with
+bufsize=1232. - If TCP succeeds and small bufsize succeeds: treat as MTU/fragmentation until proven otherwise.
- Identify the route:
ip route get <resolver_ip>and record the egress interface. - Check the interface MTU:
ip link show dev <if>. - Measure PMTU to resolver: DF pings (IPv4) or sized pings (IPv6). Record the largest working size.
- Capture packets: on the egress interface while reproducing. Look for retries, truncation, missing responses.
- Check firewall policy: ensure essential ICMP passes; verify fragment handling stance.
- Mitigate quickly: set resolver
max-udp-size(or equivalent) to 1232; ensure TCP/53 allowed. - Fix permanently: correct tunnel/overlay MTU; avoid stacking tunnels without recalculating overhead.
- Regression test: canary from VPN + data center + cloud path with large-response queries.
- Document: record PMTU values per major path and bake tests into change management.
Operational checklist: “we’re about to change MTU/tunnels/firewalls”
- Measure PMTU before and after change from at least two endpoints.
- Test DNS UDP with EDNS at 1232 and at a larger value (to expose fragment issues early).
- Verify TCP/53 permitted for clients to recursive resolvers.
- Verify ICMP types required for PMTUD are permitted (especially ICMPv6 Packet Too Big).
- Decide your fragment policy intentionally; don’t inherit defaults blindly.
- Update resolver EDNS max UDP size if the environment includes tunnels/overlays you don’t fully control.
FAQ
1) Why does DNS use UDP at all if it’s fragile?
Latency and simplicity. UDP avoids connection setup and scales well for small queries. The protocol includes TCP fallback for larger responses and retries for loss. The fragility comes from middleboxes and broken PMTUD, not from DNS alone.
2) If I allow TCP/53, do I still need to fix MTU?
Yes. TCP fallback is a safety net, not a cure. If your path is a PMTU black hole, other UDP-based protocols (and even TCP in certain cases) can suffer too. Fix the network so it behaves predictably.
3) What EDNS UDP size should I choose?
If you operate across VPNs/overlays/enterprise firewalls, start with 1232 bytes. If you control every hop and can prove larger works, you can raise it. But don’t raise it because a benchmark liked it.
4) How can I tell it’s fragmentation vs a “bad DNS server”?
The classic proof: dig +tcp works while UDP times out, and dig +bufsize=1232 works while +bufsize=4096 fails. Packet captures will show missing UDP responses or retries.
5) Why is IPv6 often worse for this?
IPv6 routers don’t fragment in transit. Endpoints must handle it, and PMTUD relies on ICMPv6 “Packet Too Big” reaching the sender. If your network blocks that ICMPv6, you’re building an IPv6 black hole by design.
6) Can DNSSEC be the trigger even if I’m not “using DNSSEC”?
Yes. Recursive resolvers may validate DNSSEC and fetch extra records, and clients might request DNSSEC (the DO bit). Even without client requests, upstream behavior can enlarge responses for certain names.
7) What about DNS over HTTPS (DoH) or DNS over TLS (DoT)?
They ride over TCP (and often TLS), so they’re less sensitive to UDP fragmentation. But they introduce different failure modes (proxy interception, TLS inspection, connection limits, latency). Don’t use DoH/DoT as a band-aid for broken MTU unless you accept the tradeoffs.
8) I see “truncated” responses (TC bit). Is that MTU?
Not necessarily. Truncation means the server intentionally cut the UDP response and asks the client to retry over TCP. That can be caused by size limits, policy, or upstream behavior. MTU issues often look like timeouts instead of clean truncation because the response is lost mid-flight.
9) Can a single firewall rule really break only large DNS answers?
Absolutely. A rule that drops IP fragments, or blocks ICMP “frag needed,” selectively harms packets that exceed PMTU. Small DNS responses keep working, which is why this issue survives so long in production.
Next steps you can do today
MTU bugs don’t announce themselves. They cosplay as flaky DNS, “random” timeouts, and user complaints that sound subjective until you realize the packet sizes are objective.
- Prove it: run
dignormal vs+tcpvs+bufsize=1232for a failing name. - Measure it: find the egress interface and PMTU to the resolver IP using DF pings.
- Mitigate fast: cap resolver EDNS UDP size to 1232 and ensure TCP/53 is allowed.
- Fix for real: correct tunnel/overlay MTU and permit essential ICMP for PMTUD; decide fragment policy intentionally.
- Keep it fixed: add a canary test from VPN + data center + cloud that includes at least one large DNSSEC response.
Do those five things and you’ll stop treating DNS like a mystical service and start treating it like what it is: packets, sizes, and paths. The network will still find new ways to disappoint you, but at least you’ll know where to look first.