Random timeouts are the worst kind of outage: too small for a clean graph spike, too frequent to ignore, and perfectly timed to hit
whoever is on-call and whoever is in a demo. Your monitoring says “latency elevated,” your app says “upstream timed out,” and your
users say “it’s broken,” with the confidence of someone who has never had to chase a packet through three networks and a load balancer.
This case is about doing the boring, reliable thing: trace the path with mtr, capture the truth with tcpdump,
and then fix the actual cause—not the last thing you touched. We’ll treat Debian/Ubuntu as the operating room, and the network as the
patient who keeps “forgetting” to mention they also smoke.
Fast diagnosis playbook
When timeouts are “random,” the real pattern is usually hidden by averages. Your job is to force the system to show its hand.
This is the order that finds bottlenecks quickly, without burning a day arguing with dashboards.
1) Confirm the scope in 5 minutes
- One host or many? If it’s one VM, think host NIC, driver, MTU, local firewall/conntrack. If it’s a whole service tier, think path, LB, DNS, upstream.
- One destination or many? If only one upstream, focus on that path. If many, suspect local network stack, resolver, or egress.
- TCP, UDP, or both? TCP-only often smells like MSS/MTU, stateful firewall, conntrack, or retransmit storms. UDP-only points at DNS, QUIC, or rate-limits.
2) Reproduce from the box that times out
- Run an application-level probe (
curlwith timing) and a packet-level probe (tcpdump) at the same time. - Collect a small capture (30–60 seconds) during a failure window. Don’t “capture all day” unless you like explaining disk usage.
3) Trace the path with mtr, but use the right protocol
- Use
mtr -T(TCP) to the actual port your app uses when ICMP is filtered or deprioritized. - Compare from two vantage points: the failing host and a healthy peer in the same subnet/VPC.
4) Identify which of these buckets you’re in
- Loss on the last hop (destination): real issue or destination rate limiting.
- Loss starts mid-path and persists: real transit loss or congestion.
- Loss only on one intermediate hop: usually ICMP rate limiting; ignore unless latency/loss also shows downstream.
- No loss, but latency spikes: bufferbloat, queueing, CPU interrupt issues, or retransmits hidden by ICMP.
- Looks clean, but app times out: MTU blackhole, asymmetric routing, conntrack drops, or DNS/Happy Eyeballs weirdness.
5) Make the smallest change that proves causality
- Clamp MSS, adjust MTU, change resolver, pin interface, change route preference, or disable offload (temporarily) to prove a hypothesis.
- Don’t deploy a “network refactor” to fix a timeout. That’s not engineering; that’s performance art.
A workable mental model of random timeouts
A timeout is not a single failure. It’s the final symptom of something else taking too long: a SYN not answered, a DNS query stuck,
a request retransmitting, an ACK delayed behind a congested queue, a PMTU discovery that never completes, or a stateful firewall
silently dropping “weird” packets because it can.
Random timeouts tend to be one of four shapes:
- Micro-loss with retries: you don’t notice until traffic rises or timeouts shrink.
- Bursty queueing: a link or virtual NIC gets congested, packets sit in buffers, latency spikes, then “recovers.”
- Path inconsistency: ECMP, asymmetric routing, or flapping routes send some flows down a bad lane.
- Protocol mismatch: MTU/MSS, checksum/offload bugs, or firewall/conntrack behaviors that only show under certain packet sizes or rates.
The key is to stop thinking in “the network” and start thinking in flows. One flow can be broken while others look fine,
especially when load balancers, NAT, or ECMP are involved. That’s why you capture packets and look for retransmits, resets, and stalls.
Facts and context that make you better at this
- Traceroute predates most production SRE teams. It was created in the late 1980s to debug routing and reachability using TTL expiration behavior.
- mtr is basically “traceroute with a memory.” It continuously samples and shows trends—perfect for intermittent issues.
- ICMP is often treated as second-class traffic. Many routers rate-limit ICMP replies, so ICMP-based mtr can show “loss” that isn’t real loss for TCP.
- Path MTU Discovery relies on ICMP “Fragmentation Needed.” If those ICMP messages are blocked, you can get a PMTU blackhole: small packets work, big ones hang.
- TCP retransmissions are normal—until they aren’t. A few retransmits happen in healthy networks; sustained retransmits indicate loss or severe reordering.
- Linux conntrack has finite tables. When state tables fill, you don’t get a nice error. You get drops that look “random.”
- ECMP can make debugging look like gaslighting. Two consecutive probes can take different paths; one path might be fine, the other pathological.
- Offloads can confuse packet captures. GRO/TSO can make tcpdump show giant segments that never hit the wire that way, unless you account for it.
- DNS timeouts can masquerade as “network timeouts.” Your app may report upstream timeouts when it’s actually stuck resolving names.
Practical tasks: commands, outputs, and decisions
These are field tasks. Each one includes: a runnable command, what the output means, and what you decide next.
Run them from the host that experiences timeouts. If possible, run the same set from a “known good” host in the same network for comparison.
Task 1: Confirm interface, IP, and default route
cr0x@server:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
ens5 UP 10.20.4.17/24 fe80::5054:ff:fe12:3456/64
cr0x@server:~$ ip route show default
default via 10.20.4.1 dev ens5 proto dhcp src 10.20.4.17 metric 100
Meaning: You know which interface matters and where traffic exits. If you have multiple defaults, you’ve found a likely cause.
Decision: If routing is ambiguous (multiple defaults, policy routing), capture route selection with ip route get (Task 2) and look for asymmetric routing.
Task 2: Verify route selection to the failing destination
cr0x@server:~$ ip route get 203.0.113.10
203.0.113.10 via 10.20.4.1 dev ens5 src 10.20.4.17 uid 1000
cache
Meaning: Confirms the egress interface and gateway for that IP.
Decision: If this varies across runs, suspect policy routing, multiple tables, or ECMP at your edge. If src IP is unexpected, fix source-based routing.
Task 3: Check link-level counters for drops and errors
cr0x@server:~$ ip -s link show dev ens5
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
987654321 1234567 0 1423 0 12345
TX: bytes packets errors dropped carrier collsns
876543210 1122334 0 7 0 0
Meaning: Drops at the NIC level can cause retransmits and timeouts. Errors are worse.
Decision: If RX drops climb during incidents, suspect host congestion, ring buffer limits, or virtual NIC issues. Move to Task 11 (interrupt/softnet stats) and Task 12 (qdisc/queue).
Task 4: Validate DNS quickly (because it lies quietly)
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.20.4.53
DNS Servers: 10.20.4.53 10.20.4.54
cr0x@server:~$ getent ahostsv4 api.example.internal
10.60.8.21 STREAM api.example.internal
10.60.8.21 DGRAM
10.60.8.21 RAW
Meaning: Resolver is systemd-resolved and you have two DNS servers. Name resolves fast.
Decision: If getent hangs or intermittently fails, fix DNS before touching the network path. If DNS is solid, continue.
Task 5: Application-level timing with curl (pinpoint which phase stalls)
cr0x@server:~$ curl -sS -o /dev/null -w 'dns:%{time_namelookup} connect:%{time_connect} tls:%{time_appconnect} ttfb:%{time_starttransfer} total:%{time_total}\n' https://203.0.113.10:443/health
dns:0.000 connect:0.214 tls:0.000 ttfb:2.997 total:3.002
Meaning: DNS is instant, TCP connect took 214ms, and time-to-first-byte is ~3s. That’s not a connect timeout; it’s a server response stall or downstream loss after connection.
Decision: Capture packets during this exact request (Task 9) and run TCP-based mtr to port 443 (Task 7).
Task 6: mtr with ICMP to get a first shape
cr0x@server:~$ mtr -n -r -c 50 203.0.113.10
Start: 2025-12-30T10:14:02+0000
HOST: server Loss% Snt Last Avg Best Wrst StDev
1.|-- 10.20.4.1 0.0% 50 0.3 0.4 0.2 1.8 0.3
2.|-- 10.20.0.1 0.0% 50 0.5 0.6 0.4 2.0 0.3
3.|-- 192.0.2.9 6.0% 50 1.2 1.3 0.9 12.4 1.6
4.|-- 198.51.100.14 0.0% 50 2.1 2.4 1.8 15.6 2.1
5.|-- 203.0.113.10 2.0% 50 2.7 3.1 2.2 48.0 6.7
Meaning: Some loss appears at hop 3 and at the destination. But ICMP loss can be rate limiting.
Decision: Repeat with TCP mtr to the actual port (Task 7). If TCP shows the same loss/spikes, treat it as real. If not, stop blaming hop 3.
Task 7: mtr with TCP SYN to the real service port
cr0x@server:~$ mtr -n -T -P 443 -r -c 50 203.0.113.10
Start: 2025-12-30T10:15:31+0000
HOST: server Loss% Snt Last Avg Best Wrst StDev
1.|-- 10.20.4.1 0.0% 50 0.5 0.6 0.3 3.1 0.6
2.|-- 10.20.0.1 0.0% 50 0.7 0.8 0.5 4.8 0.8
3.|-- 192.0.2.9 0.0% 50 1.4 1.5 1.0 6.5 0.9
4.|-- 198.51.100.14 0.0% 50 2.2 2.6 1.9 21.0 2.9
5.|-- 203.0.113.10 4.0% 50 3.0 6.8 2.4 98.7 17.2
Meaning: Now loss shows up only at the destination and the average latency is ugly with a nasty worst case. That’s consistent with real application pain.
Decision: Move from “path suspicion” to “flow truth.” Capture TCP retransmits and stalls (Task 9/10). Also test MTU/PMTUD (Task 8).
Task 8: Check MTU and PMTUD behavior (the blackhole classic)
cr0x@server:~$ ip link show dev ens5 | sed -n '1p'
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
cr0x@server:~$ ping -M do -s 1472 -c 3 203.0.113.10
PING 203.0.113.10 (203.0.113.10) 1472(1500) bytes of data.
1472 bytes from 203.0.113.10: icmp_seq=1 ttl=54 time=3.11 ms
1472 bytes from 203.0.113.10: icmp_seq=2 ttl=54 time=3.08 ms
1472 bytes from 203.0.113.10: icmp_seq=3 ttl=54 time=3.06 ms
--- 203.0.113.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
Meaning: 1500-byte path works for ICMP. That doesn’t guarantee PMTUD works for TCP, but it’s a good sign.
Decision: If this fails with “Frag needed,” reduce size until it passes and compare to expected. If it hangs (no replies), suspect ICMP blocking mid-path and consider MSS clamping (Fixes section).
Task 9: Capture traffic for one destination (focused tcpdump)
cr0x@server:~$ sudo tcpdump -i ens5 -nn -s 0 -w /tmp/case4_443.pcap 'host 203.0.113.10 and tcp port 443'
tcpdump: listening on ens5, link-type EN10MB (Ethernet), snapshot length 262144 bytes
^C
1458 packets captured
1492 packets received by filter
0 packets dropped by kernel
Meaning: You have a clean capture with zero kernel drops. That matters: drops in capture make you accuse the network for your own CPU problem.
Decision: If tcpdump reports kernel drops, reduce capture scope, increase buffers, or fix host congestion (Task 11). Then analyze retransmits (Task 10).
Task 10: Spot retransmits and resets (quick tshark-less method)
cr0x@server:~$ sudo tcpdump -nn -tt -r /tmp/case4_443.pcap 'tcp[tcpflags] & (tcp-rst) != 0 or tcp[13] & 0x10 != 0' | head
1735553741.102134 IP 10.20.4.17.51322 > 203.0.113.10.443: Flags [S], seq 240112233, win 64240, options [mss 1460,sackOK,TS val 111 ecr 0,nop,wscale 7], length 0
1735553741.316902 IP 203.0.113.10.443 > 10.20.4.17.51322: Flags [S.], seq 99112233, ack 240112234, win 65535, options [mss 1380,sackOK,TS val 222 ecr 111,nop,wscale 8], length 0
1735553741.317001 IP 10.20.4.17.51322 > 203.0.113.10.443: Flags [.], ack 1, win 502, options [nop,nop,TS val 112 ecr 222], length 0
Meaning: You’re looking for patterns: SYN retransmits, data retransmits, duplicate ACKs, and RSTs. The snippet above shows a normal handshake, and the server advertises MSS 1380 (interesting).
Decision: If you see repeated SYNs with no SYN-ACK, it’s reachability/filtering. If you see data retransmits and dup ACK storms, it’s loss/queueing. If MSS is unexpectedly low, investigate MTU or tunneling overhead.
Task 11: Check softnet backlog drops (host can be the bottleneck)
cr0x@server:~$ awk '{print "cpu"NR-1, "processed="$1, "dropped="$2, "time_squeeze="$3}' /proc/net/softnet_stat | head
cpu0 processed=12345678 dropped=12 time_squeeze=34
cpu1 processed=12233445 dropped=0 time_squeeze=0
cpu2 processed=11999887 dropped=0 time_squeeze=2
cpu3 processed=12111222 dropped=9 time_squeeze=17
Meaning: Drops/time_squeeze indicate the kernel can’t keep up with packet processing on some CPUs. That yields “random” loss from the app’s point of view.
Decision: If this spikes during incidents, you fix the host: IRQ affinity, NIC ring sizes, qdisc, or reduce packet rate. Don’t open a ticket with “the internet is flaky” yet.
Task 12: Inspect qdisc and pacing (bufferbloat and queueing)
cr0x@server:~$ tc -s qdisc show dev ens5
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
Sent 876543210 bytes 1122334 pkt (dropped 7, overlimits 0 requeues 21)
backlog 0b 0p requeues 21
maxpacket 1514 drop_overlimit 0 new_flow_count 1234 ecn_mark 0
new_flows_len 0 old_flows_len 0
Meaning: fq_codel is generally good. A small number of drops is fine. Overlimits and huge backlog would hint at queueing.
Decision: If you see massive backlog/overlimits, you’re saturating egress or shaping badly. Fix bandwidth limits, shaping policies, or upstream congestion.
Task 13: Check conntrack pressure (stateful drops look random)
cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 24612
net.netfilter.nf_conntrack_max = 262144
cr0x@server:~$ sudo dmesg -T | tail -n 5
[Tue Dec 30 10:12:01 2025] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Tue Dec 30 10:12:07 2025] nf_conntrack: table full, dropping packet
Meaning: If conntrack is full or near full, packets get dropped and connections stall. The kernel tells you, quietly, in dmesg.
Decision: If you see “table full,” you either increase conntrack sizing, reduce connection churn, or stop tracking flows you don’t need (carefully). If you see SYN cookies, investigate inbound storms or mis-sized backlog.
Task 14: Validate rp_filter and asymmetric routing risk
cr0x@server:~$ sysctl net.ipv4.conf.all.rp_filter net.ipv4.conf.ens5.rp_filter
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.ens5.rp_filter = 1
Meaning: Strict reverse path filtering can drop legitimate traffic in asymmetric routing setups (common with multi-homed hosts, policy routing, some cloud edges).
Decision: If your path is asymmetric (verified via routing tables and captures), set rp_filter to loose mode (2) on affected interfaces, but only when you understand the blast radius.
Using mtr without lying to yourself
mtr is a flashlight, not a courtroom transcript. It’s great at showing where latency and loss appear, but it can also lead you into
accusing an innocent hop that’s simply rate-limiting ICMP replies. The trick is to interpret it like an operator, not like a tourist.
Three rules for mtr that prevent bad decisions
-
Loss at an intermediate hop is irrelevant unless it continues downstream.
If hop 3 shows 30% loss but hop 4 and the destination show 0% loss, hop 3 is probably deprioritizing your probes. -
Match the protocol to your problem.
If your app uses TCP/443, usemtr -T -P 443. ICMP tells you something, but not always the thing you need. -
Sample long enough to catch the “random” part.
10 probes can look perfect. 200 probes can show a 3% loss pattern that ruins tail latency.
Joke #1: Traceroute is like office gossip—occasionally accurate, always confident, and it never tells you what happened inside the meeting room.
mtr patterns that matter
- Step change in latency: a jump that stays higher after a certain hop suggests a slower link, a tunnel boundary, or a congested segment.
- Spiky worst-case at destination: tail latency pain, often queueing or intermittent loss.
- Destination-only loss: could be real last-mile loss, destination rate-limiting, or firewall behavior on the far end. Confirm with tcpdump.
- Alternating hop behavior: can indicate ECMP where different probes take different paths. mtr may show a blended view.
tcpdump: catching retransmits, MTU, and stalls
tcpdump is where debates go to die. It doesn’t care about your cloud provider’s status page or the network team’s confidence. It shows what your
host sent, what it received, and what it never got back.
How to capture without hurting the patient
- Filter hard. Capture only the host/port you need. Disk and CPU are production resources.
- Capture short bursts. 30–60 seconds during a bad window beats 4 hours of “mostly fine.”
- Record kernel drops. If tcpdump drops packets, your capture is incomplete and your conclusions are suspect.
Three tcp-level signatures of “random timeouts”
- SYN retransmissions: the connection attempt isn’t getting a SYN-ACK. Often firewall, route, or upstream overload.
- Data retransmissions and duplicate ACKs: packet loss or reordering. Loss is more common; reordering is rarer but nasty.
- Stalls after sending large segments: MTU/MSS/PMTUD trouble, especially if small requests succeed and larger ones hang.
Case #4: path looks “fine” until it isn’t
This case shows up in real systems because modern networking is layered on layered on layered: VPC overlay, encryption, load balancers,
NAT gateways, sometimes a service mesh. The path can be “reachable,” yet still wrong in a way that only hits certain packet sizes or flows.
The symptoms
- Intermittent timeouts to one upstream HTTPS API.
- Most requests succeed; some hang for 2–10 seconds, then fail.
- ICMP ping looks clean. ICMP-based mtr shows occasional loss at a mid-hop, which sends people into unproductive Slack threads.
- TCP-based mtr shows destination tail latency and a little loss.
The investigation path that works
Start at the host. Prove whether the host is dropping packets (softnet, NIC drops). Then prove whether the flow is retransmitting.
Finally, validate MTU/MSS and any tunnel boundaries.
A plausible root cause in this case
The smoking gun is the MSS mismatch in the handshake: the server advertises MSS 1380. That strongly implies that somewhere on the path,
packets larger than roughly 1420–1450 bytes are risky (tunnel overhead, IPsec, GRE/VXLAN, or a provider edge doing something clever).
If PMTUD is blocked or unreliable, some flows will stall when they try to send larger TLS records or HTTP responses.
You’ll often see this combination:
- Small requests (health checks, short JSON) are fine.
- Requests that trigger larger responses or uploads fail intermittently.
- tcpdump shows retransmits of full-sized segments; the sender backs off, then eventually times out.
Proving it with a targeted capture
In the capture, look for repeated retransmissions of the same sequence number, especially when the segment length is near your MSS.
Also check for ICMP “fragmentation needed” messages; if they’re absent and traffic stalls at larger sizes, PMTUD blackhole is likely.
Joke #2: MTU bugs are like glitter—once they get into your network, they show up everywhere and nobody admits who brought them.
Fixes that actually stick
There are two categories of fixes: stop the bleeding and repair the path. Do both, but in that order.
Production systems value availability over aesthetics.
Fix 1: Clamp TCP MSS at the edge (pragmatic, often immediate)
If you control a firewall/router/NAT instance between your hosts and the upstream, clamp MSS to a safe value so TCP never tries to send
packets larger than the real path can carry. This sidesteps PMTUD dependency.
cr0x@server:~$ sudo iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o ens5 -j TCPMSS --clamp-mss-to-pmtu
Meaning: SYN packets get MSS adjusted to match discovered PMTU (when possible).
Decision: If PMTUD is broken, you may need an explicit MSS value (like 1360–1400) based on tunnel overhead. Validate with tcpdump and reduced timeouts.
Fix 2: Set the correct MTU on the interface (correct when you own the underlay)
In overlay networks (VXLAN, IPsec), the effective MTU is smaller than 1500. If your host thinks it can do 1500, it’ll try. Then the path does
what paths do: drop what it can’t handle.
cr0x@server:~$ sudo ip link set dev ens5 mtu 1450
Meaning: Host won’t send frames larger than 1450 at L3 payload sizes consistent with that MTU.
Decision: Only do this if your environment expects it (cloud VPCs, tunnels). Confirm by checking the provider’s recommended MTU and measuring with ping -M do.
Fix 3: Stop blocking PMTUD ICMP in the middle
The correct fix is often “allow ICMP type 3 code 4” (fragmentation needed) through firewalls. But this is corporate networking,
so “correct” may require meetings. Still: push for it. It fixes the disease, not just the symptom.
cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
ct state established,related accept
iif "lo" accept
ip protocol icmp accept
ip6 nexthdr icmpv6 accept
}
}
Meaning: ICMP is accepted here (good). If you see ICMP blocked, PMTUD may fail.
Decision: Permit necessary ICMP for your environment. Don’t “block all ICMP” and then act surprised when the network behaves strangely.
Fix 4: Reduce host packet drops (softnet/IRQ hygiene)
If Task 11 shows softnet drops, fix the host. Common improvements: ensure RSS is enabled, spread IRQs, and don’t let one CPU do all the work.
cr0x@server:~$ grep -H . /proc/interrupts | grep -E 'ens5|virtio' | head
/proc/interrupts: 43: 9922331 0 0 0 IR-PCI-MSI 327680-edge virtio0-input.0
/proc/interrupts: 44: 0 8877665 0 0 IR-PCI-MSI 327681-edge virtio0-output.0
Meaning: Interrupts are not balanced (CPU0 is taking a beating on input).
Decision: Adjust IRQ affinity or enable irqbalance if appropriate, then re-check softnet stats during load.
Fix 5: Make conntrack boring again
If conntrack is full or you see drops, fix state. Increase table size only if you understand memory and traffic patterns. Better is to reduce churn:
keep connections alive, use pooling, and don’t track what you don’t need (with care around NAT and security policies).
cr0x@server:~$ sudo sysctl -w net.netfilter.nf_conntrack_max=524288
net.netfilter.nf_conntrack_max = 524288
Meaning: More room for tracked flows.
Decision: If count rises to fill max again, you didn’t fix the cause—just gave it a larger warehouse to burn down.
Fix 6: Prefer TCP mtr and service-level probes in runbooks
This is a cultural fix. Update your runbooks so “trace the path” means “trace with the protocol that matters.” It prevents false blame and speeds up triage.
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
A fintech team ran a set of Ubuntu API servers behind a managed load balancer. Users started reporting “random checkout failures.”
The on-call ran ICMP mtr from one API node to the payment gateway. It showed 20–30% loss at hop 4. The conclusion formed instantly:
“The ISP is dropping packets.” A ticket was filed. Everyone waited.
Meanwhile, the retries piled up. The payment gateway saw bursty traffic and started rate limiting. Now timeouts were worse.
The team added more API nodes, which increased connection churn, which increased conntrack pressure on the egress NAT layer.
Suddenly, timeouts were not random—they were constant, and everyone had a new favorite theory.
A quieter engineer did the unfashionable thing: mtr -T -P 443 to the gateway and a 60-second tcpdump during failures.
Hop 4 loss vanished under TCP mtr. The destination showed sporadic SYN retransmits. tcpdump showed repeated SYNs with no SYN-ACK during bursts.
The real issue was a stateful firewall in the middle with an aggressive SYN rate limit configuration that had been copied from a “hardened” template.
ICMP loss at hop 4 was just ICMP being deprioritized. The wrong assumption was believing mtr loss at an intermediate hop equals packet loss for your app.
The fix was boring: adjust firewall thresholds to match traffic patterns, add connection reuse, and set alerts on SYN retransmits.
The ISP ticket got closed with a polite non-answer, as ISP tickets often do.
Mini-story 2: The optimization that backfired
A media company wanted faster uploads to an object store. Someone noticed the servers were using MTU 1500 in a data center that supported jumbo frames.
The proposal: set MTU 9000 everywhere. Bigger frames, fewer packets, less CPU. They rolled it out to a fleet of Debian boxes.
Upload throughput improved in the same rack. Then the bug reports started: intermittent timeouts to certain external services, flaky TLS handshakes,
and weird behavior where small API calls worked but anything “heavy” stalled. mtr looked “mostly fine.” Ping worked. Everyone got smug for the wrong reasons.
The culprit was predictable: not every segment between those servers and the outside world supported jumbo frames. Some paths clamped, some dropped,
and PMTUD ICMP messages were filtered by an intermediate security appliance. Result: PMTU blackholes for a subset of flows.
The optimization created a split-brain network where the same host behaved differently depending on destination.
The rollback was painful because services had been tuned around the “improvement,” and now everything had to relearn reality.
They eventually standardized MTU by zone, documented it, enabled required ICMP for PMTUD, and used MSS clamping at the boundary where tunnels existed.
Mini-story 3: The boring but correct practice that saved the day
A SaaS platform had an internal rule: every incident channel must include a packet capture from the failing host within 20 minutes.
People complained it was bureaucratic. They wanted to “start with dashboards.” Dashboards are comforting. Packets are honest.
One day, random timeouts hit a set of Ubuntu workers talking to a database over a private link. Latency graphs were flat-ish.
mtr showed nothing suspicious. The database team insisted nothing changed. The network team insisted nothing changed. The only thing changing was blame.
The capture showed periodic bursts of retransmissions correlated with a nightly batch job. Not huge loss, just enough.
On the worker hosts, /proc/net/softnet_stat showed drops and time_squeeze spikes on a subset of CPUs during that batch.
The problem wasn’t the link; it was the host network stack getting starved while CPU-heavy compression ran.
Because they had a standard “capture early” practice, they didn’t spend a day arguing about routers. They pinned the batch job’s CPU set,
tuned IRQ affinity, and the timeouts vanished. Nobody got a trophy. Production got quieter, which is the only trophy that matters.
Common mistakes: symptoms → root cause → fix
This section is opinionated because it’s written in the blood of wasted hours.
1) mtr shows loss on a middle hop
Symptoms: mtr reports 10–50% loss at hop N, but the destination looks fine or only slightly affected.
Root cause: ICMP rate limiting or deprioritization on that router. Not actual forwarding loss.
Fix: Re-run with mtr -T -P <port>. Only act if loss/latency persists to the destination.
2) Random timeouts only for “large” responses or uploads
Symptoms: Health checks pass. Small requests pass. Big payloads stall or time out.
Root cause: MTU mismatch, PMTUD blackhole, tunnel overhead, or filtered ICMP fragmentation-needed messages.
Fix: Validate PMTU with ping -M do, observe MSS in SYN/SYN-ACK, clamp MSS or set correct MTU, allow PMTUD ICMP.
3) SYN retransmits during spikes
Symptoms: tcpdump shows repeated SYNs, few SYN-ACKs, connect timeouts under load.
Root cause: Firewall SYN rate limiting, saturated load balancer, exhausted conntrack/NAT state, or upstream accept backlog pressure.
Fix: Inspect conntrack/dmesg, adjust firewall thresholds, reduce connection churn (keepalive/pooling), scale or tune upstream.
4) Looks like “network loss,” but tcpdump drops packets
Symptoms: tcpdump reports “packets dropped by kernel,” softnet drops rise, app sees timeouts.
Root cause: Host cannot process packets fast enough (CPU contention, IRQ imbalance, virtualization noise).
Fix: Reduce capture scope; fix host: IRQ affinity, RSS, irqbalance, CPU pinning, reduce packet rate, or move workload.
5) Timeouts only from one subnet or one AZ
Symptoms: Same code, same service, but one zone times out more.
Root cause: Different path (routing, NAT gateway, firewall policy), or one bad underlay segment.
Fix: Compare mtr -T and captures from each zone. Fix the divergent egress path; don’t “tune the app” to tolerate a broken lane.
6) “We enabled strict rp_filter for security” and now weird things happen
Symptoms: Some replies never make it back, intermittent drops on multi-homed hosts.
Root cause: Asymmetric routing plus strict reverse path filtering drops legitimate packets.
Fix: Use rp_filter loose mode (2) where asymmetry is expected, and document the routing design so nobody “re-secures” it later.
Checklists / step-by-step plan
Checklist A: 30-minute triage for random timeouts
- Identify one failing destination IP:port and one healthy control destination.
- Run
curl -wtiming (Task 5) to see whether it’s connect vs TTFB vs total. - Run
mtr -T -P(Task 7) long enough to catch spikes. - Capture 60 seconds with
tcpdumpfiltered to that host:port (Task 9). - Check NIC drops (Task 3) and softnet drops (Task 11).
- Check conntrack and dmesg (Task 13).
- Test PMTU quickly with
ping -M do(Task 8). - Make one containment change (MSS clamp or MTU adjust) only if it clearly matches the evidence.
Checklist B: Evidence package to hand to a network/provider team
- Timestamped
mtr -T -Poutput showing loss/latency at the destination. - pcap snippet showing SYN retransmits or data retransmits (include start/end times).
- Source IP, destination IP, port, and whether NAT is involved.
- Confirmation that host isn’t dropping packets (softnet, tcpdump kernel drops).
- Whether MSS/MTU appears clamped (SYN options, observed MSS values).
Checklist C: Hardening so this doesn’t come back next month
- Add synthetic probes that measure connect time and TTFB separately.
- Alert on TCP retransmissions and SYN retransmissions at the node level.
- Standardize MTU per environment and document tunnel overhead assumptions.
- Keep ICMP needed for PMTUD allowed across internal boundaries.
- Track conntrack utilization where NAT/stateful firewalls exist.
- Update runbooks: use TCP mtr to the service port, not ICMP by default.
FAQ
1) Why does ICMP mtr show loss but my app seems fine?
Many routers rate-limit ICMP replies (control plane) while forwarding data (data plane) normally. Use mtr -T -P to test the protocol your app uses.
2) Why does TCP mtr show loss at the destination—could it still be “fake”?
It’s less likely to be fake, but destinations can rate-limit SYN/ACK or deprioritize responses under load. Confirm with tcpdump: do you see SYN retransmits or data retransmits?
3) What’s the fastest way to detect MTU blackholes?
Compare behavior for small vs large payloads, check MSS in SYN/SYN-ACK, and test with ping -M do using increasing sizes. If large stalls and ICMP “frag needed” is missing, suspect PMTUD failure.
4) Should I disable TSO/GRO when debugging?
Only if you understand why. Offloads can make captures confusing, but disabling them in production can hurt performance. Prefer interpreting captures with offload behavior in mind and keep tests targeted.
5) My tcpdump says “packets dropped by kernel.” Is the network guilty?
No. That’s your host failing to capture (and often failing to process) packets fast enough. Fix CPU/IRQ/softnet issues or reduce capture load before blaming the path.
6) How do I tell if timeouts are DNS-related?
Use curl -w to see name lookup time, and run getent ahosts repeatedly. DNS issues show as spikes in time_namelookup or hangs in resolver calls.
7) Is MSS clamping a hack?
It’s a pragmatic containment strategy. The correct long-term fix is consistent MTU and functioning PMTUD, but MSS clamping is widely used at tunnel boundaries because it works.
8) How do random timeouts relate to storage engineering?
Remote storage (iSCSI, NFS, object gateways) turns packet loss into application stalls fast. A tiny retransmit rate can become huge tail latency for synchronous IO.
9) What quote should I keep in mind during these investigations?
“Hope is not a strategy.” — James Cameron
Conclusion: next steps
Random timeouts aren’t random. They’re just distributed across flows, hidden by averages, and protected by people’s favorite wrong assumptions.
The reliable workflow is simple: reproduce on the failing host, trace the path using the correct protocol, capture packets during failure,
and prove whether you’re dealing with loss, queueing, MTU/PMTUD, or host-level drops.
Next steps you can do today:
- Update your runbook to default to
mtr -T -Pfor app timeouts. - Add a lightweight tcpdump-on-demand procedure with strict filters and short capture windows.
- Baseline MTU/MSS expectations in your environment (especially if tunnels exist).
- Instrument retransmissions and conntrack pressure where NAT/stateful devices exist.
- When you find the fix, write down the evidence—not the theory—so the next on-call doesn’t relive your weekend.