Debian/Ubuntu Random Timeouts: Trace the Network Path with mtr/tcpdump and Fix the Cause (Case #64)

Was this helpful?

Random timeouts are the worst kind of outage: nothing is “down,” everything is “mostly fine,” and yet users are staring at spinning wheels. Your dashboards look smug. Your logs look bored. And the CEO can always reproduce it on hotel Wi‑Fi.

This is a field guide for Debian/Ubuntu operators who need to prove where the timeout happens, pin it to a hop or a subsystem, and fix the actual cause. We’ll use mtr and tcpdump like grown-ups: not as ritual, but as instruments. Along the way we’ll separate “packet loss” from “ICMP de-prioritized,” catch PMTU blackholes, spot DNS lies, and identify the special kind of pain that is asymmetric routing.

Fast diagnosis playbook

If you’re on call, you don’t need a philosophy lecture. You need a sequence that finds the bottleneck quickly and avoids false positives.

First: determine what times out (DNS, TCP connect, TLS, or app)

  • Check DNS latency and failures (it’s the classic “random” culprit because caching makes it sporadic).
  • Check TCP connect time (SYN/SYN-ACK handshake). If connect is slow, focus on path/firewalls/conntrack.
  • Check TLS handshake (if TLS stalls after connect, suspect PMTU, middleboxes, or MTU offload weirdness).
  • Check request/response time (if connect and TLS are fine, it’s application or server load).

Second: decide if it’s “path” or “endpoint”

  • Run mtr from the client side to the server IP, and from the server back to the client network if you can.
  • Run tcpdump on the endpoint that you control to verify if packets arrive and if replies leave.

Third: classify the timeout

  • Retransmissions without ACKs → loss, filtering, asymmetric routing, or conntrack drops.
  • Large packets disappear → PMTU blackhole (DF set, ICMP blocked).
  • Only UDP “randomly” fails → NAT timeouts, stateful firewalls, or DNS/QUIC quirks.
  • Only one destination ASN/provider → upstream routing/peering issue; get evidence and escalate.

One operational rule: do not trust a single perspective. If you only measure from the server, you’ll miss last-mile issues. If you only measure from the client, you’ll blame the internet for your own conntrack table being full.

A practical mental model: what “random timeouts” usually are

Random timeouts are rarely random. They’re conditional. Something about the traffic pattern, packet size, resolver choice, route, or state table triggers a failure. The “randomness” is your observability failing to capture the condition.

In production, intermittent timeouts typically land in one of these bins:

  • DNS issues: slow resolvers, broken split-horizon, EDNS0/UDP fragmentation problems, or resolver rate limiting.
  • Packet loss on a hop: real loss (congestion, Wi‑Fi, bad optics) vs “ICMP deprioritized” (mtr shows loss but TCP is fine).
  • PMTU blackholes: Path MTU Discovery fails because ICMP “fragmentation needed” is blocked; small packets work, larger ones stall.
  • State exhaustion: conntrack table full, NAT table full, firewall CPU pegged, DDoS mitigations dropping state.
  • Asymmetric routing: SYN goes out one path, SYN-ACK returns another and gets dropped by a stateful firewall or source validation.
  • Offload and driver quirks: GRO/TSO/LRO interactions, checksum offload oddities, or buggy NIC firmware—rare, but memorable.
  • Queueing and bufferbloat: latency spikes under load cause “timeouts” even without loss.

Here’s the unglamorous truth: mtr tells you where to look; tcpdump tells you what happened. Use both.

Interesting facts and historical context

  • Traceroute’s core idea dates back to 1987, using increasing TTL to coax ICMP “time exceeded” replies from each hop.
  • mtr (My Traceroute) has been around since the late 1990s, combining traceroute and ping over time to expose intermittent issues.
  • ICMP is not “optional” in practice: block too much of it and you break PMTU discovery, causing classic “works for small payloads” failures.
  • Path MTU Discovery (PMTUD) became a big deal in the early 1990s as networks diversified; it’s still a top cause of weird stalls today.
  • TCP retransmission behavior is intentionally conservative; exponential backoff makes a small loss event look like a massive stall to users.
  • Linux conntrack exists because stateful firewalls and NAT need it; when it’s full, your network doesn’t “degrade,” it lies and drops new flows.
  • EDNS0 increased DNS message sizes; that’s great until firewalls and NATs mishandle fragmented UDP and DNS starts “randomly” timing out.
  • Modern CDNs and multi-homed services make routing more dynamic; your “same destination” can traverse different paths minute to minute.
  • ICMP rate limiting is common on routers; it can make mtr show loss at an intermediate hop even when forwarding is perfect.

Tooling you actually need on Debian/Ubuntu

Install the basics. Don’t be the person debugging a network with only curl and vibes.

cr0x@server:~$ sudo apt-get update
Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Reading package lists... Done
cr0x@server:~$ sudo apt-get install -y mtr-tiny tcpdump iproute2 dnsutils iputils-ping traceroute ethtool conntrack netcat-openbsd
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
  conntrack dnsutils ethtool mtr-tiny netcat-openbsd tcpdump traceroute
0 upgraded, 7 newly installed, 0 to remove and 0 not upgraded.

mtr-tiny is fine for most cases. If you want the full-fat version with more features, install mtr instead when your distro provides it. tcpdump is non-negotiable.

Hands-on tasks: commands, outputs, and decisions

Below are practical tasks you can run on Debian/Ubuntu. Each includes: command, what to look for, and the decision it should trigger.

Task 1: Confirm the timeout is real and measure where it happens (DNS vs connect vs TLS)

cr0x@server:~$ time curl -sS -o /dev/null -w "namelookup:%{time_namelookup} connect:%{time_connect} appconnect:%{time_appconnect} starttransfer:%{time_starttransfer} total:%{time_total}\n" https://api.example.net/health
namelookup:0.002 connect:0.012 appconnect:0.054 starttransfer:0.061 total:0.061

real	0m0.070s
user	0m0.010s
sys	0m0.004s

Interpretation: If namelookup spikes on failures, chase DNS. If connect spikes, chase path/firewall/conntrack. If appconnect spikes, chase TLS/MTU/middleboxes. If only starttransfer spikes, it’s likely server/app latency.

Decision: Pick the subsystem to instrument next. Do not run mtr to a hostname if DNS is failing; resolve to an IP first.

Task 2: Resolve the hostname and lock onto an IP

cr0x@server:~$ dig +time=1 +tries=1 api.example.net A
; <<>> DiG 9.18.24-1ubuntu1.4-Ubuntu <<>> +time=1 +tries=1 api.example.net A
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1203
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
api.example.net.	60	IN	A	203.0.113.42

;; Query time: 18 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Mon Dec 30 12:01:02 UTC 2025
;; MSG SIZE  rcvd: 58

Interpretation: Note resolver (127.0.0.53 suggests systemd-resolved) and query time. If query time is hundreds of ms or times out intermittently, DNS is guilty until proven innocent.

Decision: Use the IP (203.0.113.42) for path tests and packet captures to avoid DNS noise.

Task 3: Run mtr with TCP mode to the service port (not ICMP)

cr0x@server:~$ mtr -rwz -c 200 -T -P 443 203.0.113.42
Start: 2025-12-30T12:02:10+0000
HOST: server                         Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.10.0.1                     0.0%   200    0.4   0.6   0.3   2.1   0.2
  2.|-- 192.0.2.10                    0.0%   200    1.2   1.5   1.0   5.4   0.6
  3.|-- 198.51.100.9                  0.0%   200    2.8   3.0   2.2   9.7   0.8
  4.|-- 203.0.113.1                   0.0%   200    9.9  10.3   8.7  22.1   2.1
  5.|-- 203.0.113.42                  6.0%   200   18.2  14.1  11.0  85.3  10.7

Interpretation: TCP-mode mtr is closer to your real traffic. Loss at the final hop with clean intermediates is meaningful. The stdev and worst spikes matter: 85 ms worst isn’t fatal, but 6% loss is.

Decision: If loss appears only on the last hop, suspect the destination host, its firewall, or an adjacent network device. If loss begins at a hop and continues onward, suspect that hop or link.

Task 4: Run mtr with ICMP to detect ICMP rate limiting vs real forwarding loss

cr0x@server:~$ mtr -rwz -c 200 203.0.113.42
Start: 2025-12-30T12:03:44+0000
HOST: server                         Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.10.0.1                     0.0%   200    0.5   0.6   0.3   1.6   0.2
  2.|-- 192.0.2.10                    0.0%   200    1.3   1.4   1.0   4.1   0.4
  3.|-- 198.51.100.9                 22.0%   200    3.0   3.2   2.4  11.8   1.1
  4.|-- 203.0.113.1                   0.0%   200   10.2  10.5   8.8  21.9   2.0
  5.|-- 203.0.113.42                  0.0%   200   12.9  13.6  11.2  40.5   4.4

Interpretation: Hop 3 shows 22% loss, but later hops show 0% loss. That’s classic ICMP response deprioritization or rate limiting on hop 3, not actual forwarding loss.

Decision: Do not escalate to the provider based on this alone. Prefer TCP mtr to the actual port and corroborate with tcpdump.

Task 5: Check local interface errors and drops (boring, often decisive)

cr0x@server:~$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
     987654321  123456      0    421       0   10234
    TX:  bytes packets errors dropped carrier collsns
     876543210  120001      0     19       0       0

Interpretation: RX dropped 421 indicates local drops (queue overflow, driver, or host CPU pressure). Not always fatal, but it’s not “the internet.”

Decision: If drops climb during incidents, investigate host load, NIC ring buffers, qdisc, and offloads. If you see errors/carrier, suspect cabling/virtual NIC issues.

Task 6: Check routing and source IP selection

cr0x@server:~$ ip route get 203.0.113.42
203.0.113.42 via 10.10.0.1 dev eth0 src 10.10.0.23 uid 1000
    cache

Interpretation: Confirms which gateway and source IP are used. Wrong source IP is a quiet killer when policy routing or multiple interfaces exist.

Decision: If source IP is wrong, fix routing rules, ip rule, or application binding. If gateway is unexpected, you might have asymmetric routing.

Task 7: Look for conntrack exhaustion (timeouts that smell like “new connections fail”)

cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 262132
net.netfilter.nf_conntrack_max = 262144

Interpretation: You’re effectively at 100%. New connections can be dropped or delayed, producing intermittent connect timeouts that magically “go away.”

Decision: Raise nf_conntrack_max (carefully: memory impact), reduce connection churn, shorten timeouts where safe, or move NAT/state elsewhere.

Task 8: Capture traffic during a timeout (SYN retransmits vs server silence)

cr0x@server:~$ sudo tcpdump -ni eth0 host 203.0.113.42 and tcp port 443 -vv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:06:01.120001 IP (tos 0x0, ttl 64, id 43121, offset 0, flags [DF], proto TCP (6), length 60)
    10.10.0.23.53124 > 203.0.113.42.443: Flags [S], seq 3021001001, win 64240, options [mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 7], length 0
12:06:02.121045 IP (tos 0x0, ttl 64, id 43122, offset 0, flags [DF], proto TCP (6), length 60)
    10.10.0.23.53124 > 203.0.113.42.443: Flags [S], seq 3021001001, win 64240, options [mss 1460,sackOK,TS val 2000 ecr 0,nop,wscale 7], length 0
12:06:04.123210 IP (tos 0x0, ttl 64, id 43123, offset 0, flags [DF], proto TCP (6), length 60)
    10.10.0.23.53124 > 203.0.113.42.443: Flags [S], seq 3021001001, win 64240, options [mss 1460,sackOK,TS val 4000 ecr 0,nop,wscale 7], length 0

Interpretation: SYN retransmissions with no SYN-ACK returning. Either the SYN never reaches the server, or the SYN-ACK never makes it back, or it’s being dropped locally by conntrack/firewall.

Decision: Capture on the server side too if possible. If server sees SYN but client doesn’t see SYN-ACK, suspect return path/asymmetry or a stateful device dropping the reply.

Task 9: Capture on the server to prove whether SYN arrives

cr0x@server:~$ sudo tcpdump -ni eth0 src 10.10.0.23 and tcp port 443 -vv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:06:01.129900 IP 10.10.0.23.53124 > 203.0.113.42.443: Flags [S], seq 3021001001, win 64240, options [mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 7], length 0
12:06:01.130010 IP 203.0.113.42.443 > 10.10.0.23.53124: Flags [S.], seq 901200200, ack 3021001002, win 65160, options [mss 1440,sackOK,TS val 7000 ecr 1000,nop,wscale 7], length 0

Interpretation: Server sends SYN-ACK. If the client capture didn’t show it, the return path is dropping it. That’s not application. That’s networking.

Decision: Escalate to network team/provider with this evidence, or check intermediate firewalls/NAT for asymmetric routing or state drops.

Task 10: Detect PMTU blackhole with tracepath

cr0x@server:~$ tracepath -n 203.0.113.42
 1?: [LOCALHOST]                      pmtu 1500
 1:  10.10.0.1                          0.471ms
 2:  192.0.2.10                         1.312ms
 3:  198.51.100.9                       3.004ms
 4:  203.0.113.1                       10.142ms
 5:  203.0.113.42                      13.012ms reached
     Resume: pmtu 1500 hops 5 back 64

Interpretation: If tracepath reports a lower PMTU (like 1472/1460/1400) or stalls with “no reply,” you may have PMTU issues.

Decision: If you suspect PMTU blackhole, test with DF pings and adjust MTU/MSS clamping.

Task 11: Validate PMTU with “do not fragment” ping sizes

cr0x@server:~$ ping -c 3 -M do -s 1472 203.0.113.42
PING 203.0.113.42 (203.0.113.42) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500

--- 203.0.113.42 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2046ms

Interpretation: This output indicates your local interface MTU is 1500, and you tried a payload that would exceed it after headers. That’s a local sizing issue, not a path issue.

Decision: Retry with correct sizing. For MTU 1500, ICMP payload 1472 is usually valid; if you see “message too long,” you may be on a tunnel interface or different MTU than you think. Confirm MTU with ip link.

Task 12: Check local MTU and tunnels that quietly lower it

cr0x@server:~$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
3: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000

Interpretation: WireGuard interface MTU 1420 is common. If traffic routes via wg0 or is encapsulated, effective MTU is lower than 1500.

Decision: If your path uses a tunnel, set MTU appropriately and clamp TCP MSS at the edge.

Task 13: Find drops in firewall logs and nftables counters

cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
  chain input {
    type filter hook input priority 0; policy drop;
    ct state established,related accept
    iif "lo" accept
    tcp dport 22 accept
    counter packets 1024 bytes 65536 drop
  }
}

Interpretation: That counter on a drop rule is not decoration. If it increases during timeouts, your host firewall is participating.

Decision: Add explicit accept rules for the service, or fix state tracking/routing so replies match established state.

Task 14: Verify systemd-resolved behavior and upstream DNS health

cr0x@server:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 192.0.2.53
       DNS Servers: 192.0.2.53 192.0.2.54

Link 2 (eth0)
    Current Scopes: DNS
         Protocols: +DefaultRoute -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.0.2.53
       DNS Servers: 192.0.2.53 192.0.2.54

Interpretation: You know which resolvers you’re actually using. “We use Google DNS” is often a myth surviving from a previous terraform module.

Decision: Test each resolver directly with dig @server. If one is flaky, remove it or fix it.

mtr done right: read it like an SRE

mtr is a time-series traceroute. It’s also a generator of bad tickets when misread. The main trap: loss shown on an intermediate hop does not necessarily mean loss for your traffic. Many routers de-prioritize TTL-expired ICMP replies; they still forward packets just fine.

Rules of thumb that keep you out of trouble

  • Trust loss only if it persists to the final hop. Loss that appears at hop N and disappears at hop N+1 is usually ICMP response rate limiting.
  • Prefer TCP mtr to the service port. ICMP might be treated differently than your traffic, especially across providers and firewalls.
  • Look at latency distribution (avg, worst, stdev), not just “Loss%”. Queueing can wreck user experience without high loss.
  • Run enough samples (-c 200 or more) to catch intermittency. Five packets is a horoscope.
  • Pin tests to the actual IP. CDNs can route different clients to different edges; you want a stable target.

Joke #1: mtr is like a conference call—someone always “drops,” and it’s usually the person who isn’t doing any real work.

What “good mtr” evidence looks like

Good evidence is comparative and consistent:

  • mtr TCP to port 443 shows 5–10% loss at final hop during incident.
  • tcpdump shows SYN retransmissions and missing SYN-ACKs on the client side.
  • Server-side capture shows SYN-ACK leaving (or not leaving) the interface.
  • Path tests from another vantage point show different behavior (routing issue) or same behavior (endpoint issue).

tcpdump done right: catch the timeout on the wire

tcpdump is your courtroom transcript. The packet trace doesn’t care about your assumptions, your org chart, or the fact that “it worked yesterday.” It only cares what was sent and what was seen.

Capture strategy: don’t drown

  • Filter aggressively: host + port + protocol. If you capture everything, you’ll miss the moment that matters.
  • Capture both directions when possible: client-side and server-side. Asymmetric routing loves single-sided captures.
  • Correlate with timestamps: keep incident times, and use monotonic clocks if you can. If NTP is broken, you’ll suffer twice.

What to look for in TCP timeouts

  • SYN retransmits: connect path issue, filtering, return path, or conntrack.
  • SYN/SYN-ACK/ACK succeeds but TLS stalls: PMTU, middlebox interference, or packet reordering with weird MTU.
  • Data sent, no ACKs, retransmissions: downstream loss or stateful device drop.
  • RSTs: active rejection by firewall or service not listening.

One more operational truth: timeouts are not errors, they’re missing evidence. tcpdump gives you that evidence.

Fix the cause: MTU, routing, DNS, conntrack, and friends

Fix class 1: DNS “random” timeouts

DNS often looks like the network because everything uses it. Typical failure modes:

  • One resolver is slow or intermittently drops UDP.
  • EDNS0 responses get fragmented; fragments are dropped; clients retry over TCP and stall.
  • Split-horizon returns unroutable IPs depending on which resolver answers.

What to do:

  • Test each resolver directly using dig @192.0.2.53 and dig @192.0.2.54 with short timeouts.
  • If you see large responses timing out, test with +tcp and consider limiting EDNS buffer size on resolvers or clients.
  • On systemd-resolved systems, confirm which servers are active; don’t assume.

Fix class 2: PMTU blackholes

PMTU blackholes are classic: small requests work, larger requests stall, usually during TLS or large headers/cookies. The culprit is often blocked ICMP “fragmentation needed” messages. TCP keeps sending with DF set, never learns the smaller MTU, and you get retransmissions until timeout.

What to do:

  • Allow necessary ICMP types through firewalls (at least “fragmentation needed” and “time exceeded” where appropriate).
  • Clamp TCP MSS on tunnel edges (WireGuard, IPsec, GRE) so endpoints never send too-large segments.
  • Set correct MTU on tunnel interfaces, and verify routing actually uses them.

Fix class 3: conntrack/NAT exhaustion

Conntrack exhaustion makes new connections flaky while established ones limp along. It’s the network equivalent of a restaurant that seats old customers forever and pretends it’s “fully booked.”

What to do:

  • Raise nf_conntrack_max and tune timeouts if your workload is connection-heavy.
  • Reduce churn: enable keep-alives, reuse connections, configure pools.
  • Move NAT/state to devices sized for it, or eliminate NAT where you can.

Fix class 4: asymmetric routing

Asymmetric routing is not inherently bad; the internet does it all the time. It becomes bad when stateful devices assume symmetry (firewalls, NATs) or when source validation drops packets returning via an unexpected interface.

What to do:

  • Validate routing with ip route get and captures on both ends.
  • If you have multiple uplinks, align policy routing and ensure replies use the same path as requests when passing stateful devices.
  • Check reverse path filtering (rp_filter) settings when multi-homed hosts see return traffic on a different interface.

Fix class 5: local drops, NIC/driver, or queueing

Sometimes the network issue is your host dropping packets under load. Look at interface drops, CPU softirq pressure, and qdisc behavior.

  • If ip -s link shows drops during incidents, inspect CPU load and softirqs.
  • Check offloads with ethtool -k if you suspect weirdness; disable selectively for troubleshooting, not permanently by superstition.
  • Consider fq_codel or cake at the edge if bufferbloat is causing latency spikes.

Joke #2: If you “fixed” timeouts by rebooting the firewall, you didn’t fix it—you hit the snooze button with authority.

Three corporate mini-stories (anonymized)

Mini-story 1: The incident caused by a wrong assumption

The company ran a multi-region API. Users in one geography reported random checkout failures: some requests succeeded, others timed out at 10–15 seconds. The application team swore it was a backend problem because their dashboards showed elevated latency on the API tier. They started scaling.

The first wrong assumption was subtle: they assumed “timeouts” meant “server slow.” In reality, curl -w showed time_connect was the thing spiking, not starttransfer. That single metric changed the whole investigation from application to network path.

mtr from a failing client network showed loss at the last hop only when using TCP to port 443. ICMP mtr was clean. That was an important detail: the network could forward ICMP and even some TCP, but not consistently to that service port. The team finally ran tcpdump on the server and saw SYNs arrive, SYN-ACKs leave, and… nothing. The client never saw the SYN-ACK.

The real cause was asymmetric routing introduced by a “temporary” upstream change. The SYN entered through provider A, but the return SYN-ACK exited through provider B due to a route preference tweak. A stateful firewall upstream expected the return path through A and dropped the SYN-ACK as “invalid.”

The fix wasn’t heroic: adjust routing policy so the return path matched, and coordinate with the upstream team to keep session symmetry where stateful inspection existed. The lesson was grimly simple: don’t let the word “timeout” bully you into scaling the app. Measure where the time went.

Mini-story 2: The optimization that backfired

A platform team wanted to reduce costs and latency. They moved DNS resolution closer to workloads by deploying local caching resolvers and pointing all hosts at them. It looked great in synthetic tests. They patted themselves on the back, which is a known precursor to pain.

Then intermittent failures started: service discovery would occasionally hang, then recover. It wasn’t constant, which made it politically annoying. Application logs showed timeouts calling dependencies by hostname; direct IP calls worked. The on-call rotation developed a superstition that “DNS is haunted.”

Packet captures told the story. DNS responses for certain records were larger due to DNSSEC and many A/AAAA records. The local resolver used EDNS0 with a large UDP buffer. Somewhere along the path—specifically a NAT device with an old fragmentation policy—fragmented UDP responses were being dropped. Clients retried over TCP, but the resolver had too-low TCP worker capacity and started queueing. Now you had “random” DNS timeouts depending on which record and which response size.

The optimization was well-intended, but it changed traffic shape: fewer upstream queries, yes, but larger and more bursty responses locally, and more fragmentation. The fix involved reducing advertised EDNS buffer size, ensuring TCP fallback capacity, and updating the NAT policy to handle fragments properly. The team also added direct monitoring of DNS query time distributions, not just “success rate.”

The outcome: DNS got stable, and the “optimization” stopped being a slow-motion incident. The enduring takeaway is that performance optimizations that change packet size and burst patterns are network changes, whether you file a ticket or not.

Mini-story 3: The boring but correct practice that saved the day

A financial services shop had a reputation for being dull about change management. They maintained a runbook that required capturing evidence from both ends for any intermittent network failure. It was not cool. It was also the reason the incident ended before lunch.

One morning, several batch jobs started failing with “connection timed out” to an external partner endpoint. Internal services were fine. The network team suspected the partner. The app team suspected the network. Everyone prepared their favorite blame.

The on-call followed the runbook: run TCP mtr to the partner’s port, capture tcpdump during a failure, and record the exact 5‑tuple. They captured on the client host and saw SYN retransmits. They captured on the edge firewall and saw SYNs leaving but no SYN-ACKs returning. That’s not “maybe,” that’s a directional observation.

They then repeated the test from a different egress IP and saw clean connections. So the issue was not “the partner is down.” It was route-specific. They escalated to their upstream provider with mtr plus packet evidence, and the provider confirmed a broken path to that destination prefix from one peering point.

The incident resolution was basically paperwork plus rerouting, but the speed came from a boring discipline: always capture at least one packet trace at the boundary you control. The runbook didn’t make them smarter. It made them faster and harder to argue with.

Common mistakes: symptom → root cause → fix

  • mtr shows 30% loss at hop 4, but the destination is fine → ICMP rate limiting on hop 4 → Ignore intermediate-hop loss unless it continues to the final hop; validate with TCP mtr and tcpdump.
  • HTTPS connects, then “hangs” during TLS handshake → PMTU blackhole or MSS too high over a tunnel → Verify with tracepath, DF pings, and fix MTU/MSS clamping; allow ICMP fragmentation-needed.
  • New connections randomly time out, existing keep working → conntrack/NAT table near full → Check nf_conntrack_count; increase limits and reduce connection churn.
  • Only one ISP path has timeouts → asymmetric routing or upstream peering issue → Prove with captures (SYN arrives, SYN-ACK leaves, not received); adjust routing or escalate with evidence.
  • Timeouts correlate with traffic bursts → queueing/bufferbloat or firewall CPU saturation → Watch latency distribution (worst/stdev), interface drops, and firewall counters; apply fq_codel/cake and capacity fixes.
  • DNS fails “sometimes,” especially for certain names → EDNS/fragmentation issues or one bad resolver in the list → Test each resolver directly; reduce EDNS buffer, ensure TCP fallback, remove flaky resolver.
  • mtr TCP to port works, but application still times out → app-layer timeouts or server overload → Use curl timing breakdown; profile app and server; stop blaming the network prematurely.
  • Packets seen leaving server, client never receives → return path drop (ACL, stateful firewall, rp_filter) → Validate routing symmetry; tune rp_filter; fix stateful device policies.

Checklists / step-by-step plan

Checklist A: 15-minute triage (single operator, minimal access)

  1. Run curl -w timings to classify DNS vs connect vs TLS vs app.
  2. Resolve hostname to IP with dig; keep the IP for tests.
  3. Run mtr -T -P <port> to the IP with at least 200 samples.
  4. Run ICMP mtr once to identify ICMP rate limiting patterns (don’t overreact).
  5. Check local interface drops with ip -s link.
  6. Check routing with ip route get.
  7. If connect timeouts: run tcpdump filtered by host+port while reproducing.

Checklist B: Prove asymmetry or stateful drop (when you control both ends)

  1. Start tcpdump on client: filter host SERVER and tcp port SERVICE.
  2. Start tcpdump on server: filter host CLIENT and tcp port SERVICE.
  3. Reproduce the timeout and save timestamps and 5‑tuple.
  4. If server sees SYN and sends SYN-ACK, but client doesn’t see SYN-ACK: return path drop/asymmetry.
  5. If server never sees SYN: forward path drop or wrong routing/ACL before server.
  6. Correlate with firewall/nft counters and conntrack pressure.

Checklist C: PMTU / MTU verification (when “small works, big fails”)

  1. Run tracepath -n to the target; record PMTU hints.
  2. Identify tunnels and interface MTUs with ip link show.
  3. Test with DF pings at a few sizes (within local MTU constraints).
  4. Clamp MSS on tunnel edge; allow essential ICMP; retest TLS handshake timing.

Operational guidance you should actually adopt

  • Always collect a packet trace during the failure. A ten-second capture beats an hour of guessing.
  • Prefer service-port measurements (TCP mtr, tcpdump). ICMP is a different class of traffic and is treated differently.
  • Keep a “known-good” vantage point (a small VM in another network) to distinguish “your network” from “the world.”

FAQ

1) Why does mtr show loss on an intermediate hop but not on the destination?

Because the router is de-prioritizing ICMP TTL-expired replies (or rate limiting them) while still forwarding your packets. Trust end-to-end loss.

2) Should I always use mtr -T?

For diagnosing application timeouts on a specific TCP port, yes. ICMP tests are still useful for PMTU clues and general reachability, but TCP matches your workload.

3) How do I tell if the timeout is DNS?

Use curl -w and look at time_namelookup. Also query directly with dig +time=1 +tries=1. If DNS is slow, everything looks slow.

4) What’s the simplest sign of a PMTU blackhole?

TCP connect works, but TLS handshake or large responses stall, often with retransmissions. tracepath can hint, but packet captures and MSS/MTU tests confirm.

5) Why would only some users see timeouts?

Different routes, different resolvers, different MTUs (VPNs), different NAT behavior, or different CDN edges. “Some users” is often “some paths.”

6) Can tcpdump be run safely on production servers?

Yes, if you filter tightly and capture briefly. Use host/port filters, avoid writing huge pcap files to busy disks, and don’t leave it running unattended.

7) mtr shows clean results but users still time out. Now what?

mtr doesn’t see application queueing, TLS issues, or server overload. Use curl timing breakdown, server metrics, and tcpdump to find whether packets stall post-connect.

8) How do I know if conntrack is the culprit?

When nf_conntrack_count approaches nf_conntrack_max and new connections fail while existing ones remain stable. It’s especially common on NAT gateways.

9) What quote should I keep in mind during these incidents?

Werner Vogels (paraphrased idea): “Everything fails, all the time; design and operate systems assuming failure is normal.”

Practical next steps

Random timeouts stop being “random” the moment you measure the right layer and capture the right packets. Do these next:

  1. Standardize on a quick classification: DNS vs connect vs TLS vs app using curl -w.
  2. Use TCP mtr to the real service port and sample long enough to catch intermittency.
  3. Capture tcpdump during an actual failure on at least one endpoint you control; if possible, capture on both ends.
  4. If you find PMTU/MTU trouble, fix it properly (MTU, MSS clamping, essential ICMP) instead of hoping retries will save you.
  5. If you find conntrack pressure, treat it like capacity: raise limits, reduce churn, and monitor it like you mean it.

Most importantly: write down what you learned. The next time the timeout returns (and it will), you’ll want a runbook, not a séance.

← Previous
Docker “connection refused” between services: fix networks, not symptoms
Next →
Responsive Typography That Looks Good: clamp() Done Right

Leave a comment