Linux Networking: Packet Loss ‘Only Sometimes’ — The Real Debug Flow

Was this helpful?

Intermittent packet loss is the worst kind of outage: it dodges graphs, hides during incident calls, and only appears when your CEO joins the all-hands video. The app team says “network.” The network team says “servers.” Someone suggests rebooting “just to clear it.”

Here’s the production workflow that ends the argument: you prove where the packets are being dropped, using specific Linux counters and captures, then you fix the bottleneck that actually owns the loss. Not vibes. Evidence.

A ruthless mental model: where drops happen

“Packet loss” is a symptom, not a location. A packet can be dropped in a lot of places, and Linux gives you counters for most of them. The trick is knowing which counters are adjacent to the drop domain you’re investigating.

Think in drop domains, not in tools

When people are stuck, they usually pick a tool (ping, traceroute, tcpdump) and keep running it harder. Tools are fine. But the right flow is: identify the domain, then pick the minimal set of tools to prove it.

Common drop domains in a Linux production system:

  • Remote drops: the packet leaves your host fine, but dies elsewhere (switch, router, firewall, remote host input queue, remote CPU, remote policy).
  • Link-level issues: cabling, optics, flaps, FEC, PCS errors. This is “physical,” but shows up as MAC/PHY counters and sometimes driver logs.
  • NIC receive path drops: ring buffer overflow, descriptor starvation, driver bugs, or offload interactions.
  • Kernel backlog drops: NAPI/softirq backlog (the classic softnet_stat story), CPU starvation, interrupt routing.
  • Netfilter/conntrack: state table pressure, invalid states, aggressive timeouts, or rules that drop under load.
  • qdisc/traffic control: shaping/policing, fq_codel behavior, or a well-meaning qdisc that quietly drops.
  • Socket-level drops: application not reading, receive buffer too small, accept queue overflow, SYN backlog issues.
  • Virtualization/overlay: veth queues, bridge drops, VXLAN/Geneve overhead, encapsulation MTU mismatch.
  • Storage-shaped “network loss”: not a joke. If the app blocks on disk, it stops reading sockets and you’ll “see” loss as retransmits and timeouts.

The goal is not to collect every metric known to humanity. The goal is to correlate one observed symptom (retransmits, gaps, tail latency, timeouts) with one specific counter increasing in one place.

Two kinds of “only sometimes”

Intermittent loss tends to fall into one of two patterns:

  • Bursty drops (microbursts): everything is fine until it isn’t, and then a small queue overflows. The graphs look clean because averages hide spikes.
  • Conditional drops: only some traffic is affected (certain MTU, DSCP, flows hashed to one bad LACP member, one IRQ pinned to a busy CPU, one conntrack bucket under attack).

If you don’t distinguish these, you’ll debug the wrong thing for days. Bursty drops demand queue/capacity work. Conditional drops demand hashing, MTU, policy, and per-flow inspection.

One quote worth keeping on a sticky note: “Hope is not a strategy.” — paraphrased idea attributed to many engineering leaders in operations culture.

Joke #1: Packet loss is like a toddler with a marker. If you don’t catch it in the act, all you get is a wall full of “huh, that’s weird.”

Fast diagnosis playbook (first/second/third)

This is the triage order that gets you to a bottleneck quickly, even when the issue is intermittent and your dashboards are lying by averaging.

First: prove whether loss is local or remote

  1. Check TCP retransmits and kernel drop counters on the affected host. If retransmits rise but local RX/TX drops don’t, suspect upstream, remote, or middleboxes.
  2. Capture on both sides of an interface boundary. For example: capture on eth0 and on bond0, or on eth0 and inside the namespace. If the packet appears on one side and not the other, you found the drop boundary.
  3. Compare error counters at the NIC/PHY. CRC, FCS, symbol errors, and link flaps are not “app problems.”

Second: check the three queues that most often overflow

  1. NIC RX ring / driver drops: ethtool -S often exposes “missed,” “no_buffer,” “rx_missed_errors,” etc.
  2. Kernel backlog: /proc/net/softnet_stat drops are the smoking gun for CPU/softirq starvation.
  3. qdisc: tc -s qdisc shows drops at the egress queue; policing shows as drops too.

Third: isolate the conditional patterns

  1. MTU/fragmentation: check PMTUD blackholes, overlay overhead, and ICMP filtering.
  2. Hashing/LACP: does one member have errors? Does one path have asymmetric routing or policing?
  3. conntrack: table full or churn; drops show up as packet loss “only at peak.”
  4. IRQ/CPU pinning: one busy core handling all RX interrupts means intermittent drops under load.

If you do these in order, you stop guessing. You also stop “fixing” things that weren’t broken.

Interesting facts and historical context

  • Ethernet has always dropped frames under congestion; early shared Ethernet relied on collision detection. Modern switched networks moved the pain into buffers and queues.
  • Linux NAPI was introduced to reduce interrupt storms by switching to polling under load; it’s why softirq CPU time matters so much on busy NICs.
  • Bufferbloat became a mainstream term in the late 2000s; big buffers reduce loss but can destroy latency. That trade-off is still alive in your NIC and switch.
  • TCP congestion control assumes loss means congestion. When loss is caused by bad optics or a busted queue, TCP still backs off and your app “mysteriously slows down.”
  • GRO/LRO offloads were created to reduce CPU overhead by coalescing packets, but they can complicate packet captures and timing analysis.
  • fq_codel and related qdiscs became popular because they fight latency under load by managing queues smarter than FIFO.
  • Conntrack exists because stateful firewalls and NAT needed tracking; at scale, conntrack becomes a shared kernel resource that can fail like a database.
  • RSS/RPS/XPS (Receive/Transmit packet steering) are decades of accumulated work to keep multicore CPUs fed evenly; misconfiguration still causes “one core melts, packets drop.”
  • Jumbo frames aren’t new, but overlays made MTU mistakes easier: VXLAN/Geneve overhead turns “works in the lab” into “drops only for large payloads.”

Practical tasks: commands, outputs, decisions

Below are real tasks you can run during an incident. Each includes: command, what the output means, and what decision you make from it. Run them on both ends when you can. If you only run them on one host, you’ll still learn a lot, just not enough to win arguments.

Task 1: Confirm the symptom is real (and which protocol shows it)

cr0x@server:~$ ping -c 20 -i 0.2 -s 56 10.10.5.20
PING 10.10.5.20 (10.10.5.20) 56(84) bytes of data.
64 bytes from 10.10.5.20: icmp_seq=1 ttl=62 time=0.611 ms
64 bytes from 10.10.5.20: icmp_seq=2 ttl=62 time=0.623 ms
Request timeout for icmp_seq 7
64 bytes from 10.10.5.20: icmp_seq=8 ttl=62 time=0.640 ms

--- 10.10.5.20 ping statistics ---
20 packets transmitted, 19 received, 5% packet loss, time 3805ms
rtt min/avg/max/mdev = 0.611/0.655/0.812/0.052 ms

Meaning: ICMP loss exists, but ICMP can be deprioritized or rate-limited by devices. Don’t conclude “the network is broken” from ping alone.

Decision: Immediately check TCP-level signals (retransmits) for the real workloads. If only ICMP drops, suspect policing/rate limits.

Task 2: Check TCP retransmits and resets on the host

cr0x@server:~$ netstat -s | egrep -i 'retran|reset|timeout' | head
    1823 segments retransmitted
    17 bad segments received
    42 resets received for embryonic SYN_RECV sockets
    9 connections reset due to unexpected data

Meaning: Retransmits indicate loss somewhere along the path (or reordering severe enough to look like loss). SYN_RECV resets hint at backlog pressure or SYN handling issues.

Decision: If retransmits climb during the incident window, treat this as real packet loss or extreme jitter. Move to counters and captures.

Task 3: Identify the route and whether PMTUD should work

cr0x@server:~$ ip route get 10.10.5.20
10.10.5.20 via 10.10.4.1 dev eth0 src 10.10.4.55 uid 0
    cache

Meaning: You now know the egress interface and next hop. This is your anchor for every subsequent counter and capture.

Decision: Run interface-level stats on eth0. If routing changes during the issue (ECMP), you’ll need to correlate per-path behavior.

Task 4: Check link status, speed/duplex, and obvious physical problems

cr0x@server:~$ ethtool eth0
Settings for eth0:
	Supported ports: [ FIBRE ]
	Supported link modes:   10000baseSR/Full
	Speed: 10000Mb/s
	Duplex: Full
	Auto-negotiation: off
	Link detected: yes

Meaning: Link is up at expected speed. That doesn’t clear the optics or cable, but it removes “duplex mismatch” and “it’s actually 1G” from the list.

Decision: If the link is flapping or negotiating at an unexpected rate, stop and fix physical. If it looks stable, proceed to error counters.

Task 5: Read NIC and driver counters for RX/TX drops and errors

cr0x@server:~$ ethtool -S eth0 | egrep -i 'drop|error|miss|no_buffer|timeout' | head -n 20
     rx_missed_errors: 1249
     rx_no_buffer_count: 1180
     rx_errors: 0
     tx_errors: 0
     tx_timeout_count: 0

Meaning: rx_missed_errors / rx_no_buffer_count usually means the NIC couldn’t hand packets to the kernel fast enough (ring exhaustion, CPU starvation, or driver issues). That’s local loss.

Decision: If these increase during the incident, you’re looking at receive-path pressure. Move to softnet stats, IRQ distribution, and ring sizes.

Task 6: Check generic interface drops (less precise, still useful)

cr0x@server:~$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:aa:bb:cc brd ff:ff:ff:ff:ff:ff
    RX:  bytes  packets  errors  dropped  missed  mcast
    9132299123  9023123  0       431      0       12213
    TX:  bytes  packets  errors  dropped  carrier collsns
    8023312288  8112231  0       0        0       0

Meaning: The dropped field is a coarse aggregate; it can include kernel-level drops, not strictly NIC drops. Still: if it grows with your problem, it’s a big hint.

Decision: If RX dropped grows but ethtool -S doesn’t show misses, suspect kernel backlog, netfilter, or qdisc/virtualization layers.

Task 7: Check kernel backlog drops with softnet_stat

cr0x@server:~$ awk '{d+=$2; t+=$1} END{print "total_processed="t, "total_dropped="d}' /proc/net/softnet_stat
total_processed=140338812 total_dropped=92841

Meaning: The second column is dropped packets in the per-CPU backlog. If that number increases rapidly during the incident, the kernel is dropping before the packet reaches your socket.

Decision: Look at CPU saturation, interrupt affinity, and whether your NIC is feeding one core. Fixing physical won’t help; you need to fix CPU/interrupt/queueing.

Task 8: Identify whether IRQs are concentrated on one CPU

cr0x@server:~$ grep -E 'eth0|mlx|ixgbe|i40e' /proc/interrupts | head
  86:  9833221        12         4         9   IR-PCI-MSI 524288-edge      eth0-TxRx-0
  87:       31   9123312         8        11   IR-PCI-MSI 524289-edge      eth0-TxRx-1
  88:       25        18   9011123        13   IR-PCI-MSI 524290-edge      eth0-TxRx-2
  89:       22        15        10   8832212   IR-PCI-MSI 524291-edge      eth0-TxRx-3

Meaning: If one IRQ line has almost all interrupts on one CPU, you get local drops under load. Even distribution is the goal, but “even” depends on NUMA and workload.

Decision: If skewed, adjust IRQ affinity (carefully), enable/verify RSS, and confirm RPS/XPS where appropriate.

Task 9: Check CPU time spent in softirq (network processing)

cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0-18-generic (server) 	02/04/2026 	_x86_64_	(32 CPU)

02:18:11 PM  CPU   %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:18:12 PM  all    8.2  0.0  6.1    0.2   0.0  14.9   0.0    0.0    0.0  70.6
02:18:12 PM   7    3.1  0.0  5.9    0.0   0.0  62.2   0.0    0.0    0.0  28.8

Meaning: A core with extremely high %soft suggests it’s drowning in packet processing. That correlates strongly with softnet_stat drops.

Decision: Rebalance interrupts, check RSS queue count, reduce per-packet overhead (offloads can help), or scale out traffic.

Task 10: Inspect qdisc drops and backlog on egress

cr0x@server:~$ tc -s qdisc show dev eth0
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
 Sent 8122331123 bytes 8112231 pkt (dropped 3312, overlimits 0 requeues 12)
 backlog 0b 0p requeues 12
  maxpacket 1514 drop_overlimit 3312 new_flow_count 4012 ecn_mark 0

Meaning: Drops here are egress queue drops on the host. That’s not “the network”; that’s your server deciding it can’t queue any more packets.

Decision: Determine why the host is congested on egress: shaping/policing, too-small qdisc limit, or the NIC can’t transmit fast enough (or is blocked by flow control issues).

Task 11: Detect MTU mismatch and PMTUD blackholes

cr0x@server:~$ ping -M do -s 1472 -c 3 10.10.5.20
PING 10.10.5.20 (10.10.5.20) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500

--- 10.10.5.20 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2025ms

Meaning: This host’s interface MTU is 1500, and you’re trying to send a 1500-byte packet with DF set. The error is local and expected.

Decision: Now do the same test with the largest expected payload for your overlay or jumbo configuration. If some path can’t pass it and ICMP is filtered, you get “loss only for big packets.”

Task 12: Confirm offloads and understand capture weirdness

cr0x@server:~$ ethtool -k eth0 | egrep -i 'gro|gso|tso|lro|rx-checksumming|tx-checksumming'
rx-checksumming: on
tx-checksumming: on
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off

Meaning: With GRO/GSO/TSO enabled, packet captures can show “giant” packets or odd segmentation patterns. That doesn’t mean jumbo frames are on; it means the kernel is optimizing.

Decision: Don’t disable offloads as your first move. If you must test, do it temporarily and only after capturing baseline counters. Disabling offloads can increase CPU and create loss.

Task 13: Capture at the right place and time (and keep it small)

cr0x@server:~$ sudo tcpdump -i eth0 -nn -s 96 -w /tmp/edge.pcap 'host 10.10.5.20 and (tcp or icmp)' -c 20000
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 96 bytes
20000 packets captured
20000 packets received by filter
0 packets dropped by kernel

Meaning: “0 packets dropped by kernel” means your capture process kept up. Good. If tcpdump drops packets, your capture becomes untrustworthy for microburst analysis.

Decision: If you see loss in the app but tcpdump shows no gaps on egress, you likely have remote drops or return-path issues. Capture on the receiver too.

Task 14: Use ss to check socket pressure (accept queue, rmem/wmem)

cr0x@server:~$ ss -s
Total: 1789 (kernel 0)
TCP:   912 (estab 501, closed 330, orphaned 0, timewait 309/0), ports 0

Transport Total     IP        IPv6
RAW	  0         0         0
UDP	  71        60        11
TCP	  582       421       161
INET	  653       481       172
FRAG	  0         0         0

Meaning: This is a sanity view. For deeper diagnosis, inspect specific listeners and their queues.

cr0x@server:~$ ss -lntp | head
State  Recv-Q Send-Q Local Address:Port  Peer Address:Port Process
LISTEN 512    4096   0.0.0.0:443        0.0.0.0:*     users:(("nginx",pid=1412,fd=8))
LISTEN 128    128    127.0.0.1:9100     0.0.0.0:*     users:(("node_exporter",pid=1201,fd=3))

Meaning: Recv-Q on a listener is the backlog of pending connections; if it’s consistently at/near Send-Q (the configured backlog), you’re dropping SYNs or stalling accept.

Decision: If your listener is saturated, packet loss may be a byproduct (retries/timeouts) rather than true network loss. Fix application accept rate, tune backlog, or scale.

Task 15: Check conntrack saturation (stateful firewall/NAT environments)

cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 252118
net.netfilter.nf_conntrack_max = 262144

Meaning: You’re near the limit. When conntrack is full, new connections can fail in ways that look like packet loss (SYNs vanish, retries happen, clients time out).

Decision: If near max during incidents, you either raise the limit (and memory), reduce churn, or stop tracking flows that don’t need tracking.

Task 16: Find retransmits per connection (stop blaming the entire network)

cr0x@server:~$ ss -ti dst 10.10.5.20 | head -n 30
ESTAB 0 0 10.10.4.55:48216 10.10.5.20:443
	 cubic wscale:7,7 rto:204 rtt:4.2/0.8 ato:40 mss:1460 pmtu:1500 rcvmss:1460 advmss:1460 cwnd:10 bytes_sent:88212 bytes_acked:88120 bytes_received:112233 segs_out:712 segs_in:690 data_segs_out:502 retrans:7/23 lost:0 sacked:8

Meaning: This connection is retransmitting. Now you can ask “what’s special about this flow?” rather than arguing about the whole fabric.

Decision: Compare affected vs unaffected flows: destination, DSCP, port, MTU, path (ECMP), and whether it rides a specific bond member.

Joke #2: Turning off every offload to “make it simple” is like removing the car’s seats to improve aerodynamics. Technically a change, emotionally satisfying, operationally suspicious.

Three corporate-world mini-stories

Mini-story 1: The outage caused by a wrong assumption

A platform team inherited a fleet of Linux gateways doing NAT for several internal services. One week, teams started reporting “random packet loss” affecting login flows. Only sometimes. Mostly during peak hours. The incident channel filled with ping screenshots, as is tradition.

The wrong assumption: “If pings show loss, it must be the network fabric.” The team escalated to the network group, who looked at switch interface errors and found none. They responded with the classic: “Looks clean from here.” Tension rose. Someone proposed a rolling reboot of the gateways “to clear whatever.”

A calmer engineer checked netstat -s and saw retransmits rising. Then they checked conntrack counters and found the table sitting near the limit. Not always full. Just flirting with it. Under peak churn, new flows failed. Some clients retried on new source ports, adding more churn. A feedback loop, but only at certain load profiles.

The punchline was banal: the system had grown, the conntrack sizing hadn’t. A previous capacity plan assumed “connections are sticky.” They weren’t. A mobile client update increased short-lived connections. Nothing was “wrong” with the fabric; the gateway was dropping state creation and effectively blackholing SYNs.

The fix wasn’t heroic. They increased conntrack max with appropriate memory headroom, stopped tracking traffic that didn’t need it, and added a dashboard that plotted nf_conntrack_count versus nf_conntrack_max with alerting on sustained > 80%.

Lesson: intermittent “packet loss” can be a state exhaustion problem. If you don’t check shared kernel tables early, you’ll waste time “debugging the network” while the server quietly refuses to remember new connections.

Mini-story 2: The optimization that backfired

A team ran a low-latency trading-adjacent service (not high-frequency, but latency-sensitive enough that engineers argued about microseconds like it was a hobby). They wanted to reduce CPU overhead on a busy ingestion node. Someone suggested disabling GRO/GSO/TSO so packet processing would be “more predictable” and captures would be “more accurate.”

They rolled the change during a quiet window. CPU usage went up, but still within budget. A week later, traffic increased. Suddenly the same node began showing intermittent timeouts and retransmits. The dashboards were confusing: network throughput was unchanged, but tail latency spiked and client retries climbed.

The failure mode was classic: by disabling offloads, they increased per-packet CPU cost. Under higher packet rates, a single core handling softirq work got saturated. /proc/net/softnet_stat drops climbed. Those drops looked like “the network losing packets,” but it was the host shedding load because it couldn’t keep up.

The team reverted offload changes and instead fixed the real problem: IRQ distribution and RX queue sizing, plus ensuring RSS was configured correctly and mapped to appropriate CPUs. Captures remained workable by capturing at the right boundaries and using appropriate snap lengths.

Lesson: performance “optimizations” that remove kernel/NIC features often just move cost into the CPU and convert predictable throughput into intermittent loss. If you change offloads, you own the side effects—especially during the next traffic step, not today’s.

Mini-story 3: The boring practice that saved the day

A mid-size company had a habit that was deeply unsexy: they kept a runbook for network incident triage that included “record these counters before and during.” It listed the exact commands—ip -s link, ethtool -S, tc -s qdisc, softnet_stat, and a small tcpdump filter—and it required pasting outputs into the incident ticket.

One day, customer requests started timing out, but only from one AZ. Engineers swore it was a code deploy because the timing lined up. The runbook forced them to collect counters on an unaffected node and an affected node. The comparison was decisive: affected nodes showed a rising count of RX “missed” errors in ethtool -S, while unaffected nodes did not.

The network team checked the switch port stats for the affected rack and found a corresponding pattern of physical-layer errors. Not enough to flap the link. Just enough to ruin frames and trigger retransmits. The fix was physical: replace optics and clean a few patches. It was resolved without a rollback, and the postmortem had a clean, evidence-based timeline.

Lesson: boring, consistent counter collection saves time because it makes comparisons easy. The job isn’t to be clever; the job is to be correct quickly.

Common mistakes: symptom → root cause → fix

1) “Ping shows 5% loss, users complain, so the network is dropping packets”

Symptom: ping loss, but TCP apps mostly fine; ICMP RTT jittery.

Root cause: ICMP is rate-limited or deprioritized on routers/firewalls; some devices treat it as a nuisance.

Fix: validate with TCP retransmits (netstat -s, ss -ti) and application-level retries; use targeted tcpdump for the actual protocol.

2) “No interface errors, so it’s not physical”

Symptom: retransmits spike; ip -s link looks clean.

Root cause: physical issues can show up in switch counters, FEC/PCS counters, or vendor-specific NIC stats not surfaced in basic counters.

Fix: use ethtool -S and correlate with switch port error counters; check for link flaps in logs.

3) “We increased buffers, so drops should be gone”

Symptom: packet loss reduced, but tail latency gets worse; users still complain.

Root cause: bufferbloat: fewer drops at the cost of giant queues and high latency.

Fix: use sane qdiscs (often fq_codel), right-size buffers, and watch p99 latency as a first-class metric.

4) “Only big uploads fail; small requests succeed”

Symptom: small packets fine; large transfers stall; retransmits climb.

Root cause: MTU/PMTUD blackhole, often because ICMP fragmentation-needed is blocked, especially with overlays.

Fix: validate path MTU with ping -M do; ensure correct MTU end-to-end; allow required ICMP; adjust overlay MTU.

5) “The NIC drops are zero, so the host can’t be dropping”

Symptom: retransmits and timeouts; ethtool -S looks clean.

Root cause: drops occur in kernel backlog (softnet_stat), qdisc (tc -s qdisc), netfilter, or socket queues.

Fix: check /proc/net/softnet_stat, tc qdisc stats, and ss queues; address CPU/IRQ distribution and queue limits.

6) “We turned off offloads to debug, and now it’s worse”

Symptom: new drops appear under load; CPU climbs; softirq high.

Root cause: removing offloads increases per-packet CPU cost and can trigger backlog drops.

Fix: revert offload changes; debug using captures placed at boundaries; fix interrupt/queue scaling instead.

7) “It must be the firewall; add rules later”

Symptom: intermittent connection failures under peak.

Root cause: conntrack saturation or churn, plus costly rulesets causing CPU pressure.

Fix: check conntrack utilization; reduce tracked traffic; tune timeouts; ensure firewall processing is sized and observable.

8) “Bonding gives redundancy, so it can’t cause loss”

Symptom: intermittent loss only for some flows; one member link looks suspicious.

Root cause: LACP member imbalance, a bad cable/optic on one member, or hashing that pins certain flows to a degraded link.

Fix: check per-member stats; temporarily drain/remove the suspect member; adjust hashing policy where appropriate.

Checklists / step-by-step plan

Checklist A: During the incident (15–30 minutes)

  1. Define the symptom in protocol terms. Is it ICMP loss, TCP retransmits, UDP gaps, or application timeouts?
  2. Pin down the path. Run ip route get to identify egress interface and next hop.
  3. Collect baseline counters now. Run:
    • ip -s link show dev <if>
    • ethtool -S <if>
    • tc -s qdisc show dev <if>
    • awk ... /proc/net/softnet_stat
    • netstat -s or ss -s
  4. Wait 60 seconds and collect again. Deltas matter more than absolute numbers.
  5. Capture targeted traffic for a short window. Small snaplen, narrow filter, fixed count, write to disk.
  6. Correlate one symptom to one counter. If you can’t, you don’t yet know where the loss is.

Checklist B: If it smells like CPU/softirq pressure

  1. Check /proc/net/softnet_stat deltas during the issue.
  2. Check mpstat for high %soft on one or a few cores.
  3. Check /proc/interrupts for skewed IRQ distribution.
  4. Check RSS queue count and whether the driver exposes multiple queues.
  5. Only then consider ring sizes, offloads, and tuning. Avoid random sysctl churn.

Checklist C: If it smells like MTU / overlay issues

  1. Confirm interface MTU: ip link show.
  2. Validate payload sizes with ping -M do tests on both ends.
  3. Check if ICMP is being filtered in your environment (security teams love doing this “for safety”).
  4. Confirm overlay MTU settings (VXLAN/Geneve) and ensure underlay supports them.
  5. Fix MTU consistently; don’t rely on fragmentation in modern production unless you enjoy surprises.

Checklist D: If it smells like state tables (conntrack) or policy

  1. Check conntrack utilization via sysctls and observe during peak.
  2. Inspect firewall counters (nftables/iptables) if available; dropped packets with matching rules are not “mysterious.”
  3. Reduce tracked traffic where safe (e.g., not all traffic needs state).
  4. Right-size the table with memory headroom and alerting.

FAQ

1) If ping shows loss but TCP looks fine, is that “real” packet loss?

It can be real ICMP loss, which is often irrelevant to application performance. Many devices rate-limit ICMP. Validate with TCP retransmits and app-level metrics.

2) What’s the single most useful Linux file for “drops under load”?

/proc/net/softnet_stat. If the dropped column climbs during the incident, you have kernel backlog drops. That usually points to CPU/interrupt/packet-rate pressure.

3) Why do averages lie so badly for intermittent loss?

Because drops often happen in microbursts. A 200ms burst can overflow a queue and cause visible loss, while your 1-minute average throughput looks normal.

4) Should I disable GRO/TSO/GSO to make tcpdump “accurate”?

Not as a first move. Offloads reduce CPU overhead. Disabling them can create loss by starving softirq. If you must test, do it temporarily, capture before/after, and watch softnet drops and CPU.

5) How do I tell whether drops happen on ingress or egress?

Ingress: rising RX misses/no-buffer, softnet drops, or socket receive issues. Egress: tc -s qdisc drops, TX queue issues, or shaping/policing. Captures on both sides of a boundary make this obvious.

6) Why does MTU mismatch look like “only sometimes”?

Because not all packets are large. Small requests succeed, large responses or uploads stall. If PMTUD is broken (often due to filtered ICMP), the connection can hang and retransmit until timeout.

7) Can conntrack really look like packet loss?

Yes. When conntrack is full or heavily contended, new flows can fail. Clients see timeouts and retries, and it’s easy to mislabel that as “packets disappearing.”

8) What if tcpdump says “packets dropped by kernel” during capture?

Your capture can’t keep up, which is common on busy links. Narrow the filter, reduce snaplen, capture fewer packets, or capture at a less busy point. Don’t use a lossy capture to prove microbursts.

9) Does a clean ip -s link mean the NIC is fine?

Not necessarily. Many important counters are only in ethtool -S, and physical-layer errors can be more visible on the switch side.

10) How do I avoid endless blame ping-pong between teams?

Agree on boundaries and evidence: “packet seen here, not seen there,” plus counters that increase. Captures at two points beat opinions every time.

Conclusion: next steps that work

Intermittent packet loss stops being mystical when you stop treating it like a vibe and start treating it like a queue overflow or a conditional path failure. The winning move is to locate the drop boundary using deltas in counters and small, targeted captures.

Do this next, in order:

  1. On an affected host, record deltas for netstat -s, ip -s link, ethtool -S, tc -s qdisc, and /proc/net/softnet_stat over a 1–5 minute window.
  2. If drops are local, decide which domain owns them: NIC ring, softirq backlog, qdisc, conntrack, or socket pressure.
  3. If drops are not local, capture on both ends (or on both sides of a boundary) and force the conversation to be about evidence.
  4. Write the runbook you wish you had today, while the pain is fresh. Keep it short. Require outputs in tickets.

The best debug flow is the one you can execute while tired, under pressure, with three people arguing in your ear. Build that flow, and intermittent loss becomes just another Tuesday.

← Previous
Dual Boot Broken? Restore the Windows Bootloader Safely
Next →
WireGuard Roaming: The Fix That Stops Mobile VPN Disconnects

Leave a comment