It starts the same way every time: the application team swears “the network is slow,” the network team swears “the servers are dropping,” and you’re stuck watching a dashboard that says “TCP Retransmits” like it’s a diagnosis. It isn’t. It’s a symptom.
On Debian 13, retransmits can mean real packet loss, queue collapse, PMTU issues, offload weirdness, conntrack overload, packet reordering, or a middlebox doing something “helpful” to your traffic. The trick isn’t to prove retransmits exist. The trick is to pin down where the loss (or the illusion of loss) actually occurs—host, hypervisor, NIC, switch, firewall, or the far end.
What retransmits really mean (and what they don’t)
TCP retransmissions happen when the sender believes a segment didn’t make it to the receiver. The important part is “believes.” TCP infers loss from missing acknowledgements (ACKs), duplicate ACKs, SACK blocks, and timers. That missing ACK might be because:
- The data segment was dropped on the way.
- The ACK was dropped on the return path.
- The segment arrived but got stuck behind a giant queue (bufferbloat), arriving “late enough” to trigger loss recovery.
- Packets were reordered and “looked lost” temporarily.
- The receiver was so overloaded it didn’t ACK in time.
- A middlebox mangled MSS/MTU/ECN, causing blackholing or stalling.
So “retransmits are high” is like saying “the smoke alarm is loud.” Yes. And now you need to find the fire, the burnt toast, or the person vaping under the detector.
On modern Linux, the kernel’s TCP stack is excellent. When you see retransmits “killing performance,” it’s usually the system telling you that something beneath or beside TCP is behaving badly: NIC rings, driver bugs, offload interactions, IRQ starvation, or the network path dropping bursts.
Dry truth: your app might be fine. Your disks might be fine. Your CPU might be fine. Your retransmits are still real, and they’re still slowing you down. Fix the path.
Fast diagnosis playbook
This is the order that wins incidents. It doesn’t prove everything; it finds the bottleneck fast enough to stop the bleeding.
First: decide if it’s one host, one link, or the path
- Compare multiple clients/servers. If one Debian 13 host is the outlier, start there. If every host sees it to the same destination, suspect the path or the destination.
- Check interface counters. If the NIC shows RX/TX errors, drops, missed, or “no buffer,” you likely have a host-side drop problem.
- Check TCP stats. If retransmits rise but interface drops stay flat, suspect upstream congestion, reordering, MTU blackholes, or ACK loss on return.
Second: locate the loss direction (data vs ACK path)
- tcpdump on both ends (or one end plus span/mirror) and check: are data segments missing on ingress, or are ACKs missing on egress?
- Check asymmetry. Loss on ACK path looks like “sender retransmits, receiver got data.”
Third: test the usual suspects in the shortest possible loop
- MTU/PMTUD (especially VLANs, tunnels, jumbo frames, overlay networks).
- Offloads (GRO/LRO/TSO/GSO) when packet capture looks “weird” or when a NIC/driver combo is known cranky.
- IRQ/softirq starvation (high CPU in ksoftirqd, drops with “no buffer”).
- Firewall/conntrack (drops, timeouts, “invalid,” or conntrack table pressure).
- Bonding/LACP (reordering across links, especially with certain hashing policies).
One rule: don’t tune TCP until you’ve proven it’s the TCP stack’s fault. Most “TCP tuning” is expensive superstition with a nice spreadsheet.
Facts and context you can use in war rooms
- TCP retransmission logic predates your datacenter. Classic fast retransmit and fast recovery behaviors were formalized in the 1990s as the Internet grew and loss became normal.
- SACK (Selective Acknowledgment) changed the game. It lets receivers tell senders exactly which blocks arrived, reducing needless retransmits on lossy paths.
- Linux has been a serious TCP implementation for decades. Many high-performance networking features—like modern congestion control options—were battle-tested in Linux before becoming “industry defaults.”
- CUBIC became Linux’s default congestion control because the Internet got faster. Reno-like behavior was too conservative for high-bandwidth, high-latency links.
- RACK (Recent ACK) improved loss detection. It tries to distinguish reordering from loss better than older heuristics, reducing spurious retransmits.
- GRO/TSO offloads changed what packet capture looks like. tcpdump can show “giant” packets or odd segmentation because the NIC/kernel is doing work you used to see on the wire.
- Ethernet has no built-in end-to-end reliability. Drops are allowed behavior under congestion; TCP is the reliability layer, not the switch.
- Microbursts are old, not trendy. Fast links and shallow buffers can drop short spikes even when average utilization looks fine.
- PMTUD failure is a recurring classic. ICMP filtering + tunnels/overlays = blackholes that look like “random retransmits.”
One paraphrased idea that still holds up comes from Werner Vogels: “Everything fails, all the time; design and operate like it.” (paraphrased idea)
Get ground truth: measure retransmits, loss, and where it happens
“TCP retransmits” appears in monitoring because it correlates with bad user experience. But correlation is not location. You want an answer to three questions:
- Is the loss real or apparent? (Reordering/late packets can trigger recovery without true drop.)
- Is it happening on the host? (Driver/NIC/IRQ/queue drops.)
- Is it happening in the network path? (Switch congestion, MTU blackhole, firewall drops.)
Debian 13 ships a modern kernel and userland. That’s good news. It also means you can easily misinterpret tools if you don’t account for offloads, namespaces, virtualization layers, and overlay networks. Your tcpdump is not a confession; it’s a witness statement.
Joke #1: Packet loss is like office politics: it’s rarely where the loudest person says it is.
Practical tasks: commands, outputs, decisions (12+)
These are field tasks: run them during an incident and you’ll narrow the search space without starting a week-long “network performance initiative.” Each task includes: command, what typical output means, and what decision you make next.
Task 1 — Confirm retransmits and who’s suffering (ss)
cr0x@server:~$ ss -ti dst 10.20.30.40
ESTAB 0 0 10.20.30.10:54322 10.20.30.40:443
cubic wscale:7,7 rto:204 rtt:5.2/0.8 ato:40 mss:1460 pmtu:1500 rcvmss:1460 advmss:1460 cwnd:10 bytes_acked:1453297 segs_out:12094 segs_in:11820 send 22.5Mbps lastsnd:12 lastrcv:12 lastack:12 pacing_rate 45.0Mbps unacked:3 retrans:57/1034 reordering:3
Meaning: retrans:57/1034 shows retransmitted segments. If retransmits climb during throughput collapse, you have a TCP recovery problem, not merely “slow app.”
Decision: If retransmits are concentrated to one destination or one interface, scope it. If it’s global, suspect a shared path element (switch, firewall, hypervisor, uplink).
Task 2 — System-wide TCP retrans stats (nstat / netstat)
cr0x@server:~$ nstat -az | egrep 'TcpRetransSegs|TcpTimeouts|TcpExtTCPSynRetrans|TcpExtTCPFastRetrans'
TcpRetransSegs 18933 0.0
TcpTimeouts 412 0.0
TcpExtTCPFastRetrans 12080 0.0
TcpExtTCPSynRetrans 55 0.0
Meaning: Rising TcpTimeouts implies RTO-based retransmits (worse) rather than fast retransmit (often congestion or reorder). SYN retransmits hint at path/filtering issues.
Decision: If timeouts rise, immediately check MTU/blackholes and severe drops. If mostly fast retransmits, focus on congestion, microbursts, or reordering.
Task 3 — Check NIC error and drop counters (ip -s link)
cr0x@server:~$ ip -s link show dev eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 3c:fd:fe:aa:bb:cc brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
9876543210 8123456 0 1432 980 12034
TX: bytes packets errors dropped carrier collsns
8765432109 7456123 0 0 0 0
Meaning: dropped and missed rising on RX means the host couldn’t process incoming packets fast enough (ring overflow, IRQ starvation, CPU pressure).
Decision: If RX drops/missed increase, prioritize host-side: driver stats, rings, IRQ affinity, RPS/XPS, offloads, CPU contention.
Task 4 — Deep driver-level counters (ethtool -S)
cr0x@server:~$ sudo ethtool -S eno1 | egrep -i 'drop|miss|err|timeout|buffer|fifo' | head
rx_dropped: 1432
rx_missed_errors: 980
rx_no_buffer_count: 975
rx_fifo_errors: 0
tx_timeout_count: 0
Meaning: rx_no_buffer_count is the NIC saying “I had frames, but no place to put them.” That’s a smoking gun for ring/IRQ/softirq issues.
Decision: Increase rings (Task 9), spread interrupts, check CPU softirq load (Task 6), and verify you’re not doing expensive firewalling on this box.
Task 5 — Validate link speed/duplex and auto-negotiation (ethtool)
cr0x@server:~$ sudo ethtool eno1 | egrep 'Speed|Duplex|Auto-negotiation|Link detected'
Speed: 1000Mb/s
Duplex: Full
Auto-negotiation: on
Link detected: yes
Meaning: A server stuck at 1Gb/s when you expect 10/25/100Gb/s is not “TCP loss,” it’s a time machine back to 2009.
Decision: Fix cabling/SFP/switch config. Don’t tune TCP to compensate for a negotiated downgrade.
Task 6 — Check softirq pressure and CPU contention (mpstat / top)
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.12.0 (server) 12/29/2025 _x86_64_ (32 CPU)
12:03:11 PM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
12:03:12 PM all 12.1 0.0 8.2 0.3 0.0 18.4 0.0 61.0
12:03:12 PM 7 5.0 0.0 6.1 0.0 0.0 72.3 0.0 16.6
Meaning: One CPU pinned in %soft while others are idle screams “interrupt/softirq imbalance.” Packets are arriving, but one core is drowning.
Decision: Inspect IRQ distribution (Task 7). Consider enabling/modifying RPS/XPS or adjusting IRQ affinity.
Task 7 — Check IRQ distribution for the NIC (procfs)
cr0x@server:~$ grep -E 'eno1|eth0|ixgbe|i40e|mlx|ens' /proc/interrupts | head -n 5
88: 12000342 0 0 0 IR-PCI-MSI 262144-edge eno1-TxRx-0
89: 3 0 0 0 IR-PCI-MSI 262145-edge eno1-TxRx-1
90: 1 0 0 0 IR-PCI-MSI 262146-edge eno1-TxRx-2
Meaning: Queue 0 is handling essentially all interrupts. That creates RX drops under load, which turns into TCP retransmits.
Decision: Fix affinity (irqbalance, manual masks), verify RSS is enabled, confirm multiple queues are active.
Task 8 — Look for packet drops inside qdisc (tc -s)
cr0x@server:~$ tc -s qdisc show dev eno1
qdisc mq 0: root
qdisc fq_codel 8001: parent :1 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
Sent 8123456789 bytes 7456123 pkt (dropped 0, overlimits 0 requeues 2)
backlog 0b 0p requeues 2
Meaning: If qdisc shows significant drops, you may be self-inflicting loss on egress (traffic shaping, mis-sized fq_codel, or too-small txqueuelen).
Decision: If egress drops exist, fix shaping and queue settings before blaming the network. If none, move on.
Task 9 — Check and tune NIC ring sizes (ethtool -g / -G)
cr0x@server:~$ sudo ethtool -g eno1
Ring parameters for eno1:
Pre-set maximums:
RX: 4096
TX: 4096
Current hardware settings:
RX: 512
TX: 512
Meaning: A small RX ring can overflow during bursts. That yields RX drops, which look like “random TCP retransmits.”
Decision: If you see rx_no_buffer_count and missed drops, increase RX/TX rings cautiously and re-test.
cr0x@server:~$ sudo ethtool -G eno1 rx 2048 tx 2048
Task 10 — Confirm MTU end-to-end and spot mismatches (ip link, ping -M do)
cr0x@server:~$ ip link show dev eno1 | head -n 1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
cr0x@server:~$ ping -c 3 -M do -s 1472 10.20.30.40
PING 10.20.30.40 (10.20.30.40) 1472(1500) bytes of data.
1472 bytes from 10.20.30.40: icmp_seq=1 ttl=63 time=0.412 ms
1472 bytes from 10.20.30.40: icmp_seq=2 ttl=63 time=0.398 ms
1472 bytes from 10.20.30.40: icmp_seq=3 ttl=63 time=0.405 ms
--- 10.20.30.40 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2040ms
Meaning: This tests DF (don’t fragment). If it fails on a path you believe supports 1500/9000, you’ve found a PMTU problem.
Decision: If DF pings fail at expected sizes, stop. Fix MTU consistency or ICMP “fragmentation needed” filtering. Retransmits are a downstream symptom.
Task 11 — Capture on the host and label retransmissions (tcpdump)
cr0x@server:~$ sudo tcpdump -i eno1 -nn 'host 10.20.30.40 and tcp port 443' -vv
tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:05:01.101010 IP 10.20.30.10.54322 > 10.20.30.40.443: Flags [P.], seq 12001:13461, ack 8801, win 501, length 1460
12:05:01.104444 IP 10.20.30.10.54322 > 10.20.30.40.443: Flags [P.], seq 12001:13461, ack 8801, win 501, length 1460 (retransmission)
12:05:01.105000 IP 10.20.30.40.443 > 10.20.30.10.54322: Flags [.], ack 13461, win 8192, length 0
Meaning: tcpdump will annotate “(retransmission)” based on what it sees. But remember: capture point matters. Capturing on the sender doesn’t prove it never left the NIC.
Decision: If possible, capture on both ends. If only one end: correlate with NIC drops and switch counters to avoid blaming ghosts.
Task 12 — Verify offloads when capture looks wrong (ethtool -k)
cr0x@server:~$ sudo ethtool -k eno1 | egrep 'tcp-segmentation-offload|generic-segmentation-offload|generic-receive-offload|large-receive-offload'
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
Meaning: TSO/GSO/GRO can make captures misleading and can interact badly with certain drivers or tunnels, especially in virtualized overlays.
Decision: If you suspect offload-induced weirdness, test with a temporary toggle on one host (not fleet-wide) and measure retransmits again.
cr0x@server:~$ sudo ethtool -K eno1 gro off gso off tso off
Task 13 — Check conntrack pressure and nftables drops
cr0x@server:~$ sudo nft list ruleset | head -n 20
table inet filter {
chain input {
type filter hook input priority filter; policy drop;
ct state established,related accept
iif "lo" accept
tcp dport { 22, 443 } accept
counter packets 189234 bytes 120934234 drop
}
}
cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 258901
net.netfilter.nf_conntrack_max = 262144
Meaning: If conntrack is near max, new flows may be dropped or delayed. That produces SYN retransmits and “random” stalls.
Decision: If you’re doing stateful firewalling on a busy host, either raise limits with memory awareness, reduce tracked traffic, or move filtering off-box.
Task 14 — Look for reordering and DSACK clues (ss -i, nstat)
cr0x@server:~$ ss -ti src 10.20.30.10:54322
ESTAB 0 0 10.20.30.10:54322 10.20.30.40:443
cubic rtt:5.1/1.0 rto:204 mss:1460 cwnd:9 unacked:2 retrans:22/903 reordering:18
cr0x@server:~$ nstat -az | egrep 'TcpExtTCPDSACKRecv|TcpExtTCPDSACKOfoRecv|TcpExtTCPSACKReorder'
TcpExtTCPDSACKRecv 3812 0.0
TcpExtTCPDSACKOfoRecv 1775 0.0
TcpExtTCPSACKReorder 2209 0.0
Meaning: Reordering-related counters rising suggests the network isn’t dropping packets as much as shuffling them. Reordering can still hurt throughput because TCP reacts defensively.
Decision: Investigate LACP/bonding hashing, ECMP paths, and any device doing per-packet load balancing. Fixing reordering often “fixes retransmits” without changing loss rates.
Task 15 — Confirm path MTU from the kernel’s view (ip route get)
cr0x@server:~$ ip route get 10.20.30.40
10.20.30.40 dev eno1 src 10.20.30.10 uid 1000
cache mtu 1500
Meaning: The route cache MTU is what the kernel thinks is safe. If you expect 9000 and see 1500, something learned a smaller PMTU.
Decision: If the PMTU is unexpectedly low, check tunnels/VPNs/VXLAN/GRE, and whether ICMP “frag needed” messages are being received and honored.
Task 16 — Validate bonding/LACP state (if relevant)
cr0x@server:~$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v6.12.0
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
Slave Interface: eno1
MII Status: up
Slave Interface: eno2
MII Status: up
Meaning: LACP itself doesn’t guarantee in-order delivery if the upstream is misconfigured or if hashing changes. Some setups produce reordering under certain flow patterns.
Decision: If reordering counters rise and you’re bonded, test with a single link or adjust hashing policy and upstream port-channel configuration.
Loss patterns that fool smart people
1) “No interface drops, but tons of retransmits”
This is the classic war room stalemate. The host says it isn’t dropping. TCP says it’s retransmitting. Both can be true.
- Path drops: switch buffers overflowing, policing/shaping upstream, bad optics, or a firewall dropping under load.
- ACK path loss: your sender sees missing ACKs, retransmits, but the receiver actually got the data.
- Queueing delays: packets arrive late (bufferbloat), triggering fast retransmit or timeouts even though they weren’t dropped.
How you prove it: capture on both ends (or sender + SPAN). If the receiver sees the original segment and the sender retransmits anyway, it’s likely ACK path issues or extreme delay.
2) Microbursts: average utilization lies
Microbursts are the reason “links at 20%” still drop. Modern systems can emit traffic in bursts: interrupt coalescing, batching in the kernel, GRO, application writes, or storage replication pushing a big chunk. Shallow buffers in ToR switches can drop these bursts in microseconds. Your SNMP graph at 60-second resolution will cheerfully report “fine.”
What you do: ask for switch interface discards and buffer drop counters, ideally at high resolution. On the host, look for RX no-buffer and softirq pressure. If you can’t get switch telemetry, you can still infer microbursts from “no obvious host drops” plus retransmits under high fan-in or synchronized senders.
3) PMTUD blackholes: it’s not loss, it’s a silent MTU mismatch
If ICMP “fragmentation needed” is blocked or mishandled, large packets vanish. TCP retransmits. Eventually connections stall or limp along with weird MSS behavior. This is extra common with tunnels, VPNs, and overlay networks.
Don’t guess. Test with ping -M do. If that fails at reasonable sizes, stop “tuning TCP.” You’re sending packets the path can’t carry.
4) Reordering: packets arrive, just not politely
TCP assumes reordering is less common than loss, though Linux has improved at detecting it. If your network does per-packet load balancing (intentionally or by accident) you can induce reordering. Bonding misconfigurations can do it too.
The result: duplicate ACKs, fast retransmits, DSACKs. Throughput tanks. Everyone blames “packet loss” because that’s the graph they have.
5) “We turned on feature X and it got worse” (because it changed timing)
Offloads, ECN, pacing, queue disciplines—many are good. But enabling them in the wrong place can change burst behavior and timing enough to expose a weak link. That doesn’t mean the feature is bad. It means your environment is fragile.
Joke #2: Turning off GRO to “fix networking” is like firing the smoke detector to reduce noise.
Three corporate mini-stories (what actually goes wrong)
Mini-story 1: The incident caused by a wrong assumption
The company had a new Debian 13 fleet rolling into production behind a pair of firewalls. The rollout was smooth until one service—bulk exports—started timing out. Retransmits climbed, RTOs climbed, and the exports were now “unreliable.” The application owner insisted it was a regression in the kernel.
The network team pointed at clean interface counters on the servers. No RX drops. No CRC errors. “It’s not us.” The SRE on call did the most boring thing possible: tested PMTU with ping -M do between the exporter hosts and the downstream API. It failed at sizes that should have worked in that VLAN.
Everyone had assumed the VLAN was a plain 1500-byte Ethernet domain. It wasn’t. There was a tunnel segment in the middle, and the firewalls were configured to drop incoming ICMP “fragmentation needed” as “unwanted noise.” PMTUD didn’t work, and the path effectively blackholed larger packets.
Fixing it didn’t require a Debian rollback, a new NIC driver, or a TCP sysctl bonfire. They allowed the relevant ICMP types and adjusted MSS clamping on the firewall for the tunnel. Retransmits dropped, throughput returned, and the war room ended before lunch.
The wrong assumption wasn’t “Debian 13 is buggy.” The wrong assumption was “our network has the MTU we think it has.” That’s how you get paged.
Mini-story 2: The optimization that backfired
A different organization wanted to squeeze more throughput out of replication traffic. Someone read that “bigger buffers improve performance,” so they increased NIC ring sizes and changed qdisc settings across a storage cluster during a maintenance window. The initial tests looked great: fewer drops, higher peak throughput on synthetic benchmarks.
Then production arrived. Latency-sensitive services sharing the same uplinks started reporting timeouts. Retransmits rose everywhere, not just on replication. The graphs didn’t show link saturation, so the blame ricocheted between teams.
The problem wasn’t that bigger rings were “wrong.” The problem was that bigger rings plus changed qdisc behavior increased queueing latency under bursty load. The network started buffering more on the host, the switch buffered more, and the RTT jitter became large enough that RTO and loss recovery behavior got triggered more often. It looked like packet loss. A chunk of it was delayed delivery.
The fix was to back out the blunt changes, then reintroduce them selectively. Replication got a dedicated interface class with controlled pacing, and the general service traffic went back to sane queue management. Retransmits fell because the network stopped behaving like a storage appliance trying to win a benchmark.
This is why you don’t “optimize” without a model of what your workloads actually need. Fast isn’t the same as stable.
Mini-story 3: The boring but correct practice that saved the day
In a third company, a single rack began showing elevated TCP retransmits on Debian 13 nodes after a hardware refresh. Nothing was outright down, just slow enough that customers noticed. The first instinct was to chase kernel versions, NIC firmware, and “maybe it’s IPv6.”
But the team had a boring practice: every rack had a baseline “golden” capture and counter snapshot taken when the rack was known-good. Not fancy—just ethtool -S, ip -s link, ss -s, and a short tcpdump. It lived in a repo next to the rack build notes.
Comparing current counters to the baseline showed one clear difference: CRC and alignment errors on the switch port connected to one leaf uplink were non-zero and steadily rising. On the hosts, the errors were not visible yet (the switch was the one seeing the physical layer issue first).
They replaced the suspect optic and cleaned one fiber connector. Retransmits dropped within minutes. No kernel change, no performance tuning, no finger-pointing. Just physics.
Boring practices don’t make great conference talks. They do end incidents.
Common mistakes: symptom → root cause → fix
This section is intentionally specific. These are the patterns that burn time because they look like something else.
1) High TCP retransmits, NIC RX drops rising
- Symptom: Retransmits increase with throughput.
ip -s linkshows RX dropped/missed increasing. - Likely root cause: Host can’t service RX fast enough (IRQ pinned, too-small rings, driver issues, CPU contention, heavy nftables, virtualization overhead).
- Fix: Balance IRQs/RSS, increase ring sizes, verify RPS/XPS, reduce firewall/conntrack load, pin workloads away from softirq-heavy cores, update NIC firmware/driver if indicated.
2) High retransmits, no host drops, SYN retransmits rising
- Symptom:
TcpExtTCPSynRetransclimbs; connections slow to establish. - Likely root cause: Firewall drops, conntrack table full, asymmetric routing through stateful device, or destination overload/refusal.
- Fix: Check conntrack count/max, firewall counters/logs, ensure symmetric routing, reduce tracked traffic, adjust firewall capacity or bypass for known-safe internal traffic.
3) Retransmits appear after enabling jumbo frames
- Symptom: Works for small responses, stalls on large transfers; DF pings fail.
- Likely root cause: MTU mismatch somewhere (switch, firewall, overlay tunnel), PMTUD blocked.
- Fix: Make MTU consistent end-to-end or clamp MSS at edges; allow ICMP frag-needed; validate with DF pings across the real path.
4) Retransmits with high “reordering” counters
- Symptom:
reorderinginss -iand DSACK counters rise. - Likely root cause: ECMP/LACP hashing changes, per-packet load balancing, mixed-speed links, or a misconfigured bond/port-channel.
- Fix: Ensure per-flow hashing, validate LACP config matches, avoid per-packet distribution, test single-link behavior to confirm.
5) Retransmits “only during backups” or “only at the top of the hour”
- Symptom: Periodic retransmit spikes with scheduled jobs.
- Likely root cause: Microbursts and congestion from synchronized senders, switch buffer pressure, or host queue collapse.
- Fix: Stagger schedules, apply pacing/shaping for bulk flows, separate traffic classes, and request/inspect switch discard counters.
6) tcpdump shows retransmits, but receiver logs show data arrived
- Symptom: Sender retransmits; receiver application sees duplicates or out-of-order data handling (or just high CPU).
- Likely root cause: ACK loss on return path or capture point misleads due to offloads/virtual switching.
- Fix: Capture on both ends; verify offload settings and capture points; investigate return-path devices and asymmetric routing.
Checklists / step-by-step plan
Step-by-step: find where the loss is in 60–90 minutes
- Pick one impacted flow (source, dest, port). Don’t chase aggregate graphs. Run
ss -ti dst <ip>and record retransmits, RTT, cwnd. - Check host interface counters on both ends:
ip -s link,ethtool -S. If RX drops/missed/no-buffer rise on either end, you have a host-side problem to fix before blaming the network. - Check CPU softirq balance:
mpstatand/proc/interrupts. If one core is getting hammered, fix IRQ/RSS. - Validate MTU and PMTUD with DF pings at realistic sizes. If it fails, fix MTU/ICMP/MSS. Don’t proceed until it passes.
- Capture packets during a retransmit spike. If possible, capture on both ends for the same window. Confirm whether original segments appear at receiver and whether ACKs return.
- Check firewall/conntrack if applicable: nft counters, conntrack count/max, logs. If tables are near full or drops increment, fix that layer.
- Investigate reordering if counters rise: bonding, ECMP, multipath overlays. Try a controlled test bypassing the bond or pinning to one path.
- Engage the network team with specifics: “drops on switch port X,” “CRC errors on uplink,” “PMTU fails at 1472,” “reordering spikes when bond is active.” Vague “retransmits are high” gets you vague answers.
Operational checklist: keep from re-living this
- Baseline
ethtool -S,ip -s link,ss -sfor known-good hosts. - Log switch port errors/discards and correlate with incident windows.
- Standardize MTU per domain; document where it changes (tunnels, overlays, WAN).
- Prefer per-flow hashing; avoid per-packet load balancing for TCP-heavy traffic.
- Keep conntrack sizing intentional; don’t let defaults run critical firewalls blindly.
- Require a “capture point map” in virtualized environments (host, VM, vSwitch, physical).
FAQ
1) Are TCP retransmits always packet loss?
No. They’re a sender reaction to missing ACKs. True drop is common, but so are reordering and excessive queueing delay.
2) Why do retransmits destroy throughput so badly?
TCP reduces its congestion window when it thinks loss occurred. That throttles sending rate. On high-BDP paths, recovery can be slow.
3) If my NIC counters show no drops, can the host still be at fault?
Yes. Drops can occur in places you’re not looking: virtual switching layers, firewall hooks, conntrack, or upstream. Also, not all drivers expose every drop counter cleanly.
4) tcpdump says “retransmission.” Is that definitive?
It’s indicative, not definitive. It’s based on what tcpdump observes at that capture point. With offloads and virtualization, you might be observing post-processed traffic.
5) Should I disable GRO/GSO/TSO to fix retransmits?
Only as a diagnostic on one host. If it helps, you’ve learned something about driver/offload interaction, but you haven’t proven the “right” steady-state configuration yet.
6) How do I tell if it’s an MTU/PMTUD issue?
DF ping tests failing at expected payload sizes are a strong signal. Also look for stalls on large transfers and weird MSS behavior. Fix ICMP frag-needed handling and MTU consistency.
7) Why do retransmits spike during backups or replication jobs?
Bulk flows create bursts and congestion. You can get microburst drops even when average utilization is modest. Consider pacing, staggering, or traffic separation.
8) Can conntrack cause retransmits even if the server is “not a firewall”?
Yes. If nftables rules track traffic by default (common), the conntrack table can fill or become CPU-expensive, causing drops and delays—especially during connection churn.
9) What’s the fastest proof that the network path is dropping packets?
Simultaneous captures on sender and receiver showing missing segments on the receiver, plus clean host counters, points strongly to path drops. Switch discard counters seal it.
10) Should I tune sysctls like tcp_rmem/tcp_wmem first?
No. If you have real drops, bigger buffers can amplify queueing and make latency worse. Fix loss and reordering first; tune second, with measurements.
Next steps you can do today
Stop treating “TCP retransmits” as a root cause. Treat it as a breadcrumb trail.
- Pick one impacted flow and capture
ss -tioutput during the problem. - Check
ip -s linkandethtool -Son both ends for drops, missed, no-buffer, CRC. - Validate MTU with DF pings across the real path.
- Check CPU softirq balance and IRQ distribution.
- If still unclear, capture packets on both ends and compare what left vs what arrived.
- Only then adjust rings/offloads/qdisc—and do it surgically, with rollback.
If you do this methodically, the argument about “whose fault it is” becomes boring. Boring is good. Boring means you’re about to fix it.