Random disconnects on a server are a special kind of pain: nothing is “down” long enough to page the networking team, but everything is broken long enough to make you look unreliable. SSH stalls. RPC timeouts. Storage mounts pause like they’re thinking about their life choices. And by the time you log in, the link is “fine.”
This is where superstition breeds. People toggle offloads like they’re flipping a haunted light switch. They reboot. They blame “Ubuntu updates.” You can do better. The goal is not to find a magic sysctl. The goal is to prove where packets are being dropped: on the wire, on the switch, in the NIC, in the driver, in the kernel’s receive path, or in your own configuration.
A working mental model: where “disconnects” actually happen
Most “random disconnects” are not disconnects. They’re transient loss, reordering, stalls, or brief link renegotiations that your application interprets as a failure. You need to separate failure modes:
- Link flaps: the physical link goes down/up (cable, SFP, switch port, autoneg). Linux will often log this clearly.
- Driver/NIC resets: the device stays “up” but the driver resets queues, firmware, or DMA. Looks like a blip; logs can be subtle.
- Receive-path overload: link is up, but RX rings overflow or softnet backlog drops. No physical errors required.
- Offload/feature interactions: checksum offload, GRO/LRO, TSO/GSO, VLAN offload, or XDP can create weirdness with specific switches, tunnels, or NIC revisions.
- Path MTU blackholes: the link is fine; PMTU discovery is not. Certain flows stall; pings “work.”
- Bonding/LACP/VLAN/bridge misbehavior: you built something clever and now it occasionally eats packets.
Your job is to label the event correctly. Once you do that, the fix becomes embarrassingly straightforward.
Paraphrased idea (with attribution): John Allspaw has long argued reliability comes from treating operations as a science of evidence, not a theater of blame.
Fast diagnosis playbook (first/second/third)
When you’re on-call, you don’t have time to admire packet graphs. Start here. The goal is to figure out which layer is lying to you.
First: confirm whether it’s link flapping, driver reset, or congestion/drops
- Kernel logs around the event window: link down/up vs reset vs queue timeout.
- NIC counters: CRC errors and alignment errors scream physical; RX_missed_errors and rx_no_buffer scream rings/interrupts.
- Softnet drops: if the kernel is dropping before your app sees packets, you’ll find it here.
Second: isolate “offload weirdness” from “real capacity problem”
- Check offload state and driver/firmware versions.
- Reproduce with a controlled traffic test (even basic iperf3) and watch counters.
- If disabling an offload “fixes” it, prove why: counters change, resets stop, or a specific encapsulation stops breaking.
Third: validate switch, optics, and cabling like an adult
- Look for FEC/CRC errors, symbol errors, and renegotiations on both ends.
- Swap optics/cable to rule out physical. It’s not glamorous, but it’s fast.
- Confirm LACP partner settings and MTU end-to-end.
Decision rule: If the OS shows link down/up, start physical and switch-port config. If the link stays up but counters climb and softnet drops spike, tune the host. If the NIC resets, go driver/firmware and power/PCIe health.
Facts and context you’ll wish you knew earlier
- Fact 1: “Checksum errors” in captures can be artifacts: with TX checksum offload, packets can appear wrong on the host before the NIC fixes them on the wire.
- Fact 2: GRO (Generic Receive Offload) in Linux is a software mechanism; LRO is NIC-driven and historically more likely to misbehave with tunnels and certain traffic patterns.
- Fact 3: The Linux network stack has had receive-side scaling (RSS) for ages, but mapping queues to CPU cores remains a common self-own on multi-socket systems.
- Fact 4: Many “random drops” blamed on kernels were actually switch microbursts: short spikes that overflow buffers faster than your monitoring interval can admit.
- Fact 5: Energy Efficient Ethernet (EEE) has a long track record of being “fine in theory” and “mysteriously spiky in practice” on mixed-vendor gear.
- Fact 6: On modern NICs, firmware is part of your reliability surface area. A driver update without a firmware update can leave you with new bugs and old microcode.
- Fact 7: Autonegotiation disputes are not a relic of the 1990s. They show up today via bad DACs, marginal optics, or forced settings on one side.
- Fact 8: Bonding (LACP) is robust, but only if both sides agree on hashing, LACP mode, and what “up” means. Otherwise it fails in delightfully intermittent ways.
- Fact 9: PMTU blackholes remain common because ICMP filtering is still treated as “security,” even though it often acts like “random outage generator.”
Practical tasks (commands + meaning + decision)
Below are field tasks you can run on Ubuntu 24.04. Each one includes what the output means and what you should do next. Copy/paste is allowed. Vibes are not.
Task 1: Identify the NIC, driver, and firmware you’re actually running
cr0x@server:~$ sudo ethtool -i eno1
driver: ixgbe
version: 6.8.0-31-generic
firmware-version: 0x800003e7
expansion-rom-version:
bus-info: 0000:3b:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
Meaning: You now have a precise tuple: kernel version, driver name, firmware version, PCI bus. That tuple is what you correlate with resets and known quirks. “Intel 10G” is not a tuple.
Decision: If disconnects correlate with recent kernel upgrades, keep this output in the incident notes. If firmware is very old relative to your fleet, schedule a firmware update path. If the driver is in-tree but the NIC vendor recommends a newer out-of-tree driver, treat that as a hypothesis, not a religion.
Task 2: Check for link flaps and driver resets in the kernel log
cr0x@server:~$ sudo journalctl -k --since "2 hours ago" | egrep -i "eno1|link is|down|up|reset|watchdog|tx timeout|nic"
[ 7432.112233] ixgbe 0000:3b:00.0 eno1: NIC Link is Down
[ 7435.556677] ixgbe 0000:3b:00.0 eno1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[ 8121.000111] ixgbe 0000:3b:00.0 eno1: Detected Tx Unit Hang
[ 8121.000222] ixgbe 0000:3b:00.0 eno1: Reset adapter
Meaning: The first pair is a genuine link flap. The latter lines indicate a driver/NIC hang and reset—different class of problem.
Decision: Link down/up points you to optics/cable/switch port/autoneg/EEE. Tx hang/reset points you to driver/firmware/PCIe/power-management and queue/interrupt behavior.
Task 3: Watch live link state and negotiated parameters
cr0x@server:~$ sudo ethtool eno1 | egrep -i "Speed|Duplex|Auto-negotiation|Link detected"
Speed: 10000Mb/s
Duplex: Full
Auto-negotiation: on
Link detected: yes
Meaning: This is your baseline. If you see speed toggling (10G to 1G) or autoneg off on one end, you have a configuration or physical issue.
Decision: If link flaps coincide with renegotiation or speed changes, stop tweaking offloads. Talk to the switch and swap components.
Task 4: Pull NIC counters that reveal physical vs host drops
cr0x@server:~$ sudo ethtool -S eno1 | egrep -i "crc|align|symbol|discard|drop|miss|overrun|timeout" | head -n 30
rx_crc_errors: 0
rx_align_errors: 0
rx_symbol_err: 0
rx_discards: 124
rx_dropped: 0
rx_missed_errors: 98765
tx_timeout_count: 3
Meaning: CRC/align/symbol errors indicate physical layer corruption. rx_missed_errors usually indicates the NIC couldn’t DMA packets into host buffers fast enough (ring starvation / interrupt moderation / CPU scheduling).
Decision: If physical errors are non-zero and rising during incidents, treat it as cable/optics/switch-port. If missed errors climb while physical errors stay at zero, focus on RX rings, IRQs, NAPI, and CPU contention.
Task 5: Inspect softnet drops (kernel dropping before the socket layer)
cr0x@server:~$ awk '{print NR-1, $1, $2, $3, $4, $5}' /proc/net/softnet_stat | head
0 0000001a 00000000 0000003f 00000000 00000000
1 00000020 00000000 00000110 00000000 00000000
Meaning: Field 1 is processed packets, field 2 is dropped packets (in hex), field 3 is time_squeeze. Drops/time_squeeze indicate CPU/softirq couldn’t keep up.
Decision: If drops increase during disconnects, you’re not dealing with “mystery.” You’re dealing with overload or bad CPU/IRQ placement. Move to IRQ/RSS tasks.
Task 6: Confirm whether NIC interrupts are mapped sensibly
cr0x@server:~$ grep -i eno1 /proc/interrupts | head -n 10
98: 12345678 0 0 0 IR-PCI-MSI 524288-edge eno1-TxRx-0
99: 12 0 0 0 IR-PCI-MSI 524289-edge eno1-TxRx-1
100: 10 0 0 0 IR-PCI-MSI 524290-edge eno1-TxRx-2
Meaning: Queue 0 is doing all the work while other queues are idle. That can happen due to RSS misconfiguration, flow types, or a pinning issue.
Decision: If one queue is hot and others cold, fix RSS/queue count and consider IRQ affinity. If the host is multi-socket, make sure NIC queues land on local NUMA cores.
Task 7: Check RSS and number of combined channels
cr0x@server:~$ sudo ethtool -l eno1
Channel parameters for eno1:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 64
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 8
Meaning: This NIC supports up to 64 combined queues, currently configured for 8. That’s not “wrong,” but it should match your CPU cores and workload.
Decision: If you see missed errors or softnet drops and you have CPU headroom, increasing combined channels can help. If you’re already CPU-bound, more queues can add overhead. Tune deliberately.
Task 8: Change channel count (temporarily) to test a hypothesis
cr0x@server:~$ sudo ethtool -L eno1 combined 16
cr0x@server:~$ sudo ethtool -l eno1 | tail -n +1
Channel parameters for eno1:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 64
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 16
Meaning: You increased queue parallelism. If drops disappear under load, you found a receive bottleneck. If latency worsens and CPU spikes, you overshot.
Decision: Keep the change only if you can prove improvement via counters and application symptoms. Make it persistent using systemd-networkd/NetworkManager hooks or udev rules, not by hoping it survives reboot.
Task 9: Check offload features currently enabled
cr0x@server:~$ sudo ethtool -k eno1 | egrep -i "rx-checksumming|tx-checksumming|tso|gso|gro|lro|rx-vlan-offload|tx-vlan-offload|ntuple"
rx-checksumming: on
tx-checksumming: on
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
Meaning: GRO/TSO/GSO/checksum offloads are on. LRO is off (often a good default). VLAN offload is on.
Decision: Don’t disable everything “to be safe.” Disable one feature at a time, only to validate a suspected interaction. If you disable checksumming, expect CPU increase and possibly worse throughput; you’re trading correctness hypotheses for measurable cost.
Task 10: Toggle a single offload to test for a bug (and watch counters)
cr0x@server:~$ sudo ethtool -K eno1 gro off
cr0x@server:~$ sudo ethtool -k eno1 | grep -i generic-receive-offload
generic-receive-offload: off
Meaning: GRO is disabled. If your issue is GRO-related (often with tunnels/encapsulation or some buggy NIC/driver combos), symptoms may change quickly.
Decision: If disabling GRO eliminates stalls but increases CPU and lowers throughput, you likely hit a kernel/driver edge case. Then you either (a) keep GRO off for that interface, (b) change kernel/driver/firmware, or (c) redesign encapsulation. Pick based on business constraints, not pride.
Task 11: Check MTU and whether you’re accidentally fragmenting or blackholing
cr0x@server:~$ ip -br link show eno1
eno1 UP 10.10.0.12/24 fe80::1234:56ff:fe78:9abc/64
cr0x@server:~$ ip link show eno1 | egrep -i "mtu|state"
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
Meaning: MTU is 9000. If your switch path or peer is 1500, you now have a reproducible intermittent failure: small packets work, big ones stall or fragment weirdly, and “random” starts trending.
Decision: Validate MTU end-to-end. For tunnels, remember there’s overhead. If you can’t guarantee jumbo frames across the path, don’t run them half-way.
Task 12: PMTU test with “do not fragment” ping
cr0x@server:~$ ping -M do -s 8972 -c 3 10.10.0.1
PING 10.10.0.1 (10.10.0.1) 8972(9000) bytes of data.
From 10.10.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 10.10.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
From 10.10.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
--- 10.10.0.1 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2043ms
Meaning: The path can’t carry 9000-byte frames. This is not subtle. Your interface MTU is set larger than the path supports.
Decision: Fix MTU mismatch. Either set MTU to 1500 on the host or enable jumbo frames end-to-end (including bonds, bridges, VLAN subinterfaces, and switch ports).
Task 13: Check TCP retransmits and kernel network stats
cr0x@server:~$ nstat -az | egrep -i "TcpRetransSegs|TcpTimeouts|IpInDiscards|IpOutDiscards"
IpInDiscards 120
IpOutDiscards 0
TcpTimeouts 18
TcpRetransSegs 3490
Meaning: Retransmits and timeouts confirm real loss/stalls experienced by TCP, not just application whining. IP discards suggest upper-layer drops.
Decision: If retransmits spike during incidents while link stays up, correlate with softnet drops, NIC missed errors, and switch port buffers. Then choose: host tuning or network remediation.
Task 14: Confirm if NetworkManager/systemd-networkd is bouncing the interface
cr0x@server:~$ systemctl status NetworkManager --no-pager
● NetworkManager.service - Network Manager
Loaded: loaded (/usr/lib/systemd/system/NetworkManager.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-12-29 09:12:10 UTC; 3h 2min ago
Docs: man:NetworkManager(8)
cr0x@server:~$ sudo journalctl -u NetworkManager --since "2 hours ago" | egrep -i "eno1|carrier|down|up|deactivat|activat"
Dec 29 10:41:12 server NetworkManager[1023]: <info> [....] device (eno1): carrier: link connected
Dec 29 10:41:13 server NetworkManager[1023]: <info> [....] device (eno1): state change: activated -> deactivating (reason 'carrier-changed')
Dec 29 10:41:14 server NetworkManager[1023]: <info> [....] device (eno1): state change: deactivating -> activated (reason 'carrier-changed')
Meaning: The carrier is changing; NM is reacting. That usually reflects a real link event, not NM being “random.”
Decision: Don’t fight the network manager. Fix the carrier instability (physical/switch) or driver resets triggering carrier changes.
Task 15: Validate PCIe health hints (AER errors can look like “random NIC problems”)
cr0x@server:~$ sudo journalctl -k --since "24 hours ago" | egrep -i "AER|pcieport|Corrected error|Uncorrected|DMAR|IOMMU" | head -n 30
[ 8120.998877] pcieport 0000:3a:00.0: AER: Corrected error received: 0000:3a:00.0
[ 8120.998900] pcieport 0000:3a:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Meaning: Corrected PCIe physical errors can precede device hiccups or resets, especially under load or with marginal slots/risers.
Decision: If AER logs correlate with NIC resets, stop arguing about GRO and start checking the physical host: seating, risers, BIOS PCIe power management, and platform firmware.
Offloads without superstition: when they help, when they hurt
Offloads are not evil. They’re performance features designed to push work from CPU to NIC (or to amortize work in the kernel). They are also a frequent scapegoat because toggling them is easy and the symptoms are intermittent. The fact that a toggle changes the symptom does not mean you understood the cause.
Know what you’re toggling
- TX checksum offload: kernel hands packet with checksum “to be filled”; NIC computes checksum. Captures on host can show “bad checksum” because it hasn’t been computed yet.
- TSO/GSO: large TCP segments created by the stack and segmented later (NIC or kernel). Great for throughput; can amplify burstiness.
- GRO: coalesces received packets into larger SKBs before passing up the stack. Saves CPU; can change latency characteristics.
- LRO: similar concept but NIC-driven; can interact badly with encapsulation and can break packet semantics more easily.
- VLAN offloads: NIC handles VLAN tag operations; usually fine, occasionally painful with bridging or odd switch behavior.
Here’s the adult approach: treat offloads as variables in an experiment. Toggle one, measure counters, measure application symptoms, and decide whether you found a real bug or just shifted the bottleneck.
When disabling offloads is appropriate
Disable an offload temporarily when:
- You have evidence of driver resets correlated with a feature (e.g., Tx hang with TSO under specific traffic).
- You’re debugging packet captures and need on-host checksums to be meaningful.
- You’re dealing with encapsulation/tunneling (VXLAN/Geneve) and suspect an offload path is broken on that NIC/driver.
And keep it disabled only when you can’t upgrade driver/firmware/kernel in time and the CPU cost is acceptable.
When disabling offloads is cargo cult
Disabling offloads is cargo cult when:
- The issue is link down/up events (offloads don’t flap your cable).
- RX CRC errors are non-zero (checksum offload doesn’t create CRC errors on the wire).
- Softnet drops are the real problem (turning off offloads often increases CPU work and makes it worse).
Joke #1: Turning off every offload is like taking the batteries out of the smoke detector because it’s loud. The fire still wins.
IRQs, RSS, ring buffers, and the “drops with no errors” trap
One of the most common disconnect patterns on fast NICs is: link is stable, no CRC errors, switch looks clean, but applications report timeouts. Under the hood, the host is dropping packets because it can’t service interrupts and drain rings quickly enough. This is not theoretical; it’s what happens when a 10/25/40/100G NIC meets a CPU that is busy doing literally anything else.
What “rx_missed_errors” and “rx_no_buffer” usually mean
These counters typically mean the NIC had frames to deliver but couldn’t put them in host memory because the receive ring was full or buffers weren’t available. Causes include:
- Too few RX descriptors (ring size too small for bursts).
- Interrupt moderation too aggressive (packets pile up, then overflow).
- CPU starvation of ksoftirqd/softirq context.
- Bad IRQ affinity (all queues pinned to one CPU, often CPU0, because the world is cruel).
- NUMA mismatch (NIC interrupts serviced on a remote socket).
Ring sizes: a blunt but effective instrument
cr0x@server:~$ sudo ethtool -g eno1
Ring parameters for eno1:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 512
RX Mini: 0
RX Jumbo: 0
TX: 512
Meaning: You’re using 512 descriptors while the NIC supports 4096. That’s not wrong, but it’s not resilient to bursts.
Decision: If you see bursts/microbursts and missed errors, increase rings to something sane (e.g., 2048) and measure memory impact and latency.
cr0x@server:~$ sudo ethtool -G eno1 rx 2048 tx 2048
Meaning: You’ve increased buffering capacity at the NIC boundary. This can reduce drops during bursts, at the cost of some added buffering latency and memory usage.
Decision: If your issue is microburst drops, this often helps quickly. If the problem is sustained overload, it just delays the inevitable.
Interrupt moderation: the latency/CPU trade nobody documents in your org
Interrupt coalescing reduces CPU overhead by batching interrupts. But if you coalesce too aggressively, you can introduce latency spikes and create ring overflow during bursts.
cr0x@server:~$ sudo ethtool -c eno1 | head -n 40
Coalesce parameters for eno1:
Adaptive RX: on TX: on
rx-usecs: 50
rx-frames: 64
tx-usecs: 50
tx-frames: 64
Meaning: Adaptive coalescing is on, and there are baseline microsecond/frame thresholds. Adaptive modes can be great, or they can oscillate under certain workloads.
Decision: If you’re chasing brief stalls, consider disabling adaptive coalescing temporarily and setting conservative fixed values. Measure tail latency and drops. Don’t “optimize” in the dark.
IRQ affinity: when the default is “hope”
Ubuntu can run irqbalance which attempts to distribute interrupts. Sometimes it works. Sometimes your workload is sensitive enough that you want deterministic placement, especially on NUMA systems.
cr0x@server:~$ systemctl status irqbalance --no-pager
● irqbalance.service - irqbalance daemon
Loaded: loaded (/usr/lib/systemd/system/irqbalance.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-12-29 09:12:12 UTC; 3h 1min ago
Meaning: irqbalance is active. That’s fine, but not always optimal.
Decision: If you see one queue hot or NUMA remote interrupts, consider pinning NIC IRQs to local CPUs and excluding those CPUs from noisy neighbors. This is especially important on storage servers and hypervisors.
Joke #2: IRQ tuning is like office seating plans: everyone agrees it matters, and everyone hates the meeting where you change it.
Link-layer and switch-side failures that masquerade as Linux issues
Linux gets blamed because Linux logs are readable and your switch logs are behind a ticket. Still, the physical layer and the switch are common culprits. Especially when the failure is intermittent.
Physical indicators that should end the “it’s the kernel” debate
- CRC/alignment/symbol errors increasing during incidents.
- FEC corrections spiking (common on higher speeds; heavy correction can precede drops).
- Auto-negotiation loops visible as repeated link down/up.
- Speed/duplex renegotiation or “link is up 1Gbps” surprises.
On the host side you can’t always see FEC details, but you can see enough to justify the escalation.
EEE and power management: death by “green” defaults
If you see periodic micro-outages and everything else looks clean, check whether Energy Efficient Ethernet (EEE) is enabled. Mixed hardware can behave badly.
cr0x@server:~$ sudo ethtool --show-eee eno1
EEE Settings for eno1:
EEE status: enabled - active
Tx LPI: 1 (on)
Supported EEE link modes: 1000baseT/Full 10000baseT/Full
Advertised EEE link modes: 1000baseT/Full 10000baseT/Full
Link partner advertised EEE link modes: 1000baseT/Full 10000baseT/Full
Meaning: EEE is active. That’s not automatically wrong, but it’s a common variable in intermittent latency and short stalls.
Decision: If you suspect EEE, disable it on both ends for a test window and watch whether stalls disappear.
cr0x@server:~$ sudo ethtool --set-eee eno1 eee off
Bonding and LACP: reliable when configured, chaotic when assumed
Bonding failures often look like random disconnects because traffic hashing sends some flows down a broken path while others succeed. You’ll see “some services flaky” rather than “host down.”
cr0x@server:~$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v6.8.0-31-generic
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Slave Interface: eno1
MII Status: up
Aggregator ID: 2
Actor Churn State: churned
Partner Churn State: churned
Slave Interface: eno2
MII Status: up
Aggregator ID: 1
Actor Churn State: churned
Partner Churn State: churned
Meaning: Churned states indicate instability in LACP negotiation. That’s not a Linux offload problem; it’s a link aggregation control plane problem.
Decision: Engage the switch team with this evidence. Verify LACP mode (active/passive), VLAN trunking, and that both links terminate on the same LACP group. Also confirm timers and that the switch isn’t silently blocking one member due to errors.
Three corporate mini-stories (anonymized, painfully plausible)
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran an internal object storage cluster on Ubuntu servers. The symptom was “random disconnects” between gateway nodes and back-end storage nodes. It happened mostly during business hours, which made everyone suspicious of “load” but nobody had the courage to quantify it.
The first assumption was classic: “It’s a kernel regression.” The team had recently moved to a newer kernel and a newer NIC driver, and the timing looked guilty. They rolled back two hosts. The issue persisted. They rolled back two more. Still persisted. Now they were running a mixed fleet with inconsistent behavior and no root cause. Reliability didn’t improve; it just became harder to reason about.
When someone finally pulled ethtool -S counters, they noticed rx_crc_errors wasn’t zero. It was small, but it climbed exactly during “disconnect” complaints. That broke the narrative. CRC errors don’t come from sysctls. CRC errors come from the physical layer.
The actual cause was painfully mundane: a batch of DAC cables that were marginal at 10G in a specific rack layout with tight bends and stressed connectors. Under temperature changes, errors rose, LACP churned, and some flows got blackholed long enough to break storage sessions. The fix was swapping cables and re-seating optics. The kernel rollback was reverted later, after everyone stopped flinching.
The lesson wasn’t “check cables.” The lesson was: don’t start with the most emotionally satisfying theory. Start with the counters that can falsify it.
Mini-story 2: The optimization that backfired
A SaaS team had a latency problem on an API tier, and someone decided the NIC interrupt rate was “too high.” They enabled aggressive interrupt coalescing and tuned it until the CPU graphs looked calmer. They celebrated. CPU usage dropped. They wrote a short internal post about “winning performance.”
Two weeks later, customers reported occasional 1–2 second hangs during peak traffic. It didn’t show up in average latency, only in tail latency. Retransmits increased slightly, but not enough to trigger existing alerts. The service looked healthy in most dashboards, which is how tail latency likes to operate: it hides in the averages like a professional.
On the affected hosts, rx_missed_errors slowly climbed. Softnet drops spiked during microbursts. The NIC wasn’t “too interrupt-y.” It was doing its job. The new coalescing settings delayed RX processing just long enough that ring buffers overflowed during brief bursts. The optimization worked for CPU graphs and failed for users.
The fix was to revert to adaptive coalescing with conservative bounds and increase RX ring size modestly. They also added monitoring on softnet drops and NIC missed errors, because “I reduced interrupts” is not a user-facing SLO.
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent organization ran a set of Ubuntu 24.04 KVM hypervisors hosting critical internal workloads. Random disconnect reports came in from multiple tenants: brief packet loss, occasional TCP stalls, nothing consistent. The networking team was ready to blame virtual switching. The platform team was ready to blame the ToR switches. Everyone was ready to blame everyone.
What saved them was not genius. It was discipline. They had a standing practice: every host shipped kernel logs, NIC stats, and softnet counters into a central system at a 30-second cadence, and they kept a few days of high-resolution history. No hero debugging required. The data already existed.
When incidents occurred, they correlated three signals: (1) AER corrected PCIe errors in journalctl, (2) NIC “Reset adapter” messages, and (3) short dips in guest network throughput. The pattern matched across a subset of hosts—same hardware model, same BIOS revision.
It turned out to be a platform firmware issue interacting with PCIe power management. Under certain states, the NIC would briefly misbehave, reset, and recover. The fix was a vendor firmware update plus disabling a specific PCIe ASPM setting in BIOS as a short-term mitigation. No offload toggling. No midnight cable swapping. Just boring evidence and a change window.
The best part: they could prove the fix worked because the same signals went quiet. No vibes-based closure emails.
Common mistakes: symptom → root cause → fix
-
Symptom: “Network disconnects,” and
journalctl -kshows NIC Link is Down/Up.
Root cause: Physical link flap (bad cable/DAC, optics, autoneg mismatch, switch port errors, EEE weirdness).
Fix: Checkethtool -Sfor CRC/symbol errors, disable EEE for a test, swap cable/optics, validate switch port config. -
Symptom: Link stays up, but you see Detected Tx Unit Hang / Reset adapter.
Root cause: Driver/NIC firmware bug, PCIe/AER issues, or queue/interrupt corner case under load.
Fix: Collect driver/firmware tuple, correlate with AER logs, update firmware, test newer kernel/HWE, adjust coalescing/rings, consider vendor driver if supported. -
Symptom: No link flap, no CRC errors, but TCP timeouts and retransmits spike.
Root cause: Host drops (softnet drops, RX ring overflow, CPU starvation), or upstream microbursts.
Fix: Check/proc/net/softnet_stat,ethtool -Smissed errors, tune rings/queues/IRQ affinity, and validate switch buffering. -
Symptom: Small packets work, big transfers stall; pings succeed; SSH “sometimes freezes.”
Root cause: MTU mismatch or PMTU blackhole (ICMP blocked).
Fix: End-to-end MTU audit; runping -M do; allow ICMP fragmentation-needed; set correct MTU on tunnels and VLANs. -
Symptom: Only some flows fail, especially behind a bond; issues appear “random” across clients.
Root cause: LACP hashing sends some flows to a broken member; churn/partner mismatch.
Fix: Inspect/proc/net/bonding; verify switch LACP group and hashing; check member errors; consider removing the bad member until fixed. -
Symptom: Packet capture shows “bad checksum” and people panic.
Root cause: Checksum offload artifact in capture path.
Fix: Validate on-wire capture (SPAN/TAP) or temporarily disable checksum offload for debugging only. -
Symptom: Disabling GRO “fixes” something but CPU skyrockets and throughput drops.
Root cause: You masked a deeper issue (driver bug or overload) by changing batching behavior.
Fix: Treat it as a diagnostic clue; pursue driver/firmware/kernel fixes or adjust coalescing/rings; keep GRO off only as a documented mitigation.
Checklists / step-by-step plan
Checklist A: Capture evidence during the next incident (15 minutes, no guesswork)
- Record time window of the symptoms (start/end). If you don’t have a window, you don’t have an incident.
- Collect kernel messages:
cr0x@server:~$ sudo journalctl -k --since "30 minutes ago" > /tmp/kern.logDecision: Link flap vs reset vs nothing logged tells you where to go next.
- Snapshot NIC counters before and after a reproduction:
cr0x@server:~$ sudo ethtool -S eno1 > /tmp/eno1.stats.beforecr0x@server:~$ sleep 60; sudo ethtool -S eno1 > /tmp/eno1.stats.afterDecision: Rising CRC/symbol errors = physical; rising missed/no_buffer = host receive path.
- Snapshot softnet stats:
cr0x@server:~$ cat /proc/net/softnet_stat > /tmp/softnet.beforecr0x@server:~$ sleep 60; cat /proc/net/softnet_stat > /tmp/softnet.afterDecision: Drops/time_squeeze increasing means kernel can’t keep up.
- Check retransmits/timeouts:
cr0x@server:~$ nstat -az | egrep -i "TcpRetransSegs|TcpTimeouts" TcpTimeouts 18 TcpRetransSegs 3490Decision: If TCP sees it, it’s real loss/stall, not just an application bug.
Checklist B: Offload and tuning changes without making things worse
- Make one change at a time. Not negotiable.
- Before changing, record current settings:
cr0x@server:~$ sudo ethtool -k eno1 > /tmp/eno1.offloads.beforecr0x@server:~$ sudo ethtool -c eno1 > /tmp/eno1.coalesce.beforecr0x@server:~$ sudo ethtool -g eno1 > /tmp/eno1.rings.beforeDecision: You can now undo changes and compare.
- Pick the smallest relevant change (e.g., disable GRO, not “all offloads”).
- Reproduce under similar load and compare counters and symptoms.
- If change helps, decide whether it’s a mitigation or a final fix. Mitigations need documentation and a plan to remove.
Checklist C: Physical and switch validation you can request precisely
- Provide switch team: timestamps, server interface, negotiated speed/duplex, and whether Linux logged link down/up.
- Ask for: port error counters (CRC/FCS), FEC stats, link renegotiation history, LACP state, and buffer/drop counters.
- Plan a swap test: move the cable/optic to a known-good port or swap optics between a good and bad host.
- After swap, verify whether errors follow the component or stay with the port.
FAQ
1) Why does my app say “disconnected” when the interface never went down?
Because TCP can time out without a link flap. Packet loss, microbursts, RX ring overflow, or PMTU blackholes can stall flows long enough that the app gives up. Check retransmits (nstat), softnet drops, and NIC missed errors.
2) Should I disable GRO/TSO/checksum offloads to fix random drops?
Not as a first move. If you have link flaps or physical errors, offloads are irrelevant. If you suspect an offload bug, disable one feature temporarily and prove the effect with counters and logs. Keep the CPU cost in mind.
3) tcpdump shows bad checksums. Is the NIC corrupting packets?
Often no. With checksum offload, the kernel hands packets to the NIC before the checksum is computed; tcpdump can observe them “pre-checksum.” Confirm with an on-wire capture or temporarily disable TX checksum offload for debugging.
4) What’s the fastest way to tell “cable/switch” vs “host tuning”?
CRC/alignment/symbol errors and explicit link down/up messages are your fast physical indicators. Missed errors, softnet drops, and a stable link point toward host receive path and CPU/IRQ/ring tuning.
5) Why do I see drops but no errors in ip -s link?
ip -s link counters don’t always expose NIC driver-specific drop reasons. Use ethtool -S for detailed counters and /proc/net/softnet_stat for kernel drops.
6) Can irqbalance cause random disconnects?
It can contribute if interrupts end up concentrated or pinned poorly on a NUMA system, especially under load. It’s rarely the sole cause, but it can turn “fine” into “flaky” when the margin is small. Verify queue distribution via /proc/interrupts.
7) I run bonding (LACP). Why are only some clients affected?
Hashing. Some flows land on a bad member link while others land on a healthy one. Check /proc/net/bonding/bond0 for churn and confirm switch-side LACP configuration and per-member errors.
8) How do I make ethtool settings persist after reboot on Ubuntu 24.04?
Don’t rely on manual commands. Use a systemd unit, udev rule, or your network manager’s native configuration hooks to apply ethtool -K, -G, -L, and -C at interface-up time. The exact method depends on whether you use NetworkManager, systemd-networkd, or netplan-generated config.
9) Do kernel upgrades on Ubuntu 24.04 commonly cause NIC instability?
They can, but “commonly” is overstated. More often, upgrades change timing (interrupts, batching, power management defaults) and expose a marginal physical layer or firmware bug. Treat kernel change as a correlation, then prove the mechanism with logs and counters.
10) What if everything looks clean but users still report stalls?
Then your visibility is insufficient. Add higher-resolution metrics: softnet drops, NIC missed errors, retransmits, and switch-side drops. Many of these problems occur in sub-minute bursts that 5-minute averages politely erase.
Conclusion: next steps you can ship this week
Random disconnects aren’t random. They’re just happening in the gap between your assumptions and your evidence. Close the gap and the problem usually folds.
- Instrument first: start collecting
journalctl -kexcerpts,ethtool -Scounters, and/proc/net/softnet_statsnapshots around incidents. - Classify the failure mode: link flap vs NIC reset vs host drops vs MTU/PMTU vs LACP/bridging.
- Make one change at a time: ring sizes, queue counts, coalescing, or a single offload. Measure before/after with counters and retransmits.
- Escalate with proof: if you have CRC/symbol errors or LACP churn, take that to the switch team along with timestamps. You’re not “asking networking to look,” you’re presenting a case.
- Stabilize long-term: align driver/firmware updates with kernel updates, and monitor the counters that actually predict pain (missed errors, softnet drops, retransmits, link events).
If you do just one thing: stop treating offloads as a ritual. Treat them as an experiment. The network will respect you more. Your pager will, too.