Debian 13 LACP Bonding Flaps: Proving Whether the Switch or Host Is Wrong

Was this helpful?

When an LACP bond flaps, it doesn’t fail loudly. It fails like a committee: intermittently, plausibly deniable, and always during your busiest hour. One minute you’ve got a healthy 2×25G bundle; the next minute your storage traffic is pinned to one link, TCP is retransmitting, and everyone’s favorite sentence appears in chat: “Must be the network.”

This is a field guide for Debian 13 operators who need to do something rarer than troubleshooting: proving. Proving whether the host is misbehaving, the switch is misconfigured, or the physical layer is quietly turning your LACP into interpretive dance.

Fast diagnosis playbook

If you only have 15 minutes before someone suggests rebooting the core, do this in order. The point is to separate “LACP negotiation issue” from “carrier/PHY issue” from “host networking stack did a weird thing.”

1) Confirm whether this is carrier flapping or LACP-state flapping

Carrier down/up points to optics/cables/ASIC/driver/firmware. LACP state changes with stable carrier points to config mismatch, timers, MLAG/stack issues, or LACPDU loss.

cr0x@server:~$ ip -br link show
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eno1             UP             3c:fd:fe:aa:bb:01 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP>
eno2             UP             3c:fd:fe:aa:bb:02 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP>
bond0            UP             3c:fd:fe:aa:bb:ff <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP>

What it means: If a slave shows DOWN or lacks LOWER_UP during the flap window, you’re chasing physical/driver. If slaves remain UP but bond0 drops active members, you’re chasing LACP negotiation or LACPDU delivery.

Decision: If carrier drops, skip ahead to ethtool counters and physical-layer checks. If carrier is stable, focus on LACPDU, timers, and switch consistency.

2) Pull bonding driver state and “last churn” indicators

cr0x@server:~$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v6.1.0

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: fast
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 2
        Actor Key: 17
        Partner Key: 51
        Partner Mac Address: 00:11:22:33:44:55

Slave Interface: eno1
MII Status: up
Speed: 25000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 3c:fd:fe:aa:bb:01
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 3
Partner Churned Count: 3
details actor lacp pdu:
    system priority: 65535
    system mac address: 3c:fd:fe:aa:bb:ff
    port key: 17
    port priority: 255
    port number: 1
details partner lacp pdu:
    system priority: 32768
    system mac address: 00:11:22:33:44:55
    oper key: 51
    port priority: 128
    port number: 49

Slave Interface: eno2
MII Status: up
Speed: 25000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 3c:fd:fe:aa:bb:02
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 3
Partner Churned Count: 3

What it means: “Churned” and increasing churn counts are your smoking gun for LACP negotiation instability or LACPDU loss. Link Failure Count rising is more physical/driver. Partner MAC, keys, and port numbers are the identity of the other side as the host sees it.

Decision: If churn counts rise while MII stays up, demand switch-side LACP logs and verify LACPDU reachability via capture.

3) Check kernel logs for NIC resets, PCIe issues, or carrier events

cr0x@server:~$ journalctl -k -S -30min | egrep -i 'bond0|eno1|eno2|lacp|link up|link down|reset|timeout|tx hang|firmware'
[...]
kernel: bond0: (slave eno1): Enslaving as an active interface with a down link.
kernel: eno1: Link is Down
kernel: eno1: Link is Up - 25Gbps/Full - flow control rx/tx
kernel: bond0: (slave eno1): link status definitely up, 25000 Mbps full duplex
kernel: bond0: (slave eno1): Actor Churn State is churned

What it means: The kernel will happily rat you out. “Link is Down/Up” = carrier. “TX hang/reset/firmware” = host-side. Churn without link down suggests LACP mismatch or LACPDU loss.

Decision: If you see resets/timeouts, stop arguing about switch config and start looking at driver/firmware and PCIe health.

4) Get error counters on the slave NICs

cr0x@server:~$ ethtool -S eno1 | egrep -i 'err|drop|crc|fcs|miss|timeout|reset|align|symbol|disc|over'
rx_crc_errors: 0
rx_fcs_errors: 0
rx_missed_errors: 0
rx_discards: 12
tx_errors: 0
tx_timeout_count: 0

What it means: CRC/FCS/symbol errors point hard at physical (optics, cable, dirty fiber, marginal DAC). Discards can be congestion, ring buffers, or switch behavior.

Decision: If CRC/FCS increments during flaps, treat the physical path as guilty until proven innocent.

5) If still unclear, capture LACPDUs for 60 seconds during the problem

More on this later, but it’s the fastest way to show “host sent LACPDU; switch didn’t reply” (or vice versa).

What “LACP flapping” actually means

In Linux bonding mode 802.3ad, “flapping” usually means the bond repeatedly changes which slave interfaces are collecting/distributing. That can happen while the physical links remain up. It can also happen because the physical links are actually dropping, in which case LACP is just the messenger getting shot.

Operationally, you’ll see:

  • One slave repeatedly removed/added to the active aggregator.
  • Partner MAC or Partner Key changing unexpectedly (often MLAG/stack weirdness).
  • Throughput sawtoothing, storage latency spikes, TCP retransmits, and “why is half my capacity gone?”

LACP has timers. LACP has state machines. LACP also has a habit of revealing every inconsistency you didn’t know your network had. The trick is to collect evidence that survives cross-team debate: logs, counters, and on-the-wire frames.

Interesting facts and context (short, concrete)

  1. LACP is standardized as IEEE 802.1AX; older documents and many tools still refer to IEEE 802.3ad, which got folded into 802.1AX.
  2. “Fast” LACP rate is roughly 1 second between LACPDUs; “slow” is around 30 seconds. “Fast” finds problems quicker—and amplifies packet-loss problems quicker too.
  3. Linux bonding predates modern “network stack as code” culture; /proc/net/bonding is still one of the most useful diagnostics files on the system.
  4. Many switches treat LACP as control-plane traffic with different queuing rules than data; congestion or policing can drop LACPDUs while data “mostly works.”
  5. MLAG (multi-chassis LAG) implementations differ widely; some handle partner identity and keying in ways that confuse hosts during failovers if configuration isn’t perfectly symmetric.
  6. Transmit hash policy matters more than people admit; “layer2” can pin flows strangely in virtualized environments, while “layer3+4” can break expectations with NAT or asymmetric routing.
  7. Link aggregation doesn’t sum bandwidth for a single TCP flow unless you’re doing something special; most deployments rely on per-flow hashing.
  8. Carrier up does not guarantee a clean physical layer; you can have stable link with intermittent CRCs that trigger higher-layer churn and retransmits.
  9. Some NIC offloads have historically interacted badly with bonding (especially around VLAN offload and certain driver versions); the failure mode looks like “the network is flaky.”

Host-side evidence: Debian 13 checks that hold up in a postmortem

Debian 13 gives you a modern kernel, systemd, and the usual Linux toolbox. Use them like you’re building a case file, not like you’re playing whack-a-mole.

Task 1: Identify who configures the bond (systemd-networkd vs ifupdown)

cr0x@server:~$ systemctl is-active systemd-networkd
inactive
cr0x@server:~$ dpkg -l | egrep 'ifupdown|network-manager|systemd-networkd'
ii  ifupdown  0.8.41  amd64  high level tools to configure network interfaces

What it means: You need to know the control plane. If two managers fight, you get “flaps” that are actually reconfiguration events.

Decision: If more than one is active, pick one and disable the rest before doing anything else. Debugging a moving target is a hobby, not an SRE practice.

Task 2: Show the running bond configuration from the kernel’s point of view

cr0x@server:~$ grep -H . /sys/class/net/bond0/bonding/*
/sys/class/net/bond0/bonding/mode: 802.3ad 4
/sys/class/net/bond0/bonding/lacp_rate: 1
/sys/class/net/bond0/bonding/miimon: 100
/sys/class/net/bond0/bonding/xmit_hash_policy: layer3+4 1
/sys/class/net/bond0/bonding/ad_select: stable 1
/sys/class/net/bond0/bonding/min_links: 1

What it means: This is the truth as the kernel enforces it, not what your config file claims. lacp_rate “1” is fast.

Decision: If miimon is 0 and you rely on carrier detection, you’re trusting the driver’s link notifications. That can be fine, but at least be explicit about it.

Task 3: Verify slaves, and check for accidental VLAN-on-slave vs VLAN-on-bond mismatch

cr0x@server:~$ ip -d link show bond0
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:aa:bb:ff brd ff:ff:ff:ff:ff:ff
    bond mode 802.3ad miimon 100 updelay 0 downdelay 0 lacp_rate 1 ad_select stable xmit_hash_policy layer3+4
cr0x@server:~$ ip -br link show master bond0
eno1             UP             3c:fd:fe:aa:bb:01 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP>
eno2             UP             3c:fd:fe:aa:bb:02 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP>

What it means: You want VLANs on top of bond0, not individually on slaves (unless you have a very specific design and matching switch config). “ip -d” shows bond parameters and MTU.

Decision: If MTU differs across slaves or the switch, you may not see “flaps,” but you’ll see symptoms that look like it. Fix MTU consistency early.

Task 4: Watch the bond state live while the flap happens

cr0x@server:~$ watch -n 1 'grep -E "MII Status|Slave Interface|Aggregator ID|Churn|Link Failure" -n /proc/net/bonding/bond0'
Every 1.0s: grep -E "MII Status|Slave Interface|Aggregator ID|Churn|Link Failure" -n /proc/net/bonding/bond0
5:MII Status: up
18:Active Aggregator Info:
19:        Aggregator ID: 1
20:        Number of ports: 2
34:Slave Interface: eno1
35:MII Status: up
39:Link Failure Count: 0
45:Actor Churn State: churned
46:Partner Churn State: churned
64:Slave Interface: eno2
65:MII Status: up
69:Link Failure Count: 0
75:Actor Churn State: churned
76:Partner Churn State: churned

What it means: You’re looking for churn transitions, port count changes, and link failures. If number of ports drops to 1 without a link-down, LACP is the culprit.

Decision: If the bond is stable but the app is unhappy, stop blaming LACP and start measuring packet loss and latency.

Task 5: Confirm NIC speed/duplex/auto-neg and look for flapping there

cr0x@server:~$ ethtool eno1 | egrep -i 'Speed|Duplex|Auto-negotiation|Link detected'
Speed: 25000Mb/s
Duplex: Full
Auto-negotiation: on
Link detected: yes

What it means: Unexpected speed changes, autoneg off on one side, or link detected toggling points to physical or configuration mismatch.

Decision: If autoneg is mismatched with the switch, fix that before arguing about LACP timers. LACP can’t negotiate over a link that’s arguing about physics.

Task 6: Check for CRC/FEC/PCS errors (physical-layer smoking guns)

cr0x@server:~$ ethtool --phy-statistics eno1 2>/dev/null | head
PHY statistics for eno1:
Symbol Error During Carrier: 0
Receive Error Count: 0
cr0x@server:~$ ethtool -S eno1 | egrep -i 'crc|fcs|symbol|fec|align|jabber|code|pcs' | head -n 20
rx_crc_errors: 0
rx_fcs_errors: 0

What it means: Not every driver exposes PHY stats. But if you do see CRC/symbol/FEC corrections climbing, you’ve found a physical problem that might present as LACP instability.

Decision: If physical errors are present, replace/clean/recable first. Software tuning around a bad fiber is like optimizing a database running on a dying disk.

Joke 1: LACP is a relationship protocol: it’s stable until someone stops communicating for one second, then it “needs to talk.”

Task 7: Detect NIC driver/firmware version and look for known-bad combos in your fleet

cr0x@server:~$ ethtool -i eno1
driver: ice
version: 6.1.0
firmware-version: 4.20 0x8001a3d5
bus-info: 0000:5e:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes

What it means: Driver and firmware matter. A lot. Keep a small internal matrix of “known good” versions for your hardware.

Decision: If this host is an outlier versus the fleet (different firmware), treat it as suspect. Standardization is boring; boring is reliable.

Task 8: Check IRQ balance, ring buffers, and dropped packets (control-plane starvation can drop LACPDUs)

cr0x@server:~$ nstat -az | egrep -i 'TcpRetransSegs|IpInDiscards|IpInDelivers|UdpInErrors'
TcpRetransSegs            1823               0.0
IpInDiscards              47                 0.0
UdpInErrors               0                  0.0
cr0x@server:~$ ip -s link show eno1 | sed -n '1,6p'
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:aa:bb:01 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
    9876543210 1234567 0      14      0       0
    TX:  bytes packets errors dropped carrier collsns
    1234567890 234567  0      0       0       0

What it means: Dropped RX packets on the NIC can include LACPDUs if things are ugly enough. Usually LACPDUs are tiny and rare, but starvation happens on overloaded systems.

Decision: If the host drops packets while CPU is pegged or IRQs are mispinned, fix host performance first—especially on storage nodes doing heavy checksum/crypto/compression.

Task 9: Confirm LACP multicast MAC reception is not filtered

LACPDUs use the slow protocols multicast address (01:80:c2:00:00:02). If your NIC, bridge, or switch filters it unexpectedly, LACP fails in interesting ways.

cr0x@server:~$ ip maddr show dev eno1
2:	eno1
	link  01:00:5e:00:00:01
	link  33:33:00:00:00:01
	link  01:80:c2:00:00:02
	link  33:33:ff:aa:bb:01

What it means: Seeing 01:80:c2:00:00:02 is a reassuring sign. Absence isn’t always fatal (some drivers don’t show it cleanly), but if it’s missing and things flap, it’s a clue.

Decision: If multicast filtering or special security features are enabled (on host or switch), explicitly validate they allow slow-protocols frames.

Task 10: Validate that both slaves have identical L2 settings (MTU, offloads, VLAN filtering)

cr0x@server:~$ for i in eno1 eno2; do echo "== $i =="; ip -d link show $i | egrep -i 'mtu|vf|vlan|state'; ethtool -k $i | egrep -i 'rx-vlan|tx-vlan|gro|gso|tso' | head -n 12; done
== eno1 ==
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
rx-vlan-offload: on
tx-vlan-offload: on
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
== eno2 ==
3: eno2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
rx-vlan-offload: on
tx-vlan-offload: on
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on

What it means: Asymmetry is a recurring villain. If offloads differ, or MTU differs, your bond can behave “fine” but intermittently drop control or data frames under load.

Decision: Make slaves as identical as possible. If you must change offloads, change them on both and record why.

Task 11: Verify STP/bridge involvement isn’t eating slow-protocol frames

If bond0 is part of a bridge (common on virtualization hosts), check for filtering settings that interact with LACP or link state.

cr0x@server:~$ bridge link show
2: eno1 state UP : <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> master bond0
3: eno2 state UP : <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> master bond0
cr0x@server:~$ bridge -d vlan show | head
port              vlan-id
bond0             10 PVID Egress Untagged
bond0             20

What it means: Bridging is fine; confusing bridging with per-port VLAN config is not. Keep your L2 model simple when debugging LACP.

Decision: If the host is doing complex bridging plus bonding, temporarily simplify (maintenance window) to isolate whether the bond itself is unstable.

Task 12: Run a controlled degradation test (prove the bond reacts correctly)

This is how you prove host behavior to a skeptical switch team: show deterministic reactions to a forced link event.

cr0x@server:~$ sudo ip link set dev eno2 down
cr0x@server:~$ cat /proc/net/bonding/bond0 | egrep -A3 'Active Aggregator Info|Number of ports|Slave Interface|MII Status' | head -n 20
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 1
Slave Interface: eno1
MII Status: up
Slave Interface: eno2
MII Status: down
cr0x@server:~$ sudo ip link set dev eno2 up

What it means: You’re checking that the host cleanly removes/adds a slave and returns to 2 ports without churn storms.

Decision: If this controlled test creates churn or takes a long time to rejoin, you may have host config issues (ad_select, lacp_rate mismatch, or timing problems).

Switch-side evidence: what to demand from the network team

You may not have switch CLI access. Fine. You still need switch facts, not vibes. Ask for specific outputs and insist they cover both member ports and the port-channel interface.

Here’s what you want, regardless of vendor:

  • Port-channel state: up/down, members bundled or suspended, reason codes.
  • LACP partner details: actor/partner system ID (MAC), keys, port numbers.
  • Interface counters: CRC/FCS, symbol errors, discards, pause frames, link transitions.
  • Consistency checks: VLAN list, MTU, speed/duplex, LACP mode (active/passive), LACP rate.
  • MLAG/stack status if relevant: peer link health, consistency, orphan ports.

And you want it time-correlated with the host’s churn timestamps.

What switch-side evidence conclusively implicates the switch?

  • The switch reports LACP partner ID changing back and forth while the host’s system MAC is stable.
  • The switch suspends a member due to “LACP timeout” but the host capture shows it transmitted LACPDUs on schedule.
  • The switch shows CRC/FCS increasing on its port counters while the host does not (or vice versa), narrowing the bad segment.
  • MLAG peer-link issues coincide with flaps, and the switch logs show renegotiation events.

What switch-side evidence implicates the host?

  • Switch sees LACPDU gaps from the host (no PDUs received) while the host is overloaded or experiencing NIC resets.
  • Switch reports the host sending inconsistent keys or port IDs across members (often due to misbonding, SR-IOV/VFs, or config drift).
  • Switch logs show interface link down/up events that match host kernel logs and point to optics/cable.

Packet capture: LACPDUs don’t lie (much)

If you want proof that survives a change review, capture the control traffic. LACPDUs are Ethernet frames (EtherType 0x8809) sent to 01:80:c2:00:00:02. They are not IP. They will not show up in your tcpdump filters unless you ask properly.

Task 13: Capture LACPDUs on each slave and on bond0 (and compare)

Do this during a flap window. Capture on slaves, because bond0 may not see every frame the way you expect.

cr0x@server:~$ sudo timeout 60 tcpdump -i eno1 -e -vvv -s 0 ether proto 0x8809
tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:01:10.123456 00:11:22:33:44:55 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110
	Actor System 3c:fd:fe:aa:bb:ff, Actor Key 17, Port 1, State 0x3d
	Partner System 00:11:22:33:44:55, Partner Key 51, Port 49, State 0x3f
cr0x@server:~$ sudo timeout 60 tcpdump -i eno2 -e -vvv -s 0 ether proto 0x8809
tcpdump: listening on eno2, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:01:10.223456 00:11:22:33:44:55 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110
	Actor System 3c:fd:fe:aa:bb:ff, Actor Key 17, Port 2, State 0x3d
	Partner System 00:11:22:33:44:55, Partner Key 51, Port 50, State 0x3f

What it means: You should see regular LACPDUs. Actor System should match your bond MAC; Partner System should match the switch LACP system ID. Port numbers should be stable. If one interface stops seeing inbound LACPDUs while the other continues, that isolates the problem to one physical path or one switch port.

Decision: If host is sending LACPDUs (you can also capture outbound) but not receiving responses on one leg, the switch/port/PHY path is suspect.

Task 14: Prove the host is transmitting LACPDUs (outbound evidence)

Inbound-only can be misleading if the switch is the only one talking. Capture directionally by simply observing source MACs and timing; Linux tcpdump doesn’t always label direction reliably on all drivers, but MAC addresses do.

cr0x@server:~$ sudo timeout 20 tcpdump -i eno1 -e -vv -s 0 ether proto 0x8809 | head
12:02:00.000001 3c:fd:fe:aa:bb:ff > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110
12:02:01.000113 3c:fd:fe:aa:bb:ff > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110

What it means: Source MAC equal to your bond MAC indicates the host is sending. Timing near 1 second implies LACP fast.

Decision: If host sends consistently but switch claims “no PDUs received,” you now have an objective conflict that usually ends with a switch port capture or ASIC counter deep-dive.

Task 15: Correlate flap time with LACPDU gaps

cr0x@server:~$ sudo timeout 30 tcpdump -tt -i eno1 -e -s 0 ether proto 0x8809 | awk '{print $1, $2, $3, $4, $5}'
1703851360.000001 3c:fd:fe:aa:bb:ff > 01:80:c2:00:00:02,
1703851361.000104 3c:fd:fe:aa:bb:ff > 01:80:c2:00:00:02,
1703851362.000095 3c:fd:fe:aa:bb:ff > 01:80:c2:00:00:02,

What it means: You’re looking for missing seconds. If you see 1-second cadence then a sudden multi-second silence and the bond churns, you’ve got control-plane loss (host or switch).

Decision: Silence with stable CPU/IRQ and no drops suggests switch-side filtering/policing or a physical glitch not long enough to register as link down.

Decision tree: switch vs host vs physical

Case A: You see “Link is Down/Up” in host logs

Most likely: physical layer or NIC/driver/firmware. LACP is downstream of link state.

Prove it:

  • Kernel log timestamps show carrier down/up on a specific slave.
  • ethtool counters show CRC/FCS or other physical errors climbing.
  • Switch interface counters match (link transitions, errors).

Action: swap cable/DAC/optic, clean fiber, move to another switch port, update NIC firmware/driver if it’s a known troublemaker.

Case B: Carrier stays up, but bond membership changes and churn counts rise

Most likely: LACP negotiation instability, mismatch, or control-plane frame loss.

Prove it:

  • /proc/net/bonding shows churned state increments.
  • tcpdump shows LACPDUs missing on one leg, or partner identity changes.
  • Switch port-channel logs show “suspended” members with reasons that align with churn.

Action: validate LACP mode (active/active is the sane default), timers (fast/fast), and config symmetry (VLAN/MTU/speed) across both switch ports and both host slaves.

Case C: Everything looks stable, but apps see intermittent loss/latency

Most likely: not actually LACP. Think buffer drops, ECN/pause storms, hashing skew, or upstream congestion.

Prove it:

  • Bond never changes membership; no churn increments; no link failures.
  • Interface drops increase under load.
  • nstat shows retransmits climbing while link stays up.

Action: treat it as performance engineering: queueing, QoS, MTU, ECN, switch buffers, and hashing policy.

Quote (paraphrased idea), attributed: Richard Feynman’s reliability-flavored warning: reality wins; you can’t negotiate with the laws of nature.

Three corporate mini-stories (anonymized)

Mini-story 1: The incident caused by a wrong assumption

They had a pair of Debian storage nodes, each with a 2×10G LACP bond to a top-of-rack pair running MLAG. Everything looked clean: bonds were “up,” port-channels were “up,” and the incident started as a familiar complaint: “writes are spiky.”

The on-call assumed it was a storage-side problem because the graphs showed IO latency. They chased disk queues, tuned elevator settings, and even throttled a background scrub. Nothing changed. The network team insisted the port-channel was stable because it was “up/up.” Everyone felt productive and nobody was correct.

The breakthrough came from doing the boring thing: capturing LACPDUs on both host interfaces. One leg had clean 1-second LACPDU cadence; the other showed bursts followed by gaps. Carrier never dropped. The host was not “flapping” physically; it was being intermittently starved of control traffic.

Switch-side logs (once requested with timestamps) showed that during microbursts, the switch control-plane queue was being hit by a storm of other slow-protocol frames from a misconfigured downstream device. LACPDUs weren’t prioritized the way the team assumed they were. They had assumed “control protocols are always protected.” In that environment, they weren’t.

The fix was not a kernel tweak. It was isolating the noisy device and adjusting switch policy to avoid slow-protocol starvation. The postmortem lesson was blunt: “up/up” is not an SLA, and assumptions are not evidence.

Mini-story 2: The optimization that backfired

A virtualization cluster wanted faster failover. Someone set LACP to fast rate everywhere, dropped miimon to 50ms, and tightened other timers because “we can detect failure quicker.” They did this during a network refresh where new switches replaced old ones, and the lab tests looked great.

In production, they started seeing intermittent bond member drops on hosts under heavy CPU load. Not consistently. Not on every host. Only on the busiest hypervisors. It was the worst kind of failure: the kind that looks like a rumor until it’s a revenue event.

The root cause was a chain reaction. Fast LACP meant more frequent PDUs. Lower miimon meant the bond reacted quickly to any hint of trouble. Under CPU saturation, a subset of hosts occasionally delayed processing receive interrupts long enough that the effective LACPDU handling jitter exceeded the tolerance of the switch’s state machine. The links were physically fine. The negotiation wasn’t.

They “optimized” failover and accidentally turned normal scheduling jitter into an outage trigger. Rolling back to slow rate (or leaving LACP fast but fixing host IRQ affinity and reserving CPU for networking) eliminated the flaps. The real fix was resource isolation: keep the host healthy enough that control traffic is handled predictably.

One of the few useful lines from the review: “Failover speed is a feature until it becomes a sensitivity.” If you’re going fast, measure jitter and loss, not just average throughput.

Mini-story 3: The boring but correct practice that saved the day

A financial services shop ran Debian on bare metal for a latency-sensitive pipeline. Their networking was not fancy, but it was disciplined. Every NIC firmware version was tracked. Every switch port-channel had a standard template. Every change had pre/post snapshots of counters and LACP partner state.

One week, a new rack came online and immediately started seeing LACP churn on a subset of servers. The network team suspected the servers because “the rest of the fabric is fine.” The server team suspected the switches because “the servers are identical.” Classic.

Then the boring practice paid out. They compared the pre/post snapshot from a known-good rack: partner system ID, partner key, member port IDs, error counters at baseline. In the new rack, the partner key was different on one member port. Same port-channel ID, different operational key. That’s not “Linux being weird.” That’s a switch configuration consistency error.

The network team found that one switch in the MLAG pair had an older template applied: VLAN list and LACP key differed, so the switch alternated between bundling and suspending depending on which peer was primary at the moment. It wasn’t dramatic enough to drop the whole port-channel—just enough to churn under load.

They fixed the config, and the flaps stopped. No heroics. No guessing. Just baseline snapshots and insistence on symmetry. Boring is not a lack of skill; it’s the result of it.

Common mistakes: symptoms → root cause → fix

1) Symptom: One slave repeatedly “suspended” on the switch; host shows churned state

Root cause: LACP mode/timer mismatch (active/passive, fast/slow) or inconsistent LACP key/VLAN/MTU across member ports.

Fix: Make both sides symmetric. Use LACP active on both ends unless there’s a policy reason not to. Align LACP rate. Verify VLAN/MTU/speed settings are identical on both switch ports in the bundle.

2) Symptom: Bond loses a member without any “Link is Down” messages

Root cause: LACPDU loss, slow-protocol filtering, or switch control-plane congestion. In MLAG, could be peer-link instability causing partner identity changes.

Fix: Capture LACPDUs on both host slaves; demand switch logs with reason codes. Validate MLAG/stack health and slow-protocol handling.

3) Symptom: Kernel logs show NIC resets, timeouts, or TX hangs during flaps

Root cause: host-side driver/firmware issue, PCIe errors, or power management oddities.

Fix: Update NIC firmware to your known-good baseline, check PCIe AER logs, disable problematic power saving, and validate BIOS settings for PCIe ASPM if relevant.

4) Symptom: CRC/FCS errors increase on one interface; LACP churn follows

Root cause: bad cable/DAC/optic, dirty fiber, marginal transceiver, or bad port.

Fix: swap physical components in a controlled order (cable first, then optic, then port). Record which change altered counters. If you don’t measure, you’re just rearranging the crime scene.

5) Symptom: Throughput is half expected, but no flaps

Root cause: hashing skew (one big flow), wrong xmit_hash_policy, or upstream flow distribution not matching.

Fix: Confirm expected traffic pattern. Use layer3+4 hashing for general-purpose server traffic. For storage protocols, validate whether you’re pushing many flows or a few large ones.

6) Symptom: Flaps only happen during peak CPU load

Root cause: host cannot process control traffic reliably (IRQ imbalance, CPU starvation, drops), causing LACPDU gaps and timeouts.

Fix: Fix host resource isolation: pin IRQs, ensure irqbalance policy is sane, increase ring buffers if needed, and avoid timer “optimizations” that reduce tolerance.

Joke 2: A port-channel is like a corporate merger: if both sides don’t agree on the paperwork, you get “synergy” and outages.

Checklists / step-by-step plan

Step-by-step: build a “proof packet” for switch vs host responsibility

  1. Record time window: start/end timestamps in UTC. If teams can’t agree on time, they can’t agree on reality.
  2. Host snapshot: capture /proc/net/bonding/bond0 before, during, after.
  3. Host logs: journalctl -k around the window; grep for link and driver events.
  4. Host counters: ethtool -S for each slave; ip -s link for drops.
  5. Host config truth: /sys/class/net/bond0/bonding/* plus ip -d link show.
  6. LACPDU capture: 60 seconds on each slave during the flap window.
  7. Switch ask: request port-channel state, per-member reason codes, partner ID/key, and per-port error counters for the same timestamps.
  8. Correlate: do timestamps match? does the same leg misbehave consistently?
  9. Isolate: move one cable/optic/port at a time if physical is suspected; re-test.
  10. Decide and fix: commit to one hypothesis and validate by change + measurement.

Step-by-step: physical isolation without making it worse

  1. Identify the “bad leg” by churn/counters/capture (eno1 vs eno2).
  2. Swap only the cable/DAC on that leg; do not change both at once.
  3. Re-check CRC/FCS counters baseline, then after 10 minutes under load.
  4. If still bad, swap optic, then move switch port, then move to another line card if possible.
  5. Document each change with before/after counters. This prevents “it got better somehow” folklore.

Step-by-step: configuration symmetry audit (host + switch)

  • Host: both slaves same MTU, same offloads, same speed/duplex/autoneg, same driver/firmware class.
  • Host: bond mode 802.3ad, stable ad_select, sane miimon, and matching lacp_rate with the switch expectation.
  • Switch: both member ports have identical VLAN list, MTU, LACP mode, LACP rate, and are in the same port-channel.
  • MLAG: both switches have identical port-channel config and peer-link is healthy; no “orphaned” inconsistencies.

FAQ

1) Is Debian 13 special here, or is this just “Linux bonding stuff”?

Mostly Linux bonding stuff. Debian 13 matters because kernel/driver versions change behavior, and systemd-networkd vs ifupdown changes how config drift happens.

2) What’s the single best file to look at for LACP issues on the host?

/proc/net/bonding/bond0. It shows churn, partner identity, keys, port numbers, and link failure count. It’s the closest thing to a black box recorder.

3) If the switch says the port-channel is up, can it still be broken?

Yes. “Up” can mean “some member is bundled.” You can still have one leg suspended, churning, or erroring. Ask for per-member state and reasons.

4) Should I use LACP fast or slow?

Fast if you can guarantee control traffic isn’t dropped and hosts aren’t starved; slow if you want tolerance. If you run fast, measure jitter and packet loss, not just link state.

5) How do I prove the host is sending LACPDUs?

tcpdump on the slave interface with EtherType 0x8809, checking source MAC equals your bond MAC and cadence matches lacp_rate.

6) How do I prove the switch is sending LACPDUs back?

Same capture, but look for frames sourced from the switch LACP system MAC. If you see host outbound but no inbound on one leg, that’s strong evidence.

7) Can a bad cable cause LACP flapping without link down events?

Yes. You can get errors and micro-interruptions that don’t drop carrier long enough to log a link down, but do disrupt LACPDU exchange and traffic quality.

8) Does hashing policy cause flapping?

No, hashing policy usually causes performance weirdness, not LACP negotiation churn. But performance weirdness often gets misreported as “the bond is flapping.” Verify with /proc/net/bonding and churn counts.

9) What if only one host in the rack flaps, and others are fine?

Suspect host-side firmware/driver differences, a bad NIC port, or a bad cable/optic. Then suspect a single switch port. Use counters and “bad leg” isolation.

10) What if multiple hosts flap on the same switch pair at the same time?

Suspect switch-side issues first: MLAG peer-link, control-plane congestion, misapplied template, or a software bug. Ask for switch logs around the same timestamps and compare partner/key behavior.

Conclusion: next practical steps

When LACP bonds flap, you win by being methodical, not loud. Start with the simplest classification: carrier flapping vs LACP churn with stable carrier. Then gather the three kinds of proof that survive cross-team debate: kernel logs, counters, and packet captures.

Next steps that usually end the argument:

  • Capture 60 seconds of LACPDUs on each slave during a flap.
  • Export /proc/net/bonding/bond0 before/during/after and correlate timestamps with switch logs.
  • Compare CRC/FCS and drops on both host and switch ports to isolate the bad segment.
  • Force a controlled link down/up on one slave in a maintenance window to verify host behavior is deterministic.
  • If MLAG is involved, demand evidence of peer-link health and configuration symmetry, not assurances.

Do that, and you’ll stop “troubleshooting” and start proving. In production, that’s the difference between fixing the problem and just moving it to next week.

← Previous
Docker Private Registry TLS Errors: Fix Certificate Chains Properly
Next →
Docker File Watching Is Broken in Containers: Fix Dev Hot Reload Reliably

Leave a comment