Ubuntu 24.04: SSH Sessions Drop Randomly — Keepalive Settings That Actually Help

December 2, 2025 • February 3, 2026 • Read: 23 min • Views: 9

Was this helpful?

You’re in the middle of a production change. You run one more command, wait for output… and the terminal freezes. A few seconds later: “Broken pipe” or “Connection reset.” Your SSH session didn’t “randomly” die. Something in the path decided your quiet TCP flow looked like dead weight.

Ubuntu 24.04 isn’t uniquely cursed here, but it’s modern enough (OpenSSH defaults, systemd plumbing, ubiquitous NAT, aggressive firewalls) that old advice like “set TCPKeepAlive yes” often lands with a thud. Let’s get practical: what to check, what to change, and what not to touch if you want SSH sessions that survive corporate networks, cloud load balancers, and home routers pretending to be enterprise gear.

How SSH sessions really die

SSH is “just TCP” with a lot of cryptography and a little hope. When your session drops, one of these happened:

Idle timeout in the middle: a firewall/NAT/load balancer decides an idle TCP flow is expendable and forgets the mapping.
Path flap: Wi‑Fi roaming, VPN renegotiation, cellular handoff, ISP hiccup. The TCP connection can’t recover.
One side gives up: client or server decides the peer is dead and closes the socket.
Resource or policy kill: sshd restarts, a host reboots, a security agent kills long-lived sessions, or PAM/systemd/logind policies terminate sessions.
PMTU/fragmentation weirdness: packets get black-holed, keepalives don’t make it through, retransmits pile up, then the app times out.

“Keepalive” is not a single feature. It’s a family of timers and probes at different layers:

SSH protocol keepalives (OpenSSH): encrypted application-level messages that travel like real traffic and can detect a dead peer.
TCP keepalives (kernel): unencrypted TCP probes that may or may not traverse NATs/firewalls the way you think.
NAT/firewall session aging: the device in the middle decides how long “idle” is allowed to be.

When someone says “SSH drops randomly,” translate it to: “something times out my idle flow sooner than I expect.” Your job is to find which thing and then pick the cheapest, safest way to keep the flow non-idle or recover quickly.

One more reality check: keepalives are not a substitute for a resilient workflow. They reduce pain. They do not make TCP immortal.

Interesting facts and historical context (useful trivia)

SSH started as a response to insecure remote logins. Early remote access leaned on plaintext protocols; SSH’s rise was as much about sanity as security.
OpenSSH didn’t invent keepalives; it professionalized them. The protocol-level keepalive mechanism is basically “send something legitimate and wait for proof the peer is alive.”
TCP keepalive defaults are famously unhelpful for humans. Many systems historically used ~2 hours before probing, which is great for long-lived database sockets and terrible for a coffee break.
NAT devices are not neutral bystanders. They maintain per-flow state, and that state ages out. Consumer gear is often more aggressive than enterprise firewalls.
“Broken pipe” is your shell reporting SIGPIPE. The underlying socket is dead; your next write fails; your terminal app tells you in its own dramatic way.
Stateful firewalls popularized “idle timeouts” as a safety valve. Tracking every flow forever costs memory; timeouts are a crude garbage collector.
Some load balancers have distinct timeouts for TCP vs. HTTP. If you hairpin SSH through an appliance optimized for HTTP, it may treat your long-lived TCP flow as suspicious furniture.
SSH multiplexing (ControlMaster) can hide drops. Your “new” sessions may just be new channels on an old TCP socket that’s about to die.
IPv6 changes the middleboxes story, but doesn’t remove it. Less NAT doesn’t mean no stateful filtering, and enterprise networks still time out idle flows.

Fast diagnosis playbook

First: decide whether the server or the network killed it

On the client, reproduce with verbose logging and look for timing and who initiated the close.
On the server, correlate sshd logs around the disconnect timestamp.
If nothing logs, assume a middlebox dropped state and your next packet hit a void.

Second: identify “idle timeout” vs “path flap”

If it dies after a predictable idle window (10m, 30m, 60m), that’s a timeout policy somewhere.
If it dies during movement (VPN on/off, Wi‑Fi roam), that’s path instability; keepalives won’t fully fix it, but they can shorten detection time.

Third: pick the cheapest fix

Client-side ServerAlive* if you just need your own sessions stable and you can’t change servers.
Server-side ClientAlive* if you run the servers and want consistent behavior for everybody.
Kernel TCP keepalive if you also need non-SSH apps stable, or your environment blocks SSH-level probes (rare but real).

Fourth: confirm the path’s timeout boundary

Don’t guess. Measure. If the network kills idle flows at ~900 seconds, sending keepalives every 60 seconds is fine; every 5 seconds is theater with extra load.

Hands-on tasks: 12+ real checks with commands, output meaning, decisions

These are deliberately “operator-style”: run a command, interpret it, make a decision. Keep timestamps. Don’t rely on vibes.

Task 1: Capture a failing session with SSH verbosity

cr0x@server:~$ ssh -vvv ops@db-prod-1
OpenSSH_9.6p1 Ubuntu-3ubuntu13, OpenSSL 3.0.13 30 Jan 2024
debug1: Server host key: ssh-ed25519 SHA256:...
debug1: Authenticating to db-prod-1:22 as 'ops'
...
debug1: client_input_channel_req: channel 0 rtype keepalive@openssh.com reply 1
...
Write failed: Broken pipe

What it means: If you see regular keepalive@openssh.com messages and it still drops, your keepalive interval may be too long for the path, or the path is flapping (not just idle aging).

Decision: If drops happen without any keepalive logs, enable ServerAliveInterval on the client. If keepalives exist but drops persist, shorten the interval and investigate network events (VPN/Wi‑Fi) and server logs.

Task 2: Check whether you’re using SSH multiplexing (and accidentally trusting a stale socket)

cr0x@server:~$ ssh -G db-prod-1 | egrep -i 'controlmaster|controlpath|controlpersist'
controlmaster auto
controlpath /home/cr0x/.ssh/cm-%r@%h:%p
controlpersist 10m

What it means: Your “new” SSH connections may reuse one underlying TCP connection. If that TCP flow is silently killed, you’ll see weird failures across multiple tabs.

Decision: For unstable networks, consider reducing ControlPersist or disabling multiplexing for that host to make failures more obvious and less blast-radius-y.

Task 3: See server-side disconnect reasons in logs (journalctl)

cr0x@server:~$ sudo journalctl -u ssh -S "2025-12-29 09:00" -U "2025-12-29 10:00" --no-pager | tail -n 20
Dec 29 09:41:12 db-prod-1 sshd[21877]: Connection closed by 10.20.4.18 port 53122 [preauth]
Dec 29 09:41:18 db-prod-1 sshd[21903]: Received disconnect from 10.20.4.18 port 53144:11: disconnected by user
Dec 29 09:52:07 db-prod-1 sshd[22011]: Timeout, client not responding from 10.20.4.18 port 53410
Dec 29 09:52:07 db-prod-1 sshd[22011]: Disconnecting: Timeout, client not responding

What it means: “Connection closed by” can be the client, or the client reacting to a dead path. “Timeout, client not responding” points to server-side keepalive (ClientAlive*) firing or a dead client path.

Decision: If the server is timing out clients, tune ClientAliveInterval/ClientAliveCountMax. If the client “closed,” focus on middleboxes and client-side settings.

Task 4: Verify sshd effective configuration (don’t trust the file)

cr0x@server:~$ sudo sshd -T | egrep -i 'clientalive|tcpkeepalive|kex|loglevel'
clientaliveinterval 0
clientalivecountmax 3
tcpkeepalive yes
loglevel INFO

What it means: ClientAliveInterval 0 means the server will not send SSH-level keepalives. Many people assume TCPKeepAlive yes is enough. It often isn’t.

Decision: If you administer the server, set ClientAliveInterval to something sensible (example later). If you don’t, use client-side ServerAliveInterval.

Task 5: Confirm the client’s effective SSH config (includes Match blocks)

cr0x@server:~$ ssh -G db-prod-1 | egrep -i 'serveralive|tcpkeepalive|ipqos'
serveraliveinterval 0
serveralivecountmax 3
tcpkeepalive yes
ipqos lowdelay throughput

What it means: No client-side keepalive is enabled. If the network drops idle flows, you’re living dangerously.

Decision: Add per-host or global ServerAliveInterval (and a sane ServerAliveCountMax).

Task 6: Measure the failure window with a controlled idle test

cr0x@server:~$ date; ssh -o ServerAliveInterval=0 -o ServerAliveCountMax=3 ops@db-prod-1 'echo connected; sleep 3600'
Mon Dec 29 10:02:01 UTC 2025
connected
Write failed: Broken pipe

What it means: With no keepalives, the session dies during the hour sleep. Note the timestamp when it dies; repeat a few times and you’ll usually see a tight band (e.g., ~15 minutes).

Decision: If the cutoff is consistent, tune keepalives to fire well inside that window.

Task 7: Repeat with aggressive SSH keepalive to validate the hypothesis

cr0x@server:~$ date; ssh -o ServerAliveInterval=30 -o ServerAliveCountMax=3 ops@db-prod-1 'echo connected; sleep 3600; echo done'
Mon Dec 29 10:10:44 UTC 2025
connected
done

What it means: If this survives, your problem is almost certainly idle timeout state loss in the network path.

Decision: Set a less aggressive but still safe value (30 seconds is sometimes fine; 60 seconds is often fine; 5 seconds is usually performative).

Task 8: Inspect kernel TCP keepalive defaults (server)

cr0x@server:~$ sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9

What it means: First probe after 2 hours. That’s not “keepalive”; that’s archaeology.

Decision: If you need kernel-level keepalives to help multiple services, lower these—but do it consciously and document the blast radius.

Task 9: Check whether the TCP session shows retransmits or stalls (ss)

cr0x@server:~$ ss -tinp '( sport = :22 )' | head -n 12
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0      0      10.10.8.21:22    10.20.4.18:53144 users:(("sshd",pid=21903,fd=4))
	 cubic wscale:7,7 rto:204 rtt:0.356/0.112 ato:40 mss:1448 pmtu:1500 rcvmss:1392 advmss:1448 cwnd:10 bytes_sent:104925 bytes_acked:104901 bytes_received:20631 segs_out:161 segs_in:149 send 325Mbit/s lastsnd:2800 lastrcv:2800 lastack:2800

What it means: lastsnd/lastrcv in milliseconds shows how long since traffic. If you see retrans climbing or rto ballooning, you have a path quality problem, not just idle aging.

Decision: If retransmits spike near disconnect, investigate MTU/VPN/Wi‑Fi. Keepalives won’t save a blackholed path.

Task 10: Look for MTU/PMTU trouble (ping with DF)

cr0x@server:~$ ping -M do -s 1472 -c 3 db-prod-1
PING db-prod-1 (10.10.8.21) 1472(1500) bytes of data.
1472 bytes from 10.10.8.21: icmp_seq=1 ttl=63 time=0.512 ms
1472 bytes from 10.10.8.21: icmp_seq=2 ttl=63 time=0.487 ms
1472 bytes from 10.10.8.21: icmp_seq=3 ttl=63 time=0.499 ms

--- db-prod-1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2040ms

What it means: This suggests a 1500-byte path works. If this fails but smaller sizes succeed, you may have PMTU blackholing (common with some VPNs or misconfigured tunnels).

Decision: If PMTU is suspect, fix the network/MTU; don’t paper over with keepalives.

Task 11: Confirm if a firewall on the server is doing something “helpful”

cr0x@server:~$ sudo ufw status verbose
Status: active
Logging: on (low)
Default: deny (incoming), allow (outgoing), deny (routed)
New profiles: skip

To                         Action      From
--                         ------      ----
22/tcp                     ALLOW IN    10.20.0.0/16

What it means: UFW typically won’t drop established connections randomly, but heavy logging or additional nftables rules might. This is a sanity check, not a smoking gun.

Decision: If you see rate-limits, connection tracking exhaustion, or aggressive “recent” rules, dig into nftables counters and conntrack usage.

Task 12: Check conntrack pressure (state table exhaustion can look like random drops)

cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 198432
net.netfilter.nf_conntrack_max = 262144

What it means: If nf_conntrack_count rides close to max, established flows can get evicted or new flows fail. SSH symptoms: intermittent drops or inability to reconnect.

Decision: If you’re near the ceiling, fix the workload (too many short-lived flows), increase table size, or move stateful filtering to a device designed for it.

Task 13: See whether sshd is being restarted (and killing sessions)

cr0x@server:~$ systemctl status ssh --no-pager
● ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/usr/lib/systemd/system/ssh.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-12-29 08:01:12 UTC; 2h 15min ago
       Docs: man:sshd(8)
             man:sshd_config(5)
   Main PID: 1123 (sshd)
      Tasks: 1 (limit: 18920)
     Memory: 6.9M
        CPU: 1.421s
     CGroup: /system.slice/ssh.service
             └─1123 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"

What it means: If uptime is short and aligns with drops, you’re chasing the wrong thing: a restart kills sessions. Config management, unattended upgrades, or a watchdog could be involved.

Decision: If restarts correlate, fix the restart behavior first. Keepalive won’t outsmart a daemon restart.

Task 14: Check OpenSSH version and crypto policy (rare, but matters for renegotiation edge cases)

cr0x@server:~$ ssh -V
OpenSSH_9.6p1 Ubuntu-3ubuntu13, OpenSSL 3.0.13 30 Jan 2024

What it means: Ubuntu 24.04 ships a modern OpenSSH. That’s good. It also means old client/server interoperability assumptions may fail in edge environments.

Decision: If only specific old clients drop, test with updated clients or adjust host-specific algorithms, but don’t weaken global security to fix one antique.

Task 15: Prove it’s “idle” by generating low-impact traffic during a session

cr0x@server:~$ ssh ops@db-prod-1 'while true; do date +"%T"; sleep 120; done'
10:31:02
10:33:02
10:35:02

What it means: If this stays alive indefinitely while your interactive session dies when you stop typing, you’ve confirmed it’s an idle-timeout problem, not general instability.

Decision: Fix idle behavior with SSH keepalives or network policy changes; don’t waste time chasing CPU load, unless logs suggest otherwise.

Keepalive knobs that actually matter (client, server, TCP)

Client-side: `ServerAliveInterval` and `ServerAliveCountMax`

This is the most effective, least political fix because it only requires access to your own machine. It sends an SSH-level message over the encrypted channel periodically. If the server (or the path) is dead, the client will notice after a few missed replies and disconnect. That’s a feature: it fails fast instead of hanging for half an hour.

How it works:

ServerAliveInterval sets how often (seconds) the client sends a keepalive request when no data has been received.
ServerAliveCountMax sets how many unanswered keepalives before giving up.

What it’s good for:

Keeping NAT/firewall state warm (because it’s real traffic).
Detecting dead sessions quickly and returning you to a prompt you can reconnect from.

What it’s not: It won’t preserve your terminal state across IP changes. If you roam networks, consider a tool designed for roaming (see FAQ). SSH keepalives are a seatbelt, not teleportation.

Server-side: `ClientAliveInterval` and `ClientAliveCountMax`

If you run the servers, server-side keepalives are about policy and hygiene. You’re telling the server to probe idle clients and evict dead sessions. It helps with:

Cleaning up zombie sessions when clients disappear behind broken networks.
Keeping state alive through middleboxes (same benefit as client-side, just initiated from the other end).

There’s a trade-off: too aggressive, and you’ll kill valid sessions in high-latency or temporarily congested paths. Too lax, and you’re back to mystery hangs.

Kernel TCP keepalives: `TCPKeepAlive` and sysctls

OpenSSH has TCPKeepAlive (client and server). When enabled, it lets the kernel send TCP keepalive probes according to kernel timers.

Here’s the catch: kernel keepalive defaults are usually too slow, and some NAT/firewalls treat TCP keepalive probes differently from real application traffic. Also, TCP keepalives can keep a broken mapping “half-alive” long enough to confuse you. SSH-level keepalives are usually clearer and more adjustable per host.

When I actually recommend tuning kernel keepalives:

You have multiple long-lived TCP services (not just SSH) suffering the same idle drops.
You can’t rely on SSH config being applied consistently (fleet of clients, unmanaged laptops).
You’re standardizing behavior in a controlled environment (servers, bastions, jump hosts).

Timeout economics: pick intervals based on the weakest middlebox

If a firewall times out idle TCP at 10 minutes, a 5-minute keepalive works. A 30-second keepalive also works, but it’s unnecessary chatter. Your target is “comfortably below the shortest idle timeout you can’t control.”

Rule of thumb that survives reality: 30–60 seconds is usually safe for hostile networks. 120 seconds is fine in well-managed data centers. Anything below 15 seconds is typically a sign you’re debugging by superstition.

Joke #1: If your SSH keepalive interval is 1 second, you’re not keeping the session alive—you’re just giving the firewall a job.

One quote to keep you honest

Paraphrased idea (Werner Vogels, reliability/operations context): “Everything fails; design assuming failure is normal.”

That’s the right mental model. Keepalives don’t prevent failure; they make failure predictable and detectable.

Opinionated recommended settings for Ubuntu 24.04

If you want one answer that works for most humans on most networks: enable SSH-level keepalives on the client, and optionally on the server for cleanup. Keep TCP keepalives enabled but don’t depend on them.

Client: set keepalive in `~/.ssh/config`

Use a global default and override for fragile networks or high-latency links.

cr0x@server:~$ cat > ~/.ssh/config <<'EOF'
Host *
  ServerAliveInterval 60
  ServerAliveCountMax 3
  TCPKeepAlive yes

Host *.corp
  ServerAliveInterval 30
  ServerAliveCountMax 3
EOF

What it means: Every minute the client sends a keepalive; if it misses 3 replies (about 3 minutes), it disconnects. On corporate hosts, it’s every 30 seconds.

Decision: If you still see drops around a known idle timeout (e.g., 5 minutes), reduce interval to 20–30 seconds for that environment. If you see drops during travel/roaming, stop trying to brute-force it with keepalives and use a roaming-friendly approach (FAQ).

Server: set keepalive in `/etc/ssh/sshd_config.d/`

On Ubuntu 24.04, prefer drop-in config files over editing the monolith. Keep it readable and reversible.

cr0x@server:~$ sudo tee /etc/ssh/sshd_config.d/50-keepalive.conf > /dev/null <<'EOF'
ClientAliveInterval 60
ClientAliveCountMax 3
TCPKeepAlive yes
EOF

cr0x@server:~$ sudo sshd -t
cr0x@server:~$ sudo systemctl restart ssh
cr0x@server:~$ sudo systemctl status ssh --no-pager | head -n 8
● ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/usr/lib/systemd/system/ssh.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-12-29 11:02:12 UTC; 4s ago
       Docs: man:sshd(8)
             man:sshd_config(5)

What it means: The server will probe idle clients. If the client can’t answer within ~3 minutes, sshd drops the session. That’s usually what you want for dead connections.

Decision: If you have known high-latency or intermittently congested links (satellite, overloaded VPN), bump ClientAliveCountMax or interval to avoid false positives.

Kernel TCP keepalive tuning (use sparingly)

If you must adjust kernel defaults, do it explicitly and persistently:

cr0x@server:~$ sudo tee /etc/sysctl.d/60-tcp-keepalive.conf > /dev/null <<'EOF'
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 5
EOF
cr0x@server:~$ sudo sysctl --system | tail -n 6
* Applying /etc/sysctl.d/60-tcp-keepalive.conf ...
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 5

What it means: Probing starts after 5 minutes idle, then every 30 seconds, up to 5 probes. That’s ~7.5 minutes to declare dead at the TCP layer.

Decision: Only do this if you understand the impact on all TCP connections on that host. On busy systems, it can increase keepalive traffic noticeably (still usually small, but “small” scales).

What to avoid

Don’t set keepalive intervals to 5 seconds globally. You’ll generate constant background traffic, keep state hot in every middlebox, and still fail during real path changes.
Don’t rely on TCPKeepAlive yes alone with default sysctls. Two hours is not a keepalive strategy; it’s a bedtime story.
Don’t “fix” drops by disabling encryption rekeying or weakening crypto. Session stability is not an excuse to time-travel back to 2006.

Three corporate mini-stories from the trenches

1) The incident caused by a wrong assumption: “The firewall is stateful, so it will remember us”

A team ran a set of jump hosts for engineers to access production databases. Sessions “randomly” dropped—mostly during incident response, because of course. People blamed the jump hosts. Someone even patched the kernel “just in case,” which is an impressive way to lose time.

The wrong assumption was quiet and deadly: “Stateful firewall means it keeps state until the connection closes.” In practice, the firewall had an idle timeout for TCP established flows. If the SSH channel was quiet—reading logs in another tab, thinking, waiting—mappings aged out.

They tried increasing server-side limits, raising file descriptor ceilings, tweaking sshd logging. Nothing changed. The key clue was the clockwork pattern: disconnects clustered around a consistent idle window. Once they ran a controlled sleep 3600 test with and without ServerAliveInterval, it became embarrassingly clear.

The fix was not heroic: set client-side ServerAliveInterval 30 on the jump hosts (and recommended it for laptops), plus server-side ClientAliveInterval 60 to clean up zombies. The incident postmortem was short and slightly humiliating, which is the best kind.

2) The optimization that backfired: multiplexing everywhere

Another org standardized on SSH multiplexing because it makes repeated connections fast. ControlMaster + ControlPersist is great when you run lots of short commands (automation, fleet checks). The performance win is real.

Then the VPN team rolled out a new “smart” tunnel client that periodically renegotiated routes. The underlying TCP connection would stall or get orphaned. With multiplexing, engineers didn’t see “the connection dropped” immediately. They saw random failures: scp hanging, new ssh sessions failing instantly, terminals that accepted input but never produced output.

The optimization turned one failing socket into a shared dependency. One dead control connection could break ten tabs. The failure mode felt chaotic because people were opening “new sessions” that weren’t new at all.

The eventual fix was nuanced: keep multiplexing for automation and stable internal networks, but disable it (or keep ControlPersist short) for hosts accessed over flaky VPN paths. They also tightened ServerAliveInterval so the control socket died fast when the VPN hiccupped, allowing new connections to create a fresh socket.

3) The boring but correct practice that saved the day: explicit keepalive standards

A financial services shop had a policy: all bastions and admin workstations include a baseline SSH config with keepalives, and the servers enforce a matching policy. It was not glamorous. It was written down. It was deployed consistently.

One day, a network change introduced a shorter idle timeout on a set of segmentation firewalls. The fallout was almost nothing. Engineers noticed a few sessions dropped faster than before (because the keepalive detection was quicker), but the overall “random” disconnect rage never materialized.

What saved them was the boring alignment: client sends keepalives every 60 seconds; server times out dead clients in a few minutes; TCP keepalive sysctls were set to reasonable values on bastions. When the network got stricter, their sessions still generated enough periodic traffic to remain “not idle.”

Joke #2: The most reliable system is the one you can explain to an auditor without crying.

Common mistakes (symptom → root cause → fix)

1) Symptom: “Broken pipe” after ~10–30 minutes of inactivity

Root cause: Idle timeout in a NAT/firewall/load balancer. The mapping expires; next packet gets reset or black-holed.

Fix: Set ServerAliveInterval to 30–60 and ServerAliveCountMax to 3. If you manage servers too, set ClientAliveInterval similarly.

2) Symptom: Session hangs (no output), then eventually disconnects

Root cause: Path blackhole or asymmetric drop; TCP doesn’t immediately know the peer is unreachable. No application traffic occurs to reveal the failure.

Fix: Enable SSH keepalives so SSH detects lack of response and tears down the session. If it correlates with VPN/Wi‑Fi movement, treat it as path instability and consider a roaming tool.

3) Symptom: Multiple terminals die together; new SSH commands fail instantly

Root cause: ControlMaster multiplexing reusing a dead control socket, or one underlying TCP connection shared by many sessions.

Fix: Reduce ControlPersist, disable multiplexing for that host, and ensure keepalives are enabled so the control socket fails fast.

4) Symptom: Server logs show “Timeout, client not responding”

Root cause: Server-side ClientAliveInterval is enabled and firing, or the network path blocks replies long enough to trip the threshold.

Fix: Keep ClientAliveInterval but set a realistic count max (often 3). If clients are on high-latency links, increase count or interval.

5) Symptom: Drops coincide with sshd restarts or unattended upgrades

Root cause: Service restarts kill sessions. This is not a keepalive problem.

Fix: Control restart cadence, use maintenance windows, or keep bastions stable. Verify with journalctl and systemctl status.

6) Symptom: Only large outputs (or scp) fail; interactive typing works

Root cause: PMTU/fragmentation issues or a broken tunnel MTU. Small packets pass; larger ones get black-holed.

Fix: Validate with DF pings and fix MTU. Keepalives won’t fix packet blackholes.

7) Symptom: Reconnects fail intermittently; SSH sometimes can’t establish

Root cause: Conntrack table exhaustion on a firewall or the host; or rate-limiting rules.

Fix: Check nf_conntrack_count, inspect firewall policies, adjust conntrack sizing, and reduce abusive traffic patterns elsewhere.

8) Symptom: Works from one network, fails from another

Root cause: Different middlebox policies: corporate firewall, hotel Wi‑Fi captive portal, ISP CGNAT.

Fix: Use per-host or per-network SSH config blocks. If necessary, consider alternative transport (VPN/bastion) rather than fighting every captive network.

Checklists / step-by-step plan

Step-by-step: stabilize your own SSH sessions (client-only)

Measure the failure window. Run an idle test without keepalive and observe when it drops.
```
cr0x@server:~$ ssh -o ServerAliveInterval=0 ops@db-prod-1 'echo connected; sleep 1800'
connected
Write failed: Broken pipe
```
Decision: If it drops around a consistent time, you’re dealing with idle timeout.

Validate with a temporary keepalive.

cr0x@server:~$ ssh -o ServerAliveInterval=30 -o ServerAliveCountMax=3 ops@db-prod-1 'echo connected; sleep 1800; echo survived'
connected
survived

Decision: If it survives, bake it into ~/.ssh/config.

Apply config globally, override where needed.
```
cr0x@server:~$ ssh -G db-prod-1 | egrep -i 'serveraliveinterval|serveralivecountmax'
serveraliveinterval 60
serveralivecountmax 3
```
Decision: Confirm effective values are what you intended (Match blocks can surprise you).
If you roam networks, stop expecting TCP to survive IP changes. Use a tool designed for roaming or accept reconnects and use tmux/screen on the server (FAQ).

Step-by-step: standardize behavior on servers (fleet approach)

Check effective sshd config.

cr0x@server:~$ sudo sshd -T | egrep -i 'clientalive|tcpkeepalive'
clientaliveinterval 0
clientalivecountmax 3
tcpkeepalive yes

Decision: If clientaliveinterval is 0, you’re not probing idle clients.

Add a drop-in keepalive policy.

cr0x@server:~$ sudo tee /etc/ssh/sshd_config.d/50-keepalive.conf > /dev/null <<'EOF'
ClientAliveInterval 60
ClientAliveCountMax 3
TCPKeepAlive yes
EOF

Validate and restart.
```
cr0x@server:~$ sudo sshd -t
cr0x@server:~$ sudo systemctl restart ssh
```
Decision: If sshd -t fails, do not restart; fix syntax first.

Observe log changes after deployment.

cr0x@server:~$ sudo journalctl -u ssh -S "now-1h" --no-pager | egrep -i 'timeout|disconnect|closed' | tail -n 20
Dec 29 11:21:07 db-prod-1 sshd[24102]: Timeout, client not responding from 10.20.4.18 port 54119
Dec 29 11:21:07 db-prod-1 sshd[24102]: Disconnecting: Timeout, client not responding

Decision: If timeouts spike unexpectedly, your interval/count may be too aggressive for your environment.

Step-by-step: decide whether to tune kernel TCP keepalive

Check current sysctls (tcp_keepalive_time especially).
Inventory who shares the host. Lowering keepalive affects all TCP sockets (databases, agents, exporters).
Change via /etc/sysctl.d/ and measure traffic/behavior.
Roll back quickly if you see unexpected side effects (some applications already implement their own keepalives).

FAQ

1) Should I use `ServerAliveInterval` or `TCPKeepAlive`?

Use ServerAliveInterval first. It’s SSH-level, configurable per host, and tends to keep NAT state alive reliably. Keep TCPKeepAlive yes but don’t rely on kernel defaults.

2) What values should I set for keepalives?

Start with ServerAliveInterval 60 and ServerAliveCountMax 3 on clients. For stricter networks, use 30 seconds. On servers, ClientAliveInterval 60 and ClientAliveCountMax 3 is a sane baseline.

3) Why does my session die even while I’m typing?

That’s usually not idle timeout. Look for packet loss, VPN route churn, Wi‑Fi roaming, or MTU issues. Check ss -tinp for retransmits and run DF pings to test MTU.

4) Does `ClientAliveInterval` kick users off?

It can, if set too aggressively. It’s meant to drop dead sessions, not punish slow links. If you see false disconnects, increase ClientAliveCountMax or interval.

5) Are keepalives bad for security?

Not inherently. They send minimal authenticated traffic. The real security question is whether you’re keeping sessions alive that should time out for policy reasons. If your org requires idle logout, keepalives may conflict with that intent.

6) I’m behind a corporate proxy/firewall that kills SSH. Will keepalive help?

Only if SSH connections are allowed but idle flows are aged out. If the network actively blocks SSH or performs deep inspection that tears down long-lived sessions, you may need a sanctioned bastion/VPN approach.

7) Should I tune Linux `net.ipv4.tcp_keepalive_*` on Ubuntu 24.04?

Only if you have a system-wide need. For SSH alone, prefer OpenSSH keepalives. If you tune kernel keepalives, document it and understand it affects all TCP connections.

8) My SSH session drops when I close my laptop lid. Is that keepalive-related?

Not really. Your laptop may suspend networking; the TCP connection goes stale. Keepalives can help detect the stale session faster, but they can’t keep a suspended NIC talking.

9) What about using tmux or screen?

Do it. Server-side terminal multiplexers don’t prevent disconnects, but they prevent lost work. Pair tmux with keepalives and you get both fewer drops and less pain when drops happen.

10) Is there a better tool than SSH for roaming networks?

Yes: roaming-friendly session tools exist specifically to survive IP changes and intermittent connectivity. Keepalives help SSH, but they don’t change TCP’s fundamentals.

Next steps you can do today

Here’s the production-grade path out of “SSH drops randomly”:

Prove the failure mode. Measure whether it’s idle-timeout or path instability using an idle sleep test and verbose logs.
Enable client-side SSH keepalives. Set ServerAliveInterval 60 (30 on hostile networks) and ServerAliveCountMax 3.
If you run servers, add server-side keepalives. Use a drop-in file with ClientAliveInterval 60 and ClientAliveCountMax 3 to clean up dead sessions.
Only then consider kernel keepalive tuning. It’s broader impact, sometimes necessary, often overused.
Reduce surprise. If multiplexing is turning one dead socket into many broken sessions, scope it carefully.

If you do those in order, you’ll stop treating SSH disconnects as weather and start treating them as what they are: a timeout policy meeting an idle TCP flow. Fixable. Predictable. Occasionally still annoying—because networks—but at least no longer mystical.