You don’t notice a remote filesystem is fragile until the exact moment it becomes your entire incident. The build job stalls at “copying artifacts…”, shells stop responding when you cd, and your monitoring graph looks fine because your monitoring never tried to touch the mount.
Debian 13 makes it easy to mount almost anything. It does not make it easy to mount anything well. If you want “won’t randomly hang” as a non-negotiable property, you have to pick the right protocol and configure failure semantics on purpose.
The decision: which one won’t randomly hang?
If your requirement is “a mount that won’t randomly hang in production,” the honest answer is:
- Pick NFSv4.1+ for production shared storage (services, CI artifacts, home directories, container volumes where appropriate), and configure timeouts/retrans/automount semantics so failure is bounded.
- Use SSHFS for tooling: quick ad-hoc access, one-off migrations, debugging, “I need that directory now,” or limited developer workflows—but treat it like a convenience layer, not a storage platform.
Why? Because “random hangs” are usually not random. They’re blocking semantics meeting a flaky network, a dead server, or a state mismatch. NFS has decades of operational knobs to define what “down” means and how clients should behave. SSHFS is a FUSE filesystem backed by an interactive protocol (SFTP) that wasn’t designed to provide strict, low-latency POSIX semantics under failure.
That said: NFS can also hang your processes. It’s just that with NFS you can choose how it fails and make it observable. With SSHFS, you spend more time reverse-engineering why a FUSE request is wedged behind a stuck SSH channel.
Rule of thumb: if you’d be annoyed by a stuck ls at 3 a.m., don’t build your system around SSHFS. If you’d be annoyed by a stuck ls at 3 a.m., also don’t deploy NFS without explicit mount options. Welcome to storage.
One quote to keep you honest: Hope is not a strategy
— Rick Page.
Joke #1: SSHFS is like a pocketknife. Great until you try to remodel your house with it.
Facts and history you can use in arguments
Short, concrete context points that explain why these tools behave the way they do:
- NFS dates to 1984 (Sun Microsystems). It was built for shared Unix filesystems on LANs, with a long history of operational tuning and enterprise expectations.
- NFSv4 moved from “RPC soup” to a more integrated protocol, reducing reliance on separate daemons/ports compared to NFSv3, and adding stateful features like delegations and better locking.
- NFS “hard mount” semantics are intentionally stubborn: the client will retry I/O because failing a write can corrupt application-level assumptions. That “hang” is a feature until it isn’t.
- SSHFS is a FUSE filesystem. That means every filesystem operation crosses into userspace, and performance/latency behavior depends heavily on the FUSE daemon and its request handling.
- SSHFS uses SFTP, which is not “filesystem native.” It’s a file-transfer protocol adapted to appear like a filesystem, which creates mismatches around metadata operations and locking.
- FUSE can create unkillable-looking processes when the kernel is waiting on userspace responses. You can
kill -9the process; the kernel still waits if the request is blocked. - NFS has well-known “stale file handle” behavior when exported directories change underneath the client (e.g., server-side fs replaced, failover gone wrong). It’s a classic, not a novelty.
- Systemd’s automount units changed the game for network mounts: you can avoid boot hangs and bound failures by mounting on access rather than at boot.
What “hang” really means on Linux
People say “the mount hung.” Linux usually means one of these:
1) Your process is in uninterruptible sleep (D state)
Classic for NFS hard mounts, and also possible with FUSE when the kernel is waiting for a response from the FUSE daemon. If it’s truly D-state, the process won’t die until the underlying I/O resolves or the kernel gives up (often: it doesn’t).
2) Your shell is fine, but any path under the mount blocks
This is the stealthiest failure mode. Monitoring checks the host load and disk space and shrugs. Meanwhile, anything that touches /mnt/share is a dead stop.
3) “Hang” is actually DNS or auth
Mount helpers can block on DNS reverse lookups, Kerberos ticket acquisition, or contacting rpcbind/mountd (NFSv3). It looks like storage. It’s often name resolution.
4) “Hang” is the network path, not the protocol
MTU mismatch, asymmetric routing, conntrack exhaustion, firewall dropping fragments, a flaky ToR, or a VPN with aggressive idle timeouts. SSH survives some of these better than NFS because it rides one TCP stream. Or worse: it survives by stalling forever.
Operationally, you want two things:
- Bounded failure: requests time out and callers get errors rather than infinite stalls.
- Good observability: when it fails, you can see why in logs/metrics and reproduce with simple tools.
SSHFS on Debian 13: where it shines, where it traps you
SSHFS is seductively simple. You already have SSH. Fire a command, get a mount. No server daemon to tune (beyond sshd), no exports, no idmap, no firewall committee meeting.
And then it starts “randomly hanging.” Here’s why that happens and what to do about it.
SSHFS reliability model (what you’re really buying)
- Single encrypted TCP session (SSH) carrying SFTP requests.
- Userspace filesystem (FUSE) translating POSIX-ish operations into SFTP actions.
- Failure often presents as “blocked syscall” because the kernel is waiting for FUSE, and FUSE is waiting for SSH, and SSH is waiting for the network.
SSHFS is great when the workflow tolerates occasional stalls, or when “it’s down” means “I’ll try again” and nothing is serving user traffic.
Common SSHFS hang triggers
- Idle TCP sessions killed by NAT/VPN: the SSH channel is still “up” from the client’s perspective but blackholed on the path.
- Server-side overload: sshd and SFTP subsystem are CPU-bound (encryption, compression, or too many sessions).
- Metadata-heavy workloads:
git status, recursive finds, dependency installs. SSHFS tends to amplify latency and RTT sensitivity. - FUSE request backlog: userspace daemon stuck; kernel waits; processes pile up.
How to run SSHFS without self-sabotage
On Debian 13, assume you’re using sshfs with FUSE3. The tactics:
- Use SSH keepalives so dead paths are detected.
- Use automount so a dead server doesn’t block boot or services that don’t need it.
- Accept that caching is tricky. Over-caching improves speed and increases “stale view” risk; under-caching can turn a repo into a latency benchmark.
- Don’t pretend locking is perfect. If you need correct multi-client POSIX behavior, SSHFS is the wrong tool.
Joke #2: The fastest way to find an SSHFS problem is to announce you’re “just going to use it for production—temporarily.”
NFS on Debian 13: production-grade, if you respect it
NFS gets a bad reputation from two sources: people who used NFSv3 across unreliable networks, and people who left defaults untouched and then acted surprised when the default behavior occurred.
NFS reliability model
NFS clients make remote I/O look local. When the server or path fails, the client decides what to do based on mount options:
- Hard mounts: keep retrying. Great for data integrity, dangerous for availability if you don’t isolate the blast radius.
- Soft mounts: give up and return an error. Better for interactive tools and “best effort” reads; risky for writes.
- Timeouts and retrans: define how long you wait and how aggressively you retry.
- Systemd automount: only mount when accessed, and avoid boot-time deadlocks.
What “random hang” looks like in NFS land
It’s almost always one of these:
- Hard mount + server unreachable ⇒ tasks in D state, potentially piling up.
- Locking / state recovery issues after server reboot ⇒ hangs or I/O errors depending on version and workload.
- DNS/idmap mismatch ⇒ permissions weirdness that looks like application bugs.
- Firewall/NAT meddling ⇒ intermittent stalls under load (especially with NFSv3’s extra services).
What to deploy by default
If you want the sane baseline for Debian 13 clients:
- NFSv4.1 or newer (4.2 if your server supports it), TCP.
- Systemd automount for anything non-essential at boot; for essential mounts, be explicit about dependencies and timeouts.
- Hard mounts for write-heavy critical data, but isolate them (separate service, separate mount namespace if needed) and monitor for stuck I/O.
- Soft mounts only for read-mostly, non-critical access, and only if your application can tolerate I/O errors.
Bounded failure with NFS doesn’t mean “never hard.” It means: hard where correctness matters, and surrounded by guardrails so it doesn’t freeze your whole host.
Practical tasks: commands, outputs, decisions (12+)
These are the checks I actually run when a “random hang” report lands. Each task includes what the output implies and what decision you make next.
Task 1 — Identify what’s mounted and how
cr0x@server:~$ findmnt -t nfs,nfs4,fuse.sshfs
TARGET SOURCE FSTYPE OPTIONS
/mnt/build nfs01:/exports/build nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,proto=tcp,sec=sys,clientaddr=10.0.2.10
/mnt/tools dev@jump:/srv/tools fuse.sshfs rw,nosuid,nodev,relatime,user_id=1000,group_id=1000
Meaning: you can see protocol, version, and whether NFS is hard/soft. SSHFS shows up as fuse.sshfs.
Decision: if NFS is hard and the mount is on a path used by critical system services (like /var subpaths), you prioritize blast-radius reduction (automount or service isolation).
Task 2 — Verify NFS version negotiation from the client
cr0x@server:~$ nfsstat -m
/mnt/build from nfs01:/exports/build
Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.10
Meaning: this confirms what the kernel is actually using, not what you think you configured.
Decision: if it says vers=3 and you expected v4, you investigate server support, mount options, and firewall constraints (v3 is more fragile across networks).
Task 3 — Detect “stuck” processes and whether it’s D-state
cr0x@server:~$ ps -eo pid,state,comm,wchan:32,args | awk '$2 ~ /D/ {print}'
23144 D rsync nfs_wait_on_request rsync -a /mnt/build/ /var/cache/build/
Meaning: D-state with nfs_* wait channels screams “blocked on NFS I/O.” If it’s fuse_wait, think SSHFS/FUSE.
Decision: stop restarting apps blindly. You now know it’s kernel I/O wait; focus on network/server path, not the application.
Task 4 — Check kernel logs for NFS client errors
cr0x@server:~$ journalctl -k -n 50 --no-pager
Dec 30 10:12:03 server kernel: NFS: server nfs01 not responding, still trying
Dec 30 10:12:33 server kernel: NFS: server nfs01 not responding, still trying
Dec 30 10:13:04 server kernel: NFS: server nfs01 OK
Meaning: transient server/path issues. If you see “still trying” repeatedly, the client is stuck in retry loops (hard mount behavior).
Decision: if this happens frequently, you tune timeouts and investigate network drops; you don’t “fix” it by adding more application retries.
Task 5 — Confirm the server export from the client side
cr0x@server:~$ showmount -e nfs01
Export list for nfs01:
/exports/build 10.0.2.0/24
Meaning: validates the server is exporting and your network is permitted. (Mostly relevant to NFSv3; still useful for sanity checks.)
Decision: if it fails or hangs, suspect firewall/rpc services, or DNS misdirection.
Task 6 — Validate name resolution and reverse DNS (yes, it matters)
cr0x@server:~$ getent hosts nfs01
10.0.2.20 nfs01
cr0x@server:~$ getent hosts 10.0.2.20
10.0.2.20 nfs01
Meaning: forward and reverse resolution are consistent. Inconsistent resolution causes weird auth, delays, and “it hangs only sometimes” behavior.
Decision: if reverse lookup is slow or wrong, fix DNS or add sane entries in the appropriate resolver stack. Don’t accept 5-second DNS stalls on every mount path.
Task 7 — Measure basic RTT and packet loss to the server
cr0x@server:~$ ping -c 5 nfs01
PING nfs01 (10.0.2.20) 56(84) bytes of data.
64 bytes from 10.0.2.20: icmp_seq=1 ttl=64 time=0.41 ms
64 bytes from 10.0.2.20: icmp_seq=2 ttl=64 time=0.38 ms
64 bytes from 10.0.2.20: icmp_seq=3 ttl=64 time=0.43 ms
64 bytes from 10.0.2.20: icmp_seq=4 ttl=64 time=0.40 ms
64 bytes from 10.0.2.20: icmp_seq=5 ttl=64 time=0.39 ms
--- nfs01 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4090ms
rtt min/avg/max/mdev = 0.38/0.40/0.43/0.02 ms
Meaning: basic connectivity looks fine, but ping isn’t proof NFS/SSH won’t stall. Still, loss here is a red flag.
Decision: if packet loss exists, fix the network first. Storage tuning is pointless on a lossy path.
Task 8 — Check for MTU or path weirdness with tracepath
cr0x@server:~$ tracepath nfs01
1?: [LOCALHOST] pmtu 1500
1: nfs01 0.439ms reached
Resume: pmtu 1500 hops 1 back 1
Meaning: PMTU is sane and path is simple. If PMTU drops unexpectedly, you can see fragmentation risk.
Decision: if PMTU is lower than expected (e.g., 1400 on a LAN), verify overlays/VPNs and align MTU to avoid intermittent stalls.
Task 9 — Verify SSH keepalive settings for SSHFS scenarios
cr0x@server:~$ ssh -G jump | egrep 'serveraliveinterval|serveralivecountmax|tcpkeepalive'
tcpkeepalive yes
serveraliveinterval 15
serveralivecountmax 3
Meaning: SSH will detect dead paths in ~45 seconds. Without this, SSH can sit on a dead TCP session for a long time depending on the network.
Decision: if these are missing or large, add them to your SSHFS invocation or SSH config to make failures fail.
Task 10 — Reproduce the “hang” with a single syscall and a timeout wrapper
cr0x@server:~$ timeout 5s stat /mnt/build/. || echo "stat timed out"
stat timed out
Meaning: this proves the hang is in filesystem access, not the calling program. Using timeout makes it visible and scriptable.
Decision: if stat hangs but ping is fine, you suspect server-side NFS thread starvation, firewall state issues, or client-side stuck RPCs.
Task 11 — Inspect live NFS RPC activity (client side)
cr0x@server:~$ nfsstat -rc
Client rpc stats:
calls retrans authrefrsh
124981 317 0
Client nfs v4:
null read write commit open open_conf
3 20811 14402 0 9812 14
Meaning: retrans increasing rapidly suggests packet loss or server slowness. A small stable retrans count during normal operation is fine; spikes are not.
Decision: if retrans climbs during incidents, treat it like a network/server capacity problem, not “NFS is bad.”
Task 12 — Confirm systemd ordering and whether boot is at risk
cr0x@server:~$ systemctl show -p After -p Wants remote-fs.target
After=network-online.target
Wants=network-online.target
Meaning: your remote mounts are tied to network-online.target. If network-online is flaky, boot can be slow or stuck.
Decision: for non-critical mounts, switch to systemd automount so the system boots even if storage is down.
Task 13 — Spot SSHFS/FUSE backpressure and blocked requests
cr0x@server:~$ ps -eo pid,state,comm,wchan:32,args | egrep 'sshfs|fuse' | head
18211 S sshfs wait_woken sshfs dev@jump:/srv/tools /mnt/tools -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3
Meaning: not D-state yet, but you can correlate timing with a stall report. If you see many processes stuck in fuse_wait or similar, SSHFS is wedged.
Decision: if it wedges regularly under load, stop using SSHFS for that workload. This is a design mismatch, not a tuning problem.
Task 14 — Confirm NFS server is reachable on the right port (v4)
cr0x@server:~$ nc -vz nfs01 2049
Connection to nfs01 (10.0.2.20) 2049 port [tcp/nfs] succeeded!
Meaning: TCP/2049 is open. NFSv4 primarily needs this. If it fails intermittently, suspect firewall state or load balancers doing “helpful” things.
Decision: if 2049 is blocked, stop. No mount option will negotiate around “port closed.” Fix network policy.
Task 15 — Verify actual mount options from /proc (ground truth)
cr0x@server:~$ grep -E ' /mnt/build | /mnt/tools ' /proc/mounts
nfs01:/exports/build /mnt/build nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.10 0 0
dev@jump:/srv/tools /mnt/tools fuse.sshfs rw,nosuid,nodev,relatime,user_id=1000,group_id=1000 0 0
Meaning: this is what the kernel uses. If you think you set an option and it’s not here, it didn’t apply.
Decision: correct the mount unit/fstab/command and remount.
Fast diagnosis playbook
When someone says “the remote mount hung,” you can burn an hour guessing. Or you can do this in five minutes and look like you planned it.
First: determine if it’s NFS or SSHFS, and whether anything is in D-state
- Run:
findmnt -t nfs,nfs4,fuse.sshfsandps ... | awk '$2 ~ /D/' - If D-state exists: it’s kernel I/O waiting. Restarting apps won’t fix it. Move to network/server checks.
- If not D-state: it might still be slow/stalled in userspace (SSHFS) or name resolution/auth.
Second: check the simplest proof of life for the exact mount path
- Run:
timeout 5s stat /mnt/whatever - If stat hangs: you’ve reproduced with one syscall. Now you can correlate with logs, RPC stats, and network.
- If stat works: your “hang” is probably at higher layers (app lock, huge directory traversal, slow metadata).
Third: isolate the domain—network, server, client, or semantics
- Network:
ping,tracepath, check retrans innfsstat -rc, check port 2049 withnc. - Server: look for “server not responding” in kernel logs; if you have access, check NFS server thread saturation and disk latency (outside scope here, but you know the drill).
- Client configuration: confirm mount options in
/proc/mountsand what systemd is doing at boot. - Semantics mismatch: if the workload is metadata-heavy and you’re on SSHFS, stop. If you require strict failure semantics but mounted NFS hard on a critical path, redesign.
Common mistakes (symptoms → root cause → fix)
1) Symptom: “The whole host is frozen,” but only some commands hang
Root cause: shell or system service touched a hard-mounted NFS path; processes are stuck in D-state waiting for I/O.
Fix: use x-systemd.automount for non-critical mounts, and avoid putting NFS under paths used by core services. For critical mounts, isolate services and monitor for NFS stalls.
2) Symptom: SSHFS mount works, then “randomly” freezes after idle
Root cause: NAT/VPN idle timeout blackholes the TCP session; SSH doesn’t detect it quickly without keepalives.
Fix: set ServerAliveInterval and ServerAliveCountMax (client-side), and consider TCP keepalive. Prefer automount so it’s re-established on access.
3) Symptom: NFS mounts sometimes take 30–90 seconds, sometimes instant
Root cause: DNS reverse lookup delays or intermittent resolver issues; or trying NFS versions/ports that time out before falling back.
Fix: fix DNS consistency; specify vers=4.2 explicitly; ensure TCP/2049 is permitted end-to-end.
4) Symptom: “Permission denied” on NFSv4 despite matching UID/GID
Root cause: idmapping/domain mismatch or exporting with root_squash assumptions that don’t match client identity model.
Fix: align idmap domain settings if you rely on name-based mapping, or use consistent numeric IDs across systems. Verify export options server-side.
5) Symptom: “Stale file handle” after maintenance/failover
Root cause: server-side filesystem replaced/moved; clients hold references to old filehandles.
Fix: remount the filesystem; avoid replacing exported directory trees without coordinated client remount or proper HA semantics.
6) Symptom: CI jobs crawl on SSHFS; CPU usage spikes on both ends
Root cause: metadata-heavy operations over encrypted SFTP + FUSE context switching; optional compression making it worse.
Fix: use NFS for build artifacts; if you must use SSHFS, disable compression for LAN and avoid using it for hot paths.
7) Symptom: Boot hangs waiting for remote mounts
Root cause: mounts in /etc/fstab without automount; network-online target not actually online; hard mount waiting.
Fix: use nofail, x-systemd.automount, and x-systemd.mount-timeout=; ensure the network-online service is correctly implemented for your network stack.
8) Symptom: NFS works, but some apps behave oddly (locks, partial writes, weird errors)
Root cause: using soft for workloads that assume writes never fail midstream; or mixing old NFS versions with lock expectations.
Fix: use hard for write-critical workloads; ensure NFSv4 and correct lock/state handling; redesign apps to handle I/O errors if you insist on soft.
Checklists / step-by-step plan
Plan A: You need a reliable shared filesystem (most production cases)
- Choose NFSv4.2 over TCP unless you have a compelling reason not to.
- Decide failure semantics per mount:
- Write-critical: hard, but isolate.
- Read-mostly and non-critical: consider soft with short timeouts.
- Use systemd automount for anything not required for boot.
- Set explicit timeouts and retrans instead of trusting defaults you haven’t tested.
- Monitor for “server not responding” messages and retrans spikes.
- Document the blast radius: which services touch the mount? Which nodes? What happens if it stalls?
Plan B: You just need secure ad-hoc access (developer tooling, migrations)
- Use SSHFS with keepalives and reconnect behavior.
- Avoid metadata-heavy workflows (large repos, language package managers) on SSHFS when you care about speed or stability.
- Mount on-demand with systemd or manual commands, not as a critical boot dependency.
- Prefer rsync/scp for bulk transfers when you don’t need a mounted view.
Debian 13 configuration snippets you can actually ship
NFS client packages and a sane fstab entry (with automount)
cr0x@server:~$ sudo apt-get update
...output...
cr0x@server:~$ sudo apt-get install -y nfs-common
...output...
cr0x@server:~$ sudo mkdir -p /mnt/build
cr0x@server:~$ sudo sh -c 'printf "%s\n" "nfs01:/exports/build /mnt/build nfs4 rw,vers=4.2,proto=tcp,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,x-systemd.automount,x-systemd.idle-timeout=300,noatime,nofail,x-systemd.mount-timeout=30s 0 0" >> /etc/fstab'
What this buys you: the host boots even if NFS is down (nofail), the mount occurs on access (automount), and if it can’t mount quickly, it fails fast rather than blocking forever at boot.
SSHFS with keepalives and reconnect (manual mount)
cr0x@server:~$ sudo apt-get install -y sshfs
...output...
cr0x@server:~$ sudo mkdir -p /mnt/tools
cr0x@server:~$ sshfs dev@jump:/srv/tools /mnt/tools -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,IdentityFile=/home/dev/.ssh/id_ed25519,allow_other
Operational note: allow_other requires user_allow_other in /etc/fuse.conf. Use it only when you mean it; otherwise keep mounts user-private.
Systemd automount for SSHFS (so it doesn’t wedge boot)
When SSHFS is used by a service, I prefer systemd units so you can set timeouts and restart behavior.
cr0x@server:~$ sudo tee /etc/systemd/system/mnt-tools.mount &>/dev/null <<'EOF'
[Unit]
Description=SSHFS tools mount
After=network-online.target
Wants=network-online.target
[Mount]
What=dev@jump:/srv/tools
Where=/mnt/tools
Type=fuse.sshfs
Options=_netdev,reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,IdentityFile=/home/dev/.ssh/id_ed25519,allow_other
[Install]
WantedBy=multi-user.target
EOF
...output...
cr0x@server:~$ sudo tee /etc/systemd/system/mnt-tools.automount &>/dev/null <<'EOF'
[Unit]
Description=Automount SSHFS tools
[Automount]
Where=/mnt/tools
TimeoutIdleSec=300
[Install]
WantedBy=multi-user.target
EOF
...output...
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl enable --now mnt-tools.automount
Created symlink /etc/systemd/system/multi-user.target.wants/mnt-tools.automount → /etc/systemd/system/mnt-tools.automount.
Decision: if the SSHFS server is down, accesses to /mnt/tools will attempt the mount and fail within systemd’s timeout constraints, rather than wedging boot.
Three corporate mini-stories from the trenches
Story 1: The incident caused by a wrong assumption (“it’s just a mount”)
A mid-sized company ran Debian on build runners. Someone needed fast access to a legacy artifact store that only had SSH open through a hardened jump host. SSHFS was the obvious shortcut: no firewall request, no server changes, and it worked in the first five minutes.
They wired the SSHFS mount into /etc/fstab on the runners because the CI job expected /mnt/artifacts to exist. It was “temporary,” which in enterprise time means “until the next reorg.” For weeks it seemed fine, because the network path was stable and the jump host was underutilized.
Then a VPN change introduced an idle timeout. SSH sessions that sat quiet for a while didn’t get torn down cleanly; they got blackholed. SSHFS calls began to block. The CI workers weren’t down; they were worse. They were up, burning executor slots on jobs that stalled during tar extraction.
The wrong assumption was that a mount is like a local directory with a different latency profile. It isn’t. It’s a dependency with failure modes. When it fails, it tends to fail inside syscalls where your application cannot “catch” it.
The fix wasn’t heroic: remove SSHFS from fstab, add systemd automount with keepalives, and move the artifact store to NFS behind the firewall where it belonged. The more important change was cultural: remote mounts got treated like databases—owned, monitored, and reviewed.
Story 2: The optimization that backfired (bigger is not always better)
Another team had an NFS share used by analytics jobs. They saw throughput issues and did what everyone does: increased rsize and wsize, enabled aggressive client-side caching behavior, and declared victory when a single benchmark looked better.
It wasn’t malicious. It was normal. One person ran a sequential read test and got a nice number. Then the workload shifted: thousands of small files, lots of metadata operations, and multiple clients. Suddenly the “fast” configuration produced long stalls during peak hours.
The backfire came from two angles. First, bigger I/O sizes didn’t help metadata-heavy workloads and sometimes made recovery slower when packets were lost (more data per retrans). Second, the team’s automount/timeout semantics were still default-ish, so when the server got busy, clients piled up retries and everything felt frozen.
They fixed it by treating tuning as workload-specific. They kept large I/O sizes for the sequential data path but separated the metadata-heavy directory trees to different exports and enforced sensible client timeouts. The biggest improvement came from making failures visible and bounded, not from squeezing another 5% out of throughput.
Story 3: The boring, correct practice that saved the day (and nobody got a trophy)
A regulated environment ran NFS home directories. Everyone was scared of “NFS hangs,” so the temptation was to go soft mounts everywhere. But the ops team did something aggressively dull: they classified mounts by correctness requirements and set policies accordingly.
Home directories were mounted hard because losing writes is unacceptable. But they were also mounted with systemd automount and kept away from boot-critical paths. User sessions would block if the filer died, but the host wouldn’t become unmanageable. That distinction matters when you’re trying to fix the problem.
They also had a routine weekly check: sample stat latency, monitor kernel log rate for “not responding,” and alert on NFS retrans spikes. Nothing fancy. No buzzwords. Just small checks that detect “this is getting weird” before it becomes a fire.
When a network maintenance caused brief packet loss, they saw retrans climb and “server not responding” messages appear. Because they had baselines, they could say, “This is abnormal and it correlates with the change window.” The fix happened quickly, and user impact stayed limited.
The best part: the runbook worked even for new on-call engineers. The system didn’t rely on one wizard with a memory of the 2019 outage.
FAQ
1) Which is faster on Debian 13: SSHFS or NFS?
For most real workloads, NFS is faster and more predictable, especially under concurrency and metadata-heavy access. SSHFS adds encryption overhead, userspace crossings, and SFTP translation costs.
2) Which is more secure by default?
SSHFS inherits SSH’s encryption and authentication, so it’s often easier to deploy securely over untrusted networks. NFS can be secure, but you must design it: network isolation, firewalling, and (if needed) Kerberos.
3) What actually causes “random hangs” on NFS?
Most commonly: hard mounts retrying during server/path issues, network packet loss causing retrans storms, or state recovery/locking pain during server restarts. The fix is usually better failure semantics plus fixing the underlying network/server issue.
4) Can SSHFS be made “non-hanging”?
You can reduce the probability by using keepalives and automount, but you can’t change the fact that FUSE + SFTP can block syscalls when the channel stalls. If “never hang” is your bar, don’t put SSHFS on critical paths.
5) Should I use NFS soft mounts to prevent hangs?
Only for read-mostly, non-critical workloads where your application can handle I/O errors. Soft mounts trade “hang forever” for “fail sometimes,” and failing writes midstream can cause subtle corruption at the application layer.
6) Is systemd automount better than autofs?
For many Debian 13 setups, systemd automount is simpler and integrates well with boot ordering and timeouts. Autofs is still valid when you need advanced map logic or legacy behavior.
7) What mount options matter most for preventing host-wide pain?
x-systemd.automount (avoid boot dependency and bound access), nofail, and explicit timeo/retrans so you know what “down” means. Also: don’t mount remote filesystems under paths core services depend on unless you truly mean it.
8) Why does df hang when the mount is broken?
df queries filesystem stats for all mounts. If one remote mount blocks, df blocks. Use df -l for local-only, or target specific filesystems carefully.
9) Is NFS okay over Wi‑Fi or the public internet?
Over Wi‑Fi: it can work, but expect variability and retrans. Over the public internet: don’t, unless you know exactly what you’re doing with secure transport, latency expectations, and failure semantics. SSH-based tools are usually a better fit for that environment.
10) What’s the simplest safe default recommendation?
NFSv4.2 with systemd automount for production shares; SSHFS with keepalives for ad-hoc access only. If you’re tempted to reverse those, you’re about to learn something expensive.
Conclusion: what to do Monday morning
If you only remember one thing: remote filesystems are dependencies that fail inside syscalls. Choose the protocol that matches your failure tolerance, then configure it so failure is bounded and observable.
- Inventory mounts with
findmntand classify them: critical writes vs convenience reads. - Migrate production shared paths to NFSv4.2 where possible. Keep SSHFS for tooling, not foundations.
- Add systemd automount for non-boot-critical mounts. Make “storage down” a recoverable state.
- Set explicit timeouts and keepalives (NFS timeo/retrans; SSH ServerAliveInterval/CountMax).
- Write one runbook page with the fast diagnosis steps: D-state check,
timeout stat, logs, retrans, port reachability.
The goal isn’t “never fails.” The goal is “fails in ways you can survive, debug, and explain without lying.” That’s what keeps mounts from becoming mysteries—and keeps your on-call from learning new swear words.