You didn’t “lose” the server. You lost time—staring at a terminal that won’t return because some process is stuck in uninterruptible I/O,
and now ls is a hostage negotiator. The pager isn’t impressed that the share “usually works.”
On Debian 13, the SSHFS vs NFS question isn’t philosophical. It’s operational: which one fails predictably, which one recovers cleanly,
and how you configure mounts so they don’t turn routine outages into whole-host paralysis.
The opinionated decision: use NFS for production, SSHFS for edge cases
If your goal is “won’t randomly hang,” pick NFS with systemd automount for anything resembling production shared storage.
SSHFS is great when you need encryption without infrastructure, or when you’re pulling data from a machine you don’t control,
or when you want a quick duct-tape mount for a one-off job. But if you mount SSHFS as a core dependency and expect it to behave like a LAN file server,
you’re going to meet FUSE semantics, TCP hiccups, and user-space buffering in the least charming way possible.
NFS is not magic. It can hang too—especially with hard mounts (the default), because “don’t return corrupt data” is a respectable principle.
But the difference is that NFS gives you better tooling, better kernel integration, and clearer knobs for failure behavior.
Most “random hangs” aren’t random; they’re you choosing blocking semantics without an escape hatch.
Here’s the practical rule:
- Use NFSv4.2 for shared home dirs, build artifacts, backups landing zones, and “multiple clients need the same data.”
- Use SSHFS for ad-hoc access, crossing hostile networks, or when you need per-user auth without standing up NFS properly.
- Never mount either one in a way that blocks boot or stalls critical daemons. Automount or explicit dependency ordering, always.
One quote to keep you honest. Werner Vogels (Amazon CTO) has a widely cited idea—paraphrased: “Everything fails, all the time.”
If you accept that, you stop treating hangs as mysterious and start designing the failure mode you can live with.
Facts and history that actually matter in 2025
A few context points you can use in meetings when someone insists “it’s just a mount, how hard can it be?”
- NFS is old enough to have scars. It dates back to the mid-1980s at Sun Microsystems, and its design reflects real enterprise pain: shared UNIX files at scale.
- NFSv4 was a rewrite in spirit. It consolidated protocols, improved security and statefulness, and moved toward “one firewall-friendly port” behavior.
- NFS used to love UDP. Early NFS deployments commonly used UDP; modern best practice is TCP for reliability and congestion control.
- SSHFS is FUSE. That means user-space filesystem plumbing; great for flexibility, not great for “kernel-level always-on storage.”
- FUSE semantics can surprise you. Permission checks, caching, and error propagation behave differently than kernel filesystems—especially under failure.
- NFS client behavior is policy-heavy. “Hard” vs “soft,” timeouts, retransmits, and attribute caching all shape how “hung” feels.
- Systemd changed the game. Automount units and mount dependencies let you keep boots and services alive even when storage is flaky.
- Locking is historically messy. NFS locking went through separate lock managers; NFSv4 integrated locking, but legacy assumptions still crop up in apps.
- Id mapping is a common hidden footgun. NFSv4 uses name-based identity mapping unless you configure it cleanly; mismatches look like “permissions randomly broke.”
What “random hang” really means (and who’s guilty)
When engineers say “SSHFS hung” or “NFS hung,” they usually mean one of four things:
1) A process is stuck in D state (uninterruptible sleep)
That’s the classic: ps shows D, kill -9 does nothing, and the only thing that “fixes it” is restoring the remote server or rebooting.
This is most common with NFS hard mounts during a server outage or network partition.
2) The mount call blocks forever
Boot hangs, or a service start hangs, because something tried to mount a remote filesystem and waited. This is a configuration failure:
no automount, no timeout, and a dependency chain that lets a remote share hold the host hostage.
3) Directory listings hang, but some reads work
Metadata operations (lookup, getattr, readdir) are sensitive to caching and network jitter.
SSHFS can feel worse here because every metadata operation is a round trip through an SSH channel, plus SFTP protocol overhead.
NFS can also do this if attribute caching is mis-tuned or the server is overloaded.
4) “It’s slow” looks like “it’s hung”
A 30-second stall on find or du is often just latency amplification:
thousands of tiny metadata operations times a few milliseconds each equals “my terminal died.”
Joke #1: A remote filesystem is like a coworker on vacation—technically still employed, but you’re not getting a response before lunch.
Who’s more likely to “randomly hang”?
SSHFS is more likely to disconnect or stall unpredictably on unreliable networks, especially if you don’t configure keepalives.
But when SSHFS fails, it often fails “softer” (you get I/O errors, the mount is gone, reconnect may work) rather than trapping the kernel in retries forever.
NFS is more likely to block forever by design with default hard mounts. That’s not random; it’s policy.
If you want “don’t hang the whole host,” you design the failure mode: automount, timeouts where appropriate,
and a clear decision about hard vs soft per workload.
Fast diagnosis playbook (first/second/third)
This is the “I have five minutes before this turns into a war room” flow. The goal is to identify whether you’re dealing with
(a) network reachability, (b) server-side load or daemon failure, (c) client-side mount semantics, or (d) DNS/identity weirdness.
First: confirm it’s not basic reachability
- Can you ping the server IP (or at least resolve the name)?
- Is TCP port 2049 reachable for NFSv4? Is TCP 22 reachable for SSHFS?
- Is there packet loss or high latency that turns metadata into molasses?
Second: check whether the client is blocked in kernel I/O
- Are processes stuck in
Dstate? - Does
cat /proc/mountsormounthang? - Does accessing other filesystems work normally?
Third: interrogate the specific protocol path
- For NFS:
nfsstat,rpcinfo(if v3), kernel logs, server export status, and whether the server is replying. - For SSHFS: SSH keepalives, SFTP subsystem health, and whether the SSH connection is stuck due to a dead TCP session.
Decision points
- If the server is unreachable: automount prevents total host stall; fix network or fail over.
- If the server is reachable but slow: you’re capacity-bound—CPU, disks, or too many clients; tune and scale.
- If one client hangs but others don’t: suspect client-side DNS, identity mapping, or a wedged mount state.
Practical tasks: commands, outputs, and what decision you make
These are the on-call moves that separate “I think it’s NFS” from “the client is in hard-retry because the server stopped answering COMPOUND ops.”
Each task includes: command, what the output means, and what you do next.
Task 1: See what’s mounted and whether you’re dealing with NFS or FUSE
cr0x@server:~$ findmnt -t nfs,nfs4,fuse.sshfs
TARGET SOURCE FSTYPE OPTIONS
/mnt/shared nas01:/export/shared nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.21
/mnt/adhoc devbox:/home/builds fuse.sshfs rw,nosuid,nodev,relatime,user_id=1000,group_id=1000
Meaning: you’ve got both. NFS is kernel; SSHFS is FUSE. Decision: apply the right debugging tools—nfsstat for NFS, SSH logs for SSHFS.
Task 2: Check whether processes are stuck in uninterruptible I/O (D state)
cr0x@server:~$ ps -eo state,pid,comm,wchan:30 | awk '$1=="D"{print}'
D 18472 rsync nfs_wait_bit_killable
D 21205 du rpc_wait_bit_killable
Meaning: these are blocked in kernel waits, typical of NFS hard retry behavior. Decision: stop expecting kill -9 to help; restore server/network or force unmount carefully.
Task 3: Confirm network path quickly (latency and loss)
cr0x@server:~$ ping -c 5 nas01
PING nas01 (10.0.0.50) 56(84) bytes of data.
64 bytes from 10.0.0.50: icmp_seq=1 ttl=63 time=0.453 ms
64 bytes from 10.0.0.50: icmp_seq=2 ttl=63 time=0.482 ms
64 bytes from 10.0.0.50: icmp_seq=3 ttl=63 time=12.103 ms
64 bytes from 10.0.0.50: icmp_seq=4 ttl=63 time=0.469 ms
64 bytes from 10.0.0.50: icmp_seq=5 ttl=63 time=0.476 ms
--- nas01 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4093ms
rtt min/avg/max/mdev = 0.453/2.797/12.103/4.650 ms
Meaning: one spike. That can be enough to make metadata-heavy workloads feel “hung.” Decision: if spikes correlate with hangs, look for congestion, bufferbloat, or NIC issues.
Task 4: For NFSv4, check port reachability (2049)
cr0x@server:~$ nc -vz -w2 nas01 2049
Connection to nas01 (10.0.0.50) 2049 port [tcp/nfs] succeeded!
Meaning: basic connectivity is fine. Decision: stop blaming the firewall; look at server load, NFS daemon health, or client mount semantics.
Task 5: For SSHFS, confirm SSH transport health and authentication speed
cr0x@server:~$ ssh -o BatchMode=yes -o ConnectTimeout=5 build@devbox 'echo ok'
ok
Meaning: SSH responds. Decision: if SSH is fine but SSHFS stalls, focus on SFTP subsystem, keepalives, and FUSE behavior under load.
Task 6: Inspect kernel logs for NFS timeouts and server not responding
cr0x@server:~$ sudo journalctl -k -n 20
Aug 12 14:03:19 app01 kernel: nfs: server nas01 not responding, still trying
Aug 12 14:03:50 app01 kernel: nfs: server nas01 not responding, still trying
Aug 12 14:04:23 app01 kernel: nfs: server nas01 OK
Meaning: the “hang” is the client doing exactly what you told it to do: keep trying. Decision: if this blocks production, add automount and consider soft only for non-critical reads.
Task 7: See per-mount NFS stats (retransmits are the smoke)
cr0x@server:~$ nfsstat -m
/mnt/shared from nas01:/export/shared
Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.21
Stats: age: 00:43:11 ops/s: 127
xprt: tcp 90632 211 0 0 0
per-op statistics
READ: 42.1% WRITE: 11.7% GETATTR: 20.4% LOOKUP: 14.2% ACCESS: 7.6%
Meaning: the transport line shows retransmits (second number after calls). If retrans climbs, you’re losing responses. Decision: investigate network loss, server saturation, or too aggressive timeouts.
Task 8: Identify whether DNS is causing stalls (reverse lookups can hurt)
cr0x@server:~$ getent hosts nas01
10.0.0.50 nas01
Meaning: name resolves fast. Decision: if this hangs or returns unexpected addresses, fix DNS before touching mounts.
Task 9: Check server-side exports (from server)
cr0x@nas01:~$ sudo exportfs -v
/export/shared 10.0.0.0/24(sync,wdelay,hide,no_subtree_check,sec=sys,rw,root_squash,no_all_squash)
Meaning: export exists; root squashed; sync enabled. Decision: if clients need root ownership semantics, don’t “fix” with all_squash hacks—use proper UID/GID mapping or a service account.
Task 10: Confirm NFS server services are up (server)
cr0x@nas01:~$ systemctl status nfs-server --no-pager
● nfs-server.service - NFS server and services
Loaded: loaded (/lib/systemd/system/nfs-server.service; enabled; preset: enabled)
Active: active (exited) since Tue 2025-08-12 13:01:21 UTC; 3h 2min ago
Docs: man:nfsd(8)
Meaning: service is “active (exited)” which is normal for some NFS server units once kernel threads are spawned. Decision: if it’s failed, don’t debug clients—fix the server first.
Task 11: Detect a mount that will hang your shell (try a timed stat)
cr0x@server:~$ timeout 3 stat /mnt/shared || echo "stat timed out"
stat timed out
Meaning: metadata call blocked longer than 3 seconds. Decision: your app doing filesystem calls will also block; route around by failing over, unmounting, or fencing the dependency.
Task 12: Safely attempt a lazy unmount (when the remote is gone)
cr0x@server:~$ sudo umount -l /mnt/shared
Meaning: no output usually means success. Lazy unmount detaches the mount point now and cleans up when references go away. Decision: use it when you need the box usable again; then fix the underlying outage.
Task 13: Force a clean remount with systemd (client)
cr0x@server:~$ sudo systemctl restart mnt-shared.mount
Meaning: systemd re-evaluates the mount unit. Decision: if restart blocks, switch to automount so the mount happens on demand and can time out without holding boot/service start.
Task 14: For SSHFS, check for dead TCP sessions and keepalive behavior
cr0x@server:~$ ss -tpn '( sport = :ssh or dport = :ssh )'
ESTAB 0 0 10.0.0.21:52418 10.0.0.60:22 users:(("ssh",pid=29102,fd=3))
Meaning: SSH connection exists. Decision: if it’s ESTAB but I/O is stuck, you likely have a half-dead path (stateful middlebox, NAT, Wi‑Fi sleep). Add keepalives and ServerAliveInterval for SSHFS.
NFS that doesn’t embarrass you: Debian 13 server + client configs
NFS’s reputation is weird: either “rock solid” or “haunted.” In production, it’s mostly determined by whether you treat it as a system component
(with identity, exports, and automount) or as a quick mount line copied from a wiki written in 2009.
Server: install and export sanely
On the NFS server (Debian 13), you want a simple export, TCP, and predictable permissions. If you need strong security across untrusted networks,
use Kerberos; for trusted LANs, sec=sys is common and fine if your network is actually trusted.
cr0x@nas01:~$ sudo apt update
Hit:1 ... InRelease
Reading package lists... Done
cr0x@nas01:~$ sudo apt install -y nfs-kernel-server
Reading package lists... Done
...
Setting up nfs-kernel-server ...
Create an export. The flags matter:
sync trades some performance for integrity; for many business workloads that’s the right trade.
no_subtree_check avoids painful path-based checks (and surprises during renames).
Keep it narrow: a subnet or a set of hosts.
cr0x@nas01:~$ sudo mkdir -p /export/shared
cr0x@nas01:~$ sudo chown -R root:root /export
cr0x@nas01:~$ sudo chmod 755 /export /export/shared
cr0x@nas01:~$ sudo tee /etc/exports >/dev/null <<'EOF'
/export/shared 10.0.0.0/24(rw,sync,no_subtree_check,root_squash)
EOF
cr0x@nas01:~$ sudo exportfs -ra
cr0x@nas01:~$ sudo exportfs -v
/export/shared 10.0.0.0/24(sync,wdelay,hide,no_subtree_check,sec=sys,rw,root_squash,no_all_squash)
Client: mount with NFSv4.2 and pick your failure semantics
The key choice is hard vs soft:
- hard: retries indefinitely. Safer for data correctness. Also the classic source of “host feels hung.”
- soft: returns an error after retries. Better for interactive clients and non-critical reads. Dangerous for some apps that don’t handle partial failures.
In production, I default to hard + automount. That way, you don’t corrupt workflows, and you don’t wedge boot.
For non-critical tooling (like a developer box listing a media share), soft can be acceptable.
cr0x@server:~$ sudo apt install -y nfs-common
Reading package lists... Done
...
Setting up nfs-common ...
Test a manual mount to validate basics.
cr0x@server:~$ sudo mkdir -p /mnt/shared
cr0x@server:~$ sudo mount -t nfs4 -o vers=4.2,proto=tcp,hard,timeo=600,retrans=2 nas01:/export/shared /mnt/shared
cr0x@server:~$ mount | grep /mnt/shared
nas01:/export/shared on /mnt/shared type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.21)
What those options mean in practice:
proto=tcp: you want TCP unless you enjoy debugging “works on my desk” packet loss.timeo=600: for TCP, this is in tenths of a second. 600 → 60s per RPC timeout attempt. That’s conservative.retrans=2: number of retries before escalating the timeout logic. Not “total retries” for hard mounts; hard mounts can keep trying.rsize/wsize: large sizes help throughput, but they don’t fix metadata latency. They’re not a cure for “find is slow.”
Permissions and identity: make it boring
NFS is not a permission system; it forwards identity. If UID/GID don’t match across machines, you get “random” access denied.
Fix it at the source: consistent UID/GID for service accounts and shared users.
SSHFS that behaves: keepalives, systemd, and sane caching
SSHFS is appealing because it’s just SSH. No exports, no rpcbind nostalgia, no special firewall meetings.
But SSHFS lives in user space, speaks SFTP, and depends on a stable SSH session. It’s a very polite filesystem: it asks permission for everything,
one question at a time, over an encrypted tunnel. That politeness is expensive.
Install and do a basic mount
cr0x@server:~$ sudo apt install -y sshfs
Reading package lists... Done
...
Setting up sshfs ...
cr0x@server:~$ mkdir -p /mnt/adhoc
cr0x@server:~$ sshfs build@devbox:/home/builds /mnt/adhoc -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,Compression=no
cr0x@server:~$ findmnt /mnt/adhoc
TARGET SOURCE FSTYPE OPTIONS
/mnt/adhoc build@devbox:/home/builds fuse.sshfs rw,nosuid,nodev,relatime,user_id=1000,group_id=1000
Keepalive options are non-negotiable if there is any NAT, Wi‑Fi, or “security appliance” between you and the server:
ServerAliveInterval sends SSH-level pings; reconnect tries to re-establish the session.
Joke #2: SSHFS is like putting a filing cabinet through airport security—everything gets inspected, and you’re paying for the privilege.
Common SSHFS hang patterns
- Half-open TCP sessions: your laptop slept, NAT expired, connection looks ESTAB but no packets flow. Keepalives reduce the time to detect.
- SFTP subsystem trouble: remote
sftp-serveris slow, resource-starved, or constrained by forced commands. - FUSE thread contention: heavy parallel filesystem calls can bottleneck the userspace daemon; CPU spikes on the client look like “mount is hung.”
SSHFS mount options that tend to reduce pain
reconnect: try to reconnect if the connection drops.ServerAliveInterval=15andServerAliveCountMax=3: detect dead sessions quickly.workaround=rename(sometimes): helps with apps that do atomic renames and hit edge cases.cache=yesorkernel_cache: can help throughput but can hurt coherence. Use carefully if multiple writers exist.Compression=noon fast LAN: saves CPU and often improves latency.
If multiple clients write the same files and you care about consistency, SSHFS caching is a trap.
NFS has its own cache/coherence story, but it’s designed for multi-client shared access. SSHFS is designed for access, not coordination.
Stop hanging your boot: systemd automount patterns
The most preventable “random hang” is the one that happens at boot because /etc/fstab told the host to block until a remote machine answers.
Don’t do that. Use systemd automount so the mount occurs on first access, and so timeouts don’t brick the whole host.
NFS with /etc/fstab and systemd automount
Put this in /etc/fstab on the client. It mounts on demand, not at boot:
cr0x@server:~$ sudo tee -a /etc/fstab >/dev/null <<'EOF'
nas01:/export/shared /mnt/shared nfs4 nofail,x-systemd.automount,x-systemd.device-timeout=10s,x-systemd.mount-timeout=10s,_netdev,vers=4.2,proto=tcp,hard,timeo=600,retrans=2 0 0
EOF
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart remote-fs.target
cr0x@server:~$ systemctl status mnt-shared.automount --no-pager
● mnt-shared.automount - /mnt/shared
Loaded: loaded (/etc/fstab; generated)
Active: active (waiting) since Tue 2025-08-12 14:10:12 UTC; 2s ago
Meaning: automount is armed and waiting. Decision: this host can boot even if nas01 is down; first access triggers a mount attempt with timeouts.
SSHFS with systemd (preferred over fstab for user mounts)
SSHFS often runs best as a user service or with explicit systemd units, because it’s a user-space process and you want clean lifecycle handling.
Here’s a system-level unit example if you want it mounted for a service account (adjust user/paths as needed).
cr0x@server:~$ sudo tee /etc/systemd/system/mnt-adhoc.mount >/dev/null <<'EOF'
[Unit]
Description=SSHFS mount for /mnt/adhoc
After=network-online.target
Wants=network-online.target
[Mount]
What=build@devbox:/home/builds
Where=/mnt/adhoc
Type=fuse.sshfs
Options=_netdev,reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,Compression=no,IdentityFile=/home/build/.ssh/id_ed25519
[Install]
WantedBy=multi-user.target
EOF
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl enable --now mnt-adhoc.mount
Created symlink /etc/systemd/system/multi-user.target.wants/mnt-adhoc.mount → /etc/systemd/system/mnt-adhoc.mount.
Decision: if you see this mount occasionally wedge, add an accompanying .automount unit and mount on demand—same principle as NFS.
Three corporate mini-stories from the land of “it should have worked”
Incident caused by a wrong assumption: “Hard mounts can’t hurt us”
A mid-sized company ran Debian app servers that pulled static assets and a few configuration fragments from an NFS share.
The assumption was straightforward: hard mounts prevent corrupt reads, so they’re “safer.” Someone even argued that a hang is better than bad data,
which is true in the abstract and untrue in the middle of a customer-facing outage.
During a maintenance window, the storage team rebooted the NFS server. The reboot took longer than expected because the box ran a filesystem check.
Meanwhile, app servers kept running, until a deploy triggered a script that did a recursive find over the mounted path.
Suddenly, every deploy agent was stuck. Then the monitoring agent tried to read a file under that path. Then a logrotate hook did the same.
The “wrong assumption” was that a hard mount hang is localized. It wasn’t. The mount point was part of too many code paths.
The result wasn’t just one job blocked; it was an entire fleet accumulating stuck processes and load from retry storms.
Some nodes degraded enough that the orchestrator started flapping services.
The fix was not “use soft mounts everywhere.” It was automount plus a redesign: stop putting core runtime dependencies on a remote filesystem
unless the application explicitly tolerates that dependency being slow or unavailable. Hard mounts stayed for the data that truly needed them,
but access became demand-driven and time-bounded at the systemd layer.
Optimization that backfired: “Let’s crank up caching and jumbo rsize/wsize”
Another shop had a build farm reading large artifacts from a shared store. Someone did the sensible thing: increased rsize/wsize to 1 MiB,
enabled aggressive client-side caching, and declared victory after a benchmark that copied a few multi-gig files faster.
The dashboard improved. People clapped. The change rolled out.
Then the weirdness began. Builds started intermittently failing with “file not found” during steps that expected a file to appear immediately
after another job uploaded it. Sometimes the file was there but had an older timestamp. Sometimes the directory listing didn’t include it for seconds.
It looked like eventual consistency, which is not a phrase you want associated with a POSIX-ish filesystem.
The backfire came from mixing workloads: the same share served large sequential reads and metadata-sensitive coordination.
The caching tuned for throughput made metadata visibility laggy. The system didn’t “hang”; it lied politely for a moment.
That’s worse than a hang because it creates ghost bugs.
They recovered by splitting workloads: one NFS export tuned for large artifacts, another path for coordination with stricter cache semantics,
and a shift toward explicit artifact versioning so readers never depended on “latest file in a directory” appearing instantly.
The lesson: performance tuning is not universal; it’s workload-specific, and mixed workloads punish optimism.
Boring but correct practice that saved the day: automount + fenced dependencies
A financial org (the sort that has change windows and paperwork for the paperwork) ran a set of Debian batch servers
that periodically pulled files from a remote share. The share was important, but the servers had other duties too.
Years earlier, a cautious engineer insisted on two things: systemd automount for all network filesystems, and explicit timeouts on mount attempts.
No one loved it. It wasn’t “performance work.” It didn’t show up in feature demos.
One night, the storage backend had a partial outage—half the NFS nodes were reachable but overloaded,
returning responses slowly enough to trigger client retransmits. Clients without automount would have blocked at boot after a reboot,
and any service restart would have become a gamble.
Instead, the batch servers stayed alive. Jobs that needed the share failed fast with clear errors and retried later.
Monitoring kept running. SSH access stayed responsive. The incident stayed contained to “the batch pipeline is delayed,” not “we can’t log in.”
The postmortem was boring. That was the win. Boring practices don’t get promoted in slide decks,
but they keep the rest of your infrastructure from participating in someone else’s outage.
Common mistakes: symptom → root cause → fix
1) Symptom: boot hangs or takes minutes when the NAS is down
Root cause: blocking mounts in /etc/fstab without nofail, automount, or timeouts.
Fix: add x-systemd.automount, nofail, and timeouts; or convert to explicit systemd units.
2) Symptom: ls in the mount point never returns; processes stuck in D state
Root cause: NFS hard mount + server not responding; kernel retries indefinitely.
Fix: restore server/network; if you must regain the host, use umount -l and redesign with automount to avoid global blockage.
3) Symptom: “Permission denied” on NFS, but the same path works on another client
Root cause: UID/GID mismatch across hosts (or NFSv4 id mapping inconsistencies).
Fix: standardize identities; ensure consistent user IDs; avoid “fixing” with unsafe export options.
4) Symptom: SSHFS mount stalls after laptop sleep or Wi‑Fi roaming
Root cause: half-open SSH session; NAT/stateful device expired the flow.
Fix: add ServerAliveInterval/ServerAliveCountMax and reconnect; consider automount for on-demand usage.
5) Symptom: SSHFS is “fine” for big files but terrible for find or small-file workloads
Root cause: per-operation SFTP overhead; metadata latency amplified.
Fix: don’t use SSHFS for metadata-heavy shared workloads; use NFS or local sync (rsync, artifact fetch) depending on the use case.
6) Symptom: NFS feels hung only during peak hours
Root cause: server overload (CPU, disk latency, lock contention) or network congestion causing retransmits.
Fix: confirm with nfsstat and server metrics; scale server, move hot data, or split exports; don’t “tune” timeouts as a substitute for capacity.
7) Symptom: mounts work by IP but not by hostname
Root cause: DNS issues, reverse lookup delays, or inconsistent records.
Fix: fix DNS; pin stable names; don’t bake IPs into fleet configs unless you accept the operational debt.
8) Symptom: unmount hangs
Root cause: busy mount with stuck processes holding references; NFS hard retry; FUSE daemon stuck.
Fix: identify holders (lsof/fuser if they respond), use umount -l; for FUSE, consider stopping the sshfs process and then lazy unmount.
Checklists / step-by-step plan
Step-by-step: choose the protocol (production reality edition)
- If multiple clients need the same writable data: choose NFSv4.2.
- If it’s ad-hoc access over semi-hostile networks: choose SSHFS.
- If you need “shared storage” but the app can’t tolerate filesystem semantics: don’t use either—use object storage, HTTP artifact fetch, or replicate locally.
- Decide your failure mode:
- Prefer hard + automount for correctness without boot hangs.
- Use soft only when the app tolerates I/O errors cleanly.
Step-by-step: implement NFS correctly on Debian 13
- Server: install
nfs-kernel-server. - Server: create a dedicated export path; set ownership deliberately.
- Server: configure
/etc/exportswithsyncandno_subtree_check; limit clients. - Client: install
nfs-common. - Client: test manual mount with
vers=4.2,proto=tcp. - Client: switch to
/etc/fstabwithx-systemd.automount,nofail, and timeouts. - Validate: simulate server downtime and confirm the host remains usable and boots normally.
Step-by-step: implement SSHFS without self-sabotage
- Install
sshfs. - Confirm non-interactive SSH works quickly (
BatchMode). - Mount with
reconnectand SSH keepalives. - If it’s anything but temporary: manage it with systemd units; add automount if needed.
- Be conservative with caching when multiple writers exist.
Operational checklist: “this share should never take the host down”
- Automount is enabled for all network mounts.
- Mount timeouts are set (device timeout and mount timeout).
- Critical services do not hard-depend on a remote mount unless designed for it.
- UID/GID consistency is enforced across clients.
- You have a documented “how to detach a wedged mount” procedure.
- You periodically test failure: server reboot, network drop, DNS failure.
FAQ
1) Which one is less likely to hang: SSHFS or NFS?
SSHFS is less likely to wedge the kernel in indefinite retry the way a hard NFS mount can, but it’s more likely to stall or disconnect on flaky networks.
For “won’t take the host down,” the winner is NFS with systemd automount and timeouts.
2) Should I use NFS hard or soft mounts?
Default to hard for correctness, but pair it with automount so outages don’t freeze boot or service start.
Use soft only when the application can tolerate I/O errors and you prefer failure over waiting.
3) Why does kill -9 not kill a hung process on NFS?
Because it’s stuck in uninterruptible sleep in the kernel waiting for I/O completion. The signal is noted but not delivered until the syscall returns.
The fix is restoring the server/network or detaching the mount (lazy unmount) so the process can unwind.
4) Why is SSHFS so slow on directories with many small files?
Metadata operations require many round trips, and SSHFS adds user-space and SFTP overhead to each.
For small-file workloads and shared directories, NFS is usually a better fit.
5) Does NFSv4 require rpcbind?
NFSv4 is designed to be simpler and typically uses port 2049 without the older portmapper dependency that NFSv3 relied on.
In practice, your environment may still run related RPC services, but NFSv4’s firewall story is cleaner.
6) What mount options are the most important to avoid “random hangs”?
For both: the most important are systemd automount and timeouts.
For SSHFS specifically: ServerAliveInterval, ServerAliveCountMax, and reconnect.
For NFS: explicit vers=4.2, proto=tcp, and choosing hard vs soft intentionally.
7) How do I unstick a host where anything touching the mount hangs?
First, stop touching it. Then restore server/network if possible. If you need the host back immediately, use umount -l on the mount point.
If SSHFS is involved, stop the sshfs process (systemd unit) and then lazy unmount.
8) Is SSHFS “secure enough” compared to NFS?
SSHFS rides on SSH, so it’s encrypted and authenticated by default. NFS can be secure too, but you need to configure it—often with Kerberos for strong auth.
For untrusted networks, SSHFS is operationally simpler. For trusted LAN production storage, NFS is usually the more stable tool.
9) Should I use /etc/fstab or systemd units?
For NFS, /etc/fstab with x-systemd.automount is perfectly fine and easy to manage at scale.
For SSHFS, systemd units are often cleaner because it’s a user-space daemon with lifecycle needs.
10) What about file locking and correctness for build systems?
If your build system depends on locking semantics and multiple writers, NFSv4 generally gives you a better shot at correctness than SSHFS.
But the real answer is architectural: avoid using a shared filesystem as a coordination database if you can.
Next steps you can do today
If you only do three things, do these:
- Convert network mounts to systemd automount with explicit timeouts. This alone prevents the most humiliating hangs.
- Pick failure semantics on purpose: NFS hard for correctness, soft only where errors are acceptable. Stop inheriting defaults as policy.
- Run the diagnosis tasks during calm hours: capture baseline
nfsstat -m, confirm DNS, confirm port reachability, and test a simulated outage.
Then be a little ruthless: if an application can’t tolerate remote filesystem weirdness, don’t give it a remote filesystem dependency.
Replace “shared directory as integration layer” with artifact publishing, local caching, or a proper service. Your future on-call self will notice.