Proxmox NFS timeouts: mount options that improve stability

Was this helpful?

You notice it when backups start “hanging” and the Proxmox GUI feels like it’s wading through syrup.
A VM pauses at the worst possible moment. Logs fill with “server not responding” and then… nothing.
Not a clean failure. Just a slow-motion operational hostage situation.

NFS timeouts on Proxmox are rarely a single bad setting. They’re usually an argument between mount semantics,
network behavior, and how Proxmox (and QEMU) react to blocked I/O. This piece is the practical path through:
what to mount, how to mount it, how to prove what’s wrong, and how to stop it from happening again.

What “NFS timeouts” look like on Proxmox (and why they’re scary)

“NFS timeout” is usually a euphemism. The system isn’t politely timing out; it’s blocking.
On Linux, a hard-mounted NFS filesystem will keep retrying operations until the server answers.
This is correct behavior for data integrity, but it’s a special kind of misery when that mount contains
VM disks or backup targets.

On Proxmox, the blast radius depends on what’s on NFS:

  • ISO/templates storage: annoying but survivable. Fetches fail, operations retry.
  • Backup target (vzdump): jobs hang, lock files linger, monitoring screams.
  • VM disks on NFS: the host can stall on I/O. Guest I/O freezes, sometimes QEMU pauses, sometimes you get split-brain feelings without the split-brain.
  • Shared storage for migration: migrations hang mid-flight. You now have two machines arguing about who owns the problem.

The reason NFS timeouts are operationally nasty is that failures are often partial:
one path flaps, one NIC drops, one switch buffer chokes, one server thread stalls. The client keeps trying.
You don’t get a clean “down.” You get a haunted house.

Two quick truths before we start tuning

  • Mount options won’t fix a broken network. They can make failure behavior sane and reduce lockups, but they can’t out-argue physics.
  • Stability beats speed for Proxmox storage. A fast storage that occasionally stalls is slower than a modest one that never lies.

Interesting facts & historical context (because the past keeps billing us)

  1. NFS was born in the mid-1980s at Sun Microsystems to share files across workstations without a shared disk bus.
  2. NFSv3 is stateless by design (server doesn’t track client state much), which makes recovery from server reboot simpler but pushes complexity elsewhere.
  3. NFSv4 introduced statefulness (locks, sessions, delegations). Better semantics, more moving parts.
  4. NFS over UDP used to be common for performance; TCP largely won because it behaves better on lossy networks and with modern NIC offloads.
  5. “Hard mount” is the default in many environments because silent data corruption is worse than a hang. Yes, that’s a grim trade-off.
  6. Linux’s NFS client uses RPC timeouts and exponential backoff; “server not responding” doesn’t necessarily mean the server is down—sometimes it’s congested.
  7. VM disk I/O patterns are hostile to chatty protocols: small random reads/writes, metadata operations, fsyncs. NFS gets stress-tested whether you want it to or not.
  8. pNFS exists to scale NFS, but most Proxmox setups don’t use it; they’re on a single head and wonder why it behaves like a single head.

One quote that belongs taped above every storage dashboard:
“Hope is not a strategy.” — General Gordon R. Sullivan

Joke #1: NFS is like office Wi‑Fi—when it’s good, nobody notices; when it’s bad, everyone becomes a network engineer.

Fast diagnosis playbook (first/second/third checks)

When NFS timeouts show up, you have minutes to decide: is this a server problem, a network problem,
or a client-side behavior problem? Here’s the order that finds the bottleneck fastest in real environments.

First: confirm the failure mode (blocked I/O vs slow I/O vs permission/lock issue)

  • Look for “server not responding” in the kernel log on the Proxmox node. If it’s there, you’re in transport/RPC land.
  • Check if processes are stuck in D state. If you see QEMU, vzdump, or kernel threads in D, you have blocked I/O.
  • Confirm whether it’s one node or all nodes. One node suggests NIC/switch path; all nodes suggests server-side stall or shared network segment failure.

Second: prove whether the NFS server is healthy right now

  • Server load and disk latency: if the server is saturated or its backing store is stalling, clients will time out.
  • RPC service responsiveness: if rpcbind/nfsd threads are stuck, even a “pingable” server won’t answer NFS calls.
  • Export configuration sanity: a mis-export can behave like intermittent failure when different clients negotiate different paths/versions.

Third: validate the network path like you don’t trust it (because you shouldn’t)

  • Packet loss and reordering matter more than raw bandwidth. NFS is latency-sensitive and hates microbursts.
  • MTU mismatches cause “mostly works” behavior that ruins afternoons.
  • LACP misconfigurations and asymmetric routing cause periodic stalls that look exactly like NFS server issues.

Decision point

If you have D-state tasks + kernel “server not responding” + retrans climbing, prioritize stability options
that avoid cluster-wide lockups (and then fix the underlying issue). If you have no retrans but
slow operations, look at server disk latency and NFS thread saturation.

Mount options that actually improve stability

Mount options are where opinions matter. Some choices are about performance. Many are about how your system behaves
when the world is on fire. Proxmox lives in the “when the world is on fire” category more often than we admit.

The baseline recommendation (stable defaults for Proxmox NFS storage)

For most Proxmox environments using NFS as a backup target or ISO store, start here:

cr0x@server:~$ cat /etc/pve/storage.cfg
nfs: nas-backup
        export /export/proxmox-backup
        path /mnt/pve/nas-backup
        server 10.10.20.10
        content backup,iso,vztmpl
        options vers=4.1,proto=tcp,hard,timeo=600,retrans=2,noatime,nodiratime,rsize=1048576,wsize=1048576
...output...

Let’s unpack the important bits:

  • vers=4.1: v4.1 adds sessions and better recovery semantics than v4.0, and avoids some v3-era weirdness. It’s a good “modern default” when the server supports it.
  • proto=tcp: TCP handles loss and congestion with less chaos than UDP. If you’re still on UDP, I admire your confidence and fear your change window.
  • hard: keep it hard for VM disk storage and for backups if correctness matters. Soft mounts can return I/O errors mid-write. That’s not “timeout handling,” that’s “corruption roulette.”
  • timeo=600,retrans=2: increases RPC timeout (in tenths of seconds, so 600 = 60s for many ops), reduces the number of retries before reporting. This combination reduces log spam and thrash during outages. You still block on hard mounts, but you do it more politely.
  • noatime,nodiratime: reduces metadata writes. Not a timeout fix, but removes pointless chatter.
  • rsize/wsize: using 1M can reduce RPC overhead on modern networks. If you see fragmentation or weird NIC issues, dial down.

Hard vs soft: choose the failure you can live with

This is the decision people try to avoid. You can’t.

  • hard means: NFS operations retry forever. Your processes may hang. Your data is safer.
  • soft means: after some retries, NFS returns an error. Your processes may fail fast. Your data may be inconsistent if the application wasn’t expecting I/O failure mid-flight.

For VM disks on NFS, I strongly prefer hard. A soft mount can cause guest filesystem corruption if writes fail.
For backup targets, hard is still usually correct—unless your operational model requires jobs to fail quickly and retry later
and you accept partial backup failures as “normal.” Most shops don’t say that out loud, but some do it.

intr and “can I interrupt a hung mount?”

Historically, intr allowed signals to interrupt NFS operations. On modern Linux kernels, its behavior is different or ignored depending on version and NFS protocol.
Don’t bet your incident response plan on it. Plan for safe reboot/evacuation paths instead.

timeo and retrans: what they really do (and what they don’t)

timeo is the base RPC timeout; retrans is how many times the client will retransmit an RPC request before it reports a timeout to the kernel log.
With hard, even after it reports a timeout, it keeps retrying. So why tune them?

  • Reduce storm behavior: aggressive retrans can generate RPC storms during server hiccups, making recovery slower.
  • Improve observability: a larger timeo reduces false “server not responding” during brief microbursts.
  • Make failover less dramatic: if you have HA NFS (or VIP movement), you want clients to tolerate the transition without melting down.

actimeo / attribute caching: a performance knob that can become a correctness story

Attribute caching reduces metadata round-trips. It can help performance, especially with directory-heavy operations.
But for shared directories where multiple clients modify the same paths (think: template caches, some backup rotation setups),
too-aggressive caching can create “where did my file go?” moments.

For Proxmox backup targets, you typically don’t need fancy attribute caching. If you do tune it, do it conservatively and measure.

lookupcache=positive and why it can reduce pain

On NFSv3/v4, lookupcache=positive can reduce negative lookup caching effects (where the client remembers “file not found” too aggressively).
It can help when apps create files and immediately look them up across nodes.
It’s not magic, but it’s one of the few knobs that can fix “stale perception” issues without changing the server.

nconnect: more connections, fewer head-of-line stalls (sometimes)

Linux supports nconnect for NFSv4.x in newer kernels, opening multiple TCP connections per mount.
It can improve throughput and reduce head-of-line blocking when one connection experiences loss.
It can also amplify load on a struggling server and make your switch buffers cry.

Use it only after you’ve verified server CPU, NIC, and storage have slack. Start with nconnect=4, not 16.

Options to avoid (unless you enjoy writing postmortems)

  • soft for VM disks: it’s a quick path to silent damage under the right failure.
  • proto=udp on modern networks: it’s not 1999, and your switches don’t love it.
  • noac as a “fix” for stale views: it can nuke performance and increase server load; treat it like chemotherapy.
  • excessively low timeo: it turns transient congestion into self-inflicted RPC storms.

Joke #2: If you set soft,timeo=1,retrans=1 to “avoid hangs,” congratulations—your storage now fails fast, like a startup’s first on-call rotation.

Pick the right NFS version (v3 vs v4.1) like you mean it

Proxmox doesn’t care about your philosophical preference. It cares whether I/O completes.
Version choice affects locking, recovery, and what “timeout” looks like.

When NFSv4.1 is the right default

  • Single export namespace and cleaner firewalling (typically 2049 only, depending on server).
  • Better session semantics that can make transient network blips less catastrophic.
  • Stateful locks that are usually more predictable for multi-client scenarios.

When NFSv3 is still reasonable

  • Legacy NAS appliances with shaky v4 implementations.
  • Specific interoperability constraints (some enterprise arrays have “special” ideas about v4 features).
  • Debuggability preference: v3 can be easier to reason about in certain packet traces because it’s simpler.

Locking considerations

If you store VM images on NFS, locking matters. For qcow2 especially, concurrent access is a disaster.
Make sure your setup ensures exclusivity: Proxmox does its part, but NFS lock semantics can still be a factor,
particularly in failover scenarios.

One practical rule

If you can run NFSv4.1 cleanly end-to-end, do it. If you can’t, run v3 cleanly and accept that you’re choosing “simple and stable” over “featureful.”
The worst option is “v4 sometimes,” where clients negotiate different behaviors across nodes.

Network realities: drops, MTU, pause frames, and why “it pings” is meaningless

Ping is a postcard. NFS is freight shipping with paperwork. You can have perfect ping and terrible NFS.
Timeouts often come from microbursts, buffer starvation, or retransmits that snowball under load.

What usually causes NFS “server not responding” in stable data centers

  • Packet loss under load (bad cable, marginal optics, oversubscribed ToR, congested uplink).
  • MTU mismatch (jumbo frames on one side, not the other). Works until it doesn’t, then fails like a magic trick.
  • Flow control / pause frame weirdness leading to short stalls that trigger RPC timeout cascades.
  • LACP hashing problems where one member link gets saturated while others idle.
  • NIC offload bugs (less common now, still real). TSO/GRO issues can manifest as strange latency spikes.

Operational stance

Treat NFS like storage traffic, not “just network.” Put it on a predictable path:
dedicated VLAN, consistent MTU, consistent bonding configuration, consistent QoS if you use it.
If you can’t isolate it, at least measure it.

Proxmox-specific failure modes: backups, migration, and clustered pain

Vzdump backups to NFS: why they hang differently than you expect

Proxmox backups can involve snapshots, compression, chunked writes, and sync operations.
When the NFS target stalls, the backup process can block in kernel I/O.
If you run multiple backups in parallel, you can amplify the stall into a self-made thundering herd.

Practical approach: cap concurrency, keep mount stable, and avoid tuning that creates retry storms.
If backups must be “always finish,” use hard mounts and operational backpressure (scheduling, concurrency limits).
If backups must “fail fast,” accept the risk and build retries and cleanup automation.

VM disks on NFS: latency spikes become guest incidents

Even small NFS stalls show up as guest latency. Databases notice. Filesystems notice. Humans notice.
If you place VM disks on NFS, you’re signing up for:

  • tight control over network behavior,
  • predictable server performance,
  • and mount options chosen for correctness over “not hanging.”

Cluster effects: one bad node can ruin the meeting

In a Proxmox cluster, shared NFS storage issues can present as:

  • multiple nodes logging timeouts simultaneously,
  • migration operations stuck,
  • GUI operations waiting on storage status refresh,
  • and the fun one: “everything looks healthy except it isn’t doing anything.”

Your goal is to make the failure mode predictable. Predictable is debuggable. Debuggable is fixable.

12+ practical tasks with commands, outputs, and decisions

These are field tasks. Each one includes: a command, an example output, what it means, and what decision you make.
Run them on the Proxmox node unless otherwise stated.

Task 1: Confirm what Proxmox thinks the NFS storage is

cr0x@server:~$ pvesm status
Name             Type     Status           Total            Used       Available        %
local             dir     active        196540280        90581232       96023420   46.07%
nas-backup        nfs     active       7812500000      2123456789     5600000000   27.18%
...output...

What it means: Storage is “active” from Proxmox’s point of view. That doesn’t guarantee it’s responsive; it just means the mount exists and looks usable.

Decision: If status flips between active/inactive during the incident, suspect mount flapping or server-side availability issues.

Task 2: Inspect the actual mount options in effect

cr0x@server:~$ findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /mnt/pve/nas-backup
10.10.20.10:/export/proxmox-backup /mnt/pve/nas-backup nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.10.20.21,local_lock=none
...output...

What it means: Confirms effective options. “relatime” appears because the kernel normalizes atime behavior.

Decision: If you don’t see hard/timeo/vers as expected, fix storage.cfg or fstab and remount deliberately.

Task 3: Watch the kernel complain in real time

cr0x@server:~$ journalctl -kf | grep -i nfs
Dec 26 10:14:21 pve1 kernel: nfs: server 10.10.20.10 not responding, still trying
Dec 26 10:14:52 pve1 kernel: nfs: server 10.10.20.10 OK
...output...

What it means: This is classic retrans/timeout behavior with recovery.

Decision: If “still trying” repeats for minutes with no “OK,” treat it as a real outage. Start isolating: server health vs network.

Task 4: Find blocked tasks (D state) and identify who’s stuck

cr0x@server:~$ ps -eo pid,state,comm,wchan:32,args | awk '$2=="D"{print}' | head
23145 D qemu-system-x86  nfs_wait_bit_killable       /usr/bin/kvm -id 102 -name vm102 ...
30411 D vzdump           rpc_wait_bit_killable       vzdump 102 --compress zstd --storage nas-backup
...output...

What it means: Processes in D state are waiting on uninterruptible I/O. Killing them usually won’t work; they’re stuck in the kernel.

Decision: If QEMU is blocked, assume guest impact. Consider moving workloads off this node after NFS recovers; don’t just “kill -9” and hope.

Task 5: Check NFS client RPC stats for retrans and timeouts

cr0x@server:~$ nfsstat -c
Client rpc stats:
calls      retrans    authrefrsh
189432     814        0

Client nfs v4:
null         read         write        commit       open         open_conf    ...
0            48212        27111        135          324          0            ...
...output...

What it means: Retransmissions are a strong hint of network loss or server responsiveness issues.

Decision: If retrans climbs rapidly during normal load, prioritize network investigation. If retrans is low but performance is bad, look at server disk/CPU.

Task 6: Check per-mount NFS state and age (useful for “stale handle” narratives)

cr0x@server:~$ cat /proc/mounts | grep nas-backup
10.10.20.10:/export/proxmox-backup /mnt/pve/nas-backup nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.10.20.21 0 0
...output...

What it means: Confirms it’s NFSv4 and shows core options. This is a quick “what are we even running?” check.

Decision: If you expect v4.1 and see v3 (or vice versa), you’ve found a configuration drift or negotiation issue.

Task 7: Verify NFS server exports from the client perspective

cr0x@server:~$ showmount -e 10.10.20.10
Export list for 10.10.20.10:
/export/proxmox-backup 10.10.20.0/24
/export/iso            10.10.20.0/24
...output...

What it means: Confirms export visibility. This doesn’t validate NFSv4 namespace behavior fully, but it catches obvious export changes.

Decision: If the export isn’t listed or access is narrowed unexpectedly, stop tuning mounts and fix exports/ACLs.

Task 8: Test I/O behavior directly on the mount (simple latency sniff test)

cr0x@server:~$ time sh -c 'dd if=/dev/zero of=/mnt/pve/nas-backup/.iotest bs=1M count=256 conv=fdatasync status=none'
real    0m5.842s
user    0m0.012s
sys     0m0.221s
...output...

What it means: This includes a sync to force the server to commit. If this spikes from seconds to minutes, you’re seeing the problem, not imagining it.

Decision: If this is slow but retrans is low, suspect server disk latency or NFS server thread starvation.

Task 9: Check network error counters on the Proxmox node

cr0x@server:~$ ip -s link show dev bond0
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    RX:  bytes  packets  errors  dropped  missed  mcast
    9876543210 12345678  0       124      0       112233
    TX:  bytes  packets  errors  dropped  carrier collsns
    8765432109 11223344  0       0        0       0
...output...

What it means: RX drops on storage traffic are bad news. Dropped packets can produce retrans/timeouts.

Decision: If drops increase during backup windows, you’ve got congestion or NIC/switch buffer issues. Consider QoS, separate VLAN, or capacity fixes.

Task 10: Validate MTU end-to-end (the jumbo frame trap)

cr0x@server:~$ ping -c 3 -M do -s 8972 10.10.20.10
PING 10.10.20.10 (10.10.20.10) 8972(9000) bytes of data.
8972 bytes from 10.10.20.10: icmp_seq=1 ttl=64 time=0.431 ms
8972 bytes from 10.10.20.10: icmp_seq=2 ttl=64 time=0.438 ms
8972 bytes from 10.10.20.10: icmp_seq=3 ttl=64 time=0.422 ms

--- 10.10.20.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
...output...

What it means: This confirms jumbo frames work between client and server. If it fails with “Frag needed,” you have an MTU mismatch.

Decision: If jumbo fails, either fix MTU consistently everywhere or drop to 1500 everywhere. Mixed MTU is a reliability tax.

Task 11: Confirm the NFS server responds to RPC/NFS probes

cr0x@server:~$ rpcinfo -t 10.10.20.10 nfs 4
program 100003 version 4 ready and waiting
...output...

What it means: NFS service is reachable over TCP. If this blocks or fails during the incident, it’s either network or server RPC stack trouble.

Decision: If rpcinfo fails from multiple clients, focus on the server or the shared network segment rather than client tuning.

Task 12: Trace one hung process to confirm it’s stuck in NFS I/O

cr0x@server:~$ cat /proc/30411/stack
[<0>] rpc_wait_bit_killable+0x1d/0x60
[<0>] __rpc_execute+0x3a/0x90
[<0>] rpc_execute+0x44/0x50
[<0>] nfs4_call_sync+0x6a/0xa0
[<0>] nfs4_proc_write+0xa1/0x140
[<0>] nfs_file_write+0x10a/0x2a0
[<0>] vfs_write+0x1b4/0x2a0
[<0>] ksys_write+0x67/0xe0
[<0>] do_syscall_64+0x5b/0x110
...output...

What it means: The stack shows the process waiting in RPC/NFS write path. This isn’t “application slowness.” It’s storage path unresponsiveness.

Decision: Stop debating app tuning. Stabilize NFS path and consider pausing new backup jobs until recovery.

Task 13: Remount with safer options (controlled change)

cr0x@server:~$ umount /mnt/pve/nas-backup
umount: /mnt/pve/nas-backup: target is busy.
...output...
cr0x@server:~$ fuser -vm /mnt/pve/nas-backup | head
                     USER        PID ACCESS COMMAND
/mnt/pve/nas-backup: root      30411 f.... vzdump
/mnt/pve/nas-backup: root      23145 f.... qemu-system-x86
...output...

What it means: You can’t unmount because active processes are using it, potentially stuck.

Decision: Don’t force unmount a mount holding VM disks. If it’s only a backup target, consider stopping/timeout-handling jobs first, then remount during a quiet window.

Task 14: Check Proxmox task logs for storage-related stalls

cr0x@server:~$ tail -n 30 /var/log/pve/tasks/active
UPID:pve1:000076CB:0012ABCD:676D0B31:vzdump:102:root@pam:
status: running
starttime: 1766743800
...output...

What it means: The task is “running” but not making progress. Combine with D-state checks to confirm it’s blocked on storage.

Decision: If it’s blocked and the mount is hard, plan recovery: restore NFS availability rather than trying to kill the task.

Task 15: Server-side check (on the NFS server): NFS thread saturation

cr0x@server:~$ ps -eLo pid,cls,pri,rtprio,stat,comm | grep -E 'nfsd|rpc'
  8123 TS  19      - S    nfsd
  8124 TS  19      - R    nfsd
  8125 TS  19      - D    nfsd
...output...

What it means: If nfsd threads are in D state, they’re likely blocked on server storage (local disk, ZFS, RAID controller, etc.).

Decision: If server nfsd threads block, no mount option on the client will save you. Fix server storage latency or reduce load.

Three corporate-world mini-stories (all anonymized, all plausible)

1) Incident caused by a wrong assumption: “The NAS has redundant links, so the network can’t be it.”

A mid-sized company ran a Proxmox cluster with NFS-backed backups. The NAS had dual NICs bonded with LACP.
The team assumed redundancy meant resilience. Backups were scheduled overnight and “mostly worked” until a quarter-end push
increased load and suddenly backups started hanging on two out of five nodes.

The first assumption: if the NAS is reachable, it’s fine. Ping worked. The GUI loaded. The NAS dashboard showed “healthy.”
But on the Proxmox nodes, kernel logs showed repeating NFS “not responding.” Some vzdump processes were stuck in D state.

They spent hours tuning mount options—lower timeouts, higher timeouts, adding soft on one node “just to test.”
That node stopped hanging… and started producing partial backups with I/O errors. They nearly promoted a broken backup as “successful” because the job exited.
The backup system didn’t corrupt data quietly; it failed loudly. That was the only good part.

The real cause was an LACP hashing mismatch: the switch hashed based on L3/L4, the NAS hashed differently, and most NFS traffic landed on one physical link.
Under peak load, that link dropped bursts of packets. TCP recovered, but RPC latency spiked enough to trigger NFS retrans and “not responding.”

The fix was boring: align LACP hashing policies, validate with interface counters, and keep NFS on a predictable VLAN.
Only after the network was stable did the mount tuning matter—and then only mildly.

2) Optimization that backfired: jumbo frames + “nconnect everywhere”

Another shop wanted faster VM migrations and snappier backups. They enabled jumbo frames (MTU 9000) on the Proxmox nodes and the NFS server,
and they turned on nconnect=8 because someone saw a benchmark chart. The first day was glorious.
Then intermittent NFS timeouts started appearing during heavy backup windows.

The tricky part: it wasn’t a clean break. Most of the time it flew. Sometimes it stalled.
Retrans counts rose, but only during peak. The team blamed the NAS.
The NAS team blamed Proxmox. The network team blamed “unknown traffic.” Everyone was partially correct, which is the worst kind of correct.

The hidden issue was MTU inconsistency: the ToR switches supported jumbo, but one inter-switch link in the path was left at 1500.
Most traffic hashed around it. Some flows didn’t. Those flows suffered fragmentation/blackholing behavior depending on ICMP handling.
With nconnect, more flows existed, increasing the chance some hit the bad path. The optimization made the failure easier to trigger.

They fixed the MTU end-to-end, then dialed nconnect down to 4 after observing higher server CPU usage.
Performance stayed good. Timeouts stopped. Nobody got to keep the “we should just tune harder” narrative.

3) Boring but correct practice that saved the day: conservative mounts + concurrency caps

A company ran nightly backups of many VMs to an NFS target. Early on, they noticed the NAS sometimes paused for maintenance tasks
and would respond slowly for a minute or two. They accepted that as reality and engineered around it.

They did three things that were not exciting enough for a conference talk:
stable NFSv4.1 over TCP with hard,timeo=600,retrans=2, a dedicated backup window, and a strict concurrency limit so only a few VMs backed up at once.
They also had a dashboard that tracked client retrans and server disk latency side-by-side.

When a switch started dropping packets under load months later, backups slowed but didn’t implode.
The logs showed timeouts, but the system recovered without cascading into dozens of hung tasks.
The on-call had enough breathing room to identify rising RX drops and shift traffic before it became a crisis.

The post-incident review was refreshingly dull: confirm drops, replace a suspect optic, validate MTU and LACP, done.
The backup system never became the headline. That’s the goal.

Common mistakes: symptom → root cause → fix

1) Symptom: “server not responding, still trying” during backups; jobs hang for hours

Root cause: hard mount + server stall or packet loss. The client is doing the correct thing (retry forever).

Fix: keep hard, raise timeo (e.g., 600), keep retrans low (2–3), fix network loss and/or server storage latency. Reduce backup concurrency.

2) Symptom: backups “finish” but restore fails or files are missing

Root cause: soft mounts returning I/O errors mid-write; backup tooling doesn’t always interpret partial writes the way you hope.

Fix: avoid soft for backup targets unless you have explicit verification and retry logic. Prefer hard and operational scheduling.

3) Symptom: only one Proxmox node has NFS issues

Root cause: per-node NIC issues, bonding misconfig, cabling/optics, driver bugs, or a single switch port buffer problem.

Fix: compare ip -s link counters across nodes; swap cables/ports; validate bonding mode and hashing; check MTU.

4) Symptom: issues appear only at peak times

Root cause: congestion, microbursts, queue drops, server CPU saturation, or backing disk latency spikes.

Fix: measure retrans and RX drops; shape backup concurrency; consider separate VLAN/QoS; fix server bottlenecks (disk, CPU, NIC).

5) Symptom: “stale file handle” errors after server maintenance

Root cause: export moved, filesystem recreated, snapshot rollback, or inode changes under the export path.

Fix: avoid destructive operations under exports; use stable datasets/paths; remount clients after disruptive server-side changes; for v4, ensure consistent pseudo-root exports.

6) Symptom: migration hangs even though storage is “shared”

Root cause: shared storage is present but experiencing latency spikes or intermittent RPC stalls; migration hits synchronous I/O.

Fix: stabilize NFS (network, server), use NFSv4.1/TCP, and avoid heroic tuning. If you must migrate during instability, stop and fix storage first.

7) Symptom: GUI operations slow, storage status checks lag

Root cause: management operations touching mounted paths; NFS hangs can propagate into userland tools.

Fix: keep problematic NFS mounts out of critical paths; mount only where needed; ensure mount options and server stability; don’t put “everything” on one flaky NFS.

Checklists / step-by-step plan

Step-by-step: stabilize an existing Proxmox NFS backup mount

  1. Identify scope: one node or many? Use kernel logs and nfsstat -c.
  2. Confirm actual mount options: findmnt. Stop trusting config files.
  3. Ensure protocol sanity: prefer vers=4.1,proto=tcp unless you have a reason not to.
  4. Set conservative timeout behavior: hard,timeo=600,retrans=2 as a baseline for stability.
  5. Reduce metadata churn: noatime,nodiratime.
  6. Check MTU end-to-end: jumbo either works everywhere or belongs nowhere.
  7. Check for drops: ip -s link on clients; interface counters on switches (yes, actually look).
  8. Cap backup concurrency: fewer parallel vzdump jobs, especially during known NAS busy windows.
  9. Measure retrans trends: if retrans rises with load, stop blaming mount options and fix the transport/server.
  10. Test with forced sync writes: dd ... conv=fdatasync to catch commit latency.
  11. Schedule disruptive server tasks: scrubs, snapshots, replication, dedupe, whatever—don’t overlap with backup peak unless you like gambling.
  12. Document the chosen semantics: “hard mount; timeouts mean stall; incident response is restore service, not kill processes.” Make it explicit.

Checklist: when you store VM disks on NFS (be honest with yourself)

  • Dedicated storage network path (VLAN or physical), consistent MTU.
  • NFSv4.1 over TCP; consider nconnect only after baseline stability.
  • Hard mount. Always. If you want soft, you want a different storage design.
  • Server has predictable latency under write sync loads.
  • Monitoring includes retrans, server disk latency, and switch drops.
  • Planned response for NFS outage: what happens to guests, what do you do first, what do you never do.

Checklist: minimum monitoring that catches the real problem

  • Client retrans count rate (from nfsstat -c or node exporter equivalents).
  • Client RX/TX drops and errors (from ip -s link).
  • Server disk latency and queue depth (tooling depends on server stack).
  • NFS server thread health (nfsd threads not blocked).
  • Backup job duration and concurrency.

FAQ

1) Should I use soft to avoid Proxmox freezing?

Not for VM disks. For backups, only if you explicitly accept partial failures and you verify backups end-to-end.
Otherwise, keep hard and fix the underlying instability.

2) What’s a sane timeo and retrans for Proxmox?

A pragmatic baseline is timeo=600,retrans=2 for stability-focused mounts. Then measure.
If you have rapid failover and want quicker detection, lower timeo cautiously, but don’t create retry storms.

3) Is NFSv4 always better than NFSv3?

No. NFSv4.1 is often better when implemented well, especially for session recovery and simplified firewalling.
But a solid NFSv3 server beats a flaky NFSv4 server every day of the week.

4) Do rsize/wsize fix timeouts?

Not directly. They can reduce RPC overhead and improve throughput, which may reduce congestion-related stalls.
But if you have packet loss or server stalls, larger sizes can make symptoms sharper.

5) My NFS mount is “active” in Proxmox, but operations hang. Why?

“Active” means mounted and accessible at a basic level. NFS can be mounted while being non-responsive under certain operations.
Use kernel logs, D-state checks, and nfsstat -c to confirm the stall.

6) Can I safely unmount and remount NFS during an incident?

If it holds VM disks: generally no, not safely, not quickly. If it’s only a backup target and you can stop jobs cleanly, sometimes.
Your goal is to restore NFS service, not play tug-of-war with the VFS layer.

7) What causes “stale file handle” and how do I prevent it?

Server-side changes under the export path: rollback, recreation, moving datasets, or changing export roots.
Prevent it by keeping export paths stable and avoiding destructive operations under active exports.

8) Should I enable nconnect?

Only after you’ve proven baseline stability (no drops, sane latency, server has CPU headroom).
It can improve throughput and reduce head-of-line blocking, but it can also amplify server load and expose network path inconsistencies.

9) Do I need a dedicated storage network for NFS?

If NFS hosts VM disks or critical backups, yes—at least a dedicated VLAN with predictable MTU and limited noisy neighbors.
Shared “everything” networks are where reliability goes to be humbled.

10) What’s the single best thing to do when timeouts start mid-backup?

Stop starting new backup jobs. Reduce load while you diagnose. Then confirm whether the issue is retrans/drops (network) or disk/CPU stall (server).

Conclusion: next steps that survive contact with production

If Proxmox is timing out on NFS, don’t treat it like a mount-options scavenger hunt. Options matter, but they’re not your root cause.
Use them to shape failure behavior: fewer storms, clearer signals, safer semantics.

Practical next steps:

  1. Lock in a stable baseline mount: vers=4.1,proto=tcp,hard,timeo=600,retrans=2,noatime,nodiratime, plus sensible rsize/wsize.
  2. Run the fast diagnosis playbook during the next incident: confirm D-state, confirm retrans, check drops/MTU, then check server latency.
  3. Cap backup concurrency and avoid overlapping heavy server maintenance with backup windows.
  4. Instrument retrans and interface drops as first-class metrics. If you can’t see retrans rising, you’re debugging blind.
  5. When you need more performance, add capacity or fix topology before you add cleverness like aggressive caching or high nconnect.

NFS can be stable in Proxmox. It just needs adults in the room: conservative semantics, measured tuning, and a network that doesn’t improvise.

← Previous
ECC RAM: When It Matters—and When It’s a Flex
Next →
ZFS “No Space Left” Lies: Fixing Space Without Random Deletes

Leave a comment