Ubuntu 24.04 “Stale file handle” on NFS: why it happens and how to stop it

Was this helpful?

The error shows up at the worst possible time: deploys, backups, CI jobs, incident calls where everyone insists they “didn’t change anything.”
Your process tries to read a file and the kernel replies with a shrug: Stale file handle.

On Ubuntu 24.04, nothing about this is “new”—but the combinations that trigger it keep evolving: containers pinning paths, systemd automount,
aggressive export changes, NAS failovers, and storage teams “optimizing” layouts mid-day. Let’s treat it like a production bug: understand the
mechanism, isolate the failure mode, and eliminate it with boring, repeatable practices.

What “Stale file handle” actually means

On Linux, “Stale file handle” maps to ESTALE. In NFS terms, it’s the client saying:
“I’m holding a reference to a file/directory that the server no longer recognizes.”

That reference is a file handle—opaque bytes returned by the NFS server that uniquely identify an inode-ish thing on that server.
The client caches those handles, because asking the server to re-resolve paths for every operation would be a performance crime.
When the server later can’t resolve the handle, you get ESTALE.

The important part: this is not a permissions error, and it’s not “network flaky.” It’s a mismatch between what the client thinks
exists and what the server can map back to an object. Sometimes it’s caused by legitimate object removal. More often it’s caused by
server-side identity changing when you didn’t expect it.

In an outage, the kernel message is brutally honest: the server says “never heard of that handle,” and your process can’t proceed.
If your workload is a build system, a container runtime, or a database that expects filesystem semantics to behave consistently, you’ll see failures
that look random until you correlate them to NFS topology events.

Why it happens (the real causes)

File handles are identities, not paths

NFS mostly treats files as objects. Paths are resolved into file handles, then operations use handles.
That’s why a file can be renamed while open and still be read locally—Linux tracks the inode. With NFS, the client tracks a server-issued identity.

The core trigger: the server can’t map the handle anymore

That happens when:

  • The underlying filesystem object genuinely disappeared (delete, filesystem recreated, snapshot rollback, etc.).
  • The server changed its idea of identity (export moved, fsid changed, filesystem re-mounted differently, failover to a different backend).
  • The client’s cached handle refers to a different filesystem than the server now exposes at that path (classic “export change” foot-gun).

Common server-side events that invalidate handles

  • Export reconfiguration: changing /etc/exports, moving exports, switching from exporting a directory to exporting its parent, etc.
  • Filesystem replacement: reformatting, restoring from backup, recreating a dataset, re-creating a directory tree.
  • Snapshot rollback on the server: the path exists, but object identities rewound.
  • Failover to a different node: HA clusters where the new node serves “same path” but not the same object IDs.
  • Crossing mountpoints in exports: exporting a path that contains other mounts and later changing those mounts.

Client-side contributors (usually not the root cause, but they amplify it)

  • Long-lived processes that keep directory FDs and assume they stay valid forever (CI runners, log shippers, language servers).
  • Containers bind-mounting NFS paths; the container survives while the NFS identity changes underneath.
  • Automounters that hide mount/unmount churn until a background job hits a stale FD.
  • Aggressive caching assumptions (attribute caching, lookup caching) that make behavior look “sticky.”

NFSv3 vs NFSv4: the “stateful” misconception

NFSv4 is stateful in ways NFSv3 isn’t (locks, delegations, sessions), but file handles can still go stale. The protocol can negotiate a lot,
but it can’t resurrect an identity that the server cannot map to an object anymore.

One quote that operations folks tend to re-learn in pain: “Everything fails, all the time.” — Werner Vogels.
It’s blunt, not cynical. Design for it.

Two jokes, because humans require them

Joke #1: NFS file handles are like corporate badges—once security invalidates yours, you can still know the hallway, but you’re not getting in.

Joke #2: If you “fixed” stale file handles by rebooting clients, congratulations: you’ve invented the most expensive cache invalidation strategy.

Interesting facts and historical context (short, concrete)

  • NFS originated at Sun Microsystems in the 1980s; the core idea was “network transparency” for Unix files.
  • NFSv2 used 32-bit file sizes; large files pushed the evolution to NFSv3 and beyond.
  • NFSv3 is stateless on the server (mostly), which simplified recovery but put more burden on clients and file handles.
  • NFSv4 introduced a pseudo-root and a stronger namespace model; exports behave differently than v3 “just export a directory.”
  • ESTALE predates NFS as a Unix error for stale remote references, but NFS made it a household name in ops.
  • File handles are intentionally opaque so servers can change internal inode mapping without clients knowing—until the server can’t map it back.
  • Some NAS vendors encode filesystem IDs into handles; if those IDs change during failover, handles die even if paths look the same.
  • “Subtree checking” (a server feature) historically caused handle weirdness with renames; many admins disable it for sanity.
  • Linux clients cache dentries aggressively; this is good for performance and terrible for debugging when identities change underneath.

Fast diagnosis playbook (first/second/third)

You’re on-call. You don’t have time for a philosophy seminar about distributed filesystems. Here’s the shortest path to truth.

First: confirm it’s ESTALE and find the scope

  • Is it one host or many?
  • One mount or every NFS mount?
  • One directory subtree or random files everywhere?

Second: determine whether the server-side identity changed

  • Did exports change?
  • Did the filesystem get remounted, replaced, failed over, or rolled back?
  • Did someone “rebuild the share” while keeping the same path?

Third: decide recovery strategy by workload risk

  • If it’s a read-mostly workload: remount is often enough.
  • If it’s write-heavy or has locks (databases, build caches): stop the app cleanly, then remount, then restart. Avoid half-measures.
  • If it’s many clients at once: treat it as a server/export event, not “client flakiness.” Fix server identity first.

Then: prevent recurrence

  • Stop changing exports in place during business hours.
  • Pin stable filesystem identities (fsid strategy, stable backend datasets).
  • Use systemd automount thoughtfully, not as a magic carpet over underlying churn.

Practical tasks: commands, outputs, decisions (12+)

These are real checks you can run on Ubuntu 24.04 clients and servers. Each task includes: command, typical output, what it means,
and what decision to make.

Task 1: Verify the error is really ESTALE (client)

cr0x@server:~$ dmesg -T | tail -n 20
[Mon Dec 29 11:02:14 2025] NFS: stale file handle
[Mon Dec 29 11:02:14 2025] NFS: v4 server nfs01 returned a stale file handle

Meaning: The kernel NFS client received ESTALE from the server. This isn’t an application-level exception; it’s the filesystem layer.

Decision: Treat it as a consistency/identity problem. Start checking server/export events, not just network latency.

Task 2: Identify which mount is affected (client)

cr0x@server:~$ findmnt -t nfs,nfs4 -o TARGET,SOURCE,FSTYPE,OPTIONS
TARGET        SOURCE                 FSTYPE OPTIONS
/mnt/build    nfs01:/exports/build   nfs4   rw,relatime,vers=4.1,hard,timeo=600,retrans=2
/mnt/archive  nfs02:/exports/archive nfs4   ro,relatime,vers=4.2,hard,timeo=600,retrans=2

Meaning: Shows the exact server:path and options. The source string matters when correlating with export changes.

Decision: If only one mount is implicated, focus on that export’s backend. If multiple, suspect server-wide events or failover.

Task 3: Reproduce and capture the failing path (client)

cr0x@server:~$ ls -la /mnt/build/workdir
ls: cannot access '/mnt/build/workdir/.': Stale file handle
total 0

Meaning: Directory handle is stale. Often indicates a rename/move/export remap or a filesystem swap behind that directory.

Decision: Don’t blindly restart apps yet. Collect evidence: which directory, which mount, and whether other paths work.

Task 4: Check whether the NFS server is reachable and responding (client)

cr0x@server:~$ rpcinfo -t nfs01 nfs 4
program 100003 version 4 ready and waiting

Meaning: Basic RPC reachability. It doesn’t prove your export is sane, but it rules out “server down.”

Decision: If this fails, treat it as connectivity/firewall/RPC service issue first. ESTALE can be secondary noise.

Task 5: List exports from the server perspective (client)

cr0x@server:~$ showmount -e nfs01
Export list for nfs01:
/exports/build   10.20.0.0/16
/exports/shared  10.20.0.0/16

Meaning: Confirms what the server claims it exports (often more useful with NFSv3; still handy for sanity checks).

Decision: If the export disappeared or changed, you have a smoking gun: your client mount points to an old assumption.

Task 6: Confirm NFS version and negotiated parameters (client)

cr0x@server:~$ nfsstat -m
/mnt/build from nfs01:/exports/build
 Flags: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.20.3.41,local_lock=none

Meaning: Shows the negotiated mount parameters. This helps you spot “soft” mounts, odd timeouts, or legacy versions.

Decision: If you see soft for critical data, plan a change. Soft mounts can convert transient issues into data corruption or partial writes.

Task 7: Determine whether the problem is confined to one subtree (client)

cr0x@server:~$ (cd /mnt/build && ls -la . && ls -la ./workdir) 2>&1 | sed -n '1,6p'
total 8
drwxr-xr-x  4 root root 4096 Dec 29 10:58 .
drwxr-xr-x  3 root root 4096 Dec 29 10:58 ..
ls: cannot access './workdir': Stale file handle

Meaning: Mount root is fine; a subdirectory handle is stale. Often caused by server-side rename, snapshot rollback, or replacing that subtree.

Decision: Escalate to the server/storage owner: “This directory’s identity changed.” Remount may help; but also hunt the server event.

Task 8: Identify processes holding FDs on the stale path (client)

cr0x@server:~$ sudo lsof +D /mnt/build/workdir 2>/dev/null | head
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
python3  4112 ci     cwd  DIR  0,1234      0  862  /mnt/build/workdir (stale)
make     4188 ci     cwd  DIR  0,1234      0  862  /mnt/build/workdir (stale)

Meaning: These processes have their current working directory or an open FD pointing at the stale directory.

Decision: For safe recovery, stop these processes before remounting. Otherwise they’ll keep failing, or worse, keep retrying and stalling pipelines.

Task 9: Try a clean remount on the client (client)

cr0x@server:~$ sudo umount /mnt/build
umount: /mnt/build: target is busy.

Meaning: Something is holding the mount busy (open files, cwd, bind mounts into containers).

Decision: Don’t force it blindly. Find holders (next task) and stop them, or use a lazy unmount only as a tactical move.

Task 10: Find what keeps the mount busy (client)

cr0x@server:~$ sudo fuser -vm /mnt/build
                     USER        PID ACCESS COMMAND
/mnt/build:          root     kernel mount /mnt/build
                     ci        4112 ..c..  python3
                     ci        4188 ..c..  make

Meaning: Lists processes using the mount. The ACCESS flags indicate how they touch it (cwd, file open, etc.).

Decision: Stop/kill the offenders, or coordinate with workload owners. If this is a shared build host, a polite drain beats a surprise kill.

Task 11: Use lazy unmount when you must (client)

cr0x@server:~$ sudo umount -l /mnt/build

Meaning: Detaches the mount from the namespace; existing references linger until processes release them.

Decision: Use this only if you’re already in incident mode and need to restore service. Follow up by restarting affected services to drop stale FDs.

Task 12: Remount with explicit options and test (client)

cr0x@server:~$ sudo mount -t nfs4 -o vers=4.1,hard,timeo=600,retrans=2 nfs01:/exports/build /mnt/build
cr0x@server:~$ stat /mnt/build/workdir
  File: /mnt/build/workdir
  Size: 4096       Blocks: 8          IO Block: 1048576 directory
Device: 0,1234     Inode: 862         Links: 12

Meaning: If stat works again, client-side recovery succeeded. If it still errors, the server is still presenting a different identity or broken export.

Decision: If remount fixes it, schedule prevention work anyway. Remounting is recovery, not a cure.

Task 13: Check export configuration and reload events (server)

cr0x@server:~$ sudo exportfs -v
/exports/build   10.20.0.0/16(rw,wdelay,root_squash,no_subtree_check,sec=sys,fsid=101)
/exports/shared  10.20.0.0/16(rw,wdelay,root_squash,no_subtree_check,sec=sys,fsid=102)

Meaning: Shows the effective export options. Note fsid presence—stability matters, especially with NFSv4 exports and HA.

Decision: If fsid is absent or changed recently, consider pinning it. If export paths changed, expect stale handles until clients remount.

Task 14: Inspect server logs for re-export/restart/FS events (server)

cr0x@server:~$ sudo journalctl -u nfs-server -u rpcbind -u nfs-idmapd --since "2 hours ago" | tail -n 25
Dec 29 10:41:08 nfs01 systemd[1]: Stopped NFS server and services.
Dec 29 10:41:08 nfs01 systemd[1]: Starting NFS server and services...
Dec 29 10:41:09 nfs01 exportfs[2143]: exporting 10.20.0.0/16:/exports/build
Dec 29 10:41:09 nfs01 systemd[1]: Started NFS server and services.

Meaning: Confirms service restarts and export reloads. Restarts alone don’t always cause ESTALE, but they correlate with backend remounts and failover actions.

Decision: If you see restarts around the incident time, look for what triggered them: maintenance, config management, package upgrades, failover scripts.

Task 15: Confirm the exported path is the filesystem you think it is (server)

cr0x@server:~$ findmnt -T /exports/build -o TARGET,SOURCE,FSTYPE,OPTIONS
TARGET        SOURCE          FSTYPE OPTIONS
/exports/build /dev/sdb1      ext4   rw,relatime

Meaning: Verifies what backs the export. If this source device/dataset changed since yesterday, handles can go stale even if the path stayed the same.

Decision: If the backing filesystem changed, treat the event as a breaking change. Communicate to client owners and plan remount/restart windows.

Task 16: Detect snapshot rollback / dataset recreation hints (server)

cr0x@server:~$ sudo tune2fs -l /dev/sdb1 | sed -n '1,12p'
tune2fs 1.47.0 (5-Feb-2023)
Filesystem volume name:   builddata
Filesystem UUID:          6f8f2b7f-7c3f-4b0e-9b8b-3a8e5f7e0c2a
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit

Meaning: UUID changes are a huge clue that the filesystem was recreated. A rollback might not change UUID, but recreation usually does.

Decision: If UUID changed, stop arguing about “no changes.” Something replaced the filesystem. Expect stale handles until clients remount and apps reopen paths.

Task 17: Check client RPC and retrans stats for a noisy network (client)

cr0x@server:~$ nfsstat -rc
Client rpc stats:
calls      retrans    authrefrsh
153209     18         153227

Meaning: High retrans can indicate packet loss or congestion. This won’t directly cause ESTALE, but it can make recovery slower and symptoms messier.

Decision: If retrans is high, you have two problems: an identity problem and a transport problem. Fix identity first, then stabilize the network.

Task 18: If systemd automount is involved, confirm mount unit state (client)

cr0x@server:~$ systemctl status mnt-build.automount --no-pager
● mnt-build.automount - Automount mnt-build
     Loaded: loaded (/etc/systemd/system/mnt-build.automount; enabled; preset: enabled)
     Active: active (waiting) since Mon 2025-12-29 10:12:10 UTC; 51min ago
      Where: /mnt/build

Meaning: Automount may unmount when idle and remount later. That’s fine—until something keeps a stale FD while the mount gets replaced.

Decision: If you see frequent mount churn, consider whether your workload is compatible. Long-lived daemons + automount can be a bad marriage.

Common mistakes: symptom → root cause → fix

“Only one directory errors; the rest of the share is fine”

Symptom: ls works at mount root, but a specific subdirectory gives “Stale file handle.”

Root cause: That subtree was renamed, replaced, rolled back via snapshot, or is a mountpoint that changed beneath an exported parent.

Fix: Remount the client to flush handles. If it recurs, stop exporting paths that contain volatile mountpoints; export stable filesystem boundaries.

“It started right after we ‘cleaned up’ /etc/exports”

Symptom: Many clients see ESTALE soon after an export change.

Root cause: Export path semantics changed (parent vs child export, cross-mount export, fsid changes). Clients hold handles to objects served under the old export mapping.

Fix: Treat export changes as breaking changes. Coordinate a remount/restart window for clients. Pin fsid values and keep export topology stable.

“Rebooting the NFS server fixed it, but it came back”

Symptom: Temporary relief, then recurring stale handles.

Root cause: Backend failover or storage lifecycle actions (rollback, rebuild, replication cutover) are changing object identity repeatedly.

Fix: Fix the lifecycle process: stable backends, consistent filesystem IDs, and controlled cutovers with client coordination. Reboots are symptom management.

“It happens during failover tests”

Symptom: HA event triggers widespread ESTALE.

Root cause: The failover node presents the same export path but different underlying filesystem identity or inconsistent fsid mapping.

Fix: Design HA with stable identity: same backend replicated with consistent IDs, or accept that clients must remount and apps must restart as part of failover runbooks.

“It’s a Docker/Kubernetes problem”

Symptom: Pods report stale handles; nodes look fine; storage team points at Kubernetes.

Root cause: Containers keep long-lived mounts and directory FDs; the NFS server side changed identity. Kubernetes just makes it easier to keep things running long enough to notice.

Fix: Stop treating NFS as an infinitely mutable backing store. If you must change exports/backends, roll pods/nodes to refresh handles. Prefer stable PVs and avoid moving exports.

“We used ‘soft’ mounts to avoid hangs, now we get weird build artifacts”

Symptom: Intermittent failures, partial outputs, and sometimes stale handles after stress.

Root cause: Soft mounts allow operations to fail mid-stream; the application wasn’t written to handle partial I/O semantics.

Fix: For critical data, use hard mounts and tune timeouts instead. If you need non-blocking behavior, redesign the workflow (local staging, retries at app layer).

Three corporate mini-stories from the trenches

Incident #1: the wrong assumption (“Paths are identities”)

A company ran a monorepo build farm on Ubuntu. The build workers mounted /mnt/build from an NFS server.
One Friday, a storage engineer migrated the share to a new volume. They kept the same directory path on the server: /exports/build.
The assumption was simple and very human: if the path is the same, clients won’t care.

Monday morning, CI started failing in a way that looked like a flaky compiler. Random jobs died while trying to stat() temp directories.
Some jobs worked, some didn’t. Retries sometimes succeeded. The build team blamed the CI scheduler. The scheduler team blamed the workers.
The storage team blamed “Linux caching.”

The clue was that only long-lived workers failed. Freshly rebooted workers were fine. On affected hosts,
lsof showed old cwd handles inside the workspace pointing at stale directories. Remounting fixed it instantly.
The server migration didn’t “break NFS.” It changed the object identities behind a path.

The fix wasn’t heroic: coordinate share migrations with a client refresh window. That meant a controlled drain of workers,
unmount/remount, then rejoin. The postmortem action item was even more boring: treat NFS backends like API versions.
If you change the identity map, clients must be recycled.

Incident #2: the optimization that backfired (exporting the parent to “simplify”)

Another org had multiple exports: /exports/app1, /exports/app2, /exports/app3.
Someone decided this was “too many mounts.” The proposal: export just /exports and let clients use subdirectories.
One mount to rule them all. The change rolled out gradually, because configuration management made it easy to “just switch.”

A month later, stale file handles popped up in only one app, and only during deploys. It was maddening.
The deploy process created a new release directory, flipped a symlink, then cleaned up old releases. Standard stuff.
On local disks it’s fine. On NFS, it’s mostly fine—until you export a parent that contains other mountpoints and you reshuffle them.

The hidden detail: /exports/app2 wasn’t a directory on the same filesystem anymore; it had become a separate mount.
The “single export” now crossed filesystem boundaries. During a maintenance window, the mount backing app2 was unmounted and remounted.
Clients that had cached handles into that subtree started getting ESTALE, while other apps stayed healthy.

Rolling back to separate exports fixed it. Not because “more mounts are better,” but because each export now mapped to a stable filesystem boundary.
The optimization reduced config complexity and increased incident complexity. That trade rarely pays.

Incident #3: the boring practice that saved the day (stable IDs and coordinated remounts)

A finance-ish company ran NFS for shared tooling and build artifacts. They had an HA pair and did regular failover tests.
Early on, they accepted that failovers were messy: clients would sometimes throw stale handles, and people would “just reboot stuff.”
Then they got serious and wrote a runbook that treated identity stability as a first-class requirement.

They pinned fsid values in exports, kept export paths tied to stable filesystem boundaries, and refused in-place export topology changes.
When backends had to move, they scheduled it like a schema migration. There was a change ticket, a client drain, and a coordinated remount.
No heroics, just discipline.

During a later storage firmware incident, the HA pair failed over unexpectedly. Some clients still needed a remount,
but the blast radius was smaller: fewer stale handles, faster recovery, and fewer “ghost” processes holding old directory FDs.
The runbook worked because it didn’t depend on luck or tribal knowledge.

The take-away: reliability isn’t a secret feature you toggle. It’s a collection of constraints you agree not to violate, even when you’re in a hurry.

How to stop it: prevention patterns that work

1) Keep exports stable and boring

The single biggest lever: don’t change the identity mapping behind an export path casually.
If you must migrate data, do it in a way that preserves object identity (often impossible), or accept that clients must refresh.

  • Avoid swapping filesystems under the same export path without a client maintenance plan.
  • Avoid exporting a parent directory that contains mountpoints that can be remounted independently.
  • Prefer exporting the filesystem root of the backing filesystem (or a stable dataset boundary), not a random directory.

2) Use NFSv4 with a consistent namespace strategy

For NFSv4, treat the server namespace as an API. Changing the pseudo-root structure or moving exports around breaks clients in ways that feel non-local.
Keep the namespace stable. If you need a new layout, create a new export and migrate clients deliberately.

3) Pin fsid where appropriate (server)

On Linux NFS servers, explicit fsid= values in /etc/exports can help keep export identity consistent across reboots and HA events
(depending on architecture). It’s not a magic spell, but it prevents accidental renumbering.

4) Don’t “soft mount” your way out of engineering

For reliability, hard mounts are usually the default because they preserve POSIX-ish expectations: writes don’t randomly fail mid-flight.
If you need bounded failure, implement it in the application (timeouts, retries, local staging), not by letting the filesystem lie.

5) Plan for client refresh: restart services that hold stale references

Even after remount, long-running daemons may keep file descriptors pointing at stale objects. Identify them and restart them as part of recovery.
This is especially true for: CI runners, language servers, application servers reading templates/config, and anything watching directories.

6) Treat storage lifecycle operations as breaking changes

Snapshot rollback, filesystem restore, replication cutover, and “rebuild the share” are not transparent.
If your storage process changes object identity, it needs:

  • an announcement (who is impacted),
  • a recovery plan (remount/restart),
  • a validation step (test paths from a canary client).

7) If you run NFS under Kubernetes, embrace recycling

When the server identity changes, pods won’t magically heal themselves. Your best defense is an operational pattern:
roll deployments (or even nodes) in a controlled way after storage events. It sounds crude. It’s also effective.

8) Observability: log the “identity events,” not just latency

Everyone graphs IOPS and throughput. Fewer teams graph: export reloads, filesystem UUID changes, failover events, snapshot rollbacks.
Those are the events that correlate with stale handles. Put them in your incident timeline.

Checklists / step-by-step plan

Checklist A: During an incident (client-side)

  1. Confirm kernel sees it: check dmesg for stale handle messages.
  2. Identify the mount: use findmnt and nfsstat -m.
  3. Scope it: one subtree or whole mount?
  4. Find holders: lsof +D (careful on large trees) or fuser -vm.
  5. Stop workloads holding stale FDs (or drain the node).
  6. Unmount cleanly; if busy and you must proceed, use umount -l and then restart the offending services.
  7. Remount and verify with stat and a small read/write test appropriate to the share.
  8. If it recurs quickly, stop: it’s likely a server/export identity churn problem.

Checklist B: During an incident (server-side)

  1. Check for service restarts/export reloads in journalctl.
  2. Verify effective exports with exportfs -v.
  3. Confirm what backs the export: findmnt -T on the exported path.
  4. Confirm no recent filesystem recreation clues (UUID changes, new datasets, remount logs).
  5. If HA: confirm whether a failover occurred and whether the new node serves identical backend identity.
  6. Communicate clearly: “Clients must remount and services must restart” is a valid operational instruction when identity changed.

Checklist C: Prevention rollout (change plan)

  1. Inventory exports and their backing filesystems. Document which are stable boundaries.
  2. Decide namespace strategy (especially for NFSv4): stable pseudo-root layout, avoid moving exports.
  3. Pin fsid values where needed; keep them stable across nodes in HA designs.
  4. Define client recovery procedure: remount + restart specific services.
  5. Add canaries: one or two clients that run periodic stat/find checks and alert on ESTALE.
  6. Run a failover or export-change game day in daylight, with owners present.
  7. Write the runbook and make it the default, not folklore.

FAQ

1) Is “Stale file handle” an Ubuntu 24.04 bug?

Usually no. It’s the Linux NFS client faithfully reporting that the server invalidated an object identity. Ubuntu 24.04 just runs modern kernels
and modern workloads that make identity churn easier to trigger and harder to ignore.

2) Why does remounting fix it?

Remounting forces the client to drop cached handles and re-resolve paths. If the server is now consistent and stable, new handles work.
If the server keeps changing identity, the problem returns.

3) Can attribute caching options prevent stale handles?

Not really. Attribute caching affects how quickly metadata changes are noticed (mtime/size/permissions). Stale handles happen when the server
can’t map the handle at all. You can change caching and still get ESTALE if identities change.

4) Does NFSv4 eliminate stale handles?

No. NFSv4 improves many things (sessions, locking model, namespace), but it cannot keep a handle valid if the server’s backend identity changes.

5) Is this caused by network issues?

Packet loss and congestion can make NFS feel broken, but they don’t usually produce ESTALE directly. They can, however, increase mount churn,
timeouts, retries, and “recovery” actions that coincide with identity changes. Diagnose both, but don’t confuse them.

6) What’s the safest client mount option set for production?

There isn’t a single universal set, but a sane baseline for critical mounts is: NFSv4.1+, hard, TCP, reasonable timeo,
and avoid exotic tweaks until you have a measured reason. Reliability beats cleverness.

7) Why do only long-running processes fail while new shells work?

Long-running processes keep open file descriptors or a current working directory pointing at objects that became invalid.
New shells resolve paths fresh and get new handles, so they appear “fine.” That split is a classic stale handle signature.

8) How do we handle stale handles in Kubernetes?

Assume you’ll need to recycle something: restart pods that touched the share, possibly drain/reboot nodes in worst cases.
More importantly: prevent identity churn by keeping exports/backends stable and treating storage cutovers as coordinated operations.

9) Can we “clear” stale handles without unmounting?

Sometimes you can work around by cd out of the stale directory, reopening files, or restarting the service.
But if the stale references are widespread, unmount/remount is the clean reset.

10) What should we tell app teams when this happens?

Tell them the truth in operational terms: “The NFS server identity changed; your process holds stale references.
We will remount and restart your service to reopen paths.” Don’t sell it as “random NFS flakiness.”

Conclusion: practical next steps

“Stale file handle” is what happens when you treat NFS like a magic folder name instead of a distributed system with identity.
The fix is rarely exotic. It’s usually discipline: stable exports, stable backends, and coordinated client refresh when identities change.

  1. Right now: use the fast diagnosis playbook, identify the mount and the subtree, and remount cleanly after stopping FD holders.
  2. This week: audit export topology and stop exporting paths that cross volatile mountpoints. Pin stable identity where appropriate.
  3. This quarter: turn migrations, rollbacks, and failovers into runbooks with canary validation and explicit client restart/remount steps.

You don’t have to love NFS. You do have to operate it like it’s real infrastructure. Because it is.

← Previous
ZFS glossary: VDEV, TXG, ARC, SPA—Everything You Pretended to Know
Next →
Debian 13 policy routing: debug ip rule and ip route without pain

Leave a comment