ZFS NFS Sync Semantics: Why Clients Change Your Write Safety

Was this helpful?

You built a ZFS filer. You set sync=standard like a responsible adult. You even bought “enterprise” SSDs for the SLOG.
Then a database team mounts the export, runs a workload, and suddenly your write latency graph looks like a seismograph. Or worse: you get a
post-crash ticket that begins with “we lost committed transactions.”

The punchline is cruel: on NFS, the client can change what “safe write” means on the server. Not by malice—by protocol semantics, mount options,
and application behavior. ZFS is deterministic; NFS clients are… creative.

The core idea: ZFS can only honor what the client actually asks for

ZFS gives you a clean contract: if an operation is synchronous, ZFS will not acknowledge it until it’s safe according to its rules—meaning it’s
in stable storage as described by ZFS’s intent log (ZIL) and the pool’s write ordering. If it’s asynchronous, ZFS is free to buffer.

NFS complicates this because the server isn’t in charge of when an application thinks a write is “committed.” The client decides when to issue
stable writes, when to send a COMMIT, and whether to lie (accidentally) via caching policy or mount options. Many applications don’t
call fsync() as often as you imagine; many libraries “batch durability” until checkpoints; and many NFS clients attempt performance
tricks that are perfectly legal in the protocol but surprising to storage engineers.

Here’s the uncomfortable operational truth: you can run the same ZFS server configuration and get materially different durability outcomes
depending on the client OS, NFS version, mount options, and the application’s flush behavior. Two clients can mount the same export and get
different crash-loss profiles.

If you manage shared storage, you have to treat NFS clients as part of the write path. They’re not “just consumers.” They’re participants in
your durability protocol—often without knowing they signed up.

Interesting facts and history that still matters

  • NFS started as “stateless” by design (v2/v3 era). The server didn’t keep per-client session state, which made recovery and scaling easier, but pushed durability complexity to clients.
  • NFSv3 introduced the COMMIT procedure. It exists because WRITE replies can represent “unstable” storage; COMMIT asks the server to push those bytes to stable storage.
  • NFSv4 shifted toward stateful operation. It added locks, delegations, and a more integrated model—yet “when is it on stable storage?” is still a negotiated reality.
  • ZFS intentionally decouples “acknowledge” from “on-disk TXG commit.” Sync writes can be satisfied by the ZIL without waiting for the next transaction group (TXG) sync.
  • The ZIL is not a write cache for everything. It only records what’s needed to replay synchronous operations after a crash; it’s about correctness, not acceleration by default.
  • A separate SLOG device is just “ZIL on faster stable media.” It doesn’t store your actual data long-term; it stores intent records until the TXG commits.
  • “sync=disabled” exists because people kept asking for it. It also exists because sometimes you want speed more than truth. Your future incident report will decide whether that was wise.
  • Linux and various UNIXes differ in how aggressively they issue COMMIT. Some will happily buffer and batch, others will force stability more often based on mount options and workload patterns.
  • Write ordering and barriers matter even with a good SLOG. If the device lies about flushes, you get acknowledgements for writes that were never actually stable.

NFS write semantics in practice: unstable, stable, and “I swear it’s fine”

The NFS promise is not “every write is durable”

NFS is not a block device protocol. It’s a file protocol with a contract that depends on what operations the client requests and what semantics
it expects.

With NFSv3, a client can send WRITE operations that the server may buffer. The WRITE reply can indicate whether the data is considered stable
(FILE_SYNC) or not (UNSTABLE). If it’s unstable, the client is expected to send a COMMIT to make it stable before
treating it as durable.

With NFSv4, the model evolves but the same fundamental question remains: what does the server promise at the moment it replies? The client may
use compound operations and different caching rules, but it still decides when to demand stability.

Stable write is a negotiation, not a vibe

“Stable” in NFS isn’t a philosophical concept. It’s concrete: stable means the server has put the write in non-volatile storage such that it
survives a crash consistent with the server’s rules. That can be battery-backed RAM, NVRAM, a disk with proper cache flush, or—on ZFS—typically
the ZIL/SLOG path for synchronous semantics.

But the client doesn’t always ask for stable. It might issue unstable writes and send COMMIT later. Or it might rely on close-to-open semantics
and attribute caching, not durability semantics. Or the app might “commit” at the database layer while letting the OS buffer.

Two exact words that trigger very different worlds: fsync and O_DSYNC

Applications express durability needs using syscalls like fsync(), fdatasync(), and flags like O_SYNC /
O_DSYNC. On a local filesystem, these map fairly directly to “don’t lie to me.”

Over NFS, those calls turn into protocol-level operations that may include COMMIT or stable writes—depending on client settings. If the client
chooses to coalesce or delay COMMIT, your server might be “correct” but your app’s expectations might be violated after a crash.

Joke #1: NFS durability is like office politics—everyone agrees in the meeting, and then the real decisions happen in private.

ZFS side: sync, ZIL, SLOG, txg, and where the truth lives

Transaction groups: the big rhythm underneath everything

ZFS batches changes in memory and periodically commits them to disk in transaction groups (TXGs). This is a core performance strategy: it turns
random small writes into more sequential I/O patterns and lets ZFS optimize allocation.

The TXG cadence is typically on the order of a few seconds. That’s fine for asynchronous writes: the app gets an ACK quickly, and ZFS commits
later. But for synchronous writes, waiting for the next TXG commit is too slow. That’s where the ZIL comes in.

ZIL: intent logging for synchronous operations

The ZIL (ZFS Intent Log) records enough information about synchronous operations so they can be replayed after a crash. It’s not a full journal
of all changes. It’s a safety net specifically for operations that were acknowledged as “done” before the TXG made them permanent on disk.

Without a separate device, the ZIL lives on the main pool. With a separate log device (SLOG), ZFS can put those intent records on faster, low
latency storage, reducing the cost of synchronous ACKs.

Operationally: the ZIL is where you pay for honesty. If you demand synchronous semantics, you’re asking ZFS to do extra work right now instead
of later.

The sync dataset property: the lever people misuse

ZFS exposes sync as a dataset property:
standard, always, and disabled.

  • sync=standard: honor sync requests. If the client/app asks for sync, do it; otherwise buffer.
  • sync=always: treat all writes as sync, even if the client didn’t ask. This is the “I don’t trust you” mode.
  • sync=disabled: lie. Acknowledge sync requests without actually ensuring stable storage.

In NFS land, sync=standard is not the same as “safe.” It’s “safe if the client asks for safety.”
And clients have multiple ways to not ask.

What “safe” means depends on the whole chain

A synchronous ACK is only as honest as the weakest component that can pretend data is stable when it isn’t:

  • Drive cache settings and flush behavior
  • Controller cache and battery/flash protection
  • SLOG device power-loss protection (PLP) and write ordering
  • Hypervisor storage stack if you’re virtualized
  • Client-side caching and NFS mount options

You can do everything right on ZFS and still get burned by a client that’s configured to treat close() as “good enough”
without enforcing stable writes, or by a “fast” SSD that acknowledges flushes like it’s reading a bedtime story.

One quote that holds up: “Hope is not a strategy.” — paraphrased idea often attributed to engineers in ops/reliability circles.

Why the client changes safety: mounts, caches, and app patterns

NFS mount options can quietly trade durability for throughput

On many clients, the difference between “sync” and “async” isn’t a single toggle; it’s a combination of behaviors:

  • Mount-time caching. Attribute caching, directory entry caching, and client-side page cache can change when writes are pushed.
  • Write gathering. Clients may batch small writes into larger RPCs, reducing overhead but delaying stability.
  • Commit behavior. The client may send COMMITs lazily, or only on fsync/close depending on policy.
  • Hard vs soft mounts. Not directly a durability control, but it changes how failures manifest (hang vs error), which changes application behavior under stress.

You don’t get to assume the client’s defaults. Defaults vary by OS version, distro, kernel, and even security baselines.

Applications are not consistent about flushes

Databases are the obvious culprits, but plenty of “boring” systems do dangerous things:

  • Message queues that batch fsync every N messages.
  • Loggers that call fsync only on rotation.
  • Build systems that “don’t care” until they suddenly care (artifact repositories, anyone?).
  • ETL pipelines that assume rename is atomic and durable everywhere.

On local filesystems, these patterns can be survivable. Over NFS, they can turn into long windows of acknowledged-but-not-stable data,
especially if the client uses unstable writes and delays COMMIT.

NFS server exports: what you allow matters

Server-side export settings don’t usually say “lie about durability,” but they do influence client behavior: security flavors, subtree checking,
FSID stability, and delegation behaviors (on NFSv4) can change caching and retry patterns.

On Linux servers (including many ZFS-on-Linux deployments), exportfs and nfsd threads can become bottlenecks that
look like “disk latency.” If you don’t measure both, you’ll blame the wrong layer and “fix” it by buying SSDs.

SLOG realities: when it helps, when it hurts, and when it’s theater

When a SLOG helps

A SLOG helps when you have many synchronous writes and the pool’s latency is higher than a dedicated low-latency device. This
is common with:

  • NFS workloads where clients issue frequent fsync / stable writes (databases, VM images on NFS, some mail systems)
  • Small random synchronous writes where pool vdevs are HDD or busy
  • Latency-sensitive apps where every sync write stalls a thread

The SLOG is about ACK latency, not throughput. If you’re not sync-heavy, it won’t move the needle much.

When a SLOG does nothing

If most writes are asynchronous, the ZIL/SLOG path is not your bottleneck. Your bottleneck is TXG commit throughput, dirty data limits, or read
amplification.

Also: if your clients are doing unstable writes and delaying COMMIT, a fast SLOG may only help at the moment COMMIT happens. Until then, you’re
not paying the sync cost anyway—which means you might be accumulating risk, not solving latency.

When a SLOG hurts

A bad SLOG device can destroy latency. ZFS sends sync intent records in a pattern that wants consistent low-latency writes with proper flush
semantics. Consumer SSDs often:

  • have unpredictable latency under sustained sync writes,
  • lack power-loss protection,
  • acknowledge flushes optimistically,
  • fall off a cliff when the SLC cache is exhausted.

That’s how you get the classic graph: p99 write latency is fine… until it’s suddenly not fine, and then it stays not fine.

Joke #2: A “fast” SSD without power-loss protection is like a resume with “team player” on it—technically possible, but you should verify.

sync=always: the nuclear option that sometimes is correct

If you cannot trust clients to request stability correctly, sync=always is the blunt tool that forces ZFS to treat every write as
synchronous. It reduces the space for client-side creativity.

It also increases latency, and it will expose every weak link: SLOG, pool, controller, and network. Use it surgically: per dataset, per export,
for workloads that truly need it.

Fast diagnosis playbook

The fastest way to debug NFS-on-ZFS sync pain is to avoid debating philosophy and instead answer three questions in order:
(1) are we doing sync writes, (2) where is the latency, and (3) is the system lying about flush durability?

First: confirm whether the workload is actually synchronous

  • Check ZFS dataset sync property and confirm expectations.
  • Check NFS client mount options and whether apps are calling fsync.
  • Watch ZIL/SLOG activity and NFS COMMIT rates.

Second: locate the bottleneck domain (CPU, network, SLOG, pool)

  • If NFS server threads are saturated, it’s not “disk latency,” it’s a server bottleneck.
  • If SLOG latency spikes, sync writes will spike even if the pool is fine.
  • If pool vdevs are busy, TXG commits will drag and async workloads will stall.

Third: validate stable storage behavior end-to-end

  • Confirm SLOG device has PLP and isn’t virtualized behind writeback caches.
  • Confirm drive write cache policy and that flushes are honored.
  • Confirm you’re not running sync=disabled anywhere “temporarily.”

Practical tasks: commands, outputs, and decisions

These are the checks I actually run when someone says “NFS is slow” or “we lost writes.” Each includes: command, example output, what it means,
and the decision you make.

Task 1: Identify the dataset and its sync policy

cr0x@server:~$ zfs get -o name,property,value,source sync tank/nfs/db
NAME         PROPERTY  VALUE     SOURCE
tank/nfs/db  sync      standard  local

What it means: ZFS will honor sync requests, but won’t force them.

Decision: If the workload needs strict durability regardless of client behavior, consider sync=always (on that dataset), not pool-wide.

Task 2: Look for the classic foot-gun: sync disabled

cr0x@server:~$ zfs get -r -o name,property,value,source sync tank/nfs
NAME        PROPERTY  VALUE     SOURCE
tank/nfs    sync      standard  default
tank/nfs/db sync      standard  local
tank/nfs/ci sync      disabled  local

What it means: One subtree is lying about sync writes.

Decision: Treat this as a production risk. If it’s intentional, document blast radius and crash-loss expectations; otherwise fix it now.

Task 3: Confirm there is (or isn’t) a SLOG

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
config:

        NAME                         STATE     READ WRITE CKSUM
        tank                         ONLINE       0     0     0
          mirror-0                   ONLINE       0     0     0
            ata-HDD_A                ONLINE       0     0     0
            ata-HDD_B                ONLINE       0     0     0
        logs
          nvme-SLOG0                 ONLINE       0     0     0

errors: No known data errors

What it means: Sync writes can be acknowledged via the log device.

Decision: If you have sync-heavy NFS and no SLOG, expect higher latency. If you have a SLOG, validate it’s actually good.

Task 4: Check SLOG device latency under load (quick indicator)

cr0x@server:~$ iostat -x 1 3 /dev/nvme0n1
Linux 6.5.0 (server)   12/26/2025

avg-cpu:  %user %nice %system %iowait  %steal %idle
           3.2   0.0    5.4    1.1     0.0   90.3

Device            r/s     w/s   rkB/s   wkB/s  avgrq-sz avgqu-sz await r_await w_await  svctm  %util
nvme0n1           0.0  8200.0     0.0  65600.0     16.0     9.1   1.2    0.0    1.2    0.1   82.0

What it means: ~1.2 ms write await at high write rate is fine; if you see tens of ms, your SLOG is a problem.

Decision: If w_await spikes during “sync storms,” replace SLOG with PLP-grade device or remove it if it’s making things worse.

Task 5: Verify dataset recordsize and workload fit (latency vs throughput)

cr0x@server:~$ zfs get -o name,property,value recordsize,logbias,primarycache tank/nfs/db
NAME         PROPERTY      VALUE     SOURCE
tank/nfs/db  recordsize    128K      local
tank/nfs/db  logbias       latency   local
tank/nfs/db  primarycache  all       default

What it means: logbias=latency nudges ZFS to favor the log for sync writes; recordsize affects read/write amplification for databases.

Decision: For DB-like random I/O, consider smaller recordsize (e.g., 16K) after measuring. Keep logbias=latency for sync-heavy datasets.

Task 6: Check TXG pressure and “dirty data” stalls

cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep 'dirty_data|txg_sync|txg_quiesce'
txg_sync_delay                     5
txg_quiesce_delay                  1
dirty_data_max                  8589934592
dirty_data                      2147483648

What it means: If dirty_data hugs dirty_data_max, ZFS will throttle writers—clients see stalls unrelated to NFS semantics.

Decision: If you’re throttling, look at pool commit throughput, vdev saturation, and whether a “fast SLOG” is masking a slow pool.

Task 7: Watch NFS server thread saturation (Linux nfsd)

cr0x@server:~$ ps -eLo pid,comm,psr,pcpu,stat | awk '$2=="nfsd" {sum+=$4} END {print "total_nfsd_cpu="sum"%"}'
total_nfsd_cpu=380%

What it means: nfsd threads are burning CPU. This can look like “storage is slow” from the client side.

Decision: If CPU is high, tune nfsd thread count, check encryption/auth overhead, and profile RPC mix before touching ZFS.

Task 8: Confirm NFS protocol versions and mount options on a client

cr0x@server:~$ nfsstat -m
/mnt/db from server:/tank/nfs/db
 Flags: rw,relatime,vers=3,rsize=1048576,wsize=1048576,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.0.0.10,mountvers=3,mountproto=tcp,local_lock=none

What it means: NFSv3 with large rsize/wsize; hard mount. No explicit sync/async mount option shown here, but version matters for COMMIT behavior.

Decision: If durability expectations are strict, evaluate NFSv4.1+ behavior and client options; align on one supported mount profile per workload class.

Task 9: Observe COMMIT rate on the server (is the client forcing stability?)

cr0x@server:~$ nfsstat -s | egrep 'commit|write'
write        182938
commit        12480

What it means: Many writes but relatively few commits suggests unstable writes being batched, or workloads not calling fsync frequently.

Decision: If the workload claims “every transaction is durable,” this mismatch is your lead. Audit the application’s fsync policy and client caching settings.

Task 10: Confirm the export options being served

cr0x@server:~$ exportfs -v | sed -n '/tank\/nfs\/db/,+2p'
/tank/nfs/db   10.0.0.0/24(rw,wdelay,root_squash,sec=sys,no_subtree_check,fsid=120)

What it means: Standard export; wdelay can change write coalescing behavior at the server side.

Decision: Don’t randomly toggle export knobs hoping for durability. Use exports to control access and correctness; use ZFS and client policy for sync semantics.

Task 11: Validate that the SLOG device is not erroring or being bypassed

cr0x@server:~$ zpool iostat -v tank 1 2
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        2.10T  5.10T      0   2200      0  45.0M
  mirror-0                  2.10T  5.10T      0   2100      0  44.0M
    ata-HDD_A                   -      -      0   1050      0  22.0M
    ata-HDD_B                   -      -      0   1050      0  22.0M
logs                            -      -      0    800      0  12.5M
  nvme-SLOG0                    -      -      0    800      0  12.5M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: The log device is seeing write activity. If it’s always zero during a “sync” workload, either the workload isn’t sync or your config isn’t what you think.

Decision: Correlate with application behavior and NFS COMMIT. Don’t buy hardware until you prove sync writes are happening.

Task 12: Check whether client-perceived latency is network or server I/O

cr0x@server:~$ ss -tin sport = :2049 | head -n 12
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0      0      10.0.0.10:2049     10.0.0.51:49822
	 cubic wscale:7,7 rto:204 rtt:0.289/0.032 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:23849342 bytes_acked:23849210 bytes_received:14920345 segs_out:22110 segs_in:21540 data_segs_out:18900 data_segs_in:17620 send 400.8Mbps lastsnd:8 lastrcv:8 lastack:8 pacing_rate 801.6Mbps unacked:1

What it means: RTT is sub-millisecond; network isn’t the primary latency source in this snapshot.

Decision: Focus on server-side CPU and storage latency. If RTT is multiple ms or retransmits climb, fix network first.

Task 13: Verify ZFS is not suffering checksum or device errors (silent latency killers)

cr0x@server:~$ zpool status -x
all pools are healthy

What it means: No known faults. If this isn’t clean, stop “performance tuning” and start incident response.

Decision: Replace failing devices, scrub, and re-evaluate. A degraded pool can look exactly like “NFS sync is slow.”

Task 14: Confirm sync=always is applied where you think it is

cr0x@server:~$ zfs get -o name,property,value,source sync tank/nfs/db
NAME         PROPERTY  VALUE   SOURCE
tank/nfs/db  sync      always  local

What it means: ZFS will treat every write as sync for this dataset, regardless of client requests.

Decision: Expect higher latency; ensure SLOG is solid and monitor p95/p99. Use this for data that must survive crashes with minimal ambiguity.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran a ZFS-backed NFS cluster for internal services. One team hosted a job queue’s persistence layer on an NFS export.
The storage folks had done their homework: mirrored pool, SLOG with a “fast NVMe,” sync=standard, and routine scrubs.

The queue team migrated from one Linux distro to another during a platform refresh. Same application version, same data, same mount target.
After a power event (not even dramatic; the kind that only becomes dramatic later), a chunk of acknowledged jobs reappeared as “never processed.”
Worse, a smaller chunk vanished.

The first round of debugging was predictably unhelpful. People argued about “ZFS is copy-on-write so it can’t lose data” and “NFS is reliable over TCP”
as if those statements end incidents. The storage team pulled zpool status—healthy. The network team showed low retransmits.
The queue team insisted their app “fsyncs on commit.”

The eventual breakthrough came from looking at protocol behavior, not beliefs. On the client, nfsstat -m showed a different NFS version and
different caching defaults. On the server, COMMIT calls were far lower than expected under load. The application did call fsync—but the I/O pattern was
buffered and the client’s policy delayed the stability boundary more than the team realized.

Fix was boring: standardize the client mount profile for that workload, and flip the dataset to sync=always until everyone could prove the
application’s durability contract. It cost latency. It bought clarity. That’s a trade you can explain in a postmortem without sweating.

Mini-story 2: The optimization that backfired

Another organization served home directories and build artifacts over NFS. They wanted faster CI times. Someone found the magic knob:
sync=disabled on the dataset backing the build cache. It was framed as “safe enough because artifacts can be rebuilt.”

For a while, it looked great. CI pipelines sped up. Storage latency dropped. Everyone congratulated the change ticket for “improving utilization.”
And then a non-obvious dependency bit them: a release pipeline also wrote signed metadata into that same tree. The metadata was “small and fast,”
so nobody thought about it. It also happened to be the thing you can’t easily reconstruct without redoing security ceremony.

A host crash occurred mid-release. ZFS had acknowledged sync requests without stable storage. The metadata file existed, but content was older.
A few builds shipped with mismatched metadata. Not catastrophic, but the compliance team got involved, and suddenly “rebuildable artifacts”
didn’t feel like a good blanket statement.

The backfire wasn’t ZFS being fickle. It was the organization being imprecise about data classes on shared exports. They optimized one workload and
accidentally dragged a stricter workload into the same durability policy.

The fix was structural: split datasets by durability requirements, apply sync=standard or sync=always accordingly, and keep
“unsafe speed hacks” behind explicit mount points and access controls. Also: delete “temporary” from change tickets; it’s the most permanent word in ops.

Mini-story 3: The boring but correct practice that saved the day

A financial services shop had a ZFS NFS platform that hosted a mix of services, including a stateful app whose vendor was very explicit:
“must have stable writes on commit.” The storage engineers didn’t guess. They created a separate dataset and export just for that app.
sync=always, a vetted PLP SLOG, and a written client mount standard. No exceptions.

This was not popular. The app team complained about latency compared to their previous local SSD setup. The storage team didn’t debate feelings.
They showed a simple test: with sync=standard and the app’s current mount, fsync frequency was lower than expected and COMMIT behavior
was bursty. With sync=always, the latency curve was stable and the crash-loss window became easy to reason about.

Months later, an ugly incident happened elsewhere: a firmware bug caused a subset of servers to reboot under specific I/O patterns.
Many systems had data oddities after restart. That stateful app didn’t. No mystery corruption, no “we need to re-run jobs,” no silent rollbacks.

The reason wasn’t heroism. It was that someone had written down the durability requirement, enforced it server-side, and refused to co-host it with
“fast but sloppy” workloads. That is the kind of boredom you want on your résumé.

Common mistakes: symptoms → root cause → fix

1) “We set sync=standard, so we’re safe.”

Symptoms: Data loss after crash despite “sync enabled,” or inconsistent durability across clients.

Root cause: Clients didn’t request stable writes consistently (unstable writes + delayed COMMIT, caching policy, app not fsyncing).

Fix: For strict workloads, use sync=always per dataset/export; standardize client mount options; verify COMMIT/fsync behavior with metrics.

2) “A SLOG will make everything faster.”

Symptoms: No improvement, or worse p99 latency after adding SLOG.

Root cause: Workload is mostly async; or SLOG device has poor latency/flush honesty.

Fix: Measure COMMIT/sync write rate first. Use PLP-grade SLOG or remove the bad one. Don’t cargo-cult SLOGs.

3) “sync=disabled is fine for non-critical data.”

Symptoms: Randomly stale files, truncated metadata, “it exists but it’s older,” painful post-crash reconciliation.

Root cause: Non-critical data shared a dataset/export with critical data, or “non-critical” turned out to be critical during an incident.

Fix: Separate datasets by durability class. Lock down unsafe exports. Make the risk explicit and auditable.

4) “NFS is slow; the disks must be slow.”

Symptoms: High client latency but pool looks fine; CPU spikes on server.

Root cause: nfsd thread saturation, auth overhead, RPC mix heavy in metadata, or single-client serialization effects.

Fix: Measure server CPU, nfsstat ops mix, thread counts, and network RTT. Scale nfsd threads and tune exports before buying disks.

5) “We lost writes, so ZFS is corrupt.”

Symptoms: Application-level loss without pool errors; no checksum errors.

Root cause: Crash-loss window due to async behavior or lied-about flushes (sync disabled, bad SLOG/drive cache).

Fix: Audit sync properties and hardware flush semantics; enforce sync semantics server-side for critical data.

6) “We switched to NFSv4 so it’s fixed.”

Symptoms: Same durability confusion, different error messages.

Root cause: Version alone doesn’t force apps to fsync or clients to demand stable storage at the right time.

Fix: Keep the focus on: app flush behavior, client caching policy, dataset sync policy, and honest stable storage.

Checklists / step-by-step plan

Plan A: You want strict durability for one NFS workload

  1. Create a dedicated dataset. Don’t share it with “fast and loose” workloads.
  2. Set sync semantics server-side. Use sync=always if you cannot fully control client flush behavior.
  3. Use a vetted SLOG if sync latency matters. PLP required; mirrored SLOG if you cannot tolerate log-device failure risk for availability.
  4. Standardize client mounts. Document the supported NFS version and mount options. Treat deviations as unsupported.
  5. Test crash behavior. Not “benchmarks,” but actual reboot/power-loss simulations in a staging environment that matches production.
  6. Monitor protocol signals. Track write/commit rates, p95 latency, and server CPU saturation.

Plan B: You want performance and can accept some crash loss

  1. Be explicit about acceptable loss. “Some” is not a number; define a recovery method and expected window.
  2. Keep sync=standard and avoid sync=disabled unless isolated. If you must disable sync, do it on a separate dataset and make it obvious.
  3. Optimize for pool throughput. Focus on vdev layout, ARC/L2ARC strategy, and TXG behavior rather than SLOG.
  4. Keep the blast radius small. Separate exports by workload class so the risky choice doesn’t infect everything.

Plan C: You inherited a mystery NFS/ZFS platform and nobody knows what’s safe

  1. Inventory datasets and sync properties. Find any sync=disabled immediately.
  2. Map exports to datasets. Ensure critical workloads aren’t sitting on a “performance” export.
  3. Sample client mounts. Collect nfsstat -m outputs from representative clients.
  4. Measure COMMIT rate and SLOG activity. Determine if clients are actually demanding stable writes.
  5. Pick a default durability posture. My bias: safe by default for stateful workloads; performance hacks must be opt-in and isolated.

FAQ

1) If I set sync=standard, are NFS writes durable?

Durable only when the client/app requests synchronous semantics (stable write / COMMIT behavior). If the client buffers and delays stability,
ZFS will happily buffer too. For strict durability independent of client behavior, use sync=always on that dataset.

2) Does NFS hard vs soft change durability?

Not directly. It changes failure behavior. hard tends to hang and retry, which is usually correct for data integrity; soft
can return errors that apps mishandle. But durability is primarily about stable writes/COMMIT and server-side sync policy.

3) Is adding a SLOG always the right move for NFS?

Only if you have a meaningful rate of synchronous writes and the pool latency is the bottleneck. If clients aren’t issuing stable writes/COMMIT
often, a SLOG won’t help. A bad SLOG can hurt a lot.

4) What’s the difference between ZIL and SLOG again?

ZIL is the mechanism (intent logging for synchronous ops). SLOG is a dedicated device where ZFS places that log to reduce latency. Without SLOG,
ZIL lives on the main pool.

5) If we use sync=disabled, can we still be “mostly safe”?

You can be “mostly lucky.” sync=disabled acknowledges sync writes without making them stable. After a crash, you can lose recent
acknowledged operations, including metadata updates. Use it only for isolated datasets where you truly accept that risk.

6) Does NFSv4 guarantee better crash consistency than NFSv3?

Not as a blanket statement. NFSv4 adds stateful features and can improve certain behaviors, but crash consistency still depends on client policy,
app fsync discipline, and server-side honest stable storage.

7) Why do some clients show huge latency spikes when a database checkpoints?

Checkpoints often trigger bursts of fsync/commit behavior. If the dataset is sync=always or the app uses O_DSYNC, ZFS will route
those through ZIL/SLOG. If the SLOG is weak or saturated, p99 latency explodes exactly at checkpoint time.

8) Should I mirror the SLOG?

Mirroring a SLOG is about availability, not data integrity in the usual sense. Losing a SLOG device can cause pool issues or force unsafe
fallbacks depending on platform behavior. For systems that must stay up, mirrored SLOG is often worth it.

9) Can I “force clients to be safe” from the server side?

You can’t force an app to call fsync, but you can force ZFS to treat all writes as sync with sync=always. That reduces reliance on
client flush discipline. It’s the most practical server-side enforcement tool you have.

10) What is the most reliable way to validate durability claims?

Controlled crash testing with the real client OS, real mount options, and the real application. Measure what the app believes is committed,
then crash/reboot and verify invariants. Benchmarks don’t tell the truth here; crash tests do.

Next steps you can do this week

  1. Inventory sync settings: run zfs get -r sync on your NFS trees and flag any sync=disabled.
  2. Pick two critical exports and standardize client mount profiles; collect nfsstat -m from representative clients.
  3. Measure COMMIT/write ratios on the server during peak workloads; compare against what applications claim.
  4. Validate your SLOG by measuring latency and confirming it has PLP; if you can’t prove it, treat it as suspect.
  5. Split datasets by durability class so one team’s performance tweak can’t rewrite another team’s risk model.
  6. Run one crash test in staging: a workload that does “commits,” a forced reboot, and a verification step. It’s amazing how fast myths die.

The goal isn’t to make NFS “safe” in the abstract. The goal is to make safety a property you can explain, measure, and enforce—without relying on
whatever today’s client defaults happen to be.

← Previous
Docker “Text file busy” During Deploy: The Fix That Stops Flaky Restarts
Next →
ZFS Raw Send: Replicating Encrypted Data Without Sharing Keys

Leave a comment