ZFS NFS: The Tunables That Make It Feel Like Local Disk

Was this helpful?

You don’t notice “storage” when it’s working. You notice it when your build queue stalls, your VM boots like it’s pulling bytes through a coffee stirrer,
and your database starts logging “fsync taking too long” while everyone blames the network because that’s a socially acceptable scapegoat.

ZFS over NFS can feel shockingly close to local disk—until it doesn’t. The gap is almost always tunables, mismatched assumptions, and a couple of
small but lethal defaults. Let’s fix that.

The mental model: why “NFS feels slow” is usually a lie

“NFS is slow” is not a diagnosis. It’s an emotional state. In production, slowness comes from one of four places:
latency, serialization, sync semantics, or fragmentation/misalignment with the workload. ZFS and NFS both have strong opinions
about how IO should happen. When their opinions line up, you get something that feels local. When they don’t, you get a distributed
denial of service built from small, well-intentioned defaults.

Your job is to align the pipeline:

  • Application IO shape: random vs sequential, small vs large, sync vs async, metadata-heavy vs streaming.
  • NFS behavior: attribute caching, read-ahead, delegation/locking, number of parallel RPCs, mount semantics.
  • ZFS behavior: recordsize, compression, primarycache, ARC sizing, special vdev, sync policy, ZIL/SLOG.
  • Hardware and network reality: latency floor, link speed, queue depth, interrupt moderation, NIC offloads.

The most common failure mode is assuming the storage is “fast enough” because it benchmarks well locally on the server.
NFS changes the IO pattern. ZFS reacts. Latency is amplified by round trips and synchronous semantics. A pool that does 2 GB/s of sequential reads
might still feel terrible to a workload doing 4 KB sync writes over NFS.

One more important point: “feels like local disk” doesn’t mean “same throughput.” It means “same tail latency under normal load.”
Tail latency is what makes users file tickets, or worse, build workarounds.

Paraphrased idea (attributed): John Allspaw has argued that reliability is about reducing surprise, not just preventing failure.
NFS tuning is a surprise-reduction project.

Facts and history that still matter in production

A few concrete bits of context make today’s tuning choices less mysterious. These aren’t trivia; they explain why the defaults are the way they are,
and why your workload can still fall off a cliff.

  1. NFS started life assuming unreliable networks. Early designs leaned hard on statelessness (especially in NFSv3), which affects how clients recover and cache.
  2. NFSv4 introduced state. Locking and delegations changed client/server coordination and can improve performance—until it clashes with odd client behavior or failover patterns.
  3. ZFS was built around copy-on-write. Great for snapshots and integrity; it also means small random writes can become larger, more complex IO patterns.
  4. The ZIL exists even without a dedicated SLOG. People still think “no SLOG means no ZIL.” The ZIL is a log structure; SLOG is just a faster device for it.
  5. Sync writes are a semantic promise, not a performance preference. Databases and VM hypervisors often force sync for good reasons.
  6. Recordsize is not “block size.” It’s the maximum logical block size for files, affecting how ZFS packs data and how much it reads/writes for a given request.
  7. Attribute caching in NFS is a performance hack with consequences. It can make metadata-heavy workloads fly—or introduce “why can’t I see the file?” confusion in distributed apps.
  8. Checksums changed the game. ZFS checksumming and scrubs are why “silent corruption” stories aren’t just urban legend in enterprise storage.
  9. 10GbE made throughput easy, latency still hard. Many teams “upgrade the pipe” and then discover their bottleneck is sync RTT and storage flush time.

NFS client and server knobs that move the needle

Pick the protocol version on purpose

NFSv4.x is usually the right default in 2025: better security model, compound operations, stateful features that can reduce chatter.
But NFSv3 still has a place, especially in environments where you want simpler behavior, or you’re dealing with certain hypervisors
and legacy stacks that are “special” in the way that ruins weekends.

The key is consistency. Mixed versions across clients can create non-obvious behavior differences (locking, caching, recovery), and you end up debugging
“performance” when you’re really debugging semantics.

rsize/wsize: stop being timid

Modern networks and kernels handle large IO sizes well. Use 1M where supported. If your environment can’t, you’ll find out quickly in negotiation.
Small rsize/wsize values produce more RPCs, more overhead, and more chances to hit latency walls.

But don’t cargo-cult it: if you’re on a high-latency WAN link, large sizes can improve throughput but worsen interactive tail latency in some patterns.
Measure with your workload, not with a synthetic sequential read test.

nconnect and parallelism: throughput’s best friend

Linux clients support nconnect= to open multiple TCP connections to the server for a single mount. This can increase throughput dramatically
by spreading load across CPU queues and avoiding a single flow becoming the choke point. It’s not free; more connections mean more state and sometimes
more lock contention on the server.

Hard vs soft mounts: be brave, not reckless

Use hard mounts for anything that matters. Soft mounts return errors on timeout, which many applications treat as corruption or “delete and retry.”
That’s how you turn a transient glitch into data loss.

For interactive developer home directories, “soft” can feel nicer when the server is down, but it’s a trap. If you want responsiveness, tune timeouts and
retransmits. Don’t trade correctness for convenience.

actimeo and attribute caching: choose your poison explicitly

Metadata-heavy workloads (build systems, package managers) can benefit from attribute caching. But distributed applications expecting near-instant visibility
of file changes can be confused if clients cache attributes too long.

The clean approach: set actimeo to something reasonable (like 1–5 seconds) rather than disabling caching entirely.
“actimeo=0” is a performance self-own for most workloads.

Server threads and RPC behavior

On Linux NFS servers, nfsd thread count matters under concurrency. Too few threads and requests queue; too many and you burn CPU on context switches
and lock contention. You want “enough to keep the pipeline full,” not “as many as you can fit in RAM.”

ZFS dataset and pool knobs that change outcomes

recordsize: align it to your workload, not your feelings

If you serve VM images or databases over NFS, the most common win is setting recordsize to something like 16K or 32K (sometimes 8K),
because those workloads do lots of small random IO. The default 128K is great for streaming and backups, not for random-write-heavy guests.

For general-purpose home directories, 128K is fine. For media archives, consider larger record sizes if your platform supports it.
The point is to avoid read-modify-write amplification and to keep ARC useful.

compression: turn it on unless you have a reason not to

compression=lz4 is one of the few near-free lunches in storage. It often improves throughput and reduces IO. On modern CPUs, the cost is modest.
You don’t “save performance” by disabling compression; you often just force more bytes over disk and network.

atime: disable it for most NFS exports

Access time updates create extra writes and metadata churn. Unless you have a compliance or application requirement, use atime=off.
For shared filesystems serving lots of reads, this is an easy latency reduction.

xattr and ACL behavior: pick a mode that matches your clients

Linux clients, Windows clients (via SMB gateways), and hypervisors can have different expectations about ACLs and extended attributes.
Poor alignment shows up as metadata storms and permission weirdness. Decide the primary use case, then tune for it.

special vdev: metadata deserves better than “whatever disks are left”

If your workload is metadata-heavy (millions of small files, builds, source trees), a special vdev can be transformative.
Put metadata (and optionally small blocks) on fast SSDs. ARC hits are great, but ARC misses still happen, and metadata misses are death by a thousand seeks.

primarycache and secondarycache: be intentional

ARC is memory; L2ARC is SSD cache. For NFS exports serving large streaming reads, caching can evict useful metadata.
For VM images, caching data can help, but you should validate memory sizing first. A common pattern:
primarycache=metadata for certain datasets, keeping ARC focused on what helps most.

ashift: you don’t tune it later

ashift is set at vdev creation and affects sector alignment. Get it wrong and you pay forever in write amplification.
If you’re on 4K sector disks or SSDs (you are), ashift=12 is the usual safe choice. Don’t let auto-detection guess wrong.

Synchronous writes, SLOG, and why “sync=disabled” is a career-limiting move

NFS clients often issue writes as synchronous depending on mount options and application behavior. Databases call fsync because they like their data to
survive power loss. Hypervisors often use sync semantics for VM disks because “I lost a VM filesystem” is an expensive meeting.

In ZFS, synchronous writes are acknowledged only when they’re committed safely. That means the ZIL path matters.
Without a dedicated SLOG, the ZIL lives on your main pool, and sync write latency becomes “how fast can the pool commit small writes safely.”
On HDD pools, that can be brutal. On SSD pools, it might be fine. On mixed pools, it depends on the slowest consistent step.

A dedicated SLOG device can drastically reduce sync latency. But only if it’s the right device: low latency, power-loss protection, consistent performance,
and connected reliably. A cheap consumer NVMe without PLP is an outage generator in disguise.

Joke #1: A consumer SSD as SLOG is like using a paper umbrella in a hurricane—technically an umbrella, emotionally a mistake.

Don’t disable sync to “fix performance”

sync=disabled makes benchmarks look amazing. It also turns “acknowledged write” into “maybe written eventually,” which is not what many applications expect.
If the server crashes, you can lose data that the client believed was durable. That’s not tuning; that’s a magic trick with a trapdoor.

Understand sync modes: standard, always, disabled

  • sync=standard: respect application requests (default, usually correct).
  • sync=always: treat all writes as sync. Useful for certain compliance cases; often punishing.
  • sync=disabled: lie to clients. Only for special cases where you’ve explicitly accepted the risk and documented it like an adult.

Logbias and workload intent

logbias=latency tells ZFS to favor low-latency logging behavior (often appropriate for sync-heavy workloads). logbias=throughput can help for streaming.
This is a knob to align intent, not a magic performance lever. Test it per dataset.

Network path: boring settings that buy you milliseconds

NFS is sensitive to latency because every RPC is a round trip, and synchronous writes are a chain of acknowledgements.
You don’t need a fancy network. You need a predictable one.

Jumbo frames: not mandatory, but don’t half-do it

If you enable MTU 9000, it must be end-to-end: NICs, switches, VLANs, bonds, and the storage server. A partial jumbo setup is worse than none because you’ll
get fragmentation or black holes. If you don’t control the whole path, skip it and focus on avoiding drops.

NIC offloads and CPU

Offloads can help, but they can also introduce strange latency spikes depending on drivers and firmware. The correct approach is empirical:
measure CPU softirq load, drops, and latency, then decide. Don’t disable features because a blog post from 2013 said so.

Congestion and bufferbloat

On fast LANs, you can still get micro-bursts and queueing. Watch switch buffers, host queueing, and TCP retransmits.
If your NFS “feels slow” only under load, you might be looking at queue buildup rather than raw throughput limits.

Practical tasks: commands, outputs, and what you decide next

These are field tasks you can run during a performance complaint without turning the incident into an archaeology project.
Each task includes: command, what the output means, and the decision you make from it.

Task 1: Confirm NFS mount options (client)

cr0x@server:~$ nfsstat -m
/home from nas01:/tank/home
 Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.21,local_lock=none

What it means: You’re on NFSv4.2 with 1M rsize/wsize, hard mount, TCP. That’s a healthy baseline.

Decision: If you see tiny rsize/wsize (like 64K) or vers=3 unexpectedly, fix mounts first. If it’s soft, change it unless this is disposable data.

Task 2: Check negotiated NFS version and transport quickly (client)

cr0x@server:~$ cat /proc/mounts | grep nfs
nas01:/tank/home /home nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.21,local_lock=none 0 0

What it means: Confirms what the kernel thinks, not what your fstab wished for.

Decision: If your fstab says nconnect=8 and it’s not here, you’re not actually using it. Fix your mount automation.

Task 3: Measure latency to the NFS server (client)

cr0x@server:~$ ping -c 5 nas01
PING nas01 (10.10.10.10) 56(84) bytes of data.
64 bytes from 10.10.10.10: icmp_seq=1 ttl=64 time=0.312 ms
64 bytes from 10.10.10.10: icmp_seq=2 ttl=64 time=0.287 ms
64 bytes from 10.10.10.10: icmp_seq=3 ttl=64 time=0.301 ms
64 bytes from 10.10.10.10: icmp_seq=4 ttl=64 time=0.295 ms
64 bytes from 10.10.10.10: icmp_seq=5 ttl=64 time=0.289 ms

--- nas01 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4090ms
rtt min/avg/max/mdev = 0.287/0.296/0.312/0.009 ms

What it means: Sub-millisecond RTT on LAN. Good. If this is 2–5 ms, sync workloads will feel “sticky.”

Decision: If RTT is high or jittery, stop tuning ZFS. Fix network path, NIC issues, or routing first.

Task 4: Check TCP retransmits and drops (client)

cr0x@server:~$ netstat -s | egrep -i 'retrans|segments retransm|lost|timeouts' | head
    12 segments retransmitted
    0 retransmit timeouts
    0 connections reset due to unexpected data

What it means: Low retransmits. If this climbs during complaints, you’re losing packets or queueing badly.

Decision: Rising retransmits under load = check NIC stats, switch ports, MTU mismatch, or bufferbloat.

Task 5: Confirm NFS server is actually exporting what you think (server)

cr0x@server:~$ exportfs -v
/tank/home  10.10.10.0/24(rw,async,wdelay,hide,no_subtree_check,sec=sys,secure,root_squash,no_all_squash)

What it means: Export is async at NFS layer (this is about server reply behavior, not ZFS sync semantics). Root squashing enabled.

Decision: If you’re serving databases/VMs, think carefully about async. Many orgs keep NFS export async but rely on ZFS sync for correctness; others require sync export semantics. Don’t guess—match to your risk model.

Task 6: Check NFS server thread count (Linux server)

cr0x@server:~$ cat /proc/fs/nfsd/threads
32

What it means: 32 nfsd threads. If you have hundreds of clients doing metadata ops, too few threads will queue.

Decision: If CPU is low and clients queue, increase threads moderately (e.g., 64). If CPU is high and latency grows, adding threads may backfire.

Task 7: Observe NFS RPC activity (server)

cr0x@server:~$ nfsstat -s
Server rpc stats:
calls      badcalls   badclnt    badauth   xdrcall
482913     0          0          0         0

Server nfs v4:
null         compound     read         write        getattr
0            412001       32941        21887        100221

What it means: High compound and getattr suggests metadata-heavy traffic. Read/write counts show IO mix.

Decision: If getattr dominates and latency hurts, consider ZFS special vdev, ARC metadata focus, and client attribute caching strategy.

Task 8: Check ZFS pool health and slow IO indicators (server)

cr0x@server:~$ zpool status -v
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 02:11:43 with 0 errors on Sun Dec 22 01:10:18 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0

errors: No known data errors

What it means: Healthy pool, scrub clean. Not proof of performance, but removes “degraded vdev” from the suspect list.

Decision: If you see a degraded disk or resilver, stop. Performance tuning during rebuild is like tuning a car during a tire fire.

Task 9: See real-time ZFS latency and sync behavior (server)

cr0x@server:~$ zpool iostat -v tank 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        3.12T  8.54T    220   1450   48.1M  31.2M
  raidz2-0  3.12T  8.54T    220   1450   48.1M  31.2M
    sda         -      -     35    240   8.2M   5.3M
    sdb         -      -     36    241   8.1M   5.2M
    sdc         -      -     37    239   8.0M   5.3M
    sdd         -      -     37    242   8.0M   5.2M
    sde         -      -     38    243   8.0M   5.1M
    sdf         -      -     37    245   7.8M   5.1M

What it means: Write IOPS are high, bandwidth moderate—classic small write workload. HDD RAIDZ2 doing sync-heavy IO will feel sluggish.

Decision: If this is VM/databases over NFS, consider SLOG and smaller recordsize. If it’s metadata churn, consider special vdev and atime off.

Task 10: Inspect key dataset properties (server)

cr0x@server:~$ zfs get -o name,property,value -s local,default recordsize,compression,atime,sync,logbias,primarycache tank/home
NAME       PROPERTY      VALUE
tank/home  recordsize    128K
tank/home  compression   lz4
tank/home  atime         off
tank/home  sync          standard
tank/home  logbias       latency
tank/home  primarycache  all

What it means: Sensible defaults for general use, but recordsize might be wrong for VM images or databases.

Decision: If this dataset serves VM disks, set recordsize smaller (16K–32K) and consider primarycache=metadata depending on RAM and access pattern.

Task 11: Check ARC pressure and hit ratio (server)

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:01:11   780    90     11    22    2    68    9     0    0   42G   48G
12:01:12   802    95     12    24    2    71    9     0    0   42G   48G
12:01:13   799    88     11    21    2    67    8     0    0   42G   48G

What it means: ~11–12% miss rate. Not terrible. If miss% spikes during workload, disks will get hammered and NFS latency climbs.

Decision: If ARC is capped too low or memory is starving, tune ARC sizing (carefully) or focus caching on metadata via dataset properties.

Task 12: Verify SLOG presence and role (server)

cr0x@server:~$ zpool status tank | sed -n '1,80p'
  pool: tank
 state: ONLINE
config:

        NAME          STATE     READ WRITE CKSUM
        tank          ONLINE       0     0     0
          raidz2-0    ONLINE       0     0     0
            sda       ONLINE       0     0     0
            sdb       ONLINE       0     0     0
            sdc       ONLINE       0     0     0
            sdd       ONLINE       0     0     0
            sde       ONLINE       0     0     0
            sdf       ONLINE       0     0     0
        logs
          nvme0n1p1   ONLINE       0     0     0

What it means: Dedicated log device exists. Good. Now ensure it’s the right class of device (latency, PLP).

Decision: If there is no SLOG and you have sync-heavy workloads on HDDs, add a mirrored SLOG from enterprise-grade devices.

Task 13: Detect “sync write pain” from the client side (client)

cr0x@server:~$ dd if=/dev/zero of=/home/testfile bs=4k count=4096 conv=fdatasync status=progress
16777216 bytes (17 MB, 16 MiB) copied, 2.10 s, 8.0 MB/s
4096+0 records in
4096+0 records out
16777216 bytes (17 MB, 16 MiB) copied, 2.10 s, 8.0 MB/s

What it means: 4K sync-flush writes are slow (8 MB/s). This is the pain users feel in databases and VM journaling.

Decision: If this is “bad,” don’t chase rsize/wsize. Chase SLOG quality, pool latency, and sync semantics.

Task 14: Compare async streaming performance (client)

cr0x@server:~$ dd if=/dev/zero of=/home/testfile2 bs=1M count=4096 status=progress
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 9.12 s, 471 MB/s
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 9.12 s, 471 MB/s

What it means: Big streaming writes look great. This contrast is classic: throughput is fine, sync latency is not.

Decision: Optimize for the workload you actually run. If your app does sync 4K, focus on ZIL/SLOG and latency, not throughput.

Task 15: Check per-dataset write amplification suspects (server)

cr0x@server:~$ zfs get -o name,property,value copies,dedup,checksum tank/home
NAME       PROPERTY  VALUE
tank/home  copies    1
tank/home  dedup     off
tank/home  checksum  on

What it means: No dedup (good for most), single copy, checksumming on. Dedup on a busy NFS dataset is a common “why is everything slow?” story.

Decision: If dedup is on accidentally, plan a migration off it. Turning it off doesn’t undedup existing blocks.

Task 16: Spot NIC-level errors and MTU mismatches (server)

cr0x@server:~$ ip -s link show dev eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:aa:bb:cc brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
    918G  612M       0      12       0   1200
    TX:  bytes packets errors dropped carrier collsns
    877G  590M       0       3       0       0

What it means: A few drops. Not necessarily fatal, but if drops spike during peaks, NFS latency and retransmits follow.

Decision: If MTU is 9000 here but 1500 somewhere else, fix the network. If drops persist, check ring buffers, IRQ affinity, and switch congestion.

Fast diagnosis playbook

When someone says “NFS is slow,” you need a plan that finds the bottleneck in minutes, not in an email thread.
Here’s the order that minimizes wasted time.

First: decide whether it’s latency pain or throughput pain

  • Latency pain: small IO, sync-heavy, metadata ops. Symptoms: slow git status, slow package installs, databases complaining about fsync, VM pauses.
  • Throughput pain: large transfers slow. Symptoms: backups crawling, large file copies underperforming link speed.

Run the quick pair: a sync-flush dd (Task 13) and a big sequential dd (Task 14). If sequential is fine but sync is terrible, you’ve already narrowed it.

Second: check the network for obvious betrayal

  • Ping RTT and jitter (Task 3).
  • Retransmits (Task 4).
  • Interface drops/errors (Task 16).

If you see retransmits increasing during the incident, stop and fix the network path. Storage tuning won’t outsmart packet loss.

Third: validate mount semantics and protocol version

  • Confirm mount options and negotiated rsize/wsize, version, proto (Tasks 1–2).
  • Check if nconnect is actually in effect.
  • Look for actimeo=0 or bizarre timeouts.

Fourth: identify if the server is CPU-bound, thread-bound, or IO-bound

  • NFS server threads (Task 6).
  • NFS operation mix (Task 7): metadata vs data.
  • ZFS pool iostat (Task 9): IOPS and bandwidth profile.
  • ARC behavior (Task 11): are you missing cache and going to disk?

Fifth: confirm sync path (SLOG/ZIL) and dataset properties

  • SLOG presence (Task 12).
  • Dataset sync/logbias/recordsize (Task 10).
  • Dedup/copies surprises (Task 15).

If it’s sync-heavy and you don’t have a proper SLOG, you have your answer. If you do have one, suspect the SLOG device quality or saturation,
or that you’re not actually hitting it due to dataset settings.

Three corporate mini-stories from the trenches

Incident: the wrong assumption (“NFS async means safe enough”)

A mid-sized company ran their CI artifacts and a small Postgres instance on an NFS export backed by ZFS. The storage team had set the NFS export to async
because it “improved performance,” and they assumed ZFS would cover durability anyway. The application team assumed “NFS is shared disk, therefore durable.”

Then came a power event. Not dramatic—just a brief outage and a messy reboot. The Postgres instance restarted, and immediately started complaining about corrupted pages.
The CI system also showed missing artifact chunks that were “successfully uploaded” minutes earlier. The incident channel filled with the usual classics:
“Could it be DNS?” and “I thought ZFS prevented corruption.”

The root problem wasn’t ZFS corruption. It was semantics. The NFS server had replied to writes before they were durable on stable storage.
ZFS had integrity, but integrity of what was actually committed. Some acknowledged writes never made it. That’s not a checksum problem; it’s a promise problem.

The fix wasn’t a single toggle. They separated workloads: CI artifacts stayed on async exports (acceptable risk, re-creatable data),
databases moved to an export and dataset policy that respected synchronous durability, with a mirrored enterprise SLOG.
They also documented the policy in the service catalog so “async” stopped being an invisible footgun.

Optimization that backfired: the recordsize gamble

Another org served VM images over NFS from a shiny new ZFS pool. Someone noticed the default recordsize=128K and decided that “bigger blocks mean faster.”
They set recordsize to 1M on the dataset hosting VM disks. The initial tests looked great: copying large ISO files was faster, and a synthetic sequential benchmark
made a graph that could win awards.

Two weeks later, random latency complaints started: VM boot times wandered, interactive sessions stuttered, and the hypervisor logs showed occasional IO timeouts.
The storage server didn’t look “busy” in bandwidth terms, but it was busy in the worst way: read-modify-write amplification and cache inefficiency.
Small guest writes forced the storage to touch huge records, and ARC started caching big data blocks instead of the metadata and hot regions that mattered.

The troubleshooting took longer than it should have because everyone stared at throughput. Throughput was fine. Tail latency was not.
Eventually someone compared IO sizes from the hypervisor with ZFS recordsize and realized they had tuned for the wrong workload.

The recovery plan was boring: new dataset with sane recordsize (16K–32K), storage vMotion/migration of VM disks, and leaving the 1M recordsize for backup archives.
Lesson learned: don’t tune a filesystem for a benchmark you don’t run in production.

Boring but correct practice that saved the day: separate datasets, sane defaults, documented intent

A large enterprise had an internal “NFS platform” used by dev, build, analytics, and a handful of stateful services. They did something profoundly unsexy:
they standardized dataset templates. Home dirs got one template; build caches another; VM images another. Each template had a recordsize, atime, compression,
and caching policy aligned to the use case.

When a performance incident hit during a big release week, the on-call didn’t have to guess what “this share is for.” The dataset name encoded intent, and the
properties matched it. They pulled zfs get output and immediately ruled out the classic mistakes: no surprise dedup, no accidental sync=always,
no atime churn on read-heavy shares.

The issue ended up being network congestion on a ToR switch. Because the storage configuration was predictable, they didn’t waste hours toggling ZFS settings
and rebooting clients. They fixed the queueing, and the problem disappeared.

The win wasn’t a clever tunable. The win was eliminating ambiguity, which is the most reliable performance optimization I know.

Common mistakes: symptom → root cause → fix

1) “Big file copy is fast, but databases are slow”

Symptom: Sequential dd looks great, but anything that fsyncs crawls; VM guests stutter.

Root cause: Sync write latency bottleneck (no SLOG, weak SLOG, slow pool flush).

Fix: Add a proper mirrored SLOG with PLP; keep sync=standard. Verify with sync-flush tests and observe latency under load.

2) “Everything got slower after we set actimeo=0”

Symptom: Builds and file listing operations become sluggish; server CPU rises; getattr ops spike.

Root cause: Attribute cache disabled, causing constant metadata revalidation.

Fix: Use a modest actimeo=1 or acregmin/acregmax tuning rather than zero; add special vdev for metadata if needed.

3) “NFS hangs forever when the server is down”

Symptom: Processes stuck in D state; unkillable IO waits.

Root cause: Hard mount doing what it promised, plus an application that can’t tolerate blocking IO.

Fix: Keep hard mounts for correctness, but tune timeo/retrans sensibly and design apps with timeouts; for non-critical mounts, consider separate paths or local caches.

4) “Random IO is terrible on a RAIDZ pool”

Symptom: High write IOPS workload yields awful latency; iostat shows lots of small writes across HDD vdevs.

Root cause: RAIDZ + small random writes + sync semantics equals write amplification and seek storms.

Fix: Use mirrors for random-write-heavy pools, or ensure workload is cached/aggregated; add SLOG for sync, tune recordsize, and consider special vdev for metadata/small blocks.

5) “We added nfsd threads and it got worse”

Symptom: CPU usage spikes, context switching rises, latency increases.

Root cause: Too many server threads causing contention and scheduler overhead.

Fix: Reduce threads to a measured sweet spot; focus on CPU affinity, network interrupts, and underlying storage latency.

6) “After enabling jumbo frames, some clients are randomly slow”

Symptom: Certain subnets or hosts see timeouts, retransmits; others fine.

Root cause: MTU mismatch somewhere in the path causing fragmentation or drops.

Fix: Ensure end-to-end MTU consistency or revert to 1500; validate with ping DF tests and switch configuration checks.

7) “ARC hit rate is fine but we’re still slow”

Symptom: ARC miss% looks okay, yet clients complain; pool shows sync write pressure.

Root cause: Cache doesn’t help sync commit latency; the bottleneck is flush/commit time.

Fix: Invest in SLOG and low-latency devices; reduce sync IO where correct (app settings), not where dangerous (sync=disabled).

8) “We enabled dedup and now the NFS server is haunted”

Symptom: Latency spikes, memory pressure, unpredictable performance.

Root cause: Dedup requires large, hot metadata structures; if not sized correctly, it thrashes and punishes every IO.

Fix: Don’t use dedup for general NFS. Migrate data off dedup dataset; keep compression instead.

Joke #2: Dedup is the storage equivalent of adopting a raccoon—sometimes it’s cute, but it will absolutely get into everything.

Checklists / step-by-step plan

Plan A: make an existing ZFS+NFS share feel local for general users

  1. Baseline the client mount. Confirm vers, proto=tcp, hard, and negotiated rsize/wsize (Tasks 1–2).
  2. Set sensible mount options. Typical good starting point: NFSv4.2, rsize/wsize=1M, hard, tuned timeo, and consider nconnect=4 or 8 for busy clients.
  3. Disable atime on the dataset. zfs set atime=off for user shares unless required.
  4. Enable lz4 compression. zfs set compression=lz4.
  5. Watch metadata behavior. If getattr dominates (Task 7), consider moderate attribute caching and special vdev for metadata.
  6. Validate network sanity. RTT, retransmits, drops (Tasks 3–4, 16).
  7. Re-test with real workflows. Git operations, builds, file browsing, not just a big dd.

Plan B: serve VM images over NFS without regrets

  1. Create a dedicated dataset. Don’t reuse the “home dirs” dataset and hope for the best.
  2. Set recordsize to match IO. Start at 16K or 32K for VM disks. Test both.
  3. Keep sync honest. Use sync=standard. Add a proper mirrored SLOG if HDD-based or if sync latency is high.
  4. Consider caching strategy. Often primarycache=metadata helps if the dataset is large and ARC would otherwise be polluted by guest data.
  5. Measure sync latency. Use sync-flush tests (Task 13) and observe pool iostat under load (Task 9).
  6. Validate failover behavior. NFS client recovery and hypervisor expectations can create performance issues that look like “storage.”

Plan C: metadata-heavy builds and source trees

  1. Prioritize metadata. Consider a special vdev for metadata/small blocks.
  2. Use compression and atime=off. Reduce IO churn.
  3. Tune attribute caching. Don’t set actimeo=0 unless you enjoy self-inflicted pain.
  4. Watch server CPU and nfsd threads. Increase threads only when you see queueing, not as a ritual.
  5. Ensure ARC has breathing room. Metadata caching is your best friend here.

FAQ

1) Should I use NFSv3 or NFSv4.2?

Default to NFSv4.2 on modern Linux unless you have a compatibility reason. NFSv3 can be simpler and sometimes easier to debug, but v4.x features
often reduce chatter and improve security posture.

2) What rsize/wsize should I use?

Start with 1M where supported. If negotiation falls back, you’ll see it in nfsstat -m. If your workload is latency-sensitive and small-IO-heavy,
rsize/wsize won’t save you from sync latency—SLOG and pool design will.

3) Is nconnect always good?

Often good for throughput and parallelism, especially on fast links and multi-core servers. But it increases connection state and can expose server-side lock contention.
Try 4 or 8, measure, and don’t assume more is better.

4) Do I need a SLOG?

If you have sync-heavy workloads (databases, VM images, anything fsyncing frequently) and your pool isn’t already low-latency, yes. For all-flash pools with
excellent latency, a SLOG might not help much. Measure with sync-flush writes.

5) Can I just set sync=disabled to fix performance?

You can, and it will look great—until you crash and lose acknowledged writes. For disposable data, maybe. For anything that matters, don’t do it.
Fix the actual latency path instead.

6) What ZFS recordsize should I use for NFS shares?

For general file shares: 128K is fine. For VM images and databases: start at 16K or 32K. For large sequential archives: larger can help.
The right value depends on IO shape, not on link speed.

7) Does compression help or hurt NFS performance?

Often helps. Less data over disk and network, sometimes better cache utilization. lz4 is the usual pick. If CPU is pegged and IO is low,
then consider testing, but don’t disable compression by superstition.

8) Why does my NFS share feel slow only during peak hours?

Congestion and queueing. Check retransmits, switch port utilization, interface drops, and server CPU softirq load. Storage can be fine; the network can be
building queues that inflate tail latency.

9) Should I use a special vdev?

If your workload is metadata-heavy or you have millions of small files, a special vdev on fast SSDs can be a night-and-day improvement.
But treat it like a critical vdev: mirror it and monitor it, because losing it can mean losing the pool depending on configuration.

10) How do I know if the bottleneck is the NFS server or the ZFS pool?

Look at server-side nfsstat -s for operation mix and volume, then correlate with zpool iostat. If NFS ops spike but pool stays idle,
suspect CPU/threading/network. If pool IOPS and latency spike, suspect storage design (sync path, vdev layout, recordsize, cache misses).

Conclusion: practical next steps

If you want ZFS over NFS to feel like local disk, optimize for tail latency and semantics, not for pretty throughput charts.
Start with the basics: confirm mounts, confirm network health, observe IO shape, then align ZFS dataset properties to the workload.

  1. Run the two quick client tests: sync-flush and sequential throughput (Tasks 13 and 14). Decide whether you’re fighting latency or bandwidth.
  2. Verify negotiated NFS options and fix mismatches (Tasks 1–2). Don’t debug what you didn’t actually configure.
  3. Check network RTT and retransmits (Tasks 3–4). If packets are dropping, storage tuning is just elaborate denial.
  4. On the server, correlate NFS op mix with ZFS pool behavior (Tasks 7–9). Identify whether it’s metadata, sync, or raw IO.
  5. Apply targeted ZFS changes: recordsize per workload, compression on, atime off, and a proper SLOG where sync matters (Tasks 10–12).

Then write down what you chose and why. Future-you will be tired, on-call, and allergic to mystery knobs.

← Previous
Scroll progress bar for articles: CSS-first, minimal JS that won’t page-fault your UX
Next →
EULAs Nobody Reads: The Contract Everyone Signs Anyway

Leave a comment