Storage: iSCSI vs NFS vs NVMe-oF — What Actually Wins and Why

Was this helpful?

You don’t notice storage protocols when they work. You notice them at 02:13 when the database starts timing out, CPU is bored,
network looks “fine,” and your pager insists this is now your personality.

iSCSI, NFS, and NVMe-oF all move bytes over networks. That’s where the similarity ends. They differ in semantics, failure modes,
operational overhead, and the kind of performance you can realistically get without turning your storage team into full-time packet archaeologists.

The real question: what are you optimizing for?

“Which is faster?” is the wrong first question. You can make any of these fast in a lab and miserable in production.
The right question is: what failure are you willing to tolerate, and what work are you willing to do repeatedly?

Storage protocols are a bundle deal:

  • Semantics: block device vs shared filesystem, locking behavior, metadata operations.
  • Concurrency model: one host writes its block device vs multiple clients sharing a namespace.
  • Recovery behavior: what happens when a path flaps, a switch drops packets, or a server reboots mid-I/O.
  • Operational ergonomics: provisioning, resizing, snapshots, multipathing, observability.
  • Security posture: authN/authZ, encryption, blast radius, “oops” protection.

If you’re running virtualization, databases, analytics, or Kubernetes, you’re picking not just a protocol but a lifestyle.
Choose the one you can operate at 3 a.m. while half-awake and mildly resentful.

Quick verdicts (the opinionated part)

Pick NFS when…

  • You want shared storage with sane human workflows: home directories, content repos, media, build caches.
  • You value simplicity of provisioning and fast restores more than absolute lowest latency.
  • You can invest in a good NFS server (or appliance) and you’re disciplined about mount options and network design.

NFS is the “works well enough” champion—until you abuse it with metadata-heavy workloads and pretend the network is lossless.

Pick iSCSI when…

  • You need block storage for a single host (or cluster filesystem/volume manager on top).
  • You need mature multipathing and broad compatibility with OSes, hypervisors, and storage arrays.
  • You can commit to operational hygiene: timeouts, path policies, and monitoring.

iSCSI is the reliable sedan of network storage. It’s not cool. It’s also not usually the reason you miss your SLO—unless you configure it like a lab toy.

Pick NVMe-oF when…

  • Your workload is latency-sensitive and already tuned locally (databases, high-frequency indexing, log ingestion at scale).
  • You have a clean, low-loss network and can justify operational complexity.
  • You’re ready to treat the fabric like a first-class system: telemetry, congestion control, and careful change management.

NVMe-oF can be spectacular. It can also be spectacularly humbling when a tiny amount of packet loss turns your “almost local NVMe” dream into a ticket generator.

If you want one default rule: Choose NFS for shared files, iSCSI for generic block, NVMe-oF only when you can prove you need it and can operate it.

Interesting facts and historical context (so you stop repeating old mistakes)

  1. NFS predates your cloud budget. NFS first appeared in the mid-1980s at Sun Microsystems, built to make networks feel like local filesystems.
  2. NFSv3 went big by being simple. Its stateless design helped servers scale and recover, but pushed complexity to clients (and locking to side protocols).
  3. NFSv4 got serious about state. It introduced integrated locking and stronger security mechanisms, trading simplicity for correctness and better WAN behavior.
  4. iSCSI was born to kill “special networks.” It emerged around the early 2000s to run SCSI over TCP/IP so SANs could use Ethernet instead of FC.
  5. Jumbo frames were once a status symbol. Many teams enabled MTU 9000 without end-to-end validation; mismatches still cause weird fragmentation and drops today.
  6. NVMe was designed for parallelism. NVMe uses multiple queues to reduce lock contention and exploit modern CPUs—unlike older storage stacks built for spinning disks.
  7. NVMe-oF isn’t one thing. It’s a family: RDMA transports (RoCE/iWARP/InfiniBand) and TCP. TCP is often easier to deploy; RDMA can be lower latency but is pickier.
  8. “NAS vs SAN” is mostly a proxy war. The real split is file semantics vs block semantics—and which layer gets to own caching, locking, and consistency.
  9. Multipathing predates your container platform. MPIO patterns were hammered out in the era of dual-controller arrays and flaky HBAs; the lessons still apply.

How they actually work (and where they hurt)

NFS: remote filesystem semantics

With NFS, clients ask a server for file operations: open, read, write, getattr, readdir, lock (in v4), and so on.
The server controls the authoritative namespace and metadata. Clients cache aggressively to reduce round trips.

The big win: shared namespace and trivial provisioning. Export a directory, mount it, done.
The big trap: metadata latency and cache coherency behavior become performance and correctness issues.
NFS can scream on sequential throughput and still melt down on “millions of tiny files” patterns.

iSCSI: block device over TCP

iSCSI encapsulates SCSI commands over TCP. The client (initiator) sees a remote LUN as a local block device.
Filesystems, LVM, RAID, databases—those sit above it, just like with local disks.

The win: applications and OSes know how to deal with block devices. Multipath is well understood.
The pain: you inherit block storage sharp edges: corruption if multiple hosts write without a cluster-aware layer, and ugly failure behavior if timeouts are wrong.

NVMe-oF: NVMe semantics across a fabric

NVMe-oF extends NVMe command sets over the network. Compared to iSCSI, it generally reduces protocol overhead and supports deep parallelism.
In practice, you can get closer to local NVMe latency—especially with RDMA—but only if the network behaves.

The pain is not “it’s new” (it’s not that new anymore). The pain is that it raises expectations.
When latency drops, the next bottleneck becomes visible: CPU scheduling, IRQ affinity, TCP congestion, noisy neighbors, array firmware, you name it.

Joke #1: NVMe-oF promises “local-like performance.” So does every resume I’ve ever read.

Performance reality: latency, IOPS, throughput

Latency: the only metric your database truly believes

If your workload is transactional, latency distribution matters more than peak IOPS. Median isn’t enough; tail latency ruins you.
Protocol overhead, network jitter, queue depths, and retransmits show up as p95/p99 spikes.

  • NFS: often excellent for large sequential reads/writes, but metadata and synchronous writes can be expensive depending on server and mount options.
  • iSCSI: usually stable and predictable; multipathing helps availability more than raw latency.
  • NVMe-oF: best potential latency; also the fastest way to discover your network isn’t as clean as you assumed.

IOPS: what marketing loves and SREs distrust

“IOPS” without block size, read/write mix, queue depth, and latency distribution is trivia.
Still, the protocol shapes how efficiently small I/O is transported.

NVMe-oF tends to handle high queue depths and parallelism well. iSCSI can do very high IOPS too, but CPU overhead and SCSI command processing can show up sooner.
NFS performance depends heavily on client caching, server threads, and whether your workload is data-heavy or metadata-heavy.

Throughput: the easy win with the easiest foot-guns

For bulk throughput, 10/25/40/100GbE matters more than protocol choice. Also: end-to-end MTU consistency, NIC offloads, and not oversubscribing your uplinks into a black hole.

The most common throughput failure isn’t “protocol is slow.” It’s “one switch port is erroring” or “one link in a LAG is half-dead” or “the array is doing background rebuilds.”

CPU cost: the hidden tax

iSCSI and NFS both ride TCP, so CPU becomes part of the storage bill. NVMe-oF over TCP is similar; NVMe-oF over RDMA can reduce CPU overhead but increases operational constraints.
If you’re on shared compute nodes, that CPU tax is real: it steals headroom from application threads and shifts latency upward.

“Hope is not a strategy.” — paraphrased idea often cited in engineering and operations culture

Failure modes you’ll meet in production

NFS: when the server sneezes, clients catch a cold

NFS is centralized. That’s both the point and the risk. If the server is overloaded, all clients see it.
Client caching can mask problems until it can’t, and then you get thundering herds.

  • Stale file handles: typically after server-side export changes, filesystem rebuilds, or failovers that didn’t preserve inode identity.
  • “Hanging” I/O: clients waiting on server responses, often due to network loss, server thread starvation, or lock contention.
  • Metadata storms: build systems, package managers, and “let’s store millions of small objects as files” designs.

iSCSI: death by timeouts and partial failures

iSCSI tends to fail in ugly partial ways: one path degrades, one switch drops packets, one NIC starts flapping.
Your OS keeps the block device alive until it can’t. Then filesystems panic, or worse: they keep going and your app corrupts itself.

  • Path flapping: intermittent link issues triggering frequent failover, causing latency spikes.
  • Queueing collapse: I/O piles up behind a stalled path; by the time failover happens, you’re already paging.
  • Split-brain access: multiple initiators writing the same LUN without coordination (this is not a “maybe”; it is a “when”).

NVMe-oF: the network is now your backplane

NVMe-oF shines when you can treat the fabric like a high-quality internal bus. But Ethernet is a democracy of packets; it does not care about your latency goals.
Congestion, microbursts, ECN settings, buffer behavior, and NIC firmware all matter.

  • Packet loss sensitivity: even small loss rates can spike tail latency, especially under load.
  • Misconfigured multipath: asymmetry or wrong policies leading to hot paths and unpredictable performance.
  • Observability gap: teams deploy NVMe-oF before they can explain per-queue latency and retransmits.

Joke #2: Storage people love “five nines” until the network team asks which five minutes you’re willing to lose.

Three corporate mini-stories from the trenches

Incident: the wrong assumption (“It’s just a mount”)

A mid-sized company ran CI builds on a fleet of Linux workers. They stored build artifacts on an NFS share because it was easy to manage and easy to clean up.
Someone noticed the share was “underutilized” and decided to consolidate: move user home directories, build caches, and application logs onto the same export.

The wrong assumption was subtle: they assumed NFS behaves like local disk under metadata churn. It does not.
The CI workload created a storm of file creates, stats, renames, and deletes. The home directories added a constant background of small reads and writes.
Logs added synchronous appends plus bursts at rotation time.

The symptoms were confusing. CPU on the build workers spiked, but not in a “busy compiling” way—more in a “blocked in D state” way.
Builds started timing out. SSH logins became sluggish because shell startup touches a lot of files. Monitoring showed the NFS server’s network wasn’t saturated.

The root cause: NFS server threads and storage backing were saturated by metadata operations and synchronous writes, while client-side caching was repeatedly invalidated.
“But bandwidth is fine” was the wrong lens. The real bottleneck was ops/sec at the server and the latency of metadata RPCs.

The fix was boring and effective: separate exports by workload, isolate CI caches to their own server (or local disks), and tune mount options and server thread counts.
They also added a synthetic metadata benchmark to capacity planning. The service recovered, and the NFS server stopped being the company’s unofficial shared anxiety.

Optimization that backfired: jumbo frames and the silent drop

Another organization wanted better throughput on iSCSI. Someone enabled MTU 9000 on the storage VLAN and on the iSCSI NICs.
The change window was short, and they didn’t validate every hop. “It’s a dedicated VLAN, it’s fine.”

For a week, everything looked okay—until periodic latency spikes appeared during peak hours. Not full outages. Just the kind of random slowness that makes teams blame “the cloud”
even when they run their own racks.

The pattern was nasty: only certain hosts saw it, and only when traffic crossed a particular pair of switches.
TCP retransmits rose slightly, but nobody was watching retransmits on the storage VLAN because, historically, “storage VLANs are clean.”

The real issue: one switch interface in the path was still at MTU 1500 and was dropping oversized frames rather than fragmenting.
iSCSI traffic got retransmitted, queues built up, and the multipath layer occasionally failed over—adding more turbulence.
Bandwidth charts stayed calm because lost packets don’t show up as used throughput.

The rollback—return to MTU 1500 end-to-end—stabilized everything immediately. Later, they reintroduced jumbo frames only after enforcing MTU compliance checks in automation
and alerting on interface drops and TCP retransmits. The lesson: performance “optimizations” that are not end-to-end are just new failure modes.

Boring but correct practice that saved the day: multipath discipline

A finance-adjacent platform ran critical databases on iSCSI LUNs. The setup wasn’t exotic: dual storage controllers, two fabrics, multipath on every host.
What was different was the team’s discipline. Every host had the same multipath config template, the same timeouts, and a quarterly “pull a cable” test in staging.

One afternoon, a top-of-rack switch started logging CRC errors on a port connected to a storage controller.
The errors were intermittent enough that the link stayed up, but bad enough to cause retries and occasional iSCSI session issues.

The databases didn’t go down. Application latency ticked up slightly, then stabilized.
Monitoring caught increased path failovers and elevated I/O latency on one path group. On-call drained traffic, failed the suspect link intentionally, and opened a hardware ticket.

The postmortem was not dramatic. That’s the point. Multipath with sane timeouts turned a flaky physical layer problem into a maintenance event instead of a customer incident.
They didn’t “solve” physics; they planned for it and practiced the response until it was muscle memory.

Hands-on: practical tasks with commands, outputs, and decisions

These are the tasks you actually do when systems wobble. Each one includes what the output means and what decision you make next.
Commands assume a Linux client unless stated otherwise.

1) Identify whether you’re on NFS, iSCSI, or NVMe-oF from the host

cr0x@server:~$ findmnt -t nfs,nfs4
TARGET           SOURCE                 FSTYPE OPTIONS
/mnt/shared      nas01:/export/shared   nfs4   rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2

cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,MODEL,TRAN
NAME   TYPE  SIZE MODEL             TRAN
sda    disk  200G Virtual Disk
sdb    disk  2.0T LIO-ORG   iscsi    iscsi
nvme0n1 disk  1.6T NVMeDisk          nvme

cr0x@server:~$ nvme list-subsys
nvme-subsys0 - NQN=nqn.2014-08.org.nvmexpress:uuid:...
\
 +- nvme0 tcp traddr=10.20.0.50 trsvcid=4420 live

Meaning: findmnt shows active NFS mounts; lsblk shows transport; nvme list-subsys confirms NVMe-oF sessions.

Decision: Pick the right diagnostic path. Don’t troubleshoot NFS like block storage or vice versa.

2) NFS: confirm mount options that can make or break latency

cr0x@server:~$ nfsstat -m
/mnt/shared from nas01:/export/shared
 Flags: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.20.1.21

Meaning: hard means I/O retries indefinitely (safer for data, can “hang” apps); timeo and retrans influence stall behavior; rsize/wsize impact throughput.

Decision: For critical data, keep hard. If user-facing apps can’t tolerate long stalls, consider architecture changes (local caching, async pipelines) rather than flipping to soft.

3) NFS: measure RPC-level behavior (latency and retrans)

cr0x@server:~$ nfsstat -rc
Client rpc stats:
calls      retrans    authrefrsh
2143098    153        2143120

Client nfs v4:
null         read         write        open         close        getattr
0            152340       91822        4311         4309         602111

Meaning: retransmits suggest network loss/congestion or server overload. Even “small” retrans counts can correlate with tail latency spikes.

Decision: If retrans is non-zero and rising, treat it like a P1. Check network errors and server load before tuning apps.

4) NFS: spot server-induced stalls in kernel logs

cr0x@server:~$ dmesg -T | tail -n 8
[Tue Feb  4 10:44:21 2026] nfs: server nas01 not responding, still trying
[Tue Feb  4 10:44:23 2026] nfs: server nas01 not responding, still trying
[Tue Feb  4 10:44:41 2026] nfs: server nas01 OK

Meaning: Clients experienced timeouts; recovery happened. This is often the earliest on-host signal of server hiccups or network micro-outages.

Decision: Correlate with switch interface counters and NFS server CPU/IO. If it repeats, plan mitigation (separate workloads, add HA, improve network).

5) iSCSI: validate sessions and target discovery

cr0x@server:~$ sudo iscsiadm -m session
tcp: [1] 10.20.0.10:3260,1 iqn.2003-01.org.linux-iscsi.san01:storage.lun1 (non-flash)
tcp: [2] 10.20.0.11:3260,1 iqn.2003-01.org.linux-iscsi.san01:storage.lun1 (non-flash)

cr0x@server:~$ sudo iscsiadm -m node -o show | head
# BEGIN RECORD 2.1.8
node.name = iqn.2003-01.org.linux-iscsi.san01:storage.lun1
node.tpgt = 1
node.startup = automatic

Meaning: Two sessions typically represent two paths (good). Startup automatic means it reconnects on boot.

Decision: If you only see one session where you expect two, stop: you have a single point of failure and likely performance imbalance.

6) iSCSI: check multipath health and path policy

cr0x@server:~$ sudo multipath -ll
mpatha (36001405a7c3d2a1b9c2d7f1a4e5b6c7d) dm-2 LIO-ORG,iscsi_disk
size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 3:0:0:1 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 4:0:0:1 sdc 8:32 active ready running

Meaning: Both paths are active/ready. queue_if_no_path means I/O queues during total path loss (can cause app stalls).
Policy influences load balancing.

Decision: If any path is faulty or failed, fix physical/network issues first. Don’t “tune” around a broken path.

7) iSCSI: confirm timeouts to avoid “hang forever” or “fail too fast”

cr0x@server:~$ sudo iscsiadm -m node -T iqn.2003-01.org.linux-iscsi.san01:storage.lun1 -p 10.20.0.10:3260 -o show | egrep 'noop|replacement|timeout'
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.timeo.replacement_timeout = 120

Meaning: Noops detect dead paths; replacement timeout controls failover behavior. Too large = long stalls; too small = flappy sessions.

Decision: Align these values with multipath and application tolerance. Test failover under load; don’t guess in production.

8) NVMe-oF: list controllers and transport details

cr0x@server:~$ sudo nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
/dev/nvme0n1     deadbeefdeadbeef     NVMeOF-Target                             1         2.00  TB / 2.00  TB       4 KiB +  0 B     1.3

cr0x@server:~$ sudo nvme list-subsys
nvme-subsys0 - NQN=nqn.2014-08.org.nvmexpress:uuid:3b8d...
\
 +- nvme0 tcp traddr=10.20.0.50 trsvcid=4420 live

Meaning: Confirms NVMe-oF over TCP and which target IP/port you’re using.

Decision: If you see unexpected transports or missing paths, stop and fix topology before chasing performance.

9) NVMe-oF: check for multipath and ANA state (asymmetric access)

cr0x@server:~$ cat /sys/module/nvme_core/parameters/multipath
Y

cr0x@server:~$ sudo nvme list-ana /dev/nvme0
ANA Log for NVMe device:nvme0
ANA group ID: 1
  state: optimized
  NSIDs: 1
ANA group ID: 2
  state: non-optimized
  NSIDs: 1

Meaning: Multipath is enabled. One path is optimized, another is non-optimized (common with dual-controller systems).

Decision: Ensure your host uses optimized paths preferentially. If traffic hits non-optimized paths, you’ll get avoidable latency.

10) Network: check interface errors (the unglamorous truth)

cr0x@server:~$ ip -s link show dev ens192
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    RX:  bytes packets errors dropped  missed   mcast
    9876543210 1234567  0      12      0        0
    TX:  bytes packets errors dropped carrier collsns
    8765432109 1122334  0      0      0       0

Meaning: RX drops exist. Drops on a storage interface are latency multipliers.

Decision: Investigate physical layer, MTU mismatches, congestion. Don’t accept “a few drops” on storage networks.

11) Network: measure TCP retransmits (especially for iSCSI/NFS/NVMe-TCP)

cr0x@server:~$ nstat -az | egrep 'TcpRetransSegs|TcpExtTCPSynRetrans|TcpTimeouts'
TcpRetransSegs            1842
TcpExtTCPSynRetrans       0
TcpTimeouts               21

Meaning: Retransmits and timeouts indicate loss or severe congestion. This often correlates with p99 latency spikes.

Decision: If these counters climb during incidents, treat the network path as a suspect, not just the storage array.

12) OS: see if the kernel is waiting on I/O (storage stall vs CPU problem)

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 812344  92124 933112    0    0   120   340  890 1200 12  4 80  4  0
 3  5      0 810120  92124 933900    0    0   110   500  910 1300 10  3 60 27  0
 1  6      0 808992  92124 934100    0    0   130   480  920 1290  9  3 58 30  0

Meaning: High b (blocked processes) and high wa (I/O wait) point to storage latency as a bottleneck.

Decision: Switch from “app debugging” to “I/O path debugging.” Collect iostat, protocol stats, and network counters immediately.

13) Block devices: watch per-disk latency and queue depth

cr0x@server:~$ iostat -x 1 3
Device            r/s     w/s   r_await   w_await   aqu-sz  %util
sdb              120.0   80.0     6.10    18.40      2.95   98.0
dm-2             240.0  160.0     7.50    21.30      6.10   99.0

Meaning: await is average latency; aqu-sz shows queued I/O; high %util on dm device suggests saturation somewhere in the path.

Decision: If awaits spike while network is clean, suspect array/controller load or path imbalance. If awaits spike with retrans/drops, suspect network/transport.

14) NFS: detect metadata pain with a simple syscall-heavy probe

cr0x@server:~$ sudo strace -f -tt -T -e trace=file ls -l /mnt/shared >/dev/null
10:51:02.112345 statx(AT_FDCWD, "/mnt/shared", AT_STATX_SYNC_AS_STAT, STATX_BASIC_STATS, ...) = 0 <0.012341>
10:51:02.124910 openat(AT_FDCWD, "/mnt/shared", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3 <0.020118>

Meaning: Those <0.02s> file ops are slow for a trivial directory listing. That’s metadata latency.

Decision: If metadata ops are slow, stop blaming “disk throughput.” Separate workloads, tune NFS server threads/cache, or redesign to reduce metadata chatter.

15) NVMe-oF: verify host sees expected queue and interrupt behavior

cr0x@server:~$ cat /proc/interrupts | egrep 'nvme|ens192' | head
 42:  1234567  0  0  0  IR-PCI-MSI 524288-edge  nvme0q0
 43:  2345678  0  0  0  IR-PCI-MSI 524289-edge  nvme0q1
 58:  3456789  0  0  0  IR-PCI-MSI 393216-edge  ens192-TxRx-0

Meaning: NVMe queues exist; interrupts are firing. If everything is stuck on one CPU, you’ll see imbalance and latency.

Decision: If interrupts concentrate on one core, adjust IRQ affinity / RSS, and re-test latency under load. NVMe-oF rewards CPU/network tuning; it punishes neglect.

Fast diagnosis playbook (find the bottleneck without a week-long war room)

First: classify the failure in 60 seconds

  • Is it storage latency or application CPU? Check vmstat (blocked processes, iowait) and app thread states.
  • Is it protocol-specific? NFS client logs vs iSCSI session events vs NVMe controller resets.
  • Is it isolated or systemic? One host, one rack, one AZ, or everyone?

Second: check the network like you mean it

  1. Interface drops/errors on clients and storage ports: ip -s link.
  2. TCP retrans/timeouts for TCP-based protocols: nstat.
  3. Path consistency: MTU end-to-end, LACP health, VLAN correctness.

If you find drops/retransmits, stop. Fix that before tuning storage. Storage traffic is not “just traffic.” It’s your app’s bloodstream.

Third: validate pathing and failover behavior

  • iSCSI: iscsiadm -m session, multipath -ll, and kernel logs for session resets.
  • NVMe-oF: nvme list-subsys, nvme list-ana, multipath enabled.
  • NFS: look for “server not responding” and rising retrans with nfsstat -rc.

Fourth: check the backend storage system (without hand-waving)

  • Is the array rebuilding, scrubbing, or throttling?
  • Are you saturating controller CPU or cache?
  • Is one path/controller hotter than the other (asymmetry)?

Fifth: measure, then change one thing

Capture baseline latency (p50/p95/p99), retrans/drops, queue sizes, and path states. Make a single change, re-measure.
If you can’t measure, you’re not tuning; you’re gambling.

Common mistakes: symptoms → root cause → fix

1) Symptom: NFS mounts “hang” and processes go into D state

  • Root cause: hard mounts waiting on a slow/unreachable server, or lock contention on the server.
  • Fix: diagnose server health and network loss; add HA (active/active where possible), separate noisy workloads, and ensure client timeouts are sane. Avoid “fixing” by switching to soft for critical data.

2) Symptom: intermittent iSCSI timeouts, filesystem errors, random VM pauses

  • Root cause: path flapping (bad cable/SFP/NIC), MTU mismatch causing drops, or multipath misconfiguration.
  • Fix: enforce redundant paths, validate MTU end-to-end, monitor retrans and interface errors, and standardize multipath policies across hosts.

3) Symptom: “Great throughput, terrible latency” on NFS

  • Root cause: metadata latency and synchronous operations; too many clients contending on the same export; server thread starvation.
  • Fix: split workloads by export/server, tune NFS server threads and storage backing, consider local caches for build systems, and reduce file-count explosions.

4) Symptom: NVMe-oF over TCP shows random p99 spikes under load

  • Root cause: packet loss or congestion microbursts; CPU/IRQ affinity issues; insufficient buffering/ECN configuration in the fabric.
  • Fix: eliminate drops first; tune NIC RSS/IRQ affinity, verify ECN/aqm strategy (if used), and ensure multipath/ANA prefers optimized paths.

5) Symptom: corruption or “mysterious” filesystem inconsistencies on iSCSI LUNs

  • Root cause: same LUN mounted read-write by multiple hosts without cluster-aware coordination; fencing absent.
  • Fix: enforce single-writer semantics or use a proper cluster filesystem with fencing (or move to NFS if you truly need shared files).

6) Symptom: NFS “stale file handle” errors after maintenance

  • Root cause: server-side filesystem/export moved or replaced in a way that changes file handles/inodes; failover didn’t preserve identity.
  • Fix: remount clients; fix HA procedure to preserve filesystem identity; avoid manual export swapping without coordination.

7) Symptom: iSCSI looks fine, but latency spikes during backup windows

  • Root cause: backend array contention (snapshots, replication, rebuild), or network oversubscription when backup traffic shares uplinks.
  • Fix: isolate backup traffic, schedule heavy array operations, enforce QoS, and monitor array-side latencies and cache hit rates.

8) Symptom: “We upgraded to 100GbE but got no faster”

  • Root cause: single flow limitations, wrong queue depths, CPU bottlenecks, or storage media/controller limits.
  • Fix: measure CPU utilization and IRQ distribution; parallelize workloads; validate that the array can actually deliver more; tune queue depths carefully.

Checklists / step-by-step plan

Decision checklist: choose the protocol based on workload and team reality

  1. Need shared POSIX-ish files? Start with NFSv4.1+.
  2. Need a raw block device? Start with iSCSI (unless proven NVMe-oF requirement).
  3. Need ultra-low latency and high parallelism? Consider NVMe-oF, but require network SLOs and observability first.
  4. Multi-writer requirement? NFS for files, or a cluster-aware block layer with fencing. Never “just mount the LUN on both.”
  5. Ops maturity available? If your team can’t consistently manage MTU, LACP, and monitoring, don’t deploy the protocol that magnifies those errors (NVMe-oF).

Build checklist: make it survive normal failure

  1. Two independent network paths (separate switches) for storage traffic.
  2. Explicit MTU strategy, validated end-to-end, with automation checks.
  3. Host monitoring: retransmits, interface drops, latency percentiles, queue depth.
  4. Documented and tested failover: pull one link, reboot one controller, confirm recovery under load.
  5. Change management: storage network changes require the same rigor as database schema changes.

Operational checklist: what you standardize across hosts

  1. NFS: consistent mount options; use NFSv4.1+ where appropriate; alert on “server not responding” logs and retransmits.
  2. iSCSI: consistent iscsiadm node settings; consistent multipath config; periodic path failure drills.
  3. NVMe-oF: multipath enabled; ANA-aware path preferences; IRQ/RSS tuning standards; alerting on drops/retrans and controller resets.

FAQ

1) Is NFS “slower” than iSCSI?

Not inherently. NFS can be extremely fast for large sequential I/O. It often loses on small, sync-heavy, metadata-heavy workloads.
iSCSI tends to be more predictable for block-oriented patterns.

2) Can I run databases on NFS?

Yes, and many do. But you must validate semantics and latency (especially fsync behavior), and you need a robust NFS server setup.
If you can’t measure tail latency, you’re rolling dice with your redo logs.

3) Why do NFS clients “hang” instead of failing fast?

Because hard mounts prioritize data integrity: the client keeps retrying to avoid partial writes and corruption.
If your application can’t tolerate that, the right fix is usually architectural (timeouts, retries, queues), not switching to soft.

4) Is iSCSI safe for shared access from multiple hosts?

Only with a cluster-aware layer and fencing (cluster filesystem, clustered volume manager, etc.). Otherwise you risk corruption.
Block devices assume a single writer unless proven otherwise.

5) Does NVMe-oF over TCP make sense, or is RDMA mandatory?

NVMe-oF over TCP is often the pragmatic choice: easier to deploy on existing Ethernet, fewer specialized knobs.
RDMA can be lower latency and lower CPU, but it raises the bar on fabric configuration and operational expertise.

6) What’s the biggest hidden cost of NVMe-oF?

Operational cost. You’ll need better network observability, stricter change control, and more tuning discipline.
When you can do it, it’s great. When you can’t, it’s an expensive way to learn humility.

7) Should I always enable jumbo frames for storage?

Only if you can guarantee end-to-end MTU consistency and you monitor drops. Otherwise you’re trading a theoretical win for a real outage mode.
MTU 1500 that’s correct beats MTU 9000 that’s “mostly configured.”

8) What’s the fastest way to tell if the network is the problem?

Check interface drops/errors (ip -s link) and TCP retrans/timeouts (nstat) during the incident window.
If those counters rise, your “storage issue” is at least partly a network issue.

9) For virtualization datastores, NFS or iSCSI?

Both can work. NFS often wins on simplicity (one export, easy provisioning), iSCSI can win on predictability and integration with certain hypervisor features.
Pick based on your operational strengths and your failure drills, not on folklore.

Conclusion: practical next steps

If you’re choosing today with no special constraints: use NFS for shared files, iSCSI for general-purpose block storage,
and reserve NVMe-oF for workloads that can prove they need lower latency and for teams that can prove they can operate the fabric.

Do this next, in order:

  1. Write down your latency target and failure tolerance (p95/p99, stall duration, recovery expectations).
  2. Instrument the path: drops, retransmits, protocol stats, backend latency.
  3. Run a failure drill: pull a link, reboot a controller, fail a switch—measure app impact.
  4. Standardize configs (mount options, multipath, timeouts) and enforce with automation.
  5. Separate incompatible workloads before you chase hero tuning (metadata storms and logs don’t belong on the same NFS export).

The winning protocol is the one that meets your SLOs and doesn’t require daily heroics. Your future self will thank you. Quietly. While sleeping.

← Previous
WSL Is Slow? Fix File I/O with This One Rule
Next →
Webcam Not Found: Privacy Toggle vs Driver vs BIOS (Fast Checklist)

Leave a comment