ZFS Kubernetes: PV Design That Won’t Bite During Node Failures

Was this helpful?

You don’t discover whether your storage design is good during a calm Tuesday deploy. You discover it when a node disappears mid-write,
a StatefulSet wedges, and your incident channel turns into a live reenactment of “Who owns the data, exactly?”

ZFS and Kubernetes can be a great match, but only if you accept a blunt truth: the default failure domain for local ZFS is the node,
and Kubernetes is not sentimental about your node. If you want your PVs to survive node failures, you must design for it up front—topology,
replication, fencing, and operational checks that don’t rely on hope.

What actually fails during “node failure”

“Node failure” is a vague phrase that hides very different outcomes. Your PV design needs to handle the specific failure modes you’ll
actually see in production, not the polite ones from diagrams.

Failure mode A: the node is dead, storage is intact

Think: kernel panic, NIC died, power supply popped, or the hypervisor rebooted it. The disks are fine, the pool is fine, but Kubernetes
can’t schedule onto that node—at least not immediately. If your PV is local-only, your pod is stuck until the node comes back. That might
be acceptable for a cache. It’s not acceptable for a payment ledger.

Failure mode B: the node is alive, storage is sick

The node responds, but the pool is degraded, a vdev is missing, NVMe errors spike, or latency is through the roof. Kubernetes will keep
happily placing pods there if your topology rules allow it. ZFS will keep trying to deliver data, sometimes heroically, sometimes slowly.
You need signals and automation that treat storage sickness as a scheduling problem.

Failure mode C: the node is half-dead (the most expensive kind)

The node is “Ready” just enough to mislead control-plane logic but cannot do reliable IO. This is where you get timeouts, hung mounts,
stuck Terminating pods, and cascading retries.

Kubernetes is great at replacing cattle. Your PV is not cattle. If you store real state, you need a plan for where that state lives
and how it moves when the host is gone.

Facts and context you should know

These aren’t trivia. They’re the little realities that shape why certain ZFS+Kubernetes designs work and others eventually eat your weekend.

  1. ZFS was born in a world that hated silent corruption. Its end-to-end checksums and self-healing were built to catch bit rot that RAID alone can’t.
  2. Copy-on-write is the core reason snapshots are cheap. ZFS doesn’t “freeze” a volume; it preserves block pointers while new writes go elsewhere.
  3. Early ZFS adoption was tied to Solaris and tight integration. That legacy still influences assumptions about predictable device naming and stable storage stacks.
  4. ZVOLs and datasets are different beasts. ZVOLs behave like block devices; datasets are filesystems. They have different tuning knobs and failure behaviors.
  5. ZFS always lies a little about “free space.” Fragmentation, metaslabs, and reservation behavior mean 80% full can feel like 95% full for latency.
  6. Ashifts matter more than people think. Misaligned sector sizing can permanently tax performance; you don’t “fix it later” without rebuilding.
  7. Scrubs are not optional in long-lived fleets. They’re how ZFS finds latent errors before you need that block during a restore.
  8. Kubernetes storage abstractions were built for networks first. Local PVs exist, but the orchestration story is intentionally limited: it won’t magically teleport your bytes.
  9. “Single writer” semantics are a foundational safety constraint. If two nodes can mount and write the same filesystem without fencing, you’re designing a corruption machine.

Three PV archetypes with ZFS (and when to use them)

1) Local ZFS PV (node-bound): fast, simple, unforgiving

This is the “ZFS on each node, PV binds to the node” pattern. You create a dataset or zvol on the node, expose it via CSI or LocalPV,
and schedule the consuming pod onto that node.

Use it when: the workload tolerates node outage, you can rebuild from elsewhere, or it’s a read-mostly cache with upstream truth.

Avoid it when: you need automatic failover with no data movement time, or you can’t tolerate “pod stuck until node returns.”

2) ZFS-backed network storage (iSCSI/NFS over ZFS): portable, centralized, failure domain shifts

Put ZFS on dedicated storage nodes and export volumes over the network. Kubernetes sees it as networked storage, which aligns with its
scheduling model. Your failure domain becomes the storage service, not the compute node.

Use it when: you need pods to move freely and are willing to engineer storage HA like adults.

Avoid it when: your network is fragile or oversubscribed, or you can’t staff the operational burden of storage HA.

3) Replicated ZFS local storage with promotion (ZFS send/receive, DRBD-like, or an orchestrator): resilient, complex

This is the “have your cake and pay for it” pattern: keep data local for performance, but replicate to peers so you can promote
a replica when a node fails. This can be done with ZFS replication mechanisms (snapshots + send/receive) coordinated by a controller,
or by a storage system layered atop ZFS.

Use it when: you want local-performance with node failure survivability, and you can enforce single-writer semantics with fencing.

Avoid it when: you can’t guarantee fencing or you need true synchronous replication but won’t accept latency costs.

Here’s the opinionated takeaway: if you’re running serious state (databases, queues, object metadata) and node failures must not cause
multi-hour recovery drama, don’t pretend local-only PVs are “highly available.” They’re “highly available to that node.”

Design principles that prevent ugly surprises

Principle 1: Decide your failure domain explicitly

For each StatefulSet, write down: “If node X is gone, do we accept downtime? For how long? Can we rebuild? Do we need automated failover?”
If you can’t answer those questions, your storage class is just a prayer with YAML.

Principle 2: Enforce single-writer with fencing, not vibes

If you replicate and promote, you must prevent two nodes from writing the same logical volume. Kubernetes won’t do this for you.
Your design needs either:

  • hard fencing (STONITH, power cut, hypervisor fence), or
  • a storage system that guarantees exclusive attachment, or
  • a workload that is itself multi-writer safe (rare; usually requires a clustered filesystem or database-level replication).

Joke #1: Split-brain is like giving two interns root on production—everyone learns a lot, and the company learns to regret it.

Principle 3: Pick dataset vs zvol based on how your app writes

Many CSI ZFS setups give you a choice: dataset (filesystem) or zvol (block). Don’t pick based on aesthetics.

  • Datasets play nicely with POSIX, quotas, and can be inspected easily. Great for general file workloads.
  • ZVOLs behave like block volumes and are common for databases via ext4/xfs on top, or raw block. They need careful volblocksize selection.

Principle 4: Tune ZFS properties per workload, not per cluster

“One ZFS configuration to rule them all” is how you end up with either sad databases or sad log pipelines. Use per-dataset properties:
recordsize, compression, atime, xattr, logbias, primarycache/secondarycache, and reservations where appropriate.

Principle 5: Capacity is a performance setting

With ZFS, “nearly full” is not just a capacity risk; it’s a latency risk. Plan alerts around pool fragmentation and allocation, not just “df says 10% free.”

Principle 6: Observability must include ZFS, not just Kubernetes

Kubernetes will tell you the pod is Pending. It won’t tell you that a single NVMe is retrying commands and your pool is throttling.
Build dashboards and alerts on zpool status, error counts, scrub results, and latency metrics.

CSI realities: what Kubernetes will and won’t do for you

CSI is an interface, not a guarantee of correctness. The driver you choose (or build) determines whether your volumes behave like
mature storage or like a science fair project with YAML.

Kubernetes will:

  • attach/mount volumes according to the driver,
  • respect node affinity for local PVs,
  • restart pods elsewhere if a PV is portable and the scheduler can place it.

Kubernetes will not:

  • replicate your local ZFS datasets,
  • fence a node to prevent double-writes,
  • magically heal a degraded pool,
  • understand ZFS’s notion of “pool health” unless you teach it (taints, node conditions, or external controllers).

A reliable design accepts these boundaries and builds the missing pieces explicitly: replication orchestration, promotion rules,
and health-based scheduling.

One operational quote that aged well: “Hope is not a strategy.” — Gene Kranz

Practical tasks: commands, outputs, and decisions

These are the checks I actually run when something smells off. Each task includes: the command, what the output means, and the decision
you make. Use them during design reviews and incidents.

Task 1: Confirm which node a PVC is actually bound to (local PV reality check)

cr0x@server:~$ kubectl get pvc -n prod app-db -o wide
NAME     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE   VOLUMEMODE
app-db   Bound    pvc-7b3b3b9a-1a2b-4f31-9bd8-3c1f9d3b2d1a   200Gi      RWO            zfs-local      91d   Filesystem
cr0x@server:~$ kubectl get pv pvc-7b3b3b9a-1a2b-4f31-9bd8-3c1f9d3b2d1a -o jsonpath='{.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0]}{"\n"}'
worker-07

Meaning: If there’s nodeAffinity, the PV is node-bound. Your pod can’t reschedule elsewhere without volume migration.

Decision: If this is a tier-1 database, stop pretending it’s HA. Either accept node-tied downtime or move to replicated/portable storage.

Task 2: See why a pod is Pending (scheduler tells on you)

cr0x@server:~$ kubectl describe pod -n prod app-db-0
...
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  2m30s  default-scheduler  0/12 nodes are available: 11 node(s) didn't match Pod's node affinity, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.

Meaning: It’s not “Kubernetes being weird.” It’s doing exactly what you told it: place the pod where the PV lives, but that node is unreachable.

Decision: If you need automatic recovery, you need a PV that can move (network) or a replica that can be promoted (with fencing).

Task 3: Inspect ZFS pool health on the node (start with facts)

cr0x@server:~$ sudo zpool status -x
all pools are healthy

Meaning: No known errors and no degraded vdevs. This does not mean “performance is fine,” but it does mean “not obviously on fire.”

Decision: If the app is slow but pool is healthy, pivot to latency, CPU, ARC, and filesystem properties.

Task 4: Find silent error accumulation (the scary kind)

cr0x@server:~$ sudo zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
  scan: scrub repaired 0B in 05:12:44 with 2 errors on Sun Dec 22 03:10:24 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            nvme-SAMSUNG_MZVLB1T0   ONLINE       0     0     2
            nvme-SAMSUNG_MZVLB1T0   ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:
        tank/k8s/pv/pvc-...:<0x123>

Meaning: Checksum errors and permanent errors are not “monitor later.” They indicate corruption ZFS could not repair.

Decision: Treat as a data-loss incident: identify affected PV, restore from backup or replica, and plan disk replacement + scrub verification.

Task 5: Verify ashift and physical sector alignment (forever decision)

cr0x@server:~$ sudo zdb -C tank | grep -E 'ashift|vdev_tree' -n | head
72:        ashift: 12

Meaning: ashift=12 implies 4K sectors. If your disks are 4K and ashift is 9, you’re paying a permanent write penalty.

Decision: If ashift is wrong on production data, plan a migration to a rebuilt pool. Don’t waste time trying to “tune it away.”

Task 6: Check pool free space and fragmentation (capacity ≠ usable performance)

cr0x@server:~$ zpool list -o name,size,alloc,free,frag,health
NAME  SIZE   ALLOC  FREE  FRAG  HEALTH
tank  3.62T  3.02T  614G  61%   ONLINE

Meaning: 61% fragmentation with high allocation is a latency multiplier. Writes get expensive, and sync writes get worse.

Decision: If frag climbs and latency matters, expand pool, reduce churn, or migrate hot workloads. Also revisit recordsize/volblocksize and snapshot retention.

Task 7: Inspect dataset properties for a PV (catch accidental “one-size-fits-none”)

cr0x@server:~$ sudo zfs get -o name,property,value -s local,default recordsize,compression,atime,logbias,primarycache tank/k8s/pv/pvc-7b3b3b9a
NAME                      PROPERTY      VALUE
tank/k8s/pv/pvc-7b3b3b9a   recordsize    128K
tank/k8s/pv/pvc-7b3b3b9a   compression   lz4
tank/k8s/pv/pvc-7b3b3b9a   atime         on
tank/k8s/pv/pvc-7b3b3b9a   logbias       latency
tank/k8s/pv/pvc-7b3b3b9a   primarycache  all

Meaning: atime=on is often a pointless write amplifier for databases. recordsize=128K might be wrong for small random IO patterns.

Decision: For database datasets, consider atime=off and recordsize tuned to typical block size (often 16K for Postgres). Validate with benchmarks, not folklore.

Task 8: For ZVOL-backed PVs, check volblocksize (the hidden performance lever)

cr0x@server:~$ sudo zfs get -o name,property,value volblocksize tank/k8s/zvol/pvc-1f2e3d4c
NAME                         PROPERTY     VALUE
tank/k8s/zvol/pvc-1f2e3d4c    volblocksize 8K

Meaning: volblocksize is fixed at creation. If your DB does 16K pages and you picked 8K, you may double IO operations.

Decision: If volblocksize is wrong for a critical workload, plan to migrate the volume to a correctly sized zvol.

Task 9: Confirm whether a dataset has a reservation (prevent “pool full” cascading failure)

cr0x@server:~$ sudo zfs get -o name,property,value refreservation,reservation tank/k8s/pv/pvc-7b3b3b9a
NAME                      PROPERTY        VALUE
tank/k8s/pv/pvc-7b3b3b9a   reservation     none
tank/k8s/pv/pvc-7b3b3b9a   refreservation  none

Meaning: No reserved space. A noisy neighbor can fill the pool and your “important” PV will fail writes.

Decision: For critical PVs on shared pools, allocate reservations or separate pools. Prefer engineering separation over arguing during incidents.

Task 10: Check scrub schedule and last scrub result (slow disasters start here)

cr0x@server:~$ sudo zpool status tank | sed -n '1,12p'
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 04:01:55 with 0 errors on Sun Dec 15 03:00:11 2025
config:
...

Meaning: Scrub ran and found no errors. Great. If your scrub hasn’t run in months, you’re deferring bad news.

Decision: Set a scrub cadence appropriate to disk size and churn. Then alert if scrubs stop happening or start finding errors.

Task 11: Validate replication freshness (RPO is a metric, not a promise)

cr0x@server:~$ sudo zfs list -t snapshot -o name,creation -s creation | tail -5
tank/k8s/pv/pvc-7b3b3b9a@replica-20251225-0010  Thu Dec 25 00:10 2025
tank/k8s/pv/pvc-7b3b3b9a@replica-20251225-0020  Thu Dec 25 00:20 2025
tank/k8s/pv/pvc-7b3b3b9a@replica-20251225-0030  Thu Dec 25 00:30 2025
tank/k8s/pv/pvc-7b3b3b9a@replica-20251225-0040  Thu Dec 25 00:40 2025
tank/k8s/pv/pvc-7b3b3b9a@replica-20251225-0050  Thu Dec 25 00:50 2025

Meaning: Snapshots exist on the source. That’s not replication yet. You need to confirm they arrived on the target and are recent.

Decision: If snapshots are lagging beyond your RPO, throttle other traffic, debug send/recv failures, and stop selling “HA” internally.

Task 12: Run a ZFS send/receive dry run mentality check (know what would happen)

cr0x@server:~$ sudo zfs send -nPv tank/k8s/pv/pvc-7b3b3b9a@replica-20251225-0050 | head
send from @ to tank/k8s/pv/pvc-7b3b3b9a@replica-20251225-0050 estimated size is 3.14G
total estimated size is 3.14G
TIME        SENT   SNAPSHOT

Meaning: Estimated send size is reasonable. If it’s huge for a small change, you may have recordsize mismatch or excessive churn.

Decision: If incremental sends are unexpectedly large, review snapshot frequency, recordsize/volblocksize, and whether the app rewrites big files constantly.

Task 13: Identify mount/IO stalls at the node level (when pods “hang”)

cr0x@server:~$ dmesg -T | tail -20
[Thu Dec 25 01:12:11 2025] INFO: task kworker/u32:4:12345 blocked for more than 120 seconds.
[Thu Dec 25 01:12:11 2025] zio pool=tank vdev=/dev/nvme0n1 error=5 type=1 offset=123456 size=131072 flags=1809
[Thu Dec 25 01:12:12 2025] blk_update_request: I/O error, dev nvme0n1, sector 987654

Meaning: Kernel is reporting blocked tasks and IO errors. This is not a Kubernetes issue; it’s a node storage incident.

Decision: Cordone the node, drain non-stateful workloads, and start replacement/repair. Don’t “just restart the pod” into a broken IO path.

Task 14: See if Kubernetes is stuck detaching a volume (control-plane symptom, storage cause)

cr0x@server:~$ kubectl get volumeattachment
NAME                                                                   ATTACHER                 PV                                         NODE        ATTACHED   AGE
csi-9a2d7c5f-1d20-4c6a-a0a8-1c0f67c9a111                                 zfs.csi.example.com    pvc-7b3b3b9a-1a2b-4f31-9bd8-3c1f9d3b2d1a  worker-07    true       3d

Meaning: Kubernetes believes the volume is attached to worker-07. If worker-07 is dead, this attachment can block failover.

Decision: Follow your driver’s safe force-detach procedure. If you don’t have one, that’s a design gap—not an ops failure.

Task 15: Check ZFS ARC pressure (cache misses look like “storage is slow”)

cr0x@server:~$ sudo arcstat 1 5
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
01:20:11   812   401     49   122   15   249   31    30    3   28.1G  31.9G
01:20:12   790   388     49   110   14   256   32    22    2   28.1G  31.9G
01:20:13   840   430     51   141   17   260   31    29    3   28.1G  31.9G
01:20:14   799   420     53   150   18   244   30    26    3   28.1G  31.9G
01:20:15   820   415     51   135   16   255   31    25    3   28.1G  31.9G

Meaning: High miss% means reads are falling through to disk. If your disks are fine but latency spikes, ARC pressure can be the culprit.

Decision: If ARC is too small or memory is constrained by pods, consider node sizing, cgroup memory policies, and dataset primarycache settings.

Task 16: Confirm actual filesystem free space vs pool free space (quota traps)

cr0x@server:~$ sudo zfs get -o name,property,value used,avail,quota,refquota tank/k8s/pv/pvc-7b3b3b9a
NAME                      PROPERTY  VALUE
tank/k8s/pv/pvc-7b3b3b9a   used      187G
tank/k8s/pv/pvc-7b3b3b9a   avail     13.0G
tank/k8s/pv/pvc-7b3b3b9a   quota     200G
tank/k8s/pv/pvc-7b3b3b9a   refquota  none

Meaning: The dataset is at its quota; the app will see ENOSPC even if the pool still has free space.

Decision: Increase quota (with change control), clean data, or scale out the app. Don’t just “add disks” if the quota is the limiter.

Fast diagnosis playbook

When the page hits and someone says “storage is down,” you need a ruthless order of operations. The goal is to isolate the bottleneck fast:
scheduling, attach/mount, ZFS health, or underlying device failure.

First: determine if this is scheduling/placement or IO failure

  1. Is the pod Pending or Running?

    • If Pending: check kubectl describe pod events for node affinity/taints.
    • If Running but hung: suspect mount/IO path.
  2. Is the PV node-bound? Check PV nodeAffinity. If yes, node failure equals volume unavailability unless you have replication/promotion.

Second: check attachment and mount status (control-plane vs node reality)

  1. Check kubectl get volumeattachment for stuck attachments.
  2. On the node, check for mount stalls and IO errors via dmesg -T.
  3. If your CSI driver has logs, look for timeouts, permission errors, or “already mounted” loops.

Third: check ZFS pool and dataset health (not just “is it mounted?”)

  1. zpool status -x and full zpool status for errors.
  2. zpool list for fragmentation and free space.
  3. zfs get on the affected dataset/zvol for properties that match workload expectations.

Fourth: decide whether to recover by restarting, failing over, or restoring

  • Restart only if the node and pool are healthy and the problem is software-level (app, kubelet, CSI transient).
  • Fail over only if you can guarantee single-writer and you have a known-good replica.
  • Restore if you have checksum errors, permanent errors, or a compromised pool.

Joke #2: Kubernetes will reschedule your pod in seconds; your data will reschedule itself in approximately never.

Three corporate mini-stories (all too real)

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company built a “high performance” Kubernetes platform on top of local ZFS. Each node had NVMe mirrors.
Their StatefulSets used a CSI driver that provisioned datasets locally, and everything screamed in benchmarks.

The wrong assumption was subtle: “If a node dies, Kubernetes will bring the pod back somewhere else.” True for stateless.
False for local PVs. They had read the words “PersistentVolume” and interpreted it as “persistent across nodes.”

When a rack switch died, a chunk of nodes went unreachable. The database pods went Pending because their PVs were node-affined.
The on-call tried to “force delete” pods, then “recreate PVCs,” then “just scale to zero and back up.” None of that moved the data.

The outage wasn’t just downtime; it was decision paralysis. No one could answer: is it safe to attach the same dataset on another node?
The answer was “not possible,” but they didn’t know that until the incident.

The fix was boring: they reclassified which workloads were allowed on node-bound PVs, moved tier-1 state to portable storage,
and wrote a runbook that begins with “Is the PV node-affined?” That one question shaved hours off future incidents.

Mini-story 2: The optimization that backfired

A large enterprise team wanted to reduce storage overhead. Snapshots were piling up, replication windows were growing, and someone
proposed an “efficiency sprint”: crank up compression, shrink recordsize cluster-wide, and increase snapshot frequency to reduce RPO.

The change looked great on paper. In practice, the smaller recordsize increased metadata pressure and fragmentation on busy datasets.
Snapshot frequency increased churn and send sizes for workloads that rewrote large files. Replication didn’t get faster; it got noisier.

Worse, they ran the changes globally, including on volumes that didn’t need it: log aggregators, build caches, and databases with very
different IO patterns. The tail latencies crept up first. Then queue depths. Then the support tickets.

The lesson wasn’t “compression is bad” or “small recordsize is bad.” The lesson was that storage knobs are workload-specific, and
global optimizations are how you manufacture a cross-functional outage.

They recovered by rolling back to per-storage-class defaults and defining profiles: “db-postgres,” “db-mysql,” “logs,” “cache,” each
mapping to a dataset template. The platform team stopped being a tuning cult and started being a service.

Mini-story 3: The boring but correct practice that saved the day

A financial services shop ran ZFS on dedicated storage nodes exporting volumes to Kubernetes. Nothing fancy. The “innovation” was discipline:
weekly scrubs, alerting on any checksum errors, and a policy that a degraded pool triggers an incident even if apps look fine.

One month, scrubs started reporting a small number of correctable checksum errors on a mirror. The apps were healthy. The temptation
was to defer: “We’ll replace it next quarter.” They didn’t. They replaced the suspect device and scrubbed again until clean.

Two weeks later, a different disk in the same pool suffered a sudden failure. Because the earlier device had been replaced, the mirror
stayed intact and the pool stayed ONLINE. No scrambling, no restore, no executive “why didn’t we see this coming?”

The team didn’t celebrate with a war room. That was the point. The correct practice was so boring it prevented a story from existing.
In operations, boring is a feature.

Common mistakes: symptom → root cause → fix

This section is intentionally specific. If you recognize a symptom, you should be able to act without reinventing storage engineering under stress.

1) Pod stuck Pending after node failure

Symptom: StatefulSet pod won’t schedule; events mention node affinity or “volume node affinity conflict.”
Root cause: Local PV is node-bound; node is unreachable or tainted.
Fix: If downtime is acceptable, restore the node. If not, redesign: portable storage or replicated local storage with promotion and fencing.

2) Pod Running but app times out on disk

Symptom: Container is up; app logs show IO timeouts; kubectl exec hangs sometimes.
Root cause: Underlying device errors or blocked IO; ZFS may be retrying; node is “half-dead.”
Fix: Check dmesg, zpool status. Cordone node. Replace failing devices. Don’t just restart pods onto the same IO path.

3) “No space left on device” but pool has free space

Symptom: Application hits ENOSPC; zpool list shows free space.
Root cause: Dataset quota or refquota reached, or snapshot reservations consuming space.
Fix: Inspect zfs get quota,refquota,used,avail. Increase quota, clean data, or adjust snapshot policy.

4) Replication exists but failover loses “recent writes”

Symptom: After promotion, data is stale; last few minutes are missing.
Root cause: Asynchronous replication and RPO not met (snapshot lag, send backlog).
Fix: Measure replication lag; adjust snapshot frequency and send concurrency; ensure bandwidth; consider sync replication if business requires it (and accept latency).

5) Unexpectedly large incremental sends

Symptom: Incremental replication balloons even with small changes.
Root cause: Workload rewrites large files; poor recordsize match; fragmentation; too many small snapshots causing churn.
Fix: Tune recordsize/volblocksize per workload; reduce churn; adjust snapshot cadence; consider app-level replication for rewrite-heavy patterns.

6) Mount errors: “already mounted” or stuck Terminating pods

Symptom: CSI logs show mount conflicts; pods hang in Terminating; attachment remains true.
Root cause: Node died without clean unmount; stale attachment object; driver not handling crash recovery cleanly.
Fix: Use driver-supported force detach procedure; enforce fencing; tune kubelet timeouts carefully; validate crash-recovery behavior in staging.

7) Pool is ONLINE but latency is terrible

Symptom: No errors; apps slow; p99 latency spikes; scrubs fine.
Root cause: Pool near-full, high fragmentation, ARC misses, sync write amplification, or small random writes hitting HDD vdevs.
Fix: Check fragmentation and alloc; tune dataset properties; add capacity; move sync-heavy workloads to vdevs with proper SLOG (only if you understand it).

Checklists / step-by-step plan

Step-by-step: choose the right PV strategy per workload

  1. Classify the workload.

    • Tier-0: data loss unacceptable; downtime minutes, not hours.
    • Tier-1: downtime acceptable but must be predictable; restore tested.
    • Tier-2: rebuildable caches and derived data.
  2. Pick the failure domain you can live with.

    • If Tier-0: do not use node-bound local PV without replication + fencing.
    • If Tier-1: local PV may be fine if node recovery is fast and practiced.
    • If Tier-2: local PV is usually fine; optimize for simplicity.
  3. Pick dataset vs zvol.

    • Dataset for general files, logs, and apps that benefit from easy inspection.
    • Zvol for block semantics or when your CSI stack expects it; set volblocksize intentionally.
  4. Define storage profiles as code.

    • Per profile: recordsize/volblocksize, compression, atime, logbias, quotas/reservations.
    • Map profiles to StorageClasses, not tribal knowledge.
  5. Plan and test node failure.

    • Kill a node in staging while the DB is writing.
    • Time how long until service is healthy again.
    • Verify no split-brain conditions are possible.

Operational checklist: what “ready for production” looks like

  • Scrubs scheduled and alerts if missed or errors found.
  • Pool health monitored (degraded, checksum errors, device removal events).
  • Capacity thresholds based on pool allocation and fragmentation, not just free GB.
  • Replicas verified (if using replication): last successful receive time, lag, and promotion procedure.
  • Fencing documented and tested (if promoting replicas): who/what prevents dual-writer.
  • Runbooks exist for: node loss, disk failure, stuck attachment, restore from snapshot, and rollback.
  • Backups tested as restores, not as warm feelings.

FAQ

1) Can I get “real HA” with local ZFS PVs?

Not by default. Local PVs are tied to a node. For HA you need replication to another node and a safe promotion mechanism with fencing,
or you need networked/portable storage.

2) Is ZFS replication (send/receive) enough for failover?

It’s necessary but not sufficient. You also need orchestration (when to snapshot, send, receive, promote) and strict single-writer fencing
so you don’t corrupt data during partial failures.

3) Should I use datasets or zvols for Kubernetes PVs?

Datasets are simpler to inspect and tune for file workloads. Zvols are better when you need block semantics or your CSI driver expects block.
For databases, either can work—just tune recordsize/volblocksize deliberately.

4) What’s the biggest “gotcha” during node failures?

Attach/mount state gets stuck while the node is unreachable, and your cluster can’t safely reattach elsewhere without risking dual-writer.
If your design relies on manual force operations, you’re betting your RTO on human calm.

5) Does Kubernetes handle fencing for me?

No. Kubernetes can delete pods and reschedule, but it cannot guarantee a dead node isn’t still writing to storage unless the storage system
enforces exclusivity or you implement fencing externally.

6) If I run ZFS on dedicated storage nodes and export NFS, is that “bad”?

Not inherently. It’s a trade: simpler portability for pods, but you must engineer storage HA and network reliability. It can be very sane,
especially for mixed workloads, if you treat storage nodes as first-class production systems.

7) What ZFS properties are the most important for PVs?

For datasets: recordsize, compression=lz4 (usually), atime=off for many write-heavy workloads, primarycache, logbias for sync-heavy apps.
For zvols: volblocksize and compression. Also don’t ignore quotas/reservations.

8) How do I prevent a noisy neighbor from filling the pool and taking down critical PVs?

Use dataset quotas for fairness, and reservations/refreservations for critical volumes if you share pools. Better: split critical workloads
onto separate pools or nodes when stakes are high.

9) Is adding a SLOG always a good idea for databases?

No. A SLOG helps only for synchronous writes and only if it’s low-latency and power-loss-safe. A bad SLOG is an expensive placebo or a new
failure point. Measure your workload’s sync behavior before buying hardware.

10) What’s the cleanest way to handle “node disappears” with stateful apps?

Prefer an architecture where the authoritative state is replicated at the application layer (e.g., database replication) or stored on portable
storage with clear failover semantics. Storage-layer replication can work, but it must be engineered as a system, not a script.

Conclusion: next steps you can do this week

ZFS will happily protect your data from cosmic rays and sloppy disks. Kubernetes will happily delete your pods and reschedule them somewhere else.
The trap is assuming those two “happilys” align during node failure. They don’t unless you design for it.

Practical next steps:

  1. Inventory StatefulSets and label them by failure tolerance (Tier-0/1/2). Make it explicit.
  2. Find node-bound PVs and decide whether that downtime is acceptable. If not, redesign now—not during an outage.
  3. Standardize storage profiles (dataset/zvol properties) per workload class. Stop doing global tuning experiments in production.
  4. Add ZFS health and scrub alerts alongside Kubernetes metrics. Storage can be broken while pods look “fine.”
  5. Run a node failure game day for one stateful workload. Time it. Document it. Fix the parts that require heroics.

If you do only one thing: make the failure domain a first-class design decision. Local ZFS is fast. But fast storage that can’t fail over
is just a very efficient way to be down.

← Previous
0-day headlines: why one vulnerability can cause instant panic
Next →
Debian 13 “Too many open files”: the correct systemd limit fix (not just ulimit)

Leave a comment