ZFS SLOG Sizing: How Much You Really Need (Not “Bigger Is Better”)

Was this helpful?

If you’ve ever watched an NFS-backed VM farm “randomly” freeze for a second at a time, you’ve met the SLOG problem.
Not the “you don’t have one” problem. The “you bought the wrong one, sized it like a hoarder, and still blamed ZFS” problem.

A ZFS SLOG is not a cache you fill up. It’s a safety valve for sync writes. Size it wrong and you’ll waste money;
build it wrong and you’ll get latency spikes that look like ghosts. Worse: you can feel confident and be completely wrong.

What a SLOG actually does (and what it doesn’t)

In ZFS, the ZIL (ZFS Intent Log) is the mechanism that makes synchronous writes safe. If an application asks
for a write to be committed (“sync”), ZFS must be able to replay that write after a crash. By default, ZFS stores the
ZIL on the pool disks themselves. That works, but it can be slow because it forces low-latency, small, write-heavy I/O
onto the same vdevs doing everything else.

A SLOG (Separate LOG device) is simply an external place to put the ZIL records. It does not hold your data long-term.
It is not an L2ARC. It is not a magic “more IOPS” lever. It is a way to turn “sync write latency” from “whatever your pool
can do under load” into “whatever your SLOG can acknowledge quickly and safely.”

The mental model that won’t betray you at 3 a.m.

Think of a sync write as two steps:

  1. ZFS receives the write and needs a durable promise before it can ACK to the client.
  2. ZFS later writes the actual blocks into the normal pool allocation as part of a transaction group (TXG) commit.

With a SLOG, step 1 becomes “write a log record to the SLOG, flush, ACK.” Step 2 still happens on the main pool
on TXG commit. The SLOG is about making the ACK fast and safe; it does not accelerate the bulk of your steady-state writes.

If your workload is mostly async writes (databases configured for write-back, bulk copies, media ingestion), the SLOG may
do almost nothing. If your workload is sync-heavy (NFS, iSCSI with write barriers, VM images on NFS, databases with strict
durability), the SLOG can be the difference between “fine” and “please stop paging me.”

One paraphrased idea from Werner Vogels (Amazon CTO) that fits SLOG sizing perfectly: everything fails; design so failure is a normal, contained event.
Your SLOG design should assume the device dies, because it eventually will.

Interesting facts & historical context

  • ZFS has always had a ZIL; the SLOG is optional. Early adopters often blamed ZFS for sync latency that was really “spindles being spindles.”
  • TXG commits are periodic (commonly on the order of seconds), which means the ZIL is usually short-lived: it’s a staging area, not storage.
  • “SLOG” is community shorthand; ZFS talks about the “log vdev.” The term stuck because storage engineers love acronyms the way cats love boxes.
  • NFS made SLOG famous because many NFS clients default to synchronous semantics for safety, especially for VM datastores and metadata-heavy activity.
  • Power-loss protection (PLP) became the defining feature of good SLOG devices as SSDs got fast enough that “latency” beat “throughput” as the limiting factor.
  • Early consumer SSDs lied (sometimes unintentionally) about flush behavior; this is why SLOG advice leans hard into enterprise drives with PLP.
  • A SLOG is typically tiny compared to pool size. In many production systems, tens of gigabytes is already generous.
  • Mirrored SLOGs became common not because ZFS can’t handle SLOG loss (it usually can), but because operators hate avoidable downtime and scary alerts.
  • NVMe changed the conversation: not “can we make sync writes tolerable?” but “how do we prevent one fast device from masking a slow, overloaded pool?”

Sizing rule that survives contact with production

Here’s the uncomfortable truth: SLOG sizing is almost never about capacity. It’s about latency,
durability, and write patterns. Capacity is the last checkbox, and a small one.

What you’re actually sizing

The SLOG must be large enough to hold outstanding ZIL records that have been acknowledged to clients but not yet committed
to the main pool. Those records are freed as TXGs sync. In a stable system, the ZIL doesn’t grow without bound; it oscillates.

A practical sizing formula looks like this:

SLOG capacity needed ≈ peak sustained sync write rate × worst-case time until TXG sync + overhead.

“Worst-case time until TXG sync” is not just a config value; it’s the time your pool takes to successfully finish a TXG
under load. If the pool is slow, fragmented, or hitting a pathological workload, TXG sync can stretch.

The number most people should start with

For many mixed workloads, start with 16–64 GiB of SLOG, mirrored if you care about uptime. That’s not a meme.
That’s because:

  • TXGs commonly commit on the order of a few seconds.
  • Many systems don’t sustain multi-gigabyte-per-second sync writes; they sustain that in async mode.
  • You want slack for bursts and for “TXG sync got slower because the pool is busy.”

If you need more than 64–128 GiB for SLOG capacity, I start suspecting you’re trying to solve a pool throughput problem
with a log device. That’s how people end up buying an expensive band-aid for a broken leg.

Joke #1: A 2 TB SLOG is like installing a helicopter pad on a canoe—impressive in a meeting, confusing in the water.

When “bigger” is actively worse

Some SSDs get slower at sustained small writes when they run out of SLC cache or their internal mapping tables get stressed.
A huge consumer drive can look fast in a benchmark and then fall off a cliff during real sync storms. Meanwhile, a smaller
enterprise NVMe with PLP stays boringly consistent. In storage, “boring” is a compliment.

Latency is the real budget

Your clients don’t care that your SLOG can do 3 GB/s sequential writes. Sync writes are usually small (4–128 KiB),
often random, and gated by flush / FUA semantics. The metric that matters is 99th percentile latency
on durable writes.

What “good” looks like

  • Great: sub-100 microsecond durable write latency on a real PLP NVMe under load.
  • Good: a few hundred microseconds to ~1 ms, stable.
  • Bad: multi-millisecond spikes, especially periodic ones (they’ll line up with TXG behavior and client timeouts).
  • Ugly: tens to hundreds of milliseconds when the SSD garbage-collects or lies about flush.

Why the pool still matters

The SLOG only accelerates the ACK path. The pool still has to absorb the actual writes on TXG sync. If your pool can’t
keep up with the average write rate, you can “buffer” in the SLOG for a little while, but eventually you get backpressure.
Symptoms include growing sync latency, rising txg_sync times, and application stalls.

Workloads: when SLOG helps, and when it’s lipstick

Workloads that often benefit

  • NFS datastores for virtualization where the hypervisor issues sync writes for VM safety.
  • iSCSI with write barriers or guest filesystems doing frequent fsync.
  • Databases with durability on (frequent fsync), especially small commits.
  • Metadata-heavy file workloads: lots of create/rename/fsync patterns.

Workloads where SLOG is usually pointless

  • Large sequential ingestion (media, backups) that’s async or can tolerate async.
  • Analytics batches where writes are buffered and committed infrequently.
  • Anything where sync=disabled (yes, it’s fast; no, it’s not safe).

The “sync write” trap: your app decides, not you

ZFS obeys the requested semantics. If the client asks for sync, ZFS will do sync. If the client doesn’t, ZFS won’t.
This is why SLOG sizing without workload observation is just performance astrology.

Device choices: the stuff vendors don’t lead with

Non-negotiables for a real SLOG device

  • Power-loss protection (PLP): capacitors or equivalent so a flush is actually durable.
  • Consistent low latency under sustained small writes: not just “peak IOPS.”
  • Endurance: sync storms can be write-amplification hell; choose accordingly.
  • Proper flush/FUA behavior: you want the device to obey write barriers.

What capacity class to buy

Buy for latency and PLP, then pick a capacity that covers your computed need with slack. In practice:

  • 16–64 GiB effective need: a 100–400 GB enterprise SSD/NVMe is usually fine.
  • High sync rate or long TXG sync times: 400–800 GB gives headroom, but don’t make it a trophy drive.
  • Multi-tenant storage serving many clients: consider mirrored NVMe SLOG with consistent latency, not larger capacity.

Why consumer SSDs are a bad idea (even when they “work”)

The failure mode isn’t just “it dies.” The failure mode is “it lies.” Consumer devices can acknowledge a write before it’s
actually durable, especially under power loss. That turns your SLOG into a confidence trick: it’s fast until you need it.

Mirror or not: reliability math and operational reality

Losing a SLOG device does not usually lose your pool. ZFS can fall back to using the on-pool ZIL.
But “usually” is not a comfort when you run production.

The operational impact of losing a single SLOG varies by implementation and environment:

  • Performance can crater immediately (sync writes go back to pool disks).
  • You may need to replace hardware under pressure.
  • Some teams choose to take an outage to restore the expected latency profile rather than limp.

My opinionated default: mirror the SLOG if this system serves latency-sensitive sync clients (NFS for VMs, databases, shared storage).
The incremental cost is usually smaller than the first incident review meeting.

Joke #2: A single-disk SLOG in production is like a single keycard for the whole office—efficient until someone drops it in a latte.

Practical tasks: commands, outputs, and decisions

The only sizing that matters is sizing you can defend with measurements. Here are practical tasks you can run today.
Each includes: a command, what the output means, and the decision you make.

Task 1: Identify whether your pool has a SLOG and its topology

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
config:

        NAME                         STATE     READ WRITE CKSUM
        tank                         ONLINE       0     0     0
          raidz2-0                   ONLINE       0     0     0
            sda                      ONLINE       0     0     0
            sdb                      ONLINE       0     0     0
            sdc                      ONLINE       0     0     0
            sdd                      ONLINE       0     0     0
            sde                      ONLINE       0     0     0
            sdf                      ONLINE       0     0     0
        logs
          mirror-1                   ONLINE       0     0     0
            nvme0n1p1                ONLINE       0     0     0
            nvme1n1p1                ONLINE       0     0     0

errors: No known data errors

Meaning: There is a mirrored log vdev (SLOG) using two NVMe partitions. Good.
If you see no logs section, you’re using the on-pool ZIL.

Decision: If you serve sync-heavy clients and there’s no SLOG, you have a strong candidate improvement.
If you have an unmirrored SLOG and uptime matters, plan a mirror.

Task 2: Check dataset sync settings (the “did someone get clever?” test)

cr0x@server:~$ sudo zfs get -o name,property,value -s local,inherited sync tank/vmstore
NAME         PROPERTY  VALUE
tank/vmstore sync      standard

Meaning: standard honors application requests. always forces sync (often slower),
disabled lies to clients (fast, unsafe).

Decision: If you find sync=disabled on production VM or database storage, treat it as a risk acceptance decision,
not a performance tweak.

Task 3: Confirm the workload is actually issuing sync writes

cr0x@server:~$ sudo zpool iostat -v tank 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                         12.1T  43.7T    210    980  32.1M  88.4M
  raidz2-0                   12.1T  43.7T    210    780  32.1M  61.2M
    sda                          -      -     35    130  5.3M   9.9M
    sdb                          -      -     34    129  5.2M   9.8M
    sdc                          -      -     36    131  5.4M  10.0M
    sdd                          -      -     35    128  5.3M   9.7M
    sde                          -      -     35    131  5.3M  10.0M
    sdf                          -      -     35    131  5.3M  10.0M
logs                             -      -      0    200   0K   27.2M
  mirror-1                       -      -      0    200   0K   27.2M
    nvme0n1p1                    -      -      0    100   0K   13.6M
    nvme1n1p1                    -      -      0    100   0K   13.6M
--------------------------  -----  -----  -----  -----  -----  -----

Meaning: Writes are hitting the logs vdev. That’s sync traffic being logged.
If log bandwidth is near zero while clients complain about latency, your workload may be mostly async or bottlenecked elsewhere.

Decision: If log write ops/bw is significant during the problem window, SLOG latency matters. If not, stop obsessing over SLOG size.

Task 4: Look at per-dataset latency with extended iostat (OpenZFS)

cr0x@server:~$ sudo zpool iostat -r -v tank 1 3
                              capacity     operations     bandwidth    total_wait     disk_wait
pool                        alloc   free   read  write   read  write   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----  ----- -----   ----- -----
tank                         12.1T  43.7T    220    990  33.0M  90.1M   2ms  24ms     1ms  20ms
  raidz2-0                   12.1T  43.7T    220    790  33.0M  62.0M   3ms  28ms     2ms  23ms
logs                             -      -      0    200   0K   28.1M    -    1ms      -    1ms
--------------------------  -----  -----  -----  -----  -----  -----  ----- -----   ----- -----

Meaning: The log vdev shows ~1 ms write wait, while the main vdev shows much worse. That’s expected: SLOG is fast, pool is slower.
If log write wait is high, your SLOG is the bottleneck for sync ACKs.

Decision: If log write wait is consistently higher than your application tolerance, replace the SLOG device (latency), not “bigger capacity.”

Task 5: Confirm ashift and device alignment (performance hygiene)

cr0x@server:~$ sudo zdb -C tank | grep -E 'ashift|vdev_tree' -n | head
34:        vdev_tree:
58:                ashift: 12

Meaning: ashift: 12 indicates 4K sectors. Misalignment can cause write amplification and latency spikes.

Decision: If ashift is wrong for your devices, plan a rebuild/migration. You can’t “tune” your way out of a misaligned pool.

Task 6: Validate the SLOG device supports durable writes (PLP hints)

cr0x@server:~$ sudo nvme id-ctrl /dev/nvme0n1 | egrep -i 'oncs|vwc|psd|mn'
mn      : INTEL SSDPE2KX040T8
oncs    : 0x001f
vwc     : 0x0001

Meaning: vwc indicates Volatile Write Cache presence. It does not prove PLP, but it’s a useful clue in inventory validation.
Real confirmation comes from model knowledge and vendor specs—but you can at least detect obvious mismatches.

Decision: If you can’t establish PLP confidence, do not use the device for SLOG in a system where correctness matters.

Task 7: Check if TRIM/discard is enabled and behaving (latency stability)

cr0x@server:~$ sudo zpool get autotrim tank
NAME  PROPERTY  VALUE     SOURCE
tank  autotrim  on        local

Meaning: TRIM can help SSD steady-state behavior, reducing long-term latency cliffs.
It is not a SLOG sizing tool, but it changes the “weeks later it got slow” story.

Decision: If you’re on SSD/NVMe and the platform supports it, consider autotrim=on after testing.

Task 8: Observe TXG behavior indirectly via ZFS stats (Linux)

cr0x@server:~$ grep -E 'sync|async|txg' /proc/spl/kstat/zfs/txg | head -n 20
1 0 0x01 4 336 64080000 2110536636251
name                            type data
birth                           4    174541
timeout                         4    5
synced                          4    174538
opened                          4    174540

Meaning: timeout shows the target TXG interval (often 5s). If your pool can’t sync TXGs quickly,
the “real” interval stretches, and your SLOG must hold records longer.

Decision: If TXG sync falls behind during peak, fix the pool throughput/latency issue; don’t just inflate the SLOG.

Task 9: Capture sync write pressure with a targeted fio test (carefully)

cr0x@server:~$ sudo fio --name=sync4k --directory=/tank/test --rw=randwrite --bs=4k --iodepth=1 --numjobs=1 --size=1G --fsync=1 --runtime=20 --time_based --group_reporting
sync4k: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.33
Starting 1 process
sync4k: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=580KiB/s][w=145 IOPS][eta 00m:00s]
sync4k: (groupid=0, jobs=1): err= 0: pid=14219: Fri Dec 26 12:14:21 2025
  write: IOPS=148, BW=594KiB/s (608kB/s)(11.6MiB/20s)
    lat (usec): min=820, max=9980, avg=1420.53, stdev=401.11
    clat percentiles (usec):
     |  1.00th=[  950],  5.00th=[ 1057], 10.00th=[ 1106], 50.00th=[ 1369]
     | 90.00th=[ 1795], 95.00th=[ 1975], 99.00th=[ 2474], 99.90th=[ 5866]

Meaning: This test issues small writes with fsync. The latency percentiles are what matter.
If 99th percentile is multiple milliseconds, VMs and databases will feel “stuttery.”

Decision: If latency is too high, look at SLOG device latency and queueing first. If SLOG is fast but overall is slow, the pool is your limiter.

Task 10: Verify the SLOG is actually being used by your dataset path

cr0x@server:~$ sudo zfs get -o name,property,value logbias,recordsize,primarycache tank/vmstore
NAME         PROPERTY      VALUE
tank/vmstore logbias       latency
tank/vmstore recordsize    128K
tank/vmstore primarycache  all

Meaning: logbias=latency nudges ZFS to favor the SLOG for sync behavior. throughput can reduce SLOG reliance in some patterns.

Decision: For VM/NFS/databases where sync latency matters, keep logbias=latency. Don’t treat it as a “make it faster” switch without evidence.

Task 11: Check for periodic stalls and correlate with system I/O latency

cr0x@server:~$ iostat -x 1 5
Linux 6.6.0 (server)     12/26/2025  _x86_64_    (32 CPU)

Device            r/s     w/s   rMB/s   wMB/s  avgrq-sz avgqu-sz   await  r_await  w_await  svctm  %util
sda              5.2   120.1    0.30    9.50      164.3     7.20   58.30     9.10    60.40   0.95  99.1
nvme0n1          0.0    98.0    0.00   12.80       267.0     0.10    1.05     0.00     1.05   0.08   0.9

Meaning: Spinning disks pinned at ~99% util with ~60 ms awaits will drag TXG sync time out.
NVMe is fine here; the pool vdev is the bottleneck.

Decision: If pool disks are saturated, fix vdev layout, add vdevs, or reduce write amplification. A bigger SLOG won’t change sda being on fire.

Task 12: Inspect ARC pressure (to avoid blaming SLOG for memory issues)

cr0x@server:~$ sudo arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:15:01  3120   210      6   180   5     20   1    10   0   112G   128G
12:15:02  2980   260      9   225   8     25   1    10   0   112G   128G
12:15:03  3410   240      7   210   6     20   1    10   0   112G   128G

Meaning: If ARC is thrashing and misses spike, you may be bottlenecking on reads, not sync writes.
SLOG won’t rescue a read-bound workload.

Decision: If miss% is high and latency is read-driven, tune memory/ARC, add RAM, or revisit recordsize/special vdev decisions.

Task 13: Check ZFS pool properties that influence write behavior

cr0x@server:~$ sudo zpool get -H -o property,value ashift,autotrim,listsnapshots,autoreplace tank
ashift	12
autotrim	on
listsnapshots	off
autoreplace	off

Meaning: Not directly SLOG sizing, but these indicate baseline operational intent. autoreplace off is common; just ensure you have a replacement workflow.

Decision: If your pool behavior is inconsistent across environments, standardize. SLOG troubleshooting gets easier when everything else is sane.

Task 14: Simulate SLOG removal impact (planned maintenance window)

cr0x@server:~$ sudo zpool remove tank nvme0n1p1
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME                         STATE     READ WRITE CKSUM
        tank                         ONLINE       0     0     0
          raidz2-0                   ONLINE       0     0     0
            sda                      ONLINE       0     0     0
            sdb                      ONLINE       0     0     0
            sdc                      ONLINE       0     0     0
            sdd                      ONLINE       0     0     0
            sde                      ONLINE       0     0     0
            sdf                      ONLINE       0     0     0
        logs
          nvme1n1p1                  ONLINE       0     0     0

errors: No known data errors

Meaning: You removed one side of the mirror; you’re running unmirrored. This is a controlled way to prove how dependent you are on SLOG.

Decision: If latency becomes unacceptable with degraded logs, you’ve proven SLOG is critical and should be mirrored with proper devices and alerting.

Fast diagnosis playbook

When users say “storage is slow,” they’re describing a symptom. Your job is to find which subsystem is lying.
Here’s a quick, production-grade sequence that finds SLOG-related bottlenecks fast.

1) Confirm it’s actually sync latency

  • Check whether the affected dataset is used by NFS/iSCSI/VMs/databases.
  • Run zpool iostat -v and see if writes hit logs.
  • Check dataset sync property for surprise settings.

2) Decide: is the SLOG slow, or the pool slow?

  • If log vdev shows high write wait: your SLOG device is the gating factor.
  • If log vdev is fast but pool disks have high await/%util: your pool can’t sync TXGs quickly enough.
  • If neither: suspect client/network, NFS settings, or application behavior (fsync storms).

3) Find periodicity and correlate

  • Look for spikes every few seconds (TXG rhythm) or minutes (SSD GC or snapshot tasks).
  • Correlate with iostat -x and ZFS zpool iostat -r latency.

4) Validate the SLOG device class

  • Confirm PLP-capable model (inventory + controller info).
  • Check for firmware oddities, thermal throttling, and PCIe slot issues (NVMe can downshift).

5) Only then talk about size

If you can’t show that the ZIL is filling beyond current SLOG capacity between TXG syncs, “bigger SLOG” is superstition.
Fix latency, fix pool throughput, fix sync semantics. Size is last.

Common mistakes (symptom → root cause → fix)

1) VM datastore freezes for 1–5 seconds, then recovers

Symptom: periodic stalls, often aligned with bursts of metadata writes.

Root cause: TXG sync taking too long because pool vdevs are saturated; SLOG only accelerates the ACK path but can’t fix a drowning pool.

Fix: add vdevs, move to mirrors for IOPS, reduce write amplification (recordsize, volblocksize, compression), or separate workloads. Verify with iostat -x.

2) SLOG “helps” for a week, then latency gets weird

Symptom: gradually rising sync latency, occasional ugly spikes.

Root cause: consumer SSD steady-state collapse (GC, SLC cache exhaustion), lack of TRIM, or thermal throttling.

Fix: replace with PLP enterprise device; ensure autotrim=on if appropriate; monitor NVMe temperatures and PCIe link state.

3) SLOG device failure causes an outage “because we got scared”

Symptom: pool remains online but performance falls off a cliff; ops chooses to take downtime.

Root cause: unmirrored SLOG serving critical sync workload; failure forces sync traffic back onto pool.

Fix: mirror the SLOG; keep spare devices; rehearse replacement; alert on log vdev degradation.

4) “We added a huge SLOG and nothing changed”

Symptom: same throughput, same latency complaints.

Root cause: workload is mostly async, or the bottleneck is reads, network, CPU, or pool write bandwidth.

Fix: measure log vdev activity; check ARC misses; validate NFS/client settings; profile the app’s fsync behavior before changing hardware.

5) Data inconsistency after a power event

Symptom: application-level corruption or missing acknowledged transactions after abrupt power loss.

Root cause: SLOG device without real power-loss protection acknowledging writes prematurely.

Fix: use enterprise PLP devices only; do not rely on “UPS will save us” as your durability strategy; validate policy for sync semantics.

6) SLOG is fast, but sync writes are still slow

Symptom: log vdev latency looks low, yet clients see high commit times.

Root cause: application doing extra fsyncs, NFS server/client settings forcing sync, or network latency dominating.

Fix: inspect client mount options, NFS export settings, database durability knobs; measure end-to-end latency not just disk.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “SLOG is a write cache”

A mid-size SaaS company ran ZFS for an internal virtualization cluster. The storage team added a “big fast NVMe” as SLOG,
because the procurement doc said “write cache for ZFS.” It was a consumer drive, but the benchmark charts were gorgeous.

The cluster felt better immediately. Tickets slowed down. The team declared victory and moved on. Two months later, a power
event hit one rack. UPS covered most servers, but one storage head dropped hard. On reboot, the pool imported. No obvious ZFS errors.
But the database team started reporting broken invariants: transactions that had been acknowledged by the guests weren’t there.

The postmortem was brutal because ZFS “looked healthy.” The failure was semantic: the SLOG device acknowledged flushes that were not
actually durable. ZFS did what it was asked to do; the device lied about doing what it claimed to do.

The fix wasn’t “bigger” or “more.” It was boring: PLP NVMe for SLOG, mirrored, and a policy that banned consumer drives in the write-ack path.
They also stopped calling SLOG a cache in internal docs. Words matter. The wrong word creates the wrong mental model, and the wrong model writes checks
your data can’t cash.

Optimization that backfired: forcing sync=always to “use the SLOG more”

At a large enterprise, a team noticed their shiny SLOG barely showed any traffic during normal hours. Someone concluded the SLOG was “wasted”
and pushed sync=always on the main NFS dataset to justify the hardware. The change got merged during a broader “storage hardening”
sprint with good intentions and insufficient skepticism.

Within a day, application response times became jittery. Not catastrophically slow—worse. Intermittent. Teams chased DNS, then load balancers,
then kernel versions, because the storage graphs looked “fine” in average throughput.

The real issue was self-inflicted: workloads that previously wrote async (and were designed to) were now forced into sync semantics. The SLOG
dutifully accepted a flood of tiny commits, and the pool had to absorb them on TXG sync. Tail latency went up across the board.
The change also increased write amplification and made SSD wear a budget line item.

Rolling back to sync=standard stabilized latency immediately. They kept the SLOG because NFS metadata bursts and certain clients did benefit.
But the lesson stuck: don’t force durability semantics to “use the thing you bought.” Make the system correct first, then make it fast where it matters.

Boring but correct practice that saved the day: mirrored SLOG plus rehearsed replacement

A financial services firm ran ZFS-backed NFS for a cluster that processed time-sensitive batch windows. The storage design was unglamorous:
mirrored SLOG on PLP SSDs, alerts wired to on-call, and a quarterly maintenance drill where they simulated a log device failure.

One quarter, a log device started throwing media errors. Monitoring fired. On-call replaced it during business hours without drama because
they’d practiced. Performance barely moved because the log was mirrored and the surviving device carried the load.

The punchline is that nobody outside infrastructure noticed. That’s success in operations: the absence of stories.
The team still wrote an internal note, updated the runbook, and checked firmware versions across the fleet.

This is why I push mirrored SLOG for sync-heavy production. Not because it’s theoretically required, but because it prevents avoidable “we should have”
moments. The cheapest incident is the one you never get invited to explain.

Checklists / step-by-step plan

Step-by-step: sizing and deploying a SLOG the right way

  1. Confirm the workload is sync-heavy.
    Use zpool iostat -v during peak. If logs show activity, you’re in SLOG territory.
  2. Measure acceptable latency.
    For VM and database workloads, aim for stable sub-millisecond durable write latency if possible.
  3. Estimate capacity need.
    Peak sync write rate × worst-case TXG sync time × overhead. If you don’t know the rate, measure it.
  4. Select device class first, size second.
    PLP enterprise SSD/NVMe, consistent latency, adequate endurance. Avoid consumer parts in the ACK path.
  5. Mirror it for production.
    Especially if the storage serves many clients or you have tight maintenance windows.
  6. Partition intentionally.
    Use a dedicated partition for SLOG if needed, but don’t share the device with random workloads.
  7. Add it cleanly.
    Verify zpool status shows it under logs.
  8. Validate improvement with sync tests.
    Use safe fio tests on a non-critical dataset or maintenance window. Compare tail latency, not just averages.
  9. Set monitoring for log vdev health and latency.
    Alert on device errors, degradation, and significant changes in log vdev wait times.
  10. Rehearse failure.
    Practice replacing a log device. If you can’t do it calmly, you haven’t really designed for it.

Procurement checklist (print this before someone buys a “fast SSD”)

  • PLP: yes, documented.
  • Steady-state random write latency: consistent, not just bursty benchmarks.
  • Endurance rating appropriate for expected sync volume.
  • Firmware maturity and known-good behavior in your OS/kernel.
  • Thermals: won’t throttle in your chassis.
  • Form factor and interface: NVMe generally preferred for latency; SAS can be fine if consistent.
  • Mirror budget included.

FAQ

1) Does a bigger SLOG make ZFS faster?

Not usually. SLOG size rarely matters once it’s “big enough.” Latency and durability matter. If performance improves with a larger SLOG,
it often means your TXG syncs are taking too long and you’re buffering longer—not solving the root cause.

2) How do I know if my workload is using the SLOG?

Run zpool iostat -v during the workload and check whether the logs vdev shows write ops/bandwidth.
If it’s near zero, your workload isn’t issuing many sync writes or they’re not going through that pool path.

3) Is SLOG the same as L2ARC?

No. L2ARC is a read cache. SLOG is for the ZIL, which is about acknowledged sync writes surviving a crash.
Mixing the two mentally is how you end up buying the wrong hardware.

4) What happens if the SLOG device dies?

Typically the pool stays online and ZFS falls back to using the in-pool ZIL, but performance for sync writes may drop sharply.
Operationally, expect alerts and potential client-visible latency changes.

5) Should I mirror the SLOG?

If you care about consistent latency and uptime for sync-heavy workloads: yes. If this is a lab box or a non-critical system: maybe not.
But recognize what you’re choosing—risk and disruption, not just “saving a drive bay.”

6) Can I use a partition of a larger NVMe as SLOG?

Yes, but don’t share the rest of that NVMe with other workloads that can introduce latency spikes. Contention defeats the whole point.
If you partition, keep the device dedicated to ZFS roles.

7) Does sync=disabled eliminate the need for a SLOG?

It eliminates durability guarantees instead. It can be acceptable for strictly disposable data, but for VMs and databases it’s a policy decision
with consequences. Treat it like turning off seatbelts because it makes the car lighter.

8) Is logbias=throughput faster than logbias=latency?

It can be, depending on write patterns, but it’s not a free win. For latency-sensitive sync clients, latency is usually the right default.
Change it only with measurement and a rollback plan.

9) What’s a reasonable SLOG size for Proxmox or VM hosts on NFS?

Commonly 16–64 GiB effective need, implemented as a 100–400 GB mirrored PLP NVMe pair. If you observe sustained high sync write rates
and slow TXG sync, adjust—but verify you’re not masking a pool bottleneck.

10) Can a fast SLOG hide a slow pool?

Briefly, yes. It can make ACKs fast while the pool is still struggling to commit TXGs. Eventually the system applies backpressure.
If your pool is chronically saturated, fix the pool design.

Conclusion: practical next steps

Size your SLOG like an operator, not a collector. You want a device that can acknowledge sync writes quickly, correctly, and consistently—
and a capacity that covers the burst between TXG syncs with slack. That’s it.

Next steps that actually move the needle:

  1. Measure whether sync writes are the problem (zpool iostat -v, dataset sync property, fio fsync tests in a safe window).
  2. If sync writes are gating, buy for PLP + latency stability, not size. Mirror it if you value uptime.
  3. If TXG sync is slow, stop shopping for SLOGs and redesign the pool for write IOPS/bandwidth.
  4. Write a runbook for log vdev failure and practice it once. Your future self will be less tired.
← Previous
Docker: I/O wait from hell — throttle the one container killing your host
Next →
Fix Proxmox pve-firewall.service Failed Without Locking Yourself Out

Leave a comment