ZFS Using NVMe as SLOG: When It’s Perfect and When It’s Overkill

Was this helpful?

You add a fast NVMe “log device” to your ZFS pool and expect instant glory. Your VM latency graphs should flatten, your NFS clients should stop complaining, your database should finally look impressed.
Then… nothing. Or worse: it gets faster in benchmarks and slower in production. That’s the SLOG experience when the problem wasn’t sync writes in the first place.

The Separate Intent Log (SLOG) is one of the most misunderstood ZFS features because it solves a very specific problem extremely well—and does almost nothing for everything else. The trick is knowing which camp you’re in before you buy hardware, burn bays, and explain to finance why you need “one tiny SSD for logs.”

SLOG in one sentence (and why that sentence matters)

A SLOG is a dedicated device for ZFS’s ZIL records that accelerates synchronous writes by committing intent quickly and safely, without waiting for the main pool.

Read that again and underline “synchronous.” If your workload is mostly async writes, reads, or CPU-bound, an NVMe SLOG is expensive placebo. If your workload is dominated by sync write latency (NFS with sync, ESXi datastores, some databases, some VM workloads), a good SLOG is a cheat code.

Also: ZFS already has a ZIL even without a SLOG. By default the ZIL lives on the pool disks. Adding a SLOG doesn’t “enable logging”; it relocates the hot part of the sync-write path to something faster (and ideally safer).

What SLOG actually accelerates (and what it never will)

The sync write path in plain English

When an application does a synchronous write, it’s asking the storage stack for a promise: “If you say ‘done’, this data will survive a crash.”
ZFS fulfills that promise by writing a small record describing the write (the “intent”) to the ZIL, then acknowledging the write to the application.
Later, during the next transaction group (TXG) sync, ZFS writes the actual blocks into the main pool and frees the ZIL entries.

With no SLOG, those ZIL writes hit the same vdevs that hold your data. With a SLOG, those ZIL writes hit the log device instead, which can be dramatically lower latency than a RAIDZ of HDDs or even SATA SSDs.

What a SLOG helps

  • Sync write latency: the thing clients feel as “my VM is laggy” or “NFS feels sticky.”
  • Sync write IOPS: especially many small fsync-heavy operations.
  • Contention on pool vdevs: moving ZIL traffic away can reduce random write pressure on the main pool during sync-heavy bursts.

What a SLOG does not help

  • Reads (that’s ARC, L2ARC, and your vdev layout).
  • Async writes (ZFS will buffer them in RAM and flush per TXG; SLOG isn’t on that critical path).
  • Sequential throughput (that’s mostly vdev bandwidth).
  • CPU-bound workloads (checksumming, compression, encryption, or simply an overloaded host).

One dry truth: a SLOG can’t make a bad pool layout good. It can make sync writes less miserable while you plan a better layout.

Joke #1: Buying a SLOG for an async workload is like installing a turbo on a parked car. It’s impressive hardware, and it still doesn’t go anywhere.

When NVMe as SLOG is perfect

1) NFS exports where clients demand sync semantics

Many NFS clients (and admins) insist on safe writes, and NFS itself is often used to back things that are allergic to data loss: VM datastores, build artifact stores, home directories with “my editor calls fsync,” and so on.
If your export is set to sync (or the client behavior effectively forces sync), the latency of committing those ZIL records is your user experience.

This is the canonical SLOG win: a pool of HDDs (or a RAIDZ of SSDs) serving NFS with steady fsync traffic. Put a low-latency NVMe with power loss protection (PLP) in front of that, and you’ll feel the difference.

2) Virtualization storage with lots of small sync writes

VM platforms and guest filesystems generate sync writes more often than you think. Journaling, metadata updates, database flushes inside VMs, guest OS behavior, and application-level durability settings all stack.
A good SLOG can turn “random 8K sync write latency of 10–20ms” into “sub-millisecond to a few milliseconds,” which is the difference between “fine” and “why is the UI freezing.”

3) Databases that actually rely on fsync()

Some database workloads are dominated by durable commit latency. If the database is correctly configured to require durability (and you want that), you’re in sync land.
A fast, safe SLOG can reduce commit latency variance—variance is what causes tail latency spikes and user-visible stalls.

4) You have a slow main pool that you can’t rebuild yet

Sometimes you inherit a pool that’s “RAIDZ2 on nearline HDDs” because someone optimized for cost per terabyte and forgot that latency is also a cost.
An NVMe SLOG can be a tactical bandage: it reduces the pain for sync-heavy consumers while you plan the real fix (more vdevs, mirrors, special vdevs, or a different architecture).

5) You need predictable durability under load

The best SLOG devices aren’t just fast; they’re consistent. Sync write performance is about tail latency as much as it is about average latency.
Commodity SSDs can look great until they hit an internal garbage collection or a write cliff, and then your “durable write” suddenly takes 50ms. Users notice.

When SLOG is overkill

1) Your workload is mostly async writes

File copies, media ingest, backups that stream sequentially, object storage writes that batch, analytics pipelines that buffer—these typically don’t block on sync commits.
ZFS will soak a lot of that into ARC and flush on TXG boundaries. A SLOG won’t enter the room.

2) You’re actually read-bound or metadata-bound

The classic misdiagnosis: “storage is slow” when the bottleneck is cache misses, too few vdevs, or metadata scattered across HDDs.
You might need more mirrors, a special vdev for metadata/small blocks, more RAM for ARC, or better recordsize tuning. None of those are a SLOG.

3) You can (and should) fix the sync requirement instead

Sometimes the correct move is policy, not hardware: do you really need sync on that dataset? Are you exporting NFS with sync because “that’s what we always do,” while the data is non-critical and already replicated?
If you can tolerate a small window of loss on crash, setting sync=disabled on the right dataset gives you a larger speedup than any SLOG. It also increases risk. Don’t do it casually.

4) You don’t have power-loss protection, so you’re “fast but lying”

A SLOG’s entire job is to make sync writes safe. If your device lies about flushes, or loses data on power loss, you can end up acknowledging durable writes that never made it. That’s not a performance bug. That’s corruption with a good marketing budget.

5) Your pool is already fast enough for sync

If you’re on all-mirror SSDs with solid latency and your sync write bottleneck is already small, the marginal gains from a SLOG can be hard to justify.
Measure first. If p99 sync lat is already in the low single-digit milliseconds and the workload is not complaining, you have better places to spend your complexity budget.

Interesting facts and history that change decisions

  1. ZFS was born at Sun in the mid-2000s with an explicit focus on end-to-end integrity: checksums, copy-on-write, and transactional semantics weren’t afterthoughts.
  2. The ZIL exists on every pool whether you add a SLOG or not. The “separate” part is optional; the intent log is not.
  3. SLOG writes are usually short-lived: ZIL entries are replay material for crashes, not a long-term journal like ext4’s. They’re discarded after the TXG commits.
  4. Early “SSD as log” guidance came from the era when HDD pools were common and SSDs were expensive; the performance delta for sync writes was massive and obvious.
  5. Power-loss protection (PLP) became the dividing line between “consumer SSD seems fine” and “enterprise SSD is boringly correct.” ZIL correctness depends on honest flush behavior.
  6. SLOG doesn’t need big capacity because it only needs to hold a small window of outstanding sync writes—typically seconds—until TXG commit.
  7. Mirroring a SLOG is about availability, not speed: if the SLOG dies, the pool can keep running, but you lose the fast path for sync writes (and you may have a risk window during replacement).
  8. NVMe changed the conversation by making low latency accessible, but it also made it easy to buy the wrong device: consumer NVMe is fast, but PLP and consistent latency are the hard parts.
  9. ZFS’s “sync=disabled” is infamous because it can make benchmarks look heroic while quietly changing durability guarantees. It’s not a free lunch; it’s a different contract.

Hardware requirements: latency, PLP, endurance, and why “fast” is not enough

Latency beats throughput

SLOG performance is about how quickly the device can commit small writes and flush them safely. You care about:
low write latency, low flush latency, and stable tail latency. A device that does 7GB/s sequential writes but stalls on flush is a bad SLOG.

Power-loss protection (PLP) is non-negotiable for serious sync workloads

With sync writes, the application is betting its data on your storage stack telling the truth. PLP helps ensure that data acknowledged as durable can survive a sudden power cut.
In enterprise SSDs, that’s typically backed by capacitors and firmware designed to flush volatile buffers to NAND.

Can you run a consumer NVMe as SLOG? Yes. Should you for business-critical sync workloads? Not unless you’re comfortable explaining to incident response why a “durable commit” evaporated. Use the right tool.

Endurance matters, but less than people fear

SLOG writes can be write-intensive, but the total volume depends on how much sync write traffic you actually generate. The killer is not total bytes; it’s sustained write rate plus flushes, plus worst-case behavior under steady load.
Still: pick a device with real endurance, not a bargain drive that’s one firmware bug away from a bad week.

Form factor and connectivity matter operationally

U.2/U.3 or enterprise M.2 with proper cooling and monitoring is easier to run in production than a random consumer stick buried under a GPU.
Also consider hot-swap and how you’ll replace it at 2 a.m. without turning “log device replacement” into “host maintenance window.”

Quote (paraphrased idea), Werner Vogels: You design systems assuming things fail; reliability comes from expecting failure and recovering fast.

Sizing, mirroring, and topology choices

How big should a SLOG be?

Smaller than you think. The SLOG only needs to absorb the maximum amount of outstanding synchronous writes between TXG syncs.
Default TXG timeout is typically around 5 seconds. Even if you have hundreds of MB/s of sync writes, the required log space is not huge.
In practice, 8–32GB of effective log space is often plenty; people deploy 100–400GB because that’s what the SSD comes with, not because ZFS needs it.

But don’t cut it too close. Leave headroom for bursts, and remember that performance can degrade if the device is near full or heavily fragmented internally.

Mirror the SLOG when downtime or risk is expensive

A single SLOG device is a single point of performance failure. If it dies, your pool continues, but sync writes fall back to main vdevs, and latency can spike sharply.
Mirroring the SLOG keeps the fast path alive through a device failure and reduces “we replaced it and everything is slow until resilver” drama.

Don’t confuse “log vdev” with “special vdev”

The SLOG is for ZIL (sync intent records). A special vdev is for metadata and small blocks (and can be a game changer for HDD pools with lots of metadata I/O).
They solve different problems. Buying a SLOG when you needed a special vdev is a classic expensive mistake.

Partitioning a device for SLOG is possible—often a bad habit

Yes, you can partition an NVMe and use a small slice as SLOG. No, you shouldn’t do it casually in production.
Mixing workloads on the same device increases variance and complicates failure domains. If you must, isolate via separate devices and keep the SLOG’s behavior predictable.

Joke #2: Partitioning a single consumer NVMe into “SLOG + L2ARC + swap” is the storage equivalent of using one parachute for two people—creative, but the landing gets weird.

Practical tasks: commands, outputs, and decisions

These are the checks I run when someone says “we added a SLOG and nothing changed,” or “sync writes are killing us,” or “is this NVMe safe as a log device.”
Each task includes a command, example output, what it means, and what decision you make.

Task 1: Confirm whether you even have a SLOG (and what it is)

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
        logs
          nvme0n1p1                 ONLINE       0     0     0

errors: No known data errors

What it means: You have a log vdev (logs) on nvme0n1p1.

Decision: If the log device isn’t listed, you don’t have a SLOG. If it’s listed but not mirrored, decide if performance-only failure is acceptable.

Task 2: Check dataset sync settings (the workload contract)

cr0x@server:~$ zfs get -r sync tank/vmstore
NAME          PROPERTY  VALUE  SOURCE
tank/vmstore  sync      standard  local

What it means: Sync behavior is default (“honor the application’s sync requests”).

Decision: If this says disabled, your SLOG probably won’t do much because you’ve told ZFS to ignore sync semantics. If it says always, you just forced more writes onto the SLOG path—maybe intentionally, maybe not.

Task 3: Check how the NFS export is configured (sync vs async)

cr0x@server:~$ cat /etc/exports
/tank/vmstore 10.10.0.0/24(rw,sync,no_subtree_check)
/tank/media   10.10.0.0/24(rw,async,no_subtree_check)

What it means: vmstore is sync (SLOG relevant). media is async (SLOG mostly irrelevant).

Decision: Put SLOG effort where sync is required. Don’t “optimize” async exports with a SLOG and then wonder why nothing moved.

Task 4: See if the workload is generating sync writes at all

cr0x@server:~$ arcstat.py 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:10:01   222    12      5     2    1    10    4     0    0   42G   64G
12:10:02   240    14      5     3    1    11    4     0    0   42G   64G
12:10:03   230    11      4     2    1     9    4     0    0   42G   64G

What it means: This shows ARC behavior, not sync writes directly—but it tells you whether you’re read-bound and cache-miss heavy.

Decision: If the complaint is “slow reads,” stop talking about SLOG and start talking about vdev layout, ARC sizing, and maybe special vdevs.

Task 5: Observe ZFS latency and see if writes are the pain

cr0x@server:~$ zpool iostat -v tank 1 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        20.1T  10.4T    120    980  1.2M  38.4M
  raidz2-0  20.1T  10.4T    120    840  1.2M  32.1M
    sda         -      -     30    210   310K  8.1M
    sdb         -      -     28    205   295K  8.0M
    sdc         -      -     31    212   305K  8.0M
    sdd         -      -     31    213   310K  8.0M
logs            -      -      0    140      0  6.3M
  nvme0n1p1     -      -      0    140      0  6.3M

What it means: You see log device writes. That strongly suggests sync activity is happening.

Decision: If log writes are zero while users complain about “sync latency,” you may not actually be doing sync writes, or sync might be disabled somewhere. Go check the client/app behavior.

Task 6: Check ZIL/SLOG statistics via kstats (Linux OpenZFS)

cr0x@server:~$ sudo kstat -p | egrep 'zfs:0:zil|zfs:0:vdev_log' | head
zfs:0:zil:zil_commit_count                           184229
zfs:0:zil:zil_commit_writer_count                     182910
zfs:0:zil:zil_commit_wait_count                        1319
zfs:0:zil:zil_itx_count                              920114
zfs:0:vdev_log:vdev_log_write_bytes               12899323904

What it means: Commits are happening; some had to wait. Log bytes are non-trivial.

Decision: If commit wait counts climb under load, you’re hitting log device latency or contention. Consider a better SLOG device or mirroring, and check for device throttling or overheating.

Task 7: Verify the NVMe device has PLP hints and identify model

cr0x@server:~$ sudo nvme id-ctrl /dev/nvme0 | egrep 'mn|vid|oacs|oncs|vwc'
mn      : INTEL SSDPE2KX040T8
vid     : 0x8086
oacs    : 0x17
oncs    : 0x5f
vwc     : 0x01

What it means: You’ve identified the exact model; vwc indicates volatile write cache info is exposed (not a guarantee of PLP, but a clue).

Decision: If you can’t identify the model or it’s a consumer drive with unclear PLP behavior, don’t use it as a SLOG for critical sync workloads. Use an enterprise drive with known power-loss behavior.

Task 8: Check NVMe health and error counters

cr0x@server:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0x00
temperature                         : 43 C
available_spare                     : 100%
percentage_used                     : 2%
data_units_written                  : 184,992
media_errors                        : 0
num_err_log_entries                 : 0

What it means: Healthy, low wear, no media errors.

Decision: If critical_warning is non-zero, temperatures are high, or error logs climb, treat the SLOG as suspect. Replace early; it’s cheaper than chasing ghost latency.

Task 9: Check if the NVMe is thermal throttling (silent latency killer)

cr0x@server:~$ sudo nvme smart-log /dev/nvme0 | egrep 'temperature|critical_warning'
critical_warning                    : 0x00
temperature                         : 72 C

What it means: 72°C is flirting with throttling territory for many drives, depending on model and airflow.

Decision: If you see high temps during incidents, fix cooling before buying a “faster” drive. A throttling NVMe is a slow NVMe with better PR.

Task 10: Validate that the log device is actually being used for sync I/O

cr0x@server:~$ sudo zdb -C tank | egrep -A2 'log'
        log
            type: 'disk'
            id: 2
            guid: 12759911906639365414

What it means: The pool config includes a log vdev.

Decision: If you expected a log vdev but don’t see it, you may have added it to the wrong pool, or it failed and was removed. Fix the configuration before tuning anything else.

Task 11: Measure sync write latency directly with a controlled test (use carefully)

cr0x@server:~$ fio --name=syncwrite --filename=/tank/vmstore/fio.test --rw=write --bs=8k --iodepth=1 --numjobs=1 --direct=1 --fsync=1 --size=512m --runtime=20 --time_based --group_reporting
syncwrite: (groupid=0, jobs=1): err= 0: pid=1337: Sat Dec 16 12:10:55 2025
  write: IOPS=1450, BW=11.3MiB/s (11.8MB/s)(226MiB/20002msec)
    clat (usec): min=190, max=14200, avg=640.12, stdev=410.50

What it means: This is a sync-heavy pattern (fsync=1). Average commit latency ~0.64ms with some tail spikes.

Decision: If latency is multiple milliseconds on an NVMe SLOG, investigate the device and the path (PCIe link, throttling, contention). If it’s tens of milliseconds, you may be falling back to the pool or the device is misbehaving.

Task 12: Confirm TXG behavior and dirty data pressure (why sync may be stalling)

cr0x@server:~$ cat /proc/spl/kstat/zfs/txgs
217 5 0x01 184 4608 2401924197 836746120 0 0

What it means: TXG stats are present; on some systems you’ll parse fields for timing and backlog with tooling. The key is: TXG syncing shouldn’t be perpetually behind.

Decision: If TXG sync is constantly overloaded, your pool is write-bound or under-provisioned. A SLOG helps with sync acknowledgement, but it doesn’t magically increase the pool’s ability to eventually commit.

Task 13: Check if your pool is constrained by too few vdevs (IOPS math)

cr0x@server:~$ zpool list -v tank
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank       30.5T  20.1T  10.4T        -         -    22%    65%  1.00x  ONLINE  -
  raidz2-0 30.5T  20.1T  10.4T        -         -    22%    65%      -  ONLINE  -

What it means: One RAIDZ vdev. That’s often great for capacity, not for IOPS.

Decision: If your workload is random I/O heavy, especially VM storage, the correct fix might be more vdevs (striped mirrors) rather than a SLOG.

Task 14: Confirm ashift and alignment (prevent self-inflicted latency)

cr0x@server:~$ zdb -C tank | egrep 'ashift'
                    ashift: 12

What it means: 4K sector alignment (ashift=12) which is generally correct for modern disks and SSDs.

Decision: If you see ashift: 9 on 4K drives, you can get pathological write amplification and latency. That’s a rebuild-level fix, not a SLOG fix.

Fast diagnosis playbook

When someone is yelling “storage is slow” and you have 20 minutes before the next escalation, this is the fastest way to determine whether SLOG is relevant.

First: determine if the pain is sync write latency

  • Check dataset sync and protocol settings (NFS/SMB/iSCSI). If sync isn’t in play, SLOG isn’t the hero.
  • Check for evidence of log device activity (zpool iostat -v showing writes under logs).
  • If possible, run a small controlled fio sync test on the dataset (don’t do this on a fragile system at peak load).

Second: validate the SLOG device isn’t the bottleneck

  • NVMe health (nvme smart-log) and temperature. Throttling produces “mystery” tail latency.
  • kstats for ZIL commit waits: are they climbing when latency spikes?
  • Confirm the log vdev is ONLINE and not silently faulted/removed.

Third: if sync is fine, find the real bottleneck

  • Read miss rate: ARC thrash vs actual disk reads.
  • Pool layout: too few vdevs or RAIDZ for VM workloads is a repeat offender.
  • CPU overhead: encryption/compression/checksum with saturated cores can masquerade as “slow disks.”
  • Network: NFS latency often lives in the NIC/MTU/interrupts world, not the pool.

Common mistakes: symptoms → root cause → fix

1) “We added an NVMe SLOG and nothing improved.”

Symptoms: Benchmarks change little; users still complain; zpool iostat shows minimal log writes.

Root cause: Workload is not dominated by sync writes (or sync is disabled, or clients aren’t issuing sync).

Fix: Verify with zfs get sync, NFS export options, and a sync-heavy fio test. If it’s read/metadata bound, consider ARC/special vdev/vdev layout changes.

2) “Latency got better on average, but p99 is still awful.”

Symptoms: Mean latency drops; occasional multi-10ms stalls remain.

Root cause: NVMe throttling, internal garbage collection, firmware quirks, or contention from sharing the device.

Fix: Check temperature, controller logs, and avoid co-locating L2ARC/swap/other workloads on the same device. Use an enterprise SSD with consistent latency.

3) “After the SLOG died, everything became unusably slow.”

Symptoms: Pool stays ONLINE but sync-heavy clients crawl; recovery requires emergency hardware swap.

Root cause: Single log device failure removed the low-latency sync path; main pool can’t handle sync write IOPS.

Fix: Mirror the SLOG. Also address the underlying pool’s random write capability if it’s fundamentally mismatched to the workload.

4) “We used a consumer NVMe, and after a power event we had weird corruption.”

Symptoms: Application-level inconsistencies, database recovery anomalies, sometimes no obvious ZFS errors.

Root cause: Device acknowledged flushes without true persistence (no PLP, volatile cache semantics, firmware behavior).

Fix: Use a PLP-capable enterprise device for SLOG. Ensure stable power (UPS) and prefer mirrored SLOG if the workload is critical.

5) “We forced sync=always everywhere and now it’s slower.”

Symptoms: Write throughput drops; latency increases; SLOG shows heavy activity.

Root cause: You turned async traffic into sync traffic and overloaded the commit path.

Fix: Only use sync=always where it’s explicitly required. Keep default standard for general-purpose datasets.

6) “We mirrored the SLOG and expected it to double performance.”

Symptoms: No speedup; sometimes slightly worse.

Root cause: Mirrors are for redundancy; each ZIL write must be committed to both devices.

Fix: Mirror for availability and safety of the fast path. If you need more performance, you need a better single-device latency profile or a different architecture.

Three corporate mini-stories from the real world

Incident caused by a wrong assumption: “SLOG makes everything faster”

A mid-sized SaaS company ran a ZFS-backed NFS cluster for CI artifacts and container image layers. The storage team heard “NVMe makes ZFS fast” and installed a shiny log device on each filer.
The incident they were trying to solve was build jobs timing out during peak hours.

After the change, nothing improved. The graphs were stubborn. The build team still saw timeouts, and now the storage team had to defend a purchase that didn’t move the needle.
A senior engineer finally asked a rude question: “Are we even sync write bound?”

They checked exports. The artifact path was mounted with settings that favored throughput and accepted some risk; the actual bottleneck was metadata reads and directory traversals on a pool built from large RAIDZ vdevs of HDDs.
The workload was thousands of small file stats and opens—not fsync-heavy commits.

The fix wasn’t more log devices. It was a combination of adding a special vdev for metadata/small blocks and rethinking dataset recordsize for the hot paths.
The SLOG wasn’t wrong. It was just irrelevant. The wrong assumption was treating “fast device” as a universal performance spell.

Optimization that backfired: “Let’s use a cheap consumer NVMe as SLOG”

A finance-driven infrastructure refresh landed at a company that hosted multi-tenant workloads on ZFS over iSCSI. The team wanted lower commit latency for a few chatty databases.
Someone proposed a consumer NVMe as SLOG: “It’s fast and it’s only for logs. It’ll be fine.”

It was fine—until the first real power incident. Not a dramatic outage, just a quick blip: a PDU reboot during routine work. Systems came back. ZFS imported pools cleanly. No screaming.
A week later, a tenant opened a ticket: inconsistent application state. Then another.

Investigation was painful because nothing looked obviously broken. The pool was healthy. Scrubs were clean. The problem smelled like “durable writes that weren’t durable,” which is the nastiest kind of mystery.
They eventually correlated the first reports with the power blip and the consumer NVMe model used for SLOG.

The final change was boring: replace SLOG devices with enterprise drives with PLP, mirror them, and stop treating flush semantics as optional.
Performance stayed good, and more importantly, the system’s promises matched reality. The optimization backfired because it optimized the wrong metric: initial cost over correctness.

Boring but correct practice that saved the day: mirrored SLOG and rehearsed replacement

A regulated enterprise ran a ZFS storage appliance serving NFS to virtualization and some internal databases. They had a mirrored SLOG on PLP enterprise SSDs.
Nobody celebrated it. It was just there, quietly eating sync writes.

One afternoon, one SLOG device started logging media errors. The monitoring system flagged it, and the on-call SRE did the least heroic thing imaginable: followed the runbook.
They confirmed the pool was still using a mirrored log, verified the remaining device was healthy, and scheduled a replacement the same day.

During replacement, nothing noticeable happened to clients. Latency didn’t spike. There was no emergency change window.
The team swapped the device, re-added it as part of the mirrored log, and watched resilver complete quickly because the log vdev was small and the process was well understood.

The moral isn’t “mirrored SLOG always.” It’s that production systems pay you back for boring discipline: redundancy where it matters, monitoring that’s specific, and a practiced procedure so you don’t learn ZFS commands during an incident.

Checklists / step-by-step plan

Step-by-step: decide whether you need a SLOG

  1. Identify workloads: NFS for VMs? databases? home dirs? sequential backups?
  2. Confirm sync behavior: dataset sync, NFS export options, application durability settings.
  3. Measure: sync-heavy fio test on a staging dataset; compare p95/p99 latency with and without SLOG if possible.
  4. Check pool design: if it’s a single RAIDZ vdev serving VM I/O, fix that regardless.
  5. Decide risk tolerance: if data must be safe, don’t plan around sync=disabled.

Step-by-step: deploy an NVMe SLOG the sane way

  1. Pick hardware with PLP and stable latency. Treat “enterprise” as a requirement, not a vibe.
  2. Prefer mirroring if performance stability matters during device failure.
  3. Install with cooling: NVMe throttling is a performance outage wearing a trench coat.
  4. Add the log vdev during a controlled window and verify usage under real load.
  5. Monitor: temperatures, error logs, and ZIL commit wait behavior.

Step-by-step: safe operational habits

  1. Keep SLOG devices single-purpose.
  2. Record device model/firmware versions in your inventory.
  3. Test failure: simulate removal in a maintenance window and confirm performance degradation is acceptable.
  4. Have a replacement runbook that includes how to re-add mirrored logs and verify pool state.

Common ZFS actions for SLOG management

These are operationally common, but do them carefully: removing logs can change performance instantly under load.

Add a mirrored SLOG

cr0x@server:~$ sudo zpool add tank log mirror /dev/nvme0n1p1 /dev/nvme1n1p1

What it means: Adds a mirrored log vdev; ZIL writes commit to both devices.

Decision: Use when you need the fast sync path to survive a device failure.

Remove an existing log vdev

cr0x@server:~$ sudo zpool remove tank nvme0n1p1

What it means: Removes the log device (if removable in your ZFS version/config). Sync writes fall back to the pool’s in-pool ZIL.

Decision: Only do this when you’re sure the pool can handle sync load or during a controlled maintenance window.

FAQ

1) Is SLOG the same thing as ZIL?

No. The ZIL is the on-disk intent log mechanism used for sync writes. A SLOG is a separate device that can hold ZIL records so they commit faster.

2) Will adding a SLOG speed up my SMB shares?

Sometimes, but only if the workload is sync-write heavy and SMB is configured/used in a way that forces durability semantics.
Measure. Don’t assume. SMB performance issues are often metadata, oplocks, or network/CPU issues.

3) Can I use any NVMe as SLOG?

You can, but you shouldn’t for critical data unless it has power-loss protection and consistent flush behavior.
“Fast” isn’t the requirement; “fast and honest under power loss” is.

4) Should I mirror the SLOG?

If losing the SLOG would cause a visible performance incident (and it often does for NFS+VM workloads), mirror it.
If it’s a lab box and you can tolerate a sudden performance cliff during failure, a single device may be acceptable.

5) How do I know if my clients are issuing sync writes?

You infer it from behavior: log device write activity, ZIL commit stats, application settings, and controlled tests (fio with fsync).
The honest approach: measure under production-like load and check whether latency correlates with ZIL commits.

6) Does SLOG help with pool scrubs or resilvers?

No. Scrubs and resilvers are about reading and reconstructing data. SLOG is about acknowledging sync writes quickly.

7) What’s the relationship between SLOG and L2ARC?

Completely different: L2ARC is a read cache extension; SLOG is a sync write accelerator.
People buy both when they’re desperate. Only one might be relevant. Sometimes neither.

8) Is sync=disabled safe if I have a UPS?

A UPS reduces the chance of sudden power loss, but it doesn’t eliminate kernel panics, controller resets, or firmware bugs.
sync=disabled changes durability semantics. Use it only when the data is disposable or protected elsewhere, and you’ve accepted the risk in writing.

9) How much does SLOG capacity matter?

Usually not much. What matters is latency, flush behavior, and consistency. Capacity just needs to cover the outstanding sync write window plus headroom.

10) If I have an all-SSD mirror pool, do I still need SLOG?

Often no, because the pool’s sync latency may already be good. But if you’re serving latency-sensitive NFS/iSCSI and you see sync commit bottlenecks, a SLOG can still help.
Measure first; don’t buy hardware out of habit.

Conclusion: practical next steps

NVMe as SLOG is perfect when you have real sync write pressure and you need durable acknowledgements with low, predictable latency.
It’s overkill when your workload is async, read-bound, metadata-bound, or simply suffering from a pool layout that can’t do random IOPS.

Next steps that actually move outcomes:

  1. Prove sync is the bottleneck: check dataset/export settings, observe log writes, run a sync-heavy fio test.
  2. Validate your SLOG device: PLP-capable, healthy, cool, and consistent. If it’s consumer-grade, treat it as a risk decision, not a technical detail.
  3. Design for failure: mirror the SLOG if performance stability matters, and practice replacement.
  4. Fix the pool if it’s the real problem: more vdevs, mirrors for IOPS, special vdev for metadata, and enough RAM for ARC often beat “one more gadget.”

If you do those in order, you’ll either get the SLOG win you expected—or you’ll save yourself from buying a very fast solution to the wrong problem.

← Previous
Ubuntu 24.04 CPU Steal and Virtualization Overhead: How to Spot It and What to Do (Case #44)
Next →
Office VPN: Allow Access to One Server/Port Only (Least Privilege)

Leave a comment