ZFS Queue Depth: Why Some SSDs Fly on ext4 and Choke on ZFS

Was this helpful?

You swap a server from ext4 to ZFS because you want checksums, snapshots, send/receive, and the warm feeling that your storage is no longer held together by hope.
Then the graphs happen. Latency jumps, throughput flattens, and the SSD that looked like a rocket on ext4 suddenly behaves like it’s waiting for management approval.

This is usually not “ZFS is slow.” It’s ZFS being honest about the device, the workload, and the queueing model you accidentally built.
Queue depth is where a lot of these misunderstandings go to live.

What queue depth really means (and why you should care)

Queue depth is how many I/O operations are “in flight” to a device at once. If you have one outstanding request at a time,
you are basically driving a sports car in first gear while carefully obeying a 10 mph speed limit.

SSDs, especially NVMe, are designed to run with parallelism. They hide internal work (flash translation layer mapping, garbage collection,
wear leveling) behind multiple outstanding commands. With low queue depth, you get decent latency but mediocre throughput. With higher queue depth,
you usually get higher throughput… until you hit a cliff and latency explodes.

The part people miss: queue depth isn’t just a property of your benchmark tool. It is an emergent property of the full stack:
application concurrency, kernel I/O scheduler, filesystem behavior, caching, writeback policy, and device firmware.
ext4 and ZFS differ dramatically in how they produce I/O, when they submit it, and how they turn random writes into something the device can digest.

Also, “queue depth” has two meanings in common conversation:

  • Host-side queue depth: how many requests the OS has submitted but the device hasn’t completed.
  • Device-side queue depth: how many commands are sitting in the device queues (NVMe has multiple submission/completion queues).

If you’ve ever seen an SSD do great at QD=1 and awful at QD=32, congratulations: you’ve met the difference between “fast” and “stable.”

Why ext4 can look faster on the same SSD

ext4’s normal operating mode is: accept writes, let the kernel page cache absorb them, and push them out later via writeback.
For many workloads, that means the application thinks it’s fast because it’s mostly writing to RAM. When the flush comes,
the kernel’s block layer and scheduler can combine, reorder, and stream I/O in ways your SSD likes.

ZFS does caching too, but it plays by different rules. ZFS is copy-on-write and transaction-based. It accumulates changes in memory,
then commits them in transaction groups (TXGs). It also validates data with checksums, writes new blocks instead of overwriting old ones,
and maintains metadata that ext4 doesn’t have to track in the same way.

This can be a performance win or a performance trap, depending on your workload and your device:

  • Win: ZFS can aggregate small random writes into larger sequential-ish writes at TXG commit time, which SSDs often handle well.
  • Trap: ZFS can create bursts: a bunch of I/O hits the device at once, queue depth rises sharply, and some SSDs fall apart.
  • Trap: ZFS metadata I/O can be more intense (especially under snapshots, small blocks, and churny datasets).
  • Trap: sync writes (fsync, O_DSYNC, databases) can be brutally honest. ext4 often gets away with “ordered mode + barriers”
    while ZFS insists on correct semantics. Correct can look slow when your hardware isn’t provisioned for it.

The practical takeaway: ext4 often “looks faster” in benchmarks that measure time-to-acknowledgement rather than time-to-durable-storage,
or that accidentally test page cache. ZFS tends to expose the true cost of durability, especially for sync-heavy apps.

How ZFS builds I/O: aggregation, txg, and the reality gap

The TXG rhythm: smooth on paper, spiky in production

ZFS collects dirty data in memory and periodically commits it as a TXG. That commit is where ZFS writes a lot of stuff:
new data blocks, new metadata blocks, updated indirect blocks, spacemaps, and finally uberblocks.
If the system is busy, you can get a near-continuous pipeline of TXGs, but each one still has a “push” moment.

Bursts aren’t inherently bad. SSDs like parallelism. The problem is when the burst pushes the device into its worst behavioral mode:
internal garbage collection, SLC cache exhaustion, firmware queue management overhead, or thermal throttling.
ZFS didn’t create those weaknesses; it just found them faster than ext4 did.

ZIO pipeline and concurrency knobs

ZFS doesn’t submit I/O as a single stream. It has a pipeline (ZIO) with stages: checksum, compression, gang blocks, allocation,
issuing I/O, completion, etc. ZFS also has internal concurrency limits per vdev and per class of I/O.
Those limits have changed across OpenZFS versions and operating systems, but the idea remains: ZFS tries to be fair and stable,
not “max queue depth at all costs.”

If you come from ext4 and think “more queue depth is always better,” ZFS is going to disagree with you in production.
ZFS typically prefers controlled concurrency because uncontrolled concurrency turns into latency spikes,
and latency spikes turn into unhappy databases and timeouts.

Why zvols can behave differently than datasets

A ZFS dataset is a filesystem with recordsize (default often 128K). A zvol is a block device with volblocksize (often 8K by default).
The I/O patterns differ, the metadata differs, and the write amplification differs. With zvols you can easily end up with
lots of small synchronous writes—especially with VM images and databases that love fsync.

ext4 on a raw block device may be getting decent writeback and reordering. A zvol can be more “literal,” and you pay for it.
That’s not a moral failing. It’s an engineering choice you need to line up with the workload.

One paraphrased idea, attributed because it’s the closest thing operations has to scripture:
paraphrased idea: “Hope is not a strategy.” — often attributed in SRE circles to engineers like Gene Kranz in spirit, but here kept deliberately non-verbatim.
(And yes, you should measure, not hope.)

Where SSDs choke: latency cliffs, firmware limits, and mixed I/O

The latency cliff is real

Many consumer and prosumer SSDs are optimized for desktop-style bursts, low queue depth, and read-heavy mixes.
They shine in synthetic ext4 tests that don’t sustain pressure long enough to trigger the ugly bits.
Under ZFS TXG commits or sustained sync writes, the device reaches steady state, the SLC cache empties,
and suddenly your “3 GB/s” SSD is doing “please hold” at 40–200 MB/s with latency that makes your app look haunted.

Firmware fairness vs. throughput tricks

Enterprise SSDs often trade peak benchmark numbers for predictable tail latency at higher queue depths.
They’re boring. Boring is good. Consumer SSDs often chase great marketing numbers by leaning on caches and aggressive write combining.
That can be great until the workload becomes continuous and multi-threaded. ZFS workloads frequently are exactly that: continuous and multi-threaded.

Mixed reads and writes with metadata

ZFS metadata is not free. Snapshots and clones can multiply metadata work. Small random writes create more metadata churn.
If your pool is near full, space allocation gets harder and spacemaps get busier. The SSD sees a mix:
small reads for metadata, writes for data, and sync points for consistency.
Some drives handle this with grace. Some handle it like a toddler handed a violin.

Joke #1: I once asked a consumer SSD about its steady-state write performance. It responded by thermal throttling out of spite.

Alignment and ashift: the silent performance tax

If your pool’s ashift doesn’t match the device’s real physical sector size, you can force read-modify-write behavior.
That means each “small” write can turn into multiple internal operations. ext4 can suffer from misalignment too,
but ZFS makes it easier to create a permanent misconfiguration at pool creation time.

Sync write semantics: ZFS refuses to lie

The fastest way to make ZFS “slow” is to run a sync-heavy workload on devices without power-loss protection and without a proper SLOG,
while also expecting ext4-like behavior. ZFS will honor the request: “make this durable.”
Your SSD will do its best impression of a spinning disk.

Interesting facts and historical context

  • ZFS originated at Sun Microsystems in the mid-2000s, designed for end-to-end integrity and pooled storage, not for chasing single-thread benchmark trophies.
  • Copy-on-write wasn’t invented by ZFS, but ZFS mainstreamed it for general-purpose filesystems at scale with integrated volume management.
  • NVMe introduced multiple hardware queues to reduce lock contention and CPU overhead compared to SATA/AHCI, making queue depth and CPU affinity more visible in performance.
  • Early SSD controllers were notorious for write cliffs when their internal mapping tables or caches filled; modern drives are better, but the shape of the cliff still exists.
  • ZFS has long used transaction groups to batch changes; this batching can improve throughput but also creates burstiness that exposes weak devices.
  • Linux I/O schedulers evolved from CFQ to deadline to mq-deadline and none; with NVMe, many stacks prefer “none,” shifting more behavior into the device and filesystem layers.
  • Sector sizes are messy: 512e and 4Kn exist because the industry tried to move to 4K sectors without breaking everything at once.
  • Checksumming data blocks means ZFS reads are often verified end-to-end; this is small CPU work but real, and it changes how “fast” feels under load.

Fast diagnosis playbook

When ZFS “chokes,” you don’t start by flipping random tunables. You start by finding which queue is filling:
application, ARC, ZIO, vdev, kernel block layer, or the SSD itself. Here’s a fast sequence that works when the pager is yelling.

First: identify whether you’re bound by latency, bandwidth, CPU, or sync

  1. Check latency and queueing per device (iostat). If await and aqu-sz climb, the device is saturated or misbehaving.
  2. Check ZFS-level wait (zpool iostat -w). If ZFS shows high wait times, you’re bottlenecked below ZFS, not in ARC.
  3. Check CPU and softirqs. If CPU is pegged in kernel or softirq, the “storage problem” might be CPU contention.
  4. Check if the workload is sync-heavy. If yes, treat it as a durability problem first.

Second: isolate which kind of I/O is hurting

  1. Reads vs writes: are reads slow because metadata is thrashing?
  2. Small random vs large sequential: is recordsize/volblocksize mismatched?
  3. Steady state vs burst: does performance collapse after a minute? That’s SLC cache/GC/thermal behavior.

Third: confirm pool layout and the non-negotiables

  1. Pool fullness: if you’re above ~80–90% on many pools, allocation costs rise and SSDs suffer.
  2. ashift: wrong ashift is forever unless you rebuild.
  3. Sync strategy: if you need sync semantics, use appropriate hardware or accept the cost.

Practical tasks: commands, outputs, decisions (12+)

These are the checks I run when someone tells me “ext4 was faster.” The point is not to collect trivia.
The point is to decide what to fix, what to leave alone, and what to rebuild.

Task 1: Confirm the device and transport (NVMe vs SATA, link speed)

cr0x@server:~$ lsblk -d -o NAME,MODEL,TRAN,ROTA,SIZE
NAME MODEL            TRAN ROTA   SIZE
nvme0n1 SAMSUNG_MZVLW1 nvme    0 953.9G
sda   ST2000DM008     sata    1   1.8T

What it means: TRAN tells you if you’re on NVMe or SATA; ROTA shows rotational.
If you expected NVMe and see SATA, your “ZFS problem” might be a procurement problem.
Decision: Verify your test target. Do not tune ZFS for the wrong device class.

Task 2: Check kernel I/O scheduler for the device

cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq

What it means: NVMe commonly runs with none, meaning minimal host scheduling.
Decision: If you see bfq on fast NVMe and you’re chasing throughput, consider none or mq-deadline.
If you’re chasing latency fairness under mixed load, mq-deadline can be saner than “trust the firmware.”

Task 3: Observe queue depth and latency live (block layer view)

cr0x@server:~$ iostat -x 1 nvme0n1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.21    0.00    4.88    2.44    0.00   84.47

Device            r/s     w/s   rKB/s   wKB/s  rrqm/s  wrqm/s  %util  await  aqu-sz
nvme0n1         1200    9000   98000  620000     0.0     0.0   99.5   6.80   62.4

What it means: aqu-sz is average queue size; await is average latency.
%util near 100% means the device is busy.
Decision: If queue size is huge and latency rises, you’re saturating the device or hitting a firmware cliff.
If queue is small but throughput is low, your bottleneck is above the device (CPU, sync, or ZFS throttling).

Task 4: Check ZFS vdev latency and queueing

cr0x@server:~$ zpool iostat -v 1
               capacity     operations     bandwidth
pool         alloc   free   read  write   read  write
tank         1.23T  2.41T    900   7200  110M   690M
  mirror     1.23T  2.41T    900   7200  110M   690M
    nvme0n1      -      -    450   3600   55M   345M
    nvme1n1      -      -    450   3600   55M   345M

What it means: This tells you if ZFS is issuing I/O and how it’s distributed across vdevs.
Decision: If one leaf vdev is slower, it will drag the whole pool. Replace the problem child or re-balance layout.

Task 5: Confirm pool health and error counters

cr0x@server:~$ zpool status -v
  pool: tank
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            nvme0n1 ONLINE       0     0     0
            nvme1n1 ONLINE       0     0     0

errors: No known data errors

What it means: If you have retries, checksum errors, or degraded devices, performance issues are expected.
Decision: Fix reliability first. Tuning around errors is how incidents get promoted to outages.

Task 6: Check dataset properties that change I/O shape

cr0x@server:~$ zfs get -o name,property,value -s local,default recordsize,compression,atime,primarycache,sync tank/data
NAME       PROPERTY      VALUE
tank/data  recordsize    128K
tank/data  compression   zstd
tank/data  atime         off
tank/data  primarycache  all
tank/data  sync          standard

What it means: recordsize affects write amplification for small random writes; sync affects durability behavior.
Decision: For databases with 8K pages, consider recordsize=16K or 8K only if you understand the metadata cost.
Don’t cargo-cult this; measure.

Task 7: Check zvol settings if you’re serving VM disks or iSCSI

cr0x@server:~$ zfs get -o name,property,value -s local,default volblocksize,compression,sync,logbias tank/vm0
NAME      PROPERTY      VALUE
tank/vm0  volblocksize  8K
tank/vm0  compression   off
tank/vm0  sync          standard
tank/vm0  logbias       latency

What it means: Small volblocksize can increase IOPS but can also punish the pool with metadata and fragmentation.
Decision: If you see heavy sync writes and no SLOG, expect pain. If you have a SLOG, logbias=latency is reasonable.

Task 8: Detect sync-heavy behavior from the application side

cr0x@server:~$ strace -f -tt -e trace=fdatasync,fsync,pwrite64 -p 2143
12:19:41.102334 fdatasync(7) = 0
12:19:41.104981 pwrite64(7, "...", 8192, 123904) = 8192
12:19:41.106204 fdatasync(7) = 0

What it means: If you see frequent fsync/fdatasync, your workload cares about durability and ordering.
Decision: Treat this as a sync latency problem. Consider SLOG (with PLP), adjust sync only with eyes open.

Task 9: Verify SLOG presence and whether it’s actually being used

cr0x@server:~$ zpool status
  pool: tank
 state: ONLINE
config:

        NAME         STATE     READ WRITE CKSUM
        tank         ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            nvme0n1  ONLINE       0     0     0
            nvme1n1  ONLINE       0     0     0
        logs
          nvme2n1    ONLINE       0     0     0

What it means: A device under logs is a separate intent log (SLOG).
Decision: If you run sync-heavy workloads, a proper SLOG can cut tail latency. If your SLOG is a random consumer SSD without PLP,
you’re buying speed with integrity risk. Don’t.

Task 10: Check pool fragmentation and capacity pressure

cr0x@server:~$ zpool list -o name,size,alloc,free,cap,frag,health
NAME  SIZE  ALLOC   FREE  CAP  FRAG  HEALTH
tank  3.64T  3.20T   440G  87%   62%  ONLINE

What it means: High cap and high frag often correlate with worse allocation behavior and more metadata churn.
Decision: If you’re running near full, stop tuning and start planning capacity. Free space is performance.

Task 11: Measure ZFS ARC behavior (are you actually caching?)

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:20:10  9850  1200     12   500   5    700   7      0   0   32.0G  32.0G
12:20:11 10310  1500     15   800   8    700   7      0   0   32.0G  32.0G
12:20:12  9980  2200     22  1600  16    600   6      0   0   32.0G  32.0G

What it means: Rising miss% means you’re going to disk more often.
Decision: If your working set doesn’t fit and you have a read-heavy workload, consider more RAM or restructure datasets.
Don’t add L2ARC as a first reflex; it can add overhead.

Task 12: Check TRIM/discard status (SSD steady-state behavior)

cr0x@server:~$ zpool get autotrim tank
NAME  PROPERTY  VALUE     SOURCE
tank  autotrim  on        local

What it means: autotrim=on helps SSDs maintain performance by informing them of freed blocks.
Decision: If autotrim is off on SSD pools and you have churn, consider enabling it (after validating OS/ZFS version support).

Task 13: Confirm ashift (alignment) on each vdev

cr0x@server:~$ zdb -C tank | grep -E 'ashift|path' -n
128:            path: '/dev/nvme0n1'
135:            ashift: 12
142:            path: '/dev/nvme1n1'
149:            ashift: 12

What it means: ashift: 12 corresponds to 4K sectors. Many SSDs want at least this.
Decision: If you see ashift: 9 on modern SSDs, you likely have a permanent performance tax. Rebuild the pool correctly.

Task 14: Check NVMe health indicators and throttling hints

cr0x@server:~$ nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                    : 0x00
temperature                         : 71 C
available_spare                     : 100%
percentage_used                     : 3%
data_units_read                     : 123,456,789
data_units_written                  : 98,765,432
host_read_commands                  : 4,321,000,000
host_write_commands                 : 7,654,000,000
controller_busy_time                : 12,345
media_errors                        : 0
num_err_log_entries                 : 0

What it means: 71°C is flirting with throttling territory for some drives.
Decision: If performance collapses under load and temperature is high, fix airflow before you write a tuning doc.

Task 15: Reproduce with a benchmark that controls queue depth and bypasses cache

cr0x@server:~$ fio --name=randwrite --filename=/tank/data/fio.test --size=8G --direct=1 --ioengine=libaio --rw=randwrite --bs=4k --iodepth=32 --numjobs=4 --group_reporting
randwrite: (groupid=0, jobs=4): err= 0: pid=8812: Fri Dec 22 12:24:33 2025
  write: IOPS=52.4k, BW=205MiB/s (215MB/s)(8192MiB/39945msec)
    slat (usec): min=4, max=980, avg=18.21, stdev=9.44
    clat (usec): min=50, max=120000, avg=2400.15, stdev=7800.22
    lat (usec): min=60, max=120500, avg=2419.10, stdev=7800.40

What it means: Average latency is 2.4ms but max is 120ms: tail latency is ugly.
Decision: If ext4 looks “faster,” repeat the same fio profile on ext4 with direct=1.
If ZFS tail latency is worse, focus on sync, pool fullness, device thermal/firmware behavior, and I/O pattern mismatch.

Common mistakes: symptom → root cause → fix

1) “Throughput is fine, but latency spikes every few seconds”

Symptom: Periodic latency spikes; application timeouts; graphs look like a heartbeat.

Root cause: TXG commit bursts hitting a weak SSD steady-state regime (SLC cache exhaustion, GC) or thermal throttling.

Fix: Confirm with fio steady-state tests and NVMe temperature. Improve cooling, choose SSDs with predictable sustained writes, keep pool under capacity pressure.

2) “ext4 beats ZFS by 2–5× on writes in my test”

Symptom: Simple file copy or naive benchmark reports ext4 much faster.

Root cause: Benchmark is measuring page cache and delayed allocation, not durability; ZFS test is hitting sync semantics or direct I/O.

Fix: Use direct=1 in fio, or at least flush caches; compare apples to apples (sync vs async, buffered vs direct).

3) “VM storage on zvol is slow, datasets look okay”

Symptom: Random write IOPS poor, fsync-heavy workloads stall, guests complain.

Root cause: Small volblocksize + sync-heavy guest workload + no SLOG; also possible mis-sized volblocksize relative to guest filesystem.

Fix: Add proper SLOG (PLP), consider matching volblocksize to workload, consider dataset-based storage for some VM patterns, validate sync expectations.

4) “It got worse after we enabled compression”

Symptom: Higher CPU, lower throughput, more latency under load.

Root cause: CPU becomes bottleneck, or compression algorithm is too heavy for the core budget; also can change I/O size distribution.

Fix: Use a faster algorithm (e.g., zstd at a low level), pin down CPU contention, measure again. Compression is not free; it’s a trade.

5) “Everything was fine until the pool hit 85–90%”

Symptom: Writes slow down, metadata reads increase, fragmentation rises.

Root cause: Allocation becomes expensive; SSDs see more random writes; ZFS has fewer large contiguous regions.

Fix: Add capacity, delete/relocate data, reduce snapshot churn, keep headroom. Tuning cannot replace free space.

6) “We set sync=disabled and it’s fast now”

Symptom: Performance improves dramatically; everyone celebrates.

Root cause: You traded durability for speed. A crash can lose acknowledged writes. On some systems it can corrupt application-level invariants.

Fix: Undo unless you have a hard, documented reason and acceptance of data loss. Use SLOG or better devices instead.

Three corporate-world mini-stories

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company moved a busy PostgreSQL cluster from ext4 on RAID10 to ZFS mirrors on “fast” NVMe.
The migration plan was solid: snapshots, rollback points, controlled cutover, the usual grown-up stuff.
The one unspoken assumption: “NVMe is NVMe, it’ll be fine.”

Within hours, they saw periodic query stalls. Not constant slowness—worse. Random 2–10 second pauses that triggered connection pool exhaustion.
The database logs showed bursts of slow fsync. The OS looked calm. CPU was fine. Network was fine. The storage graphs looked like a seismograph.

The root cause wasn’t ZFS correctness overhead. It was the SSD model’s steady-state write behavior under sustained sync pressure.
Under ext4, the workload had been “smoothed” by page cache and writeback; the RAID controller cache (with battery) also helped.
Under ZFS, with correct sync semantics and no dedicated log device, the pool forced the SSDs to show their true tail latency.

The fix was unglamorous: add proper power-loss-protected log devices, adjust the pool layout to match the workload,
and replace the worst SSD model with drives that had predictable sustained write latency.
The postmortem takeaway was even less glamorous: procurement checklists now include sustained write tests, not just spec sheet IOPS.

Mini-story 2: The optimization that backfired

An internal platform team ran a multi-tenant virtualization cluster. They were proud of their ZFS skills and wanted maximum performance.
They saw a forum post that said, roughly, “increase concurrency; ZFS is conservative.”
They adjusted several ZFS module parameters to allow deeper queues and more outstanding I/O per vdev.

In single-tenant benchmarks, it looked great. Throughput went up. Everyone posted screenshots. Then the real workload returned:
dozens of VMs doing mixed reads, writes, and sync bursts. Tail latency got worse. Not a little worse. “Users filing tickets about slowness” worse.

The system had become excellent at saturating the SSDs and terrible at keeping latency predictable.
Under mixed load, the deeper queues caused request bunching. The SSD firmware responded with internal reordering and long garbage-collection pauses.
The average metrics looked fine. The 99.9th percentile looked like a fire.

They rolled back the tuning and focused on workload isolation instead: separate pools for noisy neighbors,
appropriate SLOG for sync-heavy tenants, and leaving ZFS’s default throttling alone unless they had a measured reason.
Performance “decreased” on synthetic tests, and the platform got faster in the only way customers care about: fewer stalls.

Mini-story 3: The boring but correct practice that saved the day

A finance-adjacent company (which means audits and long memories) ran ZFS for a log ingestion pipeline.
Not glamorous. Mostly sequential writes with periodic compactions. They had a policy: never run pools above 75%,
always keep autotrim enabled on SSD pools, and always test new SSD models for 30-minute steady-state writes before approving them.

One quarter, a vendor substituted a “newer” SSD model due to supply constraints. On paper it was faster.
The team ran their steady-state test anyway. At minute 12, write latency jumped by an order of magnitude and stayed there.
The drive wasn’t failing; it was just doing what it does when its cache and mapping tables are under sustained pressure.

They rejected the substitution and kept the older model. A month later, another department deployed the substituted model in a different system
and spent weeks chasing intermittent timeouts. The ZFS team didn’t have to say “told you so” out loud.
The boring process—capacity headroom, trim policy, and steady-state testing—prevented a slow-burn incident.

Joke #2: The best performance tuning is sometimes a spreadsheet and the ability to say “no,” which is why it’s so rarely deployed.

Checklists / step-by-step plan

Step-by-step: reproduce the ext4 vs ZFS claim without lying to yourself

  1. Pick the same device and partitioning style. Don’t compare ext4 on a raw disk to ZFS on a sparse zvol unless you mean to.
  2. Run direct I/O tests first. Use fio with direct=1 so you’re measuring the device + filesystem, not RAM.
  3. Test at multiple queue depths. QD=1, 4, 16, 32. Watch where latency breaks. That break point matters more than peak IOPS.
  4. Run long enough to reach steady state. Minutes, not seconds. Many SSD cliffs appear after caches are exhausted.
  5. Separate sync vs async tests. If your workload does fsync, test fsync. If it doesn’t, don’t punish yourself with sync tests.

Step-by-step: production-safe tuning order (least regret first)

  1. Fix hardware/firmware realities. Cooling, correct SSD class, avoid bargain drives for sync-heavy workloads.
  2. Fix pool capacity and layout. Headroom, vdev design, mirrors vs RAIDZ based on latency needs.
  3. Fix obvious dataset mismatches. recordsize/volblocksize, atime, compression level.
  4. Address sync semantics properly. SLOG with PLP, or accept the performance cost.
  5. Only then touch deeper ZFS internal tunables. And do it with rollback plans and measured outcomes.

Checklist: before you blame ZFS

  • Are you comparing buffered writes on ext4 vs durable writes on ZFS?
  • Is the pool above 80–85% capacity or heavily fragmented?
  • Is the SSD temperature high under load?
  • Does the workload do frequent fsync/fdatasync?
  • Is ashift correct?
  • Are you using zvols where datasets would be simpler (or vice versa)?
  • Do you have predictable enterprise SSDs or spiky consumer ones?

FAQ

1) Is ZFS “lower queue depth” by design?

ZFS tends to manage concurrency to protect latency and fairness. It can drive high queue depth under load, but it tries not to melt the device.
If your SSD needs extreme QD to perform, that’s a device characteristic you should validate under steady state.

2) Why do some SSDs benchmark great on ext4 and terrible on ZFS?

ext4 benchmarks often hit page cache and delayed writeback, masking device cliffs. ZFS can push more truthful patterns: bursts at TXG commit,
more metadata churn, and more explicit sync behavior. Weak steady-state SSDs get exposed.

3) Should I change the Linux I/O scheduler for NVMe when using ZFS?

Sometimes. “none” is common and often fine. If you need latency stability under mixed workloads, mq-deadline can help.
Don’t expect miracles; scheduler tweaks can’t fix a drive that collapses under sustained writes.

4) Does adding a SLOG always improve performance?

Only for sync writes. For async-heavy workloads, SLOG does little. For sync-heavy workloads, a proper PLP SLOG can reduce tail latency a lot.
A non-PLP SLOG is a reliability gamble, not an optimization.

5) Are zvols inherently slower than datasets?

Not inherently, but they behave differently. zvols can amplify small random writes and sync patterns. Datasets can aggregate writes more naturally.
Choose based on workload: VM images and iSCSI often want zvols; file workloads often want datasets.

6) What’s the single biggest “oops” that causes ZFS SSD pain?

Running a sync-heavy workload on consumer SSDs without PLP and without a real SLOG, then expecting low tail latency.
Second place is running pools too full and wondering why allocation is slow.

7) Should I tune recordsize to 8K for databases?

Sometimes, but it’s not free. Smaller recordsize can reduce read amplification for small reads but increases metadata and fragmentation risk.
Many database deployments do well with 16K or 32K; some need 8K. Measure with realistic concurrency and steady-state tests.

8) How can I tell if I’m seeing an SSD SLC cache cliff?

Run a sustained write test for long enough (10–30 minutes) with direct I/O and a realistic queue depth.
If throughput drops sharply after a short period and latency balloons, that’s classic cache exhaustion or GC behavior.

9) Is autotrim safe to enable?

On modern OpenZFS and SSDs, autotrim is commonly used and helps steady-state performance. Still, validate in your environment.
If you have ancient firmware or weird virtualization layers, test first.

10) Why does performance change with snapshots?

Snapshots increase metadata work and can increase fragmentation under churn. Deletes become more expensive because blocks are referenced by snapshots.
This can change write behavior and increase random I/O, which some SSDs handle poorly at higher queue depths.

Conclusion: next steps you can actually do today

If an SSD “flies on ext4 and chokes on ZFS,” you’re probably seeing one of three things: a benchmark that flatters ext4 by testing cache,
a sync-latency reality check, or an SSD that collapses under sustained mixed I/O when queue depth rises.
ZFS is rarely the villain. It’s the auditor.

Do this next:

  1. Re-test with fio using direct I/O at multiple queue depths and for long enough to hit steady state.
  2. Capture iostat and zpool iostat during the slowdown and decide whether the bottleneck is device saturation, sync latency, or CPU/interrupt overhead.
  3. Check pool headroom and ashift; if you’re too full or misaligned, fix the design, not the symptoms.
  4. If sync is the issue, solve it correctly: PLP SLOG, better SSDs, or accept that durability costs something.
  5. Resist random tuning. Tune only after you can explain the queue that’s filling and why.

ZFS is a system for people who prefer reality over vibes. If you want vibes, buy a marketing benchmark and frame it.

← Previous
Debian 13: Journald ate your disk — cap logs without losing what matters
Next →
Docker on cgroups v2: pain, errors, and the fix path

Leave a comment