ZFS SSD-Only Pool Tuning: Avoiding Garbage Collection Collapse

Was this helpful?

The first sign is never “the pool is dying.” It’s a Slack message: “Why is the database randomly freezing?”
Then a dashboard: p99 latency turns into modern art. CPU is fine. Network is fine. ZFS says “ONLINE.”
And yet every few minutes your SSD-only pool hits a wall like it forgot how to write.

That wall is usually SSD garbage collection meeting ZFS write patterns, then multiplying the pain through sync writes,
small blocks, or accidental fragmentation. This is not theoretical. It happens in production, at 2 a.m., right after someone
“optimized” something.

What “garbage collection collapse” looks like on ZFS

SSDs don’t overwrite in place. They write to fresh pages and later clean up old ones. That cleanup is garbage collection (GC),
and it’s normally hidden behind controller firmware and spare area. ZFS also doesn’t overwrite in place. It’s copy-on-write.
So you’ve got two layers of “write somewhere else, clean up later” stacked together. When conditions are right (or wrong),
those layers synchronize into periodic misery.

In production you’ll see:

  • Latency spikes with no obvious CPU bottleneck. The system is “idle” while storage is melting.
  • Write IOPS collapse after sustained random writes, especially when the pool is high utilization.
  • Sync-heavy workloads (databases, NFS, VM hypervisors) suddenly stall even though “it’s NVMe.”
  • zpool iostat shows huge wait while bandwidth looks unimpressive.
  • Spikes correlate with transaction group (TXG) commits or periodic bursts of small writes.

The key: GC collapse is not “ZFS is slow.” It’s a system-level feedback loop: ZFS produces a write pattern; SSD firmware reacts;
latency increases; ZFS queues more; firmware has less idle time; latency increases again. It’s like a traffic jam where everyone
decides to change lanes at the same time.

Joke #1: SSD garbage collection is like cleaning your apartment by shoving everything into one closet—until the closet starts charging rent.

Interesting facts and historical context

  • TRIM wasn’t always a given. Early SSD deployments often ran without effective TRIM/discard, so drives “forgot” which pages were free and performance degraded over time.
  • ZFS copy-on-write predates SSD ubiquity. ZFS was designed for data integrity and large storage; SSD behavior became a tuning concern later as flash became the default.
  • Write amplification is a double tax. SSDs already rewrite internal structures; CoW filesystems can increase internal movement if the workload fragments free space.
  • Enterprise SSDs are not just “faster.” They typically ship with more overprovisioning, better steady-state behavior, and firmware tuned for mixed workloads.
  • Power-loss protection (PLP) changed the SLOG story. The moment you rely on sync semantics, capacitor-backed SSDs become a design requirement, not a luxury.
  • 4K alignment mistakes are ancient and still happening. The industry moved from 512-byte sectors to 4K physical, and misalignment still silently wrecks performance.
  • NVMe solved a queueing problem, not physics. Deep queues and low overhead help, but flash erase blocks and GC still exist; latency cliffs are still real.
  • ZFS compression became a performance tool on SSDs. With LZ4, less NAND written can mean more throughput and less GC stress, not just saved space.
  • Special vdevs are a modern footgun. They can be amazing for metadata/small blocks, but the failure domain and sizing mistakes have taken down more than a few “fast” pools.

A mental model that actually predicts failures

1) ZFS writes in transaction groups, not in your application’s rhythm

Applications issue writes. ZFS collects dirty data in memory (ARC) and flushes it in transaction groups (TXGs).
Those commits are bursty. Bursty writes are fine for SSDs until the SSD is short on clean pages and needs to do GC
exactly when you’re hammering it. Then the burst turns into a stall.

2) SSDs have a “steady-state” that marketing doesn’t print

Fresh-out-of-box performance is not steady-state performance. After the drive fills and rewrites, write amplification rises and
sustained random write performance can drop dramatically. A pool at 80–90% usage is the classic setup: less free space means
fewer easy choices for both ZFS allocator and the SSD’s FTL (flash translation layer).

3) Sync writes are latency’s best friend (unfortunately)

If your workload requires sync writes, ZFS must confirm durability on stable storage before acknowledging.
Without a separate log device (SLOG) that can take low-latency writes safely, your pool devices take the hit.
If those pool devices are simultaneously busy with GC, your p99 becomes your personality.

4) Fragmentation isn’t just “files are scattered”

On ZFS, fragmentation is about free space segmentation and block allocation patterns. On SSDs, it couples with the FTL’s own mapping.
High fragmentation makes it harder to write large contiguous segments; the SSD ends up doing more internal copying during GC.
The result is classic: “but it’s an SSD, why is sequential write only 150 MB/s?”

5) Compression can be an SSD tuning knob

Compression isn’t just for saving space. If you compress, you write fewer bytes, which can reduce write amplification.
On modern CPUs, LZ4 is frequently a net win for throughput and latency. The exception is already-compressed data or when CPU is the bottleneck.

One paraphrased idea from Werner Vogels (Amazon CTO): “Everything fails, all the time—design so you can survive it.”
ZFS on SSD is exactly that: assume latency cliffs exist and engineer around them.

Fast diagnosis playbook (first/second/third)

First: confirm the symptom is storage latency, not CPU, not network

  • Check application latency breakdowns (db commit time, fsync time, VM guest disk latency).
  • Check system load vs iowait: high load with low CPU usage often means blocked on IO.
  • Confirm ZFS is the device being waited on, not a downstream network mount or a single saturated NIC queue.

Second: determine whether the pain is sync, fragmentation, or device steady-state

  • If stalls align with fsync/commit: suspect sync write path (SLOG, sync settings, PLP).
  • If performance degrades as pool fills: suspect free space + fragmentation and SSD GC pressure.
  • If “random writes collapse after minutes”: suspect SSD steady-state and insufficient overprovisioning/TRIM.

Third: isolate which layer is collapsing

  • ZFS layer: TXG timing, dirty data limits, recordsize/volblocksize mismatch, special vdev saturation.
  • Device layer: firmware GC, thermal throttling, media errors, internal queue saturation.
  • Topology layer: too-wide RAIDZ on SSDs for your IOPS profile, or single-vdev bottleneck.

If you take only one operational habit from this: during an incident, collect evidence first.
Tuning in the dark is how you end up with a “performance fix” that becomes a data-loss postmortem.

Practical tasks: commands, outputs, decisions (12+)

Task 1: Check pool health and topology (you’re not tuning a broken system)

cr0x@server:~$ zpool status -v ssdpool
  pool: ssdpool
 state: ONLINE
  scan: scrub repaired 0B in 00:12:19 with 0 errors on Sun Dec 22 03:10:11 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        ssdpool                     ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            nvme-SAMSUNG_MZVLB1T0HBLR-00000  ONLINE       0     0     0
            nvme-SAMSUNG_MZVLB1T0HBLR-00001  ONLINE       0     0     0

errors: No known data errors

What it means: If you’re degraded or resilvering, performance data is contaminated. Mirrors behave differently than RAIDZ.

Decision: Don’t tune until the pool is stable and scrubs are clean. If topology is RAIDZ and workload is sync-random write heavy, reconsider design.

Task 2: Watch real-time latency and queueing per vdev

cr0x@server:~$ zpool iostat -v ssdpool 1
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
ssdpool                      620G   280G    120   1800  12.3M   210M
  mirror-0                   620G   280G    120   1800  12.3M   210M
    nvme0n1                     -      -     60    900  6.1M   105M
    nvme1n1                     -      -     60    900  6.2M   105M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: This shows rate, not latency directly. If ops drop while application latency rises, the device is stalling.

Decision: If one vdev is underperforming or uneven, suspect device firmware/thermal issues or a misbehaving path.

Task 3: Check detailed per-device latency (OpenZFS on Linux)

cr0x@server:~$ cat /proc/spl/kstat/zfs/vdev_queue | head -n 25
nthreads                            4
vdev_queue_max_active               1000
vdev_queue_min_active               1
vdev_queue_max_pending              1000
vdev_queue_min_pending              1
vdev_queue_max_threads              8
vdev_queue_min_threads              1
vdev_queue_agg_lim                  131072
vdev_queue_read_gap_limit           32768
vdev_queue_write_gap_limit          65536
...

What it means: Queue tuning exists, but it’s rarely the first lever. The defaults are usually fine unless you have a very specific contention pattern.

Decision: Don’t start by twiddling queue internals. First confirm you’re not causing GC collapse via space, sync, and block sizing.

Task 4: Confirm autotrim state and enable it if appropriate

cr0x@server:~$ zpool get autotrim ssdpool
NAME     PROPERTY  VALUE     SOURCE
ssdpool  autotrim  off       default

What it means: With autotrim off, freed blocks may not be promptly communicated to SSDs. Some drives cope; others slowly decay into steady-state sadness.

Decision: For SSD-only pools, enable autotrim unless you have a known drive/firmware issue with queued TRIM.

cr0x@server:~$ sudo zpool set autotrim=on ssdpool

Task 5: Verify discard/TRIM is actually reaching the device (Linux)

cr0x@server:~$ lsblk -D
NAME        DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
nvme0n1            0      4K       2G         0
nvme1n1            0      4K       2G         0

What it means: DISC-GRAN shows discard granularity. If it’s 0B, the device/stack may not support discard.

Decision: If discard isn’t supported, autotrim won’t help. Plan for more overprovisioning and keep pool utilization lower.

Task 6: Check pool utilization and fragmentation

cr0x@server:~$ zpool list -o name,size,alloc,free,cap,frag,ashift,health ssdpool
NAME     SIZE  ALLOC  FREE  CAP  FRAG  ASHIFT  HEALTH
ssdpool  932G   620G  312G   66%   18%      12  ONLINE

What it means: CAP and FRAG correlate with allocator pain and GC pressure. On SSD-only pools, problems often start showing up as you approach high CAP.

Decision: Treat 80%+ as “danger zone” for mixed random writes. If you must run hot, use drives with serious OP and design for it.

Task 7: Identify datasets with bad recordsize for the workload

cr0x@server:~$ zfs get -r recordsize,compression,atime ssdpool/app
NAME          PROPERTY     VALUE     SOURCE
ssdpool/app   recordsize   128K      default
ssdpool/app   compression  lz4       local
ssdpool/app   atime        off       local

What it means: 128K recordsize is fine for streaming IO and many general workloads. It can be awful for small random updates (databases) by increasing write amplification.

Decision: For DB datafiles or VM images, consider 16K or 8K recordsize (datasets) or appropriate volblocksize (zvols), then benchmark and observe.

Task 8: For zvols, check volblocksize and alignment expectations

cr0x@server:~$ zfs get volblocksize,compression,logbias,sync ssdpool/vmstore
NAME           PROPERTY      VALUE     SOURCE
ssdpool/vmstore  volblocksize 8K        local
ssdpool/vmstore  compression  lz4       inherited from ssdpool
ssdpool/vmstore  logbias      latency   default
ssdpool/vmstore  sync         standard  default

What it means: volblocksize is fixed at creation. An 8K zvol is often sensible for VM random IO, but it can increase metadata and IO overhead depending on workload.

Decision: Match volblocksize to guest filesystem/DB page size and workload. If you got it wrong, you recreate the zvol. There is no magical toggle later.

Task 9: Check whether sync writes are dominating

cr0x@server:~$ sudo zpool iostat -v ssdpool 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
ssdpool                      620G   280G    100   2400  10.1M   140M
  mirror-0                   620G   280G    100   2400  10.1M   140M
    nvme0n1                     -      -     50   1200  5.0M    70M
    nvme1n1                     -      -     50   1200  5.1M    70M
--------------------------  -----  -----  -----  -----  -----  -----

cr0x@server:~$ sudo cat /proc/spl/kstat/zfs/zil | head
zil_commit_count            98122
zil_commit_writer_count     98050
zil_commit_waiter_count     72
zil_commit_time             19123456789
zil_commit_lwb_count        220001

What it means: A busy ZIL (ZFS Intent Log) indicates lots of synchronous semantics. That’s normal for some workloads.

Decision: If you need sync durability, consider a proper SLOG device with PLP. If you don’t need it, fix the application mount/export settings rather than lying to ZFS.

Task 10: Verify SLOG presence and whether it’s actually helping

cr0x@server:~$ zpool status ssdpool
  pool: ssdpool
 state: ONLINE
config:

        NAME          STATE     READ WRITE CKSUM
        ssdpool       ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            nvme0n1   ONLINE       0     0     0
            nvme1n1   ONLINE       0     0     0
        logs
          nvme2n1     ONLINE       0     0     0

What it means: A log device exists. That does not mean it’s safe or fast. If it lacks PLP, a power loss can turn “fast sync” into “creative corruption.”

Decision: Only use SLOG devices designed for power-loss protection. If you can’t guarantee that, accept higher sync latency or re-architect.

Task 11: Inspect dirty data behavior (TXG pressure)

cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep 'dirty|txg|memory_throttle' | head -n 12
dirty_data                      1879048192
dirty_over_target                  0
memory_throttle_count              12
txg_syncing                      5321
txg_synced                       5320

What it means: If memory_throttle_count climbs, ZFS is forcing writers to wait because dirty data is too high.
That often correlates with storage not keeping up (or periodic stalls).

Decision: If TXG sync is slow because devices stall, fix device/GC causes first. Tuning dirty limits can mask symptoms but won’t cure the disease.

Task 12: Check for thermal throttling on NVMe (the silent performance killer)

cr0x@server:~$ sudo nvme smart-log /dev/nvme0n1 | egrep 'temperature|warning|critical'
temperature                         : 77 C
warning_temp_time                   : 143
critical_comp_time                  : 0

What it means: warning_temp_time indicates the controller spent time above warning threshold. Throttling often looks like “random GC collapse.”

Decision: Fix airflow, heatsinks, chassis layout. If you’re tuning ZFS to compensate for heat, you’re writing a fan policy with extra steps.

Task 13: Check NAND wear and spare blocks (a proxy for steady-state behavior)

cr0x@server:~$ sudo smartctl -a /dev/nvme0n1 | egrep 'Percentage Used|Data Units Written|Available Spare'
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    3%
Data Units Written:                 18,442,118

What it means: High “Percentage Used” doesn’t directly cause GC collapse, but worn drives can behave differently, and low spare can reduce performance headroom.

Decision: Track these over time. If you see performance instability and wear is high, replace before you’re debugging firmware on a Friday night.

Task 14: Measure sync latency directly (quick and dirty)

cr0x@server:~$ sudo zfs create -o sync=standard -o compression=off ssdpool/testsync
cr0x@server:~$ cd /ssdpool/testsync
cr0x@server:~$ /usr/bin/time -f "%e seconds" bash -c 'for i in {1..5000}; do dd if=/dev/zero of=f bs=4k count=1 conv=fsync oflag=dsync status=none; done'
12.84 seconds

What it means: This isn’t a benchmark suite; it’s a canary. If this suddenly becomes 60+ seconds under load, your sync path is unhealthy.

Decision: If sync latency is unacceptable, either add a correct SLOG, reduce sync requirements at the application layer (only if safe), or move the workload.

Tuning knobs that matter (and the ones that mostly don’t)

Keep free space like you mean it

SSD-only pools are happiest with slack space. ZFS allocator needs room, and the SSD needs room.
If your pool is routinely above 80% and you do random writes, you’re asking for GC collapse.
Mirrors tolerate this better than RAIDZ, but don’t get cocky.

What to do:

  • Target 60–75% utilization for mixed random write workloads you care about.
  • Use larger drives than you “need” and treat the extra as performance reserve.
  • Consider explicit overprovisioning (leave unpartitioned space) if your SSD firmware benefits from it.

Autotrim: enable it, then validate it’s not harming you

On modern stacks, autotrim is usually worth it for SSD-only pools. It helps maintain steady-state performance by informing the device what’s free.
But TRIM isn’t free: it adds IO and can interact poorly with certain consumer drives or buggy firmware.

Operational stance:

  • Enable autotrim and watch latency during heavy delete churn.
  • If you see TRIM-induced spikes, consider scheduled trim windows (where supported) or different SSDs.

Recordsize/volblocksize: stop pretending one size fits all

Default recordsize (128K) is fine for a lot of file workloads. For databases doing 8K pages with frequent updates,
128K can inflate write amplification: ZFS rewrites larger logical records and touches more metadata.

Opinionated guidance:

  • Databases on datasets: try recordsize 16K (sometimes 8K), compression lz4, atime off.
  • VM images on datasets: often recordsize 64K or 128K is okay if IO is somewhat sequential; if it’s random, smaller may help.
  • VM images on zvols: set volblocksize to 8K or 16K based on guest IO, and accept you must recreate to change.

Compression: LZ4 is the default for a reason

Compression reduces bytes written. Fewer bytes written means less NAND churn, less GC pressure, and sometimes faster throughput.
With SSD-only pools, compression is often a performance feature disguised as a capacity feature.

Use compression=lz4 broadly unless you have a measured reason not to. If CPU is the bottleneck, you’ll see it quickly.

Sync behavior: the fastest safe setting is “standard”

There are three common choices people argue about:

  • sync=standard: ZFS honors application requests. This is the default and usually correct.
  • sync=disabled: ZFS lies about sync. Performance improves. So does your risk profile.
  • sync=always: forces sync for everything, often used for testing or specific compliance requirements, and it will expose weak SLOG/pool devices quickly.

If you’re running databases, hypervisors, or NFS for serious applications: do not use sync=disabled as a “tuning” step.
That’s not tuning. That’s changing the durability contract. Sometimes it’s acceptable in ephemeral environments; in general production, it’s how you end up explaining to finance what “acknowledged but not durable” means.

SLOG: only when you need it, and only when it’s the right device

A SLOG helps sync write latency when your workload issues lots of sync writes and your main pool latency is the bottleneck.
It doesn’t accelerate normal async writes. It doesn’t fix a pool that’s full and fragmented. And it absolutely must be power-loss safe.

SLOG design rules that survive real life:

  • PLP is mandatory for SLOG in any environment where power loss is possible (so: all of them).
  • Mirror it if losing it would be operationally painful; a lost SLOG won’t lose main data, but it can cost downtime and stress.
  • Don’t oversize it: you need enough for bursts; enormous SLOGs don’t buy you much.

Ashift: get it right at creation or pay forever

ashift sets the pool sector size exponent. On SSDs with 4K physical sectors, ashift=12 is typical.
Wrong ashift causes read-modify-write cycles and unnecessary write amplification. It’s one of those mistakes that feels small until you graph latency.

Special vdevs: fantastic tool, sharp edges

Putting metadata and small blocks on a special vdev can reduce IO amplification and improve latency. It can also become a single point of pool failure if not redundant,
and it can run hot if sized too small. People love special vdevs because the performance chart goes up immediately. They hate them when it’s rebuild time.

If you deploy special vdevs:

  • Mirror them. Treat them like critical infrastructure, because they are.
  • Size for the long term, especially if you’re storing small blocks there.
  • Monitor their wear separately. They can wear faster than the main pool devices.

What not to obsess over first

People love tuning what they can change quickly: arc sizes, obscure module parameters, IO schedulers.
Sometimes those matter. Most of the time, GC collapse is caused by space pressure, sync semantics, block sizing mismatch,
drive class mismatch, or thermal throttling. Fix the big levers before you do interpretive dance with kernel tunables.

Joke #2: If you fix SSD GC collapse by changing twelve sysctls, congratulations—you’ve invented a new outage mode with a nicer name.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “NVMe means sync is free”

A mid-sized SaaS company migrated from a mixed HDD/SSD setup to an all-NVMe ZFS pool. The plan was simple:
“Faster disks, same architecture, everyone wins.” Their primary database ran on ZVOLs presented to VMs.
The migration was a weekend project. It mostly worked. Then Monday arrived.

Latency spikes hit every few minutes. Application nodes timed out on transactions. Engineers saw NVMe devices with huge advertised IOPS and assumed
the database couldn’t possibly be waiting on storage. They chased network issues. They chased lock contention. They scaled application pods.
It got worse, because the database did more retries, which did more sync writes.

The wrong assumption was that NVMe latency is always low, and that ZFS on NVMe behaves like “raw NVMe but with checksums.”
In reality the workload was sync-heavy, the pool was filling fast, and the consumer-grade NVMe devices had mediocre steady-state random write performance.
GC was doing long internal copy operations during TXG flush bursts.

The fix wasn’t exotic: move the database log to a proper PLP SLOG device, keep pool utilization under control, and align volblocksize with the database page size.
They also improved cooling; the devices were flirting with throttling during peak traffic.
The incident ended not with a heroic kernel patch, but with a procurement request and a capacity plan.

Optimization that backfired: “Let’s disable sync for speed”

Another company had a build farm running on NFS exports from a ZFS SSD-only pool. Build times mattered, and developers had learned to complain loudly.
Someone noticed that NFS clients were doing lots of synchronous writes. Someone else noticed a blog post suggesting sync=disabled.
You can guess what happened next.

Build times improved immediately. Charts looked great. The team declared victory and moved on.
Weeks later, a power event hit a rack. UPS coverage was partial and the storage node restarted. The pool imported cleanly.
The filesystem mounted. Everyone relaxed. Then the build cache started returning corrupted artifacts in subtle ways: wrong object files, mismatched checksums,
intermittent test failures that looked like flaky code.

The postmortem was uncomfortable because nothing “crashed” in an obvious way. The system did exactly what it was told:
acknowledge writes without waiting for stable storage. That’s what sync=disabled means. They had traded durability for speed without documenting it,
and without scoping the blast radius.

The fix was boring and slightly humbling: restore sync=standard, add a correct SLOG device sized for the sync workload,
and tune the NFS export/client mount options to reduce unnecessary sync semantics for the specific cache paths that could tolerate it.
They also added a policy: performance changes that alter durability guarantees require sign-off and a rollback plan.

Boring but correct practice that saved the day: “We kept 30% free and watched wear”

A financial services team ran an analytics platform on ZFS over SSD mirrors. The workload was ugly: bursts of ingest, heavy compaction,
lots of deletes, and periodic rebuild of derived datasets. It was the kind of IO profile that turns optimistic storage designs into incident generators.
Their secret weapon was not a clever trick. It was discipline.

They had a hard rule: do not exceed a utilization threshold on the pool. When the business asked for “just a little more,” they bought capacity.
They also tracked drive temperature, wear indicators, and weekly trim behavior. Not because they loved dashboards, but because it let them spot
“this is getting worse” before it became “this is down.”

One quarter, ingest volume jumped. Latency started creeping up during nightly jobs. The on-call saw it early: fragmentation increased, trims took longer,
and the pool’s free space dropped below their comfort line. They expanded the pool and rebalanced workloads before the next week’s ingest.
No outage. No drama. Just a ticket closed with a note: “Capacity headroom restored.”

The lesson is annoyingly unsexy: the best GC-collapse mitigation is not being tight on space, plus monitoring that tells you when you’re drifting toward the cliff.
Most reliability wins look like “nothing happened,” which is why people keep trying to replace them with excitement.

Common mistakes: symptoms → root cause → fix

1) Symptom: periodic multi-second stalls, especially on database commits

  • Root cause: sync-heavy workload hitting pool devices (no SLOG), compounded by SSD GC during TXG flushes.
  • Fix: add PLP SLOG (and mirror if needed), keep utilization lower, confirm cooling, tune recordsize/volblocksize for smaller updates.

2) Symptom: performance great after deployment, then slowly degrades over weeks

  • Root cause: no TRIM/autotrim, high delete churn, SSDs losing free-page knowledge and hitting steady-state write amplification.
  • Fix: enable autotrim, validate discard support, consider scheduled trims, leave more free space / OP, consider different drive class.

3) Symptom: random write IOPS collapse when pool hits ~85–90% usage

  • Root cause: allocator and FTL both starved for clean space; fragmentation increases; GC has fewer options.
  • Fix: expand capacity, migrate cold data out, enforce quotas/reservations, set a policy threshold and alert before it’s critical.

4) Symptom: one NVMe device shows worse performance, pool looks “lumpy”

  • Root cause: thermal throttling, firmware differences, PCIe link issues, or a drive entering a different GC regime.
  • Fix: check NVMe SMART temperatures, verify PCIe link speed/width, update firmware in a controlled window, improve airflow.

5) Symptom: tuning “worked,” then a reboot made it worse

  • Root cause: changes were applied ad hoc (module params, sysctls) without config management; rollback wasn’t real.
  • Fix: codify settings, use change control, record baseline metrics, and test under load with the same kernel/module versions.

6) Symptom: special vdev wears rapidly and becomes the bottleneck

  • Root cause: special_small_blocks set too high, special vdev too small, metadata/small-block hot set concentrated there.
  • Fix: mirror and size special vdev appropriately, reconsider special_small_blocks threshold, monitor wear and bandwidth separately.

7) Symptom: “We disabled sync and it’s fast,” followed by mysterious corruption after power loss

  • Root cause: durability contract changed; acknowledged writes weren’t actually durable.
  • Fix: revert to sync=standard, use SLOG for legitimate sync acceleration, and document where async durability is acceptable (if anywhere).

Checklists / step-by-step plan

Step-by-step: stabilize an SSD-only ZFS pool showing GC-collapse symptoms

  1. Capture evidence during the spike. Run zpool iostat -v 1, check NVMe temps, and note pool utilization and frag.
  2. Confirm sync pressure. Check ZIL activity and whether the workload is truly sync-heavy (db commits, NFS semantics, hypervisor settings).
  3. Check TRIM/autotrim. Enable autotrim if it’s off, verify discard support, and watch for TRIM-induced latency in delete-heavy periods.
  4. Get utilization under control. Free space, expand, or move data. If you’re above ~80% with random writes, treat that as the primary bug.
  5. Fix thermal and firmware basics. NVMe throttling looks like storage “mood swings.” Don’t ignore it.
  6. Right-size blocks. Adjust dataset recordsize (or recreate zvol with correct volblocksize). Do not cargo-cult 128K everywhere.
  7. Use compression. Prefer lz4 unless you measured a reason not to.
  8. If you need sync performance, add a correct SLOG. PLP, ideally mirrored. Then measure again.
  9. Re-test under realistic load. A synthetic sequential benchmark is not evidence for a database workload.
  10. Set alerts. CAP %, FRAG %, NVMe temps, write latency, and ZFS throttle counters. Problems are easier to prevent than to heroically solve.

Build checklist: designing an SSD-only ZFS pool that won’t embarrass you

  • Topology: Mirrors for latency and IOPS-heavy workloads; be cautious with wide RAIDZ for random writes.
  • Drives: Prefer enterprise SSDs with predictable steady-state and PLP where needed.
  • Space headroom: Plan capacity so you can keep healthy free space without begging for budget monthly.
  • ashift: Confirm 4K alignment and correct ashift at pool creation.
  • autotrim: Enable and validate with your drive model/firmware.
  • Workload mapping: Set recordsize/volblocksize per dataset/zvol, not per pool “vibe.”
  • Monitoring: Latency percentiles, temperatures, wear indicators, ZFS throttle counters, and utilization trends.
  • Change control: Any change that alters durability (sync, write cache settings) needs review and rollback.

FAQ

1) Is garbage collection collapse a “bad SSD” problem or a “bad ZFS” problem?

Usually neither. It’s an interaction problem: a CoW filesystem plus SSD firmware under space pressure and bursty writes.
Better SSDs (more OP, better firmware) widen the safe operating area. Better ZFS design (space headroom, correct block sizes, sane sync path) prevents the cliff.

2) Should I always enable autotrim on SSD-only pools?

In most modern environments, yes. Then verify discard support and watch for any regression under delete-heavy churn.
If your specific drive model behaves badly with continuous TRIM, you may need scheduled trim or different hardware.

3) Does adding more RAM (ARC) fix SSD stalls?

It can mask them by absorbing bursts, but it doesn’t change SSD steady-state behavior.
If the pool can’t flush fast enough because the device stalls, you’ll still hit throttling eventually.

4) When is a SLOG worth it?

When you have a measurable sync write bottleneck: databases committing frequently, NFS with sync semantics, VM workloads doing fsync-heavy patterns.
A SLOG is not a general write cache. And without PLP it’s not an optimization; it’s a reliability bug.

5) Can I fix volblocksize after creating a zvol?

No. You recreate the zvol and migrate data. Plan volblocksize upfront, and write it down so future-you doesn’t “optimize” it into chaos.

6) Does compression really help on fast NVMe?

Often yes. Less data written means less NAND work and less GC pressure. LZ4 is typically cheap enough that it improves throughput.
Measure for your workload, especially if you’re CPU-bound or storing incompressible data.

7) Why do things get worse as the pool fills even if the SSD has spare capacity?

Because “spare capacity” on SSDs isn’t the same as “free space” in ZFS. ZFS needs free metaslabs; SSD FTL needs free blocks.
High pool utilization and fragmentation reduce flexibility for both layers.

8) Is RAIDZ bad for SSD pools?

Not universally, but it’s less forgiving for small random writes and sync-heavy workloads. Mirrors tend to deliver better latency and IOPS consistency.
If your workload is mostly sequential reads/writes and capacity matters, RAIDZ can be fine—if you keep utilization sane.

9) Is sync=disabled ever acceptable?

Sometimes, for truly ephemeral data where you explicitly accept losing recent writes on a crash or power loss.
It should be a conscious policy decision, documented and scoped, not a performance “tweak.”

10) How do I know if my problem is GC or just thermal throttling?

Check NVMe SMART logs for temperature and warning time, correlate spikes with heat, and verify consistent PCIe link behavior.
Thermal throttling often produces repeatable “after X minutes of load” degradation that recovers when cooled.

Conclusion: practical next steps

GC collapse on SSD-only ZFS pools is preventable. Not with magic. With fundamentals:
keep free space, match block sizes to workloads, respect sync semantics, and don’t cheap out on the devices that define durability.

Next steps that pay off this week:

  1. Enable and verify autotrim (and confirm discard support).
  2. Set a utilization policy and alert before you cross it.
  3. Audit recordsize/volblocksize per dataset/zvol that matters.
  4. If you’re sync-heavy, implement a proper PLP SLOG and measure again.
  5. Check NVMe temperatures under peak load and fix airflow before you “tune.”
  6. Write down your durability contract (sync settings, application expectations) so performance work doesn’t become a data-loss bet.

Storage is the part of the system that remembers your mistakes. Tune like it’s going to testify later.

← Previous
Docker “No route to host” from containers: routing and iptables fixes that stick
Next →
Debian 13: ZRAM/Zswap — when it saves your box and when it backfires

Leave a comment