ZFS is a system that politely lies to you—briefly, and for your own good. Most writes land in RAM first, then ZFS turns them into on-disk reality in big, efficient batches. That batching is a feature: it’s how ZFS delivers strong consistency and good throughput without turning every small write into a small disaster.
The problem is that “big efficient batches” can look like bursts: your disks go quiet, then suddenly light up; latency graphs look like heartbeats; databases complain every few seconds; and someone asks why the storage “stutters.” A common suspect is txg_timeout: the timer that nudges ZFS to close and sync a transaction group (TXG). This article explains what’s actually happening, how to prove it in production, and how to smooth latency without trading away safety.
The mental model: TXGs, not “writes”
If you remember one thing, remember this: ZFS isn’t doing “a write” when an application calls write(2). ZFS is building a new version of the filesystem in memory and then committing that version to disk. Those commits happen in transaction groups (TXGs).
A TXG is a batch window. During that window, ZFS collects modified metadata and data buffers (dirty data). When the TXG closes, ZFS starts syncing it: allocating blocks, writing data and metadata, updating the MOS (Meta Object Set), and ultimately writing uberblocks that make the new state durable. The key: closing a TXG is cheap; syncing it is the real work.
Three TXGs at a time
ZFS typically has three TXGs in flight:
- Open TXG: accepting new changes (dirty data accumulating).
- Quiescing TXG: closing out, freezing the set of changes.
- Syncing TXG: writing that frozen set to disk.
This pipeline is why ZFS can keep taking writes while it’s syncing the previous batch. It’s also why you can get periodic bursts: the syncing phase is when storage sees real work, and it often lands in a concentrated interval.
Joke #1: ZFS doesn’t lose your data, it just “temporarily misplaces” it in RAM—like you do with your keys, but with checksums.
What txg_timeout really does (and doesn’t)
txg_timeout is commonly described as “how often ZFS flushes.” That’s close enough to be useful, and wrong enough to ruin a weekend.
In most OpenZFS implementations, txg_timeout is the maximum time (in seconds) that a TXG is allowed to stay open before ZFS forces it to quiesce and start syncing. The default is often 5 seconds. That default is not random: it’s a compromise between amortizing metadata costs (batching) and limiting the amount of work per commit (latency).
What changing txg_timeout actually changes
- Lowering it tends to create more frequent, smaller TXGs. This can reduce worst-case “commit burst” size but increases overhead (more commits, more metadata churn, potentially lower throughput).
- Raising it tends to create fewer, larger TXGs. This can improve throughput on streaming writes but can amplify periodic latency spikes (and can increase the amount of dirty data in memory).
What txg_timeout does NOT fix
It does not magically turn random sync writes into smooth low-latency I/O if your underlying device can’t keep up, your SLOG is mis-sized, your pool is fragmented, or your workload is forcing synchronous semantics. TXG timing is the drummer; your disks are still the band.
Why the “every 5 seconds” pattern appears
If you see latency spikes at a regular cadence—often near 5 seconds—your first hypothesis should be: “TXG commit bursts.” txg_timeout is a clock that can make that cadence visible. But bursts can also be driven by dirty data limits (write throttling), sync write pressure, and device cache behavior.
Why writes come in bursts
In production, “bursty writes” are rarely a single root cause. They’re usually several sane mechanisms aligning their peaks. Here are the common ones.
1) Batching is the point
ZFS is copy-on-write. For many workloads, the efficient path is: write new blocks elsewhere, then atomically switch pointers. If ZFS had to do that pointer choreography for every 4 KB write, you’d get correctness and misery. TXGs allow ZFS to do the choreography once per batch.
2) Dirty data limits and throttling create “cliffs”
ZFS allows a certain amount of dirty data in RAM. When you hit the dirty data threshold, ZFS starts throttling writers to prevent unbounded memory growth and to force syncing to catch up. That “throttle event” often looks like a cliff: latency is fine until it isn’t.
3) Sync writes can force immediate intent-log activity
Synchronous writes (O_SYNC, fsync(), database WAL settings, NFS sync semantics) don’t wait for the TXG to sync to the main pool. Instead, they wait for the ZIL (ZFS Intent Log) to record the intent. On a pool without a dedicated SLOG, that log is on the same disks as everything else, and the log writes can serialize or contend in unpleasant ways.
4) Small blocks + metadata amplification
A workload writing many small random blocks doesn’t just write data. It generates metadata updates: block pointers, indirect blocks, spacemap updates, checksums, and so on. ZFS is good at this, but it still has to do it. In a TXG commit, metadata I/O can become a short, intense storm.
5) Underlying device behavior is bursty too
SSDs have internal garbage collection; HDD arrays have write cache flushes; RAID controllers reorder I/O; NVMe firmware has its own queues and housekeeping. ZFS bursts can synchronize with device bursts and make the graph look like it’s breathing heavily.
6) ZFS is honest about backpressure
Some storage stacks hide contention by buffering endlessly until they fall over. ZFS tends to push back when it has to. That’s better engineering—and worse optics on a dashboard.
Facts & history worth knowing
- ZFS was built at Sun with enterprise storage in mind, where batching and atomic commits beat “write-through everything.”
- TXGs are older than the term “DevOps”; this style of transactional batching comes from classic filesystem and database ideas, not modern cloud marketing.
- The “5 second” default for TXG timing shows up in multiple ZFS lineages because it often balances latency and throughput on general-purpose systems.
- The ZIL is not a write cache. It’s a log of intent for synchronous semantics; most of it is thrown away after the TXG commits.
- SLOG devices only matter for sync writes. If your workload is async, a SLOG won’t help and can even become another failure domain to manage.
- Copy-on-write makes partial writes safer, but it shifts complexity into allocation and metadata updates, which are paid during TXG sync.
- ZFS checksums everything (metadata and data), which is fantastic for integrity and non-zero for CPU and memory traffic during heavy commits.
- OpenZFS diverged across platforms (Illumos, FreeBSD, Linux) and tuning knobs differ slightly; the concepts remain, but exact parameter names and defaults can vary.
- Latency spikes often correlate with uberblock writes because that’s the final “publish” step of a TXG; you can sometimes see the cadence in low-level traces.
Recognizing TXG-shaped latency
TXG-related pain has a particular smell:
- Regular cadence: spikes every ~5 seconds (or your
txg_timeoutvalue). - Mostly write-side: read latency may stay fine until the system is saturated.
- “Everything pauses” moments: not a full stall, but applications report jitter—especially databases and VM hosts.
- CPU not pegged: you can be I/O-bound with plenty of idle CPU, or CPU-bound in checksum/compression in ways that don’t look like 100% user CPU.
- Dirty data oscillation: memory usage and dirty data counters rise, then drop sharply at commit.
Joke #2: If your graphs spike every five seconds, congratulations—you’ve discovered that time is a flat circle, and your storage is the circle.
Fast diagnosis playbook
This is the “walk into the outage bridge” sequence. It’s ordered to answer one question quickly: are we seeing TXG commit pressure, sync-write log pressure, or raw device saturation?
First: confirm the symptom is periodic and write-related
- Check latency cadence: does it align with ~5 seconds?
- Separate reads from writes: is write latency spiking while reads look normal?
- Check whether the workload is sync-heavy (databases, NFS, VM storage with barriers).
Second: determine if sync writes are the trigger
- Inspect ZIL/SLOG activity: are log writes dominating?
- Check dataset
syncandlogbiassettings. - Validate SLOG health and latency (if present).
Third: determine if TXG dirty data + throttling is the trigger
- Look at dirty data and TXG state counters (platform-specific, but observable).
- Check if writers are being throttled (high wait time, blocked in kernel I/O paths).
- Correlate with pool write bandwidth: are you hitting device limits during sync windows?
Fourth: confirm underlying device behavior
- Check per-vdev latency: one slow device can dictate the pool’s commit time.
- Check queue depth and saturation.
- Check error counters and retransmits; “slow” is sometimes “retrying.”
Practical tasks (commands + interpretation)
Commands below assume a Linux + OpenZFS environment. If you’re on FreeBSD/Illumos, you can translate the intent; the workflow still applies.
Task 1: Identify datasets and key properties (sync, logbias, recordsize)
cr0x@server:~$ sudo zfs list -o name,used,avail,recordsize,compression,sync,logbias -r tank
NAME USED AVAIL RECORDSIZE COMPRESS SYNC LOGBIAS
tank 8.21T 5.44T 128K lz4 standard latency
tank/vm 4.92T 5.44T 16K lz4 standard latency
tank/db 1.10T 5.44T 16K lz4 standard latency
tank/backups 2.04T 5.44T 1M lz4 disabled throughput
Interpretation: If your latency complaints come from tank/db or tank/vm, the small recordsize and logbias=latency suggest you care about sync behavior. If the dataset is set to sync=disabled, you may have “fixed” latency by deleting durability—useful for benchmarks, dangerous for jobs.
Task 2: Watch pool-level I/O and spot periodic bursts
cr0x@server:~$ sudo zpool iostat -v tank 1
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 8.21T 5.44T 210 980 28.1M 210M
mirror 2.73T 1.81T 105 490 14.0M 105M
nvme0n1 - - 52 245 7.0M 52.5M
nvme1n1 - - 53 245 7.0M 52.4M
mirror 2.73T 1.81T 105 490 14.1M 105M
nvme2n1 - - 52 245 7.0M 52.6M
nvme3n1 - - 53 245 7.1M 52.4M
Interpretation: Look for write bandwidth that goes from “moderate” to “maxed out” at a cadence, with operations climbing sharply. Bursts that line up with the TXG timer are a clue, not a verdict.
Task 3: Add latency columns to expose the slow vdev
cr0x@server:~$ sudo zpool iostat -v tank 1 -l
operations bandwidth total_wait disk_wait
pool read write read write read write read write
-------------------------- ---- ----- ----- ----- ----- ----- ----- -----
tank 210 980 28.1M 210M 1ms 18ms 0ms 15ms
mirror 105 490 14.0M 105M 1ms 18ms 0ms 15ms
nvme0n1 52 245 7.0M 52.5M 1ms 20ms 0ms 17ms
nvme1n1 53 245 7.0M 52.4M 1ms 19ms 0ms 16ms
mirror 105 490 14.1M 105M 1ms 18ms 0ms 15ms
nvme2n1 52 245 7.0M 52.6M 1ms 18ms 0ms 15ms
nvme3n1 53 245 7.1M 52.4M 1ms 40ms 0ms 38ms
Interpretation: One device (nvme3n1) shows much higher write wait. A single laggard can stretch TXG sync time and create visible bursts. This is where you stop tuning and start asking “is a device dying or misconfigured?”
Task 4: Check pool health and error counters
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
action: Replace the device or clear the errors.
scan: scrub repaired 0B in 03:12:55 with 0 errors on Sun Dec 22 02:12:19 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
mirror-1 ONLINE 0 5 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 5 0
errors: No known data errors
Interpretation: Write errors on a single vdev member can translate into retries and timeouts, which look like latency bursts. If you see errors, treat them as performance issues and reliability issues.
Task 5: Confirm txg_timeout and related tunables
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_txg_timeout
5
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_dirty_data_max
17179869184
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_dirty_data_max_percent
10
Interpretation: The default 5-second timeout plus a generous dirty data max can create “quiet then storm” behavior. But don’t touch these yet—measure first, then change one thing at a time.
Task 6: Observe ARC pressure (because memory shapes dirty data)
cr0x@server:~$ grep -E 'c_max|c_min|size|memory_throttle_count' /proc/spl/kstat/zfs/arcstats
c_max 4 137438953472
c_min 4 68719476736
size 4 92341784576
memory_throttle_count 4 0
Interpretation: If ARC is fighting for memory, ZFS may throttle or behave more aggressively. If memory_throttle_count climbs, your “bursts” may be the system gasping for RAM rather than a simple TXG cadence issue.
Task 7: Identify whether the workload is sync-heavy
cr0x@server:~$ sudo zfs get -H -o name,property,value sync tank/db
tank/db sync standard
cr0x@server:~$ sudo iostat -x 1
avg-cpu: %user %nice %system %iowait %steal %idle
12.00 0.00 8.00 18.00 0.00 62.00
Device r/s w/s r_await w_await aqu-sz %util
nvme3n1 55.0 280.0 0.9 35.0 9.8 98.0
Interpretation: High w_await and near-100% util on a device aligns with commit pressure. To decide whether it’s ZIL-related, you need to look at the presence and behavior of a SLOG.
Task 8: Check if there is a dedicated log device (SLOG)
cr0x@server:~$ sudo zpool status tank | sed -n '1,120p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
nvme4n1 ONLINE 0 0 0
nvme5n1 ONLINE 0 0 0
Interpretation: A mirrored SLOG exists. Good. Now you must verify it is low-latency and power-loss safe, or it will turn sync writes into a performance hostage situation.
Task 9: Measure SLOG latency indirectly via per-vdev waits
cr0x@server:~$ sudo zpool iostat -v tank 1 -l | sed -n '1,40p'
pool read write read write read write read write
-------------------------- ---- ----- ----- ----- ----- ----- ----- -----
tank 210 980 28.1M 210M 1ms 18ms 0ms 15ms
mirror-0 105 490 14.0M 105M 1ms 18ms 0ms 15ms
nvme0n1 52 245 7.0M 52.5M 1ms 20ms 0ms 17ms
nvme1n1 53 245 7.0M 52.4M 1ms 19ms 0ms 16ms
logs
mirror-2 0 320 0.0K 19.2M 0ms 2ms 0ms 1ms
nvme4n1 0 160 0.0K 9.6M 0ms 2ms 0ms 1ms
nvme5n1 0 160 0.0K 9.6M 0ms 2ms 0ms 1ms
Interpretation: Log vdev write wait around 1–2 ms is decent for many sync workloads. If you see 10–50 ms here, your SLOG is not a SLOG; it’s a slow log.
Task 10: Check dataset logbias to match intent
cr0x@server:~$ sudo zfs get -H -o name,property,value logbias tank/db tank/vm
tank/db logbias latency
tank/vm logbias latency
Interpretation: logbias=latency favors using the log for sync writes (good for databases). throughput can reduce log traffic in some cases but may increase main-pool writes for sync operations. Changing it without understanding the workload is a classic foot-gun.
Task 11: Observe TXG-related kstats (dirty data and sync time)
cr0x@server:~$ ls /proc/spl/kstat/zfs/ | head
arcstats
dbufstats
dmu_tx
vdev_queue
vdev_mirror
zfetchstats
cr0x@server:~$ cat /proc/spl/kstat/zfs/dmu_tx
13 1 0x01
name type data
dmu_tx_assigned 4 12800451
dmu_tx_delay 4 21438
dmu_tx_error 4 0
dmu_tx_memory_reserve 4 0
dmu_tx_memory_reclaim 4 0
Interpretation: On Linux, some TXG visibility is indirect. Rising dmu_tx_delay indicates writers are being delayed—often due to dirty data limits and sync pressure. You’re looking for correlation: when latency spikes, do these counters jump?
Task 12: Use perf to confirm you’re blocked in ZFS I/O paths (not CPU)
cr0x@server:~$ sudo perf top -g
12.4% [kernel] [k] __schedule
8.9% [kernel] [k] io_schedule
6.1% [kernel] [k] blk_mq_get_tag
5.7% zfs [k] zio_wait
4.8% zfs [k] zio_execute
4.1% zfs [k] vdev_queue_io
Interpretation: Seeing time in io_schedule and ZFS zio_* functions suggests the system is waiting on storage, not burning CPU on compression or checksums. That steers you away from tuning compression or CPU scaling and toward vdev latency, SLOG, and TXG behavior.
Task 13: Quick check for device-level saturation and queueing
cr0x@server:~$ sudo nvme smart-log /dev/nvme3n1 | sed -n '1,25p'
Smart Log for NVME device:nvme3n1 namespace-id:ffffffff
critical_warning : 0
temperature : 47 C
available_spare : 100%
percentage_used : 3%
media_errors : 0
num_err_log_entries : 0
warning_temp_time : 0
critical_comp_time : 0
Interpretation: This isn’t a performance benchmark; it’s a sanity check. If media errors or error log entries are climbing, your “TXG problem” may be a hardware problem wearing a ZFS mask.
Task 14: Change txg_timeout safely (temporary) and observe
cr0x@server:~$ sudo sh -c 'echo 3 > /sys/module/zfs/parameters/zfs_txg_timeout'
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_txg_timeout
3
Interpretation: This is a test, not a lifestyle. If your latency spikes shift cadence and reduce amplitude, you’ve confirmed TXG timing is part of the picture. If nothing changes, stop fiddling and look at sync writes or vdev latency.
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
The platform team got a ticket: “Database stalls every five seconds.” It was the kind of complaint that makes you suspect the monitoring system first. But the graphs were honest: write latency spiked with metronomic precision.
Someone had a tidy assumption: “ZFS flushes every five seconds, so the database must be forcing flushes.” That assumption got traction because it sounded like physics. They focused on the database, tuned checkpoints, argued about WAL settings, and changed application batching. The latency spikes didn’t care.
The real issue was simpler and more annoying: one mirror member in the pool had started taking longer to complete writes—no hard failures, just occasional stalls. ZFS, being correct, waited for the slowest component to complete the TXG’s obligations. TXG commit time stretched, dirty data accumulated, and the system hit throttling. The five-second pattern was just the timer exposing a bottleneck that would have hurt anyway.
They proved it by adding wait-time columns to zpool iostat. One device consistently showed elevated disk_wait during the spikes. SMART looked “fine,” because SMART is often “fine” right up until it isn’t. Replacing the device eliminated the periodic stalls without touching a single ZFS knob.
The lesson that stuck: periodic behavior doesn’t always mean “timer problem.” Sometimes it’s “one slow component causes periodic pain when the system is forced to synchronize.” TXGs just make the synchronization visible.
Mini-story 2: The optimization that backfired
A different organization had a VM farm on ZFS zvols. They wanted smoother latency for guest writes. Somebody read that lowering txg_timeout can reduce burst size, so they dropped it aggressively across the fleet.
The graphs improved—for about a day. Then throughput tanked during nightly jobs. Backup windows slipped. The storage nodes weren’t “slow” in the classic sense; they were busy doing more work per unit of data. More TXGs meant more frequent metadata commits, more churn, and less time in the efficient steady-state.
Worse, the smaller TXGs increased the relative share of overhead for workloads with many small synchronous writes. The SLOG started getting hammered with more frequent fsync patterns, and a previously “fine” log device became the pacing item. Their latency got spikier under load, not smoother.
They unwound the change and took a more surgical approach: separate datasets, correct volblocksize for zvols, add a proper mirrored low-latency SLOG for sync-heavy tenants, and cap dirty data in a way that matched memory headroom. The final tuning was boring and incremental—which is the only kind of tuning you want near money.
The lesson: txg_timeout is a lever, not a cure. Pull it too hard and you trade burstiness for overhead, which often shows up as worse tail latency when the system is actually busy.
Mini-story 3: The boring but correct practice that saved the day
A SaaS company ran multi-tenant storage with strict SLOs. They had learned, the hard way, that ZFS tuning without observability is astrology. So they did the dull thing: per-vdev latency tracking, TXG cadence dashboards, and routine scrub schedules. Nobody loved it. Everyone benefited.
One afternoon, write latency started to develop a faint but growing 5-second rhythm. Not a full incident yet—just the kind of “hmm” that good on-call engineers notice while pretending not to. The dashboards showed commit-related spikes were slowly getting worse, but only on one pool.
Because they already had per-vdev wait histograms, they spotted a single device with a rising long-tail write latency. It wasn’t failing outright. It was just occasionally taking long enough to stretch TXG sync time. They drained tenants from that pool and swapped the device during a normal maintenance window. No emergency change, no midnight heroics, no “we rebooted it and it got better.”
After replacement, the TXG rhythm vanished. The team looked like geniuses, which was unfair—they were just prepared. The boring practice wasn’t tuning; it was noticing early enough that tuning wasn’t needed.
The lesson: the best way to “smooth latency” is to prevent the conditions that create big sync windows—especially silent device degradation and unobserved saturation.
Tuning strategy: what to change, in what order
When people say “tune ZFS,” they often mean “change a parameter until the graph looks nicer.” That’s not tuning; that’s graph-based wish fulfillment. Here’s a safer order of operations.
Step 1: Decide if your problem is sync latency or TXG commit latency
If your workload is dominated by synchronous operations, the fastest way to reduce tail latency is usually a proper SLOG (mirrored, power-loss protected, low latency) and dataset settings that match the workload (sync, logbias). If it’s mostly async throughput, TXG and dirty data behavior matter more.
Step 2: Fix the obvious bottleneck first (hardware and topology)
- Replace or remove slow devices.
- Ensure vdevs are balanced; one underperforming vdev sets the pace.
- Confirm controller settings, firmware, queue depths, and that you’re not fighting a misconfigured HBA.
Step 3: Use dataset-level knobs before global knobs
Dataset properties are reversible and scoped. Global module parameters are shared across the pool and can create cross-tenant coupling.
recordsizefor filesystems: match I/O size (databases often like 16K; backups like 1M).volblocksizefor zvols: set at creation; match the guest/filesystem block patterns.logbias: chooselatencyfor sync-heavy low-latency workloads; considerthroughputfor streaming patterns.sync: keepstandardunless you have a very explicit durability trade-off.
Step 4: Only then, consider TXG timing and dirty data limits
txg_timeout tweaks the cadence. Dirty-data tunables adjust how much work can accumulate before ZFS forces backpressure. Together they shape “burst amplitude” and “burst frequency.”
Practical heuristics that hold up in real environments:
- Lower
txg_timeoutmodestly (e.g., 5 → 3) if you have periodic spikes and your system isn’t already dominated by metadata overhead. - Be cautious raising it; larger TXGs can produce uglier tail latency and longer recovery time after transient stalls.
- Adjust dirty data max only when you understand memory headroom and ARC behavior. Too high can create bigger bursts; too low can throttle writers constantly.
Step 5: Validate with a real workload, not a synthetic victory lap
Run the workload that hurts: your database, your VM farm, your NFS traffic. A sequential write benchmark can “prove” any bad idea is good, if you pick the right block size and ignore tail latency.
Common mistakes, symptoms, and fixes
Mistake 1: Treating txg_timeout as a “flush interval” like a database checkpoint
Symptom: You lower txg_timeout, and spikes change frequency but tail latency stays bad under load.
What’s happening: The system is device-limited or sync-limited; you’ve changed the rhythm, not the capacity.
Fix: Measure per-vdev wait; verify SLOG; check sync write rate; replace slow devices or add capacity before further tuning.
Mistake 2: Disabling sync to “fix” latency
Symptom: Latency looks amazing, and then after a power event, you’re restoring from backups while explaining “it was just a test.”
What’s happening: sync=disabled turns synchronous requests into asynchronous behavior. It improves performance by changing the contract.
Fix: Use a proper SLOG; tune logbias; fix underlying device latency; leave sync at standard unless you can accept data loss.
Mistake 3: Adding a cheap consumer SSD as SLOG
Symptom: Sync latency worsens; occasional stalls get worse; sometimes you see log device timeouts.
What’s happening: SLOG needs low, consistent latency and power-loss protection. A drive with volatile write cache and unpredictable GC turns fsync into roulette.
Fix: Use an enterprise-grade, PLP-capable device; mirror it; confirm latency via zpool iostat -l.
Mistake 4: Oversizing dirty data “because RAM is free”
Symptom: Longer, uglier spikes; occasional multi-second stalls when the system is under pressure.
What’s happening: Bigger dirty buffers mean bigger commits. You’ve increased burst amplitude. Under transient slowdowns, you’ve also increased the amount of work that must drain before writers are released.
Fix: Right-size dirty limits to match device throughput and acceptable tail latency; keep ARC healthy; avoid memory contention with other services.
Mistake 5: Ignoring per-vdev imbalance
Symptom: “The pool is fast” in aggregate metrics, but latency spikes persist.
What’s happening: One vdev or one device is slow; TXG sync waits for the slowest required I/O. Average bandwidth hides tail behavior.
Fix: Use zpool iostat -v -l, replace the outlier, and verify controller/firmware consistency.
Mistake 6: Changing five knobs at once
Symptom: The system feels “different,” but you can’t tell why, and rollbacks are scary.
What’s happening: You’ve destroyed your ability to reason about cause and effect.
Fix: One change, one validation, one rollback plan. Write down what you changed and why.
Checklists / step-by-step plan
Checklist A: Confirm you’re seeing TXG commit bursts
- Graph write latency and see if spikes align to ~5 seconds.
- Run
zpool iostat -v 1 -lduring spikes; identify whether wait times jump at the same cadence. - Check
/sys/module/zfs/parameters/zfs_txg_timeoutand compare to observed cadence. - Confirm whether the log vdev (if present) shows increased wait during spikes.
- Check
dmu_tx_delayor equivalent counters; look for growth during events.
Checklist B: Smooth latency without breaking durability
- Validate hardware first: replace slow/erroring devices; confirm firmware uniformity.
- Confirm sync behavior: identify which datasets/apps are sync-heavy; keep
sync=standard. - Fix SLOG if needed: mirrored, PLP-capable, low-latency; verify waits in
zpool iostat -l. - Dataset tuning: set
recordsizeappropriately; verifylogbias. - Small txg_timeout experiment: adjust 5→3 temporarily; observe tail latency and throughput.
- Dirty data sanity: ensure dirty limits don’t create giant commits or constant throttling.
- Rollback if trade-offs hurt: prefer stable throughput and predictable tail over a prettier average.
Checklist C: Change management that won’t haunt you
- Capture “before” metrics: per-vdev wait, app latency, sync write rate.
- Change one knob at a time with a timed observation window.
- Document the hypothesis (“lower txg_timeout reduces commit burst amplitude”).
- Define success criteria (p99 write latency, not just average).
- Keep a rollback command ready and tested.
FAQ
1) Is txg_timeout the reason my writes spike every five seconds?
It can be the visible metronome, but it’s not always the cause. The cause is usually that TXG sync work concentrates in time, often because devices or sync logging can’t absorb the workload smoothly. Verify by correlating spikes with per-vdev wait times and TXG-related backpressure counters.
2) Should I lower txg_timeout to smooth latency?
Sometimes, modestly. Dropping from 5 to 3 seconds can reduce burst size, but it can also reduce throughput and increase overhead. If your bottleneck is a slow vdev or a slow SLOG, lowering txg_timeout just changes the tempo of suffering.
3) Will a SLOG fix TXG bursts?
A SLOG helps with synchronous write latency (fsync-heavy workloads). It does not directly eliminate TXG commit work to the main pool. Many environments have both issues: the SLOG fixes the “fsync pain,” and TXG tuning/hardware fixes the “commit burst pain.”
4) Why does it get worse when the pool is nearly full?
As pools fill, allocation becomes harder, fragmentation increases, and metadata updates can become more expensive. TXG sync can take longer, which increases the time dirty data spends waiting to become durable, which increases throttling and perceived burstiness.
5) Is recordsize related to txg_timeout?
Indirectly. recordsize changes I/O shape and metadata overhead. Smaller records can increase metadata work and IOPS demand, making TXG sync more intense. The timer doesn’t change, but the work done per TXG does.
6) What’s the safest way to “smooth latency” for databases on ZFS?
Keep sync=standard, use a proper mirrored PLP SLOG for WAL/fsync workloads, set recordsize appropriately (often 16K for many databases), and then address any per-vdev latency outliers. Consider small txg_timeout changes only after the basics are solid.
7) If I don’t care about durability, can I just set sync=disabled?
You can, but you should treat it like removing seatbelts because they wrinkle your shirt. It may be acceptable for scratch data, ephemeral caches, or test environments with explicit loss tolerance. Don’t do it for anything you’d be embarrassed to explain after a crash.
8) How do I know if I’m being throttled by dirty data limits?
Look for rising writer delays (e.g., dmu_tx_delay increasing) and application-level latency that correlates with dirty data oscillation and TXG sync windows. You’ll often see I/O bursts followed by periods where writers block more than usual.
9) Can compression help with TXG bursts?
Sometimes. Compression can reduce bytes written, which reduces device time during TXG sync, but it increases CPU work and can change I/O patterns. If you’re device-bound and have CPU headroom, compression can reduce burst amplitude. If you’re CPU-bound or already near saturation, it can worsen tail latency.
10) Is this problem specific to Linux?
No. TXGs are fundamental to ZFS across platforms. What differs is the visibility (which counters are exposed) and the exact tunable names and defaults.
Conclusion
txg_timeout doesn’t make ZFS bursty; it reveals the batching that ZFS uses to be fast and correct. Those write bursts are often normal—until they’re not. When they become a latency problem, it’s usually because the TXG sync phase is being stretched by a slow device, sync-write pressure, or dirty data accumulation that triggers throttling.
Smoothing latency is mostly about respecting the pipeline: ensure the slowest vdev isn’t secretly slow, give synchronous workloads a proper low-latency log path, tune datasets to match I/O shape, and only then adjust TXG timing and dirty data limits—carefully, measurably, and with rollback ready. The goal isn’t to eliminate bursts entirely; it’s to keep them small enough that your applications stop noticing—and your on-call stops developing a Pavlovian response to five-second intervals.