Your database was “fine” in staging. Then production showed up with real users, real latency, and the kind of write amplification that turns
confident benchmarks into performance fan fiction.
If you run ZFS under a database, sync writes are where truth lives. They’re also where most fio tests accidentally cheat: wrong flags, wrong
assumptions, wrong I/O path, wrong durability guarantees. This is a field guide for running fio against ZFS like you actually care whether the
data survives a power cut.
What “sync writes” really mean on ZFS (and why databases care)
Databases do not want “fast writes.” They want committed writes. When a database says “transaction committed,” it’s making a promise:
if the machine loses power immediately after, the committed data is still there when it comes back.
That promise is implemented with a small set of behaviors:
fsync(), fdatasync(), O_DSYNC, O_SYNC, and sometimes FUA
semantics in the storage stack. Most database durability comes down to one hard requirement: the log record is on stable storage before “OK”
goes back to the client.
ZFS complicates this in a good way (strong consistency model) and a confusing way (copy-on-write + transaction groups + intent log).
ZFS can accept writes into memory and commit them later in a TXG (transaction group). That’s normal buffered I/O. But when an application
asks for sync semantics, ZFS must make sure the write is durable before returning.
Enter the ZIL (ZFS Intent Log). The ZIL is not a “write cache” in the usual sense; it’s a mechanism to satisfy synchronous requests quickly
without forcing an immediate full pool commit. If the system crashes, the ZIL records are replayed at import to bring the filesystem to a
consistent state that includes those synchronous operations.
The ZIL normally lives on the pool itself. A SLOG (separate log device) is an optional dedicated device that holds the ZIL records. Done
correctly, a SLOG turns “sync write latency” into “SLOG device latency.” Done incorrectly, it turns “durable database” into “I hope the UPS
is having a good day.”
Your job when testing is to measure:
latency and IOPS under sync semantics that match the database,
and to do it while ensuring you aren’t accidentally benchmarking RAM, the page cache, or a durability setting you don’t actually run in prod.
Facts & context: how we got here
- Fact 1: ZFS was born at Sun Microsystems in the mid-2000s with a focus on end-to-end data integrity (checksums everywhere), not “peak fio numbers.”
- Fact 2: Traditional filesystems often leaned on write-back caching and a journal; ZFS uses copy-on-write plus transactional commits (TXGs), which changes how “write latency” behaves.
- Fact 3: The ZIL exists specifically to make synchronous operations fast without forcing a full TXG sync on every fsync-like request.
- Fact 4: Early enterprise storage commonly shipped with battery-backed write cache; today, NVMe drives with power-loss protection (PLP) are the closest commodity analog for safe low-latency sync logging.
- Fact 5: Databases historically optimized around spinning disks, where sequential logging was king; on SSD/NVMe, latency variance (tail latency) becomes the real killer.
- Fact 6: Linux async I/O APIs evolved separately (libaio, io_uring), and fio can exercise many code paths—some of which don’t map to how your database writes its WAL/redo log.
- Fact 7: “Write cache enabled” on a disk has been a footgun since the dawn of SATA; it can improve benchmarks while quietly destroying durability guarantees during power loss.
- Fact 8: ZFS dataset properties like
sync,recordsize,logbias, andprimarycachecan materially change performance without changing application code—great power, great ability to shoot your own toe.
ZFS mental model for database people: TXGs, ZIL, SLOG, and lies
TXGs: the metronome behind your write workload
ZFS batches modifications into transaction groups. Every few seconds (commonly around 5s, tunable), a TXG is synced to disk. If your workload
is mostly asynchronous buffered writes, the throughput you see is strongly tied to TXG behavior: dirty data accumulates, then flushes. That
can produce “sawtooth” latency patterns—bursty flushes, then calm.
For sync writes, ZFS can’t just wait for the next TXG sync. It needs to acknowledge only after the write is durable. That’s why the ZIL exists.
ZIL: not a second journal, not a database log, not magic
The ZIL records the intent of synchronous operations. It’s written sequentially-ish, in small records. When the TXG eventually commits the
real data blocks, the corresponding ZIL records become unnecessary and can be discarded.
Important consequence: in steady state, the ZIL is about latency, not capacity. A SLOG device does not need to be huge.
It needs to be low latency, durable, and consistent under pressure.
SLOG: the “don’t make sync writes terrible” device
With a SLOG, ZFS writes ZIL records to that device instead of the main pool. If the SLOG is fast and has power-loss protection,
synchronous write latency can drop dramatically.
If the SLOG is fast but not power-safe, you’ve built a database corruption appliance. Yes, you can get lucky. Luck is not a design.
Where people lie to themselves
Most “ZFS database benchmarks” go wrong in one of three ways:
- They’re not actually doing sync I/O. fio writes buffered data and reports glorious numbers. The database doesn’t behave that way.
- They’re doing sync I/O, but ZFS isn’t honoring it the way production does. Dataset
sync=disabledor virtualization layers change semantics. - They measure throughput and ignore latency distribution. Databases die from p99 and p999, not from average MB/s.
“Everything is fast on average” is how you get paged at 02:13. The database isn’t upset about the average. It’s upset about the outliers.
One quote worth keeping taped to your monitor:
“Hope is not a strategy.”
—General Gordon R. Sullivan
Joke #1: If your benchmark finishes in 30 seconds and your conclusions last three years, you’re doing performance engineering like a horoscope.
fio rules: how benchmarks accidentally cheat
Rule 1: “sync” must mean the same thing to fio, the OS, and ZFS
fio can do sync-like behavior in multiple ways:
fsync=1 (call fsync after each write),
fdatasync=1,
sync=1 (use O_SYNC),
dsync=1 (use O_DSYNC),
direct=1 (O_DIRECT, bypass page cache),
and the I/O engine can change semantics further (psync, libaio, io_uring).
Databases commonly do buffered writes to their WAL/redo log file and call fsync at commit (or group commit). Some do O_DSYNC / O_DIRECT
depending on configuration. That means your fio test should be selected based on the database configuration you run, not based on what makes
the prettiest chart.
Rule 2: buffered I/O + small files = page cache benchmark
If you run fio against a small file without direct=1, you can end up measuring the Linux page cache. That can still be useful,
but it’s not what you think it is.
For sync-write testing, the worst-case lie is:
you think you’re measuring durability latency, but you’re measuring RAM speed plus an occasional flush.
Rule 3: “fsync per write” is not always what your database does
Setting fsync=1 in fio forces an fsync after each write call. That models a database with no group commit. Many databases do
group commit: multiple transactions share an fsync. If your production database groups commits, “fsync per 4k write” may drastically
understate throughput and overstate latency.
The fix is not to cheat. The fix is to model group commit intentionally (multiple threads writing, fsync cadence, or a mix of sync and async).
Rule 4: latency percentiles are the product
When sync writes are slow, your database waits. That’s visible as latency spikes and queue buildup. Always capture:
p50, p95, p99, and ideally p99.9, plus IOPS under a target latency.
Rule 5: verify your test is actually sync
Don’t trust a config file because it “looks right.” Make the system prove it: trace syscalls, watch ZIL behavior, confirm dataset
properties, and confirm the devices are honoring flushes.
Designing an honest fio test for database sync writes
Start from the database’s durability path
Pick one database configuration and map it to fio:
- PostgreSQL default: buffered WAL writes + fsync. fio analog: buffered writes with
fsync=1or periodic fsync, depending on commit grouping. - MySQL/InnoDB: depends on
innodb_flush_methodandinnodb_flush_log_at_trx_commit. fio analog: mix of fsync and O_DSYNC patterns. - SQLite FULL: frequent fsync barriers. fio analog: small sync writes with fsync or O_DSYNC.
Pick a block size that matches the log, not the table
WAL/redo is usually written in chunks (often 8k, 16k, 32k, sometimes 4k). Use bs=8k or bs=16k for WAL-like tests.
Don’t use bs=1m and call it “database.”
Use multiple jobs to model concurrency and group commit
Most systems commit multiple transactions concurrently. Even if each transaction is “sync,” the system can pipeline work. Run multiple fio jobs
with a shared file or separate files to model contention. But understand the trade:
more threads can hide single-write latency while increasing tail latency.
Keep files large enough to avoid fake locality
For log tests, you can use a file sized like WAL segments (e.g., a few GB), but make sure you’re not just rewriting the same blocks that sit
hot in ARC and metadata caches. Also: don’t run everything against a tiny dataset with primarycache=all and then wonder why it’s
“fast.”
Separate “sync log” from “data flush” tests
A database has at least two write personalities:
- Log writes (sync): small, latency-sensitive, durability-critical.
- Data writes (async-ish): larger, throughput-oriented, checkpoint-driven.
If you blend them into one fio job, you can’t diagnose anything. Create at least two test profiles.
Run long enough to hit steady state and observe TXG cycles
If your runtime is shorter than a few TXG sync intervals, you can get misled by warm caches, initial allocation behavior, and short-term device
boosts. For sync-write latency, 2–5 minutes is a decent minimum; 10–20 minutes is better if you’re chasing p99.9.
Joke #2: The only thing faster than an NVMe SLOG is a benchmark that forgot to enable sync.
Practical tasks (commands, outputs, decisions)
These are the tasks I actually run when someone says “ZFS is slow for our database” and hands me a single average IOPS number.
Each task includes: command, what the output means, and what decision you make from it.
Task 1: Confirm dataset sync setting (the biggest “oops”)
cr0x@server:~$ zfs get -o name,property,value,source sync tank/db
NAME PROPERTY VALUE SOURCE
tank/db sync standard local
Meaning: standard honors application sync requests. If this says disabled, your “sync write benchmark”
is probably testing fantasy mode.
Decision: If production must be durable, keep standard (or always if you must force sync). If you see
disabled anywhere near a database, treat it like a sev-1 waiting for a power event.
Task 2: Check if a SLOG exists and is actually in use
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1p2 ONLINE 0 0 0
nvme1n1p2 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
nvme2n1p1 ONLINE 0 0 0
nvme3n1p1 ONLINE 0 0 0
Meaning: A mirrored log device is present. Good: ZIL writes go to the SLOG, and the mirror reduces single-device log loss risk.
Decision: If there is no logs section and sync latency is high, evaluate a PLP-capable SLOG. If there is a SLOG
but it’s a single consumer SSD with unknown power safety, assume durability risk until proven otherwise.
Task 3: Verify the SLOG device reports write cache and flush behavior
cr0x@server:~$ sudo hdparm -W /dev/nvme2n1
/dev/nvme2n1:
write-caching = 1 (on)
Meaning: Write caching is “on.” On NVMe this is normal and not automatically unsafe; the real question is whether the drive has PLP.
Decision: Confirm PLP via vendor specs or nvme id-ctrl. If you can’t establish power-loss safety, don’t use it as SLOG for a database that claims durability.
Task 4: Identify power-loss protection indicators (best-effort from the OS)
cr0x@server:~$ sudo nvme id-ctrl /dev/nvme2n1 | egrep -i 'oncs|vwc|frmw'
oncs : 0x001f
vwc : 0x0001
frmw : 0x0016
Meaning: vwc indicates presence of volatile write cache. This does not confirm PLP; it confirms you should care.
oncs bits show optional command support like flush. You still need vendor truth for PLP.
Decision: If you can’t confirm PLP, don’t sell “durable commits.” You can still use a fast device, but you should be honest about risk.
Task 5: Confirm ashift and alignment (quiet performance killer)
cr0x@server:~$ zdb -C tank | egrep -n 'ashift|vdev_tree' -A2
42: ashift: 12
43: asize: 1999844147200
44: is_log: 0
Meaning: ashift=12 means 4K sectors. If you’re on modern SSD/NVMe, that’s usually correct. Wrong ashift can cause write amplification and latency.
Decision: If ashift is too small (e.g., 9) on 4K-native devices, fix requires rebuilding the pool. Don’t “tune” around it; it’s structural.
Task 6: Watch TXG sync behavior and stalls
cr0x@server:~$ sudo dmesg -T | egrep -i 'txg|spa_sync|zil' | tail -n 5
[Mon Dec 25 10:11:03 2025] ZFS: spa_sync: tank txg 93421 took 2112ms
[Mon Dec 25 10:11:08 2025] ZFS: spa_sync: tank txg 93422 took 1897ms
Meaning: TXG syncs taking ~2 seconds are not inherently fatal, but they can correlate with latency spikes, especially when the system is near dirty data limits.
Decision: If TXG sync times are frequently multi-second under load, investigate pool device latency, dirty data tunables, and whether sync writes are piling up behind congestion.
Task 7: Check ZFS properties that matter for database datasets
cr0x@server:~$ zfs get -o name,property,value -s local,default recordsize,logbias,atime,compression,primarycache tank/db
NAME PROPERTY VALUE SOURCE
tank/db recordsize 16K local
tank/db logbias latency local
tank/db atime off local
tank/db compression lz4 local
tank/db primarycache all default
Meaning: recordsize affects data blocks (not ZIL records), but it matters for table I/O patterns. logbias=latency encourages ZIL use patterns favorable to low latency. atime=off avoids extra writes. lz4 is usually a net win.
Decision: For WAL datasets, consider smaller recordsize (e.g., 16K) and logbias=latency. For data, align recordsize with typical page size and read patterns. Don’t cargo-cult recordsize=8K everywhere.
Task 8: Run a “WAL-like” fio test with fsync per write (worst-case sync)
cr0x@server:~$ fio --name=wal-fsync --filename=/tank/db/fio-wal.log --size=8g --bs=8k --rw=write --ioengine=psync --direct=0 --fsync=1 --numjobs=1 --runtime=120 --time_based=1 --group_reporting=1
wal-fsync: (groupid=0, jobs=1): err= 0: pid=18421: Mon Dec 25 10:22:11 2025
write: IOPS=3400, BW=26.6MiB/s (27.9MB/s)(3192MiB/120001msec)
slat (usec): min=6, max=310, avg=11.2, stdev=5.4
clat (usec): min=120, max=8820, avg=280.4, stdev=210.1
lat (usec): min=135, max=8840, avg=292.1, stdev=210.5
clat percentiles (usec):
| 50.00th=[ 240], 95.00th=[ 520], 99.00th=[ 1200], 99.90th=[ 4100]
Meaning: This models “write 8K, fsync, repeat.” Latency percentiles show the durability path. p99.9 at 4ms may be acceptable or not depending on your SLA.
Decision: If p99/p99.9 are too high, you’re hunting SLOG latency, device flush behavior, or pool contention. Don’t touch database settings yet; prove storage first.
Task 9: Run an O_DSYNC test (closer to some DB log modes)
cr0x@server:~$ fio --name=wal-dsync --filename=/tank/db/fio-wal-dsync.log --size=8g --bs=8k --rw=write --ioengine=psync --direct=1 --dsync=1 --numjobs=1 --runtime=120 --time_based=1 --group_reporting=1
wal-dsync: (groupid=0, jobs=1): err= 0: pid=18502: Mon Dec 25 10:25:10 2025
write: IOPS=5200, BW=40.6MiB/s (42.6MB/s)(4872MiB/120001msec)
clat (usec): min=95, max=6210, avg=190.7, stdev=160.2
clat percentiles (usec):
| 50.00th=[ 160], 95.00th=[ 380], 99.00th=[ 900], 99.90th=[ 2700]
Meaning: direct=1 bypasses page cache; dsync=1 asks for data-only sync semantics per write. This often maps better to “durable log append” patterns than fsync=1.
Decision: If this is dramatically better than fsync-per-write, it may indicate your workload benefits from group commit or from different sync primitives. Validate against your database’s actual mode.
Task 10: Add concurrency to model group commit pressure
cr0x@server:~$ fio --name=wal-dsync-8jobs --filename=/tank/db/fio-wal-8jobs.log --size=16g --bs=8k --rw=write --ioengine=psync --direct=1 --dsync=1 --numjobs=8 --runtime=180 --time_based=1 --group_reporting=1
wal-dsync-8jobs: (groupid=0, jobs=8): err= 0: pid=18614: Mon Dec 25 10:29:40 2025
write: IOPS=24000, BW=187MiB/s (196MB/s)(33660MiB/180001msec)
clat (usec): min=110, max=20210, avg=285.9, stdev=540.3
clat percentiles (usec):
| 50.00th=[ 210], 95.00th=[ 650], 99.00th=[ 2400], 99.90th=[12000]
Meaning: Throughput rose, but p99.9 exploded. That’s typical when the log device or pool can’t keep tail latency tight under concurrency.
Decision: If your database SLA is latency-sensitive, optimize for tail latency, not peak IOPS. You might prefer fewer concurrent committers or a better SLOG rather than more threads.
Task 11: Confirm fio is issuing the syscalls you think (strace)
cr0x@server:~$ sudo strace -f -e trace=pwrite64,write,fdatasync,fsync,openat fio --name=wal-fsync --filename=/tank/db/trace.log --size=256m --bs=8k --rw=write --ioengine=psync --direct=0 --fsync=1 --numjobs=1 --runtime=10 --time_based=1 --group_reporting=1 2>&1 | tail -n 12
openat(AT_FDCWD, "/tank/db/trace.log", O_RDWR|O_CREAT, 0644) = 3
pwrite64(3, "\0\0\0\0\0\0\0\0"..., 8192, 0) = 8192
fsync(3) = 0
pwrite64(3, "\0\0\0\0\0\0\0\0"..., 8192, 8192) = 8192
fsync(3) = 0
Meaning: You can see the actual pattern: pwrite, fsync, repeat. If you expected O_DSYNC and you don’t see it, your fio options aren’t doing what you think.
Decision: Don’t accept “fio says dsync=1” as proof. Verify syscalls, especially when changing ioengine or direct I/O flags.
Task 12: Observe ZIL/SLOG activity during the test (iostat on log vdev)
cr0x@server:~$ iostat -x 1 /dev/nvme2n1 /dev/nvme3n1
Linux 6.8.0 (server) 12/25/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
5.22 0.00 2.10 7.11 0.00 85.57
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
nvme2n1 0.0 0.0 0.0 0.0 0.00 0.0 18500.0 150000.0 0.0 0.0 0.18 8.1 3.2 91.0
nvme3n1 0.0 0.0 0.0 0.0 0.00 0.0 18480.0 149800.0 0.0 0.0 0.19 8.1 3.1 90.4
Meaning: Heavy writes on the log devices while running sync-write fio suggests the ZIL is hitting the SLOG. Low w_await is what you want.
Decision: If log devices show little activity during “sync” tests, you are likely not doing synchronous I/O, or the dataset/pool is configured in a way you didn’t expect.
Task 13: Check pool-wide latency and queueing (the “is it the main vdevs?” question)
cr0x@server:~$ iostat -x 1 /dev/nvme0n1 /dev/nvme1n1
Device r/s rkB/s r_await w/s wkB/s w_await aqu-sz %util
nvme0n1 50.0 12000.0 0.45 1200.0 98000.0 3.90 6.1 99.0
nvme1n1 45.0 11000.0 0.50 1180.0 97000.0 4.10 6.2 99.0
Meaning: Main pool devices are saturated (%util near 99) with multi-millisecond write awaits. Even with a good SLOG, TXG sync and general contention can push tail latency up.
Decision: If main vdevs are pegged, you need to reduce background write pressure (checkpoints, compaction, other tenants), add vdevs, or move workloads. A SLOG can’t fix a pool that’s drowning.
Task 14: Ensure you’re not benchmarking ARC (cache) for reads
cr0x@server:~$ arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
10:35:01 412 20 4 0 0 20 100 0 0 28.1G 30.0G
10:35:02 398 17 4 0 0 17 100 0 0 28.1G 30.0G
10:35:03 420 18 4 0 0 18 100 0 0 28.1G 30.0G
Meaning: Low miss% means reads are mostly served from ARC. That’s fine for “database buffer cache hits,” but not representative of disk read performance.
Decision: For read benchmarks, size the test to exceed ARC or use primarycache=metadata on a test dataset to avoid accidental cache-only numbers.
Task 15: Test with sync=always temporarily to catch “async disguised as sync”
cr0x@server:~$ sudo zfs set sync=always tank/db
cr0x@server:~$ zfs get -o name,property,value sync tank/db
NAME PROPERTY VALUE
tank/db sync always
Meaning: ZFS will treat all writes as synchronous on that dataset. This is a diagnostic lever: if performance changes dramatically, your workload wasn’t really issuing sync writes as you thought, or it was being buffered unexpectedly.
Decision: Use this only in controlled testing. If “sync always” craters your throughput, that’s a sign your application relies on async write behavior and you need to validate durability assumptions.
Task 16: Confirm ZFS is not silently throttling due to dirty data limits
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep 'dirty_data|dirty_over'
dirty_data 4 1328619520
dirty_over_target 4 0
Meaning: If dirty_over_target is frequently non-zero, ZFS is pushing back on writers to control memory and sync behavior. That manifests as write stalls and latency spikes.
Decision: If you’re hitting dirty limits under normal workload, investigate memory sizing, background write patterns, and ZFS tunables—carefully. Don’t just raise limits and call it a day.
Task 17: Validate that your fio job is not overwriting the same blocks (trimmed illusion)
cr0x@server:~$ fio --name=wal-dsync-rand --filename=/tank/db/fio-wal-rand.log --size=8g --bs=8k --rw=randwrite --ioengine=psync --direct=1 --dsync=1 --numjobs=1 --runtime=60 --time_based=1 --group_reporting=1
wal-dsync-rand: (groupid=0, jobs=1): err= 0: pid=19311: Mon Dec 25 10:41:07 2025
write: IOPS=2100, BW=16.4MiB/s (17.2MB/s)(984MiB/60001msec)
clat percentiles (usec):
| 50.00th=[ 330], 95.00th=[ 1200], 99.00th=[ 4800], 99.90th=[21000]
Meaning: Random sync writes are harsher than sequential log appends. If your “WAL test” is random, you’re modeling something else (like data file sync).
Decision: Use sequential for WAL-like tests unless your database truly writes log blocks non-sequentially (rare). Keep this test as a stressor, not the primary model.
Fast diagnosis playbook
You’re on a call. Someone says “commits are slow” and pastes a single graph. Here’s the shortest path to identifying the bottleneck without
turning it into an archaeology project.
First: prove it’s actually synchronous and actually durable
-
Check dataset
syncand any parent dataset inheritance.
If you findsync=disabled, stop the performance discussion and start the risk discussion. -
Confirm the database’s own durability settings (e.g., WAL fsync on, InnoDB log flush mode). If the DB is set to “relaxed durability,”
don’t benchmark “strict durability” and vice versa. -
Trace one representative write with
straceor database-level tracing to confirm the syscalls: fsync? fdatasync? O_DSYNC?
Second: isolate whether the pain is ZIL/SLOG latency or pool congestion
-
While running a sync fio job, watch
iostat -xon the SLOG devices. If they’re busy andw_awaitis high, your log device is the problem. - If SLOG looks fine, watch the main vdev devices. If they’re saturated or have high latency, TXG syncing and pool contention are dragging everything.
- Look for TXG sync time spikes in kernel logs. Multi-second TXG sync times correlate strongly with tail latency problems.
Third: chase the usual amplifiers (they matter more than you want)
- Check ashift, RAID layout, and whether the pool is near full. A pool above ~80–85% used often sees worse fragmentation and allocation behavior.
- Check for competing workloads: snapshots, replication, scrubs, backups, compactions, other datasets on the same pool.
- Verify the SLOG is power-safe. If it isn’t, you can’t rely on it for database durability—performance is the wrong argument.
Three corporate mini-stories from the reliability trenches
1) Incident caused by a wrong assumption: “sync=disabled is just faster”
A mid-sized SaaS company migrated from a managed database to self-hosted Postgres to save cost and gain control. Storage was ZFS on a pair of
SSD mirrors. A well-meaning engineer set sync=disabled on the dataset because commits were “slow in benchmarks.”
For weeks, it looked great. Latency dropped. Charts were green. The team took a victory lap and moved on to more visible work, like product
features and building dashboards about dashboards.
Then a power event happened—not a dramatic datacenter fire, just a routine failure chain: UPS maintenance plus a breaker trip that didn’t
fail over the way everyone assumed it would. The hosts rebooted. Postgres came up. It even accepted connections.
The subtle horror was logical corruption. A small set of recently committed transactions were missing, and a different set were partially
applied. The application saw foreign key weirdness and reconciliation jobs started deleting “duplicates” that weren’t duplicates.
It took days to unwind, mostly because corruption is rarely polite enough to crash immediately.
The postmortem wasn’t complicated. The wrong assumption was: “ZFS is safe, so disabling sync is safe.” ZFS is safe when you keep its safety
contract. They had asked ZFS to lie to the database. ZFS complied. Computers are obedient like that.
2) Optimization that backfired: “Let’s add a cheap fast SLOG”
A financial services platform had a ZFS pool on enterprise SSDs, but sync-heavy workloads (audit logging plus a transactional database) were
showing p99 commit latency spikes. Someone suggested adding a SLOG. Good instinct.
Procurement got involved. They found “very fast” consumer NVMe drives at a fraction of the cost of enterprise models. The spec sheet was a
rainbow of IOPS numbers, the kind that make executives briefly believe physics is optional.
The team installed the drives as a mirrored SLOG and re-ran fio. The numbers looked phenomenal. Then production traffic hit. Tail latency got
worse. Not consistently—just enough to ruin SLAs intermittently.
The backfire was twofold. First, those drives had aggressive SLC caching behavior: bursts were great, sustained sync logging under concurrency
wasn’t. Second, they lacked meaningful power-loss protection, so the team couldn’t honestly claim durability. They had improved the mean and
poisoned the p99, while also increasing risk.
The fix was boring: replace the SLOG with PLP-capable devices chosen for low-latency writes under sync flush, not for max throughput. The
performance improved, and the risk story stopped being embarrassing.
3) Boring but correct practice that saved the day: “Test the actual fsync path”
A large enterprise ran a multi-tenant MySQL fleet on ZFS. They had an internal rule: any storage change required a “durability-path test.”
Not a generic benchmark. A test that modeled their real flush method and commit behavior.
A team planned a platform refresh: new servers, new NVMe pool, new SLOG. On paper it was an upgrade. During pre-prod tests, the standard
throughput benchmarks looked great. The durability-path fio test looked… off. p99.9 latencies were unexpectedly high.
Because they had the rule, they also had the tooling: fio jobs that matched their MySQL log flush pattern, strace verification, and an iostat
script to confirm the SLOG was being hit. They traced it quickly to a firmware setting on the log devices that altered flush behavior under
queue depth.
The vendor fixed the firmware/setting. The platform shipped. Nobody celebrated, which is the correct amount of celebration for “we didn’t ship
a latent data risk into production.”
The savings wasn’t a line item; it was the absence of a postmortem. Those are the best kind.
Common mistakes: symptoms → root cause → fix
1) “fio shows 200k IOPS, but the database commits at 2k/s”
Symptoms: fio numbers are huge; DB commit latency is still bad; graphs don’t correlate.
Root cause: fio test is buffered and not forcing durability (direct=0 + no fsync/dsync), or dataset sync=disabled in test but not in prod.
Fix: Re-run fio with fsync=1 or dsync=1 (matching DB mode), verify with strace, and confirm zfs get sync on the target dataset.
2) “Adding a SLOG made things worse”
Symptoms: average latency improved, but p99/p99.9 got worse; intermittent stalls; sometimes the SLOG is pegged.
Root cause: SLOG device has poor sustained sync write latency, lacks PLP, or suffers under concurrency due to internal caching behavior.
Fix: Use PLP-capable low-latency devices; mirror the SLOG; validate with concurrency fio tests; watch iostat -x for w_await and queue depth behavior.
3) “Sync writes are slow even with a good SLOG”
Symptoms: SLOG looks fine; main pool devices show high utilization; TXG sync times are long.
Root cause: pool congestion (checkpoint flushes, other tenants, scrubs, replication), dirty data throttling, or pool nearly full/fragmented.
Fix: Reduce competing writes, schedule scrubs/replication appropriately, add vdevs, keep pools with headroom, and validate TXG sync times.
4) “Latency is fine in short tests, terrible over hours”
Symptoms: 1–2 minute tests look great; sustained workloads degrade; periodic spikes.
Root cause: device SLC cache exhaustion, thermal throttling, background maintenance (GC), or TXG/dirty data cycles not captured by short runs.
Fix: Run longer tests (10–20 minutes), monitor device temperature and throttling, track p99.9, and test at steady-state fill levels.
5) “We toggled sync=always and performance cratered”
Symptoms: throughput drops massively; latency jumps; app starts timing out.
Root cause: workload was relying on async writes; application durability assumptions were weaker than expected; or log device/pool cannot sustain forced sync behavior.
Fix: Align durability expectations with business requirements, confirm DB settings, implement a proper SLOG or accept lower commit rates with correct durability.
6) “Random sync writes are catastrophic”
Symptoms: randwrite + dsync shows terrible tail latency; sequential tests look okay.
Root cause: you’re testing data file sync patterns (hard) rather than log append (easier). Also possible: RAIDZ write amplification and fragmentation.
Fix: Separate WAL tests (sequential) from data tests; consider mirrors for latency-sensitive DB workloads; keep pool headroom and sane recordsize.
Checklists / step-by-step plan
Step-by-step: build a sync-write benchmark you can defend in a postmortem
-
Pick the database durability mode.
Write down exactly how commits are made (fsync per commit? group commit? O_DSYNC?). -
Lock ZFS properties.
Confirmsync,logbias,compression,atime, and any inheritance from parents. -
Confirm SLOG presence and type.
zpool status, identify log vdevs, confirm mirroring, confirm PLP story. -
Build two fio profiles:
one for log sync writes (sequential 8–16K), one for data/checkpoint (larger, mixed, possibly random). -
Verify syscalls.
Usestraceon a short run. If it’s not issuing fsync/dsync the way you think, stop and fix the job. -
Run long enough.
At least 2–5 minutes; longer if you care about p99.9. -
Collect the right metrics while running:
iostat -xfor SLOG and main vdevs, TXG sync time logs, CPU iowait, and fio latency percentiles. -
Make a decision from p99/p99.9.
If tail latency is unacceptable, treat it as a storage design issue, not a “tune the database until it stops complaining” issue.
Checklist: SLOG sanity for databases
- PLP-capable devices (or explicit risk acceptance documented).
- Mirror the SLOG for availability and to reduce single-device log loss scenarios.
- Optimize for latency consistency, not peak throughput marketing numbers.
- Monitor SLOG write latency under concurrency.
- Keep it simple: one good SLOG beats three questionable ones.
Checklist: when to suspect you’re lying to yourself
- Your fio job runs “too fast,” and the log devices show no writes.
- Short tests look amazing; long tests degrade sharply.
- Average latency looks fine, but the database is timing out (tail latency).
- You changed
syncor DB flush settings “for performance” and did not run power-loss scenario validation. - You cannot explain which syscall pattern your benchmark uses.
FAQ
1) Should I use fsync=1 or dsync=1 in fio for database log testing?
Use what your database uses. PostgreSQL commonly resembles buffered writes plus fsync. Some MySQL modes resemble O_DSYNC or O_DIRECT patterns.
If you don’t know, trace the database with strace during commits and match that.
2) Is sync=disabled ever acceptable for databases?
Only if you’re explicitly choosing weaker durability (and the business agrees). It can be acceptable for ephemeral caches or rebuildable data.
For transactional systems, it’s a durability downgrade disguised as a tuning knob.
3) Does a SLOG improve normal async write throughput?
No. A SLOG is about synchronous operations. If your workload is mostly async writes, a SLOG won’t help much. It can still help if your
database does frequent fsync, which is exactly the point.
4) How big should my SLOG be?
Usually smaller than you think. SLOG sizing is about buffering a few seconds of sync write intent, not storing your database. The real
requirement is low-latency durable writes under sustained load. Overbuying capacity doesn’t fix bad latency.
5) Why does concurrency increase IOPS but worsen p99.9 latency?
You’re trading per-operation latency for throughput. More jobs increase queue depth, which can smooth utilization but amplify tail behavior,
especially if the device has internal write caching quirks or the pool is saturated.
6) Does recordsize affect sync write performance?
Not directly for ZIL records, but it affects how data blocks are written and can change checkpoint and read-modify-write behavior. For log
datasets, the key knobs are sync, SLOG quality, and overall pool health.
7) Should I use direct=1 for WAL benchmarks?
It depends. If your database writes WAL buffered and fsyncs, then direct=0 with fsync models it better. If the
database uses direct I/O or O_DSYNC, direct=1 may be appropriate. The correct answer is: match reality, then measure.
8) Is mirroring the SLOG necessary?
For serious databases, yes, if you’re using a SLOG at all. A lost log device at the wrong moment can turn a crash into a longer recovery or,
in worst cases, data loss of recent sync operations depending on timing and failure mode. Mirroring is cheap insurance compared to explaining
lost transactions.
9) Why do my fio numbers change after a reboot?
Cache warm-up, different device thermal state, background GC state on SSDs, and different fragmentation/metadata locality all matter. That’s
why long tests with steady-state conditions are more honest than “one run after boot.”
10) Can RAIDZ be good for databases on ZFS?
It can be fine for read-heavy or throughput-oriented workloads, but latency-sensitive sync-heavy databases often prefer mirrors because small
writes and tail latency behave better. If your priority is commit latency, mirrors are the default for a reason.
Conclusion: practical next steps
If you want to test ZFS for databases honestly, stop chasing the biggest throughput number and start chasing the most boring, repeatable,
durability-correct latency profile you can get.
Next steps that actually move the needle:
- Pick the database durability mode you run in production and write it down (fsync cadence, group commit, flush method).
- Verify ZFS dataset
syncand SLOG configuration, and prove syscall behavior withstrace. - Run two fio profiles (WAL sync and data/checkpoint) long enough to capture tail latency and TXG cycles.
- When results are bad, use the fast diagnosis playbook: validate sync semantics, isolate SLOG vs pool, then hunt contention and amplifiers.
- If you need a SLOG, buy for PLP and latency consistency. If you can’t justify that, be honest about durability tradeoffs.
Performance engineering is mostly refusing to be impressed by your own charts. ZFS will do what you ask. Make sure you’re asking for truth.