ZFS for MySQL: Avoiding Latency Meltdowns Under Write Bursts

Was this helpful?

Everything is fine. QPS is steady. Latency is boring. Then a deploy lands, a queue drains, or a batch job “just for today” kicks in—and your MySQL p99 turns into a horror story. The graphs don’t spike politely; they cliff-dive, then flatline at “timeout”.

If you run MySQL on ZFS, you can absolutely get excellent performance and strong operational safety. You can also build a perfect machine for turning short write bursts into long latency meltdowns. ZFS is honest: it will do exactly what you asked, not what you meant.

The mental model: where MySQL meets ZFS and things get spicy

Start with the uncomfortable truth: MySQL cares about latency consistency more than peak throughput. Your application doesn’t page anyone because you lost 10% throughput. It pages you because p99 jumped from 5 ms to 800 ms and connections started timing out.

ZFS is not a thin shim on top of disks. It’s a transactional copy-on-write filesystem with its own caching (ARC), write gathering, checksumming, and a separate intent log mechanism (ZIL) for synchronous semantics. It is brilliant at making writes safe and verifiable. It is also capable of taking a smooth stream of small synchronous writes and turning them into a queueing problem—especially when the pool is busy, fragmented, or mis-provisioned for latency.

Where the writes come from in MySQL

InnoDB does multiple kinds of writing. Under bursty workloads, the ones that hurt most are:

  • Redo log writes (sequential, often synced frequently depending on innodb_flush_log_at_trx_commit).
  • Doublewrite buffer writes (write amplification by design, for crash safety).
  • Data page flushes (background, but bursts happen when dirty page pressure builds).
  • Binlog writes (can be synced, especially on replication/GTID setups).
  • Temporary tables / sort files (if misconfigured, these can churn storage during spikes).

Where ZFS puts those writes

ZFS writes into transaction groups (TXGs). Data is accumulated in memory, then committed periodically (commonly every few seconds). That smoothing is great until it isn’t—because when a commit hits, ZFS has to push a lot of dirty data and metadata out, and it may need to do it while your database is also asking for synchronous durability guarantees.

Synchronous is the key word. When MySQL does an fsync() or opens files with O_SYNC, it is demanding that the data be on stable storage before proceeding. On ZFS, synchronous writes are handled through the ZIL (ZFS Intent Log). If you add a separate log device (a SLOG), you’re changing where those synchronous log writes land.

ZIL is not a “write cache” for the whole pool. It is a mechanism to satisfy sync semantics safely. Most of the time, ZIL entries are replayed only after a crash; otherwise they are “burned in” when the next TXG commits. For MySQL latency, the ZIL path is where seconds become milliseconds—or milliseconds become seconds.

Opinionated guidance: if you run MySQL with real durability settings (and you should), you are signing up for synchronous writes. Treat ZIL/SLOG and latency as first-class citizens, not afterthoughts.

Paraphrased idea (attributed): Werner Vogels has pushed the idea that you should design for failure as a normal condition, not an exception.

Facts and historical context that explain today’s failure modes

  • ZFS was born at Sun as an end-to-end data integrity filesystem: checksums everywhere, self-healing with redundancy. That “integrity first” DNA affects performance trade-offs today.
  • Copy-on-write isn’t optional in ZFS. Overwrites become allocate-new-and-update-metadata. Great for snapshots; potentially rough for random-write-heavy databases when space is tight.
  • The ZIL exists because POSIX demands sync semantics. It’s not a special feature for databases; it’s plumbing for correct behavior under fsync().
  • SLOG is a device, not a mode. People talk about “enabling SLOG”; you’re actually adding a separate log vdev to store ZIL records faster and more predictably.
  • ARC (Adaptive Replacement Cache) was designed to outperform classic LRU by balancing recency and frequency. It can make reads look magical—until it steals too much memory from InnoDB buffer pool and the OS.
  • L2ARC came later to extend ARC to fast devices. It helps reads, but it also costs memory and write bandwidth to maintain, which is not free under write bursts.
  • MySQL’s doublewrite buffer is a response to partial-page writes on crash. On filesystems with atomicity guarantees at the page level, it’s redundant; on most systems, it’s protective. On ZFS, it still often helps operational safety, but it is extra IO.
  • Dataset properties evolved as safety rails because admins kept shooting themselves. Things like atime=off, compression, and recordsize exist because default filesystem behavior is not “database-smart.”
  • Modern SSDs added their own latency surprises: SLC cache exhaustion, firmware garbage collection, and variable write amplification. Your “fast” SLOG can become a pumpkin under sustained sync writes.

How write bursts become latency meltdowns on ZFS

1) Sync write amplification: fsync storms meet limited IOPS

When MySQL is configured for durability—say, innodb_flush_log_at_trx_commit=1 and binlog sync enabled—each commit can require an fsync(). Group commit helps, but under bursty traffic you can still see a thundering herd of sync requests.

If ZFS has to put those sync writes onto the main pool, your latency becomes the pool’s latency. And the pool is also busy doing TXG commits, metadata updates, and possibly resilver/scrub. That’s how you get a system where throughput looks “fine” but every single transaction waits in line for durable storage.

2) TXG commit pressure: “smooth” writes become periodic pain

ZFS accumulates dirty data in memory and flushes it in TXGs. Under write bursts, you can hit dirty data limits, and ZFS will start throttling incoming writes to avoid runaway memory usage. This is correct behavior. It is also the moment your database threads stop doing useful work and start waiting on storage.

When a TXG flush is large, it competes with sync IO. Even with a SLOG, the pool still has to do the real work of writing data and metadata. The SLOG helps you acknowledge sync writes quickly, but you can still melt down later if the pool can’t keep up and the ZIL starts filling and waiting for commits.

3) Space and fragmentation: the silent latency multiplier

ZFS wants free space. Not “some free space.” Real free space. As pools fill, allocation becomes harder. Blocks become more fragmented. Metadata updates become more scattered. Every write starts looking like a small random IO problem.

Databases are great at turning free space into “not free space.” If your pool is hovering at 80–90% used, you’re effectively load-testing ZFS allocation algorithms during peak traffic. Don’t do that in production unless you enjoy learning about your pager tone at 3 a.m.

4) Misfit recordsize and too much metadata work

InnoDB pages are commonly 16K. ZFS recordsize defaults to 128K for general-purpose workloads. If you leave it at 128K for a dataset storing InnoDB tablespaces, you can increase write amplification: changing 16K can require rewriting larger blocks depending on access patterns and compression behavior.

It’s not always catastrophic—ZFS has some smarts, and sequential IO can still perform well—but under random updates and write bursts, wrong recordsize makes your pool work harder for the same database work.

5) The SLOG myth: “add an SSD and all sync latency disappears”

A SLOG is only as good as its durable write latency. Consumer SSDs can be fast until they hit a write cliff. They can also lie about flush behavior. For a SLOG, you want predictable latency under sustained writes and power-loss protection (or equivalent enterprise guarantees).

Joke #1: Buying a cheap SSD for SLOG is like hiring an intern to hold the building’s foundation—enthusiastic, but the physics department will have questions.

6) ARC vs InnoDB buffer pool: memory is a shared battlefield

ARC is aggressive and effective. InnoDB buffer pool is also aggressive and effective. If you let both fight for RAM, the kernel will eventually pick a winner, and it won’t be your uptime. You’ll see swapping, reclaim storms, and IO amplification as cached pages churn.

For MySQL, it’s usually better to size InnoDB buffer pool intentionally and cap ARC so ZFS doesn’t eat the rest. ZFS can survive with a smaller ARC; MySQL suffering random reads cannot.

Fast diagnosis playbook (first/second/third)

This is the “you have five minutes before leadership joins the incident channel” flow. It is not elegant. It is effective.

First: confirm the symptom is storage latency, not CPU or locks

  1. Check MySQL for lock waits and flush pressure: if threads are stuck on mutex/locks, storage tuning won’t help.
  2. Check OS load and IO wait: high %wa and rising load with low CPU usage points to IO backlog.
  3. Check ZFS pool IO latency and queue depth: identify whether the pool or SLOG is the choke point.

Second: decide if sync writes are the bottleneck

  1. Look for high sync write rates (redo/binlog) and fsync() behavior.
  2. Check if you have a SLOG and whether it’s saturated or slow.
  3. Confirm dataset sync setting (should generally be standard for durability; don’t “fix” incidents by lying).

Third: check pool health and “slow burn” factors

  1. Pool fullness: if you’re above ~80% used, you’ve found a contributor.
  2. Fragmentation: high fragmentation correlates with random-write pain.
  3. Scrub/resilver: if running, it can turn a manageable spike into a meltdown.
  4. Dirty data throttling: ZFS may be intentionally slowing you to stay alive.

Decision bias: in a write-burst incident, the top three culprits are (1) sync write path, (2) pool saturation/latency, (3) memory pressure causing cache churn. Don’t start by changing recordsize mid-incident unless you enjoy rolling dice with your data.

Practical tasks: commands, outputs, and decisions (12+)

These are real operational moves: run a command, interpret the output, decide what to do next. Assume a Linux host with OpenZFS, pool tank, dataset tank/mysql, and MySQL data in /var/lib/mysql.

Task 1: See if the pool is obviously overloaded right now

cr0x@server:~$ zpool iostat -v tank 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        6.20T  1.10T    120   3800   18M   420M
  raidz2    6.20T  1.10T    120   3700   18M   410M
    sda         -      -     15    520  2.2M    58M
    sdb         -      -     14    515  2.1M    57M
    sdc         -      -     15    540  2.2M    60M
    sdd         -      -     15    530  2.2M    59M
logs            -      -      -     100     -    12M
  nvme0n1       -      -      -     100     -    12M

What it means: writes are heavy (3800 ops/s). The pool bandwidth is 420 MB/s with relatively low reads. If latency is high, you need the next command: IO latency, not just ops.

Decision: if writes are spiking and the pool is near full, plan to reduce write load (throttle batch jobs) and investigate sync and dirty throttling immediately.

Task 2: Check per-vdev latency to find the choke point

cr0x@server:~$ zpool iostat -v -l tank 1 5
                               operations         bandwidth            total_wait         disk_wait
pool                         read  write        read  write        read  write        read  write
--------------------------  -----  -----  ---------  ---------  ---------  ---------  ---------  ---------
tank                           90   4200        12M       480M        3ms     120ms        1ms     110ms
  raidz2                       90   4100        12M       470M        3ms     125ms        1ms     115ms
    sda                         9    580       1.2M        68M        2ms     140ms        1ms     130ms
    sdb                         9    570       1.2M        67M        2ms     138ms        1ms     128ms
    sdc                         9    600       1.3M        69M        2ms     142ms        1ms     132ms
    sdd                         9    590       1.3M        69M        2ms     141ms        1ms     131ms
logs                            0    120         0        14M        0ms       2ms        0ms       1ms
  nvme0n1                       0    120         0        14M        0ms       2ms        0ms       1ms

What it means: pool write total_wait is ~120 ms. The SLOG is fine (2 ms). The bottleneck is the main pool flush/commit work, not the log device.

Decision: focus on pool write capacity, fragmentation, free space, and dirty data throttling. A faster SLOG won’t fix this specific stall.

Task 3: Check pool fullness and fragmentation (predict trouble)

cr0x@server:~$ zpool list -o name,size,alloc,free,capacity,frag,health tank
NAME  SIZE  ALLOC  FREE  CAPACITY  FRAG  HEALTH
tank  7.28T  6.20T  1.08T      85%   62%  ONLINE

What it means: 85% capacity and 62% fragmentation. That’s a latency tax, especially for random writes and metadata-heavy workloads.

Decision: prioritize capacity relief (add vdevs or migrate data). Treat “free space” as a performance feature, not a bookkeeping detail.

Task 4: Check whether a scrub or resilver is running

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Dec 23 02:11:44 2025
        3.12T scanned at 1.2G/s, 1.55T issued at 600M/s, 6.20T total
        0B repaired, 25.00% done, 02:30:12 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
        logs
          nvme0n1   ONLINE       0     0     0

errors: No known data errors

What it means: scrub is actively issuing IO. On a pool already struggling with write latency, scrub can tip you into meltdown.

Decision: if you’re in an incident, consider pausing the scrub (policy-dependent). Resume after stabilizing latency.

Task 5: Pause/resume a scrub during an incident (if your policy allows)

cr0x@server:~$ sudo zpool scrub -p tank
cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub paused since Mon Dec 23 03:01:02 2025
        3.12T scanned at 1.2G/s, 1.55T issued at 600M/s, 6.20T total
        0B repaired, 25.00% done, 02:30:12 to go
errors: No known data errors

What it means: IO pressure should reduce. You’re trading off time-to-detect latent errors for production stability.

Decision: do it only if your organization accepts the risk; document it, resume later.

Task 6: Inspect dataset properties that affect MySQL behavior

cr0x@server:~$ zfs get -o name,property,value -s local,default recordsize,compression,atime,sync,logbias,primarycache,secondarycache tank/mysql
NAME       PROPERTY        VALUE
tank/mysql recordsize      128K
tank/mysql compression     off
tank/mysql atime           on
tank/mysql sync            standard
tank/mysql logbias         latency
tank/mysql primarycache    all
tank/mysql secondarycache  all

What it means: recordsize is default 128K, compression is off, atime is on. For InnoDB, recordsize often should be smaller; atime usually should be off; compression is frequently a win on modern CPUs.

Decision: plan changes deliberately (especially recordsize). Turn off atime quickly; consider enabling compression; evaluate recordsize with workload context.

Task 7: Disable atime for the MySQL dataset

cr0x@server:~$ sudo zfs set atime=off tank/mysql
cr0x@server:~$ zfs get -o name,property,value atime tank/mysql
NAME       PROPERTY  VALUE
tank/mysql atime     off

What it means: reads won’t generate metadata writes to update access times. Small win, but consistent.

Decision: do this unless you have a specific audit workload depending on atime (rare for MySQL data dirs).

Task 8: Enable compression (usually lz4) to reduce IO under bursts

cr0x@server:~$ sudo zfs set compression=lz4 tank/mysql
cr0x@server:~$ zfs get -o name,property,value compression tank/mysql
NAME       PROPERTY     VALUE
tank/mysql compression  lz4

What it means: ZFS will compress new blocks. For databases, this often reduces write bandwidth and can improve latency, unless CPU is already pegged.

Decision: if CPU headroom exists, enable it. If CPU is saturated during incidents, test first.

Task 9: Check ARC size and pressure (avoid memory fights)

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  avail
12:10:01   980   120     10    40   33    60   50    20   17   58G   60G    3G
12:10:02  1020   140     12    55   39    70   50    15   11   59G   60G    2G
12:10:03  1005   160     14    70   44    75   47    15    9   60G   60G    1G

What it means: ARC is at its target (c), and available memory is collapsing. If the host starts swapping, MySQL will suffer badly.

Decision: cap ARC (zfs_arc_max) so MySQL buffer pool has stable RAM. Don’t try to “win” by letting both caches grow unchecked.

Task 10: Check actual swapping and IO wait at the OS level

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
12  3  524288  31200  10240  88400  150  220   120  9800 2100 6400 12  8 62 18  0
10  2  524288  28000  10080  87000  180  260   110 10500 2200 6500 11  9 60 20  0
14  4  524288  25000   9980  86500  210  300    90 12000 2300 6700 10 10 58 22  0
11  3  524288  24000   9950  86000  190  280   100 11000 2250 6600 11  9 59 21  0
13  4  524288  23000   9900  85500  220  310    95 12500 2350 6800 10 10 56 24  0

What it means: swap in/out (si/so) is active and wa is high. This is a classic “latency got worse because memory got worse” scenario.

Decision: reduce ARC, reduce MySQL memory pressure, or add RAM. In the short term, stop the worst write burst source and stabilize.

Task 11: Confirm the ZIL/SLOG devices and their health

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
        logs
          nvme0n1   ONLINE       0     0     0

errors: No known data errors

What it means: there is a SLOG (nvme0n1), and it’s healthy.

Decision: if you have sync latency but SLOG is healthy and low-latency, the main pool is likely the limiting factor (as in Task 2).

Task 12: Verify dataset is mounted where MySQL actually stores data

cr0x@server:~$ findmnt /var/lib/mysql
TARGET         SOURCE      FSTYPE OPTIONS
/var/lib/mysql tank/mysql  zfs    rw,xattr,noacl

What it means: MySQL is indeed on the ZFS dataset you think it is. You’d be surprised how often it isn’t.

Decision: if it’s not, stop “tuning” the wrong dataset and fix your mount/layout first.

Task 13: Inspect MySQL durability settings that drive sync writes

cr0x@server:~$ mysql -e "SHOW VARIABLES WHERE Variable_name IN ('innodb_flush_log_at_trx_commit','sync_binlog','innodb_doublewrite','innodb_flush_method');"
+--------------------------------+------------+
| Variable_name                  | Value      |
+--------------------------------+------------+
| innodb_doublewrite             | ON         |
| innodb_flush_log_at_trx_commit | 1          |
| innodb_flush_method            | O_DIRECT   |
| sync_binlog                    | 1          |
+--------------------------------+------------+

What it means: this is full-durability mode: redo flushed every commit and binlog synced every commit. Great for correctness. Harder on storage during bursts.

Decision: do not change these casually. If business accepts risk (some do for caches, not for money), consider sync_binlog > 1 or innodb_flush_log_at_trx_commit=2, but only with explicit sign-off.

Task 14: Check current MySQL checkpoint/dirty page pressure signals

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_dirty'; SHOW GLOBAL STATUS LIKE 'Innodb_os_log_fsyncs'; SHOW GLOBAL STATUS LIKE 'Innodb_log_waits';"
+-------------------------------+--------+
| Variable_name                 | Value  |
+-------------------------------+--------+
| Innodb_buffer_pool_pages_dirty| 245000 |
+-------------------------------+--------+
+---------------------+--------+
| Variable_name       | Value  |
+---------------------+--------+
| Innodb_os_log_fsyncs| 185000 |
+---------------------+--------+
+-----------------+------+
| Variable_name   | Value|
+-----------------+------+
| Innodb_log_waits| 420  |
+-----------------+------+

What it means: many dirty pages and non-zero Innodb_log_waits suggests log flush bottlenecks. Under write bursts, this is consistent with storage sync pressure.

Decision: correlate with ZFS latency. If ZFS write wait is high, fix storage path; if ZFS looks fine, examine MySQL log configuration and CPU scheduling.

Task 15: Spot processes stuck in IO (quick and dirty)

cr0x@server:~$ ps -eo pid,comm,state,wchan:30 | egrep 'mysqld|z_wr_iss|z_wr_int|z_wr|txg|sync' | head
  2141 mysqld           D io_schedule
  2147 mysqld           D io_schedule
  2153 mysqld           D io_schedule
  1103 z_wr_iss         D cv_wait
  1104 z_wr_int         D cv_wait

What it means: MySQL threads in D state waiting in IO scheduler indicates storage stalls. ZFS writer threads waiting can also indicate internal throttling/commit pressure.

Decision: validate with zpool iostat -l and OS disk stats. If confirmed, reduce write load and address pool limits.

Task 16: Check per-device latency and saturation from Linux (complements ZFS view)

cr0x@server:~$ iostat -x 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.10    0.00    8.20   18.40    0.00   61.30

Device            r/s     w/s   rKB/s   wKB/s  avgrq-sz avgqu-sz   await  r_await  w_await  svctm  %util
sda               9.0   580.0   1200   68000     228.0     45.0   135.0     3.0    137.0   1.6  99.0
sdb               9.0   570.0   1180   67000     228.0     44.0   132.0     3.0    134.0   1.6  98.5
nvme0n1           0.0   120.0      0   14000     233.0      0.3     2.0     0.0      2.0   0.2   2.4

What it means: HDDs are pegged at ~99% util with ~130 ms await on writes. NVMe (SLOG) is fine. This matches ZFS’s story: pool write latency is the problem.

Decision: you need more IOPS (more vdevs, different topology), less random write pressure (tuning, batching), or more headroom (free space).

ZFS knobs that matter for MySQL (and which ones are traps)

Recordsize: align to workload, not ideology

For InnoDB, common practice is recordsize=16K on the dataset storing tablespaces. The reasoning: InnoDB modifies 16K pages, and smaller records can reduce write amplification and improve latency under random updates.

But don’t cargo-cult. If your workload is mostly large sequential scans, backups, or analytics reading big ranges, a larger recordsize can help. If you store binary logs or backups on the same dataset (don’t), you’ll have conflicting needs.

Practical stance:

  • Put MySQL data on its own dataset.
  • For OLTP-heavy InnoDB, start at recordsize=16K.
  • For a mixed workload, test 16K vs 32K.

Compression: lz4 is usually free money

Compression reduces bytes written and read. For write bursts, that matters because you’re trying to push less physical IO through the same bottleneck. On modern CPUs, lz4 is typically a net win.

Edge cases exist: if CPU is already saturated during bursts, compression can move you from IO-bound to CPU-bound. That’s still often better (CPU saturation is easier to scale than storage latency), but measure.

sync and logbias: stay honest, but choose your poison

sync=standard is the normal, honest setting: respect application sync requests. sync=disabled is lying to applications: it acknowledges sync writes before they’re durable. It’s sometimes used for ephemeral data or caches, but for real databases it is a data-loss footgun.

logbias=latency vs logbias=throughput is frequently misunderstood. For MySQL, you typically care about latency. If you have a SLOG, logbias=latency is reasonable. If you don’t have a SLOG and your pool is slow, changing logbias won’t create hardware. It can, however, change how much goes to ZIL vs main pool in certain cases.

Joke #2: Setting sync=disabled to “fix latency” is like removing your smoke alarm because it keeps beeping during fires.

SLOG: what it actually does and when it helps

A SLOG helps when:

  • Your workload does many synchronous writes.
  • Your main pool has worse latency than a good SSD/NVMe.
  • The SLOG device has power-loss protection and low, consistent write latency.

A SLOG does not help when:

  • You are bottlenecked by TXG commits to the main pool (pool can’t flush fast enough).
  • Your workload is mostly asynchronous writes (no fsync pressure).
  • Your “SLOG” is a consumer SSD that collapses under sustained writes or lies about flushes.

Operational advice: mirror the SLOG if your risk profile demands it. A failed SLOG device can force the pool into an unsafe state depending on platform and configuration; even when it doesn’t, you’ve just created an incident trigger. For business-critical MySQL, mirrored SLOG is boring and correct.

Special vdev: metadata and small blocks can be your hidden win

If your pool is HDD-based, consider a special vdev (on SSD) for metadata and small blocks. Databases create metadata churn: indirect blocks, directory structures, space maps. Putting those on fast media can reduce latency and improve consistency.

But: special vdev is not optional redundancy. If it dies and it’s not redundant, you can lose the pool. Treat it like a top-tier vdev with proper mirroring.

Primarycache/secondarycache: stop double-caching yourself into swaps

MySQL already has a buffer pool. ZFS has ARC and potentially L2ARC. If you cache everything twice, you waste RAM and increase eviction churn under bursts.

Many shops set primarycache=metadata on the MySQL dataset to avoid ARC caching data pages that InnoDB already caches, while keeping metadata hot. This can help stabilize memory usage. It’s not universal; measure your read patterns.

Ashift and vdev layout: the “you can’t tune your way out of geometry” rule

If your pool is built with the wrong sector alignment (ashift too small), you can suffer write amplification forever. If your vdev layout is built for capacity instead of IOPS (for example, wide RAIDZ for a latency-sensitive OLTP database), you can get predictable pain.

Opinionated rule: for latency-sensitive MySQL, mirrored vdevs are the default choice unless you have a very clear reason not to. RAIDZ can work, but it makes small random writes harder and rebuild behavior more complex.

MySQL/InnoDB knobs that interact with ZFS

Durability settings: make the risk explicit

The biggest write-burst latency lever in MySQL is how often you force durable flushes:

  • innodb_flush_log_at_trx_commit=1: durable redo every commit. Best safety, highest sync pressure.
  • innodb_flush_log_at_trx_commit=2: flush to OS every commit, fsync once per second. Less sync pressure; risk up to 1s of transactions on crash.
  • sync_binlog=1: fsync binlog every commit. Strong replication correctness; sync pressure.

If you choose to relax these, do it as a business decision, not a midnight performance hack. For many companies, losing up to one second of data is acceptable for some systems (analytics staging) and unacceptable for others (money, inventory, auth).

Flush method: avoid double buffering

innodb_flush_method=O_DIRECT is commonly used to avoid the OS page cache for InnoDB data files, reducing double buffering. On ZFS, the interaction is nuanced because ZFS has ARC, not the same page cache model, but O_DIRECT is still often used successfully.

What you’re trying to avoid: MySQL writes data, OS caches it, ZFS caches it again, memory evaporates, and then the kernel starts swapping. That’s not a performance strategy; it’s a cry for help.

Dirty page tuning: bursts are amplified by backlog

When the buffer pool accumulates too many dirty pages, MySQL has to flush aggressively. That can turn a moderate write burst into a storage stampede. Tune:

  • innodb_max_dirty_pages_pct and innodb_max_dirty_pages_pct_lwm
  • innodb_io_capacity and innodb_io_capacity_max (set based on real storage capability)
  • innodb_flush_neighbors (often 0 on SSD pools; more nuanced on HDD)

Don’t set innodb_io_capacity to “a big number” because you bought fast disks. Set it to what the pool delivers during mixed read/write under load, not the spec sheet.

Binary logs and temp files: don’t colocate your pain

Put binlogs on a dataset tuned for sequential writes, potentially with larger recordsize, and keep them from competing with tablespaces if you can. Same for tmpdir if you do heavy sorts. Co-locating everything on one dataset is how you create “mystery latency” during bursts.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “We added a SLOG, so sync latency is solved”

The company had a MySQL primary on a ZFS pool backed by HDDs. They experienced occasional p99 spikes during traffic bursts—mostly around marketing sends and end-of-month jobs. Someone did the right thing halfway: they added a fast NVMe as SLOG. Latency improved in steady state, so the change got labeled “fixed.”

Months later, a bigger burst hit. p99 went through the roof again, but this time the on-call was confident it couldn’t be storage because “we have SLOG.” They chased ghosts in query plans and connection pools while the pool sat at high utilization. Users kept retrying, which created more writes, which created more latency—classic self-feeding failure.

When they finally looked at zpool iostat -l, the story was blunt: SLOG writes were low-latency, but the main pool write wait time was enormous. TXG commits were the bottleneck. The ZIL could acknowledge writes quickly, but it couldn’t make the pool flush dirty data faster.

The fix wasn’t another log device. It was boring: add more vdevs to increase IOPS, reduce pool fullness, and separate workloads so binlogs and temp files didn’t fight with tablespaces. The wrong assumption was thinking sync latency equals SLOG latency; in reality, sync durability still depends on the pool staying ahead of accumulated dirty work.

An optimization that backfired: “Let’s crank recordsize up for throughput”

A different team had a mostly SSD-based pool and wanted better bulk load performance. They changed the MySQL dataset recordsize to 1M because they saw that recommended for large sequential workloads. Bulk loads got faster. They declared victory and moved on.

Then OLTP latency started oscillating. Not always, but during bursts. Small updates to hot rows began causing disproportionate IO. ZFS was rewriting large records more often, and metadata churn increased. The pool could handle the bandwidth, but latency distribution got ugly. The business didn’t care that nightly loads finished sooner; they cared that checkout sometimes took 900 ms.

The worst part was operational: snapshots got larger and replication took longer because changes touched bigger blocks. Recovery time objectives quietly got worse. Nobody meant to change RTO, but they did.

The eventual fix was segmentation: separate dataset(s) for bulk-loaded tables with a larger recordsize, keep OLTP tablespaces at 16K or 32K, and stop treating “MySQL” like one monolithic IO pattern. The backfire wasn’t that large recordsize is always wrong; it was applying a single tuning knob to multiple conflicting workloads.

A boring but correct practice that saved the day: “Headroom and staged throttling”

One organization ran MySQL on ZFS for a core internal system. Nothing glamorous: business workflows, some bursts during the workday, occasional backfills. They had a strict rule: the pool never goes above a conservative fullness threshold, and every write-heavy job has a throttle knob wired into the scheduler.

When a bug caused an upstream service to retry aggressively, write traffic tripled in minutes. It was the kind of incident that usually becomes a multi-hour performance meltdown. But this time the database got slow, not dead. Latency rose, but it didn’t spiral into timeouts.

Why? Two reasons. First, there was free space and low fragmentation, so allocation didn’t degrade under pressure. Second, the on-call could quickly dial down the backfill and the batch queue without changing durability settings. The system had escape hatches that didn’t involve lying to the filesystem.

Afterward, the postmortem was almost dull. That was the point. The “boring practice” was capacity headroom plus operational controls for write bursts. Not a magic sysctl. Not a heroic rebuild. Just disciplined planning and the ability to reduce write load intentionally.

Common mistakes: symptom → root cause → fix

1) p99 spikes during bursts, SLOG looks fine

Symptom: zpool iostat -l shows low log latency but high pool write wait; MySQL shows log waits and timeouts.

Root cause: main pool cannot flush TXGs fast enough; pool is saturated, fragmented, or too few vdevs for IOPS.

Fix: add vdevs or move to mirror topology, reduce pool fullness, reduce competing workloads, consider special vdev for metadata, tune MySQL dirty page behavior.

2) Latency suddenly worsens after pool crosses ~80% used

Symptom: no config changes, but writes get slower over weeks; fragmentation climbs.

Root cause: allocation and fragmentation penalties on a near-full copy-on-write filesystem; more scattered writes and metadata updates.

Fix: add capacity (preferably by adding vdevs, not replacing disks one-by-one), rebalance by migrating datasets, enforce free-space SLOs.

3) “Fixed” by setting sync=disabled, then a crash causes data loss

Symptom: latency looks great until an unexpected reboot; after restart, MySQL tables are corrupted or missing recent commits.

Root cause: synchronous semantics were disabled; applications received acknowledgments before data was durable.

Fix: set sync=standard (or always if you must), use a proper SLOG, and address the real performance bottleneck.

4) Swapping begins during bursts and never fully recovers

Symptom: vmstat shows swap activity; MySQL stalls even after burst ends; ARC remains large.

Root cause: ARC and MySQL buffer pool compete; memory pressure triggers reclaim and swap, increasing IO and latency.

Fix: cap ARC, right-size MySQL buffer pool, avoid L2ARC unless you can afford the memory overhead, add RAM if needed.

5) Writes are fast until scrub/resilver runs, then p99 explodes

Symptom: correlated with zpool status showing scan activity.

Root cause: maintenance IO competes with production writes; pool has no headroom.

Fix: schedule scrubs off-peak, ensure pool has IOPS margin, use IO scheduling/priority controls where available, pause scrubs during incidents per policy.

6) After “adding SSD cache,” performance got worse under writes

Symptom: L2ARC enabled; under bursts, write latency increases; memory usage rises.

Root cause: L2ARC maintenance costs (metadata, writes to L2ARC) increase pressure; memory overhead reduces headroom.

Fix: disable L2ARC for write-heavy OLTP systems unless you have a measured read-miss problem and sufficient RAM.

7) Random stalls every few seconds like clockwork

Symptom: periodic latency spikes aligned with TXG commit intervals.

Root cause: TXG flush bursts causing queueing; pool can’t sustain the flush workload smoothly.

Fix: increase pool write capability, reduce dirty data production (MySQL tuning), check for background jobs causing bursts, ensure SLOG is not the only “performance plan.”

Checklists / step-by-step plan

Step-by-step: build (or rebuild) ZFS specifically for MySQL burst resilience

  1. Pick a vdev layout for IOPS first: mirrored vdevs as the baseline for OLTP. Capacity comes from more mirrors, not wider RAIDZ.
  2. Keep the pool under a capacity SLO: set an internal rule (for example, alert at 70%, act by 80%). The exact number depends on workload, but “run it to 95%” is not serious engineering.
  3. Use a real SLOG if you need sync performance: low-latency, power-loss protected, and preferably mirrored.
  4. Consider a special vdev on HDD pools: mirrored, sized for metadata and small blocks.
  5. Create separate datasets:
    • tank/mysql for InnoDB tablespaces
    • tank/mysql-binlog for binlogs
    • tank/mysql-tmp for tmpdir if needed
  6. Set dataset properties intentionally:
    • atime=off
    • compression=lz4
    • recordsize=16K (start point for OLTP tablespaces)
    • logbias=latency (typical for DB datasets)
  7. Decide on caching policy: often primarycache=metadata on tablespaces, keep default on binlogs if you read them frequently for replication/backup workflows.
  8. Cap ARC based on total RAM and MySQL buffer pool: leave headroom for the OS, page tables, connections, and burst absorption.
  9. Load test write bursts: not just average load. Simulate a burst pattern and watch p99, not just throughput.

Operational checklist: when you deploy a write-heavy change

  • Verify pool capacity and fragmentation. If you’re already near your threshold, delay the write-heavy job or add capacity first.
  • Confirm scrub/resilver isn’t scheduled to overlap.
  • Ensure batch jobs have throttles and can be paused without code changes.
  • Baseline zpool iostat -l and MySQL log fsync metrics before the change.
  • Alert on p99 latency, Innodb_log_waits, and ZFS write wait time together. Single-metric alerts lie.

Incident checklist: the “don’t make it worse” rules

  • Do not set sync=disabled on the data dataset as an incident mitigation unless you are explicitly accepting data loss.
  • Do not change recordsize mid-incident expecting instant relief; it affects new writes, and the root cause is usually elsewhere.
  • Do pause scrubs/resilvers if your policy allows and the pool is melting down.
  • Do reduce write load: throttle jobs, reduce retries, shed non-critical writes, and stop backfills.
  • Do capture evidence: zpool iostat -l, iostat -x, MySQL status counters. Your future self will want receipts.

FAQ

1) Should I run MySQL on ZFS at all?

Yes, if you want strong integrity guarantees, snapshots, and sane administration. But you must design for latency: proper vdev layout, headroom, and a plan for synchronous writes.

2) Is SLOG mandatory for MySQL?

Not always. If your workload is mostly async writes or you accept relaxed durability, you may not need it. If you run with innodb_flush_log_at_trx_commit=1 and sync_binlog=1 under bursty traffic, a good SLOG is often the difference between “fine” and “incident.”

3) Can I fix write burst latency by setting sync=disabled?

You can reduce latency and also reduce truth. It acknowledges sync writes before they are durable, which can lose committed transactions on crash. Use it only for non-critical, reconstructable data.

4) What recordsize should I use for InnoDB?

Common starting point is recordsize=16K for the tablespace dataset. Test 16K vs 32K if you have mixed patterns. Keep separate datasets for different IO patterns (binlogs, backups).

5) Does ZFS compression help databases?

Often yes. lz4 can reduce physical writes and reads, which helps under bursts and reduces wear. Validate CPU headroom and measure p99 latency, not just throughput.

6) Should I enable L2ARC for MySQL?

Usually not for write-heavy OLTP. L2ARC has memory overhead and write maintenance costs. If you have a proven read-miss problem and plenty of RAM, it can help, but it’s not a default move.

7) Why does performance degrade as the pool fills up?

Copy-on-write allocation becomes harder with less free space; fragmentation and metadata overhead increase. Your “simple write” becomes more random IO and more bookkeeping. Keep headroom.

8) RAIDZ or mirrors for MySQL?

For latency-sensitive OLTP, mirrors are the safe default because they provide more IOPS and more predictable latency. RAIDZ can work, but it’s easier to hit latency cliffs under small random writes and high utilization.

9) How do I know if the bottleneck is sync writes or background flushing?

Correlate MySQL Innodb_log_waits and fsync counters with ZFS zpool iostat -l (log and pool wait times). If log latency is high, look at SLOG and sync path. If pool wait is high, look at TXG flush pressure, pool fullness, and vdev layout.

10) Can special vdevs help MySQL latency?

Yes, especially on HDD pools, by accelerating metadata and small-block IO. Mirror them and treat them as critical; losing a special vdev can mean losing the pool.

Conclusion: next steps that actually reduce risk

If you run MySQL on ZFS and you fear write bursts, you don’t need mysticism. You need three things: a correct mental model, honest durability choices, and enough IOPS/headroom that ZFS can do its job without throttling the database into the ground.

Practical next steps:

  1. Run the fast diagnosis playbook on a quiet day. Capture baseline zpool iostat -l, iostat -x, and key MySQL counters.
  2. Audit pool fullness and fragmentation. If you’re above your safe threshold, treat capacity as an urgent performance fix.
  3. Confirm your sync write path: dataset sync is honest, SLOG (if present) is enterprise-grade and healthy, and you understand whether the pool flush is the real choke point.
  4. Split datasets by IO pattern and set properties intentionally (atime off, lz4 on, recordsize appropriate).
  5. Cap ARC so memory doesn’t become the hidden trigger for storage incidents.
  6. Build operational throttles for write-heavy jobs. Your best latency fix during a burst is often “stop the extra writes,” not “change the filesystem under fire.”

Do this, and the next write burst becomes a controlled slowdown, not a midnight meltdown. The goal isn’t heroic recovery. It’s boring graphs.

← Previous
Email migration: Move mail to a new server with minimal downtime (real steps)
Next →
Ubuntu 24.04: Fix “Too many open files” on Nginx by raising limits the right way (systemd)

Leave a comment