ZFS IO Scheduler Choices: mq-deadline vs none for HDD, SSD, and NVMe

Was this helpful?

You have a ZFS pool that benchmarks fine on a sunny afternoon and then turns into a pumpkin at 2 a.m.
Scrubs crawl. Latency spikes. The database starts timing out, and everyone stares at “%util” like it’s going to confess.
Somewhere in the stack, the Linux IO scheduler is either helping you… or politely making things worse.

The punchline: with ZFS, you don’t “tune the scheduler” so much as you stop it from fighting the rest of your system.
This is a decision article: what to set on HDD, SATA/SAS SSD, and NVMe; how to prove it with commands; and how to diagnose
the ugly failure modes when reality disagrees with your assumptions.

The one-sentence rule (what to set)

If you want a default that survives contact with production: use mq-deadline for rotational disks and most SATA/SAS SSDs,
and use none for NVMe (and other high-end devices with their own deep hardware queues).

That’s the baseline. You deviate only when you can explain, with measurements, what you’re optimizing: tail latency, fairness between jobs,
or a specific device/driver quirk.

What Linux IO schedulers actually do under blk-mq

The IO scheduler is not “a performance mode switch.” It is a policy engine sitting between the block layer and the device driver,
deciding how requests are ordered, merged, and dispatched. On modern Linux, most block devices use the multi-queue block layer (blk-mq),
which changes the scheduler story compared to the old single-queue era.

blk-mq changed the problem: from one elevator to many lanes

The classic “elevator” schedulers (deadline, CFQ, noop) were built for a single request queue feeding a single device queue.
blk-mq adds multiple software submission queues (often per-CPU) mapped to one or more hardware dispatch queues. The scheduler may now run
per hardware context, and the device itself might do aggressive reordering (NVMe is particularly good at this).

In practical terms: if the hardware already does deep queueing and smart dispatch, adding another layer of cleverness in software can
increase latency variance and CPU overhead without improving throughput.

mq-deadline: bounded waiting and “don’t starve reads”

mq-deadline is the blk-mq version of deadline. It keeps separate read and write queues, tracks expiration times,
and dispatches in a way that tries to prevent starvation. It also performs request merging when possible.

On rotational media, this matters. HDDs have terrible random latency; if you let writes queue forever, reads can starve and your application
will interpret it as “the storage is dead.” Deadline tries to cap that damage. On SATA SSDs, it can still help by smoothing bursts and keeping
latency from going feral under mixed workloads.

none: “hands off” (but not “no queueing”)

none means the block layer does minimal scheduling beyond what blk-mq inherently does. Requests still queue—just not with an
additional elevator policy. This is typically best for NVMe because:

  • NVMe devices have multiple hardware queues and sophisticated internal schedulers.
  • NVMe thrives on parallelism; extra software ordering can reduce concurrency.
  • CPU overhead matters at high IOPS; “none” reduces scheduler work.

Why not BFQ, kyber, or “whatever the default is”?

BFQ and kyber exist for good reasons. BFQ can be excellent for desktop interactivity and fairness; kyber targets latency control under load.
But ZFS already has its own IO patterns and buffering behaviors, and many production ZFS workloads care more about predictable tail latency
and avoiding pathological interactions than about per-cgroup fairness at the block layer.

If you’re running multi-tenant hosts with strict fairness requirements, you might explore them. But for the common “ZFS pool for databases,
VMs, NFS, object storage, or backups” scenario, mq-deadline/none are the sane starting points. Most of the time, the right move is not “a better
scheduler,” it’s “stop double-scheduling and measure latency correctly.”

How ZFS changes the game (ARC, TXG, sync, and why “more scheduling” is not better)

ZFS is not a dumb block consumer. It’s a filesystem and volume manager that already does aggregation, ordering, and write behavior shaping.
When people say “ZFS likes sequential writes,” what they really mean is: ZFS tries hard to turn scattered application writes into larger,
more contiguous IO at commit time.

ZFS writes are staged: the TXG heartbeat

ZFS groups modifications into transaction groups (TXGs). Dirty data accumulates in memory, then ZFS commits it to disk. This batching is
wonderful for throughput and compression, and it’s also why “my app wrote 4 KB but disk did 1 MB” is not a mystery—it’s ZFS being efficient.

The IO scheduler sees the final block IO pattern, not your application’s intent. If you add aggressive reordering at the scheduler layer,
you can interfere with ZFS’s attempt to manage latency for reads while flushing writes.

Sync writes: where assumptions go to die

For sync writes, ZFS must ensure durability before acknowledging completion (depending on settings and workload). If you don’t have a proper
SLOG (separate intent log) device, your sync workload can turn into “random write latency is your new personality.”

No IO scheduler will rescue you from a pool of HDDs doing small sync writes without a log device. It can only shape how bad the wait feels.

ZFS read behavior: prefetch and metadata

ZFS does prefetch and adaptive caching (ARC). Reads are often served from RAM; the reads that hit disk may be metadata-heavy, random, or driven
by scrub/resilver. That means your “scheduler choice” should prioritize preventing tail latency explosions when the disk gets busy.

Double queueing and the “latency tax”

ZFS queues IO internally. The block layer queues IO. The device firmware queues IO. That’s three places latency can hide, and it loves to
multiply under pressure.

If your device is an NVMe with deep queues, adding mq-deadline might increase latency variance by reshuffling requests that the controller could
have handled better in parallel. If your device is an HDD, leaving it on “none” can allow pathological starvation patterns that ZFS alone
can’t always smooth out.

Paraphrased idea (Werner Vogels, reliability/operations): “Everything fails eventually; you design systems assuming it will.”
Scheduler selection is exactly that mindset: choose the policy that fails least badly under ugly load.

Recommendations by media: HDD vs SATA/SAS SSD vs NVMe

Rotational HDD (single disks, mirrors, RAIDZ)

Use mq-deadline.

HDDs are latency machines with a side hobby of doing IO. On mixed workloads (scrub + reads + writes), an HDD can starve reads while it’s
chewing through writes. mq-deadline gives you a practical guarantee: reads won’t sit behind writes forever.

When might you consider something else? Mostly when you have a specialized appliance or you’ve validated that another scheduler (like kyber)
gives better tail latency under your exact load. But for general ZFS on Linux with HDD vdevs, mq-deadline is the boring correct answer.

SATA/SAS SSD (consumer or enterprise)

Default to mq-deadline, then consider none only if you have evidence it helps and the SSD is not misbehaving.

SATA/SAS SSDs vary wildly. Some have decent internal scheduling; others fall apart in weird ways during garbage collection or when their write
cache policies interact with power-loss protection (or the absence of it).

mq-deadline often keeps latency more predictable on SATA SSDs under mixed read/write or bursty flush behavior, which maps nicely to ZFS TXG flushes.
“none” can be fine too—especially on enterprise SSDs with strong firmware—but don’t assume NVMe rules apply.

NVMe (PCIe)

Use none.

NVMe is built for parallel queues, not for being treated like a fancy SATA drive. The controller is already doing dispatch decisions in hardware.
Your job is to keep the software path lean and avoid serializing what should be concurrent.

If you run mq-deadline on NVMe and you see worse tail latency or reduced throughput, that’s not surprising; it’s software trying to outsmart a
device designed specifically to avoid that software.

Virtualized or SAN-backed block devices

Here’s where you stop trusting labels. A “disk” might be a virtual block device backed by a network, a RAID controller, or a storage array with
its own caching and queueing. In those cases:

  • If the device presents as rotational but is actually backed by flash, your choice might differ.
  • If the hypervisor or array already does strong scheduling, “none” can be better.
  • If you need fairness between multiple guests, a scheduler that shapes latency might help—but validate.

For many virtual disks, you’ll still end up with mq-deadline as a safe default, unless the vendor explicitly recommends “none” and you’ve proven it.

Interesting facts and historical context (so the defaults make sense)

  1. The original “deadline” scheduler was built to prevent starvation—a real problem when writeback could bury reads on slow disks.
  2. CFQ used to be the default on many distros because it improved desktop responsiveness, not because it maximized server throughput.
  3. blk-mq arrived to scale IO on multicore systems; the old single-queue path became a bottleneck at high IOPS.
  4. “noop” was historically recommended for RAID controllers because the controller did its own scheduling; “none” is the blk-mq-era equivalent.
  5. NVMe was designed around multiple submission and completion queues, explicitly to reduce lock contention and improve parallelism.
  6. ZFS’s intent log (ZIL) exists because POSIX sync semantics exist; it’s not a performance feature, it’s a correctness feature with performance consequences.
  7. ZFS on Linux (OpenZFS) matured later than Solaris ZFS; Linux-specific interactions (like blk-mq schedulers) became a tuning topic only after that port matured.
  8. “IOPS” became a mainstream metric with flash; in the HDD era, we mostly talked about throughput because latency was uniformly awful.
  9. Modern kernels changed defaults multiple times; if you cargo-cult advice from 2016, you’re probably selecting for a kernel that no longer exists.

Fast diagnosis playbook: find the bottleneck without guessing

You’re paged. Latency is up. ZFS pool “looks fine” because it’s not screaming; it’s just quietly ruining your day. This is the shortest path
to the truth.

First: identify the device class and current scheduler

  • Is it HDD, SATA SSD, or NVMe?
  • Is the scheduler actually what you think it is?
  • Are you benchmarking the pool or a single vdev member by accident?

Second: determine whether you’re latency-bound or throughput-bound

  • Check per-device await, svctm equivalents (careful), and queue depth.
  • Look at tail latency symptoms: application timeouts, sync write stalls, NFS hiccups.

Third: separate “ZFS is flushing” from “disk is slow”

  • Check TXG and dirty data behavior (indirectly via ZFS stats and workload patterns).
  • Check whether sync writes are the culprit (and whether you have an effective SLOG).

Fourth: check for the usual saboteurs

  • A single dying disk in a vdev causing retries and timeouts.
  • SMR HDDs pretending to be normal disks.
  • Consumer SSD with no power-loss protection getting punished by sync semantics.
  • PCIe/NVMe power management or firmware oddities.

Fifth: only then change the scheduler

Scheduler changes are low-risk compared to, say, changing recordsize on a busy dataset—but they’re also low-impact compared to
fixing sync write design or replacing a bad disk. Change the scheduler when you can articulate the failure mode it addresses:
starvation and fairness (mq-deadline) vs reduced overhead and better parallelism (none).

Practical tasks (commands, output, and decisions)

These are real tasks you can run on a Linux host with ZFS. Each includes: the command, an example of what you might see, what it means, and
the decision you make. Do them in this order when you’re diagnosing a system you didn’t build yourself.

Task 1: list block devices and spot NVMe vs SATA

cr0x@server:~$ lsblk -d -o NAME,ROTA,TRAN,MODEL,SIZE
NAME  ROTA TRAN MODEL                 SIZE
sda      1 sata ST12000NM0008-2JH101 10.9T
sdb      1 sata ST12000NM0008-2JH101 10.9T
sdc      0 sata Samsung SSD 870 EVO   931.5G
nvme0n1  0 nvme Samsung SSD 980 PRO  953.9G

Meaning: ROTA=1 is rotational (HDD). TRAN=nvme is NVMe. SATA SSDs show ROTA=0 and TRAN=sata.

Decision: Put HDDs on mq-deadline. Put NVMe on none. For SATA SSD, start with mq-deadline unless you have a reason to prefer none.

Task 2: check the current scheduler per device

cr0x@server:~$ for d in sda sdb sdc nvme0n1; do echo -n "$d: "; cat /sys/block/$d/queue/scheduler; done
sda: [mq-deadline] none
sdb: [mq-deadline] none
sdc: mq-deadline [none]
nvme0n1: [none]

Meaning: The scheduler in brackets is active. Here, the SATA SSD (sdc) is on none; HDDs are on mq-deadline; NVMe is on none.

Decision: If that SATA SSD is a SLOG or special vdev and you’re seeing latency spikes, consider moving it to mq-deadline and measure.

Task 3: check whether the kernel even offers schedulers you expect

cr0x@server:~$ cat /sys/block/sda/queue/scheduler
[mq-deadline] none

Meaning: On some systems, you may see additional options like kyber or bfq. If they’re not present, you can’t select them.

Decision: Don’t waste time chasing a blog post scheduler if your kernel doesn’t support it. Stick to mq-deadline/none and fix the real bottleneck.

Task 4: verify which devices belong to which ZFS vdevs

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            ata-ST12000NM0008_1     ONLINE       0     0     0
            ata-ST12000NM0008_2     ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            ata-ST12000NM0008_3     ONLINE       0     0     0
            ata-ST12000NM0008_4     ONLINE       0     0     0
        logs
          nvme-Samsung_SSD_980PRO   ONLINE       0     0     0

Meaning: ZFS is built from vdevs. Performance and failure modes often hinge on one slow or unhealthy member.

Decision: Apply scheduler changes to the actual underlying block devices that correspond to vdev members—especially logs and special vdevs.

Task 5: watch ZFS-level latency and throughput during load

cr0x@server:~$ zpool iostat -v tank 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        5.12T  16.7T    180    620  21.3M  78.4M
  mirror-0  2.55T  8.35T     90    310  10.6M  39.2M
  mirror-1  2.57T  8.33T     90    310  10.7M  39.2M
logs            -      -      0    250     0  12.1M

Meaning: You can see if the log device is being used heavily (sync workload), and whether read/write operations are balanced across mirrors.

Decision: If the log is busy and latency is bad, scheduler changes on the log device may affect tail latency—especially if it’s SATA SSD.

Task 6: identify per-device IO latency and saturation (iostat)

cr0x@server:~$ iostat -x -d 1
Device            r/s     w/s   r_await   w_await   aqu-sz  %util
sda              2.1    78.3     18.2    145.7     9.84   99.0
sdb              1.9    77.5     17.9    141.2     9.53   98.7
nvme0n1          0.0   250.2      0.2      1.8     0.45   12.3

Meaning: HDDs are saturated (%util ~99) with huge write await. NVMe log looks fine. Your bottleneck is the HDD vdevs, not the scheduler on NVMe.

Decision: If reads are timing out, mq-deadline on HDD is appropriate; if it’s already set, your next move is workload shaping (sync, recordsize, vdev layout) or more spindles.

Task 7: confirm the rotational flag is sane (it lies sometimes)

cr0x@server:~$ cat /sys/block/sdc/queue/rotational
0

Meaning: 0 means non-rotational. For some RAID controllers or virtual devices, this can be misleading.

Decision: If a “virtual disk” is actually backed by HDD and claims rotational=0, don’t blindly set none. Validate with latency behavior and vendor constraints.

Task 8: check the device queue depth and see if you’re accidentally constraining NVMe

cr0x@server:~$ cat /sys/block/nvme0n1/queue/nr_requests
1023

Meaning: This is the request queue size. A too-low value can limit throughput; too high can increase latency under certain loads.

Decision: Don’t touch it unless you have a measured queueing problem. Scheduler choice is usually enough at this layer; bigger knobs have bigger side effects.

Task 9: change scheduler temporarily (runtime) for a device

cr0x@server:~$ echo mq-deadline | sudo tee /sys/block/sdc/queue/scheduler
mq-deadline

Meaning: This changes the scheduler until reboot (or until udev rules override it).

Decision: Use this for controlled tests during a maintenance window. Measure before/after with iostat and application latency, not vibes.

Task 10: confirm the change “took”

cr0x@server:~$ cat /sys/block/sdc/queue/scheduler
[mq-deadline] none

Meaning: It’s active now.

Decision: If performance improves but reverts after reboot, you need persistence (next task).

Task 11: make the scheduler persistent with a udev rule

cr0x@server:~$ sudo bash -c 'cat > /etc/udev/rules.d/60-io-scheduler.rules <
cr0x@server:~$ sudo udevadm control --reload-rules
cr0x@server:~$ sudo udevadm trigger --type=devices --action=change

Meaning: This applies scheduler policy based on device type. It’s not elegant, but it works.

Decision: If you manage fleets, encode this in configuration management and treat it like any other baseline: reviewed, tested, and rolled out gradually.

Task 12: verify udev applied the rule after a trigger

cr0x@server:~$ cat /sys/block/sda/queue/scheduler
[mq-deadline] none
cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[none]

Meaning: Good: HDD is mq-deadline, NVMe is none.

Decision: Move on to workload validation. Scheduler policy is a means, not the end.

Task 13: detect sync-heavy workload (quick indicator)

cr0x@server:~$ zfs get -o name,property,value -H sync tank
tank	sync	standard

Meaning: sync=standard respects application sync behavior. If your application uses fsync a lot, the log device matters.

Decision: Don’t set sync=disabled as “a performance fix” unless you’re willing to lose data. There are careers that end this way.

Task 14: measure real latency with fio against a ZVOL or dataset (carefully)

cr0x@server:~$ sudo fio --name=randread --filename=/tank/testfile --size=4G --rw=randread --bs=4k --iodepth=32 --numjobs=4 --direct=1 --time_based --runtime=30
randread: (groupid=0, jobs=4): err= 0: pid=22110: Fri Dec 25 02:10:11 2025
  read: IOPS=58.2k, BW=227MiB/s (238MB/s)(6815MiB/30002msec)
    clat (usec): min=90, max=9210, avg=265.4, stdev=112.7
     lat (usec): min=92, max=9220, avg=267.2, stdev=112.9

Meaning: You get latency distribution, not just throughput. The max latency tells you about tail behavior.

Decision: If switching from mq-deadline to none changes average but worsens max latency (or vice versa), decide based on application sensitivity. Databases hate tail latency more than they hate losing 5% throughput.

Joke #1: If you’re changing IO schedulers without measuring latency, you’re not tuning—you’re doing storage astrology.

Three corporate-world mini-stories from real life patterns

Mini-story 1: The incident caused by a wrong assumption (NVMe “needs” mq-deadline)

A mid-sized company ran a multi-tenant virtualization cluster on OpenZFS. They refreshed hardware, swapped in newer NVMe drives,
and kept their old tuning playbook. That playbook said: “deadline reduces latency,” so they set mq-deadline everywhere.

The change looked harmless. The first week was quiet. Then the Monday morning storm hit: VM boot bursts, backup ingest, and a database failover.
Suddenly they saw weird behavior: throughput looked fine, but the 99th percentile latency spiked hard enough to trip application timeouts.
Guests didn’t “slow down.” They stalled. The difference matters.

The team chased ZFS knobs first. They tried adjusting recordsize on a few datasets, then debated SLOG wear, then blamed the hypervisor.
Meanwhile, the real clue was hiding in plain sight: CPU usage in softirq and block-layer paths climbed, and IO completion latency got more variable
under concurrency.

They flipped NVMe to none on two canary hosts and reran the same mixed load. Tail latency improved; CPU overhead dropped; the stalls stopped.
The old assumption—“deadline always reduces latency”—was true in the HDD era and sometimes for SATA SSD. For NVMe, it was a tax with no benefit.

The lesson wasn’t “mq-deadline is bad.” The lesson was “match the policy to the device.” NVMe wants parallelism, not babysitting.

Mini-story 2: The optimization that backfired (setting none on HDD to “let ZFS handle it”)

Another org ran a big backup repository on ZFS with large RAIDZ vdevs built from HDDs. They’d read that ZFS already aggregates writes,
and concluded the scheduler was redundant. Someone set all disks to none, checked in the udev rule, and moved on.

The steady-state throughput during nightly backup ingest actually improved a bit. People congratulated themselves, which is usually when the system
starts planning revenge. The next scrub window was the first real test: scrub reads plus ongoing writes plus a few restores.

Restore jobs (reads) became painfully slow. Not “a bit slower,” but “users think the restore is hung.” Latency spiked during write-heavy periods.
The pool wasn’t failing; it was just spending an impressively long time deciding to do reads.

The root cause was classic: HDDs under a mixed workload can starve reads behind writes if you don’t enforce some fairness.
ZFS’s internal IO scheduling can’t always compensate when the block layer dispatch order is effectively “whatever arrives, whenever.”

Switching HDDs back to mq-deadline restored predictable read behavior during scrubs and mixed IO. Throughput during ingest dropped slightly,
but restores became reliable again. The business didn’t pay for “best-case throughput at 2 a.m.” They paid for restores that finish before the meeting.

Mini-story 3: The boring but correct practice that saved the day (per-device baselines + canaries)

A financial services shop had a habit: every new kernel rollout went through storage canaries. Not fancy. Two hosts per hardware generation,
same workload replay, same dashboards. They kept a small baseline doc: device types, firmware versions, scheduler settings, and expected latency bands.

One quarter, a routine kernel update changed behavior on a subset of SATA SSDs used as special vdevs. Nothing catastrophic—just a slow drift in tail latency.
The canaries caught it because they were looking at the 99th percentile, not just average throughput. They also noticed that the devices started spending
more time in internal housekeeping under mixed writes.

The fix was not a heroic rewrite. They standardized those SATA SSDs on mq-deadline, ensured the udev rules were applied consistently,
and rolled it out gradually while watching latency. It wasn’t exciting. It was correct.

Later, when a real incident happened—one SSD firmware update causing brief stalls—the team had enough discipline to test changes on canaries first.
The scheduler baseline meant they weren’t debugging ten variables at once.

Boring practices don’t get credit until they prevent a very expensive kind of excitement.

Joke #2: The IO scheduler is like office politics—ignore it and you’ll still suffer, but getting “too involved” can also ruin your week.

Common mistakes: symptom → root cause → fix

1) “NVMe is fast but my ZFS pool stalls under load”

Symptom: Good average bandwidth, but periodic latency cliffs; CPU in kernel rises; application timeouts.

Root cause: NVMe running a scheduler that adds overhead or reduces concurrency (often mq-deadline), combined with deep queueing under mixed IO.

Fix: Set NVMe to none, validate with tail latency measurements (fio + app metrics), and ensure you didn’t clamp queue depth elsewhere.

2) “HDD pool throughput is okay but reads become unusable during backups or scrubs”

Symptom: Restore/read jobs crawl during write-heavy periods; interactive reads stall.

Root cause: HDDs on none (or overly permissive dispatch), allowing write bursts to starve reads.

Fix: Use mq-deadline on HDD vdev members. If still bad, separate workloads, add spindles, or reshape vdev layout.

3) “Switching to none made benchmarks better, but production got worse”

Symptom: fio shows higher IOPS, but real workloads show worse p99 latency or more jitter.

Root cause: Benchmarks measured throughput-centric workloads; production is tail-latency sensitive and mixed IO.

Fix: Benchmark what you run. Track latency percentiles. Pick scheduler based on p95/p99, not peak IOPS screenshots.

4) “Scheduler settings don’t persist after reboot”

Symptom: You echo to sysfs, it works, reboot resets.

Root cause: sysfs changes are runtime only; udev or distro defaults reapply at boot.

Fix: Use a udev rule (as shown) or distribution-supported tuning mechanism; verify after boot.

5) “One disk is slow and it poisons the whole vdev”

Symptom: Mirror/RAIDZ performance collapses; iostat shows one device with massive await and errors/retries.

Root cause: Failing disk, bad cable, controller issue, or firmware problem; scheduler won’t fix hardware retries.

Fix: Confirm with SMART/NVMe logs; replace hardware. Keep scheduler sane, but don’t treat it as a repair tool.

6) “We ‘fixed’ sync latency by disabling sync and now we’re brave”

Symptom: Latency improved dramatically; then you lose power and have a bad time explaining missing data.

Root cause: Correctness traded for speed; ZFS was doing the right thing before.

Fix: Use a proper SLOG with power-loss protection for sync-heavy workloads; keep sync=standard unless you’re intentionally accepting data loss.

Checklists / step-by-step plan (change control friendly)

Checklist A: choose the scheduler per device

  1. Inventory devices: lsblk -d -o NAME,ROTA,TRAN,MODEL,SIZE.
  2. Map devices to vdevs: zpool status -v.
  3. Set baseline:
    • HDD: mq-deadline
    • SATA/SAS SSD: mq-deadline (start here)
    • NVMe: none
  4. If you must deviate, write down what metric you’re optimizing (p99 latency, throughput, fairness).

Checklist B: apply changes safely

  1. Pick a canary host (or one vdev member only, if risk is low and redundancy exists).
  2. Record baseline metrics: zpool iostat -v 1, iostat -x 1, application p95/p99 latency.
  3. Change scheduler runtime via sysfs.
  4. Run the workload that matters (not just a synthetic benchmark).
  5. Compare tail latency and CPU overhead.
  6. Only then create persistent udev rules and roll out gradually.

Checklist C: validate ZFS behavior didn’t change for the worse

  1. Scrub/resilver performance: does it starve production reads?
  2. Sync write latency: did it improve or worsen, and is the SLOG healthy?
  3. Error counters and retries: any disk suddenly “slow” after the change is probably just broken.

FAQ

1) Should I always use mq-deadline with ZFS?

No. Use mq-deadline for HDD and usually SATA/SAS SSD. For NVMe, use none unless you have a measured reason to do otherwise.
ZFS benefits from predictable latency on slow media; NVMe benefits from minimal software interference.

2) Why does “none” sometimes benchmark faster?

Because it removes scheduling overhead and preserves parallelism. On NVMe especially, “none” keeps the software path lean,
letting the controller do what it was designed to do.

3) If ZFS already schedules IO, why do I need a block scheduler at all?

You don’t always “need” one. But for HDDs, the scheduler’s fairness and starvation prevention is still valuable.
ZFS can’t fully compensate for the mechanical realities of rotational media when the block layer dispatch order is unhelpful.

4) What about kyber?

kyber can be good for latency control on some devices. If it’s available on your kernel and you have a specific latency problem that mq-deadline
doesn’t address, it can be worth testing. Don’t deploy it fleet-wide because a forum thread had a nice graph.

5) What about BFQ?

BFQ is often excellent for interactive fairness, particularly on desktops. On servers with ZFS, it’s less commonly the right choice because it can
add overhead and its fairness goals may not match your workload. Test it if you need per-cgroup fairness and can tolerate the cost.

6) Does scheduler choice affect ZFS scrubs and resilvers?

Indirectly, yes. Scrubs/resilvers generate sustained reads and metadata operations. On HDDs, mq-deadline can prevent those operations from getting
buried behind writes (or vice versa), improving predictability. On NVMe, none generally keeps the pipeline efficient.

7) My SSD is SATA but feels “NVMe-fast.” Should I still use mq-deadline?

Start with mq-deadline. SATA is still limited by a different command model and often simpler device queueing. Some enterprise SATA SSDs do fine with none,
but mq-deadline is a safer latency baseline under mixed workloads.

8) Can I set one scheduler for the whole pool?

You set schedulers per block device, not per pool. A pool can include HDD vdevs plus an NVMe SLOG; they should use different schedulers.
Treat log/special devices as first-class citizens—they can dominate perceived latency.

9) Why do my changes revert even with udev rules?

Because something else is also setting it (initramfs rules, distro tuning tools, or a conflicting udev rule order). Check rule ordering
and verify with udevadm test style debugging if needed. The fix is to standardize one mechanism and remove the competing one.

10) What’s the simplest safe baseline for mixed fleets?

HDD: mq-deadline. SATA/SAS SSD: mq-deadline. NVMe: none. Then validate with canaries per hardware generation.
This gets you 90% of the benefit with minimal risk.

Conclusion: the next steps you should actually take

If you’re running ZFS on Linux, the IO scheduler decision is refreshingly unromantic:
mq-deadline for HDD and most SATA/SAS SSD, none for NVMe. Anything else is a special case that needs evidence.

Next steps that pay off:

  1. Inventory devices and confirm current schedulers with sysfs.
  2. Map devices to ZFS vdev roles (data vs log vs special).
  3. Apply the baseline scheduler policy and make it persistent via udev rules.
  4. Measure tail latency before/after using iostat, zpool iostat, and your application’s p95/p99.
  5. If latency is still bad, stop blaming the scheduler and investigate sync write design, failing hardware, and vdev layout.

The IO scheduler is not magic. It’s a traffic cop. Put it where it helps, remove it where it doesn’t, and keep your incident budget for problems that deserve it.

← Previous
ZFS zfs list -o space: The View That Explains ‘Where Did It Go?’
Next →
PostgreSQL vs CockroachDB: HA without drama—or HA with new kinds of pain

Leave a comment