ZFS Mirrors: The Layout That Makes Random IO Feel Like NVMe

Was this helpful?

There’s a particular kind of joy you only get in production: when a workload that used to feel like it was dragging a refrigerator through gravel suddenly snaps to attention. The CPU stops idling. The database stops whining. Latency charts stop looking like a seismograph. You didn’t buy new servers. You didn’t rewrite the app. You changed the storage layout.

ZFS mirrors are that change, more often than people expect. Not because mirrors are magic, but because they align with the physics of disks and the math of queues. Mirrors turn random reads into a parallelizable problem, and most real-world “my storage is slow” tickets are, underneath the Slack thread, a random-read latency problem. If you’ve ever watched a RAIDZ pool do a beautiful job at sequential bandwidth while your database still times out, welcome. This is the piece that explains why mirrors can make random IO feel like NVMe—without pretending they actually are.

Why mirrors feel fast (and when they don’t)

Mirrors win on random reads for one boring, powerful reason: they give ZFS choices. For each block that exists on two (or three) disks, ZFS can pick which copy to read. That’s not just redundancy—it’s scheduling freedom. With a single disk, random reads line up into a queue and wait their turn. With a mirror, ZFS can send different reads to different members, and it can route around a disk that’s having a bad day (or just a busy one).

In practice, mirror vdevs behave like “IOPS amplifiers” for reads. A pool built from multiple mirror vdevs multiplies that again because ZFS stripes data across vdevs. The end result is that a database doing lots of small reads may see latency collapse when you move from a wide RAIDZ vdev to several mirror vdevs, even if the raw disk count stays the same.

Writes are a different story. For a simple 2-way mirror, each write must land on both members. That means mirror writes are not faster than a single disk in the same vdev (ignoring ZIL/SLOG behavior). But pool-level parallelism still helps: if you have many mirror vdevs, your writes distribute across them. So you can get strong write throughput from many mirrors, but a single mirror vdev doesn’t magically double write IOPS.

Here’s the key mental model: ZFS performance is dominated by the number and type of vdevs, not the number of disks. Mirrors are typically deployed as many small vdevs (2-disk mirrors, or sometimes 3-disk mirrors), which increases vdev count and gives ZFS more independent queues to work with.

And when don’t mirrors feel fast? When your bottleneck isn’t the disks. If you’re saturating a single HBA queue, if your CPU is stuck in checksumming or compression with a bad choice of algorithm, if your network is the limiting factor, or if your workload is mostly synchronous writes without a proper SLOG, mirrors won’t save you. They’ll still be correct. They just won’t be heroic.

Joke #1: A mirror is like a pager rotation: redundancy doesn’t make the work disappear, it just makes sure someone else can take the call.

Facts and history you can use at 3 a.m.

Some context points that are short, concrete, and surprisingly useful when you’re arguing for a design change in a meeting where someone thinks RAID5 is still the answer to everything:

  1. ZFS was born at Sun in the mid-2000s as an end-to-end filesystem + volume manager, designed to detect and correct silent corruption with checksums at every level.
  2. RAIDZ exists because traditional RAID had a write hole: parity RAID could acknowledge writes that weren’t fully committed in a power loss, leaving silent inconsistency. ZFS’s transactional model reduces that class of failure.
  3. “Vdevs are the unit of performance” is not a slogan—it’s how ZFS schedules IO. ZFS stripes allocations across top-level vdevs, not across disks inside a vdev the way classic RAID controllers present it.
  4. Mirrors enable read selection: ZFS can choose the less-busy mirror leg, and can use historical latency to prefer the faster member.
  5. Modern “4K sector” drives changed the game: using the wrong ashift can permanently penalize small writes with read-modify-write behavior. This hurts mirrors too, but parity vdevs suffer harder.
  6. SSDs made “random reads” normal: databases and search clusters started expecting low-latency storage, and HDD-based parity arrays started showing their age in those workloads.
  7. L2ARC was never meant to be a magic SSD tier: it can help read-heavy sets that don’t fit in RAM, but it consumes RAM for metadata and doesn’t fix write latency.
  8. SLOG is not a write cache: it’s a separate log device for synchronous writes; it accelerates commit latency, not bulk throughput.
  9. Resilver behavior differs: mirrors often resilver faster than wide parity vdevs because they can copy only allocated blocks (depending on platform/features) and the reconstruction math is simpler.

Mirror architecture: vdevs, queues, and the “IOPS math”

What ZFS is really doing when you create mirrors

A ZFS pool (zpool) is built from one or more top-level vdevs. Those vdevs can be:

  • a single disk (don’t do this in production unless you enjoy pain),
  • a mirror (2-way, 3-way…),
  • a RAIDZ group (raidz1/2/3).

When you build a pool from multiple top-level vdevs, ZFS distributes allocations across them. That’s the stripe. It’s not a fixed RAID0 stripe like a controller would do; it’s allocation-based, but the effect is similar: more vdevs means more independent IO queues.

Why mirrors help random reads

Random read workloads (databases, VM images, mail servers, build caches, metadata-heavy filesystems) are dominated by seek/latency on HDDs and by queue depth/latency on SSDs. With a mirror, each read can be served by either member. ZFS can:

  • send concurrent reads to different legs of the mirror,
  • prefer the disk with the shorter queue,
  • avoid a disk that is returning errors or taking longer than expected.

If you have N mirror vdevs, you have N independent vdev queues. For random reads, you also have two possible spindles per vdev. So you get two types of concurrency: across vdevs and within each vdev (choice of leg). The combined effect can look dramatic compared to a single wide RAIDZ vdev, which has one vdev queue and must touch multiple disks for each operation.

But do mirrors “double IOPS”?

On reads, mirrors can approach 2× read IOPS per vdev under the right conditions, because reads can be split between legs. In real life, it’s less clean: caching, prefetch, request sizes, and the fact that some reads are sequential reduce the theoretical gain. Still, it’s common to see a meaningful improvement.

On writes, a 2-way mirror vdev must write twice. So per-vdev write IOPS is roughly single-disk (again, ignoring ZIL/SLOG). The pool-level write throughput improves by adding more mirror vdevs, not by expecting a single mirror to write faster than physics.

RAIDZ comparison without theology

RAIDZ is great at capacity efficiency and sequential throughput. It’s also very good at “one big vdev” simplicity. But parity vdevs have a harder time with small random IO because:

  • a small write may become read-modify-write across multiple disks,
  • random reads still have to coordinate across disks (even if they can be serviced from one disk in some cases, there’s more overhead and less scheduling freedom),
  • the vdev is one scheduling unit; one wide RAIDZ2 vdev is still one vdev.

This is why mirror pools are the default answer for latency-sensitive workloads. RAIDZ is not “bad”; it’s just better suited for different performance profiles.

A note about NVMe comparisons

No, mirrors don’t turn HDDs into NVMe. But they can make the system feel NVMe-like to an application that is primarily blocked by random-read tail latency, especially when ARC hits are high and the remaining disk reads are distributed well. The subjective experience is what changes: fewer stalls, fewer long pauses, fewer “everything is fine except the database is slow.”

Joke #2: Calling HDD mirrors “NVMe” is like calling a bicycle “a motorcycle with excellent fuel efficiency.” Technically it moves you, but please don’t race it.

Design decisions: width, count, and what to buy

2-way vs 3-way mirrors

2-way mirrors are the standard: good redundancy, good read performance, acceptable capacity cost.

3-way mirrors are for when rebuild risk and availability matter more than capacity. They can reduce the odds that a second failure kills the vdev during resilver, and they can improve read selection further. They cost you dearly in usable space, so they’re usually justified for small hot pools (VM metadata, critical DB logs, etc.) or when drive failure rates are ugly and replacements take time.

Many small mirrors beat one big mirror

If you have 12 disks, the performance conversation is usually not “one 12-disk thing” but “how many top-level vdevs do I get?”

  • 6 mirror vdevs (2-way) gives you 6 vdev queues and excellent random-read scaling.
  • 1 RAIDZ2 vdev gives you 1 vdev queue and strong sequential throughput with good capacity.

If your workload is VM storage, databases, CI caches, maildir-style file servers, or anything with lots of small files: mirrors usually win.

Capacity math you should do before buying disks

Mirrors “waste” half the raw capacity (or two-thirds for 3-way). But that waste is not vanity: it buys you performance headroom and operational breathing room. When systems are slow, teams do dumb things: they turn off safety, they disable sync, they reduce replication. Paying for mirrors is often cheaper than paying for a week of incident response.

Still, you should do the math upfront:

  • Plan to stay below ~70–80% pool usage for latency-sensitive workloads. ZFS gets slower as free space shrinks because allocation becomes harder and fragmentation increases.
  • Budget for redundancy and spares. If you can’t replace a failed disk quickly, your mirror is temporarily a single disk. That’s not redundancy; that’s hope.

Drive and controller considerations

Mirrors expose latency. That’s good, until your HBA or expander starts doing “creative” things with queueing and error recovery. A few practical notes:

  • Use HBAs in IT mode (pass-through). ZFS wants to see the disks.
  • Avoid mixing wildly different drive types in the same mirror vdev. ZFS can route reads, but writes must wait for both, and resilver time follows the slowest member.
  • Set sane drive error recovery (TLER/ERC) for RAID-like environments. Long disk internal recovery can stall IO queues and trigger timeouts.

Properties and tuning that actually matter

ashift: the permanent decision

ashift sets the pool’s sector alignment. Get it wrong and you can lock in write amplification for the life of the vdev. For modern drives, ashift=12 (4K sectors) is typically the minimum sane choice; many admins choose ashift=13 (8K) for some SSDs to align with internal pages. You can’t change ashift after creation without rebuilding.

recordsize (and volblocksize) matching workload

For filesystems, recordsize controls the maximum block size ZFS will use for file data. Big records (1M) are great for sequential throughput (backups, media), but can hurt random IO if your app reads small chunks from large blocks.

For zvols (block devices for iSCSI/VMs), volblocksize is set at creation and should match the guest filesystem and workload (commonly 8K/16K for databases, 16K/32K for general VM use). Again: you can’t change it later without recreating the zvol.

compression: free performance until it isn’t

compression=lz4 is often a win: fewer bytes read/written means less disk time. For random read workloads, compression can reduce IO size and improve latency. But compression burns CPU, and on already CPU-bound systems it can backfire.

atime: death by a thousand metadata writes

atime=on updates access times on reads, causing extra writes. On busy filesystems, this can become a steady stream of small sync-ish metadata updates depending on workload. For most server workloads, atime=off is a straightforward improvement.

sync writes, ZIL, and SLOG

Synchronous writes are a contractual guarantee: the caller wants confirmation that the data is safely on stable storage. ZFS satisfies that via the ZIL (ZFS Intent Log). If you add a dedicated SLOG device (fast, low-latency, power-loss-protected), you can accelerate synchronous commit latency dramatically.

Mirrors and SLOG are a common pairing: mirrors for random read IOPS, SLOG for sync write latency. But if your workload is mostly async writes, SLOG won’t help much. If your workload is mostly sync writes and you don’t have SLOG, mirrors won’t fix the tail latency either.

Practical tasks with commands (and how to read the output)

The commands below assume a typical Linux system with OpenZFS. Adjust device names and pool/dataset names. The point is not copy-paste heroics; it’s building the habit of verifying what the system is actually doing.

Task 1: Inspect pool topology (confirm you really built mirrors)

cr0x@server:~$ sudo zpool status -v
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 00:17:02 with 0 errors on Sun Dec 22 02:17:05 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            ata-WDC_WD80...-part1   ONLINE       0     0     0
            ata-WDC_WD80...-part1   ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            ata-WDC_WD80...-part1   ONLINE       0     0     0
            ata-WDC_WD80...-part1   ONLINE       0     0     0

errors: No known data errors

Interpretation: You want multiple top-level mirror-X vdevs. If you see one giant RAIDZ vdev, your random IO story will be different. Also confirm state is ONLINE, not DEGRADED.

Task 2: Watch per-vdev IO in real time (find the hot vdev)

cr0x@server:~$ sudo zpool iostat -v tank 1
                    capacity     operations     bandwidth
pool              alloc   free   read  write   read  write
----------------  -----  -----  -----  -----  -----  -----
tank              3.21T  4.05T    820    210  41.0M  12.3M
  mirror-0        1.60T  2.02T    610    105  30.4M   6.2M
    ata-WDC...      -      -     320     52  15.2M   3.1M
    ata-WDC...      -      -     290     53  15.2M   3.1M
  mirror-1        1.61T  2.03T    210    105  10.6M   6.1M
    ata-WDC...      -      -     110     52   5.3M   3.0M
    ata-WDC...      -      -     100     53   5.3M   3.1M
----------------  -----  -----  -----  -----  -----  -----

Interpretation: If one mirror vdev is doing most of the work, you may have skew (data placement, a hot dataset, or a small set of blocks living mostly on one vdev due to pool age and expansion history). Mirrors give you concurrency, but only if allocations are reasonably distributed.

Task 3: Get the 30-second “am I cache-bound or disk-bound?” signal

cr0x@server:~$ arcstat 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:01:10  12K  2.1K    17%   890   7%  1.2K  10%     0   0%  96.3G  128G
12:01:11  11K  2.4K    21%   1.1K 10%  1.3K  12%     0   0%  96.3G  128G

Interpretation: High ARC hit rates often mask disk layout differences. When misses rise, the pool’s true random-read capability matters. If misses spike and latency spikes with them, you’re disk-bound.

Task 4: Check dataset properties that quietly sabotage latency

cr0x@server:~$ sudo zfs get -o name,property,value -s local,default recordsize,compression,atime,sync tank
NAME  PROPERTY     VALUE
tank  recordsize   128K
tank  compression  lz4
tank  atime        off
tank  sync         standard

Interpretation: A database dataset with recordsize=1M is a classic self-own. atime=on on a hot filesystem creates background write pressure. sync=disabled is an incident report waiting to happen.

Task 5: Find whether sync writes are your real problem

cr0x@server:~$ sudo zpool iostat -v tank 1 | head -n 20
                    capacity     operations     bandwidth
pool              alloc   free   read  write   read  write
----------------  -----  -----  -----  -----  -----  -----
tank              3.21T  4.05T    120    950  6.0M   38.0M
  mirror-0        1.60T  2.02T     70    480  3.4M   19.2M
  mirror-1        1.61T  2.03T     50    470  2.6M   18.8M
----------------  -----  -----  -----  -----  -----  -----

Interpretation: High write ops with low bandwidth often means lots of small writes. If the application is demanding sync semantics (databases, NFS with sync, VM flushes), your next question is: do you have a SLOG, and is it healthy?

Task 6: Confirm whether a SLOG exists and what it is

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME                         STATE     READ WRITE CKSUM
        tank                         ONLINE       0     0     0
          mirror-0                   ONLINE       0     0     0
            /dev/sdb1                ONLINE       0     0     0
            /dev/sdc1                ONLINE       0     0     0
          mirror-1                   ONLINE       0     0     0
            /dev/sdd1                ONLINE       0     0     0
            /dev/sde1                ONLINE       0     0     0
        logs
          nvme-SAMSUNG_MZ...-part1   ONLINE       0     0     0

Interpretation: If you care about sync write latency, the log device should be low-latency and power-loss-protected. A random consumer SSD used as SLOG can fail in ways that don’t show up in benchmarks but do show up as “why did fsync just take 800ms?”

Task 7: Create a mirror pool correctly (with explicit ashift)

cr0x@server:~$ sudo zpool create -o ashift=12 tank \
  mirror /dev/disk/by-id/ata-DISK_A /dev/disk/by-id/ata-DISK_B \
  mirror /dev/disk/by-id/ata-DISK_C /dev/disk/by-id/ata-DISK_D

Interpretation: Use stable device identifiers. Explicit ashift avoids platform defaults biting you later. Multiple mirror vdevs is the point: you’re buying queues.

Task 8: Add another mirror vdev (scale performance and capacity)

cr0x@server:~$ sudo zpool add tank mirror \
  /dev/disk/by-id/ata-DISK_E /dev/disk/by-id/ata-DISK_F

Interpretation: This is how mirror pools grow: add more mirror vdevs. You generally cannot “turn” a RAIDZ vdev into mirrors. Layout is a design-time choice.

Task 9: Replace a failing disk in a mirror (the grown-up way)

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: DEGRADED
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sdb1    ONLINE       0     0     0
            sdc1    FAULTED     12     0     0  too many errors
          mirror-1  ONLINE       0     0     0
            sdd1    ONLINE       0     0     0
            sde1    ONLINE       0     0     0

cr0x@server:~$ sudo zpool replace tank sdc1 /dev/disk/by-id/ata-NEW_DISK-part1
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: resilver in progress since Tue Dec 24 09:12:03 2025
        312G scanned at 1.20G/s, 41.2G issued at 160M/s, 3.21T total
        41.2G resilvered, 1.26% done, 05:22:11 to go

Interpretation: zpool replace is the lifecycle operation you want. Watch resilver rates. If resilver is crawling, you may be IO-bound by workload, controller limits, or a struggling surviving disk.

Task 10: Offline/online a disk for maintenance (simulate failure safely)

cr0x@server:~$ sudo zpool offline tank /dev/disk/by-id/ata-DISK_B
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
config:

        NAME                       STATE     READ WRITE CKSUM
        tank                       DEGRADED     0     0     0
          mirror-0                 DEGRADED     0     0     0
            ata-DISK_A             ONLINE       0     0     0
            ata-DISK_B             OFFLINE      0     0     0
          mirror-1                 ONLINE       0     0     0
            ata-DISK_C             ONLINE       0     0     0
            ata-DISK_D             ONLINE       0     0     0

cr0x@server:~$ sudo zpool online tank /dev/disk/by-id/ata-DISK_B

Interpretation: Useful for controlled maintenance. Don’t leave mirrors degraded longer than necessary; you’re one surprise away from downtime.

Task 11: Scrub and verify (and see if your pool is silently suffering)

cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Tue Dec 24 10:03:41 2025
        1.20T scanned at 820M/s, 510G issued at 350M/s, 3.21T total
        0B repaired, 15.87% done, 02:22:10 to go

Interpretation: Scrubs catch and correct bad blocks using mirror redundancy. If scrubs routinely “repair” data, your drives, cabling, or controller path needs attention. Mirrors are forgiving; they are not a permission slip to ignore errors.

Task 12: Check fragmentation and free space (predict the future)

cr0x@server:~$ sudo zpool list -o name,size,alloc,free,frag,capacity,health
NAME  SIZE  ALLOC  FREE  FRAG  CAPACITY  HEALTH
tank  7.25T  3.21T 4.05T   18%      44%  ONLINE

Interpretation: High fragmentation and high capacity usage correlate with worse latency. If you’re past ~80% and the app is latency-sensitive, you’re now paying interest on earlier optimism.

Task 13: Find top IO consumers (dataset-level IO accounting)

cr0x@server:~$ sudo zfs iostat -r -v tank 1
                              capacity     operations     bandwidth
dataset                      alloc   free   read  write   read  write
---------------------------  -----  -----  -----  -----  -----  -----
tank                         3.21T  4.05T      0      0      0      0
  tank/db                    820G   1.20T    640    520  32.0M  24.0M
  tank/vm                    1.90T  2.10T    180    310  9.0M   14.0M
  tank/home                  120G   800G      5     18  200K   900K
---------------------------  -----  -----  -----  -----  -----  -----

Interpretation: This tells you where the IO is coming from. If “db” is hot and has the wrong recordsize or sync setting, you have a targeted fix instead of a vague storage panic.

Task 14: Verify ashift on existing vdevs

cr0x@server:~$ sudo zdb -C tank | grep -E 'ashift|vdev_tree|type: mirror' -n
15:        vdev_tree:
34:            type: 'mirror'
48:                ashift: 12
71:            type: 'mirror'
85:                ashift: 12

Interpretation: Confirm alignment. If you discover ashift=9 on 4K sector drives, you’ve found a structural cause of latency. The fix is rebuild/replace vdevs, not tuning sysctls until morale improves.

Task 15: Measure latency indirectly (service time hints from iostat)

cr0x@server:~$ iostat -x 1
Linux 6.8.0 (server)  12/24/2025  _x86_64_  (32 CPU)

Device            r/s   w/s  rkB/s  wkB/s  await  svctm  %util
sdb              90.0  60.0  4600   5100    7.2   1.1    18.0
sdc              88.0  62.0  4500   5200    8.0   1.2    19.0
sdd              30.0  90.0  1400   7800   25.0   1.3    22.0
sde              28.0  92.0  1300   7900   27.5   1.4    23.0

Interpretation: await rising while %util is not pegged can indicate queueing above the device layer (controller, filesystem, sync write stalls). One mirror leg showing consistently worse await is a clue: that disk, path, or cable might be degraded.

Fast diagnosis playbook

This is the “walk into the room, look at three things, and don’t get distracted” routine. It’s designed for outages and slowdowns where you need a hypothesis quickly.

First: Is it cache, disk, or sync?

  1. ARC hit rate: run arcstat 1 (or equivalent). If misses are low, disk layout might not be the problem right now.
  2. Pool IO shape: run zpool iostat -v tank 1. Look for high small-write ops, skewed vdev load, and whether read or write dominates.
  3. Sync pressure: check if the workload is sync-heavy (DB commits, NFS sync). Confirm whether a SLOG exists via zpool status. If no SLOG and sync writes dominate, you’ve likely found the latency culprit.

Second: Is a single device/path misbehaving?

  1. Health and errors: zpool status -v. Any READ/WRITE/CKSUM errors? Any vdev DEGRADED?
  2. Per-disk latency hints: iostat -x 1. One disk with high await or retries is often “the problem,” even though the pool is “ONLINE.”
  3. Kernel logs: check for link resets/timeouts.

Third: Are you out of space or drowning in fragmentation?

  1. Capacity and frag: zpool list -o size,alloc,free,frag,capacity. If you’re above ~80% and frag is high, random IO pain is expected.
  2. Dataset offenders: zfs iostat -v 1 to find hot datasets, then inspect their properties.
  3. Background work: scrubs/resilvers can dominate IO. Check zpool status scan line.

Once you have the category—cache miss, sync write latency, single bad disk/path, or full/fragmented pool—you can act. Without that, you’ll spend hours “tuning” a system that’s simply out of design envelope.

Three corporate-world mini-stories

1) Incident caused by a wrong assumption: “Mirror writes are fast, right?”

A team migrated a transactional service from a managed database to self-hosted instances for cost reasons. The storage plan looked sound: “We’ll use ZFS mirrors, so it’ll be fast.” They built a pool of several mirror vdevs on decent SSDs, turned on compression, and called it done. Early tests were great—because the tests were mostly reads and bulk loads. The service went live on a Monday, because of course it did.

By Tuesday afternoon, the on-call rotation had discovered a new flavor of alert fatigue. Latency spikes lined up with transaction bursts. The database wasn’t CPU-bound. Network was clean. The disks weren’t saturated in bandwidth. But commits were stalling long enough to trigger timeouts. The incident channel filled with the usual suspects: “Maybe ZFS is slow?” “Maybe we need bigger instances?” “Maybe disable fsync?”

The wrong assumption was subtle: they thought mirrors would accelerate writes the way they accelerate random reads. But the workload was dominated by synchronous commits, and the system had no dedicated SLOG. ZFS was doing the right thing—honoring sync semantics by committing to stable storage—but the latency profile of the main pool devices (and the write amplification from small sync writes) made it ugly under pressure.

The fix was not a mysterious sysctl. They added a power-loss-protected low-latency device as a mirrored SLOG (because yes, log devices can fail too), verified that sync write latency dropped, and retested with actual transaction patterns. The service stabilized. The postmortem wasn’t about ZFS being “slow.” It was about matching layout to IO shape and understanding that mirrors are a read-latency weapon, not a universal performance potion.

The lesson that stuck: mirrors gave them the random read headroom they wanted, but without SLOG they still had synchronous write tail latency. They started treating “sync-heavy” as a first-class requirement, not an afterthought.

2) Optimization that backfired: “Let’s crank recordsize and turn off sync”

This one begins the way many storage tragedies begin: with a dashboard screenshot and a confident sentence. A developer showed that the application was doing lots of IO, and someone suggested “optimizing” the filesystem. The proposal was simple: increase recordsize to 1M “for throughput” and set sync=disabled “because fsync is expensive.” The team wanted their graphs to look better. They got their wish.

For about two weeks, performance looked great. Then a routine maintenance window included a power cycle that took longer than expected. When systems came back, the application data had inconsistencies. Not a dramatic “everything is gone” failure—worse. A small number of recent transactions were missing, and some records were partially applied. The kind of corruption that turns into a long week of reconciling and apologizing.

Disabling sync removed the contractual guarantee the application was relying on. ZFS did exactly what it was told: treat synchronous writes as asynchronous. The mirror layout was not the villain. The “optimization” was. Increasing recordsize also had a cost: the database was now reading and rewriting larger blocks than necessary, which made random IO spikier under load, particularly when ARC wasn’t catching it.

The recovery plan involved restoring from backups and replaying what they could from upstream logs. The lasting change was cultural: filesystem properties became change-controlled, and performance work required a workload-representative test plan. The team learned to fear any optimization that begins with “we can probably disable safety; it’ll be fine.”

Mirrors didn’t save them from themselves. But the later redesign did: they reverted sync to standard, used a proper SLOG for the sync-heavy parts, tuned recordsize per dataset, and kept mirrors because the random read profile was the real long-term need.

3) Boring but correct practice that saved the day: replacing disks before they “fail”

In a large corporate environment, the most heroic work is usually quiet. One storage fleet had a policy: if a disk starts logging medium errors or shows a rising error rate, replace it during business hours—before it trips a vdev into a degraded state. It wasn’t dramatic. It was also occasionally unpopular because it looked like spending money “early.”

One quarter, they started seeing sporadic latency spikes on a pool backing internal CI and artifact storage. Nothing screaming in zpool status. No DEGRADED vdevs. Just tail latency spikes that made builds flaky and developers unhappy. Instead of blaming the network or “Kubernetes being Kubernetes,” the on-call ran the standard triage: zpool iostat -v showed one mirror leg with uneven load and worse service times. iostat -x hinted that a single disk was occasionally stalling.

They pulled SMART data and found early signs of trouble—nothing that would trigger an immediate failure ticket, but enough to correlate with the stalls. They replaced the disk in a planned window using zpool replace, watched the resilver complete, and the latency spikes disappeared. No outage. No postmortem. Just a closed loop between observation and maintenance.

Weeks later, a different disk in a different mirror failed hard. The team shrugged, replaced it, and moved on. The important part: because they’d kept the pool healthy and had a culture of proactive replacement, the “real” failure didn’t coincide with a second marginal disk, a scrub backlog, and a full pool. The mirrors did their job because operations did theirs.

Common mistakes: symptoms and fixes

Mistake 1: Building one giant RAIDZ vdev for a random IO workload

Symptoms: Sequential benchmarks look fine. Real apps (DB/VMs) show high latency and low IOPS. zpool iostat shows one vdev doing everything because there is only one vdev.

Fix: For latency-sensitive random IO, design with multiple mirror vdevs. If you already built RAIDZ, the fix is usually migration to a new pool layout (send/receive, replication, rebuild). There isn’t a safe “convert in place” shortcut that preserves the same vdev.

Mistake 2: Wrong ashift

Symptoms: Small writes are inexplicably slow. CPU and bandwidth are fine, but latency is persistent. You see heavy read-modify-write behavior at the device layer.

Fix: Verify with zdb -C. If it’s wrong, plan a rebuild: replace vdevs with correctly aligned ones or migrate to a new pool. Don’t waste time chasing “tuning” for a geometry problem.

Mistake 3: Turning off sync to “fix” performance

Symptoms: Performance graphs improve. Then an unclean shutdown leads to missing recent data or application-level inconsistencies.

Fix: Keep sync=standard unless you can prove the application does not require durability (rare). If you need sync performance, add a proper SLOG and confirm it’s actually being used (and is low-latency and PLP-capable).

Mistake 4: Using consumer SSD as SLOG

Symptoms: Sync write latency is unstable; occasional long stalls. SLOG shows errors or resets. Sometimes performance degrades over time as the device’s internal GC fights you.

Fix: Use an enterprise-grade, power-loss-protected device. Consider mirroring the SLOG. Monitor it like it matters—because it does.

Mistake 5: Filling the pool to the brim

Symptoms: Latency increases as pool approaches high utilization. Allocation gets expensive. Fragmentation rises. Everything feels “mushy” and unpredictable.

Fix: Keep free space headroom (often 20–30% for hot pools). Add vdevs early. If you’re already full, migrate cold data elsewhere or expand; tuning won’t create free space.

Mistake 6: Ignoring one “slightly weird” disk

Symptoms: Pool is ONLINE but tail latency spikes. One disk shows higher await or sporadic errors. Scrubs take longer than usual.

Fix: Treat marginal hardware as a performance bug. Replace suspicious disks and fix flaky cables/paths. Mirrors hide failures; they also hide early warnings unless you look.

Mistake 7: Mixing disk sizes or speeds within a mirror

Symptoms: You only get capacity equal to the smallest disk in each mirror. Resilver and write latency follow the slowest member. Performance is inconsistent.

Fix: Mirror like-with-like when possible. If you must mix, do it intentionally and accept the constraints (and document them so Future You doesn’t “optimize” it into chaos).

Checklists / step-by-step plan

Step-by-step: designing a mirror pool for random IO

  1. Classify the workload: random read heavy (DB/VM), sync write heavy (transaction logs, NFS sync), sequential (backups/media), mixed.
  2. Pick the top-level vdev strategy:
    • Random IO + latency sensitive: multiple 2-way mirror vdevs.
    • Extreme availability requirements: consider 3-way mirrors for the hot tier.
    • Capacity-first + sequential: RAIDZ2/3 may be appropriate, but understand the random IO tradeoffs.
  3. Decide growth pattern: mirrors grow by adding mirror vdevs. Confirm the business can buy disks in pairs (or triples) later.
  4. Choose ashift: usually 12. Be explicit.
  5. Plan for sync writes: if you have a sync-heavy workload, design SLOG upfront (and budget for a PLP device).
  6. Set dataset defaults: compression=lz4, atime=off, and sane recordsize per dataset (not one-size-fits-all).
  7. Operational policy: scrub schedule, SMART monitoring, spare inventory, replacement playbook.

Step-by-step: validating mirrors deliver the latency you expect

  1. Confirm topology with zpool status and verify multiple mirror vdevs exist.
  2. Baseline IO: measure with application-like load, not just synthetic sequential tests.
  3. Observe distribution: zpool iostat -v 1 should show work spread across vdevs.
  4. Verify cache behavior: arcstat to see whether you’re truly hitting disk.
  5. Test failure mode: offline a disk during a controlled window, watch behavior, then online it. Ensure monitoring alerts correctly.
  6. Document the “known good” settings: recordsize/volblocksize, sync semantics, SLOG device type, ashift.

Step-by-step: when you already have RAIDZ and want mirrors

  1. Admit it’s a migration: you’re not “tuning” into mirrors. You’re moving data.
  2. Build a new pool with mirror vdevs and correct ashift.
  3. Use replication: zfs send/zfs receive for datasets; keep snapshots consistent.
  4. Cut over during a planned window, validate, then decommission the old pool.

FAQ

1) Are ZFS mirrors always faster than RAIDZ?

No. Mirrors are typically faster for random reads and latency-sensitive mixed workloads. RAIDZ can be excellent for sequential throughput and capacity efficiency. Pick based on IO shape and operational constraints.

2) How many mirror vdevs do I need?

Enough that your workload can spread across vdev queues. As a rough operational heuristic: more top-level vdevs generally means more IOPS and lower contention. But controllers, CPU, and network can become the new bottleneck, so measure and don’t overbuild blindly.

3) Should I use 2-way or 3-way mirrors?

Use 2-way mirrors by default. Use 3-way when availability and rebuild risk dominate and you can afford the capacity cost—especially for small, critical hot datasets.

4) Do mirrors help write latency?

Not directly per vdev. Writes must go to all members of a mirror. Pool-level write throughput improves by having multiple mirror vdevs, but synchronous write latency is usually addressed with a proper SLOG, not mirrors alone.

5) What’s the single biggest “don’t mess this up” setting?

ashift. Get sector alignment right at pool creation. A wrong ashift can permanently degrade small-write performance and is painful to correct later.

6) Is SLOG required for mirror pools?

No. It’s required only if you have significant synchronous write workload and you care about latency. If your workload is mostly asynchronous, a SLOG often provides little benefit.

7) Why is my mirror pool fast in benchmarks but slow in production?

Benchmarks often test sequential throughput or unrealistic queue depths. Production pain is frequently tail latency from sync writes, a single misbehaving disk/path, a nearly-full pool, or a dataset property mismatch (recordsize/volblocksize).

8) Can I expand a mirror vdev by adding a third disk later?

In many OpenZFS environments you can attach an additional device to a mirror to make it 3-way, but operational support varies by platform and policy. Even when possible, you should treat it as a change with real risk: test in a lab, confirm monitoring, and ensure replacement procedures handle the new topology.

9) Do mirrors reduce resilver time?

Often, yes. Mirror resilvers are typically simpler than parity reconstruction. Depending on features and platform behavior, ZFS may resilver only allocated blocks, which can make recovery faster than “copy the whole disk.” But resilver speed still depends on workload, disk health, and controller limits.

10) What if I want both capacity efficiency and random IO performance?

Split tiers. Use mirrors for hot random IO datasets and RAIDZ for cold/sequential data—or separate pools entirely. Trying to make one layout satisfy every workload is how you get average performance and above-average incidents.

Conclusion

ZFS mirrors don’t cheat physics. They exploit it. Random IO is a queueing problem, and mirrors give ZFS more choices: more queues, more parallelism, and more opportunities to avoid the one disk that’s quietly ruining your tail latency. When you build a pool out of multiple mirror vdevs, you’re not just duplicating data—you’re buying scheduling freedom.

The operational payoff is bigger than benchmarks. Mirrors make failure handling straightforward, scrubs meaningful, and performance more predictable under mixed workloads. Pair them with sane properties, sufficient free space, and (when needed) a real SLOG, and you get the storage equivalent of a well-run on-call rotation: not glamorous, but suddenly everything stops being an emergency.

← Previous
Email: Brute force on IMAP/SMTP — lock it down without locking yourself out
Next →
Ubuntu 24.04 servers: Snap vs apt — where Snap quietly causes pain (and what to do)

Leave a comment