You don’t notice ZFS internals when things are healthy. You notice them at 02:17, when latency spikes,
replicas start timing out, and someone asks whether “adding cache” will fix it. (It won’t. Not like that.)
This is the glossary you wish you’d had before you made that pool design “just work,” shipped it to production,
and discovered your mental model was mostly vibes. We’ll define the big nouns—VDEV, TXG, ARC, SPA—then use them
to make decisions, debug bottlenecks, and avoid the classic self-inflicted outages.
Interesting facts and history (the parts that actually matter)
- ZFS started at Sun in the early 2000s as an answer to the “filesystem + volume manager + RAID tool” mess. The design assumption: storage stacks should be one coherent system.
- Copy-on-write wasn’t new, but ZFS made it operationally mainstream: every change writes new blocks, then flips pointers atomically. That’s why it can do consistent snapshots without freezing the world.
- The “Zettabyte” name was aspirational at a time when multi-terabyte pools felt enormous. Today, the name reads less like marketing and more like a warning.
- OpenZFS is a community-led continuation after Sun’s acquisition era. Features like special vdevs and persistent L2ARC evolved largely in the open, driven by pain from real operators.
- Checksumming everything wasn’t a common default in commodity filesystems back then. ZFS made “silent corruption is a storage bug, not an application problem” an operational stance.
- ARC became a pattern: a filesystem-managed adaptive cache that knows about blocks, compression, and metadata. It’s not just “RAM is a disk cache.” It’s a policy engine with opinions.
- TXGs are the compromise between “sync everything” and “buffer forever.” The transaction group model is why ZFS can be fast and consistent—until your workload turns it into a traffic jam.
- RAID-Z is not hardware RAID. Parity is computed by ZFS, with awareness of block sizes and checksums. It’s better in many ways, but you don’t get to pretend parity math is free.
- The “1M recordsize” default wasn’t chosen for databases. It’s a throughput-friendly default for large sequential IO. If you run random 8K writes and never change it, you’re not unlucky—you’re misconfigured.
Core glossary: VDEV, TXG, ARC, SPA (with the real implications)
VDEV (Virtual Device): the unit of performance and failure
A VDEV is ZFS’s building block for a pool’s storage. People casually say “my pool has 12 disks,”
but ZFS hears: “my pool has these vdevs.” And vdevs are what determines:
IOPS, latency, and often your blast radius.
A vdev can be a single disk (don’t), a mirror, RAID-Z, a file (really don’t), a special vdev for metadata,
a log vdev (SLOG), or a cache device (L2ARC).
The operational rule: vdevs add performance; disks inside a vdev add redundancy (and sometimes bandwidth).
In a pool made of multiple vdevs, ZFS stripes allocations across vdevs. More vdevs usually means more parallelism.
But within a RAID-Z vdev, small random writes still pay a parity tax and tend to serialize more than you’d like.
Mirror vdevs are the workhorse for latency-sensitive workloads. RAID-Z vdevs are for capacity and sequential throughput.
If you run mixed workloads, pick your poison deliberately instead of letting procurement pick it for you.
TXG (Transaction Group): how ZFS turns chaos into an atomic commit
A TXG is a batch of in-memory changes that ZFS eventually commits to stable storage. Think of it as
“the set of dirty blocks we’ll write out together, then declare durable.”
ZFS cycles TXGs through states: open (accepting changes), quiescing (stopping new changes),
and syncing (writing to disk). The switching is periodic and pressure-based.
If your pool is healthy, this is invisible. If your pool is overloaded, you’ll see it as:
write latency spikes, sync storms, and applications blocked in fsync().
TXG behavior is why you can have high throughput but still awful tail latency. The pool might be “busy syncing”
and your workload is forced to wait. Your monitoring needs to separate “we’re writing” from “we’re blocked waiting
to finish writing.”
ARC (Adaptive Replacement Cache): RAM as policy, not just cache
The ARC is ZFS’s in-memory cache for frequently and recently used blocks—data and metadata.
It is not a dumb LRU. It’s adaptive: it tries to balance “recent” and “frequent” access patterns.
ARC is also a political actor in your system. It competes with applications for memory. If you let it,
it will happily eat RAM until the kernel makes it stop. That’s not a bug; it’s the bargain:
memory unused is performance wasted. But in production, “unused” is rarely true—your database, JVM, and page cache
are also hungry.
ARC has multiple important populations:
MFU/MRU lists (frequent/recent),
metadata,
anonymous buffers,
and dirty buffers awaiting TXG sync.
When people say “ARC hit rate is low,” the useful follow-up is: low for which class, under what workload?
SPA (Storage Pool Allocator): the pool brain
The SPA is the internal subsystem that coordinates the pool: vdev management, allocations,
metaslabs, space maps, and the high-level state machine that makes “a pool” a coherent thing.
If ZFS were a company, the SPA is the operations team that actually schedules the work.
You rarely touch the SPA directly, but you see its decisions everywhere: how blocks are allocated across vdevs,
how free space is tracked, why fragmentation happens, and why some pools age like fine wine while others age like milk.
One quote worth keeping on your wall, because storage failures are almost always coordination failures:
“Hope is not a strategy.”
— General Gordon R. Sullivan (often cited in engineering and operations)
Extended glossary you’ll trip over in production
Pool
A pool is the top-level storage object. It aggregates vdevs. You cannot shrink it by removing
a RAID-Z vdev. Plan accordingly. Your future self will not be impressed by “we’ll just migrate later.”
Dataset
A dataset is a filesystem with its own properties: compression, recordsize, quota, reservation,
sync behavior, and more. Datasets are how you stop one workload from poisoning another—if you actually use them.
Zvol
A zvol is a block device backed by ZFS. Use it for iSCSI, VM disks, and things that demand
block semantics. Tune volblocksize for the workload. Leave it wrong, and you’ll discover new and
exciting ways to waste IOPS.
Recordsize and volblocksize
recordsize is the maximum block size for files in a dataset. Large recordsize helps sequential
throughput and compression. Small recordsize helps random IO and reduces write amplification on small updates.
volblocksize is the block size for zvols. It’s fixed at creation time.
If you’re storing 8K database pages in a 128K volblocksize zvol, you are asking ZFS to do extra work.
Metaslab
A metaslab is a region of space within a vdev used for allocation. Metaslab fragmentation is a
common “my pool is 70% full and everything is slow” story. SPA and metaslab classes decide where blocks go;
if you’re near full, those decisions get constrained and expensive.
Scrub and resilver
A scrub verifies checksums and repairs from redundancy. A resilver rebuilds
redundancy after a device is replaced. Both compete with your workload for IO. Throttle them appropriately,
but don’t skip them. Scrub is how you find latent errors before they become data loss.
SLOG and ZIL
The ZIL is the in-pool intent log for synchronous writes. The SLOG is an optional
separate log device to accelerate those synchronous writes. SLOG does not make async writes faster. It makes sync
writes less miserable by giving them a low-latency landing zone.
The SLOG must be reliable and power-loss safe. If you put “fast but fragile” flash as SLOG, you’re building a
lie into your durability model. And ZFS will faithfully operationalize that lie.
L2ARC
L2ARC is a second-level cache on fast devices (usually SSD/NVMe). It caches reads, not writes.
Historically it also cost memory to index; modern implementations improved persistence and behavior, but it still
isn’t free. L2ARC is not a substitute for enough RAM or sane dataset tuning.
Ashift
ashift is the sector size exponent used by ZFS for a vdev. The canonical production advice:
set it correctly at pool creation, because changing it later is effectively “rebuild the pool.”
Wrong ashift is a slow leak that becomes a flood under load.
Joke #1: Setting ashift wrong is like buying shoes a size too small—technically you can walk, but you’ll hate yourself on stairs.
Compression
Compression saves space and often improves performance by turning IO into CPU, and IO is usually the bottleneck.
But if your CPU is already pinned, compression can become the last straw. Measure before you declare victory.
Copy-on-write and fragmentation
Copy-on-write means blocks are rewritten elsewhere, not overwritten in place. That’s great for consistency and
snapshots, but it can fragment. Fragmentation gets worse as the pool fills and as you churn data with snapshots.
“ZFS is slow when full” is a meme because it’s often true in specific ways.
A usable mental model: from syscall to spinning rust
Reads: the happy path (when it’s happy)
A read comes in. ZFS checks ARC. If the block is in ARC, you’re done quickly. If not, ZFS schedules reads to vdevs,
pulls blocks, verifies checksums, decompresses (if needed), and optionally populates ARC (and maybe L2ARC over time).
Latency is driven by: cache hit rate, vdev queue depth, device latency, and how scattered the blocks are.
Mirrors can serve reads from either side, often improving parallelism. RAID-Z reads can be fine, but small random reads
compete with parity layout realities and often higher IO amplification on repair/scrub.
Writes: where TXG and ZIL decide your fate
An async write goes to memory (dirty buffers in ARC). Later, the TXG sync writes it out. If the system crashes before
the TXG sync, those async writes are lost. That’s the contract. Apps that care call fsync() or use sync writes.
A sync write must be committed in a way that survives a crash. ZFS uses the ZIL: it logs the intent quickly, acknowledges,
then later the TXG sync writes the real blocks and frees the log records. If you have a SLOG, the log writes go there.
If you don’t, they go to the main pool, and your slowest vdev becomes your sync latency.
The operational takeaway: sync write latency is often a “log device and pool latency” problem, not a raw bandwidth problem.
You can have 5 GB/s sequential write throughput and still have 20 ms fsync latency that destroys a database.
Space management: SPA, metaslabs, and why “80% full” is not just a number
The SPA carves each vdev into metaslabs and tracks free space via space maps. When pools are emptier,
allocations are cheap: pick a big free region, write. As pools fill and fragment, allocations become:
find enough segments, deal with scattered free extents, update metadata, and pay more seeks and more bookkeeping.
That’s why capacity planning is performance planning. If you treat “free space” as only a budget line, ZFS will
eventually collect its interest in latency.
Practical tasks: commands, output meaning, and what you decide
These are not party tricks. These are the commands you run when someone says “storage is slow,” and you’d like to
respond with evidence instead of feelings.
Task 1: Identify pool health and immediate red flags
cr0x@server:~$ zpool status -v
pool: tank
state: ONLINE
scan: scrub repaired 0B in 02:14:19 with 0 errors on Wed Dec 24 03:00:12 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/dev/disk/by-id/ata-A ONLINE 0 0 0
/dev/disk/by-id/ata-B ONLINE 0 0 0
errors: No known data errors
What it means: State and error counters tell you whether you’re debugging performance or
actively losing redundancy/data integrity. Scrub results matter: repaired bytes means you had corruption
that redundancy fixed.
Decision: If you see DEGRADED, rising READ/WRITE/CKSUM, or an active resilver,
stop tuning and start stabilizing. Performance debugging comes after you stop the bleeding.
Task 2: See vdev-level throughput and latency trends
cr0x@server:~$ zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 1.20T 2.40T 320 980 45.2M 110M
mirror-0 1.20T 2.40T 320 980 45.2M 110M
ata-A - - 160 490 22.6M 55.1M
ata-B - - 160 490 22.6M 55.0M
-------------------------- ----- ----- ----- ----- ----- -----
What it means: You’re seeing per-vdev and per-disk ops/bandwidth. Mirrors split reads across sides;
writes go to both. If one disk shows notably higher latency in iostat -x (next task), it can drag the mirror down.
Decision: If one vdev is saturated (high ops, low bandwidth, rising latency), add vdevs (more parallelism)
or change workload characteristics (recordsize, sync behavior). If a single disk is misbehaving, replace it.
Task 3: Confirm device latency and queueing at the OS level
cr0x@server:~$ iostat -x 1 3
Linux 6.8.0 (server) 12/26/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
6.20 0.00 2.10 18.40 0.00 73.30
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 95.0 280.0 12.3 34.8 352.1 7.80 25.6 8.2 31.4 1.1 41.0
sdb 92.0 275.0 12.1 34.6 353.0 8.10 27.4 8.5 33.2 1.1 43.5
What it means: await and avgqu-sz show latency and queue depth.
High %iowait suggests CPU is waiting on storage. High %util means devices are busy.
Decision: If await is high and %util is pegged, you’re storage-limited.
If %iowait is low but apps are slow, look elsewhere (locks, network, CPU).
Task 4: Check dataset properties that directly change IO behavior
cr0x@server:~$ zfs get -o name,property,value -s local,received recordsize,compression,atime,sync,logbias tank/app
NAME PROPERTY VALUE
tank/app recordsize 128K
tank/app compression zstd
tank/app atime off
tank/app sync standard
tank/app logbias latency
What it means: These are the knobs that decide whether ZFS is doing 128K IO, compressing,
updating access times, and treating sync writes with a latency bias.
Decision: For databases, consider smaller recordsize (often 16K or 8K depending on DB page size),
keep atime=off unless you truly need it, and be cautious with sync changes. “sync=disabled” is not tuning;
it’s negotiating with physics using accounting fraud.
Task 5: Inspect ARC behavior (Linux)
cr0x@server:~$ awk 'NR==1 || /^(size|c |c_min|c_max|hits|misses|mfu_hits|mru_hits|prefetch_data_hits|prefetch_data_misses)/' /proc/spl/kstat/zfs/arcstats
13 1 0x01 204 33728 11970458852 6427389332172
size 4.20G
c 6.00G
c_min 1.00G
c_max 24.0G
hits 132948210
misses 21928411
mfu_hits 90128210
mru_hits 42820000
prefetch_data_hits 1428000
prefetch_data_misses 6180000
What it means: size is current ARC size, c is target, and hit/miss counters
hint whether reads are cache-friendly. Prefetch misses can indicate workloads that defeat sequential heuristics.
Decision: If misses dominate and your workload is read-heavy, you may need more RAM, better locality,
or a different layout (mirrors, special vdev for metadata). If ARC is huge and apps are swapping, cap ARC.
Task 6: Check ZFS dirty data and TXG pressure (Linux)
cr0x@server:~$ awk '/(dirty data|max dirty data|dirty data sync)/{print}' /proc/spl/kstat/zfs/arcstats
dirty data 812345678
max dirty data 4294967296
dirty data sync 0
What it means: Dirty data is write-back pending. If dirty data approaches max and stays there,
the pool can’t flush fast enough. That’s when TXG sync starts throttling writers.
Decision: If dirty data is chronically high, you’re over-driving the pool. Reduce write rate,
improve vdev write latency (more vdevs, faster devices), or adjust workload (bigger sequential writes, compression).
Task 7: Identify sync write pain and whether SLOG is present
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-A ONLINE 0 0 0
ata-B ONLINE 0 0 0
logs
nvme-SLOG ONLINE 0 0 0
What it means: A logs section means you have a separate log device. Without it,
sync writes hit the main vdevs and inherit their latency.
Decision: If you have a sync-heavy workload (databases, NFS with sync, VM storage),
consider a proper power-loss-safe SLOG. If you already have one and sync is still slow, validate it isn’t the bottleneck.
Task 8: See per-dataset space usage and snapshot pressure
cr0x@server:~$ zfs list -o name,used,avail,refer,compressratio -r tank
NAME USED AVAIL REFER RATIO
tank 1.20T 2.40T 128K 1.45x
tank/app 620G 2.40T 410G 1.62x
tank/app@daily-1 80G - 390G 1.60x
tank/logs 110G 2.40T 110G 1.05x
What it means: Snapshots consume space via changed blocks. If used is growing but refer is stable,
snapshots (or clones) are holding onto history.
Decision: If the pool is filling and performance is degrading, audit snapshot retention.
Keep the snapshots you need, delete the ones you don’t, and stop pretending “infinite retention” is free.
Task 9: Check pool fragmentation and capacity headroom
cr0x@server:~$ zpool list -o name,size,alloc,free,frag,cap,dedup,health
NAME SIZE ALLOC FREE FRAG CAP DEDUP HEALTH
tank 3.60T 1.20T 2.40T 18% 33% 1.00x ONLINE
What it means: frag is a heuristic, but it’s directionally useful. cap shows fullness.
As cap rises past ~70–80% on busy pools, allocations get harder and latency usually follows.
Decision: If you’re above ~80% and latency matters, plan capacity expansion or migration now.
“We’ll wait until 95%” is how you end up debugging metaslab allocation under fire.
Task 10: Inspect ashift and vdev layout (and catch the irreversible mistakes early)
cr0x@server:~$ zdb -C tank | egrep 'vdev|ashift|path' | head -n 20
vdev_tree:
type: 'root'
id: 0
guid: 1234567890123456789
children[1]:
type: 'mirror'
id: 0
ashift: 12
children[2]:
type: 'disk'
path: '/dev/disk/by-id/ata-A'
type: 'disk'
path: '/dev/disk/by-id/ata-B'
What it means: ashift: 12 means 4K sectors. If you have 4K/8K-native drives but ashift is too small,
you’ve forced read-modify-write behavior that will haunt you.
Decision: If ashift is wrong, budget a rebuild/migration. There’s no magical “fix ashift in place” button.
Task 11: Verify compression is helping (not just “enabled”)
cr0x@server:~$ zfs get -o name,property,value,source compression,compressratio tank/app
NAME PROPERTY VALUE SOURCE
tank/app compression zstd local
tank/app compressratio 1.62x -
What it means: The ratio shows actual savings. A ratio near 1.00x means your data is incompressible
or already compressed (media files, encrypted blobs).
Decision: If ratio is ~1.00x and CPU is strained, consider lighter compression or none. If ratio is good,
compression is probably buying you IO headroom.
Task 12: Detect a scrub/resilver stealing your IO budget
cr0x@server:~$ zpool status -x
pool 'tank' is healthy
cr0x@server:~$ zpool status tank | sed -n '1,12p'
pool: tank
state: ONLINE
scan: scrub in progress since Thu Dec 25 03:00:12 2025
540G scanned at 610M/s, 120G issued at 140M/s, 1.20T total
0B repaired, 9.77% done, 0:02:13 to go
What it means: A scrub in progress competes for reads (and sometimes writes for repairs).
The “issued” rate is the useful one; it reflects what’s actually being pushed to devices.
Decision: If you’re in a performance incident, you may pause or schedule scrubs off-peak
(depending on platform controls). But don’t “solve” performance by never scrubbing again.
Task 13: Confirm mountpoints and avoid accidental double-mount weirdness
cr0x@server:~$ zfs get -o name,property,value mountpoint,canmount tank/app
NAME PROPERTY VALUE
tank/app mountpoint /srv/app
tank/app canmount on
What it means: Mis-mounted datasets create the kind of outage where you write data to the wrong place,
then “restore” from a snapshot that never captured it.
Decision: Standardize mountpoints and enforce them in provisioning. Treat ad-hoc mounts as change-managed events.
Task 14: Monitor real-time IO by ZFS and catch the bully process
cr0x@server:~$ zpool iostat -w -v tank 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 1.20T 2.40T 110 2200 12.1M 280M
mirror-0 1.20T 2.40T 110 2200 12.1M 280M
ata-A - - 55 1100 6.0M 140M
ata-B - - 55 1100 6.1M 140M
-------------------------- ----- ----- ----- ----- ----- -----
What it means: The -w mode adds latency statistics on some platforms/builds; even without it,
you can see write storms. Pair this with process-level tools (pidstat, iotop) to find the source.
Decision: If one job is flooding writes and pushing TXG sync, rate-limit it, isolate it to its own dataset,
or move it to a different pool class. Shared pools reward bullies and punish everyone else.
Joke #2: L2ARC is like a corporate “Center of Excellence”—great on paper, expensive in meetings, and it won’t fix your org chart.
Fast diagnosis playbook (first/second/third)
First: determine whether it’s health, capacity, or contention
- Health:
zpool status -v. Any errors, degraded devices, resilvers, or scrubs? Fix health before tuning. - Capacity headroom:
zpool list. If you’re above ~80% capacity on a busy pool, assume allocations are getting expensive. - Is it actually storage?
iostat -xfor device latency and utilization. High await/queue = storage contention.
Second: identify whether reads or writes are the pain
- Pool-level mix:
zpool iostat -v 1to see read/write operations and bandwidth. - Sync write suspicion: If apps block on commits/fsync, check SLOG presence and dataset
sync/logbias. - Cache angle: ARC stats for hits/misses; if reads miss ARC and hit slow disks, latency follows.
Third: validate the layout matches the workload
- VDEV type: Mirrors for low-latency random IO; RAID-Z for capacity/sequential. If you picked RAID-Z for a random-write VM farm, the pool is doing exactly what you asked.
- Block sizing: Check
recordsizeand (for zvols)volblocksize. Misalignment creates write amplification and TXG pressure. - Special vdev: If metadata is hot and you have slow rust, a special vdev can help—but only if you understand the redundancy implications.
Common mistakes: symptoms → root cause → fix
1) “Everything is slow when pool hits ~85%”
Symptoms: Rising write latency, unpredictable pauses, scrub/resilver takes forever, metadata operations feel sticky.
Root cause: Metaslab fragmentation + constrained allocation choices at high capacity; copy-on-write churn worsens it.
Fix: Maintain headroom (capacity expansion sooner), reduce churn (snapshot retention discipline), consider adding vdevs for parallelism. Don’t aim for 95% utilization on performance-critical pools.
2) “Database commits take 20–80 ms, but throughput looks fine”
Symptoms: High p99 latency, threads blocked in fsync, TPS collapses during bursts.
Root cause: Sync write path bottleneck: no SLOG, slow SLOG, or main vdev latency; log writes serialized.
Fix: Use a proper power-loss-safe SLOG for sync-heavy workloads; validate dataset sync and logbias; ensure vdev latency isn’t pathological.
3) “We added L2ARC and it got worse”
Symptoms: Higher memory pressure, occasional stalls, no improvement in hit rate.
Root cause: L2ARC indexing overhead + caching the wrong working set; insufficient RAM so ARC is already starved.
Fix: Add RAM first, tune workload locality, measure ARC hit rates before/after. Use L2ARC when the working set is larger than RAM but still cacheable and read-heavy.
4) “Random writes are awful on our shiny new RAID-Z pool”
Symptoms: IOPS far below expectation, high latency at low bandwidth, CPU looks idle.
Root cause: RAID-Z parity overhead and write amplification on small random writes; TXG sync becomes a bottleneck.
Fix: Use mirrored vdevs for random-write workloads; or isolate the workload, tune recordsize/volblocksize, and ensure you’re not forcing sync writes unnecessarily.
5) “We disabled sync to ‘fix’ latency and then lost data”
Symptoms: Performance improves; after a crash, recent transactions are missing or corrupted at application level.
Root cause: You changed durability semantics. ZFS honored your request and acknowledged writes before they were stable.
Fix: Put sync back to standard; use SLOG and proper hardware; fix the actual latency bottleneck instead of rewriting the contract.
6) “Scrub kills performance every week”
Symptoms: Predictable latency spikes during scrub windows, timeouts in IO-heavy services.
Root cause: Scrub competes for IO with latency-sensitive workloads; no throttling/scheduling discipline.
Fix: Schedule scrubs off-peak, tune scrub behavior available on your platform, and isolate workloads across pools/vdev classes when necessary.
Three corporate mini-stories (anonymized, painfully real)
Incident caused by a wrong assumption: “A pool is a pool”
A mid-sized company migrated a fleet of VM hosts onto a new ZFS-backed storage appliance. Procurement had optimized
for usable capacity, so the design was “a few wide RAID-Z2 vdevs with lots of disks.” On paper: great.
In practice: it hosted hundreds of VM disks doing small random reads/writes, with periodic sync-heavy bursts.
The assumption that caused the incident was simple: “More disks means more IOPS.” That’s true when you add vdevs,
not when you add disks inside a RAID-Z vdev and then ask it to behave like a mirror set. During the first big patch
window, the pool hit sustained write pressure, TXGs started syncing continuously, and guest IO latencies went from
“fine” to “call the incident commander.”
The dashboards weren’t lying. Aggregate throughput was high. That’s what made it confusing: bandwidth looked healthy,
but the VMs were dying. The missing metric was tail latency on sync writes. The workloads weren’t bandwidth-bound;
they were commit-latency-bound. Every little fsync was waiting behind parity work and vdev queueing.
The fix wasn’t “tuning.” It was architecture. They added a dedicated mirrored pool for latency-sensitive VM storage
and kept RAID-Z for backups and sequential data. They also revisited dataset defaults: smaller block sizes where
appropriate, and no one touched sync=disabled again without a written risk acceptance.
Optimization that backfired: special vdev as a magic speed button
Another organization had a large pool on HDD mirrors that served millions of small files. Metadata was hot.
Someone read about special vdevs and proposed: “Put metadata on SSD and everything gets faster.” True, with caveats.
The caveats were not discussed. They added a special vdev with less redundancy than the main pool because it was “just metadata.”
It worked immediately. Directory traversals sped up, stat-heavy workloads calmed down, and the team declared victory.
A month later, the special device started throwing intermittent errors. ZFS reacted correctly: metadata integrity is
not optional. If you lose the special vdev and it’s not redundant enough, you don’t just lose performance—you can lose the pool.
The incident didn’t become a data-loss headline because the device didn’t fully die. But they got an operational
scare that was expensive: emergency maintenance window, vendor escalation, and a very tense conversation about
“why did we put critical pool data on a single point of failure?”
The backfired optimization wasn’t “special vdevs are bad.” The backfired optimization was treating a performance
feature as if it were a cache. Special vdevs can store metadata and sometimes small blocks. That makes them
structural. Structural means redundancy must match the pool’s expectations. They rebuilt the special vdev
as a mirror, documented the failure domain, and only then rolled it out broadly.
Boring but correct practice that saved the day: consistent scrubs + replacement discipline
A SaaS team ran ZFS pools on commodity servers. Nothing glamorous: mirrored vdevs, conservative headroom,
and a schedule that scrubs regularly. They also had a habit that looks obsessive until it saves you:
if a disk shows increasing error counters, it gets replaced before it “fully fails.”
One week, a scrub reported repaired data on a pool that was otherwise healthy. No application had complained.
No one had noticed. That’s the point: latent sector errors don’t send a calendar invite.
The scrub had found checksum mismatches, read from the mirror, repaired, and logged it.
The team treated repaired bytes as a hardware early-warning system. They checked the device’s error trends at the OS level,
swapped the disk in a planned window, and resilvered while the pool was still healthy and not under duress.
The replacement was boring. The incident that never happened would have been exciting.
A month later, another disk in a different server failed hard during peak hours. That pool stayed online because
mirrors plus proactive practice meant they were never running close to the edge. Nobody outside the team noticed.
That’s what “reliability work” looks like: success you can’t show in a screenshot.
Checklists / step-by-step plan
Step-by-step: design a pool that won’t embarrass you later
- Classify workloads: random read-heavy, random write-heavy, sequential, sync-heavy, metadata-heavy. Don’t average them into meaninglessness.
- Pick vdev type per workload: mirrors for low latency; RAID-Z for capacity and sequential throughput. Mixed workloads deserve separation via datasets or separate pools.
- Plan vdev count for IOPS: need more IOPS? add vdevs, not disks inside a single RAID-Z vdev.
- Decide on SLOG: only if you have meaningful sync writes. Choose power-loss-safe devices; mirror SLOG if your risk model demands it.
- Decide on special vdev: only if metadata/small-block performance is the bottleneck. Match redundancy to pool importance.
- Set ashift correctly at creation. Treat it as irreversible, because for practical purposes it is.
- Set dataset defaults:
compression=zstd(often),atime=off(commonly), recordsize per workload, and quotas/reservations where needed. - Capacity headroom policy: define a hard threshold (like 75–80%) where you must expand or migrate.
- Operational cadence: scrubs on schedule, alerting on errors, and documented replacement procedures.
Step-by-step: when you inherit a pool and don’t trust it
- Run
zpool status -vand record baseline error counters and scrub state. - Run
zpool listand note capacity and fragmentation. - Run
zpool iostat -v 1 10during normal load to understand “normal.” Save the output. - Audit dataset properties:
zfs get -rforrecordsize,sync,compression,atime, quotas/reservations. - Inventory special devices: SLOG, L2ARC, special vdev. Confirm redundancy and health.
- Inspect ashift via
zdb -C. If wrong, log it as technical debt with a plan. - Check snapshot retention and growth patterns. Delete responsibly, but don’t let snapshots silently consume the pool.
- Implement alerts: pool health changes, error counter deltas, capacity threshold, scrub failures, and unusual latency.
FAQ
1) Is a VDEV basically a RAID group?
Functionally, yes: a mirror vdev resembles RAID1, RAID-Z resembles parity RAID. Operationally, the important difference is:
vdevs are the unit ZFS stripes across. Add vdevs to add IOPS/parallelism.
2) Why does ZFS performance fall off when the pool is full?
Because allocations become constrained and fragmented. The SPA has fewer good choices, metadata grows, and writes turn into
more scattered IO. Near-full pools are expensive pools.
3) What exactly does TXG syncing mean for my applications?
TXG syncing is when ZFS flushes dirty data to disk. If the pool can’t keep up, ZFS throttles writers, and your app experiences it as latency.
Sync-heavy apps feel it more because they require durable commits.
4) Does ARC replace the Linux page cache?
Not exactly. ZFS uses ARC for ZFS-managed caching. The kernel still has its own page cache mechanisms, but ZFS is not a typical filesystem
in how it manages cached blocks. The practical concern is memory competition: you must ensure applications have enough RAM too.
5) When should I add a SLOG?
When your workload issues a meaningful amount of synchronous writes (databases, NFS with sync semantics, VM storage that does flushes),
and latency matters. If your workload is mostly async streaming writes, a SLOG won’t help.
6) Is L2ARC worth it?
Sometimes. If your working set is larger than RAM, read-heavy, and cacheable, L2ARC can help. If you’re write-heavy, RAM-starved,
or your workload is random with low reuse, it often does nothing or makes things worse.
7) What’s the difference between recordsize and volblocksize?
recordsize applies to files in datasets and can be changed. volblocksize applies to zvols and is fixed at creation.
Choose block sizes that match the IO your application actually does.
8) Can I change RAID-Z level or remove a vdev later?
Assume “no” for production planning. Some expansion features exist in some implementations, but vdev removal and RAID level changes are not
something you should bet your business on. Design like you’ll be living with the decision.
9) Do scrubs hurt SSDs or shorten life dramatically?
Scrubs do read the entire pool and can cause writes if repairs occur. That’s wear, yes—but the alternative is not knowing you have latent
corruption until you need the data. Scrub frequency is a risk decision, not a superstition.
10) Should I enable dedup?
Only if you have a proven dedup-friendly workload and enough memory/CPU to support it. In most general-purpose environments, dedup is a
performance and complexity tax that doesn’t pay back.
Conclusion: next steps you can do this week
ZFS isn’t “complicated.” It’s honest. It exposes the reality that storage is a coordinated system: vdev geometry,
TXG timing, ARC behavior, and SPA allocation policy all show up in your latency charts eventually.
Practical next steps:
- Run the baseline commands:
zpool status -v,zpool list,zpool iostat -v, and save outputs as your “known good.” - Audit dataset properties for your top three workloads; fix the obvious mismatches (recordsize, atime, compression).
- Set a capacity headroom policy and enforce it with alerts.
- If you have sync-heavy workloads, validate your SLOG story: either you have a good one, or you accept the latency.
- Schedule scrubs, monitor repairs, and replace suspicious disks early. Boring is the goal.
If you remember one thing: the pool is not the performance unit. The vdev is. And TXGs are always ticking in the background,
waiting to collect on whatever assumptions you made during design.