ZFS pools: Why ‘Partitions Thinking’ Makes Your Design Bad

Was this helpful?

There’s a particular kind of storage outage that doesn’t start with a bang. It starts with someone saying, “We’ll just carve the disks into a few partitions. It’ll be flexible.” Six months later you’re staring at a ZFS pool that’s technically online, practically limping, and emotionally exhausting.

ZFS isn’t a partition manager with a filesystem attached. ZFS is a storage system that wants to own its devices end-to-end: topology, redundancy, write semantics, healing, and performance. When you design it like a pile of partitions, you don’t just make it messy—you make it fragile, slower, and harder to recover when things go sideways (which they will, on a Tuesday, five minutes before the change freeze).

What “partitions thinking” is (and why it’s so tempting)

“Partitions thinking” is the habit of treating disks like a generic container you slice up to serve different purposes: one partition for data, one for logs, one for swap, one for “future,” one for “temporary performance,” and—because we’re adults who’ve been burned before—one “just in case.” It’s the worldview shaped by decades of traditional filesystems and RAID controllers where capacity planning was literally about partition tables.

In ZFS, that instinct shows up as:

  • Making multiple partitions on the same physical disk and using them in different vdevs (or different pools).
  • Creating a “small fast partition” on each disk for metadata or special workloads.
  • Carving out “SLOG partitions” on the same devices as the pool (or worse, on the same spindles).
  • Over-engineering GPT layouts to “standardize” across servers without understanding what ZFS already standardizes.

Why it’s tempting: partitions feel like control. They look neat in a spreadsheet. They allow you to claim “future flexibility” without doing the hard thing—choosing a correct vdev layout up front. And sometimes, especially in corporate environments, partitions are used as a political tool: “We can share the disks between teams.”

Here’s the problem: ZFS doesn’t care about your intention. It cares about I/O paths, failure domains, and vdev geometry. Partitions don’t change physics; they just hide it behind nicer labels.

Joke #1: Partitioning disks for ZFS “flexibility” is like buying a bigger fridge and then taping off shelves so you can “scale later.”

The ZFS mental model: vdevs are the unit of truth

If you want to stop designing bad pools, internalize this sentence: ZFS redundancy and performance are defined at the vdev level, not at the pool level.

Pool: the aggregator

A ZFS pool is a collection of vdevs. The pool stripes data across vdevs. It does not magically “balance” bad vdevs into good behavior. If you add one slow or fragile vdev, you’ve added a slow or fragile component to the pool.

Vdev: the failure domain

A vdev is built from one or more devices and has a redundancy model: single disk, mirror, raidz1/2/3, special vdev mirror, etc. If a vdev dies, the pool dies. This is why “just a small partition in a single-disk vdev” is a quietly catastrophic idea. You’re adding a single point of failure to the entire pool.

Devices: the physics

ZFS does not get to ignore what a device is. If two partitions live on the same SSD, they contend for the same flash translation layer, the same write amplification, the same garbage collection, the same endurance budget, the same firmware bugs, and the same “surprise” latency spikes when the drive decides it’s housekeeping time.

Datasets: the right place for “slicing”

If you need separation—quotas, reservations, recordsize policies, compression, snapshots—ZFS datasets already provide that cleanly. You don’t partition disks for administrative boundaries; you use datasets (or zvols) because they preserve the pool’s topology while giving you policy control.

Facts and history that explain the trap

Some context helps because a lot of bad ZFS design isn’t stupidity—it’s inherited mental models.

  1. ZFS was built to end the “volume manager + filesystem” split. In the Solaris era, UFS plus a volume manager meant two layers that both wanted to manage blocks. ZFS unified them.
  2. Early ZFS deployments often used whole disks because the tooling assumed it. Partitioning arrived later as a compatibility escape hatch, not a design goal.
  3. The “whole disk” vs “partition” debate is older than ZFS. Traditional admins partitioned because MBR/GPT were the only way to slice devices for multiple filesystems. ZFS made that less relevant.
  4. 4K sector drives changed everything. When Advanced Format disks became common, misalignment and wrong sector assumptions caused real performance cliffs. ZFS responded with ashift, which you must get right at vdev creation time.
  5. RAIDZ is not RAID5/6 in a controller. ZFS has variable stripe width and copy-on-write semantics, which changes the performance and fragmentation story compared to classic RAID.
  6. Copy-on-write means “overwrite” is a lie. ZFS allocates new blocks for changes; it doesn’t overwrite in place. Partition-centric thinking often assumes you can “protect” regions of disk from each other. You can’t.
  7. The ZIL and SLOG are frequently misunderstood. The ZIL exists in the pool; a SLOG is just an external device to accelerate certain synchronous writes. It’s not a “log partition.”
  8. Special vdevs are powerful and dangerous. They can store metadata (and optionally small blocks), but if they’re not redundant, they can take your whole pool down.
  9. SSD firmware behavior is a bigger part of your latency story than people admit. Partitions don’t partition firmware. They partition your confidence.

How partition-centric design breaks ZFS in real life

1) You accidentally create shared failure domains

The classic anti-pattern: “We’ll split each disk into two partitions and make two mirrors. That way, if we lose a disk, each mirror loses one side and we’re still fine.” On paper, it looks like redundancy. In reality, each mirror depends on every disk. Lose one disk and both mirrors degrade. Lose a second disk and you can lose both mirrors depending on which disks die.

It gets worse when you span across enclosures: you thought you built “two independent pools,” but you really built one pool with correlated failures (same firmware batch, same shelf, same SAS expander).

2) You create performance contention you can’t tune away

If two vdevs share the same physical device (because you used partitions), you’ve created internal contention that ZFS can’t see. ZFS schedules I/O to vdevs assuming they’re independent. When they’re not, you get weirdness: one workload’s latency spikes bleed into another vdev, scrub times explode, resilvers crawl, and the system looks haunted.

3) You lock yourself into unfixable geometry mistakes

In ZFS, some layout decisions are effectively permanent without rebuilding: ashift, RAIDZ width, and “where does metadata live” decisions like special vdev usage. Partition-centric plans often bake in “we’ll change it later” assumptions that turn into “we’ll migrate it later,” which turns into “we’ll live with it until the hardware refresh.”

4) You misunderstand “space efficiency” and pay for it in ops

Admins sometimes partition disks to reserve space for future vdevs or to create “equal-sized” groups. ZFS doesn’t benefit from unused disk space sitting idle in a partition table. You’re not creating a reserve; you’re creating stranded capacity and future re-layout pain.

5) You make recovery harder for no operational gain

When a pool is sick, the last thing you want is ambiguity: “Which partition was that vdev again? Was it /dev/sdb2 on this host but /dev/sdc2 on the other?” ZFS can use stable device IDs, but partition-heavy layouts multiply the number of identifiers and the number of ways humans can be wrong during a high-stress replacement.

Joke #2: If you ever find yourself labeling partitions “final2_fixed_really,” congratulations—you’ve discovered technical debt in its natural habitat.

Three corporate-world mini-stories from the trenches

Mini-story 1: An incident caused by a wrong assumption (“Partitions are independent”)

In a mid-sized enterprise, a team inherited a storage server with a pool that looked “reasonably redundant.” The previous admin had split each of eight HDDs into two partitions and built two RAIDZ1 vdevs from the partitions. The logic was that “two vdevs means parallelism,” and “RAIDZ1 is fine because we have two of them.”

The first failure was boring: one disk started throwing CRC errors. The pool stayed online, degraded. The on-call swapped the drive. Resilvering began—and immediately the system performance went sideways. Latency rose, applications timed out, and the database team started circling like sharks.

Then a second drive failed. Not surprising: same age, same batch, and resilver stress is a classic way to find marginal disks. The surprise was that the pool died even though “only two disks failed.” The partitions meant those failures hit both RAIDZ1 vdevs in a way nobody expected from the diagram. The pool could not tolerate the combination.

The postmortem was painful because the root cause wasn’t a single bad action; it was a design assumption that partitions created separate fault domains. They don’t. A disk is a disk, and ZFS redundancy does not understand your partition boundaries as safety barriers. The “two vdevs” were not independent; they were two ways to lose the same pool.

What fixed it wasn’t heroics. It was re-architecting during recovery: rebuild the pool as mirrors (or RAIDZ2 depending on capacity goals), no partition games, and enforce a “whole disk only” policy in automation so the layout couldn’t regress.

Mini-story 2: An optimization that backfired (“Small fast partitions for metadata”)

A finance org had a mixed workload: lots of small files, some databases, and a backup process that behaved like a wood chipper. Someone read about “special vdevs” and got inspired. But instead of adding a dedicated pair of SSDs for a special vdev, they carved a “fast metadata partition” from the existing SSDs that were also serving as data vdevs.

At first, it looked great. Metadata-heavy operations sped up. Directory traversals got snappy. The dashboard numbers improved enough to win a small internal award for “cost-neutral performance improvements,” which is the kind of phrase that should make you nervous.

Then quarter-end arrived. The workload shifted: more synchronous writes, more churn, more snapshots. The “metadata partitions” contended with data I/O on the same SSDs. Latency started spiking, not consistently, but in ugly bursts. The team chased ghosts: network, NFS, database locks. Meanwhile, the pool’s I/O scheduler was doing exactly what it was told—treating the special vdev as separate capacity, even though it shared the same physical devices underneath.

The backfire came in the form of maintenance: during a scrub, the system fell into a bad rhythm—SSD garbage collection coinciding with heavy reads. The scrub took dramatically longer, which extended the window of risk. Nothing exploded, but the system became predictably unpredictable, which is worse in corporate life because it’s hard to explain to management why “it’s not down, it’s just intermittently awful.”

The fix was simple and expensive: stop being clever. Move metadata/small blocks to dedicated mirrored SSDs designed for that role, and keep data vdevs separate. The performance win returned, and the latency spikes calmed down. The “cost-neutral” plan ended up costing time, attention, and credibility—three currencies SRE teams never have enough of.

Mini-story 3: A boring but correct practice that saved the day (“Whole disks, stable IDs, rehearsed replace”)

A different org ran ZFS for VM storage. No fancy partitioning. Mirrors for IOPS, a separate mirrored SLOG on power-loss-protected SSDs, and a dedicated pair of SSDs as a special vdev (also mirrored). The design document was almost insultingly dull.

But their operational practice was the real story: every disk was referenced by stable paths (by-id), every bay was labeled, and the “replace a disk” procedure was rehearsed quarterly like a fire drill. They also kept a small stash of identical spare drives because procurement lead times are a kind of outage too.

One afternoon a disk failed hard—gone from the bus. The on-call didn’t have to guess which device was missing, didn’t have to map partitions, and didn’t have to interpret a pile of ambiguous symlinks. They ran the procedure, replaced with like-for-like, resilver started, and the system kept serving workloads.

The kicker: during the resilver, another disk started logging errors. The team caught it early because they were watching zpool status and SMART counters as part of the drill. They replaced it proactively before it became catastrophic. No incident bridge. No frantic Slack archaeology. Just boring, correct storage operations—exactly the kind that never gets a trophy, but keeps the company paid.

Practical tasks: commands that keep you honest

These are the tasks I actually reach for when someone hands me a ZFS system that “was designed carefully” and is now making sad noises. Each task includes the command and how to interpret what you see.

Task 1: Inventory pool topology (spot partition games)

cr0x@server:~$ sudo zpool status -v
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            ata-SAMSUNG_SSD_...-part1  ONLINE       0     0     0
            ata-SAMSUNG_SSD_...-part1  ONLINE       0     0     0

errors: No known data errors

Interpretation: If you see -partX everywhere, ask why. It might be harmless (boot partitions plus whole-disk ZFS partitions), or it might be someone slicing devices for multiple roles. Look for the same base disk appearing in multiple vdevs via different partitions. That’s a red flag.

Task 2: Show vdevs with persistent device names

cr0x@server:~$ ls -l /dev/disk/by-id/ | grep -E 'ata-|nvme-' | head
lrwxrwxrwx 1 root root  9 Dec 24 09:10 ata-WDC_WD80... -> ../../sdb
lrwxrwxrwx 1 root root 10 Dec 24 09:10 ata-WDC_WD80...-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Dec 24 09:10 ata-WDC_WD80...-part2 -> ../../sdb2

Interpretation: In production, reference vdev members by /dev/disk/by-id/..., not /dev/sdX. If you rely on /dev/sdX and someone adds an HBA or changes boot order, your next reboot becomes a surprise party.

Task 3: Verify ashift (alignment) per vdev

cr0x@server:~$ sudo zdb -C tank | grep -E 'vdev_tree|ashift' -n
97:        vdev_tree:
130:                ashift: 12

Interpretation: ashift: 12 means 4K sectors. For most modern HDDs/SSDs, 12 is a safe baseline. If you see ashift: 9 on modern disks, you may be paying a performance tax forever unless you rebuild that vdev.

Task 4: Check actual sector sizes from the OS

cr0x@server:~$ sudo lsblk -o NAME,MODEL,SIZE,PHY-SEC,LOG-SEC,ROTA,TYPE | grep -E 'sd|nvme'
sda  INTEL SSDSC2  447.1G    4096    512    0 disk
sdb  WDC WD80EAZZ  7.3T      4096    512    1 disk

Interpretation: ZFS writes in units of 2^ashift. If the physical sector is 4096 and your vdev was created with ashift=9, you’re misaligned. That often shows up as weirdly low IOPS and higher latency under sync or small-block workloads.

Task 5: Identify whether you’re using special vdevs

cr0x@server:~$ sudo zpool status tank | sed -n '/special/,$p'
          special
            mirror-2                ONLINE       0     0     0
              nvme-SAMSUNG_MZ...     ONLINE       0     0     0
              nvme-SAMSUNG_MZ...     ONLINE       0     0     0

Interpretation: If a special vdev exists and it’s not mirrored, treat it like an incident waiting to happen. Special vdevs can hold metadata and (optionally) small blocks; losing them can mean losing the pool.

Task 6: Confirm where sync writes go (SLOG presence and health)

cr0x@server:~$ sudo zpool status tank | sed -n '/logs/,$p'
          logs
            mirror-1                ONLINE       0     0     0
              ata-INTEL_SLOG_A      ONLINE       0     0     0
              ata-INTEL_SLOG_B      ONLINE       0     0     0

Interpretation: A SLOG helps only for synchronous writes (NFS with sync, databases configured for durability, etc.). If your “SLOG” is a partition on a busy data SSD, you likely traded durable latency for unpredictable latency.

Task 7: See per-vdev I/O and spot the slow member

cr0x@server:~$ sudo zpool iostat -v tank 1 5
                    capacity     operations     bandwidth
pool              alloc   free   read  write   read  write
----------------  -----  -----  -----  -----  -----  -----
tank              2.10T  5.10T    120    380  12.3M  45.1M
  mirror-0        1.05T  2.55T     60    190   6.1M  22.6M
    sdb            -      -       60    190   6.1M  22.6M
    sdc            -      -       60    190   6.1M  22.6M
  mirror-1        1.05T  2.55T     60    190   6.2M  22.5M
    sdd            -      -       60    190   6.2M  22.5M
    sde            -      -       60    190   6.2M  22.5M

Interpretation: When one vdev is much slower, the pool’s tail latency follows it. If you’ve built vdevs from partitions on the same physical disk, the iostat output can look “balanced” while the physical device is overloaded—because ZFS can’t see the shared bottleneck.

Task 8: Check dataset-level properties instead of inventing partitions

cr0x@server:~$ sudo zfs get -o name,property,value,source compression,recordsize,atime,sync tank/app
NAME      PROPERTY     VALUE   SOURCE
tank/app  compression  lz4     local
tank/app  recordsize   128K    local
tank/app  atime        off     local
tank/app  sync         standard default

Interpretation: This is where you do workload tuning. You don’t need a “log partition” for a log-heavy app; you need an appropriate recordsize, maybe atime=off, maybe a separate dataset for different I/O patterns.

Task 9: Observe latency live (the truth serum)

cr0x@server:~$ sudo zpool iostat -r -v tank 1 3
                    capacity     operations     bandwidth    total_wait  disk_wait
pool              alloc   free   read  write   read  write   ---------   ---------
tank              2.10T  5.10T    120    380  12.3M  45.1M     12ms        8ms
  mirror-0        1.05T  2.55T     60    190   6.1M  22.6M     10ms        7ms
  mirror-1        1.05T  2.55T     60    190   6.2M  22.5M     14ms        9ms

Interpretation: If disk_wait is high, the physical devices are slow or overloaded. If total_wait is much higher than disk_wait, you might be CPU bound, memory pressured, or stuck behind ZFS queues (often a symptom of undersized vdevs or pathological sync workloads).

Task 10: Confirm ARC pressure and whether memory is your bottleneck

cr0x@server:~$ sudo arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
09:21:01   520    40      7    10   25    20   50    10   25   28G    32G

Interpretation: High miss rates with small arcsz relative to your workload can drive disk I/O and make everything look like a “storage problem” when it’s really a memory sizing issue. Partitioning won’t fix a cache problem; it just gives you new places to be wrong.

Task 11: Scrub status and impact awareness

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
scan: scrub in progress since Tue Dec 24 08:10:15 2025
        1.20T scanned at 520M/s, 600G issued at 260M/s, 2.10T total
        0B repaired, 28.57% done, 0 days 02:10:33 to go

Interpretation: Scrubs are good; scrubs are also load. If scrubs reliably crush performance, you likely have a vdev geometry problem (too-wide RAIDZ for IOPS workloads) or a shared device bottleneck (partition shenanigans), or simply not enough spindles.

Task 12: Replace a disk correctly (and not by guessing partitions)

cr0x@server:~$ sudo zpool replace tank /dev/disk/by-id/ata-WDC_OLD_DRIVE /dev/disk/by-id/ata-WDC_NEW_DRIVE
cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: DEGRADED
scan: resilver in progress since Tue Dec 24 09:02:11 2025
config:
        NAME                                 STATE
        tank                                 DEGRADED
          mirror-0                           DEGRADED
            replacing-0                      DEGRADED
              ata-WDC_OLD_DRIVE              UNAVAIL
              ata-WDC_NEW_DRIVE              ONLINE  (resilvering)
            ata-WDC_PEER_DRIVE               ONLINE

Interpretation: Use stable IDs. Replace whole devices when possible. When your pool is built from partitions, replace operations become more error-prone because you’re juggling multiple partX targets per disk.

Task 13: Detect accidental multi-use of a physical disk

cr0x@server:~$ sudo zpool status -P tank | grep -E '/dev/disk/by-id/.*-part' | sed 's/-part[0-9]\+//g' | sort | uniq -c | sort -nr | head
      2 /dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S5...
      1 /dev/disk/by-id/ata-WDC_WD80EAZZ_7SG...

Interpretation: If the same base device appears multiple times (count > 1) after stripping partition suffixes, you probably have one disk feeding multiple vdev members. Sometimes it’s intentional for boot; often it’s a design smell.

Task 14: Check pool feature flags and versioning assumptions

cr0x@server:~$ sudo zpool get -H -o property,value feature@async_destroy feature@spacemap_histogram tank
feature@async_destroy      enabled
feature@spacemap_histogram active

Interpretation: This isn’t “partition vs whole disk,” but it prevents a related class of failure: trying to import a pool on a host that can’t support its features. The partition people also tend to be the “we can import it anywhere” people. You cannot, always.

Fast diagnosis playbook (bottleneck hunting)

When performance tanks or latency spikes, you need a sequence that gets you from “users are angry” to “this is the limiting factor” without wandering into folklore. This is the production-first order I use.

First: Is the pool healthy and not doing emergency work?

cr0x@server:~$ sudo zpool status -x
all pools are healthy

If not healthy: stop diagnosing “performance.” You’re diagnosing “survival.” A degraded vdev, resilver, or heavy error correction will dominate everything.

Second: Are you currently scrubbing/resilvering?

cr0x@server:~$ sudo zpool status tank | sed -n '1,25p'

Interpretation: A resilver is a legitimate reason for reduced throughput. If your pool becomes unusable during routine scrubs, that’s a design capacity issue: not enough IOPS headroom, or a layout that over-optimizes for capacity at the expense of concurrency.

Third: Which vdev is the slowest under load?

cr0x@server:~$ sudo zpool iostat -r -v tank 1 10

Interpretation: Look for a vdev with higher wait times. In mixed vdev pools, the slowest vdev drags tail latency. If you see inconsistent waits that correlate across multiple vdevs, suspect shared physical devices (partitioning) or shared backplanes/HBA issues.

Fourth: Is it sync write pressure (SLOG / ZIL) or general random I/O?

cr0x@server:~$ sudo zfs get -r -o name,property,value,source sync tank | head
NAME      PROPERTY  VALUE     SOURCE
tank      sync      standard  default
tank/app  sync      standard  default

Interpretation: If the workload forces sync (NFS, databases), and you don’t have a proper SLOG, latency will spike. If you do have a SLOG but it’s not power-loss-protected or is overloaded (or is a partition on busy devices), you get “fast sometimes, awful sometimes.”

Fifth: Is ARC/memory pressure pushing you to disk?

cr0x@server:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           64Gi        52Gi       1.2Gi       1.0Gi        11Gi        8Gi
Swap:          0B          0B          0B

Interpretation: Low available memory with high disk read load can look like “slow storage.” ZFS loves RAM; starving it makes your disks do dumb work.

Sixth: Is the bottleneck CPU, compression, or checksums?

cr0x@server:~$ top -b -n1 | head -20

Interpretation: If CPU is pinned during heavy writes with compression, you may be compute bound. That’s not bad—compression can still be worth it—but you need to know what you’re paying with.

Seventh: Validate device-level errors and link issues

cr0x@server:~$ sudo dmesg -T | grep -E 'ata[0-9]|nvme|sd[a-z]:|I/O error|link reset' | tail -20

Interpretation: CRC errors, link resets, and timeouts can masquerade as “ZFS is slow.” Often it’s a cable, expander, or HBA firmware. Partitions don’t cause this, but partition-heavy layouts make it harder to identify which physical disk is actually sick.

Common mistakes: specific symptoms and fixes

Mistake 1: Using partitions from the same disk in different vdevs

Symptom: Random latency spikes that don’t match ZFS-level iostat distribution; scrubs/resilvers cause “everything” to slow down; unpredictable tail latency.

Why it happens: ZFS thinks it has more independent devices than it really does.

Fix: Rebuild with one role per physical device. If you need multiple “classes” (data, special, slog), use separate physical devices for each class, and mirror the critical ones.

Mistake 2: Adding a single-disk “special” or “log” vdev because it’s “just metadata” or “just a log”

Symptom: Pool failure after one device loss; inability to import; sudden “I/O error” when that device dies.

Why it happens: Special vdevs and SLOGs have different risk profiles. A SLOG can usually be lost without losing the pool (you lose recent sync transactions). A special vdev can be pool-critical depending on what it stores.

Fix: Mirror special vdevs. Use proper devices for SLOG (PLP SSDs). Avoid “clever” partitions for critical roles.

Mistake 3: Thinking partitions solve “future expansion”

Symptom: Stranded capacity; weird vdev sizes; inability to add new vdevs cleanly; uneven performance after expansion.

Why it happens: ZFS expansion is about adding vdevs, not extending partitions. You want coherent vdevs with consistent performance characteristics.

Fix: Plan expansion by vdev units. Mirrors expand by adding more mirrors. RAIDZ expands by adding another RAIDZ vdev (or by replacing disks with larger ones and waiting for autoexpand where supported).

Mistake 4: Misaligned ashift due to “it worked on the old server” cloning

Symptom: Small random writes are inexplicably slow; high write amplification; SSD wear accelerates.

Fix: Create vdevs with correct ashift from the start. If wrong, migration/rebuild is the honest fix. Don’t try to “partition your way out” of it.

Mistake 5: Treating RAIDZ like a performance feature for IOPS-heavy workloads

Symptom: VM storage feels sluggish; database latency high; throughput fine in benchmarks but real workloads suffer.

Fix: Use mirrors for IOPS-sensitive workloads. RAIDZ for capacity/throughput-heavy, sequential-ish workloads. If you must use RAIDZ, keep widths sensible and understand rebuild windows.

Mistake 6: Over-partitioning for “standard boot layouts” without documenting it

Symptom: Confusion during replacements; wrong partition replaced; pool members swapped incorrectly; longer outages due to human error.

Fix: If you must partition for boot (common on some setups), keep it minimal and uniform, and automate labeling. Make ZFS vdev members stable and obvious via by-id names and consistent partition numbers.

Checklists / step-by-step plan

Step-by-step: designing a new pool without “partitions thinking”

  1. Define the failure domain first. One server? One shelf? Multiple HBAs? Decide what a “single failure” looks like in your environment.
  2. Pick vdev type based on workload.
    • VMs/databases/latency-sensitive: mirrors (more vdevs = more parallelism).
    • Backups/media/sequential: RAIDZ2/3 (capacity efficient).
  3. Choose ashift intentionally. Default to 12 unless you have a compelling reason otherwise.
  4. Decide whether you need special vdev. Only if metadata/small-block performance matters and you can mirror it with good SSDs.
  5. Decide whether you need SLOG. Only if you have real sync write pressure and you can use power-loss-protected devices.
  6. Use datasets for policy separation. Quotas, reservations, recordsize, compression, snapshots—do it there.
  7. Standardize device naming. Build and operate using /dev/disk/by-id.
  8. Document a replace procedure. Then rehearse it.

Step-by-step: creating a pool (example) using whole disks and stable IDs

cr0x@server:~$ sudo zpool create -o ashift=12 tank \
  mirror /dev/disk/by-id/ata-WDC_DISK_A /dev/disk/by-id/ata-WDC_DISK_B \
  mirror /dev/disk/by-id/ata-WDC_DISK_C /dev/disk/by-id/ata-WDC_DISK_D

cr0x@server:~$ sudo zfs set compression=lz4 atime=off tank
cr0x@server:~$ sudo zfs create tank/app
cr0x@server:~$ sudo zfs set recordsize=16K tank/app

Interpretation: Mirrors give you IOPS by parallelism; datasets give you policy. No partition gymnastics required.

Step-by-step: adding a mirrored special vdev (example)

cr0x@server:~$ sudo zpool add tank special mirror \
  /dev/disk/by-id/nvme-SAMSUNG_SPECIAL_A \
  /dev/disk/by-id/nvme-SAMSUNG_SPECIAL_B

cr0x@server:~$ sudo zfs set special_small_blocks=16K tank

Interpretation: This is how you accelerate metadata/small blocks without lying to yourself about failure domains. Mirrored, dedicated devices.

Step-by-step: adding a mirrored SLOG (example)

cr0x@server:~$ sudo zpool add tank log mirror \
  /dev/disk/by-id/ata-INTEL_PLP_SLOG_A \
  /dev/disk/by-id/ata-INTEL_PLP_SLOG_B

Interpretation: A mirrored SLOG reduces risk and improves consistency for sync-heavy workloads. Don’t fake this with a partition on your already-busy pool SSDs.

Operational checklist: what to standardize so humans don’t have to be perfect

  • Always use by-id device paths in pool creation and replacement.
  • Label drive bays and keep an internal mapping of bay-to-serial.
  • Schedule scrubs and monitor their duration trend over time.
  • Alert on SMART pre-fail and ZFS read/write/checksum errors.
  • Keep spare drives (or at least guaranteed procurement path) for each class of device.
  • Rehearse zpool replace and zpool clear workflows in a safe environment.

FAQ

1) Is using partitions with ZFS always bad?

No. It’s common to use a small EFI/boot partition and then a ZFS partition for the pool member. What’s bad is using partitions to pretend one physical device is multiple independent devices, or mixing roles on the same disk in ways that create hidden contention and correlated failure.

2) Why does “one bad vdev kills the pool” matter so much?

Because it flips how you think about risk. In LVM-land, you can lose a PV and maybe lose only a volume. In ZFS, if you add a single-disk vdev (even a tiny one), you’ve added a single point of failure to the entire pool.

3) Can I split disks into partitions to create more vdevs for performance?

You can, but you shouldn’t. You’re not creating more IOPS; you’re creating more queues that eventually hit the same physical device. ZFS will schedule work as if the vdevs are independent, which makes performance less predictable and can worsen latency under load.

4) What’s the right way to isolate workloads if not partitions?

Use datasets (or zvols) with quotas, reservations, and per-dataset properties. If you need hard isolation at the device level, use separate pools on separate physical devices—not partitions on the same spindles/SSDs.

5) Do I need a SLOG for NFS?

Only if your NFS workload is doing synchronous writes (common with VM storage, some database configs, and certain NFS export options). If your workload is mostly async, a SLOG won’t help much. If you do need it, use power-loss-protected devices and preferably mirror them.

6) Are special vdevs worth it?

They can be transformative for metadata-heavy workloads and small-file performance, but they must be designed as first-class, redundant, monitored devices. Treat them as pool-critical unless you’ve explicitly limited what they store—and even then, understand the failure implications.

7) Mirrors vs RAIDZ: what’s the “production default”?

For general-purpose, latency-sensitive workloads (VMs, databases), mirrors are the safer default because they scale IOPS with each vdev and rebuild faster. RAIDZ is great for capacity efficiency and sequential throughput, but it’s not magic for random I/O.

8) If my pool is already built with partitions, what should I do?

First, don’t panic. Inventory whether partitions share physical disks across vdevs. If they do, plan a migration: build a new pool with sane topology and replicate using ZFS send/receive or application-level migration. If the partitioning is only “boot + ZFS member,” document it and move on.

9) Does ZFS “auto-balance” data when I add new vdevs?

Not in the way people hope. New writes will tend to go to the new space, but existing blocks stay where they are unless rewritten (or you use specific rebalancing strategies). This is another reason “we’ll fix it later” partition plans often age badly.

10) What’s the quickest sign my team is thinking in partitions instead of ZFS?

If the design document talks more about “carving disks” than about vdev count, vdev width, failure domains, and rebuild behavior, you’re in partition territory.

Conclusion

ZFS rewards people who design with topology, failure domains, and workload reality. It punishes people who design with partition tables and optimism. “Partitions thinking” is seductive because it looks like flexibility, but in production it’s usually just hidden coupling: shared bottlenecks, correlated failures, and recovery procedures that require humans to be flawless under stress.

If you want a ZFS pool that behaves like a professional system—predictable, diagnosable, and survivable—design around vdevs, use datasets for separation, keep roles on dedicated physical devices, and standardize the operational basics. The boring approach wins, not because it’s boring, but because it leaves less room for physics to surprise you.

← Previous
ZFS DDT Sizing: Predicting RAM Needs Before Enabling Dedup
Next →
Red Ring of Death: the Xbox 360 heat disaster that cost billions

Leave a comment