ZFS ashift Mismatch Detection: How to Check Existing Pools

Was this helpful?

You don’t notice an ashift problem when a pool is new. Everything’s empty, benchmarks look fine, life is good.
Then the pool fills, latency creeps up, scrubs take forever, and suddenly your “fast” SSD array performs like it’s phoning in results from a basement.

The ugly part: you can’t “fix” a wrong ashift in-place. Detection is the job. Decision-making is the job.
This guide is about checking existing pools, proving whether you have a mismatch, and choosing the least painful path out.

What ashift really is (and what it isn’t)

ashift is ZFS’s on-disk allocation unit size, expressed as a power of two.
If ashift=12, ZFS allocates in 212 bytes: 4096 bytes (4K).
If ashift=9, it’s 512 bytes.

That’s not the same thing as the disk’s reported “logical” and “physical” sector sizes, but it’s related.
ZFS will choose an ashift at vdev creation time based on what it thinks the device’s minimum sector size is.
Once set, it is effectively permanent for that vdev. You can replace disks, expand, resilver—ashift stays.

Here’s the operational translation:

  • Too small ashift (e.g., 512B on 4K physical sectors) causes read-modify-write, write amplification, latency spikes, and scrubs/resilvers that feel cursed.
  • Too large ashift (e.g., 8K or 16K on 4K devices) wastes some space and can increase overhead for very small blocks, but it usually doesn’t create the same “everything is on fire” profile.

What ashift is not:

  • It’s not the same as recordsize or volblocksize. Those are higher-level logical block sizes used for filesystems and zvols.
  • It’s not a tunable you can “set later” without rebuilding. If a blog post tells you otherwise, it’s selling you hope.

One more thing: ashift is per-vdev, not per-pool in a strict sense. Pools contain vdevs. Vdevs contain disks.
Most pools are created with a consistent ashift across all vdevs, but mixed-ashift pools happen—often during expansions or when someone added “just one more” vdev from a different era.

Interesting facts and small history

Storage problems repeat because storage vendors keep changing the rules while keeping the marketing names. A few facts help frame why ashift even exists:

  1. 512-byte sectors were the long default. Decades of software assumed 512B blocks. ZFS was born in that world.
  2. “Advanced Format” (4K physical sectors) became mainstream in the early 2010s. Many drives presented 512B logical sectors for compatibility while writing 4K physically.
  3. Some devices lie. USB-SATA bridges, certain RAID controllers, and some virtual disks have historically reported 512B sectors even when the backend is 4K.
  4. ZFS picked ashift once because on-disk formats matter. ZFS cares about consistency and verifiability; changing allocation granularity later would be a format migration, not a “tune.”
  5. Early OpenZFS tooling didn’t always make ashift obvious. You could build a pool and only find the mismatch when performance cratered at scale.
  6. SSD erase blocks and NAND pages are much larger than 4K. Even with correct ashift, flash translation layers can create their own write amplification—wrong ashift makes it worse.
  7. 4Kn (native 4K logical sectors) exists. Some enterprise drives report 4096 logical and physical. They tend to behave better, mostly because they’re harder to mis-detect.
  8. Some hypervisors “normalize” block sizes. Your guest might see 512B while the datastore uses 4K or larger. ZFS inside the guest will make decisions based on the guest-visible size.

Why an ashift mismatch hurts (failure modes)

The classic mismatch is ashift=9 on disks with 4K physical sectors. ZFS writes 512B-aligned blocks, but the drive can only update 4K chunks.
So a “small write” becomes:

  1. Read the 4K physical sector containing the 512B region.
  2. Modify the 512B portion in memory.
  3. Write the full 4K sector back.

That read-modify-write cycle inflates latency and turns your IOPS into a practical joke.

Common ways it shows up in production:

  • Scrub/resilver time explodes because the system does more IO operations per logical amount of data.
  • Random write workloads suffer (databases, VM images, small-file workloads, metadata-heavy builds).
  • CPU usage can rise because the IO path does more work, compression/dedup interactions become noisier, and latency increases context switching.
  • Write amplification hurts SSD endurance—not always catastrophically, but enough to turn “comfortably within warranty” into “we’ll be swapping drives more than planned.”

“But my pool seems fine.” Sure. A mismatch can be masked by:

  • Mostly sequential writes
  • Lots of ARC cache hits
  • Very low pool utilization
  • Workloads dominated by large blocks (big media files, backups)

Then you add VMs, enable sync writes, hit 70% full, or start doing lots of metadata updates. The mismatch stops being theoretical.

One operational quote (paraphrased idea): Gene Kranz’s reliability mantra—“failures are not an option”—translates well: plan like the mismatch is real until you prove it isn’t.

Joke #1: If storage could talk, it would say “it’s not slow, it’s just expressing itself.” Then it would time out your database.

Fast diagnosis playbook

You want the shortest path from “this feels slow” to “this is ashift, or it isn’t.” Here’s the order that saves the most time.

First: confirm ashift on every top-level vdev

  • If ashift is correct and consistent, stop blaming it and move on.
  • If ashift is too small (9 on 4K hardware), treat it as suspect #1.
  • If ashift is mixed across vdevs, expect uneven performance and weird tail latency.

Second: confirm what the OS thinks the disks are (logical/physical)

  • Check sector sizes via lsblk and blockdev.
  • Watch for “512/4096” devices. That’s where ashift mistakes breed.
  • Don’t trust USB bridges or RAID HBAs that virtualize sectors without telling you.

Third: check for the obvious performance signatures

  • zpool iostat -v for small writes and high latency at the vdev level
  • iostat -x for high await and low throughput despite high utilization
  • zpool status -v for errors or slow resilver/scrub progress (not proof, but a smell)

Fourth: validate alignment behavior under a controlled write test

  • Use a scratch dataset or test host if you can.
  • Run a small-block random write test and observe latency/IOPS. A mismatched ashift pool often “falls off a cliff” on 4K-ish workloads.

Fifth: decide your path

  • Correct ashift: focus on recordsize, sync settings, slog, special vdevs, fragmentation, and pool fullness.
  • Wrong ashift: stop tuning around it. Plan migration or rebuild.

Practical tasks: commands, outputs, decisions

Below are real checks you can run today. Each task has: the command, what the output means, and the decision you make from it.
Use them like a checklist, not like fortune-telling.

Task 1: List pools and confirm you’re looking at the right one

cr0x@server:~$ zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  21.8T  17.2T  4.60T        -         -    32%    78%  1.00x  ONLINE  -

Meaning: You’ve got one pool, tank, 78% full and moderately fragmented. Fullness and fragmentation can compound ashift issues.

Decision: If the pool is >80% full, treat any performance problem as multi-factor. Still check ashift, but don’t stop there.

Task 2: Get the vdev layout (you need it for per-vdev ashift checks)

cr0x@server:~$ zpool status -P tank
  pool: tank
 state: ONLINE
config:

        NAME                                      STATE     READ WRITE CKSUM
        tank                                      ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            /dev/disk/by-id/ata-ST12000NM0008-1JH101_ZA1A2B3C  ONLINE       0     0     0
            /dev/disk/by-id/ata-ST12000NM0008-1JH101_ZA1A2B3D  ONLINE       0     0     0
            /dev/disk/by-id/ata-ST12000NM0008-1JH101_ZA1A2B3E  ONLINE       0     0     0
            /dev/disk/by-id/ata-ST12000NM0008-1JH101_ZA1A2B3F  ONLINE       0     0     0
            /dev/disk/by-id/ata-ST12000NM0008-1JH101_ZA1A2B3G  ONLINE       0     0     0
            /dev/disk/by-id/ata-ST12000NM0008-1JH101_ZA1A2B3H  ONLINE       0     0     0

errors: No known data errors

Meaning: One RAIDZ2 vdev with six disks.

Decision: You will check ashift for this vdev and confirm all underlying devices have matching sector characteristics.

Task 3: Check ashift via zdb (the most direct, least wishful method)

cr0x@server:~$ sudo zdb -C tank | sed -n '/vdev_tree/,/features_for_read/p'
        vdev_tree:
            type: 'root'
            id: 0
            guid: 12345678901234567890
            children[0]:
                type: 'raidz'
                id: 0
                guid: 11111111111111111111
                nparity: 2
                ashift: 12
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 22222222222222222222
                    path: '/dev/disk/by-id/ata-ST12000NM0008-1JH101_ZA1A2B3C'
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 33333333333333333333
                    path: '/dev/disk/by-id/ata-ST12000NM0008-1JH101_ZA1A2B3D'

Meaning: ashift: 12 means 4K allocations. That’s usually correct for modern HDDs/SSDs.

Decision: If you see ashift: 9 on any vdev built on 4K physical drives, flag it for migration planning.

Task 4: Spot mixed ashift across vdevs (expansions are where this happens)

cr0x@server:~$ sudo zdb -C tank | grep -E 'children\[[0-9]+\]:|ashift'
            children[0]:
                ashift: 12
            children[1]:
                ashift: 9

Meaning: You have at least two top-level vdevs with different ashift values. That’s a performance and predictability tax.

Decision: Treat the smallest ashift as the likely troublemaker. Plan to retire that vdev (via migration) or rebuild the pool.

Task 5: Check what the OS reports for logical/physical sector sizes

cr0x@server:~$ lsblk -d -o NAME,MODEL,SIZE,ROTA,LOG-SEC,PHY-SEC /dev/sd[a-f]
NAME MODEL            SIZE ROTA LOG-SEC PHY-SEC
sda  ST12000NM0008  10.9T    1     512    4096
sdb  ST12000NM0008  10.9T    1     512    4096
sdc  ST12000NM0008  10.9T    1     512    4096
sdd  ST12000NM0008  10.9T    1     512    4096
sde  ST12000NM0008  10.9T    1     512    4096
sdf  ST12000NM0008  10.9T    1     512    4096

Meaning: These are 512e drives: 512 logical, 4096 physical. If your vdev ashift is 9, you’re misaligned for physical IO.

Decision: With 512e, default to ashift=12 for new vdevs. For existing vdevs with ashift=9, plan a migration rather than “tuning it away.”

Task 6: Confirm sector sizes with blockdev (helps detect odd reporting)

cr0x@server:~$ sudo blockdev --getss /dev/sda
512
cr0x@server:~$ sudo blockdev --getpbsz /dev/sda
4096

Meaning: Kernel sees logical sector 512B, physical 4096B. Consistent with the lsblk view.

Decision: If tools disagree (e.g., 512/512 but you know it’s 4K), treat the device layer as untrustworthy and assume you need ashift=12 at minimum.

Task 7: Check pool-wide IO behavior during real load

cr0x@server:~$ zpool iostat -v tank 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        17.2T  4.60T    210    980  54.3M  22.1M
  raidz2-0                  17.2T  4.60T    210    980  54.3M  22.1M
    ata-ST12000NM...B3C         -      -     35    165  9.10M  3.80M
    ata-ST12000NM...B3D         -      -     35    165  9.05M  3.75M
    ata-ST12000NM...B3E         -      -     35    165  9.07M  3.78M
    ata-ST12000NM...B3F         -      -     35    165  9.02M  3.76M
    ata-ST12000NM...B3G         -      -     35    160  9.03M  3.70M
    ata-ST12000NM...B3H         -      -     35    160  9.06M  3.72M
--------------------------  -----  -----  -----  -----  -----  -----

Meaning: Balanced per-disk stats suggest the vdev is behaving normally. This doesn’t prove ashift correctness, but it rules out obvious single-disk drag.

Decision: If per-disk writes are high while bandwidth is low, suspect small writes and amplification—ashift mismatch is a candidate, along with sync workloads and fragmentation.

Task 8: Correlate with device-level latency using iostat

cr0x@server:~$ iostat -x 1 3
Linux 6.6.0 (server) 	12/26/2025 	_x86_64_	(32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.12    0.00    3.41    8.55    0.00   81.92

Device            r/s     w/s   rkB/s   wkB/s  rrqm/s  wrqm/s  %util  await  r_await  w_await
sda              35.0   160.0  9320.0  3860.0     0.0     2.0   92.0   27.4     9.2     31.1
sdb              35.0   160.0  9300.0  3840.0     0.0     1.0   91.5   26.9     9.0     30.6

Meaning: High utilization and elevated await on writes can be normal under load for HDD RAIDZ, but if you see extreme await with tiny throughput, something is off.

Decision: If you’re pushing small writes and seeing big await, check ashift and also check sync write path (SLOG), dataset settings, and workload patterns.

Task 9: Validate volblocksize/recordsize expectations (don’t confuse them with ashift)

cr0x@server:~$ zfs get -o name,property,value,source recordsize,compression tank
NAME  PROPERTY     VALUE     SOURCE
tank  recordsize   128K      default
tank  compression  lz4       local

Meaning: recordsize is 128K. That’s typical for general-purpose filesystems. It does not change ashift, but it affects IO patterns.

Decision: If your pool hosts VMs/databases, you may tune recordsize/volblocksize. But if ashift is wrong, tuning these is lipstick on a forklift.

Task 10: Check zvol block size (common in virtualization stacks)

cr0x@server:~$ zfs get -o name,property,value,source volblocksize tank/vmdata
NAME         PROPERTY      VALUE  SOURCE
tank/vmdata  volblocksize  8K     local

Meaning: zvol uses 8K blocks. If ashift is 9 on 4K physical devices, 8K may still be misaligned in subtle ways during partial writes.

Decision: Keep volblocksize aligned to workload (often 8K–16K for VMs/DBs), but prioritize fixing ashift mismatch first if present.

Task 11: Check for 4Kn vs 512e vs “mystery devices” using udev IDs

cr0x@server:~$ for d in /dev/disk/by-id/ata-ST12000NM0008-*; do echo "== $d =="; udevadm info --query=property --name="$d" | grep -E 'ID_MODEL=|ID_SERIAL=|ID_BUS='; done
== /dev/disk/by-id/ata-ST12000NM0008-1JH101_ZA1A2B3C ==
ID_BUS=ata
ID_MODEL=ST12000NM0008-1JH101
ID_SERIAL=ST12000NM0008-1JH101_ZA1A2B3C

Meaning: Disks are direct ATA devices, not behind USB. Good: fewer lies about sectors.

Decision: If you see USB or unusual transport layers for production pools, assume sector reporting is suspect and be conservative (ashift=12 or 13 for new pools).

Task 12: Confirm ZFS is using native paths and not stale device names

cr0x@server:~$ zpool status -P tank | grep /dev/sd

Meaning: No /dev/sdX paths in the config output (good). Using by-id paths reduces operational foot-guns during disk replacement.

Decision: If your pool uses /dev/sdX, fix that during your next maintenance window (export/import with by-id) before doing anything risky like migrations.

Task 13: Check ashift on a per-leaf basis (useful when vdev_tree output is long)

cr0x@server:~$ sudo zdb -l /dev/disk/by-id/ata-ST12000NM0008-1JH101_ZA1A2B3C | grep -i ashift
    ashift: 12

Meaning: The leaf device label indicates ashift 12 for the vdev it belongs to. This is a fast spot-check when you don’t want the whole config dump.

Decision: If label reads ashift: 9 and the device’s physical sector is 4096, you’ve confirmed a mismatch. Start planning the rebuild/migration.

Task 14: Measure a small-block sync write signature (don’t guess)

cr0x@server:~$ sudo zfs create -o sync=always -o recordsize=16K tank/ashift-test
cr0x@server:~$ dd if=/dev/zero of=/tank/ashift-test/blob bs=4K count=250000 oflag=dsync status=progress
1024000000 bytes (1.0 GB, 954 MiB) copied, 54 s, 19.0 MB/s
cr0x@server:~$ sync

Meaning: This crude test forces sync behavior. 4K blocks plus sync can expose alignment and latency problems quickly.

Decision: If throughput collapses and latency spikes while your hardware should do better, investigate ashift mismatch and also your SLOG/PLP story. Don’t “optimize” until you know which bottleneck you’re hitting.

Three corporate mini-stories (what went wrong, what saved us)

Mini-story 1: The incident caused by a wrong assumption

A team inherited a ZFS-backed VM platform that had been “stable for years.” It was also built during an era when half the disks in the supply chain were 512e,
and half the tooling still treated sector size as trivia.

The on-call started getting alerts: VM IO latency spiking, storage queue depths rising, and periodic application timeouts.
The team’s first assumption—an extremely corporate assumption—was that “the new workload is noisier,” because a new microservice platform had moved in.
They throttled the tenants, fiddled with IO scheduler knobs, and moved a few VMs around. The symptoms moved, but the problem didn’t.

Then a scrub ran. It ran into business hours, because it always did, and it got slower every month. This time it triggered a cascade:
latency rose, timeouts increased, retry storms started, and the monitoring system congratulated everyone by paging them harder.

The actual root cause was painfully boring: one top-level vdev had ashift=9 from the original pool creation,
while the disks underneath were 4K physical. It had “worked” when the pool was mostly empty, and when the workload was mostly sequential.
Years later, it became a tax on every random write and metadata update.

The fix wasn’t clever. It was expensive: build a new pool with ashift=12, migrate datasets, rotate tenants, and decommission the old vdevs.
The lesson stuck because it hurt: don’t assume your predecessors chose the right ashift, even if the system is old and “fine.”

Mini-story 2: The optimization that backfired

Another shop decided to “standardize performance” by creating new pools with a larger ashift because “bigger blocks are faster.”
Someone picked ashift=13 (8K) for everything, including small-IO database volumes and metadata-heavy CI workloads.

At first it looked okay. Large sequential writes were healthy, and the space graphs didn’t look horrifying.
Then they noticed that some zvol-heavy workloads got weird: small updates produced more backend churn than expected, snapshots grew faster than anticipated,
and the team started complaining about “ZFS overhead.”

The backfire wasn’t that ashift=13 is always wrong. It’s that they applied it universally without matching it to devices and workload.
On 4K-sector SSDs, 8K allocation increased internal amplification for certain patterns, and wasted space for tiny blocks and metadata.
They also made compression/dedup trade-offs worse in edge cases by forcing a larger minimum allocation size.

The recovery plan was again unglamorous: keep ashift=13 only where it made sense (some arrays and special cases),
and return to ashift=12 as the default. They documented the decision and required a justification for deviations.
The “performance standardization” turned into a standard practice: don’t pick ashift by vibes.

Joke #2: Nothing says “enterprise optimization” like making things worse in a standardized way.

Mini-story 3: The boring but correct practice that saved the day

A regulated environment ran ZFS for log archives and VM backups. Nothing exciting. The team had a change checklist that included:
record device sector sizes, record intended ashift, and store the zdb -C output at pool creation time.
It was dull enough that engineers complained about it—quietly, like professionals.

Years later, a storage refresh introduced a new HBA firmware and a new batch of drives. A pool expansion was planned:
add another RAIDZ vdev. During pre-flight checks, the engineer compared sector reporting for the new disks against the old ones.
The new path (through a different enclosure) reported 512 logical and 512 physical, despite the drive model being known 4K physical.

Because the team had the boring artifacts from prior builds, they caught the inconsistency before adding the vdev.
They moved the disks to a different enclosure slot path, got correct 512/4096 reporting, and proceeded with ashift=12.
No mixed-ashift pool, no latent performance cliff, no “mysterious” scrubs.

The win wasn’t just avoiding a mistake; it was avoiding a slow-motion mistake that would have surfaced months later, when nobody remembered the change.
Documentation didn’t make them faster. It made them less surprised. That’s the point.

Common mistakes: symptoms → root cause → fix

This section is deliberately specific. “Check your hardware” is not a fix; it’s procrastination with better posture.

1) Scrubs are painfully slow, but disks look healthy

  • Symptoms: Scrub progress crawls; zpool status shows no errors; IO looks busy but throughput is underwhelming.
  • Root cause: ashift=9 on 4K physical disks causing extra IO operations (especially visible on metadata and small blocks).
  • Fix: Confirm with zdb -C. If mismatched, plan pool migration/rebuild with ashift=12. Don’t try to “tune” scrub speed around wrong ashift.

2) Random writes are terrible; sequential seems fine

  • Symptoms: VM workloads complain; database fsync latency is high; backup writes look okay.
  • Root cause: Misaligned small writes; often ashift mismatch, sometimes sync workload without proper SLOG, sometimes both.
  • Fix: Verify ashift and sector sizes first. If ashift is correct, evaluate sync settings and SLOG (and power-loss protection if SSD-based).

3) One vdev “feels slower” after expansion

  • Symptoms: After adding a vdev, latency distribution worsens; performance becomes inconsistent across time and workloads.
  • Root cause: Mixed ashift across vdevs or a new vdev behind a device layer reporting different sector sizes.
  • Fix: Check zdb -C for each top-level vdev. If mixed, consider migrating hot datasets off the slow vdev or rebuilding.

4) You replaced disks and expected ashift to change

  • Symptoms: “We swapped to 4Kn drives; why does zdb still show ashift=9?”
  • Root cause: Ashift is set at vdev creation and doesn’t change via replacements/resilvering.
  • Fix: Rebuild/migrate to a new vdev/pool created with the intended ashift.

5) Benchmarks don’t reproduce production pain

  • Symptoms: fio on an empty pool looks fine; real workloads stutter when pool is busy.
  • Root cause: Benchmark doesn’t match IO size, sync semantics, concurrency, or pool fullness; ashift mismatch amplifies small random IO particularly under contention.
  • Fix: Benchmark with 4K/8K random writes, realistic sync settings, and multiple jobs. Confirm ashift before trusting any benchmark narrative.

6) USB/JBOD enclosures produce “random” performance behavior

  • Symptoms: Sector sizes appear inconsistent; ashift selection differs across identical disks; performance changes after replug/reboot.
  • Root cause: Bridge firmware lies or changes reported characteristics; ZFS locks ashift based on what it saw at creation.
  • Fix: Avoid USB bridges for serious ZFS pools. If you must, force ashift at creation via zpool create -o ashift=12 and validate with zdb -C immediately.

Checklists / step-by-step plan

Checklist A: “Do I have an ashift mismatch right now?”

  1. Identify the pool and vdevs: zpool status -P.
  2. Dump config and find ashift: sudo zdb -C poolname.
  3. Record ashift per top-level vdev (mirror/raidz/draid/special).
  4. Check each leaf disk’s sector sizes: lsblk -d -o NAME,LOG-SEC,PHY-SEC.
  5. If any vdev has ashift smaller than physical sector size, you have a mismatch that matters.
  6. If ashift is mixed, you have a mismatch that will bite in uneven ways.

Checklist B: “Is ashift actually the bottleneck?”

  1. Confirm workload IO size and sync behavior (VMs and databases usually mean small random + sync-ish patterns).
  2. During load, run zpool iostat -v 1 and iostat -x 1 side-by-side.
  3. Look for high ops/s with low bandwidth and high latency—classic small IO amplification signature.
  4. If ashift is correct, check:
    • pool fullness/fragmentation (zpool list FRAG/CAP)
    • dataset properties (sync, recordsize, volblocksize)
    • special vdev metadata pressure (if used)
    • SLOG design (if sync workloads dominate)

Checklist C: “Migration/rebuild plan when ashift is wrong”

  1. Stop arguing with physics. Decide to rebuild/migrate rather than tune around.
  2. Provision new storage (new pool or new vdevs) with explicit -o ashift=12 (or a justified higher value).
  3. Verify new ashift immediately with sudo zdb -C and record it.
  4. Replicate datasets:
    • zfs snapshot
    • zfs send | zfs receive
    • or use your existing replication tooling
  5. Cut over workloads gradually; measure latency and scrub speed.
  6. Decommission old pool/vdevs only after you have:
    • validated restores
    • validated scrub on new pool
    • validated performance under peak-ish load

FAQ

1) Can I change ashift on an existing vdev or pool?

Not in-place. Ashift is baked into how blocks were allocated on disk. The practical fix is to create a new vdev/pool with the right ashift and migrate data.

2) Where do I reliably see ashift for an existing pool?

Use sudo zdb -C poolname and look for ashift: N under each top-level vdev in vdev_tree.
zpool get won’t reliably tell you ashift because it’s not a pool property in the same way.

3) If my disks are 512e (512 logical / 4096 physical), what ashift should I want?

Usually ashift=12. That aligns ZFS allocations to 4K boundaries and avoids read-modify-write at the drive.

4) Is ashift=13 “better” than ashift=12?

Not automatically. It can make sense for some arrays and some flash characteristics, but it also increases minimum allocation size.
Default to 12 unless you have a specific reason and you’ve tested the workload.

5) I replaced every disk with 4Kn drives. Why didn’t ashift update?

Because ashift is determined when the vdev is created, not when disks are replaced. Replacements inherit the vdev’s existing ashift.

6) If ashift is wrong, why didn’t ZFS detect it?

ZFS uses the device’s reported characteristics at creation time. If the device layer reported 512B sectors (even falsely), ZFS believed it.
ZFS is skeptical about corruption, not about vendor honesty.

7) What are the most reliable sector-size checks on Linux?

lsblk -o LOG-SEC,PHY-SEC and blockdev --getss/--getpbsz. If those disagree or look suspicious behind bridges/controllers, treat it as a warning.

8) Can mixed ashift vdevs exist in one pool?

Yes. It often happens after expansions or when adding vdevs from different hardware paths. It can work, but it complicates performance and predictability.

9) Does recordsize/volblocksize “fix” ashift mismatch?

No. They influence IO patterns and can reduce pain in some cases, but they can’t change the on-disk allocation granularity of existing vdevs.

10) What’s the safest operational response once I confirm ashift=9 on 4K physical drives?

Plan a migration/rebuild. Keep the pool stable, avoid “hero tuning,” and schedule a controlled move to a correctly created pool.

Conclusion: next steps you can actually take

Ashift mismatch detection is not an academic exercise. It’s the difference between “this pool is old but fine” and “this pool is quietly stealing our latency budget.”
If you suspect a mismatch, don’t debate it—measure it.

Practical next steps:

  1. Run sudo zdb -C and record ashift per vdev.
  2. Confirm disk sector sizes with lsblk and blockdev.
  3. If you find ashift=9 on 4K physical devices, stop tuning and start planning a rebuild/migration.
  4. If ashift is correct, move on quickly: check sync behavior, pool fullness, fragmentation, and workload IO sizes.
  5. Add one boring practice to your future: capture zdb -C output at pool creation time and store it with change records.
← Previous
When Reference Beats Custom: The Myth That’s Sometimes True
Next →
Debian 13 “Device busy” on umount: find the holder instantly (lsof/fuser workflow)

Leave a comment