The pager goes off. A pool that’s been fine for years suddenly won’t import after a reboot, a controller swap, a firmware update, or an “I only changed one thing” maintenance window.
You don’t need heroics. You need a method that’s boring, repeatable, and leaves you with evidence. This is the fix path I use when the pool won’t import and people are starting to say words like “recreate” and “restore” too early.
The mindset: preserve evidence, reduce writes
When a pool won’t import, you’re in a forensics situation disguised as an outage. Your goals are:
- Stop making it worse. Reduce writes to the affected disks until you understand what happened.
- Capture the state. Save command output. Copy logs. Write down the timeline.
- Make one change at a time. If you flip multiple flags and it “works,” you won’t know which change mattered or what risk you accepted.
Also: treat “force” flags like a chainsaw. Useful tool, but you don’t juggle it in the server room.
One operations quote worth keeping on a sticky note: “Hope is not a strategy.” — Gene Kranz
Short joke #1: ZFS won’t import because it’s feeling cautious today. Same, ZFS. Same.
Hard rule: start read-only whenever you can
If your first successful import is read-write and you were wrong about the failure mode, you can roll metadata forward into a new, worse reality. Prefer a read-only import for initial inspection. You can always remount later.
What “won’t import” actually means
That phrase hides multiple symptoms:
zpool importshows the pool, but import fails with I/O errors.zpool importdoes not show the pool at all.- Import “hangs” (actually stuck in I/O retries or waiting for slow devices).
- Import works, but datasets won’t mount or the pool is suspended immediately.
- The wrong host imports it (multi-host access, SAN, shared JBOD disasters).
The fix path depends on which one you have. So we triage fast.
Fast diagnosis playbook (first/second/third)
This is the “stop scrolling, start checking” sequence. The point is to find the bottleneck quickly: is it device discovery, device health, pool metadata, or mounts/keys?
First: can the OS see the disks reliably?
- Check kernel logs for resets/timeouts.
- Confirm stable device identity: by-id, WWN, or GPT IDs.
- Make sure the HBA/controller sees the same count of disks you expect.
If the OS can’t see the disks cleanly, ZFS is not the problem. ZFS is just the first honest witness.
Second: does zpool import see the pool, and what does it say?
- If the pool appears: read the status text carefully, especially “UNAVAIL,” “was /dev/…,” “insufficient replicas,” and last TXG.
- If it does not appear: scan paths manually, check cachefile confusion, and consider label damage.
Third: import in the safest viable mode
- Try read-only import first (
-o readonly=on). - Only then consider rewind/rollback flags (
-Fwith or without-X), and only after you know what you’re sacrificing. - If encryption is in play, separate “pool import” from “dataset mount”: keys can block mounts even when the pool imports.
Interesting facts and context (why ZFS behaves this way)
- ZFS stores multiple labels per device. Each disk typically has labels near the start and end, which is why partial label damage can still be recoverable.
- Pool imports are transaction-based. ZFS advances through TXGs (transaction groups). Rewind operations are about selecting an older consistent TXG.
- “Uberblocks” are the breadcrumbs. ZFS writes multiple uberblocks; import logic chooses the best valid one it can find across devices.
- OpenZFS is the living line. ZFS originated at Sun Microsystems, then continued as OpenZFS across illumos, FreeBSD, and Linux, with feature flags used to manage compatibility.
- Feature flags are not cosmetic. A pool with newer feature flags may not import on an older system, even if the disks are healthy.
- ZFS tries hard to protect you from split-brain. The pool “hostid” and “multihost” behaviors exist because importing the same pool on two machines can corrupt it fast.
- Self-healing needs redundancy. Checksums can detect corruption, but without mirrors/RAIDZ parity, ZFS can’t always repair automatically.
- Import can be slow by design. With many disks or flaky hardware, ZFS may probe labels and retry I/O; it can look like a hang while it’s being painfully thorough.
Hands-on tasks: commands, outputs, and decisions
Below are practical tasks I run in roughly this order. Each one includes what to look for and what decision it drives. Run them as root or with sudo. Save output to a file if you’re on a call with other teams.
Task 1: Capture the obvious: what does ZFS think right now?
cr0x@server:~$ zpool status -v
no pools available
What it means: Nothing is imported. That’s not yet a diagnosis; it’s a starting point.
Decision: Move to discovery: can the system even see the pool on disk?
Task 2: List importable pools (and don’t ignore the text)
cr0x@server:~$ zpool import
pool: tank
id: 1234567890123456789
state: UNAVAIL
status: One or more devices are unavailable.
action: The pool cannot be imported. Attach the missing devices and try again.
see: zpool(8)
config:
tank UNAVAIL insufficient replicas
raidz1-0 UNAVAIL insufficient replicas
sda ONLINE
sdb ONLINE
sdc UNAVAIL cannot open
sdd ONLINE
What it means: ZFS sees the pool metadata and tells you exactly why it won’t import: insufficient replicas because a vdev is missing a member (RAIDZ1 tolerates one loss; this may be more than one, or the missing one is actually present but unreachable).
Decision: Stop trying random import flags. Fix device visibility first: find out why sdc can’t open.
Task 3: Confirm device identity (don’t trust /dev/sdX)
cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep 'ata-|wwn-' | head
lrwxrwxrwx 1 root root 9 Feb 4 09:11 ata-SAMSUNG_MZ7LM960HMJP-00005_S4X9NX0M123456 -> ../../sda
lrwxrwxrwx 1 root root 9 Feb 4 09:11 ata-SAMSUNG_MZ7LM960HMJP-00005_S4X9NX0M123457 -> ../../sdb
lrwxrwxrwx 1 root root 9 Feb 4 09:11 ata-SAMSUNG_MZ7LM960HMJP-00005_S4X9NX0M123458 -> ../../sdd
What it means: You’re missing an expected by-id entry, or it points somewhere unexpected. That’s often a cabling/HBA issue, a dead disk, or a disk that changed mode/namespace.
Decision: If a disk is absent at the OS level, no ZFS flag will magic it back. Go to kernel logs and controller state.
Task 4: Kernel log triage for I/O errors and resets
cr0x@server:~$ dmesg -T | egrep -i 'error|reset|timeout|sas|scsi|nvme' | tail -n 20
[Mon Feb 4 09:08:12 2026] sd 6:0:12:0: [sdc] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon Feb 4 09:08:12 2026] sd 6:0:12:0: [sdc] Sense Key : Hardware Error [current]
[Mon Feb 4 09:08:12 2026] sd 6:0:12:0: [sdc] Add. Sense: Internal target failure
[Mon Feb 4 09:08:13 2026] mpt3sas_cm0: log_info(0x31120405): originator(PL), code(0x12), sub_code(0x0405)
[Mon Feb 4 09:08:14 2026] scsi 6:0:12:0: rejecting I/O to offline device
What it means: The disk or path is failing at the transport layer. ZFS is correct to refuse import if redundancy is insufficient.
Decision: Fix hardware first: reseat, replace, or move the drive/path. If this is a dual-path SAN/JBOD, check multipath configuration.
Task 5: Verify you’re on the right OpenZFS version/features
cr0x@server:~$ zpool upgrade -v | head -n 12
This system supports ZFS pool feature flags.
The following features are supported:
FEAT DESCRIPTION
async_destroy Destroy filesystems asynchronously.
bookmarks ZFS bookmarks.
embedded_data Blocks which compress very well use even less space.
What it means: This host supports feature flags generally, but that doesn’t confirm the pool’s exact flags.
Decision: If the pool was last imported on a newer system, you may need to import it there (or upgrade this system). Mismatch problems often show as “unsupported feature” on import.
Task 6: Get detailed import scan output (the pool’s story)
cr0x@server:~$ zpool import -d /dev/disk/by-id -o cachefile=none -N -f tank
cannot import 'tank': one or more devices is currently unavailable
What it means: Even when scanning stable paths and avoiding stale cachefiles, a vdev is missing.
Decision: If redundancy allows, you might import degraded. If it doesn’t, stop and fix the missing device(s) first.
Task 7: See whether ZFS is just waiting (import “hang” triage)
cr0x@server:~$ zpool import -d /dev/disk/by-id -o cachefile=none -N -f tank
What it means: No output and no prompt can mean the import is stuck doing I/O. Don’t assume it’s dead; assume it’s blocked.
Decision: In another terminal, confirm whether the process is active and whether devices are timing out.
cr0x@server:~$ ps -eo pid,etime,cmd | egrep 'zpool import|PID'
PID ELAPSED CMD
8421 02:14 zpool import -d /dev/disk/by-id -o cachefile=none -N -f tank
Decision: If elapsed time grows and logs show retries/timeouts, you have a hardware path problem. If there are no I/O errors, you might have a very large pool and slow label probing; be patient but watch metrics.
Task 8: Check active I/O and identify the slow device
cr0x@server:~$ iostat -x 1 5
Linux 6.6.0 (server) 02/04/2026 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.02 0.00 2.31 35.44 0.00 61.23
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await aqu-sz %util
sda 1.0 32.0 0.0 0.0 12.0 32.0 0.0 0.0 0.0 0.01 2.0
sdb 1.0 32.0 0.0 0.0 10.0 32.0 0.0 0.0 0.0 0.01 2.0
sdc 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0
sdd 1.0 32.0 0.0 0.0 11.0 32.0 0.0 0.0 0.0 0.01 2.0
What it means: High iowait with one device showing nothing can mean it’s not responding at all, or the kernel has taken it offline. Alternatively, a single device at 100% util with huge await indicates it’s dragging the import.
Decision: If a device is offline/unresponsive, fix that. If a device is just slow, you can sometimes complete a read-only import to extract data before replacing it.
Task 9: Try a read-only, no-mount import (safe first import)
cr0x@server:~$ zpool import -d /dev/disk/by-id -o cachefile=none -o readonly=on -N tank
cannot import 'tank': one or more devices is currently unavailable
What it means: Read-only doesn’t bypass missing devices. It just reduces risk when import is possible.
Decision: If redundancy exists (mirror/RAIDZ with tolerable loss), use degraded import. If not, stop and recover the missing disk/path.
Task 10: Attempt degraded import when redundancy allows
cr0x@server:~$ zpool import -d /dev/disk/by-id -o cachefile=none -o readonly=on -N -f -o failmode=continue tank
cannot import 'tank': insufficient replicas
What it means: ZFS is telling you there is no safe way to assemble consistent data because the vdev can’t be reconstructed. This is the “you’re missing too much” message.
Decision: Hardware recovery or restore from backup. Do not attempt clever flags expecting miracles; you might only create new damage.
Task 11: If the pool imports but datasets won’t mount, check encryption and mount settings
cr0x@server:~$ zpool import -d /dev/disk/by-id -o cachefile=none -N -f vault
cr0x@server:~$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
vault 1.2T 800G 96K /vault
vault/secure 0B 800G 96K /vault/secure
What it means: Pool is imported, datasets are visible, but not necessarily mounted. If encryption is enabled, mounts may fail until keys are loaded.
Decision: Check key status, then load keys and mount intentionally.
cr0x@server:~$ zfs get -H -o name,property,value keystatus vault/secure
vault/secure keystatus unavailable
Decision: Load the key. If you don’t have it, stop and escalate; brute force is not a feature.
cr0x@server:~$ zfs load-key -r vault/secure
Enter passphrase for 'vault/secure':
cr0x@server:~$ zfs mount -a
cr0x@server:~$ mount | grep vault
vault on /vault type zfs (ro,relatime,xattr,noacl)
vault/secure on /vault/secure type zfs (rw,relatime,xattr,noacl)
Task 12: If import complains about “active pool” or wrong host, verify hostid and multihost situation
cr0x@server:~$ zpool import
pool: prod
id: 9876543210987654321
state: ONLINE
status: The pool is currently imported by another system.
action: The pool must be exported from the other system, then imported.
see: zpool(8)
config:
prod ONLINE
mirror-0 ONLINE
wwn-0x5000c500a1b2c3d4 ONLINE
wwn-0x5000c500a1b2c3d5 ONLINE
What it means: ZFS thinks another host owns it. Sometimes that’s true (shared shelf), sometimes it’s stale state after a crash, sometimes it’s split-brain waiting to happen.
Decision: Confirm the other host is truly down or exported. If you’re not 100% sure, stop. Multi-host corruption is fast and humiliating.
Task 13: If you must force import, do it read-only first
cr0x@server:~$ zpool import -f -o readonly=on -o cachefile=none -N prod
cr0x@server:~$ zpool status -x
pool 'prod' is healthy
What it means: You got the pool in safely (read-only, no mounts). This is inspection mode.
Decision: Validate data and device health before flipping to read-write. If this is an ownership issue, plan a clean export/import sequence.
Task 14: Check pool-wide errors and whether it suspended itself
cr0x@server:~$ zpool status -v prod
pool: prod
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
config:
NAME STATE READ WRITE CKSUM
prod SUSPENDED 0 0 0
mirror-0 DEGRADED 0 0 0
wwn-0x5000c500a1b2c3d4 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d5 FAULTED 0 12 0 too many errors
errors: No known data errors
What it means: ZFS suspended I/O to protect consistency. This is usually a device that went sideways mid-flight.
Decision: Fix/replace the faulted device first. Clearing without fixing is asking ZFS to resume writing into a fire.
Task 15: After fixing hardware, clear errors and attempt recovery actions deliberately
cr0x@server:~$ zpool clear prod
cr0x@server:~$ zpool status prod
pool: prod
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
config:
NAME STATE READ WRITE CKSUM
prod DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
wwn-0x5000c500a1b2c3d4 ONLINE 0 0 0
wwn-0x5000c500a1b2c3d5 UNAVAIL 0 0 0 cannot open
What it means: Pool is running on remaining replicas. You need to reattach/replace the missing side.
Decision: Replace the failed disk, then resilver. Don’t leave it degraded “for a bit.” That’s how “a bit” becomes “a resume-generating event.”
Task 16: Identify recent TXGs and consider rewind only when corruption is plausible
cr0x@server:~$ zpool import -F -n -d /dev/disk/by-id tank
Would be able to return pool to state as of Tue Feb 3 18:22:41 2026.
Would discard 2.14G of transactions.
What it means: Dry-run rewind is available. It would roll back some recent writes. That can be acceptable if the alternative is “no pool.” It can also be unacceptable if those writes were critical.
Decision: Pause and align with stakeholders. If you proceed, do the first successful import read-only, confirm datasets, then decide about read-write.
Task 17: Execute rewind import carefully (only if you chose to sacrifice recent TXGs)
cr0x@server:~$ zpool import -F -o readonly=on -o cachefile=none -N -d /dev/disk/by-id tank
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
status: Pool was previously in use from another system.
action: Export the pool from the other system, then import it.
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
What it means: The pool is accessible again (in read-only inspection mode). The “previously in use” warning is common after crashes or migrations; it’s not always a real multi-host condition, but treat it seriously.
Decision: Validate data, export cleanly if needed, then import normally read-write. If you actually have multi-host access, fix that architecture.
Task 18: Once imported, mount intentionally and verify
cr0x@server:~$ zfs mount -a
cr0x@server:~$ zfs list -o name,used,avail,mountpoint | head
NAME USED AVAIL MOUNTPOINT
tank 128K 10.9T /tank
tank/home 2.1T 10.9T /tank/home
tank/vm 6.4T 10.9T /tank/vm
What it means: Datasets are mounted, mountpoints make sense, and you can proceed to application-level checks.
Decision: If mountpoints are wrong (e.g., legacy vs ZFS-managed), pause and fix properties before starting services.
Failure modes you can actually fix
1) The pool is fine; the OS can’t see the disks (or sees them differently)
Most “ZFS won’t import” incidents are really “storage layer changed.” Common triggers:
- HBA firmware update changed how drives enumerate.
- SAS expanders flapping under load.
- NVMe namespaces changed after a vendor tool run.
- Multipath misconfiguration presenting two device nodes per LUN.
Fix: Get the device layer stable first. Use by-id paths; confirm all expected WWNs exist; remove stale multipath nodes; replace failing cables/transceivers.
2) Cachefile confusion and stale device paths
On some systems, ZFS uses a cachefile to remember vdev paths. If the cachefile points to old /dev/sdX names, import may fail even though the drives exist.
Fix: Use -o cachefile=none and explicitly scan -d /dev/disk/by-id. After a clean import, set a sane cachefile path for your OS, or let the service manage it.
3) Feature flag / version mismatch
If a pool was upgraded (feature flags enabled) on one host, older hosts may refuse to import it. This looks like an import error, not a disk failure.
Fix: Import on a system with compatible OpenZFS feature support. If you must move pools between hosts, manage ZFS versions like you manage database versions: deliberately.
4) Corruption in the most recent TXGs after a crash (rewind territory)
Unexpected power loss plus write cache lies plus just the wrong timing can leave the latest TXG inconsistent. ZFS usually detects and avoids inconsistent uberblocks, but sometimes you need to rewind.
Fix: Use zpool import -F -n to see the rollback cost. Proceed only if you accept discarding recent transactions. Prefer first import read-only.
5) Pool suspended due to I/O failures
This is ZFS doing the right thing: halting I/O to protect consistency. People hate it because it’s loud, but it’s a better outcome than silent corruption.
Fix: Identify and replace/restore the failing device path. Then zpool clear and resilver.
6) “Imported elsewhere” warnings (real multi-host or stale state)
Sometimes the other host is real. Sometimes it died and left breadcrumbs. Either way, importing on two hosts is how you turn a recoverable incident into a forensic art project.
Fix: Confirm exclusivity. If the other host is up, export there. If it’s dead, ensure it cannot see the disks (power off, disconnect shelf) before forcing import.
Short joke #2: Forcing an import on shared storage without checking the other host is like merging to main without tests—technically possible, emotionally expensive.
Three corporate mini-stories from the trenches
Mini-story #1: The incident caused by a wrong assumption
The storage team replaced an HBA in a virtualization host during a planned window. New card, same model family, same cabling. The engineer assumed device names would stay stable because “Linux always finds the disks.” It did find them—just in a different order.
After reboot, the ZFS pool wouldn’t import. The on-call saw cannot open errors for a couple of devices and immediately suspected disk failure. A ticket went to the data center to pull drives. That’s a costly reflex: drive swaps are easy to do wrong, and every pull is a chance to create a second outage.
Thirty minutes later, someone finally ran zpool import -d /dev/disk/by-id -o cachefile=none and saw the pool metadata perfectly intact. The problem was that the system’s cachefile referenced old /dev/sdX paths and the new HBA enumeration didn’t match. Nobody had confirmed stable identifiers.
The fix was boring: import using by-id, update the cachefile behavior, and add a pre-flight checklist step that validates all WWNs are present after any controller change. The postmortem highlighted the real issue: the wrong assumption wasn’t about ZFS; it was about identity.
Mini-story #2: The optimization that backfired
A team wanted faster sync write performance for a logging pipeline. They added fast consumer SSDs as a SLOG device, configured aggressively, and celebrated when benchmarks doubled. Then reality arrived: a power event plus a brownout that didn’t fully drop the rack, just long enough to confuse several devices.
After the reboot, the pool import sometimes worked and sometimes hung. The logs showed intermittent resets and “Internal target failure” on one of the SSDs. Because the SLOG sat on a flaky consumer device without power-loss protection, the system spent far too long trying to talk to it during import. It wasn’t the only issue, but it was the loudest.
The first attempted “fix” was to force import with every flag under the sun. That made the system occasionally import, but it wasn’t stable. ZFS did what it could; the hardware did what it wanted.
The actual recovery: physically remove the failing log device (or replace it with a known-good, PLP-capable device), then import read-only, then replace bad parts, then reintroduce a log device with proper selection criteria. The “optimization” turned into an availability tax.
The lesson: if you optimize durability boundaries (sync writes, intent logs, write caches), you inherit their failure modes. If your business cares about those writes, your hardware has to care too.
Mini-story #3: The boring but correct practice that saved the day
One enterprise had a policy that every storage chassis had a laminated sheet taped inside the front door. It listed disk slot → WWN → pool/vdev membership. It looked like something from a 1990s NOC, and everyone made fun of it until it mattered.
A pool stopped importing after a maintenance contractor “cleaned up cabling.” Half the disks were present, half were invisible. The contractor swore nothing was moved. The OS logs said otherwise.
Because the team had slot-to-WWN mapping, they quickly identified which physical slots corresponded to the missing WWNs and traced the problem to one expander cable seated halfway. No guessing. No “try swapping sdc and sdd.” No accidental removal of the wrong drive from the only surviving mirror side.
The pool imported degraded, then fully after the cable fix. Resilver completed overnight. The policy was dull. It was also the difference between a two-hour incident and a multi-day restore.
Common mistakes: symptom → root cause → fix
1) Symptom: zpool import shows nothing
Root cause: Devices not discovered, wrong scan path, or labels not readable due to hardware/driver issues. Sometimes the pool exists but is behind multipath nodes you aren’t scanning.
Fix: Scan explicit directories (-d /dev/disk/by-id), check dmesg for device discovery errors, and confirm the expected WWNs exist. If on SAN, verify multipath presents a single stable device per LUN.
2) Symptom: Import fails with “insufficient replicas”
Root cause: Too many missing devices in a vdev. Mirrors need one side; RAIDZ needs enough members to reconstruct parity; if you’re below that threshold, ZFS can’t assemble the vdev.
Fix: Restore missing devices (cable/HBA/path), replace failed drives, or recover from backup. Do not expect -f to bypass physics.
3) Symptom: Import hangs indefinitely
Root cause: Kernel is retrying I/O to a device that’s timing out, often due to a failing disk, expander, or controller firmware mismatch.
Fix: Identify the slow/failing device via kernel logs and iostat. Remove/replace the offender. Then retry import, preferably read-only first.
4) Symptom: Pool imports, but services fail because datasets aren’t mounted
Root cause: Import with -N, mountpoint properties, legacy mounts, or encryption keys not loaded.
Fix: Check zfs get mountpoint,canmount,keystatus. Load keys if needed. Mount explicitly with zfs mount -a or per-dataset.
5) Symptom: Pool immediately becomes SUSPENDED
Root cause: ZFS detected repeated I/O failures and suspended to protect the pool. Often a device is faulting under write load.
Fix: Replace/fix the device path. Then clear errors and resilver. Do not keep clearing without fixing hardware.
6) Symptom: “The pool is currently imported by another system”
Root cause: Either it is imported elsewhere, or the system believes it was (stale host state after crash). In shared storage environments, this is the warning that matters most.
Fix: Confirm exclusivity: ensure other hosts can’t see the disks, then import read-only first. Implement multihost and operational controls if shared shelves exist.
7) Symptom: Import error mentions unsupported features
Root cause: Pool created or upgraded with feature flags not supported by this ZFS implementation/version.
Fix: Import on a newer compatible host. Align ZFS versions across fleet. Don’t “upgrade pool features” casually in mixed environments.
8) Symptom: After forced import, data seems “older” than expected
Root cause: Rewind/rollback discarded recent TXGs, or applications assumed sync semantics they didn’t have.
Fix: Treat rewind as data loss by design. Communicate clearly. Validate application consistency; restore app-level logs if available.
Checklists / step-by-step plan
Phase 0: Freeze the scene (5–10 minutes)
- Stop any automation that might keep retrying imports or mounting datasets.
- Capture:
dmesg -T,journalctl -k(if available),zpool import, and hardware inventory output. - Confirm whether this is shared storage. If yes, ensure only one host can access the disks.
Phase 1: Device layer truth
- List stable identifiers:
ls -l /dev/disk/by-id. - Compare expected WWNs to present WWNs (from your inventory, labels, or last known-good
zpool statusoutput). - Check kernel logs for resets/timeouts for the missing devices.
- Fix cabling/HBA/drive issues until all expected devices are present and stable.
Phase 2: Non-destructive ZFS discovery and import attempt
- Scan explicitly:
zpool import -d /dev/disk/by-id. - If pool appears, note state, missing vdevs, and any “last TXG” hints.
- Try safe import:
zpool import -d /dev/disk/by-id -o cachefile=none -o readonly=on -N POOL. - If it imports, inspect
zpool status -vandzfs listbefore mounting or starting services.
Phase 3: Controlled risk (only when necessary)
- If import fails due to suspected recent corruption, try dry-run rewind:
zpool import -F -n POOL. - Decide whether you can discard the shown transactions. Treat it as data loss.
- If proceeding, import with rewind read-only first:
zpool import -F -o readonly=on -N POOL. - Validate data, then export and import normally read-write if appropriate.
Phase 4: After import—make it stable again
- Replace failed devices and resilver.
- Run a scrub when the system is stable and performance impact is acceptable.
- Document the root cause. Update runbooks and inventory mappings.
What to avoid (these are repeat offenders)
- Don’t run destructive commands (like repartitioning, formatting, or “initializing” disks) on devices you haven’t positively identified by WWN/serial.
- Don’t upgrade pool features during recovery. Recovery time is not feature time.
- Don’t import read-write first when you’re unsure. You’re debugging; minimize side effects.
- Don’t ignore hardware logs. They’re usually telling you the truth you don’t want to hear.
FAQ
1) Should I try zpool import -f immediately?
No. First determine whether the pool is actually imported elsewhere, and whether devices are missing. Force import is for ownership conflicts, not missing disks.
2) What does -o cachefile=none buy me?
It prevents ZFS from trusting potentially stale cached paths and forces a fresh scan of the devices you specify. It’s a cheap way to eliminate “old /dev/sdX name” problems.
3) Why do you keep saying “scan by-id”?
/dev/sdX names are assigned at boot and can change when controllers, firmware, or disk timing changes. By-id and WWN naming is how you stay sane in production.
4) If I can import read-only, can I just copy data off and rebuild?
Often yes, and it’s a great move when you suspect ongoing hardware degradation. Read-only import reduces the chance you’ll worsen metadata issues while extracting data.
5) When is rewind (zpool import -F) appropriate?
When you strongly suspect the most recent TXGs are inconsistent (crash mid-write, write cache lies, sudden power loss) and normal import fails. Always do -n first to see the cost.
6) Can ZFS recover from silent corruption without redundancy?
ZFS can detect corruption using checksums. Repair requires redundancy (mirror/RAIDZ or copies) or an external trusted source (backup). Detection without repair is still valuable; it tells you what not to trust.
7) Why does import take so long on large pools?
Import may probe labels across many devices and retry I/O when devices respond slowly. If a single drive is flapping, import time can balloon because the kernel keeps trying to be helpful.
8) What if the pool imports but my applications still fail?
Separate storage health from application correctness. Check dataset mountpoints, encryption keys, and whether services expect specific paths/permissions. Then validate application-level consistency.
9) Is it safe to run a scrub immediately after recovery?
Usually, but timing matters. If hardware is still unstable, a scrub can amplify failures. Stabilize hardware, ensure redundancy is restored, then scrub to confirm integrity.
10) How do I prevent “pool imported elsewhere” problems?
Architecturally: don’t present the same disks to multiple hosts unless you have a designed, tested multi-host strategy. Operationally: enforce fencing and use stable host identity practices.
Conclusion: next steps that reduce repeat incidents
If your ZFS pool won’t import, the calm path is: stabilize devices, scan with stable identifiers, import read-only first, then take controlled risks only when the evidence supports it.
Practical next steps I’d actually assign after the incident:
- Standardize on by-id/WWN vdev paths for all pools. Fix the odd legacy pool before it becomes an outage.
- Add a pre-maintenance check that records
zpool status, WWN inventory, and controller firmware versions. - Review write cache, power protection, and any “performance optimizations” that change failure behavior.
- Rehearse recovery in a lab: import with missing devices, test rewind dry-runs, practice encryption key workflows.
- Make backups boring and verified. ZFS is resilient, not magical.
When you do this enough times, the scary part stops being the commands. It becomes the human tendency to rush. Don’t. ZFS rewards patience and punishes improvisation.