You reboot a Proxmox node, or move disks to new hardware, and ZFS greets you with the kind of message that makes coffee taste like regret:
cannot import pool. Your VMs are down, your storage is “not found,” and your change window is turning into a meeting.
This guide is for that exact moment. It’s written for people who run production systems and want the fastest safe path from “pool won’t import”
to “pool imported and consistent,” with minimal heroics and maximum evidence.
What “cannot import pool” really means in Proxmox
Proxmox is not doing anything mystical here. It’s a Debian-based host with ZFS tools. If the pool won’t import,
Proxmox can’t mount datasets, can’t present ZVOLs, and your storage definition in the GUI turns into a sad, inert label.
The import process is ZFS reading labels from the vdevs, reconstructing pool state from the most recent transaction group (TXG),
and ensuring it can safely open the pool with the current host’s identity and device paths.
When it fails, it’s almost always one of five categories:
- Device visibility problems: disks aren’t there (or not stable), wrong by-id paths, HBA/firmware weirdness, multipath conflicts.
- Metadata safety checks: multihost/hostid mismatch, stale imports, pool thinks it’s still active elsewhere.
- Corruption or incomplete writes: dirty shutdowns, bad blocks, failing device causing unreadable labels/uberblocks.
- Feature/property mismatch: old userspace vs pool feature flags, encryption keys not loaded, unsupported options.
- Proxmox boot/import sequencing: initramfs, zfs-import service ordering, cachefile drift, race conditions at boot.
The job is to identify which bucket you’re in using evidence, then choose the least-dangerous import method that gets you back to service.
Start with read-only imports, avoid “force” unless you can explain to another engineer why it’s safe, and don’t “just try stuff” on the only copy of your data.
Fast diagnosis playbook (first/second/third checks)
0. Stabilize: stop the bleeding
- If this is shared storage or potentially visible to multiple hosts (SAS shelf, iSCSI LUNs, FC), ensure only one host can touch it. Pull cables or disable targets if you must.
- Freeze automation: stop cron jobs, backup jobs, and anything that might keep probing devices.
- If you suspect failing disks, avoid reboot loops. Each boot is another dice roll.
1. First check: are the vdevs actually present and stable?
Most “cannot import” tickets are not “ZFS is broken.” They are “the disk path changed,” “the HBA is sulking,” or “multipath is lying.”
Confirm device visibility, and confirm it’s the same disks you think it is.
- Run
zpool importand read the exact failure message. - Check
ls -l /dev/disk/by-idfor expected WWNs. - Check
dmesg -Tfor resets, timeouts, “I/O error,” “offline device.”
2. Second check: is ZFS refusing for safety (hostid/multihost/stale state)?
If ZFS thinks the pool is active on another system, it will refuse or require explicit override. This is good; it’s preventing split-brain writes.
Look for “pool may be in use from another system” and check hostid.
3. Third check: can you import read-only and inspect?
If the pool is visible, attempt a read-only import. This is the closest thing ZFS has to “safe mode.”
If read-only fails, you’re likely dealing with missing devices, severe corruption, unsupported features, or encryption key issues.
If you do nothing else: collect outputs for zpool import -D, zdb -l on each device, and journalctl -u zfs-import-cache -u zfs-import-scan.
Those three usually tell you which direction to walk.
Interesting facts & context (why ZFS behaves like this)
- ZFS pools are self-describing: the configuration lives on each vdev label, not in a single fragile config file.
- Import is a consensus process: ZFS chooses the most recent consistent TXG across devices; it’s not just “mount and hope.”
- Hostid exists to prevent split brain: ZFS can record the last host that imported a pool and block unsafe imports when
multihost=on. - Feature flags replaced version numbers: modern pools advertise features; older tools may refuse import if they can’t understand them.
- ZFS prefers stable device IDs:
/dev/sdXis an opinion, not a fact./dev/disk/by-idis the adult choice. - The “cachefile” is an optimization: it speeds imports, but stale cachefiles can mislead boot-time imports after hardware changes.
- OpenZFS diverged from Solaris ZFS: Proxmox uses OpenZFS on Linux, which brings its own boot integration and service ordering quirks.
- Scrubs are not backups: scrubs find and heal silent corruption (when redundancy exists), but they don’t protect you from “rm -rf” or encryption key loss.
Root causes: the usual suspects, with symptoms
1) Disks missing, renamed, or intermittently dropping
Symptoms: cannot import 'pool': no such pool available, or one or more devices is currently unavailable.
In Proxmox, this often follows a reboot, HBA swap, firmware update, moving shelves, or “we cleaned up cabling.”
Reality: ZFS labels are on the disks, but Linux might not be presenting them consistently, or udev names changed,
or multipath is creating duplicate nodes. The pool may be importable with -d /dev/disk/by-id or after fixing device discovery.
2) Pool appears “in use” on another system (hostid / multihost / stale import)
Symptoms: pool may be in use from another system. This is common after cloning boot drives, restoring Proxmox OS images,
or importing a pool that was last used on a different node.
Reality: ZFS is trying to avoid a situation where two machines write to the same pool at once. If you’re absolutely sure only one host has access,
you can import with -f. If you’re not sure, stop here and prove it.
3) Encryption key not available (native ZFS encryption)
Symptoms: import succeeds but datasets won’t mount, or import fails when trying to mount with key-related errors.
Proxmox may show storage as present but unusable.
Reality: the pool can import, but datasets remain locked until keys are loaded. Sometimes people misread this as “pool won’t import”
because their VMs don’t start. Different failure, different fix.
4) Unsupported feature flags or version mismatch
Symptoms: import refuses with messages about unsupported features, or “pool uses the following feature(s) not supported by this system.”
Reality: you upgraded the pool on a newer OpenZFS, then tried importing on an older Proxmox node or rescue environment.
The solution is usually to boot a modern environment with matching OpenZFS, not to “downgrade the pool” (you can’t).
5) Metadata damage / failing device / unreadable labels
Symptoms: I/O errors during import, cannot open on specific vdevs, bad label, repeated retries, or the pool shows in zpool import but fails on actual import.
Reality: one or more devices can’t read critical metadata. Redundancy helps. Without redundancy, you’re in recovery-land: try read-only, try importing with missing devices,
and be ready to copy data off fast.
6) Boot-time import races and stale cachefiles
Symptoms: pool imports manually but not at boot; Proxmox GUI storage “missing” after reboot until you run zpool import.
Reality: ZFS import services ran before devices were ready, or cachefile points to old paths. The pool is fine; the boot sequence is not.
Practical tasks: commands, outputs, decisions (12+)
These are the tasks I actually run in incidents. Each one includes: the command, a realistic output snippet, what it means, and the decision you make.
Run them as root (or with sudo) on the Proxmox node or a rescue shell with ZFS tools installed.
Task 1: See what ZFS thinks is importable
cr0x@server:~$ zpool import
pool: rpool
id: 1234567890123456789
state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and the '-f' flag.
config:
rpool ONLINE
sda3 ONLINE
What it means: ZFS can see the pool and its vdev(s). The blocker is “last accessed by another system.”
Decision: Before you use -f, prove only this host can write to those disks (especially if this is shared storage). If single-host local disks, proceed with a forced import.
Task 2: Get verbose import diagnostics, including destroyed pools (-D)
cr0x@server:~$ zpool import -D -d /dev/disk/by-id
pool: tank
id: 9876543210987654321
state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing devices and try again.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-3C
config:
tank UNAVAIL missing device
raidz1-0 UNAVAIL
wwn-0x5000cca2abcd0001 ONLINE
wwn-0x5000cca2abcd0002 ONLINE
wwn-0x5000cca2abcd0003 UNAVAIL
What it means: ZFS has a clear expectation of the missing member. It’s not a naming issue; it’s absent/unreadable.
Decision: Go hunt that device at the hardware layer. If it’s truly gone and you have redundancy, you may import degraded. If you don’t have redundancy, prioritize imaging/copying whatever you can.
Task 3: Confirm the kernel is seeing storage errors
cr0x@server:~$ dmesg -T | egrep -i "zfs|sd |sas|nvme|ata|reset|timed out|I/O error" | tail -n 12
[Thu Dec 26 10:12:09 2025] sd 2:0:3:0: timing out command, waited 180s
[Thu Dec 26 10:12:09 2025] sd 2:0:3:0: [sdd] tag#24 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[Thu Dec 26 10:12:10 2025] blk_update_request: I/O error, dev sdd, sector 2048 op 0x0:(READ)
[Thu Dec 26 10:12:12 2025] ata4: hard resetting link
[Thu Dec 26 10:12:18 2025] sd 2:0:3:0: [sdd] Synchronizing SCSI cache
What it means: This is not ZFS being dramatic. The OS can’t read reliably from a device.
Decision: Stop trying random import flags. Stabilize the hardware path (replace disk/cable/HBA port) or expect partial recovery at best.
Task 4: Verify you’re using stable device identifiers
cr0x@server:~$ ls -l /dev/disk/by-id | egrep "wwn|nvme|ata" | head
lrwxrwxrwx 1 root root 9 Dec 26 10:10 wwn-0x5000cca2abcd0001 -> ../../sdb
lrwxrwxrwx 1 root root 9 Dec 26 10:10 wwn-0x5000cca2abcd0002 -> ../../sdc
lrwxrwxrwx 1 root root 9 Dec 26 10:10 wwn-0x5000cca2abcd0003 -> ../../sdd
lrwxrwxrwx 1 root root 13 Dec 26 10:10 nvme-SAMSUNG_MZVL21T0HCLR-00000_S6Z7... -> ../../nvme0n1
What it means: You have WWN-based symlinks; that’s what ZFS wants for consistent imports.
Decision: When importing manually, point ZFS to /dev/disk/by-id using zpool import -d. If your pool was created using /dev/sdX, plan a remediation (export/import or config update) once stable.
Task 5: Identify which disks actually contain ZFS labels
cr0x@server:~$ for d in /dev/disk/by-id/wwn-*; do echo "== $d =="; zdb -l $d 2>/dev/null | egrep "name:|pool_guid|vdev_guid|state:"; done | head -n 20
== /dev/disk/by-id/wwn-0x5000cca2abcd0001 ==
name: 'tank'
state: 0
pool_guid: 9876543210987654321
== /dev/disk/by-id/wwn-0x5000cca2abcd0002 ==
name: 'tank'
state: 0
pool_guid: 9876543210987654321
== /dev/disk/by-id/wwn-0x5000cca2abcd0003 ==
name: 'tank'
state: 1
pool_guid: 9876543210987654321
What it means: These devices really belong to the pool, and one looks unhealthy or incomplete (state: 1 varies by context).
Decision: If ZFS sees labels on the expected devices, the issue is likely availability/safety checks, not “wrong disks.” If a disk has no label, you might be looking at the wrong LUN or wrong shelf slot.
Task 6: Check hostid and multihost properties to explain “in use”
cr0x@server:~$ hostid
7f01000a
cr0x@server:~$ zpool get multihost tank
NAME PROPERTY VALUE SOURCE
tank multihost on local
What it means: multihost=on makes ZFS more strict about concurrent access safety. Hostid changes (cloned OS images) can trigger warnings.
Decision: If this pool is truly single-host local storage, consider setting multihost=off later. For now, confirm exclusivity and import with -f only when you can defend it.
Task 7: Attempt a read-only import (safe inspection mode)
cr0x@server:~$ zpool import -o readonly=on -N -d /dev/disk/by-id tank
cr0x@server:~$ zpool status tank
pool: tank
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
wwn-0x5000cca2abcd0001 ONLINE 0 0 0
wwn-0x5000cca2abcd0002 ONLINE 0 0 0
wwn-0x5000cca2abcd0003 UNAVAIL 0 0 0 cannot open
What it means: Pool can be opened without mounting datasets (-N) and without writes (readonly=on).
Decision: This is a green light to proceed carefully: mount datasets read-only and copy critical data off, or replace the missing disk then resilver.
Task 8: Import with missing log device (if you had a separate SLOG)
cr0x@server:~$ zpool import -m -d /dev/disk/by-id tank
cannot import 'tank': one or more devices is currently unavailable
cr0x@server:~$ zpool import -m -o readonly=on -d /dev/disk/by-id tank
cr0x@server:~$ zpool status tank | egrep -i "logs|missing|unavail"
logs
wwn-0x5000cca2beef0009 UNAVAIL cannot open
What it means: The missing device is a log vdev. ZFS can often import without it using -m (missing log) if the main vdevs are intact.
Decision: Import without the log, accept performance degradation, and plan to remove/replace the log vdev properly once stable.
Task 9: Inspect ZFS import services and boot-time failures
cr0x@server:~$ systemctl status zfs-import-cache zfs-import-scan --no-pager
● zfs-import-cache.service - Import ZFS pools by cache file
Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled)
Active: failed (Result: exit-code) since Thu 2025-12-26 10:01:21 UTC; 3min ago
Process: 812 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN (code=exited, status=1/FAILURE)
● zfs-import-scan.service - Import ZFS pools by device scanning
Loaded: loaded (/lib/systemd/system/zfs-import-scan.service; enabled)
Active: inactive (dead)
What it means: Cache-based import failed, and scan-based import didn’t run (or isn’t enabled). Classic “pool imports manually, not at boot.”
Decision: Fix the cachefile or enable scan import. Also investigate why devices weren’t ready when cache import ran.
Task 10: Read the journal for ZFS import hints (it’s usually blunt)
cr0x@server:~$ journalctl -u zfs-import-cache -u zfs-import-scan -b --no-pager | tail -n 40
Dec 26 10:01:20 server zpool[812]: cannot import 'tank': one or more devices is currently unavailable
Dec 26 10:01:20 server zpool[812]: Destroy and re-create the pool from a backup source.
Dec 26 10:01:20 server systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Dec 26 10:01:21 server systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
What it means: The import failed for the same reason you see manually. It’s not “systemd drama”; it’s device availability.
Decision: Go back to device enumeration and hardware errors. If import works manually after a delay, you likely have a timing issue at boot.
Task 11: Check for multipath conflicts (common with SAN/iSCSI/FC)
cr0x@server:~$ multipath -ll | head -n 25
mpatha (3600508b400105e210000900000490000) dm-2 IBM,2145
size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 3:0:0:1 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
`- 4:0:0:1 sdc 8:32 active ready running
What it means: The same LUN is visible via multiple paths and is correctly aggregated into /dev/dm-2 as mpatha.
Decision: Ensure ZFS uses the multipath device consistently (dm-id) and you are not accidentally importing using raw /dev/sdb//dev/sdc paths. Duplicate visibility can cause corruption if mishandled.
Task 12: Verify pool feature compatibility (especially on rescue environments)
cr0x@server:~$ zpool import -o readonly=on -N tank
cannot import 'tank': unsupported feature(s)
This pool uses the following feature(s) not supported by this system:
org.openzfs:encryption
org.openzfs:project_quota
What it means: Your current userspace/kernel ZFS can’t understand the pool. This is common when booting an older live ISO.
Decision: Boot a Proxmox version (or rescue environment) with a modern OpenZFS that supports those features. Do not attempt “repairs” from an incompatible environment.
Task 13: Load encryption keys and mount datasets (the “pool imported but nothing works” case)
cr0x@server:~$ zpool import -N -d /dev/disk/by-id tank
cr0x@server:~$ zfs get -r -o name,property,value keystatus tank | head
NAME PROPERTY VALUE
tank/secure keystatus unavailable
tank/vmdata keystatus available
cr0x@server:~$ zfs load-key -r tank/secure
Enter passphrase for 'tank/secure':
cr0x@server:~$ zfs mount -a
cr0x@server:~$ zfs get -r -o name,mounted,mountpoint mounted tank/secure | head -n 3
NAME MOUNTED MOUNTPOINT
tank/secure yes /tank/secure
What it means: Pool import is fine. Your problem was locked datasets.
Decision: Fix key management (prompting at boot vs keyfiles vs external KMS). Don’t chase phantom “import failures” in the GUI.
Task 14: Confirm the pool isn’t already imported (yes, it happens)
cr0x@server:~$ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 5.45T 3.12T 2.33T - - 18% 57% 1.00x ONLINE -
cr0x@server:~$ pvesm status | head
Name Type Status Total Used Available %
local dir active 102399872 4823424 972... 4%
What it means: The pool is already imported and healthy. The complaint might be a dataset mount issue, a storage config mismatch, or a Proxmox storage definition pointing elsewhere.
Decision: Shift from “import” to “why isn’t Proxmox using it”: check dataset mountpoints, /etc/pve/storage.cfg, and permissions.
Task 15: Attempt a forced import (only after proving exclusivity)
cr0x@server:~$ zpool import -f -d /dev/disk/by-id rpool
cr0x@server:~$ zpool status rpool
pool: rpool
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
sda3 ONLINE 0 0 0
What it means: The “in use” warning was cleared by force import.
Decision: If this pool is ever visible to another host, fix the underlying fencing/ownership model. Force import is a tool, not a lifestyle.
Recovery options ranked by risk
Tier 0: Don’t make it worse
- Avoid
zpool clearas a reflex. It clears errors; it doesn’t fix the cause. - Avoid writing to a pool you haven’t validated. Import read-only first when uncertain.
- Avoid
zpool labelclearunless you are intentionally destroying labels (and you’re sure you’re on the right disks). - Avoid “try random flags until it works.” That’s how you turn a recoverable pool into a forensic artifact.
Tier 1: Safe imports and inspections
- Read-only import:
zpool import -o readonly=on -N ...lets you inspect without changing on-disk state. - No-mount import:
-Nprevents dataset mounts so you can fix mountpoints or keys deliberately. - Explicit device directory:
-d /dev/disk/by-idavoids device name roulette.
Tier 2: Controlled overrides for known scenarios
- Force import (
-f): only when you’ve proven the pool is not imported elsewhere and you control all paths. - Missing log (
-m): if a dedicated SLOG is gone. Import, then remove/replace it properly. - Import with an alternate root (
-R /mnt): mount datasets under a temporary root for rescue copying without disturbing normal mountpoints.
Tier 3: Degraded imports and data evacuation
- Import degraded when redundancy exists and one device is missing. Import read-only if you’re not sure the remaining disks are healthy.
- Copy out critical VM disks and configs immediately. If a disk is failing, resilvering is stress; copying is also stress; choose the path that gets data off fastest.
Tier 4: Last resorts
- Extreme recovery using ZFS debugging tools (
zdbin deeper modes) can sometimes recover information, but at this point you should treat the pool as unstable and prioritize getting data out. - Professional recovery for irreplaceable data with no redundancy. If you’re thinking “maybe we can just keep trying,” you’re probably about to erase the last readable metadata.
One paraphrased idea from Werner Vogels: “Everything fails all the time.” ZFS is built around this. Your operational practices should be, too.
Joke #1: ZFS doesn’t “lose” your disks; it just forces you to finally learn which cable goes where.
Common mistakes: symptom → root cause → fix
1) “No pools available to import” after moving disks
Symptom: zpool import shows nothing.
Root cause: You’re scanning the wrong device namespace (e.g., behind multipath), or the HBA isn’t presenting the disks, or you booted a kernel missing the driver.
Fix: Verify disks in lsblk, check dmesg for driver attach, scan zpool import -d /dev/disk/by-id, and resolve multipath to a single canonical device set.
2) “Pool may be in use from another system” and people reach for -f blindly
Symptom: Import asks for -f.
Root cause: Shared access is possible (SAN, shared JBOD, accidental dual-path). Or hostid changed due to cloning.
Fix: Prove exclusivity (physically or via target-side access control). Then use zpool import -f. If hostid is unstable, fix /etc/hostid and avoid OS image cloning without regenerating it.
3) Pool imports, but Proxmox storage still looks dead
Symptom: zpool list shows pool online, yet GUI storage is “inactive” or VMs won’t start.
Root cause: Datasets not mounted, mountpoints changed, encryption keys not loaded, or Proxmox storage.cfg points to a dataset that no longer exists.
Fix: zfs mount -a, check zfs get mountpoint,mounted, load keys, and validate /etc/pve/storage.cfg against real datasets/ZVOLs.
4) Import fails only at boot, but manual import works
Symptom: After reboot, pool missing until you run a command.
Root cause: Device discovery is late (HBA firmware init, iSCSI login, multipath), and cache-based import runs too early or points to stale paths.
Fix: Enable scan import, fix iSCSI/multipath ordering, regenerate /etc/zfs/zpool.cache by exporting/importing cleanly, and ensure services depend on storage readiness.
5) “Unsupported feature(s)” in rescue mode
Symptom: Pool visible but refuses import with feature list.
Root cause: Rescue environment uses older OpenZFS than the pool requires.
Fix: Use a modern Proxmox kernel/userspace matching the pool’s features. Don’t “upgrade the pool” on one host without planning for recovery environments.
6) Degraded import attempted on a non-redundant pool
Symptom: Single-disk pool missing its only disk, or mirror missing one side with silent corruption.
Root cause: No redundancy or both sides impacted; ZFS can’t conjure blocks from physics.
Fix: Hardware-level recovery (bring device back), then immediate data evacuation. If it’s truly gone, restore from backup. If there’s no backup, the fix is “learn and budget.”
Checklists / step-by-step plan
Checklist A: “Pool won’t import” triage in 15 minutes
- Confirm exclusivity: make sure no other host can see the disks/LUNs.
- Run
zpool importand capture exact message. - Verify device presence:
lsblk -o NAME,SIZE,MODEL,SERIALandls -l /dev/disk/by-id. - Scan for kernel errors:
dmesg -T | tailand look for timeouts/resets. - Try read-only no-mount import:
zpool import -o readonly=on -N -d /dev/disk/by-id POOL. - If it imports: check
zpool statusand decide “resilver vs copy data off first.” - If it doesn’t: check feature flags, missing devices, and hostid/multihost signals.
Checklist B: Recovery workflow when a vdev is missing but redundancy exists
- Import read-only first to validate the remaining devices are stable.
- Collect
zpool status -vand error counters; if they climb during idle, hardware is still sick. - Replace/restore the missing disk path (swap disk, reseat, fix multipath).
- Import read-write and begin resilver when you’re confident the bus is stable.
- Monitor resilver progress and system logs for resets/timeouts.
- After resilver, scrub.
Checklist C: Boot-time import failures
- Confirm manual import works consistently.
- Check
systemctl status zfs-import-cache zfs-import-scan. - Check ordering for iSCSI/multipath (if used): devices must exist before ZFS import runs.
- Regenerate cachefile by clean export/import during a maintenance window.
- Reboot once. If it works, reboot again. Don’t declare victory after one boot.
Joke #2: “It worked after a reboot” is not a fix; it’s a plot twist.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran Proxmox on two nodes with a shared SAS shelf. Someone labeled it “like a SAN but cheaper,” which is how you can tell it will become expensive later.
They had ZFS on top of the shared shelf because it “worked fine in testing.”
One afternoon, a node rebooted after a kernel update. The pool didn’t auto-import, and the on-call engineer did what every impatient human does:
they ran zpool import -f because the tool suggested it. The pool came up. VMs started. Everyone exhaled.
The other node, which still had physical access to the shelf, was also configured to import the same pool for a previous experiment.
It had not imported it at that moment, but it was still running monitoring and udev rules that occasionally poked the devices.
A few hours later, a scheduled task triggered an import attempt there as well, and it also used -f, because that’s what the runbook said.
They got a classic split-brain write scenario. ZFS did its best, but no filesystem can reconcile two writers to the same blocks without coordination.
The corruption wasn’t immediate; it manifested as “random” VM disk errors and a slow-motion unraveling of trust.
The fix wasn’t clever. It was governance: fencing, removing shared visibility, and stopping the myth that ZFS is a clustered filesystem. It is not.
They rebuilt the pool from backups. The real change was political: “prove exclusivity” became a non-negotiable step, not a suggestion.
Mini-story 2: The optimization that backfired
Another team ran Proxmox with ZFS over iSCSI LUNs. It can work, but it’s an agreement with the devil: you must manage multipath consistently and avoid device churn.
They wanted faster imports and faster boots, so they tuned things: pinned imports to a cachefile, disabled scan import, and trimmed “unnecessary” boot delays.
It was fine until a routine SAN controller failover made LUN paths appear a few seconds later than usual during boot.
ZFS import-cache ran early, didn’t see the expected devices, and failed. Because scan import was disabled, nothing retried.
Proxmox came up with no storage. The cluster thought the node was alive but empty-handed, which is a fun combination.
The on-call did a manual import and everything worked. That’s the trap: manual fixes make you think the underlying system is reliable.
But it kept happening on every cold boot, and only on days when the SAN felt like being dramatic.
The eventual fix was boring: enable scan import as a fallback, add proper dependencies so iSCSI login and multipath settled before ZFS import,
and regenerate the cachefile after confirming stable by-id naming. Boot became a few seconds slower. Uptime became a lot less exciting.
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent company ran Proxmox on mirrored boot drives and a separate ZFS pool for VM storage. Nothing exotic: HBAs in IT mode, disks with WWNs,
datasets named like adults name things, and monthly scrub reports sent to a mailbox nobody read—until they did.
One morning after a power event, a node refused to import the VM pool. The error was “one or more devices is currently unavailable.”
Instead of flailing, they followed their own checklist: confirm device inventory, check dmesg, try read-only import.
The pool imported degraded; one disk was truly gone.
The key detail: because they scrubbed regularly, they already knew the remaining disks were clean as of last week.
That changed the decision calculus. They imported read-write, replaced the failed disk, resilvered, and then scrubbed again.
VM downtime was measured in a maintenance window, not in a career change.
The postmortem was almost boring. “Disk failed; redundancy worked; scrub history confirmed health; replacement completed.”
That’s what you want in operations: a story nobody wants to hear because nothing exploded.
FAQ
1) Should I always use zpool import -f if ZFS suggests it?
No. Use -f only when you have proven the pool is not imported elsewhere and no other host can write to those devices.
On shared storage, force-import is how you buy corruption with urgency.
2) What does “no pools available to import” usually mean on Proxmox?
ZFS scanning didn’t find labels. That’s typically missing drivers, missing devices, scanning the wrong namespace (multipath vs raw),
or you’re simply not looking at the disks you think you are.
3) If I can import read-only, does that guarantee my pool is safe?
It’s a strong sign that core metadata is readable and coherent, but it doesn’t guarantee future stability.
Hardware can degrade under load. Use read-only import to inspect, then decide whether to resilver or evacuate data first.
4) Why does Proxmox sometimes fail to import at boot but works manually?
Boot sequencing. Devices appear late (HBA init, iSCSI login, multipath settling), while ZFS import services run early.
Fix ordering and fallback mechanisms; don’t rely on “someone runs the command.”
5) Can I “downgrade” a ZFS pool to import it on an older Proxmox?
No. If the pool has feature flags unsupported by the older environment, you need a newer environment to import it.
Plan upgrades with recovery tooling in mind.
6) Pool imported but datasets didn’t mount. Is that an import failure?
Usually not. It’s a mount/key/mountpoint problem. Check zfs mount, zfs get mounted,mountpoint, and encryption keystatus.
Proxmox storage can look “down” when it’s really “locked” or “not mounted.”
7) Is it safe to import a pool with missing devices?
If redundancy exists and ZFS reports sufficient replicas, it can be safe enough to proceed—especially read-only at first.
If redundancy does not exist, “missing device” often means “missing data.”
8) Does clearing errors with zpool clear help import problems?
Rarely. It clears error counters and may re-enable a device after transient faults, but it doesn’t fix timeouts, cabling,
failing drives, multipath duplication, or unsupported features.
9) What’s the best device path strategy for Proxmox ZFS pools?
Use /dev/disk/by-id (WWN or NVMe serial-based) consistently. Avoid /dev/sdX.
If you use SAN/multipath, ensure ZFS uses a single canonical set of devices (typically dm-id) and never both raw and multipath nodes.
10) When is zpool import -m appropriate?
When a separate log device (SLOG) is missing and preventing import. It allows import without that log.
You’ll want to remove/replace the missing log vdev afterward to avoid repeated warnings and weirdness.
Conclusion: next steps you can do today
“Cannot import pool” is not a single problem. It’s a symptom family. The fastest path out is disciplined triage:
verify devices, read the exact ZFS complaint, import read-only when uncertain, and only escalate to force/degraded imports with a clear safety argument.
Practical next steps that reduce future pain:
- Standardize on
/dev/disk/by-idnaming for all pools and document the mapping from WWN to physical slot. - Make “prove exclusivity” mandatory before
-fon any shared-capable storage. - Test recovery using a modern environment that can import your pool’s feature set (especially encryption).
- Enable and monitor scrubs, and treat scrub reports as operational signals, not background noise.
- Fix boot ordering if you rely on iSCSI/multipath; manual imports are not an SLO.
When you’re in the incident: collect evidence first, then act. ZFS will usually tell you what’s wrong. You just have to listen longer than your adrenaline wants.