Nothing says “fun weekend” like booting a Proxmox node and discovering your shiny new disks have ghosted you. The installer shows nothing. lsblk is a desert. ZFS pools vanish. You swear the drives were there yesterday.
This is a field checklist for production humans: storage engineers, SREs, and the unlucky on-call who inherited a “simple” disk expansion. We’ll hunt the failure domain fast: BIOS/UEFI, HBA firmware and mode, PCIe, cabling/backplanes/expander weirdness, Linux drivers, and the gotchas that make disks “present” but invisible.
Fast diagnosis playbook (do this in order)
0) Decide what “not detected” means
- Not in BIOS/UEFI: hardware, power, cabling, backplane, HBA/PCIe enumeration.
- In BIOS but not in Linux: kernel driver/module, IOMMU quirks, broken firmware, PCIe AER errors.
- In Linux but not in Proxmox UI: wrong screen, existing partitions, multipath masking, ZFS holding devices, permissions, or it’s under
/dev/disk/by-idbut not obvious.
1) Start with the kernel’s truth
Run these three and don’t improvise yet:
dmesg -T | tail -n 200(look for PCIe, SAS, SATA, NVMe, link resets)lsblk -e7 -o NAME,TYPE,SIZE,MODEL,SERIAL,TRAN,HCTL(see what the kernel created)lspci -nn | egrep -i 'sas|raid|sata|nvme|scsi'(confirm the controller exists)
Decision: If the controller isn’t in lspci, stop blaming Proxmox. It’s BIOS/PCIe seating/lane allocation or the card is dead.
2) If the controller exists, check the driver and link
lspci -k -s <slot>→ verify “Kernel driver in use”.journalctl -k -b | egrep -i 'mpt3sas|megaraid|ahci|nvme|reset|timeout|aer'→ find the smoking gun.
Decision: No driver bound? Load module or fix firmware/BIOS settings. Link resets/timeouts? suspect cabling/backplane/expander/power.
3) Rescan before you reboot
Rescan SCSI/NVMe. If disks appear after a rescan, you’ve learned something: hotplug, link training, or boot timing.
4) If disks appear but “missing” in Proxmox UI
Go to the CLI and use stable IDs. The UI isn’t lying; it’s just not your incident commander.
Decision: If they exist in /dev/disk/by-id but not in your pool, it’s a ZFS/import/partitioning story, not a detection story.
A practical mental model: where disks can disappear
Disk detection is a chain. Break any link and you’ll stare at an empty list.
Layer 1: Power and physical connectivity
Drive needs power, correct connector, and a backplane that isn’t doing interpretive dance. “Spins up” is not the same as “data link established.” SAS especially will happily power a drive while the link is down due to a bad lane.
Layer 2: Interposer/backplane/expander translation
SAS backplanes can include expanders, multiplexers, and “helpful” logic. A single marginal lane can drop a disk, or worse, make it flap under load. SATA behind SAS expanders works—until it doesn’t, depending on the expander, drive firmware, and cabling.
Layer 3: HBA/controller firmware and mode
HBAs can run as real HBAs (IT mode) or pretend RAID controllers (IR/RAID mode). Proxmox + ZFS wants boring pass-through. RAID personality can hide drives behind virtual volumes, block SMART, and complicate error recovery.
Layer 4: PCIe enumeration and lane budget
The controller itself is a PCIe device. If the motherboard doesn’t enumerate it, Linux can’t either. PCIe bifurcation settings, slot wiring, and lane sharing with M.2/U.2 can quietly make a slot “physical x16” but electrically x4—or x0, if you anger the lane gods.
Layer 5: Linux kernel drivers + device node creation
Even when the hardware is fine, the kernel might not bind the correct driver, or udev might not create nodes the way you expect. Multipath can intentionally hide individual paths. Old initramfs can miss modules. The disks might exist but under different names.
Layer 6: Proxmox storage presentation
Proxmox VE is Debian under a UI. If Debian can’t see it, Proxmox can’t. If Debian can see it but the UI doesn’t show it where you’re looking, that’s a workflow problem, not a hardware problem.
Paraphrased idea from John Allspaw: reliability comes from responding well to failure, not pretending failure won’t happen.
Joke #1: “RAID mode will make ZFS happy” is like saying “I put a steering wheel on the toaster; now it’s a car.”
Interesting facts and history that actually helps troubleshooting
- SCSI scanning is old… and still here. Modern SAS and even some SATA stacks still rely on SCSI host scans, which is why rescans can “find” drives without a reboot.
- LSI’s SAS HBAs became the de facto standard in homelabs and enterprises. Broadcom/Avago/LSI lineage matters because driver naming (
mpt2sas/mpt3sas) and firmware tooling assumptions follow it. - IT mode became popular because filesystems got smarter. ZFS and similar systems want direct disk visibility. RAID controllers were built for an era where the controller owned integrity.
- SFF-8087 and SFF-8643 look like “just cables” but are signal systems. A partially-seated mini-SAS can power drives and still fail data lanes. It’s not magic; it’s differential pairs and tolerance.
- PCIe slots lie by marketing. “x16 slot” often means “x16 connector.” Electrically it might be x8 or x4 depending on CPU and board routing.
- UEFI changed option ROM behavior. Some storage cards rely on option ROMs for boot-time enumeration screens; UEFI settings can hide those screens without changing what Linux sees.
- NVMe brought its own detection path. NVMe devices aren’t “SCSI disks” and won’t show up in SAS HBA tools; they use the NVMe subsystem and PCIe link training.
- SMART passthrough is not guaranteed. With RAID controllers, SMART data may be blocked or require vendor tools, which changes how you verify “the disk exists.”
Hands-on tasks (commands + meaning + decision)
These are the tasks I actually run when a node says “no disks.” Each includes what you’re looking at and the decision you make.
Task 1: Confirm the controller is enumerated on PCIe
cr0x@server:~$ lspci -nn | egrep -i 'sas|raid|sata|scsi|nvme'
03:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097] (rev 02)
01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a808]
What it means: The motherboard sees the HBA/NVMe controller. If it’s not here, Linux will never see disks behind it.
Decision: Missing device → reseat card, change slot, check BIOS PCIe settings, disable conflicting devices, verify power to risers.
Task 2: Verify kernel driver binding
cr0x@server:~$ lspci -k -s 03:00.0
03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
Subsystem: Broadcom / LSI SAS9300-8i
Kernel driver in use: mpt3sas
Kernel modules: mpt3sas
What it means: The right driver is attached. If “Kernel driver in use” is blank, you’ve got a driver/firmware/blacklist problem.
Decision: No driver bound → check modprobe, kernel logs, Secure Boot, firmware compatibility, and whether you’re using a weird vendor kernel.
Task 3: See what disks Linux created (don’t trust the UI yet)
cr0x@server:~$ lsblk -e7 -o NAME,TYPE,SIZE,MODEL,SERIAL,TRAN,HCTL
NAME TYPE SIZE MODEL SERIAL TRAN HCTL
sda disk 3.6T ST4000NM0035-1V4 ZC123ABC sas 3:0:0:0
sdb disk 3.6T ST4000NM0035-1V4 ZC123DEF sas 3:0:1:0
nvme0n1 disk 1.8T Samsung SSD 990 PRO S6Z1NZ0R12345 nvme -
What it means: If it’s in lsblk, the kernel sees it. TRAN tells you if it’s sas, sata, nvme.
Decision: Disks absent → move down the stack: dmesg, cabling, expander, power. Disks present but Proxmox “missing” → likely UI/workflow, multipath, or ZFS import.
Task 4: Check kernel logs for link resets/timeouts
cr0x@server:~$ journalctl -k -b | egrep -i 'mpt3sas|megaraid|ahci|nvme|reset|timeout|aer|link down' | tail -n 60
Dec 26 10:12:01 server kernel: mpt3sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Dec 26 10:12:01 server kernel: sd 3:0:1:0: rejecting I/O to offline device
Dec 26 10:12:03 server kernel: pcieport 0000:00:1c.0: AER: Corrected error received: 0000:03:00.0
Dec 26 10:12:03 server kernel: nvme nvme0: I/O 42 QID 5 timeout, aborting
What it means: “offline device”, “timeout”, “link down”, AER spam = signal integrity, power, or failing device/controller.
Decision: Timeouts on multiple drives → cable/backplane/expander/HBA. Timeouts on one drive → that drive or its slot.
Task 5: List storage controllers the kernel thinks exist
cr0x@server:~$ lsscsi -H
[0] ata_piix
[2] mpt3sas
[3] nvme
What it means: Confirms host adapters. If your HBA driver is loaded, it shows up as a host.
Decision: HBA missing here but present in lspci → driver didn’t load or failed to initialize.
Task 6: Inspect SCSI hosts and rescan for devices
cr0x@server:~$ ls -l /sys/class/scsi_host/
total 0
lrwxrwxrwx 1 root root 0 Dec 26 10:10 host0 -> ../../devices/pci0000:00/0000:00:17.0/ata1/host0/scsi_host/host0
lrwxrwxrwx 1 root root 0 Dec 26 10:10 host2 -> ../../devices/pci0000:00/0000:03:00.0/host2/scsi_host/host2
cr0x@server:~$ for h in /sys/class/scsi_host/host*/scan; do echo "- - -" > "$h"; done
What it means: Forces a scan of all SCSI hosts. If disks appear after this, detection is timing/hotplug/expander behavior.
Decision: If rescans consistently “fix it,” check BIOS hotplug, staggered spin-up, expander firmware, and HBA firmware.
Task 7: Check SATA/AHCI detection (onboard ports)
cr0x@server:~$ dmesg -T | egrep -i 'ahci|ata[0-9]|SATA link' | tail -n 40
[Thu Dec 26 10:10:12 2025] ahci 0000:00:17.0: AHCI 0001.0301 32 slots 6 ports 6 Gbps 0x3f impl SATA mode
[Thu Dec 26 10:10:13 2025] ata1: SATA link down (SStatus 0 SControl 300)
[Thu Dec 26 10:10:13 2025] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
What it means: “link down” on a port with a drive means cabling/port disabled in BIOS/power.
Decision: If ports are link down across the board, check BIOS SATA mode (AHCI), and whether the board disabled SATA when M.2 is populated.
Task 8: Enumerate NVMe devices and controller health
cr0x@server:~$ nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- ---------------- -------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S6Z1NZ0R12345 Samsung SSD 990 PRO 2TB 1 1.80 TB / 2.00 TB 512 B + 0 B 5B2QJXD7
What it means: NVMe is present as its own subsystem. If nvme list is empty but lspci shows the controller, it can be driver, PCIe ASPM, or link issues.
Decision: Empty list → check journalctl -k for NVMe errors, BIOS settings for PCIe Gen speed, and slot bifurcation (for multi-NVMe adapters).
Task 9: Confirm stable disk identifiers (what you should use for ZFS)
cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep -i 'wwn|nvme|scsi' | head
lrwxrwxrwx 1 root root 9 Dec 26 10:15 nvme-Samsung_SSD_990_PRO_2TB_S6Z1NZ0R12345 -> ../../nvme0n1
lrwxrwxrwx 1 root root 9 Dec 26 10:15 scsi-35000c500a1b2c3d4 -> ../../sda
lrwxrwxrwx 1 root root 9 Dec 26 10:15 scsi-35000c500a1b2c3e5 -> ../../sdb
lrwxrwxrwx 1 root root 9 Dec 26 10:15 wwn-0x5000c500a1b2c3d4 -> ../../sda
What it means: These IDs survive reboots and device renames (sda becoming sdb after hardware changes).
Decision: If your pool/import scripts use /dev/sdX, stop. Migrate to by-id/by-wwn before your next maintenance window eats you.
Task 10: Check SMART visibility (tells you if you’re really seeing the disk)
cr0x@server:~$ smartctl -a /dev/sda | head -n 20
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-4-pve] (local build)
=== START OF INFORMATION SECTION ===
Model Family: Seagate Exos 7E8
Device Model: ST4000NM0035-1V4
Serial Number: ZC123ABC
LU WWN Device Id: 5 000c50 0a1b2c3d4
Firmware Version: SN03
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
What it means: If SMART works, you likely have true pass-through visibility. If SMART fails behind a RAID controller, you may need different device types or vendor utilities.
Decision: SMART blocked + you want ZFS → verify HBA IT mode or true HBA, not RAID personality.
Task 11: Detect if multipath is hiding your disks
cr0x@server:~$ multipath -ll
mpatha (3600508b400105e210000900000490000) dm-0 IBM,2810XIV
size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 3:0:0:0 sda 8:0 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
`- 4:0:0:0 sdb 8:16 active ready running
What it means: Multipath can present disks as /dev/dm-* and the raw /dev/sdX are just paths.
Decision: If these are local disks, multipath is a misconfiguration. Disable it or blacklist local HBAs, otherwise Proxmox/ZFS may see “weird” devices.
Task 12: Identify if you’re on a RAID controller with virtual disks
cr0x@server:~$ lspci -nn | egrep -i 'megaraid|raid'
04:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3108 [1000:005d] (rev 02)
cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,MODEL
NAME TYPE SIZE MODEL
sda disk 7.3T LSI MR9361-8i
What it means: If the model looks like the controller and you only see one “disk,” you’re looking at a virtual drive, not physical disks.
Decision: For ZFS: either switch to an HBA / IT mode, or accept that ZFS can’t manage individual disks properly.
Task 13: Check if BIOS disabled the slot or remapped lanes
cr0x@server:~$ dmidecode -t baseboard | egrep -i 'Manufacturer|Product|Version'
Manufacturer: Supermicro
Product Name: X11SPH-NCTF
Version: 1.02
cr0x@server:~$ lspci -vv -s 03:00.0 | egrep -i 'LnkCap|LnkSta'
LnkCap: Port #0, Speed 8GT/s, Width x8
LnkSta: Speed 2.5GT/s (downgraded), Width x8
What it means: Link downgraded to 2.5GT/s suggests signal integrity issues, wrong slot generation forcing, or bad riser/cable.
Decision: Downgraded links with errors → try forcing Gen3/Gen4 in BIOS, move slots, replace riser, check seating.
Task 14: Proxmox-specific: confirm the kernel and modules match expectations
cr0x@server:~$ uname -r
6.8.12-4-pve
cr0x@server:~$ modinfo mpt3sas | egrep -i 'filename|version|firmware'
filename: /lib/modules/6.8.12-4-pve/kernel/drivers/scsi/mpt3sas/mpt3sas.ko
version: 44.100.00.00
firmware: mpt3sas_fw.bin
What it means: Confirms you’re using the Proxmox kernel and the module exists. Mismatched kernels/initramfs can bite after upgrades.
Decision: If module missing or wrong kernel, fix packages and regenerate initramfs before chasing hardware ghosts.
HBA, BIOS/UEFI, and PCIe: the usual crime scene
HBA mode: IT vs IR/RAID (and why Proxmox cares)
If you’re running ZFS (and many Proxmox shops are), you want the HBA to present each physical disk directly to Linux. That’s IT mode in LSI/Broadcom terms. RAID mode (IR) is a different product philosophy: the controller abstracts disks into logical volumes. That abstraction breaks several things you rely on in modern ops:
- Accurate SMART/health per disk (often blocked or weird).
- Predictable disk identities (WWNs may be hidden or replaced).
- Clear error surfaces (timeouts may become “controller says no”).
- ZFS’s ability to manage redundancy and self-heal with full visibility.
Also: RAID controllers tend to have write caches, BBUs, and policies that are great until they’re not. ZFS already does its own consistency story. You don’t need two captains steering one ship. You get seasickness.
UEFI settings that silently impact detection
BIOS/UEFI can hide or break your storage without dramatic error messages. The most common settings to audit when disks vanish:
- SATA mode: AHCI vs RAID. On servers, RAID mode can route ports through an Intel RST-like layer Linux may not handle the way you expect.
- PCIe slot configuration: Gen speed forced vs auto; bifurcation x16 → x4x4x4x4 for multi-NVMe adapters.
- Option ROM policy: UEFI-only vs Legacy. This mostly affects boot visibility and management screens, but misconfiguration can mask what you think “should” appear pre-boot.
- IOMMU/VT-d/AMD-Vi: Not usually a disk-detection breaker, but it can change device behavior with passthrough setups.
- Onboard storage disablement: Some boards disable SATA ports when M.2 slots are occupied, or share lanes with PCIe slots.
PCIe lane sharing: the modern “why did my slot stop working?”
Motherboards are traffic cops. Put an NVMe in one M.2 slot and your HBA might drop from x8 to x4, or the adjacent slot may get disabled. This is not “bad design.” It’s economics and physics: CPUs have finite lanes, and board vendors multiplex them in ways that require you to read the fine print.
If you see a controller present but unstable (AER errors, link down/up), lane or signal integrity issues are very much on the table. Risers, especially, love to be “mostly fine.”
Joke #2: A PCIe riser that “works if you don’t touch the chassis” is less a component and more a lifestyle choice.
Cabling, backplanes, expanders, and “it’s seated” lies
Mini-SAS connectors: why partial failure is common
SAS cables carry multiple lanes. A single SFF-8643 can carry four SAS lanes; a backplane may map lanes to individual drive bays. If one lane goes bad, you don’t always lose all drives. You lose “some bays,” often in a pattern that looks like software.
Practical rule: if disks are missing in a repeating bay pattern (e.g., bays 1–4 fine, 5–8 dead), suspect a specific mini-SAS cable or port. Don’t spend an hour in udev for a problem that lives in copper.
Backplanes with expanders: nice when they work
Expanders let you connect many drives to fewer HBA ports. They also add a layer that can have firmware bugs, negotiation quirks, and sensitivity to SATA drives behind SAS expanders. Symptoms include:
- Disks appear after boot but disappear under load.
- Intermittent “device offlined” messages.
- Only some drive models misbehave.
When that happens, you don’t “tune Linux.” You validate the expander firmware, swap cables, isolate by connecting fewer bays, and test with a known-good disk model.
Power delivery and spin-up
Especially in dense chassis, power can be the silent killer. Drives may spin but brown out during link training or when multiple drives spin simultaneously. Some HBAs and backplanes support staggered spin-up. Some don’t. Some support it and ship misconfigured.
A telltale sign is multiple drives dropping at the same time during boot or scrub, then reappearing later. That’s not a “Proxmox thing.” That’s power or signal.
Simple physical checks that beat cleverness
- Reseat both ends of mini-SAS cables. Do not “press gently.” Disconnect, inspect, reconnect firmly.
- Swap cables between known-good and suspected-bad ports to see if the problem follows the cable.
- Move a disk to another bay. If the disk works elsewhere, the bay/backplane lane is suspect.
- If you can, temporarily connect one disk directly to an HBA port (bypass expander/backplane) to isolate layers.
Linux/Proxmox layer: drivers, udev, multipath, and device nodes
Driver presence is not driver health
Seeing mpt3sas loaded doesn’t guarantee the controller initialized properly. Firmware mismatch can produce partial functionality: controller enumerates, but no targets show; or targets show but error constantly.
Kernel logs matter more than module lists. If you see repeated resets, “firmware fault,” or queues stuck, treat it like a real incident: collect logs, stabilize hardware, and consider firmware updates.
Multipath: helpful until it’s not
Multipath is designed for SANs and dual-path storage. On a Proxmox node with local SAS disks, it’s usually accidental and harmful. It can mask the devices you expect, or it can create device-mapper nodes that Proxmox/ZFS will use inconsistently if you aren’t deliberate.
If you’re not explicitly using multipath for shared storage, you generally want it disabled or configured to ignore local disks.
Device naming: /dev/sdX is a trap
Linux assigns /dev/sdX names in discovery order. Add a controller, reorder cables, or change BIOS boot settings and the order changes. That’s how you import the wrong disks, wipe the wrong device, or build a pool on the wrong members.
Use /dev/disk/by-id or WWNs. Make it policy. Your future self will quietly thank you.
When Proxmox “doesn’t show disks” but Linux does
Common realities:
- The disks have old partitions and Proxmox UI filters what it considers “available.”
- ZFS is already using the disks (they belong to an imported pool or a stale pool). ZFS won’t politely share.
- You’re looking in the wrong place: node disks vs storage definitions vs datacenter view.
- Multipath or device-mapper is presenting different names than you expect.
ZFS angle: why “RAID mode” is not your friend
Proxmox ships with first-class ZFS support. ZFS assumes it is in charge of redundancy, checksums, and healing. Hardware RAID assumes it is in charge of redundancy and error recovery. When you stack them, you create a system where each layer makes decisions without full information.
What “works” but is still wrong
- Creating one huge RAID0/RAID10 volume and putting ZFS on it: ZFS loses per-disk visibility and can’t isolate failing members.
- Using RAID controller caching with ZFS sync writes: you can accidentally lie to ZFS about durability if the cache policy is unsafe.
- Assuming the controller will surface disk errors cleanly: it may remap, retry, or mask until it can’t.
What you should do instead
- Use an HBA (or flash the controller to IT mode) and present raw disks to ZFS.
- Use stable IDs when creating pools.
- Prefer boring, well-tested firmware combinations. Bleeding edge is great for lab work, not for your cluster quorum.
Common mistakes: symptom → root cause → fix
1) Symptom: HBA not in lspci
Root cause: Card not seated, dead slot, lane sharing disabled the slot, riser failure, or BIOS disabled that slot.
Fix: Reseat, try another slot, remove riser, check BIOS “PCIe slot enable,” check lane sharing with M.2/U.2, update BIOS if it’s ancient.
2) Symptom: HBA in lspci but no disks in lsblk
Root cause: Driver not bound, firmware mismatch, HBA in a mode requiring vendor stack, broken cable/backplane preventing target discovery.
Fix: Verify lspci -k, check journalctl -k, rescan SCSI hosts, swap cables, validate HBA firmware and mode (IT for ZFS).
3) Symptom: Some bays missing in a pattern
Root cause: One SAS lane/cable/port down; backplane mapping aligns with the missing set.
Fix: Swap mini-SAS cable; move to other HBA port; reseat connector; check for bent pins/damage.
4) Symptom: Disks appear after rescan but vanish after reboot
Root cause: Hotplug timing, expander quirks, staggered spin-up misconfigured, marginal power at boot.
Fix: Update HBA/backplane/expander firmware, enable staggered spin-up if supported, verify PSU and power distribution, check boot logs for resets.
5) Symptom: NVMe not detected, but works in another machine
Root cause: Slot disabled due to bifurcation settings, PCIe Gen forced too high/low, lane sharing with SATA, or adapter needs bifurcation.
Fix: Set correct bifurcation, set PCIe speed to Auto/Gen3/Gen4 appropriately, move to CPU-attached slot, update BIOS.
6) Symptom: Proxmox GUI doesn’t show disks, but lsblk does
Root cause: Existing partitions/LVM metadata, ZFS already claims them, multipath device presentation, or you’re looking at the wrong UI view.
Fix: Use CLI to confirm by-id, check zpool status/zpool import, check multipath -ll, wipe signatures only when you’re sure.
7) Symptom: SMART fails with “cannot open device” behind controller
Root cause: RAID controller abstraction; SMART passthrough requires special device type or isn’t supported.
Fix: Use HBA/IT mode for ZFS; otherwise use vendor tooling and accept limitations.
8) Symptom: Disks flap under load, ZFS sees checksum errors
Root cause: Cable/backplane/expander signal integrity or insufficient power; sometimes one drive is poisoning the bus.
Fix: Replace cables first, isolate by removing disks, check dmesg for resets, validate PSU and backplane health.
Checklists / step-by-step plan
Checklist A: “Installer can’t see any disks”
- Enter BIOS/UEFI and confirm the controller is enabled and visible.
- Confirm SATA mode is AHCI (unless you explicitly need RAID for a boot volume).
- For HBA: verify it’s in IT mode or true HBA (not MegaRAID virtual volumes) if you want ZFS.
- Move the HBA to a different PCIe slot (prefer CPU-attached slots).
- Boot a rescue environment and run
lspcianddmesg. If it’s missing there, it’s hardware. - Swap mini-SAS cables and re-seat connectors at both ends.
- If using a backplane expander: try a direct-attach test with one disk.
Checklist B: “Some disks missing behind HBA”
- Run
lsblkand identify which bays are missing; look for patterns. - Check logs for link resets and offline devices.
- Rescan SCSI hosts; see if missing disks appear.
- Swap the cable feeding the missing-bay set.
- Move the cable to another HBA port; see if the missing set moves.
- Move one missing drive to a known-good bay; if it appears, the bay/lane is bad.
- Update HBA firmware if you’re running a known-problematic revision.
Checklist C: “Disks detected in Linux but not usable in Proxmox”
- Confirm stable IDs in
/dev/disk/by-id. - Check if ZFS sees an importable pool:
zpool import. - Check if disks have signatures:
wipefs -n /dev/sdX(the-nis the safety flag; keep it). - Check multipath:
multipath -ll. - Decide your intent: import existing data vs wipe and repurpose.
- If wiping, do it deliberately and document which WWNs you wiped.
Checklist D: “NVMe not showing up”
- Confirm controller in
lspci. - Check
nvme listand kernel logs for timeouts. - Inspect PCIe link status (
LnkSta) for downgrades. - Set correct bifurcation for multi-NVMe adapters.
- Move the NVMe to another slot and retest.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
The team was rolling out a new Proxmox cluster for internal CI workloads. The storage plan was “simple”: eight SAS drives per node, ZFS mirrors, done. Procurement delivered servers with a “SAS RAID controller” instead of the requested HBA. Nobody panicked because the controller still had “SAS” in the name and the BIOS showed a giant logical disk.
They installed Proxmox on that logical volume and built ZFS pools on top of whatever the controller exposed. It worked fine for a few weeks, which is how bad assumptions get promoted to “design decisions.” Then a drive started failing. The controller remapped and retried in ways ZFS couldn’t observe, and the node began stalling during scrubs. The logs were full of timeouts but nothing that mapped cleanly to a physical bay.
During the maintenance window, someone pulled the “failed” drive according to the controller UI. The wrong one. The controller had changed its internal numbering after the earlier remap events, and the mapping sheet was outdated. Now the logical volume degraded in a different way, ZFS got angry, and the cluster lost a chunk of capacity during peak pipeline usage.
The fix was unglamorous: swap the RAID controller for a real HBA, rebuild the node, and enforce a policy: ZFS gets raw disks, identified by WWN, and bay mapping is validated with LEDs and serial numbers before anyone pulls hardware. The assumption “SAS equals HBA” was the original root cause, and it cost them a weekend.
Mini-story 2: The optimization that backfired
A different shop had performance problems during ZFS resilvers. Someone suggested “optimizing cabling” by using a single expander backplane to reduce HBA ports and keep the build tidy. Fewer cables, fewer failure points, right?
In practice, the expander introduced a subtle behavior: during heavy I/O, a couple of SATA SSDs (used as special vdevs) would intermittently drop for a few seconds, then return. The HBA and kernel would log link resets, and ZFS would mark devices as faulted or degraded depending on timing. The symptom looked like “ZFS is flaky” because the drops were transient.
The team tried tuning timeouts and queue depths, because engineers like knobs and the expander looked “enterprise.” The tuning reduced the obvious errors but didn’t solve the underlying issue. Under a real incident—node reboot plus simultaneous VM recovery—the devices flapped again and the pool refused to import cleanly without manual intervention.
They backed out the “optimization.” Direct-attach the SSDs, keep the expander for the bulk HDDs where latency wasn’t as sensitive, and standardize drive models behind the expander. Performance improved, and so did sleep. Sometimes fewer cables is just fewer clues when it breaks.
Mini-story 3: The boring but correct practice that saved the day
One team had a habit that looked pedantic: every disk was recorded by WWN and bay location at install time. They kept a simple sheet: chassis serial, bay number, drive serial, WWN, and the intended ZFS vdev membership. They also labeled cables by HBA port and backplane connector. Nobody loved doing it, but it was policy.
A year later, a node started reporting intermittent checksum errors during scrubs. The logs suggested a flaky link, not a failing disk, but the pool topology included twelve drives and a backplane expander. In the old world, this would devolve into “pull drives until the errors stop.” That’s how you create new incidents.
Instead, they correlated the affected WWN with the bay. The errors were always on disks in bays 9–12. That matched a single mini-SAS cable feeding that section of the backplane. They swapped the cable during a short maintenance window, scrubbed again, and the errors disappeared.
No drama. No guessing. The boring inventory practice turned a potentially messy incident into a 20-minute fix with a clear root cause. Reliability is often just bookkeeping with conviction.
FAQ
1) Proxmox installer shows no disks. Is it always an HBA driver issue?
No. If lspci doesn’t show the controller, it’s BIOS/PCIe/hardware. If the controller shows but no disks, then it might be driver/firmware/cabling.
2) I see disks in BIOS but not in Linux. How is that possible?
BIOS may show RAID virtual volumes or a controller summary without exposing targets to Linux. Or Linux lacks the right module, or the controller fails initialization during boot (check journalctl -k).
3) Do I need IT mode for Proxmox?
If you use ZFS and want sane operations, yes. If you insist on hardware RAID, you can run it, but you’re choosing a different operational model with different tooling.
4) Why do disks show up as /dev/dm-0 instead of /dev/sda?
Usually multipath or device-mapper stacking (LVM, dm-crypt). For local disks you didn’t intend to multipath, fix multipath config or disable it.
5) My disks appear, but Proxmox GUI doesn’t list them as available. Are they broken?
Often they have existing signatures (old ZFS/LVM/RAID metadata) or are already part of an imported pool. Verify with lsblk, wipefs -n, and zpool import before doing anything destructive.
6) Can a bad SAS cable really cause only one disk to disappear?
Yes. Mini-SAS carries multiple lanes; depending on backplane mapping, a lane issue can isolate a single bay or a subset. Patterns are your friend.
7) NVMe not detected: what’s the single most common BIOS mistake?
Wrong bifurcation settings when using multi-NVMe adapters, or lane sharing that disables the slot when another M.2/U.2 is populated.
8) Should I force PCIe Gen speed to fix link issues?
Sometimes forcing a lower Gen speed stabilizes flaky links (useful for diagnosis), but the real fix is usually seating, risers, cabling, or board/slot choice.
9) How do I decide between “replace disk” and “replace cable/backplane”?
If multiple disks show errors on the same HBA port/backplane segment, suspect cable/backplane. If one disk follows the disk across bays, it’s the disk.
10) Is it safe to rescan SCSI hosts on a production node?
Generally yes, but do it with situational awareness. Rescans can trigger device discovery events and log noise. Avoid during sensitive storage operations if you’re already degraded.
Conclusion: practical next steps
If Proxmox can’t see disks, stop guessing and walk the chain: PCIe enumeration → driver binding → link stability → target discovery → stable IDs → Proxmox/ZFS consumption. The fastest wins are usually physical: seating, lane allocation, and cables. The most expensive failures come from the wrong controller mode and sloppy device naming.
- Run the fast diagnosis playbook and classify the failure domain in 10 minutes.
- Collect evidence:
lspci -k,lsblk, and kernel logs around detection time. - Standardize: HBA/IT mode for ZFS, by-id naming, and a bay-to-WWN map.
- Fix the root cause, not the symptom: replace suspect cables/risers, correct BIOS bifurcation, update firmware deliberately.
- After recovery, do one scrub/resilver test and review logs. If you don’t verify, you didn’t fix it—you just stopped seeing it.