You rebooted. The box didn’t. Or it booted, but a mount went read-only, apps started throwing I/O errors, and your “quick fix” reflex is hovering over fsck like a nervous finger over a big red button.
This is the moment where tool order matters more than tool choice. With ext4 and XFS, the wrong first move can turn a recoverable metadata problem into an archaeological dig. The right first move often looks boring: identify, isolate, image/snapshot, then repair with the filesystem’s native tools in the filesystem’s preferred order.
The big rule: your first tool is “stop changing the disk”
If you remember one thing: filesystem repair tools are not “read-only diagnostics.” They are surgeons, and sometimes they amputate to save the patient. Your first tool is a pause button: remount read-only, stop services, and capture evidence before you mutate state.
What “first” means in real life
When someone says “run fsck” or “run xfs_repair,” they usually mean “fix it now.” In operations, “first” should mean:
- Confirm what filesystem you’re touching. People guess wrong more than they admit.
- Confirm the device mapping. LVM, mdraid, multipath, NVMe namespaces—your
/dev/sdXintuition is not a strategy. - Confirm the failure domain. Is it the filesystem, the block device, the controller, or a layer above (like dm-crypt, mdraid, or virtualization storage)?
- Make the disk stop changing. If you can snapshot, do it. If you can image, do it. If you can at least remount read-only, do it.
Only then do you pick the first filesystem-specific tool. On ext4, that’s usually fsck.ext4 (via e2fsck) but not always in “fix everything” mode. On XFS, it’s typically xfs_repair—but often preceded by xfs_repair -n or a deliberate log decision (-L is not a warm hug).
One more rule: if you see hardware errors (timeouts, media errors, CRC errors), treat repair utilities as secondary. Fix the plumbing first, or you’ll be repairing corruption that continues to accrue.
Joke 1: Running repair tools on a dying disk is like repainting a house during an earthquake: it’s not the paint’s fault, but it won’t end well.
And one operational truth worth pinning to your incident channel:
“Hope is not a strategy.” — General Gordon R. Sullivan
Facts and context that change decisions
These are not trivia. They explain why ext4 and XFS behave differently under failure, and why the “first tool” differs.
- ext4 came from ext3’s journaling lineage, designed for broad Linux compatibility and predictable recovery. Its journal often makes post-crash mounts more survivable than older filesystems.
- XFS originated at SGI for high-performance IRIX systems and later became a mainline Linux filesystem; it’s built around aggressive metadata structures and scalability.
- XFS journals metadata (the log), not file data by default. That means after a crash you can see “clean” file contents that are semantically wrong because the last writes didn’t complete—especially for applications without
fsyncdiscipline. - ext4’s “ordered” data mode (common default) tries to flush file data before committing related metadata, reducing some classes of post-crash garbage pointers, at a performance cost.
- XFS uses allocation groups (AGs) to scale parallel metadata operations. Corruption can localize to an AG, which is good; repair still often needs a full scan, which is not fast on huge volumes.
- ext4 can do “online metadata checks” in limited ways, but serious inconsistencies usually require unmounting and running
e2fsck. XFS repair is also offline (unmounted) for real fixes. - XFS has a log replay mechanism on mount. If the log itself is corrupt, mount may fail. People get tempted to use
xfs_repair -L, which discards the log—sometimes necessary, sometimes disastrous. - ext4 has metadata checksums and journal checksums on modern kernels; that helps detect corruption early, but it also means older rescue environments can choke on newer feature flags.
- Ubuntu 24.04 rescue environments matter: a too-old live ISO may not understand your ext4 features or XFS v5 metadata, leading you to misdiagnose “corruption” that is actually “old tooling.”
Those facts point to a sober conclusion: ext4 tends to “repair through fsck” workflows, while XFS tends to “diagnose first, then repair carefully” workflows, with special respect for the log.
Which tool to run first (ext4 vs XFS): the decision table
Before filesystem tools: the universal first moves
The first “tool” should be a set of actions, not a command:
- Stop writes: stop services, unmount if possible, or remount read-only.
- Validate the device path: confirm whether you’re on a partition, LVM LV, mdraid, or dm-crypt mapping.
- Look for hardware errors: dmesg, SMART/NVMe logs, RAID status.
- Snapshot or image if the data matters: LVM snapshot, storage snapshot, or
ddrescueif hardware is failing.
Now the filesystem-specific first tool
Here’s the opinionated rule set I use in production.
ext4: first tool is usually e2fsck, but start with “look, don’t touch”
- If the filesystem is unmountable or mounts read-only due to errors: first run
e2fsck -fn(no changes) to estimate damage and risk. Then decide whether to snapshot/image and run a fixing pass. - If you suspect severe corruption or hardware issues: first run SMART/NVMe checks and consider imaging, then run
e2fsckon the image or snapshot. - If the journal is suspect: you may need
e2fsck -fand, in some cases, journal rebuild withtune2fs—but do not start there.
XFS: first tool is xfs_repair -n (dry-run), not xfs_repair
- If mount fails with log errors: first run
xfs_repair -nto see what it would do. If it tells you the log is dirty/corrupt, decide carefully whether to mount with recovery or discard the log. - Only use
xfs_repair -Lwhen you accept losing recent metadata transactions. It can bring the filesystem back, but at the cost of forgetting “recent history.” Sometimes that means losing whole files or directory entries created just before the crash. - If hardware is flaking: image first. XFS repair is metadata-intensive and will stress a failing disk.
Joke 2: xfs_repair -L is like deleting your browser history: it solves a problem, just not always the one you actually have.
The short answer: why the order differs
ext4’s repair workflow is built around fsck as the authoritative fixer, and it often can replay or reconcile journals in a relatively deterministic way. XFS expects mount-time log replay to handle common crash recovery, and when that fails, the log becomes a decision point. That’s why xfs_repair -n is the right first XFS tool: it tells you whether you’re about to take a chainsaw to the log.
Fast diagnosis playbook (find the bottleneck fast)
When you’re on the clock, the goal is not “be thorough.” The goal is “identify which layer is lying to you.” Here’s a fast order that works on Ubuntu 24.04 in real fleets.
Step 1: Is it hardware or transport?
- Check kernel logs for I/O errors/timeouts. If you see them, treat filesystem symptoms as secondary.
- Check SMART/NVMe health quickly. Reallocated sectors, media errors, CRC errors, or NVMe “critical_warning” flags change everything.
- If it’s RAID: confirm array health before running repair tools. Repairing on top of a degraded/rebuilding array can be fine, but repairing on top of a silently inconsistent one is how you create art.
Step 2: Is the block mapping correct?
- Confirm the device identity (UUID, WWN, serial). Do not trust
/dev/sdX. - Confirm layers: dm-crypt? LVM? mdraid? multipath? cloud volume?
Step 3: What filesystem is it and what does it say about itself?
- Identify FS type and features without mounting (or mount read-only if safe).
- For ext4: do a non-destructive fsck pass to see scope.
- For XFS: do a dry-run repair to see whether the log is the problem.
Step 4: Choose the safest recovery path
- Data is irreplaceable: image/snapshot first, repair second.
- Service is irreplaceable: fail over first, repair on the side.
- You need the box back in 10 minutes: you’ll be tempted to “just run the fixer.” Do at least one read-only pass first so you know what you’re risking.
Practical recovery tasks (commands, what the output means, and the decision you make)
These are deliberately operational: each task includes a command, an example output, what it means, and what you do next. Run them from a rescue shell when possible (or at least single-user mode). Replace device names carefully.
Task 1: Confirm what is failing (mount errors and current state)
cr0x@server:~$ findmnt -rno TARGET,SOURCE,FSTYPE,OPTIONS / /var /home
/ /dev/mapper/vg0-root ext4 rw,relatime,errors=remount-ro
/var /dev/nvme0n1p3 xfs ro,relatime
/home /dev/mapper/vg0-home ext4 rw,relatime
Meaning: /var is XFS and already read-only. That’s a clue: either XFS detected inconsistency, or the kernel flipped it due to I/O errors.
Decision: Stop anything that writes to /var. If you can’t unmount, keep it read-only and start with log/hardware checks before repair.
Task 2: Identify the filesystem type from the block device (no guessing)
cr0x@server:~$ lsblk -f /dev/nvme0n1
NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
nvme0n1
├─nvme0n1p1 vfat FAT32 4C3A-1B2F 510M 2% /boot/efi
├─nvme0n1p2
└─nvme0n1p3 xfs 3a2d0a3e-3f2e-4f2d-9d8a-0c2c8c9d6f11 /var
Meaning: It’s XFS on /dev/nvme0n1p3. No ambiguity.
Decision: Use XFS tooling (xfs_repair, xfs_metadump), not fsck (which doesn’t apply to XFS in the same way).
Task 3: Check kernel logs for I/O errors before you “repair” corruption
cr0x@server:~$ dmesg -T | egrep -i "I/O error|blk_update_request|nvme|ata|reset|xfs|ext4" | tail -n 25
[Mon Dec 29 09:41:11 2025] nvme nvme0: I/O 123 QID 6 timeout, aborting
[Mon Dec 29 09:41:12 2025] nvme nvme0: Abort status: 0x371
[Mon Dec 29 09:41:14 2025] XFS (nvme0n1p3): log I/O error -5
[Mon Dec 29 09:41:14 2025] XFS (nvme0n1p3): Log I/O Error Detected. Shutting down filesystem
[Mon Dec 29 09:41:14 2025] XFS (nvme0n1p3): Please unmount the filesystem and rectify the problem(s)
Meaning: This is not “random filesystem corruption.” The log I/O error is driven by NVMe timeouts.
Decision: Treat the device as suspect. Capture logs, consider controller/firmware/PCIe issues, and prioritize imaging/snapshotting over aggressive repairs.
Task 4: Quick SMART check for SATA/SAS drives (if applicable)
cr0x@server:~$ sudo smartctl -H -A /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-xx-generic] (local build)
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 199 199 000 Old_age Always - 12
Meaning: CRC errors suggest cabling/controller issues even when the disk “passes.” That can corrupt writes and cause filesystem shutdowns.
Decision: Fix the transport (cable/backplane/HBA) before repair. If this is a server with hot-swap bays, reseat/replace and re-test.
Task 5: NVMe health check (common in Ubuntu 24.04 fleets)
cr0x@server:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0x00
temperature : 47 C
available_spare : 100%
percentage_used : 3%
media_errors : 0
num_err_log_entries : 18
Meaning: No media errors, but error log entries exist. Combined with timeouts in dmesg, this smells like controller/path issues rather than NAND death.
Decision: Pull the NVMe error log, check PCIe AER events, consider firmware updates. Still snapshot/image before repair if data is critical.
Task 6: Confirm mdraid status before touching filesystems on top
cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
976630336 blocks super 1.2 [2/2] [UU]
unused devices: <none>
Meaning: RAID1 is healthy ([UU]).
Decision: It’s at least safe to proceed to filesystem-level diagnostics. If you saw [U_] or rebuilding, you’d pause and stabilize first.
Task 7: Confirm LVM layout and which LV you’re about to operate on
cr0x@server:~$ sudo lvs -a -o lv_name,vg_name,lv_size,lv_attr,devices
LV VG LSize Attr Devices
root vg0 80.00g -wi-ao---- /dev/nvme0n1p2(0)
var vg0 200.00g -wi-ao---- /dev/nvme0n1p2(20480)
home vg0 500.00g -wi-ao---- /dev/nvme0n1p2(71680)
Meaning: /var might be an LV, not a raw partition (in this sample it’s on nvme0n1p2, not p3). This is where people repair the wrong thing.
Decision: Repair the correct block device (LV path like /dev/vg0/var) and stop using partition assumptions.
Task 8: Make a snapshot before repair (LVM)
cr0x@server:~$ sudo lvcreate -L 20G -s -n var-pre-repair /dev/vg0/var
Logical volume "var-pre-repair" created.
Meaning: You now have a copy-on-write snapshot. It’s not a full backup, but it’s a rollback point against repair mistakes.
Decision: Perform repair on the origin LV only after verifying snapshot space is sufficient for anticipated writes during repair (metadata scans can be write-heavy).
Task 9: For ext4, start with a non-destructive check
cr0x@server:~$ sudo e2fsck -fn /dev/vg0/home
e2fsck 1.47.0 (5-Feb-2023)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/home: ***** FILE SYSTEM WAS MODIFIED *****
/home: 219842/32768000 files (0.2% non-contiguous), 4123712/131072000 blocks
Meaning: Even with -n (no changes), it reports “would be modified.” That indicates inconsistencies that e2fsck can fix.
Decision: If hardware looks stable and you have a snapshot/image, proceed to a fixing pass (e2fsck -fy) during downtime. If hardware is unstable, image first.
Task 10: For ext4, do a fixing pass (controlled, logged)
cr0x@server:~$ sudo e2fsck -fy /dev/vg0/home
e2fsck 1.47.0 (5-Feb-2023)
Pass 1: Checking inodes, blocks, and sizes
Inode 924518 has illegal block(s). Clear? yes
Pass 2: Checking directory structure
Entry 'cache' in /user/alex (924612) has deleted/unused inode 1032211. Clear? yes
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/home: ***** FILE SYSTEM WAS MODIFIED *****
/home: 219840/32768000 files (0.2% non-contiguous), 4123690/131072000 blocks
Meaning: It fixed real metadata inconsistencies. Expect some files to be moved into lost+found if directory entries were damaged.
Decision: Mount and validate application-level integrity (databases, mail spools, etc.). Filesystem consistency does not guarantee application consistency.
Task 11: For XFS, dry-run repair first
cr0x@server:~$ sudo xfs_repair -n /dev/vg0/var
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
would zero log
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
Phase 4 - check for duplicate blocks...
Phase 5 - rebuild AG headers and trees...
Phase 6 - check inode connectivity...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.
Meaning: The tool is telling you it would zero the log. That is the big red sign: the log can’t be replayed as-is.
Decision: Decide whether you can accept losing recent metadata operations. If this is a database volume, consider restoring from application backups rather than forcing log discard.
Task 12: For XFS, attempt a read-only mount to copy data (when feasible)
cr0x@server:~$ sudo mount -o ro,norecovery /dev/vg0/var /mnt/var_ro
mount: /mnt/var_ro: mount(2) system call failed: Structure needs cleaning.
Meaning: Even read-only with no recovery, it can’t mount. Metadata is likely damaged beyond mount-time tolerance.
Decision: Proceed to xfs_repair planning—ideally on a snapshot/image.
Task 13: For XFS, run repair (without -L unless you mean it)
cr0x@server:~$ sudo xfs_repair /dev/vg0/var
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- scan filesystem freespace and inode maps...
- found corruption in filesystem
- resetting log...
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
Phase 4 - check for duplicate blocks...
Phase 5 - rebuild AG headers and trees...
Phase 6 - check inode connectivity...
Phase 7 - verify link counts...
done
Meaning: It reset the log as part of the repair (some repairs may imply log handling). In other situations you’ll be forced to choose -L.
Decision: Mount and validate. If files are missing or directories look odd, plan for restore from backups or application-level repair.
Task 14: For XFS when log is unrecoverable: the explicit -L decision
cr0x@server:~$ sudo xfs_repair -L /dev/vg0/var
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
Phase 3 - for each AG...
Phase 4 - check for duplicate blocks...
Phase 5 - rebuild AG headers and trees...
Phase 6 - check inode connectivity...
Phase 7 - verify link counts...
done
Meaning: You discarded the log. Any metadata transactions that were only in the log are gone. This can orphan recent files or roll back directory changes.
Decision: Only do this if (a) you have backups, (b) you’ve accepted loss of recent changes, and (c) the alternative is “filesystem stays down.” Document the choice.
Task 15: Validate mount and check for read-only flips after repair
cr0x@server:~$ sudo mount /dev/vg0/var /var
cr0x@server:~$ findmnt -no OPTIONS /var
rw,relatime,attr2,inode64,logbufs=8,logbsize=32k
Meaning: Mounted read-write with normal XFS options. Good sign.
Decision: Watch dmesg for fresh I/O errors. If it flips to read-only again, you’re chasing hardware instability, not a one-time crash.
Task 16: Check ext4 journal and features (helps with “old tools” problems)
cr0x@server:~$ sudo tune2fs -l /dev/vg0/home | egrep -i "Filesystem features|Filesystem state|Errors behavior|Journal"
Filesystem state: clean
Errors behavior: Continue
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit metadata_csum
Journal inode: 8
Meaning: Modern ext4 features like metadata_csum require a sufficiently new e2fsck. Ubuntu 24.04 is fine; ancient rescue media might not be.
Decision: If you’re booted into an old environment and it complains about unknown features, switch to a newer rescue image rather than “disabling features” in a panic.
Task 17: Confirm XFS metadata version (v5 vs older)
cr0x@server:~$ sudo xfs_info /dev/vg0/var
meta-data=/dev/vg0/var isize=512 agcount=64, agsize=819200 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
data = bsize=4096 blocks=52428800, imaxpct=25
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal bsize=4096 blocks=25600, version=2
realtime =none extsz=4096 blocks=0, rtextents=0
Meaning: crc=1 indicates metadata checksums (XFS v5 features). Again: old tooling can misbehave.
Decision: Ensure your rescue environment is contemporary (Ubuntu 24.04 or similar) before interpreting errors as “corruption.”
Task 18: When the disk is dying, image it with ddrescue (not dd)
cr0x@server:~$ sudo ddrescue -f -n /dev/nvme0n1 /mnt/recovery/nvme0n1.img /mnt/recovery/nvme0n1.map
GNU ddrescue 1.28
Press Ctrl-C to interrupt
Initial status (read from mapfile)
rescued: 0 B, errsize: 0 B, current rate: 0 B/s
ipos: 0 B, errors: 0, average rate: 0 B/s
opos: 0 B, run time: 0s, remaining time: 0s
Finished
Meaning: In real runs you’ll see rescued bytes, error size, and error counts. The mapfile lets you resume and do smarter retries later.
Decision: Perform filesystem repairs on the image (loopback) or on a cloned disk. Don’t stress a failing source with repeated metadata scans.
Three corporate mini-stories from the repair trenches
1) The incident caused by a wrong assumption
They had a fleet of Ubuntu VMs, most of them ext4, because that’s what their base image shipped with years ago. One day a critical analytics node stopped mounting /data. The on-call saw “Structure needs cleaning,” shrugged, and typed fsck -y /dev/sdb because muscle memory is undefeated.
The command didn’t “fix” anything. It also didn’t obviously fail in a way that triggered panic. That’s part of the trap: generic fsck can select helpers, but when the FS type is misidentified or the device is wrong, you can do nothing useful or—worse—damage an adjacent filesystem because you were pointed at the wrong layer (partition vs LV vs whole disk).
The real issue: /data was XFS on an LVM LV, and the device they repaired was the underlying PV partition. So they ran a tool that wasn’t appropriate, on a target that wasn’t the filesystem, while the actual LV continued to error out.
When a senior engineer joined, they did the unglamorous steps: lsblk -f, lvs, and findmnt. Then xfs_repair -n showed the log was corrupt. The fix ended up being xfs_repair -L—accepted loss of the last few minutes of directory updates—followed by restoring a small set of application outputs from upstream.
The real lesson wasn’t “XFS is scary.” It was that filesystem recovery is a mapping problem first. If you don’t know what block device backs the mount, you’re gambling with someone else’s weekend.
2) The optimization that backfired
A storage team wanted faster ingest. They moved a high-churn workload to XFS and tuned for throughput: larger log buffers, aggressive parallelism, and application writes that relied on “the filesystem will handle it.” Performance improved. Everyone congratulated themselves in the way enterprises do: quietly, and only after the dashboard goes green.
Then a power event hit a rack with a flaky UPS transfer. Servers survived, mostly. But several nodes came back with XFS complaining about log recovery. The team’s first instinct was to standardize an automated recovery step: if mount fails, run xfs_repair -L and reboot. It worked in testing. It worked in staging. It worked in exactly the way a sharp knife “works.”
On production, it brought the mounts back, but the application data was inconsistent. Some directories lost recently created files. Some files existed with zero length. The worst part: the service was “up,” so downstream systems consumed bad outputs before anyone noticed. That cost more than an extra hour of downtime would have.
The backfire wasn’t XFS. It was the assumption that “filesystem available” equals “application correct,” and that discarding the log is a safe default. XFS log discard is a last resort. It’s not a reboot script line item.
Afterward, they enforced two policies: (1) never auto-run log discard; require human approval, and (2) for critical datasets, fail over and restore from known-good checkpoints rather than forcing the filesystem to limp back.
3) The boring but correct practice that saved the day
A financial services company had a rule that annoyed everyone: before any repair tool runs, take a snapshot or clone. No exceptions. Engineers joked that the snapshot policy existed to protect them from themselves. That joke wasn’t wrong.
One quarter-end, a VM hosting batch processing hit a host-level storage hiccup. The guest kernel started logging I/O errors, then remounted ext4 volumes read-only. That alone was good behavior: ext4 trying to prevent further damage. The pressure was enormous because quarter-end batch delays have a special talent for climbing the management chain.
The on-call followed the playbook: stop writes, confirm hardware stability from the hypervisor side, take an LVM snapshot on the storage layer, then run e2fsck -fn to scope damage. The report showed inode and directory issues, repairable. They ran e2fsck -fy on the snapshot clone first, validated application-level checks, then repaired the actual volume.
During repair, e2fsck made a decision that would have been uncomfortable without a rollback: it cleared some directory entries pointing to invalid inodes. The volume came back clean, but one directory tree was partially moved to lost+found. Because they had the snapshot, they could recover specific files by inode from the pre-repair state rather than restoring the whole world.
Nothing heroic happened. That’s the point. The boring policy—snapshot first—turned a stressful incident into a bounded maintenance task.
Common mistakes: symptoms → root cause → fix
1) “I ran fsck and it didn’t help”
Symptoms: You ran fsck, the issue persists, or it complained about “bad magic number.”
Root cause: Wrong filesystem type or wrong target device (partition/PV instead of LV, md device vs member disk).
Fix: Re-identify with lsblk -f and findmnt. Use e2fsck only for ext* and xfs_repair for XFS. Repair the correct layer (e.g., /dev/vg0/var, not /dev/nvme0n1p2).
2) XFS won’t mount and you immediately used xfs_repair -L
Symptoms: Mount works after repair, but “recent” files are missing or directories look rolled back.
Root cause: Log discard loses metadata transactions that hadn’t been committed to the main structures.
Fix: Prefer xfs_repair -n first; if you must use -L, document it and validate application integrity. Restore missing outputs from backups/checkpoints. For databases, prioritize database recovery tooling and backups over filesystem heroics.
3) ext4 keeps remounting read-only after fsck “fixed it”
Symptoms: ext4 mounts, then flips to ro; kernel logs show I/O errors.
Root cause: Underlying disk/controller errors or unstable path, not unresolved metadata.
Fix: Check dmesg and SMART/NVMe logs. Replace hardware, fix cabling/backplane, or migrate data off the device. Repair tools can’t make a flaky bus reliable.
4) Repairs take forever and you assume the tool is hung
Symptoms: e2fsck or xfs_repair runs for hours on multi-TB volumes.
Root cause: Full metadata scans are expensive; also, a disk with slow retries makes everything look stuck.
Fix: Confirm disk isn’t throwing retries/timeouts in dmesg. Use iostat to see if I/O is progressing. If hardware is sick, image first with ddrescue.
5) “Structure needs cleaning” on XFS and you try to mount read-write anyway
Symptoms: Mount fails; you try different mount options; eventually you get a mount but corruption worsens.
Root cause: Forcing mounts or replay on damaged metadata can amplify damage, especially with continued writes.
Fix: Stop writes. Use xfs_repair -n and decide: copy out from RO if possible, or repair offline on a snapshot.
6) You repaired the filesystem but the app is still broken
Symptoms: PostgreSQL won’t start, MySQL complains, or a queue has corrupt segments, even though fsck/repair is clean.
Root cause: Filesystem integrity is not application transactional integrity. Crashes and log discards can break higher-level invariants.
Fix: Run application-level recovery (DB recovery, WAL replay, rebuild indexes) or restore from known-good backups. Treat filesystem repair as “make it mountable,” not “make it correct.”
Checklists / step-by-step plan
Checklist A: The first 10 minutes (any filesystem)
- Stop writes: stop services, disable cron/timers if needed.
- Capture evidence: save
dmesgoutput and relevant logs. - Confirm mapping:
findmnt,lsblk -f,lvs,/proc/mdstat. - Check hardware indicators: SMART/NVMe logs, RAID status, controller logs.
- Decide on snapshot/image: if data value is high, snapshot/image now, not later.
Checklist B: ext4 recovery order (safe-first)
- Unmount the filesystem (preferred) or boot into rescue mode.
- Run a non-destructive check:
e2fsck -fn /dev/... - If errors exist and hardware is stable: run
e2fsck -fy /dev/... - Mount read-only first if you’re nervous, then read-write.
- Check
dmesgfor new errors; if clean, proceed with application validation.
Checklist C: XFS recovery order (diagnose, then repair deliberately)
- Unmount the filesystem (required for real repair).
- Try to mount read-only with minimal risk (optional):
mount -o ro,norecoveryif it works, copy data out. - Run
xfs_repair -nto see what repair would do. - If log is implicated, decide whether log discard is acceptable; prefer restore over
-Lfor critical transactional workloads. - Run
xfs_repair(and only then, if necessary,xfs_repair -L). - Mount and validate; run application-level checks.
Checklist D: When hardware is suspicious
- Stop all writes immediately.
- Image with
ddrescueor clone via your storage platform. - Run repairs on the clone/image.
- Plan for replacement/migration; don’t “repair-and-pray” on unstable hardware.
FAQ
1) Can I run fsck on XFS?
Not in the way you mean. XFS uses xfs_repair. Generic fsck won’t do the right thing for XFS metadata repair and can waste precious time—or worse, target the wrong device.
2) Should I always unmount before running repair tools?
Yes for actual repairs. ext4 e2fsck and XFS xfs_repair are intended for offline repair. Running them on a mounted filesystem is a great way to get “fixed” corruption plus new corruption.
3) Why is xfs_repair -n such a big deal?
Because XFS log handling is a decision point. -n shows what the tool wants to do—especially whether it wants to zero the log—before you commit to losing recent metadata transactions.
4) When is xfs_repair -L justified?
When the filesystem is down, log replay can’t succeed, and you accept losing recent metadata changes. Use it as a last resort, preferably after snapshotting/imaging and after considering restores for transactional workloads.
5) ext4 says “clean” but my app still has missing data. How?
Filesystem consistency is about metadata structures, not your application’s transactional semantics. A crash can leave an app with partial writes or missing fsyncs even when the filesystem is perfectly consistent.
6) What if the rescue environment is older than the filesystem features?
You can get bogus “corruption” reports or refusal to operate. On Ubuntu 24.04-era ext4 and XFS (with metadata checksums), use a contemporary rescue system. Don’t try to disable features as a shortcut.
7) Should I repair on the original disk or a clone?
If the data matters and you can afford it, repair on a clone/snapshot/image. It protects you from mistakes and from hardware degradation during repair. If you cannot, at least snapshot if you’re on LVM or a storage platform that supports it.
8) How do I know if the “corruption” is actually hardware?
Kernel logs showing timeouts, resets, I/O errors, CRC errors, or NVMe aborts are your strongest signal. SMART/NVMe logs and RAID controller health confirm the pattern. If those exist, stabilize hardware first.
9) Is ext4 “easier” to recover than XFS?
In many routine cases, ext4 recovery feels more linear because e2fsck is the standard path. XFS is extremely robust, but the log decision and repair behavior demand a bit more caution.
10) After repair, what should I validate?
Mount options (rw vs ro), kernel logs, free space/inode counts, and—most importantly—application-level integrity checks (DB checks, queue rebuilds, reindexing, checksums if you have them).
Next steps you can actually take
If you’re standing in front of a broken mount on Ubuntu 24.04, don’t start with a fixer. Start with control: stop writes, confirm device mapping, and determine whether hardware is betraying you. Then pick your “first filesystem tool” based on the filesystem’s failure model:
- ext4: begin with
e2fsck -fnto scope, thene2fsck -fywhen you’re ready to commit changes. - XFS: begin with
xfs_repair -n. Treat log discard (-L) as a deliberate trade, not a reflex.
Do one more thing that your future self will appreciate: write down what you saw in dmesg, what device you operated on, what flags you used, and why. Incident retrospectives love facts. Filesystems do too.