Debian 13 filesystem check takes forever: what’s normal, red flags, and fixes (case #31)

December 3, 2025 • February 3, 2026 • Read: 21 min • Views: 6

Was this helpful?

Nothing warms an SRE’s heart like a server that won’t boot because it’s “checking the filesystem” and politely refuses to estimate how long it’ll take. You stare at a console that says fsck is running, the progress bar crawls, and your brain starts negotiating with physics.

This is the line between routine maintenance and “something is eating my storage stack alive.” On Debian 13, the tooling is modern, the defaults are mostly sane, and yet filesystem checks can still take forever. The trick is knowing what “forever” means on your hardware, and when to stop trusting the spinner.

What’s normal vs what’s a red flag

First: understand what kind of “check” you’re watching

People say “fsck” like it’s one thing. It isn’t. On Debian 13 you might be seeing:

Journal replay (fast-ish): ext4 replays its journal after an unclean shutdown. This is not a full scan; it’s essentially applying the last transactional writes. Often seconds to a few minutes.
Full filesystem check (slow): scans metadata and sometimes directory trees and inodes. This is the one that can take hours on multi-terabyte volumes.
Device/RAID rebuild (not fsck): mdadm resync or controller rebuild that makes every read slow, so fsck feels “stuck.”
Storage layer error recovery (the scary one): the disk/SSD is retrying reads, the kernel is logging I/O errors, and fsck is waiting on the block device.

What’s “normal” time?

Normal depends on how much metadata, not just the raw disk size, and on your IO profile (HDD vs SATA SSD vs NVMe, local vs network, and whether your controller is in a bad mood). Still, some practical heuristics:

Journal replay: usually under a minute on SSD/NVMe, a few minutes on HDD, longer if your storage is saturated or degraded.
Full ext4 fsck: can be 10–30 minutes per terabyte on spinning disks under decent conditions. On SSD/NVMe it can be much faster, but don’t bet your on-call sleep on it.
Huge inode counts: millions of small files are the classic fsck tax. A 500 GB volume with 200 million inodes can take longer than a 4 TB volume with mostly large files.

What “forever” looks like: red flags you should treat as incidents

These are the moments where you stop waiting and start diagnosing:

No disk activity for minutes while fsck claims to run, and the system log shows repeated I/O errors or timeouts.
Progress stalls at a specific pass (for ext4: Pass 1, 2, 3, 4, 5) for a long time on a small filesystem. A big filesystem can legitimately crawl, but “stuck” on a 50 GB root disk is suspicious.
Kernel messages about resets, link down/up, or NCQ errors. That’s not filesystem complexity; that’s the device begging for retirement.
fsck prints “UNEXPECTED INCONSISTENCY” or prompts for fixes during boot. You’re in interactive repair during a boot path. That’s a trap for unattended systems.
Reboots repeat the check every boot, even after it “finishes.” That can mean the filesystem never cleanly remounts, or the underlying block device is lying.

One quick mental model: if fsck is slow but the disk is steadily doing work, you probably have a big problem (scale) but not necessarily a dangerous problem. If fsck is slow and the disk is doing no work, or the kernel is shouting, you probably have a dangerous problem (hardware or corruption).

Interesting facts and context (why fsck behaves like this)

Fact 1: The classic ext2 filesystem required full checks regularly because it had no journal; ext3/ext4 journaling reduced the need for frequent full scans.
Fact 2: ext4 uses metadata checksums and journaling improvements that make some corruption easier to detect, but detection doesn’t automatically make repairs fast.
Fact 3: A “clean” unmount is a bit in the superblock. If the system loses power, that bit doesn’t flip, and the next mount can trigger a check—even if user data is fine.
Fact 4: tune2fs can schedule checks by mount count or time interval; many systems go years without a full check because the defaults are conservative for servers.
Fact 5: The lost+found directory exists because fsck may “salvage” orphaned files by reconnecting them to something. It’s a filesystem’s version of the junk drawer.
Fact 6: ext4 “lazy inode table initialization” makes creating new filesystems fast, but it also means background initialization can compete with IO early in a filesystem’s life.
Fact 7: SSDs can go slow without “failing” in a dramatic way; firmware may spend time on internal garbage collection or error correction, which shows up as latency spikes and fsck delays.
Fact 8: On large RAID arrays, a rebuild/resync can halve effective read bandwidth. fsck then becomes the messenger that gets blamed.

One engineering quote that’s survived enough postmortems to earn a chair at the table: “Hope is not a strategy.” That line is widely attributed in ops culture; treat it as a paraphrased idea from reliability engineering lore, not scripture.

Joke #1: Watching fsck progress is like watching paint dry, except the paint occasionally asks you to choose between two bad options.

Fast diagnosis playbook (find the bottleneck quickly)

Your goal: decide which bucket you’re in within 5–10 minutes.

First: are we actually running a filesystem check, and on what device?

Check boot console messages and systemd unit status if you have access.
Identify the block device backing the filesystem (/dev/nvme0n1p2, /dev/md0, LVM LV, etc.).

Second: is the storage stack healthy right now?

Look for kernel I/O errors, link resets, controller timeouts.
Check SMART health and current error counters.
If RAID is involved, verify resync/rebuild status.

Third: is fsck making forward progress or blocked on IO?

Measure device utilization and latency (even rough numbers help).
Watch for steady reads; “stuck” output can still mean active scanning.
Decide whether to let it run, reboot into rescue, or stop and image the disk.

Decision tree you should actually use

If kernel logs show I/O errors/timeouts: treat as potential disk failure. Prioritize data safety: image/backup, replace hardware, then repair filesystem.
If RAID rebuild/resync is active: expect slowness; consider pausing resync (carefully) or schedule fsck after stabilization.
If no errors and IO is active: it’s probably a legitimate long scan. Let it finish, but plan for future mitigation (tuning, splitting filesystems, faster storage).
If it’s asking interactive questions during boot: stop doing that. Use rescue mode and run a controlled repair with logs.

Practical tasks: commands, output, and decisions (12+)

These are written for Debian 13 with systemd. Adjust device names to your reality. Each task includes: command, what the output means, and what decision to make.

Task 1: Identify what systemd is waiting on (boot-time fsck)

cr0x@server:~$ systemctl list-jobs
JOB UNIT                       TYPE  STATE  
123 systemd-fsck@dev-disk-by\x2duuid-... service running
124 dev-mapper-vg0-root.device  start waiting

2 jobs listed.

Meaning: A systemd fsck unit is actively running for a UUID-backed device.

Decision: Resolve which block device that UUID maps to next, and check for IO errors before assuming it’s “just slow.”

Task 2: Map filesystem UUID to a device

cr0x@server:~$ ls -l /dev/disk/by-uuid | head
total 0
lrwxrwxrwx 1 root root 10 Dec 30 02:11 1a2b3c4d-... -> ../../sda1
lrwxrwxrwx 1 root root 15 Dec 30 02:11 7f8e9d0c-... -> ../../dm-0
lrwxrwxrwx 1 root root 10 Dec 30 02:11 9abc0123-... -> ../../sda2

Meaning: UUID resolves to a partition (sda1) or device-mapper node (dm-0 for LVM/crypt).

Decision: If it’s dm-*, also identify the underlying PV(s). Slow fsck on dm-0 can be a failing disk under LVM.

Task 3: Check recent kernel logs for IO pain

cr0x@server:~$ dmesg -T | tail -n 25
[Mon Dec 30 02:12:11 2025] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Mon Dec 30 02:12:11 2025] ata1.00: failed command: READ FPDMA QUEUED
[Mon Dec 30 02:12:11 2025] blk_update_request: I/O error, dev sda, sector 12345678 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[Mon Dec 30 02:12:12 2025] EXT4-fs (sda2): I/O error while writing superblock

Meaning: This is not “slow fsck.” This is read errors and link freezes. ext4 also couldn’t write the superblock cleanly.

Decision: Stop treating this as a filesystem-only issue. Prepare to replace the disk; prioritize a block-level copy or backup before repeated repair writes.

Task 4: Check which filesystems are configured for automatic checks

cr0x@server:~$ cat /etc/fstab
UUID=7f8e9d0c-... /     ext4  defaults,errors=remount-ro  0  1
UUID=1a2b3c4d-... /boot ext4  defaults                  0  2
UUID=9abc0123-... /data ext4  defaults                  0  2

Meaning: The last column is the fsck “pass” order. Root is 1 (first), others are 2 (after root), 0 disables fsck.

Decision: If you see a huge data volume with pass 2 and it’s causing long boot delays, consider setting it to 0 and running fsck in a maintenance window instead. Don’t do this to hide corruption; do it to control blast radius.

Task 5: See ext4 check schedule and last check time

cr0x@server:~$ sudo tune2fs -l /dev/sda2 | egrep -i 'Filesystem state|Mount count|Maximum mount count|Last checked|Check interval'
Filesystem state:         not clean
Mount count:              41
Maximum mount count:      42
Last checked:             Sun Dec  1 03:10:22 2025
Check interval:           15552000 (6 months)

Meaning: You hit the mount-count threshold, or the filesystem is marked “not clean,” so a full check is being triggered.

Decision: If this is a planned maintenance trigger, let it run. If it’s unexpected, ask why the filesystem keeps becoming “not clean” (power loss, kernel panic, storage timeouts).

Task 6: Get real-time insight into what fsck is doing (process view)

cr0x@server:~$ ps -eo pid,etime,stat,cmd | grep -E 'fsck|e2fsck' | grep -v grep
  622 01:17:43 D    /sbin/e2fsck -p -C 0 /dev/sda2

Meaning: D state often indicates uninterruptible sleep, usually waiting on IO. That’s a smoking gun when paired with disk errors or high latency.

Decision: If it’s stuck in D and you have IO errors, stop and triage the disk. If it’s R or S with steady IO, it may be fine.

Task 7: Measure device latency and utilization (quick and dirty)

cr0x@server:~$ iostat -x 1 5
Linux 6.12.0 (server)  12/30/2025  _x86_64_ (8 CPU)

avg-cpu:  %user %nice %system %iowait  %steal   %idle
          1.20  0.00    2.40   72.00    0.00   24.40

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s w_await aqu-sz  %util
sda             90.0  14400.0     0.0   0.00  850.00   160.00      2.0     64.0  120.00  40.00  99.00

Meaning: r_await at 850ms and %util at ~99% means the disk is saturated and slow. iowait is huge. That’s why fsck feels like molasses.

Decision: If this is a single HDD on a large filesystem, it may still be “normal-ish” but painful. If this is an SSD/NVMe, it’s a red flag: firmware issues, failing media, or a degraded RAID.

Task 8: Check SMART health and error counters

cr0x@server:~$ sudo smartctl -a /dev/sda | egrep -i 'SMART overall|Reallocated|Pending|Offline|Error|Power_On_Hours'
SMART overall-health self-assessment test result: FAILED!
Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always       -       8123
Current_Pending_Sector  0x0012   001   001   000    Old_age   Always       -       97
Offline_Uncorrectable   0x0010   001   001   000    Old_age   Offline      -       97
ATA Error Count: 203

Meaning: Pending/offline-uncorrectable sectors mean the drive can’t reliably read some data. fsck will hammer those areas and stall.

Decision: Replace the disk. Before running aggressive repair options, image or back up what you can. Every repair is additional writes on a failing drive.

Task 9: Check mdadm RAID state (if applicable)

cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
      976630336 blocks super 1.2 [2/1] [U_]
      [========>............]  recovery = 41.2% (402123456/976630336) finish=210.3min speed=45512K/sec

Meaning: Array is degraded ([U_]) and recovering. Reads may be slower; IO contention can make fsck crawl.

Decision: If data is critical and performance is collapsing, you may prefer to complete the recovery first, then run a filesystem check, rather than mixing both heavy operations.

Task 10: If using LVM, confirm which PV is slow

cr0x@server:~$ sudo lvs -a -o +devices
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices
  root vg0 -wi-ao---- 120.00g                                                     /dev/sda2(0)
  data vg0 -wi-ao----   3.50t                                                     /dev/sdb1(0)

Meaning: You can map a slow filesystem to a physical disk. If / is on sda2, you know where to focus SMART and cabling checks.

Decision: Don’t guess. If the volume spans multiple PVs, you need to find the weakest link, not run fsck twice and hope.

Task 11: Run a controlled fsck from rescue mode (read-only first)

cr0x@server:~$ sudo e2fsck -f -n -v /dev/sda2
e2fsck 1.47.0 (5-Feb-2023)
Pass 1: Checking inodes, blocks, and sizes
Inode 774533 has illegal block(s).  Clear? no
Pass 2: Checking directory structure
Entry 'tmp123' in /var/tmp (12345) has deleted/unused inode 778899.  Clear? no
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sda2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda2: 164233/7864320 files, 19200322/31457280 blocks

Meaning: -n means “no changes,” so it’s safe for assessment. It still reports what it would fix and whether modifications would be required.

Decision: If it reports “WAS MODIFIED” even with -n, it means changes are needed. Plan downtime and run without -n (or use -p for preen) only after confirming the disk is healthy enough to survive writes.

Task 12: Run repair with logging (and accept that you own the outcome)

cr0x@server:~$ sudo e2fsck -f -y -v /dev/sda2 | tee /root/e2fsck-root.log
Pass 1: Checking inodes, blocks, and sizes
Inode 774533 has illegal block(s).  Clear? yes
Pass 2: Checking directory structure
...
/dev/sda2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda2: 164233/7864320 files, 19200320/31457280 blocks

Meaning: -y answers “yes” to prompts. This is blunt force repair. Useful for unattended recovery, dangerous if the underlying issue is failing hardware.

Decision: Use -y when you’ve decided that restoring mountability is more important than preserving every last file name. For critical systems, consider a backup/restore path instead of aggressive repair.

Task 13: Check ext4 features that may affect check behavior

cr0x@server:~$ sudo dumpe2fs -h /dev/sda2 | egrep -i 'Filesystem features|Checksum|metadata_csum|64bit'
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit metadata_csum

Meaning: Modern ext4 features like metadata_csum and 64bit are normal on large filesystems. They can change how checks are performed and what errors look like.

Decision: If you’re migrating disks across older toolchains, ensure your rescue environment has a recent e2fsck. Ancient tools plus modern features equals a bad day.

Task 14: Confirm mounts and avoid running fsck on a mounted filesystem

cr0x@server:~$ findmnt -no SOURCE,TARGET,FSTYPE / /data
/dev/sda2 / ext4
/dev/sdb1 /data ext4

Meaning: Shows which devices are mounted where.

Decision: Do not run a writable fsck on a mounted filesystem. If you must check a mounted ext4, use fsck -n only for inspection and schedule downtime for real repairs.

Task 15: Force a check next reboot (controlled, not accidental)

cr0x@server:~$ sudo touch /forcefsck
cr0x@server:~$ ls -l /forcefsck
-rw-r--r-- 1 root root 0 Dec 30 02:45 /forcefsck

Meaning: Many Debian setups honor /forcefsck at boot to trigger checks.

Decision: Use it when you want a planned check, not when you want to gamble with a surprise 2-hour boot delay during a deploy window.

Task 16: Find why the previous shutdown was unclean

cr0x@server:~$ journalctl -b -1 -p warning..alert | tail -n 30
Dec 30 01:58:09 server kernel: EXT4-fs warning (device sda2): ext4_end_bio:345: I/O error 10 writing to inode 2625315 starting block 1234567)
Dec 30 01:58:12 server systemd[1]: Failed to start Flush Journal to Persistent Storage.
Dec 30 01:58:16 server kernel: Kernel panic - not syncing: Fatal exception

Meaning: The previous boot ended badly. fsck isn’t the cause; it’s a consequence.

Decision: Fix the trigger (disk errors, panic, power issues) or you’ll be back here again, enjoying the same “filesystem check takes forever” performance art.

Three corporate-world mini-stories (what actually happens)

1) Incident caused by a wrong assumption: “It’s just a big disk”

A mid-sized company ran a Debian fleet hosting internal build artifacts. One morning, a critical build runner stopped booting. The console showed an ext4 check. The on-call shrugged: big disk, lots of files, it’ll finish.

Two hours later, still running. The progress messages were advancing slowly, but not at the pace you’d expect from an NVMe-backed host. The team’s first mistake was assuming “slow” equals “normal,” and not looking at kernel logs. They were treating fsck as a maintenance task, not as a symptom.

Eventually someone checked dmesg and found link resets on the SATA path feeding a cheap SATA SSD used as a “temporary” cache tier that had become permanent. Reads were retrying. The disk wasn’t dead; it was dying in that polite way where it drags everyone down with it.

The repair finished, but the filesystem needed another check at the next boot because the disk threw more errors during the final superblock writes. Only then did they stop and replace the device. The lesson was painfully boring: storage errors masquerade as “long checks” before they become obvious outages.

Afterwards they added a runbook: if fsck exceeds a threshold, check SMART and dmesg within ten minutes. Not because they love process, but because they love sleep.

2) Optimization that backfired: “Let’s crank up RAID rebuild speed”

Another organization ran a pair of large RAID1 arrays on commodity servers. A disk failed, RAID rebuild started, and someone decided to “optimize” by increasing rebuild speed to shorten the degraded window. On paper, it was rational: faster rebuild, less risk.

In reality, the box also hosted a critical database and a pile of log processing jobs. Increasing rebuild aggressiveness starved latency-sensitive reads. The database started timing out, the kernel queue depths grew, and then came the reboot—because someone thought the database was “hung.” It wasn’t. It was suffocating.

On reboot, the filesystem check started. Now fsck was competing with the still-running RAID recovery. The “optimization” turned a one-problem day into a two-problem day: degraded RAID plus painfully slow fsck, plus application recovery time.

The fix was not heroic. They returned rebuild speed to a conservative setting during business hours, scheduled heavy rebuilds and checks off-peak, and introduced monitoring on latency, not just throughput. They also stopped rebooting machines because applications looked slow. If you reboot a box in distress, you’re mostly rebooting your own assumptions.

3) Boring but correct practice that saved the day: “Maintenance windows and rehearsed recovery”

A third team ran Debian servers for a compliance-heavy environment. They had the least exciting engineering culture you can imagine, and it was glorious. They scheduled quarterly storage maintenance windows, including controlled reboots, verification of SMART attributes, and a planned fsck on a rotating subset of machines.

When a host eventually suffered an unclean shutdown due to a power distribution issue, the on-call saw a long fsck and didn’t panic. They already had baselines: how long a full check takes on that specific volume and hardware, and what the normal passes look like. They also had a known-good rescue image with compatible e2fsck versions.

During diagnosis they found no I/O errors, just a filesystem marked unclean. They let it run, it finished within the expected window, and services returned. No drama, no improvised heroics.

The next day they audited the power event, confirmed it wasn’t recurring, and moved on. Their “secret sauce” was not intelligence. It was rehearsal and boring baselines. The kind of stuff nobody wants to fund until the outage bill arrives.

Joke #2: Filesystems are like corporate org charts—everything seems consistent until you try to reconcile who reports to whom.

Common mistakes: symptom → root cause → fix

1) “fsck is stuck at 0–5% forever”

Symptom: Early pass takes ages, console shows little movement.

Root cause: Device is returning reads with huge latency due to retries, or you’re on degraded RAID/resync.

Fix: Check dmesg for I/O errors, run iostat -x for await/%util, and inspect SMART counters. If RAID is rebuilding, decide whether to finish rebuild first.

2) “fsck runs every reboot”

Symptom: System always checks root on boot even after clean shutdown.

Root cause: Filesystem never gets marked clean: storage errors, forced checks due to mount-count policy, or shutdowns not reaching clean unmount.

Fix: Confirm tune2fs -l state/mount count, check logs for previous crash, fix the underlying reboot/power issue, and ensure shutdown completes.

3) “It prompts for manual intervention during boot”

Symptom: Boot drops to a prompt asking to fix errors.

Root cause: Non-preenable errors or policy mismatch (system expecting -p but errors require manual decisions).

Fix: Boot into rescue/emergency, run e2fsck -f -n first to assess, then repair with logging. Consider adjusting how the system handles checks so production boots aren’t interactive games.

4) “fsck is slow on NVMe, which should be fast”

Symptom: Minutes feel like hours on modern storage.

Root cause: NVMe thermal throttling, firmware-level error correction, PCIe link issues, or virtualization storage contention.

Fix: Check dmesg for NVMe resets, use iostat to see awaits, and inspect NVMe SMART/health. If virtualized, verify host storage health and noisy neighbors.

5) “We disabled fsck in fstab and now corruption got worse”

Symptom: After a series of crashes, filesystem mounts but behaves oddly, later becomes unmountable.

Root cause: Disabling checks to speed boots hid escalating metadata problems.

Fix: Re-enable checks for root, schedule checks for large data volumes, and build a maintenance workflow instead of hoping journaling solves everything.

6) “We ran fsck on a mounted filesystem and it got weird”

Symptom: Files disappear, directories misbehave, later errors increase.

Root cause: Running repair while the filesystem is mounted and changing.

Fix: Stop. Remount read-only if possible, then repair offline from rescue or a maintenance reboot. If data is critical, snapshot/backup first.

Checklists / step-by-step plan

Checklist A: When fsck is running at boot and you’re on-call

Confirm what’s running: identify the fsck unit and the device (Tasks 1–2).
Look for kernel storage errors: dmesg and previous boot journal (Tasks 3 and 16).
Measure IO health: iostat -x for latency/utilization (Task 7).
Check SMART / NVMe health: confirm you’re not grinding on a failing device (Task 8).
Check RAID/LVM layers: verify you’re not competing with rebuild or a degraded path (Tasks 9–10).
Decide:
- If device errors: stop repeated repairs, plan disk replacement and data recovery.
- If no errors but big volume: let it run; communicate ETA range, not a promise.
- If interactive prompts: reboot into rescue and do controlled repair with logs.

Checklist B: Controlled repair workflow (the “I want my life back” plan)

Boot into rescue/emergency mode or from an installer/live image.
Ensure the filesystem is unmounted: verify with findmnt (Task 14).
Run read-only assessment: e2fsck -f -n -v (Task 11).
If disk is healthy enough, run repair with logs: e2fsck -f -y -v | tee (Task 12).
Re-run fsck until it reports clean. One pass isn’t always enough after heavy repairs.
Mount and sanity-check: confirm key directories, services, and application-level checksums if you have them.
After boot, collect evidence: SMART, kernel logs, and the fsck log file for the incident record.

Checklist C: Prevent the next 3am fsck marathon

Baseline check times per host class (HDD vs SSD vs NVMe, RAID vs non-RAID).
Ensure monitoring alerts on disk latency and SMART degradation, not just “disk full.”
Review /etc/fstab pass values. Don’t let multi-terabyte data volumes block boots unless you mean it.
Schedule periodic controlled checks for large volumes if your risk model requires it.
Fix unclean shutdown causes: flaky power, unstable kernels/drivers, overheating controllers, bad cables.

FAQ

1) How long should ext4 fsck take on Debian 13?

Anywhere from seconds (journal replay) to hours (full scan). For multi-terabyte HDD filesystems with lots of small files, hours can be normal. For SSD/NVMe, hours is suspicious unless the filesystem is enormous or the device is unhealthy.

2) Why does fsck appear stuck with no percentage updates?

Some passes don’t emit frequent progress, and boot consoles can buffer output. More importantly, fsck can be blocked on IO and still “run.” Use ps state and iostat -x to tell the difference.

3) Can I cancel fsck?

You can, but it’s usually a bad bargain. Interrupting a repair can leave the filesystem in a worse state. If you suspect hardware failure, the correct “cancel” is to stop writes, image the disk, and recover safely.

4) Is it safe to run fsck on a mounted filesystem?

For ext4, running a repairing fsck on a mounted filesystem is not safe. Use -n only for read-only inspection if you must, and schedule downtime for offline repair.

5) Why does Debian run fsck at boot at all?

Because mounting a corrupted filesystem read-write can spread damage. The boot-time check is a guardrail. Sometimes it’s annoying. It’s still better than silent corruption.

6) My server is virtualized. What changes?

Latency lies. A VM can show “slow disk” because the host is overloaded, the storage backend is degraded, or you’re contending with noisy neighbors. The VM-level diagnostics still help, but you may need host-side confirmation to close the loop.

7) Should I disable automatic checks for big data volumes?

Often, yes—if you replace them with scheduled maintenance checks and monitoring. Use 0 in the fsck pass field for the data mount in /etc/fstab, keep root checked, and run controlled fsck in a window.

8) What’s the difference between journal replay and a full fsck?

Journal replay applies pending journal transactions and is usually quick. Full fsck scans metadata structures and can take a long time, especially with many inodes and directory entries.

9) fsck finishes, but the system still won’t boot. Now what?

Check whether the boot loader or initramfs is missing files, whether /etc/fstab references the wrong UUID, and whether the underlying disk is still erroring. A “successful” fsck on a failing disk can be temporary.

10) Does XFS/Btrfs/ZFS have the same fsck problem?

They have different failure modes. XFS uses xfs_repair (offline, can be heavy). Btrfs has its own tooling and can be painful under certain corruptions. ZFS is different entirely (scrubs, checksumming). The common theme: storage latency and hardware faults make every repair look slow.

Next steps that won’t waste your night

If fsck on Debian 13 “takes forever,” don’t treat it like weather. Treat it like a diagnostic moment.

Within 10 minutes: check dmesg, iostat -x, and SMART. Decide if you’re dealing with corruption-on-healthy-storage or corruption-because-storage-is-failing.
If hardware looks sick: stop repeated repair attempts, protect data, replace the device, then repair/restore.
If hardware looks healthy: let fsck finish, but fix the cause of unclean shutdowns and revisit boot-time check policy for huge non-root volumes.
After recovery: capture the fsck log, adjust /etc/fstab passes thoughtfully, and baseline check durations so the next on-call doesn’t have to guess.

You don’t need to love filesystem checks. You just need to stop being surprised by them.