Debian 13 “Read-only file system” surprise: fastest path to root cause and recovery

December 17, 2025 • February 3, 2026 • Read: 23 min • Views: 8

Was this helpful?

It always starts small: a deploy that “can’t write to /var”, a logrotate job that silently fails, a package install that dies mid-flight. Then you notice it: Read-only file system. On Debian 13, it’s not a vibe. It’s the kernel telling you it saw something bad enough that it stopped trusting your disk.

Your job is not to “make it writable again.” Your job is to figure out why the system decided it couldn’t write safely, preserve evidence, and recover without turning a recoverable incident into a data-loss memoir.

What “read-only file system” actually means (and why Debian did it)

When you see “Read-only file system,” you’re typically looking at one of three realities:

The filesystem is mounted read-only (either from boot, or remounted read-only at runtime).
The block device is refusing writes (device-level write-protect, failing media, SAN/virtual disk in trouble).
You’re not actually writing where you think (overlay/containers, bind mounts, permissions, or a full/readonly upper layer).

On Debian 13 with ext4 (still the default in many installs), the common “surprise” pattern is: a write fails in a way that suggests corruption or I/O instability; ext4 decides continuing would risk further damage; the kernel remounts the filesystem read-only to stop the bleeding. XFS has similar “I’m not playing this game” behavior when it detects certain classes of problems.

That’s not the OS being dramatic. It’s the OS doing the least-wrong thing in the face of uncertain storage.

Here’s the operational rule: the read-only flip is a symptom, not a diagnosis. The diagnosis lives in your logs and in the storage stack beneath the filesystem.

One quote worth keeping on the wall: “Hope is not a strategy.” — Gene Kranz. In incident response, that’s not motivational; it’s a reminder not to “just reboot and see.”

Fast diagnosis playbook (first/second/third)

This is the shortest path to identifying the bottleneck: filesystem vs block device vs platform. Don’t freestyle. Do this in order.

First: confirm what is read-only, and whether it changed at runtime

Is it the root filesystem or a data mount?
Is it ro in mount flags?
Did the kernel remount due to errors?

Second: scan kernel logs for the trigger line

ext4: “Errors detected… remounting filesystem read-only”
XFS: “Log I/O error” or “xfs_do_force_shutdown”
block layer: “I/O error”, “timeout”, “reset”, “aborted command”
device: NVMe “media errors”, SATA “UNC”, SCSI “sense key”

Third: decide whether you’re dealing with failing hardware or recoverable corruption

If there are repeated I/O errors/timeouts/reset loops: treat as hardware/platform first. Plan to replace/migrate.
If it’s a single burst (power loss, crash) and device health looks clean: you can likely fsck repair and move on.

If you’re on-call and tempted to remount read-write immediately, remember: you can’t out-configure a dying NVMe controller.

Interesting facts and context (short, concrete, useful)

Fact 1: ext4 inherits its “errors behavior” from ext3-era thinking: when metadata integrity is questionable, remounting read-only is a safety valve.
Fact 2: The ext4 mount option errors=remount-ro has been a long-standing default on many distributions because “keep writing” is how corruption spreads.
Fact 3: XFS doesn’t “fsck” in the traditional sense; it uses xfs_repair, and it can force shutdown the filesystem when it detects log inconsistency or I/O errors.
Fact 4: NVMe drives expose “Media and Data Integrity Errors” counters; non-zero doesn’t always mean imminent death, but rising counts do.
Fact 5: A hypervisor can present a disk as read-only when the datastore is full, when snapshots chain badly, or when it detects underlying storage trouble.
Fact 6: Linux can mount a filesystem read-only cleanly during boot if it detects it needs recovery (journal replay) but can’t safely do it (or the log replay fails).
Fact 7: LVM and dm-crypt typically pass I/O errors upward; the filesystem is the layer that decides “I’ll go read-only now.” That’s why root cause is usually below the filesystem.
Fact 8: Many “read-only” incidents are triggered by timeouts, not corruption: transient link issues (SATA cable/backplane, HBA firmware, SAN path flaps) can cause enough failed writes to trip protective behavior.
Fact 9: Debian’s recovery story improved dramatically once systemd made emergency/rescue modes more standardized; you can get a shell without guessing runlevels.

Practical tasks: commands, outputs, decisions (12+)

These are not “run everything because it feels productive.” Each task has: command, what the output means, and the decision you make.

Task 1: Confirm mount flags (ro vs rw) and scope

cr0x@server:~$ findmnt -no TARGET,SOURCE,FSTYPE,OPTIONS /
/ /dev/mapper/vg0-root ext4 rw,relatime,errors=remount-ro

Meaning: Root is currently writable (rw). If you’re still seeing errors, it may be a different mount like /var or /srv, or an overlay in a container.

Decision: If you see ro, continue. If rw, locate the specific path/device that’s read-only with findmnt -R /path or check the service container mount.

Task 2: Locate which filesystem contains the failing path

cr0x@server:~$ findmnt -T /var/lib/postgresql -no TARGET,SOURCE,FSTYPE,OPTIONS
/var /dev/mapper/vg0-var ext4 ro,relatime,errors=remount-ro

Meaning: It’s /var, not root. That changes recovery: you might unmount just /var in single-user mode and repair it, leaving root alone.

Decision: Identify the device (/dev/mapper/vg0-var) and focus diagnostics there.

Task 3: Check kernel ring buffer for the first error

cr0x@server:~$ dmesg -T | egrep -i 'EXT4-fs error|remounting filesystem read-only|I/O error|blk_update_request|Buffer I/O error|nvme|ata[0-9]|reset|timed out' | tail -n 25
[Sat Dec 28 11:04:10 2025] EXT4-fs error (device dm-2): ext4_journal_check_start:83: Detected aborted journal
[Sat Dec 28 11:04:10 2025] EXT4-fs (dm-2): Remounting filesystem read-only

Meaning: ext4 aborted the journal on dm-2 (likely your LVM LV). That’s either corruption from an unclean shutdown or upstream I/O trouble.

Decision: Now you must determine if the abort was triggered by I/O errors below (disk, path, controller) or by a one-off crash.

Task 4: Correlate with journal logs across boots

cr0x@server:~$ journalctl -k -b -1 -g 'Remounting filesystem read-only|I/O error|nvme|ata|sd ' --no-pager | tail -n 60
Dec 28 10:12:03 server kernel: nvme nvme0: I/O 123 QID 5 timeout, aborting
Dec 28 10:12:04 server kernel: nvme nvme0: Abort status: 0x371
Dec 28 10:12:05 server kernel: EXT4-fs (dm-2): I/O error while writing superblock
Dec 28 10:12:05 server kernel: EXT4-fs (dm-2): Remounting filesystem read-only

Meaning: That’s not “unclean shutdown.” That’s the drive timing out. ext4 then fails to write metadata and flips to read-only.

Decision: Treat it as storage instability first. Plan for migration/replacement. Do not “fsck and forget.”

Task 5: Map device-mapper names to real disks

cr0x@server:~$ lsblk -o NAME,KNAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS
nvme0n1     nvme0n1  disk  1.8T
├─nvme0n1p1 nvme0n1p1 part  1G  vfat   /boot/efi
└─nvme0n1p2 nvme0n1p2 part  1.8T LVM2_member
  ├─vg0-root dm-0    lvm   80G  ext4   /
  ├─vg0-var  dm-2    lvm  200G  ext4   /var
  └─vg0-home dm-3    lvm  100G  ext4   /home

Meaning: The failing filesystem is on dm-2 which maps to /dev/mapper/vg0-var on nvme0n1p2.

Decision: Your blast radius includes everything on that NVMe. Prepare for a full-disk incident, not a single LV problem.

Task 6: Check if the underlying block device is itself read-only

cr0x@server:~$ cat /sys/block/nvme0n1/ro
0

Meaning: The kernel does not consider the device write-protected. The read-only state is likely at filesystem level, not hardware write-protect.

Decision: Continue with device health checks; don’t assume “hardware lock.”

Task 7: NVMe SMART / health snapshot

cr0x@server:~$ smartctl -a /dev/nvme0n1 | egrep -i 'model number|firmware|critical warning|media and data integrity errors|error information log entries|percentage used|available spare|temperature'
Model Number:                       ACME NVMe 2TB
Firmware Version:                   3B2QEXM7
Critical Warning:                   0x00
Temperature:                        63 Celsius
Available Spare:                    100%
Percentage Used:                    7%
Media and Data Integrity Errors:    12
Error Information Log Entries:      57

Meaning: Non-zero media/integrity errors plus many error log entries is not “fine.” Maybe it limps for months, maybe it dies tonight. Either way, it has a story.

Decision: If you care about the data, treat this as replace/migrate. Use repairs to get stable enough to evacuate, not to “restore confidence.”

Task 8: SATA/SCSI disks: check for link resets and bad sectors

cr0x@server:~$ dmesg -T | egrep -i 'ata[0-9].*reset|SATA link|UNC|I/O error|failed command|sense key|medium error' | tail -n 20
[Sat Dec 28 09:58:31 2025] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Sat Dec 28 09:58:33 2025] ata2.00: failed command: WRITE FPDMA QUEUED
[Sat Dec 28 09:58:33 2025] blk_update_request: I/O error, dev sdb, sector 918273645 op 0x1:(WRITE) flags 0x0 phys_seg 8 prio class 0

Meaning: Link flaps + failed writes. That can be disk media, controller, backplane, cable, power, or the universe being mean.

Decision: Stop trying to “fix the filesystem” until you stabilize the transport. Move workload off, then investigate hardware.

Task 9: Determine if the filesystem thinks it had errors (ext4)

cr0x@server:~$ tune2fs -l /dev/mapper/vg0-var | egrep -i 'Filesystem state|Errors behavior|Last error|Last mount time|Last checked'
Filesystem state:         clean with errors
Errors behavior:          Remount read-only
Last error:               ext4_journal_check_start: Detected aborted journal
Last mount time:          Sat Dec 28 10:11:57 2025
Last checked:             Tue Dec 10 02:14:18 2025

Meaning: ext4 recorded an error state. Even if you remount rw, you’re now operating on a filesystem that admits it had a bad day.

Decision: Plan a controlled unmount and fsck. If the device is flaky, do it after evacuation or from a stable recovery environment.

Task 10: Check for “disk full masquerading as read-only”

cr0x@server:~$ df -hT /var
Filesystem              Type  Size  Used Avail Use% Mounted on
/dev/mapper/vg0-var     ext4  197G  197G     0 100% /var

Meaning: Full filesystems usually cause “No space left on device,” not “read-only file system.” But they can trigger cascading failures: journal issues, application crashes, and subsequent unsafe states.

Decision: Still fix space pressure immediately (delete safely, rotate logs, move caches). But if the mount is ro, you also need the real trigger from logs.

Task 11: Confirm whether systemd put you into emergency mode due to fsck failure

cr0x@server:~$ systemctl status
● server
    State: degraded
     Jobs: 0 queued
   Failed: 1 units
    Since: Sat 2025-12-28 11:07:12 UTC; 2min 13s ago

cr0x@server:~$ systemctl --failed
  UNIT          LOAD   ACTIVE SUB    DESCRIPTION
● var.mount     loaded failed failed /var

Meaning: Boot proceeded but a mount unit failed. Sometimes systemd will drop you to emergency shell; sometimes it keeps going with partial service availability.

Decision: Focus on the failed unit and its device; don’t chase random application errors first.

Task 12: Attempt a safe remount to collect logs (only if storage looks stable)

cr0x@server:~$ mount -o remount,rw /var
mount: /var: cannot remount /dev/mapper/vg0-var read-write, is write-protected.

Meaning: Either the kernel has forced it read-only due to errors, or the device layer is refusing writes.

Decision: Don’t loop this command. Go back to logs and device health; consider recovery boot and offline fsck.

Task 13: Check mdadm RAID state (if applicable)

cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
      976630336 blocks super 1.2 [2/1] [U_]
      bitmap: 1/8 pages [4KB], 65536KB chunk

unused devices: <none>

Meaning: Degraded RAID1: one disk missing or failed. Reads may still work; writes may get complicated depending on failure mode and settings.

Decision: Replace the failed member. If you have filesystem errors too, decide whether the RAID issue preceded the filesystem flip or vice versa.

Task 14: Check LVM for missing PVs or read-only activation

cr0x@server:~$ pvs -o pv_name,vg_name,pv_size,pv_free,pv_attr
  PV             VG  PSize  PFree  Attr
  /dev/nvme0n1p2  vg0 1.80t  1.42t  a--

Meaning: PV is present and active. If you saw missing PVs or partial activation, that would steer you toward cabling/storage path issues immediately.

Decision: Continue to filesystem/device checks; LVM isn’t the immediate villain here.

Task 15: Verify whether you’re in a container overlay situation

cr0x@server:~$ mount | egrep -i 'overlay|docker|containerd' | head
overlay on /var/lib/docker/overlay2/abc123/merged type overlay (ro,relatime,lowerdir=...,upperdir=...,workdir=...)

Meaning: The overlay mount is read-only; that can happen if the upperdir filesystem went read-only or if the runtime remounted it due to issues.

Decision: Don’t chase “Docker bug.” Find the upperdir backing filesystem and diagnose that mount/device.

Root-cause map: storage, filesystem, block layer, and userspace triggers

1) The filesystem protected itself after a real I/O error

This is the most common and the most important. ext4 and XFS don’t flip to read-only because they’re bored. They do it because the kernel told them a write failed, or because the filesystem detected internal inconsistency.

Typical kernel lines include:

blk_update_request: I/O error
Buffer I/O error
EXT4-fs error ... Detected aborted journal
XFS (dm-0): log I/O error

The underlying causes range from failing SSDs to flaky HBAs, from SAN path instability to “someone kicked the power cable.” Your goal is to identify which layer misbehaved first.

2) The device (or hypervisor) is effectively write-protecting you

Even if Linux thinks /sys/block/.../ro is 0, your writes can still fail because the platform is denying them. In VMs, “read-only” can be the polite version of “datastore is out of space” or “snapshot chain is broken.” On SAN, it can be an array-side protection mode after detecting an issue.

Clue: you see SCSI sense codes, persistent timeouts, or weirdly consistent failures across filesystems at once.

3) The filesystem mounted read-only at boot because recovery couldn’t be completed

If journal replay fails, or if the filesystem is marked as needing a check and the boot sequence can’t do it safely, you can come up with ro mounts or land in emergency mode.

In that scenario, the system is basically saying: “I could mount this, but I do not trust it enough to write to it unattended.” It’s a reasonable boundary.

4) You’re not on a normal filesystem

OverlayFS, squashfs, live images, immutable base images, and certain security-hardened setups can make “read-only” the intended default. The difference is: intended read-only is consistent and clean; accidental read-only comes with errors in the kernel log and usually a specific timestamp where things went sideways.

5) The system is fine, your assumption is not

Sometimes the filesystem is writable; your application is hitting a read-only bind mount, or trying to write to a path on a read-only NFS export, or it’s under an SELinux/AppArmor policy that makes the error look like “read-only” at the app layer. Debian 13 itself isn’t special here, but modern deployments are layered enough to lie convincingly.

Joke #1: Storage failures are like meetings—if you think you’re done in 15 minutes, you just haven’t found the real agenda yet.

Three corporate mini-stories (anonymized, plausible, and instructive)

Mini-story 1: The incident caused by a wrong assumption

The company ran a tidy Debian fleet on decent hardware. One Friday, a production host started throwing “Read-only file system” during a routine package upgrade. The on-call did what many of us have done under pressure: remounted root read-write, reran the upgrade, and declared victory.

The wrong assumption was subtle: they assumed the filesystem remounted read-only due to an unclean shutdown the week before. That’s a normal cause, and the host had been rebooted recently. Narrative cohesion is seductive.

But the kernel logs had two lines that didn’t fit the story: intermittent NVMe timeouts and aborted commands. They were ignored because the box “felt fast” and monitoring showed normal latency most of the time.

Over the weekend, the drive’s failure mode got worse. Metadata writes failed more frequently, the filesystem flipped read-only again, and the database started journaling to whatever it could still write—until it couldn’t. Monday morning brought not just downtime, but a messy recovery: partial WAL segments, inconsistent application state, and a restore-from-backup decision made under executive supervision.

The fix wasn’t clever. They replaced the NVMe, restored cleanly, and updated runbooks: any read-only remount requires checking block-layer errors before touching the filesystem. The second time this happened months later, the response was boring and fast. Boring is a compliment.

Mini-story 2: The optimization that backfired

A different org had a “performance tiger team.” They were tuning high-write workloads and wanted fewer latency spikes. Someone suggested aggressive mount options and queue tuning, and they also extended disk health polling intervals to reduce monitoring overhead. It shaved a bit of noise off the graphs. Everybody felt smart.

Then one host began to see rare timeouts on its storage path. With less frequent health checks and fewer early-warning signals, the first visible symptom was ext4 remounting a busy data volume read-only. By the time anyone looked, the workload had already piled up errors and retries, amplifying application-level timeouts.

The post-incident analysis showed the storage errors were detectable earlier: SMART error log entries and increasing timeout counts in kernel logs. But the “optimized” monitoring cadence didn’t catch the trend. The team had optimized for quiet dashboards instead of early detection.

The lesson was unglamorous: keep health telemetry frequent enough to detect drift. A handful of extra metrics is cheaper than a mid-day production freeze. They reverted the polling change and added a specific alert for “filesystem remounted read-only” events extracted from the kernel log.

Mini-story 3: The boring but correct practice that saved the day

A finance company ran Debian systems with strict change control and a habit that seemed almost quaint: every storage-backed host had a tested “evacuation plan” and a known-good recovery ISO available via remote management. They also kept a small, separate /var/log partition sized sanely and rotated aggressively. Not exciting. Very adult.

One morning, a host flipped /var read-only. The app team panicked because it looked like “the server is dead.” The SRE on duty pulled the playbook: verify mount flags, check kernel logs, confirm underlying NVMe errors, and stop writing.

Because logs were on a separate mount with enough headroom, the error trail was intact. They captured journalctl, saved SMART data, and initiated a planned failover to a standby node. Then they booted the bad host into recovery mode, ran a read-only check first, and only repaired after data evacuation. No heroic midnight surgery.

It took longer to explain to management why the fix was “replace the drive” than it took to keep the service running. That’s how you want it.

Recovery patterns that don’t make things worse

Priority 1: Stop the bleeding, preserve evidence

If the filesystem flipped read-only due to I/O errors, continuing to hammer it with retries can make the platform less stable. Your first move is to reduce write pressure:

Stop heavy writers (databases, log shippers, queues).
If it’s a non-root volume, unmount it if possible.
Capture evidence: kernel log excerpt, SMART data, lsblk mapping, and mount table.

Priority 2: Decide whether you’re evacuating or repairing in place

I’ll be blunt: if you see repeated device timeouts, aborts, resets, or SMART errors trending upward, evacuate first. Filesystem repair on unreliable hardware is gambling, and the house is physics.

Priority 3: Use the right repair tool, in the right mode, at the right time

ext4: fsck.ext4 (or e2fsck). Start with a read-only check when you can.
XFS: xfs_repair. If the filesystem was mounted, you must unmount it. XFS repair can be destructive if misused.
btrfs: recovery is a different playbook (scrub, check, device stats). Don’t apply ext4 advice blindly.

Priority 4: Remount read-write only after you’ve removed the cause

Remounting rw is not a fix. It’s a decision to resume writes. Do that only when:

the device layer is stable (no ongoing I/O errors/timeouts), and
the filesystem is consistent (fsck/replay done), and
you’ve accepted the risk (or migrated off).

Joke #2: The fastest way to make a filesystem writable again is to buy it dinner and promise you won’t ignore SMART warnings next time.

Checklists / step-by-step plan

Checklist A: Live system triage (5–10 minutes)

Identify the failing mount: findmnt -T /path
Confirm it’s actually ro: findmnt -no OPTIONS /mount
Pull the trigger line from kernel logs: journalctl -k -b filter for I/O errors and remount events
Map dm/LVM to physical devices: lsblk
Snapshot device health: smartctl -a (or vendor tools if needed)
Reduce writes: stop services; consider switching to read-only mode at app layer
Decide: evacuate vs attempt repair

Checklist B: Controlled recovery for ext4 on a non-root mount

Stop services using the mount (databases first).
Try a clean unmount:
```
cr0x@server:~$ umount /var
umount: /var: target is busy.
```
Meaning: Something still holds files open.

Decision: Identify and stop offenders before forcing anything.

Find open files:

cr0x@server:~$ lsof +f -- /var | head
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
rsyslogd  612 syslog  1w   REG  253,2  1048576  123 /var/log/syslog

Meaning: Logging is still writing.

Decision: Stop rsyslog/journald persistence (carefully), or redirect logs temporarily.

Once unmounted, run a read-only fsck pass:

cr0x@server:~$ fsck.ext4 -fn /dev/mapper/vg0-var
e2fsck 1.47.2 (1-Jan-2025)
Pass 1: Checking inodes, blocks, and sizes
Inode 513124 has illegal block(s).  Clear? no

/dev/mapper/vg0-var: ********** WARNING: Filesystem still has errors **********
/dev/mapper/vg0-var: 123456/13107200 files, 9876543/52428800 blocks

Meaning: Errors exist; -n refused to modify. Good. You now know repair is needed.

Decision: If hardware is stable and you have backups/evacuation, proceed with a real repair. If not stable, evacuate first.

Repair with fsck (this will modify on-disk structures):

cr0x@server:~$ fsck.ext4 -fy /dev/mapper/vg0-var
e2fsck 1.47.2 (1-Jan-2025)
Pass 1: Checking inodes, blocks, and sizes
Inode 513124 has illegal block(s).  Clear? yes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/mapper/vg0-var: FILE SYSTEM WAS MODIFIED
/dev/mapper/vg0-var: 123455/13107200 files, 9876501/52428800 blocks

Meaning: Repairs were applied.

Decision: Remount and validate app behavior, but keep investigating the original trigger (especially if there were I/O errors).

Remount:
```
cr0x@server:~$ mount /var
```
Meaning: If this succeeds and mounts rw, the filesystem is back.

Decision: Keep the host under observation and schedule deeper checks. A repaired filesystem on a questionable disk is not “resolved,” it’s “temporarily coherent.”

Checklist C: Root filesystem recovery (the reality show version)

If root (/) is read-only and you can’t recover online, use a recovery boot environment. Depending on your setup, that’s console access via remote management, a rescue ISO, or initramfs shell.

Boot into rescue/emergency mode (or a live image) and ensure the root filesystem is not mounted read-write.

Identify the root block device:

cr0x@server:~$ blkid | egrep 'ext4|xfs|btrfs|LVM2_member'
/dev/nvme0n1p2: UUID="..." TYPE="LVM2_member"
/dev/mapper/vg0-root: UUID="..." TYPE="ext4"

Meaning: Root is an ext4 LV.

Decision: Proceed with ext4 workflow.

Run fsck offline:
```
cr0x@server:~$ fsck.ext4 -fy /dev/mapper/vg0-root
e2fsck 1.47.2 (1-Jan-2025)
/dev/mapper/vg0-root: recovering journal
/dev/mapper/vg0-root: clean
```
Meaning: Journal replay succeeded; filesystem now consistent.

Decision: Reboot normally. If you saw device errors earlier, don’t declare victory; plan hardware remediation.

Common mistakes: symptom → root cause → fix

Mistake 1: “Everything is read-only” after a single app error

Symptom: One service logs “Read-only file system,” but findmnt shows the filesystem is rw.

Root cause: The service writes to a different mount (bind mount, container overlay), or it’s hitting a read-only export (NFS), or a permissions/policy problem is being misreported.

Fix: Use findmnt -T /path and check container mounts. Confirm with a direct write test in the same namespace as the process.

Mistake 2: Immediate reboot to “clear” read-only

Symptom: Box flips ro; someone reboots; it comes back; incident “resolved”… until it isn’t.

Root cause: Reboot clears the symptom but destroys forensic context and may worsen corruption if the underlying device is failing.

Fix: Before reboot: capture journalctl -k, smartctl, and device mapping. If it’s a VM, check the hypervisor datastore state. Then decide repair vs evacuation.

Mistake 3: Running fsck on a mounted filesystem

Symptom: Someone runs fsck.ext4 on /dev/mapper/vg0-var while /var is still mounted.

Root cause: Panic + muscle memory.

Fix: Unmount first (or boot recovery). If you can’t unmount because it’s busy, stop services; if it’s root, use offline recovery.

Mistake 4: “It’s filesystem corruption” when it’s actually storage transport

Symptom: ext4 aborted journal; fsck “fixes” it; it happens again a day later.

Root cause: Underlying I/O errors/timeouts from NVMe, SATA link, HBA firmware, SAN path flaps.

Fix: Treat I/O errors as primary. Replace disk, update firmware, reseat cables/backplane, investigate multipath. Then repair filesystem.

Mistake 5: Remounting rw and continuing heavy writes to “buy time”

Symptom: mount -o remount,rw succeeds and everything looks fine for 20 minutes.

Root cause: The error condition is still present; the next failed metadata write returns you to read-only, sometimes with more damage.

Fix: Use remount rw only as a controlled step for data evacuation or post-repair validation. Do not resume normal workload until the trigger is addressed.

Mistake 6: Ignoring a full filesystem because “read-only is the real issue”

Symptom: /var is 100% full, services fail, then filesystem goes weird.

Root cause: Space pressure causes app crashes and metadata churn; combined with a crash, it can leave the filesystem in a state that requires recovery. It can also make logging stop right when you need it.

Fix: Solve space pressure as part of triage: free space safely, add capacity, separate logs, and enforce retention.

FAQ

1) Why does Debian remount ext4 read-only instead of just failing the one write?

Because metadata integrity matters. If ext4 sees conditions that imply inconsistency (or can’t safely commit journal transactions), continuing writes risks widening corruption. Read-only is containment.

2) Can I just run `mount -o remount,rw` and keep going?

You can, but you’re making an explicit bet that the underlying cause is gone. If kernel logs show I/O errors or timeouts, that bet is usually wrong. Use rw remount for controlled evacuation, not denial.

3) If `fsck` fixes it, does that mean hardware is fine?

No. Filesystems often fail “second” after a device fails “first.” A clean fsck result only tells you the on-disk structure is consistent at that moment. Check SMART, kernel I/O errors, and platform logs.

4) How do I know if it was a power loss versus a dying drive?

Power loss typically shows unclean shutdown and journal replay needs, but not repeated device timeouts/resets. Dying drives show I/O errors, aborts, resets, or increasing SMART error counters.

5) Why do I see this inside a container when the host filesystem is fine?

Containers frequently use OverlayFS. If the overlay upperdir sits on a filesystem that went read-only, the container sees read-only behavior. Also, some container images mount paths read-only intentionally.

6) What’s the safest first fsck command for ext4?

Start with a non-destructive read-only check: fsck.ext4 -fn /dev/DEVICE. It tells you whether repairs are needed without modifying anything. Then decide when and where to run real repairs.

7) Does XFS do the same “remount ro” behavior?

Different mechanism, same intent. XFS may force-shutdown the filesystem on certain errors; the result is functionally similar: writes fail, and the filesystem must be repaired offline with xfs_repair.

8) Should I keep the system up to copy data, or power it down immediately?

If the device is actively erroring (timeouts/reset loops), keep runtime minimal and prioritize evacuation. If it’s stable but the filesystem is ro, you can often keep it up long enough to collect logs and plan a controlled repair.

9) What if the root filesystem is read-only and I can’t log in normally?

Use console access and boot into rescue/emergency mode. If you land in initramfs, you can still inspect logs (sometimes limited) and run offline checks once the device is visible.

10) After recovery, what should I monitor to catch this earlier next time?

At minimum: kernel log patterns for I/O errors and “remounting filesystem read-only,” SMART critical counters, and storage latency/timeouts. Also alert on filesystems nearing capacity, especially /var.

Next steps you can do today

Add a detection hook: alert on kernel messages that include “Remounting filesystem read-only” and block-layer I/O errors.
Practice the recovery path: confirm you can boot rescue mode and access encrypted/LVM volumes with your actual operational credentials.
Separate blast radii: consider separate mounts for /var, databases, and logs, sized with realistic headroom.
Make evacuation boring: document how to fail over or move workloads when a host’s storage starts lying.
Stop trusting “it’s back up” as a resolution: after any ro remount, require a root-cause note: filesystem trigger line, device health snapshot, and remediation decision.

The read-only flip is the kernel’s way of saying: “I’m not going to help you destroy your own data.” Take the hint. Diagnose the trigger, stabilize the storage, repair offline when needed, and only then resume normal writes.