One minute your service is fine. The next minute it’s failing to write PID files, logs stop moving, and your deploy pipeline starts screaming about “Read-only file system.” The kernel didn’t become poetic overnight; it’s protecting your data by refusing to make things worse.
This guide is what you do next on Debian 13 when a filesystem flips to read-only after errors. Not the “run fsck and pray” folklore. Real, damage-minimizing steps: diagnose the failure mode, preserve evidence, choose the least-destructive repair path, and only then put writes back in production.
What “mounted read-only” actually means (and what it doesn’t)
On Debian 13 (and Linux generally), a filesystem going read-only is usually the kernel or filesystem driver deciding that continuing to write would risk turning a recoverable problem into permanent loss. Think of it as a circuit breaker for metadata integrity.
Two common ways you end up here
- Automatic remount read-only: The filesystem was mounted read-write, hit a serious error, and the kernel remounted it read-only (ext4 does this a lot). You’ll see logs like “Remounting filesystem read-only” or “Aborting journal.”
- Mounted read-only from the start: Boot-time issues, explicit
romount options, systemd emergency mode, or initramfs logic decided it’s safer to keep it read-only until repaired.
What it does not guarantee
Read-only does not mean “data is safe.” It means “we stopped making it worse.” If the underlying disk is dying or your RAID is degraded and still taking writes in the wrong places, you can absolutely lose more data just by reading aggressively (timeouts, resets, controller bugs, etc.).
Also: a read-only mount doesn’t stop every kind of state change. Some subsystems can still update things outside the filesystem (device write caches, RAID metadata, drive firmware tables). So the mindset is: preserve evidence, minimize writes, and make deliberate moves.
Facts and history: why Linux does this
Some context helps you make better calls at 3 a.m. Here are short, concrete facts that explain why your Debian box is acting “overprotective.”
- ext2 didn’t journal. Back in the day, power loss meant long fsck runs and a decent chance of losing metadata. Journaling filesystems (ext3/ext4) reduced this but added the concept of “journal abort.”
- ext3 made journaling mainstream on Linux. ext3 (early 2000s) introduced journaling to the ext line; ext4 later improved allocation and scalability but kept the “abort journal then remount ro” safety valve.
- XFS was born in high-end UNIX land. XFS originated at SGI and has a strict view of metadata integrity. When it detects corruption, it often forces a shutdown to stop further damage.
- Btrfs treats checksums as first-class citizens. It can detect silent data corruption via checksums, and with redundancy it can self-heal. Without redundancy, checksums mainly tell you you’re in trouble.
- The kernel’s VFS tries to keep the system alive. Remounting one filesystem read-only is often less disruptive than panicking the kernel or hard-crashing the node.
- Drive write caches can lie to you. A disk can acknowledge a write before it’s safely on stable storage. Power loss plus cache behavior is a classic recipe for journal trouble.
- RAID controllers can “help” in harmful ways. Some controller firmware retries, remaps, or reorders operations, masking problems until the filesystem trips over inconsistent results.
- “Errors=remount-ro” is policy, not destiny. For ext4, the mount option
errors=remount-rotells the kernel what to do on certain errors. You can choose panic or continue, but “continue” is how you get creative corruption.
One quote to keep your hands steady: Hope is not a strategy.
(paraphrased idea, commonly attributed in operations circles)
Fast diagnosis playbook (first/second/third)
This is the “stop guessing” sequence. The goal is to identify which layer is failing: filesystem metadata, block device, RAID/LVM, or something upstream like out-of-space that got misread as corruption.
First: confirm scope and impact (30–90 seconds)
- Is it only one mount or multiple?
- Is root filesystem affected or just a data volume?
- Are applications failing because they can’t write, or because the device is timing out?
Second: read the kernel’s story (2–5 minutes)
- Look for
I/O error,Buffer I/O error,EXT4-fs error,XFS (dm-*)forced shutdown, NVMe resets, SCSI sense data. - Decide if this is media failure (hardware) or metadata/consistency failure (filesystem-level).
Third: decide whether you can stay online or must stop writes now
- If you see repeated device resets, medium errors, or RAID degradation: prioritize imaging/backup and minimizing reads/writes.
- If it’s a clean journal abort after an unclean shutdown and the disk is healthy: you can plan a controlled repair window.
Rule: If you don’t yet know whether the block device is stable, don’t run destructive repair tools. A “repair” on a failing disk is often a fast way to convert partial data into confetti.
Stabilize the patient: stop the bleeding without panicking
When a filesystem is read-only, the temptation is to remount it read-write and keep going. That’s a great way to learn what corruption smells like. You want to do the opposite: reduce variables.
Stabilization priorities
- Capture evidence: kernel logs, mount state, RAID/LVM status, SMART/NVMe health.
- Stop unnecessary writes: halt noisy services, disable log spam, avoid “cleanup scripts” that hammer the disk.
- Decide on a safe snapshot/backup boundary: LVM snapshot, VM snapshot, storage array snapshot, or at least a file-level copy of critical state.
- Only then repair: pick the right tool for your filesystem and scenario.
Joke #1: The filesystem went read-only because it loves your data more than your uptime. It’s the only subsystem on the server with healthy boundaries.
Hands-on tasks (commands, output meaning, decisions)
Below are practical tasks you can run on Debian 13. Each includes: a command, what typical output means, and the decision you make. Use them like an incident checklist, not a buffet.
Task 1: Verify what is actually mounted read-only
cr0x@server:~$ findmnt -o TARGET,SOURCE,FSTYPE,OPTIONS | sed -n '1,5p'
TARGET SOURCE FSTYPE OPTIONS
/ /dev/sda2 ext4 rw,relatime,errors=remount-ro
/var /dev/vg0/var xfs ro,relatime,attr2,inode64,logbufs=8,logbsize=32k
/home /dev/sdb1 ext4 rw,relatime
What it means: /var is currently mounted ro. Root is still rw, but has errors=remount-ro so it could flip too.
Decision: Scope the blast radius. If only a data mount is ro, you may keep the system up while you triage. If root is ro, prepare for maintenance mode.
Task 2: Confirm the kernel remounted it read-only (and why)
cr0x@server:~$ journalctl -k -b --no-pager | tail -n 40
Dec 29 02:11:06 server kernel: XFS (dm-2): metadata I/O error in "xlog_iodone" at daddr 0x1a2c8 len 16 error 5
Dec 29 02:11:06 server kernel: XFS (dm-2): Log I/O Error Detected. Shutting down filesystem
Dec 29 02:11:06 server kernel: XFS (dm-2): Please unmount the filesystem and rectify the problem(s)
Dec 29 02:11:06 server kernel: XFS (dm-2): Corruption of in-memory data detected. Shutting down filesystem
Dec 29 02:11:06 server kernel: XFS (dm-2): Filesystem has been shut down due to log error (0x2).
Dec 29 02:11:07 server kernel: systemd-journald[301]: Failed to write entry (22 items, 812 bytes), ignoring: Read-only file system
What it means: XFS hit a log I/O error. That’s often not “XFS is broken,” it’s “the block layer didn’t deliver consistent reads/writes.” Error 5 is EIO.
Decision: Treat as possible storage instability first. Don’t rush into xfs_repair until you’ve checked the underlying device(s).
Task 3: Check for obvious device errors and resets
cr0x@server:~$ dmesg -T | egrep -i 'blk_update_request|I/O error|reset|nvme|medium error|critical medium|sense key' | tail -n 30
[Sun Dec 29 02:10:58 2025] nvme nvme0: I/O 112 QID 4 timeout, aborting
[Sun Dec 29 02:10:58 2025] nvme nvme0: Abort status: 0x371
[Sun Dec 29 02:10:59 2025] nvme nvme0: Resetting controller
[Sun Dec 29 02:11:06 2025] blk_update_request: I/O error, dev dm-2, sector 171720 write
[Sun Dec 29 02:11:06 2025] XFS (dm-2): Log I/O Error Detected. Shutting down filesystem
What it means: NVMe timeouts and controller resets correlate with the filesystem shutdown.
Decision: Assume hardware-path trouble until proven otherwise. Plan for backup/image first, then repair.
Task 4: Map the affected mount to the real block device stack
cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS,PKNAME
NAME TYPE SIZE FSTYPE MOUNTPOINTS PKNAME
sda disk 1.8T -
sda2 part 1.8T ext4 / sda
nvme0n1 disk 1.9T -
nvme0n1p1 part 1G vfat /boot/efi nvme0n1
nvme0n1p2 part 1.9T LVM2_member nvme0n1
vg0-var lvm 800G xfs /var -
vg0-data lvm 1000G ext4 /data -
What it means: /var is XFS on an LVM LV backed by nvme0n1p2. The disk errors above are relevant.
Decision: Your repair plan must include NVMe health checks and possibly moving data off that device.
Task 5: Check LVM health and whether you can snapshot safely
cr0x@server:~$ vgs; lvs -a -o +devices
VG #PV #LV #SN Attr VSize VFree
vg0 1 2 0 wz--n- 1.90t 120.00g
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices
var vg0 -wi-ao---- 800.00g /dev/nvme0n1p2(0)
data vg0 -wi-ao---- 1000.00g /dev/nvme0n1p2(204800)
What it means: You have 120G free in the VG. That’s potential snapshot space, but snapshots are not magic: on a flaky device, CoW churn can make things worse.
Decision: If the device is stable and you need a rollback boundary, take a snapshot. If the disk is timing out, skip snapshots and do an external backup/image instead.
Task 6: Check NVMe health (NVMe devices)
cr0x@server:~$ nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0x01
temperature : 49 C
available_spare : 96%
percentage_used : 21%
media_errors : 12
num_err_log_entries : 187
warning_temp_time : 0
critical_comp_time : 0
What it means: critical_warning 0x01 is not good, and media_errors is a smoking gun. The controller is admitting it failed at the NAND/media layer.
Decision: Treat this as impending device failure. Prioritize copying data off. Repairs may “work” and still leave you with a ticking brick.
Task 7: Check SMART health (SATA/SAS devices)
cr0x@server:~$ smartctl -a /dev/sda | egrep -i 'SMART overall|Reallocated_Sector|Current_Pending_Sector|Offline_Uncorrectable|CRC_Error_Count'
SMART overall-health self-assessment test result: PASSED
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 2
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
What it means: “PASSED” is not a guarantee. Pending/uncorrectable sectors indicate the drive couldn’t read data reliably and hasn’t remapped it.
Decision: If the affected filesystem lives here, plan for replacement and data migration. If it doesn’t, still keep it on your watchlist.
Task 8: Confirm free space and inode exhaustion (yes, this can cascade)
cr0x@server:~$ df -hT /var; df -i /var
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/vg0-var xfs 800G 799G 1.2G 100% /var
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg0-var 0 0 0 - /var
What it means: XFS doesn’t report inode counts like ext4, but the key is 100% usage. Full filesystems trigger weird application behavior and can expose latent bugs, but they typically don’t force a read-only remount by themselves.
Decision: If full, you still need to free space after stabilizing the device and verifying it’s not an I/O error scenario. Don’t delete random files on a possibly-corrupt filesystem unless you’ve decided it’s safe to write.
Task 9: Check for RAID degradation (mdadm)
cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1] [raid10]
md0 : active raid1 sdb1[0] sdc1[1]
976630336 blocks super 1.2 [2/2] [UU]
md1 : active raid10 sdd1[0] sde1[1] sdf1[2] sdg1[3]
3906885632 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U]
[>....................] recovery = 5.1% (100000000/1953442816) finish=240.0min speed=120000K/sec
What it means: md1 is degraded ([4/3]). Recovery is running, which is heavy I/O and can amplify latent disk issues.
Decision: Consider pausing recovery if it’s making the situation worse and you need to get data out first. That’s situational; don’t pause if you’re one disk away from total loss.
Task 10: Check systemd’s view (did you boot into emergency due to fsck?)
cr0x@server:~$ systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● systemd-fsck@dev-disk-by\x2duuid-1a2b.service loaded failed failed File System Check on /dev/disk/by-uuid/1a2b
● var.mount loaded failed failed /var
What it means: A boot-time fsck unit failed, and the mount failed. Debian may drop to emergency mode or keep going with partial mounts depending on config.
Decision: Don’t keep rebooting hoping it “fixes itself.” Figure out why fsck failed: wrong device, missing driver, actual corruption, or a dying disk.
Task 11: Identify which process is still trying to write (and failing)
cr0x@server:~$ journalctl -b --no-pager | egrep -i 'Read-only file system|EROFS' | tail -n 20
Dec 29 02:11:07 server systemd[1]: Failed to start Rotate log files.
Dec 29 02:11:08 server nginx[1337]: open() "/var/log/nginx/access.log" failed (30: Read-only file system)
Dec 29 02:11:09 server postgres[2020]: could not write lock file "postmaster.pid": Read-only file system
What it means: Your apps are thrashing on writes. That can increase noise and confusion (retries, timeouts, cascading failures).
Decision: Stop the high-churn services cleanly. The best repair is the one done on a quiet system.
Task 12: If ext4: inspect filesystem state without writing
cr0x@server:~$ tune2fs -l /dev/sda2 | egrep -i 'Filesystem state|Errors behavior|Last mount time|Last checked|Mount count'
Filesystem state: clean with errors
Errors behavior: Remount read-only
Last mount time: Sun Dec 29 02:05:10 2025
Last checked: Wed Nov 6 10:12:44 2025
Mount count: 31
Maximum mount count: -1
What it means: ext4 thinks it’s “clean with errors,” which is basically “journal says stop pretending.”
Decision: Plan an offline fsck (unmounted filesystem). If the underlying disk looks unhealthy, take an image first.
Task 13: If XFS: check whether the log needs replay (read-only check)
cr0x@server:~$ xfs_repair -n /dev/mapper/vg0-var
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed.
Mount the filesystem to replay the log, and unmount it before re-running xfs_repair.
What it means: XFS detected pending log transactions. Normally, mounting will replay the log. But if the log I/O was the problem, replay might fail or re-trigger shutdown.
Decision: If the device is stable now, you can attempt a mount (possibly read-only) to replay. If not stable, prioritize backup/imaging, then consider xfs_repair with full awareness it can discard log data if forced.
Task 14: If Btrfs: read-only check and scrub status
cr0x@server:~$ btrfs device stats /data
[/dev/sdb1].write_io_errs 0
[/dev/sdb1].read_io_errs 3
[/dev/sdb1].flush_io_errs 0
[/dev/sdb1].corruption_errs 1
[/dev/sdb1].generation_errs 0
What it means: Btrfs is seeing read errors and corruption. With redundancy (RAID1/10), it might self-heal; without it, it’s telling you “I can detect the rot, not cure it.”
Decision: If redundant, run scrub and replace the bad device. If not redundant, restore from backup or copy out what you can, then rebuild.
Task 15: Make a safe, minimal “get me writable logs” workaround (tmpfs)
cr0x@server:~$ mount -t tmpfs -o mode=0755,size=512M tmpfs /mnt/tmp-logs
cr0x@server:~$ systemctl stop nginx postgresql
cr0x@server:~$ ln -sfn /mnt/tmp-logs /var/log
What it means: You’re rerouting logs to RAM to keep a node limping for diagnostics without writing to the suspect filesystem. This is a bandage, not a cure.
Decision: Use only if you must keep minimal services up briefly. Prefer failing over to another node if you have one.
Task 16: Capture a forensic snapshot of logs and state (low-risk)
cr0x@server:~$ mkdir -p /root/ro-incident
cr0x@server:~$ journalctl -b --no-pager > /root/ro-incident/journal.txt
cr0x@server:~$ findmnt -a > /root/ro-incident/findmnt.txt
cr0x@server:~$ lsblk -f > /root/ro-incident/lsblk.txt
What it means: You saved the story while it’s fresh. If the next reboot changes device names or the machine won’t come back, you still have clues.
Decision: If storage is unstable, copy this directory off-host (scp to a bastion, attach to ticket, etc.).
Repair strategies by filesystem (ext4, XFS, Btrfs)
Different filesystems have different failure behavior. Using the wrong repair approach is how you turn a manageable incident into a multi-day restore. Your guiding principle: repair offline whenever possible, and never repair on a device you don’t trust.
ext4: journal abort and “errors=remount-ro”
ext4 usually remounts read-only after it detects an error serious enough to abort the journal. Often triggered by I/O errors, sometimes by bugs, occasionally by “your disk returned garbage.” ext4 is conservative. Listen to it.
Preferred ext4 path
- Confirm the device stack is stable (SMART/NVMe, dmesg, RAID state).
- Schedule downtime or boot into rescue mode so the target filesystem is unmounted.
- Run a non-destructive check first (fsck without automatic fixes if you want visibility).
- Run a repair pass, then re-check.
cr0x@server:~$ umount /dev/sda2
umount: /: target is busy.
Meaning: You can’t unmount root while running. Use a rescue boot, initramfs shell, or attach the volume to another host.
cr0x@server:~$ fsck.ext4 -f -n /dev/sda2
e2fsck 1.47.2 (1-Jan-2025)
/dev/sda2: clean, 812345/122101760 files, 44444444/488378368 blocks
Meaning: “clean” is a good sign. If it reports errors in -n mode, you have a decision: repair now, or image first if hardware is suspicious.
cr0x@server:~$ fsck.ext4 -f -y /dev/sda2
e2fsck 1.47.2 (1-Jan-2025)
Pass 1: Checking inodes, blocks, and sizes
Inode 1234567 has illegal block(s). Clear? yes
Pass 2: Checking directory structure
Entry 'tmp' in /var/tmp (98765) has deleted/unused inode 1234567. Clear? yes
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sda2: ***** FILE SYSTEM WAS MODIFIED *****
Meaning: Repairs happened. Now you must run a second fsck until it reports clean, because first-pass fixes can uncover more.
Decision: If it keeps finding new issues every run, stop and reassess hardware. Filesystems don’t usually “grow” infinite errors on stable media.
XFS: forced shutdown and log problems
XFS doesn’t do fsck at mount time the way ext filesystems do. Its journal (log) is integral, and XFS is intolerant of log I/O issues. That’s good for correctness, annoying for uptime.
Preferred XFS path
- Fix the storage path first: cabling, controller resets, failing NVMe, degraded RAID, flaky multipath.
- Attempt a clean unmount if possible. If it’s busy and already shut down, stop services and unmount.
- Try mounting to replay the log (if safe). If mount fails, you move to repair.
- Run
xfs_repair. Only consider log-zeroing as a last resort.
cr0x@server:~$ fuser -vm /var
USER PID ACCESS COMMAND
/var: root 1022 f.... rsyslogd
postgres 2020 f.... postgres
root 1337 f.... nginx
Meaning: These processes keep the mount busy.
Decision: Stop them. If you can’t, you’re not doing “repair,” you’re doing “unplanned corruption research.”
cr0x@server:~$ systemctl stop rsyslog nginx postgresql
cr0x@server:~$ umount /var
cr0x@server:~$ xfs_repair /dev/mapper/vg0-var
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- scan filesystem freespace and inode maps...
Phase 3 - for each AG...
Phase 4 - check for duplicate blocks...
Phase 5 - rebuild AG headers and trees...
Phase 6 - check inode connectivity...
Phase 7 - verify and correct link counts...
done
Meaning: Repair succeeded. That doesn’t mean the storage is healthy. It means XFS can build consistent metadata given what it sees.
Decision: Remount and validate application data. If I/O errors remain, migrate immediately.
About xfs_repair -L: It can clear the log (“zero log”). This may discard recent metadata changes. That might be acceptable on a cache volume; it’s catastrophic on a database volume. Use only when you accept data loss as the cost of mounting.
Btrfs: read-only mode, checksums, scrub, and the “don’t run repair casually” rule
Btrfs is powerful, but it has sharp edges. The biggest operational one: btrfs check --repair is not a routine tool. In many environments, the preferred move is: mount read-only, copy out, rebuild.
Preferred Btrfs path
- Mount read-only if not already.
- If redundancy exists, scrub to heal.
- Replace failing devices, then scrub again.
- If no redundancy and corruption exists: copy out critical data and recreate filesystem.
cr0x@server:~$ mount -o ro,subvolid=5 /dev/sdb1 /mnt
cr0x@server:~$ btrfs scrub start -Bd /mnt
scrub started on /mnt, fsid 1b2c3d4e-....
Starting scrub on devid 1
Scrub device /dev/sdb1 (id 1) done
Scrub started: Sun Dec 29 02:20:01 2025
Status: finished
Duration: 0:10:12
Total to scrub: 900.00GiB
Error summary: read=3, csum=1, verify=0, super=0, malloc=0, uncorrectable=1
Meaning: There is at least one uncorrectable error. Without a mirrored copy, Btrfs can’t fix missing truth.
Decision: If this volume matters, restore from backup or copy out what you can and rebuild. “Keep running” is a slow-motion incident.
When it’s not the filesystem: storage, RAID, and kernel-level causes
Filesystems get blamed because they’re the messenger. In reality, “read-only after errors” is frequently a block-layer reliability failure: timeouts, resets, medium errors, or misbehaving virtual devices.
Failure modes that commonly masquerade as filesystem corruption
- NVMe firmware/controller issues: periodic resets under load, APST/power state bugs, thermal throttling leading to timeouts.
- SATA/SAS link problems: CRC errors, flaky cables/backplanes, expander weirdness. CRC errors are a gift: they scream “transport issue.”
- RAID rebuild stress: rebuild amplifies reads; weak drives fail during rebuild; the filesystem gets partial reads and panics.
- Thin-provisioned LVM snapshots: snapshot fills up, the device returns I/O errors, filesystems shut down to preserve metadata.
- Virtualization storage hiccups: iSCSI path flaps, NFS server stalls, hypervisor storage latency spikes. Guest sees EIO; filesystem gives up and goes read-only.
A quick, opinionated rulebook
- If you see EIO in dmesg, assume hardware until proven otherwise. Filesystems don’t invent EIO for fun.
- If you see a single clean journal replay need after power loss, assume software until proven otherwise. But still check SMART/NVMe. Cheap insurance.
- If the underlying device is unstable, your “repair” should start with copying data off. Repair tools are not backup tools.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They had a small fleet of Debian servers running a write-heavy analytics pipeline. One node started remounting /var read-only about once a week. The on-call pattern was always the same: reboot, it comes back, everyone moves on.
The assumption was simple: “It’s just ext4 being cranky after an unclean shutdown.” There had been a power event a month earlier, and that story stuck. Humans love a neat narrative, especially when it’s convenient.
During a busier week, the same node flipped read-only mid-ingest. This time, it took a service down hard: the pipeline couldn’t write spool files, and retry storms multiplied. Someone did what people do under pressure: remounted read-write to “get through the night.” It worked for an hour and then the filesystem stopped mounting altogether.
The postmortem was boring and painful. The actual root cause was intermittent NVMe controller resets under sustained queue depth. The drive’s health log had been showing media errors for weeks, but nobody was checking it because “SMART always says PASSED anyway.”
What changed their practice wasn’t a fancy tool. It was a tiny checklist item: if you see read-only remount, you always check device errors and health metrics before touching filesystem repair. They stopped treating the kernel as melodramatic. Downtime dropped; surprise, so did data loss.
Mini-story 2: The optimization that backfired
A different team wanted faster “repair boundaries” for their stateful services, so they leaned hard into LVM snapshots. Their idea was operationally elegant: snapshot the LV, run risky operations, roll back if needed. It worked great in staging.
Then a production node experienced a transient storage issue. The filesystem started throwing errors and remounted read-only. The on-call followed the runbook: “create snapshot before repair.” Snapshot creation succeeded, but it was thin on free space.
With the filesystem already upset, the snapshot’s copy-on-write amplification made the disk work harder. Writes piled up, latency spiked, and the snapshot filled quickly. Once full, the thin snapshot effectively turned future writes into errors from the block layer.
The team ended up with a two-layer problem: the original filesystem error plus a block-device behavior change caused by snapshot exhaustion. Applications saw EIO and crashed in ways that looked like data corruption. Their “safety mechanism” became a failure multiplier.
The fix was not “never snapshot.” It was “snapshot deliberately.” They added a minimum free-space requirement, monitoring, and a hard rule: if the disk is already unstable, don’t add CoW complexity. Image externally or fail over.
Mini-story 3: The boring but correct practice that saved the day
A large-ish org ran Debian nodes as part of a transactional system. Nothing exotic: ext4 for most volumes, XFS for big append-heavy logs. Their SRE team was annoyingly consistent about two things: scheduled filesystem checks in maintenance windows, and tested restores.
One morning, a kernel update plus an unfortunate power interruption caused a subset of nodes to boot with filesystems flagged dirty. On two nodes, ext4 aborted the journal and remounted root read-only during early boot. The first instinct from app teams was “just remount rw.”
The SRE on-call didn’t do heroics. They followed the boring plan: boot into rescue mode, run fsck offline, validate mounts, then bring services up in a controlled order. Meanwhile, they failed traffic to healthy nodes.
Because backups and restores were rehearsed, they also knew the escape hatch: if fsck looked suspicious or the disk health was questionable, they would wipe and restore rather than improvising. That option changes the psychology of incident response; you stop gambling.
They were back within their internal SLA, and the post-incident meeting was almost disappointingly calm. The lesson wasn’t “we’re geniuses.” It was “boring operational hygiene turns drama into a checklist.”
Common mistakes: symptoms → root cause → fix
This section is deliberately specific. You don’t need more generic advice. You need correct pattern matching.
1) Symptom: “Remounting filesystem read-only” after a burst of writes
Root cause: ext4 journal abort due to underlying I/O errors or timeouts; sometimes caused by a flaky storage path.
Fix: Check dmesg/journal for EIO and device resets. Run SMART/NVMe health checks. If hardware looks clean, do offline fsck. If not, image/migrate first.
2) Symptom: XFS shows “Log I/O Error Detected. Shutting down filesystem”
Root cause: Storage didn’t reliably complete log writes/reads. Could be device failure, controller resets, or dm layer issues.
Fix: Stabilize hardware path (replace device, fix multipath, pause rebuild if needed). Then unmount and run xfs_repair. Avoid -L unless you accept losing recent metadata transactions.
3) Symptom: Btrfs flips to read-only and scrub reports “uncorrectable”
Root cause: Corruption detected without redundant copies to heal from (or too many failures).
Fix: Copy out what you can, restore from backup, rebuild filesystem. If redundant, replace bad device and scrub again to heal.
4) Symptom: system boots into emergency mode; mount units failed
Root cause: fsck failed, device missing, wrong UUID, or real corruption. Sometimes triggered by a renamed device under RAID/HBA changes.
Fix: Use systemctl --failed and journalctl -xb. Verify /etc/fstab UUIDs via blkid. Repair offline. Don’t loop reboots.
5) Symptom: filesystem is read-only but disk looks “fine” and no EIO appears
Root cause: Mount options (explicit ro), systemd remount logic, or a previous error flag that wasn’t cleared.
Fix: Check findmnt options, /etc/fstab, and tune2fs -l for ext4 state. If the filesystem thinks it has errors, do offline fsck even if the device is stable.
6) Symptom: after “successful” fsck, the filesystem goes read-only again quickly
Root cause: The underlying device is still throwing errors; fsck treated the symptom, not the disease.
Fix: Re-check SMART/NVMe and kernel logs after the repair. Replace suspect media. Verify cabling/backplane. If virtualized, check host storage latency and path stability.
7) Symptom: thin LVM snapshot exists; suddenly I/O errors appear and FS shuts down
Root cause: Snapshot or thin pool ran out of space; dm layer returns errors; filesystem protects itself.
Fix: Check lvs Data%/Meta%. Extend thin pool or remove snapshot. Then repair filesystem if needed. Add monitoring and thresholds.
Checklists / step-by-step plan
Checklist A: When you first see “Read-only file system” in production
- Stop new writes: drain traffic, pause batch jobs, stop the chatty services.
- Record state: save kernel logs, mount options, device topology (
journalctl,findmnt,lsblk). - Confirm scope: which mount(s) are
roand which apps depend on them. - Check device health: SMART/NVMe, dmesg for resets/timeouts, RAID status.
- Choose a safety boundary: snapshot if stable, backup/image if unstable.
Checklist B: Controlled repair plan (minimize damage)
- Get to an offline environment: rescue boot, initramfs shell, or attach the volume to another host.
- Validate you’re repairing the correct device: map mount → LV → PV → disk. Confirm UUID.
- Run a read-only check first where supported (
fsck -n,xfs_repair -n). - Run repair with filesystem-appropriate tool (
fsck.ext4,xfs_repair). - Re-run check until clean.
- Mount read-only first and validate key data (DB files, config, application state).
- Mount read-write, start services in order, watch logs for renewed I/O errors.
- Post-recovery monitoring: error counts, latency, RAID rebuild status, NVMe resets.
Checklist C: If hardware looks bad (the “save the data, not the filesystem” plan)
- Stop services to reduce churn and timeouts.
- Mount read-only if possible, copy out critical data first (configs, DB dumps if safe, key directories).
- Image the block device if you need forensic recovery (and if you have somewhere to put it).
- Replace the device, rebuild RAID/LVM, recreate filesystem.
- Restore from backup or from copied-out data.
Joke #2: If your first recovery step is “force remount rw,” congratulations—you’ve invented a new data destruction benchmark.
Prevention that actually works (and what’s just vibes)
Prevention isn’t a sermon. It’s a set of boring controls that keep “read-only remount” as an isolated incident instead of a repeat offender.
What works
- Device health monitoring that you trust: NVMe error logs, SMART attributes that matter (pending/uncorrectable/CRC), and alerting on resets/timeouts in kernel logs.
- RAID hygiene: monitor degradation, rebuild status, and error rates. Rebuilds are when weak drives confess.
- Capacity management: full filesystems don’t usually cause read-only remounts directly, but they cause operator mistakes and application thrash during incidents.
- Backups you can restore: test restores regularly. The ability to rebuild cleanly is the best antidote to risky “repair” heroics.
- Maintenance windows for filesystem checks: especially for ext filesystems if you’re not already doing periodic checks via time-based intervals.
- Write-aware service design: apps that fail gracefully when storage becomes read-only (clear errors, stop writing, degrade features) reduce collateral damage.
What doesn’t work (or works only in presentations)
- Assuming journaling means “no corruption.” Journaling helps metadata consistency; it doesn’t make hardware reliable.
- Relying on “SMART PASSED.” It’s a coarse status, not a guarantee. The interesting parts are in attributes/logs.
- Repairing online. Some tools let you, but the risk profile is bad. Offline repair is slower and safer.
- Ignoring storage error logs because “the filesystem is the problem.” That’s how you replace the wrong component and keep having the same incident.
FAQ
1) Why does Linux remount a filesystem read-only instead of crashing?
Because continuing to write after metadata errors can cause cascading corruption. Remounting read-only keeps the system mostly functional while preventing further damage.
2) Can I just run mount -o remount,rw and continue?
You can, and sometimes it “works,” briefly. If the original error was due to I/O instability or journal abort, you’re likely to worsen corruption. Only remount rw after you’ve diagnosed and fixed the underlying cause.
3) What’s the fastest safe action when this happens on a database server?
Stop the database cleanly, capture logs, and confirm whether the storage layer is throwing errors. If the disk is unstable, prioritize copying data files or restoring from backups rather than filesystem repair gymnastics.
4) ext4 says “clean with errors.” Is that contradictory?
It means the filesystem was cleanly unmounted or looks consistent, but errors were recorded (often due to aborted journal or detected inconsistencies). Treat it as “needs fsck offline.”
5) XFS says to mount to replay the log, but mounting caused shutdown earlier. Now what?
Fix the storage path first. If you can’t make storage stable, replay may keep failing. After stability, attempt mount (possibly read-only) to replay, then unmount and run xfs_repair. Use xfs_repair -L only if you accept losing recent metadata changes.
6) For Btrfs, should I run btrfs check --repair?
Not as a default. Prefer scrub (if mounted) and device replacement when redundant. If uncorrectable corruption exists and no redundancy, copy out and rebuild. Use repair only when you understand the risk and have backups.
7) Does a read-only mount mean my disk is failing?
Not always. It can be a clean response to an unclean shutdown or a one-off bug. But repeated read-only remounts, EIO messages, timeouts, resets, pending sectors, or NVMe media errors strongly suggest hardware trouble.
8) Should I reboot immediately to “fix” it?
No. Rebooting can destroy evidence and may make recovery harder if the device gets worse. Capture logs and assess disk health first. Reboot only as part of a controlled repair plan.
9) If only /var is read-only, can I keep serving traffic?
Maybe, briefly, depending on what lives in /var (logs, spool, databases). If applications need it for state, you’re already broken. If it’s mostly logs, you might limp by with tmpfs logging while you fail over and repair properly.
10) How do I know whether fsck will cause data loss?
Any repair tool can remove or orphan corrupt structures. Run a read-only check first (fsck -n) to see what it wants to change. If the underlying disk is suspect, image first so you have a recovery path.
Conclusion: next steps you can execute
When Debian 13 mounts a filesystem read-only after errors, the kernel is giving you a chance to recover cleanly. Don’t waste it by forcing writes and turning a single failure into a forensic hobby.
Do this next:
- Run the fast diagnosis playbook: confirm scope, read kernel logs, check storage health.
- Stabilize: stop write-heavy services and capture evidence.
- Create a safety boundary: snapshot only if storage is stable; otherwise back up/image first.
- Repair offline with the correct tool for your filesystem, then validate.
- After recovery, treat repeated read-only events as a hardware-path incident until proven otherwise.