Linux Storage: The Mount Option That Can Corrupt Your Expectations

Was this helpful?

Some outages aren’t loud. The service stays up, the graphs look normal, and everyone congratulates the autoscaler. Then a power blip happens, or a node reboots, and suddenly your database is “successfully” missing yesterday afternoon.

This is the Linux storage failure mode nobody wants to explain at the postmortem: your application did exactly what it was supposed to do, but your mount options quietly changed what “write completed” means. The data isn’t corrupted in the dramatic sense. Your expectations are.

The mount option that corrupts expectations: async (and its friends)

If you force me to pick one mount option that routinely turns production into improv theater, it’s async.

async changes when the system reports success. It allows the kernel (or server, in the NFS case) to acknowledge writes before they are committed to stable storage. That can be fine for scratch space, caches, build artifacts, or things you can regenerate. It is not fine for your database, your queue, your identity service, your “single source of truth,” or anything you will later subpoena yourself for.

On local filesystems, you don’t often mount with async explicitly because it’s usually the default behavior of the page cache: writes are buffered and flushed later. The dangerous part is when people combine buffered IO assumptions with options that weaken ordering and durability, then assume fsync() saves them. Sometimes it does. Sometimes it saves you artistically.

The expectation that gets corrupted

Most engineers—reasonable, employed engineers—assume one of these is true:

  • “If my app gets a successful return code from write(), the data is safe.”
  • “If my app calls fsync(), the data is definitely safe.”
  • “If the filesystem is journaling, it can’t lose committed data.”

On Linux, the only one that is sometimes defensible is the second, and even then it depends on mount options, filesystem semantics, device write cache behavior, controller firmware, and whether your “disk” is actually a network service wearing a block-device costume.

The small set of knobs that cause the most regret

These are the usual suspects:

  • NFS async on the server side (and sometimes client choices) acknowledging before commit.
  • ext4 data=writeback weakening ordering between metadata journaling and file data.
  • barrier=0 / nobarrier removing flush/FUA behavior that enforces ordering through volatile write caches.
  • Disabling write barriers via device/controller policies (“it’s faster”) while trusting fsync() (“it’s safe”).
  • commit= increasing the maximum time metadata can sit in RAM before being committed to the journal (bigger window of pain on power loss).

Individually, each can be justified in specific contexts. Together, they’re a group project where the grade is “irrecoverable.”

Joke 1: The “fastest” storage system is the one that never writes anything. It’s also very durable, because there’s nothing to lose.

Durability: what you expect vs what the kernel actually guarantees

Linux IO has layers, and each layer has opinions.

At a high level: applications write to the page cache; the kernel schedules writeback; the filesystem updates metadata and maybe journals it; the block layer queues requests; the device may lie with a volatile cache; and then physics happens. Or doesn’t. The nasty surprises are almost always in the “maybe.”

Three statements that sound similar and are not

  • Data is visible: another process can read it (from cache), even if it’s not on disk.
  • Data is persistent: it will survive crash/power loss.
  • Filesystem is consistent: after crash, metadata is recoverable and structures are sane.

Journaling is about metadata consistency first. Some modes add stronger guarantees about file data. But “consistent” does not mean “contains the most recent successful writes.” It means “mountable without spending the weekend in fsck.”

Where fsync() fits—and where it doesn’t

fsync(fd) asks the kernel to flush dirty pages and associated metadata for that file to stable storage. That’s the contract. The problem is what “stable storage” means when:

  • a disk has a volatile write cache and ignores flushes,
  • a RAID controller reorders writes and lies about battery health,
  • a hypervisor translates flushes into “best effort,”
  • a networked system acknowledges before commit.

Also, you need to call it correctly. Many systems require fsync() on the directory after creating/renaming files to guarantee the directory entry is durable. The file can be safe while the name that points to it is not. That’s a very Linux way to lose things: the data exists, but you can’t find it.

One quote worth keeping in your on-call notebook

“Hope is not a strategy.” — Gen. Gordon R. Sullivan

Swap “hope” with “default mount options,” and you’re most of the way to a better storage posture.

Journaling modes that change your reality: ext4 data=ordered, writeback, journal

ext4 is still everywhere because it’s boring, fast, and predictable—until you change the knobs that make it boring.

What these modes actually do

  • data=ordered (default on many distros): metadata is journaled. File data is not journaled, but ext4 tries to ensure file data blocks are written to disk before the metadata that points to them is committed. After crash, you should not see old garbage in newly written file blocks (in most cases). You can still lose recent writes, but you lose them in a more sane way.
  • data=writeback: metadata is journaled, but ext4 does not enforce ordering between data writes and metadata commits. After crash, metadata can point to blocks whose contents are old. This is the “filesystem is consistent but your files are surprising” mode.
  • data=journal: both data and metadata are journaled. It’s stronger for crash consistency, usually slower, and not a free lunch. It also increases write amplification: you write data to the journal and then to its final location.

Why data=writeback is the expectation-corruptor

Because it often passes casual testing. Your app writes, your tests pass, your benchmark looks great, your manager nods. Then you hit an unclean shutdown and discover that the last “successful” transaction wrote metadata durable enough to survive reboot, but the data blocks under it are from an earlier version.

This is especially brutal for append-heavy or log-like workloads where users assume monotonic growth. Under writeback, after a crash you can get “holes” of older content inside what looks like a newer file structure.

So should you never use data=writeback?

For general-purpose servers: yes, basically never. If you have a narrowly scoped workload that never reads recently written data after crash (scratch, caches), fine. For anything with durability semantics, you’re buying performance with the kind of debt that doesn’t refinance.

Barriers, write cache, and why “stable storage” is political

Barriers (and their modern equivalents: flushes and FUA) are the guardrails that tell devices “this ordering matters.” Without them, the storage stack becomes a suggestion box.

Write caches: the great accelerator and the great liar

Most drives (and many virtual block devices) use a write-back cache. Writes complete quickly because they land in volatile memory first. If power is lost, those writes evaporate. If your device acknowledges writes from volatile cache as “done” and ignores flush requests, the OS can’t make durability guarantees no matter how sincerely it calls fsync().

What barrier=0 / nobarrier really means

Disabling barriers can improve throughput and latency in some setups, especially older kernels and certain controller stacks. It can also create the classic post-crash paradox: the journal replays, the filesystem mounts cleanly, and your database discovers yesterday is missing but the logs claim it committed.

Barriers are not “extra safety.” They are how the filesystem enforces the ordering it assumed when it promised you consistency. Removing them is like removing lug nuts because the wheel spins faster.

Interesting facts and historical context (storage has receipts)

  1. ext3 introduced journaling to mainstream Linux in the early 2000s by extending ext2, mostly to reduce long fsck times after crashes.
  2. ext4’s delayed allocation improved performance but also changed crash behavior; it became more important for apps to fsync() correctly.
  3. Write barriers were once optional and expensive on some stacks; admins sometimes disabled them after reading performance tuning guides written for different hardware eras.
  4. Some early consumer SSDs were infamous for ignoring flush commands, making “sync” behavior more aspirational than real.
  5. XFS prioritized scalability and parallelism and historically pushed “use proper hardware with power-loss protection” rather than pretending it could fix lying devices.
  6. NFS’s sync vs async debate is ancient; admins have been trading durability for speed since before “cloud” was a job title.
  7. POSIX allows buffered writes; a successful write() does not imply persistence, which surprises people exactly once per career.
  8. Databases added durability toggles (like disabling fsync) because users kept demanding speed—and then blamed the database for the consequences.

Filesystem-specific gotchas (ext4, XFS, btrfs)

ext4: safe defaults, unsafe tweaks

ext4 with default-ish settings (data=ordered, barriers on) is usually a good baseline for local disks. Most of the horror stories begin with “we changed mount options to hit a latency target.” The second act begins with “power loss.”

Also watch for:

  • commit=: increases the interval between journal commits. Larger values increase the window of metadata loss after crash. Great for throughput benchmarks; less great for incident timelines.
  • errors=remount-ro: common default; good for safety, but it can turn a latent disk problem into a sudden read-only root filesystem. That’s a feature, not a betrayal.

XFS: robust, but you still need to respect physics

XFS is excellent under parallel workloads and large filesystems. It uses journaling for metadata; data is not journaled. It relies on write ordering and flush behavior to maintain consistency.

Common “expectation corruption” with XFS is less about mount flags and more about assuming the underlying device honors flushes. If you run XFS on a stack that lies about cache flushes, you can get post-crash inconsistencies that look like “XFS ate my data” when the real culprit is “controller acknowledged a write it couldn’t keep.”

btrfs: checksums, copy-on-write, and different tradeoffs

btrfs brings end-to-end checksums and copy-on-write semantics, which changes how corruption shows up. It can detect silent corruption, which is great. It can also amplify writes and behave differently under certain workloads. The expectation failures here are usually about performance tuning (compression, CoW on databases) rather than raw durability semantics. Still: mount options matter.

Networked storage: NFS and the seductive lie of async

If local storage is complicated, networked storage is complicated with latency.

NFS server async: fast acknowledgements, slow regrets

On NFS, the most dangerous “mount option” is often not a client mount option—it’s the server export option async. With async, the server may reply “OK” to a client write before committing it to stable storage. If the NFS server crashes, the client believes the write was durable, because it got an acknowledgement. Reality disagrees.

Sometimes teams justify async with “but the server has a UPS.” Cool. A UPS is a power plan, not a correctness proof. Kernel panic, firmware bug, dual power supply failure, human error: they don’t ask your UPS for permission.

Joke 2: “We mounted it async for performance” is the storage equivalent of “I removed the smoke detector because it was noisy.”

NFS sync is not free, but it’s honest

sync forces the server to commit writes to stable storage before replying. It’s slower, yes. But it aligns acknowledgements with durability. That’s the whole point of a storage system: to store things.

Three corporate mini-stories from the storage trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran an internal artifact repository: binaries, build logs, and deployment manifests. Nothing “mission critical,” until it was. The team migrated it to a new VM cluster and attached a fast networked block device. The migration plan included a quick filesystem mount tweak: data=writeback on ext4 because a benchmark showed a noticeable improvement during parallel uploads.

The assumption was simple and very human: “The repository is append-only, so consistency doesn’t matter much.” Their CI pipelines wrote artifacts, got HTTP 200s, and moved on. The service looked stable for months.

Then a hypervisor host crashed during a patch window. The VM rebooted cleanly. ext4 replayed the journal. The filesystem mounted without complaint. The artifact service started. It served files. Everyone exhaled.

Two days later, a team tried to redeploy an older version. The manifest was present, but a referenced binary was subtly wrong: correct filename, correct size, incorrect content. Not random garbage—an older build chunk. Debugging was miserable because it looked like a bad cache or a botched release. They eventually traced it to a file that had been overwritten during a metadata update with no enforced ordering between data and metadata. Their “append-only” assumption was wrong: the service did periodic compaction and metadata rewrites.

The fix wasn’t heroic. They remounted with data=ordered, forced correct fsync behavior in the application path, and added a post-crash integrity check for critical artifacts. The lesson stuck: storage semantics aren’t negotiable just because your service isn’t “customer facing.”

Mini-story 2: The optimization that backfired

An analytics platform had a hot ingest path writing to local XFS on NVMe. During peak hours, latency got spiky. Someone did the classic move: tweak mount options and I/O scheduler settings in production because the graphs were yelling. They disabled barriers (the exact knob varied by kernel and filesystem) and adjusted commit behavior to push out metadata commits less frequently.

It worked. Latency improved. Dashboards got calmer. The change got codified into an image build, because success is contagious.

Weeks later, a rack lost power. The hosts came back. Most services recovered. A subset of nodes came up with “replayed journal” messages and then quietly served partial ingest data. The platform didn’t crash; it degraded in a way that was worse than a crash: it produced incomplete analytics that looked plausible.

The postmortem was uncomfortable because nobody had “broken” anything. The optimization did what it was supposed to do: it made the system faster by weakening ordering assumptions. The backfire was that their data pipeline relied on fsync-like behavior from a component that didn’t fsync in all the right places. The mount tweak widened the window where missing flushes mattered.

They rolled back the mount changes, added explicit durability boundaries in the ingest components, and—this part matters—introduced a chaos test that simulated abrupt power loss for a staging cluster. Performance tuning didn’t stop. It just stopped being a leap of faith.

Mini-story 3: The boring but correct practice that saved the day

A payments-adjacent service (not processing cards, but close enough to be audited) had a dull policy: all database volumes had to be mounted with safe defaults, barriers on, and any deviation required a written risk assessment and a rollback plan. Engineers complained because it slowed down “simple” performance experiments.

They also had another boring habit: quarterly pull-the-plug tests on a non-production replica using production-like storage settings. Not fancy. Just brutal. They would run a representative workload, then trigger an unclean shutdown, reboot, and verify that the database recovered without missing acknowledged transactions.

One quarter, the test failed. The database came up, but a small set of recently committed transactions were gone. That’s the kind of bug that turns into a regulatory event if it ships.

They traced it to a storage firmware update that changed flush behavior under certain queue depths. Nothing in the OS logs screamed. The filesystem looked clean. Only the durability test caught it. They blocked the firmware rollout, switched to devices with proper power-loss protection for that tier, and kept the boring policy intact.

When the real data center later had a power event, their service was the one that didn’t need an emergency war room. Not because they were smarter. Because they were more stubborn about semantics.

Fast diagnosis playbook

When durability or corruption is suspected, you need speed and triage discipline. This is the “don’t get lost in the weeds” checklist I use.

First: confirm what you actually mounted and exported

  • Check current mount options (findmnt, /proc/mounts).
  • If NFS is involved, check server export options (exportfs -v) and client mount options (nfsstat -m).

Second: validate the durability chain (flushes and write cache)

  • Is the block device write cache enabled? Is it protected?
  • Are flushes supported and honored (as much as you can infer)?
  • Are barriers/flushes disabled at filesystem or device-mapper layer?

Third: identify whether you’re seeing metadata consistency issues or data ordering issues

  • Metadata problems: mount failures, journal replay loops, orphan inodes, filesystem errors.
  • Ordering problems: filesystem mounts cleanly, but application-level invariants are violated (missing last transactions, wrong file contents under correct filenames).

Fourth: reproduce with a controlled crash test (if you can)

  • Run a small write+fsync workload.
  • Force an unclean shutdown.
  • Verify what survived and what didn’t.

Practical tasks: commands, outputs, and decisions

These are real commands you can run during an incident, a migration, or a “why is this slow” week. For each: what to run, what the output means, and what decision you make.

Task 1: See what is actually mounted (not what you think is in fstab)

cr0x@server:~$ findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /var/lib/postgresql
/dev/nvme0n1p2 /var/lib/postgresql ext4 rw,relatime,errors=remount-ro,data=ordered

Meaning: This is the truth on the running kernel. Here, ext4 is in data=ordered, which is generally the safer default.

Decision: If you see data=writeback, nobarrier, or unusually large commit= on a durability-requiring path, flag it and plan a change.

Task 2: Cross-check with /proc to spot bind mounts and surprises

cr0x@server:~$ grep -E ' /var/lib/postgresql ' /proc/mounts
/dev/nvme0n1p2 /var/lib/postgresql ext4 rw,relatime,errors=remount-ro,data=ordered 0 0

Meaning: Same story; also useful when findmnt output is being filtered by tooling.

Decision: If a path is unexpectedly mounted from overlayfs, tmpfs, or a container runtime layer, stop and map the real persistence boundary.

Task 3: Inspect ext4 superblock defaults and features (journal status, etc.)

cr0x@server:~$ sudo tune2fs -l /dev/nvme0n1p2 | egrep 'Filesystem features|Default mount options|Filesystem state|Journal'
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Default mount options:    user_xattr acl
Filesystem state:         clean
Journal features:         journal_incompat_revoke journal_64bit journal_checksum_v3

Meaning: Confirms journaling is present (has_journal) and metadata checksums are enabled (metadata_csum), which helps detect certain corruptions.

Decision: If the filesystem isn’t clean after an incident, plan an offline fsck window. If journaling is missing on a volume you assumed was journaled, treat it as a design flaw.

Task 4: Check the kernel’s view of journal replays and ext4 warnings

cr0x@server:~$ dmesg -T | egrep -i 'ext4|jbd2|xfs|btrfs' | tail -n 20
[Tue Feb  4 10:42:12 2026] EXT4-fs (nvme0n1p2): mounted filesystem with ordered data mode. Quota mode: none.
[Tue Feb  4 10:42:12 2026] EXT4-fs (nvme0n1p2): re-mounted. Opts: (null)

Meaning: Shows whether the filesystem replayed a journal, mounted read-only, or logged errors.

Decision: If you see repeated journal recovery or “Errors detected,” assume underlying IO issues or unsafe caching. Escalate to hardware/virtualization team and reduce write-risk options.

Task 5: Verify disk write cache settings (and don’t assume they’re safe)

cr0x@server:~$ sudo hdparm -W /dev/nvme0n1
/dev/nvme0n1:
 write-caching =  1 (on)

Meaning: Write cache is enabled. That’s normal, but it raises the question: is it protected by power-loss protection, battery-backed cache, or a controller that honors flushes?

Decision: If this is commodity hardware without PLP and you care about durability, ensure barriers/flushes are enabled and consider hardware changes for critical tiers.

Task 6: Check for volatile cache and flush support at the block layer

cr0x@server:~$ lsblk -D -o NAME,ROTA,DISC-GRAN,DISC-MAX,DISC-ZERO
NAME    ROTA DISC-GRAN DISC-MAX DISC-ZERO
nvme0n1    0      512B       2G         0

Meaning: Discard support and granularity. Not a direct durability signal, but helps understand the device class and behavior.

Decision: Use this as a hint: SSD/NVMe typically implies write caching and FTL behavior; tune and validate accordingly, especially around flush semantics.

Task 7: Look for disabled barriers/flushes in mount options

cr0x@server:~$ findmnt -no TARGET,OPTIONS | egrep -i 'barrier|nobarrier|data=writeback|commit='
/mnt/fast rw,relatime,data=writeback,commit=60

Meaning: This mount is explicitly weakening ordering (data=writeback) and increasing commit interval.

Decision: If /mnt/fast holds anything you can’t regenerate, schedule a remount with safer options and plan an app-level durability audit.

Task 8: Confirm NFS client mount behavior (what the client thinks it’s doing)

cr0x@server:~$ nfsstat -m
/var/lib/shared from nfs01:/export/shared
 Flags: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.10,local_lock=none

Meaning: Shows NFS version and client options. Note: client options don’t override server-side async acknowledgements.

Decision: If this backs a database or queue, treat NFS as a design review item. Verify server export options and application compatibility.

Task 9: Check NFS server exports for async (the real footgun)

cr0x@server:~$ sudo exportfs -v | sed -n '1,120p'
/export/shared
	10.0.0.0/16(rw,async,wdelay,hide,no_subtree_check,sec=sys,secure,root_squash)

Meaning: The export is async. The server may acknowledge writes before commit.

Decision: For durability workloads, change to sync, then measure the performance hit honestly. If the performance is unacceptable, the answer is architecture, not denial.

Task 10: Validate actual writeback pressure and dirty limits (risk window sizing)

cr0x@server:~$ sysctl vm.dirty_background_ratio vm.dirty_ratio vm.dirty_expire_centisecs
vm.dirty_background_ratio = 10
vm.dirty_ratio = 20
vm.dirty_expire_centisecs = 3000

Meaning: Dirty pages can accumulate until 20% of RAM before throttling; expiration is ~30 seconds. This affects how much data can be “acknowledged” by the system while still not on disk.

Decision: For critical systems, keep these reasonable. Don’t “optimize” by letting the system buffer minutes of writes unless you accept crash loss of those minutes.

Task 11: Observe what your filesystem thinks about delayed allocation and writeback (ext4 example)

cr0x@server:~$ cat /proc/mounts | grep ' ext4 ' | head -n 2
/dev/nvme0n1p2 / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
/dev/nvme0n1p3 /var ext4 rw,relatime,data=ordered 0 0

Meaning: Confirms data mode and error handling.

Decision: If you need stronger guarantees for specific paths, consider moving them to a separate filesystem with more conservative settings, rather than globally making the system slower.

Task 12: Measure whether your app is actually issuing flushes (a reality check)

cr0x@server:~$ sudo strace -f -e trace=fdatasync,fsync,openat,rename -p 1423 -s 80
strace: Process 1423 attached
fsync(17)                                = 0
rename("tmpfile", "current")             = 0
fsync(5)                                 = 0

Meaning: The process is calling fsync() and also syncing a directory FD (often needed after rename() for durable metadata). This is what “correct” looks like for many workloads.

Decision: If you don’t see fsync/fdatasync where you expect it (databases, WAL writers, queues), your durability is probably wishful. Fix the app/config before blaming the filesystem.

Task 13: Check IO errors and timeouts at the block layer (silent corruption’s cousin)

cr0x@server:~$ sudo journalctl -k -b | egrep -i 'I/O error|blk_update_request|nvme|reset|timed out' | tail -n 20
Feb 04 10:38:21 server kernel: nvme nvme0: I/O 42 QID 7 timeout, aborting
Feb 04 10:38:21 server kernel: nvme nvme0: Abort status: 0x371

Meaning: You have device-level IO timeouts. This can cause journal replays, remount-ro events, and partial writes depending on error handling.

Decision: Treat as hardware/virtual disk reliability issue. Stop tuning mount options and start planning replacement, firmware fixes, or infrastructure changes.

Task 14: Inspect queue settings to understand latency spikes (not directly durability, but often adjacent)

cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[mq-deadline] kyber none

Meaning: Shows scheduler in use. Latency tuning sometimes leads people to disable flush-related behavior; don’t confuse these layers.

Decision: Tune scheduling separately from durability semantics. Keep barriers/flush semantics correct, then optimize throughput/latency within those constraints.

Common mistakes: symptoms → root cause → fix

1) “Filesystem mounts cleanly, but my database lost committed transactions”

Symptoms: After crash/reboot, DB starts; logs show commits succeeded; data missing or rolled back beyond what your replication/WAL logic expects.

Root cause: Acknowledged writes weren’t durable: NFS server async, barriers disabled, lying write cache, or application not fsyncing WAL/data correctly. ext4 data=writeback can amplify the weirdness.

Fix: Ensure NFS exports are sync for durability workloads; keep barriers/flush enabled; validate device honors flush; set filesystem to safer journaling mode; fix app configuration to fsync at correct boundaries.

2) “After a power loss, files exist but contain older content”

Symptoms: Filenames and timestamps look right; contents are stale or partially reverted; checksum mismatches appear in a narrow time window before crash.

Root cause: ext4 data=writeback or barrier/flush issues allowing metadata to commit before data blocks are stable.

Fix: Use data=ordered (or data=journal when justified); keep flushes on; verify controller/device flush semantics; reduce risky tuning.

3) “Everything became read-only mid-flight”

Symptoms: Applications start failing with “Read-only file system”; kernel logs show filesystem errors; remount occurs.

Root cause: Underlying IO errors or metadata corruption; errors=remount-ro did its job. Not a mount option bug—this is the kernel protecting you.

Fix: Investigate device health; collect logs; run filesystem checks in maintenance; replace suspect hardware; restore from known-good backups if needed.

4) “We enabled a performance option and now crash recovery takes forever”

Symptoms: Long journal replay; slow boot; services waiting on filesystem recovery; high IO on startup.

Root cause: Larger commit=, heavy writeback backlog, or workload producing huge uncommitted metadata changes; in NFS, server-side backlog.

Fix: Reduce commit interval; ensure adequate IO capacity; avoid buffering massive dirty sets; for NFS, keep exports synchronous for critical data and scale storage appropriately.

5) “We only changed mount options, why is data integrity now ‘random’?”

Symptoms: Heisenbugs after reboots; intermittent checksum failures; not reproducible under normal load.

Root cause: You changed ordering/durability semantics. The workload was accidentally relying on old behavior (timing, ordering, flushes) and now violates its own invariants.

Fix: Revert to safe semantics; then explicitly define and enforce durability boundaries in the application and storage design.

Checklists / step-by-step plan

Checklist 1: Before you touch mount options in production

  1. Classify the data: cache/regenerable vs durable/authoritative.
  2. Write down the durability promise in plain English: “After the app reports success, data survives power loss.”
  3. Confirm application behavior: does it call fsync/fdatasync properly (including directory fsync for file-based workflows)?
  4. Confirm storage behavior: does the device/controller/hypervisor honor flushes?
  5. Stage the change under production-like load and run an unclean shutdown test.
  6. Have a rollback: either remount or redeploy prior configuration quickly.

Checklist 2: Safe defaults I recommend (and when to deviate)

  • ext4: prefer data=ordered, barriers/flush on (default), avoid data=writeback for anything durable.
  • XFS: keep defaults unless you have a specific, tested reason. Spend effort validating device flush behavior.
  • NFS for durable workloads: prefer server export sync. If you need async-like performance, reconsider architecture (local WAL + replication, or purpose-built storage).
  • Commit intervals: don’t inflate them casually. Bigger windows mean bigger losses.

Checklist 3: Post-incident verification (after unclean shutdown)

  1. Capture kernel logs for filesystem replay/errors.
  2. Verify mount options are what you expect.
  3. Run application-level integrity checks (not just filesystem checks).
  4. Validate last-known committed transactions or state markers.
  5. If mismatch exists, assume durability chain failure and escalate accordingly.

FAQ

1) Is data=writeback always “corruption”?

No. The filesystem can remain structurally consistent. The “corruption” is often semantic: metadata may reference data that wasn’t written in the order you assumed. If you expected “new file points to new bytes,” writeback can violate that after crashes.

2) If my app uses fsync(), am I safe?

Safer, not automatically safe. You also need the storage stack to honor flushes, and you must fsync the right things (including directories after rename/create, depending on the pattern). And if you’re on NFS with server async, acknowledgements can still be ahead of durable commit.

3) Why do people use unsafe options at all?

Because benchmarks are persuasive and outages are rare—until they aren’t. Also because some environments (scratch clusters, caches, build farms) truly don’t need durability, and the options are valid there. The mistake is cargo-culting those settings into durable systems.

4) What’s the difference between “filesystem consistency” and “data durability”?

Consistency means the filesystem structures are sane after recovery. Durability means acknowledged writes survive crashes. Journaling mostly targets consistency. Durability requires correct ordering and flush behavior across the entire stack.

5) Is NFS ever okay for databases?

Sometimes, with the right server settings (sync), stable storage underneath, and a database that supports it. But it’s a high-risk design if you don’t control the whole chain. Many teams choose local storage for WAL/primary writes and replicate for redundancy.

6) Should I disable disk write cache for safety?

Disabling write cache can reduce risk on unprotected devices, but it can also crush performance and is not always supported or effective. The better answer is: use devices with power-loss protection for critical tiers and keep flush/barrier semantics enabled.

7) What mount option improves safety the most?

Usually, “don’t disable safety features” is the winning move: don’t use data=writeback for durable data, don’t disable barriers/flushes, and avoid NFS server async where durability matters. Safety is often the absence of “optimizations.”

8) How do I prove my storage is durable?

You test it with power-loss style fault injection under realistic load. Run a workload that does acknowledged writes with fsync boundaries, crash it uncleanly, reboot, and validate invariants. Anything else is theory.

9) My filesystem is ext4 with data=ordered. Can I still lose data?

Yes. You can lose the most recent writes that were still in cache, and you can lose acknowledged writes if the device lies about flushes or the application doesn’t fsync correctly. ordered reduces specific classes of post-crash surprises; it doesn’t repeal physics.

10) Is data=journal the answer for everything?

No. It can be slower and increases write amplification. It’s a tool for specific cases, not a universal setting. Many systems do better with proper application-level logging (WAL) and safe filesystem defaults.

Conclusion: next steps you can actually do this week

If you take one idea from this: mount options don’t just tune performance. They define what “done” means. When you change them, you change the contract your applications thought they were getting.

Do these next steps:

  1. Inventory mounts on critical paths with findmnt. Write down anything non-default and justify it.
  2. Ban data=writeback for durable data unless you have a documented, tested exception.
  3. Audit NFS exports for async. If you find it on a durability workload, treat it like a sev-worthy risk.
  4. Validate the flush chain: device cache behavior, controller policies, virtualization layer semantics.
  5. Add one unclean shutdown test to staging for at least one critical service. Make it routine, not heroic.

You don’t need to become a filesystem developer. You do need to stop letting performance tweaks rewrite your durability story behind your back.

← Previous
IOMMU Explained: The One BIOS Switch That Makes (or Breaks) GPU Passthrough
Next →
Linux Databases: PostgreSQL on Small VPS — The Tuning That Prevents OOM

Leave a comment