Everything’s fine until it isn’t. Your Debian 13 host runs happily for hours, then a batch job lands, memory fills with dirty cache, and the box goes from “responsive” to “why is SSH typing in slow motion.” iowait spikes, kworker threads light up, and your database starts timing out like it’s auditioning for a drama.
This is the writeback storm: a synchronized flush of too many dirty pages, happening too late, at the worst time. The fix is usually not “buy faster disks.” It’s making Linux start writeback earlier, in smaller bites, and verifying you didn’t trade performance for data risk.
What a writeback storm looks like (and why Debian 13 makes it obvious)
A writeback storm is not “the disk is slow.” It’s a timing problem: the kernel allows a large amount of memory to accumulate dirty pages (modified cached file data not yet on disk), and then it has to flush them. Flush too much at once, and you get contention everywhere:
- Foreground writers suddenly get throttled hard, often in a burst.
kswapdor reclaim pressure forces writeback at the same time.jbd2(ext4 journal) or filesystem log commits stack up.- Latency-sensitive threads (databases, API workers) block on IO or on filesystem locks.
On Debian 13, you’re likely running a modern kernel with better visibility and sometimes more aggressive reclaim/writeback behavior than older long-lived fleets. That’s good: you see the problem sooner. It’s also bad: default ratios that “worked fine” on Debian 10-era spinning disks can turn into NVMe-shaped latency cliffs because the system writes very fast… right until it doesn’t.
One operational truth: writeback storms are rarely random. They’re repeatable when your workload pattern is repeatable (ETL jobs, backups, compactions, log bursts, container image pulls, CI artifacts). That’s a gift. Use it.
Joke #1: The kernel is an optimist—it assumes your disk will totally handle all those writes later. “Later” is when you’re on call.
Dirty page basics: what the kernel is really doing
Linux uses free memory aggressively as page cache. When applications write to files (not using direct IO), they often write into memory first. Those pages become “dirty.” Later, background writeback flushes them to disk. This gives great throughput and decouples application speed from disk speed—until you hit limits.
Key terms that matter in production
- Dirty pages: cached file data modified in RAM but not yet persisted to storage.
- Writeback: the kernel flushing dirty pages to disk, in background or under pressure.
- Background writeback threshold: when background threads begin writing dirty data proactively.
- Dirty limit: when processes generating dirty data are throttled (or forced to writeback).
- Expire time: how long dirty data is allowed to sit before writeback tries to push it.
The knobs live in /proc/sys/vm/, most notably:
vm.dirty_background_ratioandvm.dirty_ratiovm.dirty_background_bytesandvm.dirty_bytesvm.dirty_expire_centisecsandvm.dirty_writeback_centisecs
Ratios versus bytes (pick one style, don’t mix casually)
Ratios are percentages of “available memory” (not strictly total RAM; the kernel uses internal accounting that changes with memory pressure). Bytes are absolute thresholds. In production, bytes are more predictable across machines and across memory-pressure scenarios, especially in container-heavy hosts where “free memory” is a moving target.
If you set dirty_bytes, the ratio knobs are effectively ignored (same for background bytes). That’s usually what you want: fewer surprises.
What “data risk” actually means here
Tuning vm.dirty* does not change filesystem correctness or journaling guarantees. It changes how much data is allowed to be dirty in RAM and how long it can stay there. The risks are operational:
- Bigger dirty limits can improve throughput but increase the amount of data “at risk” during a power loss (data not yet flushed), and can produce larger, more painful writeback storms.
- Smaller dirty limits reduce storm size and reduce “dirty data exposure,” but can throttle writers sooner and reduce peak throughput.
Most teams don’t need heroic throughput. They need predictable latency. Tune accordingly.
One reliability-adjacent quote, because it belongs here. Charity Majors said: “You can’t improve what you don’t measure.” That’s the whole game with writeback storms: measure first, tune second.
Facts & history: how we got here
- Fact 1: Early Linux kernels used much simpler buffer cache behavior; modern writeback evolved heavily around the 2.6 era with per-bdi (backing device) mechanisms to manage IO pressure better.
- Fact 2: The default dirty ratios historically assumed “a reasonable disk” and human-scale RAM. Those defaults aged badly once 128–512 GB RAM hosts became normal.
- Fact 3: Dirty throttling is a latency tool wearing a throughput costume: it exists to prevent the system from dirtying “infinite” memory and then blocking forever on IO.
- Fact 4: With SSDs and NVMe, the problem often shifts from throughput to tail latency—the 99.9th percentile spikes caused by queueing and sudden bursts.
- Fact 5: Journaling filesystems (ext4, XFS) can still produce IO patterns that look bursty, especially under metadata-heavy workloads.
- Fact 6: Virtualized environments can amplify storms: host-level writeback plus guest-level writeback equals “double caching,” and each layer thinks it’s being helpful.
- Fact 7: The kernel’s idea of “available memory” is dynamic; ratio thresholds can effectively rise and fall during reclaim, changing writeback behavior mid-incident.
- Fact 8: Setting
dirty_writeback_centisecsto 0 doesn’t mean “no writeback,” it means the periodic writeback timer is disabled; writeback still happens via other triggers. - Fact 9: Many “disk is slow” incidents are actually “queue is saturated” incidents—your disk might be fast, but you’re feeding it pathological bursts.
Fast diagnosis playbook (check 1/2/3)
This is the on-call version. You’re not here to philosophize. You’re here to find the bottleneck in five minutes and decide whether tuning vm.dirty helps.
First: confirm it’s writeback, not random IO
- Check dirty page levels and writeback activity.
- Check whether tasks are blocked in
Dstate. - Check if IO wait and disk queue depth spike during the freeze.
Second: identify the device and the pressure source
- Which block device is saturated? (NVMe? RAID? network block?)
- Is the workload file writes, journal commits, or swap/reclaim forcing writeback?
- Is this inside a VM or container host with double caching?
Third: choose the minimal safe mitigation
- Reduce the dirty limits (use bytes) to start writeback earlier and avoid large flush bursts.
- Optionally shorten dirty expiry so old dirty data doesn’t pile up.
- Validate with latency and queue metrics; roll back if throughput collapses.
Practical tasks: 14 checks with commands, outputs, and decisions
These are real tasks you can run on Debian 13 during an incident or in a calm window. Each one includes: command, example output, what it means, and what decision to make.
Task 1: Inspect current dirty tunables
cr0x@server:~$ sysctl vm.dirty_ratio vm.dirty_background_ratio vm.dirty_bytes vm.dirty_background_bytes vm.dirty_expire_centisecs vm.dirty_writeback_centisecs
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_background_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
Meaning: This host uses ratio-based thresholds (bytes are 0). Background writeback starts at 10% dirty, and writers are throttled at 20% dirty.
Decision: On large-memory hosts, 20% can be enormous. If storms exist, plan to move to byte-based thresholds.
Task 2: Confirm the machine’s memory scale (ratios are relative)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 256Gi 92Gi 11Gi 2.0Gi 154Gi 150Gi
Swap: 8.0Gi 0.0Gi 8.0Gi
Meaning: With 256 GiB RAM, a 20% dirty limit can translate into tens of gigabytes of dirty data. That’s a big flush.
Decision: Prefer bytes. For example, cap dirty to single-digit GiB unless you have a strong reason not to.
Task 3: Watch dirty and writeback pages in real time
cr0x@server:~$ awk '/Dirty:|Writeback:|MemAvailable:|Cached:|Buffers:/{print}' /proc/meminfo
MemAvailable: 157392112 kB
Cached: 132884944 kB
Buffers: 126764 kB
Dirty: 6248120 kB
Writeback: 184320 kB
Meaning: ~6.2 GiB dirty and some active writeback. During a storm, Dirty can climb rapidly and then Writeback spikes as flush begins.
Decision: If Dirty grows into multi-GB and Writeback lags, your background threshold is too high or storage can’t keep up.
Task 4: Verify blocked tasks and IO wait during the freeze
cr0x@server:~$ top -b -n1 | head -n 20
top - 11:08:21 up 14 days, 3:52, 1 user, load average: 18.42, 16.77, 10.91
Tasks: 612 total, 4 running, 58 sleeping, 0 stopped, 9 zombie
%Cpu(s): 5.3 us, 2.1 sy, 0.0 ni, 33.7 id, 58.6 wa, 0.0 hi, 0.3 si, 0.0 st
MiB Mem : 262144.0 total, 11540.0 free, 94512.0 used, 156092.0 buff/cache
MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 157000.0 avail Mem
Meaning: 58.6% iowait is classic saturation. Load average rises because tasks are waiting, not because you’re “CPU bound.”
Decision: Continue IO path diagnostics. Dirty tuning helps when the waiting correlates with massive writeback.
Task 5: Identify the busiest block device and queue pressure
cr0x@server:~$ iostat -x 1 3
Linux 6.12.0 (server) 12/30/2025 _x86_64_ (64 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
4.92 0.00 2.18 56.10 0.00 36.80
Device r/s w/s rKB/s wKB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 8.0 1900.0 256.0 78000.0 82.4 32.5 17.3 2.1 17.4 0.5 99.2
Meaning: %util near 100% and large avgqu-sz means the device is saturated. Write latency (w_await) is elevated.
Decision: If this aligns with dirty accumulation, reduce dirty thresholds. If not, find the IO source (journal, swap, random reads, etc.).
Task 6: Check pressure stall information (PSI) for IO
cr0x@server:~$ cat /proc/pressure/io
some avg10=12.43 avg60=8.11 avg300=3.21 total=91823354
full avg10=7.02 avg60=4.88 avg300=1.94 total=51299210
Meaning: full indicates time where tasks are completely stalled on IO. If it climbs during the storm, it’s not subtle.
Decision: High full supports the case for smoothing writeback; you’re seeing system-wide stalls.
Task 7: Observe writeback and throttling counters
cr0x@server:~$ egrep 'dirty|writeback|balance_dirty' /proc/vmstat
nr_dirty 1605402
nr_writeback 48812
nr_writeback_temp 0
nr_dirtied 981234567
nr_written 979998123
balance_dirty_pages 734520
dirty_background_threshold 786432
dirty_threshold 1572864
Meaning: balance_dirty_pages increments when tasks are throttled to keep dirty pages under control. Thresholds are shown in pages.
Decision: If you see huge nr_dirty and sudden spikes in balance_dirty_pages, you’re in burst-throttle territory. Tune to start earlier and avoid cliffs.
Task 8: Identify which processes are writing
cr0x@server:~$ pidstat -d 1 5
Linux 6.12.0 (server) 12/30/2025 _x86_64_ (64 CPU)
11:10:01 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
11:10:02 0 2143 0.00 32100.00 31500.00 12 rsyslogd
11:10:02 105 18721 0.00 98000.00 96500.00 88 postgres
11:10:02 0 291002 0.00 14000.00 13900.00 6 backup-agent
Meaning: kB_ccwr/s shows “cancelled write bytes”—data that was dirtied but later truncated or overwritten before writeback. High numbers can indicate churn.
Decision: If one process dominates writes, you can also fix the workload (batching, fsync cadence, log rotation) instead of only tuning kernel thresholds.
Task 9: Check filesystem and mount options
cr0x@server:~$ findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /var/lib/postgresql
/dev/nvme0n1p3 /var/lib/postgresql ext4 rw,relatime,errors=remount-ro,data=ordered
Meaning: ext4 data=ordered is default and sane; it journals metadata and ensures data blocks are written before metadata points to them.
Decision: Don’t change journaling modes during an incident. If you’re on exotic mount options, document and re-evaluate later.
Task 10: Check if you’re accidentally using writeback caching without power-loss protection
cr0x@server:~$ lsblk -d -o NAME,MODEL,ROTA,TRAN,TYPE,SIZE
NAME MODEL ROTA TRAN TYPE SIZE
nvme0n1 Samsung SSD 980 PRO 0 nvme disk 1.8T
cr0x@server:~$ sudo nvme id-ctrl /dev/nvme0n1 | egrep 'vwc|oncs'
vwc : 0x01
oncs : 0x001f
Meaning: vwc indicates volatile write cache presence. This isn’t automatically bad, but it raises the importance of barriers/flush behavior and power-loss protection.
Decision: If the platform lacks power-loss protection, keep dirty limits conservative. You can’t “sysctl” your way out of physics.
Task 11: Confirm the IO scheduler (and avoid cargo culting)
cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq
Meaning: NVMe often uses none by default; that can be correct. Schedulers can affect latency under contention.
Decision: Don’t treat scheduler selection as a first-line fix for writeback storms. Tune dirty thresholds first; change schedulers only with measurement.
Task 12: Measure fsync/commit pressure (ext4 example)
cr0x@server:~$ ps -eLo pid,comm,wchan:30 | egrep 'jbd2|fsync|fdatasync' | head
612 jbd2/nvme0n1p3 jbd2_journal_commit_transaction
18721 postgres do_fsync
18804 postgres do_fsync
Meaning: If many threads are waiting in fsync paths while dirty pages are high, your workload is forcing durability points under heavy writeback.
Decision: Consider tuning dirty limits smaller (reduce background/dirty bytes), and check app-level fsync behavior and checkpointing parameters.
Task 13: Apply a temporary, safer writeback profile (bytes-based)
cr0x@server:~$ sudo sysctl -w vm.dirty_background_bytes=$((512*1024*1024)) vm.dirty_bytes=$((2*1024*1024*1024))
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 2147483648
Meaning: Background writeback starts around 512 MiB dirty, and throttling starts around 2 GiB dirty.
Decision: This is a sane first cut for many servers. If storms shrink and latency improves, make it persistent and refine.
Task 14: Validate that storms shrink (dirty levels + IO await)
cr0x@server:~$ watch -n1 'awk "/Dirty:|Writeback:/{print}" /proc/meminfo; iostat -x 1 1 | tail -n +7'
Every 1.0s: awk "/Dirty:|Writeback:/{print}" /proc/meminfo; iostat -x 1 1 | tail -n +7
Dirty: 612480 kB
Writeback: 98304 kB
nvme0n1 6.0 820.0 192.0 31000.0 78.3 6.2 3.8 1.9 3.8 0.4 71.0
Meaning: Dirty stays under ~600 MiB, writeback is active but not explosive, await and queue are lower, and %util has headroom.
Decision: If this holds under peak workload, you’ve tamed the storm. If throughput drops too much, raise dirty_bytes carefully (not ratios).
A safe tuning strategy: reduce storms without betting the company
Writeback tuning is deceptively easy to do badly. People see a blog post, slam vm.dirty_ratio=80, and then wonder why a power blip turned a queue into a crime scene.
Here’s the strategy that works in real production:
1) Prefer bytes-based thresholds on modern servers
Ratios scale with memory. That sounds nice until your “20% dirty” becomes 30–50 GiB on a big host. If your storage can flush that quickly and consistently, you wouldn’t be reading this.
Recommendation: Set:
vm.dirty_background_bytesto 256–1024 MiBvm.dirty_bytesto 1–8 GiB depending on storage and workload
Start conservative. Increase only if you have measured throughput pain.
2) Keep the background threshold meaningfully below the dirty limit
If background and dirty thresholds are too close, you don’t get “smooth writeback,” you get “writeback starts late and then throttles immediately.” That feels like a stall.
Rule of thumb: background at 1/4 to 1/2 of the dirty limit. Example: 512 MiB background, 2 GiB dirty.
3) Shorten dirty expiry if your workload creates long-lived dirt
vm.dirty_expire_centisecs controls how long dirty data can sit before it’s “old enough” to be flushed. Default values often mean “up to tens of seconds.” That can be fine. It can also allow a slow accumulation that becomes a storm when pressure hits.
Recommendation: If you see “quiet accumulation then sudden flush,” try reducing expiry moderately (e.g., from 30s to 10–15s). Don’t set it to 1s and act surprised when throughput changes.
4) Don’t disable periodic writeback unless you understand the consequences
vm.dirty_writeback_centisecs controls periodic wakeups for background writeback. Setting it to 0 changes the dynamics and can shift flushing to more reactive triggers (reclaim, sync, fsync-heavy paths). That’s not “more efficient.” That’s “more chaotic.”
5) Remember what you’re optimizing: tail latency
You’re not trying to win a benchmark. You’re trying to keep p99 latency sane while still writing enough data. Lower dirty limits smooth writeback and reduce the maximum burst size. That’s the point.
Joke #2: If you tune dirty ratios based on “feels faster,” you’ve reinvented performance testing—poorly.
6) Make it persistent, reviewed, and reversible
Temporary sysctls fix incidents. Persistent sysctls prevent repeats. But only if they’re deployed like any other production change: peer-reviewed, documented, and rolled out gradually.
cr0x@server:~$ sudo tee /etc/sysctl.d/99-dirty-writeback.conf >/dev/null <<'EOF'
# Reduce writeback storms by starting writeback earlier and capping dirty cache.
# Bytes-based thresholds are predictable across RAM sizes.
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 2147483648
vm.dirty_expire_centisecs = 1500
vm.dirty_writeback_centisecs = 500
EOF
cr0x@server:~$ sudo sysctl --system
* Applying /etc/sysctl.d/99-dirty-writeback.conf ...
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 2147483648
vm.dirty_expire_centisecs = 1500
vm.dirty_writeback_centisecs = 500
Meaning: You’ve applied a controlled profile: earlier background writeback, smaller max dirty cache, slightly faster expiry.
Decision: Roll this to one canary host first, then a slice of the fleet. Watch p99 latency and IO queueing.
Suggested profiles (VMs, DBs, file servers, NVMe)
These are starting points, not gospel. The right values depend on storage write throughput, IO latency tolerance, and how bursty your writers are.
Profile A: General-purpose VM or app server (latency-sensitive)
dirty_background_bytes = 256–512 MiBdirty_bytes = 1–2 GiBdirty_expire_centisecs = 1500–3000(15–30s)
Use this when you care more about responsiveness than streaming write throughput.
Profile B: Database host (durability points + steady writes)
dirty_background_bytes = 512 MiB–1 GiBdirty_bytes = 2–4 GiBdirty_expire_centisecs = 1000–2000(10–20s)
Databases often do their own flushing/checkpointing and care about fsync latency. Smaller storm size is usually a win.
Profile C: File server / backup target (throughput-biased, still no storms)
dirty_background_bytes = 1–2 GiBdirty_bytes = 4–8 GiBdirty_expire_centisecs = 3000(30s)
This is for sequential write ingestion where users tolerate slightly higher latency but not full-host stalls.
Profile D: NVMe RAID or very fast local SSD (avoid “too much optimism”)
Fast devices can flush quickly, which tempts you to raise dirty limits. The trap is that queueing still creates spikes, and background writeback can still get behind when metadata/journal patterns get weird.
- Start with Profile A or B anyway.
- Increase only after measuring sustained write workload behavior and p99 latency.
Three corporate mini-stories from the writeback trenches
Mini-story 1: The incident caused by a wrong assumption (the “RAM is free, right?” story)
A mid-sized SaaS company migrated a batch-heavy analytics service from older 64 GB nodes to newer 256 GB nodes. The storage stayed roughly the same class: decent SSDs behind a controller, good enough for years. Their assumption was simple: more RAM means more cache, which means fewer disk hits, which means faster jobs.
The first Monday after rollout, the daily ingestion job finished faster—until it didn’t. Halfway through the run, API latency spiked. SSH sessions stuttered. The load average hit numbers that made the graph look like a skyline. The team initially blamed a noisy neighbor in the virtualization layer because CPU usage was low and load was high. Classic misread.
When they finally looked at /proc/meminfo and /proc/vmstat, it was obvious: dirty cache had climbed into the tens of gigabytes, then the kernel throttled writers and started an aggressive flush. The storage could write fast, but not “flush 40 GB while serving random reads and fsyncs” fast. The workload hadn’t changed; the default ratio thresholds had.
They fixed it with bytes-based dirty thresholds and a lower background trigger. The job still ran quickly, but the system no longer froze. The wrong assumption wasn’t “cache helps.” It was “defaults scale safely with hardware.” They don’t.
Mini-story 2: The optimization that backfired (the “let it buffer more” experiment)
A fintech team had a high-throughput logging pipeline writing large append-only files. They wanted to maximize throughput because nightly reconciliation depended on log availability. Someone suggested increasing vm.dirty_ratio and vm.dirty_background_ratio to “let Linux buffer more and write in bigger batches.” On paper, that can improve sequential write efficiency.
It worked in a quick test. Throughput improved. Everyone nodded. Then the workload met reality: log rotation, compression jobs, and a periodic snapshot process. Those introduced bursts of metadata and sync operations. The system began experiencing sudden pauses around the top of the hour, like a commuter train hitting inexplicable red signals.
The deeper issue: by raising dirty ratios, they increased the maximum dirty footprint substantially. Under rotation and snapshot pressure, writeback needed to push out a mountain of dirty data at once. Foreground tasks—especially the ones doing fsync—started getting stuck behind that flush. The pipeline didn’t just slow down; it created cascading delays across dependent systems.
The fix was not “undo performance.” It was choosing smaller dirty limits and starting background writeback earlier, then tuning the pipeline to write more consistently. Their “optimization” was a throughput-only change in a system with latency-sensitive dependencies. That’s how you backfire an otherwise reasonable idea.
Mini-story 3: The boring but correct practice that saved the day (canary + metrics + rollback)
A media company had a fleet of Debian hosts serving uploads and transcoding outputs. They’d been burned before by “one-line sysctl fixes” that caused a week of subtle regressions. So they treated kernel tuning like an application deploy: canary, observe, expand, and keep a rollback plan.
When writeback storms started showing up during peak upload windows, they didn’t shotgun changes across the fleet. They picked one representative host, applied bytes-based dirty thresholds, and watched two things: IO PSI full and request latency at the application edge. They also watched error rates, because nothing says “oops” like timeouts masquerading as success.
The canary improved: fewer IO stalls, lower tail latency, and no throughput collapse. They rolled out to 10% of the fleet, then 50%. One cluster with older SATA SSDs saw a slight throughput drop, so they bumped dirty_bytes modestly on that hardware class only. No drama, no war room, no heroics.
The boring practice was the win: controlled rollout plus meaningful metrics. When you’re tuning writeback, the best tool is not a sysctl—it’s restraint.
Common mistakes: symptom → root cause → fix
1) Symptom: periodic full-host “freezes” with high iowait
Root cause: Dirty cache accumulates until the dirty limit, then writeback bursts saturate storage and throttle everything.
Fix: Use bytes-based thresholds; lower dirty_bytes and lower dirty_background_bytes so writeback starts earlier. Validate with /proc/meminfo Dirty/Writeback and iostat -x queue depth.
2) Symptom: throughput is fine, but p99 latency spikes under mixed load
Root cause: Queueing effects from burst writeback; journaling and fsync points compete with bulk flushing.
Fix: Reduce dirty limits (smaller bursts). Check for fsync-heavy processes and tune application flush cadence where possible.
3) Symptom: tuning ratios “works” on one host class but not another
Root cause: Ratio thresholds scale with memory and with the kernel’s available-memory accounting, which varies by workload and host role.
Fix: Standardize on dirty_bytes/dirty_background_bytes per hardware class.
4) Symptom: after lowering dirty limits, bulk writers slow down dramatically
Root cause: You throttled too early relative to the disk’s sustained write throughput; background writeback can’t keep up, so writers are constantly forced to wait.
Fix: Increase dirty_background_bytes slightly (start writeback earlier but allow a bit more runway) and/or increase dirty_bytes modestly. Confirm disk throughput and latency; if the device is simply too slow, tuning can’t create bandwidth.
5) Symptom: “sync” or snapshots cause multi-minute stalls
Root cause: Huge dirty backlog meets a forced flush trigger (sync, snapshot, filesystem commit), causing a flood of writes and journal activity.
Fix: Keep dirty backlog small via bytes thresholds. Schedule snapshots away from peak. Ensure your snapshot tooling doesn’t force global sync behavior unnecessarily.
6) Symptom: storms mainly happen inside VMs, not on bare metal
Root cause: Double caching and writeback interactions between guest and hypervisor. Each layer buffers, then flushes in bursts.
Fix: Use conservative dirty bytes inside guests. If possible, align hypervisor storage settings and avoid extreme buffering at both layers. Measure at both guest and host.
Checklists / step-by-step plan
Step-by-step: from incident to stable config
- Confirm it’s writeback. Check Dirty/Writeback in
/proc/meminfo, IO PSI, andiostatqueue. - Identify the hot device. Find the saturated block device and confirm it maps to the affected filesystem.
- Capture a short baseline. Save current sysctls and a 2–3 minute snapshot of IO stats during the event.
- Apply temporary bytes thresholds. Start with 512 MiB background and 2 GiB dirty. Avoid touching filesystem mount options mid-incident.
- Watch the shape. Dirty should oscillate below the cap; IO queue depth should drop; latency should smooth.
- Validate app health. p99 latency, error rates, and DB commit latency if applicable.
- Persist the change. Use
/etc/sysctl.d/with comments explaining why. - Canary rollout. One host → small slice → fleet, with dashboards that include IO PSI and disk await.
- Refine by class. Older disks and network block devices may need different caps than local NVMe.
- Write a runbook note. “If you see A/B/C, check these values and these graphs.” Future-you is a stakeholder.
Safety checklist (the “don’t create new incidents” list)
- Don’t raise dirty thresholds during an incident unless you are absolutely certain you’re only throughput-limited and not latency-sensitive.
- Don’t mix ratio and bytes tuning “because both seem relevant.” Pick bytes for predictability.
- Don’t disable writeback timers as a first-line move.
- Don’t change journaling modes or barrier settings to fix storms. That’s not tuning; that’s gambling.
- Do keep dirty caps conservative on systems without power-loss protection.
- Do test on the same workload pattern that triggers the storm (batch window, backup window, compaction window).
FAQ
1) Are writeback storms a Debian 13 bug?
Usually not. They’re a mismatch between defaults, your RAM size, and your workload’s burstiness. Debian 13 just makes it easier to notice because modern stacks surface tail latency problems more clearly.
2) Should I use vm.dirty_ratio or vm.dirty_bytes?
Use bytes on servers where predictability matters (which is most servers). Ratios can be acceptable on small, uniform machines, but they scale poorly as RAM grows.
3) Does lowering dirty limits increase data safety?
It reduces how much unwritten file data can be sitting in RAM, so power-loss exposure decreases. It does not replace proper durability practices (journaling, correct flush semantics, UPS/PLP storage).
4) Can I set dirty limits extremely low to eliminate storms?
You can, but you may just convert storms into permanent throttling. The goal is smaller, continuous writeback— not forcing every writer to act like it’s doing synchronous IO all the time.
5) What about databases using O_DIRECT or direct IO?
Direct IO bypasses page cache for data files, which reduces dirty page pressure from that workload. But databases still write logs, metadata, and other files through the cache, and the rest of the system still uses page cache. Dirty tuning can still matter.
6) Should I tweak vm.swappiness instead?
Swappiness affects reclaim behavior and swap usage; it can influence when writeback is triggered under memory pressure, but it’s not the primary tool for writeback storms. Fix dirty thresholds first, then look at reclaim if you still see thrashing.
7) Why do storms happen at “random” times?
They’re often triggered by a periodic event: log rotation, backups, compaction, snapshotting, or memory pressure from a new workload. Correlate the time with cron/systemd timers and application schedules.
8) Is changing IO scheduler a better fix than dirty tuning?
Sometimes schedulers help tail latency under contention, but they don’t fix the root cause of “too much dirty data flushed too late.” Scheduler tuning without writeback tuning is polishing the wrong part of the machine.
9) How do I know I didn’t just mask the problem?
If the device is still pegged at 100% util and the queue stays deep, you didn’t solve the bottleneck; you only changed when it hurts. A good fix reduces stall time (IO PSI), reduces queue depth, and improves application latency without exploding error rates.
Next steps you can do today
If you’re seeing disk writeback storms on Debian 13, don’t start with superstition. Start with evidence: Dirty/Writeback levels, IO PSI, and disk queue depth. Then make one disciplined change: move from ratio-based dirty thresholds to bytes-based caps that fit your storage reality.
Do this next:
- Run the fast diagnosis checks and capture a baseline during a storm.
- Apply a temporary profile:
dirty_background_bytes=512MiB,dirty_bytes=2GiB, and optionallydirty_expire_centisecs=1500. - Confirm storms shrink: Dirty stays bounded, IO queue depth drops, p99 improves.
- Persist the config in
/etc/sysctl.d/with comments, canary it, then roll out.
You don’t need to eliminate writeback. You need to stop it from showing up all at once, like an unpaid bill with interest.