Ubuntu 24.04 swapiness and vm.dirty settings: the small tunings that actually matter

November 22, 2025 • February 3, 2026 • Read: 25 min • Views: 8

Was this helpful?

At 02:17, your graph does the classic horror-movie thing: latency spikes, CPU “steal” looks fine, but the app starts timing out. The disks aren’t full. Network is boring. Yet the box “feels” like it’s wading through wet cement. You SSH in and see swap activity and a pile of dirty pages. Nothing is technically “down,” and somehow everything is unusable.

This is where tiny kernel tunings stop being trivia and start being a steering wheel. Ubuntu 24.04 ships with sane defaults for general-purpose systems. Production is rarely general-purpose. If your workload is memory-hungry, I/O-bursty, or runs on cloud volumes with personality, vm.swappiness and the vm.dirty* knobs are among the few “small” tunings that can actually change outcomes.

1. The mental model: what these knobs really control

Swapping isn’t just “out of RAM”; it’s “I picked a victim”

Linux memory management is a negotiation between anonymous memory (heap, stacks, tmp allocations) and file-backed memory (page cache, mapped files). When you see swap activity, it doesn’t necessarily mean you’re “out of RAM.” It means the kernel decided that some anonymous pages are less valuable than keeping other pages resident.

vm.swappiness is not a simple on/off switch. It’s a preference signal: how aggressively the kernel should reclaim anonymous memory by swapping it out versus reclaiming file cache (dropping clean cache pages). Higher values encourage more swapping earlier; lower values bias toward keeping anonymous memory in RAM and sacrificing cache first.

That bias matters because swapping has a brutal failure mode: latency becomes nonlinear. A system can feel perfectly fine and then suddenly become unresponsive when it crosses into sustained swap-in/swap-out behavior. That’s the swap storm: not a single big swap, but a repeating cycle where working set is larger than effective RAM and the kernel keeps evicting what the process will need again soon.

Dirty pages are “IO debt” that eventually comes due

When an application writes to a file, the kernel often marks pages dirty in RAM and returns quickly. It’s a performance feature: batching writes is cheaper than issuing tiny synchronous IO. The debt is that those dirty pages must be flushed to storage later. The vm.dirty* settings decide how big that debt can get, how quickly it must be paid, and how aggressively the kernel throttles writers when the ledger looks scary.

Two ratios tend to dominate the conversation:

vm.dirty_background_ratio: when dirty memory exceeds this percentage, background writeback starts.
vm.dirty_ratio: when dirty memory exceeds this percentage, processes doing writes are throttled (they effectively help flush).

There are also byte-based versions (vm.dirty_background_bytes, vm.dirty_bytes) which override the ratios when set. In production, byte-based settings are often safer because ratios scale with RAM, and modern RAM sizes make “percent of RAM” a dangerously large number of dirty pages. With 256 GB RAM, 20% dirty is a lot of IO debt.

These knobs don’t “speed up disks.” They shape pain

No sysctl makes slow storage fast. What these tunings do is decide when you pay the cost of memory pressure and writeback, and whether you pay it in manageable background work or in a catastrophic foreground stall. Your goal is boring predictability:

Swap only when it’s genuinely less harmful than dropping cache.
Write back dirty data steadily enough that you don’t hit a cliff.
Throttle writers before the system becomes a hostage situation.

One quote that ages well in operations: Hope is not a strategy. (paraphrased idea, attributed to General Gordon R. Sullivan). Kernel defaults are hope. Production is evidence.

2. Interesting facts and a little history (because defaults have a backstory)

Fact 1: The page cache is not “wasted memory.” Linux will happily fill RAM with cache and drop it instantly when applications need memory. The confusion comes from old tools and old mental models.
Fact 2: Early Linux kernels had very different swap heuristics; swapping behavior has been repeatedly reworked as storage and RAM sizes changed.
Fact 3: Ratios like dirty_ratio made more sense when “a lot of RAM” meant single-digit gigabytes. Today, ratio-based dirty limits can translate into tens of gigabytes of dirty data—huge IO bursts later.
Fact 4: The writeback subsystem is designed to smooth IO, but it can only smooth what the storage can sustain. Bursty apps on bursty cloud disks get bursty consequences.
Fact 5: The kernel has multiple reclaim mechanisms: dropping clean page cache, writing back dirty cache, compacting memory, and swapping. These interact; tuning one knob can change which path gets chosen.
Fact 6: Swap-on-SSD became practical, and then commonplace, which changed the cost equation. Swap is still slower than RAM, but it’s no longer “instant death” in every environment.
Fact 7: Cgroups and containers changed what “memory pressure” means. Swapping can happen due to cgroup limits even when the host has plenty of RAM.
Fact 8: The “dirty” settings also affect how long data can sit in RAM before being written—important for durability expectations and for how nasty a crash recovery can be.

3. vm.swappiness: when swapping is smart and when it’s just rude

What swappiness actually influences

vm.swappiness ranges from 0 to 200 on modern kernels (most distros use 0–100 conventions, but the kernel allows 200). Ubuntu defaults are typically around 60. That’s a middle-of-the-road bias: some swapping is fine if it preserves cache and overall throughput.

Here’s the key: swappiness doesn’t say “never swap.” It says “how hard should the kernel try to avoid swapping compared to reclaiming cache.” If you set it too low, the kernel may drop cache aggressively and you can end up with more disk reads (cache misses) and worse performance for IO-heavy services. If you set it too high, you can swap out memory that will be needed again soon, and latency becomes spiky.

When lower swappiness is usually correct

Latency-sensitive services (API servers, interactive systems) where rare long pauses are worse than slightly higher steady-state IO.
Hosts with plenty of RAM relative to working set where swapping indicates bad heuristics more than necessity.
Systems where swap is slow (network-backed swap, oversubscribed hypervisors, or cheap cloud disks under credit pressure).

When higher swappiness can be rational

Mixed workloads where keeping file cache hot matters (build boxes, artifact servers, read-heavy databases with OS cache doing real work).
Memory overcommit with known cold pages (some JVMs, some caches, some batch jobs) where swapping out inactive anonymous memory can be less harmful than thrashing the page cache.
When swap is fast enough (NVMe, good SSD arrays) and you’re trading throughput for slightly more swap IO without wrecking tail latency.

Two realistic target ranges

Opinionated guidance that tends to survive production:

General servers: 20–60. Start at 30 if you’re unsure and have swap enabled.
Latency-sensitive boxes: 1–10, but only after confirming you are not relying on swap as a safety valve.

A swappiness of 0 is often misunderstood. It doesn’t mean “swap off.” It means “avoid swapping as much as possible,” but under real pressure, swapping can still happen. If you truly want no swapping, disable swap (and accept what that implies: the OOM killer becomes your blunt instrument).

Joke #1: Setting vm.swappiness=1 is like putting “please be nice” on a sticky note for the kernel. Sometimes it listens; sometimes it’s having a day.

4. vm.dirty settings: writeback, throttling, and why “buffers” isn’t free

Meet the dirty quartet

The commonly tuned set:

vm.dirty_background_ratio / vm.dirty_background_bytes: start background flushing when dirty pages exceed this threshold.
vm.dirty_ratio / vm.dirty_bytes: throttle foreground writers when dirty pages exceed this threshold.
vm.dirty_expire_centisecs: how long dirty data can sit before it’s considered old enough to write back (in 1/100s).
vm.dirty_writeback_centisecs: how often the kernel wakes flusher threads to write back (in 1/100s).

Ratios are percentages of total memory. Bytes are absolute thresholds. If you set bytes, the ratio is effectively ignored for that side. The byte-based tuning is usually more predictable across instance types and future RAM upgrades.

The failure mode you’re trying to prevent: the writeback cliff

If your dirty thresholds are high, the kernel can accumulate a huge amount of dirty data in RAM, then suddenly decide it’s time to flush. That can saturate your storage, trigger IO queueing, and stall writes. If your application writes via buffered IO (most do), the stall shows up as blocked processes in D state and elevated IO wait. Users call it “the system froze.” Engineers call it “writeback throttling did its job, just not in a way anyone enjoyed.”

On cloud volumes, the cliff is extra fun because baseline throughput and burst credits can change over time. You can “get away” with high dirty ratios during bursts and then get punished when the volume drops back to baseline. It looks random until you remember the storage is literally a bank account.

Choosing ratios vs bytes

Use ratios when:

You manage a fairly uniform fleet with similar RAM sizes.
Your performance envelope scales with RAM and storage proportionally.
You want simpler mental math and accept some variability.

Use bytes when:

RAM sizes vary widely (common in autoscaling groups).
Storage throughput is the real constraint and does not scale with RAM.
You want to cap IO debt to something your storage can flush predictably.

Concrete starting points that usually behave

These are not magical, but they’re less likely to produce cliff behavior than the “let’s just raise dirty_ratio” school of thought.

For general-purpose SSD-backed VMs: set dirty_background_bytes to 256–1024 MB and dirty_bytes to 1–4 GB, depending on storage bandwidth and write burstiness.
For write-heavy services on modest disks: smaller thresholds (background 128–512 MB, max 512 MB–2 GB) plus more frequent writeback can reduce spikes.
For large-memory nodes: bytes almost always beat ratios unless storage scales with RAM (rare outside very expensive boxes).

Expire and writeback intervals: the “how lumpy is lumpy” controls

dirty_writeback_centisecs sets how often background writeback kicks in. Shorter intervals can mean smoother writeback at the cost of more frequent flushing overhead. dirty_expire_centisecs is about how long data can remain dirty before it’s old enough to be pushed out.

For most servers, defaults are fine. But if you see periodic “every N seconds the world pauses” behavior, these are suspects, especially when combined with high dirty ratios. Don’t go wild. Small changes, measure, repeat.

Joke #2: Dirty pages are like dishes in the sink: you can stack them impressively high, but eventually someone has to face the consequences.

5. Fast diagnosis playbook (first/second/third)

First: decide if you’re dying from memory pressure or writeback pressure

Memory pressure signs: rising pgscan/pgsteal, frequent reclaim, growing swap in/out, major page faults, kswapd busy.
Writeback pressure signs: many tasks stuck in D state, high IO wait, dirty pages high, balance_dirty_pages shows up in stacks (if you sample), storage utilization pegged.

Second: locate the bottleneck layer in 5 minutes

Is swap actually being used? If yes, is it steady background trickle or a storm?
Are dirty pages climbing? If yes, are you hitting the dirty throttle threshold?
Is storage saturated? If yes, is it throughput, IOPS, queue depth, or latency?
Is this per-cgroup/container? If you’re on a host with containers, check cgroup memory and IO constraints.

Third: pick the right lever

Swap storm: reduce working set, add RAM, adjust swappiness, fix memory leaks, consider zswap/zram only with eyes open.
Writeback cliff: cap dirty bytes, lower dirty ratios, smooth writeback intervals, and fix the storage bottleneck (often the real answer).
Both at once: you likely have an app doing heavy buffered writes while memory is tight. Tuning helps, but capacity planning fixes.

6. Practical tasks (commands + output meaning + decisions)

Task 1: Confirm current swappiness and dirty settings

cr0x@server:~$ sysctl vm.swappiness vm.dirty_ratio vm.dirty_background_ratio vm.dirty_bytes vm.dirty_background_bytes vm.dirty_expire_centisecs vm.dirty_writeback_centisecs
vm.swappiness = 60
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_background_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500

What it means: ratios are active (bytes are zero). Dirty data can sit ~30 seconds before considered old (3000 centisecs). Writeback wakes every 5 seconds (500 centisecs).

Decision: If RAM is large and storage is modest, consider moving to byte-based dirty limits to avoid huge bursts.

Task 2: Check if swap is enabled and what kind

cr0x@server:~$ swapon --show --bytes
NAME      TYPE  SIZE        USED      PRIO
/swap.img file  8589934592  268435456 -2

What it means: 8 GiB swapfile, ~256 MiB used. That’s not automatically bad; it depends on trends.

Decision: If swap usage is steadily increasing during load and never returns, you’re probably exceeding the working set or leaking memory.

Task 3: Check memory pressure quickly (including swap trends)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           31Gi       18Gi       1.2Gi       1.0Gi       12Gi        9.5Gi
Swap:         8.0Gi      256Mi       7.8Gi

What it means: “available” is your friend; it estimates memory that can be reclaimed without heavy swapping. Low “free” alone is meaningless on Linux.

Decision: If available collapses and swap rises under normal load, you need capacity or a working set reduction, not a magical sysctl.

Task 4: See real swap activity over time

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0 262144 1265000 120000 9500000    0    0   120   180  600 1200 12  4 82  2  0
 3  0 262144 1180000 120000 9440000    0    0   100   210  620 1300 14  4 80  2  0
 4  1 310000  900000 119000 9300000 2048 4096   90  5000  800 2000 18  6 60 16  0
 2  2 420000  600000 118000 9100000 4096 8192   60  9000 1000 2600 20  7 45 28  0
 1  2 520000  500000 118000 9000000 4096 8192   40 12000 1100 2800 22  7 40 31  0

What it means: si/so (swap-in/out) ramping indicates active swapping. Increasing b indicates blocked processes (often IO). Rising wa points at IO wait.

Decision: If you see sustained swap-in (si) during user traffic, lowering swappiness may help only if the system is swapping “unnecessarily.” If the working set is too large, tune won’t save you.

Task 5: Identify top swap consumers

cr0x@server:~$ sudo smem -rs swap -k | head -n 8
  PID User     Command                         Swap      USS      PSS      RSS
14231 app      /usr/bin/java -jar service.jar  512000   420000   600000  1300000
 5123 postgres /usr/lib/postgresql/16/bin/...  128000   300000   380000   700000
 2211 root     /usr/bin/containerd              24000    35000    42000    90000
 1987 root     /usr/lib/systemd/systemd          2000     8000    12000    25000

What it means: Which processes are actually swapped out, not just large.

Decision: If a latency-critical process has meaningful swap, consider memory tuning in the app, adding RAM, or reducing swappiness. If only truly idle daemons are swapped, don’t panic.

Task 6: Check dirty page levels and writeback activity

cr0x@server:~$ egrep 'Dirty|Writeback|MemTotal' /proc/meminfo
MemTotal:       32734064 kB
Dirty:           184320 kB
Writeback:        16384 kB
WritebackTmp:         0 kB

What it means: Dirty is currently small (~180 MiB). Writeback is active but modest.

Decision: If Dirty grows into multiple gigabytes and stays there, you’re likely accumulating IO debt faster than storage can flush.

Task 7: Check kernel’s global dirty thresholds (computed)

cr0x@server:~$ cat /proc/sys/vm/dirty_background_ratio /proc/sys/vm/dirty_ratio
10
20

What it means: Background flush begins at 10% of memory; throttling at 20%.

Decision: On big RAM systems, that’s often too high in absolute terms. Consider byte limits.

Task 8: Measure blocked IO and disk saturation

cr0x@server:~$ iostat -xz 1 3
Linux 6.8.0 (server)  12/30/2025  _x86_64_  (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.20    0.00    5.10   18.40    0.00   62.30

Device            r/s     rkB/s   rrqm/s  %rrqm  r_await  w/s     wkB/s   w_await aqu-sz  %util
nvme0n1         120.0   5200.0     2.0    1.6     4.2   980.0  64000.0   22.5    7.8    99.0

What it means: Disk is pegged (%util ~99) and write latency (w_await) is high. That’s consistent with writeback throttling and/or heavy writes.

Decision: Tuning dirty settings can smooth bursts, but if the disk is saturated at baseline, you need better storage, less write volume, or app-level batching/compression.

Task 9: Look for tasks stuck in uninterruptible IO sleep

cr0x@server:~$ ps -eo state,pid,comm,wchan:32 --sort=state | head -n 15
D  9182 postgres                        io_schedule
D 14231 java                            balance_dirty_pages
D  7701 rsyslogd                        ext4_da_writepages
R  2109 sshd                            -
S  1987 systemd                         ep_poll
S  2211 containerd                      ep_poll

What it means: D state tasks are blocked, typically on IO. Seeing balance_dirty_pages is a strong hint that dirty throttling is active.

Decision: If many workers sit in balance_dirty_pages, reduce dirty limits and/or fix storage throughput. Also evaluate whether your application is doing large buffered writes without backpressure awareness.

Task 10: Verify cgroup memory limits (containers bite here)

cr0x@server:~$ cat /sys/fs/cgroup/memory.max
max

What it means: This host (or current cgroup) is not memory-limited.

Decision: If you see a number instead of max, swapping behavior can be triggered by cgroup pressure even if the machine looks roomy overall.

Task 11: Apply a temporary tuning safely (runtime only)

cr0x@server:~$ sudo sysctl -w vm.swappiness=30
vm.swappiness = 30

What it means: You changed the running kernel parameter. Reboot will revert unless made persistent.

Decision: If swap storms reduce and cache misses don’t explode, consider persisting. If nothing changes, don’t cargo-cult it—look elsewhere.

Task 12: Switch dirty tuning to byte-based caps (runtime only)

cr0x@server:~$ sudo sysctl -w vm.dirty_background_bytes=536870912 vm.dirty_bytes=2147483648
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 2147483648

What it means: Background writeback starts at 512 MiB dirty, throttling at 2 GiB dirty—regardless of RAM size.

Decision: If this smooths IO wait and reduces “writeback cliff” stalls, persist it. If your workload relies on huge write caching for throughput, you may see reduced peak write speed; decide whether you prefer peak or predictability.

Task 13: Persist sysctls the right way (and verify)

cr0x@server:~$ sudo tee /etc/sysctl.d/99-vm-tuning.conf >/dev/null <<'EOF'
vm.swappiness=30
vm.dirty_background_bytes=536870912
vm.dirty_bytes=2147483648
EOF
cr0x@server:~$ sudo sysctl --system | tail -n 6
* Applying /etc/sysctl.d/99-vm-tuning.conf ...
vm.swappiness = 30
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 2147483648

What it means: Settings are now persistent and applied. The output confirms what was loaded.

Decision: If sysctl --system shows your values overridden later, you have another file or tool fighting you (cloud-init, config management, vendor agent). Fix the source of truth.

Task 14: Confirm your dirty ratios are effectively disabled when bytes are set

cr0x@server:~$ sysctl vm.dirty_ratio vm.dirty_bytes
vm.dirty_ratio = 20
vm.dirty_bytes = 2147483648

What it means: The ratio value still prints, but dirty_bytes takes precedence for throttling.

Decision: Keep ratios at defaults unless you have a reason. The bytes are your actual control now.

Task 15: Check for ext4 delayed allocation write bursts (context, not blame)

cr0x@server:~$ mount | grep ' on / '
/dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)

What it means: ext4 with delayed allocation can batch writes, which can amplify writeback bursts depending on workload.

Decision: Don’t change filesystem mount options as a first response. Use dirty caps to shape burstiness; change FS behavior only after you have a measured reason.

7. Opinionated tuning profiles that don’t blow up quietly

Profile A: “Mostly normal servers” (web + sidecars + modest writes)

Use this when you want fewer surprises, not maximum benchmark numbers.

vm.swappiness=30
vm.dirty_background_bytes=268435456 (256 MiB)
vm.dirty_bytes=1073741824 (1 GiB)

Why: Keeps dirty debt capped. Starts flushing earlier. Still allows buffering for throughput, just not “let’s buffer 30 GB because we can.”

Profile B: “Write-heavy, latency-sensitive” (logs, ingestion, queues)

vm.swappiness=10 (or 20 if you rely on cache heavily)
vm.dirty_background_bytes=134217728 (128 MiB)
vm.dirty_bytes=536870912 (512 MiB)

Why: You’re trading peak write buffering for fewer tail-latency cliffs. This reduces the chance that a sudden flush event bulldozes your service.

Profile C: “Big RAM boxes with uneven storage” (the classic trap)

vm.swappiness=20–40 depending on workload
vm.dirty_background_bytes=536870912 (512 MiB)
vm.dirty_bytes=2147483648 (2 GiB)

Why: Big RAM plus average disks is how you accidentally build an IO debt machine. Ratios scale up, but disks don’t.

What to avoid (because you’ll be tempted)

Don’t set dirty_ratio higher to “improve performance” unless you’re sure storage can handle the later flush. You’re usually buying a benchmark win with a production outage.
Don’t disable swap blindly on general-purpose servers. You are replacing a gradual degradation mechanism with a sudden death mechanism.
Don’t tune in the dark. If you don’t measure IO wait, dirty levels, and swap activity, you’re just rearranging kernel vibes.

8. Three corporate mini-stories from the trenches

Story 1: The incident caused by a wrong assumption

They had a fleet of Ubuntu servers running a busy API, plus a background worker that periodically wrote batch exports to disk. Nothing exotic. The team noticed swap usage creeping up over weeks and decided swap was the villain. The fix was decisive: disable swap everywhere. “Swap is slow; we have enough RAM.” A sentence that has ended many peaceful on-call rotations.

Two days later, a traffic surge arrived with a modest memory leak in a rarely exercised code path. The leak wasn’t huge, and with swap present, the system would have limped along while alerts fired. Without swap, the kernel had one tool left: the OOM killer. Under load, the OOM killer did what it does: it killed a process that looked like it was using a lot of memory. That process happened to be the API worker.

The outage wasn’t dramatic at first. It was worse: a rolling pattern of partial failure. Instances would fall out of the load balancer, restart, warm up caches, and die again. Users saw flapping errors. Engineers saw “healthy” CPU graphs and a growing sense of personal betrayal.

The wrong assumption wasn’t “swap is slow.” Swap is slow. The wrong assumption was that swap is only for performance. Swap is also for stability—buying you time to detect leaks and recover gracefully. After the incident, they re-enabled swap and set a lower swappiness. Then they fixed the leak. The most important change was cultural: they stopped treating swap as a moral failing.

Story 2: The optimization that backfired

A data platform team had large-memory nodes and write-heavy ingestion. Someone read that increasing dirty ratios can improve throughput because the kernel can batch larger writes. They raised vm.dirty_ratio and vm.dirty_background_ratio substantially. In a synthetic test, throughput looked great. The dashboards smiled. The PR merged quickly.

In production, the workload wasn’t a steady stream. It was spiky: bursts of writes followed by quiet periods. With high dirty ratios, the system happily buffered huge bursts in RAM. Then the quiet period arrived and the kernel started flushing a mountain of dirty data. Flushes were large enough to saturate the storage, which increased IO latency for everything else: metadata operations, reads, even small writes from unrelated services.

Users experienced the issue as “random stalls.” The stalls lined up with flush cycles, but only if you graphed dirty memory and IO latency on the same axis. The first response was to blame the application. The second response was to blame the cloud provider. The third response—eventually—was to admit the kernel was doing exactly what they asked, and the optimization was tuned for a benchmark, not for a spiky fleet.

The fix was boring: switch to byte-based dirty limits that reflected what the storage could actually flush without drama. Throughput dropped slightly on the most aggressive bursts. Tail latency improved drastically. Everyone pretended this was the plan all along, which is also boring and therefore acceptable.

Story 3: The boring but correct practice that saved the day

A finance-adjacent team ran a set of Ubuntu 24.04 hosts that handled end-of-month processing. The workload was predictable: heavy writes for a few hours, then quiet. They weren’t chasing peak performance; they were chasing “never wake me up.” Their practice was painfully unglamorous: every kernel tuning change required a canary rollout, a set of before/after measurements, and a rollback plan.

They had already standardized on byte-based dirty caps and a moderate swappiness. They also had dashboards for dirty memory, IO latency, and swap-in/out rates. Not because it’s fun, but because it prevents debates fueled by vibes.

One month, a storage backend change introduced slightly higher write latency. Nothing catastrophic, just slower. On the first heavy processing run, their metrics showed dirty pages climbing faster than usual and foreground write throttling kicking in earlier. But because their dirty caps were conservative, the system degraded gracefully: processing took longer, but the hosts stayed responsive, and nothing spiraled into a stall storm.

The team used the evidence to push back on the storage change and temporarily adjusted their job scheduling. No heroics, no kernel archaeology at 3 AM. The correct practice wasn’t a magic sysctl; it was treating sysctls as production configuration with observability and change control.

9. Common mistakes: symptom → root cause → fix

1) Symptom: “Swap is used even though free RAM exists”

Root cause: “Free” is not the metric. Linux uses RAM for cache; swap decisions depend on reclaim cost, not on “free MB.” Also, cold anonymous pages may be swapped to keep cache hot.

Fix: Look at available in free, trend swap-in/out with vmstat. If swap-in is near zero and performance is fine, do nothing. If swap-in rises under load, reduce working set or lower swappiness cautiously.

2) Symptom: periodic 10–60 second stalls; many processes in D state

Root cause: Dirty writeback cliff: too many dirty pages accumulate, then a big flush saturates storage and throttles writers.

Fix: Set vm.dirty_background_bytes and vm.dirty_bytes to sane caps; verify with /proc/meminfo and IO latency tools (iostat). Consider lowering expire/writeback intervals modestly if flushes are too lumpy.

3) Symptom: high IO wait but low disk throughput

Root cause: Latency-bound storage (IOPS-limited, queueing, throttling, or a noisy neighbor). Dirty throttling can amplify it because writers stall waiting for writeback completion.

Fix: Use iostat -xz to check await and queue size; reduce dirty caps to limit queue explosion; address storage limits (bigger volume class, more IOPS, local NVMe, or reduce sync writes).

4) Symptom: swapping causes tail latency spikes, but total swap used is “small”

Root cause: Even small swap-in rates can hurt if they hit hot pages during peak. Latency-sensitive workloads hate swap-in more than they hate reduced cache.

Fix: Lower swappiness (e.g., 10–30), confirm the app isn’t memory-leaking, and ensure swap device isn’t slow. Consider pinning critical services with appropriate cgroup settings rather than global hacks.

5) Symptom: after tuning dirty bytes, throughput dropped and CPU went up

Root cause: You reduced buffering too far, causing more frequent smaller writebacks and more overhead, or you forced foreground throttling too often.

Fix: Increase dirty_bytes modestly (e.g., from 512 MiB to 1 GiB) and re-measure. Don’t jump back to huge ratios; aim for the smallest caps that avoid stalls.

6) Symptom: “Settings don’t stick” after reboot

Root cause: You changed runtime sysctl only, or another component overwrites sysctls at boot.

Fix: Put settings in /etc/sysctl.d/, run sysctl --system, and inspect boot-time logs or configuration management to find the override source.

7) Symptom: container swaps heavily while host looks fine

Root cause: cgroup memory limits (and possibly swap limits) trigger reclaim inside the container scope.

Fix: Inspect cgroup memory.max, container memory requests/limits, and workload sizing. Global swappiness won’t fix a too-tight container limit.

10. Checklists / step-by-step plan

Step-by-step: safe tuning workflow for production

Write down the symptom in operational terms: “p99 latency spikes during batch writes,” “hosts freeze for 30 seconds,” “swap-in rises after deploy.” If you can’t phrase it, you can’t validate it.
Capture baselines: swappiness/dirty settings, swap usage, dirty memory, IO latency, and blocked tasks.
Prove which pressure dominates: memory vs writeback vs raw storage saturation.
Change one dimension at a time: swappiness or dirty limits, not both, unless you’re already sure both contribute.
Use runtime sysctl first: validate effect under real workload without committing.
Canary rollout: apply to a small subset; compare with identical load patterns.
Persist via sysctl.d: keep a single authoritative file per host role; avoid “mystery” settings.
Reboot validation: confirm after reboot that settings remain and behavior matches.
Document the why: include the metric change you were targeting and the observed outcome.

Checklist: what “good” looks like after tuning

Swap-in/out is near zero during normal traffic (unless you have a known reason).
Dirty memory rises and falls smoothly, not as a sawtooth with long flat stalls.
IO latency doesn’t periodically spike to the moon during background flush.
Few or no application threads are stuck in D state for long.
Throughput is acceptable and tail latency is stable.

Checklist: rollback plan

Keep last-known-good sysctl file content in your config management history.
Have a one-liner to revert runtime settings (sysctl -w with old values).
Know what graphs should improve within minutes (dirty/IO wait) versus hours (swap trends, cache behavior).

11. FAQ

1) Should I set `vm.swappiness=1` on all servers?

No. On latency-sensitive services, it can be reasonable. On IO-heavy servers that benefit from cache, it can backfire by dropping cache too aggressively. Start around 20–30 and measure.

2) Is swap usage always bad?

Swap activity during peak is often bad. Small amounts of swap used can be fine if those pages are truly cold and swap-in stays near zero.

3) What’s better: dirty ratios or dirty bytes?

Dirty bytes are more predictable across varying RAM sizes and usually safer on large-memory machines. Ratios are simpler but can scale into absurd dirty backlogs on modern servers.

4) If I cap dirty bytes, will I lose performance?

You may lose peak buffered-write throughput during bursts. You often gain stability and lower tail latency. If you’re running production systems, that trade is usually correct.

5) Do these tunings matter for databases?

Sometimes. Databases with their own buffering and WAL patterns may care more about IO scheduler, filesystem, and storage. But dirty writeback can still affect background tasks, logs, and any buffered IO paths around the database.

6) Why do I see high IO wait when CPU is mostly idle?

Because threads are blocked on IO. CPU “idle” doesn’t mean the system is healthy; it can mean the CPU has nothing to do while everyone waits for storage.

7) Should I disable swap to force the app to fail fast?

Only if you truly want OOM kills as your primary control mechanism and you have strong orchestration/retry behavior. For many fleets, swap is a safety net that prevents cascading failures during short-lived memory spikes.

8) Do I need to tune `dirty_expire_centisecs` and `dirty_writeback_centisecs`?

Usually no. Start with dirty bytes/ratios. Consider intervals only if you have periodic flush-related stalls and you’ve confirmed dirty limits aren’t the core issue.

9) I changed sysctls and nothing improved. Why?

Because the bottleneck is often storage throughput/latency, application write patterns, or memory leaks. Sysctls shape behavior at the margins; they do not create capacity.

10) What’s a safe way to test without risking a fleet-wide incident?

Runtime changes on a single canary instance under representative load, with clear success metrics: swap-in rate, dirty levels, IO latency, and request latency.

12. Conclusion: practical next steps

If your Ubuntu 24.04 boxes are stalling, and you see either swap churn or a wall of dirty writeback, you don’t need folklore. You need a tight loop: observe, cap, verify.

Run the fast diagnosis: is it swap pressure, writeback pressure, or saturated storage?
If writeback cliffs are the story, move from ratio-based dirty thresholds to byte-based caps sized to your storage reality.
If swap storms are the story, lower swappiness cautiously and fix the real cause: working set, leaks, or underprovisioned RAM.
Persist settings in /etc/sysctl.d/, canary them, and verify after reboot.

These tunings are small. That’s the point. Small, targeted changes that turn 02:17 into a normal night.