Debian 13: Swap Is Growing and Performance Tanks — Memory Pressure Fixes That Actually Help

Was this helpful?

You log in and everything feels… sticky. SSH takes a second too long. A “quick” apt run drags. Your dashboards show CPU is fine,
disk isn’t obviously pegged, and yet latency climbs like it’s trying to win a stair-climb race.
Then you notice it: swap usage is climbing. Slowly at first, then relentlessly. The box isn’t down, but it’s not alive, either.

On Debian 13, this usually isn’t “swap is bad” so much as “your system is negotiating with physics.” The good news: you can diagnose memory pressure
quickly, choose the right intervention, and stop treating swap like a superstition. The bad news: some common “tweaks” make it worse.
Let’s do the fixes that hold up in production.

Fast diagnosis playbook (check this first)

When swap climbs and performance tanks, you want to answer three questions fast:
(1) Are we actually under memory pressure, or just holding onto cache?
(2) If pressure is real, what’s causing it—one process, a cgroup limit, or the kernel’s reclaim behavior?
(3) Is the slowdown from swap I/O, direct reclaim stalls, or both?

First: confirm the pressure and the kind of pain

  • Look at memory + swap + reclaim indicators: free, vmstat, and PSI in /proc/pressure.
  • Look at the biggest consumers: ps RSS sorting, then cgroups.
  • Look at stalls: high si/so (swap in/out), high wa, and PSI “some/full”.

Second: identify the boundary

  • If you’re on bare metal/VM with no cgroup limits, the boundary is “RAM + swap”.
  • If you’re on a host with containers, the boundary might be a cgroup memory.max; swap can grow on the host while a container is choking.
  • If systemd-oomd is enabled, it might be killing (or not killing) based on PSI signals and unit configuration.

Third: decide the intervention class

  • Leak or runaway: fix the app, restart safely, add guardrails (limits/oomd).
  • Cache-heavy workload: tune reclaim and avoid “drop caches” panic buttons.
  • Swap thrash: reduce swapping (swappiness, memcg tuning) or make swap faster (zram/NVMe) while you fix the real cause.
  • Mis-sized system: add RAM or reduce concurrency. Yes, sometimes the fix is spending money.

A paraphrased idea from John Allspaw (operations and reliability): “You don’t fix incidents by hunting a single ‘root cause’; you fix the conditions that allowed failure.”
That’s swap growth in a nutshell: it’s usually a set of conditions, not one villain.

What’s really happening when swap grows

Linux uses RAM for more than your application’s heap. It keeps file data in the page cache, tracks filesystem metadata,
and holds anonymous pages (your process memory). When RAM fills, the kernel reclaims memory: it drops clean cache pages,
writes dirty pages back, and, if it must, swaps anonymous pages to swap space.

Swap growth doesn’t automatically mean the system is “out of memory.” It can also mean the kernel decided some pages were cold
and parked them in swap to make room for cache that improves throughput. That sounds reasonable—until it isn’t.

The three failure modes you actually see

  1. Benign swap: swap is used a bit, but swap-in rate stays low. The machine is responsive. You can ignore it.
  2. Swap thrash: swap-in/out rates spike, latency explodes, CPU shows time in I/O wait, and interactive tasks crawl.
    This is your “performance tanks” scenario.
  3. Direct reclaim stalls without massive swap I/O: the kernel spends time reclaiming pages; PSI memory “some/full” rises.
    The box feels slow even if swap I/O isn’t huge, especially with certain workloads and memory fragmentation.

On Debian 13, you’ll typically have a modern kernel and systemd. That’s good—PSI is available, cgroup v2 is common, and
systemd-oomd exists. But it also means you can end up with a lot of “smart defaults” interacting in ways that are smart individually
and dumb collectively.

Short joke #1: Swap is like a junk drawer—fine until you need that one thing quickly and it’s buried under old takeout menus.

Facts and context: swap, reclaim, and why Debian behaves like this

  • Fact 1: Linux swap predates modern “RAM is cheap” thinking; it was designed when memory was scarce and workloads were smaller,
    so reclaim policies evolved over decades, not months.
  • Fact 2: The Linux kernel distinguishes anonymous memory (process pages) from file-backed memory (page cache).
    File-backed pages can be dropped; anonymous pages need swap (or reclaim through killing).
  • Fact 3: PSI (Pressure Stall Information) landed to measure time tasks are stalled due to resource pressure (memory/CPU/I/O),
    which is more actionable than “free memory” when diagnosing latency.
  • Fact 4: cgroup v2 changed memory control semantics and made “memory.high” a throttling knob instead of just a hard limit;
    this matters because throttling can look like “mysterious slowness.”
  • Fact 5: OOM killer is kernel-level and blunt; systemd-oomd is user-space and policy-driven,
    often triggered by PSI signals rather than waiting for total OOM.
  • Fact 6: The default vm.swappiness value has changed across eras and distros; it’s not a moral statement,
    it’s a trade-off knob, and defaults are tuned for “general purpose,” not your specific workload.
  • Fact 7: Swapping to fast NVMe can “work” but still wreck tail latency because swap-in is often on the critical path for request handling.
  • Fact 8: Transparent Huge Pages (THP) can inflate memory footprint and complicate reclaim behavior; for some databases it’s beneficial,
    for others it’s a steady source of pain.
  • Fact 9: Memory fragmentation can cause allocation failures or reclaim stalls even when “free memory” looks nonzero,
    particularly for large contiguous allocations (including huge pages).

Practical tasks: commands, outputs, and decisions (the real work)

These are not “run this and feel better” commands. Each one is: command → what output means → what decision you make.
Run them in order until the story becomes obvious.

Task 1: Establish baseline memory and swap state

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        51Gi       1.2Gi       1.5Gi        10Gi       3.4Gi
Swap:           16Gi       9.8Gi       6.2Gi

What it means: “available” is the key line, not “free.” Here, only 3.4Gi is readily reclaimable without major pain.
Swap is already ~60% used.
Decision: Continue: we need to see whether swap is actively used (thrash) or just allocated and cold.

Task 2: Check live swap activity (thrash vs parked pages)

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0 10032128 120064  89216 8675328  12  18    22    40  820 2150 12  6 80  2  0
 1  0 10032256 118720  89216 8675456   0   0     0     8  790 2080 10  5 84  1  0
 3  1 10110464  93248  89216 8629312 420  980   120   640 1100 3200 14  7 62 17  0
 2  1 10135680  90112  89216 8622336 510 1200   180   920 1180 3400 15  8 55 22  0
 1  0 10136576  95040  89216 8622912  10   40    16    88  840 2250 12  6 79  3  0

What it means: Look at si/so (swap in/out) and wa (I/O wait). The middle samples show heavy swap I/O and high wait.
That’s thrash, not benign swap.
Decision: Identify who is forcing this, and whether the pressure is global or inside a cgroup.

Task 3: Identify top resident memory consumers (RSS)

cr0x@server:~$ ps -eo pid,comm,rss,vsz,%mem --sort=-rss | head -n 12
  PID COMMAND           RSS    VSZ %MEM
 4121 java          18842340 25011240 29.6
 2337 postgres       9241120 10012560 14.5
 2875 python3        4021152  5120032  6.3
 1998 node           2568896  3101024  4.0
 1552 dockerd         812224  1887744  1.3
  911 systemd-journald 392448   566288 0.6

What it means: RSS is “resident set,” what’s actually in RAM. VSZ includes mapped but not necessarily resident.
If one process dominates RSS, it’s your primary suspect.
Decision: For big processes, check whether memory growth is expected (heap, cache) or a leak; also check cgroup limits.

Task 4: Confirm whether you’re using cgroup v2 (common on modern Debian)

cr0x@server:~$ stat -fc %T /sys/fs/cgroup/
cgroup2fs

What it means: cgroup v2 is active. Memory pressure may be occurring within a unit/container even if the host looks “okay,” or vice versa.
Decision: Inspect memory usage and limits per unit or container.

Task 5: Find the worst memory pressure units via systemd

cr0x@server:~$ systemctl status --no-pager --full user.slice
● user.slice - User and Session Slice
     Loaded: loaded (/usr/lib/systemd/system/user.slice; static)
     Active: active since Mon 2025-12-29 08:12:10 UTC; 2h 21min ago
       Docs: man:systemd.special(7)
      Tasks: 246 (limit: 76800)
     Memory: 14.8G
        CPU: 1h 10min 32.920s
     CGroup: /user.slice
             ├─user-1000.slice
             │ ├─session-4.scope
             │ │ ├─4121 java
             │ │ └─...

What it means: systemd can show memory by slice. If a slice or service is ballooning, you can contain it.
Decision: Drill into the specific service unit (or container scope) and check its memory controls.

Task 6: Inspect memory limits and current usage for a specific unit

cr0x@server:~$ systemctl show myapp.service -p MemoryCurrent -p MemoryPeak -p MemoryMax -p MemoryHigh -p ManagedOOMMemoryPressure
MemoryCurrent=20521447424
MemoryPeak=22231822336
MemoryMax=infinity
MemoryHigh=infinity
ManagedOOMMemoryPressure=auto

What it means: No memory ceiling. If the app grows, it will push the system into reclaim and swap.
Decision: If this is a multi-tenant host, add MemoryHigh/MemoryMax guardrails and define an OOM policy.

Task 7: Read PSI memory pressure (this correlates with “everything is slow”)

cr0x@server:~$ cat /proc/pressure/memory
some avg10=18.42 avg60=12.01 avg300=6.55 total=922337
full avg10=3.21 avg60=1.70 avg300=0.62 total=120044

What it means: “some” is time at least one task is stalled on memory. “full” is time the system is fully stalled (no forward progress).
Those are high enough to explain latency spikes and user complaints.
Decision: Treat this as a reliability issue, not a “nice-to-have tuning.” Reduce pressure or enforce limits.

Task 8: Check swap devices and priority (slow swap can be fatal)

cr0x@server:~$ swapon --show --bytes
NAME       TYPE      SIZE        USED       PRIO
/dev/nvme0n1p3 partition 17179865088 10522624000 -2

What it means: Swap is on NVMe partition, but priority is low (-2). If you also have zram or other swap devices, priority matters.
Decision: If swap is unavoidable, prefer faster swap (often zram for bursts) and set priorities intentionally.

Task 9: Observe major page faults and swap faults per process

cr0x@server:~$ pidstat -r -p 4121 1 3
Linux 6.12.0 (server)  12/29/2025  _x86_64_  (32 CPU)

09:41:10      UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
09:41:11     1000      4121   1200.00     35.00 25011240 18842340  29.6  java
09:41:12     1000      4121   1305.00     42.00 25011240 18850012  29.7  java
09:41:13     1000      4121   1104.00     38.00 25011240 18861020  29.7  java

What it means: Major faults (majflt/s) often mean disk I/O to bring pages back—swap-in is a common cause.
Sustained major faults with swap activity is a smoking gun for thrash.
Decision: Reduce working set or increase RAM; also consider heap tuning and memory limits.

Task 10: Check kernel reclaim stats (are we churning page cache?)

cr0x@server:~$ egrep 'pgscan|pgsteal|pswpin|pswpout' /proc/vmstat | head
pgscan_kswapd 18223344
pgscan_direct 992311
pgsteal_kswapd 17199812
pgsteal_direct 801223
pswpin 621044
pswpout 1200033

What it means: High pgscan_direct suggests direct reclaim (tasks stalling). High pswpin/out indicates swap churn.
Decision: If direct reclaim is significant, pressure is hurting latency. Prioritize reducing working set and adding limits; tuning swappiness alone won’t save you.

Task 11: Verify swappiness and other VM knobs (don’t tune blindly)

cr0x@server:~$ sysctl vm.swappiness vm.vfs_cache_pressure vm.dirty_ratio vm.dirty_background_ratio
vm.swappiness = 60
vm.vfs_cache_pressure = 100
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10

What it means: swappiness=60 is a general-purpose default. On latency-sensitive servers, it can be too eager to swap under pressure.
Dirty ratios affect writeback bursts; mis-tuning can create I/O congestion that amplifies swap pain.
Decision: If you have thrash, consider lowering swappiness (carefully) and ensure dirty settings aren’t causing writeback storms.

Task 12: Check for THP settings (can inflate footprint and stall reclaim)

cr0x@server:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

What it means: THP is enabled “always.” That can be fine, but for some workloads (notably certain DB/cache patterns) it causes latency spikes or memory bloat.
Decision: If you see reclaim stalls and unpredictable latency, test switching to madvise (or never) in a controlled manner.

Task 13: Confirm disk I/O wait and swap I/O correlation

cr0x@server:~$ iostat -xz 1 3
Linux 6.12.0 (server)  12/29/2025  _x86_64_  (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.10    0.00    6.50   18.40    0.00   63.00

Device            r/s     w/s   rKB/s   wKB/s  avgrq-sz avgqu-sz   await  svctm  %util
nvme0n1         85.00  140.00  9200.0 18000.0    210.0     4.20   18.50   0.45  92.00

What it means: %iowait is high and NVMe is near saturation. Swap I/O competes with everything else and makes tail latency ugly.
Decision: If swap is on the same device as your database or logs, you’re inviting cross-talk. Separate devices or reduce swapping.

Task 14: Check journal and kernel logs for OOM and memory pressure events

cr0x@server:~$ journalctl -k --since "1 hour ago" | egrep -i 'oom|out of memory|memory pressure|kswapd' | tail -n 20
Dec 29 09:02:14 server kernel: oom_reaper: reaped process 4121 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Dec 29 09:02:14 server kernel: Out of memory: Killed process 4121 (java) total-vm:25011240kB, anon-rss:18842340kB, file-rss:10240kB, shmem-rss:0kB, UID:1000 pgtables:52000kB oom_score_adj:0

What it means: If you see OOM kills, you didn’t “fix swap,” you survived it. Also note: kernel OOM is a last resort, often after long thrash.
Decision: Put policy in place (limits, oomd) so you fail faster and more predictably, and fix the memory sizing/leak.

Task 15: If you run containers, inspect memory events (the quiet assassin)

cr0x@server:~$ cat /sys/fs/cgroup/system.slice/docker.service/memory.events
low 0
high 1242
max 0
oom 0
oom_kill 0

What it means: Many “high” events means the group is hitting memory.high throttling frequently. That looks like random slowness.
Decision: Increase the memory.high threshold, reduce workload concurrency, or fix the service footprint. Throttling is useful, but it’s still pain.

Fixes that actually help (and why)

There are two categories of “fix”: structural and tactical. Structural fixes address why the working set doesn’t fit in RAM. Tactical fixes reduce the blast radius
when it happens anyway. You want both. And you want to avoid cargo-cult tuning.

1) Put memory guardrails where the problem lives (systemd units / cgroups)

If a service can eat the host, it eventually will—by accident, by traffic spike, or by a feature someone deployed at 4:55 PM on a Friday.
On Debian with systemd, use unit-level memory controls.

cr0x@server:~$ sudo systemctl edit myapp.service
[Service]
MemoryHigh=18G
MemoryMax=22G
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=20%

How to think about it:
MemoryHigh is “start applying pressure” (reclaim/throttling). MemoryMax is “hard stop.” If your service can’t survive reclaim,
you’d rather kill and restart it than let it poison the whole node.

2) Configure systemd-oomd intentionally (or explicitly disable it)

systemd-oomd is not the kernel OOM killer. It’s a policy engine based on pressure metrics. This can be fantastic—if you configure it for your topology.
It can also kill the wrong thing if you treat defaults like gospel.

cr0x@server:~$ systemctl status systemd-oomd --no-pager
● systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer
     Loaded: loaded (/usr/lib/systemd/system/systemd-oomd.service; enabled)
     Active: active (running) since Mon 2025-12-29 08:12:11 UTC; 2h 35min ago

Decision: If you run a multi-service host, enabling oomd with correct per-service policies is usually better than waiting for kernel OOM.
If you run a single-purpose appliance with strong app-level control, you may choose to disable oomd to avoid surprise kills—but then you must have other guardrails.

3) Lower swappiness for latency-sensitive services (but don’t pretend it’s a fix)

vm.swappiness controls how aggressively the kernel will swap anonymous memory relative to reclaiming file cache.
On servers where latency matters more than squeezing every last cache hit, a lower swappiness can help reduce swap churn.
It does not make an undersized host magically fit.

cr0x@server:~$ sudo sysctl -w vm.swappiness=10
vm.swappiness = 10
cr0x@server:~$ sudo tee /etc/sysctl.d/99-swappiness.conf >/dev/null <<'EOF'
vm.swappiness=10
EOF

Decision: If you observe real thrash (high si/so), lowering swappiness is worth testing.
If you observe high PSI “full” and direct reclaim, you still need to reduce memory demand or set limits.

4) Fix the biggest working set first: cap heaps, tune caches, stop double-caching

The most common “swap grows forever” culprit is an application with an elastic heap or cache that was never given a ceiling.
Java, Node, Python services with aggressive caching, and databases can all behave like this.

  • Java: set -Xmx thoughtfully; don’t let it float to “whatever the OS allows.”
  • PostgreSQL: keep shared_buffers reasonable and remember the OS page cache is already a cache.
  • Elasticsearch / caches: cap memory; don’t let caching be “infinite” just because the host has swap.

The classic trap is double-caching: your application caches data in heap while the kernel caches the same data in page cache. You pay twice, then you swap.
A memory hierarchy is not a buffet.

5) Consider zram for bursty pressure (especially on smaller VMs)

If your host is occasionally memory tight, zram (compressed RAM swap) can absorb bursts without hammering your disk.
It’s not free—compression costs CPU—but it can be a net win when disk latency is the bigger enemy.

cr0x@server:~$ sudo apt-get update
Hit:1 deb.debian.org stable InRelease
Reading package lists... Done
cr0x@server:~$ sudo apt-get install -y zram-tools
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
  zram-tools
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.

Decision: If swap thrash is happening on a slow disk or shared storage, zram can be a strong tactical mitigation.
If your workloads are already CPU-bound, zram might backfire. Measure.

6) Put swap on the right storage, with the right expectations

Swap on spinning disks is a time machine to 2009. Swap on networked block devices is an adventure you didn’t ask for.
Swap on NVMe can be workable, but if it shares the device with latency-critical databases, it can still destroy tail latency.

Good practice:

  • Prefer a dedicated swap device or at least separate from DB WAL/log-heavy volumes.
  • Ensure swap is enabled early and has correct priority if multiple swap backends exist.
  • Right-size swap. If you “need” enormous swap to stay up, you’re masking a capacity issue.

7) Tune dirty writeback to avoid I/O storms that amplify swap pain

Memory pressure + writeback bursts = a bad time. If dirty pages pile up, the kernel eventually forces writeback, which can saturate I/O just when swap-in needs it.
This is how you get a server that is both swapping and “randomly” slow even after swap stops growing.

cr0x@server:~$ sudo sysctl -w vm.dirty_background_ratio=5 vm.dirty_ratio=10
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Decision: Lower ratios can smooth writeback on systems with bursty writes and latency sensitivity.
But don’t tune this in isolation; you need to check your storage throughput and application write patterns.

8) Avoid “drop caches” as a habit

Yes, echo 3 > /proc/sys/vm/drop_caches will make “free memory” look better. It will also nuke useful cache and can cause a thundering herd of re-reads.
It is a last-ditch troubleshooting tool, not a fix. If you need it weekly, you’re running a production system on vibes.

9) Don’t disable swap unless you’re prepared for abrupt death

Turning off swap can reduce thrash, but it changes failure mode from “slow” to “killed.” Sometimes that’s desirable. Often it’s not.
If you disable swap, you need:

  • strict memory limits per service
  • a clear restart policy
  • alerting on approaching limits and PSI
  • a plan for traffic spikes

Short joke #2: Disabling swap to fix memory pressure is like removing the smoke detector because the beeping is annoying.

10) Admit when the machine is too small

If your steady-state working set exceeds RAM, you will either swap, stall, or kill processes. Tuning can pick which misery you prefer;
it can’t rewrite arithmetic. Add RAM, reduce concurrency, shard the workload, or move heavy services off the node.

Three corporate mini-stories (how this fails in real life)

Mini-story 1: The incident caused by a wrong assumption

A team ran a customer-facing API on Debian hosts with “plenty of memory.” They had swap configured because “it’s safer.”
Their dashboards tracked CPU, request rate, and disk utilization. Memory was a single line: “free.”
It looked fine most of the time. Then latency doubled for ten minutes at a time, several times a day, with no obvious pattern.

The wrong assumption was subtle: they believed if the host wasn’t out of memory, memory wasn’t the problem.
Swap usage climbed gradually during traffic peaks, but since nothing crashed, it didn’t trigger urgency.
Meanwhile, the kernel was swapping out cold pages from a Java service. Under later bursts, those pages weren’t cold anymore.
The swap-ins landed on the same NVMe volume used by the database WAL.

The incident manifested as “database gets slow,” because that’s what everyone saw first: queries waiting on I/O and locks.
But the root condition was memory pressure causing swap I/O, which competed with WAL and amplified tail latency.
The team chased indexes and connection pools for a week.

The fix wasn’t exotic. They added PSI-based alerting, lowered swappiness, separated swap from the database volume,
and—most importantly—capped the Java heap and the in-process caches. After that, swap stayed mostly flat,
and when pressure returned, it was obvious early instead of being a daily whodunit.

Mini-story 2: The optimization that backfired

Another org wanted to “use memory efficiently” and pushed hard on maximizing page cache for read-heavy workloads.
Someone suggested raising swappiness “so Linux can keep file cache hot.” They also enabled Transparent Huge Pages everywhere
because “bigger pages are faster.”

On paper, it sounded clean: more cache hits, fewer disk reads. In practice, two services shared the host:
a cache-heavy API and a batch job that spiked memory during daily processing.
With high swappiness, anonymous pages were swapped aggressively during pressure; with THP, the memory footprint became less predictable.
The batch job triggered reclaim; reclaim triggered swap; swap stole I/O bandwidth; and the interactive API paid the bill.

The really painful part: the optimization improved throughput benchmarks in isolation.
It only failed under mixed workload conditions, which is the default state of corporate servers.
The team saw “good cache hit ratio” and assumed success while customer latency degraded.

The rollback was straightforward: swappiness back down, THP moved to madvise, and the batch job moved into a systemd scope with a hard memory cap.
They also stopped co-locating batch processing with latency-sensitive APIs on the same node class.
The “optimization” wasn’t evil; it was misapplied and unbounded.

Mini-story 3: The boring but correct practice that saved the day

A platform team ran a fleet of Debian servers hosting multiple internal services. Nothing glamorous: logs, metrics, a handful of APIs,
and a couple of background workers. They had a boring rule: every service gets a systemd unit with memory boundaries, and every node
exports PSI metrics with alerts. No exceptions. Developers complained. Quietly, everyone complied.

One afternoon, a deployment introduced a memory leak in a Python worker. RSS climbed slowly over a few hours.
On hosts without guardrails, this kind of leak becomes a slow-motion outage: swap grows, latency rises, people blame the network,
then finally something crashes.

Here, the worker hit MemoryHigh first and started getting throttled. PSI spiked for the unit, not the whole node.
Alerts fired with clean context: “worker unit memory pressure” rather than “host is slow.”
systemd-oomd killed the worker when pressure exceeded the configured limit. systemd restarted it, restoring service.

Nobody celebrated. That’s the point. The system degraded predictably and recovered without turning the entire node into a swap furnace.
The follow-up work was focused: fix the leak, add a regression test, and keep the guardrails. Boring won. Again.

Common mistakes: symptom → root cause → fix

1) “Swap used is high, so we’re out of memory”

Symptom: Swap is non-zero or growing, but system feels fine.
Root cause: Benign swap: cold anonymous pages swapped out; no significant swap-in rate.
Fix: Check vmstat for si/so and PSI. If swap-in is near zero and PSI is low, don’t touch it.

2) “We’ll fix it by setting swappiness=1”

Symptom: Swap reduces, but the system starts OOMing or stalls during reclaim anyway.
Root cause: Working set still doesn’t fit; you just changed which pages get sacrificed. Low swappiness doesn’t prevent memory pressure.
Fix: Cap application memory, set memory.high/max, reduce concurrency, or add RAM. Use swappiness as a fine-tuning knob, not a parachute.

3) “Disable swap and we’re safe”

Symptom: No more slow thrash, but sudden process kills and restarts; sometimes kernel OOM takes out the wrong service.
Root cause: Failure mode changed from “degraded” to “abrupt.” Without policy, it’s chaos.
Fix: If disabling swap, enforce per-service limits and adopt predictable kill/restart policy (systemd-oomd or service supervision).

4) “It’s disk performance; let’s tune the filesystem”

Symptom: High iowait, high storage utilization, but the root event is memory pressure.
Root cause: Swap I/O competes with real I/O; the disk looks guilty because it is being used as emergency RAM.
Fix: Reduce swapping first. Separate swap device if needed. Then re-evaluate disk tuning.

5) “We’ll just drop caches”

Symptom: Short-term improvement, then worse performance and more I/O.
Root cause: You destroyed useful page cache; the system must re-read data, increasing I/O and pressure.
Fix: Don’t do it except as a controlled experiment. Fix memory sizing and app footprint.

6) “The container has a memory limit, so the host is protected”

Symptom: Host swap grows; a container slows down; no clear OOM events.
Root cause: memory.high throttling and reclaim inside cgroup; host page cache and other services still compete; limits don’t prevent pressure from affecting neighbors.
Fix: Configure both memory.high and memory.max sensibly; isolate noisy services; watch cgroup PSI and memory.events.

7) “Swap is slow because it’s on NVMe; that can’t be it”

Symptom: NVMe is fast but latency is still awful under swapping.
Root cause: Swap-in is on the critical path; queueing and contention hurt tail latency even on fast devices.
Fix: Stop swapping by reducing working set. Use zram for bursts if CPU allows. Separate swap from hot I/O paths.

Checklists / step-by-step plan

Checklist A: When you’re on-call and the box is “slow”

  1. Run free -h and record “available” and swap used.
  2. Run vmstat 1 10. If si/so spikes or wa is high, suspect swap thrash.
  3. Read /proc/pressure/memory. If “full avg10” is non-trivial, treat as incident-worthy.
  4. Check iostat -xz 1 3 to see if swap is saturating your primary disk.
  5. Find top RSS consumers with ps; confirm with service/cgroup metrics.
  6. If a single service is runaway: restart safely and add a temporary limit (MemoryHigh/Max) to stop recurrence.
  7. Open a follow-up task: capacity plan, per-service memory caps, and PSI alerts.

Checklist B: Permanent fixes on a multi-service Debian 13 host

  1. Inventory services and set memory boundaries: MemoryHigh for soft pressure, MemoryMax for hard stop.
  2. Decide policy: use systemd-oomd to kill units under sustained pressure, or keep kernel OOM as last resort (not recommended for multi-tenant).
  3. Set swappiness to a tested value (often 10–30 for latency-sensitive hosts).
  4. Evaluate zram for burst absorption; validate CPU headroom.
  5. Place swap on appropriate storage; avoid sharing with DB/WAL heavy workloads.
  6. Validate THP behavior; switch to madvise if it reduces latency spikes for your workload.
  7. Alert on PSI memory “some/full” and on swap-in/out rates, not just swap used.

Checklist C: If you suspect a leak

  1. Capture RSS growth over time: ps snapshots or pidstat -r.
  2. Confirm swap-in behavior via major faults: pidstat -r majflt.
  3. Add a hard memory cap to prevent host-wide impact while you debug.
  4. Restart service to restore stability; file an incident report with evidence: RSS trend, PSI, vmstat, and logs.

FAQ

1) Is swap usage always bad?

No. Swap activity is what kills you. If swap used is high but vmstat shows near-zero si/so and PSI is low, it’s probably fine.

2) What’s the best single metric for “memory is hurting latency”?

PSI memory pressure (/proc/pressure/memory). It tells you how much time tasks are stalled. It correlates with “users are mad” better than “free RAM.”

3) Should I lower swappiness on Debian 13 servers?

Often yes for latency-sensitive services, but only after confirming you’re seeing swap thrash. Start with 10–30, measure swap-in/out and PSI.
Don’t use swappiness to hide an oversized working set.

4) Should I disable swap entirely?

Only if you accept abrupt OOM kills and you have guardrails (per-service limits, restart policies, and monitoring). Disabling swap can improve predictability
for some clusters, but it’s not a universal upgrade.

5) What’s better: kernel OOM killer or systemd-oomd?

For multi-service hosts, systemd-oomd with explicit unit policies is often better because it can act earlier based on pressure and target the right unit.
Kernel OOM is last-resort and can take down something critical after minutes of thrash.

6) Why does performance tank even when swap I/O isn’t massive?

Direct reclaim stalls. The kernel may spend time scanning and reclaiming pages while tasks wait. PSI “some/full” will usually show this.
Fragmentation and THP interactions can also contribute.

7) Is zram safe for servers?

Yes, when used deliberately. It’s great for bursty pressure and for systems where disk swap is slow or contested.
It costs CPU; if you’re CPU-bound, you may just move the bottleneck.

8) My container has a memory limit; why is the host swapping?

Because the host still manages global memory and may swap due to overall pressure. Also, cgroup throttling (memory.high) can cause slowdowns without OOM events.
Inspect memory.events and unit-level PSI.

9) How much swap should a Debian 13 server have?

Enough to handle short bursts and allow orderly behavior, not so much that it hides chronic undersizing. There’s no universal number.
If you regularly rely on deep swap, you need more RAM or smaller working sets.

10) Why does swap keep growing and never shrink?

Linux doesn’t aggressively swap pages back in unless they’re accessed. If swapped pages are truly cold, they stay swapped.
That’s not a bug. It becomes a problem when those pages stop being cold and you start swapping-in under load.

Conclusion: practical next steps

If swap is growing and Debian 13 performance tanks, treat it like what it is: memory pressure turning into latency and I/O contention.
Don’t argue with the swap graph. Use it as a clue, then prove the failure mode with swap activity and PSI.

  1. Right now: run vmstat, free, PSI, and iostat. Decide if you have thrash or benign swap.
  2. Within the hour: identify the top RSS consumer and whether it’s bounded by cgroups/systemd units.
  3. Within the day: add memory guardrails (MemoryHigh/MemoryMax) and a deliberate OOM policy.
  4. Within the week: fix the working set (heap/cache sizing), evaluate zram for bursts, and ensure swap placement doesn’t fight critical I/O.
  5. Always: alert on PSI and swap-in/out, not swap used. Your users experience stalls, not accounting.

The end state you want isn’t “swap is zero.” It’s “pressure is visible, bounded, and recoverable”—and the machine stays responsive when it matters.

← Previous
VRAM Myths: When 8/12/16GB Matters and When It’s Just a Number
Next →
ZFS Module Versions: Keeping Kernel and ZFS From Fighting

Leave a comment