High RAM Usage ‘For No Reason’: What’s Normal and What’s Broken

February 6, 2026 • February 6, 2026 • Read: 20 min • Views: 0

Was this helpful?

You log in, run free, and your stomach drops: memory is “used” like it’s going out of style. Nothing is obviously on fire. Latency looks fine. CPU is bored. Yet the dashboard screams: RAM 92%. Someone pings: “Memory leak?”

Sometimes that’s a real incident. Other times it’s just Linux doing its job, using spare RAM as cache because unused memory is wasted opportunity. The trick is knowing which is which—quickly—without ritual sacrifices to the kernel gods.

The mental model: RAM isn’t a bucket, it’s a marketplace

If you take one idea from this piece, take this: “high RAM usage” is not a diagnosis. It’s an observation, and often a healthy one.

Three kinds of memory that get mixed up in alerts

Anonymous memory: process heap, stacks, malloc arenas, Java heap, Node/V8 heap, etc. If this grows without bound, you might have a leak or unbounded workload.
File-backed memory: mapped files, shared libraries, memory-mapped databases, and especially page cache. This is where Linux keeps file data around because reading from RAM is faster than reading from disk.
Kernel memory: slab caches, network buffers, dentries/inodes, and other kernel structures. This can grow for good reasons (traffic) or bad reasons (bugs, misconfig).

Why “free memory” is a misleading metric

Linux wants free memory to be close to zero. Not because it’s careless, but because idle memory does nothing. The kernel opportunistically uses RAM for caching and reclaims it when applications need it.

So when someone says, “RAM is full,” the next question is: full of what? Cache is reclaimable. Leaked heap is not (until the process dies).

Two rules that keep you out of trouble

If latency is stable and there’s no reclaim thrash, stop panicking. A cache-heavy system can look “full” and be perfectly healthy.
If swap is growing and major faults spike, you’re paying interest on memory debt. That’s where production outages are born.

One dry truth: the dashboard’s “RAM %” is usually a bad abstraction. You need pressure signals, not just occupancy.

Interesting facts and history (short, concrete, and useful)

Linux’s “buff/cache” behavior has confused admins for decades; the modern free output added available to reduce false alarms.
Windows popularized the term “standby memory” for reclaimable cache; Linux does the same thing, just with different labels and tooling.
The OOM killer exists because perfect memory overcommit is impossible: when the kernel can’t satisfy allocations, it chooses a process to kill rather than deadlock the system.
Containers didn’t invent memory limits; cgroups did. Containers just made it easy to apply cgroups to everything (including your database, which then complains loudly).
Early Unix systems were stingy with page cache because RAM was expensive; modern kernels are aggressive because CPU cycles and I/O wait cost more than RAM.
ZFS’s ARC is basically “page cache, but for ZFS” and can dominate RAM if you let it; it’s not a leak, it’s a strategy.
Transparent Huge Pages (THP) can improve performance—or create latency spikes and memory bloat, depending on workload and fragmentation.
Java memory reporting is notoriously confusing: the heap is only part of it; RSS includes native allocations, JIT, thread stacks, and mapped files.
Linux can reclaim clean page cache quickly but dirty pages require writeback; if writeback is slow, “reclaimable” becomes “not right now.”

What “normal” high RAM looks like vs “broken” high RAM

Normal: RAM used as cache (fast, quiet, and reversible)

Normal looks like this:

free shows high “used” but also high available.
Low swap usage (or stable swap that doesn’t climb).
vmstat shows low si/so (swap-in/out) and low wa (I/O wait).
Disk read latency is low because cache hits are high.
Processes’ RSS is stable; nothing is ballooning.

This is the kernel being helpful. You can usually leave it alone. If you “fix” it by clearing cache on a schedule, you’re not optimizing; you’re self-sabotaging.

Broken: memory pressure and reclaim thrash

Broken looks like this:

available is low and falling.
Swap activity is increasing (si/so non-zero for sustained periods).
Major page faults spike; latency gets weird.
OOM killer messages appear in logs.
Kernel kswapd shows up in CPU profiles, because the machine is busy trying to find memory to breathe.

At that point the question changes from “why is RAM high?” to “what is consuming it and can we reclaim it safely?”

The three most common “it’s broken” causes

Real application leak or unbounded memory growth: queueing data in RAM, caching without eviction, growth in per-connection buffers, etc.
Mis-sized caches: database buffers, JVM heap, Redis maxmemory, ZFS ARC, CDN cache, application-level LRU that isn’t actually LRU.
Kernel memory growth: slab caches, networking buffers, conntrack tables, filesystem metadata caches under churn.

Joke #1: Memory leaks are like glitter: you don’t see it at first, then suddenly it’s in every corner of your life.

Fast diagnosis playbook (first/second/third)

First: Determine if this is pressure or just occupancy

Check free for available.
Check swap trend and vmstat for sustained swap activity.
Check for OOM events in logs.

Decision: If available is healthy and swap isn’t climbing, call it normal caching unless you have user-facing symptoms.

Second: Identify the class of memory

Is it process RSS (a few PIDs growing)?
Is it page cache (cache high, anon stable)?
Is it kernel slab (slab high, objects growing)?
In containers: is it a cgroup limit issue (RSS fine, but cgroup reports near limit)?

Decision: Choose the right tool: per-process inspection, cgroup stats, or slab breakdown—not “restart everything” as a diagnostic method.

Third: Prove the top consumer and pick the least risky mitigation

Get top memory consumers and confirm growth over time (not just a snapshot).
Check whether the consumer is tunable (cache limit) or buggy (leak).
Mitigate: cap cache, fix query/workload, add memory, or restart as a controlled stopgap with follow-up action.

Decision: If the system is under active pressure, prioritize stability: prevent OOM, stop thrash, then do root cause calmly.

Practical tasks: commands, outputs, and decisions

These are real tasks you can run on production Linux hosts. Each includes what the output means and the decision it drives. Use them in order, not at random.

Task 1: Read memory like an adult (`free -h`)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            31Gi        26Gi       600Mi       1.2Gi       4.5Gi       3.8Gi
Swap:          4.0Gi       512Mi       3.5Gi

What it means: “used” includes cache and other reclaimables. The number you care about first is available (roughly: how much can be allocated without swapping).

Decision: If available is several GiB and stable, this is likely normal caching. If it’s low (<5–10% of total) and falling, move to pressure checks.

Task 2: Confirm pressure and swap churn (`vmstat 1`)

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0 524288 612000  82000 3920000   0    0     5    22  430  900 12  4 83  1  0
 1  0 524288 598000  82000 3950000   0    0     1    10  420  880 11  4 84  1  0
 3  1 525312 120000  81000 2800000 150  320   800   900 1200 2500 25 12 35 28  0
 2  1 540000  90000  80000 2700000 220  400  1100  1400 1500 3000 22 15 30 33  0
 2  1 560000  70000  79000 2600000 180  380   900  1200 1400 2900 20 14 32 34  0

What it means: Sustained non-zero si/so indicates swapping. Rising wa suggests the box is waiting on I/O, often made worse by swap or writeback.

Decision: If swap churn is sustained, treat as an incident: reduce memory use or increase memory/limits before you get latency collapse or OOM.

Task 3: Check for OOM killer evidence (`journalctl`)

cr0x@server:~$ journalctl -k -S -2h | egrep -i 'oom|out of memory|killed process' | tail -n 20
Feb 05 09:41:12 server kernel: Out of memory: Killed process 21784 (java) total-vm:9876544kB, anon-rss:2456789kB, file-rss:12345kB, shmem-rss:0kB

What it means: The kernel ran out of allocatable memory under its constraints and killed something. That “something” is often not the root cause—just the unlucky victim with the wrong score.

Decision: If you see OOM kills, stop debating definitions and start reducing memory pressure immediately. Then do root cause.

Task 4: See system-wide memory breakdown (`cat /proc/meminfo`)

cr0x@server:~$ egrep 'MemTotal|MemAvailable|Buffers|Cached|SwapTotal|SwapFree|Active|Inactive|AnonPages|Slab|SReclaimable|Dirty|Writeback' /proc/meminfo
MemTotal:       32803520 kB
MemAvailable:    4012340 kB
Buffers:          112340 kB
Cached:          3890120 kB
SwapTotal:       4194300 kB
SwapFree:        3670016 kB
Active:         20123456 kB
Inactive:        8765432 kB
AnonPages:      22123456 kB
Slab:            1456780 kB
SReclaimable:     612340 kB
Dirty:            245600 kB
Writeback:          4200 kB

What it means: AnonPages is a rough proxy for process private memory. Cached is page cache. Slab/SReclaimable tells you about kernel caches. Dirty/Writeback tells you if reclaim is blocked on flushing.

Decision: High AnonPages points you at processes/cgroups. High Slab points you at kernel object growth. High Dirty suggests writeback issues and I/O bottlenecks.

Task 5: Identify top processes by RSS (`ps`)

cr0x@server:~$ ps -eo pid,user,comm,rss,vsz --sort=-rss | head -n 15
  PID USER     COMMAND        RSS    VSZ
21784 app      java        3854120 9876544
10422 app      node        1123400 2456780
 2314 postgres postgres     823000 1024000
 1981 root     dockerd      412000 1650000
 1450 root     systemd-journ 180000  450000

What it means: RSS is resident set size: memory actually in RAM (including some file-backed mappings). VSZ is virtual, often huge, often irrelevant.

Decision: If one process dominates and grows over time, suspect leak or mis-sized cache. If RSS is spread across many processes, suspect workload fan-out or container density.

Task 6: Inspect a suspicious process’s mappings (`pmap -x`)

cr0x@server:~$ sudo pmap -x 21784 | tail -n 5
00007f2c78000000  262144  260000  260000 rw---   [ anon ]
00007f2c88000000  262144  262144  262144 rw---   [ anon ]
00007f2c98000000  262144  262144  262144 rw---   [ anon ]
00007ffd2c1b1000     132      44      44 rw---   [ stack ]
 total kB         9876544 3854120 3720000

What it means: Large anonymous regions point to heap/native allocations. Huge file-backed mappings may be mmap’d files or shared libs.

Decision: If anonymous RSS is ballooning, you need app-level investigation (heap dump, profiling, cache eviction). If it’s file-backed, consider page cache/mmap behavior and I/O pattern.

Task 7: Detect whether the kernel is spending time reclaiming (`sar -B`)

cr0x@server:~$ sar -B 1 3
Linux 6.5.0 (server)  02/05/2026  _x86_64_ (16 CPU)

12:01:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
12:01:02 AM      5.00     22.00   1200.00      0.00   8000.00      0.00      0.00      0.00      0.00
12:01:03 AM    900.00   1400.00   9000.00     85.00  12000.00   4500.00      0.00   3200.00     71.11
12:01:04 AM    850.00   1350.00   8700.00     90.00  11800.00   4700.00      0.00   3100.00     65.96

What it means: Major faults (majflt/s) and scanning/stealing indicate reclaim activity. %vmeff gives a rough sense of reclaim efficiency; low efficiency often correlates with thrash.

Decision: High major faults + heavy scanning = you’re in pressure. Fix memory usage or limits; don’t just “add swap and hope.”

Task 8: Examine kernel slab consumers (`slabtop`)

cr0x@server:~$ sudo slabtop -o -s c | head -n 15
 Active / Total Objects (% used)    : 812345 / 845000 (96.1%)
 Active / Total Slabs (% used)      : 22000 / 22000 (100.0%)
 Active / Total Caches (% used)     : 95 / 120 (79.2%)
 Active / Total Size (% used)       : 1456780.00K / 1520000.00K (95.8%)
 Minimum / Average / Maximum Object : 0.01K / 0.18K / 8.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
220000 219500  99%    0.19K  11000       20    440000K dentry
180000 179000  99%    0.94K   9000       20    360000K inode_cache
 95000  94900  99%    0.05K   1250       76     47500K kmalloc-64

What it means: Big dentry/inode_cache often means filesystem metadata churn (many files). This can be normal on build servers, log shippers, or systems scanning directories.

Decision: If slab is huge and growing, investigate workload (file storms), kernel bugs, or limits like vm.vfs_cache_pressure (with caution). Don’t “drop caches” in production as a lifestyle choice.

Task 9: Check cgroup memory limits (containers/systemd)

cr0x@server:~$ cat /sys/fs/cgroup/memory.max
8589934592
cr0x@server:~$ cat /sys/fs/cgroup/memory.current
8422686720

What it means: You can have plenty of host RAM but still OOM inside a cgroup because the limit is tight. memory.current near memory.max is pressure regardless of host free memory.

Decision: If the cgroup is near limit, raise the limit or reduce workload/caches within the container. Host-level “free memory” won’t save you.

Task 10: Identify cgroup OOM events (`dmesg` / kernel log)

cr0x@server:~$ dmesg | egrep -i 'memory cgroup out of memory|oom-kill|Killed process' | tail -n 10
[123456.789012] Memory cgroup out of memory: Killed process 10422 (node) total-vm:2456780kB, anon-rss:980000kB, file-rss:12000kB, shmem-rss:0kB

What it means: That’s a container/cgroup OOM, not necessarily a host OOM. The remediation is limits, not “clear cache.”

Decision: Adjust cgroup limits and set realistic requests/limits. If you don’t, you’re running a lottery where the prize is downtime.

Task 11: Understand “missing memory” via smaps rollup

cr0x@server:~$ sudo cat /proc/21784/smaps_rollup | egrep 'Rss|Pss|Private|Shared|Swap'
Rss:                3854120 kB
Pss:                3801200 kB
Shared_Clean:         12000 kB
Shared_Dirty:          4000 kB
Private_Clean:        35000 kB
Private_Dirty:      3803120 kB
Swap:                120000 kB

What it means: Private_Dirty is a strong signal of anonymous private memory—often heap. Swap here means the process is actively paying the swap tax.

Decision: If private dirty grows with load and never comes down, plan an application fix. If swap grows, consider memory limit adjustments and investigate hotspots.

Task 12: Check open file descriptors and mmap pressure (`lsof`)

cr0x@server:~$ sudo lsof -p 21784 | wc -l
18234

What it means: Lots of FDs can correlate with per-connection buffers, large mmap sets, and sometimes leaks (FD leak is not a RAM leak, but it often travels with one).

Decision: If FD counts climb unbounded, treat it as a leak class. Fix the app; raise ulimits only as a stopgap.

Task 13: Track memory trend over time (`ps` loop)

cr0x@server:~$ for i in {1..5}; do date; ps -p 21784 -o pid,rss,etime,cmd; sleep 60; done
Mon Feb  5 10:00:00 UTC 2026
  PID   RSS     ELAPSED CMD
21784 3720000  01:22:10 java -jar service.jar
Mon Feb  5 10:01:00 UTC 2026
  PID   RSS     ELAPSED CMD
21784 3755000  01:23:10 java -jar service.jar
Mon Feb  5 10:02:00 UTC 2026
  PID   RSS     ELAPSED CMD
21784 3812000  01:24:10 java -jar service.jar
Mon Feb  5 10:03:00 UTC 2026
  PID   RSS     ELAPSED CMD
21784 3899000  01:25:10 java -jar service.jar
Mon Feb  5 10:04:00 UTC 2026
  PID   RSS     ELAPSED CMD
21784 3980000  01:26:10 java -jar service.jar

What it means: Growth trend beats a single snapshot. If RSS climbs steadily, you’re not looking at “cache”; you’re looking at accumulation.

Decision: Escalate to profiling and code-level mitigation; implement guardrails (limits, backpressure) while you fix it.

Task 14: Check writeback congestion (`grep` vmstat counters)

cr0x@server:~$ egrep 'nr_dirty|nr_writeback|pgscan|pgsteal|oom_kill' /proc/vmstat | head
nr_dirty 61234
nr_writeback 2100
pgscan_kswapd 987654
pgsteal_kswapd 765432
oom_kill 3

What it means: High scanning/stealing indicates reclaim work. Dirty/writeback levels hint at whether reclaim is blocked behind slow flushes.

Decision: If writeback is elevated and I/O is slow, fix storage performance or dirty limits; otherwise memory pressure will feel worse than it “should.”

Three corporate mini-stories from the memory trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-size company ran a fleet of Linux VMs behind a load balancer. Their monitoring tool had one headline metric: “RAM used %.” It alerted at 85%. It alerted often. People got trained to ignore it, which is a neat trick until it isn’t.

One afternoon, the RAM alert fired on a critical API node. The on-call glanced: used 90%, CPU fine. They shrugged—“it’s just cache”—and went back to whatever people do when they’re not being paged.

Thirty minutes later, latency climbed and then fell off a cliff. The node started timing out health checks and was yanked from the load balancer. A second node followed. Then a third. Now it was an outage.

The root cause wasn’t page cache. A new deployment had introduced an unbounded in-memory queue when a downstream dependency slowed down. The “used %” graph looked like it always did, but MemAvailable was collapsing and swap churn was rising. They had a pressure problem and treated it like a cosmetics problem.

The fix was two-fold: implement backpressure and put MemAvailable, swap activity, and OOM events into the primary alerts. The wrong assumption wasn’t “Linux uses cache.” The wrong assumption was “high used means the same thing every time.”

Mini-story 2: The optimization that backfired

An engineering team wanted to reduce disk I/O on a log-processing service. They increased an internal cache size substantially. It worked in staging and it made the graphs look great: fewer reads, smoother throughput.

In production, the same change caused a slow-motion failure. Peak traffic made the cache warm quickly, which made RSS rise. That alone would have been fine—machines had plenty of RAM. But the service ran in containers with conservative memory limits set months earlier when traffic was lower and “efficiency” was a KPI.

The cgroup started reclaiming aggressively. Latency increased. The service retried more, which increased memory, which increased reclaim. Then the cgroup OOM killer started terminating containers. Auto-restart made it worse because it created thundering herds of cache warm-ups.

The post-incident lesson was not “caches are bad.” It was “caches are workloads.” If you change the memory profile, you must change the limits and the alarms. Otherwise you’ve just moved the bottleneck into a smaller box and acted surprised when it hurt.

Mini-story 3: The boring but correct practice that saved the day

A finance-adjacent company ran a database cluster and a handful of services on the same hosts. Not ideal, but budgets are a real physical law. One SRE insisted on a weekly drill: record baseline memory breakdowns and keep a runbook with the “known normal” ranges: page cache, slab, database buffers, and typical RSS for each service.

It was tedious. It produced no dopamine. It also meant that when memory started creeping up on a set of nodes, they didn’t argue about whether it was normal—they compared it to baseline.

The slab caches were within normal range, page cache was typical, but AnonPages was higher than baseline by a few GiB and trending up. Top RSS pointed to a sidecar process that “shouldn’t do much.” That phrase is a reliable predictor of future work.

They rolled back the sidecar change before any OOMs. Later analysis showed a library update that changed buffering behavior under certain response patterns. The save wasn’t heroics. It was having a baseline, and trusting it.

Common mistakes: symptom → root cause → fix

1) Symptom: “RAM is 95% used, must be a leak”

Root cause: Interpreting “used” as “unavailable.” Page cache inflates “used” and is reclaimable.

Fix: Alert on MemAvailable, swap activity, major faults, and OOM events. Teach your team to read free properly.

2) Symptom: “Clearing caches fixed it” (temporarily)

Root cause: You flushed page cache and slab, which reduced used memory, but you also destroyed performance and masked the real growth.

Fix: Don’t use drop_caches as a routine. If you must do it during diagnosis, document it and measure the performance impact immediately afterward.

3) Symptom: OOM kills but host has “free memory”

Root cause: cgroup/container limit reached. Host memory is irrelevant to that limit.

Fix: Inspect memory.max/memory.current, raise limits, and set sane requests/limits. Make cgroup OOM alerts distinct from host OOM.

4) Symptom: Latency spikes with plenty of RAM

Root cause: Reclaim thrash or THP/compaction stalls. Memory “available” may be fine on average but not in the right shape.

Fix: Check major faults, reclaim stats, and THP settings. Consider disabling THP for latency-sensitive workloads where it hurts.

5) Symptom: Swap is used, panic ensues

Root cause: Swap usage is not inherently evil; constant swap-in/out is. Linux may keep cold pages swapped out as a policy choice.

Fix: Look at vmstat si/so and major faults. If stable, it’s fine. If churn, treat as pressure.

6) Symptom: Memory grows after every deploy, then “stabilizes”

Root cause: Warmup behavior (JIT, caches, connection pools) looks like a leak if you only watch the first hour.

Fix: Compare steady-state after warmup. Set alerts on growth rate, not just level, and annotate deploys.

7) Symptom: “Used memory” climbs with filesystem-heavy workloads

Root cause: Slab (dentry/inode) growth under metadata churn.

Fix: Confirm with slabtop. Reduce churn (fewer files, larger batches), tune carefully, and ensure kernels are up to date.

8) Symptom: Database node “eats all RAM”

Root cause: Database buffer cache + OS cache double-caching, or intentionally using RAM for performance.

Fix: Size DB caches intentionally and leave headroom for OS and other services. Don’t co-locate memory-hungry services without explicit budgets.

Checklists / step-by-step plan (production-ready)

Checklist A: When you get paged for “high memory”

Run free -h. Focus on available, not “used.”
Run vmstat 1 10. Look for sustained si/so and high wa.
Search for OOM events: journalctl -k -S -2h | egrep -i 'oom|killed process'.
If in containers, check cgroup: memory.current vs memory.max.
If pressure is real, identify top consumers (ps) and check growth over 5–15 minutes.
Choose mitigation: cap caches, reduce concurrency, shed load, raise limits, or add RAM.
Only then, consider restart—and treat it as a stopgap with a ticket attached.

Checklist B: Stabilize a host under memory pressure (without doing something regrettable)

Stop the bleeding: reduce traffic/concurrency or roll back the change that increased memory.
Protect the kernel: ensure you have some swap configured sensibly and not on painfully slow media; avoid turning swap off as a reflex.
Prioritize critical services: adjust cgroup priorities/limits so the database doesn’t die to save a batch job.
Measure reclaim: major faults, pgscan, pgsteal, and swap churn tell you if you’re thrashing.
Make one change at a time: raising a cache limit and lowering concurrency in the same minute destroys your ability to reason.

Checklist C: Build better alerts so this doesn’t happen again

Alert on low MemAvailable sustained (host) and memory.current near memory.max (cgroup).
Alert on sustained swap-in/out, not “swap used.”
Alert on OOM kills as page-worthy events with clear victim/root-cause context.
Track top RSS processes and their slope after deploys.
Annotate deploys and config changes on memory graphs.

Joke #2: “We’ll just increase the cache” is the tech equivalent of “I’ll just loosen my belt” — it works until you try to run.

One quote worth keeping on your wall (or at least your runbook)

Paraphrased idea from Gene Kranz (mission operations): “Tough and competent” — stay calm, stick to fundamentals, and execute the checklist.

FAQ

1) Why does Linux use almost all RAM even when nothing is running?

Because “nothing is running” is rarely true (daemons, caches, kernel structures), and because Linux uses spare RAM for page cache. That cache accelerates future reads and is reclaimed under pressure.

2) Is it bad if swap is non-zero?

Not automatically. Stable swap usage can mean cold pages were moved out. What’s bad is sustained swap-in/out (vmstat si/so), rising major faults, and user-facing latency.

3) What’s the fastest way to tell cache from a leak?

Start with free: if available is healthy, it’s probably cache. Then confirm with /proc/meminfo: high Cached with stable AnonPages points to cache; rising AnonPages points to process memory growth.

4) Why does the same service use more RAM over time even without a leak?

Warm caches, JIT compilation (Java), allocator behavior (glibc arenas), connection pools, and workload shifts can all increase steady-state memory. The question is whether it stabilizes and whether pressure signals remain calm.

5) In Kubernetes, why do I get OOMKilled when the node has free memory?

Because the pod hit its memory limit. The kernel enforces cgroup limits; it doesn’t care that the node has headroom. Fix by right-sizing requests/limits and/or reducing in-container caching.

6) Should I ever run `echo 3 > /proc/sys/vm/drop_caches` in production?

Almost never. It’s a diagnostic tool at best, and it can cause a sudden performance drop by forcing disk reads that used to be cache hits. If you do it, do it intentionally, during controlled investigation, and expect impact.

7) What about ZFS ARC—leak or cache?

Usually cache. ARC will expand to use available RAM unless capped. If ARC competes with applications, you’ll see pressure: low MemAvailable, swapping, reclaim. Fix by sizing ARC to your workload and leaving headroom.

8) Why does “available” differ from “free”?

free is literally unused pages at that moment. available estimates memory that can be allocated without swapping by reclaiming cache and other reclaimables. It’s the more operationally useful number.

9) How do I decide between “add RAM” and “fix the app”?

If you’re under pressure now, adding RAM (or raising limits) can be the least risky immediate mitigation. But if RSS growth is unbounded, more RAM just delays the crash. Use growth trend plus pressure signals to decide.

10) Can kernel slab growth cause outages?

Yes. Slab can consume large amounts of RAM, especially with filesystem metadata churn or networking tables. Confirm with slabtop and investigate the workload pattern and kernel version before tuning knobs.

Conclusion: practical next steps (do these, not vibes)

Fix your mental model: stop treating “RAM used” as “RAM unavailable.” Use MemAvailable and pressure indicators.
Upgrade your alerts: page on OOM kills, sustained swap churn, and low available memory—not on “used %.”
Classify the memory: anonymous vs file cache vs slab vs cgroup. Pick tools that match the class.
Measure trends: snapshots lie. Track top RSS processes and their slopes after deploys.
Make memory budgets explicit: caches (DB, JVM, Redis, ZFS ARC) need headroom. If you can’t explain the budget, you’re gambling.
Use restarts sparingly: they’re acceptable as a stopgap when pressure is acute, but only if you also open the “why did it grow?” investigation.

If you run production systems long enough, you’ll see both: perfectly healthy “full RAM” machines and quiet little leaks that eat your weekends. The difference is whether you look for pressure and attribution—or just stare at a percentage and invent a story.