You rented a small VPS because you’re responsible. Then PostgreSQL eats the box at 03:17, the kernel swings an axe,
and your “high availability plan” turns out to be a Slack apology and a very awake morning.
The good news: most out-of-memory (OOM) events on small Postgres servers are self-inflicted. The better news:
the fixes are boring, measurable, and repeatable—if you stop treating memory like a single knob.
How OOM happens on a small VPS (and why Postgres gets blamed)
On a small VPS, “memory” is not a single pool. It’s an uneasy treaty between the kernel page cache, anonymous
memory (heap/stack), shared memory segments, per-process allocations, and whatever the hypervisor decided your
neighbor doesn’t deserve today. PostgreSQL adds its own categories: shared buffers, per-backend working memory,
background workers, maintenance memory, and extensions that have zero interest in your budget.
When an OOM happens, PostgreSQL often becomes the headline because it’s a big, long-lived process with lots of
children. But the kill decision is the kernel’s, based on badness scoring, memory cgroups (if used), and what’s
easiest to shoot without taking the whole OS down. Sometimes it kills the biggest backend. Sometimes it kills the
postmaster and you get a full outage. Sometimes it kills your monitoring agent, which is like shooting the smoke
alarm because the beeping is annoying.
The most common pattern on tiny boxes is not “shared_buffers too big.” It’s “unbounded concurrency plus per-query
memory settings that multiply.” A single statement can allocate work_mem multiple times (one per sort/hash
node), and many backends can do it simultaneously. That’s not a bug. That’s math.
The second common pattern is the slow burn: autovacuum can’t keep up, tables bloat, indexes bloat, queries get
slower, and slower queries hold more connections longer, which increases concurrency, which increases memory
pressure. Congratulations, you invented a feedback loop.
Joke #1: If you set max_connections to 2000 on a 1 GB VPS, you don’t have a database server—you have a
memory-themed lottery.
Interesting facts and historical context (the stuff that still bites)
- Fact 1: PostgreSQL’s multi-process architecture (one OS process per connection) dates back to decisions made in an era when threads were less portable and less predictable across Unix variants.
- Fact 2:
shared_buffersused to be set quite high by default recommendations, but Linux’s page cache and better kernel IO behavior shifted best practice toward “moderate buffers, let the OS cache work.” - Fact 3:
work_memhas been misunderstood for decades because it’s per operation, not per query. One query can use it several times per backend. - Fact 4: PostgreSQL has had
log_temp_filesfor a long time, but many teams still don’t enable it—so they never learn that their OOM “mystery” was a sort that should have spilled to disk. - Fact 5: Autovacuum was introduced to reduce manual vacuum pain, but its default aggressiveness is intentionally conservative to avoid surprising workloads—on small VPS it often needs help.
- Fact 6: The kernel OOM killer exists to keep the system alive, not your service. It’s not a fairness algorithm; it’s triage.
- Fact 7: PostgreSQL’s memory accounting is not centralized because it’s distributed across processes and contexts; this is why “Postgres memory usage” is always an estimate, never a single counter.
- Fact 8: Connection pooling (transaction pooling especially) became mainstream for Postgres partly because “one process per connection” is operationally expensive on small machines.
The mental model: where PostgreSQL actually spends memory
1) Shared memory: shared_buffers (and friends)
shared_buffers is the Postgres-managed cache of table and index pages. It lives in shared memory and is
visible as part of the main postmaster’s footprint, but it’s used by all backends. On small VPS, you want it large
enough to avoid constant churn and small enough to leave room for everything else (including the OS page cache).
Practical rule: on a 1–4 GB VPS, a shared_buffers between 128 MB and 1 GB is often sane. If you go bigger,
you’d better have a reason and measurements. A tiny box still needs memory for:
- per-connection overhead (stacks, memory contexts, syscalls)
- sorts/hashes (
work_mem) - maintenance (
maintenance_work_mem) - WAL buffers, background processes, and extension memory
- kernel page cache (your cheap IO accelerator)
2) Per-backend memory: the silent multiplier
Every client connection is a backend process. Each backend has baseline memory usage (a few MB to tens of MB,
depending on settings, extensions, locale, and query shape). Then queries add memory in chunks, often in memory
contexts tied to operators.
The dangerous settings are the ones that look harmless in isolation:
work_mem(sorts, hashes)temp_buffers(temporary tables per session)max_parallel_workers_per_gather(parallel query multiplies memory)
A small VPS doesn’t need cleverness. It needs predictability. Your goal is to cap worst-case memory, not to win a
benchmark.
3) Maintenance memory: vacuum, create index, and the “surprise weekend rebuild”
maintenance_work_mem is used by VACUUM, CREATE INDEX, ALTER TABLE ADD FOREIGN KEY, and friends.
On small boxes, you can keep it modest (64–256 MB) and still be fine. The trap is running several maintenance jobs
at once: each can consume up to that limit. Autovacuum workers can do it concurrently.
4) The OS page cache: not optional, not “wasted”
Linux will use free memory as cache. This is good. Postgres reads data, kernel caches it, future reads are faster.
If you starve the page cache by inflating Postgres memory settings, you’ll force disk reads and increase latency,
which increases query time, which increases concurrency, which increases memory pressure. That’s the slow-burn OOM.
5) Swap: the emergency exit, not a second living room
On a small VPS, some swap is often better than none. It gives the kernel a place to park cold pages instead of
immediately killing your database on a transient spike. But if you let the system thrash in swap under load, you’re
not “surviving,” you’re “timing out slowly.”
Fast diagnosis playbook: first, second, third checks
This is the order that wins incidents. Not always. But often.
First: confirm it’s actually OOM (and who died)
- Check kernel logs for OOM killer messages.
- Check whether the postmaster restarted (Postgres crash recovery logs).
- Check whether systemd/cgroup limits caused a kill (often looks like OOM but isn’t).
Second: find the multiplier (connections, parallelism, work_mem)
- How many active connections at peak?
- Any sudden change in
max_connections, connection pooler, or application deploy? - Any queries doing huge sorts/hashes or spilling temp files?
- Parallel query enabled on a tiny box?
Third: check for the slow-burn loop (bloat, autovacuum lag, IO wait)
- Autovacuum running? Falling behind?
- Table/index bloat symptoms (growing size, slow scans)?
- Disk latency high? WAL fsync slow? That turns spikes into sustained pressure.
Fourth: validate OS-level basics (swap, overcommit, vm settings)
- Swap exists and is sized appropriately.
vm.swappinessnot set to something dramatic without reason.- Memory overcommit isn’t tricking you into thinking allocations are “free.”
Paraphrased idea (attributed): Hope is not a strategy.
— General H. Norman Schwarzkopf, often cited in operations contexts (paraphrased idea)
Practical tasks: commands, outputs, and decisions (12+)
These are the checks I actually run on a small VPS when Postgres is accused of murder. Each task includes:
command → sample output → what it means → what decision you make.
Task 1: Confirm OOM killer activity
cr0x@server:~$ journalctl -k -g -i 'out of memory|oom-killer|Killed process' -n 50
Jan 12 03:17:21 vps kernel: Out of memory: Killed process 24113 (postgres) total-vm:612844kB, anon-rss:348920kB, file-rss:1200kB, shmem-rss:0kB, UID:113 pgtables:1256kB oom_score_adj:0
Jan 12 03:17:21 vps kernel: oom_reaper: reaped process 24113 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Meaning: The kernel killed a specific postgres backend. This is real OOM, not “Postgres crashed.”
Decision: Move immediately to concurrency and per-backend memory checks. One backend got big.
Task 2: Check if the postmaster restarted (crash recovery)
cr0x@server:~$ sudo journalctl -u postgresql -n 80
Jan 12 03:17:22 vps postgresql[1023]: LOG: database system was interrupted; last known up at 2026-01-12 03:12:05 UTC
Jan 12 03:17:22 vps postgresql[1023]: LOG: database system was not properly shut down; automatic recovery in progress
Jan 12 03:17:23 vps postgresql[1023]: LOG: redo starts at 0/5A1C2B0
Jan 12 03:17:24 vps postgresql[1023]: LOG: database system is ready to accept connections
Meaning: Postgres itself went down hard (postmaster died or got SIGKILL). Recovery ran.
Decision: Treat as an outage: check client error rates, WAL/recovery time, and consider systemd OOM behavior.
Task 3: See memory and swap status right now
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 1.9Gi 1.6Gi 74Mi 28Mi 264Mi 152Mi
Swap: 1.0Gi 612Mi 412Mi
Meaning: RAM is tight; swap is in use. “Available” is what matters for new allocations.
Decision: If swap is climbing under load with latency spikes, you’re in thrash territory; cap concurrency and memory multipliers.
Task 4: Identify top memory consumers (RSS) quickly
cr0x@server:~$ ps -eo pid,ppid,user,comm,rss,etime --sort=-rss | head -n 15
PID PPID USER COMMAND RSS ELAPSED
24113 1023 postgres postgres 348920 00:03:12
24102 1023 postgres postgres 112540 00:03:11
1023 1 postgres postgres 87420 05:42:19
24098 1023 postgres postgres 80112 00:03:10
1780 1 root node 62310 02:11:03
Meaning: One backend is huge. That’s usually a big sort/hash, a parallel plan, or an extension.
Decision: Map that PID to the query and user; tune the query or reduce work_mem/max_parallel_workers_per_gather.
Task 5: Map a backend PID to its query
cr0x@server:~$ sudo -u postgres psql -d appdb -c "SELECT pid, usename, state, wait_event_type, wait_event, now()-query_start AS age, left(query,120) AS q FROM pg_stat_activity WHERE pid=24113;"
pid | usename | state | wait_event_type | wait_event | age | q
-------+---------+--------+-----------------+------------+----------+------------------------------------------------------------
24113 | app | active | | | 00:03:09 | SELECT ... ORDER BY created_at DESC LIMIT 200000;
(1 row)
Meaning: It’s doing a large order-by with a huge limit. That’s a memory and temp-file invitation.
Decision: Fix the query (indexes, smaller limits, keyset pagination) and set statement_timeout as a seatbelt.
Task 6: See connection counts and states
cr0x@server:~$ sudo -u postgres psql -d appdb -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
state | count
-----------+-------
idle | 82
active | 14
idle in transaction | 7
(3 rows)
Meaning: Too many idle connections on a small VPS is death by a thousand paper cuts.
Decision: Add a pooler (pgBouncer), lower max_connections, and kill “idle in transaction” sessions.
Task 7: Check max_connections and current settings
cr0x@server:~$ sudo -u postgres psql -d appdb -c "SHOW max_connections; SHOW shared_buffers; SHOW work_mem; SHOW maintenance_work_mem; SHOW effective_cache_size;"
max_connections
-----------------
300
(1 row)
shared_buffers
---------------
512MB
(1 row)
work_mem
---------
64MB
(1 row)
maintenance_work_mem
----------------------
512MB
(1 row)
effective_cache_size
----------------------
1536MB
(1 row)
Meaning: On a 2 GB box, max_connections=300 and work_mem=64MB is an OOM recipe if concurrency spikes.
Decision: Reduce max_connections (pooler), reduce work_mem (often 4–16MB), and right-size maintenance memory.
Task 8: Find whether sorts/hash operations are spilling to disk (temp files)
cr0x@server:~$ sudo -u postgres psql -d appdb -c "SHOW log_temp_files;"
log_temp_files
----------------
-1
(1 row)
Meaning: You’re blind to temp file usage in logs.
Decision: Set log_temp_files = 0 (log all) temporarily during diagnosis, or a threshold like 16MB for steady-state.
Task 9: Check current temp file usage per database
cr0x@server:~$ sudo -u postgres psql -d appdb -c "SELECT datname, temp_files, pg_size_pretty(temp_bytes) AS temp_bytes FROM pg_stat_database ORDER BY temp_bytes DESC;"
datname | temp_files | temp_bytes
----------+------------+------------
appdb | 18412 | 37 GB
postgres | 3 | 12 MB
(2 rows)
Meaning: Queries are spilling to temp heavily. That can be OK, but on small disks it becomes IO pain and extends query duration.
Decision: Fix the worst queries and indexes; don’t just increase work_mem “to avoid temp.” That’s how you buy OOM.
Task 10: Check autovacuum status and dead tuples
cr0x@server:~$ sudo -u postgres psql -d appdb -c "SELECT relname, n_live_tup, n_dead_tup, last_autovacuum, last_autoanalyze FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;"
relname | n_live_tup | n_dead_tup | last_autovacuum | last_autoanalyze
---------------+------------+------------+-------------------------------+------------------------------
events | 48210321 | 9123401 | 2026-01-10 02:11:44+00 | 2026-01-10 02:30:02+00
sessions | 1192031 | 412120 | | 2026-01-08 11:02:19+00
(2 rows)
Meaning: Bloat pressure. One table is accumulating dead tuples; another hasn’t autovacuumed at all recently.
Decision: Adjust autovacuum thresholds for hot tables; ensure vacuum isn’t blocked; consider manual VACUUM during low traffic.
Task 11: Check if vacuum is blocked by long transactions
cr0x@server:~$ sudo -u postgres psql -d appdb -c "SELECT pid, usename, state, now()-xact_start AS xact_age, left(query,120) AS q FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_start ASC LIMIT 5;"
pid | usename | state | xact_age | q
-------+---------+-------+------------+------------------------------------------------------------
23301 | app | idle in transaction | 02:14:09 | UPDATE sessions SET ...;
(1 row)
Meaning: “Idle in transaction” for hours prevents vacuum cleanup and grows bloat.
Decision: Kill the session, fix the app transaction handling, add idle_in_transaction_session_timeout.
Task 12: See IO pressure and swap activity under load
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 621120 74200 18324 232140 0 18 120 340 450 780 22 8 55 15 0
3 1 623080 61840 18324 230820 0 120 200 1100 520 910 28 11 39 22 0
4 1 624904 51220 18324 229110 0 240 180 2500 600 1020 34 12 26 28 0
Meaning: Swap-out (so) increasing plus rising IO wait (wa) suggests thrash and disk contention.
Decision: Reduce concurrency, lower per-query memory, and ensure storage isn’t overloaded (WAL on slow disk hurts).
Task 13: Check per-cgroup memory limits (systemd) if applicable
cr0x@server:~$ systemctl show postgresql -p MemoryMax -p MemoryHigh -p OOMPolicy -p ManagedOOMMemoryPressure
MemoryMax=infinity
MemoryHigh=infinity
OOMPolicy=stop
ManagedOOMMemoryPressure=auto
Meaning: No explicit MemoryMax cap here; systemd may still react via ManagedOOM depending on distro defaults.
Decision: If you do set MemoryMax, set it knowingly and leave headroom for shared memory and kernel needs; otherwise you get “mysterious kills.”
Task 14: Inspect huge pages / shared memory constraints
cr0x@server:~$ grep -E 'HugePages|Shmem' /proc/meminfo | head
Shmem: 28672 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Meaning: No huge pages configured; shared memory is small at the moment. That’s typical for small VPS.
Decision: Don’t chase huge pages as your first fix on tiny machines. Fix concurrency and memory multipliers first.
Task 15: Identify worst query shapes (quick-and-dirty top time)
cr0x@server:~$ sudo -u postgres psql -d appdb -c "SELECT queryid, calls, mean_exec_time, rows, left(query,100) AS q FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;"
queryid | calls | mean_exec_time | rows | q
----------+-------+----------------+--------+----------------------------------------------------
91344122 | 112 | 8421.113 | 200000 | SELECT ... ORDER BY created_at DESC LIMIT $1
11822001 | 9011 | 312.882 | 20 | SELECT ... FROM sessions WHERE user_id = $1
(2 rows)
Meaning: One query is slow and high-row; it’s likely producing large sorts or wide hashes.
Decision: EXPLAIN it, index it, and cap it. On small VPS, one bad query can be the entire memory story.
Task 16: Verify current kernel swappiness and overcommit policy
cr0x@server:~$ sysctl vm.swappiness vm.overcommit_memory vm.overcommit_ratio
vm.swappiness = 10
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
Meaning: Conservative swappiness, default overcommit heuristic. Not inherently wrong.
Decision: Avoid extreme values unless you understand the failure mode. On small VPS, predictability beats folklore.
The tuning that prevents OOM (what to set, what to avoid)
Start with a hard truth: your worst-case memory must fit
On small machines, tuning is not “make it fast.” It’s “make it not die.” Speed comes from avoiding pathological
behaviors: too many backends, too much per-backend memory, and long transactions that convert spikes into sustained
load.
Think in budgets:
- Budget A: shared memory (buffers and overhead)
- Budget B: per-connection baseline × maximum active connections
- Budget C: per-query working memory × concurrent heavy operators
- Budget D: maintenance workers × maintenance memory
- Budget E: OS needs + page cache + “unknown unknowns” (extensions, libc, TLS, monitoring)
1) Fix the real villain: connection count
If you’re on a 1–2 GB VPS and you’re running direct-to-Postgres connections from a web app with bursty traffic,
you’re gambling. Every “idle” connection is still a process with memory overhead, and every burst creates a memory
shock.
What to do:
- Use pgBouncer (transaction pooling unless you need session features).
- Set
max_connectionsto something your RAM can afford, not what your app can open. - Prefer fewer connections with faster queries over many connections with slow queries.
2) Set work_mem like you’re paying for it (you are)
On small VPS, a global work_mem of 4–16 MB is a solid starting point. If that sounds tiny, good: it
forces large operations to spill to disk rather than murdering RAM. Spilling is slower, but dead is slower.
The better pattern is low default, higher per-role or per-session for controlled jobs:
analytics role gets more, OLTP role gets less.
Also: parallel queries increase memory usage. Each worker may allocate work_mem. On a small VPS, cap
parallelism aggressively or disable it if it’s unpredictable for your workload.
3) Keep shared_buffers moderate
Don’t set shared_buffers to 70% of RAM because a blog said “cache is king.” Linux already caches. You
want room for the OS and for per-backend spikes.
Practical starting points:
- 1 GB RAM: 128–256 MB shared_buffers
- 2 GB RAM: 256–512 MB shared_buffers
- 4 GB RAM: 512 MB–1 GB shared_buffers
Then measure: cache hit ratio, IO wait, and latency under load. If you increase buffers and performance doesn’t
improve, you just stole memory from where you needed it.
4) Autovacuum: tune it so bloat doesn’t turn into memory pressure
Autovacuum on a small VPS should be set to keep up, not to be polite. The defaults aim to be safe across
wildly different machines. Your tiny VPS is not wildly different—it’s specifically constrained.
What helps:
- Lower
autovacuum_vacuum_scale_factoron hot tables; rely less on percentage thresholds and more on fixed thresholds. - Increase
autovacuum_max_workersonly if IO can handle it; otherwise you just add contention. - Raise
autovacuum_work_memmodestly (or usemaintenance_work_mem) but don’t let multiple workers run wild. - Monitor for “idle in transaction” and long-running transactions blocking cleanup.
5) Put guardrails on transaction and statement behavior
The cheapest reliability feature in Postgres is a timeout. Timeouts convert “slowly eating your server” into “fast
failure with logs.” You can debug a failure. You can’t debug an OOM after the process is gone.
statement_timeout: stop runaway queries.idle_in_transaction_session_timeout: stop apps that begin a transaction and wander off.lock_timeout: stop query piles caused by lock waits.
6) Swap: have it, but don’t depend on it
A small swap file (equal to RAM or half RAM, depending on disk and workload) can reduce “instant death” during a
spike. It is not a license to ignore memory tuning. Watch swap-out rates; if swap is actively churning under load,
you’re already in the danger zone.
7) Logging that prevents superstition
To prevent future blame games, log enough to connect symptoms to causes:
- Enable slow query logging (with
log_min_duration_statement). - Enable temp file logging (
log_temp_filesthreshold). - Use
pg_stat_statementsto identify repeat offenders.
Joke #2: Turning on logs won’t fix your outage, but it will dramatically improve your ability to be correct in the next outage.
Three corporate-world mini-stories (how this fails in real life)
Mini-story 1: The incident caused by a wrong assumption (“work_mem is per query, right?”)
A mid-sized SaaS team ran Postgres on a 2 GB VPS for a “non-critical” internal analytics dashboard. It started as a
side project and graduated into “everyone uses it every morning.” That’s how it goes.
A developer bumped work_mem to 128 MB to speed up a set of dashboard queries. The assumption was that a
query gets one work_mem allocation. They also left max_connections at a generous value because
“connections are cheap.” The dashboard ran faster in testing. In production, it ran faster right up until
marketing’s Monday all-hands.
At peak, dozens of concurrent dashboard queries ran with multiple sort/hash nodes per query. Some queries went
parallel. Memory climbed fast. Swap got hit. Latency spiked. Clients retried. Connection counts rose. The kernel
OOM killer started picking off the fattest backends, and then took the postmaster for good measure.
The fix was not heroic: reduce default work_mem, cap parallel workers, add pgBouncer, and set per-role
work_mem higher only for an “analytics” role with controlled concurrency. They also added a statement
timeout, which converted “melt the box” into “dashboard times out,” a much healthier failure mode.
Mini-story 2: The optimization that backfired (“Let’s max shared_buffers and disable swap”)
Another company moved a small customer-support tool to a 1 GB VPS. The tool had modest traffic but a lot of
full-text search and an extension or two. An engineer, trying to be helpful, applied a tuning snippet:
shared_buffers=768MB, effective_cache_size=1GB, and “swap is bad, disable it.”
The first week looked fine. Then they installed a security agent that added a bit more memory usage. Later they
enabled a background job that ran a nightly report with a couple of heavy sorts. With swap disabled, transient
spikes had nowhere to go. The kernel OOM killer triggered earlier and more violently.
The incident report had a familiar smell: graphs showed memory rising until it fell off a cliff. The engineer’s
tuning did reduce disk reads, but it also eliminated the OS page cache’s breathing room and removed swap as a
pressure-release valve. The system became brittle.
They rolled back to moderate buffers, re-enabled a small swap file, and reduced default work_mem.
Performance improved because the system stopped thrashing and stopped restarting. It turns out “up” is a very fast
state.
Mini-story 3: The boring practice that saved the day (“timeouts, pooler, and one dashboard”)
A finance-adjacent service ran on a 4 GB VPS with Postgres. Nothing fancy. They had two habits that looked
painfully conservative: strict timeouts and a connection pooler with an intentionally low server pool size.
A product manager demanded a new “export everything” feature. Engineering implemented it as a background job and a
user-facing endpoint. The endpoint had a hard statement_timeout. The job used a dedicated role with a
higher work_mem and a low concurrency queue.
The day the feature shipped, one customer tried to export a massive dataset during peak traffic. The endpoint timed
out and returned an error that was annoying but honest. The background job queued and ran later with controlled
resources. The database stayed stable.
The postmortem was short because there wasn’t much of one. The boring practice—resource compartmentalization and
timeouts—prevented the system from turning one customer’s request into everyone’s outage. Nobody got promoted for
it, but nobody got paged either.
Common mistakes: symptoms → root cause → fix
1) Symptom: random Postgres disconnects, “server closed the connection unexpectedly”
Root cause: Kernel OOM killer killing backends or postmaster; sometimes systemd kills due to memory pressure.
Fix: Confirm with kernel logs; lower concurrency, reduce work_mem, add pgBouncer, and ensure swap exists.
2) Symptom: swap usage grows, latency spikes, then everything times out
Root cause: Memory pressure causing swap thrash; typically too many active connections or heavy queries running concurrently.
Fix: Cap active connections, enforce timeouts, tune worst queries, consider lowering shared_buffers to restore page cache breathing room.
3) Symptom: “idle in transaction” connections pile up
Root cause: Application leaks transactions (forgot to commit/rollback), holds locks, blocks vacuum.
Fix: Set idle_in_transaction_session_timeout; fix app connection/transaction handling; monitor and kill offenders.
4) Symptom: autovacuum always running, queries slowly get worse over days
Root cause: Vacuum can’t keep up; bloat increases IO and query time; concurrency increases; memory pressure increases.
Fix: Tune autovacuum per hot table; remove blockers (long transactions); schedule manual vacuum during quiet windows.
5) Symptom: temp disk fills, then Postgres errors about no space
Root cause: Large sorts/hashes spilling to temp files; insufficient disk; sometimes a single report query.
Fix: Fix queries/indexes; cap result sizes; enable temp file logging; ensure adequate disk or move temp to faster/larger volume if possible.
6) Symptom: performance got worse after “tuning” shared_buffers upward
Root cause: Starved OS page cache and per-backend memory; increased IO wait and longer queries.
Fix: Reduce shared_buffers to a moderate value; measure; focus on query plans and connection management.
7) Symptom: a single query sometimes triggers OOM
Root cause: Parallel query plus high work_mem; query plan includes multiple memory-hungry nodes; sometimes wide rows.
Fix: Lower work_mem; cap parallel workers; rewrite query; add indexes; set statement timeout.
Checklists / step-by-step plan (small VPS safe baseline)
Step-by-step: stabilize first, then optimize
- Prove OOM: check kernel logs for kills; record timestamps and PIDs.
- Count connections: measure peak active and total; identify “idle” vs “idle in transaction.”
- Install or configure pgBouncer: aim for a small, steady server-side pool.
- Lower
max_connections: force the pooler to do its job. - Set conservative defaults:
work_memlow,maintenance_work_memmodest,shared_buffersmoderate. - Cap parallelism: limit
max_parallel_workers_per_gather(or disable if needed). - Add timeouts: statement + idle-in-transaction + lock timeout for sanity.
- Turn on observability knobs: slow query log, temp file logging, pg_stat_statements.
- Fix top 3 queries: index, rewrite, or limit. On small boxes, you don’t need a top 30 list.
- Autovacuum tune hot tables: reduce scale factors; ensure vacuum isn’t blocked.
- Validate swap: ensure it exists; avoid disabling it; monitor swap-in/out rates.
- Load test with concurrency: verify worst-case stays within memory.
Baseline config ideas (not universal, but safe-ish)
These are intentionally conservative starting points for small VPS. Adjust based on measurement, not vibes.
- 1 GB RAM: shared_buffers 128–256MB, work_mem 4–8MB, maintenance_work_mem 64–128MB, max_connections 50–100 (prefer pooler).
- 2 GB RAM: shared_buffers 256–512MB, work_mem 4–16MB, maintenance_work_mem 128–256MB, max_connections 100–150 (prefer pooler).
- 4 GB RAM: shared_buffers 512MB–1GB, work_mem 8–16MB, maintenance_work_mem 256MB, max_connections 150–200 (prefer pooler).
Operational checklist: after any change
- Re-check connection behavior during peak traffic.
- Re-check temp file generation and slow queries.
- Re-check swap activity and IO wait.
- Re-check autovacuum progress and dead tuples.
- Keep a rollback plan: config changes are cheap; recovery time is not.
FAQ
1) Is shared_buffers the main cause of OOM on small VPS?
Sometimes, but usually not. The classic small-VPS OOM is per-backend memory multiplied by concurrency: too many
connections plus a generous work_mem, often with parallel query. Moderate buffers rarely kill a system
by themselves; unbounded spikes do.
2) What’s a safe work_mem on a 1–2 GB box?
Start at 4–8MB for general OLTP workloads. Increase per-role for controlled jobs (like a single reporting worker),
not globally. Remember: one query can allocate work_mem multiple times.
3) Should I disable swap for databases?
On small VPS, usually no. A small amount of swap can prevent abrupt OOM during transient spikes. The real goal is
to avoid steady-state swapping; monitor swap-in/out and IO wait. Disabling swap often turns “slow” into “dead.”
4) Why do I see lots of “idle” connections? Aren’t idle connections harmless?
Idle connections still cost memory and process overhead. On small machines, that overhead matters. Idle connections
also make bursts worse because you’ve already spent your baseline budget. Use a pooler and keep server-side
connections small and stable.
5) How do I know if OOM is caused by a specific query?
Correlate: kernel kill log PID → pg_stat_activity (if still present) → application logs → slow query logs.
Add pg_stat_statements and temp file logging. If one query shape dominates temp_bytes or mean_exec_time,
it’s a strong suspect.
6) Should I increase effective_cache_size to fix performance?
effective_cache_size doesn’t allocate memory; it’s a planner hint. Set it to a reasonable estimate of
memory available for caching (OS cache + shared buffers). Don’t treat it as a performance lever for OOM.
7) Does autovacuum tuning help with OOM?
Indirectly, yes—often a lot. When autovacuum falls behind, bloat increases, queries slow down, concurrency rises,
and memory pressure gets sustained. Keeping vacuum healthy prevents the “slow burn” path to OOM.
8) Can I solve this just by upgrading the VPS?
More RAM buys margin, not correctness. If the workload has unbounded concurrency or runaway queries, it will fill
any memory you give it—just later and more expensively. Tune first, then scale with evidence.
9) Are parallel queries bad on small VPS?
They’re not morally bad. They’re operationally unpredictable. Parallelism can multiply memory usage and CPU
contention. On small VPS, cap it low. If you need big parallel analytics, you probably need a different box or a
different architecture.
10) What’s the fastest win if I’m already paging weekly?
Add a pooler and lower max_connections. Then drop global work_mem to something conservative.
Those two changes alone often eliminate OOM events because they control the multiplier.
Next steps you can do today
- Run the fast diagnosis playbook: confirm OOM, identify the multiplier, and check for slow-burn bloat/IO loops.
- Put a hard cap on concurrency: pgBouncer + lower
max_connections. - Reset
work_memto a conservative default: then selectively raise it per-role for controlled jobs. - Add timeouts: statement + idle-in-transaction + lock timeout. Make your failures fast and debuggable.
- Enable temp file and slow query visibility: so the next incident has evidence, not folklore.
- Autovacuum tune hot tables: bloat prevention is memory protection wearing an IO hat.
Small VPS PostgreSQL isn’t fragile by nature. It’s fragile when you let memory usage be unbounded. Bound it, measure
it, and you’ll sleep like someone who doesn’t treat the OOM killer as a load balancer.