MariaDB vs PostgreSQL on an 8GB VPS: How to Scale Clients Safely

January 26, 2026 • February 3, 2026 • Read: 23 min • Views: 12

Was this helpful?

You don’t “run out of database.” You run out of something boring: RAM, IOPS, file descriptors, CPU time, or patience.
On an 8GB VPS, the margin between “fine” and “why is checkout timing out” is thinner than your incident channel’s sarcasm.

This is a field guide for scaling clients safely on MariaDB and PostgreSQL when you can’t just throw hardware at the problem.
We’ll talk about connection limits, memory math, I/O reality, the gotchas that only show up under load, and the practices that keep you out of trouble.

Frame the problem: what “scaling clients” actually breaks

“Scale clients” usually means “increase the number of concurrent connections or requests.” It sounds like a networking problem.
It’s not. It’s a resource accounting problem.

The typical failure pattern on an 8GB VPS goes like this:

Traffic increases, app servers open more DB connections “just in case.”
The DB accepts them (because defaults are generous or you bumped max_connections in a hurry).
Memory rises per connection/session. Context switching increases. Caches get squeezed.
I/O spikes because the working set no longer fits in RAM.
Latency rises; app timeouts trigger retries; retries create more connections; you invent a denial-of-service against yourself.

On a VPS, there’s another twist: CPU steal time, noisy neighbors, burst credits, and storage that “looks like SSD” but behaves like a sad USB stick at the worst moment.

A quote worth taping near your dashboard: Hope is not a strategy. — General Gordon R. Sullivan.
It’s not a database quote, but it’s the best production engineering quote for database scaling.

The core question isn’t “MariaDB or PostgreSQL is faster?” It’s:
which one will let you enforce sane limits and predictable behavior as clients scale, on constrained RAM and real-world disks.

Facts & history that matter in production

These aren’t trivia-night points. They explain why defaults, behaviors, and operational tooling look the way they do.

PostgreSQL’s process model is historically “one backend per connection”.
That design makes isolation and debugging straightforward, but pushes you toward connection pooling as concurrency rises.
MySQL’s thread-per-connection approach shaped MariaDB’s concurrency story.
It can handle many connections, but memory per session can still crush an 8GB box if you let it.
MariaDB exists because of the Oracle acquisition of MySQL (2009–2010 era).
That fork created two ecosystems and two sets of operational assumptions that still differ today.
InnoDB became the default MySQL storage engine long after MyISAM dominated.
Many “MySQL tuning” blog posts are fossilized from the MyISAM era and are actively harmful on modern InnoDB workloads.
PostgreSQL’s MVCC has “bloat” as an operational reality.
Vacuuming isn’t optional if you update/delete heavily; it’s garbage collection for your tables.
MariaDB’s InnoDB also has MVCC and purge, but the ops levers look different.
History matters: the knobs aren’t one-to-one, and the symptoms mislead people migrating between them.
PostgreSQL’s WAL (write-ahead log) and MariaDB’s redo/binlogs both turn random writes into sequential-ish I/O.
But fsync behavior, checkpointing, and “flush storms” still vary by engine and configuration.
PostgreSQL introduced major parallel query improvements in recent years.
That helps analytics, but can surprise you on small boxes if parallel workers eat CPU and memory during peak OLTP.
MariaDB’s thread pool exists specifically because thread-per-connection hits scaling walls.
It’s one of the few “turn it on, then validate” knobs that can genuinely change concurrency behavior.

Joke #1: A database will always accept your max_connections=2000 change. It’s like a toddler accepting a third espresso.

How MariaDB and PostgreSQL spend your 8GB

The 8GB VPS budget: what you really have

“8GB RAM” is marketing. In production, you reserve space for the kernel, filesystem cache, background services, and the fact that memory usage spikes under load.
If you run the DB alone, you might safely allocate ~5–6GB to database memory knobs.
If you run app + DB on the same VPS (don’t, if you can avoid it), cut that drastically.

You’re also buying I/O, not just RAM. Most latency bugs blamed on “database performance” are actually storage performance and queueing.

PostgreSQL: predictable, but connection-heavy

PostgreSQL’s memory is broadly split into:

Shared memory: mainly shared_buffers.
Per-session / per-operation memory: work_mem, maintenance_work_mem, sorts, hashes, temp buffers.
OS page cache: PostgreSQL relies on it heavily for real performance.

The killer on small boxes isn’t shared_buffers. It’s the multiplication of per-connection overhead plus per-query memory allocations.
If you allow hundreds of direct connections from apps, you’re betting your uptime on “they won’t all do expensive queries at once.”
They will. On a Monday. During payroll. Or when someone ships a new report.

MariaDB (InnoDB): buffer pool and per-thread surprises

For MariaDB (InnoDB workloads), memory is typically dominated by:

InnoDB buffer pool: your main cache (innodb_buffer_pool_size).
Per-connection buffers: sort buffer, join buffer, tmp tables, net buffers.
Thread and internal caches: especially visible when connection churn is high.

MariaDB can look “fine” until one query forces big sort/join buffers across many threads and you take a sudden detour into swapping.
Swap on a VPS-backed disk is a slow-motion outage generator.

Scaling clients safely is mostly about controlling concurrency

Both databases can handle serious throughput on 8GB if you do one thing: cap concurrency and smooth the workload.
Connection pooling, sane limits, and query discipline beat heroic tuning every time.

Fast diagnosis playbook (first/second/third)

When latency spikes, you need a fast triage loop that tells you where the limit is: CPU, memory, locks, or I/O.
Don’t “tune.” Diagnose.

First: Is the host sick?

Load average vs CPU count: if load is high and CPU is idle, you’re waiting on I/O.
Swap activity: any sustained swap-in/out during peak means you’re losing.
Disk latency: if await is high, the DB is innocent; storage is the crime scene.

Second: Is the DB queued or blocked?

PostgreSQL: check pg_stat_activity for wait events and blocked queries.
MariaDB: check processlist and InnoDB status for lock waits and deadlocks.
Connection storms: timeouts + retries produce self-inflicted load. Look for sudden connection count jumps.

Third: Is it one bad query or systemic tuning?

Slow query logs (both) show whether a few queries dominate time.
Cache hit rates tell you if you’re I/O bound due to insufficient cache or poor indexes.
Autovacuum / purge issues show up as growing tables/indexes, rising I/O, and “random slowness.”

This order matters because the fastest fix is often outside the database: reduce client concurrency, stop retries, or throttle background jobs.

12+ practical tasks with commands, output meaning, decisions

These are the checks I run on an 8GB VPS before I touch tuning parameters. Each includes: command, what the output means, and what decision it drives.

Task 1: Confirm real memory and swap pressure

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           7.7Gi       5.9Gi       220Mi       120Mi       1.6Gi       1.3Gi
Swap:          2.0Gi       1.1Gi       900Mi

Meaning: “available” is what you can still use without swapping. Swap already in use is a warning; swap actively moving is the real alarm.
If swap usage grows during load, the DB will stall.

Decision: If swap is non-trivial under load, reduce connection counts, lower per-session memory settings, or add RAM before “optimizing queries.”

Task 2: Check for active swapping (the outage slow-cooker)

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  1 118000 210000  65000 980000  120  240  3100  4200  900 2100 18  6 32 44  0
 3  2 119200 190000  64000 970000  300  500  2800  3900 1000 2500 20  7 25 48  0

Meaning: si/so above ~0 sustained means you’re swapping. wa high means CPU is waiting on I/O.

Decision: Stop the bleeding: cap concurrency (pool), lower DB memory multipliers, disable expensive reports at peak, or move to faster storage.

Task 3: Measure disk latency and queueing

cr0x@server:~$ iostat -xz 1 3
Linux 6.8.0 (server)  12/31/2025  _x86_64_  (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.20    0.00    6.50   28.10    0.80   50.40

Device            r/s     w/s   rkB/s   wkB/s  rrqm/s  wrqm/s  %util  await
vda              85.0   120.0  4200.0  9800.0     0.0     2.0   96.0   18.5

Meaning: %util near 100% with await in double digits means the disk is saturated.
On many VPS platforms, that’s your bottleneck right there.

Decision: Fix I/O: reduce write amplification (indexes, autovacuum/purge tuning), move logs to faster storage, or upgrade disk tier. Don’t just raise caches blindly.

Task 4: Check filesystem space and inode pressure

cr0x@server:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        80G   62G   15G  81% /
tmpfs           3.9G     0  3.9G   0% /dev/shm

cr0x@server:~$ df -i
Filesystem       Inodes  IUsed   IFree IUse% Mounted on
/dev/vda1      5242880 220000 5022880    5% /

Meaning: Near-full disks hurt DB performance and can hard-stop writes (WAL/redo/binlog can’t grow).

Decision: Keep free space headroom (I like 20%+ on busy systems). If you’re tight, you’re one incident away from “read-only surprise.”

Task 5: Check file descriptor limits (quiet connection killer)

cr0x@server:~$ ulimit -n
1024

Meaning: 1024 is low for a database that may open many files (tables, indexes, WAL segments) plus sockets.

Decision: Raise limits via systemd unit or limits.conf. If your DB hits FD limits, clients see connection failures that look “random.”

Task 6 (PostgreSQL): Count connections and see who’s hogging them

cr0x@server:~$ sudo -u postgres psql -X -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
  state  | count
---------+-------
 idle    |   140
 active  |    12
         |     3
(3 rows)

Meaning: Lots of idle connections usually means the app is hoarding. That’s memory overhead and process overhead doing nothing.

Decision: Use a pooler (PgBouncer) and reduce direct connections. Set sane app pool sizes, don’t let every pod open 50 sessions “because defaults.”

Task 7 (PostgreSQL): Find blocked queries and lock chains

cr0x@server:~$ sudo -u postgres psql -X -c "SELECT pid, wait_event_type, wait_event, state, left(query,80) AS q FROM pg_stat_activity WHERE wait_event IS NOT NULL ORDER BY pid;"
 pid  | wait_event_type |  wait_event   |  state  | q
------+-----------------+---------------+---------+----------------------------------------------
 2314 | Lock            | relation      | active  | UPDATE orders SET status='paid' WHERE id=$1
 2451 | Lock            | transactionid | active  | DELETE FROM carts WHERE user_id=$1
(2 rows)

Meaning: Waiting on locks means your concurrency is fighting itself. Adding connections makes it worse.

Decision: Identify the blocking transaction (often long-running or idle-in-transaction). Fix app transaction scope; add indexes to reduce lock time; schedule migrations safely.

Task 8 (PostgreSQL): Check cache effectiveness (are you I/O bound?)

cr0x@server:~$ sudo -u postgres psql -X -c "SELECT datname, blks_hit, blks_read, round(100.0*blks_hit/(blks_hit+blks_read+1),2) AS hit_pct FROM pg_stat_database ORDER BY hit_pct ASC;"
  datname  | blks_hit | blks_read | hit_pct
-----------+----------+-----------+---------
 appdb     | 89213321 |   5120032 |   94.55
 postgres  |  1200032 |     12033 |   98.99
(2 rows)

Meaning: 94–95% might be fine for some workloads, terrible for others. If hit% drops during load and disk latency rises, you’re cache-starved or your working set grew.

Decision: Add RAM or reduce working set (better indexes, fewer full scans). Don’t crank shared_buffers to 6GB and starve the OS cache.

Task 9 (PostgreSQL): Check vacuum pressure (bloat tax)

cr0x@server:~$ sudo -u postgres psql -X -c "SELECT relname, n_dead_tup, n_live_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
   relname   | n_dead_tup | n_live_tup |     last_autovacuum
-------------+------------+------------+----------------------------
 events      |    8200032 |   41000211 | 2025-12-31 10:22:14+00
 sessions    |    1200033 |    9000123 | 2025-12-31 10:18:01+00
 carts       |     450012 |    1500032 | 2025-12-31 09:59:44+00
(3 rows)

Meaning: Dead tuples piling up means more disk reads, bigger indexes, slower queries. Autovacuum might be under-provisioned or blocked by long transactions.

Decision: Fix long transactions, tune autovacuum per table, and consider partitioning for append-heavy tables.

Task 10 (MariaDB): Check current connections and max usage

cr0x@server:~$ sudo mariadb -e "SHOW GLOBAL STATUS LIKE 'Threads_connected'; SHOW GLOBAL STATUS LIKE 'Max_used_connections'; SHOW VARIABLES LIKE 'max_connections';"
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Threads_connected | 210   |
+-------------------+-------+
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Max_used_connections | 398   |
+----------------------+-------+
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 500   |
+-----------------+-------+

Meaning: You’ve already hit 398 connections. That’s not “capacity,” that’s “you’re flirting with memory collapse.”

Decision: Implement pooling on the app side or use MariaDB thread pool; reduce max_connections to a safe number; enforce timeouts and kill idle sleepers.

Task 11 (MariaDB): Spot sleeping connections and long runners

cr0x@server:~$ sudo mariadb -e "SHOW PROCESSLIST;"
Id	User	Host	db	Command	Time	State	Info
1203	app	10.0.1.12:44218	appdb	Sleep	512		NULL
1210	app	10.0.1.12:44276	appdb	Query	9	Sending data	SELECT * FROM events WHERE user_id=?
1311	app	10.0.1.13:51022	appdb	Sleep	611		NULL

Meaning: Lots of Sleep with high Time is connection hoarding. The Sending data state often means scanning and returning rows (or waiting on disk).

Decision: Reduce app pool sizes, add connection lifetime limits, enable slow query log, and fix missing indexes.

Task 12 (MariaDB): Check InnoDB buffer pool sizing and pressure

cr0x@server:~$ sudo mariadb -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests';"
+-------------------------+------------+
| Variable_name           | Value      |
+-------------------------+------------+
| innodb_buffer_pool_size | 4294967296 |
+-------------------------+------------+
+--------------------------+---------+
| Variable_name            | Value   |
+--------------------------+---------+
| Innodb_buffer_pool_reads | 8200332 |
+--------------------------+---------+
+----------------------------------+------------+
| Variable_name                    | Value      |
+----------------------------------+------------+
| Innodb_buffer_pool_read_requests | 932003312  |
+----------------------------------+------------+

Meaning: The ratio of reads from disk vs read requests hints at cache misses. On an 8GB VPS, 4GB buffer pool may be reasonable, but not if per-thread buffers are huge.

Decision: Keep buffer pool substantial (often 50–70% of RAM if DB-only), but audit per-connection buffers and connection counts first.

Task 13 (MariaDB): Catch temporary table spills (silent disk hammer)

cr0x@server:~$ sudo mariadb -e "SHOW GLOBAL STATUS LIKE 'Created_tmp%tables';"
+-------------------------+----------+
| Variable_name           | Value    |
+-------------------------+----------+
| Created_tmp_disk_tables | 2200033  |
| Created_tmp_tables      | 3100044  |
+-------------------------+----------+

Meaning: A high proportion of disk temp tables indicates queries that sort/group on large datasets without proper indexes, or too-small tmp table limits.

Decision: Fix queries/indexes first. Only then consider adjusting tmp_table_size/max_heap_table_size, because raising them increases per-session memory risk.

Task 14: Confirm network socket backlog and SYN flood lookalikes

cr0x@server:~$ ss -s
Total: 1180
TCP:   942 (estab 410, closed 410, orphaned 0, timewait 110)

Transport Total     IP        IPv6
RAW       0         0         0
UDP       8         6         2
TCP       532       410       122
INET      540       416       124
FRAG      0         0         0

Meaning: If established connections spike and timewait grows fast, your app might be reconnecting aggressively or a load balancer is misbehaving.

Decision: Fix client behavior (pool, keepalive, timeouts). Databases hate connection churn almost as much as humans hate surprise on-call.

Task 15: Confirm CPU steal time (VPS reality check)

cr0x@server:~$ mpstat 1 3
Linux 6.8.0 (server)  12/31/2025  _x86_64_  (2 CPU)

12:00:01 AM  all   %usr  %nice  %sys %iowait  %irq  %soft  %steal  %idle
12:00:02 AM  all  22.00   0.00  10.00   18.00  0.00   1.00    6.00  43.00
12:00:03 AM  all  20.00   0.00   9.00   19.00  0.00   1.00    7.00  44.00

Meaning: %steal indicates the hypervisor is taking CPU away. If steal rises during peaks, your database “slowness” is literally someone else’s workload.

Decision: Consider a better VPS class or dedicated CPU. Tuning won’t fix stolen cycles.

Safe scaling patterns for more clients

Pattern 1: Connection pooling is not optional on 8GB

If you run PostgreSQL and you expect many concurrent clients, put a pooler in front. Period.
Even if you “only” have 100 app instances. Especially then.

For PostgreSQL, external pooling (PgBouncer) is the standard move because it reduces backend processes while keeping client concurrency.
For MariaDB, many stacks use application-level pooling; the MariaDB thread pool can help, but it’s not a free lunch.

The safe mental model: clients can be large, server concurrency must be small and controlled.
You want a queue you control, not a resource stampede the kernel controls.

Pattern 2: Set hard caps and enforce them

“Unlimited” is how you get a cascading failure. Cap:

Database connections (max_connections / max_connections).
Per-user or per-app connections (roles, users, firewall rules, pooler settings).
Statement time (PostgreSQL statement_timeout, MariaDB max_execution_time for SELECT, plus application timeouts).
Idle-in-transaction time (PostgreSQL idle_in_transaction_session_timeout is a lifesaver).

Pattern 3: Protect the cache by keeping memory multipliers small

The most common 8GB tuning mistake is spending all RAM on “database memory” while forgetting:
per-connection buffers multiply, and OS cache is not “wasted memory.”

PostgreSQL: if you set work_mem high and allow many active sessions, you can allocate far more memory than you own.
MariaDB: if you inflate sort/join buffers and allow many threads, you can do the same thing, only with less obvious accounting.

Pattern 4: Make writes boring

On small VPS disks, writes cause most of the pain:
WAL/redo fsyncs, checkpoints, background flush, temp tables, vacuum, index maintenance.
Your goal is steady write behavior, not “fast until it stalls.”

Keep transactions short.
Batch background jobs off-peak or throttle them.
Don’t create three new indexes at noon on a busy table.
Use sane durability settings; don’t disable fsync unless you genuinely want to practice data loss.

Pattern 5: Scale reads and writes differently

On a single 8GB VPS, vertical scaling is limited. So you scale shape:

Reads: caching, read replicas, query optimization, indexing, denormalized read models where necessary.
Writes: reduce write amplification, partition large churny tables, archive old data, avoid “update every row” operations.

PostgreSQL and MariaDB both can replicate. The operational tradeoffs differ, but the principle holds: don’t expect one box to do infinite work.

Pattern 6: Observe the database from the outside

Internal stats are great. External symptoms are better for triage:

p95/p99 query latency
disk await and utilization
swap activity
connection rate (new connections/sec)
checkpoint/flush behavior

If you can’t see it, you’ll “solve” the wrong problem. That’s how outages get repeated with better dashboards.

Joke #2: Turning off durability to “improve performance” is like removing your brakes to win a race. You’ll definitely arrive faster, briefly.

Three mini-stories from corporate life

Mini-story 1: The incident caused by a wrong assumption (PostgreSQL connections)

A mid-sized SaaS team moved from a managed database to a self-hosted PostgreSQL instance on an 8GB VPS to cut costs.
The migration went smoothly. The first week looked quiet. Everyone congratulated themselves and went back to shipping features.

Then they onboarded a large customer. Traffic doubled. Latency crept up, but only during business hours. The team assumed it was “just heavier queries”
and started chasing indexes. It helped a little. Then the real incident hit: the app started throwing intermittent 500s, and the DB CPU graph looked oddly calm.

The wrong assumption was subtle: they thought “PostgreSQL can handle lots of connections like our old system did.”
Their app tier used a default connection pool of 50 per instance. With a handful of instances, it was already high. With auto-scaling, it exploded.
The box ran hundreds of backend processes, ate memory, forced the OS to reclaim cache, and turned every cache miss into an I/O wait party.

The fix wasn’t a clever index. It was boring: reduce app pool sizes, deploy PgBouncer, cap server-side max connections, and set timeouts so idle sessions didn’t linger.
They also adjusted retry behavior to avoid a feedback loop.

After that, query tuning actually worked—because the system was no longer thrashing itself to death.
The lesson stuck: on small machines, connection management is performance management.

Mini-story 2: The optimization that backfired (MariaDB memory knobs)

Another company ran MariaDB for an internal analytics dashboard on an 8GB VM. It was mostly read-heavy, with periodic batch loads.
Someone read an old tuning guide and decided that “bigger buffers are always better.” They increased per-thread sort and join buffers and enlarged temporary table limits.

On a calm system, it looked great. Queries that used to spill to disk became faster. The dashboard felt snappier.
The change got promoted to production without load testing because the system “wasn’t mission-critical.” That phrase ages poorly.

When the next batch load ran, it kicked off several concurrent reports plus ETL jobs. Each job opened multiple connections.
Each connection now had permission to allocate large buffers. Memory usage spiked fast, the kernel started swapping, and the box entered the classic state:
CPU mostly idle, load average high, requests timing out, and everyone blaming “the database being slow.”

Rolling back the memory changes stabilized the system. The long-term fix was more disciplined:
put a cap on concurrent heavy jobs, fix the worst queries and indexes, and keep per-connection memory small so concurrency didn’t multiply into chaos.

The takeaway: memory knobs can be performance knobs or self-destruct buttons. On 8GB, they’re usually both.

Mini-story 3: The boring but correct practice that saved the day (disk headroom and WAL/binlog hygiene)

A team ran a mixed workload—transactions plus background jobs—on a single VPS while they waited for budget approval.
Not ideal, but reality has opinions. They did one thing consistently: routine capacity checks and conservative disk headroom.

They tracked disk usage, WAL/binlog growth patterns, and ensured log rotation was sane.
They also monitored replication lag (when enabled) and kept a strict alert on filesystem usage thresholds.
None of this was glamorous. No one got promoted for it.

One day, a deployment introduced a bug that generated far more write activity than normal. WAL/binlogs started growing.
On many teams, this is where you learn your disk is full when the database stops accepting writes.
They didn’t. An alert fired early, before the danger zone.

The on-call throttled the offending job, paused the rollout, and trimmed the blast radius while the DB stayed healthy.
Postmortem was short. The fix was straightforward. The customer impact was minimal.

Boring practice doesn’t prevent every bug. It prevents bugs from becoming incidents.

Common mistakes: symptom → root cause → fix

1) “DB is slow” but CPU is low and load average is high

Symptom: p95 latency spikes; CPU idle; load high; users complain.

Root cause: I/O wait (disk saturation) or swapping.

Fix: Check iostat/vmstat. Reduce concurrency, stop swap, move to better disk tier, and cut write amplification (indexes, vacuum/purge tuning).

2) Random connection errors during traffic spikes

Symptom: “too many connections”, “could not connect”, intermittent failures.

Root cause: Unbounded app connection pools + lack of pooling + low FD limits.

Fix: Enforce pool sizes, use PgBouncer for PostgreSQL, set realistic max_connections, raise ulimit -n, and add backpressure.

3) PostgreSQL gets slower over days/weeks, then “magically” improves after maintenance

Symptom: Gradual slowdown; disk grows; more reads; autovacuum logs appear.

Root cause: Table/index bloat from MVCC and insufficient vacuuming, often made worse by long transactions.

Fix: Find and eliminate long-running/idle-in-transaction sessions. Tune autovacuum per table. Schedule vacuum/analyze where needed.

4) MariaDB is fine until one report runs, then everything stalls

Symptom: A single dashboard/report causes global slowdown; temp tables spike.

Root cause: Bad query plan + missing index + temp tables spilling to disk + concurrency.

Fix: Use slow query log, EXPLAIN, add indexes, cap concurrent report jobs, and avoid inflating per-thread buffers as a band-aid.

5) Replication lag spikes during peak hours

Symptom: Read replica behind; failover risk increases.

Root cause: Write bursts, slow disk fsync, large transactions, or vacuum/checkpoint pressure.

Fix: Reduce write bursts, break up large transactions, tune checkpoint/flush behavior carefully, and ensure replication settings match disk capability.

6) Latency spikes every few minutes like a heartbeat

Symptom: Regular stalls, periodic slowdowns, then recovery.

Root cause: Checkpoints/flush storms (PostgreSQL checkpoints; InnoDB flushing) or scheduled jobs colliding.

Fix: Smooth write workload; tune checkpoint interval and completion target (PostgreSQL); validate InnoDB flushing settings; schedule batch jobs away from peak.

7) “We increased caches and it got worse”

Symptom: Higher memory settings lead to more latency, not less.

Root cause: Starving OS cache, increasing swap risk, or increasing per-session allocation headroom.

Fix: Back off. Keep headroom. Make memory settings proportional to concurrency and workload, not to hope.

Checklists / step-by-step plan

Step-by-step: scale clients safely on an 8GB VPS (either database)

Separate concerns: if possible, keep DB on its own VPS. If you can’t, budget memory and CPU for the DB explicitly.
Measure the baseline: disk await, swap activity, connection counts, p95 latency. Write them down. If you don’t, you’ll argue later.
Set hard connection limits: cap server connections and app pool sizes. Avoid “autoscaling equals infinite connections.”
Add pooling: PgBouncer for PostgreSQL; for MariaDB use robust app pooling and consider thread pool where appropriate.
Set timeouts: kill idle-in-transaction sessions; set statement timeouts for runaway queries; use app timeouts that don’t cause retry storms.
Protect RAM: keep per-session memory small; reserve OS cache; avoid swap. Add headroom like you mean it.
Turn on slow query visibility: slow query log (MariaDB) or pg_stat_statements (PostgreSQL) and log slow queries.
Fix the top offenders first: add missing indexes, remove N+1 queries, eliminate large unbounded scans.
Stabilize writes: avoid huge transactions; batch background jobs; tune checkpoint/flush behaviors carefully and validate with metrics.
Plan growth: if client count keeps rising, plan read replicas, partitioning, or moving to a bigger box before you’re forced to during an incident.

PostgreSQL-specific checklist (8GB VPS)

Use a pooler if client concurrency is more than a few dozen.
Keep shared_buffers moderate (often around 1–2GB on 8GB DB-only); leave room for OS cache.
Set work_mem conservatively and remember it can be used multiple times per query node.
Watch autovacuum: dead tuples, long transactions, vacuum freeze risk.
Check wait events before changing settings. Locks and I/O waits need different fixes.

MariaDB-specific checklist (8GB VPS)

Size innodb_buffer_pool_size realistically (often 4–6GB on DB-only 8GB, depending on connection count and per-thread buffers).
Keep per-thread buffers sane; avoid “just increase sort buffer” without concurrency math.
Enable and review slow query log regularly; fix the worst queries, not the average ones.
Validate temp table behavior and reduce disk temp tables by indexing and query rewrites.
Consider thread pool when you have many connections but limited CPU.

Capacity planning rule-of-thumb (practical, not perfect)

On 8GB, aim for a small, stable number of active executing queries (tens, not hundreds).
Let clients queue in a pooler or app queue rather than letting the kernel and database fight for memory and I/O.

FAQ

1) Which is “better” on an 8GB VPS: MariaDB or PostgreSQL?

If you need strict correctness semantics, strong observability of waits/locks, and predictable query planning tools, PostgreSQL is usually the better ops experience.
If you’re deep in the MySQL ecosystem and your workload is straightforward OLTP with disciplined pooling, MariaDB can run very well.
The bigger predictor is whether you control connection counts and write behavior.

2) How many connections should I allow?

Fewer than you think. On an 8GB VPS, direct database connections in the hundreds are a common path to swap and I/O collapse.
Prefer a small server-side limit (often tens to low hundreds depending on workload) and queue at the pooler/app layer.
Measure memory and latency under peak, then adjust.

3) Do I really need PgBouncer for PostgreSQL?

If you have many app instances, short-lived connections, or spiky traffic: yes.
PostgreSQL’s connection model makes pooling the clean way to scale client count without scaling backend processes.
If you have a small number of long-lived connections and no spikes, you can skip it—until you can’t.

4) Can MariaDB handle more connections than PostgreSQL without pooling?

It can handle more threads than PostgreSQL can handle processes, but “handle” isn’t the same as “stay fast.”
Memory per connection and per-query buffers still multiply, and contention still rises.
Pooling and sane app behavior matter just as much.

5) Should I increase `shared_buffers` (PostgreSQL) to use most of RAM?

No. PostgreSQL benefits heavily from OS cache. On 8GB, a moderate shared_buffers often performs better than an aggressive one
because it leaves room for filesystem cache and avoids memory pressure. Validate with cache hit rates and disk await.

6) Should I set `innodb_buffer_pool_size` to 75–80% of RAM (MariaDB)?

Only if the DB is the main service and your connection counts and per-thread buffers are controlled.
If you have high concurrency or big per-session buffers, 80% can push you into swap during peaks.
Start conservative, measure, then increase.

7) Why does performance get worse when I increase per-query memory (`work_mem`, sort/join buffers)?

Because concurrency turns “per query” into “per everything.”
Higher per-query memory can help single queries, but under load it increases total memory footprint, squeezes cache, and triggers swapping.
On small RAM, you want predictable behavior more than peak single-query speed.

8) What’s the first sign I’m I/O bound?

Rising latency with low CPU and high iowait is the classic. iostat shows high %util and elevated await.
PostgreSQL wait events often show I/O-related waits; MariaDB may show “Sending data” while actually waiting on reads.

9) Is it safe to disable fsync or relax durability to survive peak load?

It’s a business decision dressed up as a tuning knob. If you disable fsync, you are accepting data loss on crash.
If you need throughput, fix concurrency, queries, and storage first. Only relax durability if you can explain the blast radius to stakeholders without blinking.

10) When do I stop tuning and upgrade the VPS?

When disk is saturated at steady state, when you can’t avoid swap during peak despite disciplined limits, or when CPU steal time dominates.
Tuning can’t create IOPS or reclaim stolen CPU. Upgrading is sometimes the most reliable optimization.

Next steps you can do this week

Pick a connection strategy: pooler for PostgreSQL, strict app pooling for MariaDB. Write down target max server concurrency.
Run the fast diagnosis checks during peak: vmstat, iostat, connection counts, lock waits.
Enable slow query visibility and fix the top 3 offenders. Not 30. Three.
Set timeouts that prevent runaway sessions and retry storms.
Create a capacity budget for RAM: shared cache + per-connection overhead + OS cache headroom. Then enforce it with limits.
Plan the exit: if growth continues, decide early whether you’ll add a replica, move to a bigger box, or split workloads.

On an 8GB VPS, you don’t win by being clever. You win by being disciplined: fewer connections, steadier writes, and a refusal to let defaults drive architecture.