MySQL vs MariaDB max_connections: stop OOM crashes on small servers

Was this helpful?

It always happens at the worst possible time: a tiny VM, a sudden traffic spike, a few slow queries, and then the database disappears like it had somewhere better to be. You log in and see it: the Linux OOM killer took out mysqld. Again.

When someone says “Just raise max_connections so users stop getting errors,” they are usually asking you to trade visible failure (too many connections) for invisible failure (memory blow-up), and invisible failure wins by crashing the process. This piece is about making that trade explicit, sizing it correctly, and not getting surprised.

What max_connections really means (and why it’s a memory promise)

max_connections is not a performance knob. It is a risk knob.

At face value, it’s the maximum number of simultaneous client connections the server will allow. On a calm day, you might think it’s just about concurrency: more connections means more users. On a production day, it’s about how many memory-hungry threads your database is willing to host at once before it runs out of RAM and gets killed.

Each connection can carry per-session and per-thread allocations. Some are tiny. Some are surprisingly not. Some get allocated only when needed (sorts, joins, temp tables). The problem is that “when needed” is exactly when the system is already stressed: slow queries pile up, connections accumulate, and those expensive allocations happen right as you least want them.

So the correct mental model is:

  • Global memory (shared by everyone): InnoDB buffer pool, log buffers, caches, internal structures.
  • Per-connection baseline: thread stack, network buffers, session structures.
  • Per-query spikes: sort buffers, join buffers, temp table memory, read buffers.

If you set max_connections too high on a small server, you’ve effectively promised the kernel: “I might spawn up to N threads, and each might do memory-expensive work at the same time.” The kernel believes you. The OOM killer has a different opinion.

One short joke, because we need it: Raising max_connections to fix ‘too many connections’ is like buying a bigger trash can to fix a kitchen fire. It addresses the visible symptom and ignores the heat.

Interesting facts and context (because this debate has history)

  1. MySQL’s thread-per-connection model is old and durable. It scales surprisingly well with good workload patterns, but it makes “too many idle connections” a memory and scheduler problem.
  2. MariaDB forked from MySQL in 2009 after the Sun/Oracle acquisition era; compatibility stayed high, but defaults and features diverged over time.
  3. MySQL 5.6 popularized the “big InnoDB buffer pool” era; before that, many deployments leaned too hard on OS cache and MyISAM. That changed memory planning culture.
  4. InnoDB became the default engine for MySQL 5.5. That single decision increased the importance of global memory sizing and reduced tolerance for “random tuning.”
  5. Performance Schema (MySQL) and similar instrumentation changed how memory and waits could be observed—if you turn it on and actually use it.
  6. Linux OOM killer behavior is not “random,” but it is non-intuitive: it chooses victims based on badness scoring, memory usage, and constraints. Databases are large, so they’re attractive targets.
  7. “Swap makes databases slow” became lore; but in reality, “no swap on a tiny VM” often turns a recoverable spike into a process kill.
  8. connection pooling became mainstream in the 2010s for web stacks because thread-per-connection databases don’t love thousands of mostly-idle TCP sessions.

MySQL vs MariaDB: the differences that matter for OOM risk

If you’re looking for a simple answer like “MySQL uses less memory” or “MariaDB is safer,” you won’t get one. Both can be stable or catastrophic. The difference is where the foot-guns are, and which defaults you inherited.

Defaults and packaging: your real database is “MySQL + distro choices”

On small servers, the biggest risk factor is not the fork; it’s the packaged configuration and what your team copied from a blog in 2016.

Common patterns:

  • MariaDB packages often ship with conservative defaults for small instances, but your cloud image might override them with “helpful” presets.
  • MySQL on managed platforms (or vendor builds) sometimes enables instrumentation or extra caches that increase baseline memory.
  • Both can be configured into the ground by setting per-thread buffers too large and then raising max_connections.

Thread handling and connection limits

Both typically use a thread per client connection. Thread scheduling overhead and memory overhead scale with connection count.

Where people get burned:

  • High max_connections with long wait_timeout creates a large population of idle sessions. Idle isn’t free.
  • High connection churn (connect/disconnect per request) causes CPU waste and sometimes memory fragmentation patterns that look like “leaks.”
  • Thread cache can help, but it can also keep a large number of thread resources around if you let it.

Memory observability differences

MySQL’s Performance Schema has gotten better at breaking down memory usage in modern versions, but it’s not automatically “cheap.” MariaDB has its own instrumentation and status variables; the practical difference is: which tool your team actually uses.

For OOM prevention, the winner is the system that:

  • lets you estimate per-connection and per-query spikes,
  • shows you current connection state distribution (Sleep vs active),
  • helps you find slow queries causing pile-ups,
  • makes it easy to enforce sane limits.

The one real “fork” difference that matters: you can’t assume knobs map 1:1

Most configuration keys are similar, but not identical across versions and forks. You can’t safely apply a tuning guide written for “MySQL” without checking which server you’re running. Some variables exist in both but have different defaults or behavior. Memory is unforgiving; a small mismatch is enough.

Fast diagnosis playbook

This is the order I use when a small server is flapping and people are yelling.

1) Confirm it was OOM and identify the kill pattern

  • Check kernel logs for OOM killer entries naming mysqld.
  • Decide whether you’re dealing with a single spike event (bad query) or chronic overcommit (too many connections + oversized buffers).

2) Measure current connection state and pile-up cause

  • How many connections are active vs sleeping?
  • Are threads stuck in “Sending data,” “Copying to tmp table,” “Sorting result,” or waiting for locks?
  • Do you have a thundering herd from a pool misconfiguration?

3) Compute the “worst credible memory” quickly

  • Global: InnoDB buffer pool + log buffers + other global caches.
  • Per-connection baseline: thread stack + net buffers + session.
  • Spikes: sort/join buffers and temp tables (memory) for active queries.

4) Apply a stabilizing cap

  • Lower max_connections to a value you can afford today.
  • Lower per-thread buffers to sane defaults if someone inflated them.
  • Shorten timeouts so idle sessions don’t camp forever.
  • Implement or fix connection pooling at the app layer.

5) Fix the root cause: slow queries and lock contention

  • Find the worst offenders and make them stop holding resources.
  • Add indexes, rewrite queries, reduce result sizes, or batch work.

If you do steps 1–4, you stop the bleeding. Step 5 keeps it from coming back.

The memory model: global buffers, per-thread buffers, and the trap door

Databases OOM on small servers because people reason about memory like it’s a single knob. It’s not. It’s a pile of knobs, and some multiply by connection count.

Global memory: the stuff that’s always there

On InnoDB-heavy systems (which is most of them), the big one is:

  • innodb_buffer_pool_size: typically the largest consumer. Great for performance. Fatal when oversized.

Other global or mostly-global consumers include:

  • innodb_log_buffer_size and other InnoDB internal allocations,
  • various caches (table cache, etc.),
  • instrumentation overhead if enabled at scale.

Global memory is at least predictable. You set it, it stays mostly stable.

Per-connection baseline: the silent multiplier

Every connection has overhead. The exact footprint depends on version, build, and workload, but the multiplier behavior doesn’t change: more sessions means more memory and more scheduling overhead.

Key contributors:

  • thread_stack: allocated per thread.
  • Network buffers and per-session structures.
  • Thread cache effects: it can keep thread resources alive after disconnect.

Per-query spikes: where OOM happens during “slow query storms”

Spiky allocations are the true villain. The classic ones:

  • sort_buffer_size: used for sorts that can’t be satisfied by indexes.
  • join_buffer_size: used for joins when indexes aren’t used effectively.
  • read_buffer_size and read_rnd_buffer_size for certain access patterns.
  • Temporary tables: can use memory up to a limit, then spill to disk.

Here’s the trap door: those buffers are often allocated per thread per operation. If you have 200 active connections doing sorts concurrently, even modest buffer sizes can become catastrophic.

Why “I had free memory yesterday” doesn’t mean you’re safe

OOM events are usually correlated with concurrency spikes and tail latency. A slow query increases its own runtime, which increases the number of concurrent queries, which increases memory use, which slows things further. You get a feedback loop that ends when either the load drops… or the kernel drops your database.

Second short joke, because we’re allowed exactly two: The OOM killer is your most decisive SRE colleague: it never bikesheds, it just deletes the largest process and goes to lunch.

One operational quote

Paraphrased idea from John Allspaw: “In complex systems, incidents come from normal work interacting, not a single broken part.”

This matters because OOM on small servers is rarely “a leak.” It’s normal buffers + normal connections + one slow thing, all at once.

Right-sizing max_connections on small servers

Here’s the opinionated guidance: on small servers, you should prefer fewer database connections, kept warm, and multiplexed by a pool, rather than “let everyone connect directly.”

Step 1: pick a memory budget for MySQL/MariaDB

On a small VM, you don’t own the whole box. The OS needs page cache, the service manager needs headroom, and if you have co-located apps, they need room too.

A practical approach:

  • Decide how much RAM you can give the database process under pressure. Not “available,” but “safe.”
  • Reserve headroom for spikes, filesystem cache, and the unexpected.

If you’re running with no swap, be extra conservative. The difference between “slow” and “dead” is sometimes 512MB.

Step 2: compute your global baseline

Start with innodb_buffer_pool_size. Add other global allocations. Don’t overthink the smaller ones; just leave margin.

Step 3: estimate per-connection cost (baseline + credible spike)

On small servers, you don’t need the perfect number. You need a conservative upper bound.

Break it into two categories:

  • Baseline per connection: thread stack + session + net buffers. Assume a few MB per connection unless you have evidence otherwise.
  • Active-query overhead: for the fraction of connections that can be concurrently running memory-heavy operations, assume sort/join buffers could be used.

That last part is key: not all sessions are active. If you have 300 connections but only 20 are truly active at once, you size for 300 baseline + 20 spikes. If you have 300 active because the app is doing synchronous work and everything is slow, you’re already in trouble and max_connections should not be 300.

Step 4: set max_connections to the value you can actually afford

On a small server, the safest default is often in the range of 50–200, depending on workload and buffers. If you’re tempted to set it to 1000 “just in case,” you’re treating the database like a message queue. It is not a message queue.

Also decide whether you need a reserved admin slot:

  • max_connections caps the total.
  • super_read_only and admin access patterns vary, but the general practice is to keep room for emergency access and automation.

Step 5: fix the connection pattern at the application

If you have lots of short-lived connections, you will get hammered by handshake/auth overhead and thread churn. Use a pool. Keep it modest. Set a hard cap per app instance. Then scale horizontally if needed.

Bonus: connection pooling tends to make tail latency better because you remove “connection storms” that happen during partial outages.

Practical tasks: 14 commands that tell you what to do next

These are real tasks you can run on a Linux server. Each has: a command, typical output, what it means, and the decision you make.

Task 1: confirm OOM killer hit mysqld

cr0x@server:~$ sudo journalctl -k -b | egrep -i 'oom-killer|out of memory|killed process|mysqld' | tail -n 20
Dec 29 09:12:01 server kernel: Out of memory: Killed process 1423 (mysqld) total-vm:3187420kB, anon-rss:1456820kB, file-rss:0kB, shmem-rss:0kB, UID:110 pgtables:4120kB oom_score_adj:0
Dec 29 09:12:01 server kernel: oom_reaper: reaped process 1423 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

What it means: This is a kernel-level kill, not a database crash. Your DB asked for memory; the kernel said no.

Decision: Treat as memory capacity/limits issue first, not “MySQL bug.” You can’t tune a dead process.

Task 2: check system memory and swap posture

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       3.4Gi       120Mi        52Mi       310Mi       140Mi
Swap:          0B          0B          0B

What it means: You are running with essentially no slack and no swap. Any spike is a kill.

Decision: Add swap (even small) or lower DB memory footprint immediately. Prefer both.

Task 3: see the mysqld process RSS and threads (when it’s up)

cr0x@server:~$ ps -o pid,cmd,rss,vsz,nlwp --sort=-rss -C mysqld
  PID CMD                          RSS    VSZ NLWP
 2189 /usr/sbin/mysqld          1684200 3278800  198

What it means: RSS is real memory in use. NLWP gives you a clue about thread/connection scale.

Decision: If NLWP balloons with connection count, you need pooling/timeouts and lower max_connections.

Task 4: check current max_connections and connection usage

cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'max_connections'; SHOW GLOBAL STATUS LIKE 'Max_used_connections'; SHOW GLOBAL STATUS LIKE 'Threads_connected';"
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 600   |
+-----------------+-------+
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Max_used_connections | 412   |
+----------------------+-------+
+-------------------+-------+
| Variable_name      | Value |
+-------------------+-------+
| Threads_connected  | 389   |
+-------------------+-------+

What it means: You’ve allowed 600 and you’re actively sitting near 400. On a small VM, that’s usually a bad sign.

Decision: Cap max_connections closer to what you can afford (and reduce actual concurrency via pooling).

Task 5: break down “Sleep” vs active connections

cr0x@server:~$ mysql -N -e "SELECT COMMAND, COUNT(*) FROM information_schema.PROCESSLIST GROUP BY COMMAND ORDER BY COUNT(*) DESC;"
Sleep	340
Query	28
Connect	3
Binlog Dump	1

What it means: Most connections are idle. This is often an app pool sizing issue or overly long timeouts.

Decision: Lower wait_timeout and fix pool max size per service instance. Don’t “solve” this by raising max_connections.

Task 6: identify what active queries are doing (state distribution)

cr0x@server:~$ mysql -N -e "SELECT STATE, COUNT(*) FROM information_schema.PROCESSLIST WHERE COMMAND='Query' GROUP BY STATE ORDER BY COUNT(*) DESC;"
Sending data	11
Copying to tmp table	6
Sorting result	4
Waiting for table metadata lock	3
executing	2

What it means: Temp tables and sorts are happening now. That’s where memory spikes tend to occur.

Decision: Look for the offending queries and reduce the need for sorts/temp tables (indexes, query rewrites), and tighten per-thread buffers.

Task 7: find the top longest-running queries currently executing

cr0x@server:~$ mysql -e "SELECT ID,USER,HOST,DB,TIME,STATE,LEFT(INFO,120) AS INFO FROM information_schema.PROCESSLIST WHERE COMMAND='Query' ORDER BY TIME DESC LIMIT 10;"
+------+-------+-----------------+------+------|-------------------------------+----------------------------------------------------------+
| ID   | USER  | HOST            | DB   | TIME | STATE                         | INFO                                                     |
+------+-------+-----------------+------+------|-------------------------------+----------------------------------------------------------+
| 7712 | app   | 10.0.2.15:53342  | prod |  188 | Copying to tmp table          | SELECT ... ORDER BY created_at DESC LIMIT 1000           |
| 7761 | app   | 10.0.2.15:53388  | prod |  163 | Sorting result                | SELECT ... JOIN ... WHERE ...                            |
| 7799 | app   | 10.0.2.18:40210  | prod |  142 | Waiting for table metadata lock | ALTER TABLE ...                                         |
+------+-------+-----------------+------+------|-------------------------------+----------------------------------------------------------+

What it means: The “storm” is visible. You have long runners and even DDL waiting on locks.

Decision: Consider killing the worst offenders (carefully), and separate online schema changes from peak hours. Fix the query patterns after you stabilize.

Task 8: check memory-sensitive per-thread variables

cr0x@server:~$ mysql -e "SHOW VARIABLES WHERE Variable_name IN ('sort_buffer_size','join_buffer_size','read_buffer_size','read_rnd_buffer_size','thread_stack','tmp_table_size','max_heap_table_size');"
+--------------------+----------+
| Variable_name      | Value    |
+--------------------+----------+
| join_buffer_size   | 4194304  |
| max_heap_table_size| 67108864 |
| read_buffer_size   | 2097152  |
| read_rnd_buffer_size| 4194304 |
| sort_buffer_size   | 8388608  |
| thread_stack       | 262144   |
| tmp_table_size     | 67108864 |
+--------------------+----------+

What it means: Someone inflated buffers. With high concurrency, this is an OOM recipe.

Decision: Reduce these to conservative values unless you have proof they help. Big buffers help single queries; they hurt overall stability.

Task 9: check InnoDB buffer pool size vs RAM

cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size';"
+-------------------------+------------+
| Variable_name           | Value      |
+-------------------------+------------+
| innodb_buffer_pool_size | 3221225472 |
+-------------------------+------------+

What it means: That’s 3GiB. On a ~4GiB host, it leaves almost nothing for connections, temp tables, and the OS.

Decision: Lower buffer pool on small boxes (often 50–70% RAM depending on co-tenancy and swap). Then re-evaluate cache hit rates and performance.

Task 10: check whether tmp tables are spilling to disk (performance + memory clues)

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Created_tmp%tables';"
+-------------------------+---------+
| Variable_name           | Value   |
+-------------------------+---------+
| Created_tmp_disk_tables | 183492  |
| Created_tmp_files       | 2281    |
| Created_tmp_tables      | 512044  |
+-------------------------+---------+

What it means: A significant number of temp tables spill to disk. That can slow queries, causing connection pile-ups; memory may still spike before the spill.

Decision: Fix queries/indexes first. Don’t blindly increase tmp_table_size/max_heap_table_size on a small server; that increases peak memory risk.

Task 11: check thread cache and thread creation churn

cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'thread_cache_size'; SHOW GLOBAL STATUS LIKE 'Threads_created'; SHOW GLOBAL STATUS LIKE 'Connections';"
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| thread_cache_size | 50    |
+-------------------+-------+
+-----------------+--------+
| Variable_name   | Value  |
+-----------------+--------+
| Threads_created | 924812 |
+-----------------+--------+
+--------------+---------+
| Variable_name| Value   |
+--------------+---------+
| Connections  | 985004  |
+--------------+---------+

What it means: If Threads_created rises rapidly relative to Connections, you have connection churn and insufficient caching.

Decision: Prefer fixing the app to reuse connections. Adjust thread_cache_size modestly, but don’t let it mask a bad connection pattern.

Task 12: check timeouts that control idle-connection hoarding

cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'wait_timeout'; SHOW VARIABLES LIKE 'interactive_timeout';"
+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| wait_timeout  | 28800  |
+---------------+--------+
+-------------------+--------+
| Variable_name     | Value  |
+-------------------+--------+
| interactive_timeout| 28800 |
+-------------------+--------+

What it means: Eight hours. On small servers with busy apps, this is a common reason Sleep connections accumulate.

Decision: Lower timeouts (for non-interactive app users) and enforce pool sizing. If you truly need long-lived connections, you need a bigger box or fewer of them.

Task 13: check open files and table cache pressure (secondary symptom)

cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'table_open_cache'; SHOW GLOBAL STATUS LIKE 'Open_tables'; SHOW GLOBAL STATUS LIKE 'Opened_tables';"
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| table_open_cache | 4000  |
+------------------+-------+
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Open_tables   | 3980  |
+---------------+-------+
+----------------+---------+
| Variable_name  | Value   |
+----------------+---------+
| Opened_tables  | 784920  |
+----------------+---------+

What it means: If Open_tables is pinned near table_open_cache and Opened_tables climbs fast, you’re churning table opens. This can add overhead and latency, indirectly increasing connection concurrency.

Decision: Tune table cache and OS file limits, but don’t confuse this with the primary OOM driver. Fix memory multipliers first.

Task 14: verify systemd limits and cgroup memory constraints

cr0x@server:~$ systemctl show mysql -p MemoryMax -p TasksMax -p LimitNOFILE
MemoryMax=infinity
TasksMax=4915
LimitNOFILE=1048576

What it means: You’re not capped by systemd memory limits here. If MemoryMax were set, the DB might be getting killed by cgroup constraints rather than system OOM.

Decision: If you run in containers or strict cgroups, size for that limit, not host RAM, and consider setting oom_score_adj and memory reservations carefully.

Three corporate mini-stories from the OOM trenches

Mini-story #1: the incident caused by a wrong assumption

They had a small production environment for a “non-critical” internal tool. The tool became critical the moment payroll reporting depended on it. The database lived on a 4GB VM, and it had been “fine for months.” That phrase is always the setup.

A developer noticed occasional Too many connections errors during a busy window and made a change: max_connections went from a few hundred to a thousand. The assumption was simple: more connections means fewer errors. Nobody checked memory math. Nobody asked what the sleeping connections were doing.

The next Monday, an innocuous report query got slower due to a missing index. Requests stacked up, the app opened more sessions, and the DB complied because it was now allowed to. Memory climbed steadily. Then the kernel killed mysqld. The tool went down, payroll reporting failed, and a lot of people learned what “blast radius” means.

The postmortem wasn’t dramatic. It was worse: it was boring. “We assumed connection limit errors were the cause, not the symptom.” They fixed the query, implemented a pool with a hard cap, lowered max_connections, and reduced timeouts. The same VM survived future spikes with the exact same RAM. The difference was that the database was no longer being volunteered as a stress ball for the app.

Mini-story #2: the optimization that backfired

A team was chasing latency on a customer-facing API. They had read that increasing sort_buffer_size and join_buffer_size could speed up complex queries. They bumped both substantially, tested with a single-user benchmark, and it looked great. The change sailed through.

Under real production concurrency, it was a different story. Multiple app workers ran similar reporting queries at the same time, each allocating larger buffers. Memory consumption didn’t just rise—it rose like a staircase when concurrency increased. They didn’t hit a performance wall; they hit the OOM killer.

The nasty part was that the crash was intermittent. Sometimes the load pattern aligned and the buffers were allocated concurrently; sometimes not. That made it look like a “random crash” or a “leak.” They spent days staring at graphs and blaming the storage layer.

The fix was humbling: revert buffer sizes closer to defaults, add an index that removed the sort, and limit concurrency at the app (pool) so that heavy queries could not stampede. Performance improved overall because the database stopped thrashing and the app stopped retrying. The moral wasn’t “never tune buffers.” It was “never tune per-thread buffers using a single-thread benchmark and call it done.”

Mini-story #3: the boring but correct practice that saved the day

A different org ran many small MySQL/MariaDB instances, each attached to a service. Nothing fancy. The trick was process: every instance had a lightweight “capacity sheet” checked into the repo alongside config. It listed RAM, buffer pool size, per-thread buffer assumptions, and the resulting safe connection cap. It wasn’t perfect. It was consistent.

One night, a traffic spike hit a low-priority service that shared a node with other things. The service’s pool misbehaved and tried to open far more connections than normal. The database refused once it hit max_connections, and the app started failing requests. Annoying, visible, and recoverable.

Because the database didn’t OOM, other services on the host stayed alive. The on-call had time to respond like a human: reduce pool size, restart a few workers, and roll out a config fix. No filesystem recovery. No crash loops. No corrupted state.

That was the point: a clean refusal at the DB boundary is often better than letting the OS shoot the database. The boring practice—documented limits and conservative caps—turned a potential multi-service incident into a single-service hiccup.

Common mistakes: symptom → root cause → fix

1) Symptom: “Too many connections” errors, followed by OOM after you raise max_connections

Root cause: Connection pile-up from slow queries or pool misbehavior; raising max_connections increases memory exposure.

Fix: Keep max_connections within memory budget. Implement pooling with a strict cap. Find and fix the slow query or lock contention that caused pile-up.

2) Symptom: Hundreds of sleeping connections; memory still high

Root cause: Long wait_timeout plus oversized pool(s), plus thread/session overhead.

Fix: Lower wait_timeout (for application users), set pool max size per instance, and reduce idle connections. If you need many long-lived sessions, resize the server.

3) Symptom: OOM happens only during reporting jobs or end-of-month batch

Root cause: Concurrency of memory-heavy queries (sorts, joins, temp tables) triggers per-thread spikes.

Fix: Tune queries and indexes to reduce sorts/temp tables, cap concurrency for those jobs, and keep per-thread buffers conservative. Consider running batch on a separate replica.

4) Symptom: OOM happens after “performance tuning” with big buffers

Root cause: Per-thread buffer sizes multiplied by active sessions exceed RAM.

Fix: Revert to sane buffer sizes; measure with production-like concurrency. Prefer query/index fixes over buffer inflation.

5) Symptom: DB restarts in a loop; logs show nothing obvious

Root cause: Killed by OOM before it can flush useful logs, or killed by cgroup/container memory limits.

Fix: Check kernel logs and cgroup limits. Reduce memory footprint and/or raise container memory limit. Add swap where appropriate.

6) Symptom: Memory usage grows slowly over days; occasional crashes

Root cause: Often not a leak, but workload shifts: more concurrent sessions, bigger result sets, increased temp table usage, or fragmentation pressure.

Fix: Track connection count distribution, slow query logs, and temp table metrics. Apply hard concurrency caps and fix the workload drivers.

Checklists / step-by-step plan

Stabilize now (same day)

  1. Confirm OOM killer via kernel logs. If it’s OOM, stop debating.
  2. Lower max_connections to a safe number based on current RAM and buffer sizes. Yes, users will see errors. Better errors than data recovery.
  3. Reduce per-thread buffers if they were inflated (sort_buffer_size, join_buffer_size, read buffers).
  4. Lower wait_timeout for app users to prevent connection hoarding.
  5. Enable or verify connection pooling and set hard per-instance caps.
  6. Kill or reschedule the worst queries only if you understand the blast radius (reporting queries are usually safe to stop; migrations/DDL are trickier).

Make it robust (this week)

  1. Capture baseline metrics: max used connections, active vs sleep counts, temp tables, slow query counts.
  2. Turn on slow query logging with a reasonable threshold for your workload and review the top offenders.
  3. Add missing indexes for the queries that cause sorts and temp tables.
  4. Set a “capacity contract” in config: documented buffer pool size, per-thread buffers, and safe connection cap.
  5. Decide on swap: small swap on small servers often increases survivability. Monitor swap-in; don’t let it become permanent.

Prevent recurrence (this quarter)

  1. Load test with production-like concurrency, not a single connection benchmark.
  2. Separate OLTP and reporting: replicas, dedicated instances, or at least concurrency limits on reporting.
  3. Automate guardrails: alerts on Threads_connected, Max_used_connections approaching max, and memory pressure.
  4. Review pool settings with application teams like you review firewall rules: explicitly and regularly.

FAQ

1) Should I set max_connections to the number of app threads?

No. You should set app pool max sizes so that the sum of all pools stays below a safe DB cap, with headroom for admin and batch tasks.

2) Is MariaDB safer than MySQL for memory?

Not inherently. Both can OOM the same way: too many threads and too much per-thread memory. Safety comes from sizing and limits, not the logo.

3) Why does the DB OOM when most connections are sleeping?

Sleeping connections still consume per-session memory and threads. Also, your “sleeping” population can hide a smaller number of active connections doing memory-heavy work that triggers the spike.

4) If I lower max_connections, won’t I just get more errors?

Yes, but they are controlled errors. A capped DB fails fast and predictably. An OOM-killed DB fails everything, including recovery tasks, and it may trigger cascading retries.

5) Is adding swap a good idea for databases?

On small servers, often yes—if done intentionally. A small swap can absorb short spikes and prevent OOM kills. If you see sustained swap-in under normal load, you’re underprovisioned or misconfigured.

6) Which variables are the biggest “per-connection memory multipliers” to watch?

Common culprits are sort_buffer_size, join_buffer_size, and temp table memory limits (tmp_table_size/max_heap_table_size). These don’t always allocate fully, but they define worst-case behavior under concurrency.

7) What’s the best fix for too many connections: pooling or raising max_connections?

Pooling, almost always. Raising max_connections is sometimes appropriate after you’ve done memory math and proven you have headroom, but it’s not the first move on small hardware.

8) How do I know if slow queries are causing the connection pile-up?

Look at processlist TIME/STATE, slow query logs, and whether active queries cluster around temp tables, sorts, or lock waits. Connection count rising with latency is the giveaway.

9) Should I shrink innodb_buffer_pool_size to stop OOM?

If the buffer pool is crowding out everything else on a small server, yes. Performance might drop, but a stable database is faster than a dead one. Then fix queries and workload to regain performance.

10) Can I “just put the database in a container” to solve this?

Containers don’t solve memory math. They enforce it. If you set a low memory limit, you’ll get faster kills unless you also size buffers and connection caps for that limit.

Conclusion: next steps you can do this week

If your small server is crashing, stop treating max_connections as a customer happiness dial. It’s a memory commitment. Make it a conservative one.

Do these next steps in order:

  1. Prove it’s OOM with kernel logs and measure RSS/threads when the DB is alive.
  2. Right-size global memory (especially innodb_buffer_pool_size) so the OS and connections have breathing room.
  3. Cap max_connections to what the server can actually support, not what your app can spam.
  4. Keep per-thread buffers conservative unless you have strong evidence and production-like concurrency tests.
  5. Fix the connection pattern with pooling and sane timeouts. Most “DB capacity” problems are really “app behavior” problems wearing a database costume.
  6. Eliminate the slow queries that cause pile-ups and temp-table storms. Fewer concurrent queries is the cheapest memory upgrade.

When the next spike hits, you want the database to say “no” early and clearly, not to get shot by the kernel. That’s not pessimism. That’s operations.

← Previous
Drivers as Weapons: How Software Can “Change” Your GPU
Next →
ZFS Scrub: How Often to Run It and What It Proves

Leave a comment