It’s 02:13. Your API is “up” but users are watching spinners like it’s a meditation app. Then the logs land the punchline: ERROR 1040 (08004): Too many connections. MySQL didn’t crash. It just stopped letting anyone in.
The tempting fix is to crank max_connections to a heroic number and call it done. That’s how you turn a connection problem into a memory problem, and then into a paging problem, and then into a career-reflection problem. Let’s fix it properly on Ubuntu 24.04: find the real bottleneck, stop the bleeding, and increase concurrency only where it’s safe—without making the database slower.
What “too many connections” actually means (and what it doesn’t)
MySQL throws “too many connections” when it can’t accept a new client connection because it hit an internal ceiling. Most commonly, that ceiling is max_connections. Sometimes it’s a different wall: OS file descriptor limits, systemd service limits, or resource exhaustion that makes MySQL effectively unable to create more sessions.
Here’s the part people miss: “too many connections” is rarely caused by “too much traffic” in the simplistic sense. It’s usually caused by connections not going away fast enough. That can be slow queries, locked transactions, overloaded CPU, saturated I/O, a mis-sized buffer pool causing constant disk reads, or application code that leaks connections. The database is telling you it’s drowning in concurrency. Your job is to figure out whether the water is coming from too many faucets or a clogged drain.
One opinion that will save you time: raising max_connections is valid only after you have evidence that the server has headroom (memory, CPU, I/O) and that the application uses connections responsibly (pooling, timeouts, sane transaction scope). If you raise it blindly, you often convert a sharp error into slow misery.
Joke #1 (short, relevant): A MySQL connection leak is like a kitchen faucet drip—nobody cares until the water bill starts paging you.
Fast diagnosis playbook: first/second/third checks
This is the “I’m on call, my coffee is cold, give me signal” sequence. Do it in order. Each step narrows the bottleneck class quickly.
First: is it a connection leak or a throughput bottleneck?
- Check current threads and states. If most are
Sleep, you likely have app-side pooling/leak issues or overly long idle timeouts. If most areQueryorLocked, it’s throughput/locking. - Check if connections are climbing. A steady climb suggests leak. Sudden spike suggests traffic burst or retry storm.
Second: what’s blocking progress—CPU, I/O, locks, or memory?
- CPU pegged + many running queries: too much parallelism, bad query plans, or missing indexes.
- Disk I/O saturated + high read latency: buffer pool too small, random I/O from poor indexes, or slow storage.
- Locks/waits dominate: long transactions, hot rows, or schema design causing contention.
- Memory pressure/swap: MySQL allowed too many per-connection buffers, too many connections, or OS is reclaiming aggressively.
Third: are you hitting a limit that is not max_connections?
- File descriptors too low: can’t open more sockets or tables.
- systemd limits: service
LimitNOFILEoverrides what you think you set. - Network backlog issues: SYN queue overflow under bursts (less common, but shows up as connect timeouts rather than 1040).
By the end of this playbook, you should be able to say one sentence that starts with: “Connections are piling up because…” If you can’t, you’re guessing. And guessing is how “quick fixes” become architecture.
Interesting facts and context (why this keeps happening)
- Fact 1: MySQL historically used a “one thread per connection” model by default, which made
max_connectionsa direct proxy for thread count and memory pressure. - Fact 2: Many per-connection buffers (
sort_buffer_size,join_buffer_size,read_buffer_size) are allocated per session and can explode memory when concurrency spikes. - Fact 3: The default
wait_timeoutis traditionally long enough that idle sessions can hang around for hours in poorly pooled apps. - Fact 4: The “thundering herd” effect—many clients retrying at once—turns a small blip into a connection storm. This shows up a lot with aggressive HTTP retry middleware.
- Fact 5: InnoDB’s buffer pool was introduced to reduce disk I/O by caching pages; an undersized buffer pool often looks like “too many connections” because queries stall and sessions pile up.
- Fact 6: MySQL’s
thread_cache_sizematters more than people think: creating/destroying threads under spiky workloads can add latency and CPU overhead. - Fact 7: Linux limits (ulimits, file descriptors) are as old as multi-user Unix; they still bite modern databases because the defaults are conservative.
- Fact 8: On systemd-based distros (including Ubuntu 24.04), service-level limits can silently override shell
ulimitsettings, confusing even experienced operators. - Fact 9: “Increasing max connections” is a famous anti-pattern because it can increase lock contention and reduce cache hit rates, making each query slower.
One paraphrased idea worth keeping on your wall, attributed to John Ousterhout: paraphrased idea: Complexity is the tax you pay later; the best systems keep the fast path simple.
The concurrency model: connections, threads, and why “more” can be slower
Think of a MySQL server like a restaurant that seats customers (connections) and assigns a waiter (thread). If you add more tables without adding kitchen capacity, you don’t get more food—just more people waiting, more noise, and worse service.
When you raise max_connections, you aren’t just letting more clients in. You’re allowing more concurrent work to be attempted. That changes:
- Memory footprint: each session consumes base memory, plus per-connection buffers, plus temporary tables, plus metadata locks structures.
- CPU scheduling: more threads means more context switching. On a saturated CPU, that can reduce useful work.
- Lock contention: more concurrent transactions increase time spent waiting for locks, especially on hot tables/rows.
- I/O queue depth: more outstanding reads/writes can exceed what your storage can handle, increasing latency for everyone.
The goal is not “accept infinite connections.” The goal is “keep response times stable under peak load.” That often means controlling concurrency: pooling connections in the app, limiting max in the pool, using short transactions, and tuning MySQL so it can sustain the real workload.
Joke #2 (short, relevant): The easiest way to handle 10,000 MySQL connections is to not have 10,000 MySQL connections.
Hands-on tasks (commands, expected output, and decisions)
These are production-grade checks. Run them during an incident or in a calm maintenance window. Each task includes: command, what typical output means, and the decision you make.
Task 1: Confirm the exact error rate and where it’s coming from
cr0x@server:~$ sudo journalctl -u mysql --since "30 min ago" | tail -n 30
Dec 29 02:08:11 db1 mysqld[1327]: [Warning] Aborted connection 18933 to db: 'app' user: 'appuser' host: '10.0.2.41' (Got an error reading communication packets)
Dec 29 02:08:13 db1 mysqld[1327]: [Note] Too many connections
Dec 29 02:08:14 db1 mysqld[1327]: [Note] Too many connections
What it means: MySQL is logging connection pressure, plus aborted connections that can indicate network issues, client timeouts, or server overload.
Decision: If you see many “Too many connections” lines, proceed to measure current sessions and their states. If you mostly see “Aborted connection,” check network, client timeouts, and server CPU/I/O saturation.
Task 2: Check configured and effective connection limits
cr0x@server:~$ sudo mysql -e "SHOW VARIABLES LIKE 'max_connections'; SHOW STATUS LIKE 'Max_used_connections';"
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 300 |
+-----------------+-------+
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| Max_used_connections | 298 |
+----------------------+-------+
What it means: You’re riding the ceiling. This is not theoretical; you’re hitting it.
Decision: Don’t raise the ceiling yet. First find why sessions aren’t completing. If Max_used_connections is far below max_connections, your error may be a different limit (FDs/systemd) or a proxy layer.
Task 3: See what connections are doing right now
cr0x@server:~$ sudo mysql -e "SHOW FULL PROCESSLIST;" | head -n 25
Id User Host db Command Time State Info
19401 appuser 10.0.2.41:53312 app Sleep 412 NULL
19408 appuser 10.0.2.45:58921 app Query 18 Sending data SELECT ...
19412 appuser 10.0.2.44:51220 app Query 18 Waiting for table metadata lock ALTER TABLE ...
19420 appuser 10.0.2.47:60011 app Sleep 399 NULL
What it means: You have a mix: many long-sleeping sessions (app pooling or leaks) and some active sessions stuck on metadata locks.
Decision: If most are Sleep with long times, reduce idle timeouts and fix pooling. If many are waiting for locks, you have a schema change or long transaction blocking others—address the blocker.
Task 4: Count sessions by state (fast signal)
cr0x@server:~$ sudo mysql -NBe "SELECT COMMAND, STATE, COUNT(*) c FROM information_schema.PROCESSLIST GROUP BY COMMAND, STATE ORDER BY c DESC LIMIT 15;"
Sleep 221
Query Sending data 36
Query Waiting for table metadata lock 18
Query Sorting result 6
What it means: Your “connections problem” is mostly idle sessions plus a real lock contention situation.
Decision: Attack idle session hoarding (timeouts + pool sizing) and unblock the metadata lock (finish/kill DDL, avoid DDL in peak).
Task 5: Identify the metadata lock blocker
cr0x@server:~$ sudo mysql -e "SELECT OBJECT_SCHEMA, OBJECT_NAME, LOCK_TYPE, LOCK_STATUS, THREAD_ID FROM performance_schema.metadata_locks WHERE LOCK_STATUS='PENDING' LIMIT 10;"
+---------------+-------------+-----------+-------------+----------+
| OBJECT_SCHEMA | OBJECT_NAME | LOCK_TYPE | LOCK_STATUS | THREAD_ID|
+---------------+-------------+-----------+-------------+----------+
| app | orders | EXCLUSIVE | PENDING | 8421 |
+---------------+-------------+-----------+-------------+----------+
cr0x@server:~$ sudo mysql -e "SELECT * FROM performance_schema.threads WHERE THREAD_ID=8421\G" | head -n 20
*************************** 1. row ***************************
THREAD_ID: 8421
NAME: thread/sql/one_connection
PROCESSLIST_ID: 19412
PROCESSLIST_USER: appuser
PROCESSLIST_HOST: 10.0.2.44:51220
PROCESSLIST_DB: app
PROCESSLIST_COMMAND: Query
PROCESSLIST_TIME: 18
What it means: You’ve got DDL needing an exclusive metadata lock, and other sessions may be holding conflicting locks.
Decision: Find who holds the lock (granted locks on same object), and decide whether to wait, kill the blocker, or postpone DDL. In production, “postpone DDL” is usually the grown-up move.
Task 6: Check for long transactions that keep locks alive
cr0x@server:~$ sudo mysql -e "SELECT trx_id, trx_started, trx_mysql_thread_id, trx_query FROM information_schema.innodb_trx ORDER BY trx_started LIMIT 5\G" | sed -n '1,40p'
*************************** 1. row ***************************
trx_id: 42119291
trx_started: 2025-12-29 01:58:03
trx_mysql_thread_id: 19321
trx_query: UPDATE orders SET status='paid' WHERE ...
What it means: A transaction has been open for ~10 minutes. That’s an eternity in OLTP land. It can hold row locks and prevent purge, causing cascading stalls.
Decision: If it’s accidental (stuck worker, bad code path), kill it and fix the application transaction scope. If it’s intentional (batch job), schedule it off-peak or refactor to chunked commits.
Task 7: Check whether you’re CPU-bound
cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.8.0-xx-generic (db1) 12/29/2025 _x86_64_ (16 CPU)
02:12:10 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:12:11 PM all 78.2 0.0 9.1 1.3 0.0 0.8 0.0 0.0 0.0 10.6
What it means: CPU is heavily utilized. If this stays pegged, adding connections will usually increase latency, not throughput.
Decision: If CPU-bound, prioritize query optimization and concurrency limiting (pool caps) over raising max_connections.
Task 8: Check for disk pressure and latency
cr0x@server:~$ iostat -xz 1 3
avg-cpu: %user %nice %system %iowait %steal %idle
62.11 0.00 7.40 7.55 0.00 22.94
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await aqu-sz %util
nvme0n1 920.0 48200.0 12.0 1.3 9.80 52.4 240.0 19320.0 14.20 6.12 96.3
What it means: Storage is near saturation (%util ~96%), with noticeable await times. Queries will stall, and connections will pile up.
Decision: If I/O-bound, raise buffer pool (if memory allows), reduce random reads via indexes, and avoid increasing concurrency until latency improves.
Task 9: Check for swapping (the silent performance killer)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 64Gi 58Gi 1.2Gi 612Mi 4.8Gi 2.1Gi
Swap: 8.0Gi 2.9Gi 5.1Gi
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa
12 3 3024896 128992 98112 3621120 64 120 980 410 9200 18000 72 8 10 10
What it means: You’re swapping in/out (si/so), which will spike query latency and cause connection backlogs.
Decision: Reduce memory usage (often by lowering per-connection buffers and/or max_connections), and size buffer pool intelligently. Swapping plus “raise max_connections” is a self-own.
Task 10: Verify OS file descriptor limits for MySQL (systemd matters)
cr0x@server:~$ sudo systemctl show mysql -p LimitNOFILE -p LimitNPROC
LimitNOFILE=1048576
LimitNPROC=15238
cr0x@server:~$ sudo cat /proc/$(pidof mysqld)/limits | egrep "Max open files|Max processes"
Max open files 1048576 1048576 files
Max processes 15238 15238 processes
What it means: Your service has high FD limits. If this were low (like 1024/4096), it could masquerade as a connection issue or cause odd failures under load.
Decision: If low, set a systemd override for mysql to raise it. If already high, don’t blame ulimit; move on.
Task 11: Check table cache pressure (can drive latency and stalls)
cr0x@server:~$ sudo mysql -e "SHOW VARIABLES LIKE 'table_open_cache'; SHOW STATUS LIKE 'Opened_tables'; SHOW STATUS LIKE 'Open_tables';"
+------------------+--------+
| Variable_name | Value |
+------------------+--------+
| table_open_cache | 4000 |
+------------------+--------+
+---------------+--------+
| Variable_name | Value |
+---------------+--------+
| Opened_tables | 812349 |
+---------------+--------+
+-------------+------+
| Variable_name | Value |
+-------------+------+
| Open_tables | 3998 |
+-------------+------+
What it means: Opened_tables is huge and still climbing, while Open_tables is pegged near the cache size. MySQL is constantly opening tables—extra overhead under load.
Decision: Increase table_open_cache if memory allows, and verify you have adequate file descriptors. This is a “small cuts become hemorrhage” issue.
Task 12: Find the worst query patterns that hold connections open
cr0x@server:~$ sudo mysql -e "SELECT DIGEST_TEXT, COUNT_STAR, SUM_TIMER_WAIT/1000000000000 AS total_s, AVG_TIMER_WAIT/1000000000000 AS avg_s FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 5\G" | sed -n '1,80p'
*************************** 1. row ***************************
DIGEST_TEXT: SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC LIMIT ?
COUNT_STAR: 1203912
total_s: 84231.1134
avg_s: 0.0699
What it means: You have high total time tied up in a few statement patterns. Even if average latency is “fine,” the total time indicates heavy load and connection occupancy.
Decision: Optimize the hottest statement digests (indexes, covering indexes, reducing selected columns, better pagination patterns). This reduces time-per-connection and fixes the symptom at the source.
Task 13: Check thread cache effectiveness (cheap win for spiky workloads)
cr0x@server:~$ sudo mysql -e "SHOW VARIABLES LIKE 'thread_cache_size'; SHOW STATUS LIKE 'Threads_created'; SHOW STATUS LIKE 'Connections';"
+-------------------+-------+
| Variable_name | Value |
+-------------------+-------+
| thread_cache_size | 8 |
+-------------------+-------+
+-----------------+--------+
| Variable_name | Value |
+-----------------+--------+
| Threads_created | 238912 |
+-----------------+--------+
+---------------+---------+
| Variable_name | Value |
+---------------+---------+
| Connections | 9823812 |
+---------------+---------+
What it means: Low cache size and many threads created suggests overhead from thread churn.
Decision: Raise thread_cache_size to reduce CPU overhead under connection churn. This doesn’t make queries slower; it reduces overhead around them.
Task 14: Confirm buffer pool sizing and hit rate signals
cr0x@server:~$ sudo mysql -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; SHOW STATUS LIKE 'Innodb_buffer_pool_reads'; SHOW STATUS LIKE 'Innodb_buffer_pool_read_requests';"
+-------------------------+------------+
| Variable_name | Value |
+-------------------------+------------+
| innodb_buffer_pool_size | 8589934592 |
+-------------------------+------------+
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| Innodb_buffer_pool_reads| 832912311 |
+-------------------------+-----------+
+----------------------------------+------------+
| Variable_name | Value |
+----------------------------------+------------+
| Innodb_buffer_pool_read_requests | 3412981123 |
+----------------------------------+------------+
What it means: A lot of physical reads relative to requests suggests a cache that’s not keeping up (rough signal; don’t worship one ratio).
Decision: If you have memory headroom and are I/O-bound, increase buffer pool. If you’re swapping, do the opposite: reduce other memory consumers first.
Fixes that don’t slow the DB
Here’s the practical hierarchy: fix the application’s connection behavior, fix query time, fix lock behavior, then size MySQL limits to match reality. If you do it in reverse, you’re masking the problem until it returns louder.
1) Fix connection hoarding: pooling, caps, and timeouts
Do: Use a connection pool per app instance, set a hard cap, and set sane timeouts.
- Pool size: start smaller than you think. If you have 50 app instances and each pool is 50, that’s 2500 potential connections before the database even gets a vote.
- Connection acquisition timeout: force fast failure in the app rather than infinite waiting. Infinite waiting just holds more sockets open.
- Idle timeout: close idle connections so you don’t spend your connection budget on sleeping sessions.
Avoid: “One connection per request” without pooling. It’s a thread churn machine and turns spikes into storms.
MySQL-side, two knobs help reduce sleeping zombies:
wait_timeout(non-interactive sessions)interactive_timeout(interactive sessions)
Opinionated guidance: In typical web app environments, a wait_timeout of 60–300 seconds is reasonable if the app pool is healthy. If your app can’t tolerate that, the app is the bug, not the database.
2) Unstick lock pileups: keep transactions short and schedule DDL
Connection errors often follow lock jams. Sessions wait; new sessions arrive; eventually you hit the ceiling. Fixing lock jams is one of the few ways to resolve “too many connections” without adding capacity.
- Keep transactions short: no user think-time inside a transaction, no long loops, no “read 10k rows then decide.”
- DDL discipline: don’t run schema changes during peak, especially not ones that can take metadata locks for long periods.
- Chunk batch jobs: commit frequently, use bounded batches, and back off when the system is hot.
3) Reduce query time: because time is what holds connections open
Connection count is basically “arrival rate × time in system.” You can’t always control arrival rate. You can reduce time in system.
- Index the actual predicates and orderings used by the hot digests.
- Prefer covering indexes for common list endpoints to avoid extra page reads.
- Stop selecting columns you don’t use. Yes, even if “it’s just JSON.” It’s memory, CPU, and network.
- Kill “OFFSET pagination” at scale. It gets slower as the offset grows. Use keyset pagination when possible.
4) Make max_connections a calculated decision, not a wish
If you’ve verified you’re not locked, not swapping, and not I/O-saturated, raising max_connections can be appropriate. But size it with math and guardrails.
Rule of thumb (use with judgment):
- Baseline memory: InnoDB buffer pool + MySQL global buffers + OS overhead.
- Per-connection memory: can vary wildly depending on your settings and workload. The dangerous part is that many per-thread buffers are allocated on demand; worst-case is ugly.
Practical approach:
- Measure memory use at current peak connections (RSS of mysqld).
- Estimate additional memory per extra 50–100 connections from observed deltas, not guesses.
- Increase in small steps (e.g., +20%); observe swap, latency, and lock waits.
5) Tune per-connection buffers conservatively
This is how you avoid “we raised max connections and the box started swapping.” Many MySQL defaults are not insane, but people copy-paste “performance” configs that set big buffers, forgetting they multiply by active sessions.
Settings to treat with skepticism in high-concurrency systems:
sort_buffer_sizejoin_buffer_sizeread_buffer_sizeread_rnd_buffer_sizetmp_table_size/max_heap_table_size(affects memory temp tables)
Opinionated guidance: Keep these relatively small until you have evidence a particular workload benefits. Most OLTP workloads gain more from indexes and buffer pool than from giant per-thread buffers.
6) Thread cache: reduce overhead, especially during churn
Spiky workloads create connection churn. Churn creates thread creation/destruction overhead. A decent thread_cache_size helps absorb spikes without wasting CPU.
What to do: Increase thread_cache_size until Threads_created stops climbing rapidly relative to Connections. Don’t set it to 10,000 “just in case.” Cache enough to cover spikes.
7) Use pooling at the right layer (app first, proxy second)
App-side pooling is usually best because it preserves request context and backpressure. But there are cases where you need a dedicated pooling proxy:
- Many short-lived clients (serverless, cron bursts, CI jobs).
- Legacy apps that open too many connections and can’t be fixed quickly.
- Multi-tenant environments where you need hard governance.
Beware: transaction-level pooling can break session assumptions (temporary tables, session variables). If your app uses those, you need session pooling or code changes.
8) Ubuntu 24.04 specifics: systemd overrides and making them stick
On Ubuntu 24.04, the most common “we set ulimit but it didn’t work” story is systemd. The service unit has its own limits.
To set a persistent FD limit for MySQL:
cr0x@server:~$ sudo systemctl edit mysql
# (opens an editor)
cr0x@server:~$ sudo cat /etc/systemd/system/mysql.service.d/override.conf
[Service]
LimitNOFILE=1048576
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart mysql
What it means: MySQL will now inherit the FD limit at service start.
Decision: Do this only if you’ve confirmed FD limits are actually part of the issue. Raising limits doesn’t fix slow queries; it just lets you fail later.
9) Add backpressure: fail fast upstream instead of melting down downstream
If the DB is the shared critical resource, your app should protect it. Backpressure strategies that keep systems responsive:
- Cap pool size per app instance.
- Short acquisition timeout (e.g., 100–500ms depending on SLO).
- Queue requests in the app with bounded queues; reject beyond that.
- Disable aggressive retries on DB connection failures; use exponential backoff with jitter.
10) Scale the right thing: read replicas, sharding, or just a bigger box
If you’ve done the hygiene and you’re still hitting connection ceilings because the workload is real, scale deliberately:
- Read replicas for read-heavy workloads (move reporting/list endpoints, not critical writes).
- Bigger host if you’re memory/I/O constrained and can scale vertically quickly.
- Sharding if write scaling is the problem and the data model supports it—this is not a “Tuesday afternoon fix.”
Common mistakes: symptom → root cause → fix
1) Symptom: lots of “Sleep” sessions, maxed connections, but CPU is not high
Root cause: application pools too large, leaked connections, or long wait_timeout keeping idle sessions alive.
Fix: cap pool sizes, ensure connections are returned, reduce wait_timeout to a few minutes, and set client-side idle timeouts.
2) Symptom: many sessions “Waiting for table metadata lock”
Root cause: online schema change/DDL or long transaction blocking metadata locks.
Fix: schedule DDL off-peak, use online schema change tools appropriately, and kill/avoid long transactions that hold locks. Identify blocker via performance_schema and act.
3) Symptom: “too many connections” appears together with swap activity
Root cause: memory overcommit, often due to per-connection buffers and excessive concurrency.
Fix: reduce per-connection buffers, reduce connection caps, right-size buffer pool, and stop swapping before raising max_connections.
4) Symptom: connections spike during an outage, then never recover cleanly
Root cause: retry storms, clients reconnecting aggressively, or load balancer health checks opening sessions.
Fix: exponential backoff with jitter, circuit breakers, rate-limit reconnects, and ensure health checks don’t authenticate to MySQL repeatedly.
5) Symptom: “Aborted connection… error reading communication packets” grows during peak
Root cause: clients timing out due to server slowness, packet loss, or overloaded network stack.
Fix: fix the underlying latency (CPU/I/O/locks), validate network stability, and align client/server timeouts so clients don’t churn unnecessarily.
6) Symptom: raising max_connections “fixes” the error but latency doubles
Root cause: you increased contention and memory pressure; you didn’t increase throughput.
Fix: revert or reduce, implement pooling caps, optimize top queries, and scale capacity (CPU/I/O) if needed.
7) Symptom: errors persist even though max_connections looks high
Root cause: you’re hitting OS/service limits (FDs), or a proxy has its own connection cap.
Fix: check systemd LimitNOFILE, verify mysqld limits via /proc, and audit intermediate layers for caps.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They had a tidy mental model: “We run 20 app pods, so 20 connections.” Neat. Comforting. Wrong.
The real setup included a background worker deployment, a metrics collector doing periodic checks, an admin UI, and a migration job that “just runs sometimes.” Each component had its own connection pool defaults. The web pods alone were fine. The workers, however, scaled up during backlog events—exactly when the database was already busy.
The first “too many connections” incident happened during a marketing campaign. They increased max_connections, restarted MySQL, and watched the error disappear. For about an hour. Then the DB slowed, timeouts increased, and the app retried harder. Connections climbed again, now with more lock waits and more swapping.
The post-incident lesson wasn’t “set max_connections higher.” It was “count connections from the entire fleet, including the stuff nobody thinks about.” They fixed it by capping every pool, adding an acquisition timeout, and teaching the worker system to back off when DB latency rises.
Mini-story 2: The optimization that backfired
A team wanted to reduce query latency. Someone increased join_buffer_size and sort_buffer_size aggressively because a blog said it helps. It did help—on a staging box running one query at a time.
In production, concurrency was the whole game. At peak, hundreds of sessions were active. Those larger buffers weren’t always allocated, but when certain endpoints hit complex queries, memory usage climbed fast. The OS started swapping under burst traffic. Once swapping started, queries slowed. As queries slowed, connections stayed open longer. As connections stayed open longer, the server ran out of connections. The original symptom? “Too many connections.”
It looked like a connection limit problem, but it was a per-connection memory amplification problem. The fix was boring: revert the oversized per-thread buffers, add the right indexes, and increase buffer pool within safe memory. Their latency got better, not worse, and the connection limit stopped being a daily drama.
Mini-story 3: The boring but correct practice that saved the day
Another org had a habit that wasn’t glamorous: every service had a documented “DB budget.” A maximum pool size per instance. A maximum number of instances allowed before a scaling review. A standard acquisition timeout and retry policy.
During a regional failover test, traffic shifted abruptly. Load doubled. The database got hot but didn’t fall over. Instead of spawning infinite connections, the apps queued a bit, rejected excess requests quickly, and recovered when the initial spike passed.
They still saw elevated DB CPU. They still had to tune a couple of indexes. But they didn’t get the catastrophic “too many connections” outage that tends to trigger chaos changes in the middle of the night.
The best part: the incident report was short. Boring systems are reliable systems, and reliable systems are the ones you can sleep near.
Checklists / step-by-step plan
Step-by-step: stop the bleeding during an incident (15–30 minutes)
- Verify what’s saturated: run Task 7 (CPU), Task 8 (I/O), Task 9 (swap).
- Identify connection states: Task 4 to see Sleep vs Query vs Locked.
- If lock waits dominate: Task 5 and Task 6 to find blockers. Decide: wait, kill, or postpone DDL/batch.
- If Sleep dominates: reduce app pool size (fastest lever), then consider lowering
wait_timeoutafter verifying impact. - If I/O bound: stop increasing concurrency. Reduce load (rate limit, disable expensive endpoints), then tune buffer pool/indexes.
- If swapping: reduce memory usage immediately (lower concurrency/pools, revert oversized per-thread buffers if applicable). Swapping is a “drop everything” signal.
- Only then, if you have headroom and need it, increase
max_connectionsmodestly as a temporary mitigation.
Step-by-step: permanent fix (one sprint, not one panic)
- Inventory all clients: web, workers, cron, admin tools, metrics, ETL. Document expected max connections per component.
- Implement connection pooling correctly: cap pools, set acquisition timeouts, and ensure connections are returned.
- Fix hot queries: use performance_schema digest summaries (Task 12), then add/adjust indexes and reduce data fetched.
- Fix lock behavior: shorten transactions, chunk batch jobs, and schedule DDL.
- Calibrate MySQL limits: set
max_connectionsbased on measured headroom; tunethread_cache_size,table_open_cache. - Validate OS/service limits: ensure systemd limits match your design (Task 10).
- Load test with concurrency: not just QPS. Verify latency and error rates under spike patterns.
- Set alerts that prevent surprise: alert on rising Threads_connected, lock waits, swap, and rising aborted connections.
What to avoid (because you’ll regret it)
- Setting
max_connectionsto a huge number “so it never happens again.” It will happen again, just slower and more expensively. - Copy-pasting “high performance” my.cnf templates with massive per-thread buffers.
- Running schema changes during peak and then acting surprised when metadata locks exist.
- Allowing clients to retry instantly and infinitely. That’s not resilience; it’s a denial-of-service feature.
FAQ
1) Should I just increase max_connections?
Only if you’ve confirmed you have memory headroom (no swapping), and the server isn’t already CPU/I/O saturated. Otherwise you’ll trade hard errors for slow collapse.
2) Why does the database get slower when I allow more connections?
More sessions mean more contention, more context switching, more buffer churn, and potentially more disk I/O. Throughput doesn’t scale linearly with connections.
3) What’s a good value for wait_timeout?
For typical web apps with pooling, 60–300 seconds is a sane starting range. If you need hours, your app is using MySQL as a session store, and that’s a different conversation.
4) How do I know if it’s an application connection leak?
If Threads_connected climbs steadily and most sessions are Sleep with large times, that’s classic leak/hoarding. Confirm by correlating with app instance counts and pool configs.
5) What if I see many “Waiting for table metadata lock” sessions?
Stop running DDL during peak, find the blocker (often a long transaction), and decide whether to kill it. Then put DDL behind a change window and a process.
6) Do read replicas fix “too many connections”?
They can, if the pressure comes from read traffic and you actually route reads there. They won’t fix write contention, long transactions, or bad pooling.
7) Can OS file descriptor limits cause connection errors?
Yes. If mysqld can’t open more sockets/files, you’ll see weird failures under load. On Ubuntu 24.04, check systemd LimitNOFILE and mysqld’s effective limits.
8) Is lowering per-connection buffers always safe?
It’s usually safer than raising them blindly. Lower buffers can increase temporary disk usage or slow certain queries, so validate against your real workload. But oversized buffers under high concurrency are a common outage trigger.
9) How do I choose between fixing queries and adding hardware?
If you’re CPU/I/O saturated with known bad queries, fix queries first. If you’re well-optimized and still saturated, add hardware or scale out. Measure before spending.
10) What metrics should I alert on to catch this early?
Threads_connected, Max_used_connections, lock wait indicators, aborted connections rate, swap usage, disk await/utilization, and p95/p99 query latency (not just average).
Next steps that won’t hurt later
If you’re seeing “too many connections” on Ubuntu 24.04, treat it like a symptom, not a setting. The fastest stable fix is almost always to reduce connection occupancy: shorten queries, shorten transactions, and stop the app from hoarding sessions.
Do these next, in order:
- Run the fast diagnosis playbook and capture evidence (process states, CPU/I/O/swap).
- Fix the biggest driver: sleeping hoards or lock jams. Don’t debate; measure.
- Cap and tune application pools. Add acquisition timeouts and sane retry backoff.
- Optimize the top statement digests. This reduces time-in-system and makes everything else easier.
- Then—and only then—adjust
max_connectionsand MySQL caches based on observed headroom.
When you fix this properly, the database gets faster under load, not slower. That’s the whole point. The error goes away because the system is healthier, not because you taught it to tolerate more pain.