Fix WordPress MySQL “Server Has Gone Away” and “Too Many Connections”

Was this helpful?

If your WordPress site randomly throws “MySQL server has gone away” or “Too many connections”, you’re not having a philosophy problem. You’re having a resource accounting problem. Something is killing connections, exhausting them, or holding them too long—usually under a traffic spike, a slow query, or a bad plugin doing the database equivalent of leaving the faucet on.

This guide is written for people who run production systems and need a calm, repeatable way to get from “it’s down” to “it’s stable” without guessing. You’ll get fast triage, hard commands, what the output means, and the decisions you should make next.

What these errors actually mean (and what they don’t)

“MySQL server has gone away”

This message is the application telling you: “I tried to use a MySQL connection, but the server side isn’t there anymore.” That can happen because:

  • The server closed the connection due to idle timeout (wait_timeout).
  • The server crashed or restarted (OOM killer, manual restart, upgrade, panic).
  • A network device dropped the connection (NAT, firewall, load balancer idle timeout).
  • The client sent a packet bigger than allowed (max_allowed_packet).
  • The server hit resource limits and started refusing or aborting work (disk full, table corruption, runaway memory, thread exhaustion).

It is not “WordPress is broken.” WordPress is just the messenger. Don’t shoot it; interrogate it.

“Too many connections”

This is MySQL being blunt: the number of concurrent client connections exceeded max_connections. In WordPress stacks, this usually happens because PHP is opening connections faster than MySQL can finish queries.

Two subtle points matter:

  • Raising max_connections is not a fix by itself. It can buy time, or it can turn “refused connections” into “swap storm” and take the whole host down.
  • Connection count is a symptom. The root is almost always latency: slow queries, locked tables, saturated I/O, CPU pegged, or PHP-FPM spawning too many workers.

One paraphrased idea often attributed to engineering leadership: “Reliability is a feature you build, not a wish you make.” (paraphrased idea, John Allspaw)

Short joke #1: MySQL doesn’t “go away.” It’s more like it rage-quits because someone asked it to do five things at once on a laptop disk.

Fast diagnosis playbook (first/second/third checks)

When pages are erroring, you have one job: find the bottleneck fast, stop the bleeding, then make it boring.

First: identify the failure class in 2 minutes

  1. Is MySQL up and accepting connections? If not, this is crash/restart/OOM/disk.
  2. Are connections saturated? If yes, this is “too many connections,” often driven by slow queries or PHP worker explosion.
  3. Are connections being killed mid-flight? If yes, think network timeouts, MySQL timeouts, max packet, proxy behavior.

Second: decide where the queue is forming

  • PHP side queue: PHP-FPM has a long listen queue, many active workers, slow requests.
  • MySQL side queue: MySQL shows many running threads, long queries, lock waits, buffer pool misses, I/O waits.
  • Network/proxy queue: intermittent resets, NAT timeouts, “gone away” after idle periods.

Third: apply the safest immediate mitigation

  • If MySQL is down: fix disk/OOM, restart, and cap PHP concurrency before you let traffic hit it again.
  • If connections are saturated: temporarily reduce PHP-FPM concurrency; optionally raise max_connections slightly if memory allows; find the slow query pattern.
  • If “gone away” happens after idle: align timeouts across MySQL, PHP, and network devices; consider enabling keepalives or avoiding persistent connections.

Interesting facts & historical context (so the behavior makes sense)

  1. MySQL’s default timeouts were designed for long-lived app servers, not for modern layers of proxies, NAT, and serverless edges that love dropping idle TCP.
  2. In the early LAMP era, persistent connections were trendy because opening TCP + auth was expensive. Today, persistent connections often amplify failure modes under traffic spikes.
  3. “Too many connections” used to be a badge of honor in some orgs—until they discovered each thread costs memory and context switching time.
  4. WordPress historically encouraged plugin ecosystems, which is great for features and terrible for query discipline. A single plugin can add 20 queries per page without telling you.
  5. InnoDB became the default engine in MySQL 5.5, and that changed the “right” tuning approach. Old MyISAM-era advice still haunts forums.
  6. Connection storms are a known class of cascading failure: when MySQL slows down, PHP opens more connections, which slows MySQL more, which triggers retries… and now everything is on fire.
  7. Historically, max_allowed_packet was small by default because memory mattered and giant blobs were discouraged. WordPress media and serialized options ignore that history.
  8. Many “server has gone away” incidents are not MySQL bugs; they are infrastructure timeouts (load balancers, firewalls) with defaults like 60 seconds that nobody remembers setting.
  9. Aborted connections are not always bad: they can reflect users closing browsers mid-request. But a spike in aborted connections alongside errors is usually a smoking gun.

Core failure modes behind “gone away” and “too many connections”

1) MySQL is restarting (often OOM or disk)

If mysqld restarts, every active connection dies. WordPress reports “server has gone away” because the socket it had is now a fossil.

Common triggers:

  • Out-of-memory kill: buffer pool too big, too many per-thread buffers, too many connections, or a memory leak in surrounding services.
  • Disk full: binary logs, tmpdir, ibtmp, or slow query logs grow; MySQL starts failing writes; then things cascade.
  • Filesystem latency: I/O stalls lead to hung threads; watchdog restarts; clients time out.

2) Connection starvation from slow queries

WordPress is chatty. Add a few unindexed meta queries, a report plugin, and an over-eager bot crawl, and you can occupy all threads. New connections get refused: “too many connections.”

3) PHP-FPM or web tier is allowed to create too much parallelism

Most WordPress outages blame MySQL, but PHP-FPM often holds the match. If you allow 200 PHP workers and each can open a MySQL connection, you’ve effectively configured a connection flood.

4) Timeouts misaligned across layers

Imagine MySQL wait_timeout is 60 seconds. Your load balancer drops idle TCP at 50 seconds. PHP tries to reuse a connection at 55 seconds. Result: “server has gone away,” but MySQL never actually closed it.

5) Oversized packets and weird WordPress data

WordPress stores arrays in wp_options as serialized blobs. Some plugins shove huge payloads there (page builder caches, analytics dumps). If that blob exceeds max_allowed_packet, MySQL drops the connection and you get “server has gone away.”

Practical tasks: commands, outputs, decisions (12+)

These are the moves I expect an on-call to execute during an incident. Each task includes a command, sample output, what it means, and the decision you make.

Task 1: Confirm MySQL is alive and note uptime

cr0x@server:~$ systemctl status mysql --no-pager
● mysql.service - MySQL Community Server
     Loaded: loaded (/lib/systemd/system/mysql.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2025-12-27 10:11:03 UTC; 3min ago
       Docs: man:mysqld(8)
   Main PID: 1423 (mysqld)
     Status: "Server is operational"

What it means: If uptime is minutes and you didn’t restart it, you have a crash loop or host restart.

Decision: If it recently restarted, immediately check logs for OOM/disk and cap PHP concurrency before traffic re-saturates it.

Task 2: Check MySQL error log for crash signatures

cr0x@server:~$ journalctl -u mysql --since "30 min ago" --no-pager | tail -n 40
Dec 27 10:10:58 server mysqld[1423]: 2025-12-27T10:10:58.221Z 0 [Warning] Aborted connection 891 to db: 'wpdb' user: 'wpuser' host: '10.0.2.15' (Got an error reading communication packets)
Dec 27 10:11:02 server systemd[1]: mysql.service: Main process exited, code=killed, status=9/KILL
Dec 27 10:11:03 server systemd[1]: mysql.service: Scheduled restart job, restart counter is at 1.
Dec 27 10:11:03 server systemd[1]: Started MySQL Community Server.

What it means: Status 9/KILL often points to OOM killer or admin kill.

Decision: Validate OOM and memory sizing; don’t “just restart” repeatedly and hope the database finds inner peace.

Task 3: Confirm OOM killer events

cr0x@server:~$ dmesg -T | egrep -i "oom|killed process|out of memory" | tail -n 20
[Sat Dec 27 10:11:01 2025] Out of memory: Killed process 1423 (mysqld) total-vm:4123988kB, anon-rss:2450180kB, file-rss:0kB, shmem-rss:0kB, UID:110 pgtables:6420kB oom_score_adj:0

What it means: MySQL was killed for memory. “Server has gone away” is the aftertaste.

Decision: Reduce memory footprint (buffer pool, per-thread buffers, connection count), add RAM, or move MySQL off the congested host.

Task 4: Check free disk and inode pressure (yes, inodes)

cr0x@server:~$ df -hT
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/nvme0n1p2 ext4   80G   76G  2.1G  98% /
tmpfs          tmpfs 3.9G   12M  3.9G   1% /run
cr0x@server:~$ df -ihT
Filesystem     Type  Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2 ext4     5M   4.9M  120K   98% /

What it means: Nearly full disk or inodes can break temp tables, logs, binlogs, and even socket writes.

Decision: Free space immediately (rotate logs, purge binlogs safely), then fix growth (log retention, monitoring alerts).

Task 5: Measure connection pressure and max_connections headroom

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected'; SHOW VARIABLES LIKE 'max_connections';"
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Threads_connected | 198   |
+-------------------+-------+
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 200   |
+-----------------+-------+

What it means: You’re pinned at the ceiling. MySQL is spending time context switching, and new logins will fail.

Decision: Don’t jump straight to 1000 connections. First reduce upstream concurrency (PHP-FPM), then find what’s slow.

Task 6: Identify who is consuming connections (by user/host)

cr0x@server:~$ mysql -e "SELECT user, host, COUNT(*) AS conns FROM information_schema.processlist GROUP BY user, host ORDER BY conns DESC LIMIT 10;"
+--------+------------+-------+
| user   | host       | conns |
+--------+------------+-------+
| wpuser | 10.0.2.15  | 184   |
| wpuser | 10.0.2.16  | 12    |
| root   | localhost  | 1     |
+--------+------------+-------+

What it means: One web node is overwhelming the database or stuck retrying.

Decision: Rate-limit or drain that node; check its PHP-FPM queue and error logs; verify it’s not looping on DB failures.

Task 7: Spot long-running queries and lock waits

cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST;" | head -n 30
Id	User	Host	db	Command	Time	State	Info
812	wpuser	10.0.2.15:54012	wpdb	Query	42	statistics	SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts ...
901	wpuser	10.0.2.15:54101	wpdb	Query	39	Waiting for table metadata lock	ALTER TABLE wp_postmeta ADD INDEX meta_key (meta_key)
...

What it means: Long queries and metadata locks can pin threads. That drives connection pileups.

Decision: Kill obviously bad ad-hoc queries, postpone DDL, and fix the query/index. If DDL is blocking, move it to off-hours and use online schema change tooling.

Task 8: Check InnoDB health and the real “why is it slow” clues

cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | egrep -A3 -B2 "LATEST DETECTED DEADLOCK|History list length|buffer pool|I/O|semaphore" | head -n 80
History list length 124873
Pending reads 0
Pending writes: LRU 0, flush list 12, single page 0
Buffer pool size   131072
Free buffers       12
Database pages     130984
Modified db pages  9812

What it means: Big history list length can imply purge lag; lots of modified pages can imply dirty page flush pressure; low free buffers indicates a hot buffer pool.

Decision: If I/O flush is backed up, improve storage latency, tune flush settings, and reduce write bursts (caches, batching). If purge is lagging, investigate long transactions.

Task 9: Validate timeout settings that cause “gone away” after idle

cr0x@server:~$ mysql -e "SHOW VARIABLES WHERE Variable_name IN ('wait_timeout','interactive_timeout','net_read_timeout','net_write_timeout','max_allowed_packet');"
+--------------------+-----------+
| Variable_name      | Value     |
+--------------------+-----------+
| interactive_timeout| 28800     |
| max_allowed_packet | 67108864  |
| net_read_timeout   | 30        |
| net_write_timeout  | 60        |
| wait_timeout       | 60        |
+--------------------+-----------+

What it means: A 60-second wait_timeout is aggressive. It can be fine if your app doesn’t reuse idle connections. WordPress stacks sometimes do (depending on client library and persistence).

Decision: Align timeouts: set wait_timeout to something sane (e.g., 300–900) and ensure proxies don’t drop earlier than MySQL or vice versa.

Task 10: Inspect aborted connections counters

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Aborted_%';"
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Aborted_clients  | 1821  |
| Aborted_connects | 94    |
+------------------+-------+

What it means: Aborted_clients rises when clients disconnect uncleanly; Aborted_connects rises when auth/handshake fails or resource limits block connections.

Decision: If both spike during incidents, correlate with network drops and max_connections saturation. If only Aborted_clients grows, don’t panic—confirm with error rate and timing.

Task 11: Check PHP-FPM concurrency and queueing

cr0x@server:~$ systemctl status php8.2-fpm --no-pager
● php8.2-fpm.service - The PHP 8.2 FastCGI Process Manager
     Active: active (running) since Sat 2025-12-27 09:41:10 UTC; 33min ago
cr0x@server:~$ sudo ss -lntp | egrep "php-fpm|:9000"
LISTEN 0 511 127.0.0.1:9000 0.0.0.0:* users:(("php-fpm8.2",pid=2201,fd=8))

What it means: The listen backlog (here 511) is just capacity for queued requests. The real question is: are workers saturated?

Decision: Check FPM status page or logs; if saturated, lower pm.max_children to match DB capacity, not “number that fits in RAM on a good day.”

Task 12: Confirm Nginx/Apache is retrying upstream too aggressively

cr0x@server:~$ sudo tail -n 30 /var/log/nginx/error.log
2025/12/27 10:12:01 [error] 3112#3112: *9821 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 203.0.113.55, server: example, request: "GET / HTTP/2.0", upstream: "fastcgi://127.0.0.1:9000"
2025/12/27 10:12:03 [error] 3112#3112: *9838 FastCGI sent in stderr: "PHP message: WordPress database error Too many connections for query SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'"

What it means: Slow upstream causes web timeouts; PHP logs show DB errors. Both are symptoms of the same queue.

Decision: Reduce concurrency and fix DB latency. Extending Nginx timeouts without fixing DB is how you create longer outages with more angry users.

Task 13: Find the worst WordPress queries (slow log on-demand)

cr0x@server:~$ mysql -e "SET GLOBAL slow_query_log=ON; SET GLOBAL long_query_time=0.5; SHOW VARIABLES LIKE 'slow_query_log_file';"
+---------------------+----------------------------------+
| Variable_name       | Value                            |
+---------------------+----------------------------------+
| slow_query_log_file | /var/log/mysql/mysql-slow.log    |
+---------------------+----------------------------------+
cr0x@server:~$ sudo tail -n 20 /var/log/mysql/mysql-slow.log
# Time: 2025-12-27T10:12:22.114Z
# Query_time: 3.218  Lock_time: 0.004 Rows_sent: 10  Rows_examined: 184221
SET timestamp=1766830342;
SELECT p.ID FROM wp_posts p
JOIN wp_postmeta pm ON pm.post_id=p.ID
WHERE pm.meta_key='some_key' AND pm.meta_value LIKE '%needle%'
ORDER BY p.post_date DESC LIMIT 10;

What it means: Rows examined is huge. That’s your connection killer: each slow query ties up a thread and grows the pile.

Decision: Add/adjust indexes, reduce meta queries, or remove/replace the plugin pattern. Then turn the log back down so it doesn’t fill disk.

Task 14: Verify table/index health quickly

cr0x@server:~$ mysql -e "CHECK TABLE wp_options, wp_posts, wp_postmeta QUICK;"
+-------------+-------+----------+----------+
| Table       | Op    | Msg_type | Msg_text |
+-------------+-------+----------+----------+
| wpdb.wp_options  | check | status   | OK       |
| wpdb.wp_posts    | check | status   | OK       |
| wpdb.wp_postmeta | check | status   | OK       |
+-------------+-------+----------+----------+

What it means: “OK” rules out some corruption scenarios. It does not prove performance is fine.

Decision: If you see corruption, stop ad-hoc repairs during peak unless you know the blast radius. Stabilize, snapshot/backup, then repair deliberately.

Task 15: Quantify memory per connection (rough sizing)

cr0x@server:~$ mysql -e "SHOW VARIABLES WHERE Variable_name IN ('read_buffer_size','read_rnd_buffer_size','sort_buffer_size','join_buffer_size','tmp_table_size','max_heap_table_size');"
+----------------------+----------+
| Variable_name        | Value    |
+----------------------+----------+
| join_buffer_size     | 262144   |
| max_heap_table_size  | 16777216 |
| read_buffer_size     | 131072   |
| read_rnd_buffer_size | 262144   |
| sort_buffer_size     | 262144   |
| tmp_table_size       | 16777216 |
+----------------------+----------+

What it means: Per-connection buffers can allocate on demand. At high concurrency, “on demand” becomes “all at once.”

Decision: Keep per-thread buffers modest. Don’t crank them up to fix a single query; fix the query.

Tuning knobs that actually matter (and the ones that waste time)

Start with the constraint: MySQL threads are not free

Each connection consumes memory and scheduler attention. Raising max_connections increases the size of the possible stampede.

When you must raise it, do it like an adult:

  • Estimate memory headroom first (buffer pool + overhead + OS cache + other daemons).
  • Raise in steps (e.g., +25% or +50%), not 10x.
  • Pair it with upstream rate limiting (PHP-FPM max_children, web server limits).

InnoDB buffer pool: the only big knob that usually pays off

For dedicated MySQL hosts, a common starting point is 60–75% of RAM for innodb_buffer_pool_size. For shared hosts, it depends on what else is fighting for memory.

If your working set doesn’t fit, you’ll hit disk more often. Disk I/O makes queries slower. Slow queries hold connections longer. That triggers “too many connections.” It’s a loop.

Timeouts: fix the alignment, not the myth

wait_timeout is how long MySQL keeps an idle non-interactive connection. If it’s too low and your client reuses idle connections, you get “gone away.” If it’s too high and you have thousands of idle connections, you waste memory. Pick a realistic value and match it to your environment.

Also consider the invisible killers: firewalls and load balancers. They often have idle timeouts smaller than MySQL’s. If they drop connections silently, the first read/write later triggers “gone away.”

max_allowed_packet: the “why did this upload kill my DB” knob

WordPress uploads don’t directly go through MySQL, but plugins store large serialized blobs, especially in options and post meta. If you see errors around big updates, raise max_allowed_packet to something sane (64M or 128M depending on workload) and then find the plugin storing junk.

Don’t waste time on these during an incident

  • Random query cache advice: If someone tells you to enable the old MySQL query cache, they’re time traveling. In modern MySQL it’s gone.
  • Cranking per-thread buffers: That’s how you manufacture OOM kills with confidence.
  • Turning off InnoDB flush safety: You can trade durability for speed, but do it knowingly—not because a blog comment said so.

WordPress-specific failure patterns

Autoloaded options becoming a database tax

WordPress loads all options with autoload='yes' early. If a plugin dumps huge data into autoloaded options, every request pays the cost. That makes requests slower, which increases concurrent DB usage.

Fix: audit wp_options, shrink autoloaded junk, and push bulky caches into object caching (Redis/Memcached) or files.

wp_postmeta queries without proper indexes

Meta queries are flexible but expensive. Many plugins build “search-like” features using LIKE '%...%' on meta_value. That’s not a query; it’s a cry for help.

Fix: limit meta querying, add targeted indexes (with care), or move that feature to a search engine.

WP-Cron and admin-ajax storms

High traffic can trigger WP-Cron frequently if it’s traffic-driven. Meanwhile, admin-ajax.php can be abused by themes/plugins for constant polling.

Fix: move cron to real system cron; rate-limit admin-ajax endpoints; cache aggressively for anonymous users.

Short joke #2: “Too many connections” is MySQL’s way of saying it needs fewer meetings on its calendar.

Three corporate-world mini-stories from the trenches

Mini-story 1: The outage caused by a wrong assumption

The team had a WordPress estate behind a load balancer: two web nodes, one MySQL node. They’d been stable for months. Then intermittent “server has gone away” started appearing—only on some pages, mostly after users spent time browsing and then clicked something.

The first assumption was predictable: “MySQL is timing out.” Someone proposed raising wait_timeout to hours. Another person suggested enabling persistent connections to “avoid reconnect cost.” It sounded plausible, and it was wrong.

The actual issue lived in the network layer. A firewall between the web subnet and the DB subnet had an idle TCP timeout shorter than MySQL’s. It silently dropped idle sessions. PHP reused those connections later and got resets. MySQL logs showed “aborted connection” messages, but MySQL wasn’t the one doing the aborting.

The fix was boring: align timeouts and enable TCP keepalives on the web nodes so idle connections stayed alive, or better yet, disable persistent reuse where it wasn’t needed. The lesson stuck: when you see “gone away,” verify who actually hung up first.

Mini-story 2: The optimization that backfired

A different company was chasing performance. They reduced page TTFB by adding aggressive PHP-FPM scaling: high pm.max_children so the site could “handle spikes.” It did handle spikes—by creating them.

During a marketing campaign, traffic rose. PHP-FPM happily spawned workers. Each worker opened a MySQL connection and ran a handful of queries. MySQL started to lag, so requests took longer. Longer requests meant workers stayed busy longer, so the pool spawned more. Connections hit the ceiling. “Too many connections” exploded. Then retries kicked in from the app and some upstream proxies.

They had optimized for the wrong thing: throughput at the web tier, not end-to-end capacity. The correct fix wasn’t “raise max_connections to 2000.” It was to cap PHP concurrency to what the database could serve with acceptable latency, and to cache anonymous traffic so most requests didn’t need MySQL at all.

After tuning, the system handled the same campaign with fewer workers, fewer DB connections, and more cache hits. It felt slower in a synthetic “how many workers can I spawn” benchmark. It was faster for actual users. That’s the only metric that matters.

Mini-story 3: The boring but correct practice that saved the day

An enterprise WordPress deployment ran on MySQL with replicas. Nothing fancy. The secret sauce was discipline: every schema change was scheduled, every plugin update had a staging soak, and slow query logging was enabled during business hours with sensible thresholds.

One afternoon, a plugin update introduced a new query pattern on wp_postmeta. Not catastrophic. Just slower. But slow enough that the connection count started climbing under normal traffic. The on-call noticed because they had alerting on “Threads_connected as percentage of max_connections” and on “95th percentile query time” from their DB metrics.

They didn’t wait for a full outage. They rolled back the plugin, purged a bad autoloaded option bloat that came with it, and added a targeted index in a maintenance window. Users barely noticed. Leadership never heard about it. That’s what success looks like: prevention so dull it doesn’t make slides.

Common mistakes: symptom → root cause → fix

1) “Server has gone away” after exactly N seconds idle

  • Symptom: Errors appear when users return to the site after being idle; reproduces with a stopwatch.
  • Root cause: Timeout mismatch (wait_timeout vs firewall/LB idle timeout vs client reuse).
  • Fix: Align timeouts; raise wait_timeout moderately; disable persistent connections if present; set TCP keepalives where appropriate.

2) “Too many connections” during traffic spikes, plus high CPU iowait

  • Symptom: DB becomes sluggish, connection count rises, iowait spikes.
  • Root cause: Buffer pool miss + slow storage; queries block waiting for disk.
  • Fix: Increase InnoDB buffer pool (within RAM limits), improve storage, reduce query cost, add caching for anonymous traffic.

3) “Too many connections” right after raising PHP-FPM max_children

  • Symptom: More workers “improve” throughput for a day, then incidents begin.
  • Root cause: Upstream concurrency exceeds DB capacity; connection storms.
  • Fix: Cap PHP-FPM to DB capacity; add queueing and backpressure; consider separate read replica for read-heavy traffic if the app supports it.

4) “Server has gone away” when saving posts or updating plugins

  • Symptom: Admin actions fail; logs show packet errors or aborted connections.
  • Root cause: max_allowed_packet too small, or long writes hitting timeouts.
  • Fix: Raise max_allowed_packet; inspect what data is being stored; stop plugins from stuffing huge blobs into options.

5) “Error establishing a database connection” intermittently

  • Symptom: WordPress generic DB connection error, not always “too many connections.”
  • Root cause: MySQL restarts, DNS hiccups, socket exhaustion, or auth throttling.
  • Fix: Check MySQL uptime and crash logs; pin DB host by IP if DNS is unreliable; verify file descriptor limits; audit auth failures.

6) Raising max_connections makes the host fall over

  • Symptom: Fewer “too many connections” errors, but latency gets worse; then mysqld gets OOM-killed.
  • Root cause: Memory per connection + thread scheduling overhead; swapping or OOM.
  • Fix: Revert to safer max_connections; cap PHP; reduce per-thread buffers; add caching; scale vertically or move DB to dedicated host.

Checklists / step-by-step plan

Incident response checklist (15–30 minutes)

  1. Confirm DB health: check systemctl status mysql and uptime.
  2. Check for OOM/disk: dmesg OOM lines; df -h and df -ih.
  3. Measure saturation: Threads_connected vs max_connections.
  4. Find the culprits: processlist; top users/hosts; long queries.
  5. Apply backpressure: reduce PHP-FPM concurrency (or temporarily drain one web node) before touching DB knobs.
  6. Turn on slow log briefly: capture query patterns; don’t leave it on at very low thresholds forever.
  7. Stabilize: confirm error rate drops; watch latency; ensure MySQL doesn’t restart again.

Stabilization plan (next day)

  1. Fix slow queries: indexes, plugin changes, query rewrites.
  2. Audit wp_options autoload: remove bloat, disable offenders, move caching out of DB.
  3. Align timeouts: MySQL, PHP, web server, load balancer, firewall.
  4. Right-size buffer pool: based on RAM and workload, not folklore.
  5. Add monitoring: connection utilization, query latency, InnoDB metrics, disk growth, OOM events.

Hardening plan (next sprint)

  1. Separate roles: put MySQL on a dedicated host or managed service if you can.
  2. Cache anonymous traffic: full-page cache + object cache to reduce DB load.
  3. Limit concurrency intentionally: set PHP-FPM pm.max_children to protect MySQL.
  4. Operational hygiene: scheduled schema changes, plugin update policy, and a rollback plan that doesn’t involve prayer.

FAQ

1) Should I just increase max_connections to fix “Too many connections”?

Only as a temporary pressure release, and only after checking memory headroom. The durable fix is reducing query time and upstream concurrency.

2) What’s the single fastest way to stop the outage right now?

Cap concurrency at the web tier (PHP-FPM max_children or drain a web node) so MySQL can catch up. Then hunt slow queries.

3) Why do I see “server has gone away” but MySQL looks healthy?

Because the connection can be killed by firewalls, load balancers, NAT gateways, or client timeouts. Verify timeouts and network resets, not just mysqld uptime.

4) Is this caused by WordPress itself or a plugin?

WordPress core is usually not the direct culprit. Plugins often create pathological queries (especially meta queries) or bloat autoloaded options.

5) Do persistent MySQL connections help WordPress performance?

Sometimes, but they can also amplify “gone away” errors and connection hoarding. In most modern setups, fixing query cost and caching pays more.

6) What timeout values should I use?

There’s no universal answer. But wait_timeout=60 is frequently too low for layered networks. Start around 300–900 seconds and align with your network device idle timeouts.

7) Why does it happen mostly during admin actions?

Admin actions trigger heavier queries, writes, and sometimes schema changes. They also expose max_allowed_packet limits when large option blobs are updated.

8) Can a replica solve “too many connections”?

Only if your application actually sends reads to it. Classic WordPress doesn’t automatically split reads/writes without additional tooling. Caching is usually the first win.

9) How do I know if disk I/O is the real bottleneck?

Look for high iowait, slow query times that correlate with disk activity, InnoDB pending flushes, and low buffer pool hit rates. If storage is slow, everything else is theater.

10) Is MariaDB different here?

The failure modes are similar: connection ceilings, timeouts, memory/thread costs, and slow queries. The exact variables and defaults can differ, so verify on your version.

Conclusion: practical next steps

These MySQL errors aren’t mysteries. They’re accounting statements: you ran out of connections, time, memory, or patience somewhere in the stack.

  1. Today: run the fast diagnosis playbook, cap PHP concurrency, confirm MySQL isn’t restarting, and capture slow queries.
  2. This week: remove or fix the worst query patterns (often plugin-driven), shrink autoloaded option bloat, and align timeouts across MySQL and network layers.
  3. This sprint: add caching, set explicit capacity limits, and put monitoring/alerts on connection utilization and query latency so you see the cliff before you drive off it.

If you make one cultural change: stop treating “max_connections” as a performance knob. It’s a circuit breaker. Size it carefully, then design your stack so it rarely matters.

← Previous
Email SMTP 4xx Temporary Failures: Top Causes and Fixes That Actually Work
Next →
Frame generation: free frames or a latency trap?

Leave a comment