“MySQL server has gone away” is the database equivalent of a co-worker vanishing mid-meeting: the room goes quiet, everyone stares at their laptop, and somehow it’s now your problem.
On Ubuntu 24.04, this usually isn’t “a MySQL bug.” It’s a timeout mismatch, a packet size limit, a proxy that thinks it’s being helpful, or a mysqld restart you didn’t notice because the app only logs “SQLSTATE[HY000]”. We’re going to fix it properly: find the bottleneck fast, prove the root cause with commands, and tune the right layer without turning your database into an unbounded memory piñata.
What “server has gone away” actually means (and what it doesn’t)
The phrase is misleadingly friendly. It implies the server went out for coffee and will be right back. In practice it means: the client tried to use a MySQL connection and the other side wasn’t there anymore, or the protocol stream broke in a way the client can’t recover from.
Common shapes of the failure
- Error 2006:
MySQL server has gone away(client notices the socket is dead or can’t send). - Error 2013:
Lost connection to MySQL server during query(query started, connection broke mid-flight). - SQLSTATE[HY000] wrappers from app drivers (PDO, mysqli, JDBC) that hide the error number and context.
What it is not
It’s not automatically “the database is overloaded” and it’s not solved by “just increase all timeouts.” Increasing timeouts blindly tends to hide leaks (connection leaks, transaction leaks) until the day you run out of memory or threads.
There are only a handful of root categories:
- Idle connection killed by MySQL (
wait_timeout) or by a proxy/LB/NAT in the middle. - Packet too big: MySQL rejects it (
max_allowed_packet) or a proxy rejects it, or client hits its own limit. - Server restart/crash: mysqld died, got OOM-killed, or systemd restarted it.
- Network: TCP resets, MTU weirdness, conntrack timeouts, keepalive mismatches.
- Query exceeded timeouts (server-side
net_read_timeout/net_write_timeout, proxy timeouts, client-side timeouts).
One quote worth keeping in your pocket: Hope is not a strategy
— commonly attributed in ops circles; a paraphrased idea you should apply to database troubleshooting.
Joke #1: If your fix is “set timeouts to 24 hours,” you didn’t solve the problem, you just scheduled it for tomorrow’s shift.
Interesting facts and history you can weaponize
- Fact 1: “MySQL server has gone away” has been around since early MySQL client libraries; it’s a client-side message, not a server log line.
- Fact 2: The classic default
wait_timeouthistorically sat at 8 hours on many installations; modern environments with proxies often need much shorter timeouts plus proper pooling. - Fact 3: The protocol is packet-based; large statements and large rows stress both
max_allowed_packetand memory allocation paths. - Fact 4: NAT gateways and stateful firewalls frequently have idle TCP timeouts far below MySQL’s defaults; they’ll drop “healthy” idle sockets without telling either endpoint nicely.
- Fact 5: Linux TCP keepalive defaults are conservative (hours). In cloud networks, that’s basically “never,” which means dead connections can linger until the next write fails.
- Fact 6: Many drivers (especially older ones) don’t automatically reconnect safely because reconnecting mid-transaction is correctness landmine territory.
- Fact 7: MySQL’s
max_allowed_packetlimit exists on both server and client; raising only one side can keep the failure. - Fact 8: A common modern culprit is not MySQL at all but the layer-7 stack: an app-side connection pool with “infinite” lifetime holding sockets that proxies time out.
- Fact 9: The error often shows up after deploys because deploys change connection behavior: more parallelism, bigger payloads, different retry logic.
Fast diagnosis playbook (first/second/third)
First: determine if mysqld restarted or crashed
If mysqld restarted around the time of errors, stop chasing packet limits. Your clients lost connections because the server disappeared. Find out why it restarted.
- Check systemd journal for mysqld restarts and exit codes.
- Check MySQL error log for crash markers and InnoDB recovery.
- Check OOM killer logs and memory pressure.
Second: decide if it’s idle timeouts vs mid-query drop
“Gone away” right after a period of inactivity points to wait_timeout or network/proxy idle timeouts. “Lost connection during query” points to long queries, proxy timeouts, or server-side net timeouts.
- Compare error timestamps to “idle periods” in app logs.
- Check MySQL
Aborted_clientsandAborted_connectstrends. - Inspect proxy/LB timeout settings if present.
Third: test packet size limits (only after you know it’s not restarts)
If errors correlate with large inserts/updates or blob traffic, check max_allowed_packet on both server and client, and see whether the app is sending multi-megabyte statements.
- Check current
max_allowed_packetvalues (global and session). - Reproduce with a controlled large payload test.
- Fix by raising limits thoughtfully or changing how data is sent (chunking, streaming, avoiding mega-queries).
Practical tasks: commands, outputs, decisions (do these in order)
These are not “ideas.” These are the steps you run on Ubuntu 24.04 and what you conclude from each. Run them as a user with sudo rights on the DB host unless noted.
Task 1: Confirm what MySQL you’re running (MySQL vs MariaDB matters)
cr0x@server:~$ mysql --version
mysql Ver 8.0.39-0ubuntu0.24.04.1 for Linux on x86_64 ((Ubuntu))
What it means: You’re on Oracle MySQL 8.0 packages. Variables and log paths match MySQL 8.0 conventions.
Decision: Use MySQL 8.0 docs/variables; don’t copy MariaDB-specific knobs.
Task 2: Check service health and recent restarts (systemd is the truth serum)
cr0x@server:~$ systemctl status mysql --no-pager
● mysql.service - MySQL Community Server
Loaded: loaded (/usr/lib/systemd/system/mysql.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-12-30 08:22:10 UTC; 2h 11min ago
Main PID: 1234 (mysqld)
Status: "Server is operational"
Tasks: 44 (limit: 38316)
Memory: 1.3G
CPU: 18min 12.345s
CGroup: /system.slice/mysql.service
└─1234 /usr/sbin/mysqld
What it means: MySQL is currently up; you also get uptime.
Decision: If uptime is suspiciously short compared to when the errors started, pivot to crash/restart investigation before tuning timeouts.
Task 3: Pull the last hour of mysql.service logs (look for exits, OOM, restarts)
cr0x@server:~$ journalctl -u mysql --since "1 hour ago" --no-pager
Dec 30 09:54:01 server systemd[1]: mysql.service: Main process exited, code=killed, status=9/KILL
Dec 30 09:54:01 server systemd[1]: mysql.service: Failed with result 'signal'.
Dec 30 09:54:01 server systemd[1]: mysql.service: Scheduled restart job, restart counter is at 1.
Dec 30 09:54:04 server systemd[1]: Started mysql.service - MySQL Community Server.
What it means: mysqld got SIGKILL. That’s not graceful. Common reasons: OOM killer, admin kill, watchdog, or container eviction.
Decision: Don’t touch max_allowed_packet yet. First find who killed it.
Task 4: Check for OOM killer evidence (Ubuntu will tell you)
cr0x@server:~$ journalctl -k --since "2 hours ago" | grep -E "Out of memory|Killed process|oom-kill" | tail -n 20
Dec 30 09:53:59 server kernel: Out of memory: Killed process 1234 (mysqld) total-vm:9123456kB, anon-rss:6212340kB, file-rss:0kB, shmem-rss:0kB, UID:112 pgtables:15432kB oom_score_adj:0
Dec 30 09:53:59 server kernel: oom_reaper: reaped process 1234 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
What it means: MySQL was killed due to memory exhaustion. Your app sees “server has gone away” because it did.
Decision: Fix memory sizing, concurrency, or leaks; raising packet limits might make it worse.
Task 5: Locate and tail the MySQL error log (catch crash recovery, packet errors)
cr0x@server:~$ sudo ls -l /var/log/mysql/
total 120
-rw-r----- 1 mysql adm 118432 Dec 30 10:32 error.log
cr0x@server:~$ sudo tail -n 60 /var/log/mysql/error.log
2025-12-30T09:54:04.123456Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.39) starting as process 5678
2025-12-30T09:54:05.234567Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2025-12-30T09:54:06.345678Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.39' socket: '/var/run/mysqld/mysqld.sock' port: 3306 MySQL Community Server - GPL.
What it means: You see restart and readiness markers; if there were packet errors, you’d often see “Got packet bigger than ‘max_allowed_packet’ bytes”.
Decision: If logs show crash recovery or frequent restarts, fix stability first. If logs show packet-size errors, move to packet section.
Task 6: Check current timeout and packet variables (global truth)
cr0x@server:~$ mysql -uroot -p -e "SHOW GLOBAL VARIABLES WHERE Variable_name IN ('wait_timeout','interactive_timeout','net_read_timeout','net_write_timeout','max_allowed_packet','max_connections','connect_timeout');"
+---------------------+-----------+
| Variable_name | Value |
+---------------------+-----------+
| connect_timeout | 10 |
| interactive_timeout | 28800 |
| max_allowed_packet | 67108864 |
| max_connections | 500 |
| net_read_timeout | 30 |
| net_write_timeout | 60 |
| wait_timeout | 28800 |
+---------------------+-----------+
What it means: 64MB packets, 8-hour idle timeout, and fairly standard net timeouts.
Decision: If errors happen after ~8 hours idle, it’s likely wait_timeout. If errors happen at ~60 seconds during large result sets, suspect net_write_timeout or proxy timeouts.
Task 7: Check session variables from the app’s perspective (pooling can differ)
cr0x@server:~$ mysql -uroot -p -e "SHOW SESSION VARIABLES WHERE Variable_name IN ('wait_timeout','max_allowed_packet');"
+--------------------+----------+
| Variable_name | Value |
+--------------------+----------+
| max_allowed_packet | 67108864 |
| wait_timeout | 28800 |
+--------------------+----------+
What it means: The current session inherits global values. Apps can change session values at connect time.
Decision: If the app uses a different account with different init commands, check that specific account or connection pool config.
Task 8: Look at aborted connections counters (detect timeouts and network drops)
cr0x@server:~$ mysql -uroot -p -e "SHOW GLOBAL STATUS LIKE 'Aborted_%';"
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| Aborted_clients | 1249 |
| Aborted_connects | 12 |
+------------------+-------+
What it means: Aborted_clients counts connections that were aborted (client disappeared, network issue, timeout). It’s not perfectly diagnostic, but trends matter.
Decision: If Aborted_clients climbs rapidly during incidents, investigate idle drops (proxy/NAT) or server overload causing stalled writes.
Task 9: Check current connections and what they’re doing (idle vs stuck)
cr0x@server:~$ mysql -uroot -p -e "SHOW PROCESSLIST;"
+-----+------+-----------------+------+---------+------+------------------------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-----+------+-----------------+------+---------+------+------------------------+------------------+
| 101 | app | 10.10.0.21:5332 | prod | Sleep | 7200 | | NULL |
| 102 | app | 10.10.0.21:5333 | prod | Sleep | 7199 | | NULL |
| 201 | app | 10.10.0.22:6121 | prod | Query | 55 | Sending data | SELECT ... |
+-----+------+-----------------+------+---------+------+------------------------+------------------+
What it means: You have long-sleeping pooled connections (2 hours). That’s fine if your network supports it. It’s a problem if a proxy kills idle sockets at 60 minutes.
Decision: If you see lots of very old sleeping connections and the app errors after idle, align pool max lifetime and server/proxy timeouts.
Task 10: Check TCP-level resets and retransmits (is the network lying?)
cr0x@server:~$ ss -tan sport = :3306 | head -n 20
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0 0 10.0.0.10:3306 10.10.0.21:5332
ESTAB 0 0 10.0.0.10:3306 10.10.0.21:5333
ESTAB 0 0 10.0.0.10:3306 10.10.0.22:6121
What it means: Connections exist and aren’t obviously backed up. For deeper network signals, use nstat.
cr0x@server:~$ nstat -az | egrep "TcpRetransSegs|TcpExtTCPRcvCoalesce|TcpExtListenOverflows|TcpExtListenDrops" || true
TcpRetransSegs 124
TcpExtListenOverflows 0
TcpExtListenDrops 0
Decision: If retransmits spike during errors, it’s not “a MySQL setting.” Investigate network path, MTU, congestion, or host CPU saturation.
Task 11: Validate Linux TCP keepalive settings (often too sleepy)
cr0x@server:~$ sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
What it means: Keepalives start after 2 hours. If your firewall drops idle TCP at 15 minutes, these won’t help.
Decision: If you must keep long-lived idle connections, lower keepalive time (or better: set pool max lifetime below the network idle timeout).
Task 12: Check for MySQL packet-size errors by grepping logs
cr0x@server:~$ sudo grep -E "max_allowed_packet|packet bigger than" -n /var/log/mysql/error.log | tail -n 20
2025-12-30T10:01:12.123456Z 45 [Warning] [MY-000000] [Server] Got packet bigger than 'max_allowed_packet' bytes
What it means: That’s your smoking gun. The server rejected an incoming packet.
Decision: Increase max_allowed_packet to a justified value and fix the app behavior that generated giant packets if possible.
Task 13: Reproduce packet limit safely (controlled test, not on prod peak)
cr0x@server:~$ python3 - << 'PY'
import os
print(len(os.urandom(10*1024*1024)))
PY
10485760
What it means: You can generate a 10MB blob. Now test an insert with parameter binding (preferred) using your app stack, or a local mysql test if acceptable.
Task 14: Confirm configuration sources on Ubuntu (avoid editing the wrong file)
cr0x@server:~$ mysql -uroot -p -e "SHOW VARIABLES LIKE 'pid_file'; SHOW VARIABLES LIKE 'socket';"
+---------------+------------------------------+
| Variable_name | Value |
+---------------+------------------------------+
| pid_file | /var/run/mysqld/mysqld.pid |
+---------------+------------------------------+
+---------------+------------------------------+
| Variable_name | Value |
+---------------+------------------------------+
| socket | /var/run/mysqld/mysqld.sock |
+---------------+------------------------------+
Decision: On Ubuntu packaging, you typically adjust MySQL settings in /etc/mysql/mysql.conf.d/mysqld.cnf (or drop-in files). Verify with mysqld --verbose --help if in doubt.
Task 15: Test max connections pressure (thread exhaustion can look like “gone away”)
cr0x@server:~$ mysql -uroot -p -e "SHOW GLOBAL STATUS LIKE 'Threads_connected'; SHOW GLOBAL STATUS LIKE 'Max_used_connections';"
+-------------------+-------+
| Variable_name | Value |
+-------------------+-------+
| Threads_connected | 312 |
+-------------------+-------+
+---------------------+-------+
| Variable_name | Value |
+---------------------+-------+
| Max_used_connections| 498 |
+---------------------+-------+
What it means: You are flirting with max_connections (500). When you hit the cap, apps often fail to connect and misreport the error.
Decision: If Max_used_connections approaches the limit, fix pooling and connection leaks before raising max_connections. Raising it can just convert “connect errors” into “OOM kill.”
Timeouts that matter: wait_timeout, net_*_timeout, proxies, and TCP keepalives
Start with the timeline, not the variable names
The fastest way to solve “gone away” is to plot time. Not dashboards with forty colors; a simple timeline:
- When did the application open the connection?
- How long was it idle?
- When did the error occur: on the first query after idle, or mid-query?
- Is there a proxy between app and MySQL?
If errors occur on the first query after a long idle, you’re looking at an idle timeout mismatch. If errors occur during a long query, you’re looking at query/proxy read/write timeouts or a server crash/restart.
wait_timeout and interactive_timeout: MySQL’s idle guillotine
wait_timeout is how long the server lets a non-interactive session sit idle before it kills it. Most app connections are non-interactive. interactive_timeout is for interactive clients (like a human in a terminal) when the client sets the CLIENT_INTERACTIVE flag.
What goes wrong in production is rarely “wait_timeout is too low.” It’s usually:
- An app pool holds connections for hours.
- A firewall/NAT/proxy in the middle drops idle TCP at 5–60 minutes.
- The app keeps the connection object and tries to reuse it later.
- The next query hits a dead socket and you get “server has gone away.”
So the fix is not always “increase wait_timeout.” In many environments, you want to reduce server-side idle timeout and ensure the pool doesn’t keep zombies.
Practical guidance that works
- Set the connection pool max lifetime below the shortest idle timeout in the path (NAT/proxy/firewall). Example: if the LB kills idle at 15 minutes, set pool max lifetime 10 minutes.
- Prefer health-checks at borrow time (test query) over periodic pings if your workload is spiky.
- Don’t set wait_timeout to days. Long-lived idle connections are not “stability,” they’re stale state with a good PR team.
net_read_timeout and net_write_timeout: the “mid-query” timeouts
These are server-side timeouts for reading from and writing to a client. They become relevant when:
- The client sends a large statement slowly (slow network, overloaded client), and the server decides it waited long enough.
- The server is sending a large result set and the client isn’t reading (client CPU pinned, app backpressure), so server blocks and times out.
If you’re seeing Lost connection during query and you have large results or large writes, these variables are suspects. But don’t just raise them. Ask why the sender/receiver is slow. In production, “slow client” is often a thread pool starved by GC, or a PHP worker stuck doing something else while holding the socket open.
Linux TCP keepalives: last-resort plumbing
Keepalives are a low-level way to keep NAT/firewall state alive and to detect dead peers. They can help, but they are not a substitute for correct pooling.
If you control the servers and must maintain long-lived connections through flaky middleboxes, tuning keepalives can reduce “surprise dead socket” events:
- Lower
tcp_keepalive_timeto something like 300 seconds (5 minutes) if the path drops idle around 10–15 minutes. - Keep intervals reasonable (
tcp_keepalive_intvl30–60 seconds) and probes modest.
Be careful: keepalives create traffic. On very large fleets it’s not free. But it’s cheaper than waking up humans.
Packet limits: max_allowed_packet, big rows, and large queries
Packet errors are straightforward when you actually catch them. The hard part is that many stacks don’t log the real MySQL warning, so you only see “gone away.”
What counts as a “packet” in this context
MySQL’s protocol transmits data in packets, and max_allowed_packet limits the largest packet the server will accept (and the largest it will send). This interacts with:
- Huge
INSERT ... VALUES (...), (...), ...statements - Large BLOB/TEXT columns
- Big result sets (server sending a lot back)
- Replication and binlog events (yes, packet sizes matter there too)
How to pick a value without cargo culting
Common values you’ll see: 16MB, 64MB, 256MB, 1GB. The temptation is to set it to 1GB “just in case.” Don’t.
Why? Because bigger packets can create bigger per-connection buffers and memory allocations. A single huge request can pressure memory and trigger OOM conditions, which then manifests as… you guessed it… “server has gone away.”
A reasonable approach:
- Measure payload size: find the largest realistic request/row your application sends.
- Add headroom: 2–4x is usually enough, not 100x.
- Prefer changing the app to avoid mega-statements: batch smaller, stream blobs, store large objects elsewhere, or at least compress.
Client-side limits matter too
Many clients have their own max_allowed_packet or equivalent. If you raise the server but the client still refuses, you’ll keep failing—sometimes with confusing, driver-specific errors.
Fixing it properly on Ubuntu (MySQL 8.0)
You want a persistent change in config, not a one-off SET GLOBAL that disappears on restart.
Example change (illustrative): set max_allowed_packet=128M if you have evidence you need it.
cr0x@server:~$ sudo grep -n "max_allowed_packet" /etc/mysql/mysql.conf.d/mysqld.cnf || true
cr0x@server:~$ sudo bash -lc 'printf "\n[mysqld]\nmax_allowed_packet=128M\n" >> /etc/mysql/mysql.conf.d/mysqld.cnf'
cr0x@server:~$ sudo systemctl restart mysql
What it means: You’ve set it at server startup. Now verify.
cr0x@server:~$ mysql -uroot -p -e "SHOW GLOBAL VARIABLES LIKE 'max_allowed_packet';"
+--------------------+-----------+
| Variable_name | Value |
+--------------------+-----------+
| max_allowed_packet | 134217728 |
+--------------------+-----------+
Decision: If errors stop and memory remains stable, keep it. If memory spikes or OOM events increase, your “fix” is a grenade with the pin taped back in.
Joke #2: A 1GB max_allowed_packet is like buying a bigger trash can instead of taking out the trash; it works until it doesn’t, and then it’s spectacular.
The sneaky cause: restarts, crashes, OOM, and systemd
In production, “gone away” is very often literal. mysqld died, restarted, or got replaced during maintenance. Your clients were not consulted.
Why MySQL restarts on Ubuntu 24.04
- OOM killer due to memory pressure (common when buffers are oversized or workload changed).
- Kernel or hypervisor events (host reboot, live migration issues, storage stalls causing watchdog actions).
- Package upgrades restarting services.
- Human actions (restart for config change) without coordinating with connection pools and retries.
- Crashes (bugs exist; but assume resource and config first).
How to make restarts less painful
Two angles: reduce frequency and reduce blast radius.
- Reduce frequency: size memory correctly; avoid pathological queries; ensure storage is healthy; don’t run the box at 99% RAM with swap disabled and prayers enabled.
- Reduce blast radius: use app retries with idempotency; use connection pools that validate connections; set shorter pool lifetimes; consider MySQL Router/ProxySQL if appropriate (and then manage their timeouts too).
Detecting crash loops quickly
If you see frequent restarts, you need a stability incident, not a tuning exercise.
cr0x@server:~$ systemctl show mysql -p NRestarts -p ExecMainStatus -p ExecMainCode
NRestarts=3
ExecMainStatus=0
ExecMainCode=0
What it means: The service has restarted multiple times since boot.
Decision: Pull journal logs across the window of the restarts; check OOM and error logs; consider temporarily reducing concurrency from the app side to stabilize.
Proxies and load balancers: the timeout middlemen
If there’s a proxy between your app and MySQL (HAProxy, ProxySQL, cloud load balancer, service mesh sidecar), you now have at least three independent timeout systems:
- MySQL server timeouts
- Proxy idle and session timeouts
- Client/driver timeouts and pool settings
Most “server has gone away after idle” cases come from mismatch: the proxy kills the connection first, but the pool thinks it can reuse it forever.
What “correct” looks like
- The shortest idle timeout wins. Your pool max lifetime must be shorter than that.
- Keepalive only where needed. If the proxy supports TCP keepalive, enable it there; don’t rely solely on kernel defaults.
- Observe at the proxy. If the proxy logs disconnect reasons, it will save you hours of guessing.
When increasing timeouts is correct
Sometimes the app truly needs long-running queries (analytics, migrations, backfills). In that case:
- Raise proxy/server read/write timeouts to exceed the longest legitimate query time, plus buffer.
- Prefer moving long-running work off the request path.
- Use proper limits and cancellation to avoid zombie queries that never finish.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They had a perfectly reasonable assumption: “Our database and app are in the same VPC, so the network is stable.” Reasonable is not the same as correct.
The application used a pool with a generous max lifetime. Connections would sit idle for hours overnight, then get reused in the morning. The error rate would spike at exactly the first traffic surge after 9am, and then slowly settle.
The team tuned wait_timeout upward because it looked like “idle disconnects.” That did nothing. They tuned it further. Still nothing. Meanwhile, the proxy logs (which nobody was reading) showed that idle sessions were being culled at 60 minutes by an internal load balancer policy.
Once they aligned the pool lifetime to 45 minutes and enabled a borrow-time validation query, the error disappeared. The lesson wasn’t “set timeouts higher.” The lesson was “the shortest timeout in the path decides your fate.”
Mini-story 2: The optimization that backfired
A team wanted to reduce network round trips. They changed a bulk import path to build enormous multi-row INSERT statements. CPU went down, throughput went up, and everyone congratulated everyone.
Then a new customer started importing bigger records with long JSON blobs. The first failure wasn’t a graceful “packet too large.” It was intermittent “server has gone away” in the app, followed by a MySQL restart a few minutes later. On-call blamed the network. The network team blamed the app. The app team blamed MySQL.
The root cause was ugly but simple: gigantic statements pushed memory usage up (server-side parsing plus buffers plus per-connection overhead), and under peak concurrency MySQL got OOM-killed. The original “optimization” increased the size of worst-case packets and made memory spikes sharper. It worked until it didn’t.
The fix was twofold: cap batch size to keep statements below a sane maximum, and set max_allowed_packet to a value driven by real needs. They kept most of the performance win, and the database stopped playing dead.
Mini-story 3: The boring but correct practice that saved the day
Another org had a habit that looked dull on paper: after every significant change, they captured the current MySQL variables, systemd status, and a small set of OS network parameters into a ticket comment. Not a giant config dump—just the stuff that explains behavior.
One afternoon they started seeing sporadic “lost connection during query.” The first instinct was to blame a recent schema migration. But their “boring snapshots” showed that net_write_timeout had been lowered weeks earlier during a misguided attempt to “fail fast,” and the change survived because it lived in a drop-in file nobody remembered.
They reverted the timeout, and the errors vanished. The migration was innocent. The value of the practice wasn’t the snapshot itself; it was that it gave them a fast diff of assumptions versus reality.
That kind of boring correctness is what keeps your weekends intact.
Common mistakes: symptom → root cause → fix
1) Errors happen right after long idle periods
Symptom: First query after inactivity fails with “server has gone away,” subsequent retries work.
Root cause: Idle TCP sessions are being dropped (proxy/NAT/firewall) or MySQL kills idle sessions (wait_timeout).
Fix: Set pool max lifetime below the shortest path timeout; add connection validation; optionally tune TCP keepalive. If the server is killing idles and you truly need long idles, raise wait_timeout—but only if the network can keep them alive.
2) Errors happen during large inserts/updates
Symptom: Failures correlate with bulk writes, large JSON, BLOB uploads, or big multi-row statements.
Root cause: Packet limit exceeded (max_allowed_packet) on server or client; or memory pressure from huge statements.
Fix: Confirm log message; increase max_allowed_packet to a justified value; reduce batch sizes; switch to streaming/chunking patterns.
3) “Lost connection during query” on slow endpoints
Symptom: Long query runs, then fails. Or large result sets fail when the client is busy.
Root cause: Proxy read/write timeout or MySQL net_read_timeout/net_write_timeout is too low for real behavior; client not reading fast enough.
Fix: Increase the relevant timeouts to exceed realistic query times; fix the app to stream results; avoid fetching massive result sets into memory; add pagination.
4) Errors spike alongside MySQL restarts
Symptom: Connection errors cluster around service restart times; uptime resets.
Root cause: mysqld crash, OOM kill, package restart, host reboot.
Fix: Diagnose restart reason via journal + error log; fix memory sizing; reduce concurrency; consider swap behavior; limit worst-case queries/packets.
5) Errors appear after raising max_connections
Symptom: Fewer “too many connections,” but more “gone away” and restarts.
Root cause: More concurrent connections increased memory usage and context switching; server becomes unstable.
Fix: Reduce connections with proper pooling; add a connection limiter; right-size buffers; don’t scale by brute force.
6) Errors only in one environment (prod, not staging)
Symptom: Staging is fine, prod fails intermittently.
Root cause: Different network middleboxes/timeouts, different pool settings, different payload sizes, different concurrency, or different MySQL config.
Fix: Compare the shortest timeout in the path; check real payload sizes; diff MySQL variables; reproduce with prod-like data volume.
Checklists / step-by-step plan
Step-by-step plan: stop the bleeding, then fix it right
- Confirm whether MySQL restarted. Use
systemctl status mysqlandjournalctl -u mysql. If yes, this is stability first. - Check for OOM kills. Use
journalctl -kfiltering OOM messages. If OOM: reduce memory use and concurrency before raising limits. - Tail MySQL error log. Look for packet warnings and crash recovery markers.
- Classify the error timing. After idle vs during query changes the entire path of investigation.
- Inventory timeouts across layers. MySQL (
wait_timeout,net_*), proxy, client pool, OS keepalive. - Fix the shortest timeout mismatch. Set pool max lifetime and validation; don’t try to out-wait a firewall with optimism.
- Only then adjust MySQL variables. Raise
max_allowed_packetif you have evidence; adjustnet_write_timeout/net_read_timeoutif mid-query drops occur. - Re-test with a controlled reproduction. Use a known payload size, known query time, and watch logs and counters.
- Lock changes in config management. Avoid mystery drop-in files and “temporary” global SETs that vanish on restart.
- Monitor what matters. Uptime, restarts, OOM events, aborted clients, connections used, and latency distribution.
Checklist: safe timeout tuning rules
- Do not set server idle timeouts longer than the network can maintain.
- Pool max lifetime < shortest network/proxy idle timeout.
- Validation on borrow beats periodic ping in bursty traffic.
- Increase read/write timeouts only after confirming legitimate long queries or slow clients.
Checklist: safe packet tuning rules
- Find the largest legitimate payload first.
- Increase in steps (e.g., 64M → 128M), not leaps to 1G.
- Cap batch sizes; avoid building mega-queries.
- Watch memory after the change; packet size and OOM are close friends.
FAQ
1) Is “MySQL server has gone away” always a timeout?
No. It can be a timeout, a packet limit, a server restart, or a network reset. Your first job is to determine which category you’re in.
2) Should I just increase wait_timeout to stop idle disconnects?
Only if MySQL is the layer killing idles and you truly need long-lived idle connections. In many real environments, a proxy/NAT drops the TCP session first, so increasing wait_timeout doesn’t help. Fix the pool lifetime and validation first.
3) What’s a good value for max_allowed_packet?
One driven by measured payload sizes plus headroom. 64MB or 128MB often works for apps that occasionally move medium blobs. If you need hundreds of MB, reconsider the design (chunking, object storage, streaming).
4) Why do I see it only after deployment?
Deployments change connection behavior: more worker processes, different pooling defaults, larger requests, different retry logic, or a new proxy hop. Also, package upgrades can restart mysql.service.
5) What’s the difference between error 2006 and 2013?
2006 often indicates the connection was already dead when trying to use it. 2013 usually indicates the connection died during an active query or data transfer. They point to different suspects.
6) Can max_connections cause “gone away”?
Indirectly. Hitting max_connections causes connection failures; depending on the driver, the app may report it poorly. Raising max_connections can also increase memory pressure and trigger crashes, which then create real “gone away” errors.
7) Should the application auto-reconnect?
Auto-reconnect is dangerous if you might be in a transaction. The safer approach is: use pooling with validation, handle retries at a higher level with idempotent operations, and fail fast on non-idempotent writes.
8) Do TCP keepalives solve this reliably?
They help with idle NAT/firewall timeouts and dead peer detection, but they’re not a substitute for correctly configured pools and timeouts. Use them as plumbing, not as architecture.
9) What if the MySQL error log shows nothing?
Then either you’re looking in the wrong place, logging is misconfigured, or the disconnect happens outside MySQL (proxy/network). Confirm log paths, check systemd journal, and check proxy logs if any.
10) I raised max_allowed_packet and it still fails. Now what?
Confirm the client-side limit, confirm the setting actually applied (global vs session), and confirm the payload size. Also check for proxy limits and for OOM events caused by bigger allocations.
Conclusion: next steps you can do today
If you take only one operational habit from this: stop treating “server has gone away” like a single bug. It’s a symptom class. Categorize it fast.
- Check restarts first (
systemctl status,journalctl, OOM logs). If mysqld died, fix stability before tuning knobs. - Decide idle vs mid-query from timing and error type (2006 vs 2013). That tells you whether to focus on
wait_timeout/pooling or on read/write/query/proxy timeouts. - Prove packet issues by log evidence and controlled reproduction. Raise
max_allowed_packetto a justified number, then fix batching so you don’t create memory cliffs. - Align timeouts across layers (MySQL, proxy/LB, pool, TCP keepalive). The shortest one wins; act accordingly.
Do that, and “server has gone away” turns from a spooky intermittent ghost into a normal incident with a root cause and a short postmortem. Your future self will still be tired, but at least they’ll be tired for interesting reasons.