MySQL vs MariaDB: WordPress 504s—who collapses first under a traffic spike

Was this helpful?

WordPress 504s are rarely “a WordPress problem.” They’re a symptom: the page request walked into your stack and couldn’t get back out before the gateway timer hit zero. Most of the time, the database is where it went to die—quietly, politely, and with just enough logs to waste your afternoon.

If you’re deciding between MySQL and MariaDB for WordPress under spiky traffic, here’s the blunt truth: either can survive, and either can fold. The database that “collapses first” is usually the one you’ve configured like it’s 2014, deployed like it’s a pet, and monitored like it’s a rumor. Let’s talk about failure modes, not brand loyalty.

What a WordPress 504 actually means in production

A 504 is not “WordPress timed out.” It’s your gateway (often Nginx, sometimes an ALB, sometimes a CDN) saying: “I asked an upstream for a response, and it didn’t answer fast enough.” The upstream might be PHP-FPM, which might be blocked on MySQL, which might be blocked on disk, which might be blocked on a mutex, which might be blocked on your questionable life choices.

On a typical WordPress stack:

  • Client hits CDN or load balancer.
  • Request goes to Nginx/Apache.
  • Dynamic path goes to PHP-FPM.
  • PHP does application work and calls MySQL/MariaDB.
  • DB reads from buffer pool or disk, locks rows, writes redo/undo, returns results.

Under a spike, the weakest link isn’t necessarily the slowest component; it’s the component that sheds load the least gracefully. A saturated DB typically doesn’t fail fast. It fails by queuing. Queuing looks like “it still works, just slower,” which becomes “everything times out,” which becomes “the incident channel is now a group therapy session.”

There are two main ways a database turns a spike into 504s:

  1. Connection pressure: too many simultaneous PHP workers trying to connect or run queries, causing thread explosion, context switching, or resource starvation.
  2. Latency inflation: queries get slower because of disk IO, lock contention, buffer pool misses, purge lag, redo log pressure, or simply bad query plans.

So “who collapses first” is really: which engine + configuration gives you better tail latency and more predictable behavior when the queue starts to form.

MySQL vs MariaDB under a traffic spike: who fails first (and why)

Baseline reality: WordPress is not a database benchmark

WordPress isn’t OLTP purity. It’s lots of small reads, some writes, and a party trick called wp_options that can become your hottest table in the building. It also includes plugin SQL of… variable moral character.

What matters under spikes:

  • Connection handling: thread-per-connection vs pooling/thread pooling behavior.
  • InnoDB behavior under contention: row locks, metadata locks, purge, flushing.
  • Optimizer predictability: stable plans, good indexing decisions.
  • Observability and tooling: can you see what’s happening before the 504s become a full-on flameout?

MySQL: predictable if you keep it boring, dangerous if you keep it “default”

Modern MySQL (8.x) is solid for WordPress. It’s also unapologetically complex. The defaults are designed to start everywhere, not to survive your homepage hitting the front page of the internet.

Where MySQL tends to do well under spikes:

  • Stable core InnoDB with mature instrumentation (Performance Schema, sys schema).
  • Optimizer improvements over the years that often help with mixed workloads.
  • Replication tooling and ecosystem maturity in many orgs.

Where MySQL often collapses first (in real WordPress stacks):

  • Connection storms when PHP-FPM ramps workers and each opens a DB connection; you end up CPU-bound on threads and mutexes long before you’re “out of CPU” in the usual sense.
  • IO-bound stalls when buffer pool is undersized and redo log/flushing is tuned like a laptop.
  • Metadata lock pileups from DDL or long transactions (often from “harmless” admin tasks during peak).

MariaDB: sometimes more forgiving on concurrency, sometimes a compatibility trap

MariaDB began as a fork with a familiar face. Over time it became its own database with its own personality. For WordPress, the most practical differentiation you’ll see under spikes is around concurrency behavior (especially if you use MariaDB’s thread pool) and operational familiarity depending on what your team has run before.

Where MariaDB tends to do well under spikes:

  • Thread pool (in many MariaDB builds) can reduce thread thrash under high connection counts by limiting active workers and scheduling.
  • Operational knobs that teams sometimes find easier to work with, depending on distro packaging and defaults.

Where MariaDB collapses first (again, in the field):

  • “It’s drop-in MySQL” assumptions that break on edge-case SQL behavior, system variables, or tooling expectations—especially if you mix ecosystem components that assume MySQL 8 semantics.
  • Query plan surprises if you’re used to MySQL’s optimizer behavior and don’t validate with EXPLAIN after upgrades.
  • Replication/GTID mismatches across mixed environments during migrations, which turns “just add a replica” into “why is the replica angry?”

If you want the straight answer: under a sudden WordPress traffic spike, the first thing to collapse is usually connection management, not “MySQL vs MariaDB.” MariaDB with thread pool can stay upright longer in a connection storm. MySQL can do fine too, but it often needs you to be deliberate: cap PHP-FPM workers, use pooling (ProxySQL or similar), and stop pretending max_connections is a performance feature.

Joke #1: A traffic spike is nature’s way of asking whether your “it works on staging” also pays rent.

Facts and history that still matter for your outage

  • Fact 1: MariaDB forked from MySQL after Oracle acquired Sun Microsystems (and thus MySQL) in 2010, driven by concerns about MySQL’s future governance.
  • Fact 2: MySQL 8.0 introduced major changes—data dictionary overhaul, improved performance schema, and many optimizer enhancements—making “MySQL 5.7 behavior” a bad mental model.
  • Fact 3: MariaDB’s version numbers diverged deliberately (10.x) and are not directly comparable to MySQL 8.0; treating them as “higher means newer” leads to wrong expectations.
  • Fact 4: In many distros, “mysql” packages eventually became meta-packages pointing to either MySQL or MariaDB; teams have discovered they were running MariaDB by accident. This is fun exactly once.
  • Fact 5: WordPress historically leaned on MySQL compatibility, but modern WordPress doesn’t protect you from bad plugin SQL; the database engine won’t save you from a plugin that does full table scans on every request.
  • Fact 6: InnoDB became the default MySQL storage engine years ago, and almost all serious WordPress workloads should be InnoDB; MyISAM in production is a time capsule with sharp edges.
  • Fact 7: Query cache (old MySQL feature) was removed from MySQL 8 and is disabled/irrelevant in most modern setups; if someone suggests turning it on, they are citing folklore.
  • Fact 8: “More replicas” doesn’t fix write contention; WordPress has a write-heavy hotspot pattern (sessions, options, transient updates) that can bottleneck a primary even if reads are offloaded.
  • Fact 9: Thread-per-connection models can become CPU-scheduling bound well before CPU utilization looks “high”; the graph lies because the time is spent context switching and waiting.

Fast diagnosis playbook: find the bottleneck in minutes

This is the order that wins incidents. Not because it’s theoretically pure, but because it narrows the search fast.

First: prove where the timeout happens

  1. Gateway logs (Nginx/ALB/CDN): is it upstream timeout, or client-side?
  2. PHP-FPM status: are workers saturated, slowlog firing, or stuck on DB?
  3. DB: connections and active queries: are threads exploding, or are a few queries wedged?

Second: decide “connection storm” vs “slow queries” vs “locks”

  • Connection storm: many sleeping/connecting threads, high threads_connected, rising CPU sys time, lots of short queries timing out, PHP-FPM maxed out.
  • Slow queries / IO: buffer pool misses, high disk read latency, high InnoDB data reads, long query times without obvious locks.
  • Locks: “Waiting for table metadata lock,” “Waiting for record lock,” many queries blocked behind one writer, lock wait timeouts.

Third: stop the bleeding safely

  • Shed load at the edge: enable cache, rate limit abusive endpoints, temporarily disable expensive routes (search, wp-cron via web, XML-RPC if applicable).
  • Cap concurrency: reduce PHP-FPM children if DB is drowning; letting infinite PHP workers hammer the DB is not “scaling.”
  • Kill the worst offenders: terminate the one query holding locks or running for minutes, but confirm it’s not a critical DDL or backup.

Paraphrased idea (John Allspaw): Reliability is a property of the whole system, not a single component.

Practical tasks: commands, outputs, and decisions (12+)

Everything below is designed to be runnable on a typical Linux host with WordPress + Nginx + PHP-FPM + MySQL/MariaDB. Adjust paths and service names for your distro.

Task 1: Confirm the 504 is an upstream timeout (not DNS, not client)

cr0x@server:~$ sudo tail -n 20 /var/log/nginx/error.log
2025/12/29 10:11:45 [error] 1187#1187: *28491 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 203.0.113.10, server: example.com, request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock", host: "example.com"

What it means: Nginx waited for PHP-FPM and didn’t get headers in time. The DB may still be the cause, but the immediate choke is upstream latency.

Decision: Go to PHP-FPM saturation and slow request tracing next.

Task 2: Check PHP-FPM pool saturation quickly

cr0x@server:~$ sudo ss -s
Total: 1560 (kernel 0)
TCP:   932 (estab 210, closed 590, orphaned 0, timewait 560)

Transport Total     IP        IPv6
RAW       0         0         0
UDP       12        10        2
TCP       342       290       52
INET      354       300       54
FRAG      0         0         0

What it means: High timewait and lots of TCP churn can hint at connection storms (either PHP↔DB or client↔web), but it’s not proof.

Decision: Inspect PHP-FPM status and slowlog, then DB threads.

Task 3: Check PHP-FPM process pressure

cr0x@server:~$ ps -o pid,etime,pcpu,pmem,cmd -C php-fpm8.2 --sort=-pcpu | head
  PID     ELAPSED %CPU %MEM CMD
 2261    01:12:08 18.3  1.6 php-fpm: pool www
 2377    00:45:12 16.9  1.5 php-fpm: pool www
 2410    00:21:33 15.2  1.5 php-fpm: pool www
 2011    02:10:40  0.3  0.6 php-fpm: master process (/etc/php/8.2/fpm/php-fpm.conf)

What it means: Workers are busy for long elapsed times. That often means they’re blocked—DB, filesystem, external HTTP calls, or CPU-heavy PHP.

Decision: Enable/check PHP-FPM slowlog or status page to identify where time is spent.

Task 4: Find “server reached pm.max_children” in PHP-FPM logs

cr0x@server:~$ sudo grep -R "max_children" -n /var/log/php8.2-fpm.log | tail -n 5
[29-Dec-2025 10:11:44] WARNING: [pool www] server reached pm.max_children setting (40), consider raising it

What it means: Requests are queueing at PHP-FPM. Raising max_children may help, or may just increase DB pressure and make the outage worse.

Decision: If DB is already saturated, do not raise it blindly. Confirm DB headroom first.

Task 5: Check database identity and version (don’t guess)

cr0x@server:~$ mysql -NBe "SELECT VERSION(), @@version_comment;"
10.6.18-MariaDB-0ubuntu0.22.04.1	Ubuntu 22.04

What it means: You’re on MariaDB 10.6, not Oracle MySQL. This impacts available variables, performance schema behavior, and tooling assumptions.

Decision: Use engine-appropriate tuning/metrics. Don’t paste MySQL 8-only advice into MariaDB.

Task 6: Check connection count and thread state distribution

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_%';"
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Threads_cached    | 18    |
| Threads_connected | 420   |
| Threads_created   | 9812  |
| Threads_running   | 64    |
+-------------------+-------+

What it means: 420 connections and 64 running threads. Threads_created is high, suggesting churn or insufficient thread cache (or connection storms). Under spikes, this can become a scheduler fight.

Decision: If connections scale with traffic linearly, prioritize connection pooling/capping (ProxySQL, persistent connections with care, or fewer PHP workers).

Task 7: Identify the top running queries right now

cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST;" | head -n 15
Id	User	Host	db	Command	Time	State	Info
1283	wp	wpapp:42110	wpdb	Query	12	Sending data	SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'
1290	wp	wpapp:42118	wpdb	Query	11	Waiting for table metadata lock	ALTER TABLE wp_posts ADD INDEX idx_post_date (post_date)
1299	wp	wpapp:42130	wpdb	Query	10	Locked	UPDATE wp_options SET option_value='...' WHERE option_name='_transient_timeout_x'

What it means: You have an ALTER TABLE waiting on a metadata lock and an UPDATE that’s locked. Meanwhile, many requests are slamming wp_options.

Decision: Stop the DDL during peak (kill it if safe), then address wp_options access patterns and autoload bloat.

Task 8: Confirm metadata lock contention (MySQL and MariaDB)

cr0x@server:~$ mysql -e "SELECT OBJECT_SCHEMA, OBJECT_NAME, LOCK_TYPE, LOCK_STATUS FROM performance_schema.metadata_locks WHERE LOCK_STATUS='PENDING' LIMIT 10;"
+--------------+-------------+-----------+-------------+
| OBJECT_SCHEMA| OBJECT_NAME | LOCK_TYPE | LOCK_STATUS |
+--------------+-------------+-----------+-------------+
| wpdb         | wp_posts    | EXCLUSIVE | PENDING     |
+--------------+-------------+-----------+-------------+

What it means: Something wants an exclusive metadata lock (typically DDL) and is waiting, often blocking or being blocked by long queries/transactions.

Decision: Find the blocker (long transaction) and either wait it out or kill it. Move DDL to off-peak and use online schema change tooling if needed.

Task 9: Measure InnoDB buffer pool pressure

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';"
+---------------------------------------+-----------+
| Variable_name                         | Value     |
+---------------------------------------+-----------+
| Innodb_buffer_pool_read_requests      | 986543210 |
| Innodb_buffer_pool_reads              | 5432109   |
+---------------------------------------+-----------+

What it means: Innodb_buffer_pool_reads are reads from disk. If this climbs rapidly during spikes, you’re missing cache and paying IO latency per request.

Decision: Increase buffer pool (within RAM constraints), reduce working set (autoload cleanup, indexes), or add caching layers to reduce DB reads.

Task 10: Check redo log / checkpoint pressure signals

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_os_log_written'; SHOW GLOBAL STATUS LIKE 'Innodb_log_waits';"
+------------------------+------------+
| Variable_name          | Value      |
+------------------------+------------+
| Innodb_os_log_written  | 8123456789 |
+------------------------+------------+
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Innodb_log_waits | 421   |
+------------------+-------+

What it means: Non-zero and increasing Innodb_log_waits indicates sessions waiting for redo log space/flushing. Under spikes, this can tank write latency and cascade into timeouts.

Decision: Revisit redo log sizing and IO capacity; reduce write amplification (plugin behavior, transient churn), and ensure storage latency is sane.

Task 11: Identify lock waits (transactions blocking others)

cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,120p'
TRANSACTIONS
------------
Trx id counter 4892011
Purge done for trx's n:o < 4891900 undo n:o < 0 state: running
History list length 3212
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 4892007, ACTIVE 19 sec
2 lock struct(s), heap size 1136, 1 row lock(s), undo log entries 1
MySQL thread id 1299, OS thread handle 140312351889152, query id 712381 wpapp 10.0.0.12 wpdb updating
UPDATE wp_options SET option_value='...' WHERE option_name='_transient_timeout_x'

What it means: A transaction has been active for 19 seconds and is updating wp_options. History list length suggests purge lag; long transactions can block purge and increase undo pressure.

Decision: Find what holds long transactions (batch jobs, backups, admin screens), fix them, and consider lowering transaction scope in the app (hard with WordPress, but plugins can be fixed or removed).

Task 12: Check for autoloaded options bloat (WordPress hotspot)

cr0x@server:~$ mysql -NBe "SELECT COUNT(*) AS autoload_rows, ROUND(SUM(LENGTH(option_value))/1024/1024,2) AS autoload_mb FROM wp_options WHERE autoload='yes';"
4123	18.74

What it means: 18.74 MB of autoloaded options loaded frequently. That’s not “a little overhead.” That’s a tax on every uncached request.

Decision: Reduce autoload set (fix plugins, move large blobs out, set autoload=no where safe). This can be more impactful than swapping MySQL for MariaDB.

Task 13: Confirm storage latency (because databases have feelings)

cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (server) 	12/29/2025 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          18.22    0.00    9.31   14.70    0.00   57.77

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz  aqu-sz  %util
nvme0n1         180.0   8200.0     0.0    0.0    6.20    45.6     95.0   9100.0     5.0    5.0   18.40    95.8    2.10  92.0

What it means: Write await is 18ms and device utilization is 92%. Under spikes, that’s enough to inflate query latency and create queues.

Decision: Either reduce write load (caching, tune flushing, fix plugins) or upgrade storage / separate DB volume / ensure proper RAID/NVMe configuration.

Task 14: Enable slow query logging temporarily (surgical, not forever)

cr0x@server:~$ mysql -e "SET GLOBAL slow_query_log=ON; SET GLOBAL long_query_time=0.5; SET GLOBAL log_queries_not_using_indexes=ON;"

What it means: You’ll capture queries slower than 0.5s and those not using indexes. This can be noisy; don’t leave it on forever at high volume.

Decision: Use it during the incident window, then analyze top offenders and turn it back down/off.

Task 15: Observe top tables by read/write pressure (InnoDB metrics)

cr0x@server:~$ mysql -e "SELECT * FROM sys.schema_table_statistics ORDER BY rows_read DESC LIMIT 5;"
+----------------+------------+-----------+----------------+----------------+------------+------------+------------+-------------------+-------------------+
| table_schema   | table_name | total_rows| rows_fetched   | fetch_latency  | rows_insert| rows_update| rows_delete| io_read_requests  | io_write_requests |
+----------------+------------+-----------+----------------+----------------+------------+------------+------------+-------------------+-------------------+
| wpdb           | wp_options | 52341     | 98122312       | 00:14:12.123456| 120        | 53210      | 8          | 842113            | 290112            |
| wpdb           | wp_postmeta| 984122    | 55122311       | 00:12:40.654321| 210        | 1201       | 12         | 721223            | 120991            |
+----------------+------------+-----------+----------------+----------------+------------+------------+------------+-------------------+-------------------+

What it means: It’s the usual suspects. If wp_options and wp_postmeta dominate, fix WordPress patterns before buying bigger hardware.

Decision: Prioritize indexing, autoload hygiene, object caching, and plugin audits.

The usual collapse patterns (and what they look like)

Pattern A: Connection storm and thread thrash

This one is common when a post goes viral or a botnet starts “crawling” your dynamic endpoints. PHP-FPM scales up workers. Each worker opens a DB connection (or several). The DB spawns threads. CPU time goes into context switching, mutex contention, and just managing too many sessions.

What you see:

  • Threads_connected shoots up.
  • Threads_created rises fast.
  • CPU system time increases, load average climbs, but “useful work” doesn’t.
  • Many queries are short, but everything waits in line.

Who collapses first? Either. But MariaDB with thread pool can be more graceful here if configured, because it limits concurrent execution and reduces thrash. MySQL can match it if you cap concurrency at the application layer and/or add a proxy pool.

Pattern B: wp_options hotspot + autoload bloat

WordPress autoloaded options are loaded frequently. Add a few plugins that store large serialized arrays as autoload=yes, and you’ve built a tiny denial-of-service against yourself.

What you see:

  • SELECT option_name, option_value FROM wp_options WHERE autoload='yes' shows up constantly.
  • Buffer pool churn (if working set doesn’t fit RAM).
  • CPU climbs with query parsing and row processing.

Who collapses first? The one with less RAM, worse indexes, or worse caching strategy. Engine choice won’t save you from autoload abuse.

Pattern C: Lock pileups from writes and background tasks

Traffic spikes often coincide with more comments, logins, cart updates (if WooCommerce), and transient updates. Writes create locks. Long-running queries or “innocent” admin tasks can hold locks and block the world.

What you see:

  • Processlist states: Locked, Waiting for…
  • InnoDB status shows long transactions and growing history list length.
  • Lock wait timeouts and deadlocks.

Who collapses first? Again, both. The difference is operational: how quickly you can see the blocker and how safely you can mitigate without making it worse.

Pattern D: IO saturation (the quiet killer)

When disk latency climbs, everything slows. The database doesn’t “crash.” It just becomes a latency amplifier. PHP workers stack up. Nginx times out. You get 504s and a lot of confused opinions.

What you see:

  • iowait increases.
  • Disk %util high, await grows.
  • Innodb_buffer_pool_reads rises fast.
  • Checkpoint/log waits.

Joke #2: The database didn’t “go down.” It just entered a long, meaningful pause to reflect on your storage choices.

Three corporate mini-stories from the field

Mini-story 1: The incident caused by a wrong assumption

The company was mid-migration from an older MySQL to MariaDB because “it’s drop-in.” They did the sensible parts: replicated data, rehearsed cutover, validated application tests. They did the less sensible part: assumed their operational tooling would behave the same.

On launch day, traffic spiked—marketing did their job. WordPress started returning 504s. The on-call did the normal playbook: pull top queries from the monitoring dashboard, check replication delay, and inspect the usual MySQL status counters.

The dashboard looked oddly calm. Connections were “fine.” Query time was “fine.” Yet Nginx was timing out and PHP-FPM was screaming about max_children. The team chased the web tier for half an hour because the database graphs insisted everything was okay.

The root problem was simple and embarrassing: their exporter was pinned to MySQL-specific performance schema tables and returned partial metrics on MariaDB. The DB was saturated with lock waits and disk latency, but the graphs were lying by omission. They fixed it by deploying a MariaDB-compatible exporter and adding direct “SHOW PROCESSLIST sample” panels. The 504s were a database problem; the incident was an observability problem.

Mini-story 2: The optimization that backfired

A different org had a WordPress site that periodically got slammed by crawlers. Someone decided the fix was to increase concurrency everywhere: raise PHP-FPM max_children, raise MySQL max_connections, and bump the web worker count. The intent was noble: “handle more traffic.” The effect was more like “handle more suffering.”

The next spike hit and the DB fell over harder than before. Not a crash—worse. It stayed up and responded slowly. CPU wasn’t pegged at 100%, but latency went vertical. Threads_created exploded. The kernel spent its time context switching. The storage queue depth stayed high because writes were coming in faster than they could flush. Nginx returned 504s like it was paid per error.

The team rolled back the concurrency changes and the system stabilized. Later they implemented the unglamorous fix: cap PHP-FPM to a number the DB could sustain, add a connection pooler, and put caching in front of the heaviest endpoints. The lesson stuck: raising limits increases the size of the blast radius. It doesn’t increase capacity unless something else changed.

Mini-story 3: The boring but correct practice that saved the day

This team ran WordPress with a strict change calendar and a habit that everyone mocked: they rehearsed incident response and kept a “peak freeze” policy. During predicted traffic events (product launches, big media mentions), they did not deploy schema changes, plugin updates, or “quick admin tweaks.”

One afternoon, traffic tripled unexpectedly due to a social media surge. The site slowed, but it didn’t collapse. Their Nginx cache served most anonymous traffic. PHP-FPM had a hard cap. The DB had headroom because the working set fit in the buffer pool and the slow query log had been sampled and cleaned up weeks earlier.

They still got alerts: replication lag grew a bit, and p95 query latency climbed. But nothing hit the timeout thresholds. Their incident channel stayed boring, which is the highest compliment you can pay an SRE system.

The saving move wasn’t a magical database engine. It was operational discipline: capacity planning, caching, query hygiene, and refusing to do DDL at the worst possible time.

Common mistakes: symptom → root cause → fix

  • Symptom: 504s spike, PHP-FPM logs show max_children reached, DB CPU looks “not that high.”
    Root cause: Connection storm and thread scheduling overhead; latency inflation without CPU pegging.
    Fix: Cap PHP-FPM, add connection pooling, reduce max_connections to force backpressure, and cache anonymous traffic.
  • Symptom: Many queries stuck in “Waiting for table metadata lock.”
    Root cause: DDL running during peak, blocked by long transactions (or blocking others).
    Fix: Stop DDL during peak; find and end long transactions; schedule online schema change off-peak.
  • Symptom: Spikes correlate with disk %util near 100% and high await; DB feels “slow everywhere.”
    Root cause: IO saturation and insufficient buffer pool; redo/checkpoint pressure.
    Fix: Increase buffer pool, tune redo log sizing and flushing, reduce write amplification (transients), upgrade storage or isolate DB disk.
  • Symptom: A single page type (search, category pages) triggers timeouts while others are okay.
    Root cause: Specific slow queries (often wp_postmeta joins) missing indexes or doing filesorts/temporary tables.
    Fix: Capture slow queries, add targeted indexes, or change plugin/theme query patterns. Don’t “optimize everything.”
  • Symptom: After “moving to MariaDB/MySQL,” things got subtly worse, not catastrophically.
    Root cause: Optimizer plan changes, collation differences, or mismatched config defaults; metrics dashboards may be incomplete.
    Fix: Compare EXPLAIN plans, validate critical queries, standardize collations/charsets, update exporters and alert thresholds.
  • Symptom: Replicas are fine but primary is dying during spikes.
    Root cause: Write hotspot on primary (options/transients/sessions); replicas don’t help writes.
    Fix: Reduce writes (object cache, disable web wp-cron, adjust transient behavior), consider splitting workloads, and optimize hot tables.

Checklists / step-by-step plan

During the incident (stop the 504s)

  1. Confirm the timeout source: Nginx upstream timeouts vs PHP-FPM vs DB.
  2. Freeze change: no plugin updates, no schema changes, no “quick fixes” in wp-admin.
  3. Reduce dynamic load:
    • Enable/extend Nginx microcache for anonymous traffic if you have it.
    • Rate limit abusive endpoints (search, xmlrpc.php, wp-login.php) if that’s the pattern.
    • Disable wp-cron via web hits; run it via system cron if possible.
  4. Cap concurrency: keep PHP-FPM workers at a number the DB can serve with acceptable latency.
  5. Find and kill the blockers: long DDL, metadata locks, runaway queries.
  6. Capture evidence: processlist samples, slow log snippet, InnoDB status, iostat.

After the incident (make it harder to repeat)

  1. Implement connection pooling if you see thread churn or high connections during spikes.
  2. Fix wp_options autoload: audit and reduce; remove oversized autoloaded blobs.
  3. Add/validate indexes for the top slow queries. Don’t guess; measure.
  4. Right-size InnoDB: buffer pool, redo logs, flush behavior according to storage.
  5. Improve observability: ensure metrics match engine, dashboards show active queries and lock waits.
  6. Practice peak hygiene: no schema changes during forecast traffic peaks; rehearse rollback.

FAQ

1) Is MariaDB faster than MySQL for WordPress?

Sometimes, in specific concurrency scenarios—especially if MariaDB thread pool is in play. But the bigger determinants are caching, query quality, buffer pool sizing, and controlling connection storms.

2) Will switching engines fix my 504s?

Rarely by itself. Most 504s under spikes come from unbounded concurrency, hot tables (wp_options), slow queries, or IO saturation. Switching engines without fixing those is a lateral move with new failure modes.

3) What’s the single best fix for WordPress spikes?

Cache anonymous traffic aggressively and cap dynamic concurrency. If your edge can serve 80–95% of requests without touching PHP/DB, spikes become boring.

4) Should I raise max_connections to stop “too many connections” errors?

Only if you’ve proven the DB can handle more concurrency. Otherwise you’re trading a fast failure for a slow death: more queued work, more lock contention, more memory pressure, more 504s.

5) Is ProxySQL worth it for WordPress?

If you have frequent spikes and a lot of short-lived connections, yes. It can smooth connection storms, enable routing to replicas, and give you a control point. It also adds complexity; operate it like a real tier, not a sidecar toy.

6) Do replicas help with 504s?

They help if reads are your bottleneck and you can actually send reads to replicas safely. They don’t help write hotspots, lock contention on the primary, or slow writes due to disk.

7) Why does CPU look fine while the site times out?

Because latency can be dominated by waiting: IO waits, lock waits, scheduler thrash, flushing. Your CPU graph isn’t a truth oracle; it’s a hint.

8) What should I tune first in InnoDB for WordPress?

Buffer pool size (so reads hit RAM), redo log sizing appropriate for your write rate, and sane flushing. Then focus on query/index issues and autoload bloat—those are usually the real villains.

9) Is MySQL 8 a safer choice for compatibility?

If your tooling and ecosystem assume MySQL 8 semantics, yes, it’s often operationally simpler. MariaDB is compatible in many cases, but not identical; treat it as its own product and validate behaviors.

Next steps you can do this week

  1. Measure a spike: capture processlist samples, slow query logs (briefly), and IO stats during peak. Don’t tune blind.
  2. Control concurrency: set PHP-FPM caps based on DB capacity; consider a pooler if connections churn.
  3. Fix the WordPress hotspots: reduce autoloaded options size, audit plugins that hammer wp_postmeta, and add missing indexes.
  4. Make caching non-optional: page cache at the edge for anonymous users, object cache for database-heavy paths.
  5. Pick your engine based on operations: choose the one your team can monitor correctly, upgrade safely, and recover quickly. Performance is a feature; recovery is a lifestyle.
← Previous
ZFS logbias: Latency vs Throughput—Pick What You Actually Need
Next →
ZFS Sequential Reads: Tuning for Maximum Streaming Throughput

Leave a comment