WordPress 504 Gateway Timeout: Is It the Database or PHP? How to Prove Which One

Was this helpful?

504s are the worst kind of downtime: not a clean crash, not a nice error page, just a proxy shrugging at your customers while your Slack fills with “site is spinning.” WordPress makes this extra fun because the failure can be in three places at once: the web proxy, PHP-FPM, and the database, all arguing about whose fault it is.

This is a production-minded approach to proving whether your 504 Gateway Timeout comes from the database layer (MySQL/MariaDB) or the PHP layer (PHP-FPM / mod_php). Not guessing. Not “I restarted it and it went away.” Evidence you can paste into an incident channel and make a decision from.

The only mental model you need for 504s

A 504 Gateway Timeout is almost never the real failure. It’s the messenger. The proxy (Nginx, Apache as a reverse proxy, Cloudflare, a load balancer) waited for an upstream response and ran out of patience.

In typical WordPress hosting, the request path looks like this:

  • Client → CDN/WAF (optional) → Nginx/Apache (reverse proxy)
  • Proxy → PHP handler (PHP-FPM or mod_php)
  • PHP → database (MySQL/MariaDB) and other dependencies (Redis, external APIs, SMTP, payment gateways)
  • Responses go back the same way

So when you get a 504, the question is: who did not answer in time? That “who” can be PHP (stuck, slow, saturated) or the database (slow queries, locks, IO stalls), or both with a causal chain (DB slow → PHP workers pile up → proxy times out).

Here’s the production rule that saves hours: 504 is a queueing problem until proven otherwise. Something is queuing: requests waiting for PHP workers, PHP workers waiting for DB, DB waiting for disk, or everything waiting for a lock.

First short joke: A 504 is like a meeting that “ran out of time” — nobody admits they didn’t do the work, but everyone agrees to reschedule.

What “database problem” means (operationally)

“The database is slow” isn’t a diagnosis. Operationally, it means one of:

  • Queries are slow because they scan too much (missing indexes, bad query patterns).
  • Queries are slow because the DB is blocked (locks, metadata locks, long transactions).
  • Queries are slow because the DB can’t read/write fast enough (IO saturation, fsync stalls).
  • Queries are slow because the DB is CPU constrained (complex sorts, regex, heavy joins).
  • Queries are slow because connections are bottlenecked (max_connections, thread pool, connection storms).

What “PHP problem” means (operationally)

“PHP is slow” usually means one of:

  • PHP-FPM pool is saturated (all workers busy; requests queue).
  • Workers are stuck (deadlocks in code, external API calls with long timeouts, DNS issues).
  • Workers are dying / recycling (OOM kills, max_requests too low, memory leaks).
  • Opcode cache missing/misconfigured (every request compiles too much code).
  • File IO is slow (NFS stalls, EBS burst credits, overloaded storage).

The trick is not to debate these as theories. The trick is to gather enough signals to convict one layer.

Fast diagnosis playbook (check 1/2/3)

This is the “I have 10 minutes before management discovers the status page is also WordPress” sequence. You’re not optimizing; you’re determining the bottleneck and stopping the bleeding.

1) Start at the edge: confirm it’s an origin timeout, not a CDN tantrum

  • If Cloudflare/ALB returns 504, check whether the origin is reachable and whether the proxy logs show upstream timeouts.
  • If only some pages 504, suspect application/db. If all pages 504 including static assets, suspect web/proxy or network.

2) Check the proxy error log for upstream timeout details

  • Nginx will literally tell you: “upstream timed out” (PHP didn’t respond) or “connect() failed” (PHP down).
  • This doesn’t yet prove DB vs PHP; it proves “PHP didn’t answer the proxy.” Then you ask: was PHP waiting on DB?

3) Check PHP-FPM: queue length and max_children saturation

  • If PHP-FPM has a listen queue building and processes pegged at max_children, PHP is saturated (cause may still be DB).
  • If PHP has free workers but requests still time out, look for stuck calls (DB locks, external APIs, filesystem stalls).

4) Check MySQL/MariaDB: active queries, lock waits, and slow query spikes

  • If you see many threads “Waiting for table metadata lock” or long-running SELECT/UPDATE, you’ve got DB contention.
  • If you see high InnoDB fsync/IO waits and rising query times, suspect storage.

5) Make a stabilization move, not a random move

  • If DB locked: kill the blocker transaction, not the whole DB (unless you enjoy outage extensions).
  • If PHP saturated: temporarily scale PHP workers only if the DB can handle it; otherwise you just DDoS your own database.
  • If one plugin endpoint is melting things: rate limit or temporarily disable it.

Interesting facts and short history (why 504s look the way they do)

  1. 504 is defined in the HTTP spec as “Gateway Timeout”—it’s explicitly about intermediaries (proxies/gateways) timing out, not the origin application deciding to time out.
  2. Nginx popularized “upstream” language in logs and docs, and that wording shaped how modern teams debug: “which upstream?” became the first question.
  3. PHP-FPM became the default for many WordPress stacks because it isolates PHP processes, offers pool controls (max_children), and avoids Apache prefork’s memory bloat on busy sites.
  4. MySQL’s InnoDB replaced MyISAM for most WordPress installs because row-level locking and crash recovery matter when you have concurrent writes (comments, carts, sessions).
  5. WordPress’s schema is intentionally generic (postmeta, usermeta key/value tables). Flexible, yes. Also a performance footgun if you query meta without good indexes.
  6. Slow queries don’t always show up as slow pages until concurrency increases. One 1-second query is annoying. One 1-second query repeated 200 times becomes a denial-of-service you paid for.
  7. Timeout defaults are rarely aligned: CDN timeout, proxy timeout, fastcgi timeout, PHP max_execution_time, and MySQL net_read_timeout can all disagree. Misalignment produces “mystery” 504s.
  8. Index changes can lock tables longer than you expect, especially with large tables and certain ALTER TABLE operations. That shows up as sudden metadata locks and cascading 504s.
  9. Historically, “just add workers” worked when CPUs were cheap and DB load was light. At scale, that approach turns into self-inflicted thundering herds against the database.

What “proof” looks like: establishing blame without vibes

“It’s the database” is a claim. “It’s PHP” is a claim. Proof is a chain of timestamps and correlated evidence that shows where time is spent.

In a clean incident writeup, you want at least two independent signals that point to the same culprit:

  • Proxy layer evidence: upstream timeouts, upstream response times, 499/504 ratios, error spikes.
  • PHP layer evidence: PHP-FPM status (active/idle, max_children hit), slowlog stack traces, worker CPU time, queue length.
  • DB layer evidence: slow query log correlated with incident window, lock waits, long transactions, IO waits, high threads running.
  • Host evidence: CPU steal, load average vs runnable threads, iowait, disk latency, network retransmits.

Also, don’t overlook the dependency chain: PHP can be “the one that timed out,” while the database is “the one that caused it.” Your job is to identify the first constrained resource in the chain.

Second short joke: Restarting services to fix a 504 is like turning the radio up to fix a weird engine noise — it changes your feelings, not the physics.

The reliability mindset (one quote)

Hope is not a strategy. — often attributed to operations culture; treat it as a paraphrased idea used in reliability circles.

Translation: gather evidence, then act.

Practical tasks (commands, outputs, decisions)

These are real tasks you can run on a typical Linux WordPress host. Each includes: command, what the output means, and what decision you make next. Run them in order if you’re panicking; cherry-pick if you’re calm.

Task 1: Confirm the 504 and measure where time is spent (client-side)

cr0x@server:~$ curl -sS -o /dev/null -w 'code=%{http_code} ttfb=%{time_starttransfer} total=%{time_total}\n' https://example.com/
code=504 ttfb=60.001 total=60.002

Meaning: Time-to-first-byte (TTFB) is basically equal to total time, and it dies at ~60s. That smells like a proxy timeout, not an app-generated error page.

Decision: Go to proxy logs to find the upstream that timed out and the configured timeout value.

Task 2: Check Nginx error log for “upstream timed out” (proxy evidence)

cr0x@server:~$ sudo tail -n 30 /var/log/nginx/error.log
2025/12/26 11:18:41 [error] 1842#1842: *991 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 203.0.113.10, server: example.com, request: "GET / HTTP/2.0", upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock", host: "example.com"

Meaning: Nginx waited for PHP-FPM and didn’t get response headers in time. This proves the proxy isn’t the slow part; it’s waiting.

Decision: Inspect PHP-FPM saturation and slow requests. Still not enough to convict DB; PHP may be stuck waiting on DB.

Task 3: Confirm Nginx fastcgi timeout settings (avoid chasing ghosts)

cr0x@server:~$ sudo nginx -T 2>/dev/null | grep -E 'fastcgi_read_timeout|proxy_read_timeout|send_timeout' | head
fastcgi_read_timeout 60s;
send_timeout 60s;

Meaning: The 60-second cutoff matches your curl timing. Good. You’re debugging the right window, not some client-side timeout.

Decision: Don’t raise timeouts yet. First figure out what’s taking >60 seconds. Raising timeouts hides outages and makes queues bigger.

Task 4: Check PHP-FPM pool status (are you out of workers?)

cr0x@server:~$ sudo ss -lxp | grep php-fpm
u_str LISTEN 0 128 /run/php/php8.2-fpm.sock  44123  * 0 users:(("php-fpm8.2",pid=1640,fd=9))

Meaning: PHP-FPM is listening. If this line is missing, PHP-FPM is down or socket path differs.

Decision: If missing: restore PHP-FPM service. If present: check whether it’s saturated and queueing.

Task 5: Inspect PHP-FPM for max_children saturation (the classic bottleneck)

cr0x@server:~$ sudo grep -R "pm.max_children" /etc/php/8.2/fpm/pool.d/*.conf
/etc/php/8.2/fpm/pool.d/www.conf:pm.max_children = 20

Meaning: You have at most 20 concurrent PHP requests in this pool. That may be fine or wildly low, depending on request time and traffic.

Decision: Next, check whether those 20 are all busy and whether requests are queueing.

Task 6: Read PHP-FPM logs for “server reached pm.max_children”

cr0x@server:~$ sudo tail -n 30 /var/log/php8.2-fpm.log
[26-Dec-2025 11:18:12] WARNING: [pool www] server reached pm.max_children setting (20), consider raising it
[26-Dec-2025 11:18:13] WARNING: [pool www] server reached pm.max_children setting (20), consider raising it

Meaning: PHP-FPM is saturated. Requests are queueing. This is hard evidence that PHP capacity is a limit.

Decision: Determine whether PHP is slow due to its own CPU/work or waiting on DB/IO. Raising max_children blindly can crush the DB.

Task 7: Check current PHP-FPM process count and CPU usage

cr0x@server:~$ ps -o pid,pcpu,pmem,etime,cmd -C php-fpm8.2 --sort=-pcpu | head
  PID %CPU %MEM     ELAPSED CMD
 1721 62.5  2.1       01:12 php-fpm: pool www
 1709 55.2  2.0       01:11 php-fpm: pool www
 1698 48.9  1.9       01:10 php-fpm: pool www

Meaning: If workers show high CPU, PHP may be doing heavy work (or stuck in tight loops). If workers show low CPU but long elapsed time, they’re likely waiting on IO (DB, disk, network).

Decision: If CPU high: profile/optimize PHP or reduce work (plugins, caching). If CPU low but time high: check DB and IO next.

Task 8: Enable or read PHP-FPM slowlog to capture stack traces of slow requests

cr0x@server:~$ sudo grep -R "slowlog\|request_slowlog_timeout" /etc/php/8.2/fpm/pool.d/www.conf
request_slowlog_timeout = 10s
slowlog = /var/log/php8.2-fpm.slow.log
cr0x@server:~$ sudo tail -n 20 /var/log/php8.2-fpm.slow.log
[26-Dec-2025 11:18:39]  [pool www] pid 1721
script_filename = /var/www/html/index.php
[0x00007f2f0c...] mysqli_query() /var/www/html/wp-includes/wp-db.php:2056
[0x00007f2f0c...] query() /var/www/html/wp-includes/wp-db.php:1945
[0x00007f2f0c...] get_results() /var/www/html/wp-includes/wp-db.php:2932

Meaning: This is the smoking gun when it appears: PHP is slow because it’s inside a database call. If you see curl_exec(), file_get_contents(), or DNS functions instead, the culprit is elsewhere.

Decision: If slowlog shows DB calls: move to MySQL diagnostics immediately. If it shows external HTTP calls: isolate that plugin/service and add timeouts/circuit breakers.

Task 9: Check MySQL thread states (locks and long runners)

cr0x@server:~$ sudo mysql -e "SHOW FULL PROCESSLIST\G" | egrep -A2 "State:|Time:|Info:" | head -n 40
Time: 58
State: Waiting for table metadata lock
Info: ALTER TABLE wp_postmeta ADD INDEX meta_key (meta_key)
Time: 55
State: Sending data
Info: SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts LEFT JOIN wp_postmeta ...

Meaning: “Waiting for table metadata lock” is a red flag: schema change or long DDL is blocking reads/writes. “Sending data” for a long time suggests big scans or slow IO.

Decision: If metadata lock: find and kill the blocking DDL or schedule properly. If long scans: check slow query log, indexes, and buffer pool pressure.

Task 10: Check InnoDB status for lock waits and IO stalls

cr0x@server:~$ sudo mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,120p'
=====================================
2025-12-26 11:18:45 0x7f0a4c2
TRANSACTIONS
------------
Trx id counter 12904421
Purge done for trx's n:o < 12904400 undo n:o < 0 state: running
History list length 1987
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 12904388, ACTIVE 62 sec
2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 418, OS thread handle 139682..., query id 9812 10.0.0.15 wpuser updating
UPDATE wp_options SET option_value='...' WHERE option_name='woocommerce_sessions'

Meaning: Long active transactions and growing history list length indicate purge lag and potential contention. Updates to hot tables (options, sessions) often cause pileups.

Decision: If you see one long transaction blocking many: identify it and consider killing it (carefully). Then fix the app behavior that holds locks too long.

Task 11: Enable and inspect MySQL slow query log (time-correlated proof)

cr0x@server:~$ sudo mysql -e "SHOW VARIABLES LIKE 'slow_query_log%'; SHOW VARIABLES LIKE 'long_query_time';"
+---------------------+------------------------------+
| Variable_name       | Value                        |
+---------------------+------------------------------+
| slow_query_log      | ON                           |
| slow_query_log_file | /var/log/mysql/mysql-slow.log|
+---------------------+------------------------------+
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| long_query_time | 1.000 |
+-----------------+-------+
cr0x@server:~$ sudo tail -n 25 /var/log/mysql/mysql-slow.log
# Time: 2025-12-26T11:18:21.123456Z
# Query_time: 12.302  Lock_time: 0.000 Rows_sent: 10  Rows_examined: 2450381
SELECT * FROM wp_postmeta WHERE meta_key = '_price' ORDER BY meta_value+0 DESC LIMIT 10;

Meaning: Rows_examined in the millions for a simple query is the kind of thing that turns traffic into timeouts. This is solid DB blame with timestamps.

Decision: Add/adjust indexes, rewrite queries (often plugin-driven), or introduce caching/search. Also check why this query spiked now (new plugin, feature, campaign).

Task 12: Watch real-time MySQL running threads and queries (is the DB drowning?)

cr0x@server:~$ sudo mysqladmin extended-status -ri 2 | egrep "Threads_running|Questions|Slow_queries"
Threads_running            34
Questions                  188420
Slow_queries               912
Threads_running            37
Questions                  191102
Slow_queries               925

Meaning: Rising Threads_running under load means concurrency is piling up inside MySQL. If Threads_running stays low but PHP times out, DB might not be the bottleneck.

Decision: If Threads_running is high: reduce query cost and contention; consider read replicas for read-heavy endpoints. If low: focus back on PHP/external calls/storage.

Task 13: Check host-level IO wait and disk latency (storage is often the quiet villain)

cr0x@server:~$ iostat -xz 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.34    0.00    4.12   28.90    0.00   54.64

Device            r/s     w/s   rKB/s   wKB/s  await  svctm  %util
nvme0n1         120.0   210.0  6400.0  8200.0  48.20   1.10  98.00

Meaning: iowait near 30% and disk %util near 100% with high await means storage is saturated. MySQL will get slow even if CPU looks fine.

Decision: If storage is the bottleneck: fix IO (faster disk, better IOPS, reduce write amplification, tune InnoDB, move tmpdir, reduce logging volume). Don’t just add PHP workers.

Task 14: Check for OOM kills or kernel pressure (silent PHP death spiral)

cr0x@server:~$ sudo journalctl -k -n 50 | egrep -i "oom|killed process" | tail
Dec 26 11:17:59 server kernel: Out of memory: Killed process 1721 (php-fpm8.2) total-vm:1234567kB, anon-rss:456789kB, file-rss:0kB, shmem-rss:0kB

Meaning: If PHP workers are being OOM-killed, the proxy sees timeouts and resets. It can look like “random 504s.”

Decision: Reduce PHP memory usage (plugin bloat), cap per-process memory, adjust pool sizing, add RAM, and ensure swap isn’t a performance disaster.

Task 15: Confirm WordPress debug logging is not making things worse

cr0x@server:~$ grep -n "WP_DEBUG" /var/www/html/wp-config.php
90:define('WP_DEBUG', false);
91:define('WP_DEBUG_LOG', false);

Meaning: Leaving WP_DEBUG_LOG enabled on a busy site can create heavy disk writes, turning a small issue into IO contention.

Decision: Keep debug logging off in production by default; enable temporarily with tight windows and log rotation when needed.

Task 16: Prove whether PHP is waiting on DB using strace (surgical, not for the faint-hearted)

cr0x@server:~$ sudo strace -p 1721 -tt -T -e trace=network,read,write,poll,select -s 80
11:18:40.101203 poll([{fd=12, events=POLLIN}], 1, 60000) = 0 (Timeout) <60.000312>

Meaning: A PHP worker blocked in poll/select for 60 seconds is waiting on network IO—often the database socket or an external HTTP service.

Decision: If it’s DB socket, focus on MySQL. If it’s external IP, fix that integration (timeouts, retries, circuit breaking, caching).

You now have enough tools to prove where time is spent. Next: pattern recognition, because the signals cluster in predictable ways.

Database vs PHP: signal patterns that separate them

Pattern A: PHP-FPM max_children reached + PHP slowlog shows mysqli calls

Most likely culprit: database latency or locks causing PHP workers to pile up.

What it looks like:

  • Nginx: “upstream timed out while reading response header from upstream”
  • PHP-FPM log: “server reached pm.max_children”
  • PHP slowlog: stack traces in wp-db.php / mysqli_query()
  • MySQL: Threads_running elevated; slow query log entries spike; processlist shows long queries or lock waits

Do this: treat DB as the root cause. Reduce DB load first, then adjust PHP concurrency.

Pattern B: PHP-FPM max_children reached + PHP workers high CPU + DB looks calm

Most likely culprit: PHP-level work (template loops, expensive plugin logic, image processing, cache misses, poor object caching).

What it looks like:

  • PHP workers show high %CPU and long elapsed times
  • MySQL Threads_running modest, slow query log not spiking
  • Requests that time out are often specific endpoints (search, admin-ajax, product filters)

Do this: isolate endpoint, add caching, turn on OPcache sanity, profile with sampling (not wall-of-shame full tracing during an outage).

Pattern C: PHP has idle workers, but requests still 504

Most likely culprit: proxy configuration mismatch, PHP socket backlog, upstream connectivity, or something outside PHP/DB (DNS, external API, filesystem).

What it looks like:

  • Nginx errors may show connect() failed, recv() failed, or intermittent upstream resets
  • PHP-FPM logs might show child exited, segfault, or nothing at all
  • Host logs might show OOM kills, disk stalls, or network issues

Do this: validate sockets, backlogs, kernel limits, and external dependencies; don’t tunnel-vision on MySQL.

Pattern D: DB Threads_running high + IO wait high + disk await high

Most likely culprit: storage is limiting the DB, which is limiting PHP, which is causing 504.

Do this: fix IO. Sometimes the “database problem” is “we bought the cheapest disk.”

Pattern E: Sudden lock waits, especially metadata locks

Most likely culprit: DDL during peak, plugin migrations, or a “quick index change” performed in production without considering locking behavior.

Do this: stop the DDL, reschedule with online schema changes, and implement guardrails.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

They had a WordPress marketing site plus a WooCommerce store, hosted on a “pretty beefy” VM. A sudden wave of 504s hit during a campaign launch. The on-call engineer looked at the Nginx error log and saw upstream timeouts to PHP-FPM. They assumed, confidently, that PHP-FPM needed more workers.

They raised pm.max_children. 504s got worse. The database CPU climbed, then disk latency spiked, then the whole site started failing in ways that weren’t even consistent. Now it wasn’t just checkout—homepage requests were timing out too.

The actual culprit was a single query pattern introduced by a “filter products by price” widget. It used postmeta in a way that scanned huge ranges, and it ran on every category page view. The database wasn’t healthy, but it was surviving when concurrency was capped. Increasing PHP workers increased the number of concurrent expensive queries. The DB hit IO saturation. Every request slowed down. Queueing exploded across the stack.

They stabilized by rolling back the widget, clearing caches, and returning PHP concurrency to sane levels. The postmortem wasn’t about “don’t scale PHP.” It was about not assuming “PHP timeout = PHP problem.” PHP was the victim. The database was the crime scene.

Mini-story 2: The optimization that backfired

A team tried to be clever: they moved WordPress uploads and parts of the codebase onto a network filesystem to “simplify deployments” across two web nodes. It worked fine in testing. Then production traffic hit and the 504s started—sporadic at first, then correlated with bursts (email campaigns, homepage features).

Everything looked normal: MySQL Threads_running wasn’t crazy, CPU was not pegged, PHP-FPM had capacity. But requests were still hanging long enough for Nginx to give up. Someone insisted it was “definitely the database” because WordPress is always the database. That confident claim survived for about half a day.

The turning point was capturing a PHP-FPM slowlog: stack traces were in file operations and autoloading paths, not in mysqli. At the same time, host metrics showed spikes in IO wait. The network filesystem had periodic latency stalls and occasional retransmits. PHP workers were idle from a CPU perspective, blocked on file reads.

The “optimization” (shared storage) reduced deployment friction but introduced a new latency dependency into every request. The fix was boring: local filesystem for code, object storage for uploads with aggressive caching, and a deployment mechanism that didn’t rely on shared POSIX semantics. Performance came back instantly, and the database—shockingly—was fine.

Mini-story 3: The boring but correct practice that saved the day

Another company ran WordPress as part of a larger platform. They weren’t perfect, but they had one habit that looked almost quaint: every tier had a standard set of dashboards and logs, and they kept timeouts aligned. Proxy timeout, fastcgi timeout, PHP max_execution_time, DB timeouts. All written down. All consistent.

One afternoon they saw a rise in 504s. The first responder checked Nginx error logs: upstream timeouts. Then they checked PHP-FPM: max_children was not hit, but slowlog showed wp-db calls. They jumped to MySQL and immediately saw lock waits around a migration plugin that had started an ALTER TABLE on a high-traffic table.

Because they had a boring practice—slow query log enabled with sane thresholds, and a change calendar—they knew exactly what changed and when. They stopped the migration, rescheduled it for off-peak with safer tooling, and the 504s cleared. No random restarts, no “let’s scale everything,” no week-long blame parade.

Their biggest win wasn’t a fancy observability tool. It was consistency: aligned timeouts and always-on baseline signals. In incident response, boring is a feature.

Common mistakes: symptom → root cause → fix

These are the patterns that waste the most time because they feel intuitive and are wrong in production.

1) Symptom: “Nginx shows upstream timed out, so PHP is broken.”

Root cause: PHP is fine; it’s waiting on MySQL locks or slow queries.

Fix: Use PHP-FPM slowlog to confirm where it’s stuck. If it’s in wp-db.php, go straight to MySQL processlist and InnoDB status; resolve locks and query cost.

2) Symptom: “Let’s increase fastcgi_read_timeout so customers stop seeing 504.”

Root cause: You’re masking latency; queues grow; eventually everything collapses and you get timeouts anyway, just slower.

Fix: Keep timeouts strict enough to surface failure quickly. Reduce tail latency by fixing the bottleneck and adding caching/rate limiting where appropriate.

3) Symptom: “Raising pm.max_children made things worse.”

Root cause: Database was the limiting resource; more PHP concurrency increased DB contention and IO.

Fix: Treat PHP concurrency as a load generator. Size it to what the DB can serve. Reduce expensive queries, add indexes, and cache hot reads.

4) Symptom: “Only wp-admin 504s, frontend looks okay.”

Root cause: Admin pages often trigger heavier queries (post listings with filters), plugin update checks, and cron-like behavior.

Fix: Capture slowlog for admin endpoints. Check for admin-ajax hot loops and plugin calls. Add object caching and audit plugins.

5) Symptom: “504s happen in bursts after traffic spikes.”

Root cause: Queueing collapse: cache misses, stampedes, or connection storms to DB.

Fix: Implement caching with stampede protection, ensure persistent DB connections are sane, and rate limit abusive endpoints (xmlrpc, wp-login, admin-ajax).

6) Symptom: “Database CPU is low, so DB can’t be the issue.”

Root cause: DB is IO bound or lock bound, not CPU bound.

Fix: Look at IO wait, disk await, buffer pool hit rate, and lock waits. CPU is only one way to be miserable.

7) Symptom: “Slow query log is empty, so queries aren’t slow.”

Root cause: slow_query_log is off, long_query_time too high, or the issue is locks (Lock_time can be huge while Query_time seems modest depending on logging config).

Fix: Enable slow query log with a realistic threshold (often 0.5–1s for busy sites). Correlate with lock waits and transaction duration.

8) Symptom: “It goes away after restart, so it was a memory leak.”

Root cause: Restart drains queues and clears locks; you didn’t fix the trigger (traffic pattern, query regression, DDL, external API stall).

Fix: Treat restarts as a temporary mitigation. Capture evidence before restarting: slowlog, processlist, iostat, error logs.

Checklists / step-by-step plan

Step-by-step: proving DB vs PHP in under 30 minutes

  1. Get a timestamped sample: run curl with timing to confirm the timeout duration and frequency.
  2. Check proxy error log: confirm upstream timed out and identify the upstream (php-fpm socket, upstream host).
  3. Check PHP-FPM saturation: look for max_children warnings and count busy workers.
  4. Check PHP slowlog: confirm whether slow requests are in mysqli calls or elsewhere.
  5. Check DB processlist: look for lock waits, long runners, and repeating expensive queries.
  6. Check InnoDB status: identify long transactions and blocking locks.
  7. Check slow query log: correlate spikes in Query_time/Rows_examined with the incident window.
  8. Check storage/host health: iowait, disk latency, OOM kills, CPU steal.
  9. Make one stabilization change: kill blocker, disable offending plugin endpoint, rate limit, or temporarily scale the right tier.
  10. Record what you saw: paste the key log lines and command outputs into the incident timeline.

Stabilization checklist (what to do during the incident)

  • Disable the single worst endpoint if possible (admin-ajax handlers, heavy search/filter pages).
  • Reduce concurrency at the load generator if DB is melting (cap PHP-FPM or add rate limiting at Nginx).
  • Stop schema changes immediately if they are locking hot tables.
  • If IO is saturated, stop background jobs that churn disk (backups, indexing, debug logs, file sync).
  • Prefer targeted kills (blocking transaction) over restarting MySQL blindly.

Hardening checklist (what to do after the incident)

  • Keep PHP-FPM slowlog configured and tested (not necessarily always noisy, but ready).
  • Keep slow query logging available with log rotation; tune long_query_time to your traffic reality.
  • Align timeouts across CDN/proxy/PHP/DB so symptoms are consistent and debuggable.
  • Add object cache (Redis) and verify it’s actually used by the WordPress install.
  • Audit plugins for query patterns (meta queries, wildcard searches, heavy admin-ajax).
  • Index responsibly: avoid random indexes; validate with EXPLAIN and measure Rows_examined.
  • Plan schema changes with low-lock tooling and off-peak windows.
  • Monitor disk latency and IO wait; don’t wait for it to become a “database incident.”

FAQ

1) If Nginx says “upstream timed out,” does that mean PHP is the problem?

No. It means Nginx didn’t get a response from the upstream (often PHP-FPM) in time. PHP might be waiting on MySQL, disk, or an external API. Use PHP-FPM slowlog to see where the code is stuck.

2) What’s the fastest way to prove the database is causing 504s?

Correlate three things in the same time window: PHP slowlog stack traces in mysqli functions, MySQL processlist showing long queries/lock waits, and slow query log entries spiking.

3) What’s the fastest way to prove PHP-FPM capacity is the bottleneck?

Find “server reached pm.max_children” warnings plus a growing listen queue/backlog, and confirm MySQL isn’t overloaded (Threads_running stable, no lock storms). If PHP is pegged CPU-wise, it’s doing too much work per request.

4) Should I increase fastcgi_read_timeout to stop 504s?

Only as a temporary containment measure, and only if you understand the queueing impact. Long timeouts can turn intermittent slowness into sustained saturation. Fix the long tail, don’t hide it.

5) How do I tell the difference between a lock problem and a slow query problem in MySQL?

Locks show up as “Waiting for … lock” states in processlist and as lock wait sections in InnoDB status. Slow queries show high Query_time and large Rows_examined in slow query logs, often with “Sending data” states.

6) Why do 504s happen mostly on WooCommerce checkout or cart?

Checkout touches write-heavy tables (sessions, orders, order meta) and can trigger external calls (payment gateways, tax/shipping APIs). That combination makes it sensitive to both DB contention and external latency.

7) Can Redis/object caching fix 504s?

It can, if your bottleneck is repeated read queries (options, postmeta lookups) and you have good cache hit rates. It won’t fix lock contention from heavy writes or a schema change blocking everything.

8) Why do I see 504s but MySQL CPU is low?

Because the DB can be IO bound (high disk await), lock bound (waiting), or network bound. Low CPU doesn’t mean healthy. Look at iowait, disk latency, and lock waits.

9) Is it safe to kill a MySQL query during an incident?

Sometimes it’s the right move—especially if one transaction is blocking many others. But be deliberate: identify the blocker, understand if it’s a critical write, and expect application errors for requests using that transaction.

10) What if neither DB nor PHP looks obviously bad?

Then suspect dependencies: DNS latency, external HTTP calls, filesystem stalls (network storage), kernel OOM kills, or proxy misconfiguration. PHP slowlog is your compass: it points to the function where time disappears.

Conclusion: next steps that actually reduce 504s

If you take one operational lesson from this: don’t debate whether it’s “database or PHP” in the abstract. Prove where the time is spent with timestamped evidence from the proxy, PHP-FPM, and MySQL. You’re building a causal chain, not a hunch.

Practical next steps:

  1. Keep PHP-FPM slowlog configured (with a sane threshold like 5–10 seconds) so you can capture stack traces during real incidents.
  2. Keep slow query logging usable (enabled or quickly enabled), and know where the log lives and how it rotates.
  3. Align timeouts so a request doesn’t die mysteriously at different layers with different clocks.
  4. Treat PHP concurrency as a lever with consequences: raising max_children increases load on the database. Confirm DB headroom first.
  5. Measure storage latency when “the database is slow.” IO is the hidden axis most WordPress stacks ignore until it’s on fire.
  6. After the incident, remove the trigger: fix the query, index properly, disable the plugin behavior, cache the expensive path, or redesign the endpoint.

504s aren’t mysterious. They’re just your infrastructure telling you, politely, that one part of your stack has stopped keeping up. The polite part ends when you ignore it.

← Previous
WordPress Database Bloated: wp_options Autoload Cleanup Without Breaking Things
Next →
How to Read GPU Reviews: 1080p vs 1440p vs 4K Traps

Leave a comment