When WordPress throws a 502 or 504, the business doesn’t see “an upstream connection issue.” They see “checkout broken,” “ads down,” or “the CEO’s blog post won’t publish.” You see the other thing: Nginx sitting between the internet and PHP-FPM like a bouncer who sometimes forgets the guest list.
This is a field guide to the specific Nginx config mistakes that reliably generate 5xx errors on WordPress, how to confirm them with commands, and what to change without turning a transient incident into a full-time career.
Interesting facts and context (why this keeps happening)
- Nginx started as an answer to the C10k problem (handling 10,000 concurrent connections) and was built around event-driven I/O. WordPress, meanwhile, is a database-backed PHP app that loves synchronous work. Put them together and you get a very efficient web server exposing a very inefficient request path.
- “Bad Gateway” is not an Nginx error in the moral sense. It’s a statement: Nginx asked an upstream (PHP-FPM) for a response and didn’t get a usable one.
- Most WordPress 5xx incidents are self-inflicted: timeouts, buffer defaults, permission mismatches, and wrong assumptions about how PHP-FPM scales. Actual software bugs exist, but config mistakes are the repeat offenders.
- PHP-FPM is a process manager, not magic. If you give it too few workers, it queues. If you give it too many, it thrashes memory and dies. Both can look like “Nginx is broken.”
- HTTP/2 didn’t reduce server work for dynamic pages; it reduced connection overhead and improved multiplexing. One client can now fire many requests concurrently over one connection, which changes traffic shape and can expose FPM saturation faster.
- WordPress admin-ajax.php is a tiny URL with a big blast radius. Plugins use it for everything, including long-running tasks. If you treat it like a normal page request, you’ll discover 504s on “random” admin actions.
- Headers got bigger over the years. Cookie bloat from plugins, A/B testing, and analytics can trigger “upstream sent too big header” and manifest as 502/500 depending on exact failure path.
- Default timeouts are rarely aligned. Nginx has its timeouts, PHP-FPM has request timeouts, and your database has lock waits. Misalignment creates classic “it dies at exactly 60 seconds” mysteries.
One paraphrased idea often attributed to Werner Vogels (operations and reliability): Everything fails; the job is designing for failure and recovering quickly.
Joke #1: The only thing more reliable than a misconfigured Nginx upstream is a Slack channel filling with “anyone else seeing 502s?” within 30 seconds.
Fast diagnosis playbook (first/second/third checks)
When a WordPress site starts throwing 5xx, you don’t start by rewriting the whole server block. You triage. You identify whether the failure is routing, upstream health, capacity, or policy (permissions/limits). This playbook is the shortest path I know to “fix or isolate.”
First: identify which 5xx and where it’s generated
- 502/504 usually indicates upstream trouble (PHP-FPM, network, socket, timeouts).
- 500 can be upstream crash, script fatal, rewrite loop, or Nginx internal error.
- 503 is often rate limiting, maintenance mode, upstream marked down, or capacity exhaustion.
Second: check the error logs that tell the truth
- Nginx error log for upstream errors, buffer issues, rewrite loops, permission problems.
- PHP-FPM log for “server reached pm.max_children,” slow requests, segfaults, and killed workers.
- Kernel / systemd journal for OOM kills, restarts, file descriptor exhaustion.
Third: decide if this is a single-request bug or a capacity/queueing event
- Single-request bug: only one URL fails; other pages work; error repeats instantly. Think rewrite, permissions, script fatal, path traversal protection, specific plugin behavior.
- Capacity/queueing: widespread 504s, slow responses, rising upstream connect time, FPM maxed out, DB slow. Think pm tuning, timeouts, database locks, or a traffic spike.
Fourth: pick a safe mitigation
- Increase logging detail temporarily (not forever).
- Raise specific timeouts carefully only where it matches reality.
- Scale/raise FPM capacity if you can afford memory.
- Disable the one plugin route doing 2-minute work over HTTP.
Practical tasks: commands, expected output, and decisions (12+)
These are the commands I actually run when the pager goes off. Each includes what the output means and the decision you make from it. Run them on the Nginx host unless noted.
Task 1: Confirm Nginx config parses cleanly
cr0x@server:~$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
What it means: Syntax is valid. This does not mean the config is correct, just that it’s parseable.
Decision: If this fails, stop. Fix syntax before chasing ghost 5xxs that are just a failed reload.
Task 2: Check whether Nginx actually loaded your latest config
cr0x@server:~$ sudo systemctl status nginx --no-pager
● nginx.service - A high performance web server and a reverse proxy server
Loaded: loaded (/lib/systemd/system/nginx.service; enabled)
Active: active (running) since Fri 2025-12-27 09:12:41 UTC; 3h 10min ago
Docs: man:nginx(8)
Main PID: 1327 (nginx)
Tasks: 2 (limit: 18962)
Memory: 8.9M
CGroup: /system.slice/nginx.service
├─1327 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
└─1330 nginx: worker process
What it means: Nginx is running. The master command line can expose if it’s using an alternate config path.
Decision: If it’s not active or is crash-looping, fix that first. If you edited files but didn’t reload, you’re debugging yesterday’s config.
Task 3: Tail Nginx error log and reproduce a failing request
cr0x@server:~$ sudo tail -n 50 /var/log/nginx/error.log
2025/12/27 12:18:07 [error] 1330#1330: *481 upstream prematurely closed connection while reading response header from upstream, client: 203.0.113.19, server: example.com, request: "GET /wp-admin/ HTTP/2.0", upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock:", host: "example.com"
2025/12/27 12:18:08 [error] 1330#1330: *482 connect() to unix:/run/php/php8.2-fpm.sock failed (13: Permission denied) while connecting to upstream, client: 203.0.113.19, server: example.com, request: "GET /wp-login.php HTTP/2.0", upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock:", host: "example.com"
What it means: Two different root causes are already screaming: an upstream close (FPM worker died/timeout) and a socket permission problem.
Decision: Don’t guess. Pick the top repeating error line and fix that specific failure mode first.
Task 4: Confirm PHP-FPM is up and listening where Nginx thinks it is
cr0x@server:~$ sudo systemctl status php8.2-fpm --no-pager
● php8.2-fpm.service - The PHP 8.2 FastCGI Process Manager
Loaded: loaded (/lib/systemd/system/php8.2-fpm.service; enabled)
Active: active (running) since Fri 2025-12-27 09:11:02 UTC; 3h 12min ago
Docs: man:php-fpm8.2(8)
Process: 1051 ExecStartPost=/usr/lib/php/php-fpm-socket-helper install /run/php/php8.2-fpm.sock /etc/php/8.2/fpm/pool.d/www.conf 82 (code=exited, status=0/SUCCESS)
Main PID: 1018 (php-fpm8.2)
Tasks: 18 (limit: 18962)
Memory: 312.4M
CGroup: /system.slice/php8.2-fpm.service
├─1018 php-fpm: master process (/etc/php/8.2/fpm/php-fpm.conf)
├─1032 php-fpm: pool www
└─1033 php-fpm: pool www
What it means: Service is alive. Still doesn’t confirm Nginx can reach the socket, just that FPM hasn’t vanished.
Decision: If inactive or restarting, investigate why (OOM, config error). If active, verify socket path and permissions next.
Task 5: Inspect the FastCGI socket permissions
cr0x@server:~$ ls -l /run/php/php8.2-fpm.sock
srw-rw---- 1 www-data www-data 0 Dec 27 09:11 /run/php/php8.2-fpm.sock
What it means: Only root and members of www-data group can connect. If Nginx workers run as nginx, they’ll get permission denied.
Decision: Either run Nginx as www-data (common on Debian/Ubuntu) or change the FPM pool socket owner/group to match your Nginx user. Don’t “chmod 777” your way into security incidents.
Task 6: Verify Nginx worker user and match it to the socket
cr0x@server:~$ grep -R "^\s*user\s" /etc/nginx/nginx.conf
user www-data;
What it means: Nginx workers run as www-data, so the socket above should be reachable.
Decision: If the Nginx user differs (e.g., nginx), change either Nginx user or FPM socket ownership; keep it consistent across hosts.
Task 7: See whether PHP-FPM is saturated (pm.max_children reached)
cr0x@server:~$ sudo grep -R "pm.max_children" /etc/php/8.2/fpm/pool.d/www.conf
pm.max_children = 20
cr0x@server:~$ sudo tail -n 30 /var/log/php8.2-fpm.log
[27-Dec-2025 12:21:44] WARNING: [pool www] server reached pm.max_children setting (20), consider raising it
What it means: Requests are queuing at FPM. Nginx sees slow upstreams and starts timing out or failing connections.
Decision: Either raise pm.max_children (if you have RAM headroom), reduce per-request cost (cache, DB), or scale horizontally. If you raise it blindly, you’ll trade 504s for OOM kills.
Task 8: Confirm the exact HTTP status and timing from the edge
cr0x@server:~$ curl -sS -o /dev/null -w "code=%{http_code} ttfb=%{time_starttransfer} total=%{time_total}\n" https://example.com/wp-admin/
code=504 ttfb=60.002 total=60.002
What it means: A 60-second wall suggests an Nginx timeout (often fastcgi_read_timeout defaulting low in some configs) or an upstream that consistently stalls.
Decision: If the failure is a neat round number, hunt timeouts and queueing. If it fails instantly, hunt permissions, missing sockets, or immediate PHP fatals.
Task 9: Validate your WordPress routing (try_files) isn’t wrong
cr0x@server:~$ sudo nginx -T 2>/dev/null | sed -n '/server_name example.com/,/}/p' | sed -n '1,140p'
server {
server_name example.com;
root /var/www/example.com/public;
location / {
try_files $uri $uri/ /index.php?$args;
}
location ~ \.php$ {
include snippets/fastcgi-php.conf;
fastcgi_pass unix:/run/php/php8.2-fpm.sock;
}
}
What it means: The canonical pattern is there: serve static if it exists, otherwise route to index.php with query args.
Decision: If you see try_files $uri /index.php; without $args, expect “random” app behavior and plugin breakage. If you see recursion (routing to a URI that again hits the same location), expect 500s from internal redirect loops.
Task 10: Check for “upstream sent too big header” (cookie bloat)
cr0x@server:~$ sudo grep -R "too big header" -n /var/log/nginx/error.log | tail -n 5
/var/log/nginx/error.log:1928:2025/12/27 11:04:15 [error] 1330#1330: *211 upstream sent too big header while reading response header from upstream, client: 198.51.100.77, server: example.com, request: "GET /wp-admin/ HTTP/2.0", upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock:", host: "example.com"
What it means: Nginx couldn’t fit upstream headers into configured buffers. WordPress admin pages and plugins are common culprits due to large cookies and redirects.
Decision: Increase FastCGI buffers in the specific server/location; also reduce cookie bloat if possible. Don’t increase globally without reason; you’ll inflate memory under load.
Task 11: Find if you’re dropping uploads (413) but users report it as “5xx”
cr0x@server:~$ sudo grep -R "client intended to send too large body" -n /var/log/nginx/error.log | tail -n 3
/var/log/nginx/error.log:2210:2025/12/27 10:31:19 [error] 1330#1330: *302 client intended to send too large body: 134217728 bytes, client: 203.0.113.58, server: example.com, request: "POST /wp-admin/async-upload.php HTTP/2.0", host: "example.com"
What it means: That’s a 413, not 5xx, but in ticket land it becomes “upload fails, site broken.”
Decision: Set client_max_body_size to a sane value for your business and align it with PHP limits (upload_max_filesize, post_max_size).
Task 12: Detect OOM kills that look like random 502s
cr0x@server:~$ sudo journalctl -k --since "2 hours ago" | grep -i -E "oom|killed process" | tail -n 10
Dec 27 11:58:22 server kernel: Out of memory: Killed process 1033 (php-fpm) total-vm:1324080kB, anon-rss:512000kB, file-rss:0kB, shmem-rss:0kB
What it means: The kernel killed a PHP-FPM worker. Nginx reports upstream closed connection or 502. It feels intermittent because it depends on memory pressure.
Decision: Stop raising pm.max_children. Reduce worker count, fix memory leaks (plugins), add RAM, or isolate workloads. OOM is your server telling you “no” in the least diplomatic way.
Task 13: Confirm file descriptor limits (a quiet 5xx factory)
cr0x@server:~$ sudo cat /proc/$(pidof nginx | awk '{print $1}')/limits | grep "Max open files"
Max open files 1024 524288 files
What it means: Soft limit is 1024. Under bursts of keepalive + HTTP/2 + upstream sockets, that’s not generous.
Decision: Raise the soft limit via systemd unit overrides and Nginx worker_rlimit_nofile. Then validate with load. If you ignore this, you’ll chase “random” upstream failures for weeks.
Task 14: Check upstream connect failures (socket path wrong or not created)
cr0x@server:~$ sudo grep -R "fastcgi_pass" -n /etc/nginx/sites-enabled | head
/etc/nginx/sites-enabled/example.com.conf:42: fastcgi_pass unix:/run/php/php8.2-fpm.sock;
cr0x@server:~$ test -S /run/php/php8.2-fpm.sock; echo $?
0
What it means: Exit code 0 means the socket exists and is a socket file. If it returns 1, Nginx is pointing at a fantasy.
Decision: Fix the socket path or switch to TCP (127.0.0.1:9000) if you need cross-container/namespace connectivity. But keep it consistent and documented.
Task 15: Identify slow PHP requests and match them to Nginx timeouts
cr0x@server:~$ sudo grep -R "request_slowlog_timeout" /etc/php/8.2/fpm/pool.d/www.conf
request_slowlog_timeout = 5s
cr0x@server:~$ sudo tail -n 20 /var/log/php8.2-fpm/www-slow.log
[27-Dec-2025 12:19:11] [pool www] pid 1099
script_filename = /var/www/example.com/public/wp-admin/admin-ajax.php
[0x00007f2b9c8a2e10] mysqli_query() /var/www/example.com/public/wp-includes/wp-db.php:2056
What it means: PHP is spending time in DB queries. Nginx is waiting. If your Nginx timeout is shorter than the slow request’s runtime, you’ll get 504s.
Decision: Fix the slowness (DB/index/plugin) before inflating timeouts. Bigger timeouts without capacity is how you get a slow-motion outage instead of a fast failure.
Common mistakes: symptoms → root cause → fix
This is the “I’m bleeding; what artery is it?” section. Read the symptom, validate with logs, apply the targeted fix.
1) 502 Bad Gateway immediately on PHP pages
Symptoms: Static assets load. Any .php route fails instantly. Nginx error log shows connect() ... failed (2: No such file or directory) or (13: Permission denied).
Root cause: Wrong fastcgi_pass socket path, missing socket (FPM not running), or socket permissions mismatch between Nginx user and FPM pool.
Fix: Align fastcgi_pass with the actual socket, ensure FPM is active, and set FPM pool directives like:
listen = /run/php/php8.2-fpm.sock,
listen.owner = www-data,
listen.group = www-data,
listen.mode = 0660.
Reload FPM and Nginx.
2) 504 Gateway Timeout at consistent time boundaries (30s/60s/120s)
Symptoms: The site “works but sometimes times out,” often at a neat round number. Nginx error log shows upstream timed out.
Root cause: Misaligned timeouts: Nginx fastcgi_read_timeout too low for real request time, or FPM/DB queueing causes a request to sit and wait.
Fix: First find why requests are slow (FPM saturation, DB locks, plugin). Then set timeouts intentionally: Nginx timeouts should be slightly above your realistic worst-case for interactive requests. For long-running jobs, stop doing them synchronously over HTTP.
3) 500 Internal Server Error right after deploy/reload
Symptoms: Everything was fine, then reload, then 500s. Nginx error log mentions rewrite or internal redirection cycle or could not build the variables_hash or an include path error.
Root cause: Rewrite loops, broken include ordering, or invalid regex locations that catch more than intended.
Fix: Use the known-good WordPress location structure, keep rewrite rules minimal, and avoid “clever” regex that tries to parse WordPress. Nginx is a great router and a terrible PHP framework.
4) 502/500 on wp-admin only, front page fine
Symptoms: Homepage loads. Admin pages fail with buffer/header errors. Error log shows upstream sent too big header.
Root cause: Large upstream headers (usually cookies) from plugins or auth flows exceed FastCGI buffers.
Fix: Increase FastCGI buffering for that server, for example:
fastcgi_buffer_size and fastcgi_buffers.
Also reduce cookie sprawl (disable the plugin that writes half a novel into cookies).
5) 503 Service Unavailable under load, then recovers
Symptoms: Spikes cause 503s, often paired with “limiting requests” logs or upstream failures. Sometimes only some clients get it.
Root cause: Rate limiting too aggressive, connection limiting per IP (bad fit for NATed clients), or upstream capacity collapse (FPM maxed out).
Fix: Tune rate limiting using real traffic distributions. Don’t punish everyone behind one corporate NAT. If the upstream is collapsing, fix capacity and caching first; rate limiting should protect, not permanently throttle.
6) Random 502s with “upstream prematurely closed connection”
Symptoms: Intermittent failures, worse under load. Nginx logs show upstream closed connection while reading response headers.
Root cause: PHP-FPM worker crashes, gets killed (OOM), hits request_terminate_timeout, or segfaults due to an extension.
Fix: Check kernel OOM logs, FPM logs, and PHP error logs. Stabilize memory, cap worker count, remove suspect extensions, and set sane terminate timeouts so you fail fast instead of failing weird.
7) 500/502 after enabling “microcache” or fastcgi_cache
Symptoms: Logged-in users see wrong pages, admin actions fail, occasional 5xx due to cache lock/stampede, weird headers.
Root cause: Caching dynamic/authenticated content incorrectly, caching responses that should bypass, or caching error pages.
Fix: Cache only anonymous GET/HEAD, bypass for cookies like wordpress_logged_in_ and admin paths, and avoid caching 5xx. Microcache can be great; it’s also a foot-gun with excellent marketing.
8) 500 on specific URLs with “Primary script unknown”
Symptoms: Some PHP routes fail; error log or FPM log references Primary script unknown.
Root cause: Wrong root/fastcgi_param SCRIPT_FILENAME mapping, often from copy-pasting a generic PHP config that doesn’t match your directory layout.
Fix: Use distribution-provided FastCGI snippets when possible and ensure root points to the WordPress document root. Confirm actual filesystem path of index.php.
Deep dives: the usual offenders in Nginx + WordPress
FastCGI basics you can’t afford to hand-wave
Nginx does not “run PHP.” It speaks FastCGI to PHP-FPM. That conversation has three recurring failure modes:
- Connection failure (socket missing/wrong/perms) → instant 502.
- Upstream died mid-response (crash, OOM, terminate timeout) → 502 with “prematurely closed connection.”
- Upstream too slow (queueing, slow DB, long job) → 504.
If you treat them as the same thing, you’ll make the same fix repeatedly and wonder why the graph doesn’t move.
WordPress routing: the one try_files line that matters
WordPress wants “pretty permalinks” and expects that non-existent paths get routed to index.php. The clean Nginx approach is a try_files that checks for a real file, then a real directory, then hands off to /index.php?$args.
The mistakes that cause 5xx are almost always variations of:
- Dropping
$args, which breaks query-string behavior and can trigger odd plugin logic, redirects, and (yes) loops under certain conditions. - Incorrect root so that
/index.phpdoesn’t exist where Nginx thinks it does. Nginx routes to PHP; PHP can’t find the file; you see 500/502 patterns depending on config. - Regex locations that intercept too much, especially ones trying to “secure WordPress” by blocking patterns. Blocklists are where good intentions go to become outages.
Timeouts: align the chain, don’t just raise numbers
Timeout tuning is where SREs get blamed for being “too cautious” right up until the day the server collapses slowly.
You have timeouts in multiple places:
- Nginx:
client_header_timeout,client_body_timeout,send_timeout, plus FastCGI timeouts likefastcgi_connect_timeout,fastcgi_send_timeout,fastcgi_read_timeout. - PHP-FPM:
request_terminate_timeoutand (optionally) max execution settings at PHP level. - PHP:
max_execution_time(CLI differs from FPM), memory limits, and extension-level timeouts. - Database: lock waits and query timeouts (or lack thereof).
If Nginx gives up at 60 seconds but FPM will keep a worker busy for 180 seconds, you’ve created a resource leak under load: requests keep running after the client has already been told “timeout.” That’s how you get a spiral: more timeouts → more stuck workers → more timeouts.
Do this instead:
- Set interactive request budgets (e.g., admin pages should be fast; background tasks should be asynchronous).
- Ensure Nginx timeout is slightly above the expected max for those interactive endpoints.
- Ensure FPM terminate timeout is slightly above Nginx so you can log slow scripts and still clean up.
- Move long work to queues/cron/CLI workers. WordPress can do this, but not with wishful thinking.
Buffers: the hidden reason your admin area “randomly” dies
FastCGI buffers are basically “how much response header/body Nginx will hold while reading from upstream.” When they’re too small, Nginx can fail reading headers and return a 502.
WordPress admin responses can grow headers due to:
- Huge cookies (multiple plugins setting tracking or state cookies).
- Multiple
Set-Cookieheaders during auth flows. - Long redirect chains or security plugins adding headers.
The correct move is targeted tuning with proof from logs. Increase buffers in the server block serving WordPress, validate memory impact, and then—this is the part everyone skips—trim the cookie bloat. You don’t need a cookie the size of a short story.
File permissions and ownership: the boring stuff that causes real outages
WordPress needs read access for PHP files and write access for uploads, caches, and sometimes plugin updates (depending on how you deploy). The most common outage is not “permissions wrong,” but “permissions changed on one host during an urgent manual fix.”
Watch for:
- Nginx worker user can’t traverse directories (execute bit missing).
- FPM pool runs as a different user than expected, and can’t read the code or write uploads.
- Deploy system creates files owned by a CI user with restrictive modes.
Fix with consistent ownership and a deployment model that doesn’t rely on the web server being able to modify code. If your production plan includes “the site updates itself,” you’ve accepted operational chaos as a feature.
HTTP/2 and concurrency: why “a few users” can melt PHP-FPM
HTTP/2 allows a single client to open many concurrent streams. That’s great for loading pages faster. It’s also a neat way for one browser tab (or one bot) to create a burst of parallel PHP hits: admin panels, API endpoints, and assets proxied to PHP due to misconfig.
If your Nginx config routes too much to PHP (like images through PHP, or missing static file caching), HTTP/2 can accelerate the pain. The fix is to serve static files as static files, aggressively and correctly, and to make PHP handle only what needs PHP.
When “security hardening” rules cause 5xx
WordPress attracts hardening snippets like moths to a porch light. Many are fine. Some break uploads, REST endpoints, or admin flows by denying methods or paths incorrectly.
Common breakage patterns:
- Blocking
POSTto/wp-json/or/wp-admin/admin-ajax.phpbecause “AJAX is scary.” That can bubble up as 500/503 depending on how the denial is handled and how the app expects the response. - Denying access to
/wp-content/uploads/with overbroad rules, causing plugins to error and throw 5xx when they can’t fetch resources. - Trying to block PHP execution in uploads but accidentally blocking legitimate PHP endpoints due to a bad regex.
Hardening should be tested like application code. Put it through staging. Add regression checks for admin actions. And keep your rules readable, because you’ll be reading them at 2 a.m.
Joke #2: The fastest way to discover you don’t have staging is to deploy a “security snippet” straight to production and call it bravery.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran WordPress behind Nginx on two app nodes. They did a PHP upgrade, and the new package used a different PHP-FPM socket path than the old one. The deploy checklist said “restart services,” and everyone did. The load balancer health checks were basic: they hit / and looked for HTTP 200.
On one node, the Nginx site config still pointed to the old socket. Static home page content was cached and served fine, so the health check stayed green. The first real user who tried to log in got a 502. Then every editor got a 502. Then the marketing team discovered it by attempting to publish a time-sensitive post and watching the admin panel spin until it died.
The on-call engineer initially assumed “the PHP upgrade is broken,” and started rolling back. But the rollback didn’t help because the service restart kept the new socket path and the old Nginx config mismatch remained. It wasn’t a software defect; it was an assumption that “PHP-FPM always lives at the same path.”
The fix was painfully simple: point fastcgi_pass at the correct socket, reload Nginx, and add a health check that exercises a dynamic PHP route (something cheap like a dedicated /health.php script). They also added a guardrail: a pre-reload check that verifies the socket exists and is connectable by the Nginx user.
What changed long-term was not the socket path. It was the culture: they stopped relying on “homepage returns 200” as a definition of health for a PHP application.
Mini-story 2: The optimization that backfired
Another team wanted to “speed up WordPress” without buying more servers. They enabled microcaching in Nginx and felt like heroes. Anonymous traffic got faster. Their graphs improved. Someone pasted a screenshot into a quarterly slide deck.
Then the weird stuff started. Logged-in users occasionally saw the wrong admin page. Editors complained that saving a draft sometimes returned a 502. The support team got tickets about “I clicked publish and it vanished.” The on-call engineer looked at the Nginx error log and saw a mix of upstream timeouts and cache-related locking behavior during bursts.
The root cause wasn’t microcaching in general; it was microcaching applied too broadly. They cached responses that should never be cached: admin pages, requests with auth cookies, and some plugin endpoints. Under load, a cache stampede formed: many requests waited on the same upstream computation, but because of the cache key and bypass conditions, they didn’t coalesce properly.
The fix involved strict bypass rules for anything with WordPress auth cookies, explicit no-cache for admin and AJAX, and conservative caching of only anonymous GET/HEAD. It also included “do not cache 5xx” behavior, because caching a failure is a way to make a transient hiccup look like a prolonged outage.
The lesson wasn’t “never cache.” It was: caching is application behavior. Treat it like code, test it like code, and roll it out slowly like you’re responsible for the outcome. Because you are.
Mini-story 3: The boring but correct practice that saved the day
A large enterprise had WordPress as one of many sites on a shared Nginx platform. Their configs were generated from templates and committed to version control. Every change required a config test and an automated smoke test that hit: homepage, a dynamic PHP endpoint, login page, and a media upload endpoint with a tiny file.
One day, a well-meaning engineer proposed a cleanup: “standardize FastCGI includes across all sites.” They modified a shared snippet used by dozens of hosts. It looked safe. It was not. The snippet changed how SCRIPT_FILENAME was computed, which broke a subset of sites whose root paths weren’t uniform.
The pipeline caught it. The smoke test failed on the login route with a 500, and the logs showed Primary script unknown. Nobody had to learn about this failure mode during a live incident. The fix was to adjust the template to honor each site’s root and to add a unit-style test that validates expected resolved paths for each vhost.
This wasn’t glamorous engineering. Nobody got a dopamine hit from “we prevented an outage.” But it saved hours of downtime and a lot of internal credibility.
If you’re looking for the moral: boring practices scale better than heroic debugging.
Checklists / step-by-step plan (deploy without drama)
Step-by-step: from 5xx alert to stable service
- Confirm the blast radius. One URL or all PHP routes? Anonymous only or admin too?
- Pull the top Nginx error lines. Look for upstream connect failures, timeouts, header/buffer errors, rewrite loops.
- Check PHP-FPM health. Service status, socket existence, permission alignment, max_children warnings.
- Check for OOM/restarts. Kernel logs and systemd restarts are often the real story.
- Measure a request. Use curl timing to see whether this is instant fail vs timeout.
- Mitigate safely. Scale FPM within memory limits, bypass problematic caching, disable the one plugin endpoint doing long work, or temporarily increase timeouts with a follow-up ticket to fix the root cause.
- Verify with a dynamic health endpoint. Don’t declare victory based on static content.
- Write down the signature. “If you see X log line, it was Y root cause.” This reduces future MTTR more than almost any tuning.
Config checklist: minimum viable WordPress Nginx server block
- Correct root to WordPress document root.
- try_files includes
$args:try_files $uri $uri/ /index.php?$args; - PHP location includes correct snippet and correct
fastcgi_pass. - Deny access to sensitive files thoughtfully (
.htaccess, wp-config backups), without regex chaos. - Static caching for assets served directly by Nginx, not via PHP.
- Upload size aligned with PHP limits.
- Logging includes upstream timing at least during incident windows.
Operational checklist: prevent repeat 5xx incidents
- Health checks must hit PHP. A single dynamic endpoint can save you from half-green pools.
- Pin and track PHP-FPM socket path. Avoid implicit defaults that change across upgrades.
- Cap and observe FPM. Set
pm.max_childrenbased on memory measurements, not vibes. - Enable slow logs. If you don’t know what’s slow, you’ll “fix” timeouts forever.
- Don’t let production self-update. Deploy code like adults: CI, artifacts, rollbacks.
- Test Nginx reloads.
nginx -tis not optional. Neither is verifying the new config is live.
FAQ
1) Why do I get 502 instead of 504?
502 typically means Nginx couldn’t establish or maintain a valid upstream connection (socket missing, permission denied, upstream closed early). 504 means Nginx connected but didn’t get a response in time. Your error log will usually make this obvious.
2) Should I use a Unix socket or TCP for php-fpm?
On the same host, a Unix socket is common and efficient, with fewer moving parts. Use TCP when you need cross-container or cross-host connectivity, or when your runtime environment makes socket permissions painful. Either way, keep the choice consistent and monitored.
3) I raised fastcgi_read_timeout and the 504s stopped. Am I done?
You might have just traded “fast failure” for “longer queue.” If the requests are slow because of FPM saturation or DB slowness, higher timeouts can increase concurrency pressure and worsen peak incidents. Use slow logs and queue metrics to confirm you fixed the cause, not the symptom.
4) What causes “upstream sent too big header” on WordPress?
Usually cookie bloat: too many cookies, too-large cookies, or too many Set-Cookie headers. WordPress admin plus plugins is the perfect storm. Fix by increasing FastCGI buffers and reducing cookie growth where possible.
5) Can Nginx rewrites cause 500 errors?
Yes. Rewrite loops and internal redirect cycles can produce 500s. WordPress generally needs a simple try_files and minimal rewrites. If you’re doing complex rewrite logic, you’re probably re-implementing WordPress routing poorly.
6) How do I know if PHP-FPM is the bottleneck or the database is?
Start with PHP-FPM slow logs. If stack traces point to DB calls (e.g., in wp-db.php), the database is likely your limiter. Also look for FPM max_children warnings (queueing) and correlate with DB metrics (lock waits, slow queries). A 504 is often “someone waited on something else.”
7) Why do only wp-admin and wp-login.php fail while the homepage works?
Admin pages often generate bigger headers and rely on cookies. They’re also more dynamic, so they expose upstream issues earlier. If the homepage is cached or mostly static, it can mask upstream failures. That’s why static-only health checks lie.
8) Is enabling fastcgi_cache for WordPress safe?
It can be safe for anonymous traffic if you bypass caching for logged-in cookies, admin paths, preview URLs, carts/checkouts, and anything personalized. Misapplied caching breaks correctness first, then availability. Test thoroughly and roll out gradually.
9) What’s the single most common “simple” cause of 502 after maintenance?
A mismatched socket path after a PHP upgrade or a restarted service that recreated a socket with different permissions. It’s embarrassingly common and fast to detect by checking fastcgi_pass and ls -l on the socket.
10) Do I need to tune Nginx worker_processes, worker_connections for WordPress 5xx issues?
Sometimes, but it’s rarely the first fix. WordPress outages tend to be upstream CPU/RAM/DB, not Nginx worker starvation. Still, if you see file descriptor limits or connection caps, fix those. Nginx is usually the messenger, not the murderer.
Conclusion: next steps that prevent repeats
If you’re seeing 5xx on WordPress behind Nginx, your job is to stop treating it like a mystical web-server mood swing. The failure modes are consistent: socket connectivity, upstream capacity, timeouts, buffers, and routing. The logs will tell you which one, if you read them like an operator instead of a fortune teller.
Practical next steps:
- Add a dynamic health endpoint and make your load balancer check it.
- Standardize and verify socket paths and permissions during deploys and upgrades.
- Enable PHP-FPM slow logging and treat “slow” as a reliability bug.
- Tune FPM worker counts based on measured memory, and stop when you hit safe headroom.
- Make caching explicit and conservative: anonymous GET/HEAD only unless you have a strong reason.
- Write a short runbook using the tasks above, so the next incident is a procedure, not a debate.
Do that, and your next 502 won’t be a mystery. It’ll be a known class of problems with a short list of fixes—and that’s what “reliability” looks like in the real world.