502s are the worst kind of outage: the kind that makes everyone argue about whose fault it is. Cloudflare points at your origin. Nginx points at PHP-FPM. PHP-FPM points at WordPress. WordPress points at a plugin written in 2013 by someone who has since found peace in woodworking.
This guide is a production-minded way to stop guessing. You’ll identify where the request died, why it died, and what change actually fixes it—without turning your server into a science fair.
What a 502 really means (in this stack)
A 502 Bad Gateway is not a WordPress error. It’s a proxy error. Something acting as a gateway (Cloudflare, Nginx, a load balancer) tried to talk to an upstream (your origin, PHP-FPM) and got garbage back, got nothing back, or timed out in a way that produces a 502.
That distinction matters because you debug 502s by proving where the failure happened. Your goal is not “fix WordPress.” Your goal is “identify the first component that failed to do its job.” Once you know that, the fix becomes obvious. Often boring. Usually effective.
Two common realities:
- 502 from Cloudflare: Cloudflare couldn’t get a valid response from your origin in time, or the origin closed the connection. Cloudflare did not run your PHP code. It asked your origin nicely and got ignored.
- 502 from Nginx: Nginx couldn’t talk to PHP-FPM (socket refused, permission denied, upstream closed, upstream timed out). Nginx did not run your PHP code either. It tried to hand it off and the handoff failed.
Here’s the mental model I use on-call: 502s are handoff failures. The most valuable question is: which handoff?
Fast diagnosis playbook
If you remember nothing else, remember this order. It cuts through blame and gets you to the culprit fast.
Step 1: Classify the 502 source (edge vs origin)
- If users see a Cloudflare-branded error page, start at Cloudflare headers and origin reachability.
- If users see your site’s error page or a plain “502 Bad Gateway” from your server, start at Nginx and PHP-FPM logs.
Step 2: Check Nginx error log for the upstream error string
This is the fastest truth serum. The exact phrasing tells you whether it’s “upstream timed out”, “connect() failed”, “upstream prematurely closed connection”, or “no live upstreams”. Each maps to a different fix.
Step 3: Check PHP-FPM health and capacity
Look for max_children reached, slowlog entries, or a dead/unreachable pool socket. Don’t tune blind. Confirm saturation and confirm memory headroom.
Step 4: Correlate by time and request path
Most 502s aren’t random. They cluster around specific endpoints (wp-admin, wp-cron.php, /checkout, an AJAX action). Find the path. Then find what code it runs.
Step 5: Decide: capacity, latency, or connectivity
- Connectivity: socket perms, listen backlog, SELinux/AppArmor, firewall, wrong upstream address.
- Capacity: too few workers, CPU pegged, MySQL stalled, IO waits, memory pressure.
- Latency: slow queries, external API calls, bad plugin, slow disk, DNS stalls.
Joke #1: A 502 is your server’s way of saying “I tried,” which is also what I write in incident retros when the graphs are missing.
Follow the request: Cloudflare → Nginx → PHP-FPM → WordPress
Picture the request as a relay race:
- Browser → Cloudflare: user hits the edge. Cloudflare applies WAF rules, caching, bot checks.
- Cloudflare → Origin (your server): Cloudflare connects to your Nginx/Apache, usually on 443.
- Nginx → PHP-FPM: Nginx proxies PHP requests to a UNIX socket or TCP port.
- PHP-FPM → WordPress: PHP executes code, calls MySQL, maybe Redis, maybe external APIs.
- Response flows back.
A 502 happens when a runner drops the baton. The key is to find which runner and why. That’s why logs and time correlation matter more than intuition.
Interesting facts and context (so the errors make sense)
- Fact 1: “Bad Gateway” is an HTTP status defined for intermediaries, not application code. Your PHP app typically doesn’t generate 502 on purpose.
- Fact 2: Nginx became popular partly because its event-driven model handles lots of idle connections efficiently—great for slow clients, not a magic wand for slow upstreams.
- Fact 3: PHP-FPM is the de facto process manager for PHP because it isolates PHP execution and lets the web server stay lean; this separation is exactly why “handoff” errors exist.
- Fact 4: Cloudflare’s “522” is famous, but “502/504 at the edge” is often just origin slowness expressed differently—timeouts are policy decisions, not universal truths.
- Fact 5: WordPress’s admin-ajax endpoint can become a high-concurrency hotspot; it’s effectively an RPC endpoint many plugins abuse.
- Fact 6: Keepalive and buffering settings were historically tuned for expensive upstreams and slow clients; modern stacks still inherit these knobs, and mis-tuning them can amplify failure.
- Fact 7: The classic “max_children reached” warning in PHP-FPM isn’t an error by itself; it’s a capacity alarm that often correlates with 502/504 at the proxy.
- Fact 8: UNIX sockets are slightly faster than TCP loopback for Nginx→FPM, but permission and path mistakes are more common with sockets than with TCP.
- Fact 9: Many “random” 502s happen on deploys because PHP-FPM reloads can briefly drop sockets or reset connections; graceful reload configuration matters.
Practical tasks: commands, outputs, decisions (12+)
These are the tasks I actually run under pressure. Each has three parts: a command, what the output means, and the decision you make. Adjust service names (php8.2-fpm vs php-fpm) for your distro.
Task 1: Confirm where the 502 is generated (headers)
cr0x@server:~$ curl -sS -D - -o /dev/null https://example.com/ | sed -n '1,20p'
HTTP/2 502
date: Fri, 26 Dec 2025 14:03:12 GMT
content-type: text/html; charset=UTF-8
server: cloudflare
cf-ray: 88b12345abcd1234-FRA
What it means: server: cloudflare and a cf-ray indicates the edge is returning the 502. That does not exonerate your origin.
Decision: Immediately test origin directly (Task 2). If origin is fine, look at Cloudflare settings/WAF. If origin is failing, debug origin.
Task 2: Bypass Cloudflare to test origin response
cr0x@server:~$ curl -sS -D - -o /dev/null --resolve example.com:443:203.0.113.10 https://example.com/ | sed -n '1,20p'
HTTP/2 200
server: nginx
content-type: text/html; charset=UTF-8
What it means: This hits 203.0.113.10 while still using SNI for example.com. You’re seeing the origin’s answer.
Decision: If origin returns 200 while Cloudflare returns 502, suspect Cloudflare-to-origin connectivity, TLS mode mismatch, WAF/bot fight, or origin rate limiting specific Cloudflare IPs. If origin also fails, proceed into Nginx/PHP-FPM.
Task 3: See the exact Nginx upstream error string
cr0x@server:~$ sudo tail -n 50 /var/log/nginx/error.log
2025/12/26 14:03:10 [error] 21344#21344: *99172 connect() to unix:/run/php/php8.2-fpm.sock failed (111: Connection refused) while connecting to upstream, client: 172.67.10.20, server: example.com, request: "GET / HTTP/2.0", upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock:", host: "example.com"
What it means: Nginx tried to connect to the PHP-FPM socket and got Connection refused. That’s not a slow plugin. That’s a broken handoff.
Decision: Check whether PHP-FPM is running and listening on that socket (Task 4, Task 5). If it’s running, this can be a mismatch of socket path, permissions, or a crash loop.
Task 4: Check PHP-FPM service health
cr0x@server:~$ systemctl status php8.2-fpm --no-pager
● php8.2-fpm.service - The PHP 8.2 FastCGI Process Manager
Loaded: loaded (/lib/systemd/system/php8.2-fpm.service; enabled)
Active: active (running) since Fri 2025-12-26 13:58:22 UTC; 4min 53s
Docs: man:php-fpm8.2(8)
Main PID: 20811 (php-fpm8.2)
Status: "Processes active: 12, idle: 4, Requests: 983, slow: 3, Traffic: 0.8req/sec"
What it means: The service is running. That doesn’t guarantee it’s listening where Nginx expects.
Decision: Verify socket existence and ownership (Task 5) and FPM pool config for listen.
Task 5: Verify the socket exists and permissions make sense
cr0x@server:~$ sudo ls -l /run/php/php8.2-fpm.sock
srw-rw---- 1 www-data www-data 0 Dec 26 13:58 /run/php/php8.2-fpm.sock
What it means: Socket exists, owned by www-data, mode 660. Nginx must run as www-data or be in the www-data group to connect.
Decision: If Nginx runs as nginx user, either adjust groups or set listen.owner, listen.group, and listen.mode in the FPM pool config. If the socket is missing, PHP-FPM isn’t creating it—check logs (Task 6).
Task 6: Read PHP-FPM logs and journal for pool errors
cr0x@server:~$ sudo journalctl -u php8.2-fpm -n 80 --no-pager
Dec 26 14:02:57 server php-fpm8.2[20811]: [WARNING] [pool www] server reached pm.max_children setting (20), consider raising it
Dec 26 14:03:01 server php-fpm8.2[20811]: [ERROR] [pool www] child 22109 exited on signal 11 (SIGSEGV) after 12.345678 seconds from start
Dec 26 14:03:01 server php-fpm8.2[20811]: [NOTICE] [pool www] child 22117 started
What it means: Two separate issues: capacity saturation (max_children) and instability (segfault). The segfault is often a buggy extension, OPcache issue, or memory corruption scenario.
Decision: If you see max_children warnings, confirm it correlates with 502s and tune safely (Task 11). If you see segfaults, disable suspect extensions, verify PHP packages, and consider a rollback of recent changes. Treat segfaults as a reliability incident, not “WordPress being WordPress.”
Task 7: Confirm Nginx is actually pointing at the right upstream
cr0x@server:~$ sudo nginx -T 2>/dev/null | sed -n '/fastcgi_pass/,+3p'
fastcgi_pass unix:/run/php/php8.2-fpm.sock;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
include fastcgi_params;
What it means: Nginx expects that socket path. If FPM pool listens on a different path or TCP port, you’ve built a distributed system inside one server.
Decision: Align fastcgi_pass with the pool’s listen. Avoid “multiple truths” (a socket in one file, a TCP port in another).
Task 8: Check for upstream timeouts vs connection errors
cr0x@server:~$ sudo grep -E "upstream timed out|prematurely closed|Connection refused|no live upstreams" -n /var/log/nginx/error.log | tail -n 8
41288:2025/12/26 14:01:21 [error] 21344#21344: *98811 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 162.158.90.12, server: example.com, request: "GET /wp-admin/ HTTP/2.0", upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock:", host: "example.com"
41307:2025/12/26 14:03:10 [error] 21344#21344: *99172 connect() to unix:/run/php/php8.2-fpm.sock failed (111: Connection refused) while connecting to upstream, client: 172.67.10.20, server: example.com, request: "GET / HTTP/2.0", upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock:", host: "example.com"
What it means: You have both timeout and refusal. That suggests intermittent FPM unavailability (reloads/crashes) and slow upstream responses when it is available.
Decision: Fix stability first (avoid refusal), then address slowness. A system that is consistently slow is easier to tune than one that randomly disappears.
Task 9: Inspect active connections and listen backlog pressure
cr0x@server:~$ sudo ss -xlp | grep php8.2-fpm.sock
u_str LISTEN 0 4096 /run/php/php8.2-fpm.sock 113217 * 0 users:(("php-fpm8.2",pid=20811,fd=8))
What it means: The socket is listening, backlog is 4096. If backlog is tiny and you’re spiky, you’ll see connection failures under load.
Decision: If backlog is low, tune FPM listen.backlog (pool config) and ensure kernel limits aren’t tiny. But don’t use backlog to hide an undersized worker pool.
Task 10: Check CPU, load, IO wait (is the box dying?)
cr0x@server:~$ uptime; mpstat -P ALL 1 3; vmstat 1 5
14:03:33 up 36 days, 3:22, 2 users, load average: 12.44, 10.81, 8.03
Linux 6.1.0 (server) 12/26/2025 _x86_64_ (4 CPU)
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: all 72.10 0.00 18.34 6.22 0.00 0.78 0.00 0.00 0.00 2.56
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
12 1 0 14200 12000 820000 0 0 120 980 4200 8800 74 19 2 5 0
What it means: CPU is saturated and there’s some IO wait. If PHP-FPM needs CPU and the box is pegged, Nginx timeouts will follow.
Decision: Reduce concurrency (rate limit, caching), optimize hot paths, or add capacity. Before tuning FPM upward, confirm memory headroom (Task 11) so you don’t “fix” 502s by summoning OOM kills.
Task 11: Measure PHP-FPM worker memory to tune pm.max_children safely
cr0x@server:~$ ps -ylC php-fpm8.2 --sort:rss | head -n 8
S UID PID PPID C PRI NI RSS SZ WCHAN TTY TIME CMD
S 33 22117 20811 0 80 0 126432 210944 - ? 00:00:01 php-fpm: pool www
S 33 22098 20811 0 80 0 118944 205312 - ? 00:00:02 php-fpm: pool www
S 33 22077 20811 0 80 0 112880 198656 - ? 00:00:02 php-fpm: pool www
S 33 22031 20811 0 80 0 108220 190112 - ? 00:00:03 php-fpm: pool www
S 33 21990 20811 0 80 0 104112 186880 - ? 00:00:04 php-fpm: pool www
What it means: RSS around ~110–125 MB per worker in this snapshot. That’s the number that matters when you raise pm.max_children.
Decision: If you have 2 GB available for PHP workers, you probably don’t set max_children to 80 “because traffic.” You set it to something your RAM can pay for, with headroom for MySQL, OS cache, and bursts.
Task 12: Turn on and read PHP-FPM slowlog to catch the real offenders
cr0x@server:~$ sudo grep -nE "request_slowlog_timeout|slowlog" /etc/php/8.2/fpm/pool.d/www.conf | tail -n 5
277:request_slowlog_timeout = 5s
278:slowlog = /var/log/php8.2-fpm/www-slow.log
cr0x@server:~$ sudo tail -n 40 /var/log/php8.2-fpm/www-slow.log
[26-Dec-2025 14:01:22] [pool www] pid 22031
script_filename = /var/www/example.com/public/wp-admin/admin-ajax.php
[0x00007f4c8b8f2a30] curl_exec() /var/www/example.com/public/wp-includes/Requests/Transport/cURL.php:204
[0x00007f4c8b8f28a0] request() /var/www/example.com/public/wp-includes/class-wp-http.php:420
[0x00007f4c8b8f2500] post() /var/www/example.com/public/wp-includes/class-wp-http.php:634
[0x00007f4c8b8f1d40] call_external_api() /var/www/example.com/public/wp-content/plugins/some-plugin/api.php:88
What it means: A specific path (admin-ajax.php) is stuck waiting on an external API call via cURL.
Decision: You don’t “increase timeouts.” You isolate or fix the plugin behavior: add caching, circuit breakers, async jobs, or disable the plugin. And you set a sane upstream timeout to avoid worker pileups.
Task 13: Check MySQL/MariaDB stalls (common hidden upstream)
cr0x@server:~$ mysql -e "SHOW PROCESSLIST\G" | sed -n '1,40p'
*************************** 1. row ***************************
Id: 18821
User: wpuser
Host: localhost
db: wordpress
Command: Query
Time: 42
State: Sending data
Info: SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'
What it means: A query is running for 42 seconds. That can stall PHP workers long enough to cause Nginx upstream timeouts and Cloudflare 5xx.
Decision: Find why it’s slow: table bloat, missing indexes, disk contention, or a plugin storing huge autoload options. Fix the database bottleneck or you’re just moving deck chairs.
Task 14: Validate disk pressure and filesystem health (yes, storage matters)
cr0x@server:~$ df -hT /var /var/www; iostat -xz 1 3
Filesystem Type Size Used Avail Use% Mounted on
/dev/nvme0n1p2 ext4 120G 111G 3.6G 97% /
/dev/nvme0n1p2 ext4 120G 111G 3.6G 97% /
avg-cpu: %user %nice %system %iowait %steal %idle
68.12 0.00 17.43 9.87 0.00 4.58
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 45.0 60.0 3200.0 5400.0 18.2 0.7 92.0
What it means: Root filesystem is at 97% used and the disk is at 92% util with meaningful await. Near-full disks cause cascading failures: log writes fail, temp files fail, DB slows, PHP blocks.
Decision: Free space now (logs, old backups), then fix the underlying storage plan. Do not try to tune PHP-FPM on a disk that’s screaming.
Task 15: Confirm Cloudflare-to-origin connectivity isn’t being blocked
cr0x@server:~$ sudo iptables -S | sed -n '1,60p'
-P INPUT DROP
-P FORWARD DROP
-P OUTPUT ACCEPT
-A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 443 -s 198.51.100.0/24 -j ACCEPT
What it means: Only a specific /24 is allowed to reach 443. If Cloudflare’s edge IPs aren’t in that allowlist, you’ll see intermittent edge errors depending on which PoP hits you.
Decision: Either open 443 appropriately or maintain a correct allowlist. “We only allow Cloudflare IPs” is a valid strategy—if you actually keep it updated.
Nginx failure modes that produce 502
1) “connect() failed (111: Connection refused)”
This means Nginx tried to connect to the upstream (FPM socket or TCP port) and the OS refused. Common causes:
- PHP-FPM is down or crash-looping.
- Nginx points to the wrong socket path or port.
- A reload briefly removed the socket, and Nginx raced the restart.
Fix: Make the upstream stable first. Verify the socket path in nginx -T, ensure PHP-FPM service health, and avoid aggressive reload loops during deploys.
2) “upstream timed out (110: Connection timed out) while reading response header from upstream”
Nginx connected to PHP-FPM, sent the request, and then waited too long for headers. This is the classic “PHP is slow or saturated” case. Likely culprits:
- Workers are all busy (
pm.max_children reached), so requests queue. - One or more endpoints are slow (admin-ajax, checkout, cron).
- Database or storage latency blocks PHP, so FPM can’t respond.
- External API calls hang and pin workers.
Fix: Use slowlog to identify the code path. Then fix the reason it’s slow. Increasing Nginx timeouts without addressing worker pileups just turns 502s into 504s and makes your latency charts look like modern art.
3) “upstream prematurely closed connection”
The upstream accepted the connection, then died or closed it without sending a full response. Common causes:
- PHP worker crashed (segfault, fatal error, OOM kill).
- FastCGI buffer issues with huge headers or responses (less common with WordPress HTML, more common with odd plugins).
- Misconfigured fastcgi settings.
Fix: Check PHP-FPM logs for crashes and system logs for OOM kills. Treat crashes as a real bug. If it’s headers, look for oversized cookies or plugin-generated headers.
4) Permission denied on the socket
Nginx can’t open the FPM socket because of filesystem permissions or SELinux/AppArmor. The error log will say (13: Permission denied).
Fix: Ensure the socket is readable/writable by the Nginx worker user. If SELinux is enforcing, you need the correct context—not a prayer.
PHP-FPM failure modes that produce 502
1) pm.max_children reached (capacity ceiling)
This is the most common “it works until it doesn’t” scenario. When all children are busy, new requests queue. If the queue waits longer than Nginx’s timeout, you get 502/504.
What to do:
- Measure worker memory and CPU before raising limits.
- Fix slow requests first. Adding more workers can increase database load and make things worse.
- Consider separate pools for admin and public traffic if you’re serious about availability.
2) Slow requests pin workers (the silent killer)
One slow plugin endpoint doesn’t just hurt that endpoint. It consumes a worker. Under concurrency, slow requests become a queueing problem, then an outage.
Fix: Use slowlog. Identify the stack trace. Fix the code path. Add caching, reduce calls, make external calls time out quickly with fallbacks.
3) Crashes: SIGSEGV, OOM, fatal errors
Crashes cause “prematurely closed” or “connection refused” depending on timing. OOM kills are especially nasty because they’re silent unless you check kernel logs.
cr0x@server:~$ dmesg -T | tail -n 20
[Fri Dec 26 14:03:02 2025] Out of memory: Killed process 22077 (php-fpm8.2) total-vm:812340kB, anon-rss:256144kB, file-rss:0kB, shmem-rss:0kB, UID:33 pgtables:680kB oom_score_adj:0
Decision: If OOM killed PHP workers, you do not raise max_children. You reduce memory per request (plugins, OPcache tuning), add RAM, or move services off-box.
4) Bad listen/backlog and thundering herds
Under sudden spikes, a too-small backlog or tight kernel limits can turn load into connection refusal. This often shows up during traffic bursts, cron storms, or cache purges.
Fix: Ensure FPM listen backlog is reasonable, and kernel settings aren’t ancient. But again: backlog is not capacity.
WordPress and plugin/theme culprits
WordPress itself is usually fine. WordPress plus the plugin ecosystem is a lively bazaar where quality varies. 502s often come from:
- admin-ajax.php abuse: frequent polling, long-running actions, unbounded loops.
- wp-cron.php storms: “pseudo-cron” triggered by web requests can stack up under load or when visitors arrive after quiet periods.
- External API calls: marketing, CRM, payment, shipping, analytics. External calls without strict timeouts are worker glue.
- Autoloaded options bloat: large serialized blobs autoloaded on every request. It’s like bringing your entire attic to every meeting.
- Image processing on request: resizing and optimization done synchronously in PHP.
Use the slowlog as a plugin detector
The slowlog stack trace is your friend because it points to a file path under wp-content/plugins/ or a theme. That’s evidence. It changes conversations.
WP-CLI triage without heroics
If the site is melting and you need a quick isolation move, disable nonessential plugins in a controlled way. Better: do it on a staging environment. But in real incidents, you sometimes have to cut power to a plugin and apologize later.
cr0x@server:~$ cd /var/www/example.com/public
cr0x@server:~$ sudo -u www-data wp plugin list --status=active
+-----------------------+----------+--------+---------+
| name | status | update | version |
+-----------------------+----------+--------+---------+
| woocommerce | active | none | 8.5.1 |
| some-plugin | active | none | 2.9.0 |
| cache-plugin | active | none | 1.3.2 |
+-----------------------+----------+--------+---------+
Decision: If slowlog points at some-plugin, disable it first and retest. If the outage stops, you have a root cause candidate. Then do the real fix: config, update, replace, or vendor escalation.
Cloudflare: when it’s “them”, when it’s you
Cloudflare is a reverse proxy with opinions. It enforces timeouts and will happily return an error while your origin is still thinking about life. Common Cloudflare-adjacent causes of 502:
- Origin overload: Cloudflare increases concurrency by making your site more reachable and by retrying behaviors in some cases. Your origin still needs capacity.
- TLS mode mismatch: “Full” vs “Full (strict)” and certificate validation issues can surface as connection problems.
- WAF and bot fight: blocked requests can look like partial failures if only some endpoints/users are affected.
- IP allowlisting mistakes: firewall only permits old Cloudflare ranges.
Practical approach:
- Prove origin health by bypassing Cloudflare (Task 2).
- Check origin logs for Cloudflare IPs and request patterns at the error times.
- Confirm your firewall isn’t blocking edge IPs (Task 15).
Joke #2: Cloudflare isn’t “down,” it’s just practicing boundary setting with your origin server.
Common mistakes: symptom → root cause → fix
1) Symptom: 502 spikes during traffic peaks
Root cause: PHP-FPM max_children reached; requests queue; Nginx times out waiting for headers.
Fix: Measure worker RSS, set max_children based on RAM, enable slowlog, fix the slow endpoints, add caching. If needed, add capacity (more CPU/RAM, separate DB).
2) Symptom: 502 only for wp-admin or admin-ajax
Root cause: Plugin doing slow external calls, or admin endpoints bypass cache and run heavier queries.
Fix: Use slowlog, identify plugin file, add strict HTTP timeouts, cache external responses, or disable/replace the plugin.
3) Symptom: 502 after deploy/restart, then “recovers”
Root cause: PHP-FPM reload drops socket briefly; Nginx catches it mid-transition; or OPcache warm-up creates CPU spike.
Fix: Use graceful reloads, stagger restarts, keep health checks, and avoid restarting Nginx+FPM simultaneously. Consider pre-warming common endpoints.
4) Symptom: Cloudflare shows 502, origin direct is fine
Root cause: Firewall blocks some Cloudflare IPs; intermittent edge-to-origin routing issues; TLS mode mismatch.
Fix: Fix allowlists, confirm TLS settings, ensure origin can handle Cloudflare concurrency. Validate with --resolve tests.
5) Symptom: Nginx error shows Permission denied on FPM socket
Root cause: Socket ownership doesn’t match Nginx worker user; SELinux context mismatch.
Fix: Align users/groups or set listen.owner/listen.group/listen.mode. For SELinux, apply proper policy/context (don’t just disable SELinux in production unless you enjoy audits).
6) Symptom: 502 appears with “upstream prematurely closed connection”
Root cause: PHP crash (segfault) or OOM kill; sometimes fatal errors with worker termination.
Fix: Check journal and dmesg. Roll back recent PHP extensions or changes. Reduce memory pressure. Add swap only as a last resort, and don’t confuse it with a fix.
7) Symptom: 502 after enabling a “performance” plugin
Root cause: Aggressive caching plugin triggers cache stampede, purges too often, or increases admin-ajax load; sometimes misconfigures headers/buffering.
Fix: Disable to confirm. Reintroduce with sane config: cache warm-up, rate limits, object cache strategy, and careful purge rules.
Checklists / step-by-step plan
Checklist A: First 10 minutes on-call
- Check whether the error is Cloudflare-branded or origin-branded.
- Run
curl -D -to capture headers and confirmserverheader. - Bypass Cloudflare with
curl --resolveto test origin directly. - Tail Nginx error log and look for the upstream error string.
- Check PHP-FPM status and journal for
max_children, crashes, and pool errors. - Quick system sanity: CPU, RAM, IO wait, disk full.
- If needed, apply a mitigation: temporarily disable the known-bad plugin, rate limit abusive endpoints, or raise timeouts only as a stopgap.
Checklist B: Identify if it’s capacity vs latency vs connectivity
- Connectivity: socket missing, permission denied, refused connections, firewall blocks. Fix config and security policy first.
- Capacity: max_children reached, CPU pegged, memory pressure. Fix with sizing, caching, more compute.
- Latency: slowlog points to DB/API/plugin. Fix the slow code path and dependency performance.
Checklist C: Hardening changes that prevent repeat incidents
- Enable PHP-FPM slowlog with a low threshold (e.g., 3–5s) and rotate logs.
- Add request IDs to Nginx access logs and propagate them (so you can trace).
- Set strict timeouts for external calls in application code (plugins/themes).
- Separate concerns: DB off the web box for busy sites; isolate admin traffic via separate pool.
- Implement caching intentionally: page cache, object cache, and proper cache invalidation strategy.
- Capacity plan based on measurements, not vibes.
Three corporate-world mini-stories (realistic and painful)
Mini-story 1: The incident caused by a wrong assumption
The site was behind Cloudflare, and the team assumed that meant the origin was “protected” from traffic spikes. Marketing launched a campaign. Cloudflare did its job: it accepted a flood of connections and tried to reach the origin for cache misses.
The origin, a single VM, had PHP-FPM configured years ago with a conservative pm.max_children. Under load, workers saturated. Requests queued. Nginx started logging upstream timed out while reading response header. Cloudflare started returning 502s because it wasn’t getting timely responses.
The initial call was a classic: “Cloudflare is down.” It wasn’t. The edge was accurately reporting that the origin couldn’t keep up.
What fixed it wasn’t a heroic Cloudflare ticket. The team turned on PHP-FPM slowlog and found one endpoint dominating: admin-ajax.php calls from a plugin’s frontend widget. The widget hit an external API on every page view without caching.
They cached the API response, set a short timeout, and reduced the widget’s call frequency. Then they resized FPM based on memory measurements. The assumption died; the site lived.
Mini-story 2: The optimization that backfired
A well-meaning engineer “optimized” the stack by cranking up pm.max_children dramatically. The goal was simple: more workers, fewer queued requests, fewer 502s.
It worked for about an hour. Then the box started swapping. MySQL got slower. IO wait climbed. PHP workers took longer, not shorter. Nginx timeouts increased. Cloudflare errors followed. The incident got worse because now every request was competing for disk and memory.
The postmortem was humbling: they had treated PHP-FPM like a thread pool. It isn’t. Each worker is a process with real memory cost. Increasing concurrency without increasing resources often increases contention and tail latency.
The fix was boring: cap workers to what RAM could support, move MySQL to a separate host, and add caching so fewer requests needed PHP at all. The “optimization” was rolled back, and everyone quietly agreed to stop tuning production with caffeine and hope.
Mini-story 3: The boring but correct practice that saved the day
A different company had a habit: every incident started with log correlation, not speculation. They also had a tiny but useful convention: every Nginx access log line included a request ID, and that ID was passed to PHP via a fastcgi param and logged by the application.
When 502s hit, they pulled a sample of failing requests and immediately saw the same request path: a checkout endpoint that triggered a shipping rate lookup. The PHP-FPM slowlog showed the call stack; the request ID matched the Nginx entry; the timing lined up.
They didn’t argue about Cloudflare, Nginx, or PHP-FPM. They didn’t need to. The chain of evidence was clean. The external shipping API was slow and occasionally hanging. Their code had a generous timeout and no fallback.
Because they’d practiced this boring discipline, mitigation was quick: reduce timeout, add cached fallback rates, and degrade gracefully. The incident ended without drama, which is the best kind of incident.
Operational guidance that holds up under pressure
Timeouts: use them as guardrails, not a lifestyle
There are timeouts at every layer: Cloudflare, browser, Nginx, FastCGI, PHP execution, external APIs, database. A 502 is often a timeout policy being enforced. If you only raise timeouts, you usually convert fast failure into slow failure. Users still lose; you just waste more compute while losing.
Set timeouts so that:
- external calls fail fast (seconds, not minutes),
- your upstream timeout is slightly above the 95th percentile for legitimate slow requests, and
- your app degrades gracefully when dependencies misbehave.
One quote worth keeping in your head
Paraphrased idea, attributed to John Allspaw: “Blamelessness helps you learn the real reasons systems fail.”
FAQ
1) Why do I get 502 sometimes and 200 other times?
Intermittent 502s usually mean saturation or instability. Saturation: worker pool is sometimes full. Instability: PHP-FPM reloads/crashes, OOM kills, or flaky dependencies.
2) Is a 502 always PHP-FPM’s fault?
No. 502 is produced by the gateway (Cloudflare/Nginx) when the upstream fails. The upstream could be PHP-FPM, but it could also be your origin from Cloudflare’s perspective, or a misconfigured upstream address.
3) What’s the fastest way to know if Cloudflare is involved?
Check headers: server: cloudflare and cf-ray. Then bypass Cloudflare with curl --resolve to test the origin directly.
4) Should I switch Nginx→PHP-FPM from UNIX socket to TCP?
If you’re fighting permissions and deployment tooling, TCP can be simpler to reason about. UNIX sockets are fine and slightly more efficient, but the real win is reliability and clarity, not micro-optimizations.
5) Why does wp-cron.php correlate with 502s?
Because WP-Cron runs scheduled tasks on normal web requests. Under certain traffic patterns, tasks pile up. If tasks are heavy (email sending, API sync), they pin PHP workers and cause queueing.
6) Can caching alone eliminate 502s?
It can reduce the probability dramatically by reducing origin load. But if your uncached endpoints are still slow or your worker pool is unstable, caching won’t save admin, checkout, and API endpoints.
7) I see “max_children reached” but no 502s. Do I care?
Yes. It’s an early warning. You may be hiding the issue with generous timeouts or low traffic. When traffic spikes or a dependency slows down, you’ll pay for it.
8) Why do I get 502 only on large uploads or image operations?
Uploads and image processing can be IO-heavy and CPU-heavy. If PHP execution time is high or temporary disk is full/slow, workers stall. Fix by moving processing off-request, increasing resources, and ensuring disk has space and performance.
9) What if Nginx says “upstream timed out” but PHP-FPM seems idle?
Then “idle” may be misleading: workers may be stuck in uninterruptible IO, or the bottleneck is elsewhere (database locks, DNS stalls, external calls). Use slowlog and check MySQL processlist and system IO wait.
10) Should I restart PHP-FPM when I see 502?
Restarting can be a mitigation if FPM is wedged, but it destroys evidence. Grab logs first, capture current state (service status, error logs, slowlog), then restart if needed.
Conclusion: next steps that prevent the sequel
502s aren’t mysterious. They’re specific: a proxy didn’t get a valid upstream response. Your job is to name the failed handoff, then fix the reason it failed.
Next steps I’d actually do in production:
- Instrument for evidence: enable PHP-FPM slowlog, keep Nginx error logs clean and rotated, and log request IDs.
- Build a fast triage routine: headers → bypass edge → Nginx upstream error → FPM health → system pressure → slowlog.
- Fix the real bottleneck: slow plugin paths, database stalls, disk pressure, or unstable PHP extensions.
- Tune last, based on measurements: set FPM workers based on RSS and CPU, not hope.
- Make failure safe: strict external API timeouts, caching, graceful degradation, and sane rate limits on abusive endpoints.