WordPress 503 Service Unavailable: what to check when the site is down

Was this helpful?

503 is the sound a website makes when it can’t keep its promises. Your users see a blank page, your marketing team sees lost revenue, and you see a ticket that says “site down” with exactly zero useful details.

This guide is how you cut through the noise. Not “clear your cache” folklore. Real checks, commands you can run, and decisions you can make while the outage clock is ticking.

What a 503 really means (and what it doesn’t)

A 503 “Service Unavailable” is not a single bug. It’s a class of failures where a component in the request path can’t serve the request right now. The key phrase is “right now”: the system is alive enough to respond, but unhealthy enough to refuse, time out, or shed load.

In WordPress land, 503 usually bubbles up from one of these choke points:

  • Front door proxy/CDN (Cloudflare, Fastly, ALB/ELB, Nginx): origin is down or too slow, health checks failing, upstream errors.
  • Web server (Nginx/Apache): can’t reach upstream, too many open files, worker limits, queue saturation.
  • Application runtime (PHP-FPM): process manager at limit, slow requests, deadlocked workers, OOM kills.
  • Database (MySQL/MariaDB): max connections reached, disk full, long-running queries, replication lag if reads are split.
  • Storage (local disk, NFS, EBS): latency spikes, I/O wait, inode exhaustion, permission problems after deployments.
  • External dependencies (SMTP, payment API, search): WordPress theme/plugin blocking on remote calls.

Two important non-truths:

  • 503 is not “WordPress is broken.” WordPress might be fine; the platform around it might be the one face-planting.
  • 503 is not “CPU is high.” High CPU can cause it, but so can slow disks, a stuck DNS resolver, or a single plugin doing something it shouldn’t.

One quote worth keeping above your monitor during incidents: paraphrased idea from John Allspaw: the system is telling you a story; go find the narrative, not a scapegoat.

Joke #1: A 503 is your server’s way of saying “I’m busy” without committing to a time estimate—basically tech support, but with HTTP headers.

Fast diagnosis playbook (first/second/third)

When the site is down, you don’t need a deep dive first. You need a funnel that gets you to the bottleneck in minutes. This is the sequence that works in real incidents.

First: confirm where the 503 is generated (edge vs origin)

  1. Check from outside: do you get 503 from the CDN/load balancer or from your origin?
  2. Check from inside: curl the origin directly (or the service VIP) from a box in the same network.
  3. Decision: If edge returns 503 but origin is fine, focus on health checks, WAF rules, origin connectivity, TLS, or upstream timeouts. If origin returns 503 too, go deeper.

Second: decide if you’re capacity-limited or broken

  1. Look at load, memory, and disk on the origin: is the box thrashing, swapping, I/O waiting, or out of space?
  2. Decision: If you’re resource-limited, mitigate first (scale out, restart safely, shed load) then investigate root cause after service returns.

Third: follow the request path, hop by hop

  1. Web server logs: are requests reaching Nginx/Apache, and what upstream errors appear?
  2. PHP-FPM status: are pools saturated, are workers stuck, are slow logs lighting up?
  3. Database health: connections, slow queries, locks, disk, and buffer pool pressure.
  4. Decision: Once you find the first component that is failing (not the last one complaining), you’ve found the incident’s control knob.

Speed rule: if you can’t explain the 503 after 10 minutes, you’re probably staring at the wrong layer. Move one hop up or down the chain.

Interesting facts and context (quick but useful)

  • HTTP 503 has an “honest” intent: it’s designed to be temporary and can include a Retry-After header for well-behaved clients.
  • 503 is commonly used for maintenance mode because it tells search engines “don’t index this as a permanent failure,” unlike 404.
  • Many proxies emit 503 for upstream trouble even when the origin would have returned 504 or 500—your error code can be a translation, not a truth.
  • Nginx returns 502/504 for many upstream failures, but certain configurations or intermediate load balancers can surface them as 503 instead.
  • WordPress itself rarely emits a raw 503; it’s typically the web tier, PHP-FPM, or a plugin’s fatal path causing the server to fail health checks.
  • PHP-FPM has been the default “scale lever” for WordPress in many stacks since the early 2010s because it decouples PHP workers from the web server’s process model.
  • “Thundering herd” is not theoretical: a cache expiration at the same second across many nodes can cause a synchronized stampede, pushing PHP-FPM into 503 land.
  • Some managed WordPress platforms intentionally return 503 to protect shared infrastructure under load, preferring partial availability over total collapse.

Triage questions that narrow the blast radius

Ask these before you start restarting things like a pinball machine.

  • Is it all pages or only some? If only admin or checkout fails, think sessions, database locks, or external APIs.
  • Is it all regions or one? If only one region fails, suspect routing, DNS, or a zonal storage issue.
  • Did anything change? Deploys, plugin updates, theme edits, PHP version bumps, WAF rule changes, certificate rotations.
  • Is it load-related? Traffic spike, bot wave, or a cron job that runs on the hour and turns your database into soup.
  • Is it intermittent? Intermittent 503s smell like saturation: worker limits, connection pools, or slow I/O.

The goal is to pick a hypothesis that can be disproven quickly. Incident response is basically science, but with more coffee and fewer grants.

Hands-on tasks: commands, outputs, decisions (12+)

Below are practical tasks you can run on a typical Linux WordPress origin (Nginx + PHP-FPM + MySQL/MariaDB). Adjust service names for your distro. Each task includes: command, example output, what it means, and what you do next.

Task 1: Verify the symptom from the outside (status code + headers)

cr0x@server:~$ curl -sS -o /dev/null -D - https://example.com | sed -n '1,20p'
HTTP/2 503
date: Thu, 26 Dec 2025 10:12:41 GMT
content-type: text/html; charset=UTF-8
server: cloudflare
cf-ray: 88c0a1b8f9c01234-AMS

Meaning: The response is generated at the edge (server: cloudflare). You’re not yet sure whether the origin is down or the edge can’t reach it.

Decision: Test origin directly (Task 2). If origin is healthy, you investigate CDN/origin connectivity, firewall, TLS, or health checks.

Task 2: Curl the origin directly (bypass CDN)

cr0x@server:~$ curl -sS -o /dev/null -D - -H 'Host: example.com' http://127.0.0.1/ | sed -n '1,15p'
HTTP/1.1 503 Service Temporarily Unavailable
Server: nginx
Date: Thu, 26 Dec 2025 10:13:02 GMT
Content-Type: text/html
Content-Length: 190
Connection: keep-alive

Meaning: The origin itself is returning 503 (nginx). Now you’re in the origin stack: Nginx, PHP-FPM, app, database, storage.

Decision: Jump to web server error logs and upstream health (Tasks 6–8).

Task 3: Check basic host pressure (load, memory, swap)

cr0x@server:~$ uptime
 10:13:12 up 23 days,  4:01,  2 users,  load average: 18.42, 17.90, 16.55

Meaning: Load average is very high. That can be CPU saturation, I/O wait, or runnable queue explosion from too many processes.

Decision: Confirm whether the load is CPU or I/O (Task 4) before you blindly scale PHP workers (which can make it worse).

Task 4: Identify CPU vs I/O wait (vmstat)

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
16  4      0  32120  11244 622144    0    0   120  3400 2200 6100 15  8 20 57  0
18  3      0  29804  10980 615332    0    0    90  4100 2400 6500 12  7 18 63  0
20  5      0  28512  10820 612110    0    0   110  5200 2500 6900 10  6 14 70  0
15  2      0  31040  11100 620400    0    0    95  3800 2100 6000 14  7 22 57  0
17  4      0  29520  10920 614880    0    0   105  4600 2450 6700 11  6 16 67  0

Meaning: High wa (I/O wait) indicates the CPUs are mostly waiting on disk/network storage. The “b” (blocked) column is also elevated.

Decision: Investigate disk latency and filesystem pressure (Tasks 5 and 14). Don’t “fix” this by adding PHP-FPM children; you’ll just create more blocked processes.

Task 5: Check disk space and inode exhaustion

cr0x@server:~$ df -hT /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  ext4   40G   39G  520M  99% /

Meaning: You’re basically out of disk. Many services fail in weird ways when they can’t write logs, temp files, or sockets.

Decision: Free space immediately (rotate logs, delete large temp files, move backups), then re-test. If it’s inodes instead, check Task 5b:

cr0x@server:~$ df -i /
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/nvme0n1p2  2621440 2619900   1540  100% /

Meaning: Inode exhaustion: too many small files (cache directories, sessions, image variants). WordPress plus an aggressive cache can do this.

Decision: Purge cache directories and session storage; consider moving cache to tmpfs or object storage long-term.

Task 6: Check Nginx error log for upstream clues

cr0x@server:~$ sudo tail -n 30 /var/log/nginx/error.log
2025/12/26 10:12:58 [error] 22114#22114: *98123 connect() to unix:/run/php/php8.2-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 203.0.113.10, server: example.com, request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock:", host: "example.com"
2025/12/26 10:13:01 [error] 22114#22114: *98145 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 198.51.100.42, server: example.com, request: "GET /wp-admin/ HTTP/1.1", upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock:", host: "example.com"

Meaning: Nginx can’t connect to PHP-FPM socket sometimes and times out other times. This is typically PHP-FPM saturation, stuck workers, or socket backlog problems.

Decision: Inspect PHP-FPM service health and pool status (Tasks 7–9). If PHP-FPM is down, restart it; if it’s up but saturated, find the slow path.

Task 7: Check whether PHP-FPM is running and stable

cr0x@server:~$ systemctl status php8.2-fpm --no-pager
● php8.2-fpm.service - The PHP 8.2 FastCGI Process Manager
     Loaded: loaded (/lib/systemd/system/php8.2-fpm.service; enabled; preset: enabled)
     Active: active (running) since Thu 2025-12-26 09:02:11 UTC; 1h 11min ago
       Docs: man:php-fpm8.2(8)
   Main PID: 1032 (php-fpm8.2)
     Status: "Processes active: 64, idle: 0, Requests: 92314, slow: 387, Traffic: 0.0req/sec"
      Tasks: 65 (limit: 18922)
     Memory: 2.4G
        CPU: 1h 03min

Meaning: “idle: 0” with many active processes suggests full saturation. If your pool max children is 64, you’ve hit the ceiling.

Decision: Don’t just bump max children until the box collapses. First determine why workers are slow (Task 9) and whether the database or storage is the real bottleneck (Tasks 11–14).

Task 8: Check PHP-FPM socket and listen backlog

cr0x@server:~$ sudo ss -xlpn | grep php
u_str LISTEN 0      128    /run/php/php8.2-fpm.sock  22139            * 0 users:(("php-fpm8.2",pid=1032,fd=8))

Meaning: The socket backlog is 128. If you see lots of pending connections or errors like “Resource temporarily unavailable,” backlog and process availability matter.

Decision: If backlog is too small for your burst traffic, tune listen.backlog and kernel somaxconn later. Right now, find why workers aren’t freeing up.

Task 9: Enable/inspect PHP-FPM slow log (find the slow code path)

cr0x@server:~$ sudo tail -n 30 /var/log/php8.2-fpm/slow.log
[26-Dec-2025 10:12:55]  [pool www] pid 11844
script_filename = /var/www/example.com/public/index.php
[0x00007f6b2c2f81f0] mysqli_query() /var/www/example.com/public/wp-includes/wp-db.php:2050
[0x00007f6b2c2f7f20] query() /var/www/example.com/public/wp-includes/wp-db.php:1941
[0x00007f6b2c2f7c50] get_results() /var/www/example.com/public/wp-includes/wp-db.php:2970
[0x00007f6b2c2f6a10] get_posts() /var/www/example.com/public/wp-includes/post.php:2543
[0x00007f6b2c2f51a0] WP_Query->get_posts() /var/www/example.com/public/wp-includes/class-wp-query.php:3604

Meaning: Workers are stuck in MySQL queries. The 503 is an application-layer symptom of a database problem (or storage latency causing database slowdown).

Decision: Switch focus to database health and query behavior (Tasks 11–13). Restarting PHP-FPM won’t fix a database that’s choking; it just resets the queue briefly.

Task 10: Verify WordPress cron isn’t stampeding

cr0x@server:~$ sudo tail -n 50 /var/log/syslog | grep -E 'wp-cron|cron' | tail -n 10
Dec 26 10:00:01 server CRON[21780]: (www-data) CMD (/usr/bin/php /var/www/example.com/public/wp-cron.php)
Dec 26 10:00:02 server CRON[21795]: (www-data) CMD (/usr/bin/php /var/www/example.com/public/wp-cron.php)
Dec 26 10:00:03 server CRON[21810]: (www-data) CMD (/usr/bin/php /var/www/example.com/public/wp-cron.php)

Meaning: Multiple cron invocations back-to-back can stack up, especially if a job is slow or overlapping. WordPress cron is notorious for being “helpful” at the worst time.

Decision: Temporarily disable WP cron triggering via web requests and move to a single system cron later. During the incident, stop the stampede and re-check load.

Task 11: Check database max connections and thread pressure

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected'; SHOW GLOBAL VARIABLES LIKE 'max_connections';"
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Threads_connected | 298   |
+-------------------+-------+
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 300   |
+-----------------+-------+

Meaning: You’re at the connection ceiling. New requests queue or fail; upstream components may produce 503 depending on how they handle timeouts.

Decision: Identify who’s holding connections (Task 12) and why. Short-term mitigation might be to restart runaway clients or increase max connections if memory allows, but fixing query behavior is the real work.

Task 12: Find long-running queries and locks

cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST;" | head -n 20
Id	User	Host	db	Command	Time	State	Info
421	wpuser	10.0.2.15:51244	wpdb	Query	87	Sending data	SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts WHERE 1=1 AND wp_posts.post_type = 'post' ORDER BY wp_posts.post_date DESC LIMIT 0, 10
422	wpuser	10.0.2.15:51262	wpdb	Query	90	Locked	UPDATE wp_options SET option_value = '...' WHERE option_name = 'rewrite_rules'
430	wpuser	10.0.2.15:51410	wpdb	Sleep	220		NULL

Meaning: You have queries stuck (“Locked”) and long-running selects. Option table updates (like rewrite_rules) can lock and block requests site-wide.

Decision: If a specific query is locking the world, you may need to kill it surgically and then disable the plugin/theme/action causing it. Killing random queries is how you create sequel incidents.

Task 13: Check MySQL slow query log signals (if enabled)

cr0x@server:~$ sudo tail -n 20 /var/log/mysql/slow.log
# Time: 2025-12-26T10:12:41.123456Z
# User@Host: wpuser[wpuser] @ 10.0.2.15 []
# Query_time: 5.882  Lock_time: 1.204 Rows_sent: 10  Rows_examined: 890221
SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts WHERE 1=1 AND wp_posts.post_type = 'post' ORDER BY wp_posts.post_date DESC LIMIT 0,10;

Meaning: Rows examined is huge for a tiny result set. That’s an indexing problem, an unbounded query, or a plugin doing “search” the hard way.

Decision: Apply indexing fixes and query refactors after recovery. During the incident, reduce load (cache, rate-limit bots, disable the feature), and keep the database breathing.

Task 14: Check storage latency and I/O wait at the device layer (iostat)

cr0x@server:~$ iostat -xz 1 3
Linux 6.1.0 (server) 	12/26/2025 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.31    0.00    6.22   58.44    0.00   23.03

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz  aqu-sz  %util
nvme0n1         42.00   1800.00     0.00   0.00   12.40    42.86  210.00  86000.00   120.00  36.36   95.20   409.52   20.10  99.80

Meaning: Write await is very high and device is ~100% utilized. When storage is saturated, PHP workers stall, database stalls, and Nginx starts returning upstream timeouts/503s.

Decision: Identify what’s writing (logs, backups, database flush storms). Pause non-essential heavy I/O (backups, cron jobs, indexing) and consider moving uploads/cache to separate volumes later.

Task 15: Check if the kernel is killing processes (OOM)

cr0x@server:~$ sudo dmesg -T | tail -n 20
[Thu Dec 26 10:11:33 2025] Out of memory: Killed process 11844 (php-fpm8.2) total-vm:2145120kB, anon-rss:612000kB, file-rss:0kB, shmem-rss:0kB, UID:33 pgtables:1920kB oom_score_adj:0

Meaning: PHP-FPM workers are being OOM-killed. That can manifest as intermittent 503s as capacity drops and recovers.

Decision: Mitigate by reducing PHP-FPM max children, fixing memory leaks (plugins), raising memory limits cautiously, and adding memory/swap only with intent.

Task 16: Check web server active connections and rate of requests

cr0x@server:~$ sudo ss -s
Total: 2345 (kernel 0)
TCP:   1988 (estab 612, closed 1201, orphaned 0, timewait 1103)

Transport Total     IP        IPv6
RAW	  0         0         0
UDP	  14        12        2
TCP	  787       735       52
INET	  801       747       54
FRAG	  0         0         0

Meaning: Lots of timewait and high total connections can indicate bursts, bots, keepalive mis-tuning, or load balancer behavior.

Decision: If this is attacky traffic, rate-limit at the edge or Nginx. If it’s normal traffic, ensure worker and file descriptor limits aren’t too low.

Task 17: Confirm file descriptor limits (classic hidden 503 cause)

cr0x@server:~$ sudo cat /proc/$(pidof nginx | awk '{print $1}')/limits | grep -i "open files"
Max open files            1024                 1024                 files

Meaning: 1024 open files for Nginx is small for a busy site (connections, logs, temp files). When you hit it, you get bizarre upstream errors and failed accepts.

Decision: Raise limits via systemd unit overrides and Nginx config. During an incident, reducing connections (rate-limit/bots) can buy time.

Task 18: Quickly test PHP execution without the full WordPress stack

cr0x@server:~$ printf '%s\n' '/dev/null
cr0x@server:~$ curl -sS http://127.0.0.1/health.php
ok

Meaning: PHP-FPM can execute a trivial script. If WordPress pages 503 but this works, the runtime is alive; the issue is likely WordPress code path (plugin/theme), database, or storage.

Decision: If trivial PHP works, check database connectivity and disable recent plugins/themes in a controlled way (see checklist section).

Common 503 failure modes in WordPress stacks

1) PHP-FPM pool saturation (most common, least understood)

When PHP-FPM is out of idle workers, requests queue at the socket. Nginx waits. Eventually it gives up and you get 503/504/502 depending on tuning and upstream components. The user sees “Service Unavailable,” you see a swamp of workers doing something slow.

What causes it:

  • Slow database queries (missing indexes, table locks, option autoload bloat).
  • Slow storage (uploads on network filesystem, saturated disk).
  • External API calls inside page render (payment/shipping, marketing pixels, remote fonts, license checks).
  • Too many workers for available RAM leading to OOM, then thrash.

Fix strategy: First stabilize: limit concurrency, restart gracefully if needed, purge caches carefully. Then find the slow path via slow logs, database processlist, and request traces.

2) Web server upstream timeouts mis-tuned

If your upstream timeouts are too aggressive, legitimate slow requests get cut off and look like 503 to clients. If timeouts are too generous, you build huge queues and amplify outages.

Opinionated rule: timeouts should reflect reality, not hope. If normal pages take 8 seconds, you don’t “fix” it by setting 120-second timeouts. You fix the page.

3) Database connection ceiling or lock storms

WordPress can create a surprising number of concurrent queries under load, especially with heavy plugins. When the DB hits max connections, new PHP requests either block or error. That bubbles up as “upstream timed out,” “could not connect,” or plain 503 from load balancers failing health checks.

Lock storms often come from:

  • Plugins updating wp_options frequently.
  • Rewrite rule regeneration on traffic.
  • Cache plugins doing writes under load.

4) Storage latency and inode disasters

WordPress isn’t just PHP and MySQL. It’s media files, caches, sessions, and plugin detritus. When storage is slow, everything becomes slow. When inodes are exhausted, everything becomes broken in creative ways.

5) Edge returns 503 because health checks fail

Your origin might work for real pages but fail the health check endpoint due to redirects, auth, or dependency calls. The load balancer then marks the target unhealthy and serves 503 even though the app is mostly alive.

6) Plugin/theme meltdown after an update

A single plugin can add N+1 queries, remote calls, or heavy CPU work on every request. Under real traffic, that’s a slow-motion denial of service you paid for yourself.

Joke #2: The plugin promised it would “boost performance.” It did—right up until it boosted your error rate.

Three corporate-world mini-stories (because this keeps happening)

Mini-story 1: The incident caused by a wrong assumption

They had a WordPress site fronted by a CDN and a managed load balancer. During a product launch, 503s appeared. The team assumed “origin is down” and started restarting PHP-FPM and scaling instances. It helped for about two minutes each time. The 503s kept coming back like a recurring calendar invite.

The wrong assumption was subtle: they treated the 503 as an application error. But the load balancer was the one returning it, and it was doing that because health checks were failing. The health check endpoint was /, which started redirecting to a geo-specific path after a marketing change. The load balancer didn’t follow redirects. Unhealthy targets, instant 503.

It got worse because restarts changed timing and made health checks intermittently pass. That created an illusion of partial fixes, which led to more restarts, which made user-facing stability worse. Everyone had graphs. Nobody had the right graph.

The fix was boring: a dedicated /healthz endpoint returning a plain 200, served by Nginx without PHP. Health checks stabilized immediately. After that, the team could see the real issue: traffic was high, but the stack was actually coping fine once it was allowed to receive requests.

Mini-story 2: The optimization that backfired

A different company tried to “speed up WordPress” by cranking PHP-FPM pm.max_children way up and lowering Nginx timeouts to keep things snappy. The site did feel fast in staging. Production traffic arrived, and the box turned into a space heater with a disk problem.

The increased worker count multiplied concurrent database queries. The DB started flushing dirty pages constantly. Disk utilization pegged. I/O wait soared. PHP workers piled up waiting on the database, and Nginx timeouts started cutting them off. The user saw 503s; the team saw “but we increased capacity.”

They had optimized for throughput on paper, ignoring the shared bottleneck: storage-backed database I/O. More concurrency did not mean more completion. It meant more waiting.

The recovery play was to reduce concurrency intentionally: lower PHP-FPM children to match the database’s ability to serve, extend upstream timeouts just enough to avoid mass retries, and rate-limit expensive endpoints. After the fire was out, they added proper caching and fixed the worst queries. The “optimization” became a lesson in bottlenecks.

Mini-story 3: The boring but correct practice that saved the day

A media company ran WordPress at steady high traffic. Their secret weapon wasn’t a fancy service mesh. It was discipline: a health check endpoint that didn’t touch PHP, regular log rotation, capacity alerts on disk and inodes, and a standard incident runbook that everyone actually used.

One afternoon, 503 rates climbed. On-call followed the runbook: confirm edge vs origin, check disk, check PHP-FPM, check DB connections. Disk usage was fine, but inodes were nearly exhausted. A cache plugin had started generating huge numbers of tiny files after a configuration change. PHP couldn’t write sessions reliably, and the site started throwing errors and timing out.

Because they had inode alerts, they caught it before total failure. They purged the cache directory, applied a cap to cache file generation, and moved the cache store to Redis later. Users saw a blip, not a headline.

It wasn’t glamorous. It was correct. Boring correctness is how you win outages without becoming a story on social media.

Common mistakes: symptom → root cause → fix

This is the section that prevents the “we restarted it six times and now it’s worse” pattern.

503 only at the CDN, origin looks fine

  • Symptom: Edge headers show CDN/load balancer; origin curl returns 200.
  • Root cause: Health check path failing, origin IP blocked, TLS mismatch, WAF rule, or origin rate limiting the edge.
  • Fix: Use a dedicated /healthz that returns 200 without auth/redirects; confirm firewall allows edge IPs; verify certificate and SNI; check edge error logs.

503 spikes during traffic bursts, then recovers

  • Symptom: Intermittent 503s, upstream timeout errors in Nginx.
  • Root cause: PHP-FPM max children hit; DB connection limit hit; cache stampede.
  • Fix: Add caching that prevents stampedes, tune PHP-FPM to match CPU/RAM and DB capacity, rate-limit botty paths, and ensure OPcache is configured correctly.

503 after plugin/theme update

  • Symptom: Outage starts right after update; admin may be inaccessible; slow logs show specific PHP paths.
  • Root cause: Plugin causing fatal errors, memory blowups, or heavy DB writes; incompatibility with PHP version.
  • Fix: Disable plugin by renaming its directory or using WP-CLI; roll back; pin versions; introduce staging and canary deployments.

503s correlate with “Disk 100%” or high iowait

  • Symptom: High iowait, MySQL stalls, PHP workers blocked.
  • Root cause: Backup job, log flood, DB flush storm, slow network storage, inode exhaustion.
  • Fix: Stop/limit noisy I/O; move heavy writers off the root volume; tune database flush behavior; separate uploads/cache storage; monitor latency, not just throughput.

Restarting PHP-FPM “fixes it” briefly

  • Symptom: Immediate improvement, then rapid relapse.
  • Root cause: Queue reset; underlying bottleneck remains (DB, storage, external API); or a memory leak that rebuilds quickly.
  • Fix: Use the brief recovery window to gather evidence: slow logs, processlist, top offenders, and rate-limit. Then address root cause.

Only wp-admin returns 503

  • Symptom: Front page sometimes works, admin consistently fails.
  • Root cause: Higher privilege pages trigger heavier plugin logic, non-cached requests, or session writes; also can indicate auth provider slowness.
  • Fix: Disable plugins that hook admin; check session storage; test minimal PHP endpoint; investigate external auth calls.

Checklists / step-by-step plan

Checklist A: 10-minute triage to restore service

  1. Locate the 503 origin: edge vs origin (curl headers; origin direct).
  2. Check host health: load, memory, swap, disk space, inodes.
  3. Look at Nginx/Apache error logs: upstream connect errors, timeouts, “too many open files.”
  4. Check PHP-FPM status: running, max children hit, slow log activity.
  5. Check DB saturation: threads connected, max connections, processlist locks.
  6. Mitigate:
    • Rate-limit abusive paths/bots at edge or Nginx.
    • Purge or warm critical caches if safe.
    • Stop heavy background jobs (backups, imports, cron storms).
    • Gracefully reload/restart services only if you understand what you’re resetting.
  7. Verify recovery: curl from origin and outside, check 2–3 key pages, watch error rate drop.

Checklist B: Safe plugin disable when wp-admin is down

If you strongly suspect a plugin and you can’t reach admin, do it from the filesystem. This is blunt but effective.

  1. Identify recently updated plugin(s) (from deployment logs or file mtimes).
  2. Disable by renaming directory (WordPress will skip it).
  3. Re-test health endpoint and homepage.
  4. If it recovers, keep the plugin disabled and plan a controlled re-enable with profiling.
cr0x@server:~$ cd /var/www/example.com/public/wp-content/plugins
cr0x@server:~$ sudo mv suspicious-plugin suspicious-plugin.disabled
cr0x@server:~$ curl -sS -o /dev/null -w "%{http_code}\n" -H 'Host: example.com' http://127.0.0.1/
200

Meaning: If status returns 200 after disabling, you’ve isolated the cause to that plugin or something it triggered.

Decision: Keep it disabled; notify stakeholders; capture logs and a copy of the plugin version for later analysis.

Checklist C: Graceful service restart without making it worse

Restarts are tools, not prayers. Use them when the component is genuinely wedged, not as a substitute for diagnosis.

cr0x@server:~$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
cr0x@server:~$ sudo systemctl reload nginx
cr0x@server:~$ sudo systemctl restart php8.2-fpm

Meaning: Reload Nginx first (cheap and low risk) after config verification. Restart PHP-FPM only if workers are stuck/OOMing and you need to shed the jam.

Decision: If restarting PHP-FPM restores service but relapse happens quickly, stop restarting and switch to “find the slow path” evidence collection.

Checklist D: Post-recovery stabilization (same day)

  1. Keep rate limits until you’ve confirmed stability under normal load.
  2. Capture artifacts: top slow queries, PHP-FPM slow log snippets, Nginx upstream errors, DB processlist samples.
  3. Write a short incident timeline: what changed, when it started, what mitigations worked.
  4. Turn the biggest pain point into a permanent check (alert, dashboard, automated test).

FAQ

1) Is a WordPress 503 always a server problem?

No. It’s often the server reacting to application behavior: a plugin creating expensive queries, a theme calling external APIs, or cron jobs piling up. The server is just the messenger.

2) Why do I see 503 sometimes and 504 other times?

Different layers translate failures differently. A load balancer might call an upstream timeout a 503, while Nginx might emit 504. Focus on where the failure originates, not the exact code.

3) Should I increase PHP-FPM pm.max_children to fix 503?

Only if you’ve confirmed you have spare CPU/RAM and the database/storage can handle more concurrency. Otherwise you amplify the bottleneck and risk OOM kills and worse latency.

4) How do I know if the database is the bottleneck?

PHP-FPM slow logs showing mysqli_query(), MySQL Threads_connected near max, long-running queries in processlist, and high disk wait on the DB host are strong signals.

5) Can a CDN cause 503 even if my origin is fine?

Yes. Health check failures, blocked origin access, TLS/SNI mismatch, or rate limiting can make the edge return 503. Always test origin directly with the correct Host header.

6) What’s the fastest way to isolate a bad plugin?

Disable it without relying on wp-admin: rename the plugin directory and re-test. For a more surgical approach, disable plugins one at a time starting with the most recently changed.

7) Why does restarting “fix it” temporarily?

Because you flush queues and kill stuck workers, which briefly restores capacity. If the underlying slow dependency remains, you’re just resetting the stopwatch.

8) Could 503 be caused by file permissions?

Indirectly. If WordPress can’t write to upload/cache/session directories, requests can hang or error, leading to upstream failures. Check Nginx/PHP logs for permission errors and verify filesystem health.

9) What health check endpoint should I use for load balancers?

A static endpoint served by the web server (not PHP), returning 200 with minimal work. If you need deeper checks, create a second endpoint for internal monitoring, not for traffic steering.

10) How do I prevent 503s long-term?

Instrument the bottlenecks (PHP-FPM saturation, DB connections, disk latency), control concurrency, cache the right things, and treat plugin updates like deployments—staged and reversible.

Conclusion: next steps after you’re back

Once the site is serving again, don’t “close the incident” and walk away. The system just taught you where it breaks. Cash that lesson in.

  1. Convert the root cause into a guardrail: alerts on inodes/disk latency, PHP-FPM saturation, DB connection headroom, and health check correctness.
  2. Fix the slow path: index the worst queries, remove or replace the heaviest plugin behavior, and stop doing remote calls in the critical render path.
  3. Right-size concurrency: set PHP-FPM children to what the database and storage can actually sustain, not what feels comforting.
  4. Make failures cheaper: add a real /healthz, add caching with stampede protection, and implement rate limiting for abusive endpoints.
  5. Write the runbook you wish you had: the exact commands you ran, the logs that mattered, and the “don’t do this again” notes. Next outage you will be tired and less clever.

If you do those five things, the next 503 won’t be a mystery. It’ll be a known failure mode with a short, slightly annoyed, but effective response.

← Previous
Proxmox vzdump backup failed: 10 real causes and how to check in order
Next →
Proxmox live migration fails: what to verify for network, CPU flags, and storage

Leave a comment