Ubuntu 24.04: Web server suddenly shows 502/504 — the real reason (and how to fix it fast)

Was this helpful?

Your web server was fine. You deployed nothing. Then suddenly: 502 Bad Gateway and 504 Gateway Timeout across the site like a bad weather front. Customers refresh. Dashboards go red. Someone suggests “restart nginx” like it’s a sacred ritual.

Sometimes a restart is a band-aid. Sometimes it’s the fastest way to destroy the evidence. The goal here is to find the actual bottleneck on Ubuntu 24.04—upstream processes, sockets, DNS, CPU starvation, or storage latency—and fix it without guessing.

What 502/504 actually mean (and what they don’t)

When a user sees a 502/504, they blame “the web server.” In practice, the web server is often just the messenger.

502 Bad Gateway: “I talked to my upstream and it was nonsense or nothing.”

  • Typical meaning: Nginx/Apache acted as a proxy and the upstream connection failed or returned an invalid response.
  • Common real causes: upstream process crashed, socket permissions changed, upstream listening port wrong, TLS mismatch, protocol mismatch, backend returning garbage because it’s dying.

504 Gateway Timeout: “I waited for upstream and it never answered in time.”

  • Typical meaning: the proxy established a connection (or tried to) but the upstream didn’t respond within a timeout.
  • Common real causes: upstream overloaded, DB calls slow, storage latency, DNS stalls, connection pool exhaustion, deadlocks, kernel resource exhaustion.

Here’s the operational truth: 502/504 are usually symptoms of a bottleneck, not bugs in nginx. Nginx is boring. It’s supposed to be boring. When it isn’t, it’s because something else made it interesting.

Joke #1: Restarting nginx to fix a 504 is like turning your radio off to fix traffic jams. It reduces the noise, not the problem.

Fast diagnosis playbook (first/second/third)

If you do nothing else, do this. The goal is to identify the bottleneck in under 10 minutes, with evidence you can hand to the next person without shame.

First: confirm what’s generating the 502/504 and capture logs immediately

  1. Check proxy error logs (nginx or Apache) for the exact upstream error string.
  2. Correlate timestamps with backend logs (php-fpm, gunicorn, node, etc.).
  3. Check systemd journal for restarts, OOM kills, permission errors.

Decision: if you see “connect() failed (111: Connection refused)” it’s an upstream reachability problem. If you see “upstream timed out” it’s a performance or deadlock problem. If you see “permission denied” it’s usually a socket/SELinux/AppArmor/file mode issue.

Second: determine whether it’s compute, memory, network, or storage

  1. CPU + load + iowait: if iowait spikes, you’re not CPU-bound; you’re waiting on disks.
  2. Memory pressure: OOM kills and swap storms look like random timeouts.
  3. Network and DNS: upstream could be “slow” because the proxy can’t resolve or connect fast.

Decision: if wa is high, go to storage checks. If si/so (swap in/out) is busy, go to memory. If you see many SYN-SENT or TIME-WAIT, go to network/ports.

Third: isolate the failing hop

  1. Test from the proxy to the upstream (loopback, unix socket, local port, remote service).
  2. Test from the upstream to its dependencies (DB, cache, object storage).
  3. If it’s only some requests: check timeouts, worker limits, queueing, and connection pools.

Decision: fix the narrowest choke point first. Don’t “tune everything” in a panic. That’s how you end up with a second outage that’s more creative than the first.

Hands-on: 12+ tasks with commands, outputs, and decisions

These are the tasks I run in production when 502/504s appear “out of nowhere.” Every command includes: what it tells you, what the output means, and what you do next.

Task 1: Verify whether nginx is the component emitting the errors

cr0x@server:~$ sudo tail -n 50 /var/log/nginx/error.log
2025/12/28 09:41:12 [error] 3192#3192: *884 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 203.0.113.25, server: example.com, request: "GET /api/orders HTTP/1.1", upstream: "http://127.0.0.1:8000/api/orders", host: "example.com"
2025/12/28 09:41:15 [error] 3192#3192: *901 connect() to unix:/run/php/php8.3-fpm.sock failed (13: Permission denied) while connecting to upstream, client: 198.51.100.9, server: example.com, request: "GET /index.php HTTP/1.1", upstream: "fastcgi://unix:/run/php/php8.3-fpm.sock:", host: "example.com"

Meaning: Two different failure modes: one is a timeout from an HTTP upstream on 127.0.0.1:8000; the other is a permission issue on a PHP-FPM unix socket.

Decision: Don’t treat this as one incident. Split the problem: fix socket permissions for PHP paths and performance/timeouts for the API upstream.

Task 2: Confirm active config and upstream targets (avoid debugging the wrong file)

cr0x@server:~$ sudo nginx -T | sed -n '1,120p'
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
# configuration file /etc/nginx/nginx.conf:
user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;

Meaning: You’re looking at the live configuration nginx will use, including includes. This prevents the classic mistake: editing a file not actually loaded.

Decision: Search inside this output for your upstream stanza or fastcgi_pass lines, then verify they match reality (socket path, port, hostname).

Task 3: Check whether upstream processes are alive and not restart-looping

cr0x@server:~$ systemctl status php8.3-fpm --no-pager
● php8.3-fpm.service - The PHP 8.3 FastCGI Process Manager
     Loaded: loaded (/usr/lib/systemd/system/php8.3-fpm.service; enabled; preset: enabled)
     Active: active (running) since Sun 2025-12-28 09:10:04 UTC; 33min ago
       Docs: man:php-fpm8.3(8)
    Process: 1258 ExecStartPre=/usr/lib/php/php-fpm-socket-helper install /run/php/php-fpm.sock /etc/php/8.3/fpm/pool.d/www.conf 83 (code=exited, status=0/SUCCESS)
   Main PID: 1269 (php-fpm8.3)
      Tasks: 21 (limit: 18841)
     Memory: 230.5M
        CPU: 1min 42.902s

Meaning: php-fpm is running. That doesn’t mean nginx can access its socket. It just means the service didn’t crash.

Decision: If nginx shows permission denied, inspect socket ownership/mode and the nginx user (usually www-data).

Task 4: Validate unix socket permissions for PHP-FPM

cr0x@server:~$ ls -l /run/php/php8.3-fpm.sock
srw-rw---- 1 root www-data 0 Dec 28 09:10 /run/php/php8.3-fpm.sock

Meaning: Socket owned by root:www-data with mode 660. nginx runs as www-data by default, so this should work.

Decision: If you see srw------- root root or similar, fix listen.owner, listen.group, and listen.mode in the pool config, then reload php-fpm.

Task 5: Confirm nginx worker user (don’t assume)

cr0x@server:~$ ps -o user,group,pid,cmd -C nginx | head
USER     GROUP      PID CMD
root     root      3121 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
www-data www-data  3192 nginx: worker process
www-data www-data  3193 nginx: worker process

Meaning: Workers run as www-data, so socket group ownership needs to include www-data or permissions must allow access.

Decision: If workers run under a different user (containers, custom builds), align socket perms accordingly.

Task 6: Find whether the upstream port is listening and what owns it

cr0x@server:~$ sudo ss -ltnp | grep -E ':8000|:8080|:9000'
LISTEN 0      4096       127.0.0.1:8000      0.0.0.0:*    users:(("gunicorn",pid=2204,fd=5))
LISTEN 0      4096       127.0.0.1:9000      0.0.0.0:*    users:(("php-fpm8.3",pid=1269,fd=8))

Meaning: gunicorn is listening on 127.0.0.1:8000. If nginx still times out, gunicorn is alive but slow or stuck, or the path is blocking inside the app.

Decision: Move to backend logs and resource contention checks. If the port isn’t listening, fix the service or its bind address.

Task 7: Reproduce the request from the proxy host (bypass the proxy layer)

cr0x@server:~$ curl -sS -D- -o /dev/null http://127.0.0.1:8000/api/orders
HTTP/1.1 200 OK
Server: gunicorn
Date: Sun, 28 Dec 2025 09:44:01 GMT
Connection: close
Content-Type: application/json

Meaning: The endpoint can answer quickly right now. That suggests intermittent queueing, worker exhaustion, or downstream dependency spikes, not a permanent misconfig.

Decision: Check concurrency limits and latency during the incident window; don’t declare victory because a single curl works.

Task 8: Look for upstream timeouts and worker starvation in application service logs

cr0x@server:~$ sudo journalctl -u gunicorn --since "2025-12-28 09:35" --no-pager | tail -n 30
Dec 28 09:40:58 app1 gunicorn[2204]: [2025-12-28 09:40:58 +0000] [2204] [CRITICAL] WORKER TIMEOUT (pid:2311)
Dec 28 09:40:58 app1 gunicorn[2311]: [2025-12-28 09:40:58 +0000] [2311] [ERROR] Error handling request /api/orders
Dec 28 09:40:59 app1 gunicorn[2204]: [2025-12-28 09:40:59 +0000] [2204] [INFO] Worker exiting (pid: 2311)
Dec 28 09:41:00 app1 gunicorn[2204]: [2025-12-28 09:41:00 +0000] [2204] [INFO] Booting worker with pid: 2418

Meaning: gunicorn workers are timing out and restarting. Nginx sees that as upstream timeouts/502s depending on timing.

Decision: Don’t just increase gunicorn timeout. Find why requests hang: DB locks, storage latency, thread deadlocks, DNS stalls, slow external calls.

Task 9: Check system load, CPU, and iowait during the event

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 812344  48216 923400    0    0    24    18  210  390  6  2 90  2  0
 3  1      0 789120  48220 921900    0    0   120  2040  480 1200  8  4 58 30  0
 5  2      0 760112  48220 920100    0    0    80  1990  520 1350 10  5 50 35  0
 4  2      0 748000  48224 918500    0    0    60  2100  510 1300  9  4 52 31  0
 2  1      0 770200  48224 920300    0    0    40   980  360  900  7  3 78 12  0

Meaning: iowait (wa) spikes to 30–35%. That is not a web problem. That’s the kernel telling you your processes are waiting on storage.

Decision: Go to disk latency checks, filesystem saturation, and what’s hammering the disk (journald, logrotate, backups, DB checkpoints, antivirus scans, you name it).

Task 10: Identify which process is causing disk pressure

cr0x@server:~$ sudo iotop -oPa -n 3
Total DISK READ:         1.65 M/s | Total DISK WRITE:        38.90 M/s
TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN  IO>  COMMAND
2210 be/4  postgres    0.00 B/s   22.30 M/s  0.00 %  9.10 % postgres: checkpointer
3140 be/4  root        0.00 B/s    8.40 M/s  0.00 %  2.20 % systemd-journald
987  be/4  www-data    1.20 M/s    0.00 B/s  0.00 %  1.30 % gunicorn: worker [app]

Meaning: Postgres checkpointer is writing heavily. journald is also writing. Those writes can stall reads depending on storage.

Decision: If this coincides with timeouts, check Postgres health and storage latency. You may need to tune DB checkpointing, move logs to faster disk, or stop a noisy neighbor.

Task 11: Check actual block device latency and saturation

cr0x@server:~$ sudo iostat -x 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.82    0.00    3.21   28.44    0.00   60.53

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz aqu-sz  %util
nvme0n1          32.0   1824.0     0.0    0.0    8.20    57.0     620.0  41280.0     0.0    0.0   42.10    66.6   6.20   98.0

Meaning: Device is at 98% utilization and write await is 42ms. That’s not “fine.” That’s a queue.

Decision: Reduce write amplification (logs, checkpoints), verify underlying storage isn’t throttling, and confirm you’re not hitting cloud IOPS limits.

Task 12: Check for OOM kills and memory pressure (silent assassin)

cr0x@server:~$ sudo journalctl -k --since "2025-12-28 09:00" | grep -i -E 'oom|killed process' | tail -n 20
Dec 28 09:38:11 app1 kernel: Out of memory: Killed process 2311 (gunicorn) total-vm:1452200kB, anon-rss:612400kB, file-rss:0kB, shmem-rss:0kB, UID:33 pgtables:2200kB oom_score_adj:0

Meaning: The kernel killed a gunicorn worker. Nginx sees the backend vanish mid-request: 502s, 504s, or both.

Decision: Stop the bleeding: reduce concurrency or memory footprint, add RAM, fix leaks, and add cgroup limits so the “kill” is predictable (service restart) rather than chaotic.

Task 13: Inspect connection states for port exhaustion or upstream congestion

cr0x@server:~$ ss -s
Total: 3087 (kernel 0)
TCP:   1987 (estab 412, closed 1450, orphaned 0, synrecv 0, timewait 1450/0), ports 0

Transport Total     IP        IPv6
RAW       0         0         0
UDP       12        10        2
TCP       537       509       28
INET      549       519       30
FRAG      0         0         0

Meaning: TIME_WAIT is high. That’s not automatically a problem, but it can be if you’re doing lots of short-lived upstream connections and running out of ephemeral ports.

Decision: If you observe “cannot assign requested address” in logs, shift to keepalive/pooling, and examine net.ipv4.ip_local_port_range and reuse settings carefully.

Task 14: Confirm DNS isn’t stalling upstream lookups

cr0x@server:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (ens5)
Current Scopes: DNS
     Protocols: +DefaultRoute -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 10.0.0.2
       DNS Servers: 10.0.0.2 10.0.0.3

Meaning: systemd-resolved is in play. If DNS server is slow or unreachable, your upstream hostname resolution can block request handling.

Decision: If upstream uses hostnames (not IPs), verify resolution speed and caching. For nginx, consider resolving at startup with stable IPs or using a resolver directive with timeouts.

Task 15: Confirm AppArmor isn’t blocking socket/file access (Ubuntu loves AppArmor)

cr0x@server:~$ sudo dmesg | grep -i apparmor | tail -n 10
[ 8921.224911] audit: type=1400 audit(1766914872.112:188): apparmor="DENIED" operation="connect" profile="/usr/sbin/nginx" name="/run/php/php8.3-fpm.sock" pid=3192 comm="nginx" requested_mask="wr" denied_mask="wr" fsuid=33 ouid=0

Meaning: nginx is denied connecting to the PHP-FPM socket by AppArmor. That looks exactly like “permission denied,” but chmod won’t fix it.

Decision: Adjust the AppArmor profile to allow that socket path or align to the expected path. Then reload AppArmor and nginx.

The real reasons behind sudden 502/504 on Ubuntu 24.04

“Sudden” outages are usually slow-motion failures you didn’t have visibility into—until the proxy started emitting honest HTTP status codes.

1) The backend is alive but jammed (queueing and head-of-line blocking)

Most web backends have a concurrency limit: gunicorn workers, PHP-FPM children, Node’s event loop, Java thread pools. When concurrency saturates, new requests queue. The proxy waits. Eventually it times out.

What it looks like: nginx error log says “upstream timed out,” backend logs show worker timeouts, slow requests, or nothing at all because workers are blocked.

What to do: find what’s blocking. The fastest culprits: database locks, external API calls without timeouts, slow disk I/O for templates/uploads/sessions, and synchronous logging under pressure.

2) Storage latency is the hidden upstream (and it punishes everyone equally)

If iowait climbs, the system isn’t “busy.” It’s stalled. Your application threads are waiting on disk reads/writes: DB files, session stores, cache persistence, log writes, temp files, Python wheels loading, PHP opcache resets, anything.

On Ubuntu 24.04, you might also see more aggressive logging or different defaults around systemd services than what you remember from older installs. Not because Ubuntu is mean. Because your workload is.

What to do: prove latency with iostat -x, identify writers with iotop, and look for periodic jobs (backups, logrotate, DB maintenance). Fix by reducing write volume, moving hot paths to faster storage, or raising IOPS provisioned on cloud disks.

3) Socket/permissions regressions (unix sockets: fast, great, occasionally petty)

Unix sockets are excellent for local proxying: low overhead, clear access control. They also fail in very specific ways—usually after a package upgrade, config drift, or a new service unit that changes runtime directories.

What it looks like: nginx error log: “connect() to unix:/run/… failed (13: Permission denied)” or “No such file or directory.”

What to do: check the socket exists, check ownership/mode, check nginx worker user, and check AppArmor denials. Then fix the pool config, not the symptom.

4) systemd restart loops and “helpful” watchdog behavior

Ubuntu 24.04 is unapologetically systemd-driven. If your backend crashes, systemd might restart it fast. That creates bursts of failure where nginx sees connection resets/refusals.

What it looks like: “connection refused” in proxy logs, journalctl -u shows frequent restarts, maybe an ExitCode pointing at config errors, missing env vars, or migrations that didn’t run.

What to do: stop the restart storm long enough to read logs. Use systemctl edit to add sane RestartSec and StartLimitIntervalSec if needed, then fix the underlying crash.

5) DNS resolution stalls and the myth of “it’s just a hostname”

Some proxies resolve upstream hostnames at startup. Some resolve periodically. Some rely on the OS resolver and block when it’s slow. If DNS stalls, upstream connections stall. Timeouts follow.

What it looks like: intermittent 504s, especially after network changes. Logs might show host not found in upstream or nothing obvious except increased connect times.

What to do: validate resolver health, confirm caching, reduce reliance on live DNS for critical upstreams, or configure explicit resolvers and timeouts in nginx where appropriate.

6) Kernel/network resource exhaustion: ephemeral ports, conntrack, file descriptors

When traffic spikes, you don’t just run out of CPU. You can run out of things: file descriptors, local ports, conntrack entries, listen backlog, accept queue. The kernel starts refusing requests. Nginx converts that to 502/504 depending on where it fails.

What it looks like: lots of TIME_WAIT, errors like “cannot assign requested address,” “too many open files,” or accept queue warnings. Users see timeouts more than clean errors.

What to do: inspect connection states, raise limits cautiously, enable keepalive where safe, and avoid creating a new TCP connection per request like it’s 2009.

7) Timeouts misaligned across layers (proxy vs backend vs load balancer)

Timeouts are a team sport. If your load balancer times out at 60s, nginx at 30s, and gunicorn at 120s, you’ll get a parade of partial failures and retries that amplify load.

What it looks like: client sees 504 at 30s, backend continues working, then tries to write to a closed socket. Logs show broken pipes, reset by peer, long request durations.

What to do: pick a coherent timeout budget. Then ensure upstreams have their own internal timeouts for DB and external calls, shorter than the proxy timeout.

One paraphrased idea worth keeping on your wall, attributed to a notable ops voice: paraphrased idea: Everything fails, all the time—your job is to make failure predictable and visible. — Werner Vogels

Joke #2: A 504 is your server’s way of saying “I’m not ignoring you, I’m just thinking really hard.”

Common mistakes: symptom → root cause → fix

1) Symptom: 502 with “connect() failed (111: Connection refused)”

Root cause: upstream process is down, bound to a different address, or crash-looping; you’re pointing nginx to the wrong port.

Fix: verify listener with ss -ltnp; check systemctl status and journalctl -u; correct bind address (127.0.0.1 vs 0.0.0.0) and ensure service starts cleanly.

2) Symptom: 502 with “Permission denied” to unix socket

Root cause: socket owner/group/mode mismatch, or AppArmor denial.

Fix: adjust PHP-FPM pool listen.owner/listen.group/listen.mode and align nginx user; if AppArmor denies, update profile to allow /run/php/php8.3-fpm.sock.

3) Symptom: 504 “upstream timed out while reading response header”

Root cause: backend accepts connections but doesn’t respond quickly enough—worker starvation, DB slowness, storage iowait, or external API stalls.

Fix: confirm backend worker timeouts; check iowait with vmstat/iostat; find slow dependencies; add timeouts to external calls; reduce queueing by scaling or optimizing.

4) Symptom: Errors spike during logrotate/backup window

Root cause: storage saturation, log compression, snapshotting, or backups consuming I/O and CPU. Often hidden on shared cloud disks.

Fix: reschedule heavy jobs, throttle I/O, move logs/DB to separate volumes, tune DB checkpoints, and make sure backups are incremental and not “copy the universe nightly.”

5) Symptom: Only some endpoints 504, others fine

Root cause: specific code path triggers slow DB query, lock contention, large file reads, or synchronous dependency call.

Fix: instrument per-endpoint latency; examine slow query logs; add indexes or caching; stream large responses; isolate heavyweight jobs off the request path.

6) Symptom: 502 after upgrading packages (Ubuntu 24.04 refresh)

Root cause: service unit changes, PHP version bumps, socket path changes, config not compatible, or modules disabled.

Fix: revalidate nginx upstream/fastcgi paths; run nginx -T; check php-fpm service and socket paths; confirm enabled sites and modules.

7) Symptom: 504 behind a corporate proxy/load balancer only

Root cause: timeout mismatch at the edge, health checks failing, or TLS handshake delays. Sometimes the LB retries, doubling load.

Fix: align timeout budgets across LB/nginx/app; validate health check endpoint is fast and dependency-light; ensure keepalive and TLS settings are sensible.

Three corporate mini-stories (anonymized, plausible, technically accurate)

Mini-story 1: The incident caused by a wrong assumption

They ran a tidy setup: nginx on Ubuntu, PHP-FPM on the same host, unix socket between them. It had worked for years. Then they rebuilt the VM image for Ubuntu 24.04 and the site started throwing 502s within minutes of the first traffic.

The on-call did what on-calls do: restarted nginx. The errors dipped, then returned. Someone blamed “a bad deploy,” but nothing had changed in app code. That was true. It also didn’t matter.

The wrong assumption was subtle: they assumed file permissions were the same as before. In the new image, a hardening step had changed the nginx master/worker user and tightened AppArmor profiles. The socket looked correct at a glance. Ownership was fine. Mode was fine. But AppArmor was denying the connect operation.

Once they looked at dmesg and saw the denial, the fix was embarrassingly straightforward: permit the socket path in the nginx profile (or align the socket to the expected path). The 502s disappeared instantly.

The postmortem wasn’t “AppArmor is bad.” The postmortem was “we assumed a security control wouldn’t affect runtime.” That assumption is how you get paged.

Mini-story 2: The optimization that backfired

A different team wanted to reduce response times and CPU. They enabled more aggressive access logging and added per-request upstream timing fields so they could graph “where time goes.” It worked. Their dashboards got prettier. Their ego grew accordingly.

Then traffic increased, and 504s started appearing during peak hours. The app team swore nothing changed. The DB team swore it wasn’t them. The proxy logs showed upstream timeouts, but backend metrics looked “mostly fine.”

The culprit: the optimization created a new hot path. Logging turned into synchronous disk writes on a cloud volume with limited IOPS. Under load, disk utilization hit the ceiling. iowait climbed. Worker threads stalled. The proxy waited, then timed out. Classic 504.

They fixed it by making logs less expensive: buffering, reducing verbosity, and moving heavy logging off the primary disk. The more meaningful fix was organizational: treat observability as production load, not free candy.

Mini-story 3: The boring but correct practice that saved the day

A company I’ve worked with had a policy that everyone mocked: “Don’t restart services until you’ve captured 10 minutes of evidence.” People complained it slowed incident response. Leadership kept it anyway, because they’d been burned before.

One afternoon, 502s appeared across a multi-tenant web cluster. The easy move was to bounce everything. They didn’t. They captured nginx error logs, backend journals, and a quick iostat/vmstat snapshot from two affected nodes.

The evidence showed a pattern: iowait spikes exactly when a scheduled job ran. That job rotated and compressed huge logs locally, colliding with a DB checkpoint burst. The storage subsystem couldn’t keep up, and upstream requests piled up behind it.

Because they had data, the fix was fast and surgical: reschedule and throttle the job, tune DB checkpointing, and split logs to a different volume. No “we think” statements. No heroic guessing. Just boring discipline that prevented a recurring outage.

Checklists / step-by-step plan

Checklist A: Stop the bleeding without deleting evidence

  1. Capture the last 200 lines of proxy error logs and access logs for the failing window.
  2. Capture backend service logs for the same timestamps.
  3. Capture vmstat 1 5 and iostat -x 1 3 once, from an affected node.
  4. If the site is melting: shed load (rate limits, temporarily disable non-critical endpoints, reduce expensive features) before restarting everything.

Checklist B: Confirm the failing hop

  1. From proxy host, curl the upstream directly (loopback/port/socket).
  2. From upstream host, test dependencies (DB/cache) with short timeouts.
  3. Check connection states (ss -s) and file descriptors if suspect.
  4. Confirm DNS resolution speed if upstream is hostname-based.

Checklist C: Fix safely (minimum change that restores service)

  1. If permission/socket/AppArmor: fix the policy/config; reload services; verify with a single request.
  2. If worker starvation: reduce traffic, increase worker capacity carefully, and fix the slow dependency or blocking call.
  3. If storage saturation: stop the noisy writer, reschedule jobs, or move hot I/O to a better disk.
  4. Align timeouts across LB → nginx → app → dependencies so you don’t create retry storms.

Checklist D: Prevent recurrence

  1. Add alerting on iowait, disk %util, and backend queue depth—not just HTTP error rate.
  2. Track upstream response time percentiles and count of timed-out requests.
  3. Document the live upstream topology (ports, sockets, services) and keep it accurate.
  4. Load test after OS upgrades and after “observability improvements.”

Interesting facts & historical context

  • HTTP 502 and 504 are gateway errors: they were designed for intermediaries—proxies, gateways, and load balancers—complaining about upstreams, not themselves.
  • Nginx became a default reverse proxy largely because it handled high concurrency with low memory compared to older process-per-connection models.
  • Unix domain sockets predate modern web stacks and remain one of the simplest, fastest IPC mechanisms on Linux for local services.
  • systemd normalized “restart on failure” in Linux fleets; it improved availability but also made crash loops easier to hide if you don’t watch logs.
  • AppArmor is a mainstream Ubuntu security control; it often explains “permission denied” when filesystem permissions look correct.
  • TIME_WAIT storms are old: they’ve been biting high-traffic services since the early days of TCP-heavy web architectures, especially without keepalive.
  • Cloud disks frequently have performance ceilings (IOPS/throughput caps). Latency spikes can appear “sudden” when you cross a threshold.
  • Misaligned timeouts create retry amplification: one layer times out, retries happen, load increases, and the system collapses faster than if you’d waited.

FAQ

1) Should I restart nginx when I see 502/504?

Only after you’ve captured logs and basic system signals. Restarting can temporarily clear stuck workers or sockets, but it also erases the trail. If it’s a real bottleneck (DB/storage), the errors will come right back.

2) What’s the fastest way to tell 502 vs 504 root cause?

Read the nginx error log line. “Connection refused” points to reachability/service down. “Upstream timed out” points to slow/hung upstream or dependency. “Permission denied” points to socket perms or AppArmor.

3) Why do I see 504 but the backend logs show 200 responses?

Because the backend kept working after the proxy gave up. The proxy timeout expired, closed the connection, and the backend later wrote a response to nobody. Align timeouts and add internal timeouts for dependencies.

4) Can storage really cause 504s even if CPU is low?

Yes. Low CPU with high iowait is a classic signature. Your threads are blocked in the kernel waiting on disk I/O. The proxy waits too, until it times out.

5) Why did Ubuntu 24.04 “suddenly” start doing this?

Ubuntu 24.04 itself usually isn’t the cause; the upgrade often changes versions and defaults (php-fpm versions, service units, security profiles). Those changes expose existing fragility: tight limits, brittle socket paths, or storage headroom that was never real.

6) How do I know if AppArmor is involved?

Look for AppArmor DENIED messages in dmesg or the kernel journal. If you see denies involving nginx, php-fpm sockets, or backend files, treat it as a policy issue, not a chmod issue.

7) Is increasing nginx proxy_read_timeout a real fix?

Sometimes, but it’s usually a delay, not a fix. Increase timeouts only when the work is legitimately long and controlled (exports, reports), and only if the backend can handle it without tying up workers.

8) How do I distinguish backend overload from a single slow dependency?

Backend overload shows rising queue time, worker exhaustion, and broad latency increases. A single slow dependency shows specific endpoints failing, correlated with DB locks, cache misses, or external API latency. Use per-endpoint metrics and correlate logs by request IDs if you have them.

9) Why do 502/504 appear in bursts?

Because the system oscillates: traffic pushes it into saturation, timeouts trigger retries, workers restart, caches warm/cold cycle, storage queues drain, and you get a repeating failure pattern. Fix the bottleneck and the oscillation stops.

Next steps you should actually take

If you’re in the middle of an incident: pick the fast diagnosis playbook and run it end-to-end. Don’t “do a bit of everything.” Collect evidence, identify the failing hop, and fix the narrowest constraint first.

If you’re past the incident and want it not to happen again:

  • Make storage visible: alert on disk %util, await, and iowait. Web timeouts often start as disk latency.
  • Align timeouts: ensure LB, nginx, app server, and DB calls have a coherent timeout budget.
  • Codify socket and security expectations: if you rely on unix sockets, document paths and permissions; include AppArmor checks in your rollout validation.
  • Reduce surprise jobs: find periodic backups/logrotate/maintenance tasks and ensure they don’t share the same I/O path as your request latency.
  • Practice evidence-first ops: the boring habit of capturing logs and system state before restarts is how you stop recurring 504 folklore.

502/504 aren’t mysteries. They’re the proxy telling you exactly where to look. Listen to it, collect the evidence, and fix the system instead of the symptom.

← Previous
DNS Monitoring: Alert Before Users Notice (Simple, Effective Checks)
Next →
Email SPF fails: the 5 record mistakes that break delivery (and fixes)

Leave a comment