Docker Nginx upstream errors: debug 502/504 with the correct logs

Was this helpful?

You’re on call. The dashboard is red. Users say “site is down,” and the only thing Nginx will tell you is a smug little 502 or 504. Your backend team swears nothing changed. Your Docker host looks “fine.” And yet production is very much not fine.

This is where people waste hours staring at the wrong logs. The trick is boring: log the correct upstream fields, prove which hop failed, then fix the one thing that’s actually broken. Not “restart everything” broken. The specific broken.

A mental model: what 502 and 504 actually mean in Docker + Nginx

Start with discipline: a 502/504 is rarely “an Nginx problem.” It’s usually Nginx being the messenger who got stuck between a client and an upstream (your app) and has receipts.

502 Bad Gateway: Nginx could not get a valid response from upstream

In practice, with Docker, a 502 often means one of these:

  • Connection failure: Nginx couldn’t connect to the upstream IP:port (container down, wrong port, wrong network, firewall rules, DNS points to stale IP).
  • Upstream closed early: Nginx connected, sent the request, and the upstream closed before sending a proper HTTP response (crash, OOM kill, application bug, proxy protocol mismatch, TLS mismatch).
  • Protocol mismatch: Nginx expects HTTP but upstream speaks HTTPS, gRPC, FastCGI, or raw TCP; or expects HTTP/1.1 keepalive but upstream can’t handle it.

504 Gateway Timeout: Nginx connected but did not get a response in time

A 504 is usually slower and nastier: Nginx connected to the upstream, but didn’t receive the response (or headers) within configured timeouts. That’s not always “the app is slow.” It can also be:

  • Upstream overloaded: thread pool exhausted, DB pool exhausted, event loop blocked, or CPU throttled in cgroups.
  • Network stalls: packet loss, conntrack exhaustion, Docker bridge weirdness under load, or an MTU mismatch that only hurts big responses.
  • Timeouts don’t match reality: Nginx expects a response in 60s, but the app legitimately needs 120s for some workloads, and you never meant to proxy those through Nginx anyway.

One more framing: Nginx has three clocks during proxying—connect time, time to first byte (headers), and time to finish reading the response. If you don’t log all three, you’re debugging blind.

Paraphrased idea from Werner Vogels (Amazon CTO): “You build it, you run it” is about owning operational reality, not just shipping code.

Short joke #1: A 502 is Nginx saying “I tried to call your app, but it went straight to voicemail.”

Fast diagnosis playbook (check first/second/third)

This is the order that finds the bottleneck quickly without turning your incident channel into a therapy group.

First: prove which hop is failing

  1. Client → Nginx: Is Nginx receiving requests? Check access logs and $request_id correlation.
  2. Nginx → upstream: Is it connect failing (502) or timing out (504)? Look for “connect() failed” vs “upstream timed out.”
  3. Upstream → its dependencies: DB, cache, queue, other HTTP services. You don’t need full tracing to confirm the obvious: dependency timeouts spike when 504s spike.

Second: capture the right timestamps and upstream timings

  • Add (or confirm) Nginx access log fields: $upstream_addr, $upstream_status, $upstream_connect_time, $upstream_header_time, $upstream_response_time, $request_time.
  • In Docker, confirm container restarts/OOM kills line up with 502 bursts.
  • Confirm whether errors are per upstream instance (one bad container) or systemic (all containers slow).

Third: decide whether to fix timeouts or fix the upstream

  • If $upstream_connect_time is high or missing: fix networking, service discovery, ports, container health, capacity.
  • If $upstream_header_time is high: upstream is slow to start responding; check app latency and dependencies.
  • If headers arrive fast but $upstream_response_time is huge: response streaming is slow; check payload size, buffering, slow clients, rate limits.

Timeouts are not a performance strategy. They are a contract. Change them only after you know what you’re signing.

Get the right logs: Nginx, Docker, and the app

Nginx error log: where the truth starts

If you only check Nginx access logs, you’ll see status codes but not the why. The error log contains the upstream failure mode: connect refused, no route to host, upstream prematurely closed, upstream timed out, resolver failure.

In a container, ensure Nginx writes error logs to stdout/stderr or to a mounted volume. If it’s writing to /var/log/nginx/error.log inside a container with no volume, you can still read it via docker exec, but you won’t like the ergonomics during an incident.

Nginx access log: where you learn patterns

Access logs are the best place to answer questions like: “Is it all endpoints or one?” and “Is it one upstream instance?” But only if you log upstream fields.

Opinionated stance: log JSON. Humans can still read it, and machines can definitely read it. If you can’t change the format today, you can still add upstream timing variables to your existing format.

Docker logs: the container is lying unless you look

502 bursts that match container restarts are not a mystery. They’re a timeline. Docker tells you when a container restarted, when it got OOM-killed, and whether health checks are failing.

Application logs: confirm the upstream received the request

Your app logs should answer: did the request arrive, what path, what latency, what error. If you can add a request ID header (e.g., X-Request-ID) from Nginx to upstream and log it, you’ll stop arguing and start fixing.

Short joke #2: “Upstream timed out” is the operational version of “I’ll get back to you” from that vendor who never does.

Practical tasks (commands, output meaning, decision)

These are the tasks I actually run when Nginx in Docker starts throwing 502/504. Each task includes: command, what typical output means, and the decision you make next.

Task 1: Confirm Nginx is the component returning 502/504

cr0x@server:~$ curl -sS -D- -o /dev/null http://127.0.0.1/health
HTTP/1.1 502 Bad Gateway
Server: nginx/1.25.3
Date: Sat, 03 Jan 2026 10:12:01 GMT
Content-Type: text/html
Content-Length: 157
Connection: keep-alive

What it means: The response is coming from Nginx (Server header). Not from the app.

Decision: Go to Nginx error logs next; don’t waste time hunting in app logs yet.

Task 2: Read Nginx error logs from the container

cr0x@server:~$ docker logs --tail=200 -f nginx
2026/01/03 10:11:58 [error] 28#28: *194 connect() failed (111: Connection refused) while connecting to upstream, client: 10.0.2.15, server: _, request: "GET /health HTTP/1.1", upstream: "http://172.19.0.5:8080/health", host: "example.internal"

What it means: Nginx tried to connect to the upstream IP:port and got Connection refused. The container exists on the network, but nothing is listening on that port (or it’s not ready).

Decision: Check upstream container status, port mapping, and whether the app process is listening.

Task 3: Identify which upstream Nginx is trying to reach

cr0x@server:~$ docker exec -it nginx nginx -T 2>/dev/null | sed -n '1,180p'
user  nginx;
worker_processes  auto;
error_log  /var/log/nginx/error.log warn;
events { worker_connections  1024; }
http {
    upstream app_upstream {
        server app:8080;
        keepalive 32;
    }
    server {
        listen 80;
        location / {
            proxy_set_header Host $host;
            proxy_set_header X-Request-ID $request_id;
            proxy_pass http://app_upstream;
            proxy_connect_timeout 5s;
            proxy_read_timeout 60s;
        }
    }
}

What it means: Upstream uses the Docker DNS name app on port 8080, with keepalive enabled.

Decision: Validate Docker DNS resolution and confirm the app listens on 8080 inside its container. Also keep keepalive in mind; it can amplify certain app bugs.

Task 4: Validate DNS inside the Nginx container

cr0x@server:~$ docker exec -it nginx getent hosts app
172.19.0.5     app

What it means: Docker’s embedded DNS resolves app to an IP. If this fails, Nginx may be resolving at startup only or not at all (depending on config).

Decision: If DNS fails: fix container networks, service name, or resolver configuration. If DNS works: move to connectivity and listening checks.

Task 5: Test connectivity from Nginx to upstream port

cr0x@server:~$ docker exec -it nginx bash -lc 'nc -vz -w2 app 8080'
nc: connect to app (172.19.0.5) port 8080 (tcp) failed: Connection refused

What it means: IP is reachable, but nothing is accepting connections on 8080.

Decision: Check the app container: is it running, listening on the right interface, and not crash-looping?

Task 6: Inspect upstream container health and restarts

cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'
NAMES   STATUS                      PORTS
nginx   Up 2 hours                  0.0.0.0:80->80/tcp
app     Restarting (1) 12 seconds ago

What it means: The app is restart-looping. Nginx is innocent; it can’t talk to something that can’t stay up.

Decision: Pull app logs and look for crash cause (config error, missing secret, OOM, migrations, dependency refusal).

Task 7: Look for OOM kills (classic 502 generator)

cr0x@server:~$ docker inspect app --format '{{json .State}}'
{"Status":"restarting","Running":false,"Paused":false,"Restarting":true,"OOMKilled":true,"Dead":false,"Pid":0,"ExitCode":137,"Error":"","StartedAt":"2026-01-03T10:11:42.020785322Z","FinishedAt":"2026-01-03T10:11:52.901123812Z","Health":null}

What it means: OOMKilled:true and exit code 137. The kernel killed the process. Nginx is just reporting the aftermath.

Decision: Add memory, lower workload, fix memory leak, or set sane limits and autoscaling. Also consider reducing Nginx buffering for huge responses only if you understand the tradeoffs.

Task 8: Correlate 502/504 spikes with container restarts in the Docker event stream

cr0x@server:~$ docker events --since 30m --filter container=app
2026-01-03T10:02:11.000000000Z container die 1f2a3b4c5d (exitCode=137, image=app:prod, name=app)
2026-01-03T10:02:12.000000000Z container start 1f2a3b4c5d (image=app:prod, name=app)
2026-01-03T10:11:52.000000000Z container die 1f2a3b4c5d (exitCode=137, image=app:prod, name=app)
2026-01-03T10:11:53.000000000Z container start 1f2a3b4c5d (image=app:prod, name=app)

What it means: The app died twice in 30 minutes. If your 502s align with these timestamps, you’ve got causality, not vibes.

Decision: Focus on why the app dies. Don’t tune Nginx timeouts; that’s not the problem.

Task 9: If it’s 504, log timings and verify where time is spent

cr0x@server:~$ docker exec -it nginx awk 'NR==1{print; exit}' /var/log/nginx/access.log
10.0.2.15 - - [03/Jan/2026:10:14:09 +0000] "GET /api/report HTTP/1.1" 504 564 "-" "curl/8.5.0" rt=60.001 uct=0.001 uht=60.000 urt=60.000 ua="172.19.0.5:8080" us="504"

What it means: uct (connect) is fast, but uht (time to headers) hit 60s, matching proxy_read_timeout. The upstream accepted the connection but didn’t produce headers in time.

Decision: This is upstream slowness or deadlock, not networking. Check app latency, dependency calls, worker exhaustion, and DB.

Task 10: Confirm Nginx timeout configuration that actually triggered the 504

cr0x@server:~$ docker exec -it nginx nginx -T 2>/dev/null | grep -R --line-number -E 'proxy_(connect|read|send)_timeout|send_timeout' -
69:            proxy_connect_timeout 5s;
70:            proxy_read_timeout 60s;
71:            proxy_send_timeout 60s;

What it means: Nginx will give upstream 60s to respond (headers count). That’s your contract today.

Decision: If 60s is too low for a legitimate endpoint, split routes: raise timeout only for that location, or redesign the endpoint (async job + polling). Avoid global timeout inflation.

Task 11: Validate upstream app is listening on the expected port inside its container

cr0x@server:~$ docker exec -it app bash -lc 'ss -lntp | head -n 5'
State  Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
LISTEN 0      4096   0.0.0.0:8080      0.0.0.0:*    users:(("app",pid=1,fd=7))

What it means: The app is listening on 0.0.0.0:8080, which is correct for container networking.

Decision: If it were listening only on 127.0.0.1, Nginx in another container would get connection refused. Fix the app bind address.

Task 12: Confirm the Nginx container and app container share a Docker network

cr0x@server:~$ docker inspect nginx --format '{{json .NetworkSettings.Networks}}'
{"frontend":{"IPAMConfig":null,"Links":null,"Aliases":["nginx"],"NetworkID":"8c1d...","EndpointID":"4a2b...","Gateway":"172.19.0.1","IPAddress":"172.19.0.3","IPPrefixLen":16,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"MacAddress":"02:42:ac:13:00:03","DriverOpts":null}}
cr0x@server:~$ docker inspect app --format '{{json .NetworkSettings.Networks}}'
{"frontend":{"IPAMConfig":null,"Links":null,"Aliases":["app"],"NetworkID":"8c1d...","EndpointID":"9f8e...","Gateway":"172.19.0.1","IPAddress":"172.19.0.5","IPPrefixLen":16,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"MacAddress":"02:42:ac:13:00:05","DriverOpts":null}}

What it means: Both are on the frontend network. Connectivity should be possible.

Decision: If networks differ, fix Compose networks or connect containers to the same user-defined network. Don’t use legacy --link.

Task 13: Check Docker embedded DNS behavior and Nginx name resolution pitfalls

cr0x@server:~$ docker exec -it nginx cat /etc/resolv.conf
nameserver 127.0.0.11
options ndots:0

What it means: Docker DNS is in use. Nginx resolves upstream names depending on how you configured it. If you put a hostname in an upstream { server app:8080; } block, Nginx typically resolves it at startup and caches it.

Decision: If containers churn and IPs change, either reload Nginx when upstream IPs change (common in Compose) or use dynamic resolution patterns (e.g., resolver 127.0.0.11 plus variables in proxy_pass) with care.

Task 14: Detect upstream keepalive reuse issues (stale connections)

cr0x@server:~$ docker logs --tail=200 nginx | grep -E 'upstream prematurely closed|recv\(\) failed|reset by peer' | head
2026/01/03 10:20:31 [error] 28#28: *722 upstream prematurely closed connection while reading response header from upstream, client: 10.0.2.15, server: _, request: "GET /api HTTP/1.1", upstream: "http://172.19.0.5:8080/api", host: "example.internal"

What it means: The upstream closed the connection unexpectedly while Nginx waited for headers. This can be app crashes, but it can also be keepalive + upstream idle timeout mismatches.

Decision: Compare Nginx keepalive settings with upstream server idle timeouts. Consider disabling upstream keepalive temporarily to test if errors stop; then fix properly (align timeouts, tune keepalive_requests, etc.).

Task 15: Check host-level pressure that makes everything “randomly” slow

cr0x@server:~$ uptime
 10:24:02 up 41 days,  4:11,  2 users,  load average: 18.42, 17.90, 16.55

What it means: High load average can indicate CPU saturation, runnable queue backlog, or blocked I/O. In container land, this can manifest as 504s because the upstream can’t schedule.

Decision: Check CPU and memory next; if the host is saturated, no amount of Nginx timeout tuning will “fix” it.

Task 16: See per-container CPU/memory pressure in real time

cr0x@server:~$ docker stats --no-stream
CONTAINER ID   NAME    CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O
a1b2c3d4e5f6   nginx   2.15%     78.2MiB / 512MiB      15.27%    1.2GB / 1.1GB     12.3MB / 8.1MB
1f2a3b4c5d6e   app     380.44%   1.95GiB / 2.00GiB     97.50%    900MB / 1.3GB     1.1GB / 220MB

What it means: The app is pegging CPU and nearly OOM. Expect latency and restarts. This directly produces 504 (slow) and 502 (crash).

Decision: Add capacity, fix memory usage, add caching, reduce concurrency, or fix the query. But do one thing at a time.

Task 17: Verify connection tracking exhaustion (a sneaky 502/504 source under load)

cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 262119
net.netfilter.nf_conntrack_max = 262144

What it means: You’re near conntrack max. New connections can fail or stall; Nginx sees connect errors or timeouts.

Decision: Increase conntrack max (with memory awareness), reduce connection churn (keepalive, pooling), or scale out. Also check for connection leaks.

Task 18: Validate that Nginx is logging upstream timings (or fix it)

cr0x@server:~$ docker exec -it nginx grep -R --line-number 'log_format' /etc/nginx/nginx.conf /etc/nginx/conf.d 2>/dev/null
/etc/nginx/nginx.conf:15:log_format upstream_timing '$remote_addr - $request_id [$time_local] '
/etc/nginx/nginx.conf:16:    '"$request" $status rt=$request_time uct=$upstream_connect_time '
/etc/nginx/nginx.conf:17:    'uht=$upstream_header_time urt=$upstream_response_time ua="$upstream_addr" us="$upstream_status"';

What it means: You have the key timing variables. Good. Now use them.

Decision: If missing, add them and reload Nginx. Without upstream timing, you’ll misdiagnose 504s.

Task 19: Reload Nginx safely after config changes

cr0x@server:~$ docker exec -it nginx nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
cr0x@server:~$ docker exec -it nginx nginx -s reload

What it means: Syntax is valid; reload will apply changes without dropping existing connections (in most typical setups).

Decision: Prefer reload over container restarts in the middle of an incident unless the process is wedged.

Task 20: Prove the upstream is slow using direct curls from the Nginx network namespace

cr0x@server:~$ docker exec -it nginx bash -lc 'time curl -sS -o /dev/null -w "status=%{http_code} ttfb=%{time_starttransfer} total=%{time_total}\n" http://app:8080/api/report'
status=200 ttfb=59.842 total=59.997

real    1m0.010s
user    0m0.005s
sys     0m0.010s

What it means: The upstream itself takes ~60s to first byte and total. Nginx’s 60s proxy_read_timeout is right on the edge; a little jitter causes 504.

Decision: Fix the upstream performance or redesign the endpoint. Raising timeouts may stop the bleeding but can also pile up connections and increase blast radius.

Common mistakes: symptom → root cause → fix

This section exists because most “Nginx upstream errors” are self-inflicted. Here are the ones that keep showing up in containerized setups.

1) Symptom: 502 with “connect() failed (111: Connection refused)”

  • Root cause: Upstream container is restarting/crashed; app is listening on a different port; app binds to 127.0.0.1 inside container.
  • Fix: Confirm ss -lntp in the app container, fix bind address to 0.0.0.0, fix port in Nginx/Compose, and add health checks so Nginx doesn’t route to dead containers.

2) Symptom: 502 with “no live upstreams”

  • Root cause: All upstream servers marked down by Nginx (failed checks or max_fails), or upstream name failed to resolve at startup.
  • Fix: Ensure Nginx can resolve the service name at startup; reload Nginx after network changes; validate upstream entries. If you’re doing blue/green, don’t leave Nginx pointing at the retired name.

3) Symptom: 504 with “upstream timed out (110: Connection timed out) while reading response header”

  • Root cause: Upstream is slow to produce headers; thread pool or event loop is blocked; DB queries are slow; upstream is CPU-throttled.
  • Fix: Log $upstream_header_time. If it’s high, optimize upstream and dependencies. Raise proxy_read_timeout only for endpoints that genuinely require it.

4) Symptom: 502 with “upstream prematurely closed connection”

  • Root cause: App crashes mid-request; upstream keepalive idle timeout shorter than Nginx reuse window; buggy proxy protocol/TLS mismatch.
  • Fix: Check app logs for crashes. Temporarily disable upstream keepalive to validate. Align timeouts and consider limiting keepalive reuse via keepalive_requests on the upstream.

5) Symptom: 502 only during deploys

  • Root cause: Containers stop before new ones are ready; no readiness gate; Nginx resolves to an IP for a container that just got replaced.
  • Fix: Add readiness endpoints and health checks. In Compose, stagger restarts and reload Nginx if you rely on name resolution at startup. Prefer a stable service VIP (in orchestrators) or a proxy that does dynamic resolution properly.

6) Symptom: 504s spike but app logs look “fine”

  • Root cause: Requests never reach the app (stuck in Nginx queue, conntrack exhaustion, SYN backlog issues, or network stalls). Or the app is dropping logs under pressure.
  • Fix: Compare Nginx access logs with app request logs using request IDs. Check conntrack and host saturation. Confirm app logging isn’t buffered to death.

7) Symptom: Random 502s under load, disappears when you scale up

  • Root cause: File descriptor exhaustion, ephemeral port exhaustion, NAT table pressure, or a slow-loris style client causing resource contention.
  • Fix: Check ulimit -n and open files, tune worker connections, enforce sane client timeouts, and keep an eye on conntrack.

8) Symptom: 502 after enabling HTTP/2 or TLS changes

  • Root cause: Misconfigured upstream protocol expectations (proxying to HTTPS upstream without proxy_ssl settings, or speaking HTTP to a TLS port).
  • Fix: Validate upstream scheme and ports, test directly with curl from inside the Nginx container, and ensure the upstream is actually HTTP where you think it is.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A company ran a simple Docker Compose stack: Nginx reverse proxy, a Node.js API container, and a Redis container. One Monday, they started seeing a clean wall of 502s. The team’s first assumption was universal and wrong: “Nginx can’t resolve the upstream name.” So they changed the Nginx config to hardcode the upstream IP they saw in docker inspect. It “worked” for ten minutes.

Then it failed again. Hard. Because the API container was crash-looping; each restart grabbed a new IP, and their “fix” pinned Nginx to yesterday’s address. The error log told the story the whole time: connect() failed (111: Connection refused). That’s not DNS. That’s a port with nothing listening.

They finally looked at docker inspect and noticed OOMKilled:true. The API container was set to a small memory limit, and a new feature created a larger in-memory cache under a certain request pattern. Under load, the kernel killed it. Nginx wasn’t broken; it was consistently routing to a service that was not consistently alive.

The actual fix was boring: reduce memory footprint, raise the container limit to match realistic peak, and add a readiness endpoint so the proxy didn’t send traffic before the app had warmed up. They also stopped hardcoding container IPs—because that’s how you turn a simple incident into a recurring hobby.

Mini-story 2: The optimization that backfired

Another org had a performance initiative: “Reduce latency by enabling keepalive everywhere.” Someone added keepalive 128; in the Nginx upstream block. They also bumped worker_connections. On paper, it looked like free speed.

Two weeks later, intermittent 502s started appearing: “upstream prematurely closed connection.” They were rare enough to dodge basic monitoring, but common enough to annoy customers. The backend service was a Java app behind an embedded server with a lower idle timeout than Nginx’s connection reuse window. Nginx would happily reuse a connection that the upstream had already timed out and closed. Sometimes the race went in Nginx’s favor, sometimes it didn’t.

The team’s first response was classic: increase timeouts. That made the symptom less frequent… and increased resource usage. Now there were more idle upstream connections sitting around, doing nothing but consuming file descriptors and memory. Under load, the proxy started to struggle, and latency climbed.

The fix was not “more keepalive.” The fix was matching keepalive behavior end-to-end: reduce Nginx upstream keepalive, align idle timeouts, and limit reuse with keepalive_requests. They also added upstream timing logs, so future failures would show whether the cost was connect time or header wait time. The optimization became a controlled tool instead of a superstition.

Mini-story 3: The boring but correct practice that saved the day

A financial services team had one policy that was almost comically unglamorous: every reverse proxy must log upstream timings and upstream status codes, and every request must carry a request ID. It was enforced in code review. No exceptions. People grumbled. Then they stopped grumbling.

During a mid-quarter release, they saw a surge of 504s on a single endpoint. The on-call pulled Nginx access logs and filtered by path. The line format included uct, uht, and urt, plus upstream_status. Within minutes, they found a pattern: connect time was low, header time spiked, and upstream status was absent on some requests—meaning Nginx never got headers at all.

They pivoted: not a network problem, not a port problem. An application thread pool problem. Using the request ID, they matched failed requests in the app logs and saw they all stalled on a downstream service call. That downstream service was rate limiting after a config change.

The incident was resolved without random restarts: they rolled back the downstream config, added proper client-side backoff, and adjusted Nginx timeouts only for a different endpoint that legitimately streamed data. That’s what “boring” looks like when it works: quick isolation, clean causality, minimal collateral damage.

Checklists / step-by-step plan

Checklist A: When you see a 502

  1. Read Nginx error logs: look for connect() failed, no live upstreams, prematurely closed, resolver errors.
  2. From the Nginx container, test getent hosts and nc to the upstream host:port.
  3. Check upstream container state: restarting, exited, unhealthy, OOM killed.
  4. Check if the app is listening on the expected port and interface (0.0.0.0).
  5. If it’s intermittent, investigate keepalive mismatch or deploy churn.

Checklist B: When you see a 504

  1. Confirm it’s a read timeout: error log should say “while reading response header” or access logs show high uht.
  2. Inspect access log timings: high uct suggests connect problem; high uht suggests upstream first-byte latency; high urt suggests slow streaming.
  3. Directly curl the upstream from inside the Nginx container and measure TTFB.
  4. Check upstream CPU/memory saturation and dependency latency (DB, cache, other HTTP services).
  5. Only after root cause is known: adjust proxy_read_timeout for that one route if needed.

Checklist C: Logging setup that pays rent

  1. Access logs include: request ID, upstream addr, upstream status, connect time, header time, response time, request time.
  2. Error logs go to stdout/stderr (container-friendly) or a mounted volume with rotation.
  3. Pass request ID to upstream and log it there too.
  4. Track container restarts/OOM kills and correlate with 502 bursts.

Step-by-step plan: fix without flailing

  1. Freeze changes: stop deploys and config edits until you isolate the failure mode.
  2. Collect evidence: Nginx error logs, a slice of access logs with timings, docker events, container state.
  3. Classify the failure:
    • Connect refused/no route → network/port/container lifecycle.
    • Upstream timed out reading headers → upstream latency/dependency stall.
    • Premature close/reset → crashes, keepalive mismatch, protocol mismatch.
  4. Pick one intervention: scale upstream, rollback change, increase memory limit, fix port, fix timeout for a specific location. One.
  5. Verify: confirm error rate drops and latency distribution improves, not just one happy curl.
  6. Backfill prevention: add logs, add health checks, add alerts on upstream timing percentiles and container restarts.

Interesting facts and historical context

  1. Nginx started as a C10k solution: it was built to handle many concurrent connections efficiently, which is why it’s often the first reverse proxy choice.
  2. 502 vs 504 is an HTTP gateway vocabulary: these codes exist because gateways/proxies needed a way to say “the next hop failed” without pretending the origin server answered.
  3. Docker’s embedded DNS (127.0.0.11) is a design choice: it provides service discovery on user-defined networks, but it doesn’t magically fix how every app caches DNS.
  4. Nginx resolves upstream names differently depending on config: hostnames in upstream blocks are typically resolved at startup, which surprises people during container churn.
  5. Keepalive is older than microservices hype: persistent connections have existed for decades; they’re great until mismatched idle timeouts turn “optimization” into intermittent failures.
  6. 504s often correlate with queueing, not just “slow code”: when worker pools fill, latency can jump without any code change.
  7. OOM kills masquerade as networking issues: from Nginx’s perspective, a crashing upstream looks like refused connections or premature closes, not “out of memory.”
  8. Conntrack exhaustion is a modern classic: NAT and stateful firewall tracking can become the bottleneck long before CPU hits 100%.
  9. Timeout defaults are cultural artifacts: many stacks inherit 60s timeouts from old assumptions about web requests, even when workloads shifted to long-running APIs and streaming.

FAQ

1) Why do I get 502 in Docker but not when running the app directly on the host?

In Docker you add at least one more network hop and often change bind behavior. The app might be listening on 127.0.0.1 inside the container, which works locally but is unreachable from Nginx in another container. Confirm with ss -lntp inside the container.

2) How do I tell if a 504 is Nginx timing out or the upstream returning 504?

Check $upstream_status in access logs. If Nginx generated the 504 because it timed out, upstream status may be empty or different. Also read the Nginx error log: it will say “upstream timed out … while reading response header.”

3) Should I just increase proxy_read_timeout to stop 504s?

Only if you’re sure the endpoint is supposed to take that long and you’re okay with tying up proxy connections longer. Otherwise you’re hiding a capacity problem and increasing blast radius. Prefer fixing upstream latency or moving long jobs to async workflows.

4) My upstream is a service name in Compose. Why does Nginx sometimes hit the wrong IP after a redeploy?

Nginx often resolves upstream names at startup and keeps the IP. If the container is replaced and gets a new IP, Nginx may keep using the old one until reload. Solutions: reload Nginx on deploy, use a more dynamic resolution approach carefully, or use an orchestrator/service VIP that stays stable.

5) Why do I see “upstream prematurely closed connection” without any app crash logs?

It can be keepalive reuse against an upstream that closes idle connections, or a proxy/protocol mismatch. Test by disabling upstream keepalive temporarily and see if the symptom disappears. Also verify your app logs aren’t dropping messages under pressure.

6) Can a slow client cause upstream timeouts?

Yes. If you’re buffering responses or streaming large payloads, slow clients can keep connections open and consume worker capacity, indirectly causing upstream queues and 504s. Log request time vs upstream time to separate “upstream slow” from “client slow.”

7) How do I differentiate connect-time problems from application latency?

Use upstream timing fields. High or failing $upstream_connect_time points to network/port/service availability. High $upstream_header_time points to upstream processing or dependency stalls.

8) Does enabling Nginx upstream keepalive always help?

No. It reduces connection setup overhead, but it can expose bugs and mismatch idle timeouts, producing intermittent 502s. Use it intentionally: align timeouts, monitor errors, and tune reuse limits.

9) I’m using multiple upstream containers. How do I see if only one instance is bad?

Log $upstream_addr and group errors by it. If one IP shows most failures, you’ve got a “one bad replica” issue—often bad config, uneven load, or a noisy neighbor on the host.

10) What’s the minimum logging change that makes upstream debugging sane?

Add $request_id, $upstream_addr, $upstream_status, and the three upstream timing values (connect, header, response) to access logs. And keep the error log accessible.

Conclusion: next steps that prevent a repeat

If you remember one thing: 502/504 debugging is a timing and topology problem. You don’t fix it by guessing. You fix it by logging the upstream hop correctly and proving where the request dies.

Do these next:

  1. Upgrade your Nginx access log format to include upstream timings, upstream addr, and upstream status. If you’re not logging those, you’re choosing slower incidents.
  2. Make Nginx error logs easy to access in Docker (stdout/stderr or mounted volume). During an outage, “where are the logs?” is not a fun scavenger hunt.
  3. Implement request IDs end-to-end and log them in the app. Correlation beats debate.
  4. Add health checks and readiness gates so deploys don’t manufacture 502s.
  5. Stop treating timeouts as a fix. Use them as a signal. If you raise them, do it per-route, with intent, and with monitoring.

Then run a game day: intentionally kill the upstream container, slow it down, and watch whether your logs tell the truth in under five minutes. If they don’t, that’s your real bug.

← Previous
WireGuard is Slow: MTU, Routing, CPU — Speed It Up Without Guesswork
Next →
Proxmox “backup storage not available on node”: why “shared” isn’t shared

Leave a comment