You don’t feel the web server choice on a calm Tuesday. You feel it at 02:13 on a Saturday when checkout stalls,
dashboards lie, and your on-call brain starts doing math with packet loss. “Nginx vs Apache” isn’t about ideology.
It’s about which failure modes you can live with, which operational knobs your team will actually turn, and which
defaults will quietly tax you for months.
In 2026, the correct answer is often: “both, but on purpose.” The decision that matters is the boundary: where TLS
terminates, where request buffering happens, where dynamic code runs, where per-tenant overrides are allowed, and
where you want your incident blast radius to stop.
What actually matters in 2026 (and what doesn’t)
In 2010, the question was “which one is faster?” In 2026, raw benchmark speed is rarely your bottleneck. Your
bottleneck is usually one of these: upstream saturation (app or DB), bad buffering choices, too many slow clients,
bad TLS or HTTP/2 behavior, filesystem latency spikes, or a config model that lets random teams mutate production
behavior without review.
The decision that actually matters
You’re not choosing a web server. You’re choosing a control plane for traffic and config:
- Where do you terminate TLS? That’s where certificates, ciphers, ALPN, OCSP, and client quirks live.
- Where do you buffer requests and responses? That decides whether slow clients drown your app.
- Do you allow per-directory overrides? That’s a governance question disguised as a feature.
- How do you do reloads? If reloads are scary, people delay fixes. Delayed fixes become incidents.
- What do you log by default? Because you will debug with what you logged, not what you wished you logged.
Six-to-ten historical facts that still affect you
- Apache HTTP Server began in 1995 as a set of patches (“a patchy server”). Its module ecosystem and config sprawl are a direct inheritance.
- Nginx was created in the early 2000s to address the C10k problem: handling 10,000 concurrent connections efficiently with an event-driven model.
- Apache’s MPM split (prefork/worker/event) exists because it had to reconcile threads, processes, and non-thread-safe modules over time.
- .htaccess is a cultural artifact from shared hosting: decentralize control to directories, accept overhead. It’s convenient and operationally chaotic.
- Nginx’s “reverse proxy first” identity shaped modern stacks where the “web server” is often a traffic router, not an app executor.
- HTTP/2 changed resource loading (multiplexing) but also introduced head-of-line blocking at the TCP layer, making latency and loss more visible.
- QUIC/HTTP/3 moved the transport to UDP which alters how middleboxes, firewalls, and observability behave, especially in corporate networks.
- Let’s Encrypt normalized short-lived cert automation making “TLS is hard” less true, but “TLS renewal breaks prod at 3am” still very true.
- Kernel improvements (epoll, sendfile, TCP tuning) made both servers faster; now the differentiator is configuration model and failure behavior.
The punchline: both servers are “fast enough” for most sites. The question is which one makes the next
incident shorter, cheaper, and less humiliating.
Blunt recommendations (pick this, avoid that)
If you run a typical modern Linux web stack
- Default to Nginx at the edge for static assets, TLS termination, caching, and reverse proxying to application backends.
- Run apps behind it via PHP-FPM, uWSGI, Gunicorn, Node, or whatever you’ve chosen to regret.
- Use Apache only when you need Apache: heavy use of per-directory rules, legacy apps, or modules that are painful to replace.
If you have multiple teams pushing config changes
- Avoid .htaccess in production unless you also enjoy doing postmortems with the phrase “someone edited a file in a directory on a random host.”
- Prefer centralized config and CI validation (Nginx is naturally aligned with this; Apache can be, but .htaccess tends to sabotage it).
If you’re hosting user-generated content or dealing with slow clients
- Nginx’s buffering model is usually easier to reason about: you can protect upstreams from slow uploads/downloads.
- Apache can do it, but you need to be explicit and careful with MPM choice and timeouts.
Joke #1: Picking a web server because a benchmark said it was 3% faster is like choosing a parachute for its color. You only notice the wrong choice once.
Architecture that survives real traffic
The architecture that keeps your pager quiet looks boring:
Nginx (edge) → app servers (PHP-FPM / containers / language runtime) → DB/cache.
Put the “expensive thinking” (dynamic code) behind something that can absorb client weirdness.
Edge responsibilities (what belongs at the front)
- TLS termination with sane ciphers, session resumption, OCSP stapling where appropriate.
- Request limiting (rate/conn) and basic abuse controls. Not security theater—just load shaping.
- Static assets served from disk with correct caching headers.
- Compression (careful with already-compressed formats) and content-type correctness.
- Routing to upstreams, health checks (or at least health-aware timeouts), graceful fail behavior.
Upstream responsibilities (what belongs behind)
- Dynamic rendering and business logic.
- Authentication/authorization (edge can help, but don’t wedge your identity model into proxy config).
- Database access and caching logic.
Why “just run Apache with mod_php” is mostly a legacy move
It’s not that mod_php can’t work. It’s that it couples your HTTP worker model to your PHP lifecycle, and the
coupling is where operational freedom goes to die. It also tends to encourage “just one more rewrite rule”
sprinkled across directories.
Modern reality: you want a clear seam between “accept client traffic” and “execute app code.” That seam is where
you put timeouts, buffering, limits, and observability. Nginx makes that seam feel native. Apache can do it, but
you’ll fight older defaults and older habits.
Apache deep dive: MPMs, .htaccess, and when it wins
MPM choice is not optional
If you run Apache in 2026 and you don’t know which MPM you’re using, you’re running it by vibes. Stop.
Apache’s Multi-Processing Modules define concurrency and resource behavior:
- prefork: one process per connection. Compatible with older, non-thread-safe modules. High memory footprint. Easy to overload with concurrency.
- worker: threads per process. Better memory usage. Thread-safety matters.
- event: like worker but better keep-alive handling. Generally the best default for modern reverse-proxy Apache.
Apache’s wins are still real:
- Compatibility: legacy enterprise apps, older auth modules, and setups that depend on Apache semantics.
- Per-directory configuration: if you’re truly in a multi-tenant environment with delegated control, .htaccess can be the least-worst governance choice.
- Module ecosystem: some modules are effectively “Apache-only” in operational culture, not just code.
.htaccess: feature or footgun?
.htaccess exists because shared hosting needed distributed control. It also forces Apache to check for override
files across directories during request processing. That cost is often minor per request, until you scale, cache
bust, or put your content on a higher-latency filesystem.
The bigger problem is not CPU. It’s change control. With .htaccess, a developer can “fix prod” by editing a file
on a single host, bypassing review, CI, and rollbacks. You don’t want “configuration drift” to be a personality
trait of your fleet.
Nginx deep dive: event loop, buffering, and when it wins
Nginx is basically an opinionated traffic machine: a small number of workers, an event loop, and a habit of
buffering to protect upstreams. It shines when you need to handle lots of concurrent connections without tying up
a thread/process per client.
Where Nginx is usually the right call
- Reverse proxying to multiple upstreams, with predictable behavior under load.
- Static file serving with sendfile, caching headers, and sane keep-alives.
- Load shaping (rate limiting, connection limiting) that is straightforward to reason about.
- Operational hygiene: centralized config, testable syntax, reloads that don’t drop active connections when done right.
Where Nginx bites people
- Buffering surprises: if you don’t understand request/response buffering, you can increase disk IO or latency unintentionally.
- Timeout mismatches: Nginx timeouts that don’t match upstream timeouts lead to “mystery 504s”.
- Configuration readability: Nginx config is clean until it isn’t. Nested locations and regex rules can become a puzzle box.
Joke #2: Nginx configs don’t become unreadable gradually. They become unreadable the moment someone says, “It’s just one more location block.”
HTTP/2, HTTP/3, TLS, and the stuff that breaks quietly
HTTP/2 and HTTP/3 aren’t “speed switches.” They change your failure modes.
HTTP/2 multiplexing reduces connection counts but can amplify the pain of packet loss on a single connection.
HTTP/3 (QUIC) can improve performance on lossy networks, but it also interacts with firewalls and monitoring in
ways that make your network team ask pointed questions.
TLS termination: choose a place and own it
Terminating TLS at the edge is common because it centralizes certificates and performance tuning. But centralizing
also centralizes blame. Good: fewer moving parts. Bad: one misconfigured cipher suite can tank your entire fleet.
One reliability quote you should internalize
Hope is not a strategy.
— paraphrased idea often attributed to operations and reliability leaders.
Paraphrased or not, the point is operationally precise: don’t “hope” your timeouts align, your TLS renewals will
work, or your HTTP/2 settings are fine. Prove it with tests and dashboards.
Dynamic apps: PHP-FPM, proxies, and reality
Most “Nginx vs Apache” fights are actually “how are we running dynamic code?” fights. In 2026, PHP-FPM is still a
common choice, and the difference between a stable system and a flapping one is usually pool sizing, timeouts, and
upstream backpressure—not the brand of the front-end web server.
What you should standardize
- Timeout chain: client → edge proxy → app runtime → DB. Set them intentionally, document the intended maximum request time.
- Backpressure: what happens when the app is saturated? Queue? Reject? Shed load? This is a product decision wearing an SRE costume.
- Max request body: if you accept uploads, enforce limits in one place (edge) and validate again in the app.
- Observability fields: request ID, upstream timings, status, bytes sent, and TLS details for debugging.
When Apache is a fine app front-end
If you’re running an older PHP app that depends on Apache-specific behavior, or you have a mature internal
ecosystem built around Apache modules, it can be correct to keep Apache close to the app. But you still want a
stable edge in front—often Nginx, sometimes a hardware or cloud LB—so the app tier doesn’t have to be your client
traffic shock absorber.
Observability that prevents blame-driven debugging
If your only metric is “requests per second,” you’re not observing—you’re counting. For Nginx/Apache, your
minimum viable visibility includes:
- Latency percentiles (p50/p95/p99), separated into upstream time vs edge time.
- Error rates by status code and by upstream.
- Connection states (active, reading, writing, waiting/keepalive).
- Resource pressure: CPU steal, memory, disk IO latency, network drops/retransmits.
- Deploy/reload events correlated with error spikes.
The trick: log what you’ll need before you know you need it. Add request IDs and upstream timing fields now.
During an incident, you won’t have time to negotiate with your past self.
Practical tasks: commands, outputs, and decisions (12+)
These are tasks you can run on a real Linux box. Each includes: command, what the output means, and what decision
you make from it. Use them as a production ritual, not a one-time tutorial.
1) Identify which server is actually answering traffic
cr0x@server:~$ sudo ss -ltnp | egrep ':(80|443)\s'
LISTEN 0 511 0.0.0.0:80 0.0.0.0:* users:(("nginx",pid=1294,fd=6))
LISTEN 0 511 0.0.0.0:443 0.0.0.0:* users:(("nginx",pid=1294,fd=7))
Meaning: Nginx owns ports 80/443 on this host. If you expected Apache, you have an architectural mismatch or a stale runbook.
Decision: Point your tuning and logs at the actual edge process. Also verify if Apache is only upstream (e.g., on 127.0.0.1:8080).
2) Confirm Nginx config is syntactically valid before you reload
cr0x@server:~$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
Meaning: You won’t brick the service with a reload due to syntax.
Decision: Proceed to reload; if it fails, treat it as a change-management incident, not “ops being cautious.”
3) Validate Apache config and loaded modules
cr0x@server:~$ sudo apachectl -t -D DUMP_MODULES | head
Loaded Modules:
core_module (static)
so_module (static)
http_module (static)
mpm_event_module (shared)
proxy_module (shared)
proxy_fcgi_module (shared)
Meaning: Apache is using mpm_event and has proxy modules. This suggests Apache is being used as a reverse proxy or app front-end, not necessarily mod_php.
Decision: If you see mpm_prefork plus heavy traffic, investigate memory and concurrency limits immediately.
4) Check Apache MPM at runtime (the concurrency model you’re paying for)
cr0x@server:~$ sudo apachectl -V | egrep 'Server MPM|MPM_NAME|MPM_DIR'
Server MPM: event
MPM_NAME: event
MPM_DIR: server/mpm/event
Meaning: You’re on the event MPM, usually the modern choice for keep-alive efficiency.
Decision: Tune thread/process limits accordingly; don’t apply prefork-era “MaxClients” folklore.
5) Check live connection pressure (are you drowning in slow clients?)
cr0x@server:~$ sudo ss -s
Total: 2234 (kernel 0)
TCP: 1947 (estab 812, closed 1001, orphaned 0, timewait 1001)
Transport Total IP IPv6
RAW 0 0 0
UDP 12 9 3
TCP 946 781 165
INET 958 790 168
FRAG 0 0 0
Meaning: High TIME-WAIT counts can be normal, but sustained high established connections can indicate slow clients, keepalive too generous, or downstream slowness.
Decision: If established connections trend up with latency, investigate keepalive settings and upstream saturation; consider enabling/request buffering and tighter timeouts.
6) Spot file descriptor exhaustion before it becomes a mystery outage
cr0x@server:~$ sudo cat /proc/$(pidof nginx | awk '{print $1}')/limits | egrep 'Max open files'
Max open files 1024 1024 files
Meaning: 1024 open files is tiny for a busy edge proxy. You’ll hit it via sockets and logs under load.
Decision: Raise limits via systemd unit overrides and worker_rlimit_nofile (Nginx) / appropriate service limits for Apache.
7) Check listen backlog and SYN pressure symptoms
cr0x@server:~$ sudo netstat -s | egrep -i 'listen|overflow|drop' | head
104 times the listen queue of a socket overflowed
104 SYNs to LISTEN sockets dropped
Meaning: Your accept queue overflowed; clients may see connection failures or retries. This is often load + too-small backlog + slow accept due to CPU pressure.
Decision: Increase somaxconn, verify Nginx listen ... backlog= where applicable, and reduce upstream latency so workers return to accept faster.
8) Verify kernel backlog and file limits (system-wide guardrails)
cr0x@server:~$ sysctl net.core.somaxconn fs.file-max
net.core.somaxconn = 128
fs.file-max = 9223372036854775807
Meaning: somaxconn=128 is conservative for an edge server; it can contribute to dropped SYNs under bursty traffic.
Decision: Raise net.core.somaxconn (and align with your server’s listen backlog settings). Validate with load tests; don’t cargo-cult huge values without need.
9) Confirm what HTTP versions are negotiated (and whether clients are stuck on 1.1)
cr0x@server:~$ curl -skI --http2 https://localhost/ | head -n 5
HTTP/2 200
server: nginx
date: Tue, 04 Feb 2026 12:03:11 GMT
content-type: text/html
Meaning: HTTP/2 is working locally; this doesn’t prove it’s working from the internet (middleboxes exist), but it’s a baseline.
Decision: If clients still negotiate HTTP/1.1 at scale, check ALPN/TLS config, CDN/LB capabilities, and whether you accidentally disabled HTTP/2 on the relevant vhost.
10) Inspect TLS certificate and expiration quickly
cr0x@server:~$ echo | openssl s_client -servername example.com -connect 127.0.0.1:443 2>/dev/null | openssl x509 -noout -dates -subject
notBefore=Jan 10 00:00:00 2026 GMT
notAfter=Apr 10 00:00:00 2026 GMT
subject=CN = example.com
Meaning: Cert is valid now and expires April 10. You have a calendar, but production needs automation and alerting.
Decision: If expiry is close, prioritize renewal pipeline and deploy process; don’t schedule heroics.
11) Find whether you’re CPU-bound, IO-bound, or blocked on something else
cr0x@server:~$ sudo pidstat -p $(pidof nginx | tr ' ' ',') 1 3
Linux 6.8.0 (server) 02/04/2026 _x86_64_ (8 CPU)
12:03:40 PM UID PID %usr %system %CPU CPU Command
12:03:41 PM 0 1294 18.00 6.00 24.00 2 nginx
12:03:41 PM 0 1295 21.00 7.00 28.00 3 nginx
12:03:41 PM 0 1296 20.00 6.00 26.00 4 nginx
Meaning: Nginx workers are consuming CPU but not maxing the host. If latency is high, it may be upstream waits, network, or disk rather than pure CPU saturation.
Decision: If CPU is pegged, reduce expensive work at the edge (regex, Lua, compression level), offload static assets, and ensure you’re not doing dynamic work in the web tier.
12) Measure disk latency that turns “fast web server” into “slow web server”
cr0x@server:~$ iostat -xz 1 2
Linux 6.8.0 (server) 02/04/2026 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
22.10 0.00 7.20 9.80 0.00 60.90
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await wareq-sz aqu-sz %util
nvme0n1 210.0 8420.0 10.0 4.5 8.20 40.1 95.0 6120.0 18.40 64.4 2.10 92.0
Meaning: High %util and elevated w_await suggest disk pressure. Static files, temp buffers, and logs can all contribute.
Decision: Move temp buffering to faster storage, reduce response buffering to disk where appropriate, fix log volume, and investigate filesystem and underlying storage health.
13) Check upstream saturation for PHP-FPM (the silent 504 factory)
cr0x@server:~$ sudo ss -ltnp | grep php-fpm
LISTEN 0 128 127.0.0.1:9000 0.0.0.0:* users:(("php-fpm8.2",pid=2201,fd=9))
Meaning: PHP-FPM is listening on TCP 9000. Now check if it’s the bottleneck via status or logs.
Decision: If Nginx shows upstream timeouts, tune FPM pool (pm.max_children, timeouts) and fix slow requests rather than increasing web worker counts.
14) Validate Nginx access log fields for upstream timing (debug with facts)
cr0x@server:~$ sudo tail -n 3 /var/log/nginx/access.log
203.0.113.10 - - [04/Feb/2026:12:03:51 +0000] "GET /api/cart HTTP/2.0" 200 512 "-" "Mozilla/5.0" rt=0.245 uct=0.002 uht=0.240 urt=0.241
203.0.113.11 - - [04/Feb/2026:12:03:52 +0000] "POST /upload HTTP/2.0" 413 166 "-" "curl/8.5.0" rt=0.003 uct=- uht=- urt=-
203.0.113.12 - - [04/Feb/2026:12:03:52 +0000] "GET /static/app.js HTTP/2.0" 200 183211 "https://example.com/" "Mozilla/5.0" rt=0.012 uct=- uht=- urt=-
Meaning: rt (request time) and upstream timing fields show where time is spent. For /api/cart, nearly all time is upstream.
Decision: Don’t tune Nginx for an upstream latency problem. Go to the app/DB. For 413, adjust client_max_body_size intentionally (and enforce product limits).
15) Confirm Apache is not secretly doing per-request filesystem walks via AllowOverride
cr0x@server:~$ sudo apachectl -t -D DUMP_VHOSTS | head -n 12
VirtualHost configuration:
*:80 is a NameVirtualHost
default server example.com (/etc/apache2/sites-enabled/000-default.conf:1)
port 80 namevhost example.com (/etc/apache2/sites-enabled/000-default.conf:1)
Meaning: Vhost config location is identified. You still need to inspect for AllowOverride settings in relevant directories.
Decision: If you find AllowOverride All on high-traffic paths, consider moving rules into central config and setting AllowOverride None for performance and governance.
Fast diagnosis playbook
When the site is “slow,” don’t start by editing config. Start by locating the bottleneck with the least amount of
ceremony. This is the order that wins incidents.
First: is it the edge or the upstream?
- Check edge latency vs upstream latency using logs (Nginx upstream timing fields or Apache %D/%{…}).
- Check status code mix: spikes in 499/504/502 scream proxy/upstream issues, not “web server is slow.”
- Check connection counts: too many established connections + stable RPS often means slow clients or keepalive mis-tuning.
Second: is the host resource-starved?
- CPU: high %system can mean TLS/compression overhead or kernel/network pressure.
- Memory: swapping on a web tier is a self-inflicted DoS. If you see swap activity, treat it as a priority-1 misconfiguration.
- Disk IO latency: high await times can stall static file serving and proxy buffering.
Third: is the network lying to you?
- Retransmits and drops: packet loss turns HTTP/2 into “one connection to suffer.”
- Backlog overflow: dropped SYNs look like random client failures.
- MTU/PMTUD weirdness: especially with VPNs, corporate networks, and UDP/QUIC.
Fourth: only then touch config
Once you’ve identified where the time goes, tune the component that owns that time. If upstream is slow, don’t
“scale Nginx.” That’s how you get an expensive outage with prettier graphs.
Common mistakes: symptom → root cause → fix
1) Symptom: random 502/504 spikes during deploys
Root cause: reload/restart behavior plus upstream connection draining not handled; health checks too optimistic; timeouts mismatched.
Fix: implement graceful reloads; ensure upstreams drain; align proxy_read_timeout (Nginx) / ProxyTimeout (Apache) with app timeouts; reduce deploy blast radius.
2) Symptom: high memory usage, OOM kills on Apache
Root cause: prefork MPM with high concurrency; mod_php loaded; each process carries heavy memory.
Fix: move to event MPM + PHP-FPM; cap workers/threads; measure RSS per worker; stop pretending RAM is infinite.
3) Symptom: Nginx “works” but uploads fail or are slow
Root cause: request buffering to disk with slow storage; client_max_body_size too low; timeouts too aggressive for large uploads.
Fix: set size limits intentionally; place temp paths on fast storage; consider direct-to-object-store uploads; tune client_body_timeout and upstream timeouts.
4) Symptom: sudden increase in TIME_WAIT and ephemeral port issues on upstream
Root cause: keepalive disabled between proxy and upstream, causing new TCP connections per request; or upstream closes too aggressively.
Fix: enable upstream keepalive pools (Nginx keepalive in upstream); ensure app server supports it; tune keepalive timeouts thoughtfully.
5) Symptom: “fast locally, slow for some users” after enabling HTTP/3
Root cause: UDP blocked or rate-limited in certain networks; fallback behavior not tested; observability blind spots.
Fix: deploy with progressive rollout; ensure clean fallback to HTTP/2; monitor per-protocol error rates and handshake failures.
6) Symptom: Apache CPU spikes after “small config change”
Root cause: enabling extensive rewrite rules per directory; AllowOverride causing filesystem checks; regex-heavy rewrites executed on every request.
Fix: migrate rules to centralized config; set AllowOverride None where possible; simplify rewrite logic; cache static routes at the edge.
7) Symptom: Nginx serves stale or incorrect content intermittently
Root cause: proxy caching configured without correct cache keys; missing Vary handling; caching authenticated responses.
Fix: define cache key explicitly; bypass cache on auth/cookies; validate cache-control headers; run canary checks for authenticated pages.
8) Symptom: clients see “connection reset” under load, logs look clean
Root cause: kernel backlog overflow, conntrack pressure, or upstream resets; web server logs often don’t capture the kernel dropping SYNs.
Fix: check listen queue overflow stats; raise backlog settings; ensure CPU headroom; examine firewall/conntrack capacity if relevant.
Three corporate mini-stories from the trenches
Mini-story #1: The incident caused by a wrong assumption
A mid-sized SaaS company migrated from a single Apache box to “a modern stack.” They put Nginx in front, left the
app tier mostly unchanged, and declared victory after a load test that hit only the homepage. The assumption:
“Nginx will handle concurrency better, so the app will be fine.”
Two weeks later, a marketing campaign landed. Traffic wasn’t insane, but user behavior changed: more uploads,
longer sessions, and a lot more slow mobile clients. Nginx did its job—kept connections open, buffered politely,
and forwarded requests. The app tier, however, was still sized like it lived in 2014: PHP-FPM pools too small,
slow requests unbounded, and database queries occasionally taking seconds.
The visible symptom was 504s at the edge. The team blamed Nginx because it was new. They raised worker processes,
raised file limits, and tuned keepalive. The 504s got worse because the proxy got better at feeding the upstream.
The fix was embarrassingly basic: add upstream timing to logs, cap request time, tune PHP-FPM children based on
measured memory per worker, and add circuit-breaking behavior for the slow endpoints. Nginx wasn’t the problem;
the assumption that “edge performance equals system performance” was.
Mini-story #2: The optimization that backfired
An enterprise internal platform team decided to “standardize” by moving everything to HTTP/2 everywhere and
enabling aggressive compression at the edge. The goal was noble: fewer connections, faster loads, lower bandwidth.
They rolled it out broadly after tests showed lower median latency.
Then the slow tail arrived. A subset of users on lossy VPN connections started seeing requests hang. Not fail
immediately—hang. The p99 got ugly. Support tickets mentioned “spinning” and “sometimes it fixes itself.”
Observability didn’t help at first because the dashboards were still focused on averages and total throughput.
Root cause: with HTTP/2 multiplexing, a single lossy connection could degrade multiple in-flight requests. Combine
that with high compression CPU usage during peak and you get a perfect storm: workers spend more time compressing
and less time accepting/serving, while clients retry in ways that amplify load. It wasn’t an HTTP/2 bug; it was an
interaction between network conditions, CPU headroom, and configuration choices.
The “fix” was not to disable HTTP/2 globally. They reduced compression level, excluded already-compressed types,
added headroom, and instrumented per-protocol latency. The lesson was classic SRE: optimizations that help the
median can still harm the tail, and the tail is what pages you.
Mini-story #3: The boring but correct practice that saved the day
A financial services team ran a mixed estate: Nginx at the edge, Apache behind it for a legacy app, plus a few
newer services. Nothing glamorous. But they had a rule: every config change must pass a syntax test, be deployed
via the same pipeline, and include an automated rollback path. They also shipped a standard log format that
included request IDs and upstream timings everywhere.
One afternoon, a certificate renewal went sideways—not because TLS is hard, but because a seemingly minor config
refactor changed the certificate path in one environment. The edge started serving an old cert on a subset of
hosts. Some clients failed hard, others retried, and support started lighting up.
Here’s the boring part that saved them: they had an alert on certificate expiration/validity, and a canary check
that validated the presented certificate subject and expiry from multiple networks. The issue was caught quickly.
The log correlation by request ID also proved that app latency wasn’t involved, so the incident stayed scoped.
They rolled back in minutes. The postmortem was almost disappointing: no heroics, no mystery. Just controlled
change, fast detection, and a pipeline that didn’t require a human to remember twelve steps under stress.
Checklists / step-by-step plan
Checklist A: choosing Nginx, Apache, or both
- Define the edge contract: TLS termination point, max body size, timeouts, and which headers you trust.
- Decide if per-directory overrides are allowed. If yes, Apache may be justified; if no, prefer centralized config (often Nginx).
- Choose the app execution model: PHP-FPM or app server behind proxy. Avoid coupling web worker model to app runtime unless you have a strong reason.
- Write down failure behavior: what happens when upstream is slow? Queue? Reject? Serve stale? This must be explicit.
- Pick an observability baseline: log fields, metrics, dashboards, and alerts that map to the above contract.
Checklist B: production hardening for either server
- Set file descriptor limits (service and system). Validate with
/proc/<pid>/limits. - Validate config in CI using
nginx -torapachectl -tbefore deployment. - Standardize log formats and include upstream timing and request IDs.
- Align timeouts across edge, upstream, and application.
- Set sane keepalive between clients and edge, and between edge and upstream.
- Plan reload strategy and test it during business hours on a canary host.
- Protect the upstream with buffering/limits so slow clients don’t hold app resources.
Checklist C: migration plan (Apache → Nginx edge) without drama
- Inventory rewrite rules, auth mechanisms, and any Apache-only modules you depend on.
- Start with a pass-through proxy (Nginx in front, minimal logic). Prove correctness first.
- Move static assets and caching to Nginx next. Observe IO and cache headers.
- Migrate rewrites carefully. Regex parity is a trap; test with real request samples.
- Roll out by percentage (LB weights) or by domain/host. Keep rollback cheap.
- After stable operation, decide whether Apache stays as upstream or is retired.
FAQ
1) Which is faster: Nginx or Apache?
For most real stacks, “faster” is dominated by app and database time. Nginx often handles high concurrency with
fewer resources at the edge, but Apache with event MPM can be excellent too. Choose based on operational model and failure modes.
2) Should I run both Nginx and Apache?
Often yes: Nginx at the edge, Apache behind for legacy apps. But don’t do it accidentally. Be explicit about who owns TLS, buffering, and timeouts.
3) Is .htaccess always bad?
Not always—.htaccess is a pragmatic tool for delegated control. It’s “bad” when you want centralized change control, consistent behavior, and reproducible deployments.
4) What Apache MPM should I use in 2026?
Prefer event for most modern setups. Use prefork only when forced by non-thread-safe modules or legacy requirements, and then budget memory accordingly.
5) Why do I see 499 in Nginx logs?
499 means the client closed the connection before Nginx could respond. Common causes: client timeouts, slow upstreams, mobile network drops, or overly aggressive edge timeouts.
6) What’s the most common cause of 504 Gateway Timeout?
Upstream slowness or saturation: app threads/workers exhausted, slow queries, downstream dependencies. Fix upstream capacity and latency first; then tune proxy timeouts to match reality.
7) Should I enable HTTP/3 everywhere?
Enable it when you can measure it. Roll out gradually, ensure clean fallback, and watch per-protocol latency and error rates. Some networks will treat UDP as suspicious.
8) Is Nginx better for Kubernetes and microservices?
Nginx is commonly used, but in cluster environments your ingress/LB choice may matter more than the per-node web server. The same principles apply: buffering, timeouts, observability, and change control.
9) Can Apache do reverse proxying just as well?
Yes, Apache can reverse proxy effectively, especially with event MPM and the right proxy modules. The operational ergonomics often differ: Nginx configs tend to be more standardized for edge roles.
10) What’s the simplest safe default stack for a PHP app?
Nginx at the edge + PHP-FPM for execution, with explicit timeouts, request size limits, upstream keepalive where appropriate, and upstream timing in logs.
Next steps you can do this week
- Decide the boundary: who terminates TLS, who buffers, who sets timeouts, and where per-tenant overrides are allowed.
- Instrument upstream timing in your logs so you can answer “edge or app?” in 30 seconds.
- Run the practical tasks above on one production host and write down what surprised you. That surprise is where your next incident lives.
- Align timeouts across proxy and app. Remove mismatches that generate fake 504s or hung connections.
- Kill config drift: enforce config tests in CI, standardize reload procedures, and stop letting ad-hoc edits become “the way we do things.”
If you want the real 2026 answer: choose the server that makes your system more observable, your changes more
controlled, and your failure modes more predictable. Everything else is internet arguments.