Ubuntu 24.04: Fix “Too many open files” on Nginx by raising limits the right way (systemd)

December 23, 2025 • February 3, 2026 • Read: 24 min • Views: 10

Was this helpful?

It starts as a few 502s. Then your graphs look like a seismograph. Nginx is “up,” system load is fine, but clients get connection resets and your error log starts chanting: Too many open files.

On Ubuntu 24.04, the fix is rarely “just raise ulimit.” That advice is how you end up with a pretty number in a shell and the same outage in production. The right fix is about systemd unit limits, Nginx’s own ceilings, and the kernel’s global file descriptor budget—plus verifying all of it with commands that don’t lie.

What “Too many open files” really means on Nginx

The kernel returns EMFILE when a process hits its per-process file descriptor limit. Nginx logs that as “Too many open files” when it can’t open a socket, accept a connection, open a file, or create an upstream connection.

Important nuance: Nginx “files” are not just files. They’re sockets, pipes, eventfd, signalfd, and whatever else the process opens. If Nginx is a busy reverse proxy, most descriptors are sockets: client connections + upstream connections + listening sockets + log files. If it’s serving static files, descriptors are also open file handles. If you use open_file_cache, you can intentionally increase open file handles. If you run on HTTP/2 or HTTP/3, connection patterns change, but the descriptor accounting still matters.

On Ubuntu 24.04, Nginx is typically managed by systemd. That means:

/etc/security/limits.conf does not reliably apply to systemd services.
What you set in an interactive shell (ulimit -n) is irrelevant to the Nginx master process launched by systemd.
There are multiple layers of limits: Nginx config, systemd unit, PAM limits for user sessions, and kernel maximums.

Here’s the only mental model that consistently prevents the “I raised it, why is it still broken?” loop:

Kernel global ceiling: how many file descriptors the whole system can allocate (fs.file-max, file-nr), plus the per-process maximum allowed to be set (fs.nr_open).
systemd service ceiling: LimitNOFILE on the Nginx unit (or a systemd default applied to all services).
Nginx internal ceiling: worker_rlimit_nofile and the maximum implied by worker_connections.
Reality: the actual number of open descriptors under load.

Also, “too many open files” can be a symptom of a leak. It’s not always “you need a bigger number.” If descriptors grow steadily with constant traffic, something isn’t being closed. It might be an upstream issue, a misbehaving module, or a logging or caching pattern that never releases.

Joke #1: File descriptors are like coffee cups in the office kitchen—if nobody brings them back, you eventually can’t serve anyone, and someone will blame the dishwasher.

Fast diagnosis playbook (what to check first)

This is the sequence that finds the bottleneck quickly, without arguing with hunches.

1) Confirm the error and where it’s happening

Nginx error log: is it failing on accept(), open(), or upstream sockets?
systemd journal: do you see “accept4() failed (24: Too many open files)” or similar?

2) Check the Nginx process limits as seen by the kernel

Read /proc/<pid>/limits for the Nginx master and a worker.
If the limit is low (1024/4096), it’s almost always systemd configuration, not Nginx.

3) Measure actual FD usage and growth

Count open descriptors per worker.
Look for monotonic growth under steady load (leak smell).

4) Validate system-wide capacity

sysctl fs.file-max and cat /proc/sys/fs/nr_open
cat /proc/sys/fs/file-nr to see allocated vs used.

5) Only then tune

Raise LimitNOFILE in a systemd drop-in.
Set worker_rlimit_nofile and align worker_connections.
Reload, verify, and load test (or at least confirm under real traffic).

Interesting facts and context (why this keeps happening)

Fact 1: Unix file descriptors predate modern networking; sockets were intentionally designed to look like files to reuse APIs.
Fact 2: The classic default soft limit of 1024 comes from an era when 1,000 simultaneous connections sounded like science fiction, not Tuesday.
Fact 3: Nginx’s event-driven model is efficient partly because it keeps many connections open concurrently—meaning it will happily consume FDs if you let it.
Fact 4: systemd does not automatically inherit your shell’s ulimit; services get limits from unit files and systemd defaults.
Fact 5: On Linux, fs.nr_open bounds the maximum per-process open files limit you can set, even as root.
Fact 6: “Too many open files” may show up even when system-wide file descriptor usage is fine—because per-process limits bite first.
Fact 7: Nginx can hit FD limits faster when proxying because each client connection can imply an upstream connection (sometimes more than one).
Fact 8: Ephemeral port exhaustion often gets misdiagnosed as FD exhaustion; both look like connection failures, but they’re different knobs.

One paraphrased idea from a notable SRE voice: Paraphrased idea (John Allspaw): reliability comes from learning, not blame—so instrument the system and verify assumptions.

Practical tasks: commands, expected output, and decisions (12+)

These are “do this now” tasks. Each includes what the output means and what decision you make next. Run them as root or with sudo where appropriate.

Task 1: Confirm the error in Nginx logs

cr0x@server:~$ sudo tail -n 50 /var/log/nginx/error.log
2025/12/30 02:14:01 [crit] 2143#2143: *98123 accept4() failed (24: Too many open files)
2025/12/30 02:14:01 [alert] 2143#2143: open() "/var/log/nginx/access.log" failed (24: Too many open files)

Meaning: Nginx is hitting a per-process FD ceiling. accept4() failing means it can’t accept new client connections. Failing to open logs is also a red flag: it’s already in trouble.

Decision: Immediately check Nginx process limits and current FD usage before changing anything.

Task 2: Check systemd journal for correlated messages

cr0x@server:~$ sudo journalctl -u nginx --since "30 minutes ago" | tail -n 30
Dec 30 02:14:01 server nginx[2143]: 2025/12/30 02:14:01 [crit] 2143#2143: *98123 accept4() failed (24: Too many open files)
Dec 30 02:14:02 server systemd[1]: nginx.service: Reloading.
Dec 30 02:14:03 server nginx[2143]: 2025/12/30 02:14:03 [alert] 2143#2143: open() "/var/log/nginx/access.log" failed (24: Too many open files)

Meaning: Confirms it’s Nginx and not a client-side rumor. Also shows if restarts/reloads happened during the event.

Decision: If reloads are happening automatically (config management loops), you may be amplifying the problem. Consider pausing automation until limits are corrected.

Task 3: Find the Nginx master PID

cr0x@server:~$ pidof nginx
2143 2144 2145 2146

Meaning: Typically first PID is master, the rest are workers (depends on build/options).

Decision: Verify which PID is master and which are workers; you need to inspect both.

Task 4: Verify master and worker roles

cr0x@server:~$ ps -o pid,ppid,user,cmd -C nginx
  PID  PPID USER     CMD
 2143     1 root     nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
 2144  2143 www-data nginx: worker process
 2145  2143 www-data nginx: worker process
 2146  2143 www-data nginx: worker process

Meaning: Master is root, workers are www-data. Limits may differ if something odd is happening, but normally they inherit from the master.

Decision: Inspect limits on both master and workers anyway—don’t assume inheritance is what you think it is.

Task 5: Check current FD limits from /proc

cr0x@server:~$ sudo cat /proc/2143/limits | grep -i "open files"
Max open files            1024                 1024                 files

Meaning: This is the smoking gun: 1024 is not a serious production limit for an Nginx reverse proxy.

Decision: Fix systemd LimitNOFILE. Shell ulimit tweaks won’t matter.

Task 6: Count open file descriptors for a worker

cr0x@server:~$ sudo ls -1 /proc/2144/fd | wc -l
1007

Meaning: The worker is nearly at the limit. You’re not imagining things.

Decision: If usage is close to the limit, raise it. If it’s nowhere near, you may be chasing the wrong error (rare, but check).

Task 7: Identify what types of FDs are open (sockets vs files)

cr0x@server:~$ sudo lsof -p 2144 | awk '{print $5}' | sort | uniq -c | sort -nr | head
  812 IPv4
  171 IPv6
   12 REG
    5 FIFO

Meaning: Mostly sockets (IPv4/IPv6). This is load-driven connection concurrency, not static file handle bloat.

Decision: Focus on per-worker connection limits, keepalive behavior, and upstream connection reuse—not just “serve fewer files.”

Task 8: Check Nginx config for worker limits

cr0x@server:~$ sudo nginx -T 2>/dev/null | egrep -n 'worker_(processes|connections|rlimit_nofile)' | head -n 30
12:worker_processes auto;
18:worker_connections 768;

Meaning: No worker_rlimit_nofile set, and worker_connections is modest. With keepalive and proxying, 768 can still cause pressure, but the bigger issue is the OS limit of 1024.

Decision: Align LimitNOFILE and worker_rlimit_nofile with a realistic target. Then revisit worker_connections based on expected concurrency.

Task 9: Check current systemd limits applied to the nginx service

cr0x@server:~$ sudo systemctl show nginx -p LimitNOFILE
LimitNOFILE=1024

Meaning: systemd is enforcing 1024 for the service. This overrides your hopes and dreams.

Decision: Add a systemd drop-in override for nginx with a higher LimitNOFILE.

Task 10: Check kernel global file descriptor stats

cr0x@server:~$ cat /proc/sys/fs/file-max
9223372036854775807
cr0x@server:~$ cat /proc/sys/fs/file-nr
3648	0	9223372036854775807

Meaning: On modern Ubuntu, file-max may be effectively “very high.” file-nr shows allocated handles (~3648). This system is nowhere near global FD exhaustion.

Decision: Don’t touch global fs.file-max yet. Your bottleneck is per-service/per-process limits.

Task 11: Check fs.nr_open (per-process hard cap)

cr0x@server:~$ cat /proc/sys/fs/nr_open
1048576

Meaning: You can set per-process limits up to 1,048,576. Plenty of headroom.

Decision: Choose a reasonable number (often 65535 or 131072) rather than going full “infinite.”

Task 12: Check how many connections Nginx is actually handling

cr0x@server:~$ sudo ss -s
Total: 2389 (kernel 0)
TCP:   1920 (estab 1440, closed 312, orphaned 0, timewait 311)

Meaning: If established connections are high, FD usage will be high. TIME_WAIT isn’t FDs in the same way, but it signals traffic patterns and keepalive behavior.

Decision: If established connections correlate with the error, raising FD limits is valid. If established is low but FDs are high, suspect leaks or caches.

Task 13: Validate the active Nginx service unit file path and drop-ins

cr0x@server:~$ sudo systemctl status nginx | sed -n '1,12p'
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/usr/lib/systemd/system/nginx.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/nginx.service.d
             └─ override.conf
     Active: active (running) since Tue 2025-12-30 01:58:12 UTC; 16min ago

Meaning: Shows where the unit is loaded from and whether drop-ins exist. If you don’t see a drop-in directory, you haven’t overridden anything yet.

Decision: Prefer a drop-in override under /etc/systemd/system/nginx.service.d/. Don’t edit vendor units in /usr/lib.

Task 14: After changes, confirm the new limit is in effect

cr0x@server:~$ sudo systemctl show nginx -p LimitNOFILE
LimitNOFILE=65535
cr0x@server:~$ sudo cat /proc/$(pidof nginx | awk '{print $1}')/limits | grep -i "open files"
Max open files            65535                65535                files

Meaning: systemd now grants 65535 and the running master process has it. That’s the verification step people skip, then act surprised later.

Decision: If the values don’t match, you didn’t restart properly, your override didn’t load, or another unit setting wins.

Fix it the right way: systemd LimitNOFILE for Nginx

Ubuntu 24.04 runs systemd. Nginx likely runs as nginx.service. The correct approach is a unit override (drop-in), not hand-editing unit files and not relying on limits.conf.

Pick a sane number

Common production values:

65535: classic, widely used, usually enough for single-node Nginx doing normal proxying.
131072: for very high concurrency or multi-upstream workloads.
262144+: rare; justified only with measurements and a reason (and usually other bottlenecks show up first).

Don’t set it to “a million” because you can. A big FD limit can hide a leak longer, and it increases the blast radius when something goes pathological.

Create a systemd drop-in override

cr0x@server:~$ sudo systemctl edit nginx
# (an editor opens)

Add this:

cr0x@server:~$ cat /etc/systemd/system/nginx.service.d/override.conf
[Service]
LimitNOFILE=65535

Decision: If you also run Nginx as a container or via a different unit (e.g., a custom wrapper), apply the override to the correct unit name, not the one you wish you were using.

Reload systemd and restart Nginx

cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart nginx
cr0x@server:~$ sudo systemctl is-active nginx
active

Meaning: daemon-reload makes systemd re-read units. A restart is required for new limits; reloads don’t reliably reapply resource limits to an already-running process.

Verify, don’t assume

cr0x@server:~$ sudo systemctl show nginx -p LimitNOFILE
LimitNOFILE=65535

If that shows the right value but /proc/<pid>/limits doesn’t, you’re probably looking at an old PID (service didn’t restart) or Nginx is launched by something else.

What about DefaultLimitNOFILE?

systemd can set default limits for all services via /etc/systemd/system.conf (and user.conf for user services). This is tempting in corporate environments because it “fixes everything.” It also changes everything.

My opinion: for Nginx, use a per-service override unless you have a mature baseline and know the impact on every daemon. Setting a high default can unintentionally enable other processes to open huge numbers of files, which is not automatically good.

Fix Nginx-side limits: worker_rlimit_nofile, worker_connections, keepalive

Systemd can grant Nginx 65k FDs, but Nginx still has to use them intelligently.

worker_rlimit_nofile: align Nginx to the OS limit

Add in the main (top-level) context of /etc/nginx/nginx.conf:

cr0x@server:~$ sudo grep -n 'worker_rlimit_nofile' /etc/nginx/nginx.conf || true
cr0x@server:~$ sudoedit /etc/nginx/nginx.conf
cr0x@server:~$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

Example snippet you might add:

cr0x@server:~$ sudo awk 'NR==1,NR==30{print}' /etc/nginx/nginx.conf
user www-data;
worker_processes auto;
worker_rlimit_nofile 65535;

events {
    worker_connections 8192;
}

Meaning: worker_rlimit_nofile sets RLIMIT_NOFILE for worker processes (and sometimes master, depending on build/behavior). If you don’t set it, Nginx may still run with the system-provided limit, but explicit alignment avoids surprises.

Decision: Set worker_rlimit_nofile to the same (or slightly lower) value than systemd’s LimitNOFILE. If you set it higher than allowed, Nginx will not magically exceed the OS limit.

worker_connections: it’s not “how many users,” it’s how many sockets

In the events block, worker_connections defines the maximum number of simultaneous connections per worker process. Roughly:

Max client connections ≈ worker_processes * worker_connections
But reverse proxying often doubles socket usage: one client socket + one upstream socket.
Plus overhead: listening sockets, logs, internal pipes, resolver sockets, etc.

If you set worker_connections to 50k and keep FD limit at 65k, you’re doing math with the enthusiasm of a toddler and the precision of a fog machine.

Keepalive and proxying: the sneaky FD multiplier

Keepalive is great—until it isn’t. With keepalive, clients and upstreams hold sockets open longer. This improves latency and reduces handshake costs, but it raises steady-state FD usage. Under bursty traffic, it can turn “we had a spike” into “we have 30 seconds of sustained descriptor pressure.”

Places to check:

keepalive_timeout and keepalive_requests (client side)
proxy_http_version 1.1 and proxy_set_header Connection "" (upstream keepalive correctness)
keepalive in upstream blocks (connection pools)

Joke #2: Keepalive is like leaving the meeting room “reserved” because you might have more meetings later—eventually nobody can work, but the calendar looks fantastic.

Reload safely

cr0x@server:~$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
cr0x@server:~$ sudo systemctl reload nginx
cr0x@server:~$ sudo systemctl status nginx | sed -n '1,10p'
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/usr/lib/systemd/system/nginx.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/nginx.service.d
             └─ override.conf
     Active: active (running) since Tue 2025-12-30 02:20:11 UTC; 2min ago

Meaning: Reload applies config changes without dropping connections (mostly), but FD limit changes required restart earlier. Now you’re iterating on Nginx settings.

Kernel and system-wide limits: fs.file-max, fs.nr_open, ephemeral ports

Most “Too many open files” incidents on Nginx are per-process limits. But you should understand the system-wide levers because they’re the next failure mode when you scale.

System-wide file descriptor capacity

Key indicators:

/proc/sys/fs/file-max: global maximum open files (may be huge on modern systems).
/proc/sys/fs/file-nr: allocated, unused, maximum. Allocated rising fast can signal spikes or leaks across the system.
fs.nr_open: maximum value for per-process limits.

If the global allocation is close to the maximum, raising Nginx’s per-process limit won’t help. You’ll just move the failure from Nginx into other services, and the kernel will start refusing allocations in more creative ways.

Ephemeral ports: the other “too many”

When Nginx is a reverse proxy, each upstream connection consumes a local ephemeral port. If you hammer a small upstream set with lots of short-lived connections, you can exhaust ephemeral ports or get stuck in TIME_WAIT patterns. Symptoms can resemble FD pressure: connection failures, upstream timeouts, increased 499/502/504.

Quick checks:

cr0x@server:~$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768	60999
cr0x@server:~$ ss -tan state time-wait | wc -l
311

Meaning: Default ephemeral range is ~28k ports. TIME_WAIT count tells you how many recently-closed connections linger. Not all TIME_WAIT is bad, but high counts with high churn can hurt.

Decision: If port pressure is real, you fix it with connection reuse (upstream keepalive), load distribution, and sometimes port range tuning—not by inflating FD limits alone.

Do you need to touch /etc/security/limits.conf?

For systemd services: usually no. That file is applied via PAM to user sessions (SSH, login, etc.). Nginx launched by systemd doesn’t read it. You can still set it for consistency, but don’t expect it to solve the outage.

Sanity check in an interactive shell:

cr0x@server:~$ ulimit -n
1024

Meaning: This is your shell’s limit, not Nginx’s. Useful only to confirm what you can do, not what the service can do.

Decision: Don’t “fix” production by pasting ulimit -n 65535 into a shell and feeling accomplished.

Capacity planning: how many FDs do you actually need?

Capacity planning for file descriptors is boring. That’s why it works.

Rule-of-thumb accounting

Start with this rough model:

Per active client connection: ~1 FD (the client socket)
Per proxied request: often +1 FD (upstream socket), sometimes more with retries/multiple upstreams
Per worker baseline: a few dozen FDs (logs, eventfd, pipes, listening sockets shared, etc.)

If you serve static content only and use sendfile, you still open file handles, but they may be short-lived. If you enable file cache, they can be long-lived.

Compute a practical target

Example: you expect up to 20,000 concurrent client connections on a box, and proxy to upstreams with keepalive. Conservatively assume 2 FDs per client at peak (client + upstream): 40,000. Add overhead: say 2,000. Now divide across workers? Not exactly—connections are distributed, but unevenness happens. If you have 8 workers, worst-case one worker might temporarily hold more than average depending on accept behavior and scheduling.

This is why you give yourself margin. A LimitNOFILE of 65535 per process is a common sweet spot: large enough to avoid silly failures, small enough that leaks show up before the heat death of the universe.

Measure, don’t guess

After you set limits, keep measuring. For a live worker:

cr0x@server:~$ sudo bash -c 'for p in $(pgrep -u www-data nginx); do echo -n "$p "; ls /proc/$p/fd | wc -l; done'
2144 1842
2145 1760
2146 1791

Meaning: Per-worker FD counts under load. If you see these numbers hovering near your limit, you need either higher limits or fewer long-lived sockets (tuning keepalive/upstream behavior) or more capacity (more instances).

Decision: If counts are stable and comfortably below limits, stop tuning. Ship the fix and move on. Engineering is not a sport where you win by changing more knobs.

Three mini-stories from the corporate trenches

Mini-story 1: The incident caused by a wrong assumption

The team had migrated from an older Ubuntu release to Ubuntu 24.04 as part of a security baseline refresh. The Nginx config hadn’t changed. The traffic profile hadn’t changed. Yet within days, they got sporadic connection failures during predictable peaks.

The on-call engineer did what every Linux person has done at least once: SSH’d in, ran ulimit -n 65535, restarted Nginx by hand, saw the problem disappear, and went back to sleep. The next day it was back. So they “fixed it again.” It became a ritual.

The wrong assumption was subtle: they believed their shell’s ulimit applied to the service after restart. It didn’t. Nginx was being started by systemd during unattended restarts and host maintenance, always returning to LimitNOFILE=1024. The manual fix only worked when Nginx was launched from a shell that carried the raised limit.

The durable fix took ten minutes: a systemd drop-in with LimitNOFILE, then verifying via /proc/<pid>/limits. The real win, though, was social: they wrote down the verification step as part of the incident template. Future on-calls stopped treating “ulimit” as an incantation.

Mini-story 2: The optimization that backfired

A performance-minded engineer wanted to reduce upstream latency. They enabled aggressive keepalive settings on both client and upstream sides. The change looked brilliant in a synthetic test: fewer handshakes, lower p95, happier dashboards.

Then production did its thing. The service had many long-tail clients (mobile networks, enterprise proxies) that held connections open like they were paying rent. With keepalive cranked up, Nginx retained far more concurrent sockets than before. Descriptors stayed allocated longer, and the system drifted toward the per-process limit during peak hours.

Worse: the team had also enabled a larger open_file_cache for static assets on the same nodes, adding more long-lived file handles. The combined effect made the error show up in odd places—sometimes during a deploy (when logs rotated), sometimes during traffic spikes.

The fix wasn’t “disable keepalive,” because that would have returned the latency cost. The fix was adult supervision: raise LimitNOFILE properly, cap worker_connections to a defensible number, and tune keepalive timeouts to match real client behavior. They also separated roles: static-heavy nodes got different caching settings than proxy-heavy nodes. Optimizations can work; they just need a budget.

Mini-story 3: The boring but correct practice that saved the day

A different org had a simple runbook: every time they changed Nginx concurrency settings, they recorded three numbers in the change ticket: LimitNOFILE, worker_rlimit_nofile, and peak observed /proc/<pid>/fd count. Not glamorous. Not “innovative.” Effective.

One evening, an upstream dependency started timing out intermittently. Client connections piled up while Nginx waited on upstream responses. Concurrency rose. That’s normal under upstream slowness: queues form at the proxy layer.

But the proxy didn’t fall over. Why? Because their FD limits were set with margin, and they had monitoring that alerted at 70% of the FD limit per worker. The on-call saw the alert, recognized it as upstream-induced socket retention, and escalated to the dependency team while rate-limiting a noisy endpoint. No cascading failure. No “Nginx died too.”

The post-incident writeup was almost boring, which is the highest compliment you can pay an operations practice.

Common mistakes: symptom → root cause → fix

1) Symptom: You set ulimit to 65535 but Nginx still logs EMFILE

Root cause: Nginx is managed by systemd; shell limits don’t apply. systemd still has LimitNOFILE=1024.

Fix: Add a systemd drop-in override for nginx.service with LimitNOFILE=65535, restart, verify via /proc/<pid>/limits.

2) Symptom: systemctl show reports high LimitNOFILE, but /proc shows low

Root cause: You changed the unit but didn’t restart the service; or you’re inspecting the wrong PID; or Nginx is started by a different unit/wrapper.

Fix: systemctl daemon-reload, then systemctl restart nginx. Re-check master PID and read limits from that PID.

3) Symptom: Raising LimitNOFILE helped for a day, then errors returned

Root cause: A leak or steadily increasing concurrency due to upstream slowness or client behavior. You treated the symptom, not the trend.

Fix: Track FD usage over time per worker. Correlate with upstream latency and active connections. Fix keepalive, timeouts, upstream pooling, or the upstream dependency.

4) Symptom: Errors only during log rotation or reloads

Root cause: When near the FD limit, benign actions like reopening logs require spare descriptors. You have zero headroom.

Fix: Raise limits and ensure normal operating FD count stays below ~70–80% of limit at peak.

5) Symptom: Nginx accepts connections but upstream proxying fails

Root cause: You may have enough FDs for clients but not for upstream sockets, or ephemeral ports are constrained.

Fix: Increase FD limit and ensure upstream keepalive works. Check ephemeral port range and TIME_WAIT churn; address connection reuse.

6) Symptom: Only one worker hits the limit and melts down

Root cause: Uneven connection distribution, possibly due to accept mutex settings, CPU pinning, or traffic patterns.

Fix: Ensure worker_processes auto matches CPU, revisit accept settings, and verify load distribution. Often the real fix is simply more headroom + ensuring the upstream isn’t forcing long holds.

7) Symptom: You raised worker_connections massively and things got worse

Root cause: You increased the theoretical concurrency without ensuring memory, upstream capacity, and FD limits match. You invited a bigger stampede.

Fix: Size worker_connections to what the system can sustain. Use rate limiting, queueing, or scaling rather than infinite concurrency.

Checklists / step-by-step plan

Step-by-step: production-safe fix path

Capture evidence: tail Nginx error log and journal for EMFILE.
Find PIDs: identify master and workers.
Read live limits: /proc/<pid>/limits for master and a worker.
Measure usage: count /proc/<pid>/fd for workers; sample a few times under load.
Check systemd limit: systemctl show nginx -p LimitNOFILE.
Implement systemd override: systemctl edit nginx → LimitNOFILE=65535.
Restart (not reload): apply limits with restart.
Verify again: systemd shows new limit, /proc shows new limit.
Align Nginx config: set worker_rlimit_nofile, consider worker_connections based on expected concurrency.
Reload Nginx config: nginx -t then systemctl reload nginx.
Watch for regressions: monitor open FDs per worker; alert at 70–80% of limit.
Investigate root causes: if FD usage climbs steadily, hunt leaks or upstream slowness, don’t just raise numbers again.

Checklist: what “good” looks like after the fix

systemctl show nginx -p LimitNOFILE returns your target value.
/proc/<masterpid>/limits matches.
Worker /proc/<pid>/fd counts have headroom at peak.
Nginx error log no longer shows accept4() failed (24: Too many open files).
Deploys/reloads and log rotations do not trigger connection failures.

Checklist: when raising limits is the wrong fix

FD count rises monotonically over hours with constant traffic (leak pattern).
Upstream is slow and connections pile up; you need timeouts, backpressure, or scaling.
Connection failures correlate with high TIME_WAIT and small port range (port exhaustion).
Memory pressure is already high; increasing concurrency will worsen latency and failures.

FAQ (real questions people ask at 02:00)

1) Why doesn’t /etc/security/limits.conf fix Nginx on Ubuntu 24.04?

Because Nginx runs as a systemd service. PAM limits apply to login sessions. systemd applies its own resource limits from unit files and defaults.

2) Do I need to set both LimitNOFILE and worker_rlimit_nofile?

Yes, in practice. LimitNOFILE is the OS-enforced ceiling for the service. worker_rlimit_nofile makes Nginx explicitly request/propagate a matching limit for workers. Align them to avoid “it should be fine” ambiguity.

3) What’s a reasonable LimitNOFILE for Nginx?

Common: 65535. High-traffic reverse proxies may use 131072. Go higher only with measured need and monitoring, because it can mask leaks and increase blast radius.

4) If I raise the limit, will Nginx automatically handle more connections?

Not automatically. You also need appropriate worker_connections, CPU, memory, upstream capacity, and sometimes kernel tuning. More concurrency without capacity is just a larger queue for failure.

5) I raised the limits but still see “Too many open files” occasionally. Why?

Either you didn’t apply the limit to the running process (verify /proc), or you’re hitting the new limit legitimately under spikes, or another component (upstream, resolver, logging) is consuming descriptors unexpectedly. Measure per-worker FD counts and correlate with traffic and latency.

6) Can “Too many open files” be caused by a bug or leak?

Yes. If FD usage grows steadily with no traffic growth, suspect leaks or a misconfiguration that keeps descriptors around (aggressive keepalive, caching, stalled upstream connections).

7) Is it safe to change LimitNOFILE without downtime?

Changing the limit requires restarting the service, which can cause brief disruption unless you have multiple instances behind a load balancer. Plan a rolling restart where possible.

8) How do I alert on this before it becomes an outage?

Alert on FD usage as a percentage of the limit per worker (e.g., 70% warning, 85% critical). You can compute it from /proc/<pid>/fd counts and /proc/<pid>/limits.

9) What if the system-wide file-max is the bottleneck?

Then you’re in “whole host is running out of descriptors” territory. Identify the biggest consumers across processes. Raising Nginx limits won’t help if the kernel can’t allocate more. This is rare on modern Ubuntu defaults but possible on dense multi-tenant nodes.

10) Does HTTP/2 reduce FD usage because multiplexing exists?

Sometimes, but don’t count on it. HTTP/2 can reduce the number of client TCP connections, but upstream behavior, keepalive pools, and sidecars can still dominate FD usage.

Conclusion: next steps that stick

When Nginx on Ubuntu 24.04 hits “Too many open files,” the right fix is not a heroic one-liner. It’s a verified chain: systemd grants the limit, Nginx is configured to use it safely, and the kernel has enough global capacity.

Do these next, in this order:

Verify the live limit via /proc/<pid>/limits. If it’s 1024, stop everything and fix systemd LimitNOFILE.
Set a systemd drop-in override for nginx.service and restart. Verify again.
Align Nginx with worker_rlimit_nofile and set worker_connections based on measured concurrency, not vibes.
Measure per-worker FD counts under peak. Add alerting on headroom. If FD usage creeps upward over time, investigate leaks or upstream slowness.

After that, you get to enjoy the rare luxury of a web server that fails for interesting reasons, not because it ran out of numbered handles.