“Too many open files” is one of those messages that looks like a simple knob-turn and turns into an afternoon of misdirection. You run ulimit -n 1048576, it prints the big number you asked for, you restart the service, and it still faceplants under load. Your terminal says “fixed.” Production says “nice try.”
On Debian 13, the correct fix is usually not more shell magic. It’s systemd. Specifically: the right systemd layer, the right unit, and a verification step that doesn’t lie to you. Let’s do it properly, the way you’d want it done at 02:00 with customers watching.
What “Too many open files” really means (and what it doesn’t)
On Linux, “Too many open files” is typically one of two errors:
- EMFILE: the process hit its per-process file descriptor limit (RLIMIT_NOFILE). This is the one you fix with systemd
LimitNOFILE=(or equivalent). It’s personal. One process is over budget. - ENFILE: the system hit a global file table limit. This is rarer on modern kernels, but it happens in pathological cases. It’s communal. The whole host is over budget.
Also: “open files” is a sloppy phrase. It means file descriptors. Those descriptors can represent regular files, directories, sockets, pipes, eventfds, signalfds, inotify watches, epoll instances, and a handful of other things that only make sense after a long week.
Two consequences:
- You can run out of FDs without a single log file being “open.” High-connection services die this way all the time.
- Raising limits blindly can hide leaks, mask poor connection handling, and move the failure to a different subsystem (like memory, kernel bookkeeping, or your storage backend).
One quote that still holds up in ops, paraphrased because I’m not gambling on perfect wording: paraphrased idea
— “Hope is not a strategy.” (often attributed in engineering circles to practitioners like Gene Kranz, and repeated endlessly in reliability work). In this context: don’t hope your ulimit did anything. Prove it, then fix it at the layer that actually starts the service.
Joke #1: “Too many open files” is Linux’s way of saying your process has commitment issues. It keeps opening things and refuses to close them.
Fast diagnosis playbook
This is the “stop scrolling and do these three checks” section. It’s ordered the way I do it when the pager is hot.
First: identify whether it’s per-process (EMFILE) or system-wide (ENFILE)
- Look for the exact error in logs (application logs, journal). EMFILE/ENFILE is decisive.
- If you can’t get the exact errno, check per-process FD usage and RLIMIT quickly (see tasks below).
Second: verify the running service’s actual RLIMIT_NOFILE (not your shell’s)
- Grab the PID of the service main process.
- Read
/proc/<pid>/limits. That file is reality.
Third: find what set the limit (systemd unit, drop-in, defaults, PAM)
- If the process is started by systemd, the unit (and systemd defaults) win.
- If it’s a user session service, the user manager has its own defaults.
- PAM limits often affect interactive logins and some services, but not typical system services.
If you do only one thing today: stop trusting ulimit output from your login shell as evidence that your daemon has that limit. That’s how you end up “fixing” things three times.
The limits stack on Debian 13: who wins and why
Linux has resource limits (rlimits). systemd can set them for services. PAM can set them for login sessions. Shells can set them for child processes. Containers can add their own caps. And the kernel has system-wide maxima.
Here’s the practical hierarchy most people trip over:
- Kernel maximums and system-wide tables (e.g.,
fs.file-max). These are hard boundaries and shared resources. - Per-process rlimits (RLIMIT_NOFILE). Each process gets a “soft” and a “hard” limit. The soft limit is what you hit. The hard limit is the ceiling you can raise soft up to, without privilege.
- systemd service manager settings set rlimits at exec time, and they apply to the service process tree.
- User sessions (PAM + user manager) can set rlimits for interactive shells and user services.
- Shell
ulimitis just the shell’s view and what it passes to its children. It does nothing for already-running daemons.
The key Debian 13 reality: most server daemons are started by systemd. That means the unit file and systemd defaults matter more than anything you type in a shell.
Also, systemd is not just a glorified init. It’s a process supervisor that can set limits, isolate resources, and control file descriptors in ways that bypass your assumptions. If your mental model is “/etc/security/limits.conf controls everything,” you’ll lose time.
The correct systemd fix: LimitNOFILE, defaults, and drop-ins
There are three sane ways to raise FD limits for a systemd service on Debian 13. Pick based on blast radius.
Option A (recommended): set LimitNOFILE in a drop-in for the specific service
This is the surgical fix. It changes only one unit. It survives package upgrades. It’s reviewable.
Create a drop-in:
cr0x@server:~$ sudo systemctl edit nginx.service
Then add:
cr0x@server:~$ sudo tee /etc/systemd/system/nginx.service.d/limits.conf >/dev/null <<'EOF'
[Service]
LimitNOFILE=262144
EOF
Reload and restart:
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart nginx.service
cr0x@server:~$ systemctl status nginx.service --no-pager
● nginx.service - A high performance web server and a reverse proxy server
Loaded: loaded (/lib/systemd/system/nginx.service; enabled; preset: enabled)
Drop-In: /etc/systemd/system/nginx.service.d
└─limits.conf
Active: active (running)
If you don’t see the Drop-In listed, your drop-in is misplaced, or you forgot daemon-reload.
Option B: change system-wide defaults for all system services (big hammer)
You can set default limits in /etc/systemd/system.conf and/or /etc/systemd/user.conf:
cr0x@server:~$ sudo sed -n '1,120p' /etc/systemd/system.conf
# This file is part of systemd.
#DefaultLimitNOFILE=1024:524288
Uncomment and set, for example:
cr0x@server:~$ sudo perl -0777 -pe 's/^#?DefaultLimitNOFILE=.*/DefaultLimitNOFILE=65536:1048576/m' -i /etc/systemd/system.conf
Then:
cr0x@server:~$ sudo systemctl daemon-reexec
Opinionated warning: Option B is how you accidentally “fix” one service by giving every service permission to hurt the host in new ways. Use it only when you have a fleet policy and a reason.
Option C: configure the application’s own limits (only if it actually applies)
Some daemons have their own limit knobs (for example via worker_rlimit_nofile in nginx) that must be aligned with RLIMIT_NOFILE. These don’t replace systemd’s limits; they layer on top. If systemd caps at 8192, your app config can ask for 100k all day and still get 8192.
So the correct order is: set systemd LimitNOFILE first, then tune application-level limits to use it.
What about /etc/security/limits.conf?
It’s not useless, it’s just frequently irrelevant to system services. /etc/security/limits.conf is applied by PAM in login contexts. If you start a daemon via systemd at boot, PAM isn’t part of that story. Debian 13 isn’t special here; it’s just that modern Linux is unapologetically multi-layered.
Joke #2: ulimit is like yelling your desired salary into the hallway. It feels good, but payroll still does whatever it’s configured to do.
Verification that doesn’t lie: /proc, systemctl, and live processes
If you take one operational habit from this piece, make it this: verify limits in the running process, not in the shell that happens to be on your screen.
Three reliable angles:
- /proc/<pid>/limits shows the current soft/hard limits for that process.
- systemctl show can show unit properties and effective settings (but still verify in /proc).
- Counting open FDs via
/proc/<pid>/fdtells you whether you’re approaching the cap, leaking, or spiking during load.
And remember: services often have multiple processes (master + workers). The one that fails might not be the one you looked at. Verify the actual process emitting EMFILE.
Practical tasks (commands + what the output means + the decision you make)
These are field tasks. Run them on Debian 13. Each one includes: command, example output, interpretation, and a decision.
Task 1: Confirm the error code in the journal
cr0x@server:~$ sudo journalctl -u nginx.service -n 50 --no-pager
Dec 28 10:12:01 server nginx[1423]: accept4() failed (24: Too many open files)
Dec 28 10:12:01 server nginx[1423]: worker_connections are not enough
Meaning: errno 24 is EMFILE. That’s per-process RLIMIT_NOFILE, not global kernel tables.
Decision: Focus on LimitNOFILE for the nginx processes and nginx’s own worker settings. Don’t touch fs.file-max yet.
Task 2: Find the service MainPID
cr0x@server:~$ systemctl show -p MainPID --value nginx.service
1423
Meaning: PID 1423 is the main process systemd tracks.
Decision: Use this PID for initial verification, but check worker PIDs too.
Task 3: Read the real RLIMIT_NOFILE from /proc
cr0x@server:~$ sudo awk '/Max open files/ {print}' /proc/1423/limits
Max open files 8192 8192 files
Meaning: Soft=8192, Hard=8192. You will hit EMFILE at ~8192 descriptors in that process.
Decision: Raise it with systemd (drop-in). A shell ulimit won’t change this running process.
Task 4: Confirm what systemd thinks the unit limit is
cr0x@server:~$ systemctl show nginx.service -p LimitNOFILE
LimitNOFILE=8192
Meaning: systemd is currently applying 8192 to the service.
Decision: Add a drop-in with a higher LimitNOFILE, then restart the service.
Task 5: Create and verify a drop-in override
cr0x@server:~$ sudo systemctl edit nginx.service
cr0x@server:~$ sudo systemctl cat nginx.service
# /lib/systemd/system/nginx.service
[Unit]
Description=A high performance web server and a reverse proxy server
# /etc/systemd/system/nginx.service.d/override.conf
[Service]
LimitNOFILE=262144
Meaning: The drop-in is loaded and overrides the unit.
Decision: Reload and restart. If the drop-in doesn’t appear in systemctl cat, it’s not in effect.
Task 6: Restart and re-check /proc limits
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart nginx.service
cr0x@server:~$ systemctl show -p MainPID --value nginx.service
1777
cr0x@server:~$ sudo awk '/Max open files/ {print}' /proc/1777/limits
Max open files 262144 262144 files
Meaning: The running process now has the raised limit. That’s the win condition.
Decision: Move to usage/pressure: are you still leaking FDs, or was the cap just too low?
Task 7: Count current open FDs for a process
cr0x@server:~$ sudo ls /proc/1777/fd | wc -l
412
Meaning: 412 descriptors open right now. This is a snapshot.
Decision: If you see this climbing steadily without dropping, suspect a leak. If it spikes with traffic, suspect peak concurrency and tune accordingly.
Task 8: Identify which descriptors are dominating
cr0x@server:~$ sudo ls -l /proc/1777/fd | head -n 12
lrwx------ 1 www-data www-data 64 Dec 28 10:20 0 -> /dev/null
lrwx------ 1 www-data www-data 64 Dec 28 10:20 1 -> /var/log/nginx/access.log
lrwx------ 1 www-data www-data 64 Dec 28 10:20 2 -> /var/log/nginx/error.log
lrwx------ 1 www-data www-data 64 Dec 28 10:20 3 -> socket:[23451]
lrwx------ 1 www-data www-data 64 Dec 28 10:20 4 -> socket:[23452]
Meaning: Many sockets implies connection load. Lots of regular files might imply log/file handling, cache, or leaks in file IO.
Decision: If sockets dominate, tune connection handling (backlog, keepalive, worker counts). If files dominate, hunt file leaks or rotate behavior.
Task 9: Check per-user limits for interactive sessions (PAM)
cr0x@server:~$ ulimit -Sn
1024
cr0x@server:~$ ulimit -Hn
1048576
Meaning: Your login shell has soft 1024, hard 1048576. That doesn’t automatically apply to systemd services.
Decision: If developers run workloads manually (not via systemd), adjust PAM limits. Otherwise, focus on unit-level limits.
Task 10: Check system-wide file descriptor pressure
cr0x@server:~$ cat /proc/sys/fs/file-nr
2976 0 9223372036854775807
Meaning: First number is allocated file handles; third is the max. On many systems the max is effectively huge (or “infinite-ish”).
Decision: If allocated is close to max (on systems with a real cap), you have a host-level issue. Otherwise, it’s likely per-process.
Task 11: Inspect kernel max open files per process (fs.nr_open)
cr0x@server:~$ cat /proc/sys/fs/nr_open
1048576
Meaning: This is the kernel’s ceiling for RLIMIT_NOFILE. You can’t set per-process limits above it.
Decision: If you truly need > 1,048,576 FDs per process (rare, but not impossible for proxy boxes), you must raise fs.nr_open via sysctl and ensure memory headroom.
Task 12: Verify the unit’s effective limits after restart
cr0x@server:~$ systemctl show nginx.service -p LimitNOFILE
LimitNOFILE=262144
Meaning: systemd believes it’s applying the new limit.
Decision: If systemctl show says the new value but /proc/<pid>/limits doesn’t, you restarted the wrong thing, you’re looking at the wrong PID, or the service uses a wrapper that execs elsewhere. Follow the PID chain.
Task 13: Find all PIDs in the service cgroup and check one worker
cr0x@server:~$ systemctl show -p ControlGroup --value nginx.service
/system.slice/nginx.service
cr0x@server:~$ cat /sys/fs/cgroup/system.slice/nginx.service/cgroup.procs | head
1777
1780
1781
1782
cr0x@server:~$ sudo awk '/Max open files/ {print}' /proc/1780/limits
Max open files 262144 262144 files
Meaning: Workers inherited the same limit. Good.
Decision: If workers differ, you may have mixed launch paths or exec wrappers. Fix the service start path.
Task 14: Catch FD growth over time (poor-man’s leak check)
cr0x@server:~$ for i in 1 2 3 4 5; do date; sudo ls /proc/1777/fd | wc -l; sleep 5; done
Sun Dec 28 10:24:01 UTC 2025
412
Sun Dec 28 10:24:06 UTC 2025
418
Sun Dec 28 10:24:11 UTC 2025
430
Sun Dec 28 10:24:16 UTC 2025
446
Sun Dec 28 10:24:21 UTC 2025
462
Meaning: The count is trending upward. Could be legitimate load increasing or a leak.
Decision: If traffic is stable but FDs grow monotonically, investigate leaks (upstream sockets stuck, files not closed, buggy libraries). Raising limits is only buying time.
Task 15: Confirm that the service isn’t constrained by a wrapper like su/sudo or a cron job
cr0x@server:~$ ps -o pid,ppid,cmd -p 1777
PID PPID CMD
1777 1 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
Meaning: PPID 1 indicates systemd (or init) is the parent. This is a system service, so systemd limits apply.
Decision: If PPID is a shell, cron, or a wrapper, fix the launch method or ensure that wrapper sets rlimits correctly (usually: stop doing that, use a unit).
Three corporate mini-stories from the trenches
Mini-story 1: the incident caused by a wrong assumption
A mid-sized SaaS company ran a Debian-based API tier behind a reverse proxy. During a marketing event, requests started failing with sporadic 502s. The on-call engineer did the usual: SSH in, ran ulimit -n, saw a large number (they had tuned it months ago), and concluded “not an FD limit.” They hunted CPU and network graphs instead.
The outage dragged on because the symptom was inconsistent. Only some nodes failed. Only during peak. And the logs were noisy: connection resets, upstream timeouts, the usual chaos soup. The key line—accept4() failed (24: Too many open files)—was present, but drowned in other messages. When they finally focused on it, they checked /proc/<pid>/limits for the proxy workers and found a tiny RLIMIT compared to the shell.
The wrong assumption was simple: “If my shell has 200k, the service does too.” That assumption was true years ago in some setups where daemons were launched by scripts in login-like environments. Under systemd, it’s mostly false.
The fix was a unit drop-in: LimitNOFILE=131072, followed by a restart in waves. The incident postmortem wasn’t about the FD number; it was about verification discipline. Their new rule was brutally practical: no incident is closed until someone pastes /proc/<pid>/limits into the ticket.
Mini-story 2: the optimization that backfired
A financial services team ran a message gateway with aggressive connection reuse. Someone noticed that raising the FD limit “fixed” the occasional EMFILE and decided to go bigger. They bumped LimitNOFILE from tens of thousands to nearly the kernel max, rolled it out, and declared victory.
Within days, the host started behaving oddly under bursts. Not EMFILE—now it was memory pressure and sporadic latency spikes. It turned out that with the new headroom, the gateway happily kept far more sockets alive during upstream slowness. That increased kernel memory usage for socket buffers and bookkeeping. Their actual bottleneck wasn’t “not enough FDs,” it was “we don’t shed load when upstream is sad.”
The optimization backfired because it removed a safety fuse. The FD cap had been forcing a kind of accidental backpressure. Removing it without adding explicit backpressure moved the failure mode somewhere uglier: tail latency and cascading retries.
They ended up settling on a moderate FD limit, plus strict connection pool bounds and timeouts. The lesson was not “don’t raise limits.” It was “raise limits only after you’ve made your concurrency behavior explicit.” Limits are guardrails. If you remove them, install better guardrails first.
Mini-story 3: the boring but correct practice that saved the day
A large internal platform team ran Debian systems with a mix of system services and user-level services. They had a habit some folks called “paranoid”: every performance-related change required a verification snippet in the change request. For FD tuning, that snippet always included (1) systemd unit properties, (2) /proc limits for main PID and one worker, and (3) a before/after FD count during a load test.
One afternoon, a routine deployment introduced a subtle FD leak in a gRPC client library. It wasn’t catastrophic immediately; it took hours. Their monitoring caught the trend: open FDs rising linearly. Because their rollout checklist included “FD trend check,” they noticed it early on the canary.
They reverted before the leak hit the hard cap. No outage. No incident bridge. Just a mildly annoyed engineer and a clean graph.
Nothing heroic happened. That’s the point. Boring practices save more systems than clever ones. Verification beats vibes.
Common mistakes: symptoms → root cause → fix
1) “I raised ulimit, but the service still errors with Too many open files”
Symptom: ulimit -n in your shell looks high; daemon logs show EMFILE.
Root cause: The daemon is started by systemd with a lower LimitNOFILE (or default).
Fix: Add a systemd drop-in with LimitNOFILE=, reload, restart, verify via /proc/<pid>/limits.
2) “systemctl show says LimitNOFILE is high, but /proc still shows low”
Symptom: systemctl show -p LimitNOFILE prints your new value; the running PID is unchanged and still capped.
Root cause: You didn’t restart the service, or you’re checking the wrong process (worker vs master), or the service forks/execs into a different PID you didn’t inspect.
Fix: Restart the unit, get new MainPID, inspect cgroup PIDs, check /proc/<pid>/limits for the failing process.
3) “After raising limits, the host gets unstable under load”
Symptom: No more EMFILE, but latency spikes, memory pressure, or network backlog grows.
Root cause: Higher FD limits allowed higher concurrency without backpressure. Socket buffers and kernel memory usage climb.
Fix: Add explicit concurrency controls (connection pools, worker caps, timeouts). Set FD limits to match design, not ego.
4) “Only user-run processes fail; system services are fine”
Symptom: Interactive tools (batch jobs, dev scripts) hit EMFILE; system daemons don’t.
Root cause: PAM limits and user sessions have low defaults; systemd system units are tuned separately.
Fix: Adjust PAM limits (/etc/security/limits.conf or drop-ins in /etc/security/limits.d) and/or user manager defaults in /etc/systemd/user.conf. Verify in a fresh session.
5) “We set LimitNOFILE to 2 million and systemd refused”
Symptom: Service fails to start, or limit clamps to a lower number.
Root cause: Kernel fs.nr_open is lower. RLIMIT_NOFILE cannot exceed it.
Fix: Raise fs.nr_open via sysctl (with caution), then set a sane limit. Also validate memory headroom.
6) “It fails only after log rotation”
Symptom: After rotation, FD count grows; eventually EMFILE appears.
Root cause: The application keeps old log file descriptors open (or a helper process does). Not a classic “leak,” but it looks like one.
Fix: Ensure correct log reopen signaling (e.g., HUP where appropriate), or use stdout/stderr to journald where feasible. Verify FD targets in /proc/<pid>/fd.
Checklists / step-by-step plan
Checklist A: fix a single systemd service cleanly
- Confirm EMFILE vs ENFILE in logs/journal.
- Get the unit’s MainPID:
systemctl show -p MainPID --value your.service. - Read
/proc/<pid>/limits, record current “Max open files”. - Create a drop-in:
systemctl edit your.serviceand setLimitNOFILE=. systemctl daemon-reload, then restart the service.- Re-check
/proc/<pid>/limitson the new PID. - Count FDs and identify what they are (
/proc/<pid>/fd). - If FD usage trends upward at steady load, treat it as a leak until proven otherwise.
Checklist B: decide the right number (instead of “make it huge”)
- Estimate peak concurrency: connections, files, pipes, watchers.
- Add headroom, but keep a safety margin for runaway behavior (don’t set “infinite”).
- Ensure kernel ceilings (
fs.nr_open) are above your target. - Consider memory impact of many sockets and buffers. FD numbers aren’t free.
- Load test and watch: FD count, memory, latency, error rate.
Checklist C: fleet policy (if you must touch defaults)
- Document why you need a default change and which services benefit.
- Set
DefaultLimitNOFILEconservatively. Use per-service overrides for outliers. - Roll out gradually and monitor for new failure modes (memory, latency).
- Standardize verification: unit property + /proc limits + FD count trend.
Interesting facts and historical context
- Fact 1: The original Unix idea of “everything is a file” made file descriptors the universal handle—great for composability, occasionally great for outages.
- Fact 2: Early Unix systems had tiny default FD limits (often 64 or 256). Many defaults stayed conservative for decades to protect shared machines.
- Fact 3: RLIMIT_NOFILE has “soft” and “hard” values; the soft limit is what triggers EMFILE most of the time.
- Fact 4: systemd didn’t just replace init scripts; it standardized how services inherit limits and cgroup membership, making behavior more predictable—once you learn where to look.
- Fact 5:
/proc/<pid>/limitsis one of the simplest truth sources on Linux. It’s also criminally underused in incident response. - Fact 6: “Too many open files” is often about sockets, not files. High-QPS network daemons can hit it without touching disk.
- Fact 7: Raising FD limits can increase kernel memory usage materially because sockets and epoll/watch structures have per-object overhead.
- Fact 8: Some services have internal caps (worker limits, connection pools) that must be aligned with OS limits or you get a “fixed the OS, still broken” loop.
FAQ
1) Why doesn’t ulimit -n fix my systemd service?
Because your systemd service isn’t a child of your interactive shell. systemd sets rlimits at service start based on the unit and systemd defaults. Your shell’s ulimit affects only processes spawned from that shell.
2) Is /etc/security/limits.conf ignored on Debian 13?
No. It applies where PAM applies: login sessions (ssh, console, some su/sudo paths depending on configuration). Many system services do not go through PAM, so they won’t inherit those limits.
3) Should I set DefaultLimitNOFILE globally?
Usually no. Set per-service LimitNOFILE via drop-ins. Global defaults are reasonable only when you’re enforcing a fleet baseline and you understand the side effects.
4) What number should I choose for LimitNOFILE?
Choose based on measured peak FD usage plus headroom, not superstition. For busy proxies, tens to hundreds of thousands can be normal. For many apps, 8192–65536 is plenty. Load test and watch /proc/<pid>/fd counts.
5) Can I raise LimitNOFILE without restarting the service?
No. RLIMIT_NOFILE is set per process. You must restart (or otherwise re-exec) the process to apply a new limit. systemd can’t retroactively change a running process’s rlimit.
6) I raised LimitNOFILE, but errors persist. Now what?
Verify you changed the failing process (check workers). Then inspect FD usage and types. If FDs grow steadily, you likely have a leak. If it spikes with traffic, you may need better concurrency control or to tune internal application limits.
7) What’s the difference between EMFILE and ENFILE again?
EMFILE is per-process: that process hit RLIMIT_NOFILE. ENFILE is system-wide: the host’s file table is exhausted. Most “Too many open files” incidents in services are EMFILE.
8) Do containers change this story?
Yes, but the principle stays: the process has a real RLIMIT. Container runtimes can set it; systemd inside a container can set it; the host can constrain it. Still verify with /proc/<pid>/limits inside the relevant namespace/environment.
9) Can a high FD limit be a security risk?
It can be an availability risk. If a compromised or buggy service can open millions of FDs, it can exhaust kernel resources and degrade the host. Limits are part of blast-radius control.
Conclusion: next steps you can do today
Fixing “Too many open files” on Debian 13 is not about typing ulimit louder. It’s about setting the limit where the service is born: systemd. Use a per-service drop-in with LimitNOFILE. Reload. Restart. Verify in /proc/<pid>/limits. Then measure FD usage so you know whether you solved a capacity issue or merely postponed a leak.
Practical next steps:
- Pick one service that has ever thrown EMFILE, and add a drop-in with a sane
LimitNOFILE. - Build a tiny runbook snippet: MainPID → /proc limits → FD count.
- During your next load test, record peak FD usage and set limits based on that data, not folklore.
If you do this once, you’ll stop treating file descriptors like mysterious gremlins and start treating them like what they are: a budget. And budgets, sadly, always matter.