It’s always the same vibe: everything is green until the traffic spike, then your app starts throwing EMFILE, your logs fill with “too many open files,” and the incident channel becomes a group therapy session.
When this happens in Docker, people often “fix” it by bumping a limit somewhere random. Sometimes that works. More often it doesn’t—or it works until the next deploy, reboot, or host rotation. Let’s do it properly: find which limit you’re actually hitting, raise it at the right layer (systemd, Docker daemon, container), and make sure you aren’t just hiding a leak.
What “too many open files” really means in containers
On Linux, “too many open files” usually maps to EMFILE (per-process file descriptor limit reached) or ENFILE (system-wide file table exhaustion). Most container incidents are EMFILE—one process (or a handful) hits its file descriptor cap and collapses in dramatic ways: failing to accept connections, failing to open log files, failing to resolve DNS, failing to talk to upstreams, failing to create sockets. In other words: failing to do its job.
In Docker, file descriptors (FDs) aren’t a container abstraction. They’re a kernel resource. Containers share the same kernel, and they ultimately share the same global file table. But each process still has a per-process limit (RLIMIT_NOFILE) that can be different per service, per container, per process, depending on what spawned it and what limits were applied.
So when a container says “too many open files,” you’re debugging a stack of limits and settings, including:
- Kernel-wide file table and per-process maximums
- User session limits (PAM,
/etc/security/limits.conf)—sometimes relevant, often misunderstood - systemd unit limits for
dockerd(and sometimes for your container runtime if separate) - Docker’s own ulimit settings passed into containers
- Your app’s behavior: connection pools, keep-alives, file watchers, leaks, logging patterns
And yes, you can “just raise the limit.” But if you raise it blindly, you’re giving a leaky process a larger bucket and calling it reliability. That’s not engineering; that’s a postponement strategy.
One quote to keep you honest: “Hope is not a strategy.” — General Gordon R. Sullivan
Fast diagnosis playbook (check first/second/third)
When the pager goes off and you need to stop the bleeding, you don’t start by editing five configs and rebooting the host. You start by identifying which limit you hit and where.
First: is it per-process (EMFILE) or system-wide (ENFILE)?
- If only one service is failing and the host looks otherwise stable, suspect
EMFILE. - If many unrelated services start failing to open sockets/files at once, suspect
ENFILEor a host-level exhaustion.
Second: confirm the limit inside the failing container
- Check
ulimit -ninside the container (or via/procfor the exact process). - Check how many FDs the process actually has open right now (
ls /proc/<pid>/fd | wc -l).
Third: verify what systemd granted to Docker
- If
dockerdis limited to something low, it can constrain what containers can inherit. - Confirm
LimitNOFILEon the Docker systemd unit and what the Docker process currently has.
Fourth: rule out “you raised it but nothing changed”
- Changing
/etc/security/limits.confdoesn’t affect systemd services. - Changing systemd unit files doesn’t affect already-running processes until restart.
- Changing Docker daemon defaults doesn’t retroactively change running containers.
Fifth: decide if you’re masking a leak
- FD count climbing steadily over hours/days is a leak pattern.
- FD count spiking with traffic and falling back is often “normal” scaling, but still needs headroom.
Joke #1: File descriptors are like forks in an office kitchen—everyone thinks there are plenty until lunch starts.
Interesting facts and historical context (short, concrete)
- Early UNIX had tiny FD limits (often 20 or 64 per process) because RAM was expensive and workloads were simple by modern standards.
select(2)historically capped FDs at 1024 due toFD_SETSIZE, which shaped how servers were written for years—even after better APIs existed.poll(2)andepoll(7)became the practical answer to scaling large numbers of sockets without theselectceiling.- Linux counts “open files” broadly: regular files, sockets, pipes, eventfds, signalfds—many of which don’t look like “files” to app developers.
- systemd changed the game by owning service limits; editing shell startup files doesn’t touch a system service started by PID 1.
- Docker’s defaults can be conservative depending on distro packaging and daemon configuration; assuming “containers inherit the host limit” is often wrong.
- Kernel-wide file table pressure shows up as performance weirdness before hard failure: latency spikes, connect() errors, and “random” I/O failures.
- Some runtimes burn FDs for convenience (like aggressive file watching in dev modes); shipping that to prod is how you get paged at 2 a.m.
The limits stack: kernel, user, systemd, Docker, container
You don’t fix this problem by memorizing one magical knob. You fix it by understanding the chain of custody for RLIMIT_NOFILE and the kernel’s global limits.
1) Kernel-wide file table: fs.file-max
fs.file-max is a system-wide cap on the number of file handles the kernel will allocate. If you hit it, a lot of things fail at once. This is the “ENFILE” flavor, and it’s typically catastrophic across the host.
2) Per-process maximum: fs.nr_open
fs.nr_open is the kernel-imposed ceiling for per-process open files. Even if you set ulimit -n to a million, the kernel will stop you at nr_open. This is a common “why didn’t my limit apply?” trap.
3) Process limits: RLIMIT_NOFILE and inheritance
Every process has a current soft limit and a hard limit. The soft limit is what the process uses; the hard limit is the maximum it can raise itself to without privilege. Child processes inherit limits from their parent unless explicitly changed. That detail matters because:
dockerdinherits from systemd- containers inherit from the runtime and Docker’s configuration
- your app process inherits from the container’s init/entrypoint
4) systemd service limits: the grown-up place to set them
For production hosts running Docker as a systemd service, the most reliable way to set Docker’s FD limits is a systemd override for docker.service. Edits to /etc/security/limits.conf are not wrong; they’re just frequently irrelevant to systemd-managed services.
5) Docker container ulimits: explicit is better than “I think it inherits”
You can set ulimits per container using Docker run flags, Compose, or daemon defaults. Here’s the practical stance: set per-service limits explicitly for critical workloads. It makes behavior portable across hosts and reduces “it worked on that node” drama.
Practical tasks: commands, outputs, decisions (12+)
These are the checks I actually run. Each includes what the output means and what decision you make from it.
Task 1: Confirm the error type in logs (EMFILE vs ENFILE)
cr0x@server:~$ docker logs --tail=200 api-1 | egrep -i 'too many open files|emfile|enfile' || true
Error: EMFILE: too many open files, open '/app/logs/access.log'
Meaning: EMFILE points to per-process FD exhaustion, not a whole-host file table collapse.
Decision: Focus on container/app ulimits and the process’s FD usage pattern before touching kernel-wide settings.
Task 2: Check the container’s ulimit from inside
cr0x@server:~$ docker exec -it api-1 sh -lc 'ulimit -n; ulimit -Hn'
1024
1048576
Meaning: Soft limit is 1024 (tiny for many network servers). Hard limit is high, so the process could raise soft if it tried (or if entrypoint does).
Decision: Raise the soft limit via Docker/Compose ulimits so the app starts with a sane value.
Task 3: Identify the PID of the real worker process
cr0x@server:~$ docker exec -it api-1 sh -lc 'ps -eo pid,comm,args | head'
PID COMMAND COMMAND
1 tini tini -- node server.js
7 node node server.js
Meaning: PID 7 is the actual Node process; PID 1 is the init wrapper.
Decision: Inspect limits and open FDs for PID 7, not just PID 1.
Task 4: Read the process limit straight from /proc
cr0x@server:~$ docker exec -it api-1 sh -lc "grep -E 'Max open files' /proc/7/limits"
Max open files 1024 1048576 files
Meaning: Confirms the actual runtime limit applied to the process.
Decision: If it’s low, fix container ulimit configuration; if it’s high but still failing, look for leaks or fs.file-max/fs.nr_open issues.
Task 5: Count open FDs for the process right now
cr0x@server:~$ docker exec -it api-1 sh -lc 'ls /proc/7/fd | wc -l'
1018
Meaning: The process is basically at its 1024 ceiling; failure is expected.
Decision: Raise the limit and immediately reduce FD pressure if possible (reduce concurrency, connection pools, watchers).
Task 6: See what those FDs actually are
cr0x@server:~$ docker exec -it api-1 sh -lc 'ls -l /proc/7/fd | head -n 15'
total 0
lrwx------ 1 root root 64 Jan 3 10:11 0 -> /dev/null
lrwx------ 1 root root 64 Jan 3 10:11 1 -> /app/logs/stdout.log
lrwx------ 1 root root 64 Jan 3 10:11 2 -> /app/logs/stderr.log
lrwx------ 1 root root 64 Jan 3 10:11 3 -> socket:[390112]
lrwx------ 1 root root 64 Jan 3 10:11 4 -> socket:[390115]
lrwx------ 1 root root 64 Jan 3 10:11 5 -> anon_inode:[eventpoll]
Meaning: Mostly sockets and eventpoll—typical for a busy server. If you see thousands of the same file path, suspect a leak in file handling or logging.
Decision: If it’s mostly sockets, check upstream connection reuse and client keepalive; if it’s files, audit open/close behavior.
Task 7: Check host-wide file table usage
cr0x@server:~$ cat /proc/sys/fs/file-nr
15872 0 9223372036854775807
Meaning: First number is allocated file handles. Third is the system max (this example effectively “very high”). If allocated approaches max, you’re in ENFILE territory.
Decision: If you’re close to max, raise fs.file-max and hunt the host-level FD hogs.
Task 8: Check kernel ceilings for per-process open files
cr0x@server:~$ sysctl fs.nr_open
fs.nr_open = 1048576
Meaning: No process can exceed this many open files, regardless of ulimit requests.
Decision: If you need more and understand the RAM cost, raise fs.nr_open (rare; usually unnecessary).
Task 9: Inspect the Docker daemon’s current FD limit (systemd inheritance)
cr0x@server:~$ pidof dockerd
1240
cr0x@server:~$ cat /proc/1240/limits | grep -E 'Max open files'
Max open files 1048576 1048576 files
Meaning: Docker daemon itself has a high limit; good. If it were 1024 or 4096, you’d fix systemd first.
Decision: If dockerd limit is low, apply systemd override and restart Docker. Don’t argue with PID 1.
Task 10: Confirm systemd unit limits for Docker
cr0x@server:~$ systemctl show docker --property=LimitNOFILE
LimitNOFILE=1048576
Meaning: This is what systemd intends for the service, not what you hope it is.
Decision: If this is low, you need a drop-in override; if high but dockerd is low, you forgot to restart.
Task 11: Check a running container’s effective ulimits from the outside
cr0x@server:~$ docker inspect api-1 --format '{{json .HostConfig.Ulimits}}'
[{"Name":"nofile","Soft":1024,"Hard":1048576}]
Meaning: Docker is explicitly setting the soft limit to 1024 for this container.
Decision: Fix your Compose file or run flags; don’t touch kernel sysctls for a per-container soft limit problem.
Task 12: Measure FD growth rate (leak vs load)
cr0x@server:~$ for i in 1 2 3 4 5; do docker exec api-1 sh -lc 'ls /proc/7/fd | wc -l'; sleep 5; done
812
845
880
914
950
Meaning: FD count is steadily climbing over 25 seconds. That’s not “normal variance.” It might be traffic ramping, but it smells like a leak or runaway concurrency.
Decision: Raise limits for immediate stability, but open a bug and instrument FD usage. Also cap concurrency until you understand the slope.
Task 13: Find the worst FD offenders on the host
cr0x@server:~$ for p in /proc/[0-9]*; do pid=${p#/proc/}; if [ -r "$p/fd" ]; then c=$(ls "$p/fd" 2>/dev/null | wc -l); echo "$c $pid"; fi; done | sort -nr | head
21012 3351
16840 2980
9021 1240
Meaning: PID 3351 and 2980 are holding tens of thousands of FDs. PID 1240 is dockerd with ~9k, which can be normal on busy hosts.
Decision: Map top PIDs to services. If an unrelated process is consuming FDs, you may be headed for host-wide trouble.
Task 14: Map a PID to a systemd unit (host-side)
cr0x@server:~$ ps -p 3351 -o pid,comm,args
PID COMMAND COMMAND
3351 nginx nginx: worker process
Meaning: It’s nginx. Now the question becomes: why is nginx holding 21k FDs? Likely too many keepalives, too many upstream connections, or a leak via misconfiguration.
Decision: You might need to tune nginx worker connections or upstream keepalive, not just raise ulimit.
Raising limits correctly: systemd + Docker + container
There are two parts to doing this right:
- Set sane defaults for Docker the service (so your platform isn’t fragile).
- Set explicit ulimits for the containers that need them (so the behavior is repeatable and reviewable).
Step 1: Set Docker’s service-level LimitNOFILE with a systemd drop-in
Edit via systemd’s supported mechanism. Don’t fork vendor unit files unless you enjoy diff conflicts during package upgrades.
cr0x@server:~$ sudo systemctl edit docker
In the editor, add a drop-in like this:
cr0x@server:~$ cat /etc/systemd/system/docker.service.d/override.conf
[Service]
LimitNOFILE=1048576
Then reload systemd and restart Docker (yes, restart—limits apply at exec time):
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart docker
What you’re protecting against: a distro package defaulting to 1024/4096, or a future node image that quietly resets the daemon limits. This is a platform-level safety belt.
Step 2: Ensure the kernel isn’t the real ceiling
Most of the time, fs.nr_open is already high enough. Still, check it once and bake it into your baseline if your workloads are FD-heavy.
cr0x@server:~$ sysctl fs.nr_open
fs.nr_open = 1048576
If you must raise it (rare), do it explicitly and persistently:
cr0x@server:~$ sudo tee /etc/sysctl.d/99-fd-limits.conf >/dev/null <<'EOF'
fs.nr_open = 1048576
fs.file-max = 2097152
EOF
cr0x@server:~$ sudo sysctl --system
* Applying /etc/sysctl.d/99-fd-limits.conf ...
fs.nr_open = 1048576
fs.file-max = 2097152
Opinionated guidance: don’t crank fs.file-max to the moon “just in case.” It’s not free. Decide based on real concurrency and FD usage, then add headroom.
Step 3: Set container ulimits explicitly (Docker run)
For one-off tests or emergency mitigation:
cr0x@server:~$ docker run --rm -it --ulimit nofile=65536:65536 alpine sh -lc 'ulimit -n; ulimit -Hn'
65536
65536
Meaning: The container starts with a higher soft/hard limit.
Decision: If this resolves the immediate error, move the setting into Compose/Kubernetes spec so it’s not a hand-tuned snowflake.
Step 4: Set container ulimits explicitly (Docker Compose)
Compose makes this reviewable. That’s good governance and good uptime.
cr0x@server:~$ cat docker-compose.yml
services:
api:
image: myorg/api:latest
ulimits:
nofile:
soft: 65536
hard: 65536
Recreate the container (ulimits don’t change on the fly):
cr0x@server:~$ docker compose up -d --force-recreate api
[+] Running 1/1
✔ Container project-api-1 Started
Decision: If the recreated container still reports 1024, you’re not deploying what you think you’re deploying (wrong file, wrong project, old Compose version, or another orchestrator is in charge).
Step 5: Consider Docker daemon defaults (carefully)
Docker can set default ulimits for all containers via daemon config. This is tempting. It’s also a blunt instrument.
Use it when you control the whole fleet and want consistent defaults, but keep per-service overrides for truly FD-hungry workloads.
cr0x@server:~$ cat /etc/docker/daemon.json
{
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Hard": 65536,
"Soft": 65536
}
}
}
cr0x@server:~$ sudo systemctl restart docker
cr0x@server:~$ docker run --rm alpine sh -lc 'ulimit -n'
65536
Opinionated warning: daemon-wide defaults can surprise workloads that were stable at 1024 but misbehave with higher concurrency (because now they can). Raising limits can expose new bottlenecks: database max connections, upstream rate limits, NAT table pressure, and so on.
Step 6: Validate the change where it matters: the process
Always verify against the real PID that handles traffic.
cr0x@server:~$ docker exec -it api-1 sh -lc 'ps -eo pid,comm,args | head -n 5'
PID COMMAND COMMAND
1 tini tini -- node server.js
7 node node server.js
cr0x@server:~$ docker exec -it api-1 sh -lc "grep -E 'Max open files' /proc/7/limits"
Max open files 65536 65536 files
Decision: If that shows the correct limit, you’re done with configuration. Now you decide if the application design needs remediation.
Step 7: Don’t forget the app can set its own limits
Some runtimes and service wrappers call setrlimit() at startup. That can lower your effective limit even if Docker/systemd are generous.
If your container limit is high but your process limit is low, check for:
- Entrypoint scripts calling
ulimit -nto something small - Language runtime settings (e.g., JVM, nginx master/worker patterns, process managers)
- Base image defaults
Joke #2: “We set it in three places” is the operational version of “I saved it on my desktop.”
Three corporate mini-stories from the trenches
Mini-story 1: the incident caused by a wrong assumption
The company ran a customer-facing API on a Docker fleet. The service had been stable for months. Then a new node image rolled out—same OS family, newer minor release. Two days later, the API started failing under load with intermittent 500s. The logs showed EMFILE inside containers. The immediate reaction was disbelief: “We already set host nofile to 1,048,576. This can’t be limits.”
The wrong assumption was subtle and common: that changing /etc/security/limits.conf on the host affects Docker containers. It affects login sessions started via PAM. Docker is started by systemd. systemd doesn’t care about PAM limits. It has its own limit configuration.
During the incident, someone raised the container ulimit in Compose and redeployed. It helped—on some nodes. On others, it didn’t. That inconsistency was the clue: node image drift had changed the Docker systemd unit’s default LimitNOFILE, and only some hosts were still on the old value.
Fix was boring and permanent: systemd drop-in for docker.service setting LimitNOFILE, plus explicit per-service ulimits in Compose. They also added a post-provision check that fails the node if systemctl show docker -p LimitNOFILE isn’t what they expect.
The lasting outcome wasn’t “a bigger number.” It was a shared understanding: containers don’t float above the host; they inherit from it. Assumptions are cheap. Incidents are not.
Mini-story 2: the optimization that backfired
A different org ran a high-throughput ingestion pipeline: web tier → queue → processors → object storage. They tuned for efficiency, always chasing the next latency improvement. Someone noticed TCP connection churn between processors and the queue and decided to keep connections alive longer and increase worker concurrency.
It looked great in a staging test: fewer handshakes, better throughput, CPU flatter. So they shipped it. The next week, production had rolling brownouts. Not full outages—worse. Some percentage of requests failed; retries amplified load; the queue lag grew; the object storage client began timing out. Logs were full of “too many open files.”
They raised ulimits. It reduced the error rate, but the system remained unstable. That’s the backfire: higher FD limits allowed the service to keep even more idle and semi-stuck connections open, which increased memory usage and increased pressure on downstream limits (database connections and load balancer tracking). The incident wasn’t “we hit 1024.” It was “we designed a concurrency shape that was brittle.”
The fix was multi-layered: a moderate ulimit increase to a sane baseline, a cap on concurrency, shorter keepalive for certain upstreams, and better backpressure behavior. After that, FD count still got high under peak, but it remained bounded and predictable.
The lesson: raising FD limits is not an optimization. It’s capacity. If you don’t manage concurrency and backpressure, you’ll just fail later and louder.
Mini-story 3: the boring but correct practice that saved the day
Another team ran a mixed fleet: some bare metal, some virtual, lots of Docker, lots of handoffs between teams. They’d been burned by “works on my node” issues, so they built a small checklist into their node bootstrap: verify kernel sysctls, verify systemd limits, and verify Docker daemon defaults. Nothing fancy. No glossy dashboards. Just guardrails.
One afternoon, a new application release started tripping EMFILE in one region. Engineers suspected a code regression. But the on-call followed the checklist: inside the container, ulimit -n was 1024. That was already suspicious. On the host, systemctl show docker -p LimitNOFILE reported a low number on a subset of nodes.
It turned out those nodes were provisioned by a parallel pipeline created for a temporary capacity burst. It missed a bootstrap step. The application wasn’t suddenly broken; it was landing on hosts with different limits.
Because the team had a known-good baseline and quick checks, the incident response was clean: cordon/evict from bad nodes, patch the bootstrap, then gradually reintroduce capacity. No witch hunt, no frantic config edits in production, no heroic “I fixed it by typing fast.”
Boring practices don’t get promoted in slide decks. They do keep customers online.
Common mistakes: symptom → root cause → fix
1) Symptom: “I increased /etc/security/limits.conf but containers still show 1024”
Root cause: Docker is started by systemd; PAM limits don’t apply to systemd services.
Fix: Set LimitNOFILE in a systemd drop-in for docker.service, restart Docker, then recreate containers.
2) Symptom: container ulimit -n is high, but the app still throws EMFILE
Root cause: You checked the shell, not the process. The real worker has a lower limit, or the app lowered it at runtime.
Fix: Inspect /proc/<pid>/limits for the worker process. Audit entrypoint scripts and process managers.
3) Symptom: raising ulimit didn’t help; multiple services fail to open files/sockets
Root cause: Host-wide file table exhaustion (ENFILE) or another kernel resource limit (ephemeral ports, conntrack, memory).
Fix: Check /proc/sys/fs/file-nr, identify top FD consumers, and raise fs.file-max only with evidence. Also inspect network tables if symptoms include connect() failures.
4) Symptom: ulimit applies after redeploy, then resets after reboot or node rotation
Root cause: You fixed it manually on a node but didn’t persist it in config management, systemd drop-in, or Compose/Kubernetes spec.
Fix: Commit the change to infrastructure-as-code and application deployment manifests; add a bootstrap verification check.
5) Symptom: after raising limits, memory usage climbs and latency worsens
Root cause: More allowed FDs means more allowed concurrency; your app now holds more sockets and buffers open, increasing memory and downstream load.
Fix: Tune concurrency, pools, keepalive, and backpressure. Set limits based on measured steady-state plus headroom, not theoretical maximum.
6) Symptom: can’t raise above a specific number even as root
Root cause: You hit fs.nr_open or the container runtime refuses larger values.
Fix: Check sysctl fs.nr_open. Raise it if justified. Validate with /proc/<pid>/limits.
7) Symptom: only one container hits EMFILE, but host has plenty of headroom
Root cause: Container-specific ulimit is low, often inherited from daemon defaults or explicitly set to 1024.
Fix: Set ulimits per service (Compose ulimits or --ulimit) and recreate the container.
Checklists / step-by-step plan
Checklist A: stop the incident (15–30 minutes)
- Confirm error type in logs:
EMFILEvsENFILE. - Check process limit via
/proc/<pid>/limitsinside the container. - Count open FDs for the worker process. If near the limit, you’ve found your immediate constraint.
- Raise container ulimit (Compose/run flags) to a reasonable value (often 32k–128k depending on workload).
- Recreate the container so the new limits apply.
- Verify again against the worker PID.
- Apply a temporary throttle (reduce worker count, concurrency, keepalive) if FD growth is unbounded.
Checklist B: make it stick (same day)
- Set systemd LimitNOFILE for
docker.servicevia drop-in. - Verify dockerd’s limit from
/proc/<dockerd-pid>/limits. - Set per-service ulimits in Compose so application behavior is portable.
- Record baseline metrics: typical FD count at steady load and at peak load.
- Add a health check (even a simple script) to alert when FD usage approaches 70–80% of limit.
Checklist C: prevent recurrence (next sprint)
- Investigate leaks: does FD count increase monotonically under constant load?
- Audit connection pools: database clients, HTTP keepalive, message queue consumers.
- Review logging: avoid opening files per request; prefer stdout/stderr aggregation in containers.
- Load test with FD visibility: track FD count as a first-class signal, not an afterthought.
- Codify node baseline: sysctls + systemd limits in your provisioning pipeline.
FAQ
1) Why does my container show ulimit -n as 1024?
Because 1024 is a common default soft limit. Docker may be passing it explicitly, or your image/entrypoint sets it. Verify with docker inspect ...HostConfig.Ulimits and /proc/<pid>/limits.
2) If I set LimitNOFILE for Docker, do I still need container ulimits?
Yes, for important services. The daemon limit is a platform baseline; container ulimits are application behavior. Explicit per-service limits prevent drift across hosts and future node images.
3) Does raising FD limits harm the host?
Not directly, but it enables processes to hold more kernel objects and memory. If the app uses that capacity, you may increase RAM usage and downstream load. Capacity is a tool, not a virtue.
4) What’s a “reasonable” nofile limit for production containers?
It depends. Many web services are comfortable at 32k–128k. High-fanout proxies, message brokers, or busy nginx instances may need more. Measure steady-state and peak FD usage, then add headroom.
5) I raised limits but still get “too many open files” in DNS lookups or TLS handshakes. Why?
Those operations open sockets and sometimes temporary files. If your worker is at its FD limit, everything that requires a new FD fails in strange places. Confirm the worker’s FD count and limit; don’t chase the symptom site.
6) Do Kubernetes pods behave differently?
The principles are the same: kernel resources + per-process limits. The configuration knobs differ (runtime, kubelet settings, pod security context capabilities, and how the container runtime applies rlimits). Still: verify in /proc for the actual process.
7) Why did my change not apply until I restarted things?
Because limits are applied when a process starts (exec time). systemd changes require restarting the service. Container ulimit changes require recreating the container. Running processes keep their current limits.
8) How do I tell if it’s a leak vs just traffic?
Sample the FD count over time at a constant-ish load. A leak trends upward without coming back down. Traffic-related usage tends to rise and fall with concurrency. Task 12’s loop is a quick smell test; proper monitoring is better.
9) Can I just set everything to 1,048,576 and move on?
You can. You’ll also make it easier for a bug to consume far more resources before failing, and you’ll push bottlenecks into less obvious places. Pick a limit that matches your architecture, then monitor.
10) Is “too many open files” ever caused by storage issues?
Indirectly. Slow disks and stuck I/O can cause descriptors to remain open longer, which raises concurrent FD usage. But the error is still a limit/capacity issue—solve the limit and the I/O behavior.
Conclusion: next steps that actually stick
When Docker throws “too many open files,” don’t treat it like a Docker quirk. It’s Linux doing exactly what you told it to do: enforce resource limits. Your job is to figure out which limit, at which layer, and whether the workload deserves more capacity or less chaos.
Do this next, in order:
- Verify the real worker process limit in
/proc/<pid>/limitsand its current FD usage. - Set explicit container ulimits for the service (Compose/run flags), then recreate the container.
- Set a systemd drop-in for
docker.serviceLimitNOFILEso the platform is consistent across reboots and node rotations. - Measure FD behavior under load, and decide whether you’re solving capacity or hiding a leak.
If you do those four things, this stops being a recurring incident and becomes a solved operational detail. Which is what you want: fewer surprises, fewer heroics, and a quieter on-call.