“Too many open files” is one of those errors that feels insulting. Linux is doing exactly what you told it to do: enforce limits. Proxmox is just where you notice it, usually at 2am, usually while you’re doing something “safe” like a backup, a migration, or a restore that can’t wait.
The fix is rarely “set everything to a million.” The fix is: find the process that hits the ceiling, raise the right ceiling (systemd unit, user limit, kernel limit, or application setting), and then prove you didn’t just hide a file-descriptor leak that will come back nastier next week.
What “too many open files” actually means on Proxmox
On Linux, “open files” really means file descriptors: small integers a process uses to reference open resources. Yes, regular files. Also sockets, pipes, eventfds, inotify watches, device nodes, sometimes even “anonymous” kernel objects exposed through a file interface. When you see:
EMFILE: process hit its per-process file descriptor limitENFILE: the system hit a global file table limit (less common on modern kernels, but not impossible)
In Proxmox environments, EMFILE is the usual culprit. That means a specific process — a daemon, a QEMU instance, a backup worker, a Ceph OSD — ran out of file descriptors. Raising limits blindly can mask a leak, but it can also be the right move if your workload legitimately needs tens of thousands of fds (think: Ceph, busy reverse proxies, huge backup catalogs, or a single VM with a lot of virtio-scsi and network activity).
File descriptor limits exist at multiple layers:
- Kernel global max (e.g.,
fs.file-max): how many file handles the entire system can allocate. - Per-process rlimit (ulimit /
RLIMIT_NOFILE): the cap for a given process. - systemd unit limits (
LimitNOFILE=): what systemd sets for a service before exec. - Per-user limits (PAM /
/etc/security/limits.conf): affects login sessions and services started under a user, but not necessarily system services managed by systemd unless PAM is involved. - Application internal limits: sometimes there’s a knob inside the service too (Ceph configs, PBS datastore open file patterns, etc.).
Proxmox adds a twist: a lot of the important processes are systemd-managed daemons (pvedaemon, pveproxy, pvestatd, pve-cluster), plus QEMU instances started under systemd scopes (depending on version and configuration). So the “right” fix is often a systemd override for a specific unit, not a global ulimit change that doesn’t apply to services anyway.
Fast diagnosis playbook (first/second/third checks)
This is the part you run when the pager is loud and your patience is quiet.
First: identify who said it
- Check journald for the exact error line and the process name. Don’t guess.
- Confirm whether it’s
EMFILE(per-process) orENFILE(system-wide). It changes the fix.
Second: check file descriptor usage in the suspect process
- Get PID(s) of the suspect daemon or QEMU process.
- Count open fds and compare to the process limit (from
/proc/<pid>/limits).
Third: decide what to raise (and where)
- If the process limit is low and usage is legitimately high, raise
LimitNOFILEfor that systemd unit (best case). - If the system-wide file table is exhausted, raise
fs.file-maxand investigate why the system is hoarding file handles (often a leak, runaway services, or Ceph at scale). - If it’s a one-off spike from a backup job or restore, consider scheduling/throttling and fixing the workload pattern rather than only raising limits.
One quote that’s been painfully true in operations: (paraphrased idea) “Hope is not a strategy.” — commonly attributed in engineering leadership circles
. For this error, hope looks like “raise everything and reboot.” Strategy looks like instrumentation and targeted limits.
Interesting facts and historical context (why this keeps happening)
- Fact 1: Early Unix systems often had tiny per-process fd limits (sometimes a few dozen). Modern systems default to 1024 or higher, but defaults still reflect “general purpose” assumptions, not virtualization hosts.
- Fact 2: The soft/hard limit split exists so processes can raise their own soft limit up to a hard cap. systemd can set both.
- Fact 3: “Everything is a file” includes sockets; a busy API proxy can burn thousands of fds without touching disk.
- Fact 4: QEMU uses multiple file descriptors per disk, per NIC, per vhost thread, plus eventfds. A single VM can consume hundreds or thousands of fds on a busy host.
- Fact 5: inotify uses file descriptors and watchers; some monitoring stacks and log shippers scale watchers aggressively and can hit limits in surprising ways.
- Fact 6: systemd introduced a more explicit, unit-level way to manage limits (
LimitNOFILE), reducing reliance on shell-level ulimit hacks that never applied to daemons. - Fact 7: Linux global file handle accounting has improved, but you can still exhaust file tables if you run enough services or leak fds over time.
- Fact 8: Ceph historically tends to want higher fd limits; OSDs can legitimately require large numbers of open files and sockets under load.
Joke #1: File descriptors are like meeting rooms—if nobody ever leaves, eventually the building “runs out,” and management blames the calendar.
Where it happens in Proxmox: which service is usually guilty
On a Proxmox node, the phrase “too many open files” can surface from multiple layers. Your job is to stop treating Proxmox as a monolith and start treating it like what it is: Debian plus systemd plus a stack of daemons plus workloads (VMs/CTs) doing chaotic things.
Proxmox daemons (common on busy clusters)
- pveproxy: web UI and API. It can burn fds via many concurrent HTTPS connections, long-lived browser sessions, reverse proxy setups, and authentication backends.
- pvedaemon: background tasks and API helper. It’s frequently involved during backups, restores, migrations, storage scans.
- pvestatd: collects stats; can stress storage queries and network connections in large clusters.
- pve-cluster / pmxcfs: cluster filesystem daemon; issues often show up as weird cluster behavior, not just a clear EMFILE line.
QEMU/KVM processes (common when “one VM goes feral”)
If a VM runs a high-connection workload (proxies, message brokers, websockets, NAT gateways), it can force the host-side QEMU process to open many fds for vhost-net, tap devices, and io_uring/eventfd interactions. If the QEMU process limit is low, you’ll see failures in networking, disk IO, or device hotplug.
Storage layers (ZFS, Ceph, PBS)
- ZFS usually isn’t the direct EMFILE source on the host, but the userland helpers and backup processes interacting with many files can be. Think:
zfs sendpipelines and backup indexing. - Ceph: OSDs and monitors are notorious for needing high fd limits in real deployments. “Too many open files” here is not exotic; it’s a config check you should do early.
- Proxmox Backup Server (if co-located or integrated): backup jobs open lots of chunks, indices, and connections. A low fd limit can turn a backup window into a slow-motion failure.
Containers (LXC)
LXC containers can hit their own nofile limits inside the container namespace. But don’t assume the container is at fault; the host’s daemon managing it might be the one complaining. Always pin the error to a PID and then to a unit.
Practical tasks (commands, outputs, decisions) — the working checklist
You want real tasks that reduce uncertainty. Here are more than a dozen. Run them in order until you have a named bottleneck and a justified change.
Task 1: Find the exact error line and the emitting service
cr0x@server:~$ journalctl -p err -S -2h | grep -iE "too many open files|EMFILE|ENFILE" | tail -n 20
Dec 26 09:31:12 pve1 pveproxy[1432]: error: accept4: Too many open files
Dec 26 09:31:12 pve1 pveproxy[1432]: failed to accept connection: EMFILE
What it means: This is an EMFILE from pveproxy PID 1432. Don’t touch kernel globals yet; it’s per-process.
Decision: Inspect PID 1432 limits and fd count. Prepare a systemd override for pveproxy.service if needed.
Task 2: Confirm whether the system is near global file handle exhaustion
cr0x@server:~$ cat /proc/sys/fs/file-nr
23872 0 9223372036854775807
What it means: First number is allocated file handles; third is the system max. On many modern kernels, the max is effectively huge (or dynamic). If the first number is close to the max, you have a system-wide issue.
Decision: Here it’s nowhere near. Focus on per-process limits.
Task 3: Check the process’s fd limit and current usage
cr0x@server:~$ pid=1432; echo "FDs:"; ls -1 /proc/$pid/fd | wc -l; echo; echo "Limits:"; grep -i "open files" /proc/$pid/limits
FDs:
1024
Limits:
Max open files 1024 1048576 files
What it means: The process is sitting exactly at its soft limit (1024). The hard limit is higher, so the service could theoretically raise it, but it didn’t.
Decision: Raise LimitNOFILE for the unit so systemd starts it with a higher soft limit. Target something sensible (e.g., 8192–65536 depending on concurrency).
Task 4: Identify the systemd unit backing the PID
cr0x@server:~$ systemctl status pveproxy --no-pager
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: active (running) since Thu 2025-12-26 08:11:02 UTC; 1h 25min ago
Main PID: 1432 (pveproxy)
Tasks: 10 (limit: 154675)
Memory: 76.4M
CPU: 1min 52.123s
What it means: It’s a normal unit, not some mystery process. Good: we can override it cleanly.
Decision: Use systemctl edit pveproxy to set LimitNOFILE. Don’t edit files in /lib/systemd/system directly.
Task 5: Inspect current unit-level limits (if any)
cr0x@server:~$ systemctl show pveproxy -p LimitNOFILE
LimitNOFILE=1024:1048576
What it means: systemd started it with soft=1024, hard=1048576.
Decision: Override to something higher for soft (and optionally hard). If you only need a higher soft, set just one value; systemd accepts LimitNOFILE=65536 (applies to both) or LimitNOFILE=65536:65536.
Task 6: Create a systemd override for the specific Proxmox service
cr0x@server:~$ sudo systemctl edit pveproxy
# (editor opens)
Put this in the drop-in:
cr0x@server:~$ cat /etc/systemd/system/pveproxy.service.d/override.conf
[Service]
LimitNOFILE=65536
What it means: Next restart, pveproxy gets a 65k fd limit. That’s high enough for most clusters without being a moonshot.
Decision: Reload systemd and restart the service. If this is the API proxy and you’re in the middle of an outage, do it during a safe window or via the node console.
Task 7: Apply and verify the new limits actually took effect
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart pveproxy
cr0x@server:~$ systemctl show pveproxy -p LimitNOFILE
LimitNOFILE=65536:65536
What it means: The unit is now configured correctly.
Decision: Confirm the running PID has the new limit, not just the unit metadata.
Task 8: Confirm the running PID’s effective limit
cr0x@server:~$ pid=$(pidof pveproxy); grep -i "open files" /proc/$pid/limits
Max open files 65536 65536 files
What it means: The process is actually running with the new limit. That’s the point.
Decision: Watch for recurrence. If fd usage keeps climbing, you may have a leak or an upstream overload.
Task 9: See what kind of fds are being consumed (sockets vs files)
cr0x@server:~$ pid=$(pidof pveproxy); sudo ls -l /proc/$pid/fd | awk '{print $11}' | head
socket:[942113]
socket:[942115]
/dev/null
socket:[942118]
anon_inode:[eventfd]
What it means: If the list is mostly socket:, you’re dealing with connection concurrency. If it’s real paths, you may be scanning a lot of files, or stuck on storage.
Decision: If it’s sockets, consider front-end proxies, keepalives, auth backends, and client behavior. If it’s filesystem paths, investigate what is opening them and why.
Task 10: Identify top fd consumers on the node (quick triage)
cr0x@server:~$ for p in /proc/[0-9]*; do
pid=${p#/proc/}
comm=$(tr -d '\0' < $p/comm 2>/dev/null) || continue
fdc=$(ls -1 $p/fd 2>/dev/null | wc -l)
echo "$fdc $pid $comm"
done | sort -nr | head -n 10
18034 28741 ceph-osd
9123 1942 qemu-system-x86
5012 1432 pveproxy
2120 1567 pvedaemon
What it means: You’ve got the villains lined up by fd count.
Decision: If a Ceph OSD is dominating, you may need to adjust Ceph systemd limits too. If QEMU is large, focus on the specific VM and its start method.
Task 11: For QEMU: map a QEMU PID back to a VMID
cr0x@server:~$ pid=1942; tr '\0' ' ' < /proc/$pid/cmdline | sed -n 's/.*-id \([0-9]\+\).*/VMID=\1/p'
VMID=104
What it means: That QEMU process is VM 104.
Decision: Check if it’s repeatedly spiking fd usage during backups, high connection load, or a device misconfiguration.
Task 12: Check the QEMU process limit and whether it’s too low
cr0x@server:~$ pid=1942; grep -i "open files" /proc/$pid/limits
Max open files 4096 4096 files
What it means: QEMU is capped at 4096. That can be okay, or totally insufficient for a busy gateway VM.
Decision: If QEMU is failing with EMFILE and fd usage approaches 4096, raise the limit for the scope/service that starts VMs (see the “Raising limits correctly” section).
Task 13: Find the systemd unit/scope for a QEMU PID
cr0x@server:~$ pid=1942; systemctl status $pid --no-pager
● 104.scope - qemu-kvm -id 104
Loaded: loaded
Active: active (running) since Thu 2025-12-26 08:22:51 UTC; 1h 13min ago
Main PID: 1942 (qemu-system-x86)
What it means: The VM runs in a systemd scope named after VMID. Limits may come from the parent service or systemd defaults.
Decision: Adjusting per-VM scopes is possible but usually not the best operational model. Prefer fixing the parent unit/template or the service that spawns QEMU (often via Proxmox tooling), or use systemd manager defaults if appropriate.
Task 14: Check systemd default limits (sometimes the hidden culprit)
cr0x@server:~$ systemctl show --property DefaultLimitNOFILE
DefaultLimitNOFILE=1024:524288
What it means: systemd’s defaults can influence scopes and services that don’t set their own limits explicitly.
Decision: Don’t raise this casually. It’s a “blast radius” knob. Use it only if many unrelated services need higher limits and you’ve validated memory impact and monitoring.
Task 15: Check per-user limits (mostly useful for interactive tools and PAM sessions)
cr0x@server:~$ ulimit -n
1024
What it means: Your shell session has 1024 open files limit. This does not automatically apply to systemd services.
Decision: If your operational tools (custom scripts, rsync jobs, CLI backup workflows) are hitting EMFILE in interactive sessions, adjust PAM limits. Otherwise, prioritize systemd overrides.
Task 16: If you suspect a leak, watch fd count over time
cr0x@server:~$ pid=$(pidof pvedaemon); for i in $(seq 1 10); do date; ls -1 /proc/$pid/fd | wc -l; sleep 30; done
Thu Dec 26 09:40:01 UTC 2025
812
Thu Dec 26 09:40:31 UTC 2025
829
Thu Dec 26 09:41:01 UTC 2025
846
What it means: If it only climbs and never falls under steady load, something may be leaking fds (or you have an ever-increasing number of long-lived connections).
Decision: Raising limits might buy time, but you should chase the leak: logs, recent updates, plugins, auth backends, or a load balancer doing something “creative.”
Task 17: Identify which remote peers are holding sockets open (for proxies)
cr0x@server:~$ pid=$(pidof pveproxy); sudo lsof -nP -p $pid | awk '/TCP/ {print $9}' | head
10.20.5.18:8006->10.20.5.40:51522
10.20.5.18:8006->10.20.7.91:50944
What it means: You can see client IPs and ports. If one client opens a ridiculous number of connections, you’ve found a behavioral problem.
Decision: Fix the client, add a proxy with sane limits, or tune keepalive settings. Raising fd limits without addressing abusive clients is like buying a bigger trash bin for a raccoon.
Raising limits correctly: kernel vs systemd vs app
Start with the most surgical fix
If a specific service hits EMFILE, set LimitNOFILE for that service. This is the cleanest approach: measurable, reversible, minimal collateral.
Systemd override: the default move for Proxmox services
Proxmox daemons are systemd services. You change limits with drop-ins in /etc/systemd/system/<unit>.service.d/override.conf. That file survives updates and tells future-you what you did.
Common Proxmox service units to consider:
pveproxy.service(UI/API)pvedaemon.service(tasks, API helper)pvestatd.service(stats)pve-cluster.service(cluster filesystem)ceph-osd@<id>.service,ceph-mon@<name>.service(if you run Ceph on the node)
How high should you set LimitNOFILE?
Pick numbers that correspond to operational reality:
- 8192: decent bump for lightweight daemons that occasionally spike.
- 16384–65536: common for busy proxies, management daemons in large clusters, and backup-related processes.
- 131072+: reserved for heavy hitters (Ceph OSDs at scale, high-throughput network services). Don’t do this because you saw it on a forum.
The limit is not free: each fd consumes kernel resources. On modern systems this is usually acceptable, but “acceptable” depends on memory pressure, number of processes, and how many fds they actually use.
When kernel global limits matter
Global file handle exhaustion is rarer, but it happens on:
- nodes running many services (monitoring, log shipping, storage, reverse proxies)
- Ceph-heavy nodes
- systems under fd leaks across multiple processes
Inspect current settings:
cr0x@server:~$ sysctl fs.file-max
fs.file-max = 9223372036854775807
On many modern kernels, this value is enormous or effectively unbounded. In that case, ENFILE is less likely, and EMFILE remains the primary issue. Still, don’t assume; measure.
Per-user limits: useful, but not your primary lever for Proxmox daemons
/etc/security/limits.conf and ulimit -n matter for user sessions, cron jobs, and scripts run by humans. They do not reliably control systemd-managed services (systemd sets rlimits itself). If you “fixed” a Proxmox daemon by editing limits.conf, you got lucky, or you changed something else at the same time.
Raising QEMU limits: what you should and should not do
If QEMU hits fd limits, you have two parallel questions:
- Is this VM’s workload legitimately fd-heavy (lots of connections/devices)?
- Are QEMU processes started with a sensible rlimit for your environment?
Raising QEMU limits can be done by adjusting the systemd environment that spawns them (depending on your Proxmox version and how the VM is scoped). In practice, I recommend:
- First, confirm the QEMU PID is hitting the limit (not the guest OS).
- Then, consider raising the default for the spawner service only if multiple VMs show the same issue.
- For a single outlier VM, consider workload-level fixes: connection pooling, reducing aggressive keepalives, fixing application leaks, or splitting roles across VMs.
Joke #2: “We’ll just set LimitNOFILE to infinity” is how you end up learning what “finite memory” means in production.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They ran a mid-sized Proxmox cluster for internal services and a few customer-facing apps. One morning, the web UI started timing out. Then API calls failed. Then backups queued up. The team assumed the node was “down” in the classic sense: CPU pegged, disk full, network dead. None of that was true.
The clue was sitting in journald: pveproxy couldn’t accept new connections. “Too many open files.” Someone had recently put an internal dashboard in a browser kiosk mode on a wall display. It refreshed aggressively, opened multiple parallel API connections, and never closed them properly. Multiply that by a few TVs and a few developers keeping tabs open for days, and you get a slow-motion denial-of-service.
The wrong assumption was that “too many open files” must mean “ZFS opened too many files” or “storage is broken.” They spent hours chasing pool health and SMART data. Meanwhile, the proxy was at 1024 fds and rejecting the very UI they needed to debug things.
The fix was simple: raise LimitNOFILE for pveproxy to a sane value and fix the offending client behavior. The lesson was more valuable: always tie the error to a PID, then to a unit, then to a workload. Guessing is expensive.
Mini-story 2: The optimization that backfired
A different org wanted faster backups. They tuned their backup scheduling to run more jobs in parallel and shortened retention cycles. On paper, it looked great: more concurrency, more throughput, less time in the window. They also deployed additional monitoring that tailed lots of logs and watched lots of directories.
Then the backups started failing intermittently. Sometimes it was the backup process. Sometimes it was a management daemon. Sometimes a seemingly unrelated service like a metrics collector. The pattern was “random,” which is what you call a systemic resource limit when you haven’t admitted it yet.
The backfire was that each parallel backup worker and each monitoring agent multiplied fd usage. Individually, each service looked fine. Collectively, the node spent long periods with many processes near their soft caps. Occasional spikes pushed one of them over the edge, and whichever was unlucky threw EMFILE first.
They “fixed” it once by raising a global default and moved on. A month later, the same pattern returned, except now the node also had higher memory pressure and more processes. The eventual correct fix was boring: set explicit LimitNOFILE for the known heavy services, reduce backup parallelism to match the storage backend, and measure fd usage during peak operations. Performance improved, failures stopped, and they didn’t have to keep raising ceilings forever.
Mini-story 3: The boring but correct practice that saved the day
A team running Ceph on Proxmox had a habit that felt uncool: they kept a runbook with three simple checks for any platform error. One of those checks was “fd usage vs limits.” They also version-controlled their systemd drop-ins for critical services (not the Proxmox-managed files, the overrides in /etc).
During a storage expansion, a new OSD deployment started flapping. The symptoms looked like networking: heartbeats missing, slow ops, occasional daemon restarts. But the runbook forced them to check the basics. The OSD processes were hitting their fd limit under the new layout and load pattern.
Because they already had a known-good policy for Ceph unit limits, the fix wasn’t a frantic forum search. They applied the existing override pattern, restarted the affected services in a controlled sequence, and validated with a single command that limits and usage now had headroom.
The day was saved by something boring: repeatable checks and configuration hygiene. No heroics. No “just reboot it.” The cluster kept serving IO, and the change window ended on time. In operations, boring is a feature.
Common mistakes: symptom → root cause → fix
Mistake 1: “I increased ulimit -n and it still fails”
Symptom: You run ulimit -n 65535 in your shell, restart a Proxmox service, and still see EMFILE.
Root cause: systemd services don’t inherit your shell ulimit. systemd sets rlimits itself.
Fix: Set LimitNOFILE in a systemd drop-in for the specific unit, then restart the unit and verify in /proc/<pid>/limits.
Mistake 2: Raising the global systemd default because one service misbehaved
Symptom: You change DefaultLimitNOFILE system-wide and everything seems fine… until unrelated services start consuming more resources.
Root cause: You expanded the blast radius. Now lots of services can open more fds, which can amplify leaks and increase kernel memory pressure.
Fix: Keep defaults conservative. Raise limits only on units that need it. Use the default only after you’ve confirmed many services share a legitimate requirement.
Mistake 3: Treating EMFILE as a “storage problem”
Symptom: Backup fails with “too many open files,” and the team dives into ZFS tuning.
Root cause: The failing process is often a daemon or backup worker opening lots of files/sockets, not ZFS itself.
Fix: Identify the emitting PID and check its limit and fd count. Then raise that unit’s limit or reduce concurrency.
Mistake 4: Raising limits without checking for leaks
Symptom: The node runs fine for a week after raising limits, then collapses again.
Root cause: A real leak (fds never released) or runaway connection behavior; a higher limit just delayed the crash.
Fix: Trend fd counts over time for suspect processes. If it’s monotonic growth, treat it as a bug or workload problem. Escalate to vendor/community with evidence (fd growth, lsof samples, journald lines).
Mistake 5: Confusing guest limits with host limits
Symptom: A containerized app throws EMFILE, so you raise host limits and nothing changes.
Root cause: The limit is inside the container or inside the app runtime, not the host process.
Fix: Check within the container (ulimit -n) and check the host-side process too. Fix the layer that’s actually failing.
Mistake 6: Restarting random services until the error disappears
Symptom: After restarts, things “work again,” but nobody knows why.
Root cause: Restarting drops open fds, temporarily masking load or leaks.
Fix: Use restarts as stabilization, not diagnosis. Before restarting, capture: journald error, PID limits, fd count, and lsof sample.
Checklists / step-by-step plan (safe change sequence)
Checklist A: You saw “too many open files” in Proxmox UI/API
- Pull journald errors for last 1–2 hours; find the emitting binary and PID.
- Check if it’s EMFILE vs ENFILE.
- Count open fds for the PID and compare with
/proc/<pid>/limits. - If fd count is near the limit: set
LimitNOFILEfor the specific unit and restart it. - If fd count is far below the limit: you may have transient spikes; investigate concurrency, client behavior, and logs around the timestamp.
- After change, verify new limits on the running PID and monitor for recurrence.
Checklist B: You suspect QEMU processes are hitting limits
- Find QEMU PID from the error message or from top fd consumers.
- Map PID to VMID using the cmdline (
-id). - Check
/proc/<pid>/limitsand fd count. If close: this is real. - Decide: fix workload pattern (preferred for single VM) vs raise limits (if many VMs affected).
- After changes, validate by running the same workload and watching fd headroom.
Checklist C: You run Ceph on Proxmox and hit EMFILE
- Identify which Ceph daemon (OSD/MON/MGR) logged EMFILE.
- Check that daemon’s systemd unit limit and running PID limit.
- Apply unit override for that daemon type (often templated units like
ceph-osd@.service). - Restart in a controlled order, watching cluster health.
- Confirm fd usage under load and ensure you didn’t just postpone a leak.
Step-by-step: the “correct” way to raise a limit for a Proxmox daemon
- Measure current limit and usage (
/proc/<pid>/limitsand/proc/<pid>/fdcount). - Pick a new value with headroom (usually 8–64x current peak, not 1000x).
- Create a systemd drop-in with
systemctl edit <unit>. systemctl daemon-reload.- Restart the unit during an appropriate window.
- Verify with
systemctl showand/proc/<pid>/limits. - Monitor: fd count over time, error logs, and client behavior.
- Write down why you did it. Future-you is a different person with less context and more caffeine.
FAQ
1) Should I raise fs.file-max to fix “too many open files” on Proxmox?
Usually no. Most Proxmox incidents are per-process EMFILE. Raise the service’s LimitNOFILE first. Touch kernel globals only if you can demonstrate global exhaustion.
2) What’s a good default LimitNOFILE for pveproxy?
For small setups, 8192 is plenty. For busy clusters with many UI/API clients or automation, 65536 is a pragmatic ceiling. If you need more, you probably have client behavior to fix too.
3) Why does ulimit -n show 1024 even though I set higher limits in systemd?
ulimit -n reflects your current shell session. systemd services have their own rlimits. Check the service PID in /proc/<pid>/limits to see what matters.
4) Can raising nofile limits cause instability?
Indirectly, yes. Higher limits let processes allocate more kernel resources. If you also have leaks, you just gave the leak more runway. Use higher limits with monitoring and trend fd counts.
5) My container says “too many open files.” Should I change the host?
Maybe, but start inside the container: check the app’s limit and the container’s limit. If the host’s LXC/QEMU process is the one hitting EMFILE, then adjust the host-side service or scope.
6) How do I know if it’s a leak versus legitimate concurrency?
Leaks look like fd counts that rise steadily and never come down, even when load drops. Legitimate concurrency tends to track traffic patterns. Sample fd counts over time and inspect fd types (sockets vs files).
7) Which Proxmox services are most sensitive to fd limits?
pveproxy and pvedaemon are common. In larger clusters, pvestatd and cluster components can surface issues. With Ceph, the Ceph daemons are frequent candidates.
8) I raised limits, but I still see errors. What now?
Confirm the running PID has the new limit (don’t trust config files). Then check whether fd usage still reaches the new cap. If it does, you either need a higher limit or you have a genuine runaway. If it doesn’t, the error may be coming from a different process than you think.
9) Is there a “one size fits all” number for Proxmox nodes?
No, and that’s the trap. Virtualization nodes host wildly different workloads. Set per-service limits based on measured peaks and expected concurrency, not superstition.
Conclusion: practical next steps
If you remember one thing: “too many open files” is not a vibe. It’s a measurable resource ceiling. Your first job is attribution: which PID, which unit, which limit. Your second job is choosing the smallest effective change: a systemd LimitNOFILE override for the specific service that actually hit EMFILE. Your third job is making sure you didn’t just give a leak a longer fuse.
Next steps you can do today:
- Add a quick fd triage command to your runbook (PID → fd count → limits).
- Set explicit
LimitNOFILEoverrides for known heavy Proxmox services in your environment (pveproxy,pvedaemon, Ceph daemons if applicable). - During peak operations (backup windows, migrations), sample fd usage and keep the numbers. Evidence beats arguments.