Ubuntu 24.04 Kernel Parameter Tuning: the 5 sysctls that matter (and the 10 that don’t) (case #42)

Was this helpful?

You got paged because p99 latency doubled, the API is “fine on my laptop,” and someone in a chat thread
is copy-pasting a sysctl -w incantation from 2013. The temptation is to tune the kernel like it’s a race car.
In production, it’s more like tuning a freight train: small knobs matter, big knobs can derail you, and most knobs are placebo.

Ubuntu 24.04 ships with a modern kernel and conservative defaults that are usually decent. Usually.
The trick is knowing which five sysctls are worth touching, when they matter, and how to prove you helped instead of just changing numbers.

Rules of engagement: how to tune without hurting yourself

Kernel tuning is not a hobby. Treat it like a production change: you need a hypothesis, a metric, a rollback,
and a clear definition of “better.” Otherwise you’re just rearranging sysctls until the outage ends for unrelated reasons.

1) Change one dimension at a time

If you change TCP buffers, dirty page thresholds, and conntrack sizing in one deploy, you’ve created a mystery novel.
SREs don’t get paid to read mystery novels at 03:00.

2) Prefer workload-aware defaults over folklore

Your system’s limits are usually elsewhere: CPU saturation, storage latency, DNS timeouts, head-of-line blocking,
or application-level backpressure. Sysctls can help, but only when you can name the bottleneck.

3) Make changes persistent the right way

Ubuntu 24.04 loads sysctls from /etc/sysctl.conf and /etc/sysctl.d/*.conf.
Put your changes in a dedicated file like /etc/sysctl.d/99-local-tuning.conf and document why.
Don’t “just run sysctl -w” on a pet server and call it done.

4) Understand the blast radius

Some sysctls are per-namespace, some are global. Containers can complicate this.
If you tune the host, you might change behavior for every pod, VM, or service using that kernel.

5) Don’t tune around missing capacity

You can’t sysctl your way out of an undersized NIC, a saturated disk array, or a database that’s doing full table scans.
Tuning is for friction. Not for physics.

One quote to keep you honest: Hope is not a strategy. — paraphrased idea often attributed to operations leaders.
In kernel tuning, “hope” looks like “let’s set it to 1,000,000 and see.”

Facts and history that explain today’s defaults

  • The “sysctl” interface predates Linux containers. Many knobs were designed for single-tenant machines; multi-tenant hosts can turn “safe” into “noisy neighbor.”
  • TCP window scaling (RFC 1323) made buffer sizing relevant. Before it, large bandwidth-delay product links couldn’t be efficiently filled no matter what you did.
  • Linux moved from “tune everything” to “auto-tune most things.” Modern kernels dynamically size many TCP buffers, so old static recipes often do nothing.
  • The “dirty page” writeback model has been tuned for decades. The knobs exist because buffering writes can improve throughput, but too much buffering makes latency spikes spectacular.
  • vm.swappiness became famous partly because laptops were swapping during the early desktop Linux era. Servers inherited the folklore, even when it didn’t fit.
  • Conntrack wasn’t “default” in the early days. Stateful firewalling and NAT made connection tracking a core scaling concern for many fleets.
  • Ephemeral ports used to collide more often in high-QPS clients. Port range and TIME-WAIT behavior became operational topics once service meshes and aggressive retries showed up.
  • File descriptor exhaustion has always been a top-10 outage cause. Not because Linux is fragile, but because software is creative at leaking sockets under partial failure.

Joke #1: Kernel tuning is like seasoning soup: you can always add more salt, but you can’t remove it once the customers are already yelling.

Fast diagnosis playbook (first/second/third)

Before you touch sysctls, find the bottleneck. This is the fastest “triage ladder” I’ve used on Ubuntu fleets:
it narrows the suspect list in minutes.

First: is it CPU, memory pressure, or runnable queue?

  • Check load vs CPU usage: high load with low CPU usage often means I/O wait or lock contention.
  • Check major faults and swapping: if you’re paging, you have a memory problem first.
  • Check run queue length: if it’s consistently above CPU cores, you’re CPU-bound or scheduler-bound.

Second: is it storage latency or writeback stalls?

  • Look for high disk await, saturation, and spikes in flush/writeback.
  • Look for dirty pages growth and sudden sync storms (especially on databases, logging bursts, or NFS clients).

Third: is it network queuing, drops, or connection tracking?

  • Check retransmits and drops (they turn throughput into sadness).
  • Check socket backlog overflow and listen queue drops.
  • Check conntrack count vs max and “insert failed” errors.

If you do this in order, you’ll avoid the classic mistake: tuning TCP buffers when the real issue is a throttled NVMe or a CPU pinned by encryption.

The 5 sysctls that matter (most of the time)

These are the knobs I actually see move production metrics on Ubuntu 24.04 when there’s a clear bottleneck.
They’re not magic. They’re levers you pull when the measurement says you should.

1) fs.file-max — system-wide file handle ceiling

If you run high-connection-count services (proxies, message brokers, busy HTTP frontends),
file handles are oxygen. When you run out, things don’t degrade gracefully; they fail in ways that look like “random” connection errors.

Ubuntu defaults are often fine for moderate workloads, but large fleets and connection-heavy nodes can hit the ceiling.
Also: the kernel limit is only half the story; per-process limits (ulimit -n) and systemd unit limits often bite first.

When to change it

  • Errors like “Too many open files” in logs.
  • Load balancers/proxies with tens of thousands of concurrent sockets per process.
  • High churn systems with many watchers (inotify) or lots of short-lived connections.

How to set it safely

Raise it to a value you can justify, not “infinite.” Track file handle usage and keep headroom.
The kernel needs memory for file structures; on tiny machines, huge values are performative.

2) net.core.somaxconn — listen backlog ceiling

This caps how many connections can queue for acceptance on a listening socket (with some nuance: the application’s listen()
backlog and kernel behavior both matter). If you have bursty traffic and an accept loop that can’t keep up momentarily,
a low backlog turns bursts into SYN drops or refused connections.

When to change it

  • Short spikes cause connection failures while CPU is not saturated.
  • Reverse proxies and API gateways with bursty ingress.
  • Services that do expensive TLS handshakes and can’t accept fast enough during storms.

How to set it safely

Bump to 4096 or 8192 for busy frontends, then verify you reduced listen drops.
Bigger backlogs can hide application slowness; don’t use it as a sedative.

3) net.ipv4.ip_local_port_range — ephemeral ports for clients

This one matters for systems that initiate lots of outbound connections: API clients, service mesh sidecars,
NAT gateways, scrapers, and anything doing aggressive retries. If you run out of ephemeral ports, you get connect() failures,
and the application often interprets that as “remote is down,” adding retries, making it worse.

When to change it

  • High outbound QPS with short-lived connections.
  • NAT or proxy nodes where many internal clients share egress.
  • TIME-WAIT accumulation is significant and you can’t move to connection reuse quickly.

How to set it safely

Widen the range (for example, to 10240 60999 or similar) while keeping clear of reserved ports.
Then verify port usage and confirm you’re not masking poor connection reuse.

4) vm.swappiness — swapping policy preference

This is the knob everyone touches first, and that’s usually wrong. But it does matter in one very common server case:
when you have plenty of RAM, but the kernel decides to swap out infrequently used pages, and latency-sensitive services
later fault them back in at the worst possible time.

On many modern server workloads, a lower swappiness (often 1 to 10) reduces surprise major faults,
assuming you have enough memory and you aren’t using swap as a capacity extension.

When to change it

  • Latency spikes correlate with major page faults or swap-ins.
  • You have swap enabled “just in case,” and you want it as a last resort, not a routine.
  • Database nodes where caching is meaningful and swap thrash is deadly.

How to set it safely

Don’t set it to 0 as a reflex. Use a low value and watch reclaim behavior.
If you’re memory-constrained, swappiness is not your real fix—capacity or workload reduction is.

5) vm.dirty_ratio and vm.dirty_background_ratio — writeback buffering and pain distribution

Yes, that’s two sysctls, but it’s one decision. These govern how much dirty (modified, not-yet-written) data the kernel will allow
before it forces writeback, and when background writeback should start.

Default ratios can be fine on general-purpose nodes. On write-heavy systems, they can create “quiet for a while, then everything stalls”
behavior: buffers fill, then the kernel forces synchronous-ish writeback pressure on unlucky threads. Latency goes from smooth to sawtooth.

When to change it

  • Write-heavy workloads with periodic latency cliffs.
  • Log aggregation nodes, ingest pipelines, CI workers writing artifacts.
  • Systems with fast bursts (in-memory) but slower backing storage (network block, spinning disks, overloaded RAID).

How to set it safely

For many servers, lowering ratios (or using the byte-based equivalents) smooths writeback and reduces tail latency.
But go too low and you’ll throttle throughput unnecessarily. This is one you must validate with real metrics: dirty bytes, writeback,
storage latency, and application p99.

The 10 sysctls that don’t (most of the time)

These are the knobs that show up in blog posts, “Linux hardening” gists, or performance cargo cults.
They aren’t useless in all circumstances. They’re just rarely the thing that fixes your problem on Ubuntu 24.04.
Touch them only when you can explain the failure mode and how you’ll measure improvement.

1) net.ipv4.tcp_tw_reuse — TIME-WAIT folklore

Reusing TIME-WAIT sockets used to be a common “fix” for ephemeral port pressure. Modern networking stacks and real-world NAT/timestamp behavior
make this a risky lever. Connection reuse at the application layer (keep-alives, pooling) is the grown-up solution.

2) net.ipv4.tcp_fin_timeout — looks helpful, usually isn’t

Tuning FIN timeout rarely fixes port exhaustion because TIME-WAIT, not FIN-WAIT, is usually the bulk.
If you’re drowning in half-closed sockets, ask why peers aren’t closing cleanly.

3) net.ipv4.tcp_syncookies — not a performance tweak

This is a defense mechanism for SYN floods, not a throughput booster.
Set it based on security posture, not because you saw it in a tuning list.

4) net.ipv4.tcp_sack — please don’t toggle it casually

Disabling SACK has been suggested during specific vulnerability eras. In normal life, SACK improves loss recovery.
Flipping it can harm performance on imperfect networks and is rarely justified for general servers.

5) net.core.netdev_max_backlog — the “just make the queue bigger” trap

If packets are piling up because the CPU can’t drain the queue, making the queue bigger can increase latency and jitter.
Sometimes you need it for burst absorption; often you need CPU, IRQ affinity, RPS/RFS tuning, or simply fewer packets.

6) net.ipv4.tcp_mtu_probing — a workaround, not a baseline

MTU probing can help on paths with broken PMTUD. If you’re not diagnosing blackholes, don’t make your stack guess MTUs all day.

7) kernel.pid_max — rarely your bottleneck

If you’re exhausting PIDs, something else is deeply wrong: fork bombs, runaway supervisors, or extreme process churn.
Fix the churn; don’t just raise the ceiling.

8) vm.vfs_cache_pressure — the “cache knob” people misread

On modern kernels, the relationship between page cache, inode/dentry cache, and reclaim is more nuanced than this single number.
If you tune it without measuring reclaim and cache hit rates, you’re guessing.

9) kernel.randomize_va_space — security knob, not a speed knob

Address space randomization affects exploit difficulty. It’s not your latency issue, and disabling it for performance is usually unjustifiable.

10) vm.overcommit_memory — misunderstood and misapplied

Overcommit behavior matters for certain allocation-heavy workloads, but changing it “for performance” is a classic way to turn a recoverable
memory spike into immediate allocation failures. Use it when you understand the application’s allocation patterns and failure handling.

Joke #2: Setting random sysctls from the internet is like accepting a production change request that says “misc tweaks”; the only guarantee is future meetings.

Practical tasks: commands, meaning, decisions (12+)

These are real tasks I run on Ubuntu servers before and after sysctl changes.
Each task includes: command, what the output means, and the decision you make from it.
Run them as a user with appropriate privileges.

Task 1: See what sysctls are actually set (and where)

cr0x@server:~$ sudo systemd-analyze cat-config systemd-sysctl.service
# /usr/lib/systemd/system/systemd-sysctl.service
...
# /usr/lib/sysctl.d/00-system.conf
net.ipv4.ip_forward = 0
...
# /etc/sysctl.d/99-local-tuning.conf
vm.swappiness = 10

Meaning: This shows which sysctl files are applied and in what precedence order.
Decision: If you can’t find where a value comes from, don’t change it yet. Fix configuration hygiene first.

Task 2: Check current values for the “five that matter”

cr0x@server:~$ sysctl fs.file-max net.core.somaxconn net.ipv4.ip_local_port_range vm.swappiness vm.dirty_ratio vm.dirty_background_ratio
fs.file-max = 9223372036854775807
net.core.somaxconn = 4096
net.ipv4.ip_local_port_range = 32768	60999
vm.swappiness = 60
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10

Meaning: You’ve got a baseline. Note: some distros set very large file-max values; per-process limits still matter.
Decision: If swappiness is 60 on a latency-sensitive node with swap enabled, it’s a candidate—after confirming you’re actually swapping.

Task 3: Validate file descriptor pressure system-wide

cr0x@server:~$ cat /proc/sys/fs/file-nr
12032	0	9223372036854775807

Meaning: First number is allocated file handles; third is the maximum.
Decision: If allocated approaches max (or you see allocation failures), raise fs.file-max and audit processes for leaks.

Task 4: Validate per-process file descriptor limits (systemd often wins)

cr0x@server:~$ systemctl show nginx --property=LimitNOFILE
LimitNOFILE=1024

Meaning: Your service can only open 1024 files/sockets, regardless of fs.file-max.
Decision: If this is a busy frontend, raise LimitNOFILE in the unit override; sysctl alone won’t save you.

Task 5: Check listen queue drops and SYN backlog signals

cr0x@server:~$ netstat -s | sed -n '/listen queue/,/TCPBacklogDrop/p'
    120 times the listen queue of a socket overflowed
    120 SYNs to LISTEN sockets dropped

Meaning: The kernel had connections arriving faster than the application could accept them.
Decision: Consider raising net.core.somaxconn and also profile accept loop / TLS / thread starvation.

Task 6: See how many ephemeral ports are actually in use

cr0x@server:~$ ss -s
Total: 32156 (kernel 0)
TCP:   27540 (estab 1320, closed 24511, orphaned 12, synrecv 0, timewait 23890/0), ports 0
Transport Total     IP        IPv6
RAW       0         0         0
UDP       316       280       36
TCP       3029      2510      519
INET      3345      3070      275
FRAG      0         0         0

Meaning: A large TIME-WAIT count indicates churn; it may also indicate lack of keep-alives/pooling.
Decision: If outbound churn is high and connect failures occur, widen ip_local_port_range and prioritize connection reuse.

Task 7: Confirm you’re swapping (don’t guess)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        41Gi       2.1Gi       1.2Gi        19Gi        18Gi
Swap:          8.0Gi       1.6Gi       6.4Gi

Meaning: Swap is in use. That’s not always evil, but it’s suspicious on latency-sensitive nodes.
Decision: If swap is being used and you see major faults during latency spikes, lower vm.swappiness and/or add memory.

Task 8: Measure swap activity over time

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0 1677720 2200000 120000 18400000   0   4     2   120  900 2400 12  6 78  4  0
 1  0 1677720 2195000 120000 18420000   0   0     0    80  850 2300 11  5 80  4  0
 3  1 1677720 2100000 120000 18500000  40  12   200   600 1200 4000 18  8 60 14  0

Meaning: si/so (swap in/out) spikes correlate with latency pain.
Decision: If swap-ins occur under load, lower swappiness and fix memory pressure. If swap is steady but low, it may be harmless.

Task 9: Inspect dirty/writeback behavior

cr0x@server:~$ egrep 'Dirty|Writeback|MemAvailable' /proc/meminfo
MemAvailable:   18342172 kB
Dirty:           914832 kB
Writeback:        26304 kB

Meaning: Dirty data is currently ~900MB. That can be fine or alarming depending on RAM and workload.
Decision: If Dirty grows into multi-GB and latency spikes align with flush storms, tune dirty ratios (or bytes) to smooth writeback.

Task 10: Correlate I/O latency quickly

cr0x@server:~$ iostat -xz 1 3
Linux 6.8.0-xx-generic (server) 	12/30/2025 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.40    0.00    5.80   18.20    0.00   63.60

Device            r/s     rkB/s   rrqm/s  %rrqm  r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm  w_await wareq-sz  aqu-sz  %util
nvme0n1          40.0   1024.0     0.0   0.00    2.10    25.6     900.0   32000.0    10.0   1.10   45.00    35.6   41.2   98.0

Meaning: High w_await, high aqu-sz, and near-100% utilization indicates the device is saturated.
Decision: Don’t “tune the kernel” first. Fix storage throughput/latency or reduce write pressure; dirty tuning can only redistribute pain.

Task 11: Check conntrack saturation (classic NAT/gateway failure mode)

cr0x@server:~$ sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 252144
net.netfilter.nf_conntrack_max = 262144

Meaning: You’re within 4% of the conntrack max. That’s a red blinking light.
Decision: If this is a stateful firewall/NAT/load balancer node, raise nf_conntrack_max and ensure you have memory headroom. Also investigate connection churn.

Task 12: Verify conntrack drops in kernel logs

cr0x@server:~$ sudo dmesg -T | tail -n 8
[Mon Dec 30 10:14:02 2025] nf_conntrack: table full, dropping packet
[Mon Dec 30 10:14:02 2025] nf_conntrack: table full, dropping packet

Meaning: Packets are being dropped because conntrack is full. Users will experience timeouts and retries.
Decision: Raise conntrack capacity and reduce churn. This is not a “maybe”; it’s a confirmed bottleneck.

Task 13: Check TCP retransmits and drops (don’t blame sysctls for a bad link)

cr0x@server:~$ netstat -s | sed -n '/Tcp:/,/Ip:/p'
Tcp:
    923849 active connection openings
    12 failed connection attempts
    210938 segments retransmitted
    41 resets sent

Meaning: Retransmits are a sign of packet loss, congestion, or poor queueing.
Decision: If retransmits spike with performance issues, investigate the network path, NIC drops, queue disciplines, and congestion—not random TCP sysctls.

Task 14: Apply a sysctl change temporarily (for validation only)

cr0x@server:~$ sudo sysctl -w net.core.somaxconn=8192
net.core.somaxconn = 8192

Meaning: Change applied immediately, not persistent across reboot.
Decision: Use this for A/B validation during an incident window, then make it persistent properly if it helped.

Task 15: Make changes persistent and reload them

cr0x@server:~$ sudo tee /etc/sysctl.d/99-local-tuning.conf >/dev/null <<'EOF'
net.core.somaxconn = 8192
vm.swappiness = 10
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
EOF
cr0x@server:~$ sudo sysctl --system
* Applying /etc/sysctl.d/99-local-tuning.conf ...
net.core.somaxconn = 8192
vm.swappiness = 10
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Meaning: Values are now persistent and applied in correct order.
Decision: Record the change ticket, metrics before/after, and the rollback plan (delete the file or revert values).

Three corporate mini-stories from the trenches

Mini-story #1: The incident caused by a wrong assumption (conntrack edition)

A mid-sized company ran a set of “edge” nodes: HAProxy, NAT, and some light L7 routing. The team assumed the nodes were stateless because
“it’s just forwarding.” That assumption was comfortable, popular, and wrong.

A partner integration launched. Traffic didn’t just grow; it got spikier. A retry storm hit after a brief upstream hiccup, turning a neat traffic curve
into a porcupine. The edge nodes started timing out on new connections. Nothing obvious in CPU graphs. Memory looked stable. Storage was irrelevant.

Someone bumped net.core.somaxconn. Another person blamed TLS. The real clue was in the kernel log: conntrack table full, dropping packets.
These nodes were doing NAT, so conntrack state was not optional; it was the product.

The fix was boring: increase net.netfilter.nf_conntrack_max and size it to available memory, then reduce churn by fixing client connection reuse
and adding circuit breakers to stop retries from becoming self-harm. Afterward, the team updated their design docs: “stateful edge.”
That single phrase prevented future “but it’s stateless” meetings.

Mini-story #2: The optimization that backfired (dirty ratios and the illusion of speed)

Another org had a logging pipeline with a local disk buffer. Under heavy ingest, the buffer would occasionally fall behind. An engineer noticed
the disks were “not busy” most of the time and wanted to batch writes for throughput. They raised vm.dirty_ratio significantly,
expecting fewer flushes and better disk efficiency.

Throughput looked great—briefly. Then the on-call started seeing periodic multi-second stalls in the ingestion service.
Not a gentle slowdown. Full stops. Downstream alerts fired because ingestion became bursty, which made indexing bursty, which made queries bursty.
It was like watching a small tremor trigger a chain of falling shelves.

The kernel had been allowed to accumulate a huge pile of dirty pages. When writeback pressure finally kicked in, the system paid the debt all at once.
Some threads that needed to allocate memory or write small metadata got stuck behind the writeback storm. The “not busy” disk graph was a lie:
it was idle until it wasn’t, and then it was 100% utilized with high await.

Rolling back the change stabilized latency immediately. The longer-term fix was more nuanced: moderate dirty ratios (or byte thresholds),
better batching inside the application where it could be controlled, and more disk bandwidth. The incident review had a clean lesson:
kernel buffering can smooth; it can also hide risk until it detonates.

Mini-story #3: The boring but correct practice that saved the day (limits and observability)

A financial services team ran a high-traffic API with strict SLOs. Their tuning posture was almost annoying in its discipline:
every sysctl change required a hypothesis, a dashboard link, and a rollback procedure. They kept a small, versioned file in
/etc/sysctl.d/ with comments explaining each deviation from defaults.

One afternoon, a deployment introduced a subtle socket leak under a specific error path. File descriptors climbed slowly.
On a less disciplined team, the first symptom would have been an outage and a frantic “increase limits!” reaction.

Instead, they had two guardrails. First: a dashboard tracking /proc/sys/fs/file-nr and per-process open FD counts.
Second: sane headroom in LimitNOFILE and the kernel’s file-max, so the leak had room to be detected and rolled back before hard failure.

They still had to fix the bug. But the “boring” work—documented sysctls, measured baselines, and limits sized for reality—turned a would-be incident
into a routine rollback. That’s the whole game.

Common mistakes: symptom → root cause → fix

1) “Random connection timeouts during traffic spikes”

Symptom: Clients see intermittent connection failures during bursts; CPU is not pegged.

Root cause: Listen backlog overflow (somaxconn too low) or accept loop too slow.

Fix: Raise net.core.somaxconn to 4096–8192 and confirm with netstat -s listen queue counters; profile accept/TLS and worker saturation.

2) “Everything freezes for seconds, then recovers”

Symptom: Periodic stalls; disk utilization spikes to 100%; p99 becomes sawtooth.

Root cause: Dirty page accumulation and writeback storms (dirty ratios too high for the device).

Fix: Lower vm.dirty_background_ratio and vm.dirty_ratio (or use byte-based limits), then confirm reduced Dirty/Writeback oscillation and improved I/O await.

3) “Outbound calls fail under load with EADDRNOTAVAIL”

Symptom: Client-side connection errors, often during retries or bursts.

Root cause: Ephemeral port exhaustion or excessive TIME-WAIT due to connection churn.

Fix: Widen net.ipv4.ip_local_port_range, reduce connection churn via keep-alives/pooling, and monitor TIME-WAIT counts with ss -s.

4) “nf_conntrack: table full” and user-facing timeouts

Symptom: Kernel logs show conntrack drops; NAT/firewall nodes time out.

Root cause: nf_conntrack_max too low for connection churn and concurrency.

Fix: Increase net.netfilter.nf_conntrack_max, ensure memory headroom, and reduce churn (idle timeouts, reuse connections, avoid retry storms).

5) “We set swappiness to 1 and it still feels slow”

Symptom: Latency issues persist; memory is tight; swap may still be used.

Root cause: You’re actually memory constrained; swappiness is not capacity. Also, the workload might be I/O bound or CPU bound.

Fix: Measure major faults and swap-ins; if memory pressure is real, add RAM or reduce memory use. Then tune swappiness as a policy, not a rescue.

6) “We increased all the TCP buffers and it got worse”

Symptom: Latency increased; memory use rose; no throughput gain.

Root cause: Buffers increased queueing and memory pressure; the bottleneck wasn’t TCP window size.

Fix: Revert buffer changes, validate retransmits/drops, check CPU and NIC queueing; tune only with a proven BDP problem.

7) “Too many open files” despite huge fs.file-max

Symptom: Application errors show FD exhaustion; kernel file-max looks massive.

Root cause: Per-process limits (systemd LimitNOFILE) are low.

Fix: Raise the unit’s LimitNOFILE and verify with systemctl show and /proc/<pid>/limits.

Checklists / step-by-step plan

Checklist A: Safe sysctl tuning workflow (production)

  1. Write the hypothesis: “We are dropping inbound connections due to backlog overflow; raising somaxconn will reduce drops and improve p99 connect time.”
  2. Pick 2–3 metrics: one kernel signal (e.g., listen queue overflow), one user symptom (p99 latency), one resource metric (CPU, memory, I/O await).
  3. Capture baseline: run the relevant tasks above; save output with timestamps.
  4. Apply change temporarily: sysctl -w during a controlled window, or on a canary node.
  5. Observe: did the kernel signal improve, and did the user metric improve without collateral damage?
  6. Make it persistent: write to /etc/sysctl.d/99-local-tuning.conf, then sysctl --system.
  7. Document: what changed, why, and what rollback looks like.
  8. Revalidate after reboot: confirm the value persists and behavior is unchanged.

Checklist B: Minimal “five sysctls” profile for common server roles

Busy HTTP ingress / reverse proxy

  • net.core.somaxconn: consider 4096–8192 if you see listen overflow.
  • fs.file-max: ensure ample headroom; also raise unit LimitNOFILE.
  • net.ipv4.ip_local_port_range: mostly relevant if it also acts as an outbound client heavily.
  • vm.swappiness: lower if swap-ins cause latency spikes.
  • vm.dirty_*: only if local logging/buffering causes stalls.

Database node (latency sensitive, write heavy)

  • vm.swappiness: often 1–10, but only with confirmed swapping impact.
  • vm.dirty_ratio/vm.dirty_background_ratio: tune to avoid writeback cliffs; measure I/O await and dirty/writeback.
  • fs.file-max: ensure headroom; DBs can use many files and sockets.
  • net.core.somaxconn: rarely the limiting factor unless it’s also serving bursts at the TCP level.
  • ip_local_port_range: mostly irrelevant unless the DB initiates many outbound connections.

NAT gateway / firewall / Kubernetes node with heavy egress

  • net.netfilter.nf_conntrack_max: not in the “five” list above, but for this role it becomes top-tier. Size it deliberately.
  • ip_local_port_range: relevant for connection-heavy clients or shared egress patterns.
  • fs.file-max: relevant for proxies and agents.

Checklist C: Rollback plan that actually works

  1. Keep a copy of the pre-change values: sysctl -a is noisy; at least record the keys you touched.
  2. For persistent changes, revert the file in /etc/sysctl.d/, then run sysctl --system.
  3. If you must rollback instantly, use sysctl -w key=value for the few keys touched, then revert the file after the incident.
  4. Confirm with sysctl key that rollback applied.

FAQ

1) Should I tune sysctls on Ubuntu 24.04 at all?

Only when you have a measured bottleneck and a clear failure mode. Defaults are decent for general purpose use, not for every high-load edge case.

2) Why not just apply a “sysctl performance profile” from the internet?

Because it’s usually a mix of outdated advice, security toggles, and workload-specific tweaks presented as universal truth.
You’ll fix nothing and complicate debugging.

3) Is vm.swappiness=1 always best for servers?

No. It’s often a good policy for latency-sensitive nodes with swap enabled as an emergency brake, but it won’t solve memory shortages.
If you’re swapping under steady load, you need memory or less workload.

4) What’s the real relationship between fs.file-max and ulimit -n?

fs.file-max is a kernel-wide ceiling on file handles. ulimit -n (and systemd LimitNOFILE) caps per-process open files.
You can have a massive system-wide limit and still fail because a service is capped at 1024.

5) I increased net.core.somaxconn. Why do I still see connection drops?

Because backlog isn’t the only queue: your application’s accept loop can be slow, SYN backlog behavior matters, and CPU/IRQ saturation can drop packets earlier.
Measure where the drop occurs and whether the process is actually accepting quickly.

6) Should I tune TCP rmem/wmem buffers on Ubuntu 24.04?

Only if you’ve proven a bandwidth-delay product problem (high-latency, high-bandwidth links where throughput is limited by window size)
and autotuning isn’t sufficient. Otherwise you may just increase memory use and queueing latency.

7) Is vm.dirty_ratio tuning safe?

It’s safe when you validate: watch Dirty/Writeback, I/O await, and application tail latency.
It’s unsafe when you crank ratios high to chase throughput; that can manufacture writeback stalls.

8) How do I persist sysctl changes the “Ubuntu 24.04 way”?

Put them in /etc/sysctl.d/99-local-tuning.conf (or similar), then run sysctl --system.
Avoid ad-hoc manual changes that disappear after reboot.

9) Do containers ignore host sysctls?

Some sysctls are namespaced and can be set per container; others are global to the host kernel.
In practice: host-level tuning can affect every workload. Treat it as a shared infrastructure change.

10) If I can only change one thing during an incident, what’s the safest bet?

It depends, but the safest operational move is often not a sysctl at all: reduce load (shed traffic), add capacity, or roll back a release.
For sysctls, raising somaxconn or widening port range can be low-risk if you have matching evidence.

Next steps you can actually do today

If you want kernel tuning that improves reliability instead of producing folklore, do this sequence:

  1. Pick one node role (ingress, database, NAT, worker). Don’t tune “the fleet” as a vibe.
  2. Run the tasks section and save outputs as your baseline.
  3. Identify the dominant bottleneck using the fast diagnosis playbook—CPU, memory, disk, or network.
  4. Change only one of the five sysctls that matches the bottleneck, and only if you have confirming signals.
  5. Make it persistent via sysctl.d and document the reason in comments, not tribal knowledge.
  6. Re-check after one reboot and one traffic cycle. If you can’t demonstrate improvement, revert and move on.

The best tuning is the kind you can explain six months later without apologizing. The second best is the kind you didn’t need because you sized the system correctly.

← Previous
Email Domain Warm-up: A Practical Plan to Avoid Instant Spam Flags
Next →
MySQL vs MariaDB on a 2GB RAM VPS: Tuning Profiles That Don’t Crash

Leave a comment