Somewhere in your org is a spreadsheet cell that quietly assumes “this limit won’t matter.”
It might be a queue depth, a default JVM heap, an inode count, a NAT table size, a writeback cache, or a “temporary” 10 GB volume.
Nobody remembers why it’s that number. Everyone treats it as physics.
That’s how you end up on a 03:00 incident call, arguing with graphs that look like a lie.
The “640 KB is enough” quote survives because it flatters our worst habit: believing today’s constraints are permanent and tomorrow’s demand is negotiable.
The myth, why it’s sticky, and what it teaches
The quote usually goes like this: “640 KB ought to be enough for anybody.” It’s typically attributed to Bill Gates, placed somewhere in the early PC era,
and used as a punchline about arrogance, shortsightedness, or how fast technology changes.
There’s a problem: there’s no solid evidence he said it. The attribution is shaky, the timeframe is fuzzy, and the quote tends to show up in print long after the fact.
The story persists anyway because it’s useful. It compresses a complicated history of hardware, operating systems, and business tradeoffs into one sneer-worthy sentence.
Engineers love a clean moral. Managers love a clean villain. And everyone loves a quote you can deploy in a meeting like a smoke grenade.
But production systems don’t fail because someone said a dumb thing. They fail because a limit existed, was misunderstood, and then got treated as a constant.
Here’s the point you should keep: 640 KB wasn’t a belief about the future; it was a boundary created by design choices and compatibility pressure.
The modern equivalent isn’t “someone thought RAM wouldn’t grow.” It’s “we don’t know which limit is real, which is a default, and which is a landmine.”
First short joke: The “640 KB is enough” quote is like a zombie incident ticket—nobody knows who created it, but it keeps reopening itself.
Facts and context: what 640 KB actually was
To understand why this myth clings to the timeline, you need the boring details. The boring details are where outages come from.
Here are concrete context points that matter, without the cosplay.
8 facts that explain the 640 KB boundary (and why it wasn’t random)
-
The original IBM PC used the Intel 8088, whose addressing model and early PC architecture made 1 MB of address space a natural ceiling for that era.
The “1 MB limit” wasn’t a vibe; it was structural. -
Conventional memory was the first 640 KB (0x00000–0x9FFFF). Above that lived reserved space for video memory, ROM, and hardware mappings.
That reserved region is why “640 KB” appears as a clean number. -
The upper memory area (UMA) existed for a reason: video adapters, BIOS ROMs, and expansion ROMs needed address space.
PC compatibility wasn’t optional; it was the product. -
MS-DOS ran in real mode, which meant it lived with that conventional memory world.
You can shout at history, but the CPU still does what the CPU does. -
Expanded memory (EMS) and extended memory (XMS) were workarounds:
EMS bank-switched memory into a page frame; XMS used memory above 1 MB with a manager. Both were complexity taxes paid for compatibility. -
HIMEM.SYS and EMM386.EXE were common tools to access and manage memory beyond conventional limits.
If you ever “optimized” CONFIG.SYS and AUTOEXEC.BAT, you were doing capacity planning with a text editor and prayer. -
Protected mode existed, but software ecosystems lagged.
Hardware capability doesn’t instantly rewrite the world; the installed base and compatibility matrix decide what you can ship. -
That era was full of tight constraints, but also rapid change.
People weren’t stupid; they were building systems where every kilobyte had a job. The myth survives because we misread constraint as arrogance.
The useful takeaway: the number “640 KB” came from an address space map and pragmatic engineering choices, not a declaration that users would never want more.
It’s the difference between “this is the box we can draw today” and “this box will always be sufficient.”
The real lesson: limits are decisions, not trivia
I don’t care who said what in 1981. I care that in 2026, teams still ship systems with invisible ceilings and then act surprised when they hit them.
The “640 KB” story is a mirror: it shows us what we’re currently hand-waving away.
What “640 KB” looks like in modern production
- Default quotas (Kubernetes ephemeral storage, cloud block volume sizes, per-namespace object limits) treated as if they were policy.
- Kernel defaults (somaxconn, nf_conntrack_max, fs.file-max) left untouched because “Linux knows best.”
- Filesystem limits (inodes, directory scaling behaviors, small file overhead) ignored until “df says there’s space.”
- Cache assumptions (“more cache always faster”) that turn into memory pressure, eviction storms, and tail latency spikes.
- Queueing and backpressure that doesn’t exist, because someone wanted “simplicity.”
A single quote is comforting; a limit inventory is useful
The myth thrives because it gives you a villain. Villains are easy. Limits are work.
If you run production systems, your job is to know the limits before your users do.
Here’s a paraphrased idea from a notable reliability voice, because it’s the opposite of the 640 KB myth:
paraphrased idea — John Allspaw: Reliability comes from learning and adapting systems, not blaming individuals for outcomes.
Treat “640 KB is enough” as a diagnostic prompt: where are we relying on a historical artifact, a default setting, or a half-remembered constraint?
Then go find it. Write it down. Test it. Put alerts on it. Make it boring.
Three corporate mini-stories from the land of “it’ll be fine”
Mini-story 1: An incident caused by a wrong assumption (“disk full can’t happen; we have monitoring”)
A mid-sized SaaS company ran a multi-tenant Postgres cluster with logical replication into a reporting system.
The primary DB had plenty of free space, and dashboards showed “disk usage stable.” Everyone slept well.
One night, writes slowed, then stalled. Application error rates climbed. The on-call saw the DB was “healthy” by their usual checks:
CPU fine, RAM fine, replication lag rising but not catastrophic. The cluster didn’t crash; it just stopped making forward progress in a way that felt like molasses.
Root cause: the WAL volume filled. Not the main data volume. The WAL mount had a different size, different growth behavior, and a different alert threshold.
The “disk usage stable” dashboard looked at the data filesystem. It never looked at the WAL mount, because someone assumed “it’s on the same disk.”
Worse: the cleanup process that should have removed old WAL segments relied on replication slots. A stuck consumer held slots open.
So WAL grew until it hit the mount limit. The database did exactly what it should do when it can’t safely persist: it stopped accepting work.
The fix was straightforward—resize the mount, add alerts, unstick the consumer, and set sane retention policies. The uncomfortable lesson was not.
The team hadn’t missed a complex failure mode. They’d missed a basic inventory item: what volumes exist, what fills them, and how quickly.
Mini-story 2: An optimization that backfired (“we’ll use huge caches; memory is cheap”)
A payments service had latency issues during peak traffic. The team optimized: more caching in-process, larger connection pools,
and aggressive read-through caches for frequently accessed metadata. Latency improved in staging. The deploy went out with confidence.
In production, tail latency improved for a few hours. Then things got weird. P99 climbed, CPU usage spiked, and error rates became bursty.
The service didn’t look overloaded—until you checked major page faults and reclaim activity. The kernel was fighting for its life.
The optimization created memory pressure and caused the kernel to reclaim file cache aggressively. That meant more disk reads for dependencies.
It also pushed the JVM (yes, this was Java) into a GC posture that looked like a sawtooth of regret.
The service had become “fast on average” and “unpredictable when it mattered,” which is the worst kind of fast.
They rolled back cache sizes, added an explicit memory budget, and moved some cache responsibility to a dedicated tier that could be scaled separately.
The long-term fix included per-endpoint SLOs and load tests that modeled peak cardinality and cache churn—not just steady-state QPS.
The lesson: “memory is cheap” is not an engineering argument. Memory is a shared resource that interacts with IO, GC, and scheduling.
Caches are not free; they are loans you repay with unpredictability unless you budget them.
Mini-story 3: A boring but correct practice that saved the day (capacity headroom + limit drills)
An enterprise internal platform team ran object storage gateways in front of a large storage backend.
The system served logs, artifacts, and backups—everything nobody thinks about until it disappears.
The team had an unsexy practice: every quarter, they ran a “limit drill.”
They would pick a constraint—file descriptors, network connections, cache size, disk throughput, inode usage—and verify alerts, dashboards, and runbooks.
They didn’t do it because it was fun. They did it because unknown limits are where incidents breed.
One week, an application team started uploading millions of tiny objects due to a packaging change.
The backend wasn’t full on bytes, but metadata pressure surged. The gateway nodes began to show elevated IO wait and increased latency.
The platform team caught it early because they had alerts not just on “disk percent used” but also on inode consumption,
request queue depth, and per-device latency. They throttled the noisy workload, coordinated a packaging fix, and added a policy for minimum object size.
Nobody outside the platform team noticed. That’s what “saved the day” looks like: nothing happens, and you get no applause.
Second short joke: Reliability engineering is being proud of an incident that never makes it into a slide deck.
Practical tasks: commands, outputs, decisions
My bias: if you can’t interrogate the system with a command, you don’t understand the system.
Below are practical tasks you can run on a Linux host. Each one includes what the output means and the decision you make from it.
These are not academic; they’re the kinds of checks you do when “something feels slow” and you need to stop guessing.
Task 1: Check memory pressure and swap reality
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 31Gi 24Gi 1.2Gi 512Mi 5.8Gi 3.9Gi
Swap: 2.0Gi 1.6Gi 400Mi
Meaning: “available” is the big one; it reflects reclaimable cache. Heavy swap usage suggests sustained memory pressure, not a brief spike.
Decision: If swap is actively used and latency is bad, you either reduce memory footprint (cache budgets, JVM heap, worker count)
or add memory. Don’t treat swap as “extra RAM”; treat it as “latency insurance with a very expensive premium.”
Task 2: Identify top memory consumers (and whether it’s anonymous or file cache)
cr0x@server:~$ ps -eo pid,comm,rss,vsz --sort=-rss | head
PID COMMAND RSS VSZ
4121 java 9876540 12582912
2330 postgres 2456780 3145728
1902 prometheus 1024000 2048000
1187 nginx 256000 512000
Meaning: RSS shows resident memory; VSZ can be misleading (reserved address space).
A single process with ballooning RSS is an obvious target.
Decision: If RSS growth correlates with latency spikes, apply a memory budget: cap caches, tune heap, or isolate the workload.
Task 3: See if the kernel is reclaiming aggressively
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 1638400 120000 80000 5200000 10 25 120 300 1200 1800 20 8 60 12 0
4 1 1639000 90000 70000 5100000 80 120 400 1500 1600 2200 18 10 45 27 0
3 1 1639500 85000 65000 5000000 60 90 350 1200 1500 2100 15 9 50 26 0
2 0 1640000 110000 70000 5050000 15 30 150 500 1300 1900 19 8 58 15 0
2 0 1640000 115000 72000 5080000 5 10 100 320 1250 1850 21 7 62 10 0
Meaning: Non-zero si/so (swap in/out) and high wa (IO wait) indicate memory pressure turning into IO pain.
Decision: If swapping happens during peak, stop “optimizing” elsewhere. Fix memory pressure first or you’ll chase phantom bottlenecks.
Task 4: Check load average versus CPU saturation
cr0x@server:~$ uptime
14:22:10 up 37 days, 3:11, 2 users, load average: 18.42, 17.90, 16.77
Meaning: Load average includes runnable and uninterruptible tasks (often IO wait). High load doesn’t automatically mean CPU is maxed.
Decision: Pair this with CPU and IO checks before declaring “we need more cores.”
Task 5: Confirm CPU is actually the bottleneck
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (server) 01/21/2026 _x86_64_ (16 CPU)
12:00:01 AM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
12:00:02 AM all 22.11 0.00 7.90 0.80 0.00 0.60 0.00 68.59
12:00:03 AM all 24.01 0.00 8.10 9.50 0.00 0.70 0.00 57.69
12:00:04 AM all 21.20 0.00 7.50 11.40 0.00 0.60 0.00 58.70
Meaning: %iowait is climbing; CPU isn’t saturated. The system is waiting on storage/network.
Decision: Don’t scale CPU. Investigate IO latency, filesystem contention, or network storage behavior.
Task 6: Quick disk space check (bytes) and inode check (metadata)
cr0x@server:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 900G 720G 180G 81% /
/dev/nvme1n1p1 200G 198G 2.0G 99% /var/lib/postgresql/wal
cr0x@server:~$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2 58982400 1048576 57933824 2% /
/dev/nvme1n1p1 13107200 13090000 17200 100% /var/lib/postgresql/wal
Meaning: A filesystem can be “not full” on bytes but full on inodes. WAL mount is full on both.
Decision: If inode usage is the issue, deleting large files won’t help. You need to delete many small files or rebuild with more inodes.
Task 7: Measure storage latency and utilization live
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (server) 01/21/2026 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
20.15 0.00 7.30 9.42 0.00 63.13
Device r/s w/s rkB/s wkB/s await %util
nvme0n1 120.0 200.0 4096.0 8192.0 6.20 78.00
nvme1n1 10.0 900.0 512.0 16384.0 45.30 99.20
Meaning: nvme1n1 is pegged (%util ~99) with high await. That’s a bottleneck.
Decision: Move write-heavy workloads (WAL, logs) off that device, increase device capability, or reduce write amplification.
Task 8: Find which processes are doing IO right now
cr0x@server:~$ sudo iotop -o -b -n 3
Total DISK READ: 5.12 M/s | Total DISK WRITE: 42.33 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
2330 be/4 postgres 0.00 B/s 28.10 M/s 0.00 % 35.20 % postgres: wal writer
4121 be/4 cr0x 1.20 M/s 4.10 M/s 0.00 % 5.10 % java -jar service.jar
3011 be/4 root 0.00 B/s 3.20 M/s 0.00 % 2.00 % journald
Meaning: The WAL writer dominates writes. This isn’t “mysterious IO”; it’s your database doing its job.
Decision: If the IO device can’t keep up, tuning app queries won’t fix it fast. Address the storage path first.
Task 9: Check filesystem mount options that can bite you
cr0x@server:~$ mount | grep -E ' / |wal'
/dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
/dev/nvme1n1p1 on /var/lib/postgresql/wal type ext4 (rw,relatime,data=ordered)
Meaning: You’re looking for surprises: sync mounts, noatime/relatime, barriers, odd options that change write patterns.
Decision: If you find sync or an unexpected network filesystem under a latency-sensitive path, that’s likely your “640 KB” moment.
Task 10: Check file descriptor limits (the modern “conventional memory” of sockets)
cr0x@server:~$ ulimit -n
1024
cr0x@server:~$ cat /proc/sys/fs/file-nr
42112 0 9223372036854775807
Meaning: Per-process limit is 1024, which is tiny for many services. System-wide file handles are fine.
Decision: If you see “too many open files” errors or connection churn, raise per-service limits via systemd and verify with a restart.
Task 11: Check network backlog and SYN handling (queue limits that look like “random packet loss”)
cr0x@server:~$ sysctl net.core.somaxconn net.ipv4.tcp_max_syn_backlog
net.core.somaxconn = 128
net.ipv4.tcp_max_syn_backlog = 256
Meaning: These defaults can be too low for high-concurrency services, causing connection drops under bursts.
Decision: If you see SYN drops or accept queue overflow in metrics, tune these and load-test. Don’t “just add pods” and hope.
Task 12: Check conntrack table usage (NAT and state tracking: the hidden ceiling)
cr0x@server:~$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 262144
cr0x@server:~$ cat /proc/sys/net/netfilter/nf_conntrack_count
261900
Meaning: You’re nearly at the maximum. When this fills, new connections fail in ways that look like application bugs.
Decision: Increase the table (with memory awareness), reduce unnecessary connection churn, and set alerts at sensible thresholds.
Task 13: Check kernel logs for the truth you didn’t want
cr0x@server:~$ dmesg -T | tail -n 8
[Mon Jan 21 13:58:11 2026] Out of memory: Killed process 4121 (java) total-vm:12582912kB, anon-rss:9876540kB, file-rss:10240kB, shmem-rss:0kB
[Mon Jan 21 13:58:11 2026] oom_reaper: reaped process 4121 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Mon Jan 21 14:00:02 2026] EXT4-fs warning (device nvme1n1p1): ext4_dx_add_entry: Directory index full, reached max htree level
Meaning: OOM kills and filesystem warnings are not “noise.” They are the system telling you your assumptions are wrong.
Decision: If you see OOM, stop adding features and start sizing memory. If you see filesystem index warnings, examine directory/file layout.
Task 14: Measure directory and small-file explosion
cr0x@server:~$ sudo find /var/lib/postgresql/wal -type f | wc -l
12983456
Meaning: Millions of files implies inode pressure, directory scaling issues, and backup/scan pain.
Decision: Re-architect file layout, rotate aggressively, or move to a design that doesn’t use the filesystem as a database.
Task 15: Confirm whether the app is throttled by cgroups (a very 2026 kind of “640 KB”)
cr0x@server:~$ cat /sys/fs/cgroup/memory.max
2147483648
cr0x@server:~$ cat /sys/fs/cgroup/memory.current
2130014208
Meaning: The workload is basically at its memory limit. You can tune all day; the wall is literal.
Decision: Increase the limit or reduce memory use. Also: set alerts on memory.current approaching memory.max, not after OOM.
Fast diagnosis playbook: find the bottleneck fast
When everything is slow, you don’t have time to philosophize about the 1980s. You need a disciplined sequence that converges.
This playbook assumes a single host or node is misbehaving; adapt it for distributed systems by sampling multiple nodes.
First: confirm the failure mode (symptoms, not theories)
- Is it latency, throughput, or errors?
- Is it steady degradation or spiky bursts?
- Does it correlate with deploys, traffic, cron jobs, batch windows?
Run: load + CPU + memory + IO quick checks. Don’t guess which subsystem is guilty.
Second: check the four usual bottlenecks in order
-
Memory pressure:
free -h,vmstat, cgroup limits, OOM logs.
If swapping or OOM is present, treat it as primary until disproven. -
Storage latency:
iostat -xz,iotop, filesystem fullness and inode fullness.
High await or %util near 100% is a smoking gun. -
CPU saturation:
mpstatand per-process CPU.
High %usr/%sys with low iowait points to CPU. -
Network and queues: backlog settings, conntrack, retransmits, drops.
A full conntrack table can make an otherwise healthy service look haunted.
Third: localize impact before you “fix” it
- Which process is top CPU / top RSS / top IO?
- Which mount is filling?
- Which device has high latency?
- Which limit is near its ceiling (fds, conntrack, cgroups, disk, inodes)?
Fourth: pick the least risky mitigation
- Throttle the offender (rate limits, pause batch jobs).
- Add headroom (increase volume size, raise limits) if it’s safe and reversible.
- Move hot paths off contended resources (separate WAL/logs, isolate caches).
- Roll back recent changes if the timeline fits.
Fifth: make it non-repeatable
- Add an alert on the actual constraint you hit (not a proxy metric).
- Write a runbook that starts with “show me the limit and current usage.”
- Schedule a limit drill. Put it on the calendar like patching. Because it is patching—of your assumptions.
Common mistakes: symptom → root cause → fix
This is where the 640 KB myth earns its keep. The failure isn’t “we didn’t predict the future.”
The failure is “we didn’t identify a limit and treat it like a production dependency.”
1) Symptom: “Disk is 70% free but writes fail”
Root cause: inode exhaustion, filesystem metadata limits, or a different mount (WAL/logs) is full.
Fix: check df -i and mountpoints; move hot paths to dedicated volumes; rebuild filesystems with appropriate inode density if needed.
2) Symptom: “Load average is huge; CPU graphs look fine”
Root cause: IO wait or blocked tasks (storage latency, NFS hiccups).
Fix: run iostat -xz and iotop; investigate device await and %util; fix storage bottleneck before scaling CPU.
3) Symptom: “Random timeouts under bursts; adding pods doesn’t help”
Root cause: accept queue overflow, low somaxconn, SYN backlog exhaustion, or conntrack full.
Fix: tune backlog parameters, increase conntrack max with memory awareness, and reduce connection churn via keep-alives/pooling.
4) Symptom: “Latency improved after caching, then got worse than before”
Root cause: cache-induced memory pressure causing reclaim, swap, or GC thrash.
Fix: enforce cache budgets; monitor page faults and reclaim; move caches to dedicated tiers; test with realistic cardinality and churn.
5) Symptom: “Service restarts fix it for a while”
Root cause: resource leak (fds, memory, conntrack), fragmentation, or unbounded queues.
Fix: track growth over time; set hard limits; add leak detection; implement backpressure; don’t accept “restart is the runbook.”
6) Symptom: “Database is slow but CPU is low”
Root cause: storage latency, fsync contention, WAL on saturated device, or checkpoint bursts.
Fix: separate WAL onto fast storage, tune checkpoint settings carefully, measure fsync latency, and watch write amplification.
7) Symptom: “Plenty of RAM free; still OOM-killed”
Root cause: cgroup memory limits, per-container ceilings, or high anonymous RSS under a hard cap.
Fix: check /sys/fs/cgroup/memory.max; increase limits; reduce memory; ensure alerts are based on cgroup usage, not host free.
Checklists / step-by-step plan
Checklist 1: Build a “limits inventory” for any service that matters
- List all storage mounts used by the service (data, logs, WAL, tmp, cache).
- For each mount: record size, inode count, growth drivers, and cleanup mechanism.
- Record compute ceilings: CPU limits, memory limits, heap size, thread pools.
- Record OS ceilings: ulimit values, systemd limits, conntrack size, backlog settings.
- Record upstream ceilings: DB connection limits, API rate limits, queue quotas.
- For each ceiling: define a warning threshold and an emergency threshold.
- Create one dashboard that shows “current usage vs limit” for all of the above.
Checklist 2: Capacity planning that doesn’t pretend to be prophecy
- Measure current peak (not average) for CPU, memory, IO, network, and storage growth.
- Identify the first resource that hits 80% during peak; that’s your first scaling target.
- Define headroom policy (example: keep >30% free space on hot volumes; keep conntrack <70%).
- Model growth as ranges, not single lines. Include seasonality and batch jobs.
- Test failure modes: simulate full disk, full inode table, conntrack near max, low fd limits.
- Write down what “degraded but acceptable” looks like and how you’ll enforce it (throttling, shedding).
Checklist 3: Pre-deploy guardrails (the anti-640 KB routine)
- Before shipping a “performance” change, define the resource budget it will consume.
- Load-test with peak cardinality, not synthetic uniform traffic.
- Verify alerts exist for the actual new pressure point (memory.current, iowait, disk await).
- Ensure rollback is viable and quick.
- Run a canary that is big enough to hit real caches and queues.
FAQ
1) Did Bill Gates actually say “640 KB is enough for anybody”?
There’s no reliable primary-source evidence. The quote is widely considered misattributed or at least unverified.
Treat it as folklore, not history.
2) If the quote is dubious, why talk about it at all?
Because it’s a perfect proxy for a real failure mode: teams confusing a design boundary or default setting with a permanent truth.
The myth is annoying; the lesson is valuable.
3) What exactly was the “640 KB” limit?
Conventional memory on the IBM PC architecture: the usable RAM below the upper reserved area in the first 1 MB address space.
Hardware mapping needs (video, ROM) consumed the rest.
4) Why didn’t they just “use more than 1 MB”?
Later systems did, but compatibility mattered. Early software, DOS real mode assumptions, and the ecosystem made workarounds (EMS/XMS) more practical than breaking everything.
5) What is the modern equivalent of the 640 KB barrier?
Any hidden ceiling: container memory limits, conntrack tables, file descriptor caps, queue depths, inode exhaustion, tiny default volumes, or saturated storage devices.
The “barrier” is wherever your system hits a hard limit you didn’t model.
6) Isn’t this just “always plan for growth”?
Not quite. “Plan for growth” becomes a hand-wave. The real work is: identify specific limits, track usage against them, and rehearse what happens at 80/90/100%.
7) Should we always raise limits preemptively?
No. Raising limits blindly can move failure elsewhere or increase blast radius. Raise limits when you understand the resource cost and you have monitoring and backpressure.
8) How do I stop performance optimizations from backfiring?
Budget them. Every optimization consumes something: memory, IO, CPU, complexity, or operational risk.
Require a “resource bill” in reviews, and test under realistic peak patterns.
9) What if we don’t have time for a full capacity planning effort?
Do a limits inventory first. It’s cheap and immediately useful. Most outages aren’t from unknown unknowns; they’re from known limits nobody wrote down.
10) What’s one metric you’d add everywhere tomorrow?
“Usage vs limit” for each critical ceiling: disk bytes, inodes, memory.current vs memory.max, open fds vs ulimit, conntrack_count vs nf_conntrack_max.
Percentages alone lie; you need the ceiling in view.
Next steps you can actually do this week
Stop arguing about whether someone said a line in the 1980s. Your production system is busy creating its own quote.
The fix is not cynicism; it’s instrumentation and discipline.
- Write down your top 10 hard limits per service (memory, disk, inodes, fds, conntrack, queue depths, DB connections).
- Add alerts on the real ceilings, not proxies. Alert when you approach the wall, not when you’re already bleeding.
- Run one limit drill: pick a constraint and verify you can detect it, mitigate it, and prevent recurrence.
- Budget caches explicitly. If a cache doesn’t have a max size, it isn’t a cache; it’s a slow-motion incident.
- Separate hot IO paths (logs/WAL/tmp) so a noisy neighbor doesn’t take down your core storage.
The 640 KB myth won’t die because it’s memorable. Make your limits memorable too—by putting them on dashboards and runbooks, where they belong.