Your service is “CPU-bound.” The dashboards say so. CPU is at 80–90%, latency is ugly, and the team’s first instinct is to throw cores at it.
Then you add cores and nothing improves. Or it gets worse. Congratulations: you just met the real boss fight—memory.
CPU caches (L1/L2/L3) exist because modern CPUs can do arithmetic faster than your system can feed them data. Most production performance failures
aren’t “the CPU is slow.” They’re “the CPU is waiting.” This piece explains caches without baby talk, then shows how to prove what’s happening
on a real Linux box with commands you can run today.
Why memory wins (and the CPU mostly waits)
CPUs are ridiculous. A modern core can execute multiple instructions per cycle, speculate, reorder, vectorize, and generally act like an overcaffeinated
accountant doing taxes at 4 a.m. Meanwhile, DRAM is comparatively sluggish. The core can retire instructions in sub-nanoseconds; a trip to DRAM can take
tens to hundreds of nanoseconds depending on topology, contention, and whether you wandered into remote NUMA.
The practical outcome: your CPU spends a lot of time stalled on memory loads. Not disk. Not network. Not even “slow code” in the usual sense.
It’s waiting for the next cache line.
Caches are an attempt to keep the CPU busy by keeping frequently used data close. They are not “nice to have.” They are the only reason general-purpose
computing works at current clock rates. If every load hit DRAM, your cores would spend most cycles twiddling bits in existential dread.
Here’s the mental model that survives contact with production: performance is dominated by how often you miss caches and
how expensive those misses are. The most expensive misses are the ones that escape the chip package and go out to DRAM, and the truly
spicy ones are remote NUMA DRAM accessed over an interconnect while other cores fight for bandwidth.
One rule of thumb: when your request path touches “lots of stuff,” the cost isn’t the arithmetic; it’s the pointer chasing and cache misses.
And if you do it concurrently with many threads, you can turn your memory subsystem into the bottleneck while CPU graphs lie to your face.
L1/L2/L3 in plain English
Think of cache levels as increasingly larger, increasingly slower “pantries” between the core and DRAM.
The naming is historical and simple: L1 is closest to the core, L2 is next, L3 is usually shared among cores on a socket (not always), and then DRAM.
What each level is for
- L1 cache: tiny and extremely fast. Often split into L1i (instructions) and L1d (data). It’s the first place the core looks.
- L2 cache: bigger, a bit slower, typically private per core. It catches what falls out of L1.
- L3 cache: much bigger, slower, often shared across cores. It reduces DRAM trips and acts like a shock absorber for contention.
What “hit” and “miss” mean operationally
A cache hit means the data you need is already nearby; the load is satisfied quickly, and the pipeline keeps moving.
A cache miss means the CPU must fetch that data from a lower level. If the miss reaches DRAM, the core can stall hard.
Misses happen because caches are finite, and because real workloads have messy access patterns. The CPU tries to predict and prefetch, but it can’t predict
everything—especially pointer-heavy code, random access, or data structures larger than the cache.
Why you can’t “just use L3”
People sometimes talk as if L3 is a magic shared pool that will hold your working set. It’s not. L3 is shared, contended, and often inclusive or partially
inclusive depending on architecture. Also, L3 bandwidth and latency are still much better than DRAM, but they’re not free.
If your workload’s working set is bigger than L3, you’re going to DRAM. If it’s bigger than DRAM… well, that’s called “swap,” and it’s a cry for help.
Cache lines, locality, and the “you touched it, you bought it” rule
CPUs don’t fetch single bytes into cache. They fetch cache lines, commonly 64 bytes on x86_64. When you load one value, you often drag
in nearby values too. That’s good if your code uses nearby memory (spatial locality). It’s bad if you only wanted one field and the rest is junk,
because you just polluted the cache with stuff you won’t reuse.
Locality is the whole game:
- Temporal locality: if you use it again soon, caching helps.
- Spatial locality: if you use nearby memory, caching helps.
Databases, caches, and request routers often live or die by how predictable their access patterns are. Sequential scans can be fast because hardware
prefetchers can keep up. Random pointer chasing through a giant hash table can be slow because every step is “surprise, go to memory.”
Dry operational translation: if you see high CPU but also high stalled cycles, you don’t have a “compute” problem. You have a “feeding the core” problem.
Your hottest code path is probably dominated by cache misses or branch mispredicts, not math.
Joke #1: Cache misses are like “quick questions” in corporate chat—each one seems small until you realize your entire day is waiting on them.
Prefetching: the CPU’s attempt to be helpful
CPUs try to detect patterns and prefetch future cache lines. It works well for streaming and strided access. It works poorly for pointer chasing, because
the address of the next load depends on the result of the previous load.
This is why “I optimized the loop” sometimes does nothing. The loop isn’t the problem; the memory dependency chain is.
The part nobody wants to debug: coherency and false sharing
In multi-core systems, each core has its own caches. When one core writes to a cache line, other cores’ copies must be invalidated or updated so everyone
sees a consistent view. That’s cache coherency. It’s necessary. It’s also a performance trap.
False sharing: when your threads fight over a cache line they don’t “share”
False sharing is when two threads update different variables that happen to live on the same cache line. They’re not logically sharing data, but the cache
coherence protocol treats the entire line as a unit. So each write triggers invalidations and ownership transfers, and your performance falls off a cliff.
Symptom-wise, it looks like “more threads made it slower” with lots of CPU time spent, but not much progress. You’ll see high cache-to-cache traffic and
coherence misses if you look with the right tools.
Joke #2: False sharing is when two teams “own” the same spreadsheet cell; the edits are correct, the process is not.
Write-heavy workloads pay extra
Reads can be shared. Writes require exclusive ownership of the line, which triggers coherence actions. If you have a hot counter updated by many threads,
the counter becomes a serialized bottleneck even though you “have lots of cores.”
This is why per-thread counters, sharded locks, and batching exist. You’re not being fancy. You’re avoiding a physics bill.
NUMA: the latency tax you pay when you scale
On many servers, memory is physically attached to CPU sockets. Accessing “local” memory is faster than accessing memory attached to another socket.
That’s NUMA (Non-Uniform Memory Access). It’s not an edge case. It’s the default on a lot of real production iron.
You can get away with ignoring NUMA until you can’t. The failure mode shows up when:
- you scale threads across sockets,
- your allocator spreads pages across nodes,
- or your scheduler migrates threads away from their memory.
Then latency spikes, throughput plateaus, and the CPU looks “busy” because it’s stalled. You can easily waste weeks tuning application code when the fix
is pinning processes, fixing allocation policy, or choosing fewer sockets with higher clocks for latency-sensitive workloads.
Interesting facts and history you can repeat in meetings
- The “memory wall” became a mainstream concern in the 1990s: CPU speed improved faster than DRAM latency, making caches mandatory.
- Cache lines are a design choice: 64 bytes is common on x86, but other architectures have used different sizes; it’s a balance of bandwidth and pollution.
- L1 is often split into instruction and data caches because mixing them causes conflicts; code fetch and data loads have different patterns.
- L3 sharing is intentional: it helps when threads share read-mostly data and reduces DRAM trips, but it also creates contention under load.
- Hardware prefetchers exist because sequential access is common; they can dramatically speed streaming reads without code changes.
- Coherency protocols (like MESI variants) are a big reason multi-core “just works,” but they also impose real costs under write contention.
- TLBs are also caches: the Translation Lookaside Buffer caches address translations; TLB misses can hurt like cache misses.
- Huge pages reduce TLB pressure by mapping more memory per entry; they can help some workloads and hurt others.
- Early multi-core scaling surprises in the 2000s taught teams that “more threads” is not a performance plan if memory and locking aren’t handled.
Fast diagnosis playbook
When a system is slow, you want to find the limiting resource fast, not write poetry about microarchitecture. This is a field checklist.
First: confirm whether you’re compute-bound or stalled
- Check CPU utilization and run-level metrics: run queue, context switches, IRQ pressure.
- Look for stalled cycles / cache misses with
perfif you can. - If instructions-per-cycle is low and cache misses are high, it’s probably memory-latency or memory-bandwidth bound.
Second: decide if it’s latency-bound or bandwidth-bound
- Latency-bound: pointer chasing, random access, lots of LLC misses, low memory bandwidth.
- Bandwidth-bound: streaming, large scans, many cores reading/writing, high memory bandwidth near platform limits.
Third: check NUMA and topology
- Are threads running on one socket but allocating on another?
- Are you cross-socket thrashing the LLC?
- Is the workload sensitive to tail latency (it usually is), making remote memory a silent killer?
Fourth: check the “obvious but boring”
- Are you swapping or under memory pressure (reclaim storms)?
- Are you hitting cgroup memory limits?
- Are you saturating a single lock or counter (false sharing, contended mutex)?
Paraphrased idea (attributed): Gene Kim’s operations message is that fast feedback loops beat heroics—measure first, then change one thing at a time.
Practical tasks: commands, outputs, and decisions
These are meant to be run on a Linux host where you’re diagnosing performance. Some require root or perf permissions.
The point isn’t to memorize commands; it’s to connect outputs to decisions.
Task 1: Identify cache sizes and topology
cr0x@server:~$ lscpu
Architecture: x86_64
CPU(s): 64
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
L1d cache: 32K
L1i cache: 32K
L2 cache: 1M
L3 cache: 35.8M
NUMA node(s): 2
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
What it means: you have two sockets, two NUMA nodes, and an L3 per socket (often). Your working set that spills out of ~36MB per socket
starts paying DRAM prices.
Decision: if the service is latency sensitive, plan for NUMA awareness (pinning, memory policy) and keep hot data structures small.
Task 2: Verify cache line size (and stop guessing)
cr0x@server:~$ getconf LEVEL1_DCACHE_LINESIZE
64
What it means: false sharing risk boundaries are 64 bytes.
Decision: in low-level code, align hot per-thread counters/structs to 64B boundaries to avoid ping-ponging cache lines.
Task 3: Confirm NUMA distances
cr0x@server:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 256000 MB
node 0 free: 120000 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 256000 MB
node 1 free: 118000 MB
node distances:
node 0 1
0: 10 21
1: 21 10
What it means: remote memory is ~2x the “distance.” Not literally 2x latency, but directionally meaningful.
Decision: if you’re tail-latency sensitive, keep threads and their memory local (or reduce cross-socket traffic by limiting CPU affinity).
Task 4: Check if the kernel is fighting you with automatic NUMA balancing
cr0x@server:~$ cat /proc/sys/kernel/numa_balancing
1
What it means: the kernel may migrate pages to “follow” threads. Great sometimes, noisy other times.
Decision: for stable, pinned workloads, you may disable it (carefully, tested) or override with explicit placement.
Task 5: Observe per-process NUMA memory placement
cr0x@server:~$ pidof myservice
24718
cr0x@server:~$ numastat -p 24718
Per-node process memory usage (in MBs) for PID 24718 (myservice)
Node 0 38000.25
Node 1 2100.10
Total 40100.35
What it means: the process is mostly using node0 memory. If its threads run on node1, you’ll pay remote penalties.
Decision: align CPU affinity and memory allocation policy; if it’s uneven by accident, fix scheduling or startup placement.
Task 6: Check memory pressure and swapping (the performance cliff)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 0 1200000 80000 9000000 0 0 2 15 900 3200 45 7 48 0 0
5 0 0 1180000 80000 8900000 0 0 0 0 1100 4100 55 8 37 0 0
7 0 0 1170000 80000 8850000 0 0 0 0 1300 5200 61 9 30 0 0
What it means: no swap-in/out (si/so = 0), so you’re not in the “everything is terrible” category. CPU is busy, but not waiting on IO.
Decision: proceed to cache/memory analysis; don’t waste time blaming disk.
Task 7: See if you’re bandwidth-bound (quick read on memory throughput)
cr0x@server:~$ sudo perf stat -a -e cycles,instructions,cache-references,cache-misses,LLC-loads,LLC-load-misses -I 1000 -- sleep 5
# time(ms) cycles instructions cache-references cache-misses LLC-loads LLC-load-misses
1000 5,210,000,000 2,340,000,000 120,000,000 9,800,000 22,000,000 6,700,000
2000 5,300,000,000 2,310,000,000 118,000,000 10,200,000 21,500,000 6,900,000
3000 5,280,000,000 2,290,000,000 121,000,000 10,500,000 22,300,000 7,100,000
What it means: instructions/cycle is low-ish (roughly 0.43 here), and cache/LLC misses are significant. The CPU is doing a lot of waiting.
Decision: treat this as memory-latency dominated unless bandwidth counters show saturation; look for random access, pointer chasing, or NUMA.
Task 8: Identify top functions and whether they stall (profile with perf)
cr0x@server:~$ sudo perf top -p 24718
Samples: 2K of event 'cycles', Event count (approx.): 2500000000
18.50% myservice myservice [.] hashmap_lookup
12.20% myservice myservice [.] parse_request
8.90% libc.so.6 libc.so.6 [.] memcmp
7.40% myservice myservice [.] cache_get
5.10% myservice myservice [.] serialize_response
What it means: hotspots are lookup/compare heavy—classic candidates for cache misses and branch mispredicts.
Decision: inspect data structures: are keys scattered? are you chasing pointers? can you pack data? can you reduce comparisons?
Task 9: Check for scheduler migration (NUMA’s quiet enabler)
cr0x@server:~$ pidstat -w -p 24718 1 3
Linux 6.5.0 (server) 01/09/2026 _x86_64_ (64 CPU)
01:02:11 UID PID cswch/s nvcswch/s Command
01:02:12 1001 24718 1200.00 850.00 myservice
01:02:13 1001 24718 1350.00 920.00 myservice
01:02:14 1001 24718 1100.00 800.00 myservice
What it means: high context switches can indicate lock contention or too many runnable threads.
Decision: if latency is spiky, reduce thread count, investigate locks, or pin critical threads to reduce migration.
Task 10: Check run queue and per-CPU saturation (don’t confuse “busy” with “progress”)
cr0x@server:~$ mpstat -P ALL 1 2
Linux 6.5.0 (server) 01/09/2026 _x86_64_ (64 CPU)
01:03:01 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
01:03:02 AM all 62.0 0.0 9.0 0.1 0.0 0.5 0.0 0.0 0.0 28.4
01:03:02 AM 0 95.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
01:03:02 AM 32 20.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 75.0
What it means: CPU0 is pegged while CPU32 is mostly idle. This can be an affinity issue, single hot shard, or a lock funnel.
Decision: if a single core is hot, scale won’t happen until you remove the funnel. Investigate per-core work distribution and locks.
Task 11: Verify CPU affinity and cgroup constraints
cr0x@server:~$ taskset -pc 24718
pid 24718's current affinity list: 0-15
What it means: the process is pinned to CPUs 0–15 (one socket subset). That may be intentional or accidental.
Decision: if pinned, ensure memory is local to that node; if accidental, fix your unit file / orchestrator CPU set.
Task 12: Check LLC miss rate per process (perf stat on PID)
cr0x@server:~$ sudo perf stat -p 24718 -e cycles,instructions,LLC-loads,LLC-load-misses -- sleep 10
Performance counter stats for process id '24718':
18,320,000,000 cycles
7,410,000,000 instructions # 0.40 insn per cycle
210,000,000 LLC-loads
78,000,000 LLC-load-misses # 37.14% of all LLC hits
10.001948393 seconds time elapsed
What it means: a ~37% LLC load miss rate is a big flashing sign that your working set doesn’t fit in cache or access is random.
Decision: reduce working set, increase locality, or change data layout. Also validate NUMA locality.
Task 13: Spot page faults and major faults (TLB and paging hints)
cr0x@server:~$ pidstat -r -p 24718 1 3
Linux 6.5.0 (server) 01/09/2026 _x86_64_ (64 CPU)
01:04:10 UID PID minflt/s majflt/s VSZ RSS %MEM Command
01:04:11 1001 24718 8200.00 0.00 9800000 4200000 12.8 myservice
01:04:12 1001 24718 7900.00 0.00 9800000 4200000 12.8 myservice
01:04:13 1001 24718 8100.00 0.00 9800000 4200000 12.8 myservice
What it means: high minor faults can be normal (demand paging, mapped files), but if faults spike under load it can correlate with
page churn and TLB pressure.
Decision: if faults correlate with latency spikes, check allocator behavior, mmap usage, and consider huge pages only after measuring.
Task 14: Validate transparent huge pages (THP) status
cr0x@server:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
What it means: THP is always on. Some databases love it, some latency-sensitive services hate the allocation/compaction behavior.
Decision: if you see periodic stalls, test madvise or never in staging and compare tail latency.
Task 15: Check memory bandwidth counters (Intel/AMD tooling varies)
cr0x@server:~$ sudo perf stat -a -e uncore_imc_0/cas_count_read/,uncore_imc_0/cas_count_write/ -- sleep 5
Performance counter stats for 'system wide':
8,120,000,000 uncore_imc_0/cas_count_read/
4,010,000,000 uncore_imc_0/cas_count_write/
5.001234567 seconds time elapsed
What it means: these counts approximate DRAM transactions; if they’re high and near platform limits, you’re bandwidth-bound.
Decision: if bandwidth-bound, adding cores won’t help. Reduce data scanned, compress, improve locality, or move work closer to data.
Task 16: Identify lock contention (often misdiagnosed as “cache issues”)
cr0x@server:~$ sudo perf lock report -p 24718
Name acquired contended total wait (ns) avg wait (ns)
pthread_mutex_lock 12000 3400 9800000000 2882352
What it means: threads are spending real time waiting on locks. This can amplify cache effects (cache lines bounce with lock ownership).
Decision: reduce lock granularity, shard, or change algorithm. Don’t “optimize memory” if your bottleneck is a mutex.
Task 17: Watch LLC occupancy and memory stalls (if supported)
cr0x@server:~$ sudo perf stat -p 24718 -e cpu/mem-loads/,cpu/mem-stores/ -- sleep 5
Performance counter stats for process id '24718':
320,000,000 cpu/mem-loads/
95,000,000 cpu/mem-stores/
5.000912345 seconds time elapsed
What it means: heavy load/store traffic suggests the work is memory-centric. Combine with LLC miss metrics to decide if it’s cache-friendly.
Decision: if load-heavy with high miss rates, focus on data structure locality and reducing pointer chasing.
Task 18: Validate that you’re not accidentally throttling (frequency matters)
cr0x@server:~$ cat /proc/cpuinfo | grep -m1 "cpu MHz"
cpu MHz : 1796.234
What it means: CPU frequency is relatively low (possibly power saving or thermal constraints).
Decision: if performance regressed after a platform change, validate CPU governor and thermals before blaming caches.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A payments service started timing out every day at roughly the same hour. The team called it “CPU saturation” because dashboards showed CPU at 90%,
and the flame graph highlighted JSON parsing and some hashing. They did what teams do: added instances, increased thread pools, and raised autoscaling limits.
The incident got worse. The latency tails grew teeth.
The wrong assumption was subtle: “High CPU means the core is busy computing.” In reality, the cores were busy waiting. perf stat showed low IPC
and a high LLC miss rate. The request path had a cache-backed “enrichment” lookup that had quietly expanded: more keys, more metadata, more pointer-heavy
objects, and a working set that no longer fit anywhere near L3.
Then the scaling change kicked it into a new failure mode. More threads meant more random accesses in parallel, which increased memory-level parallelism
but also contention. The memory controller got hot, bandwidth rose, and average latency rose with it. It was a classic: the more you tried to push,
the more the memory subsystem pushed back.
The fix wasn’t heroic. They reduced object overhead, packed fields into contiguous arrays for the hot path, and capped the enrichment set per request.
They also stopped pinning the process across both sockets without controlling memory placement. Once locality improved, CPU utilization stayed high,
but throughput climbed and tail latency fell. The CPU graphs looked the same. The system behaved differently. That’s the lesson.
Mini-story 2: The optimization that backfired
A team tried to speed up an analytics API by “improving caching.” They replaced a simple vector of structs with a hash map keyed by string to avoid
linear scans. Microbenchmarks on a laptop looked great. Production disagreed.
The new structure destroyed locality. The old code scanned a contiguous array: predictable, prefetch-friendly, cache-friendly. The new code did random
lookups, each involving pointer chasing, string hashing, and multiple dependent loads. On real servers under load, it turned a mostly L2/L3-friendly
loop into a DRAM party.
Worse, the hash map introduced a shared resize path. Under burst traffic, resizes happened, locks contended, and cache lines bounced between cores.
The team saw higher CPU and concluded “we need more CPU.” But the “more CPU” increased contention, and their p99 got uglier.
They rolled it back, then implemented a boring compromise: keep a sorted vector for the hot path and do occasional rebuilds off the request thread,
with a stable snapshot pointer. They accepted O(log n) with good locality instead of O(1) with terrible constants. Production became boring again,
which is the kind of success you can build a career on.
Mini-story 3: The boring but correct practice that saved the day
A storage-adjacent service—lots of metadata reads, some writes—was migrated to a new hardware platform. Everyone expected it to be faster. It wasn’t.
There were sporadic latency spikes and occasional throughput drops, but nothing obvious: no swapping, disks fine, network fine.
The team had one habit that saved them: a “performance triage bundle” they ran for any regression. It included lscpu,
NUMA topology, perf stat for IPC and LLC misses, and a quick check of CPU frequency and governors. Not exciting. Reliable.
The bundle immediately showed two surprises. First, the new hosts had more sockets, and the service was being scheduled across sockets without
consistent memory placement. Second, CPU frequency was lower under sustained load due to power settings in the baseline image.
The fix was procedural: they updated the host tuning baseline (governor, firmware settings where appropriate), and they pinned the service to a single
NUMA node with memory bound to that node. No code changes. Latency stabilized. The rollout finished. The postmortem was short, which is a luxury.
Common mistakes (symptoms → root cause → fix)
1) “CPU is high so we need more CPU”
Symptoms: CPU 80–95%, throughput flat, p95/p99 worse when adding threads/instances.
Root cause: low IPC due to cache misses or memory stalls; the CPU is “busy waiting.”
Fix: measure IPC and LLC misses with perf stat; reduce working set, improve locality, or fix NUMA placement. Don’t scale threads blindly.
2) “Hash map is always faster than a scan”
Symptoms: slower after switching to “O(1)” structure; perf shows hotspots in hashing/strcmp/memcmp.
Root cause: random access and pointer chasing cause DRAM trips; poor locality beats big-O on real hardware.
Fix: prefer contiguous structures for hot paths (arrays, vectors, sorted vectors). Benchmark with production-like datasets and concurrency.
3) “More threads = more throughput”
Symptoms: throughput improves then collapses; context switches increase; LLC misses climb.
Root cause: memory bandwidth saturation, lock contention, or false sharing becomes dominant.
Fix: cap thread count near the knee of the curve; shard locks/counters; avoid shared hot writes; pin threads if NUMA-sensitive.
4) “NUMA doesn’t matter; Linux will handle it”
Symptoms: good average latency, terrible tail latency; regressions when moving to multi-socket hosts.
Root cause: remote memory access and cross-socket traffic; scheduler migration breaks locality.
Fix: use numastat and numactl; pin CPU and memory; consider running one process per socket for predictability.
5) “If we disable caches, we can test worst-case”
Symptoms: someone suggests turning off caches or flushing constantly as a test strategy.
Root cause: misunderstanding; modern systems are not designed for that mode and results won’t map to reality.
Fix: test with realistic working sets and access patterns; use profiling counters, not science-fair stunts.
6) “Huge pages always help”
Symptoms: THP enabled and periodic stalls; compaction activity; latency spikes during memory growth.
Root cause: THP allocation/compaction overhead; mismatch with allocation patterns.
Fix: benchmark always vs madvise vs never; if using huge pages, allocate up front and monitor tail latency.
Checklists / step-by-step plan
Checklist A: Prove it’s memory, not compute
- Capture CPU topology:
lscpu. Record sockets/NUMA and cache sizes. - Check swapping/memory pressure:
vmstat 1. Ifsi/so> 0, fix memory first. - Measure IPC and LLC misses:
perf stat(system-wide or PID). Low IPC + high LLC misses = memory stall suspicion. - Look for hot functions:
perf top. If hotspots are lookup/compare/alloc, expect locality issues.
Checklist B: Decide whether it’s latency-bound or bandwidth-bound
- If LLC miss rate is high but memory bandwidth counters are moderate: latency-bound pointer chasing is likely.
- If bandwidth counters are near platform limits and cores don’t help: bandwidth-bound scan/stream is likely.
- Change one thing and re-measure: reduce concurrency, reduce working set, or change access pattern.
Checklist C: Fix NUMA before rewriting code
- Map NUMA nodes:
numactl --hardware. - Check process memory per node:
numastat -p PID. - Check CPU affinity:
taskset -pc PID. - Align: pin CPUs to one node and bind memory to the same node (test in staging first).
Checklist D: Make data cache-friendly (the boring wins)
- Flatten pointer-heavy structures in hot paths.
- Pack hot fields together; separate cold fields (hot/cold split).
- Prefer arrays/vectors and predictable iteration over random access.
- Shard write-heavy counters; batch updates.
- Benchmark with production-like sizes; cache effects appear when data is large enough to matter.
FAQ
1) Is L1 always faster than L2, and L2 always faster than L3?
Generally yes in latency terms, but real performance depends on contention, access pattern, and whether the line is already present due to prefetching.
Also, bandwidth characteristics differ; L3 may deliver high aggregate bandwidth but higher latency.
2) Why does my CPU show 90% usage if it’s “waiting on memory”?
Because “CPU usage” mostly means the core is not idle. A stalled pipeline is still executing instructions, handling misses, doing speculation,
and burning cycles. You need counters (IPC, cache misses, stalled cycles) to see waiting.
3) What’s the difference between CPU cache and the Linux page cache?
CPU caches are hardware-managed and tiny (KB/MB). Linux page cache is OS-managed, uses DRAM, and caches file-backed data (GBs).
They interact, but they solve different problems at different scales.
4) Can I “increase L3 cache” by changing software?
Not literally. What you can do is act like you have more cache by reducing your hot working set, improving locality, and avoiding cache pollution.
5) Why do linked lists and pointer-heavy trees perform badly?
They destroy spatial locality. Each pointer leads to a different cache line, often far away. That means dependent loads and frequent DRAM trips,
which stall the core.
6) When should I care about false sharing?
When you have multiple threads updating distinct fields/counters in tight loops and performance gets worse with more threads.
It’s common in metrics counters, ring buffers, and naive “per-connection state arrays.”
7) Are cache misses always bad?
Some misses are inevitable. The question is whether your workload is structured so that misses are amortized (streaming) or catastrophic (random dependent loads).
You optimize to reduce misses on the hot path, not to achieve a mythical “zero misses.”
8) Do faster CPUs fix memory problems?
Sometimes they make them worse. Faster cores can demand data faster and hit the memory wall sooner. A platform with better memory bandwidth,
better NUMA topology, or larger caches may matter more than raw GHz.
9) Should I pin everything to one socket?
For latency-sensitive services, pinning to one socket (and binding memory) can be a big win: predictable locality, fewer remote accesses.
For throughput-heavy jobs, spreading across sockets may help—if you keep locality and avoid shared write hotspots.
10) What metric should I watch in dashboards to catch cache problems early?
If you can, export IPC (instructions per cycle) and LLC miss rates or stalled cycles from perf/PMU tooling. If not, watch for the pattern:
CPU rises, throughput flat, latency up when scaling. That pattern screams memory.
Conclusion: what to do next week
CPU caches aren’t trivia. They’re the reason your “simple” change can tank p99 and why adding cores often just adds disappointment.
Memory wins because it sets the pace: if your core can’t get data cheaply, it can’t do useful work.
Practical next steps:
- Put
perf stat(IPC + LLC misses) into your standard incident toolkit for “CPU-bound” pages. - Document NUMA topology per host class and decide whether services should be pinned (and how) by default.
- Audit hot paths for locality: flatten structures, separate hot/cold fields, and avoid shared write hotspots.
- Benchmark with realistic dataset sizes. If your benchmark fits in L3, it’s not a benchmark; it’s a demo.
- When optimization is suggested, ask one question first: “What does this do to cache misses and memory traffic?”