Why a Cache Monster Can Beat a “Faster” CPU

Was this helpful?

You bought the “faster” CPU. Higher clocks, newer microarchitecture, a spec sheet full of bravado.
And your production latency got worse.

That’s not bad luck. That’s physics, plus a pile of optimistic assumptions about what “faster” means.
In real systems, the CPU isn’t usually waiting on itself. It’s waiting on data. When the data isn’t in cache,
the CPU becomes a very expensive space heater that occasionally does work.

What “cache monster” actually means

“Cache monster” isn’t an insult. It’s a compliment for a system (or a chip) that wins by keeping the working set
close to the cores. The monster can be:

  • A CPU with a big last-level cache (LLC/L3), high bandwidth, and a good prefetcher.
  • A platform with fast memory channels and sane NUMA topology.
  • An application that behaves like it respects locality (and doesn’t fling pointers like confetti).
  • A storage stack with aggressive caching (page cache, database buffer cache, ZFS ARC, Redis, CDN).

The trick is that “faster CPU” often means “can retire more instructions per second if the data is already there.”
If the system keeps missing in caches, you’re not limited by peak compute. You’re limited by the speed of fetching.
And fetching, in a modern machine, is a multi-stage bureaucratic process.

Think of it like a kitchen. Your chef can chop faster (higher IPC, higher clock), sure. But if the ingredients
are still on a truck somewhere (cache miss), it’s a very quiet kitchen.

The memory hierarchy: where time goes to die

Latency is the real currency

Systems people learn this the hard way: throughput gets headlines, latency writes your incident reports.
A single cache miss can stall a core for long enough to make your fancy “faster” CPU irrelevant.
This isn’t theoretical; it’s why performance engineering is mostly a study of waiting.

Roughly, you’re dealing with a ladder of “near” to “far”:

  • Registers: basically immediate.
  • L1 cache: tiny and extremely fast.
  • L2 cache: larger, still fast.
  • L3/LLC: shared, bigger, slower.
  • DRAM: much slower, and shared across many hungry cores.
  • Remote NUMA node memory: slower still.
  • SSD/NVMe: orders of magnitude slower than DRAM.
  • Network storage: now you’re negotiating with the laws of geography.

The CPU’s job is to execute instructions. Your job is to keep those instructions fed with data from the closest
possible place in that ladder. Because every rung down you fall, you pay in stalled cycles, blown tail latency,
and suspicious “CPU utilization” graphs that look fine right up until users start screaming.

Caches aren’t just “small RAM”

Caches are built around locality:

  • Temporal locality: if you used it, you might use it again soon.
  • Spatial locality: if you used this address, you might use nearby addresses soon.

Hardware caches operate on cache lines (commonly 64 bytes). That means you don’t fetch a single integer; you fetch
its neighbors too, whether you like them or not.

This is why “faster CPU” is often defeated by “better layout.” A slightly slower CPU that gets cache hits can
beat a faster one that thrashes the LLC and spends its life on DRAM round-trips.

The cruel math of misses

A CPU can retire multiple instructions per cycle. Modern out-of-order execution hides some latency, but only up to
the point where the core can find other independent work. When your workload is pointer-chasing, branchy, or
serialized (common in databases, allocators, and many high-level runtimes), the CPU runs out of independent work
quickly. Then you stall. Hard.

It’s also why “average latency” can lie. A cache miss doesn’t just add a bit of time; it can turn a request into
a long-tail outlier. Users don’t file tickets for p50. They file tickets for “it hung and then refreshed.”

Why the faster CPU loses in production

1) Your workload is not compute-bound

A lot of production workloads are data-bound:

  • Databases reading indexes and following pointers through B-trees.
  • Key/value lookups with random access patterns.
  • Microservices doing JSON parsing, allocations, hashmap lookups, and logging.
  • Observability pipelines that compress, batch, and ship events with heavy metadata.

The CPU isn’t “busy” doing arithmetic. It’s busy waiting for memory, for locks, for I/O completion, for cache line
ownership, for the allocator to find free space, for the garbage collector to stop the world and tidy up.

2) “Faster” CPUs often trade cache for cores or clocks

Product SKUs do this constantly. You’ll see CPUs with:

  • More cores but less cache per core.
  • Higher boost clocks but tighter power limits (so sustained load drops frequency).
  • Different cache topology (shared LLC slices, different interconnect behavior).

If your app needs cache, then a CPU with fewer cores but more LLC can outperform a “bigger” CPU that forces your
working set out to DRAM. That’s the cache monster: not glamorous, just effective.

3) NUMA and topology are performance multipliers (and destroyers)

NUMA isn’t an academic footnote. It’s the reason your new server is “mysteriously slower” than the old one.
If threads run on one socket but allocate memory on another, you pay a remote access penalty on almost every miss.
Now your “faster” CPU core is sipping data through a longer straw.

Topology also affects shared cache contention. Place too many noisy neighbors on the same LLC domain and your
hit rate collapses. The CPU is fast. It’s just hungry.

4) Storage is the ultimate cache miss

When you miss in memory and land on storage, you’re not comparing clocks anymore. You’re comparing microseconds
and milliseconds.

This is where cache monsters show their teeth: page cache, database buffer pools, ZFS ARC, object caches, CDN edges.
A cache hit can turn “needs a disk read” into “returns immediately.” That’s not a 10% speedup. That’s survival.

5) Tail latency punishes the impatient

Faster CPUs help p50 when you’re compute-bound. Caching helps p99 when you’re data-bound.
Production is mostly a p99 business. Your on-call rotation certainly is.

Exactly one quote, because engineers deserve at least one good one:
Everyone has a test environment. Some people are lucky enough to have a separate production environment.
—Anonymous operations aphorism

Joke #1: Buying a faster CPU to fix cache misses is like buying a faster car because you keep forgetting where you parked.

Interesting facts and historical context (the stuff that explains today’s pain)

  1. The “memory wall” became a named problem in the mid-1990s: CPU speeds improved faster than DRAM latency,
    forcing architects to lean heavily on caches and prefetching.
  2. Early CPUs had no on-chip cache: caches moved on-die as transistor budgets grew, because off-chip cache
    latency was too expensive.
  3. Cache lines exist because memory transfers are block-based: fetching 64 bytes amortizes bus overhead, but
    it also means you can waste bandwidth dragging useless neighbors along.
  4. Associativity is a tradeoff: higher associativity reduces conflict misses but costs more power and complexity.
    Real chips make pragmatic compromises that show up as “mysterious” performance cliffs.
  5. Inclusive vs non-inclusive LLC policies matter: some designs keep upper-level cache lines duplicated in LLC,
    which affects eviction behavior and effective cache capacity.
  6. Modern CPUs use sophisticated prefetchers: they guess future memory accesses. When they’re right, you look
    like a genius. When they’re wrong, they can pollute caches and consume bandwidth.
  7. NUMA became mainstream as multi-socket servers scaled: local memory access is faster than remote. Ignoring
    it is the fastest way to turn “more sockets” into “more sadness.”
  8. In Linux, the page cache is a first-class performance feature: it’s why a second run of a job can be
    dramatically faster than the first—unless you bypass it with Direct I/O.
  9. Storage moved from HDDs to SSDs to NVMe: latency dropped massively, but it’s still far slower than DRAM.
    A cache hit remains a different universe from a storage read.

Failure modes: how cache and I/O bite you

Cache misses that look like “CPU problems”

The classic trap is to see high CPU usage and assume you need more CPU. Sometimes you do. Often you don’t.
High CPU can mean “actively executing instructions” or “spinning, stalled, waiting on memory.”
Both show up as “CPU busy” to the untrained eye.

Cache misses show up as:

  • Lower instructions per cycle (IPC) despite high CPU utilization.
  • Higher LLC miss rates under load.
  • Throughput that stops scaling with more cores.
  • Latency spikes when the working set crosses a cache boundary.

False sharing and cache line ping-pong

You can have plenty of cache and still lose because cores fight over cache lines. When two threads on different cores
write to different variables that share the same cache line, the line bounces between cores to maintain coherence.
Congratulations: you invented a distributed system inside your CPU.

TLB misses: the stealth cache miss

Translation Lookaside Buffers (TLBs) cache virtual-to-physical address translations. When you miss there, you pay extra
page-walk overhead before you even fetch the data. Workloads with huge address spaces, random access, or fragmented heaps
can turn TLB pressure into latency.

Storage cache misses: page cache, buffer pools, and read amplification

Page cache misses become disk reads. Disk reads become latency. And if your storage layer does read amplification (common in
some copy-on-write filesystems, encrypted layers, or poorly aligned I/O), you can “miss” more than once per request.

A lot of “CPU upgrades” fail because the storage path is the bottleneck, and the CPU just spends more time waiting faster.

Joke #2: The fastest way to make a system slower is to optimize it until it starts doing “helpful” work you didn’t ask for.

Fast diagnosis playbook

When performance regresses after a CPU upgrade (or any “hardware improvement”), don’t guess. Work the stack from top to bottom
and force the system to confess where it’s waiting.

First: decide if you’re compute-bound, memory-bound, or I/O-bound

  • Compute-bound: high IPC, high user CPU, low stalled cycles, scaling improves with more cores.
  • Memory-bound: low IPC, high cache/TLB misses, high stalled cycles, scaling plateaus early.
  • I/O-bound: iowait present, high storage latency, threads blocked on reads, page cache miss spikes.

Second: check cache behavior and IPC under real load

  • Use perf to look at cycles, instructions, cache misses, stalled cycles.
  • Watch LLC misses and memory bandwidth counters if available.
  • Look for sudden cliffs when QPS increases (working set exceeds cache).

Third: verify NUMA locality and CPU frequency behavior

  • Confirm the workload is pinned sensibly or at least not fighting the scheduler.
  • Check if memory allocations are local to the CPU running the thread.
  • Verify you’re not thermal/power throttling under sustained load.

Fourth: follow the cache miss to the storage stack

  • Check page cache hit/miss signals (readahead, major faults).
  • Measure storage latency distribution, not just throughput.
  • Confirm queue depth, scheduler, and filesystem behavior (especially CoW/RAID layers).

Fifth: only then consider a CPU change

If you’re compute-bound, sure: buy CPU. If you’re memory- or I/O-bound, fix locality, caching, and layout first.
It’s cheaper and usually more effective.

Hands-on tasks: commands, outputs, what they mean, and the decision you make

These are production-grade checks. Run them during a load test or a real incident window (carefully). Each task includes:
the command, representative output, what it means, and what you decide next.

Task 1: Check CPU frequency and throttling behavior

cr0x@server:~$ lscpu | egrep 'Model name|CPU\(s\)|Socket|Thread|NUMA|MHz'
Model name:                           Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
CPU(s):                               64
Thread(s) per core:                   2
Socket(s):                            2
NUMA node(s):                         2
CPU MHz:                              1199.842

What it means: If MHz is far below base/expected under load, you may be power/thermal limited or in a conservative governor.

Decision: Verify governor and sustained frequency under load; don’t assume boost clocks are real in production.

Task 2: Confirm CPU governor (performance vs powersave)

cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave

What it means: “powersave” can cap frequency and increase latency under bursty load.

Decision: If this is a latency-sensitive node, switch to “performance” (or tune platform policy) during testing and compare.

Task 3: Quick view of run queue pressure and iowait

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 8  0      0 242112  81240 912340    0    0     0    12 6200 9800 62 10 22  6  0
10  1      0 240980  81240 910120    0    0    64   980 6900 11200 55 11 18 16  0
11  2      0 239500  81240 908400    0    0   128  1400 7200 11800 48 12 16 24  0

What it means: High r suggests CPU contention; high wa and b suggests I/O waits. This sample shows rising iowait.

Decision: If iowait climbs with load, don’t waste time on CPU upgrades—follow the I/O path.

Task 4: Identify blocked tasks (often I/O or lock contention)

cr0x@server:~$ ps -eo state,pid,comm,wchan:32 | awk '$1 ~ /D/ {print}'
D 18423 postgres         io_schedule
D 19011 java             futex_wait_queue_me

What it means: D state indicates uninterruptible sleep, commonly I/O waits (io_schedule) or sometimes kernel lock paths.

Decision: If many threads are in D, storage latency or filesystem contention is a prime suspect.

Task 5: Measure cache misses and IPC with perf stat (quick, high value)

cr0x@server:~$ sudo perf stat -a -e cycles,instructions,cache-references,cache-misses,branches,branch-misses -I 1000 -- sleep 3
#           time             counts unit events
     1.000353504   12,804,112,210      cycles
     1.000353504    6,210,554,991      instructions              #    0.49  insn per cycle
     1.000353504      401,220,118      cache-references
     1.000353504      121,884,332      cache-misses              #   30.37% of all cache refs
     1.000353504    1,002,884,910      branches
     1.000353504       21,880,441      branch-misses             #    2.18% of all branches

What it means: IPC ~0.49 and a high cache-miss ratio screams memory stalls. A “faster” CPU won’t fix this unless it has meaningfully better cache/memory behavior.

Decision: Reduce working set, improve locality, pin threads/memory for NUMA, or change data structures before buying compute.

Task 6: Check NUMA placement and remote memory usage (if numactl is available)

cr0x@server:~$ numastat -p 18423
Per-node process memory usage (in MBs) for PID 18423 (postgres)
Node 0          8123.45
Node 1           421.12
Total           8544.57

What it means: Memory is mostly on Node 0. If threads are running on Node 1, you’ll pay remote access penalties.

Decision: Align CPU affinity and memory policy: pin processes or fix orchestration so allocation and execution live together.

Task 7: See where threads are running (CPU affinity and migration clues)

cr0x@server:~$ ps -eLo pid,tid,psr,comm | awk '$4=="postgres"{print $0}' | head
18423 18423  12 postgres
18423 18424   3 postgres
18423 18425  47 postgres
18423 18426  19 postgres

What it means: Threads scattered across CPUs can be fine, or it can be cache-hostile if the workload shares hot structures and bounces cache lines.

Decision: If you see contention and poor scaling, test pinning to a subset of cores or a single NUMA node and measure p99.

Task 8: Check major page faults (page cache misses that hit storage)

cr0x@server:~$ pidstat -r -p 19011 1 3
Linux 6.2.0 (server) 	01/10/2026 	_x86_64_	(64 CPU)

12:00:01     UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
12:00:02    1001     19011   1200.00     85.00 12488192 2380040  7.42  java
12:00:03    1001     19011   1180.00     92.00 12488192 2381208  7.42  java

What it means: majflt/s indicates major faults requiring disk I/O (unless already cached). That’s often a latency killer.

Decision: Increase memory, reduce file churn, or ensure hot files stay in cache; consider application-level caching/buffering.

Task 9: Measure storage device latency and queueing

cr0x@server:~$ iostat -x 1 3
Linux 6.2.0 (server) 	01/10/2026 	_x86_64_	(64 CPU)

Device            r/s     w/s   r_await   w_await  aqu-sz  %util
nvme0n1         820.0   120.0     6.10     8.40    5.40   98.0

What it means: r_await/w_await are average latencies; aqu-sz and high %util indicate queueing and saturation.

Decision: If device is saturated, reduce I/O (cache more), spread load across devices, or change access pattern before touching CPU.

Task 10: Identify hot files and I/O sources (quick and dirty)

cr0x@server:~$ sudo lsof -nP | awk '{print $1,$2,$4,$9}' | egrep 'postgres|java' | head
postgres 18423 mem /usr/lib/x86_64-linux-gnu/libssl.so.3
postgres 18423  12u /var/lib/postgresql/15/main/base/16384/2619
java     19011  45u /var/log/app/service.log

What it means: Shows which files are open. If you see log files or unexpected paths dominating, your “compute problem” may be “logging to death.”

Decision: Reduce synchronous logging, batch writes, move logs to faster volume, or fix log level.

Task 11: Check filesystem and mount options (Direct I/O, barriers, atime)

cr0x@server:~$ findmnt -no TARGET,SOURCE,FSTYPE,OPTIONS /var/lib/postgresql
/var/lib/postgresql /dev/md0 ext4 rw,relatime,data=ordered

What it means: Options like relatime matter; others (journal modes, barriers) affect latency and safety.

Decision: Tune only with a clear reason. If you change safety-related options to chase benchmarks, write the postmortem in advance.

Task 12: Observe page cache pressure and reclaim behavior

cr0x@server:~$ grep -E 'pgscan|pgsteal|pgfault|pgmajfault' /proc/vmstat | head -n 8
pgfault 1283394021
pgmajfault 228103
pgscan_kswapd 901223
pgscan_direct 188004
pgsteal_kswapd 720111
pgsteal_direct 141220

What it means: Rising pgscan_direct suggests direct reclaim (processes reclaiming memory themselves), which can hurt latency.

Decision: If direct reclaim is high during load, reduce memory pressure: add RAM, reduce cache churn, tune memory limits, or fix oversized heaps.

Task 13: Check swap activity (a slow-motion disaster)

cr0x@server:~$ swapon --show
NAME      TYPE SIZE USED PRIO
/dev/sda3 partition 16G  2G   -2

What it means: Swap in use isn’t automatically evil, but sustained swapping under load is a cache-miss factory with extra steps.

Decision: If latency matters, avoid swap thrash: cap memory, fix leaks, right-size heaps, or add memory.

Task 14: Check memory bandwidth pressure (top-level hint via perf)

cr0x@server:~$ sudo perf stat -a -e stalled-cycles-frontend,stalled-cycles-backend,LLC-loads,LLC-load-misses -I 1000 -- sleep 3
#           time             counts unit events
     1.000339812    3,110,224,112      stalled-cycles-frontend
     1.000339812    8,901,884,900      stalled-cycles-backend
     1.000339812      220,112,003      LLC-loads
     1.000339812       88,440,120      LLC-load-misses           #   40.17% of all LLC loads

What it means: Backend stalls and high LLC miss rates strongly suggest memory latency/bandwidth limits.

Decision: You need locality improvements, not raw GHz. Consider data structure changes, sharding, or co-locating hot data.

Task 15: Check ZFS ARC effectiveness (if on ZFS)

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:00:01   12K  2.8K     23   920   33  1.7K   61   180    6   96G   112G
12:00:02   13K  4.9K     37  2.1K   43  2.6K   53   200    4   96G   112G

What it means: Rising ARC miss% under load implies your working set doesn’t fit in ARC; reads fall through to disk.

Decision: Add RAM, tune ARC limits, add faster secondary cache if appropriate, or reduce working set (index bloat, cold data).

Task 16: Validate network latency as “remote cache miss” (service calls)

cr0x@server:~$ ss -tin dst 10.20.0.15:5432 | head -n 12
State Recv-Q Send-Q Local Address:Port  Peer Address:Port
ESTAB 0      0      10.20.0.10:44912   10.20.0.15:5432
	 cubic wscale:7,7 rto:204 rtt:1.8/0.4 ato:40 mss:1448 cwnd:10 bytes_acked:1234567 bytes_received:2345678

What it means: RTT tells you if your “slow query” is actually “network jitter.” In distributed systems, the network is just another cache level—an expensive one.

Decision: If RTT/jitter is high, focus on co-location, connection pooling, or caching results closer to the caller.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption (the “CPU upgrade” that made p99 worse)

A mid-size company ran a search-heavy API. The team had a clean story: queries are slow, CPU is high, buy faster CPUs.
Procurement delivered newer servers with higher clocks and more cores. Benchmarks on a single node looked decent.
Rollout began. Within hours, p99 latency spiked and error budgets started evaporating.

The wrong assumption was subtle: they assumed “more CPU” improves a workload that was already limited by memory locality.
The new servers had more cores per socket, but less LLC per core and slightly different NUMA characteristics.
Under real multi-tenant load, the query working set no longer fit nicely in cache. The LLC miss rate climbed.
IPC dropped. The CPU charts still looked “busy,” which misled everyone for longer than it should have.

The incident response took an awkward turn when engineers realized their new “faster” fleet had become better at
executing stalls. They reverted traffic, then reproduced the regression in a controlled test with perf counters.
The smoking gun was a consistent increase in LLC load misses and backend stalled cycles under the same QPS.

The fix was not to undo the hardware purchase (that ship had sailed), but to change placement and reduce cross-core contention.
They pinned the service to fewer cores per NUMA node and ran more replicas instead of trying to “use all cores.”
Counterintuitive, but it recovered p99. Later they reworked data structures to be more cache-friendly and cut pointer chasing.

The postmortem action item that mattered: performance acceptance tests must include cache miss/IPC metrics, not just throughput.
“CPU utilization” alone is a mood ring, not a diagnosis.

Mini-story 2: The optimization that backfired (Direct I/O: the page cache wasn’t the villain)

Another company ran a pipeline that ingested events, wrote them to disk, and periodically compacted them.
They saw memory pressure and decided the Linux page cache was “stealing” RAM from the application.
An engineer flipped a switch: use Direct I/O for writes and reads to “avoid polluting cache.”

The graphs looked great for a day. Memory usage stabilized. Then the latency alarms started.
Not everywhere—just in the worst possible place: the compaction jobs and certain read paths that depended on re-reading
recent data. With Direct I/O, those reads bypassed page cache entirely. Each re-read became a storage operation.
NVMe is fast, but not “as fast as RAM pretending to be disk.”

The backfire came with extra shrapnel. Without page cache smoothing, I/O became burstier.
Queue depth spiked. Tail latencies spiked with it. The CPU still wasn’t the bottleneck; it just spent more time in iowait.
The team had optimized memory graphs at the expense of customer-visible latency.

The eventual fix was boring: revert Direct I/O for the hot read paths, keep it only where data was truly cold or sequential,
and set sane cgroup memory limits so the page cache couldn’t starve the process.
They also introduced a small application-level cache for metadata to avoid repeated random reads.

Lesson learned: “bypassing caches” is not a performance strategy. It’s a weapon. Use it only when you understand what it will hit.

Mini-story 3: The boring but correct practice that saved the day (measuring working set and planning cache)

A financial services shop had a batch + API hybrid workload. They were planning a refresh: newer CPU generation, same RAM size,
same storage. The SRE team insisted on a pre-migration profile: measure working set size, cache hit ratios, and NUMA behavior
during peak week, not peak hour.

It was unpopular because it delayed the project. It required capturing perf stats, page fault rates, and storage latency
histograms. It also required a load test that looked like production, not like a benchmark brochure.
The team pushed anyway, because they’d been burned before.

The profile showed a predictable cliff: once the dataset grew past a certain point, buffer cache hit rate dropped and reads
hit storage. p99 exploded. The current system survived because it had slightly larger effective cache (database buffer pool plus
OS cache) than the planned configuration. The “faster” CPU wouldn’t matter once the cache misses started landing on disk.

They adjusted the refresh plan: more RAM per node, slightly fewer cores, and a clear NUMA placement strategy.
They also scheduled a routine to keep the hottest partitions resident in cache during business hours.
Launch day was uneventful, which is the highest compliment in operations.

Lesson: boring measurement beats exciting replacement. The best incidents are the ones that never happen.

Common mistakes (symptoms → root cause → fix)

1) Symptom: CPU is high, throughput is flat

Root cause: Memory-bound workload; low IPC due to cache misses or backend stalls.

Fix: Use perf to confirm low IPC and high LLC misses. Reduce working set, improve data locality, shard, or choose a CPU with more cache per core.

2) Symptom: p99 got worse after “faster CPU” rollout

Root cause: Cache topology/LLC per core changed; NUMA placement changed; prefetch behavior differs; more contention per socket.

Fix: Compare perf counters old vs new under identical load. Pin to NUMA nodes, reduce core sharing, re-evaluate instance sizing and placement.

3) Symptom: Lots of iowait, but disks don’t look “busy” by throughput

Root cause: Latency saturation: small random I/O, high queueing, or storage tail latency; throughput hides it.

Fix: Use iostat -x and look at await and aqu-sz. Reduce random reads via caching, fix query/index patterns, or add devices/IOPS headroom.

4) Symptom: Performance drops when adding cores

Root cause: Shared cache contention, lock contention, false sharing, or memory bandwidth saturation.

Fix: Scale out instead of up; pin threads; eliminate false sharing; reduce shared mutable state; measure with perf and flame graphs if possible.

5) Symptom: “Optimization” reduces memory usage but increases latency

Root cause: Bypassing caches (Direct I/O), shrinking buffer pools, or aggressive eviction leads to storage reads.

Fix: Restore caching for hot paths, cap memory correctly, and measure cache hit ratios and major faults.

6) Symptom: Periodic latency spikes every few seconds/minutes

Root cause: GC pauses, cache eviction cycles, compaction, or background reclaim (kswapd/direct reclaim).

Fix: Check majflt, vmstat reclaim signals, and GC logs. Reduce allocation churn, tune heap, add memory, or smooth compaction scheduling.

7) Symptom: A single host is slower than its identical siblings

Root cause: Different BIOS settings, power policy, microcode, memory population (channels), or noisy neighbor saturating LLC/memory bandwidth.

Fix: Compare lscpu, governors, NUMA, and perf stats across hosts. Standardize firmware and kernel tunables; isolate noisy workloads.

8) Symptom: High system CPU, low user CPU, “nothing” in app profiles

Root cause: Kernel overhead: page faults, network stack pressure, context switching, or filesystem metadata churn.

Fix: Use vmstat, pidstat, perf top. Reduce syscalls (batching), tune logging, fix file churn, and remove accidental sync points.

Checklists / step-by-step plan

Step-by-step: deciding whether cache beats CPU for your workload

  1. Collect a production-like profile: QPS mix, concurrency, dataset size, and p50/p95/p99 latency.
  2. Measure IPC and cache misses under load: perf stat (cycles, instructions, LLC misses, stalled cycles).
  3. Measure storage latency distribution: iostat -x and application-level timings.
  4. Measure major faults: pidstat -r and /proc/vmstat; confirm whether reads are hitting disk.
  5. Validate NUMA placement: numastat, thread placement, and affinity policy.
  6. Check scaling: run 1, 2, 4, 8, N cores and see if throughput scales or stalls.
  7. Change one thing at a time: pinning, buffer pool size, cache size, data layout, then retest.
  8. Only then decide on hardware: more cache per core, more RAM, faster memory, or more nodes.

Operational checklist: before a CPU “upgrade” rollout

  • Baseline perf counters (IPC, LLC misses, stalls) on old hardware.
  • Confirm BIOS settings: power/performance policy, SMT, memory interleaving.
  • Confirm kernel and microcode parity across old and new.
  • Run a load test with the real dataset size and realistic access skew.
  • Compare tail latency, not just throughput.
  • Plan a rollback that doesn’t require a meeting.

Storage/cache checklist: making caching work for you

  • Know your working set size (hot data) and compare it to RAM/buffer caches.
  • Prefer sequential access where possible; random I/O is a tax.
  • Keep indexes and hot metadata in memory; disk is for cold truth, not hot opinions.
  • Measure cache hit ratios and eviction rates, don’t guess.
  • Don’t “optimize” by disabling safety (write barriers, journaling) unless you’re willing to own the data loss story.

FAQ

1) Is CPU cache really that big a deal, or is this just performance-nerd drama?

It’s a big deal. Many real workloads are memory-latency bound. A small change in LLC miss rate can swing p99 latency
far more than a modest CPU frequency bump.

2) If my CPU usage is 90%, doesn’t that prove I’m CPU-bound?

No. High CPU usage can include stalled cycles, spin loops, lock contention, and kernel overhead. Use IPC and stall metrics
(perf stat) to separate “executing” from “waiting expensively.”

3) What’s the simplest metric to tell “cache monster wins”?

IPC plus LLC misses under production load. If IPC is low and LLC misses are high, you’re not starving for CPU; you’re starving
for locality and cache hits.

4) Does more RAM always solve cache problems?

More RAM helps if the working set can fit and you can keep it hot (page cache, buffer pool, ARC). But RAM doesn’t fix
pathological access patterns, false sharing, or NUMA misplacement.

5) Should I pin processes to cores?

Sometimes. Pinning can improve cache warmth and reduce migration overhead, but it can also make load imbalance worse.
Test it with real load. If pinning improves p99 and reduces LLC misses, keep it.

6) Why did my NVMe upgrade not improve latency much?

Because you were already hitting page cache or buffer cache, or because your latency was dominated by CPU stalls, locks, or
network hops. Also, NVMe can be fast and still have nasty tail latency under saturation.

7) Isn’t Direct I/O faster because it avoids double caching?

It can be, for specific workloads (large sequential reads/writes, streaming). For mixed or re-read-heavy patterns, page cache
is a performance feature. Removing it often turns “memory speed” into “storage speed.”

8) How do I know if NUMA is hurting me?

If you have multi-socket systems, assume NUMA matters. Confirm with numastat and perf. Symptoms include poor scaling, increased
latency, and performance differences depending on where the scheduler puts threads.

9) Can a CPU with fewer cores outperform one with more cores?

Yes, if the fewer-core CPU has more cache per core, better memory latency, or avoids bandwidth saturation. Many services don’t
scale linearly with cores because data access and contention dominate.

10) What’s the most common “cache mistake” in application code?

Data structures that destroy locality: pointer-heavy graphs, random hash iteration, and allocating millions of tiny objects.
Also: false sharing in multithreaded counters and queues.

Practical next steps

If you take one operational lesson from this: stop treating CPU speed like it’s the main character. In production, it’s a
supporting actor. The lead roles are locality, caching, and latency.

  1. Run perf stat under real load and record IPC, LLC misses, and stalled cycles.
  2. Run iostat -x and look at latency and queue depth, not just MB/s.
  3. Check major faults and reclaim behavior to see whether you’re falling out of memory into disk.
  4. Validate NUMA placement and test pinning as an experiment, not a belief system.
  5. Only after those measurements, decide: more cache (different CPU SKU), more RAM, better data layout, or a different architecture (scale out).

Then do the boring thing: document the baseline, codify the checks in your rollout runbook, and make “cache and locality”
a first-class performance requirement. Your future self, trapped in an incident bridge at 2 a.m., will be oddly grateful.

← Previous
How to Read CPU Benchmarks Without Getting Played
Next →
ZFS fio for Databases: Testing Sync Writes Without Lying to Yourself

Leave a comment