Every CPU review looks decisive until you try to serve real traffic. Your service isn’t “Cinebench.” It’s a messy cocktail: syscalls, TLS, JSON, GC pauses, interrupts, cache misses, and a database call that sometimes wakes up on the wrong side of NUMA.
If you’ve ever shipped a “safe” CPU upgrade and still got paged for p99 latency, you already know the problem: you didn’t measure the right thing. This is a pragmatic method to test CPUs against your workload, with repeatable runs and decisions you can defend in a postmortem.
Principles: what “CPU performance” means in production
CPU testing goes off the rails when people ask the wrong question. The wrong question is: “Which CPU is faster?” The right question is: “Which CPU gives my workload the latency and throughput I need, with acceptable cost and operational risk?”
Production CPU performance is a three-body problem:
- Work done (throughput): requests/sec, jobs/min, rows scanned/sec.
- Time to do it (latency): especially tail latency (p95/p99/p999).
- Stability: variance under noise—background tasks, jitter, throttling, noisy neighbors.
And “CPU” often isn’t just “CPU.” It’s:
- Microarchitecture: IPC, caches, branch predictor, prefetchers.
- Memory subsystem: bandwidth, latency, NUMA, page faults.
- Kernel behavior: scheduling, context switching, interrupts, cgroups.
- Thermals and power limits: turbo behavior that looks great for 30 seconds and then folds.
- Software: compiler flags, crypto libraries, GC configuration, locks.
The goal of testing isn’t to find a single number. It’s to find the operating envelope: under what load does p99 jump, when do run queues build, when do you start thrashing caches, and what knob actually moves the needle.
Paraphrased idea (John Allspaw, operations/reliability): “You don’t get to claim reliability unless you can demonstrate it under realistic conditions.” That applies to performance too.
One rule I’ll be annoyingly consistent about: prioritize latency first, then cost. Throughput-only tests make bad CPU decisions because they hide the pain users feel at the tail.
Joke #1: Benchmarking on an idle lab machine is like testing a fire alarm in a vacuum—technically impressive, operationally useless.
Interesting facts and historical context
These aren’t trivia for trivia’s sake. They explain why modern CPU testing can’t be reduced to a single “GHz” label.
- “MHz wars” ended for a reason. Around the mid-2000s, frequency stopped scaling cleanly due to power density and leakage; performance shifted to multicore and microarchitectural gains.
- Speculative execution changed everything. Out-of-order execution and speculation improved IPC, but also created security classes (Spectre/Meltdown) that later introduced mitigations affecting certain workloads.
- Turbo is not a promise. Modern CPUs dynamically boost based on power/temperature/current limits; two “identical” runs can differ if cooling or BIOS settings differ.
- NUMA is old, but still a top offender. Multi-socket NUMA systems have been common for decades; the failure mode (remote memory latency) still surprises teams migrating from smaller nodes.
- Linux perf counters weren’t built for dashboards. Hardware performance counters originated for chip designers and low-level profiling; using them operationally requires careful interpretation.
- Hyper-threading (SMT) is workload-dependent. It can boost throughput on stalls, but worsen tail latency on lock-heavy or cache-sensitive services.
- Virtualization matured, then cgroups made it weird again. VM overhead shrank, but container CPU quotas introduced a new failure mode: throttling that looks like “mysterious latency.”
- Big pages aren’t a universal win. Huge pages reduce TLB misses, but increase fragmentation and can hurt when memory is dynamic or overcommitted.
- “Faster CPU” can slow the system. If the CPU processes requests faster, you can shift bottlenecks to storage, locks, or downstream services—then your p99 gets worse.
The simple method: build a workload harness you can trust
Here’s the method I recommend when you want to compare CPUs or validate a tuning change. It’s simple, not simplistic. You’ll run the same workload across candidates, collect a small set of metrics, and make a decision based on tail latency and saturation signals.
1) Define the workload like a grown-up
“API traffic” is not a workload. A workload is a distribution:
- Request mix (endpoints, query types, read/write ratios)
- Payload sizes (small, typical, pathological)
- Concurrency model (threads, async, connection pool sizes)
- Think time / arrival pattern (Poisson-like vs bursty)
- SLO you care about (p99 under N RPS)
Pick 1–3 representative scenarios. Don’t build an entire universe. But also don’t pick only the happy path.
2) Make the environment boring on purpose
Repeatability is a feature. Your test environment should remove variable noise where possible:
- Pin CPU frequency governor to
performancefor tests (or at least record it). - Disable background cron storms, package updates, and opportunistic antivirus scanning.
- Ensure identical kernel versions, BIOS settings, microcode, and mitigations across candidates—or consciously accept differences and record them.
- Don’t compare a bare-metal host to an overcommitted VM and call it “CPU testing.” That’s comedy.
3) Test for steady state, not warm-up theater
Most services have warm-up effects: JIT compilation, caches, connection pools, filesystem cache, ARC, page cache, branch predictor training, even DNS caches. You want two phases:
- Warm-up: long enough to reach stable behavior.
- Measurement: fixed duration where you sample metrics.
When teams skip this, they optimize for “first minute performance,” then wonder why the 30-minute mark melts. Your CPUs are not being graded on charisma.
4) Measure latency distribution and CPU saturation signals together
If you only record CPU utilization, you’re blind. If you only record latency, you don’t know where to look. Pair them:
- Latency: p50/p95/p99, plus max (with caution).
- Throughput: RPS, QPS, jobs/sec.
- CPU saturation: run queue, throttling, iowait (interpreted carefully).
- Perf counters (sampling): cycles, instructions, branches, cache misses.
5) Compare at equal “user pain,” not equal utilization
When comparing two CPUs, don’t match “CPU = 70%.” Match the SLO point (e.g., “p99 < 200ms”). One CPU might hit 70% while sailing, another at 50% already thrashes due to cache behavior or frequency limits.
6) Decide before you run: what result changes your decision
Write down your acceptance criteria:
- “At 10k RPS, p99 < 150ms and error rate < 0.1%.”
- “CPU throttling < 1% under quota.”
- “Perf shows IPC within 10% of baseline; no abnormal cache miss spike.”
If you don’t decide in advance, you’ll negotiate with the graphs after the fact. And the graphs always win.
Practical tasks (commands, output, decisions)
Below are hands-on tasks you can run on Linux. Each includes: a command, what the output means, and what decision to make. Pick the subset that matches your environment, but don’t skip the ones that reveal “CPU isn’t the CPU.”
Task 1: Record CPU model, topology, and SMT status
cr0x@server:~$ lscpu
Architecture: x86_64
CPU(s): 32
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Model name: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
NUMA node(s): 1
L3 cache: 48 MiB
What it means: You now know what you’re actually testing: cores, sockets, SMT, cache sizes, NUMA. The “32 CPUs” may be 16 cores with SMT.
Decision: If you’re comparing machines, ensure topology is comparable, or explicitly include SMT/NUMA differences in your conclusion. If your service is latency-sensitive, plan an A/B run with SMT on vs off.
Task 2: Check CPU frequency governor and current frequency behavior
cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
ondemand
What it means: “ondemand” can add jitter and under-boost during short spikes, depending on kernel and platform. On servers, you often want consistent behavior.
Decision: For benchmarking, set governor to performance (or record it and accept variability). If you can’t change it (cloud), at least record it and run longer to average boost behavior.
Task 3: Verify turbo/boost capability (when available)
cr0x@server:~$ cat /sys/devices/system/cpu/intel_pstate/no_turbo
0
What it means: 0 means turbo is allowed. If it’s 1, your “fast CPU” might be acting like a polite CPU.
Decision: Keep turbo consistent across test hosts. If one host has turbo disabled (BIOS or OS), your comparison is already corrupted.
Task 4: Check for CPU throttling due to cgroups (containers/Kubernetes)
cr0x@server:~$ cat /sys/fs/cgroup/cpu.stat
usage_usec 987654321
user_usec 800000000
system_usec 187654321
nr_periods 12345
nr_throttled 2345
throttled_usec 456789012
What it means: nr_throttled and throttled_usec indicate the kernel is forcibly stopping your workload to enforce CPU quota.
Decision: If throttling is non-trivial during your test, your “CPU benchmark” is a quota benchmark. Either increase quota, remove limits for the test, or treat “throttling per request” as a primary metric.
Task 5: Watch run queue and CPU usage over time
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 8123456 123456 789012 0 0 0 2 900 1400 25 5 70 0 0
8 0 0 8121200 123456 789100 0 0 0 0 1200 6000 80 10 10 0 0
10 0 0 8119000 123456 789200 0 0 0 0 1300 7000 85 10 5 0 0
9 0 0 8118000 123456 789250 0 0 0 0 1250 6800 83 12 5 0 0
3 0 0 8120000 123456 789300 0 0 0 0 1000 3000 50 7 43 0 0
What it means: r is runnable processes. If r consistently exceeds core count, you’re CPU-saturated or contended. cs (context switches) spikes can hint at lock contention or oversubscription.
Decision: If run queue grows while latency spikes, you likely need more CPU (or fewer threads) or you’re hitting a lock. If run queue is low but latency is high, the bottleneck is elsewhere.
Task 6: Inspect per-core utilization and steal time (virtualization)
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.1.0 (server) 01/12/2026 _x86_64_ (32 CPU)
12:00:01 AM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
12:00:02 AM all 62.00 0.00 8.00 0.10 0.00 1.50 5.00 23.40
12:00:02 AM 0 90.00 0.00 5.00 0.00 0.00 0.00 2.00 3.00
12:00:02 AM 1 10.00 0.00 5.00 0.00 0.00 0.00 20.00 65.00
12:00:03 AM all 64.00 0.00 7.50 0.00 0.00 1.20 6.50 20.80
What it means: %steal shows the hypervisor taking CPU time away. High steal makes latency noisy and invalidates comparisons across runs.
Decision: If steal is >1–2% during tests, move to dedicated hosts/instances or treat the environment as “not suitable for CPU comparison.”
Task 7: Check NUMA layout and whether your process is bouncing memory
cr0x@server:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 64250 MB
node 0 free: 12000 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 64250 MB
node 1 free: 15000 MB
node distances:
node 0 1
0: 10 21
1: 21 10
What it means: Remote memory access (distance 21 vs 10) is significantly slower. If your process is scheduled on node 0 but frequently accesses memory from node 1, your CPU looks “slow.”
Decision: For tests, pin CPU and memory locality for the workload (or explicitly test with and without pinning). If you’re migrating to dual-socket, treat NUMA as a first-class requirement.
Task 8: Detect CPU migrations (a common latency killer)
cr0x@server:~$ perf stat -e task-clock,context-switches,cpu-migrations,page-faults -p 1234 -- sleep 10
Performance counter stats for process id '1234':
10012.34 msec task-clock # 0.999 CPUs utilized
120,345 context-switches # 12.014 K/sec
3,210 cpu-migrations # 320.593 /sec
8,765 page-faults # 875.539 /sec
10.012345678 seconds time elapsed
What it means: Thousands of CPU migrations per second can trash cache locality and cause jitter. Context switches can indicate thread oversubscription, locks, or blocking IO.
Decision: If migrations are high, consider CPU pinning, reducing runnable threads, or fixing scheduler pressure (e.g., cgroup placement). If context switches are high and latency is spiky, hunt lock contention or excessive thread pools.
Task 9: Measure IPC and cache behavior with perf (macro view)
cr0x@server:~$ perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses -a -- sleep 10
Performance counter stats for 'system wide':
32,100,000,000 cycles
45,500,000,000 instructions # 1.42 insn per cycle
8,900,000,000 branches
120,000,000 branch-misses # 1.35% of all branches
2,100,000,000 cache-references
210,000,000 cache-misses # 10.00% of all cache refs
10.000987654 seconds time elapsed
What it means: IPC and cache-miss rates shift dramatically across CPUs and workloads. A CPU with higher clocks can lose to one with better cache behavior on real services.
Decision: If IPC collapses or cache misses spike under load, focus on memory locality, data structures, and concurrency model—not just “buy faster cores.” Use this to explain why a “worse” synthetic CPU can win for your service.
Task 10: Confirm whether you’re actually CPU-bound or stalled on memory
cr0x@server:~$ perf stat -e cycles,instructions,stalled-cycles-frontend,stalled-cycles-backend -p 1234 -- sleep 10
Performance counter stats for process id '1234':
5,400,000,000 cycles
6,900,000,000 instructions # 1.28 insn per cycle
1,900,000,000 stalled-cycles-frontend
2,700,000,000 stalled-cycles-backend
10.001234567 seconds time elapsed
What it means: High backend stalls often correlate with memory latency/bandwidth constraints, cache misses, or pipeline resource limits.
Decision: If backend stalls dominate, a CPU with better memory subsystem (bigger caches, more channels) may outperform a higher-frequency CPU. Also consider NUMA and memory speed before rewriting half the codebase.
Task 11: Identify top CPU consumers and whether they’re user vs kernel heavy
cr0x@server:~$ top -b -n 1 | head -n 15
top - 00:00:10 up 12 days, 3:12, 2 users, load average: 12.50, 11.20, 9.80
Tasks: 310 total, 8 running, 302 sleeping, 0 stopped, 0 zombie
%Cpu(s): 82.0 us, 10.0 sy, 0.0 ni, 8.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 128000.0 total, 22000.0 free, 53000.0 used, 53000.0 buff/cache
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1234 app 20 0 4567890 512000 25000 R 750.0 0.4 10:12.3 myservice
2345 root 20 0 0 0 0 S 80.0 0.0 1:20.1 ksoftirqd/3
What it means: High sy (system) plus ksoftirqd hints that networking/interrupt handling is eating CPU. That’s still CPU, but the fix isn’t “optimize JSON parsing.”
Decision: If kernel CPU is high, inspect interrupts, NIC settings, packet rates, conntrack, and TLS offload options. If user CPU is high, profile the application.
Task 12: Check interrupt distribution (common on high packet rates)
cr0x@server:~$ cat /proc/interrupts | head -n 8
CPU0 CPU1 CPU2 CPU3
24: 1234567 1200345 98012 87000 IR-PCI-MSI eth0-TxRx-0
25: 98012 87000 1123456 1198765 IR-PCI-MSI eth0-TxRx-1
NMI: 1234 1234 1234 1234 Non-maskable interrupts
LOC: 98765432 98765000 98764900 98764800 Local timer interrupts
What it means: If one CPU is drowning in NIC interrupts, you’ll see hotspots and tail latency. Ideally, interrupts are balanced across cores (or pinned intentionally).
Decision: If interrupts are skewed, enable/verify irqbalance, tune RSS queues, or pin interrupts away from latency-critical threads.
Task 13: Measure scheduler pressure (PSI) to confirm contention
cr0x@server:~$ cat /proc/pressure/cpu
some avg10=0.50 avg60=0.80 avg300=0.75 total=123456789
full avg10=0.10 avg60=0.25 avg300=0.20 total=23456789
What it means: PSI tells you time spent waiting for CPU. “full” indicates tasks that cannot run at all due to lack of CPU. This is extremely actionable for “is it actually CPU contention?”
Decision: If PSI “full” rises during latency spikes, you’re CPU-starved or throttled. Add CPU, reduce concurrency, or remove quota throttling. If PSI is low, stop blaming CPU and look elsewhere.
Task 14: Confirm disk and filesystem aren’t smearing your CPU test
cr0x@server:~$ iostat -xz 1 3
Linux 6.1.0 (server) 01/12/2026 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
70.00 0.00 10.00 0.20 0.00 19.80
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util await
nvme0n1 10.00 20.00 800.00 900.00 0.00 1.00 5.00 0.40
What it means: Low %util and low await suggest storage isn’t the bottleneck. High iowait isn’t always “disk is slow,” but this gives you a baseline.
Decision: If disk utilization and await spike during “CPU tests,” you’re benchmarking IO or fsync behavior. Either isolate IO, move to tmpfs for purely CPU tests, or accept IO as part of the workload (often correct for databases).
Task 15: Run a controlled load test and capture latency percentiles
cr0x@server:~$ hey -z 60s -c 200 -q 50 http://127.0.0.1:8080/api/v1/search
Summary:
Total: 60.0012 secs
Slowest: 0.4123 secs
Fastest: 0.0102 secs
Average: 0.0451 secs
Requests/sec: 9700.12
Response time histogram:
0.010 [1] |
0.050 [520000] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.100 [55000] |■■■
0.200 [4000] |
0.400 [120] |
Latency distribution:
50% in 0.0400 secs
90% in 0.0800 secs
99% in 0.1500 secs
What it means: This gives you an external view: what a client experiences. The p99 is your “customer pain meter.”
Decision: Compare CPUs at the same offered load and watch p99/p999. If CPU A gives you 20% more RPS but doubles p99, CPU A is a trap for latency-sensitive services.
Task 16: Profile hotspots without changing code (quick-and-dirty flamegraph input)
cr0x@server:~$ perf record -F 99 -p 1234 -g -- sleep 30
[ perf record: Woken up 8 times to write data ]
[ perf record: Captured and wrote 12.345 MB perf.data (12345 samples) ]
What it means: You collected call stacks. You can now identify whether time is spent in crypto, JSON parsing, memory allocation, or kernel syscalls.
Decision: If hotspots are in kernel networking, tune kernel/NIC. If hotspots are in allocation/GC, tune runtime or reduce allocations. If hotspots are in a lock, fix concurrency. CPU testing without profiling is guesswork with nicer charts.
Fast diagnosis playbook
This is the “I’m on call and the graph is on fire” version. You want to decide quickly whether you’re CPU-bound, throttled, or just suffering a different bottleneck wearing a CPU mask.
First: confirm user pain and whether it correlates with CPU pressure
- Check p95/p99 latency (from load test output or service metrics). If tail latency is flat, don’t chase CPU just because utilization looks high.
- Check CPU PSI (
/proc/pressure/cpu). If “full” climbs during latency spikes, CPU contention/throttling is real. - Check run queue (
vmstat 1,uptime). Sustained run queue above core count is a classic saturation signal.
Second: rule out the top two CPU impostors
- Throttling: container quotas (
/sys/fs/cgroup/cpu.stat) and cloud steal time (mpstat). - NUMA/migrations: if you’re multi-socket or have strict latency SLOs, check
numactl --hardwareandperf stat ... cpu-migrations.
Third: decide whether you need more CPU, different CPU, or different code
- More CPU when run queue + PSI are high, and profiling shows real compute.
- Different CPU when perf shows memory stalls, cache misses, or frequency collapse. You might need bigger caches, more memory channels, or higher all-core turbo—not just more cores.
- Different code when hotspots are obvious and fixable (lock contention, allocation storms, chatty syscalls).
Joke #2: If your p99 improves only when you stop measuring it, you’ve invented performance by observation—please don’t ship that.
Three corporate mini-stories from the trenches
Mini-story #1: The incident caused by a wrong assumption (SMT “is always free performance”)
A company ran a latency-sensitive API that did authentication, JSON validation, and a small database query. They moved from a smaller fleet to a newer CPU generation and saw great synthetic numbers. Someone noticed they could turn on SMT across the board and “get 30% more CPU” without buying anything. The rollout began with confidence and a celebratory spreadsheet.
Within hours, the p99 for login requests started to wobble. Not dramatically at first—just enough to break their internal SLO alerts. Error rate stayed low, average latency looked fine, and CPU utilization actually dropped slightly because requests were “completing.” The on-call engineer got the worst kind of graph: the one that doesn’t scream, it just disappoints.
They chased the database. They tuned connection pools. They even blamed the load balancer. The real culprit was embarrassingly physical: SMT increased contention on shared execution resources and caches for a lock-heavy, branchy auth path. Under burst traffic, the extra threads increased context switching and made tail latency worse.
The fix wasn’t ideological (“SMT bad”). They ran controlled tests with SMT on/off and discovered a split: batch endpoints liked SMT, login did not. They ended up disabling SMT for the latency tier and keeping it for the batch tier. The incident wasn’t “SMT broke production.” The incident was a wrong assumption: that CPU features are universally beneficial.
What to learn: treat SMT like any other variable. Measure under your real concurrency and watch tail latency and migrations. Throughput-only benchmarks will lie to you with a straight face.
Mini-story #2: The optimization that backfired (CPU upgrade shifted the bottleneck and made p99 worse)
Another org had a service that was marginally CPU-bound: lots of TLS termination plus some response compression. They upgraded to a higher-frequency CPU with better crypto acceleration. The before/after RPS improved. The graphs in the upgrade review looked clean. Everyone relaxed.
Then a week later, p99 jumped during normal daily peaks. Not always. Not predictably. The new CPUs were “faster,” yet customer complaints rose. The team reacted like teams do: add more pods, add more nodes, increase autoscaling limits. It helped, but cost spiked and the problem didn’t fully disappear.
The actual issue: by making the front-end faster, they increased the request rate hitting downstream caches and a shared key-value store. That store had a CPU-heavy compaction routine that was previously invisible because the front-end couldn’t generate enough pressure. Now it could. The bottleneck moved, and tail latency followed the weakest link.
They fixed it by capping concurrency toward the downstream store, adding backpressure, and adjusting cache TTLs to reduce thundering-herd patterns. The CPU upgrade wasn’t wrong; the assumption that CPU is a local property was wrong.
What to learn: CPU improvements change system dynamics. When you test CPUs, include downstream behavior or isolate the dependency explicitly. If you only test a component in isolation, you’re doing a demo, not capacity planning.
Mini-story #3: The boring but correct practice that saved the day (repeatable harness + pinned environment)
A storage-heavy team was preparing to buy new nodes for an internal analytics platform. They had two CPU options: one with more cores, one with fewer but faster cores and bigger cache. Opinions were strong, as they always are when hardware is involved.
Instead of arguing, they built a dull, disciplined harness. Same OS image, same kernel, same BIOS settings, same microcode baseline. They pinned CPU governor, disabled unnecessary services, and ran a warm-up phase before measuring. They captured p50/p95/p99 for query latency, plus perf counters and PSI. Every run was tagged with the exact git revision and a timestamp. Nobody got to “just try one more tweak” without recording it.
They found something counterintuitive: the “more cores” CPU won on raw throughput, but the “bigger cache” CPU won on p99 latency for the heavy join queries. The perf counters showed higher cache miss rates and backend stalls on the many-core option. The conclusion wasn’t philosophical; it was data: if they wanted stable interactive performance, the cache-heavy CPU was the safer bet.
When procurement asked why they were picking the “less core-dense” SKU, they had a one-page summary with plots and reproducible commands. No heroics. Just boring competence. It prevented a million-dollar argument from turning into a million-dollar mistake.
What to learn: the most valuable performance tool is a repeatable harness. It’s not glamorous. It’s how you avoid buying regret.
Common mistakes: symptoms → root cause → fix
This section is for diagnosing failure modes that repeatedly show up in real CPU tests. The pattern matters: the symptom is what you see; the root cause is what’s actually happening; the fix is what you should do next.
1) Symptom: CPU is “low,” but p99 latency is high
- Root cause: IO waits, lock contention, or downstream dependency latency. CPU utilization averages hide stalls.
- Fix: Check
iostat -xz, check PSI for memory/IO (if available), and profile for locks. Add dependency timing to your test output.
2) Symptom: CPU is pegged, but throughput doesn’t increase with more clients
- Root cause: Global lock, serialization point, or single hot thread (e.g., GC, event loop, logging).
- Fix:
top -Hto find hot threads; useperf record; reduce contention; re-architect critical section.
3) Symptom: Results vary wildly run-to-run
- Root cause: CPU frequency scaling/turbo differences, thermal throttling, steal time, background noise, or warm-up effects.
- Fix: Pin governor, monitor frequency/temps, ensure no steal, run longer with warm-up, and log environment details.
4) Symptom: “Better CPU” wins on average latency but loses on p99
- Root cause: Scheduling jitter, SMT contention, migrations, or cache thrash under burst.
- Fix: Check migrations, run queue, and perf counters; experiment with SMT off; pin cores for latency-critical threads.
5) Symptom: CPU throttling shows up only in containers
- Root cause: cgroup CPU quota too low, too small period, or bursty workload hitting quota edges.
- Fix: Increase limits, consider using CPU requests/limits carefully, or avoid hard quotas for latency-critical services.
6) Symptom: High system CPU and ksoftirqd spikes
- Root cause: Interrupt/softirq pressure: high packet rate, small packets, conntrack, or suboptimal NIC queueing.
- Fix: Balance interrupts, tune RSS/queues, review conntrack settings, and consider offloads carefully (measure!).
7) Symptom: CPU looks saturated only on dual-socket systems
- Root cause: NUMA remote memory or cross-node locking; memory allocations not local to the executing cores.
- Fix: Pin processes with
numactl, use NUMA-aware allocators, ensure IRQs and threads align, and test per-socket scaling.
8) Symptom: You “optimize” by increasing threads and get worse performance
- Root cause: Context switching overhead, lock contention, cache thrash. More threads can reduce CPU efficiency.
- Fix: Reduce concurrency; use async where appropriate; size thread pools to cores; measure context switches and migrations.
Checklists / step-by-step plan
Checklist A: A sane CPU comparison plan (bare metal or dedicated VMs)
- Pick scenarios: 1–3 workload mixes that represent production (include one “bad day” scenario).
- Freeze the environment: same OS, kernel, microcode baseline, same service config, same dependency versions.
- Control CPU behavior: record governor/turbo; avoid thermal differences; ensure similar cooling and power limits.
- Warm-up: run a warm-up phase (5–15 minutes depending on caches/JIT/DB).
- Measure phase: fixed duration (10–30 minutes) with stable offered load.
- Collect metrics: latency distribution, throughput, error rate, run queue, PSI, cgroup throttling, perf counters snapshot.
- Repeat: at least 3 runs per scenario; throw out obvious outliers only with a reason (steal spike, deploy event).
- Compare at SLO: which CPU hits your p99 target at the lowest cost and lowest operational risk?
- Document the harness: commands, configs, and a one-page “what changed” summary.
Checklist B: A quick “is it CPU or not?” plan
- Look at p99 latency and error rate first. If users aren’t hurting, don’t panic-tune.
- Check CPU PSI and run queue. If both are low, CPU isn’t your limiting resource.
- Check cgroup throttling and steal time. If present, fix the environment before interpreting results.
- Check system CPU and interrupts. If kernel time is high, you’re fighting the kernel/NIC path.
- Only then profile the app. Don’t profile blind; bring evidence.
Checklist C: Deciding what to buy (or whether to tune instead)
- If your workload is cache/memory sensitive: prioritize larger caches, better memory bandwidth, and NUMA friendliness over peak single-core boost.
- If your workload is embarrassingly parallel: cores matter, but watch for diminishing returns from locking and shared resources.
- If your workload is latency-critical: reduce jitter sources (SMT, migrations, throttling), prefer consistent all-core performance, and measure p99 under burst.
- If you’re inside containers: treat throttling as a first-class constraint; a “fast CPU” can be made slow by policy.
FAQ
1) Should I use synthetic benchmarks at all?
Use them as guardrails, not decision makers. They help detect broken hosts (bad cooling, wrong BIOS, turbo off). They do not replace workload tests.
2) How many runs are “enough” for confidence?
For stable environments, three runs per scenario is a reasonable minimum. If variance is high, fixing variance is the real work; more runs just quantify the chaos.
3) What’s the single most useful CPU metric besides utilization?
CPU Pressure Stall Information (PSI) is brutally practical. It answers: “Are tasks waiting for CPU time?” It correlates well with user-visible latency under saturation.
4) Is iowait a reliable signal that I’m IO-bound?
It’s a clue, not a verdict. iowait can be low even when IO is your bottleneck (async IO, blocked threads elsewhere), and it can be high due to scheduling artifacts. Pair it with iostat -xz and application timing.
5) Should I disable SMT for production?
Sometimes. If your workload is lock-heavy, cache-sensitive, or p99-critical, SMT can increase jitter. If your workload is throughput-oriented and stalls often, SMT can be a win. Test both; don’t guess.
6) Why does my CPU benchmark look great for 30 seconds and then gets worse?
Turbo and thermal limits. Many CPUs boost aggressively until they hit sustained power/thermal constraints. Run long enough to hit steady state and monitor frequency/temps.
7) Can I do real-world CPU testing on shared cloud instances?
You can do workload testing, but CPU comparison is risky. Steal time, noisy neighbors, and variable turbo policies introduce variance. If you must, use dedicated instances or at least record %steal and run longer.
8) What if perf is restricted in my environment?
Then lean harder on external latency percentiles, PSI, run queue, and application-level profiling. Perf counters are great, but not mandatory to make a correct decision.
9) How do I know if I’m memory bandwidth bound vs compute bound?
Look for low IPC and high backend stalls under load, plus sensitivity to NUMA pinning and memory speed. If performance improves dramatically when you improve locality, it’s not “just CPU.”
10) What’s a realistic success criterion for a CPU change?
Define it as a capacity increase at a fixed SLO: “At p99 < X ms, we can handle Y% more throughput.” That’s the number you can run a business on.
Conclusion: next steps that actually work
Real-world CPU testing isn’t about winning arguments. It’s about buying (or tuning) performance you can keep at 3 a.m. when the cache is cold, the traffic is bursty, and someone’s cron job is doing something “helpful.”
Do this next:
- Write down one SLO-based scenario (offered load + p99 target) and one “bad day” scenario.
- Build a repeatable harness: warm-up, measure, record environment, collect latency + PSI + run queue + throttling.
- Run three times on your current CPU and establish a baseline you trust.
- Change one variable (CPU model, SMT, quota) and rerun. If you change five things, you learn nothing.
- Make the decision at equal user pain: pick the CPU (or tuning) that meets p99 with the lowest operational risk.
If you only remember one thing: measure the workload, not the hardware. The hardware is just the stage. Your software is the play. And production is the reviewer who never sleeps.