IPC > GHz: the simplest explanation you’ll actually remember

Was this helpful?

You’ve seen it: procurement buys “the fastest” CPUs because the spec sheet shouts 3.8 GHz. The service still crawls, p95 explodes, and the on-call gets to learn the difference between marketing and physics at 03:17.

If you remember one thing from this piece, make it this: GHz is how fast the metronome ticks. IPC is how much work happens per tick. Your workload lives in the “work” part.

The one-sentence model

Performance ≈ (clock frequency) × (IPC) × (number of useful cores) − (waiting on memory, I/O, locks, and other humans).

GHz is one multiplier. IPC is another. And the subtractive term—waiting—is why your “3.8 GHz upgrade” can do nothing except heat the room. You can’t out-clock a cache miss, a page fault storm, or a lock convoy. You can, however, diagnose which one you’re paying for.

What IPC actually is (and what it isn’t)

IPC is instructions per cycle: on average, how many instructions a CPU retires (finishes) each clock tick. If a core runs at 3.5 GHz, that’s 3.5 billion cycles per second. If IPC is 2.0, it retires about 7 billion instructions per second (roughly; reality has footnotes).

Retired instructions: the “done” bucket

Modern CPUs are out-of-order factories. They fetch, decode, and speculate. They execute instructions in parallel, reorder them, and sometimes throw work away if speculation was wrong. IPC usually refers to retired instructions: instructions that made it through the pipeline and became architectural reality.

IPC is not “how smart the CPU is”

IPC depends on the workload. The same CPU can show high IPC for tight compute loops and terrible IPC for pointer-chasing through memory. A different CPU can flip that. This is why benchmarking one microservice and extrapolating to “our whole platform” is how you end up with a rack of regret.

A practical mental picture

Think of a CPU core like a restaurant kitchen:

  • GHz is how often the head chef claps to signal “next step.”
  • IPC is how many dishes leave the pass per clap.
  • Cache/memory latency is the time spent waiting for ingredients.
  • Branch mispredicts are when the chef preps the wrong dish and bins it.
  • Locks/contention are when the kitchen fights over the one pan everyone needs.

Clapping faster doesn’t help if you’re still waiting on the truck delivering tomatoes.

Joke #1: Buying CPUs by GHz is like hiring bartenders by how fast they can shake an empty cocktail shaker.

Why GHz keeps lying to you

Clock speed is easy to print on a box. It’s also the least stable number in a modern server, because frequency is a negotiated truce among power limits, thermal limits, turbo rules, and “what else is happening on this socket right now.”

Turbo is conditional, not a lifestyle

The advertised “max turbo” is usually:

  • for one or a few cores,
  • for a limited time window,
  • under specific power and temperature conditions,
  • with a workload that doesn’t trigger downclocking (e.g., AVX-heavy code on some architectures).

If your service uses all cores under load (most do), you care about all-core sustained frequency, not peak.

Frequency can go down when the work gets “hard”

Some instruction mixes draw more power. Vector-heavy code can lower clocks. So can running hot in dense racks. So can bad BIOS power settings. GHz is not a fixed property; it’s an outcome.

And even if GHz is high, IPC can be low

Low IPC happens when the core spends cycles doing anything besides retiring instructions: waiting on cache misses, recovering from mispredicted branches, stalled on dependencies, or stuck behind front-end limits (can’t fetch/decode fast enough).

The one chart you should keep in your head

Imagine a CPU timeline:

  • Retiring (good): instructions complete.
  • Front-end bound: not enough instructions fed into execution.
  • Back-end bound: execution units or memory subsystem are the limit.
  • Bad speculation: work thrown away due to mispredicts.

High GHz only speeds up the timeline. It doesn’t change which bucket dominates.

The four things that control IPC

There are many microarchitectural details, but you can keep IPC grounded with four big levers that show up in production.

1) The memory hierarchy: cache hits are the paycheck

A cache hit is fast; DRAM is slow; storage is glacial. Your IPC collapses when the CPU waits for data. The nastiest cases are pointer-chasing workloads (hash tables, B-trees, object graphs) where each step depends on the previous load completing.

Latency matters more than bandwidth for a lot of server code. If your data access is serial, you’re not “using all that memory bandwidth.” You’re waiting on a single chain of misses.

2) Branch prediction: the CPU is guessing your future

CPUs speculate. If the branch predictor guesses right, you look like a genius. If it guesses wrong, the pipeline flushes and you pay in cycles. Branch-heavy code with unpredictable patterns can tank IPC even if the data is hot in cache.

3) Instruction-level parallelism (ILP): how much can happen at once

Out-of-order cores can execute multiple operations per cycle if there are independent instructions ready to go. Tight loops with dependencies (e.g., iterative hashing where each step depends on the previous) limit ILP. You’ll see low IPC even when everything is cache-resident.

4) The front-end: fetching/decoding is its own bottleneck

If the CPU can’t feed the execution engine fast enough—because of I-cache misses, code layout issues, or instruction decoding limits—IPC drops. This shows up in large binaries with poor locality, heavy indirection, JIT’d code with churn, or workloads with frequent instruction cache invalidations.

Rule of thumb: If you don’t know the bottleneck, assume it’s memory latency or contention. Those two pay my mortgage.

IPC and real workloads: DB, web, storage, JVM

Databases: “CPU-bound” often means “waiting on memory while holding locks”

Databases can be compute-heavy (compression, encryption, query execution) but are often dominated by memory access patterns: index lookups, buffer pool churn, and pointer-rich structures. A DB server can show 40–70% CPU utilization and still be “CPU-limited” because the hot threads stall and the rest are idle or blocked.

What to do: measure stalls, LLC misses, and lock time. If you scale up GHz without fixing cache miss rate, you just make the CPU wait faster.

Web services: branchiness, allocations, and tail latency

Request handlers are branch-heavy: auth checks, routing, feature flags, serialization. Branch mispredict penalties show up as tail latency variance. Add GC pressure, and you now have CPU time that doesn’t retire useful instructions.

What to do: look for mispredicts and instruction cache issues, and reduce allocations. If you must buy hardware, prefer cores with strong per-thread performance and good cache behavior, not the highest advertised turbo.

Storage/IO paths: IPC is a victim, not the cause

Storage stacks often bottleneck on interrupts, syscalls, context switching, and lock contention in the kernel or driver. The CPU can spend cycles in kernel mode not retiring much application work. IPC metrics can look weird because you’re measuring the wrong “useful work.”

What to do: measure iowait, softirq time, and syscall rates. Don’t “optimize” by pinning everything to one core unless you enjoy surprises.

JVM and managed runtimes: JIT can raise IPC… until it doesn’t

JIT compilation can produce tight loops with good locality and high IPC. It can also produce megamorphic call sites, deoptimized code paths, and frequent safepoints that turn your CPU into a scheduler. If you’re diagnosing, separate “CPU spent doing work” from “CPU spent managing the runtime.”

One quote to keep you honest: “Hope is not a strategy.” — General Gordon R. Sullivan

Facts and history you can use in arguments

  • Clock rates hit a wall in the mid-2000s due to power density and leakage; the industry shifted from “faster clocks” to “more cores” and “more work per cycle.”
  • “NetBurst” (Pentium 4 era) chased high GHz with deep pipelines; it often lost to lower-clock designs with better IPC, especially on real-world code.
  • Out-of-order execution exists largely to improve IPC by exploiting instruction-level parallelism; it’s a hardware scheduler running at GHz.
  • Branch predictors became sophisticated because mispredictions are expensive; deeper pipelines made wrong guesses hurt more.
  • CPU caches grew because DRAM didn’t keep up; the “memory wall” is a recurring theme in architecture.
  • IPC is not portable across ISAs; comparing IPC between x86 and ARM without context is usually meaningless because “an instruction” is not a uniform unit of work.
  • Speculation and security collided in the late 2010s (Spectre-class issues); some mitigations can reduce performance by impacting speculation and memory access costs.
  • Performance counters have existed for decades, but “observability for hardware” only became mainstream in ops as fleet efficiency and cloud costs forced better measurement.

Fast diagnosis playbook

This is the “I have 30 minutes before the incident call” workflow. The goal isn’t a thesis. It’s to identify the dominant bottleneck and make a safe decision.

First: confirm what kind of slow you have

  • Latency up but CPU low? Likely waiting: I/O, locks, network, or memory stalls.
  • CPU pegged and throughput capped? Could be compute bound, but also could be spinlocks, syscall storms, or GC.
  • Only tail latency bad? Look for contention, GC pauses, noisy neighbors, throttling, and cache thrash.

Second: check frequency and throttling

  • Is the CPU running at expected sustained clocks?
  • Are power limits (PL1/PL2), thermal throttling, or cgroup quotas in play?

Third: measure whether you’re retiring work

  • Look at IPC and stalls using performance counters.
  • If IPC is low and stalls are high: memory/branch/lock causes are likely.

Fourth: decide which subsystem owns the incident

  • High iowait / blocked tasks → storage/network path.
  • High runnable queue and high context switching → CPU scheduling/oversubscription.
  • High cache miss rate → data locality, working set, or NUMA placement.
  • High branch misses → code path unpredictability or poor layout.

Fifth: pick one action you can justify

Examples: reduce concurrency, move a noisy batch job, adjust CPU quotas, pin memory to local NUMA node, or roll back a change that increased working set.

Practical tasks: commands, outputs, decisions

These are real tasks you can run on Linux. Each one includes what the output means and what decision you make from it. Run them on the affected host, not your laptop. Reality is hostile to theory.

Task 1: See CPU model and advertised base/max

cr0x@server:~$ lscpu | egrep 'Model name|CPU MHz|CPU max MHz|CPU min MHz|Socket|Core|Thread|NUMA'
Model name:                         Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
CPU MHz:                            2095.123
CPU max MHz:                        3900.0000
CPU min MHz:                        800.0000
Socket(s):                          2
Core(s) per socket:                 20
Thread(s) per core:                 2
NUMA node(s):                       2

Meaning: The box says 3.9 GHz max, but current is ~2.1 GHz. That might be fine (base clock), or it might be throttling under load.

Decision: If performance complaints coincide with low current MHz under high load, proceed to frequency governor and throttling checks.

Task 2: Watch real-time frequencies per core

cr0x@server:~$ sudo turbostat --quiet --interval 1 --num_iterations 5
     CPU     Avg_MHz   Busy%   Bzy_MHz   IRQ
       -       2240    62.14     3605   1123

Meaning: Avg_MHz is average across all cores; Bzy_MHz is frequency while busy. Here, busy cores boost to ~3.6 GHz.

Decision: If Bzy_MHz is far below expectation (e.g., stuck near 2.1 GHz), suspect power limits/thermal throttling, BIOS settings, or AVX downclock.

Task 3: Check CPU governor (common on older configs and some cloud images)

cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave

Meaning: The kernel is trying to save power, which can reduce frequency responsiveness.

Decision: On latency-sensitive servers, set to performance (after confirming policy). If you’re in a power-capped DC, coordinate; don’t surprise Facilities.

Task 4: Check for cgroup CPU throttling (containers lie politely)

cr0x@server:~$ cat /sys/fs/cgroup/cpu.stat
usage_usec 983211234
user_usec 702112345
system_usec 281098889
nr_periods 12345
nr_throttled 9321
throttled_usec 551234567

Meaning: nr_throttled and throttled_usec show the workload was frequently throttled by CPU quotas.

Decision: If throttling is significant during the incident window, increase CPU quota/requests or reduce concurrency. Buying higher GHz won’t fix a quota.

Task 5: Quick “is it CPU or waiting?” snapshot

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 6  0      0 812344  22044 912344    0    0     1     3 1800 4200 32  8 58  1  0
12  0      0 809112  22044 913008    0    0     0     0 2100 5200 44 11 45  0  0

Meaning: r is runnable threads, b blocked, CPU split into user/system/idle/iowait. Low wa suggests not storage-stalled.

Decision: If r is consistently higher than CPU cores and id is low, you’re CPU-saturated or heavily contended. Go measure IPC/stalls.

Task 6: Check run queue and load relative to cores

cr0x@server:~$ uptime
 10:41:22 up 23 days,  4:12,  2 users,  load average: 38.12, 41.03, 39.77

Meaning: Load average near or above core count indicates many runnable or uninterruptible tasks.

Decision: If load is high but CPU idle exists, suspect blocked I/O or mutex contention. Pair with vmstat and pidstat.

Task 7: Identify hot processes and whether time is user vs kernel

cr0x@server:~$ pidstat -u -p ALL 1 3
Linux 6.2.0 (server)  01/10/2026  _x86_64_  (80 CPU)

#      Time   UID       PID    %usr %system  %CPU  Command
10:41:40  1001    21433   520.00   80.00 600.00  java
10:41:40     0     1321    10.00  120.00 130.00  ksoftirqd/12

Meaning: The JVM is using lots of user CPU; ksoftirqd indicates heavy soft interrupt handling (often network).

Decision: If ksoftirqd is hot, look at network packet rates, IRQ distribution, and NIC settings. If user CPU dominates, look at IPC and code paths.

Task 8: Measure IPC quickly with perf (system-wide)

cr0x@server:~$ sudo perf stat -a -e cycles,instructions,branches,branch-misses,cache-references,cache-misses -- sleep 10
 Performance counter stats for 'system wide':

    38,221,456,789      cycles
    41,002,112,334      instructions              #    1.07  insn per cycle
     7,991,122,010      branches
       211,334,556      branch-misses             #    2.64% of all branches
     2,114,334,221      cache-references
       322,114,990      cache-misses              #   15.23% of all cache refs

      10.001234567 seconds time elapsed

Meaning: IPC ~1.07 system-wide. Branch miss rate ~2.6% (not awful). Cache miss ratio is notable.

Decision: Low-ish IPC plus meaningful cache misses suggests memory locality/working set issues. Don’t chase GHz; chase misses, data layout, NUMA, and contention.

Task 9: Measure IPC for one process (useful during an incident)

cr0x@server:~$ sudo perf stat -p 21433 -e cycles,instructions -- sleep 10
 Performance counter stats for process id '21433':

    12,112,334,990      cycles
    10,001,223,110      instructions              #    0.83  insn per cycle

      10.000998321 seconds time elapsed

Meaning: This process is retiring <1 instruction per cycle on average. That’s “lots of waiting” or “lots of hard-to-parallelize work.”

Decision: If the app is supposed to be compute-heavy, this indicates stalls (memory, locks, syscalls, branch). Proceed to top-down analysis and flame graphs if allowed.

Task 10: Spot major faults and page-fault storms (they murder IPC)

cr0x@server:~$ pidstat -r -p 21433 1 3
#      Time   UID       PID  minflt/s  majflt/s     VSZ     RSS  %MEM  Command
10:42:20  1001    21433   12000.00      0.00 18432324 9321120  28.3  java
10:42:21  1001    21433   11500.00      3.00 18432324 9325540  28.3  java

Meaning: High minor faults can be normal (e.g., memory-mapped files), but major faults mean disk-backed page-ins. That’s latency, which becomes CPU stalls.

Decision: If major faults rise during latency spikes, investigate memory pressure, THP behavior, and file cache churn; add RAM or reduce working set.

Task 11: Check NUMA balance and remote memory access (hidden latency)

cr0x@server:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0-39
node 0 size: 192000 MB
node 0 free: 81234 MB
node 1 cpus: 40-79
node 1 size: 192000 MB
node 1 free: 23456 MB

Meaning: Two NUMA nodes; node 1 is much more allocated. Imbalance can imply remote memory accesses.

Decision: If a single process spans sockets but its memory isn’t local, pin CPUs/memory or fix the scheduler placement. Remote memory is a tax you pay in IPC.

Task 12: Confirm if transparent huge pages are causing latency variance

cr0x@server:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Meaning: THP is set to always. This can help throughput but sometimes hurts tail latency due to compaction stalls.

Decision: For latency-sensitive databases, consider madvise or never after testing. Change carefully; this is a policy lever, not a superstition.

Task 13: Check for CPU steal time (cloud reality check)

cr0x@server:~$ mpstat 1 3
Linux 6.2.0 (server)  01/10/2026  _x86_64_  (8 CPU)

10:43:10 AM  all   %usr %nice %sys %iowait %irq %soft %steal %idle
10:43:11 AM  all   41.00  0.00  9.00    0.00 0.00  1.00   8.00 41.00

Meaning: %steal shows time your VM wanted CPU but the hypervisor didn’t schedule you. That can look like “low IPC” because you’re not running.

Decision: If steal is high during incidents, move instance types, avoid overcommitted hosts, or negotiate placement. Don’t refactor code to fix someone else’s noisy neighbor.

Task 14: Check disk I/O latency quickly (because storage can impersonate CPU)

cr0x@server:~$ iostat -x 1 3
Linux 6.2.0 (server)  01/10/2026  _x86_64_  (80 CPU)

Device            r/s     w/s   r_await   w_await  aqu-sz  %util
nvme0n1         120.0   450.0     2.10     8.40    4.20   96.00

Meaning: High %util and elevated await indicate the device is saturated or queueing. That means threads block, CPU goes idle or context-switches, and your “CPU upgrade” dream dies quietly.

Decision: If storage await matches latency spikes, fix I/O: reduce sync writes, tune queues, add devices, or separate workloads.

Task 15: Confirm interrupt distribution (IRQ storms reduce useful IPC)

cr0x@server:~$ grep -E 'eth0|nvme' /proc/interrupts | head
  64:  99112233   0   0   0  IR-PCI-MSI 524288-edge  eth0-TxRx-0
  65:    112233   0   0   0  IR-PCI-MSI 524289-edge  eth0-TxRx-1
  66:     99887   0   0   0  IR-PCI-MSI 524290-edge  eth0-TxRx-2

Meaning: IRQs are landing on one CPU (first column) while others show zero. That’s a hotspot, often causing softirq load and latency.

Decision: Fix IRQ affinity / RPS/XPS policy and ensure the NIC queues are distributed. If you leave it, you’ll scale cores and still bottleneck on CPU0.

Task 16: Check memory bandwidth vs latency pressure (quick heuristic)

cr0x@server:~$ sudo perf stat -a -e cpu/mem-loads/,cpu/mem-stores/ -- sleep 5
 Performance counter stats for 'system wide':

       882,112,334      cpu/mem-loads/
       401,122,110      cpu/mem-stores/

       5.000912345 seconds time elapsed

Meaning: High load/store counts can suggest memory-intensive behavior; pair with cache-miss metrics to infer whether these are mostly hitting cache or going remote/DRAM.

Decision: If memory ops are high and IPC low, you likely have a memory-bound service. Favor CPUs with larger caches, better prefetch, and memory subsystem, not just GHz.

Three corporate-world mini-stories

1) Incident caused by a wrong assumption: “Higher GHz equals faster API”

A mid-size SaaS company had a checkout API that slowed down every time a marketing campaign hit. Engineering flagged CPU saturation on the app tier. Procurement had a plan: upgrade to “faster CPUs” with higher advertised clock speed. The migration was scheduled, the dashboards were blessed, and everyone expected a nice victory lap.

After cutover, average CPU utilization dropped a bit. Latency didn’t. Tail latency got worse. The incident postmortem started with an awkward slide: “We bought 3.9 GHz parts, why are we slower?”

They eventually measured IPC and cache misses. The new CPUs had higher turbo clocks but smaller last-level cache per core and a different memory topology. Their service was dominated by cache-miss-driven stalls from a large in-memory product catalog and a giant pile of feature flags. IPC went down under load because the working set spilled out of cache more often, and the memory subsystem was busier.

The fix was unglamorous: shrink the working set (remove dead flags, compress hot data structures), improve locality (store IDs contiguously, avoid pointer-heavy maps), and pin the hottest processes to local NUMA memory. The hardware wasn’t “bad.” The assumption was.

They kept the machines, but repurposed them for batch jobs where frequency helped. The checkout fleet switched to CPUs with larger cache and better sustained all-core performance, not the highest turbo sticker.

2) Optimization that backfired: “Let’s pin everything for better cache locality”

A different organization ran a latency-sensitive logging ingestion service. Someone noticed that CPU migrations were high during peak. They decided to aggressively pin the main ingestion threads to a subset of cores using cpusets, hoping to improve cache locality and reduce context switching.

For a day, the graphs looked prettier: fewer migrations, slightly lower average latency. Then a traffic shift happened. A new customer sent larger payloads, increasing parsing costs and memory allocations. The pinned cores hit 100% utilization while the rest of the system stayed underused. Tail latency spiked, and backlog grew.

The root issue wasn’t “insufficient GHz.” It was self-inflicted scheduling starvation. Pinning removed the kernel’s ability to spread load, and it also concentrated interrupts and kernel work on the same cores. IPC on pinned cores dropped because they spent more time in kernel paths and waiting on memory for the larger payloads.

The fix was to pin only what needed pinning (a couple of sensitive threads), distribute IRQs properly, and leave headroom for variability. They also introduced backpressure and capped concurrency per connection to avoid turning CPU stalls into queueing collapse.

Joke #2: CPU pinning is like tattooing your name on a seat at the airport: it feels secure until the flight changes gates.

3) Boring but correct practice that saved the day: “Measure before you touch”

A fintech team ran an OLTP database cluster with strict latency SLOs. During a quarterly load test, p99 latency increased by 40% on one shard. The first impulse was to scale up the instance type and move on. Instead, the on-call followed a boring runbook: capture baseline counters, confirm frequency, check throttling, validate NUMA locality, then inspect I/O.

They found the CPU frequency was fine. IPC was fine. Cache misses were stable. But iowait spiked during the same window. iostat showed elevated write latency and high device utilization. The shard had landed on a host where another noisy workload was saturating the same NVMe device. Not a CPU problem at all.

The “boring” practice was simple: mandatory capture of perf stat, vmstat, and iostat snapshots during incidents, stored with timestamps. That made the pattern obvious, and it prevented an expensive and wrong CPU upgrade.

They fixed it by isolating storage for the database and adjusting scheduling so noisy neighbors didn’t share devices. The load test passed on the existing CPU class. Nobody got a trophy, but the SLO stayed green, which is the only trophy that matters.

Common mistakes (symptom → root cause → fix)

1) “CPU is at 60%, so we’re not CPU-bound”

Symptom: Latency high, CPU not pegged, threads pile up.

Root cause: A few hot threads are stalled on memory/locks while other cores are idle; overall CPU hides per-thread bottlenecks.

Fix: Use per-thread profiling and perf counters; reduce contention; improve data locality; cap concurrency to prevent queueing.

2) “We need higher GHz”

Symptom: Throughput capped; procurement suggests “faster clock” SKUs.

Root cause: IPC is low due to cache misses, branch mispredicts, syscalls, or runtime overhead. Frequency is not the limiting multiplier.

Fix: Measure IPC and miss rates; optimize hot paths; choose CPUs with cache and memory performance that match the workload.

3) “Turbo says 4.2 GHz, therefore we have 4.2 GHz”

Symptom: Benchmarks differ wildly between runs; production slower than dev tests.

Root cause: Turbo is opportunistic; sustained all-core frequency is lower under real load; power/thermal limits change behavior.

Fix: Measure Bzy_MHz under representative load; review BIOS power limits; ensure cooling; avoid evaluating CPUs on single-thread microbenchmarks alone.

4) “We optimized by pinning threads, now it’s worse”

Symptom: Average latency improved but tail latency and backlog worsen after a traffic shape change.

Root cause: Over-pinning reduced scheduler flexibility and concentrated interrupts; local optimizations created global starvation.

Fix: Pin sparingly; distribute IRQs; leave spare cores; validate under multiple traffic mixes.

5) “Our CPU is slow in the cloud”

Symptom: CPU looks busy but work output is low; periodic latency spikes.

Root cause: Steal time, throttling, or noisy neighbors; you’re not actually running when you think you are.

Fix: Check %steal and cgroup throttle stats; pick different instance families; use dedicated hosts if needed.

6) “IPC is low, so the CPU is bad”

Symptom: perf shows <1 IPC and panic ensues.

Root cause: The workload is memory-latency bound (pointer chasing), or the measurement covers system-wide kernel time not relevant to app progress.

Fix: Measure per-process, correlate with cache misses and stall breakdowns, and consider algorithm/data structure changes before blaming silicon.

7) “We upgraded the CPU and got slower”

Symptom: New hardware regressions in real workload.

Root cause: Different cache sizes, NUMA topology, memory speeds, or microarchitectural behavior; also possible BIOS defaults changed.

Fix: Re-tune NUMA, BIOS power settings, and kernel parameters; re-baseline perf counters; validate with production-like load tests.

Checklists / step-by-step plan

When choosing CPUs (or instance types): what to do and what to avoid

  1. Classify the workload: compute-bound, memory-latency-bound, memory-bandwidth-bound, I/O-bound, or contention-bound. If you don’t know, measure.
  2. Use representative benchmarks: your real service, real data shapes, real concurrency, real GC settings. Synthetic tests are fine for sanity, not purchasing.
  3. Compare sustained all-core behavior: measure under load for 10–30 minutes, not 30 seconds of turbo glory.
  4. Check cache per core: especially for in-memory services and databases. Cache is frequently a bigger win than +300 MHz.
  5. Validate NUMA fit: does the workload fit on one socket? If not, plan for memory locality and interconnect costs.
  6. Account for security mitigations: ensure you’re comparing apples to apples in kernel and microcode state.
  7. Look for throttling risk: power capping, dense racks, cloud CPU credits, cgroup quotas.
  8. Avoid buying on peak GHz: prefer published per-core performance under sustained conditions and your own IPC/counter baselines.

During an incident: minimal safe workflow

  1. Capture vmstat 1 5, mpstat 1 3, iostat -x 1 3.
  2. Check for throttling: cgroups, frequency, and steal time.
  3. Measure IPC quickly with perf stat (system-wide and for the hot PID).
  4. Decide: CPU compute vs memory stalls vs I/O vs contention. Pick one.
  5. Take one reversible action: reduce concurrency, move a noisy workload, adjust quota, or rollback the change that expanded working set.
  6. Write down the counters and the time window. Future-you will need them.

After the incident: make it cheaper next time

  1. Create a baseline dashboard: IPC proxy (instructions/cycle), cache misses, context switches, major faults, iowait, steal time.
  2. Codify the runbook with commands and “what good looks like.”
  3. Run a quarterly “hardware reality” test: sustained clocks, thermals, and throttling checks under load.
  4. Teach procurement one phrase: “sustained all-core performance with workload counters.” Repeat until funded.

FAQ

1) What’s a “good” IPC number?

It depends on the CPU and workload. Many real services run around ~0.5–2.0 IPC. Tight compute loops can go higher. Pointer-chasing can go lower. The trend matters: if IPC dropped after a deploy, something changed in behavior (working set, branches, syscalls, locks).

2) If IPC is low, does that mean I need a better CPU?

Usually no. Low IPC often means you’re waiting on memory, I/O, or contention. A “better CPU” helps if it has a stronger memory subsystem, larger caches, or better branch prediction for your code. But the first win is typically software: reduce misses, reduce locks, reduce allocations.

3) Why do benchmarks show big gains from higher GHz then production doesn’t?

Benchmarks often fit in cache, run single-threaded, and avoid I/O and lock contention. Production has larger working sets, more branches, more syscalls, more interrupts, and more neighbors. GHz helps when the CPU is the limiting factor, not when it’s waiting.

4) Does more cores beat higher IPC?

Only if your workload scales cleanly. Many services are limited by shared resources: locks, database connections, memory bandwidth, or a single queue. If scaling stalls at 16 threads, buying 64 cores doesn’t fix it. Stronger per-core (higher IPC) can be the better spend.

5) Is IPC comparable between Intel and AMD? Between x86 and ARM?

Between CPUs in the same ISA and era, IPC comparisons can be useful. Between different ISAs, it gets shaky because instruction mix differs. Use end-to-end throughput/latency benchmarks and counters to understand why, rather than treating IPC as a universal score.

6) How does cache size relate to IPC?

More cache can improve IPC by turning DRAM accesses into cache hits. But it’s not just size: latency, associativity, prefetch behavior, and core-to-cache topology matter. For many server workloads, last-level cache per core is a critical spec.

7) Can a deployment change IPC without changing CPU usage much?

Yes. You can keep CPU at 60% and still get slower if the new version increases working set, causes more cache misses, adds branches, increases syscalls, or introduces lock contention. CPU utilization is a blunt instrument; IPC and stall indicators tell you what that CPU time accomplished.

8) Is “GHz doesn’t matter” the right takeaway?

No. GHz matters when you’re truly compute-bound and retiring instructions efficiently. The point is that GHz is not sufficient, and often not even stable. Treat frequency as one input, and validate with counters and real workload tests.

9) What’s the fastest way to explain IPC to a non-engineer stakeholder?

“Clock speed is how fast the CPU ticks. IPC is how much work it gets done per tick. Our service is limited by how much it waits, not how fast it ticks.”

Conclusion: next steps you’ll actually do

If you only change one habit: stop treating GHz as a performance guarantee. Start treating it as a variable.

Do this next:

  1. Pick one critical service and capture a baseline under typical load: turbostat, perf stat, vmstat, iostat, and cgroup throttle stats.
  2. Write down three numbers that matter: sustained busy MHz, IPC, and cache miss ratio. Now you have a fingerprint.
  3. When performance regresses, compare fingerprints before debating hardware. If IPC fell, find the wait. If MHz fell, find the throttle. If both are fine, look elsewhere (I/O, locks, network).
  4. When you must buy hardware, buy it for the bottleneck you measured: cache and memory behavior for memory-bound services; sustained frequency for compute-bound; and predictable scheduling for latency-sensitive work.

You don’t need to become a CPU architect. You just need to stop being surprised by the same bottlenecks. IPC is the quickest way to stop getting fooled by a big GHz number in a small font.

← Previous
SMTP TLS handshake fails: fix certs, ciphers, and chains the correct way
Next →
Bottlenecks Without Hysteria: How to Measure the Real Limit

Leave a comment