You’ve seen it: the service got “more CPU” and somehow got slower. Or the vendor promised “twice the cores,”
and your p99 latency shrugged like it was on salary. Meanwhile, your dashboards show plenty of headroom—
except the users are refreshing, the incident channel is lively, and the on-call has that particular thousand-yard stare.
The turning point wasn’t the day CPUs gained extra cores. It was the day we collectively learned—often the hard way—that
clocks stopped saving us. From that point on, performance became an engineering discipline instead of a shopping trip.
The turning point: from clocks to cores
For a long time, performance work was basically a procurement workflow. You bought the next CPU generation, it ran at a
higher clock, and your application magically improved—even if your code was a museum exhibit. That era ended when heat and
power became the limiting factor, not transistor count.
Once frequency growth slowed, “faster” turned into “more parallel.” But parallelism isn’t a free lunch; it’s a bill that
arrives monthly, itemized in lock contention, cache misses, memory bandwidth, scheduler overhead, tail latency, and
“why is this one thread pegged at 100%?”
Here’s the uncomfortable truth: cores don’t beat clocks by default. They beat clocks only when your software and your
operating model can exploit concurrency without drowning in coordination costs.
Dry-funny joke #1: Adding cores to a lock-heavy app is like adding more checkout lanes while keeping one cashier who insists on validating every coupon personally.
The “real turning point” is when you stop treating CPU as a scalar and start treating the machine as a system:
CPU and memory and storage and network and kernel scheduling. In production, those
subsystems don’t take turns politely.
Facts and context worth remembering
These are short, concrete points that keep you from repeating history. Not trivia. Anchors.
-
Frequency scaling hit a wall in the mid-2000s as power density and heat dissipation made “just crank the clock”
a reliability and packaging problem, not just an engineering challenge. -
“Dennard scaling” stopped being your friend: as transistors shrank, voltage didn’t keep dropping at the same pace,
so power per area climbed and forced conservative frequency choices. -
Multi-core wasn’t a luxury; it was a workaround. If you can’t increase frequency safely, you add parallel execution units.
Then you shift complexity into software. -
Amdahl’s Law became an operational reality: the serial fraction of your workload determines your scaling ceiling,
and production traffic loves to find the serial path. -
Cache hierarchy became a first-order performance factor. L1/L2/L3 behavior, cache coherency traffic, and NUMA effects
routinely dominate “CPU usage” graphs. -
Speculative execution and out-of-order tricks bought performance without higher clocks, but also increased complexity and
vulnerability surface; mitigations later changed performance profiles in measurable ways. -
Virtualization and containers changed the meaning of “a core.” vCPU scheduling, steal time, CPU throttling, and noisy neighbors
can make a 32-core node feel like a tired laptop. -
Storage got faster, but latency stayed stubborn. NVMe improved a lot, yet the difference between “fast” and “slow” is often queueing,
filesystem behavior, and sync semantics—not raw device specs.
What actually changed in production systems
1) “CPU” stopped being the bottleneck and became the messenger
In the clock-scaling era, CPU utilization was a reasonable proxy for “we are busy.” In the core era, CPU is often just where you
notice the symptom: a thread spinning on a lock, a GC pause, a kernel path doing too much work per packet, or a syscall storm
caused by tiny I/O.
If you treat high CPU as the problem, you’ll tune the wrong thing. If you treat CPU as the messenger, you’ll ask:
what work is being done, on behalf of whom, and why now?
2) Tail latency became the metric that matters
Parallel systems are excellent at producing averages that look fine while users suffer. When you run many requests concurrently,
the slowest few—lock holders, stragglers, cold caches, NUMA misses, disk queue spikes—define user experience and timeouts.
Cores amplify concurrency; concurrency amplifies queueing; queueing amplifies p99.
You can ship a “faster” system that is worse, if your optimizations increase variance. In other words: throughput wins demos; stability wins pagers.
3) The kernel scheduler became part of your application architecture
With more cores, the scheduler has more choices—and more opportunities to hurt you. Thread migration can trash caches.
Poor IRQ placement can steal cycles from your hot threads. Cgroups and CPU quotas can introduce throttling
that looks like “mysterious latency.”
4) Memory and NUMA stopped being “advanced topics”
Once you have multiple sockets, memory isn’t just RAM; it’s local RAM versus remote RAM.
A thread on socket 0 reading memory allocated on socket 1 can take a measurable hit, and that hit compounds when you’re
saturating memory bandwidth. Your code might be “CPU-bound” until it becomes “memory-bound,” and you won’t notice
by staring at CPU utilization.
5) Storage performance became more about coordination than devices
Storage systems are parallel too: queues, merges, readahead, writeback, journaling, copy-on-write, checksums.
The device can be fast while the system is slow because you created the perfect storm of small synchronous writes,
metadata contention, or write amplification.
As a storage engineer, I’ll say the quiet part out loud: for many workloads, the filesystem is your database’s
first performance dependency. Treat it with the same respect as your query planner.
One quote that holds up in operations:
“Hope is not a strategy.”
— General Gordon R. Sullivan
Bottlenecks: where “more cores” disappears
Serial work: Amdahl collects his rent
Every system has a serial fraction: global locks, single leader, one compaction thread, one WAL writer, one shard coordinator,
one kernel mutex in a hot path. Your shiny core count mostly increases the number of threads waiting their turn.
Decision rule: if adding concurrency improves throughput but worsens latency, you likely hit a serial chokepoint plus queueing.
You don’t need more cores. You need to reduce contention or shard the serial resource.
Lock contention and shared state
Shared state is the classic tax of concurrency: mutexes, rwlocks, atomics, global allocators, reference counting, connection pools,
and “just one little metrics lock.” Sometimes the lock isn’t in your code; it’s in the runtime, the libc allocator, the kernel,
or the filesystem.
Memory bandwidth and cache coherency
Modern CPUs are fast at arithmetic. They are slower at waiting for memory. Add cores and you increase the number of hungry mouths
competing for memory bandwidth. Then cache coherency traffic shows up: cores spend time agreeing on the meaning of a cache line
instead of doing useful work.
NUMA: when “RAM” has geography
NUMA issues often look like randomness: same request, different latency, depending on which core ran it and where its memory lives.
If you don’t pin, allocate locally, or choose a topology-aware config, you get “performance drift” that comes and goes.
Kernel time: syscalls, context switches, and interrupts
High context switch rates can erase gains from parallelism. Lots of syscalls from tiny I/O or chatty logging can make a service
CPU-heavy without doing business work. Misplaced interrupts can pin an entire NIC queue’s IRQ load to the same cores that run your
latency-sensitive threads.
Storage I/O: queueing and write amplification
Cores don’t help if you’re stuck on synchronous fsync patterns, small random writes, or a storage stack that amplifies writes
through copy-on-write and metadata updates. Worse: more cores can issue more concurrent I/O, increasing queue depth and latency.
Networking: packet processing and softirq time
If you’re doing high PPS, the bottleneck can be softirq processing, conntrack, iptables rules, or TLS. The CPU isn’t “busy”
with your code; it’s busy being the network card’s assistant.
Dry-funny joke #2: The only thing that scales linearly in my career is the number of dashboards that claim everything is fine.
Fast diagnosis playbook
This is the order that gets you to the bottleneck quickly without a week of interpretive dance in Grafana. The goal is not to be
clever; it’s to be fast and correct.
First: establish what’s saturating (CPU vs memory vs I/O vs network)
- Check load average versus runnable threads and I/O wait.
- Check CPU breakdown (user/system/iowait/steal) and throttling.
- Check disk latency and queue depth; confirm if waits correlate with p99.
- Check NIC drops/retransmits and softirq CPU time if the service is network-heavy.
Second: determine if the limit is serial, shared, or external
- One thread pegged at 100% while others idle: serial work or a hot lock.
- All cores moderately busy but p99 bad: queueing, contention, or memory stalls.
- CPU low but latency high: I/O or downstream dependency.
- CPU high in kernel: networking, syscalls, filesystem, or interrupts.
Third: validate with targeted profiling (not vibes)
- Use
perf top/perf recordfor CPU hot paths. - Use flame graphs if you can, but even stack traces at the right time help.
- Use
pidstatfor per-thread CPU and context switches. - Use
iostatand filesystem stats for I/O latency distribution and saturation.
Fourth: change one variable, measure, roll back fast
- Reduce concurrency and see if tail improves (queueing diagnosis).
- Pin threads / adjust IRQ affinity if cache and scheduler effects dominate.
- Switch sync strategy carefully (batch fsync, group commit) if safe.
- Scale out when you’ve proven it’s not a single-node coordination limit.
Practical tasks with commands: measure, interpret, decide
These are deliberately “runnable at 03:00” tasks. Each one includes: command, example output, what it means, and the decision you make.
Use them in order when you’re lost.
Task 1: Check CPU saturation and iowait quickly
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (db01) 01/09/2026 _x86_64_ (32 CPU)
01:12:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
01:12:02 PM all 42.10 0.00 12.40 8.60 0.00 2.10 0.20 34.60
01:12:02 PM 7 98.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00
01:12:02 PM 12 10.00 0.00 40.00 30.00 0.00 5.00 0.00 15.00
Meaning: CPU 7 is essentially saturated in user space (likely a hot thread). CPU 12 is spending a lot in sys + iowait (kernel + storage waits).
Decision: If one CPU is pegged, look for a single-thread bottleneck or lock. If iowait is high, move to disk latency checks before tuning CPU.
Task 2: Separate runnable load from I/O load
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 1 0 421312 91240 812340 0 0 1024 2048 5400 9800 41 12 35 9 3
9 0 0 418900 91300 811900 0 0 120 250 6200 21000 55 15 25 2 3
Meaning: r shows runnable threads; b shows blocked (often I/O). First line indicates some blocking, second indicates CPU pressure (r=9).
High cs (context switches) hints at contention or overly chatty concurrency.
Decision: If r is consistently > CPU count, you’re CPU-bound or thrashing. If b spikes, prioritize I/O investigation.
Task 3: Spot CPU throttling in containers (cgroups)
cr0x@server:~$ cat /sys/fs/cgroup/cpu.stat
usage_usec 9034211123
user_usec 7011200456
system_usec 2023010667
nr_periods 120934
nr_throttled 23910
throttled_usec 882134223
Meaning: The workload has been throttled in 23,910 periods; almost 882 seconds worth of throttled time accumulated.
This can look like “random latency” and “CPU idle” simultaneously.
Decision: If throttling is non-trivial during incidents, raise CPU limits/requests, fix noisy neighbors, or stop assuming “idle means available.”
Task 4: Check steal time on virtual machines
cr0x@server:~$ sar -u 1 3
Linux 6.5.0 (app03) 01/09/2026 _x86_64_ (8 CPU)
01:18:01 PM CPU %user %system %iowait %steal %idle
01:18:02 PM all 22.10 7.30 1.20 18.40 51.00
01:18:03 PM all 24.00 8.10 1.10 19.20 47.60
Meaning: ~19% steal is the hypervisor taking time away; you can’t tune your way out of it inside the guest.
Decision: If steal is high, migrate hosts, change instance type, or reduce contention at the virtualization layer.
Task 5: Identify per-thread CPU hogs and context switch storms
cr0x@server:~$ pidstat -t -p 2147 1 3
Linux 6.5.0 (api01) 01/09/2026 _x86_64_ (32 CPU)
01:21:10 PM UID TGID TID %usr %system %CPU cswch/s nvcswch/s Command
01:21:11 PM 1001 2147 2159 98.00 1.00 99.00 0.00 12.00 java
01:21:11 PM 1001 2147 2164 5.00 18.00 23.00 12000.00 8000.00 java
Meaning: One thread is CPU-bound (likely a hot loop or serial bottleneck). Another is heavy on system time and switching—often lock contention, syscalls, or scheduler churn.
Decision: CPU-bound thread: profile with perf. High switching: inspect locks, allocator behavior, logging, and kernel hotspots.
Task 6: Find CPU hotspots with perf (fast triage)
cr0x@server:~$ sudo perf top -p 2147
Samples: 2K of event 'cycles', Event count (approx.): 2289012345
38.12% libpthread-2.35.so [.] pthread_mutex_lock
14.55% libc-2.35.so [.] __memmove_avx_unaligned_erms
10.09% [kernel] [k] tcp_recvmsg
7.44% [kernel] [k] ext4_da_write_end
Meaning: A huge chunk in pthread_mutex_lock is a contention signature. Kernel hotspots suggest network receive and filesystem write paths are also relevant.
Decision: If mutex lock dominates, reduce shared state, increase sharding, or change concurrency model. If kernel networking dominates, check softirq/IRQ and packet rates.
Task 7: Confirm disk latency and queue depth
cr0x@server:~$ iostat -x 1 3
Linux 6.5.0 (db01) 01/09/2026 _x86_64_ (32 CPU)
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 120.0 950.0 4800 81200 154.0 9.80 11.2 3.1 12.3 0.7 82.0
Meaning: High avgqu-sz and elevated await indicate queueing. %util at 82% suggests the device is getting busy; latency will climb under bursts.
Decision: If await grows with load, reduce sync write pressure, batch writes, tune filesystem/ZFS, or move hot data/logs to faster or dedicated devices.
Task 8: See which processes are generating I/O
cr0x@server:~$ sudo iotop -o -b -n 3
Total DISK READ: 0.00 B/s | Total DISK WRITE: 58.23 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
2147 be/4 app 0.00 B/s 22.10 M/s 0.00 % 12.50 % java -jar api.jar
1762 be/4 postgres 0.00 B/s 31.90 M/s 0.00 % 18.00 % postgres: wal writer
Meaning: The WAL writer is pushing significant writes; your application also writes heavily. If latency issues correlate, you may be sync-bound or journaling-bound.
Decision: Validate fsync patterns and storage config. Consider moving WAL/logs to separate device, or tuning commit settings with durability requirements in mind.
Task 9: Check filesystem and mount options (the boring truth)
cr0x@server:~$ findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /var/lib/postgresql
/dev/nvme0n1p2 /var/lib/postgresql ext4 rw,noatime,data=ordered
Meaning: noatime avoids extra metadata writes. Ext4 ordered mode is generally sane for databases on Linux.
Decision: If you see surprising options (like sync or weird barriers disabled without reason), fix them. Don’t cargo-cult performance flags.
Task 10: Check ZFS pool health and latency pressure (if you run it)
cr0x@server:~$ sudo zpool iostat -v tank 1 3
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 1.20T 2.30T 210 1800 8.20M 95.1M
mirror 1.20T 2.30T 210 1800 8.20M 95.1M
nvme1n1 - - 110 920 4.10M 47.6M
nvme2n1 - - 100 880 4.10M 47.5M
Meaning: High write ops relative to bandwidth implies small writes. Mirrors can handle it, but latency depends on sync behavior and SLOG presence.
Decision: If small sync writes dominate, evaluate separate SLOG, recordsize, and application fsync batching—without compromising durability.
Task 11: Detect memory pressure and reclaim thrash
cr0x@server:~$ sar -B 1 3
Linux 6.5.0 (cache01) 01/09/2026 _x86_64_ (16 CPU)
01:33:20 PM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff
01:33:21 PM 0.0 81234.0 120000.0 12.0 90000.0 0.0 54000.0 39000.0 72.2
Meaning: Heavy page scanning and high pgpgout suggests reclaim pressure; major faults indicate real disk-backed paging.
Decision: If reclaim is active during latency spikes, reduce memory footprint, fix cache sizing, or move to nodes with more RAM. More cores won’t help.
Task 12: Check NUMA locality problems
cr0x@server:~$ numastat -p 2147
Per-node process memory usage (in MBs) for PID 2147 (java)
Node 0 18240.3
Node 1 2240.8
Total 20481.1
Meaning: Memory is heavily concentrated on Node 0; if threads run on both sockets, Node 1 threads will access remote memory frequently.
Decision: If NUMA imbalance correlates with latency, consider pinning the process to one socket, enabling NUMA-aware allocation, or tuning thread placement.
Task 13: Check interrupt distribution and softirq load
cr0x@server:~$ cat /proc/interrupts | head -n 8
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
24: 9123401 102332 99321 88210 90111 93321 92011 88712 PCI-MSI 524288-edge eth0-TxRx-0
25: 10231 8231201 99221 88120 90211 93411 92101 88602 PCI-MSI 524289-edge eth0-TxRx-1
NMI: 2012 1998 2001 2003 1999 2002 1997 2004 Non-maskable interrupts
Meaning: IRQs are concentrated on CPU0 and CPU1 for separate queues. That’s not always bad, but if your hot threads share those CPUs, you’ll get jitter.
Decision: If latency-sensitive threads and IRQ-heavy CPUs overlap, set IRQ affinity to isolate them, or move app threads away from IRQ cores.
Task 14: Confirm TCP retransmits and drops (network-induced latency)
cr0x@server:~$ netstat -s | egrep -i 'retrans|segments retransmited|listen drops|RTO' | head
124567 segments retransmited
98 timeouts after RTO
1428 SYNs to LISTEN sockets dropped
Meaning: Retransmits and RTOs create tail latency that looks like “app got slower.” SYN drops can look like random connection failures under load.
Decision: If retransmits spike with incidents, check NIC saturation, queue settings, load balancer health, conntrack, and packet loss upstream.
Task 15: Measure file descriptor pressure (hidden serialization)
cr0x@server:~$ cat /proc/sys/fs/file-nr
24576 0 9223372036854775807
Meaning: The first number is allocated file handles; near limits you’ll see failures and retries that create weird contention patterns.
Decision: If you’re approaching limits, raise them and fix leaks. Don’t let a file descriptor shortage masquerade as a CPU scalability issue.
Task 16: Spot one-core bottlenecks in application metrics using top
cr0x@server:~$ top -H -p 2147 -b -n 1 | head -n 12
top - 13:41:01 up 12 days, 3:22, 1 user, load average: 6.20, 5.90, 5.10
Threads: 98 total, 2 running, 96 sleeping, 0 stopped, 0 zombie
%Cpu(s): 45.0 us, 12.0 sy, 0.0 ni, 35.0 id, 8.0 wa, 0.0 hi, 0.0 si, 0.0 st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2159 app 20 0 9856.2m 3.1g 41224 R 100.0 9.8 32:11.20 java
2164 app 20 0 9856.2m 3.1g 41224 S 23.0 9.8 10:05.88 java
Meaning: One thread maxing out a core is the poster child for “cores won’t help.” That thread is your throughput ceiling and your latency cliff.
Decision: Profile it, then redesign: split work, reduce lock scope, shard state, or move the work off the request path.
Three corporate mini-stories (anonymized, plausible, and technically accurate)
Mini-story #1: An incident caused by a wrong assumption
A mid-sized SaaS company moved their API tier from older 8-core instances to newer 32-core instances. Same memory. Same storage class.
The migration plan was simple: replace nodes, keep the same auto-scaling thresholds, enjoy the cheaper cost-per-core.
The first weekday after the cutover, error rates rose slowly. Not catastrophically—just enough to trigger client retries.
Latency climbed, then plateaued, then climbed again. The dashboards said CPU was “fine”: 40–55% across the fleet.
The incident commander asked the classic question: “Why are we slow if CPU is only half used?”
The wrong assumption was that CPU percent maps linearly to capacity. What happened was more subtle: the application had a single
global lock protecting a cache structure used on every request. On 8-core boxes the lock was annoying. On 32-core boxes it became
a contention festival. More cores meant more simultaneous contenders, higher context switch rates, and longer wait time per request.
Throughput didn’t rise; tail latency did.
The fix wasn’t “add more CPU.” The fix was to shard the cache by keyspace and use lock striping. They also lowered the request
concurrency limit at the ingress temporarily—less throughput on paper, better p99 in reality—until the code change shipped.
The lesson that stuck: cores amplify both your parallelism and your coordination overhead. If you don’t measure contention, you’re guessing.
And guessing is how you get paged.
Mini-story #2: An optimization that backfired
A data platform team wanted to reduce write latency for an ingestion service. They noticed their database spent time flushing and syncing.
Someone proposed a “quick win”: move logs and WAL to the same fast NVMe volume used for data, and increase worker threads to “use all the cores.”
The change was rolled out gradually, with a small test that looked good: higher throughput, lower average latency.
Two weeks later, during a predictable traffic spike, the system started timing out. Not evenly—just enough to hurt. The node graphs showed
disk %util climbing. iostat showed queue depth increasing. CPU remained available. Engineers tuned thread pools upward to “push through.”
That made it worse.
The backfire was queueing collapse: more threads generated more small sync writes, which increased device queue depth, which increased
per-operation latency, which increased the number of concurrent in-flight operations, which increased queue depth again. A feedback loop.
Average throughput looked acceptable, but p99 blew up because some requests landed behind long I/O queues.
The fix was deliberately boring: cap concurrency, separate WAL/log devices, and implement batching so fsync calls were grouped.
They also added alerts on await and queue depth, not just throughput. After that, the system handled spikes without drama.
The lesson: “use all cores” is not a goal. It’s a risk. Concurrency is a dial, and the right setting depends on the slowest shared resource.
Mini-story #3: A boring but correct practice that saved the day
A fintech ran a settlement service that did heavy reads, moderate writes, and required strict durability for a subset of operations.
They had a habit that would not win hackathon prizes: every quarter they ran a capacity and failure-mode rehearsal using production-like load,
with strict runbooks and a “no heroics” rule.
During one rehearsal, they noticed something unsexy: p99 latency was creeping up as they approached peak throughput, even though CPU looked fine.
They collected pidstat, iostat, and perf profiles and found mild lock contention plus a storage queue depth rise
during periodic checkpoint-like bursts. Nothing was “broken,” just close to the cliff.
They made two changes: (1) pinned specific worker pools to CPUs away from NIC IRQ cores, and (2) adjusted storage layout so the durable log
lived on a separate device with predictable latency. They also set explicit SLOs on p99 and added alerts on throttling and steal time.
Months later, a real traffic spike hit during a noisy-neighbor event on the virtualization layer. Their systems still degraded,
but stayed within SLO long enough to shed load gracefully. Other teams in the company had incidents; this one had a Slack thread
and a postmortem with no adrenaline.
The lesson: boring practices are what make cores usable. Rehearse, measure the right things, and you’ll see the cliff before you drive off it.
Common mistakes: symptom → root cause → fix
1) Symptom: CPU is “only 50%” but latency is awful
Root cause: single-thread bottleneck, lock contention, or cgroup throttling masking real saturation.
Fix: Use top -H/pidstat -t to find the hot thread; use perf top to find lock or hot loop.
Check /sys/fs/cgroup/cpu.stat for throttling. Redesign the serial path; don’t just add instances.
2) Symptom: Throughput increases with more threads, then suddenly collapses
Root cause: queueing collapse on I/O or downstream dependency; concurrency overshoots the service’s stable operating region.
Fix: Cap concurrency, add backpressure, and measure queue depth/await. Tune thread pools downward until p99 stabilizes.
3) Symptom: Random p99 spikes after moving to larger multi-socket machines
Root cause: NUMA effects and cache locality issues; threads migrate and access remote memory.
Fix: Check numastat. Pin processes or use NUMA-aware allocators. Keep latency-critical services within a socket when possible.
4) Symptom: CPU system time climbs with traffic, but app code didn’t change
Root cause: syscalls and kernel overhead from networking, small I/O, logging, or filesystem metadata churn.
Fix: Use perf top to see kernel symbols, check interrupt distribution, reduce syscall rate (batch, buffer, async),
and re-evaluate logging volume and flush policies.
5) Symptom: High iowait and “fast disks”
Root cause: device queueing, sync write patterns, write amplification (copy-on-write, small blocks), or sharing a device with another workload.
Fix: Confirm with iostat -x and iotop. Separate WAL/logs, batch fsync, tune filesystem record size,
and ensure the underlying device isn’t oversubscribed.
6) Symptom: Scaling out adds nodes but not capacity
Root cause: centralized dependency (single DB writer, leader election hotspot, shared cache, rate-limited downstream).
Fix: Identify the shared dependency, shard it or replicate properly, and ensure clients distribute load evenly.
“More stateless pods” won’t fix a stateful chokepoint.
7) Symptom: Performance is worse after “making it more concurrent”
Root cause: increased contention and cache coherency traffic; more threads cause more shared writes and false sharing.
Fix: Reduce shared state, avoid hot atomic counters on the request path, use per-core/per-thread batching, and profile contention.
Checklists / step-by-step plan
Step-by-step: proving whether cores will help
-
Measure tail latency under load (p95/p99) and correlate with CPU breakdown (usr/sys/iowait/steal).
If p99 worsens while CPU is “available,” suspect contention or external waits. -
Find the limiting thread or lock:
runtop -Handpidstat -tto locate hot threads and switching storms. -
Profile before you tune:
useperf topto identify top functions (locks, memcpy, syscalls, kernel paths). -
Check I/O latency and queue depth:
iostat -xandiotopto confirm whether storage is the pacing item. -
Check for artificial CPU limits:
cgroup throttling, steal time, and scheduler constraints can mimic “bad code.” -
Validate NUMA and IRQ placement:
confirm locality withnumastat, confirm interrupt distribution with/proc/interrupts. -
Only then decide:
- If CPU is truly saturated in user time across cores: more cores (or faster cores) may help.
- If contention dominates: redesign concurrency; extra cores may worsen it.
- If I/O dominates: fix storage path; extra cores may just generate more waiting.
- If network/kernel dominates: tune IRQ, offloads, and packet path; consider faster cores.
Operational checklist: making multi-core behavior predictable
- Set explicit concurrency limits (per instance) and treat them as capacity controls, not “temporary hacks.”
- Alert on CPU throttling and steal time; they are silent capacity killers.
- Track disk
awaitand queue depth; throughput alone is a liar. - Measure context switches and run-queue length; high values often precede p99 pain.
- Keep hot state sharded; don’t centralize counters and maps on the request path.
- Separate durability-sensitive logs from bulk data when possible.
- Validate NUMA placement on multi-socket machines; pin if you need determinism.
- Rehearse peak load with production-like data; the serial path will show itself.
FAQ
1) Should I prefer higher clock speed or more cores for latency-sensitive services?
Prefer higher per-core performance when you have a known serial component, heavy kernel/network processing, or lock-sensitive code.
More cores help when the workload is embarrassingly parallel and the shared state is minimal.
2) Why does CPU utilization look low when the service is timing out?
Because the service may be waiting: on locks, on I/O, on downstream calls, or being throttled by cgroups. Also, “low average CPU”
can hide a single saturated core. Always look at per-core and per-thread views.
3) What’s the quickest way to detect lock contention?
In Linux, perf top showing pthread_mutex_lock (or futex paths in the kernel) is a strong signal.
Pair it with pidstat -t for context switches and with per-thread CPU to find the culprit.
4) How do I know if I’m I/O-bound versus CPU-bound?
If iostat -x shows rising await and queue depth during latency spikes, you’re likely I/O-bound.
If vmstat shows high runnable threads and low iowait, you’re likely CPU-bound or contention-bound.
5) Can adding more threads reduce latency?
Sometimes, for I/O-heavy workloads where concurrency hides wait time. But once you hit a shared bottleneck, more threads increase queueing
and variance. The correct move is usually a capped concurrency plus backpressure.
6) What’s the most common “cores vs clocks” trap in Kubernetes?
CPU limits causing throttling. The pod can show “CPU usage below limit,” but still be throttled in bursts, creating latency spikes.
Check /sys/fs/cgroup/cpu.stat inside the container and correlate with request latency.
7) Why do bigger machines sometimes perform worse than smaller ones?
NUMA effects, scheduler migration, and cache locality. Also, bigger machines often attract more co-located workloads, increasing contention
on shared resources like memory bandwidth and I/O.
8) Is storage still relevant if I’m on NVMe?
Very. NVMe improves baseline latency and throughput, but queueing still exists, sync semantics still exist, and filesystems still do work.
If you generate lots of small sync writes, NVMe will simply let you hit the queueing wall faster.
9) What metrics should I put on a dashboard to reflect the “core era” reality?
Per-core CPU, CPU steal, CPU throttling, context switches, run queue length, disk await and queue depth, network retransmits, and p95/p99 latency.
Averages are fine, but only as supporting actors.
Next steps you can do this week
If you want systems that benefit from more cores instead of being embarrassed by them, do the following—practically, not aspirationally.
-
Add one dashboard panel that shows per-core CPU and top threads (or export per-thread CPU for the main process).
Catch the single-core ceiling early. - Alert on CPU throttling and steal time. If you’re in containers or VMs and you don’t alert on those, you are choosing surprise.
-
Track disk
awaitand queue depth alongside p99 latency. If you only track throughput, you’re optimizing for the wrong kind of success. -
Run a one-hour “bottleneck drill”: under controlled load, capture
mpstat,vmstat,iostat,pidstat, and a shortperfsample.
Write down the top three limiting factors. Repeat quarterly. - Set explicit concurrency limits for your busiest services. Treat the limit as a stability control; tune it like you tune a circuit breaker.
The real turning point wasn’t multi-core CPUs. It was the moment we had to stop trusting clocks to cover our sins.
If you measure contention, queueing, and locality—and you’re willing to lower concurrency when it helps—cores will beat clocks.
Otherwise, they’ll just beat you.