You don’t pick “containers” or “VMs” in the abstract. You pick them while someone is paging you because p99 latency doubled,
the batch job got 3× slower, or a “minor” node upgrade turned into a week of blame archaeology.
CPU is where this decision gets political. Everyone thinks they understand CPU. Then they meet throttling, steal time, NUMA,
and the uncomfortable truth: the scheduler doesn’t care about your sprint commitments.
CPU profiles that actually matter
“CPU-bound” is not a sufficient description. The choice between containers and VMs depends on the shape of CPU demand and the
penalties you’re willing to pay: jitter, tail latency, context switch overhead, cache churn, preemption behavior, and how much
you trust neighboring workloads.
Profile 1: Tail-latency sensitive (p99/p999 is the product)
Think: trading, ad bidding, OLTP, request/response services with strict SLOs, or anything where a single slow core ruins your day.
These workloads hate:
- CPU throttling (cgroup quotas) and bursty scheduling.
- Unpredictable preemption from other workloads (noisy neighbors).
- NUMA cross-node memory access and cache misses.
- Interrupt storms landing on “your” cores.
They usually want guaranteed CPU time, core isolation, and stable placement. Containers can do this, but only if you
actually configure it. VMs can do this too, but only if you stop pretending overcommit is a free lunch.
Profile 2: Throughput-bound batch (finish by morning, no one cares about jitter)
Think: ETL, video encoding, indexing, ML feature generation. These workloads care about total CPU-seconds and scaling out.
They tolerate jitter. They don’t tolerate you charging them for idle cores.
Containers often win here because you can pack them tight, scale them quickly, and let the scheduler do its thing. VMs can compete
when you have strict tenant boundaries or need different kernels, but you’ll pay in density and operational friction.
Profile 3: Bursty interactive (spiky CPU, human-in-the-loop)
Think: CI runners, developer preview environments, internal dashboards that spike on deploys. These want fast spin-up and elasticity.
Containers tend to win operationally. CPU-wise, they win as long as you don’t “optimize” by setting aggressive CPU limits that turn
bursts into throttled misery.
Profile 4: Mixed-mode “it’s fine until it isn’t” (shared hosts, lots of services)
This is the corporate default: many small services, a few medium ones, and one mystery workload that nobody owns.
CPU contention here is a governance problem disguised as a technical one.
Containers give you finer-grained control and better density, but also more ways to accidentally create contention. VMs give you a
thicker wall, which is nice until you realize you still share the physical CPU and the wall isn’t soundproof.
Profile 5: Special CPU behavior (real-time, DPDK, HPC-ish)
If you need real-time scheduling, predictable interrupt routing, or user-space networking that wants dedicated cores, you’re in the
“stop guessing” category. Both containers and VMs can work, but you’ll likely end up doing CPU pinning, isolcpus, hugepages, and
explicit NUMA placement. At that point, your problem isn’t “containers vs VMs”. It’s whether your platform team can support
deterministic tuning without breaking everything else.
A CPU mental model: what the kernel and hypervisor really do
Containers are not little VMs. They’re processes with opinions: namespaces to lie about what they can see, and cgroups to enforce
what they can use. The CPU scheduler is still the host kernel’s scheduler.
VMs run a guest kernel on virtual CPUs (vCPUs). Those vCPUs are scheduled onto physical CPUs by the hypervisor/host kernel.
So you have two schedulers: guest scheduling threads onto vCPUs, then host scheduling vCPUs onto pCPUs.
That double-layer is both a feature (isolation) and a source of weirdness (steal time, scheduling interference).
CPU time: the three big metrics that decide your fate
- Usage: how much CPU your workload actually consumes.
- Wait: time spent runnable but not running (queueing, contention).
- Jitter: variance of latency due to scheduling, throttling, interrupts, and cache effects.
Containers tend to minimize overhead and maximize density, but they increase the probability of “wait” and “jitter” unless you’re
disciplined about resource governance.
VMs add a layer of overhead and scheduling complexity, but can reduce cross-tenant coupling when configured correctly. The key phrase
is “when configured correctly.” Most aren’t.
Quotas vs shares: how containers get “less CPU” without you noticing
In cgroup v2, CPU control is usually a mix of:
- weights (relative priority under contention), and
- max (hard cap that can cause throttling even when idle CPU exists elsewhere, depending on configuration and burst behavior).
The failure mode is classic: you set CPU limits “for fairness,” then your latency-sensitive service hits the limit during a spike,
gets throttled, and p99 goes off a cliff while average CPU looks “reasonable.”
Steal time: how VMs tell you “someone else got my lunch”
In virtualized environments, steal time is CPU time the guest wanted but didn’t get because the host ran something else.
High steal is the VM equivalent of standing in a long coffee line while your meeting starts without you.
NUMA: the tax you pay when memory is “over there”
On multi-socket systems, CPU and memory are arranged in NUMA nodes. Accessing local memory is faster than remote memory.
When you schedule threads on one socket and their memory lives on another, you get performance that feels like packet loss: sporadic,
hard to reproduce, and always blamed on “the network” first.
Cache behavior and context switches: the hidden bill
If you pack lots of containers on a host, you increase context switches and cache pressure. That’s not automatically bad.
It’s bad when you expect single-digit millisecond tail latency and you treat CPU as an infinitely divisible commodity.
Paraphrased idea from Werner Vogels (Amazon CTO): You build it, you run it; ownership improves reliability.
It applies here too: the team that sets CPU policy should feel the pager.
Who wins when: containers vs VMs by CPU profile
Latency-sensitive services: usually VMs for hard isolation, containers for disciplined platforms
If you’re running a multi-tenant platform where teams don’t coordinate, VMs are the safer default for latency-sensitive services.
Not because VMs are magical, but because the boundaries are harder to accidentally bypass. A container with no CPU limits and no
isolation can ruin a whole node; a VM can too, but it takes a different kind of misconfiguration.
If you run a disciplined Kubernetes platform with:
- Guaranteed QoS pods (requests == limits for CPU),
- static CPU manager policy for pinned cores where needed,
- NUMA-aware scheduling,
- IRQ affinity tuning on hot paths,
…then containers can deliver excellent tail latency with higher density than VMs. But that’s not “default Kubernetes.” That’s
“Kubernetes after you read the footnotes and paid the tax.”
High-throughput stateless compute: containers win on density and operational velocity
For stateless, horizontally scalable compute where jitter is acceptable, containers are the better tool. The overhead is lower,
scheduling is simpler, and you get better bin packing. Also: rolling updates, autoscaling, and quick rollbacks are easier to execute
without treating the hypervisor fleet as a sacred artifact.
Legacy apps and different kernels: VMs win, and it’s not even close
If you need a different kernel, kernel modules, or you’re running software that assumes it owns the world (hello, old licensing
systems), use a VM. Containers share the host kernel; if the kernel is the compatibility layer you need to control, containers are
the wrong abstraction.
Noisy neighbor risk: VMs reduce blast radius; containers need policy and enforcement
Both containers and VMs share the same physical CPU. The difference is how many layers you have to get wrong before one workload
harms another.
- In containers, the kernel scheduler is directly shared. Cgroup mistakes show up immediately.
- In VMs, mis-sized vCPU counts, overcommit, and host contention show up as steal time and jitter.
When “near bare metal” matters: containers have less overhead, but VMs can be tuned
Containers usually have less CPU overhead because there’s no second kernel and no emulated devices (assuming you’re not doing something
creative with networking). VMs can be very close to native with hardware virtualization and paravirtualized drivers, but you still
pay some cost in exits, interrupts, and the scheduling indirection.
The practical advice: if you’re chasing the last 5–10% on CPU-heavy workloads, measure. Don’t argue.
Facts and history that explain today’s weirdness
- 1970s: virtualization isn’t new. IBM mainframes ran virtual machines decades before cloud made it trendy; the idea was isolation and utilization.
- 2006: AWS popularized “rent a VM.” EC2 made VM-centric operations the default mental model for a generation of engineers.
- 2007–2008: Linux cgroups arrived. Control groups gave the kernel a way to account and limit CPU/memory per process group—containers later rode that wave.
- 2008: LXC made “containers” feel real. Before Docker, LXC was already doing namespaces+cgroups; it just didn’t have the same UX and distribution story.
- 2013: Docker made packaging contagious. The killer feature wasn’t isolation; it was shipping an app with its dependencies in a repeatable way.
- 2014: Kubernetes turned scheduling into a product. It normalized the idea that the platform allocates CPU, not the app team begging for VMs.
- 2015+: Spectre/Meltdown era increased the “overhead conversation.” Some mitigations impacted syscall-heavy and virtualization-heavy paths; performance discussions got more nuanced.
- cgroup v2 unified control semantics. It reduced some legacy weirdness but introduced new knobs people misread, especially around CPU.max and burst behavior.
- Modern CPUs aren’t “just cores.” Turbo, frequency scaling, SMT/Hyper-Threading, and shared caches make CPU isolation a probabilistic exercise unless you pin and isolate.
Fast diagnosis playbook
When CPU performance goes sideways, your first job is to identify whether you’re dealing with not enough CPU,
not getting scheduled, or doing too much per request. Everything else is decoration.
First: confirm the symptom and scope
- Is latency up, throughput down, or both?
- Is it one pod/VM, one node, one AZ, or everywhere?
- Did it start after a deploy, scaling event, node recycle, kernel update, or host type change?
Second: decide whether it’s throttling/steal or genuine saturation
- Containers: check cgroup throttling counters and CPU.max/CPU quota configuration.
- VMs: check steal time and host-level contention.
- Both: check run queue length and context switch rate.
Third: check placement pathologies (NUMA, pinning, interrupts)
- NUMA imbalance or cross-node memory access.
- CPU pinned workloads sharing sibling hyperthreads.
- IRQs landing on the same cores as your latency-sensitive threads.
Fourth: validate frequency and power management
- Unexpectedly low CPU frequency due to power governor or thermal throttling.
- Turbo behavior changing after BIOS/firmware updates.
Fifth: only then look at “application inefficiency”
If you start with flamegraphs when the kernel is literally throttling your process, you’re doing archaeology with the lights off.
Practical tasks: commands, outputs, and decisions (12+)
These are the checks you can run today. Each one includes: the command, what plausible output means, and what decision to make next.
Run them on the node/host, then inside the container or VM as appropriate. Don’t rely on dashboards alone; dashboards are where nuance goes to die.
Task 1: See if the host is CPU saturated (run queue and load vs CPU count)
cr0x@server:~$ nproc
32
cr0x@server:~$ uptime
15:42:10 up 12 days, 3:17, 2 users, load average: 48.12, 44.90, 39.77
Meaning: Load average ~48 on a 32-core host suggests heavy runnable queueing (or lots of uninterruptible sleep; check next tasks).
If p99 is bad, this is a red flag.
Decision: If load > cores for sustained periods, stop arguing about micro-optimizations. Reduce contention: scale out, reduce co-location, or raise CPU allocation.
Task 2: Check CPU breakdown and steal time (VM clue)
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (node-7) 01/12/2026 _x86_64_ (32 CPU)
15:42:21 CPU %usr %nice %sys %iowait %irq %soft %steal %idle
15:42:22 all 62.10 0.00 10.43 0.12 0.00 0.55 18.34 8.46
15:42:23 all 63.02 0.00 10.21 0.09 0.00 0.61 17.80 8.27
15:42:24 all 61.77 0.00 10.66 0.14 0.00 0.59 18.02 8.82
Meaning: 18% steal is huge. In a VM, this means the hypervisor isn’t scheduling your vCPUs.
Decision: Don’t tune the app yet. Move the VM, reduce host overcommit, or negotiate reserved capacity. If you can’t, accept worse tail latency—it’s physics with invoices.
Task 3: Check per-process CPU and context switching pressure
cr0x@server:~$ pidstat -w -u 1 3
Linux 6.5.0 (node-7) 01/12/2026 _x86_64_ (32 CPU)
15:43:10 UID PID %usr %system %CPU CPU Command
15:43:11 1001 24811 310.0 42.0 352.0 7 java
15:43:11 1001 24811 12000.0 800.0 - cswch/s nvcswch/s
15:43:11 1001 24811 12000.0 650.0 - java
Meaning: Extremely high context switch rates usually mean heavy thread contention, lock churn, or oversubscription.
Decision: If this is a container on a shared node, consider pinning and reducing co-tenancy. If it’s a VM, verify vCPU count vs threads and check host scheduling.
Task 4: For containers, verify cgroup v2 CPU limits (CPU.max)
cr0x@server:~$ cat /sys/fs/cgroup/cpu.max
200000 100000
Meaning: This means a quota of 200ms CPU time per 100ms period (effectively 2 cores worth). If the workload bursts beyond that, it will be throttled.
Decision: For latency-sensitive services, avoid hard CPU caps unless you’re deliberately enforcing predictable behavior. Prefer requests/Guaranteed QoS and dedicated cores.
Task 5: For containers, check throttling counters
cr0x@server:~$ cat /sys/fs/cgroup/cpu.stat
usage_usec 12888904512
user_usec 12110000000
system_usec 778904512
nr_periods 934112
nr_throttled 212334
throttled_usec 9901123456
Meaning: If nr_throttled is high and throttled_usec is non-trivial, your container is being forcibly paused by the kernel due to quota limits.
Decision: If you care about p99, either raise/remove the limit or redesign the burst pattern (e.g., concurrency control). Throttling is a latency machine.
Task 6: In Kubernetes, confirm QoS class (Guaranteed vs Burstable)
cr0x@server:~$ kubectl -n prod get pod api-7d9c6b8c9f-4kq2m -o jsonpath='{.status.qosClass}{"\n"}'
Burstable
Meaning: Burstable pods can be deprioritized under contention and are more likely to suffer CPU time variability.
Decision: For latency-critical services, aim for Guaranteed (set CPU requests equal to limits) or use dedicated nodes with no CPU limits but strong admission control.
Task 7: Check Kubernetes CPU requests/limits and spot “limit set, request tiny”
cr0x@server:~$ kubectl -n prod get pod api-7d9c6b8c9f-4kq2m -o jsonpath='{range .spec.containers[*]}{.name}{" req="}{.resources.requests.cpu}{" lim="}{.resources.limits.cpu}{"\n"}{end}'
api req=100m lim=2000m
Meaning: Scheduler thinks you need 0.1 core, but you can burst to 2 cores. Under contention, you’ll lose scheduling priority and get jitter.
Decision: For stable latency, set realistic requests. For cost efficiency, don’t lie to the scheduler and then complain about the results.
Task 8: Check CPU manager policy on a Kubernetes node (pinning capability)
cr0x@server:~$ ps -ef | grep -E 'kubelet.*cpu-manager-policy' | head -n 1
root 1123 1 1 Jan10 ? 00:12:44 /usr/bin/kubelet --cpu-manager-policy=static --kube-reserved=cpu=500m --system-reserved=cpu=500m
Meaning: static policy allows exclusive CPU allocation for Guaranteed pods (with integer CPU requests).
Decision: If you need consistent tail latency, consider nodes with static CPU manager plus topology manager. Without it, you’re gambling.
Task 9: Identify NUMA topology and whether your workload spans nodes
cr0x@server:~$ lscpu | egrep 'NUMA node|Socket|CPU\(s\)|Thread|Core'
CPU(s): 32
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
Meaning: Two NUMA nodes. If your process bounces across CPUs 0–31 and allocates memory freely, you may pay remote memory penalties.
Decision: For CPU-latency sensitive services, keep CPU and memory local: pin CPUs, use NUMA-aware scheduling, or run smaller instances that fit a node.
Task 10: Check which CPUs a process is allowed to run on (cpuset / affinity)
cr0x@server:~$ taskset -pc 24811
pid 24811's current affinity list: 0-31
Meaning: The process can run anywhere. That maximizes flexibility but can increase cache churn and NUMA effects.
Decision: If you see jitter and cross-NUMA issues, consider narrowing affinity (or use orchestrator features to do it safely). If throughput is the goal, leave it flexible.
Task 11: Check interrupt distribution (IRQ affinity hot spots)
cr0x@server:~$ cat /proc/interrupts | head -n 8
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
24: 98122321 0 0 0 1022334 0 0 0 PCI-MSI 524288-edge eth0-TxRx-0
25: 0 93455321 0 0 0 993442 0 0 PCI-MSI 524289-edge eth0-TxRx-1
26: 0 0 90211233 0 0 0 889120 0 PCI-MSI 524290-edge eth0-TxRx-2
27: 0 0 0 88777210 0 0 0 901223 PCI-MSI 524291-edge eth0-TxRx-3
NMI: 1223311 1219988 1213444 1209987 1198877 1190021 1189911 1180221 Non-maskable interrupts
Meaning: Network interrupts are concentrated on specific CPUs. If your latency-sensitive workload shares those CPUs, you’ll see jitter.
Decision: Adjust IRQ affinity or isolate CPUs for the workload. Don’t “solve” tail latency by adding retries; that’s how outages get personality.
Task 12: Check frequency governor and current MHz (silent performance killer)
cr0x@server:~$ cpupower frequency-info | egrep 'governor|current CPU frequency' | head -n 6
The governor "powersave" may decide which speed to use
current CPU frequency: 1200 MHz (asserted by call to hardware)
Meaning: powersave governor at 1.2GHz on a server expected to run hot is… not ideal.
Decision: Switch to performance governor for latency-sensitive nodes, or at least validate firmware/power settings. Measure power savings before you sell the idea.
Task 13: In a VM, correlate steal with host contention (guest view)
cr0x@server:~$ sar -u 1 3
Linux 6.5.0 (vm-12) 01/12/2026 _x86_64_ (8 CPU)
15:45:01 CPU %user %nice %system %iowait %steal %idle
15:45:02 all 52.11 0.00 9.88 0.10 22.34 15.57
15:45:03 all 50.90 0.00 10.12 0.08 21.77 17.13
15:45:04 all 51.33 0.00 9.94 0.09 22.01 16.63
Meaning: Steal >20% means your VM is routinely not scheduled. The guest sees “idle” too, but it’s not real idle.
Decision: Don’t add vCPUs reflexively; that often makes scheduling worse. Instead, reduce overcommit, move hosts, or use reserved instances/CPU pinning on the hypervisor.
Task 14: Spot hyperthread sibling contention (SMT/HT awareness)
cr0x@server:~$ for c in 0 1 2 3; do echo -n "cpu$c siblings: "; cat /sys/devices/system/cpu/cpu$c/topology/thread_siblings_list; done
cpu0 siblings: 0,16
cpu1 siblings: 1,17
cpu2 siblings: 2,18
cpu3 siblings: 3,19
Meaning: CPU0 shares a physical core with CPU16, etc. If you pin two noisy workloads to sibling threads, expect performance surprises.
Decision: For latency-sensitive work, prefer one thread per core (avoid sibling sharing) or disable SMT on dedicated nodes if you can afford capacity loss.
Joke #1: CPU limits are like office budgets—everyone feels “disciplined” until the first incident, then suddenly limits become “just a suggestion.”
Three corporate mini-stories from the CPU trenches
Mini-story 1: The incident caused by a wrong assumption (containers “have no overhead”)
A mid-sized SaaS company moved a latency-sensitive API from VMs to Kubernetes. The pitch was clean: “containers are lighter, so we’ll
get better performance and higher density.” The migration went fast, which should have been the first clue.
The API was deployed with CPU limits to “prevent noisy neighbors.” Requests were tiny—100m—because the service didn’t use much CPU
on average. Limits were 2 cores, which sounded generous. Under normal traffic, everything looked fine. Then they ran a marketing
campaign and traffic spiked. The service scaled out, but each pod hit its CPU quota repeatedly during request bursts.
The graphs showed CPU usage below node capacity, so the initial response was to tune application code and blame the database. Meanwhile,
the kernel quietly throttled the hottest pods. p99 latency doubled, then tripled. Timeouts followed. The autoscaler panicked and added
more pods, which increased contention and created a feedback loop of misery.
The actual fix was boring: remove CPU limits for this service on dedicated nodes, set realistic CPU requests, and use the static CPU
manager for pinned cores. They also put guardrails in admission control so “100m requests” didn’t slip into production for critical services.
The wrong assumption wasn’t “containers are faster.” Containers can be fast. The wrong assumption was that average CPU is the relevant metric for a spiky service
with strict tail latency.
Mini-story 2: The optimization that backfired (vCPU overprovisioning)
An enterprise team ran a VM-based platform for internal services. They were under pressure to reduce VM counts. Someone proposed
“right-sizing” by increasing vCPUs per VM so each VM could host more worker threads, reducing the number of instances.
On paper, it looked efficient: fewer VMs, less management overhead, better utilization. In practice, the hypervisor cluster was already
moderately overcommitted. Increasing vCPUs made each VM harder to schedule on busy hosts. Steal time rose, but only during peak hours.
Naturally, the incident started at 10:00 AM when everyone was watching.
The worst part: the team added vCPUs again, thinking the application was starving. That increased the runnable set and made scheduling
even more chaotic. Tail latency degraded across multiple services, not just the “optimized” one.
They recovered by rolling back vCPU changes, spreading load across more VMs, and enforcing a stricter overcommit policy for that cluster.
The lesson was unglamorous: bigger VMs are not always faster VMs. Scheduling is a real constraint, not an implementation detail.
Mini-story 3: The boring but correct practice that saved the day (CPU isolation + change discipline)
A payments-adjacent system had a small set of services with tight latency SLOs and heavy cryptography. The platform team kept them on a
dedicated pool of nodes, with static CPU manager, pinned cores, and explicit IRQ tuning. It was not popular. Dedicated pools look like
“wasted capacity” to finance and “wasted fun” to engineers who want one platform to rule them all.
They also had a change policy: kernel upgrades and BIOS updates were staged, with a canary node running synthetic load tests that tracked
p99 and jitter, not just throughput. Every time someone asked to skip the canary step, the answer was politely consistent: no.
One day, a routine fleet firmware update landed in the general pool. It changed CPU power behavior. Several services saw latency shifts,
and a few teams went into incident mode. The payments pool didn’t flinch. Their nodes were pinned to a validated governor profile and
were updated only after the canary results were stable.
The practice wasn’t clever. It was just consistent. In production, “boring” is a feature.
Joke #2: The only thing more virtual than a VM is the certainty in a postmortem written before the metrics come in.
Common mistakes: symptom → root cause → fix
1) Symptom: p99 latency spikes every few minutes; average CPU looks fine
Root cause: Container CPU quota throttling (cgroup CPU.max / CFS quota) during bursts.
Fix: Remove or raise CPU limits for the service; set realistic CPU requests; use Guaranteed QoS and consider CPU pinning for strict SLOs.
2) Symptom: VM CPU usage is high but throughput is low; graphs show “idle” too
Root cause: High steal time from host overcommit or CPU contention.
Fix: Reduce overcommit; move VM to less loaded host; use reservations or dedicated host policies; don’t add vCPUs as a first response.
3) Symptom: performance got worse after scaling up to a bigger instance type
Root cause: NUMA effects or cross-socket scheduling; memory is remote for part of the workload.
Fix: Ensure NUMA-aware placement; pin CPU and memory; prefer instances that fit within one NUMA node for latency-sensitive services.
4) Symptom: “random” latency spikes during network-heavy traffic
Root cause: Interrupts (IRQs) landing on application cores; softirq overload.
Fix: Tune IRQ affinity; separate networking interrupts from application CPUs; validate with /proc/interrupts and softirq stats.
5) Symptom: containerized Java/Go service becomes unstable under load, lots of context switches
Root cause: Thread oversubscription relative to CPU allocation; scheduler thrash amplified by tight CPU limits.
Fix: Reduce thread counts; increase CPU requests; avoid low requests with high limits; consider CPU pinning for stable behavior.
6) Symptom: everything slows down after “power saving” rollout
Root cause: CPU frequency governor set to powersave; thermal/power capping.
Fix: Use performance governor for critical nodes; validate BIOS settings; monitor frequency and throttling under sustained load.
7) Symptom: Kubernetes node looks underutilized but pods are slow
Root cause: CPU limits throttling per-pod; the node can be idle while individual pods are capped.
Fix: Revisit limits; use requests for placement, not limits as a blunt instrument; separate noisy workloads onto different nodes.
8) Symptom: VM-based service regresses after enabling “more security”
Root cause: Microcode/kernel mitigation changes affecting syscall/virtualization-heavy code paths (workload-dependent).
Fix: Benchmark before/after; isolate the change; if needed, adjust instance sizing or move the workload to a profile that tolerates the overhead.
Checklists / step-by-step plan
Step-by-step: choosing containers vs VMs for a CPU profile
- Write down the CPU success metric. p99 latency? requests/sec? job completion time? If you can’t name it, you’re going to optimize vibes.
- Classify the workload. Latency-sensitive, throughput batch, bursty interactive, mixed-mode, special CPU behavior.
- Decide on isolation needs. Multi-tenant with weak governance? Prefer VMs or dedicated nodes. Disciplined platform? Containers can work well.
-
Decide how you will prevent noisy neighbors.
- Containers: requests, QoS, CPU pinning for critical pods, node pools, admission control.
- VMs: limit overcommit, reservations, CPU pinning where needed, host monitoring of contention.
- Set defaults that match reality. No “100m request” for services that spike. No “8 vCPUs” for a service that spends half its time waiting.
- Validate on representative hardware. NUMA topology, SMT, frequency behavior. “Same vCPU count” does not mean “same performance.”
- Load test for jitter, not just averages. Track p95/p99/p999 and variance under contention and during co-location.
- Operationalize the diagnosis. Put throttling/steal/run-queue metrics into alerts. If it’s not measured, it will become a debate.
Checklist: CPU governance for container platforms
- Define allowed ranges for CPU requests/limits per service tier.
- Enforce via admission policies: no tiny requests for critical workloads.
- Use Guaranteed QoS for strict SLO services; consider static CPU manager.
- Separate node pools: “throughput pack” vs “latency clean room.”
- Monitor cgroup throttling counters and alert on sustained throttling.
- Document when CPU limits are required (usually for fairness on shared batch pools).
Checklist: CPU governance for VM platforms
- Set and publish an overcommit policy (and stick to it).
- Monitor steal time and host run queue; alert when contention persists.
- Avoid reflexive vCPU scaling; validate scheduling impact.
- For critical workloads: consider CPU pinning and reserved capacity.
- Track NUMA placement and host topology alignment for larger VMs.
FAQ
1) Are containers always faster than VMs for CPU?
Often, containers have lower overhead because there’s no guest kernel and fewer virtualization exits. But “faster” collapses if you
throttle containers with CPU limits or pack them into contention. VMs can be very close to native when tuned and not overcommitted.
2) What’s the CPU equivalent of “noisy neighbor” in Kubernetes?
A pod with high CPU demand plus either no limits (hogging under contention) or badly set limits (causing throttling cascades in others
through scheduling pressure). The fix is tiered node pools, realistic requests, and enforcement.
3) Why do CPU limits hurt latency even when nodes are idle?
Because limits can act as hard caps per cgroup. Your pod may be throttled even if there’s spare CPU elsewhere on the node, depending
on how demand lines up with the quota period. The symptom is throttling counters rising while host CPU isn’t pegged.
4) Should I set CPU limits on every container?
No. Set limits when you’re intentionally constraining burst behavior for fairness (batch pools) or preventing runaway processes.
For latency-sensitive services, limits are frequently a self-inflicted wound unless you’ve validated they don’t induce throttling.
5) In VMs, is higher vCPU count always better?
Not under contention. More vCPUs can mean the VM is harder to schedule and can increase steal time. Match vCPUs to parallelism you can
actually use, and validate under peak conditions.
6) How do I know if NUMA is hurting me?
You’ll see inconsistent performance, worse tail latency on larger instances, and sometimes higher cross-socket traffic. Confirm with
NUMA topology, process CPU affinity, and (where available) NUMA memory locality stats. The fix is NUMA-aware placement and pinning.
7) Does SMT/Hyper-Threading help or hurt?
It helps throughput for many workloads, especially those with stalls. It can hurt predictability for latency-sensitive work when sibling
threads contend for shared execution resources. For strict SLOs, avoid sharing siblings or disable SMT on dedicated nodes if capacity allows.
8) If I’m on Kubernetes, should I use dedicated nodes for critical services?
Yes, when the service’s p99 matters and the rest of the cluster is a mixed bag of workloads and teams. Dedicated nodes simplify the
performance story and reduce incident entropy. It’s not “waste,” it’s paying for fewer 3 AM surprises.
9) What metrics should page me for CPU issues?
Containers: throttled time (cpu.stat), run queue, context switches, and node CPU saturation. VMs: steal time plus guest run queue and
host contention. For both: p99 latency correlated with scheduling indicators.
10) Can I get VM-like isolation with containers?
You can get close for CPU by using dedicated node pools, Guaranteed QoS, static CPU manager pinning, and strict policy enforcement.
You still share a kernel, so the isolation story is not identical. Whether that matters depends on your threat model and operational maturity.
Conclusion: next steps you can actually take
Choose based on CPU profile, not fashion. If you need deterministic tail latency and you don’t have strong platform governance, start
with VMs or dedicated container nodes. If you want throughput and density, containers are usually the right bet—just don’t sabotage them
with naive CPU limits.
Next steps:
- Pick one critical service and run the fast diagnosis checks during peak load: throttling (containers) or steal (VMs), run queue, IRQ hot spots, frequency.
- Fix one policy issue, not ten: either remove harmful CPU limits for latency services or reduce VM overcommit where steal is high.
- Create two node/host classes: “latency clean room” (pinned/isolated) and “throughput pack” (high density, fair sharing).
- Make requests/limits (or vCPU sizing) part of code review with a short rubric. The pager will thank you later.