Ubuntu 24.04 CPU Steal and Virtualization Overhead: How to Spot It and What to Do (Case #44)

Was this helpful?

Your service is slow. Load average looks spicy. Users complain. But inside the VM, CPU “usage” isn’t even that high. You scale the deployment, throw a bigger instance at it, and somehow it still feels like running through wet cement.

This is where CPU steal time and virtualization overhead show up—quietly, like a tax you didn’t vote for. If you run Ubuntu 24.04 on KVM, VMware, Xen, or “cloud”, you need to be able to prove (not guess) when the hypervisor is the bottleneck, and what leverage you actually have.

What CPU steal time actually is (and what it isn’t)

CPU steal time is the portion of time your guest OS wanted to run on a physical CPU, but the hypervisor said “not now”. It’s measured by the guest kernel and exposed as %steal (tools like mpstat, top, pidstat) and as steal time in the Linux CPU accounting counters.

In plain terms: you paid for vCPUs. The host had other ideas. That “other ideas” bucket is steal. You don’t see it as high user CPU or system CPU inside the VM, because your VM wasn’t running. It was waiting offstage while other guests got the microphone.

Steal time is not the same as throttling

Throttling is enforced limit behavior: the VM is prevented from using more than some allocation (often via cgroups quotas, CPU credit models, or provider policies). Steal time is contention behavior: you are runnable, but not scheduled on the physical core because someone else is using it.

Steal time is not I/O wait

%iowait is time spent waiting on I/O completion while the CPU is idle. %steal is time you could have been executing, but the hypervisor didn’t schedule your vCPU. Both make the app slow. Only one implicates the host’s CPU scheduling.

What virtualization overhead means in practice

Even with zero steal, virtual machines pay overhead: VM exits, emulation for certain devices, interrupt virtualization, TLB flush costs, nested page tables, and all the scheduler gymnastics that come with multiplexing many virtual CPUs onto fewer physical ones.

On modern KVM/VMware this overhead is usually small when configured sanely. When it’s not small, it’s usually because you’re doing one of these:

  • Running too many vCPUs per VM relative to your real parallelism (SMP penalty, scheduling delays).
  • Using oversubscribed hosts (high steal/ready).
  • Driving high interrupt rates (networking, storage) through suboptimal virtual devices or offload settings.
  • Mixing latency-sensitive workloads with batch-y CPU hogs on the same host.

Paraphrased idea from Werner Vogels (reliability/operations): “Everything fails, all the time; design and operate with that assumption.” It applies here too: assume the hypervisor is busy unless you can prove it’s not.

Joke #1: If you’ve never been gaslit by a hypervisor, you haven’t lived. The VM swears it’s idle while your users swear it’s on fire.

Interesting facts and context (worth knowing)

  • Steal time appeared with Linux paravirtualization era: early Xen guests could explicitly observe when they were descheduled, which later became a standard accounting concept.
  • VMware’s “CPU Ready” predates most Linux dashboards: many enterprises learned this problem through vCenter graphs before Linux admins saw %steal in their tools.
  • Cloud “vCPU” has never guaranteed “a physical core”: oversubscription is normal; what you buy is a scheduling share, not a seat with your name on it.
  • High steal is usually a host symptom, not a guest symptom: inside the VM you can’t directly see who the noisy neighbor is, only that you’re losing scheduling time.
  • Steal can be near-zero and you can still be slow: virtualization overhead can show up as increased context switch cost, interrupt overhead, and higher tail latency without obvious steal spikes.
  • CPU steal is often bursty: a neighbor’s cron job, backup, or ETL can create periodic spikes that ruin p99 latency while average metrics look fine.
  • Timekeeping and virtualization used to be a mess: older kernels and hypervisors had clock drift and TSC issues; modern Ubuntu 24.04 largely hides this, but when time goes weird, scheduling metrics get misleading fast.
  • NUMA matters more than people admit: vCPU placement across sockets and memory locality can cause latency that looks like “CPU” but is really cross-node memory traffic.
  • Nested virtualization compounds overhead: running a hypervisor inside a VM adds more VM exits and scheduling layers; the guest sees more variability even if steal seems moderate.

Symptoms: how CPU steal masquerades as everything else

CPU steal time is a master of disguise because the guest OS is not “busy” in the usual way. Your thread is runnable, but the vCPU isn’t getting time on silicon. That creates a specific pattern of pain:

Symptom patterns you should recognize

  • High load average with modest CPU usage: runnable tasks pile up, but user/system CPU inside the VM doesn’t climb proportionally.
  • Latency spikes across unrelated endpoints: everything slows down, not just one subsystem. p99 goes off a cliff.
  • “Random” timeouts: TLS handshakes, database queries, and queue consumers time out under modest traffic.
  • Throughput drops after “successful” scaling: adding more vCPUs or more pods increases runnable threads, which can increase scheduling contention and make things worse.
  • Soft lockups and missed heartbeats: system daemons fail to run on time; monitoring sees flapping.
  • GC pauses look longer: not because the GC changed, but because the process is being descheduled mid-collection.

Think of steal as a hidden queue in front of your CPU. The guest sees the queue but not the cashier.

Fast diagnosis playbook (first / second / third)

This is the triage order that saves time. Don’t freestyle it. You’re trying to answer one question: Are we CPU-starved because the host won’t schedule us, or because we’re actually CPU-bound inside the VM?

First: prove or rule out steal

  1. Check %steal over time (mpstat), not a single snapshot.
  2. Correlate with latency spikes (app metrics) or queue depth (system metrics).
  3. If steal is consistently >2–5% during incidents, treat it as real contention. If it spikes to 20–50%, that’s an incident even if your CPU graphs look “fine”.

Second: separate “CPU busy” from “CPU blocked”

  1. Check run queue and context switch patterns (vmstat, pidstat).
  2. Check iowait and disk latency (iostat) so you don’t blame steal for storage stalls.
  3. Check memory pressure (free, psi) so you don’t confuse reclaim stalls with CPU contention.

Third: identify the lever you actually have

  1. If you’re on a shared cloud instance: migrate instance type/size or move to dedicated hosts/isolated cores.
  2. If you’re on your own KVM/VMware: fix oversubscription, vCPU sizing, NUMA alignment, and noisy neighbor placement.
  3. If you’re in Kubernetes: decide whether to fix the node (host-level) or limit the pod (cgroups) or move workloads.

Practical tasks: commands, interpretation, decisions

You wanted real work, not theory. Here are concrete tasks you can run on Ubuntu 24.04 guests to detect steal and distinguish it from other bottlenecks. Each task includes (1) the command, (2) what the output means, and (3) the decision you make.

Task 1: Confirm you’re virtualized (and how)

cr0x@server:~$ systemd-detect-virt
kvm

Meaning: You’re inside a VM, so steal is a candidate. If it prints none, steal is not your problem (look at CPU-bound, I/O, locks, or throttling).

Decision: If virtualized, continue with steal checks. Also note the hypervisor family; it influences what metrics exist on the host side.

Task 2: Snapshot CPU accounting including steal

cr0x@server:~$ top -b -n 1 | head -n 5
top - 10:42:18 up 12 days,  3:21,  2 users,  load average: 8.21, 7.90, 6.44
Tasks: 312 total,   3 running, 309 sleeping,   0 stopped,   0 zombie
%Cpu(s):  9.2 us,  3.1 sy,  0.0 ni, 79.8 id,  0.4 wa,  0.0 hi,  0.3 si,  7.2 st
MiB Mem :  32085.4 total,   1954.7 free,  11892.3 used,  18238.4 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  20193.1 avail Mem

Meaning: st is steal. Here it’s 7.2%—not subtle. Load average is high while idle is high: classic contention signature.

Decision: If st is non-trivial, stop blaming “the app” until you measure more. Move to time-series sampling next.

Task 3: Time-series CPU steal per core

cr0x@server:~$ mpstat -P ALL 1 10
Linux 6.8.0-41-generic (server)  12/30/2025  _x86_64_  (8 CPU)

10:42:30 AM  CPU   %usr  %nice   %sys %iowait  %irq  %soft  %steal  %idle
10:42:31 AM  all   12.5   0.0    4.0    0.3    0.0   0.5     9.8    72.9
10:42:31 AM    0   10.0   0.0    3.0    0.0    0.0   0.0    18.0    69.0
10:42:31 AM    1    9.0   0.0    2.0    0.0    0.0   0.0     1.0    88.0
...

Meaning: Steal is uneven across vCPUs. That often indicates scheduler contention and vCPU placement effects. If %steal is consistently >2–5% during slow periods, it’s real.

Decision: If steal is bursty, capture it during incidents and correlate with latency. If it’s constant, consider migrating or resizing; constant steal is a capacity problem.

Task 4: Compare load average to run queue and blocked tasks

cr0x@server:~$ vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 6  0      0 201312  98244 932112    0    0     1     8  520 1220 10  3 76  1 10
 8  0      0 200984  98244 932120    0    0     0     0  610 1405 12  4 70  0 14

Meaning: r (runnable) is high; b (blocked) is low. CPU has meaningful st. This points to CPU scheduling contention rather than I/O blockage.

Decision: If b is high, investigate storage and network. If r is high with steal, focus on host contention or vCPU sizing.

Task 5: Check if you’re being CPU-throttled by cgroups (containers/systemd)

cr0x@server:~$ systemctl show -p CPUQuota -p CPUQuotaPerSecUSec docker
CPUQuota=
CPUQuotaPerSecUSec=infinity

Meaning: No systemd CPU quota at the service level. In Kubernetes or container runtimes, throttling may still occur at the pod/container cgroup.

Decision: If quotas exist, fix throttling first; throttling can look like “slow CPU” even with zero steal.

Task 6: Look for container CPU throttling evidence (cgroup v2)

cr0x@server:~$ cat /sys/fs/cgroup/cpu.stat
usage_usec 389432187
user_usec 312903110
system_usec 76529077
nr_periods 21903
nr_throttled 0
throttled_usec 0

Meaning: nr_throttled and throttled_usec show cgroup CPU throttling. Here it’s zero, so the guest isn’t self-throttled.

Decision: If throttling is non-zero and climbing, raise limits/requests, reduce concurrency, or move to dedicated CPU pools; don’t blame steal yet.

Task 7: Check pressure stall information (PSI) for CPU saturation signals

cr0x@server:~$ cat /proc/pressure/cpu
some avg10=2.14 avg60=1.05 avg300=0.34 total=9823412
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

Meaning: PSI CPU some indicates time with at least one task waiting for CPU. If this rises with %steal, you have contention. If PSI is high with low steal, you may simply be CPU-bound inside the VM.

Decision: High PSI + high steal: move hosts/instance types or reduce host overcommit. High PSI + low steal: optimize application CPU usage or add real compute.

Task 8: Separate I/O wait from steal with iostat

cr0x@server:~$ iostat -xz 1 5
Linux 6.8.0-41-generic (server)  12/30/2025  _x86_64_  (8 CPU)

avg-cpu:  %user %nice %system %iowait  %steal  %idle
          11.9  0.00    3.8    0.4     9.7    74.2

Device            r/s     w/s   rKB/s   wKB/s  await  svctm  %util
nvme0n1          2.1     3.4    45.2    88.3   1.20   0.35   0.3

Meaning: Disk await is low, %util is tiny; storage is not the bottleneck. Meanwhile %steal is ~10%.

Decision: Don’t tune disks. Escalate to hypervisor scheduling contention or VM sizing/placement.

Task 9: Check network softirq load (virtual NIC overhead can look like CPU problems)

cr0x@server:~$ cat /proc/softirqs | head
                    CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
HI:                 12          8          4          7          6          5          4          6
TIMER:        1234567    1122334    1099887    1044556    1001222     998877     912345     887766
NET_TX:         23456      19876      21011      18765      16543      17002      16001      15888
NET_RX:        345678     312345     321002     300111     280987     275444     260999     255888

Meaning: Massive NET_RX/NET_TX counts can imply high packet rates, which may increase CPU overhead and scheduling jitter—especially with bad offload settings or small packets.

Decision: If networking is hot, check NIC type (virtio vs emulated), offloads, and interrupt distribution. But don’t confuse this with steal; it can coexist.

Task 10: Check vCPU topology exposed to the guest

cr0x@server:~$ lscpu | egrep 'CPU\\(s\\)|Thread|Core|Socket|NUMA'
CPU(s):                               8
Thread(s) per core:                   1
Core(s) per socket:                   8
Socket(s):                            1
NUMA node(s):                         1

Meaning: This guest sees a simple 1-socket topology. If you see multiple sockets/NUMA nodes in a VM and the workload isn’t NUMA-aware, you can get nasty latency outliers without huge steal.

Decision: For latency-sensitive apps, prefer fewer sockets and sensible vCPU counts; avoid huge SMP VMs unless you truly need parallelism.

Task 11: Look for clocksource/timekeeping weirdness (it can amplify scheduling pain)

cr0x@server:~$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
kvm-clock

Meaning: kvm-clock is expected under KVM. Odd clocksources or frequent time jumps can make latency analysis misleading and can break timeout logic.

Decision: If time jumps are observed (logs show “time went backwards” style errors), investigate hypervisor time config and NTP/chrony. Don’t chase phantom performance bugs.

Task 12: Capture scheduling delay evidence with perf sched (short, controlled)

cr0x@server:~$ sudo perf sched record -a -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 6.214 MB perf.data (12698 samples) ]
cr0x@server:~$ sudo perf sched latency --sort max | head
Task                  |   Runtime ms  | Switches | Avg delay ms | Max delay ms
postgres:checkpointer |      120.11   |    1321  |       0.18   |      22.34
nginx:worker          |       98.45   |    1887  |       0.12   |      18.02

Meaning: Big max scheduling delays suggest the guest isn’t getting CPU in a timely manner. This can be due to steal (hypervisor contention) or due to guest CPU saturation.

Decision: If %steal is high at the same time, you have host contention. If %steal is low, you’re likely overloaded inside the VM (or suffering from lock contention).

Task 13: Check kernel messages for soft lockups and scheduling stalls

cr0x@server:~$ journalctl -k --since "1 hour ago" | egrep -i "soft lockup|rcu_sched|stall|watchdog" | tail -n 5
Dec 30 10:11:02 server kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [java:28199]
Dec 30 10:11:02 server kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks: { 3-... } (detected by 2, t=5250 jiffies)

Meaning: Stalls can happen with extreme steal because the guest simply doesn’t get scheduled often enough. It can also happen with real CPU lockups, but in VMs contention is a frequent trigger.

Decision: If this correlates with steal spikes, escalate to platform/host capacity. If it correlates with CPU pegged and zero steal, investigate runaway threads or kernel bugs.

Task 14: Confirm hypervisor hints (DMI) for escalation breadcrumbs

cr0x@server:~$ sudo dmidecode -s system-product-name
Standard PC (Q35 + ICH9, 2009)

Meaning: This can help you identify the virtualization stack (QEMU machine type here). It’s useful when you need to file a ticket or match known issues.

Decision: Record this along with steal graphs and incident timestamps. When you escalate, bring evidence, not vibes.

Task 15: If you suspect host overcommit, check steal from Prometheus node_exporter style counters (guest-side)

cr0x@server:~$ awk '/^cpu /{print "steal_jiffies=" $9}' /proc/stat
steal_jiffies=981223

Meaning: The 9th CPU field is steal time in jiffies for the aggregate CPU line. It’s how exporters compute steal rate. Useful for verifying what your monitoring should be seeing.

Decision: If your dashboards show zero steal but /proc/stat is moving, your monitoring is wrong. Fix the monitoring before you “fix” production.

Task 16: Check if virtualization features are available (paravirt hints)

cr0x@server:~$ dmesg | egrep -i "kvm-clock|pvclock|Hypervisor detected|Booting paravirtualized kernel" | tail -n 5
[    0.000000] Hypervisor detected: KVM
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00

Meaning: Paravirtualized clock and hypervisor detection are normal and usually good. If you don’t see virt features you expect, you might be using emulated devices or missing optimizations, which increases overhead.

Decision: If virt features are missing, review the VM configuration (virtio drivers, CPU model passthrough, correct machine type). Fixing that can reduce overhead even if steal is low.

What you can control vs what the provider controls

Steal time is politically awkward: the root cause is often “somebody else on the host.” In a public cloud, that somebody else is not going to join your incident call. Your job is to identify which levers exist in your environment and pull the ones that pay.

If you run on public cloud shared tenancy

  • You can: change instance family, size up/down, move regions/availability zones, use dedicated hosts/instances, use CPU pinning features where available, adjust autoscaling and concurrency, reduce vCPU count to match actual parallelism.
  • You cannot: fix host oversubscription directly, evict noisy neighbors, control host kernel versions, or change hypervisor scheduling policy.

Practical guidance: if steal spikes are regular and correlated with business hours, treat it as a capacity class mismatch. Move to a more predictable tier (dedicated, reserved performance, isolated core offerings). Paying slightly more is cheaper than paying engineers to explain why p99 is haunted.

If you run your own KVM/Proxmox/OpenStack

  • You can: set vCPU:pCPU ratios, pin vCPUs, isolate host cores, tune host scheduler, enforce workload placement, monitor host run queues, and avoid mixing latency workloads with batch compute.
  • You cannot: avoid the laws of physics; if you oversubscribe and everyone gets busy, steal is the correct outcome.

If you run on VMware

VMware calls the cousin of this metric “CPU Ready”. The guest’s %steal will often correlate, but not perfectly. The operational meaning is the same: the VM is runnable but not scheduled. Use host-side metrics for the final verdict, but don’t dismiss guest steal—it’s often the earliest signal.

If you run Kubernetes on VMs

You get two layers of scheduling contention:

  • Linux cgroups (pod limits) can throttle CPU, which is not steal but feels similar to apps.
  • Hypervisor scheduling can steal CPU from the node, which the kubelet can’t fix.

Operationally: if node-level steal rises, rescheduling pods within the same node pool often does nothing. You need node replacement, different instance type, or dedicated capacity.

Three corporate mini-stories (anonymized, plausible, painful)

Mini-story 1: The incident caused by a wrong assumption

They ran a payments API on Ubuntu VMs. The incident started as “database is slow.” App teams saw timeouts, SREs saw load average climbing, and everyone stared at the database graphs like they personally owed them money.

The wrong assumption: “CPU is fine because usage is only 25%.” That number was inside the guest. The hosts were oversubscribed, and a batch analytics tenant on the same hypervisor cluster began a CPU-heavy job. The payments VMs showed %steal around 15–30% for forty minutes. Not enough to pin CPU at 100%, but enough to blow up tail latency. Requests timed out, retries piled on, and the queue pressure made it worse.

The debugging was delayed because the team’s dashboards didn’t include steal time. Their node exporter scrape existed, but their Grafana panel only charted user/system/idle. So the incident narrative became “the app is slow for unknown reasons,” which is a polite way of saying “we don’t measure what matters.”

Once they added rate(node_cpu_seconds_total{mode="steal"}) to their standard node panel, they could see the contention signature clearly. The actual fix was boring: move the payments tier onto a less-oversubscribed cluster and cap batch tenant placement during business hours. The postmortem action item wasn’t “optimize SQL.” It was “stop assuming CPU usage inside a VM is the whole story.”

Mini-story 2: The optimization that backfired

A platform team tried to “help” a Java service by doubling its vCPUs: from 8 to 16. Their reasoning was clean: more cores, more throughput. The service did get a little faster in synthetic tests, which was enough to declare victory and push the change to production.

Then p99 latency got worse. Not always. Just during peak. That’s how these gremlins work: they wait until you’re confident.

The host cluster was moderately oversubscribed. A 16-vCPU VM is harder to schedule than an 8-vCPU VM when the hypervisor tries to co-schedule vCPUs fairly. Under contention, the VM spent more time waiting for the host to find CPU slots. Guest %steal climbed. The application also increased its thread pool sizes because it “saw” more CPUs, which increased runnable threads and lock contention. They had improved the system’s ability to fight itself.

The rollback—reducing back to 8 vCPUs and limiting concurrency—stabilized latency. The eventual improvement was more nuanced: move to a CPU-optimized instance family and use fewer, faster vCPUs with better per-core performance. The lesson: scaling vCPU count can increase scheduling latency and overhead. Bigger is not always faster; sometimes it’s just larger.

Mini-story 3: The boring but correct practice that saved the day

An infrastructure team had a habit: every production node dashboard included steal time, PSI, and a simple “host contention suspicion” alert when steal exceeded a small threshold for more than a few minutes. Nobody was excited about it. It wasn’t glamorous. It was just there, like seatbelts.

One Tuesday, a customer reported intermittent slowness. The app graphs looked mostly fine, but p95 latency for a key endpoint spiked every 20–30 minutes. The team’s first look at the node showed %steal spikes matching the pattern almost perfectly. No guessing. No two-hour debate about garbage collection.

They tagged the incident as “platform contention,” moved the workload to a dedicated node pool, and the problem disappeared. Later, they found the periodic spikes were driven by maintenance tasks on other guests sharing the same host cluster. The team didn’t need to prove who the neighbor was; they just needed to protect the service.

The practice that saved them was not a clever kernel tweak. It was having the right metrics, already graphed, and an escalation path that didn’t depend on someone’s mood. Boring wins again.

Joke #2: The best performance fix is sometimes relocation. It’s like moving apartments because the neighbor practices drums—technically not your bug, but still your problem.

Common mistakes: symptom → root cause → fix

This section is intentionally blunt. These are the mistakes that keep incidents alive longer than necessary.

1) High load average, CPU “idle” high

Symptom: Load average climbs, user CPU stays modest, app latency increases.

Root cause: CPU steal time (host contention) or severe scheduling delay.

Fix: Confirm %steal via mpstat. If sustained, migrate to less-contended capacity, reduce vCPU count, or move to dedicated hosts. If you own the hypervisor, reduce overcommit and isolate noisy workloads.

2) “We scaled up, it got worse”

Symptom: Increasing vCPUs or pod replicas increases tail latency.

Root cause: Larger SMP VMs are harder to schedule; increased concurrency increases runnable threads and lock contention; host is oversubscribed.

Fix: Scale out with more smaller VMs (if host capacity allows) or move to better instance family. Cap concurrency. Prefer fewer vCPUs with higher per-core performance for latency services.

3) Confusing cgroup throttling with steal

Symptom: Service slow, CPU usage plateaus, steal near zero.

Root cause: CPU quota throttling (nr_throttled rises) or Kubernetes limits too low.

Fix: Inspect /sys/fs/cgroup/cpu.stat. Adjust limits/requests, use Guaranteed QoS for critical workloads, or move to dedicated CPU nodes.

4) Blaming storage because “iowait exists”

Symptom: Small %wa appears, app slow, people yell “disk”.

Root cause: CPU steal or scheduling delay, while storage is fine.

Fix: Use iostat -xz to validate await and %util. If disk is not busy and steal is high, stop tuning storage.

5) Not capturing during the incident

Symptom: “We checked later and it looked fine.”

Root cause: Steal is bursty; you missed it.

Fix: Add a lightweight incident capture script (mpstat/vmstat/iostat snapshots), enable continuous monitoring, and alert on steal sustained above threshold.

6) Treating steal as “normal cloud noise”

Symptom: Regular latency spikes accepted as fate.

Root cause: Instance class mismatch; shared tenancy too contended for SLA.

Fix: Use dedicated/isolated capacity for latency-critical services, or move the service to a tier where predictable scheduling is part of the product.

7) Over-tuning kernel and ignoring placement

Symptom: Weeks of sysctl tweaks, no sustained improvement.

Root cause: Host contention; tuning can’t out-argue the scheduler when you’re not scheduled.

Fix: Solve capacity and placement first. Tune second, and only with measured benefits.

Checklists / step-by-step plan

Checklist A: When a VM is slow and you suspect virtualization overhead

  1. Confirm virtualization: systemd-detect-virt.
  2. Snapshot CPU with steal: top (look for st).
  3. Sample steal over time: mpstat -P ALL 1 60 during the incident.
  4. Check run queue and steal together: vmstat 1 60.
  5. Rule out disk: iostat -xz 1 10.
  6. Rule out cgroup throttling: cat /sys/fs/cgroup/cpu.stat.
  7. Check PSI: cat /proc/pressure/cpu and memory PSI too if needed.
  8. If steal is confirmed, decide: migrate/resize/dedicate vs tune guest.

Checklist B: If you control the hypervisor (KVM/VMware) and steal is high

  1. Find host CPU overcommit and run queue pressure (host-side tooling).
  2. Reduce vCPU:pCPU ratios for latency clusters.
  3. Stop mixing batch CPU hogs with latency services on the same hosts/pools.
  4. Pin or isolate CPUs for critical workloads if your operational model supports it.
  5. Validate NUMA alignment: avoid spanning sockets unnecessarily.
  6. Use virtio devices and modern CPU models; avoid emulation where possible.
  7. Re-test with mpstat and application p95/p99 after each change.

Checklist C: For Kubernetes nodes on VMs

  1. Check node-level steal (mpstat) and container throttling (cpu.stat).
  2. If steal is high: replace node(s) or move to different node pool/instance class.
  3. If throttling is high: adjust limits, use dedicated CPU manager policies, or change QoS.
  4. Do not “fix” by increasing replicas without understanding contention; you can amplify it.

Operational thresholds (opinionated, useful)

  • Steal < 1% most of the time: fine for many workloads.
  • Steal 1–5% sustained: investigate for latency-sensitive systems; likely contention.
  • Steal 5–10% sustained: expect user-visible impact under load; treat as a platform issue.
  • Steal > 10% during incidents: escalate and migrate; tuning inside the VM is mostly theater.

FAQ

1) What’s a “bad” CPU steal percentage?

For batch workloads, a few percent might be tolerable. For latency services, sustained >2–5% is a real problem. Spikes >10% that line up with p99 spikes are basically a confession.

2) Can steal time be zero and I’m still suffering virtualization overhead?

Yes. Overhead can come from interrupt costs, VM exits, NUMA effects, and device emulation. That shows up as higher CPU cost per request and worse tail latency without a big steal signature.

3) How does Linux steal relate to VMware CPU Ready?

They are conceptually similar: “runnable but not scheduled.” CPU Ready is measured on the host and is often the most authoritative in VMware environments. Guest steal is still valuable for first-response diagnosis.

4) Why does load average rise when CPU inside the VM looks idle?

Load includes runnable tasks. If your vCPU isn’t scheduled (steal), tasks remain runnable longer, inflating load average even though the guest isn’t consuming user/system CPU.

5) Is CPU steal always a noisy neighbor?

Often, yes. But it can also be host maintenance, live migration, host-level CPU frequency changes under thermal constraints, or oversubscription decisions made by your own virtualization team.

6) Should I just move to bigger instances to fix steal?

Sometimes it helps, sometimes it backfires. Bigger VMs can be harder to schedule under contention. Prefer moving to a less contended class (dedicated/isolated cores, compute-optimized families) over simply increasing vCPU count.

7) How do I alert on steal correctly?

Alert on sustained steal rate, not single spikes: e.g., steal >2% for 5–10 minutes on nodes that run latency-sensitive workloads. Correlate with p95/p99 latency to avoid crying wolf for batch tiers.

8) Can I fix steal inside the guest by tuning the kernel?

Not really. You can reduce CPU demand (fewer threads, less busy polling), which reduces how often you want CPU, but you can’t force the hypervisor to schedule you more. Steal is a capacity/placement problem.

9) What about CPU frequency scaling and “steal-like” symptoms?

Frequency scaling affects how much work you get per scheduled time slice, but it doesn’t show up as steal. If you see low steal yet reduced throughput, check per-core performance, turbo behavior, and CPU model differences across hosts.

10) Does pinning vCPUs fix everything?

Pinning can reduce jitter for specific workloads, but it reduces flexibility and can lower overall utilization. It’s a tool for critical services with known CPU needs, not a universal fix for messy capacity planning.

Conclusion: next steps that actually move the needle

When Ubuntu 24.04 is running in a VM, CPU “usage” is only half the story. Steal time is the missing chapter: the time your workload wanted to run but wasn’t allowed to. If you don’t measure it, you will misdiagnose incidents, waste time tuning the wrong subsystem, and accidentally make performance worse with well-meant scaling.

Do these next, in this order:

  1. Add steal to your standard dashboards (and verify it matches /proc/stat).
  2. Set an alert for sustained steal on latency-critical nodes.
  3. During the next incident, capture mpstat, vmstat, iostat, and cgroup cpu.stat for the exact window.
  4. If steal is confirmed, stop tuning inside the VM and change placement/capacity: migrate, reduce overcommit, or buy dedicated scheduling.
  5. Right-size vCPU counts: fewer, faster cores often beat more, slower, contended vCPUs for tail latency.

The best part: once you can prove steal, you can have the right argument with the right team. And you can stop interrogating the database for crimes committed by the hypervisor.

← Previous
PC HDR: Why It’s Amazing… and Why It’s Sometimes Awful
Next →
ZFS Using NVMe as SLOG: When It’s Perfect and When It’s Overkill

Leave a comment