The alert says “latency up.” The graphs say “CPU idle.” Your gut says “storage is slow.” And the server—quietly—says nothing.
Then someone walks by the rack and notices the fans sound like a leaf blower auditioning for a metal band.
Thermal issues are sneaky because they impersonate everything else: noisy neighbors, bad kernels, tired disks, flaky NICs, “just Kubernetes.”
This piece is about proof. Not vibes. You’ll use built-in tools to show whether heat is the bottleneck, where it’s happening, and what to do next.
What “thermal throttling” looks like in production
Thermal throttling is the system protecting itself by reducing performance: lowering CPU frequency, clamping power, slowing a GPU,
reducing NVMe throughput, or in extreme cases forcing a shutdown. It’s not a bug. It’s a safety feature. Your problem is that it’s often
invisible at the layer you’re looking at.
The tricky part: the system doesn’t announce “I am throttling now” in one universal place. It leaks hints across:
CPU frequency stats, kernel logs, hardware monitoring sensors, SMART data, and performance counters.
You’ll pull those hints together into a case you can defend.
Good operators treat “thermal” as a first-class failure mode, right alongside packet loss and disk errors.
Not because it’s common, but because when it hits, it creates expensive confusion.
One quote, because it’s the entire job: “Hope is not a strategy.” — General Gordon R. Sullivan
Joke #1: Fans don’t “spin up for no reason.” They spin up because the server is having feelings about your workload.
Interesting facts and a little history (because it explains today’s weirdness)
- Thermal throttling predates “modern” servers. Intel’s early Pentium 4 era introduced aggressive thermal management after the “clock speed race” made heat a first-order limit.
- DVFS (dynamic voltage and frequency scaling) became mainstream in the 2000s as laptops forced CPUs to survive tiny cooling systems; servers inherited the mechanisms.
- Turbo Boost changed expectations. Since turbo frequencies are opportunistic, “base clock” is not the performance you actually get—until thermals push you back down.
- Power limits became policy. RAPL-style package power controls made “throttling” as much about datacenter power budgets as about temperature.
- NVMe introduced its own thermal states. Many drives have multiple temperature sensors and separate “warning” vs “critical” thresholds, plus internal throttling curves.
- Hot aisle / cold aisle is not a suggestion. The concept emerged as rack densities increased; mixing air streams turns your datacenter into a very expensive convection experiment.
- Fan control moved into firmware. Modern BMCs (IPMI/Redfish) often override OS-level fan control, which is why your OS “set fan speed” trick doesn’t work.
- Virtualization obscures thermal symptoms. Guests see a virtual CPU and don’t directly see host thermal events, so “the VM is slow” becomes a detective story.
Fast diagnosis playbook (first/second/third)
First: confirm whether performance is clamped
- Check current CPU frequency behavior (is it stuck low across cores?).
- Check for kernel thermal/power messages (thermal zone trips, throttling warnings).
- Check “package temperature vs max” (are you near TjMax or drive warning thresholds?).
Second: identify which component is the victim
- CPU-bound latency spike? Look at CPU frequency caps, thermal throttle counters, and power limits.
- Storage latency spike? Check NVMe SMART temperature and throttling logs; correlate with I/O latency.
- Network “mystery drops”? Check NIC temperature if exposed; also check host CPU throttling (softirq handling slows down).
Third: prove causality with correlation
- Correlate temperature with frequency and latency over time.
- Change one variable safely: open the door, increase fan policy, reduce power limit, move workload, or cap turbo. Watch the effect.
- If temperature changes and performance changes in lockstep, you have your proof.
Stop doing this
Don’t start by replacing hardware. Don’t start by blaming the kernel. Don’t start by “tuning” the app.
First prove whether the machine is being physically constrained.
Practical tasks: commands, outputs, and decisions (12+)
These are deliberately boring. Boring is good. Boring is repeatable. Each task includes: a command, what typical output means,
and the decision you make.
Task 1: Read CPU frequency live (per core) and spot a clamp
cr0x@server:~$ grep -H . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq 2>/dev/null | head
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:1200000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:1200000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:1200000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:1200000
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:1200000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:1200000
What it means: If all cores are pinned at a low frequency (here ~1.2GHz) during a period when you expect load, you might be power- or thermally-limited.
If you see normal variation (cores bouncing up/down) that’s usually fine.
Decision: If it’s pinned low, immediately check temperatures and throttle counters before you touch the application.
Task 2: Check min/max frequency policy (are you capping yourself?)
cr0x@server:~$ for f in /sys/devices/system/cpu/cpu0/cpufreq/scaling_{min,max}_freq; do echo "$f: $(cat $f)"; done
/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq: 800000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: 1800000
What it means: The OS policy says “don’t exceed 1.8GHz.” That might be intentional (power saving) or accidental (bad config, wrong governor, cloud policy).
Decision: If scaling_max_freq is below expected, fix policy first. Thermal may still exist, but don’t confuse “admin cap” with “hardware cap.”
Task 3: Inspect CPU governor/driver (who’s in charge?)
cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
intel_pstate
What it means: intel_pstate behaves differently than acpi-cpufreq. Some knobs and expectations change. On AMD you’ll see amd-pstate or similar.
Decision: Use driver-appropriate tooling (e.g., powercap for Intel package limits) and don’t cargo-cult sysfs tweaks from a different platform.
Task 4: Pull kernel messages for thermal events
cr0x@server:~$ dmesg -T | egrep -i 'thermal|thrott|overheat|trip|powercap' | tail -n 12
[Mon Feb 5 10:14:21 2026] CPU0: Core temperature above threshold, cpu clock throttled (total events = 17)
[Mon Feb 5 10:14:21 2026] CPU2: Core temperature above threshold, cpu clock throttled (total events = 17)
[Mon Feb 5 10:14:22 2026] CPU0: Package temperature/speed normal
[Mon Feb 5 10:17:03 2026] nvme nvme0: temperature above threshold, throttling
What it means: This is your smoking gun when it appears. Not all systems log cleanly, but when they do, it’s hard to argue with.
Decision: If you see repeated throttle events, stop debating and start mitigating: airflow, fan policy, power limits, workload placement.
Task 5: Read thermal zones via sysfs (works even without fancy tools)
cr0x@server:~$ for z in /sys/class/thermal/thermal_zone*; do echo -n "$(cat $z/type) "; awk '{print $1/1000 "°C"}' $z/temp; done | head
x86_pkg_temp 92.0°C
acpitz 54.0°C
What it means: x86_pkg_temp is the CPU package temp. 92°C is “near the cliff” on many parts (depends on TjMax).
acpitz is often a motherboard/ACPI zone, less directly tied to throttling.
Decision: If package temp is high and performance is low, treat it as thermal until proven otherwise.
Task 6: Get sensor detail (fans + temps) with lm-sensors
cr0x@server:~$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +94.0°C (high = +80.0°C, crit = +100.0°C)
Core 0: +92.0°C (high = +80.0°C, crit = +100.0°C)
Core 1: +93.0°C (high = +80.0°C, crit = +100.0°C)
nct6779-isa-0a20
Adapter: ISA adapter
fan1: 3200 RPM
fan2: 3100 RPM
SYSTIN: +44.0°C
CPUTIN: +73.0°C
What it means: This shows thresholds and current values. Note the high vs crit. “High” is often where throttling starts.
Fans at 3200 RPM might still be “not enough” depending on chassis and obstructions.
Decision: If fans are high but temps still climb, suspect airflow path, clogged filters, wrong blanks, or a mis-seated heatsink.
Task 7: Check for power capping (Intel RAPL via powercap)
cr0x@server:~$ grep -H . /sys/class/powercap/intel-rapl:*/constraint_*_power_limit_uw 2>/dev/null | head
/sys/class/powercap/intel-rapl:0/constraint_0_power_limit_uw:65000000
/sys/class/powercap/intel-rapl:0/constraint_1_power_limit_uw:85000000
What it means: Power limits (here 65W/85W) can enforce throttling even when temps are fine. That’s “power throttling,” not strictly thermal.
Decision: If frequency is low but temps are moderate, check for power caps and datacenter policies before blaming cooling.
Task 8: Observe CPU throttling counters (Intel MSR via turbostat)
cr0x@server:~$ sudo turbostat --Summary --quiet --interval 2 --num_iterations 3
Avg_MHz Busy% Bzy_MHz TSC_MHz PkgTmp PkgWatt CorWatt
1280 72.15 1773 2600 96 62.4 38.1
1275 73.02 1740 2600 97 62.1 38.5
1269 71.88 1729 2600 97 61.9 37.9
What it means: Busy% is high (work is there), but Avg_MHz is low compared to TSC_MHz (nominal).
PkgTmp is 96°C: you’re likely in thermal management territory.
Decision: If PkgTmp rises and Avg_MHz falls under load, move to cooling/power remediation, not application profiling.
Task 9: Use perf to see if cycles and frequency match your expectations
cr0x@server:~$ sudo perf stat -a -e cycles,instructions,msr/aperf/,msr/mperf/ sleep 5
Performance counter stats for 'system wide':
12,345,678,901 cycles
7,654,321,098 instructions # 0.62 insn per cycle
9,120,000,000 msr/aperf/
19,500,000,000 msr/mperf/
5.001234567 seconds time elapsed
What it means: The ratio aperf/mperf approximates average frequency relative to nominal. Here it’s ~0.47, meaning the CPU averaged ~47% of nominal.
That’s consistent with throttling or heavy downclocking.
Decision: If aperf/mperf drops when temps rise, you’ve got a quantitative “frequency clamp” proof for your incident report.
Task 10: Check NVMe temperature and warnings using SMART
cr0x@server:~$ sudo smartctl -a /dev/nvme0n1 | egrep -i 'Temperature:|Warning|Critical|Thermal|throttle' -n
31:Temperature: 78 Celsius
32:Temperature Sensor 1: 81 Celsius
33:Temperature Sensor 2: 76 Celsius
54:Warning Comp. Temperature Time: 12
55:Critical Comp. Temperature Time: 0
What it means: Drives track time spent above warning/critical thresholds. “Warning time” being non-zero means this isn’t theoretical.
Different sensors may represent controller vs NAND vs composite.
Decision: If warning time increments during your latency incident, treat storage as thermally constrained even if CPU looks fine.
Task 11: Correlate storage latency with temperature (iostat + smartctl sampling)
cr0x@server:~$ iostat -x 1 5 nvme0n1
Linux 6.5.0 (server) 02/05/2026 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.10 0.00 4.22 8.80 0.00 74.88
Device r/s w/s rkB/s wkB/s aqu-sz await %util
nvme0n1 120.0 340.0 5120.0 18240.0 9.2 18.5 98.0
What it means: await at ~18ms and %util near 100% on NVMe under moderate throughput can be suspicious.
It might be queueing, firmware behavior, or throttling. It’s a clue, not a verdict.
Decision: If iostat latency rises at the same time SMART temperatures cross warning thresholds, prioritize airflow to the drive bays and check heatsinks/ducting.
Task 12: Watch system logs for thermal and PCIe link events
cr0x@server:~$ journalctl -k --since "30 min ago" | egrep -i 'thermal|thrott|nvme|pcie|AER|overheat' | tail -n 30
Feb 05 10:14:21 server kernel: CPU0: Core temperature above threshold, cpu clock throttled
Feb 05 10:17:03 server kernel: nvme nvme0: temperature above threshold, throttling
Feb 05 10:17:04 server kernel: pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
What it means: PCIe AER corrected errors sometimes show up when things are hot (not always—bad slots and bad cards exist).
Heat can aggravate marginal signal integrity.
Decision: If thermal + AER co-occur, treat it as a hardware environment problem: reseat, check airflow, check for missing slot blanks, validate card cooling.
Task 13: Check BMC/IPMI sensor view (fans/temps from the hardware side)
cr0x@server:~$ sudo ipmitool sdr type temperature | head
CPU1 Temp | 94 degrees C | ok
CPU2 Temp | 93 degrees C | ok
Inlet Temp | 29 degrees C | ok
Exhaust Temp | 57 degrees C | ok
What it means: “ok” doesn’t mean “healthy.” It means “not past the vendor’s alarm threshold.”
Inlet vs exhaust delta tells you if the box is moving heat out or just cooking internally.
Decision: If inlet is reasonable but CPU temp is high, suspect internal airflow path or heatsink contact. If inlet is high, the room is the problem.
Task 14: Confirm fan RPM and whether fan control is stuck
cr0x@server:~$ sudo ipmitool sdr type fan | head
FAN1 | 3200 RPM | ok
FAN2 | 3100 RPM | ok
FAN3 | 800 RPM | ok
FAN4 | 0 RPM | nr
What it means: A fan at 0 RPM with status “nr” (not readable / not present) might be expected (empty bay) or might be a dead fan.
Mixed RPMs can indicate a fan wall with one failed unit or a zone control issue.
Decision: If a required fan is missing or dead, don’t “monitor it.” Replace it. Thermal problems are not a good place to be philosophical.
CPU: frequency caps, package limits, and the lies your dashboard tells
Throttling isn’t always “temperature too high”
People say “thermal throttling” as shorthand for “the CPU got hot.” In practice, performance can be clamped by:
temperature, package power limits, current limits, VRM temperature, or firmware policies (especially on servers with strict BMC control).
The symptom is the same: the CPU doesn’t run as fast as you paid for.
Your dashboards often show CPU utilization, not CPU capability.
A CPU at 70% busy at 1.2GHz is not the same machine as a CPU at 70% busy at 3.2GHz.
Utilization without frequency is a half-truth that causes bad decisions.
Proving it’s thermal: the minimum viable chain of evidence
A credible proof set looks like:
- CPU package temp near its limit (or increasing toward it).
- Frequency (or aperf/mperf ratio) falling as temperature rises.
- Kernel/BMC logs showing thermal events, or SMART thermal warning time increments.
- Performance symptom aligning in time (latency spike, throughput drop, tail latency widening).
If you have only one of these, you have suspicion. If you have three, you have a case.
When it’s not thermal: the “looks like throttling” imposters
- CPU quota / cgroup limits: containers can be capped and appear “stuck” at low effective performance with no temperature issue.
- Frequency policy caps: conservative governors, BIOS settings, or cloud provider policies.
- Interrupt storms: high softirq can create latency while “CPU idle” looks high due to measurement artifacts or steal time.
- NUMA misplacement: memory latency spikes that look like “CPU slowed down.”
One move that’s usually right
If you suspect CPU throttling, collect evidence first, then temporarily reduce turbo (or cap max frequency) and see if the system becomes stable.
That sounds backward—why reduce performance?—but stability matters more than peak. If a small cap prevents the CPU from bouncing off thermal limits,
you often get better sustained throughput.
NVMe & disks: when storage “slowness” is actually temperature
NVMe throttling is real, and it’s not subtle
NVMe drives can throttle hard when hot. Controllers protect themselves first, and your application gets to experience it as “random I/O latency.”
In mixed workloads, you’ll see tail latency explode while average throughput looks only mildly worse. That’s the nightmare profile for databases.
SMART often provides the most defensible proof: current composite temperature, per-sensor temps, and the cumulative time spent above warning/critical thresholds.
If warning time increments during the incident window, you can stop arguing and start fixing airflow to the drive area.
What to check besides temperature
- Heatsinks and airflow ducting: Some chassis assume a specific airflow path; removing blanks or swapping risers can break it.
- Drive placement: A “hot” drive next to a hot NIC or GPU can run warmer than its neighbors.
- PCIe link width/speed changes: Rare, but thermal/mechanical issues can cause link retraining; AER messages help.
- Firmware behavior: Drives have different thermal curves; two “same capacity” drives can throttle very differently.
Joke #2: NVMe drives are fast until they discover thermodynamics, at which point they become an excellent lesson in humility.
Fans, airflow, and chassis reality: the physics part
Fans can be “working” and still be failing you
“Fan RPM looks fine” is not a victory condition. RPM is just rotation. It doesn’t guarantee airflow volume, pressure, or correct direction through the right path.
A server is a ducted system. Missing blanks, poorly routed cables, aftermarket heatsinks, and even a slightly unseated air shroud can turn a designed airflow path
into a recirculation party.
Inlet temperature is the hidden KPI
Most operators obsess over CPU temp and ignore inlet temp. Inlet is what your cooling system is actually delivering.
If inlet is high, your server can be perfectly assembled and still throttle.
If inlet is low but CPU is high, your server is badly assembled (or your heatsink contact/paste is wrong, or the fan wall is weak).
Hot aisle/cold aisle mistakes show up as “random throttling”
When cold air mixes with hot exhaust, you don’t get “slightly worse cooling.” You get unstable conditions:
bursts of throttle when workload spikes, then recovery, then throttle again. It looks like jitter. It is jitter.
And it tends to show up first in the most thermally dense nodes: storage-heavy boxes, compute nodes with high turbo, or anything with accelerators.
Firmware fan control can sabotage your OS assumptions
On many servers, the BMC owns fan control. The OS can read temps but can’t always command fans meaningfully.
If your team tries to “set PWM from Linux” on a server platform managed by BMC, you’ll waste hours.
The right approach is to check BMC sensor readings and fan policies, then change the fan mode using vendor-supported interfaces.
Three corporate mini-stories (anonymized, plausible, and painfully educational)
Mini-story #1: The incident caused by a wrong assumption
A mid-size company ran a payments API on a cluster of “identical” servers. A new release went out, and p99 latency doubled in one AZ.
The on-call did what many of us have done: blamed the release. Rollback started immediately.
Rollback didn’t fix it. CPU utilization looked normal. Memory looked normal. Error rates didn’t spike. The only obvious symptom was that
one rack had a weird pattern: throughput fell off a cliff every afternoon and returned late evening. Someone suggested “noisy neighbor”
in the building HVAC system. Another suggested a bad NIC firmware. A third suggested “it’s the kernel scheduler again.”
The wrong assumption was subtle: the team assumed that if the CPU was throttling, they’d see it on their standard dashboards.
They didn’t collect frequency or thermal data. Utilization alone looked healthy, because the CPU was happily busy—at a reduced clock.
When they finally sampled /sys/class/thermal and ran turbostat, the story snapped into focus:
package temp sat near the high threshold and the average MHz was far below nominal during the incident window.
BMC inlet temp was also high, but still “ok” by vendor thresholds. The rack’s cold aisle tiles had been rearranged during unrelated work,
and the rack was ingesting warmer air.
The fix was not heroic: restore airflow, add blanking panels, and adjust the tile placement. The release was innocent.
The lesson stuck: dashboards that ignore frequency and thermals create false confidence.
Mini-story #2: The optimization that backfired
Another company ran analytics workloads that were “batchy” and loved turbo frequencies. Someone had a bright idea:
crank the performance governor, enable aggressive turbo, and lift some power limits “because the CPUs can handle it.”
For a week, everyone celebrated faster jobs.
Then the monthly peak hit. Sustained load pushed the CPUs into a thermal oscillation: boost hard, hit thermal limit, throttle, cool slightly,
boost again. The average throughput dropped compared to the prior conservative settings, and tail latencies became chaotic.
Worse, NVMe drives in the same chassis ran hotter because the fan curve was now tuned for CPU peaks, not for sustained chassis heat.
The on-call saw the symptoms as “random I/O stalls.” They tuned the storage stack: scheduler changes, queue depths, filesystem tweaks.
Some changes made the graphs look different but didn’t make the system stable. The optimization had turned thermal behavior into a feedback loop,
and the storage team got blamed because storage metrics screamed the loudest.
The fix was to stop chasing peak and design for sustained: cap maximum CPU frequency slightly below the thermal cliff, set a more aggressive fan policy
for sustained loads, and add monitoring for NVMe warning temperature time. Jobs got predictably fast again. Not as “spiky fast,” but reliably fast.
The lesson: boosting without thermal headroom is like overclocking with a suit on. It looks impressive until you try to breathe.
Mini-story #3: The boring but correct practice that saved the day
A healthcare SaaS provider had a ritual: during new rack turn-up, they logged inlet/exhaust temps, fan baselines, and SMART temperature baselines
under a controlled load test. They also kept a boring spreadsheet of “normal” values per chassis model.
Nobody loved it. It felt like busywork.
Six months later, a database cluster started showing elevated write latency on two nodes only. Not enough to page constantly—just enough to be ominous.
The team compared current NVMe warning temperature time against the baseline and noticed it was accumulating on those two nodes only.
CPU temps were fine. Inlet temps were fine. Yet the drives were hot.
Because they had baselines, they didn’t argue about what “hot” meant. They opened the chassis and found a missing air shroud component
after a prior maintenance event. Airflow was bypassing the drive area. Drives were slowly cooking.
They replaced the shroud, re-ran the controlled load, and the SMART warning time stopped accumulating. Latency normalized.
No firmware upgrade. No filesystem tuning. No late-night war room.
The lesson: boring baselines turn “mystery performance” into “replace the missing plastic and go to bed.”
Common mistakes: symptom → root cause → fix
1) Symptom: CPU is “idle” but requests are slow
Root cause: CPU is throttled; utilization is measured relative to current capacity, not expected capacity. Or the workload is latency-sensitive and waiting on a throttled core.
Fix: Add frequency/aperf-mperf monitoring; confirm with turbostat. Fix cooling or power caps. Stop using utilization alone to declare victory.
2) Symptom: Periodic performance cliffs at the same time every day
Root cause: Ambient/inlet temperature swings (HVAC schedules, hot aisle recirculation, sun exposure near exterior wall, adjacent equipment turning on).
Fix: Check BMC inlet temp trends; correlate with CPU/NVMe temps. Fix airflow management, tile placement, containment, or CRAC settings.
3) Symptom: NVMe latency spikes during heavy writes; throughput looks “fine-ish”
Root cause: Drive controller overheating and entering thermal throttle; tail latency gets wrecked first.
Fix: Use smartctl warning time and temperature sensors; improve drive airflow/heatsinks; consider relocating drives away from hot components.
4) Symptom: Fans are loud constantly, even at low load
Root cause: BMC fan policy set to a high static mode, a missing sensor reading, or a “failsafe full speed” triggered by a bad/absent temperature sensor.
Fix: Check IPMI SDR for “nr” sensors; verify BMC settings. Replace faulty sensors/fans; restore correct fan mode.
5) Symptom: After a kernel update, the system “started throttling”
Root cause: Changed CPU frequency driver/governor defaults, or new thermal policy interacts with firmware differently. Sometimes it just exposed existing marginal cooling.
Fix: Confirm driver (intel_pstate/acpi-cpufreq/amd-pstate) and max freq policy; compare temps and throttle counters pre/post. Adjust BIOS/OS policy deliberately.
6) Symptom: Only one node in a cluster is slow
Root cause: Local airflow obstruction (cable bundle, missing blank, dust filter), failing fan, poorly seated heatsink, or a hotter slot/device layout.
Fix: Compare BMC inlet/exhaust, fan RPM, and component temps against sibling nodes. Physically inspect and restore chassis airflow integrity.
7) Symptom: iowait rises, but disks show no errors
Root cause: Thermal throttling on NVMe or controller; or CPU throttling slows the I/O submission/completion path.
Fix: Correlate iostat await with NVMe temperature/warning time and CPU frequency. Fix the thermal bottleneck; don’t “tune iowait.”
8) Symptom: Performance is worse after “making fans quieter”
Root cause: Aggressive acoustics setting reduced cooling headroom; system now hits thermal limits under sustained load.
Fix: Set a performance/thermal fan profile; validate under sustained stress. Quiet is nice, but SLOs are nicer.
Checklists / step-by-step plan
Step-by-step: prove or disprove thermal throttling in 20 minutes
- Snapshot the symptom window. Note time, workload, and which nodes are affected. If it’s only one node, prioritize physical causes.
- Record CPU frequency now. Read
scaling_cur_freqand policy (scaling_max_freq). - Record CPU package temp now. Use
/sys/class/thermalandsensors. If package temp is high, you’re already close. - Check logs for thermal evidence.
dmesgandjournalctl -kfor thermal/throttling/NVMe messages. - Quantify with counters. Use
turbostatand/orperf statfor aperf/mperf ratio. - If storage is involved, check SMART temperatures. Especially NVMe warning time increments.
- Check inlet/exhaust and fan RPM from BMC. IPMI SDR is often more truthful than OS sensors on servers.
- Correlate. If temp↑ and frequency↓ and latency↑ at the same time, call it thermal.
- Mitigate safely. Temporarily reduce turbo or migrate workload; increase fan profile; fix airflow. Then re-test.
- Write it down. Capture command outputs and timestamps. Thermal issues recur unless you institutionalize the fix.
What to standardize across fleets (boring, scalable, effective)
- Monitor CPU package temperature, CPU effective frequency (or aperf/mperf), and throttle events where available.
- Monitor BMC inlet temperature and fan duty/RPM per zone.
- Monitor NVMe composite temperature and warning/critical temperature time.
- Alert on “sustained frequency clamp” rather than “high temperature” alone. Heat is only a problem when it changes performance or reliability.
- Keep a per-chassis baseline under a standard load test after provisioning and after any maintenance that opens the chassis.
Red flags that mean “go look at the hardware”
- One node differs materially from its siblings in inlet temp delta or fan RPM.
- SMART warning temperature time climbing on only one drive or one node.
- Thermal messages in kernel logs clustered around a specific time window.
- CPU effective frequency collapsing under moderate load.
FAQ
1) How do I know it’s throttling vs just low utilization?
Measure frequency (or aperf/mperf). If the workload is busy and frequency is pinned low, you have a clamp. Then check temps/logs to classify it as thermal vs power policy.
2) Why does CPU utilization look fine during throttling?
Utilization is “time spent not idle,” not “how fast the CPU ran.” A throttled CPU can be 90% busy and still deliver far less work per second.
3) Can thermal throttling happen without any log messages?
Yes. Some platforms don’t log cleanly, or logs get rate-limited. That’s why you corroborate with temps, frequency, and counters like aperf/mperf.
4) Is power limiting the same as thermal throttling?
Different cause, same symptom: reduced performance. Power caps can trigger frequency reduction even at safe temperatures. Diagnose with powercap/RAPL limits and turbostat power readings.
5) Why do only the NVMe drives throttle while CPU looks okay?
Drive bays can be a separate airflow zone. Also, NVMe controllers are tiny and heat-dense; they can hit limits while the CPU still has headroom.
6) Should I just increase fan speed permanently?
If you’re in a datacenter, “permanently louder” is sometimes acceptable, but it’s not a root-cause fix. Prefer restoring designed airflow (blanks, shrouds, filters) and fixing inlet temperature.
7) My laptop fans are loud but performance is still bad. Is that thermal?
Often yes, but laptops also hit power adapter limits and battery policies. Check thermal zones, then check frequency policy caps. If frequency is capped low even when cool, it’s policy or power.
8) What’s the single best metric to alert on?
“Sustained low effective frequency under sustained load” is a strong indicator. Pair it with package temperature and NVMe warning temperature time for coverage.
9) Why does performance improve if I cap max frequency slightly lower?
Because you avoid oscillation. A stable 2.8GHz can beat a chaotic bounce between 3.6GHz and 1.2GHz once you include cache behavior, queueing, and tail latency.
10) How do I prove this to someone who insists it’s the application?
Bring correlation: timestamps of temp/frequency clamp and latency. Show a controlled mitigation (improved cooling or reduced turbo) that immediately improves performance without code changes.
Conclusion: what to do next week (not someday)
Thermal issues don’t need mysticism. They need evidence and a habit of checking the right layer first.
If you take nothing else: stop diagnosing performance without frequency and temperature in the same view.
Practical next steps:
- Add host-level collection for CPU package temperature, effective frequency (aperf/mperf or turbostat-derived), and kernel thermal messages.
- Add NVMe SMART collection for composite temperature and warning/critical temperature time.
- Baseline inlet/exhaust temps and fan RPM after provisioning and after any chassis-open maintenance.
- Create an on-call runbook that starts with the fast diagnosis playbook: clamp → component → correlation → mitigation.
- When you find a thermal issue, fix airflow design mistakes (blanks, shrouds, filters) before you chase tuning knobs.
Your systems run on software, but they fail in physics. Treat heat like a first-class dependency and it will stop surprising you at 2 a.m.