You can spend six figures on compute, tune your kernel, size your storage tiers, and still lose a node to a problem
that costs exactly $0 to create: a tiny plastic film sticker left on the cooler cold plate.
The failure mode is equal parts slapstick and brutal: the machine boots, runs “fine,” then starts throttling, throwing
thermal alarms, dropping NVMe performance, or rebooting under load. The dashboard screams. Your pager screams louder.
Why a cooler sticker can take down production
The “cooler sticker” is usually a clear protective film on the cold plate of a CPU cooler (air or AIO). It’s there to
keep the metal clean and unscratched on the shelf. It is not there to be a thermal interface. It is, in fact, a
nearly perfect way to convert your expensive cooling solution into a heat-retention experiment.
Heat transfer from CPU to cooler is a chain. Break one link, the rest doesn’t matter:
- Die → IHS (internal CPU package path, not under your control).
- IHS → thermal paste (your control; thin, uniform, not artistic).
- Thermal paste → cold plate (your control; clean metal contact).
- Cold plate → heatpipes/radiator (cooler design).
- Radiator/fins → airflow (case or chassis, fan curves, obstructions).
A sticker wedges itself into the single most important interface: the cold plate contact. Even if the sticker is
“thin,” it introduces a layer with poor thermal conductivity and often traps microscopic air gaps. Thermal paste
doesn’t fix it; paste on plastic is just paste with a hobby.
In a workstation, this becomes “my game stutters.” In a server, it becomes “my node intermittently disappears” or
“latency spikes under peak traffic,” or worse: “storage rebuilds take forever, and then the box resets.”
One dry truth: overheating is rarely a clean failure. It’s messy. It masquerades as flaky RAM, buggy firmware, noisy
neighbors, storage saturation, or “Kubernetes being Kubernetes.”
Joke #1 (short, relevant): The cooler sticker is the only security film that successfully blocks heat instead of hackers.
Fast diagnosis playbook (first/second/third)
You don’t get points for “deep analysis” while the fleet cooks. You get points for restoring service and preventing
repeat incidents. Here’s a fast triage order that finds the bottleneck quickly, with minimal thrash.
First: confirm it’s thermal, not “just slow”
- Look for throttling: CPU frequency drops under load, despite high utilization.
- Check temps: CPU package temp near or at TjMax; GPU/NVMe temps high if relevant.
- Look for thermal events: kernel logs, IPMI SEL, BMC sensor alarms.
Second: determine scope and blast radius
- Single node or many? If many, suspect ambient airflow, fan control policy, firmware rollout, or clogged filters.
- Workload-specific? Only under AVX-heavy compute, sustained compaction, rebuild, or encryption?
- Time-correlated? Same time every day can be batch jobs, but it can also be air handling schedules or rack door habits.
Third: decide the safest mitigation path
- Immediate safety: reduce load, drain workloads, cap power limits, increase fan speed policy.
- Physical check: if a new build or recently serviced unit, assume installation error until disproven.
- Permanent fix: clean/replace fans, remount cooler, remove sticker, reapply paste, update firmware, adjust airflow design.
The fastest path to “is it the sticker?” is brutally simple: if this is a new build, a cooler swap, or a CPU reseat,
and you see immediate overheating under even moderate load, treat the cold plate interface as guilty until proven
innocent.
Interesting facts and a little history
Overheating is old. The sticker incident is newer than most people think, and it’s more common now because DIY and
rapid deployment have changed the ergonomics of hardware work. A few concrete points to ground the story:
- Thermal throttling is not a modern invention. CPUs have had forms of thermal protection for decades; modern chips just make it more dynamic and harder to notice.
- Protective films became more common as coolers became premium products. Mirror-finished cold plates and pre-applied paste ship with films to prevent contamination and scratches.
- Paste is not glue. Its job is to fill microscopic voids; metal-to-metal contact is still the main heat path.
- Fan control shifted from dumb voltage to intelligent curves. PWM control and BMC policies can mask poor contact by “saving” you at idle and failing at sustained load.
- Data centers increasingly run higher inlet temperatures. Efficiency targets drive warmer aisles; the margin for “small” mistakes shrinks.
- NVMe brought new thermal bottlenecks. Modern SSDs throttle on temperature and can look like “storage latency” rather than “heat.”
- AVX and similar instruction sets can change the thermal profile. The same CPU at the same utilization can run dramatically hotter depending on instruction mix.
- Remote management normalized “hands-off” ops. IPMI/BMC tools made it easy to forget that some failures require eyeballs and a screwdriver.
- Thermal pads on VRMs and memory exist for a reason. Misalignment during assembly can overheat power delivery while the CPU looks fine, producing weird resets and WHEA errors.
How it looks in logs, metrics, and user reports
The sticker blunder has a signature: temperatures rocket fast and early. Not “after an hour,” but “within minutes of
any real load.” The fan curve hits max. The CPU still overheats. That mismatch—fans screaming, temps still rising—is
the giveaway.
Common surface-level symptoms
- Random reboots under load (thermal shutdown or VRM protection).
- Sudden latency spikes (CPU throttles; queues grow).
- Inconsistent benchmark results (thermal equilibrium changes run to run).
- “Storage got slow” (NVMe throttling makes IO latency jittery and compaction/rebuild times explode).
- CPU frequency stuck low despite plenty of headroom on paper.
- Kernel messages about thermal zones or “CPU throttled” hints.
- IPMI sensor alarms (CPU Temp, VRM Temp, System Temp) or SEL events.
What’s actually happening
When the CPU hits thermal limits, it doesn’t politely ask. It changes behavior: lowers frequency and voltage,
clamps turbo, sometimes forces power limits, and if temperatures keep climbing, it triggers protective shutdown.
These controls preserve silicon life. They do not preserve your SLO.
In production, throttling is especially nasty because it is non-linear. You can be fine at 55% load, then
hit a knee and fall off a cliff at 70%. Your capacity planning looks right until it isn’t.
Practical tasks: commands, outputs, decisions (12+)
These are the checks I actually run. Each task includes: a command, a representative output, what it means, and the
decision you make. You can do most of this without taking the box down—up until the moment you must.
Task 1: Check current CPU frequencies (throttling hint)
cr0x@server:~$ lscpu | egrep 'Model name|CPU\(s\)|Thread|MHz'
Model name: Intel(R) Xeon(R) CPU
CPU(s): 32
Thread(s) per core: 2
CPU MHz: 1197.843
Meaning: A server-class CPU idling around 1200 MHz is normal at idle, suspicious under load.
Decision: If users report slowness and this stays low during load, hunt throttling next.
Task 2: Watch frequency under load in real time
cr0x@server:~$ sudo apt-get -y install linux-tools-common linux-tools-generic >/dev/null
cr0x@server:~$ sudo turbostat --Summary --interval 2
turbostat version 2024.01
Summary: 2 sec
Avg_MHz Busy% Bzy_MHz TSC_MHz PkgTmp PkgWatt
1680 92.15 1822 2300 97 182.4
Meaning: Package temp 97°C with busy cores and lower-than-expected Bzy_MHz screams thermal limit.
Decision: Confirm temps via sensors/IPMI and plan immediate mitigation (drain/load shed).
Task 3: Read thermal sensors locally (lm-sensors)
cr0x@server:~$ sudo apt-get -y install lm-sensors >/dev/null
cr0x@server:~$ sudo sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +98.0°C (high = +84.0°C, crit = +100.0°C)
Core 0: +96.0°C (high = +84.0°C, crit = +100.0°C)
Core 1: +97.0°C (high = +84.0°C, crit = +100.0°C)
Meaning: You are brushing crit. “High” being exceeded means sustained throttling is likely.
Decision: If this is a new/serviced system, prioritize physical inspection (mounting/paste/sticker).
Task 4: Check kernel for thermal events
cr0x@server:~$ sudo dmesg -T | egrep -i 'thermal|thrott|critical|overheat' | tail -n 8
[Mon Jan 22 10:41:12 2026] CPU0: Package temperature above threshold, cpu clock throttled
[Mon Jan 22 10:41:12 2026] CPU2: Package temperature above threshold, cpu clock throttled
[Mon Jan 22 10:41:15 2026] CPU0: Core temperature/speed normal
Meaning: The kernel is telling you it throttled. This is not “maybe.”
Decision: Treat as an incident cause, not an incidental log line; adjust capacity and fix cooling.
Task 5: Confirm fan RPM and system temps via IPMI
cr0x@server:~$ sudo apt-get -y install ipmitool >/dev/null
cr0x@server:~$ sudo ipmitool sdr type Temperature
CPU Temp | 98 degrees C | critical
System Temp | 36 degrees C | ok
VRM Temp | 92 degrees C | non-critical
Meaning: Inlet/system temp is fine; CPU/VRM are hot. That points away from “room is hot” and toward “contact/airflow inside.”
Decision: If fans are already high, suspect blocked heatsink, bad mount, sticker, or dead pump (AIO).
Task 6: Check fan sensors via IPMI
cr0x@server:~$ sudo ipmitool sdr type Fan
FAN1 | 17600 RPM | ok
FAN2 | 17400 RPM | ok
FAN3 | 17100 RPM | ok
FAN4 | 0 RPM | critical
Meaning: One fan is dead or unplugged. On some chassis, a missing fan ruins the pressure pattern.
Decision: Replace/seat the fan first. If all fans are OK and temps still spike, move to cold-plate interface.
Task 7: Read IPMI System Event Log for shutdown reasons
cr0x@server:~$ sudo ipmitool sel elist | tail -n 6
1a2 | 01/22/2026 | 10:42:01 | Temperature CPU Temp | Upper Critical going high
1a3 | 01/22/2026 | 10:42:05 | Processor #0 | Thermal Trip
1a4 | 01/22/2026 | 10:42:06 | System Boot Initiated | Initiated by power reset
Meaning: Thermal trip preceded the reset. The BMC saw it.
Decision: Stop blaming the hypervisor. Fix cooling. Preserve logs for postmortem.
Task 8: Verify power draw and power limit behavior (RAPL / turbostat)
cr0x@server:~$ sudo turbostat --Summary --interval 2 | head -n 4
turbostat version 2024.01
Summary: 2 sec
Avg_MHz Busy% Bzy_MHz PkgTmp PkgWatt
1450 88.40 1601 99 205.7
Meaning: High power at high temp, yet MHz is mediocre: cooling can’t move heat away.
Decision: Short-term: apply a power cap to reduce thermal runaway while you schedule hands-on repair.
Task 9: Apply a temporary CPU power cap (safe mitigation)
cr0x@server:~$ sudo apt-get -y install linux-tools-common linux-tools-generic >/dev/null
cr0x@server:~$ sudo powercap-info -p intel-rapl:0
Zone intel-rapl:0
enabled: 1
constraint_0_power_limit_uw: 220000000
constraint_1_power_limit_uw: 250000000
cr0x@server:~$ echo 160000000 | sudo tee /sys/class/powercap/intel-rapl:0/constraint_0_power_limit_uw
160000000
Meaning: You lowered the sustained package power limit to 160W.
Decision: Use this to keep the system alive long enough to drain workloads. Don’t call this “fixed.”
Task 10: Stress test briefly to reproduce without frying it
cr0x@server:~$ sudo apt-get -y install stress-ng >/dev/null
cr0x@server:~$ sudo stress-ng --cpu 0 --cpu-method matrixprod --timeout 30s --metrics-brief
stress-ng: info: [21432] dispatching hogs: 32 cpu
stress-ng: info: [21432] successful run completed in 30.01s
stress-ng: info: [21432] cpu: 960.00 bogo ops/s
Meaning: You created a repeatable thermal load. Pair this with sensors/turbostat to see temperature ramp rate.
Decision: If temps spike to high 90s in seconds and fans are maxed, stop and inspect mounting/sticker/pump.
Task 11: Check NVMe thermal throttling (storage “mystery slowness”)
cr0x@server:~$ sudo apt-get -y install nvme-cli >/dev/null
cr0x@server:~$ sudo nvme smart-log /dev/nvme0 | egrep 'temperature|warning|critical'
temperature : 78 C
warning_temp_time : 134
critical_comp_time : 0
Meaning: NVMe is running hot and has spent time above its warning threshold. It may already be throttling.
Decision: Add airflow over NVMe, check heatsinks, and stop assuming the disk is “slow.” It might be “hot.”
Task 12: Check for PCIe correctable errors (heat can destabilize links)
cr0x@server:~$ sudo dmesg -T | egrep -i 'aer|pcie|corrected|whea' | tail -n 6
[Mon Jan 22 10:39:44 2026] pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
[Mon Jan 22 10:39:44 2026] pcieport 0000:00:1c.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer
Meaning: Not definitive, but rising corrected errors under heat/load can indicate marginal signal integrity or thermal stress.
Decision: If errors correlate with high temps, fix cooling before replacing hardware.
Task 13: Verify BMC fan mode/policy (mis-set can masquerade as hardware failure)
cr0x@server:~$ sudo ipmitool raw 0x30 0x45 0x00
01
Meaning: Vendor-specific, but often returns current fan mode (e.g., “standard” vs “full”).
Decision: If policy is too quiet for your thermal profile, temporarily switch to higher mode; then revisit airflow properly.
Task 14: Check container orchestration symptoms (node flaps look like software)
cr0x@server:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
worker-07 NotReady <none> 18d v1.29.2
worker-08 Ready <none> 18d v1.29.2
cr0x@server:~$ kubectl describe node worker-07 | egrep -i 'Ready|KubeletNotReady|reboot|pressure' | tail -n 8
Ready False Mon, 22 Jan 2026 10:42:15 +0000 KubeletNotReady runtime network not ready
Ready True Mon, 22 Jan 2026 10:30:02 +0000 KubeletReady kubelet is posting ready status
Meaning: Node flapping. “Runtime network not ready” can be a side effect of abrupt resets.
Decision: Check BMC SEL and temps before you chase CNI ghosts for three hours.
Task 15: Confirm recent maintenance (sticker incidents correlate with “we touched it”)
cr0x@server:~$ last -x | head -n 8
reboot system boot 6.8.0-40-generic Mon Jan 22 10:42 still running
shutdown system down 6.8.0-40-generic Mon Jan 22 10:41 - 10:42 (00:00)
reboot system boot 6.8.0-40-generic Mon Jan 22 10:12 - 10:41 (00:29)
Meaning: Multiple reboots in a short window: either testing, instability, or thermal shutdown loops.
Decision: Cross-check with change tickets. If a cooler/CPU was reseated recently, prioritize physical inspection.
Task 16: If you must open the box, document and verify contact pattern
cr0x@server:~$ sudo mkdir -p /var/tmp/thermal-incident && sudo date | sudo tee /var/tmp/thermal-incident/notes.txt
Mon Jan 22 10:55:03 UTC 2026
Meaning: You’re creating an incident artifact folder before you handle components.
Decision: Take photos of the cold plate, sticker presence, and paste spread. This is gold for postmortems and training.
Joke #2 (short, relevant): If your CPU hits 100°C with fans at full blast, it’s not “turbo mode”—it’s “please remove the plastic.”
Three corporate mini-stories from the overheating trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-size company rolled out a new batch of compute nodes for CI and ephemeral environments. Hardware came
pre-racked by a vendor, imaged in-house, then joined to a cluster. For two days, everything looked fine—because the
workloads were spiky and short.
On day three, they enabled a new test suite that ran sustained CPU-heavy compilation. Suddenly, jobs started timing
out. Engineers saw it as “queue contention” and scaled the runner pool. That made it worse: more nodes hit sustained
load, more failures piled up, and the scheduler looked like it was haunted.
The wrong assumption was subtle: “If it boots and passes a 10-minute smoke test, thermals are fine.” It wasn’t fine.
The smoke test never reached steady state. Under sustained load, the CPUs hit thermal thresholds, clamped frequency,
and then tripped. Some nodes recovered; others fell into reboot loops that looked like provisioning issues.
The eventual clue came from a single person who stopped reading software logs and pulled IPMI data. CPU temp critical
events lined up with job timeouts. A tech opened one chassis. The cold plate still had the protective film on it.
Once they found one, they found several. A batch assembly line error, not a one-off mistake. The “funny” part lasted
about ten seconds; the rest was recalculating capacity, reworking nodes, and writing a new acceptance test that ran a
real sustained load while recording temps.
Mini-story 2: The optimization that backfired
Another organization chased power savings. Their data center bill had teeth, and management wanted “efficiency wins.”
Someone proposed lowering fan speeds by switching BMC fan mode from a high-performance profile to a quieter,
lower-RPM policy. The change rolled out via automation.
At first: success. Lower noise in the lab. Lower idle power. Then the real world arrived. Summer ambient rose, racks
got denser, and a few chassis had slightly imperfect airflow due to cable bundles and blanking panel gaps. Under
peak load, certain nodes started showing intermittent latency spikes. Nothing dramatic. Just enough to be expensive.
The optimization backfired because it narrowed the margin. With lower fan headroom, any extra thermal resistance—dust,
aging paste, slightly skewed heatsink pressure, or yes, a forgotten sticker after a CPU replacement—pushed the system
over the edge. The same hardware that tolerated mistakes at “full fan” didn’t at “eco fan.”
The team learned two lessons. First: fan policy is part of the reliability budget, not just acoustics. Second:
rolling out fan mode changes without per-chassis thermal characterization is gambling with a loaded die.
They recovered by restoring a more aggressive fan curve for production racks, then selectively applying quieter
policies only where inlet temps and chassis density allowed it, with thermal alarms tied to automation to reverse
the policy when thresholds were reached.
Mini-story 3: The boring but correct practice that saved the day
A financial services team had a habit that looked tedious: every hardware touch required a two-person “physical
verification” checklist. Not just “someone signed off,” but two pairs of eyes confirmed items like: sticker removed,
paste applied, cooler torque sequence followed, fan headers connected, and airflow baffles installed.
During a rushed maintenance window, they had to replace a CPU in a server that also hosted latency-sensitive storage
services. The environment was loud, hot, and full of distractions. The exact place where stickers thrive.
The first tech mounted the cooler, started to close the chassis, then the second person asked the checklist question:
“Cold plate film removed?” They paused. Reopened. The film was still there. It took ten seconds to fix and saved what
would have been a messy incident with confusing symptoms.
This is the kind of practice that doesn’t earn applause. It earns uptime. It also scales: when turnover happens and
new people join, the checklist is institutional memory written down in plain language.
Common mistakes: symptom → root cause → fix
Overheating mistakes are predictable. That’s good news: you can build guardrails. Here are specific mappings that
change decisions, not just vibes.
1) Fans at 100%, CPU still hits 95–100°C quickly → bad cold plate interface → remount, remove film, repaste
- Symptom: Rapid temperature ramp; throttling in dmesg; fans screaming.
- Root cause: Protective sticker left on, uneven mounting pressure, or paste applied wrong (too much, too little, or contaminated).
- Fix: Power down, remove cooler, clean with isopropyl alcohol, remove film, apply proper paste, tighten in cross pattern to spec.
2) CPU temps okay but system reboots under load → VRM or chipset overheating → restore airflow and check pads
- Symptom: CPU package temp looks safe, but SEL shows VRM temp warnings or “power unit” events.
- Root cause: Missing airflow baffle, dead fan, or misaligned VRM thermal pad after service.
- Fix: Replace fan, restore baffles/ducts, verify heatsinks and pads, ensure chassis pressure design is intact.
3) Storage latency spikes, CPU seems normal → NVMe thermal throttling → add heatsinks/airflow, relocate drives
- Symptom: Random IO latency spikes; NVMe temps 70–85°C; warning_temp_time increases.
- Root cause: Drives lack airflow, heatsinks missing, or placed behind hot GPUs.
- Fix: Install proper NVMe heatsinks, improve airflow path, use blanking panels, consider relocating high-IO drives to better-cooled bays.
4) Only certain workloads trigger issues → instruction mix heat spike → set power limits or schedule workloads
- Symptom: Normal for most tasks; fails under specific compute (vectorized math, compression, encryption, heavy rebuilds).
- Root cause: Workload generates higher sustained power draw; cooling margin insufficient.
- Fix: Apply power caps, tune turbo behavior, schedule heavy jobs to cooler windows, or upgrade cooling/airflow.
5) After firmware update, temps changed → fan control policy changed → pin policy or re-tune curves
- Symptom: “It was fine last week,” no hardware changes, but fans now idle lower and ramp later.
- Root cause: BMC/BIOS update altered fan curves or sensor interpretation.
- Fix: Verify fan mode settings, compare against baseline, pin to known-good policy, and update your automation checks.
6) One node in a rack overheats more than neighbors → local airflow obstruction → fix cabling, blanking, doors
- Symptom: Same model servers, same load; one runs hot.
- Root cause: Cable bundle blocking intake, missing blanking panels, partial obstruction, or a failed fan.
- Fix: Restore proper front-to-back airflow, replace fan, re-route cables, install blanks.
Checklists / step-by-step plan
When you suspect overheating in production (ops checklist)
- Stabilize service: drain workloads, reduce concurrency, or fail over if possible.
- Confirm thermals: sensors + BMC/IPMI + logs. Don’t rely on one source.
- Measure throttling: frequency under load, not at idle.
- Determine scope: single node vs fleet vs rack.
- Mitigate safely: temporary power cap, higher fan mode, or workload throttling.
- Schedule hands-on: if new/serviced hardware, assume physical root cause and plan a controlled shutdown.
- Postmortem capture: SEL, dmesg, sensor snapshots, workload timing.
When you build or service a server (hardware checklist)
- Remove the cold plate film before paste touches anything.
- Clean surfaces (IHS and cold plate) with proper solvent and lint-free wipes.
- Apply paste correctly: consistent method, small amount, no bubbles, no reusing old paste.
- Mount with even pressure: cross pattern, correct torque, no “one corner fully tight first.”
- Confirm fan/pump headers connected to the correct motherboard headers.
- Verify airflow parts: baffles, shrouds, blanks, ducting.
- Run a sustained burn-in: at least 20–30 minutes under real load while recording temps.
- Record baseline: idle temp, load temp, fan RPM, ambient, CPU power draw.
“Sticker suspicion” decision tree (quick and decisive)
- Was the cooler/CPU touched recently? If yes, suspect interface first.
- Do temps spike within minutes under modest load? If yes, interface/pump/fan failure likely.
- Are fans/pump working? If yes and still overheats, physical contact is the prime suspect.
- Can you mitigate with power cap? If yes, do it to buy time; then schedule a shutdown and inspect.
FAQ
1) Is the cooler sticker mistake actually common in corporate environments?
Yes, especially when hardware is assembled or serviced under time pressure, or when vendors pre-assemble and your
team assumes acceptance testing would catch it. Short tests won’t.
2) Wouldn’t the system fail to boot if the sticker is left on?
Usually it boots. That’s why it’s dangerous. At idle, the CPU can survive with poor heat transfer. Under sustained
load, it can’t.
3) Can thermal paste “burn through” or compensate for the sticker?
No. Paste is not magic. It reduces tiny air gaps between metal surfaces. A plastic film creates a continuous
barrier with poor conductivity and often traps more air.
4) How can I tell sticker vs dead pump on an AIO?
Both look similar: fast temperature rise. With a dead pump, you may see pump RPM at 0 or abnormal, and the radiator
stays relatively cool while the CPU block gets hot. Sticker issues often show a bad paste/contact pattern when you
open it up.
5) Why does overheating sometimes look like storage or network problems?
Because throttling increases latency everywhere: CPU scheduling delays, interrupt handling delays, IO submission
delays. Add NVMe throttling and you get a perfect misdirection.
6) What’s a safe “burn-in” to catch this without risking hardware?
A controlled 20–30 minute sustained load (CPU stress plus realistic IO if it’s a storage node) while watching temps,
fan RPM, and throttling flags. Stop if you approach critical thresholds.
7) Do servers protect themselves reliably from overheating?
They try. Thermal throttling and trips are designed to protect silicon, not your uptime. A “protected shutdown” is
still an outage, and repeated trips can stress components.
8) Should I run fans at max all the time to avoid this?
No. Max fans hide problems and increase wear and noise. Use appropriate fan curves for your environment, and rely on
good assembly, clean airflow, and monitoring. Raise fan policy only as a mitigation or where validated.
9) What should I monitor to catch thermal issues early?
CPU package temperature, CPU frequency under load, fan RPM, power draw, and BMC SEL events. For storage nodes, add
NVMe temperature and throttling indicators.
10) What’s the single best “human process” fix?
A two-person physical verification step for any cooler/CPU work, plus a short sustained load test with logged temps
before the box returns to production.
Practical next steps
Overheating is one of those failure classes where the difference between a funny story and a real incident is whether
you treat it as a first-class operational concern. The sticker is funny exactly once, and only if it happens on a
non-critical box.
A reliability paraphrased idea from W. Edwards Deming: “You can’t manage what you don’t measure.” In thermals, that
means temperatures, frequencies, fan RPM, and power—tied to changes and maintenance events.
- Add a sustained thermal acceptance test for new or serviced nodes (with recorded temps and throttling checks).
- Instrument BMC/IPMI into your monitoring so thermal alarms aren’t trapped behind a login page.
- Codify a physical checklist (including “remove cold plate film”) and require two-person verification.
- Define a mitigation runbook: drain workloads, set temporary power caps, raise fan policy, then schedule physical repair.
- Audit airflow basics: blanking panels, cable discipline, dust filters, and fan health.
If you do only one thing: the next time a node overheats after maintenance, stop negotiating with the logs and open
the chassis. The sticker doesn’t care about your dashboards.