You open the chassis because “it’s just a quick repaste,” and ten minutes later you’re wiping gray goop off a motherboard like you’re detailing a car.
The server comes back up… and then it throttles. Or worse: it reboots under load, right when your storage rebuild is at 72%.
Thermal paste is boring until it isn’t. In production systems, it’s a reliability primitive: a tiny, messy layer that decides whether your CPU runs at spec or
spends its life negotiating with physics. Here’s what actually goes wrong when people get enthusiastic, and how to diagnose it with the same discipline you use for
latency spikes and disk errors.
The physics you can’t negotiate
Thermal paste (TIM: thermal interface material) is not “a better conductor than metal.” It’s the opposite. It exists because real metal surfaces are not flat.
If you put a CPU heat spreader against a heatsink, you don’t get perfect contact. You get microscopic peaks touching and a whole lot of trapped air in the valleys.
Air is a terrible conductor. Paste is “less terrible than air,” so it fills the voids.
The goal is not a thick layer. The goal is a thin, continuous layer that displaces air while keeping the metal-to-metal contact as high as possible. If you add too
much paste, you increase the thickness of the paste layer, and since paste conducts worse than copper or aluminum, your thermal resistance goes up. That’s the first
and most common “enthusiasm beats physics” failure.
The second failure is mechanical: paste is slippery. Excess paste can change how the heatsink seats. A cooler that’s slightly tilted or not evenly torqued can create
a contact pattern that looks fine to the naked eye but gives you a hot spot on one core cluster under AVX load. Modern CPUs will protect themselves with throttling,
but “protected” still means “slower,” and in distributed systems, slower is contagious.
The third failure is contamination. Most pastes are nominally non-conductive electrically, but “non-conductive” is not “safe to smear across tiny components.”
Some pastes are slightly capacitive; some have metal content; some become conductive when contaminated or aged. And even if the paste itself is electrically benign,
it attracts dust and fibers, and it makes inspection and rework miserable.
Here’s the operational truth: if a server’s thermal behavior changes after you repaste, assume you made it worse until proven otherwise. That doesn’t mean you’re bad.
It means the system was already working, and you changed multiple variables at once: interface thickness, mounting pressure, fan curves (often), and airflow
(you had the lid off). Start with measurement, not vibes.
One quote that belongs on every operations team’s wall, from Richard Feynman, is:
For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.
It’s short, it’s rude, and it’s true.
Joke #1: Thermal paste is like perfume—if you can see it across the room, you used too much.
What correct looks like (and why it’s not a universal “pea”)
Internet advice loves the “pea-sized dot.” It’s not wrong in spirit, but it’s incomplete. Different CPUs have different die layouts under the heat spreader.
Different heatsinks apply different pressure distributions. Some sockets are rectangular and long (HEDT and server platforms), which means the “one dot” method can
leave corners underfilled. A thin line or X can be better for large IHS footprints.
The sane approach is boring: use a known-good method per platform, use consistent torque, and validate with a contact pattern check when you’re changing cooler or
paste type. If you’re doing fleet work, standardize. Consistency beats artisanal paste art.
Why “more paste = better cooling” keeps surviving
It feels intuitive: more material between two things means more transfer. That’s true when the material is better than the gap. The gap is air (awful), so the first
little bit of paste helps a lot. After that, you’re not replacing air anymore. You’re replacing metal contact with paste thickness. And now you’re paying for it.
In server terms: paste is like a cache. Some is good. All of memory pretending to be cache is just… memory.
Facts and historical context (the non-myth version)
- Fact 1: Early high-power electronics used greases and oils as interface materials decades before consumer PCs made “repasting” a hobby.
- Fact 2: “Thermal compound” became mainstream in PCs as CPU power density climbed and the mismatch between shiny-looking surfaces and real flatness mattered.
- Fact 3: Even polished metal surfaces have microscopic asperities; optical smoothness is not thermal smoothness.
- Fact 4: Typical thermal paste conductivity is far lower than copper; its value is in displacing air, not beating metal.
- Fact 5: Phase-change interface materials (pads that soften/melt slightly at operating temperature) exist to simplify assembly consistency in manufacturing.
- Fact 6: “Pump-out” is a real phenomenon: thermal cycling and mechanical stress can migrate paste away from the hottest contact area over time.
- Fact 7: Some pastes are electrically conductive (notably many liquid metal compounds), and they require insulation, masking, and a higher standard of workmanship.
- Fact 8: Many server heatsinks are engineered for a specific mounting pressure and airflow; swapping to an “aftermarket” approach can break the thermal model.
- Fact 9: Thermal throttling has become more aggressive and granular in modern CPUs; you can lose performance without crashing, which makes the failure easy to miss.
What “thermal paste everywhere” really breaks
Failure mode 1: Higher thermal resistance from thick TIM
Too much paste creates a thicker layer. Thermal resistance increases. Temperatures rise faster under load and stabilize at a higher equilibrium. You see earlier
fan ramp, earlier throttling, and reduced turbo residency. In production, that becomes longer job runtimes, more tail latency, and occasionally watchdog resets
on systems with tight thermal limits.
Failure mode 2: Poor contact from uneven mounting
Excess paste can hydroplane the heatsink during installation, especially if you tighten one corner too far too early. The heatsink can trap a wedge of paste
and never fully seat. You’ll often see one or two cores or one CCD hotter than the rest, not a uniform increase. That pattern matters: it screams “contact problem”
more than “airflow problem.”
Failure mode 3: Paste in the wrong places
Paste smeared onto socket edges, SMD components, or between pins is a gift that keeps giving. Even “non-conductive” compounds can cause leakage paths when mixed with
dust. It also makes later inspections unreliable: you can’t easily tell if a component is cracked, charred, or just wearing a fashionable gray coat.
Failure mode 4: Wrong paste for the operating profile
Desktops and servers live different lives. A server may run sustained load, high inlet temperatures, and constant thermal cycling. Some consumer pastes dry out,
separate, or pump out faster under that regime. Conversely, some high-performance compounds are finicky and demand perfect mounting and surface prep.
Failure mode 5: Chasing paste when the real issue is airflow
The classic misdiagnosis: “CPU is hot, therefore paste is bad.” In a rack, inlet temperature, blanking panels, cable bundles, fan health, and BMC fan curves are
often the real villain. Paste is the easiest thing to touch, so it gets blamed. Meanwhile the server is breathing its neighbor’s exhaust because someone removed a
filler panel months ago and nobody wanted to file a ticket.
Joke #2: If your paste application looks like modern art, the CPU will respond with performance art—mostly interpretive throttling.
Fast diagnosis playbook
When a machine runs hot after a repaste—or starts throttling during normal workloads—don’t start by repasting again. Start by isolating the bottleneck in three passes:
(1) confirm sensors and symptoms, (2) correlate with power and frequency behavior, (3) validate airflow and contact. You’re trying to answer one question quickly:
is the limiting factor heat generation, heat transfer, or heat removal?
First: confirm the symptom is real and specific
- Check CPU package temperature, per-core/CCD deltas, and whether the BMC agrees with the OS.
- Look for thermal throttling flags and frequency drops under load.
- Compare against a known-good sibling host if you have one.
Second: correlate thermals with workload and power
- Is it load-triggered (only during AVX or compression), time-triggered (after 20 minutes), or ambient-triggered (only at hot aisle peaks)?
- Do fans ramp to max? If fans are low while CPU is hot, suspect fan control/BMC policies.
- Are you power-limited (package power clamp) rather than thermal-limited?
Third: validate airflow and mechanical contact
- Airflow: inlet temps, chassis fan RPM, blocked filters, missing blanks, cable obstructions.
- Mechanical: heatsink torque pattern, mounting standoffs, backplate alignment, warped cold plate, correct spacer for the socket.
- TIM: correct amount, no voids, no paste contamination, correct paste type for the temperature range.
If you follow this order, you avoid the most expensive mistake: doing repeated physical rework without a measurement change, which turns a technical issue into a
reliability incident with extra downtime sprinkled on top.
Practical tasks: commands, outputs, and decisions
These are real commands you can run on typical Linux servers to determine whether your problem is throttling, sensors, airflow, or contact. Each task includes
what the output means and what you decide next. Use them like you’d use iostat for storage: as evidence, not decoration.
Task 1: Check basic CPU temperatures and per-core spread
cr0x@server:~$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +86.0°C (high = +90.0°C, crit = +100.0°C)
Core 0: +85.0°C (high = +90.0°C, crit = +100.0°C)
Core 1: +86.0°C (high = +90.0°C, crit = +100.0°C)
Core 2: +68.0°C (high = +90.0°C, crit = +100.0°C)
Core 3: +69.0°C (high = +90.0°C, crit = +100.0°C)
Meaning: Two cores are ~17–18°C hotter than others under similar conditions. That’s not “case airflow”; that’s often uneven contact or a localized hotspot.
Decision: Move to throttling checks and then a mechanical inspection if the pattern persists under a controlled load.
Task 2: Watch temperatures and fan behavior live
cr0x@server:~$ watch -n 1 'sensors | egrep "Package id 0|Core 0|Core 2|fan"'
Every 1.0s: sensors | egrep Package id 0|Core 0|Core 2|fan
Package id 0: +92.0°C
Core 0: +91.0°C
Core 2: +74.0°C
fan1: 8200 RPM
fan2: 8100 RPM
Meaning: Fans are high; the system is trying. Temps still high, with a large delta. Heat removal is working; heat transfer (TIM/contact) is suspect.
Decision: Validate throttling flags; prepare for a reseat with correct torque and paste quantity.
Task 3: Confirm CPU frequency and throttling during load
cr0x@server:~$ lscpu | egrep "Model name|CPU MHz|Thread|Socket"
Model name: Intel(R) Xeon(R) CPU
Thread(s) per core: 2
Socket(s): 2
CPU MHz: 1199.992
Meaning: If you’re under load and you see ~1.2 GHz on a CPU that should be much higher, you’re likely throttling or power-limited.
Decision: Check kernel logs for thermal throttling events and compare to power caps.
Task 4: Look for thermal throttling messages in kernel logs
cr0x@server:~$ sudo dmesg -T | egrep -i "thermal|throttl|PROCHOT|temperature" | tail -n 10
[Mon Jan 22 10:14:05 2026] CPU0: Package temperature above threshold, cpu clock throttled (total events = 37)
[Mon Jan 22 10:14:05 2026] CPU1: Package temperature above threshold, cpu clock throttled (total events = 37)
Meaning: This is explicit thermal throttling. Not “maybe.” Not “user says it’s slow.”
Decision: Determine whether this is due to airflow/ambient or a bad interface/mount by checking inlet and fan control next.
Task 5: Read BMC/IPMI sensor data (temps, fans, inlet)
cr0x@server:~$ sudo ipmitool sdr elist | egrep -i "inlet|ambient|cpu|fan" | head -n 12
Inlet Temp | 24 degrees C | ok
CPU1 Temp | 91 degrees C | ok
CPU2 Temp | 89 degrees C | ok
FAN1 | 8100 RPM | ok
FAN2 | 8200 RPM | ok
FAN3 | 7900 RPM | ok
Meaning: Inlet is reasonable; CPU temps are high; fans are high and healthy. This points away from hot aisle issues and toward heatsink contact/TIM.
Decision: Schedule a maintenance window for reseat; don’t waste time reconfiguring fan curves.
Task 6: Verify CPU governor and frequency policy (avoid self-inflicted throttling)
cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
performance
Meaning: You’re not accidentally running “powersave.” Good. If it were “powersave,” you could misinterpret low clocks as thermal throttling.
Decision: Proceed to power/thermal limit checks rather than tuning CPU policy.
Task 7: Check for power capping (can masquerade as thermal issues)
cr0x@server:~$ sudo ipmitool dcmi power reading
Instantaneous power reading: 412 Watts
Minimum during sampling period: 380 Watts
Maximum during sampling period: 430 Watts
Average power reading over sample period: 405 Watts
IPMI timestamp: Mon Jan 22 10:20:10 2026
Sampling period: 00000010 Seconds.
Meaning: This shows actual draw; it doesn’t prove you are capped, but it gives context. If your platform enforces a strict cap, clocks may drop even at safe temps.
Decision: If temps are high and clocks are low, it’s thermal. If temps are moderate and clocks are low, suspect power capping or BIOS limits.
Task 8: Identify whether a specific process triggers the heat spike
cr0x@server:~$ top -b -n 1 | head -n 15
top - 10:22:31 up 18 days, 3:12, 1 user, load average: 63.12, 58.77, 41.09
Tasks: 412 total, 2 running, 410 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.1 us, 0.3 sy, 0.0 ni, 97.4 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
MiB Mem : 257843.1 total, 98212.7 free, 40117.2 used, 119513.2 buff/cache
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28412 app 20 0 12.3g 2.1g 112m R 780.0 0.8 12:31.44 compressor
Meaning: A single workload (compression/crypto/AVX-heavy) can push thermals harder than your usual tests.
Decision: Use a repeatable load test (same binary) when validating a reseat; otherwise you’ll chase noise.
Task 9: Stress test in a controlled way to reproduce the issue
cr0x@server:~$ sudo apt-get install -y stress-ng
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
stress-ng
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Meaning: You now have a consistent tool to generate load.
Decision: Run a short stress and watch temps; don’t run it on production without a maintenance window and safety limits.
cr0x@server:~$ sudo stress-ng --cpu 32 --timeout 60s --metrics-brief
stress-ng: info: [31201] dispatching hogs: 32 cpu
stress-ng: metrc: [31201] cpu 60.00s 12345.67 bogo ops/s
stress-ng: info: [31201] successful run completed in 60.02s
Meaning: A repeatable 60-second load allows before/after comparisons. Pair this with watch sensors.
Decision: If throttling happens quickly, it’s likely contact/TIM. If it takes a long time, suspect airflow, clogged filters, or ambient.
Task 10: Check for corrected hardware errors that correlate with heat
cr0x@server:~$ sudo journalctl -k --since "1 hour ago" | egrep -i "mce|edac|hardware error" | tail -n 10
Jan 22 10:09:12 server kernel: mce: [Hardware Error]: Machine check events logged
Jan 22 10:09:12 server kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
Meaning: Heat can exacerbate marginal components. A repaste that increases temps might turn “rare correctable errors” into “incident.”
Decision: Treat thermal issues as reliability issues; plan remediation before the errors stop being correctable.
Task 11: Validate storage latency during thermal throttling (because it will show up)
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (server) 01/22/2026 _x86_64_ (64 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.31 0.00 3.22 0.85 0.00 83.62
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util r_await w_await
nvme0n1 12.0 220.0 512.0 8192.0 0.0 0.0 68.5 2.10 9.80
Meaning: During throttling, CPU can become the bottleneck and inflate request processing times, which can look like “storage got slow.”
Decision: If storage latency rises while CPU clocks drop and temps spike, fix thermals before tuning I/O.
Task 12: Verify ZFS scrub/resilver isn’t turning thermals into a self-inflicted outage
cr0x@server:~$ zpool status
pool: tank
state: ONLINE
status: One or more devices is currently being resilvered.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jan 22 09:41:03 2026
1.20T scanned at 1.68G/s, 612G issued at 858M/s, 4.10T total
612G resilvered, 14.58% done, 0 days 01:10:22 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
Meaning: Resilvering drives CPU and memory hard (checksums, compression, parity). If your CPU thermals are marginal, this kind of workload will find it.
Decision: If you’re throttling, consider pausing or scheduling heavy maintenance workloads until cooling is corrected—otherwise you extend the risky period.
Task 13: Check BMC event log for thermal or fan events
cr0x@server:~$ sudo ipmitool sel list | tail -n 8
217 | 01/22/2026 | 10:14:06 | Temperature #0x01 | Upper Critical going high | Asserted
218 | 01/22/2026 | 10:14:10 | Temperature #0x01 | Upper Critical going high | Deasserted
219 | 01/22/2026 | 10:14:12 | Processor #0x01 | IERR | Asserted
Meaning: BMC saw a thermal threshold crossing. Also a processor error may indicate instability under heat.
Decision: Escalate. Thermal isn’t cosmetic; it’s now causing hardware-level faults.
Task 14: Check whether the chassis thinks the lid is present (yes, this happens)
cr0x@server:~$ sudo ipmitool sdr elist | egrep -i "intrusion|chassis"
Chassis Intrusion | Not Available | ok
Meaning: Some platforms adjust fan behavior based on chassis intrusion or lid sensors. If it’s triggered, fan control can do odd things.
Decision: If intrusion is asserted or “open,” fix the physical state first; don’t tune software around a missing lid.
Three corporate mini-stories from the field
1) The incident caused by a wrong assumption
A mid-size SaaS company had a fleet of database servers that were stable for years. Then a routine hardware refresh happened: same CPU family, slightly newer stepping,
new heatsink bracket revision from the vendor. Nothing scary. A technician repasted a handful of hosts during rack-and-stack because a couple of heatsinks looked
“a little dry.” That seemed responsible.
The wrong assumption was simple: more paste improves thermals, and “it’ll spread out.” The tech used a generous amount and did a quick install—tightened one corner,
then the opposite, but not in incremental steps. The machines booted. Temperatures looked okay at idle. Everyone went home.
The next day, the database cluster started showing unpredictable latency spikes. Not massive. Just enough to trigger retries, which created more load, which created
more heat. Under the nightly analytics job, two nodes began throttling, fell behind replication, and were fenced out by the cluster manager as “slow and unhealthy.”
The failover worked, but it was messy: an availability blip, a pager storm, and a long root-cause meeting.
The postmortem was less about paste and more about discipline. They compared thermal telemetry between “repasted” and “untouched” nodes and found a clear signature:
higher package temps under load and a larger per-core delta. The fix was not heroic. They pulled the affected machines in a maintenance window, cleaned properly,
applied a measured amount, tightened in a cross pattern with consistent torque, and validated with a stress test before putting them back into the pool.
The real lesson: assuming a physical change is benign because the system boots is like assuming a storage change is safe because the filesystem mounts. Boot is not
a benchmark. It’s a greeting.
2) The optimization that backfired
Another org—large, cost-conscious, and proud of their “efficiency”—wanted to reduce fan noise and power consumption in a lab that had quietly become a production
staging area. Someone decided to “optimize” thermals: reapply premium high-conductivity paste across the fleet and then lower fan curves slightly via BMC settings.
The argument: better paste means we can spin fans slower.
The paste was fine. The process wasn’t. They used a spreader method to create a perfect-looking layer, but they didn’t control thickness. Some heatsinks ended up
with a paste layer that was simply too thick. The machines ran cooler at idle—because everything runs cooler at idle—and the fan curve change made the environment
seem quieter and “stable.” Victory slide deck.
Then they ran staging load tests that were more realistic than their earlier synthetic ones. Under sustained CPU-heavy workloads, temperatures climbed slowly, fans
ramped late (because of the new curve), and CPUs began to downclock. Performance results looked worse. The team assumed the new paste needed “burn-in,” because
that’s the kind of myth you reach for when you’ve already committed to the narrative.
In the end, the optimization backfired twice: the fan curve change reduced thermal headroom, and the inconsistent TIM thickness increased thermal resistance. They
reverted fan policy, standardized the application method, and only then did the “premium paste” produce a measurable improvement. The cost was mostly time and
credibility, which in corporate life is not renewable.
The operational rule: never bundle physical changes with policy changes unless you’re prepared to bisect them. If you can’t bisect, you can’t learn.
3) The boring but correct practice that saved the day
A storage team running dense compute-and-NVMe nodes had one habit that looked almost comical: every time a heatsink was removed, they logged it like a disk swap.
Ticket, reason, paste type, method, torque pattern, and a “before/after” 60-second stress test snapshot. Nobody loved doing it. Everyone loved having it later.
During a quarter-end change freeze, a node started intermittently throttling. It wasn’t failing outright. It was just slow. The service it hosted had strict tail
latency SLOs, and this node was dragging the whole pool down. Because of the freeze, the team needed proof before requesting an exception for physical work.
They pulled the host’s historical data and saw that package temps under the standard stress test had increased by ~10°C since the last maintenance. They also saw
that the host had a heatsink removal recorded two months earlier for a motherboard RMA. That gave them a plausible hypothesis: a subtle seating issue or pump-out.
They got the exception, reseated the heatsink using their standard procedure, and the after-test matched baseline. No drama, no guessing, no “try a different paste
brand.” The host returned to the pool, and the quarter-end passed without a performance incident.
This is what boring looks like when it works: a tiny ritual of measurement and documentation that turns a thermal mystery into a predictable maintenance task.
Common mistakes: symptoms → root cause → fix
1) High CPU temps immediately after repaste
Symptoms: Temperatures are worse than before; fans ramp quickly; throttling under modest load.
Root cause: Too much paste (thick layer), trapped air pockets, heatsink not seated flat.
Fix: Remove heatsink, clean both surfaces fully, apply a measured small amount, reseat with cross-pattern incremental tightening. Validate with a repeatable load test.
2) One core/CCD much hotter than others
Symptoms: Large per-core delta under load; package temp looks “okay-ish” but hotspot hits thresholds.
Root cause: Uneven mounting pressure, tilt, wrong standoff/spacer, warped heatsink base, paste wedge.
Fix: Check mounting hardware compatibility; reseat; ensure even torque. Consider inspecting contact pattern (thin paste imprint) to confirm full coverage.
3) Temps fine at idle, bad after 20–60 minutes
Symptoms: Gradual climb, then throttling; often correlates with sustained workloads (scrubs, rebuilds, batch jobs).
Root cause: Airflow restriction (filters, cable bundles), fan curve too conservative, ambient/inlet temperature peaks, paste pump-out over time.
Fix: Check inlet temp and fan RPM via BMC; inspect airflow path; restore vendor fan policy; if history suggests, reseat with a paste known to resist pump-out.
4) System reboots under load, thermals “look normal”
Symptoms: Random resets; sometimes no clear thermal log; occasional MCE/EDAC events.
Root cause: Localized hotspot not captured by the sensor you’re watching, VRM overheating, heatsink misalignment, or lid/ducting missing causing component overheating.
Fix: Use BMC sensors beyond CPU (VRM, motherboard, inlet). Confirm ducting and shrouds are installed. Re-check heatsink seating. Don’t ignore corrected errors.
5) Fans stuck low while temps rise
Symptoms: CPU hits 90°C, fans remain at low RPM; no obvious fan faults.
Root cause: BMC fan policy misconfiguration, chassis intrusion sensor asserted, or a firmware bug.
Fix: Compare OS temps to BMC readings; check SEL for policy events; restore default thermal profile; update BMC firmware during a controlled window.
6) Paste on socket/components after rework
Symptoms: Visual contamination; intermittent boot issues; unexplained instability post-maintenance.
Root cause: Over-application and smear during heatsink removal/installation; poor cleaning method.
Fix: Power down, disassemble carefully, clean with appropriate solvent and lint-free tools, inspect under bright light. If conductive paste was used, treat as an incident and consider board replacement.
7) “We repasted and it’s still hot”
Symptoms: No improvement after multiple repastes; everyone is tired; the system remains marginal.
Root cause: The problem isn’t paste: wrong heatsink model, missing shroud, incorrect mounting hardware, degraded fan, clogged heat sink fins, or high inlet temp.
Fix: Stop repasting. Validate part numbers, shrouds, and airflow. Verify fans and heatsink fin cleanliness. Compare to a known-good host in the same rack.
Checklists / step-by-step plan
Step-by-step: the “do it once and be done” repaste procedure (server-grade)
- Plan the validation. Pick a repeatable load test (e.g.,
stress-ng --cpu N --timeout 60s) and record baseline temps and clocks before touching hardware. - Schedule a window. You want time for careful cleaning and a post-work stress test. Rushing is how paste becomes a lifestyle.
- Power down and discharge. Remove power cords, wait, follow your platform’s service guide. Don’t hot-swap your patience.
- Remove heatsink carefully. Loosen in a cross pattern a little at a time. Avoid twisting that smears paste across components.
- Clean both surfaces fully. Use lint-free wipes/swabs and appropriate solvent. Remove old paste from edges and corners where it loves to hide.
- Inspect surfaces. Look for scratches, pits, residue, and signs of uneven contact. Confirm the correct bracket/standoffs for the socket.
- Apply paste sparingly. Use the minimum that will fill voids: small dot for typical IHS, line/X for large rectangular server IHS as appropriate.
- Seat heatsink straight down. Avoid sliding it around; a tiny shift can create voids or push paste out unevenly.
- Tighten incrementally in a cross pattern. Few turns per screw, alternating corners, until fully seated per vendor spec.
- Reinstall shrouds and ducts. These are not optional aesthetics. They’re the difference between “cooling system” and “hope.”
- Boot and verify sensors. Confirm fans, inlet temp, and CPU temps in both OS and BMC.
- Run the validation load. Compare to baseline. If temps are worse, stop and re-check mounting and paste amount rather than “trying a new pattern” randomly.
- Record the change. Log paste type, method, and before/after metrics. Future you will be annoyingly grateful.
Checklist: airflow and chassis sanity (before blaming paste)
- All fan modules present, correct model, no reported faults.
- Heatsink fins clean; no dust matting or packaging foam (yes, it happens).
- Air shroud installed and seated.
- Blanking panels installed; no open RU holes short-circuiting airflow.
- Cable bundles not blocking fan inlets or the CPU shroud.
- Inlet temps within expected range; compare to rack neighbors.
- BMC thermal profile set to vendor-recommended mode for your workload.
Checklist: choosing paste like an adult
- Prefer non-conductive, non-capacitive paste for fleet servers unless you have a strong reason and workmanship controls.
- Prioritize stability under thermal cycling (pump-out resistance) over peak benchmark conductivity claims.
- Standardize on one or two approved compounds and one application method per platform.
- Avoid mixing pastes or applying on top of residue; clean to bare surface every time.
- If you’re using phase-change pads by design, don’t replace them with paste casually; you’re altering a validated assembly process.
FAQ
1) Can too much thermal paste actually make temperatures worse?
Yes. Paste is primarily an air-gap filler. A thick layer increases thermal resistance compared to metal-to-metal contact, raising temps and accelerating throttling.
2) How do I know if I used too much paste without taking it apart?
Look for a post-repaste signature: higher package temps under the same controlled load, larger per-core deltas, earlier fan ramp, and new throttling events in logs.
Those patterns strongly suggest a bad interface or seating problem.
3) Is the “pea method” always correct?
No. It’s a decent default for many mainstream IHS sizes, but large rectangular server IHS footprints often benefit from a line or X to ensure edge coverage. The real
requirement is thin, continuous coverage after mounting, not loyalty to a shape.
4) Should I spread the paste with a card/spatula?
In fleet operations, spreading often increases variability in thickness and introduces bubbles if done casually. A controlled dot/line/X with proper mounting pressure
is usually more consistent. If you do spread, you need a method that controls thickness and avoids air.
5) How often should servers be repasted?
Less often than hobby forums suggest. Many server-grade assemblies run for years without repaste. Repaste when you have evidence: rising temps over time,
after heatsink removal, or after a verified contact issue—not as a seasonal ritual.
6) Are “metal” or “liquid metal” compounds worth it in production?
Usually no, unless you have a controlled process and the platform supports it. Conductive TIM increases risk: shorts, corrosion, and harder rework.
Reliability trumps a few degrees.
7) My CPU is hot; does that automatically mean the paste is bad?
Not automatically. Check inlet temps, fan RPM, shrouds, and BMC policy first. Airflow problems are common and affect multiple components, not just the CPU package.
8) Why do I see throttling but no obvious temperature alarm?
Throttling can be triggered by localized hotspots or internal sensors that don’t map cleanly to the one temperature you’re watching. Also, firmware may throttle
proactively below “critical” thresholds. Use both OS logs and BMC sensors for a fuller picture.
9) What’s the single most important mechanical factor besides paste quantity?
Mounting pressure and evenness. A perfect paste can’t compensate for a heatsink that’s tilted, torqued unevenly, or using the wrong spacer/backplate.
10) If I repaste and temps improve, am I done?
You’re done when you’ve validated under a representative sustained load and recorded the result. Many thermal issues show up after time, not in the first minute.
Conclusion: next steps you can actually do
Thermal paste is not magic and not a craft project. It’s a controlled interface in a heat-transfer system with known failure modes: too thick, uneven seating,
wrong material, or blaming TIM for airflow sins. The messiest repaste jobs usually come from the same root cause as messy outages: changing things without measurement.
Practical next steps:
- Pick a standard validation load and record baseline thermals and frequencies for each platform.
- When thermals drift, run the fast diagnosis playbook before you touch hardware.
- If you must repaste, standardize paste type, application method, and torque sequence—and document it like any other production change.
- After rework, validate under sustained realistic load, not just “it boots.”
- Treat thermal regressions as reliability risks, especially on storage nodes doing rebuilds and scrubs.
If you remember one thing: the correct amount of paste is the minimum amount that makes air irrelevant. Everything beyond that is just you decorating a heat problem.