Thermal pads: the $10 fix that can change a whole GPU

Was this helpful?

Your GPU is “fine” until it isn’t. One day a training job starts clocking down for no reason. A game that used to sit at 1900+ MHz suddenly flirts with 1200. Fans scream, frame time spikes, and your monitoring dashboard turns into a crime scene.

The culprit is often not the silicon, not the drivers, not the power supply. It’s the $10 squishy stuff you forgot existed: thermal pads. And when they’re the wrong thickness, wrong hardness, misaligned, dried out, or simply tired, they can sabotage an otherwise healthy card with the quiet confidence of a misconfigured cron job.

What thermal pads actually do (and what they don’t)

A GPU cooler isn’t one surface. It’s a small ecosystem: the GPU die (or package), memory chips around it, VRM stages, inductors, sometimes a backplate, and a heatsink assembly that can’t perfectly touch all of those parts at once.

Thermal paste is for very thin gaps and high clamping pressure: GPU die to cold plate. Thermal pads are for imperfect, larger gaps and uneven stackups: memory ICs to heatsink, VRM components to a secondary plate, sometimes backplate contact.

Pads do two jobs:

  • Fill the gap between a hot component and a heatsink surface that’s not perfectly coplanar.
  • Transmit heat through a material with acceptable thermal resistance while staying mechanically stable.

What pads do not do well:

  • Compensate for wrong pressure. If the pad is too thick, it can prevent the cold plate from seating on the die. That’s a catastrophic “fix.”
  • Outperform paste at the die. Pads are almost always worse than decent paste for the primary die contact.
  • Save you from poor airflow. If the case/ducting is a toaster, pads are just a better way to cook evenly.

The key mental model: you’re not buying “W/mK.” You’re buying lower total thermal resistance in your specific geometry. Thermal conductivity is a spec; the stackup is the truth.

Interesting facts and a little history

A handful of context points that make modern GPU pad work make more sense:

  1. Thermal interface materials (TIMs) took off with dense electronics packaging. Once heatsinks stopped being simple blocks and became multi-contact assemblies, pads became the “manufacturing tolerances tax.”
  2. GDDR6X made memory temperatures a mainstream problem. Earlier generations ran hot too, but GDDR6X and its power density turned “warm VRAM” into “your card is throttling at the memory.”
  3. “Hotspot” sensors changed how we see cooling. Modern GPUs expose junction/hotspot telemetry that reveals local contact issues, not just average die temperature.
  4. Backplates weren’t originally thermal devices. Many started as structural and aesthetic parts; later designs began using them as heat spreaders with pads.
  5. Pad hardness matters as much as thickness. Two 2.0 mm pads with different compressibility can behave like different thicknesses under the same torque.
  6. Factory pads are often chosen for assembly yield, not peak performance. Vendors optimize for “works on every unit on the line,” not “best possible temps on your specific card.”
  7. Thermal pads age. Heat cycles and time can stiffen pads, reduce compliance, and degrade contact—especially near VRMs.
  8. Board partner designs vary wildly. Two cards with the same GPU can have completely different pad maps, VRM layouts, and contact plates.

Why pads can change a whole GPU

In production systems, small friction points create disproportionate outages. On a GPU, thermal pads are one of those friction points. They sit between critical components and the only thing preventing them from baking: the heatsink.

If your GPU die is well-pasted but your memory pads are wrong, your performance can still crater. Why? Because modern cards will throttle on whichever limit hits first: power, temperature, voltage reliability limits, memory junction, VRM temps, or even hotspot deltas that imply poor contact.

The most common “whole GPU changed” outcomes after a proper repad are boring and measurable:

  • Memory junction drops enough to stop memory-induced throttling.
  • Hotspot delta shrinks because the cooler sits correctly after correcting pad stackup.
  • Fans calm down because the controller is no longer chasing runaway local temps.
  • Clocks stabilize because the card stays within its thermal and electrical envelopes.

Pads are also the easiest way to accidentally wreck a card’s thermal behavior. There is no “universal best thickness.” There is only “the thickness that makes the cold plate seat correctly while memory/VRM contacts are fully engaged.”

Joke #1 (short, relevant): A thermal pad is like a meeting invite—too thick and nobody can get close enough to do real work.

The physics, without the pretend math

Heat flow through a pad is dominated by thermal resistance. Roughly speaking, thicker pad = more resistance, unless the alternative is an air gap (air is a fantastic insulator and a terrible life choice for VRAM cooling).

But you don’t get to pick thickness freely. You’re constrained by:

  • Component height tolerances (memory packages, chokes, MOSFETs).
  • Heatsink flatness and machining variation.
  • Screw torque and spring pressure.
  • Pad compressibility and creep over time.

So the “$10 fix” isn’t “slap on thicker pads.” It’s “restore correct contact across the entire stackup.”

What “better pads” often really means

Marketing loves W/mK numbers. Practical engineering loves outcomes. In my experience, “better pads” usually means one or more of these:

  • Correct thickness (most important).
  • More compliant material that compresses to fit small variations without lifting the cold plate.
  • Cleaner installation: correct placement, no wrinkles, no shifted pads that miss the chip.
  • Fresh material that hasn’t hardened from years of heat cycling.

Fast diagnosis playbook (find the bottleneck fast)

When a GPU is underperforming or unstable, you can waste hours “tuning” power limits and undervolts. Don’t. First determine what’s actually limiting you.

First: identify the limiter (thermal vs power vs software)

  1. Check clocks and throttle reasons under load. If clocks dip while utilization is high, you’re likely hitting a limit.
  2. Check hotspot and memory temps (if exposed). A high hotspot delta or high memory junction is a classic pad/contact signal.
  3. Check fan behavior. If fans ramp hard but core temp looks “okay,” that’s often hotspot/memory/VRM pulling the string.

Second: isolate which surface is failing contact

  1. Hotspot delta large (hotspot much higher than GPU temp): suspect poor die contact or cooler not seating due to pad thickness.
  2. Memory junction high with core reasonable: suspect memory pads, pad placement, or backplate transfer.
  3. Crashes under transient load (not steady): suspect VRM thermals or power delivery stability, which pads can influence indirectly.

Third: decide whether you need a repaste, a repad, airflow changes, or all three

  • Repaste only when hotspot delta indicates die contact issues and memory temps are fine.
  • Repads only when memory/VRM temps are high and core contact is healthy.
  • Both when the cooler comes off anyway on an older card, or when you suspect the pads are lifting the cooler.
  • Airflow/ducting when everything improves with side panel off or with external fan assist.

The order matters because the failure mode matters. Fix the wrong thing and you can make the right thing worse.

Tooling and metrics that matter

You don’t need a thermal camera to make good decisions (though they’re fun). You need consistent telemetry and repeatable load.

Metrics to watch

  • GPU temperature: general core thermal state, but not enough alone.
  • Hotspot/junction temperature: reveals contact quality and localized heating.
  • Memory junction temperature: especially on cards that expose it; strongly linked to pad effectiveness.
  • Fan speed and duty: indicates what the controller is reacting to.
  • Clocks and voltage: shows throttling and stability.
  • Power draw: confirms whether you’re power-limited or temperature-limited.
  • Error counters: Xid, ECC (if present), driver resets—these can correlate with overheating memory/VRM.

A reliability quote (paraphrased idea)

Paraphrased idea from John Allspaw: reliability comes from understanding normal behavior and instrumenting systems so you can see when reality diverges.

That applies perfectly here: baseline your “normal,” then look for divergence when load changes.

Practical tasks: commands, outputs, and decisions (12+)

These are intentionally operational. Each task includes: a command, what the output means, and the decision you make from it. The commands are Linux-focused, because production tends to be.

Task 1: Confirm the GPU and driver stack

cr0x@server:~$ nvidia-smi
Wed Jan 21 10:17:02 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf           Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        Off |   00000000:01:00.0  On |                  N/A |
|  72%   76C    P2              320W / 340W |    8900MiB / 10018MiB |     98%      Default |
+-----------------------------------------+------------------------+----------------------+

Meaning: Confirms model, driver version, and baseline utilization/power. This tells you whether you’re near power cap and whether the load is real.

Decision: If you see low utilization and low clocks, don’t blame thermals yet—look for a software bottleneck first. If you see high utilization and clocks dropping, continue to thermal checks.

Task 2: Log temps, clocks, power every second during a known load

cr0x@server:~$ nvidia-smi --query-gpu=timestamp,temperature.gpu,clocks.sm,clocks.mem,power.draw,fan.speed,utilization.gpu --format=csv -l 1
timestamp, temperature.gpu, clocks.sm, clocks.mem, power.draw, fan.speed, utilization.gpu
2026/01/21 10:17:10, 77, 1710, 9501, 323.45, 74, 99
2026/01/21 10:17:11, 78, 1695, 9501, 327.10, 76, 99
2026/01/21 10:17:12, 78, 1545, 9501, 310.02, 78, 99

Meaning: Clocks stepping down while utilization stays high is usually a limiter (thermal, voltage, or power).

Decision: If power draw drops with clocks while temperature rises slowly, you’re likely hitting a thermal threshold or hotspot/memory limit. Next: check hotspot and memory temps if possible.

Task 3: Pull detailed sensor telemetry (including hotspot/mem if exposed)

cr0x@server:~$ nvidia-smi -q -d TEMPERATURE,CLOCK,PERFORMANCE
==============NVSMI LOG==============

Temperature
    GPU Current Temp            : 78 C
    GPU Shutdown Temp           : 93 C
    GPU Slowdown Temp           : 83 C
    GPU Max Operating Temp      : 83 C

Clocks
    Graphics                    : 1545 MHz
    SM                          : 1545 MHz
    Memory                      : 9501 MHz

Performance State
    Performance State           : P2

Meaning: “Slowdown temp” is a clue. If you’re near it and clocks are dropping, you’re likely temperature-limited on a sensor the driver cares about.

Decision: If slowdown is being hit at relatively modest core temps, suspect hotspot/memory/VRM rather than average core.

Task 4: Check kernel logs for GPU resets and thermal events

cr0x@server:~$ sudo dmesg -T | egrep -i "nvrm|xid|thermal|throttle" | tail -n 20
[Wed Jan 21 10:15:42 2026] NVRM: Xid (PCI:0000:01:00): 31, pid=18422, Ch 00000008, intr 00000000
[Wed Jan 21 10:15:43 2026] NVRM: GPU at PCI:0000:01:00: GPU has fallen off the bus.
[Wed Jan 21 10:16:10 2026] thermal thermal_zone0: throttling, current_temp=92000

Meaning: Xid events and “fallen off the bus” can be power, driver, or thermal instability. If it correlates with heavy load and high temps, cooling becomes suspect.

Decision: If you see repeated Xids under load after months of stability, check thermals (pads/VRM contact) before you chase driver ghosts.

Task 5: Check PCIe link state (bad seating can mimic “thermal instability”)

cr0x@server:~$ sudo lspci -s 01:00.0 -vv | egrep -i "LnkSta|SltSta|Errors|Speed|Width"
LnkSta: Speed 16GT/s, Width x16
SltSta: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise- Interlock- NoCompl+
Errors: Correctable- Non-Fatal- Fatal- Unsupported-

Meaning: Confirms the link is stable and negotiated correctly. PCIe issues can cause resets that look like thermal trouble.

Decision: If link speed/width flaps or errors appear, don’t open the cooler first—reseat the card, inspect power cables, and validate the slot.

Task 6: Inspect GPU utilization vs CPU bottleneck

cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (server) 	01/21/2026 	_x86_64_	(32 CPU)

10:17:35 AM  CPU    %usr   %nice    %sys %iowait   %irq  %soft  %steal  %idle
10:17:36 AM  all   35.12    0.00    4.01    0.12   0.00   0.31    0.00  60.44
10:17:36 AM    7   99.00    0.00    1.00    0.00   0.00   0.00    0.00   0.00

Meaning: One CPU pegged at 99% while GPU utilization is inconsistent can indicate a CPU bottleneck or a single-thread feeder problem.

Decision: If CPU is the limiter, thermal pad work won’t buy you anything. Fix the pipeline first.

Task 7: Confirm fan control and whether the GPU is stuck in a conservative profile

cr0x@server:~$ nvidia-settings -q GPUFanControlState -q GPUTargetFanSpeed
  Attribute 'GPUFanControlState' (server:0[gpu:0]): 0.
  Attribute 'GPUTargetFanSpeed' (server:0[gpu:0]): 74.

Meaning: Fan control state 0 is automatic. Target fan speed indicates the controller is actively trying to manage thermals.

Decision: If fans are low while temps spike, you may have a fan control issue. Don’t blame pads until fan behavior makes sense.

Task 8: Stress the GPU consistently (compute) and watch stability

cr0x@server:~$ sudo apt-get install -y stress-ng
Reading package lists... Done
Building dependency tree... Done
stress-ng is already the newest version (0.15.06-1ubuntu1).
cr0x@server:~$ stress-ng --cpu 16 --timeout 60s --metrics-brief
stress-ng: info:  [20133] dispatching hogs: 16 cpu
stress-ng: info:  [20133] successful run completed in 60.01s

Meaning: This doesn’t stress the GPU; it stabilizes CPU-side behavior so your GPU workload isn’t starved or jittery.

Decision: If GPU thermals only look bad when CPU is also loaded, you may have case airflow or PSU heat interaction, not just pads.

Task 9: Measure “hotspot delta” when available (proxy via sensors)

cr0x@server:~$ sudo apt-get install -y lm-sensors
Reading package lists... Done
Building dependency tree... Done
lm-sensors is already the newest version (1:3.6.0-7ubuntu1).
cr0x@server:~$ sensors
nvme-pci-0200
Adapter: PCI adapter
Composite:    +47.9°C  (low  = -273.1°C, high = +84.8°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +62.0°C

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +71.0°C

Meaning: Many systems won’t expose GPU hotspot via lm-sensors; that’s fine. Use nvidia-smi and your application logs.

Decision: If the platform runs hot overall, your GPU cooling job is harder. Address intake/exhaust before surgical pad work.

Task 10: Validate power limit and whether you’re power-throttling

cr0x@server:~$ nvidia-smi -q -d POWER | egrep -i "Power Limit|Enforced|Default|Min|Max"
    Power Limit                  : 340.00 W
    Default Power Limit          : 340.00 W
    Enforced Power Limit         : 340.00 W
    Min Power Limit              : 100.00 W
    Max Power Limit              : 370.00 W

Meaning: Confirms you’re not accidentally capped. In corporate fleets, someone often “temporarily” lowered power limits and forgot.

Decision: If power limit is far below default, fix that before you open hardware. If power is normal and clocks still drop with rising temps, keep investigating thermals.

Task 11: Watch application-level throughput during a thermal event

cr0x@server:~$ tail -n 8 /var/log/gpu-job.log
step=1840 imgs/sec=1210 gpu_util=99% sm_clock=1710 mem_clock=9501 temp=77
step=1841 imgs/sec=1198 gpu_util=99% sm_clock=1695 mem_clock=9501 temp=78
step=1842 imgs/sec=1042 gpu_util=99% sm_clock=1545 mem_clock=9501 temp=78
step=1843 imgs/sec=1035 gpu_util=99% sm_clock=1545 mem_clock=9501 temp=79

Meaning: Throughput drops in lockstep with SM clock. That’s not “random slowness.” It’s a limiter.

Decision: If throughput correlates strongly with temperature, you have a thermal control problem to solve, and pads are a prime suspect when memory/hotspot are implicated.

Task 12: Verify the system isn’t heat-soaking due to fan curves or chassis control

cr0x@server:~$ sudo ipmitool sdr type fan
FAN1         | 4200 RPM          | ok
FAN2         | 4100 RPM          | ok
FAN3         | 1900 RPM          | ok
FAN4         | 1800 RPM          | ok

Meaning: In servers, chassis fans can be in a quiet profile that starves GPUs of fresh air.

Decision: If chassis fans are low while GPUs are hot, fix platform fan policy first. Repadding won’t defeat a chassis that refuses to move air.

Task 13: Baseline after changes with a consistent capture

cr0x@server:~$ mkdir -p ~/gpu-thermal-baselines
cr0x@server:~$ nvidia-smi --query-gpu=timestamp,temperature.gpu,clocks.sm,power.draw,fan.speed,utilization.gpu --format=csv -l 2 | head -n 10 | tee ~/gpu-thermal-baselines/baseline.csv
timestamp, temperature.gpu, clocks.sm, power.draw, fan.speed, utilization.gpu
2026/01/21 10:20:10, 44, 210, 24.12, 30, 0
2026/01/21 10:20:12, 45, 210, 24.05, 30, 0
2026/01/21 10:20:14, 45, 210, 24.01, 30, 0

Meaning: A baseline file gives you “before/after” evidence. Otherwise you’ll rely on vibes, which is not a metric.

Decision: Don’t do pad work without a baseline. If you can’t prove improvement, you can’t tell whether you introduced a new risk.

Choosing pads: thickness, hardness, conductivity, and reality

Thickness is the boss

The number-one decision is thickness. Not brand. Not W/mK. Thickness.

Why? Because thickness determines whether you get contact at all, and whether you accidentally reduce die pressure. The GPU die is unforgiving: if the cold plate isn’t seated properly, hotspot goes up, clocks go down, and you’ve traded a memory problem for a core problem.

Practical guidance:

  • Start with known-good thickness maps for your exact board variant when possible. “Same GPU model” is not “same PCB.”
  • If you must measure, measure the old pads and verify with imprint testing (more on that below). Old pads can be compressed or deformed, so treat the measurement as a starting point, not gospel.
  • Don’t mix thickness randomly. If one section gets thicker, you might lift another contact surface.

Hardness/compressibility: the hidden variable

Pads aren’t just thickness; they’re springs with thermal conductivity. Hard pads resist compression, which can be good for keeping contact on tall components, but risky for die seating. Softer pads conform better, but can “creep” over time and reduce consistent pressure.

When you see people report wildly different results using “the same thickness,” hardness is usually why.

Thermal conductivity (W/mK) is not a lie, just incomplete

Higher W/mK can help, but only if:

  • the pad actually contacts both surfaces,
  • it compresses properly,
  • it doesn’t introduce a larger gap somewhere else.

Also: datasheets are often tested under specific compression and temperature conditions. Your GPU is a chaotic real-world lab with uneven pressure, micro-gaps, and airflow limitations.

Pad vs putty: know what you’re trading

Thermal putty (gap filler paste) has become popular because it conforms easily to uneven surfaces and can reduce the “wrong thickness” risk. It can be excellent for VRM/odd shapes.

Downsides:

  • Messier, harder to rework cleanly.
  • Can migrate if applied excessively.
  • Long-term stability varies by compound and temperature cycling.

If you’re running production GPUs where repeatability matters more than internet points, pads are still the predictable choice—when you have the right thickness.

When the backplate is part of the thermal system

Some cards rely on backplate pads to pull heat from memory or the rear of the PCB. If those pads are missing or too thin, you lose a heat-spreading surface. If they’re too thick, you can warp the PCB and create new contact problems at the front.

PCB warp is not just aesthetic. Warped boards change pressure distribution, which can raise hotspot delta even if your paste is perfect.

Joke #2 (short, relevant): Thermal pads age like milk, not wine—if you’re lucky, you notice the smell before the crash.

Checklists / step-by-step plan (repaste + repad without regret)

Pre-flight checklist: decide if you should even open the card

  • Collect baselines: temps, clocks, fan, power under a repeatable load (see tasks above).
  • Confirm the limiter: is it memory junction, hotspot delta, or just case airflow?
  • Confirm the variant: exact board partner model and revision if possible.
  • Accept the warranty trade: if you can’t afford the risk, don’t do it. Production doesn’t care about your curiosity.
  • Schedule downtime: treat it like a maintenance window.

Tools and supplies (minimal but correct)

  • ESD precautions (strap or at least disciplined grounding).
  • Correct screwdriver bits (don’t strip tiny screws and then improvise like a villain).
  • Isopropyl alcohol and lint-free wipes.
  • Thermal paste for the die (a known, stable compound).
  • Thermal pads in the correct thicknesses; buy extra.
  • Calipers (helpful), and a notepad for pad mapping.

Step-by-step: disassembly with an SRE mindset

  1. Power down, unplug, discharge. Remove the card, label it, and take photos as you go. Photos are your rollback plan.
  2. Remove the cooler evenly. Loosen screws in a cross pattern. You’re trying to avoid uneven stress on the PCB.
  3. Document pad locations and thickness. Create a “pad map” in your notes: memory pads, VRM pads, backplate pads, any odd spots.
  4. Inspect old pads. Look for glossy untouched areas (no contact), torn sections (shifted), or brittle/hardened material (aged).
  5. Clean paste and residue. Remove old paste from die and cold plate carefully. Clean pad residue where needed without scraping components.

Step-by-step: installing new pads without lifting the cold plate

  1. Cut pads cleanly. Slightly smaller than the chip footprint is usually safer than overhang that can interfere with other surfaces.
  2. Place pads precisely. Memory chips must be fully covered. VRM pads must cover the intended components; don’t “bridge” onto capacitors unless the design expects it.
  3. Mind the protective films. Remove both sides. Miss one and you’ve created an insulating layer with excellent vibes and terrible thermals.
  4. Apply paste to the die. Use a reliable method (thin spread or small central blob depending on paste viscosity and die size). The goal is full coverage without excess squeeze-out.
  5. Dry fit and imprint test (recommended). Before final assembly, lightly seat the cooler, then remove it once to inspect pad compression marks and paste spread. You’re looking for “contact everywhere” and “cold plate seated.”
  6. Final assembly with torque discipline. Tighten in a cross pattern in small increments. If screws have springs, compress them evenly.

Post-flight checklist: prove the fix

  • Boot and idle check: verify fans spin, no artifacting, no driver issues.
  • Load test: run the same workload as baseline. Capture the same telemetry.
  • Compare deltas: core temp, hotspot delta (if available), memory temps, clocks under load, fan speed for same throughput.
  • Stability soak: 30–60 minutes. Thermal issues often appear after heat soak, not in the first minute.

Three corporate mini-stories (realistic and painful)

1) Incident caused by a wrong assumption: “Same GPU model means same pad thickness”

A team I worked with ran a mixed fleet of GPUs bought over multiple quarters. Same GPU name on paper, same vendor, same driver image. Someone noticed memory junction temperatures creeping up on a subset of nodes and suggested a repad campaign. Sensible. Preventative maintenance is cheaper than surprise downtime.

They ordered pads based on a thickness map posted for “that GPU.” The first few cards improved. Confidence rose. The rollout accelerated, because humans love a success narrative and hate waiting.

Then a different batch started failing validation: hotspot climbed, clocks dropped, and one system started hard-resetting under load. The graphs were insulting: memory looked better, but the core started throttling earlier than before.

Root cause wasn’t mysterious. The later batch had a slightly different cooler plate and component stackup. The “universal” thickness lifted the cold plate just enough to reduce die pressure and create a hotspot delta problem. Memory was cooler; the GPU core was now the limiting factor.

The fix was slow and unglamorous: stop the rollout, identify board revisions, build a thickness map per revision, and rework the already-touched cards that were now worse. The lesson wasn’t “never repad.” It was “never assume mechanical equivalence from a marketing name.”

2) Optimization that backfired: “Max W/mK pads everywhere”

In another shop, a performance-minded engineer decided to standardize on a premium, high-conductivity pad material for everything: VRAM, VRM plate, backplate, even where the factory used softer pads. The goal was noble: reduce fan speeds and improve sustained boost clocks.

On the bench, the first card looked good for a short test. Fans were calmer. Memory temps dipped a bit. The change was declared a win and repeated across a small batch.

Two weeks later, support tickets: intermittent instability under long training runs. Nothing obvious in core temperature. A couple of nodes threw GPU driver resets after hours, not minutes. The team did what teams do: blamed software first. They rebuilt images, pinned driver versions, swapped cables, even questioned the PSU rails.

The real issue was mechanical. The “premium” pads were significantly harder. Under the same torque, they didn’t compress like the original pads, which changed pressure distribution. The die contact was still “okay” at first, but after repeated thermal cycles, micro-movement and creep made it worse. The hotspot delta increased, and local thermal stress increased error likelihood.

The fix wasn’t to abandon better materials; it was to respect the system. They switched to a more compliant pad for specific zones and used high-conductivity pads only where the gap and pressure were appropriate. Performance returned, and so did stability. The optimization failed because it optimized a spec sheet, not a mechanical assembly.

3) The boring but correct practice that saved the day: “Baseline, change one thing, validate”

A reliability-focused team had a policy: no thermal maintenance without a before/after artifact. Every node had a simple script that captured nvidia-smi telemetry under a standardized load. The file landed in a central place. It wasn’t fancy, but it was consistent.

One day, a new technician repadded a card and the GPU started underperforming. They didn’t argue about whether it “felt slower.” They pulled the baseline and compared. The post-change clocks were 10–15% lower at the same utilization, with higher fan speed. That’s a failing change, not “variance.”

Because they had the artifact, rollback was straightforward: open the card again, inspect contact marks, and correct pad thickness in one area that was preventing full cold plate seating. After the fix, the telemetry matched the original baseline and improved memory temps slightly.

The policy looked bureaucratic until it wasn’t. The whole incident ended in an afternoon instead of a week of forum archaeology and driver roulette. Boring process saved real time, which is the only metric that matters during an incident.

Common mistakes: symptoms → root cause → fix

1) Memory temps worse after repad

  • Symptoms: Memory junction climbs faster than before; fans ramp; performance drops after heat soak.
  • Root cause: Pads not contacting the heatsink (too thin), protective film left on, pad shifted off a chip, or pad cut too small leaving an edge gap.
  • Fix: Reopen and inspect compression marks; verify film removal; confirm pad footprint covers the memory IC fully; adjust thickness per zone.

2) Core hotspot delta increases after repad

  • Symptoms: Core temp seems “fine,” but hotspot is much higher; clocks throttle earlier; paste imprint looks uneven.
  • Root cause: Pads too thick or too hard, lifting the cold plate or reducing mounting pressure on the die.
  • Fix: Reduce pad thickness or switch to more compliant pads; retorque in cross pattern; perform an imprint test to confirm seating.

3) Random crashes after 20–60 minutes

  • Symptoms: Long-run instability; driver resets; no immediate thermal shutdown.
  • Root cause: VRM thermal stress due to poor pad contact on MOSFETs/plates, or PCB warp causing localized heating.
  • Fix: Verify VRM pad placement and coverage; ensure correct thickness; check for backplate pad over-thickness causing bowing.

4) Fans louder but temps unchanged

  • Symptoms: Same temps at higher fan duty; noise increases; little performance improvement.
  • Root cause: You improved one path (e.g., memory to backplate) but the limiting path is case airflow; or the cooler is clogged/dusty.
  • Fix: Clean heatsink fins; fix intake/exhaust; consider ducting; verify chassis fans aren’t in quiet mode.

5) Paste “pumps out” quickly after a repad

  • Symptoms: Good temps for a day, worse a week later; hotspot delta creeps upward.
  • Root cause: Uneven pressure or excessive movement from pads acting like stiff springs; thermal cycling shifts paste away from the die center.
  • Fix: Correct pad compressibility; use a stable paste; verify mounting pressure consistency; avoid overtightening that warps the assembly.

6) “Everything is cooler” but performance is still down

  • Symptoms: Temps improved, yet clocks aren’t recovering.
  • Root cause: Power limit or voltage/frequency curve issue, or the workload changed; sometimes a driver or firmware setting got altered during the maintenance window.
  • Fix: Re-check power limits, application configuration, and throttle reasons; compare to pre-change baselines.

FAQ

1) Do thermal pads really “wear out”?

Yes. Heat cycling can harden pads, reduce compliance, and degrade contact. They don’t evaporate, but they stop behaving like good gap fillers.

2) Should I always replace pads when I repaste?

If the card is older or you already have it open, often yes—because reusing disturbed pads is gambling with contact. If the card is new and pads are intact, you can repaste only, but be careful not to tear or shift pads during disassembly.

3) Is higher W/mK always better?

Not automatically. A slightly lower W/mK pad that compresses correctly and preserves die contact can outperform a “better” pad that lifts the cooler.

4) How do I know the right pad thickness for my GPU?

Ideally: a thickness map for your exact board revision. If you must measure: use the original pads as a starting point, then confirm with imprint tests to ensure contact and seating.

5) Can wrong pads damage the GPU?

Indirectly, yes—by causing sustained overheating of memory or VRMs, or by warping the PCB and stressing solder joints over time. The immediate risk is throttling and instability; the long-term risk is wear you can’t see.

6) Why does my core temp look fine but I still throttle?

Because “core temp” is often not the hottest sensor. Hotspot and memory junction can hit limits first. Poor contact can create local hotspots that average temps hide.

7) Do I need to pad the backplate?

Only if the design expects it or you can verify it improves heat spreading without warping the board. Random backplate padding can create more problems than it solves.

8) Pads or thermal putty for VRAM?

Pads are cleaner and more repeatable if you know the thickness. Putty is more forgiving for uneven gaps but messier and more variable long-term. In fleets, repeatability usually wins.

9) What’s a “good” hotspot delta?

It varies by GPU and cooler design, but large deltas often signal contact issues. If your delta jumps significantly after maintenance, assume you did something wrong and re-check seating.

10) How long should I soak-test after repadding?

At least 30 minutes under steady load, and ideally a longer run that matches your real workload. Many failures show up after the whole assembly heat-soaks.

Conclusion: practical next steps

Thermal pads are not a magic upgrade. They’re a mechanical interface that decides whether your GPU’s memory and VRMs get to share a heatsink or fend for themselves.

If you’re seeing high memory junction temps, unstable long runs, or a hotspot delta that doesn’t make sense, treat pads as a first-class suspect. But do it like operations, not like a hobby: baseline, change one thing, validate, and keep a rollback plan.

  1. Capture a baseline under a repeatable load (temps, clocks, power, fan).
  2. Identify the limiter (core, hotspot, memory, VRM, airflow, power).
  3. If pads are implicated, source correct thicknesses for your board revision.
  4. Repaste/repads with imprint testing and disciplined torque.
  5. Prove the outcome with the same telemetry capture you started with.

The best part of doing this right is that it feels boring. That’s how you know it’s production-grade.

← Previous
Email “Recipient address rejected”: why valid users still bounce
Next →
MariaDB vs PostgreSQL on an 8GB VPS: How to Scale Clients Safely

Leave a comment