Debian 13: Thermal throttling ruins throughput — prove it and fix cooling/power limits

Was this helpful?

Nothing makes a storage engineer twitch like a benchmark that starts strong, then slowly collapses into mediocrity. Your NVMe array looked heroic for 30 seconds, then writes turn into a gentle drizzle. The graphs look like a beach cliff: high plateau, then a sad slide down.

On Debian 13, this is often not “Debian being slow.” It’s physics, firmware, and power policy. Thermal throttling (CPU, NVMe, HBA, chipset) quietly clamps frequency, link speed, or queue depths, and your throughput goes with it. The trick is proving it cleanly, then fixing it without cooking your hardware or tripping breakers.

Fast diagnosis playbook

If you’re on-call and the storage team is staring at a wobbling throughput chart, you don’t have time for interpretive dance. You need a sequence that finds the bottleneck fast.

First: confirm it’s time-dependent and temperature-correlated

  • Run a steady workload (fio or your real workload replay) for at least 5–10 minutes. Throttling loves “after warm-up.”
  • Watch CPU frequency + throttle flags and NVMe temperature at the same time. If performance droops while temps climb and limits kick in, you’ve got a story.

Second: identify which component throttles

  • CPU throttle: frequency downshifts, power limit events (PL1/PL2), high package temp, “throttled” counters increment.
  • NVMe throttle: controller temperature rises, “thermal management” state changes, SMART warnings appear, latency spikes.
  • PCIe/link weirdness: device link retrains, error counters rise, negotiated width/speed drops under heat.
  • Cooling control: fans aren’t ramping, BMC stuck in “acoustic” mode, laptop firmware refuses to spin up.

Third: decide whether it’s policy or physics

  • Policy: powersave governor, too-low power limits, BIOS “energy efficient” presets, platform profiles.
  • Physics: dust, blocked intake, missing NVMe heatsinks, bad thermal paste, dead fan, wrong fan zone mapping.

Do those three steps and you’ll stop arguing about “Debian vs kernel vs filesystem” and start arguing about the thing that is actually heating up, which is progress.

What throttling looks like in real throughput

Thermal throttling is not subtle. It just looks subtle if you only measure one thing.

Classic pattern:

  • fio starts at, say, 6–7 GB/s reads on a striped NVMe set.
  • After 60–180 seconds, throughput glides down to 3–4 GB/s.
  • Latency grows. Queue depths stop helping.
  • CPU frequency drops a few hundred MHz, or NVMe controller temp crosses a threshold, or both.

Then you stop the test, the system cools, and the next test “mysteriously” looks good again. This is why throttling survives so long in production: short benchmarks and quick repros don’t catch it.

One quote that’s worth keeping around: Werner Vogels, Amazon CTO: “Everything fails, all the time.” (paraphrased idea). Throttling is a failure mode that looks like a performance regression until you instrument it like a reliability problem.

Interesting facts and context (because history repeats)

  1. Intel introduced aggressive turbo behavior years ago, and “short-term fast, long-term slower” became normal: PL2 boosts, PL1 sustains.
  2. NVMe drives have formal thermal management states (often two levels), and many consumer drives are tuned to throttle early in poor airflow.
  3. Data centers used to design around spinning disks; NVMe density raised heat per rack unit dramatically, and airflow assumptions broke.
  4. Linux CPU frequency scaling has had multiple eras (acpi-cpufreq, intel_pstate, amd_pstate), and defaults changed over time; “it worked on Debian X” is not evidence.
  5. RAPL power limits were designed for energy control and can be exposed through MSRs; vendors sometimes set conservative values to meet acoustic or platform thermal targets.
  6. Some server BIOS profiles trade sustained performance for noise; “Balanced” often means “quiet until it hurts.”
  7. PCIe error rates can climb with temperature; marginal signaling gets worse as components heat, and link retraining can reduce effective bandwidth.
  8. Thermal paste degradation is real; paste pumps out over years of heat cycling, raising junction temperature at the same load.
  9. Laptops pioneered aggressive throttling heuristics (skin temperature, battery protection), and those ideas leaked into small-form-factor servers and edge devices.

Prove it: instrumentation that holds up in a postmortem

“It’s throttling” is not a claim; it’s a set of counters, temperatures, and time-series evidence. Your goal is a single timeline where:

  • Workload stays constant.
  • Performance degrades.
  • A thermal/power limit state changes at the same time.

That’s how you win arguments with vendors, with your own team, and with the part of your brain that wants the problem to be “a tunable.”

Also: don’t collect everything. Collect the right things. For CPU: frequency, package temperature, throttling counters, power limits. For NVMe: controller temp and thermal events. For platform cooling: fan RPM and sensor temps. For storage: latency distributions, not just throughput.

Joke #1: Thermal throttling is your server politely saying, “I’m tired,” right before it starts doing half the work for the same salary.

Practical tasks: commands, outputs, and decisions

These are production-grade tasks. Each one includes: command, sample output, what it means, and the decision you make. Run them during a sustained load, not at idle. Where possible, run two terminals: one for workload, one for telemetry.

Task 1: Confirm kernel and platform basics

cr0x@server:~$ uname -a
Linux server 6.12.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.0-1 (2025-10-01) x86_64 GNU/Linux

What it means: You’re on Debian 13 kernel line and the specific version is recorded. Throttling behavior can change with cpufreq drivers and scheduler tweaks.

Decision: Keep this string in your incident notes. If you later adjust BIOS or firmware, you want to control variables.

Task 2: Check CPU governor and driver (policy check)

cr0x@server:~$ cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  hardware limits: 400 MHz - 5300 MHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 5300 MHz.
                  The governor "powersave" may decide which speed to use
  current CPU frequency: 1200 MHz (asserted by call to hardware)

What it means: intel_pstate + powersave can be totally fine for many workloads, but it’s a suspect when throughput drops under sustained load.

Decision: If you’re diagnosing a throughput regression, temporarily force performance to remove policy as a variable (later we’ll fix sustainably).

Task 3: Force CPU governor to performance (temporary test)

cr0x@server:~$ sudo cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3

What it means: You’ve reduced one common source of “mysterious droop.”

Decision: Re-run the same workload and see if droop remains. If droop disappears, you just found a policy issue (or you hid a thermal issue by finishing faster—keep reading).

Task 4: Install and use turbostat to catch CPU throttling

cr0x@server:~$ sudo apt-get update
Hit:1 http://deb.debian.org/debian trixie InRelease
Reading package lists... Done
cr0x@server:~$ sudo apt-get install -y linux-cpupower linux-tools-common linux-tools-amd64
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
  linux-cpupower linux-tools-amd64 linux-tools-common
cr0x@server:~$ sudo turbostat --Summary --interval 2
Summary: 2.00 sec samples
  PkgTmp  Bzy_MHz  Avg_MHz  Busy%  IRQ     SMI  PkgWatt  CorWatt  GFXWatt
   93     2800     2200     84.0   12000     0   150.2    120.8     0.0

What it means: Package temperature is high (93C). If you see frequency collapse while Busy% stays high, that’s classic thermal/power limiting. On some platforms turbostat can also show “Throt” counters or limit reasons.

Decision: If package temperature approaches platform TJmax and performance droops, cooling or power limits are in play. Next: correlate with throughput and power limits.

Task 5: Prove power limiting via RAPL (Intel) counters

cr0x@server:~$ sudo apt-get install -y msr-tools
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
  msr-tools
cr0x@server:~$ sudo modprobe msr
cr0x@server:~$ sudo rdmsr -a 0x610
00000000dd8080c8
00000000dd8080c8
00000000dd8080c8
00000000dd8080c8

What it means: MSR 0x610 is often PKG_POWER_LIMIT. You need decoding to interpret wattage/time windows; the raw value is evidence and can be decoded with tooling/scripts if you standardize internally.

Decision: If your organization isn’t already decoding RAPL limits, do it. If limits are unexpectedly low versus CPU TDP class, suspect BIOS profile or vendor caps.

Task 6: Check AMD pstate status (AMD systems)

cr0x@server:~$ cat /sys/devices/system/cpu/amd_pstate/status
active
cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave

What it means: AMD pstate active, but governor is powersave. That may be okay; it may also be overly conservative under sustained I/O + compression + checksum workloads.

Decision: For diagnosis, switch to performance and compare sustained results. If results improve, set a persistent policy (later section).

Task 7: Watch temperatures live (simple and effective)

cr0x@server:~$ sudo apt-get install -y lm-sensors
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
  lm-sensors
cr0x@server:~$ sudo sensors-detect --auto
# sensors-detect revision 3.6.0
# System: ExampleVendor ServerBoard
# Summary: 3 drivers loaded
cr0x@server:~$ watch -n 2 sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +92.0°C  (high = +90.0°C, crit = +100.0°C)
Core 0:        +88.0°C  (high = +90.0°C, crit = +100.0°C)

nvme-pci-0100
Adapter: PCI adapter
Composite:    +78.9°C  (low  = -20.1°C, high = +84.8°C)

What it means: CPU is above “high,” NVMe is nearing “high.” That’s the setup for dual throttling: CPU reduces compute, NVMe reduces its own speed, and latency gets ugly.

Decision: If temps approach “high” during sustained I/O, treat cooling as a primary bottleneck, not an afterthought.

Task 8: Inspect NVMe SMART temperatures and thermal events

cr0x@server:~$ sudo apt-get install -y nvme-cli smartmontools
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
  nvme-cli smartmontools
cr0x@server:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0x00
temperature                         : 80 C
available_spare                     : 100%
percentage_used                     : 2%
thermal_management_t1_trans_count    : 7
thermal_management_t2_trans_count    : 1
time_for_thermal_management_t1       : 438
time_for_thermal_management_t2       : 55

What it means: The drive entered thermal management state T1 and even T2. That’s throttling with paperwork.

Decision: If these counters rise during your throughput drop window, stop blaming filesystems. Fix airflow and heatsinking around the NVMe devices.

Task 9: Check NVMe temperature sensors via smartctl (alternate view)

cr0x@server:~$ sudo smartctl -a /dev/nvme0
smartctl 7.4 2024-12-08 r5560 [x86_64-linux-6.12.0-1-amd64] (local build)
=== START OF INFORMATION SECTION ===
Model Number:                       ExampleNVMe 3.84TB
Firmware Version:                   2B0QEXM7
=== START OF SMART DATA SECTION ===
Temperature:                        80 Celsius
Temperature Sensor 1:               82 Celsius
Temperature Sensor 2:               76 Celsius
Warning  Comp. Temperature Time:    438
Critical Comp. Temperature Time:    55

What it means: Warning and critical temperature time counters are non-zero and increasing. That correlates strongly with throttling and sometimes with media wear acceleration.

Decision: Treat NVMe temperature like you treat memory errors: a reliability signal, not just “performance.”

Task 10: Verify PCIe negotiated link speed/width (heat can expose marginality)

cr0x@server:~$ sudo lspci -s 01:00.0 -vv | sed -n '/LnkCap:/,/LnkSta:/p'
LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 8GT/s (downgraded), Width x4 (ok)

What it means: Link is capable of 16GT/s but is currently at 8GT/s. That can happen due to BIOS settings, signal integrity, or sometimes error recovery after heat-induced issues.

Decision: If you see downgraded link during the “slow” phase, investigate PCIe errors and physical installation (riser, cable, slot, airflow). Fixing that can double bandwidth instantly.

Task 11: Look for PCIe errors that align with heat and throttling

cr0x@server:~$ sudo dmesg -T | egrep -i 'aer|pcie|nvme|thrott'
[Mon Dec 29 11:04:12 2025] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:01:00.0
[Mon Dec 29 11:04:12 2025] nvme 0000:01:00.0: AER: can't recover (no error_detected callback)
[Mon Dec 29 11:05:01 2025] nvme nvme0: I/O 123 QID 4 timeout, aborting

What it means: Even corrected PCIe errors can lead to performance issues; timeouts are worse. Heat can make borderline links misbehave.

Decision: If errors increase during sustained load and temps are high, address airflow/slotting/cabling, and update firmware. Don’t “tune around” a flaky link.

Task 12: Measure throughput and latency in a way that catches throttling (fio time series)

cr0x@server:~$ sudo apt-get install -y fio
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
  fio
cr0x@server:~$ fio --name=warmth --filename=/dev/nvme0n1 --direct=1 --ioengine=io_uring --rw=read --bs=128k --iodepth=32 --numjobs=1 --runtime=600 --time_based=1 --group_reporting=1 --status-interval=10
warmth: (groupid=0, jobs=1): err= 0: pid=22114: Mon Dec 29 11:07:10 2025
  read: IOPS=52.1k, BW=6512MiB/s (6827MB/s)(63.6GiB/10s)
...
warmth: (groupid=0, jobs=1): err= 0: pid=22114: Mon Dec 29 11:09:20 2025
  read: IOPS=34.0k, BW=4250MiB/s (4457MB/s)(41.5GiB/10s)
...
warmth: (groupid=0, jobs=1): err= 0: pid=22114: Mon Dec 29 11:17:10 2025
  read: IOPS=33.2k, BW=4149MiB/s (4350MB/s)(40.5GiB/10s)

What it means: Same job, same parameters, throughput dropped and stayed low. That’s sustained limiting, not random noise.

Decision: When you see step-down behavior, immediately correlate timestamps with turbostat/sensors/nvme smart-log. If correlation is strong, stop changing fio flags. Start fixing cooling/power.

Task 13: Observe block layer and device queue stats (is the device stalling?)

cr0x@server:~$ iostat -dxm 2 5 nvme0n1
Linux 6.12.0-1-amd64 (server) 	12/29/2025 	_x86_64_	(64 CPU)

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
nvme0n1        52000.0   6500.0     0.0   0.00    0.60    128.0   31.0  99.0
nvme0n1        34000.0   4250.0     0.0   0.00    0.95    128.0   32.0  99.0

What it means: %util stays pinned while throughput drops and await rises. The device is the limiter (or the path to it), not the CPU starving it of requests.

Decision: Point your flashlight at NVMe temperature, PCIe link state, and physical cooling.

Task 14: Check systemd journal for thermal/power hints

cr0x@server:~$ sudo journalctl -k -b | egrep -i 'thermal|thrott|powercap|rapl|cpu frequency' | tail -n 20
Dec 29 11:04:55 server kernel: intel_rapl_common: Found RAPL domain package
Dec 29 11:05:03 server kernel: CPU0: Core temperature above threshold, cpu clock throttled
Dec 29 11:05:03 server kernel: CPU2: Core temperature above threshold, cpu clock throttled

What it means: The kernel is literally telling you it throttled. Believe it.

Decision: Take this as confirmation, then fix root causes: airflow, heatsinks, fan control, or power profile.

Task 15: On servers, check IPMI sensors and fan behavior

cr0x@server:~$ sudo apt-get install -y ipmitool
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
  ipmitool
cr0x@server:~$ sudo ipmitool sensor | egrep -i 'fan|temp|inlet|exhaust|cpu'
Inlet Temp       | 29.000     | degrees C  | ok
Exhaust Temp     | 51.000     | degrees C  | ok
CPU1 Temp        | 92.000     | degrees C  | ok
FAN1             | 1800.000   | RPM        | ok
FAN2             | 1700.000   | RPM        | ok

What it means: CPU is hot, but fans are not screaming. That’s often a control policy issue (quiet mode, wrong sensor mapping) or a physical blockage.

Decision: Escalate to BIOS/BMC fan profile adjustments. If you don’t control that, involve whoever owns platform firmware. Software alone won’t spin the fans faster if the BMC disagrees.

Task 16: Quick check for background “helpfulness” (tlp/power-profiles-daemon)

cr0x@server:~$ systemctl status power-profiles-daemon.service --no-pager
● power-profiles-daemon.service - Power Profiles daemon
     Loaded: loaded (/lib/systemd/system/power-profiles-daemon.service; enabled)
     Active: active (running) since Mon 2025-12-29 10:52:11 UTC; 25min ago

What it means: A daemon may be pushing “balanced” behavior. On laptops, this is normal. On storage nodes, it can be a quiet throughput killer.

Decision: Decide explicitly: performance node or not. If yes, pin performance profile and document it. If no, accept the throughput ceiling.

Storage-specific throttling: NVMe, PCIe, and “why is fio lying?”

Storage throughput is a chain: CPU submits I/O, kernel queues it, PCIe moves it, the controller services it, NAND does the work, and the controller tries not to melt. You can throttle at any step.

NVMe thermal throttling is usually a controller problem, not a NAND problem

The NVMe controller is the hot spot. NAND likes warmth up to a point, but controller silicon and package constraints don’t. Many drives report “Composite” temperature, but also multiple sensors. If Sensor 1 spikes while Composite lags, you’re watching the controller get toasted.

Thermal management transitions (T1/T2) are the smoking gun because they’re counted. If you can show thermal_management_t1_trans_count increments at the same timestamp your throughput drops, you’ve proven the case with the drive’s own paperwork.

Why short benchmarks miss it

Because the drive starts cold-ish, turbo modes are active, SLC cache behavior is generous, and the controller hasn’t saturated its thermal mass. Then it heats up. Then firmware clamps. The first minute is a lie you tell yourself.

PCIe link speed and ASPM can be red herrings—or the entire story

If your PCIe link is negotiated at a lower speed, you can cap throughput regardless of drive capability. Sometimes that’s configuration. Sometimes it’s error recovery. Sometimes it’s riser/cable quality. Sometimes it’s “someone installed a x4 drive into a slot that shares lanes with three other devices, and now we’re learning about topology.”

ASPM (power saving) can cause latency hiccups, but it rarely causes a clean step-down sustained throughput drop by itself. Thermal and power limits do.

CPU throttling and storage: the ugly coupling

You might think storage is “I/O bound” and CPU frequency doesn’t matter. Then you turn on checksums, compression, encryption, erasure coding, or high-IOPS small-block workloads. Suddenly the CPU is doing real work per I/O, and frequency matters. Throttling turns into fewer IOPS, higher tail latency, and sometimes strange “device is slow” narratives.

Joke #2: If you want to simulate thermal throttling in a meeting, present your graph and then slowly lower your voice until everyone gets bored.

Fixes that actually move the needle: cooling, airflow, and contact

Cooling fixes are unglamorous. They also work. If you want sustained throughput, you need sustained thermal headroom. Not “it passes a 30-second test.” Sustained.

1) Stop recirculating hot air

Most “mystery throttling” in racks is airflow. The server is inhaling its own exhaust because:

  • Blanking panels are missing.
  • Hot aisle containment is aspirational.
  • Cables block the intake or create turbulence.
  • The server is shoved in a spot with poor cold-air delivery.

Measure inlet vs exhaust via IPMI. If inlet is already high before load, you’re operating with reduced margin.

2) Make sure fans ramp when they should

On many servers, Linux cannot directly command the fans; the BMC does. If the BMC is in a quiet profile, it will happily let the CPU hit the edge and throttle, because it’s optimizing for acoustics and component longevity. Your job is to optimize for throughput and predictability.

Fix is usually in BIOS/BMC settings: “Performance” thermal profile, higher fan minimums, or correct sensor mapping. If you can’t change it, at least document it, because you’re living inside someone else’s assumptions.

3) NVMe heatsinks and airflow matter more than you think

M.2 NVMe in servers is a classic trap. It was born in laptops where airflow is minimal but chassis design is integrated. In servers, people mount M.2 on a board somewhere “because it fits,” then wonder why it throttles. U.2/U.3 bays usually have better airflow, but even then, dense front bays can run hot.

Actionable fixes:

  • Add proper heatsinks (vendor-qualified if you want warranty peace).
  • Add airflow guides/shrouds if the chassis supports them.
  • Ensure adjacent high-heat devices (GPUs, DPUs, HBAs) aren’t dumping heat into the same zone.

4) Reseat and repaste (yes, really)

Over years, thermal paste can degrade. Mounting pressure can shift. Dust insulates. I’ve seen “software regressions” disappear after cleaning a heatsink that looked like a felt filter.

When to do this:

  • One node out of a pool throttles earlier than its peers.
  • Fans behave normally, but CPU temperature rises faster than expected.
  • Heatsink contact issues are plausible (recent maintenance, shipping, vibration).

Be disciplined: schedule downtime, follow ESD practices, use known-good paste, and verify after with the same instrumentation.

Fixes for power limits: RAPL, AMD pstate, and firmware handcuffs

Thermal throttling and power limiting are cousins who borrow each other’s clothes. You can be power-limited long before you’re thermally maxed, especially in servers configured for “efficiency” or constrained power budgets.

CPU governors and platform profiles: stop letting defaults run your storage node

Defaults are designed for general-purpose systems. Storage nodes running sustained I/O are not general-purpose. If you want predictable performance, set explicit policy:

  • CPU governor: performance (or at least a tuned setting that avoids deep downclock under load).
  • Platform profile: performance (if supported).
  • Disable laptop-oriented daemons on servers unless you have a reason.

Persistent governor configuration (systemd)

For Debian, one clean approach is a small systemd unit that sets governor at boot. It’s blunt, but it’s honest. You can also use cpufrequtils, but systemd makes the intent explicit and version-controlled.

cr0x@server:~$ sudo tee /etc/systemd/system/cpupower-performance.service <<'EOF'
[Unit]
Description=Set CPU governor to performance
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/bin/cpupower frequency-set -g performance

[Install]
WantedBy=multi-user.target
EOF
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl enable --now cpupower-performance.service
Created symlink /etc/systemd/system/multi-user.target.wants/cpupower-performance.service → /etc/systemd/system/cpupower-performance.service.

Decision: Only do this on nodes where sustained performance matters and you understand the power/thermals. Otherwise you’ll “fix throughput” by spending more watts and then rediscover cooling limits.

Intel RAPL: don’t blindly crank limits

RAPL tuning can be effective, but it’s also how you turn “predictable throttling” into “unpredictable shutdowns” if cooling can’t keep up. If you raise PL1/PL2 without fixing airflow, you’re just moving the failure point.

What you should do instead:

  • First set cooling/fan profile to handle sustained load.
  • Then validate sustained package power and temperature under the real workload.
  • Only then consider adjusting limits, and keep changes small and reversible.

AMD systems: pstate + CPPC can be great, until firmware gets conservative

AMD pstate tends to behave well, but you can still be capped by platform limits (PPT/TDC/EDC) and thermal constraints. The fix path is often BIOS tuning (performance profile, thermal limits, fan curves), not Linux flags.

Common mistakes: symptom → root cause → fix

This section exists because humans repeat mistakes with conviction.

1) Throughput drops after 1–3 minutes, then stabilizes low

Root cause: NVMe thermal throttling (T1/T2), often due to poor airflow/heatsink contact.

Fix: Add heatsinks/airflow guides, ensure fans ramp, relocate drives away from heat sources, validate with nvme smart-log counters.

2) IOPS drop while CPU Busy% stays high, CPU temp near threshold

Root cause: CPU thermal throttling or power limiting (PL1 clamps sustained frequency).

Fix: Fix cooling first (fans/profile/heatsink), then verify governor and BIOS power profile. Use turbostat + journal evidence.

3) Performance is fine cold boot, bad after firmware update

Root cause: BIOS reset to “Balanced” or “Acoustic” thermal profile; power limits changed; fan curve altered.

Fix: Reapply known-good BIOS/BMC settings. Treat firmware updates like config changes with a checklist.

4) One node in a fleet is slower under identical config

Root cause: Physical variation: dust, failing fan, degraded paste, mis-seated heatsink, missing airflow baffle.

Fix: Compare IPMI sensor baselines across nodes, inspect hardware, clean and repaste if needed.

5) Latency spikes, dmesg shows AER corrected errors

Root cause: Marginal PCIe link (riser/cable/slot) exacerbated by temperature; link may retrain down.

Fix: Reseat/replace riser, move slot, update firmware, improve airflow around PCIe area, confirm negotiated link speed.

6) After “optimizing power,” storage got slower

Root cause: Governor/power profile too conservative; deep C-states or downclocking add latency and cap IOPS.

Fix: Set explicit performance policy on storage nodes; measure watts and temps; don’t optimize for power without SLOs.

7) fio shows huge numbers, app still slow

Root cause: Benchmark not representative: too short, wrong queue depth, caching effects, not sustained, or not measuring tail latency.

Fix: Run sustained tests (10+ minutes), log temps and throttle events, and measure p95/p99 latency in your app.

Three corporate-world mini-stories

Mini-story 1: The incident caused by a wrong assumption

A company ran a set of storage nodes that handled log ingestion. Nothing exotic: NVMe, Linux, a simple pipeline. A new batch of servers arrived, same vendor, same model name, “same as last time.” The team imaged them, joined them to the cluster, and went home feeling competent.

Two days later, ingestion lag started climbing every afternoon. Not a sharp failure, just a slow drift into “why are we behind again?” The graphs were insulting: everything looked fine in the morning. By mid-day, throughput sagged, latency rose, and the backlog grew. People blamed the pipeline, then the application, then the network.

The wrong assumption was simple: “same model name means same thermal behavior.” The new servers shipped with a different NVMe carrier and a different airflow baffle arrangement. In the old units, the front-to-back airflow washed over the NVMe area. In the new units, the NVMe controller sat in a pocket of warm air behind a plastic barrier that looked harmless until you measured it.

They proved it with two timelines: fio sustained reads and the NVMe thermal management counters. Every time the backlog grew, the thermal_management_t1_trans_count ticked upward in sync. The fix was not clever. It was physical: install the correct baffle kit, ensure fans were in the vendor’s “performance” profile, and add heatsinks for the hottest drives.

After the change, the afternoon sag disappeared. The postmortem was blunt: they had treated platform variance as irrelevant. The action item was better: baseline thermals as part of hardware acceptance, not after production screams.

Mini-story 2: The optimization that backfired

Another organization got serious about power efficiency. Good motive, real costs. They rolled out a “green” configuration to a fleet: balanced CPU governor, platform set to an energy-saving mode, and some aggressive settings around idle behavior. It looked great on the power dashboard.

Then their nightly batch jobs started taking longer. Not catastrophically longer—just long enough to collide with the morning peak. That overlap caused user-facing latency to wobble. The incident was slow, political, and annoying: everyone had a graph that supported their favorite narrative.

The optimization backfired because they optimized the wrong phase of the day. The system spent many hours in moderate load and short bursts, where power savings were real. But during long sustained I/O with compression, the CPU got pinned. The “balanced” policy downclocked more aggressively than expected, and sustained package power limits kept the CPU from holding higher frequencies. It wasn’t just “slower CPU.” It was “slower CPU during the exact window where we needed sustained throughput.”

They proved it by capturing turbostat output during the batch: frequency sag with high Busy%, plus kernel logs indicating clock throttling. Storage throughput graphs matched the frequency curve almost embarrassingly well. The NVMe drives weren’t the limiter; the compute per I/O was.

The fix was to split policy: keep energy saving on general nodes, but pin storage/batch nodes to a performance profile during batch windows. That required process, not heroics: change control, documentation, and a simple boot-time service to set governor. Power went up a bit. Incidents went down a lot.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran a modest ZFS-backed storage cluster. They weren’t flashy, but they were disciplined. Every quarter, they performed a “sustained performance soak test” after patching: 30 minutes of read/write load, with temperature and throttling counters collected and archived.

One quarter, the soak test showed something off: throughput was fine for the first few minutes, then drooped by about a third. The application was fine, the kernel was updated, and nobody had complained yet. In other words: the perfect time to find a problem, because the business wasn’t currently yelling.

They compared the soak telemetry to the prior quarter and saw two differences: CPU package temperature climbed faster, and the BMC fan RPM plateaued lower. A firmware update had quietly reset the fan profile. Nothing “broke,” so it wasn’t obvious. But the soak test made it obvious.

They changed the BMC profile back to performance, reran the soak, and the droop vanished. No incident. No scramble. No vendor call. Just a checkbox and a rerun.

The team’s manager called it “boring excellence,” which is the nicest thing you can say about operations. The test didn’t make them faster at heroics; it made heroics unnecessary.

Checklists / step-by-step plan

Step-by-step: prove throttling end-to-end (repeatable method)

  1. Pick a sustained workload: 10–20 minutes, constant parameters. Avoid “burst” tests.
  2. Start workload with status output: fio with --status-interval.
  3. In parallel, collect CPU telemetry: turbostat every 2 seconds.
  4. In parallel, collect temps: sensors watch output.
  5. In parallel, collect NVMe thermal counters: snapshot smart-log at start/mid/end.
  6. Mark timestamps: either note the clock time or log everything to files.
  7. Look for correlation: throughput drop aligned to temp rise and thermal/power limiting evidence.
  8. Classify bottleneck: CPU vs NVMe vs PCIe link vs cooling control.
  9. Apply one fix at a time: fan profile OR heatsink OR governor, not a pile.
  10. Repeat the same test: compare curves, not single numbers.

Cooling checklist (servers)

  • Inlet temperature reasonable under load? If not, fix rack airflow/containment.
  • Fans ramping with CPU/NVMe temps? If not, fix BMC profile.
  • Blanking panels present? Cable bundles blocking intake? Fix it.
  • NVMe has heatsinks or airflow? If not, add it.
  • One node hotter than peers? Inspect: dust, fan health, paste, baffles.

Power-policy checklist (Debian 13 nodes)

  • Confirm governor and cpufreq driver match your intent.
  • Check for power management daemons on servers; remove or configure.
  • Validate platform profile/BIOS preset post-upgrade.
  • Don’t raise power limits unless cooling is proven adequate under sustained load.

FAQ

1) How do I know it’s thermal throttling and not “normal SSD cache behavior”?

Look for explicit evidence: NVMe thermal management transition counters (T1/T2) increasing, or CPU “clock throttled” messages, aligned with throughput drop. Cache effects don’t usually increment thermal counters.

2) Why does my benchmark look great when I rerun it right after?

Because stopping the load cools the hardware. You reset the thermal state. Sustained tests reveal the steady-state ceiling; short reruns reveal the burst ceiling.

3) Can CPU throttling really reduce storage throughput that much?

Yes, especially with compression, checksums, encryption, small-block I/O, high interrupt rates, or user-space storage stacks. If the CPU can’t submit/completion-handle I/O fast enough, IOPS drops.

4) Should I always set the CPU governor to performance on storage nodes?

On nodes where predictable sustained performance is part of the SLO, yes—after you validate cooling and power budgets. On mixed-use or thermally constrained nodes, you may prefer a tuned balanced approach.

5) Are NVMe heatsinks always necessary?

Not always, but they’re often the cheapest way to prevent throttling, especially for M.2 devices or dense U.2/U.3 bays. If your NVMe hits T1/T2 under sustained load, the answer is effectively “yes.”

6) My fans are slow but temperatures are high. Can Linux fix that?

Sometimes on desktops; often not on servers. Many servers use the BMC for fan control. You’ll need BIOS/BMC settings, not a clever kernel parameter.

7) Is raising power limits (PL1/PL2) safe?

It can be, if the platform is designed for it and cooling is verified. It’s unsafe if you’re already near thermal limits or if your rack power budget is tight. Fix cooling first, then test.

8) What if my PCIe link speed shows “downgraded” only after heating up?

That’s a sign of marginal signal integrity or error recovery under stress. Check for AER errors, reseat/replace risers, and improve airflow around PCIe devices. Don’t ignore it; it can become data-path instability.

9) Why does Debian 13 get blamed for this?

Because OS upgrades change defaults (governors, drivers, daemons) and often coincide with firmware updates. The timing makes people suspicious. The physics doesn’t care.

10) What telemetry should I archive for future “it got slower” debates?

fio status output over time, turbostat summaries, sensors snapshots, NVMe smart-log counters, IPMI sensor dumps, and kernel logs containing throttle or AER messages.

Conclusion: next steps you can do today

If your throughput drops over time on Debian 13, assume throttling until proven otherwise. Not because Debian is fragile, but because modern hardware is aggressively opportunistic: it boosts first and negotiates later.

Do this in order:

  1. Run a sustained 10–20 minute workload and capture time-series throughput.
  2. Capture CPU temperature/frequency and throttle evidence (turbostat + journal).
  3. Capture NVMe thermal counters (nvme smart-log, smartctl).
  4. If you can correlate performance drop with thermal/power limit evidence, fix cooling/fan policy first.
  5. Only then tune power policy (governors, platform profile, and—carefully—power limits).
  6. Re-run the same sustained test and compare the entire curve, not a single headline number.

The goal is boring sustained performance. Burst speed is for marketing slides. Steady-state throughput is for production.

← Previous
Debian 13 IRQ storms and weird latency: check irqbalance and fix interrupts
Next →
Email: Authenticated relay vs direct send — pick the approach that doesn’t get you blocked

Leave a comment