PL1, PL2, and Tau Explained: The Three Numbers That Decide Everything

Was this helpful?

You’ve seen it: the CPU screams for a few seconds, benchmarks look heroic, then performance falls off a cliff and your latency graphs develop new hobbies.
Someone says “thermal throttling,” someone else says “bad paste,” and a third person suggests disabling turbo like it’s an exorcism.

Half the time the real culprit is simpler and more annoying: power limits. Specifically PL1, PL2, and Tau—the three numbers that decide whether your
chip behaves like a sprinter, a marathoner, or a confused intern trying to do both at once.

The mental model: power, time, and “why did it slow down?”

A CPU is basically a power-to-work converter wrapped in a thermal problem. Frequency and voltage determine how much work it can do per second.
But frequency and voltage also determine how much power it consumes, and power becomes heat. Heat must leave the package, enter the cooler,
and then escape to ambient air. That chain has limits.

PL1, PL2, and Tau are the CPU’s way of playing honest with physics while still letting marketing show pretty boost numbers.
They define a budget: how much power the CPU can spend continuously (PL1), how much it can overspend briefly (PL2),
and for how long that overspend is tolerated (Tau). If you remember nothing else, remember this:
PL2 is your “burst,” PL1 is your “steady state,” and Tau is the patience timer.

In production systems, these numbers show up as:

  • Requests start fast, then tail latency climbs after some seconds of sustained load.
  • Batch jobs show a “sawtooth” throughput curve: boost, clamp, recover, repeat.
  • Two “identical” machines perform differently because one motherboard vendor decided to interpret “default” creatively.
  • Cooling upgrades don’t help as much as expected because the system is power-limited, not thermally limited.

Joke #1: A CPU without sensible power limits is like a buffet with no plates—everyone’s excited until the floor is covered in pasta.

The best operators treat power limits like any other production control:
define a target, enforce it consistently, and alert on drift. “Defaults” are not a strategy.

PL1, PL2, Tau: definitions you can use in a war room

PL1: long-term power limit (steady-state)

PL1 is the long-term average power the CPU is allowed to consume. In many ecosystems it maps loosely to TDP,
but “loosely” does a lot of work here. If your workload is sustained—encoding, compaction, analytics, AV scanning, VM consolidation—PL1 is
the number that determines the long-run performance ceiling.

When the CPU hits PL1, it reduces frequency (and sometimes voltage) to keep average package power at or below that limit.
This can happen even when temperatures look fine. That’s a key point: power throttling is not always thermal throttling.

PL2: short-term power limit (burst)

PL2 is a higher power limit the CPU may use for a short interval. This is what makes your system feel snappy:
UI bursts, short compile phases, quick queries, short-lived request spikes. PL2 is where “turbo” lives.

If PL2 is set aggressively and cooling is mediocre, the CPU will boost hard, heat up fast, then slam into another limiter:
either it hits Tau and drops to PL1, or it hits thermal limits (TJmax, PROCHOT) and drops even harder.

Tau: the time window (how long PL2 is allowed)

Tau is the time constant / window that controls how long the CPU can exceed PL1 up to PL2 before it must
rein itself in. In practice, many platforms implement this with a rolling average of package power over a time window.
If the average power over that interval exceeds PL1, the CPU reduces frequency to bring the average down.

The classic symptom of Tau is: “Everything is fast for 10–60 seconds, then performance drops and stays lower.”
That is not mysterious. That is Tau doing exactly what it was told.

One more definition you will meet: “unlimited”

Some BIOSes and laptop vendor tools offer “unlimited” PL2 or very high Tau. This usually means “until something else panics,”
not “forever.” The something else is typically temperature, VRM current limits, chassis skin temperature limits, or platform power budgets.
Unlimited is great for benchmarks and terrible for predictable systems.

How CPUs enforce limits (and why it feels personal)

Modern CPUs have multiple layers of governors and hard stops. Power limits are enforced by on-die logic using telemetry:
estimated package power, current draw, temperature sensors, and sometimes external platform signals from the motherboard controller.
When a limit is exceeded, the CPU doesn’t negotiate. It clamps frequency, adjusts voltage, and may restrict turbo bins.

The enforcement hierarchy varies by generation and vendor, but you’ll typically see:

  • Power limit throttling: package power exceeds PL1/PL2 (often reported as “PL1/PL2” or “Power Limit”).
  • Thermal throttling: temperature approaches TJmax; CPU reduces frequency to stay under the thermal ceiling.
  • PROCHOT: “processor hot” signal—can be asserted by CPU or platform; throttles aggressively.
  • Current/VRM limits: motherboard VRM can’t safely deliver the requested current; power delivery becomes the ceiling.
  • Platform limits: laptop EC enforces battery/adapter budgets; servers enforce rack power caps.

Important operational point: the limiter you see is not always the root cause. A power limit can cause lower frequency, which lowers heat,
which hides a thermal issue. Or a thermal issue can force frequency down, which lowers power, making it look like you’re “within PL1.”
You diagnose by correlating: frequency, temperature, package power, and throttle flags—over time.

One quote, because it’s still the best advice in operations. Gene Kranz, NASA flight director, said:
“Tough and competent.”
In power-limit land, “tough” means you don’t panic-tune; “competent” means you measure before you twist knobs.

What changes when you tweak each number

Raising PL1: better sustained throughput, higher steady heat, bigger power bill

Raising PL1 increases sustained performance if (and only if) the CPU was previously power-limited under your steady-state load.
It also increases steady thermal load. If your cooler and chassis airflow can’t remove that heat, you’ll just trade
power throttling for thermal throttling. That’s not an upgrade; that’s a different kind of misery.

Raising PL2: snappier bursts, potentially worse stability under spiky loads

PL2 affects responsiveness and short benchmarks. In servers, it affects tail behavior during brief load spikes.
You can get real wins for bursty services (API frontends, caches warming, short compilation steps).
But if PL2 is too high, you can trigger VRM current limits, adapter limits (laptops), or thermal transients that
make the system oscillate. Oscillation is what your SLO graphs call “interesting.”

Changing Tau: controls the “cliff” timing

Tau determines how long the CPU can behave like a sprinter before being forced into marathon pace.
Increasing Tau tends to make benchmarks look better and can help workloads that finish quickly.
It can also make sustained workloads worse if it causes the system to run hotter earlier, increasing fan saturation
and causing later thermal throttling. This is why “just raise Tau” is a meme, not a plan.

Lowering limits: the underrated move

In production, lowering PL2 and/or PL1 can increase predictability, reduce fan noise (yes, it matters in offices and labs),
and keep systems within rack power budgets. This is especially useful in dense deployments where one node going full turbo
can push a PDU branch uncomfortably close to limits.

Joke #2: Setting PL2 sky-high because “we paid for turbo” is like removing the speed limiter on a delivery van—great until the tires file a complaint.

Where you set them matters: BIOS vs OS vs vendor tooling

You can set power limits in BIOS/UEFI, via OS-level interfaces (RAPL on Intel, platform profiles, powercap), or via vendor daemons.
BIOS is usually the most stable and least surprising. OS-level control is powerful but easier to drift via updates or misconfig.
Vendor tooling is often opaque and occasionally “helpful” in the way a toddler helps you cook.

Interesting facts and short history

  • Power limits got mainstream when turbo got mainstream. Turbo boost created a need for explicit, enforceable budgets beyond “TDP.”
  • RAPL exists because you can’t manage what you can’t measure. Intel’s Running Average Power Limit interfaces brought power into software control loops.
  • Tau values often align with “good-looking” benchmark windows. Many default Tau settings land in the tens-of-seconds range, which is… convenient.
  • Motherboard vendors treat “Intel defaults” as a suggestion. Some boards ship with elevated PL2/Tau to win reviews, then your datacenter pays the bill.
  • PL1 isn’t always equal to TDP. OEMs can set PL1 above or below nominal TDP depending on cooling and product goals.
  • Power management is a platform problem, not just a CPU problem. VRM design, PSU limits, chassis airflow, and embedded controllers can override your settings.
  • Telemetry is imperfect. Package power is often estimated, not directly measured, and accuracy varies with platform and calibration.
  • “Thermal design” includes time. Thermal capacitance means a system can absorb heat temporarily—exactly what Tau exploits.
  • Datacenters already cap power—CPUs just internalized it. Rack and PDU budgets forced predictable consumption; CPU-level limits are the granular extension.

Fast diagnosis playbook

When performance drops under load, you want to identify the limiter quickly. Don’t start by changing BIOS settings.
Start by proving whether you are power-limited, thermally limited, or something else (scheduler, IO, memory bandwidth).

First: confirm the symptom pattern

  • Does throughput drop after a consistent time window (10–60 seconds)? Think Tau/PL1 enforcement.
  • Does it drop immediately when temperature approaches TJmax? Think thermal throttling / cooling.
  • Does it oscillate in cycles of a few seconds? Think VRM/current limits, aggressive PL2, or fan control lag.

Second: capture power + frequency + throttling flags together

  • Use turbostat (Intel) or relevant vendor tools to capture: package power, MHz, temperature, and throttle reasons.
  • Correlate with dmesg for thermal/prochot messages and with journalctl for platform daemons changing policies.

Third: verify what limits are configured and who set them

  • Read RAPL powercap values from sysfs; compare to BIOS expectations.
  • Check if a vendor daemon or power profile is rewriting limits at runtime.
  • In virtualized hosts, check if hypervisor policies cap CPU power indirectly (e.g., power profiles, cgroups, cpufreq governors).

Fourth: decide the fix class

  • If power-limited but thermals are fine: raise PL1 modestly, or accept the cap and plan capacity accordingly.
  • If thermally limited: fix cooling first; raising PL1 just makes it throttle harder.
  • If current/VRM limited: reduce PL2 or improve motherboard/PSU design; you can’t “tune” weak power delivery into strength.
  • If not CPU-limited: stop touching PL settings and go find the real bottleneck.

Practical tasks: commands, outputs, and decisions (12+)

These are the checks I actually run when someone says “the CPU is throttling.”
The command lines are Linux-centric because production tends to be Linux-centric. Adjust for your environment.

Task 1: Identify CPU model and platform basics

cr0x@server:~$ lscpu | egrep 'Model name|Socket|Thread|Core|CPU\(s\)|MHz'
Model name:                           Intel(R) Xeon(R) CPU E-2288G @ 3.70GHz
CPU(s):                               16
Thread(s) per core:                   2
Core(s) per socket:                   8
Socket(s):                            1
CPU MHz:                              3700.000

What it means: You need the exact SKU because default PL values vary by SKU, OEM, and microcode.
The current MHz line is usually not enough for diagnosis but anchors your expectations.
Decision: Record CPU model before you compare nodes or blame “identical hardware.”

Task 2: Check current frequency behavior under load

cr0x@server:~$ sudo apt-get -y install stress-ng >/dev/null
cr0x@server:~$ stress-ng --cpu 16 --cpu-method matrixprod --metrics-brief --timeout 30s
stress-ng: info:  [12421] dispatching hogs: 16 cpu
stress-ng: metrc: [12421] stressor       bogo ops real time  usr time  sys time   bogo ops/s
stress-ng: metrc: [12421] cpu             39132     30.00   456.18     0.12      1304.40

What it means: This creates a controlled sustained load. The metrics let you compare “before/after” changes.
Decision: Run this alongside turbostat to see whether performance decays during the run.

Task 3: Capture turbo, power, and throttling flags with turbostat

cr0x@server:~$ sudo turbostat --Summary --interval 1 --quiet --num_iterations 10
     time  Avg_MHz  Busy%  Bzy_MHz  PkgWatt  CorWatt  PkgTmp  GFXWatt
     1.00     4680  98.23     4764    124.9     88.2      86     0.0
     2.00     4681  98.10     4766    125.1     88.0      89     0.0
     3.00     4502  98.05     4589    108.0     76.1      92     0.0
     4.00     4100  98.00     4189     95.2     67.5      92     0.0
     5.00     4098  98.02     4188     95.0     67.3      91     0.0

What it means: You can literally see the step down: package power drops and MHz drops after a few seconds.
That’s consistent with Tau expiring and settling near PL1.
Decision: If PkgTmp is stable below TJmax but MHz drops with PkgWatt, you’re probably power-limited (PL1), not thermally limited.

Task 4: Check kernel logs for thermal and PROCHOT events

cr0x@server:~$ sudo dmesg -T | egrep -i 'thermal|thrott|prochot|temperature' | tail -n 8
[Mon Jan 10 09:12:31 2026] thermal thermal_zone0: critical temperature reached (100 C), shutting down
[Mon Jan 10 09:15:02 2026] CPU0: Core temperature above threshold, cpu clock throttled (total events = 42)
[Mon Jan 10 09:15:02 2026] CPU2: Package temperature above threshold, cpu clock throttled (total events = 42)

What it means: This is real thermal throttling evidence. If you see this, stop “tuning PL” and fix cooling.
Decision: Treat repeated events as a reliability issue; thermal stress reduces lifespan and increases error rates.

Task 5: Read RAPL power limit values from sysfs

cr0x@server:~$ ls -1 /sys/class/powercap/intel-rapl
intel-rapl:0
intel-rapl:0:0
intel-rapl:0:1
cr0x@server:~$ cd /sys/class/powercap/intel-rapl/intel-rapl:0
cr0x@server:~$ cat name
package-0
cr0x@server:~$ cat constraint_0_name
long_term
cr0x@server:~$ cat constraint_0_power_limit_uw
95000000
cr0x@server:~$ cat constraint_0_time_window_us
28000000
cr0x@server:~$ cat constraint_1_name
short_term
cr0x@server:~$ cat constraint_1_power_limit_uw
125000000

What it means: Long-term limit is 95W (PL1), time window is 28 seconds (Tau), short-term limit is 125W (PL2).
Units are micro-Watts and micro-seconds.
Decision: If these values don’t match what you think the BIOS sets, something (firmware, OS, daemon) is overriding them.

Task 6: Observe current package energy and compute average power

cr0x@server:~$ cat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj
124987654321

What it means: This counter increments with energy used (micro-Joules). Sample it twice with timestamps to compute average power.
Decision: Use this when you can’t trust higher-level tools, or to cross-check turbostat.

Task 7: Check who is managing CPU frequency policy (governor)

cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
performance

What it means: Governor affects frequency selection but does not override hard power limits.
If you’re in powersave, you might be self-throttling before PL limits even matter.
Decision: For consistent performance testing, use performance. For fleet policy, pick intentionally.

Task 8: Detect if intel_pstate is in active mode and what the caps are

cr0x@server:~$ cat /sys/devices/system/cpu/intel_pstate/status
active
cr0x@server:~$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
100
cr0x@server:~$ cat /sys/devices/system/cpu/intel_pstate/min_perf_pct
20

What it means: You’re using intel_pstate with full max performance allowed.
If max_perf_pct is low, someone capped you and you’ll blame PL1 unfairly.
Decision: Keep these consistent across nodes before comparing performance.

Task 9: Check for runtime power management daemons changing policy

cr0x@server:~$ systemctl list-units --type=service | egrep 'tlp|power-profiles-daemon|thermald'
power-profiles-daemon.service loaded active running Power Profiles daemon
thermald.service              loaded active running Thermal Daemon Service

What it means: These services can influence performance profiles and thermal behavior. On laptops and some servers, they can also write power limits.
Decision: For controlled experiments, temporarily stop them (carefully) or at least know they exist.

Task 10: Verify if the platform exposes throttle reasons (turbostat flags)

cr0x@server:~$ sudo turbostat --interval 1 --num_iterations 5 | egrep 'PkgWatt|Bzy_MHz|PkgTmp|IRQ|^CPU|PL1|PL2|THERMAL'
CPU     Avg_MHz   Busy%   Bzy_MHz   PkgTmp   PkgWatt
-       4650      97.8    4730      88       124.0
-       4635      97.9    4720      90       124.8
-       4105      98.0    4186      91        95.1

What it means: Some builds of turbostat show explicit throttle columns; others don’t depending on kernel and CPU support.
Even without flags, the coupling of MHz and PkgWatt is a strong hint.
Decision: If you can’t see flags, compensate with sysfs RAPL limits and kernel logs.

Task 11: Change PL1/PL2/Tau (carefully) via powercap sysfs

cr0x@server:~$ cd /sys/class/powercap/intel-rapl/intel-rapl:0
cr0x@server:~$ sudo sh -c 'echo 105000000 > constraint_0_power_limit_uw'
cr0x@server:~$ sudo sh -c 'echo 140000000 > constraint_1_power_limit_uw'
cr0x@server:~$ sudo sh -c 'echo 35000000 > constraint_0_time_window_us'
cr0x@server:~$ cat constraint_0_power_limit_uw
105000000
cr0x@server:~$ cat constraint_0_time_window_us
35000000
cr0x@server:~$ cat constraint_1_power_limit_uw
140000000

What it means: You raised PL1 to 105W, PL2 to 140W, Tau to 35s. This may or may not persist across reboot.
Decision: Only do this on a test node first. Then rerun the same load + turbostat capture and compare.

Task 12: Validate the change’s real effect using a repeatable benchmark

cr0x@server:~$ stress-ng --cpu 16 --cpu-method matrixprod --metrics-brief --timeout 60s
stress-ng: info:  [13200] dispatching hogs: 16 cpu
stress-ng: metrc: [13200] stressor       bogo ops real time  usr time  sys time   bogo ops/s
stress-ng: metrc: [13200] cpu             85510     60.00   909.90     0.12      1425.17

What it means: If bogo ops/s improves and stays stable through the run, your new steady-state is better.
If it improves early but regresses later, you likely pushed into thermal/current limits.
Decision: Keep the change only if it improves sustained throughput without increasing throttling events or error logs.

Task 13: Check actual temperature sensors and fan behavior

cr0x@server:~$ sudo apt-get -y install lm-sensors >/dev/null
cr0x@server:~$ sudo sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +92.0°C  (high = +95.0°C, crit = +100.0°C)
Core 0:        +90.0°C  (high = +95.0°C, crit = +100.0°C)
Core 1:        +91.0°C  (high = +95.0°C, crit = +100.0°C)

What it means: You are close to high/crit. Even if you’re “only” power-limited right now, you have no thermal headroom.
Decision: Don’t raise PL limits on a box living at 92°C under routine load. Fix airflow, fan curves, dust, or heatsink mounting first.

Task 14: Check if cgroups are capping CPU (common in containers)

cr0x@server:~$ cat /sys/fs/cgroup/cpu.max 2>/dev/null || true
80000 100000

What it means: This says the cgroup can use 80% of a CPU per 100ms period (cgroup v2). That’s an artificial cap.
Decision: If you see this, your “throttling” might be cgroup scheduling, not PL1/PL2. Fix orchestration limits before tuning power.

Task 15: Spot a BIOS/firmware mismatch across nodes

cr0x@server:~$ sudo dmidecode -t bios | egrep 'Vendor|Version|Release Date'
Vendor: American Megatrends International, LLC.
Version: 2.1.7
Release Date: 08/14/2024

What it means: BIOS version differences often imply different default PL/Tau behavior.
Decision: In fleets, standardize BIOS versions (and settings) before you attempt performance forensics.

Three corporate mini-stories from the field

1) The incident caused by a wrong assumption: “TDP equals performance”

A mid-sized SaaS company rolled out a new batch of “same CPU, same RAM” servers for a latency-sensitive service.
They were replacing older hosts that had become capacity-bound, so expectations were simple: same SKU, just newer boxes, should be equal or better.

First week looked fine in staging. Then production saw a weird pattern: mornings were great, afternoons were ugly.
p95 latency drifted upward during the daily traffic plateau, but CPU utilization looked normal.
The first reaction was classic: “It must be garbage collection” (it wasn’t), then “it must be the network” (also wasn’t).

An SRE finally graphed per-host throughput over time for a synthetic load test and noticed a consistent drop after ~30 seconds.
That was the tell. The old servers had a conservative turbo configuration: modest PL2, sensible Tau, stable steady state.
The new servers shipped with an OEM “performance” profile: higher PL2 and a Tau long enough to heat-soak the chassis,
and a PL1 that was actually lower than expected once the embedded controller started protecting platform thermals.

The wrong assumption wasn’t “turbo is bad.” The wrong assumption was “TDP is a property of the CPU, so behavior must match.”
In practice, platform firmware, VRM, and cooling define the real steady-state. The same CPU in two chassis is not the same product.

They fixed it by standardizing BIOS settings across vendors: explicit PL1/PL2/Tau values tested against their real workload.
Latency stabilized. Power draw became predictable. Procurement learned a new rule: “Same SKU” is not an acceptance test.

2) The optimization that backfired: “Raise PL2 for faster deploys”

A large enterprise had a CI fleet where compile and test times were creeping up. Someone proposed a clean-looking fix:
“These CPUs support higher turbo power. Let’s raise PL2 and Tau so builds finish faster.”
It worked in a narrow sense. The median build time improved. The charts in the meeting were pretty.

Two weeks later the fleet started seeing intermittent failures: random worker reboots, occasional kernel machine checks,
and the kind of flaky behavior that makes teams blame “cosmic rays” and then quietly change the topic.
Hardware diagnostics found nothing obvious. Temperatures weren’t even alarming—just spiky.

The real issue was transients. Raising PL2 and Tau caused repeated current spikes during parallel builds.
The VRM and PSU combination on that particular chassis could handle average load but hated repeated bursts at elevated power.
The system stayed “within thermal limits” while still stressing power delivery components.
The failures clustered during periods where build concurrency was high and the burst patterns aligned.

Rolling back the change eliminated reboots. Then they reintroduced a safer version: a modest PL2 increase with a tighter Tau,
plus a per-host concurrency cap so load patterns were smoother. Median build time got slightly worse than the reckless peak,
but the failure rate dropped to normal, which is the only KPI that matters when the pager is involved.

The moral: boosting PL2 is not free. It shifts stress from “heat over time” to “current now.” Current now is how you discover weak links.

3) The boring but correct practice that saved the day: “Explicit limits and drift detection”

A fintech shop ran a mixed fleet: some nodes were new, some were old, some were in one datacenter, some in another.
They had strict performance requirements during market hours and strict power budgets in one facility that was already near its PDU limits.
Everyone wanted “max performance,” but nobody wanted a power incident during trading.

They did the boring thing. They created a hardware profile document per platform that explicitly listed PL1/PL2/Tau,
BIOS settings, fan profiles, and OS governor settings. They enforced it with configuration management that validated RAPL values at boot,
and they alerted on drift if a node reported different constraint values than expected.

Months later, an urgent microcode/BIOS update rolled through because security. After the update, a subset of nodes quietly changed
their default Tau and PL2 behavior. Without drift detection, the team would have noticed only when tail latency started to widen.
Instead, alerts fired within an hour of the first canary reboot.

They rolled back the profile deviation by reapplying explicit values and kept moving. No incident. No mystery graph.
Just the kind of unglamorous competence that keeps you employed.

Common mistakes (symptoms → root cause → fix)

1) “It boosts to 5 GHz then falls to 3.8 GHz”

Symptoms: Repeatable drop after a fixed time window; temperatures not at TJmax.

Root cause: Tau expires and CPU settles to PL1 steady state.

Fix: Decide if you need sustained performance (raise PL1 if cooling allows) or accept PL1 and capacity-plan. Avoid “just raise Tau” unless the workload genuinely finishes within that extended window.

2) “Temperatures are fine but performance is low”

Symptoms: Low MHz under load, low package power, no thermal logs.

Root cause: OS policy cap (intel_pstate max_perf_pct, cpufreq governor), cgroup CPU limits, or hypervisor host power profile.

Fix: Verify governor, intel_pstate caps, and cgroup CPU quotas before touching PL limits.

3) “We raised PL1 and got worse performance”

Symptoms: Higher initial power, then sharper throttling; increased thermal events; fans pinned.

Root cause: You moved from power-limited steady state into thermal-limited steady state (TJmax/PROCHOT).

Fix: Improve cooling (heatsink seating, airflow, fan curve, ambient). Then re-evaluate PL1. If you can’t cool it, don’t feed it.

4) “Random reboots after tuning turbo”

Symptoms: Instability under bursty parallel loads; no clear thermal shutdown; sporadic hardware errors.

Root cause: VRM/PSU/current transient stress from aggressive PL2/Tau; platform power delivery limits.

Fix: Reduce PL2 and/or Tau; smooth workload concurrency; ensure PSU/VRM and BIOS are validated for that power profile.

5) “Two identical servers have different performance”

Symptoms: Same CPU SKU, different sustained MHz and power; inconsistent benchmark results.

Root cause: Different BIOS versions, vendor default profiles, or different chassis cooling/VRM.

Fix: Standardize BIOS config; read back RAPL constraints; treat “same SKU” as insufficient.

6) “We set power limits in the OS but they revert”

Symptoms: sysfs values change after reboot or after a daemon starts.

Root cause: BIOS reasserts limits at boot, or a power management daemon rewrites them at runtime.

Fix: Prefer BIOS for stable fleet defaults; if using OS control, enforce at boot via systemd unit and disable conflicting services.

7) “Laptop performance tanks on battery”

Symptoms: Dramatically lower PL1/PL2 on battery; sharp throttling under sustained load.

Root cause: Embedded controller enforces battery/adapter power budget and skin temperature constraints.

Fix: Use appropriate power profile; don’t benchmark on battery; accept that the platform, not the CPU, is in charge.

Checklists / step-by-step plan

Step-by-step: establish a trustworthy baseline

  1. Record platform identity: CPU model (lscpu), BIOS version (dmidecode), kernel version, microcode version.
  2. Record policy: governor (scaling_governor), intel_pstate status, power daemons running.
  3. Record configured limits: read RAPL constraints (PL1/PL2/Tau) from sysfs.
  4. Run a controlled load: e.g., stress-ng for 60s.
  5. Capture telemetry: turbostat 1s interval, plus sensors output near steady state.
  6. Annotate the time-to-cliff: when frequency/power drop happens; compare to Tau.

Step-by-step: decide whether to change limits

  1. If temperature is near TJmax: do not raise PL1/PL2. Fix cooling first.
  2. If PkgWatt clamps near PL1 and temp is safe: consider raising PL1 in small increments (5–10W) and retest.
  3. If you only need short bursts: tune PL2 and Tau to match burst duration, not to win synthetic charts.
  4. If stability matters more than peak: lower PL2 and/or Tau to reduce oscillation and current spikes.
  5. Validate with production-like load: not just synthetic CPU burn; include memory and IO patterns.

Fleet checklist: keep it from becoming a recurring mystery

  • Set explicit PL1/PL2/Tau in BIOS for each hardware class.
  • Store “expected RAPL constraints” and verify them at boot.
  • Alert on drift: if constraints change, treat it like config drift.
  • Keep BIOS versions consistent, especially after security-driven updates.
  • Capacity plan using steady-state performance (PL1), not burst (PL2).
  • Document the business rule: are you optimizing for latency bursts or sustained throughput?

FAQ

1) Are PL1 and TDP the same thing?

Not reliably. Many platforms set PL1 near TDP, but OEMs can set it higher or lower based on cooling and product goals.
Treat TDP as a marketing-adjacent thermal target, not a guarantee of sustained power behavior.

2) Why do I see great benchmark numbers but terrible sustained performance?

Because benchmarks often live inside the PL2/Tau window. They measure the sprint. Your workload is a marathon.
Measure at steady state and watch for the time-to-clamp behavior.

3) If temperatures are low, can I safely raise PL1?

“Safely” depends on VRM, PSU, and chassis airflow, not just CPU temperature.
Low temperature suggests thermal headroom, but you still need to validate stability and power delivery under burst and sustained load.

4) What’s a reasonable Tau?

Reasonable is workload-dependent. If your service sees bursts lasting a few seconds, long Tau mostly adds heat and oscillation risk.
If your tasks finish in 20–40 seconds, tuning Tau can matter. For long-running jobs, Tau mostly affects the first minute and then PL1 rules.

5) Why does raising PL2 sometimes reduce performance?

Because it can trigger thermal transients or current limits, causing more aggressive throttling later.
You get a hotter start and a worse landing. Performance becomes jagged instead of flat.

6) Can I tune PL limits inside a VM or container?

Usually not in a meaningful way. PL limits are enforced at the host CPU/package level.
In containers you can cap CPU via cgroups; in VMs you can shape vCPU scheduling, but you can’t typically rewrite the host’s RAPL constraints.

7) What’s the difference between power throttling and thermal throttling in symptoms?

Power throttling often shows frequency dropping while temperature stays below TJmax and package power clamps near a limit.
Thermal throttling shows temperature hitting a ceiling and logs or flags indicating thermal events; power may drop as a consequence.

8) Should I always set “maximum performance” profiles in BIOS?

Not blindly. “Maximum performance” often means higher PL2 and longer Tau, which can violate your rack power budget and hurt predictability.
For production, pick “consistent performance” unless you have a measured reason to chase burst peaks.

9) Do AMD systems have the same PL1/PL2/Tau concepts?

The names differ, but the idea—power budgets over time windows—is common across modern CPUs.
On AMD you’ll see different controls (PPT/TDC/EDC and boosting behaviors), but you still diagnose the same way: correlate clocks, power, temperature, and limit flags.

10) What’s the single most useful graph for this?

Frequency, package power, and temperature on the same timeline during a sustained load test, with throttle/limit flags if available.
The “cliff” becomes obvious, and you can match it to Tau or to thermal ceilings.

Next steps you can actually do

PL1, PL2, and Tau are not trivia. They are policy. Policy controls performance, thermals, stability, and power budgets.
If you treat them like magic numbers, you’ll get magic problems: inconsistent throughput, mystery cliffs, and “identical” servers that aren’t.

Do this next:

  1. Pick one representative node and capture a 60-second load test with turbostat + sensors.
  2. Read back the configured PL1/PL2/Tau from RAPL sysfs and compare to observed behavior.
  3. Decide what you’re optimizing for: burst latency or sustained throughput. Write it down.
  4. Change one variable at a time (PL1 or PL2 or Tau), in small increments, and retest under the same load.
  5. Standardize BIOS settings across the fleet and alert on drift so you don’t rediscover this every quarter.

Predictable systems are usually a little less exciting. That’s the point.

← Previous
ZFS zpool split: Cloning a Mirror Pool for Migration or DR
Next →
The northbridge that disappeared: how integration rewired PCs

Leave a comment