Cheap PSUs: how saving $20 becomes a fireworks show

Was this helpful?

You don’t “buy a PSU.” You buy the thing that decides whether every other component has a normal life or a short, loud one.
And when it goes wrong, it doesn’t fail like software—politely, with a log line and a rollback. It fails like physics.

The story usually starts the same way: random reboots, weird disk errors, NIC flaps, “CPU machine check” panics, and a growing suspicion
that you’re cursed. Then you swap the PSU and everything magically stabilizes—right after you’ve spent two nights debugging the wrong layer.

Why cheap PSUs fail in ways that hurt

A power supply is a translator between the messy world (your wall power, your UPS output, your generator, your neighbor’s air conditioner)
and the picky world (CPU VRMs, RAM rails, SSD controllers). “It turns AC into DC” is like describing a hospital as “a building with beds.”

Cheap PSUs aren’t just “less efficient.” They’re often missing protection features, built with lower-grade capacitors, undersized heat sinks,
optimistic labels, and control loops that go unstable under real-world transients. The unit might run a desktop during a calm afternoon.
Then you kick off a compile, a ZFS scrub, a GPU job, or a VM storm and it turns into the electrical equivalent of a toddler driving a forklift.

The hard part is that bad power doesn’t always kill things immediately. It can cause:

  • Silent corruption: storage writes that “succeed” but land wrong.
  • Heisenbugs: kernel panics and random segfaults that vanish when you change something unrelated.
  • Component aging: you shorten the lifespan of disks, motherboard VRMs, and NICs through ripple and heat.
  • “Only under load” failures: stable idle, unstable when busy—exactly when you care.

The $20 you “saved” is not compared to a better PSU. It’s compared to your time, your outage budget, your SLA credibility, and the data you
can’t recreate. If you’re running production, the cheapest PSU is the one you never have to talk about again.

One short joke, as a palate cleanser: A cheap PSU is like a motivational poster—looks supportive, collapses under real pressure.

What good PSUs do that cheap ones often don’t

Quality PSUs tend to deliver boring, repeatable behavior:

  • Tight voltage regulation across load ranges, not just at one “review-friendly” point.
  • Low ripple and noise, so downstream VRMs aren’t constantly filtering garbage.
  • Fast transient response when CPU/GPU load steps from 20% to 90% in milliseconds.
  • Real protection circuits: OCP, OVP, UVP, OTP, SCP, and preferably well-tuned “no drama” shutdown behavior.
  • Hold-up time that survives brief input dips without brownout resets.
  • Conservative ratings: 750W that can actually do 750W at real temperature.

You can’t “RAID” your way out of a PSU that drops rails or sprays ripple at the exact moment your SSD controller is committing a mapping-table update.
Redundancy helps. It doesn’t repeal physics.

Facts and history that matter more than marketing

Here are concrete facts and context points that explain why the PSU market is so full of traps. These aren’t trivia; they’re the reason you keep
seeing the same failures repeat across companies and homelabs.

  1. 80 PLUS started as an efficiency program, not a quality stamp.
    The certification primarily measures efficiency at a few load points; it does not guarantee low ripple, good protections, or safe shutdown behavior.
  2. “Peak wattage” labeling has been abused for decades.
    Some cheap units advertise a number they can deliver briefly, not continuously, and sometimes only at unrealistically low internal temperatures.
  3. ATX standards evolved because unstable rails caused real-world instability.
    Modern systems lean heavily on the 12V rail, with CPU/GPU VRMs stepping it down; older designs cared more about 3.3V/5V distribution.
  4. Hold-up time is a hidden reliability feature.
    It’s the PSU’s ability to keep DC stable for milliseconds after AC input drops. Short hold-up time turns harmless input dips into reboots and disk resets.
  5. Electrolytic capacitors age faster with heat and ripple current.
    Cheaper caps and hotter designs lose capacitance, increasing ripple over time, which accelerates failure—an ugly feedback loop.
  6. Group-regulated designs can struggle with modern cross-load patterns.
    If your load is mostly 12V (common today), a group-regulated PSU may have poor regulation on minor rails and worse transient response.
  7. “Protections” can exist on a spec sheet but be poorly tuned.
    Overcurrent protection set too high is basically decorative; undervoltage protection set too low means your system suffers before it trips.
  8. Server PSUs popularized hot-swap redundancy for a reason.
    Datacenters learned the hard way that power is a leading cause of service-impacting events, and hot-swap PSUs let you fix it without downtime.

One quote to keep you honest. Here’s a paraphrased idea from W. Edwards Deming: Quality is designed into the process; you can’t inspect it in later.
That’s PSUs in one sentence. You can’t “test your way” into a bad design becoming good.

What actually breaks: failure modes in plain English

1) Ripple and noise: the slow poison

Ripple is AC noise riding on top of DC output. Downstream VRMs and regulators filter it, but they are not infinite trash compactors.
Excess ripple increases heat in VRMs, can trigger marginal behavior in RAM and PCIe, and makes storage electronics work harder.
The failure signature is maddening: you get “random” errors under load, especially on warm days.

Think of it like feeding a marathon runner a diet of energy drinks and gravel. They’ll still run—for a while.

2) Transient response: when load steps faster than the PSU reacts

Modern CPUs and GPUs change power draw extremely quickly. A good PSU handles the step without the output dipping below spec or overshooting.
A cheap unit’s control loop can lag or oscillate. The result: brief undervoltage events that cause:

  • Instant reboots (no clean shutdown)
  • PCIe link resets (NIC disappears, NVMe resets)
  • Kernel machine check exceptions
  • Disk write cache flushes that time out

3) Protections that don’t protect

A PSU should shut down cleanly when things go out of bounds. Missing or badly tuned protections turn a manageable fault into collateral damage.
Typical ones:

  • OVP (Over Voltage Protection): prevents “oops, 12V became 14V” from killing boards.
  • UVP (Under Voltage Protection): prevents brownout limbo where the system malfunctions for seconds.
  • OCP (Over Current Protection): prevents a cable/rail from becoming a heating element.
  • OTP (Over Temperature Protection): prevents the PSU from cooking itself and then failing unpredictably.
  • SCP (Short Circuit Protection): prevents the spectacular kind of failure.

4) Inrush current and cheap components: the “works until it doesn’t” problem

PSUs have inrush limiting and PFC circuitry to handle startup. Cheap designs may stress components at power-on, especially behind certain UPSes.
Over time, you get failures at boot: the system won’t start cold, starts after multiple tries, or only starts if you flip the PSU switch off/on.

5) Cabling and connectors: melted plastic is a diagnostic tool you didn’t want

Even if the PSU itself is “fine,” cheap harnesses, thin gauge wire, and poorly crimped connectors become hotspots at high current.
The symptom: intermittent GPU resets, SATA drives dropping, or the system hard-freezing when a particular device spins up.
You open the case and find browned connectors. That’s not “cosmetic.” That’s resistance, heat, and impending failure.

Second short joke (and the last one): The smell of a failing PSU is nature’s way of telling you your maintenance window has started.

6) Why storage people care so much

Storage is where power sins become data sins. Sudden power loss is one thing; corruption during unstable power is worse because it’s quiet.
Filesystems and databases have strategies for power loss. They have fewer strategies for “the controller lied for 200 milliseconds.”

SSDs are particularly sensitive to mid-write power events. Enterprise SSDs often have power-loss protection capacitors; cheap consumer drives often don’t.
Pair that with a cheap PSU and you’ve built a corruption machine that only runs on weekdays.

Three corporate mini-stories (anonymized, plausible, technically accurate)

Mini-story 1: The incident caused by a wrong assumption

A mid-size SaaS company expanded a compute cluster used for background jobs. They bought a batch of “cost-effective” 1U nodes from a secondary vendor.
The spec sheet looked fine: adequate wattage, 80 PLUS badge, and the vendor promised “server-grade power.”

The wrong assumption was subtle: they assumed “server-grade” implied stable behavior on their UPS topology. The nodes were connected to a line-interactive UPS
that occasionally switched modes during minor input fluctuations. Most servers didn’t care. This batch did.

During a weather week with frequent small sags, the cluster started to see bursts of reboots. The reboots weren’t synchronized, so it looked like software:
maybe a bad deployment, maybe a kernel regression, maybe a scheduler problem. Engineers chased it in the logs and saw nothing consistent.

Then the queue backed up. Timeouts increased. Autoscaling added more nodes—more of the same flaky PSU behavior. That turned the problem from “some reboots”
into “the platform is unstable.” The incident ended when someone physically moved two suspect nodes to a different PDU/UPS path and the reboots stopped.

The postmortem fix was boring: replace PSUs with known-good models and qualify power gear with a UPS-mode transition test.
The lesson wasn’t “UPS bad.” It was “do not assume a sticker means compatibility with your environment.”

Mini-story 2: The optimization that backfired

A fintech team tried to reduce rack power draw and heat. They swapped a set of older, slightly oversized PSUs for smaller, “right-sized” units
that promised better efficiency at typical load. On paper, it was smart: less wasted capacity, higher efficiency, lower bills.

The backfire came from real workloads. Their systems had sharp power transients: bursts of CPU, bursts of NVMe activity, occasional GPU acceleration.
The smaller PSUs ran closer to their limits and had worse transient headroom. When load spiked, 12V dipped just enough to trigger PCIe resets.

The symptom wasn’t “power issue.” It was storage flakiness: NVMe devices would drop and reappear, RAID rebuilds would start, and the database would log
IO errors. Because the boxes stayed up, people suspected firmware, drives, kernel, and backplanes. The team replaced drives. Then replaced more drives.

The turning point was correlating events: NVMe reset logs aligned with brief BMC sensor warnings about 12V min values. Nothing catastrophic, just enough.
Rolling back to higher-quality PSUs with better transient response eliminated the resets. The power bill went up a little; the incident rate went down a lot.

The lesson: “right-sized” is not the same as “right-engineered.” Headroom isn’t waste; it’s stability margin.

Mini-story 3: The boring but correct practice that saved the day

An enterprise IT group ran a private virtualization cluster. Nothing fancy. They were disciplined about two things: PSU redundancy and quarterly maintenance.
Each host had dual hot-swap PSUs fed by separate PDUs, and the team routinely tested failover by pulling one PSU under load during planned windows.

One afternoon, a PDU breaker started tripping intermittently. It wasn’t a full datacenter event—just one feed going flaky. Half the racks on that feed
saw brief input drops. In many environments, that becomes a cascading outage as nodes reset and workloads thrash.

Here, it became a ticket. The servers stayed up on the alternate PSU feed. The monitoring system alerted on “PSU redundancy lost” and “input voltage events”
but the VMs didn’t notice. No storage corruption. No cluster split-brain. No heroic midnight recovery.

Because they had practiced the procedure, the response was calm: isolate the failing PDU feed, shift load, call facilities, and swap a breaker module.
The postmortem was short. The action items were mostly: “keep doing what we’re doing.”

The lesson: reliability is often a collection of boring habits performed consistently.

Fast diagnosis playbook (first/second/third)

When you suspect a PSU, you need speed and a plan. Don’t fall into the trap of spending six hours proving a hypothesis you could test in twenty minutes.
This playbook is designed for on-call reality.

First: classify the failure and protect data

  • Is the system rebooting hard? If yes, treat it like unstable power. Disable write caches where appropriate, pause risky maintenance jobs (scrubs, rebuilds), and snapshot what you can.
  • Any smell/heat/noise? If there’s electrical smell, arcing sounds, or visible damage, power down safely and stop “testing.”
  • Is it isolated or systemic? One host vs multiple hosts on the same PDU/UPS points to different fixes.

Second: correlate logs with sensors and load

  • Check kernel logs for machine checks, NVMe resets, SATA link resets, and ACPI power events.
  • Check BMC/IPMI sensor history for 12V/5V min/max events and PSU status.
  • Check whether failures align with CPU/GPU spikes, disk scrubs, or fan ramps (heat).

Third: reduce variables with a decisive swap or isolation

  • If you have a known-good PSU, swap it. Don’t “observe” for days.
  • If dual-PSU, pull one PSU at a time under load and see if behavior changes.
  • Move the host to another circuit/PDU temporarily to rule out upstream power.

The bottleneck in PSU diagnosis is usually human hesitation. Swap, isolate, and measure. You’re not doing philosophy; you’re doing operations.

Hands-on tasks: commands, outputs, and decisions

These are practical checks you can run on Linux servers. None of them “prove” the PSU is bad by themselves.
Together, they let you build a strong case quickly and avoid replacing random parts out of frustration.

Task 1: Check for abrupt reboots (no clean shutdown)

cr0x@server:~$ last -x | head -n 12
reboot   system boot  6.8.0-41-generic Mon Jan 22 09:14   still running
shutdown system down  6.8.0-41-generic Mon Jan 22 08:57 - 09:13  (00:16)
reboot   system boot  6.8.0-41-generic Mon Jan 22 08:12 - 08:57  (00:45)
reboot   system boot  6.8.0-41-generic Mon Jan 22 07:38 - 08:12  (00:34)
shutdown system down  6.8.0-41-generic Mon Jan 22 07:35 - 07:38  (00:02)

What it means: “reboot” entries without a preceding “shutdown” at the right time indicate a hard reset/power drop.

Decision: Treat as potential power instability. Prioritize PSU/UPS/PDU checks over application debugging.

Task 2: Look for kernel power-related events and machine checks

cr0x@server:~$ journalctl -k -b -1 | egrep -i "mce|machine check|watchdog|reset|nvme|ata|pcie|EDAC" | tail -n 30
Jan 22 08:11:58 server kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: b200000000070005
Jan 22 08:11:58 server kernel: mce: [Hardware Error]: TSC 0 ADDR fef1c140 MISC d012000100000000
Jan 22 08:11:59 server kernel: nvme nvme0: controller is down; will reset: CSTS=0x1, PCI_STATUS=0x10
Jan 22 08:12:00 server kernel: nvme nvme0: reset controller
Jan 22 08:12:03 server kernel: ata3: SATA link down (SStatus 0 SControl 300)

What it means: MCEs plus storage link resets are classic “power or board” territory, especially if they appear under load.

Decision: Check PSU rails via BMC sensors; consider immediate swap if correlated with reboots.

Task 3: Check BMC sensors for voltage min/max and PSU status

cr0x@server:~$ ipmitool sdr type Voltage
12V         | 11.71 Volts      | ok
5V          | 4.92 Volts       | ok
3.3V        | 3.28 Volts       | ok
VBAT        | 3.02 Volts       | ok

What it means: “ok” is not the whole story; you need min/max history if available. Still, a 12V reading near low threshold is suspicious.

Decision: If values drift under load or flirt with thresholds, plan a swap and reduce peak load until fixed.

Task 4: Get detailed sensor thresholds (if BMC exposes them)

cr0x@server:~$ ipmitool sensor get 12V
Locating sensor record...
Sensor ID              : 12V (0x30)
Entity ID             : 7.1
Sensor Type (Voltage) : Voltage
Sensor Reading        : 11.71 (+/- 0.00) Volts
Lower Non-Recoverable : 10.80
Lower Critical        : 11.00
Lower Non-Critical    : 11.20
Upper Non-Critical    : 12.80
Upper Critical        : 13.00
Upper Non-Recoverable : 13.20

What it means: 11.71V is above LNC, but not by a luxurious margin. Under transient load it may dip below 11.2V briefly.

Decision: If the platform is resetting or devices are dropping, treat this as corroborating evidence and swap PSU or test with different power feed.

Task 5: Identify PSU model and redundancy (where supported)

cr0x@server:~$ ipmitool fru | egrep -i "Power Supply|PSU|Part Number|Product"
Product Name          : Power Supply 1
Part Number           : PWS-920P-SQ
Product Name          : Power Supply 2
Part Number           : PWS-920P-SQ

What it means: Confirms what hardware you actually have, not what procurement thinks you have.

Decision: If you’re troubleshooting a fleet, group incidents by PSU part number and batch.

Task 6: Detect PSU-related events in system event log (SEL)

cr0x@server:~$ ipmitool sel list | tail -n 10
1a3c | 01/22/2026 | 08:11:57 | Power Supply PS1 | Power Supply AC lost | Asserted
1a3d | 01/22/2026 | 08:11:58 | Power Supply PS1 | Failure detected | Asserted
1a3e | 01/22/2026 | 08:12:04 | Power Supply PS1 | Power Supply AC lost | Deasserted
1a3f | 01/22/2026 | 08:12:05 | Power Supply PS1 | Failure detected | Deasserted

What it means: The BMC saw the PSU drop AC input or fault. That’s close to a smoking gun.

Decision: Replace PS1 and inspect the upstream feed (cable, PDU outlet, breaker).

Task 7: Verify upstream power quality signals from a UPS (if connected via USB/NUT)

cr0x@server:~$ upsc myups@localhost | egrep "input\.voltage|input\.frequency|ups\.status|ups\.load|battery\.charge"
input.voltage: 228.0
input.frequency: 50.0
ups.status: OL
ups.load: 41
battery.charge: 100

What it means: OL means “on line.” If you see frequent OL/OB transitions, your PSU may be reacting badly to transfer events.

Decision: If transitions correlate with reboots, test with a different UPS mode or a higher-quality PSU with better hold-up time.

Task 8: Check for PCIe device resets (common with undervoltage transients)

cr0x@server:~$ journalctl -k | egrep -i "pcie.*error|AER|link down|link reset|Surprise Down" | tail -n 25
Jan 22 08:11:59 server kernel: pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
Jan 22 08:11:59 server kernel: pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer
Jan 22 08:12:00 server kernel: pcieport 0000:00:1c.0: device [8086:a110] error status/mask=00000001/00002000

What it means: Physical layer corrected errors can be signal integrity, but if they spike with load and coincide with resets, power is suspect.

Decision: Combine with voltage sensor data; if both point the same direction, stop blaming firmware first.

Task 9: Check SATA/NVMe link stability

cr0x@server:~$ dmesg -T | egrep -i "ata[0-9]|link down|hard resetting link|nvme.*reset" | tail -n 30
[Mon Jan 22 08:12:03 2026] ata3: SATA link down (SStatus 0 SControl 300)
[Mon Jan 22 08:12:03 2026] ata3: hard resetting link
[Mon Jan 22 08:12:05 2026] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Mon Jan 22 08:12:06 2026] nvme nvme0: controller is down; will reset: CSTS=0x1

What it means: Storage links should not flap in a healthy server. Power dips can cause controllers to reset.

Decision: If you see link resets plus hard reboots, prioritize PSU and cabling inspection immediately.

Task 10: Stress CPU to reproduce transient load issues (controlled test)

cr0x@server:~$ sudo apt-get update -qq && sudo apt-get install -y stress-ng
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
  stress-ng
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Setting up stress-ng (0.17.12-1ubuntu0.1) ...
cr0x@server:~$ sudo stress-ng --cpu 0 --cpu-method matrixprod --timeout 120s --metrics-brief
stress-ng: info:  [21412] dispatching hogs: 32 cpu
stress-ng: metrc: [21412] stressor       bogo ops real time  usr time  sys time   bogo ops/s
stress-ng: metrc: [21412] cpu               5231    120.00  3790.11     0.31        43.59
stress-ng: info:  [21412] successful run completed in 120.02s

What it means: If this consistently triggers reboots or device resets, you’re looking at power delivery, thermals, or board VRMs.

Decision: If failures appear under this test, swap PSU and retest before blaming kernel/app.

Task 11: Monitor voltages and temperatures during load (lm-sensors)

cr0x@server:~$ sudo apt-get install -y lm-sensors
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
  lm-sensors
Setting up lm-sensors (1:3.6.0-7.1ubuntu1) ...
cr0x@server:~$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +71.0°C  (high = +90.0°C, crit = +100.0°C)
Core 0:        +67.0°C  (high = +90.0°C, crit = +100.0°C)

nct6798-isa-0a20
Adapter: ISA adapter
in0:          +1.02 V
in1:          +1.81 V
in2:          +12.00 V
in3:          +5.02 V
in4:          +3.33 V
fan1:        1200 RPM

What it means: These readings can be approximate, but you’re looking for drift or sudden drops under load.

Decision: If 12V/5V readings sag significantly during stress, stop—your PSU/VRM path is unstable.

Task 12: Check memory errors (power can masquerade as RAM problems)

cr0x@server:~$ journalctl -k | egrep -i "EDAC|ecc|memory error" | tail -n 20
Jan 22 08:11:58 server kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
Jan 22 08:11:58 server kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR

What it means: Correctable errors (CE) might be real DIMM degradation, but bursts that align with power events are suspicious.

Decision: If ECC errors spike during PSU/transient events, fix power first, then re-evaluate DIMMs.

Task 13: Check filesystem and block layer for IO errors after events

cr0x@server:~$ journalctl -p err..alert -b | egrep -i "I/O error|EXT4-fs error|XFS.*CORRUPT|blk_update_request|Buffer I/O" | tail -n 30
Jan 22 08:12:01 server kernel: blk_update_request: I/O error, dev nvme0n1, sector 428032, op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Jan 22 08:12:01 server kernel: Buffer I/O error on dev nvme0n1p2, logical block 53504, lost async page write

What it means: IO errors after a reset can be a device reset chain reaction, or the drive itself. Power instability increases both likelihoods.

Decision: Do not start replacing drives blindly. Stabilize power, then run SMART/self-tests.

Task 14: Check SMART and error logs (to separate “bad drive” from “bad power”)

cr0x@server:~$ sudo smartctl -a /dev/sda | egrep -i "Reallocated_Sector_Ct|Current_Pending_Sector|UDMA_CRC_Error_Count|Power_Cycle_Count|Power-Off_Retract_Count"
Reallocated_Sector_Ct     0
Current_Pending_Sector    0
UDMA_CRC_Error_Count      19
Power_Cycle_Count         142
Power-Off_Retract_Count   27

What it means: UDMA CRC errors often indicate cabling or signal issues. A rising “Power-Off Retract” count points to abrupt power loss.

Decision: If CRC errors are rising, inspect/replace SATA cables and power harness; if power-off events rise, fix PSU/UPS path.

Task 15: Inspect and verify CPU frequency throttling (power/thermal interactions)

cr0x@server:~$ sudo apt-get install -y linux-tools-common linux-tools-generic
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
  linux-tools-common linux-tools-6.8.0-41-generic linux-tools-generic
Setting up linux-tools-common (6.8.0-41.41) ...
Setting up linux-tools-generic (6.8.0-41.41) ...
cr0x@server:~$ sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,TSC_MHz,PkgTmp --interval 1 --num_iterations 5
Busy%   Bzy_MHz  TSC_MHz  PkgTmp
12.34   2101     2200     74
89.02   3298     2200     89
91.11   2980     2200     94
88.50   2750     2200     96
90.20   2605     2200     97

What it means: Falling Bzy_MHz at high Busy% and rising package temp suggests throttling. Throttling can change load patterns and expose PSU weakness.

Decision: Fix cooling first if you’re near thermal limits; then assess PSU stability under expected thermal conditions.

Task 16: If dual-PSU, test redundancy under load (controlled)

cr0x@server:~$ sudo stress-ng --cpu 0 --timeout 60s --metrics-brief
stress-ng: info:  [21901] dispatching hogs: 32 cpu
stress-ng: info:  [21901] successful run completed in 60.01s
cr0x@server:~$ ipmitool sel clear
Clearing SEL.  Please allow a few seconds to erase.
cr0x@server:~$ ipmitool sel list
SEL has no entries

What it means: Clearing SEL lets you see new PSU redundancy events during your controlled test (while you physically pull one PSU).

Decision: If pulling PSU A causes instability but pulling PSU B doesn’t, PSU A (or its feed/cable) is suspect.

Note: your hands are not a command. But they are part of the diagnostic toolkit. If you do physical tests, do them during a maintenance window, under supervision,
and with clear rollback steps.

Common mistakes: symptom → root cause → fix

1) “Random reboots, no logs” → brownouts/hold-up time issues → test and swap

Symptom: System restarts with no clean shutdown; logs show a gap.

Root cause: PSU output dips below tolerance during transient load or AC input sag; hold-up time too short.

Fix: Swap in known-good PSU; check UPS transfer events; ensure adequate PSU headroom and quality.

2) “NVMe disappears, then comes back” → transient undervoltage → improve PSU transient response

Symptom: NVMe reset loops, IO errors, RAID rebuild triggers.

Root cause: PCIe device resets caused by brief 12V dips or noisy rail.

Fix: Replace PSU with better transient performance; reduce load spikes; verify cabling and backplane power connectors.

3) “SATA CRC errors climbing” → bad cable or poor power harness → replace the right cable

Symptom: SMART shows UDMA_CRC_Error_Count increasing; drives drop under load.

Root cause: Signal cabling issue (SATA cable, backplane), sometimes worsened by ripple and connector heating.

Fix: Replace SATA data cables/backplane connectors; inspect for browned power connectors; ensure tight, clean connections.

4) “ECC correctable errors spike only during heavy IO” → power noise → fix power first

Symptom: Burst of corrected memory errors during scrubs/backups.

Root cause: Power delivery noise causing marginal timing; can appear as RAM instability.

Fix: Validate PSU rails and swap PSU before replacing DIMMs; retest after power fix.

5) “Only fails when hot” → capacitor aging and thermal derating → airflow and quality PSU

Symptom: Stable in winter or with the case open; fails under summer ambient or high fan curves.

Root cause: Cheap PSU internal temps rise; capacitors lose effective capacitance; regulation worsens.

Fix: Improve chassis airflow, clean dust, and replace PSU with higher temp-rated design; avoid running near max rating.

6) “PSU fan screams, then stops, then the server dies” → failing fan/OTP misbehavior → replace immediately

Symptom: PSU fan behavior erratic; intermittent shutdowns.

Root cause: Fan bearing failure or poor thermal control; over-temperature protection may trip late.

Fix: Replace PSU. Do not “just replace the fan” in production unless you enjoy risk with your coffee.

7) “We replaced drives and it still happens” → chasing the victim not the attacker → stop and re-baseline

Symptom: Multiple component swaps don’t fix resets and IO errors.

Root cause: PSU or upstream power causing cascading errors across devices.

Fix: Re-baseline: correlate resets with power sensors; do a known-good PSU swap; verify PDU/UPS behavior.

8) “Brand name means safe” → model/platform variance → qualify the exact unit

Symptom: “But it’s a reputable brand” while issues persist.

Root cause: Brands outsource; quality varies by platform, revision, and OEM.

Fix: Select PSUs by verified electrical performance and protections, not logo; standardize on known-good models.

Checklists / step-by-step plan

Procurement checklist: how to not buy trouble

  • Buy for electrical behavior, not wattage. Look for independent validation of ripple, transient response, protections, and hold-up time.
  • Plan headroom. Target typical load around 40–60% of PSU rating for efficiency and transient margin (exact numbers depend on platform, but “near max” is asking for it).
  • Prefer modern DC-DC designs for minor rails if you’re building ATX systems, and reputable server PSUs for rack gear.
  • Check connector quality and harness gauge. Especially for GPUs and high-drive-count storage nodes.
  • Standardize models. Fewer PSU SKUs means fewer unknowns and faster swaps.
  • Decide your redundancy strategy upfront. Dual-PSU with separate feeds if uptime matters; single PSU plus cold spare if it doesn’t.

Build checklist: installation details that prevent outages

  • Use separate PDUs/circuits for redundant PSUs. “Two PSUs into the same strip” is theater.
  • Label PSU-to-PDU mapping. Humans debug faster when the physical world is documented.
  • Don’t overload a single cable bundle. Avoid sharp bends and tension on connectors.
  • For storage servers, verify drive power distribution. Stagger spin-up where possible.
  • Set monitoring for PSU redundancy lost, voltage events (if available), and unexpected reboots.

Operational checklist: when a host starts acting haunted

  1. Freeze risky writes: pause scrubs/rebuilds/backfills if the system is flapping.
  2. Classify the failure: hard reset vs graceful restart vs device resets.
  3. Pull logs: last -x, journalctl -k, SMART, BMC SEL.
  4. Correlate with load: was there a job spike, backup, or compile storm?
  5. Check upstream: UPS status transitions, PDU alerts, breaker trips.
  6. Swap with known-good PSU (or swap PSU position in a dual-PSU chassis).
  7. Re-run controlled stress and confirm stability before closing.
  8. Write a short incident note: what you saw, what changed, and what you replaced.

Design checklist: storage and virtualization specifics

  • ZFS/RAID arrays: prefer stable power and UPS; avoid unstable brownout behavior that causes repeated device resets.
  • SSD-heavy nodes: consider drives with power-loss protection; cheap PSUs plus consumer SSDs is a nasty combo.
  • GPU nodes: prioritize cabling, connector quality, and transient response; GPUs are spiky loads.
  • Virtualization hosts: dual PSUs, separate feeds, and alerting on redundancy loss are non-negotiable if you care about uptime.

FAQ

1) Is 80 PLUS Gold enough to guarantee a good PSU?

No. It mostly tells you efficiency at specific load points. A PSU can be efficient and still have mediocre ripple, weak transient response,
or poorly tuned protections. Treat 80 PLUS as “one data point,” not a safety certificate.

2) What’s the single most common real-world symptom of a bad PSU?

Hard reboots under load. Especially when combined with storage link resets (SATA/NVMe) or PCIe AER errors.

3) Can a cheap PSU cause data corruption without rebooting?

Yes, though it’s harder to prove. Ripple and transient dips can destabilize controllers and memory paths in ways that create wrong writes or
metadata inconsistencies. Filesystems and databases are robust, but they’re not magical.

4) If my system boots and runs games, why wouldn’t it run a server workload?

Servers often run sustained high IO and steady CPU load for hours, plus simultaneous bursts (scrubs, backups, compactions).
Consumer “it works on my desk” testing does not cover those patterns or the thermal environment of a rack.

5) How much headroom should I leave when sizing a PSU?

Enough that normal operation doesn’t live near the cliff. A practical target is to keep typical load in the 40–60% range and ensure worst-case
spikes stay comfortably below the PSU’s continuous rating. The exact margin depends on transients (GPU nodes need more).

6) Are modular cables interchangeable between PSU brands?

Usually no, and “usually” is not a risk strategy. Pinouts differ even within the same brand across product lines.
Mixing cables can instantly kill drives and motherboards. Use only the cables intended for that exact PSU model.

7) Does a UPS make cheap PSUs safe?

A UPS helps with outages and some power anomalies, but it doesn’t fix poor transient response or high ripple.
Also, some PSUs behave badly during UPS transfer events. Test your exact pairing.

8) What about redundant PSUs—can I use two cheap ones and be fine?

Redundancy reduces downtime from a single PSU failure, but it doesn’t guarantee clean power or good behavior under load.
Two bad PSUs can still produce instability, and a failing unit can stress the other or inject noise before it dies.

9) How do I tell if the problem is PSU vs motherboard VRM?

You often can’t tell cleanly without swapping. But patterns help: PSU issues correlate with AC/PSU SEL events, multiple device resets at once,
and changes when you move circuits. VRM issues may correlate with specific CPU load and temperature and persist across PSUs.
In ops, you swap the cheaper/faster-to-swap component first—often the PSU.

10) What’s the safest upgrade path for a homelab storage box?

Buy a known-good PSU with strong protections and decent headroom, put the system on a UPS, and avoid mixing modular cables.
If you run important data, prioritize stability over aesthetics and RGB features.

Next steps you can actually do this week

If you run production systems, treat PSUs like tires on a fleet vehicle: you don’t buy the cheapest ones and hope your drivers become better at physics.
Power quality is foundational. Everything else depends on it.

Practical actions

  • Inventory PSUs (model and part number) across your fleet using BMC where available, and identify unknown/low-confidence units.
  • Pick one or two standardized, proven PSU models for each chassis class (storage, compute, GPU) and stop improvising per purchase order.
  • Implement alerting for unexpected reboots, “PSU redundancy lost,” and BMC voltage threshold events.
  • Run a controlled stress test on new hardware during burn-in and watch for PSU/PCIe/storage reset signals.
  • Keep cold spares of the standardized PSU models. The fastest incident is the one you end with a swap and a note.
  • For critical services, move to dual-PSU with separate feeds and test failover quarterly—boring, repeatable, effective.

If you’re still tempted by the bargain bin PSU, ask a different question: “What’s my hourly rate during an outage?”
Suddenly, the $20 savings starts looking like a loan with aggressive interest.

← Previous
ZFS arcstat: The Fastest Way to See if Cache Helps
Next →
Email DANE: When It’s Worth It (and Why It Often Isn’t)

Leave a comment