The machine won’t post. Or worse: it posts, but only half the RAM shows up, one PCIe slot vanishes,
and the logs are screaming about corrected memory errors like it’s trying to be helpful while drowning.
Someone says “bad CPU.” Someone else says “bad board.” Meanwhile you’re staring at a maintenance window
that’s evaporating.
Under that drama is a very physical decision: should the fragile pins live on the CPU (PGA) or on the
motherboard socket (LGA)? The industry moved pins to the board for boring, ruthless reasons:
density, electrical performance, manufacturability, and the fact that physics doesn’t care about your
feelings. Let’s make that decision legible, and then make you faster at diagnosing the real failures.
LGA vs PGA in one page: terms, anatomy, and what actually touches what
PGA (Pin Grid Array)
In PGA, the CPU package has an array of pins sticking out of it. The motherboard has a socket with
corresponding holes (and internal spring contacts) that grab those pins.
You drop the CPU in, close a lever (ZIF: Zero Insertion Force), and the socket clamps onto the pins.
PGA’s user-facing advantage is obvious: the motherboard socket is mostly a passive plastic+metal
receptacle. If you damage the pins, you might be able to straighten them. If you annihilate the socket,
you’re buying a new board anyway. But PGA makes “damage” a CPU-side problem more often.
LGA (Land Grid Array)
In LGA, the CPU has flat pads (“lands”) on the bottom. The motherboard socket has spring pins that
press against those pads. A load plate and lever provide clamping force so the contact pressure is
consistent across hundreds or thousands of contacts.
LGA’s advantage is mostly invisible until you run fast buses and high core counts:
it supports higher pin counts and better electrical characteristics at high frequency.
The tradeoff is that the fragile bits (pins) now live on the motherboard. Bend them and you may
be replacing a board that costs more than the CPU you were trying to “save.”
What “pins” really are: contacts, not magic
Whether it’s PGA or LGA, the electrical job is the same: a controlled, low-resistance contact with
predictable inductance and capacitance, stable across temperature cycles, vibration, and time.
Modern platforms need that across:
- Power and ground (a huge chunk of the “pins” are just delivering current and return paths).
- High-speed serial links (PCIe, UPI/QPI, Infinity Fabric-like interconnects).
- Memory channels (DDR4/DDR5: timing-sensitive, noise-sensitive).
- Sideband signals (SMBus, SPI flash interface, debug, management).
You don’t “lose a pin.” You lose a function: a memory channel, a PCIe lane group, a management bus,
or stability under load. That’s why the failure looks like software until it doesn’t.
Why the pins moved: density, signal integrity, and mechanical reality
Pin density: the quiet dictator
The number of connections a modern CPU needs exploded. Not because marketing wanted bigger sockets,
but because physics demanded more power delivery and more I/O. High core counts and high turbo power
require serious current. Current needs metal. Metal needs area. Area needs pins/contacts.
PGA pins have practical limits: pin pitch (spacing) can’t shrink indefinitely without pins becoming
too fragile, too easy to bend, or too likely to short. LGA sockets can pack contacts more densely
because the “pin” is a spring element in a precisely manufactured socket, and the CPU just has pads.
The CPU package doesn’t need protruding needles that must survive shipping, handling, and installation.
Signal integrity: at high speeds, geometry is policy
When buses run in multi-gigahertz territory, the contact geometry matters. Not in an academic way.
In a “your PCIe link trains at Gen4 instead of Gen5 and your storage throughput collapses” way.
LGA tends to offer better control of:
- Inductance: shorter, flatter contact structures reduce inductive effects.
- Impedance consistency: sockets are engineered to keep transitions cleaner.
- Return paths: more ground contacts and better distribution improves noise margins.
- Crosstalk: tighter control of neighbor coupling helps maintain eye openings.
PGA pins are little antennas. They can work fine, but as speeds and densities rise, “fine” turns into
“we’re spending too much engineering time making it fine.”
Power delivery: more grounds than you think
Modern CPUs draw large, fast-changing currents. The package and socket must deliver power with low
impedance across a wide frequency range. That means many power/ground contacts distributed across
the socket to reduce local hotspots and voltage droop. LGA makes it easier to allocate huge numbers
of contacts and keep mechanical reliability reasonable.
Mechanical loading and contact reliability
LGA sockets use a load plate and lever to apply a defined force. That force matters: too little,
you get intermittent contacts (the worst kind). Too much, you risk board flex or socket damage.
In a well-designed LGA system, the pressure is uniform and predictable. That’s hard to guarantee
with PGA pins across high counts without making insertion and retention a horror show.
First short joke (as promised, only two): An LGA socket is like a datacenter badge reader—touch the pads perfectly or nothing happens, and it remembers your mistakes forever.
Manufacturing and yield: which side is cheaper to protect?
This is the part engineers don’t always say out loud in front of customers: shifting fragile geometry
from the CPU to the socket changes the economics. CPU packages are expensive, high-value items.
Motherboards are also expensive, but sockets can be integrated into board manufacturing and tested
in different ways.
With LGA, the CPU underside is a flat array of pads that is easier to protect with a cover and
less likely to be bent in shipping. The fragile pins are on the board, usually protected by a
socket cover until installation. In the field, this pushes failure into “install handling” rather than
“shipping damage.” It’s not kinder. It’s more controllable.
Interesting facts and historical context (9 quick hits)
- Intel mainstream moved to LGA in the mid-2000s (LGA775 era), driven by higher pin counts and power delivery needs.
- AMD stayed with PGA longer on consumer platforms (e.g., AM2/AM3/AM4), partly for socket longevity and ecosystem stability.
- Server sockets went “big LGA” early because multi-socket interconnects and more memory channels demand lots of contacts.
- Most contacts aren’t “signals”: a large fraction are power and ground to manage current delivery and noise.
- LGA enables denser pitch because the spring contact is socket-side and can be manufactured with tighter tolerances than protruding CPU pins.
- Socket covers exist for a reason: leaving an LGA socket uncovered during handling is basically inviting a tiny mechanical tragedy.
- Contact plating matters: gold plating and spring materials are chosen to minimize corrosion and maintain stable contact resistance over cycles.
- “Missing RAM channel” is a classic symptom of a contact issue—one or more address/data/control contacts not making proper contact.
- Gen-to-gen PCIe training failures can be caused by marginal contacts; the link falls back to a lower speed that “works” but quietly hurts throughput.
The uncomfortable truth: LGA is better for performance; PGA is more forgiving for humans
If you’re designing a platform with high I/O density and high power, LGA is the pragmatic choice.
If you’re a technician swapping CPUs in a hurry, PGA is friendlier—because your enemy is your own
hands, not the socket’s microscopic spring pins.
Failure modes that matter in production
Bent LGA pins: the “it boots, but…” nightmare
LGA damage is often partial. A few pins don’t make contact, and the system still powers up.
Now you’re in the land of weirdness:
- One memory channel missing, or RAM runs at lower speed.
- PCIe devices disappear or train down (Gen5 → Gen4 → Gen3).
- Machine Check Exceptions (MCE) under load, especially memory-heavy load.
- Random reboots that correlate with temperature or vibration.
- Spurious I/O errors that look like “bad NVMe” but aren’t.
The reason: those pins aren’t just “extra.” They’re often grouped by function. Lose a cluster and
you lose a whole lane group or a channel. Lose a ground reference and you lose signal integrity.
PGA bent pins: dramatic, visible, and sometimes fixable
PGA failures are usually more obvious. A pin bends, the CPU doesn’t seat, or it seats but a pin
doesn’t mate correctly. You might see:
- No POST, or POST with clear error codes.
- CPU not detected.
- Memory not detected, similar to LGA, but often with more total failure.
And yes, sometimes you can straighten pins. The risk is micro-fractures and work-hardening. You
get it “straight,” it passes a quick boot, and then fails six weeks later after thermal cycles.
That’s the kind of “saved money” that costs money.
Contact resistance and contamination
Dust, skin oils, thermal paste migration, and corrosion can increase contact resistance.
With LGA, the spring pins are designed to wipe slightly against the pad to break through films,
but it’s not magic. In servers, repeated reseats can also wear plating.
Board flex and mounting pressure: the cooler can be the villain
Over-torqued heatsinks, uneven mounting, or missing backplates can flex the motherboard and change
contact pressure distribution. The system may pass light load and fall over under heavy AVX or
memory traffic because marginal contacts become marginal-er as temperature rises.
“It’s the CPU” vs “it’s the socket”: how to think like an SRE
CPUs are statistically reliable. Sockets and handling are less so. In field incidents, assume:
- Configuration and firmware first.
- Memory and power delivery next.
- Socket contact issues when symptoms map to specific channels/lanes or change after reseat.
Diagnose before you swap. Swapping is easy. Explaining why you swapped the wrong expensive thing is not.
Fast diagnosis playbook: find the bottleneck before you start swapping parts
This is the workflow that wins when you’re on the clock. You’re not trying to be clever; you’re trying
to be correct fast, with evidence.
First: establish what changed and what’s missing
- Check POST/firmware logs for memory population, PCIe training, CPU errors.
- Confirm inventory: CPU model, microcode, BIOS, DIMM layout, PCIe devices expected.
- Look for asymmetry: a specific channel missing, a specific PCIe root port absent—those scream “contact or lane group.”
Second: classify the failure type
- No POST: power, CPU presence, catastrophic socket damage, wrong BIOS, or short.
- POST but degraded: missing channels/lanes, downtrained PCIe, corrected errors—often socket/contact pressure or bent pins.
- Only fails under load: marginal contact, VRM instability, thermal or firmware settings.
Third: prove or eliminate socket/contact issues
- Reseat CPU and cooler with correct torque pattern.
- Inspect socket with magnification and angled light.
- Swap known-good DIMMs into the “missing” channel slots; if the channel stays missing, it’s not the DIMM.
- Check whether behavior changes with different PCIe devices/slots; persistent root-port absence points to CPU socket mapping.
Fourth: only then start swapping expensive parts
If you can’t make the platform consistent after a careful reseat and inspection, decide whether you
replace the motherboard (LGA risk) or CPU (PGA risk) based on what’s physically at risk and what’s
easier to validate.
Practical tasks: commands, outputs, and decisions (12+)
These are Linux-side tasks you can run even when you’re not sure if the failure is “hardware” or
“software.” Each task includes: command, what output means, and the decision it enables.
The goal is to turn vague symptoms into a socket-level hypothesis.
Task 1: Confirm CPU model, stepping, and microcode actually loaded
cr0x@server:~$ lscpu | egrep 'Model name|Socket|Thread|Core|CPU\(s\)|Stepping'
CPU(s): 64
Model name: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
Core(s) per socket: 32
Thread(s) per core: 2
Socket(s): 1
Stepping: 6
Meaning: If sockets/cores/threads don’t match what you bought, stop. That can be a BIOS setting,
disabled cores, or a CPU not seated/detected correctly.
Decision: If CPU count is wrong, prioritize BIOS/POST logs and physical reseat before chasing OS tuning.
Task 2: Check kernel sees memory size and NUMA topology
cr0x@server:~$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 196355 MB
node 0 free: 182904 MB
node distances:
node 0
0: 10
Meaning: Unexpected NUMA nodes or a memory size significantly below installed RAM often indicates
missing memory channels or disabled DIMM slots.
Decision: If memory is low, correlate with DMI slot inventory next; if slots “empty” while physically populated, suspect channel contact/socket issue.
Task 3: Inventory DIMM slots from DMI
cr0x@server:~$ sudo dmidecode -t memory | egrep -A5 'Memory Device|Locator:|Bank Locator:|Size:|Speed:|Configured Memory Speed:'
Memory Device
Locator: DIMM_A1
Bank Locator: P0_Node0_Channel0_Dimm0
Size: 32768 MB
Speed: 3200 MT/s
Configured Memory Speed: 3200 MT/s
--
Memory Device
Locator: DIMM_B1
Bank Locator: P0_Node0_Channel1_Dimm0
Size: No Module Installed
Speed: Unknown
Configured Memory Speed: Unknown
Meaning: “No Module Installed” in a slot you know is populated is a major clue. If a whole channel’s
slots read empty, that’s not coincidence.
Decision: If an entire channel is missing, plan a CPU reseat + socket inspection; swapping DIMMs won’t resurrect a channel that isn’t electrically present.
Task 4: Look for memory controller errors (MCE/EDAC)
cr0x@server:~$ sudo journalctl -k | egrep -i 'mce|machine check|edac|hardware error' | tail -n 20
Jan 09 10:12:41 server kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jan 09 10:12:41 server kernel: mce: [Hardware Error]: Machine check events logged
Jan 09 10:12:41 server kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#2_DIMM#0 (channel:2 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0)
Meaning: Corrected errors clustered on one channel or DIMM can be a DIMM problem, but also a marginal
channel contact (socket pins associated with that channel).
Decision: If errors stick to one channel across DIMM swaps, escalate to socket/CPU seating rather than blaming “bad RAM batches.”
Task 5: Verify PCIe link width and speed (catch downtraining)
cr0x@server:~$ sudo lspci -vv -s 3b:00.0 | egrep -i 'LnkCap:|LnkSta:'
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 8GT/s (downgraded), Width x8 (downgraded)
Meaning: The device is capable of x16 at a higher speed but is running slower/narrower. That can be:
marginal signal integrity, lane contact issues, riser problems, or BIOS forcing compatibility.
Decision: If downtraining is stable but unexpected, check physical seating and socket/cooler pressure; if it flaps (changes across boots), suspect contact marginality.
Task 6: Map PCIe topology to root ports (find missing groups)
cr0x@server:~$ lspci -tv
-+-[0000:00]-+-00.0 Intel Corporation Device 1234
| +-01.0-[01]----00.0 Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3
| +-1d.0-[3b]----00.0 NVIDIA Corporation Device 1eb8
| \-1f.6 Intel Corporation Ethernet Connection (7) I219-LM
Meaning: If an expected root port or downstream device is missing entirely, you’re not dealing with a
driver issue. You’re dealing with enumeration not happening.
Decision: Missing whole branches after a CPU/cooler touch strongly suggests socket/CPU seating or bent pins affecting that PCIe root complex.
Task 7: Check NVMe errors that are actually link problems
cr0x@server:~$ sudo dmesg | egrep -i 'nvme|pcie|AER|link down|corrected error' | tail -n 30
[ 92.112233] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
[ 92.112240] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 92.112244] pcieport 0000:00:1d.0: device [8086:1234] error status/mask=00000001/00002000
[ 92.112250] pcieport 0000:3b:00.0: AER: Corrected error received: 0000:3b:00.0
[ 92.112256] nvme nvme0: Abort status: 0x371
Meaning: AER Physical Layer corrected errors plus NVMe command aborts can be a marginal PCIe link.
That can be cable/riser/device, but it can also be CPU socket contact issues on the lane group.
Decision: If errors vanish after reseat or change with cooler torque, stop blaming the SSD.
Task 8: Confirm ECC mode and memory speeds (BIOS fallout)
cr0x@server:~$ sudo dmidecode -t memory | egrep -i 'Type:|Type Detail:|Error Correction Type:|Configured Memory Speed:' | head -n 20
Error Correction Type: Multi-bit ECC
Type: DDR4
Type Detail: Synchronous Registered (Buffered)
Configured Memory Speed: 2666 MT/s
Meaning: If configured speed is lower than expected across all DIMMs, BIOS might have fallen back due
to training issues. Sometimes that’s “safe mode” after bad boots; sometimes it’s real marginality.
Decision: If speed drops after a hardware intervention, treat it as a canary: re-check seating and socket, then re-train memory in BIOS if your platform supports it.
Task 9: Check CPU throttling and thermal limits (cooler pressure problems show up here)
cr0x@server:~$ sudo turbostat --Summary --quiet --interval 1 --num_iterations 3
Average: CPU Avg_MHz Busy% Bzy_MHz TSC_MHz PkgTmp PkgWatt
Average: all 2100 45.32 4632 2000 92 195.12
Meaning: High package temp and high watts with unstable frequency suggests thermal or power limits.
A badly mounted cooler can cause hot spots and weird instability that looks like “bad silicon.”
Decision: If temps are high right after a service event, re-mount the cooler with correct torque and paste application before touching firmware knobs.
Task 10: Load test to reproduce in a controlled way (don’t guess)
cr0x@server:~$ stress-ng --cpu 32 --vm 4 --vm-bytes 75% --timeout 120s --metrics-brief
stress-ng: info: [2417] dispatching hogs: 32 cpu, 4 vm
stress-ng: metrc: [2417] stressor bogo ops real time usr time sys time bogo ops/s
stress-ng: metrc: [2417] cpu 84512 120.02 3790.11 21.43 704.1
stress-ng: metrc: [2417] vm 9123 120.01 457.33 210.88 76.0
stress-ng: info: [2417] successful run completed in 120.02s
Meaning: If it fails only under combined CPU+memory pressure, suspect marginal contacts, VRM, or
thermal issues more than “driver bugs.”
Decision: Use this to reproduce after each change (reseat, DIMM swap, BIOS tweak). If the crash signature changes, you’re narrowing it down.
Task 11: Check for WHEA-like hardware error reporting on Linux (MCE count trend)
cr0x@server:~$ sudo mcelog --client
hardware event: corrected memory error
status: 0x9c20400000010091
misc: 0x0
addr: 0x0000000123456780
mcgstatus: 0x0
Meaning: Repeated corrected errors are not “fine.” They are early warning.
Decision: If corrected errors increase after a CPU/cooler reseat, you likely made contact pressure worse; reverse course and re-seat carefully.
Task 12: Validate storage “slowdown” isn’t CPU link training
cr0x@server:~$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
/dev/nvme0n1 S5X0NX0R123456 ACME NVMe Gen4 SSD 3.84TB 1 3.84 TB / 3.84 TB 512 B + 0 B 3B2QEXM7
cr0x@server:~$ sudo nvme smart-log /dev/nvme0 | egrep -i 'media_errors|num_err_log_entries|warning_temp_time'
media_errors : 0
num_err_log_entries : 3
warning_temp_time : 0
Meaning: Clean media errors but some error log entries plus AER in dmesg hints link-layer flakiness, not NAND death.
Decision: If storage “slow” correlates with PCIe downtraining (Task 5), treat the socket/PCIe path as primary suspect before RMA’ing drives.
Task 13: Confirm BIOS/UEFI version from OS (when you suspect training regressions)
cr0x@server:~$ sudo dmidecode -t bios | egrep -i 'Vendor:|Version:|Release Date:'
Vendor: American Megatrends International, LLC.
Version: 2.6.1
Release Date: 08/14/2024
Meaning: BIOS updates can change memory training behavior and PCIe equalization defaults.
Decision: If symptoms appear after BIOS update, consider rollback or apply vendor-recommended settings; but don’t let firmware be a scapegoat for a physically damaged socket.
Task 14: Detect missing interrupts/devices that point to a dead root complex
cr0x@server:~$ cat /proc/interrupts | head -n 12
CPU0 CPU1 CPU2 CPU3
0: 22 0 0 0 IO-APIC 2-edge timer
1: 2 0 0 0 IO-APIC 1-edge i8042
24: 182993 0 0 0 PCI-MSI 524288-edge eth0
25: 0 0 0 0 PCI-MSI 1048576-edge nvme0q0
Meaning: Interrupt lines stuck at zero for an active device can indicate the device isn’t actually active,
or it’s wedged due to link issues.
Decision: Combine with lspci and dmesg. If the device exists but never generates interrupts under load, the PCIe path may be unstable (riser, slot, or CPU lane group).
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A team rolled out a batch of new compute nodes for a latency-sensitive service. The acceptance test
was simple: boot, join cluster, run a short synthetic load, ship it. Two days later, a subset of nodes
started reporting intermittent storage timeouts. Everyone fixated on the NVMe drives, because the logs
were full of I/O errors and the vendor’s health tool was annoyingly quiet.
The wrong assumption: “Storage errors mean storage is broken.” It’s a comfortable story because it keeps
the blast radius small. Swap drive, move on. But drive swaps didn’t help. The same nodes kept flaking.
An SRE finally looked at PCIe training state across nodes and noticed a pattern: the problem nodes had
links running at reduced width and speed, and AER corrected errors spiked under load. The service was
fine at idle. Under traffic, the PCIe fabric got noisy.
The culprit was mechanical: those nodes had been reworked on a bench where someone removed the CPU cooler
to “check paste” and reinstalled it quickly. The cooler mounting pressure was uneven, and the LGA contact
pressure on the PCIe-related pins was marginal. Reseating CPU + careful torque fixed it. No parts replaced.
The lesson is brutal: don’t assume the failing subsystem is the broken component. With LGA, a socket contact
problem can impersonate half your hardware inventory.
Mini-story 2: The optimization that backfired
A platform team wanted faster turnaround on hardware repairs. They introduced an “efficiency” change:
pre-apply thermal paste templates and tighten heatsinks using a powered screwdriver set to a “safe” torque.
The repair line got faster and more consistent—on paper.
Within a month, a trickle of nodes began showing a specific pattern: one memory channel missing after reboot.
The nodes would run fine in a reduced-memory configuration, so the issue slipped through initial checks.
Eventually workloads that were tuned for memory bandwidth started missing SLOs.
Investigation found two problems. First, the powered screwdriver’s clutch torque wasn’t calibrated and varied
with battery level. Second, the paste templates encouraged a thicker-than-ideal layer in some cases, which
changed mounting pressure distribution after thermal cycling. The LGA socket’s contact pressure ended up
uneven, and a handful of pins in the affected memory channel were on the edge.
Fixing it was painfully unglamorous: manual torque with a calibrated driver, paste applied by weight/volume
spec, and a post-repair validation that explicitly checked memory channel presence and speed. Throughput came back.
Optimization is great. Unmeasured optimization is how you create a new class of “ghost hardware failures.”
Mini-story 3: The boring but correct practice that saved the day
A company ran mixed workloads on a fleet that included some aging nodes. Failures happened, but they were
manageable because the team practiced a tedious ritual: every CPU service action had a documented checklist,
and every removed LGA socket had its protective cover immediately reinstalled. No exceptions. People complained.
One week, a contractor was brought in to help with a hardware refresh. They were competent but unfamiliar with
the team’s “socket cover obsession.” Mid-shift, a board sat uncovered on an anti-static mat while parts were
sorted. Someone brushed the area with a sleeve. Nothing dramatic. No sparks. No crunchy noises.
The checklist caught it. The technician doing reassembly inspected the socket under magnification before placing
the CPU and noticed a slight pin misalignment. Because the cover rule had been followed everywhere else, this one
exception stood out. They quarantined the board, and the contractor’s batch didn’t go into production.
The board was later confirmed to have a bent pin cluster that likely would have produced a “boots but missing RAM”
failure. Instead, it produced a boring ticket and a controlled replacement.
Second short joke (that’s your lot): The socket cover is the little plastic hat that prevents your motherboard from learning interpretive dance with tweezers.
Common mistakes: symptom → root cause → fix
1) Only half the RAM shows up after maintenance
Symptom: OS reports significantly less RAM; dmidecode shows “No Module Installed” for a whole channel.
Root cause: CPU not seated flat, uneven cooler torque, or bent LGA pins associated with that channel.
Fix: Power down, remove cooler, inspect socket with magnification and angled light; reseat CPU; reapply paste; tighten cooler in cross pattern to spec; validate channel presence in BIOS and OS.
2) PCIe device randomly disappears or link trains down
Symptom: lspci intermittently missing a device; lspci -vv shows downgraded width/speed; AER errors in dmesg.
Root cause: Marginal lane contact (socket pin issue), riser/slot contamination, or board flex from cooler/retention.
Fix: Reseat device and riser; check link state; if persistent to one root port, inspect CPU socket and re-mount cooler with correct torque.
3) Random reboots only under heavy load
Symptom: Stable at idle, crashes under stress-ng or real workload; MCEs may appear.
Root cause: Marginal contact pressure worsened by thermal expansion; VRM or power delivery instability; aggressive BIOS settings.
Fix: Verify thermals; re-mount cooler; return BIOS to known-good defaults; check MCE/EDAC; if errors correlate with a channel, treat socket contact as primary.
4) “CPU is dead” after a swap, but old CPU works in another board
Symptom: New CPU won’t POST in a board; works elsewhere; board fails with multiple CPUs.
Root cause: Socket damage (LGA pins bent), debris in socket, or wrong BIOS for that CPU stepping.
Fix: Inspect socket; verify BIOS support; do not repeatedly clamp CPUs into a suspect socket—each cycle risks more damage.
5) Memory speed drops across the board after “minor” service
Symptom: Configured memory speed lower than expected on all DIMMs; performance regression.
Root cause: BIOS safe-mode training after errors, or marginal training due to contact issues.
Fix: Fix physical seating first; then clear training failures (platform-specific); re-validate with dmidecode and workload benchmarks.
6) You straighten PGA pins and it works… until it doesn’t
Symptom: Repaired CPU boots; later intermittent faults appear.
Root cause: Work-hardened or micro-cracked pins; contact is electrically marginal under thermal cycling.
Fix: Treat “straightened” CPUs as temporary. In production, replace them; don’t put them in systems that matter.
Checklists / step-by-step plan
Step-by-step: safe CPU removal and installation (LGA focus)
- Plan the validation: decide ahead of time what “good” looks like (expected RAM size, channels, PCIe devices, link speeds).
- Power down properly: graceful shutdown, then remove power and wait for standby rails to discharge as per platform guidance.
- ESD discipline: wrist strap, grounded mat, and avoid synthetic clothing that loves static.
- Remove cooler evenly: loosen in a cross pattern to avoid twisting the CPU in the socket.
- Open socket mechanism carefully: do not drag the CPU across pins.
- Install socket cover immediately if CPU is removed: the cover is not packaging; it’s armor.
- Inspect: magnification, angled light; look for pin rows that don’t reflect the same way.
- Clean appropriately: remove dust with clean air; do not smear oils; avoid “creative solvents.”
- Seat CPU: align notches/markers; no force; it should settle flat.
- Close load plate and lever: expect resistance; that’s clamping force, not a fight.
- Apply thermal paste consistently: follow your platform spec; don’t overdo it.
- Tighten cooler to spec: cross pattern; calibrated torque; avoid powered drivers unless validated.
- First boot to firmware: confirm memory channels, speeds, and PCIe inventory before booting OS.
- OS validation: run the tasks in the diagnostics section and record outputs for the ticket.
Decision checklist: when to blame the socket vs the CPU
- Blame the socket/board first when: a channel or lane group is missing, symptoms change with reseat/torque, or multiple CPUs show the same issue in the same board.
- Blame the CPU first when: the CPU fails in multiple known-good boards, or you see consistent internal CPU errors not tied to a channel/port topology.
- Blame firmware/config first when: behavior changed after BIOS update, settings were modified, or inventory mismatches are consistent across boots without physical changes.
Operational hygiene checklist (what to standardize)
- Socket covers stocked and used, every time.
- Calibrated torque drivers with documented settings per platform.
- Mandatory socket inspection step for any “mystery” hardware behavior after service.
- Post-maintenance validation script that checks: lscpu, dmidecode memory slots, PCIe link state, and MCE/EDAC.
- Quarantine policy: any board with suspected LGA pin damage is tagged and removed from rotation.
FAQ
1) Is LGA always better than PGA?
Better for high pin counts and high-speed electrical performance, yes. Better for field handling, no.
“Better” depends on whether your pain is engineering constraints or technician-induced damage.
2) Why not keep pins on the CPU so the expensive motherboard is protected?
Because the expensive part is not just the board; it’s the platform’s ability to support dense, fast I/O and power delivery.
LGA reduces CPU-side fragility and enables higher contact density with better signal characteristics. The cost is shifted risk.
3) Do bent LGA pins always prevent boot?
No. That’s why they’re dangerous. A small set of bent pins can remove a memory channel or degrade PCIe without killing POST.
You get a “working” server that quietly underperforms or errors under load.
4) Can I straighten bent LGA pins?
Sometimes, physically. Operationally, it’s rarely worth it in production.
If you attempt it, you need proper magnification, lighting, and tools, and you accept that you may turn a recoverable board into scrap.
Many orgs choose “replace board” because it’s auditable and repeatable.
5) Why does missing RAM often point to socket issues?
Memory channels are wired through specific contact groups. If a few contacts in that group don’t connect, the memory controller
may disable the channel during training. The OS then reports less RAM, and dmidecode often shows empty slots.
6) What’s the role of the load plate and lever in LGA?
It provides consistent clamping force so each spring contact presses against its pad with the right pressure.
Consistency is everything: too little gives intermittent contacts; too much can flex the board or damage the socket.
7) Why do PCIe links sometimes train down after a service event?
Equalization and training adapt to the channel quality. If contact resistance rises or a lane becomes marginal, the platform may
negotiate a lower speed or narrower width to stay reliable. It’s a defensive move that looks like “performance regression.”
8) Is PGA actually more reliable?
Not inherently. It’s more forgiving to handle because the motherboard doesn’t have exposed spring pins.
But PGA CPU pins are easier to bend during installation and can fatigue if repeatedly straightened.
Reliability comes from procedure, not socket religion.
9) How can a cooler mounting problem cause memory errors?
Uneven pressure can slightly warp the board or CPU package, changing contact pressure across the LGA field.
Memory channels are sensitive; a marginal contact can become unstable when hot, leading to training failures or corrected errors.
10) What’s a good acceptance test after CPU work?
Validate inventory (CPU, RAM slots present, expected PCIe devices), check link speeds/widths, then run a short combined CPU+memory stress.
Also check logs for MCE/EDAC and PCIe AER corrected errors. “No errors” beats “seems fine.”
Practical next steps
Here’s the operational stance that keeps you out of socket hell:
treat LGA sockets as precision components, not “just a connector.” Standardize torque, inspection,
and post-maintenance validation. Don’t let “boots” be your definition of “healthy.”
One quote, because it belongs in every ops room: Hope is not a strategy.
— paraphrased idea, often attributed to engineering and operations leaders.
- Codify a CPU service procedure that includes socket cover discipline and inspection.
- Automate post-maintenance checks using the tasks above (CPU topology, DIMM inventory, PCIe link state, error logs).
- Train people on symptom mapping: missing channel/lane group implies contact or seating; random under-load faults imply marginality.
- Quarantine suspect boards immediately. Repeated reseats on a damaged LGA socket are how you turn “maybe salvageable” into “definitely not.”
- Stop “optimizing” with uncalibrated tools. If you use powered drivers, validate torque across battery states and operators, or don’t use them.
The industry didn’t move pins to the motherboard to annoy technicians. It moved them because it needed
more connections, better electrical behavior, and predictable mechanics at scale. Your job in production
is to respect that tradeoff—and to diagnose it fast when reality bites.