MCM Graphics: What Can Go Wrong and How Vendors Patch It

Was this helpful?

Your GPUs were “fine” yesterday. Today your inference tail latency is spiky, training jobs stall at random, and the only clue is a handful of cryptic kernel logs and a scheduler full of angry retries. When you’re running multi-chip-module (MCM) graphics—chiplets, tiles, multiple dies, fancy packaging—this is the kind of day you eventually get.

MCM GPUs promise better yields, more compute per socket, and a roadmap that doesn’t require building a single monster die. They also move a bunch of “this used to be on-die” problems into places where physics and firmware get a vote. The good news: most failures are diagnosable. The bad news: you need to know where to look and what to believe.

What’s different about MCM graphics (and why ops should care)

MCM graphics is the idea that a “GPU” isn’t one monolithic piece of silicon anymore. It’s a package with multiple active dies—compute tiles, IO dies, memory stacks, cache dies—stitched together using high-speed interconnects and advanced packaging (2.5D interposers, silicon bridges, organic substrates, various flavors of micro-bumps).

From a production operations perspective, the risk profile changes in three ways:

  1. More links to fail. Every extra die-to-die hop is a place for signal integrity, clocking, training, or power noise to go sideways.
  2. More firmware arbitration. A single monolithic die can hide internal complexity behind “it just works.” MCM needs more boot-time training, link management, error handling, and sometimes runtime reconfiguration.
  3. More ways to be “mostly working.” Partial degradation becomes common: one tile throttles, one memory stack throws ECC corrections, one link retrains under load. Your dashboards show “GPU is up,” while your SLOs say otherwise.

Here’s the operational trap: engineers treat MCM GPUs like big versions of last year’s parts. Same driver playbook, same thermal assumptions, same “if it passes burn-in it’s good” mindset. That assumption doesn’t age well.

One quote that remains painfully relevant in this space is a paraphrased idea often attributed to John Ousterhout: Complexity is incremental; it accumulates until the system becomes hard to understand and unreliable. MCM doesn’t create complexity from nothing—it relocates it into packaging, firmware, and telemetry, where you may have weaker instincts.

Interesting facts and context (the short, useful kind)

  • Chiplets aren’t new. Multi-die packages have existed for decades; what changed is bandwidth density and packaging that makes them behave like one device at GPU-scale rates.
  • HBM made “package-level” the new motherboard. When memory stacks sit next to compute on an interposer, many “DIMM-era” failure patterns disappear—and new ones arrive (stack-local thermals, PHY margin, per-stack ECC behavior).
  • Early multi-GPU (SLI/CrossFire) is the wrong mental model. That was “multiple devices over PCIe.” MCM is “one device with internal fabrics,” which changes error containment and reset semantics.
  • Yield economics are a major driver. Smaller dies yield better; vendors can bin tiles, fuse off weak units, and build product stacks without betting on one giant reticle-limited die.
  • Advanced packaging is its own reliability domain. Micro-bumps, underfill, and substrate warpage can create time-dependent failures that look like “random driver hangs.”
  • Interconnect training is now a boot-critical phase. Link training failures can show up as intermittent enumeration issues or performance cliffs that only happen after cold boot.
  • Telemetry got better, but interpretation got harder. You get per-link counters, per-stack ECC, per-tile throttling… and a flood of false positives if your thresholds are naive.
  • RAS features are increasingly product-defining. Correctable errors and isolation behavior are now part of the product promise, not just a nice-to-have for HPC.

Failure modes: what breaks in the real world

1) Die-to-die interconnect flakiness: “it’s not down, it’s just slow… sometimes”

MCM GPUs live or die by their internal fabrics: die-to-die links, cache-coherent interconnects, memory PHY connections, and sometimes external fabrics like NVLink-class links between packages. These links have training, equalization, and error correction behaviors that can drift with temperature, voltage, or aging.

How it presents:

  • Performance jitter: p95 latency doubles without obvious utilization changes.
  • Intermittent hangs under peak load, often during all-reduce or heavy peer-to-peer traffic.
  • Correctable error counters climb steadily; uncorrectable errors cause GPU reset or job kill.

What’s really happening: marginal link margin, periodic retraining, or escalating error correction overhead. The GPU “works,” but your effective bandwidth collapses.

2) Power delivery and transient response: your PSU is innocent, your rails aren’t

MCM tends to increase local current transients: multiple tiles switching in lockstep, HBM bursts, fabric activity spikes. Board VRMs and on-package regulators must keep up. If they don’t, you get brownout-like behavior that looks like software bugs.

How it presents:

  • GPU resets under sudden load ramps (job start, kernel launch storm, mixed precision phase transitions).
  • Kernel logs show “GPU fallen off bus” or PCIe link resets.
  • More frequent errors at higher power limits or aggressive boost clocks.

3) Thermal gradients inside the package: average temp lies

With multiple dies and HBM stacks, the hottest spot may not be where your one-line “GPU temperature” metric points. You can have a “fine” reported temperature and still be throttling a tile or memory stack.

How it presents:

  • Clocks downshift unpredictably; perf looks like congestion but isn’t.
  • HBM ECC correctable errors correlate with high ambient or fan curves.
  • Problems appear only in certain rack elevations or airflows.

Joke #1: Thermal throttling is the GPU’s way of saying “I’m not mad, I’m just disappointed,” while quietly halving your throughput.

4) Reset semantics get weirder: partial resets, stuck contexts, and “ghost GPUs”

On monolithic GPUs, a reset often means “reset the whole device.” In MCM, vendors may attempt finer-grained recovery: reset a tile, retrain a link, quarantine a memory stack, restart a microcontroller. This is good—until your driver, kernel, or application stack assumes “reset is reset.”

How it presents:

  • GPU visible to the OS, but CUDA/ROCm/OpenCL context creation fails.
  • Device nodes exist; tools report “OK”; jobs fail immediately.
  • Only a full host reboot restores normal behavior.

5) ECC and RAS telemetry: corrections are a warning, not a badge of honor

HBM ECC correctables are often treated as “fine.” In production, a rising correctable rate is an early warning for thermal issues, marginal signal integrity, or impending uncorrectable errors. MCM adds more surfaces for this: multiple stacks, more PHYs, more controllers.

How it presents:

  • Slow creep in correctable errors, then sudden job terminations.
  • Unexplained “illegal memory access” errors that correlate with ECC events.
  • Higher error rates on specific GPUs or positions in the chassis.

6) PCIe and BAR behavior: the host link still matters

Even if the internal GPU is chiplet-based, the host sees a PCIe device. MCM GPUs can increase MMIO pressure, BAR sizing sensitivity, and reset complexity. Resizable BAR and IOMMU settings can help or hurt depending on firmware quality.

How it presents:

  • Device enumerates, but driver fails to initialize after firmware updates.
  • PCIe Advanced Error Reporting (AER) logs show correctable bursts during load.
  • Peer-to-peer transfers are slower than expected, or occasionally time out.

7) Driver scheduling and “implicit assumptions” about symmetry

Some early MCM designs expose asymmetry: IO die vs compute dies, shared caches, or different memory affinity domains. If the driver assumes uniform latency, it may place work sub-optimally. If your application assumes uniformity, it may amplify the issue.

How it presents: performance cliffs that appear only for certain batch sizes, tensor shapes, or memory access patterns. Benchmark A says “great,” production workload says “why are we paying for this.”

8) Firmware microcontrollers: the hidden operating system inside your GPU

Modern GPUs ship with multiple embedded controllers handling power, security, scheduling, and RAS. MCM increases coordination needs: more endpoints, more states, more training sequences. Firmware bugs become “hardware flakiness” in your incident channel.

How it presents:

  • Issues that vanish after a firmware update (or appear after one).
  • Problems that reproduce only after warm reboot, not cold boot.
  • “Stuck in low power state” or “won’t boost” with no thermal cause.

9) Packaging-level aging: when “works in QA” doesn’t mean “works in month 14”

Thermal cycling, vibration, and long-term electromigration can degrade bump connections and interconnect margin. These failures often start as correctable errors and intermittent link retraining. They become “random resets” later.

How it presents: “We swapped software, kernel, driver, model versions, and the problem persists on the same physical GPU.” This is when you stop debating and start isolating hardware.

How vendors patch MCM graphics in practice

When an MCM GPU has a field issue, vendors patch it across layers. Hardware changes come later. Production has to survive the in-between.

Patch layer 1: Firmware updates (VBIOS, device firmware, microcontroller images)

Firmware patches typically target:

  • Link training and equalization: better presets, longer training windows, retry logic tuned for marginal channels.
  • Power management: less aggressive boost transitions, different voltage droop mitigation, revised power gating sequences.
  • RAS policy: when to reset, when to isolate, when to limp along; thresholds for escalating correctable errors.
  • Reset handling: improving recoverability without host reboot.

Operational reality: firmware updates are not “nice.” They are change events with rollback risk, dependency on driver versions, and sometimes interactions with motherboard BIOS.

Patch layer 2: Kernel driver changes

Drivers patch around hardware behavior. Common patterns:

  • Timeout handling tuning: avoid false hang detection on longer kernels or heavier fabrics.
  • Better error decoding: mapping cryptic error codes to actionable categories; exposing per-link counters.
  • Workarounds: disabling an optimization path that triggers a hardware bug (yes, it happens).
  • Reset improvements: trying a tile reset before a full device reset; better cleanup of contexts.

Patch layer 3: User-space libraries and runtimes

CUDA/ROCm/OpenCL stacks, collectives libraries, and ML frameworks sometimes ship mitigations:

  • Different default algorithm choices for all-reduce when P2P is unstable.
  • More conservative memory pooling behavior to reduce fragmentation-induced stress.
  • Health checks that detect “zombie device” states and fail fast.

Patch layer 4: Platform BIOS and PCIe quirks

Motherboard BIOS/UEFI updates can adjust PCIe ASPM, link training behavior, BAR handling, and IOMMU defaults. On some platforms, this is the difference between “rare correctables” and “daily bus resets.”

How vendors decide what to patch first

Vendors chase reproducibility. If they can reproduce a hang with a specific workload pattern, they’ll patch the driver. If it’s platform-specific, they’ll patch BIOS guidance. If it’s clearly marginal signal integrity, they may tune training or lower default clocks. If it’s a true silicon erratum, they’ll issue workarounds now and fix it in a stepping later.

What you should do: treat GPU firmware/driver as a coupled unit. Don’t “just update the driver” in isolation and then act surprised when a device firmware mismatch causes new failure modes.

Joke #2: Vendor release notes are like weather forecasts: useful, occasionally wrong, and you still shouldn’t ignore the storm warnings.

Fast diagnosis playbook

When an MCM GPU cluster starts misbehaving, you’re not solving a philosophy problem. You’re locating a bottleneck and deciding whether to remediate, quarantine, or roll back.

First: decide whether it’s host link, device health, or workload

  1. Host link sanity: PCIe AER, link speed/width changes, “fallen off bus,” IOMMU faults.
  2. Device health: ECC counters, throttling reasons, Xid/driver reset logs, fabric errors.
  3. Workload correlation: does it happen only on certain kernels, batch sizes, or communication patterns?

Second: classify the failure as “degradation” vs “hard failure”

  • Degradation: throughput down, latency up, correctable errors increasing, clocks unstable. Action: drain + investigate, possibly firmware/driver tuning, airflow check.
  • Hard failure: uncorrectable ECC, GPU reset loops, device disappears. Action: quarantine GPU/node immediately, preserve logs, don’t churn.

Third: determine scope and blast radius

  • Single GPU only: suspect hardware marginality or slot/PSU/airflow local issue.
  • Whole node: suspect platform BIOS, kernel, power supply, backplane, or driver update.
  • Rack/row: suspect cooling, power distribution, firmware rollout, or a bad batch of nodes.
  • Cluster-wide after a change: suspect software/firmware mismatch or scheduler behavior.

Fourth: pick the least risky mitigation that buys time

  • Reduce power cap / disable boost temporarily.
  • Adjust fan curve or airflow containment; verify inlet temperatures.
  • Disable problematic P2P path / change collective algorithm.
  • Pin to stable driver+firmware combo; roll back if regression is obvious.
  • Quarantine specific GPUs with rising ECC or recurrent resets.

Hands-on tasks: commands, outputs, and decisions

These are the bread-and-butter checks I run when MCM GPUs smell funny. Each task includes a command, what the output means, and what decision you make. Commands assume Linux with common tooling; adjust to your environment.

Task 1: Check basic GPU visibility and persistence state

cr0x@server:~$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2d3f...)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-9a11...)

What it means: GPUs enumerate and the driver can talk to them. Missing GPUs suggests PCIe/link/platform issues, not “a slow kernel.”

Decision: If a GPU is missing or shows up intermittently, immediately check PCIe AER and dmesg; don’t waste time in ML framework logs.

Task 2: Read driver/kernel error logs for resets and bus events

cr0x@server:~$ sudo dmesg -T | egrep -i "nvrm|xid|pcie|aer|amdgpu|iommu" | tail -n 20
[Mon Jan 21 08:12:44 2026] NVRM: Xid (PCI:0000:41:00): 79, pid=22341, GPU has fallen off the bus.
[Mon Jan 21 08:12:44 2026] pcieport 0000:40:01.0: AER: Corrected error received: 0000:40:01.0
[Mon Jan 21 08:12:44 2026] pcieport 0000:40:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)

What it means: “Fallen off the bus” is a host link/power/reset class event. AER physical layer errors suggest signal integrity or link instability.

Decision: If you see repeated AER bursts around failures, treat it like a hardware/platform issue first: slot, riser, cable (if any), BIOS PCIe settings, power transients.

Task 3: Verify PCIe link width/speed didn’t downshift

cr0x@server:~$ sudo lspci -s 41:00.0 -vv | egrep -i "LnkCap:|LnkSta:"
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s<1us, L1<8us
LnkSta: Speed 8GT/s (downgraded), Width x8 (downgraded)

What it means: The device can do Gen4 x16 but is running at Gen3 x8. That is a performance cliff and often a symptom of marginal signal integrity or BIOS quirks.

Decision: If downgraded, reseat/check risers, update platform BIOS, consider forcing PCIe generation in BIOS for stability, and compare against known-good nodes.

Task 4: Check Resizable BAR status (common source of “works but weird”)

cr0x@server:~$ sudo lspci -s 41:00.0 -vv | egrep -i "Resizable BAR|BAR 1|Region 0" -n
55: Region 0: Memory at 3a000000000 (64-bit, prefetchable) [size=16M]
61: Resizable BAR: Current Size: 256MB, Supported: 256MB 512MB 1GB 2GB 4GB

What it means: BAR sizing affects how efficiently the CPU maps GPU memory apertures. Some firmware/driver combos regress badly with certain sizes.

Decision: If you see initialization failures after updates, try toggling Resizable BAR in BIOS consistently across the fleet rather than letting it vary node-to-node.

Task 5: Check GPU clocks, power, and throttling reasons

cr0x@server:~$ nvidia-smi -q -d CLOCK,POWER,TEMPERATURE | egrep -i "GPU Current Temp|Power Draw|Power Limit|Clocks|Throttle" -n | head -n 60
118: GPU Current Temp            : 78 C
141: Power Draw                  : 345.22 W
142: Power Limit                 : 400.00 W
210: Clocks
214:     Graphics                : 990 MHz
230:     Memory                  : 1215 MHz
310: Clocks Throttle Reasons
314:     SW Power Cap            : Active
318:     HW Thermal Slowdown     : Not Active

What it means: You’re power capped in software even though thermals aren’t throttling. That can be intentional (data center policy) or a misconfiguration.

Decision: If perf is low and SW power cap is active, validate your power limit policy; consider raising cap or smoothing workload spikes.

Task 6: Inspect ECC error counters (correctables matter)

cr0x@server:~$ nvidia-smi -q -d ECC | egrep -i "Volatile|Aggregate|Correctable|Uncorrectable" -n | head -n 80
60: Volatile ECC Errors
62:     Single Bit
64:         Device Memory        : 120
70:     Double Bit
72:         Device Memory        : 0
90: Aggregate ECC Errors
92:     Single Bit
94:         Device Memory        : 8421

What it means: Correctable ECC errors are accumulating. Volatile counts reset on reboot; aggregate persist across reboots (implementation-dependent).

Decision: If correctables are rising quickly or correlate with hot periods, drain that GPU from latency-sensitive workloads and investigate cooling and firmware updates. If double-bit (uncorrectable) appears, quarantine immediately.

Task 7: Pull fabric / NVLink-class link status (where supported)

cr0x@server:~$ nvidia-smi nvlink --status
GPU 0: Link 0: Up
GPU 0: Link 1: Up
GPU 1: Link 0: Up
GPU 1: Link 1: Down

What it means: One link is down; the system may fall back to slower paths, changing performance and sometimes stability depending on topology.

Decision: If a link is down unexpectedly, schedule maintenance. Don’t “just rerun jobs”—collectives may behave differently and hide the problem until peak traffic.

Task 8: Check PCIe AER counters live (if exposed) via journal

cr0x@server:~$ sudo journalctl -k --since "10 min ago" | egrep -i "AER:|PCIe Bus Error" | tail -n 20
Jan 21 08:11:05 server kernel: pcieport 0000:40:01.0: AER: Corrected error received: 0000:40:01.0
Jan 21 08:11:05 server kernel: pcieport 0000:40:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer

What it means: Corrected physical layer errors often indicate marginal links. One or two might be noise; bursts under load are a pattern.

Decision: If the count spikes with GPU activity, prioritize platform fixes: reseat, BIOS settings, swap riser, or move GPU to a different slot to confirm.

Task 9: Confirm IOMMU state (can interact with resets and peer-to-peer)

cr0x@server:~$ dmesg -T | egrep -i "iommu|dmar" | head -n 20
[Mon Jan 21 07:58:01 2026] DMAR: IOMMU enabled
[Mon Jan 21 07:58:01 2026] DMAR: Intel(R) Virtualization Technology for Directed I/O

What it means: IOMMU is enabled. That’s often correct in shared environments, but it can expose device/driver bugs or performance regressions depending on settings.

Decision: If you see DMA mapping faults or weird peer-to-peer behavior, test with known-good IOMMU settings (including passthrough) on a canary before changing fleet-wide.

Task 10: Measure GPU utilization vs memory utilization to spot “fabric waits”

cr0x@server:~$ nvidia-smi dmon -s pucm -d 1 -c 5
# gpu   pwr gtemp mtemp    sm   mem   enc   dec
# Idx     W     C     C     %     %     %     %
    0   210    74     -     12     85     0     0
    1   205    73     -     15     88     0     0

What it means: SM utilization is low while memory utilization is high: the workload is memory-bound, stalled, or waiting on transfers. In MCM, internal fabric problems can masquerade as “memory-bound.”

Decision: If this pattern appears suddenly after a change or only on certain GPUs, suspect throttling, link downgrades, or a fabric/link issue rather than “the model got worse.”

Task 11: Check for GPU resets and persistence events via vendor tools (NVIDIA example)

cr0x@server:~$ nvidia-smi -q | egrep -i "Reset Status|Pending|Retired Pages" -n | head -n 80
420: Reset Status
424:     Reset Required          : No
610: Retired Pages
614:     Single Bit ECC          : 2
618:     Double Bit ECC          : 0

What it means: Page retirement indicates the driver/firmware is mapping out bad memory locations. That’s a reliability signal, not a trivia fact.

Decision: If page retirements climb, plan replacement. If they stabilize at a low number and no other errors occur, you may keep it in a lower-tier pool, but watch it closely.

Task 12: Stress-test the interconnect path you actually use (not a toy benchmark)

cr0x@server:~$ python3 -c "import torch; import time; x=torch.randn(8192,8192,device='cuda'); torch.cuda.synchronize(); t=time.time(); y=x@x; torch.cuda.synchronize(); print('matmul_s', time.time()-t)"
matmul_s 0.7421183586120605

What it means: A simple kernel timing sanity check. It won’t validate your entire stack, but it helps distinguish “GPU is fundamentally slow/broken” from “the cluster network is on fire.”

Decision: If single-GPU compute is stable but multi-GPU jobs fail, focus on P2P/NVLink/PCIe topology and collective libraries. If single-GPU is unstable, focus on thermals/power/driver/firmware per node.

Task 13: Validate topology and peer-to-peer capability (NVIDIA example)

cr0x@server:~$ nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity
GPU0     X      NV1     0-31
GPU1    NV1      X      0-31

What it means: GPUs have a fast link (NV1). If you expected NVLink and you see “PHB” or “SYS,” your traffic is going over PCIe/CPU, and your perf model is wrong.

Decision: If topology is not what you designed for, don’t tune software endlessly. Fix placement, cabling, BIOS, or the physical build.

Task 14: Confirm versions (driver, firmware) and stop running mystery combos

cr0x@server:~$ nvidia-smi
Wed Jan 21 08:14:02 2026
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14    Driver Version: 550.54.14    CUDA Version: 12.4     |
+-----------------------------------------------------------------------------+

What it means: You have a concrete driver version. Tie this to your firmware version inventory; “latest” is not an inventory strategy.

Decision: If failures correlate with a recent driver rollout, freeze and roll back on a subset. If it correlates with firmware, coordinate a matched upgrade/rollback plan.

Task 15: Check CPU frequency scaling and IRQ storms (because GPUs don’t live alone)

cr0x@server:~$ sudo turbostat --Summary --quiet --show Avg_MHz,Busy%,Bzy_MHz -i 2 -n 3
Avg_MHz  Busy%  Bzy_MHz
  1020   12.34    8270
   980   10.90    8205
  1005   11.80    8250

What it means: If the CPU is stuck at low frequency due to power policy or thermal limits, it can bottleneck data feeding, launch overhead, and I/O handling—especially for smaller kernels or heavy control-plane traffic.

Decision: If GPU utilization is low and CPU looks constrained, fix CPU power policy, cooling, or IRQ distribution before blaming the GPU.

Task 16: Quick NIC and RDMA sanity (multi-GPU jobs fail here a lot)

cr0x@server:~$ ip -s link show dev eth0 | head -n 12
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    RX:  bytes  packets  errors  dropped  missed   mcast
    9876543210  1234567  0       0        0        12
    TX:  bytes  packets  errors  dropped  carrier collsns
    8765432109  2345678  0       0        0       0

What it means: No errors/drops. If you do see errors, your “GPU hang” might be the job waiting on network collectives.

Decision: If network errors align with stalls, triage the fabric first. If network is clean and only certain GPU pairs fail, focus back on GPU P2P links and topology.

Three corporate mini-stories (anonymized, plausible, and painfully real)

Mini-story 1: The incident caused by a wrong assumption

The company had a new MCM-based GPU fleet and a shiny internal doc that said “ECC correctables are expected; ignore unless uncorrectables.” Someone wrote that doc based on an older generation where correctables were rare and mostly harmless.

Six months later, training jobs began failing in clusters—always during the same phase of a model run. Engineers blamed a recent framework update. They rolled it back. Failures continued. Then they blamed the collective library. They tuned it. Failures continued. Meanwhile, the dashboard for ECC correctables was green because the alert threshold was set to “unreasonably high, to avoid noise.”

Eventually, one SRE compared a “good node” and a “bad node” side-by-side. The bad node showed a steady climb in correctable HBM errors correlated with higher inlet temperatures at the top of the rack. Nothing dramatic. Just a slope.

The fix wasn’t glamorous: they tightened thermal control, adjusted fan curves, and changed the automation to quarantine GPUs when correctables rose faster than a baseline rate. After that, uncorrectables mostly disappeared. The wrong assumption was treating correctables as ignorable rather than predictive.

Mini-story 2: The optimization that backfired

A performance team wanted to squeeze out extra throughput. They increased GPU power limits and enabled the most aggressive boost behavior allowed by the vendor. Benchmarks improved nicely. The rollout went broad, because the canary tests were short and the graphs looked good.

Two weeks later, intermittent GPU resets started showing up, always under bursty workloads. The logs looked like classic “driver flake.” Engineers chased software ghosts: container runtimes, kernel versions, even a suspicious monitoring agent. The resets kept happening, mostly on nodes with slightly older PSUs and slightly warmer ambient air.

What actually happened was boring physics: the new power policy increased transient load steps, and a subset of boards/platforms had less margin. The internal fabrics and PCIe links were more sensitive to those dips than the previous monolithic GPUs had been. The failures were rare enough to slip through short tests, but common enough to destroy SLOs in production.

The rollback to conservative power limits stopped the resets. Then they reintroduced tuning slowly, with longer soak tests and explicit monitoring for PCIe AER bursts and throttling reason changes. The lesson: in MCM land, “more power” is not a free performance slider; it’s also a reliability lever.

Mini-story 3: The boring but correct practice that saved the day

A different org ran a strict hardware+firmware matrix. Every node reported driver version, VBIOS/firmware versions, platform BIOS, and a small set of RAS counters into a CMDB-like system. It was dull. Engineers complained it slowed down rollouts.

Then a vendor released a driver update that improved performance on paper but introduced a rare reset bug on a specific firmware stepping. Only a subset of nodes had that stepping—because of a mid-year procurement batch. Mixed fleets happen; pretending otherwise doesn’t help.

When failures started, they didn’t need a week of archaeology. They queried: “show nodes with driver X and firmware Y and rising PCIe AER.” The intersection set was small and crisp. They pinned those nodes to the older driver, kept the rest on the new version, and continued shipping.

It wasn’t heroics. It was inventory discipline and staged rollouts. The best incident response is not having to guess.

Common mistakes: symptom → root cause → fix

1) Symptom: sudden throughput drop on a subset of nodes

Root cause: PCIe link downshift (Gen4→Gen3, x16→x8) after retraining due to marginal signal integrity or riser issues.

Fix: Check lspci -vv link status; reseat GPU/riser, update platform BIOS, consider forcing PCIe gen, replace suspect risers.

2) Symptom: intermittent “GPU fallen off the bus” under load

Root cause: power transient/VRM margin issue, sometimes exacerbated by aggressive power limits/boost or inadequate PSU headroom.

Fix: temporarily lower power cap; verify PSU and cabling; update GPU firmware; run longer soak tests before re-enabling aggressive tuning.

3) Symptom: jobs fail only on multi-GPU, single-GPU tests pass

Root cause: fabric/P2P/NVLink-class link issues, topology mismatch, or collective algorithm sensitivity to link errors.

Fix: verify nvidia-smi topo -m and link status; switch collective algorithm or disable P2P path as mitigation; schedule hardware inspection if a link is down.

4) Symptom: rising correctable ECC errors but “no failures yet”

Root cause: thermal gradient, marginal HBM PHY, aging package interconnect. It’s often predictive.

Fix: correlate ECC rate with inlet temp and fan speed; improve cooling; update firmware/driver; quarantine if rate exceeds baseline or page retirements increase.

5) Symptom: GPU visible to OS but runtime fails to create contexts

Root cause: partial reset left device in an inconsistent state; driver/firmware mismatch; stale persistence daemon state.

Fix: attempt vendor-supported GPU reset; if not reliable, reboot node; enforce matched driver+firmware combos and consistent persistence settings.

6) Symptom: performance jitter at stable utilization

Root cause: internal link retraining, hidden throttling reasons, or power management oscillation.

Fix: inspect throttling reasons; look for AER bursts; stabilize power policy; update firmware that improves link training.

7) Symptom: after an update, some nodes won’t initialize GPU driver

Root cause: platform BIOS + Resizable BAR + driver interaction, or firmware dependency not met.

Fix: standardize BIOS settings; validate Resizable BAR state; roll back driver on affected hardware stepping; avoid mixed settings across fleet.

8) Symptom: “random” failures that follow the same physical GPU across hosts

Root cause: hardware marginality or packaging-level aging; error counters tell the story.

Fix: swap GPU into a known-good host to confirm; if the problem follows the GPU, quarantine and RMA; stop blaming kernels.

Checklists / step-by-step plan

When you’re buying or adopting MCM GPUs

  1. Demand a driver+firmware support matrix and treat it as contract, not suggestion.
  2. Plan for telemetry ingestion: ECC (per stack if available), link counters, throttle reasons, reset counts.
  3. Standardize platform BIOS settings: PCIe gen policy, Resizable BAR, IOMMU mode.
  4. Design for thermal headroom: inlet temp targets, fan policies, and rack-level airflow validation.
  5. Run soak tests that mimic production (duration matters; burstiness matters).

When you’re rolling out new firmware/driver

  1. Canary on representative hardware steppings and rack locations (hot/cold zones).
  2. Track: PCIe AER bursts, link width/speed, ECC correctable rates, page retirements, reset events.
  3. Keep rollback artifacts ready (driver packages, firmware images, known-good BIOS profile).
  4. Roll out gradually; stop at the first sign of a new error signature.
  5. Never change firmware, driver, and BIOS all at once unless you enjoy ambiguity.

When you’re in an incident

  1. Classify: degradation vs hard failure.
  2. Preserve evidence: dmesg/journal, vendor error logs, ECC counters before reboot if possible.
  3. Check PCIe link state and AER first; it’s fast and frequently decisive.
  4. Mitigate with the smallest blast radius: drain/quarantine a GPU or node; reduce power cap.
  5. Only then dig into framework-level symptoms.

When you’re building long-term reliability

  1. Alert on rates, not just absolute thresholds (ECC correctables per hour beats “ECC > 10,000”).
  2. Keep a “known-good” baseline node for comparisons.
  3. Automate quarantine for recurrent reset signatures and rising error rates.
  4. Maintain inventory of: GPU firmware/VBIOS, driver, kernel, platform BIOS, and slot/riser mapping.
  5. Run periodic link and throughput sanity checks to catch silent downshifts.

FAQ

1) Are MCM GPUs inherently less reliable than monolithic GPUs?

Not inherently. They have more failure surfaces (links, training, packaging), but also better RAS options and better yield screening. Reliability depends heavily on firmware maturity, platform quality, and your operational discipline.

2) What’s the single most common “it’s hardware” indicator?

Repeated PCIe AER events under load, link downshifts, and “fallen off the bus” messages. Those are not your Python code misbehaving.

3) Should I ignore correctable ECC errors?

No. Treat them like a smoke detector. One occasional correction might be fine; a rising rate is predictive and deserves investigation or quarantine.

4) Why do problems show up only after warm reboot?

Because link training and firmware initialization sequences can differ between cold boot and warm reset. Some marginal channels pass one path but not the other.

5) Do firmware updates usually improve stability?

Often, yes—especially early in a product’s life. But they can also introduce regressions. Roll firmware like you roll kernels: staged, measured, reversible.

6) What metrics should I alert on for MCM GPUs?

ECC correctable rate, uncorrectable events, page retirements, reset counts, throttle reasons (power/thermal), PCIe link width/speed, and PCIe AER bursts. Add fabric link status where available.

7) Is “GPU utilization low, memory high” always a model problem?

No. It can indicate memory-bound behavior, but in MCM systems it can also indicate internal fabric issues, link retraining, throttling, or host-side bottlenecks feeding the GPU.

8) How do I decide between quarantining a GPU vs replacing the whole node?

If the issue follows the GPU when moved to a known-good host, quarantine/RMA the GPU. If multiple GPUs fail in the same host/slot/rack, suspect platform power, thermals, BIOS, risers, or backplane.

9) Can Resizable BAR cause real production problems?

Yes—mostly via firmware/driver interactions. Standardize the setting and validate with your exact driver+firmware combo; don’t let it vary across the fleet.

10) What’s the best “first test” when performance looks wrong?

Check PCIe link state (lspci -vv), then throttling reasons and power/thermals, then ECC and reset logs. Fast, objective, and usually decisive.

Conclusion: practical next steps

MCM graphics is not fragile magic. It’s just a more distributed GPU: more links, more controllers, more states. When it fails, it often fails as “degraded reality” rather than a clean outage, which is worse for SLOs and harder on humans.

What to do next, in order:

  1. Standardize and inventory your driver+firmware+BIOS matrix. Mixed mystery combos are how you get haunted clusters.
  2. Alert on trends: ECC correctable rate, AER bursts, link downshifts, reset signatures.
  3. Operationalize quarantine for repeat offenders. Don’t let one marginal GPU poison a job queue.
  4. Stage rollouts with soak tests that reflect production, including bursty phases and multi-GPU collectives.
  5. Keep a fast diagnosis loop that starts with PCIe/link/thermals before diving into framework logs.

If you do those things, MCM GPUs stop being mysterious. They become what they should be: expensive heaters that also do useful math.

← Previous
Ubuntu 24.04: “Failed to start …” — the fastest systemd triage workflow (case #62)
Next →
L2TP/IPsec Connects but No Internet: Why It Happens and How to Fix It

Leave a comment