Fakes and Refurbs: How to Avoid a Ghost GPU

Was this helpful?

You bought “an A100-class card” off a broker because lead times were ugly and finance was uglier. It arrives in a box that looks like it survived a small war, gets racked, and then… your ML jobs run like they’re doing math on a toaster.

Welcome to the ghost GPU: hardware that “shows up” in inventory and kind of works, until it doesn’t. Fakes, refurbs, mismatched firmware, mining wear, virtualized impostors, and well-meaning resellers who blur the line between “tested” and “powered on once.” If you run production systems, the only correct posture is skepticism backed by repeatable checks.

What a “ghost GPU” actually is (and why it happens)

A ghost GPU is not one thing. It’s a category of unpleasant surprises where the device enumerates and appears legitimate, but its identity, condition, or capability is not what you paid for. The theme is mismatched expectations:

  • Identity mismatch: You expected Model X with 80GB; you got Model Y with a vBIOS that lies, or a board with disabled features.
  • Condition mismatch: “New” is actually refurb; “refurb” is actually ex-mining or ex-crypto datacenter with a hard life.
  • Capability mismatch: It’s the right silicon, but thermals, PCIe link, power delivery, or memory stability are compromised.
  • Environment mismatch: The GPU is real, but your driver, kernel, firmware, PCIe topology, or virtualization layer makes it behave like a fake.

Modern GPUs are a stack: PCIe device IDs, vBIOS, VPD fields, board ID, serials, fuses, ECC telemetry, and a driver that translates marketing into reality. Attackers, shady resellers, and honest accidents all target the same weak point: you trusting the label more than the telemetry.

My bias: treat every GPU like a storage device. You wouldn’t deploy a random used SSD into a database cluster without reading SMART, checking firmware, and running a burn-in. Same deal here—except GPUs fail in more creative ways.

Joke #1: A “lightly used mining GPU” is like a “lightly used parachute.” It might be fine, but you don’t want to discover the truth mid-flight.

A few facts and history that explain today’s mess

Some context makes the current market feel less like chaos and more like predictable entropy.

  1. GPU counterfeits predate AI. Long before LLMs, counterfeiters reflashed lower-tier consumer cards to appear as higher models for gaming and workstation buyers.
  2. VBIOS flashing has been a hobby for decades. Enthusiasts used it to unlock features, change fan curves, or squeeze performance—creating tooling and know-how that counterfeiters later weaponized.
  3. The 2017–2018 crypto boom normalized “GPU fleets.” That wave produced huge volumes of heavily used cards that later re-entered the resale market, often cosmetically cleaned.
  4. Datacenter GPUs became supply-chain targets once they became scarce. When lead times spiked, brokers appeared overnight. Some are legitimate; some are basically e-waste routing services.
  5. PCIe topology matters more now. Modern servers have complex lane routing, bifurcation, retimers, and multiple root complexes—so “it’s slow” can be your platform, not the card.
  6. ECC telemetry changed the game. Many datacenter GPUs expose correctable/uncorrectable error counters. That’s a gift—if you actually look at it.
  7. MIG and SR-IOV complicate identity. Partitioning and virtualization can make a legitimate GPU look “wrong” if you don’t know what mode it’s in.
  8. Thermal materials age faster than marketing admits. Pads and paste degrade; fans wear; VRM components drift. A refurb that “passes boot” can still be one thermal cycle away from throttling.

Threat model: how fake and refurb GPUs fool you

1) “It enumerates, therefore it’s real” is a trap

PCIe enumeration proves you have a device. It does not prove it’s the right device, the right memory size, or the right performance bin. Counterfeiters can spoof names in software-facing layers. And plain refurbs can carry mismatched firmware that reports confusing identifiers.

2) The three common fraud patterns

  • Reflash a lower model as a higher model: Often fails under load, shows odd memory size, or has inconsistent device IDs across tools.
  • Sell repaired boards as “new”: Reworked components, replaced memory chips, or “reballed” GPUs. Sometimes stable, sometimes not. The failures can be intermittent and heat-dependent.
  • Sell ex-fleet/ex-mining as “refurbished”: The silicon might be correct. The issue is lifespan consumption: fans, VRAM, and VRMs. Mining workloads are brutal: constant load, constant heat, frequent undervolting/overclocking experimentation.

3) The unsexy failure: procurement ambiguity

The most common “fake GPU” incident in enterprises isn’t a Hollywood counterfeit ring. It’s a purchase order that says “A100 equivalent” and an asset intake process that checks only “nvidia-smi shows something.” The ghosts thrive in gaps between teams: procurement, datacenter ops, platform engineering, and finance.

4) The reliability mindset, paraphrased

Quote (paraphrased idea): John Allspaw’s idea is that reliability is an outcome of systems thinking—assumptions are the enemy, and feedback loops are the cure.

Fast diagnosis playbook (first/second/third checks)

You’ve got a GPU that “seems off.” Maybe performance is terrible, maybe it disappears, maybe you suspect it’s not what the label says. Here’s how to find the bottleneck without turning your day into a forensic hobby.

First: prove basic identity and topology

  1. Does the kernel see it consistently? Check PCI enumeration and kernel logs.
  2. Does the driver agree? Compare lspci vs nvidia-smi (or ROCm tools) for consistency.
  3. Is the PCIe link healthy? Link speed and width. A x16 card running at x4 is a “ghost” in practice.

Second: prove capabilities match expectations

  1. Memory size and ECC state. Does it report expected VRAM? ECC on/off matches SKU norms?
  2. Clocks and power limits. Is it power-throttled? Stuck in low P-state?
  3. Thermals under load. Does it throttle immediately?

Third: prove stability with a short burn-in

  1. Run a controlled load test. Watch for Xid errors, ECC spikes, resets.
  2. Check persistence and reboots. Some fraud only shows after a cold boot or a power cycle.
  3. Compare to a known-good baseline. Same host class, same driver, same test.

The goal is to separate: “bad card” vs “bad platform” vs “bad configuration” in under an hour. If you can’t, you don’t have a diagnosis problem—you have an observability gap.

Acceptance tests with commands: prove what you bought

Below are practical tasks you can run on Linux hosts. Each one includes: a command, what typical output means, and what decision to make next. These are written for NVIDIA tooling because it’s common in production, but the posture applies everywhere: verify from multiple layers.

Task 1: Confirm PCIe device identity and vendor

cr0x@server:~$ sudo lspci -nn | egrep -i 'vga|3d|nvidia|amd'
01:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:20b5] (rev a1)

What it means: You get the vendor/device ID pair ([10de:20b5]) and a human-readable name. The ID is harder to spoof than a marketing name.

Decision: If the device ID doesn’t match the SKU you purchased, stop. Don’t “see if it works.” Open an intake ticket and quarantine the hardware.

Task 2: Inspect extended PCIe capabilities and link status

cr0x@server:~$ sudo lspci -s 01:00.0 -vv | egrep -i 'LnkCap|LnkSta'
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
LnkSta: Speed 16GT/s (ok), Width x16 (ok)

What it means: LnkCap is what the hardware supports; LnkSta is what you actually negotiated. “(ok)” is your friend.

Decision: If width is x8/x4 or speed is 8GT/s when you expect 16GT/s, troubleshoot platform: seating, slot choice, BIOS settings, bifurcation, retimers. A “fake GPU” complaint is often a “wrong slot” mistake wearing a costume.

Task 3: Check kernel logs for GPU-related errors

cr0x@server:~$ sudo dmesg -T | egrep -i 'nvrm|xid|pcie|aer' | tail -n 25
[Mon Jan 21 09:13:42 2026] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.54.14  Tue Dec 17 20:41:24 UTC 2025
[Mon Jan 21 09:14:03 2026] pcieport 0000:00:03.0: AER: Corrected error received: 0000:01:00.0
[Mon Jan 21 09:14:03 2026] nvidia 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer

What it means: Corrected AER errors can be “fine-ish” or a sign of marginal signal integrity. Uncorrected errors are a bigger deal. Xid codes are driver-level GPU faults worth correlating.

Decision: If you see frequent AER spam under load, suspect risers, retimers, slot issues, or borderline hardware. Move the GPU to another host/slot before declaring the card fraudulent.

Task 4: Confirm driver sees the same GPU(s)

cr0x@server:~$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-80GB (UUID: GPU-3b2c1c39-1d1a-2f0a-8f1a-1f5c8f5c2a11)

What it means: The driver stack enumerates the GPU and exposes a UUID. This is the inventory handle you should track in CMDB, not “slot 1.”

Decision: If lspci shows the GPU but nvidia-smi does not, suspect driver mismatch, kernel module load failure, Secure Boot issues, or a GPU that fails initialization.

Task 5: Capture full inventory fields for later comparison

cr0x@server:~$ nvidia-smi -q | egrep -i 'Product Name|Product Brand|VBIOS Version|Serial Number|GPU UUID|Board Part Number|FRU Part Number' -A0
Product Name                    : NVIDIA A100-PCIE-80GB
Product Brand                   : NVIDIA
VBIOS Version                   : 94.00.9F.00.02
Serial Number                   : 0324XXXXXXXXXX
GPU UUID                        : GPU-3b2c1c39-1d1a-2f0a-8f1a-1f5c8f5c2a11
Board Part Number               : 699-2G133-0200-000
FRU Part Number                 : 900-2G133-0200-000

What it means: These fields help detect “mixed lots” and sketchy refurb chains. If serials are missing, duplicated, or inconsistent across tools, it’s a red flag.

Decision: Record this output at intake. If later RMA or audits happen, you’ll want proof of what you actually received and deployed.

Task 6: Check memory size, ECC mode, and ECC error counters

cr0x@server:~$ nvidia-smi --query-gpu=name,memory.total,ecc.mode.current,ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv
name, memory.total [MiB], ecc.mode.current, ecc.errors.corrected.volatile.total, ecc.errors.uncorrected.volatile.total
NVIDIA A100-PCIE-80GB, 81920 MiB, Enabled, 0, 0

What it means: VRAM and ECC are sanity checks. Correctable errors climbing under light use can indicate aging or damaged memory.

Decision: If VRAM is wrong, stop. If ECC correctables climb during burn-in, quarantine: that card is not for production training or inference with SLAs.

Task 7: Verify power limits and throttling reasons

cr0x@server:~$ nvidia-smi --query-gpu=power.draw,power.limit,clocks.current.sm,clocks_throttle_reasons.active --format=csv
power.draw [W], power.limit [W], clocks.current.sm [MHz], clocks_throttle_reasons.active
45.12 W, 250.00 W, 210 MHz, Not Active

What it means: Idle clocks should be low; under load you should see clocks rise and power approach limit. Throttle reasons tell you if you’re power/thermal capped.

Decision: If throttling is active at modest load, check cooling, power cabling, and PSU rails. If power limit is strangely low and locked, suspect firmware policies or vendor restrictions typical of some refurbs.

Task 8: Confirm PCIe generation from the driver’s perspective

cr0x@server:~$ nvidia-smi -q | egrep -i 'PCIe Generation|Link Width' -A2
PCIe Generation
    Max                         : 4
    Current                     : 4
Link Width
    Max                         : 16x
    Current                     : 16x

What it means: Cross-check against lspci. If they disagree, you have a platform reporting issue or a negotiated link that changes with power state.

Decision: If Current is below Max at steady state under load, you may be stuck in a low-power state due to BIOS settings or ASPM quirks.

Task 9: Check temperature, fan, and immediate throttling behavior

cr0x@server:~$ nvidia-smi --query-gpu=temperature.gpu,temperature.memory,fan.speed,pstate --format=csv
temperature.gpu, temperature.memory, fan.speed [%], pstate
36, 40, 30 %, P8

What it means: Idle temperatures should be sane. Under load, watch for runaway memory temps; memory can throttle before core does.

Decision: If memory temp is high at idle or spikes rapidly, suspect poor pad contact (common after sloppy refurb) or blocked airflow.

Task 10: Validate compute functionality with a minimal CUDA sample

cr0x@server:~$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery | egrep -i 'Device 0|CUDA Capability|Total amount of global memory|Result'
Device 0: "NVIDIA A100-PCIE-80GB"
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 81161 MBytes (85014073344 bytes)
Result = PASS

What it means: Confirms runtime sees expected compute capability and memory. This is a cheap catch for obvious mismatches.

Decision: If compute capability doesn’t match expectation, your “A100” may be a different architecture, or your container/driver stack is lying through incompatible libraries.

Task 11: Run a short burn-in and watch for Xid/ECC events

cr0x@server:~$ sudo gpu-burn -d 60
Burning for 60 seconds.
GPU 0: 0.0%  (0/1024 MB)
GPU 0: 100.0%  (1024/1024 MB)
Tested 1 GPUs:
  GPU 0: OK

What it means: A quick burn-in isn’t proof of long-term health, but it flushes out immediate instability: bad VRAM chips, marginal power delivery, overheating.

Decision: If the test crashes the driver, triggers Xid errors, or spikes ECC, quarantine and re-test in a different host. If it follows the card, it’s the card.

Task 12: Check GPU reset behavior (useful for flaky refurbs)

cr0x@server:~$ sudo nvidia-smi --gpu-reset -i 0
GPU 00000000:01:00.0 was reset.

What it means: Reset functionality working is a sign the GPU is responding sanely to management controls.

Decision: If reset fails repeatedly, or the GPU disappears after reset until a full host power cycle, that’s a stability red flag. Don’t ship it into a Kubernetes node pool and hope for the best.

Task 13: Verify IOMMU grouping for passthrough / isolation sanity

cr0x@server:~$ for d in /sys/kernel/iommu_groups/*/devices/*; do echo "$d"; done | egrep '01:00.0|01:00.1'
/sys/kernel/iommu_groups/42/devices/0000:01:00.0
/sys/kernel/iommu_groups/42/devices/0000:01:00.1

What it means: If you’re doing passthrough, “mysterious” GPU behavior can be IOMMU and ACS issues. A legit GPU can look haunted when it’s grouped with other devices.

Decision: If the GPU shares a group with storage or NICs, you may need BIOS ACS settings or different slot placement.

Task 14: Confirm persistence mode and compute mode policies

cr0x@server:~$ sudo nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:01:00.0.

What it means: Persistence can reduce re-init churn and improve stability for some workloads. It’s not a fix for bad hardware, but it reduces “driver cold start” noise during testing.

Decision: If enabling persistence causes errors, you likely have a driver/hardware mismatch. Investigate before workload rollout.

Task 15: Spot virtualization/MIG configuration that changes what you “see”

cr0x@server:~$ nvidia-smi -q | egrep -i 'MIG Mode' -A2
MIG Mode
    Current                     : Enabled
    Pending                     : Enabled

What it means: With MIG enabled, you might not see “one big GPU” in your scheduler the way you expect. People misdiagnose this as “refurb wrong VRAM” constantly.

Decision: If your expectation is full-GPU allocation, disable MIG (with a maintenance window) or adapt your scheduling and inventory logic to MIG instances.

Task 16: Cross-check PCIe errors live while stressing

cr0x@server:~$ sudo journalctl -k -f | egrep -i 'aer|xid|nvrm'
Jan 21 09:22:41 server kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception
Jan 21 09:22:41 server kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:01:00.0

What it means: Live correlation during stress is how you separate “slow” from “broken.” Xid 13 is often compute-related; the AER line points to link issues.

Decision: If AER errors show up only in one chassis/slot, fix platform. If they follow the GPU, it’s hardware risk—treat accordingly.

Joke #2: The most expensive GPU is the one that passes procurement and fails the first on-call rotation.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company decided to bring inference in-house. They bought a small batch of “datacenter-class GPUs” through a broker because the approved distributor had a waitlist. The intake check was simple: rack the node, run nvidia-smi, confirm a name string, then hand it to the platform team.

Things looked fine for a week. Then the model-serving latency started drifting upward in a way that didn’t match traffic. Auto-scaling added nodes, which helped briefly, then didn’t. The SRE on call noticed a pattern: nodes from the new batch were consistently slower, but only under peak load. The dashboards blamed “model variance.” The business blamed “data.” Classic.

The actual failure was a wrong assumption: “If nvidia-smi says A40, it’s an A40.” The devices were real NVIDIA boards, but they were negotiated at a lower PCIe width due to a slot/riser mismatch in that chassis generation. Under light load, nobody noticed. Under heavy load, the PCIe bottleneck made batch transfers and host-device copies painfully slow, and the scheduler’s placement logic kept putting high-throughput models on those nodes.

The fix was embarrassingly simple: move the GPUs to the correct slots and lock BIOS settings to prevent accidental bifurcation. The lesson was not “brokers are evil.” The lesson was that identity checks must include topology checks. A ghost GPU can be created by your own rack layout.

Mini-story 2: The optimization that backfired

An enterprise ML team wanted higher utilization. They enabled MIG across a fleet so multiple teams could share GPUs safely. The idea was good: isolate workloads, increase fairness, reduce idle. Finance loved the projection.

Then training jobs started failing in weird ways: out-of-memory at sizes that “should fit,” inconsistent performance, and a few containers reporting smaller VRAM than expected. The team suspected counterfeit hardware because the symptoms matched the internet folklore: “the card is lying about memory.” They escalated to procurement, who escalated to legal, who escalated to everyone’s blood pressure.

The root cause was an optimization backfiring due to incomplete communication. MIG was enabled, but the scheduler and internal documentation still assumed full-GPU allocation. Some teams were launching large jobs onto MIG slices and then acting surprised when the slice behaved like a slice. Worse, monitoring aggregated GPU memory at the host level and made it look like memory “vanished” during the day.

The fix was operational: explicit labeling of MIG profiles, admission control to prevent full-GPU jobs on sliced devices, and a proper “GPU identity” endpoint that reports MIG mode and instance sizes. The hardware was innocent; the abstraction layer was the prankster.

Mini-story 3: The boring but correct practice that saved the day

A fintech company with strict change control treated GPUs like any other critical component. Every incoming card went through an intake harness: record PCI IDs, record vBIOS versions, run a 30-minute stress test, log ECC counters before and after, and store the results in an asset database keyed by GPU UUID.

One quarter, they bought a mixed batch: some new, some vendor-refurb, all from legitimate channels. A subset passed initial functionality but showed a small rise in correctable ECC errors during stress. Not enough to crash. Enough to be suspicious. The intake harness flagged them automatically because the policy was “ECC movement during burn-in = quarantine.” Boring rule, consistently applied.

Two weeks later, another team in the industry reported a rash of memory-related GPU failures after deploying similar refurb lots into production. This company didn’t participate in that fun. Their quarantined GPUs were sent back under vendor terms, and the production fleet stayed clean.

The practice wasn’t clever. It didn’t require special access or vendor secrets. It was just discipline: capture baseline telemetry, compare it, and make acceptance decisions before the hardware meets your customers.

Common mistakes: symptoms → root cause → fix

This section is designed for triage. If your GPU situation looks “haunted,” map the symptom to a likely root cause and do the fix that actually changes the outcome.

1) Symptom: “GPU is detected, but performance is half of expected”

Root cause: PCIe link negotiated at lower width/speed (x4/x8, Gen3 instead of Gen4), often due to wrong slot, riser issues, or BIOS bifurcation.

Fix: Verify LnkSta via lspci -vv and nvidia-smi -q. Move to a known-good x16 slot. Disable unexpected bifurcation. Replace riser/retimer if errors persist.

2) Symptom: “nvidia-smi shows the right name, but memory size is wrong”

Root cause: MIG enabled, or you’re looking at a GPU instance not a full GPU. Alternately, firmware mismatch or a fraudulent reflash.

Fix: Check MIG mode. If MIG is off and memory is still wrong, compare PCI IDs and vBIOS fields; quarantine the card and verify with a known-good host.

3) Symptom: “GPU disappears after reboot or under load”

Root cause: Power delivery instability, overheating VRMs, marginal PCIe signal integrity, or driver/kernel compatibility. Refurbs with tired power components can do this when warm.

Fix: Check dmesg for AER and Xid errors. Ensure proper power cabling. Validate chassis airflow and fan curves. Re-test in another node to isolate card vs platform.

4) Symptom: “Correctable ECC errors increase during stress”

Root cause: VRAM degradation or previous heavy use. Sometimes also induced by overheating memory due to bad pads.

Fix: Quarantine for production use. Inspect thermals; re-seat and verify cooling. If still increasing, return/RMA.

5) Symptom: “Clocks are stuck low; GPU never boosts”

Root cause: Power limit locked low, thermal cap, persistence/config issues, or running in a restricted mode in a shared environment.

Fix: Check throttle reasons and power limit. Confirm no admin policies (datacenter management, nvidia settings) are capping. Fix airflow; confirm correct drivers.

6) Symptom: “Driver loads, but CUDA sample fails”

Root cause: Mismatched driver/runtime versions, container runtime misconfiguration, or hardware faults that only show under compute.

Fix: Align driver and CUDA runtime. Run deviceQuery on host first, then in container. If host fails, treat as hardware/driver issue; if only container fails, fix container stack.

7) Symptom: “Lots of PCIe corrected errors, but workloads mostly run”

Root cause: Signal integrity issues: riser, retimer, slot contamination, or motherboard marginality. Sometimes triggered by high power draw and heat.

Fix: Move the GPU to a different slot or host. Clean/inspect connectors. Update BIOS/firmware. If errors follow the slot, fix platform; if they follow the GPU, treat as risk.

8) Symptom: “Two GPUs report the same serial or missing serial”

Root cause: Refurb chain reprogrammed VPD fields, or tooling/firmware reporting is broken.

Fix: Key inventory on GPU UUID and PCI IDs, not serial alone. If serial anomalies correlate with other weirdness, quarantine the lot and push back on the supplier.

Checklists / step-by-step plan (procurement to prod)

Step 0: Procurement rules (before money leaves the building)

  • Write purchase specs like an engineer, not a marketer. Specify exact model, memory size, form factor, interconnect (PCIe vs SXM), and acceptable refurb status.
  • Require provenance and terms. You want written confirmation of refurb/new status, warranty, and return conditions. If they won’t commit, you shouldn’t either.
  • Ask for vBIOS and part numbers upfront. Legit suppliers can usually provide typical vBIOS/PN ranges or at least acknowledge them.
  • Budget for intake testing time. Hardware without acceptance testing is just expensive randomness.

Step 1: Physical intake (the “don’t be impressed by shrink-wrap” phase)

  • Photograph labels, connectors, and any tamper indicators. Not for drama—so you can prove condition on arrival.
  • Inspect PCIe edge connector for wear, residue, or rework signs.
  • Inspect power connectors for heat discoloration.
  • Check heatsink screws for tool marks; heavy refurb work leaves scars.

Step 2: Baseline software inventory (10 minutes per node)

  • Record lspci -nn, lspci -vv link state, and nvidia-smi -q identity fields.
  • Record GPU UUID, vBIOS version, and board part numbers.
  • Ensure driver versions are consistent across the fleet before judging performance.

Step 3: Platform validation (stop blaming the card for your chassis)

  • Verify PCIe width and generation at idle and under load.
  • Confirm BIOS settings: Above 4G decoding, Resizable BAR policies (if applicable), ASPM, bifurcation, SR-IOV/MIG expectations.
  • Confirm adequate power: proper cables, PSU capacity, and no shared rail surprises.

Step 4: Burn-in and telemetry policy (make it hard for bad hardware to sneak in)

  • Stress for at least 30 minutes for production-bound GPUs; longer if you can spare it.
  • Capture ECC counters before and after.
  • Capture kernel logs during stress (AER, Xid).
  • Reject criteria: any uncorrectable ECC, any increase in corrected ECC during burn-in, repeated Xid resets, persistent AER under load.

Step 5: Production rollout (gradual, labeled, reversible)

  • Roll into a canary pool first. Run real workloads with safe blast radius.
  • Label nodes with GPU UUIDs and acceptance status in your scheduler inventory.
  • Set alerts: ECC deltas, Xid frequency, thermal throttling, PCIe error rates.
  • Have an escape hatch: fast cordon/drain of GPU nodes and clear RMA workflow.

FAQ

1) Can a fake GPU really pass nvidia-smi?

Yes, in the sense that software-visible strings can be manipulated. But it’s hard to fake everything consistently: PCI IDs, compute capability, memory size behavior under load, ECC telemetry, and stability. That’s why you cross-check.

2) What’s the single best “authenticity” check?

There isn’t one. If forced: start with lspci -nn device IDs plus a short stress test while watching ECC/Xid and PCIe link state. Identity without stability is a mirage.

3) Are refurbished GPUs always a bad idea?

No. Refurb can be fine if it comes from a credible channel with warranty and you run acceptance tests. The problem is “refurb” being used as a polite word for “unknown history.”

4) How do mining GPUs fail differently?

You often see fan wear, degraded thermal interfaces, and memory instability. They can run “fine” until they hit certain temperature ranges, then start throwing corrected ECC or driver resets.

5) If ECC is disabled, can I still detect memory problems?

You can, but you’ll detect them later and more painfully. Without ECC telemetry, you rely more on stress tests, application errors, and crashes. If you have a choice for production reliability, prefer ECC-capable SKUs with ECC enabled.

6) Why does PCIe width matter so much if my compute is on the GPU?

Because you still move data: batches, weights, activations, pre/post-processing, checkpointing, and multi-node comms. A GPU with a crippled PCIe link can look like “slow model code” until you measure it.

7) How long should burn-in be?

For intake: 30 minutes catches a lot. For high-stakes clusters: 2–12 hours is better, especially for refurbs. The goal isn’t perfection; it’s to surface early-life failures and marginal memory.

8) What if the GPU is real but has a weird vBIOS version?

Weird doesn’t automatically mean fake. It can mean OEM-specific firmware, a refurb update, or a mismatch. The right response is to compare within the lot, check behavior under load, and align on a supported firmware/driver matrix.

9) Can virtualization make a real GPU look counterfeit?

Absolutely. MIG, vGPU, passthrough, and container runtime mismatches can change reported memory, device naming, and even capability exposure. Always verify on the bare host before escalating to “fraud.”

10) What do I store in my asset database for GPUs?

At minimum: GPU UUID, PCI vendor/device IDs, board part numbers, vBIOS version, host/slot mapping, and intake test results (ECC before/after, stress outcome, logs). Serial numbers help, but don’t bet your audit on them alone.

Practical next steps

If you want fewer ghost GPUs in your life, do three things and do them consistently:

  1. Stop trusting name strings. Cross-check PCI IDs, vBIOS fields, and driver-reported capabilities. Record them at intake.
  2. Measure the platform, not just the card. Verify PCIe width/speed and watch for AER errors under load. Many “bad GPUs” are actually bad slots, risers, or BIOS defaults.
  3. Adopt a reject policy you’ll actually enforce. ECC movement, Xid storms, and thermal throttling during burn-in are not “quirks.” They’re early warnings.

The goal isn’t paranoia. It’s operational hygiene. In production, “probably fine” is how hardware becomes folklore—and folklore is how outages get budget approval.

← Previous
WordPress Plugin Requires Newer PHP: What to Do When Hosting Is Outdated
Next →
Ubuntu 24.04 disk is slow: IO scheduler, queue depth, and how to verify improvements

Leave a comment