You didn’t lose the GPU budget meeting. Physics did. And supply chains. And a software ecosystem that turned one vendor’s chips into the default CPU of modern machine learning.
If you’ve tried to procure GPUs recently, you already know the symptoms: lead times that read like a joke, “equivalent” alternatives that aren’t, and a queueing problem that spreads from procurement to scheduling to power to storage. You can’t fix the market. But you can understand it fast enough to make sane decisions and keep production from melting down.
The simplest explanation (with the boring truth)
AI “ate” the GPU market because deep learning is a brutally efficient way to convert compute into product value, and GPUs are the most cost-effective hardware we have for that conversion at scale. When a single class of workload can profitably consume almost any amount of parallel math, it doesn’t just increase demand. It rewrites the definition of “enough.”
There are three forces at work, and you need all three to understand why the market snapped:
- Math shape: Training and serving modern neural networks is mostly giant matrix multiplications and vector ops—work that GPUs can do absurdly well.
- Memory bandwidth: AI isn’t only compute; it’s moving tensors around. High-bandwidth memory (HBM) and wide on-package buses are the cheat code, and GPUs have them.
- Software + network effects: CUDA, cuDNN, NCCL, and a decade of tooling meant you could turn capex into models faster on GPUs than anywhere else. The market followed the path of least friction.
Once a few big players proved that scaling GPUs scaled revenue, everyone else chased. That’s not hype. That’s incentives.
One dry operational truth: we used to buy accelerators to speed up programs. Now we buy accelerators to speed up organizations. The GPU cluster is a factory line, and the product is iteration speed.
What GPUs are actually good at
GPUs are not magic. They are a very specific trade: massive throughput for a narrow class of parallel work, and a willingness to make you deal with constraints (memory size, memory bandwidth, interconnect, kernel launch overhead, and a software stack that can bite).
Throughput: the “many small workers” model
A CPU is a few very capable workers with big toolboxes. A GPU is thousands of narrower workers with the same wrench, cranking on the same kind of bolt. AI happens to be “a lot of the same bolt,” especially during training: multiply matrices, accumulate gradients, repeat.
HBM: the hidden star of the show
If you only look at FLOPS, you’ll make bad decisions. Many real training runs are limited by how fast you can move weights and activations through memory. GPUs pair compute with very high memory bandwidth—especially when equipped with HBM. That’s not a marketing detail; it is often the difference between 30% utilization and 90% utilization.
Interconnect: scaling is mostly networking
Once you train on multiple GPUs, you stop “doing math” and start “doing distributed systems.” Synchronizing gradients is an all-reduce problem. The difference between PCIe-only and NVLink/NVSwitch can be the difference between scaling and thrashing.
Here’s the operational punchline: buying more GPUs is sometimes the fastest way to discover your network topology is wrong.
Why AI workloads dominate everything else
Historically, GPUs were a mixed bag: gaming, visualization, some HPC, some rendering farms. That market was large but fragmented. AI turned it into a single, ravenous buyer with standardized needs: dense math, huge memory bandwidth, fast interconnect, stable drivers, and a predictable programming model.
Training is a bulk commodity, inference is a latency business
AI has two primary production modes:
- Training: maximize throughput and scaling efficiency. You accept batchiness. You care about utilization and cost per token/sample.
- Inference: minimize latency and cost per query, without melting the SLA. You care about tail latency, batching strategy, and memory footprint.
Both modes want GPUs, but for different reasons:
- Training wants compute density, memory bandwidth, and interconnect.
- Inference often wants memory capacity and predictable performance, because the model has to fit and respond quickly.
Why CPUs didn’t keep up (for most of this)
CPUs got better, and vector units improved, and there are serious CPU inference deployments. But for large models and large-scale training, the economics favors GPUs: the “cost to do a unit of dense linear algebra” has been better on GPUs for a long time, and software stacks made it even better.
Also: modern AI pipelines are now a stack problem. It’s not “my code runs on silicon.” It’s CUDA kernels, fused ops, mixed precision, communication libraries, compilers, and runtime schedulers. The platform that makes that least painful wins demand.
Short joke #1: Buying GPUs for AI is like adopting a husky—you don’t “own” it, you just become responsible for its daily need to run.
The economics: one chart you can keep in your head
Imagine a simple ratio:
Value produced per unit time ÷ cost per unit time.
AI made that ratio explode for companies that could ship better search, better recommendations, better code tools, better ad targeting, better customer support, or entirely new products. Once that ratio is big enough, you stop asking “Do we need GPUs?” and start asking “How many can we get without collapsing the datacenter?”
Demand doesn’t scale linearly; ambition does
When GPUs were for graphics, you could forecast demand with consumer trends. When GPUs are for competitive advantage, demand becomes strategic. If the next model iteration might move your product metrics, you’ll happily allocate more training budget. Your competitor does too. This is how markets get eaten: not by one breakthrough, but by a feedback loop.
Supply is slow, and packaging is the choke point
Even if silicon capacity exists, high-end GPUs depend on advanced packaging, HBM supply, substrates, and test capacity. Those are not infinite, and you can’t simply “spin up” more overnight. Lead times are the market’s way of saying: the physical world is still in charge.
The operational tax: power and cooling
A GPU-rich rack is a power-dense rack. That means your constraints shift from “budget” to “watts and BTUs.” Many teams discover—late—that their datacenter contract assumed a world where 10–15 kW racks were normal. Now 30–60 kW racks show up, and suddenly your rollout is gated by facility work, not hardware procurement.
The software stack that locked in demand
People underestimate how much the GPU market is a software story. Hardware matters, but developer time matters more. The winning platform is the one that gives you the shortest path from “I have an idea” to “it trains overnight and the metrics improved.”
CUDA is a moat made of tools and habits
The CUDA ecosystem has had years to harden: kernels, libraries, profilers, debuggers, collective communication (NCCL), mixed-precision primitives, and vendor support. Most ML frameworks assume it. Many cutting-edge optimizations land there first. That creates an adoption gradient: the easiest solution becomes the default solution, and procurement follows.
Mixed precision turned “too big” into “possible”
Tensor cores and mixed-precision training are not just performance tweaks. They changed the feasible scale of models and made training cheaper per unit progress. When the right numeric tricks are available in the mainstream stack, model size grows. Model size growth drives more GPU demand. Demand drives more investment. Loop repeats.
Distributed training: one library choice can change your burn rate
Teams often treat multi-GPU scaling as a checkbox. It’s not. Your choice of data parallelism, tensor parallelism, pipeline parallelism, and communication strategy can change training cost by multiples.
And if your all-reduce is fighting your topology, no amount of “more GPUs” will save you. You’ll just pay for faster disappointment.
Paraphrased idea (attributed): Werner Vogels has argued you should build systems assuming things will fail, so your service keeps working anyway.
Where the bottlenecks really are (hint: not always the GPU)
In production, “GPU shortage” is sometimes just “GPU underutilization.” Before you spend another million, learn where your time goes. There are five common chokepoints:
1) Input pipeline and storage
Your training job can’t train if it can’t read. Data loaders, small-file storms, network file systems, object store throttling, and decompression can starve GPUs. A cluster with 80% idle GPU time is usually a storage and CPU story wearing a GPU costume.
2) CPU-side preprocessing
Tokenization, augmentation, decoding, feature engineering, JSON parsing—these can become the limiter. If one CPU thread is preparing batches for eight GPUs, congratulations: you built a very expensive space heater.
3) GPU memory capacity and fragmentation
Out-of-memory isn’t always “model too big.” Sometimes it’s allocator fragmentation, multiple processes on one GPU, or runaway caching. Sometimes it’s your batch size. Sometimes it’s a memory leak in a custom extension. Diagnose, don’t guess.
4) Interconnect saturation
Multi-node training is communication-heavy. If the fabric is oversubscribed or misconfigured, scaling efficiency collapses. Watch for NCCL retries, slow all-reduce, and imbalanced GPU step times across ranks.
5) Power and thermal throttling
GPUs are designed to hit power and thermal limits. If your datacenter is hot, airflow is wrong, or power caps are mis-set, the GPU will downclock. It won’t page you. It will just quietly do less work while your job takes longer.
Facts and historical context that matter
- GPUs became general-purpose compute in the mid-2000s when developers started using graphics pipelines for non-graphics math (the early GPGPU era).
- CUDA launched in 2007, making GPU programming far less awkward than shader hacks, and giving a stable target for tools and libraries.
- AlexNet (2012) is widely credited with making GPU-accelerated deep learning “obviously worth it” for computer vision, because it trained efficiently on GPUs.
- HBM arrived in the mid-2010s and changed the performance profile of high-end accelerators by raising memory bandwidth dramatically.
- Tensor cores (late 2010s) pushed mixed-precision matrix math into the mainstream, multiplying effective throughput for training workloads.
- NCCL matured into the default for GPU collectives, which made multi-GPU and multi-node training viable for more teams without custom MPI tuning.
- Transformer models (2017 onward) shifted AI toward workloads that scale aggressively with compute and data, accelerating demand for large clusters.
- In 2020–2022, global supply chain disruptions plus rising demand made lead times worse, but the AI wave didn’t politely wait for factories.
- Data center power density became a primary limiting factor as accelerator racks moved from “high” to “facility redesign” territory.
Three corporate mini-stories from the trenches
Mini-story #1: The incident caused by a wrong assumption
A mid-size SaaS company moved from a single-GPU inference service to a multi-GPU node to reduce tail latency. The team assumed that “more GPUs” meant “more headroom,” and scheduled multiple model replicas per host. They also assumed the PCIe topology was effectively uniform.
It wasn’t. Half the GPUs shared a PCIe switch with the NIC, and under load the traffic pattern got ugly. P99 latency started drifting upward, then spiking. The service didn’t crash; it just became unreliable in a way customers noticed. The on-call found that GPU utilization looked “fine,” which delayed the right diagnosis.
Eventually someone ran a topology check and realized that the busiest replicas were pinned to GPUs with the worst host-to-device and device-to-network path. The fix was not heroic: set explicit GPU affinity, move network-heavy replicas to the GPUs with better PCIe locality, and stop oversubscribing the node’s I/O.
The lesson stuck: never assume the bus is “fast enough.” The bus is often the product.
Mini-story #2: The optimization that backfired
A research group was desperate to improve training throughput, so they cranked up data loader workers, enabled aggressive caching, and switched to a higher compression ratio to reduce storage reads. The GPU graphs looked better for an hour. Everyone high-fived.
Then the node started swapping. Not a little. A lot. The compressed cache inflated in memory, page cache fought with pinned GPU buffers, and the kernel started reclaiming. CPU utilization went to the moon while actual useful work fell. Training step time became noisy, then steadily worse. Jobs began timing out, and the cluster scheduler piled retries on top, making the situation look like “random instability.”
The fix was boring: cap cache size, pin fewer batches, lower worker count, and measure end-to-end step time rather than “GPU busy percentage.” The optimized pipeline had optimized the wrong metric.
That failure mode is common in AI systems: you can improve one stage so much that you amplify the next stage’s worst behavior. Congratulations, you found the real bottleneck by making it angry.
Mini-story #3: The boring but correct practice that saved the day
A fintech company ran GPU inference for fraud scoring. Nothing glamorous: strict SLA, steady traffic, and a security team that reads driver changelogs like bedtime stories. They maintained golden images per GPU model, pinned driver versions, and validated CUDA/cuDNN combos in a staging environment that matched production hardware.
One weekend, a vendor issued an urgent security update that included a driver bump. A different team tried to roll it broadly. The fintech’s platform team resisted, because they had a rule: no GPU driver change without a canary that runs real inference load and checks both correctness and latency distribution.
The canary failed. Not catastrophically—worse. The new driver introduced a subtle performance regression in their exact kernel mix. Under load, P99 latency degraded enough to violate their SLA. Because they had baseline histograms and replay traffic, they caught it before rollout.
The fix was to apply the security update only on CPU-only nodes immediately, then plan a controlled GPU rollout with mitigation (capacity buffer, batch tuning) and a later driver version. The practice wasn’t exciting. It was the difference between “quiet weekend” and “incident review with executives.”
Fast diagnosis playbook
If a training job is slow or inference latency spikes, don’t start by blaming the GPU model. Start by proving where time is lost. This is the order that saves hours.
First: are the GPUs actually busy?
- Check utilization and power draw.
- Check if the job is compute-bound or memory-bound.
- Check if you are throttling (power/thermal).
Second: is the data pipeline feeding the beast?
- Measure read throughput and IOPS from the dataset path.
- Check CPU saturation (especially a single hot core).
- Look for small-file storms, decompression overhead, and network storage jitter.
Third: is multi-GPU scaling working, or are you paying for synchronization?
- Check interconnect topology and link width/speed.
- Watch for NCCL errors, retries, or slow collectives.
- Compare per-rank step time; imbalance often points to fabric issues or stragglers.
Fourth: is the scheduler lying to you?
- Confirm GPU assignment, MIG settings, and cgroup limits.
- Verify no “invisible neighbors” are on the same GPU or NIC.
- Check NUMA locality if CPU-side preprocessing matters.
Short joke #2: The GPU is never “idle”; it’s just waiting for the rest of your architecture to catch up.
Practical tasks: commands, outputs, decisions
These are the checks I actually run (or ask someone to run) when performance, capacity, or stability is at stake. Each includes the command, what typical output tells you, and what you decide next.
Task 1: Confirm the GPUs are visible and the driver is sane
cr0x@server:~$ nvidia-smi
Tue Jan 13 12:10:11 2026
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+----------------------+----------------------|
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:41:00.0 Off | 0 |
| 34% 56C P0 265W / 400W| 61234MiB / 81920MiB | 92% Default |
+-----------------------------------------+----------------------+----------------------+
What it means: Driver/CUDA versions, GPU model, memory usage, utilization, and power draw. If GPU-Util is low but memory is high, you may be input-bound or sync-bound.
Decision: If GPUs aren’t listed or show errors, stop and fix drivers/firmware/container runtime integration before tuning anything else.
Task 2: Catch thermal or power throttling
cr0x@server:~$ nvidia-smi -q -d PERFORMANCE,POWER,TEMPERATURE | sed -n '1,120p'
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Active
HW Thermal Slowdown : Active
Power Readings
Power Draw : 395.12 W
Power Limit : 400.00 W
Temperature
GPU Current Temp : 86 C
What it means: “HW Thermal Slowdown: Active” tells you the GPU is downclocking. That’s real performance loss.
Decision: Fix airflow, fan curves, inlet temp, or rack density before you buy more GPUs. Throttling is paying for hardware you can’t use.
Task 3: Validate NVLink connectivity (when applicable)
cr0x@server:~$ nvidia-smi nvlink -s
GPU 0: NVLink Status
Link 0: 25 GB/s
Link 1: 25 GB/s
GPU 1: NVLink Status
Link 0: 25 GB/s
Link 1: 25 GB/s
What it means: NVLink links are up and reporting expected bandwidth.
Decision: If links are down, don’t attempt multi-GPU scaling assumptions; investigate hardware seating, firmware, or platform support.
Task 4: Check PCIe link width and speed
cr0x@server:~$ sudo lspci -s 41:00.0 -vv | egrep -i 'LnkCap|LnkSta'
LnkCap: Port #0, Speed 16GT/s, Width x16
LnkSta: Speed 16GT/s (ok), Width x16 (ok)
What it means: If you see x8 instead of x16, or lower GT/s, you’re leaving bandwidth on the floor.
Decision: Fix BIOS settings, riser/cable issues, or slot placement. PCIe under-negotiation is a silent performance killer.
Task 5: Inspect GPU-to-CPU NUMA locality
cr0x@server:~$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X NV2 0-31 0
GPU1 NV2 X 0-31 0
What it means: CPU affinity and NUMA node mapping. If preprocessing is heavy, wrong NUMA placement increases latency and reduces throughput.
Decision: Pin CPU threads and dataloaders to the NUMA node local to the GPU(s) they feed.
Task 6: Look for GPU memory fragmentation pressure
cr0x@server:~$ nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv
pid, process_name, used_gpu_memory [MiB]
23144, python, 61234 MiB
What it means: Confirms which process is consuming memory. Multiple processes can fragment memory and cause unpredictable OOMs.
Decision: If multiple PIDs share a GPU unexpectedly, fix scheduling/isolation (Kubernetes device plugin settings, Slurm GRES, or systemd unit placement).
Task 7: Check CPU saturation and steal time (VMs or noisy neighbors)
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (server) 01/13/2026 _x86_64_ (64 CPU)
12:11:02 PM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
12:11:03 PM all 780.0 0.0 90.0 15.0 0.0 5.0 0.0 10.0
12:11:03 PM 7 99.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
What it means: One core pegged at 99% can bottleneck a pipeline (tokenization, decompression). High %steal suggests virtualization contention.
Decision: Parallelize preprocessing, increase workers carefully, or offload preprocessing. If %steal is high, move to dedicated hosts.
Task 8: Verify storage read throughput (local NVMe example)
cr0x@server:~$ sudo fio --name=readtest --filename=/mnt/datasets/.fio_test --rw=read --bs=1M --ioengine=libaio --direct=1 --numjobs=1 --iodepth=32 --size=4G --runtime=30 --time_based
read: IOPS=2850, BW=2850MiB/s (2988MB/s)(83.5GiB/30001msec)
What it means: If you’re training from local NVMe and you only get a few hundred MiB/s, something is wrong (mount options, device health, filesystem).
Decision: If bandwidth is below what your pipeline needs, fix storage or cache datasets locally before tuning GPU kernels.
Task 9: Check network throughput to a data source (basic sanity)
cr0x@server:~$ iperf3 -c datahost -P 4 -t 10
[SUM] 0.00-10.00 sec 37.5 GBytes 32.2 Gbits/sec sender
[SUM] 0.00-10.00 sec 37.4 GBytes 32.1 Gbits/sec receiver
What it means: Confirms achievable throughput. If you expect 100 Gbit/sec and see 20–30, you’re bottlenecked on NIC config, switch, or host tuning.
Decision: Fix MTU, bonding, switch port config, or route. Distributed training doesn’t forgive weak links.
Task 10: Spot I/O wait as the silent GPU-starver
cr0x@server:~$ iostat -xz 1 3
avg-cpu: %user %nice %system %iowait %steal %idle
18.40 0.00 6.10 32.50 0.00 42.90
Device r/s rkB/s await %util
nvme0n1 220.0 2800000.0 14.2 96.0
What it means: High %iowait and near-100% device utilization indicate storage saturation. GPU idle time often correlates.
Decision: Add faster storage, shard datasets, increase read parallelism carefully, or pre-stage data to local disks.
Task 11: Check Kubernetes GPU allocation (if you run K8s)
cr0x@server:~$ kubectl describe pod infer-6c9c8d7f6f-kp2xw | sed -n '1,160p'
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Node-Selectors: gpu=true
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m default-scheduler Successfully assigned prod/infer-... to gpu-node-12
What it means: Confirms the pod requested and was granted a GPU, and where it landed.
Decision: If pods land on wrong node types or compete for GPUs, fix node labeling, taints/tolerations, and device plugin configuration.
Task 12: Check for GPU resets or Xid errors in logs
cr0x@server:~$ sudo journalctl -k -S -2h | egrep -i 'NVRM|Xid|pcie|nvlink' | tail -n 20
Jan 13 10:44:02 server kernel: NVRM: Xid (PCI:0000:41:00): 79, GPU has fallen off the bus.
Jan 13 10:44:05 server kernel: pcieport 0000:00:03.1: AER: Corrected error received: id=00e1
What it means: “fallen off the bus” often points to PCIe signal issues, firmware problems, power events, or severe driver faults.
Decision: Stop treating it as an application bug. Involve hardware/vendor, check BIOS/firmware, reseat cards, validate power delivery, and run burn-in tests.
Task 13: Validate GPU persistence mode and application clocks (when you mean to)
cr0x@server:~$ sudo nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:41:00.0.
What it means: Persistence mode reduces initialization overhead and can stabilize latency for some inference workloads.
Decision: Enable for long-lived services; for shared interactive boxes, evaluate impact and policy.
Task 14: Verify hugepages / memory pressure that can hurt RDMA and performance
cr0x@server:~$ grep -E 'HugePages|MemAvailable' /proc/meminfo
MemAvailable: 18234564 kB
HugePages_Total: 0
HugePages_Free: 0
What it means: Low MemAvailable means the OS is under pressure; hugepages may be required for some RDMA/NIC setups or performance tuning.
Decision: If memory pressure is high, reduce caching, fix container limits, or increase RAM. If RDMA requires hugepages, configure them explicitly.
Task 15: Confirm container runtime can see the GPU
cr0x@server:~$ docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| 0 NVIDIA A100 On | 00000000:41:00.0 Off | 0 |
+-----------------------------------------------------------------------------+
What it means: If this fails, your app’s failures may be plumbing, not ML.
Decision: Fix NVIDIA Container Toolkit, driver/runtime compatibility, and cgroup/device permissions before blaming the framework.
Common mistakes: symptom → root cause → fix
1) “GPU utilization is low, but the job is slow”
Symptom: GPU-Util 10–40%, step time high, CPU looks busy or iowait is elevated.
Root cause: Input pipeline bottleneck: storage throughput, decompression, tokenization, or dataloader contention.
Fix: Profile dataloader time; pre-stage datasets to local NVMe; increase preprocessing parallelism; switch to sharded/packed formats; reduce small-file reads.
2) “Multi-GPU scaling gets worse after adding nodes”
Symptom: 2 GPUs faster than 8 GPUs; adding nodes increases epoch time.
Root cause: Interconnect or collective overhead dominates: NCCL all-reduce saturating fabric, oversubscribed switches, wrong topology awareness.
Fix: Validate fabric bandwidth; ensure correct NCCL settings for your topology; avoid oversubscription; consider gradient accumulation or parallelism strategy changes.
3) “OOM happens randomly even though batch size is stable”
Symptom: Same config sometimes fits, sometimes OOMs after hours.
Root cause: Memory fragmentation, leaked tensors, multiple processes per GPU, or caching growth.
Fix: Ensure one process per GPU unless you’re partitioning intentionally; monitor per-process memory; restart leaky workers; use allocator settings and checkpointing.
4) “Inference latency spikes every few minutes”
Symptom: P50 stable, P99 periodic spikes.
Root cause: Background jobs causing PCIe/NVMe contention, CPU GC pauses, kernel page reclaim, or GPU clock changes.
Fix: Isolate inference hosts; set CPU pinning; control batcher; enable persistence mode; remove noisy cron jobs; tune memory limits.
5) “GPU node crashes or resets under load”
Symptom: Xid errors, GPU falls off bus, node reboots, training job dies.
Root cause: Power delivery issues, bad riser/cable, thermal runaway, unstable firmware, or marginal hardware.
Fix: Check logs for Xid; validate PCIe link stability; burn-in test; update BIOS/firmware; involve vendor; reduce power cap temporarily to confirm suspicion.
6) “We bought GPUs but can’t deploy them”
Symptom: Hardware arrives; rollout blocked by facilities or rack design.
Root cause: Power/cooling density assumptions were based on CPU-era racks.
Fix: Plan rack-level power budgets early; verify PDUs, cooling, containment, and breaker limits; model failure cases (one CRAC down).
7) “Alternative GPUs are ‘equivalent’ but performance is worse”
Symptom: Same model code, slower training/inference on non-default vendor.
Root cause: Kernel/library maturity differences, missing fused ops, less optimized communication stack, or lower effective memory bandwidth for your workload.
Fix: Benchmark your exact workload, not synthetic FLOPS. Budget engineering time for porting and tuning, or don’t pretend it’s plug-and-play.
Checklists / step-by-step plan
Checklist A: Before you buy more GPUs
- Measure GPU duty cycle: if utilization is low, solve feed/IO before scaling supply.
- Quantify step time breakdown: dataloader vs forward/backward vs all-reduce.
- Validate interconnect: PCIe width/speed; NVLink status; fabric bandwidth for multi-node.
- Check thermal headroom: ensure no throttling at expected ambient temperatures.
- Check power provisioning: rack budget, PDU capacity, redundant feeds, and power caps.
- Confirm software stack pinning: driver/CUDA/framework versions you can reproduce.
Checklist B: First week of a new GPU cluster (do this even if it’s “temporary”)
- Install a golden image per GPU model; keep immutable builds.
- Run burn-in tests and monitor for Xid errors and PCIe AER warnings.
- Record baseline performance: step time, tokens/sec, P99 latency, power draw, and thermals.
- Validate topology: NUMA mapping, GPU placement, NIC locality, and NVLink links.
- Set observability: node exporter + DCGM metrics + job-level step timing.
- Write a runbook: what to do when GPU falls off bus, OOM storms, or NCCL hangs.
Checklist C: Inference service hardening (keep the SLA)
- Define the SLO using P99 and error rate, not averages.
- Implement load shedding and backpressure; avoid infinite queues.
- Use explicit batching with caps; measure tail latency impact.
- Isolate noisy neighbors: dedicated nodes or strict cgroups/quotas.
- Canary driver/runtime updates with replay traffic and correctness checks.
- Keep a capacity buffer for deploys and for “one GPU died” reality.
Checklist D: Storage plan for training (because GPUs hate waiting)
- Prefer sharded datasets to reduce metadata overhead and small-file reads.
- Pre-stage to local NVMe for repeated epochs when possible.
- Measure IOPS and throughput under concurrency; single-client benchmarks lie.
- Track cache behavior; set explicit limits to avoid page-reclaim chaos.
- Align format to access pattern: sequential reads win; random reads lose.
FAQ
1) Why can’t we just use CPUs for everything?
You can for some inference and smaller models. But for large-scale training and high-throughput inference, GPUs offer better throughput per watt and per dollar for dense linear algebra, plus mature kernels and libraries.
2) Is the GPU shortage only about AI hype?
No. Hype can inflate plans, but the underlying driver is that AI workloads convert compute into measurable business value. When that happens, demand becomes strategic and persistent.
3) What’s the single biggest technical reason GPUs win for AI?
Memory bandwidth paired with massive parallel compute. Many AI kernels are bandwidth-sensitive; HBM and wide internal paths keep the math units fed.
4) Why does NVLink/NVSwitch matter so much?
Because distributed training spends a lot of time synchronizing parameters and gradients. Faster GPU-to-GPU links reduce communication overhead and improve scaling efficiency.
5) Why do “equivalent” accelerators often underperform?
Software maturity. Kernel fusion, libraries, compilers, and communication stacks take years to harden. If your framework’s fast path assumes one ecosystem, alternatives may run the slow path.
6) What should I monitor on GPU nodes in production?
At minimum: GPU utilization, memory usage, power draw, temperature, throttling reasons, ECC/Xid errors, PCIe link state, and job-level step time or latency histograms.
7) How do I tell if I’m compute-bound or input-bound?
If GPU utilization and power draw are low while CPU/iowait is high, you’re likely input-bound. If GPUs are pegged and step time is stable, you’re more compute-bound. Confirm with pipeline timing inside the job.
8) Why do inference latency spikes happen even when average GPU utilization is low?
Tail latency is sensitive to queueing, batcher behavior, CPU pauses, page reclaim, and contention on PCIe/NVMe. A mostly-idle GPU doesn’t guarantee consistent response times.
9) What’s the safest way to roll GPU driver updates?
Canary on identical hardware with replay traffic. Validate correctness and latency distribution. Then roll gradually with capacity buffer. Treat drivers like kernel updates: powerful, risky, and not a Friday-afternoon activity.
10) What’s the fastest way to waste money on GPUs?
Buy them before you validate power, cooling, storage throughput, network fabric, and your ability to keep the software stack stable. The hardware will arrive on time; your readiness won’t.
Next steps you can take this week
If you’re responsible for keeping GPU-backed systems reliable—or for making a procurement plan that won’t implode—do these next. They’re practical, and they move the needle.
- Baseline a real workload on your current hardware: step time, tokens/sec, P95/P99 latency, power, thermals.
- Run the fast diagnosis playbook on one slow job and one “healthy” job. Write down what differs.
- Prove your data path can feed the GPUs: storage throughput and CPU preprocessing capacity under concurrency.
- Map topology (PCIe, NUMA, NVLink) and document it. Then pin workloads intentionally.
- Implement a minimal GPU runbook: what you do for Xid errors, OOM storms, NCCL hangs, and thermal throttling.
- Stop buying on FLOPS. Buy on end-to-end throughput for your exact workload, plus the operational cost of keeping it stable.
The GPU market didn’t break because everyone suddenly got irrational. It broke because AI workloads are rationally hungry, and the fastest path to shipping them has been GPU-shaped for years. Treat GPUs like a production dependency, not a luxury. Build the boring systems around them: power, cooling, storage, scheduling, and change control. That’s how you turn scarcity into predictable output.