10 GPU Myths That Refuse to Die

Was this helpful?

If you run GPU workloads in production, you’ve met the same villain in ten different costumes: “The GPU is slow.”
Sometimes it is. More often, the GPU is waiting—on your CPU, your storage, your network, your drivers, your container runtime,
or a single missing flag that turns a $20k accelerator into a space heater.

Below are ten GPU myths that keep chewing through budgets and sleep. Each myth comes with a practical correction,
commands you can run today, and the kinds of failure modes that show up at 2 a.m. when a training job “mysteriously” regresses.

Fast diagnosis playbook: find the bottleneck in 10 minutes

The fastest way to debug GPU “performance” is to stop talking about GPUs and start measuring the whole pipeline.
You’re looking for the first scarce resource. Not the most expensive one.

1) First check: is the GPU actually doing work?

  • Look at SM utilization, memory utilization, and power draw. If power is low, you’re probably not compute-bound.
  • Check clocks and throttling. A GPU at 300 MHz is not “slow,” it’s capped.

2) Second check: is the CPU/data pipeline starving it?

  • High CPU usage, high iowait, or slow reads usually means the GPU is waiting on input.
  • Look for dataloader stalls, small batch sizes, excessive augmentation on CPU, or non-pinned host memory.

3) Third check: are you bandwidth-bound (PCIe / network / storage)?

  • PCIe link downtrained (x16 → x4) or Gen4 → Gen3 will quietly kneecap throughput.
  • Network or shared storage can cap distributed training long before you “run out of GPU.”

4) Fourth check: are you fighting software friction?

  • Driver/runtime mismatches, wrong container runtime, disabled persistence mode, or mis-set environment variables can force slow paths.
  • Kernel versions, IOMMU settings, and cgroups can also introduce surprising overhead.

5) Final check: profile one iteration end-to-end

  • Don’t guess. Measure step time broken down into input, CPU transforms, H2D transfer, GPU compute, D2H, and synchronization.

One paraphrased idea (reliability lore, attributed): paraphrased ideaJohn Allspaw (operations and incident response): the system’s behavior makes sense once you see the constraints.
In GPU land, the constraints are almost never where you want them to be.

The 10 myths (and what to do instead)

Myth 1: “GPU utilization should be ~100% or you’re wasting money.”

“Utilization” is one of those metrics that sounds objective and behaves like gossip. A GPU can show 20–40% utilization
and still be perfectly healthy if your model has frequent synchronization points, short kernels, or is latency-bound.
Conversely, you can hit 99% utilization doing the wrong work—like constantly reformatting tensors or burning cycles on pointless casts.

What to do instead:

  • Track time per step and throughput (samples/sec) as primary KPIs.
  • Watch power draw and clocks. Compute-heavy work tends to pull power.
  • Look for CPU wait and data loader stalls before you chase “more utilization.”

Practical decision: if step time is stable and meets SLOs, don’t “fix” utilization. If step time regressed, find the stall.

Myth 2: “More VRAM makes training faster.”

VRAM is capacity, not horsepower. More VRAM lets you fit bigger batches, larger models, longer sequences, or more cached activations.
It does not automatically increase FLOPS or memory bandwidth. Sometimes bigger VRAM tempts teams into bigger batches that reduce
generalization or destabilize training, which is a performance regression with extra steps.

What to do instead:

  • Buy VRAM for fit (can it run?) and for reduced recomputation (activation checkpointing choices), not for speed by itself.
  • Benchmark with fixed batch size first. Then explore scaling batch size and learning rate correctly.

Historical fact: Early CUDA-era GPUs often had far less memory than CPUs, pushing the industry to invent smarter batching,
fused kernels, and mixed precision to get work done inside tight VRAM budgets.

Myth 3: “If it fits in VRAM, you’re safe from out-of-memory errors.”

The phrase “fits in VRAM” hides two traps: fragmentation and peak usage.
Many frameworks allocate and free temporary buffers during forward/backward passes. Peaks can exceed the steady-state footprint.
Fragmentation can prevent allocating a large contiguous block even when “free memory” looks ample.

What to do instead:

  • Measure peak memory per iteration. Watch for growth over time (leaks, caching, graph accumulation).
  • Use allocator controls (where supported) and avoid pathological allocation patterns (lots of varying tensor sizes).
  • Prefer consistent shapes and batch sizes; dynamic shapes are great until they become an allocator stress test.

Interesting fact: GPU memory allocators frequently use caching strategies to avoid expensive device allocations; “reserved” is not “leaked.”

Myth 4: “PCIe doesn’t matter—compute is the bottleneck.”

PCIe matters whenever you move data host↔device or device↔device without a faster fabric. Small batches, frequent transfers,
or heavy CPU-side preprocessing can turn PCIe into the metronome of your job. Worse, PCIe often fails quietly:
a slot running at x4 width or a link trained at Gen3 instead of Gen4 can look “fine” but perform like a rental car with the handbrake on.

What to do instead:

  • Keep data on the GPU longer. Fuse operations. Avoid ping-ponging tensors to CPU for convenience.
  • Validate link width and speed. Don’t assume.
  • Use pinned memory and async copies when you must transfer.

Historical fact: NVLink was introduced partly because PCIe bandwidth wasn’t keeping up with multi-GPU training and model parallelism.

Myth 5: “Tensor Cores / mixed precision are always faster.”

Mixed precision can be a gift. It can also be a bill. You get speedups when your model and kernel mix
are set up to use fast math paths and when overhead (casts, loss scaling, synchronization) doesn’t dominate.
Some models become memory-bound or hit kernel launch overhead, where reduced precision doesn’t help much.

What to do instead:

  • Benchmark FP32 vs AMP with fixed seeds and identical dataloading.
  • Watch for numerical instability: NaNs, loss spikes, divergent gradients.
  • Confirm the fast math path is actually used (profiling), not just requested.

Interesting fact: “Tensor Core era” started with NVIDIA Volta (V100), and it changed how ML frameworks schedule GEMMs and convolutions.

Myth 6: “If nvidia-smi shows memory used, the GPU is doing work.”

VRAM usage is not activity. It may just mean the process initialized a context and cached allocations.
You can park 30 GB of tensors on a GPU and do exactly zero useful computation if your job is stuck on CPU preprocessing,
blocked on a lock, or deadlocked in distributed rendezvous.

What to do instead:

  • Check SM utilization, power draw, and per-process utilization.
  • Correlate with application logs: is iteration count increasing? Are you stuck before the first step?

Joke #1: A GPU with full VRAM and no compute is like a meeting room that’s “booked” all day because someone left a water bottle inside.

Myth 7: “Multi-GPU scaling is basically linear.”

Linear scaling is a marketing demo where the network never drops a packet, batch sizes scale perfectly, and nobody logs anything at INFO.
In real systems, all-reduce overhead, stragglers, input pipeline contention, and NCCL topology constraints eat your gains.

What to do instead:

  • Measure scaling efficiency (throughput / number of GPUs) and watch when it falls off.
  • Validate topology: PCIe, NVLink, NUMA placement, and NIC affinity.
  • Fix stragglers first: one slow worker can drag the whole job.

Interesting fact: Ring all-reduce made distributed training practical because it avoids a single parameter server bottleneck, but it’s still sensitive to the slowest link.

Myth 8: “GPU performance issues are solved by ‘updating the driver’.”

Drivers do matter. But “update the driver” is the operational equivalent of “try turning it off and on again.”
Sometimes it’s correct; often it’s a distraction from the actual bottleneck—data stalls, wrong clocks, power limits,
or a container runtime that isn’t even exposing the GPU properly.

What to do instead:

  • Pin known-good driver/toolkit combos for your fleet. Upgrade intentionally, not emotionally.
  • When you do upgrade, validate with a small battery of performance and correctness tests.

Historical fact: CUDA’s driver/runtime split is powerful but unforgiving; the “it runs on my laptop” gap often hides in version compatibility.

Myth 9: “Thermals only matter for gamers.”

In production, thermals are finance. A GPU that power-throttles or hits thermal limits will downclock,
turning your carefully tuned job into a slow drip. Data centers are full of reasons for this: clogged filters,
failed fans, weird airflow in a chassis, or someone racking high-TDP gear next to the one server that already runs hot.

What to do instead:

  • Monitor temperatures, power, and throttling reasons.
  • Validate that your power limit is set correctly and your cooling is actually doing what you think it is.

Interesting fact: Many GPUs will prefer power-limit throttling before thermal shutdown; you can lose performance without ever seeing a dramatic “overheat” event.

Myth 10: “GPUs are the hard part; storage is a solved problem.”

Storage is where “fast training” goes to die quietly. If your dataset lives on a busy shared filesystem,
or you’re doing random reads of lots of small files, your GPU will spend its life waiting for the next batch.
You’ll blame CUDA, then drivers, then the moon phase, and the whole time it’s a metadata storm on the NAS.

What to do instead:

  • Measure input pipeline throughput and storage latency during training.
  • Use sharding, larger record formats, local NVMe caches, and fewer small files.
  • Prefer sequential-ish reads and prefetching over “YOLO random access.”

Joke #2: The GPU is a race car; feeding it from a shared NFS full of tiny JPEGs is like towing a boat with a bicycle.

Historical fact: The rise of TFRecord/WebDataset-style formats was driven as much by storage and metadata scalability as by ML convenience.

Short historical and context facts (because myths have origin stories)

  • CUDA launched in 2007, turning GPUs from graphics-only devices into general-purpose parallel compute hardware.
  • Tensor Cores arrived with Volta (2017), shifting performance tuning toward matrix math and precision choices.
  • “Persistence mode” exists because initializing GPU contexts repeatedly can add seconds of overhead—painful in batch schedulers.
  • NVIDIA’s Multi-Instance GPU (MIG) made GPUs sliceable for isolation, but also made “why is my GPU small?” a weekly question.
  • NCCL became the default collective communication library because distributed training needs bandwidth-efficient collectives.
  • PCIe link training issues are older than modern ML; the problem just got expensive when the device became a GPU.
  • On Linux, pinned (page-locked) host memory is the difference between async transfers and “why is memcpy blocking my step?”
  • GPU clocks can be governed by power caps, thermals, and application clocks—three different knobs that people confuse constantly.

Practical tasks: commands, outputs, and decisions

These are real things you can run on a Linux GPU host. Each task includes what the output means and what decision you make from it.
Don’t treat them as a checklist to run once; treat them as instruments you keep calibrated.

Task 1: Confirm the GPU is visible and the driver is healthy

cr0x@server:~$ nvidia-smi
Tue Jan 21 11:02:13 2026
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4   |
|-----------------------------------------+----------------------+----------------------|
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:81:00.0 Off |                    0 |
| 34%   52C    P0              180W / 400W|  12000MiB / 40536MiB |     62%      Default |
+-----------------------------------------+----------------------+----------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=======================================================================================|
|    0   N/A  N/A     23141      C   python                                        11850MiB |
+---------------------------------------------------------------------------------------+

Meaning: driver loads, GPU is present, and a process is actually using it.
Decision: if this fails or shows “No devices were found,” stop and fix drivers/container runtime/PCIe before chasing performance.

Task 2: Watch utilization, clocks, and power over time (spot starvation vs throttling)

cr0x@server:~$ nvidia-smi dmon -s pucmt
# gpu   pwr gtemp mtemp sm   mem   enc   dec   mclk  pclk
# Idx     W     C     C   %     %     %     %   MHz   MHz
    0   185    53     -  61    48     0     0  1215  1410
    0    85    50     -   8     6     0     0  1215   510
    0   190    54     -  63    49     0     0  1215  1410

Meaning: when SM% and power drop together, the GPU is idle or waiting. When SM% is high but clocks are low, you might be throttling.
Decision: idle → investigate input pipeline/CPU/IO; throttling → check thermals, power limits, and clocks.

Task 3: Check throttling reasons (power/thermal caps)

cr0x@server:~$ nvidia-smi -q -d PERFORMANCE | sed -n '1,160p'
==============NVSMI LOG==============

Performance
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting  : Not Active
        SW Power Cap                : Active
        HW Slowdown                 : Not Active
        HW Thermal Slowdown         : Not Active
        Sync Boost                  : Not Active

Meaning: “SW Power Cap: Active” indicates the GPU is being capped by a power limit.
Decision: confirm power caps are intentional; if not, adjust power limit or fix rack power/cooling policies.

Task 4: Confirm PCIe link speed and width (silent performance killer)

cr0x@server:~$ nvidia-smi -q | grep -A3 "PCI"
    PCI
        Bus                          : 0x81
        Device                       : 0x00
        Domain                       : 0x0000
--
        Link Width                   : 16x
        Link Generation              : 4

Meaning: x16 Gen4 is what you want on modern servers; x8 or Gen3 may be expected on some platforms but should be known.
Decision: if you see x4 or Gen1/2/3 unexpectedly, reseat card, check BIOS settings, check risers, and validate slot wiring.

Task 5: Check NUMA topology and device affinity (avoid cross-socket pain)

cr0x@server:~$ nvidia-smi topo -m
        GPU0    NIC0    CPU Affinity    NUMA Affinity
GPU0     X      PHB     0-31            0
NIC0    PHB      X      0-31            0

Meaning: GPU and NIC share a PCIe host bridge (PHB) and are close to the same NUMA node.
Decision: if GPU is attached to NUMA node 1 and your dataloader threads run on node 0, pin processes/IRQs to reduce latency.

Task 6: Validate container GPU access (the “it’s running on CPU” classic)

cr0x@server:~$ docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Tue Jan 21 11:03:48 2026
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4   |
+---------------------------------------------------------------------------------------+

Meaning: the container can see the GPU through the runtime.
Decision: if this fails, fix NVIDIA Container Toolkit/runtime configuration before blaming your ML code.

Task 7: Confirm the right kernel driver modules are loaded

cr0x@server:~$ lsmod | grep -E '^nvidia|^nvidia_uvm'
nvidia_uvm           1679360  0
nvidia_drm            126976  2
nvidia_modeset       1327104  1 nvidia_drm
nvidia              62287872  72 nvidia_uvm,nvidia_modeset

Meaning: missing nvidia_uvm can break some CUDA workloads; missing core modules means the driver isn’t loaded.
Decision: if modules are absent, inspect secure boot, DKMS build, kernel upgrade, and driver installation health.

Task 8: Spot CPU-side bottlenecks (iowait and context switching)

cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0-21-generic (train-node-07)  01/21/2026  _x86_64_  (64 CPU)

11:04:10 AM  CPU   %usr %nice %sys %iowait %irq %soft %steal %idle
11:04:11 AM  all   420.0 0.0  35.0   98.0  0.0  2.0   0.0  3445.0
11:04:12 AM  all   415.0 0.0  33.0  105.0  0.0  2.0   0.0  3445.0

Meaning: significant iowait suggests storage latency is throttling your pipeline; high sys can hint at networking or filesystem overhead.
Decision: if iowait is high while GPU is idle, prioritize storage and input format fixes over GPU tuning.

Task 9: Check filesystem latency and throughput on the dataset path

cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0-21-generic (train-node-07)  01/21/2026  _x86_64_  (64 CPU)

Device            r/s   rkB/s  rrqm/s  %rrqm r_await aqu-sz  %util
nvme0n1         820.0 52480.0    12.0   1.4    1.2   0.9    58.0
nfs0           1200.0 19200.0     0.0   0.0   18.5  22.0    99.0

Meaning: the NFS-like device shows high await and 99% utilization: you are storage-bound for reads.
Decision: stage data to local NVMe, shard/pack files, add caching, or move to a format with fewer metadata ops.

Task 10: Prove you’re not swapping or under memory pressure on the host

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           503Gi       212Gi        21Gi       2.1Gi       270Gi       286Gi
Swap:           16Gi       7.5Gi       8.5Gi

Meaning: swap usage during training is a red flag; it can stall dataloaders, tokenizers, and caching layers.
Decision: reduce host memory footprint, increase RAM, or cap parallelism; avoid swap for performance-critical training nodes.

Task 11: Confirm the process is actually using the GPU (and not stuck)

cr0x@server:~$ ps -o pid,ppid,stat,etime,cmd -p 23141
  PID  PPID STAT     ELAPSED CMD
23141 22980 Sl+      00:17:42 python train.py --config prod.yaml

Meaning: the process is alive; “Sl+” indicates sleeping with threads—could be normal, could be waiting.
Decision: if GPU is idle and process sleeps, inspect dataloader, locks, or distributed barriers.

Task 12: Inspect per-process GPU usage and compute mode

cr0x@server:~$ nvidia-smi pmon -c 1
# gpu        pid  type    sm   mem   enc   dec   jpg   ofa   command
    0      23141     C    62    48     0     0     0     0   python
    0      1822      G     0     1     0     0     0     0   Xorg

Meaning: python is consuming SM and memory. If SM is 0 while memory is high, you’re likely stalled or between steps.
Decision: correlate with app logs; if stuck, attach profiler or enable debug logging around data fetch and forward pass.

Task 13: Check GPU ECC and error counters (silent corruption and retries)

cr0x@server:~$ nvidia-smi -q -d ECC | sed -n '1,120p'
ECC Mode
    Current ECC                    : Enabled
    Pending ECC                    : Enabled

ECC Errors
    Volatile
        Single Bit
            Device Memory          : 0
        Double Bit
            Device Memory          : 0

Meaning: ECC is enabled and counters are clean.
Decision: if you see rising correctable errors, plan maintenance; if uncorrectable errors appear, drain the node.

Task 14: Validate GPU persistence mode (reduce init overhead and flakiness)

cr0x@server:~$ nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:81:00.0.

Meaning: persistence mode keeps the driver context warm.
Decision: enable it on dedicated compute nodes unless your operational model requires otherwise (e.g., strict power savings).

Task 15: Confirm clock policy / application clocks (avoid accidental downclock)

cr0x@server:~$ nvidia-smi -q -d CLOCK | sed -n '1,140p'
Clocks
    Graphics                        : 1410 MHz
    SM                              : 1410 MHz
    Memory                          : 1215 MHz

Applications Clocks
    Graphics                        : 1410 MHz
    Memory                          : 1215 MHz

Meaning: clocks are where you expect. If application clocks are set low, you’ll underperform “mysteriously.”
Decision: standardize clock settings via provisioning, and audit drift after driver updates.

Three corporate-world mini-stories

Mini-story 1: The incident caused by a wrong assumption (GPU “failure” that was actually storage)

A mid-sized company ran nightly retraining on a shared cluster. One week, training time doubled. The on-call saw low GPU utilization,
filed it under “GPU degradation,” and escalated to hardware. A vendor ticket was opened. Replacement timelines were discussed.
The CFO became briefly interested in machine learning, which is how you know it was bad.

The first clue was that the GPUs weren’t hot. Power draw was low and spiky, like a job that sprints then naps.
Meanwhile, CPU iowait climbed, and the shared filesystem looked like it had caught the flu: high latency, high metadata load.
The model was fine; the GPUs were fine. The dataset pipeline was quietly broken.

The “optimization” that triggered it was innocent: a dataset refresh that increased the number of small files by splitting shards “for parallelism.”
It improved one benchmark on a developer laptop with local SSD, then detonated in production on shared storage.
The training job began doing tens of thousands of tiny opens and stats per minute across many workers.

Fixing it was boring: pack samples into larger shards, stage to local NVMe, and prefetch. GPU utilization jumped as a side effect,
but the true win was predictable step time. The hardware ticket was closed with the quiet shame it deserved.

Mini-story 2: The optimization that backfired (mixed precision everywhere, correctness nowhere)

Another org chased training speed. They enabled automatic mixed precision globally and celebrated a big throughput increase on a single node.
Then distributed training started producing intermittent NaNs. Not always. Just enough to ruin confidence and trigger reruns.
The incident pattern was classic: “Works in staging, fails in production, only when the moon is… busy.”

The real problem wasn’t AMP itself. It was a layer in the model that was numerically touchy under reduced precision,
plus a loss scaling configuration that was fine for one dataset distribution and fragile for another.
Add different ordering from multi-worker dataloading, and the instability surfaced.

The team responded by disabling AMP entirely. Training stabilized, but throughput fell below the business target.
Then they tried to compensate by increasing batch size. That changed convergence, required learning-rate retuning, and caused a second wave of regressions:
models trained faster but performed worse.

The eventual fix was disciplined: selectively keep sensitive operations in FP32, validate loss scaling, add NaN guards,
and gate changes with a small correctness suite. AMP returned—this time as an engineered choice, not a blanket toggle.
The lesson wasn’t “AMP is risky.” The lesson was “treat performance changes like production changes: observe, test, roll forward carefully.”

Mini-story 3: The boring but correct practice that saved the day (topology and pinning)

A company ran multi-GPU inference with tight tail-latency SLOs. One day, p95 latency rose without a clear code change.
The service looked healthy: no errors, GPUs visible, no obvious thermal alarms. But latency drifted upward like a slow leak.

They had one boring practice that paid off: every node boot ran a small “hardware sanity” script.
It captured PCIe link width, NUMA affinity, GPU clocks, driver versions, and basic storage metrics, then compared them to a known-good baseline.
When p95 moved, they didn’t guess; they diffed.

The diff showed one node had the GPU on a different NUMA node than the NIC after a motherboard replacement.
The workload was doing frequent host↔device transfers for pre/post-processing.
Cross-socket traffic increased latency and jitter. The GPU wasn’t slower; the memory path was longer.

The fix was simple: pin the process and memory allocations to the correct NUMA node, adjust IRQ affinity,
and update the rack documentation so future replacements matched slot topology. No heroics, no rewrites.
Just a refusal to “assume the bus is fine.”

Common mistakes: symptom → root cause → fix

1) Symptom: GPU memory is high, GPU utilization is near zero

  • Root cause: data pipeline stall (storage latency, dataloader deadlock, CPU transforms too slow), or the job is stuck before first iteration.
  • Fix: measure step-time breakdown; check iowait; reduce small-file reads; enable prefetch; increase dataloader workers carefully; use pinned memory.

2) Symptom: Newer GPU is slower than the old one

  • Root cause: power cap, wrong clocks, PCIe downtraining, or a model that is memory-bound and not compute-bound.
  • Fix: check throttling reasons and PCIe link; compare power draw; profile kernels; adjust batch size and fusion rather than blaming silicon.

3) Symptom: Random CUDA OOMs even though “free memory” exists

  • Root cause: fragmentation or peak memory spikes (different shapes, variable sequence lengths, dynamic graphs), or memory leak in caching layers.
  • Fix: use fixed shapes where possible; cap max sequence length; reduce batch; clear caches at safe boundaries; restart workers to reset fragmentation when needed.

4) Symptom: Multi-GPU scaling is terrible, one GPU always behind

  • Root cause: straggler worker (CPU contention, different NUMA node, noisy neighbor IO), or NCCL using a suboptimal interface/topology.
  • Fix: validate topology; pin CPU threads; ensure consistent storage locality; set NCCL to the right NIC; eliminate per-rank skew in dataloading.

5) Symptom: Training is fast for a while, then gradually slows

  • Root cause: background storage contention, CPU thermal throttling, memory pressure causing swap, or a metrics/logging storm growing with time.
  • Fix: monitor over time; throttle logging; isolate training IO; check host thermals; verify no memory growth/leaks.

6) Symptom: “GPU available” but code runs on CPU

  • Root cause: container not using GPU runtime, wrong build (CPU-only), missing device visibility env vars, or framework not moving tensors to device.
  • Fix: validate docker run --gpus all nvidia-smi; check framework device placement; enforce device assertions in code and CI.

7) Symptom: High GPU utilization but low throughput

  • Root cause: you’re doing expensive work inefficiently: tiny kernels, excessive synchronization, poor batch sizing, or non-fused operations.
  • Fix: profile; increase batch; fuse ops; reduce sync points; avoid frequent CPU round-trips.

Checklists / step-by-step plan

Step-by-step: from “slow GPU” report to root cause

  1. Capture a baseline. Record step time, throughput, and hardware identifiers for one run. If you can’t compare, you can’t debug.
  2. Check GPU health. Run nvidia-smi, check errors/ECC, check clocks and power.
  3. Check utilization with context. Use nvidia-smi dmon while the job is active; correlate dips with dataloader logs.
  4. Validate topology. Confirm PCIe link width/gen and NUMA affinity. Fix obvious misplacements first.
  5. Inspect host pressure. CPU iowait, swap usage, memory pressure, and filesystem latency during the job.
  6. Prove data pipeline throughput. Measure read rates and latency on the dataset path; confirm file counts and access pattern.
  7. Profile one iteration. Identify time in input vs H2D transfer vs GPU compute vs synchronization.
  8. Change one variable at a time. Batch size, num_workers, prefetch depth, precision mode. Measure after each change.
  9. Roll out safely. Gate changes with a small performance+correctness suite; avoid fleet-wide toggles without canaries.

Checklist: “We’re about to buy GPUs”

  • Decide if you’re compute-bound or memory/bandwidth-bound with profiling, not vibes.
  • Budget for storage and networking. Your GPUs are only as fast as your input and your all-reduce.
  • Validate power and cooling for sustained loads, not bursty demos.
  • Standardize on a known-good driver/toolkit matrix and bake it into provisioning.

Checklist: “We’re about to run multi-GPU / distributed”

  • Confirm NCCL topology and NIC affinity.
  • Eliminate stragglers: consistent dataloading, consistent CPU pinning, consistent IO.
  • Measure scaling efficiency at 1, 2, 4, 8… GPUs. Stop when gains flatten and fix the constraint.

FAQ

1) Why is my GPU utilization low even though training is slow?

Low utilization plus slow training usually means the GPU is waiting: on dataloading, storage latency, CPU transforms, or synchronization.
Check power draw and clocks; then check iowait and dataset read latency.

2) Is 100% GPU utilization a good target for inference services?

Not if you care about latency. For online inference, running “hot” often increases tail latency and makes you fragile to burst traffic.
Target SLOs (p95/p99) and capacity planning, not utilization trophies.

3) How do I know if I’m compute-bound or memory-bound?

A quick clue: high SM utilization and high power draw suggests compute pressure; low SM but high memory throughput suggests memory-bound behavior.
The real answer is profiling: you need to see where time is spent and whether kernels are bandwidth-limited.

4) Why do I get CUDA OOM after hours of training?

Common causes: fragmentation from varying shapes, cached allocations growing, or a leak (storing tensors in a list, accumulating graphs).
Watch peak memory over time, enforce shape limits, and don’t retain computation graphs accidentally.

5) Does pinned memory always help?

It helps for host-to-device transfers and async copies, but too much pinned memory can hurt the host by reducing pageable memory flexibility.
Use it intentionally: enable it for dataloading buffers, not for everything.

6) Why is multi-GPU training slower than single GPU?

Because you added communication and synchronization. If your batch is too small, all-reduce dominates.
If one worker is slower (IO, CPU, NUMA), everyone waits. Fix stragglers and validate NCCL topology.

7) Should I always enable mixed precision?

Enable it when it’s measured to help and proven not to harm correctness. Some models need selective FP32 ops, stable loss scaling,
or different hyperparameters. Treat it as an engineering change with tests.

8) Why does training get faster when I copy the dataset to local NVMe?

Because shared filesystems often fail on metadata-heavy access patterns, many small files, or high concurrency.
Local NVMe reduces latency, avoids noisy neighbors, and turns “random read hell” into something the kernel can cache and prefetch.

9) Do driver upgrades improve performance?

Sometimes. They can also regress performance or break compatibility. Pin versions, test upgrades on a canary,
and only roll out when you can measure the benefit.

10) How do I prevent “GPU runs at low clocks” surprises?

Monitor clocks, power limits, and throttle reasons. Enforce baseline settings via provisioning, and alert on drift.
Most “mystery slowdowns” are policy changes, not physics.

Conclusion: next steps you can actually take

The enduring GPU myths survive because they’re comforting. “The GPU is underutilized” feels actionable.
“Our storage layout causes metadata contention under concurrency” feels like work. Do the work.

Next steps:

  1. Pick one representative job and record an end-to-end baseline: step time, throughput, GPU power/clocks, and IO latency.
  2. Run the fast diagnosis playbook when performance shifts. Don’t start with driver upgrades; start with constraints.
  3. Standardize node sanity checks (PCIe width/gen, NUMA affinity, clocks, driver versions) and diff them when things change.
  4. Fix the boring bottlenecks first: dataset format, local caching, pinned memory, and CPU placement. GPUs reward discipline.
← Previous
Ubuntu 24.04: Cron Runs Manually but Not on Schedule — PATH/Env Gotchas and Fixes
Next →
WordPress Can’t Upload Images: Fast Troubleshooting for Permissions, Limits, and Libraries

Leave a comment