GPU efficiency: why “better” doesn’t always mean “bigger”

November 10, 2025 • February 3, 2026 • Read: 20 min • Views: 0

Was this helpful?

Somebody buys the shiny top-bin GPU. The dashboard lights up. The bill lights up more. And the training job? It runs… basically the same speed as the old box.

If you’ve ever stared at nvidia-smi showing 20–40% utilization while your stakeholders ask why their “upgrade” didn’t upgrade anything, you’re in the right place. The hard truth: GPUs are not magic accelerators; they’re picky, bandwidth-hungry co-processors. You don’t “use” them. You feed them.

The core idea: bigger isn’t automatically better

“Better GPU” usually means more compute, more tensor cores, more memory bandwidth, maybe more HBM, and sometimes more VRAM. But your workload is an assembly line, not a single machine. If the slowest station is the CPU preprocessing, storage reads, PCIe transfers, all-reduce, or kernel launch overhead, then a bigger GPU just sits there… politely waiting.

In production, GPU efficiency is not “percent utilization.” It’s business throughput per dollar per watt per engineer-hour. A large GPU that’s mostly idle is not a premium product; it’s an expensive space heater with excellent marketing.

Here’s the decision-relevant statement:

If you can’t keep a mid-range GPU busy, a high-end GPU will not fix your pipeline.
If you have a stable, compute-bound workload, a bigger GPU can be a straight win—but only if interconnect, memory, and software stack scale with it.
If you scale out to multiple GPUs, the network and collective operations become part of your “GPU” whether you like it or not.

One quote worth taping to the monitor: “Hope is not a strategy.” — Gene Kranz

Yes, it’s cliché. It’s also what I say when someone proposes “let’s just buy bigger GPUs” without measuring where time is actually going.

A production-grade mental model of GPU efficiency

1) Think in stages, not in chips

A typical training/inference loop has stages that can stall independently:

Data source: object storage, NFS, local NVMe, database, feature store.
Decode/augment: CPU transforms, image decode, tokenization, compression.
Host-to-device transfer: pinned memory vs pageable, PCIe vs NVLink, async copies.
GPU compute: kernels, tensor cores, memory-bound ops, reductions.
Device-to-device / collectives: multi-GPU all-reduce, sharded attention, pipeline parallelism.
Checkpointing/logging: filesystem latency, metadata storms, synchronous writes.

The “bigger GPU” often accelerates only step 4. If step 1 or 2 is the bottleneck, step 4 doesn’t matter. If step 5 dominates at scale, step 4 becomes a rounding error.

2) Understand what “utilization” is hiding

nvidia-smi GPU-Util is a coarse signal: “was the GPU doing something recently?” It’s not a guarantee you’re running efficiently. A GPU can show 95% utilization while running memory-bound kernels that barely touch tensor cores, or while spinning on tiny kernels with high launch overhead. It can show 30% utilization while still delivering excellent throughput because the workload is bursty and overlapped.

3) Bigger GPUs have bigger appetites

When you jump from a smaller GPU to a larger one, you also increase the minimum required “feeding rate”:

More compute: means you need larger batches, more parallel work, better fused kernels, or more concurrency.
More bandwidth: means you need memory access patterns that can use it; random access won’t magically become sequential.
More VRAM: can reduce CPU-GPU transfers and allow bigger models/batches—but only if your framework uses it intelligently.

4) Latency and overhead don’t scale down

Kernel launch overhead, Python interpreter overhead, per-step synchronization, logging, and dataloader coordination can dominate small models or small batch sizes. A faster GPU shortens compute time, making overhead a bigger fraction of the total. Congratulations: you “optimized” yourself into a bottleneck.

Joke #1 (short and relevant): A bigger GPU can’t fix a slow dataloader any more than a bigger coffee mug fixes insomnia.

5) Efficiency is a stack contract

In SRE terms, your GPU is downstream of: storage, CPU scheduling, memory allocator behavior, container runtime, drivers, libraries, and sometimes a cluster scheduler with opinions. If any layer breaks the contract—slow I/O, noisy neighbors, incorrect NUMA pinning, wrong CUDA version—you’re not “GPU-bound,” you’re “everything-else-bound.”

Interesting facts and historical context (that actually matter)

These aren’t trivia for bar night. Each one maps to a real failure mode or design decision.

GPUs became general-purpose by accident and persistence: early “GPGPU” work used graphics APIs to do compute before CUDA made it mainstream. The legacy is why memory access patterns still matter so much.
CUDA’s 2007-era bet: NVIDIA’s programming model made GPU compute accessible, but it also created an ecosystem where driver/toolkit mismatches can silently cost performance.
PCIe has been a recurring limiter: for many workloads, host-device transfer bottlenecks persist even as GPU compute leaps forward.
HBM changed the game for bandwidth-heavy models: high bandwidth memory helps, but memory-bound kernels still require locality and coalescing; bandwidth doesn’t fix poor access patterns.
Tensor cores shifted the “optimal dtype” conversation: mixed precision can be transformative, but it’s not free—loss scaling, numerics, and conversion overhead can bite.
NVLink/NVSwitch emerged because PCIe wasn’t enough: multi-GPU scaling often hinges on interconnect and topology more than raw FLOPS.
Deep learning popularized “input pipelines” as first-class performance work: people learned the hard way that JPEG decode can be your “GPU bottleneck.” It’s not.
Checkpointing got harder as models got bigger: saving state can become a distributed storage and metadata problem, not a “write a file” problem.

Fast diagnosis playbook (first/second/third checks)

First: Is the GPU actually starved?

Look at per-process GPU utilization and memory use.
Check whether GPU power draw is near expected during steady state.
Check if CPU is saturated or iowait is high.

If GPU util is low and CPU/iowait is high, stop blaming the GPU. Fix the feeder.

Second: Is it compute-bound, memory-bound, or launch/overhead-bound?

Compare achieved TFLOPS to theoretical (roughly) using profiler metrics.
Check GPU memory throughput and SM occupancy indicators.
Look for lots of tiny kernels, sync points, or Python overhead.

If you’re memory-bound, “bigger compute” won’t help much. If you’re overhead-bound, faster GPU makes it worse.

Third: If it’s multi-GPU, is the interconnect the real bottleneck?

Check topology: NVLink presence, PCIe generation, NUMA placement.
Check NCCL all-reduce time share in your profiler traces.
Check network saturation (for multi-node): RDMA, TCP, switch counters.

If collectives dominate, scaling up to a bigger GPU per node may beat scaling out—or vice versa—depending on topology. Measure, don’t vibe.

Hands-on tasks: commands, outputs, and decisions

These are the kinds of checks you can run during an incident bridge without needing a bespoke profiler setup. Each task includes: command, typical output, what it means, and the decision you make.

Task 1: Confirm GPU visibility, driver health, and basic load

cr0x@server:~$ nvidia-smi
Wed Jan 21 12:11:03 2026
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15    Driver Version: 550.54.15    CUDA Version: 12.4               |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:65:00.0 Off |                    0 |
|  33%   62C    P0              185W / 250W |  12450MiB / 40536MiB |     38%      Default |
+-----------------------------------------+----------------------+----------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|=======================================================================================|
|    0   N/A  N/A     21983      C   python3                                     12340MiB |
+---------------------------------------------------------------------------------------+

What it means: driver/toolkit versions, power draw, memory in use, util, and which process owns it.

Decision: If GPU-Util is low and power is low during “steady state,” you’re likely starved or overhead-bound. Move up-stack.

Task 2: Watch utilization and power over time (spot starvation)

cr0x@server:~$ nvidia-smi dmon -s pucvmt
# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0   92    56     -    12     8     0     0  1215  1410
    0  210    64     -    85    72     0     0  1215  1410
    0   75    55     -     9     6     0     0  1215  1410

What it means: bursts of high SM% followed by idle troughs often mean input pipeline stalls or sync points.

Decision: If you see a “sawtooth” pattern, investigate dataloader, host-device copies, and CPU contention.

Task 3: Identify CPU saturation vs iowait (classic feeder failure)

cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (server) 	01/21/2026 	_x86_64_	(64 CPU)

12:12:11 PM  CPU   %usr %nice  %sys %iowait  %irq %soft  %steal  %idle
12:12:12 PM  all   420.0  0.0   18.0   60.0   0.0  2.0    0.0   0.0
12:12:12 PM   12    97.0  0.0    1.0    2.0   0.0  0.0    0.0   0.0
12:12:12 PM   13    95.0  0.0    3.0    2.0   0.0  0.0    0.0   0.0

What it means: high %usr across many cores suggests CPU preprocessing or Python overhead; high %iowait suggests storage stalls.

Decision: If %iowait is high, profile storage. If %usr is pegged, optimize decode/tokenization, increase workers, or offload transforms.

Task 4: Check process-level I/O stalls (is your dataset slow?)

cr0x@server:~$ iostat -xz 1 2
Linux 6.5.0 (server) 	01/21/2026 	_x86_64_	(64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          30.12    0.00    2.10   22.45    0.00   45.33

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s  w_await aqu-sz  %util
nvme0n1         420.0   58240.0     0.0   0.00    8.10   138.67    15.0   2048.0    2.30   3.60  82.00

What it means: high %util and high r_await indicate the device is busy and reads are waiting.

Decision: If storage is hot, cache datasets locally, increase read parallelism carefully, or change format (fewer small files).

Task 5: Spot “too many small files” (metadata pain)

cr0x@server:~$ find /datasets/vision/train -type f | head -n 5
/datasets/vision/train/000001.jpg
/datasets/vision/train/000002.jpg
/datasets/vision/train/000003.jpg
/datasets/vision/train/000004.jpg
/datasets/vision/train/000005.jpg

What it means: image-per-file datasets can crush metadata performance on networked filesystems.

Decision: Consolidate into shard formats (tar/LMDB/record files), or stage locally on NVMe.

Task 6: Check PCIe link speed and width (quiet throughput killer)

cr0x@server:~$ nvidia-smi -q -d pcie | sed -n '1,80p'
==============NVSMI LOG==============

GPU 00000000:65:00.0
    PCIe Generation
        Current                       : 3
        Max                           : 4
    Link Width
        Current                       : x8
        Max                           : x16
    Tx Throughput                     : 1200 KB/s
    Rx Throughput                     : 980 KB/s

What it means: running at Gen3 x8 when you expected Gen4 x16 is a real reduction in host-device bandwidth.

Decision: Reseat card, check BIOS settings, check risers, verify slot wiring, and confirm not sharing lanes with other devices.

Task 7: Validate NUMA locality (CPU feeds GPU through the right socket)

cr0x@server:~$ nvidia-smi topo -m
        GPU0    CPU Affinity    NUMA Affinity
GPU0     X      0-31            0

Legend:
  X    = Self

What it means: CPU cores 0–31 are local to GPU0. Using the wrong socket can increase latency and reduce effective bandwidth.

Decision: Pin dataloader workers and main process to the local NUMA node when performance matters.

Task 8: Check thermal or power throttling (the invisible handbrake)

cr0x@server:~$ nvidia-smi -q -d PERFORMANCE | sed -n '1,120p'
==============NVSMI LOG==============

GPU 00000000:65:00.0
    Performance State               : P2
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting  : Not Active
        SW Power Cap                : Active
        HW Slowdown                 : Not Active

What it means: SW Power Cap: Active suggests your job hits a power limit (or a configured cap) and clocks down.

Decision: Adjust power limit (if policy allows), improve cooling, or accept that your “bigger” GPU won’t run at headline clocks.

Task 9: Confirm CPU frequency isn’t stuck low (container host misconfig)

cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave

What it means: CPUs in powersave can slow preprocessing and coordination, starving GPUs.

Decision: Set performance governor on dedicated training nodes (with your ops policy in mind).

Task 10: Confirm hugepages / pinned memory pressure symptoms (H2D copy stalls)

cr0x@server:~$ grep -E 'HugePages|MemAvailable' /proc/meminfo
MemAvailable:   18432000 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0

What it means: not definitive, but low available memory combined with heavy dataloading can increase paging and slow transfers.

Decision: Reduce host memory pressure, consider pinned memory settings, and avoid oversubscribing RAM with too many workers.

Task 11: Check container cgroup CPU limits (self-inflicted starvation)

cr0x@server:~$ cat /sys/fs/cgroup/cpu.max
200000 100000

What it means: quota 200ms per 100ms period effectively caps at 2 CPUs worth of time. For a GPU job, that’s often ridiculous.

Decision: Increase CPU requests/limits so your input pipeline can breathe; match CPU to GPU class.

Task 12: Check dataloader worker starvation (Python-side symptoms)

cr0x@server:~$ ps -o pid,pcpu,pmem,cmd -C python3 --sort=-pcpu | head
  PID %CPU %MEM CMD
21983 780.2 12.4 python3 train.py --config prod.yaml
22010  95.1  1.2 python3 -c from multiprocessing.spawn import spawn_main; spawn_main(...)
22011  94.6  1.2 python3 -c from multiprocessing.spawn import spawn_main; spawn_main(...)

What it means: multiple worker processes are active; if only the main process is busy, your workers may be stuck on I/O, GIL-heavy work, or misconfiguration.

Decision: Tune worker count, move CPU transforms to native code, precompute features, or use faster dataset formats.

Task 13: Observe network throughput for multi-node training

cr0x@server:~$ sar -n DEV 1 2 | grep -E 'IFACE|mlx|eth'
12:15:01 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s
12:15:02 PM       eth0   8200.00   7900.00  910000.00  880000.00
12:15:03 PM       eth0   8300.00   8100.00  920000.00  895000.00

What it means: if the network is saturated during all-reduce phases, scaling stalls.

Decision: If you’re saturating, revisit NCCL settings, topology, gradient compression, or node count.

Task 14: Identify filesystem latency spikes during checkpointing

cr0x@server:~$ dmesg | tail -n 8
[123456.120001] nfs: server nfs01 not responding, still trying
[123456.421019] nfs: server nfs01 OK
[123457.002341] INFO: task python3:21983 blocked for more than 120 seconds.

What it means: your training step might be fine; your checkpoint write pauses everything.

Decision: Make checkpointing async, write locally then upload, or change storage backend.

Where big GPUs lose: common bottlenecks by layer

Input pipeline: the silent throughput tax

If you’re training on images, video, or large text corpora, you’re running a factory:

Read compressed bytes
Decode (often CPU)
Transform/augment
Batch, collate, pad
Copy to GPU

Each stage can be “fast enough” for a smaller GPU and suddenly insufficient for a bigger one. The upgrade doesn’t break the code; it breaks the assumptions about how much slack you had.

High-end GPUs amplify weak pipelines. That’s not a metaphor; it’s basic queueing theory. When service time in one stage drops, the next stage becomes the bottleneck.

CPU-side overhead: death by a thousand syncs

Python is wonderful for expressing intent. It is not wonderful for running tiny, frequent operations in a tight loop while coordinating a device that can chew through teraflops. If your per-step loop includes logging, metrics aggregation, frequent device synchronizations, or non-vectorized CPU work, a faster GPU shortens the compute window and makes the CPU overhead proportionally worse.

Common culprits:

Calling .item() or forcing synchronization every step
Frequent shape changes that prevent kernel fusion
Too-small batch sizes (especially in inference)
Excessive per-sample preprocessing in Python

Host-device transfer: PCIe is not a suggestion

When people talk about “GPU speed,” they often forget the bus. PCIe bandwidth and latency are fixed properties of your platform, not your aspirations. If your model repeatedly moves data between CPU and GPU, or if you have pageable memory copies and sync points, you can bottleneck on transfers long before compute.

Typical anti-pattern: “We’ll keep embeddings on the CPU to save VRAM.” On a big GPU, that can be catastrophic.

Memory-bound kernels: when FLOPS don’t matter

Some operations are bound by memory bandwidth or cache behavior, not compute. Think elementwise ops, certain normalization layers, embedding lookups, and attention patterns that are not optimized for locality. You can buy more compute and see little improvement.

Signals:

High GPU-Util but modest power draw relative to expected
Profiler shows low tensor core utilization
Memory throughput near peak while SM utilization is “busy” but not productive

Multi-GPU: scaling taxes arrive fast and don’t leave

Multi-GPU training is an argument with physics. At some point, you spend more time coordinating than computing. Bigger GPUs can help by reducing the number of GPUs needed for a given batch/model, which reduces collective overhead. Or they can hurt by encouraging larger node counts with a weak interconnect, making all-reduce dominate.

Topology matters. NVLink vs PCIe. Socket placement. Switch oversubscription. If you don’t know your topology, you don’t know your performance.

Checkpointing and observability: the “it was fine until it saved” phenomenon

Checkpointing is often synchronized. If you write from every rank, or you write to a slow shared filesystem, your job will appear “mysteriously slow” even if compute is healthy. Storage latency spikes become GPU idling spikes.

Joke #2 (short and relevant): Checkpointing to a shared filesystem during peak hours is a great way to learn what “backpressure” feels like, emotionally.

Three corporate-world mini-stories (anonymized)

Mini-story 1: The incident caused by a wrong assumption

A company rolled out new GPU nodes to accelerate a recommender model retrain. The change request was clean: bigger GPUs, same code, same dataset. The assumption was simple and comforting: “Compute was the bottleneck.”

Within days, the training SLA slipped. GPUs reported low utilization, and the on-call channel filled with screenshots of idle accelerators and very non-idle cloud invoices. Someone blamed drivers. Someone else blamed the framework. The usual.

The actual problem was more boring: the dataset lived on a shared network filesystem. The old nodes had smaller GPUs that took longer per step, so filesystem latency hid under compute time. The new GPUs shortened compute so much that every step hit a read/metadata wall. The job didn’t get faster; it got more sensitive.

The fix was staged data: nightly rsync (or object-store sync) to local NVMe, plus sharding small files into larger blobs. Utilization climbed, training time dropped, and the incident stopped. The lesson wasn’t “network filesystems are bad.” The lesson was: measure the pipeline end-to-end before changing only one station.

Mini-story 2: The optimization that backfired

Another team optimized for GPU memory by aggressively reducing batch size and enabling gradient checkpointing everywhere. VRAM usage looked great. The model fit comfortably, leaving room for future growth. Everyone celebrated—quietly, because celebrations are also overhead.

Then throughput tanked. GPUs were “busy,” but step time increased. The profiler showed more recomputation (expected) and a flood of small kernels (less expected). The smaller batch increased per-step overhead, reduced kernel fusion opportunities, and worsened all-reduce efficiency because the compute/communication ratio dropped.

They had optimized the wrong constraint. The system wasn’t VRAM-bound; it was latency/overhead-bound. The “memory optimization” paid with compute, synchronization, and communication.

The recovery plan was pragmatic: increase batch size until kernels got chunky again, checkpoint only the layers that mattered, and use mixed precision with careful validation. Memory use rose. Throughput rose more. The business metric—models trained per day—finally matched the hardware spend.

Mini-story 3: The boring but correct practice that saved the day

A platform team had a rule that annoyed everyone: every GPU node image had a small “performance sanity suite” that ran after provisioning. It checked PCIe link width, CPU governor, driver/library compatibility, and basic NCCL bandwidth tests. People called it bureaucracy.

One week, a batch of new servers arrived. Jobs ran, but multi-GPU performance was awful on a subset of nodes. The suite flagged those nodes immediately: GPUs negotiated a lower PCIe width due to a riser issue, and some nodes had a BIOS setting that limited link speed. The GPUs were fine. The platform was not.

Because the checks ran automatically, the team quarantined the bad nodes before they polluted the training fleet. No prolonged incident, no wild goose chase, no “framework regression” postmortem with the wrong culprit.

It was boring. It was correct. And it saved days of engineer time—which is the most expensive resource in the building.

Common mistakes: symptom → root cause → fix

1) Low GPU-Util, high CPU usage

Symptom: GPU sits at 10–40%, CPU pegged.

Root cause: CPU-bound preprocessing (decode/tokenization/augmentation), Python overhead, too few dataloader workers.

Fix: Increase workers, move transforms to vectorized/native ops, cache preprocessed data, pin to correct NUMA node, ensure container CPU limits aren’t choking you.

2) Low GPU-Util, high iowait

Symptom: GPU idle troughs match storage latency spikes.

Root cause: dataset on slow network storage, too many small files, metadata contention.

Fix: shard datasets, stage to local NVMe, use read-ahead/caching, avoid per-sample file opens, make checkpointing async.

3) High GPU-Util but disappointing throughput

Symptom: 90–100% util, but step time not great.

Root cause: memory-bound kernels, poor kernel fusion, unoptimized attention, dtype conversions.

Fix: use fused kernels where possible, adjust batch/sequence lengths to improve efficiency, validate mixed precision, profile to find hot ops.

4) Performance regresses after upgrading to a “bigger” GPU

Symptom: new GPU is slower or only marginally faster.

Root cause: unchanged batch size leading to overhead dominance; PCIe link negotiated down; power throttling; wrong CPU governor.

Fix: re-tune batch size, validate PCIe Gen/width, check throttle reasons, set CPU governor appropriately, verify NUMA locality.

5) Multi-GPU scaling stalls after 2–4 GPUs

Symptom: doubling GPUs doesn’t double throughput.

Root cause: all-reduce dominates, weak interconnect, bad topology, small per-GPU batch.

Fix: increase per-GPU work, optimize collectives (bucket sizes), use fewer nodes with bigger GPUs, or improve network/interconnect alignment.

6) Random slowdowns or “every 10 minutes it pauses”

Symptom: steady training interrupted by periodic stalls.

Root cause: checkpointing/logging, GC pauses, filesystem hiccups, background compactions.

Fix: stagger checkpoints, write locally then upload, reduce synchronous logging, monitor filesystem latency, avoid shared metadata hotspots.

Checklists / step-by-step plan

Step-by-step: making a bigger GPU actually faster

Establish a baseline: measure images/s, tokens/s, or requests/s on the old GPU with the same code and dataset snapshot.
Confirm platform sanity: driver version, PCIe Gen/width, CPU governor, NUMA topology, power limits.
Check the feeder: storage throughput/latency, dataloader worker health, CPU headroom, memory pressure.
Re-tune batch size: bigger GPU often wants bigger batches; if you can’t, expect diminishing returns.
Profile one representative run: identify top kernels, CPU stall time, H2D time, collective time.
Fix the biggest stall: don’t “optimize everything.” Pick the bottleneck that dominates wall time.
Validate correctness: especially with mixed precision and fused kernels—silent accuracy regressions are real outages, just slower.
Re-check efficiency per dollar: sometimes two “smaller” GPUs with good feeding beat one giant GPU with a starved pipeline.

Checklist: what to avoid when shopping for GPUs

Buying for peak TFLOPS without mapping your workload’s bottleneck (compute vs memory vs I/O vs communication).
Assuming VRAM solves everything; it solves some things, and introduces others (bigger checkpoints, longer startup, more expensive failures).
Ignoring interconnect topology for multi-GPU (PCIe lanes, NVLink presence, NUMA placement).
Ignoring the CPU: a weak CPU can starve a monster GPU.
Ignoring storage: “dataset is on shared NFS” is not a performance plan.

Checklist: “boring SRE controls” that keep GPU fleets fast

Node-level acceptance tests: PCIe Gen/width, basic bandwidth tests, throttle reason checks.
Standardized images with pinned driver/library compatibility.
Dashboards showing GPU util, power, memory BW, CPU iowait, and storage latency.
Quarantine automation for nodes that negotiate down link speed or show repeated Xid errors.
Workload-specific runbooks (training vs inference vs batch embedding jobs).

FAQ

1) If my GPU utilization is low, is the GPU the problem?

Usually no. Low utilization is often a symptom of starvation: CPU preprocessing, storage, network, or synchronization. Confirm with power draw and CPU/iowait.

2) Why did upgrading to a faster GPU make my job look worse?

You shortened the compute slice, so fixed overhead (Python, dataloader coordination, I/O) became dominant. Bigger GPUs reduce compute time; they don’t reduce your bad habits.

3) Is increasing batch size always the answer?

No. It can improve kernel efficiency and amortize overhead, but it can hurt convergence, increase memory, or shift bottlenecks to communication. Tune with metrics, not faith.

4) PCIe vs NVLink: when should I care?

Care when you do multi-GPU work on one node or frequent host-device transfers. If collectives or peer-to-peer transfers are significant, topology is performance.

5) Mixed precision didn’t speed things up. Why?

You might be memory-bound, overhead-bound, or spending time converting dtypes. Or your model has ops that don’t benefit from tensor cores. Profile to verify where time goes.

6) What’s the quickest way to tell if I’m storage-bound?

Look for high iowait and high disk/network utilization during training, plus sawtooth GPU SM% patterns. Then validate with iostat and filesystem logs.

7) Why doesn’t multi-GPU scale linearly?

Because you pay coordination costs (all-reduce, synchronization, stragglers). As you add GPUs, communication grows and eventually dominates unless the workload per GPU grows too.

8) Should I buy one huge GPU or multiple smaller ones?

Depends on your bottleneck. If you’re communication-bound, fewer bigger GPUs can be better. If you’re throughput-bound with many independent jobs, multiple smaller GPUs often win operationally.

9) Can Kubernetes scheduling hurt GPU efficiency?

Yes. CPU limits, memory pressure, noisy neighbors, and poor NUMA alignment can starve GPUs. If you treat GPU pods like stateless web pods, you’ll learn new forms of sadness.

10) What’s a sane “good” GPU utilization target?

There isn’t one number. For steady training, 80–95% can be fine. For bursty inference, lower utilization can still be optimal if latency SLOs are met. Track throughput and tail latency, not ego metrics.

Practical next steps

If you’re about to buy bigger GPUs, or you already did and you’re regretting it, do this in order:

Run the fast diagnosis playbook and capture a 10-minute window of GPU util, power, CPU usage, iowait, and storage/network stats.
Validate platform basics: PCIe Gen/width, throttle reasons, NUMA affinity, CPU governor, container CPU limits.
Fix the feeder before touching model code: shard data, stage locally, reduce small-file overhead, add parallelism where it’s actually useful.
Then tune the workload: batch size, mixed precision, kernel fusion opportunities, and multi-GPU communication settings.
Institutionalize the boring checks: acceptance tests on nodes, dashboards that show starvation, and a runbook that names the top bottlenecks you’ve actually seen.

The punchline is not “don’t buy big GPUs.” Buy them when the workload deserves them. But treat GPU performance like any other production system: measure the bottleneck, change one thing, verify, repeat. Hardware is fast. Systems are slow. That’s why you have a job.