You bought the big GPU. The dashboard still says 20–40% utilization. Training is slow, inference is jittery, frames drop, and someone in chat says the line: “Your CPU can’t feed your GPU.”
Sometimes that’s true. Sometimes it’s cargo cult. And sometimes the real culprit is storage, PCIe, or a tiny synchronization you didn’t know you wrote. The fix depends on which kind of “feed” you mean: work submission, data delivery, or keeping the GPU busy with enough parallelism.
What “feed the GPU” actually means
“Feeding the GPU” is a sloppy phrase that lumps three distinct pipelines into one vibe:
- Work submission: the host (CPU) launches kernels, sets up command buffers, schedules CUDA graphs, enqueues copies, and synchronizes streams. If the host can’t launch fast enough, the GPU sits idle between kernels.
- Data delivery: the host pulls data from storage, decodes/preprocesses it, and transfers it over PCIe/NVLink into GPU memory. If this is slow, the GPU waits for the next batch.
- Parallel work availability: even if submission and data are perfect, the GPU needs enough independent work to fill SMs. Too small batches, too small kernels, or too many serial dependencies can keep utilization low without any CPU issue.
So when someone says “CPU can’t feed GPU,” force a clarification: which feed? If they can’t answer, you’ve found the first bottleneck: diagnosis quality.
One quote to keep in your pocket (paraphrased idea) from Donald Knuth: Premature optimization is a common way to waste time; measure first, then optimize what matters.
And yes, I’m going to be annoying about measurement. Production systems don’t run on confidence.
Truth vs meme: when the CPU is really the bottleneck
The “truth” cases (CPU genuinely limits GPU throughput)
These are boringly real:
- Kernel launch overhead dominates. You’re launching thousands of tiny kernels per step. The CPU thread spends its life in driver calls, and the GPU spends its life waiting for the next launch.
- Single-threaded input pipeline. Data decoding, augmentation, tokenization, or feature engineering runs on one core because someone set workers=0 “for determinism.” GPU waits for the next batch like it’s stuck behind a slow cashier.
- Excessive synchronization. Hidden
cudaDeviceSynchronize()-equivalents (directly or indirectly) serialize the pipeline. The CPU blocks, then the GPU blocks, then you blame the CPU for being “too weak.” - CPU-bound preprocessing. Think JPEG decode, video decode without hardware acceleration, JSON parsing, or decompression. GPUs are fast; your CPU is still subject to physics and branch prediction.
- NUMA and memory bandwidth starvation. CPU has “many cores,” but they’re all pulling data across a socket boundary because your process and GPU are on different NUMA nodes.
- Driver/firmware overhead and interrupts. Particularly in multi-GPU servers with heavy I/O. The CPU isn’t “weak,” it’s busy being your I/O scheduler and interrupt sponge.
The “meme” cases (CPU isn’t the limiting factor)
This is where people burn weeks:
- GPU utilization is low because the GPU is doing short bursts. Average utilization hides micro-stalls; the GPU is actually waiting on memory inside the kernel, not on the CPU.
- You’re PCIe-limited. Copies saturate PCIe. Upgrading CPU won’t widen PCIe Gen3 x8 magically.
- You’re VRAM-limited. You shrink batch size to fit memory, which reduces arithmetic intensity and makes the GPU look “underfed.” That’s not CPU; that’s working set size.
- Your kernels are inefficient. Low occupancy, poor memory coalescing, divergent branches. The GPU is “busy” being inefficient, not waiting for the CPU.
- Your job is latency-bound (like small-batch inference). GPU utilization may never be high because the workload doesn’t have enough parallelism. “100% GPU” is not a law of nature.
Joke #1: The GPU isn’t “hungry,” it’s picky—if you serve it one crouton at a time, it will stare at you like you’re the problem.
Interesting facts and a little history (because it explains today’s pain)
- Fact 1: Early GPU programming models (pre-CUDA) were essentially graphics APIs in disguise; “feeding the GPU” literally meant keeping the graphics pipeline full of triangles. Today’s compute kernels inherited the same throughput mindset.
- Fact 2: CUDA’s launch model made it easy to enqueue kernels, but early best practices encouraged many small kernels; modern guidance often pushes fusion and CUDA Graphs to reduce launch overhead.
- Fact 3: PCIe has improved steadily, but not at the same pace as GPU FLOPS. The gap is why host-to-device transfers are still a frequent bottleneck even in “monster” servers.
- Fact 4: NUMA became a mainstream pain point as dual-socket servers dominated data centers; GPU affinity and “closest CPU” placement matter because memory latency across sockets is not a rounding error.
- Fact 5: Pinned (page-locked) memory is faster for DMA transfers, but too much pinned memory can hurt the OS and other processes by reducing pageable RAM flexibility.
- Fact 6: NVLink exists largely because PCIe wasn’t enough for multi-GPU workloads; but it doesn’t fix CPU-side preprocessing, kernel launch overhead, or storage ingestion.
- Fact 7: “GPU utilization” counters were originally built for graphics and long-running kernels. Interpreting them for ML training with mixed copy/compute can be misleading without a timeline view.
- Fact 8: The rise of data-centric ML made input pipelines (decode, augment, tokenize) first-class performance problems; your “training job” often behaves like an ETL job with a GPU attached.
Four bottleneck types you keep confusing
1) CPU submission bottleneck (launch-bound)
Symptoms: GPU shows gaps between kernels, lots of short kernels, CPU thread pegged in system time or driver calls, step time scales with “number of kernels,” not batch size.
Typical fixes: fuse kernels, increase batch size, use CUDA Graphs, reduce Python overhead, avoid per-sample device calls, reduce synchronization, use persistent kernels where appropriate.
2) CPU preprocessing bottleneck (decode/augment/tokenize)
Symptoms: CPU cores saturated in user space, disk/network reads look fine, GPU waits on input, increasing dataloader workers helps until you hit contention.
Typical fixes: parallelize preprocessing, vectorize, cache decoded data, move transforms to GPU, use faster codecs, reduce augmentation cost, use larger batches to amortize overhead.
3) I/O and storage bottleneck (your “GPU server” is really a storage client)
Symptoms: high iowait, long read latencies, inconsistent throughput, GPU utilization noisy, performance improves when dataset is on local NVMe or cached.
Typical fixes: local cache, prefetch, larger sequential reads, better file formats, avoid tiny random reads, ensure filesystem and network aren’t throttling.
4) GPU-side bottleneck (it’s busy, just not in the way you want)
Symptoms: GPU utilization might be high or low, but profiling shows memory stalls, low occupancy, tensor cores idle, or poor kernel efficiency. CPU is mostly idle.
Typical fixes: kernel optimization, better libraries, mixed precision, layout changes, larger fused ops, fix uncoalesced memory access, ensure you’re using the right backend.
Fast diagnosis playbook (first/second/third)
First: establish if the GPU is waiting or working
- Check GPU utilization, power, clocks, memory usage, and—critically—whether utilization is bursty.
- Look for copy engines active vs compute active (H2D/D2H vs SM activity).
- If the GPU is truly idle a lot, it’s waiting on something upstream (CPU submission, preprocessing, I/O, synchronization).
Second: separate CPU submission from CPU preprocessing
- If one CPU thread is hot and system time is high: suspect launch overhead or synchronization.
- If many CPU cores are hot in user time: suspect preprocessing or decompression/decoding/tokenization.
- If CPU is mostly idle but GPU is underutilized: suspect GPU-side inefficiency or small workload.
Third: validate the transport path (PCIe/NUMA) and storage path
- Confirm PCIe link width/speed. “x16” isn’t a vibe; it’s a negotiated link state.
- Check NUMA locality: CPU, memory, and GPU should be aligned when possible.
- Check storage latency and read sizes. Random 4KB reads from a network filesystem will humble your H100.
Joke #2: Upgrading the CPU to fix a PCIe bottleneck is like buying a faster cashier because the delivery truck is stuck in traffic.
Practical tasks: commands, outputs, and decisions
These are the “stop arguing, start checking” steps. Each task includes a command, a sample output, what it means, and the decision you make.
Task 1: Check live GPU utilization, clocks, and power
cr0x@server:~$ nvidia-smi dmon -s pucvmt
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 92 64 - 28 18 0 0 5001 1410
0 88 63 - 31 20 0 0 5001 1410
0 55 60 - 4 6 0 0 5001 705
What it means: SM% swinging from ~30% to ~4% with clocks dropping suggests bursty work or stalls. Power/clock drops often mean the GPU is idle enough to downclock.
Decision: If bursts correlate with batch boundaries, look upstream (input, sync). If SM% is steady but low, look at kernel efficiency or batch sizing.
Task 2: See per-process GPU usage (are you even looking at the right job?)
cr0x@server:~$ nvidia-smi pmon -s um
# gpu pid type sm mem enc dec command
0 27431 C 9 12 0 0 python
0 29902 G 0 1 0 0 Xorg
What it means: Your Python process is barely using SM. Memory is allocated, compute isn’t.
Decision: Profile input and synchronization. Don’t assume “allocated VRAM” equals “GPU busy.”
Task 3: Check PCIe link speed and width
cr0x@server:~$ nvidia-smi -q | sed -n '/PCI/,/Replay/p'
PCI
Bus : 00000000:81:00.0
Link Width : 8x
Link Speed : 8.0 GT/s
Replay Counter : 0
What it means: That’s PCIe Gen3 x8. If you expected Gen4 x16, you’ve already found a bottleneck class.
Decision: Fix BIOS settings, slot placement, risers, or platform mismatch before rewriting code.
Task 4: Validate negotiated PCIe state via lspci
cr0x@server:~$ sudo lspci -s 81:00.0 -vv | egrep -i 'LnkCap|LnkSta'
LnkCap: Port #0, Speed 16GT/s, Width x16
LnkSta: Speed 8GT/s (downgraded), Width x8 (downgraded)
What it means: The card can do Gen4 x16 but is currently downgraded. This can happen due to slot wiring, BIOS forcing, or a bad riser.
Decision: Treat as hardware/platform issue. No amount of “dataloader workers” fixes a downgraded link.
Task 5: Check GPU topology and NUMA affinity
cr0x@server:~$ nvidia-smi topo -m
GPU0 CPU Affinity NUMA Affinity
GPU0 X 0-15 0
What it means: GPU0 is closest to CPU cores 0–15 on NUMA node 0.
Decision: Pin your process and memory allocations to those cores/node if you’re doing heavy CPU preprocessing or H2D transfers.
Task 6: Check CPU saturation and run queue pressure
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (server) 01/10/2026 _x86_64_ (32 CPU)
12:40:20 PM CPU %usr %nice %sys %iowait %irq %soft %idle
12:40:21 PM all 210.0 0.00 35.0 0.50 0.0 1.2 753.3
12:40:21 PM 7 98.0 0.00 2.0 0.00 0.0 0.0 0.0
12:40:21 PM 8 4.0 0.00 60.0 0.00 0.0 0.0 36.0
What it means: One core (CPU 7) is pegged in user time while the system overall is mostly idle. CPU 8 shows high system time (driver/kernel work). This smells like a single-threaded bottleneck (Python GIL, launch thread, or sync).
Decision: Optimize the host thread path: reduce per-step Python overhead, use batching, CUDA Graphs, avoid frequent sync calls.
Task 7: Catch I/O wait and storage latency hints
cr0x@server:~$ iostat -xz 1 3
avg-cpu: %user %nice %system %iowait %steal %idle
18.2 0.0 6.1 22.9 0.0 52.8
Device r/s rkB/s rrqm/s %util await
nvme0n1 85.0 4200.0 0.0 78.0 9.8
What it means: iowait is high and NVMe await is ~10ms under load. For an input pipeline doing many small reads, that’s painful.
Decision: Increase read size, prefetch, pack data into larger shards, cache locally, or move dataset off contended storage.
Task 8: Confirm the dataset access pattern (small random reads vs streaming)
cr0x@server:~$ sudo strace -f -e trace=openat,read -p 27431 -s 80 -tt 2>&1 | head -n 8
12:41:10.102334 openat(AT_FDCWD, "/data/ds/img_000812.jpg", O_RDONLY) = 57
12:41:10.102801 read(57, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0", 4096) = 4096
12:41:10.103122 read(57, "...", 4096) = 4096
12:41:10.103444 openat(AT_FDCWD, "/data/ds/img_000813.jpg", O_RDONLY) = 58
What it means: Lots of tiny 4KB reads across many files. That’s classic “looks fine on my laptop” data access that collapses at scale.
Decision: Consolidate files (tar/shards), use sequential reads, enable page cache friendly patterns, and prefetch batches.
Task 9: Check CPU frequency scaling (the quiet throughput killer)
cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave
What it means: CPU is allowed to underclock aggressively. For bursty preprocessing, this can increase tail latency and starve GPU between batches.
Decision: On dedicated training nodes, use performance governor or platform-appropriate tuning, then re-measure.
Task 10: Spot NUMA misplacement (process running “far” from the GPU)
cr0x@server:~$ numactl --show
policy: default
preferred node: current
physcpubind: 16 17 18 19 20 21 22 23
membind: 1
What it means: Your process is pinned to NUMA node 1, but earlier topology showed GPU0 prefers NUMA node 0. That’s cross-socket traffic for every DMA buffer and preprocessing output.
Decision: Re-pin to the GPU-local NUMA node (or move the job to a GPU attached to node 1). This is often a double-digit percent fix.
Task 11: Check for throttling and thermal constraints
cr0x@server:~$ nvidia-smi -q | sed -n '/Clocks/,/Applications Clocks/p'
Clocks
Graphics : 705 MHz
SM : 705 MHz
Memory : 5001 MHz
Applications Clocks
Graphics : 1410 MHz
Memory : 5001 MHz
What it means: Current SM clock is much lower than application clock. If this persists under load, you may be power/thermal throttled or simply idle.
Decision: If utilization is high but clocks are low, investigate power caps, cooling, and chassis airflow. If utilization is low, it’s likely idle and downclocking normally.
Task 12: Identify a kernel-launch-heavy pattern (lots of tiny GPU work)
cr0x@server:~$ sudo perf top -p 27431 -g --stdio
Samples: 3K of event 'cycles'
18.40% libcuda.so.1 [.] cuLaunchKernel
11.22% libc.so.6 [.] memcpy
9.87% libpthread.so.0 [.] pthread_mutex_lock
6.31% python3.10 [.] _PyEval_EvalFrameDefault
What it means: The process spends a lot of cycles launching kernels and running Python interpreter frames. This is “submission overhead.”
Decision: Fuse ops, use compiled graphs, reduce Python-side per-op calls, and consider larger kernels/batches.
Task 13: Check network filesystem impact (if your dataset is remote)
cr0x@server:~$ nfsstat -c
Client rpc stats:
calls retrans authrefrsh
248391 1203 0
Client nfs v4:
ops count
read 182744
open 50322
getattr 411802
What it means: Retransmits exist, and getattr/open volume is huge. Metadata chatter plus remote reads can absolutely starve a GPU.
Decision: Stage to local NVMe, reduce file count, increase client caching where safe, or change dataset format.
Task 14: Verify huge pages / pinned memory pressure signals (host memory path)
cr0x@server:~$ grep -E 'MemAvailable|Dirty|Writeback' /proc/meminfo
MemAvailable: 1842332 kB
Dirty: 482912 kB
Writeback: 12984 kB
What it means: Low MemAvailable and high Dirty suggests memory pressure and writeback. This can slow preprocessing and cause noisy latency, even if GPU is “fine.”
Decision: Reduce pinned memory usage, avoid excessive caching in the application, or provision more RAM / isolate noisy neighbors.
Three corporate mini-stories (how teams actually get this wrong)
Mini-story 1: The incident caused by a wrong assumption
They had a new GPU cluster. Everyone was excited, and the launch checklist had the usual items: driver versions, CUDA runtime, health checks, and a quick training run. GPU utilization was low, so the conclusion arrived quickly: “These CPUs are too small. We cheaped out.”
The team escalated. Procurement got dragged into an engineering call. Someone proposed swapping the whole node SKU. A week of heat later, an SRE asked one annoying question: “What PCIe link did we actually negotiate?”
It turned out the nodes were cabled through a riser configuration that silently negotiated PCIe at a lower width and generation than expected. The GPUs were fine. The CPUs were fine. The path between them wasn’t. H2D transfers were capped, and the training loop stalled at each batch copy.
Once they fixed slot placement and BIOS settings, utilization jumped without touching a line of code. The postmortem wasn’t about PCIe. It was about assumptions: they treated “CPU can’t feed GPU” as an explanation instead of a hypothesis.
Mini-story 2: The optimization that backfired
A data engineering group wanted to “feed the GPU better,” so they increased dataloader workers aggressively and enabled pinned memory everywhere. Throughput improved on a quiet test node. They rolled it into the shared training fleet.
Within a day, training jobs started failing unpredictably. Some ran fast. Others hung. Some got killed by the OOM killer even though GPU memory was stable. The on-call rotation had a bad time, mostly because the graphs made it look like “random infrastructure flakiness.”
The root cause was mundane: pinned memory plus many workers created significant pressure on host RAM and the page allocator. On nodes with other colocated services and filesystem cache needs, the “optimization” turned into memory contention and latency spikes. Workers also thrashed the shared network filesystem with more concurrent small reads, raising tail latency for everyone.
The fix wasn’t to undo parallelism; it was to size it. They capped workers per node, staged datasets locally for high-throughput jobs, and only used pinned memory where transfer time was actually significant. Feeding the GPU is not permission to starve the operating system.
Mini-story 3: The boring but correct practice that saved the day
A different team ran a multi-tenant inference platform. They had a strict rule: every performance incident begins with a reproducible capture of host metrics, GPU metrics, and a short profile trace. No exceptions, no “I’m pretty sure.”
One Friday, latency alarms fired. GPU utilization was low, and the easy story was “CPU is saturated.” But their runbook forced a quick check of CPU run queues, GPU clocks, PCIe state, and storage latency for the model cache. The CPU wasn’t saturated. The GPU wasn’t waiting on launches. PCIe was healthy.
The trace showed the inference process blocking on file reads for model shards after a deployment. A harmless-looking change had moved the model cache directory from local NVMe to a network mount. Nobody thought it mattered because “the model fits in RAM,” except it didn’t always fit in page cache under churn.
They rolled back the cache path, warmed the cache deliberately, and the incident ended without heroic debugging. The practice that saved them was not a fancy profiler. It was a checklist that prevented them from chasing the wrong bottleneck story.
Common mistakes: symptom → root cause → fix
1) Symptom: GPU utilization low, VRAM high
Root cause: You allocated tensors on GPU, but compute is tiny or blocked by synchronization / input waits.
Fix: Profile for gaps; increase batch size; remove per-step sync; verify dataloader throughput; check for CPU preprocessing hotspots.
2) Symptom: Utilization spiky, step time inconsistent
Root cause: Input pipeline jitter (storage latency, remote filesystem, GC pauses, CPU frequency scaling).
Fix: Stage data locally; shard files; prefetch; pin CPU governor; cap worker count to avoid thrash.
3) Symptom: One CPU core pegged, others idle, GPU not busy
Root cause: Single-threaded launch or Python overhead; GIL contention; too many small kernels; frequent driver calls.
Fix: Fuse ops; use CUDA Graphs; batch work; move logic out of Python hot loop; reduce per-sample device operations.
4) Symptom: Copy engines busy, SM low
Root cause: Transfer bottleneck (PCIe saturated, pageable memory copies, small transfers, wrong link state).
Fix: Ensure Gen/width; use pinned memory judiciously; increase batch size; overlap copies with compute; reduce host-device chatter.
5) Symptom: GPU clocks low under load
Root cause: Power cap, thermal throttling, or the GPU is actually idle enough to downclock.
Fix: Check power limits, cooling, chassis airflow; if it’s idle, go upstream and find the wait.
6) Symptom: Performance got worse after “more workers”
Root cause: Contention (filesystem metadata storms, RAM pressure, context switching overhead, cache thrash).
Fix: Right-size worker count; use larger shards; cache; reduce transforms; measure end-to-end throughput not just loader speed.
7) Symptom: Two identical servers differ massively
Root cause: Different PCIe negotiation, NUMA placement, BIOS power settings, background daemons, or storage path differences.
Fix: Compare PCIe LnkSta, CPU governor, NUMA bindings, and storage mount options. Standardize the node image and BIOS profile.
8) Symptom: GPU underutilized only at small batch sizes
Root cause: Workload lacks parallelism; launch overhead and memory latency dominate at small batch.
Fix: Increase batch size, use batching/queueing, or accept low utilization as a latency tradeoff. Don’t chase 100% utilization for a p99 SLA.
Checklists / step-by-step plan
Checklist A: Prove or disprove “CPU can’t feed GPU” in 20 minutes
- Observe GPU behavior: utilization, clocks, power, memory, burstiness.
- Check PCIe state: negotiated gen/width, errors, topology.
- Check CPU shape: single core pegged vs many cores vs idle; run queue pressure.
- Check iowait and dataset path: local vs network; random reads vs sequential.
- Confirm NUMA locality: process CPU affinity and memory binding vs GPU attachment.
- Take a short profile: identify whether time is in kernel launches, memcpy, decoding, or waiting.
Checklist B: If it’s CPU submission (launch-bound)
- Reduce kernel count: fuse operations, reduce Python loops.
- Increase per-launch work: bigger batches, larger tiles, fewer micro-kernels.
- Remove sync points: avoid forced device synchronizations in the hot path.
- Consider CUDA Graphs or a compiled execution path if your framework supports it.
- Re-test with the same dataset and a fixed random seed to avoid chasing noise.
Checklist C: If it’s CPU preprocessing (decode/augment/tokenize)
- Measure per-stage timing (read, decode, transform, batch, copy).
- Parallelize carefully: more workers until you hit contention, then stop.
- Prefer vectorized operations and libraries that use SIMD effectively.
- Cache expensive transforms when reproducible.
- Move transforms to GPU when it reduces CPU cost more than it increases GPU time.
Checklist D: If it’s I/O
- Stage datasets locally for high-throughput runs.
- Pack many small files into shards; avoid per-sample opens.
- Prefetch and read sequentially; increase request sizes.
- Watch metadata operations on network filesystems.
- Verify storage isn’t shared and saturated by other jobs.
Checklist E: If it’s GPU inefficiency
- Use a timeline profiler to see stalls (memory vs compute vs sync).
- Verify you’re hitting the right kernels (tensor cores, optimized libraries).
- Adjust batch size and precision to increase arithmetic intensity.
- Fix layout and memory access patterns; avoid tiny kernels.
- Stop blaming the CPU when the GPU is the one doing a bad job.
FAQ
1) Is low GPU utilization always bad?
No. For latency-sensitive inference with small batches, low utilization can be expected. Optimize for p95/p99 latency, not for making a utilization graph look macho.
2) What’s a quick sign it’s a dataloader bottleneck?
GPU utilization drops at batch boundaries, CPU cores spike in user time, and throughput improves when you cache data locally or increase workers (until contention). Also: spiky step times.
3) How do I tell PCIe bottleneck vs CPU bottleneck?
If copy engines are busy and SM is low, suspect PCIe/transfer. Validate negotiated link width/speed. If CPU is hot in driver calls and you see lots of tiny kernels, suspect submission overhead.
4) Why does “more dataloader workers” sometimes slow things down?
Because concurrency creates contention: filesystem metadata storms, cache thrash, memory pressure (especially with pinned memory), and context switching overhead. Throughput has a peak; find it.
5) Does pinned memory always help?
It helps DMA transfers, but it’s not free. Too much pinned memory reduces OS flexibility and can increase system instability under multi-tenant load. Use it where H2D is actually on the critical path.
6) Can the CPU “feed” the GPU on a single thread?
Sometimes. For large kernels and heavy compute, a single host thread can be enough. For workloads with many small kernels, lots of launches, or per-sample device calls, one thread becomes the choke point.
7) Why do two “identical” nodes perform differently?
Because they’re not identical in the ways that matter: PCIe link state, NUMA locality, BIOS power settings, background I/O, or storage mount paths differ. Measure those first.
8) What’s the most common misunderstanding behind “CPU can’t feed GPU”?
People treat “GPU utilization” as a single truth metric. It’s an average of a complex timeline. You need to know whether the GPU is idle, copying, stalled on memory, or simply executing short bursts.
9) Should I upgrade CPU or GPU first if training is slow?
If you haven’t measured, neither. If profiling shows host preprocessing or launch overhead dominates, CPU (or software changes) helps. If GPU kernels dominate and you’re compute-bound, GPU helps. If you’re I/O-bound, buy storage bandwidth and better data formats.
Next steps (do this, not vibes)
If you take one operational lesson from the “CPU can’t feed GPU” meme, let it be this: the phrase is not a diagnosis. It’s a prompt to instrument the pipeline.
- Run the fast diagnosis playbook and classify the bottleneck: submission, preprocessing, I/O, or GPU inefficiency.
- Validate the physical truth: PCIe link state, NUMA affinity, clocks, and throttling. Fix the platform before touching code.
- Pick one metric that represents user value (samples/sec, p99 latency, cost per batch) and optimize toward it. Not toward a pretty utilization line.
- Make changes that are reversible and measurable: one variable at a time, captured with the same dataset slice and a repeatable run.
- Write the runbook you wish you had. Future-you will be tired and less smart.
Feeding the GPU is a systems problem. The CPU is only one of the waiters.