ROPs, TMUs, SMs, and CUs: GPU Anatomy in Plain English

November 27, 2025 • February 3, 2026 • Read: 23 min • Views: 0

Was this helpful?

Your GPU is “at 95% utilization,” the dashboard is green, and the customer is still complaining that frames drop or inference latency spikes.
Welcome to the part of production where averages lie and bottlenecks hide in plain sight.

ROPs, TMUs, SMs, and CUs are the boring-sounding body parts that decide whether your GPU is sprinting, limping, or waiting for memory like it’s stuck behind a slow forklift.
This is the map you use when the numbers don’t add up.

A mental model that survives real workloads

There are two common failure modes when people talk about “GPU performance”:
(1) treating the GPU like a single number (utilization), and
(2) treating GPU parts like magic words (“more SMs fixes it”).
Both lead to expensive mistakes.

Use this mental model instead: the GPU is a pipeline of specialized factories with shared roads.
SMs/CUs are where most arithmetic happens; TMUs are texture fetch and filtering specialists (graphics-heavy, sometimes also relevant for sampling-like ops);
ROPs are where pixels get written out with blending and depth/stencil tests.
Everyone depends on memory bandwidth, cache behavior, and the driver/runtime keeping queues fed.

In ML inference, you can ignore ROPs and TMUs most of the time—but “most of the time” is not “always.”
If you’re doing rendering, compositing, video encode/decode, or running mixed workloads on the same GPU, these blocks become a source of contention.
In graphics, ROP/TMU balance matters constantly.

In operations terms: SMs/CUs are your CPU cores, memory bandwidth is your network uplink, caches are your CDN, and ROPs/TMUs are specialized accelerators.
If you only look at “GPU utilization,” you’re basically declaring a service healthy because top shows 90% CPU—then wondering why p99 latency is on fire.

One quote I’ve watched hold up across a decade of outages: Hope is not a strategy. — James Cameron.
If you want reliability, you measure the system you actually have, not the one you wish you bought.

GPU anatomy: ROPs, TMUs, SMs, CUs (and the parts people forget)

SMs (NVIDIA) and CUs (AMD): the muscle

NVIDIA calls them Streaming Multiprocessors (SMs).
AMD calls them Compute Units (CUs) (and in newer architectures you’ll also hear about WGPs, or workgroup processors, which group CUs).
These are the blocks that run the majority of shader code and compute kernels.

Each SM/CU contains multiple execution lanes (think “SIMD/SIMT lanes”), registers, schedulers, and local storage (shared memory / LDS).
They are designed to keep many threads “in flight” so that when one set stalls (often on memory), another can run.

Operations translation: SMs/CUs are throughput engines. They like consistent, parallel work and get cranky when fed tiny serial tasks,
heavy branching, or memory access patterns that look like spilled confetti.

Two practical consequences:

Occupancy (how many warps/wavefronts can be resident) matters, but it’s not a goal by itself. Sometimes lower occupancy with better memory locality wins.
Register pressure is a silent killer. Too many registers per thread reduces occupancy and increases spills to local memory, which is basically “global memory in a trench coat.”

TMUs: texture units (not just for games)

Texture Mapping Units (TMUs) traditionally handle texture address calculation and filtering.
In graphics, they fetch texture data (often with specialized caching) and apply filtering (bilinear, trilinear, anisotropic).
That’s their day job.

In compute-heavy ML, you may never hit TMU limits directly.
But TMUs can still show up indirectly when:

You’re doing image pre/post-processing on the GPU.
You’re using hardware-accelerated sampling paths in graphics APIs.
You have mixed workloads—rendering plus compute—sharing the same chip.

If your workload is texture-fetch heavy, TMUs can bottleneck even with plenty of shader/compute capacity.
In that case “more SMs” won’t save you.

ROPs: the write-back department

ROPs (Raster Operations Pipelines) are responsible for taking pixel data that’s ready to land in a framebuffer and doing the final steps:
blending, depth/stencil tests, and writing to memory.
If you remember one thing: ROPs are about writing pixels out, and writing pixels out is often memory-bandwidth hungry.

ROP bottlenecks show up in graphics workloads with:

High resolutions (4K and beyond)
Lots of overdraw
Heavy blending (transparency, post-processing)
MSAA (more samples written)

In compute-only, you might not think about ROPs at all—and that’s fine—until you run GPU compositing, screen capture, remote desktop, or video overlays on the same GPU.
Then you discover the “irrelevant” hardware blocks were actually sharing memory bandwidth and caches with your “important” kernel.

Memory subsystem: the part that ruins your day

The GPU memory subsystem is the combination of HBM/GDDR, memory controllers, L2 cache, and interconnects that move data between the SMs/CUs and VRAM.
It is frequently the real bottleneck, and it doesn’t care about your theoretical TFLOPS.

If your kernels are memory bound, adding compute won’t help.
If you’re bandwidth bound, the winning move is usually one of:
better memory locality, fewer bytes moved, better fusion, lower precision (where safe),
or simply using a GPU with more bandwidth.

Schedulers, queues, and the “fed vs starving” problem

A GPU can be fully healthy and still underutilized if the host can’t feed it:
CPU thread contention, Python GIL bottlenecks, synchronous transfers, small batches, slow dataloaders, or a driver stuck in serialization.
“GPU is at 20%” is not a GPU problem until you prove it is.

Tensor cores, RT cores, and friends: specialized blocks that change the math

Modern GPUs have specialized execution units (NVIDIA Tensor Cores, AMD matrix units, RT cores).
These are not SMs/CUs, but they live alongside them and can dramatically change performance characteristics.
They are also a common source of confusion: a kernel can be compute-heavy but still not use the fast matrix units due to dtype, layout, or kernel selection.

Joke #1: A GPU is like a restaurant kitchen—if the dishwasher (memory bandwidth) is slow, hiring more chefs (SMs) just creates taller stacks of dirty plates.

Interesting facts and short history (that actually helps)

ROPs predate “GPU compute” as we know it. They were central when graphics pipelines were more fixed-function, and writing pixels efficiently was everything.
Early “shader model” eras made SM/TMU balance a product strategy. Vendors tuned chips for game workloads where texture fetch + arithmetic had specific ratios.
CUDA (2007 era) helped rebrand SMs as general-purpose engines. Suddenly the same silicon that drew triangles was doing linear algebra and simulations.
AMD’s CU terminology came from a similar shift. “Compute Unit” is a name that sounds like it wants to be scheduled by a compiler, not a graphics driver.
Texture caches influenced compute patterns. Some algorithms originally used texture-like access because that path had caching behavior that general loads didn’t match at the time.
Fillrate used to be a headline spec. Pixel fillrate (often tied to ROP count and clock) mattered more when shading was simpler and output bandwidth was a hard wall.
Memory bandwidth has been the long-running arms race. GDDR generations and HBM weren’t vanity upgrades; they addressed a bottleneck that compute improvements kept outrunning.
Asynchronous compute and better scheduling changed “utilization” meaning. A high utilization number can hide queue contention or suboptimal overlap between copy/compute/graphics.
Modern GPUs are multi-tenant by design. Preemption, multi-instance GPU (MIG), and virtualization features exist because production demanded isolation—not because gamers asked nicely.

What to measure and why: the metrics that matter

If you’re responsible for performance and reliability, you need to know which subsystem is saturated:
compute, memory bandwidth, caches, PCIe/NVLink, graphics pipeline, or the host feed loop.
The job is not “get GPU utilization high.” The job is “hit SLOs with predictable cost.”

Compute-bound vs memory-bound: stop guessing

A compute-bound workload scales with:
more SMs/CUs, higher clocks, better instruction mix, better use of tensor/matrix units.
A memory-bound workload scales with:
more bandwidth, better cache hit rate, better coalescing, fewer reads/writes, fusion.

“But my utilization is high” is compatible with both cases.
That’s why you also watch memory throughput and stall reasons in profilers.

ROPs and TMUs: when they matter

If you operate graphics workloads (render farms, VDI, game streaming, CAD), you need per-frame pipeline clarity:
are you limited by shading (SM/CU), texture fetch/filter (TMU), or output write/blend (ROP)?
Symptoms differ:

ROP-limited: increasing resolution hurts disproportionately; heavy blending kills; bandwidth climbs.
TMU-limited: texture-heavy scenes tank even if ALU utilization isn’t maxed.
SM/CU-limited: complex shaders, heavy compute per pixel, ray marching, heavy math.

The operational “gotchas” metrics

Power limit / throttling: if the GPU is power capped or thermally throttled, your “expected” numbers are fiction.
ECC errors / memory remapping: reliability features can change performance; errors can degrade throughput or trigger retirements.
PCIe replay / bandwidth: host-to-device transfers can dominate when batching is small or you’re doing too many copies.
Context switching / multi-process contention: multiple jobs can fight for cache, memory bandwidth, and execution time slices.
Clock domains: memory clock and SM clock can behave differently under limits.

Practical tasks: commands, outputs, and decisions (12+)

These are “I’m on call and need answers” tasks. Each one includes: a command, what the output means, and the decision you make.
Assume Linux, NVIDIA tooling where applicable, and that you’re allowed to look at the box you pay for.
(AMD has comparable tooling; the logic is the same even when the command differs.)

Task 1: Confirm the GPU model, driver, and basic health

cr0x@server:~$ nvidia-smi
Tue Jan 13 12:44:10 2026
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14    Driver Version: 550.54.14    CUDA Version: 12.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|  0  NVIDIA A10          On    | 00000000:17:00.0 Off |                    0 |
| 30%   62C    P0   138W / 150W |  19750MiB / 24576MiB |     92%      Default |
+-------------------------------+----------------------+----------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|=============================================================================|
|    0   N/A  N/A     18342      C   python                           19600MiB|
+-----------------------------------------------------------------------------+

What it means: Confirms driver/CUDA version, GPU model, power usage, temperature, memory usage, utilization, and which process owns VRAM.

Decision: If memory is near full, suspect fragmentation/OOM risk; if power is at cap with Perf state high, check throttling; if GPU-Util high but throughput low, go deeper (memory or stalls).

Task 2: Check for throttling reasons (power/thermal/voltage limits)

cr0x@server:~$ nvidia-smi -q -d PERFORMANCE | sed -n '1,120p'
==============NVSMI LOG==============
Timestamp                           : Tue Jan 13 12:44:17 2026
Driver Version                      : 550.54.14
CUDA Version                        : 12.4

Performance State                   : P0
Clocks Throttle Reasons
    Idle                            : Not Active
    Applications Clocks Setting     : Not Active
    SW Power Cap                    : Active
    HW Slowdown                     : Not Active
    HW Thermal Slowdown             : Not Active
    Sync Boost                      : Not Active
    SW Thermal Slowdown             : Not Active

What it means: “SW Power Cap: Active” means you’re hitting the configured power limit; clocks may be reduced to stay within it.

Decision: If you own the thermals and power envelope, raise the power limit (within spec) or improve cooling. If you don’t, adjust expectations and capacity planning.

Task 3: Watch utilization, clocks, power, and memory in real time

cr0x@server:~$ nvidia-smi dmon -s pucvmt -d 1
# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0   149    69     -    78    94     0     0  5001  1710
    0   150    70     -    79    95     0     0  5001  1710
    0   150    70     -    80    95     0     0  5001  1710

What it means: You get a time-series view. If SM% is high but mem% is also high, you may be memory-bound or thrashing caches.

Decision: If mem% is pegged and performance is flat, focus on bytes moved (batching, fusion, precision). If SM% is low, focus on feeding (CPU, dataloader, launch overhead).

Task 4: Check PCIe link width and speed (host feed limits)

cr0x@server:~$ nvidia-smi -q -d PCI | sed -n '1,120p'
PCI
    Bus                             : 0x17
    Device                          : 0x00
    Domain                          : 0x0000
    Bus Id                          : 00000000:17:00.0
    PCIe Generation
        Max                         : 4
        Current                     : 3
    Link Width
        Max                         : 16x
        Current                     : 8x

What it means: The card supports Gen4 x16 but is currently running Gen3 x8. That’s a big cut in host-device bandwidth.

Decision: Reseat, move slots, check BIOS settings, check risers, and verify motherboard lane sharing. If your workload is transfer-heavy, this can be the whole story.

Task 5: Observe per-process GPU usage (multi-tenant contention)

cr0x@server:~$ nvidia-smi pmon -c 1
# gpu        pid  type    sm   mem   enc   dec   command
# Idx          #   C/G     %     %     %     %   name
    0      18342     C    78    92     0     0   python
    0      27711     C    12     7      0     0   python

What it means: Two compute processes share the GPU. The small one still costs you: context switching, cache pressure, and memory bandwidth contention.

Decision: If you need latency SLOs, isolate workloads (separate GPUs, MIG, scheduling policies). If throughput is the goal, co-location might be OK but measure interference.

Task 6: Check kernel driver logs for GPU resets or Xid errors

cr0x@server:~$ sudo journalctl -k --since "2 hours ago" | egrep -i "NVRM|Xid|gpu|pcie" | tail -n 20
Jan 13 11:52:03 server kernel: NVRM: Xid (PCI:0000:17:00): 31, pid=18342, Ch 0000002b, MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f3a0000
Jan 13 11:52:03 server kernel: NVRM: Xid (PCI:0000:17:00): 31, pid=18342, Ch 0000002b, MMU Fault: Fault type: UNBOUND_INST

What it means: Xid errors often indicate driver, hardware, or application-level faults (invalid memory accesses, bad PCIe, unstable clocks).

Decision: If Xids repeat: quarantine the host/GPU, reproduce under controlled load, update driver/firmware, run stress tests, and consider RMA if hardware suspicion grows.

Task 7: Confirm CPU isn’t the real bottleneck (feeding the GPU)

cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.8.0 (server) 	01/13/26 	_x86_64_	(64 CPU)

12:44:43 PM  CPU   %usr %nice  %sys %iowait %irq %soft %steal %idle
12:44:44 PM  all   92.1  0.0   7.1    0.0   0.0    0.3    0.0   0.5
12:44:44 PM    7  100.0  0.0   0.0    0.0   0.0    0.0    0.0   0.0
12:44:44 PM    8  100.0  0.0   0.0    0.0   0.0    0.0    0.0   0.0

What it means: CPUs are saturated; a couple cores are pinned. If your GPU work is launched from a single-threaded loop, you can starve the device.

Decision: Profile the host pipeline: dataloader threads, preprocessing, serialization, Python overhead. Increase batching, use async pipelines, or move preprocessing onto the GPU.

Task 8: Check NUMA locality (PCIe devices hate cross-socket traffic)

cr0x@server:~$ nvidia-smi topo -m
        GPU0    CPU Affinity    NUMA Affinity
GPU0     X     0-31            0

What it means: GPU0 is attached to NUMA node 0; best CPU affinity is 0-31. Running the feeder thread on the other socket adds latency and reduces effective PCIe bandwidth.

Decision: Pin CPU threads to the right NUMA node and allocate host memory local to that node for transfers.

Task 9: Verify effective PCIe throughput during transfers

cr0x@server:~$ nvidia-smi dmon -s t -d 1
# gpu   rxpci   txpci
# Idx   MB/s    MB/s
    0   5210    4980
    0   5400    5102
    0   5305    5055

What it means: RX/TX PCIe throughput is around ~5 GB/s. If you expected Gen4 x16 behavior, you’re underfeeding the device.

Decision: Reduce transfers (keep data on GPU longer, fuse ops), increase batch size, use pinned memory, and fix PCIe link issues if present.

Task 10: Check GPU memory info and fragmentation risk

cr0x@server:~$ nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv
memory.total [MiB], memory.used [MiB], memory.free [MiB]
24576 MiB, 24010 MiB, 566 MiB

What it means: You’re effectively out of VRAM. Even if you “fit,” allocator fragmentation or a sudden spike can OOM.

Decision: Reduce model/batch, enable memory optimizations (activation checkpointing where appropriate), or isolate this workload from anything else using VRAM.

Task 11: Identify whether the GPU is doing graphics/display work (ROPs involved)

cr0x@server:~$ nvidia-smi --query-gpu=display_active,display_mode --format=csv
display_active, display_mode
Disabled, Disabled

What it means: No display is attached and display mode is off. Good: fewer surprises from compositing/desktop effects.

Decision: If display is active on a production compute GPU, consider headless configuration or a separate GPU for display/VDI. Mixed use is how you get “mystery jitter.”

Task 12: Check GPU clock settings (someone “optimized” you into a corner)

cr0x@server:~$ nvidia-smi -q -d CLOCK | sed -n '1,120p'
Clocks
    Graphics                        : 1710 MHz
    SM                              : 1710 MHz
    Memory                          : 5001 MHz
Applications Clocks
    Graphics                        : 1410 MHz
    Memory                          : 5001 MHz

What it means: Application clocks are set lower than peak graphics/SM clocks. That can cap performance even when thermals are fine.

Decision: If this is unintended, reset application clocks or set them explicitly to match your performance/power strategy.

Task 13: Verify compute mode and persistence mode (multi-user behavior)

cr0x@server:~$ nvidia-smi --query-gpu=compute_mode,persistence_mode --format=csv
compute_mode, persistence_mode
Default, Enabled

What it means: Default compute mode allows multiple processes; persistence enabled avoids repeated init overhead and can reduce first-job latency.

Decision: For strict isolation, consider Exclusive Process mode (where appropriate). Keep persistence on for stable behavior in production.

Task 14: Confirm hugepages/IOMMU aren’t sabotaging DMA (host-level checks)

cr0x@server:~$ dmesg | egrep -i "iommu|DMAR|vfio" | tail -n 10
[    0.612345] DMAR: IOMMU enabled
[    0.612900] DMAR: Intel(R) Virtualization Technology for Directed I/O

What it means: IOMMU is enabled. This can be fine, but misconfigurations can cause overhead or mapping limits in certain setups.

Decision: If you see unusual transfer latency or mapping errors, validate IOMMU settings for your environment (especially with passthrough/virtualization).

Task 15: Quick-and-dirty “is this memory-bound?” check using observed throughput

cr0x@server:~$ nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv -l 1
utilization.gpu [%], utilization.memory [%]
91 %, 96 %
92 %, 97 %
90 %, 96 %

What it means: High memory utilization alongside high GPU utilization often points to memory-bound kernels or heavy memory traffic (including spills).

Decision: Move to profiling: check memory throughput, cache hit rates, and kernel-level stall reasons. Don’t waste time tuning arithmetic first.

Fast diagnosis playbook: find the bottleneck in minutes

This is the “don’t get lost in profiler screenshots” path. It’s optimized for production triage.
The goal: identify whether you’re limited by feed, compute, memory bandwidth, graphics pipeline (TMU/ROP), or throttling/faults.

First: sanity and health (30–60 seconds)

Run nvidia-smi. Check: errors, temps, power, memory usage, active processes.
Check kernel logs for Xid/reset events (journalctl -k).
Check throttling reasons (nvidia-smi -q -d PERFORMANCE).

If you find resets, ECC spikes, or repeated Xids, stop chasing performance. You’re debugging reliability now.

Second: is the GPU being fed? (2–3 minutes)

Look at CPU saturation (mpstat / top).
Check PCIe link state (Gen/width) and PCIe throughput (nvidia-smi -q -d PCI, nvidia-smi dmon -s t).
Check NUMA affinity (nvidia-smi topo -m).

If CPU is pinned or PCIe is downgraded (Gen3 x8 surprise), you have a platform problem, not a kernel problem.

Third: classify the bottleneck by behavior (5–10 minutes)

Compute-bound: SM busy high, memory util moderate; scaling with clocks; better with tensor cores enabled.
Memory-bound: memory util high, performance flat with higher SM clocks, sensitive to batch size and fusion.
TMU/texture-bound (graphics): texture-heavy scenes tank; sampling/filtering changes impact performance more than shader math.
ROP/output-bound (graphics): resolution and blending hit hard; output writes dominate; MSAA hurts more than expected.

Your next tool depends on the classification. If you’re compute-bound, optimize kernels and math. If you’re memory-bound, reduce traffic and improve locality.
If you’re ROP/TMU-bound, adjust rendering pipeline settings or choose hardware with different balance.

Common mistakes: symptom → root cause → fix

1) “GPU utilization is high but throughput is low”

Symptom: GPU-Util 90%+, but frames/sec or inferences/sec are disappointing.

Root cause: Memory-bound kernels, cache thrash, register spills, or contention from other processes.

Fix: Confirm memory utilization and bandwidth behavior; profile for stalls/spills; fuse kernels; reduce precision where safe; isolate workloads.

2) “GPU utilization is low but latency is high”

Symptom: GPU-Util 10–30%, yet p99 latency is bad.

Root cause: Host feed bottleneck (CPU preprocessing, small batches, sync copies), or kernel launch overhead dominating.

Fix: Increase batch size, pipeline CPU/GPU work, use asynchronous transfers, pin CPU threads to NUMA-local cores, and reduce per-request GPU launches.

3) “Performance varies wildly between identical servers”

Symptom: Same GPU model, one server is slower by 20–40%.

Root cause: PCIe link training at reduced Gen/width, different power limits, thermal conditions, or BIOS settings.

Fix: Compare nvidia-smi -q -d PCI and throttling reasons; standardize firmware/BIOS, validate cooling, fix lane allocation.

4) “Rendering tanks when enabling transparency or higher resolution”

Symptom: A scene looks fine until you add transparency/post-processing; higher resolution craters FPS.

Root cause: ROP/output and bandwidth bound; lots of blending and overdraw increases write traffic.

Fix: Reduce overdraw, optimize blending passes, consider lower precision buffers where acceptable, and watch memory bandwidth. Hardware with more ROPs/bandwidth helps.

5) “Texture-heavy scenes are slow even though shaders aren’t complex”

Symptom: Simple math, but performance collapses with high-res textures and filtering.

Root cause: TMU/texture fetch bound; texture cache misses; anisotropic filtering cost.

Fix: Tune texture LODs, reduce filtering levels, compress textures appropriately, reduce random sampling, and choose GPU SKUs with stronger texture performance if needed.

6) “After a driver update, everything is slower”

Symptom: Same code, lower throughput, possibly different power behavior.

Root cause: Different default clocks/power management, changed kernel selection, or new safety mitigations.

Fix: Re-check application clocks and power caps; validate with controlled benchmarks; pin known-good versions when you need deterministic performance.

7) “We bought the bigger GPU and got little improvement”

Symptom: More SMs/TFLOPS, but only small speedup.

Root cause: You were memory or PCIe bound; you bought compute but needed bandwidth or better data pipeline.

Fix: Measure transfer and memory throughput; pick GPUs with more bandwidth (HBM), faster interconnect (NVLink), or redesign the data path.

Joke #2: The easiest way to double GPU performance is to stop sending it the same tensor three times—sadly, this is more common than it should be.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption (ROPs don’t matter… until they do)

A company ran a GPU fleet for real-time video analytics: decode, infer, overlay boxes, and re-encode.
They treated it as “mostly compute,” so the capacity model was built around SM utilization and VRAM.
It looked clean. It was also wrong.

During a product launch, overlays got more complex: more labels, alpha-blended UI elements, and some extra post-processing.
Compute graphs stayed comfortable. But customer streams started dropping frames, and p95 latency climbed.
Engineers chased kernels, adjusted batch sizes, even rolled back a model. Nothing stuck.

The real culprit was output composition: blending and writing frames back out was hammering the memory subsystem and the final pixel stages.
In other words, the pipeline got more ROP-ish and bandwidth-heavy.
The GPU was “busy,” but not in the way the model assumed.

Fixing it was refreshingly unromantic:
they reduced overlay overdraw, changed composition strategy to minimize blending passes, and split responsibilities—compute on one GPU, compositor/encode on another on the busiest nodes.
After that, the system went back to predictable scaling. The lesson wasn’t “ROPs are important.” The lesson was “assumptions are expensive.”

Mini-story 2: The optimization that backfired (chasing occupancy into a ditch)

Another team had a CUDA kernel that was slower than expected. Someone found a blog post saying “maximize occupancy.”
They refactored the kernel to use fewer registers and increased occupancy. The profiler screenshot looked nicer.
Production throughput got worse.

The change reduced register usage, but it also increased global memory traffic by recomputing values and loading intermediates.
The kernel shifted from “moderately compute-bound” to “painfully memory-bound.”
Occupancy rose. Performance fell. The graphs lied because they weren’t measuring the right limit.

The eventual fix was to reverse the “occupancy at all costs” approach:
accept lower occupancy, keep critical values in registers, and restructure memory access to be more coalesced.
They also fused two adjacent kernels to avoid writing an intermediate tensor out to VRAM.
Occupancy looked worse; throughput improved materially.

The operational follow-up was even more valuable: they added a performance gate to CI that ran representative kernels and compared memory throughput and runtime distributions.
The next “clever” change got caught before it hit production.

Mini-story 3: The boring but correct practice that saved the day (standardizing PCIe and power)

A platform team ran mixed GPU servers sourced across multiple procurement waves.
Same GPU model on paper, but different motherboards, BIOS versions, and chassis airflow.
They had an unglamorous rule: every node must pass a “lane and clocks” acceptance test before joining the cluster.

The test was dull: check PCIe Gen/width, run a short transfer benchmark, confirm power limit, confirm clocks under sustained load, confirm no Xid errors.
It was the kind of thing people call bureaucracy until they’re the ones on call at 3 a.m.

One weekend, a batch of nodes arrived with a riser configuration that quietly trained the GPU at a reduced link width.
Without the gate, those nodes would have been added to the pool and caused random slow jobs, intermittent timeouts, and endless performance ticket churn.
Instead, the gate failed them immediately. They were fixed before customers ever saw them.

The win wasn’t “we caught a bug.” The win was preventing an entire class of flaky behavior from ever entering the system.
Production loves boring, repeatable checks.

Checklists / step-by-step plan

Checklist A: When someone says “the GPU is slow”

Identify the workload type: compute-only, graphics-only, or mixed (decode/encode/display/compositor included).
Verify GPU health: nvidia-smi, driver versions, temps, power, errors.
Check throttling reasons: power cap, thermal slowdown, application clocks.
Confirm platform link: PCIe Gen/width, NUMA locality, PCIe throughput while under load.
Check multi-tenant contention: per-process usage; confirm scheduling/isolation assumptions.
Classify bottleneck: compute-bound vs memory-bound vs feed-bound; for graphics consider TMU/ROP behavior.
Only then open the profiler—and go in with a hypothesis.

Checklist B: Before buying hardware or resizing a cluster

Measure current bottleneck (not just utilization).
If memory-bound, prioritize bandwidth/cache improvements and software changes over more compute.
If transfer-bound, prioritize interconnect and data locality (PCIe Gen, NVLink, NUMA pinning).
If ROP/TMU-bound (graphics), pick SKUs with appropriate balance; don’t buy compute you can’t feed or write out.
Plan for isolation: MIG, dedicated GPUs per latency-sensitive service, or at least cgroup-level and scheduler enforcement.
Standardize acceptance tests: lane width, clocks under load, power cap settings, error logs clean.

Checklist C: Safe performance tuning loop (the one that doesn’t create incidents)

Pick one representative workload. Not a microbenchmark unless that microbenchmark is your production reality.
Record baselines: throughput, latency distribution, power, temps, memory use, PCIe throughput.
Change one thing. One.
Re-measure and compare distributions, not just means.
Roll forward only if you understand why it improved and you can explain the new failure modes.
Add a regression gate if the change matters (CI job, canary, or nightly perf run).

FAQ

1) What’s the simplest definition of SMs and CUs?

They are the main execution blocks that run parallel threads for shaders and compute kernels.
NVIDIA calls them SMs; AMD calls them CUs. Different internals, similar role: do the math and hide latency with lots of parallel work.

2) Are “CUDA cores” the same as SMs?

No. “CUDA cores” is a marketing-friendly count of scalar ALU lanes across all SMs.
An SM is the unit that schedules and manages work; it contains multiple execution units, registers, and shared memory.

3) What do TMUs actually do, and why should I care?

TMUs fetch texture data and apply filtering efficiently. In graphics, TMU limits are real and visible.
In compute, they matter when your workload uses texture sampling paths or when mixed workloads cause contention for memory/cache.

4) What do ROPs do, and when are they a bottleneck?

ROPs handle final pixel operations: blending, depth/stencil, and writing pixels to the framebuffer.
They bottleneck when output writes dominate: high resolution, heavy blending, lots of overdraw, MSAA, or bandwidth-constrained scenarios.

5) Why does increasing resolution hurt performance so much sometimes?

More pixels means more work in the later pipeline stages and more memory traffic.
If you’re ROP/bandwidth limited, doubling pixel count can crater FPS even if compute isn’t maxed.

6) Why is GPU utilization low even though my program is “using the GPU”?

Common reasons: CPU preprocessing bottlenecks, small batch sizes, synchronous host-device copies, kernel launch overhead, or waiting on I/O.
Low utilization often means the GPU is idle waiting for work.

7) Is high occupancy always good?

No. Occupancy is a tool to hide latency, not a trophy.
For some kernels, higher occupancy increases memory traffic or forces spills. The win condition is throughput and latency under real load, not a single metric.

8) How do ROPs/TMUs relate to ML inference?

Usually they don’t—until you run pre/post-processing, composition, or video workloads on the same GPU.
Also, anything that increases memory bandwidth pressure can indirectly harm ML throughput, even if ROPs/TMUs aren’t directly used.

9) What’s the quickest way to tell if I’m memory-bound?

Watch memory utilization and throughput under load, then change something that should increase compute (like SM clocks or a faster compute SKU).
If performance barely changes, you’re likely memory-bound. Then confirm with profiling of memory stalls/bandwidth.

10) Should I share GPUs between services in production?

Only if you can tolerate interference and you measure it.
For latency-sensitive workloads, isolate with dedicated GPUs or MIG/partitioning. “It worked in staging” is not a scheduling policy.

Conclusion: practical next steps

ROPs, TMUs, SMs, and CUs aren’t trivia. They’re a map of where your work can get stuck.
When you diagnose by pipeline stage—feed, compute, memory, output—you stop guessing and start fixing.

Next steps that pay back immediately:

Add a lightweight GPU node acceptance test: PCIe Gen/width, sustained clocks under load, no Xids, stable power/temps.
Instrument your services with throughput and latency distributions alongside GPU/CPU/PCIe metrics. Averages are cute. p99 pays your salary.
When performance regresses, run the fast diagnosis playbook before touching code. Half of “GPU issues” are platform and pipeline issues.
For graphics/mixed workloads, explicitly consider TMU/ROP and bandwidth constraints. If you don’t model output cost, it will model you.