Docker GPU in Containers: Why It Fails and How to Fix It

October 9, 2025 • February 3, 2026 • Read: 23 min • Views: 6

Was this helpful?

You ship a container that “definitely works on my laptop,” deploy it to a GPU host, and your model falls back to CPU like it’s doing penance.
The logs say “no CUDA-capable device detected,” nvidia-smi is missing, or PyTorch politely informs you that CUDA is unavailable.
Meanwhile, the GPU is sitting there, bored, expensive, and billing by the hour.

GPU-in-container failures are rarely mystical. They’re almost always a small set of mismatches: driver vs. toolkit, runtime vs. Docker config,
device nodes vs. permissions, or a scheduler and security layer doing exactly what you asked—just not what you meant.

A practical mental model: what “GPU in a container” actually means

Containers do not virtualize hardware. They isolate processes via namespaces and control resource access via cgroups.
Your GPU is still the host’s GPU. The kernel driver still lives on the host. The device nodes in /dev still originate on the host.
The container is basically a process with a different view of the filesystem, network, and PID space.

For NVIDIA GPUs, the critical chain looks like this:

Host driver: the kernel module(s) and user-space driver libraries on the host. If this is broken, no container magic will save you.
Device nodes: /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm, etc. Without these, user-space can’t talk to the driver.
Container runtime integration: something needs to mount the right devices and inject the right libraries into the container. That’s what NVIDIA Container Toolkit does.
User-space CUDA stack: your container image might include CUDA libraries, cuDNN, and frameworks (PyTorch, TensorFlow). These must be compatible with the host driver.
Permissions and policy: rootless Docker, SELinux/AppArmor, cgroup device policy, and Kubernetes security contexts can block access even when everything else is correct.

If any link is missing, you’ll get the usual symptoms: CUDA unavailable, libcuda.so not found, nvidia-smi failing,
or applications silently falling back to CPU.

One operational truth: the GPU path is not “set it and forget it.” It’s “set it, then pin versions, then monitor drift.”
Driver updates, kernel updates, Docker upgrades, and cgroup mode changes are the four horsemen.

Interesting facts and short history that explains today’s mess

Fact 1: Early “GPU in containers” approaches used brittle --device flags and hand-mounted libraries, because the runtime had no standardized GPU hook.
Fact 2: NVIDIA’s container story matured when the ecosystem moved toward runtime hooks (OCI) that can inject mounts/devices at container start.
Fact 3: CUDA’s user-space libraries can live inside the container, but the kernel driver cannot; the driver must match the host kernel and is inherently host-managed.
Fact 4: The “CUDA version” you see in nvidia-smi is not the same thing as the CUDA toolkit installed in your container; it reflects driver capability, not your image contents.
Fact 5: The shift from cgroup v1 to cgroup v2 changed how device access and delegation behave, and it broke working GPU setups in subtle ways on upgrades.
Fact 6: Kubernetes GPU scheduling became mainstream only after device plugins standardized how to advertise and allocate GPUs; before that, it was the wild west of privileged pods.
Fact 7: Multi-instance GPU (MIG) introduced a new unit of allocation—GPU slices—which made “which GPU did my container get?” a non-trivial question.
Fact 8: “Rootless containers” are great for security, but GPUs are not naturally rootless-friendly because device node permissions are a hard wall, not a suggestion.

Fast diagnosis playbook (check these first)

When production is on fire, you don’t need a PhD in NVIDIA packaging. You need a short sequence that narrows the failure domain quickly.
Start with the host, then runtime, then image.

First: prove the host GPU works (no containers yet)

Run nvidia-smi on the host. If it fails, stop. Fix the host driver/kernel situation first.
Confirm device nodes exist: ls -l /dev/nvidia*. If they’re missing, the driver isn’t loaded or udev didn’t create nodes.
Check kernel module state: lsmod | grep nvidia and dmesg for GPU/driver errors.

Second: prove Docker can wire the GPU into a container

Run a minimal CUDA base image with --gpus all and execute nvidia-smi inside. If this fails, it’s runtime/toolkit, not your app.
Inspect Docker’s runtime configuration: ensure the NVIDIA runtime hook is installed and selected when needed.

Third: prove your app image is compatible

Inside your application container, check for libcuda.so visibility and the framework’s CUDA build.
Validate driver/toolkit compatibility: driver too old for the container’s CUDA, or container expecting libraries that aren’t there.
Check permissions/policy: rootless mode, SELinux/AppArmor, and cgroup device restrictions.

The bottleneck is usually discovered by step 4 or 5. If you’re still guessing after that, you’re probably debugging three problems at once.
Don’t. Reduce scope until one thing fails at a time.

Hands-on tasks: commands, expected output, and decisions

These are the tasks I actually run when a GPU container isn’t behaving. Each includes: the command, what the output means, and what decision to make.
Run them in order if you want a clean binary search through the stack.

Task 1: Host sanity check with nvidia-smi

cr0x@server:~$ nvidia-smi
Thu Jan  3 10:14:22 2026
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14    Driver Version: 550.54.14    CUDA Version: 12.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|  0  NVIDIA A10           On   | 00000000:17:00.0 Off |                  Off |
|  0%   39C    P0    62W / 150W |    512MiB / 23028MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+

Meaning: Driver is loaded and talking to the GPU. The “CUDA Version” is what the driver supports, not what your container ships.

Decision: If this fails, fix the host first: driver install, kernel headers, secure boot, DKMS rebuild, or hardware issues.

Task 2: Verify device nodes exist

cr0x@server:~$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Jan  3 10:10 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jan  3 10:10 /dev/nvidiactl
crw-rw-rw- 1 root root 235,   0 Jan  3 10:10 /dev/nvidia-uvm
crw-rw-rw- 1 root root 235,   1 Jan  3 10:10 /dev/nvidia-uvm-tools

Meaning: Character devices exist; user-space can talk to the kernel driver.

Decision: If missing, load modules (modprobe nvidia) and investigate why udev didn’t create nodes (or why the driver failed to load).

Task 3: Check kernel modules are loaded

cr0x@server:~$ lsmod | grep -E 'nvidia|nouveau'
nvidia_uvm           1556480  0
nvidia_drm             94208  2
nvidia_modeset       1564672  1 nvidia_drm
nvidia              62480384  80 nvidia_uvm,nvidia_modeset

Meaning: Proprietary NVIDIA modules loaded; no nouveau shown, which is usually good for CUDA compute nodes.

Decision: If nouveau is loaded on a compute host, expect pain. Blacklist it and rebuild initramfs per your OS policy.

Task 4: Look for driver errors in dmesg

cr0x@server:~$ dmesg -T | tail -n 12
[Thu Jan  3 10:10:05 2026] nvidia: loading out-of-tree module taints kernel.
[Thu Jan  3 10:10:05 2026] nvidia: module license 'NVIDIA' taints kernel.
[Thu Jan  3 10:10:06 2026] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[Thu Jan  3 10:10:06 2026] nvidia 0000:17:00.0: enabling device (0000 -> 0003)
[Thu Jan  3 10:10:07 2026] nvidia_uvm: Loaded the UVM driver, major device number 235

Meaning: Normal module load messages. You’re hunting for “failed,” “tainted” isn’t the problem; “RmInitAdapter failed” is.

Decision: If you see initialization failures, stop chasing Docker. Fix driver/kernel/hardware first.

Task 5: Confirm NVIDIA Container Toolkit is installed (host)

cr0x@server:~$ dpkg -l | grep -E 'nvidia-container-toolkit|nvidia-container-runtime'
ii  nvidia-container-toolkit   1.15.0-1   amd64   NVIDIA Container toolkit
ii  nvidia-container-runtime   3.14.0-1   amd64   NVIDIA container runtime

Meaning: Toolkit/runtime packages are present (Debian/Ubuntu style). On RPM systems you’d use rpm -qa.

Decision: If absent, install them. If present but ancient, upgrade—GPU enablement isn’t a “set-and-never-update” component.

Task 6: Check Docker sees the runtime

cr0x@server:~$ docker info | sed -n '/Runtimes/,+3p'
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: runc

Meaning: Docker knows about the nvidia runtime. Default is still runc, which is fine if you use --gpus.

Decision: If nvidia runtime is missing, the toolkit isn’t wired into Docker (or Docker needs a restart).

Task 7: Minimal container test with GPU

cr0x@server:~$ docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Thu Jan  3 10:16:01 2026
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14    Driver Version: 550.54.14    CUDA Version: 12.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
|===============================+======================+======================|
|  0  NVIDIA A10           On   | 00000000:17:00.0 Off |                  Off |
+-------------------------------+----------------------+----------------------+

Meaning: The runtime successfully injected GPU devices and driver libraries; container can query the GPU.

Decision: If this fails, your app image is irrelevant. Fix toolkit/runtime, policy, or device permissions.

Task 8: Inspect which GPU devices were injected

cr0x@server:~$ docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 bash -lc 'ls -l /dev/nvidia*'
crw-rw-rw- 1 root root 195,   0 Jan  3 10:16 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jan  3 10:16 /dev/nvidiactl
crw-rw-rw- 1 root root 235,   0 Jan  3 10:16 /dev/nvidia-uvm
crw-rw-rw- 1 root root 235,   1 Jan  3 10:16 /dev/nvidia-uvm-tools

Meaning: The container sees the same class of character devices as the host.

Decision: If devices aren’t present inside container, you’re not actually using the GPU runtime path (--gpus missing, runtime misconfigured, or policy blocking).

Task 9: Validate the injected driver libraries (libcuda)

cr0x@server:~$ docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 bash -lc 'ldconfig -p | grep -E "libcuda\.so|libnvidia-ml\.so" | head'
	libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
	libnvidia-ml.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1

Meaning: The container can resolve key driver-facing libraries.

Decision: If libcuda.so is missing inside container, the toolkit injection is broken or you’re in a nonstandard base image that confuses the runtime.

Task 10: Check what Docker thinks your container requested

cr0x@server:~$ docker inspect --format '{{json .HostConfig.DeviceRequests}}' gpu-test
[{"Driver":"","Count":-1,"DeviceIDs":null,"Capabilities":[["gpu"]],"Options":{}}]

Meaning: The container asked for GPUs through the modern device request interface (--gpus all shows Count:-1).

Decision: If this is empty for a workload you thought was GPU-enabled, you’ve found the bug: your run/compose spec didn’t request a GPU.

Task 11: Diagnose cgroup v2 and delegation

cr0x@server:~$ stat -fc %T /sys/fs/cgroup/
cgroup2fs

Meaning: You’re on cgroup v2. Most modern distros are. Some older GPU setups assume v1 behaviors.

Decision: If GPU access is flaky after an OS upgrade, re-validate NVIDIA Container Toolkit version and your container runtime stack on cgroup v2.

Task 12: Check AppArmor status (common silent blocker)

cr0x@server:~$ aa-status | sed -n '1,8p'
apparmor module is loaded.
54 profiles are loaded.
51 profiles are in enforce mode.
   docker-default
   /usr/bin/man
   /usr/sbin/sshd

Meaning: AppArmor is active. The default Docker profile is usually fine, but custom profiles can block device access.

Decision: If a hardened profile is in place, confirm it allows required /dev/nvidia* device access or test with a known-good profile.

Task 13: Check SELinux mode (if applicable)

cr0x@server:~$ getenforce
Enforcing

Meaning: SELinux is enforcing. This is not a problem by itself; it’s a policy problem if misconfigured.

Decision: If GPU works in permissive but not enforcing, you need correct SELinux labels/policy for device nodes and container runtime.

Task 14: Rootless Docker reality check

cr0x@server:~$ docker info | grep -E 'Rootless|Security Options'
 Rootless: true
 Security Options:
  seccomp
  rootless

Meaning: You’re running rootless Docker. This is great for security, but GPU device access is often blocked because the user can’t open the device nodes.

Decision: If rootless is required, plan for a supported approach (often: don’t do GPU via rootless on shared hosts unless you control device permissions carefully).

Task 15: Verify inside your app container that the framework sees CUDA

cr0x@server:~$ docker run --rm --gpus all myapp:latest bash -lc 'python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.version.cuda)"'
2.2.1
True
12.1

Meaning: Your framework build supports CUDA and can initialize it.

Decision: If is_available() is false but nvidia-smi works, suspect missing CUDA user-space libs in the image, wrong wheel (CPU-only), or incompatible libc/glibc.

Task 16: Catch the classic “driver too old” mismatch

cr0x@server:~$ docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 bash -lc 'cuda-samples-deviceQuery 2>/dev/null || true'
bash: cuda-samples-deviceQuery: command not found

Meaning: Base images don’t include samples; that’s fine. The point is to remember that “CUDA image” doesn’t imply “toolkit and tools installed.”

Decision: If you need compile tools, use a devel image. Don’t install build toolchains into runtime images unless you enjoy slow CI and surprise CVEs.

Why it fails: the big failure modes in production

1) Host driver is broken (and you’re debugging Docker anyway)

Containers are a convenient scapegoat. But if the host driver can’t initialize the GPU, you’re not “one flag away” from success.
Common causes: kernel update without rebuilding DKMS modules, Secure Boot blocking unsigned modules, or a partial driver upgrade that left user-space and kernel-space out of sync.

The tell: nvidia-smi fails on the host. Everything else is noise.

2) NVIDIA Container Toolkit isn’t installed or isn’t wired into Docker

Docker needs an OCI hook that adds device nodes, mounts driver libraries, and sets environment variables for capabilities.
Without the toolkit, --gpus all either does nothing or fails loudly, depending on versions.

If you’re using containerd directly, the integration point shifts. If you’re using Kubernetes, the GPU device plugin is another layer where things can go wrong.
Same idea, more moving parts.

3) You requested “GPU” in your head, not in your spec

You’d think this couldn’t happen. It happens constantly. Compose files missing the right stanza, Helm charts not setting resource limits,
or a CI job that runs with docker run but without --gpus.

The worst version is silent fallback to CPU. Your service stays “healthy,” latency creeps, costs rise, and nobody notices until the quarterly performance review.

Joke 1: A GPU that isn’t allocated to your container is just a very expensive space heater—except it won’t even warm your hands because it’s idle.

4) Driver capability vs. container CUDA toolkit mismatch

The driver defines which CUDA runtime features can be supported. The container defines which user-space CUDA libraries your app links against.
If you ship a container built against CUDA 12.x and deploy onto a host driver that only supports older CUDA, initialization can fail.

In practice, modern stacks are more forgiving than they used to be, but “forgiving” is not a plan.
The correct plan is: pin driver versions on the host and pin CUDA base image versions in your build.
Treat compatibility as an interface contract, not a vibe.

5) Rootless containers and device node permissions

Device nodes are protected by Unix permissions and sometimes by additional security layers.
Rootless Docker runs containers as a non-root user, which is great until you need access to /dev/nvidia0 and friends.

You can sometimes solve this by adjusting group ownership (e.g., video), udev rules, or using a privileged helper.
But be honest: on multi-tenant systems, GPU + rootless is often a policy fight wearing a technical hat.

6) Security policy blocks GPU access (SELinux/AppArmor/seccomp)

Security systems don’t “break” things. They enforce rules. If you didn’t declare GPU devices as allowed, access is denied.
This shows up as permission errors, or as frameworks failing to initialize with cryptic messages.

In enterprises, the fix is often not “disable SELinux,” it’s “write the policy that allows the minimum required device access.”
Disabling security controls because your ML job is late is how you end up with a different kind of incident.

7) Kubernetes: GPU allocation and device plugin mismatch

In Kubernetes, the container runtime is one layer, but allocation is another. If the NVIDIA device plugin isn’t installed,
the scheduler can’t assign GPUs and your pod either won’t schedule or will run without GPU access.

Also: GPU resources in Kubernetes are typically requested via limits. If you forget the limit, you don’t get a GPU.
This is consistent, predictable, and still surprising to someone every week.

8) MIG and “it sees a GPU, but not the one you expect”

With MIG, the “GPU” you receive might be a slice. Libraries and tools will show different identifiers.
Operators confuse this with “GPU disappeared.” No: the unit of allocation changed.

9) Filesystem and library path weirdness (scratch images, distroless, minimal OS)

Some minimal images don’t have ldconfig, don’t have standard library directories, or use nonstandard dynamic linker layouts.
NVIDIA’s injection works by mounting driver libs into expected paths. If your image is too “clever,” it becomes incompatible with that approach.

Be pragmatic: if you want maximal minimalism, pay the integration tax. Otherwise, pick a base image that behaves like a normal Linux distribution.

10) Performance failures: it “works,” but it’s slow

A working GPU container can still be a production failure if it’s slow. Common culprits:
CPU pinned too tightly, PCIe topology and NUMA mismatch, container memory limits causing paging, or storage bottlenecks feeding the GPU.

GPU utilization at 5% with high CPU and high I/O wait is not a GPU problem. It’s a data pipeline problem.

paraphrased idea — John Ousterhout: reliability comes from simplicity; complexity is where bugs and outages like to hide.

Common mistakes: symptoms → root cause → fix

“CUDA is not available” in PyTorch/TensorFlow, but the host GPU is fine

Symptom: torch.cuda.is_available() returns false; no obvious error.

Root cause: CPU-only framework build in the container, or missing CUDA user-space libs.

Fix: Install the CUDA-enabled wheel/conda package, and ensure your base image includes compatible CUDA runtime libraries. Validate with Task 15.

`nvidia-smi` works on host, fails in container: “command not found”

Symptom: Container can’t run nvidia-smi.

Root cause: Your image doesn’t include nvidia-smi (it’s part of NVIDIA user-space utilities), even if GPU passthrough is correct.

Fix: Use nvidia/cuda base images for debugging, or install nvidia-utils-* inside your image if you truly need it (often you don’t).

`nvidia-smi` fails in container: “Failed to initialize NVML”

Symptom: NVML errors inside container.

Root cause: Missing/incorrectly mounted libnvidia-ml.so, mismatched driver libs, or permission/policy issue.

Fix: Validate toolkit injection (Tasks 5–9). Confirm device nodes exist in container (Task 8). Check SELinux/AppArmor (Tasks 12–13).

Container runs, but no GPUs are visible

Symptom: Framework sees zero devices; /dev/nvidia* absent.

Root cause: You didn’t request GPUs (--gpus missing) or Compose/K8s spec missing GPU request.

Fix: Add --gpus all (or specific devices) and verify with Task 10. In Kubernetes, set GPU resource limits correctly.

Works as root, fails as non-root user

Symptom: Root can run CUDA; non-root gets permission denied.

Root cause: Device node permissions/groups don’t allow access, especially under rootless Docker.

Fix: Revisit device node ownership (udev rules), group membership, or avoid rootless for GPU workloads on that host (Task 14).

After OS upgrade, GPU containers stopped working

Symptom: Everything was fine, then “suddenly” broken.

Root cause: Kernel update broke NVIDIA DKMS build, or cgroup mode changed, or Docker/containerd updated without matching toolkit.

Fix: Re-validate host driver (Tasks 1–4), toolkit packages (Task 5), Docker runtimes (Task 6), and cgroup mode (Task 11).

Multi-GPU box: container sees the wrong GPU

Symptom: Job runs on GPU 0, but you intended GPU 2; or multiple jobs collide.

Root cause: No explicit device selection; reliance on implicit ordering; lack of scheduler.

Fix: Use --gpus '"device=2"' or a scheduler with proper allocation. For MIG, validate slice IDs and allocation logic.

It works, but performance is terrible

Symptom: Low GPU utilization, high wall time.

Root cause: Data loader bottlenecks, storage throughput, CPU pinning, NUMA mismatch, or small batch sizes that underutilize GPU.

Fix: Profile end-to-end: check CPU usage, iowait, and storage throughput. Don’t “optimize” the GPU until you prove the GPU is the limiter.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “The CUDA version in nvidia-smi is what’s in the container”

A mid-size enterprise rolled out a new inference container built against a newer CUDA runtime. The team checked the GPU nodes and saw
nvidia-smi reporting a recent CUDA version. They assumed the driver+runtime story was aligned and pushed the deployment.

The service didn’t crash. It degraded. The framework quietly decided CUDA wasn’t usable and fell back to CPU. Latency spiked, autoscaling kicked in,
and the cluster grew like it had discovered an all-you-can-eat buffet.

The on-call engineer did the right thing: they stopped staring at application logs and ran a minimal CUDA container test. It failed with a driver capability mismatch.
The “CUDA Version” in nvidia-smi was driver capability, not a promise that their container’s user-space stack would initialize.

Fix was boring: pin the host driver to a version compatible with the container’s CUDA runtime, and add a preflight job that runs
nvidia-smi inside the exact production image. They also added an alert when GPU utilization drops below a threshold while CPU rises—a dead giveaway for silent fallback.

Optimization that backfired: shaving the image until GPU injection broke

Another company had a security mandate to reduce container images. Someone got ambitious: switched from a standard distro base
to a minimal/distroless-like approach, removed dynamic linker caches, and stripped “unnecessary” directories.

The image was smaller, the vulnerability scanner was happier, and the demo worked in CI. Then production started failing on certain nodes only.
The same image ran on one GPU pool and failed on another. That’s the kind of inconsistency that makes adults say words that shouldn’t be in tickets.

The root cause was not “NVIDIA is flaky.” It was a subtle assumption in how the runtime injects driver libraries and expects the container filesystem layout.
The minimal image didn’t have conventional library search paths, so injected libraries existed but weren’t discoverable by the dynamic linker.

They reverted the base image to a conventional one for GPU workloads, then applied a measured hardening approach:
multi-stage builds, keep runtime images lean but not hostile, and verify with an integration test that runs a real CUDA initialization.

Joke 2: Nothing says “security” like a container so minimal it can’t find its own libraries.

Boring but correct practice that saved the day: a golden node + pinned versions + canary tests

A regulated enterprise ran GPU workloads on a dedicated cluster. They had two rules that sounded bureaucratic:
(1) GPU nodes are built from a golden image with pinned kernel, pinned NVIDIA driver, and pinned toolkit.
(2) Every change goes through a canary pool that runs synthetic GPU tests hourly.

One week, an OS repository update introduced a kernel patch that was harmless for general compute but triggered a DKMS rebuild issue in their environment.
The canary pool lit up within an hour: nvidia-smi failed on new nodes; the synthetic container test failed too.

Because the cluster used pinned versions, the blast radius was confined. No production nodes drifted automatically.
The team held the update, fixed the build pipeline, rebuilt the golden image, and promoted it after canaries passed.

The moral is not “never update.” It’s “update on purpose.” GPU stacks are a three-body problem: kernel, driver, container runtime.
You want a process that keeps physics from becoming astrology.

Hardening and reliability: making GPU containers boring

Pin versions like you mean it

The GPU stack has multiple version axes: kernel, driver, container toolkit, CUDA runtime, framework, and sometimes NCCL.
If you let them float independently, you will eventually create an incompatible combination.

Do this:

Pin host NVIDIA driver versions per node pool.
Pin CUDA base image tags (don’t use latest unless you hate sleep).
Pin framework versions and record which CUDA variant you installed.
Upgrade as a bundle in a controlled rollout.

Make GPU availability an explicit SLO signal

“Service is up” is not the same as “service is using the GPU.” For inference, silent CPU fallback is a cost bomb and a latency bomb.
For training, it’s a schedule bomb.

Operationally: emit metrics for GPU utilization, GPU memory, and a binary “CUDA initialized successfully” at process start.
Alert on a mismatch (CPU high, GPU low) for GPU-dependent services.

Separate “debug images” from “production images”

You want small images in production. You also want tools when debugging at 03:00. Don’t confuse the goals.
Keep a debug variant that includes bash, procps, maybe pciutils, and the ability to run nvidia-smi or a small CUDA test.
Production images can stay lean.

Be careful with privilege escalation

GPU passthrough does not require --privileged in most setups. If you’re using it “because it fixes things,” you’re likely papering over
a missing runtime configuration or a policy rule. Privileged containers expand attack surface and complicate compliance reviews.

Don’t ignore NUMA and PCIe topology

Once the GPU is visible, performance issues often trace to topology: the container’s CPU threads are scheduled on a NUMA node far from the GPU,
or the NIC and GPU are on different sockets causing cross-socket traffic.

Containers don’t automatically fix bad placement. Your scheduler and your node configuration do.

Checklists / step-by-step plans

Step-by-step: bring up GPU support on a fresh Docker host (NVIDIA)

Install and validate the host driver.
Run Task 1. If nvidia-smi fails, do not proceed.
Confirm device nodes.
Run Task 2. Missing nodes means driver isn’t properly loaded.
Install NVIDIA Container Toolkit.
Validate with Task 5.
Restart Docker and confirm runtimes.
Validate with Task 6.
Run a minimal GPU container test.
Task 7 should succeed.
Lock versions.
Record kernel, driver, toolkit, Docker version. Pin them for the node pool.

Step-by-step: debugging a failing production workload

Confirm host GPU health (Task 1). If it’s broken, stop.
Confirm runtime path with a known-good CUDA image (Task 7).
Confirm devices and libraries in container (Tasks 8–9).
Confirm the workload actually requested GPU (Task 10).
Check policy layers (Tasks 12–14), especially after hardening changes.
Check the app stack (Task 15) for CPU-only builds or missing dependencies.
Only then chase performance: CPU, I/O, batch size, NUMA placement.

Operational checklist: preventing regressions

Keep a canary GPU node pool. Run synthetic GPU tests hourly.
Alert on “GPU expected but unused”: low GPU util + high CPU for GPU-tagged services.
Pin and roll driver/toolkit/kernel upgrades as a unit.
Track cgroup mode changes as a breaking change.
Maintain a debug container that can run basic GPU validation quickly.
Document the exact Compose/Helm patterns that request GPUs; ban ad-hoc copy/paste configs.

FAQ

1) Why does `nvidia-smi` work on the host but not inside the container?

Usually because the container runtime didn’t inject the devices/libraries (missing toolkit or missing --gpus),
or a security policy blocks device access. Validate with Tasks 6–9 and 12–13.

2) Do I need to install the NVIDIA driver inside the container?

No. The kernel driver must be on the host. The container typically carries CUDA user-space libraries (runtime/toolkit/framework),
while the NVIDIA Container Toolkit injects host driver libraries needed to talk to the kernel driver.

3) What does the “CUDA Version” in `nvidia-smi` actually mean?

It indicates the maximum CUDA capability supported by the installed driver. It does not tell you which CUDA toolkit is in your container image.
Treat it as “driver speaks up to this CUDA dialect,” not “toolkit installed.”

4) Should I set Docker’s default runtime to `nvidia`?

In most modern setups, no. Use --gpus and keep default runtime as runc.
Setting nvidia as default can surprise non-GPU workloads and complicate debugging.

5) Why does my framework report CUDA unavailable even when `nvidia-smi` works?

Because nvidia-smi only proves NVML can talk to the driver. Frameworks also need the correct CUDA runtime libraries,
correct framework build (not CPU-only), and sometimes compatible glibc. Task 15 is your quickest truth serum.

6) Can I use GPUs with rootless Docker?

Sometimes, but expect friction. The core issue is permission to open /dev/nvidia* device nodes.
If you can’t guarantee device permissions and policy alignment, rootless + GPU becomes a reliability hazard.

7) My container sees the GPU, but performance is awful. Is Docker slowing it down?

Docker overhead is rarely the main culprit for GPU compute. Performance issues usually come from CPU/I/O bottlenecks,
NUMA placement, small batches, or data pipeline starvation. Prove the GPU is the limiter before “optimizing” CUDA.

8) What’s the safest way to give a container GPU access?

Request only the GPUs you need (not all by default), avoid --privileged, and rely on the NVIDIA runtime injection path.
Combine that with tight image pinning and node pool isolation for GPU workloads.

9) How do I prevent silent CPU fallback?

Add startup checks that fail fast if CUDA initialization fails, expose a metric that indicates GPU use, and alert on CPU/GPU utilization mismatches.
“Healthy but wrong” is a classic reliability trap.

Conclusion: practical next steps

GPU-in-container issues feel chaotic because there are multiple layers, and each layer fails differently.
The fix is discipline: validate the host, validate the runtime injection, then validate the application image.
Don’t skip steps. Don’t debug by vibes.

Next steps you can do today:

Run the Fast diagnosis playbook on one broken host and one known-good host; diff the outputs.
Add a CI smoke test that runs nvidia-smi (or framework CUDA init) inside your production image.
Pin driver/toolkit/CUDA image versions and roll them together through a canary pool.
Build an alert for “GPU expected but unused” to catch silent CPU fallback before finance does.