How AI Will Rewrite the Graphics Pipeline: The Frame That’s Half-Generated

Every graphics team knows the feeling: the demo hits 60 fps on the lead engineer’s machine, then collapses into a stuttery mess on real hardware—right when
the marketing capture starts. Now add AI into the pipeline: you’re not just shipping shaders and textures; you’re shipping a model, runtime, drivers, and
“helpful” heuristics that can turn one bad frame into a whole second of visual regret.

The next decade of real-time rendering is not “AI replaces rasterization.” It’s messier and more interesting: each frame becomes a negotiated settlement
between classic rendering and neural inference. Half rendered, half generated. And ops will own the outcome—latency, memory, determinism, regressions, and
weird artifacts that only show up after three hours in a desert biome with fog enabled.

What “half-generated” actually means

When people say “AI graphics,” they usually mean one of three things: (1) upscale a low-res render to high-res, (2) generate intermediate frames between
real frames, or (3) denoise an image produced by a noisy renderer (path tracing, stochastic effects). Those are the mainstream, shipping, “works on a
Tuesday” uses.

But “the frame that’s half-generated” is broader. It’s a pipeline where the engine deliberately renders less than a final image requires, and uses AI to
fill in what was skipped—resolution, samples, geometry detail, shading detail, even parts of the G-buffer. In other words, AI isn’t a post-process. It’s
a co-processor making up for missing compute or missing time.

The important operational distinction: a post-process can be disabled when things go sideways. A co-processor changes what “correct output” means. That
affects testing, debugging, and what you can reasonably roll back under incident pressure.

The mental model that won’t betray you

Treat the hybrid frame as a multi-stage transaction with strict budgets:

  • Inputs: depth, motion vectors, exposure, jitter, previous frames, sometimes normals/albedo.
  • Classical render: raster, compute, maybe partial ray tracing at reduced samples/res.
  • Neural inference: reconstruct or synthesize missing details from inputs plus history.
  • Composition: HUD/UI, alpha elements, post FX that must remain crisp and stable.
  • Presentation: frame pacing, VRR, frame generation, capture/streaming implications.

If you can’t describe which stage owns which pixels, you can’t debug it. “The model did it” is not a root cause. It’s a confession.

Where AI slots into the pipeline (and where it shouldn’t)

1) Upscaling: buying pixels with math

Upscaling is the gateway drug because it’s easy to justify: render at 67% resolution, spend a couple milliseconds on reconstruction, ship a sharper image
than naive bilinear. The operational problem is that upscalers are temporal. They use history. That means:

  • Motion vectors must be correct or you get ghosting and “rubber” edges.
  • Exposure/tonemapping must be stable or you get shimmer and breathing.
  • Camera cuts, UI overlays, and particles become special cases.

2) Frame generation: buying time with prediction

Frame generation (FG) inserts AI-synthesized frames between “real” frames. This is not the same as doubling performance. You are trading latency and
occasional hallucinations for smoother motion. That’s fine for some games, terrible for others, and complicated for anything competitive.

The core SRE question: what’s your latency SLO? If you can’t answer that, you’re basically rolling dice with user input. Sometimes the dice come up “buttery.”
Sometimes they come up “why did my parry miss?”

3) Denoising: buying samples with priors

Denoising is where neural methods feel inevitable. Path tracing gives you physically plausible lighting but noisy results at low sample counts. Neural
denoisers turn a handful of samples into something presentable by leaning on learned priors. Great—until the priors are wrong for your content.

Denoisers also create a subtle reliability trap: your renderer might be “correct,” but your denoiser is sensitive to input encoding, normal precision,
or subtle differences in roughness ranges. Two shaders that look identical in a classic pipeline can diverge once denoised.

4) The places AI should not own (unless you like firefights)

  • UI and text: keep them in native resolution, late in the pipeline. Do not let temporal reconstruction smear your typography.
  • Competitive hit feedback: if an AI stage can create or remove a cue, you will get bug reports phrased like legal threats.
  • Safety-critical visualization: training sims, medical imaging, anything where a hallucination becomes a liability.
  • Deterministic replay systems: if your game relies on replays matching exactly, AI stages must be made deterministic or excluded.

Latency budgets: the only truth that matters

Old rendering arguments were about fps. New arguments are about pacing and end-to-end latency. AI tends to improve average throughput while
worsening tail latency, because inference can have cache effects, driver scheduling quirks, and occasional slow paths (shader compilation, model warmup,
memory paging, power state transitions).

A production pipeline needs budgets that look like SLOs:

  • Frame time p50: the normal case.
  • Frame time p95: what users remember as “stutter.”
  • Frame time p99: what streamers clip and turn into memes.
  • Input-to-photon latency: what competitive players feel in their hands.
  • VRAM headroom: what prevents intermittent paging and catastrophic spikes.

Frame generation complicates the math because you have two clocks: simulation/render cadence and display cadence. If your sim runs at 60 and you display at 120
with generated frames, your motion looks smoother but your input latency is tied to the sim cadence plus buffering. This is not a moral judgment. It’s physics
plus queues.

A reliable hybrid pipeline does two things aggressively:

  1. It measures latency explicitly (not just fps).
  2. It keeps headroom in VRAM, GPU time, and CPU submission so that spikes don’t become outages.

One quote you should keep taped to your monitor, because it applies here more than anywhere: “Hope is not a strategy.” — Gene Kranz.

Joke #1: If your plan is “the model probably won’t spike,” congratulations—you’ve invented probabilistic budgeting, also known as gambling.

Data paths and telemetry: treat frames like transactions

Classic graphics debugging is already hard: a million moving parts, driver black boxes, and timing-sensitive bugs. AI adds a new category of “silent wrong”:
the frame looks plausible, but it’s not faithful. Worse: it’s content-dependent. The bug only shows on a certain map, at a certain time of day, with a
certain particle effect, after the GPU has warmed up.

Production systems survive by observability. Hybrid rendering needs the same discipline. You should log and visualize:

  • Per-stage GPU times: base render, inference, post, present.
  • Queue depth and backpressure: are frames piling up anywhere?
  • VRAM allocations over time: not just “used,” but “fragmented” and “evicted.”
  • Inference metrics: model version, precision mode, batch shape, warm/cold state.
  • Quality indicators: motion vector validity %, disocclusion rate, reactive mask coverage.

The simplest operational win: stamp every frame with a “pipeline manifest” that records the key toggles and versions that influenced it. If you can’t answer
“what model version produced this artifact?” you don’t have a bug—you have a mystery novel.

Facts and historical context that explain today’s tradeoffs

  • 1) Temporal anti-aliasing (TAA) popularized the idea that “the current frame is not enough.” Modern upscalers inherited that worldview.
  • 2) Early GPU pipelines were fixed-function; programmability (shaders) turned graphics into software, and software always attracts automation.
  • 3) Offline renderers used denoising long before real-time; production film pipelines proved you can trade samples for smarter reconstruction.
  • 4) Checkerboard rendering on consoles was a precursor to ML upscaling: render fewer pixels, reconstruct the rest using patterns and history.
  • 5) Motion vectors existed for motion blur and TAA before they became critical inputs to AI; now a bad velocity buffer is a quality outage.
  • 6) Hardware ray tracing made “noisy but correct” feasible; neural denoisers made “shippable” feasible at real-time budgets.
  • 7) The industry learned from texture streaming incidents: VRAM spikes don’t fail gracefully—they fail like a trapdoor under your feet.
  • 8) Consoles forced deterministic performance thinking; AI reintroduces variance unless you design for it.
  • 9) Video encoders already do motion-compensated prediction; frame generation is conceptually adjacent, but must tolerate interactivity.

New failure modes in hybrid rendering

Artifact taxonomy you should actually use

  • Ghosting: history over-trusted; motion vectors wrong or disocclusion not handled.
  • Shimmering: temporal instability; exposure, jitter, or reconstruction feedback loop.
  • Smearing: inference smoothing detail that should be high-frequency (foliage, thin wires).
  • Hallucinated edges: upscaler invents structure; usually from underspecified inputs.
  • UI contamination: temporal stage sees UI as scene content and drags it through time.
  • Latency “feels off”: frame generation and buffering; sometimes compounded by reflex-like modes misconfigured.
  • Random spikes: VRAM paging, model warmup, shader compilation, power state changes, background processes.

The reliability trap: AI hides rendering debt

Hybrid rendering can mask underlying problems: unstable motion vectors, inconsistent depth, missing reactive masks, incorrect alpha handling. The model
covers it… until content shifts and the cover-up fails. Then you’re debugging two systems at once.

If you ship hybrid rendering, you must maintain a “no-AI” fallback path that is tested in CI, not just theoretically possible. This is the difference
between a degraded mode and an outage.

Joke #2: Neural rendering is like a colleague who finishes your sentences—impressive until it starts doing it in meetings with your boss.

Fast diagnosis playbook

When performance or quality goes sideways, don’t start by arguing about “AI vs raster.” Start by finding the bottleneck with a ruthless, staged approach.
The goal is to identify which budget is busted: GPU time, CPU submission, VRAM, or latency/pacing.

First: confirm the symptom is pacing, not average fps

  • Check p95/p99 frame time spikes and whether they correlate with scene transitions, camera cuts, or effects.
  • Confirm if stutter aligns with VRAM pressure or shader compilation events.
  • Validate that the display path (VRR, vsync, limiter) matches test assumptions.

Second: isolate “base render” vs “AI inference” vs “present”

  • Disable frame generation first (if enabled). If latency and pacing normalize, you’re in the display/interpolation domain.
  • Drop to native resolution (disable upscaling). If artifacts vanish, your inputs (motion vectors, reactive mask) are suspect.
  • Switch denoiser to a simpler mode or lower quality. If spikes vanish, inference is the culprit (or its memory behavior).

Third: check VRAM headroom and paging

  • If VRAM is within 5–10% of the limit, assume you’ll page under real workloads.
  • Look for periodic spikes: these often match streaming, GC-like allocation churn, or background capture.
  • Confirm the model weights are resident and not being re-uploaded due to context loss or memory pressure.

Fourth: validate inputs and history integrity

  • Motion vectors: correct space, correct scale, correct handling for skinned meshes and particles.
  • Depth: stable precision and consistent near/far mapping; avoid “helpful” reversed-Z mismatches across passes.
  • History reset on cuts: if you don’t cut history, the model will try to glue two unrelated frames together.

Fifth: regression control

  • Pin driver versions for QA baselines. Don’t debug two moving targets at once.
  • Pin model versions and precision modes. If you can’t reproduce, you can’t fix.
  • Use feature flags with kill-switches that ops can flip without rebuilding.

Practical tasks: commands, outputs, and decisions

These are ops-grade checks you can run on a Linux workstation or build server. They won’t magically debug your shader code, but they will tell you whether
you’re fighting the GPU, the driver stack, memory pressure, or your own process.

Task 1: Identify the GPU and driver in the exact environment

cr0x@server:~$ lspci -nn | grep -Ei 'vga|3d|display'
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD104 [GeForce RTX 4070] [10de:2786] (rev a1)

What it means: You’ve confirmed the hardware class. This matters because inference behavior differs by architecture.

Decision: If bug reports mention different device IDs, split the issue by architecture first; don’t average them together.

Task 2: Confirm kernel driver and firmware versions

cr0x@server:~$ uname -r
6.5.0-21-generic

What it means: Kernel updates can change DMA behavior, scheduling, and IOMMU defaults—enough to alter stutter.

Decision: Pin the kernel for performance test baselines. Upgrade intentionally, not accidentally.

Task 3: Confirm NVIDIA driver version (or equivalent stack)

cr0x@server:~$ nvidia-smi
Wed Jan 21 10:14:32 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------|
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070        Off |   00000000:01:00.0  On |                  N/A |
| 30%   54C    P2              95W / 200W |     7420MiB / 12282MiB |     78%      Default |
+-----------------------------------------+------------------------+----------------------+

What it means: Driver version and VRAM usage are visible. 7.4 GiB used is not alarming; 11.8/12.2 is.

Decision: If VRAM is consistently >90%, treat it as a paging risk and reduce budgets (textures, RT buffers, model size, history buffers).

Task 4: Watch VRAM and utilization over time to catch spikes

cr0x@server:~$ nvidia-smi dmon -s pucm -d 1 -c 5
# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    94    55     -    81    63     0     0  9501  2580
    0   102    56     -    88    66     0     0  9501  2610
    0    73    53     -    52    62     0     0  9501  2145
    0   110    57     -    92    70     0     0  9501  2655
    0    68    52     -    45    61     0     0  9501  2100

What it means: You can see bursts. If mem% ramps then drops, you may be paging or reallocating aggressively.

Decision: Correlate spikes to engine events (streaming zones, cutscenes). Add pre-warm or cap allocations.

Task 5: Confirm PCIe link width/speed (hidden throttles happen)

cr0x@server:~$ sudo lspci -s 01:00.0 -vv | grep -E 'LnkCap|LnkSta'
LnkCap: Port #0, Speed 16GT/s, Width x16
LnkSta: Speed 16GT/s, Width x16

What it means: You’re not stuck at x4 because someone used the wrong slot or a BIOS setting.

Decision: If link is downgraded, fix hardware/BIOS before you “optimize” your renderer into a pretzel.

Task 6: Check for CPU frequency scaling (frame pacing killer)

cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave

What it means: CPU may be slow to ramp, causing render thread submission stutter.

Decision: For perf testing, set to performance and document it, or your results are fiction.

Task 7: Set performance governor during controlled benchmarks

cr0x@server:~$ sudo cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3

What it means: CPU will hold higher clocks more consistently.

Decision: If stutters disappear, you have a CPU scheduling/power issue, not an “AI is slow” issue.

Task 8: Check memory pressure and swap activity

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        41Gi       3.1Gi       1.2Gi        18Gi        19Gi
Swap:          8.0Gi       2.4Gi       5.6Gi

What it means: Swap usage suggests the system is paging. That can manifest as periodic spikes and asset hitching.

Decision: Reduce memory footprint, fix leaks, or increase RAM. Don’t pretend GPU tuning will fix host paging.

Task 9: Identify top CPU consumers (background capture tools are frequent villains)

cr0x@server:~$ ps -eo pid,comm,%cpu,%mem --sort=-%cpu | head
  PID COMMAND         %CPU %MEM
 4121 chrome          38.2  4.1
 9332 obs             22.7  1.9
 7771 game-bin        18.4  6.8
 1260 Xorg             9.2  0.6
 2104 pulseaudio       3.1  0.1

What it means: Your “benchmark” is competing with a browser and a streamer tool.

Decision: Reproduce under clean conditions. If OBS is required, treat it as part of the production workload.

Task 10: Check disk I/O latency (asset streaming and model loads)

cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0-21-generic (server) 	01/21/2026 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.41    0.00    3.28    2.91    0.00   81.40

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz aqu-sz  %util
nvme0n1         92.0   18240.0     0.0   0.00    3.12   198.3      44.0    5280.0     2.0   4.35    5.44   120.0   0.36  18.40

What it means: r_await/w_await are modest. If you see 50–200ms awaits, you’ll get hitches regardless of GPU.

Decision: If storage is slow, fix streaming (prefetch, compression, packaging) before touching inference settings.

Task 11: Validate filesystem space (logs and caches can fill disks mid-run)

cr0x@server:~$ df -h /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  220G  214G  6.0G  98% /

What it means: You’re one enthusiastic debug logging session away from a bad day.

Decision: Free space or redirect caches/logs. A full disk can break shader caches, model caches, and crash dump writing.

Task 12: Inspect GPU error counters (hardware/driver instability)

cr0x@server:~$ sudo journalctl -k -b | grep -Ei 'nvrm|gpu|amdgpu|i915' | tail
Jan 21 09:58:11 server kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=7771, name=game-bin, Ch 0000002c, intr 00000000
Jan 21 09:58:11 server kernel: NVRM: GPU at PCI:0000:01:00: GPU has fallen off the bus.

What it means: That’s not an optimization problem. That’s a stability incident: driver reset, power issue, or hardware fault.

Decision: Stop tuning quality. Reproduce under stress tests, check power, thermals, and driver known issues.

Task 13: Check GPU clocks and throttling reasons

cr0x@server:~$ nvidia-smi -q -d CLOCK,PERFORMANCE | sed -n '1,80p'
==============NVSMI LOG==============

Performance State                          : P2
Clocks
    Graphics                               : 2580 MHz
    Memory                                 : 9501 MHz
Clocks Throttle Reasons
    Idle                                   : Not Active
    Applications Clocks Setting            : Not Active
    SW Power Cap                           : Not Active
    HW Slowdown                            : Not Active
    HW Thermal Slowdown                    : Not Active

What it means: No obvious throttling. If thermal slowdown is active during spikes, your “AI regression” may be just heat.

Decision: If throttling appears after minutes, test with fixed fan curves and case airflow before rewriting the pipeline.

Task 14: Confirm model files are not being reloaded repeatedly (cache thrash)

cr0x@server:~$ lsof -p $(pgrep -n game-bin) | grep -E '\.onnx|\.plan|\.bin' | head
game-bin 7771 cr0x  mem REG  259,2  31248768  1048612 /opt/game/models/upscaler_v7.plan
game-bin 7771 cr0x  mem REG  259,2   8421376  1048620 /opt/game/models/denoiser_fp16.bin

What it means: The model weights are memory-mapped. Good. If you see repeated open/close patterns in traces, you’re paying load costs mid-game.

Decision: Preload and pin models at startup or level load; don’t lazily load on first explosion.

Task 15: Check for shader cache behavior (compilation stutter often blamed on AI)

cr0x@server:~$ ls -lh ~/.cache/nv/GLCache | head
total 64M
-rw------- 1 cr0x cr0x 1.2M Jan 21 09:40 0b9f6a8d0b4a2f3c
-rw------- 1 cr0x cr0x 2.8M Jan 21 09:41 1c2d7e91a1e0f4aa
-rw------- 1 cr0x cr0x 512K Jan 21 09:42 3f4a91c2d18e2b0d

What it means: Cache exists and is populated. If it’s empty every run, your environment is wiping it or permissions are wrong.

Decision: Ensure shader caches persist in test and production. Otherwise you’ll chase “random” frame spikes forever.

Task 16: Measure scheduling jitter on the host (useful for render thread pacing)

cr0x@server:~$ sudo cyclictest -m -Sp90 -i200 -h400 -D5s | tail -n 3
T: 0 ( 2345) P:90 I:200 C: 25000 Min:    5 Act:    7 Avg:    9 Max:  112
T: 1 ( 2346) P:90 I:200 C: 25000 Min:    5 Act:    6 Avg:    8 Max:   98
T: 2 ( 2347) P:90 I:200 C: 25000 Min:    5 Act:    6 Avg:    8 Max:  130

What it means: Max jitter in the ~100µs range is usually fine. If you see multi-millisecond jitter, your OS is interrupting you hard.

Decision: For reproducible profiling, isolate CPUs, tame background daemons, and avoid noisy neighbors (VMs, laptop power modes).

Three corporate mini-stories from the trenches

Mini-story #1: The incident caused by a wrong assumption

A studio shipped a patch that “only changed the upscaler.” The release notes said: improved sharpness, fewer artifacts. QA signed off on visual quality in a
controlled scene set, and performance looked stable on their lab machines.

Within hours, support tickets poured in: intermittent hitching, mostly on mid-range GPUs with 8–10 GiB VRAM. The hitches didn’t show up immediately. They
appeared after 20–30 minutes, often after a couple of map transitions. The team blamed shader compilation. It smelled like shader compilation.

The wrong assumption: the new model would be “roughly the same size” in VRAM because it had similar input/output resolution. But the engine’s inference path
quietly enabled a higher-precision intermediate buffer for the new model. Add in a slightly bigger history buffer and a more aggressive reactive mask, and
VRAM headroom vanished.

On those GPUs, the driver started evicting resources. Not always the same ones. The eviction pattern depended on what else was resident: textures, RT
acceleration structures, shadow maps, capture tools. The “shader stutter” was actually memory churn and occasional re-uploads.

The fix wasn’t heroic: cap history resolution, force FP16 intermediates, and reserve VRAM budget explicitly for the model and history buffers. They added a
runtime warning when headroom fell below a threshold and exposed a “safe mode” upscaler that traded sharpness for stability. The lesson was also boring:
treat VRAM as a budget with guardrails, not as a best-effort suggestion.

Mini-story #2: The optimization that backfired

An engine team decided to “save bandwidth” by packing motion vectors and depth into a tighter format. The commit message was cheerful: smaller G-buffer,
faster passes, better cache locality. Benchmarks improved by a couple percent on average. Everyone clapped and moved on.

Then the hybrid pipeline started showing intermittent ghosting on thin geometry—fences, wires, tree branches—especially during fast camera pans. Only some
scenes. Only some lighting. Only some GPUs. The bug reports were vague, because the frames looked “mostly fine” until you stared long enough to hate your
own eyes.

The optimization reduced precision in exactly the places the upscaler relied on: sub-pixel motion and accurate depth discontinuities. The model was trained
assuming a certain distribution of motion errors; the new packing changed the distribution. Not enough to break every frame. Enough to break the hard ones.

The backfire was organizational too. The team had improved one metric (bandwidth) while quietly destroying another (input fidelity). Because input buffers
felt like “internal details,” nobody updated the model validation suite. There was no guardrail for “motion vector quality regression.”

They rolled the packing change back for the AI path while keeping it for the non-AI path. Then they created a contract: motion vector precision and range
became versioned inputs, with automated scene tests that compared temporal stability metrics before and after changes. They still optimized—but only with a
quality budget in the loop.

Mini-story #3: The boring but correct practice that saved the day

A platform team owned the runtime that loaded models, selected precision modes, and negotiated with the graphics backend. Nothing flashy. No one wrote
blog posts about it. But they had one practice that looked like paperwork: every model artifact was treated like a deployable with semantic versioning and a
changelog that included input assumptions.

One Friday, a driver update hit their internal fleet. Suddenly, a subset of machines began showing rare flickers during frame generation—one frame every few
minutes. The flicker was small but obvious in motion. The kind of bug that ruins confidence because it’s rare enough to evade quick reproduction.

Because the model artifacts and runtime were version-pinned and logged per frame, they could answer the crucial question within an hour: nothing in the model
changed. The runtime changed only in a minor way. The driver changed, and only on the affected machines.

They flipped the kill-switch to disable frame generation for that driver branch while leaving upscaling and denoising intact. The game stayed playable. QA
regained a stable baseline. Meanwhile, they worked with the vendor on a minimal repro and verified it against the pinned matrix.

The saving practice wasn’t genius. It was boring operational hygiene: version pinning, per-frame manifests, and fast rollback controls. It turned a potential
weekend incident into a controlled degradation with a clear scope.

Common mistakes: symptoms → root cause → fix

1) Symptom: Ghost trails behind moving characters

Root cause: Motion vectors wrong for skinned meshes, particles, or vertex animation; disocclusion mask missing.

Fix: Validate velocity for each render path; generate motion for particles separately; reset history on invalid vectors; add reactive masks.

2) Symptom: UI text smears or “echoes” during camera movement

Root cause: UI composited before temporal reconstruction, or UI leaks into history buffers.

Fix: Composite UI after upscaling/denoising; ensure UI render targets are excluded from history and motion vector passes.

3) Symptom: Performance is fine in benchmarks, terrible after 30 minutes

Root cause: VRAM fragmentation, asset streaming growth, model weights evicted under pressure, or thermal throttling.

Fix: Track VRAM over time; enforce budgets; pre-warm and pin model allocations; monitor throttling reasons; fix leaks in transient RTs.

4) Symptom: Frame generation feels smooth but input feels laggy

Root cause: Display cadence decoupled from simulation cadence; extra buffering; latency mode misconfigured.

Fix: Measure input-to-photon; reduce render queue depth; tune low-latency modes; offer player-facing toggles with honest descriptions.

5) Symptom: Shimmering on foliage and thin geometry

Root cause: Temporal instability from undersampling plus insufficient reactive mask; precision loss in depth/velocity; aggressive sharpening.

Fix: Improve input precision; tune reactive mask; reduce sharpening; clamp history contribution in high-frequency regions.

6) Symptom: Sudden black frame or corrupted frame once in a while

Root cause: GPU driver reset, TDR-like recovery, out-of-bounds in a compute pass, or model runtime failure path not handled.

Fix: Capture kernel logs; add robust fallback when inference fails; validate bounds and resource states; escalate as stability issue, not “quality.”

7) Symptom: “It only happens on one vendor’s GPU”

Root cause: Different math modes, denormal handling, scheduling, or precision defaults; driver compiler differences.

Fix: Build vendor-specific baselines; constrain precision; test per vendor and per architecture; don’t assume “same API means same behavior.”

8) Symptom: Artifacts appear after camera cut or respawn

Root cause: History not reset; model tries to reconcile unrelated frames.

Fix: Treat cuts as hard resets; fade history contribution; reinitialize exposure and jitter sequences.

Checklists / step-by-step plan

Step-by-step: shipping a half-generated frame without embarrassing yourself

  1. Define budgets: frame time p95 and p99, VRAM headroom target, input-to-photon target. Write them down. Make them enforceable.
  2. Version everything: model version, runtime version, driver baseline, feature flags. Log them per frame in debug builds.
  3. Build a fallback ladder: native render → classic TAAU → ML upscaler → ML + frame gen. Each step must be shippable.
  4. Validate inputs: motion vectors (all geometry types), depth precision, exposure stability, alpha handling, disocclusion detection.
  5. Create a temporal test suite: fast pans, foliage, particle storms, camera cuts, respawns, UI overlays. Automate captures and metrics.
  6. Reserve VRAM: budget history buffers and model weights explicitly; don’t “see what happens.”
  7. Warm up: precompile shaders, pre-initialize inference, pre-allocate RTs where possible. Hide it behind loading screens.
  8. Instrument per-stage timings: base render, inference, post, present; include queue depth and pacing metrics.
  9. Control tail latency: cap worst-case work; avoid allocations in-frame; watch for background CPU contention.
  10. Ship kill-switches: ops needs toggles to disable FG or swap to a smaller model without a full rebuild.
  11. Document player tradeoffs: smoothness vs latency, quality modes vs stability. If you hide it, players will discover it the loud way.
  12. Run endurance tests: 2–4 hours, multiple map transitions, streaming-heavy paths. Most “AI issues” are actually time-based resource issues.

Checklist: before blaming the model

  • Is VRAM headroom >10% during worst scenes?
  • Are motion vectors valid for every render path (skinned, particles, vertex anim)?
  • Do you reset history on cuts and invalid frames?
  • Are you compositing UI after temporal stages?
  • Are driver and model versions pinned for repro?
  • Can you reproduce with AI disabled? If not, your measurement setup is suspect.

FAQ

1) Is “half-generated frame” just marketing for upscaling?

No. Upscaling is one part. “Half-generated” means the pipeline intentionally renders incomplete data and relies on inference to reconstruct or synthesize the rest,
sometimes including time (generated frames) and sometimes including light transport (denoising).

2) Does frame generation increase performance or just hide it?

It increases displayed frame rate, which can improve perceived smoothness. It does not increase simulation rate, and it can increase perceived input latency
depending on buffering and latency modes. Measure input-to-photon, don’t argue in circles.

3) What’s the #1 operational risk when adding AI to rendering?

Tail latency and memory behavior. Average frame time might improve while p99 gets worse due to VRAM eviction, warmup, driver scheduling, or occasional slow paths.

4) Why do artifacts often show up on foliage and thin geometry?

Those features are high-frequency and often under-sampled. They also produce hard disocclusions and unreliable motion vectors. Temporal reconstruction is fragile
when the inputs don’t describe motion cleanly.

5) Can we make AI stages deterministic for replays?

Sometimes. You can constrain precision, fix seeds, avoid non-deterministic kernels, and pin runtimes/drivers. But determinism across vendors and driver versions
is hard. If deterministic replays are a product requirement, design the pipeline with a deterministic mode from day one.

6) Should we ship one big model or multiple smaller ones?

Multiple. You want a ladder: high quality, balanced, safe. Production systems need graceful degradation. One big model is a single point of failure with a
fancy haircut.

7) How do we test “quality” without relying on subjective screenshots?

Use temporal metrics: variance over time, edge stability, ghosting heuristics, disocclusion error counts, and curated “torture scenes.” Also keep human review,
but make it focused and repeatable.

8) What should ops demand from graphics teams before enabling frame generation by default?

A measured latency impact, a kill-switch, clear player messaging, and a regression matrix across driver versions and common hardware. If they can’t provide that,
enabling by default is a reliability gamble.

9) Why does “it works on my machine” get worse with AI?

Because you’ve added more hidden state: model caches, precision modes, driver scheduling differences, VRAM headroom variance, and thermal/power profiles. The
system is more path-dependent, which punishes sloppy baselines.

Conclusion: what to do next week

The frame that’s half-generated is not a science project anymore. It’s a production pipeline with all the usual sins: budgets ignored, versions unpinned,
caches wiped, and “optimizations” that delete the very signals the model needs. The good news is that the fixes look like normal engineering:
measurement, guardrails, and controlled rollouts.

Next week, do these practical things:

  • Define p95/p99 frame time and input-to-photon targets, and make them release gates.
  • Add per-frame manifests: model/runtime/driver versions and key toggles, logged in debug builds.
  • Build a tested fallback ladder and wire it to a kill-switch ops can use.
  • Track VRAM headroom and paging risk as a first-class metric, not an afterthought.
  • Automate temporal torture scenes and validate motion vectors like your job depends on it—because it does.

Hybrid rendering will keep evolving. Your job is to make it boring in production: predictable, observable, and recoverable when it misbehaves. The “AI” part is
impressive. The “pipeline” part is where you either ship—or spend your weekends watching frame time graphs like they’re stock charts.

Proxmox Disks Not Detected: HBA, BIOS, and Cabling Quick Checklist

Nothing says “fun weekend” like booting a Proxmox node and discovering your shiny new disks have ghosted you. The installer shows nothing. lsblk is a desert. ZFS pools vanish. You swear the drives were there yesterday.

This is a field checklist for production humans: storage engineers, SREs, and the unlucky on-call who inherited a “simple” disk expansion. We’ll hunt the failure domain fast: BIOS/UEFI, HBA firmware and mode, PCIe, cabling/backplanes/expander weirdness, Linux drivers, and the gotchas that make disks “present” but invisible.

Fast diagnosis playbook (do this in order)

0) Decide what “not detected” means

  • Not in BIOS/UEFI: hardware, power, cabling, backplane, HBA/PCIe enumeration.
  • In BIOS but not in Linux: kernel driver/module, IOMMU quirks, broken firmware, PCIe AER errors.
  • In Linux but not in Proxmox UI: wrong screen, existing partitions, multipath masking, ZFS holding devices, permissions, or it’s under /dev/disk/by-id but not obvious.

1) Start with the kernel’s truth

Run these three and don’t improvise yet:

  1. dmesg -T | tail -n 200 (look for PCIe, SAS, SATA, NVMe, link resets)
  2. lsblk -e7 -o NAME,TYPE,SIZE,MODEL,SERIAL,TRAN,HCTL (see what the kernel created)
  3. lspci -nn | egrep -i 'sas|raid|sata|nvme|scsi' (confirm the controller exists)

Decision: If the controller isn’t in lspci, stop blaming Proxmox. It’s BIOS/PCIe seating/lane allocation or the card is dead.

2) If the controller exists, check the driver and link

  • lspci -k -s <slot> → verify “Kernel driver in use”.
  • journalctl -k -b | egrep -i 'mpt3sas|megaraid|ahci|nvme|reset|timeout|aer' → find the smoking gun.

Decision: No driver bound? Load module or fix firmware/BIOS settings. Link resets/timeouts? suspect cabling/backplane/expander/power.

3) Rescan before you reboot

Rescan SCSI/NVMe. If disks appear after a rescan, you’ve learned something: hotplug, link training, or boot timing.

4) If disks appear but “missing” in Proxmox UI

Go to the CLI and use stable IDs. The UI isn’t lying; it’s just not your incident commander.

Decision: If they exist in /dev/disk/by-id but not in your pool, it’s a ZFS/import/partitioning story, not a detection story.

A practical mental model: where disks can disappear

Disk detection is a chain. Break any link and you’ll stare at an empty list.

Layer 1: Power and physical connectivity

Drive needs power, correct connector, and a backplane that isn’t doing interpretive dance. “Spins up” is not the same as “data link established.” SAS especially will happily power a drive while the link is down due to a bad lane.

Layer 2: Interposer/backplane/expander translation

SAS backplanes can include expanders, multiplexers, and “helpful” logic. A single marginal lane can drop a disk, or worse, make it flap under load. SATA behind SAS expanders works—until it doesn’t, depending on the expander, drive firmware, and cabling.

Layer 3: HBA/controller firmware and mode

HBAs can run as real HBAs (IT mode) or pretend RAID controllers (IR/RAID mode). Proxmox + ZFS wants boring pass-through. RAID personality can hide drives behind virtual volumes, block SMART, and complicate error recovery.

Layer 4: PCIe enumeration and lane budget

The controller itself is a PCIe device. If the motherboard doesn’t enumerate it, Linux can’t either. PCIe bifurcation settings, slot wiring, and lane sharing with M.2/U.2 can quietly make a slot “physical x16” but electrically x4—or x0, if you anger the lane gods.

Layer 5: Linux kernel drivers + device node creation

Even when the hardware is fine, the kernel might not bind the correct driver, or udev might not create nodes the way you expect. Multipath can intentionally hide individual paths. Old initramfs can miss modules. The disks might exist but under different names.

Layer 6: Proxmox storage presentation

Proxmox VE is Debian under a UI. If Debian can’t see it, Proxmox can’t. If Debian can see it but the UI doesn’t show it where you’re looking, that’s a workflow problem, not a hardware problem.

Paraphrased idea from John Allspaw: reliability comes from responding well to failure, not pretending failure won’t happen.

Joke #1: “RAID mode will make ZFS happy” is like saying “I put a steering wheel on the toaster; now it’s a car.”

Interesting facts and history that actually helps troubleshooting

  • SCSI scanning is old… and still here. Modern SAS and even some SATA stacks still rely on SCSI host scans, which is why rescans can “find” drives without a reboot.
  • LSI’s SAS HBAs became the de facto standard in homelabs and enterprises. Broadcom/Avago/LSI lineage matters because driver naming (mpt2sas/mpt3sas) and firmware tooling assumptions follow it.
  • IT mode became popular because filesystems got smarter. ZFS and similar systems want direct disk visibility. RAID controllers were built for an era where the controller owned integrity.
  • SFF-8087 and SFF-8643 look like “just cables” but are signal systems. A partially-seated mini-SAS can power drives and still fail data lanes. It’s not magic; it’s differential pairs and tolerance.
  • PCIe slots lie by marketing. “x16 slot” often means “x16 connector.” Electrically it might be x8 or x4 depending on CPU and board routing.
  • UEFI changed option ROM behavior. Some storage cards rely on option ROMs for boot-time enumeration screens; UEFI settings can hide those screens without changing what Linux sees.
  • NVMe brought its own detection path. NVMe devices aren’t “SCSI disks” and won’t show up in SAS HBA tools; they use the NVMe subsystem and PCIe link training.
  • SMART passthrough is not guaranteed. With RAID controllers, SMART data may be blocked or require vendor tools, which changes how you verify “the disk exists.”

Hands-on tasks (commands + meaning + decision)

These are the tasks I actually run when a node says “no disks.” Each includes what you’re looking at and the decision you make.

Task 1: Confirm the controller is enumerated on PCIe

cr0x@server:~$ lspci -nn | egrep -i 'sas|raid|sata|scsi|nvme'
03:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097] (rev 02)
01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a808]

What it means: The motherboard sees the HBA/NVMe controller. If it’s not here, Linux will never see disks behind it.

Decision: Missing device → reseat card, change slot, check BIOS PCIe settings, disable conflicting devices, verify power to risers.

Task 2: Verify kernel driver binding

cr0x@server:~$ lspci -k -s 03:00.0
03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
	Subsystem: Broadcom / LSI SAS9300-8i
	Kernel driver in use: mpt3sas
	Kernel modules: mpt3sas

What it means: The right driver is attached. If “Kernel driver in use” is blank, you’ve got a driver/firmware/blacklist problem.

Decision: No driver bound → check modprobe, kernel logs, Secure Boot, firmware compatibility, and whether you’re using a weird vendor kernel.

Task 3: See what disks Linux created (don’t trust the UI yet)

cr0x@server:~$ lsblk -e7 -o NAME,TYPE,SIZE,MODEL,SERIAL,TRAN,HCTL
NAME    TYPE  SIZE MODEL              SERIAL        TRAN HCTL
sda     disk  3.6T ST4000NM0035-1V4    ZC123ABC      sas  3:0:0:0
sdb     disk  3.6T ST4000NM0035-1V4    ZC123DEF      sas  3:0:1:0
nvme0n1 disk  1.8T Samsung SSD 990 PRO S6Z1NZ0R12345 nvme -

What it means: If it’s in lsblk, the kernel sees it. TRAN tells you if it’s sas, sata, nvme.

Decision: Disks absent → move down the stack: dmesg, cabling, expander, power. Disks present but Proxmox “missing” → likely UI/workflow, multipath, or ZFS import.

Task 4: Check kernel logs for link resets/timeouts

cr0x@server:~$ journalctl -k -b | egrep -i 'mpt3sas|megaraid|ahci|nvme|reset|timeout|aer|link down' | tail -n 60
Dec 26 10:12:01 server kernel: mpt3sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Dec 26 10:12:01 server kernel: sd 3:0:1:0: rejecting I/O to offline device
Dec 26 10:12:03 server kernel: pcieport 0000:00:1c.0: AER: Corrected error received: 0000:03:00.0
Dec 26 10:12:03 server kernel: nvme nvme0: I/O 42 QID 5 timeout, aborting

What it means: “offline device”, “timeout”, “link down”, AER spam = signal integrity, power, or failing device/controller.

Decision: Timeouts on multiple drives → cable/backplane/expander/HBA. Timeouts on one drive → that drive or its slot.

Task 5: List storage controllers the kernel thinks exist

cr0x@server:~$ lsscsi -H
[0]    ata_piix
[2]    mpt3sas
[3]    nvme

What it means: Confirms host adapters. If your HBA driver is loaded, it shows up as a host.

Decision: HBA missing here but present in lspci → driver didn’t load or failed to initialize.

Task 6: Inspect SCSI hosts and rescan for devices

cr0x@server:~$ ls -l /sys/class/scsi_host/
total 0
lrwxrwxrwx 1 root root 0 Dec 26 10:10 host0 -> ../../devices/pci0000:00/0000:00:17.0/ata1/host0/scsi_host/host0
lrwxrwxrwx 1 root root 0 Dec 26 10:10 host2 -> ../../devices/pci0000:00/0000:03:00.0/host2/scsi_host/host2
cr0x@server:~$ for h in /sys/class/scsi_host/host*/scan; do echo "- - -" > "$h"; done

What it means: Forces a scan of all SCSI hosts. If disks appear after this, detection is timing/hotplug/expander behavior.

Decision: If rescans consistently “fix it,” check BIOS hotplug, staggered spin-up, expander firmware, and HBA firmware.

Task 7: Check SATA/AHCI detection (onboard ports)

cr0x@server:~$ dmesg -T | egrep -i 'ahci|ata[0-9]|SATA link' | tail -n 40
[Thu Dec 26 10:10:12 2025] ahci 0000:00:17.0: AHCI 0001.0301 32 slots 6 ports 6 Gbps 0x3f impl SATA mode
[Thu Dec 26 10:10:13 2025] ata1: SATA link down (SStatus 0 SControl 300)
[Thu Dec 26 10:10:13 2025] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

What it means: “link down” on a port with a drive means cabling/port disabled in BIOS/power.

Decision: If ports are link down across the board, check BIOS SATA mode (AHCI), and whether the board disabled SATA when M.2 is populated.

Task 8: Enumerate NVMe devices and controller health

cr0x@server:~$ nvme list
Node             SN               Model                          Namespace Usage                      Format           FW Rev
---------------- ---------------- -------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1      S6Z1NZ0R12345    Samsung SSD 990 PRO 2TB        1         1.80  TB / 2.00  TB        512   B +  0 B   5B2QJXD7

What it means: NVMe is present as its own subsystem. If nvme list is empty but lspci shows the controller, it can be driver, PCIe ASPM, or link issues.

Decision: Empty list → check journalctl -k for NVMe errors, BIOS settings for PCIe Gen speed, and slot bifurcation (for multi-NVMe adapters).

Task 9: Confirm stable disk identifiers (what you should use for ZFS)

cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep -i 'wwn|nvme|scsi' | head
lrwxrwxrwx 1 root root  9 Dec 26 10:15 nvme-Samsung_SSD_990_PRO_2TB_S6Z1NZ0R12345 -> ../../nvme0n1
lrwxrwxrwx 1 root root  9 Dec 26 10:15 scsi-35000c500a1b2c3d4 -> ../../sda
lrwxrwxrwx 1 root root  9 Dec 26 10:15 scsi-35000c500a1b2c3e5 -> ../../sdb
lrwxrwxrwx 1 root root  9 Dec 26 10:15 wwn-0x5000c500a1b2c3d4 -> ../../sda

What it means: These IDs survive reboots and device renames (sda becoming sdb after hardware changes).

Decision: If your pool/import scripts use /dev/sdX, stop. Migrate to by-id/by-wwn before your next maintenance window eats you.

Task 10: Check SMART visibility (tells you if you’re really seeing the disk)

cr0x@server:~$ smartctl -a /dev/sda | head -n 20
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-4-pve] (local build)
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Exos 7E8
Device Model:     ST4000NM0035-1V4
Serial Number:    ZC123ABC
LU WWN Device Id: 5 000c50 0a1b2c3d4
Firmware Version: SN03
User Capacity:    4,000,787,030,016 bytes [4.00 TB]

What it means: If SMART works, you likely have true pass-through visibility. If SMART fails behind a RAID controller, you may need different device types or vendor utilities.

Decision: SMART blocked + you want ZFS → verify HBA IT mode or true HBA, not RAID personality.

Task 11: Detect if multipath is hiding your disks

cr0x@server:~$ multipath -ll
mpatha (3600508b400105e210000900000490000) dm-0 IBM,2810XIV
size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 3:0:0:0 sda 8:0  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 4:0:0:0 sdb 8:16 active ready running

What it means: Multipath can present disks as /dev/dm-* and the raw /dev/sdX are just paths.

Decision: If these are local disks, multipath is a misconfiguration. Disable it or blacklist local HBAs, otherwise Proxmox/ZFS may see “weird” devices.

Task 12: Identify if you’re on a RAID controller with virtual disks

cr0x@server:~$ lspci -nn | egrep -i 'megaraid|raid'
04:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3108 [1000:005d] (rev 02)
cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,MODEL
NAME  TYPE  SIZE MODEL
sda   disk  7.3T LSI MR9361-8i

What it means: If the model looks like the controller and you only see one “disk,” you’re looking at a virtual drive, not physical disks.

Decision: For ZFS: either switch to an HBA / IT mode, or accept that ZFS can’t manage individual disks properly.

Task 13: Check if BIOS disabled the slot or remapped lanes

cr0x@server:~$ dmidecode -t baseboard | egrep -i 'Manufacturer|Product|Version'
Manufacturer: Supermicro
Product Name: X11SPH-NCTF
Version: 1.02
cr0x@server:~$ lspci -vv -s 03:00.0 | egrep -i 'LnkCap|LnkSta'
LnkCap: Port #0, Speed 8GT/s, Width x8
LnkSta: Speed 2.5GT/s (downgraded), Width x8

What it means: Link downgraded to 2.5GT/s suggests signal integrity issues, wrong slot generation forcing, or bad riser/cable.

Decision: Downgraded links with errors → try forcing Gen3/Gen4 in BIOS, move slots, replace riser, check seating.

Task 14: Proxmox-specific: confirm the kernel and modules match expectations

cr0x@server:~$ uname -r
6.8.12-4-pve
cr0x@server:~$ modinfo mpt3sas | egrep -i 'filename|version|firmware'
filename:       /lib/modules/6.8.12-4-pve/kernel/drivers/scsi/mpt3sas/mpt3sas.ko
version:        44.100.00.00
firmware:       mpt3sas_fw.bin

What it means: Confirms you’re using the Proxmox kernel and the module exists. Mismatched kernels/initramfs can bite after upgrades.

Decision: If module missing or wrong kernel, fix packages and regenerate initramfs before chasing hardware ghosts.

HBA, BIOS/UEFI, and PCIe: the usual crime scene

HBA mode: IT vs IR/RAID (and why Proxmox cares)

If you’re running ZFS (and many Proxmox shops are), you want the HBA to present each physical disk directly to Linux. That’s IT mode in LSI/Broadcom terms. RAID mode (IR) is a different product philosophy: the controller abstracts disks into logical volumes. That abstraction breaks several things you rely on in modern ops:

  • Accurate SMART/health per disk (often blocked or weird).
  • Predictable disk identities (WWNs may be hidden or replaced).
  • Clear error surfaces (timeouts may become “controller says no”).
  • ZFS’s ability to manage redundancy and self-heal with full visibility.

Also: RAID controllers tend to have write caches, BBUs, and policies that are great until they’re not. ZFS already does its own consistency story. You don’t need two captains steering one ship. You get seasickness.

UEFI settings that silently impact detection

BIOS/UEFI can hide or break your storage without dramatic error messages. The most common settings to audit when disks vanish:

  • SATA mode: AHCI vs RAID. On servers, RAID mode can route ports through an Intel RST-like layer Linux may not handle the way you expect.
  • PCIe slot configuration: Gen speed forced vs auto; bifurcation x16 → x4x4x4x4 for multi-NVMe adapters.
  • Option ROM policy: UEFI-only vs Legacy. This mostly affects boot visibility and management screens, but misconfiguration can mask what you think “should” appear pre-boot.
  • IOMMU/VT-d/AMD-Vi: Not usually a disk-detection breaker, but it can change device behavior with passthrough setups.
  • Onboard storage disablement: Some boards disable SATA ports when M.2 slots are occupied, or share lanes with PCIe slots.

PCIe lane sharing: the modern “why did my slot stop working?”

Motherboards are traffic cops. Put an NVMe in one M.2 slot and your HBA might drop from x8 to x4, or the adjacent slot may get disabled. This is not “bad design.” It’s economics and physics: CPUs have finite lanes, and board vendors multiplex them in ways that require you to read the fine print.

If you see a controller present but unstable (AER errors, link down/up), lane or signal integrity issues are very much on the table. Risers, especially, love to be “mostly fine.”

Joke #2: A PCIe riser that “works if you don’t touch the chassis” is less a component and more a lifestyle choice.

Cabling, backplanes, expanders, and “it’s seated” lies

Mini-SAS connectors: why partial failure is common

SAS cables carry multiple lanes. A single SFF-8643 can carry four SAS lanes; a backplane may map lanes to individual drive bays. If one lane goes bad, you don’t always lose all drives. You lose “some bays,” often in a pattern that looks like software.

Practical rule: if disks are missing in a repeating bay pattern (e.g., bays 1–4 fine, 5–8 dead), suspect a specific mini-SAS cable or port. Don’t spend an hour in udev for a problem that lives in copper.

Backplanes with expanders: nice when they work

Expanders let you connect many drives to fewer HBA ports. They also add a layer that can have firmware bugs, negotiation quirks, and sensitivity to SATA drives behind SAS expanders. Symptoms include:

  • Disks appear after boot but disappear under load.
  • Intermittent “device offlined” messages.
  • Only some drive models misbehave.

When that happens, you don’t “tune Linux.” You validate the expander firmware, swap cables, isolate by connecting fewer bays, and test with a known-good disk model.

Power delivery and spin-up

Especially in dense chassis, power can be the silent killer. Drives may spin but brown out during link training or when multiple drives spin simultaneously. Some HBAs and backplanes support staggered spin-up. Some don’t. Some support it and ship misconfigured.

A telltale sign is multiple drives dropping at the same time during boot or scrub, then reappearing later. That’s not a “Proxmox thing.” That’s power or signal.

Simple physical checks that beat cleverness

  • Reseat both ends of mini-SAS cables. Do not “press gently.” Disconnect, inspect, reconnect firmly.
  • Swap cables between known-good and suspected-bad ports to see if the problem follows the cable.
  • Move a disk to another bay. If the disk works elsewhere, the bay/backplane lane is suspect.
  • If you can, temporarily connect one disk directly to an HBA port (bypass expander/backplane) to isolate layers.

Linux/Proxmox layer: drivers, udev, multipath, and device nodes

Driver presence is not driver health

Seeing mpt3sas loaded doesn’t guarantee the controller initialized properly. Firmware mismatch can produce partial functionality: controller enumerates, but no targets show; or targets show but error constantly.

Kernel logs matter more than module lists. If you see repeated resets, “firmware fault,” or queues stuck, treat it like a real incident: collect logs, stabilize hardware, and consider firmware updates.

Multipath: helpful until it’s not

Multipath is designed for SANs and dual-path storage. On a Proxmox node with local SAS disks, it’s usually accidental and harmful. It can mask the devices you expect, or it can create device-mapper nodes that Proxmox/ZFS will use inconsistently if you aren’t deliberate.

If you’re not explicitly using multipath for shared storage, you generally want it disabled or configured to ignore local disks.

Device naming: /dev/sdX is a trap

Linux assigns /dev/sdX names in discovery order. Add a controller, reorder cables, or change BIOS boot settings and the order changes. That’s how you import the wrong disks, wipe the wrong device, or build a pool on the wrong members.

Use /dev/disk/by-id or WWNs. Make it policy. Your future self will quietly thank you.

When Proxmox “doesn’t show disks” but Linux does

Common realities:

  • The disks have old partitions and Proxmox UI filters what it considers “available.”
  • ZFS is already using the disks (they belong to an imported pool or a stale pool). ZFS won’t politely share.
  • You’re looking in the wrong place: node disks vs storage definitions vs datacenter view.
  • Multipath or device-mapper is presenting different names than you expect.

ZFS angle: why “RAID mode” is not your friend

Proxmox ships with first-class ZFS support. ZFS assumes it is in charge of redundancy, checksums, and healing. Hardware RAID assumes it is in charge of redundancy and error recovery. When you stack them, you create a system where each layer makes decisions without full information.

What “works” but is still wrong

  • Creating one huge RAID0/RAID10 volume and putting ZFS on it: ZFS loses per-disk visibility and can’t isolate failing members.
  • Using RAID controller caching with ZFS sync writes: you can accidentally lie to ZFS about durability if the cache policy is unsafe.
  • Assuming the controller will surface disk errors cleanly: it may remap, retry, or mask until it can’t.

What you should do instead

  • Use an HBA (or flash the controller to IT mode) and present raw disks to ZFS.
  • Use stable IDs when creating pools.
  • Prefer boring, well-tested firmware combinations. Bleeding edge is great for lab work, not for your cluster quorum.

Common mistakes: symptom → root cause → fix

1) Symptom: HBA not in lspci

Root cause: Card not seated, dead slot, lane sharing disabled the slot, riser failure, or BIOS disabled that slot.

Fix: Reseat, try another slot, remove riser, check BIOS “PCIe slot enable,” check lane sharing with M.2/U.2, update BIOS if it’s ancient.

2) Symptom: HBA in lspci but no disks in lsblk

Root cause: Driver not bound, firmware mismatch, HBA in a mode requiring vendor stack, broken cable/backplane preventing target discovery.

Fix: Verify lspci -k, check journalctl -k, rescan SCSI hosts, swap cables, validate HBA firmware and mode (IT for ZFS).

3) Symptom: Some bays missing in a pattern

Root cause: One SAS lane/cable/port down; backplane mapping aligns with the missing set.

Fix: Swap mini-SAS cable; move to other HBA port; reseat connector; check for bent pins/damage.

4) Symptom: Disks appear after rescan but vanish after reboot

Root cause: Hotplug timing, expander quirks, staggered spin-up misconfigured, marginal power at boot.

Fix: Update HBA/backplane/expander firmware, enable staggered spin-up if supported, verify PSU and power distribution, check boot logs for resets.

5) Symptom: NVMe not detected, but works in another machine

Root cause: Slot disabled due to bifurcation settings, PCIe Gen forced too high/low, lane sharing with SATA, or adapter needs bifurcation.

Fix: Set correct bifurcation, set PCIe speed to Auto/Gen3/Gen4 appropriately, move to CPU-attached slot, update BIOS.

6) Symptom: Proxmox GUI doesn’t show disks, but lsblk does

Root cause: Existing partitions/LVM metadata, ZFS already claims them, multipath device presentation, or you’re looking at the wrong UI view.

Fix: Use CLI to confirm by-id, check zpool status/zpool import, check multipath -ll, wipe signatures only when you’re sure.

7) Symptom: SMART fails with “cannot open device” behind controller

Root cause: RAID controller abstraction; SMART passthrough requires special device type or isn’t supported.

Fix: Use HBA/IT mode for ZFS; otherwise use vendor tooling and accept limitations.

8) Symptom: Disks flap under load, ZFS sees checksum errors

Root cause: Cable/backplane/expander signal integrity or insufficient power; sometimes one drive is poisoning the bus.

Fix: Replace cables first, isolate by removing disks, check dmesg for resets, validate PSU and backplane health.

Checklists / step-by-step plan

Checklist A: “Installer can’t see any disks”

  1. Enter BIOS/UEFI and confirm the controller is enabled and visible.
  2. Confirm SATA mode is AHCI (unless you explicitly need RAID for a boot volume).
  3. For HBA: verify it’s in IT mode or true HBA (not MegaRAID virtual volumes) if you want ZFS.
  4. Move the HBA to a different PCIe slot (prefer CPU-attached slots).
  5. Boot a rescue environment and run lspci and dmesg. If it’s missing there, it’s hardware.
  6. Swap mini-SAS cables and re-seat connectors at both ends.
  7. If using a backplane expander: try a direct-attach test with one disk.

Checklist B: “Some disks missing behind HBA”

  1. Run lsblk and identify which bays are missing; look for patterns.
  2. Check logs for link resets and offline devices.
  3. Rescan SCSI hosts; see if missing disks appear.
  4. Swap the cable feeding the missing-bay set.
  5. Move the cable to another HBA port; see if the missing set moves.
  6. Move one missing drive to a known-good bay; if it appears, the bay/lane is bad.
  7. Update HBA firmware if you’re running a known-problematic revision.

Checklist C: “Disks detected in Linux but not usable in Proxmox”

  1. Confirm stable IDs in /dev/disk/by-id.
  2. Check if ZFS sees an importable pool: zpool import.
  3. Check if disks have signatures: wipefs -n /dev/sdX (the -n is the safety flag; keep it).
  4. Check multipath: multipath -ll.
  5. Decide your intent: import existing data vs wipe and repurpose.
  6. If wiping, do it deliberately and document which WWNs you wiped.

Checklist D: “NVMe not showing up”

  1. Confirm controller in lspci.
  2. Check nvme list and kernel logs for timeouts.
  3. Inspect PCIe link status (LnkSta) for downgrades.
  4. Set correct bifurcation for multi-NVMe adapters.
  5. Move the NVMe to another slot and retest.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The team was rolling out a new Proxmox cluster for internal CI workloads. The storage plan was “simple”: eight SAS drives per node, ZFS mirrors, done. Procurement delivered servers with a “SAS RAID controller” instead of the requested HBA. Nobody panicked because the controller still had “SAS” in the name and the BIOS showed a giant logical disk.

They installed Proxmox on that logical volume and built ZFS pools on top of whatever the controller exposed. It worked fine for a few weeks, which is how bad assumptions get promoted to “design decisions.” Then a drive started failing. The controller remapped and retried in ways ZFS couldn’t observe, and the node began stalling during scrubs. The logs were full of timeouts but nothing that mapped cleanly to a physical bay.

During the maintenance window, someone pulled the “failed” drive according to the controller UI. The wrong one. The controller had changed its internal numbering after the earlier remap events, and the mapping sheet was outdated. Now the logical volume degraded in a different way, ZFS got angry, and the cluster lost a chunk of capacity during peak pipeline usage.

The fix was unglamorous: swap the RAID controller for a real HBA, rebuild the node, and enforce a policy: ZFS gets raw disks, identified by WWN, and bay mapping is validated with LEDs and serial numbers before anyone pulls hardware. The assumption “SAS equals HBA” was the original root cause, and it cost them a weekend.

Mini-story 2: The optimization that backfired

A different shop had performance problems during ZFS resilvers. Someone suggested “optimizing cabling” by using a single expander backplane to reduce HBA ports and keep the build tidy. Fewer cables, fewer failure points, right?

In practice, the expander introduced a subtle behavior: during heavy I/O, a couple of SATA SSDs (used as special vdevs) would intermittently drop for a few seconds, then return. The HBA and kernel would log link resets, and ZFS would mark devices as faulted or degraded depending on timing. The symptom looked like “ZFS is flaky” because the drops were transient.

The team tried tuning timeouts and queue depths, because engineers like knobs and the expander looked “enterprise.” The tuning reduced the obvious errors but didn’t solve the underlying issue. Under a real incident—node reboot plus simultaneous VM recovery—the devices flapped again and the pool refused to import cleanly without manual intervention.

They backed out the “optimization.” Direct-attach the SSDs, keep the expander for the bulk HDDs where latency wasn’t as sensitive, and standardize drive models behind the expander. Performance improved, and so did sleep. Sometimes fewer cables is just fewer clues when it breaks.

Mini-story 3: The boring but correct practice that saved the day

One team had a habit that looked pedantic: every disk was recorded by WWN and bay location at install time. They kept a simple sheet: chassis serial, bay number, drive serial, WWN, and the intended ZFS vdev membership. They also labeled cables by HBA port and backplane connector. Nobody loved doing it, but it was policy.

A year later, a node started reporting intermittent checksum errors during scrubs. The logs suggested a flaky link, not a failing disk, but the pool topology included twelve drives and a backplane expander. In the old world, this would devolve into “pull drives until the errors stop.” That’s how you create new incidents.

Instead, they correlated the affected WWN with the bay. The errors were always on disks in bays 9–12. That matched a single mini-SAS cable feeding that section of the backplane. They swapped the cable during a short maintenance window, scrubbed again, and the errors disappeared.

No drama. No guessing. The boring inventory practice turned a potentially messy incident into a 20-minute fix with a clear root cause. Reliability is often just bookkeeping with conviction.

FAQ

1) Proxmox installer shows no disks. Is it always an HBA driver issue?

No. If lspci doesn’t show the controller, it’s BIOS/PCIe/hardware. If the controller shows but no disks, then it might be driver/firmware/cabling.

2) I see disks in BIOS but not in Linux. How is that possible?

BIOS may show RAID virtual volumes or a controller summary without exposing targets to Linux. Or Linux lacks the right module, or the controller fails initialization during boot (check journalctl -k).

3) Do I need IT mode for Proxmox?

If you use ZFS and want sane operations, yes. If you insist on hardware RAID, you can run it, but you’re choosing a different operational model with different tooling.

4) Why do disks show up as /dev/dm-0 instead of /dev/sda?

Usually multipath or device-mapper stacking (LVM, dm-crypt). For local disks you didn’t intend to multipath, fix multipath config or disable it.

5) My disks appear, but Proxmox GUI doesn’t list them as available. Are they broken?

Often they have existing signatures (old ZFS/LVM/RAID metadata) or are already part of an imported pool. Verify with lsblk, wipefs -n, and zpool import before doing anything destructive.

6) Can a bad SAS cable really cause only one disk to disappear?

Yes. Mini-SAS carries multiple lanes; depending on backplane mapping, a lane issue can isolate a single bay or a subset. Patterns are your friend.

7) NVMe not detected: what’s the single most common BIOS mistake?

Wrong bifurcation settings when using multi-NVMe adapters, or lane sharing that disables the slot when another M.2/U.2 is populated.

8) Should I force PCIe Gen speed to fix link issues?

Sometimes forcing a lower Gen speed stabilizes flaky links (useful for diagnosis), but the real fix is usually seating, risers, cabling, or board/slot choice.

9) How do I decide between “replace disk” and “replace cable/backplane”?

If multiple disks show errors on the same HBA port/backplane segment, suspect cable/backplane. If one disk follows the disk across bays, it’s the disk.

10) Is it safe to rescan SCSI hosts on a production node?

Generally yes, but do it with situational awareness. Rescans can trigger device discovery events and log noise. Avoid during sensitive storage operations if you’re already degraded.

Conclusion: practical next steps

If Proxmox can’t see disks, stop guessing and walk the chain: PCIe enumeration → driver binding → link stability → target discovery → stable IDs → Proxmox/ZFS consumption. The fastest wins are usually physical: seating, lane allocation, and cables. The most expensive failures come from the wrong controller mode and sloppy device naming.

  1. Run the fast diagnosis playbook and classify the failure domain in 10 minutes.
  2. Collect evidence: lspci -k, lsblk, and kernel logs around detection time.
  3. Standardize: HBA/IT mode for ZFS, by-id naming, and a bay-to-WWN map.
  4. Fix the root cause, not the symptom: replace suspect cables/risers, correct BIOS bifurcation, update firmware deliberately.
  5. After recovery, do one scrub/resilver test and review logs. If you don’t verify, you didn’t fix it—you just stopped seeing it.

MySQL vs PostgreSQL on a 4GB RAM VPS: What to Set First for Websites

You’ve got a 4GB RAM VPS. A few websites. A database. And now a pager, a ticket, or a client email that says, “The site is slow.” Nothing is more humbling than watching a $10/month box try to be an enterprise platform because someone enabled a plugin that “only runs one query.”

This is a field guide for getting MySQL or PostgreSQL stable and fast enough for website workloads on small VPS hardware. Not a benchmark fantasy. Not a config-dump. The stuff you set first, the stuff you measure first, and the stuff you stop doing before it costs you weekends.

First decision: MySQL or PostgreSQL for websites on 4GB

On a 4GB VPS, the “best database” is the one you can keep predictable under memory pressure and bursty traffic. Your enemy is not theoretical throughput. It’s swap storms, connection floods, and storage latency spikes that turn “fast enough” into “why is checkout timing out?”

Pick MySQL (InnoDB) when:

  • Your stack is already MySQL-native (WordPress, Magento, many PHP apps) and you don’t want to be the person rewriting everything “for fun.”
  • You want a fairly straightforward cache story: the InnoDB buffer pool is the big knob, and it behaves like a big knob.
  • You need replication that’s easy to operate with common tooling, and you’re okay with eventual consistency trade-offs in some modes.

Pick PostgreSQL when:

  • You care about query correctness and rich SQL features (real window functions, CTEs, better constraints and data types) and you’ll actually use them.
  • You want predictable query plans, good observability, and sane defaults for many modern app patterns.
  • You can commit to connection pooling (pgBouncer) because PostgreSQL’s process-per-connection model punishes “just open more connections” on small boxes.

If this is mostly CMS traffic with plugins you don’t control, I’m usually conservative: stay with MySQL unless the app is already Postgres-first. If you’re building something new with a team that writes SQL intentionally, PostgreSQL is often the better long-term deal. But on 4GB, the short-term win is operational simplicity, not philosophical purity.

Rule of thumb: if you can’t describe your top 5 queries and their indexes, you’re not “choosing a database,” you’re choosing which failure modes you’d like to experience first.

Interesting facts & historical context (that actually changes decisions)

  1. MySQL’s early web dominance came from LAMP ubiquity and “good enough” speed for read-heavy sites. That’s why so many website apps still assume MySQL dialect quirks.
  2. InnoDB became the default in MySQL 5.5 (2010 era). If you’re still thinking in MyISAM terms (table locks, no crash recovery), you’re carrying a fossil in your pocket.
  3. PostgreSQL’s MVCC model is one reason it stays consistent under concurrency, but it creates a steady need for vacuuming. Ignore vacuum and the database won’t scream; it’ll just slowly get worse.
  4. PostgreSQL switched to a more parallel-friendly execution model over time (parallel queries, better planner features). On a small VPS this matters less than on big iron, but it’s part of why Postgres “feels modern” for analytics-style queries.
  5. MySQL’s query cache was removed in MySQL 8.0 because it scaled poorly under concurrency. If someone tells you to “enable query_cache_size,” you found a time traveler.
  6. Postgres gets credit for standards and correctness because it historically prioritized features and integrity over early raw speed. Today it’s fast too, but the cultural DNA still shows in defaults and tooling.
  7. Both engines are conservative about durability by default (fsync, WAL/redo). Disabling durability settings makes benchmarks look heroic and postmortems look like crime scenes.
  8. MariaDB diverged from MySQL in significant ways. “MySQL tuning” advice sometimes maps poorly to MariaDB versions and storage engines. Verify what you’re actually running.
  9. RDS and managed services influenced tuning folklore: people copy cloud defaults to VPS, then wonder why a 4GB box behaves like it’s underwater.

Baseline architecture for a 4GB VPS (and why it matters)

On a 4GB VPS, you don’t have “extra memory.” You have a budget. Spend it on caches that reduce I/O, and on headroom that prevents swapping. The OS page cache also matters because both MySQL and PostgreSQL ultimately need filesystem-backed reads, and the kernel is not your enemy; it’s your last line of defense.

Reality-based memory budget

  • OS + SSH + basic daemons: 300–600MB
  • Web server + PHP-FPM: wildly variable. A few hundred MB to multiple GB depending on process counts and app behavior.
  • Database: what’s left, but not all of it. If you give the DB everything, the web tier will OOM or swap when traffic spikes.

For “websites on a single VPS,” the database isn’t isolated. This is one of the few times where “set it and forget it” is not laziness; it’s survival.

Opinion: If you’re running both web and DB on the same 4GB VPS, plan to allocate roughly 1.5–2.5GB to the database cache layer max, unless you’ve measured PHP memory usage under load and it’s truly small. Your goal is stable latency, not a heroic buffer pool.

Joke #1: A 4GB VPS is like a studio apartment—technically you can fit a treadmill in it, but you’ll hate your life and so will your neighbors.

Fast diagnosis playbook: find the bottleneck in 10 minutes

This is the order I check things when “the site is slow” and the database is the prime suspect. Each step tells you whether to look at CPU, memory, connections, locks, or storage.

First: is the box starving (CPU, RAM, swap)?

  • Check load vs CPU count.
  • Check swap activity and major page faults.
  • Check OOM killer history.

Second: is it storage latency (IOPS/fsync/WAL/redo)?

  • High iowait, slow fsync, long commit times, or stalled checkpoints.
  • Look for queue depth and average await times.

Third: is it connection pressure?

  • Too many DB connections or threads.
  • Connection storms from PHP workers.
  • Thread/process counts hitting RAM.

Fourth: is it locks or long transactions?

  • MySQL: metadata locks, InnoDB row locks, long-running transactions.
  • Postgres: blocked queries, idle-in-transaction sessions, vacuum blocked by old snapshots.

Fifth: is it “bad queries + missing indexes”?

  • Slow query logs / pg_stat_statements show the top offenders.
  • Look for full table scans and “filesort”/temp tables or sequential scans with huge row counts.

That’s it. Don’t start by changing random knobs. Don’t copy a “high performance my.cnf” from a 64GB database server. Measure, then choose one change you can explain.

Quote (paraphrased idea): John Allspaw’s reliability idea: production is where assumptions go to die, so design and operate for learning, not certainty.

Practical tasks: commands, outputs, and what you do next

These are real tasks you can run on a Linux VPS. Each includes: the command, what typical output means, and the decision you make. Run them in order when you’re triaging or setting baselines.

Task 1: Confirm basic system pressure (CPU, RAM, swap)

cr0x@server:~$ uptime
 14:22:19 up 36 days,  3:18,  1 user,  load average: 5.84, 5.12, 3.90

What it means: On a 2 vCPU VPS, load averages above ~2–3 for sustained periods often mean runnable queue backups (CPU) or uninterruptible I/O waits.

Decision: If load is high, immediately check iowait and memory/swap before touching database configs.

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       3.3Gi       120Mi        90Mi       420Mi       220Mi
Swap:          1.0Gi       860Mi       164Mi

What it means: Swap in use with low “available” memory is a red flag on latency-sensitive web workloads.

Decision: If swap is being used during peak, reduce DB memory footprint and/or reduce PHP-FPM workers. Consider enabling a small amount of swap only as a crash cushion, not a performance tier.

Task 2: Check for active swapping and iowait

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  1 880000 120000  28000 380000   12   45   210   480  520  900 18  8 40 34  0
 3  0 880000 118000  28000 381000    0   60   160   510  600 1100 20  9 35 36  0
 1  2 880000 115000  28000 379000   20    0   400   200  480  850 15  7 38 40  0
 4  0 880000 119000  28000 382000    0    8    90   520  610 1200 22 10 32 36  0
 2  1 880000 117000  28000 381500    0   35   200   600  590 1050 19  9 34 38  0

What it means: Non-zero si/so indicates swapping. High wa indicates I/O wait.

Decision: If swapping is active, stop tuning queries and fix memory/connection counts. If iowait is high, focus on storage latency, checkpoint/flush behavior, and reducing write amplification.

Task 3: Identify disk latency and queueing quickly

cr0x@server:~$ iostat -xz 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          18.20    0.00    8.10   34.50    0.00   39.20

Device            r/s     w/s   rkB/s   wkB/s  rrqm/s  wrqm/s  %util  await  aqu-sz
vda              12.0    48.0   900.0  6200.0    0.2     3.1   92.0   21.5    1.8

What it means: %util near 100% and await > 10–20ms on a VPS disk usually means your storage is the bottleneck.

Decision: Reduce writes (batching, autovacuum tuning, InnoDB flush tuning), move DB to better storage, or separate DB and web. On some VPS plans, the only “tuning” that matters is paying for higher IOPS.

Task 4: Confirm which DB you’re running and version details

cr0x@server:~$ mysql --version
mysql  Ver 8.0.36-0ubuntu0.22.04.1 for Linux on x86_64 ((Ubuntu))

What it means: Major versions change defaults and remove knobs (like query cache). Version tells you what advice is even applicable.

Decision: Don’t apply tuning recipes unless they match your major version and flavor (MySQL vs MariaDB).

cr0x@server:~$ psql --version
psql (PostgreSQL) 16.1 (Ubuntu 16.1-1.pgdg22.04+1)

What it means: Newer Postgres versions improve vacuum, WAL, and planner behavior. That changes “what hurts” on small boxes.

Decision: On old Postgres, you may need more manual babysitting. On newer Postgres, focus more on connection pooling and autovacuum thresholds.

Task 5: Count DB connections (MySQL)

cr0x@server:~$ mysql -e "SHOW STATUS LIKE 'Threads_connected';"
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Threads_connected | 185   |
+-------------------+-------+

What it means: 185 connections on a 4GB VPS with PHP is often a problem, even before queries get slow.

Decision: Cap application concurrency, enable persistent connections carefully, or move to a pattern that limits DB concurrency (queueing at app, caching, or splitting read traffic). If you can’t control the app, lower max_connections and accept controlled failures over total collapse.

Task 6: Count DB connections (PostgreSQL)

cr0x@server:~$ sudo -u postgres psql -c "SELECT count(*) AS connections FROM pg_stat_activity;"
 connections
-------------
         142
(1 row)

What it means: 142 Postgres sessions equals 142 backend processes. On a 4GB VPS, that’s a memory and context-switch tax.

Decision: Install pgBouncer and drop max_connections. On small boxes, Postgres without pooling is a performance prank you play on yourself.

Task 7: Find long-running queries and blockers (PostgreSQL)

cr0x@server:~$ sudo -u postgres psql -c "SELECT pid, now()-query_start AS age, state, wait_event_type, wait_event, left(query,80) AS q FROM pg_stat_activity WHERE state <> 'idle' ORDER BY age DESC LIMIT 5;"
 pid  |   age    | state  | wait_event_type | wait_event |                                       q
------+----------+--------+-----------------+------------+--------------------------------------------------------------------------------
 9123 | 00:02:18 | active | Lock            | relation   | UPDATE orders SET status='paid' WHERE id=$1
 9051 | 00:01:44 | active | IO              | DataFileRead | SELECT * FROM products WHERE slug=$1
(2 rows)

What it means: Lock waits point to contention; IO waits point to slow storage or cache misses.

Decision: If Lock waits dominate, fix transaction scope and indexing. If IO waits dominate, increase effective caching (within reason) and reduce random reads via indexes and query shaping.

Task 8: Find lock waits (MySQL)

cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST;"
Id	User	Host	db	Command	Time	State	Info
210	app	10.0.0.12:50344	shop	Query	75	Waiting for table metadata lock	ALTER TABLE orders ADD COLUMN foo INT
238	app	10.0.0.15:38822	shop	Query	12	Sending data	SELECT * FROM orders WHERE created_at > NOW() - INTERVAL 1 DAY

What it means: Metadata locks can freeze writes and reads behind schema changes, depending on operation and version.

Decision: Stop doing online schema changes casually on a single small VPS. Schedule maintenance or use online schema migration tools designed to reduce locking.

Task 9: Check InnoDB buffer pool hit rate and read pressure

cr0x@server:~$ mysql -e "SHOW STATUS LIKE 'Innodb_buffer_pool_read%';"
+---------------------------------------+---------+
| Variable_name                         | Value   |
+---------------------------------------+---------+
| Innodb_buffer_pool_read_requests      | 9823412 |
| Innodb_buffer_pool_reads              | 412390  |
+---------------------------------------+---------+

What it means: reads are physical reads; read_requests are logical. If physical reads are high relative to requests, you’re missing cache.

Decision: If the working set fits in RAM, increase innodb_buffer_pool_size cautiously. If it doesn’t fit, prioritize indexes and reducing the working set (fewer columns, fewer scans).

Task 10: Check Postgres cache and temp file spills

cr0x@server:~$ sudo -u postgres psql -c "SELECT datname, blks_hit, blks_read, temp_files, temp_bytes FROM pg_stat_database ORDER BY temp_bytes DESC LIMIT 5;"
  datname  | blks_hit | blks_read | temp_files |  temp_bytes
-----------+----------+-----------+------------+--------------
 appdb     |  9201123 |   612332  |      1832  | 2147483648
(1 row)

What it means: Lots of temp_bytes suggests sorts/hashes spilling to disk because work_mem is too small for those operations—or queries are doing too much.

Decision: Don’t crank work_mem globally on a small VPS. Fix queries and indexes first; then raise work_mem per-role or per-session for specific workloads.

Task 11: See top queries (Postgres, if pg_stat_statements is enabled)

cr0x@server:~$ sudo -u postgres psql -c "SELECT calls, mean_exec_time, rows, left(query,80) AS q FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 5;"
 calls | mean_exec_time | rows |                                       q
-------+----------------+------+--------------------------------------------------------------------------------
 82021 |          12.45 |    1 | SELECT id FROM sessions WHERE token=$1
  1220 |         210.12 |  300 | SELECT * FROM orders WHERE user_id=$1 ORDER BY created_at DESC LIMIT 50
(2 rows)

What it means: High total time queries are your budget burners. High call-count queries are your “death by a thousand cuts.”

Decision: Index the hot paths and reduce chatty queries. If a query runs 80k times and takes 12ms, that’s a full core’s worth of regret.

Task 12: Enable and read MySQL slow query log quickly

cr0x@server:~$ mysql -e "SET GLOBAL slow_query_log=ON; SET GLOBAL long_query_time=0.5; SET GLOBAL log_queries_not_using_indexes=ON;"
...output omitted...

What it means: You’re turning on evidence collection. Keep thresholds reasonable so you don’t DOS yourself with logs.

Decision: Collect for 15–60 minutes during peak, then use the data to fix the worst offenders. Turn off log_queries_not_using_indexes if it’s too noisy for your app.

cr0x@server:~$ sudo tail -n 5 /var/log/mysql/mysql-slow.log
# Query_time: 1.204  Lock_time: 0.000 Rows_sent: 50  Rows_examined: 84512
SELECT * FROM orders WHERE user_id=123 ORDER BY created_at DESC LIMIT 50;

What it means: Rows examined is huge relative to rows sent: classic missing index or wrong index order.

Decision: Add/adjust composite indexes to match filter + sort pattern (e.g., (user_id, created_at)), then verify with EXPLAIN.

Task 13: Check filesystem space and inode pressure

cr0x@server:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        80G   74G  2.1G  98% /

What it means: Near-full disks destroy performance and can break DB writes. Postgres may refuse to checkpoint; MySQL may crash or go read-only depending on circumstances.

Decision: Free space now. Then set up log rotation and a monitoring alert at 80–85%.

Task 14: Check kernel and DB process memory usage

cr0x@server:~$ ps -eo pid,comm,rss,pmem --sort=-rss | head
 2311 mysqld     1854320 46.2
 1822 php-fpm8.1  412000 10.2
 1825 php-fpm8.1  405000 10.1
  911 postgres    220000  5.4

What it means: RSS shows actual resident memory. A few large PHP workers plus a large DB cache can tip the box into swap.

Decision: If DB + PHP already consume most RAM, stop increasing DB buffers. Reduce concurrency and cap memory consumers.

If you pick MySQL: what to set first on a 4GB VPS

MySQL on small VPS hardware is usually fine if you don’t treat it like a bottomless pit for connections and memory. InnoDB is your default engine; tune for InnoDB, not for nostalgia.

1) Set innodb_buffer_pool_size like an adult

Goal: Cache hot data/indexes, reduce random reads, avoid starving everything else.

  • If DB is on same box as web: start around 1.0–1.5GB.
  • If DB is mostly alone: up to 2.0–2.5GB can work.

Failure mode: Oversizing the buffer pool doesn’t “use free memory.” It competes with the OS page cache and the web tier. Then you swap. Then every query becomes a storage benchmark.

2) Set max_connections lower than you think

MySQL threads consume memory. PHP apps love opening connections like it’s free. It’s not free.

  • Start around 100–200 depending on app and query latency.
  • If you’re seeing 300–800 connections, you don’t have a “database performance issue.” You have a concurrency control issue.

3) Keep redo log and flush behavior sane

On a small VPS with uncertain storage latency, overly aggressive flushing can cause spikes. But turning durability into a suggestion is how you earn a resume update.

  • innodb_flush_log_at_trx_commit=1 for real durability (default).
  • If you absolutely must reduce fsync pressure and can accept losing up to 1 second of transactions in a crash: consider =2. Document it. Put it in incident runbooks. Don’t pretend it’s free.

4) Disable what you don’t need, but don’t blind yourself

Performance Schema is useful; it also costs overhead. On a tiny VPS, you can reduce instrumentation rather than nuking it.

  • If you’re constantly CPU-bound with low query latency, consider trimming Performance Schema consumers.
  • But keep enough visibility to catch regressions. Debugging without metrics is just creative writing.

5) Set temporary table limits carefully

Web apps love ORDER BY and GROUP BY, often with too-wide result sets.

  • tmp_table_size and max_heap_table_size can reduce disk temp tables, but set them too high and you’ll blow memory under concurrency.

MySQL starter config sketch (not a copy-paste religion)

This is the spirit of it for a mixed web+DB 4GB VPS. Adjust based on measurements above.

cr0x@server:~$ sudo cat /etc/mysql/mysql.conf.d/99-vps-tuning.cnf
[mysqld]
innodb_buffer_pool_size = 1G
innodb_buffer_pool_instances = 1
max_connections = 150
innodb_flush_log_at_trx_commit = 1
innodb_flush_method = O_DIRECT
slow_query_log = ON
long_query_time = 0.5

What it means: Smaller buffer pool to preserve headroom, capped connections, direct I/O to reduce double-caching (depends on your filesystem and workload), and slow query logging for evidence.

Decision: Apply, restart during a quiet window, then re-check swap/iowait and slow logs. If latency improves and swap disappears, you’re on the right path.

If you pick PostgreSQL: what to set first on a 4GB VPS

Postgres is excellent for websites, but it makes you pay attention to three things early: connection counts, vacuum, and WAL/checkpoints. Ignore any of those and you’ll get “random” slowdowns that aren’t random at all.

1) Install connection pooling (pgBouncer) before you “need” it

On 4GB, Postgres backends are not disposable. A traffic spike that opens hundreds of connections can turn into memory pressure and context-switch overhead.

Do: run pgBouncer in transaction pooling mode for typical web workloads.

Don’t: crank max_connections to 500 and call it scaling.

2) Set shared_buffers conservatively

The folklore says “25% of RAM.” On a mixed web+DB VPS, I’d start around:

  • 512MB to 1GB for shared_buffers.

Postgres benefits from OS page cache too. Giving everything to shared_buffers can starve the OS and other processes.

3) Set work_mem low globally; raise it surgically

work_mem is per sort/hash operation, per query, per backend. You don’t have enough RAM for bravado here.

  • Start at 4–16MB globally depending on concurrency.
  • Increase for a specific role or session if you have a known heavy report query.

4) Keep autovacuum healthy

Autovacuum isn’t optional housekeeping. It’s how Postgres prevents table bloat and keeps index-only scans possible.

  • Monitor dead tuples and vacuum lag.
  • Tune autovacuum thresholds per hot table if needed.

5) Make checkpoints less spiky

On slow VPS storage, checkpoint spikes show up as random latency cliffs. Smoother checkpoints reduce pain.

  • Increase checkpoint_timeout (within reason).
  • Set checkpoint_completion_target high to spread writes.

Postgres starter config sketch

cr0x@server:~$ sudo cat /etc/postgresql/16/main/conf.d/99-vps-tuning.conf
shared_buffers = 768MB
effective_cache_size = 2304MB
work_mem = 8MB
maintenance_work_mem = 128MB
checkpoint_completion_target = 0.9
checkpoint_timeout = 10min
wal_compression = on
log_min_duration_statement = 500ms

What it means: Conservative shared buffers, realistic cache hinting, modest work memory, smoother checkpoints, and query logging for slow statements.

Decision: Apply and reload/restart, then watch temp file growth and checkpoint timing. If your disk is slow, checkpoint smoothing will show up as fewer latency cliffs.

Connections: the silent killer on small boxes

If you run websites, the easiest way to ruin a database is to let the application decide concurrency. PHP-FPM workers + “open a DB connection per request” becomes a thundering herd. On 4GB, you don’t survive by being faster. You survive by being calmer.

What “too many connections” looks like

  • DB CPU high but not doing useful work (context switching, mutex contention).
  • Memory usage climbs with traffic until swap.
  • Latency increases even for simple queries.

What you do instead

  • Cap app concurrency: fewer PHP-FPM children, or set process manager to avoid explosions.
  • Use pooling: pgBouncer for Postgres; for MySQL, consider pooling at the application layer or ensure persistent connections are configured sanely.
  • Fail fast: sometimes lower max_connections is the right move because it protects the box from total thrash.

Joke #2: Unlimited connections is like unlimited buffet shrimp—sounds great until you realize you’re the one closing the restaurant.

Storage and filesystem realities: IOPS, fsync, and why “fast SSD” lies

On VPS platforms, “SSD storage” can mean anything from respectable NVMe to a shared network block device having a bad day. Databases care about latency more than throughput. A few milliseconds of extra fsync time per commit becomes visible at the website.

How writes hurt you differently in MySQL vs PostgreSQL

  • MySQL/InnoDB: redo logging + doublewrite buffer (depending on config/version) + flushing dirty pages. Bursty flush can amplify latency.
  • PostgreSQL: WAL writes + checkpoints + background writer. Vacuum also creates I/O, and bloat increases future I/O.

Small VPS best practice: reduce write amplification first

  • Fix chatty apps (too many small transactions).
  • Batch writes where correctness allows.
  • Avoid constantly updating “last_seen” columns on every request if you don’t need it.
  • Keep indexes lean; every index is a write tax.

Filesystem gotchas

  • Don’t put databases on flaky network filesystems unless you know the platform guarantees durability semantics.
  • Watch out for disk-full conditions: Postgres and MySQL behave badly in different ways, but none of those ways are “nice.”

Three corporate mini-stories from the trenches

1) The incident caused by a wrong assumption: “The cache will cover it”

A small team ran a collection of marketing sites and a checkout service on a single 4GB VPS. It had MySQL, Nginx, and PHP-FPM. Traffic was “mostly static,” which was true until a campaign launched and the checkout service started receiving bursts of authenticated requests.

The assumption was that the page cache and application caching would handle reads, so they pushed innodb_buffer_pool_size up near 3GB to “make the database fast.” It looked great in a quiet hour. Then the campaign hit.

PHP-FPM spawned to handle traffic. Each worker used more memory than anyone remembered. The OS started swapping. The database’s buffer pool was huge, so the kernel had less room for everything else. Latency didn’t increase gradually; it fell off a cliff. The checkout endpoint started timing out, retries increased traffic, and the retry storm turned a resource issue into a denial-of-service they hosted themselves.

The fix wasn’t exotic. They reduced the buffer pool to leave headroom, capped PHP-FPM children, lowered MySQL max_connections so the system failed fast instead of thrashing, and put an explicit queue in front of checkout. They also learned the operational difference between “free memory” and “available memory under burst.”

2) The optimization that backfired: “Just raise work_mem, it’s fine”

An internal app ran on PostgreSQL. Users complained about slow reports, so someone increased work_mem significantly because a blog post said it would reduce temp file I/O. It did. For one user. In one session.

Then a Monday morning happened. Several users ran reports concurrently. Those reports each did multiple sorts and hash joins. Postgres correctly allocated work_mem per operation. Memory usage surged. The VPS didn’t crash immediately; it got slower and slower as swap kicked in. The DB looked “alive” but every query waited behind the I/O storm caused by swapping.

The team rolled back work_mem to a conservative value and instead fixed the report query. They added a missing index, reduced selected columns, and introduced a summary table refreshed periodically. For the genuinely heavy query, they used a role with higher work_mem and forced it through a controlled reporting path. The lesson wasn’t “never tune.” It was “don’t tune globally for a local problem on a small machine.”

3) The boring but correct practice that saved the day: “Cap connections and log slow queries”

A different org hosted several small client sites on a shared 4GB VPS. Nothing fancy. They weren’t chasing microseconds. They did three boring things from day one: capped database connections, enabled slow query logging with a sane threshold, and monitored disk usage with an alert well before 90% full.

One afternoon a plugin update introduced a query regression. The site didn’t immediately fall over because connection caps prevented unlimited load from piling into the DB. Instead, some requests failed quickly, which made the issue visible without melting the whole box.

The slow query log had the smoking gun: a query that started scanning a large table without a useful index. They added the index, cleared up the regression, and the incident was contained to a short window. No mystery. No “it went away.” No weekend archaeology.

This is what boring reliability looks like: controlled failure, evidence collection, and enough headroom that one bad deploy doesn’t become a system-wide catastrophe.

Common mistakes: symptom → root cause → fix

1) Symptom: sudden 10–60s stalls across the site

Root cause: storage latency spikes during checkpoints/flushes or swap storms.

Fix: confirm with iostat and vmstat; reduce memory pressure (smaller DB caches, fewer app workers), smooth checkpoints (Postgres), and reduce write amplification (both).

2) Symptom: database CPU high, queries “not that slow” individually

Root cause: too many concurrent connections; contention overhead dominates.

Fix: cap connections; add pooling (pgBouncer); reduce PHP-FPM concurrency; cache at app or reverse proxy; fail fast rather than thrash.

3) Symptom: Postgres grows and grows; performance slowly degrades

Root cause: vacuum lag and table/index bloat due to insufficient autovacuum or long-running transactions.

Fix: identify idle-in-transaction sessions, tune autovacuum per hot table, and stop holding transactions open across requests.

4) Symptom: MySQL “Waiting for table metadata lock” in processlist

Root cause: schema change or DDL blocked by long transactions; queries queue behind metadata locks.

Fix: schedule DDL in maintenance windows; keep transactions short; use online schema change approaches if required.

5) Symptom: lots of temp files or “Using temporary; Using filesort” in MySQL

Root cause: missing indexes for ORDER BY/GROUP BY patterns; queries sorting huge datasets.

Fix: add composite indexes matching filter+sort; reduce selected columns; paginate properly; avoid OFFSET pagination for deep pages.

6) Symptom: frequent “too many connections” errors

Root cause: app connection leaks, no pooling, or spikes in web worker counts.

Fix: pool connections; set sane timeouts; cap app concurrency; set DB max_connections to a number you can afford.

7) Symptom: after “tuning,” performance got worse

Root cause: a global setting (like work_mem or too-large buffer pool) increased per-connection memory and triggered swap under concurrency.

Fix: revert; apply tuning per-user/per-query; measure memory and concurrency explicitly.

Checklists / step-by-step plan

Step 0: Decide what “good” means

  • Pick an SLO-like target: e.g., homepage p95 < 500ms, checkout p95 < 800ms.
  • Pick a measurement window and capture baseline (CPU, RAM, swap, iowait, DB connections, slow queries).

Step 1: Stabilize the host

  • Ensure disk has at least 15–20% free space.
  • Ensure you’re not swapping under normal peak traffic.
  • Set conservative service limits (systemd limits if needed) to avoid runaway processes.

Step 2: Cap concurrency deliberately

  • Set PHP-FPM max children to a number you can afford in RAM.
  • Set DB max_connections to protect the machine.
  • On Postgres: deploy pgBouncer and reduce backend connections.

Step 3: Set the first memory knobs

  • MySQL: set innodb_buffer_pool_size to fit the working set without starving the OS.
  • Postgres: set shared_buffers conservatively; keep work_mem low globally.

Step 4: Turn on evidence collection

  • MySQL: slow query log at 0.5–1s during peak, then analyze and fix.
  • Postgres: log_min_duration_statement and ideally pg_stat_statements.

Step 5: Fix the top 3 query patterns

  • Add the missing indexes that reduce row scans.
  • Eliminate N+1 queries in the app.
  • Stop doing expensive queries per request; precompute or cache.

Step 6: Re-test and set guardrails

  • Re-run your triage tasks at peak.
  • Add alerts on swap activity, disk utilization, connection counts, and slow query rate.
  • Document your “safe” settings and the rationale so future-you doesn’t undo them.

FAQ

1) On a 4GB VPS, should I prioritize DB cache or OS page cache?

Prioritize stability. For single-box web+DB, don’t starve the OS. A moderate DB cache plus headroom beats a giant cache that triggers swap under bursts.

2) Is PostgreSQL “slower” than MySQL for websites?

Not as a rule. For many web workloads, either is fast enough when indexed well. The bigger differentiator on 4GB is connection management and write patterns, not raw engine speed.

3) What’s the first MySQL setting I should change?

innodb_buffer_pool_size, sized to your reality. Then cap max_connections. Then enable slow query logging and fix what it shows you.

4) What’s the first PostgreSQL setting I should change?

Connection pooling strategy (pgBouncer) and max_connections. Then conservative shared_buffers and logging/pg_stat_statements to identify top queries.

5) Can I just increase swap to solve memory issues?

You can increase swap to prevent abrupt OOM crashes, but swap is not performance RAM. If your database or PHP workers regularly hit swap, latency will become unpredictable.

6) Should I disable fsync for speed?

No for production websites where you care about data integrity. If you disable durability and the host crashes, you can lose data. Benchmarks love it; customers don’t.

7) How do I know if I’m I/O bound?

High iowait in vmstat, high await and %util in iostat, and DB sessions waiting on IO events (Postgres) are strong signals.

8) When should I split web and DB onto separate servers?

When your tuning changes become trade-offs between web tier and DB tier memory, or when storage latency makes database writes unpredictable. Separation buys you isolation and clearer capacity planning.

9) Are defaults good enough these days?

Defaults are better than they used to be, but they’re not tailored to your 4GB “everything on one box” situation. Connection caps and memory budgeting are still on you.

10) What’s the safest “performance win” I can do without deep DB expertise?

Enable slow query logging (or pg_stat_statements), identify the top 3 time consumers, and add the right indexes. Also cap connections so the server remains stable under load.

Next steps that won’t embarrass you later

On a 4GB VPS, you’re not optimizing a database. You’re managing contention between web, database, and storage while trying to keep latency boring.

  1. Run the fast diagnosis playbook during peak and write down what’s actually happening: swap, iowait, connections, locks, top queries.
  2. Cap concurrency first: PHP-FPM workers and DB connections. Add pgBouncer if you’re on Postgres.
  3. Set the first memory knob (InnoDB buffer pool or Postgres shared buffers) to a conservative value that leaves headroom.
  4. Turn on evidence (slow query logs / pg_stat_statements) and fix the top offenders with indexes and query changes.
  5. Re-check disk and write behavior; smooth checkpoints, reduce temp spills, and stop doing noisy writes you don’t need.
  6. Decide if the real fix is architectural: moving DB to separate VPS, upgrading storage tier, or using a managed DB. Sometimes the most effective tuning parameter is your invoice.

If you do only one thing today: cap connections and stop swapping. Everything else is garnish.

MySQL vs ClickHouse: Stop Analytics from Killing OLTP (The Clean Split Plan)

Somewhere in your company, a well-meaning analyst just refreshed a dashboard. Now checkout is slow, the API is timing out, and the on-call channel is becoming a group therapy session.

This isn’t a “bad query” problem. It’s an architecture problem: OLTP and analytics are different animals, and putting them in the same cage ends predictably. The fix is a clean split—MySQL does transactions, ClickHouse does analytics, and you stop letting curiosity DDoS your revenue path.

The actual problem: OLTP and analytics fight at the storage layer

OLTP is about latency, correctness, and predictable concurrency. You optimize for thousands of small reads/writes per second, tight indexes, and hot working sets that fit in memory. The cost of a single slow request is paid immediately—in customer experience, timeouts, and retries that amplify load.

Analytics is about throughput, wide scans, and aggregation. You optimize for reading lots of data, compressing it well, and using vectorized execution to turn CPU into answers. Analytics queries are often “embarrassingly parallel” and don’t mind being a few seconds slower—until they’re pointed at your transactional database and become a denial-of-service with a pivot table attached.

The punchline: OLTP and analytics compete for the same finite resources—CPU cycles, disk I/O, page cache, buffer pools, locks/latches, and background maintenance (flushing, checkpoints, merges). Even if you add a read replica, you’re still frequently sharing the same fundamental pain: replication lag, I/O saturation, and inconsistent performance caused by unpredictable scans.

Where the knife goes in: resource contention in MySQL

  • Buffer pool pollution: A big reporting query reads a cold slice of history, evicts hot pages, and suddenly your primary workload becomes disk-bound.
  • InnoDB background pressure: Long scans + temp tables + sorts can increase dirty pages and redo pressure. Flush storms are not polite.
  • Locks and metadata locks: Some reporting patterns trigger ugly interactions (think “ALTER TABLE during business hours” meets “SELECT …” holding MDL).
  • Replication lag: Heavy reads on a replica steal I/O and CPU from the SQL thread applying changes.

Where ClickHouse fits

ClickHouse is built for analytics: columnar storage, compression, vectorized execution, and aggressive parallelism. It expects you to read lots of rows, but only a few columns, and it rewards you for grouping work into partitions and sorted keys.

The discipline is simple: treat MySQL as the system of record for transactions. Treat ClickHouse as the system of truth for analytics—truth meaning “derived from the record, reproducible, and queryable at scale.”

Paraphrased idea from Werner Vogels: “Everything fails; design for failure.” It applies to data too: design for failure modes like query storms, lag, and backfills.

MySQL vs ClickHouse: the real differences that matter in production

Storage layout: row vs column

MySQL/InnoDB is row-oriented. Great for fetching a row by primary key, updating a couple columns, maintaining secondary indexes, and enforcing constraints. But scanning a billion rows to compute aggregates means dragging entire rows through the engine, touching pages you didn’t need, and burning cache.

ClickHouse is column-oriented. It reads only the columns you ask for, compresses them well (often dramatically), and processes them in vectors. You pay upfront with different modeling constraints—denormalization, careful ordering keys, and a merge process that you must respect.

Concurrency model: transactional vs analytical parallelism

MySQL handles many concurrent short transactions well—up to the limits of your schema, indexes, and hardware. ClickHouse handles many concurrent reads too, but the magic is in parallelizing big reads and aggregations efficiently. If you point a BI tool at ClickHouse and allow unlimited concurrency with no query limits, it will try to set your CPU on fire. You can and should govern it.

Consistency and correctness

MySQL is ACID (with the usual caveats, but yes, it’s your transactional anchor). ClickHouse is typically eventually consistent for ingested data and append-oriented. You can model updates/deletes, but you do it on ClickHouse’s terms (ReplacingMergeTree, CollapsingMergeTree, version columns, or asynchronous deletes). That’s fine: analytics usually wants the current truth and a time series of changes, not per-row transactional semantics.

Indexing and query patterns

MySQL indexes are B-trees that support point lookups and range scans. ClickHouse uses primary key ordering and sparse indexes, plus data skipping indexes (like bloom filters) where it helps. The best ClickHouse query is one that can skip big chunks of data because your partitioning and ordering match the access patterns.

Operational posture

MySQL operations revolve around replication health, backups, schema migrations, and query stability. ClickHouse operations revolve around merges, disk utilization, part counts, TTL, and query governance. In other words: you trade one set of dragons for a different set. The deal is still worth it because you stop letting analytics wreck your checkout flow.

Joke #1: A dashboard refresh is the only kind of “user engagement” that can increase error rates and churn simultaneously.

Facts and historical context (useful, not trivia)

  1. MySQL’s InnoDB became default in MySQL 5.5 (2010 era), cementing row-store OLTP behavior for most deployments.
  2. ClickHouse started at Yandex to power analytics workloads at scale; it grew up in a world where scanning big data fast was the whole job.
  3. Column stores took off because CPU got faster than disks, and compression + vectorized execution let you spend CPU to avoid I/O.
  4. InnoDB buffer pool “pollution” is a classic failure mode when long scans blow away hot pages; the engine isn’t “broken,” it’s doing what you asked.
  5. Replication-based analytics has existed for decades: people have been shipping OLTP changes into data warehouses since before “data lake” was a résumé keyword.
  6. MySQL query cache was removed in MySQL 8.0 because it caused contention and didn’t scale well; caching isn’t free, and global locks are expensive.
  7. ClickHouse’s MergeTree family stores data in parts and merges them in the background—great for writes and compression, but it creates operational signals (part counts, merge backlog) you must monitor.
  8. “Star schema” and dimensional modeling predate modern tools; ClickHouse often pushes teams back toward denormalized, query-friendly shapes because joins at scale have real costs.

The clean split plan: patterns that don’t melt prod

Principle 1: MySQL is for serving users, not curiosity

Make it policy: production MySQL is not a reporting database. Not “usually.” Not “except for a quick query.” Never. If someone needs a one-off, run it against ClickHouse or a controlled snapshot environment.

You’ll get pushback. That’s normal. The trick is to replace “no” with “here’s the safe way.” Provide the safe path: ClickHouse access, curated datasets, and a workflow that doesn’t involve begging on-call for permission to run a JOIN across a year of orders.

Principle 2: Choose a data movement strategy that matches your failure tolerance

There are three common ways to feed ClickHouse from MySQL. Each has sharp edges.

Option A: Batch ETL (dump and load)

You extract hourly/daily snapshots (mysqldump, CSV exports, Spark jobs), load into ClickHouse, and accept staleness. This is simplest operationally but can be painful when you need near-real-time metrics, and backfills can be heavy.

Option B: Replication-driven ingestion (CDC)

Capture changes from MySQL binlog and stream them into ClickHouse. This gets you near-real-time analytics while keeping MySQL insulated from query load. But it introduces pipeline health as a first-class production concern: lag, schema drift, and reprocessing become your new hobby.

Option C: Dual-write (application writes to both)

Don’t. Or, if you absolutely must, do it only with robust idempotency, asynchronous delivery, and a reconciliation job that assumes the dual-write will lie to you occasionally.

The clean split plan usually means CDC plus curated data models in ClickHouse. Batch ETL is acceptable when you can tolerate staleness. Dual-write is a trap unless you enjoy explaining data mismatches during incident postmortems.

Principle 3: Model ClickHouse for your questions, not your schema

Most OLTP schemas are normalized. Analytics wants fewer joins, stable keys, and event-style tables. Your job is to build an analytics representation that’s easy to query and hard to misuse.

  • Prefer event tables: orders_events, sessions, payments, shipments, support_tickets. Append events. Derive facts.
  • Partition by time: usually by day or month. This gives you predictable pruning and manageable TTL.
  • Order by query dimensions: put the most common filter/group-by keys early in ORDER BY (after the time key if you always filter by time).
  • Pre-aggregate where it’s stable: materialized views can produce rollups so dashboards don’t repeatedly scan raw data.

Principle 4: Governance beats heroics

ClickHouse can answer questions fast enough that people will ask worse questions more frequently. You need guardrails:

  • Separate users and quotas: BI users get timeouts and max memory. ETL gets a different profile.
  • Set max threads and concurrency: avoid a “thundering herd” of parallel queries.
  • Use dedicated “gold” datasets: stable views or tables that dashboards depend on, versioned if needed.
  • Define SLOs: MySQL latency SLO is sacred. ClickHouse freshness SLO is negotiable but measurable.

Hands-on tasks (commands, outputs, decisions)

These are the moves you actually make at 02:13. Each task includes a command, sample output, what it means, and the decision you make from it.

Task 1: Confirm MySQL is suffering from analytic scans (top digests)

cr0x@server:~$ mysql -e "SELECT DIGEST_TEXT, COUNT_STAR, SUM_TIMER_WAIT/1e12 AS total_s FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 5\G"
*************************** 1. row ***************************
DIGEST_TEXT: SELECT customer_id, sum(total) FROM orders WHERE created_at BETWEEN ? AND ? GROUP BY customer_id
COUNT_STAR: 9421
total_s: 18873.214
*************************** 2. row ***************************
DIGEST_TEXT: SELECT * FROM orders WHERE created_at > ? ORDER BY created_at DESC LIMIT ?
COUNT_STAR: 110233
total_s: 8211.532

What it means: Your worst time is coming from a classic reporting aggregate across a date range. It’s not “one slow query,” it’s repeated pain.

Decision: Block or reroute the analytic query pattern. Don’t tune MySQL into an OLAP engine. Start by moving that dashboard to ClickHouse or a rollup table.

Task 2: Check current MySQL thread activity (is it a dogpile?)

cr0x@server:~$ mysql -e "SHOW PROCESSLIST;" | head
Id	User	Host	db	Command	Time	State	Info
31	app	10.0.2.14:51234	prod	Query	2	Sending data	SELECT customer_id, sum(total) FROM orders WHERE created_at BETWEEN '2025-12-01' AND '2025-12-30' GROUP BY customer_id
44	app	10.0.2.14:51239	prod	Query	2	Sending data	SELECT customer_id, sum(total) FROM orders WHERE created_at BETWEEN '2025-12-01' AND '2025-12-30' GROUP BY customer_id
57	app	10.0.2.14:51241	prod	Query	1	Sending data	SELECT customer_id, sum(total) FROM orders WHERE created_at BETWEEN '2025-12-01' AND '2025-12-30' GROUP BY customer_id

What it means: Many identical queries are running concurrently. This is a dashboard or a fleet of workers doing the same expensive work.

Decision: Throttle at the app/BI layer and introduce caching or pre-aggregation in ClickHouse. Also consider MySQL connection limits and per-user resource controls.

Task 3: Validate InnoDB buffer pool pressure (hot pages getting evicted)

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';"
Variable_name	Value
Innodb_buffer_pool_read_requests	987654321
Innodb_buffer_pool_reads	12345678

What it means: A high number of physical reads (Innodb_buffer_pool_reads) relative to logical reads suggests your working set isn’t staying in memory—often due to big scans.

Decision: Stop the scans (move analytics off), and only then consider increasing buffer pool or adjusting workload. Hardware cannot outvote bad workload mix forever.

Task 4: Catch disk I/O saturation on the MySQL host

cr0x@server:~$ iostat -xz 1 3
Linux 6.2.0 (mysql01) 	12/30/2025 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          18.12    0.00    6.44   31.55    0.00   43.89

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   w_await aqu-sz  %util
nvme0n1         820.0  64200.0     0.0    0.0   12.4    78.3     410.0  18800.0    9.8   18.2   98.7

What it means: %util near 100% and high iowait means the disk is the bottleneck. Analytics scans love this outcome.

Decision: Immediate: reduce query concurrency, kill worst offenders, shift analytics to ClickHouse. Long-term: separate storage and workloads; don’t bank on “faster NVMe” as a strategy.

Task 5: Identify MySQL replication lag (your “read replica” isn’t helping)

cr0x@server:~$ mysql -h mysql-replica01 -e "SHOW SLAVE STATUS\G" | egrep "Seconds_Behind_Master|Slave_SQL_Running|Slave_IO_Running"
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 487

What it means: The replica is ~8 minutes behind. Dashboards reading it are lying. Worse: if you fail over, you might lose recent transactions.

Decision: Don’t use the replica as an analytics sink. Use CDC to ClickHouse, or at least a dedicated replica with controlled query access and guaranteed resources.

Task 6: Show the actual expensive query plan (stop guessing)

cr0x@server:~$ mysql -e "EXPLAIN SELECT customer_id, sum(total) FROM orders WHERE created_at BETWEEN '2025-12-01' AND '2025-12-30' GROUP BY customer_id\G"
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: orders
type: range
possible_keys: idx_created_at
key: idx_created_at
rows: 98234123
Extra: Using where; Using temporary; Using filesort

What it means: Even with an index, you’re scanning ~98M rows and using temp/filesort. That’s not an OLTP query; it’s an OLAP job.

Decision: Move it. If you must keep some aggregates in MySQL, use summary tables updated incrementally, not ad hoc GROUP BY over raw facts.

Task 7: Confirm ClickHouse health basics (are merges or disk the issue?)

cr0x@server:~$ clickhouse-client -q "SELECT hostName(), uptime()"
ch01
345678

What it means: You can connect and the server is alive long enough to be useful.

Decision: Proceed to deeper checks: parts/merges, query load, and disk.

Task 8: Check ClickHouse active queries and their resource usage

cr0x@server:~$ clickhouse-client -q "SELECT user, query_id, elapsed, read_rows, formatReadableSize(memory_usage) AS mem, left(query, 80) AS q FROM system.processes ORDER BY memory_usage DESC LIMIT 5 FORMAT TabSeparated"
bi_user	0f2a...	12.4	184001234	6.31 GiB	SELECT customer_id, sum(total) FROM orders_events WHERE event_date >= toDate('2025-12-01')
etl	9b10...	3.1	0	512.00 MiB	INSERT INTO orders_events FORMAT JSONEachRow

What it means: BI is consuming memory. That’s fine if it’s budgeted. It’s a problem if it starves merges or triggers OOM.

Decision: Set per-user max_memory_usage, max_threads, and possibly max_concurrent_queries. Keep ETL reliable.

Task 9: Check ClickHouse merges backlog (parts growing like weeds)

cr0x@server:~$ clickhouse-client -q "SELECT database, table, sum(parts) AS parts, formatReadableSize(sum(bytes_on_disk)) AS disk FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(parts) DESC LIMIT 10 FORMAT TabSeparated"
analytics	orders_events	1842	1.27 TiB
analytics	sessions	936	640.12 GiB

What it means: Thousands of parts can mean heavy insert fragmentation or merges falling behind. Query performance will degrade, and startup/metadata gets heavier.

Decision: Adjust insert batching, tune merge settings conservatively, and consider partition strategy. If parts keep climbing, treat it as an incident in slow motion.

Task 10: Validate partition pruning (if it scans everything, you modeled it wrong)

cr0x@server:~$ clickhouse-client -q "EXPLAIN indexes=1 SELECT customer_id, sum(total) FROM analytics.orders_events WHERE event_date BETWEEN toDate('2025-12-01') AND toDate('2025-12-30') GROUP BY customer_id"
Expression ((Projection + Before ORDER BY))
  Aggregating
    Filter (WHERE)
      ReadFromMergeTree (analytics.orders_events)
        Indexes:
          MinMax
            Keys: event_date
            Condition: (event_date in [2025-12-01, 2025-12-30])
            Parts: 30/365
            Granules: 8123/104220

What it means: It’s reading 30/365 parts thanks to the date filter. That’s what “works as designed” looks like.

Decision: If Parts reads close to total, change partitioning and/or require time filters in dashboards.

Task 11: Monitor ClickHouse disk usage and predict capacity trouble

cr0x@server:~$ df -h /var/lib/clickhouse
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    3.5T  3.1T  330G  91% /var/lib/clickhouse

What it means: 91% used. You are one backfill away from a bad day, and merges need headroom.

Decision: Stop non-essential backfills, extend storage, enforce TTL, and compress/optimize data model. ClickHouse under disk pressure becomes unpredictably slow and risky.

Task 12: Verify CDC pipeline lag at the consumer (is analytics stale?)

cr0x@server:~$ clickhouse-client -q "SELECT max(ingested_at) AS last_ingest, now() AS now, dateDiff('second', max(ingested_at), now()) AS lag_s FROM analytics.orders_events"
2025-12-30 19:03:12	2025-12-30 19:03:29	17

What it means: ~17 seconds lag. That’s healthy for “near real time” analytics.

Decision: If lag climbs, pause heavy queries, check pipeline throughput, and decide whether to degrade dashboards or risk OLTP.

Task 13: Check MySQL binary log format for CDC correctness

cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'binlog_format';"
Variable_name	Value
binlog_format	ROW

What it means: ROW format is typically what CDC tools want for correctness. STATEMENT can be ambiguous for non-deterministic queries.

Decision: If you’re not on ROW, plan a change window. CDC correctness is not something you “hope for.”

Task 14: Confirm MySQL has sane slow query logging (so you can prove causality)

cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'slow_query_log%'; SHOW VARIABLES LIKE 'long_query_time';"
Variable_name	Value
slow_query_log	ON
slow_query_log_file	/var/log/mysql/mysql-slow.log
Variable_name	Value
long_query_time	0.500000

What it means: You’ll capture queries slower than 500ms. That’s aggressive, but useful during a noisy period.

Decision: During incidents, lower long_query_time briefly and sample. Afterward, set it to a stable threshold and use digest summaries.

Task 15: Verify ClickHouse user limits (prevent a BI “parallelism party”)

cr0x@server:~$ clickhouse-client -q "SHOW CREATE USER bi_user"
CREATE USER bi_user IDENTIFIED WITH sha256_password SETTINGS max_memory_usage = 4000000000, max_threads = 8, max_execution_time = 60, max_concurrent_queries = 5

What it means: BI is fenced: 4GB memory, 8 threads, 60s runtime, 5 concurrent queries. That’s the difference between a dashboard and a stress test.

Decision: If you can’t set limits because “business needs,” you’re not running analytics, you’re running roulette.

Fast diagnosis playbook

This is the order that finds the bottleneck quickly, without turning the incident into a philosophy debate.

First: Is MySQL overloaded by reads, writes, locks, or I/O?

  1. Top query digests (performance_schema digests or slow log): identify the query families eating time.
  2. Thread states (SHOW PROCESSLIST): “Sending data” suggests scan/aggregation; “Locked” suggests contention; “Waiting for table metadata lock” suggests DDL collision.
  3. Disk I/O (iostat): if iowait is high and disk %util is pegged, stop scans before tuning anything else.

Second: Is the “solution” (replica) actually making it worse?

  1. Replication lag (SHOW SLAVE STATUS): if lag is minutes, analytics users are making decisions on stale data and blaming you for it.
  2. Replica resource contention: heavy queries can starve the SQL thread and increase lag further.

Third: If ClickHouse exists, is it healthy and governed?

  1. system.processes: identify runaway BI queries and memory hogs.
  2. Parts and merges (system.parts): too many parts means ingestion shape or merge backlog problems.
  3. Disk headroom (df): merges and TTL need space; 90% full is operational debt with interest.

Fourth: Is data freshness the real complaint?

  1. CDC lag (max ingested_at): quantify the staleness.
  2. Communicate a fallback: if freshness degrades, degrade dashboards—not checkout.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “Read replicas are for reporting”

A mid-size subscription business had a primary MySQL cluster and two read replicas. Their BI tool was pointed at a replica because “reads don’t affect writes.” That phrase has caused more incidents than caffeine has prevented.

During month-end, finance ran a set of cohort and revenue reports. The replica’s disk hit saturation: heavy scans plus temp tables. Replication lag rose from seconds to tens of minutes. Nobody noticed at first because application traffic was fine; the primary wasn’t directly impacted.

Then someone made the second assumption: “If the primary fails, we can fail over to a replica.” Right when the lag was worst, the primary had an unrelated host issue and went into an unhealthy state. Automation tried to promote the “best” replica—except “best” was 20 minutes behind.

They didn’t lose the entire database. They lost enough recent transactions to create a customer support nightmare: payments that “succeeded” externally but didn’t exist internally, and sessions that didn’t match billing. Recovery was a careful mix of binlog spelunking and reconciling against the payment provider.

The fix wasn’t heroic. They separated concerns: a dedicated replica for failover with strict query blocking, and analytics moved to ClickHouse via CDC. Reporting became fast, and failover became trustworthy because the replica was no longer being used as a punching bag.

Optimization that backfired: “Let’s just add an index”

An e-commerce team had a slow reporting query on orders: time range filter plus group-by. Someone added an index on created_at and another composite index on (created_at, customer_id). The query got faster in isolation, so they shipped it and celebrated.

Two weeks later, write latency started spiking. Inserts into orders slowed, and the background flush rate climbed. The new indexes increased write amplification—every insert now maintained more B-tree structures. At peak traffic, they were paying an index tax on every transaction to make a handful of reports cheaper.

Then the BI tool got a new dashboard that ran the same query every minute. The query was faster, so concurrency increased (humans love pressing refresh when refresh is quick). The system traded one slow query for many medium-fast queries and still ended up I/O bound.

The actual solution was to remove the index bloat, keep OLTP lean, and build a ClickHouse rollup table updated continuously. Dashboards hit ClickHouse. Transactions stayed smooth. The team learned the hard lesson: indexing is not “free speed,” it’s a write-time bill you pay forever.

Boring but correct practice that saved the day: quotas and staged backfills

A B2B SaaS company ran ClickHouse for analytics with strict user profiles. BI users had max_execution_time, max_memory_usage, and concurrency limits. ETL had different limits and ran in a controlled queue. Nobody loved those constraints. Everyone benefited from them.

One afternoon, an analyst attempted to run a wide query across two years of raw events without a date filter. ClickHouse started scanning, hit the execution time limit, and killed the query. The analyst complained. On-call did not get paged. That’s a good trade.

Later that month, the data team needed a backfill due to a schema change in the upstream CDC. They staged it: a day at a time, verifying part counts, disk headroom, and lag after each chunk. Slow, careful, measurable. The backfill finished without threatening production dashboards.

The boring practice wasn’t a fancy algorithm. It was governance and operational discipline: limits, queues, and incremental backfills. It saved them because the system behaved predictably when humans behaved unpredictably.

Joke #2: The only thing more permanent than a temporary dashboard is the incident channel it creates.

Common mistakes: symptom → root cause → fix

  • Symptom: MySQL p95 latency spikes during “business reporting hours”
    Root cause: Long scans and GROUP BY queries competing with OLTP for buffer pool and I/O
    Fix: Move reporting to ClickHouse; enforce policy; add curated rollups; block BI users from MySQL.
  • Symptom: Read replica lag increases when analysts run reports
    Root cause: Replica I/O and CPU saturated; SQL thread can’t apply binlog fast enough
    Fix: Remove analytic access from failover replicas; use CDC to ClickHouse; cap query concurrency.
  • Symptom: ClickHouse queries get slower over time with no change in data size
    Root cause: Parts explosion; merges falling behind due to fragmented inserts or disk pressure
    Fix: Batch inserts; tune merge-related settings carefully; monitor parts; ensure disk headroom; consider repartitioning.
  • Symptom: Dashboards are “fast sometimes” and time out randomly on ClickHouse
    Root cause: Unbounded BI concurrency; memory pressure; noisy neighbor queries
    Fix: Set per-user limits (memory, threads, execution time, concurrent queries); create pre-aggregated tables; add query routing.
  • Symptom: Analytics data has duplicates or “wrong latest state”
    Root cause: CDC applied as append-only without dedup/versioning; updates/deletes not modeled correctly
    Fix: Use version columns and ReplacingMergeTree where appropriate; store events and derive current state via materialized views.
  • Symptom: ClickHouse disk keeps climbing until it’s an emergency
    Root cause: No TTL; storing raw forever; heavy backfills; no capacity guardrails
    Fix: Apply TTL for cold data; downsample; compress; archive; enforce quotas and backfill procedures.
  • Symptom: “We moved to ClickHouse but MySQL is still slow”
    Root cause: CDC pipeline still reads MySQL in a heavy way (full-table extracts, frequent snapshots), or app still runs reports on MySQL
    Fix: Use binlog-based CDC; review MySQL query sources; firewall/reporting accounts; validate with digest data.
  • Symptom: ClickHouse freshness lags during peaks
    Root cause: Ingestion bottleneck (pipeline throughput), merges, or disk pressure; sometimes too many small inserts
    Fix: Batch inserts; scale ingestion; monitor lag; temporarily reduce BI concurrency; prioritize ETL resources.

Checklists / step-by-step plan

Step-by-step: the clean split implementation plan

  1. Declare the boundary: production MySQL is OLTP only. Write it down. Enforce it with accounts and network policy.
  2. Inventory analytic queries: use MySQL digest tables and slow log summaries to list the top 20 query families.
  3. Pick the ingestion method: CDC for near-real-time; batch for daily/hourly; avoid dual-write.
  4. Define analytics tables in ClickHouse: start with event tables, time partitioning, and ORDER BY keys aligned to filters.
  5. Build “gold” datasets: materialized views or rollup tables for dashboards; keep raw data for deep dives.
  6. Set governance from day one: user profiles, quotas, max_execution_time, max_memory_usage, max_concurrent_queries.
  7. Measure freshness: track ingestion lag and publish the SLO to stakeholders. People tolerate staleness when it’s explicit.
  8. Cut over dashboards: migrate the highest-impact dashboards first (the ones that page on-call indirectly).
  9. Block the old path: remove BI credentials from MySQL; firewall if needed; prevent regression.
  10. Backfill safely: incremental, measurable, with disk headroom checks; no “just run it overnight” fantasies.
  11. Load test analytics: simulate dashboard concurrency. ClickHouse will happily accept your optimism and then punish it.
  12. Operationalize: alerts on ClickHouse parts count, disk usage, query failures, ingestion lag; and on MySQL latency/IO.

Release checklist: moving one dashboard from MySQL to ClickHouse

  • Does the dashboard query include a time filter that matches partitioning?
  • Is there a rollup table/materialized view to avoid scanning raw events repeatedly?
  • Is the ClickHouse user limited (memory, threads, execution time, concurrency)?
  • Is the CDC lag metric visible to the dashboard users?
  • Is the old MySQL query blocked or at least removed from the app/BI tool?
  • Did you validate results for a known time window (spot-check totals and counts)?

Operational checklist: weekly hygiene that prevents slow disasters

  • Review ClickHouse active parts by table; investigate fast growth.
  • Review ClickHouse disk headroom; keep enough free space for merges and backfills.
  • Review top BI queries by read_rows and memory usage; optimize or pre-aggregate.
  • Review MySQL top digests to ensure analytics didn’t creep back in.
  • Test restore paths: MySQL backups, ClickHouse metadata and data recovery expectations.

FAQ

1) Can’t I just scale MySQL vertically and call it a day?

You can, and you’ll get temporary relief. The failure mode returns when the next dashboard or cohort query appears. The issue is workload mismatch, not just horsepower.

2) What if I already have MySQL read replicas—should I point BI at them?

Only if you’re comfortable with lag and you’re not using those replicas for failover. Even then, cap concurrency and treat it as a temporary bridge, not the end state.

3) Is ClickHouse “real time” enough for operational dashboards?

Often yes, with CDC. Measure ingestion lag explicitly and design dashboards to tolerate small delays. If you need sub-second transactional truth, that’s MySQL territory.

4) How do I handle updates and deletes from MySQL in ClickHouse?

Prefer event modeling (append changes). If you need “current state,” use versioned rows with engines like ReplacingMergeTree and design queries/materialized views accordingly.

5) Will ClickHouse replace my data warehouse?

Sometimes. For many companies it becomes the primary analytics store. But if you need heavy transformations, governance, or cross-system modeling, you may still keep a warehouse layer. Don’t force a religious conversion.

6) What’s the fastest win if we’re on fire today?

Stop running analytics on MySQL immediately: kill the worst queries, remove BI access, and ship the dashboard to ClickHouse or a cached rollup. Then fix it properly.

7) What’s the biggest ClickHouse operational surprise for MySQL teams?

Merges and parts. Row-store folks expect “I inserted it, it’s done.” ClickHouse continues working in the background, and you must monitor that work.

8) How do I prevent analysts from writing expensive ClickHouse queries?

Use user profiles with quotas and timeouts, provide curated “gold” tables, and teach people that missing time filters is not “exploration,” it’s arson.

9) Do materialized views solve everything?

No. They’re great for stable rollups and common aggregates. But they can add complexity and storage cost. Use them where they reduce repeated work measurably.

10) What if my analytics queries require complex joins across many tables?

Denormalize for the common paths, precompute dimensions, and keep joins limited. ClickHouse can join, but the best production analytics systems avoid doing it repeatedly at query time.

Conclusion: practical next steps

If you take one action this week, make it this: remove analytics load from MySQL. Not by pleading with users to “be careful,” but by providing a better place to ask questions.

  1. Lock down MySQL: separate accounts, block BI networks, and enforce that production MySQL serves users first.
  2. Stand up ClickHouse governance: limits, quotas, and curated datasets before you invite the whole company.
  3. Move the top 5 worst queries: replicate the needed data via CDC or batch, then build rollups so dashboards stay cheap.
  4. Operationalize freshness: publish ingestion lag and treat it like a product requirement. It’s better to be honestly 60 seconds behind than unknowably wrong.
  5. Practice backfills: staged, measurable, reversible. Your future self will appreciate your current self’s restraint.

The clean split isn’t glamorous. It’s just the difference between a database that serves customers and a database that hosts a daily analytics cage match. Pick the calmer life.

SLI/CrossFire: Why Multi-GPU Was a Dream—and Why It Died

If you ever tried to “just add another GPU” and expected the graph to go up and to the right, you’ve already met the villain of this story:
the real world. Multi-GPU in consumer gaming—NVIDIA SLI and AMD CrossFire—looked like pure engineering righteousness: parallelism, more silicon,
more frames, done.

Then you shipped it. The frametimes turned into a picket fence. The driver stack became a negotiation between game engine, GPU scheduler, PCIe,
and whatever monitor timing you thought you understood. Your expensive second card often became a space heater with a resume.

The promise: scaling by bolting on GPUs

Multi-GPU, as sold to gamers, was an operational fairy tale: your game is GPU-bound, therefore another GPU means nearly double performance.
That’s the pitch. It’s also the first wrong assumption. Systems don’t scale because a marketing slide says “2×”; systems scale when the slowest
part of the pipeline stops being slow.

A modern game frame is a messy assembly line: CPU simulation, draw-call submission, GPU rendering, post-processing, compositing, presentation,
and a timing contract with your display. SLI/CrossFire tried to hide multi-GPU complexity behind drivers, profiles, and a bridge. That hiding is
exactly what doomed it.

The multi-GPU dream died because it fought physics (latency and synchronization), software economics (developers don’t test rare configs), and
platform changes (DX12/Vulkan shifted responsibility from driver to engine). And because “average FPS” turned out to be a lie of omission: what
your eyes feel is frametime consistency, not the mean.

How SLI/CrossFire actually worked

Driver-managed multi-GPU: profiles all the way down

In the classic era, SLI/CrossFire relied on driver heuristics and per-game profiles. The driver would decide how to split rendering across GPUs
without the game explicitly knowing. That sounds convenient. It is also an operational nightmare: you now have a distributed system where one node
(the game) doesn’t know it’s distributed.

Profiles mattered because most games weren’t written to be safely parallelized across GPUs. The driver needed game-specific “hints” to avoid
hazards like reading back data that hasn’t been produced yet, or applying post-processing that assumes a full frame history.

The main modes: AFR, SFR, and “please don’t do that”

Alternate Frame Rendering (AFR) was the workhorse. GPU0 renders frame N, GPU1 renders frame N+1, repeat. On paper: fantastic.
In practice: AFR is a latency and pacing machine. If frame N takes 8 ms and frame N+1 takes 22 ms, your “average FPS” may look fine while your
eyes get a slideshow with extra steps.

Split Frame Rendering (SFR) divides a single frame into regions. This demands careful load balancing: one half of the screen might
contain an explosion, hair shaders, volumetrics, and your regrets; the other half is a wall. Guess which GPU finishes first and sits idle.

There were also hybrid modes and vendor-specific hacks. The more hacks you need, the less general your solution becomes. At some point you’re not
doing “multi-GPU support”; you’re writing per-title incident response in driver form.

Bridges, PCIe, and why the interconnect was never the hero

SLI bridges (and CrossFire bridges in earlier eras) provided a higher-bandwidth, lower-latency path for certain synchronization and buffer sharing
operations than PCIe alone. But the bridge didn’t magically merge VRAM. Each GPU still had its own memory. In AFR, each GPU typically needed its
own copy of the same textures and geometry. So your “two 8 GB cards” did not become “16 GB.” It became “8 GB, twice.”

When developers began leaning harder on temporal techniques—TAA, screen-space reflections with history buffers, temporal upscalers—AFR became
increasingly incompatible. You can’t easily render frame N+1 on GPU1 if it needs history from frame N that lives on GPU0, unless you add
synchronization and data transfer that erases the performance gain.

One paraphrased idea, widely attributed in spirit to systems reliability thinking (and often said by engineers in the Google SRE orbit): Paraphrased idea: hope is not a strategy.
It fits multi-GPU perfectly. SLI/CrossFire asked you to hope your game’s render pipeline aligned with a driver’s assumptions.

Why it failed: the death by a thousand edge cases

1) Frame pacing killed “it feels fast”

AFR can deliver high average FPS while producing uneven frametimes (microstutter). Humans notice variance. Your monitoring overlay might show
“120 FPS,” while your brain registers “inconsistent.” This was the central user experience failure: SLI/CrossFire could win benchmarks and lose
eyeballs.

Frame pacing isn’t just “a little jitter.” It interacts with VSync, VRR (G-SYNC/FreeSync), render queue depth, and CPU scheduling. If the driver
queues frames too aggressively, you get input latency. If it queues too little, you get bubbles and stutter.

Joke #1: Multi-GPU is like having two interns write alternating pages of the same report—fast, until you notice they disagree on the plot.

2) VRAM mirroring: you paid for memory you couldn’t use

Consumer multi-GPU almost always mirrored assets in each GPU’s memory. That made scaling possible without treating memory as a shared coherent
pool, but it also meant high-resolution textures, large geometry, and modern ray tracing acceleration structures were constrained by the smallest
VRAM on a single card.

As games became more VRAM-hungry, the “just add a second GPU” plan got worse: your bottleneck moved from compute to memory capacity, and multi-GPU
did nothing to help. Worse, a second GPU increased power, heat, and case airflow requirements while delivering the same VRAM limit as one card.

3) The CPU became the coordinator, and it didn’t scale either

Multi-GPU is not just “two GPUs.” It’s extra driver work, extra command buffer management, more synchronization, and often more draw-call overhead.
Many engines were already CPU-bound on the render thread. Adding a second GPU can shift the bottleneck upward and make the CPU the limiter.

In production terms: you added capacity to a downstream service without increasing upstream throughput. Congratulations, you invented a new queue.

4) The driver profile model didn’t survive the software supply chain

Driver-managed SLI/CrossFire required vendors to keep up with new game releases, patches, engine updates, and new rendering techniques. Game studios
shipped weekly updates. GPU vendors shipped drivers on a slower cadence and had to test across thousands of combinations.

A multi-GPU profile that works on version 1.0 can break on 1.0.3 because a post-processing pass changed order, or because a new temporal filter now
reads a previous frame buffer. The driver “optimizing” blindly can become the thing that corrupts the frame.

5) VRR (variable refresh) and multi-GPU made each other miserable

Variable refresh rate is one of the best quality-of-life improvements in PC gaming. It also complicates multi-GPU pacing: the display adapts to the
frame delivery cadence, so if AFR creates bursts and gaps, VRR can’t “smooth” it; it will faithfully show the unevenness.

Many users upgraded to VRR monitors and discovered their previously “fine” multi-GPU setup now looked worse. That’s not the monitor’s fault. It’s
you finally seeing the truth.

6) Explicit multi-GPU arrived, and the industry didn’t want the bill

DX12 and Vulkan made explicit multi-adapter possible: the engine can control multiple GPUs directly. That is technically cleaner than driver magic.
It is also expensive engineering work that benefits a tiny fraction of customers.

Studios prioritized features that shipped to everyone: better upscaling, better anti-aliasing, better content pipelines, better console parity.
Multi-GPU was a support burden with low ROI. It died the way many enterprise features die: quietly, because nobody funded the on-call rotation.

7) Power, thermals, and case constraints: the physical layer pushed back

Two high-end GPUs demand serious PSU headroom, good airflow, and often a motherboard that can provide enough PCIe lanes without throttling. The
“consumer case + two flagship GPUs” configuration is a thermal engineering project. And most people wanted a computer, not a hobby that burns dust.

8) Security and stability: the driver stack became a larger blast radius

The more complex the driver scheduling and inter-GPU synchronization logic, the more failure modes: black screens, TDRs (timeout detection and
recovery), weird corruption, game-specific crashes. In ops terms, you increased system complexity and reduced mean time to innocence.

Joke #2: SLI promised “twice the GPUs,” but sometimes delivered “twice the troubleshooting,” which is not a feature anyone benchmarks.

Historical context: the facts people forget

  • Fact 1: The original “SLI” name came from 3dfx’s Scan-Line Interleave in the late 1990s; NVIDIA reused the acronym later with a different technical approach.
  • Fact 2: Early consumer multi-GPU often leaned heavily on AFR because it was the easiest way to scale without rewriting engines.
  • Fact 3: Multi-GPU scaling was famously inconsistent: some titles saw near-linear gains, others saw zero, and some got slower due to CPU/driver overhead.
  • Fact 4: “Microstutter” became a mainstream complaint in the early 2010s as reviewers began measuring frametimes rather than just average FPS.
  • Fact 5: AMD invested in frame pacing improvements in drivers after widespread criticism; it helped, but it didn’t change AFR’s underlying constraints.
  • Fact 6: Many engines increasingly used temporal history buffers (TAA, temporal upscaling, motion vectors), which are inherently awkward for AFR.
  • Fact 7: PCIe bandwidth rose over generations, but latency and synchronization overhead remained central problems for frame-to-frame dependencies.
  • Fact 8: DX12/Vulkan explicit multi-GPU put control in the application; most studios chose not to implement it because the testing matrix exploded.
  • Fact 9: NVIDIA gradually restricted/changed SLI support in later generations, focusing on high-end segments and specific use cases rather than broad game support.

What replaced it (sort of): explicit multi-GPU and modern alternatives

Explicit multi-GPU: better architecture, worse economics

Explicit multi-GPU (DX12 multi-adapter, Vulkan device groups) is how you’d design it if you were sober: the engine knows what workloads can run on
which GPU, what data needs sharing, and when to synchronize. This removes a lot of driver guesswork.

It also requires the engine to be structured for parallelism across devices: resource duplication, cross-device barriers, careful handling of
temporal effects, and different strategies for different GPU combinations. That’s not “supporting SLI.” That’s building a second renderer.

A few titles experimented with it. Most studios did the math and bought something else: temporal upscalers, better CPU threading, and content
optimizations that help every user.

The modern “multi-GPU” that actually works: specialization

Multi-GPU is alive in places where the workload is naturally parallel and doesn’t require strict frame-to-frame coherence:

  • Offline rendering / path tracing: You can split samples or tiles across GPUs and merge results.
  • Compute / ML training: Data parallelism with explicit frameworks, albeit still full of synchronization pain.
  • Video encoding pipelines: Separate GPUs can handle separate streams or stages.

For real-time gaming, the winning strategy became: one strong GPU, better scheduling, better upscaling, and better frame generation techniques. Not
because it’s “cool,” but because it’s operationally sane.

Fast diagnosis playbook

When someone says “my second GPU isn’t doing anything” or “SLI made it worse,” don’t start with mystical driver toggles. Treat it like an incident.
Establish what’s bottlenecked, then isolate.

First: confirm the system sees both GPUs and the link is sane

  • Are both devices present on PCIe?
  • Are they running at expected PCIe generation/width?
  • Is the correct bridge installed (if required)?
  • Are power connectors correct and stable?

Second: confirm the software path is actually multi-GPU

  • Is the game known to support SLI/CrossFire for your GPU generation?
  • Is the driver profile present/enabled?
  • Is the API path (DX11 vs DX12 vs Vulkan) compatible with the vendor’s multi-GPU mode?

Third: measure frametimes and identify the limiting resource

  • GPU utilization per card (not just “total”).
  • CPU render thread saturation.
  • VRAM usage and paging behavior.
  • Frame pacing (99th percentile frametime), not just average FPS.

Fourth: remove variables until the behavior is explainable

  • Disable VRR/VSync temporarily to observe raw pacing.
  • Test a known-good title/benchmark with documented scaling.
  • Test each GPU individually to rule out a marginal card.

Practical tasks: commands, outputs, and decisions

These assume a Linux workstation used for testing/CI rigs, lab reproduction, or just because you enjoy pain in a reproducible way. The point isn’t
that Linux is where SLI gaming peaked; it’s that Linux gives you observability without a GUI treasure hunt.

Task 1: List GPUs and confirm the PCIe topology

cr0x@server:~$ lspci -nn | egrep -i 'vga|3d|display'
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06]
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06]

What it means: Two GPUs are enumerated on the PCIe bus. If you only see one, stop: you have a hardware/firmware problem.

Decision: If one GPU is missing, reseat, check power leads, BIOS settings (Above 4G decoding, PCIe slot config), then retest.

Task 2: Verify PCIe link width and generation for each GPU

cr0x@server:~$ sudo lspci -s 01:00.0 -vv | egrep -i 'LnkCap|LnkSta'
LnkCap: Port #0, Speed 8GT/s, Width x16
LnkSta: Speed 8GT/s, Width x16

What it means: The GPU is negotiating PCIe Gen3 x16 as expected. If you see x8 or Gen1, you’ve found a bottleneck or fallback.

Decision: If the link is downgraded, check slot wiring, motherboard lane sharing (M.2 stealing lanes), BIOS PCIe settings, risers, and signal integrity.

Task 3: Confirm NVIDIA driver sees both GPUs and reports utilization

cr0x@server:~$ nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-ffffffff-1111-2222-3333-444444444444)

What it means: Driver layer sees both devices. If one is missing here but present in lspci, you likely have a driver binding issue or firmware mismatch.

Decision: If missing, check dmesg for GPU errors, verify kernel modules, and confirm both GPUs are supported by the installed driver.

Task 4: Watch per-GPU utilization and memory during load

cr0x@server:~$ nvidia-smi dmon -s pucvmet
# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk   pviol  rxpci  txpci
    0   210    78     -    92    55     0     0  5500  1582      0    120    110
    1    95    64     -    18    52     0     0  5500  1582      0     40     35

What it means: GPU0 is doing real work; GPU1 is mostly idle but still holding similar VRAM (mirroring assets). That’s classic “second GPU not used” behavior.

Decision: If GPU1 stays idle, verify the application path supports multi-GPU; otherwise, stop trying to fix a non-feature.

Task 5: Confirm Xorg/Wayland session details (to avoid compositor surprises)

cr0x@server:~$ echo $XDG_SESSION_TYPE
wayland

What it means: You’re on Wayland. Some tooling and certain legacy multi-GPU paths behave differently under Wayland vs Xorg.

Decision: If you’re debugging rendering/presentation issues, reproduce under Xorg as a control to isolate compositor timing effects.

Task 6: Check kernel logs for PCIe errors and GPU resets

cr0x@server:~$ sudo dmesg -T | egrep -i 'pcie|aer|nvrm|gpu|xid' | tail -n 12
[Mon Jan 13 10:19:22 2026] NVRM: Xid (PCI:0000:02:00): 79, GPU has fallen off the bus.
[Mon Jan 13 10:19:22 2026] pcieport 0000:00:03.1: AER: Corrected error received: 0000:02:00.0

What it means: “Fallen off the bus” often indicates power/thermal instability, bad riser, flaky slot, or signal integrity issues—multi-GPU makes this more likely.

Decision: Treat as hardware reliability: reduce power limit, improve cooling, reseat, swap slots, remove risers, update BIOS, and retest stability before blaming drivers.

Task 7: Check CPU bottleneck indicators (load, run queue, throttling)

cr0x@server:~$ uptime
 10:22:11 up 3 days,  6:41,  1 user,  load average: 14.82, 13.97, 12.10

What it means: High load average can indicate CPU saturation or runnable threads piling up. Games can be CPU-bound on a single render thread even if total CPU isn’t “100%.”

Decision: If load is high and GPU utilization is low, stop chasing SLI toggles. Lower CPU-heavy settings (view distance, crowd density), or accept you’re CPU-bound.

Task 8: Inspect per-core usage to catch a pegged render thread

cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (server)  01/13/2026  _x86_64_  (16 CPU)

10:22:18 AM  CPU   %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:22:19 AM  all   42.0  0.0  8.0   0.2    0.0  0.5    0.0    0.0    0.0   49.3
10:22:19 AM    3   98.5  0.0  1.0   0.0    0.0  0.0    0.0    0.0    0.0    0.5

What it means: One core (CPU3) is pegged. That’s your render/game thread bottleneck. Two GPUs won’t help if the frame can’t be fed.

Decision: Reduce CPU-bound settings, or move to a CPU/platform with higher single-thread performance. Multi-GPU won’t fix a narrow upstream pipe.

Task 9: Verify memory pressure (paging can masquerade as “GPU stutter”)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            32Gi        30Gi       500Mi       1.2Gi       1.5Gi       1.0Gi
Swap:           16Gi        10Gi       6.0Gi

What it means: You’re swapping heavily. That will destroy frametimes regardless of how many GPUs you stack.

Decision: Fix memory pressure first: close background apps, reduce texture settings, add RAM, and re-test. Treat swap usage as a red alert for frame pacing.

Task 10: Confirm CPU frequency and throttling status

cr0x@server:~$ lscpu | egrep -i 'model name|cpu mhz'
Model name:                           AMD Ryzen 9 5950X 16-Core Processor
CPU MHz:                               3599.998

What it means: Current frequency is shown, but not whether it’s throttling under sustained load.

Decision: If clocks drop under gaming load, fix cooling or power limits. Multi-GPU increases case heat, which can quietly nerf CPU boost.

Task 11: Check power capping / throttling flags on NVIDIA

cr0x@server:~$ nvidia-smi -q -d PERFORMANCE | egrep -i 'Power Limit|Clocks Throttle Reasons' -A3
    Power Limit                        : 250.00 W
    Clocks Throttle Reasons
        Idle                           : Not Active
        Applications Clocks Setting     : Not Active
        SW Power Cap                   : Active

What it means: The GPU is hitting a software power cap. In multi-GPU, PSUs and VRM thermals can force conservative limits.

Decision: If power cap is active and performance is inconsistent, consider lowering target FPS, improving airflow, or running a single GPU at higher sustained clocks.

Task 12: Check PCIe lane sharing hints from NVLink/bridge status (when available)

cr0x@server:~$ nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity
GPU0     X      PHB     0-7
GPU1    PHB      X      8-15

What it means: PHB indicates communication goes through the PCIe host bridge; it’s not a tight peer-to-peer path. That can hurt any workload needing inter-GPU traffic.

Decision: If topology is poor, stop expecting inter-GPU collaboration to be cheap. Prefer single-GPU rendering or explicit workload partitioning that avoids sharing.

Task 13: Confirm Vulkan device visibility (for explicit multi-GPU experiments)

cr0x@server:~$ vulkaninfo --summary | egrep -i 'GPU id|deviceName' -A1
GPU id : 0 (GeForce GTX 1080 Ti)
deviceName     = GeForce GTX 1080 Ti
GPU id : 1 (GeForce GTX 1080 Ti)
deviceName     = GeForce GTX 1080 Ti

What it means: Vulkan sees both devices. That’s a prerequisite for explicit multi-GPU apps, not a guarantee of any game supporting it.

Decision: If only one shows up, fix driver/runtime installation. If both show up, move on to application-level capability checks.

Task 14: Validate storage latency (yes, it can look like “GPU stutter”)

cr0x@server:~$ iostat -xz 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.12    0.00    6.21    8.93    0.00   46.74

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s w_await aqu-sz  %util
nvme0n1         210.0   38400.0     0.0   0.00   3.10   182.86    75.0   10240.0   9.80   2.10  78.00

What it means: High iowait and elevated await can cause streaming hitches. Multi-GPU won’t fix shader compilation stalls or asset streaming latency.

Decision: If storage is saturated, reduce background IO, move game to faster storage, and address shader cache behavior. Fix the actual bottleneck.

Common mistakes (symptoms → root cause → fix)

1) “Second GPU shows 0–10% utilization”

Symptoms: One GPU runs hot, the other idles; FPS unchanged vs single GPU.

Root cause: The game/API path doesn’t support driver-managed multi-GPU, or the driver profile is missing/disabled.

Fix: Validate the title’s support for your GPU generation and API mode. If the game is DX12/Vulkan and doesn’t implement explicit multi-GPU, accept single GPU.

2) “Higher average FPS, but feels worse”

Symptoms: Benchmark says faster; gameplay feels stuttery; VRR makes it more obvious.

Root cause: AFR frametime variance (microstutter), queueing, or inconsistent per-frame workload.

Fix: Measure frametimes and cap FPS to stabilize pacing, or disable multi-GPU. Prioritize 1% low / 99th percentile frametime over averages.

3) “Textures pop in, then hitching gets brutal at 4K”

Symptoms: Sudden spikes, especially when turning quickly or entering new areas.

Root cause: VRAM limit is per GPU; mirroring means you didn’t gain capacity. You’re paging assets and stalling.

Fix: Lower texture resolution, reduce RT settings, or move to a single GPU with more VRAM.

4) “Random black screens / GPU disappeared”

Symptoms: Driver resets, one GPU drops off bus, intermittent stability issues.

Root cause: Power delivery instability, thermal stress, marginal PCIe signal integrity, or an overclock that was “stable” on one card.

Fix: Return to stock clocks, reduce power limit, improve cooling, verify cabling, avoid risers, update BIOS, and test each GPU solo.

5) “Works in one driver version, breaks in the next”

Symptoms: Scaling disappears or artifacts appear after a driver update.

Root cause: Profile changes, scheduling changes, or a regression in multi-GPU code paths (which are now low priority).

Fix: Pin driver versions for your use case, document known-good combinations, and don’t treat “latest driver” as inherently better for multi-GPU.

6) “Two GPUs, but CPU usage looks low—still CPU-bound”

Symptoms: GPU utilization low, FPS capped, total CPU under 50%.

Root cause: One or two hot threads (render thread, game thread). Total CPU hides per-core saturation.

Fix: Observe per-core usage. Reduce CPU-heavy settings; target stable frametimes; consider platform upgrade over adding GPUs.

7) “PCIe x8/x4 unexpectedly, scaling poor”

Symptoms: Worse-than-expected scaling; high stutter during streaming; topo shows PHB paths.

Root cause: Lane sharing with M.2/other devices, wrong slot choice, or chipset uplink limitations.

Fix: Use the correct slots, reduce lane consumers, or choose a platform with more CPU lanes if you insist on multi-device setups.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A small studio had a “performance lab” with a few high-end test rigs. Someone had built a monster machine: two top-tier GPUs, lots of RGB, and a
spreadsheet of benchmark numbers that made management happy. The studio used it to sign off on performance budgets for a new content-heavy level.

The wrong assumption was subtle: they assumed scaling was representative. Their sign-off machine was running AFR with a driver profile that
happened to work well for that specific build. It produced great average FPS in the lab. It did not produce great frametimes on most customer
machines, and it definitely didn’t represent the single-GPU baseline that the majority owned.

Release week arrived. Social media filled with “stutter in the new level” complaints. Internally, the lab rig looked “fine.” Engineers started
chasing phantom bugs in animation and physics because the GPU graphs didn’t look pegged.

The real culprit was asset streaming plus a new temporal effect. On the lab rig, AFR masked some GPU time by overlapping, while making pacing worse
in a way the studio didn’t measure. On single-GPU consumer rigs, the same effect pushed VRAM over the edge and triggered paging and shader cache
thrash. The studio had optimized for the wrong reality.

The fix wasn’t a magic multi-GPU tweak. They rebuilt their perf gate: single-GPU, frametime-based, with memory pressure thresholds. The dual-GPU
rig stayed in the lab, but it stopped being the source of truth. The incident ended when they stopped trusting a benchmark that didn’t match the
user population.

Mini-story 2: The optimization that backfired

An enterprise visualization team (think: large CAD scenes, real-time walkthroughs) tried to “get free performance” by enabling AFR in a controlled
environment. Their scenes were heavy on temporal accumulation: anti-aliasing, denoising, and a bunch of “use previous frame” logic. Someone argued
that since the GPUs were identical, the results should be consistent.

They got higher average throughput in a static camera. Great demo. Then they shipped a beta to a few internal stakeholders. As soon as you moved
the camera, image stability degraded: ghosting, shimmer, and inconsistent temporal filters. Worse, the interactive latency felt worse because the
queue depth increased under AFR.

The backfire was architectural: the renderer’s temporal pipeline assumed a coherent frame history. AFR split that history across devices. The team
added sync points and cross-GPU transfers to “fix it,” which destroyed the performance gain and introduced new stalls. Now they had complexity
and no speedup.

They eventually removed AFR and invested in a boring set of improvements: CPU-side culling, shader simplification, and content LOD rules. The final
system was faster on one GPU than the AFR build was on two. The optimization failed because it optimized the wrong layer: it tried to parallelize
something that was fundamentally serial in terms of temporal dependency.

Mini-story 3: The boring but correct practice that saved the day

A hardware validation group at a mid-sized company maintained a fleet of GPU test nodes. They didn’t game on them; they ran rendering and compute
regressions and occasionally reproduced customer bugs. The nodes included multi-GPU boxes because customers used them for compute, not because it
was fun.

Their secret weapon wasn’t a clever scheduler. It was a change log. Every node had a pinned driver version, a pinned firmware baseline, and a
simple “known-good” matrix. Updates were staged: one canary node first, then a small batch, then the rest. No exceptions. Nobody loved this. It
felt slow.

One week, a new driver introduced intermittent PCIe correctable errors on a specific motherboard revision when both GPUs were under mixed load.
On a developer’s workstation, it looked like random application crashes. In the fleet, the canary node started emitting AER logs within hours.

Because the group had boring discipline, they correlated the timeline, rolled back the canary, and blocked the rollout. No fleet-wide instability,
no massive reimaging, no scramble. They filed a vendor ticket with reproducible logs and a tight reproduction recipe.

The “save” wasn’t hero debugging. It was the operational practice of staged rollouts and version pinning. Multi-GPU systems amplify marginal
issues; the only sane response is to treat changes like production changes, not like weekend experiments.

Checklists / step-by-step plan

Step-by-step: decide whether multi-GPU is worth touching

  1. Define the goal. Is it higher average FPS, better 1% lows, or a specific compute/render workload?
  2. Identify the workload type. Real-time gaming with temporal effects? Assume “no.” Offline rendering/compute? Maybe “yes.”
  3. Check support reality. If the app doesn’t implement explicit multi-GPU and the vendor no longer supports driver profiles, stop here.
  4. Measure the baseline. Single GPU, stable driver, frametimes, VRAM usage, CPU per-core.
  5. Add the second GPU. Verify PCIe link width, power, thermals, and topology.
  6. Re-measure. Look for improvements in 99th percentile frametime and throughput, not just mean FPS.
  7. Decide. If gains are small or pacing is worse, remove it. Complexity tax is real.

Step-by-step: stabilize a multi-GPU box (when you must run it)

  1. Run stock clocks first. Overclocks that are “stable” on one GPU can fail in dual-GPU thermal conditions.
  2. Validate power budget. Ensure PSU headroom; avoid daisy-chained PCIe power cables for high draw.
  3. Lock versions. Pin driver/firmware; stage updates like production.
  4. Instrument. Log dmesg, AER events, GPU throttling reasons, temperatures, and utilization.
  5. Set expectations. For gaming, you’re optimizing for stability and pacing, not benchmark screenshots.

FAQ

1) Did SLI/CrossFire ever truly work?

Yes—sometimes. In well-profiled DX11 titles with AFR-friendly pipelines and minimal temporal dependencies, scaling could be strong. The problem is
“sometimes” is not a product strategy.

2) Why didn’t VRAM add up across GPUs for games?

Because each GPU needed local access to textures and geometry at full speed, and consumer multi-GPU typically mirrored resources per card. Without
a unified memory model, you can’t treat two VRAM pools as one without paying heavy synchronization and transfer costs.

3) What is microstutter, operationally speaking?

It’s latency variance. You’re delivering frames at inconsistent intervals—bursts and gaps—so motion looks uneven. It’s why “average FPS” is a
dangerously incomplete metric.

4) Why did DX12/Vulkan make multi-GPU rarer instead of more common?

They made it explicit. That’s architecturally honest but shifts work to the engine team: resource management, synchronization, testing across GPU
combinations, and QA coverage. Most studios didn’t want to fund that for a small user base.

5) Can two different GPUs work together for gaming now?

Not in the old “driver does it for you” way. Explicit multi-adapter can, in theory, use heterogeneous GPUs, but real-world support is rare and
usually specialized. For typical games: assume no.

6) What about NVLink—does that fix it?

NVLink helps certain peer-to-peer bandwidth scenarios and is valuable in compute. It doesn’t automatically solve frame pacing, temporal
dependencies, or the software economics problem. Interconnects don’t fix architecture.

7) If I already own two GPUs, what should I do?

For gaming: run one GPU and sell the other, or repurpose it for compute/encoding. For compute: use frameworks that explicitly support multi-GPU and
measure scaling with realistic batch sizes and synchronization overhead.

8) What metrics should I trust when testing multi-GPU?

Frametime percentiles (like 99th), input latency feel (hard to measure, easy to notice), per-GPU utilization, VRAM headroom, and stability logs.
Average FPS is a vanity metric in this context.

9) Is multi-GPU completely dead?

Not broadly—just in consumer real-time gaming as a default acceleration path. Multi-GPU thrives where the workload can be partitioned cleanly:
offline rendering, scientific compute, ML, and some professional visualization pipelines.

Next steps you can actually take

If you’re thinking about multi-GPU for gaming in 2026, here’s the blunt advice: don’t. Buy the best single GPU you can justify, then optimize for
frametimes, VRAM headroom, and a stable driver stack. You’ll get a system that behaves predictably, which is what you want when you’re the one who
has to debug it.

If you must run multi-GPU—because your workload is compute, offline render, or specialized visualization—treat it like production infrastructure:
pin versions, stage updates, instrument everything, and assume the second GPU increases your failure surface area more than your performance.

Practical next steps:

  • Switch your testing mindset from “average FPS” to frametime percentiles and reproducible runs.
  • Validate PCIe link width, topology, and power stability before touching drivers.
  • Decide upfront whether your application uses explicit multi-GPU; if not, stop investing time.
  • Keep one known-good driver baseline and treat updates as a controlled rollout.

Docker Desktop Networking Weirdness: LAN Access, Ports, and DNS Fixes That Actually Work

You run docker run -p 8080:80, hit localhost:8080, and it works. You hand the URL to a coworker on the same Wi‑Fi, and… nothing.
Or your container can curl the internet but can’t reach the NAS on your LAN. Or DNS flips a coin every time your VPN connects.

Docker Desktop networking isn’t “broken.” It’s just not the Linux host networking model you think you’re using.
It’s a VM, a NAT, a pile of platform-specific shims, and a handful of special names that exist mostly to save our sanity.

The mental model: why Docker Desktop is different

On Linux, Docker typically plugs containers into a bridge network on the host, uses iptables/nftables to NAT outbound traffic,
and adds DNAT rules for published ports. Your host is the host. The kernel that runs containers is the same kernel that runs your shell.

Docker Desktop on macOS and Windows is different by design. It runs a small Linux VM (or a Linux environment via WSL2),
and the containers live behind a virtualization boundary. That boundary is why “host networking” behaves weirdly,
why LAN access is not symmetrical, and why port publishing can feel like it’s aimed at localhost only.

Think in layers:

  • Your physical machine OS (macOS/Windows): has your Wi‑Fi/Ethernet interface, your VPN client, and your firewall.
  • The Docker VM / WSL2: has its own virtual NIC, its own routing table, its own iptables, and its own DNS behavior.
  • Container networks: bridges inside that Linux environment; your containers rarely touch the physical LAN directly.
  • Port publishing shim: Docker Desktop forwards ports from the host OS to the VM to the container.

So when someone says “the container can’t reach the LAN,” your first response should be: “Which layer can’t reach which layer?”

Interesting facts and short history (the stuff that explains today’s pain)

  1. Docker’s original networking model assumed Linux. Early Docker popularized the “bridge + NAT + iptables” pattern because Linux made it easy and portable.
  2. macOS can’t run Linux containers natively. Docker Desktop on macOS has always relied on a Linux VM because containers need Linux kernel features (namespaces, cgroups).
  3. Windows had two eras. First came Hyper-V based Docker Desktop; then WSL2 became the default path for better filesystem and resource behavior, with different networking quirks.
  4. host.docker.internal exists because “the host” is ambiguous. Inside a container, “localhost” is the container; Docker Desktop needed a stable hostname for “the host OS.”
  5. Published ports aren’t just iptables rules on Desktop. On Linux they are; on Desktop they’re often implemented by a user-space proxy/forwarder across the VM boundary.
  6. VPN clients love to rewrite your DNS and routes. They often install a new DNS server, block split DNS, or add a virtual interface with higher priority than Wi‑Fi.
  7. Corporate endpoint security frequently injects a local proxy. This can break container DNS, MITM TLS, or silently divert traffic to “inspection” infrastructure.
  8. ICMP lies to you in virtual networks. “Can’t ping” does not reliably mean “can’t connect,” especially when firewalls block ICMP but allow TCP.

Joke #1: Docker Desktop networking is like an org chart—there’s always one more layer than you think, and it’s never the layer accountable.

Fast diagnosis playbook (check first/second/third)

The fastest way to win is to stop guessing. Diagnose in this order, because it isolates layers with minimal effort.

1) Is it a port publishing problem or a routing/DNS problem?

  • If localhost:PORT works on your machine but LAN clients can’t reach it, you’re likely dealing with host firewall/bind address/VPN route filtering.
  • If containers can’t resolve names or reach any external host, start with DNS and outbound routing from inside the container/VM.

2) Identify where the packet dies (host OS → VM → container)

  • From host OS: can you reach the LAN target?
  • From inside a container: can you reach the same LAN target by IP?
  • From inside a container: can you resolve the name?

3) Verify the actual bind/listen address and the forwarder

  • Is the service listening on 0.0.0.0 inside the container, or only on 127.0.0.1?
  • Is Docker publishing the port on all interfaces or only on localhost?
  • Is the host firewall blocking inbound from the LAN?

4) Check VPN and DNS override behavior early

  • If the problem appears/disappears with VPN, stop treating it as a Docker bug. It’s policy, routes, DNS, or inspection.

5) Only then tweak Docker Desktop settings

  • Changing DNS servers or network ranges can help, but do it with evidence. Otherwise you’ll just create a new mystery.

Practical tasks: commands, outputs, and decisions (12+)

These are the checks I actually run. Each includes: command, example output, what it means, and what decision to make next.
Commands are shown with a generic prompt; adapt the interface names and IPs to your environment.

Task 1: Confirm which Docker context you’re using

cr0x@server:~$ docker context ls
NAME                DESCRIPTION                               DOCKER ENDPOINT
default *           Current DOCKER_HOST based configuration   unix:///var/run/docker.sock
desktop-linux       Docker Desktop                            unix:///Users/me/.docker/run/docker.sock

Meaning: If you think you’re talking to Desktop but you’re on a remote daemon (or vice versa), every networking assumption will be wrong.
Decision: If the starred context isn’t what you expect, switch it: docker context use desktop-linux.

Task 2: Inspect a container’s IP and network attachment

cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Ports}}'
NAMES          PORTS
web            0.0.0.0:8080->80/tcp
db             5432/tcp
cr0x@server:~$ docker inspect -f '{{.Name}} {{range .NetworkSettings.Networks}}{{.IPAddress}} {{.Gateway}}{{end}}' web
/web 172.17.0.2 172.17.0.1

Meaning: The container lives on an internal bridge (here 172.17.0.0/16). That is not your LAN.
Decision: If you’re trying to reach 172.17.0.2 from another laptop on the Wi‑Fi, stop. Publish a port or use a different networking pattern.

Task 3: Check what address your service is actually listening on

cr0x@server:~$ docker exec -it web sh -lc "ss -lntp | head -n 5"
State  Recv-Q Send-Q Local Address:Port  Peer Address:Port Process
LISTEN 0      4096   0.0.0.0:80         0.0.0.0:*     users:(("nginx",pid=1,fd=6))

Meaning: Listening on 0.0.0.0 is good; it accepts traffic from the container network.
If you see 127.0.0.1:80, the port publish will “work” in confusing ways or fail entirely.
Decision: If it’s bound to localhost, fix the app config: bind to 0.0.0.0.

Task 4: Verify published port bindings on the Docker side

cr0x@server:~$ docker port web
80/tcp -> 0.0.0.0:8080

Meaning: Docker believes it published to all interfaces.
Decision: If it shows 127.0.0.1:8080, LAN clients won’t reach it. Re-run with -p 0.0.0.0:8080:80 (or fix your compose file).

Task 5: Confirm the host OS is listening on the expected port

cr0x@server:~$ ss -lntp | grep ':8080'
LISTEN 0      4096      0.0.0.0:8080     0.0.0.0:*    users:(("com.docker.backend",pid=2314,fd=123))

Meaning: On Desktop, you often see Docker’s backend process listening, not the container PID. That’s normal.
Decision: If nothing is listening, your publish didn’t apply, or another process stole the port.

Task 6: Test from the host OS to confirm the forward path works

cr0x@server:~$ curl -sS -D- http://127.0.0.1:8080/ | head
HTTP/1.1 200 OK
Server: nginx/1.25.3
Date: Sat, 03 Jan 2026 09:12:52 GMT
Content-Type: text/html

Meaning: Host-to-container port forwarding works locally.
Decision: If LAN clients can’t connect, focus on firewall/VPN/bind-to-localhost, not the container app.

Task 7: Test from a LAN peer (simulate with another namespace/host if you can)

cr0x@server:~$ nc -vz 192.168.1.50 8080
Connection to 192.168.1.50 8080 port [tcp/http-alt] succeeded!

Meaning: The port is reachable from the LAN.
Decision: If it fails with “timed out” you likely have firewall/routing issues. If “refused,” something is listening but not accepting on that interface or the forwarder isn’t bound correctly.

Task 8: Check container DNS configuration

cr0x@server:~$ docker exec -it web sh -lc "cat /etc/resolv.conf"
nameserver 192.168.65.5
search localdomain
options ndots:0

Meaning: Docker Desktop often injects a stub resolver IP (example: 192.168.65.5) inside the VM network.
Decision: If this nameserver is unreachable or misbehaving (common with VPNs), override DNS at the daemon/compose level.

Task 9: Test DNS resolution inside the container (don’t guess)

cr0x@server:~$ docker exec -it web sh -lc "getent hosts example.com | head -n 2"
2606:2800:220:1:248:1893:25c8:1946 example.com
93.184.216.34 example.com

Meaning: DNS works well enough to resolve both AAAA and A.
Decision: If it hangs or returns nothing, you have a DNS path problem. Next step: try resolving using a specific server (if you have tools installed) or override resolvers.

Task 10: Test direct IP connectivity to a LAN resource from inside the container

cr0x@server:~$ docker exec -it web sh -lc "nc -vz 192.168.1.10 445"
192.168.1.10 (192.168.1.10:445) open

Meaning: Routing from container → VM → host OS → LAN works for that destination.
Decision: If IP works but name fails, it’s DNS. If neither works, it’s routing/VPN/policy.

Task 11: Check the container’s default route (basic but decisive)

cr0x@server:~$ docker exec -it web sh -lc "ip route"
default via 172.17.0.1 dev eth0
172.17.0.0/16 dev eth0 scope link  src 172.17.0.2

Meaning: The container routes to the bridge gateway. The gateway then decides how to reach your LAN/internet.
Decision: If the default route is missing or wrong, you’ve built a custom network setup; back up and test with a vanilla bridge network.

Task 12: Check whether you’re colliding with a corporate/VPN subnet

cr0x@server:~$ ip route | head -n 12
default via 192.168.1.1 dev wlan0
10.0.0.0/8 via 10.8.0.1 dev tun0
172.16.0.0/12 via 10.8.0.1 dev tun0
192.168.1.0/24 dev wlan0 proto kernel scope link src 192.168.1.50

Meaning: If your Docker networks use 172.16.0.0/12 and your VPN also routes 172.16.0.0/12, you’ve created ambiguous routing.
Desktop is especially sensitive to overlap because it’s already NATing.
Decision: Change Docker’s internal subnet ranges to avoid overlap with corporate routes.

Task 13: Inspect Docker networks and their subnets

cr0x@server:~$ docker network ls
NETWORK ID     NAME      DRIVER    SCOPE
a1b2c3d4e5f6   bridge    bridge    local
f1e2d3c4b5a6   host      host      local
123456789abc   none      null      local
cr0x@server:~$ docker network inspect bridge --format '{{(index .IPAM.Config 0).Subnet}}'
172.17.0.0/16

Meaning: You now know which subnets Docker is consuming.
Decision: If this overlaps with VPN routes or your LAN, move it.

Task 14: Validate that the container can reach the host OS via Docker Desktop’s special name

cr0x@server:~$ docker exec -it web sh -lc "getent hosts host.docker.internal"
192.168.65.2    host.docker.internal

Meaning: The special mapping exists and points at the host-side endpoint Docker provides.
Decision: If this name doesn’t resolve, you’re on an older setup, a custom network mode, or something tampered with DNS inside the container. Use explicit IPs only as a last resort.

LAN access patterns: what works, what lies

There are three common asks:

  • LAN → your containerized service (coworker wants to hit your dev server).
  • Container → LAN resource (container needs to reach NAS, printer, internal API, Kerberos, whatever).
  • Container → host OS (container calls a service running on your laptop).

Pattern A: LAN → container via published ports (the only sane default)

Publish ports on the host OS, not by trying to hand out container IPs.
With Docker Desktop you cannot treat container IPs as routable on the physical LAN. They live behind NAT, inside a VM, behind another NAT if your OS is also doing something clever.

What to do:

  • Bind to all interfaces: -p 0.0.0.0:8080:80 or in Compose "8080:80" plus ensure it doesn’t default to localhost-only publishing.
  • Open the host firewall for that port (and limit scope; don’t expose your dev database to Starbucks Wi‑Fi).
  • If your VPN forbids inbound from LAN while connected, accept reality: test without VPN or use a proper dev environment elsewhere.

Pattern B: container → LAN resources (routing works until it doesn’t)

Containers reaching your LAN usually works out of the box, because Docker Desktop NATs outbound traffic through the host OS.
Then you connect a VPN, and the host OS changes DNS and routes. Suddenly your container can’t resolve or can’t reach subnets that are now “owned” by the VPN.

When it fails, it fails in a few repeatable ways:

  • Subnet overlap: Docker chooses a private range that your VPN routes. Packets disappear into the tunnel.
  • Split DNS mismatch: host resolves internal names via corporate DNS, but containers are stuck on a stub resolver that doesn’t forward split domains correctly.
  • Firewall policy: corporate endpoint denies traffic from “unknown” virtual interfaces.

Pattern C: container → host OS services (use the special names)

Use host.docker.internal. That is what it’s for.
It’s not elegant, but it’s stable across DHCP changes and less fragile than hardcoding 192.168.x.y.

If you’re on Linux (not Desktop) you may not have it; on Desktop you generally do.

Ports: publishing, binding addresses, and why coworkers can’t hit your dev server

Published ports are the currency of “make my container reachable.” Everything else is debt.

Localhost isn’t a moral virtue, it’s a bind address

Two different things get confused constantly:

  • Where the app listens inside the container (127.0.0.1 vs 0.0.0.0).
  • Where Docker binds the published port on the host (127.0.0.1:PORT vs 0.0.0.0:PORT).

If either one is “localhost-only,” LAN clients lose. And you’ll waste time blaming the other layer.

Compose tip: don’t accidentally bind to localhost

Compose supports explicit host IP binding. This is great when you mean it and awful when you don’t.

cr0x@server:~$ cat docker-compose.yml
services:
  web:
    image: nginx:alpine
    ports:
      - "127.0.0.1:8080:80"

Meaning: That service is intentionally reachable only from the host OS.
Decision: If you want LAN access, change it to "8080:80" or "0.0.0.0:8080:80", and then handle firewall scope properly.

When published ports still aren’t reachable from the LAN

If Docker shows 0.0.0.0:8080 but LAN clients can’t connect:

  • Host firewall: macOS Application Firewall, Windows Defender Firewall, third-party endpoint tools.
  • Interface selection: the port may be bound, but the OS may block inbound on Wi‑Fi while allowing it on Ethernet (or vice versa).
  • VPN policy: some clients enforce “block local LAN” to reduce lateral movement risk.
  • NAT hairpin quirks: some networks don’t let you reach your own public IP from inside; that’s not Docker, that’s your router doing its best.

Joke #2: Nothing improves teamwork like telling someone “it works on my machine” and meaning it as a network architecture statement.

DNS fixes: from “it’s flaky” to “it’s deterministic”

DNS is where Docker Desktop weirdness goes to become folklore. The problem is usually not “Docker can’t do DNS.”
The problem is: you now have at least two resolvers (host OS and VM), sometimes three (VPN’s), and they don’t agree on split-horizon rules.

Failure mode 1: container DNS resolves public names but not internal ones

Classic corporate split DNS: git.corp only resolves via internal DNS servers, reachable only on VPN.
Your host OS does the right thing. Your container uses a stub resolver that doesn’t forward the right domains to the right servers.

Fix options, from best to worst:

  1. Configure Docker Desktop DNS to use your internal resolvers when on VPN, and public resolvers when off VPN. This is sometimes a manual toggle because “auto” can be unreliable.
  2. Per-project DNS in Compose:
    • Set dns: to the IPs of resolvers that can answer both internal and external names (often your VPN-provided ones).
  3. Hardcode /etc/hosts inside containers. This is a tactical hack, not a strategy.

Task 15: Override DNS in Compose and verify inside container

cr0x@server:~$ cat docker-compose.yml
services:
  web:
    image: alpine:3.20
    command: ["sleep","infinity"]
    dns:
      - 10.8.0.53
      - 1.1.1.1
cr0x@server:~$ docker compose up -d
[+] Running 1/1
 ✔ Container web-1  Started
cr0x@server:~$ docker exec -it web-1 sh -lc "cat /etc/resolv.conf"
nameserver 10.8.0.53
nameserver 1.1.1.1

Meaning: The container is now using the DNS servers you specified.
Decision: If internal domains now resolve, you’ve proven it’s a DNS path/split DNS issue, not an application issue.

Failure mode 2: DNS works, but only sometimes (timeouts, slow builds, flaky package installs)

Intermittent DNS failures often come from:

  • VPN DNS servers that drop UDP under load or require TCP for large responses.
  • Corporate security agents intercepting DNS and occasionally timing out.
  • MTU/MSS issues on tunneled links (DNS over UDP fragments and then dies quietly).

Task 16: Detect DNS timeouts vs NXDOMAIN inside container

cr0x@server:~$ docker exec -it web-1 sh -lc "time getent hosts pypi.org >/dev/null; echo $?"
real    0m0.042s
user    0m0.000s
sys     0m0.003s
0

Meaning: Fast success.
Decision: If this takes seconds or fails intermittently, prefer changing resolvers (or forcing TCP via a different resolver) over retrying forever in your build scripts.

Failure mode 3: internal service works by IP but not by name (and only on VPN)

That’s split DNS again, but with extra spice: sometimes the VPN pushes a DNS suffix and search domains to the host OS,
but Docker Desktop’s resolver doesn’t inherit them cleanly.

Task 17: Confirm search domains inside container

cr0x@server:~$ docker exec -it web-1 sh -lc "cat /etc/resolv.conf"
nameserver 10.8.0.53
search corp.example
options ndots:0

Meaning: Search domain is present.
Decision: If it’s missing, FQDNs may work while short names fail. Either use FQDNs or configure search domains at the container level.

VPNs, split tunnels, and corporate endpoint “helpfulness”

VPNs cause two broad classes of issues: routing changes and DNS changes. Docker Desktop amplifies both because it’s effectively a nested network.

Routing: when the VPN steals your RFC1918 space

Many corporate networks route large private ranges like 10.0.0.0/8 or 172.16.0.0/12 through the tunnel.
Docker defaults often use 172.17.0.0/16 for the bridge and other 172.x ranges for user-defined networks.

On a pure Linux host, you can usually manage this with custom bridge subnets and iptables. On Desktop, you can still do it, but you must treat it as a first-class configuration.

Task 18: Create a user-defined network on a “safe” subnet

cr0x@server:~$ docker network create --subnet 192.168.240.0/24 devnet
9f8c7b6a5d4e3c2b1a0f
cr0x@server:~$ docker run -d --name web2 --network devnet -p 8081:80 nginx:alpine
b1c2d3e4f5a6
cr0x@server:~$ docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' web2
192.168.240.2

Meaning: You’ve moved the container network away from common corporate routes.
Decision: If VPN-related reachability improves, institutionalize a subnet policy for dev networks.

Endpoint security: the invisible middlebox

Some endpoint tools treat virtualization NICs as “untrusted.” They may block inbound or outbound, or force traffic through a proxy.
Symptoms include: published ports work only when the security agent is paused, DNS becomes slow, or internal services fail TLS due to inspection.

You can’t “SRE” your way out of policy. What you can do is get proof quickly, then escalate with concrete evidence.

Task 19: Prove it’s local firewall/policy with a quick inbound test

cr0x@server:~$ python3 -m http.server 18080 --bind 0.0.0.0
Serving HTTP on 0.0.0.0 port 18080 (http://0.0.0.0:18080/) ...

Meaning: This is not Docker. This is a plain host process.
Decision: If a LAN peer can’t reach this either, stop debugging Docker and fix firewall/VPN “block local network” settings.

Windows + WSL2 specifics (where packets go to retire)

On modern Windows, Docker Desktop often runs its engine inside WSL2. WSL2 has its own virtual network (NAT behind Windows).
That means you can have: container NAT behind Linux, behind WSL2 NAT, behind Windows firewall rules. It’s NAT all the way down.

Typical Windows symptoms

  • Published port reachable from Windows localhost but not from LAN. Usually Windows Defender Firewall inbound rules, or the binding is loopback-only.
  • Containers can’t reach a LAN subnet that Windows can reach. Usually VPN routes are not propagated the way you think into WSL2, or policy blocks WSL interfaces.
  • DNS differs between Windows and WSL2. WSL2 writes its own /etc/resolv.conf; sometimes it points at a Windows-side resolver that can’t see VPN DNS.

Task 20: Check WSL2’s resolv.conf and route table (from inside WSL)

cr0x@server:~$ cat /etc/resolv.conf
nameserver 172.29.96.1
cr0x@server:~$ ip route | head
default via 172.29.96.1 dev eth0
172.29.96.0/20 dev eth0 proto kernel scope link src 172.29.96.100

Meaning: WSL2 is using a Windows-side virtual gateway/resolver.
Decision: If DNS breaks only on VPN, consider configuring WSL2 DNS behavior (static resolv.conf) and aligning Docker’s DNS with the VPN resolvers.

macOS specifics (pf, vmnet, and the illusion of localhost)

On macOS, Docker Desktop runs a Linux VM and forwards ports back to macOS.
Your containers are not first-class citizens on your physical LAN. They’re guests behind a very polite concierge.

What macOS users trip over

  • “It works on localhost but not from my phone.” Usually macOS firewall or port published to loopback only.
  • DNS changes when Wi‑Fi changes networks. The host resolver changes quickly; the VM sometimes lags or caches oddness.
  • Corporate VPN blocks local subnet access. Your phone can’t reach your laptop while the VPN is connected, regardless of Docker.

Task 21: Confirm the host OS has the right IP and interface for LAN testing

cr0x@server:~$ ip addr show | sed -n '1,25p'
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
2: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    inet 192.168.1.50/24 brd 192.168.1.255 scope global dynamic wlan0

Meaning: Your LAN IP is 192.168.1.50.
Decision: This is the address a LAN peer should use to hit your published port. If peers are using an old IP, they’re testing the wrong machine.

Common mistakes: symptom → root cause → fix

1) Symptom: localhost:8080 works, coworker can’t reach 192.168.x.y:8080

  • Root cause: Port published to 127.0.0.1 only, or host firewall blocks inbound.
  • Fix: Publish on all interfaces (-p 0.0.0.0:8080:80), then allow inbound for that port on the host firewall for the correct network profile.

2) Symptom: container can reach internet but not 192.168.1.10 (LAN NAS)

  • Root cause: VPN “block local LAN” policy or routes pushing LAN subnets into the tunnel.
  • Fix: Test with VPN disconnected; if that fixes it, request split-tunnel exceptions or run the workload in a proper environment (remote dev VM, staging). Don’t fight policy with hacks.

3) Symptom: container can reach LAN IPs but internal hostnames fail

  • Root cause: Split DNS not propagated into Docker Desktop; containers using a stub resolver that can’t see internal zones.
  • Fix: Configure container/project DNS (dns: in Compose) to include corporate DNS servers reachable on VPN; verify with getent hosts.

4) Symptom: DNS flaps during builds (apt/npm/pip failing randomly)

  • Root cause: Unreliable UDP DNS across VPN, MTU issues, endpoint interception.
  • Fix: Prefer stable resolvers; use two resolvers (internal + public) where policy allows; reduce fragmentation risk by addressing MTU at the VPN layer if you control it.

5) Symptom: service is published, but you get “connection refused” from LAN

  • Root cause: App is listening only on container localhost, or wrong container port published.
  • Fix: Check ss -lntp inside container; fix bind address; verify docker port and container port mapping.

6) Symptom: can’t connect to host.docker.internal from container

  • Root cause: DNS override removed the special name, or using a network mode where Desktop doesn’t inject it.
  • Fix: Avoid overriding DNS blindly; if you must, ensure the special name still resolves (or add an explicit host entry via extra_hosts as a last resort).

7) Symptom: everything breaks only on one Wi‑Fi network

  • Root cause: That network isolates clients (AP isolation) or blocks inbound connections between devices.
  • Fix: Use a proper network (or wired), or run the service behind a reverse tunnel; don’t assume “same Wi‑Fi” means “mutually reachable.”

Three corporate mini-stories (realistic, anonymized, painful)

Mini-story 1: The incident caused by a wrong assumption

A product team built a demo environment on laptops for an on-site customer workshop. The plan was simple: run a few services in Docker Desktop, publish ports,
and have attendees connect over the hotel Wi‑Fi. Everyone had done “-p 8080:8080” a thousand times.
The wrong assumption was that Docker Desktop behaves like a Linux host on a flat LAN.

The morning of the workshop, half the attendees couldn’t connect. The services were up. Local curl worked. The presenters could reach each other sometimes.
People started rebooting like it was 1998. The networking issue wasn’t Docker; it was the hotel Wi‑Fi doing client isolation—devices could reach the internet but not each other.

The second wrong assumption arrived immediately after: “Let’s just use container IPs and avoid port mapping.”
They tried to hand out 172.17.x.x addresses visible inside the Docker VM, which of course were not reachable from other laptops.
That led to ten minutes of confident nonsense and one deeply regretted whiteboard diagram.

The fix was boring: create a local hotspot on a phone that allowed peer-to-peer traffic,
and explicitly publish required ports on 0.0.0.0 with a quick firewall allow rule.
The services were fine. The assumption about “same network” was the actual outage.

Mini-story 2: The optimization that backfired

A platform team wanted faster CI runs on developer machines. They noticed frequent DNS lookups during builds and decided to “optimize” by forcing Docker containers to use a public DNS resolver.
It looked great in a coffee-shop test: faster resolves, fewer timeouts, nice graphs.

Then the first engineer tried to build while on VPN. Internal package registries were only reachable via corporate DNS and internal routes.
Suddenly, builds failed with “host not found” even though the host OS resolved fine. The workaround became “disconnect VPN,”
which is a great way to create the next incident.

The situation got worse because some internal names resolved publicly to placeholder IPs (for security reasons), so “DNS succeeded” but connections went to a blackhole.
Debugging was brutal: you’d see A records, the app would time out, and everyone blamed TLS, proxies, and Docker in random order.

The eventual fix was to stop optimizing DNS globally. They moved to per-project DNS settings:
internal resolvers first when on VPN, public resolvers only when off VPN.
They also documented how to test resolution inside containers, because “it resolves on my host” is not a data point in a nested network.

Mini-story 3: The boring but correct practice that saved the day

A security-sensitive service used Docker Desktop for local integration testing. It needed to call an internal API and also accept inbound webhooks from a test harness on another machine in the same office.
The team had a habit I respect: before changing settings, they captured “known-good” network evidence—routes, DNS config, port bindings—when it worked.

One Monday, everything broke after an OS update. Containers couldn’t resolve internal names. Webhooks from a LAN machine stopped arriving.
Instead of guessing, they compared current state to the baseline: published ports were now bound to localhost only, and the DNS stub inside containers pointed at a new VM-side resolver IP that wasn’t forwarding split DNS.

They fixed the port binding in Compose, then pinned container DNS to the internal resolvers while on VPN.
Because they had the baseline, they could show the endpoint security team exactly what changed and why.
The incident didn’t turn into a week-long blame festival.

That practice—capture baseline, diff when broken—is as exciting as watching paint dry.
It also works.

Checklists / step-by-step plan (boring on purpose)

Checklist 1: Expose a Docker Desktop service to your LAN reliably

  1. Ensure the app listens on 0.0.0.0 inside the container (ss -lntp).
  2. Publish the port on all host interfaces: -p 0.0.0.0:8080:80 (or Compose "8080:80").
  3. Confirm Docker sees the mapping: docker port CONTAINER.
  4. Confirm the host OS is listening on that port: ss -lntp | grep :8080.
  5. Test locally: curl http://127.0.0.1:8080.
  6. Test from a LAN peer: nc -vz HOST_LAN_IP 8080.
  7. If LAN test fails, run a non-Docker listener (python3 -m http.server) to isolate firewall/VPN from Docker issues.

Checklist 2: Make containers reach internal LAN resources (NAS, internal APIs)

  1. From host OS, verify the target is reachable by IP.
  2. From inside the container, test IP connectivity (nc -vz or curl).
  3. If IP fails only on VPN, check route overlap (ip route) and VPN policies (“block local LAN”).
  4. If IP works but name fails, check /etc/resolv.conf and resolve with getent hosts.
  5. Override DNS per-project using Compose dns: if needed.
  6. Avoid subnet overlap: move Docker networks to a range your VPN doesn’t route.

Checklist 3: Stabilize DNS for dev builds (pip/npm/apt stop flaking)

  1. Measure resolution time inside container with time getent hosts.
  2. Inspect current resolvers in /etc/resolv.conf.
  3. If on VPN, prefer the VPN-provided internal resolvers (and add a public fallback only if permitted).
  4. Don’t hardcode public DNS globally across all projects; you’ll break split DNS workflows.
  5. Re-test inside container after changes; don’t trust host OS results.

FAQ

1) Why can’t I just use the container IP from another machine on my LAN?

Because on Docker Desktop that IP is on an internal bridge inside a Linux VM (or WSL2 environment). Your LAN doesn’t route to it. Publish ports instead.

2) Why does -p 8080:80 work locally but not from my phone?

Usually either the port is bound to localhost only (explicitly or via Compose), or your host firewall/VPN blocks inbound connections from the LAN.

3) What’s the difference between 127.0.0.1 and 0.0.0.0 in this context?

127.0.0.1 means “only accept connections from this same network stack.” 0.0.0.0 means “listen on all interfaces.”
You need 0.0.0.0 if you expect other devices to connect.

4) Is --network host the fix for Docker Desktop networking?

No. On Docker Desktop, “host network” is not the same as Linux host networking and often won’t give you what you want. Default to bridge + published ports.

5) Why does DNS work on my host but not inside containers?

The container may be using a different resolver path (a stub inside the VM), and it may not inherit your VPN’s split DNS configuration.
Verify with cat /etc/resolv.conf and getent hosts inside the container, then override DNS per-project if needed.

6) Should I set Docker Desktop DNS to a public resolver to “fix everything”?

Only if you never need internal DNS. Public resolvers can break corporate domains, internal registries, and split-horizon setups.
Use project-specific DNS or conditional behavior tied to VPN state.

7) My container can’t reach a LAN device only when the VPN is connected. Is Docker at fault?

Almost always no. VPN clients can route private subnets through the tunnel or block local LAN access.
Prove it by testing the same connection from the host OS and by disconnecting VPN as a control.

8) What’s the most reliable way for a container to call a service on my laptop?

Use host.docker.internal and keep it consistent across environments. Avoid hardcoded host IP addresses that change with Wi‑Fi networks.

9) How do I know whether the problem is firewall vs Docker port mapping?

Run a non-Docker listener on the host (like python3 -m http.server). If the LAN can’t reach that, Docker isn’t the problem.

10) What’s a good principle for Desktop networking sanity?

Treat Docker Desktop as “containers behind a VM behind your OS.” Publish ports, avoid subnet overlap, and validate DNS from inside the container.

Conclusion: next steps you can do today

Docker Desktop networking stops being weird when you stop expecting it to be Linux host networking. It’s a VM boundary with a forwarding layer.
Once you accept that, most issues collapse into three buckets: bind addresses, firewall/VPN policy, and DNS/resolver drift.

Practical next steps:

  1. Pick one test service, publish it on 0.0.0.0, and verify LAN reachability end-to-end using ss, curl, and nc.
  2. Capture a baseline when things work: docker port, container /etc/resolv.conf, and host routing table.
  3. If you use a VPN, stop letting Docker networks overlap with corporate routes. Standardize a “safe” subnet range for dev networks.
  4. Make DNS a per-project configuration when internal names matter. Global “fixes” are how you create cross-team breakage.

One paraphrased idea from Werner Vogels (Amazon CTO): “Everything fails; design your systems—and your operations—to absorb that failure.”
Docker Desktop networking isn’t special. It’s just failure with extra layers.

Ubuntu 24.04: “Failed to get D-Bus connection” — fix broken sessions and services (case #48)

You run systemctl and it spits: “Failed to get D-Bus connection”. Suddenly your “simple restart” turns into a crime scene: services won’t talk, logins look haunted, and every automation that expects a clean session starts failing.

This error is rarely “just D-Bus.” It’s usually a broken contract between systemd, your login/session, and the bus sockets under /run. The fix is boring—but only after you stop guessing and start proving.

What the error really means (and what it doesn’t)

When a tool says “Failed to get D-Bus connection”, it’s complaining that it can’t reach a message bus socket it expects to exist. On Ubuntu 24.04, the usual caller is systemctl, loginctl, GNOME components, policykit prompts, snapd helpers, or any process that expects either:

  • The system bus at /run/dbus/system_bus_socket (used for system-wide services), or
  • The user session bus (per-user) typically at /run/user/UID/bus, managed by systemd --user and dbus-daemon or dbus-broker depending on the setup.

The phrase is misleading because the root cause is often not “D-Bus is down.” The bus may be fine; your environment may be wrong, your runtime directory may not exist, you might be inside a container/namespace, or you might be using sudo in a way that strips the bus variables.

Two rules that keep you sane:

  1. Decide if you need the system bus or the user bus. If you’re managing services with systemctl (system scope), you care about PID 1, dbus, and the system socket. If you’re running desktop/session actions, you care about systemd --user, XDG_RUNTIME_DIR, and the per-user socket.
  2. Always test the socket, not your feelings. Most “D-Bus connection” outages are actually missing /run paths, dead user sessions, or a broken login manager.

One paraphrased idea from Gene Kim (DevOps/reliability author): Improvement comes from reducing work-in-progress and making problems visible early. That applies here: make the failure visible by checking the bus paths and session state first, not by restarting random daemons.

Fast diagnosis playbook

When this hits production at 02:00, you don’t want theory. You want a triage loop that converges.

Step 1: Identify which bus is failing

  • If the error appears while running systemctl status foo as root, it’s likely the system bus or PID 1 connectivity.
  • If the error appears in a desktop app, GNOME settings, or systemctl --user, it’s the user session bus (/run/user/UID/bus).
  • If it only happens over SSH or automation, suspect environment variables and non-login shells.

Step 2: Check sockets and runtime dirs (fastest signal)

  • /run/dbus/system_bus_socket exists and is a socket?
  • /run/user/UID exists and is owned by the user?
  • /run/user/UID/bus exists and is a socket?

Step 3: Validate the session manager and systemd state

  • systemctl is-system-running tells you if PID 1 is healthy.
  • systemctl status dbus tells you if the system bus service exists/started.
  • loginctl list-sessions tells you if logind sees your session (critical for /run/user/UID creation).

Step 4: Fix the right layer, not the loudest one

  • Missing /run/user/UID? Fix logind/session lifecycle.
  • Socket exists but access denied? Fix permissions, SELinux/AppArmor policies, or the user context.
  • Works locally but not with sudo? Fix environment preservation, don’t “restart dbus” out of spite.

Interesting facts and context (you’ll debug faster)

  • D-Bus was designed in the early 2000s to replace ad-hoc IPC mechanisms in Linux desktops; it later became a staple for system services too.
  • systemd didn’t create D-Bus, but systemd made D-Bus dependency patterns more explicit with unit ordering, socket activation, and user services.
  • User runtime directories under /run/user/UID are typically created by systemd-logind when a session starts—and removed when the last session ends.
  • Ubuntu has shipped both dbus-daemon and alternatives (like dbus-broker in some ecosystems); what matters is the socket contract, not the implementation brand.
  • XDG_RUNTIME_DIR is part of the XDG Base Directory spec; it’s supposed to be user-specific, secure, and ephemeral—exactly the opposite of a random directory under /tmp.
  • systemctl talks to systemd over D-Bus; if systemctl can’t reach a bus, it can’t ask systemd anything, even if systemd is technically alive.
  • SSH sessions are not always “logind sessions” depending on PAM configuration; when they aren’t, you can lose automatic runtime dir setup and user bus availability.
  • Containers often don’t have a full system bus because PID 1 isn’t systemd, or because /run is isolated. This error is normal there unless you deliberately wire it up.
  • PolicyKit (polkit) relies on D-Bus for authorization queries; broken bus access can look like “authentication prompts never appear” or “permission denied” with no UI.

Joke #1: D-Bus is like office email—when it’s down, everyone suddenly discovers how many things they never understood were relying on it.

Field guide: isolate which “bus” you’re failing to reach

There are a few common failure shapes:

  • Root on a server: systemctl fails. Usually the system bus socket is missing, dbus unit is failed, or PID 1 is in a degraded/half-dead state.
  • Desktop user session: GNOME settings fail, gsettings breaks, systemctl --user fails. Usually XDG_RUNTIME_DIR is not set, /run/user/UID is missing, or systemd --user isn’t running.
  • Automation via sudo: works as your user, fails as root, or the reverse. Usually environment variables and session context are wrong.
  • Inside containers/CI: systemctl errors by design because there is no systemd D-Bus to talk to.

Here’s the key: the bus is a Unix socket file. If the socket isn’t there, you’re not going to “retry harder.” If it is there but your process can’t access it, you’re dealing with permissions, namespaces, or identity problems. If it’s there and accessible but replies fail, then you’re dealing with a daemon problem.

Practical tasks: commands, expected output, and decisions

These are the tasks I actually run. Each includes what the output means and what decision you make next. Run them in order until the failure mode becomes obvious. You’re not collecting logs for fun; you’re narrowing the search space.

Task 1: Confirm the exact failing command and context

cr0x@server:~$ whoami
cr0x
cr0x@server:~$ systemctl status ssh
Failed to get D-Bus connection: No such file or directory

Meaning: The client cannot reach its bus socket. “No such file or directory” hints at a missing socket path, not a permission issue.

Decision: Determine if this is a system bus failure (root/system scope) or user bus failure (user scope). Next: check whether you’re root and which systemctl you ran.

Task 2: Check whether PID 1 is systemd (containers and chroots)

cr0x@server:~$ ps -p 1 -o pid,comm,args
  PID COMMAND         COMMAND
    1 systemd         /sbin/init

Meaning: PID 1 is systemd; systemctl should work if the system bus path is present.

Decision: If PID 1 is not systemd (common in containers), the “fix” is to avoid systemctl or run a proper init. If it is systemd, continue.

Task 3: Verify the system bus socket exists

cr0x@server:~$ ls -l /run/dbus/system_bus_socket
srwxrwxrwx 1 root root 0 Dec 30 10:12 /run/dbus/system_bus_socket

Meaning: The system bus socket file exists and is a socket (leading s in permissions). World-writable here is normal for the socket endpoint; access is still controlled by D-Bus policy.

Decision: If missing: focus on dbus service and early boot issues. If present: test whether dbus replies.

Task 4: Check dbus service health (system bus)

cr0x@server:~$ systemctl status dbus --no-pager
● dbus.service - D-Bus System Message Bus
     Loaded: loaded (/usr/lib/systemd/system/dbus.service; static)
     Active: active (running) since Mon 2025-12-30 10:12:01 UTC; 2min ago
TriggeredBy: ● dbus.socket
       Docs: man:dbus-daemon(1)
   Main PID: 842 (dbus-daemon)
      Tasks: 1 (limit: 18939)
     Memory: 3.8M
        CPU: 52ms

Meaning: System bus is running; the problem may be systemctl’s ability to connect to systemd (not dbus), or a namespace/permission issue.

Decision: If dbus is inactive/failed, restart it and read logs. If active, check systemd itself and the systemd private socket.

Task 5: Confirm systemd is responsive

cr0x@server:~$ systemctl is-system-running
running

Meaning: PID 1 reports healthy. If you still see “Failed to get D-Bus connection,” you may be running systemctl in an environment that can’t see /run or lacks the correct mount namespace.

Decision: If output is degraded or maintenance, go straight to journal for systemic failures. If it’s running but clients fail, suspect namespace, chroot, or filesystem issues under /run.

Task 6: Inspect /run mount and free space (yes, really)

cr0x@server:~$ findmnt /run
TARGET SOURCE FSTYPE OPTIONS
/run   tmpfs  tmpfs  rw,nosuid,nodev,relatime,size=394680k,mode=755,inode64
cr0x@server:~$ df -h /run
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           386M  2.1M  384M   1% /run

Meaning: /run is tmpfs; it should be writable and have space/inodes. If /run is read-only or full, sockets won’t be created and you’ll get missing-bus errors.

Decision: If full/ro: fix that first (often a runaway process or a tmpfs mis-size). If healthy: continue to user-session checks if the error is user-scoped.

Task 7: Determine if you’re dealing with the user bus

cr0x@server:~$ echo "$XDG_RUNTIME_DIR"
/run/user/1000
cr0x@server:~$ echo "$DBUS_SESSION_BUS_ADDRESS"
unix:path=/run/user/1000/bus

Meaning: Environment variables point to the per-user bus. If either is empty, your session is incomplete (common over sudo, cron, or broken PAM).

Decision: If unset: you must establish a proper session context or explicitly set up a user bus (prefer the former). If set: check the socket exists.

Task 8: Validate the user bus socket exists and has sane ownership

cr0x@server:~$ id -u
1000
cr0x@server:~$ ls -ld /run/user/1000
drwx------ 12 cr0x cr0x 320 Dec 30 10:12 /run/user/1000
cr0x@server:~$ ls -l /run/user/1000/bus
srw-rw-rw- 1 cr0x cr0x 0 Dec 30 10:12 /run/user/1000/bus

Meaning: The runtime dir exists, is private (0700), and the bus socket exists. Good. If /run/user/1000 is missing, your session wasn’t registered properly with logind.

Decision: If missing: jump to loginctl and PAM/logind troubleshooting. If present but wrong owner: fix ownership and investigate why it drifted (often a bad script run as root).

Task 9: Prove the user systemd instance is alive

cr0x@server:~$ systemctl --user status --no-pager
● cr0x@server
    State: running
    Units: 221 loaded (incl. snap units)
     Jobs: 0 queued
   Failed: 0 units
    Since: Mon 2025-12-30 10:12:05 UTC; 2min ago
  

Meaning: Your user manager is running and reachable. If you get “Failed to connect to bus,” your user bus path or environment is broken.

Decision: If this fails but the socket exists, your environment may be lying (wrong XDG_RUNTIME_DIR) or you’re in a different namespace (common with sudo and some remote tools).

Task 10: Use loginctl to verify logind sees your session

cr0x@server:~$ loginctl list-sessions
SESSION  UID USER SEAT  TTY
     21 1000 cr0x seat0 tty2

1 sessions listed.
cr0x@server:~$ loginctl show-user cr0x -p RuntimePath -p State -p Linger
RuntimePath=/run/user/1000
State=active
Linger=no

Meaning: logind has an active session for the user and knows where the runtime path is. If there are no sessions, your user runtime dir may not be created.

Decision: If session is missing over SSH: check PAM configuration and whether your login path uses systemd/logind. If you need background user services, consider lingering (carefully).

Task 11: Diagnose “sudo broke my bus” (classic)

cr0x@server:~$ sudo -i
root@server:~# echo "$DBUS_SESSION_BUS_ADDRESS"

root@server:~# systemctl --user status
Failed to connect to bus: No medium found

Meaning: Root’s shell has no user bus context; systemctl --user under root is not your user session. That error is expected.

Decision: Don’t “fix” this by exporting random variables into root. Use systemctl (system scope) as root, and systemctl --user as the user inside the session. If you must manage a user unit from root, use machinectl shell or runuser with proper env, or target the user manager via loginctl enable-linger and systemctl --user under that user.

Task 12: Check journal for the first failure, not the last complaint

cr0x@server:~$ journalctl -b -u systemd-logind --no-pager | tail -n 20
Dec 30 10:11:58 server systemd-logind[701]: New session 21 of user cr0x.
Dec 30 10:11:58 server systemd-logind[701]: Watching system buttons on /dev/input/event3 (Power Button)
Dec 30 10:12:01 server systemd-logind[701]: Removed session 19.

Meaning: logind is creating sessions. If you instead see repeated failures to create runtime dirs, that’s your smoking gun.

Decision: If logind shows errors about runtime dir or cgroups, fix those layers. Restarting dbus won’t fix “can’t create /run/user/UID”.

Task 13: Confirm dbus packages and user-session support are installed

cr0x@server:~$ dpkg -l | egrep 'dbus|dbus-user-session|libpam-systemd' | awk '{print $1,$2,$3}'
ii dbus 1.14.10-4ubuntu4.1
ii dbus-user-session 1.14.10-4ubuntu4.1
ii libpam-systemd 255.4-1ubuntu8

Meaning: Required components exist. Missing dbus-user-session can lead to missing session bus behavior in some setups (especially minimal installs).

Decision: If missing: install the missing packages and re-login. If present: move on to PAM/logind and environment issues.

Task 14: Check PAM session hooks for systemd/logind (SSH-focused)

cr0x@server:~$ grep -R "pam_systemd.so" -n /etc/pam.d/sshd /etc/pam.d/login
/etc/pam.d/sshd:15:session    required     pam_systemd.so
/etc/pam.d/login:14:session    required     pam_systemd.so

Meaning: PAM is configured to register sessions with systemd/logind for SSH and console logins. If missing, you can end up with no runtime dir and no user bus.

Decision: If absent for the login path you use: add it (carefully, change-controlled) and test with a new session. If present: focus on why logind still isn’t creating runtime dirs (often related to lingering, cgroup issues, or broken systemd state).

Task 15: Check if the user runtime dir is being removed unexpectedly

cr0x@server:~$ sudo ls -l /run/user
total 0
drwx------ 12 cr0x cr0x 320 Dec 30 10:12 1000
drwx------ 10 gdm  gdm  280 Dec 30 10:11 120

Meaning: Runtime dirs exist for active users. If yours disappears when you disconnect SSH, you probably don’t have lingering and you have no active session.

Decision: For background user services: consider loginctl enable-linger username. For interactive work: ensure you have a real session and avoid running session-dependent commands from non-session contexts.

Task 16: Enable lingering (only if you truly need user services without a login)

cr0x@server:~$ sudo loginctl enable-linger cr0x
cr0x@server:~$ loginctl show-user cr0x -p Linger
Linger=yes

Meaning: The user manager can survive beyond logins, keeping user services and the runtime dir available.

Decision: Use this for headless services run in user scope (sometimes CI agents, per-user podman, etc.). Don’t enable it everywhere “just in case.” That’s how you get zombie user managers eating RAM on shared hosts.

Task 17: If systemctl fails as root, test D-Bus directly

cr0x@server:~$ busctl --system list | head
NAME                      PID PROCESS         USER CONNECTION UNIT SESSION DESCRIPTION
:1.0                      842 dbus-daemon     root :1.0       -    -       -
org.freedesktop.DBus      842 dbus-daemon     root :1.0       -    -       -
org.freedesktop.login1    701 systemd-logind  root :1.2       -    -       -

Meaning: The system bus responds. If systemctl still errors, you might have a broken systemd D-Bus endpoint or a mismatch in environment/namespace.

Decision: If busctl fails too: system bus is genuinely broken. If busctl works: focus on systemd connectivity and client environment.

Task 18: Check the systemd private socket (systemd’s IPC endpoint)

cr0x@server:~$ ls -l /run/systemd/private
srw------- 1 root root 0 Dec 30 10:11 /run/systemd/private

Meaning: systemd’s private socket exists; systemctl uses it in some code paths. If missing, something is deeply wrong with PID 1 or /run.

Decision: If missing: treat as a systemd/runtime filesystem problem; consider a controlled reboot after extracting logs. If present: go back to scope (system vs user) and namespace issues.

Task 19: Spot chroot/namespace issues (common in recovery shells)

cr0x@server:~$ readlink /proc/$$/ns/mnt
mnt:[4026532585]
cr0x@server:~$ sudo readlink /proc/1/ns/mnt
mnt:[4026531840]

Meaning: Your shell is in a different mount namespace than PID 1. You might not see the real /run where the sockets live.

Decision: If namespaces differ, run diagnostics from the host namespace (or enter it) instead of “fixing” phantom paths in your isolated view.

Task 20: Last resort, controlled restarts (in the right order)

cr0x@server:~$ sudo systemctl restart systemd-logind
cr0x@server:~$ sudo systemctl restart dbus
cr0x@server:~$ sudo systemctl daemon-reexec

Meaning: These restarts can recover a wedged logind/dbus/systemd. daemon-reexec is heavy; it re-execs PID 1 without rebooting.

Decision: Only do this after you’ve confirmed you’re not in a container and you’ve captured enough logs to explain the incident. If user sessions are broken due to logind, restarting logind can drop sessions; schedule it like you mean it.

Common mistakes: symptom → root cause → fix

1) “systemctl works as root locally, fails over SSH”

Symptom: Over SSH, systemctl returns “Failed to get D-Bus connection,” but on console it works.

Root cause: You’re in a restricted environment (forced command, chroot, toolbox), or your SSH session isn’t seeing host /run (namespace difference).

Fix: Confirm PID 1 and mount namespace; ensure your SSH path is not chrooted and has access to /run. Use Task 2 and Task 19.

2) “systemctl –user fails after sudo -i”

Symptom: You become root and try to manage user services; it fails with bus errors.

Root cause: Root does not have your user bus environment. Also, root’s user manager is not your user manager.

Fix: Run systemctl --user as the user within that session. If you must from root, use runuser -l username -c 'systemctl --user …' and ensure a proper session exists (or enable lingering).

3) “GNOME Settings won’t open; polkit prompts never appear”

Symptom: GUI actions fail silently or complain about D-Bus.

Root cause: User session bus is broken: missing XDG_RUNTIME_DIR, stale DBUS_SESSION_BUS_ADDRESS, or missing /run/user/UID/bus.

Fix: Verify Task 7/8. Log out and log back in to recreate a clean session. If it persists, check logind and PAM integration.

4) “Cron job fails with D-Bus connection errors”

Symptom: A script that uses gsettings, notify-send, or systemctl --user fails in cron.

Root cause: Cron runs without a user session and without XDG_RUNTIME_DIR.

Fix: Don’t run desktop/session commands in cron unless you create a session context. Use system services instead, or enable lingering and run a user service that doesn’t depend on GUI state.

5) “/run/user/UID exists but owned by root”

Symptom: The directory exists, but permissions are wrong; user bus errors follow.

Root cause: Someone ran a “cleanup” as root and recreated directories incorrectly, or a misbehaving script wrote to /run/user.

Fix: Log the user out (end sessions), remove the incorrect runtime directory, and let logind recreate it. If you must fix live, correct ownership and restart user manager carefully.

6) “system bus socket missing after boot”

Symptom: /run/dbus/system_bus_socket is absent; systemctl fails broadly.

Root cause: dbus.socket or dbus.service didn’t start, or /run wasn’t mounted correctly.

Fix: Validate /run mount (Task 6), then systemctl status dbus dbus.socket, and check early-boot logs.

7) “It works on the host but fails inside a container”

Symptom: systemctl and busctl fail in a container image or CI runner.

Root cause: No systemd PID 1, no system bus, or isolated /run.

Fix: Don’t use systemctl inside that container. Use the service’s native foreground process, or run a systemd-based container intentionally with the right privileges and mounts.

Three corporate mini-stories from the trenches

Mini-story #1: The incident caused by a wrong assumption

At a mid-sized company, an on-call engineer got paged for “deploy host won’t restart services.” They SSH’d in, ran sudo systemctl restart app, and hit “Failed to get D-Bus connection.” The assumption was immediate and confident: “dbus is down; restart it.”

They restarted dbus. Then logind. Then tried a daemon-reexec. The host became harder to access, and a few interactive sessions dropped. The app was still not restarting. The incident grew legs.

The actual problem was mundane: the engineer wasn’t on the host. They were in a maintenance chroot that the team’s rescue tooling used for disk work. That environment had a different mount namespace and a different /run. Of course /run/dbus/system_bus_socket didn’t exist there; the bus socket lived in the host namespace.

Once they exited the chroot and ran the same command in the real host environment, systemctl worked immediately. The “D-Bus outage” was a mirage created by context. The fix was to add a clear shell banner for rescue environments and to teach the team to run Task 2 and Task 19 before touching daemons.

Mini-story #2: The optimization that backfired

Another team wanted faster login times and fewer background processes on developer workstations. Someone decided to “simplify” by stripping packages from the base image, including session-related components they believed were “desktop fluff.”

The image shipped, and it was fast. For about a week. Then came the tickets: IDE integration failing, password prompts not appearing, settings toggles doing nothing, and a weird one—user services failing only after reconnecting through remote desktop.

They’d removed pieces that indirectly ensured a stable user session bus. The system bus still existed, but per-user session infrastructure was inconsistent across login methods. Some logins created /run/user/UID properly; others didn’t, because PAM hooks were incomplete and user-session packages weren’t present.

The optimization wasn’t “wrong” because it saved CPU. It was wrong because it removed the scaffolding that makes the user bus predictable. The rollback added the needed packages and standardized login paths. Login time increased slightly, and the incident rate dropped dramatically. Sometimes “fast” is just “fragile with better marketing.”

Mini-story #3: The boring but correct practice that saved the day

In a regulated environment, a team ran Ubuntu servers that occasionally needed emergency console work. They had a policy that felt old-fashioned: every incident response starts with capturing state, including journalctl -b excerpts and a snapshot of /run socket paths, before any restarts.

It sounded bureaucratic until a production host began throwing D-Bus connection errors after a kernel update. The on-call followed the policy. They captured findmnt /run, checked free space, verified /run/systemd/private existed, and noted that /run/dbus/system_bus_socket was missing. They also captured early-boot logs showing tmpfs mount warnings.

Because they had evidence, they didn’t thrash. They found that /run was mounted read-only due to a subtle initramfs/mount failure. With that corrected and a controlled reboot, the bus socket appeared, systemctl recovered, and the outage ended cleanly.

The boring practice didn’t just fix the machine; it preserved the narrative. In corporate environments, the narrative is half the recovery: you need to explain what happened without blaming cosmic rays.

Checklists / step-by-step plan

Checklist A: You see “Failed to get D-Bus connection” running systemctl (system scope)

  1. Confirm you’re on the host and PID 1 is systemd (Task 2).
  2. Check /run mount and capacity (Task 6).
  3. Verify /run/dbus/system_bus_socket exists (Task 3).
  4. Check systemctl status dbus dbus.socket (Task 4).
  5. Check systemd private socket /run/systemd/private (Task 18).
  6. Test bus responsiveness with busctl --system list (Task 17).
  7. Pull logs: journalctl -b and relevant units (Task 12).
  8. If you must restart, do it deliberately: logind → dbus → daemon-reexec (Task 20).

Checklist B: You see the error running systemctl –user or desktop tools (user session scope)

  1. Check XDG_RUNTIME_DIR and DBUS_SESSION_BUS_ADDRESS (Task 7).
  2. Verify /run/user/UID and /run/user/UID/bus exist and are owned by the user (Task 8).
  3. Check systemctl --user status (Task 9).
  4. Use loginctl list-sessions and loginctl show-user (Task 10).
  5. If this is SSH/cron, decide: do you need a real session or a system service instead?
  6. If you need background user services, enable lingering for that user (Task 16), then re-test.
  7. If the runtime dir keeps disappearing, fix session lifecycle and PAM (Task 14/15).

Checklist C: You’re in automation/CI and it fails

  1. Confirm whether you are in a container and PID 1 is not systemd (Task 2).
  2. Stop trying to use systemctl in that environment. Run the service directly, or redesign the job.
  3. If you truly require systemd, run a systemd-capable environment intentionally, not accidentally.

Joke #2: Restarting dbus without checking sockets is like rebooting a printer because you’re out of paper—cathartic, ineffective, and oddly popular.

FAQ

1) Why does systemctl use D-Bus at all?

systemctl is a client. It talks to systemd’s manager APIs, commonly exposed over D-Bus and systemd’s private socket. No bus, no conversation.

2) I can see dbus-daemon running. Why do I still get the error?

Because the daemon process existing is not the same as the socket being reachable in your namespace/context. Check the socket paths under /run and confirm you’re in the host mount namespace (Task 3, 6, 19).

3) What does “No such file or directory” vs “Permission denied” change?

No such file usually means the socket path doesn’t exist in your view (missing /run mount, missing runtime dir, namespace issue). Permission denied means the socket exists but access control blocks you (wrong user, policy, or confinement).

4) Why does it break only over SSH?

Either your SSH session isn’t registered with logind (PAM misconfiguration), or you’re executing within a restricted wrapper/chroot. Verify pam_systemd.so and check whether /run/user/UID is created for that session (Task 10, 14).

5) Is enabling lingering safe?

It’s safe when you know why you need it: running user services without active logins. It’s unsafe as a blanket workaround because you’ll keep user managers alive, which can hide logout bugs and waste resources. Enable it per-user, deliberately (Task 16).

6) Can I just export DBUS_SESSION_BUS_ADDRESS and move on?

You can, but you shouldn’t. Exporting stale addresses is how you create “works on my shell” ghosts that break later. Prefer establishing a real session and letting logind/systemd set XDG_RUNTIME_DIR and the bus address.

7) What’s the quickest way to tell system bus vs user bus?

If you’re using systemctl without --user, it’s system scope. If the relevant socket is /run/dbus/system_bus_socket, it’s system bus. If it’s /run/user/UID/bus, it’s user session bus.

8) I’m in a minimal server install—do I need dbus-user-session?

If you run user-scoped services or expect user sessions to have a proper session bus, yes, it’s often necessary. If you only manage system services, you can sometimes avoid it. The symptom-driven answer: if user bus is missing, check package presence (Task 13).

9) Why does systemctl --user fail as root even when the user is logged in?

Because root’s environment is not the user’s environment, and root is not “attached” to that user session bus. Run the command as the user in the session, or use appropriate tooling to target that user manager.

10) When do I reboot instead of debugging?

If PID 1 is unhealthy, /run is corrupted/read-only, or systemd sockets are missing and you can’t recover them cleanly, a controlled reboot is often the most reliable fix. Capture logs first.

Conclusion: next steps you can ship today

“Failed to get D-Bus connection” is not an invitation to restart random services. It’s a request to verify a contract: /run is mounted and writable, the right socket exists, your session is real, and your environment points at the correct bus.

Do these next:

  1. Run the fast playbook: sockets, runtime dirs, logind sessions. Don’t skip to restarts.
  2. Decide whether your workflow depends on the user bus. If it does, standardize login paths (PAM + logind) and avoid cron for session work.
  3. If this is a fleet issue, add a lightweight health check: verify /run/dbus/system_bus_socket and /run/systemd/private exist, and alert on missing runtime dirs for active sessions.
  4. Write down the context rule: chroots/containers are allowed to fail systemctl. Your runbooks should say that out loud.

RAID is not backup: the sentence people learn too late

The call usually comes in when the dashboard is green and the data is gone. The array is “healthy.” The database is “running.”
And yet the CFO is staring at an empty report, the product team is staring at an empty bucket, and you’re staring at the one sentence
you wish you’d tattooed onto the purchase order: RAID is not backup.

RAID is great at one thing: keeping a system online through certain kinds of disk failure. It is not designed to protect you from
deletion, corruption, ransomware, fire, fat fingers, broken firmware, or the strange and timeless human urge to run rm -rf
in the wrong window.

What RAID actually does (and what it never promised)

RAID is a redundancy scheme for storage availability. That’s it. It’s a way to keep serving reads and writes when one disk
(or sometimes two) stops cooperating. RAID is about continuity of service, not continuity of truth.

In production terms: RAID buys you time. It reduces the probability that a single disk failure becomes an outage. It may improve
performance depending on level and workload. It can simplify capacity management. But it does not create a separate, independent,
versioned copy of your data. And independence is the word that keeps your job.

Availability vs durability vs recoverability

People mash these into one bucket labeled “data safety.” They are not the same:

  • Availability: can the system keep working right now? RAID helps here.
  • Durability: will bits remain correct over time? RAID sometimes helps, sometimes lies about it.
  • Recoverability: can you restore a known-good state after an incident? That’s backup, snapshots, replication, and process.

RAID can keep serving corrupted data. RAID can faithfully mirror your accidental deletion. RAID can replicate your ransomware-encrypted blocks
with extreme enthusiasm. RAID is a loyal employee. Loyal doesn’t mean smart.

What “backup” means in a system you can defend

A backup is a separate copy of data that is:

  • Independent of the primary failure domain (different disks, different host, ideally different account/credentials).
  • Versioned so you can go back to before the bad thing happened.
  • Restorable within a time bound you can live with (RTO) and to a point in time you can accept (RPO).
  • Tested, because “we have backups” is not a fact until you have restored from them.

Snapshots and replication are great tools. They are not automatically backups. They become backups when they’re independent, protected from
the same admin mistakes, and you can restore them under pressure.

Joke #1: RAID is the seatbelt. Backup is the ambulance. If you’re counting on the seatbelt to perform surgery, you’re going to have a long day.

Why RAID fails as backup: the failure modes that matter

The reason “RAID is not backup” gets repeated is that the failure modes are non-intuitive. Disk failure is just one kind of data loss.
Modern systems lose data through software, humans, and attackers more often than through a single drive popping its SMART cherry.

1) Deletion and overwrite are instantly redundant

Delete a directory. RAID mirrors the deletion. Overwrite a table. RAID stripes that new truth across the set. There is no “undo” because RAID’s
job is to keep copies consistent, not to keep copies historical.

2) Silent corruption, bit rot, and the “looks fine” trap

Disks, controllers, cables, and firmware can return the wrong data without throwing an error. Filesystems with checksums (like ZFS, btrfs) can
detect corruption, and with redundancy they can often self-heal. Traditional RAID under a filesystem that doesn’t checksum at the block level
can happily return corrupted blocks and call it success.

Even with end-to-end checksums, you can still corrupt data at a higher layer: bad application writes, buggy compaction, half-applied migrations.
RAID will preserve the corruption perfectly.

3) Ransomware doesn’t care about your parity

Ransomware encrypts what it can access. If it can access your mounted filesystem, it can encrypt your data on RAID1, RAID10, RAID6,
ZFS mirrors, whatever. Redundancy doesn’t stop encryption. It just ensures the encryption is highly available.

4) Controller and firmware failures take the array with them

Hardware RAID adds a failure domain: the controller, its cache module, its firmware, its battery/supercap, and its metadata format.
If the controller dies, you may need an identical controller model and firmware level to reassemble the array cleanly.

Software RAID also has failure domains (kernel, md metadata, userspace tooling), but they tend to be more transparent and portable.
Transparent does not mean safe. It just means you can see the knife before you step on it.

5) Rebuilds are stressful and get worse as drives get bigger

Rebuild is where the math meets physics. During rebuild, every remaining disk is read heavily, often close to full bandwidth, for hours or days.
That’s a perfect storm for surfacing latent errors on the remaining drives. If you lose another disk in a RAID5 during rebuild, you lose the array.
RAID6 buys you more margin, but rebuild still increases risk and degrades performance.

6) Human error: the most common, least respected failure mode

A tired engineer replaces the wrong disk, pulls the wrong tray, or runs the right command on the wrong host. RAID doesn’t protect against
humans. It amplifies them. One wrong click gets replicated at line rate.

7) Site disasters and blast radius

RAID is local. Fire is also local. So is theft, power events, and “oops we deleted the whole cloud account.” A real backup strategy assumes
you will lose an entire failure domain: a host, a rack, a region, or an account.

Interesting facts and a little history (the useful kind)

A few concrete facts make this topic stick because they show how RAID ended up being treated like a magic spell.
Here are nine, all relevant, none romantic.

  1. RAID was named and popularized in a 1987 UC Berkeley paper that framed “redundant arrays of inexpensive disks” as an alternative to big expensive disks.
  2. Early RAID marketing leaned hard on “fault tolerance,” and a lot of people quietly translated that into “data protection,” which is not the same contract.
  3. RAID levels were never a single official standard. Vendors implemented “RAID5” with different behaviors and cache policies, then argued about semantics in your outage window.
  4. Hardware RAID controllers historically used proprietary on-disk metadata formats, which is why controller failure can turn into archaeology.
  5. The rise of multi-terabyte disks made RAID5 rebuilds dramatically riskier because the rebuild time grew and the probability of encountering an unreadable sector during rebuild rose.
  6. URE (unrecoverable read error) rates were widely discussed in the 2000s as a practical reason to prefer dual-parity for large arrays, especially under heavy rebuild load.
  7. ZFS (first released in the mid-2000s) pushed end-to-end checksums into mainstream operations and made “bit rot” a boardroom-friendly phrase because it could finally be detected.
  8. Snapshots became common in enterprise storage in the 1990s but were often stored on the same array—fast rollback, not disaster recovery.
  9. Ransomware shifted the backup conversation from “tape vs disk” to “immutability vs credentials,” because attackers learned to delete backups first.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-size SaaS company ran its primary PostgreSQL cluster on a pair of high-end servers with hardware RAID10. The vendor pitch sounded
comforting: redundant disks, battery-backed write cache, hot spares. The team heard “no data loss” and mentally filed backups under “nice-to-have.”

One afternoon, a developer ran a cleanup script against production. It was supposed to target a staging schema; it targeted the live one.
Within seconds, millions of rows were deleted. The database kept serving traffic, and the monitoring graphs looked fine—queries got faster, actually,
because there was less data.

They tried to recover using the RAID controller’s “snapshot” feature, which was not a snapshot in the filesystem sense. It was a configuration
profile for caching behavior. The storage vendor, to their credit, did not laugh. They simply asked the question that ends careers:
“What are your last known-good backups?”

There were none. There was a nightly logical dump configured months ago, but it wrote to the same RAID volume, and the cleanup script deleted
the dump directory too. The company rebuilt from application logs and third-party event streams. They recovered most, but not all, and they spent
weeks fixing subtle referential damage.

The wrong assumption wasn’t “RAID is safe.” It was “availability implies recoverability.” They had high uptime and low truth.

Mini-story 2: The optimization that backfired

A media platform was obsessed with performance. They moved their object storage metadata from a conservative setup to a wide RAID5 to squeeze
more usable capacity and better write throughput on paper. They also enabled aggressive controller caching to improve ingest rates.

In normal operation, it looked great. The queue depths were low. Latency was down. Leadership got their “storage efficiency” slide for the quarterly
deck. Everyone slept better for about a month.

Then a single disk started throwing intermittent read errors. The array marked it as “predictive failure” but kept it online. A rebuild was initiated
to a hot spare during peak hours because the system was “redundant.” That rebuild saturated the remaining disks. Latency spiked, timeouts climbed,
and application retries created a feedback loop.

Mid-rebuild, another disk hit an unreadable sector. RAID5 can’t handle that during rebuild. The controller declared the virtual disk failed.
The result wasn’t just downtime. It was partial metadata corruption that made recovery slower and nastier than a clean crash would have been.

The optimization wasn’t evil; it was unbounded. They optimized for capacity and benchmark performance, then paid for it with rebuild risk and
a larger blast radius. They replaced the layout with dual parity, moved rebuild windows off-peak, and—most importantly—built an off-array backup
pipeline so the next failure would be boring.

Mini-story 3: The boring but correct practice that saved the day

A financial services firm ran a file service used by internal teams. The storage was a ZFS mirror set: simple, conservative, not exciting.
The exciting part was their backup hygiene: nightly snapshots, offsite replication to a different admin domain, and monthly restore tests.
Everyone complained about the restore tests because they “wasted time.” The SRE manager made them non-optional anyway.

A contractor’s laptop was compromised. The attacker obtained VPN access and then a privileged credential that could write to the file share.
Overnight, ransomware started encrypting user directories. Because the share was online and writable, the encryption propagated quickly.

ZFS did exactly what it was asked to do: it stored the new encrypted blocks with integrity. RAID mirroring ensured the encryption was durable.
The next morning, users found their files renamed and unreadable. The mirror was “healthy.” The business was not.

The firm pulled the network share offline, rotated credentials, and checked the immutable backup target. The backups were stored in a separate
environment with restricted delete permissions and retention locks. The attacker couldn’t touch them.

Restore was not magical; it was practiced. They restored the most critical directories first based on a pre-agreed priority list, then the rest
over the next day. The postmortem was dull in the best way. The moral was also dull: boring process beats fancy redundancy.

Fast diagnosis playbook: find the bottleneck and the blast radius

When something is wrong with storage, teams waste time arguing about whether it’s “the disks” or “the network” or “the database.”
The right approach is to establish: (1) what changed, (2) what is slow, (3) what is unsafe, and (4) what you can still trust.

First: stop making it worse

  • If you suspect corruption or ransomware, freeze writes where you can: remount read-only, stop services, revoke credentials.
  • If an array is degraded and rebuilding, consider reducing workload to avoid a second failure during rebuild.
  • Start an incident log: commands run, timestamps, changes made. Memory is not evidence.

Second: identify whether this is performance, integrity, or availability

  • Performance: high latency, timeouts, queue depth, iowait. Data may still be correct.
  • Integrity: checksum errors, application-level corruption, unexpected file changes. Performance may look fine.
  • Availability: devices missing, arrays degraded/failed, filesystems not mounting. The system is screaming.

Third: localize the fault domain quickly

  1. Host: kernel logs, disk errors, controller state.
  2. Storage stack: RAID/mdadm/ZFS, filesystem health, scrub status.
  3. IO path: multipath, HBA, SAS expander, NICs, switches if network storage.
  4. Application: query plans, lock contention, retry storms.
  5. Backup/recovery posture: do you have a clean restore point, and is it reachable?

Fourth: decide on the objective

In an outage, you must pick one objective to lead with:

  • Keep it running (availability): stabilize, accept degraded mode.
  • Protect data (integrity): freeze writes, take forensic copies, restore from known-good.
  • Recover service (recoverability): fail over, rebuild elsewhere, restore backups.

These objectives conflict. Pretending they don’t is how you end up with a working system serving the wrong data.

Practical tasks with commands: what to run, what it means, what you decide

Below are hands-on tasks you can run on Linux systems to understand your redundancy posture and your actual recoverability.
Each task includes: command, example output, what it means, and the decision you make from it.

Task 1: Check current block devices and RAID membership

cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,MODEL,SERIAL
NAME    SIZE TYPE FSTYPE MOUNTPOINT MODEL            SERIAL
sda   3.6T disk       	        HGST_HUS726T4TAL  K8H1ABCD
├─sda1 512M part vfat   /boot/efi
└─sda2 3.6T part
sdb   3.6T disk       	        HGST_HUS726T4TAL  K8H1EFGH
└─sdb1 3.6T part
md0   3.6T raid1 ext4   /data

What it means: You have a software RAID1 device md0 mounted at /data, built from partitions.

Decision: If you thought you had “backups,” you don’t. This is redundancy only. Confirm backup location is separate.

Task 2: Inspect mdadm RAID health and rebuild status

cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda2[0]
      3906886464 blocks super 1.2 [2/2] [UU]

unused devices: <none>

What it means: [UU] indicates both members are up. During rebuild you’d see [U_] and a progress line.

Decision: If degraded, reduce load and plan disk replacement. Also: take a backup snapshot now if you don’t have one off-host.

Task 3: Get detailed mdadm information, including event counts

cr0x@server:~$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Mon Oct  2 11:22:09 2023
        Raid Level : raid1
        Array Size : 3906886464 (3726.02 GiB 4000.79 GB)
     Used Dev Size : 3906886464 (3726.02 GiB 4000.79 GB)
      Raid Devices : 2
     Total Devices : 2
       State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
 Spare Devices : 0
           Name : server:0
           UUID : 1a2b3c4d:5e6f:7890:abcd:ef0123456789
         Events : 12891

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       17        1      active sync   /dev/sdb1

What it means: “clean” is good, but it says nothing about whether files are correct, consistent, or restorable.

Decision: Use this to confirm topology and identify which physical disk maps to which member before touching hardware.

Task 4: Check kernel logs for IO errors and resets

cr0x@server:~$ sudo journalctl -k -S "1 hour ago" | egrep -i "ata|sas|scsi|i/o error|reset|timeout" | tail -n 8
Jan 22 10:11:41 server kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 22 10:11:41 server kernel: ata3.00: failed command: READ DMA EXT
Jan 22 10:11:41 server kernel: blk_update_request: I/O error, dev sdb, sector 9175040 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jan 22 10:11:42 server kernel: ata3: hard resetting link
Jan 22 10:11:47 server kernel: ata3: link is slow to respond, please be patient
Jan 22 10:11:52 server kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

What it means: Link resets and IO errors are early warning signs. Could be disk, cable, backplane, or controller.

Decision: Treat as “integrity at risk.” Start a fresh backup if possible; plan maintenance and hardware isolation.

Task 5: Query SMART health and key counters

cr0x@server:~$ sudo smartctl -a /dev/sdb | egrep -i "SMART overall|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|Power_On_Hours"
SMART overall-health self-assessment test result: PASSED
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       12
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       2
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       41231

What it means: “PASSED” is not reassurance. Pending/offline-uncorrectable sectors matter more. This disk is deteriorating.

Decision: Replace proactively. If in RAID5/6, rebuild risk rises; schedule rebuild with reduced load and verified backups.

Task 6: For hardware RAID, check controller/virtual disk state (storcli example)

cr0x@server:~$ sudo storcli /c0/vall show
Controller = 0
Status = Success
Description = Show Virtual Drives

DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name
0/0   RAID5 dgrd  RW     No      RWBD  -   OFF  10.913 TB data_vd0

What it means: Virtual drive is dgrd (degraded). “Consist No” suggests a consistency check is needed.

Decision: Pause nonessential writes, identify failed/predictive disks, and ensure you have a restorable backup before rebuild.

Task 7: Confirm write cache policy and battery/supercap status

cr0x@server:~$ sudo storcli /c0 show battery
Controller = 0
Status = Success
Description = Battery Status

BatteryType = iBBU
Status = Failed
Replacement required = Yes

What it means: If cache protection is failed, controllers often disable write-back cache or risk losing acknowledged writes on power loss.

Decision: Expect performance changes and potential data integrity risk if policy is misconfigured. Replace battery/supercap and review cache mode.

Task 8: Measure if you are CPU-starved or IO-starved (iostat)

cr0x@server:~$ iostat -xz 1 3
Linux 6.1.0 (server) 	01/22/2026 	_x86_64_	(16 CPU)

avg-cpu:  %user %nice %system %iowait  %steal   %idle
          12.34  0.00    5.12   31.45    0.00   51.09

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   w_await aqu-sz  %util
md0              85.0   5420.0     0.0    0.0   18.20    63.76     40.0   3120.0   44.10   2.90   98.7

What it means: High %iowait and %util near 100% indicate IO bottleneck. Write latency is high.

Decision: Throttle heavy jobs, check for rebuild/scrub, and consider moving hot workload off the array while you stabilize.

Task 9: Find which processes are hammering IO (iotop)

cr0x@server:~$ sudo iotop -oPa -n 5
Total DISK READ: 55.43 M/s | Total DISK WRITE: 12.10 M/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN  IO>  COMMAND
18422 be/4   postgres  40.22 M/s   8.10 M/s  0.00 % 92.00 % postgres: checkpointer
27109 be/4   root      12.11 M/s   0.00 B/s  0.00 % 15.00 % rsync -aH --delete /data/ /mnt/backup/

What it means: Your backup job and database maintenance are competing. That’s not a morality tale; it’s physics.

Decision: Reschedule backups/maintenance windows or implement rate limiting so backups don’t cause outages (or vice versa).

Task 10: Check filesystem errors quickly (ext4 example)

cr0x@server:~$ sudo dmesg | egrep -i "EXT4-fs error|I/O error|Buffer I/O error" | tail -n 6
[915230.112233] EXT4-fs error (device md0): ext4_find_entry:1531: inode #524301: comm nginx: reading directory lblock 0
[915230.112240] Buffer I/O error on device md0, logical block 12345678

What it means: The filesystem is seeing read errors. RAID may be masking some failures, but not all.

Decision: Stop services if possible, capture logs, plan a controlled fsck (or restore) rather than letting corruption spread.

Task 11: Verify ZFS pool health and error counters

cr0x@server:~$ sudo zpool status -v
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.
action: Replace the faulted device, or use 'zpool clear' to mark the device repaired.
  scan: scrub repaired 0B in 00:42:18 with 0 errors on Sun Jan 18 02:15:01 2026
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sdc     FAULTED      0     0     8  too many errors
            sdd     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        tank/data/app.db

What it means: ZFS detected checksum errors and can tell you which file is affected. This is the difference between “we think” and “we know.”

Decision: Treat named files as suspect. Restore affected data from backup or application-level replication; replace the faulted disk.

Task 12: Check ZFS snapshots and whether you’re confusing them with backups

cr0x@server:~$ sudo zfs list -t snapshot -o name,creation -s creation | tail -n 5
tank/data@hourly-2026-01-22-0600  Thu Jan 22 06:00 2026
tank/data@hourly-2026-01-22-0700  Thu Jan 22 07:00 2026
tank/data@hourly-2026-01-22-0800  Thu Jan 22 08:00 2026
tank/data@hourly-2026-01-22-0900  Thu Jan 22 09:00 2026
tank/data@hourly-2026-01-22-1000  Thu Jan 22 10:00 2026

What it means: Nice. But if these snapshots live on the same pool, they won’t survive pool loss, account compromise, or site failure.

Decision: Replicate snapshots to an independent target with different credentials and deletion protections.

Task 13: Confirm backups exist and are recent (restic example)

cr0x@server:~$ restic -r /mnt/backup/restic-repo snapshots --last
repository 9b2f1c12 opened (version 2, compression level auto)
ID        Time                 Host        Tags        Paths
a1b2c3d4  2026-01-22 09:00:14  server                  /data

What it means: You have a backup snapshot from today. That’s a start.

Decision: Validate restore capability, not just existence. If snapshots stop updating, treat it as an incident.

Task 14: Do a test restore of a single file to prove recoverability

cr0x@server:~$ mkdir -p /tmp/restore-test && restic -r /mnt/backup/restic-repo restore latest --target /tmp/restore-test --include /data/important/report.csv
repository 9b2f1c12 opened (version 2, compression level auto)
restoring <Snapshot a1b2c3d4 of [/data] at 2026-01-22 09:00:14 by cr0x@server> to /tmp/restore-test
Summary: Restored 1 files/dirs (42.133 KiB) in 0:00

What it means: This is the moment “we have backups” becomes a fact. You successfully restored a real file.

Decision: Make this a scheduled drill with random file selection and documented results.

Task 15: Check whether your “backup” target is actually independent

cr0x@server:~$ mount | egrep "/data|/mnt/backup"
 /dev/md0 on /data type ext4 (rw,relatime)
 server:/export/backup on /mnt/backup type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2)

What it means: Backups go to NFS. Independence depends on where that NFS lives and who can delete it.

Decision: If NFS is on the same server, same rack, or same admin credential set, it’s not independent enough. Fix that.

Task 16: Verify retention and immutability at the filesystem layer (chattr)

cr0x@server:~$ sudo lsattr -d /mnt/backup
-------------e---- /mnt/backup

What it means: No immutability flags here. That might be fine, but then immutability must come from the backup system or storage target.

Decision: If ransomware is in your threat model (it is), implement retention locks/immutability outside the primary admin’s easy reach.

Task 17: Check if you’re one typo away from deleting backups (permissions)

cr0x@server:~$ namei -l /mnt/backup/restic-repo | tail -n 4
drwxr-xr-x root root /mnt
drwxr-xr-x root root /mnt/backup
drwxrwxrwx root root /mnt/backup/restic-repo

What it means: World-writable backup repository. That’s not a backup; it’s a community art project.

Decision: Lock down permissions, separate backup credentials, and consider append-only or immutable targets.

Task 18: Spot a rebuild or scrub that is quietly killing performance

cr0x@server:~$ sudo zpool iostat -v 1 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        2.10T  1.40T    820    210  92.1M  18.2M
  mirror-0                  2.10T  1.40T    820    210  92.1M  18.2M
    sdc                         -      -    420    105  46.0M   9.1M
    sdd                         -      -    400    105  46.1M   9.1M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: Sustained high reads can indicate scrub/resilver or a workload shift. You need to correlate with pool status and cron jobs.

Decision: If this coincides with user pain, reschedule scrubs, tune resilver priority, or add capacity/performance headroom.

Joke #2: A RAID rebuild is the storage equivalent of “just a quick change in production.” It’s never quick, and it definitely changes things.

Common mistakes: symptoms → root cause → fix

This section is intentionally specific. Generic advice doesn’t survive an incident; it just gets quoted in the postmortem.

1) “Array is healthy, but files are corrupted”

  • Symptoms: Application errors reading specific files; checksum mismatches at app layer; users see garbled media; RAID shows optimal.
  • Root cause: Silent corruption on disk/controller/cable, or application wrote bad data. RAID parity/mirroring preserved it.
  • Fix: Use checksumming filesystem (ZFS) or application checksums; run scrubs; restore corrupted objects from independent backups; replace flaky hardware.

2) “We can’t rebuild: second disk failed during rebuild”

  • Symptoms: RAID5 virtual disk fails mid-rebuild; UREs appear; multiple drives show media errors.
  • Root cause: Single-parity plus large disks plus heavy rebuild read load; insufficient margin for latent sector errors.
  • Fix: Prefer RAID6/RAIDZ2 or mirrors for large arrays; keep hot spares; run patrol reads/scrubs; replace drives proactively; ensure you have restorable backups before rebuild.

3) “Backups exist but restores are too slow to meet RTO”

  • Symptoms: Backup job reports success; restore is days; business needs hours.
  • Root cause: RTO was never engineered; backup target bandwidth too low; too much data, too little prioritization; no tiered restore plan.
  • Fix: Define RTO/RPO per dataset; implement fast local recovery (snapshots) plus offsite backups; pre-stage critical datasets; practice partial restores.

4) “Snapshots saved us… until the pool died”

  • Symptoms: Confident snapshot schedule; then catastrophic pool loss; snapshots gone with it.
  • Root cause: Snapshots stored in the same failure domain as primary data.
  • Fix: Replicate snapshots to a different system/account; add immutability; treat “same host” as “same blast radius.”

5) “Ransomware encrypted production and backups”

  • Symptoms: Backup repository deleted/encrypted; retention purged; credentials used legitimately.
  • Root cause: Backup system writable/deletable by the same credentials compromised on production; no immutability/air gap.
  • Fix: Separate credentials and MFA; write-only backup roles; immutable object lock or append-only targets; offline copy for worst-case; monitor deletion events.

6) “Performance collapsed after we replaced a disk”

  • Symptoms: Latency spikes after disk replacement; systems time out; nothing else changed.
  • Root cause: Rebuild/resilver saturating IO; controller throttling; degraded mode on parity arrays.
  • Fix: Schedule rebuild windows; throttle rebuild; move workloads; add spindles/SSDs; keep extra headroom; don’t rebuild at peak unless you enjoy chaos.

7) “Controller died and we can’t import the array”

  • Symptoms: Disks appear but array metadata not recognized; vendor tool can’t see virtual disk.
  • Root cause: Hardware RAID metadata tied to controller family/firmware; cache module failure; foreign config confusion.
  • Fix: Standardize controllers and keep spares; export controller configs; prefer software-defined storage for portability; most importantly, have backups that don’t require the controller to exist.

Checklists / step-by-step plan: build backups that survive reality

Here’s the plan that works when you’re tired, understaffed, and still expected to be right.
It’s opinionated because production is opinionated.

Step 1: Classify data by business consequence

  • Tier 0: authentication/identity, billing, customer data, core database.
  • Tier 1: internal tools, analytics, logs needed for security/forensics.
  • Tier 2: caches, build artifacts, reproducible datasets.

If everything is “critical,” nothing is. Define RPO and RTO per tier. Write it down where finance can see it.

Step 2: Choose the baseline rule and then exceed it

The classic baseline is 3-2-1: three copies of data, on two different media/types, with one copy offsite. It’s a starting point, not a medal.
For ransomware, “offsite” should also mean “not deletable by the same creds.”

Step 3: Separate failure domains on purpose

  • Different hardware: not “a different directory.”
  • Different administrative boundary: separate accounts/roles; production should not have delete on backups.
  • Different geography: at least one copy outside the site/rack/region you can lose.

Step 4: Use snapshots for speed, backups for survival

Local snapshots are for fast “oops” recovery: accidental deletes, bad deploys, quick rollback. Keep them frequent and short-retention.
Backups are for when the machine, the array, or the account is gone.

Step 5: Encrypt and authenticate the backup pipeline

  • Encrypt at rest and in transit (and manage keys as if they matter, because they do).
  • Use dedicated backup credentials with minimal permissions.
  • Prefer write-only paths from production to backup when possible.

Step 6: Make retention a policy, not a vibe

  • Short: hourly/daily for fast rollback.
  • Medium: weekly/monthly for business/legal needs.
  • Long: quarterly/yearly if required, stored cheaply and immutably.

Step 7: Test restores like you mean it

The most expensive backup is the one you never restore until the day you need it. Restore tests should be scheduled, logged, and owned.
Rotate responsibility so knowledge doesn’t live in one person’s head.

Step 8: Monitor the right things

  • Backup freshness: last successful snapshot time per dataset.
  • Backup integrity: periodic verification or test restore.
  • Deletion events: alerts on unusual backup deletions.
  • Storage health: SMART, RAID state, ZFS errors, scrub results.

Step 9: Run a tabletop exercise for the ugly scenarios

Practice:

  • Accidental delete (restore a directory).
  • Ransomware (assume attacker has production admin).
  • Controller failure (assume primary array is unrecoverable).
  • Site loss (assume the whole rack/region is gone).

Step 10: Decide what RAID level is for (and stop asking it to be a backup)

Use RAID/mirrors/erasure coding to meet availability and performance goals. Use backups to meet recoverability goals.
If your RAID choice is being driven by “we don’t need backups,” you’re doing architecture by wishful thinking.

One quote worth keeping above your monitor

Paraphrased idea: Hope is not a strategy. — General Jim Mattis (often cited in engineering and operations circles)

If you’re building storage on hope, you’re not building storage. You’re building a future incident report with a long lead time.

FAQ

1) If I have RAID1, do I still need backups?

Yes. RAID1 protects against one disk failing. It does not protect against deletion, corruption, ransomware, controller bugs, or site loss.
RAID1 makes the system keep running while the wrong thing is happening.

2) Are snapshots a backup?

Not automatically. Snapshots are point-in-time references, usually stored on the same system. They become “backup-like” only when replicated
to an independent target with retention you can’t casually delete.

3) Is RAID6 “safe enough” to skip backups?

No. RAID6 reduces the chance of array loss from disk failures during rebuild. It does nothing for logical failures (delete, overwrite),
malware, or catastrophic events. Backups exist because disk failure isn’t the only threat.

4) What about cloud storage with redundancy—does that count as backup?

Cloud provider redundancy is typically about durability of stored objects, not your ability to recover from your own mistakes.
If you delete or overwrite, the cloud will do it reliably. You still need versioning, retention locks, and independent copies.

5) What’s the minimum viable backup plan for a small company?

Start with: daily backups to an independent target, at least 30 days retention, and one offsite copy. Add weekly/monthly retention as needed.
Then schedule restore tests. If you only do one “advanced” thing, do the restore tests.

6) How often should we test restores?

For critical systems, monthly is a reasonable baseline, with smaller partial restores more frequently (weekly is great).
After major changes—new storage, new encryption keys, new backup tool—test immediately.

7) What’s the difference between replication and backup?

Replication copies data to another place, often near-real-time. That’s great for high availability and low RPO, but it can replicate bad changes instantly.
Backups are versioned and retained so you can go back to before the failure. Many environments use both.

8) How do I protect backups from ransomware?

Separate credentials and restrict delete. Use immutability/retention locks on the backup target. Keep at least one copy offline or in a separate
admin domain. Monitor for suspicious deletion and disable backup repository access from general-purpose hosts.

9) Does ZFS eliminate the need for backups?

ZFS improves integrity with checksums and self-healing (with redundancy), and snapshots are excellent for fast rollback.
But ZFS doesn’t stop you from deleting data, encrypting it, or losing the whole pool. You still need independent backups.

10) What RPO/RTO should we pick?

Pick based on business pain, not what the storage team wishes were true. For Tier 0 data, RPO of minutes/hours and RTO of hours might be necessary.
For lower tiers, days may be acceptable. The key is that the numbers must be engineered and tested, not declared.

Next steps you can do this week

RAID is a tool for staying online through certain hardware failures. It is not a time machine. It is not a courtroom witness. It does not care
whether the data is correct; it cares whether the bits are consistent across disks.

If you run production systems, do these next steps this week:

  1. Inventory your storage: RAID level, controller type, disk ages, and rebuild behavior.
  2. Write down RPO/RTO for your top three datasets. If you can’t, you don’t have a backup plan—you have a hope plan.
  3. Verify independence: confirm backups live outside the primary failure domain and outside easy-delete credentials.
  4. Run one restore test: a single file, a directory, and (if you’re brave) a database restore to a test environment.
  5. Set alerts for backup freshness and deletion anomalies, not just disk health.

Then, and only then, enjoy your RAID. It’s useful when you treat it honestly: as redundancy, not salvation.

ZFS Scrub Slow: How to Tell Normal Slowness From a Real Problem

Your scrub has been “in progress” long enough that people are asking if the storage is haunted. Applications feel sluggish, dashboards show a sea of I/O, and the ETA is either missing or lying. You need to know: is this a normal, boring scrub doing its job, or is it a symptom of something that will bite you later?

This is the production-friendly way to answer that question. We’ll separate expected slowness (the kind you schedule and tolerate) from pathological slowness (the kind you fix before it turns into a support ticket bonfire).

What a scrub actually does (and why “slow” is sometimes correct)

A ZFS scrub is not a benchmark and not a copy operation. It’s a data integrity patrol. ZFS walks through allocated blocks in the pool, reads them, verifies checksums, and—if redundancy allows—repairs silent corruption by rewriting good data over bad. It’s proactive maintenance, the “find it before the user does” kind.

That implies two things that surprise people:

  • Scrubs are fundamentally read-heavy (with occasional writes when repairs happen). Your pool can be “slow” because reads are slow, because there’s contention with real workloads, or because ZFS is intentionally being polite.
  • Scrubs operate at the block level, not the file level. Fragmentation, recordsize choices, and metadata overhead can matter more than raw disk MB/s.

Scrubs also behave differently depending on vdev layout. Mirrors tend to scrub faster and more predictably than RAIDZ, because mirrors can service reads from either side and have simpler parity math. RAIDZ scrubs are perfectly fine when healthy, but they can turn into a long walk if you have wide vdevs, marginal disks, or heavy random I/O from apps.

Here’s the production rule I use: scrub time is an observable property of your system, not a moral failure. But scrub rate that collapses, or an ETA that increases, is a smell. Not always a fire, but always worth looking.

Short joke #1: A scrub with no ETA is like a storage outage with no postmortem—technically possible, socially unacceptable.

Interesting facts and a little history

  • ZFS popularized end-to-end checksumming in mainstream server storage. Checksums are stored separately from data, which is why ZFS can detect “lying disks” that return corrupted blocks without I/O errors.
  • Scrub is ZFS’s answer to “bit rot”—silent, incremental corruption that traditional RAID often can’t detect unless a read happens and parity rebuild is triggered.
  • The term “scrub” comes from older storage systems that periodically scanned media for errors. ZFS made it routine and user-visible.
  • RAIDZ was designed to avoid the write hole seen in classic RAID5/6 implementations, by keeping transactionally consistent metadata and copy-on-write semantics.
  • ZFS was born at Sun Microsystems and later spread widely via OpenZFS. Modern ZFS behavior depends on the OpenZFS version, not just “ZFS” as a brand.
  • Scrubs used to be more painful on systems without good I/O scheduling or where scrub throttling was primitive. Modern Linux and FreeBSD stacks give you more levers, but also more ways to shoot yourself in the foot.
  • Metadata matters. Pools with millions of small files can scrub slower than a pool with fewer large files, even if “used space” looks similar.
  • SMR drives made scrubs more unpredictable in the real world. When the drive does background shingled garbage collection, “reads” can become “reads plus internal rewrite drama.”
  • Enterprise arrays have done patrol reads for decades, often invisibly. ZFS just gives you the truth in the open—and it turns out the truth can be slow.

Normal scrub slowness vs real trouble: the mental model

“Slow scrub” is ambiguous. You need to pin down which kind of slow you’re seeing. I divide it into four buckets:

1) “Big pool, normal physics” slow

If you have hundreds of TB and spinning disks, a scrub that takes days can be normal. It’s limited by sequential read bandwidth, the vdev layout, and the fact that scrubs don’t always get perfectly sequential access patterns (allocated blocks are not necessarily contiguous).

Signals it’s normal:

  • Scrub rate is steady over hours.
  • Disk latency isn’t exploding.
  • Application impact is predictable and bounded.
  • No checksum errors, no read errors.

2) “Throttled on purpose” slow

ZFS will often self-throttle scrubs so production workloads don’t fall over. That means your scrub can look disappointingly slow while the system stays usable. This is good engineering behavior. You can tune it, but do it deliberately.

Signals it’s throttling:

  • CPU is mostly fine.
  • IOPS aren’t pegged, but scrub progress moves slowly.
  • Workload latency stays within SLOs.

3) “Contended by workload” slow

If the pool is serving a busy database, VM farm, or object workload, scrub reads compete with application reads/writes. Now scrub speed becomes a function of business hours. That’s not a ZFS failure; it’s a scheduling failure.

Signals it’s contention:

  • Scrub speed varies with traffic patterns.
  • Latency spikes correlate with application peaks.
  • Turning scrub off makes users happy again.

4) “Something is wrong” slow

This is the category you’re really asking about. Scrub slowness becomes a symptom: a disk is retrying reads, a controller is erroring, a link negotiated down to 1.5Gbps, a vdev has a sick member dragging everyone, or you’ve built a pool layout that’s fine for capacity but bad for scrub behavior.

Signals you likely have a real problem:

  • Read errors, checksum errors, or increasing “repaired” bytes across scrubs.
  • One disk shows much higher latency or lower throughput than its siblings.
  • Scrub rate collapses over time (starts normal, then crawls).
  • Kernel logs show resets, timeouts, or link issues.
  • SMART attributes show reallocated/pending sectors or UDMA CRC errors.

The key: “slow” isn’t a diagnosis. You’re hunting for a bottleneck and then asking whether that bottleneck is expected, configured, or failing.

Fast diagnosis playbook (first/second/third)

When you’re on-call, you don’t have time for a long philosophy seminar. You need a quick funnel that narrows the problem to one of: expected, contended, throttled, or broken.

First: Is the scrub healthy?

  • Check pool status for errors and the scrub’s actual rate.
  • Look for any vdev member that’s degraded, faulted, or “too many errors.”
  • Decision: if errors exist, treat this as a reliability incident first and a performance question second.

Second: Is one device dragging the whole vdev?

  • Check per-disk latency and I/O service times while scrub runs.
  • Check SMART quickly for pending sectors, media errors, and link CRC errors.
  • Decision: if one disk is slow or retrying, replace it or at least isolate it; scrubs are the canary.

Third: Is it contention or throttling?

  • Correlate scrub speed with workload metrics (IOPS, latency, queue depth).
  • Check ZFS tunables and whether scrub is intentionally limited.
  • Decision: if you’re throttled, adjust carefully; if contended, reschedule or split workloads.

Only after those three do you get to “architecture questions” like vdev width, recordsize, special vdevs, or adding cache devices. If the scrub is slow because a SATA cable is flaky, no amount of “performance tuning” fixes it.

Practical tasks: commands, what output means, and what decision you make

The following tasks are designed to be run while a scrub is active (or right after). Each includes a realistic command, sample output, what it means, and the next decision. The host prompt and outputs are illustrative, but the commands are standard in real environments.

Task 1: Confirm scrub status, rate, and errors

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Dec 23 01:00:02 2025
        12.3T scanned at 612M/s, 8.1T issued at 403M/s, 43.2T total
        0B repaired, 18.75% done, 2 days 09:14:33 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0

errors: No known data errors

What it means: ZFS shows both “scanned” and “issued.” Issued is closer to actual physical I/O completion rate. If issued is far lower than scanned, you may be seeing readahead, caching effects, or waiting on slow devices.

Decision: If READ/WRITE/CKSUM counts are non-zero, stop treating this as “just slow.” Investigate the failing device(s) before tuning.

Task 2: Get one-line progress repeatedly (good for incident channels)

cr0x@server:~$ zpool status tank | sed -n '1,12p'
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Dec 23 01:00:02 2025
        12.3T scanned at 612M/s, 8.1T issued at 403M/s, 43.2T total
        0B repaired, 18.75% done, 2 days 09:14:33 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0

What it means: This is the minimum viable status snippet. If the ETA keeps increasing hour to hour, you’re likely contended or retrying reads.

Decision: If the issued rate is steady and the ETA shrinks steadily, it’s probably normal or throttled. If it fluctuates wildly, move to per-disk checks.

Task 3: Find which vdev layout you’re dealing with

cr0x@server:~$ zpool status -P tank
  pool: tank
 state: ONLINE
config:

        NAME                                   STATE     READ WRITE CKSUM
        tank                                   ONLINE       0     0     0
          raidz2-0                             ONLINE       0     0     0
            /dev/disk/by-id/ata-ST12000...A1   ONLINE       0     0     0
            /dev/disk/by-id/ata-ST12000...B2   ONLINE       0     0     0
            /dev/disk/by-id/ata-ST12000...C3   ONLINE       0     0     0
            /dev/disk/by-id/ata-ST12000...D4   ONLINE       0     0     0
            /dev/disk/by-id/ata-ST12000...E5   ONLINE       0     0     0
            /dev/disk/by-id/ata-ST12000...F6   ONLINE       0     0     0

What it means: RAIDZ2 in one wide vdev. Scrub speed will be bounded by the slowest disk and the parity overhead. One misbehaving disk can slow the whole vdev.

Decision: If you have one very wide RAIDZ vdev and scrubs are painful, you may need an architectural change later (more vdevs, narrower width). Don’t “tune” your way out of physics.

Task 4: Check per-disk latency and utilization during scrub (Linux)

cr0x@server:~$ iostat -x 2 3
Linux 6.6.12 (server)     12/25/2025  _x86_64_    (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.21    0.00    2.73    8.14    0.00   84.92

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s  w_await aqu-sz  %util
sda             112.0   28800.0     0.0   0.00   18.40   257.1      2.0     512.0   4.30   2.10  98.5
sdb             118.0   30208.0     0.0   0.00   17.92   256.0      2.0     512.0   4.10   2.12  97.9
sdc             110.0   28160.0     0.0   0.00   19.30   256.0      2.0     512.0   4.20   2.05  98.2
sdd              15.0    3840.0     0.0   0.00  220.10   256.0      1.0     256.0  10.00   3.90  99.1
sde             115.0   29440.0     0.0   0.00   18.10   256.0      2.0     512.0   4.00   2.08  98.0
sdf             114.0   29184.0     0.0   0.00   18.70   256.0      2.0     512.0   4.20   2.11  98.4

What it means: sdd has r_await of ~220ms while others are ~18ms. That’s your scrub anchor. The pool will move at the pace of the worst performer in a RAIDZ vdev.

Decision: Immediately inspect sdd for errors/logs/SMART. If it’s a cable/controller issue, fix that before replacing the disk.

Task 5: Check kernel logs for resets/timeouts (Linux)

cr0x@server:~$ sudo dmesg -T | egrep -i 'ata|scsi|reset|timeout|error' | tail -n 12
[Wed Dec 24 13:18:44 2025] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Wed Dec 24 13:18:44 2025] ata7.00: failed command: READ FPDMA QUEUED
[Wed Dec 24 13:18:44 2025] ata7: hard resetting link
[Wed Dec 24 13:18:45 2025] ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[Wed Dec 24 13:18:46 2025] sd 6:0:0:0: [sdd] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=14s
[Wed Dec 24 13:18:46 2025] blk_update_request: I/O error, dev sdd, sector 123456789 op 0x0:(READ) flags 0x0 phys_seg 8 prio class 0

What it means: Link reset plus renegotiation to 1.5Gbps is classic “bad cable/backplane/port” territory. It can also be a dying disk, but cables are cheaper and embarrassingly common.

Decision: Treat as hardware fault. Reseat/replace cable or move to another port. Then re-check per-disk latency. If errors persist, replace the drive.

Task 6: Quick SMART health check for the slow device

cr0x@server:~$ sudo smartctl -a /dev/sdd | egrep -i 'Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|UDMA_CRC_Error_Count|SMART overall|Power_On_Hours'
SMART overall-health self-assessment test result: PASSED
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       31245
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       12
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       3
199 UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       27

What it means: Pending sectors and offline uncorrectables mean the disk is struggling to read some areas. UDMA CRC errors often point to cabling/backplane issues. “PASSED” is not absolution; it’s marketing.

Decision: If pending/offline uncorrectables exist, plan replacement. If CRC errors are increasing, fix the path (cable/backplane/HBA) too.

Task 7: Identify if the pool is doing repairs (and how much)

cr0x@server:~$ zpool status -v tank | sed -n '1,25p'
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Dec 23 01:00:02 2025
        14.8T scanned at 540M/s, 10.2T issued at 372M/s, 43.2T total
        256M repaired, 23.61% done, 2 days 05:01:12 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0

What it means: Non-zero “repaired” during scrub means ZFS found checksum mismatches and corrected them. That’s the scrub doing its job, but it’s also evidence of corruption somewhere (disk, cabling, controller, or memory).

Decision: If repairs are recurring across scrubs, investigate root cause. One-time repair after a known event might be fine; repeated repairs are not.

Task 8: Look for ZFS-level I/O and latency indicators (Linux)

cr0x@server:~$ sudo cat /proc/spl/kstat/zfs/arcstats | egrep '^(hits|misses|size|c_max|demand_data_misses|prefetch_data_misses) ' | head
hits                            2876543210
misses                          456789012
size                            17179869184
c_max                           34359738368
demand_data_misses              123456789
prefetch_data_misses            234567890

What it means: ARC stats tell you whether reads are being served from memory or going to disk. During scrub, ARC can be less helpful because scrub reads are not always cache-friendly, but a tiny ARC with heavy workloads can worsen contention.

Decision: If ARC is starved and the box is swapping, fix memory pressure first. Don’t blame scrub for a RAM problem.

Task 9: Check pool I/O in real time (Linux with zpool iostat)

cr0x@server:~$ zpool iostat -v tank 2 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        28.1T  15.1T  3.10K    220   380M  12.1M
  raidz2-0  28.1T  15.1T  3.10K    220   380M  12.1M
    sda          -      -    520     35  63.5M  2.1M
    sdb          -      -    525     36  64.1M  2.1M
    sdc          -      -    518     35  63.2M  2.0M
    sdd          -      -    110     34  12.9M  2.0M
    sde          -      -    523     35  64.0M  2.1M
    sdf          -      -    524     35  64.3M  2.1M
----------  -----  -----  -----  -----  -----  -----

What it means: Again, sdd is underperforming compared to others. In RAIDZ, that can reduce effective throughput and increase scrub time.

Decision: If one device shows persistently lower bandwidth/ops, focus there. Don’t tweak global tunables until device parity is restored.

Task 10: Confirm ashift and basic pool properties (performance baseline)

cr0x@server:~$ zdb -C tank | egrep 'ashift|vdev_tree' -n | head -n 8
64:        ashift: 12
120:    vdev_tree:
121:        type: 'root'
122:        id: 0

What it means: ashift: 12 implies 4K sectors. If you see ashift: 9 on modern 4K disks, you can get write amplification and weird performance behavior. It won’t always show during scrub (mostly reads), but it can worsen general pool performance and resilver/scrub overhead.

Decision: If ashift is wrong, the fix is usually “rebuild the pool correctly,” not “tune harder.” Put it on the roadmap.

Task 11: Check dataset compression and recordsize (workload interaction)

cr0x@server:~$ zfs get -o name,property,value -s local compression,recordsize tank/vmstore
NAME         PROPERTY     VALUE
tank/vmstore compression  lz4
tank/vmstore recordsize   128K

What it means: For VM images, recordsize often gets set smaller (like 16K) depending on I/O patterns. Large recordsize isn’t “wrong,” but if your workload is random 4K, you can end up with more reads per useful byte during scrub and heavy operational overhead in general.

Decision: Don’t change recordsize casually on existing data. But if scrub pain correlates with a dataset known for small random I/O, review dataset design for the next iteration.

Task 12: Check for special vdevs (metadata) and their health

cr0x@server:~$ zpool status tank | egrep -n 'special|log|cache|spares' -A3
15:    special
16:      nvme0n1p2             ONLINE       0     0     0

What it means: If you have a special vdev (often NVMe) storing metadata/small blocks, its health and latency can dominate scrub behavior for metadata-heavy pools. A dying special vdev can make the entire pool “feel” slow even if HDDs are fine.

Decision: If scrub is slow on a metadata-heavy workload, check special vdev performance and errors early.

Task 13: Check the actual device path and link speed (common hidden failure)

cr0x@server:~$ sudo hdparm -I /dev/sdd | egrep -i 'Transport|speed|SATA Version' | head -n 5
Transport: Serial, ATA8-AST, SATA 3.1
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)

What it means: The drive supports 6.0Gb/s but is currently at 1.5Gb/s. That’s a strong indicator of link problems, not “ZFS being slow.”

Decision: Fix the physical path. After repair, confirm it negotiates at 6.0Gb/s and rerun iostat.

Task 14: Check scrub throttling-related module parameters (Linux OpenZFS)

cr0x@server:~$ sudo systool -m zfs -a 2>/dev/null | egrep 'zfs_scrub_delay|zfs_top_maxinflight|zfs_vdev_scrub_max_active' | head -n 20
  Parameters:
    zfs_scrub_delay        = "4"
    zfs_top_maxinflight    = "32"
    zfs_vdev_scrub_max_active = "2"

What it means: These values influence how aggressively scrub issues I/O. More aggressive isn’t always better; you can increase queue depth and latency for applications, and sometimes slow down the scrub due to thrash.

Decision: If the scrub is slow but healthy and you have headroom (low latency impact, low util), you can consider tuning. If the system is already hot, don’t “fix” it by making it fight harder.

Task 15: Confirm TRIM and autotrim behavior (SSD pools)

cr0x@server:~$ zpool get autotrim tank
NAME  PROPERTY  VALUE     SOURCE
tank  autotrim  off       default

What it means: On SSD pools, autotrim can affect long-term performance. Not directly scrub speed, but it changes how the pool behaves under sustained reads/writes and garbage collection, which can make scrubs “randomly awful.”

Decision: If you’re on SSDs and see periodic performance cliffs, evaluate enabling autotrim in a controlled change window.

Task 16: Check if you’re accidentally scrubbing frequently

cr0x@server:~$ sudo grep -R "zpool scrub" -n /etc/cron* /var/spool/cron 2>/dev/null | head
/etc/cron.monthly/zfs-scrub:4: zpool scrub tank

What it means: Monthly scrubs are common. Weekly scrubs can be fine for small pools, but on big pools it can mean you’re effectively always scrubbing, and operators start ignoring the signal.

Decision: Set a cadence appropriate to media and risk. If scrub never finishes before the next one starts, you’ve turned integrity checks into background noise.

Three corporate mini-stories from the scrub trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran a ZFS-backed virtualization cluster. Nothing exotic: RAIDZ2, big SATA disks, one pool per node. The scrubs were scheduled monthly, and they always took “a while.” People accepted it as part of life.

Then one month the scrub ETA started increasing. It wasn’t dramatic at first—just an extra day. The on-call assumed it was workload contention: end-of-quarter batch jobs. They let it ride. “Scrubs are slow; it’ll finish.”

Two days later, user-facing VM latency spiked, then settled, then spiked again. Zpool status still showed ONLINE, no obvious errors. The assumption held: “It’s busy.” So nobody looked at the disk-level stats. That was the mistake.

When someone finally ran iostat -x, one drive had 300–800ms read await, while the others sat at 15–25ms. SMART had pending sectors. The drive wasn’t failing fast; it was failing politely, dragging the whole vdev through retries. That’s the worst kind because it looks like “normal slowness” right up until it’s not.

They replaced the drive. Scrub rate immediately returned to normal. The real lesson wasn’t “replace drives faster.” It was: never assume scrub slowness is workload until you’ve proven all devices are healthy. Scrub is the only time some bad sectors get touched. It’s your early warning system. Use it.

Mini-story 2: The optimization that backfired

A different org had strict maintenance windows. They wanted scrubs to complete over a weekend, no exceptions. Someone found scrub tunables and decided to “turn it up.” They increased scrub concurrency and reduced delays. The scrub became aggressive, and the throughput number looked great—for about an hour.

Then the application latency climbed. The hypervisors started logging storage stalls. Users complained Monday morning about “random slowness.” The team blamed the network first (as teams do), then blamed ZFS, then blamed the hypervisor. Classic triangle of denial.

What actually happened was more boring: the scrub’s I/O pattern displaced the workload’s cache and drove disk queues deep. HDDs hit near-100% util with high service times. Some application reads became tail-latency monsters. The scrub itself didn’t even finish faster overall—because as queues grew, effective throughput dropped and retries increased.

They rolled back the tuning and moved scrubs to lower-traffic periods. The “optimization” was real, but the system-level impact was negative. The best performance trick in storage is still scheduling: don’t fight your users.

Short joke #2: Storage tuning is like office politics—if you push too hard, everyone slows down and somehow it’s still your fault.

Mini-story 3: The boring but correct practice that saved the day

A fintech team ran OpenZFS on Linux for a ledger-like workload. Scrubs were treated as a formal maintenance activity: scheduled, monitored, and compared against historical baselines. No heroics. Just graphs and discipline.

They kept a simple runbook: after each scrub, record duration, average issued bandwidth, and any repaired bytes. If “repaired” was non-zero, it triggered a deeper check: kernel logs, SMART long test, and a review of any recent hardware changes.

One month, a scrub completed with a small amount repaired—nothing alarming on its own. But it was the second month in a row. Their baseline logic flagged it. The on-call dug in and found intermittent CRC errors on one disk path. Not enough to fail the disk outright, but enough to flip bits occasionally under load. Exactly the kind of defect that ruins your day six months later.

They swapped the backplane cable and moved that disk to a different HBA port. Repairs stopped. No outage, no data loss, no dramatic incident report. This is the kind of win that never gets celebrated because nothing exploded. It should be celebrated anyway.

Common mistakes: symptom → root cause → fix

This section is intentionally blunt. These are patterns that show up in production, repeatedly, because humans are consistent creatures.

Scrub ETA increases over time

  • Symptom: ETA goes from “12 hours” to “2 days” while scrub runs.
  • Root cause: A device is retrying reads (media issues) or link is flapping; alternatively, workload contention ramped up.
  • Fix: Run iostat -x and zpool iostat -v to identify a slow disk; check dmesg and SMART. If no single disk is slow, correlate with workload and reschedule scrub.

Scrub is “slow” only during business hours

  • Symptom: Scrub crawls 9–5 and speeds up at night.
  • Root cause: Contention with production workload; ZFS and/or the OS scheduler is doing the right thing and prioritizing foreground I/O.
  • Fix: Schedule scrubs for low-traffic windows; consider throttling rather than aggression. Don’t crank scrub concurrency and hope.

One disk shows 10x higher await than others

  • Symptom: In iostat -x, one drive has high r_await or %util patterns that don’t match.
  • Root cause: Dying disk, SMR behavior under stress, bad cable/backplane, port negotiated down.
  • Fix: Check dmesg and SMART, confirm link speed, swap cable/port, replace drive if pending sectors or uncorrectables appear.

Scrub makes applications time out

  • Symptom: Latency spikes, timeouts, queue depth grows; scrub seems to “DoS” the system.
  • Root cause: Scrub I/O too aggressive, poor workload isolation, too few vdevs, HDD pool serving random I/O workloads without enough spindles.
  • Fix: Reduce scrub aggressiveness; schedule; add vdevs or move workload to SSD/NVMe; consider special vdevs for metadata-heavy cases. Stop expecting one wide RAIDZ vdev to act like an array.

Scrub reports repaired bytes repeatedly

  • Symptom: Every scrub repairs some data.
  • Root cause: Chronic corruption source: bad disk, bad cable, flaky controller, or memory issues (yes, memory).
  • Fix: Investigate hardware path end-to-end; run SMART long tests; check ECC logs if available; consider a controlled memory test window. Repaired data is a gift—don’t ignore it.

Scrub is slow on an SSD pool “for no reason”

  • Symptom: NVMe/SSD pool scrubs slower than expected, sometimes with periodic cliffs.
  • Root cause: Thermal throttling, SSD garbage collection, poor TRIM behavior, PCIe link issues, or a special vdev bottleneck.
  • Fix: Check temperatures and PCIe link speed; review autotrim; confirm firmware; ensure special vdev isn’t saturated or erroring.

Scrub never finishes before next scheduled scrub

  • Symptom: Always scrubbing; operators stop paying attention.
  • Root cause: Oversized pool for the given media, too frequent cadence, or scrub is being restarted by automation.
  • Fix: Reduce cadence; ensure scrubs aren’t restarted unnecessarily; consider architectural changes (more vdevs, faster media) if integrity checks can’t complete in a reasonable window.

Scrub speed is far below what raw disk math suggests

  • Symptom: “We have N disks, each can do X MB/s, so why not N×X?”
  • Root cause: Scrub reads allocated blocks, not necessarily sequential; metadata overhead; RAIDZ parity; fragmentation; and the pool might be near full, which makes everything uglier.
  • Fix: Compare against your own historical scrub baselines, not vendor datasheets. If near-full, free space. If fragmentation is severe, consider planned re-layout via replication to a fresh pool.

Checklists / step-by-step plans

Step-by-step: Decide if a slow scrub is “normal”

  1. Capture current status. Run zpool status -v. Save it in your ticket/chat.
  2. Look for any errors. Non-zero READ/WRITE/CKSUM counts or “repaired” bytes changes the urgency.
  3. Measure the issued rate. If issued is stable and within your historical range, it’s likely normal.
  4. Check per-disk latency. Use iostat -x (Linux) and identify outliers.
  5. Check logs. One line in dmesg about resets can explain days of scrub pain.
  6. Check SMART. Pending sectors, uncorrectables, and CRC errors decide whether you replace hardware.
  7. Correlate with workload. If scrub is slow only under load, fix scheduling and/or throttling.
  8. Only then tune. And make one change at a time with a rollback plan.

Step-by-step: If you find a slow disk during scrub

  1. Confirm it’s consistently slow: iostat -x 2 5 and zpool iostat -v 2 5.
  2. Check for link negotiation down: hdparm -I on SATA, or controller logs for SAS.
  3. Check kernel logs for resets/timeouts: dmesg -T filtered.
  4. Check SMART: pending/offline uncorrectable sectors mean it’s living on borrowed time.
  5. Swap the cheap stuff first (cable/port) if evidence points to link issues.
  6. Replace the disk if media issues are present or errors persist after path fixes.
  7. After replacement, run another scrub or at least a targeted verification plan based on your operational standards.

Step-by-step: If scrub is healthy but disrupts performance

  1. Confirm no device is sick (outlier latency, errors).
  2. Confirm whether scrub is already throttled (check tunables and observed I/O depth).
  3. Move scrub schedule to low-traffic periods; stagger across pools/nodes.
  4. If you must scrub during business hours, throttle rather than accelerate.
  5. Re-evaluate pool layout if you routinely can’t complete scrubs in a maintenance window.

FAQ

1) What is a “normal” ZFS scrub speed?

Normal is whatever your pool does when healthy, lightly loaded, and not erroring. Use your own historical scrub duration and issued bandwidth as the baseline. Disk vendor sequential specs are not a scrub promise.

2) Why does scanned differ from issued in zpool status?

“Scanned” reflects logical progress through blocks; “issued” reflects actual I/O sent/completed to the vdevs. Big gaps can happen due to caching, readahead, or waiting on slow devices. If issued is low and latency is high, look for a dragging disk.

3) Does a scrub read free space?

Generally, scrub checks allocated blocks (what’s actually in use). It’s not a full surface scan of every sector. That’s why a disk can still have latent bad sectors that only show up when written or read later.

4) Should I stop a scrub if it’s slow?

If the scrub is healthy but impacting production SLOs, pausing/stopping can be reasonable—then reschedule. If you see errors or repairs, stopping it just delays information you probably need. Handle the underlying hardware issue instead.

5) How often should I scrub?

Common cadence is monthly for large HDD pools, sometimes weekly for smaller or higher-risk environments. The right answer depends on media, redundancy, and how quickly you want to discover latent errors. If your scrub cadence exceeds your ability to finish scrubs, adjust—don’t normalize “always scrubbing.”

6) Scrub found and repaired data. Am I safe now?

You’re safer than you would’ve been, but you’re not “done.” Repairs mean something corrupted beneath ZFS. If repairs repeat, you need a root cause analysis of disks, cabling, controllers, and potentially memory.

7) Is RAIDZ inherently slow at scrubs compared to mirrors?

Mirrors are often faster and more predictable for reads because they can load-balance and don’t do parity reconstruction on reads. RAIDZ can be fine when healthy, but wide RAIDZ vdevs are more sensitive to one slow disk and to random I/O patterns.

8) Can tuning make scrubs dramatically faster?

Sometimes modestly, if you have headroom and conservative defaults. But tuning is not a substitute for more spindles, better media, or fixing a flaky disk path. Also: tuning can backfire by increasing latency and reducing effective throughput.

9) Why is scrub slow on a pool that’s mostly empty?

Because “empty” doesn’t mean “simple.” A pool with millions of small files, heavy metadata, snapshots, or fragmentation can scrub slowly even if used space is low. Scrub touches allocated blocks; metadata-heavy allocations are not sequential candy.

10) What’s the difference between scrub and resilver, and why does it matter for slowness?

Scrub verifies existing data and repairs corruption; resilver reconstructs data to a replaced/returned device. Resilver often has different priority and patterns, and may be more write-heavy. If you confuse the two, you’ll misread the performance expectations and urgency.

Conclusion: practical next steps

Slow scrubs are not inherently scary. In fact, a slow scrub on a big, busy pool is often a sign that ZFS is behaving responsibly. What’s scary is unexplained slowness, especially when it comes with per-disk outliers, kernel resets, or recurring repairs.

Use this sequence as your default:

  1. Run zpool status -v and decide if this is a reliability event (errors/repairs) or a scheduling/perf issue.
  2. Run iostat -x and zpool iostat -v to find the slow device or confirm contention.
  3. Check dmesg and SMART for the obvious hardware path failures.
  4. Only then consider tuning and scheduling changes, and measure impact against your historical baseline.

One paraphrased idea from W. Edwards Deming fits operations work: “Without data, you’re just someone with an opinion.” Scrub slowness is your chance to collect data before you collect outages.

Debian 13: Service won’t start after config change — fix it by reading the right log lines (case #1)

You changed a config. You did the responsible thing. You even left a comment like “temporary” that will absolutely still be there in 2027. Now the service won’t start, your monitoring is paging, and systemctl status is being coy.

The good news: Debian 13 plus systemd gives you everything you need to solve this quickly—if you stop reading the wrong log lines. The bad news: most people do exactly that, squint at the last three lines of output, and then begin ritual sacrifices to “the cache”. Don’t. Read the right lines, in the right order, and you’ll fix this in minutes.

Case #1: config change → service won’t start (what actually happened)

This is the most common pattern I see on Debian systems: a service is healthy, someone edits a config file, then restarts the service. The restart fails. On-call opens systemctl status, sees “failed with result ‘exit-code’”, and starts guessing.

The fix is nearly always inside the logs, but not the part people read first. The useful line is usually:

  • Earlier than the “Main process exited…” line
  • From a helper process (like ExecStartPre) that tested config and quit
  • Or from the daemon itself, emitted once, then buried under systemd boilerplate

For case #1, imagine a typical service with a config test step:

  • ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on;
  • ExecStart=/usr/sbin/nginx -g daemon on; master_process on;

The restart fails not because systemd is mysterious, but because the pre-flight config test caught a syntax error, an invalid include path, or a permission problem on a referenced file. The “right log lines” are the ones describing that pre-flight failure. Your job is to pull them out cleanly, without drowning in unrelated noise.

Joke #1 (short, relevant): A service restart is like a parachute pack—if you skip the inspection step, you’ll still find out whether it worked.

A few facts and history that explain why logs look like this

Understanding why Debian 13 behaves the way it does makes you faster under pressure. Here are concrete facts that matter when a service refuses to start after a config change:

  1. systemd became Debian’s default init system in Debian 8 (Jessie). That decision standardized service management and logging expectations, but also changed where people look for errors.
  2. journald is not a text file. Logs are stored in a binary journal and queried with journalctl. You can still forward to syslog, but the canonical source is the journal.
  3. systemctl status is a summary, not an investigation. It shows a clipped slice of logs and a high-level unit state. It’s meant to point you to deeper queries, not replace them.
  4. systemd units can have multiple processes before the “real” daemon starts. ExecStartPre, generators, wrapper scripts, and environment files can fail before your service’s PID even exists.
  5. Exit codes are standardized, but often misleading without context. An “exit status 1” might mean “syntax error” or “permission denied” or “port already in use.” You need the message next to it.
  6. Many daemons are designed to fail fast on invalid config. Nginx, Postfix, HAProxy, and others purposely refuse to start if config tests fail—because running with partial/invalid config is worse.
  7. Debian’s packaging tends to add safety checks. Maintainers frequently include pre-start validation in units or wrapper scripts. That’s good engineering, but it means errors can come from scripts you didn’t realize were in the path.
  8. Log ordering can be deceptive. journald is timestamped, but parallel unit start and multiple processes can interleave. The “last line” is not always “the cause.”
  9. Rate limiting is real. journald can rate-limit spammy services; the first error might be recorded, the next 500 might be summarized. If you only look at the summary, you miss the first clue.

One paraphrased idea worth keeping in your head, attributed correctly: Gene Kim (paraphrased idea): reliability improves when you build fast feedback loops and shorten the distance between change and diagnosis.

Fast diagnosis playbook (first/second/third checks)

This is the order that wins in production. It’s biased toward getting the root cause in under five minutes, not toward feeling busy.

First: confirm what systemd thinks failed (unit-level view)

  • Get the unit state, exit code, and which phase failed (pre-start vs main start).
  • Extract the exact command line systemd ran (including ExecStartPre).

Second: pull the right journal slice (time- and unit-scoped)

  • Query logs for that unit, for the last boot, with minimal noise.
  • Then widen time range if needed; do not widen scope first.
  • Look for the first meaningful error line, not the last “exited” line.

Third: run the daemon’s own config validation manually

  • Most services have a “test config and exit” mode.
  • Run it exactly as systemd would (same user, same environment, same config path).
  • If validation passes manually but fails under systemd, suspect permissions, environment files, AppArmor, or working directory differences.

Fourth: decide between fix, rollback, or temporary bypass

  • If it’s a clear syntax error: fix it now, then restart.
  • If it’s uncertain and production is burning: rollback to last known-good config and restart.
  • Avoid “temporary” bypasses like commenting out validation steps unless you understand the blast radius.

Hands-on tasks: commands, expected output, and decisions (12+)

These tasks are written the way an SRE actually works: run a command, read the output, make a decision. No motivational speeches. Each task includes what the output means and what you do next.

Task 1: Check the unit status (but read it correctly)

cr0x@server:~$ systemctl status nginx.service --no-pager
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Mon 2025-12-30 10:14:03 UTC; 42s ago
   Duration: 2.103s
    Process: 21984 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=1/FAILURE)
        CPU: 29ms

Dec 30 10:14:03 server nginx[21984]: nginx: [emerg] unexpected "}" in /etc/nginx/sites-enabled/app.conf:57
Dec 30 10:14:03 server systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Dec 30 10:14:03 server systemd[1]: nginx.service: Failed with result 'exit-code'.
Dec 30 10:14:03 server systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.

What it means: The failure happened in ExecStartPre, before nginx daemon start. That’s a config test failure, not a runtime crash.

Decision: Don’t chase ports, PID files, or kernel limits. Fix the config line referenced (app.conf:57) and rerun the config test.

Task 2: Show only the journal for this unit (the last attempt, cleanly)

cr0x@server:~$ journalctl -u nginx.service -b --no-pager -n 60
Dec 30 10:14:03 server systemd[1]: Starting nginx.service - A high performance web server and a reverse proxy server...
Dec 30 10:14:03 server nginx[21984]: nginx: [emerg] unexpected "}" in /etc/nginx/sites-enabled/app.conf:57
Dec 30 10:14:03 server systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Dec 30 10:14:03 server systemd[1]: nginx.service: Failed with result 'exit-code'.
Dec 30 10:14:03 server systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.

What it means: The journal confirms the exact parser error. No need to infer.

Decision: Open the file, fix the syntax, then test config again before restarting.

Task 3: Pull logs from “since the restart” when the boot is noisy

cr0x@server:~$ systemctl show -p ActiveEnterTimestampMonotonic nginx.service
ActiveEnterTimestampMonotonic=81234567890
cr0x@server:~$ journalctl -u nginx.service -b --no-pager --since "2 min ago"
Dec 30 10:14:03 server systemd[1]: Starting nginx.service - A high performance web server and a reverse proxy server...
Dec 30 10:14:03 server nginx[21984]: nginx: [emerg] unexpected "}" in /etc/nginx/sites-enabled/app.conf:57
Dec 30 10:14:03 server systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Dec 30 10:14:03 server systemd[1]: nginx.service: Failed with result 'exit-code'.

What it means: You’re scoping logs by time instead of wading through a whole boot.

Decision: If the error isn’t in that window, widen to 10 minutes; do not remove the unit filter yet.

Task 4: Inspect the unit for pre-start checks and environment files

cr0x@server:~$ systemctl cat nginx.service
# /lib/systemd/system/nginx.service
[Unit]
Description=A high performance web server and a reverse proxy server
After=network-online.target
Wants=network-online.target

[Service]
Type=forking
PIDFile=/run/nginx.pid
ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on;
ExecStart=/usr/sbin/nginx -g daemon on; master_process on;
ExecReload=/usr/sbin/nginx -g daemon on; master_process on; -s reload
TimeoutStopSec=5
KillMode=mixed

[Install]
WantedBy=multi-user.target

What it means: The failure is not inside the long-running daemon; it’s in the validation step. Also, no EnvironmentFile= here, so fewer hidden variables.

Decision: Run the exact pre-start command manually to reproduce; if it fails, you fix config. If it succeeds, the failure is environmental (permissions, AppArmor, include paths).

Task 5: Run the daemon’s config test manually (same command)

cr0x@server:~$ sudo /usr/sbin/nginx -t -q -g "daemon on; master_process on;"
nginx: [emerg] unexpected "}" in /etc/nginx/sites-enabled/app.conf:57

What it means: This is a deterministic config parse error. No systemd weirdness.

Decision: Fix the referenced file and line. Don’t restart repeatedly hoping it “settles”. It won’t.

Task 6: Locate the offending line and validate include structure

cr0x@server:~$ nl -ba /etc/nginx/sites-enabled/app.conf | sed -n '45,70p'
    45  server {
    46      listen 443 ssl;
    47      server_name app.example.internal;
    48      include /etc/nginx/snippets/tls.conf;
    49
    50      location / {
    51          proxy_pass http://127.0.0.1:8080;
    52          proxy_set_header Host $host;
    53      }
    54
    55  }   # end server
    56
    57  }

What it means: There’s an extra closing brace at line 57.

Decision: Remove it, save, rerun config test. If you see brace mismatches often, adopt a style rule: one block per file, consistent indentation, and a config linter in CI.

Task 7: Validate again, then restart (do not skip the validation step)

cr0x@server:~$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

What it means: Now it’s safe to restart.

Decision: Restart once. If restart fails now, it’s a different problem—don’t assume it’s still the config syntax.

cr0x@server:~$ sudo systemctl restart nginx.service
cr0x@server:~$ systemctl is-active nginx.service
active

What it means: Service is running.

Decision: Confirm it serves traffic (local health check) and close the incident properly.

Task 8: When status is unhelpful, show full logs with priority filtering

cr0x@server:~$ journalctl -u nginx.service -b -p warning --no-pager
Dec 30 10:14:03 server nginx[21984]: nginx: [emerg] unexpected "}" in /etc/nginx/sites-enabled/app.conf:57

What it means: You filtered to warnings and worse, so you’re not reading “Started…” fluff.

Decision: Use this when a unit is chatty. If nothing appears at warning/error, you’re either logging elsewhere or you have a silent failure before logging initializes.

Task 9: Confirm which config files changed recently (catch the real culprit)

cr0x@server:~$ sudo find /etc/nginx -type f -printf '%TY-%Tm-%Td %TH:%TM %p\n' | sort | tail -n 8
2025-12-30 10:12 /etc/nginx/sites-enabled/app.conf
2025-12-29 18:41 /etc/nginx/nginx.conf
2025-12-10 09:03 /etc/nginx/snippets/tls.conf
2025-11-21 15:22 /etc/nginx/mime.types

What it means: You can correlate the start failure with the most recent edit.

Decision: If the error references an include file, check that file’s mtime too. “I only changed one line” is rarely the whole story.

Task 10: If it’s not syntax, check for permission denied (classic after “hardening”)

cr0x@server:~$ journalctl -u nginx.service -b --no-pager -n 30
Dec 30 10:20:11 server nginx[22310]: nginx: [emerg] open() "/etc/nginx/snippets/tls.conf" failed (13: Permission denied)
Dec 30 10:20:11 server systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
cr0x@server:~$ namei -l /etc/nginx/snippets/tls.conf
f: /etc/nginx/snippets/tls.conf
drwxr-xr-x root root /
drwxr-xr-x root root etc
drwxr-xr-x root root nginx
drwx------ root root snippets
-rw------- root root tls.conf

What it means: Directory permissions prevent nginx (running as www-data after start) or its pre-start check from reading includes.

Decision: Fix permissions to the minimum required. Usually: directory execute bit for traversal and file read for the service user or group.

Task 11: Validate the runtime user and service sandboxing

cr0x@server:~$ systemctl show nginx.service -p User -p Group -p DynamicUser -p ProtectSystem -p ReadWritePaths
User=
Group=
DynamicUser=no
ProtectSystem=no
ReadWritePaths=

What it means: This particular unit isn’t using systemd sandboxing directives. If you do see ProtectSystem=strict or tight ReadWritePaths, config reads/writes may be blocked.

Decision: If sandboxing is enabled, align it with the daemon’s needs rather than disabling it blindly. Add explicit ReadOnlyPaths/ReadWritePaths in an override.

Task 12: Interpret failure reasons from systemd’s perspective (exit codes and signals)

cr0x@server:~$ systemctl show nginx.service -p ExecMainStatus -p ExecMainCode -p Result
ExecMainStatus=1
ExecMainCode=exited
Result=exit-code

What it means: The process exited normally with status 1. Not SIGKILL, not OOM, not a timeout.

Decision: Focus on configuration, parameters, and permissions. If you see ExecMainCode=killed or Result=timeout, that’s a different branch entirely.

Task 13: If the service is flapping, stop the restart loop while you read logs

cr0x@server:~$ sudo systemctl reset-failed nginx.service
cr0x@server:~$ sudo systemctl stop nginx.service
cr0x@server:~$ systemctl status nginx.service --no-pager
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; preset: enabled)
     Active: inactive (dead)

What it means: You’re preventing systemd from spamming restarts while you debug. This also makes the journal easier to read.

Decision: Do this when Restart=always is creating noise and load. Then restart intentionally when you have a fix.

Task 14: Compare config changes safely with dpkg metadata (packaging reality check)

cr0x@server:~$ dpkg -S /etc/nginx/nginx.conf
nginx-common: /etc/nginx/nginx.conf
cr0x@server:~$ sudo ls -l /etc/nginx/nginx.conf*
-rw-r--r-- 1 root root 1492 Dec 29 18:41 /etc/nginx/nginx.conf
-rw-r--r-- 1 root root 1479 Nov 21 15:22 /etc/nginx/nginx.conf.dpkg-dist

What it means: You may have a distro-provided new default file or a pending merge. That can interact with your change.

Decision: If the service started failing after an upgrade plus a config edit, examine .dpkg-dist/.dpkg-old and reconcile intentionally.

Task 15: When logs are missing, confirm journald persistence and rate limiting

cr0x@server:~$ sudo grep -E '^(Storage|SystemMaxUse|RateLimitIntervalSec|RateLimitBurst)=' /etc/systemd/journald.conf | sed '/^#/d;/^$/d'
Storage=auto
RateLimitIntervalSec=30s
RateLimitBurst=1000
cr0x@server:~$ journalctl --disk-usage
Archived and active journals take up 384.0M in the file system.

What it means: If Storage=volatile, you lose logs on reboot. If rate limiting is low, you might miss repeated errors.

Decision: For production, persist logs on disk and size appropriately. For debugging, temporarily raise rate limits if a service is spamming, but fix the spam next.

Joke #2 (short, relevant): “It worked yesterday” is not evidence; it’s just testimony from a witness with a terrible memory.

Three corporate mini-stories (and what they teach)

Mini-story 1: The incident caused by a wrong assumption

The team had a Debian fleet running a mix of web and API services. One afternoon, a routine config change went out: update TLS ciphers, standardize across environments. Someone restarted nginx on a canary. It failed. They ran nginx -t manually; it passed. The assumption formed instantly: “systemd is broken on this host.”

They dug into package versions, kernel parameters, even SELinux (which wasn’t even enabled). Meanwhile, traffic drained from the node and the autoscaler got nervous. They kept retrying restarts “just to see,” which is a great way to overwrite the one good error line with a pile of restarts.

The fix was embarrassingly simple: the systemd unit used a different config path via an environment file. Not malicious—just historical. Manual nginx -t tested /etc/nginx/nginx.conf; systemd tested /etc/nginx/nginx-canary.conf. The canary file included a snippet path that didn’t exist on that host.

The lesson isn’t “don’t use environment files.” It’s: never assume your manual reproduction matches the service manager. Extract the exact ExecStartPre/ExecStart command from systemctl cat and run that. If an environment file exists, print it, and stop guessing.

Mini-story 2: The optimization that backfired

A platform group decided to “speed up deployments” by switching from restart to reload whenever possible. Reload is cheaper: less connection churn, fewer transient errors. Good intent. Then they generalized it across multiple services with a one-size-fits-most script.

One service, a message broker, accepted reload signals but only partially reloaded configuration. For some settings it required a full restart, but the reload command returned success anyway. Over time, config drift built up: the running config didn’t match what was on disk, and people stopped trusting both.

Eventually a config change introduced a parameter that would have failed a fresh start validation. The reload did nothing useful, said “OK,” and the system ran with the old settings. Days later, a routine host reboot occurred. Now the service had to do a cold start, it read the bad config, and it refused to come up. This failure happened during a maintenance window, which is where you go to meet your future mistakes.

The lesson: reload is not a free lunch. If you choose reload as an optimization, you must also enforce config validation as part of the change process and periodically perform controlled restarts to prove the config is actually startable.

Mini-story 3: The boring but correct practice that saved the day

A finance-adjacent internal service ran on Debian, backed by a database and a web front-end. They had a change policy that was not glamorous: every config edit had to be committed to a repo, and the deployment tooling always ran the service’s built-in config test before touching systemd. If the test failed, the change simply didn’t ship.

People complained about it. “It slows us down.” “I can test it in my head.” The usual. But then came a day when a senior engineer edited a config live during an incident—because the service was misbehaving and they needed a quick mitigation. The edit had a subtle quoting error. The next restart would have killed the service entirely.

The deployment tooling refused to apply the change without a passing config test. That was the whole point: guardrails when stress makes everyone sloppy. They fixed the quoting, re-tested, then restarted safely. Nobody got paged twice.

The lesson: the boring practice isn’t the repo. It’s the automatic validation gate plus a predictable rollback path. Those two things prevent small mistakes from becoming outages.

Common mistakes: symptom → root cause → fix

Here are the repeat offenders. If your service won’t start after a config change, you will likely land in one of these buckets.

1) Symptom: systemctl status shows “failed (Result: exit-code)” with no useful error

Root cause: You’re only seeing the summary. The meaningful line is earlier or truncated.

Fix: Query the journal directly and expand the slice.

cr0x@server:~$ journalctl -u myservice.service -b --no-pager -n 200
...look for the first real error line...

2) Symptom: service fails instantly after restart; logs mention ExecStartPre

Root cause: Pre-start validation failed (syntax, missing include, invalid directive).

Fix: Run the same validation manually and correct config before restarting.

cr0x@server:~$ systemctl cat myservice.service | sed -n '/ExecStartPre/p'
ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on;

3) Symptom: config test passes manually, fails under systemd

Root cause: Different config path, different user, different environment, or sandbox restrictions.

Fix: Extract exact command and environment from the unit; run as the service user.

cr0x@server:~$ systemctl show myservice.service -p Environment -p EnvironmentFiles
Environment=
EnvironmentFiles=/etc/default/myservice (ignore_errors=no)

4) Symptom: “Permission denied” on includes, certificates, sockets, PID files

Root cause: Hardening change (chmod/chown), new path with restrictive permissions, or service user mismatch.

Fix: Trace path permissions with namei -l; correct directory execute bits and file readability.

5) Symptom: “Address already in use” after config change

Root cause: You changed listen/port binding; another service already owns it; or the old instance didn’t stop cleanly.

Fix: Identify who holds the port; decide whether to change port back, stop the conflicting service, or fix socket activation.

cr0x@server:~$ sudo ss -ltnp | grep ':443 '
LISTEN 0      511          0.0.0.0:443        0.0.0.0:*    users:(("haproxy",pid=1203,fd=7))

6) Symptom: unit shows Result=timeout

Root cause: The daemon hangs during start (waiting on DNS, storage, migrations) or systemd’s timeout is too aggressive for a cold start.

Fix: Read logs around the hang, then adjust TimeoutStartSec only if the startup work is legitimate and bounded.

7) Symptom: after a config change, service “starts” but doesn’t work

Root cause: You used reload and assumed it applied everything; or the config is accepted but semantically wrong.

Fix: Run an application-level health check, and confirm active config with service introspection if available. Restart if needed.

8) Symptom: journal has no entries for the unit

Root cause: The service logs to a file (or stdout is redirected), journald is volatile, or the unit never executed due to dependency failure.

Fix: Check systemctl list-dependencies and journald settings; inspect traditional log files if configured.

Checklists / step-by-step plan (safe fixes and rollback)

Step-by-step: diagnose and fix without thrashing the system

  1. Stop the restart loop if present. If the unit is flapping, pause it so you can read stable logs.
  2. Read the unit summary. Identify whether ExecStartPre failed or the main process died.
  3. Query journald by unit and boot. Don’t start with global logs.
  4. Extract the exact start commands. Read systemctl cat and check for drop-ins.
  5. Run the service’s config test manually. Same args, same config path.
  6. Fix the smallest thing that makes it start. Avoid refactors during outage response.
  7. Restart once, then verify at the application layer. “active (running)” isn’t the same as “serving.”
  8. Write down the root cause line. Paste the exact error string in the incident note. Future you will thank present you.

Rollback plan: when you’re not sure your fix is correct

If you can’t prove the fix quickly, rollback. Don’t “iterate in production” while the pager is screaming.

  1. Save the broken config. Copy it with a timestamp so you can analyze later.
  2. Restore last-known-good from your config repo or backups.
  3. Validate config. Always run the daemon’s test mode.
  4. Restart service and verify.
  5. Only after recovery: debug the broken change in a controlled environment.
cr0x@server:~$ sudo cp -a /etc/nginx/sites-enabled/app.conf /root/app.conf.broken.$(date +%F-%H%M%S)
cr0x@server:~$ sudo cp -a /root/rollback/app.conf /etc/nginx/sites-enabled/app.conf
cr0x@server:~$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
cr0x@server:~$ sudo systemctl restart nginx.service
cr0x@server:~$ systemctl is-active nginx.service
active

When you must keep partial service up (damage control)

Sometimes you can’t fully fix it immediately, but you can reduce impact:

  • Restore a minimal config that serves a maintenance page.
  • Disable the broken virtual host while keeping others running.
  • Route traffic away from the node temporarily, fix in isolation, then reintroduce.

Do not disable validation steps to “make it start” unless you’re certain the daemon won’t start in a corrupted state. That path leads to data loss, and you won’t enjoy the postmortem.

FAQ

1) Why isn’t systemctl status enough?

Because it’s intentionally compact. It shows a small log tail and a unit state summary. Use it to find the unit name, failure phase, and then pivot to journalctl -u for the real diagnosis.

2) What’s the single best journalctl command for this situation?

Usually:

cr0x@server:~$ journalctl -u myservice.service -b --no-pager -n 200

If that’s noisy, add -p warning or restrict time with --since.

3) How do I know whether the failure is config vs runtime?

Look for ExecStartPre failing (config/validation) versus the main process starting and then dying (runtime). systemctl status usually tells you which process failed.

4) Why does running the config test manually sometimes succeed when systemd fails?

Different environment. systemd may use an environment file, a different working directory, sandboxing, or a different user context. Always reproduce using the exact command line from the unit.

5) How do I see drop-in overrides that might change behavior?

cr0x@server:~$ systemctl status myservice.service --no-pager
...look for "Drop-In:" lines...
cr0x@server:~$ systemctl cat myservice.service
...includes /etc/systemd/system/myservice.service.d/*.conf if present...

6) When should I use reload instead of restart?

Only when the service documents that reload applies the changes you made, and you have a validation step. If you’re uncertain, restart during a safe window or after draining traffic.

7) What if there are no logs at all for the unit?

Then either the unit didn’t execute, journald isn’t retaining logs, or logs go elsewhere (like /var/log/*). Check dependencies and journald settings, and inspect file-based logs if configured by the service.

8) How do I quickly tell if this is a permissions problem?

Look for “Permission denied” in the journal, then trace the file path with namei -l. Permissions issues often come from directory traversal bits or new hardening changes that forgot the service user.

9) What’s the safest way to prevent this class of outage?

Automate config validation (daemon test mode) before restart/reload, keep configs in version control, and make rollback trivial. The goal is to catch the broken line before it reaches systemd.

Conclusion: next steps that prevent repeat incidents

A Debian 13 service failing after a config change is rarely a mystery. It’s usually one precise error line that you didn’t extract cleanly. Read the unit to learn what ran. Read the journal scoped to the unit to learn what failed. Then validate config manually using the exact command systemd uses.

Practical next steps:

  • Add a pre-deploy config test step for every service that supports it.
  • Train your team to treat systemctl status as a pointer, not a diagnosis.
  • Make rollback a first-class operation (copy, restore, validate, restart).
  • Standardize on a short “fast diagnosis” runbook and keep it near the pager rotation.

Do that, and the next time a service refuses to start, you’ll spend your time fixing the actual problem—not arguing with a summary screen.