The weirdest outages in high-performance computing aren’t caused by cosmic rays or exotic race conditions.
They’re caused by ordinary assumptions that become weapons once you scale to tens of thousands of cores,
millions of files, and networks designed like a cathedral.
If your code runs fine on 64 cores and then turns into a pumpkin at 4,096, you don’t have a “performance problem.”
You have a scaling bug. And scaling bugs are rarely glamorous. They’re often boring. That’s why they survive.
What scaling bugs look like in real supercomputers
A scaling bug is what happens when something that is “fine” at small scale becomes mathematically,
physically, or organizationally impossible at large scale.
Not slower. Impossible.
At 8 nodes, a sloppy barrier is a rounding error. At 8,192 nodes, that barrier becomes a tax you pay on every iteration,
and the tax collector is a packet storm. At 8 nodes, opening 10,000 files during startup is “fast enough.”
At 8,192 nodes, it’s a distributed denial-of-service against your metadata server.
The tricky part is that the system often looks “healthy” from the outside. CPUs are busy, the job isn’t crashed, the network links are up,
and storage is “only” at 40% capacity. Meanwhile, your wall-clock time doubles when you add more nodes.
That’s not a mystery. That’s physics, queueing theory, and one developer’s innocent assumption about “just one more MPI_Allreduce.”
Here’s the headline: supercomputers fail in the same ways normal fleets fail.
They just do it at an absurd scale, with tighter coupling, and users who measure time in node-hours and grudges.
Scaling bugs are often “silly” because they’re human-sized
- O(N) control paths that should have been O(1) become a cliff. An all-to-all “gather debug info” turns into an all-to-all funeral.
- Single shared resources (one lock, one leader rank, one metadata target, one head node thread) become chokepoints.
- Default settings (TCP buffers, hugepages, scheduler limits, Lustre stripe counts) were chosen for “reasonable” workloads, not your chaos.
- Observability gaps turn performance into superstition. Without per-rank timing, you argue about ghosts.
Joke #1: A 10,000-core job is just a normal job with more opportunities to be wrong.
A single quote worth keeping on the wall
“Hope is not a strategy.” — General Gordon R. Sullivan
If you run production HPC, you can be optimistic about science. You cannot be optimistic about tail latency,
shared filesystems, or the probability that one node in 5,000 will do something weird today.
Facts and history that actually matter
Supercomputing history is full of shiny FLOPS numbers. The operational lessons are in the footnotes.
A few concrete facts and context points that change how you think about scaling:
- Amdahl’s Law (1967) didn’t get less true because we started buying GPUs. Serial fractions still murder speedup at scale.
- Gustafson’s Law (late 1980s) explains why “weak scaling” can look great even when “strong scaling” is falling apart. Don’t confuse the two.
- MPI has been around since the 1990s; the communication patterns are well-studied. Many modern “new” scaling bugs are old mistakes wearing containers.
- Parallel filesystems like Lustre popularized the separation of metadata and data targets; most catastrophic slowdowns today are metadata pathologies, not raw bandwidth limits.
- InfiniBand and RDMA made low-latency collectives possible, but they also made it easier to saturate fabric with collective storms.
- Exascale-era systems have leaned hard into heterogeneous nodes (CPUs + GPUs). That shifts bottlenecks: PCIe/NVLink locality and host staging are frequent silent killers.
- Checkpoint/restart became operationally mandatory as job sizes grew; you can’t treat failures as “rare events” when you run 20,000 components for hours.
- Filesystem metadata scaling became a first-class concern once workflows shifted from a few giant files to millions of small ones (think ML feature shards, ensemble runs, and per-rank logs).
- Schedulers (PBS, Slurm, etc.) turned supercomputers into shared production systems. That introduced “queuing physics”: backfill, fragmentation, and node health become performance variables.
The main failure modes: compute, network, storage, scheduler
1) Compute: the slowest rank sets the pace
In tightly-coupled HPC, performance is dictated by the tail. One rank hits a page fault storm, or one socket runs at a lower frequency,
and the whole job waits at the next synchronization point. This is why “average CPU utilization” is a lie you tell yourself to feel better.
Watch for:
- NUMA locality mistakes: memory allocated on the wrong socket, remote bandwidth, unpredictable stalls.
- Thread oversubscription: OpenMP defaults that multiply across MPI ranks and blow out CPU caches.
- Frequency throttling: power caps, thermal throttling, or AVX-induced downclocking changing per-node timing.
2) Network: collectives don’t care about your feelings
Most “mystery” scaling collapses are communication overhead that grew faster than your compute.
Collectives (Allreduce, Alltoall, Barrier) are necessary; they’re also the easiest way to turn a fancy fabric into a parking lot.
Watch for:
- All-to-all patterns in FFTs, transposes, and some ML workloads. They are sensitive to topology and congestion.
- Imbalanced ranks turning collectives into repeated stalls.
- MTU, credit, or queue tuning that helped microbenchmarks but destabilized real traffic.
3) Storage: bandwidth is rarely the first thing that breaks
Storage scaling bugs are often metadata contention disguised as “slow I/O.”
You see it as: jobs hang at startup, or “open() is slow,” or “stat() takes forever,” or the filesystem looks fine but the app is stuck.
Watch for:
- Millions of tiny files at job start/end (per-rank logs, per-task outputs, temp files).
- Checkpoint storms where many ranks write simultaneously, saturating a subset of OSTs due to poor striping.
- Client-side caching mismatches: too aggressive caching causing contention and invalidations, too little causing metadata round-trips.
4) Scheduler and operations: your job is a guest in a crowded hotel
Scaling isn’t just inside the job. It’s in the cluster.
Scheduler policies, node health, and shared services (auth, DNS, container registries, license servers) become part of your performance profile.
Watch for:
- Job placement: scattered nodes across islands or leaf switches, increasing hop count and contention.
- Background noise: monitoring agents, kernel updates, runaway logs, or a neighboring job that’s doing “creative” I/O.
- Control plane fragility: Slurmctld overload, slow prolog/epilog scripts, or authentication delays at scale.
Fast diagnosis playbook (find the bottleneck fast)
You don’t get bonus points for diagnosing slowly. When a large job is burning node-hours, you triage like an SRE:
determine whether you’re compute-bound, network-bound, or I/O-bound, then narrow down the specific limiter.
First: confirm what “bad” means and isolate scale
- Reproduce at two sizes: one that scales “okay” and one that doesn’t (e.g., 256 ranks vs 4096). If the problem doesn’t vary with scale, it’s not a scaling bug.
- Decide your metric: time per iteration, time to checkpoint, time to solution, or job efficiency. Pick one. Don’t hand-wave.
- Check for phase changes: startup, steady-state compute, communication phases, I/O checkpoints, finalization. Scaling failures often hide in short phases.
Second: determine the dominant wait state
- CPU busy but low IPC suggests memory/NUMA or vector throttling.
- High time in MPI calls suggests network/collectives or imbalance.
- High time in open/stat/fsync suggests metadata pathologies.
- Many ranks idle at barriers suggests load imbalance or one slow node.
Third: find the “one slow thing” pattern
At scale, one sick node can throttle a whole job.
Don’t average. Identify outliers by rank and by host.
- Find the slowest ranks (app timing or MPI profiling).
- Map ranks to hosts (scheduler or mpirun mapping).
- Check node-local health: memory errors, CPU frequency, NIC counters, filesystem client errors.
Fourth: verify the shared services aren’t the hidden bottleneck
DNS timeouts, LDAP slowness, container registry pulls, and license checkouts can look like “application hang”
when multiplied by thousands of nodes. The cluster can be down without being “down.”
Joke #2: Nothing builds teamwork like 4,000 nodes waiting on one DNS lookup.
Hands-on tasks: commands, outputs, decisions (12+)
These are realistic operational tasks you can run during an incident or a performance investigation.
Each one includes: a command, what typical output means, and what decision you make from it.
Adjust paths and device names to your environment.
Task 1: Confirm job placement and node list (Slurm)
cr0x@server:~$ scontrol show job 842193
JobId=842193 JobName=climate_step
UserId=ana(14021) GroupId=hpc(14000) MCS_label=N/A
Priority=12233 Nice=0 Account=research QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
RunTime=00:18:42 TimeLimit=02:00:00 TimeMin=N/A
NodeList=cn[1203-1234,1301-1366]
NumNodes=98 NumCPUs=6272 NumTasks=6272 CPUs/Task=1
TRES=cpu=6272,mem=1500G,node=98
MinCPUsNode=64 MinMemoryNode=15000M MinTmpDiskNode=0
What it means: NodeList shows whether you got a compact block or scattered allocation. Mixed ranges can mean multiple network islands.
Decision: If performance is sensitive to topology, request contiguous nodes (constraints/partitions) or use scheduler topology-aware options.
Task 2: Check job efficiency and find obvious waste (Slurm accounting)
cr0x@server:~$ sacct -j 842193 --format=JobID,Elapsed,AllocCPUS,CPUTime,MaxRSS,AveCPU,State
JobID Elapsed AllocCPUS CPUTime MaxRSS AveCPU State
842193 00:18:42 6272 19-11:05:24 2100Mc 00:08.2 RUNNING
842193.batch 00:18:42 64 00:19:10 320Mc 00:18.9 RUNNING
What it means: AveCPU low relative to Elapsed suggests lots of waiting (MPI, I/O, imbalance). MaxRSS helps spot memory headroom or paging risk.
Decision: If AveCPU is far below expected, profile MPI time or I/O; if MaxRSS is near node memory limits, expect paging and slow ranks.
Task 3: Find per-node CPU frequency and throttling hints
cr0x@server:~$ sudo turbostat --quiet --Summary --interval 5 --num_iterations 1
Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI PkgTmp PkgWatt CorWatt
1890 78.3 2415 2300 8120 0 84.0 265.4 202.1
What it means: If Bzy_MHz is low under load or PkgTmp is high, you may be throttling. Busy% near 100% but low MHz is suspicious.
Decision: If throttling is present, check power caps/thermal issues; consider reducing AVX-heavy code frequency impact or adjusting power management policies.
Task 4: Spot NUMA misplacement quickly
cr0x@server:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0-31
node 0 size: 256000 MB
node 0 free: 182340 MB
node 1 cpus: 32-63
node 1 size: 256000 MB
node 1 free: 190112 MB
node distances:
node 0 1
0: 10 21
1: 21 10
What it means: Distance indicates remote access penalty. If processes run on node 0 but allocate memory on node 1, you pay in bandwidth and latency.
Decision: Bind ranks/threads and memory consistently (e.g., Slurm –cpu-bind, numactl, or OpenMP affinity). Validate with per-rank performance.
Task 5: Identify CPU runqueue pressure and iowait on a node
cr0x@server:~$ mpstat -P ALL 1 3
Linux 5.15.0 (cn1203) 01/22/2026 _x86_64_ (64 CPU)
12:10:11 PM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
12:10:12 PM all 61.2 0.0 6.1 18.9 0.0 1.2 0.0 12.6
12:10:12 PM 0 55.0 0.0 5.0 29.0 0.0 1.0 0.0 10.0
What it means: High %iowait indicates the CPU is stalled waiting on I/O. Not “disk busy,” but “your process is blocked.”
Decision: If iowait spikes during checkpoints, investigate filesystem throughput and stripe settings; if during compute, look for paging or filesystem metadata calls.
Task 6: Confirm if you’re paging (a classic slow-rank generator)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa
12 0 0 182340 4212 81234 0 0 0 12 5200 8100 62 6 13 19
10 2 8192 10240 3900 22000 120 240 1024 2048 6100 9900 41 7 8 44
What it means: Non-zero si/so (swap-in/out) under load is bad news. Even “a little swap” at scale creates stragglers.
Decision: Reduce memory footprint, increase per-node memory request, fix leaks, or adjust problem size. If only one node swaps, suspect a bad DIMM or misconfigured cgroup limits.
Task 7: Check MPI rank imbalance with a quick-and-dirty timing file
cr0x@server:~$ awk '{sum+=$2; if($2>max){max=$2; rmax=$1} if(min==0||$2
What it means: A wide max/min spread screams imbalance or a slow node. The max rank is your investigative target.
Decision: Map the slow rank to a host; check node health, NUMA placement, and NIC/storage errors on that host.
Task 8: Map ranks to nodes (Slurm + mpirun style)
cr0x@server:~$ srun -j 842193 -N 1 -n 1 hostname
cn1203
What it means: Validate you can target specific nodes from the allocation. You’ll use this to interrogate outliers.
Decision: If the slow ranks map to one node or one switch group, you likely have a hardware or topology issue, not an algorithm issue.
Task 9: Check InfiniBand port counters for errors and congestion
cr0x@server:~$ ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 1043
SM lid: 1
Link layer: InfiniBand
cr0x@server:~$ perfquery -x -r 1 | egrep 'PortXmitWait|PortRcvErrors|PortXmitDiscards'
PortXmitWait....................: 000000000000a1f2
PortRcvErrors...................: 0000000000000000
PortXmitDiscards................: 0000000000000003
What it means: PortXmitWait suggests congestion (waiting to transmit). Discards indicate drops; not normal at steady state.
Decision: If congestion counters climb during slow phases, look at job placement/topology and collective algorithms; if errors/discards climb on one node, suspect cable/NIC/switch port.
Task 10: Verify link utilization and drops on Ethernet management or storage networks
cr0x@server:~$ ip -s link show dev eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
RX: bytes packets errors dropped missed mcast
9876543210 8123456 0 120 0 10022
TX: bytes packets errors dropped carrier collsns
8765432109 7345678 0 42 0 0
What it means: Dropped packets at scale can show up as “random slowness,” especially for control plane services and NFS-based homes.
Decision: If drops are rising, check NIC ring buffers, driver/firmware, and switch queues; consider moving noisy traffic off shared management links.
Task 11: Identify metadata hot spots on a Lustre client
cr0x@server:~$ lctl get_param -n llite.*.stats | egrep 'open|close|statfs|getattr' | head
open 1209341 samples [usec] 1 10 25 100 250 1000 2000 5000 10000 50000
close 1209340 samples [usec] 1 10 25 100 250 1000 2000 5000 10000 50000
getattr 883201 samples [usec] 1 10 25 100 250 1000 2000 5000 10000 50000
statfs 12012 samples [usec] 1 10 25 100 250 1000 2000 5000 10000 50000
What it means: Huge open/getattr counts suggest the app is hammering metadata. Latency buckets (if expanded) show whether these calls are slow.
Decision: If metadata is hot, reduce file count, use per-node aggregation, avoid per-rank logs, and consider directory hashing/striping practices supported by your filesystem.
Task 12: Check OST/MDT health and saturation signals (Lustre server side)
cr0x@server:~$ lctl get_param -n obdfilter.*.kbytesavail | head
obdfilter.fs-OST0000.kbytesavail=912345678
obdfilter.fs-OST0001.kbytesavail=905123456
obdfilter.fs-OST0002.kbytesavail=887654321
cr0x@server:~$ lctl get_param -n mdt.*.md_stats | head
mdt.fs-MDT0000.md_stats:
open 39123890
close 39123888
getattr 82012311
setattr 1023311
What it means: Capacity imbalance can create unintended hotspots. High MDT ops correlate with “slow startup” and “hanging finalization.”
Decision: If one OST is far fuller, rebalance or adjust striping; if MDT ops are extreme during job storms, tune client behaviors and fix application file patterns.
Task 13: Observe per-process I/O with pidstat
cr0x@server:~$ pidstat -d -p 21344 1 3
Linux 5.15.0 (cn1203) 01/22/2026 _x86_64_ (64 CPU)
12:12:01 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
12:12:02 PM 14021 21344 0.00 51200.00 0.00 87 climate_step
12:12:03 PM 14021 21344 0.00 48000.00 0.00 91 climate_step
What it means: High write throughput plus iodelay indicates the process is blocked on I/O. If many ranks show this simultaneously, it’s a filesystem bottleneck.
Decision: Coordinate checkpoint timing, increase striping, reduce frequency, or use node-local burst buffers if available.
Task 14: Detect “small I/O” pathology with iostat
cr0x@server:~$ iostat -dxm 1 2
Linux 5.15.0 (cn1203) 01/22/2026 _x86_64_ (64 CPU)
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
nvme0n1 0.0 3200.0 0.0 48.0 30.7 9.4 3.1 0.2 64.0
What it means: Lots of IOPS with tiny avgrq-sz means small writes. Await grows when queue builds. %util shows device saturation.
Decision: If it’s node-local, fix buffering and batching; if it’s a shared mount, the pattern likely amplifies at scale—aggregate writes and align I/O sizes.
Task 15: Find syscall hotspots (open/stat/fsync) with strace summary
cr0x@server:~$ strace -c -p 21344 -f -q -e trace=openat,statx,futex,fsync,close -t -o /tmp/strace.out
strace: Process 21344 attached
^Cstrace: Process 21344 detached
cr0x@server:~$ tail -n 8 /tmp/strace.out
% time seconds usecs/call calls errors syscall
62.14 1.840231 92 20007 102 openat
18.77 0.555880 71 7821 0 statx
11.02 0.326112 40 8099 0 futex
8.07 0.239055 310 771 0 fsync
What it means: If openat/statx dominate, you have metadata intensity. If futex dominates, you may have lock contention. If fsync dominates, you’re paying durability costs.
Decision: For metadata, reduce file ops and use fewer files; for futex contention, refactor threading or reduce shared locks; for fsync, batch syncs or use safer but less frequent durability points.
Task 16: Check kernel and filesystem client logs for “soft” errors
cr0x@server:~$ dmesg -T | tail -n 12
[Thu Jan 22 12:06:41 2026] Lustre: llite fs-ffff8c2b3c2c8800: server not responding, reconnecting
[Thu Jan 22 12:06:43 2026] Lustre: llite fs-ffff8c2b3c2c8800: Connection restored
[Thu Jan 22 12:08:10 2026] mlx5_core 0000:41:00.0: CQE error: syndrome 0x2 vendor syndrome 0x0
[Thu Jan 22 12:08:10 2026] mlx5_core 0000:41:00.0: Dumping QP 0x1a2b
What it means: Transient reconnects and NIC CQ errors can create slow ranks without a hard failure. These are the “silly” bugs that ruin scaling graphs.
Decision: If only a subset of nodes show these, drain them from the scheduler and open a hardware ticket; if widespread, suspect fabric/filesystem incident.
Three corporate-world mini-stories (anonymized)
Mini-story 1: The wrong assumption (and the MPI_Barrier tax)
A research group brought a new simulation into a corporate HPC cluster—serious code, serious science, serious budget.
The developers had validated correctness on a modest partition and asked for “as many nodes as you can spare”
to hit a deadline.
The initial run scaled beautifully to a few hundred ranks. Past that, performance went sideways.
The job didn’t crash; it just stopped getting faster. Users blamed “the network.”
The network team blamed “the code.” Everyone was half right, which is the worst kind of right.
Profiling showed a shocking fraction of time in MPI_Barrier. That’s not automatically a crime,
but it’s never a compliment. The real culprit was an assumption: a “debug timing” block that gathered
per-rank metrics at every iteration and serialized output through rank 0. It was supposed to be temporary.
It had become permanent by inertia.
At small scale, the barrier cost was hidden under compute. At large scale, the barrier became a global
synchronization point that amplified minor imbalances into major stalls. Rank 0 then did extra work formatting
and writing logs. The “temporary debug” turned into a load imbalance generator.
The fix was almost offensively simple: collect timing every N iterations, aggregate hierarchically (node-level then global),
and write one compact structured record per checkpoint instead of per iteration. Scaling recovered.
Not because the network got faster, but because the code stopped demanding the network behave like magic.
Mini-story 2: The optimization that backfired (striping for glory, contending for reality)
A platform team tried to help an analytics workload that wrote big checkpoint files.
Someone suggested increasing Lustre striping across many OSTs to “get more bandwidth.”
It worked in a small benchmark. They rolled it out broadly via a module that set default striping
for a whole project directory.
Then production happened. The workload wasn’t writing one big sequential file per job.
It wrote many medium files and did frequent metadata operations. Striping them widely increased the amount of
coordination the filesystem had to do. It also raised the probability that at least one OST was busy,
turning tail latency into a controlling factor.
The most painful symptom: jobs became unpredictable. Some ran fine, some crawled.
Users did what users do—they resubmitted. This multiplied the load and made the cluster look haunted.
The diagnosis came from correlating slow periods with OST load imbalance and elevated metadata ops.
“More stripes” increased cross-OST fanout and contention, which hurt when the access pattern wasn’t
large streaming I/O. The optimization solved a lab problem and created a production problem.
They reverted the default striping, set sane per-workload guidance, and added a guardrail:
jobs that used the project directory template had a preflight check to validate file sizes and recommended stripe counts.
The filesystem didn’t become faster; it became predictable again. Predictable beats fast in shared systems.
Mini-story 3: The boring practice that saved the day (draining bad nodes)
A long-running workload began failing intermittently, but only at high scale.
It looked like an application bug: random hangs, occasional MPI timeouts, and slowdowns that vanished when rerun.
The cluster health dashboard was green. Of course it was.
The operations team had a boring habit: they tracked “suspect nodes” based on low-level counters,
not just hard failures. A node that logged recurring NIC CQ errors or filesystem reconnects went onto a watch list.
After a threshold, it got drained from the scheduler and tested offline.
During the incident, they mapped the slowest MPI ranks to hosts and found a pattern:
a small set of nodes repeatedly hosted stragglers. Those nodes had no dramatic errors—just small, frequent warnings.
Enough to slow one rank. Enough to stall thousands.
They drained the nodes, reran the workload, and the “application bug” vanished.
Hardware was replaced later. The science team got their results on time.
The practice wasn’t clever. It was disciplined.
Common mistakes: symptoms → root cause → fix
Scaling failures repeat because they look like each other from far away. Here are the classics, written in the
format you actually need during a fire.
1) Symptom: adding nodes makes the job slower
Root cause: You’re strong-scaling past the point where communication and synchronization dominate; or you introduced a global serialized path (rank 0 I/O, barriers, locks).
Fix: Reduce global collectives, overlap communication with compute, use hierarchical reductions, and measure time in MPI calls. Stop scaling at the knee of the curve.
2) Symptom: job “hangs” at startup or at the end
Root cause: Metadata storm: thousands of ranks doing stat/open/unlink in the same directory, or contending on shared Python envs/modules.
Fix: Stage software locally, use per-node shared caches, avoid per-rank file creation, bundle outputs, and distribute directory trees. Treat metadata like a scarce resource.
3) Symptom: periodic multi-minute stalls during steady-state compute
Root cause: Checkpoint bursts or background filesystem recovery events; or noisy neighbors saturating shared OSTs/MDTs.
Fix: Stagger checkpoints, use burst buffers, tune striping for file size, and coordinate cluster-wide checkpoint windows for giant jobs.
4) Symptom: only some runs are slow; reruns “fix” it
Root cause: Tail issues: a single degraded node, switch port errors, imbalanced placement, or an OST that’s hotter than others.
Fix: Identify outlier ranks/hosts; drain suspect nodes; enforce topology-aware placement; rebalance filesystem targets.
5) Symptom: high CPU utilization but poor progress
Root cause: Spin-waiting, lock contention, or busy-polling in the MPI stack; sometimes a mis-set environment variable causes aggressive polling.
Fix: Profile with perf and strace summaries; adjust MPI progress settings; reduce lock sharing; revisit thread pinning.
6) Symptom: high iowait on compute nodes during “compute” phase
Root cause: Hidden I/O: paging, logging, metadata calls, or reading shared config files repeatedly.
Fix: Eliminate swap, buffer logs, cache configs in memory, and reduce filesystem calls in the hot loop.
7) Symptom: network looks fine, but MPI time is huge
Root cause: Collective algorithm mismatch, congestion due to topology, or message sizes triggering unfavorable protocols.
Fix: Use MPI profiling to find the call; try alternative collective algorithms (if your MPI supports it); use topology-aware job allocation; reduce all-to-all frequency.
Checklists / step-by-step plan
Checklist A: When a big job scales badly (first hour)
- Get two data points: a “good scale” run and a “bad scale” run with the same input and build.
- Confirm node placement and whether allocation is fragmented.
- Compute time breakdown: compute vs MPI vs I/O vs “other.” If you can’t break it down, add minimal instrumentation.
- Find the slowest ranks and map them to hosts.
- Check those hosts for throttling, swap, NIC errors, filesystem reconnects.
- Look for metadata storms: counts of open/stat/unlink, per-rank logs, temp files.
- Check shared services dependencies: DNS/LDAP, container pulls, license servers.
- Make one change at a time. Scaling bugs love confounding variables.
Checklist B: Storage sanity for HPC workloads
- Measure file count and directory layout before running at scale.
- Match stripe count to file size and access pattern (streaming vs small random I/O).
- Avoid per-rank file creation in shared directories.
- Write fewer, larger files; batch metadata operations.
- Stagger checkpoints; do not let 10,000 ranks fsync at once unless you enjoy chaos.
- Monitor MDT ops and OST utilization, not just “overall bandwidth.”
Checklist C: Network and MPI sanity
- Measure time in MPI calls (not just total runtime).
- Identify collective hotspots (Allreduce, Alltoall) and their frequency.
- Check fabric counters for congestion and errors; isolate to nodes vs systemic.
- Ensure rank/thread pinning is correct; NUMA mistakes can masquerade as network problems.
- Use topology-aware placement for large jobs; avoid spanning islands unless you must.
Checklist D: Boring operations that prevent “mystery slowness”
- Drain nodes with recurring correctable errors or NIC warnings, not just hard failures.
- Keep firmware and drivers consistent across the fleet; heterogeneity breeds heisenbugs.
- Set and enforce sane defaults for environment modules (thread counts, pinning, I/O libs).
- Run regular, automated microbenchmarks for network and filesystem to establish baselines.
FAQ
1) Why does performance get worse when I add nodes?
Because you’re paying increasing coordination cost (communication, synchronization, metadata contention) that outgrows your compute savings.
Strong scaling has a limit; find it and stop pretending it doesn’t exist.
2) How do I know if I’m network-bound or just imbalanced?
If MPI time is high and a few ranks are consistently slower, it’s often imbalance or a slow node.
If all ranks spend similar time in collectives and fabric counters show congestion, it’s the network/topology.
3) Is the parallel filesystem “slow,” or is my application doing something silly?
Count metadata operations and small I/O. If you create millions of files, repeatedly stat the same paths,
or write tiny chunks with fsync, you’ll make any filesystem look slow.
4) What’s the fastest way to detect a metadata storm?
Look for massive open/stat/getattr counts and user complaints about startup/finalization delays.
On Lustre, client stats and MDT md_stats are your early warning system.
5) Why do MPI collectives become a cliff at scale?
Many collectives have costs that grow with rank count, message size, and topology.
They’re also sensitive to tail latency: one slow participant slows the whole operation.
6) Should we always increase Lustre striping to get more bandwidth?
No. Striping widely can help large sequential I/O, but it can backfire for many medium files, mixed access patterns,
or when it amplifies contention and tail latency. Measure and match to the workload.
7) What’s the most common “silly bug” you see in HPC applications?
Per-rank logging and per-rank temp files in a shared directory, especially at startup and shutdown.
It’s the operational equivalent of everyone trying to exit through a single door.
8) How do you deal with “only fails at scale” issues?
Treat it like an SRE incident: reproduce at two sizes, isolate phases, identify outlier ranks/hosts, and correlate with system counters.
Then remove one variable at a time until the cliff disappears.
9) What’s more important: peak performance or predictability?
In shared production systems, predictability wins. A slightly slower but stable job saves more node-hours than a “fast” job
that occasionally detonates into retries and resubmissions.
Conclusion: practical next steps
Supercomputers don’t fail in exotic ways. They fail in scaled-up versions of everyday mistakes:
a serialized path, a storm of tiny filesystem operations, one sick node dragging a collective, a “temporary” debug feature
that becomes a permanent tax.
Next steps that actually move the needle:
- Instrument your application with per-phase and per-rank timing so you can see imbalance and waiting.
- Adopt the fast diagnosis playbook and practice it on non-incident days, when you can think.
- Fix file patterns before you fix filesystems: fewer files, fewer stats, fewer fsyncs, smarter aggregation.
- Make topology and placement explicit for large jobs; don’t leave it to scheduler roulette.
- Drain suspicious nodes aggressively; one flaky NIC can turn “scaling” into “suffering.”
Scaling problems to the moon is impressive. Doing it with boring correctness is how you keep the lights on.