Some performance problems feel personal. You bought a high-core-count Ryzen or EPYC, you fed it fast NVMe, you gave it enough RAM to shame yesterday’s cluster—and it still behaves like it’s dragging a piano up stairs. The graphs show CPU “available,” disks “fine,” network “fine,” yet the workload is stubbornly slow and jittery.
That’s usually when Infinity Fabric enters the chat. Not as a component you can point at with a screwdriver, but as the interconnect that decides whether your chiplets cooperate like a well-run incident response—or argue like a change review at 4:55 PM.
What Infinity Fabric really is (and what it is not)
Infinity Fabric is AMD’s scalable interconnect architecture—the plumbing that moves data and coherence traffic between CPU cores, caches, memory controllers, and I/O. On modern Ryzen and EPYC, it’s the glue that makes the chiplet approach behave like a single logical CPU (most of the time).
If you came from the monolithic-die era, think of it like this: the “CPU” is now a small campus. You have multiple buildings (core chiplets), a central services building (I/O die or memory controllers depending on generation), and a network in between. Infinity Fabric is that internal network—switches, links, protocols, arbitration, and timing. When it’s happy, you don’t notice it. When it’s not, your latency spikes and your throughput gets oddly uneven.
What it’s not: a single clock, a single bus, or a magic performance knob you set to “fast.” It’s a layered system with different domains: core-to-core, core-to-memory, core-to-I/O, and coherency traffic. The most ops-relevant part is that certain latency paths traverse it, and its effective speed often depends on how you configure memory, clocks, NUMA policy, and BIOS power settings.
Why ops and performance engineers should care
Because most production workloads aren’t “CPU-bound” in the way benchmark charts pretend. They’re latency-bound. They’re cache-miss sensitive. They’re cross-thread chatty. They’re NUMA-sensitive. They’re a pile of microservices making each other’s lives complicated. And Infinity Fabric is part of the latency story whenever data has to move between chiplets, memory regions, or I/O paths.
This is where teams get fooled: average CPU utilization looks fine, but tail latency gets ugly. Or a database node performs great on one socket and weirdly worse on another “identical” node. Or an all-NVMe storage server gets less throughput than the spec sheet suggests. The fabric doesn’t get a graph in most dashboards, but it absolutely gets a vote.
Joke #1 (short, and painfully accurate): Infinity Fabric is like office Wi‑Fi—no one budgets for it, everyone blames it, and somehow it’s always involved.
Facts and history that actually matter in 2026
Here are concrete bits of context that change how you troubleshoot:
- Infinity Fabric arrived as a unifying interconnect around the Zen era to scale cores and I/O without monolithic dies. That architectural shift is why “same CPU family” can still behave very differently across generations and SKUs.
- Chiplets made interconnect latency a first-class performance factor. In monolithic dies, cross-core latency was mostly “on-die.” With chiplets, some traffic now has to hop across the fabric, and your workload either tolerates that or doesn’t.
- On many Ryzen generations, fabric clock (FCLK) and memory clock (MCLK) were coupled for best latency. Decoupling can allow higher memory frequency but adds latency in ways that punish tail-sensitive workloads.
- EPYC scaled by adding CCDs and a large I/O die. That improves yield and core count, but it also creates topology: some cores are “closer” to some memory and I/O than others.
- NUMA is not optional on multi-CCD EPYC systems. You can pretend it’s UMA, but the hardware won’t play along. The Linux scheduler will try, but it can’t read your mind—or your cache locality.
- BIOS defaults often optimize for power/thermals, not deterministic latency. If you run databases, trading systems, or storage targets, “energy efficient” can quietly mean “jittery.”
- Virtualization magnifies topology mistakes. A VM that spans NUMA nodes can turn a decent CPU into a remote-memory generator with a side hustle in packet drops.
- Firmware and AGESA revisions have historically changed memory/fabric behavior—sometimes improving stability, sometimes shifting performance characteristics. “Same hardware” before and after firmware updates isn’t always the same system.
A mental model: latency budgets and traffic patterns
When people say “Infinity Fabric bottleneck,” they often mean one of three things:
- Added latency: requests cross chiplet boundaries or NUMA domains more often than expected, extending the critical path for cache misses, locks, and IPC-heavy code.
- Limited bandwidth: many cores generate enough memory traffic that the interconnect and memory controllers saturate, causing contention and queuing delays.
- Jitter: power states, clock changes, or contention create inconsistent service times; averages look fine, p99 looks like a crime scene.
In practice, most incidents are a blend: a workload becomes remote-memory heavy (latency) and experiences interconnect contention (bandwidth) and gets tail spikes due to scheduler migration (jitter). Your job is to identify which dominates and pick the lowest-risk fix.
Think in terms of traffic patterns:
- Chatty multi-threaded apps (shared locks, shared queues, GC pauses) suffer when threads bounce across CCDs/NUMA nodes.
- In-memory databases care about memory latency and predictable access. Remote memory turns “fast RAM” into “slower RAM with extra steps.”
- Storage targets care about I/O and interrupt locality. Bad affinity makes your NVMe interrupts run on cores far from the PCIe root, adding latency and stealing cache.
- Virtualization hosts care about placement. A single host can run perfectly until one “noisy neighbor” VM spreads across nodes and becomes a fabric stress test.
FCLK/UCLK/MCLK and the “sync tax”
On many AMD platforms you’ll encounter three clocks that matter for memory and fabric behavior:
- MCLK: memory clock (related to DDR data rate).
- UCLK: memory controller clock.
- FCLK: fabric clock.
Depending on generation and BIOS options, these can run 1:1 or in ratios (often 1:2) once you push memory frequency high. The trap is that higher DDR data rates can look like “more bandwidth,” while the ratio change adds latency that punishes the exact workloads you care about in production.
Here’s the practical ops translation:
- If your workload is latency-sensitive (databases, caches, RPC services), you usually prefer a stable configuration that keeps fabric/memory clocks in a low-latency relationship, even if peak bandwidth is slightly lower.
- If your workload is streaming/bandwidth-bound (some analytics, large sequential scans), pushing bandwidth might help—until you hit contention elsewhere.
The “sync tax” shows up as worse memory access latency, not necessarily lower measured bandwidth. It’s why you can “optimize” DDR speed and then wonder why p99 got worse. You didn’t make memory faster. You made the system less predictable.
NUMA, chiplets, CCD/CCX: where your cycles go to commute
On EPYC especially, topology is destiny. Cores are grouped into CCDs (core chiplet dies), and those connect to an I/O die that houses memory controllers and PCIe. Each CCD has its own L3 cache; cross-CCD access is inherently more expensive than staying local.
NUMA exposes this as multiple nodes. Even on a single physical socket, you might have multiple NUMA nodes depending on BIOS settings (NPS modes, CCD mapping). Linux then tries to schedule threads and allocate memory to reduce remote access. The word is “tries.” Your workload, your pinning, and your cgroup policies can undo those attempts in seconds.
Common patterns that bite:
- Thread migration: threads bounce across cores/NUMA nodes due to scheduler decisions, stealing cache locality and increasing remote memory accesses.
- Memory allocated on the wrong node: a process starts on one node, allocates memory there, then gets scheduled elsewhere. Now every memory access is a fabric trip.
- Interrupts on the wrong cores: NIC/NVMe interrupts handled on distant cores cause higher I/O latency and wasted CPU cycles.
PCIe and I/O: the other half of the story
Infinity Fabric isn’t just “CPU-to-RAM.” It’s also part of how I/O is serviced. On many systems, PCIe devices hang off specific root complexes associated with certain NUMA nodes (or at least certain locality domains). If your storage interrupt handling and your storage processing happen “far” from the PCIe path, you pay in extra latency and CPU overhead.
This is where storage engineering meets CPU topology. Your NVMe RAID, ZFS, SPDK target, or Ceph OSD might be technically “fast,” but if its hottest threads are scheduled on remote cores and its memory allocations land on the wrong NUMA node, you’re effectively building a low-latency storage system and then routing it through the scenic route.
Practical tasks: 12+ commands, what they mean, what you decide
These are real tasks you can run on a Linux host. They won’t magically print “Infinity Fabric is sad,” but they will tell you if topology, memory locality, clocks, or interrupts are undermining you.
Task 1: Identify CPU model and basic topology
cr0x@server:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 128
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
NUMA node(s): 4
Model name: AMD EPYC 7xx3 64-Core Processor
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
NUMA node2 CPU(s): 64-95
NUMA node3 CPU(s): 96-127
What it means: You have four NUMA nodes on one socket. That’s already your warning label: memory locality matters.
Decision: If you run latency-sensitive workloads, plan to pin processes/VMs per NUMA node and align memory allocations. If you assumed UMA, stop.
Task 2: Show detailed NUMA distances
cr0x@server:~$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 128798 MB
node 0 free: 97211 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 128829 MB
node 1 free: 99102 MB
node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 2 size: 128813 MB
node 2 free: 100118 MB
node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 3 size: 128820 MB
node 3 free: 99654 MB
node distances:
node 0 1 2 3
0: 10 16 16 16
1: 16 10 16 16
2: 16 16 10 16
3: 16 16 16 10
What it means: Local node distance is 10, remote is 16. That’s a meaningful latency delta for hot memory.
Decision: For databases and storage targets, treat each NUMA node like a “mini-socket.” Keep hot threads and their memory on the same node whenever possible.
Task 3: Visualize topology with hwloc
cr0x@server:~$ lstopo-no-graphics --no-io
Machine (512GB total)
NUMANode L#0 (P#0 128GB)
Package L#0
L3 L#0 (32MB)
L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
...
NUMANode L#1 (P#1 128GB)
Package L#0
L3 L#1 (32MB)
...
What it means: You can see how cores map to L3 slices and NUMA nodes. This is your placement blueprint.
Decision: Use this to choose CPU sets for services and to interpret “why did that thread migrate?” questions later.
Task 4: Check kernel NUMA balancing status
cr0x@server:~$ cat /proc/sys/kernel/numa_balancing
1
What it means: Automatic NUMA balancing is enabled. It can help generic workloads, and it can also create page migrations (jitter) for latency-critical services.
Decision: For latency-sensitive pinned workloads, consider disabling system-wide and managing placement explicitly—or at least test both. Don’t guess.
Task 5: Observe NUMA placement and memory policy for a running process
cr0x@server:~$ pidof postgres
2147
cr0x@server:~$ numactl --show --pid 2147
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
cpubind: 0
nodebind: 0
membind: 0
What it means: This process is effectively confined to node 0 for CPU and memory. That’s good—if the workload fits in node 0’s memory bandwidth and cache.
Decision: If the process is large and bandwidth-bound, you might scale by sharding across nodes. If it’s latency-bound, keep it tight and local.
Task 6: Measure remote vs local memory access tendencies with numastat
cr0x@server:~$ numastat -p 2147
Per-node process memory usage (in MBs) for PID 2147 (postgres)
Node 0 56321.4
Node 1 112.7
Node 2 95.3
Node 3 88.9
Total 56618.3
What it means: The memory is mostly on node 0. If the threads also run on node 0, you’re in good shape. If not, you’re paying fabric latency.
Decision: If you see substantial memory on non-local nodes, fix pinning or startup placement. For services: start them under numactl or systemd CPU/memory policies.
Task 7: Check for excessive page migrations (a jitter source)
cr0x@server:~$ grep -E 'pgmigrate|numa' /proc/vmstat | head
numa_pte_updates 1829401
numa_huge_pte_updates 0
numa_hint_faults 219884
numa_hint_faults_local 171102
numa_pages_migrated 48211
pgmigrate_success 47998
pgmigrate_fail 213
What it means: Pages are being migrated. Some migration is normal with balancing enabled; a lot of it under load is a sign your scheduler and memory policy are fighting your workload.
Decision: If p99 latency correlates with migration spikes, reduce migration: pin threads, allocate memory locally, consider disabling auto NUMA balancing for that host class.
Task 8: Verify CPU frequency governor (latency vs power)
cr0x@server:~$ cpupower frequency-info | sed -n '1,18p'
analyzing CPU 0:
driver: amd-pstate-epp
CPUs which run at the same hardware frequency: 0
hardware limits: 1.50 GHz - 3.70 GHz
available cpufreq governors: performance powersave
current policy: frequency should be within 1.50 GHz and 3.70 GHz.
The governor "powersave" may decide which speed to use
current CPU frequency: 1.74 GHz
What it means: You’re in a “powersave” policy. That’s fine for batch compute, and often bad for tail latency and I/O responsiveness.
Decision: For latency-critical nodes, set governor to performance (and validate thermals). Make it a role-based policy, not a one-off hack.
Task 9: Set performance governor (controlled change)
cr0x@server:~$ sudo cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
What it means: You’ve reduced frequency scaling variability.
Decision: Re-test p95/p99 latency and throughput. If it improves, bake into the build for that class of systems.
Task 10: Check PCIe device NUMA locality (critical for NVMe/NIC)
cr0x@server:~$ lspci -nn | grep -E 'Non-Volatile|Ethernet' | head -n 3
01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a808]
41:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a808]
81:00.0 Ethernet controller [0200]: Mellanox Technologies MT28908 Family [ConnectX-6] [15b3:1017]
cr0x@server:~$ cat /sys/bus/pci/devices/0000:01:00.0/numa_node
0
What it means: That NVMe device is local to NUMA node 0 (or at least the kernel believes so). If your storage threads run on node 3, you’ve invented remote I/O.
Decision: Align IRQ handling and worker threads with the device’s NUMA node. For multi-device servers, consider per-node queues/workers.
Task 11: Inspect IRQ distribution (interrupt locality)
cr0x@server:~$ grep -E 'nvme|mlx5' /proc/interrupts | head -n 6
47: 12841 11993 12110 11888 PCI-MSI 1048576-edge nvme0q0
48: 42112 39877 40655 39221 PCI-MSI 1048577-edge nvme0q1
92: 11822 12031 11760 11911 PCI-MSI 524288-edge mlx5_comp0
93: 12203 12188 12002 12244 PCI-MSI 524289-edge mlx5_comp1
What it means: IRQs are landing on a small set of CPUs (here the first few). That may be fine; it may be terrible if those CPUs aren’t local to the device or if they’re busy.
Decision: If you see IRQ hotspots or wrong-node handling, change affinity or use irqbalance with NUMA-aware configuration. Then validate latency.
Task 12: Pin an IRQ to CPUs local to a NUMA node (surgical)
cr0x@server:~$ cat /proc/irq/47/smp_affinity_list
0-3
cr0x@server:~$ sudo sh -c 'echo 0-31 > /proc/irq/47/smp_affinity_list'
cr0x@server:~$ cat /proc/irq/47/smp_affinity_list
0-31
What it means: You widened the IRQ’s CPU target set to CPUs 0–31 (often node 0). This improves locality if the device is on node 0.
Decision: Re-test I/O latency under load. If improved, codify via udev/systemd scripts (carefully) or tune irqbalance policy rather than manual writes.
Task 13: Check memory latency signals with perf (watch for stalled cycles)
cr0x@server:~$ sudo perf stat -e cycles,instructions,cache-misses,LLC-load-misses -p 2147 -- sleep 10
Performance counter stats for process id '2147':
18,421,334,112 cycles
12,102,884,551 instructions # 0.66 insn per cycle
221,433,112 cache-misses
98,331,220 LLC-load-misses
10.001112123 seconds time elapsed
What it means: Low IPC with high last-level cache misses can indicate memory latency pressure—often amplified by remote memory access patterns on fabric-heavy topologies.
Decision: If IPC tanks under load and you suspect remote memory, validate with NUMA stats and placement. Don’t jump straight to “CPU upgrade.”
Task 14: Confirm Transparent Huge Pages status (can interact with migration/latency)
cr0x@server:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
What it means: THP is always enabled. Sometimes that’s good; sometimes it causes latency spikes due to defrag or allocation behavior, especially with mixed workloads.
Decision: For databases with known best practices, follow them. If you don’t know, test under realistic load; don’t cargo-cult “disable THP” or “always THP.”
Task 15: Check if your process is bouncing across CPUs (migration)
cr0x@server:~$ pidstat -t -p 2147 1 5
Linux 6.8.0 (server) 01/10/2026 _x86_64_ (128 CPU)
12:00:01 PM UID TGID TID %usr %system %CPU CPU Command
12:00:02 PM 26 2147 2147 8.00 2.00 10.00 4 postgres
12:00:02 PM 26 2147 2161 6.00 1.00 7.00 37 postgres
12:00:02 PM 26 2147 2162 5.00 1.00 6.00 92 postgres
What it means: Threads are running on CPUs 4, 37, 92—likely different NUMA nodes. That’s not automatically wrong, but it’s a red flag for a latency-sensitive DB.
Decision: If performance is inconsistent, constrain the DB to a node (or a set of nodes with explicit sharding) and retest.
Task 16: Validate memory bandwidth pressure with pcm-like signals (fallback: vmstat)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
12 0 0 987654 12345 456789 0 0 2 18 9000 22000 55 10 33 2 0
18 0 0 982110 12345 456792 0 0 0 12 9800 26000 62 12 24 2 0
What it means: High context switches and runnable threads can indicate contention (locks, scheduling) that becomes worse when threads span NUMA nodes. It’s not definitive, but it’s a clue.
Decision: Pair this with NUMA placement checks. If contention coincides with cross-node spread, tighten affinity.
Fast diagnosis playbook
This is the “pager is ringing” version. You don’t have time to become a microarchitect, you have time to stop the bleeding.
First: Prove whether it’s locality/topology
- Check NUMA node count and mapping (
lscpu,numactl --hardware). If NUMA>1, assume locality matters until proven otherwise. - Check process CPU spread (
pidstat -t,ps -o psr). If hot threads run across distant nodes, suspect fabric-driven latency. - Check memory placement (
numastat -p). If memory is mostly on one node but threads run elsewhere, you found a likely culprit.
Second: Look for jitter sources that amplify fabric costs
- CPU governor/p-states (
cpupower frequency-info). “powersave” on latency-critical nodes is an easy win (with thermal awareness). - NUMA migrations (
/proc/vmstatmigrations and hint faults). High migration under load often correlates with tail spikes. - THP/defrag behavior (THP status, workload-specific guidance). It’s not always fabric-related, but it couples with latency.
Third: Validate I/O locality (storage and network)
- Device NUMA node (
/sys/bus/pci/devices/.../numa_node). - IRQ CPU distribution (
/proc/interrupts, affinity settings). Wrong-node IRQ handling is the classic “why is NVMe slow” footgun. - Queue depth / interrupt moderation (device-specific, but start by confirming you’re not bottlenecked by one busy core handling everything).
If you do only one thing in the first hour: make the workload local—CPU, memory, and I/O interrupts aligned—and retest. It’s the most common “fabric-shaped” failure mode and the fastest to correct.
Common mistakes: symptom → root cause → fix
1) Symptom: Great average throughput, terrible p99 latency
Root cause: Thread migration and remote memory access across NUMA nodes; fabric adds latency and increases variance.
Fix: Pin hot threads to a NUMA node, bind memory locally (systemd CPU/NUMA policies or numactl), and reduce migrations (tune/disable auto NUMA balancing for that class).
2) Symptom: “Upgraded RAM speed” and performance got worse
Root cause: FCLK/UCLK/MCLK decoupling or ratio change increased memory latency; bandwidth may have improved but critical-path latency got punished.
Fix: Prefer stable low-latency memory profiles for latency workloads. Validate with real workload p99 metrics, not synthetic bandwidth alone.
3) Symptom: NVMe array benchmarks fine, production I/O latency is spiky
Root cause: IRQs handled on non-local CPUs, or storage threads scheduled far from the PCIe root complex; fabric hop adds latency and cache misses.
Fix: Align IRQ affinity and worker threads with device locality; consider per-NUMA-node queueing models. Re-test under realistic concurrency.
4) Symptom: VM performance inconsistent across hosts “with same CPU”
Root cause: Different BIOS NUMA partitioning (NPS settings), memory interleaving, or firmware changes affecting topology and fabric behavior.
Fix: Standardize BIOS profiles and firmware; capture topology via lscpu/lstopo in provisioning; enforce NUMA-aware VM sizing and pinning.
5) Symptom: CPU utilization low, but the service is slow
Root cause: Memory latency stalls (remote memory, cache misses), lock contention amplified by cross-node scheduling, or I/O interrupt overhead on wrong cores.
Fix: Use perf stat and NUMA stats to confirm stall behavior; adjust placement; reduce cross-node sharing; fix IRQ locality.
6) Symptom: “It’s only one socket, so NUMA can’t matter”
Root cause: Chiplet topologies create NUMA-like behavior inside one socket; Linux exposes it as NUMA nodes for a reason.
Fix: Treat single-socket multi-CCD systems as NUMA systems. Put locality in your runbooks and capacity planning.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They had a “simple” storage gateway: a few EPYC servers, a fast network, NVMe cache, and a userspace I/O stack. In staging it flew. In production it jittered—tail latency would spike during peak traffic, and the on-call would stare at clean disks and clean network graphs like they were being gaslit by Grafana.
The wrong assumption was subtle: “single socket equals uniform memory.” The team treated the box as a flat pool of cores and RAM. They let the kernel schedule freely, and they let the I/O threads float because “Linux will do the right thing.” Linux did a reasonable thing. The workload needed a specific thing.
The PCIe NVMe devices were local to one NUMA node, but the busiest completion threads were often running on another. Every I/O completion included a fabric hop, plus cache misses because the hottest data structures lived in another L3 region. Under low load, it didn’t matter. Under high concurrency, it mattered a lot.
They fixed it with three changes: (1) pin I/O threads to CPUs local to the NVMe controllers, (2) bind memory allocation for those threads to the same node, (3) set IRQ affinity to stop completions from landing on random CPUs. Latency stabilized, throughput increased, and the on-call stopped seeing phantom performance issues that only appeared after lunch.
Mini-story 2: The optimization that backfired
A platform team wanted “free performance” on a fleet of application servers. The plan: increase memory frequency in BIOS. The vendor tool showed stable operation, the burn-in passed, and the graphs in a synthetic benchmark looked great. They rolled it out in waves because they were not reckless—just optimistic.
Then the incident: p99 API latency went up, not down. Not across the board. Just certain services, mostly those that used a lot of small in-memory structures and did frequent cross-thread handoffs. CPU utilization dropped slightly (which looked “good” if you didn’t know better), but response times got worse.
The postmortem was uncomfortable: the new memory settings pushed the system into a different clock relationship that increased effective memory access latency. Bandwidth improved. Latency worsened. The workloads were latency-bound and highly sensitive to remote memory patterns, so the fabric-related delay became visible in tail metrics.
They rolled back the memory “upgrade” and re-ran tests with two profiles: one tuned for bandwidth-heavy batch jobs, one tuned for low-latency services. They ended up standardizing profiles by role. Performance tuning stopped being a universal “make number bigger” exercise and became what it always should have been: workload-specific engineering.
Mini-story 3: The boring but correct practice that saved the day
A finance company ran a latency-sensitive service on EPYC. They weren’t doing anything exotic. What they did have was discipline: every server role had a baseline BIOS config, a baseline kernel config, and a tiny “topology report” captured during provisioning and stored with the asset record.
One morning after a routine maintenance window, a subset of nodes showed increased tail latency. Not catastrophic, but enough to trigger alerts. The on-call compared two topology reports: the NUMA layout had changed. Same model number, same RAM count, but a BIOS setting had shifted NUMA partitioning. The scheduler behavior changed, the memory locality changed, and the fabric started doing extra work.
They reverted the BIOS profile to the known-good baseline, rebooted the affected nodes, and the problem disappeared. No heroics. No “kernel deep dive.” Just configuration control and observability.
Joke #2: The best performance fix is sometimes a spreadsheet. Don’t tell your developers, they’ll start requesting pivot tables.
Checklists / step-by-step plan
Checklist: Before you tune anything
- Capture
lscpu,numactl --hardware, andlstopo-no-graphicsoutput for the node class. - Confirm BIOS/firmware versions are consistent across the fleet for comparable nodes.
- Define success metrics: p50/p95/p99 latency, throughput, error rate, CPU cost per request, and (for storage) IOPS at given latency.
- Run a workload-representative test. Synthetic microbenchmarks are fine for clues; they are not your acceptance test.
Step-by-step: Make a service NUMA-sane (lowest-risk path)
- Pick a NUMA node target based on available CPUs/memory and device locality (NVMe/NIC).
- Pin the service threads to CPUs in that node (systemd
CPUAffinity=or a wrapper). - Bind memory allocation to that node (
numactl --membindor systemd NUMA policies if used). - Fix IRQ locality so the device interrupts land on local CPUs.
- Retest under load, compare tail latency and CPU cycles per request.
- Only then touch memory clocks, boost, or power-state knobs—because those can improve averages while quietly ruining tails.
Step-by-step: Diagnose a “fabric-ish” performance regression after a change
- Confirm what changed: BIOS profile, firmware, kernel version, microcode, memory settings, VM placement rules.
- Compare topology outputs (NUMA nodes, CPU mapping) pre/post change.
- Check migrations and scheduling spread (
pidstat -t,/proc/vmstatmigrations,numastat -p). - Check CPU governor and p-state policy; restore the previous policy if needed.
- Check IRQ distribution and device NUMA node; restore affinity if it drifted.
- If you must roll back, roll back fast. If you must keep the change, implement locality controls to compensate.
FAQ
1) Is Infinity Fabric “the same thing” as NUMA?
No. NUMA is a software-visible model: memory access time depends on which node owns the memory. Infinity Fabric is the hardware interconnect that often makes those differences real on AMD chiplet systems.
2) Why do I see multiple NUMA nodes on a single-socket EPYC server?
Because the socket contains multiple chiplets and locality domains. The kernel exposes this so the scheduler and memory allocator can make better decisions. Ignoring it is allowed, but it’s not free.
3) Should I disable kernel automatic NUMA balancing?
Sometimes. For generic mixed workloads, it can help. For pinned, latency-sensitive services, it can introduce page migration overhead and jitter. Test both ways on a canary host with production-like load.
4) Does faster DDR always improve performance on Ryzen/EPYC?
No. Faster DDR can improve bandwidth, but if it changes clock ratios or increases effective latency, some workloads get worse—especially those sensitive to tail latency and cache-miss paths.
5) How do I know if remote memory is hurting me?
Look for a mismatch: threads running on one node while memory allocations live on others (pidstat + numastat -p). Also watch for elevated page migrations and low IPC with high LLC misses.
6) Can virtualization hide Infinity Fabric effects?
It can hide the cause and amplify the pain. If a VM spans NUMA nodes, it can incur remote memory access frequently. Proper vNUMA exposure, pinning, and sizing matter on EPYC.
7) Is this only relevant for CPU-bound workloads?
It’s often more relevant for I/O and mixed workloads because interrupt locality and memory access patterns dominate tail behavior. CPU “idle” time doesn’t mean “fast.”
8) What’s the single most effective operational control?
Topology-aware placement: keep hot threads, their memory, and their device interrupts in the same locality domain. It reduces latency, jitter, and wasted CPU.
9) Should I always pin everything?
No. Over-pinning can cause uneven load, starvation, and poor utilization. Pin the things that are latency-sensitive or that own I/O paths. Leave batch jobs and background work more flexible.
10) What’s a good reliability mindset for this kind of tuning?
Use controlled experiments, canaries, and role-based profiles. If you can’t roll it back quickly, you’re not “tuning,” you’re gambling.
Next steps you can do this week
Here’s the practical path that won’t wreck your fleet:
- Inventory topology across node types: store
lscpu/numactl/lstopooutputs with the asset record. - Pick one latency-sensitive service and make it NUMA-local (CPU + memory + IRQ locality). Measure p99 and CPU cost.
- Standardize a BIOS and kernel policy by role (latency vs throughput vs batch). “One profile fits all” is how you get surprises.
- Add two dashboards: (a) remote memory/migration indicators (vmstat/vmstat-derived, numa stats), (b) IRQ CPU distribution and top consumers.
- Write a runbook with the fast diagnosis steps above. The goal is not to worship Infinity Fabric; the goal is to stop being surprised by topology.
One quote worth keeping on the wall, because it fits this entire topic: “Hope is not a strategy.” — General Gordon R. Sullivan