You buy a server expecting it to behave like a reliable appliance: predictable latency, boring graphs, no surprises at 2 a.m.
Then you rack an AMD EPYC box, turn on a few dozen VMs, add some NVMe, and suddenly your data center tour sounds like a car salesperson:
“Look at all these cores. Look at the lanes. Look at the memory channels.”
The problem is that showrooms are optimized for wow-factor. Production is optimized for “nothing weird happens.” EPYC can do both,
but only if you understand what it changed in server design—and where the sharp edges live when your workload meets NUMA, PCIe topology,
and a storage stack that never asked for this many parallel queues.
Why EPYC made servers look different
EPYC didn’t just give operators more cores. It changed what a “balanced” server looks like. For a long time, server design was
a set of compromises everyone accepted: not enough PCIe lanes, memory channels that filled up too quickly, and CPUs that forced you
into expensive dual-socket configs just to get I/O.
EPYC arrived and made single-socket servers feel like the grown-up option instead of the budget option. More PCIe lanes meant you
could attach serious NVMe without playing Tetris. More memory channels meant you could feed those cores without immediately starving
on bandwidth. And the chiplet approach let AMD scale core counts without needing a monolithic die the size of your regret.
The showroom effect is real: you can now build a machine that looks absurd on paper—dozens of cores, a wall of NVMe, and enough
memory bandwidth to make old architectures sweat. But the showroom effect has a shadow: if you ignore topology, you’ll create
bottlenecks that don’t show up in simple benchmarks and only appear when everything is noisy at once.
Facts and context that explain the shift
Here are concrete bits of history and context that help explain why EPYC felt like a market reset, not a normal product launch.
These aren’t trivia. They map to design decisions you’ll make in racks and in procurement spreadsheets.
- 2017 was the inflection point: EPYC “Naples” (1st gen) arrived and forced buyers to reconsider single-socket density as a first-class choice.
- Chiplets went mainstream in servers: EPYC’s multi-die design normalized the idea that latency and locality matter more than “one big die” purity.
- Memory channels became a buying criterion again: EPYC’s wide memory configurations pulled “memory bandwidth per dollar” into the boardroom.
- PCIe lane count stopped being a rounding error: Platforms with lots of lanes enabled more NVMe and more NICs without extra CPUs.
- Rome (2nd gen) pushed core counts hard: When core counts climb, software licensing and per-core overhead become operational risks, not just budget line items.
- Milan (3rd gen) cleaned up latency and IPC: It wasn’t just more cores; it improved single-thread behavior enough to matter for databases and control planes.
- Genoa (4th gen) moved the I/O goalposts again: DDR5 and PCIe Gen5 increased the ceiling, but also made layout and cooling decisions more consequential.
- Cloud providers validated it in public: Big deployments signaled “this isn’t niche,” which matters when you’re buying into an ecosystem of firmware, boards, and support.
Architecture you can actually use (chiplets, NUMA, I/O)
Chiplets: the part everyone repeats, and the part that matters
“Chiplets” gets marketed like it’s a lifestyle choice. In production, it’s a topology story. Multiple compute dies (CCDs) hang off an I/O die,
and your threads, memory allocations, and interrupts are now participating in a geography lesson. The CPU is no longer a single neighborhood;
it’s a city with bridges.
That impacts three things you can’t ignore:
- NUMA locality: memory attached to one node isn’t equally fast from another. The penalty varies by generation and configuration, but it’s never zero.
- Cache behavior: larger aggregate cache doesn’t mean your hot data is near your thread. “Near” is now a technical property, not a vibe.
- Jitter under load: when the fabric is busy—lots of I/O, lots of cores, lots of interrupts—tail latency is where you’ll pay.
NUMA: you can’t “turn it off,” you can only disrespect it
NUMA is not a bug. It’s a contract. If you place memory and compute on the same node, you get better latency and bandwidth. If you don’t, you get
a performance tax. That tax is sometimes cheap enough to ignore. Sometimes it’s your entire incident.
EPYC platforms commonly expose multiple NUMA nodes per socket depending on settings like NPS (NUMA per socket) and how the kernel enumerates
the I/O. You can tune for fewer NUMA domains (simpler scheduling, potentially higher local contention) or more domains (more locality opportunities,
more complexity).
I/O die and lane abundance: the hidden trap
EPYC made I/O look “solved” because you can attach a lot of devices. But “lane count” is not the same as “I/O without contention.”
Switches, bifurcation, firmware defaults, and kernel interrupt routing can still turn a glorious topology into a parking lot.
The other trap is psychological: teams start attaching everything because they can. More NVMe. More NICs. More HBAs “just in case.”
Then the platform becomes a shared-bus experiment where you’re surprised that QoS is hard.
Joke #1: A server with 128 lanes isn’t “future-proof.” It’s “future-temptation-proof,” and most of us fail the test.
Virtualization and consolidation: the good, the weird, the fixable
EPYC is a consolidation engine. If you’re running a VM farm with a lot of “medium” VMs, you can pack density without the classic
dual-socket tax. But consolidation is where the showroom effect bites: the more you pack, the more you amplify scheduler and memory
effects you used to get away with.
Core density changes the failure modes
When you go from “a few dozen threads” to “hundreds of runnable threads,” your bottlenecks shift:
- CPU isn’t the limit; memory bandwidth is. Especially for analytics, compression, encryption, and storage stacks with lots of copying.
- Interrupt handling matters again. A fast NIC with poor IRQ affinity can look like a CPU problem while your cores spin on softirqs.
- Lock contention becomes visible. Some software scales to 64 cores, then faceplants at 96 because one global lock turns into a turnstile.
Licensing and “per-core” realities
Your procurement team will love high core counts right up until they don’t. Some enterprise software still prices per core, per socket,
per vCPU, or with core-factor math that was invented to make you feel bad. EPYC makes it easy to buy more compute than you can license sanely.
Operational advice: treat licensing like a performance constraint. If you can’t afford to use all the cores, don’t pretend you have them.
Set realistic CPU limits, pin where necessary, and avoid building an architecture that depends on “we’ll license it later.”
When pinning is smart, and when it’s cosplay
CPU pinning and NUMA pinning are powerful tools. They’re also a great way to turn your virtualization platform into a hand-maintained spreadsheet
that breaks every time you add a NIC.
Pinning is worth it when:
- you have latency-sensitive workloads (trading systems, realtime-ish control planes, certain databases)
- you have dedicated hosts for a small number of large VMs
- you can enforce consistent placement and keep host drift under control
Pinning is usually a mistake when:
- the environment is highly dynamic (autoscaling, frequent evacuations)
- the workload is throughput-oriented and tolerant of variance
- the team doesn’t have observability for NUMA misses and interrupt hotspots
Storage and PCIe: lanes are not throughput
The most common EPYC story in storage is simple: “We attached a lot of NVMe and expected linear scaling.” Sometimes you get it.
Often you get a confusing plateau. The plateau isn’t a moral failure. It’s the combination of:
- NUMA locality (your NVMe interrupts are handled far from the application threads)
- PCIe topology (switch uplinks, bifurcation, oversubscription)
- queue depths and IO scheduler choices
- filesystem and RAID/ZFS behavior under mixed workloads
- network/storage stack interactions (especially in Ceph, iSCSI, NVMe-oF)
NVMe parallelism: the CPU becomes part of the storage path
NVMe is fast because it’s parallel. That means more queues, more interrupts, more CPU time in the kernel.
On EPYC, you have plenty of cores to throw at it—but you still need to place that work close to the right NUMA node,
and you need to make sure you aren’t bottlenecked on a single queue, a single IRQ, or a single CPU handling softirqs.
Memory bandwidth: the silent limiter for storage
High-speed storage often turns into “how fast can we move memory around.” Checksums, compression, encryption, replication,
and even simple copying can saturate memory bandwidth long before “CPU usage” looks scary. EPYC’s memory channels help,
but you can still starve if you run one DIMM per channel, mix speeds, or misconfigure BIOS power settings.
One reliability quote (paraphrased idea)
Werner Vogels (paraphrased idea): “Everything fails, all the time—design and operate as if that’s the default.”
Practical tasks: commands, outputs, and decisions (12+)
These are the tasks I actually run when an EPYC host “should be faster” but isn’t. Each one includes a realistic output snippet
and what decision you make from it. Run them on the host, not in a container, unless noted.
Task 1: Confirm CPU model, sockets, and core layout
cr0x@server:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 128
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
NUMA node(s): 4
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9554 64-Core Processor
L3 cache: 256 MiB
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
NUMA node2 CPU(s): 64-95
NUMA node3 CPU(s): 96-127
What it means: You have one socket but four NUMA nodes. That’s a topology decision, not just a fact.
Decision: If the workload is latency-sensitive, plan for NUMA-aware pinning and memory placement.
If it’s general-purpose, keep NUMA visible but avoid complex pinning unless you measure a win.
Task 2: Check NUMA distance (local vs remote penalty)
cr0x@server:~$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0-31
node 0 size: 192000 MB
node 0 free: 121500 MB
node 1 cpus: 32-63
node 1 size: 192000 MB
node 1 free: 118200 MB
node 2 cpus: 64-95
node 2 size: 192000 MB
node 2 free: 119100 MB
node 3 cpus: 96-127
node 3 size: 192000 MB
node 3 free: 120900 MB
node distances:
node 0 1 2 3
0: 10 12 12 12
1: 12 10 12 12
2: 12 12 10 12
3: 12 12 12 10
What it means: Remote memory access costs more than local (distance 12 vs 10). Not catastrophic, but real.
Decision: For databases and storage daemons, align CPU/memory to a node and try to keep I/O interrupts local.
Task 3: Verify memory speed and population
cr0x@server:~$ sudo dmidecode -t memory | egrep -i 'Locator:|Size:|Speed:|Configured Memory Speed:'
Locator: DIMM_A1
Size: 64 GB
Speed: 4800 MT/s
Configured Memory Speed: 4800 MT/s
Locator: DIMM_B1
Size: 64 GB
Speed: 4800 MT/s
Configured Memory Speed: 3600 MT/s
What it means: One DIMM is downclocking (configured at 3600). That can drag the whole memory subsystem.
Decision: Fix population rules (matched DIMMs, correct slots). Don’t benchmark until memory runs at expected speed.
Task 4: Check current CPU frequency behavior under load
cr0x@server:~$ sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,IPC,IRQ,POLL --interval 1 --num_iterations 3
Busy% Bzy_MHz IPC IRQ POLL
38.45 2850 1.35 6200 0
92.10 2975 1.02 18900 0
91.88 2400 0.98 20100 0
What it means: Frequency drops at high busy% (thermal/power/firmware policy). IPC also drops, suggesting memory stalls or contention.
Decision: Check BIOS power profile, cooling, and Linux governor. If throughput matters, avoid conservative power settings.
Task 5: Inspect Linux CPU frequency driver/governor
cr0x@server:~$ cpupower frequency-info
analyzing CPU 0:
driver: amd-pstate-epp
CPUs which run at the same hardware frequency: 0
available cpufreq governors: performance powersave
current policy: frequency should be within 400 MHz and 3700 MHz.
The governor "powersave" may decide which speed to use
current CPU frequency: 1580 MHz (asserted by call to hardware)
What it means: You’re in powersave. On a server doing real work, that’s often accidental.
Decision: For performance-critical hosts, set performance and confirm thermals are acceptable.
Task 6: Confirm PCIe topology and link widths
cr0x@server:~$ lspci -tv
-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14xx
+-01.1-[01]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller
+-01.2-[02]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller
+-03.0-[03]----00.0 Mellanox Technologies MT28908 Family [ConnectX-6]
\-04.0-[04]----00.0 Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3
What it means: Devices are on different root ports—good sign. But you still need to check negotiated link speed/width.
Decision: If an NVMe negotiated x2 instead of x4 (or Gen3 instead of Gen4/5), fix cabling/backplane/bifurcation/BIOS before blaming software.
Task 7: Validate negotiated PCIe speed/width on a device
cr0x@server:~$ sudo lspci -s 01:00.0 -vv | egrep -i 'LnkCap:|LnkSta:'
LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
LnkSta: Speed 8GT/s (downgraded), Width x4
What it means: Capable of Gen4 (16GT/s) but running Gen3 (8GT/s). That halves bandwidth.
Decision: Check BIOS PCIe settings, backplane, retimers, and firmware. Don’t “optimize” the kernel to compensate for a physical downgrade.
Task 8: Identify NVMe queue and interrupt distribution
cr0x@server:~$ cat /proc/interrupts | egrep 'nvme|mlx5' | head
95: 1200331 0 0 0 IR-PCI-MSI 524288-edge nvme0q0
96: 903221 0 0 0 IR-PCI-MSI 524289-edge nvme0q1
97: 910554 0 0 0 IR-PCI-MSI 524290-edge nvme0q2
110: 3321100 0 0 0 IR-PCI-MSI 327680-edge mlx5_comp0
What it means: All interrupts landing on CPU0 (the first column) is a classic performance-killer.
Decision: Enable/verify irqbalance, or pin IRQs to the local NUMA node for the device. This is often a “free” latency win.
Task 9: Check NIC locality and NUMA node association
cr0x@server:~$ for dev in /sys/class/net/enp*; do echo -n "$(basename $dev) "; cat $dev/device/numa_node; done
enp65s0f0 2
enp65s0f1 2
What it means: The NIC is attached to NUMA node 2. If your workload runs on node 0 with memory on node 0, you’re paying for distance.
Decision: For high-throughput networking (Ceph, NVMe-oF, replication), place the network stack threads near the NIC’s NUMA node.
Task 10: Spot memory bandwidth pressure via perf counters
cr0x@server:~$ sudo perf stat -a -e cycles,instructions,cache-misses,dTLB-load-misses -I 1000 -- sleep 3
# time counts unit events
1.000233062 5,210,332,110 cycles
1.000233062 3,101,229,884 instructions
1.000233062 92,110,553 cache-misses
1.000233062 1,020,112 dTLB-load-misses
2.000472981 5,401,223,019 cycles
2.000472981 3,002,118,991 instructions
2.000472981 120,004,221 cache-misses
2.000472981 1,230,888 dTLB-load-misses
What it means: Rising cache misses + dropping instructions per cycle hints at memory pressure or poor locality.
Decision: Look at NUMA placement, huge pages, allocator behavior, and data structure locality before adding more cores.
Task 11: Check disk I/O saturation and queueing
cr0x@server:~$ iostat -x 1 3
avg-cpu: %user %nice %system %iowait %steal %idle
22.10 0.00 9.20 3.10 0.00 65.60
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
nvme0n1 1200.0 800.0 420.0 310.0 620.0 9.80 6.20 0.35 99.8
What it means: The device is at ~100% util and queue depth is high (avgqu-sz). Await is 6.2ms—high for NVMe under this load.
Decision: Either the device is saturated (normal) or you have an interrupt/NUMA issue making it look saturated. Correlate with interrupts and CPU softirq.
Task 12: Verify filesystem or ZFS compression/checksum CPU cost (example: ZFS)
cr0x@server:~$ sudo zpool iostat -v 1 3
capacity operations bandwidth
pool alloc free read write read write
tank 12.1T 10.7T 9800 6200 1.1G 820M
mirror-0 12.1T 10.7T 9800 6200 1.1G 820M
nvme-SAMSUNG_MZQLB1T9-0 - - 4900 3100 560M 410M
nvme-SAMSUNG_MZQLB1T9-1 - - 4900 3100 560M 410M
What it means: High ops and bandwidth; if CPU usage is also high, ZFS features (compression, checksums) may be compute-bound, not disk-bound.
Decision: If latency is the priority, reconsider compression level, recordsize, and sync settings—carefully, and with data safety intact.
Task 13: Detect softirq overload (network/storage path congestion)
cr0x@server:~$ mpstat -P ALL 1 2 | head -n 20
Linux 6.5.0 (server) 01/10/2026 _x86_64_ (128 CPU)
01:01:10 PM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
01:01:11 PM all 18.20 0.00 12.30 2.10 0.90 15.40 0.00 51.10
01:01:11 PM 0 2.00 0.00 5.00 0.00 0.00 70.00 0.00 23.00
What it means: CPU0 is drowning in softirq. That aligns with “all interrupts on CPU0.”
Decision: Fix IRQ affinity. If you’re using DPDK or busy polling, validate configuration and ensure it’s intentional.
Task 14: Verify virtualization topology exposure (KVM/QEMU example)
cr0x@server:~$ virsh capabilities | egrep -n 'topology|numa|cells' | head
115: <topology sockets='1' dies='1' clusters='1' cores='64' threads='2'/>
132: <cells num='4'>
133: <cell id='0'>
What it means: The host has 4 NUMA cells. Your VMs can be configured to match (or can be left mismatched).
Decision: For big VMs, expose NUMA to the guest and align vCPU/memory. For small VMs, keep it simple unless you can measure a benefit.
Task 15: Check thermal/power capping evidence
cr0x@server:~$ sudo journalctl -k --since "1 hour ago" | egrep -i 'thrott|powercap|edac|mce' | tail
Jan 10 12:35:22 server kernel: amd_pstate: limiting max performance due to platform profile
Jan 10 12:41:07 server kernel: EDAC amd64: ECC error detected on CPU#0 Channel:1 DIMM:0
What it means: Platform profile is limiting performance, and you also have ECC errors. Both can manifest as “random slowness.”
Decision: Fix thermals/power profile first; replace or reseat problematic DIMMs immediately. Performance tuning on flaky memory is self-harm.
Fast diagnosis playbook
This is the “I have 20 minutes before the incident call turns into improv theater” flow. It assumes you have a host that is slower than expected,
or tail latency is spiking under load.
First: separate “resource saturation” from “topology/pathology”
- Check CPU and softirq distribution:
mpstat -P ALL 1andcat /proc/interrupts. If one CPU is pegged in softirq, stop. Fix IRQ affinity. - Check disk util and queueing:
iostat -x 1. If %util ~100% with high await, you’re either truly device-bound or you have a locality/interrupt problem making it look device-bound. - Check memory pressure:
vmstat 1(look at si/so), and perf counters if available. If you’re paging, everything else is background noise.
Second: confirm physical negotiation and firmware reality
- PCIe link downgrades:
lspci -vvfor Speed/Width. Fix Gen3-at-Gen4-capable issues before changing kernel knobs. - Memory downclocking:
dmidecodeto find misconfigured DIMMs. EPYC will run, but your bandwidth will quietly leave the building. - Power policy throttling:
cpupower frequency-infoand logs. Many “mystery regressions” are power profiles.
Third: line up locality for the hot path
- NUMA node mapping:
numactl --hardware, NIC NUMA node via sysfs, and storage device locality. - Pin only what you must: for large VMs or dedicated services, pin CPU and memory. For everything else, start with IRQ and power fixes.
- Measure tail latency, not averages: EPYC’s problems often hide in p99 while p50 looks fine.
Joke #2: If your first step is “let’s tweak sysctl,” you’re not diagnosing—you’re seasoning.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A platform team migrated a fleet of storage-heavy hosts from older dual-socket systems to shiny single-socket EPYC servers. The plan looked airtight:
fewer sockets, same RAM, more cores, more NVMe. They expected lower latency and less power. The first week was quiet. Then the weekly batch job hit:
backup compaction plus indexing plus customer traffic. Latency charts grew teeth.
The on-call assumed “the disks are saturated.” NVMe %util was high, and queue depths looked ugly. They added more drives to the next host build
and spread data wider. The result: the same p99 latency, now with more hardware.
The actual issue was locality. The NIC and half the NVMe were hanging off a PCIe topology that the kernel was routing interrupts to CPU0.
CPU0 wasn’t “busy” in user space; it was drowning in softirq and interrupt handling. The storage threads were running on a different NUMA node,
constantly crossing fabric boundaries to complete I/O. The system had cores to spare—just not where the work was landing.
The fix was boring and surgical: enable irqbalance as a baseline, then override affinity for the NIC and NVMe queues to keep them local to the device’s NUMA node.
They also adjusted VM placement so the storage daemons and their memory stayed near that node.
The key lesson wasn’t “irqbalance good.” It was: the wrong assumption was treating NVMe saturation as a device property rather than an end-to-end path property.
In EPYC-land, the CPU and NUMA fabric are part of your storage subsystem whether you like it or not.
Mini-story 2: The optimization that backfired
A compute team running a busy virtualization cluster decided to pin everything. Their logic was reasonable: NUMA exists, therefore pinning is good,
therefore pinning everything is better. They pinned vCPUs, pinned memory, pinned IRQs, and pinned a few kernel threads for good measure.
The cluster looked fantastic under steady workloads. Then reality showed up: hosts were evacuated for maintenance, VMs migrated, and the orchestrator
started reshuffling. Suddenly, some hosts had “perfectly pinned” VMs sitting on the wrong NUMA nodes because the original placement constraints
weren’t preserved during migrations. Performance became inconsistent: some VMs ran great, others crawled, and the difference depended on where they landed.
The backfire was operational. The pinning strategy required strict placement discipline, but the team didn’t have tooling to validate it continuously.
They also created fragmentation: free cores existed, but not in the right NUMA shape to place new VMs. The scheduler started rejecting placements,
and humans started doing manual moves. That’s when you learn how quickly a cluster turns into a bespoke artifact.
They unwound most of it. Pinning stayed for a small set of high-value VMs with predictable lifecycles. For everything else, they relied on sensible defaults:
huge pages where appropriate, irqbalance with targeted overrides, and a host-level policy that kept power management from sabotaging them.
The lesson: an optimization that requires perfect operations is not an optimization; it’s a debt instrument with variable interest.
Mini-story 3: The boring but correct practice that saved the day
A company rolled out new EPYC hosts for a storage cluster. Nothing exotic: NVMe, high-speed NICs, standard Linux, standard monitoring.
The change window was tight, so they kept the OS image identical across generations and relied on automation.
One host started showing intermittent latency spikes after a week. Not enough to trip global alerts, but enough to trigger application retries
and customer complaints. The graph looked like random noise—classic “it’s the network” bait.
The saving practice was painfully boring: they had a routine that reviewed kernel logs for corrected hardware errors as part of weekly hygiene,
not just during incidents. The operator spotted ECC corrections and a few PCIe AER messages on the affected host. No panic, just a clear direction:
hardware or firmware, not application tuning.
They scheduled a controlled maintenance, reseated DIMMs, updated firmware, and replaced the suspect module. The spikes disappeared.
Without that hygiene, the team would have wasted weeks “tuning” IRQ affinity and storage parameters while the box quietly corrupted trust.
The lesson: the most profitable performance trick is preventing flaky hardware from impersonating a software problem.
Common mistakes (symptoms → root cause → fix)
EPYC doesn’t create new mistakes. It makes existing ones louder. Here are the ones I see repeatedly, with specific fixes.
1) Symptom: p99 latency spikes under load; CPU “looks idle” overall
- Root cause: interrupts/softirqs concentrated on a few CPUs; the hot path is CPU-bound in kernel space.
- Fix: check
/proc/interrupts; enable irqbalance; pin IRQs for NIC/NVMe to local NUMA CPUs; verify RPS/XPS if relevant.
2) Symptom: NVMe throughput plateau far below expectation
- Root cause: PCIe link negotiated at lower speed (Gen3 vs Gen4/5) or lower width; retimers/backplane/BIOS mismatch.
- Fix:
lspci -vvLnkSta; correct BIOS settings; update firmware; validate cabling/backplane; re-test before tuning software.
3) Symptom: database is slower after “more cores” upgrade
- Root cause: memory bandwidth saturation, remote NUMA memory access, or lock contention scaling wall.
- Fix: align DB process CPU/memory with NUMA node; increase locality; consider fewer threads; validate memory speed/population.
4) Symptom: virtualization cluster has random “good” and “bad” hosts
- Root cause: inconsistent BIOS settings (NPS mode, power profile), inconsistent kernel drivers/governors, or drift in firmware.
- Fix: baseline BIOS/firmware; enforce in provisioning; audit with
lscpu,cpupower, and DMI inventory.
5) Symptom: storage CPU usage explodes after enabling compression/encryption
- Root cause: CPU becomes the storage bottleneck; memory bandwidth and cache behavior dominate; per-core overhead scales badly.
- Fix: measure with perf/turbostat; tune compression level; ensure IRQ distribution; move heavy daemons to local nodes; consider offload only if it’s real.
6) Symptom: “We pinned vCPUs and it got worse”
- Root cause: pinning without aligning memory and I/O locality; or operational drift makes pinning inconsistent post-migration.
- Fix: pin as a full policy (CPU + memory + device locality) for a small subset; otherwise remove pinning and fix interrupts/power first.
7) Symptom: frequent tiny stalls, hard to reproduce
- Root cause: corrected ECC errors, PCIe AER retries, or thermal/power capping events.
- Fix: inspect
journalctl -kfor EDAC/AER; address hardware; don’t “tune around” reliability warnings.
Checklists / step-by-step plan
Step-by-step: bringing a new EPYC host into production without drama
- Standardize BIOS settings: power profile, NPS mode, SMT policy, PCIe generation settings. Document the chosen stance and why.
- Validate memory population: correct slots, matched DIMMs, expected configured speed. Fix downclocking immediately.
- Verify PCIe negotiation: check all NVMe and NICs for expected link speed/width; ensure no silent downgrades.
- Baseline CPU frequency behavior: confirm governor/driver; run a short load test and verify frequencies don’t collapse unexpectedly.
- Confirm NUMA shape: record
lscpuandnumactl --hardwareoutputs; store them with the host inventory. - Set interrupt policy: enable irqbalance as baseline, then add targeted affinity overrides for known hotspots (NICs, NVMe) if you have evidence.
- Run a storage smoke test: random read/write with realistic queue depths; record p50/p95/p99 latency, not just throughput.
- Run a network smoke test: validate line rate, CPU cost, and IRQ distribution; ensure no single core is pegged in softirq.
- Establish a “hardware health” watch: EDAC, MCE, AER logs tracked. Treat corrected errors as leading indicators.
- Only then tune workload knobs: thread counts, pinning, huge pages, filesystem settings, and so on—based on measurements.
Checklist: deciding between single-socket and dual-socket EPYC
- Choose single-socket if you need lots of I/O, moderate memory, and want simpler licensing and lower failure domain complexity.
- Choose dual-socket if you truly need memory capacity or bandwidth beyond what one socket provides, or your workload demands more PCIe endpoints without switches.
- Avoid dual-socket “just because.” It doubles the NUMA complexity and increases the blast radius for remote memory penalties.
Checklist: what to measure before you blame EPYC
- PCIe link speed/width negotiated
- memory configured speed and channel population
- IRQ distribution across CPUs
- softirq time and per-core hotspots
- NUMA node placement for the process and its memory
- power management governor and throttling evidence
- tail latency (p99) under mixed load
FAQ
1) Is EPYC “better” than Intel for servers?
It depends on what “better” means. EPYC has been especially strong when you need high core density, lots of PCIe, and strong memory bandwidth per dollar.
But “better” in production is also firmware maturity, platform stability, and how your software behaves with NUMA and high parallelism.
2) Why does my single-socket EPYC show multiple NUMA nodes?
Because the platform can be configured as multiple NUMA domains per socket (NPS modes), and the chiplet/I/O topology exposes locality boundaries.
Treat it as real: memory latency and I/O locality can differ by node even in a single socket.
3) Should I enable SMT (hyperthreading equivalent) on EPYC?
Usually yes for throughput workloads, mixed VM environments, and general services. For strict latency or noisy-neighbor scenarios, test both ways.
SMT can improve utilization but can also increase tail latency when contention is high.
4) Why is my NVMe device running at Gen3 when it supports Gen4/Gen5?
Common causes: BIOS forced generation, backplane/retimer limitations, mixed devices negotiating down, signal integrity issues, or outdated firmware.
Confirm with lspci -vv and fix the physical/firmware layer before tuning Linux.
5) Do I need to pin IRQs manually, or is irqbalance enough?
Start with irqbalance. If you still see hotspots (one CPU pegged in softirq) or you need consistent latency, pin IRQs for NIC/NVMe
to CPUs local to the device’s NUMA node. Manual pinning without measurement tends to become permanent folklore.
6) What’s the fastest way to tell if I’m memory bandwidth bound?
Look for low IPC under load (turbostat), rising cache misses (perf stat), and scaling that stops improving with more threads.
Also verify memory isn’t downclocked and that channels are populated correctly.
7) Is a higher core count always better for storage nodes?
No. Storage nodes can be limited by memory bandwidth, interrupt handling, and network stack efficiency. More cores help only if the hot path
parallelizes and locality is respected. Otherwise you just get more cores watching one core do the work.
8) Should I configure more NUMA nodes per socket (NPS) for performance?
Sometimes. More NUMA nodes can improve locality if your scheduler and application placement are NUMA-aware, especially for I/O-heavy workloads.
But it also increases operational complexity. If you don’t have placement discipline, fewer NUMA nodes may be more stable.
9) What’s the single worst “EPYC tuning” habit?
Treating topology as optional. If you ignore IRQ distribution, memory placement, and PCIe negotiation, you’ll spend weeks “tuning” sysctls and thread counts
while the real bottleneck sits in plain sight.
Next steps that won’t embarrass you
EPYC turned servers into a showroom by making the spec sheet matter again: cores, lanes, channels, and cache. That’s the headline.
The operational reality is that topology matters again too, and topology punishes hand-wavy thinking.
Practical next steps:
- Inventory topology on every host: CPU/NUMA shape, NIC and NVMe NUMA nodes, PCIe negotiated speeds. Store it with your CMDB or provisioning metadata.
- Standardize BIOS and power policy across the fleet. Inconsistent profiles create “haunted hosts” that waste SRE time.
- Baseline IRQ and softirq behavior under representative load. If one CPU is the kernel’s dumping ground, fix that first.
- Measure p99 before and after changes. If you only track averages, you’ll declare victory while users keep retrying.
- Use pinning surgically: reserve it for workloads that deserve it and environments that can maintain it.
The goal isn’t to make your EPYC servers impressive. They already are. The goal is to make them boring—and to keep them boring when the workload gets mean.