You buy a “16-core” workstation, throw a build at it, and the latency graph looks like a seismograph during a minor existential crisis.
Or you provision a shiny EPYC host, then watch one microservice fly while another crawls—same code, same load, same day.
That is the chiplet era in production: cheaper cores, more cores, and a topology that absolutely will punish lazy assumptions.
AMD’s chiplet strategy didn’t just “help performance.” It revived Ryzen as a product line by changing the manufacturing math—and it changed how operators
should diagnose bottlenecks, place memory, and pin work.
Chiplets in one sentence (and why it mattered)
A chiplet CPU is a processor built from multiple smaller dies—typically CPU core dies plus an I/O die—connected by a high-speed interconnect.
If you’re an SRE, translate that into: compute is modular, I/O is centralized, and memory access is no longer “uniform” even when your scheduler pretends it is.
If you’re a storage engineer, translate it into: PCIe and memory controllers live on a different die than the cores doing your checksums, erasure coding,
compression, and networking.
AMD made this mainstream. And AMD did it at the exact moment when monolithic high-core-count dies were getting uncomfortably expensive to build reliably.
Chiplets were the trick that let AMD ship lots of cores with competitive clocks and sane margins, while iterating quickly across generations.
Fast historical context: the concrete facts that set the stage
A few facts matter because they explain why chiplets weren’t a cute design choice—they were an escape hatch. Keep these in your pocket.
- AMD’s first Zen-based Ryzen desktop CPUs launched in 2017, ending a long stretch where “AMD vs Intel” was not a serious performance debate in many segments.
- Zen 2 (2019) was the big chiplet pivot for mainstream: CPU cores moved to multiple smaller core dies, paired with a separate I/O die on a different process node.
- The I/O die in Zen 2 was typically on a mature node (older, cheaper, higher-yield), while CPU core chiplets went on a leading-edge node.
- EPYC “Rome” (Zen 2) scaled to many chiplets in servers, proving the model at high core counts before desktops fully absorbed the implications.
- Infinity Fabric became the “spine” that made modular compute feasible without turning every cache miss into a disaster.
- Yield economics got brutal at advanced nodes: the larger the die, the more likely a defect ruins the entire piece. Smaller dies improve usable output per wafer.
- Chiplets enabled aggressive product binning: AMD could mix-and-match core chiplets and segment SKUs without designing a new monolithic die each time.
- Windows and Linux schedulers had to catch up: topology awareness matters more when cores are separated into dies and memory controllers sit elsewhere.
The story isn’t “AMD invented chiplets.” The story is that AMD operationalized them at scale for consumer and server parts in a way that changed the price/perf slope.
How AMD’s chiplets actually work: CCDs, IODs, and the fabric in between
The parts: CCD and IOD
In AMD’s chiplet designs, you usually have:
- CCD (Core Complex Die): the compute tile(s). This is where the CPU cores and their caches live.
- IOD (I/O Die): the tile that hosts memory controllers, PCIe controllers, and often other “uncool” but essential infrastructure.
- Interconnect: Infinity Fabric links these dies.
Operationally, that means the CPU core running your process may be one die-hop away from the memory controller handling its DRAM traffic and one die-hop away
from the PCIe root complex carrying NVMe interrupts. That hop is fast. It is not free.
The topology: “NUMA, but make it subtle”
NUMA is older than most of the dashboards we stare at. But chiplets made it relevant to people who used to ignore it.
Even within one socket, the latency to memory can vary depending on which CCD your core sits on and which memory controller (on the IOD) is servicing the request.
Here’s the practical definition: if a workload is cache-friendly, chiplets are mostly a win. If it’s memory-latency sensitive with lots of random access,
chiplets can turn into a topology tax unless you schedule and allocate carefully.
Infinity Fabric: what it gives you, what it charges you
Infinity Fabric is the interconnect that stitches dies together. It’s not “just a bus.” It’s an ecosystem: clocking, coherency,
and how core dies talk to the I/O die and, in multi-socket systems, potentially to another socket.
Practically:
- Best case: the fabric is fast enough that modularity feels invisible, and you get lots of cores at good prices.
- Worst case: you schedule threads across dies, bounce cache lines, and turn your p99 latency into a personality trait.
One quote to keep you honest when you’re tempted to hand-wave the topology:
Latency is a tax you pay on every request; throughput is a dividend you may or may not collect.
— Brendan Gregg (paraphrased idea)
Two jokes total, used responsibly
Joke 1: Chiplets are great because now you can have eight little CPUs arguing about cache coherency instead of one big CPU doing it quietly.
Why this resurrected Ryzen: yields, bins, cadence, and product segmentation
Manufacturing economics: smaller dies, better yields
Let’s be blunt: chiplets let AMD sell more good silicon per wafer. Defects happen. They’re normal. What matters is how much product you can salvage.
With a big monolithic die, one defect can trash a lot of area. With multiple smaller dies, a defect trashes one chiplet.
That changes everything: cost per usable core, how aggressive you can be with core counts, and how many SKUs you can profitably ship.
It also means AMD can ride leading-edge nodes for CPU cores while keeping I/O on a mature node that’s cheaper and often electrically easier for analog-ish interfaces.
Bin flexibility: mixing good parts into good products
Chiplets unlock practical binning. If one CCD has a slightly worse core, you can down-bin that chiplet into a lower SKU. Another CCD with better characteristics
can go into a higher clocked SKU. The IOD stays the same family. That modularity keeps the product stack full without requiring heroic yields.
This is also how you get “strangely good” mid-tier parts that overclock like they’re trying to prove a point: they’re often made from excellent chiplets that
didn’t fit a higher SKU for non-performance reasons (inventory, demand, segmentation).
Faster iteration: upgrade the I/O separately from cores (or vice versa)
With chiplets, AMD can evolve core architecture and process node without redoing the entire I/O subsystem at the same cadence.
That reduces risk. It also reduces time-to-market. The IOD is complex, and it’s full of interfaces that are painful at the bleeding edge.
For operators, the downstream effect is subtle: you’ll see generations where raw compute jumps, but memory latency or I/O behavior changes differently.
Don’t assume “new CPU” means “same topology with more GHz.”
Segmenting desktop vs server without re-inventing everything
AMD can scale the same basic building blocks across Ryzen and EPYC families, tuning counts and I/O features to match markets.
That’s not just business efficiency—it’s why Ryzen came back with credible performance and why EPYC became a serious datacenter option.
Ops reality: where chiplets help, and where they bite
Where chiplets are a clear win
- Parallel workloads: build farms, render, compression, encryption, analytics, VM consolidation—anything that scales with cores and tolerates some locality variance.
- Cost-effective scale: more cores per dollar often beats a slightly lower p99, provided you’re honest about tail latency needs.
- SKU availability and diversity: the market ends up with more “weirdly specific” CPUs that fit specific fleet roles.
Where chiplets punish you
The failure mode is usually not “slow.” It’s “inconsistent.” One run is fine. Next run is 30% worse. Then you reboot and it gets better,
which convinces everyone it was “a transient.” It wasn’t.
- NUMA-blind scheduling: threads and memory allocations drift across dies.
- Interrupt storms landing on the wrong cores: NIC/NVMe interrupts hammer a CCD far from the work.
- Cross-die lock contention: shared data structures bounce cache lines over the fabric.
- Memory latency sensitivity: key-value stores, trading-ish workloads, certain databases, and anything that lives on pointer chasing.
Joke 2: If your performance “improves after a reboot,” congratulations—you have invented topology roulette.
What to do about it (high-level)
Treat topology as a first-class resource. That means:
- Measure memory latency and bandwidth, not just CPU utilization.
- Pin critical workloads and their memory to the same NUMA node where possible.
- Be deliberate about BIOS settings that alter fabric clocks, power states, and memory interleaving.
- Watch for interrupt distribution and queue affinity on high-throughput NICs and NVMe.
Fast diagnosis playbook: find the bottleneck before the meeting ends
When a Ryzen/EPYC chiplet host feels “off,” you don’t have time to philosophize. You need a quick triage order that narrows the space.
First: confirm topology and NUMA exposure
- How many NUMA nodes does the OS see? Do they map to expectations?
- Is memory evenly populated across channels?
- Are cores spread across CCDs in a way the scheduler understands?
Second: decide whether the bottleneck is compute, memory, or I/O
- Compute-bound: high IPC, high core utilization, stable p99.
- Memory-bound: low IPC, high stalled cycles, high LLC misses, uneven NUMA traffic.
- I/O-bound: queues backing up, high iowait, interrupts on a small subset of CPUs, PCIe throttling, NVMe latency spikes.
Third: check for “topology accidents”
- Workloads migrating across NUMA nodes (cpuset/cgroup misconfig or scheduler behavior).
- NIC/NVMe interrupts pinned poorly.
- Memory allocations remote from the threads that use them (NUMA balancing fighting you).
Fourth: validate firmware settings and power behavior
- Fabric clock coupling, memory speed, and power states (C-states, CPPC, P-states).
- SMT on/off decisions for tail latency.
- Deterministic performance profiles if your vendor provides them.
Fifth: only then touch application tuning
If topology is wrong, application tuning is just performance theater. Fix placement first.
Practical tasks with commands: prove topology, measure, decide
These are tasks I actually run when chiplet topology is suspected. Each includes: a command, what the output means, and the decision you make.
Assumption: Linux host. If you run something else, your day is already complicated enough.
Task 1: Identify the CPU model and stepping
cr0x@server:~$ lscpu | egrep 'Model name|Socket|Thread|Core|NUMA|Vendor|CPU\(s\)'
CPU(s): 64
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7xx2 32-Core Processor
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 4
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
NUMA node2 CPU(s): 32-47
NUMA node3 CPU(s): 48-63
Meaning: One socket, 4 NUMA nodes exposed. That’s a topology signal: memory locality matters even “within one CPU.”
Decision: If the workload is latency-sensitive, plan for NUMA pinning and memory binding; otherwise accept it and focus on throughput.
Task 2: Validate NUMA memory availability per node
cr0x@server:~$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 64512 MB
node 0 free: 60210 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 64512 MB
node 1 free: 61102 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 2 size: 64512 MB
node 2 free: 60001 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 3 size: 64512 MB
node 3 free: 61234 MB
node distances:
node 0 1 2 3
0: 10 12 12 12
1: 12 10 12 12
2: 12 12 10 12
3: 12 12 12 10
Meaning: Memory is evenly provisioned; distance matrix shows local vs remote cost.
Decision: If one node has far less memory (or is missing), fix DIMM population or BIOS interleaving before blaming the application.
Task 3: Verify memory speed and channel population signals
cr0x@server:~$ sudo dmidecode -t memory | egrep 'Locator:|Speed:|Configured Memory Speed:|Size:'
Locator: DIMM_A1
Size: 32 GB
Speed: 3200 MT/s
Configured Memory Speed: 3200 MT/s
Locator: DIMM_B1
Size: 32 GB
Speed: 3200 MT/s
Configured Memory Speed: 3200 MT/s
Locator: DIMM_C1
Size: 32 GB
Speed: 3200 MT/s
Configured Memory Speed: 3200 MT/s
Locator: DIMM_D1
Size: 32 GB
Speed: 3200 MT/s
Configured Memory Speed: 3200 MT/s
Meaning: The configured speed matches rated speed. If you see 2133/2400 on a platform that should run 3200, you’re paying a silent latency and bandwidth penalty.
Decision: Fix BIOS memory profile/compatibility; check DIMM mix-and-match and population rules.
Task 4: See which CPUs are getting hammered by interrupts
cr0x@server:~$ cat /proc/interrupts | head -n 12
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
24: 18423922 0 0 0 0 0 0 0 IR-PCI-MSI 524288-edge nvme0q0
25: 12 0 0 0 0 0 0 0 IR-PCI-MSI 524289-edge nvme0q1
40: 9234411 0 0 0 0 0 0 0 IR-PCI-MSI 1048576-edge enp65s0f0-TxRx-0
41: 34 0 0 0 0 0 0 0 IR-PCI-MSI 1048577-edge enp65s0f0-TxRx-1
Meaning: CPU0 is getting clobbered by NVMe and NIC queues. That often correlates with jitter, softirq spikes, and “why is one core at 100%?”
Decision: Distribute IRQs: enable irqbalance (carefully), or manually set affinity for critical queues near the workload’s NUMA node.
Task 5: Map a device to its NUMA node (NIC/NVMe locality)
cr0x@server:~$ cat /sys/class/net/enp65s0f0/device/numa_node
1
Meaning: That NIC is local to NUMA node 1.
Decision: Place the busiest network-processing threads on CPUs in node 1, and consider binding network buffers/processing there.
Task 6: Check PCIe link width/speed for “why is my NVMe slow?”
cr0x@server:~$ sudo lspci -s 41:00.0 -vv | egrep 'LnkCap:|LnkSta:'
LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <4us
LnkSta: Speed 8GT/s (downgraded), Width x4 (ok)
Meaning: The device can do 16GT/s but is running at 8GT/s. That’s a real throughput and latency hit.
Decision: Check BIOS PCIe settings, risers, slot wiring, and retimers. Don’t “optimize” software for a hardware negotiation problem.
Task 7: Verify CPU frequency behavior under load (power states matter)
cr0x@server:~$ sudo apt-get -y install linux-tools-common linux-tools-generic >/dev/null
cr0x@server:~$ sudo turbostat --quiet --Summary --interval 1 --num_iterations 3
Time_Of_Day_Seconds Avg_MHz Busy% Bzy_MHz IRQ SMI PkgTmp PkgWatt
54421.9 2875 62.3 4012 812 0 61 142.3
54422.9 2910 64.1 3988 799 0 62 145.0
54423.9 2842 60.8 4020 821 0 61 141.7
Meaning: You see effective MHz and busy MHz. If Bzy_MHz collapses under moderate load, power or thermal constraints are biting.
Decision: For latency-critical services, select a deterministic power profile and consider limiting deep C-states.
Task 8: Inspect per-NUMA-node memory allocation of a process
cr0x@server:~$ pidof memcached
24831
cr0x@server:~$ numastat -p 24831
Per-node process memory usage (in MBs) for PID 24831 (memcached)
Node 0 Node 1 Node 2 Node 3 Total
Huge 0.0 0.0 0.0 0.0 0.0
Heap 5120.0 1024.0 980.0 990.0 8114.0
Stack 8.0 8.0 8.0 8.0 32.0
Meaning: Heap is spread across nodes. That can be fine for throughput; it can be terrible for tail latency if threads are mostly on one node.
Decision: Bind the process and its memory to one node (or shard per node). Or explicitly run multiple instances per NUMA node.
Task 9: Observe remote vs local memory accesses (kernel NUMA stats)
cr0x@server:~$ egrep 'numa_(hit|miss|foreign|interleave|local|other)' /proc/vmstat
numa_hit 428112233
numa_miss 2219921
numa_foreign 1941122
numa_interleave 0
numa_local 426331900
numa_other 1920333
Meaning: numa_miss and numa_foreign show cross-node allocations. Rising rapidly during a latency incident is a red flag.
Decision: Investigate process migration, automatic NUMA balancing, and memory policy. Fix placement before changing code.
Task 10: Confirm scheduler and cgroup CPU pinning (is the service drifting?)
cr0x@server:~$ systemctl show -p AllowedCPUs -p AllowedMemoryNodes myservice.service
AllowedCPUs=
AllowedMemoryNodes=
Meaning: Empty means “no restriction.” If you expected pinning, it isn’t happening.
Decision: Add CPUAffinity/AllowedCPUs and memory node restrictions, or use cpuset cgroup to enforce placement.
Task 11: Pin a benchmark to one NUMA node to measure locality impact
cr0x@server:~$ numactl --cpunodebind=1 --membind=1 bash -lc 'python3 - <
Meaning: This is a crude “touch memory” test. Repeat with different node bindings. If times swing widely, locality matters for your workload class.
Decision: If swings are big, design deployments around NUMA sharding or explicit binding.
Task 12: Measure interconnect/topology with hwloc (visualize the chiplets)
cr0x@server:~$ sudo apt-get -y install hwloc >/dev/null
cr0x@server:~$ lstopo-no-graphics | head -n 30
Machine (256GB total)
Package L#0
NUMANode L#0 (P#0 64GB)
L3 L#0 (32MB)
L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#1)
NUMANode L#1 (P#1 64GB)
L3 L#1 (32MB)
L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#16)
PU L#3 (P#17)
Meaning: You can see NUMA nodes and L3 groupings. On chiplet designs, these groupings often align with CCD/CCX boundaries.
Decision: Use this view to design CPU sets for services: keep chatty threads within the same L3 domain when possible.
Task 13: Spot cross-die cache line bouncing with perf (lock contention hint)
cr0x@server:~$ sudo perf stat -e cycles,instructions,cache-misses,LLC-load-misses -p 24831 -- sleep 10
Performance counter stats for process id '24831':
38,112,004,991 cycles
52,984,222,101 instructions # 1.39 insn per cycle
812,113,992 cache-misses
204,113,100 LLC-load-misses
10.003221861 seconds time elapsed
Meaning: High LLC misses and low-ish IPC can indicate memory pressure. Pair this with NUMA stats to distinguish “memory-bound” from “bad placement.”
Decision: If LLC misses spike when threads spread across nodes, re-pin; if misses are inherent, redesign data layout or caching strategy.
Task 14: Check for Linux automatic NUMA balancing behavior
cr0x@server:~$ cat /proc/sys/kernel/numa_balancing
1
Meaning: 1 means automatic NUMA balancing is enabled. It can help general workloads, but it can also cause unpredictable migrations for latency-critical services.
Decision: If you do explicit pinning, consider disabling it (system-wide or via workload isolation) and measure before/after.
Task 15: Verify transparent hugepages status (latency vs throughput trade)
cr0x@server:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
Meaning: THP is always on. This can be good for throughput, but can add latency spikes during collapse/defrag on some workloads.
Decision: For tail-latency-sensitive services, benchmark with THP=never or madvise and decide based on p99, not average.
Task 16: Check memory bandwidth saturation quickly (vmstat + mpstat)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 8123456 123456 987654 0 0 1 3 900 2200 45 8 45 2 0
8 0 0 8012345 123456 988000 0 0 0 0 1100 5200 62 10 28 0 0
9 0 0 7999999 123456 987900 0 0 0 0 1200 6000 65 11 24 0 0
7 0 0 7988888 123456 987800 0 0 0 0 1180 5900 64 10 26 0 0
Meaning: High runnable threads (r) with low iowait (wa) suggests CPU/memory pressure rather than storage. Pair with perf/numastat to see if it’s bandwidth or latency.
Decision: If CPU is busy but IPC is low and NUMA misses rise, fix placement; if not, consider reducing concurrency or improving cache behavior.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A team moved a latency-sensitive API from an older dual-socket Intel host to a single-socket EPYC box. The migration plan was simple:
“Same cores, same RAM, fewer sockets, so it must be simpler.” Their load test even looked okay—at first.
Production didn’t. p95 held, p99 spiked, and the on-call got the classic alert combo: elevated request latency, normal CPU utilization, no obvious I/O saturation.
The dashboard looked calm in the way a quiet forest looks calm right before you realize you’re lost.
The wrong assumption was treating “one socket” as “uniform.” The service had a big in-memory cache and a handful of very hot mutexes.
The scheduler happily migrated threads across NUMA nodes within the socket, memory allocations drifted, and cache lines bounced across the fabric.
Average latency didn’t scream. Tail latency did.
The fix was boring but decisive: pin the service to a NUMA node, bind memory to that node, and run two instances instead of one big instance.
They also moved NIC interrupts to the same node. The p99 stabilized. The team learned a new kind of humility: topology humility.
Postmortem action item that mattered: add NUMA and IRQ affinity checks to the readiness checklist, not to the “advanced tuning” wiki page nobody reads.
Mini-story 2: The optimization that backfired
A storage-heavy service (think: metadata, checksums, compression) was CPU-bound during peak. Someone proposed a simple change:
“Let’s spread worker threads across all cores to maximize parallelism.” They also enabled aggressive auto-scaling based on CPU, because of course they did.
Throughput improved in microbenchmarks. The graphs were celebratory. Then the weekly batch job hit, and everything went sideways:
queueing delays rose, p99 lat doubled, and the system behaved like it had a random number generator in the scheduler.
The backfire was cross-die contention. Spreading workers across CCDs increased parallelism, yes, but it also increased shared-state traffic.
The “global” queue and a few shared hash tables became coherence hotspots. The interconnect did its job; the workload punished it for being helpful.
The fix was counterintuitive if you only believe in core counts: reduce cross-die chatter by sharding queues per NUMA node and pinning worker pools.
Some workers went idle, and total CPU utilization dropped. Latency improved. Throughput remained acceptable. The graphs became less exciting,
which is what you want in production.
The real lesson: on chiplet CPUs, “more cores” is not a synonym for “more shared state.” If your algorithm assumes cheap sharing, you’re holding a grenade by the pin.
Mini-story 3: The boring but correct practice that saved the day
Another org ran a mixed fleet: Ryzen workstations for CI and EPYC servers for production. They had a habit that looked painfully dull:
every new hardware batch got a standardized topology validation run, and the results were attached to the asset record.
One quarter, a batch of servers arrived with a subtle BIOS misconfiguration from the vendor: memory interleaving and a power profile that favored
efficiency over deterministic latency. Nothing was “broken.” Nothing failed POST. The machines even passed basic burn-in.
But their validation run caught it: memory latency was higher than the previous batch, and under load the frequency behavior was inconsistent.
Because they had baselines, they had proof. They didn’t argue from vibes; they argued from measurements.
They fixed it before production: adjusted BIOS profiles, standardized firmware, and re-ran the tests. The batch joined the fleet quietly.
No incident. No emergency meeting. No late-night “why is p99 drifting?” archaeology.
Boring practice wins because it scales. Heroics don’t. If you want reliability, institutionalize the unglamorous checks.
Common mistakes: symptom → root cause → fix
1) Symptom: p99 latency spikes while CPU% looks fine
Root cause: threads migrate across NUMA nodes; memory allocations become remote; coherence traffic increases across CCDs/IOD.
Fix: pin CPU and memory (cpuset/numactl), shard per NUMA node, and validate with numastat -p and /proc/vmstat NUMA counters.
2) Symptom: one core is pegged, softirq is high, network latency is jittery
Root cause: IRQ affinity collapsed onto a single CPU (often CPU0), or queues are misconfigured.
Fix: distribute interrupts, align queues with NUMA node locality, verify with /proc/interrupts and device NUMA node in sysfs.
3) Symptom: NVMe throughput is lower than expected after hardware change
Root cause: PCIe link trained down (speed downgrade), wrong slot, or BIOS settings limiting link speed.
Fix: check lspci -vv LnkSta vs LnkCap; correct slot/riser/BIOS; re-test before touching filesystem knobs.
4) Symptom: performance varies run-to-run with same workload
Root cause: automatic NUMA balancing, scheduler drift, or different initial allocation patterns.
Fix: enforce placement; disable/adjust NUMA balancing for the service; validate by repeating a pinned benchmark run.
5) Symptom: “More threads” reduces throughput
Root cause: cross-die lock contention; shared queues; cache line ping-pong.
Fix: shard state per NUMA node/CCD; reduce global locks; use per-node worker pools; measure LLC misses and lock contention.
6) Symptom: stable throughput, but periodic long stalls
Root cause: THP collapse/defrag, memory reclaim, or frequency/power transitions.
Fix: test THP=never/madvise; ensure enough headroom; set deterministic power profiles for latency-sensitive systems.
7) Symptom: database looks “CPU bound” but IPC is low
Root cause: actually memory latency bound; remote memory; poor locality across chiplets.
Fix: pin and bind memory; move hottest data into cache-friendly layouts; measure with perf + NUMA stats.
Checklists / step-by-step plan for chiplet-friendly deployments
Step-by-step plan: new Ryzen/EPYC node admission
- Inventory topology: record
lscpu, NUMA node count, core/thread counts. - Validate memory population: use
dmidecode; confirm expected speed and balanced channels. - Baseline NUMA behavior: capture
numactl --hardwaredistance matrix and per-node memory sizes. - Baseline frequency behavior: measure under load with
turbostatand record “normal” ranges. - Validate PCIe links: check NIC and NVMe negotiated speed/width with
lspci -vv. - Confirm device locality: record device NUMA node via sysfs for top NICs and NVMe.
- Set IRQ strategy: decide irqbalance vs manual affinity; document it and test it.
- Define workload placement: decide which services need pinning and which can float.
- Establish a test: run one pinned memory-touch or bandwidth test per node and store results.
- Roll into production gradually: canary with representative workloads; watch p99 and NUMA misses.
Checklist: when a chiplet host has unexplained latency
- Do we have unexpected NUMA nodes or uneven node memory sizes?
- Are IRQs concentrated on a small CPU set?
- Is the NIC/NVMe on a different NUMA node than the busiest threads?
- Is the PCIe link trained down?
- Are numa_miss/numa_foreign rising during the incident?
- Is the scheduler allowed to migrate the service across nodes?
- Did firmware or BIOS settings change (power profile, memory interleaving, SMT)?
Checklist: designing for predictable performance
- Shard by NUMA node when state is large and access is frequent.
- Keep IRQs local to the compute handling packets and I/O completions.
- Avoid global locks that force cross-die coherence traffic.
- Prefer bounded concurrency over “use all cores” when p99 matters.
- Benchmark with pinning on and off to see how sensitive you are to topology.
FAQ
1) Are chiplets always faster than monolithic CPUs?
No. Chiplets often win on price/performance and scalability. Monolithic designs can win on uniform latency and some cache-coherent sharing patterns.
You choose based on workload behavior, not marketing adjectives.
2) What’s the single biggest operational difference with chiplet CPUs?
Topology becomes a performance feature. NUMA and cache domain awareness matter in places where you previously got away with ignoring them.
3) Why did AMD split compute and I/O into different dies?
Because it’s economically and technically sane: CPU cores benefit from cutting-edge nodes; I/O benefits from mature nodes and stable analog behavior.
Splitting them improves yields and reduces risk per generation.
4) What’s the most common “chiplet tax” in real systems?
Remote memory access and cross-die cache line bouncing. Both show up as tail latency, not necessarily as average slowdown.
5) Should I disable SMT on Ryzen/EPYC for latency?
Sometimes. SMT can improve throughput but can worsen tail latency when contention is high or when you’re already topology-sensitive.
Measure with production-like load; don’t cargo-cult it.
6) Is automatic NUMA balancing good or bad?
Good for general-purpose hosts. Risky for carefully pinned, latency-sensitive services because it can migrate pages in ways that create jitter.
If you pin explicitly, consider disabling it for those hosts or services—after measurement.
7) Why do two “identical” Ryzen systems benchmark differently?
Common reasons: different memory population (channels), different BIOS power profiles, different fabric/memory coupling settings,
PCIe link negotiation issues, and different scheduler/IRQ affinity states.
8) How do chiplets affect storage workloads specifically?
Storage stacks mix CPU, memory, and PCIe. If your NVMe and NIC interrupts land far from the threads doing compression/checksums,
you pay extra latency and coherence overhead. Align device locality, IRQ affinity, and worker placement.
9) Do chiplets make virtualization harder?
Not harder, but less forgiving. If you oversubscribe and let vCPUs float across NUMA domains, you can get noisy-neighbor effects and unpredictable p99.
NUMA-aware VM placement and CPU pinning help.
10) What should I baseline on every new chiplet platform?
NUMA topology and memory distribution, PCIe link status, frequency behavior under load, and an IRQ distribution snapshot.
If you can’t detect drift, you can’t prevent it.
Conclusion: next steps that actually reduce risk
AMD’s chiplets resurrected Ryzen by changing the manufacturing equation: smaller core dies, better yields, modular scaling, and faster iteration.
That business decision turned into an operational reality: topology is now part of performance, not a footnote.
Next steps you should take this week, not “someday”:
- Pick one latency-sensitive service and measure NUMA placement today:
numastat -p,/proc/vmstatNUMA counters, and/proc/interrupts. - Canary a pinning strategy: bind CPU + memory to a node, align IRQs, compare p99. Keep the change small and measurable.
- Institutionalize a hardware admission test: topology, memory speeds, PCIe link training, and a frequency sanity check.
- Stop trusting averages: chiplet performance problems are often tail problems wearing an average’s disguise.
Chiplets are not a trap. They’re a deal: you get lots of cores at a sane price, and in return you agree to treat topology like it’s real.
Sign the contract. Your p99 will thank you.