From monoliths to chiplets: why modern CPUs look like LEGO

Was this helpful?

The incident ticket always reads the same: “p99 latency doubled after the refresh; CPU utilization is lower than before; nothing makes sense.”
Then you log into the new fleet and discover the “CPU” isn’t one thing anymore. It’s a small city: multiple dies, multiple memory controllers,
multiple caches, and an interconnect doing rush-hour traffic between them.

Chiplets didn’t just change how CPUs are built. They changed what “same CPU model” even means in production. If you buy, schedule, and tune like it’s 2012,
you’ll get 2012-grade surprises—just faster and more expensive.

Why CPUs went LEGO

“Monolithic” used to be a compliment. One die, one package, one cache hierarchy, one set of rules.
You could pretend the silicon was a flat plane where any core could reach any byte of memory at roughly the same cost.
That fiction died for the same reason most elegant theories die: money, physics, and scheduling.

A chiplet CPU is built from multiple smaller dies (chiplets) assembled into one package.
Some chiplets hold CPU cores and caches. Another might hold the memory controllers and I/O (PCIe, USB, SATA, CXL).
The chiplets talk over an on-package interconnect.
The result is modular: vendors can mix and match building blocks instead of taping out one enormous, fragile slab of silicon.

Think of chiplets less like “multiple CPUs in a trench coat” and more like a motherboard’s worth of subsystems shrunk into a single package.
You get scale and flexibility, but you also get topology. And topology is where performance goes to either become impressive or become a ticket.

Interesting facts and historical context (the stuff that explains today’s mess)

  • Big dies have brutal yield curves. As die area grows, the chance a defect kills the die rises; chiplets keep dies smaller and yields higher.
  • Multi-chip modules aren’t new. Packaging multiple dies in one module has existed for decades, but modern interconnect bandwidth makes it mainstream.
  • “Glue logic” became a feature. Early multi-die approaches were often seen as compromises; today vendors design around it deliberately.
  • Memory controllers moved on-die in the 2000s. That helped latency, but also set the stage for NUMA and per-socket topology complexity.
  • 2.5D packaging (silicon interposers) changed the game. It enabled high-bandwidth links between dies without a traditional PCB-level penalty.
  • HBM made chiplet thinking normal. High Bandwidth Memory stacks on-package pushed everyone to treat packaging as part of architecture.
  • Chiplets allow mixing process nodes. CPU cores might be on a leading-edge node, while I/O stays on a mature node that’s cheaper and often better for analog.
  • Standardization efforts exist, but reality is messy. Industry wants interoperable chiplets; vendors still ship tightly integrated ecosystems first.

Joke #1: If you miss the simplicity of monolithic CPUs, you can still find it in nature—inside a single-celled organism, also known as your staging environment.

Chiplets in practice: CCDs, IODs, tiles, and interconnects

Vendors differ in branding, but the pattern is consistent: separate the high-performance compute bits from the “plumbing” bits.
Compute loves the newest process node (fast transistors, dense caches). I/O loves stable nodes (good analog characteristics, high voltage tolerance, cheaper wafers).

Common building blocks

  • Compute chiplets: CPU cores and their near caches (L1/L2) plus a shared cache slice (often L3).
  • I/O die (IOD): memory controllers, PCIe/CXL controllers, fabric routers, sometimes integrated accelerators.
  • Interconnect: the on-package network that lets chiplets share memory and cache coherence. It defines your “local” and “remote” costs.
  • Package substrate / interposer: the physical medium carrying signals. The more advanced it is, the more it can act like a tiny high-speed backplane.

What you gain

You gain manufacturing flexibility. If a compute chiplet is defective, you toss that small die, not a giant monolith.
You can also build a product stack by populating the package with different counts of compute chiplets—same I/O die, same socket, different SKU.

What you pay

You pay in latency and in “non-uniformity.” Two cores might be the same microarchitecture, but not the same distance from memory.
A cache line might live in a different chiplet’s L3 slice. A thread might bounce between chiplets if the scheduler or your app is careless.

In ops terms: chiplets are a throughput machine that can become a tail-latency machine if you don’t respect locality.

The economics: yield, binning, and why big dies hurt

Chiplets aren’t primarily a performance story. They’re a business story with performance consequences.
The biggest lever in semiconductor cost is how many good dies you get per wafer.
Wafer cost rises with advanced nodes; defects don’t politely scale down.

Yield, simplified (without lying too much)

A wafer has a defect density. A die has an area. Larger area increases probability that any given die intersects a defect.
That’s why big monolithic dies are expensive even before packaging: you throw away more silicon.

With chiplets, you accept that some chiplets are bad and some are good, and you assemble good ones into products.
You can also bin chiplets by achievable frequency or power. The result is a more efficient use of what the fab produces.

Why mixing nodes is practical engineering, not just accounting

Cutting-edge nodes are great for dense logic and caches, but they’re not automatically great for every circuit.
PHYs and analog blocks often behave better on mature nodes. Also: mature nodes can have better supply chain availability.
When the I/O die stays on a mature node, you reduce risk and can keep shipping even when cutting-edge capacity is constrained.

The procurement takeaway: “same socket” no longer implies “same performance.” Two SKUs might share a name and a platform,
but the chiplet count, cache layout, or I/O die revision can change behavior in ways your benchmarks won’t catch unless you look.

Performance reality: latency, bandwidth, and topology

Chiplets make the CPU package look less like a uniform slab and more like a small NUMA system.
Even within one socket, you can have multiple memory domains, multiple L3 islands, and a fabric in between.
Your bottleneck is often not “CPU” but “CPU plus where the data lives.”

Latency: the tax you pay when data is “over there”

Latency sensitivity shows up first in p95/p99. Throughput workloads can hide it with batching and parallelism.
Interactive services can’t. If your request path touches shared state with poor locality, chiplets will surface that cost quickly.

Typical latency traps:

  • Remote memory access: a core loads from memory attached to a different NUMA node; it’s slower and can be more variable.
  • Cross-chiplet cache coherency traffic: false sharing and frequent writes make the interconnect do work you didn’t budget for.
  • Thread migration: the scheduler moves your thread; its hot working set no longer sits in the “near” caches.

Bandwidth: chiplets can be huge, but it’s not infinite

A common failure mode is assuming the on-package fabric is “basically as good as” a monolithic die.
It’s good, but it’s not free. You can saturate it with:

  • All-to-all communication patterns (barriers, shared queues, distributed locks).
  • Memory copy-heavy workloads (serialization, compression staging, encryption with poor buffer reuse).
  • High core counts doing the same thing to the same memory region.

Cache topology: L3 is not a single magical pool anymore

Many chiplet designs expose multiple L3 slices with faster access locally and slower access remotely.
Your app doesn’t see “L3 = 96MB” as one homogenous lake; it sees a set of ponds connected by canals.
If your hot set fits in one pond but you keep rowing between ponds, you’ll still drown.

Power and boost: chiplets complicate “why is this slower today?”

Modern CPUs juggle per-core boosting, socket power limits, temperature, and sometimes per-chiplet limits.
Adding chiplets increases potential peak throughput, but you can’t always boost everything at once.
SRE takeaway: after a refresh, “more cores” can mean “lower per-core turbo under sustained load,” which changes latency.

Quote (paraphrased idea): Werner Vogels: build systems expecting failure, and design for resilience rather than assuming components behave nicely.

Joke #2: Chiplets are like microservices—great until you realize you’ve invented a network, and now you’re debugging it at 5 a.m.

What changes for SREs and platform teams

In a monolithic world, you could often get away with “Linux will schedule it” and “the database will handle it.”
In a chiplet world, the defaults are decent, but the defaults are not your workload.
If you care about tail latency, you must actively manage locality: CPU pinning, NUMA-aware memory allocation, IRQ affinity, and sane BIOS settings.

What to demand from vendors and procurement

  • Full topology disclosure: number of NUMA nodes per socket, memory channels per node, cache layout, and interconnect bandwidth class.
  • Consistent SKU mapping: if a “minor” stepping changes I/O die or memory behavior, you want release notes and validation time.
  • Power behavior under sustained load: turbo policies matter more than spec-sheet base clocks.

What to demand from your own org

  • Benchmarking that matches production: synthetic CPU benchmarks are entertainment, not evidence.
  • Topology-aware capacity planning: plan per NUMA node, not just per socket.
  • Scheduling and pinning standards: container orchestrators need guidance, not wishful thinking.

Practical tasks: commands, outputs, what it means, and what you decide

These are production-grade checks. Run them on a host with the “mystery performance regression.”
Each task includes a command, typical output, what the output means, and the decision you make.

Task 1: See the CPU and NUMA topology

cr0x@server:~$ lscpu
Architecture:                         x86_64
CPU(s):                               64
Thread(s) per core:                   2
Core(s) per socket:                   32
Socket(s):                            1
NUMA node(s):                         4
NUMA node0 CPU(s):                    0-15
NUMA node1 CPU(s):                    16-31
NUMA node2 CPU(s):                    32-47
NUMA node3 CPU(s):                    48-63
L3 cache:                             256 MiB

Meaning: One socket, but four NUMA nodes. That’s chiplet-style locality. Your “single socket” behaves like a small multiprocessor.

Decision: Treat this host as NUMA. Pin latency-sensitive services to a node and allocate memory locally.

Task 2: Verify NUMA distances (how “far” remote really is)

cr0x@server:~$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 64000 MB
node 0 free: 52210 MB
node distances:
node   0   1   2   3
  0:  10  20  20  28
  1:  20  10  28  20
  2:  20  28  10  20
  3:  28  20  20  10

Meaning: Not all remote nodes are equal. “28” is notably worse than “20.” Some chiplets are farther apart.

Decision: For strict latency, keep threads and memory within the same node; for distributed work, prefer “nearer” pairs.

Task 3: Confirm kernel sees the right NUMA policy

cr0x@server:~$ cat /proc/sys/kernel/numa_balancing
1

Meaning: Automatic NUMA balancing is enabled. It may help general workloads, but it can also cause page migrations and latency spikes.

Decision: For a latency-critical service, consider disabling per-host or per-cgroup and manage affinity explicitly.

Task 4: Check current CPU frequency behavior (boost vs throttling)

cr0x@server:~$ sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgWatt,PkgTmp
Busy%  Bzy_MHz  PkgWatt  PkgTmp
72.18   3045     245.7     88

Meaning: Under load, cores are averaging ~3.0 GHz, package power is high, temperature is close to limits.

Decision: If p99 is worse than prior gen, verify cooling, power caps, and BIOS power settings before blaming code.

Task 5: Detect memory bandwidth pressure

cr0x@server:~$ sudo perf stat -a -e cycles,instructions,cache-misses,LLC-load-misses,LLC-store-misses sleep 10
 Performance counter stats for 'system wide':

  38,220,118,992      cycles
  52,908,331,407      instructions              #    1.38  insn per cycle
   1,182,220,114      cache-misses
     992,110,332      LLC-load-misses
      88,330,901      LLC-store-misses

      10.001024001 seconds time elapsed

Meaning: High LLC miss counts suggest the workload is spilling out of cache and hammering memory (or remote cache lines).

Decision: Investigate locality: pin threads, reduce cross-thread sharing, and validate memory allocation per NUMA node.

Task 6: Identify scheduler and runqueue pressure (are threads bouncing?)

cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (server)  01/10/2026  _x86_64_  (64 CPU)

12:04:18 AM  CPU   %usr   %sys  %iowait  %irq  %soft  %idle
12:04:19 AM  all  58.12   9.44    0.05   0.10   0.65  31.64
12:04:19 AM   0  92.00   6.00    0.00   0.00   0.00   2.00
12:04:19 AM  16  12.00  18.00    0.00   0.00   1.00  69.00
12:04:19 AM  32  86.00   9.00    0.00   0.00   0.00   5.00
12:04:19 AM  48  14.00  22.00    0.00   0.00   2.00  60.00

Meaning: Some CPUs are hot, others are underused, and some show elevated system time. This can indicate IRQ imbalance or poor pinning.

Decision: Check IRQ affinity and ensure your service threads are pinned consistently to avoid migrations across chiplets.

Task 7: Inspect IRQ distribution (a classic hidden bottleneck)

cr0x@server:~$ cat /proc/interrupts | head
           CPU0       CPU1       CPU2       CPU3
  24:  98211233          0          0          0   PCI-MSI 524288-edge      eth0-TxRx-0
  25:         0          0          0          0   PCI-MSI 524289-edge      eth0-TxRx-1
  26:         0          0          0          0   PCI-MSI 524290-edge      eth0-TxRx-2
  27:         0          0          0          0   PCI-MSI 524291-edge      eth0-TxRx-3

Meaning: One core is handling almost all NIC interrupts. That core’s local chiplet might become the “network chiplet,” accidentally.

Decision: Enable irqbalance or set explicit affinity so interrupts are spread across intended cores (and ideally within the same NUMA node as the NIC).

Task 8: Check PCIe device locality (which NUMA node owns the NIC/NVMe)

cr0x@server:~$ cat /sys/class/net/eth0/device/numa_node
2

Meaning: The NIC is attached to NUMA node 2. If your network stack runs on node 0, you’re doing remote memory and remote DMA bookkeeping.

Decision: Pin network-heavy threads to node 2 (or move IRQs) and allocate buffers on node 2 where possible.

Task 9: Confirm memory is actually local to the service

cr0x@server:~$ numastat -p 12345
Per-node process memory usage (in MB) for PID 12345 (myservice)
Node 0          1200.3
Node 1           980.1
Node 2          8200.7
Node 3           410.2
Total          10791.3

Meaning: The process is mostly on node 2, but still has sizable allocations on other nodes—potential cross-node access.

Decision: If this service is latency-sensitive, tighten CPU and memory binding (systemd, cgroups, taskset, numactl) so it doesn’t sprawl.

Task 10: Check for page migration activity (NUMA balancing side effects)

cr0x@server:~$ grep -E 'pgmigrate|numa' /proc/vmstat | head -n 10
pgmigrate_success 1822331
pgmigrate_fail 1122
numa_pte_updates 998122
numa_hint_faults 288111
numa_hint_faults_local 201994
numa_pages_migrated 155002

Meaning: The kernel is actively migrating pages. That can be good for throughput, but it can add jitter and contention.

Decision: If you see tail latency spikes, try controlling placement explicitly and reduce reliance on automatic migration.

Task 11: Check huge pages status (TLB pressure vs fragmentation)

cr0x@server:~$ grep -E 'HugePages|Hugepagesize' /proc/meminfo
HugePages_Total:       2048
HugePages_Free:        1980
HugePages_Rsvd:          12
Hugepagesize:        2048 kB

Meaning: Huge pages are available and mostly free; your service might not be using them, or it might reserve a few.

Decision: For memory-intensive services, validate whether huge pages reduce TLB misses; don’t enable blindly if allocation becomes fragmented across NUMA nodes.

Task 12: Validate cgroup CPU sets for a containerized workload

cr0x@server:~$ systemctl show myservice --property=CPUQuota --property=AllowedCPUs --property=AllowedMemoryNodes
CPUQuota=400%
AllowedCPUs=0-31
AllowedMemoryNodes=0-1

Meaning: The service is restricted to CPUs 0–31 and memory nodes 0–1. If the NIC/NVMe sits on node 2, you just built a remote-access machine.

Decision: Align CPU and memory nodes with device locality. Place the service where its I/O lives or move devices/IRQs accordingly.

Task 13: Quick remote vs local memory test (spot-check)

cr0x@server:~$ sudo numactl --cpunodebind=0 --membind=0 bash -c 'dd if=/dev/zero of=/dev/null bs=1M count=4096'
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 0.642 s, 6.7 GB/s
cr0x@server:~$ sudo numactl --cpunodebind=0 --membind=3 bash -c 'dd if=/dev/zero of=/dev/null bs=1M count=4096'
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 0.811 s, 5.3 GB/s

Meaning: Remote memory binding reduces observed throughput. Real apps also pay extra latency, not just bandwidth loss.

Decision: If remote binding materially changes numbers, you have a locality-sensitive workload. Treat placement as a first-class requirement.

Task 14: Inspect cache and topology hints from sysfs

cr0x@server:~$ for c in 0 16 32 48; do echo "cpu$c:"; cat /sys/devices/system/cpu/cpu$c/cache/index3/shared_cpu_list; done
cpu0:
0-15
cpu16:
16-31
cpu32:
32-47
cpu48:
48-63

Meaning: Each group of 16 CPUs shares an L3—four L3 “islands.” That’s a chiplet/cache-cluster boundary you should respect.

Decision: Pin cooperating threads within one shared L3 island. Keep chatty thread pools together; separate noisy neighbors onto different islands.

Fast diagnosis playbook: find the bottleneck before you argue about architecture

When a service regresses on chiplet-era CPUs, the fastest path is not a deep microarchitecture debate.
It’s disciplined triage: topology, locality, power, then code.

First: confirm what the machine actually is

  • Topology: run lscpu and numactl --hardware. If NUMA nodes > 1 per socket, treat it as topology-sensitive.
  • Cache islands: check shared L3 groups via sysfs (Task 14). This often predicts cross-chiplet penalties better than marketing names.
  • Device locality: check NIC/NVMe numa_node in sysfs. Devices anchored to one node can drag your whole workload remote.

Second: identify whether the pain is latency, bandwidth, or scheduling

  • Latency / jitter: look at p95/p99 vs mean. If mean is fine and tail is awful, suspect migration, remote memory, IRQ imbalance, or power throttling.
  • Bandwidth: use perf stat for LLC misses, watch for cache-miss explosions under load. Also check memory channel saturation if you have vendor tools.
  • Scheduling: use mpstat -P ALL. Hot cores and cold cores suggest pinning issues, IRQ hot spots, or uneven work distribution.

Third: validate policy and placement

  • NUMA balancing: check /proc/sys/kernel/numa_balancing and migration stats in /proc/vmstat.
  • cgroups / cpusets: ensure CPU and memory nodes align with the service’s I/O.
  • IRQ affinity: confirm interrupts are spread and local to the device node.

Fourth: only then, tune or refactor

  • If locality is the issue: pin threads, enforce memory policy, restructure allocations.
  • If bandwidth is the issue: reduce shared-state churn, avoid false sharing, reduce memcpy, batch, and consider compression/encryption strategy changes.
  • If power is the issue: fix cooling, power limits, and BIOS settings; don’t “optimize” code to compensate for a thermal problem.

Common mistakes (symptoms → root cause → fix)

1) Symptom: p99 latency regressed after a CPU refresh, but average latency improved

Root cause: Cross-chiplet migrations and remote memory accesses cause tail jitter.

Fix: Pin request-handling threads within one L3 island; bind memory to the same NUMA node; reduce thread migration (affinity, fewer runnable threads than cores, avoid oversized pools).

2) Symptom: “CPU is only 40% utilized” yet throughput caps early

Root cause: Memory bandwidth or fabric bandwidth is saturated; additional cores don’t help once you’re bandwidth-bound.

Fix: Use perf stat to confirm cache misses; reduce copy-heavy paths; profile allocator behavior; increase locality; consider per-NUMA sharding.

3) Symptom: network-heavy service shows one core pegged at 100% system time

Root cause: IRQ affinity concentrates interrupts on one CPU; device is on a different NUMA node than the service threads.

Fix: Spread IRQs across cores on the device’s NUMA node; align service CPU set to the same node; verify with /proc/interrupts and sysfs.

4) Symptom: performance varies wildly between identical-looking hosts

Root cause: BIOS settings differ (power limits, memory interleaving, SMT, NUMA settings); stepping differences change topology behavior.

Fix: Standardize BIOS profiles; track firmware versions; validate with a short topology/perf smoke test during provisioning.

5) Symptom: database gets slower as you add worker threads

Root cause: Lock contention and cache line bouncing across chiplets; false sharing in shared queues/counters.

Fix: Cap threads per L3 island; shard hot data per NUMA node; use per-core/per-node counters with periodic aggregation.

6) Symptom: containerized service “randomly” thrashes memory and stalls

Root cause: cpuset allows CPUs on one node, but memory allocations land on others via defaults, page migration, or shared host allocations.

Fix: Set both CPU and memory node policies in cgroups; verify with numastat -p; consider disabling automatic NUMA balancing for that workload.

Three corporate mini-stories from the chiplet era

Mini-story 1: The incident caused by a wrong assumption

A platform team rolled out a new “single-socket, high-core-count” server to replace older dual-socket machines. The pitch was simple:
fewer sockets means fewer NUMA headaches, and the chip had plenty of cores. The migration plan treated each host as one uniform pool of CPU.

Within a week, one customer-facing API started timing out in bursts. Not constant overload—bursts. The graphs were infuriating:
CPU utilization looked healthy, error rate spiked, and adding instances helped less than expected.
The on-call engineer did the usual things: checked GC, checked database, checked network.
Nothing obvious.

The breakthrough came from topology checks. The “single socket” exposed four NUMA nodes. The NIC sat on node 2.
The orchestrator pinned the pods to CPUs 0–31 (nodes 0 and 1) because that was the default CPU set on the host image.
Network interrupts were also concentrated on a CPU in node 2. The system was doing a complicated dance:
packets arrived on node 2, got handled by an IRQ core in node 2, queued into memory not consistently local,
then worker threads on nodes 0/1 pulled the work remotely. Cross-node traffic plus scheduling jitter made p99 miserable.

Fixing it was boring: align cpusets and memory nodes to the NIC locality, spread IRQs across that node, and keep request threads close to their buffers.
The service stabilized. The postmortem’s main lesson was also boring: “single socket” is not a synonym for “uniform memory.” On chiplet systems, it never was.

The follow-up action that mattered: the team added a provisioning gate that rejected hosts where device NUMA node didn’t match the intended CPU set.
Not glamorous, but it prevented a repeat.

Mini-story 2: The optimization that backfired

A data pipeline team optimized a hot path by increasing parallelism. They took a batch job that ran 32 worker threads and “scaled it”
to 128 threads on the new generation of high-core-count chiplet CPUs. Their reasoning was textbook: more cores, more threads, more throughput.

The first benchmark looked good—briefly. On short runs, throughput improved. On long runs, throughput degraded and became noisy.
The cluster’s aggregate performance became inconsistent, and the job’s wall clock time stopped improving.
Meanwhile, other services on the same hosts started complaining about latency even though “this is just a batch job.”

Root cause analysis showed interconnect and memory pressure, not CPU saturation. The workload had a shared work queue
and a handful of global counters updated frequently. On a monolithic die, that was annoying but tolerable.
On a chiplet topology with multiple L3 islands, it became a coherency traffic generator.
Add in thread migrations and you got exactly what the perf counters predicted: lots of cache misses, lots of cross-node chatter.

The “fix” was to reduce parallelism, not increase it. They capped threads per L3 island and introduced per-node queues and counters.
The job used fewer threads but completed faster and stopped bullying the rest of the node.
The painful lesson: chiplet-era CPUs punish sloppy shared-state design. Throwing threads at it isn’t scaling; it’s a denial strategy.

Afterward, the team changed their performance rubric: any optimization that increases cross-thread sharing must include a topology-aware benchmark,
otherwise it doesn’t ship. They also stopped celebrating “CPU utilization went up” as a win. Utilization is not throughput, and it’s definitely not tail latency.

Mini-story 3: The boring but correct practice that saved the day

An infrastructure group ran a mixed fleet: two CPU generations, multiple BIOS versions, and a rolling cadence of firmware updates.
They were not heroes. They were just organized.
Every host provisioned into a “burn-in” pipeline that collected topology, device locality, and a short set of performance counters under load.

One week, they noticed a subtle change: a subset of new hosts showed higher remote memory access ratios for the same synthetic placement test.
The machines weren’t failing, and nobody had filed a ticket. The pipeline caught it because the baseline included NUMA distance and a microbenchmark
pinned local vs remote.

The culprit was a BIOS profile drift. A well-meaning technician used a vendor “performance” preset that changed memory interleaving behavior.
It didn’t break the machine; it just shifted locality characteristics enough to matter to latency-sensitive services.
Without the burn-in pipeline, this would have landed as “random regressions” weeks later.

They rolled back the profile, re-imaged a handful of hosts, and carried on. No outage, no emergency meetings.
The practice that saved them wasn’t genius. It was consistency: treat topology and locality as part of configuration drift, and test for it like you test for disk health.

Checklists / step-by-step plan

Checklist: adopting chiplet-era CPUs without creating a latency horror show

  1. Inventory topology: record NUMA nodes per socket, shared L3 groups, memory channels, and device NUMA nodes.
  2. Define workload classes: latency-critical, throughput, batch, noisy background. Not everything gets the same placement rules.
  3. Pick placement policy:
    • Latency-critical: pin within one L3 island and one NUMA node.
    • Throughput: shard per NUMA node; scale out across nodes if data is partitionable.
    • Batch: isolate to specific nodes/cores; cap bandwidth hogs.
  4. Align I/O locality: place network-heavy or NVMe-heavy services on the NUMA node attached to those devices.
  5. Standardize BIOS and firmware: one profile, version tracked, changes rolled through canaries.
  6. Validate power behavior: verify sustained clocks under your real load, not just “it boosts once.”
  7. Build a smoke test: run a local-vs-remote memory test and a short perf counter capture during provisioning.
  8. Teach your scheduler: encode cpuset/mems policies in systemd units or orchestrator node labels and pod specs.
  9. Document failure modes: IRQ imbalance, remote memory, cross-node lock contention, thermal throttling.
  10. Roll out with canaries: compare p99, not just throughput. Require rollback criteria.

Step-by-step: fixing a locality regression on a live host

  1. Run lscpu and numactl --hardware; write down nodes, CPU ranges, distances.
  2. Find device locality for NIC/NVMe via sysfs numa_node.
  3. Check IRQ distribution (/proc/interrupts) and fix affinity if one core is overloaded.
  4. Check the service’s cpuset and memory node policy (systemd properties or cgroup files).
  5. Use numastat -p to confirm the service’s memory allocation matches its CPU placement.
  6. Disable automatic NUMA balancing for that service if it causes migration churn; re-test p99.
  7. Cap thread pools to fit within one L3 island when possible; avoid cross-island contention.
  8. Re-run a short load test and compare p50/p95/p99 plus perf counters.

FAQ

1) Are chiplets always slower than monolithic CPUs?

No. Chiplets often deliver better throughput, more cores, larger aggregate cache, and better economics. They’re “slower” only when your workload
pays remote-access taxes and coherency traffic you didn’t plan for—typically visible in tail latency.

2) Why not just make the interconnect so fast it doesn’t matter?

Because physics and power. Driving high-speed signals costs energy and generates heat. Also, latency is stubborn; you can buy bandwidth more easily than you can buy round-trip time.

3) Is NUMA the same thing as chiplets?

Not the same, but they rhyme. NUMA is the performance model: memory access time depends on which memory controller owns the page.
Chiplets are a packaging/architecture approach that often creates multiple memory domains and cache islands even within one socket.

4) If Linux sees one socket, why should I care?

Because “socket” is an accounting label. Linux may still expose multiple NUMA nodes and separate shared-cache groups.
Your workload cares about where memory pages are and which cores share cache, not how many physical sockets are present.

5) Do I need to pin everything to CPUs now?

Not everything. Pinning is a tool, not a religion. Use it for latency-critical services, high-throughput networking, and anything with tight cache locality.
For general-purpose stateless workloads, default scheduling can be fine—until you see p99 drift.

6) What’s the single biggest mistake teams make during refreshes?

Assuming the new CPU is just “the old CPU but faster.” Chiplets change topology, cache behavior, and power dynamics.
If you don’t measure locality and sustained clocks under your workload, you’re doing a refresh by vibes.

7) Does more L3 cache mean fewer problems?

More cache helps, but topology matters. If the cache is partitioned into islands and your threads bounce across islands, you can still miss “more often than you expect.”
Use shared L3 groups as a placement boundary.

8) How do chiplets affect virtualization and containers?

They raise the cost of sloppy placement. VMs and containers can end up with vCPUs spread across NUMA nodes while their memory sits elsewhere.
Fix it with CPU and memory policies (cpuset + mems), and validate with numastat and topology checks.

9) Do chiplets increase failure rates?

Not inherently. They change where risk lives: more complex packaging and interconnects, but often better yields and binning.
Operationally, your main “failure” is performance unpredictability, not outright hardware faults.

10) What should I benchmark to compare two chiplet CPUs?

Benchmark your workload with realistic concurrency, realistic data sizes, and realistic I/O. Record p99, not just average.
Also record topology (NUMA distances, L3 sharing) so you can explain differences rather than argue about them.

Next steps you can actually do this week

Chiplets aren’t a fad. They’re the default path for scaling core counts and mixing process nodes. Fighting that is like fighting gravity:
you can do it briefly, but you won’t look smart on the incident call.

  1. Add topology to your host facts: store lscpu, numactl --hardware, shared L3 groups, and device NUMA nodes in your inventory.
  2. Make placement explicit: for one latency-critical service, implement CPU and memory binding and document the rationale.
  3. Audit IRQ affinity on your busiest hosts. If one core is carrying the world, fix it and take the easy win.
  4. Create a “local vs remote” smoke test during provisioning so BIOS drift and weird steppings get caught before production traffic does.
  5. Rewrite one shared-state hotspot: replace a global counter/queue with per-NUMA or per-core sharding, then measure tail latency.

The goal isn’t to become a CPU architect. It’s to stop being surprised by physics you can measure in 30 seconds.
Chiplet CPUs look like LEGO because manufacturing and economics demanded it. Your job is to run them like they’re modular, because they are.

← Previous
AI PC Hype vs Real Architecture Changes: What Actually Matters in 2026
Next →
WireGuard Site-to-Site: Connect Two Offices Over the Internet Step by Step

Leave a comment