You bought a shiny dual-socket box because “twice the CPUs” sounds like “twice the throughput.” Then your database p95 got worse, your storage stack started stuttering, and the perf team showed up with charts and quiet disappointment.
This is NUMA: Non-Uniform Memory Access. It’s not a bug. It’s the bill you pay for pretending a server is one big happy pool of CPUs and RAM.
What NUMA actually is (and what it is not)
NUMA means the machine is physically made of multiple “locality domains” (NUMA nodes). Each node is a chunk of CPU cores and memory that are close to each other. Accessing local memory is fast(ish). Accessing memory attached to the other socket is slower(ish). The “ish” matters because it’s not just latency; it’s also contention on the inter-socket fabric.
On a typical modern dual-socket x86 server:
- Each socket has its own memory controllers and channels. The DRAM sticks are wired to that socket.
- Sockets are connected by an interconnect (Intel UPI, AMD Infinity Fabric, or similar).
- PCIe devices are also physically attached to a socket via root complexes. A NIC or NVMe device is “closer” to one socket than the other.
So NUMA is not “a tuning option.” It’s a topology. Ignoring it means you’re letting the OS play traffic cop while you’re actively generating traffic jams.
Also: NUMA is not automatically a disaster. NUMA is fine when the workload is NUMA-friendly, the scheduler is given a fighting chance, and you don’t sabotage it with bad pinning or memory placement.
One quote to keep around: “Hope is not a strategy.” — Gen. Gordon R. Sullivan. In NUMA land, “hope the kernel will figure it out” is hope.
Why dual-socket isn’t “double fast”
Because the bottleneck moved. Adding a second socket increases peak compute and total memory capacity, but it also increases the number of ways to be slow. You’re not doubling a single resource; you’re stitching together two computers and telling Linux to pretend it’s one.
1) Memory locality becomes a performance dimension
In a single-socket machine, every core’s “local” memory is basically the same thing. In a dual-socket box, memory has an address, and that address has a home. When a thread on socket 0 frequently reads and writes pages allocated from node 1, every cache miss becomes a cross-socket trip. Cross-socket isn’t free; it’s a toll road at rush hour.
2) Cache coherency traffic goes up
Multi-socket requires coherency across sockets. If you have shared data structures with lots of writes (locks, queues, hot counters, allocator metadata), you get inter-socket ping-pong. Sometimes the CPU is “busy” but the useful work per cycle tanks.
3) PCIe locality matters more than you think
Your NIC is connected to a specific socket. Your NVMe HBA is connected to a specific socket. If your workload is scheduled mainly on the other socket, you’ve invented an extra internal hop for every packet or IO completion. On high IOPS or high packet-rate systems, that hop becomes a tax.
4) You can easily make it worse with “optimizations”
Pinning threads sounds disciplined until you pin CPUs but not memory, or you pin interrupts to the wrong socket, or you force everything onto node 0 because it “worked in staging.” NUMA misconfiguration is the rare problem where doing something is often worse than doing nothing.
Joke #1 (short, relevant): Dual-socket servers are like open-office plans: you gained “capacity,” but now everything important involves walking across the room.
Facts and history you can use at work
These aren’t trivia-night facts. They’re the kind you use to stop a bad purchase decision or win an argument with someone holding a spreadsheet.
- NUMA predates your cloud. Commercial NUMA designs existed decades ago in high-end systems; the idea is older than most modern performance tooling.
- SMP stopped scaling “for free.” Uniform memory access was simpler, but it hit physical and electrical limits as core counts rose and memory bandwidth didn’t keep up linearly.
- Integrated memory controllers changed everything. Moving memory controllers onto the CPU made memory latency better, but also made “which CPU owns the memory” an unavoidable question.
- Inter-socket links have evolved, but they’re still slower than local DRAM. UPI/Infinity Fabric are fast, but they are not a replacement for local channels. They also carry coherency traffic.
- NUMA isn’t just memory. Linux uses the same topology to think about PCIe devices, interrupts, and scheduling domains. Locality is a whole-machine property.
- Virtualization didn’t remove NUMA; it made it easier to hide. Hypervisors can expose virtual NUMA, but you can also accidentally build a VM that spans sockets and then wonder why it jitters.
- Transparent Huge Pages interact with NUMA. THP can reduce TLB overhead, but it can also make page placement and migration more expensive when the system is under pressure.
- Early-boot placement decisions matter. Many allocators and services allocate a lot of memory at startup; “where it lands” can determine performance for hours.
- Storage stacks are NUMA-sensitive at high throughput. NVMe queues, softirq processing, and userspace polling loops love locality; crossing sockets adds jitter and lowers ceiling.
How it fails in production: the failure modes
NUMA problems don’t usually look like “NUMA problem.” They look like:
- p95 and p99 latency drift up while average throughput looks “fine.”
- CPU is high but IPC is low and the box feels like it’s running through mud.
- One socket is busy, the other is bored because the scheduler did what you asked, not what you meant.
- Remote memory access climbs and now you’re paying cross-socket for cache misses.
- IRQ imbalance makes networking or NVMe completion processing pile up on the “wrong” cores.
- Unstable performance under load because page migration and reclaim kick in when memory is imbalanced across nodes.
NUMA performance issues are often multiplicative. Remote memory access adds latency; that increases lock hold times; that increases contention; that increases context switching; that increases cache misses; that increases remote access. That spiral is why “it was fine yesterday” is a common opener.
Fast diagnosis playbook (first/second/third)
When the system is slow and you suspect NUMA, don’t start with heroic benchmarking. Start with topology and placement. You’re trying to answer one question: are the CPUs, memory, and IO paths on the same node for the hot work?
First: confirm topology and whether you are spanning sockets
- How many NUMA nodes exist?
- Which CPUs belong to each node?
- Is your workload pinned or constrained by cgroups/cpuset?
Second: check memory locality and remote access
- Is most memory allocated on node 0 while threads run on node 1 (or vice versa)?
- Are NUMA “miss” and “foreign” counters rising?
- Is the kernel migrating pages a lot?
Third: check PCIe and interrupts locality
- Where is the NIC/NVMe device attached (NUMA node)?
- Are its interrupts landing on CPUs on the same node?
- Are queues spread sensibly across cores near the device?
If you do those three steps, you’ll find the cause of a large fraction of “mysterious” dual-socket regressions in under 20 minutes. Not all of them. Enough to save your weekend.
Hands-on: practical NUMA tasks with commands
These are the tasks I actually run when someone says “the new dual-socket servers are slower.” Each task includes the command, example output, what it means, and the decision you make from it.
Task 1: See NUMA nodes, CPU mapping, and memory size
cr0x@server:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 257728 MB
node 0 free: 18240 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 257728 MB
node 1 free: 214912 MB
node distances:
node 0 1
0: 10 21
1: 21 10
What it means: Two nodes. The distance matrix shows remote access costs more than local. Note the free memory imbalance: node 0 is tight; node 1 is mostly free.
Decision: If your workload is mostly on CPUs 0–15 and node 0 is nearly full, you’re at high risk of remote allocations or reclaim. Plan to rebalance memory placement or CPU placement.
Task 2: Check which node a PCIe device is local to
cr0x@server:~$ cat /sys/class/net/ens5f0/device/numa_node
1
What it means: The NIC is attached to NUMA node 1.
Decision: For high packet-rate workloads, prefer running the network-heavy threads on node 1 CPUs, and steer IRQs there.
Task 3: Map NVMe devices to NUMA nodes
cr0x@server:~$ for d in /sys/class/nvme/nvme*; do echo -n "$(basename $d) "; cat $d/device/numa_node; done
nvme0 0
nvme1 0
What it means: Both NVMe controllers are local to node 0.
Decision: Keep your hottest IO submission/completion threads on node 0 if you’re chasing low latency, or at least align IO processing cores with node 0.
Task 4: See per-node memory allocations and NUMA hit/miss counters
cr0x@server:~$ numastat
node0 node1
numa_hit 1245067890 703112340
numa_miss 85467123 92133002
numa_foreign 92133002 85467123
interleave_hit 10234 11022
local_node 1231123456 688000112
other_node 100000000 110000000
What it means: Non-trivial numa_miss and other_node indicate remote memory usage. Some remote is normal; a lot is a sign of misplacement.
Decision: If remote traffic is growing during the slowdown window, focus on CPU/memory binding or reclaim/migration pressure before touching application code.
Task 5: Inspect a process’s NUMA memory map (per-node RSS)
cr0x@server:~$ pidof postgres
2481
cr0x@server:~$ numastat -p 2481
Per-node process memory usage (in MBs) for PID 2481 (postgres)
Node 0 Node 1 Total
--------------- --------------- ---------------
Private 62048.1 1892.0 63940.1
Heap 41000.0 256.0 41256.0
Stack 32.0 16.0 48.0
Huge 0.0 0.0 0.0
---------------- --------------- --------------- ---------------
Total 62080.1 1924.0 64004.1
What it means: This process’s memory is overwhelmingly on node 0.
Decision: Ensure the busiest Postgres worker threads are scheduled primarily on node 0 CPUs, or explicitly bind memory allocation to the node you intend to run on.
Task 6: Check CPU affinity of a process (did someone pin it?)
cr0x@server:~$ taskset -cp 2481
pid 2481's current affinity list: 16-31
What it means: The process is pinned to node 1 CPUs, but in Task 5 its memory is on node 0. That’s classic remote access pain.
Decision: Either move the process to CPUs 0–15 or rebuild memory locality (restart with proper binding, or use migration carefully if supported).
Task 7: Launch a workload with explicit CPU + memory binding
cr0x@server:~$ numactl --cpunodebind=1 --membind=1 -- bash -c 'echo "bound"; sleep 1'
bound
What it means: This shell (and anything it launches) will run on node 1 and allocate memory from node 1.
Decision: Use this for targeted tests. If performance improves, you’ve confirmed locality as a bottleneck and can implement a durable placement strategy.
Task 8: Check automatic NUMA balancing status
cr0x@server:~$ sysctl kernel.numa_balancing
kernel.numa_balancing = 1
What it means: The kernel may migrate pages between nodes to improve locality.
Decision: For some latency-sensitive workloads, automatic balancing can add jitter. If you already manage affinity explicitly, consider disabling it after testing.
Task 9: Observe NUMA balancing activity in vmstat
cr0x@server:~$ vmstat -w 1 5
procs -------------------memory------------------ ---swap-- -----io---- -system-- --------cpu--------
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 18234000 10240 120000 0 0 12 45 610 1200 18 6 74 2 0
4 0 0 18190000 10240 120500 0 0 10 38 640 1600 22 7 68 3 0
6 0 0 18020000 10240 121100 0 0 11 40 780 2400 28 9 58 5 0
7 0 0 17800000 10240 121900 0 0 12 42 900 3000 31 10 52 7 0
8 0 0 17550000 10240 122800 0 0 15 50 1100 4200 35 11 46 8 0
What it means: Rising context switches (cs) and decreasing idle (id) can accompany migration/reclaim pressure. This is a coarse signal, not a NUMA proof.
Decision: If this correlates with higher remote memory stats, treat as “memory placement plus pressure” and look at per-node free memory and page scanning.
Task 10: Check per-node free memory directly
cr0x@server:~$ grep -E 'Node [01] (MemTotal|MemFree|FilePages|Active|Inactive)' /sys/devices/system/node/node*/meminfo
Node 0 MemTotal: 263913472 kB
Node 0 MemFree: 1928396 kB
Node 0 FilePages: 2210040 kB
Node 0 Active: 92233728 kB
Node 0 Inactive: 70542336 kB
Node 1 MemTotal: 263913472 kB
Node 1 MemFree: 198223224 kB
Node 1 FilePages: 5110040 kB
Node 1 Active: 12133728 kB
Node 1 Inactive: 8542336 kB
What it means: Node 0 is nearly out of free memory while node 1 is swimming in it. The kernel will start allocating remotely or reclaiming aggressively on node 0.
Decision: Either move workload threads to node 1 and restart so memory allocates there, or spread allocations across nodes intentionally (interleave) if the workload tolerates it.
Task 11: Check IRQ distribution and whether interrupts align with device locality
cr0x@server:~$ grep -E 'ens5f0|nvme' /proc/interrupts | head
132: 1203345 0 0 0 IR-PCI-MSI 524288-edge ens5f0-TxRx-0
133: 0 1189922 0 0 IR-PCI-MSI 524289-edge ens5f0-TxRx-1
134: 0 0 1191120 0 IR-PCI-MSI 524290-edge ens5f0-TxRx-2
135: 0 0 0 1210034 IR-PCI-MSI 524291-edge ens5f0-TxRx-3
What it means: Queues are distributed across CPUs (columns). Good sign. But you still need to verify those CPUs belong to the same NUMA node as the NIC.
Decision: If NIC is node 1 but most interrupts land on node 0 CPUs, adjust IRQ affinity (or let irqbalance do it if it’s doing the right thing).
Task 12: Identify which CPUs are on which node (for IRQ affinity decisions)
cr0x@server:~$ lscpu | egrep 'NUMA node\(s\)|NUMA node0 CPU\(s\)|NUMA node1 CPU\(s\)'
NUMA node(s): 2
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
What it means: Clear mapping of CPUs to nodes.
Decision: When pinning app threads or IRQs, keep the hot path on one node unless you have a good reason not to.
Task 13: Check cgroup cpuset constraints (containers love this trap)
cr0x@server:~$ systemctl is-active kubelet
active
cr0x@server:~$ cat /sys/fs/cgroup/cpuset/kubepods.slice/cpuset.cpus
0-31
cr0x@server:~$ cat /sys/fs/cgroup/cpuset/kubepods.slice/cpuset.mems
0-1
What it means: Pods can run on all CPUs and allocate from both nodes. That’s flexible, but it can also turn into “random placement.”
Decision: For latency-critical pods, use a topology-aware policy (Guaranteed QoS, CPU Manager static, and NUMA-aware scheduling) so CPUs and memory stay aligned.
Task 14: Inspect a running process’s allowed NUMA nodes
cr0x@server:~$ cat /proc/2481/status | egrep 'Cpus_allowed_list|Mems_allowed_list'
Cpus_allowed_list: 16-31
Mems_allowed_list: 0-1
What it means: Process can allocate memory on both nodes, but can only run on node 1 CPUs. That combination often produces remote allocations at startup and “fixes itself” later in unpredictable ways.
Decision: For predictable behavior, align Cpus_allowed_list and Mems_allowed_list unless you’ve measured a benefit from interleaving.
Task 15: Use performance counters for a sanity check (LLC misses and stalled cycles)
cr0x@server:~$ sudo perf stat -p 2481 -a -e cycles,instructions,cache-misses,stalled-cycles-frontend -I 1000 -- sleep 3
# time counts unit events
1.000993650 2,104,332,112 cycles
1.000993650 1,003,122,400 instructions
1.000993650 42,110,023 cache-misses
1.000993650 610,334,992 stalled-cycles-frontend
2.001976121 2,201,109,003 cycles
2.001976121 1,021,554,221 instructions
2.001976121 47,901,114 cache-misses
2.001976121 702,110,443 stalled-cycles-frontend
3.003112980 2,305,900,551 cycles
3.003112980 1,030,004,992 instructions
3.003112980 55,002,203 cache-misses
3.003112980 801,030,110 stalled-cycles-frontend
What it means: Rising cache misses and frontend stalls suggest memory subsystem pain. This doesn’t scream “NUMA” by itself, but it supports the locality hypothesis when paired with numastat.
Decision: If cache misses correlate with higher remote memory stats, prioritize placement fixes and reduce cross-socket chatter before micro-optimizing code.
Task 16: Quick-and-dirty A/B test: single-node run vs spanning
cr0x@server:~$ numactl --cpunodebind=0 --membind=0 -- bash -c 'stress-ng --cpu 8 --vm 2 --vm-bytes 8G --timeout 20s --metrics-brief'
stress-ng: info: [3112] setting timeout to 20s
stress-ng: metrc: [3112] stressor bogo ops real time usr time sys time bogo ops/s
stress-ng: metrc: [3112] cpu 9821 20.00 18.90 1.02 491.0
stress-ng: metrc: [3112] vm 1220 20.00 8.12 2.10 61.0
What it means: You have a baseline when constrained to one node. Repeat on node 1 and compare. Then run without binding and compare again.
Decision: If single-node results are steadier or faster than “free-for-all,” your default scheduling/placement is not keeping locality. Fix policy; don’t just buy more CPUs.
Joke #2 (short, relevant): NUMA tuning is the art of moving work closer to its memory—because teleportation is still not in the kernel roadmap.
Three corporate mini-stories (anonymized, plausible, technically accurate)
Mini-story 1: The incident caused by a wrong assumption
A payments team migrated a latency-sensitive service from older single-socket machines to new dual-socket servers. The new boxes had more cores, more RAM, and a much higher price tag, so everyone expected an easy win. The load test looked fine in average throughput, so the change went to production.
Within hours, p99 latency started drifting. Not spiking—drifting. The on-call saw CPU at 60–70%, network fine, disks fine, and assumed it was a noisy neighbor problem in the upstream dependency graph. They rolled back. Latency snapped back to normal. The new hardware got labeled “flaky.”
Weeks later, the same migration came back. This time, someone ran numastat -p and taskset. The service was pinned (by a well-meaning deployment script) to “the last 16 CPUs” because that was how the old boxes separated workloads. On the new machines, “the last 16 CPUs” belonged to NUMA node 1. The service’s memory—allocated early at startup—landed mostly on node 0 due to where the init process ran and how the unit file was structured.
So the hottest threads were running on node 1, reading and writing memory on node 0, and also handling network interrupts from a NIC attached to node 1. The workload was doing cross-socket reads for application state and then cross-socket writes for allocator metadata. It was a latency cocktail.
The fix was boring: align CPU affinity and memory policy at service start, verify NIC IRQ locality, and stop using CPU numbers as if they were stable semantics. The postmortem’s real lesson: dual-socket is not “more single-socket.” It’s a topology you must respect.
Mini-story 2: The optimization that backfired
A storage team running an NVMe-heavy service wanted to reduce tail latency. Someone proposed pinning the IO submission threads to a small set of isolated cores. They did it. Latency improved in light load tests, so they rolled it out widely.
Under real traffic, the system developed periodic stalls. Not full outages—just enough to cause retries and user-visible slowdowns. CPU graphs looked “good”: those pinned cores were busy, others were mostly idle. That was the first clue. You don’t want half a server idle while customers are waiting.
Investigation showed the NVMe devices were attached to NUMA node 0, but the pinned IO threads were on node 1. Worse, the MSI-X interrupts were being balanced across both sockets. Every completion involved cross-socket hops: interrupt on node 0, wake thread on node 1, access queues allocated on node 0, touch shared counters, repeat. When load increased, coherency traffic and remote memory access amplified each other. The pinned setup prevented the scheduler from “accidentally fixing it” by moving threads closer to the device.
The rollback wasn’t “stop pinning.” It was “pin correctly.” They aligned IO threads and IRQ affinities to node 0, then spread queues across cores on that node. Tail latency stabilized, and throughput increased because the cross-socket fabric stopped doing unpaid labor.
The practical takeaway: pinning is not optimization; pinning is a commitment. If you don’t commit to locality end-to-end—CPU, memory, and IO—pinning is just a way to make a bad design consistent.
Mini-story 3: The boring but correct practice that saved the day
A data platform group ran mixed workloads: a database, a Kafka-like log service, and a set of batch jobs. They standardized on dual-socket servers for capacity, but they treated NUMA as part of the deployment spec, not a post-incident curiosity.
Every service had a “placement contract” in its runbook: which NUMA node it should prefer, how to check device locality, and what to do if the node had insufficient free memory. They used a small set of commands—numactl --hardware, numastat, taskset, /proc/interrupts—and they required evidence in change reviews for any CPU pinning change.
One day, after a routine kernel update, they noticed a mild but consistent p95 regression on the log ingestion path. Nothing was on fire. That’s when boring practice pays off: someone ran the placement checks. A NIC firmware update had caused a PCIe slot change in a maintenance cycle, and the NIC ended up attached to node 1 while their ingestion threads were bound to node 0. IRQs followed the NIC; the threads didn’t.
They adjusted affinity to match the new topology and recovered performance without a war room. No heroics. No blame. Just topology-aware operations. The incident ticket was closed with the kind of comment everyone ignores until they need it: “Hardware changes are software changes.”
Common mistakes: symptoms → root cause → fix
This section is intentionally blunt. These are the patterns that show up repeatedly in production.
1) Symptom: one socket is pegged, the other is mostly idle
Root cause: CPU affinity/cpuset confines the workload to one node, or a single-threaded bottleneck is forcing serialization. Sometimes it’s IRQ processing concentrated on a few cores.
Fix: If the workload can scale, spread it across cores within a node first. Only span sockets when you have to. If spanning, also manage memory policy and IRQ locality. Validate with taskset -cp, lscpu, and /proc/interrupts.
2) Symptom: p99 latency worse on dual-socket than single-socket
Root cause: Cross-socket memory access and coherency churn (locks, shared allocators, hot counters). Often triggered by threads running on node A with memory on node B.
Fix: Align threads and memory via numactl --cpunodebind + --membind (or via service manager/cgroup policy). Reduce sharing across sockets by sharding queues and per-thread counters. Verify with numastat -p and remote access counters.
3) Symptom: throughput is okay, but jitter is awful
Root cause: Automatic NUMA balancing and page migration under load, or per-node memory pressure causing reclaim spikes. THP can amplify the cost of migration.
Fix: Ensure adequate free memory on the node where the workload runs. Consider disabling automatic NUMA balancing for pinned workloads after testing. Watch node meminfo and numastat over time.
4) Symptom: network performance caps early or drops under load
Root cause: NIC interrupts and network stack processing on the wrong socket; application threads on the other socket; packet steering fighting CPU pinning.
Fix: Confirm NIC NUMA node via sysfs. Align IRQ affinity and application cores. Use multi-queue sensibly. Check /proc/interrupts distribution and CPU node mapping.
5) Symptom: NVMe latency spikes during heavy IO even with idle CPU elsewhere
Root cause: IO submission/completion threads far from the NVMe controller; queue memory on remote node; interrupts landing across sockets.
Fix: Keep IO path on the device’s node. Validate controller numa_node. Adjust thread placement and IRQ affinity. Don’t “optimize” by pinning without locality.
6) Symptom: adding more worker threads makes it slower
Root cause: You crossed a NUMA boundary and turned local contention into remote contention; locks and shared caches got more expensive.
Fix: Scale within a socket first. If you need more, shard the workload per NUMA node (two independent pools) rather than one global pool. Treat cross-socket sharing as expensive.
7) Symptom: containers behave differently across identical nodes
Root cause: Different PCIe slot topology, BIOS settings, or kubelet CPU/memory manager policies. “Identical SKU” does not mean “identical topology.”
Fix: Standardize BIOS settings, validate lscpu and device NUMA nodes during provisioning, and use topology-aware scheduling for critical pods.
Checklists / step-by-step plan
These are the steps that stop dual-socket machines from quietly stealing your latency budget.
Checklist A: Before you deploy a latency-sensitive workload on dual-socket
- Map topology: record output of
numactl --hardwareandlscpu. You want it in the ticket, not in someone’s memory. - Map device locality: for NICs and NVMe, capture
/sys/class/net/*/device/numa_nodeand/sys/class/nvme/nvme*/device/numa_node. - Decide a placement model: single-node (preferred), per-node sharding, or cross-node (only if necessary).
- Pick an affinity strategy: either “no pinning, let scheduler work” or “pin + bind memory + align IRQs.” Never “pin only.”
- Capacity-check per node: ensure the chosen node has enough memory headroom for page cache + heap + spikes.
Checklist B: When you already have a slow box and you need a fix today
- Run
numactl --hardware; look for per-node free imbalance. - Run
numastat; look for high and risingnuma_miss/other_node. - Pick your hottest PID; run
numastat -pandtaskset -cp. Check if CPU node and memory node match. - Check device NUMA node and
/proc/interrupts. Confirm IRQ locality for NIC/NVMe. - If mismatch is obvious: fix placement (restart with correct binding) rather than chasing micro-optimizations.
Checklist C: A sane long-term operating model
- Make topology part of inventory: store NUMA mapping and PCIe attachment info per host.
- Standardize BIOS settings: avoid surprise modes that alter NUMA exposure (and document what you chose).
- Build “NUMA smoke tests”: run quick locality A/B tests during provisioning to catch weirdness early.
- Train teams: “dual-socket ≠ twice fast” should be common knowledge, not lore.
- Review pinning changes like code changes: require proof, rollback plan, and post-change validation.
FAQ
Q1: Should I always avoid dual-socket servers?
No. Dual-socket is great for memory capacity, PCIe lanes, and aggregate throughput. Avoid it when your workload is latency-critical and highly shared, and you can fit within a strong single-socket SKU.
Q2: Is NUMA only a database problem?
No. It hits databases hard because of large memory footprints and shared structures, but it also affects networking, NVMe, JVM services, analytics, and anything that pushes lots of cache misses.
Q3: If Linux has automatic NUMA balancing, why do I need to care?
Because balancing is reactive and not free. It can improve throughput for some general-purpose workloads, but it can add jitter for latency-sensitive services—especially when you also pin CPUs.
Q4: What’s better: interleaving memory across nodes or binding to one node?
Binding is usually better for latency and predictability when you can keep the workload within one node. Interleaving can help when you truly need bandwidth from both memory controllers and the workload is well-parallelized.
Q5: How do I know if my workload is crossing sockets?
Use taskset -cp (where it runs) and numastat -p (where its memory lives). If CPU node and memory node don’t match, you’re crossing sockets in the worst way.
Q6: Can I fix NUMA issues without restarting the service?
Sometimes you can mitigate with CPU affinity changes and IRQ steering, but memory placement is often set at allocation time. The clean fix is frequently a restart with correct binding and sufficient per-node free memory.
Q7: Does huge page usage help or hurt NUMA?
It can help by reducing TLB pressure, which lowers CPU overhead. It can hurt if it makes memory placement and migration harder under pressure. Measure; don’t assume.
Q8: What about hyperthreading—does it change NUMA behavior?
Hyperthreading doesn’t change which memory is local. It changes how much contention you get per core. For some workloads, using fewer threads per core improves latency and reduces cross-socket contention.
Q9: In Kubernetes, what’s the most common NUMA foot-gun?
Guaranteed pods with CPU pinning that land on one node while memory allocations spread or default elsewhere, plus NIC IRQs sitting on the opposite node. Alignment matters.
Q10: If I need all cores across both sockets, what’s the least bad design?
Shard by NUMA node. Run two worker pools, two allocators/arenas where possible, per-node queues, and only exchange data across sockets when necessary. Treat the interconnect as a scarce resource.
Next steps you should actually do
Dual-socket servers aren’t cursed. They’re just honest. They tell you, very directly, whether your system design respects locality.
- Pick one production host and record topology + device locality:
numactl --hardware,lscpu, and thenuma_nodefiles for NIC/NVMe. - Pick your top 3 latency-sensitive services and capture:
numastat -p,taskset -cp, and/proc/<pid>/statusmem/CPU allowances during peak. - Decide a policy per service: single-node binding, per-node sharding, or measured interleaving. Write it down. Put it in the runbook.
- Stop “pin-only” changes unless the change also specifies memory policy and interrupt locality. If you can’t explain the full data path, you’re not done.
- Validate with an A/B test (single-node bound vs default) before and after the next hardware refresh. Make NUMA part of acceptance, not part of the postmortem.