NUMA without tears: why dual-socket isn’t ‘double fast’

Was this helpful?

You bought a shiny dual-socket box because “twice the CPUs” sounds like “twice the throughput.” Then your database p95 got worse, your storage stack started stuttering, and the perf team showed up with charts and quiet disappointment.

This is NUMA: Non-Uniform Memory Access. It’s not a bug. It’s the bill you pay for pretending a server is one big happy pool of CPUs and RAM.

What NUMA actually is (and what it is not)

NUMA means the machine is physically made of multiple “locality domains” (NUMA nodes). Each node is a chunk of CPU cores and memory that are close to each other. Accessing local memory is fast(ish). Accessing memory attached to the other socket is slower(ish). The “ish” matters because it’s not just latency; it’s also contention on the inter-socket fabric.

On a typical modern dual-socket x86 server:

  • Each socket has its own memory controllers and channels. The DRAM sticks are wired to that socket.
  • Sockets are connected by an interconnect (Intel UPI, AMD Infinity Fabric, or similar).
  • PCIe devices are also physically attached to a socket via root complexes. A NIC or NVMe device is “closer” to one socket than the other.

So NUMA is not “a tuning option.” It’s a topology. Ignoring it means you’re letting the OS play traffic cop while you’re actively generating traffic jams.

Also: NUMA is not automatically a disaster. NUMA is fine when the workload is NUMA-friendly, the scheduler is given a fighting chance, and you don’t sabotage it with bad pinning or memory placement.

One quote to keep around: “Hope is not a strategy.” — Gen. Gordon R. Sullivan. In NUMA land, “hope the kernel will figure it out” is hope.

Why dual-socket isn’t “double fast”

Because the bottleneck moved. Adding a second socket increases peak compute and total memory capacity, but it also increases the number of ways to be slow. You’re not doubling a single resource; you’re stitching together two computers and telling Linux to pretend it’s one.

1) Memory locality becomes a performance dimension

In a single-socket machine, every core’s “local” memory is basically the same thing. In a dual-socket box, memory has an address, and that address has a home. When a thread on socket 0 frequently reads and writes pages allocated from node 1, every cache miss becomes a cross-socket trip. Cross-socket isn’t free; it’s a toll road at rush hour.

2) Cache coherency traffic goes up

Multi-socket requires coherency across sockets. If you have shared data structures with lots of writes (locks, queues, hot counters, allocator metadata), you get inter-socket ping-pong. Sometimes the CPU is “busy” but the useful work per cycle tanks.

3) PCIe locality matters more than you think

Your NIC is connected to a specific socket. Your NVMe HBA is connected to a specific socket. If your workload is scheduled mainly on the other socket, you’ve invented an extra internal hop for every packet or IO completion. On high IOPS or high packet-rate systems, that hop becomes a tax.

4) You can easily make it worse with “optimizations”

Pinning threads sounds disciplined until you pin CPUs but not memory, or you pin interrupts to the wrong socket, or you force everything onto node 0 because it “worked in staging.” NUMA misconfiguration is the rare problem where doing something is often worse than doing nothing.

Joke #1 (short, relevant): Dual-socket servers are like open-office plans: you gained “capacity,” but now everything important involves walking across the room.

Facts and history you can use at work

These aren’t trivia-night facts. They’re the kind you use to stop a bad purchase decision or win an argument with someone holding a spreadsheet.

  1. NUMA predates your cloud. Commercial NUMA designs existed decades ago in high-end systems; the idea is older than most modern performance tooling.
  2. SMP stopped scaling “for free.” Uniform memory access was simpler, but it hit physical and electrical limits as core counts rose and memory bandwidth didn’t keep up linearly.
  3. Integrated memory controllers changed everything. Moving memory controllers onto the CPU made memory latency better, but also made “which CPU owns the memory” an unavoidable question.
  4. Inter-socket links have evolved, but they’re still slower than local DRAM. UPI/Infinity Fabric are fast, but they are not a replacement for local channels. They also carry coherency traffic.
  5. NUMA isn’t just memory. Linux uses the same topology to think about PCIe devices, interrupts, and scheduling domains. Locality is a whole-machine property.
  6. Virtualization didn’t remove NUMA; it made it easier to hide. Hypervisors can expose virtual NUMA, but you can also accidentally build a VM that spans sockets and then wonder why it jitters.
  7. Transparent Huge Pages interact with NUMA. THP can reduce TLB overhead, but it can also make page placement and migration more expensive when the system is under pressure.
  8. Early-boot placement decisions matter. Many allocators and services allocate a lot of memory at startup; “where it lands” can determine performance for hours.
  9. Storage stacks are NUMA-sensitive at high throughput. NVMe queues, softirq processing, and userspace polling loops love locality; crossing sockets adds jitter and lowers ceiling.

How it fails in production: the failure modes

NUMA problems don’t usually look like “NUMA problem.” They look like:

  • p95 and p99 latency drift up while average throughput looks “fine.”
  • CPU is high but IPC is low and the box feels like it’s running through mud.
  • One socket is busy, the other is bored because the scheduler did what you asked, not what you meant.
  • Remote memory access climbs and now you’re paying cross-socket for cache misses.
  • IRQ imbalance makes networking or NVMe completion processing pile up on the “wrong” cores.
  • Unstable performance under load because page migration and reclaim kick in when memory is imbalanced across nodes.

NUMA performance issues are often multiplicative. Remote memory access adds latency; that increases lock hold times; that increases contention; that increases context switching; that increases cache misses; that increases remote access. That spiral is why “it was fine yesterday” is a common opener.

Fast diagnosis playbook (first/second/third)

When the system is slow and you suspect NUMA, don’t start with heroic benchmarking. Start with topology and placement. You’re trying to answer one question: are the CPUs, memory, and IO paths on the same node for the hot work?

First: confirm topology and whether you are spanning sockets

  • How many NUMA nodes exist?
  • Which CPUs belong to each node?
  • Is your workload pinned or constrained by cgroups/cpuset?

Second: check memory locality and remote access

  • Is most memory allocated on node 0 while threads run on node 1 (or vice versa)?
  • Are NUMA “miss” and “foreign” counters rising?
  • Is the kernel migrating pages a lot?

Third: check PCIe and interrupts locality

  • Where is the NIC/NVMe device attached (NUMA node)?
  • Are its interrupts landing on CPUs on the same node?
  • Are queues spread sensibly across cores near the device?

If you do those three steps, you’ll find the cause of a large fraction of “mysterious” dual-socket regressions in under 20 minutes. Not all of them. Enough to save your weekend.

Hands-on: practical NUMA tasks with commands

These are the tasks I actually run when someone says “the new dual-socket servers are slower.” Each task includes the command, example output, what it means, and the decision you make from it.

Task 1: See NUMA nodes, CPU mapping, and memory size

cr0x@server:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 257728 MB
node 0 free: 18240 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 257728 MB
node 1 free: 214912 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

What it means: Two nodes. The distance matrix shows remote access costs more than local. Note the free memory imbalance: node 0 is tight; node 1 is mostly free.

Decision: If your workload is mostly on CPUs 0–15 and node 0 is nearly full, you’re at high risk of remote allocations or reclaim. Plan to rebalance memory placement or CPU placement.

Task 2: Check which node a PCIe device is local to

cr0x@server:~$ cat /sys/class/net/ens5f0/device/numa_node
1

What it means: The NIC is attached to NUMA node 1.

Decision: For high packet-rate workloads, prefer running the network-heavy threads on node 1 CPUs, and steer IRQs there.

Task 3: Map NVMe devices to NUMA nodes

cr0x@server:~$ for d in /sys/class/nvme/nvme*; do echo -n "$(basename $d) "; cat $d/device/numa_node; done
nvme0 0
nvme1 0

What it means: Both NVMe controllers are local to node 0.

Decision: Keep your hottest IO submission/completion threads on node 0 if you’re chasing low latency, or at least align IO processing cores with node 0.

Task 4: See per-node memory allocations and NUMA hit/miss counters

cr0x@server:~$ numastat
                           node0           node1
numa_hit              1245067890        703112340
numa_miss               85467123         92133002
numa_foreign            92133002         85467123
interleave_hit             10234            11022
local_node           1231123456        688000112
other_node            100000000        110000000

What it means: Non-trivial numa_miss and other_node indicate remote memory usage. Some remote is normal; a lot is a sign of misplacement.

Decision: If remote traffic is growing during the slowdown window, focus on CPU/memory binding or reclaim/migration pressure before touching application code.

Task 5: Inspect a process’s NUMA memory map (per-node RSS)

cr0x@server:~$ pidof postgres
2481
cr0x@server:~$ numastat -p 2481
Per-node process memory usage (in MBs) for PID 2481 (postgres)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Private                  62048.1          1892.0         63940.1
Heap                     41000.0           256.0         41256.0
Stack                       32.0            16.0            48.0
Huge                        0.0             0.0             0.0
----------------  --------------- --------------- ---------------
Total                   62080.1          1924.0         64004.1

What it means: This process’s memory is overwhelmingly on node 0.

Decision: Ensure the busiest Postgres worker threads are scheduled primarily on node 0 CPUs, or explicitly bind memory allocation to the node you intend to run on.

Task 6: Check CPU affinity of a process (did someone pin it?)

cr0x@server:~$ taskset -cp 2481
pid 2481's current affinity list: 16-31

What it means: The process is pinned to node 1 CPUs, but in Task 5 its memory is on node 0. That’s classic remote access pain.

Decision: Either move the process to CPUs 0–15 or rebuild memory locality (restart with proper binding, or use migration carefully if supported).

Task 7: Launch a workload with explicit CPU + memory binding

cr0x@server:~$ numactl --cpunodebind=1 --membind=1 -- bash -c 'echo "bound"; sleep 1'
bound

What it means: This shell (and anything it launches) will run on node 1 and allocate memory from node 1.

Decision: Use this for targeted tests. If performance improves, you’ve confirmed locality as a bottleneck and can implement a durable placement strategy.

Task 8: Check automatic NUMA balancing status

cr0x@server:~$ sysctl kernel.numa_balancing
kernel.numa_balancing = 1

What it means: The kernel may migrate pages between nodes to improve locality.

Decision: For some latency-sensitive workloads, automatic balancing can add jitter. If you already manage affinity explicitly, consider disabling it after testing.

Task 9: Observe NUMA balancing activity in vmstat

cr0x@server:~$ vmstat -w 1 5
procs -------------------memory------------------ ---swap-- -----io---- -system-- --------cpu--------
 r  b       swpd       free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0          0   18234000  10240 120000   0    0    12    45  610 1200 18  6 74  2  0
 4  0          0   18190000  10240 120500   0    0    10    38  640 1600 22  7 68  3  0
 6  0          0   18020000  10240 121100   0    0    11    40  780 2400 28  9 58  5  0
 7  0          0   17800000  10240 121900   0    0    12    42  900 3000 31 10 52  7  0
 8  0          0   17550000  10240 122800   0    0    15    50 1100 4200 35 11 46  8  0

What it means: Rising context switches (cs) and decreasing idle (id) can accompany migration/reclaim pressure. This is a coarse signal, not a NUMA proof.

Decision: If this correlates with higher remote memory stats, treat as “memory placement plus pressure” and look at per-node free memory and page scanning.

Task 10: Check per-node free memory directly

cr0x@server:~$ grep -E 'Node [01] (MemTotal|MemFree|FilePages|Active|Inactive)' /sys/devices/system/node/node*/meminfo
Node 0 MemTotal:       263913472 kB
Node 0 MemFree:         1928396 kB
Node 0 FilePages:       2210040 kB
Node 0 Active:         92233728 kB
Node 0 Inactive:       70542336 kB
Node 1 MemTotal:       263913472 kB
Node 1 MemFree:       198223224 kB
Node 1 FilePages:       5110040 kB
Node 1 Active:         12133728 kB
Node 1 Inactive:        8542336 kB

What it means: Node 0 is nearly out of free memory while node 1 is swimming in it. The kernel will start allocating remotely or reclaiming aggressively on node 0.

Decision: Either move workload threads to node 1 and restart so memory allocates there, or spread allocations across nodes intentionally (interleave) if the workload tolerates it.

Task 11: Check IRQ distribution and whether interrupts align with device locality

cr0x@server:~$ grep -E 'ens5f0|nvme' /proc/interrupts | head
  132:    1203345          0          0          0   IR-PCI-MSI 524288-edge      ens5f0-TxRx-0
  133:          0    1189922          0          0   IR-PCI-MSI 524289-edge      ens5f0-TxRx-1
  134:          0          0    1191120          0   IR-PCI-MSI 524290-edge      ens5f0-TxRx-2
  135:          0          0          0    1210034   IR-PCI-MSI 524291-edge      ens5f0-TxRx-3

What it means: Queues are distributed across CPUs (columns). Good sign. But you still need to verify those CPUs belong to the same NUMA node as the NIC.

Decision: If NIC is node 1 but most interrupts land on node 0 CPUs, adjust IRQ affinity (or let irqbalance do it if it’s doing the right thing).

Task 12: Identify which CPUs are on which node (for IRQ affinity decisions)

cr0x@server:~$ lscpu | egrep 'NUMA node\(s\)|NUMA node0 CPU\(s\)|NUMA node1 CPU\(s\)'
NUMA node(s):          2
NUMA node0 CPU(s):     0-15
NUMA node1 CPU(s):     16-31

What it means: Clear mapping of CPUs to nodes.

Decision: When pinning app threads or IRQs, keep the hot path on one node unless you have a good reason not to.

Task 13: Check cgroup cpuset constraints (containers love this trap)

cr0x@server:~$ systemctl is-active kubelet
active
cr0x@server:~$ cat /sys/fs/cgroup/cpuset/kubepods.slice/cpuset.cpus
0-31
cr0x@server:~$ cat /sys/fs/cgroup/cpuset/kubepods.slice/cpuset.mems
0-1

What it means: Pods can run on all CPUs and allocate from both nodes. That’s flexible, but it can also turn into “random placement.”

Decision: For latency-critical pods, use a topology-aware policy (Guaranteed QoS, CPU Manager static, and NUMA-aware scheduling) so CPUs and memory stay aligned.

Task 14: Inspect a running process’s allowed NUMA nodes

cr0x@server:~$ cat /proc/2481/status | egrep 'Cpus_allowed_list|Mems_allowed_list'
Cpus_allowed_list:	16-31
Mems_allowed_list:	0-1

What it means: Process can allocate memory on both nodes, but can only run on node 1 CPUs. That combination often produces remote allocations at startup and “fixes itself” later in unpredictable ways.

Decision: For predictable behavior, align Cpus_allowed_list and Mems_allowed_list unless you’ve measured a benefit from interleaving.

Task 15: Use performance counters for a sanity check (LLC misses and stalled cycles)

cr0x@server:~$ sudo perf stat -p 2481 -a -e cycles,instructions,cache-misses,stalled-cycles-frontend -I 1000 -- sleep 3
#           time             counts unit events
     1.000993650      2,104,332,112      cycles
     1.000993650      1,003,122,400      instructions
     1.000993650         42,110,023      cache-misses
     1.000993650        610,334,992      stalled-cycles-frontend
     2.001976121      2,201,109,003      cycles
     2.001976121      1,021,554,221      instructions
     2.001976121         47,901,114      cache-misses
     2.001976121        702,110,443      stalled-cycles-frontend
     3.003112980      2,305,900,551      cycles
     3.003112980      1,030,004,992      instructions
     3.003112980         55,002,203      cache-misses
     3.003112980        801,030,110      stalled-cycles-frontend

What it means: Rising cache misses and frontend stalls suggest memory subsystem pain. This doesn’t scream “NUMA” by itself, but it supports the locality hypothesis when paired with numastat.

Decision: If cache misses correlate with higher remote memory stats, prioritize placement fixes and reduce cross-socket chatter before micro-optimizing code.

Task 16: Quick-and-dirty A/B test: single-node run vs spanning

cr0x@server:~$ numactl --cpunodebind=0 --membind=0 -- bash -c 'stress-ng --cpu 8 --vm 2 --vm-bytes 8G --timeout 20s --metrics-brief'
stress-ng: info:  [3112] setting timeout to 20s
stress-ng: metrc: [3112] stressor       bogo ops real time  usr time  sys time   bogo ops/s
stress-ng: metrc: [3112] cpu               9821     20.00    18.90     1.02       491.0
stress-ng: metrc: [3112] vm                1220     20.00     8.12     2.10        61.0

What it means: You have a baseline when constrained to one node. Repeat on node 1 and compare. Then run without binding and compare again.

Decision: If single-node results are steadier or faster than “free-for-all,” your default scheduling/placement is not keeping locality. Fix policy; don’t just buy more CPUs.

Joke #2 (short, relevant): NUMA tuning is the art of moving work closer to its memory—because teleportation is still not in the kernel roadmap.

Three corporate mini-stories (anonymized, plausible, technically accurate)

Mini-story 1: The incident caused by a wrong assumption

A payments team migrated a latency-sensitive service from older single-socket machines to new dual-socket servers. The new boxes had more cores, more RAM, and a much higher price tag, so everyone expected an easy win. The load test looked fine in average throughput, so the change went to production.

Within hours, p99 latency started drifting. Not spiking—drifting. The on-call saw CPU at 60–70%, network fine, disks fine, and assumed it was a noisy neighbor problem in the upstream dependency graph. They rolled back. Latency snapped back to normal. The new hardware got labeled “flaky.”

Weeks later, the same migration came back. This time, someone ran numastat -p and taskset. The service was pinned (by a well-meaning deployment script) to “the last 16 CPUs” because that was how the old boxes separated workloads. On the new machines, “the last 16 CPUs” belonged to NUMA node 1. The service’s memory—allocated early at startup—landed mostly on node 0 due to where the init process ran and how the unit file was structured.

So the hottest threads were running on node 1, reading and writing memory on node 0, and also handling network interrupts from a NIC attached to node 1. The workload was doing cross-socket reads for application state and then cross-socket writes for allocator metadata. It was a latency cocktail.

The fix was boring: align CPU affinity and memory policy at service start, verify NIC IRQ locality, and stop using CPU numbers as if they were stable semantics. The postmortem’s real lesson: dual-socket is not “more single-socket.” It’s a topology you must respect.

Mini-story 2: The optimization that backfired

A storage team running an NVMe-heavy service wanted to reduce tail latency. Someone proposed pinning the IO submission threads to a small set of isolated cores. They did it. Latency improved in light load tests, so they rolled it out widely.

Under real traffic, the system developed periodic stalls. Not full outages—just enough to cause retries and user-visible slowdowns. CPU graphs looked “good”: those pinned cores were busy, others were mostly idle. That was the first clue. You don’t want half a server idle while customers are waiting.

Investigation showed the NVMe devices were attached to NUMA node 0, but the pinned IO threads were on node 1. Worse, the MSI-X interrupts were being balanced across both sockets. Every completion involved cross-socket hops: interrupt on node 0, wake thread on node 1, access queues allocated on node 0, touch shared counters, repeat. When load increased, coherency traffic and remote memory access amplified each other. The pinned setup prevented the scheduler from “accidentally fixing it” by moving threads closer to the device.

The rollback wasn’t “stop pinning.” It was “pin correctly.” They aligned IO threads and IRQ affinities to node 0, then spread queues across cores on that node. Tail latency stabilized, and throughput increased because the cross-socket fabric stopped doing unpaid labor.

The practical takeaway: pinning is not optimization; pinning is a commitment. If you don’t commit to locality end-to-end—CPU, memory, and IO—pinning is just a way to make a bad design consistent.

Mini-story 3: The boring but correct practice that saved the day

A data platform group ran mixed workloads: a database, a Kafka-like log service, and a set of batch jobs. They standardized on dual-socket servers for capacity, but they treated NUMA as part of the deployment spec, not a post-incident curiosity.

Every service had a “placement contract” in its runbook: which NUMA node it should prefer, how to check device locality, and what to do if the node had insufficient free memory. They used a small set of commands—numactl --hardware, numastat, taskset, /proc/interrupts—and they required evidence in change reviews for any CPU pinning change.

One day, after a routine kernel update, they noticed a mild but consistent p95 regression on the log ingestion path. Nothing was on fire. That’s when boring practice pays off: someone ran the placement checks. A NIC firmware update had caused a PCIe slot change in a maintenance cycle, and the NIC ended up attached to node 1 while their ingestion threads were bound to node 0. IRQs followed the NIC; the threads didn’t.

They adjusted affinity to match the new topology and recovered performance without a war room. No heroics. No blame. Just topology-aware operations. The incident ticket was closed with the kind of comment everyone ignores until they need it: “Hardware changes are software changes.”

Common mistakes: symptoms → root cause → fix

This section is intentionally blunt. These are the patterns that show up repeatedly in production.

1) Symptom: one socket is pegged, the other is mostly idle

Root cause: CPU affinity/cpuset confines the workload to one node, or a single-threaded bottleneck is forcing serialization. Sometimes it’s IRQ processing concentrated on a few cores.

Fix: If the workload can scale, spread it across cores within a node first. Only span sockets when you have to. If spanning, also manage memory policy and IRQ locality. Validate with taskset -cp, lscpu, and /proc/interrupts.

2) Symptom: p99 latency worse on dual-socket than single-socket

Root cause: Cross-socket memory access and coherency churn (locks, shared allocators, hot counters). Often triggered by threads running on node A with memory on node B.

Fix: Align threads and memory via numactl --cpunodebind + --membind (or via service manager/cgroup policy). Reduce sharing across sockets by sharding queues and per-thread counters. Verify with numastat -p and remote access counters.

3) Symptom: throughput is okay, but jitter is awful

Root cause: Automatic NUMA balancing and page migration under load, or per-node memory pressure causing reclaim spikes. THP can amplify the cost of migration.

Fix: Ensure adequate free memory on the node where the workload runs. Consider disabling automatic NUMA balancing for pinned workloads after testing. Watch node meminfo and numastat over time.

4) Symptom: network performance caps early or drops under load

Root cause: NIC interrupts and network stack processing on the wrong socket; application threads on the other socket; packet steering fighting CPU pinning.

Fix: Confirm NIC NUMA node via sysfs. Align IRQ affinity and application cores. Use multi-queue sensibly. Check /proc/interrupts distribution and CPU node mapping.

5) Symptom: NVMe latency spikes during heavy IO even with idle CPU elsewhere

Root cause: IO submission/completion threads far from the NVMe controller; queue memory on remote node; interrupts landing across sockets.

Fix: Keep IO path on the device’s node. Validate controller numa_node. Adjust thread placement and IRQ affinity. Don’t “optimize” by pinning without locality.

6) Symptom: adding more worker threads makes it slower

Root cause: You crossed a NUMA boundary and turned local contention into remote contention; locks and shared caches got more expensive.

Fix: Scale within a socket first. If you need more, shard the workload per NUMA node (two independent pools) rather than one global pool. Treat cross-socket sharing as expensive.

7) Symptom: containers behave differently across identical nodes

Root cause: Different PCIe slot topology, BIOS settings, or kubelet CPU/memory manager policies. “Identical SKU” does not mean “identical topology.”

Fix: Standardize BIOS settings, validate lscpu and device NUMA nodes during provisioning, and use topology-aware scheduling for critical pods.

Checklists / step-by-step plan

These are the steps that stop dual-socket machines from quietly stealing your latency budget.

Checklist A: Before you deploy a latency-sensitive workload on dual-socket

  1. Map topology: record output of numactl --hardware and lscpu. You want it in the ticket, not in someone’s memory.
  2. Map device locality: for NICs and NVMe, capture /sys/class/net/*/device/numa_node and /sys/class/nvme/nvme*/device/numa_node.
  3. Decide a placement model: single-node (preferred), per-node sharding, or cross-node (only if necessary).
  4. Pick an affinity strategy: either “no pinning, let scheduler work” or “pin + bind memory + align IRQs.” Never “pin only.”
  5. Capacity-check per node: ensure the chosen node has enough memory headroom for page cache + heap + spikes.

Checklist B: When you already have a slow box and you need a fix today

  1. Run numactl --hardware; look for per-node free imbalance.
  2. Run numastat; look for high and rising numa_miss/other_node.
  3. Pick your hottest PID; run numastat -p and taskset -cp. Check if CPU node and memory node match.
  4. Check device NUMA node and /proc/interrupts. Confirm IRQ locality for NIC/NVMe.
  5. If mismatch is obvious: fix placement (restart with correct binding) rather than chasing micro-optimizations.

Checklist C: A sane long-term operating model

  1. Make topology part of inventory: store NUMA mapping and PCIe attachment info per host.
  2. Standardize BIOS settings: avoid surprise modes that alter NUMA exposure (and document what you chose).
  3. Build “NUMA smoke tests”: run quick locality A/B tests during provisioning to catch weirdness early.
  4. Train teams: “dual-socket ≠ twice fast” should be common knowledge, not lore.
  5. Review pinning changes like code changes: require proof, rollback plan, and post-change validation.

FAQ

Q1: Should I always avoid dual-socket servers?

No. Dual-socket is great for memory capacity, PCIe lanes, and aggregate throughput. Avoid it when your workload is latency-critical and highly shared, and you can fit within a strong single-socket SKU.

Q2: Is NUMA only a database problem?

No. It hits databases hard because of large memory footprints and shared structures, but it also affects networking, NVMe, JVM services, analytics, and anything that pushes lots of cache misses.

Q3: If Linux has automatic NUMA balancing, why do I need to care?

Because balancing is reactive and not free. It can improve throughput for some general-purpose workloads, but it can add jitter for latency-sensitive services—especially when you also pin CPUs.

Q4: What’s better: interleaving memory across nodes or binding to one node?

Binding is usually better for latency and predictability when you can keep the workload within one node. Interleaving can help when you truly need bandwidth from both memory controllers and the workload is well-parallelized.

Q5: How do I know if my workload is crossing sockets?

Use taskset -cp (where it runs) and numastat -p (where its memory lives). If CPU node and memory node don’t match, you’re crossing sockets in the worst way.

Q6: Can I fix NUMA issues without restarting the service?

Sometimes you can mitigate with CPU affinity changes and IRQ steering, but memory placement is often set at allocation time. The clean fix is frequently a restart with correct binding and sufficient per-node free memory.

Q7: Does huge page usage help or hurt NUMA?

It can help by reducing TLB pressure, which lowers CPU overhead. It can hurt if it makes memory placement and migration harder under pressure. Measure; don’t assume.

Q8: What about hyperthreading—does it change NUMA behavior?

Hyperthreading doesn’t change which memory is local. It changes how much contention you get per core. For some workloads, using fewer threads per core improves latency and reduces cross-socket contention.

Q9: In Kubernetes, what’s the most common NUMA foot-gun?

Guaranteed pods with CPU pinning that land on one node while memory allocations spread or default elsewhere, plus NIC IRQs sitting on the opposite node. Alignment matters.

Q10: If I need all cores across both sockets, what’s the least bad design?

Shard by NUMA node. Run two worker pools, two allocators/arenas where possible, per-node queues, and only exchange data across sockets when necessary. Treat the interconnect as a scarce resource.

Next steps you should actually do

Dual-socket servers aren’t cursed. They’re just honest. They tell you, very directly, whether your system design respects locality.

  1. Pick one production host and record topology + device locality: numactl --hardware, lscpu, and the numa_node files for NIC/NVMe.
  2. Pick your top 3 latency-sensitive services and capture: numastat -p, taskset -cp, and /proc/<pid>/status mem/CPU allowances during peak.
  3. Decide a policy per service: single-node binding, per-node sharding, or measured interleaving. Write it down. Put it in the runbook.
  4. Stop “pin-only” changes unless the change also specifies memory policy and interrupt locality. If you can’t explain the full data path, you’re not done.
  5. Validate with an A/B test (single-node bound vs default) before and after the next hardware refresh. Make NUMA part of acceptance, not part of the postmortem.
← Previous
Will the clock-speed race return? Not how you think
Next →
Proxmox VLAN Not Working: Trunk Ports, Linux Bridge Tagging, and “No Network” Fixes

Leave a comment