<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>gpu | Ronak Nathani</title><link>https://ronaknathani.com/tag/gpu/</link><atom:link href="https://ronaknathani.com/tag/gpu/index.xml" rel="self" type="application/rss+xml"/><description>gpu</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>© 2026 Ronak Nathani</copyright><lastBuildDate>Tue, 26 May 2026 11:00:00 -0400</lastBuildDate><image><url>https://ronaknathani.com/img/avatar.jpg</url><title>gpu</title><link>https://ronaknathani.com/tag/gpu/</link></image><item><title>Keeping GPU Workloads NUMA-Local in Kubernetes</title><link>https://ronaknathani.com/blog/2026/05/keeping-gpu-workloads-numa-local-in-kubernetes/</link><pubDate>Tue, 26 May 2026 11:00:00 -0400</pubDate><guid>https://ronaknathani.com/blog/2026/05/keeping-gpu-workloads-numa-local-in-kubernetes/</guid><description>&lt;p>&amp;ldquo;NUMA alignment&amp;rdquo; comes up frequently in GPU infrastructure discussions, but concepts like NUMA nodes, topology policies, and CPU pinning are often assumed rather than well understood. Getting it right is as much the platform engineer&amp;rsquo;s job as the workload owner&amp;rsquo;s.&lt;/p>
&lt;p>This post isn&amp;rsquo;t a comprehensive guide to NUMA architecture. It&amp;rsquo;s a practical account of what happens when you align CPU and GPU resources on Kubernetes nodes: the levels of isolation Kubernetes offers, the gotchas, and what it takes to make it work. My experience is on AMD EPYC hardware. Intel has analogous concepts (Sub-NUMA Clustering instead of NPS, UPI instead of Infinity Fabric), but I haven&amp;rsquo;t worked with Intel in this context, so I&amp;rsquo;ll stick to what I know.&lt;/p>
&lt;h2 id="table-of-contents">Table of Contents&lt;/h2>
&lt;ul>
&lt;li>
&lt;a href="#key-terminology">Key Terminology&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#what-is-numa">What is NUMA?&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#why-numa-matters-for-gpu-workloads">Why NUMA Matters for GPU Workloads&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#increasing-levels-of-cpu-isolation-and-numa-alignment">Levels of CPU Isolation and NUMA Alignment&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#how-the-kubelet-allocates-cpus">How the Kubelet Allocates CPUs&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#failure-modes-to-be-aware-of">Failure Modes&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#topology-aware-scheduling">Topology-Aware Scheduling&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#making-it-work-collaboration-between-platform-and-workload-teams">Making It Work&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#dra-and-future-direction">DRA and Future Direction&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="key-terminology">Key Terminology&lt;/h2>
&lt;p>If you&amp;rsquo;re already familiar with CPU cache hierarchies, sockets, and PCIe, skip ahead. Otherwise, expand below for the shared vocabulary used throughout the post.&lt;/p>
&lt;details>
&lt;summary>Show terminology&lt;/summary>
&lt;p>&lt;strong>CPU Socket&lt;/strong>: The physical slot on a motherboard that holds a processor. Multi-socket servers (commonly 2-socket) have multiple processors, each with its own cores and local memory.&lt;/p>
&lt;p>&lt;strong>Physical Core vs. Logical Core&lt;/strong>: A physical core is a single processing unit on the CPU die. With
&lt;a href="https://en.wikipedia.org/wiki/Simultaneous_multithreading" target="_blank" rel="noopener">SMT&lt;/a> (&amp;ldquo;hyperthreading&amp;rdquo; on Intel), each physical core presents as 2 logical cores that share the core&amp;rsquo;s execution resources and caches.&lt;/p>
&lt;p>&lt;strong>L1/L2 Cache&lt;/strong>: Small, fast caches private to each physical core (L1 is smaller and faster than L2). Two containers sharing a physical core, one logical core each, compete for the same L1/L2 space.&lt;/p>
&lt;p>&lt;strong>L3 Cache (Last-Level Cache)&lt;/strong>: A larger cache shared among a group of cores. On AMD EPYC, it&amp;rsquo;s shared within a
&lt;a href="https://en.wikipedia.org/wiki/Chiplet#AMD" target="_blank" rel="noopener">Core Complex (CCD)&lt;/a> of typically 8 cores. Cores sharing an L3 cache can exchange data quickly through it.&lt;/p>
&lt;p>&lt;strong>Interconnect&lt;/strong>: The high-speed link between CPU sockets (
&lt;a href="https://en.wikipedia.org/wiki/Infinity_Fabric" target="_blank" rel="noopener">Infinity Fabric&lt;/a> on AMD,
&lt;a href="https://en.wikipedia.org/wiki/Intel_Ultra_Path_Interconnect" target="_blank" rel="noopener">UPI&lt;/a> on Intel). Accessing memory locally is faster than going cross-socket over the interconnect.&lt;/p>
&lt;p>&lt;strong>
&lt;a href="https://en.wikipedia.org/wiki/PCI_Express" target="_blank" rel="noopener">PCIe&lt;/a>&lt;/strong>: The bus connecting CPUs to devices like GPUs and NICs. Each PCIe root complex is wired to a specific CPU socket, so a GPU is physically closer to one socket than another.&lt;/p>
&lt;p>&lt;strong>
&lt;a href="https://en.wikipedia.org/wiki/Direct_memory_access" target="_blank" rel="noopener">DMA&lt;/a>&lt;/strong>: Lets devices like GPUs read from and write to system memory directly, without the CPU copying data byte-by-byte. If the data sits in memory attached to a different NUMA node than the GPU, the DMA read crosses the interconnect.&lt;/p>
&lt;/details>
&lt;h2 id="what-is-numa">What is NUMA?&lt;/h2>
&lt;p>NUMA (Non-Uniform Memory Access) describes a memory architecture where the time it takes a CPU core to access memory depends on where that memory physically sits relative to the core.&lt;/p>
&lt;p>In a 2-socket server, each socket has its own local memory. A core on socket 0 can access memory attached to socket 0 quickly (local access), but accessing memory attached to socket 1 requires crossing the interconnect, which is slower. On AMD EPYC hardware, cross-socket memory access can incur roughly 3x the latency of local access.&lt;/p>
&lt;p>NUMA doesn&amp;rsquo;t only exist across sockets, though. On AMD EPYC processors, a BIOS setting called NPS (Nodes Per Socket) controls how many NUMA domains a single socket is divided into:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>NPS1&lt;/strong>: Each socket is one NUMA node. A 2-socket machine has 2 NUMA nodes.&lt;/li>
&lt;li>&lt;strong>NPS2&lt;/strong>: Each socket is split into 2 NUMA nodes. A 2-socket machine has 4 NUMA nodes.&lt;/li>
&lt;li>&lt;strong>NPS4&lt;/strong>: Each socket is split into 4 NUMA nodes. A 2-socket machine has 8 NUMA nodes.&lt;/li>
&lt;/ul>
&lt;p>The interactive diagram below shows a simplified view of a 2-socket AMD EPYC machine. Toggle between NPS modes to see how the NUMA boundaries change. In NPS1, each socket is one NUMA node. In NPS2 and NPS4, each socket is further subdivided. The CCD and GPU placement is illustrative, not a promise about every SKU.&lt;/p>
&lt;div class="numa-topo-widget">
&lt;style>
.numa-topo-widget {
margin: 1.5em 0;
width: 100vw;
position: relative;
left: 50%;
transform: translateX(-50%);
max-width: 1100px;
}
.numa-topo-widget svg { width: 100%; height: auto; display: block; }
.numa-topo-controls {
display: flex;
gap: 12px;
margin-bottom: 16px;
justify-content: center;
flex-wrap: wrap;
}
.numa-topo-controls button {
padding: 6px 16px;
border: 2px solid #2563eb;
background: #fff;
color: #2563eb;
font-size: 13px;
font-weight: 600;
border-radius: 6px;
cursor: pointer;
transition: all 0.15s;
font-family: inherit;
}
.numa-topo-controls button.active {
background: #2563eb;
color: #fff;
}
.numa-topo-controls button:hover:not(.active) {
background: #eff6ff;
}
&lt;/style>
&lt;div class="numa-topo-controls">
&lt;button class="active" onclick="numaTopoSetMode('nps1', this)">NPS1 (1 NUMA / socket)&lt;/button>
&lt;button onclick="numaTopoSetMode('nps2', this)">NPS2 (2 NUMA / socket)&lt;/button>
&lt;button onclick="numaTopoSetMode('nps4', this)">NPS4 (4 NUMA / socket)&lt;/button>
&lt;/div>
&lt;div id="numa-topo-diagram">&lt;/div>
&lt;/div>
&lt;script>
(function() {
var COLORS = {
numa: ['#dbeafe', '#fce7f3', '#d1fae5', '#fef3c7', '#e0e7ff', '#ede9fe', '#fce4ec', '#e8f5e9'],
numaBorder: ['#2563eb', '#db2777', '#059669', '#d97706', '#4f46e5', '#7c3aed', '#e91e63', '#2e7d32'],
socket: '#f8fafc',
socketBorder: '#94a3b8',
ccd: '#f1f5f9',
ccdBorder: '#cbd5e1',
gpu: '#1e293b',
gpuText: '#ffffff',
memory: '#e2e8f0',
memoryBorder: '#94a3b8',
interconnect: '#ef4444',
text: '#1e293b',
textMuted: '#64748b',
pcie: '#475569'
};
var currentMode = 'nps1';
window.numaTopoSetMode = function(mode, btn) {
currentMode = mode;
var buttons = document.querySelectorAll('.numa-topo-controls button');
for (var i = 0; i &lt; buttons.length; i++) buttons[i].classList.remove('active');
btn.classList.add('active');
render();
};
function render() {
var el = document.getElementById('numa-topo-diagram');
var socketW = 480, socketH = 420, gap = 100;
var padX = 30, padY = 30;
var svgW = socketW * 2 + gap + padX * 2;
var svgH = socketH + padY * 2 + 60;
var numaPerSocket = currentMode === 'nps1' ? 1 : currentMode === 'nps2' ? 2 : 4;
var ccdsPerSocket = 4, coresPerCCD = 8, gpusPerSocket = 4;
var h = '&lt;svg viewBox="0 0 ' + svgW + ' ' + svgH + '" xmlns="http://www.w3.org/2000/svg">';
h += '&lt;defs>&lt;filter id="nt-shadow" x="-2%" y="-2%" width="104%" height="104%">&lt;feDropShadow dx="0" dy="1" stdDeviation="2" flood-opacity="0.08"/>&lt;/filter>';
h += '&lt;marker id="nt-arrow" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto">&lt;polygon points="0 0, 8 3, 0 6" fill="' + COLORS.interconnect + '"/>&lt;/marker>&lt;/defs>';
for (var sock = 0; sock &lt; 2; sock++) {
var sx = padX + sock * (socketW + gap);
var sy = padY;
h += '&lt;rect x="' + sx + '" y="' + sy + '" width="' + socketW + '" height="' + socketH + '" rx="12" fill="' + COLORS.socket + '" stroke="' + COLORS.socketBorder + '" stroke-width="2" filter="url(#nt-shadow)"/>';
h += '&lt;text x="' + (sx + socketW/2) + '" y="' + (sy + 24) + '" text-anchor="middle" font-size="15" font-weight="700" fill="' + COLORS.text + '">Socket ' + sock + '&lt;/text>';
var numaCols = numaPerSocket &lt;= 2 ? numaPerSocket : 2;
var numaRows = numaPerSocket &lt;= 2 ? 1 : 2;
var numaW = (socketW - 20) / numaCols;
var numaH = numaRows === 1 ? socketH - 85 : (socketH - 100) / 2;
for (var n = 0; n &lt; numaPerSocket; n++) {
var numaIdx = sock * numaPerSocket + n;
var col = n % numaCols;
var row = Math.floor(n / numaCols);
var nx = sx + 10 + col * numaW;
var ny = sy + 36 + row * (numaH + 8);
h += '&lt;rect x="' + nx + '" y="' + ny + '" width="' + (numaW - 4) + '" height="' + numaH + '" rx="8" fill="' + COLORS.numa[numaIdx % 8] + '" stroke="' + COLORS.numaBorder[numaIdx % 8] + '" stroke-width="1.5" stroke-dasharray="6,3"/>';
h += '&lt;text x="' + (nx + (numaW - 4)/2) + '" y="' + (ny + 18) + '" text-anchor="middle" font-size="12" font-weight="700" fill="' + COLORS.numaBorder[numaIdx % 8] + '">NUMA Node ' + numaIdx + '&lt;/text>';
var ccdsInNuma = ccdsPerSocket / numaPerSocket;
var ccdW = (numaW - 20) / ccdsInNuma;
var ccdH = numaRows === 1 ? 120 : 80;
for (var c = 0; c &lt; ccdsInNuma; c++) {
var cx = nx + 8 + c * ccdW;
var cy = ny + 28;
var globalCcdIdx = sock * ccdsPerSocket + n * ccdsInNuma + c;
h += '&lt;rect x="' + cx + '" y="' + cy + '" width="' + (ccdW - 4) + '" height="' + ccdH + '" rx="6" fill="' + COLORS.ccd + '" stroke="' + COLORS.ccdBorder + '" stroke-width="1"/>';
h += '&lt;text x="' + (cx + (ccdW-4)/2) + '" y="' + (cy + 16) + '" text-anchor="middle" font-size="10" font-weight="600" fill="' + COLORS.textMuted + '">CCD ' + globalCcdIdx + '&lt;/text>';
var coreR = coresPerCCD &lt;= 4 ? 1 : 2;
var coreC = Math.ceil(coresPerCCD / coreR);
var coreSize = Math.min((ccdW - 16) / coreC, (ccdH - 50) / coreR) - 2;
var coreStartX = cx + ((ccdW - 4) - coreC * (coreSize + 2)) / 2;
var coreStartY = cy + 22;
for (var cr = 0; cr &lt; coreR; cr++) {
for (var cc = 0; cc &lt; coreC; cc++) {
if (cr * coreC + cc >= coresPerCCD) break;
h += '&lt;rect x="' + (coreStartX + cc * (coreSize+2)) + '" y="' + (coreStartY + cr * (coreSize+2)) + '" width="' + coreSize + '" height="' + coreSize + '" rx="2" fill="#fff" stroke="' + COLORS.ccdBorder + '" stroke-width="0.75"/>';
}
}
h += '&lt;text x="' + (cx + (ccdW-4)/2) + '" y="' + (cy + ccdH - 8) + '" text-anchor="middle" font-size="9" fill="' + COLORS.textMuted + '">L3: 32MB&lt;/text>';
}
var memY = ny + 28 + ccdH + 8;
var memW = numaW - 20;
var memH = numaRows === 1 ? 28 : 22;
h += '&lt;rect x="' + (nx+8) + '" y="' + memY + '" width="' + memW + '" height="' + memH + '" rx="4" fill="' + COLORS.memory + '" stroke="' + COLORS.memoryBorder + '" stroke-width="1"/>';
h += '&lt;text x="' + (nx+8+memW/2) + '" y="' + (memY + (numaRows===1?18:15)) + '" text-anchor="middle" font-size="' + (numaRows===1?11:9) + '" font-weight="500" fill="' + COLORS.textMuted + '">Local Memory&lt;/text>';
var gpusInNuma = gpusPerSocket / numaPerSocket;
var gpuY = memY + (numaRows === 1 ? 36 : 28);
var gpuW = Math.min(40, (numaW - 20) / gpusInNuma - 4);
var gpuTotalW = gpusInNuma * (gpuW + 4) - 4;
var gpuStartX = nx + 8 + (memW - gpuTotalW) / 2;
h += '&lt;text x="' + (nx+8+memW/2) + '" y="' + (gpuY-4) + '" text-anchor="middle" font-size="9" fill="' + COLORS.pcie + '">PCIe&lt;/text>';
for (var g = 0; g &lt; gpusInNuma; g++) {
var gx = gpuStartX + g * (gpuW + 4);
var gpuIdx = sock * gpusPerSocket + n * gpusInNuma + g;
h += '&lt;rect x="' + gx + '" y="' + gpuY + '" width="' + gpuW + '" height="' + (numaRows===1?26:20) + '" rx="4" fill="' + COLORS.gpu + '"/>';
h += '&lt;text x="' + (gx+gpuW/2) + '" y="' + (gpuY+(numaRows===1?17:13)) + '" text-anchor="middle" font-size="' + (numaRows===1?10:8) + '" font-weight="600" fill="' + COLORS.gpuText + '">GPU ' + gpuIdx + '&lt;/text>';
}
if (numaIdx === 0 &amp;&amp; numaRows === 1) {
var legY = ny + 28 + ccdH + 8 + 28 + 8 + 26 + 16;
if (legY &lt; ny + numaH - 10) {
h += '&lt;text x="' + (nx+(numaW-4)/2) + '" y="' + legY + '" text-anchor="middle" font-size="9" fill="' + COLORS.textMuted + '">' + coresPerCCD + ' cores/CCD, 2 threads/core&lt;/text>';
}
}
}
var totalVCPUs = ccdsPerSocket * coresPerCCD * 2;
h += '&lt;text x="' + (sx+socketW/2) + '" y="' + (sy+socketH-8) + '" text-anchor="middle" font-size="11" fill="' + COLORS.textMuted + '">' + totalVCPUs + ' vCPUs (' + (ccdsPerSocket*coresPerCCD) + ' cores x 2 threads)&lt;/text>';
}
var interY = padY + socketH / 2;
var interX1 = padX + socketW + 4;
var interX2 = padX + socketW + gap - 4;
var interMid = (interX1 + interX2) / 2;
h += '&lt;line x1="' + interX1 + '" y1="' + (interY-6) + '" x2="' + interX2 + '" y2="' + (interY-6) + '" stroke="' + COLORS.interconnect + '" stroke-width="2.5" marker-end="url(#nt-arrow)"/>';
h += '&lt;line x1="' + interX2 + '" y1="' + (interY+6) + '" x2="' + interX1 + '" y2="' + (interY+6) + '" stroke="' + COLORS.interconnect + '" stroke-width="2.5" marker-end="url(#nt-arrow)"/>';
h += '&lt;text x="' + interMid + '" y="' + (interY-16) + '" text-anchor="middle" font-size="10" font-weight="600" fill="' + COLORS.interconnect + '">Interconnect&lt;/text>';
h += '&lt;text x="' + interMid + '" y="' + (interY+24) + '" text-anchor="middle" font-size="9" fill="' + COLORS.interconnect + '">(Infinity Fabric)&lt;/text>';
var totalNuma = numaPerSocket * 2;
var npsLabel = currentMode.toUpperCase();
h += '&lt;text x="' + (svgW/2) + '" y="' + (svgH-12) + '" text-anchor="middle" font-size="13" fill="' + COLORS.textMuted + '">2-socket AMD EPYC, ' + npsLabel + ' mode (' + totalNuma + ' NUMA nodes, 8 GPUs)&lt;/text>';
h += '&lt;/svg>';
el.innerHTML = h;
}
render();
})();
&lt;/script>
&lt;p>&lt;strong>The key point&lt;/strong>: NUMA topology is a function of both the hardware and how it&amp;rsquo;s configured. You can&amp;rsquo;t assume a fixed number of NUMA nodes across your fleet unless you control the BIOS settings. Different SKUs have different core counts, different numbers of CCDs, and different NPS configurations, all of which change the NUMA geometry.&lt;/p>
&lt;h2 id="why-numa-matters-for-gpu-workloads">Why NUMA Matters for GPU Workloads&lt;/h2>
&lt;p>GPU inference often follows a CPU-GPU pipeline: the CPU prepares requests, batches them, copies input data into the right format, feeds data to the GPU over PCIe, and then handles postprocessing on the GPU&amp;rsquo;s output. The GPU does the heavy computation, but it&amp;rsquo;s the CPU that keeps the pipeline fed.&lt;/p>
&lt;p>When a container&amp;rsquo;s CPUs are on a different NUMA node than its GPU, moving input data from CPU memory to GPU memory may cross a NUMA boundary. The GPU has to read data from memory attached to a farther-away CPU socket instead of memory attached to its local socket. This adds latency on the critical path.&lt;/p>
&lt;p>In one inference workload, we observed more than 30% higher p99 tail latency under load for pods whose CPUs spanned both sockets compared with pods whose CPUs stayed on the same socket. For a latency-sensitive service, that is enough to matter, and it happens silently unless someone is explicitly monitoring NUMA alignment. Nothing in Kubernetes surfaces it. The pod is running, serving traffic, and looking healthy, just consistently slower than its peers.&lt;/p>
&lt;p>Training workloads are affected too, though the impact profile is different. Data loading workers continuously preprocess batches on CPU and stage them for GPU consumption. Cross-NUMA data loaders contend for inter-socket bandwidth and add latency to every batch transfer.
&lt;a href="https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html" target="_blank" rel="noopener">PyTorch&amp;rsquo;s own performance tuning guide&lt;/a> explicitly recommends binding training processes to a single NUMA node.&lt;/p>
&lt;p>For GPU workloads where the CPU is on the data path to the GPU, NUMA locality has a direct and measurable impact on performance.&lt;/p>
&lt;h2 id="increasing-levels-of-cpu-isolation-and-numa-alignment">Increasing Levels of CPU Isolation and NUMA Alignment&lt;/h2>
&lt;p>Kubernetes offers several knobs for CPU isolation, each providing stronger guarantees at the cost of more constraints.&lt;/p>
&lt;h3 id="level-1-logical-core-pinning-with-cpumanagerpolicy-static">Level 1: Logical Core Pinning with &lt;code>cpuManagerPolicy: static&lt;/code>&lt;/h3>
&lt;p>By default, Kubernetes lets the operating system&amp;rsquo;s CPU scheduler move a container&amp;rsquo;s processes across any available core. This is efficient for overall CPU utilization, but it means your container&amp;rsquo;s threads may move across cores, invalidating caches and sharing physical cores with other containers.&lt;/p>
&lt;p>Setting &lt;code>cpuManagerPolicy: static&lt;/code> in the kubelet config changes this. Containers in
&lt;a href="https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#guaranteed" target="_blank" rel="noopener">Guaranteed QoS&lt;/a> pods (where &lt;code>requests == limits&lt;/code>) with integer CPU requests get exclusive, pinned logical cores. Kubernetes won&amp;rsquo;t assign those exclusive CPUs to another container, and your processes stay put. Host daemons and kernel threads can still run there unless the platform also reserves or isolates CPUs for the OS.&lt;/p>
&lt;p>The way kubelet pins CPUs is by constraining the container&amp;rsquo;s cpuset cgroup to the assigned CPU list. The assigned cores can be seen in &lt;code>cpuset.cpus&lt;/code> on cgroup v1 and &lt;code>cpuset.cpus.effective&lt;/code> on cgroup v2.&lt;/p>
&lt;p>On the node, the exact path depends on the cgroup driver, runtime, QoS class, and pod UID formatting. With systemd-style kubepods slices, the files are roughly located here:&lt;/p>
&lt;pre>&lt;code class="language-bash">KUBEPODS=&amp;quot;&amp;lt;kubepods.slice/.../kubepods-pod&amp;lt;uid&amp;gt;.slice/&amp;lt;container&amp;gt;.scope&amp;gt;&amp;quot;
# cgroup v1
cat /sys/fs/cgroup/cpuset/$KUBEPODS/cpuset.cpus
# cgroup v2
cat /sys/fs/cgroup/$KUBEPODS/cpuset.cpus.effective
&lt;/code>&lt;/pre>
&lt;p>This alone improves performance consistency. Cache affinity improves because threads aren&amp;rsquo;t migrating across cores, and container-to-container CPU contention is reduced. But there&amp;rsquo;s a subtlety: you&amp;rsquo;re pinning logical cores (hyperthreads), not physical cores. Two containers can still end up sharing a physical core if one gets one hyperthread and the other gets the sibling. They&amp;rsquo;ll contend for that physical core&amp;rsquo;s L1 and L2 cache.&lt;/p>
&lt;p>&lt;strong>What it requires from workload owners:&lt;/strong> Set &lt;code>requests == limits&lt;/code> for CPU and memory on all containers (including init containers and sidecars) to get Guaranteed QoS. CPU requests must be integers for the containers where pinning is desired.&lt;/p>
&lt;h3 id="level-2-physical-core-pinning-with-full-pcpus-only">Level 2: Physical Core Pinning with &lt;code>full-pcpus-only&lt;/code>&lt;/h3>
&lt;p>The
&lt;a href="https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy-options" target="_blank" rel="noopener">&lt;code>full-pcpus-only&lt;/code>&lt;/a> CPU manager policy option (&lt;code>cpuManagerPolicyOptions: full-pcpus-only=true&lt;/code>) takes isolation further. Instead of allocating individual logical cores, it allocates entire physical cores. Both hyperthreads of each core go to the same container.&lt;/p>
&lt;p>This eliminates L1/L2 cache contention between containers that would otherwise share a physical core.&lt;/p>
&lt;p>The trade-off: containers that receive exclusive CPUs must request a multiple of the SMT thread count, typically 2. A pinned container requesting 3 CPUs fails with an &lt;code>SMTAlignmentError&lt;/code> (covered in
&lt;a href="#smtalignmenterror">Failure Modes&lt;/a> below). Any existing pinned containers with odd CPU counts on the node need to be resized before you enable this option.&lt;/p>
&lt;p>&lt;strong>What it requires from workload owners:&lt;/strong> Even CPU request values. Audit all containers, including sidecars and init containers. Fractional CPU values on sidecars and init containers are fine as those containers use the shared CPU pool and don&amp;rsquo;t get pinned.&lt;/p>
&lt;h3 id="level-3-full-numa-alignment-with-single-numa-node">Level 3: Full NUMA Alignment with &lt;code>single-numa-node&lt;/code>&lt;/h3>
&lt;p>CPU pinning ensures your cores are dedicated, and &lt;code>full-pcpus-only&lt;/code> ensures the container gets full physical cores. Neither guarantees that all your cores come from the same NUMA node. With the default static policy options, the kubelet&amp;rsquo;s CPU manager uses a packed allocation strategy that fills one NUMA node before spilling to the next (more on this
&lt;a href="#how-the-kubelet-allocates-cpus">below&lt;/a>), but depending on node fragmentation, your container&amp;rsquo;s CPUs can still span NUMA boundaries.&lt;/p>
&lt;p>The
&lt;a href="https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/" target="_blank" rel="noopener">&lt;code>topologyManagerPolicy: single-numa-node&lt;/code>&lt;/a> setting addresses this. The topology manager sits above the CPU manager, device manager, and memory manager, and coordinates resource allocation by collecting topology hints from each. With &lt;code>single-numa-node&lt;/code>, it requires that hinted resources can be satisfied from a single NUMA node. If they can&amp;rsquo;t, the pod is rejected at admission time with a &lt;code>TopologyAffinityError&lt;/code>.&lt;/p>
&lt;p>The default scope &lt;code>topologyManagerScope: container&lt;/code> computes alignment independently for each container. That&amp;rsquo;s usually fine when one main container owns the GPU and the exclusive CPUs, while sidecars are unrelated to the latency-critical path and use fractional CPU from the shared pool.&lt;/p>
&lt;p>&lt;code>topologyManagerScope: pod&lt;/code> is stricter: it asks whether the pod&amp;rsquo;s effective request fits on one NUMA node. Use it when multiple containers in the same pod are performance-coupled, not just because a logging or metrics sidecar exists.&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Caveat:&lt;/strong> Topology Manager only enforces resources that report topology hints. CPU hints come from CPU Manager, GPU hints come through Device Manager from plugins such as the nvidia-device-plugin, and memory hints come from Memory Manager.&lt;/p>
&lt;p>If the GPU device plugin does not report NUMA &lt;code>TopologyInfo&lt;/code>, Topology Manager cannot force CPU-GPU locality.&lt;/p>
&lt;/blockquote>
&lt;p>For guaranteed NUMA alignment, set &lt;code>memoryManagerPolicy: Static&lt;/code> too. This makes requested memory part of topology admission along with CPU and GPU resources. The workload&amp;rsquo;s memory request must fit within the target NUMA node. Kubernetes also requires &lt;code>reservedMemory&lt;/code> when &lt;code>memoryManagerPolicy: Static&lt;/code> is enabled.&lt;/p>
&lt;p>This gives you the strongest isolation available. The container stays within one NUMA node and communicates with the GPU on that NUMA node without crossing the socket interconnect. CPU cache locality can still vary within that NUMA node (depending on NPS configuration), but in testing, we have found that a container occupying an entire NUMA node with no overlap from other workloads is materially less affected by cache-thrashing or CPU-intensive noisy neighbors.&lt;/p>
&lt;p>&lt;strong>What it requires from workload owners:&lt;/strong> The performance-critical container, or the effective pod request when using &lt;code>topologyManagerScope: pod&lt;/code>, must fit within a single NUMA node. That means understanding the machine topology and resizing the workload as hardware or its configuration changes.&lt;/p>
&lt;p>A minimal kubelet config for a dedicated NUMA-aligned GPU pool looks like:&lt;/p>
&lt;pre>&lt;code class="language-yaml">cpuManagerPolicy: static
cpuManagerPolicyOptions:
full-pcpus-only: &amp;quot;true&amp;quot;
topologyManagerPolicy: single-numa-node
# Default is container. Use pod only when the whole pod should fit on one NUMA node.
# topologyManagerScope: pod
memoryManagerPolicy: Static
# memoryManagerPolicy: Static requires reservedMemory to be configured.
&lt;/code>&lt;/pre>
&lt;p>Roll this out on dedicated, drained nodes. When changing CPU or memory manager policies, clear the kubelet state files before restarting kubelet: &lt;code>&amp;lt;kubelet-root-dir&amp;gt;/cpu_manager_state&lt;/code> and &lt;code>&amp;lt;kubelet-root-dir&amp;gt;/memory_manager_state&lt;/code>.&lt;/p>
&lt;h2 id="how-the-kubelet-allocates-cpus">How the Kubelet Allocates CPUs&lt;/h2>
&lt;p>Understanding the allocation algorithm helps explain how and when NUMA spillover happens.&lt;/p>
&lt;p>When &lt;code>cpuManagerPolicy: static&lt;/code> is enabled with the default policy options, the kubelet uses a packed (bin-pack) allocation strategy:
&lt;a href="https://sourcegraph.com/r/github.com/kubernetes/kubernetes@78994b5cf1fd09d94f8f1748fac83d15eb83c479/-/blob/pkg/kubelet/cm/cpumanager/cpu_assignment.go?L776" target="_blank" rel="noopener">&lt;code>takeByTopologyNUMAPacked&lt;/code>&lt;/a>. It works top-down through the topology:&lt;/p>
&lt;ol>
&lt;li>First, try to take full NUMA nodes (prefer smaller/more-used ones)&lt;/li>
&lt;li>Then, take full physical cores from partially-used NUMA nodes (prefer NUMA nodes with fewer free CPUs to pack them first)&lt;/li>
&lt;li>Finally, take individual logical cores if needed&lt;/li>
&lt;/ol>
&lt;p>The sort order is key: at every level, it prefers NUMA nodes with fewer remaining free CPUs. This packs nearly-exhausted NUMA nodes before touching less-used ones. The allocator usually keeps CPUs NUMA-local when there is room, but locality is not guaranteed.&lt;/p>
&lt;p>But &amp;ldquo;when possible&amp;rdquo; is doing a lot of work there. Consider a 2-socket machine with 48 cores per socket (96 vCPUs per socket with SMT). In NPS1 mode, this gives 2 NUMA nodes of 96 vCPUs each. After system and kube-reserved, suppose 90 vCPUs are allocatable per NUMA node, with 4 GPUs per NUMA node. Each pod requests 22 vCPUs.&lt;/p>
&lt;p>The first 4 pods land on NUMA 0: 4 x 22 = 88 vCPUs used, leaving only 2 allocatable vCPUs. The 5th pod requests 22 vCPUs, but only 2 remain on NUMA 0. The CPU manager takes those 2 from NUMA 0 and the remaining 20 from NUMA 1. The diagram below (credit to my colleague
&lt;a href="https://www.linkedin.com/in/rlishtaba/" target="_blank" rel="noopener">Roman Lishtaba&lt;/a> for identifying this pattern in our GPU inference workloads) shows exactly how this plays out:&lt;/p>
&lt;div class="cpu-spill-widget" style="margin: 1.5em 0; width: 100vw; position: relative; left: 50%; transform: translateX(-50%); max-width: 1100px;">
&lt;div id="cpu-spill-diagram">&lt;/div>
&lt;/div>
&lt;script>
(function() {
var C = {
socketBg: '#f8fafc', socketBorder: '#94a3b8',
sharedBg: '#f1f5f9', sharedBorder: '#cbd5e1',
text: '#1e293b', textMuted: '#64748b',
interconnect: '#64748b', ddr: '#2563eb',
pods: [
{ bg: '#dcfce7', border: '#22c55e', text: '#166534' },
{ bg: '#dbeafe', border: '#3b82f6', text: '#1e40af' },
{ bg: '#f3e8ff', border: '#a855f7', text: '#6b21a8' },
{ bg: '#fef3c7', border: '#f59e0b', text: '#92400e' },
{ bg: '#fecaca', border: '#ef4444', text: '#991b1b' }
]
};
var VCPUS_PER_SOCKET = 96;
var CCDS = 6;
var SYS_RESERVED = 6;
var ALLOC = VCPUS_PER_SOCKET - SYS_RESERVED;
var POD = 22;
var el = document.getElementById('cpu-spill-diagram');
var W = 1160, H = 720;
var socketW = 500, socketH = 340, socketGap = 100, socketY = 80;
var sx0 = 30, sx1 = sx0 + socketW + socketGap;
var s = '&lt;svg viewBox="0 0 ' + W + ' ' + H + '" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;display:block;">';
s += '&lt;text x="' + (W/2) + '" y="28" text-anchor="middle" font-size="18" font-weight="700" fill="' + C.text + '">AMD EPYC System: NPS1&lt;/text>';
s += '&lt;text x="' + (W/2) + '" y="50" text-anchor="middle" font-size="12" fill="' + C.textMuted + '">Pod placement at full packing \u2014 8 \u00d7 ' + POD + ' vCPU pods, cpuManagerPolicy: static, no Topology Manager&lt;/text>';
s += '&lt;defs>&lt;marker id="cs-ah" markerWidth="6" markerHeight="5" refX="6" refY="2.5" orient="auto">&lt;polygon points="0 0, 6 2.5, 0 5" fill="' + C.interconnect + '"/>&lt;/marker>&lt;/defs>';
for (var sock = 0; sock &lt; 2; sock++) {
var sx = sock === 0 ? sx0 : sx1;
s += '&lt;rect x="' + sx + '" y="' + socketY + '" width="' + socketW + '" height="' + socketH + '" rx="8" fill="' + C.socketBg + '" stroke="' + C.socketBorder + '" stroke-width="2"/>';
s += '&lt;text x="' + (sx+12) + '" y="' + (socketY+22) + '" font-size="14" font-weight="700" fill="' + C.text + '">Socket ' + sock + ' (' + CCDS + ' CCDs)&lt;/text>';
s += '&lt;text x="' + (sx+12) + '" y="' + (socketY+38) + '" font-size="11" fill="' + C.textMuted + '">NUMA ' + sock + '&lt;/text>';
var ccdCols = 3, ccdRows = 2, ccdPadX = 12, ccdPadY = 48, ccdGap = 8;
var ccdW = (socketW - ccdPadX*2 - ccdGap*(ccdCols-1)) / ccdCols;
var ccdH = 100;
for (var r = 0; r &lt; ccdRows; r++) {
for (var c = 0; c &lt; ccdCols; c++) {
var ccdIdx = r * ccdCols + c;
var globalCcd = sock * CCDS + ccdIdx;
var cx = sx + ccdPadX + c * (ccdW + ccdGap);
var cy = socketY + ccdPadY + r * (ccdH + ccdGap);
var color = getCCDColor(sock, ccdIdx);
s += '&lt;rect x="' + cx + '" y="' + cy + '" width="' + ccdW + '" height="' + ccdH + '" rx="6" fill="' + color.bg + '" stroke="' + color.border + '" stroke-width="1.5"/>';
s += '&lt;text x="' + (cx+ccdW/2) + '" y="' + (cy+16) + '" text-anchor="middle" font-size="10" font-weight="600" fill="' + C.textMuted + '">CCD ' + globalCcd + '&lt;/text>';
var coreSize = 14, corePad = 3, coreCols = 4, coreRows = 2;
var coreBlockW = coreCols * (coreSize + corePad) - corePad;
var coreStartX = cx + (ccdW - coreBlockW) / 2;
var coreStartY = cy + 24;
for (var cr = 0; cr &lt; coreRows; cr++) {
for (var cc = 0; cc &lt; coreCols; cc++) {
var coreX = coreStartX + cc * (coreSize + corePad);
var coreY = coreStartY + cr * (coreSize + corePad);
s += '&lt;rect x="' + coreX + '" y="' + coreY + '" width="' + coreSize + '" height="' + coreSize + '" rx="2" fill="#fff" stroke="' + color.border + '" stroke-width="0.75"/>';
s += '&lt;text x="' + (coreX+coreSize/2) + '" y="' + (coreY+coreSize/2+3) + '" text-anchor="middle" font-size="6" fill="' + C.textMuted + '">C' + (cr*coreCols+cc+1) + '&lt;/text>';
}
}
s += '&lt;text x="' + (cx+ccdW/2) + '" y="' + (cy+ccdH-10) + '" text-anchor="middle" font-size="9" fill="' + C.textMuted + '">L3&lt;/text>';
}
}
var sharedX = sx + socketW - 90;
var sharedY = socketY + socketH - 30;
s += '&lt;rect x="' + sharedX + '" y="' + sharedY + '" width="78" height="20" rx="4" fill="' + C.sharedBg + '" stroke="' + C.sharedBorder + '" stroke-width="1" stroke-dasharray="3,2"/>';
s += '&lt;text x="' + (sharedX+39) + '" y="' + (sharedY+14) + '" text-anchor="middle" font-size="8" fill="' + C.textMuted + '">shared pool&lt;/text>';
}
var interX1 = sx0 + socketW + 4, interX2 = sx1 - 4;
var interY = socketY + socketH / 2;
s += '&lt;line x1="' + interX1 + '" y1="' + (interY-5) + '" x2="' + interX2 + '" y2="' + (interY-5) + '" stroke="' + C.interconnect + '" stroke-width="2" marker-end="url(#cs-ah)"/>';
s += '&lt;line x1="' + interX2 + '" y1="' + (interY+5) + '" x2="' + interX1 + '" y2="' + (interY+5) + '" stroke="' + C.interconnect + '" stroke-width="2" marker-end="url(#cs-ah)"/>';
s += '&lt;text x="' + ((interX1+interX2)/2) + '" y="' + (interY-14) + '" text-anchor="middle" font-size="10" font-weight="600" fill="' + C.interconnect + '">xGMI&lt;/text>';
var podBarY = socketY + socketH + 20;
var podH = 40;
var pxScale = socketW / VCPUS_PER_SOCKET;
for (var p = 0; p &lt; 4; p++) {
var px = sx0 + p * POD * pxScale;
var pw = POD * pxScale;
var pc = C.pods[p % 4];
s += '&lt;rect x="' + px + '" y="' + podBarY + '" width="' + pw + '" height="' + podH + '" rx="4" fill="' + pc.bg + '" stroke="' + pc.border + '" stroke-width="1.5"/>';
s += '&lt;text x="' + (px+pw/2) + '" y="' + (podBarY+16) + '" text-anchor="middle" font-size="11" font-weight="600" fill="' + pc.text + '">Pod ' + (p+1) + '&lt;/text>';
s += '&lt;text x="' + (px+pw/2) + '" y="' + (podBarY+30) + '" text-anchor="middle" font-size="9" fill="' + pc.text + '">' + POD + ' vCPU&lt;/text>';
}
var pod5Start = 4 * POD;
var pod5N0 = ALLOC - pod5Start;
var pod5N1 = POD - pod5N0;
var sc = C.pods[4];
var p5x0 = sx0 + pod5Start * pxScale;
var p5w0 = pod5N0 * pxScale;
s += '&lt;rect x="' + p5x0 + '" y="' + podBarY + '" width="' + p5w0 + '" height="' + podH + '" rx="4" fill="' + sc.bg + '" stroke="' + sc.border + '" stroke-width="2"/>';
s += '&lt;text x="' + (p5x0+p5w0/2) + '" y="' + (podBarY+16) + '" text-anchor="middle" font-size="9" font-weight="600" fill="' + sc.text + '">' + pod5N0 + '&lt;/text>';
s += '&lt;text x="' + (p5x0+p5w0/2) + '" y="' + (podBarY+30) + '" text-anchor="middle" font-size="8" fill="' + sc.text + '">vCPU&lt;/text>';
s += '&lt;text x="' + (p5x0+p5w0+4) + '" y="' + (podBarY+14) + '" font-size="9" fill="' + sc.text + '">\u2192 Pod 5&lt;/text>';
var reservedW = SYS_RESERVED * pxScale;
var reservedPx0 = sx0 + ALLOC * pxScale;
s += '&lt;rect x="' + reservedPx0 + '" y="' + podBarY + '" width="' + reservedW + '" height="' + podH + '" rx="4" fill="' + C.sharedBg + '" stroke="' + C.sharedBorder + '" stroke-width="1" stroke-dasharray="3,2"/>';
s += '&lt;text x="' + (reservedPx0+reservedW/2) + '" y="' + (podBarY+24) + '" text-anchor="middle" font-size="8" fill="' + C.textMuted + '">sys&lt;/text>';
var p5x1 = sx1;
var p5w1 = pod5N1 * pxScale;
s += '&lt;rect x="' + p5x1 + '" y="' + podBarY + '" width="' + p5w1 + '" height="' + podH + '" rx="4" fill="' + sc.bg + '" stroke="' + sc.border + '" stroke-width="2"/>';
s += '&lt;text x="' + (p5x1+p5w1/2) + '" y="' + (podBarY+16) + '" text-anchor="middle" font-size="9" font-weight="600" fill="' + sc.text + '">' + pod5N1 + '&lt;/text>';
s += '&lt;text x="' + (p5x1+p5w1/2) + '" y="' + (podBarY+30) + '" text-anchor="middle" font-size="8" fill="' + sc.text + '">vCPU&lt;/text>';
for (var p = 0; p &lt; 3; p++) {
var px = sx1 + pod5N1 * pxScale + p * POD * pxScale;
var pw = POD * pxScale;
var pc = C.pods[p % 4];
s += '&lt;rect x="' + px + '" y="' + podBarY + '" width="' + pw + '" height="' + podH + '" rx="4" fill="' + pc.bg + '" stroke="' + pc.border + '" stroke-width="1.5"/>';
s += '&lt;text x="' + (px+pw/2) + '" y="' + (podBarY+16) + '" text-anchor="middle" font-size="11" font-weight="600" fill="' + pc.text + '">Pod ' + (p+6) + '&lt;/text>';
s += '&lt;text x="' + (px+pw/2) + '" y="' + (podBarY+30) + '" text-anchor="middle" font-size="9" fill="' + pc.text + '">' + POD + ' vCPU&lt;/text>';
}
var usedN1 = pod5N1 + 3 * POD;
var sharedN1 = ALLOC - usedN1;
var sharedPx = sx1 + usedN1 * pxScale;
var sharedPw = sharedN1 * pxScale;
s += '&lt;rect x="' + sharedPx + '" y="' + podBarY + '" width="' + sharedPw + '" height="' + podH + '" rx="4" fill="' + C.sharedBg + '" stroke="' + C.sharedBorder + '" stroke-width="1" stroke-dasharray="3,2"/>';
s += '&lt;text x="' + (sharedPx+sharedPw/2) + '" y="' + (podBarY+18) + '" text-anchor="middle" font-size="8" fill="' + C.textMuted + '">shared&lt;/text>';
s += '&lt;text x="' + (sharedPx+sharedPw/2) + '" y="' + (podBarY+30) + '" text-anchor="middle" font-size="8" fill="' + C.textMuted + '">' + sharedN1 + ' vCPU&lt;/text>';
var reservedPx1 = sx1 + ALLOC * pxScale;
s += '&lt;rect x="' + reservedPx1 + '" y="' + podBarY + '" width="' + reservedW + '" height="' + podH + '" rx="4" fill="' + C.sharedBg + '" stroke="' + C.sharedBorder + '" stroke-width="1" stroke-dasharray="3,2"/>';
s += '&lt;text x="' + (reservedPx1+reservedW/2) + '" y="' + (podBarY+24) + '" text-anchor="middle" font-size="8" fill="' + C.textMuted + '">sys&lt;/text>';
var calloutY = podBarY + podH + 18;
var calloutX = (sx0 + socketW + sx1) / 2;
s += '&lt;rect x="' + (calloutX-100) + '" y="' + calloutY + '" width="200" height="40" rx="6" fill="' + sc.bg + '" stroke="' + sc.border + '" stroke-width="2"/>';
s += '&lt;text x="' + calloutX + '" y="' + (calloutY+16) + '" text-anchor="middle" font-size="11" font-weight="700" fill="' + sc.text + '">CROSS-SOCKET&lt;/text>';
s += '&lt;text x="' + calloutX + '" y="' + (calloutY+30) + '" text-anchor="middle" font-size="9" fill="' + sc.text + '">Pod 5: ' + pod5N0 + ' vCPU NUMA 0 + ' + pod5N1 + ' vCPU NUMA 1&lt;/text>';
s += '&lt;line x1="' + (p5x0+p5w0/2) + '" y1="' + (podBarY+podH) + '" x2="' + (calloutX-40) + '" y2="' + calloutY + '" stroke="' + sc.border + '" stroke-width="1" stroke-dasharray="4,3"/>';
s += '&lt;line x1="' + (p5x1+p5w1/2) + '" y1="' + (podBarY+podH) + '" x2="' + (calloutX+40) + '" y2="' + calloutY + '" stroke="' + sc.border + '" stroke-width="1" stroke-dasharray="4,3"/>';
var annoY = calloutY + 54;
s += '&lt;text x="' + (sx0+socketW/2) + '" y="' + annoY + '" text-anchor="middle" font-size="10" fill="' + C.textMuted + '">NUMA 0: 4 \u00d7 ' + POD + ' = ' + (4*POD) + ' pinned | ' + pod5N0 + ' \u2192 Pod 5 | ' + SYS_RESERVED + ' sys-reserved&lt;/text>';
s += '&lt;text x="' + (sx1+socketW/2) + '" y="' + annoY + '" text-anchor="middle" font-size="10" fill="' + C.textMuted + '">NUMA 1: ' + pod5N1 + ' (Pod 5) + 3 \u00d7 ' + POD + ' = ' + usedN1 + ' pinned | ' + sharedN1 + ' shared | ' + SYS_RESERVED + ' sys-reserved&lt;/text>';
var ddrY = annoY + 16;
s += '&lt;text x="' + (sx0+socketW/2) + '" y="' + (ddrY+20) + '" text-anchor="middle" font-size="13" font-weight="700" fill="' + C.ddr + '">\u2191 DDRs&lt;/text>';
s += '&lt;text x="' + (sx1+socketW/2) + '" y="' + (ddrY+20) + '" text-anchor="middle" font-size="13" font-weight="700" fill="' + C.ddr + '">\u2191 DDRs&lt;/text>';
var legY = ddrY + 40;
var legItems = [
{ c: C.pods[0], l: 'Pods 1, 6 \u2014 fully NUMA-local' },
{ c: C.pods[1], l: 'Pods 2, 7 \u2014 fully NUMA-local' },
{ c: C.pods[2], l: 'Pods 3, 8 \u2014 fully NUMA-local' },
{ c: C.pods[3], l: 'Pod 4 \u2014 fully NUMA-local' },
{ c: C.pods[4], l: 'Pod 5 \u2014 CROSS-SOCKET' }
];
var legItemW = 260;
for (var i = 0; i &lt; legItems.length; i++) {
var lx = sx0 + (i % 3) * legItemW;
var ly = legY + Math.floor(i / 3) * 22;
s += '&lt;rect x="' + lx + '" y="' + ly + '" width="14" height="14" rx="3" fill="' + legItems[i].c.bg + '" stroke="' + legItems[i].c.border + '" stroke-width="1.5"/>';
s += '&lt;text x="' + (lx+20) + '" y="' + (ly+12) + '" font-size="10" fill="' + C.text + '">' + legItems[i].l + '&lt;/text>';
}
var slx = sx0 + 2 * legItemW, sly = legY + 22;
s += '&lt;rect x="' + slx + '" y="' + sly + '" width="14" height="14" rx="3" fill="' + C.sharedBg + '" stroke="' + C.sharedBorder + '" stroke-width="1" stroke-dasharray="3,2"/>';
s += '&lt;text x="' + (slx+20) + '" y="' + (sly+12) + '" font-size="10" fill="' + C.text + '">Shared CPU pool (fractional sidecars + daemons)&lt;/text>';
var rcY = legY + 56;
s += '&lt;text x="' + (W/2) + '" y="' + rcY + '" text-anchor="middle" font-size="10" fill="' + C.textMuted + '">cpuManagerPolicy: static fills NUMA 0 sequentially. After 4 pods \u00d7 ' + POD + ' = ' + (4*POD) + ' vCPU, only ' + pod5N0 + ' pinnable vCPUs remain in NUMA 0.&lt;/text>';
s += '&lt;text x="' + (W/2) + '" y="' + (rcY+16) + '" text-anchor="middle" font-size="10" fill="' + C.textMuted + '">The 5th pod\u2019s ' + POD + ' vCPU allocation overflows into NUMA 1 (' + pod5N0 + ' + ' + pod5N1 + '). Without Topology Manager, there is no admission-time rejection.&lt;/text>';
s += '&lt;text x="' + (W/2) + '" y="' + (rcY+32) + '" text-anchor="middle" font-size="10" font-weight="600" fill="' + C.text + '">Fix: topologyManagerPolicy: single-numa-node&lt;/text>';
s += '&lt;/svg>';
el.innerHTML = s;
function getCCDColor(socket, ccdInSocket) {
if (socket === 0) {
if (ccdInSocket &lt; 1) return C.pods[0];
if (ccdInSocket &lt; 2) return C.pods[1];
if (ccdInSocket &lt; 3) return C.pods[1];
if (ccdInSocket &lt; 4) return C.pods[2];
if (ccdInSocket &lt; 5) return C.pods[3];
return C.pods[4];
} else {
if (ccdInSocket &lt; 1) return C.pods[4];
if (ccdInSocket &lt; 2) return C.pods[0];
if (ccdInSocket &lt; 3) return C.pods[1];
if (ccdInSocket &lt; 4) return C.pods[1];
if (ccdInSocket &lt; 5) return C.pods[2];
return C.pods[2];
}
}
})();
&lt;/script>
&lt;p>Without topology manager enforcement, the kubelet allocates CPUs from multiple NUMA nodes. Pod 5 runs fine, but its performance is degraded. Nothing in Kubernetes will tell you about this.&lt;/p>
&lt;p>With &lt;code>topologyManagerPolicy: single-numa-node&lt;/code>, the system keeps the allocation bounded within one NUMA node. In this scenario, NUMA 1 still has 90 vCPUs free, so Pod 5 would land there entirely. &lt;code>TopologyAffinityError&lt;/code> only fires when no single NUMA node can satisfy the request.&lt;/p>
&lt;h2 id="failure-modes-to-be-aware-of">Failure Modes to Be Aware Of&lt;/h2>
&lt;p>Both &lt;code>cpuManagerPolicyOptions: full-pcpus-only=true&lt;/code> and &lt;code>topologyManagerPolicy: single-numa-node&lt;/code> introduce hard failure modes that are worth understanding before enabling them.&lt;/p>
&lt;h3 id="smtalignmenterror">SMTAlignmentError&lt;/h3>
&lt;p>When &lt;code>full-pcpus-only&lt;/code> is enabled, the kubelet rejects any container that would receive exclusive CPUs but does not request a multiple of the SMT thread count, typically 2. The pod goes into &lt;code>Failed&lt;/code> state with an &lt;code>SMTAlignmentError&lt;/code> and stays there until someone deletes it. Workload controllers (Deployments, StatefulSets) will recreate the pod, but the replacement hits the same error on any node where &lt;code>full-pcpus-only&lt;/code> is in effect.&lt;/p>
&lt;h3 id="topologyaffinityerror">TopologyAffinityError&lt;/h3>
&lt;p>When &lt;code>topologyManagerPolicy: single-numa-node&lt;/code> is enabled, the kubelet rejects any pod whose containers&amp;rsquo; hinted resource requests can&amp;rsquo;t be satisfied from a single NUMA node. With &lt;code>topologyManagerScope: pod&lt;/code>, that check applies to the pod&amp;rsquo;s effective request. The sequence is:&lt;/p>
&lt;ol>
&lt;li>The scheduler picks a node based on aggregate resource availability&lt;/li>
&lt;li>The kubelet receives the pod and runs topology admission&lt;/li>
&lt;li>The topology manager collects hints from the CPU manager, device manager, and memory manager&lt;/li>
&lt;li>If no single NUMA node can satisfy all resources, the pod is rejected with &lt;code>TopologyAffinityError&lt;/code>&lt;/li>
&lt;/ol>
&lt;p>Same failure semantics as &lt;code>SMTAlignmentError&lt;/code>: the pod is &lt;code>Failed&lt;/code> and the scheduler won&amp;rsquo;t retry. The confusing part is that the node has sufficient aggregate capacity, but the pod still fails because no individual NUMA node has enough room. If you&amp;rsquo;re not thinking in NUMA terms, this is disorienting.&lt;/p>
&lt;p>&lt;code>cpuManagerPolicy: static&lt;/code> on its own doesn&amp;rsquo;t introduce these failure modes. They come from the additional constraints of &lt;code>full-pcpus-only&lt;/code> and &lt;code>single-numa-node&lt;/code>. Both are node-level kubelet settings that apply to every pod on the node, which means enabling &lt;code>single-numa-node&lt;/code> can break existing workloads that don&amp;rsquo;t fit in a single NUMA node. Dedicated node pools for NUMA-aligned workloads are a practical approach to mitigate this.&lt;/p>
&lt;h2 id="topology-aware-scheduling">Topology-Aware Scheduling&lt;/h2>
&lt;p>A practical problem with &lt;code>single-numa-node&lt;/code> is that the default Kubernetes scheduler sees only aggregate node resources. It doesn&amp;rsquo;t know that a node&amp;rsquo;s 60 free vCPUs are split 20/40 across two NUMA nodes. The scheduler can place a pod on a node, only for the kubelet to reject it at admission. The workload controller then creates a replacement, which may fail on the next node too.&lt;/p>
&lt;p>The
&lt;a href="https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/noderesourcetopology" target="_blank" rel="noopener">&lt;code>NodeResourceTopologyMatch&lt;/code> scheduler plugin&lt;/a> reduces this gap. It gives the scheduler per-NUMA-node resource visibility, so it can filter out nodes that can&amp;rsquo;t satisfy the topology constraints before placing the pod.&lt;/p>
&lt;p>Deploying it requires:&lt;/p>
&lt;ul>
&lt;li>A cluster-scoped &lt;code>NodeResourceTopology&lt;/code> CRD and one &lt;code>NodeResourceTopology&lt;/code> custom resource per node&lt;/li>
&lt;li>A topology exporter DaemonSet (such as
&lt;a href="https://kubernetes-sigs.github.io/node-feature-discovery/stable/get-started/introduction.html" target="_blank" rel="noopener">NFD Topology Updater&lt;/a>) on every node, polling the kubelet&amp;rsquo;s PodResources API and publishing per-NUMA resource availability&lt;/li>
&lt;li>The &lt;code>NodeResourceTopologyMatch&lt;/code> scheduler plugin configured as a filter and scorer&lt;/li>
&lt;/ul>
&lt;p>That&amp;rsquo;s additional infrastructure for the platform team: a DaemonSet, a cluster-scoped CRD, one custom resource per node refreshed roughly every 60 seconds, and a scheduler plugin with its own cache. Without it, pods repeatedly fail on nodes that look like they have enough capacity.&lt;/p>
&lt;h2 id="making-it-work-collaboration-between-platform-and-workload-teams">Making It Work: Collaboration Between Platform and Workload Teams&lt;/h2>
&lt;p>Getting NUMA alignment right is not something either platform admins or workload owners can do alone. It requires collaboration and shared understanding.&lt;/p>
&lt;h3 id="platform-admins-publish-topology-and-sizing-guidance">Platform admins: publish topology and sizing guidance&lt;/h3>
&lt;p>With NUMA alignment, platform admins can&amp;rsquo;t just hand out node pools and let workload owners request whatever CPU/memory they want. They need to publish clear guidance:&lt;/p>
&lt;ul>
&lt;li>What SKU each node pool uses, and its NUMA geometry (cores per NUMA node, NPS mode)&lt;/li>
&lt;li>How many vCPUs are consumed by system-reserved and kube-reserved&lt;/li>
&lt;li>Recommended container sizes that align with NUMA boundaries&lt;/li>
&lt;li>What constraints are in effect (&lt;code>full-pcpus-only&lt;/code>, &lt;code>single-numa-node&lt;/code>, and any non-default topology manager scope) and what failure modes they introduce&lt;/li>
&lt;/ul>
&lt;p>For example, on a 2-socket machine with 48 cores per socket in NPS1 mode: each NUMA node has 96 vCPUs, about 90 allocatable after reservations, with 4 GPUs per NUMA node. The recommended CPU request per GPU might be 22 vCPUs (22 x 4 = 88, fitting within the 90 available). If the fleet has multiple SKUs with different core counts or NPS configurations, this becomes a matrix of recommendations.&lt;/p>
&lt;h3 id="workload-owners-understand-the-constraints-size-accordingly">Workload owners: understand the constraints, size accordingly&lt;/h3>
&lt;p>GPU workloads are inherently more hardware-aware than typical Kubernetes workloads. Unlike a stateless web service where you declare CPU and memory and let the platform figure out placement, GPU inference and training benefit from understanding the machine topology.&lt;/p>
&lt;p>This means:&lt;/p>
&lt;ul>
&lt;li>Sizing containers to fit within a NUMA node based on the platform&amp;rsquo;s published guidance (often it&amp;rsquo;s better to run multiple smaller pods, each NUMA-local, than one larger pod that spans NUMA nodes)&lt;/li>
&lt;li>Using even CPU values when &lt;code>full-pcpus-only&lt;/code> is in effect&lt;/li>
&lt;li>Ensuring the pod is Guaranteed QoS (requests == limits on all containers)&lt;/li>
&lt;li>Updating container sizes when the platform migrates to different SKUs&lt;/li>
&lt;/ul>
&lt;h3 id="verifying-alignment-in-practice">Verifying alignment in practice&lt;/h3>
&lt;p>At the node level, start by confirming the hardware topology:&lt;/p>
&lt;pre>&lt;code class="language-bash">lscpu -e=CPU,CORE,SOCKET,NODE
numactl -H
nvidia-smi topo -m # NVIDIA GPU nodes
&lt;/code>&lt;/pre>
&lt;p>Inside a running container, check the workload process&amp;rsquo;s CPU affinity and compare it with the node&amp;rsquo;s &lt;code>lscpu&lt;/code> output to confirm the allowed CPUs sit within the expected NUMA node:&lt;/p>
&lt;pre>&lt;code class="language-bash">kubectl exec &amp;lt;pod-name&amp;gt; -c &amp;lt;container-name&amp;gt; -- taskset -cp 1
# If taskset is unavailable:
kubectl exec &amp;lt;pod-name&amp;gt; -c &amp;lt;container-name&amp;gt; -- grep Cpus_allowed_list /proc/1/status
&lt;/code>&lt;/pre>
&lt;p>&lt;code>taskset -cp 1&lt;/code> checks PID 1 in the container. If your workload runs as a different PID, check that process instead.&lt;/p>
&lt;h2 id="appendix-related-cpu-manager-options">Appendix: Related CPU Manager Options&lt;/h2>
&lt;p>Kubernetes has several
&lt;a href="https://kubernetes.io/docs/concepts/workloads/resource-managers/#cpu-policy-static--options" target="_blank" rel="noopener">CPU manager policy options&lt;/a> adjacent to the path described above.&lt;/p>
&lt;ul>
&lt;li>&lt;code>strict-cpu-reservation&lt;/code> keeps regular workloads off CPUs reserved for the OS and Kubernetes daemons, which helps reduce system noise on pinned workloads.&lt;/li>
&lt;li>&lt;code>prefer-align-cpus-by-uncorecache&lt;/code> is a best-effort cache-locality option that tries to keep a container&amp;rsquo;s CPUs within the same L3 or uncore cache group.&lt;/li>
&lt;li>&lt;code>align-by-socket&lt;/code> is useful when a container is too large to fit in one NUMA node and must use multiple NUMA nodes. It asks CPU Manager to keep that allocation within one socket when possible.&lt;/li>
&lt;/ul>
&lt;p>These can improve isolation or latency, but they do not replace &lt;code>topologyManagerPolicy: single-numa-node&lt;/code> for keeping a GPU workload NUMA-local. &lt;code>align-by-socket&lt;/code> is also not compatible with &lt;code>single-numa-node&lt;/code>.&lt;/p>
&lt;h2 id="dra-and-future-direction">DRA and Future Direction&lt;/h2>
&lt;p>The Kubernetes
&lt;a href="https://github.com/kubernetes-sigs/dra-driver-cpu" target="_blank" rel="noopener">DRA (Dynamic Resource Allocation) CPU driver&lt;/a> is interesting because it may allow NUMA-aware CPU placement to happen through the scheduling layer, without some of the post-scheduling admission issues described above. I haven&amp;rsquo;t explored it deeply enough to recommend it here. I&amp;rsquo;ll write a follow-up after I spend more time with it.&lt;/p>
&lt;h2 id="wrapping-up">Wrapping Up&lt;/h2>
&lt;p>There&amp;rsquo;s a clear progression of CPU isolation in Kubernetes:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Level&lt;/th>
&lt;th>Config&lt;/th>
&lt;th>What you get&lt;/th>
&lt;th>What it requires&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>1&lt;/td>
&lt;td>&lt;code>cpuManagerPolicy: static&lt;/code>&lt;/td>
&lt;td>Dedicated logical cores, reduced CPU contention&lt;/td>
&lt;td>Guaranteed QoS, integer CPU requests&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2&lt;/td>
&lt;td>+ &lt;code>full-pcpus-only=true&lt;/code>&lt;/td>
&lt;td>Full physical cores, L1/L2 cache isolation&lt;/td>
&lt;td>Even CPU request values&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3&lt;/td>
&lt;td>+ &lt;code>topologyManagerPolicy: single-numa-node&lt;/code> and &lt;code>memoryManagerPolicy: Static&lt;/code>&lt;/td>
&lt;td>CPU, GPU, and memory admitted only if they fit one NUMA node&lt;/td>
&lt;td>Critical container fits in a NUMA node, device plugin topology hints, &lt;code>reservedMemory&lt;/code>, topology-aware scheduler, sizing guidance from platform team&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Each level introduces stricter constraints in exchange for stronger performance isolation. For GPU inference, where the CPU is directly on the data path to the GPU, best-effort alignment is not always good enough. If a pod is misaligned, Kubernetes will not tell you, but the workload may still show worse tail latency. For consistency, a hard failure like &lt;code>TopologyAffinityError&lt;/code> is often better than silently serving degraded traffic.&lt;/p>
&lt;p>Getting it right takes effort from both sides: platform teams publishing topology guidance and workload owners sizing containers to match. It is more work than treating compute as a black box, but GPU workloads typically need to be more aware of the underlying hardware than ordinary services.&lt;/p></description></item></channel></rss>