Keeping GPU Workloads NUMA-Local in Kubernetes

“NUMA alignment” comes up frequently in GPU infrastructure discussions, but concepts like NUMA nodes, topology policies, and CPU pinning are often assumed rather than well understood. Getting it right is as much the platform engineer’s job as the workload owner’s.

This post isn’t a comprehensive guide to NUMA architecture. It’s a practical account of what happens when you align CPU and GPU resources on Kubernetes nodes: the levels of isolation Kubernetes offers, the gotchas, and what it takes to make it work. My experience is on AMD EPYC hardware. Intel has analogous concepts (Sub-NUMA Clustering instead of NPS, UPI instead of Infinity Fabric), but I haven’t worked with Intel in this context, so I’ll stick to what I know.

Table of Contents

Key Terminology

If you’re already familiar with CPU cache hierarchies, sockets, and PCIe, skip ahead. Otherwise, expand below for the shared vocabulary used throughout the post.

Show terminology

CPU Socket: The physical slot on a motherboard that holds a processor. Multi-socket servers (commonly 2-socket) have multiple processors, each with its own cores and local memory.

Physical Core vs. Logical Core: A physical core is a single processing unit on the CPU die. With SMT (“hyperthreading” on Intel), each physical core presents as 2 logical cores that share the core’s execution resources and caches.

L1/L2 Cache: Small, fast caches private to each physical core (L1 is smaller and faster than L2). Two containers sharing a physical core, one logical core each, compete for the same L1/L2 space.

L3 Cache (Last-Level Cache): A larger cache shared among a group of cores. On AMD EPYC, it’s shared within a Core Complex (CCD) of typically 8 cores. Cores sharing an L3 cache can exchange data quickly through it.

Interconnect: The high-speed link between CPU sockets ( Infinity Fabric on AMD, UPI on Intel). Accessing memory locally is faster than going cross-socket over the interconnect.

PCIe: The bus connecting CPUs to devices like GPUs and NICs. Each PCIe root complex is wired to a specific CPU socket, so a GPU is physically closer to one socket than another.

DMA: Lets devices like GPUs read from and write to system memory directly, without the CPU copying data byte-by-byte. If the data sits in memory attached to a different NUMA node than the GPU, the DMA read crosses the interconnect.

What is NUMA?

NUMA (Non-Uniform Memory Access) describes a memory architecture where the time it takes a CPU core to access memory depends on where that memory physically sits relative to the core.

In a 2-socket server, each socket has its own local memory. A core on socket 0 can access memory attached to socket 0 quickly (local access), but accessing memory attached to socket 1 requires crossing the interconnect, which is slower. On AMD EPYC hardware, cross-socket memory access can incur roughly 3x the latency of local access.

NUMA doesn’t only exist across sockets, though. On AMD EPYC processors, a BIOS setting called NPS (Nodes Per Socket) controls how many NUMA domains a single socket is divided into:

  • NPS1: Each socket is one NUMA node. A 2-socket machine has 2 NUMA nodes.
  • NPS2: Each socket is split into 2 NUMA nodes. A 2-socket machine has 4 NUMA nodes.
  • NPS4: Each socket is split into 4 NUMA nodes. A 2-socket machine has 8 NUMA nodes.

The interactive diagram below shows a simplified view of a 2-socket AMD EPYC machine. Toggle between NPS modes to see how the NUMA boundaries change. In NPS1, each socket is one NUMA node. In NPS2 and NPS4, each socket is further subdivided. The CCD and GPU placement is illustrative, not a promise about every SKU.

The key point: NUMA topology is a function of both the hardware and how it’s configured. You can’t assume a fixed number of NUMA nodes across your fleet unless you control the BIOS settings. Different SKUs have different core counts, different numbers of CCDs, and different NPS configurations, all of which change the NUMA geometry.

Why NUMA Matters for GPU Workloads

GPU inference often follows a CPU-GPU pipeline: the CPU prepares requests, batches them, copies input data into the right format, feeds data to the GPU over PCIe, and then handles postprocessing on the GPU’s output. The GPU does the heavy computation, but it’s the CPU that keeps the pipeline fed.

When a container’s CPUs are on a different NUMA node than its GPU, moving input data from CPU memory to GPU memory may cross a NUMA boundary. The GPU has to read data from memory attached to a farther-away CPU socket instead of memory attached to its local socket. This adds latency on the critical path.

In one inference workload, we observed more than 30% higher p99 tail latency under load for pods whose CPUs spanned both sockets compared with pods whose CPUs stayed on the same socket. For a latency-sensitive service, that is enough to matter, and it happens silently unless someone is explicitly monitoring NUMA alignment. Nothing in Kubernetes surfaces it. The pod is running, serving traffic, and looking healthy, just consistently slower than its peers.

Training workloads are affected too, though the impact profile is different. Data loading workers continuously preprocess batches on CPU and stage them for GPU consumption. Cross-NUMA data loaders contend for inter-socket bandwidth and add latency to every batch transfer. PyTorch’s own performance tuning guide explicitly recommends binding training processes to a single NUMA node.

For GPU workloads where the CPU is on the data path to the GPU, NUMA locality has a direct and measurable impact on performance.

Increasing Levels of CPU Isolation and NUMA Alignment

Kubernetes offers several knobs for CPU isolation, each providing stronger guarantees at the cost of more constraints.

Level 1: Logical Core Pinning with cpuManagerPolicy: static

By default, Kubernetes lets the operating system’s CPU scheduler move a container’s processes across any available core. This is efficient for overall CPU utilization, but it means your container’s threads may move across cores, invalidating caches and sharing physical cores with other containers.

Setting cpuManagerPolicy: static in the kubelet config changes this. Containers in Guaranteed QoS pods (where requests == limits) with integer CPU requests get exclusive, pinned logical cores. Kubernetes won’t assign those exclusive CPUs to another container, and your processes stay put. Host daemons and kernel threads can still run there unless the platform also reserves or isolates CPUs for the OS.

The way kubelet pins CPUs is by constraining the container’s cpuset cgroup to the assigned CPU list. The assigned cores can be seen in cpuset.cpus on cgroup v1 and cpuset.cpus.effective on cgroup v2.

On the node, the exact path depends on the cgroup driver, runtime, QoS class, and pod UID formatting. With systemd-style kubepods slices, the files are roughly located here:

KUBEPODS="<kubepods.slice/.../kubepods-pod<uid>.slice/<container>.scope>"

# cgroup v1
cat /sys/fs/cgroup/cpuset/$KUBEPODS/cpuset.cpus

# cgroup v2
cat /sys/fs/cgroup/$KUBEPODS/cpuset.cpus.effective

This alone improves performance consistency. Cache affinity improves because threads aren’t migrating across cores, and container-to-container CPU contention is reduced. But there’s a subtlety: you’re pinning logical cores (hyperthreads), not physical cores. Two containers can still end up sharing a physical core if one gets one hyperthread and the other gets the sibling. They’ll contend for that physical core’s L1 and L2 cache.

What it requires from workload owners: Set requests == limits for CPU and memory on all containers (including init containers and sidecars) to get Guaranteed QoS. CPU requests must be integers for the containers where pinning is desired.

Level 2: Physical Core Pinning with full-pcpus-only

The full-pcpus-only CPU manager policy option (cpuManagerPolicyOptions: full-pcpus-only=true) takes isolation further. Instead of allocating individual logical cores, it allocates entire physical cores. Both hyperthreads of each core go to the same container.

This eliminates L1/L2 cache contention between containers that would otherwise share a physical core.

The trade-off: containers that receive exclusive CPUs must request a multiple of the SMT thread count, typically 2. A pinned container requesting 3 CPUs fails with an SMTAlignmentError (covered in Failure Modes below). Any existing pinned containers with odd CPU counts on the node need to be resized before you enable this option.

What it requires from workload owners: Even CPU request values. Audit all containers, including sidecars and init containers. Fractional CPU values on sidecars and init containers are fine as those containers use the shared CPU pool and don’t get pinned.

Level 3: Full NUMA Alignment with single-numa-node

CPU pinning ensures your cores are dedicated, and full-pcpus-only ensures the container gets full physical cores. Neither guarantees that all your cores come from the same NUMA node. With the default static policy options, the kubelet’s CPU manager uses a packed allocation strategy that fills one NUMA node before spilling to the next (more on this below), but depending on node fragmentation, your container’s CPUs can still span NUMA boundaries.

The topologyManagerPolicy: single-numa-node setting addresses this. The topology manager sits above the CPU manager, device manager, and memory manager, and coordinates resource allocation by collecting topology hints from each. With single-numa-node, it requires that hinted resources can be satisfied from a single NUMA node. If they can’t, the pod is rejected at admission time with a TopologyAffinityError.

The default scope topologyManagerScope: container computes alignment independently for each container. That’s usually fine when one main container owns the GPU and the exclusive CPUs, while sidecars are unrelated to the latency-critical path and use fractional CPU from the shared pool.

topologyManagerScope: pod is stricter: it asks whether the pod’s effective request fits on one NUMA node. Use it when multiple containers in the same pod are performance-coupled, not just because a logging or metrics sidecar exists.

Caveat: Topology Manager only enforces resources that report topology hints. CPU hints come from CPU Manager, GPU hints come through Device Manager from plugins such as the nvidia-device-plugin, and memory hints come from Memory Manager.

If the GPU device plugin does not report NUMA TopologyInfo, Topology Manager cannot force CPU-GPU locality.

For guaranteed NUMA alignment, set memoryManagerPolicy: Static too. This makes requested memory part of topology admission along with CPU and GPU resources. The workload’s memory request must fit within the target NUMA node. Kubernetes also requires reservedMemory when memoryManagerPolicy: Static is enabled.

This gives you the strongest isolation available. The container stays within one NUMA node and communicates with the GPU on that NUMA node without crossing the socket interconnect. CPU cache locality can still vary within that NUMA node (depending on NPS configuration), but in testing, we have found that a container occupying an entire NUMA node with no overlap from other workloads is materially less affected by cache-thrashing or CPU-intensive noisy neighbors.

What it requires from workload owners: The performance-critical container, or the effective pod request when using topologyManagerScope: pod, must fit within a single NUMA node. That means understanding the machine topology and resizing the workload as hardware or its configuration changes.

A minimal kubelet config for a dedicated NUMA-aligned GPU pool looks like:

cpuManagerPolicy: static
cpuManagerPolicyOptions:
  full-pcpus-only: "true"
topologyManagerPolicy: single-numa-node
# Default is container. Use pod only when the whole pod should fit on one NUMA node.
# topologyManagerScope: pod
memoryManagerPolicy: Static
# memoryManagerPolicy: Static requires reservedMemory to be configured.

Roll this out on dedicated, drained nodes. When changing CPU or memory manager policies, clear the kubelet state files before restarting kubelet: <kubelet-root-dir>/cpu_manager_state and <kubelet-root-dir>/memory_manager_state.

How the Kubelet Allocates CPUs

Understanding the allocation algorithm helps explain how and when NUMA spillover happens.

When cpuManagerPolicy: static is enabled with the default policy options, the kubelet uses a packed (bin-pack) allocation strategy: takeByTopologyNUMAPacked. It works top-down through the topology:

  1. First, try to take full NUMA nodes (prefer smaller/more-used ones)
  2. Then, take full physical cores from partially-used NUMA nodes (prefer NUMA nodes with fewer free CPUs to pack them first)
  3. Finally, take individual logical cores if needed

The sort order is key: at every level, it prefers NUMA nodes with fewer remaining free CPUs. This packs nearly-exhausted NUMA nodes before touching less-used ones. The allocator usually keeps CPUs NUMA-local when there is room, but locality is not guaranteed.

But “when possible” is doing a lot of work there. Consider a 2-socket machine with 48 cores per socket (96 vCPUs per socket with SMT). In NPS1 mode, this gives 2 NUMA nodes of 96 vCPUs each. After system and kube-reserved, suppose 90 vCPUs are allocatable per NUMA node, with 4 GPUs per NUMA node. Each pod requests 22 vCPUs.

The first 4 pods land on NUMA 0: 4 x 22 = 88 vCPUs used, leaving only 2 allocatable vCPUs. The 5th pod requests 22 vCPUs, but only 2 remain on NUMA 0. The CPU manager takes those 2 from NUMA 0 and the remaining 20 from NUMA 1. The diagram below (credit to my colleague Roman Lishtaba for identifying this pattern in our GPU inference workloads) shows exactly how this plays out:

Without topology manager enforcement, the kubelet allocates CPUs from multiple NUMA nodes. Pod 5 runs fine, but its performance is degraded. Nothing in Kubernetes will tell you about this.

With topologyManagerPolicy: single-numa-node, the system keeps the allocation bounded within one NUMA node. In this scenario, NUMA 1 still has 90 vCPUs free, so Pod 5 would land there entirely. TopologyAffinityError only fires when no single NUMA node can satisfy the request.

Failure Modes to Be Aware Of

Both cpuManagerPolicyOptions: full-pcpus-only=true and topologyManagerPolicy: single-numa-node introduce hard failure modes that are worth understanding before enabling them.

SMTAlignmentError

When full-pcpus-only is enabled, the kubelet rejects any container that would receive exclusive CPUs but does not request a multiple of the SMT thread count, typically 2. The pod goes into Failed state with an SMTAlignmentError and stays there until someone deletes it. Workload controllers (Deployments, StatefulSets) will recreate the pod, but the replacement hits the same error on any node where full-pcpus-only is in effect.

TopologyAffinityError

When topologyManagerPolicy: single-numa-node is enabled, the kubelet rejects any pod whose containers’ hinted resource requests can’t be satisfied from a single NUMA node. With topologyManagerScope: pod, that check applies to the pod’s effective request. The sequence is:

  1. The scheduler picks a node based on aggregate resource availability
  2. The kubelet receives the pod and runs topology admission
  3. The topology manager collects hints from the CPU manager, device manager, and memory manager
  4. If no single NUMA node can satisfy all resources, the pod is rejected with TopologyAffinityError

Same failure semantics as SMTAlignmentError: the pod is Failed and the scheduler won’t retry. The confusing part is that the node has sufficient aggregate capacity, but the pod still fails because no individual NUMA node has enough room. If you’re not thinking in NUMA terms, this is disorienting.

cpuManagerPolicy: static on its own doesn’t introduce these failure modes. They come from the additional constraints of full-pcpus-only and single-numa-node. Both are node-level kubelet settings that apply to every pod on the node, which means enabling single-numa-node can break existing workloads that don’t fit in a single NUMA node. Dedicated node pools for NUMA-aligned workloads are a practical approach to mitigate this.

Topology-Aware Scheduling

A practical problem with single-numa-node is that the default Kubernetes scheduler sees only aggregate node resources. It doesn’t know that a node’s 60 free vCPUs are split 20/40 across two NUMA nodes. The scheduler can place a pod on a node, only for the kubelet to reject it at admission. The workload controller then creates a replacement, which may fail on the next node too.

The NodeResourceTopologyMatch scheduler plugin reduces this gap. It gives the scheduler per-NUMA-node resource visibility, so it can filter out nodes that can’t satisfy the topology constraints before placing the pod.

Deploying it requires:

  • A cluster-scoped NodeResourceTopology CRD and one NodeResourceTopology custom resource per node
  • A topology exporter DaemonSet (such as NFD Topology Updater) on every node, polling the kubelet’s PodResources API and publishing per-NUMA resource availability
  • The NodeResourceTopologyMatch scheduler plugin configured as a filter and scorer

That’s additional infrastructure for the platform team: a DaemonSet, a cluster-scoped CRD, one custom resource per node refreshed roughly every 60 seconds, and a scheduler plugin with its own cache. Without it, pods repeatedly fail on nodes that look like they have enough capacity.

Making It Work: Collaboration Between Platform and Workload Teams

Getting NUMA alignment right is not something either platform admins or workload owners can do alone. It requires collaboration and shared understanding.

Platform admins: publish topology and sizing guidance

With NUMA alignment, platform admins can’t just hand out node pools and let workload owners request whatever CPU/memory they want. They need to publish clear guidance:

  • What SKU each node pool uses, and its NUMA geometry (cores per NUMA node, NPS mode)
  • How many vCPUs are consumed by system-reserved and kube-reserved
  • Recommended container sizes that align with NUMA boundaries
  • What constraints are in effect (full-pcpus-only, single-numa-node, and any non-default topology manager scope) and what failure modes they introduce

For example, on a 2-socket machine with 48 cores per socket in NPS1 mode: each NUMA node has 96 vCPUs, about 90 allocatable after reservations, with 4 GPUs per NUMA node. The recommended CPU request per GPU might be 22 vCPUs (22 x 4 = 88, fitting within the 90 available). If the fleet has multiple SKUs with different core counts or NPS configurations, this becomes a matrix of recommendations.

Workload owners: understand the constraints, size accordingly

GPU workloads are inherently more hardware-aware than typical Kubernetes workloads. Unlike a stateless web service where you declare CPU and memory and let the platform figure out placement, GPU inference and training benefit from understanding the machine topology.

This means:

  • Sizing containers to fit within a NUMA node based on the platform’s published guidance (often it’s better to run multiple smaller pods, each NUMA-local, than one larger pod that spans NUMA nodes)
  • Using even CPU values when full-pcpus-only is in effect
  • Ensuring the pod is Guaranteed QoS (requests == limits on all containers)
  • Updating container sizes when the platform migrates to different SKUs

Verifying alignment in practice

At the node level, start by confirming the hardware topology:

lscpu -e=CPU,CORE,SOCKET,NODE
numactl -H
nvidia-smi topo -m  # NVIDIA GPU nodes

Inside a running container, check the workload process’s CPU affinity and compare it with the node’s lscpu output to confirm the allowed CPUs sit within the expected NUMA node:

kubectl exec <pod-name> -c <container-name> -- taskset -cp 1

# If taskset is unavailable:
kubectl exec <pod-name> -c <container-name> -- grep Cpus_allowed_list /proc/1/status

taskset -cp 1 checks PID 1 in the container. If your workload runs as a different PID, check that process instead.

Kubernetes has several CPU manager policy options adjacent to the path described above.

  • strict-cpu-reservation keeps regular workloads off CPUs reserved for the OS and Kubernetes daemons, which helps reduce system noise on pinned workloads.
  • prefer-align-cpus-by-uncorecache is a best-effort cache-locality option that tries to keep a container’s CPUs within the same L3 or uncore cache group.
  • align-by-socket is useful when a container is too large to fit in one NUMA node and must use multiple NUMA nodes. It asks CPU Manager to keep that allocation within one socket when possible.

These can improve isolation or latency, but they do not replace topologyManagerPolicy: single-numa-node for keeping a GPU workload NUMA-local. align-by-socket is also not compatible with single-numa-node.

DRA and Future Direction

The Kubernetes DRA (Dynamic Resource Allocation) CPU driver is interesting because it may allow NUMA-aware CPU placement to happen through the scheduling layer, without some of the post-scheduling admission issues described above. I haven’t explored it deeply enough to recommend it here. I’ll write a follow-up after I spend more time with it.

Wrapping Up

There’s a clear progression of CPU isolation in Kubernetes:

LevelConfigWhat you getWhat it requires
1cpuManagerPolicy: staticDedicated logical cores, reduced CPU contentionGuaranteed QoS, integer CPU requests
2+ full-pcpus-only=trueFull physical cores, L1/L2 cache isolationEven CPU request values
3+ topologyManagerPolicy: single-numa-node and memoryManagerPolicy: StaticCPU, GPU, and memory admitted only if they fit one NUMA nodeCritical container fits in a NUMA node, device plugin topology hints, reservedMemory, topology-aware scheduler, sizing guidance from platform team

Each level introduces stricter constraints in exchange for stronger performance isolation. For GPU inference, where the CPU is directly on the data path to the GPU, best-effort alignment is not always good enough. If a pod is misaligned, Kubernetes will not tell you, but the workload may still show worse tail latency. For consistency, a hard failure like TopologyAffinityError is often better than silently serving degraded traffic.

Getting it right takes effort from both sides: platform teams publishing topology guidance and workload owners sizing containers to match. It is more work than treating compute as a black box, but GPU workloads typically need to be more aware of the underlying hardware than ordinary services.


Previous

Related

comments powered by Disqus