K8s GPU Scheduling: Stop NUMA Crossings From Killing Distributed Training
Kubernetes GPU scheduling playbook: pin workers to a single NUMA domain for NVLink speed, gang-schedule via Volcano, and slash training cost 60-70% on Spot.
Introduction
Running ML workloads on Kubernetes sounds simple until you need to schedule a multi-GPU training job across nodes, or guarantee latency for an inference endpoint while batch jobs are running in the same cluster. Kubernetes default GPU scheduling treats GPUs as opaque resources. Getting real performance out of a GPU cluster requires explicit configuration for device selection, memory management, and workload isolation.
This guide covers the practical setup: the NVIDIA device plugin, time-slicing for overload scenarios, node affinity for GPU topology, gang scheduling for distributed training, and the common failure modes that eat your GPU budget.
The NVIDIA Device Plugin
Kubernetes does not natively understand GPUs. The NVIDIA Device Plugin advertises GPU resources to the scheduler:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml After installation, nodes advertise nvidia.com/gpu resources:
kubectl describe node | grep nvidia.com/gpu
Allocatable:
nvidia.com/gpu: 4 Request GPUs in pod specs like any other resource:
resources:
limits:
nvidia.com/gpu: 2 MIG Profiles on NVIDIA A100 and H100
Multi-Instance GPU (MIG) lets you partition a single A100 or H100 GPU into up to 7 independent slices, each with dedicated memory and compute engines. This is useful when you want guarantees of quality-of-service for multiple small inference requests running simultaneously.
Enable MIG on A100 nodes:
# Check MIG mode support
nvidia-smi -L
# Enable MIG on the node (requires node reboot with mig-manager)
kubectl label nodes <node> nvidia.com/mig.config=all-1g.5gb Request a MIG slice in your pod:
resources:
limits:
nvidia.com/gpu.mig-1g.5gb: 1 MIG is most effective for inference serving where you want predictable latency isolation between tenants. For training workloads, full GPUs always outperform MIG due to the overhead of the partitioning hardware.
Topology-Aware GPU Scheduling for Multi-Node Training
When distributed training spans multiple nodes, GPU-to-GPU interconnect bandwidth becomes the bottleneck. NVLink within a node provides 300-600 GB/s. PCIe across nodes drops to 32-64 GB/s. NCCL performs dramatically worse without topology awareness.
Check your node topology:
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X NV1 NV1 NV2 0-31 N/A
GPU1 NV1 X NV2 NV1 0-31 N/A
GPU2 NV1 NV2 X NV1 32-63 N/A
GPU3 NV2 NV1 NV1 X 32-63 N/A For 8-GPU training on a p4d node (2 NUMA domains, 4 GPUs each), place all 8 workers on the same NUMA domain when possible. Use the Topology Manager and CNFFCCL plugin for Kubernetes:
apiVersion: v1
kind: Pod
metadata:
name: distributed-training
spec:
containers:
- name: train
env:
- name: NCCL_TOPOLOGY_FILE
value: /etc/kubernetes/nccl-topo.xml
resources:
limits:
nvidia.com/gpu: 4
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/NUMANode
whenUnsatisfiable: DoNotSchedule GPU Time-Slicing for Oversubscription
When you have more workloads than GPUs, time-slicing lets multiple pods share a GPU by interleaving their compute. This is common in development clusters or for batch inference workloads.
Configure time-slicing in the device plugin config:
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 With replicas: 4, up to 4 pods can share each physical GPU. The NVIDIA driver handles context switching. Time-slicing works well for inference workloads with low memory footprints. For training jobs that need full GPU memory, do not oversubscribe.
Check actual GPU utilization to validate sharing:
nvidia-smi
Tue Apr 11 02:30:00 2026
+-----------------------------------------------------------------------------+
| GPU 0 Ga [Unit] A100-SXM4-80GB Off | 00000000:00:1B.0 Off |
| 0% Memory: 8192MiB / 81920MiB VI | Not In Use |
+-----------------------------------------------------------------------------+ Node Affinity for GPU Topology
Multi-GPU training performs best when tensors stay local to the node. Use nodeSelector or nodeAffinity to pin jobs to nodes with sufficient GPUs:
nodeSelector:
node.kubernetes.io/instance-type: g4dn.4xlarge
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.count
operator: Gte
values:
- "4" For NCCL-based distributed training, avoid cross-node GPU communication when possible. Place all workers on the same node for 8-GPU training. For larger jobs that span nodes, use a topology-aware placement controller or explicit topologyKey constraints.
Gang Scheduling for Distributed Training
Distributed ML training jobs require all workers to start simultaneously. Kubernetes default scheduling can deadlock: Job A gets 3 of 4 required GPUs and waits for the fourth, while Job B holds that fourth GPU waiting for a third. Neither progresses. This is the gang scheduling problem.
Use the Coscheduler (part of Kubernetes scheduling framework) or Volcano scheduler to coordinate multi-pod job placement:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distributed-training
spec:
minAvailable: 8
schedulerName: volcano
tasks:
- replicas: 8
name: worker
template:
spec:
containers:
- name: train
image: pytorch/pytorch:2.2.0
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-a100 With gang scheduling, the entire job is not scheduled until all 8 GPUs are available. This trades immediate scheduling for guaranteed co-location.
GPU Memory Management
Each GPU has fixed memory. Oversubscribing memory causes OOM kills that crash your pod and waste compute. Set memory limits explicitly:
resources:
limits:
nvidia.com/gpu: 1
memory: "60Gi"
requests:
memory: "40Gi" For transformer models, allocate based on model size: a 7B parameter model in FP16 needs roughly 14GB for weights, plus 4-8GB for activations and KV cache. Leave headroom. A 80GB A100 can comfortably run a 70B model in FP16; a 7B model fits easily on a 16GB T4.
Monitor actual memory usage per pod:
nvidia-smi --query-compute-apps=pid,used_memory --format=csv Migrating from kube-gpu to the Device Plugin
The legacy kube-gpu scheduler was deprecated. If you have existing configurations, migrate to the NVIDIA Device Plugin + Coscheduler pattern. The new stack has better community support and works with the standard Kubernetes scheduling framework.
Migration steps:
- Deploy the NVIDIA Device Plugin DaemonSet on all GPU nodes
- Verify
nvidia.com/gpuresources appear in node capacity - Update pod specs from
alpha.kubernetes.io/nvidia-gputonvidia.com/gpu - Install the Coscheduler or Volcano for gang scheduling
- Test with a single GPU pod before rolling out to training jobs
GPU Instance Selection: Cost vs Performance
Matching the right instance to the workload is where most teams bleed money. Use this decision framework:
| Instance | GPU | Memory | Interconnect | Best for | $/hr (on-demand) |
|---|---|---|---|---|---|
| g4dn.xlarge | 1x T4 (16GB) | 64GB CPU | PCIe | Inference, small models, dev/test | $0.526 |
| g4dn.4xlarge | 1x T4 (16GB) | 64GB CPU | PCIe | Batch inference, moderate throughput | $1.88 |
| p4d.24xlarge | 8x A100 40GB | 1152GB CPU | NVLink | Training, 7-70B models | $32.77 |
| p5.48xlarge | 8x H100 80GB | 2048GB CPU | NVLink | Large model training, RLHF, 100B+ | $98.32 |
| trn1.32xlarge | 16x Trainium | 1024GB CPU | NeuronLink | Cost-sensitive large-scale training | $131.00 |
Spot instances cut costs by 60-70% for fault-tolerant training workloads. Set up checkpointing to save model state every 100-500 steps so preemption only loses a few minutes of compute. For inference, Reserved Instances or Savings Plans on g4dn cut costs to $0.18-0.22/hr per GPU — enough savings to justify the commitment on moderate traffic.
Key decision rule: if your GPU utilization averages below 40% over a month, you are on the wrong instance type. Either right-size to fewer GPUs or switch to time-slicing.
Monitoring GPU Workloads
DCGM (Data Center GPU Manager) exporter gives you Prometheus metrics for GPU utilization, memory, temperature, and power:
helm install dcgm-exporter nicl/dcgm-exporter \
--set serviceMonitor.enabled=true \
--namespace gpu-operator Key metrics to track:
DCGM_FI_DEV_GPU_UTIL: GPU compute utilization percentageDCGM_FI_DEV_FB_USED: Frame buffer memory usedDCGM_FI_DEV_GPU_TEMP: GPU temperature in CelsiusDCGM_FI_DEV_POWER_USAGE: Current power draw in watts
Alert when GPU utilization drops below 30% for sustained periods — it means your workload is memory-bound or I/O-bound, not compute-bound, and you are wasting expensive hardware.
Conclusion
Kubernetes GPU scheduling requires explicit configuration to get real performance. Start with the NVIDIA Device Plugin to expose nvidia.com/gpu resources. Use MIG on A100/H100 for latency-isolated inference, or time-slicing for development clusters. Pin multi-GPU training jobs to the same NUMA domain using node affinity for NVLink performance. Install Volcano or Coscheduler for gang scheduling so distributed training jobs do not deadlock. Monitor with DCGM and alert on sub-30% GPU utilization — a clear signal you are wasting expensive hardware.
The decision framework is simple: T4 for inference, A100 for training 7-70B models, H100 for anything above. Checkpoint training jobs to Spot instances for 60-70% cost savings. If average GPU utilization sits below 40%, right-size your cluster or switch to time-slicing.