Back to Blog
AI Infrastructure 12 min read

K8s GPU Scheduling: Stop NUMA Crossings From Killing Distributed Training

Kubernetes GPU scheduling playbook: pin workers to a single NUMA domain for NVLink speed, gang-schedule via Volcano, and slash training cost 60-70% on Spot.

Introduction

Running ML workloads on Kubernetes sounds simple until you need to schedule a multi-GPU training job across nodes, or guarantee latency for an inference endpoint while batch jobs are running in the same cluster. Kubernetes default GPU scheduling treats GPUs as opaque resources. Getting real performance out of a GPU cluster requires explicit configuration for device selection, memory management, and workload isolation.

This guide covers the practical setup: the NVIDIA device plugin, time-slicing for overload scenarios, node affinity for GPU topology, gang scheduling for distributed training, and the common failure modes that eat your GPU budget.

The NVIDIA Device Plugin

Kubernetes does not natively understand GPUs. The NVIDIA Device Plugin advertises GPU resources to the scheduler:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

After installation, nodes advertise nvidia.com/gpu resources:

kubectl describe node | grep nvidia.com/gpu
Allocatable:
  nvidia.com/gpu: 4

Request GPUs in pod specs like any other resource:

resources:
  limits:
    nvidia.com/gpu: 2

MIG Profiles on NVIDIA A100 and H100

Multi-Instance GPU (MIG) lets you partition a single A100 or H100 GPU into up to 7 independent slices, each with dedicated memory and compute engines. This is useful when you want guarantees of quality-of-service for multiple small inference requests running simultaneously.

Enable MIG on A100 nodes:

# Check MIG mode support
nvidia-smi -L

# Enable MIG on the node (requires node reboot with mig-manager)
kubectl label nodes <node> nvidia.com/mig.config=all-1g.5gb

Request a MIG slice in your pod:

resources:
  limits:
    nvidia.com/gpu.mig-1g.5gb: 1

MIG is most effective for inference serving where you want predictable latency isolation between tenants. For training workloads, full GPUs always outperform MIG due to the overhead of the partitioning hardware.

Topology-Aware GPU Scheduling for Multi-Node Training

When distributed training spans multiple nodes, GPU-to-GPU interconnect bandwidth becomes the bottleneck. NVLink within a node provides 300-600 GB/s. PCIe across nodes drops to 32-64 GB/s. NCCL performs dramatically worse without topology awareness.

Check your node topology:

nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      NV1    NV1    NV2    0-31            N/A
GPU1    NV1     X      NV2    NV1    0-31            N/A
GPU2    NV1    NV2     X      NV1    32-63           N/A
GPU3    NV2    NV1    NV1     X      32-63           N/A

For 8-GPU training on a p4d node (2 NUMA domains, 4 GPUs each), place all 8 workers on the same NUMA domain when possible. Use the Topology Manager and CNFFCCL plugin for Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: distributed-training
spec:
  containers:
    - name: train
      env:
        - name: NCCL_TOPOLOGY_FILE
          value: /etc/kubernetes/nccl-topo.xml
      resources:
        limits:
          nvidia.com/gpu: 4
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/NUMANode
      whenUnsatisfiable: DoNotSchedule

GPU Time-Slicing for Oversubscription

When you have more workloads than GPUs, time-slicing lets multiple pods share a GPU by interleaving their compute. This is common in development clusters or for batch inference workloads.

Configure time-slicing in the device plugin config:

version: v1
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu
        replicas: 4

With replicas: 4, up to 4 pods can share each physical GPU. The NVIDIA driver handles context switching. Time-slicing works well for inference workloads with low memory footprints. For training jobs that need full GPU memory, do not oversubscribe.

Check actual GPU utilization to validate sharing:

nvidia-smi
Tue Apr 11 02:30:00 2026
+-----------------------------------------------------------------------------+
| GPU 0      Ga [Unit] A100-SXM4-80GB   Off | 00000000:00:1B.0 Off |
|  0%   Memory:  8192MiB / 81920MiB    VI | Not In Use                |
+-----------------------------------------------------------------------------+

Node Affinity for GPU Topology

Multi-GPU training performs best when tensors stay local to the node. Use nodeSelector or nodeAffinity to pin jobs to nodes with sufficient GPUs:

nodeSelector:
  node.kubernetes.io/instance-type: g4dn.4xlarge
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: nvidia.com/gpu.count
              operator: Gte
              values:
                - "4"

For NCCL-based distributed training, avoid cross-node GPU communication when possible. Place all workers on the same node for 8-GPU training. For larger jobs that span nodes, use a topology-aware placement controller or explicit topologyKey constraints.

Gang Scheduling for Distributed Training

Distributed ML training jobs require all workers to start simultaneously. Kubernetes default scheduling can deadlock: Job A gets 3 of 4 required GPUs and waits for the fourth, while Job B holds that fourth GPU waiting for a third. Neither progresses. This is the gang scheduling problem.

Use the Coscheduler (part of Kubernetes scheduling framework) or Volcano scheduler to coordinate multi-pod job placement:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  minAvailable: 8
  schedulerName: volcano
  tasks:
    - replicas: 8
      name: worker
      template:
        spec:
          containers:
            - name: train
              image: pytorch/pytorch:2.2.0
              resources:
                limits:
                  nvidia.com/gpu: 1
          nodeSelector:
            accelerator: nvidia-tesla-a100

With gang scheduling, the entire job is not scheduled until all 8 GPUs are available. This trades immediate scheduling for guaranteed co-location.

GPU Memory Management

Each GPU has fixed memory. Oversubscribing memory causes OOM kills that crash your pod and waste compute. Set memory limits explicitly:

resources:
  limits:
    nvidia.com/gpu: 1
    memory: "60Gi"
  requests:
    memory: "40Gi"

For transformer models, allocate based on model size: a 7B parameter model in FP16 needs roughly 14GB for weights, plus 4-8GB for activations and KV cache. Leave headroom. A 80GB A100 can comfortably run a 70B model in FP16; a 7B model fits easily on a 16GB T4.

Monitor actual memory usage per pod:

nvidia-smi --query-compute-apps=pid,used_memory --format=csv

Migrating from kube-gpu to the Device Plugin

The legacy kube-gpu scheduler was deprecated. If you have existing configurations, migrate to the NVIDIA Device Plugin + Coscheduler pattern. The new stack has better community support and works with the standard Kubernetes scheduling framework.

Migration steps:

  1. Deploy the NVIDIA Device Plugin DaemonSet on all GPU nodes
  2. Verify nvidia.com/gpu resources appear in node capacity
  3. Update pod specs from alpha.kubernetes.io/nvidia-gpu to nvidia.com/gpu
  4. Install the Coscheduler or Volcano for gang scheduling
  5. Test with a single GPU pod before rolling out to training jobs

GPU Instance Selection: Cost vs Performance

Matching the right instance to the workload is where most teams bleed money. Use this decision framework:

InstanceGPUMemoryInterconnectBest for$/hr (on-demand)
g4dn.xlarge1x T4 (16GB)64GB CPUPCIeInference, small models, dev/test$0.526
g4dn.4xlarge1x T4 (16GB)64GB CPUPCIeBatch inference, moderate throughput$1.88
p4d.24xlarge8x A100 40GB1152GB CPUNVLinkTraining, 7-70B models$32.77
p5.48xlarge8x H100 80GB2048GB CPUNVLinkLarge model training, RLHF, 100B+$98.32
trn1.32xlarge16x Trainium1024GB CPU NeuronLinkCost-sensitive large-scale training$131.00

Spot instances cut costs by 60-70% for fault-tolerant training workloads. Set up checkpointing to save model state every 100-500 steps so preemption only loses a few minutes of compute. For inference, Reserved Instances or Savings Plans on g4dn cut costs to $0.18-0.22/hr per GPU — enough savings to justify the commitment on moderate traffic.

Key decision rule: if your GPU utilization averages below 40% over a month, you are on the wrong instance type. Either right-size to fewer GPUs or switch to time-slicing.

Monitoring GPU Workloads

DCGM (Data Center GPU Manager) exporter gives you Prometheus metrics for GPU utilization, memory, temperature, and power:

helm install dcgm-exporter nicl/dcgm-exporter \
  --set serviceMonitor.enabled=true \
  --namespace gpu-operator

Key metrics to track:

  • DCGM_FI_DEV_GPU_UTIL: GPU compute utilization percentage
  • DCGM_FI_DEV_FB_USED: Frame buffer memory used
  • DCGM_FI_DEV_GPU_TEMP: GPU temperature in Celsius
  • DCGM_FI_DEV_POWER_USAGE: Current power draw in watts

Alert when GPU utilization drops below 30% for sustained periods — it means your workload is memory-bound or I/O-bound, not compute-bound, and you are wasting expensive hardware.

Conclusion

Kubernetes GPU scheduling requires explicit configuration to get real performance. Start with the NVIDIA Device Plugin to expose nvidia.com/gpu resources. Use MIG on A100/H100 for latency-isolated inference, or time-slicing for development clusters. Pin multi-GPU training jobs to the same NUMA domain using node affinity for NVLink performance. Install Volcano or Coscheduler for gang scheduling so distributed training jobs do not deadlock. Monitor with DCGM and alert on sub-30% GPU utilization — a clear signal you are wasting expensive hardware.

The decision framework is simple: T4 for inference, A100 for training 7-70B models, H100 for anything above. Checkpoint training jobs to Spot instances for 60-70% cost savings. If average GPU utilization sits below 40%, right-size your cluster or switch to time-slicing.