1. The Kubernetes Cost Problem

Kubernetes delivers reliability and scalability — but it also delivers massive cloud bills if you are not paying attention. The average engineering team running Kubernetes in production is overspending by 30-50%, and most of them do not know it until they get a bill shock at the end of the quarter.

The problem is not that Kubernetes is expensive by design. It is that Kubernetes is flexible by design, and that flexibility lets you provision far more resources than your workloads actually need. A default GKE or EKS cluster with standard node pools will consume whatever you give it, and cloud providers are happy to let that happen.

This guide is about taking control: understanding where the money goes, identifying the waste, and applying concrete optimizations that compound. By the end, you will have a systematic approach to cutting your Kubernetes bill by 40-60% without sacrificing reliability.

2. Understanding Where Your Money Goes

Before you can optimize, you need to see. Kubernetes costs break down into four categories:

  • Compute (Nodes) — 60-75% of your bill. The EC2/GCE/Azure VMs that form your cluster nodes.
  • Storage — 10-20% of your bill. Persistent volumes, snapshots, and backup storage across your PVCs.
  • Network — 5-15% of your bill. Data transfer out, cross-zone traffic, load balancers, and ingress controllers.
  • Managed Services — 5-10% of your bill. EKS/GKE control plane fees, service meshes, monitoring tools, log aggregation.

Compute dominates. The optimizations that move the needle are the ones that reduce compute spend: right-sizing nodes, using Spot instances strategically, and eliminating idle capacity.

3. Right-Sizing Your Nodes and Cluster

The most common mistake teams make is provisioning node pools based on rough estimates and never revisiting them. You requested m5.xlarge instances because that is what the tutorial recommended, and six months later those nodes are running at 30% CPU utilization while you are paying full price for all of it.

Right-sizing means looking at actual resource consumption and matching your node types to your workload patterns.

Analyze Before You Change

Use the Kubernetes Metrics Server and a tool like Kubecost to get a 30-day view of actual resource usage across your namespaces. Look for:

  • Nodes running at under 40% CPU or memory utilization
  • Namespaces with CPU requests that are 2-3x actual usage
  • Pods that are constantly being OOMKilled (under-provisioned) or throttled (CPU limit too low)

Kubecost gives you a cost breakdown per namespace, deployment, and pod — which makes the conversation with your engineering team about resource limits much more concrete.

Start Free with Kubecost

Kubecost provides real-time Kubernetes cost visibility, right-sizing recommendations, and savings reports. The free tier covers single-cluster monitoring — enough to find the biggest waste in your current setup.

Explore Kubecost →

Node Right-Sizing Strategies

Once you know where the waste is, apply these patterns:

Match node family to workload type. Memory-optimized nodes (r* on AWS, highmem on GCE) for workloads that cache heavily or run in-memory databases. Compute-optimized (c*, c2) for CPU-bound batch processing. General-purpose (m*) as a fallback, not a default.

Use a mixed node pool strategy. Rather than a single node type, split your cluster into:

  • On-Demand nodes for system daemons, databases, and anything requiring guaranteed resources
  • Spot/Preemptible nodes for stateless workloads, batch jobs, CI runners, and embarrassingly parallel tasks
  • Burstable nodes (AWS T3, GCE E2) for workloads with irregular CPU patterns

4. Spot Instances — The Biggest Lever

Spot instances (AWS), Preemptible VMs (GCP), or Spot VMs (Azure) cost 60-90% less than On-Demand. The trade-off: they can be taken away with 30 seconds to 2 minutes notice. For stateless, fault-tolerant workloads, this is a non-issue.

The teams cutting their Kubernetes bill in half are running 60-70% Spot nodes for the right workloads.

What Runs Well on Spot

  • Web servers and API gateways (restart on eviction is fast and clean)
  • Async workers and queue consumers
  • CI/CD runners and build agents
  • ML training jobs (with checkpointing enabled)
  • Development and staging environments
  • Stateless microservices with multiple replicas

What Should NOT Run on Spot

  • Stateful databases (unless you have a replication and failover strategy)
  • Single-replica critical services
  • Leader-elected components (Kubernetes control plane components)
  • Anything with strict SLAs and no tolerance for brief disruption

Implementing Spot Gracefully

Use Pod Disruption Budgets (PDBs) to ensure minimum availability during Spot evictions:

apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: my-app-pdb spec: minAvailable: 2 selector: matchLabels: app: my-app

Configure your node pool with graceful termination: a preStop hook that stops accepting new traffic and a readiness gate that drains the pod cleanly before the 30-second Spot notice expires.

5. Cluster Autoscaler and Vertical Pod Autoscaler

The Cluster Autoscaler (built into GKE, EKS, AKS) adjusts the number of nodes in your node pool based on pending pods and node utilization. It scales down idle nodes and scales up when demand spikes.

For the Cluster Autoscaler to work well:

  • Set appropriate --min-size and --max-size on your node pools
  • Use standard node labels and taints so pods can be scheduled correctly
  • Give it headroom — do not run at 90%+ node utilization or scaling will be too slow

Vertical Pod Autoscaler (VPA) adjusts your pod resource requests automatically based on actual usage. It recommends or applies CPU and memory requests, eliminating the guesswork. VPA in recommendation mode is safe to run on any cluster; in auto mode, it will evict pods to apply new resource settings, so schedule it during low-traffic windows.

6. Namespace Quotas as Guardrails

Engineering teams will naturally use whatever resources you give them — and then ask for more. Namespace-level resource quotas enforce discipline:

apiVersion: v1 kind: ResourceQuota metadata: name: team-quota namespace: production spec: hard: requests.cpu: "20" requests.memory: 40Gi limits.cpu: "40" limits.memory: 80Gi pods: "50"

Set LimitRange objects to enforce default resource requests/limits on new pods, so teams that forget to set requests do not run with unbounded resources:

apiVersion: v1 kind: LimitRange metadata: name: default-limits namespace: production spec: limits: - default: cpu: 500m memory: 512Mi defaultRequest: cpu: 100m memory: 128Mi type: Container

7. Storage Tiering

Storage is often the forgotten cost lever. Not all persistent data needs premium SSD storage.

  • Hot storage (gp3/gp4 on AWS, pd-standard on GCP): standard block storage for databases and frequently accessed data. Cost: $0.08-0.12/GB-month.
  • Warm/cold storage (st1/sc1 on AWS, pd-balanced + labels on GCP): cheaper storage for backups, archives, and infrequently accessed logs. Cost: $0.01-0.05/GB-month.
  • Object storage (S3/GCS/Azure Blob): the cheapest tier for blobs, backups, and machine learning datasets. Mount with Rclone or use the Kubernetes S3 CSI driver.

Audit your PVCs monthly. You will almost always find old volumes from deleted services that are still accruing charges.

8. Network Cost Optimization

Cross-availability-zone traffic costs money — roughly $0.01/GB within a region. In a multi-zone cluster, if your pods are communicating across zones constantly, you are paying a hidden tax on every request.

Optimize by:

  • Using topology-aware service routing (topology keys) so Services prefer same-zone endpoints
  • Placing related microservices in the same zone when possible
  • Reducing unnecessary inter-service calls with message batching or gRPC streaming
  • Auditing egress costs: every external API call, webhook, and data export adds to your bill

9. The Kubecost Audit — Your First Action

If you do only one thing from this guide, deploy Kubecost. It takes 10 minutes to install via Helm and gives you immediate visibility into every cost dimension of your cluster:

helm install kubecost kubecost/cost-analyzer \ --namespace kubecost \ --create-namespace \ --values https://raw.githubusercontent.com/kubecost/cost-analyzer-helm-chart/develop/cost-analyzer-values.yaml

Within an hour you will have a dashboard showing:

  • Cost by namespace, deployment, pod, and service
  • Unused compute (requested but not used)
  • Right-sizing recommendations per workload
  • Spot vs On-Demand savings projections
  • Historical cost trends

Kubecost — Free for Single Cluster

Kubecost's free tier includes real-time cost monitoring, right-sizing recommendations, and savings reports for one cluster. Deploy it today and find the low-hanging fruit in your current setup.

Get Started with Kubecost →

10. Putting It Together — The Optimization Stack

A mature Kubernetes cost optimization practice layers multiple strategies:

  • Week 1: Deploy Kubecost. Get 30 days of data. Identify the top 3 cost consumers.
  • Week 2: Apply namespace quotas and LimitRanges. Right-size the top 3 offending workloads.
  • Month 2: Build a Spot mixed node pool. Migrate stateless workloads (60-70% of your pods). Set up PDBs.
  • Month 3: Review storage tiers. Audit PVCs. Implement topology-aware routing.

Teams that follow this sequence consistently see 40-60% reduction in compute spend within 90 days. The savings compound — every dollar you do not spend on infrastructure is a dollar that funds product development.

The bottom line: Kubernetes cost optimization is not about running less. It is about running smarter — giving your engineers the visibility and tooling to make resource decisions that align with actual business costs, not guessed ones.

11. Kubernetes Cost Optimization for AI/ML Workloads

GPU-equipped Kubernetes nodes are among the most expensive resources in your cluster, and AI/ML workloads are hungry for them. A single GPU node can cost $2-10/hour depending on the GPU type, and idle GPU time is pure waste. Optimizing GPU utilization in Kubernetes requires a different playbook than CPU-focused workloads.

GPU Scheduling and Node Taints

Not all pods need GPUs, and not all GPUs are equal. Use node taints and tolerations to ensure GPU nodes only run workloads that actually need them:

# Taint GPU nodes so only GPU workloads schedule on them kubectl taint nodes gpu-node nvidia.com/gpu=present:NoSchedule # Tolerate the taint in your GPU workload pod spec spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule

Without this, you will end up with GPU nodes running regular CPU workloads while your ML training jobs queue up waiting for GPU capacity.

Multi-Instance GPU (MIG) on NVIDIA A100/A30

NVIDIA's Multi-Instance GPU technology lets you partition a single physical GPU into multiple logical instances. An A100 can be split into up to 7 MIG instances, each running an independent workload. For smaller ML models or batch inference, this dramatically increases GPU utilization:

# Check MIG mode on A100
nvidia-smi -L

# List available MIG devices
nvidia-smi --query-gpu=mig.index --format=csv,noheader

On Kubernetes, use the NVIDIA device plugin with MIG partitioning enabled to schedule workloads onto specific MIG instances. This is particularly effective for inference workloads where a single model does not saturate a full A100.

GPU Monitoring — The Metrics That Matter

Standard node metrics miss GPU-specific health signals. Track these in your Prometheus dashboards:

  • GPU utilization % — target above 80% for training, above 60% for inference
  • VRAM usage — GPU memory consumption, not to be confused with system RAM
  • GPU temperature and power draw — thermal throttling kicks in above 83°C
  • PCIe throughput — bottleneck indicator for data transfer-heavy workloads
  • NVLink/NVSwitch cross-GPU bandwidth — relevant for multi-GPU training jobs

The NVIDIA DCGM (Data Center GPU Manager) exporter for Prometheus gives you all of these out of the box. Integrate it with your Grafana dashboards to get real-time GPU cost attribution per workload:

# Deploy DCGM exporter as a DaemonSet
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml

Spot GPU Instances for ML Training

ML training jobs are ideal Spot workload candidates — they are fault-tolerant (checkpoint/resume), episodic (run on demand), and extremely expensive at On-Demand rates. Running distributed training on Spot GPU nodes can cut your ML infrastructure bill by 70-80%.

Key implementation pattern: use a job tracker (like Volcano or Kubeflow's training operator) that handles preemption gracefully with checkpoint-based restart. Without checkpointing, Spot preemption means you lose the entire training run.

Storage for GPU Workloads — The Hidden Cost Driver

GPU nodes sit idle while waiting for data to load from storage. If your training dataset is on a slow NFS volume, your $8/hour GPU node burns money while waiting for I/O. Use local NVMe storage or high-throughput parallel file systems (GPFS, Lustre) for active training data:

  • Local NVMe — 200-700K IOPS, attach as ephemeral storage to your pod
  • Amazon FSx for Lustre — petabyte-scale, integrates with S3 for ML dataset access
  • Google Cloud Filestore — managed NFS with up to 1.2 TB/s throughput

The cost of fast storage is almost always less than the cost of idle GPU time. A $200/month Filestore instance that keeps your GPU utilization at 85% instead of 55% is a clear win.

Cut GPU Infrastructure Costs

CoreWeave offers Kubernetes-optimized GPU instances (A100, H100, H200) with NVMe storage and preemptible pricing up to 70% below on-demand rates. Purpose-built for ML training and inference at scale.

Explore CoreWeave GPU Cloud →

12. FinOps Tooling for Kubernetes — The Full Stack

Visibility without action is just expensive dashboard hosting. The FinOps tooling ecosystem for Kubernetes gives you the full cycle: cost visibility, attribution, optimization recommendations, and enforcement.

Kubecost — The Baseline

Kubecost remains the standard for Kubernetes cost visibility. The open-source version is free for single-cluster use; the enterprise version adds multi-cluster aggregation, anomaly detection, and budget alerting. Either way, start here before evaluating commercial alternatives.

Cloud-Native Cost Tools

Each major cloud provider has its own cost optimization tool that integrates with their Kubernetes offering:

  • AWS Cost Explorer + EKS Cost Monitoring — native AWS tooling, good for basic attribution but less actionable than Kubecost
  • GCP Cloud Billing + GKE Cost Attribution — labels-based attribution integrated into GCP billing dashboard
  • Azure Cost Management + AKS — similar label-based approach, Azure-specific cost recommendations

The cloud-native tools are useful for high-level showback to finance teams, but they do not give engineers the actionable right-sizing recommendations that Kubecost provides.

Policy-Based Enforcement with OPA

Prevent waste before it happens with Open Policy Agent (OPA) gatekeeper policies that enforce cost standards at deployment time:

# Example: Block deployment if no resource limits set apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sRequiredResources metadata: name: container-must-have-limits spec: match: kinds: - apiGroups: [""] kinds: ["Pod"] parameters: limits: - cpu: "500m" memory: "128Mi"

OPA policies cannot prevent the cost of running workloads, but they can ensure that every workload deployed to your cluster has explicit resource requests and limits — which is the foundation of right-sizing.

Recommended Tool Kubecost

Real-time Kubernetes cost monitoring, right-sizing recommendations, and Spot savings projections. Free tier for single-cluster deployments. Deploy in minutes via Helm.