If you're running AI training or inference workloads on AWS, you're probably burning money on On-Demand pricing. GPU instances — P5, P4d, P3 — cost thousands per hour On-Demand. The gap between On-Demand and reserved pricing is not marginal; for sustained GPU workloads, it can mean the difference between profitable and unprofitable.

This guide cuts through the confusion: Savings Plans vs Reserved Instances — what they are, when to use each, and how to structure a coverage strategy specifically for AI infrastructure.

The Three Pricing Models

On-Demand — Pay per second, no commitment. Highest cost, highest flexibility.

Reserved Instances (RI) — Make a 1-year or 3-year commitment to a specific instance family in a specific Availability Zone. Up to 72% savings vs On-Demand. Lower flexibility — you're locked to AZ and instance type.

Savings Plans (SP) — Make a commitment to spend a certain dollar amount per hour on compute (not specific instance types). More flexibility than RIs — you can apply SP coverage across instance families, sizes, and AZs. Up to 72% savings as well.

The Critical Distinction: Compute Savings Plans vs EC2 Reserved Instances

Most people compare SP vs RI as if they're equivalent. They're not.

EC2 Reserved Instances:

  • Tied to a specific instance family (e.g., `p5.48xlarge`)
  • Tied to a specific Availability Zone
  • Can only cover that exact instance type
  • If you stop using that instance, your reservation still burns

Compute Savings Plans:

  • Apply to ANY EC2 instance within the selected family across ANY AZ
  • More flexible — a `ml.p5.48xlarge` SP covers `p5.48xlarge` but also `p4d.24xlarge` if needed (within the same instance family)
  • You can change instance sizes and AZs as your workload evolves
  • Same 72% theoretical maximum savings

Recommendation: Always use Compute Savings Plans over EC2 Reserved Instances for AI workloads. The flexibility far outweighs any marginal difference in realized savings.

Instance Family Nuance for AI

AWS separates instance families into "families" for SP purposes. Here's what matters:

Instance Family Common AI Use Case On-Demand $/hr 1yr SP $/hr Savings
`ml.p5` H100/H200 training ~$45 ~$28 ~38%
`ml.p4d` A100 training ~$25 ~$15 ~40%
`ml.g5` Inference (moderate) ~$8 ~$5 ~37%
`ml.g6` Inference (T4/L4) ~$4 ~$2.50 ~37%

A Compute Savings Plan for `ml.p5` covers all P5 sizes. A plan for `ml.p4d` covers P4d. They do NOT cross-cover — a `ml.p5` SP doesn't cover `ml.g5` instances.

Advertisement
Advertisement

The GPU Workload Pattern Problem

AI infrastructure has a unique cost challenge: workloads vary dramatically between training (bursty, high GPU utilization for days/weeks) and inference (sustained, lower utilization).

Training workloads — RIs/SPs are risky because training runs are often:

  • Experiment-driven (you don't know how long a training run will take)
  • Multi-cloud (switching between AWS, GCP, and Azure as capacity fluctuates)
  • Short-lived experiments that get killed

Inference workloads — RIs/SPs are a no-brainer because:

  • Production inference is sustained 24/7/365
  • Model serving is typically stable — same instance types for months
  • Predictable traffic patterns

Recommendation: Commit reserved capacity ONLY for inference, not training. Use On-Demand + Spot for training unless you have extreme certainty about the training duration and instance type.

The Auto-Refit Strategy

Auto-Refit is a native AWS feature that automatically applies your Savings Plans to cover running instances. Here's the workflow:

  1. Buy Compute Savings Plans for your expected baseline inference capacity
  2. Set coverage target — aim for 70-80% coverage of your steady-state inference spend
  3. Let Auto-Refit handle the rest — AWS automatically applies your SP coverage to any matching instance running below your SP limit
  4. Fill remaining gap with On-Demand for traffic spikes

Coverage breakdown:

Baseline inference (70% of traffic) → Covered by SP

Traffic spikes (30%) → On-Demand

Experimental deployments → Spot instances

Training runs → On-Demand or Spot

Azure and GCP Equivalents

Azure:

  • Azure Reserved Instances — similar to AWS RIs, 1 or 3 year commitments
  • Azure Savings Plans for Compute — equivalent to AWS Compute Savings Plans, flexibility across instance sizes
  • Azure Hybrid Benefit — Windows/SQL licenses can be reused; also applies to some GPU VMs

Google Cloud:

  • Committed Use Discounts (CUDs) — similar to RIs, commitment to specific instance families
  • Resource-based committed use — newer, more flexible, applies to GPU memory and custom machine types
  • Spot VMs — the GCP equivalent of Spot instances, up to 91% off On-Demand
FinOps Kubecost

Kubernetes cost monitoring and FinOps visibility. Kubecost gives you real-time spend attribution per namespace, deployment, and service — with built-in Savings Plans recommendations based on your actual usage patterns. Free tier for clusters under $10k/month cloud spend.

Coverage Analysis: How Much Can You Actually Save?

Using AWS Cost Explorer, you can model Savings Plans coverage:

Example: P5 inference deployment

Current monthly spend on ml.p5.48xlarge On-Demand: $32,000

Baseline (predictable inference): 70% = $22,400

Committed via 1-year Compute SP at 38% savings: $22,400 × 0.62 = $13,888/month

Remaining On-Demand (spikes): $9,112

Monthly savings: $9,000 | Annual savings: $108,000

That's realistic for a mid-size inference deployment. Larger deployments scale quadratically.

The Commitment Trap

The biggest mistake teams make: over-committing SPs/RIs for workloads that shrink.

  • A 1-year commitment doesn't care if you deprecate a model
  • You CAN sell unused RI capacity on the AWS RI Marketplace (at 10-30% of original value, depending on remaining term)
  • For rapidly-changing AI infra, 1-year commitments are safer than 3-year

For AI specifically: The pace of model improvement means you're likely to migrate to newer GPU generations within 18-24 months. Don't lock into 3-year RIs for production inference unless you have extreme confidence in your instance family's longevity.

Advertisement
Advertisement

Tools for Managing Reserved Capacity

Tool Use Case Affiliate
AWS Cost Explorer Coverage analysis, savings projections
CloudHealth (VMware) Multi-cloud RI/SP management Has affiliate program
Spot.io (NetApp) Auto-recommendations, Spot + SP optimization Has affiliate program
AWS Budgets Alert when usage drops below SP coverage
Kubecost Kubernetes cost attribution + SP recommendations 20% recurring
GPU Cloud CoreWeave

Cloud GPU infrastructure built for AI workloads. H100 and A100 instances at 20-30% below AWS pricing, with Kubernetes-native deployment. Committed capacity pricing available — talk to their team for custom SP-aligned contracts.

Summary: When to Use What

Workload Type Pricing Model Commitment Expected Savings
Production inference (stable) Compute Savings Plans 1-year 37-40%
Production inference (growing) Compute Savings Plans 1-year, scale gradually 30-37%
Variable inference load Savings Plans (partial) + On-Demand 50% covered 20-30%
Training runs On-Demand or Spot None 0%
Short experiments Spot Instances None 60-91% off
Batch inference Spot + On-Demand mix None 40-60%

Conclusion

For AI infrastructure teams, the Savings Plans vs Reserved Instances decision is simpler than it appears: always use Compute Savings Plans over EC2 RIs, commit only for stable inference workloads, and leave training and experimentation on On-Demand/Spot.

The 37-40% savings on your largest inference bill is real money — at scale, a $100K/month inference bill becomes $62K/month with SP coverage. That's not marginal. Start with coverage analysis, model your baseline, and commit conservatively (you can always add more SPs as confidence grows).