Your AI workload is running. It's producing results. And it's quietly costing you three times what it should.
GPU costs are the fastest-growing line item in cloud bills for teams running AI/ML workloads. But unlike compute or storage waste, GPU waste is invisible in the standard cost explorer views, opaque in vendor dashboards, and genuinely hard to reason about without understanding how GPU workloads actually behave.
This guide is for engineering teams who are past the "should we use AI?" phase and are now staring at a GPU bill wondering where the money went.
Where GPU Costs Actually Hide
Before you can optimize, you need to find the waste. GPU cost waste concentrates in four specific places that most teams miss:
1. Idle GPU Instances
This is the biggest one, and it's embarrassingly common. GPU instances run 24/7 for workloads that are actually intermittent. A fine-tuning job that runs for 6 hours still pays for the instance at 3 AM when nothing is happening. Jupyter notebooks stay running because developers hate losing their kernel state. Development clusters idle overnight and on weekends.
The pattern: GPU utilization for a "running" AI team is often 15-25% on average, even though the instances never show as idle in the traditional sense. They're running, just not doing useful work.
Measurement first: Before you optimize, pull GPU utilization metrics over a full week. AWS CloudWatch, GCP Cloud Monitoring, and Azure Monitor all expose GPU metrics — but you have to explicitly enable them. A utilization pattern below 40% sustained average is a clear signal you're overprovisioned for your actual workload pattern.
2. Overprovisioned GPU Memory
Different AI workloads have wildly different memory profiles. Running a small inference workload on an A100 (80GB VRAM) when an L4 (24GB VRAM) would handle it is pure waste. The GPU memory ceiling often drives instance selection without engineers doing the math on whether they actually need that ceiling.
A few benchmarks worth knowing:
- GPT-3 style 7B parameter model inference: ~14GB VRAM (L4 handles it fine)
- LLaMA-2 70B in FP16: ~140GB VRAM (you need an A100 or multi-GPU setup)
- Stable Diffusion XL inference: ~12GB VRAM (L4 is plenty)
- Training on large custom datasets: memory requirements scale with batch size — optimize batch size before upgrading instances
3. Wrong Instance Type for the Workload
This is the one that hurts the most because it feels like a technical decision. H100s are 3x the price of A100s on-demand. A100s are 4x the price of L4s. For many workloads — especially inference, smaller model training, and batch processing — the cheaper GPU performs identically in terms of time-to-result and cost-per-token.
The common mistake: Using H100s (the fastest available) for workloads that are I/O bound, not compute bound. If your GPU sits at 60% compute utilization while waiting for data, you're not using the H100's compute advantage — you're just paying for it.
Real pricing reference:
- A100 (80GB): ~$3.50/hr on-demand (AWS p4d.24xlarge, normalized per GPU)
- A100 spot: ~$1.05/hr (70% savings)
- H100: ~$12-15/hr on-demand
- L4: ~$0.80/hr on-demand
- L4 spot: ~$0.25/hr
For most inference workloads, an L4 cluster will outperform a single H100 in cost efficiency by a factor of 5-10x.
4. Training vs. Inference Conflation
Training and inference have fundamentally different cost profiles. Training is bursty, compute-intensive, and benefits from the fastest available GPUs. Inference is steady-state, latency-sensitive in some contexts but not others, and often more memory-constrained than compute-constrained.
The mistake: running inference workloads on training-optimized instances, or sizing inference clusters for peak training throughput. These are different workloads that need different infrastructure strategies.
Spot/Preemptible Instances: The 70% Solution
Spot instances are the single highest-impact lever for GPU cost reduction. The savings are real — consistently 60-80% off on-demand pricing across all major cloud providers.
The catch is that spot instances can be reclaimed with 2 minutes' notice (AWS) or 30 seconds (GCP). For many AI workloads, this is manageable with proper architecture.
Workloads That Work Well on Spot
Model training with checkpointing — If your training job saves checkpoints every N steps, losing an instance means you restart from the last checkpoint, not from zero. Most modern training frameworks (PyTorch Lightning, Hugging Face Trainer) have built-in checkpoint support. The overhead of checkpoint-aware training is usually 2-5% of compute time.
Batch inference jobs — If you're running inference on a dataset (not serving real-time user requests), spot interruption just means re-queuing the affected batch. Build idempotent batch processing and spot becomes easy.
Experiment tracking and hyperparameter search — These are embarrassingly parallel workloads. An interruption kills one trial, not all of them. Use spot aggressively here.
Development and experimentation — Most of the time, a developer can absorb a 2-minute interruption if it means they pay 1/3 the price.
Workloads That Don't Work Well on Spot
Real-time inference serving where latency guarantees are required. Long-running fine-tuning jobs without robust checkpointing. Any workload where the interruption recovery overhead exceeds the spot savings.
Implementation
AWS Spot Fleet and SageMaker managed spot training handle the mechanics. On GCP, preemptible VMs and Spot VMs (newer, with more flexibility) serve the same purpose. On Azure, Spot VMs with eviction policies.
For Kubernetes-based AI workloads, Spot.io (now Spot by NetApp) and CAST AI both provide smart spot management — they handle instance diversification (spreading across instance types to reduce interruption probability), fallback to on-demand when spot isn't available, and automated rebalancing.
Spot diversification tip: Never rely on a single instance type. Configure your spot fleet to accept 5-8 instance types that meet your GPU requirements. Interruption risk drops dramatically when you're not competing for a single instance pool.
Model Optimization: Compute Reduction at the Source
Before you optimize infrastructure, optimize the model. A quantized model running on cheaper hardware often beats a full-precision model on expensive hardware in both cost and latency.
Quantization
Quantization reduces the numerical precision of model weights, shrinking memory footprint and often improving inference speed. Going from FP32 to INT8 typically reduces memory by 4x with 1-3% quality degradation. Going to INT4 halves that again with somewhat higher quality impact depending on the model and task.
Practical tools:
- bitsandbytes (Python): Simple INT8 and INT4 quantization for transformer models. Drop-in for most Hugging Face workflows.
- GGUF/llama.cpp: Quantized model format that runs efficiently on CPU and consumer GPUs. Excellent for local deployment or cheap instance types.
- TensorRT (NVIDIA): NVIDIA's production inference optimizer. Fuses operations, reduces precision appropriately, and produces optimized engines. 2-4x throughput improvement over naive PyTorch for inference.
When to quantize: Always evaluate quantization for inference workloads. For training, quantization is more complex (quantization-aware training vs. post-training quantization) but worth exploring for large runs.
Knowledge Distillation
Distillation trains a smaller "student" model to mimic a larger "teacher" model. The student ends up smaller and faster while retaining most of the teacher's performance on your specific task. This is higher-effort than quantization but can yield more dramatic efficiency gains.
If you're fine-tuning a foundation model for a specific task, consider whether you need the full model or whether a distilled, task-specific version would serve 95% of your use cases at 30% of the cost.
Inference Batching
Batching multiple inference requests together dramatically improves GPU utilization. A GPU that processes one request at a time is often at 5-10% utilization. Batching 8-16 requests together can push utilization above 80% while adding minimal latency.
vLLM has become the standard here for LLM serving — its continuous batching implementation is far more efficient than naive request-by-request processing. For production LLM inference, vLLM running on an L4 can outperform naive serving on an A100 in throughput-per-dollar.
Multi-Cloud Arbitrage
GPU availability and pricing vary significantly across cloud providers. AWS, GCP, and Azure don't always have the same instance types available in the same regions at the same prices. Add CoreWeave, Lambda Labs, and Vast.ai into the mix and the variance becomes dramatic.
Spot pricing on a single day (typical ranges):
- A100 80GB: $1.05-$1.80/hr depending on cloud and region
- H100 80GB: $2.50-$5.00/hr on spot
- A10G (AWS): $0.65-$0.90/hr
For non-latency-sensitive batch workloads, running jobs wherever GPU spot capacity is cheapest can reduce training costs by 30-50% compared to committing to a single provider.
CAST AI supports multi-cloud GPU optimization for Kubernetes workloads. Kubecost provides the visibility layer — GPU cost allocation by namespace, job, and team — that makes cross-cloud comparison possible.
Real Cost Breakdowns
Let me make this concrete. Here's a training run comparison for a realistic workload: fine-tuning a 7B parameter language model for a custom task over a weekend.
Assumptions: 48-hour training run, 8 GPUs required for reasonable training time.
| Scenario | Instance | GPU Cost/hr | Total Cost | Notes |
|---|---|---|---|---|
| On-demand A100 | p4d.24xlarge | $32.77 (8x A100) | $1,573 | Full price, no risk |
| Spot A100 | p4d spot | ~$9.00 (8x A100) | $432 | 72% savings, checkpoint required |
| On-demand L4 | g6.12xlarge | ~$16 (4x L4) | $768 | Slower per GPU, more needed |
| Spot L4 | g6 spot | ~$4.80 | $230 | Cheapest viable option |
| CoreWeave A100 spot | CoreWeave | ~$6.40 | $307 | Alternative cloud, often available |
For a team doing 4 training runs per month, the difference between "default on-demand A100" and "spot-first with L4 for smaller runs" is roughly $4,500-6,000/month. At scale, this adds up fast.
The Cost Optimization Stack
Here's the practical tooling stack for teams serious about GPU cost optimization:
Visibility:
- Kubecost — GPU cost allocation for Kubernetes, breaks down by workload/team/namespace
- CloudWatch/Cloud Monitoring — GPU utilization metrics (must enable explicitly)
- Weights & Biases or MLflow — Training run costs tracked alongside model metrics
Infrastructure optimization:
- CAST AI — Automated spot instance management and right-sizing for Kubernetes
- Spot.io — Spot fleet management with sophisticated diversification and fallback
- SageMaker Managed Spot Training — If you're AWS-native, this handles most of the spot complexity for training jobs
Model optimization:
- bitsandbytes — Drop-in quantization
- TensorRT — Production inference optimization (NVIDIA)
- vLLM — High-throughput LLM serving
Implementation Priority
If you're starting from scratch on GPU cost optimization, here's the sequence that delivers the fastest ROI:
- Week 1: Enable GPU utilization metrics and pull a week of data. Identify idle instances and instances with <30% average utilization.
- Week 2: Implement auto-shutdown for development/experimentation instances outside business hours. This alone often saves 30-40% of dev GPU spend.
- Week 3: Add spot instances for any batch training jobs. Start with checkpoint-enabled jobs where spot is low-risk.
- Month 2: Evaluate quantization for inference workloads. Run your inference workload on an L4 and compare quality vs. current setup.
- Month 3: Implement proper GPU cost allocation across teams. If engineers can see what their experiments cost, spending patterns change.
The goal isn't to cut GPU access — it's to make sure every GPU-hour spent is contributing to useful output. Most teams that do this work find they can run more experiments, not fewer, at the same or lower total cost.
Your AI ambitions don't have to break the budget. They just have to meet it honestly.