The Uncomfortable Truth About Your EC2 Fleet

When I ran right-sizing analysis at a previous employer — a SaaS platform running ~400 EC2 instances across three environments — we found that 38% of production instances had average CPU utilization under 10% over a 30-day period. Another 22% had peak utilization under 25%.

These weren't lab environments or dev boxes. They were production instances that someone provisioned at one point in time and nobody had revisited. The company was spending $340K/month on EC2. Right-sizing captured $87K/month without touching code.

This guide is the playbook I used.


Why Instances End Up Oversized

Before the process, it's worth understanding why right-sizing opportunities exist:

Provisioning for projected load: Engineers spin up instances for traffic that never materialized, or peak capacity that hasn't peaked in months.

The "just in case" buffer: Platform teams add safety margin on top of safety margin. A service needs 4 vCPUs at peak → you provision 8 → the 8 "feels risky" so you go to 16.

No ownership: In shared infrastructure, nobody has the authority to downsize. The team that could take the savings isn't the team that owns the cost.

Instance type drift: AWS releases new generations regularly. An m4.2xlarge from 2019 may deliver worse performance at higher cost than an m7i.xlarge in 2026.

Environment sprawl: Dev and staging environments provisioned at production scale "for parity," then forgotten.


Current EC2 Pricing Reference (US East — March 2026)

Understanding the relative cost of instance types is foundational. Here's a representative slice of on-demand Linux pricing in us-east-1:

General Purpose (m-series):

InstancevCPURAMOn-Demand/hrMonthly
m7i.large28 GB$0.1008$73.58
m7i.xlarge416 GB$0.2016$147.17
m7i.2xlarge832 GB$0.4032$294.34
m7i.4xlarge1664 GB$0.8064$588.67
m7i.8xlarge32128 GB$1.6128$1,177.34

Compute Optimized (c-series):

InstancevCPURAMOn-Demand/hrMonthly
c7i.large24 GB$0.0850$62.05
c7i.xlarge48 GB$0.1700$124.10
c7i.2xlarge816 GB$0.3400$248.20
c7i.4xlarge1632 GB$0.6800$496.40

Memory Optimized (r-series):

InstancevCPURAMOn-Demand/hrMonthly
r7i.large216 GB$0.1260$91.98
r7i.xlarge432 GB$0.2520$183.96
r7i.2xlarge864 GB$0.5040$367.92

ARM/Graviton (cost-performance leaders):

InstancevCPURAMOn-Demand/hrMonthlyvs x86
m8g.large28 GB$0.0864$63.07-14% vs m7i
m8g.xlarge416 GB$0.1728$126.14-14% vs m7i
m8g.2xlarge832 GB$0.3456$252.29-14% vs m7i
c8g.xlarge48 GB$0.1445$105.49-15% vs c7i

Graviton4 (m8g/c8g/r8g) typically delivers 20–40% better price-performance than equivalent x86 instances. For new workloads or when rebuilding, always test on Graviton first.

Use the EC2 Pricing Calculator to model specific configurations including Savings Plans and Reserved Instance discounts, which can reduce these numbers by 30–60%.


Step 1: Build Your Instance Inventory

Before sizing anything, you need a complete picture of what you're running.

# List all EC2 instances with key attributes
aws ec2 describe-instances \
  --query 'Reservations[*].Instances[*].{
    ID:InstanceId,
    Type:InstanceType,
    State:State.Name,
    Name:Tags[?Key==`Name`]|[0].Value,
    Env:Tags[?Key==`Environment`]|[0].Value,
    LaunchTime:LaunchTime,
    Region:Placement.AvailabilityZone
  }' \
  --output table

# Export to CSV for analysis
aws ec2 describe-instances \
  --query 'Reservations[*].Instances[*].[InstanceId,InstanceType,State.Name,LaunchTime,Tags[?Key==`Name`]|[0].Value,Tags[?Key==`Environment`]|[0].Value]' \
  --output text > ec2-inventory.csv

For multi-account organizations:

# Use AWS Organizations to iterate accounts
for account in $(aws organizations list-accounts --query 'Accounts[?Status==`ACTIVE`].Id' --output text); do
  echo "=== Account: $account ==="
  aws ec2 describe-instances \
    --profile "org-role-${account}" \
    --query 'Reservations[*].Instances[?State.Name==`running`].[InstanceId,InstanceType,LaunchTime]' \
    --output text
done

Step 2: Pull CloudWatch Metrics

CPU and memory utilization are your primary right-sizing signals. CloudWatch gives you CPU for free; memory requires the CloudWatch agent.

# CPU utilization: average and max over 30 days for a single instance
INSTANCE_ID="i-1234567890abcdef0"
START=$(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-30d +%Y-%m-%dT%H:%M:%SZ)
END=$(date -u +%Y-%m-%dT%H:%M:%SZ)

# Average CPU
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=$INSTANCE_ID \
  --start-time $START \
  --end-time $END \
  --period 86400 \
  --statistics Average Maximum \
  --query 'sort_by(Datapoints, &Timestamp)[*].{Date:Timestamp,Avg:Average,Max:Maximum}' \
  --output table

For fleet-wide analysis, use AWS Compute Optimizer instead of manual CloudWatch queries:

# Get Compute Optimizer recommendations for all EC2 instances
aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[*].{
    Instance:instanceArn,
    Finding:finding,
    CurrentType:currentInstanceType,
    Recommendation:recommendationOptions[0].instanceType,
    MonthlySavings:recommendationOptions[0].estimatedMonthlySavings.value,
    Currency:recommendationOptions[0].estimatedMonthlySavings.currency,
    PerfRisk:recommendationOptions[0].performanceRisk
  }' \
  --output table

Compute Optimizer analyzes 14 days of metrics by default. Opt into enhanced metrics (CloudWatch agent required) for 93-day analysis — this catches monthly batch jobs and weekly patterns that 14-day analysis misses.

# Opt in to Compute Optimizer with enhanced infrastructure metrics
aws compute-optimizer update-enrollment-status \
  --status Active

# Opt in enhanced metrics for better recommendations
aws compute-optimizer put-recommendation-preferences \
  --resource-type Ec2Instance \
  --scope name=AccountId,value=$(aws sts get-caller-identity --query Account --output text) \
  --enhanced-infrastructure-metrics Active

Step 3: Classify Your Findings

Compute Optimizer returns four findings per instance:

  • OVER_PROVISIONED: Instance has headroom to downsize
  • UNDER_PROVISIONED: Instance is constrained (don't touch these)
  • OPTIMIZED: Current type is the right fit
  • NOT_OPTIMIZED: Insufficient data (usually instance is too new)

For OVER_PROVISIONED instances, Compute Optimizer provides up to 3 recommendation options ranked by performance risk (LOW, MEDIUM, HIGH). Always prefer LOW performance risk options for production.

# Filter to only over-provisioned instances with low-risk recommendations
aws compute-optimizer get-ec2-instance-recommendations \
  --filters name=finding,values=OVER_PROVISIONED \
  --query 'instanceRecommendations[*].{
    Instance:instanceArn,
    CurrentType:currentInstanceType,
    RecType:recommendationOptions[?performanceRisk==`LOW`][0].instanceType,
    MonthlySavings:recommendationOptions[?performanceRisk==`LOW`][0].estimatedMonthlySavings.value
  }' \
  --output table

Step 4: Prioritize by Savings Impact

Not all right-sizing opportunities are worth the operational risk. Prioritize by:

  1. Savings magnitude: Focus on instances where downsize saves >$100/month
  2. Instance generation: Upgrading from m4 → m7i while downsizing (same cost, better performance) is always a win
  3. Environment: Dev/staging first, then non-critical production, then critical production
  4. Stateless vs stateful: Stateless instances (behind a load balancer) can be hot-swapped; stateful instances (databases, caches) need maintenance windows

Build a prioritized list:

import boto3
import json

co = boto3.client('compute-optimizer', region_name='us-east-1')
ec2 = boto3.client('ec2', region_name='us-east-1')

recommendations = co.get_ec2_instance_recommendations(
    filters=[{'name': 'finding', 'values': ['OVER_PROVISIONED']}]
)

candidates = []
for rec in recommendations['instanceRecommendations']:
    low_risk_opts = [o for o in rec['recommendationOptions'] if o.get('performanceRisk') == 'LOW']
    if not low_risk_opts:
        continue
    
    best = low_risk_opts[0]
    monthly_savings = best.get('estimatedMonthlySavings', {}).get('value', 0)
    
    if monthly_savings < 50:  # Skip small savings
        continue
    
    instance_id = rec['instanceArn'].split('/')[-1]
    
    candidates.append({
        'instance_id': instance_id,
        'current_type': rec['currentInstanceType'],
        'recommended_type': best['instanceType'],
        'monthly_savings': monthly_savings,
        'risk': best['performanceRisk']
    })

# Sort by savings descending
candidates.sort(key=lambda x: x['monthly_savings'], reverse=True)

total_savings = sum(c['monthly_savings'] for c in candidates)
print(f"Total monthly savings opportunity: ${total_savings:,.2f}")
print(f"Instances to right-size: {len(candidates)}")
print()
for c in candidates[:20]:
    print(f"  {c['instance_id']}: {c['current_type']}{c['recommended_type']} (${c['monthly_savings']:.2f}/mo)")

Step 5: Validate Before You Downsize

Never downsize production instances based purely on average CPU. Validate:

1. Check peak utilization, not just average

# Get hourly max CPU over 30 days — look for spikes
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 3600 \
  --statistics Maximum \
  --query 'max_by(Datapoints, &Maximum).Maximum'

If the max CPU over 30 days exceeds 70%, be cautious. A monthly batch job or an end-of-quarter report could spike the instance. Understand the spike pattern before downsizing.

2. Check memory utilization (requires CloudWatch agent)

# Memory utilization (requires cloudwatch-agent installed on instance)
aws cloudwatch get-metric-statistics \
  --namespace CWAgent \
  --metric-name mem_used_percent \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 86400 \
  --statistics Average Maximum \
  --output table

3. Check network and disk I/O

Some instances are bottlenecked on network or disk, not CPU. Check NetworkIn, NetworkOut, DiskReadBytes, DiskWriteBytes metrics before downsizing I/O-heavy workloads.

# Network throughput (to verify you're not hitting instance bandwidth limits)
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name NetworkOut \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 3600 \
  --statistics Maximum \
  --query 'max_by(Datapoints, &Maximum).Maximum'

Step 6: Execute the Downsize

For stateless instances (auto-scaling groups):

# Update the launch template with the new instance type
aws ec2 create-launch-template-version \
  --launch-template-id lt-1234567890abcdef0 \
  --source-version '$Latest' \
  --launch-template-data '{"InstanceType":"m7i.xlarge"}'

# Update the ASG to use the new launch template version
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name my-asg \
  --launch-template "LaunchTemplateId=lt-1234567890abcdef0,Version=\$Latest"

# Instance refresh: gradually replace instances (respects min healthy %)
aws autoscaling start-instance-refresh \
  --auto-scaling-group-name my-asg \
  --preferences '{
    "MinHealthyPercentage": 90,
    "InstanceWarmup": 300,
    "CheckpointPercentages": [20, 50, 100],
    "CheckpointDelay": 600
  }'

The instance refresh approach is safest: it replaces 10% of instances at a time, waits for health checks to pass, and can be cancelled mid-stream if something goes wrong.

For standalone instances (stop/resize):

# Stop the instance
aws ec2 stop-instances --instance-ids i-1234567890abcdef0

# Wait for it to stop
aws ec2 wait instance-stopped --instance-ids i-1234567890abcdef0

# Modify the instance type
aws ec2 modify-instance-attribute \
  --instance-id i-1234567890abcdef0 \
  --instance-type '{"Value": "m7i.xlarge"}'

# Start it back up
aws ec2 start-instances --instance-ids i-1234567890abcdef0

# Wait for it to be running
aws ec2 wait instance-running --instance-ids i-1234567890abcdef0

Note on EBS-backed vs instance store: The resize process above works for EBS-backed instances. Instance store instances cannot be resized — you must launch a new instance and migrate data.


Step 7: Monitor After Downsize

Watch the instance for at least 48–72 hours after downsize. Set CloudWatch alarms before you resize, so you have an automatic rollback trigger:

# Set a CPU alarm to alert (or auto-scale) if utilization spikes
aws cloudwatch put-metric-alarm \
  --alarm-name "ec2-cpu-high-after-resize-i-1234567890abcdef0" \
  --alarm-description "CPU spike after right-sizing" \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --period 300 \
  --evaluation-periods 3 \
  --threshold 85 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --statistic Average \
  --alarm-actions "arn:aws:sns:us-east-1:123456789012:ops-alerts"

The Graviton Migration: Double the Savings

Right-sizing within the same instance family captures maybe 30–40% savings. Combining right-sizing with a Graviton migration can double that.

AWS Graviton4 (Arm-based) delivers 20–40% better price-performance than equivalent Intel/AMD instances. For many workloads (Go, Java, Python, Node.js, containers), the migration is straightforward.

m7i.2xlarge (x86): $0.4032/hr → $294.34/mo
m8g.xlarge (Graviton4): $0.1728/hr → $126.14/mo (same RAM, same vCPU count)

That's 57% cost reduction for the same workload capacity (assuming your code runs on Arm).

Compatibility check:

# Check if your AMI has an arm64 version
aws ec2 describe-images \
  --owners amazon \
  --filters "Name=name,Values=amzn2-ami-hvm-*" "Name=architecture,Values=arm64" \
  --query 'sort_by(Images, &CreationDate)[-1].{ID:ImageId,Name:Name,Arch:Architecture}'

# For containers: check if your images support ARM
docker manifest inspect your-image:latest | jq '.manifests[].platform'

Most popular base images (Amazon Linux 2, Ubuntu, Debian) have native arm64 builds. Most major language runtimes (OpenJDK 11+, Python 3.6+, Node.js 14+, Go 1.14+) compile natively on Arm.

The main gotcha: native extensions compiled for x86 won't run on Arm. Check your dependencies before migrating.


Automating Ongoing Right-Sizing

One-time right-sizing is not enough — instances drift back to oversized over time as load changes and engineers provision new capacity.

Scheduled Compute Optimizer reports:

# Export recommendations to S3 weekly for tracking
aws compute-optimizer export-ec2-instance-recommendations \
  --s3-destination-config bucket=my-finops-bucket,keyPrefix=compute-optimizer/ec2/ \
  --file-format Csv \
  --include-member-accounts

# This creates a CSV with all recommendations across your org

Integrate with Cost Explorer anomaly detection:

# Set up a cost anomaly monitor for EC2 spend
aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "EC2SpendMonitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

# Add an alert subscription
aws ce create-anomaly-subscription \
  --anomaly-subscription '{
    "SubscriptionName": "EC2AnomalyAlert",
    "MonitorArnList": ["arn:aws:ce::123456789012:anomalymonitor/..."],
    "Subscribers": [{"Address": "finops@company.com", "Type": "EMAIL"}],
    "Threshold": 20,
    "Frequency": "DAILY"
  }'

EC2 Right-Sizing Savings Benchmarks

Based on real engagements I've been part of or read about:

Fleet SizeTypical Savings OpportunityTimeline
50–100 instances25–35% of EC2 spend2–4 weeks
100–500 instances30–40% of EC2 spend4–8 weeks
500+ instances20–35% of EC2 spend8–16 weeks

The larger the fleet, the more organizational friction slows execution. The technical work is straightforward; getting teams to accept downsizes of "their" instances is where most projects stall.

The organizational unlock: tie right-sizing to a team-level showback dashboard. When teams see their own EC2 spend with per-instance breakdowns, they become motivated to right-size rather than waiting for a central FinOps team to force it.


Right-sizing EC2 is often step 2 in a broader compute optimization program:

  1. Step 1: Move interruptible workloads to Spot (50–70% discount before right-sizing) — see our guide on savings plans vs reserved instances for commitment-based discounts
  2. Step 2: Right-size (this guide)
  3. Step 3: Commit to Savings Plans or Reserved Instances on the right-sized baseline (additional 30–60% on committed spend)
  4. Step 4: Migrate compatible workloads to Graviton (additional 10–20%)

Do them in order. Committing to Reserved Instances on an oversized fleet locks in waste for 1–3 years.

Use the EC2 Pricing Calculator to model the cost difference between your current configuration and the right-sized alternative, including the impact of Savings Plans on top.


Summary

Right-sizing EC2 is the highest-leverage cost optimization available for most AWS accounts because:

  1. The data is free (Compute Optimizer, CloudWatch)
  2. The tooling is built in (no third-party required)
  3. The savings are immediate (no commitment, no upfront cost)
  4. The process is repeatable

The three-step process that works at any scale:

  1. Inventory + Compute Optimizer: 2 hours to pull all recommendations
  2. Prioritize and validate: 1 week to verify top candidates
  3. Execute in waves: Dev/staging first, production second, critical third

Most teams can capture 25–40% EC2 savings in under 4 weeks with this process. For a $100K/month EC2 bill, that's $25–40K/month — enough to justify making right-sizing a quarterly ritual.