Learn DevOps: Cost Anomaly Detection — AWS Cost Anomaly Detection, Azure Alerts, ML-Based Detection

Cost Anomaly Detection — AWS Cost Anomaly Detection, Azure Alerts, ML-Based Detection

DodaTech Updated Jun 20, 2026 9 min read

Cloud cost anomalies — unexpected spikes in spending — can double your monthly bill before you notice. This guide covers native anomaly detection tools across AWS, Azure, and GCP, plus custom ML-based approaches and automated remediation workflows.

What You’ll Learn

You’ll configure AWS Cost Anomaly Detection, Azure budget alerts with anomaly evaluation, GCP cost insights, build custom anomaly detection with Python and ML, and implement automated remediation that pauses or scales down anomalous resources.

Why Cost Anomaly Detection Matters

A single misconfigured resource — an unattached GPU instance, a data transfer spike from a DDoS attack, a forgotten development cluster — can cost thousands of dollars per day. Manual cost monitoring doesn’t scale. Automated anomaly detection catches these issues within minutes, not days.

Learning Path

    flowchart LR
  A[Cloud Cost Basics] --> B[Cost Anomaly Detection<br/>You are here]
  B --> C[Right-Sizing Strategies]
  C --> D[FinOps Practices]
  style B fill:#f90,color:#fff

AWS Cost Anomaly Detection

AWS’s native anomaly detection uses machine learning to establish spending baselines and detect deviations:

# Enable AWS Cost Anomaly Detection via CLI
aws ce provide-anomaly-feedback --anomaly-id "abc123" \
  --feedback "YES" --comment "Confirmed spike from marketing campaign"

# List all monitors
aws ce get-anomaly-monitors

# Get anomalies for a date range
aws ce get-anomalies \
  --date-interval Start=2026-06-01,End=2026-06-20 \
  --monitor-arn "arn:aws:ce::MONITOR_ARN"

Setting Up via Console

Go to AWS Cost Management → Cost Anomaly Detection
Create a monitor: Choose between:
- AWS services — Monitor total spend per service
- Linked accounts — Monitor spend per account
- Cost categories — Monitor spend per tag/category
- Custom — Combined view
Configure thresholds:
- Anomaly threshold: Dollar value or percentage (e.g., > $100 or > 50%)
- Evaluation frequency: Daily or hourly
Subscription: Set up SNS topic for email/Slack alerts

{
    "MonitorArn": "arn:aws:ce::MONITOR_ARN",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE",
    "MonitorSpec": {
        "MonitorDimensionalGroupValues": ["AmazonEC2", "AmazonRDS"]
    }
}

Alert Example

AWS Cost Anomaly Alert — Service: AmazonEC2
Anomaly Score: 89/100
Estimated Impact: $2,345.67 over 7 days
Current Spend: $4,567.89 (normal: $2,222.22 +- $345.67)
Top Driver: ap-southeast-1 m5.24xlarge instance started 2026-06-18

Azure Cost Alerts

Azure provides budget-based alerts with anomaly detection:

# Create a budget with alert
az consumption budget create \
  --budget-name "prod-budget" \
  --category cost \
  --amount 50000 \
  --time-grain monthly \
  --start-date 2026-01-01 \
  --end-date 2026-12-31 \
  --notification-group \
    threshold-type actual \
    threshold-value 80 \
    contact-groups "ops-team" \
    enabled true

# List budgets
az consumption budget list

# Create action group for alerts
az monitor action-group create \
  --name "CostAlerts" \
  --resource-group "management" \
  --action email ops@dodatech.com \
  --action webhook https://hooks.slack.com/services/...

Azure Anomaly Detection Configuration

{
    "properties": {
        "displayName": "Production Anomaly Monitor",
        "timeGrain": "daily",
        "notificationThresholds": [
            {
                "thresholdType": "forecasted",
                "thresholdValue": 120,
                "notificationType": "email"
            }
        ],
        "dimensions": [
            {"name": "ServiceName", "values": ["virtualMachines", "storage"]}
        ]
    }
}

GCP Cost Insights

GCP provides built-in cost anomaly detection through Recommender:

# List cost insights
gcp recommender insights list \
  --insight-type=google.cloud.billing.CostInsight \
  --project=my-project \
  --location=global

# Describe a specific insight
gcp recommender insights describe INSIGHT_ID \
  --insight-type=google.cloud.billing.CostInsight \
  --project=my-project \
  --location=global

# Set up budget alerts
gcp billing budgets create \
  --billing-account=BILLING_ACCOUNT_ID \
  --display-name="Monthly Budget" \
  --budget-amount=50000 \
  --threshold-rules=spendBaseline=100,percent=0.5

GCP Anomaly Insight Example

Insight: Unusual spike in Compute Engine costs
Category: cost
Severity: CRITICAL
Observation Period: 2026-06-13 to 2026-06-20
Observed Cost: $12,345 (baseline: $5,432)
Suggested Action: Review recently created VM instances in us-west1-b

Custom ML-Based Detection

For multi-cloud or custom requirements, build your own anomaly detection:

Python Anomaly Detection with Prophet

import pandas as pd
import numpy as np
from prophet import Prophet
from datetime import datetime, timedelta
import boto3
import json

class CostAnomalyDetector:
    def __init__(self, sensitivity=0.95):
        self.sensitivity = sensitivity
        self.model = Prophet(
            yearly_seasonality=True,
            weekly_seasonality=True,
            daily_seasonality=False,
            changepoint_prior_scale=0.05,
            seasonality_prior_scale=10.0
        )

    def fetch_cost_data(self, days=90):
        """Fetch daily cost data from AWS Cost Explorer"""
        client = boto3.client('ce', region_name='us-east-1')
        end = datetime.now()
        start = end - timedelta(days=days)

        response = client.get_cost_and_usage(
            TimePeriod={
                'Start': start.strftime('%Y-%m-%d'),
                'End': end.strftime('%Y-%m-%d')
            },
            Granularity='DAILY',
            Metrics=['UnblendedCost'],
            GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
        )

        records = []
        for result in response['ResultsByTime']:
            date = result['TimePeriod']['Start']
            total = float(result['Total']['UnblendedCost']['Amount'])
            records.append({'ds': date, 'y': total})

        return pd.DataFrame(records)

    def detect_anomalies(self, df):
        """Detect anomalies using Prophet forecast intervals"""
        self.model.fit(df)

        future = self.model.make_future_dataframe(periods=7)
        forecast = self.model.predict(future)

        # Merge actual with forecast
        merged = df.merge(
            forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']],
            on='ds', how='left'
        )

        # Flag anomalies
        merged['anomaly'] = (
            (merged['y'] > merged['yhat_upper']) |
            (merged['y'] < merged['yhat_lower'])
        )
        merged['deviation'] = abs(
            (merged['y'] - merged['yhat']) / merged['yhat'] * 100
        )

        anomalies = merged[merged['anomaly']].sort_values(
            'deviation', ascending=False
        )
        return anomalies, forecast

    def send_alert(self, anomaly):
        """Send anomaly alert via SNS"""
        sns = boto3.client('sns')
        message = json.dumps({
            'type': 'cost_anomaly',
            'date': str(anomaly['ds']),
            'actual_cost': float(anomaly['y']),
            'expected_cost': float(anomaly['yhat']),
            'deviation_pct': float(anomaly['deviation']),
            'severity': 'HIGH' if anomaly['deviation'] > 50 else 'MEDIUM'
        })
        sns.publish(
            TopicArn='arn:aws:sns:us-east-1:ACCOUNT:cost-alerts',
            Message=message,
            Subject='Cost Anomaly Detected'
        )

# Usage
detector = CostAnomalyDetector()
cost_data = detector.fetch_cost_data()
anomalies, forecast = detector.detect_anomalies(cost_data)

for _, anomaly in anomalies.head(5).iterrows():
    print(f"{anomaly['ds']}: ${anomaly['y']:.2f} "
          f"(expected ${anomaly['yhat']:.2f}, "
          f"deviation: {anomaly['deviation']:.1f}%)")
    detector.send_alert(anomaly)

Expected output:

2026-06-18: $4567.89 (expected $2222.22, deviation: 105.6%)
2026-06-19: $3890.12 (expected $2345.67, deviation: 65.8%)
2026-06-15: $567.89 (expected $1234.56, deviation: 54.0%)

Automated Remediation

AWS Lambda Auto-Remediation

import boto3
import json

ec2 = boto3.client('ec2')

def lambda_handler(event, context):
    """Auto-stop suspect instances when anomaly is severe"""

    anomaly = json.loads(event['Records'][0]['Sns']['Message'])

    # Only auto-remediate severe anomalies
    if anomaly['severity'] != 'HIGH':
        return {'status': 'monitoring_only'}

    # Find recently launched expensive instances
    instances = ec2.describe_instances(
        Filters=[
            {'Name': 'instance-state-name', 'Values': ['running']},
            {'Name': 'tag:AutoStop', 'Values': ['true']}
        ]
    )

    stopped = []
    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            launch_time = instance['LaunchTime']
            # Stop instances launched in the last 24 hours
            if (datetime.now() - launch_time).days < 1:
                ec2.stop_instances(InstanceIds=[instance['InstanceId']])
                stopped.append(instance['InstanceId'])

    return {
        'status': 'remediated',
        'stopped_instances': stopped,
        'anomaly_id': anomaly.get('anomaly_id')
    }

Common Anomaly Detection Mistakes

1. Setting Thresholds Too Low

A $10 anomaly alert triggers daily for normal cost fluctuations. Set thresholds based on historical variance — use 2-3 standard deviations from the mean.

2. Ignoring Seasonality

Cloud costs have natural patterns — higher during business hours, lower on weekends. Anomaly detection must account for daily, weekly, and monthly seasonality.

3. No Automated Remediation

Finding an anomaly is useless if no one acts on it. At minimum, send alerts to Slack/email. For critical anomalies, automate resource pausing or scaling.

4. Single-Cloud Monitoring Only

If you’re multi-cloud, you need a unified anomaly detection system. Use custom ML or a third-party tool (CloudHealth, Vantage) that aggregates across providers.

5. Not Tagging Resources

Anomaly detection is only as good as your data granularity. Untagged resources appear as “unknown” — you can’t determine the owner or purpose of the anomalous spend.

6. Alert Fatigue

Too many false alarms cause alert fatigue. Tune sensitivity, use evaluation periods (confirm anomaly persists for 30+ minutes), and implement severity levels.

7. Not Investigating Root Cause

Stopping an anomalous resource doesn’t prevent recurrence. Always investigate root cause: Was it a developer launching an expensive instance? A CI/CD pipeline with unlimited budget? A misconfigured auto-scaling policy?

Practice Questions

1. How does AWS Cost Anomaly Detection establish baselines? It uses ML to analyze 60+ days of historical spend, accounting for seasonality (daily, weekly, monthly patterns). It creates a prediction interval — spending outside this interval is flagged.

2. What’s the difference between actual and forecasted budget thresholds in Azure? Actual threshold alerts when spend reaches a percentage of the budget. Forecasted threshold alerts when the projected end-of-month spend reaches a percentage. Forecasted alerts catch overspend earlier.

3. Why use Prophet or similar ML models for anomaly detection? Prophet handles seasonality, trend changes, and missing data well. It provides uncertainty intervals essential for anomaly detection. It’s also robust to outliers that would confuse simpler statistical methods.

4. How do you prevent alert fatigue in cost anomaly detection? Use tiered severity (info/warning/critical), require sustained anomalies (e.g., 3 consecutive days), tune the sensitivity threshold, and exclude known cost events (planned launches, campaigns).

5. Challenge: Your organization has accounts in AWS, Azure, and GCP. Design an anomaly detection system that works across all three with centralized alerting and automated remediation. Answer: Use a Python service that pulls cost data from all three providers via their APIs daily. Feed into Prophet for baseline modeling. Send detected anomalies to a central SNS/Slack topic. For remediation, use provider-specific webhooks with common response playbooks (stop instances, reduce instance sizes, notify owners via tags).

Mini Project: Multi-Cloud Cost Monitor

Create a unified cost anomaly detector:

#!/usr/bin/env python3
# cost_monitor.py — Multi-cloud cost anomaly detection
# Requires: boto3, azure-mgmt-consumption, google-cloud-billing

import os
import json
from datetime import datetime, timedelta

class CloudCostMonitor:
    """Fetches cost data from AWS, Azure, and GCP"""

    def fetch_aws_costs(self, days=30):
        """Fetch daily AWS costs"""
        import boto3
        client = boto3.client('ce', region_name='us-east-1')
        end = datetime.now()
        start = end - timedelta(days=days)

        response = client.get_cost_and_usage(
            TimePeriod={
                'Start': start.strftime('%Y-%m-%d'),
                'End': end.strftime('%Y-%m-%d')
            },
            Granularity='DAILY',
            Metrics=['UnblendedCost'],
            GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
        )
        return self._parse_ce_response(response)

    def fetch_azure_costs(self, days=30):
        """Fetch daily Azure costs"""
        from azure.mgmt.consumption import ConsumptionManagementClient
        from azure.identity import DefaultAzureCredential

        credential = DefaultAzureCredential()
        client = ConsumptionManagementClient(
            credential, os.environ['AZURE_SUBSCRIPTION_ID']
        )
        scope = f"/subscriptions/{os.environ['AZURE_SUBSCRIPTION_ID']}"
        end = datetime.now()
        start = end - timedelta(days=days)

        usage = client.usage_details.list(
            scope,
            filter=f"properties/usageStart ge '{start}'",
            expand='properties/additionalProperties'
        )
        return self._parse_azure_usage(usage)

    def fetch_gcp_costs(self, days=30):
        """Fetch daily GCP costs"""
        from google.cloud import billing

        client = billing.CloudBillingClient()
        project = f"projects/{os.environ['GCP_PROJECT_ID']}"

        response = client.get_project_billing_info(
            request={"name": project}
        )
        return self._parse_gcp_billing(response)

    def analyze_and_alert(self):
        """Analyze costs across all clouds and send alerts"""
        results = {
            'timestamp': datetime.now().isoformat(),
            'anomalies': [],
            'total_spend': {}
        }

        for provider, fetcher in [
            ('aws', self.fetch_aws_costs),
            ('azure', self.fetch_azure_costs),
            ('gcp', self.fetch_gcp_costs)
        ]:
            try:
                data = fetcher()
                results['total_spend'][provider] = sum(
                    r['cost'] for r in data['daily_costs'][-7:]
                )
                # Simple threshold check
                for day in data['daily_costs'][-3:]:
                    if day['cost'] > day['baseline'] * 1.5:
                        results['anomalies'].append({
                            'provider': provider,
                            'date': day['date'],
                            'cost': day['cost'],
                            'baseline': day['baseline'],
                            'pct_increase': (
                                (day['cost'] - day['baseline'])
                                / day['baseline'] * 100
                            )
                        })
            except Exception as e:
                print(f"Error fetching {provider}: {e}")

        return results

if __name__ == '__main__':
    monitor = CloudCostMonitor()
    report = monitor.analyze_and_alert()
    print(json.dumps(report, indent=2))

    if report['anomalies']:
        print(f"\n⚠ {len(report['anomalies'])} anomalies detected!")
    else:
        print("\n✓ No anomalies detected.")

FAQ

How fast should anomaly detection be?

For cost anomalies, hourly evaluation is usually sufficient. Daily evaluation misses spikes that could run for 24+ hours. Real-time evaluation is rarely needed for cost — costs change slowly compared to operational metrics.

What’s a good anomaly threshold?

Start with 50% above baseline for 2+ consecutive days. Adjust based on your cost volatility — stable workloads can use 20%, variable workloads may need 100%. Monitor false positive rate.

Can anomaly detection work for Kubernetes?

Yes — with Kubecost or open-source tools (OpenCost, KubeBuddy). These provide pod-level cost allocation, right-sizing recommendations, and anomaly detection based on cluster resource usage.

How do I handle false positives?

Tag known cost events (marketing campaigns, product launches) in your cost management tool. Set exclusion windows for planned events. Tune the model by providing feedback (mark as “not anomaly”) through the API.

Do I need ML for anomaly detection?

No — simple threshold-based rules catch 80% of cost anomalies. Use ML for the remaining 20% where patterns are subtle or seasonal. Start with thresholds, add ML as your cost data matures.

What about third-party tools?

CloudHealth, Vantage, CloudCheckr, and Apptio provide cross-cloud anomaly detection with pre-built dashboards. They’re worth evaluating if you lack engineering time to build custom solutions.

What’s Next

Right-Sizing Strategies

Cloud FinOps Practices

Tagging & Labeling Strategy

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro. Updated 2026-06-20.

Previous Multi-Cloud Cost Strategy: Compare and Save Across Providers Next Right-Sizing Strategies — Instance Right-Sizing, Compute Optimizer, Auto-Scaling, Workload Profiling

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse DevOps & Cloud Cost Optimization