Learn Cloud Disaster Recovery: Backup, Failover, and Business Continuity

Cloud Disaster Recovery: Backup, Failover, and Business Continuity

DodaTech Updated Jun 20, 2026 7 min read

Cloud disaster recovery is the strategy and practice of restoring IT infrastructure and data after a catastrophic event — using backup, replication, failover, and automated recovery to meet business continuity requirements.

What You’ll Learn

DR strategies: backup & restore, pilot light, warm standby, multi-site
Defining RPO (Recovery Point Objective) and RTO (Recovery Time Objective)
Cross-region replication for databases and storage
Automating failover testing with Terraform and AWS tools

Why Disaster Recovery Matters

A single region outage can take your entire service offline. In 2023, an AWS US-East-1 outage affected thousands of companies for hours — those without cross-region DR lost customers, revenue, and trust. DR isn’t just for enterprise; any business that relies on cloud infrastructure needs a plan. DodaTech maintains a warm standby for Durga Antivirus Pro’s signature distribution service, replicating data across two AWS regions and automating failover with Route 53 health checks.

    flowchart LR
    A[Cloud Fundamentals] --> B[Disaster Recovery]
    B --> C[Backup & Restore]
    B --> D[Pilot Light]
    B --> E[Warm Standby]
    B --> F[Multi-Site]
    C --> G[RPO: Hours, RTO: Hours]
    D --> G
    E --> H[RPO: Minutes, RTO: Minutes]
    F --> H
    style B fill:#e53e3e,color:#fff

Prerequisites: Familiarity with cloud computing and AWS services. Understanding of Terraform for infrastructure automation.

RPO and RTO

Two metrics define DR requirements:

RPO (Recovery Point Objective): Maximum acceptable data loss (measured in time)
RTO (Recovery Time Objective): Maximum acceptable downtime

DR Strategy	Typical RPO	Typical RTO	Cost
Backup & Restore	Hours	Hours	Low
Pilot Light	Minutes	Hours	Medium
Warm Standby	Seconds	Minutes	High
Multi-Site	Seconds	Seconds	Very High

DR Strategies

Backup & Restore

The simplest and cheapest DR — regularly back up data and restore in a new region when needed.

# Automated S3 cross-region backup script
#!/bin/bash
SOURCE_BUCKET="dodatech-prod-us-east-1"
DEST_BUCKET="dodatech-dr-us-west-2"
DATE=$(date +%Y-%m-%d)

# Sync data to DR region (incremental)
aws s3 sync s3://$SOURCE_BUCKET s3://$DEST_BUCKET/$DATE \
  --source-region us-east-1 \
  --region us-west-2

# RDS snapshot copy to DR region
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier prod-db-snapshot-$DATE \
  --target-db-snapshot-identifier prod-db-snapshot-$DATE-dr \
  --source-region us-east-1 \
  --region us-west-2

echo "Backup complete: $(date)"

Pilot Light

A minimal copy of core infrastructure runs in the DR region. Scale up when failover is needed.

# pilot-light-dr.tf
# Minimal DR infrastructure — scale up on failover
resource "aws_instance" "pilot_db" {
  provider = aws.west
  ami           = data.aws_ami.ubuntu.id
  instance_type = "db.r5.large"  # Replica of prod database
}

resource "aws_instance" "pilot_app" {
  provider = aws.west
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"  # Minimal — scale on failover
  user_data     = <<-EOF
    #!/bin/bash
    # Pre-warm with read replica of production database
    # On failover: scale up and switch to primary
  EOF
}

Warm Standby

A scaled-down but fully functional copy of production runs in the DR region:

# warm-standby.tf
resource "aws_autoscaling_group" "app_dr" {
  provider = aws.west
  name     = "app-warm-standby"
  min_size = 2       # Running and ready
  max_size = 20
  desired_capacity = 2  # Scaled-down, ready to burst
}

resource "aws_rds_cluster_instance" "db_dr" {
  provider           = aws.west
  count              = 1  # Single reader in DR, scale on failover
  cluster_identifier = aws_rds_cluster.prod_dr.id
  instance_class     = "db.r5.large"
  engine             = "aurora-postgresql"
}

Multi-Site (Active-Active)

Production runs simultaneously in two or more regions:

# Route 53 — active-active DNS routing
resource "aws_route53_record" "app" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "app.dodatech.com"
  type    = "A"

  alias {
    name                   = aws_lb.app_us_east.dns_name
    zone_id                = aws_lb.app_us_east.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "PRIMARY"
  }

  set_identifier = "us-east-1"
}

resource "aws_route53_record" "app_dr" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "app.dodatech.com"
  type    = "A"

  alias {
    name                   = aws_lb.app_us_west.dns_name
    zone_id                = aws_lb.app_us_west.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "SECONDARY"
  }

  set_identifier = "us-west-2"
}

Database Replication

# RDS cross-region read replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier prod-db-dr \
  --source-db-instance-identifier prod-db \
  --source-region us-east-1 \
  --region us-west-2 \
  --db-instance-class db.r5.large

# Verify replication lag
aws rds describe-db-instances \
  --db-instance-identifier prod-db-dr \
  --region us-west-2 \
  --query 'DBInstances[0].ReadReplicaSourceDBInstanceIdentifier'

# Output:
# {
#     "ReplicaLag": "0.25"
# }

Failover Testing

Automate failover testing to ensure DR actually works:

# dr_test_runner.py
import subprocess
import json
import time

class DRTest:
    def __init__(self, primary_region, dr_region):
        self.primary = primary_region
        self.dr = dr_region
        self.steps = []

    def run_step(self, name, command):
        print(f"  → {name}...", end=" ", flush=True)
        result = subprocess.run(command, shell=True, capture_output=True, text=True)
        success = result.returncode == 0
        self.steps.append({"name": name, "success": success, "output": result.stdout[:200]})
        print("PASS" if success else "FAIL")
        return success

    def simulate_failure(self):
        print("\n=== DR Failover Test ===")
        self.run_step("Block primary region traffic",
                      "aws ec2 revoke-security-group-ingress ...")
        self.run_step("Verify health check fails",
                      "aws route53 get-health-check-status --health-check-id ...")
        self.run_step("Promote DR database to primary",
                      "aws rds promote-read-replica --db-instance-identifier prod-db-dr")
        self.run_step("Verify DR application health",
                      "curl -f https://dr.app.dodatech.com/health")
        self.run_step("Verify DNS failover",
                      "dig +short app.dodatech.com @8.8.8.8")

test = DRTest("us-east-1", "us-west-2")
test.simulate_failure()
print(f"\n{'='*40}\n{'All tests passed!' if all(s['success'] for s in test.steps) else 'Some tests failed!'}")

Automation with Terraform

Use Terraform workspaces for environment-specific DR config:

# DR workspace
terraform workspace new dr-us-west-2
terraform workspace select dr-us-west-2

# Apply DR infrastructure
terraform apply -var-file=dr.tfvars

# Output:
# Apply complete! Resources: 12 added, 0 changed, 0 destroyed.
# DR infrastructure ready in us-west-2

Common Mistakes

Not testing failover: A DR plan that’s never tested is a fantasy. Schedule quarterly failover exercises and document every gap found.
Setting unrealistic RPO/RTO: Five-minute RPO requires synchronous replication, which adds latency. Be realistic about what your application and budget can support.
Forgetting about data consistency: Cross-region replication has lag. Applications must handle eventually-consistent reads or read from the primary region.
Only backing up, not testing restoration: A backup that can’t be restored is worthless. Test restores monthly. Include the full application stack, not just the database.
Ignoring network dependencies: DNS propagation, SSL certificate provisioning, and VPN setup take time. Include these in RTO calculations.

Practice Questions

What is the difference between RPO and RTO? Answer: RPO is maximum acceptable data loss (time between last backup and disaster). RTO is maximum acceptable downtime (time from disaster to recovery).
What is pilot light vs warm standby? Answer: Pilot light has a minimal copy of core data running — you scale up on failover. Warm standby has a scaled-down but fully functional copy — you scale up capacity on failover.
Why is cross-region replication important? Answer: A single region outage takes down all resources in that region. Cross-region replication ensures data survives a regional disaster and can be promoted elsewhere.
What is the cheapest DR strategy? Answer: Backup & restore. Store backups in a different region, restore on failure. Low cost but highest RTO (hours to days).

Challenge

Design and test a DR plan for an e-commerce platform: set up cross-region RDS read replicas with 30-second RPO, configure Route 53 health checks with failover routing, create a warm standby with 2 app servers in the DR region, automate failover testing with a CI/CD pipeline, and document the runbook for SEV-1 regional outage.

FAQ

How often should I test DR?

: Quarterly for most systems, monthly for critical systems. Each test should include full failover and failback, not just a tabletop exercise.

What is the difference between backup and DR?

: Backup is copying data. DR is restoring the entire system — compute, networking, DNS, certificates, databases, and application state. Backup is a component of DR, not a replacement.

Can DR be fully automated?

: Yes, for warm standby and multi-site strategies. Backup & restore is harder to fully automate. The key is: automation for detection and failover, human approval for failback.

What is the cost of a warm standby?

: Typically 40-60% of production cost — you run smaller instances and fewer replicas. The cost is insurance against regional outages.

Do I need DR if I use Kubernetes?

: Yes. Kubernetes doesn’t protect against regional outages. Deploy clusters in multiple regions with cross-region service mesh and data replication.

Mini Project: DR Dashboard

# dr_dashboard.py
class DRDashboard:
    def __init__(self):
        self.regions = []

    def add_service(self, name, primary, dr, replication_lag_s=0):
        self.regions.append({
            "name": name, "primary": primary, "dr": dr,
            "lag": replication_lag_s, "healthy": True,
        })

    def report(self):
        print("=== DR Status Dashboard ===")
        for svc in self.regions:
            status = "✅" if svc["healthy"] else "❌"
            print(f"  {status} {svc['name']:<20} Primary: {svc['primary']:<12} "
                  f"DR: {svc['dr']:<12} Lag: {svc['lag']}s")
        print(f"\n  Overall: {'ALL HEALTHY' if all(s['healthy'] for s in self.regions) else 'ISSUES DETECTED'}")

dash = DRDashboard()
dash.add_service("API", "us-east-1", "us-west-2", 0.5)
dash.add_service("Database", "us-east-1", "us-west-2", 0.3)
dash.add_service("Storage", "us-east-1", "us-west-2", 0)
dash.report()

What’s Next

Topic	Description
CloudFront CDN	Content delivery and edge caching
AWS ECS	Running containers at scale

Related topics: AWS, Terraform, Cloud Computing

What’s Next

Congratulations on completing this Disaster Recovery tutorial! Here’s where to go from here:

Practice daily — Review your current DR plan for gaps
Build a project — Set up cross-region replication for a database
Explore related topics — Check out CloudFront CDN and AWS ECS

Remember: every expert was once a beginner. Keep coding!

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Previous Multi-Cloud Strategy: When and How to Use Multiple Providers Next CloudFront CDN: Setup and Optimization Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Cloud Computing