Skip to content
Cloud Disaster Recovery: Backup, Failover, and Business Continuity

Cloud Disaster Recovery: Backup, Failover, and Business Continuity

DodaTech Updated Jun 20, 2026 7 min read

Cloud disaster recovery is the strategy and practice of restoring IT infrastructure and data after a catastrophic event — using backup, replication, failover, and automated recovery to meet business continuity requirements.

What You’ll Learn

  • DR strategies: backup & restore, pilot light, warm standby, multi-site
  • Defining RPO (Recovery Point Objective) and RTO (Recovery Time Objective)
  • Cross-region replication for databases and storage
  • Automating failover testing with Terraform and AWS tools

Why Disaster Recovery Matters

A single region outage can take your entire service offline. In 2023, an AWS US-East-1 outage affected thousands of companies for hours — those without cross-region DR lost customers, revenue, and trust. DR isn’t just for enterprise; any business that relies on cloud infrastructure needs a plan. DodaTech maintains a warm standby for Durga Antivirus Pro’s signature distribution service, replicating data across two AWS regions and automating failover with Route 53 health checks.

    flowchart LR
    A[Cloud Fundamentals] --> B[Disaster Recovery]
    B --> C[Backup & Restore]
    B --> D[Pilot Light]
    B --> E[Warm Standby]
    B --> F[Multi-Site]
    C --> G[RPO: Hours, RTO: Hours]
    D --> G
    E --> H[RPO: Minutes, RTO: Minutes]
    F --> H
    style B fill:#e53e3e,color:#fff
  
Prerequisites: Familiarity with cloud computing and AWS services. Understanding of Terraform for infrastructure automation.

RPO and RTO

Two metrics define DR requirements:

  • RPO (Recovery Point Objective): Maximum acceptable data loss (measured in time)
  • RTO (Recovery Time Objective): Maximum acceptable downtime
DR StrategyTypical RPOTypical RTOCost
Backup & RestoreHoursHoursLow
Pilot LightMinutesHoursMedium
Warm StandbySecondsMinutesHigh
Multi-SiteSecondsSecondsVery High

DR Strategies

Backup & Restore

The simplest and cheapest DR — regularly back up data and restore in a new region when needed.

# Automated S3 cross-region backup script
#!/bin/bash
SOURCE_BUCKET="dodatech-prod-us-east-1"
DEST_BUCKET="dodatech-dr-us-west-2"
DATE=$(date +%Y-%m-%d)

# Sync data to DR region (incremental)
aws s3 sync s3://$SOURCE_BUCKET s3://$DEST_BUCKET/$DATE \
  --source-region us-east-1 \
  --region us-west-2

# RDS snapshot copy to DR region
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier prod-db-snapshot-$DATE \
  --target-db-snapshot-identifier prod-db-snapshot-$DATE-dr \
  --source-region us-east-1 \
  --region us-west-2

echo "Backup complete: $(date)"

Pilot Light

A minimal copy of core infrastructure runs in the DR region. Scale up when failover is needed.

# pilot-light-dr.tf
# Minimal DR infrastructure — scale up on failover
resource "aws_instance" "pilot_db" {
  provider = aws.west
  ami           = data.aws_ami.ubuntu.id
  instance_type = "db.r5.large"  # Replica of prod database
}

resource "aws_instance" "pilot_app" {
  provider = aws.west
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"  # Minimal — scale on failover
  user_data     = <<-EOF
    #!/bin/bash
    # Pre-warm with read replica of production database
    # On failover: scale up and switch to primary
  EOF
}

Warm Standby

A scaled-down but fully functional copy of production runs in the DR region:

# warm-standby.tf
resource "aws_autoscaling_group" "app_dr" {
  provider = aws.west
  name     = "app-warm-standby"
  min_size = 2       # Running and ready
  max_size = 20
  desired_capacity = 2  # Scaled-down, ready to burst
}

resource "aws_rds_cluster_instance" "db_dr" {
  provider           = aws.west
  count              = 1  # Single reader in DR, scale on failover
  cluster_identifier = aws_rds_cluster.prod_dr.id
  instance_class     = "db.r5.large"
  engine             = "aurora-postgresql"
}

Multi-Site (Active-Active)

Production runs simultaneously in two or more regions:

# Route 53 — active-active DNS routing
resource "aws_route53_record" "app" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "app.dodatech.com"
  type    = "A"

  alias {
    name                   = aws_lb.app_us_east.dns_name
    zone_id                = aws_lb.app_us_east.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "PRIMARY"
  }

  set_identifier = "us-east-1"
}

resource "aws_route53_record" "app_dr" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "app.dodatech.com"
  type    = "A"

  alias {
    name                   = aws_lb.app_us_west.dns_name
    zone_id                = aws_lb.app_us_west.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "SECONDARY"
  }

  set_identifier = "us-west-2"
}

Database Replication

# RDS cross-region read replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier prod-db-dr \
  --source-db-instance-identifier prod-db \
  --source-region us-east-1 \
  --region us-west-2 \
  --db-instance-class db.r5.large

# Verify replication lag
aws rds describe-db-instances \
  --db-instance-identifier prod-db-dr \
  --region us-west-2 \
  --query 'DBInstances[0].ReadReplicaSourceDBInstanceIdentifier'

# Output:
# {
#     "ReplicaLag": "0.25"
# }

Failover Testing

Automate failover testing to ensure DR actually works:

# dr_test_runner.py
import subprocess
import json
import time

class DRTest:
    def __init__(self, primary_region, dr_region):
        self.primary = primary_region
        self.dr = dr_region
        self.steps = []

    def run_step(self, name, command):
        print(f"  → {name}...", end=" ", flush=True)
        result = subprocess.run(command, shell=True, capture_output=True, text=True)
        success = result.returncode == 0
        self.steps.append({"name": name, "success": success, "output": result.stdout[:200]})
        print("PASS" if success else "FAIL")
        return success

    def simulate_failure(self):
        print("\n=== DR Failover Test ===")
        self.run_step("Block primary region traffic",
                      "aws ec2 revoke-security-group-ingress ...")
        self.run_step("Verify health check fails",
                      "aws route53 get-health-check-status --health-check-id ...")
        self.run_step("Promote DR database to primary",
                      "aws rds promote-read-replica --db-instance-identifier prod-db-dr")
        self.run_step("Verify DR application health",
                      "curl -f https://dr.app.dodatech.com/health")
        self.run_step("Verify DNS failover",
                      "dig +short app.dodatech.com @8.8.8.8")

test = DRTest("us-east-1", "us-west-2")
test.simulate_failure()
print(f"\n{'='*40}\n{'All tests passed!' if all(s['success'] for s in test.steps) else 'Some tests failed!'}")

Automation with Terraform

Use Terraform workspaces for environment-specific DR config:

# DR workspace
terraform workspace new dr-us-west-2
terraform workspace select dr-us-west-2

# Apply DR infrastructure
terraform apply -var-file=dr.tfvars

# Output:
# Apply complete! Resources: 12 added, 0 changed, 0 destroyed.
# DR infrastructure ready in us-west-2

Common Mistakes

  1. Not testing failover: A DR plan that’s never tested is a fantasy. Schedule quarterly failover exercises and document every gap found.

  2. Setting unrealistic RPO/RTO: Five-minute RPO requires synchronous replication, which adds latency. Be realistic about what your application and budget can support.

  3. Forgetting about data consistency: Cross-region replication has lag. Applications must handle eventually-consistent reads or read from the primary region.

  4. Only backing up, not testing restoration: A backup that can’t be restored is worthless. Test restores monthly. Include the full application stack, not just the database.

  5. Ignoring network dependencies: DNS propagation, SSL certificate provisioning, and VPN setup take time. Include these in RTO calculations.

Practice Questions

  1. What is the difference between RPO and RTO? Answer: RPO is maximum acceptable data loss (time between last backup and disaster). RTO is maximum acceptable downtime (time from disaster to recovery).

  2. What is pilot light vs warm standby? Answer: Pilot light has a minimal copy of core data running — you scale up on failover. Warm standby has a scaled-down but fully functional copy — you scale up capacity on failover.

  3. Why is cross-region replication important? Answer: A single region outage takes down all resources in that region. Cross-region replication ensures data survives a regional disaster and can be promoted elsewhere.

  4. What is the cheapest DR strategy? Answer: Backup & restore. Store backups in a different region, restore on failure. Low cost but highest RTO (hours to days).

Challenge

Design and test a DR plan for an e-commerce platform: set up cross-region RDS read replicas with 30-second RPO, configure Route 53 health checks with failover routing, create a warm standby with 2 app servers in the DR region, automate failover testing with a CI/CD pipeline, and document the runbook for SEV-1 regional outage.

FAQ

How often should I test DR?
: Quarterly for most systems, monthly for critical systems. Each test should include full failover and failback, not just a tabletop exercise.
What is the difference between backup and DR?
: Backup is copying data. DR is restoring the entire system — compute, networking, DNS, certificates, databases, and application state. Backup is a component of DR, not a replacement.
Can DR be fully automated?
: Yes, for warm standby and multi-site strategies. Backup & restore is harder to fully automate. The key is: automation for detection and failover, human approval for failback.
What is the cost of a warm standby?
: Typically 40-60% of production cost — you run smaller instances and fewer replicas. The cost is insurance against regional outages.
Do I need DR if I use Kubernetes?
: Yes. Kubernetes doesn’t protect against regional outages. Deploy clusters in multiple regions with cross-region service mesh and data replication.

Mini Project: DR Dashboard

# dr_dashboard.py
class DRDashboard:
    def __init__(self):
        self.regions = []

    def add_service(self, name, primary, dr, replication_lag_s=0):
        self.regions.append({
            "name": name, "primary": primary, "dr": dr,
            "lag": replication_lag_s, "healthy": True,
        })

    def report(self):
        print("=== DR Status Dashboard ===")
        for svc in self.regions:
            status = "✅" if svc["healthy"] else "❌"
            print(f"  {status} {svc['name']:<20} Primary: {svc['primary']:<12} "
                  f"DR: {svc['dr']:<12} Lag: {svc['lag']}s")
        print(f"\n  Overall: {'ALL HEALTHY' if all(s['healthy'] for s in self.regions) else 'ISSUES DETECTED'}")

dash = DRDashboard()
dash.add_service("API", "us-east-1", "us-west-2", 0.5)
dash.add_service("Database", "us-east-1", "us-west-2", 0.3)
dash.add_service("Storage", "us-east-1", "us-west-2", 0)
dash.report()

What’s Next

TopicDescription
CloudFront CDN
Content delivery and edge caching
AWS ECS
Running containers at scale

Related topics: AWS, Terraform, Cloud Computing

What’s Next

Congratulations on completing this Disaster Recovery tutorial! Here’s where to go from here:

  • Practice daily — Review your current DR plan for gaps
  • Build a project — Set up cross-region replication for a database
  • Explore related topics — Check out CloudFront CDN and AWS ECS

Remember: every expert was once a beginner. Keep coding!

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro