Cloud Disaster Recovery: Backup, Failover, and Business Continuity
Cloud disaster recovery is the strategy and practice of restoring IT infrastructure and data after a catastrophic event — using backup, replication, failover, and automated recovery to meet business continuity requirements.
What You’ll Learn
- DR strategies: backup & restore, pilot light, warm standby, multi-site
- Defining RPO (Recovery Point Objective) and RTO (Recovery Time Objective)
- Cross-region replication for databases and storage
- Automating failover testing with Terraform and AWS tools
Why Disaster Recovery Matters
A single region outage can take your entire service offline. In 2023, an AWS US-East-1 outage affected thousands of companies for hours — those without cross-region DR lost customers, revenue, and trust. DR isn’t just for enterprise; any business that relies on cloud infrastructure needs a plan. DodaTech maintains a warm standby for Durga Antivirus Pro’s signature distribution service, replicating data across two AWS regions and automating failover with Route 53 health checks.
flowchart LR
A[Cloud Fundamentals] --> B[Disaster Recovery]
B --> C[Backup & Restore]
B --> D[Pilot Light]
B --> E[Warm Standby]
B --> F[Multi-Site]
C --> G[RPO: Hours, RTO: Hours]
D --> G
E --> H[RPO: Minutes, RTO: Minutes]
F --> H
style B fill:#e53e3e,color:#fff
RPO and RTO
Two metrics define DR requirements:
- RPO (Recovery Point Objective): Maximum acceptable data loss (measured in time)
- RTO (Recovery Time Objective): Maximum acceptable downtime
| DR Strategy | Typical RPO | Typical RTO | Cost |
|---|---|---|---|
| Backup & Restore | Hours | Hours | Low |
| Pilot Light | Minutes | Hours | Medium |
| Warm Standby | Seconds | Minutes | High |
| Multi-Site | Seconds | Seconds | Very High |
DR Strategies
Backup & Restore
The simplest and cheapest DR — regularly back up data and restore in a new region when needed.
# Automated S3 cross-region backup script
#!/bin/bash
SOURCE_BUCKET="dodatech-prod-us-east-1"
DEST_BUCKET="dodatech-dr-us-west-2"
DATE=$(date +%Y-%m-%d)
# Sync data to DR region (incremental)
aws s3 sync s3://$SOURCE_BUCKET s3://$DEST_BUCKET/$DATE \
--source-region us-east-1 \
--region us-west-2
# RDS snapshot copy to DR region
aws rds copy-db-snapshot \
--source-db-snapshot-identifier prod-db-snapshot-$DATE \
--target-db-snapshot-identifier prod-db-snapshot-$DATE-dr \
--source-region us-east-1 \
--region us-west-2
echo "Backup complete: $(date)"Pilot Light
A minimal copy of core infrastructure runs in the DR region. Scale up when failover is needed.
# pilot-light-dr.tf
# Minimal DR infrastructure — scale up on failover
resource "aws_instance" "pilot_db" {
provider = aws.west
ami = data.aws_ami.ubuntu.id
instance_type = "db.r5.large" # Replica of prod database
}
resource "aws_instance" "pilot_app" {
provider = aws.west
ami = data.aws_ami.ubuntu.id
instance_type = "t3.micro" # Minimal — scale on failover
user_data = <<-EOF
#!/bin/bash
# Pre-warm with read replica of production database
# On failover: scale up and switch to primary
EOF
}Warm Standby
A scaled-down but fully functional copy of production runs in the DR region:
# warm-standby.tf
resource "aws_autoscaling_group" "app_dr" {
provider = aws.west
name = "app-warm-standby"
min_size = 2 # Running and ready
max_size = 20
desired_capacity = 2 # Scaled-down, ready to burst
}
resource "aws_rds_cluster_instance" "db_dr" {
provider = aws.west
count = 1 # Single reader in DR, scale on failover
cluster_identifier = aws_rds_cluster.prod_dr.id
instance_class = "db.r5.large"
engine = "aurora-postgresql"
}Multi-Site (Active-Active)
Production runs simultaneously in two or more regions:
# Route 53 — active-active DNS routing
resource "aws_route53_record" "app" {
zone_id = data.aws_route53_zone.main.zone_id
name = "app.dodatech.com"
type = "A"
alias {
name = aws_lb.app_us_east.dns_name
zone_id = aws_lb.app_us_east.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "us-east-1"
}
resource "aws_route53_record" "app_dr" {
zone_id = data.aws_route53_zone.main.zone_id
name = "app.dodatech.com"
type = "A"
alias {
name = aws_lb.app_us_west.dns_name
zone_id = aws_lb.app_us_west.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "SECONDARY"
}
set_identifier = "us-west-2"
}Database Replication
# RDS cross-region read replica
aws rds create-db-instance-read-replica \
--db-instance-identifier prod-db-dr \
--source-db-instance-identifier prod-db \
--source-region us-east-1 \
--region us-west-2 \
--db-instance-class db.r5.large
# Verify replication lag
aws rds describe-db-instances \
--db-instance-identifier prod-db-dr \
--region us-west-2 \
--query 'DBInstances[0].ReadReplicaSourceDBInstanceIdentifier'
# Output:
# {
# "ReplicaLag": "0.25"
# }Failover Testing
Automate failover testing to ensure DR actually works:
# dr_test_runner.py
import subprocess
import json
import time
class DRTest:
def __init__(self, primary_region, dr_region):
self.primary = primary_region
self.dr = dr_region
self.steps = []
def run_step(self, name, command):
print(f" → {name}...", end=" ", flush=True)
result = subprocess.run(command, shell=True, capture_output=True, text=True)
success = result.returncode == 0
self.steps.append({"name": name, "success": success, "output": result.stdout[:200]})
print("PASS" if success else "FAIL")
return success
def simulate_failure(self):
print("\n=== DR Failover Test ===")
self.run_step("Block primary region traffic",
"aws ec2 revoke-security-group-ingress ...")
self.run_step("Verify health check fails",
"aws route53 get-health-check-status --health-check-id ...")
self.run_step("Promote DR database to primary",
"aws rds promote-read-replica --db-instance-identifier prod-db-dr")
self.run_step("Verify DR application health",
"curl -f https://dr.app.dodatech.com/health")
self.run_step("Verify DNS failover",
"dig +short app.dodatech.com @8.8.8.8")
test = DRTest("us-east-1", "us-west-2")
test.simulate_failure()
print(f"\n{'='*40}\n{'All tests passed!' if all(s['success'] for s in test.steps) else 'Some tests failed!'}")Automation with Terraform
Use Terraform workspaces for environment-specific DR config:
# DR workspace
terraform workspace new dr-us-west-2
terraform workspace select dr-us-west-2
# Apply DR infrastructure
terraform apply -var-file=dr.tfvars
# Output:
# Apply complete! Resources: 12 added, 0 changed, 0 destroyed.
# DR infrastructure ready in us-west-2Common Mistakes
Not testing failover: A DR plan that’s never tested is a fantasy. Schedule quarterly failover exercises and document every gap found.
Setting unrealistic RPO/RTO: Five-minute RPO requires synchronous replication, which adds latency. Be realistic about what your application and budget can support.
Forgetting about data consistency: Cross-region replication has lag. Applications must handle eventually-consistent reads or read from the primary region.
Only backing up, not testing restoration: A backup that can’t be restored is worthless. Test restores monthly. Include the full application stack, not just the database.
Ignoring network dependencies: DNS propagation, SSL certificate provisioning, and VPN setup take time. Include these in RTO calculations.
Practice Questions
What is the difference between RPO and RTO? Answer: RPO is maximum acceptable data loss (time between last backup and disaster). RTO is maximum acceptable downtime (time from disaster to recovery).
What is pilot light vs warm standby? Answer: Pilot light has a minimal copy of core data running — you scale up on failover. Warm standby has a scaled-down but fully functional copy — you scale up capacity on failover.
Why is cross-region replication important? Answer: A single region outage takes down all resources in that region. Cross-region replication ensures data survives a regional disaster and can be promoted elsewhere.
What is the cheapest DR strategy? Answer: Backup & restore. Store backups in a different region, restore on failure. Low cost but highest RTO (hours to days).
Challenge
Design and test a DR plan for an e-commerce platform: set up cross-region RDS read replicas with 30-second RPO, configure Route 53 health checks with failover routing, create a warm standby with 2 app servers in the DR region, automate failover testing with a CI/CD pipeline, and document the runbook for SEV-1 regional outage.
FAQ
Mini Project: DR Dashboard
# dr_dashboard.py
class DRDashboard:
def __init__(self):
self.regions = []
def add_service(self, name, primary, dr, replication_lag_s=0):
self.regions.append({
"name": name, "primary": primary, "dr": dr,
"lag": replication_lag_s, "healthy": True,
})
def report(self):
print("=== DR Status Dashboard ===")
for svc in self.regions:
status = "✅" if svc["healthy"] else "❌"
print(f" {status} {svc['name']:<20} Primary: {svc['primary']:<12} "
f"DR: {svc['dr']:<12} Lag: {svc['lag']}s")
print(f"\n Overall: {'ALL HEALTHY' if all(s['healthy'] for s in self.regions) else 'ISSUES DETECTED'}")
dash = DRDashboard()
dash.add_service("API", "us-east-1", "us-west-2", 0.5)
dash.add_service("Database", "us-east-1", "us-west-2", 0.3)
dash.add_service("Storage", "us-east-1", "us-west-2", 0)
dash.report()What’s Next
| Topic | Description |
|---|---|
| Content delivery and edge caching | |
| Running containers at scale |
Related topics: AWS, Terraform, Cloud Computing
What’s Next
Congratulations on completing this Disaster Recovery tutorial! Here’s where to go from here:
- Practice daily — Review your current DR plan for gaps
- Build a project — Set up cross-region replication for a database
- Explore related topics — Check out CloudFront CDN and AWS ECS
Remember: every expert was once a beginner. Keep coding!
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro