Chaos Engineering: Break Things on Purpose to Build Resilience
Chaos engineering is the disciplined practice of intentionally injecting failures into a system to uncover weaknesses before they cause real outages — building confidence in your system’s ability to withstand turbulent conditions.
What You’ll Learn
- Core principles of chaos engineering and the scientific method
- Setting up Chaos Monkey, Gremlin, and Litmus experiments
- Defining steady-state hypotheses and controlling blast radius
- Running experiments in production vs staging environments
- Analyzing results and building resilience iteratively
Why Chaos Engineering Matters
You can’t truly know your system is resilient until you test it under failure. Traditional testing validates that components work — chaos engineering validates that the system works when components fail. Netflix’s Chaos Monkey, which randomly terminates EC2 instances in production, forced their engineering teams to build systems that handle instance failure automatically. DodaTech uses chaos engineering experiments on Durga Antivirus Pro’s update infrastructure — simulating network partitions and instance failures to ensure signature updates reach users even during partial outages.
flowchart LR
A[SRE & Monitoring] --> B[Chaos Engineering]
B --> C[Steady-State Hypothesis]
B --> D[Design Experiment]
B --> E[Run Experiment]
B --> F[Analyze Results]
C --> G["System is healthy when..."]
D --> H["Inject failure: kill, delay, overload"]
E --> I[Minimize Blast Radius]
F --> J[Fix Weaknesses]
style B fill:#ff6b6b,color:#fff
Principles of Chaos Engineering
Chaos engineering follows the scientific method:
- Define steady-state — What does “healthy” look like? (latency < 200ms, error rate < 1%)
- Form a hypothesis — “If the API server fails, the load balancer routes to healthy instances and error rate stays below 1%”
- Introduce variables — Kill the API server, add latency, partition the network
- Try to disprove the hypothesis — Run the experiment and measure against steady-state
# chaos_experiment.py
import random
import time
class ChaosExperiment:
def __init__(self, name, hypothesis, metric_name, steady_threshold):
self.name = name
self.hypothesis = hypothesis
self.metric_name = metric_name
self.steady_threshold = steady_threshold
self.baseline = None
self.results = []
def measure(self, label, value):
self.results.append({"label": label, "value": value, "time": time.time()})
def run(self, fault_fn):
print(f"=== Experiment: {self.name} ===")
print(f"Hypothesis: {self.hypothesis}")
# Measure baseline
baseline_val = 0.2 # Simulated baseline latency
self.measure("baseline", baseline_val)
print(f"Baseline {self.metric_name}: {baseline_val}s")
# Inject fault
print("Injecting fault: killing API server ...")
fault_fn()
# Measure during fault
time.sleep(1)
during_val = random.uniform(0.15, 0.8)
self.measure("during_fault", during_val)
print(f"During fault {self.metric_name}: {during_val:.3f}s")
# Recover
time.sleep(1)
after_val = random.uniform(0.15, 0.3)
self.measure("after_recovery", after_val)
print(f"After recovery {self.metric_name}: {after_val:.3f}s")
passed = during_val < self.steady_threshold
print(f"Result: {'PASSED' if passed else 'FAILED'}")
print(f"Threshold: {self.metric_name} < {self.steady_threshold}s\n")
return passed
experiment = ChaosExperiment(
"API Server Failure",
"Killing one API server should not increase p99 latency above 500ms",
"p99_latency", 0.5
)
def kill_api_server():
pass # In real practice: `kubectl delete pod api-server-xyz`
experiment.run(kill_api_server)
# Output:
# === Experiment: API Server Failure ===
# Hypothesis: Killing one API server should not increase p99 latency above 500ms
# Baseline p99_latency: 0.2s
# Injecting fault: killing API server ...
# During fault p99_latency: 0.345s
# After recovery p99_latency: 0.234s
# Result: PASSEDTools for Chaos Engineering
Chaos Monkey (Netflix)
Chaos Monkey randomly terminates instances to ensure auto-scaling and failover work:
# chaos-monkey.yml
# Spinnaker Chaos Monkey configuration
chaosMonkey:
enabled: true
minTimeSeconds: 60
maxTimeSeconds: 600
probability: 0.1 # 10% chance per check
group:
stack: prod
detail: api
regions:
- us-east-1
- us-west-2
exceptions:
- account: prod
stack: redis # Don't kill RedisGremlin
Gremlin is a SaaS chaos engineering platform with pre-built attack types:
# Install Gremlin agent
curl -sSL https://app.gremlin.com/install/agent.sh | sudo bash
# Run a CPU attack for 30 seconds
gremlin attack cpu -l 80 -d 30
# Run a network latency attack
gremlin attack latency -h target-service.internal -p 8080 -l 200 -d 60
# Run a blackhole attack (drop all packets to a destination)
gremlin attack blackhole -h database.internal
# Output:
# Attack sent. ID: 1234-5678
# Monitoring: https://app.gremlin.com/attacks/1234-5678Litmus
Litmus is a cloud-native chaos engineering platform for Kubernetes:
# litmus-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-engine
spec:
appinfo:
appns: default
applabel: "app=api-server"
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "true"
probe:
- name: check-api-health
type: httpProbe
httpProbe/inputs:
url: http://api-service:8080/health
expectedResponseCode: "200"
mode: Continuous# Run Litmus experiment
kubectl apply -f litmus-experiment.yaml
# Output:
# chaosengine.litmuschaos.io/pod-delete-engine created
# Experiment pod-delete is running...
# Experiment pod-delete completed successfully
# Check results: kubectl describe chaosresult pod-delete-engine-pod-deleteBlast Radius Control
Never run experiments without safety controls:
# blast_radius.py
class BlastRadius:
def __init__(self, min_replicas=3, max_failure_pct=20):
self.min_replicas = min_replicas
self.max_failure_pct = max_failure_pct
def validate(self, deployment, target_count):
total = deployment["replicas"]
if total < self.min_replicas:
return False, f"Too few replicas ({total} < {self.min_replicas})"
if target_count / total * 100 > self.max_failure_pct:
return False, f"Blast radius too large ({target_count}/{total})"
return True, "Safe to proceed"
deploy = {"name": "api-server", "replicas": 5}
bl = BlastRadius()
for target in [1, 3]:
safe, msg = bl.validate(deploy, target)
print(f"Kill {target} of {deploy['replicas']} pods: {msg}")Expected output:
Kill 1 of 5 pods: Safe to proceed
Kill 3 of 5 pods: Blast radius too large (3/5)Production vs Staging Testing
| Aspect | Staging | Production |
|---|---|---|
| Realism | Simulated traffic, synthetic data | Real users, real data, real scale |
| Risk | Low — only internal impact | High — user-facing impact possible |
| Confidence | Low — may miss prod-only issues | High — validates real behavior |
| Best for | First-time experiments, new attack types | Validated experiments, gradual rollout |
Common Mistakes
Running experiments without a steady-state hypothesis: If you don’t know what “healthy” looks like, you can’t measure whether the experiment caused harm. Define metrics and thresholds before injecting failures.
Starting with production: Chaos engineering is dangerous without practice. Start in staging, validate your experiment design, and only then move to production with small blast radii.
No automatic rollback: If an experiment causes unexpected impact, you need an immediate way to stop it. Use time-bound experiments, halt-on-failure conditions, and manual kill switches.
Ignoring observability during experiments: Chaos experiments are useless if you can’t see the system’s behavior during the attack. Ensure monitoring dashboards, logging, and alerting are all functioning before starting.
Treating chaos engineering as a one-time activity: Resilience isn’t a checkbox. Run experiments regularly — every sprint, every release. As systems change, new weaknesses appear.
Practice Questions
What is the steady-state hypothesis? Answer: A statement defining what “healthy” means for the system, measured by specific metrics (latency, error rate, throughput). The experiment tries to disprove this hypothesis.
What is blast radius and why does it matter? Answer: Blast radius is the scope of impact an experiment can have. Start small (kill 1 pod out of 10) and expand gradually. Large blast radii can cause customer-facing outages.
How does Chaos Monkey differ from Gremlin? Answer: Chaos Monkey randomly terminates instances in a group — it’s simple and focused. Gremlin offers a wider range of attacks (CPU, memory, network, DNS, process kill) with more controls.
When should you run experiments in production vs staging? Answer: Start in staging for initial validation. Run in production once you’re confident, but use small blast radii, time limits, and gradual traffic shifting.
Challenge
Design and run a chaos experiment: create a Kubernetes deployment with 5 API server replicas and a load balancer, define a steady-state hypothesis (p99 latency < 300ms with 100 req/s), use Litmus to kill 1 pod during traffic, measure the impact on latency and error rate, document the results, and fix any weaknesses found.
FAQ
Mini Project: Automated Chaos Pipeline
# .github/workflows/chaos-pipeline.yml
name: Weekly Chaos Experiment
on:
schedule:
- cron: '0 10 * * 1' # Every Monday at 10am
jobs:
chaos:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Litmus experiment
run: |
kubectl apply -f experiments/pod-delete.yaml
sleep 60
kubectl describe chaosresult pod-delete-engine-pod-delete
- name: Verify system health
run: |
ERROR_RATE=$(curl -s http://api.example.com/metrics | grep error_rate)
if echo "$ERROR_RATE" | grep -q "> 1"; then
echo "Experiment FAILED: error rate exceeded threshold"
exit 1
fi
echo "Experiment PASSED: system remained healthy"What’s Next
| Topic | Description |
|---|---|
| Measuring reliability | |
| Handling outages |
Related topics: SRE, Kubernetes, Prometheus, Gremlin
What’s Next
Congratulations on completing this Chaos Engineering tutorial! Here’s where to go from here:
- Practice daily — Run a small chaos experiment in staging this week
- Build a project — Set up automated chaos experiments with Litmus
- Explore related topics — Check out SRE and incident response practices
Remember: every expert was once a beginner. Keep coding!
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro