Learn DevOps: Chaos Engineering: Break Things on Purpose to Build Resilience

Q: Is chaos engineering the same as testing?

: No. Testing validates known behaviors. Chaos engineering uncovers unknown weaknesses by exploring edge cases you didn’t think to test.

Q: Can I do chaos engineering without tools?

: Yes. Manually kill a pod, block a port with iptables, or fill a disk to test alerting and recovery. Tools make it repeatable and safe.

Q: What if a chaos experiment causes a real outage?

: That’s the point — you found a weakness. The outage should be minimal (controlled blast radius) and teach you something critical. Fix the weakness and run the experiment again.

Q: How often should I run chaos experiments?

: Continuously. Netflix runs Chaos Monkey 24/7 in production. Start with weekly scheduled experiments and increase frequency as your team gains confidence.

Q: What is the difference between chaos engineering and fault injection?

: Fault injection is a technique — inject a specific failure. Chaos engineering is the full practice: hypothesis, experiment, analysis, and iterative improvement. Fault injection is one part of it.

DevOps & Cloud

Chaos Engineering: Break Things on Purpose to Build Resilience

DodaTech Updated Jun 20, 2026 8 min read

Chaos engineering is the disciplined practice of intentionally injecting failures into a system to uncover weaknesses before they cause real outages — building confidence in your system’s ability to withstand turbulent conditions.

What You’ll Learn

Core principles of chaos engineering and the scientific method
Setting up Chaos Monkey, Gremlin, and Litmus experiments
Defining steady-state hypotheses and controlling blast radius
Running experiments in production vs staging environments
Analyzing results and building resilience iteratively

Why Chaos Engineering Matters

You can’t truly know your system is resilient until you test it under failure. Traditional testing validates that components work — chaos engineering validates that the system works when components fail. Netflix’s Chaos Monkey, which randomly terminates EC2 instances in production, forced their engineering teams to build systems that handle instance failure automatically. DodaTech uses chaos engineering experiments on Durga Antivirus Pro’s update infrastructure — simulating network partitions and instance failures to ensure signature updates reach users even during partial outages.

    flowchart LR
    A[SRE & Monitoring] --> B[Chaos Engineering]
    B --> C[Steady-State Hypothesis]
    B --> D[Design Experiment]
    B --> E[Run Experiment]
    B --> F[Analyze Results]
    C --> G["System is healthy when..."]
    D --> H["Inject failure: kill, delay, overload"]
    E --> I[Minimize Blast Radius]
    F --> J[Fix Weaknesses]
    style B fill:#ff6b6b,color:#fff

Prerequisites: Solid SRE and monitoring knowledge. Experience with Kubernetes and Docker is helpful.

Principles of Chaos Engineering

Chaos engineering follows the scientific method:

Define steady-state — What does “healthy” look like? (latency < 200ms, error rate < 1%)
Form a hypothesis — “If the API server fails, the load balancer routes to healthy instances and error rate stays below 1%”
Introduce variables — Kill the API server, add latency, partition the network
Try to disprove the hypothesis — Run the experiment and measure against steady-state

# chaos_experiment.py
import random
import time

class ChaosExperiment:
    def __init__(self, name, hypothesis, metric_name, steady_threshold):
        self.name = name
        self.hypothesis = hypothesis
        self.metric_name = metric_name
        self.steady_threshold = steady_threshold
        self.baseline = None
        self.results = []

    def measure(self, label, value):
        self.results.append({"label": label, "value": value, "time": time.time()})

    def run(self, fault_fn):
        print(f"=== Experiment: {self.name} ===")
        print(f"Hypothesis: {self.hypothesis}")

        # Measure baseline
        baseline_val = 0.2  # Simulated baseline latency
        self.measure("baseline", baseline_val)
        print(f"Baseline {self.metric_name}: {baseline_val}s")

        # Inject fault
        print("Injecting fault: killing API server ...")
        fault_fn()

        # Measure during fault
        time.sleep(1)
        during_val = random.uniform(0.15, 0.8)
        self.measure("during_fault", during_val)
        print(f"During fault {self.metric_name}: {during_val:.3f}s")

        # Recover
        time.sleep(1)
        after_val = random.uniform(0.15, 0.3)
        self.measure("after_recovery", after_val)
        print(f"After recovery {self.metric_name}: {after_val:.3f}s")

        passed = during_val < self.steady_threshold
        print(f"Result: {'PASSED' if passed else 'FAILED'}")
        print(f"Threshold: {self.metric_name} < {self.steady_threshold}s\n")
        return passed

experiment = ChaosExperiment(
    "API Server Failure",
    "Killing one API server should not increase p99 latency above 500ms",
    "p99_latency", 0.5
)

def kill_api_server():
    pass  # In real practice: `kubectl delete pod api-server-xyz`

experiment.run(kill_api_server)

# Output:
# === Experiment: API Server Failure ===
# Hypothesis: Killing one API server should not increase p99 latency above 500ms
# Baseline p99_latency: 0.2s
# Injecting fault: killing API server ...
# During fault p99_latency: 0.345s
# After recovery p99_latency: 0.234s
# Result: PASSED

Tools for Chaos Engineering

Chaos Monkey (Netflix)

Chaos Monkey randomly terminates instances to ensure auto-scaling and failover work:

# chaos-monkey.yml
# Spinnaker Chaos Monkey configuration
chaosMonkey:
  enabled: true
  minTimeSeconds: 60
  maxTimeSeconds: 600
  probability: 0.1  # 10% chance per check
  group:
    stack: prod
    detail: api
    regions:
      - us-east-1
      - us-west-2
  exceptions:
    - account: prod
      stack: redis  # Don't kill Redis

Gremlin

Gremlin is a SaaS chaos engineering platform with pre-built attack types:

# Install Gremlin agent
curl -sSL https://app.gremlin.com/install/agent.sh | sudo bash

# Run a CPU attack for 30 seconds
gremlin attack cpu -l 80 -d 30

# Run a network latency attack
gremlin attack latency -h target-service.internal -p 8080 -l 200 -d 60

# Run a blackhole attack (drop all packets to a destination)
gremlin attack blackhole -h database.internal

# Output:
# Attack sent. ID: 1234-5678
# Monitoring: https://app.gremlin.com/attacks/1234-5678

Litmus

Litmus is a cloud-native chaos engineering platform for Kubernetes:

# litmus-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-engine
spec:
  appinfo:
    appns: default
    applabel: "app=api-server"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "true"
        probe:
          - name: check-api-health
            type: httpProbe
            httpProbe/inputs:
              url: http://api-service:8080/health
              expectedResponseCode: "200"
            mode: Continuous

# Run Litmus experiment
kubectl apply -f litmus-experiment.yaml

# Output:
# chaosengine.litmuschaos.io/pod-delete-engine created
# Experiment pod-delete is running...
# Experiment pod-delete completed successfully
# Check results: kubectl describe chaosresult pod-delete-engine-pod-delete

Blast Radius Control

Never run experiments without safety controls:

# blast_radius.py
class BlastRadius:
    def __init__(self, min_replicas=3, max_failure_pct=20):
        self.min_replicas = min_replicas
        self.max_failure_pct = max_failure_pct

    def validate(self, deployment, target_count):
        total = deployment["replicas"]
        if total < self.min_replicas:
            return False, f"Too few replicas ({total} < {self.min_replicas})"
        if target_count / total * 100 > self.max_failure_pct:
            return False, f"Blast radius too large ({target_count}/{total})"
        return True, "Safe to proceed"

deploy = {"name": "api-server", "replicas": 5}
bl = BlastRadius()

for target in [1, 3]:
    safe, msg = bl.validate(deploy, target)
    print(f"Kill {target} of {deploy['replicas']} pods: {msg}")

Expected output:

Kill 1 of 5 pods: Safe to proceed
Kill 3 of 5 pods: Blast radius too large (3/5)

Production vs Staging Testing

Aspect	Staging	Production
Realism	Simulated traffic, synthetic data	Real users, real data, real scale
Risk	Low — only internal impact	High — user-facing impact possible
Confidence	Low — may miss prod-only issues	High — validates real behavior
Best for	First-time experiments, new attack types	Validated experiments, gradual rollout

Common Mistakes

Running experiments without a steady-state hypothesis: If you don’t know what “healthy” looks like, you can’t measure whether the experiment caused harm. Define metrics and thresholds before injecting failures.
Starting with production: Chaos engineering is dangerous without practice. Start in staging, validate your experiment design, and only then move to production with small blast radii.
No automatic rollback: If an experiment causes unexpected impact, you need an immediate way to stop it. Use time-bound experiments, halt-on-failure conditions, and manual kill switches.
Ignoring observability during experiments: Chaos experiments are useless if you can’t see the system’s behavior during the attack. Ensure monitoring dashboards, logging, and alerting are all functioning before starting.
Treating chaos engineering as a one-time activity: Resilience isn’t a checkbox. Run experiments regularly — every sprint, every release. As systems change, new weaknesses appear.

Practice Questions

What is the steady-state hypothesis? Answer: A statement defining what “healthy” means for the system, measured by specific metrics (latency, error rate, throughput). The experiment tries to disprove this hypothesis.
What is blast radius and why does it matter? Answer: Blast radius is the scope of impact an experiment can have. Start small (kill 1 pod out of 10) and expand gradually. Large blast radii can cause customer-facing outages.
How does Chaos Monkey differ from Gremlin? Answer: Chaos Monkey randomly terminates instances in a group — it’s simple and focused. Gremlin offers a wider range of attacks (CPU, memory, network, DNS, process kill) with more controls.
When should you run experiments in production vs staging? Answer: Start in staging for initial validation. Run in production once you’re confident, but use small blast radii, time limits, and gradual traffic shifting.

Challenge

Design and run a chaos experiment: create a Kubernetes deployment with 5 API server replicas and a load balancer, define a steady-state hypothesis (p99 latency < 300ms with 100 req/s), use Litmus to kill 1 pod during traffic, measure the impact on latency and error rate, document the results, and fix any weaknesses found.

FAQ

Is chaos engineering the same as testing?

: No. Testing validates known behaviors. Chaos engineering uncovers unknown weaknesses by exploring edge cases you didn’t think to test.

Can I do chaos engineering without tools?

: Yes. Manually kill a pod, block a port with iptables, or fill a disk to test alerting and recovery. Tools make it repeatable and safe.

What if a chaos experiment causes a real outage?

: That’s the point — you found a weakness. The outage should be minimal (controlled blast radius) and teach you something critical. Fix the weakness and run the experiment again.

How often should I run chaos experiments?

: Continuously. Netflix runs Chaos Monkey 24/7 in production. Start with weekly scheduled experiments and increase frequency as your team gains confidence.

What is the difference between chaos engineering and fault injection?

: Fault injection is a technique — inject a specific failure. Chaos engineering is the full practice: hypothesis, experiment, analysis, and iterative improvement. Fault injection is one part of it.

Mini Project: Automated Chaos Pipeline

# .github/workflows/chaos-pipeline.yml
name: Weekly Chaos Experiment
on:
  schedule:
    - cron: '0 10 * * 1'  # Every Monday at 10am

jobs:
  chaos:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Litmus experiment
        run: |
          kubectl apply -f experiments/pod-delete.yaml
          sleep 60
          kubectl describe chaosresult pod-delete-engine-pod-delete
      - name: Verify system health
        run: |
          ERROR_RATE=$(curl -s http://api.example.com/metrics | grep error_rate)
          if echo "$ERROR_RATE" | grep -q "> 1"; then
            echo "Experiment FAILED: error rate exceeded threshold"
            exit 1
          fi
          echo "Experiment PASSED: system remained healthy"

What’s Next

Topic	Description
SLIs, SLOs, SLAs	Measuring reliability
Incident Response	Handling outages

Related topics: SRE, Kubernetes, Prometheus, Gremlin

What’s Next

Congratulations on completing this Chaos Engineering tutorial! Here’s where to go from here:

Practice daily — Run a small chaos experiment in staging this week
Build a project — Set up automated chaos experiments with Litmus
Explore related topics — Check out SRE and incident response practices

Remember: every expert was once a beginner. Keep coding!

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Previous Incident Response: Handling Production Outages Next SLIs, SLOs, and SLAs: Measuring Reliability

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse DevOps & Cloud