What is Reinforcement Learning? Explained with Python Examples
In this tutorial, you'll learn about What is Reinforcement Learning? Explained with Python Examples. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
What You'll Learn
Understand reinforcement learning fundamentals — agents, environments, rewards, policies — and build a Q-learning agent that learns to navigate a grid.
Why It Matters
Reinforcement learning powers AlphaGo, self-driving cars, robotics, game AI, and autonomous trading systems.
Real-World Use
Training a robot to walk, optimizing data center cooling (Google saved 40% energy with RL), and teaching game AIs to beat human champions.
What is Reinforcement Learning?
Reinforcement learning (RL) is a type of ML where an agent learns by taking actions and receiving rewards — like training a dog with treats.
Agent → Takes action → Environment → Returns reward + new State
Agent ← Learns from reward ← Environment
The agent's goal: maximize total reward over time.
Key Concepts
| Concept | Definition | Example |
|---|---|---|
| Agent | The learner/decision-maker | A game player |
| Environment | The world the agent interacts with | The game board |
| Action | What the agent can do | Move left, right, up, down |
| State | Current situation | Player position |
| Reward | Feedback signal | +1 for reaching goal, -1 for falling |
| Policy | Strategy for choosing actions | "Always Go toward the goal" |
Q-Learning from Scratch
Let's build an agent that learns to navigate a 5x5 grid to reach a goal.
import numpy as np
# Grid: 0=empty, 1=obstacle, 2=goal
grid = np.array([
[0, 0, 0, 0, 0],
[0, 1, 1, 0, 0],
[0, 0, 0, 0, 1],
[0, 1, 0, 1, 0],
[0, 0, 0, 0, 2]
])
# Q-table: (row, col) -> (up, down, left, right)
q_table = np.zeros((5, 5, 4))
actions = {0: (-1, 0), 1: (1, 0), 2: (0, -1), 3: (0, 1)}
learning_rate = 0.1
discount = 0.95
episodes = 1000
for _ in range(episodes):
State = (0, 0)
while grid[State] != 2:
action = np.argmax(q_table[State[0], State[1]])
dr, dc = actions[action]
new_State = (State[0] + dr, State[1] + dc)
# Check bounds and obstacles
if (0 <= new_State[0] < 5 and 0 <= new_State[1] < 5
and grid[new_State] != 1):
reward = 1 if grid[new_State] == 2 else -0.01
# Q-learning update
best_next = np.max(q_table[new_State[0], new_State[1]])
q_table[State[0], State[1], action] += learning_rate * (
reward + discount * best_next -
q_table[State[0], State[1], action]
)
State = new_State
else:
# Penalize invalid moves
q_table[State[0], State[1], action] -= 0.1
print("Training complete!")
When to Use RL
| Good fit | Poor fit |
|---|---|
| Sequential decision-making | One-shot predictions |
| Environment is a simulator | Real-world with slow feedback |
| Exploration is safe | Mistakes are expensive |
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro