Skip to content

Decision Trees and Random Forests Explained — Complete Guide

DodaTech 3 min read

In this tutorial, you'll learn about Decision Trees and Random Forests Explained. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Decision trees split data into branches based on feature values, creating a tree-like structure where each leaf represents a decision. Random forests combine hundreds of trees to create models that are far more accurate and stable than any single tree.

What You'll Learn

How decision trees work, what makes them prone to overfitting, how random forests fix that weakness, and how to build both models in Scikit-Learn with real datasets.

Why It Matters

Random forests are among the most widely used ML algorithms in production because they work well with both numeric and categorical data, require minimal preprocessing, and provide feature importance rankings out of the box.

Real-World Use

Durga Antivirus Pro uses a random forest classifier as one of its detection layers. The model analyzes file metadata, entropy, and structural patterns to flag suspicious files, processing millions of files daily.

Decision Tree vs Random Forest

flowchart TD
    subgraph Single Tree
        A1[Root Node] --> B1[Split 1]
        B1 --> C1[Leaf 1]
        B1 --> C2[Leaf 2]
        A1 --> D1[Split 2]
        D1 --> E1[Leaf 3]
        D1 --> E2[Leaf 4]
    end
    subgraph Random Forest
        F1[Tree 1] --> G1[Vote]
        F2[Tree 2] --> G1
        F3[Tree 3] --> G1
        F4[Tree n] --> G1
        G1 --> H1[Final Prediction]
    end

Building a Decision Tree

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

train_acc = tree.score(X_train, y_train)
test_acc = tree.score(X_test, y_test)
print(f"Training accuracy: {train_acc:.3f}")
print(f"Test accuracy:     {test_acc:.3f}")
print(f"Tree depth: {tree.get_depth()}")
print(f"Number of leaves: {tree.get_n_leaves()}")
print(f"Feature importances: {dict(zip(iris.feature_names, tree.feature_importances_.round(3)))}")

Expected output:

Training accuracy: 0.983
Test accuracy:     1.000
Tree depth: 3
Number of leaves: 8
Feature importances: {'sepal length (cm)': 0.0, 'sepal width (cm)': 0.0, 'petal length (cm)': 0.555, 'petal width (cm)': 0.445}

Petal length and petal width are the most important features. The tree ignores sepal measurements entirely at depth 3.

Understanding Overfitting in Decision Trees

tree_deep = DecisionTreeClassifier(max_depth=None, random_State=42)
tree_deep.fit(X_train, y_train)

train_acc_deep = tree_deep.score(X_train, y_train)
test_acc_deep = tree_deep.score(X_test, y_test)

tree_shallow = DecisionTreeClassifier(max_depth=2, random_State=42)
tree_shallow.fit(X_train, y_train)

train_acc_shallow = tree_shallow.score(X_train, y_train)
test_acc_shallow = tree_shallow.score(X_test, y_test)

print(f"Deep tree   (depth=inf): train={train_acc_deep:.3f}, test={test_acc_deep:.3f}")
print(f"Shallow tree (depth=2):  train={train_acc_shallow:.3f}, test={test_acc_shallow:.3f}")

Expected output:

Deep tree   (depth=inf): train=1.000, test=0.967
Shallow tree (depth=2):  train=0.950, test=0.967

The deep tree memorized the training data (100% accuracy) but performs worse on test data. The shallow tree generalizes better despite lower training accuracy.

Random Forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    min_samples_split=5,
    random_State=42
)
rf.fit(X_train, y_train)

train_acc_rf = rf.score(X_train, y_train)
test_acc_rf = rf.score(X_test, y_test)

print(f"Random Forest training accuracy: {train_acc_rf:.3f}")
print(f"Random Forest test accuracy:     {test_acc_rf:.3f}")

importances = pd.DataFrame({
    'feature': iris.feature_names,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importances:")
print(importances)

Expected output:

Random Forest training accuracy: 0.983
Random Forest test accuracy:     1.000

Feature Importances:
             feature  importance
2  petal length (cm)       0.479
3   petal width (cm)       0.412
1   sepal width (cm)       0.078
0  sepal length (cm)       0.031

The random forest generalizes at least as well as the single tree and provides more stable feature importance estimates by averaging across trees trained on different Bootstrap samples.

Key Hyperparameters

  • n_estimators: More trees = better performance, diminishing returns after 200-500
  • max_depth: Deeper trees capture more patterns but overfit more
  • min_samples_split: Higher values prevent splits with too few samples
  • max_features: Controls randomness. Lower values reduce correlation between trees

Practice Questions

  1. Why does a random forest outperform a single decision tree?
  2. What causes overfitting in decision trees and how do random forests mitigate it?
  3. How do you interpret feature importance from a random forest?

Frequently Asked Questions

Are random forests better than gradient boosting?

Random forests are simpler to tune, LESS prone to overfitting, and train faster in parallel. Gradient boosting often achieves higher accuracy but requires careful tuning of learning rate and regularization.

Can random forests handle missing values?

Most implementations cannot handle missing values directly. You need to impute missing data before training. However, Scikit-Learn's random forest can handle NaN values internally in recent versions.

Related Topics

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro