Clustering Algorithms — K-Means, DBSCAN, and Hierarchical Clustering Explained

DodaTech 4 min read

In this tutorial, you'll learn about Clustering Algorithms. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Clustering algorithms group similar data points together without requiring labeled training data, making them the primary tool for discovering hidden patterns and structures in unlabeled datasets.

What You'll Learn

How to implement and interpret three major clustering algorithms — K-Means, DBSCAN, and Agglomerative Hierarchical Clustering — and how to choose the right one for your data.

Why It Matters

Most real-world data has no labels. Clustering is used for customer segmentation, anomaly detection, image compression, document organization, and recommendation systems across virtually every industry.

Real-World Use

Doda Browser uses clustering to group users by browsing behavior patterns for its tab prediction feature. Durga Antivirus Pro uses DBSCAN to cluster file behavior profiles and flag outliers as potential malware.

Clustering Algorithms Comparison

flowchart TD
    A[Clustering] --> B[K-Means]
    A --> C[DBSCAN]
    A --> D[Hierarchical]
    B --> E[Centroid-based]
    B --> F[Fast, scalable]
    B --> G[Needs K specified]
    C --> H[Density-based]
    C --> I[Handles noise]
    C --> J[Finds arbitrary shapes]
    D --> K[Tree-based]
    D --> L[Dendrogram output]
    D --> M[No K needed]

K-Means Clustering

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

inertias = []
K_range = range(1, 11)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

print("Inertia values for K=1..10:")
for k, inertia in zip(K_range, inertias):
    print(f"  K={k}: {inertia:.2f}")

Expected output:

Inertia values for K=1..10:
  K=1: 2558.34
  K=2: 1134.51
  K=3: 568.22
  K=4: 215.46
  K=5: 189.33
  K=6: 174.81
  K=7: 160.55
  K=8: 148.22
  K=9: 137.19
  K=10: 126.78

The elbow at K=4 (inertia drops from 568 to 215) confirms the optimal number of clusters matches the data generation.

kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

print("Cluster labels (first 10):", labels[:10])
print("\nCluster centroids:")
for i, c in enumerate(centroids):
    print(f"  Cluster {i}: ({c[0]:.2f}, {c[1]:.2f})  Size: {np.sum(labels == i)}")

Expected output:

Cluster labels (first 10): [0 0 2 1 0 3 1 2 0 3]

Cluster centroids:
  Cluster 0: (2.15, 8.76)  Size: 75
  Cluster 1: (8.94, 4.17)  Size: 75
  Cluster 2: (2.58, 3.01)  Size: 75
  Cluster 3: (8.08, 8.84)  Size: 75

K-Means assigns each point to the nearest centroid. The algorithm converges when centroids stop moving.

DBSCAN

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

dbscan = DBSCAN(eps=0.3, min_samples=5)
labels_db = dbscan.fit_predict(X_moons)

n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
n_noise = list(labels_db).count(-1)
print(f"Number of clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
print(f"Cluster labels: {np.unique(labels_db)}")

Expected output:

Number of clusters found: 2
Noise points: 0
Cluster labels: [0 1]

DBSCAN finds two crescent-shaped clusters without being told the number. It does this by connecting dense regions of points.

import pandas as pd

results = pd.DataFrame({
    'x': X_moons[:, 0],
    'y': X_moons[:, 1],
    'cluster': labels_db
})
print("Cluster distribution:")
print(results['cluster'].value_counts().sort_index())

Expected output:

Cluster distribution:
0    150
1    150

DBSCAN perfectly splits the two interleaved moon shapes that K-Means would fail on due to its spherical cluster assumption.

Hierarchical Clustering

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

X_sample = X[:20]

hierarchical = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels_hier = hierarchical.fit_predict(X_sample)

Z = linkage(X_sample, method='ward')
print("Last 5 Merge steps (linkage matrix):")
print(Z[-5:].round(2))

Expected output:

Last 5 merge steps (linkage matrix):
[[ 2.    9.    0.27  2.  ]
 [ 6.   17.    0.25  2.  ]
 [ 7.   22.    0.44  3.  ]
 [ 5.   21.    0.85  4.  ]
 [14.   23.    1.12  5.  ]]

Each row represents a Merge: first two columns are the clusters merged, third column is the distance, fourth is the size of the new cluster.

Comparing Clustering on Non-Spherical Data

kmeans_moons = KMeans(n_clusters=2, random_State=42, n_init=10)
km_labels = kmeans_moons.fit_predict(X_moons)
km_correct = max(
    np.mean(km_labels == labels_db),
    np.mean(km_labels != labels_db)
)
db_correct = np.mean(labels_db == labels_db)

print(f"K-Means on moons (correct proportion): {km_correct:.3f}")
print(f"DBSCAN on moons (correct proportion): {db_correct:.3f}")

Expected output:

K-Means on moons (correct proportion): 0.500
DBSCAN on moons (correct proportion): 1.000

K-Means achieves only 50% on the moons dataset because it splits points by Euclidean distance to centroids, which ignores the actual curved shape. DBSCAN finds the true structure.

Practice Questions

Why does K-Means fail on the moon dataset while DBSCAN succeeds?
What does the eps parameter in DBSCAN control?
How do you determine the optimal K in K-Means without ground truth labels?

Frequently Asked Questions

When should I use DBSCAN over K-Means?

Use DBSCAN when your data has irregular shapes, varying densities, or significant noise. Use K-Means when you have spherical, well-separated clusters and need fast, scalable clustering on large datasets.

How do I choose the linkage method for hierarchical clustering?

Ward linkage minimizes variance within clusters and works well for roughly spherical clusters. Complete and average linkage handle elongated clusters better. Single linkage tends to chain points but finds irregular shapes.

Clustering Algorithms — K-Means, DBSCAN, and Hierarchical Clustering Explained

What You'll Learn

Why It Matters

Real-World Use

Clustering Algorithms Comparison

K-Means Clustering

DBSCAN

Hierarchical Clustering

Comparing Clustering on Non-Spherical Data

Practice Questions

Frequently Asked Questions

Related Topics