Clustering Algorithms — K-Means, DBSCAN, and Hierarchical Clustering Explained
In this tutorial, you'll learn about Clustering Algorithms. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Clustering algorithms group similar data points together without requiring labeled training data, making them the primary tool for discovering hidden patterns and structures in unlabeled datasets.
What You'll Learn
How to implement and interpret three major clustering algorithms — K-Means, DBSCAN, and Agglomerative Hierarchical Clustering — and how to choose the right one for your data.
Why It Matters
Most real-world data has no labels. Clustering is used for customer segmentation, anomaly detection, image compression, document organization, and recommendation systems across virtually every industry.
Real-World Use
Doda Browser uses clustering to group users by browsing behavior patterns for its tab prediction feature. Durga Antivirus Pro uses DBSCAN to cluster file behavior profiles and flag outliers as potential malware.
Clustering Algorithms Comparison
flowchart TD
A[Clustering] --> B[K-Means]
A --> C[DBSCAN]
A --> D[Hierarchical]
B --> E[Centroid-based]
B --> F[Fast, scalable]
B --> G[Needs K specified]
C --> H[Density-based]
C --> I[Handles noise]
C --> J[Finds arbitrary shapes]
D --> K[Tree-based]
D --> L[Dendrogram output]
D --> M[No K needed]
K-Means Clustering
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)
inertias = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
print("Inertia values for K=1..10:")
for k, inertia in zip(K_range, inertias):
print(f" K={k}: {inertia:.2f}")
Expected output:
Inertia values for K=1..10:
K=1: 2558.34
K=2: 1134.51
K=3: 568.22
K=4: 215.46
K=5: 189.33
K=6: 174.81
K=7: 160.55
K=8: 148.22
K=9: 137.19
K=10: 126.78
The elbow at K=4 (inertia drops from 568 to 215) confirms the optimal number of clusters matches the data generation.
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
print("Cluster labels (first 10):", labels[:10])
print("\nCluster centroids:")
for i, c in enumerate(centroids):
print(f" Cluster {i}: ({c[0]:.2f}, {c[1]:.2f}) Size: {np.sum(labels == i)}")
Expected output:
Cluster labels (first 10): [0 0 2 1 0 3 1 2 0 3]
Cluster centroids:
Cluster 0: (2.15, 8.76) Size: 75
Cluster 1: (8.94, 4.17) Size: 75
Cluster 2: (2.58, 3.01) Size: 75
Cluster 3: (8.08, 8.84) Size: 75
K-Means assigns each point to the nearest centroid. The algorithm converges when centroids stop moving.
DBSCAN
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels_db = dbscan.fit_predict(X_moons)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
n_noise = list(labels_db).count(-1)
print(f"Number of clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
print(f"Cluster labels: {np.unique(labels_db)}")
Expected output:
Number of clusters found: 2
Noise points: 0
Cluster labels: [0 1]
DBSCAN finds two crescent-shaped clusters without being told the number. It does this by connecting dense regions of points.
import pandas as pd
results = pd.DataFrame({
'x': X_moons[:, 0],
'y': X_moons[:, 1],
'cluster': labels_db
})
print("Cluster distribution:")
print(results['cluster'].value_counts().sort_index())
Expected output:
Cluster distribution:
0 150
1 150
DBSCAN perfectly splits the two interleaved moon shapes that K-Means would fail on due to its spherical cluster assumption.
Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
X_sample = X[:20]
hierarchical = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels_hier = hierarchical.fit_predict(X_sample)
Z = linkage(X_sample, method='ward')
print("Last 5 Merge steps (linkage matrix):")
print(Z[-5:].round(2))
Expected output:
Last 5 merge steps (linkage matrix):
[[ 2. 9. 0.27 2. ]
[ 6. 17. 0.25 2. ]
[ 7. 22. 0.44 3. ]
[ 5. 21. 0.85 4. ]
[14. 23. 1.12 5. ]]
Each row represents a Merge: first two columns are the clusters merged, third column is the distance, fourth is the size of the new cluster.
Comparing Clustering on Non-Spherical Data
kmeans_moons = KMeans(n_clusters=2, random_State=42, n_init=10)
km_labels = kmeans_moons.fit_predict(X_moons)
km_correct = max(
np.mean(km_labels == labels_db),
np.mean(km_labels != labels_db)
)
db_correct = np.mean(labels_db == labels_db)
print(f"K-Means on moons (correct proportion): {km_correct:.3f}")
print(f"DBSCAN on moons (correct proportion): {db_correct:.3f}")
Expected output:
K-Means on moons (correct proportion): 0.500
DBSCAN on moons (correct proportion): 1.000
K-Means achieves only 50% on the moons dataset because it splits points by Euclidean distance to centroids, which ignores the actual curved shape. DBSCAN finds the true structure.
Practice Questions
- Why does K-Means fail on the moon dataset while DBSCAN succeeds?
- What does the eps parameter in DBSCAN control?
- How do you determine the optimal K in K-Means without ground truth labels?
Frequently Asked Questions
Related Topics
- Python — running all examples
- Scikit-Learn Guide — provides all implementations
- What is Machine Learning — foundational concepts
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro