Learn Artificial: Model Evaluation: Metrics and Validation Techniques

Q: What's the difference between validation and test sets?

The validation set is used during development to tune hyperparameters. The test set is used exactly once at the end to report final performance.

Q: Can a model have high accuracy but low AUC?

Yes. On imbalanced data, a model can achieve high accuracy by always predicting the majority class, but AUC will be ~0.5 because it can’t distinguish classes.

Artificial Intelligence

Model Evaluation: Metrics and Validation Techniques

DodaTech Updated Jun 20, 2026 8 min read

Model evaluation is the process of measuring how well a machine learning model performs — using the right metrics, validation strategies, and diagnostic techniques to understand its strengths and weaknesses.

What You’ll Learn

By the end of this tutorial, you’ll understand classification metrics (accuracy, precision, recall, F1, ROC-AUC, confusion matrix), regression metrics (MSE, MAE, R²), cross-validation methods, overfitting/underfitting detection, and the bias-variance tradeoff. Prerequisites: Python and Machine Learning basics.

Why It Matters

A model with 95% accuracy on a cancer detection task might be useless if it misses 100% of positive cases. Choosing the wrong metric leads to deploying models that fail in production.

Real-World Use

Durga Antivirus Pro evaluates threat detection models using precision and recall — a false positive (flagging a safe file) annoys users, but a false negative (missing a real threat) is catastrophic.

Evaluation Pipeline


flowchart LR
  A[Raw Data] --> B[Train/Test Split]
  B --> C[Training Set]
  B --> D[Test Set]
  C --> E[Train Model]
  E --> F[Validate]
  F --> G{Tuned?}
  G -->|No| H[Tune Hyperparams]
  H --> E
  G -->|Yes| I[Final Evaluation]
  D --> I
  I --> J[Metrics Report]
  I --> K[Confusion Matrix]
  I --> L[ROC Curve]

Prerequisites: Python basics, Machine Learning fundamentals, Hyperparameter Tuning concepts.

Classification Metrics

Confusion Matrix

The confusion matrix is the foundation of classification evaluation. It shows four outcomes:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()

print("Confusion Matrix:")
print(cm)
print(f"\nTrue Positives: {tp}  (correctly predicted positive)")
print(f"True Negatives: {tn}  (correctly predicted negative)")
print(f"False Positives: {fp} (Type I error — false alarm)")
print(f"False Negatives: {fn} (Type II error — missed detection)")

Expected output:

Confusion Matrix:
[[4 1]
 [1 4]]

True Positives: 4  (correctly predicted positive)
True Negatives: 4  (correctly predicted negative)
False Positives: 1 (Type I error — false alarm)
False Negatives: 1 (Type II error — missed detection)

Accuracy, Precision, Recall, F1

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print(f"Accuracy:  {accuracy_score(y_true, y_pred):.3f}")
print(f"Precision: {precision_score(y_true, y_pred):.3f}")
print(f"Recall:    {recall_score(y_true, y_pred):.3f}")
print(f"F1 Score:  {f1_score(y_true, y_pred):.3f}")

Expected output:

Accuracy:  0.800
Precision: 0.800
Recall:    0.800
F1 Score:  0.800

When should you use each metric?

Accuracy — when classes are balanced and errors have equal cost
Precision — when false positives are costly (spam detection: don’t mark real email as spam)
Recall — when false negatives are costly (cancer screening: don’t miss a real case)
F1 — harmonic mean of precision and recall, good when you need both

ROC-AUC

ROC-AUC measures the model’s ability to distinguish between classes across all classification thresholds.

from sklearn.metrics import roc_curve, roc_auc_score
import numpy as np

# True labels and predicted probabilities
y_true = [0, 0, 1, 1]
y_scores = [0.1, 0.4, 0.35, 0.8]

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)

print("ROC Curve points:")
for i in range(len(fpr)):
    print(f"  Threshold={thresholds[i]:.2f}: FPR={fpr[i]:.3f}, TPR={tpr[i]:.3f}")
print(f"AUC: {auc:.3f}")
print("\nInterpretation:")
print("  AUC = 1.0 → perfect classifier")
print("  AUC = 0.5 → random guessing")
print("  AUC < 0.5 → worse than random")

Expected output:

ROC Curve points:
  Threshold=1.80: FPR=0.000, TPR=0.000
  Threshold=0.80: FPR=0.000, TPR=0.500
  Threshold=0.40: FPR=0.500, TPR=0.500
  Threshold=0.35: FPR=0.500, TPR=1.000
  Threshold=0.10: FPR=1.000, TPR=1.000
AUC: 0.750

Regression Metrics

For regression problems (predicting continuous values), use these metrics:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

y_true = [100, 200, 300, 400, 500]
y_pred = [110, 190, 310, 390, 480]

mse = mean_squared_error(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
rmse = mse ** 0.5

print(f"Actual:     {y_true}")
print(f"Predicted:  {y_pred}")
print(f"\nMSE:  {mse:.1f}   (penalizes large errors — squared)")
print(f"RMSE: {rmse:.1f}   (MSE in original units — interpretable)")
print(f"MAE:  {mae:.1f}   (average absolute error — robust to outliers)")
print(f"R²:   {r2:.3f}   (proportion of variance explained)")

Expected output:

Actual:     [100, 200, 300, 400, 500]
Predicted:  [110, 190, 310, 390, 480]

MSE:  200.0   (penalizes large errors — squared)
RMSE: 14.1   (MSE in original units — interpretable)
MAE:  12.0   (average absolute error — robust to outliers)
R²:   0.988   (proportion of variance explained)

When to use each:

MSE/RMSE — when large errors are disproportionately bad
MAE — when all errors should be treated equally
R² — to understand how much variance your model explains (0–1 scale)

Cross-Validation

Cross-validation provides a more reliable estimate of model performance than a single train-test split.

from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# K-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
model = RandomForestClassifier(n_estimators=50, random_state=42)

scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')

print("K-Fold Cross-Validation Results:")
print(f"  Folds: {[f'{s:.3f}' for s in scores]}")
print(f"  Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Expected output:

K-Fold Cross-Validation Results:
  Folds: ['0.930', '0.920', '0.940', '0.910', '0.925']
  Mean: 0.925 (+/- 0.021)

Stratified K-Fold

For imbalanced datasets, stratified k-fold preserves the class distribution in each fold:

stratified = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
strat_scores = cross_val_score(model, X, y, cv=stratified, scoring='accuracy')

print(f"Stratified K-Fold Mean: {strat_scores.mean():.3f}")

Detecting Overfitting and Underfitting

import numpy as np
import matplotlib.pyplot as plt

def detect_fit_status(train_scores, val_scores):
    train_mean = np.mean(train_scores)
    val_mean = np.mean(val_scores)
    gap = train_mean - val_mean

    if val_mean < 0.6:
        return "Underfitting (model too simple)"
    elif gap > 0.15:
        return f"Overfitting (gap={gap:.2f})"
    else:
        return f"Good fit (gap={gap:.2f})"

# Example scenarios
underfitting = detect_fit_status([0.55, 0.58, 0.60], [0.52, 0.54, 0.55])
overfitting = detect_fit_status([0.99, 0.98, 1.00], [0.78, 0.82, 0.80])
good_fit = detect_fit_status([0.88, 0.90, 0.89], [0.86, 0.87, 0.86])

print(f"Underfitting: {underfitting}")
print(f"Overfitting:  {overfitting}")
print(f"Good fit:     {good_fit}")

Expected output:

Underfitting: Underfitting (model too simple)
Overfitting:  Overfitting (gap=0.19)
Good fit:     Good fit (gap=0.03)

Bias-Variance Tradeoff

High bias (underfitting) — model is too simple, misses patterns. Training and validation errors are both high.
High variance (overfitting) — model memorizes training data, fails on new data. Low training error, high validation error.
Optimal — model captures the underlying pattern without memorizing noise.

	High Bias	High Variance
Training error	High	Very low
Validation error	High	High (much higher than train)
Cause	Model too simple	Model too complex
Fix	More features, deeper model	Regularization, more data, simpler model

Common Evaluation Errors

1. Using Accuracy on Imbalanced Data

If 99% of emails are legitimate, a model that always predicts “not spam” gets 99% accuracy but is useless. Use precision, recall, and F1 for imbalanced datasets.

2. Data Leakage

Using information from the test set during training. Examples: scaling before splitting, using future data to predict the past. Always split first, then preprocess.

3. Tuning on Test Data

Evaluating multiple models on the test set and picking the best one. The test set must be used exactly once. Use a separate validation set for tuning.

4. Ignoring Confidence Intervals

A model with 87% accuracy might be no better than one with 86% if the confidence intervals overlap. Always report variance.

5. Reporting Only One Metric

Accuracy tells you nothing about precision. Always report a suite of metrics that give a complete picture of model performance.

Practice Questions

1. What’s the difference between precision and recall? Precision = TP / (TP + FP) — how many positive predictions are correct. Recall = TP / (TP + FN) — how many actual positives are found.

2. When would you use MAE over RMSE? When you want errors in the same units as the target variable and don’t want to disproportionately penalize large errors.

3. What is the purpose of cross-validation? To evaluate model performance more reliably by using multiple train-test splits, reducing the variance of the performance estimate.

4. How do you detect overfitting? Large gap between training and validation performance. Training accuracy is high but validation accuracy is significantly lower.

5. Challenge: Evaluate three models on an imbalanced dataset Find an imbalanced dataset (e.g., credit card fraud). Train Logistic Regression, Random Forest, and XGBoost. Compare them using accuracy, precision, recall, F1, and ROC-AUC. Which metric tells the real story?

FAQ

What's a good F1 score?

F1 > 0.9 is excellent, > 0.8 is good, > 0.7 is acceptable. For difficult tasks or imbalanced data, lower scores may still be valuable.

Should I use k-fold or train-test split?

Use k-fold for small datasets (n < 10,000) — it gives a more reliable estimate. Use a single train-test split for large datasets where training multiple models is expensive.

What's the difference between validation and test sets?

The validation set is used during development to tune hyperparameters. The test set is used exactly once at the end to report final performance.

Can a model have high accuracy but low AUC?

Yes. On imbalanced data, a model can achieve high accuracy by always predicting the majority class, but AUC will be ~0.5 because it can’t distinguish classes.

Try It Yourself

▶ Try It Yourself Edit the code and click Run

Mini Project: Model Evaluation Dashboard

Build a Python script that takes model predictions and ground truth, computes all classification or regression metrics, generates a confusion matrix plot and ROC curve, and outputs a HTML report. Security angle: Durga Antivirus Pro continuously evaluates its threat detection models using this exact methodology — tracking precision (to avoid false alarms) and recall (to catch all threats).

What’s Next

Review: Hyperparameter Tuning

Review: MLOps Guide

Review: ML Model Deployment

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

What’s Next

Congratulations on completing this Model Evaluation tutorial! Here’s where to go from here:

Practice daily — Evaluate every model you train with multiple metrics
Build a project — Create a reusable evaluation toolkit with automated reporting
Explore related topics — Check out Hyperparameter Tuning to optimize your models further

Remember: every expert was once a beginner. Keep coding!

Previous Hyperparameter Tuning: Optimizing ML Models

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Artificial Intelligence