Deploying ML Models to Production — Step-by-Step Guide

DodaTech 3 min read

In this tutorial, you'll learn about Deploying ML Models to Production. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Deploying Machine Learning models to production means making your trained model accessible via an API so applications can send data and receive predictions in real time.

What You'll Learn

How to wrap an ML model in a REST API, containerize it with Docker, deploy to a cloud server, and set up monitoring for drift and performance.

Why It Matters

A model in a Jupyter notebook has zero business value. Deployment is what turns your work into a functioning product. Without it, the best model is just a proof of concept.

Real-World Use

Durga Antivirus Pro deploys multiple ML models that scan file signatures in real time. Each model runs as a microservice behind a load balancer handling thousands of requests per second.

Deployment Architecture

flowchart LR
    A[Client App] --> B[Load Balancer]
    B --> C[API Gateway]
    C --> D[Model Service 1]
    C --> E[Model Service 2]
    C --> F[Model Service 3]
    D --> G[(Model Artifact)]
    D --> H[Prediction Cache]
    D --> I[Metrics / Monitoring]

Step 1: Save the Trained Model

import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
model = RandomForestClassifier()
model.fit(X, y)

joblib.dump(model, 'model.pkl')
print("Model saved as model.pkl")

Expected output:

Model saved as model.pkl

The joblib format is preferred for Scikit-Learn because it handles large numpy arrays efficiently.

Step 2: Create a Prediction API with FastAPI

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI(title="ML Model API")
model = joblib.load('model.pkl')

class InputData(BaseModel):
    features: list[float]

class Prediction(BaseModel):
    prediction: int
    probability: float

@app.post("/predict", response_model=Prediction)
def predict(data: InputData):
    if len(data.features) != 20:
        raise HTTPException(status_code=400, detail="Expected 20 features")
    X = np.array(data.features).reshape(1, -1)
    pred = model.predict(X)[0]
    prob = model.predict_proba(X).max()
    return Prediction(prediction=int(pred), probability=float(prob))

Expected output (when run with test client):

# Test the API
from fastapi.testclient import TestClient
client = TestClient(app)
response = client.post("/predict", json={"features": [0.1]*20})
print(response.json())

{'prediction': 1, 'probability': 0.92}

Step 3: Containerize with Docker

FROM Python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN Pip install --no-cache-dir -R requirements.txt
COPY model.pkl app.py ./
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Docker build -t ml-model-API .
Docker run -d -p 8000:8000 ml-model-API
curl -X POST HTTP://localhost:8000/predict \
  -H "Content-Type: application/JSON" \
  -d '{"features": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}'

Expected output:

{"prediction":1,"probability":0.92}

The model is now accessible from any application via HTTP.

Model Monitoring

import time
import random
from datetime import datetime

class ModelMonitor:
    def __init__(self):
        self.latencies = []
        self.predictions = []
        self.errors = 0

    def log_prediction(self, latency_ms, prediction, confidence):
        self.latencies.append(latency_ms)
        self.predictions.append({
            'timestamp': datetime.now().isoformat(),
            'prediction': prediction,
            'confidence': confidence
        })

    def report(self):
        avg_latency = sum(self.latencies) / len(self.latencies)
        print(f"Total predictions: {len(self.predictions)}")
        print(f"Average latency: {avg_latency:.2f}ms")
        print(f"Error count: {self.errors}")

monitor = ModelMonitor()
for _ in range(100):
    latency = random.uniform(5, 50)
    monitor.log_prediction(latency, random.randint(0, 1), random.uniform(0.7, 0.99))

monitor.report()

Expected output:

Total predictions: 100
Average latency: 27.34ms
Error count: 0

Production monitoring catches latency spikes, error bursts, and data drift before your users notice.

Practice Questions

Why is joblib preferred over pickle for saving Scikit-Learn models?
What is the purpose of a load balancer in model deployment?
How would you detect model drift in production?

Frequently Asked Questions

Should I use Flask or FastAPI for ML model deployment?

FastAPI is generally preferred because it provides automatic OpenAPI documentation, request validation with Pydantic, and async support out of the box. Flask is simpler but requires more manual setup for the same features.

How do I handle model versioning in production?

Store models in a registry like MLflow or DVC with version tags. Your API can accept a model version parameter or use a blue-green deployment Strategy where the new version runs alongside the old one until validated.

Deploying ML Models to Production — Step-by-Step Guide

What You'll Learn

Why It Matters

Real-World Use

Deployment Architecture

Step 1: Save the Trained Model

Step 2: Create a Prediction API with FastAPI

Step 3: Containerize with Docker

Model Monitoring

Practice Questions

Frequently Asked Questions

Related Topics