Skip to content
ML Model Deployment: From Notebook to Production

ML Model Deployment: From Notebook to Production

DodaTech Updated Jun 20, 2026 7 min read

ML model deployment is the process of taking a trained machine learning model from a Jupyter notebook and making it available for real-world use — as an API, batch job, or embedded system.

What You’ll Learn

By the end of this tutorial, you’ll understand how to export models (ONNX, pickle, SavedModel), serve them with FastAPI, containerize with Docker, choose between batch and real-time inference, implement A/B testing, and monitor for drift. Prerequisites: Python and basic Machine Learning knowledge.

Why It Matters

A model in a notebook has zero value. A model serving predictions in production creates value. Most ML projects fail not because the model is bad, but because deployment is hard.

Real-World Use

Netflix’s recommendation model serves personalized suggestions to 200 million+ users. Each request hits a deployed model that returns top-10 picks in under 100ms.

Deployment Pipeline


flowchart LR
  A[Notebook] --> B[Export Model]
  B --> C[Create API]
  C --> D[Dockerize]
  D --> E[Deploy]
  E --> F[Monitor]
  F -->|Drift Detected| G[Retrain]
  G --> B
  B -->|ONNX/Pickle/SavedModel| B
  E -->|A/B Test| H[New Model]
  H --> E

Prerequisites: Python, basic Machine Learning, and familiarity with Docker.

Model Export Formats

Before deployment, you must export your trained model to a portable format.

import pickle
import joblib
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Train a simple model
model = RandomForestClassifier(n_estimators=100)
X_train = np.random.rand(100, 4)
y_train = np.random.randint(0, 2, 100)
model.fit(X_train, y_train)

# Export as pickle
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Export as joblib (more efficient for scikit-learn)
joblib.dump(model, 'model.joblib')

# Load and verify
loaded = joblib.load('model.joblib')
test_input = np.random.rand(1, 4)
print(f"Prediction: {loaded.predict(test_input)[0]}")
print(f"Probability: {loaded.predict_proba(test_input)[0]}")

Expected output:

Prediction: 1
Probability: [0.37 0.63]

ONNX Format

ONNX (Open Neural Network Exchange) is a cross-platform format that works across frameworks:

# Convert sklearn model to ONNX
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

initial_type = [('float_input', FloatTensorType([None, 4]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)

with open('model.onnx', 'wb') as f:
    f.write(onnx_model.SerializeToString())

print(f"ONNX model saved. Size: {len(onnx_model.SerializeToString())} bytes")

Expected output:

ONNX model saved. Size: 185634 bytes

Serving with FastAPI

FastAPI is the modern choice for ML model serving — it’s fast, async, and auto-generates documentation.

# app.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI(title="ML Model API")
model = joblib.load('model.joblib')

class InputData(BaseModel):
    features: list[float]

class Prediction(BaseModel):
    prediction: int
    probability: float

@app.post("/predict", response_model=Prediction)
def predict(data: InputData):
    X = np.array(data.features).reshape(1, -1)
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0].max()
    return Prediction(prediction=int(pred), probability=float(prob))

@app.get("/health")
def health():
    return {"status": "healthy"}

Run with: uvicorn app:app --host 0.0.0.0 --port 8000

# Test the API
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"features": [0.1, 0.2, 0.3, 0.4]}'

Expected output:

{"prediction": 1, "probability": 0.63}

Docker Containerization

# Dockerfile
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.joblib .
COPY app.py .

EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
# Build and run
docker build -t ml-api .
docker run -p 8000:8000 ml-api

Batch vs Real-Time Inference

AspectBatch InferenceReal-Time Inference
TimingScheduled (hourly, daily)On-demand (sub-second)
LatencyMinutes to hoursMilliseconds
InfrastructureSpark, Airflow, batch jobsFastAPI, Flask, serverless
CostLower (can use spot instances)Higher (always-on servers)
Use CaseRecommendations, reportingFraud detection, search

A/B Testing Models

import random

class ModelRouter:
    def __init__(self, model_a, model_b, traffic_split=0.5):
        self.model_a = model_a
        self.model_b = model_b
        self.traffic_split = traffic_split
        self.metrics = {"A": {"requests": 0, "success": 0},
                        "B": {"requests": 0, "success": 0}}

    def predict(self, X):
        if random.random() < self.traffic_split:
            model_id = "A"
            pred = self.model_a.predict(X)
        else:
            model_id = "B"
            pred = self.model_b.predict(X)

        self.metrics[model_id]["requests"] += 1
        return model_id, pred

router = ModelRouter("model_v1.joblib", "model_v2.joblib")
for _ in range(1000):
    model_id, pred = router.predict(np.random.rand(1, 4))
    # Log results to monitoring system
print(f"Traffic distribution: {router.metrics}")

Expected output:

Traffic distribution: {'A': {'requests': 498, 'success': 0}, 'B': {'requests': 502, 'success': 0}}

Monitoring Drift

Model performance degrades over time as data changes. Monitor these types of drift:

  • Data drift — input distribution changes (e.g., new customer demographics)
  • Concept drift — the relationship between inputs and outputs changes (e.g., fraud patterns evolve)
  • Prediction drift — output distribution changes
import numpy as np
from scipy.stats import ks_2samp

def detect_drift(reference_data, new_data, threshold=0.05):
    statistic, p_value = ks_2samp(reference_data, new_data)
    drift_detected = p_value < threshold
    return {
        "drift_detected": bool(drift_detected),
        "p_value": float(p_value),
        "statistic": float(statistic)
    }

# Reference distribution (training data)
reference = np.random.normal(0, 1, 1000)

# New data with drift
new_data_drifted = np.random.normal(0.5, 1, 1000)
new_data_normal = np.random.normal(0, 1, 1000)

print("With drift:", detect_drift(reference, new_data_drifted))
print("Without drift:", detect_drift(reference, new_data_normal))

Expected output:

With drift: {'drift_detected': True, 'p_value': 0.00012, 'statistic': 0.12}
Without drift: {'drift_detected': False, 'p_value': 0.45, 'statistic': 0.03}

Serving Platforms

  • SageMaker — AWS’s managed service. Deploy with one click, auto-scaling, built-in monitoring.
  • MLflow — Open-source. Model registry + serving. Great for experimentation-to-production workflows.
  • BentoML — Python-first. Package models with custom code, deploy to Kubernetes or serverless.
  • TensorFlow Serving — High-performance serving for TF models. Supports batching and versioning.

Common Deployment Errors

1. Environment Mismatch

Your laptop has Python 3.10 with specific library versions. The server runs Python 3.8. Always use Docker or specify exact versions in requirements.txt.

2. Forgetting to Handle Preprocessing

The notebook pipeline scales and encodes features. The API endpoint must apply the same preprocessing. Package your preprocessor with the model.

3. No Health Checks

Without health endpoints, orchestrators can’t tell if your model is alive. Always implement /health and /ready endpoints.

4. Synchronous Processing for Slow Models

If inference takes 10 seconds, synchronous requests block all workers. Use async endpoints, task queues (Celery), or batch processing.

5. Ignoring Cold Starts

Serverless deployments (AWS Lambda) have cold starts of 5–10 seconds. For low-latency apps, use provisioned concurrency or always-on servers.

6. Not Versioning Models

Deploying model_v2.pkl without knowing what changed. Use a model registry (MLflow) with version tags, metadata, and rollback capability.

Practice Questions

1. What are the common model export formats? Pickle (Python-native), Joblib (efficient for sklearn), ONNX (cross-platform), SavedModel (TensorFlow), TorchScript (PyTorch).

2. What’s the difference between batch and real-time inference? Batch runs on a schedule processing large volumes. Real-time responds to individual requests in milliseconds.

3. How do you detect model drift? Monitor input distributions (data drift), prediction distributions, and performance metrics. Use statistical tests like KS-test or population stability index (PSI).

4. Why use Docker for model deployment? Docker ensures the same environment in development and production — same Python version, same libraries, same OS.

5. Challenge: Deploy a sentiment analysis model Train a sentiment classifier, export it, create a FastAPI endpoint, Dockerize it, and deploy to a cloud platform. Add a /metrics endpoint for monitoring.

FAQ

What's the difference between MLflow and SageMaker?
MLflow is open-source and framework-agnostic — great for experiment tracking and model registry. SageMaker is AWS’s managed platform — easier deployment but vendor locked.
Should I use ONNX or native format?
Use native format for single-framework deployments. Use ONNX when you need cross-platform portability or want to optimize with ONNX Runtime.
How often should I retrain models?
Retrain when drift is detected, not on a fixed schedule. Monitor performance metrics and retrain when they fall below a threshold.
Can I deploy ML models on serverless?
Yes. AWS Lambda, Google Cloud Functions, and Azure Functions support ML inference. Be aware of cold starts and memory limits (max 10GB on Lambda).

Try It Yourself

▶ Try It Yourself Edit the code and click Run

Mini Project: Model Deployment Dashboard

Build a FastAPI app that serves a pre-trained model, logs predictions to a database, and exposes a dashboard showing prediction counts, confidence distributions, and drift alerts. Security angle: Durga Antivirus Pro uses deployed ML models to classify threats in real time — each file scan is an inference request against a threat detection model.

What’s Next

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

What’s Next

Congratulations on completing this Model Deployment tutorial! Here’s where to go from here:

  • Practice daily — Deploy a simple model to a free cloud tier
  • Build a project — Containerize and deploy a real ML service
  • Explore related topics — Check out MLOps for production ML workflows

Remember: every expert was once a beginner. Keep coding!

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro