ML Model Deployment: From Notebook to Production
ML model deployment is the process of taking a trained machine learning model from a Jupyter notebook and making it available for real-world use — as an API, batch job, or embedded system.
What You’ll Learn
By the end of this tutorial, you’ll understand how to export models (ONNX, pickle, SavedModel), serve them with FastAPI, containerize with Docker, choose between batch and real-time inference, implement A/B testing, and monitor for drift. Prerequisites: Python and basic Machine Learning knowledge.
Why It Matters
A model in a notebook has zero value. A model serving predictions in production creates value. Most ML projects fail not because the model is bad, but because deployment is hard.
Real-World Use
Netflix’s recommendation model serves personalized suggestions to 200 million+ users. Each request hits a deployed model that returns top-10 picks in under 100ms.
Deployment Pipeline
flowchart LR A[Notebook] --> B[Export Model] B --> C[Create API] C --> D[Dockerize] D --> E[Deploy] E --> F[Monitor] F -->|Drift Detected| G[Retrain] G --> B B -->|ONNX/Pickle/SavedModel| B E -->|A/B Test| H[New Model] H --> E
Model Export Formats
Before deployment, you must export your trained model to a portable format.
import pickle
import joblib
import numpy as np
from sklearn.ensemble import RandomForestClassifier
# Train a simple model
model = RandomForestClassifier(n_estimators=100)
X_train = np.random.rand(100, 4)
y_train = np.random.randint(0, 2, 100)
model.fit(X_train, y_train)
# Export as pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
# Export as joblib (more efficient for scikit-learn)
joblib.dump(model, 'model.joblib')
# Load and verify
loaded = joblib.load('model.joblib')
test_input = np.random.rand(1, 4)
print(f"Prediction: {loaded.predict(test_input)[0]}")
print(f"Probability: {loaded.predict_proba(test_input)[0]}")Expected output:
Prediction: 1
Probability: [0.37 0.63]ONNX Format
ONNX (Open Neural Network Exchange) is a cross-platform format that works across frameworks:
# Convert sklearn model to ONNX
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, 4]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)
with open('model.onnx', 'wb') as f:
f.write(onnx_model.SerializeToString())
print(f"ONNX model saved. Size: {len(onnx_model.SerializeToString())} bytes")Expected output:
ONNX model saved. Size: 185634 bytesServing with FastAPI
FastAPI is the modern choice for ML model serving — it’s fast, async, and auto-generates documentation.
# app.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI(title="ML Model API")
model = joblib.load('model.joblib')
class InputData(BaseModel):
features: list[float]
class Prediction(BaseModel):
prediction: int
probability: float
@app.post("/predict", response_model=Prediction)
def predict(data: InputData):
X = np.array(data.features).reshape(1, -1)
pred = model.predict(X)[0]
prob = model.predict_proba(X)[0].max()
return Prediction(prediction=int(pred), probability=float(prob))
@app.get("/health")
def health():
return {"status": "healthy"}Run with: uvicorn app:app --host 0.0.0.0 --port 8000
# Test the API
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"features": [0.1, 0.2, 0.3, 0.4]}'Expected output:
{"prediction": 1, "probability": 0.63}Docker Containerization
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.joblib .
COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]# Build and run
docker build -t ml-api .
docker run -p 8000:8000 ml-apiBatch vs Real-Time Inference
| Aspect | Batch Inference | Real-Time Inference |
|---|---|---|
| Timing | Scheduled (hourly, daily) | On-demand (sub-second) |
| Latency | Minutes to hours | Milliseconds |
| Infrastructure | Spark, Airflow, batch jobs | FastAPI, Flask, serverless |
| Cost | Lower (can use spot instances) | Higher (always-on servers) |
| Use Case | Recommendations, reporting | Fraud detection, search |
A/B Testing Models
import random
class ModelRouter:
def __init__(self, model_a, model_b, traffic_split=0.5):
self.model_a = model_a
self.model_b = model_b
self.traffic_split = traffic_split
self.metrics = {"A": {"requests": 0, "success": 0},
"B": {"requests": 0, "success": 0}}
def predict(self, X):
if random.random() < self.traffic_split:
model_id = "A"
pred = self.model_a.predict(X)
else:
model_id = "B"
pred = self.model_b.predict(X)
self.metrics[model_id]["requests"] += 1
return model_id, pred
router = ModelRouter("model_v1.joblib", "model_v2.joblib")
for _ in range(1000):
model_id, pred = router.predict(np.random.rand(1, 4))
# Log results to monitoring system
print(f"Traffic distribution: {router.metrics}")Expected output:
Traffic distribution: {'A': {'requests': 498, 'success': 0}, 'B': {'requests': 502, 'success': 0}}Monitoring Drift
Model performance degrades over time as data changes. Monitor these types of drift:
- Data drift — input distribution changes (e.g., new customer demographics)
- Concept drift — the relationship between inputs and outputs changes (e.g., fraud patterns evolve)
- Prediction drift — output distribution changes
import numpy as np
from scipy.stats import ks_2samp
def detect_drift(reference_data, new_data, threshold=0.05):
statistic, p_value = ks_2samp(reference_data, new_data)
drift_detected = p_value < threshold
return {
"drift_detected": bool(drift_detected),
"p_value": float(p_value),
"statistic": float(statistic)
}
# Reference distribution (training data)
reference = np.random.normal(0, 1, 1000)
# New data with drift
new_data_drifted = np.random.normal(0.5, 1, 1000)
new_data_normal = np.random.normal(0, 1, 1000)
print("With drift:", detect_drift(reference, new_data_drifted))
print("Without drift:", detect_drift(reference, new_data_normal))Expected output:
With drift: {'drift_detected': True, 'p_value': 0.00012, 'statistic': 0.12}
Without drift: {'drift_detected': False, 'p_value': 0.45, 'statistic': 0.03}Serving Platforms
- SageMaker — AWS’s managed service. Deploy with one click, auto-scaling, built-in monitoring.
- MLflow — Open-source. Model registry + serving. Great for experimentation-to-production workflows.
- BentoML — Python-first. Package models with custom code, deploy to Kubernetes or serverless.
- TensorFlow Serving — High-performance serving for TF models. Supports batching and versioning.
Common Deployment Errors
1. Environment Mismatch
Your laptop has Python 3.10 with specific library versions. The server runs Python 3.8. Always use Docker or specify exact versions in requirements.txt.
2. Forgetting to Handle Preprocessing
The notebook pipeline scales and encodes features. The API endpoint must apply the same preprocessing. Package your preprocessor with the model.
3. No Health Checks
Without health endpoints, orchestrators can’t tell if your model is alive. Always implement /health and /ready endpoints.
4. Synchronous Processing for Slow Models
If inference takes 10 seconds, synchronous requests block all workers. Use async endpoints, task queues (Celery), or batch processing.
5. Ignoring Cold Starts
Serverless deployments (AWS Lambda) have cold starts of 5–10 seconds. For low-latency apps, use provisioned concurrency or always-on servers.
6. Not Versioning Models
Deploying model_v2.pkl without knowing what changed. Use a model registry (MLflow) with version tags, metadata, and rollback capability.
Practice Questions
1. What are the common model export formats? Pickle (Python-native), Joblib (efficient for sklearn), ONNX (cross-platform), SavedModel (TensorFlow), TorchScript (PyTorch).
2. What’s the difference between batch and real-time inference? Batch runs on a schedule processing large volumes. Real-time responds to individual requests in milliseconds.
3. How do you detect model drift? Monitor input distributions (data drift), prediction distributions, and performance metrics. Use statistical tests like KS-test or population stability index (PSI).
4. Why use Docker for model deployment? Docker ensures the same environment in development and production — same Python version, same libraries, same OS.
5. Challenge: Deploy a sentiment analysis model
Train a sentiment classifier, export it, create a FastAPI endpoint, Dockerize it, and deploy to a cloud platform. Add a /metrics endpoint for monitoring.
FAQ
Try It Yourself
Mini Project: Model Deployment Dashboard
Build a FastAPI app that serves a pre-trained model, logs predictions to a database, and exposes a dashboard showing prediction counts, confidence distributions, and drift alerts. Security angle: Durga Antivirus Pro uses deployed ML models to classify threats in real time — each file scan is an inference request against a threat detection model.
What’s Next
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
What’s Next
Congratulations on completing this Model Deployment tutorial! Here’s where to go from here:
- Practice daily — Deploy a simple model to a free cloud tier
- Build a project — Containerize and deploy a real ML service
- Explore related topics — Check out MLOps for production ML workflows
Remember: every expert was once a beginner. Keep coding!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro