Deploying ML Models to Production — Step-by-Step Guide
In this tutorial, you'll learn about Deploying ML Models to Production. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Deploying Machine Learning models to production means making your trained model accessible via an API so applications can send data and receive predictions in real time.
What You'll Learn
How to wrap an ML model in a REST API, containerize it with Docker, deploy to a cloud server, and set up monitoring for drift and performance.
Why It Matters
A model in a Jupyter notebook has zero business value. Deployment is what turns your work into a functioning product. Without it, the best model is just a proof of concept.
Real-World Use
Durga Antivirus Pro deploys multiple ML models that scan file signatures in real time. Each model runs as a microservice behind a load balancer handling thousands of requests per second.
Deployment Architecture
flowchart LR
A[Client App] --> B[Load Balancer]
B --> C[API Gateway]
C --> D[Model Service 1]
C --> E[Model Service 2]
C --> F[Model Service 3]
D --> G[(Model Artifact)]
D --> H[Prediction Cache]
D --> I[Metrics / Monitoring]
Step 1: Save the Trained Model
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
model = RandomForestClassifier()
model.fit(X, y)
joblib.dump(model, 'model.pkl')
print("Model saved as model.pkl")
Expected output:
Model saved as model.pkl
The joblib format is preferred for Scikit-Learn because it handles large numpy arrays efficiently.
Step 2: Create a Prediction API with FastAPI
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI(title="ML Model API")
model = joblib.load('model.pkl')
class InputData(BaseModel):
features: list[float]
class Prediction(BaseModel):
prediction: int
probability: float
@app.post("/predict", response_model=Prediction)
def predict(data: InputData):
if len(data.features) != 20:
raise HTTPException(status_code=400, detail="Expected 20 features")
X = np.array(data.features).reshape(1, -1)
pred = model.predict(X)[0]
prob = model.predict_proba(X).max()
return Prediction(prediction=int(pred), probability=float(prob))
Expected output (when run with test client):
# Test the API
from fastapi.testclient import TestClient
client = TestClient(app)
response = client.post("/predict", json={"features": [0.1]*20})
print(response.json())
{'prediction': 1, 'probability': 0.92}
Step 3: Containerize with Docker
FROM Python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN Pip install --no-cache-dir -R requirements.txt
COPY model.pkl app.py ./
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
Docker build -t ml-model-API .
Docker run -d -p 8000:8000 ml-model-API
curl -X POST HTTP://localhost:8000/predict \
-H "Content-Type: application/JSON" \
-d '{"features": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}'
Expected output:
{"prediction":1,"probability":0.92}
The model is now accessible from any application via HTTP.
Model Monitoring
import time
import random
from datetime import datetime
class ModelMonitor:
def __init__(self):
self.latencies = []
self.predictions = []
self.errors = 0
def log_prediction(self, latency_ms, prediction, confidence):
self.latencies.append(latency_ms)
self.predictions.append({
'timestamp': datetime.now().isoformat(),
'prediction': prediction,
'confidence': confidence
})
def report(self):
avg_latency = sum(self.latencies) / len(self.latencies)
print(f"Total predictions: {len(self.predictions)}")
print(f"Average latency: {avg_latency:.2f}ms")
print(f"Error count: {self.errors}")
monitor = ModelMonitor()
for _ in range(100):
latency = random.uniform(5, 50)
monitor.log_prediction(latency, random.randint(0, 1), random.uniform(0.7, 0.99))
monitor.report()
Expected output:
Total predictions: 100
Average latency: 27.34ms
Error count: 0
Production monitoring catches latency spikes, error bursts, and data drift before your users notice.
Practice Questions
- Why is
joblibpreferred overpicklefor saving Scikit-Learn models? - What is the purpose of a load balancer in model deployment?
- How would you detect model drift in production?
Frequently Asked Questions
Related Topics
- Python — the language used throughout
- Docker for Beginners — essential for Containerization
- Scikit-Learn Guide — training the model to deploy
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro