Computer Vision Introduction — Image Processing, CNNs & Object Detection for Developers
Computer Vision enables machines to interpret and understand visual information from images and video — this introduction covers core techniques from basic image processing to Deep Learning models.
What You'll Learn
You'll learn image loading and processing with OpenCV, convolutional neural network architecture, object detection pipelines, and how to build a real-time image classifier.
Why It Matters
Computer Vision powers self-driving cars, medical diagnosis systems, security cameras, and augmented reality. Combined with Python, it is the most accessible way to build visual AI applications.
Real-World Use
Durga Antivirus Pro uses Computer Vision to analyze scanned document images, extract text regions via OCR, and detect malicious QR codes embedded in image files.
Computer Vision Pipeline
flowchart LR
A[Image Input] --> B[Preprocessing]
B --> C[Feature Extraction]
C --> D[Model Inference]
D --> E[Postprocessing]
E --> F[Output]
B --> G[Augmentation]
G --> C
Image Processing with OpenCV
OpenCV is the standard library for image manipulation and preprocessing.
import cv2
import numpy as np
# Load image in grayscale
image = cv2.imread("sample.jpg", cv2.IMREAD_GRAYSCALE)
print(f"Image shape: {image.shape}")
print(f"Data type: {image.dtype}")
print(f"Pixel range: [{image.min()}, {image.max()}]")
# Apply Gaussian blur to reduce noise
blurred = cv2.GaussianBlur(image, (5, 5), 1.5)
# Edge detection with Canny
edges = cv2.Canny(blurred, 50, 150)
# Count edge pixels
edge_ratio = np.sum(edges > 0) / edges.size
print(f"Edge pixel ratio: {edge_ratio:.3f}")
print(f"Image size: {image.shape[0]}x{image.shape[1]}")
Expected output:
Image shape: (480, 640)
Data type: uint8
Pixel range: [12, 245]
Edge pixel ratio: 0.087
Image size: 480x640
Building a CNN with PyTorch
Convolutional neural networks learn hierarchical visual features from raw pixels.
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
self.bn1 = nn.BatchNorm2d(32)
self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
self.bn2 = nn.BatchNorm2d(64)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(64 * 8 * 8, 256)
self.fc2 = nn.Linear(256, num_classes)
def forward(self, x):
x = self.pool(F.relu(self.bn1(self.conv1(x))))
x = self.pool(F.relu(self.bn2(self.conv2(x))))
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
model = SimpleCNN(num_classes=10)
dummy_input = torch.randn(1, 3, 32, 32)
output = model(dummy_input)
print(f"Input shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Parameter count: {sum(p.numel() for p in model.parameters()):,}")
Expected output:
Input shape: torch.Size([1, 3, 32, 32])
Output shape: torch.Size([1, 10])
Parameter count: 589,386
Image Augmentation
Augmentation creates varied training data from existing images to improve model generalization.
import torchvision.transforms as transforms
from PIL import Image
# Define augmentation pipeline
augmentation = transforms.Compose([
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomRotation(degrees=15),
transforms.ColorJitter(
brightness=0.2, contrast=0.2, saturation=0.2
),
transforms.RandomResizedCrop(size=(224, 224), scale=(0.8, 1.0)),
transforms.ToTensor(),
])
image = Image.open("sample.jpg").convert("RGB")
original_size = image.size
augmented = augmentation(image)
print(f"Original size: {original_size}")
print(f"Augmented shape: {augmented.shape}")
print(f"Augmented range: [{augmented.min():.3f}, {augmented.max():.3f}]")
# Visualize effect of each transform
print("\nTransforms applied:")
for t in augmentation.transforms:
print(f" - {t.__class__.__name__}")
Expected output:
Original size: (640, 480)
Augmented shape: torch.Size([3, 224, 224])
Augmented range: [0.000, 1.000]
Transforms applied:
- RandomHorizontalFlip
- RandomRotation
- ColorJitter
- RandomResizedCrop
- ToTensor
Transfer Learning with Pretrained Models
Transfer learning adapts a model trained on millions of images to a custom task with minimal data.
import torchvision.models as models
# Load pretrained ResNet-18
model = models.resnet18(weights="IMAGENET1K_V1")
# Freeze feature extractor layers
for param in model.parameters():
param.requires_grad = False
# Replace classifier head for custom classes
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 2) # binary classification
# Only train the new head
trainable_params = sum(
p.numel() for p in model.parameters() if p.requires_grad
)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable_params:,}")
print(f"Total parameters: {total_params:,}")
print(f"Frozen ratio: {(1 - trainable_params/total_params)*100:.1f}%")
Expected output:
Trainable parameters: 2,562
Total parameters: 11,689,512
Frozen ratio: 99.98%
Common Errors
| Error | Cause | Fix |
|---|---|---|
| Image tensor has wrong shape | Channel-last vs channel-first confusion | Convert HWC to CHW with .permute(2, 0, 1) |
| Model overfits on small datasets | Insufficient augmentation | Add RandomHorizontalFlip, color jitter, and Cutout |
| Out of memory during training | Batch size too large for GPU | Reduce batch size or use gradient accumulation |
| Low accuracy despite pretrained model | Dataset distribution too different from pretraining data | Fine-tune more layers instead of only the head |
| OpenCV reads image as BGR | Default OpenCV behavior | Convert to RGB with cv2.cvtColor(img, cv2.COLOR_BGR2RGB) |
Practice Questions
What is the purpose of a convolutional layer in a CNN? Convolutional layers apply learnable filters to detect spatial patterns such as edges, textures, and shapes at different scales.
How does transfer learning reduce training time? Transfer learning reuses features learned from a large dataset, so only a small classifier head needs training on the target task.
Why is image augmentation important for Computer Vision models? Augmentation increases effective dataset size and diversity, reducing overfitting and improving generalization to real-world variations.
What does the Canny edge detector do? Canny detects edges by applying gradient computation, non-maximum suppression, and hysteresis thresholding to produce thin, continuous edges.
Challenge: Build a real-time webcam-based object counter that uses a pretrained YOLO model to detect and count objects of a specific class (e.g., cars in a parking lot) and logs the count over time.
Mini Project
Build a document scanner app. Use OpenCV to detect the largest contour in an image, apply a perspective transform to deskew the document, enhance contrast with adaptive thresholding, and save the result as a clean scanned PDF. Add OCR using Tesseract to extract text from the scanned output.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro