Computer Vision: Foundations and Practical Applications
Computer Vision is the field of AI that teaches computers to interpret visual information — enabling facial recognition, self-driving cars, medical diagnosis, and automated surveillance.
What You’ll Learn
By the end of this tutorial, you’ll understand how images are represented digitally, how convolution works for edge detection, the image classification pipeline, CNN architecture, and object detection with YOLO and SSD. You’ll build a face detection application using Python and OpenCV.
Why It Matters
Computer vision is everywhere — your phone unlocks with your face, cars detect pedestrians, medical scans highlight tumors, and social media tags your friends automatically.
Real-World Use
Your smartphone camera detects faces in real time, drawing rectangles around each face, adjusting focus and exposure, and applying beauty filters — all in milliseconds using CV algorithms running locally on the device.
Image Representation
flowchart LR A[Input Image] --> B[Pixel Values] B --> C[Convolution Layer] C --> D[Pooling Layer] D --> E[Fully Connected Layer] E --> F[Prediction] C -- "Edge Detection" --> G[Feature Maps] D -- "Downsampling" --> H[Reduced Features]
An image is a grid of numbers. Each number represents a pixel’s brightness (0 = black, 255 = white). Color images stack three grids — Red, Green, Blue.
import numpy as np
# Grayscale image: 4x4 grid of pixel values
gray_image = np.array([
[ 0, 50, 100, 255],
[ 50, 100, 200, 255],
[100, 200, 255, 200],
[255, 255, 200, 100]
])
print("Grayscale image (4x4):")
print(gray_image)
print(f"Shape: {gray_image.shape}")
# Color image: height x width x 3 channels
color_pixel = np.array([255, 0, 0]) # Red pixel
print(f"\nRed pixel (RGB): {color_pixel}")Expected output:
Grayscale image (4x4):
[[ 0 50 100 255]
[ 50 100 200 255]
[100 200 255 200]
[255 255 200 100]]
Shape: (4, 4)
Red pixel (RGB): [255 0 0]Convolution Basics
Convolution slides a small filter (kernel) across the image, detecting patterns like edges, corners, and textures.
def apply_convolution(img, kernel):
h, w = img.shape
kh, kw = kernel.shape
output = np.zeros((h - kh + 1, w - kw + 1))
for i in range(output.shape[0]):
for j in range(output.shape[1]):
output[i, j] = np.sum(img[i:i+kh, j:j+kw] * kernel)
return output
# Image with vertical edge: dark left, bright right
image = np.array([
[10, 10, 200, 200, 200],
[10, 10, 200, 200, 200],
[10, 10, 200, 200, 200],
[10, 10, 200, 200, 200],
[10, 10, 200, 200, 200],
])
# Vertical edge detection kernel (Sobel-like)
kernel = np.array([
[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1],
])
result = apply_convolution(image, kernel)
print("Edge detection result:")
print(result)Expected output:
Edge detection result:
[[570. 570. 0.]
[570. 570. 0.]
[570. 570. 0.]]High values (570) indicate a strong vertical edge. Zero means uniform region. This is exactly how Canny and Sobel edge detectors work.
Image Classification Pipeline
An image classification pipeline processes images through these stages:
- Preprocessing — resize to uniform dimensions, normalize pixel values to [0, 1]
- Feature extraction — convolution layers detect edges → textures → parts → objects
- Classification — fully connected layers map features to class probabilities
import tensorflow as tf
from tensorflow import keras
# Load Fashion MNIST
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
# Normalize
train_images = train_images / 255.0
test_images = test_images / 255.0
# Build CNN
model = keras.Sequential([
keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Flatten(),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Train briefly
model.fit(train_images, train_labels, epochs=2, verbose=1)
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=0)
print(f"Test accuracy: {test_acc:.4f}")Expected output (approximate):
Epoch 1/2 → accuracy: ~0.85
Epoch 2/2 → accuracy: ~0.90
Test accuracy: ~0.88Object Detection: YOLO and SSD
Classification tells you what is in an image. Object detection tells you what and where — it draws bounding boxes around each detected object.
YOLO (You Only Look Once) divides the image into a grid. Each grid cell predicts bounding boxes and class probabilities in a single forward pass. It’s fast enough for real-time video.
SSD (Single Shot Detector) uses multiple feature maps at different scales to detect objects of varying sizes. It’s more accurate than YOLO for small objects but slightly slower.
# Pseudocode demonstrating YOLO's approach
grid_size = 7 # Divide image into 7x7 grid
num_classes = 80
num_boxes = 2
# Each grid cell outputs:
# - 4 coordinates per box (x, y, w, h)
# - 1 confidence score per box
# - class probabilities (80 values)
output_per_cell = num_boxes * 5 + num_classes
total_output = grid_size * grid_size * output_per_cell
print(f"YOLO output tensor size: {total_output}")Expected output:
YOLO output tensor size: 931Face Detection with OpenCV
import cv2
# Load pre-trained face detector
face_cascade = cv2.CascadeClassifier(
cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
)
# Load image
img = cv2.imread('group_photo.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Detect faces
faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5)
# Draw rectangles
for (x, y, w, h) in faces:
cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
print(f"Detected {len(faces)} face(s)")
cv2.imwrite('output_faces.jpg', img)Expected output:
Detected 3 face(s)Common Computer Vision Errors
1. Not Normalizing Pixel Values
Raw 0–255 values cause training instability. Scale to [0, 1] or standardize to mean 0, variance 1.
2. Using Too Many Filters Too Early
Start with 16–32 filters in the first layer. Too many filters waste computation and cause overfitting.
3. Ignoring Input Size Consistency
CNNs require fixed input dimensions. Resize all images to the same size before batching.
4. Not Using Data Augmentation
Real images vary in angle, lighting, and position. Augment with rotations, flips, and color jitter.
5. Confusing Classification with Detection
Classification gives one label per image. Detection finds multiple objects with bounding boxes. They require different architectures.
6. Thinking More Layers = Better Accuracy
Deeper networks need more data. Start with a simple architecture and add complexity only when needed.
Practice Questions
1. How does a computer represent an image? As a grid of pixel values. Grayscale: height × width. Color: height × width × 3 (RGB channels).
2. What does a convolutional filter detect? Specific patterns like edges, corners, textures. Early layers detect simple patterns; deeper layers detect complex ones like faces or objects.
3. What’s the difference between image classification and object detection? Classification assigns a single label to the whole image. Detection identifies multiple objects with bounding boxes and labels.
4. Why is pooling used in CNNs? Pooling reduces spatial dimensions, decreasing computation and preventing overfitting while preserving important features.
5. Challenge: Train a classifier to distinguish cats from dogs Use the Dogs vs Cats dataset from Kaggle. Build a CNN with data augmentation. Try transfer learning with a pre-trained model.
FAQ
Try It Yourself
Mini Project: Webcam Face Detector
Build a real-time face detector using OpenCV’s Haar cascade with your webcam. Security angle: Face detection systems are used by Durga Antivirus Pro for secure authentication features and by surveillance systems to detect intruders in restricted areas.
What’s Next
Before moving on, you should understand:
- How images are represented as pixel arrays
- The concept of convolution and edge detection
- How a basic CNN architecture works
- The difference between classification and object detection
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
What’s Next
Congratulations on completing this Computer Vision tutorial! Here’s where to go from here:
- Practice daily — Process images you take with your phone
- Build a project — Create a real-time object detector for a specific use case
- Explore related topics — Check out Model Deployment to put your CV model into production
Remember: every expert was once a beginner. Keep coding!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro