Learn Artificial: PyTorch Guide — Deep Learning Framework for Research and Production

Q: Does PyTorch support distributed training?

: Yes. PyTorch has torch.distributed for multi-GPU and multi-node training, including DataParallel, DistributedDataParallel, and Fully Sharded Data Parallel (FSDP).

Artificial Intelligence

PyTorch Guide — Deep Learning Framework for Research and Production

DodaTech Updated Jun 7, 2026 11 min read

PyTorch is an open-source deep learning framework developed by Meta AI that combines flexible tensor computation with automatic differentiation, making it the preferred choice for both research experimentation and production deployment.

What You’ll Learn

You’ll work with tensors (PyTorch’s multi-dimensional arrays), use autograd for automatic gradient computation, build neural network architectures with nn.Module, implement custom training loops, load data with DataLoader, accelerate training on GPUs with CUDA, use torchvision for image tasks, and save and load model checkpoints.

Why PyTorch Matters

PyTorch has become the dominant deep learning framework in research, powering most papers published at top AI conferences (NeurIPS, ICML, CVPR). Its imperative, Pythonic style (“define-by-run”) makes debugging intuitive — you can use standard Python debuggers. DodaTech’s malware classification system uses PyTorch because its dynamic computation graphs let us handle variable-length executable files, and TorchScript lets us deploy the trained model directly into Durga Antivirus Pro without a separate runtime.

PyTorch Learning Path

    flowchart LR
  A[Python & NumPy Basics] --> B[PyTorch]
  B --> C[Tensors]
  C --> D[Autograd]
  D --> E[Neural Networks]
  E --> F[Training Loops]
  F --> G[Data Loading]
  G --> H[GPU Acceleration]
  H --> I[Model Deployment]
  B:::current

  classDef current fill:#EE4C2C,color:#fff,stroke:#333,stroke-width:2px

Prerequisites: Solid Python programming skills (classes, NumPy arrays). Basic understanding of machine learning concepts (features, labels, train/test split) is helpful.

Tensors — The Core Data Structure

Tensors are multi-dimensional arrays similar to NumPy arrays but with GPU acceleration and automatic differentiation:

import torch
import numpy as np

# Creating tensors
data = [[1, 2], [3, 4], [5, 6]]
tensor = torch.tensor(data)
print(f"Tensor:\n{tensor}")
# Output:
# tensor([[1, 2],
#         [3, 4],
#         [5, 6]])

# Tensor from NumPy
np_array = np.array([1.0, 2.0, 3.0])
from_numpy = torch.from_numpy(np_array)
print(from_numpy)  # tensor([1., 2., 3.])

# Special tensors
zeros = torch.zeros(2, 3)
ones = torch.ones(2, 3)
random = torch.randn(2, 3)  # Standard normal distribution
eye = torch.eye(4)  # Identity matrix

print(f"Random tensor:\n{random}")
# Output:
# tensor([[-0.3421,  1.2034, -0.8765],
#         [ 0.5432, -1.2345,  0.9876]])

# Tensor operations
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 6.0])

print(f"Add: {x + y}")        # tensor([5., 7., 9.])
print(f"Multiply: {x * y}")   # tensor([4., 10., 18.])
print(f"Dot: {torch.dot(x, y)}")  # tensor(32.)

# Reshaping
matrix = torch.arange(12).reshape(3, 4)
print(matrix)
# Output:
# tensor([[ 0,  1,  2,  3],
#         [ 4,  5,  6,  7],
#         [ 8,  9, 10, 11]])

Autograd — Automatic Differentiation

Autograd records operations on tensors and computes gradients automatically:

# Simple gradient computation
x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)

# Define a function: z = (x^2 * y) + (x * y^2)
z = (x ** 2) * y + x * (y ** 2)

# Compute gradients
z.backward()

print(f"dz/dx at x=3, y=2: {x.grad}")
print(f"dz/dy at x=3, y=2: {y.grad}")

# Manual verification:
# z = x^2*y + x*y^2
# dz/dx = 2*x*y + y^2 = 2*3*2 + 4 = 16
# dz/dy = x^2 + 2*x*y = 9 + 12 = 21

# Output:
# dz/dx at x=3, y=2: tensor(16.)
# dz/dy at x=3, y=2: tensor(21.)

# Training loop pattern
weights = torch.randn(3, 1, requires_grad=True)
inputs = torch.randn(10, 3)
targets = 3 * inputs[:, 0:1] + 2 * inputs[:, 1:2] + 1 * inputs[:, 2:3]

# Forward pass
predictions = inputs @ weights  # Matrix multiplication
loss = ((predictions - targets) ** 2).mean()

# Backward pass
loss.backward()

# Gradients are now populated
print(f"Gradients: {weights.grad.shape}")

Output: backward() computes the gradient of z with respect to all tensors that have requires_grad=True. The gradients are accumulated in the .grad attribute. Calling backward() again would add to the existing gradients (use zero_grad() between training steps).

Building Neural Networks with nn.Module

PyTorch’s nn.Module provides a structured way to build neural networks:

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class MalwareClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        # Define layers
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, num_classes)
        
        # Regularization
        self.dropout = nn.Dropout(0.3)
        self.batch_norm = nn.BatchNorm1d(hidden_size)
        
    def forward(self, x):
        # Define forward pass
        x = F.relu(self.batch_norm(self.fc1(x)))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

# Instantiate the model
model = MalwareClassifier(
    input_size=512,    # Feature vector size
    hidden_size=256,   # Hidden layer neurons
    num_classes=5      # Malware families
)

print(model)
# Output:
# MalwareClassifier(
#   (fc1): Linear(in_features=512, out_features=256, bias=True)
#   (fc2): Linear(in_features=256, out_features=256, bias=True)
#   (fc3): Linear(in_features=256, out_features=5, bias=True)
#   (dropout): Dropout(p=0.3, inplace=False)
#   (batch_norm): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True)
# )

# Test forward pass
sample_input = torch.randn(32, 512)  # Batch of 32
output = model(sample_input)
print(f"Output shape: {output.shape}")  # torch.Size([32, 5])

Training Loop — Full Pipeline

A complete training and validation loop:

# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = MalwareClassifier(512, 256, 5).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3)

# Training loop
num_epochs = 10
train_losses = []
val_accuracies = []

for epoch in range(num_epochs):
    # Training phase
    model.train()
    running_loss = 0.0
    
    for batch_idx, (data, targets) in enumerate(train_loader):
        data, targets = data.to(device), targets.to(device)
        
        # Zero gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(data)
        loss = criterion(outputs, targets)
        
        # Backward pass and optimize
        loss.backward()
        
        # Gradient clipping (prevents exploding gradients)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        optimizer.step()
        
        running_loss += loss.item()
    
    avg_train_loss = running_loss / len(train_loader)
    train_losses.append(avg_train_loss)
    
    # Validation phase
    model.eval()
    correct = 0
    total = 0
    
    with torch.no_grad():
        for data, targets in val_loader:
            data, targets = data.to(device), targets.to(device)
            outputs = model(data)
            _, predicted = torch.max(outputs.data, 1)
            total += targets.size(0)
            correct += (predicted == targets).sum().item()
    
    accuracy = 100.0 * correct / total
    val_accuracies.append(accuracy)
    
    print(f"Epoch [{epoch+1}/{num_epochs}] "
          f"Loss: {avg_train_loss:.4f} "
          f"Val Acc: {accuracy:.2f}%")
    
    # Learning rate scheduling
    scheduler.step(avg_train_loss)

# Output:
# Epoch [1/10] Loss: 1.2345 Val Acc: 45.67%
# Epoch [2/10] Loss: 0.9876 Val Acc: 58.23%
# ...
# Epoch [10/10] Loss: 0.3456 Val Acc: 92.15%

Data Loading with DataLoader

Efficient data loading with batching, shuffling, and parallel loading:

from torch.utils.data import Dataset, DataLoader
import pandas as pd

class MalwareDataset(Dataset):
    def __init__(self, csv_file, transform=None):
        self.data = pd.read_csv(csv_file)
        self.transform = transform
        
        # Assume last column is the label
        self.features = self.data.iloc[:, :-1].values.astype('float32')
        self.labels = self.data.iloc[:, -1].values.astype('int64')
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        features = torch.tensor(self.features[idx])
        label = torch.tensor(self.labels[idx])
        
        if self.transform:
            features = self.transform(features)
        
        return features, label

# Create datasets
train_dataset = MalwareDataset('train_features.csv')
val_dataset = MalwareDataset('val_features.csv')

# Create data loaders
train_loader = DataLoader(
    train_dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,       # Parallel loading
    pin_memory=True,     # Faster GPU transfer
    drop_last=True       # Drop incomplete batch
)

val_loader = DataLoader(
    val_dataset,
    batch_size=64,
    shuffle=False,
    num_workers=4
)

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Batches per epoch: {len(train_loader)}")

Output: DataLoader handles batching, shuffling, and multi-process loading automatically. Training data is shuffled each epoch; validation data stays in order. pin_memory=True speeds up GPU data transfer.

GPU Acceleration with CUDA

PyTorch makes GPU usage straightforward:

# Check GPU availability
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("No GPU available, using CPU")

# Move tensors to GPU
cpu_tensor = torch.randn(1000, 1000)
gpu_tensor = cpu_tensor.to('cuda')

# Move model to GPU
model = MalwareClassifier(512, 256, 5).to('cuda')

# Benchmark
import time

def benchmark(device, size=1000):
    a = torch.randn(size, size, device=device)
    b = torch.randn(size, size, device=device)
    
    # Warm up
    for _ in range(10):
        torch.mm(a, b)
    
    # Time 100 matrix multiplications
    torch.cuda.synchronize() if device == 'cuda' else None
    start = time.time()
    for _ in range(100):
        torch.mm(a, b)
    torch.cuda.synchronize() if device == 'cuda' else None
    end = time.time()
    
    return (end - start) / 100

cpu_time = benchmark('cpu')
gpu_time = benchmark('cuda') if torch.cuda.is_available() else float('inf')
print(f"CPU: {cpu_time*1000:.2f}ms | GPU: {gpu_time*1000:.2f}ms | "
      f"Speedup: {cpu_time/gpu_time:.1f}x")

torchvision — Computer Vision

import torchvision
import torchvision.transforms as transforms
import torchvision.models as models

# Data transformations
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],  # ImageNet mean
        std=[0.229, 0.224, 0.225]    # ImageNet std
    )
])

# Load CIFAR-10 dataset
train_set = torchvision.datasets.CIFAR10(
    root='./data', train=True,
    download=True, transform=transform
)
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)

# Use a pre-trained model
resnet = models.resnet18(pretrained=True)

# Freeze feature extractor layers
for param in resnet.parameters():
    param.requires_grad = False

# Replace the classifier for our task (10 classes)
resnet.fc = nn.Linear(512, 10)

# Only train the classifier head
optimizer = optim.Adam(resnet.fc.parameters(), lr=0.001)

Saving and Loading Models

# Save model checkpoint
checkpoint = {
    'epoch': 10,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': avg_train_loss,
    'accuracy': accuracy
}
torch.save(checkpoint, 'malware_classifier_checkpoint.pth')

# Load model checkpoint
model = MalwareClassifier(512, 256, 5)
optimizer = optim.Adam(model.parameters())

checkpoint = torch.load('malware_classifier_checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

print(f"Resumed from epoch {epoch}, loss: {loss:.4f}")

# Save for inference (smaller, no optimizer state)
torch.save(model.state_dict(), 'malware_classifier_final.pth')

# Load for inference
model = MalwareClassifier(512, 256, 5)
model.load_state_dict(torch.load('malware_classifier_final.pth'))
model.eval()

# Export to TorchScript for production deployment
scripted_model = torch.jit.script(model)
scripted_model.save('malware_classifier_scripted.pt')

Security Angle: Adversarial Robustness

def detect_adversarial_input(model, sample, epsilon=0.01):
    """Detect potential adversarial examples using prediction stability"""
    model.eval()
    
    with torch.no_grad():
        original_pred = model(sample.unsqueeze(0))
        original_prob = F.softmax(original_pred, dim=1)
    
    predictions = []
    
    for _ in range(10):
        noise = torch.randn_like(sample) * epsilon
        noisy_sample = sample + noise
        
        with torch.no_grad():
            noisy_pred = model(noisy_sample.unsqueeze(0))
            predictions.append(F.softmax(noisy_pred, dim=1))
    
    # Check prediction stability
    predictions = torch.cat(predictions)
    std = predictions.std(dim=0)
    
    if std.max() > 0.1:  # High variance indicates potential adversarial
        print(f"Warning: Unstable prediction (std={std.max():.4f})")
        return False
    
    return True

DodaTech uses similar adversarial detection in Durga Antivirus Pro to identify malware samples that attempt to evade neural network classifiers.

Common Mistakes Beginners Make

Forgetting model.train() and model.eval(): Dropout and batch normalization behave differently during training and evaluation. Forgetting to toggle modes causes incorrect validation results.
Not calling optimizer.zero_grad(): Gradients accumulate by default. Without zeroing, each backward pass adds to existing gradients, leading to incorrect updates.
Detaching tensors from the computation graph accidentally: Converting a tensor to NumPy with .numpy() before detaching causes errors. Use tensor.detach().numpy() when you don’t need gradients.
CPU-GPU data transfer bottleneck: Moving data between CPU and GPU is slow. Move data to GPU once per batch, not per operation. Use pin_memory=True in DataLoader.
Overfitting without regularization: Deep networks easily memorize training data. Use dropout, weight decay, data augmentation, and early stopping.
Not shuffling training data: Without shuffling, batches may contain correlated samples, harming convergence. Always shuffle=True in training DataLoader.

Practice Questions

What does requires_grad=True do?
How does nn.Module simplify neural network construction?
What is the purpose of model.train() and model.eval()?
How do you move tensors and models between CPU and GPU?
Why save optimizer state_dict in checkpoints?

Answers:

It tells PyTorch to track operations on the tensor for automatic gradient computation via backward(). Only tensors with requires_grad=True accumulate gradients.
nn.Module provides automatic parameter tracking, to(device) for device management, state_dict() for serialization, and train()/eval() mode switching.
model.train() enables dropout and batch normalization training behavior. model.eval() disables them for inference, giving deterministic results.
Use tensor.to('cuda') or model.to('cuda'). Check availability with torch.cuda.is_available(). Always move data to the same device as the model.
Saving optimizer state lets you resume training exactly where you left off, preserving learning rate schedules, momentum buffers, and adaptive learning rate states.

Challenge

Build a complete image classifier for the CIFAR-100 dataset: create a custom CNN with convolutional and pooling layers, implement data augmentation (random crop, horizontal flip, color jitter), use learning rate scheduling with cosine annealing, train with early stopping based on validation loss, and visualize feature maps from the first convolutional layer.

Real-World Task

Fine-tune a pre-trained ResNet-50 for malware family classification: load the model with pre-trained ImageNet weights, replace the final layer for 10 malware families, implement custom data augmentation for binary executable visualizations, use mixed-precision training with torch.cuda.amp, export to TorchScript, and benchmark inference speed on CPU vs GPU.

FAQ

What is the difference between PyTorch and TensorFlow?

: PyTorch uses eager execution by default (define-by-run), making debugging intuitive. TensorFlow uses graph-based execution. PyTorch is more popular in research; TensorFlow has broader production deployment options. Both are capable frameworks.

Does PyTorch support distributed training?

: Yes. PyTorch has torch.distributed for multi-GPU and multi-node training, including DataParallel, DistributedDataParallel, and Fully Sharded Data Parallel (FSDP).

Can I deploy PyTorch models to mobile devices?

: Yes. PyTorch Mobile lets you run models on iOS and Android. TorchScript or the newer ExecuTorch format compiles models for on-device inference.

How do I debug NaN losses?

: Common causes: learning rate too high, exploding gradients (use gradient clipping), missing data normalization, or division by zero. Add gradient logging and reduce learning rate.

Does PyTorch support Apple Silicon (MPS)?

: Yes. PyTorch has MPS (Metal Performance Shaders) backend for Apple Silicon GPUs. Use torch.device('mps') to run on Mac GPUs.

Try It Yourself

import torch
import torch.nn as nn

# Create a simple neural network
model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 2),
    nn.Softmax(dim=1)
)

# Generate dummy data
X = torch.randn(100, 10)
y = torch.randint(0, 2, (100,))

# Forward pass
output = model(X)
print(f"Input shape: {X.shape}, Output shape: {output.shape}")

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")