Implementing strategies for improved performance in PyTorch Model

Original Model

This example demonstrates how to implement the discussed optimization techniques for training a simple CNN model on the MNIST dataset, to perform data transformations.

Let’s start by defining a simple CNN model without using any optimization strategies.

Python3
# Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms
from torch.utils.tensorboard import SummaryWriter
import torch.profiler as profiler
from google.colab import drive
drive.mount('/content/drive')

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a simple CNN model
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.relu = nn.ReLU()
        self.fc = nn.Linear(32 * 28 * 28, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

# Load MNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

# Define data loader
train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(dataset=val_dataset, batch_size=64, shuffle=False)

# Instantiate the model, loss function, and optimizer
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model without optimization strategies
def train(model, train_loader, criterion, optimizer, device):
    model.train()
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

# Validation function
def validate(model, val_loader, criterion, device):
    model.eval()
    total_correct = 0
    total_samples = 0
    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            _, predicted = torch.max(outputs, 1)
            total_samples += labels.size(0)
            total_correct += (predicted == labels).sum().item()
    accuracy = total_correct / total_samples
    return accuracy

# Function to log results in TensorBoard
def log_results(writer, epoch, loss, accuracy):
    writer.add_scalar('Loss/train', loss, epoch)
    writer.add_scalar('Accuracy/val', accuracy, epoch)

# Train and log results without optimizations
with SummaryWriter(log_dir='/content/drive/My Drive/logs/') as writer:
    for epoch in range(5):
        train(model, train_loader, criterion, optimizer, device)
        accuracy = validate(model, val_loader, criterion, device)
        print(f'Epoch {epoch + 1}, Accuracy: {accuracy}')
        log_results(writer, epoch, 0, accuracy)

Output:

Epoch 1, Accuracy: 0.9673333333333334
Epoch 2, Accuracy: 0.9745
Epoch 3, Accuracy: 0.9733333333333334
Epoch 4, Accuracy: 0.9685833333333334
Epoch 5, Accuracy: 0.9748333333333333

Model Using Optimization Strategies

The Code initialize the training data loader with varying batch sizes and optimization strategies.

  • The first two lines define data loaders with a batch size of 64, while the next two lines experiment with a larger batch size of 128 for improved GPU utilization.
  • Additionally, torch.cuda.amp.GradScaler() is used to apply automatic mixed precision (AMP) for faster training by scaling the loss to prevent numerical underflow.
  • Finally, the model is compiled into a torch script using torch.jit.script() to enable graph mode optimization for improved computational efficiency during training
Python3
# Apply optimization strategies

# A. Multi-process Data Loading
# Use multi-process data loading for faster data loading
train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True, num_workers=4, pin_memory=True)

# B. Memory Pinning
# Enable memory pinning for faster data transfer
train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True, pin_memory=True)

# C. Increase Batch Size
# Experiment with a larger batch size for improved GPU utilization
train_loader = DataLoader(dataset=train_dataset, batch_size=128, shuffle=True, pin_memory=True)

# D. Reduce Host to Device Copy
# Use memory pinning and increase batch size to minimize copy overhead
train_loader = DataLoader(dataset=train_dataset, batch_size=128, shuffle=True, pin_memory=True)

# E. Set Gradients to None
# Directly set gradients to None for efficient zeroing of gradients
def zero_grad(model):
    for param in model.parameters():
        param.grad = None

# F. Automatic Mixed Precision (AMP)
# Utilize automatic mixed precision for faster training
scaler = torch.cuda.amp.GradScaler()

# G. Train in Graph Mode
# Enable torch.jit.graph mode for improved computational efficiency
model = torch.jit.script(model)

Results after Optimizations

The model undergoes final training iterations over five epochs. Each epoch involves iterating through the training data loader, computing loss, and optimizing model parameters using automatic mixed precision (AMP) to enhance training efficiency. After each epoch, the model’s performance is evaluated on the validation dataset to determine accuracy. Results including loss, accuracy, and epoch number are logged for monitoring and analysis via TensorBoard.

Python3
# The final results after optimizations
with SummaryWriter(log_dir='/content/drive/My Drive/logs/') as writer:
    for epoch in range(5):
        model.train()
        total_loss = 0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()

            # AMP: Scale the loss to prevent underflow
            with torch.cuda.amp.autocast():
                outputs = model(inputs)
                loss = criterion(outputs, labels)
            scaler.scale(loss).backward()

            # AMP: Unscales the gradients and performs optimization
            scaler.step(optimizer)
            scaler.update()

            total_loss += loss.item()

        accuracy = validate(model, val_loader, criterion, device)
        print(f'Epoch {epoch + 1}, Loss: {total_loss}, Accuracy: {accuracy}')
        log_results(writer, epoch, total_loss, accuracy)
Epoch 1, Loss: 4.980324613978155, Accuracy: 0.9795833333333334
Epoch 2, Loss: 2.9660821823053993, Accuracy: 0.9788333333333333
Epoch 3, Loss: 2.2020596808288246, Accuracy: 0.979
Epoch 4, Loss: 1.7389826085127424, Accuracy: 0.9786666666666667
Epoch 5, Loss: 1.5763679939555004, Accuracy: 0.978
  • The loss decreases progressively over epochs, indicating that the model’s predictive performance improves as training progresses. This decrease in loss suggests that the model is learning to make better predictions over time.
  • Despite the decrease in loss, the accuracy remains relatively high and stable across epochs, hovering around 97.9% indicating model is consistently making accurate predictions on the training data.

Overall, the decreasing loss and stable high accuracy across epochs indicate that the model is learning effectively and converging towards a good solution.

Visualizing performance using TensorBoard

To visualize the training and validation metric and to obtain the above Accuracy vs Validation and Loss vs Train graphs, we will use TensorBoard. Execute the following command in your terminal.

Python3
%load_ext tensorboard
%tensorboard --logdir /content/drive/My\ Drive/logs/

Output:


The values displayed on the TensorBoard graphs represent the performance metrics of the trained model during optimization.

Observation

Accuracy vs. Validation:

The accuracy vs validation graph illustrates the accuracy of the model on the validation dataset over training epochs.

  • The SMOOTHED value, 0.9803, denotes the smoothed accuracy curve, reducing fluctuations for clearer visualization.
  • The VALUE of 0.9801 indicates the current accuracy achieved.
  • STEP refers to the training step or epoch at which this accuracy was measured.
  • The RELATIVE value of 1.71 min suggests the time taken to reach this accuracy relative to the original run.
  • The ORIGINAL section provides the corresponding values for the original model without optimizations, indicating an accuracy of 0.9766, 0.9785 at step 4, and a duration of 2.16 minutes.

Loss vs. Train:

The loss vs train graph depicts the training loss of the model over epochs.

  • The SMOOTHED value, 3.0398, represents the smoothed loss curve, minimizing fluctuations.
  • The VALUE of 2.562 denotes the current training loss.
  • Similar to the accuracy graph, STEP refers to the training step or epoch at which this loss was measured.
  • The RELATIVE value of 1.71 min indicates the time taken to reach this loss relative to the original run.
  • The ORIGINAL section provides the corresponding values for the original model without optimizations, showing a loss of 0 at step 0 and 4, and a duration of 2.16 minutes.


How to improve the performance of PyTorch models?

PyTorch‘s flexibility and ease of use make it a popular choice for deep learning. To attain the best possible performance from a model, it’s essential to meticulously explore and apply diverse optimization strategies. The article explores effective methods to enhance the training efficiency and accuracy of your PyTorch models.

Similar Reads

Understanding Performance Challenges in PyTorch Model

Before delving into optimization strategies, it’s crucial to pinpoint potential bottlenecks that hinder your training pipeline. These challenges can be:...

Optimization Techniques for Improving PyTorch models

PyTorch offers a variety of techniques to address these challenges and accelerate training:...

Implementing strategies for improved performance in PyTorch Model

Original Model...