In this article, we will build one of the earliest convolutional neural networks ever introduced, LeNet5. We will build this CNN from scratch in PyTorch and see how it performs on a real-world dataset.
We will start by exploring LeNet5’s architecture. We will then load and analyze our dataset, MNIST, using the class provided by Torchvision. Using PyTorch, we will build LeNet5 from scratch and train it on our data. Finally, we will see how the model performs on the unseen test data.
Knowledge of neural networks will help you understand this article. This translates to being familiar with the different layers of neural networks (input layer, hidden layers, output layer), activation functions, optimization algorithms (variants of gradient descent), loss functions, etc. Additionally, familiarity with Python syntax and the PyTorch library is essential for understanding the code snippets presented in this article.
An understanding of CNNs is also recommended. This includes knowledge of convolutional layers, pooling layers, and their role in extracting features from input data. Understanding concepts like stride, padding, and the impact of kernel/filter size is beneficial.
LeNet5 is used to recognize handwritten characters. Yann LeCun and others proposed it in 1998 in the paper Gradient-Based Learning Applied to Document Recognition.
Let’s understand the architecture of LeNet5 as shown in the figure below:
As the name indicates, LeNet5 has five layers: two convolutional and three fully connected. Let’s start with the input. LeNet5 accepts a greyscale image of 32x32 as input, indicating that the architecture is not suitable for RGB images (multiple channels). So, the input image should contain just one channel. After this, we start with our convolutional layers.
The first convolutional layer has a filter size of 5x5 with six such filters. This will reduce the width and height of the image while increasing the depth (number of channels). The output would be 28x28x6. After this, pooling is applied to decrease the feature map by half, i.e., 14x14x6. The same filter size (5x5) with 16 filters is now applied to the output, followed by a pooling layer. This reduces the output feature map to 5x5x16.
A convolutional layer with a size of 5x5 and 120 filters is then applied to transform the feature map into 120 values. Following this, there is the first fully connected layer, which contains 84 neurons. Finally, the output layer consists of 10 neurons, corresponding to the 10 classes representing the numerical digits in the MNIST dataset.
Let’s start by loading and analyzing the data. We will be using the MNIST dataset. The MNIST dataset contains images of handwritten numerical digits. The images are greyscale, all with a size of 28x28, and are composed of 60,000 training and 10,000 testing images.
You can see some of the sample images below:
Let’s start by importing the required libraries and defining some variables (hyperparameters and device are also detailed to help the package determine whether to train on GPU or CPU):
# Load in relevant libraries, and alias where appropriate
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
# Define relevant variables for the ML task
batch_size = 64
num_classes = 10
learning_rate = 0.001
num_epochs = 10
# Device will determine whether to run the training on GPU or CPU.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Using torchvision, we will load the dataset, allowing us to perform any pre-processing steps easily.
#Loading the dataset and preprocessing
train_dataset = torchvision.datasets.MNIST(root = './data',
train = True,
transform = transforms.Compose([
transforms.Resize((32,32)),
transforms.ToTensor(),
transforms.Normalize(mean = (0.1307,), std = (0.3081,))]),
download = True)
test_dataset = torchvision.datasets.MNIST(root = './data',
train = False,
transform = transforms.Compose([
transforms.Resize((32,32)),
transforms.ToTensor(),
transforms.Normalize(mean = (0.1325,), std = (0.3105,))]),
download=True)
train_loader = torch.utils.data.DataLoader(dataset = train_dataset,
batch_size = batch_size,
shuffle = True)
test_loader = torch.utils.data.DataLoader(dataset = test_dataset,
batch_size = batch_size,
shuffle = True)
Let’s understand the code:
Let’s first look at the code:
#Defining the convolutional neural network
class LeNet5(nn.Module):
def __init__(self, num_classes):
super(ConvNeuralNet, self).__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),
nn.BatchNorm2d(6),
nn.ReLU(),
nn.MaxPool2d(kernel_size = 2, stride = 2))
self.layer2 = nn.Sequential(
nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
nn.BatchNorm2d(16),
nn.ReLU(),
nn.MaxPool2d(kernel_size = 2, stride = 2))
self.fc = nn.Linear(400, 120)
self.relu = nn.ReLU()
self.fc1 = nn.Linear(120, 84)
self.relu1 = nn.ReLU()
self.fc2 = nn.Linear(84, num_classes)
def forward(self, x):
out = self.layer1(x)
out = self.layer2(out)
out = out.reshape(out.size(0), -1)
out = self.fc(out)
out = self.relu(out)
out = self.fc1(out)
out = self.relu1(out)
out = self.fc2(out)
return out
I’ll explain the code linearly:
Before training, we need to set some hyperparameters, such as the loss function and the optimizer to be used.
model = LeNet5(num_classes).to(device)
#Setting the loss function
cost = nn.CrossEntropyLoss()
#Setting the optimizer with the model parameters and learning rate
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
#this is defined to print how many steps are remaining when training
total_step = len(train_loader)
We begin by initializing our model with the number of classes as an argument, which is set to 10 in this case. Next, we define our cost function as cross-entropy loss and choose the Adam optimizer. While there are various options available for both, these choices generally yield good results with the model and the data provided. Finally, we establish a `total_step` variable to help track the steps more effectively during training.
Now, we can train our model:
total_step = len(train_loader)
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):
images = images.to(device)
labels = labels.to(device)
#Forward pass
outputs = model(images)
loss = cost(outputs, labels)
#Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i+1) % 400 == 0:
print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, i+1, total_step, loss.item()))
Let’s see what the code does:
We can see the output as follows:
The loss is decreasing with each epoch, indicating that our model is effectively learning. It’s important to note that this loss is calculated on the training set. However, if the loss is excessively low (as we are observing), it could be a sign of overfitting. There are several methods to address this issue, such as regularization and data augmentation, but we won’t delve into those solutions in this article. Now, let’s test our model to evaluate its performance.
Let’s now test our model:
# Test the model
# In the test phase, we don't need to compute gradients (for memory efficiency)
model.eval() # Set the model to evaluation mode
with torch.no_grad():
correct = 0
total = 0
for images, labels in test_loader:
images = images.to(device)
labels = labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
print(f'Accuracy of the network on the 10000 test images: {accuracy:.2f} %')
As you can see, the code is not so different than the one for training. The only difference is that we are not computing gradients (using with torch.no_grad()), and also not computing the loss because we don’t need to backpropagate here. To compute the resulting accuracy of the model, we can simply calculate the total number of correct predictions over the total number of images.
Using this model, we get around 98.8% accuracy, which is quite good:
Note that the MNIST dataset is quite basic and small for today’s standards, and similar results are hard to get for other datasets. Nonetheless, it’s a good starting point when learning deep learning and CNNs.
In this article, we took a deep dive into LeNet-5, one of the foundational convolutional neural networks that laid the groundwork for modern deep learning in computer vision. We began by understanding the architecture of LeNet-5, breaking down each layer and its role in feature extraction and classification.
We then introduced the MNIST dataset—a benchmark for handwritten digit recognition—and demonstrated how to load and preprocess it using torchvision
, to make sure it’s ready for training.
From there, we implemented the LeNet-5 architecture from scratch in PyTorch, carefully defining each layer and specifying the hyperparameters necessary for training. This included setting learning rates, batch sizes, optimizers, and loss functions, all of which play a crucial role in the model’s performance.
Finally, we trained and evaluated the model on the MNIST dataset. The network achieved strong accuracy on the test set, confirming that even a relatively simple CNN like LeNet-5 can be highly effective for digit classification tasks.
Through this hands-on walkthrough, you’ve learned not only how to build and train LeNet-5 in PyTorch but also gained insights into the inner workings of CNNs, data handling, and model evaluation. This foundation will serve you well as you explore more advanced architectures and tackle more complex image recognition problems in the future.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!