[2025-08-06 14:08:43,131360][I][ezpz/__init__:265:ezpz] Setting logging level to 'INFO' on 'RANK == 0'
[2025-08-06 14:08:43,133474][I][ezpz/__init__:266:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0'
Content for this tutorial has been modified from content originally written by:
Marieme Ngom, Bethany Lusch, Asad Khan, Prasanna Balaprakash, Taylor Childers, Corey Adams, Kyle Felker, and Tanwi Mallick
This tutorial will serve as a gentle introduction to neural networks and deep learning through a hands-on classification problem using the MNIST dataset.
In particular, we will introduce neural networks and how to train and improve their learning capabilities. We will use the PyTorch Python library.
The MNIST dataset contains thousands of examples of handwritten numbers, with each digit labeled 0-9.
[2025-08-06 14:08:43,131360][I][ezpz/__init__:265:ezpz] Setting logging level to 'INFO' on 'RANK == 0'
[2025-08-06 14:08:43,133474][I][ezpz/__init__:266:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0'
We will now download the dataset that contains handwritten digits. MNIST is a popular dataset, so we can download it via the PyTorch library.
Note:
x
is for the inputs (images of handwritten digits)y
is for the labels or outputs (digits 0-9)Note that downloading it the first time might take some time.
The data is split as follows:
[2025-08-06 14:08:43,552146][I][ipykernel_92984/3921772995:1:mnist] MNIST data loaded: train=48000 examples validation=12000 examples test=10000 examples input shape=torch.Size([1, 28, 28])
Let’s take a closer look. Here are the first 10 training digits:
pltsize=1
# plt.figure(figsize=(10*pltsize, pltsize))
for i in range(10):
plt.subplot(1,10,i+1)
plt.axis('off')
# x, y = training_data[i]
# plt.imshow(x.reshape(28, 28), cmap="gray")
# x[0] is the image, x[1] is the label
plt.imshow(
numpy.reshape(
training_data[i][0],
(28, 28)
),
cmap="gray"
)
plt.title(f"{training_data[i][1]}")
To train our classifier, we need (besides the data):
Let’s begin with a simple linear model: linear regression, like last week.
We add one complication: each example is a vector (flattened image), so the “slope” multiplication becomes a dot product. If the target output is a vector as well, then the multiplication becomes matrix multiplication.
Note, like before, we consider multiple examples at once, adding another dimension to the input.
The linear layers in PyTorch perform a basic xW + b.
These “fully connected” layers connect each input to each output with some weight parameter.
We wouldn’t expect a simple linear model f(x) = xW+b directly outputting the class label and minimizing mean squared error to work well - the model would output labels like 3.55 and 2.11 instead of skipping to integers.
We now need:
class LinearClassifier(nn.Module):
def __init__(self):
super().__init__()
# First, we need to convert the input image to a vector by using
# nn.Flatten(). For MNIST, it means the second dimension 28*28 becomes 784.
self.flatten = nn.Flatten()
# Here, we add a fully connected ("dense") layer that has 28 x 28 = 784 input nodes
#(one for each pixel in the input image) and 10 output nodes (for probabilities of each class).
self.layer_1 = nn.Linear(28*28, 10)
def forward(self, x):
x = self.flatten(x)
x = self.layer_1(x)
return x
[2025-08-06 14:08:43,705731][I][ipykernel_92984/2844520859:2:mnist] LinearClassifier( (flatten): Flatten(start_dim=1, end_dim=-1) (layer_1): Linear(in_features=784, out_features=10, bias=True) )
Now we are ready to train our first model.
A training step is comprised of:
How many steps do we take?
def train_one_epoch(dataloader, model, loss_fn, optimizer):
model.train()
for batch, (X, y) in enumerate(dataloader):
# forward pass
pred = model(X)
loss = loss_fn(pred, y)
# backward pass calculates gradients
loss.backward()
# take one step with these gradients
optimizer.step()
# resets the gradients
optimizer.zero_grad()
def evaluate(dataloader, model, loss_fn):
# Set the model to evaluation mode - some NN pieces behave differently during training
# Unnecessary in this situation but added for best practices
model.eval()
size = len(dataloader.dataset)
num_batches = len(dataloader)
loss, correct = 0, 0
# We can save computation and memory by not calculating gradients here - we aren't optimizing
with torch.no_grad():
# loop over all of the batches
for X, y in dataloader:
pred = model(X)
loss += loss_fn(pred, y).item()
# how many are correct in this batch? Tracking for accuracy
correct += (pred.argmax(1) == y).type(torch.float).sum().item()
loss /= num_batches
correct /= size
accuracy = 100*correct
return accuracy, loss
%%time
epochs = 5
train_acc_all = []
val_acc_all = []
for j in range(epochs):
train_one_epoch(train_dataloader, linear_model, loss_fn, optimizer)
# checking on the training loss and accuracy once per epoch
acc, loss = evaluate(train_dataloader, linear_model, loss_fn)
train_acc_all.append(acc)
logger.info(f"Epoch {j}: training loss: {loss}, accuracy: {acc}")
# checking on the validation loss and accuracy once per epoch
val_acc, val_loss = evaluate(val_dataloader, linear_model, loss_fn)
val_acc_all.append(val_acc)
logger.info(f"Epoch {j}: val. loss: {val_loss}, val. accuracy: {val_acc}")
[2025-08-06 14:08:45,785148][I][./<timed exec>:10:mnist] Epoch 0: training loss: 0.5019691607952118, accuracy: 87.6
[2025-08-06 14:08:46,026971][I][./<timed exec>:15:mnist] Epoch 0: val. loss: 0.49424059245180574, val. accuracy: 87.63333333333333
[2025-08-06 14:08:48,169396][I][./<timed exec>:10:mnist] Epoch 1: training loss: 0.4216008733908335, accuracy: 89.01875
[2025-08-06 14:08:48,417192][I][./<timed exec>:15:mnist] Epoch 1: val. loss: 0.4121108831877404, val. accuracy: 88.925
[2025-08-06 14:08:50,479591][I][./<timed exec>:10:mnist] Epoch 2: training loss: 0.38766712208588916, accuracy: 89.7
[2025-08-06 14:08:50,742874][I][./<timed exec>:15:mnist] Epoch 2: val. loss: 0.37754899675541737, val. accuracy: 89.45833333333333
[2025-08-06 14:08:53,372352][I][./<timed exec>:10:mnist] Epoch 3: training loss: 0.36771729950110116, accuracy: 90.1125
[2025-08-06 14:08:53,646507][I][./<timed exec>:15:mnist] Epoch 3: val. loss: 0.35739373891277515, val. accuracy: 89.93333333333334
[2025-08-06 14:08:55,822620][I][./<timed exec>:10:mnist] Epoch 4: training loss: 0.35414256183306375, accuracy: 90.39791666666666
[2025-08-06 14:08:56,081519][I][./<timed exec>:15:mnist] Epoch 4: val. loss: 0.3438146301406495, val. accuracy: 90.14166666666667
CPU times: user 11.6 s, sys: 695 ms, total: 12.3 s
Wall time: 12.4 s
# Visualize how the model is doing on the first 10 examples
pltsize=1
plt.figure(figsize=(10*pltsize, pltsize))
linear_model.eval()
batch = next(iter(train_dataloader))
predictions = linear_model(batch[0])
for i in range(10):
plt.subplot(1,10,i+1)
plt.axis('off')
plt.imshow(batch[0][i,0,:,:], cmap="gray")
plt.title('%d' % predictions[i,:].argmax())
Exercise: How can you improve the accuracy? Some things you might consider: increasing the number of epochs, changing the learning rate, etc.
Let’s see how our model generalizes to the unseen test data.
[2025-08-06 14:08:56,519944][I][ipykernel_92984/372756021:2:mnist] Test loss: 0.3325562601909041, test accuracy: 90.89
We can now take a closer look at the results.
Let’s define a helper function to show the failure cases of our classifier.
def show_failures(model, dataloader, maxtoshow=10):
model.eval()
batch = next(iter(dataloader))
predictions = model(batch[0])
rounded = predictions.argmax(1)
errors = rounded!=batch[1]
logger.info(
f"Showing max {maxtoshow} first failures."
)
logger.info("The predicted class is shown first and the correct class in parentheses.")
ii = 0
plt.figure(figsize=(maxtoshow, 1))
for i in range(batch[0].shape[0]):
if ii>=maxtoshow:
break
if errors[i]:
plt.subplot(1, maxtoshow, ii+1)
plt.axis('off')
plt.imshow(batch[0][i,0,:,:], cmap="gray")
plt.title("%d (%d)" % (rounded[i], batch[1][i]))
ii = ii + 1
Here are the first 10 images from the test data that this small model classified to a wrong class:
Our linear model isn’t enough for high accuracy on this dataset. To improve the model, we often need to add more layers and nonlinearities.
The output of this NN can be written as
\begin{equation} \hat{u}(x) = \sigma_2(\sigma_1(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2), \end{equation}
where \mathbf{x} is the input, \mathbf{W}_j are the weights of the neural network, \sigma_j the (nonlinear) activation functions, and \mathbf{b}_j its biases. The activation function introduces the nonlinearity and makes it possible to learn more complex tasks. Desirable properties in an activation function include being differentiable, bounded, and monotonic.
Image source: PragatiBaheti
Adding more layers to obtain a deep neural network:
Deep Neural networks can be overly flexible/complicated and “overfit” your data, just like fitting overly complicated polynomials:
Vizualization wrt to the accuracy and loss (Image source: Baeldung):
To improve the generalization of our model on previously unseen data, we employ a technique known as regularization, which constrains our optimization problem in order to discourage complex models.
We can now implement a deep network in PyTorch.
nn.Dropout()
performs the Dropout operation mentioned earlier:
#For HW: cell to change activation
class NonlinearClassifier(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.layers_stack = nn.Sequential(
nn.Linear(28*28, 50),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(50, 50),
nn.ReLU(),
# nn.Dropout(0.2),
nn.Linear(50, 50),
nn.ReLU(),
# nn.Dropout(0.2),
nn.Linear(50, 10)
)
def forward(self, x):
x = self.flatten(x)
x = self.layers_stack(x)
return x
%%time
epochs = 5
train_acc_all = []
val_acc_all = []
for j in range(epochs):
train_one_epoch(train_dataloader, nonlinear_model, loss_fn, optimizer)
# checking on the training loss and accuracy once per epoch
acc, loss = evaluate(train_dataloader, nonlinear_model, loss_fn)
train_acc_all.append(acc)
logger.info(f"Epoch {j}: training loss: {loss}, accuracy: {acc}")
# checking on the validation loss and accuracy once per epoch
val_acc, val_loss = evaluate(val_dataloader, nonlinear_model, loss_fn)
val_acc_all.append(val_acc)
logger.info(f"Epoch {j}: val. loss: {val_loss}, val. accuracy: {val_acc}")
[2025-08-06 14:08:59,101430][I][./<timed exec>:10:mnist] Epoch 0: training loss: 0.7994496553738912, accuracy: 77.17500000000001
[2025-08-06 14:08:59,379305][I][./<timed exec>:15:mnist] Epoch 0: val. loss: 0.7928039590094952, val. accuracy: 77.60833333333333
[2025-08-06 14:09:01,776063][I][./<timed exec>:10:mnist] Epoch 1: training loss: 0.41807308411598204, accuracy: 88.52291666666666
[2025-08-06 14:09:02,063954][I][./<timed exec>:15:mnist] Epoch 1: val. loss: 0.40851338334540105, val. accuracy: 88.59166666666667
[2025-08-06 14:09:04,377194][I][./<timed exec>:10:mnist] Epoch 2: training loss: 0.31588405946890513, accuracy: 91.10208333333333
[2025-08-06 14:09:04,633401][I][./<timed exec>:15:mnist] Epoch 2: val. loss: 0.3068072290179577, val. accuracy: 91.13333333333333
[2025-08-06 14:09:06,849967][I][./<timed exec>:10:mnist] Epoch 3: training loss: 0.2562710582613945, accuracy: 92.63125
[2025-08-06 14:09:07,108817][I][./<timed exec>:15:mnist] Epoch 3: val. loss: 0.2528020724495675, val. accuracy: 92.5
[2025-08-06 14:09:09,483188][I][./<timed exec>:10:mnist] Epoch 4: training loss: 0.2121300637324651, accuracy: 93.80625
[2025-08-06 14:09:09,745825][I][./<timed exec>:15:mnist] Epoch 4: val. loss: 0.21229731101304927, val. accuracy: 93.65833333333333
CPU times: user 12.4 s, sys: 834 ms, total: 13.2 s
Wall time: 13.1 s
To train and validate a neural network model, you need:
If you have time, experiment with how to improve the model.
Note: training and validation data can be used to compare models, but test data should be saved until the end as a final check of generalization.
Make the following changes to the cells with the comment “#For HW”
#####################To modify the batch size##########################
batch_size = 32 # 64, 128, 256, 512
# The dataloader makes our dataset iterable
train_dataloader = torch.utils.data.DataLoader(training_data, batch_size=batch_size)
val_dataloader = torch.utils.data.DataLoader(validation_data, batch_size=batch_size)
##############################################################################
##########################To change the learning rate##########################
optimizer = torch.optim.SGD(nonlinear_model.parameters(), lr=0.01) #modify the value of lr
##############################################################################
##########################To change activation##########################
###### Go to https://pytorch.org/docs/main/nn.html#non-linear-activations-weighted-sum-nonlinearity for more activations ######
class NonlinearClassifier(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.layers_stack = nn.Sequential(
nn.Linear(28*28, 50),
nn.Sigmoid(), #nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(50, 50),
nn.Tanh(), #nn.ReLU(),
# nn.Dropout(0.2),
nn.Linear(50, 50),
nn.ReLU(),
# nn.Dropout(0.2),
nn.Linear(50, 10)
)
def forward(self, x):
x = self.flatten(x)
x = self.layers_stack(x)
return x
##############################################################################
Bonus question: A learning rate scheduler is an essential deep learning technique used to dynamically adjust the learning rate during training. This strategic can significantly impact the convergence speed and overall performance of a neural network. See below on how to incorporate it to your training.
nonlinear_model = NonlinearClassifier()
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(nonlinear_model.parameters(), lr=0.1)
# Step learning rate scheduler: reduce by a factor of 0.1 every 2 epochs (only for illustrative purposes)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.1)
%%time
epochs = 6
train_acc_all = []
val_acc_all = []
for j in range(epochs):
train_one_epoch(train_dataloader, nonlinear_model, loss_fn, optimizer)
#step the scheduler
scheduler.step()
# logger.info the current learning rate
current_lr = optimizer.param_groups[0]['lr']
logger.info(f"Epoch {j+1}/{epochs}, Learning Rate: {current_lr}")
# checking on the training loss and accuracy once per epoch
acc, loss = evaluate(train_dataloader, nonlinear_model, loss_fn)
train_acc_all.append(acc)
logger.info(f"Epoch {j}: training loss: {loss}, accuracy: {acc}")
# checking on the validation loss and accuracy once per epoch
val_acc, val_loss = evaluate(val_dataloader, nonlinear_model, loss_fn)
val_acc_all.append(val_acc)
logger.info(f"Epoch {j}: val. loss: {val_loss}, val. accuracy: {val_acc}")
[2025-08-06 14:09:11,862569][I][./<timed exec>:11:mnist] Epoch 1/6, Learning Rate: 0.1
[2025-08-06 14:09:13,137090][I][./<timed exec>:16:mnist] Epoch 0: training loss: 0.3418297287598252, accuracy: 89.94791666666667
[2025-08-06 14:09:13,464259][I][./<timed exec>:21:mnist] Epoch 0: val. loss: 0.33424193555116655, val. accuracy: 89.725
[2025-08-06 14:09:15,137657][I][./<timed exec>:11:mnist] Epoch 2/6, Learning Rate: 0.010000000000000002
[2025-08-06 14:09:16,329964][I][./<timed exec>:16:mnist] Epoch 1: training loss: 0.23566976040912171, accuracy: 92.98125
[2025-08-06 14:09:16,749356][I][./<timed exec>:21:mnist] Epoch 1: val. loss: 0.2289788018465042, val. accuracy: 92.77499999999999
[2025-08-06 14:09:18,453201][I][./<timed exec>:11:mnist] Epoch 3/6, Learning Rate: 0.010000000000000002
[2025-08-06 14:09:19,680275][I][./<timed exec>:16:mnist] Epoch 2: training loss: 0.21982640741268794, accuracy: 93.42708333333334
[2025-08-06 14:09:20,005618][I][./<timed exec>:21:mnist] Epoch 2: val. loss: 0.2157861268222332, val. accuracy: 93.16666666666666
[2025-08-06 14:09:21,798088][I][./<timed exec>:11:mnist] Epoch 4/6, Learning Rate: 0.0010000000000000002
[2025-08-06 14:09:23,134008][I][./<timed exec>:16:mnist] Epoch 3: training loss: 0.21438998909915488, accuracy: 93.56666666666666
[2025-08-06 14:09:23,443619][I][./<timed exec>:21:mnist] Epoch 3: val. loss: 0.21053465707600116, val. accuracy: 93.35
[2025-08-06 14:09:25,681837][I][./<timed exec>:11:mnist] Epoch 5/6, Learning Rate: 0.0010000000000000002
[2025-08-06 14:09:27,223263][I][./<timed exec>:16:mnist] Epoch 4: training loss: 0.21325495328381658, accuracy: 93.57916666666667
[2025-08-06 14:09:27,565179][I][./<timed exec>:21:mnist] Epoch 4: val. loss: 0.2093664672325055, val. accuracy: 93.33333333333333
[2025-08-06 14:09:29,530354][I][./<timed exec>:11:mnist] Epoch 6/6, Learning Rate: 0.00010000000000000003
[2025-08-06 14:09:30,760327][I][./<timed exec>:16:mnist] Epoch 5: training loss: 0.21277805399273833, accuracy: 93.60000000000001
[2025-08-06 14:09:31,060373][I][./<timed exec>:21:mnist] Epoch 5: val. loss: 0.2088593033850193, val. accuracy: 93.33333333333333
CPU times: user 18.8 s, sys: 3.41 s, total: 22.2 s
Wall time: 20.9 s
@online{foreman2025,
author = {Foreman, Sam and Foreman, Sam and Ngom, Marieme and Zheng,
Huihuo and Lusch, Bethany and Childers, Taylor},
title = {Intro to {NNs:} {MNIST}},
date = {2025-07-17},
url = {https://saforem2.github.io/hpc-bootcamp-2025/01-neural-networks/1-mnist/},
langid = {en}
}