Intro to NNs: MNIST

Introduction to Neural Networks with MNIST dataset
Authors
Affiliations

Sam Foreman

Marieme Ngom

Huihuo Zheng

Bethany Lusch

Taylor Childers

Published

July 17, 2025

Modified

July 27, 2025

Content for this tutorial has been modified from content originally written by:

Marieme Ngom, Bethany Lusch, Asad Khan, Prasanna Balaprakash, Taylor Childers, Corey Adams, Kyle Felker, and Tanwi Mallick

This tutorial will serve as a gentle introduction to neural networks and deep learning through a hands-on classification problem using the MNIST dataset.

In particular, we will introduce neural networks and how to train and improve their learning capabilities. We will use the PyTorch Python library.

The MNIST dataset contains thousands of examples of handwritten numbers, with each digit labeled 0-9.

Figure 1: MNIST sample
import ambivalent

import matplotlib.pyplot as plt
import seaborn as sns

import ezpz
# console = ezpz.log.get_console()
logger = ezpz.get_logger('mnist')

plt.style.use(ambivalent.STYLES['ambivalent'])
sns.set_context("notebook")
plt.rcParams["figure.figsize"] = [6.4, 4.8]
[2025-08-06 14:08:43,131360][I][ezpz/__init__:265:ezpz] Setting logging level to 'INFO' on 'RANK == 0'
[2025-08-06 14:08:43,133474][I][ezpz/__init__:266:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0'
# %matplotlib inline

import torch
import torchvision
from torch import nn

import numpy 
import matplotlib.pyplot as plt
import time

The MNIST dataset

We will now download the dataset that contains handwritten digits. MNIST is a popular dataset, so we can download it via the PyTorch library.

Note:

  • x is for the inputs (images of handwritten digits)
  • y is for the labels or outputs (digits 0-9)
  • We are given “training” and “test” datasets.
    • Training datasets are used to fit the model.
    • Test datasets are saved until the end, when we are satisfied with our model, to estimate how well our model generalizes to new data.

Note that downloading it the first time might take some time.

The data is split as follows:

  • 60,000 training examples, 10,000 test examples
  • inputs: 1 x 28 x 28 pixels
  • outputs (labels): one integer per example
training_data = torchvision.datasets.MNIST(
    root="data",
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor()
)

test_data = torchvision.datasets.MNIST(
    root="data",
    train=False,
    download=True,
    transform=torchvision.transforms.ToTensor()
)
train_size = int(0.8 * len(training_data))  # 80% for training
val_size = len(training_data) - train_size  # Remaining 20% for validation
training_data, validation_data = torch.utils.data.random_split(
    training_data,
    [train_size, val_size],
    generator=torch.Generator().manual_seed(55)
)
logger.info(
    " ".join([
        f"MNIST data loaded:",
        f"train={len(training_data)} examples",
        f"validation={len(validation_data)} examples",
        f"test={len(test_data)} examples",
        f"input shape={training_data[0][0].shape}" 
    ])
)
# logger.info(f'Input shape', training_data[0][0].shape)
[2025-08-06 14:08:43,552146][I][ipykernel_92984/3921772995:1:mnist] MNIST data loaded: train=48000 examples validation=12000 examples test=10000 examples input shape=torch.Size([1, 28, 28])

Let’s take a closer look. Here are the first 10 training digits:

pltsize=1
# plt.figure(figsize=(10*pltsize, pltsize))

for i in range(10):
    plt.subplot(1,10,i+1)
    plt.axis('off')
    # x, y = training_data[i]
    # plt.imshow(x.reshape(28, 28), cmap="gray")
    # x[0] is the image, x[1] is the label
    plt.imshow(
        numpy.reshape(
            training_data[i][0],
            (28, 28)
        ),
        cmap="gray"
    )
    plt.title(f"{training_data[i][1]}") 

Generalities:

To train our classifier, we need (besides the data):

  • A model that depend on parameters \mathbf{\theta}. Here we are going to use neural networks.
  • A loss function J(\mathbf{\theta}) to measure the capabilities of the model.
  • An optimization method.

Linear Model

Let’s begin with a simple linear model: linear regression, like last week.

We add one complication: each example is a vector (flattened image), so the “slope” multiplication becomes a dot product. If the target output is a vector as well, then the multiplication becomes matrix multiplication.

Note, like before, we consider multiple examples at once, adding another dimension to the input.

Figure 2: Fully connected linear net

The linear layers in PyTorch perform a basic xW + b.

These “fully connected” layers connect each input to each output with some weight parameter.

We wouldn’t expect a simple linear model f(x) = xW+b directly outputting the class label and minimizing mean squared error to work well - the model would output labels like 3.55 and 2.11 instead of skipping to integers.

We now need:

  • A loss function J(\theta) where \theta is the list of parameters (here W and b). Last week, we used mean squared error (MSE), but this week let’s make two changes that make more sense for classification:
    • Change the output to be a length-10 vector of class probabilities (0 to 1, adding to 1).
    • Cross entropy as the loss function, which is typical for classification. You can read more here.
  • An optimization method or optimizer such as the stochastic gradient descent (sgd) method, the Adam optimizer, RMSprop, Adagrad etc. Let’s start with stochastic gradient descent (sgd), like last week. For far more information about more advanced optimizers than basic SGD, with some cool animations, see https://ruder.io/optimizing-gradient-descent/ or https://distill.pub/2017/momentum/.
  • A learning rate. As we learned last week, the learning rate controls how far we move during each step.
class LinearClassifier(nn.Module):

    def __init__(self):
        super().__init__()
        # First, we need to convert the input image to a vector by using 
        # nn.Flatten(). For MNIST, it means the second dimension 28*28 becomes 784.
        self.flatten = nn.Flatten()
        # Here, we add a fully connected ("dense") layer that has 28 x 28 = 784 input nodes 
        #(one for each pixel in the input image) and 10 output nodes (for probabilities of each class).
        self.layer_1 = nn.Linear(28*28, 10)

    def forward(self, x):
        x = self.flatten(x)
        x = self.layer_1(x)
        return x
linear_model = LinearClassifier()
logger.info(linear_model)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(linear_model.parameters(), lr=0.05)
[2025-08-06 14:08:43,705731][I][ipykernel_92984/2844520859:2:mnist] LinearClassifier(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (layer_1): Linear(in_features=784, out_features=10, bias=True)
)

Learning

Now we are ready to train our first model.

A training step is comprised of:

  • A forward pass: the input is passed through the network
  • Backpropagation: A backward pass to compute the gradient \frac{\partial J}{\partial \mathbf{W}} of the loss function with respect to the parameters of the network.
  • Weight updates \mathbf{W} = \mathbf{W} - \alpha \frac{\partial J}{\partial \mathbf{W}} where \alpha is the learning rate.

How many steps do we take?

  • The batch size corresponds to the number of training examples in one pass (forward + backward).
    • A smaller batch size allows the model to learn from individual examples but takes longer to train.
    • A larger batch size requires fewer steps but may result in the model not capturing the nuances in the data.
  • The higher the batch size, the more memory you will require.
  • An epoch means one pass through the whole training data (looping over the batches). Using few epochs can lead to underfitting and using too many can lead to overfitting.
  • The choice of batch size and learning rate are important for performance, generalization and accuracy in deep learning.
batch_size = 128

# The dataloader makes our dataset iterable 
train_dataloader = torch.utils.data.DataLoader(training_data, batch_size=batch_size)
val_dataloader = torch.utils.data.DataLoader(validation_data, batch_size=batch_size)
def train_one_epoch(dataloader, model, loss_fn, optimizer):
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # forward pass
        pred = model(X)
        loss = loss_fn(pred, y)
        # backward pass calculates gradients
        loss.backward()
        # take one step with these gradients
        optimizer.step()
        # resets the gradients 
        optimizer.zero_grad()
def evaluate(dataloader, model, loss_fn):
    # Set the model to evaluation mode - some NN pieces behave differently during training
    # Unnecessary in this situation but added for best practices
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    loss, correct = 0, 0

    # We can save computation and memory by not calculating gradients here - we aren't optimizing 
    with torch.no_grad():
        # loop over all of the batches
        for X, y in dataloader:
            pred = model(X)
            loss += loss_fn(pred, y).item()
            # how many are correct in this batch? Tracking for accuracy 
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    loss /= num_batches
    correct /= size

    accuracy = 100*correct
    return accuracy, loss
%%time

epochs = 5
train_acc_all = []
val_acc_all = []
for j in range(epochs):
    train_one_epoch(train_dataloader, linear_model, loss_fn, optimizer)

    # checking on the training loss and accuracy once per epoch
    acc, loss = evaluate(train_dataloader, linear_model, loss_fn)
    train_acc_all.append(acc)
    logger.info(f"Epoch {j}: training loss: {loss}, accuracy: {acc}")

    # checking on the validation loss and accuracy once per epoch
    val_acc, val_loss = evaluate(val_dataloader, linear_model, loss_fn)
    val_acc_all.append(val_acc)
    logger.info(f"Epoch {j}: val. loss: {val_loss}, val. accuracy: {val_acc}")
[2025-08-06 14:08:45,785148][I][./<timed exec>:10:mnist] Epoch 0: training loss: 0.5019691607952118, accuracy: 87.6
[2025-08-06 14:08:46,026971][I][./<timed exec>:15:mnist] Epoch 0: val. loss: 0.49424059245180574, val. accuracy: 87.63333333333333
[2025-08-06 14:08:48,169396][I][./<timed exec>:10:mnist] Epoch 1: training loss: 0.4216008733908335, accuracy: 89.01875
[2025-08-06 14:08:48,417192][I][./<timed exec>:15:mnist] Epoch 1: val. loss: 0.4121108831877404, val. accuracy: 88.925
[2025-08-06 14:08:50,479591][I][./<timed exec>:10:mnist] Epoch 2: training loss: 0.38766712208588916, accuracy: 89.7
[2025-08-06 14:08:50,742874][I][./<timed exec>:15:mnist] Epoch 2: val. loss: 0.37754899675541737, val. accuracy: 89.45833333333333
[2025-08-06 14:08:53,372352][I][./<timed exec>:10:mnist] Epoch 3: training loss: 0.36771729950110116, accuracy: 90.1125
[2025-08-06 14:08:53,646507][I][./<timed exec>:15:mnist] Epoch 3: val. loss: 0.35739373891277515, val. accuracy: 89.93333333333334
[2025-08-06 14:08:55,822620][I][./<timed exec>:10:mnist] Epoch 4: training loss: 0.35414256183306375, accuracy: 90.39791666666666
[2025-08-06 14:08:56,081519][I][./<timed exec>:15:mnist] Epoch 4: val. loss: 0.3438146301406495, val. accuracy: 90.14166666666667
CPU times: user 11.6 s, sys: 695 ms, total: 12.3 s
Wall time: 12.4 s
plt.figure()
plt.plot(range(epochs), train_acc_all, label='Training Acc.' )
plt.plot(range(epochs), val_acc_all, label='Validation Acc.' )
plt.xlabel('Epoch #')
plt.ylabel('Loss')
plt.legend()

# Visualize how the model is doing on the first 10 examples
pltsize=1
plt.figure(figsize=(10*pltsize, pltsize))
linear_model.eval()
batch = next(iter(train_dataloader))
predictions = linear_model(batch[0])

for i in range(10):
    plt.subplot(1,10,i+1)
    plt.axis('off')
    plt.imshow(batch[0][i,0,:,:], cmap="gray")
    plt.title('%d' % predictions[i,:].argmax())

Exercise: How can you improve the accuracy? Some things you might consider: increasing the number of epochs, changing the learning rate, etc.

Prediction

Let’s see how our model generalizes to the unseen test data.

#For HW: cell to change batch size
#create dataloader for test data
# The dataloader makes our dataset iterable

batch_size_test = 256 
test_dataloader = torch.utils.data.DataLoader(test_data, batch_size=batch_size_test)
acc_test, loss_test = evaluate(test_dataloader, linear_model, loss_fn)
logger.info(f"Test loss: {loss_test}, test accuracy: {acc_test}")
# logger.info("Test loss: %.4f, test accuracy: %.2f%%" % (loss_test, acc_test))
[2025-08-06 14:08:56,519944][I][ipykernel_92984/372756021:2:mnist] Test loss: 0.3325562601909041, test accuracy: 90.89

We can now take a closer look at the results.

Let’s define a helper function to show the failure cases of our classifier.

def show_failures(model, dataloader, maxtoshow=10):
    model.eval()
    batch = next(iter(dataloader))
    predictions = model(batch[0])

    rounded = predictions.argmax(1)
    errors = rounded!=batch[1]
    logger.info(
        f"Showing max {maxtoshow} first failures."
    )
    logger.info("The predicted class is shown first and the correct class in parentheses.")
    ii = 0
    plt.figure(figsize=(maxtoshow, 1))
    for i in range(batch[0].shape[0]):
        if ii>=maxtoshow:
            break
        if errors[i]:
            plt.subplot(1, maxtoshow, ii+1)
            plt.axis('off')
            plt.imshow(batch[0][i,0,:,:], cmap="gray")
            plt.title("%d (%d)" % (rounded[i], batch[1][i]))
            ii = ii + 1

Here are the first 10 images from the test data that this small model classified to a wrong class:

show_failures(linear_model, test_dataloader)
[2025-08-06 14:08:56,536347][I][ipykernel_92984/2368214845:8:mnist] Showing max 10 first failures.
[2025-08-06 14:08:56,537158][I][ipykernel_92984/2368214845:11:mnist] The predicted class is shown first and the correct class in parentheses.

Multilayer Model

Our linear model isn’t enough for high accuracy on this dataset. To improve the model, we often need to add more layers and nonlinearities.

Figure 3: Shallow neural network

The output of this NN can be written as

\begin{equation} \hat{u}(x) = \sigma_2(\sigma_1(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2), \end{equation}

where \mathbf{x} is the input, \mathbf{W}_j are the weights of the neural network, \sigma_j the (nonlinear) activation functions, and \mathbf{b}_j its biases. The activation function introduces the nonlinearity and makes it possible to learn more complex tasks. Desirable properties in an activation function include being differentiable, bounded, and monotonic.

Image source: PragatiBaheti

Figure 4: Activation functions

Adding more layers to obtain a deep neural network:

Figure 5

Important things to know

Deep Neural networks can be overly flexible/complicated and “overfit” your data, just like fitting overly complicated polynomials:

Figure 6: Bias-variance tradeoff

Vizualization wrt to the accuracy and loss (Image source: Baeldung):

Figure 7: Visualization of accuracy and loss

To improve the generalization of our model on previously unseen data, we employ a technique known as regularization, which constrains our optimization problem in order to discourage complex models.

  • Dropout is the commonly used regularization technique. The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting.
  • Penalizing the loss function by adding a term such as \lambda ||\mathbf{W}||^2 is alsp a commonly used regularization technique. This helps “control” the magnitude of the weights of the network.
Vanishing gradients
Gradients become small as they propagate backward through the layers.
Squashing activation functions like sigmoid or tanh could cause this.
Exploding gradients
Gradients grow exponentially usually due to “poor” weight initialization.

We can now implement a deep network in PyTorch.

nn.Dropout() performs the Dropout operation mentioned earlier:

#For HW: cell to change activation
class NonlinearClassifier(nn.Module):

    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layers_stack = nn.Sequential(
            nn.Linear(28*28, 50),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(50, 50),
            nn.ReLU(),
           # nn.Dropout(0.2),
            nn.Linear(50, 50),
            nn.ReLU(),
           # nn.Dropout(0.2),
            nn.Linear(50, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        x = self.layers_stack(x)

        return x
#### For HW: cell to change learning rate
nonlinear_model = NonlinearClassifier()
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(nonlinear_model.parameters(), lr=0.05)
%%time

epochs = 5
train_acc_all = []
val_acc_all = []
for j in range(epochs):
    train_one_epoch(train_dataloader, nonlinear_model, loss_fn, optimizer)

    # checking on the training loss and accuracy once per epoch
    acc, loss = evaluate(train_dataloader, nonlinear_model, loss_fn)
    train_acc_all.append(acc)
    logger.info(f"Epoch {j}: training loss: {loss}, accuracy: {acc}")

    # checking on the validation loss and accuracy once per epoch
    val_acc, val_loss = evaluate(val_dataloader, nonlinear_model, loss_fn)
    val_acc_all.append(val_acc)
    logger.info(f"Epoch {j}: val. loss: {val_loss}, val. accuracy: {val_acc}")
[2025-08-06 14:08:59,101430][I][./<timed exec>:10:mnist] Epoch 0: training loss: 0.7994496553738912, accuracy: 77.17500000000001
[2025-08-06 14:08:59,379305][I][./<timed exec>:15:mnist] Epoch 0: val. loss: 0.7928039590094952, val. accuracy: 77.60833333333333
[2025-08-06 14:09:01,776063][I][./<timed exec>:10:mnist] Epoch 1: training loss: 0.41807308411598204, accuracy: 88.52291666666666
[2025-08-06 14:09:02,063954][I][./<timed exec>:15:mnist] Epoch 1: val. loss: 0.40851338334540105, val. accuracy: 88.59166666666667
[2025-08-06 14:09:04,377194][I][./<timed exec>:10:mnist] Epoch 2: training loss: 0.31588405946890513, accuracy: 91.10208333333333
[2025-08-06 14:09:04,633401][I][./<timed exec>:15:mnist] Epoch 2: val. loss: 0.3068072290179577, val. accuracy: 91.13333333333333
[2025-08-06 14:09:06,849967][I][./<timed exec>:10:mnist] Epoch 3: training loss: 0.2562710582613945, accuracy: 92.63125
[2025-08-06 14:09:07,108817][I][./<timed exec>:15:mnist] Epoch 3: val. loss: 0.2528020724495675, val. accuracy: 92.5
[2025-08-06 14:09:09,483188][I][./<timed exec>:10:mnist] Epoch 4: training loss: 0.2121300637324651, accuracy: 93.80625
[2025-08-06 14:09:09,745825][I][./<timed exec>:15:mnist] Epoch 4: val. loss: 0.21229731101304927, val. accuracy: 93.65833333333333
CPU times: user 12.4 s, sys: 834 ms, total: 13.2 s
Wall time: 13.1 s
# pltsize=1
# plt.figure(figsize=(10*pltsize, 10 * pltsize))
plt.figure()
plt.plot(range(epochs), train_acc_all,label = 'Training Acc.' )
plt.plot(range(epochs), val_acc_all, label = 'Validation Acc.' )
plt.xlabel('Epoch #')
plt.ylabel('Loss')
plt.legend()

show_failures(nonlinear_model, test_dataloader)
[2025-08-06 14:09:09,823886][I][ipykernel_92984/2368214845:8:mnist] Showing max 10 first failures.
[2025-08-06 14:09:09,824680][I][ipykernel_92984/2368214845:11:mnist] The predicted class is shown first and the correct class in parentheses.

Recap

To train and validate a neural network model, you need:

  • Data split into training/validation/test sets,
  • A model with parameters to learn
  • An appropriate loss function
  • An optimizer (with tunable parameters such as learning rate, weight decay etc.) used to learn the parameters of the model.

Homework

  1. Compare the quality of your model when using different:
  • batch sizes
  • learning rates
  • activation functions
  1. Bonus: What is a learning rate scheduler?

If you have time, experiment with how to improve the model.

Note: training and validation data can be used to compare models, but test data should be saved until the end as a final check of generalization.

Homework solution

Make the following changes to the cells with the comment “#For HW”

#####################To modify the batch size##########################
batch_size = 32 # 64, 128, 256, 512

# The dataloader makes our dataset iterable 
train_dataloader = torch.utils.data.DataLoader(training_data, batch_size=batch_size)
val_dataloader = torch.utils.data.DataLoader(validation_data, batch_size=batch_size)
##############################################################################


##########################To change the learning rate##########################
optimizer = torch.optim.SGD(nonlinear_model.parameters(), lr=0.01) #modify the value of lr
##############################################################################


##########################To change activation##########################
###### Go to https://pytorch.org/docs/main/nn.html#non-linear-activations-weighted-sum-nonlinearity for more activations ######
class NonlinearClassifier(nn.Module):

    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layers_stack = nn.Sequential(
            nn.Linear(28*28, 50),
            nn.Sigmoid(), #nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(50, 50),
            nn.Tanh(), #nn.ReLU(),
           # nn.Dropout(0.2),
            nn.Linear(50, 50),
            nn.ReLU(),
           # nn.Dropout(0.2),
            nn.Linear(50, 10)
        )
        
    def forward(self, x):
        x = self.flatten(x)
        x = self.layers_stack(x)

        return x
##############################################################################

Bonus question: A learning rate scheduler is an essential deep learning technique used to dynamically adjust the learning rate during training. This strategic can significantly impact the convergence speed and overall performance of a neural network. See below on how to incorporate it to your training.

nonlinear_model = NonlinearClassifier()
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(nonlinear_model.parameters(), lr=0.1)

# Step learning rate scheduler: reduce by a factor of 0.1 every 2 epochs (only for illustrative purposes)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.1)
%%time

epochs = 6
train_acc_all = []
val_acc_all = []
for j in range(epochs):
    train_one_epoch(train_dataloader, nonlinear_model, loss_fn, optimizer)
    #step the scheduler
    scheduler.step()

    # logger.info the current learning rate
    current_lr = optimizer.param_groups[0]['lr']
    logger.info(f"Epoch {j+1}/{epochs}, Learning Rate: {current_lr}")

    # checking on the training loss and accuracy once per epoch
    acc, loss = evaluate(train_dataloader, nonlinear_model, loss_fn)
    train_acc_all.append(acc)
    logger.info(f"Epoch {j}: training loss: {loss}, accuracy: {acc}")

    # checking on the validation loss and accuracy once per epoch
    val_acc, val_loss = evaluate(val_dataloader, nonlinear_model, loss_fn)
    val_acc_all.append(val_acc)
    logger.info(f"Epoch {j}: val. loss: {val_loss}, val. accuracy: {val_acc}")
[2025-08-06 14:09:11,862569][I][./<timed exec>:11:mnist] Epoch 1/6, Learning Rate: 0.1
[2025-08-06 14:09:13,137090][I][./<timed exec>:16:mnist] Epoch 0: training loss: 0.3418297287598252, accuracy: 89.94791666666667
[2025-08-06 14:09:13,464259][I][./<timed exec>:21:mnist] Epoch 0: val. loss: 0.33424193555116655, val. accuracy: 89.725
[2025-08-06 14:09:15,137657][I][./<timed exec>:11:mnist] Epoch 2/6, Learning Rate: 0.010000000000000002
[2025-08-06 14:09:16,329964][I][./<timed exec>:16:mnist] Epoch 1: training loss: 0.23566976040912171, accuracy: 92.98125
[2025-08-06 14:09:16,749356][I][./<timed exec>:21:mnist] Epoch 1: val. loss: 0.2289788018465042, val. accuracy: 92.77499999999999
[2025-08-06 14:09:18,453201][I][./<timed exec>:11:mnist] Epoch 3/6, Learning Rate: 0.010000000000000002
[2025-08-06 14:09:19,680275][I][./<timed exec>:16:mnist] Epoch 2: training loss: 0.21982640741268794, accuracy: 93.42708333333334
[2025-08-06 14:09:20,005618][I][./<timed exec>:21:mnist] Epoch 2: val. loss: 0.2157861268222332, val. accuracy: 93.16666666666666
[2025-08-06 14:09:21,798088][I][./<timed exec>:11:mnist] Epoch 4/6, Learning Rate: 0.0010000000000000002
[2025-08-06 14:09:23,134008][I][./<timed exec>:16:mnist] Epoch 3: training loss: 0.21438998909915488, accuracy: 93.56666666666666
[2025-08-06 14:09:23,443619][I][./<timed exec>:21:mnist] Epoch 3: val. loss: 0.21053465707600116, val. accuracy: 93.35
[2025-08-06 14:09:25,681837][I][./<timed exec>:11:mnist] Epoch 5/6, Learning Rate: 0.0010000000000000002
[2025-08-06 14:09:27,223263][I][./<timed exec>:16:mnist] Epoch 4: training loss: 0.21325495328381658, accuracy: 93.57916666666667
[2025-08-06 14:09:27,565179][I][./<timed exec>:21:mnist] Epoch 4: val. loss: 0.2093664672325055, val. accuracy: 93.33333333333333
[2025-08-06 14:09:29,530354][I][./<timed exec>:11:mnist] Epoch 6/6, Learning Rate: 0.00010000000000000003
[2025-08-06 14:09:30,760327][I][./<timed exec>:16:mnist] Epoch 5: training loss: 0.21277805399273833, accuracy: 93.60000000000001
[2025-08-06 14:09:31,060373][I][./<timed exec>:21:mnist] Epoch 5: val. loss: 0.2088593033850193, val. accuracy: 93.33333333333333
CPU times: user 18.8 s, sys: 3.41 s, total: 22.2 s
Wall time: 20.9 s

Citation

BibTeX citation:
@online{foreman2025,
  author = {Foreman, Sam and Foreman, Sam and Ngom, Marieme and Zheng,
    Huihuo and Lusch, Bethany and Childers, Taylor},
  title = {Intro to {NNs:} {MNIST}},
  date = {2025-07-17},
  url = {https://saforem2.github.io/hpc-bootcamp-2025/01-neural-networks/1-mnist/},
  langid = {en}
}
For attribution, please cite this work as:
Foreman, Sam, Sam Foreman, Marieme Ngom, Huihuo Zheng, Bethany Lusch, and Taylor Childers. 2025. “Intro to NNs: MNIST.” July 17, 2025. https://saforem2.github.io/hpc-bootcamp-2025/01-neural-networks/1-mnist/.