Language models (LMs)

Sam Foreman

Language models (LMs)

An introduction to High Performance Computing (HPC) and Artificial Intelligence (AI) for scientific applications, with a focus on practical skills and hands-on experience.

Author

Affiliation

Sam Foreman

ALCF

Published

August 5, 2025

Modified

August 5, 2025

Authors

Author: Archit Vasan , including materials on LLMs by Varuni Sastri and Carlo Graziani at Argonne, and discussion/editorial work by Taylor Childers, Bethany Lusch, and Venkat Vishwanath (Argonne)

Modification by Huihuo Zheng on August 1, 2025

Inspiration from the blog posts “The Illustrated Transformer” and “The Illustrated GPT2” by Jay Alammar, highly recommended reading.

Although the name “language models” is derived from Natural Language Processing, the models used in these approaches can be applied to diverse scientific applications as illustrated below.

This session is dedicated to setting out the basics of sequential data modeling, and introducing a few key elements required for DL approaches to such modeling—principally Transformers.

Overview

During this session I will cover:

Scientific applications modeling sequential data
Brief History of Language Models
Tokenization and embedding of sequential data
Elements of a Transformer
Attention mechanisms
Output layers
Training loops
Different types of Transformers

Modeling Sequential Data

Sequences are variable-length lists with data in subsequent iterations that depends on previous iterations (or tokens).

Mathematically: A sequence is a list of tokens:

T = [t_1, t_2, t_3,...,t_N]

where each token within the list depends on the others with a particular probability:

P(t_2 | t_1, t_3, t_4, ..., t_N)

The purpose of sequential modeling is to learn these probabilities for possible tokens in a distribution to perform various tasks including:

Sequence generation based on a prompt
Language translation (e.g. English –> French)
Property prediction (predicting a property based on an entire sequence)
Identifying mistakes or missing elements in sequential data

Scientific sequential data modeling examples

Nucleic acid sequences + genomic data

Nucleic acid sequences can be used to predict translation of proteins, mutations, and gene expression levels.

Here is an image of GenSLM. This is a language model developed by Argonne researchers that can model genomic information in a single model. It was shown to model the evolution of SARS-COV2 without expensive experiments.

Figure 2: GenSLM. Image credit: Zvyagin et. al 2022. BioRXiv

Protein sequences

Protein sequences can be used to predict folding structure, protein-protein interactions, chemical/binding properties, protein function and many more properties.

Other applications:

Biomedical text
SMILES strings
Weather predictions
Interfacing with simulations such as molecular dynamics simulation

Overview of Language models

We will now briefly talk about the progression of language models.

RNNs

Recurrent Neural Newtorks(RNNs) were a traditional model used to determine temporal dependencies within data.

In RNNs, the hidden state from the previous time step is fed back into the network, allowing it to maintain a “memory” of past inputs.

They were ideal for tasks with short sequences such as natural language processing and time-series prediction.

However, these networks had significant challenges.

Slow to train: RNNs process data one element at a time, maintaining an internal hidden state that is updated at each step. They operate recurrently, where each output depends on the previous hidden state and the current input; thus, parallel computation is not possible.
Cannot handle large sequences: Exploding and vanishing gradients limit the RNN modelling of long sequences. Some variants of RNNs such as LSTM and GRU addressed this problem, they cannot engage with very large sequences.

Transformers

The newest LMs referred to as “large language models” (since they have large parameter size) were developed to address many of these challenges.

These new models base their desin on the Transformer architecture that was introduced in 2017 in the “Attention is all you need” paper.

Since then a multitude of LLM architectures have been designed.

The power of these models comes from the “attention mechanism” defined in the Vaswani 2017 seminal paper.

Coding example of LLMs in action!

Let’s look at an example of running inference with a LLM as a block box to generate text given a prompt and we will also initiate a training loop for an LLM:

Here, we will use the transformers library which is as part of HuggingFace, a repository of different models, tokenizers and information on how to apply these models

Warning: Large Language Models are only as good as their training data. They have no ethics, no judgement, or editing ability. We will be using some pretrained models from Hugging Face which used wide samples of internet hosted text. The datasets have not been strictly filtered to restrict all malign content so the generated text may be surprisingly dark or questionable. They do not reflect our core values and are only used for demonstration purposes.

# !pip install transformers
# !pip install pandas
# !pip install torch

%load_ext autoreload
%autoreload 2
%matplotlib inline
# settings for jupyter book: svg for html version, high-resolution png for pdf
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('retina', 'svg', 'png')
import matplotlib as mpl
# mpl.rcParams['figure.dpi'] = 400
from rich import print

from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

input_text = "I got an A+ in my final exam; I am very"
from transformers import pipeline

generator = pipeline("text-generation", model="openai-community/gpt2")
print(
    [
        i["generated_text"]
        for i in generator(input_text, max_length=20, num_return_sequences=5)
    ]
)

Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

[
    "I got an A+ in my final exam; I am very disappointed.\n\nI'm not sure if I'll ever get my grades back, but I'm
sure it's a good thing!\n\nI'm still looking for other ways to help, and I hope you'll join in!\n\n-Joe\n\nUPDATE: 
I got this email from a reader:\n\nDear Joe,\n\nI was just looking for a way to get my C grades back and I received
this email from a reader stating that they are good because they gave me an A+ in my final exam. I was surprised to
see that they gave me a B+ in my final exam, as I am a pretty good student.\n\nThanks for your email, though. I'm 
glad I have received a letter from you, too.\n\nIf you're interested in getting your A+ in your final exam, you can
get it today by clicking here.\n\nThank you for taking the time to send me this email, but please know that you may
want to consider making a donation to help me.\n\nUntil next time!\n\n-Tom",
    'I got an A+ in my final exam; I am very happy with my results. I would like to thank the faculty at the 
University of Arkansas for their time and effort.\n\nThank you,\n\nColleen F.',
    "I got an A+ in my final exam; I am very happy with my study as it has allowed me to take the A-level with an 
emphasis on study and teaching.\n\nI've also received many positive feedback from my students, and I have been able
to gain valuable feedback on my writing skills. I have been given a lot of support from my faculty and staff, 
including teachers and students, to help me develop my writing techniques and focus and build a better writing 
vocabulary.\n\nI am very grateful to all of you for your support and encouragement.\n\nSincerely,\n\nBarry",
    'I got an A+ in my final exam; I am very happy I did. I am very proud of my performance and I hope I am getting
a second chance at the same level I did.\n\n(3/17/14: I had a second chance at my second attempt and I am very 
proud of it.)\n\n(3/17/14: I am very happy with my second attempt and I am very proud of it.)\n\n(3/17/14: I am 
very happy with my second attempt and I am very proud of it.)\n\n(3/17/14: I am very happy with my second attempt 
and I am very proud of it.)\n\n(3/17/14: I am extremely happy with my second attempt and I am very proud of 
it.)\n\n(3/17/14: I am extremely happy with my second attempt and I am very proud of it.)\n\n(3/17/14: I am 
extremely happy with my second attempt and I am very proud of it.)\n\n(3/17/14: I am extremely happy with my second
attempt and I am very proud of it.)\n\n(3/17/14: I am extremely happy with my second attempt and I am very proud of
it.)\n\n(',
    "I got an A+ in my final exam; I am very pleased with my grades. I think this year is going to be a tough year 
for me. I could use a few more years of the hard work of getting my mind off the subject. I could also use some 
more experience in the field of psychology, or even more time with my team. I have never been a good teacher, and I
have no way to make my students feel comfortable about that.\n\nWhat do you do when you're an outlier?\n\nI am very
excited to have a great career and to be able to continue to produce at a high level for the next five years. I am 
excited to have a partner that will be able to help me with the next five years. I have been working on my career 
as a writer, and I hope to be working with other writers soon.\n\nWhat's your career plan like?\n\nI have been 
working my way up the ladder for about three years and have been fortunate enough to have an amazing team of people
who are amazing at what they do. I am very happy with my situation and very happy to be working with 
them.\n\nYou've been writing for a long time. How did you get started writing for a publisher?\n\nI started writing
after college. I started"
]

We can also train a language model given input data:

What’s going on under the hood?

There are two components that are “black-boxes” here:

The method for tokenization
The model that generates novel text.

Image credit: https://blog.floydhub.com/tokenization-nlp/

Tokenization

from transformers import AutoTokenizer

# A utility function to tokenize a sequence and print out some information about it.


def tokenization_summary(tokenizer, sequence):
    # get the vocabulary
    vocab = tokenizer.vocab
    # Number of entries to print
    n = 10

    # Print subset of the vocabulary
    print("Subset of tokenizer.vocab:")
    for i, (token, index) in enumerate(tokenizer.vocab.items()):
        print(f"{token}: {index}")
        if i >= n - 1:
            break

    print("Vocab size of the tokenizer = ", len(vocab))
    print("------------------------------------------")

    # .tokenize chunks the existing sequence into different tokens based on the rules and vocab of the tokenizer.
    tokens = tokenizer.tokenize(sequence)
    print("Tokens : ", tokens)
    print("------------------------------------------")

    # .convert_tokens_to_ids or .encode or .tokenize converts the tokens to their corresponding numerical representation.
    #  .convert_tokens_to_ids has a 1-1 mapping between tokens and numerical representation
    # ids = tokenizer.convert_tokens_to_ids(tokens)
    # print("encoded Ids: ", ids)

    # .encode also adds additional information like Start of sequence tokens and End of sequene
    print("tokenized sequence : ", tokenizer.encode(sequence))

    # .tokenizer has additional information about attention_mask.
    # encode = tokenizer(sequence)
    # print("Encode sequence : ", encode)
    # print("------------------------------------------")

    # .decode decodes the ids to raw text
    ids = tokenizer.convert_tokens_to_ids(tokens)
    decode = tokenizer.decode(ids)
    print("Decode sequence : ", decode)


tokenizer_1 = AutoTokenizer.from_pretrained(
    "gpt2"
)  # GPT-2 uses "Byte-Pair Encoding (BPE)"

sequence = "I got an A+ in my final exam; I am very"

tokenization_summary(tokenizer_1, sequence)

Subset of tokenizer.vocab:

ĠSikh: 34629

wcs: 12712

force: 3174

ount: 608

either: 31336

Ġsmear: 35180

ĠPhoenix: 9643

Ġpreacher: 39797

imum: 2847

Susan: 45842

Vocab size of the tokenizer =  50257

------------------------------------------

Tokens : 
['I', 'Ġgot', 'Ġan', 'ĠA', '+', 'Ġin', 'Ġmy', 'Ġfinal', 'Ġexam', ';', 'ĠI', 'Ġam', 'Ġvery']

------------------------------------------

tokenized sequence : 
[40, 1392, 281, 317, 10, 287, 616, 2457, 2814, 26, 314, 716, 845]

Decode sequence :  I got an A+ in my final exam; I am very

Token embedding:

Words are turned into vectors based on their location within a vocabulary.

The strategy of choice for learning language structure from tokenized text is to find a clever way to map each token into a moderate-dimension vector space, adjusting the mapping so that

Similar, or associated tokens take up residence nearby each other, and different regions of the space correspond to different position in the sequence. Such a mapping from token ID to a point in a vector space is called a token embedding. The dimension of the vector space is often high (e.g. 1024-dimensional), but much smaller than the vocabulary size (30,000–500,000).

Various approaches have been attempted for generating such embeddings, including static algorithms that operate on a corpus of tokenized data as preprocessors for NLP tasks. Transformers, however, adjust their embeddings during training.

We can visualize these embeddings of the popular BERT model using PCA!

# !pip install umap
# !pip install plotly
# !pip install scikit-learn
# !pip install nltk

import nltk
import numpy as np
import pandas as pd
import plotly.express as px
import umap
from nltk.corpus import stopwords
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from transformers import BertModel, BertTokenizer

nltk.download("stopwords")
import torch

# Load BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

if True:
    text = "The diligent student diligently studied hard for his upcoming exams He was incredibly conscientious in his efforts and committed himself to mastering every subject"

    # Tokenize and get BERT embeddings
    tokens = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**tokens)
        embeddings = outputs.last_hidden_state.squeeze(
            0
        ).numpy()  # Shape: (num_tokens, 768) for BERT-base

    # Get the list of token labels without special tokens and subword tokens
    labels = [
        tokenizer.convert_ids_to_tokens(id) for id in tokens.input_ids[0].tolist()
    ]
    filtered_labels = [
        label
        for label in labels
        if not (label.startswith("[") and label.endswith("]")) and "##" not in label
    ]

    # Remove stopwords from labels and embeddings
    stop_words = set(stopwords.words("english"))
    filtered_labels = [
        label for label in filtered_labels if label.lower() not in stop_words
    ]
    filtered_embeddings = embeddings[: len(filtered_labels)]

    # Perform PCA for dimensionality reduction (3D)
    pca = PCA(n_components=3)
    embeddings_pca = pca.fit_transform(filtered_embeddings)

    # Convert embeddings and labels to DataFrame for Plotly
    data_pca = {
        "x": embeddings_pca[:, 0],
        "y": embeddings_pca[:, 1],
        "z": embeddings_pca[:, 2],
        "label": filtered_labels,
    }
    df_pca = pd.DataFrame(data_pca)

    # Plot PCA in 3D with Plotly (interactive)
    fig_pca = px.scatter_3d(
        df_pca,
        x="x",
        y="y",
        z="z",
        text="label",
        title="PCA 3D Visualization of Token Embeddings",
        labels={"x": "Dimension 1", "y": "Dimension 2", "z": "Dimension 3"},
        hover_name="label",
    )
    fig_pca.update_traces(marker=dict(size=5), textfont=dict(size=8))
    fig_pca.show()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/samforeman/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

You should see common words grouped together!

Elements of a Transformer

Now let’s look at the base elements that make up a Transformer by dissecting the popular GPT2 model

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

GPT2 is an example of a Transformer Decoder which is used to generate novel text.

Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.

The pretraining of decoder models usually revolves around predicting the next word in the sentence.

These models are best suited for tasks involving text generation.

Examples of these include: * CTRL * GPT * GPT-2 * Transformer XL

Let’s discuss one of the most popular models, GPT-2 in a little more detail.

The architecture of GPT-2 is inspired by the paper: “Generating Wikipedia by Summarizing Long Sequences” which is another arrangement of the transformer block that can do language modeling. This model threw away the encoder and thus is known as the “Transformer-Decoder”.

Image credit: https://jalammar.github.io/illustrated-gpt2/

The Transformer-Decoder is composed of Decoder blocks stacked ontop of each other where each contains two types of layers: 1. Masked Self-Attention and 2. Feed Forward Neural Networks.

In this lecture, we will * First, discuss attention mechanisms at length as this is arguably the greatest contribution by Transformers. * Second, extend the discussion from last week (https://github.com/argonne-lcf/ai-science-training-series/blob/main/04_intro_to_llms/Sequential_Data_Models.ipynb) on embedding input data while taking into account position. * Third, discuss outputting real text/sequences from the models. * Fourth, build a training loop for a mini-LLM.

## IMPORTS

import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)
# hyperparameters
batch_size = 16  # how many independent sequences will we process in parallel?
block_size = 32  # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = "cuda" if torch.cuda.is_available() else "cpu"
eval_iters = 200
n_embd = 64
n_head = 4  ## so head_size = 16
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

<torch._C.Generator at 0x11d729970>

Attention mechanisms

Suppose the following sentence is an input sentence we want to translate using an LLM:

”The animal didn't cross the street because it was too tired”

Earlier, we mentioned that the Transformer learns an embedding of all words allowing interpretation of meanings of words.

Drawing

So, if the model did a good job in token embedding, it will “know” what all the words in this sentence mean.

But to understand a full sentence, the model also need to understand what each word means in relation to other words.

For example, when we read the sentence: ”The animal didn't cross the street because it was too tired” we know intuitively that the word "it" refers to "animal", the state for "it" is "tired", and the associated action is "didn't cross".

However, the model needs a way to learn all of this information in a simple yet generalizable way. What makes Transformers particularly powerful compared to earlier sequential architectures is how it encodes context with the self-attention mechanism.

As the model processes each word in the input sequence, attention looks at other positions in the input sequence for clues to a better understanding for this word.

Drawing

Image credit: https://jalammar.github.io/illustrated-transformer/

Self-attention mechanisms use 3 vectors to encode the context of a word in a sequence with another word: 1. Query: the word representation we score other words against using the other word’s keys 2. Key: labels for the words in a sequence that we match against the query 3. Value: actual word representation. We will use the queries and keys to score the word’s relevance to the query, and multiply this by the value.

An analogy provided by Jay Alammar is thinking about attention as choosing a file from a file cabinet according to information on a post-it note. You can use the post-it note (query) to identify the folder (key) that most matches the topic you are looking up. Then you access the contents of the file (value) according to its relevance to your query.

Drawing Image credit: https://jalammar.github.io/illustrated-gpt2/

In our models, we can encode queries, keys, and values using simple linear layers with the same size (sequence length, head_size). During the training process, these layers will be updated to best encode context.

C = 32  # channels
head_size = 16

key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

The algorithm for self-attention is as follows:

Generate query, key and value vectors for each word
Calculate a score for each word in the input sentence against each other.
Divide the scores by the square root of the dimension of the key vectors to stabilize the gradients. This is then passed through a softmax operation.
Multiply each value vector by the softmax score.
Sum up the weighted value vectors to produce the output.

Drawing

Image credit: https://jalammar.github.io/illustrated-transformer/

Let’s see how attention is performed in the code.

import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)
B, T, C = 4, 8, 32  # batch, time, channels
x = torch.randn(B, T, C)

# Here we want the wei to be data dependent - ie gather info from the past but in a data dependant way

head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(
    x
)  # (B, T, 16) # each token here (totally B*T) produce a key and query in parallel and independently
q = query(x)  # (B, T, 16)
v = value(x)

wei = (
    q @ k.transpose(-2, -1) * head_size**-0.5
)  # (B, T, 16) @ (B, 16, T) ---> (B, T, T). #
wei = F.softmax(
    wei, dim=-1
)  # exponentiate and normalize giving a nice distibution that sums to 1 and
# now it tells us that in a data dependent manner how much of info to aggregate from

out = wei @ v  # aggregate the attention scores and value vector.

print(out[0])

tensor([[ 0.0618, -0.0091, -0.3488,  0.3208,  0.2971, -0.1573, -0.0561,  0.1068,
          0.0368,  0.0139, -0.0017,  0.3110,  0.1404, -0.0158,  0.1853,  0.4290],
        [ 0.1578, -0.0971, -0.4256,  0.3538,  0.3621, -0.2392, -0.0536,  0.1759,
          0.1115,  0.0282, -0.0649,  0.3641,  0.1928,  0.0261,  0.2162,  0.3758],
        [ 0.1293,  0.0759, -0.2946,  0.2292,  0.2215, -0.0710, -0.0107,  0.1616,
         -0.0930, -0.0877,  0.0567,  0.1899,  0.0311, -0.0894,  0.0309,  0.5471],
        [ 0.1247,  0.1400, -0.2436,  0.1819,  0.1976,  0.0338, -0.0028,  0.1124,
         -0.1477, -0.0748,  0.0650,  0.1392, -0.0314, -0.0989,  0.0613,  0.5433],
        [ 0.0667,  0.1845, -0.2135,  0.2813,  0.2064,  0.0873,  0.0084,  0.2055,
         -0.1130, -0.1466,  0.0459,  0.1923, -0.0275, -0.1107,  0.0065,  0.4674],
        [ 0.1924,  0.1693, -0.1568,  0.2284,  0.1620,  0.0737,  0.0443,  0.2519,
         -0.1912, -0.1979,  0.0832,  0.0713, -0.0826, -0.0848, -0.1047,  0.6089],
        [ 0.1184,  0.0884, -0.2652,  0.2560,  0.1840,  0.0284, -0.0621,  0.1181,
         -0.0880,  0.0104,  0.1123,  0.1850,  0.0369, -0.0730,  0.0663,  0.5242],
        [ 0.1243,  0.0453, -0.3412,  0.2709,  0.2335, -0.0948, -0.0421,  0.2143,
         -0.0330, -0.0313,  0.0520,  0.2378,  0.1084, -0.0959,  0.0300,  0.4707]],
       grad_fn=<SelectBackward0>)

Multi-head attention

In practice, multiple attention heads are used which 1. Expands the model’s ability to focus on different positions and prevent the attention to be dominated by the word itself. 2. Have multiple “representation subspaces”. Have multiple sets of Query/Key/Value weight matrices

Drawing

Image credit: https://jalammar.github.io/illustrated-transformer/

Let’s see attention mechanisms in action!

We are going to use the powerful visualization tool bertviz, which allows an interactive experience of the attention mechanisms. Normally these mechanisms are abstracted away but this will allow us to inspect our model in more detail.

!pip install bertviz

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Requirement already satisfied: bertviz in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (1.4.1)
Requirement already satisfied: transformers>=2.0 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from bertviz) (4.53.3)
Requirement already satisfied: torch>=1.0 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from bertviz) (2.7.1)
Requirement already satisfied: tqdm in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from bertviz) (4.67.1)
Requirement already satisfied: boto3 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from bertviz) (1.39.11)
Requirement already satisfied: requests in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from bertviz) (2.32.4)
Requirement already satisfied: regex in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from bertviz) (2024.11.6)
Requirement already satisfied: sentencepiece in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from bertviz) (0.2.0)
Requirement already satisfied: IPython>=7.14 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from bertviz) (9.4.0)
Requirement already satisfied: decorator in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from IPython>=7.14->bertviz) (5.2.1)
Requirement already satisfied: ipython-pygments-lexers in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from IPython>=7.14->bertviz) (1.1.1)
Requirement already satisfied: jedi>=0.16 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from IPython>=7.14->bertviz) (0.19.2)
Requirement already satisfied: matplotlib-inline in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from IPython>=7.14->bertviz) (0.1.7)
Requirement already satisfied: pexpect>4.3 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from IPython>=7.14->bertviz) (4.9.0)
Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from IPython>=7.14->bertviz) (3.0.51)
Requirement already satisfied: pygments>=2.4.0 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from IPython>=7.14->bertviz) (2.19.2)
Requirement already satisfied: stack_data in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from IPython>=7.14->bertviz) (0.6.3)
Requirement already satisfied: traitlets>=5.13.0 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from IPython>=7.14->bertviz) (5.14.3)
Requirement already satisfied: wcwidth in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from prompt_toolkit<3.1.0,>=3.0.41->IPython>=7.14->bertviz) (0.2.13)
Requirement already satisfied: parso<0.9.0,>=0.8.4 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from jedi>=0.16->IPython>=7.14->bertviz) (0.8.4)
Requirement already satisfied: ptyprocess>=0.5 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from pexpect>4.3->IPython>=7.14->bertviz) (0.7.0)
Requirement already satisfied: filelock in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from torch>=1.0->bertviz) (3.18.0)
Requirement already satisfied: typing-extensions>=4.10.0 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from torch>=1.0->bertviz) (4.14.1)
Requirement already satisfied: setuptools in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from torch>=1.0->bertviz) (80.9.0)
Requirement already satisfied: sympy>=1.13.3 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from torch>=1.0->bertviz) (1.14.0)
Requirement already satisfied: networkx in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from torch>=1.0->bertviz) (3.5)
Requirement already satisfied: jinja2 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from torch>=1.0->bertviz) (3.1.6)
Requirement already satisfied: fsspec in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from torch>=1.0->bertviz) (2025.7.0)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from sympy>=1.13.3->torch>=1.0->bertviz) (1.3.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.30.0 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from transformers>=2.0->bertviz) (0.33.4)
Requirement already satisfied: numpy>=1.17 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from transformers>=2.0->bertviz) (2.3.1)
Requirement already satisfied: packaging>=20.0 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from transformers>=2.0->bertviz) (25.0)
Requirement already satisfied: pyyaml>=5.1 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from transformers>=2.0->bertviz) (6.0.2)
Requirement already satisfied: tokenizers<0.22,>=0.21 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from transformers>=2.0->bertviz) (0.21.2)
Requirement already satisfied: safetensors>=0.4.3 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from transformers>=2.0->bertviz) (0.5.3)
Requirement already satisfied: hf-xet<2.0.0,>=1.1.2 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from huggingface-hub<1.0,>=0.30.0->transformers>=2.0->bertviz) (1.1.5)
Requirement already satisfied: botocore<1.40.0,>=1.39.11 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from boto3->bertviz) (1.39.11)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from boto3->bertviz) (1.0.1)
Requirement already satisfied: s3transfer<0.14.0,>=0.13.0 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from boto3->bertviz) (0.13.1)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from botocore<1.40.0,>=1.39.11->boto3->bertviz) (2.9.0.post0)
Requirement already satisfied: urllib3!=2.2.0,<3,>=1.25.4 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from botocore<1.40.0,>=1.39.11->boto3->bertviz) (2.5.0)
Requirement already satisfied: six>=1.5 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.40.0,>=1.39.11->boto3->bertviz) (1.17.0)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from jinja2->torch>=1.0->bertviz) (3.0.2)
Requirement already satisfied: charset_normalizer<4,>=2 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from requests->bertviz) (3.4.2)
Requirement already satisfied: idna<4,>=2.5 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from requests->bertviz) (3.10)
Requirement already satisfied: certifi>=2017.4.17 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from requests->bertviz) (2025.8.3)
Requirement already satisfied: executing>=1.2.0 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from stack_data->IPython>=7.14->bertviz) (2.2.0)
Requirement already satisfied: asttokens>=2.1.0 in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from stack_data->IPython>=7.14->bertviz) (3.0.0)
Requirement already satisfied: pure-eval in /Users/samforeman/projects/saforem2/intro-hpc-bootcamp-2025/.venv/lib/python3.13/site-packages (from stack_data->IPython>=7.14->bertviz) (0.2.3)

Let’s load in the model, GPT2 and look at the attention mechanisms.

Hint… click on the different blocks in the visualization to see the attention

from bertviz import model_view
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer, utils

utils.logging.set_verbosity_error()  # Suppress standard warnings

model_name = "openai-community/gpt2"
input_text = "The animal didn't cross the street because it was too tired"
model = AutoModelForCausalLM.from_pretrained(model_name, output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors="pt")  # Tokenize input text
outputs = model(inputs)  # Run model
attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(
    inputs[0]
)  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view

Positional encoding

We just discussed attention mechanisms which account for context between words. Another question we should ask is how do we account for the order of words in an input sentence

Consider the following two sentences to see why this is important:

The man ate the sandwich.

The sandwich ate the man.

Clearly, these are two vastly different situations even though they have the same words. The Transformer can

Transformers differentiate between these situations by adding a Positional encoding vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word.

Image credit: https://medium.com/@xuer.chen.human/llm-study-notes-positional-encoding-0639a1002ec0

We set up positional encoding similarly as token embedding using the nn.Embedding tool. We use a simple embedding here but there are more complex positional encodings used such as sinusoidal.

For an explanation of different positional encodings, refer to this post: https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/

vocab_size = 65
n_embd = 64

token_embedding_table = nn.Embedding(vocab_size, n_embd)
block_size = 32  # what is the maximum context length for predictions?
position_embedding_table = nn.Embedding(block_size, n_embd)

You will notice the positional encoding size is (block_size, n_embed) because it encodes for the postion of a token within the sequence of size block_size

Then, the position embedding used is simply added to the token embedding to apply positional embedding.

Let’s look at token embedding alone:

x = torch.tensor([1, 3, 15, 4, 7, 1, 4, 9])
x = token_embedding_table(x)
print(x[0])

tensor([ 0.7221, -0.9629, -2.0578,  1.9740,  0.7434,  1.1139,  0.6926,  0.0296,
         0.6405, -1.6464,  0.4935,  0.7485,  0.9238, -0.4940,  0.4814, -0.3859,
        -0.3094,  1.1066, -0.2891,  0.1891,  2.0440, -0.7945, -0.4331,  0.3007,
         1.4317,  0.2881, -0.4343,  0.4280,  1.2469,  1.4047, -0.3404, -2.2190,
         0.4893,  0.0398, -0.2717, -2.2400, -0.0029, -1.4251,  0.7330,  0.3551,
         0.1472, -1.1895, -0.8407,  0.3134, -0.6709, -0.8176,  0.6929, -0.6374,
         0.3174,  0.4837, -0.0073, -1.5924,  1.8606, -1.2910, -0.1594,  0.3111,
        -0.1536, -0.3414, -0.0170, -0.1633,  0.2794,  0.6755,  0.7066, -1.6665],
       grad_fn=<SelectBackward0>)

And token + positional embeddings:

x = torch.tensor([1, 3, 15, 4, 7, 1, 4, 9])
x = position_embedding_table(x) + token_embedding_table(x)
print(x[0])

tensor([ 0.4326, -1.6287, -0.8684,  3.0704,  0.3646,  1.9826,  0.7582, -0.1918,
         1.0491, -2.2562, -0.4931, -0.7808,  1.7206, -1.0297,  2.0798, -1.3427,
        -0.7896, -0.1746,  0.0926,  0.0543,  2.3831, -0.6208,  0.3902,  0.1097,
         1.0455, -1.4557,  0.3402,  2.6717,  1.8380,  1.2628, -0.4831, -4.6023,
         0.6959,  1.0347,  0.5903, -0.7541,  0.4682, -0.3895,  2.1526,  0.6272,
        -0.8558, -0.8434,  0.1311, -1.0272, -2.0580,  0.0584,  0.3442, -0.3464,
        -0.3444,  2.3134, -1.1142, -1.4629,  3.3503, -2.0594,  1.4105,  0.4558,
        -1.3366,  1.9283,  1.5187,  0.3906,  1.1448, -0.8422,  2.2692, -0.7949],
       grad_fn=<SelectBackward0>)

You can see a clear offset between these two embeddings.

During the training process, these embeddings will be learned to best encode the token and positional embeddings of the sequences.

Output layers

At the end of our Transformer model, we are left with a vector, so how do we turn this into a word?

Drawing

Using a final Linear layer and a Softmax Layer. The Linear layer projects the vector produced by the stack of decoders, into a larger vector called a logits vector.

If our model knows 10,000 unique English words learned from its training dataset the logits vector is 10,000 cells wide – each cell corresponds to the score of a unique word.

The softmax layer turns those scores into probabilities. The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

Figure 8: Transformer decoder output softmax

Image credit: https://jalammar.github.io/illustrated-transformer/

Training

How does an LLM improve over time? We want to compare the probabilitiy distribution for each token generated by our model to the ground truths. Our model produces a probability distribution for each token. We want to compare these probability distributions to the ground truths. For example, when translating the sentence: “je suis étudiant” into “i am a student” as can be seen in the example:

Figure 9: Output target probability distributions

Image credit: https://jalammar.github.io/illustrated-transformer/

The model can calculate the loss between the vector it generates and the ground truth vector seen in this example. A commonly used loss function is cross entropy loss:

CE = -\sum_{x \in X} p(x) log q(x)

where p(x) represents the true distribution and q(x) represents the predicted distribution.

from torch.nn import functional as F

logits = torch.tensor([0.5, 0.1, 0.3])
targets = torch.tensor([1.0, 0.0, 0.0])
loss = F.cross_entropy(logits, targets)
print(loss)

tensor(0.9119)

Another important metric commonly used in LLMs is perplexity.

Intuitively, perplexity means to be surprised. We measure how much the model is surprised by seeing new data. The lower the perplexity, the better the training is.

Mathematically, perplexity is just the exponent of the negative cross entropy loss:

\text{perplexity} = exp(\text{CE})

perplexity = torch.exp(loss)
print(perplexity)

tensor(2.4891)

Let’s train a mini-LLM from scratch

Set up hyperparameters:

# hyperparameters
batch_size = 4  # how many independent sequences will we process in parallel?
block_size = 32  # what is the maximum context length for predictions?
max_iters = 500
eval_interval = 50
learning_rate = 1e-3
device = "cuda" if torch.cuda.is_available() else "cpu"
eval_iters = 200
n_embd = 64
n_head = 4  ## so head_size = 16
n_layer = 4
dropout = 0.0
# ------------

Load in data and create train and test datasets

We’re going to be using the tiny Shakespeare dataset. Data is tokenized according to a simple character based tokenizer. Data is split into a train and test set so we have something to test after performing training (9:1 split).

! [ ! -f "input.txt" ] && wget https://raw.githubusercontent.com/argonne-lcf/ATPESC_MachineLearning/refs/heads/master/02_intro_to_LLMs/dataset/input.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

with open("input.txt", "r", encoding="utf-8") as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [
    stoi[c] for c in s
]  # encoder: take a string, output a list of integers
decode = lambda l: "".join(
    [itos[i] for i in l]
)  # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))  # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]


# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == "train" else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i : i + block_size] for i in ix])
    y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.

Set up the components of the Decoder block:

MultiHeadAttention
FeedForward Network

class Head(nn.Module):
    """one head of self-attention"""

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)  # (B,T,C) 16,32,16
        q = self.query(x)  # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2, -1) * C**-0.5  # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf"))  # (B, T, T)
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x)  # (B,T,C)
        out = wei @ v  # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out


class MultiHeadAttention(nn.Module):
    """multiple heads of self-attention in parallel"""

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out


class FeedFoward(nn.Module):
    """a simple linear layer followed by a non-linearity"""

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(
                4 * n_embd, n_embd
            ),  # Projection layer going back into the residual pathway
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

Combine components into the Decoder block

class Block(nn.Module):
    """Transformer block: communication followed by computation"""

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))  # Communication
        x = x + self.ffwd(self.ln2(x))  # Computation
        return x

Set up the full Transformer model

This is a combination of the Token embeddings, Positional embeddings, a stack of Transformer blocks and an output block.

# super simple language model
class LanguageModel(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(
            *[Block(n_embd, n_head=n_head) for _ in range(n_layer)]
        )
        self.ln_f = nn.LayerNorm(n_embd)  # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx)  # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T,C)
        x = tok_emb + pos_emb  # (B,T,C)
        x = self.blocks(x)  # (B,T,C)
        x = self.ln_f(x)  # (B,T,C)
        logits = self.lm_head(x)  # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

Homework

In this notebook, we learned the various components of an LLM.
Take the mini LLM we created from scratch and run your own training loop. Show how the training and validation perplexity change over the steps.
Hint: this function might be useful for you:

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Run the same training loop but modify one of the hyperparameters from the below list. Run this at least 4 times with a different value and plot each perplexity over training step.

# hyperparameters
n_embd = 64
n_head = 4  ## so head_size = 16
n_layer = 4

Output some generated text from each model you trained. Did the output make more sense with some hyperparameters than others?
We saw a cool visualization of attention mechanisms with BertViz. Take a more complicated model than GPT2 such as “meta-llama/Llama-2-7b-chat-hf” and see how the attention mechanisms are different

Different types of Transformers

Encoder-Decoder architecture

Incorporates both an encoder + decoder architecture

The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence.

In the decoder, the self-attention layer only attends to earlier positions in the output sequence. The future positions are masked (setting them to -inf) before the softmax step in the self-attention calculation.

The “Encoder-Decoder Attention” layer creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output.

The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did.

And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

Figure 10: Illustration of the Encoder-Decoder architecture

Image credit: https://jalammar.github.io/illustrated-transformer/

Encoder-only Transformers

In addition to the encoder-decoder architecture shown here there various other architectures which are either only encoder or decoder models.

Bidirectional Encoder Representations from Transformers (BERT) model

Encoder-only models only use the encoder layer of the Transformer.

These models are usually used for “understanding” natural language; however, they typically are not used for text generation. Examples of uses for these models are:

Determining how positive or negative a movie’s reviews are. (Sentiment Analysis)
Summarizing long legal contracts. (Summarization)
Differentiating words that have multiple meanings (like ‘bank’) based on the surrounding text. (Polysemy resolution)

These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models. The attention mechanisms of these models can access all the words in the initial sentence.

The most common encoder only architectures are:

ALBERT
BERT
DistilBERT
ELECTRA
RoBERTa

As example, let’s consider BERT model in a little more detail.

Image credit: https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

The BERT model is bidirectionally trained to have a deeper sense of language context and flow than single-direction language models.

The Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence.

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2.
A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper.

Image credit: https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

To predict if the second sentence is indeed connected to the first, the following steps are performed:

The entire input sequence goes through the Transformer model.
The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases).
Calculating the probability of IsNextSequence with softmax.

Advantages and disadvantages:

Advantages:

Contextualized embeddings: Good for tasks where contextualized embeddings of input tokens are crucial, such as natural language understanding.
Parallel processing: Allows for parallel processing of input tokens, making it computationally efficient.

Disadvantages:

Not designed for sequence generation: Might not perform well on tasks that require sequential generation of output, as there is no inherent mechanism for auto-regressive decoding.

Here is an example of a BERT code that can be used to

Decoder-only models

An important difference of the GPT-2 architecture compared to the encoder-Transformer architecture has to do with the type of attention mechanism used.

In models such as BERT, the self-attention mechanism has access to tokens to the left and right of the query token. However, in decoder-based models such as GPT-2, masked self-attention is used instead which allows access only to tokens to the left of the query.

The masked self-attention mechanism is important for GPT-2 since it allows the model to be trained for token-by-token generation without simply “memorizing” the future tokens.

Image credit: https://jalammar.github.io/illustrated-gpt2/

The masked self-attention adds understanding of associated words to explain contexts of certain words before passing it through a neural network. It assigns scores to how relevant each word in the segment is, and then adds up the vector representation. This is then passed through the feed-forward network resulting in an output vector.

Image credit: https://jalammar.github.io/illustrated-gpt2/

The resulting vector then needs to be converted to an output token. A common method of obtaining this output token is known as top-k.

Here, the output vector is multiplied by the token embeddings which results in probabilities for each token in the vocabulary. Then the output token is sampled according to this probability.

Image credit: https://jalammar.github.io/illustrated-gpt2/

Advantages and disadvantages

Advantages:

Auto-regressive generation: Well-suited for tasks that require sequential generation, as the model can generate one token at a time based on the previous tokens.
Variable-length output: Can handle tasks where the output sequence length is not fixed.

Disadvantages:

No direct access to input context: The decoder doesn’t directly consider the input context during decoding, which might be a limitation for certain tasks.
Potential for inefficiency: Decoding token by token can be less computationally efficient compared to parallel processing.

Additional architectures

In addition to text, LLMs have also been applied on other data sources such as images and graphs. Here I will describe two particular architectures: 1. Vision Transformers 2. Graph Transformers

Vision Transformers

Vision Transformers (ViT) is an architecture that uses self-attention mechanisms to process images.

The way this works is:

Split image into patches (size is fixed)
Flatten the image patches
Create lower-dimensional linear embeddings from these flattened image patches and include positional embeddings
Feed the sequence as an input to a transformer encoder
Pre-train the ViT model with image labels, which is then fully supervised on a big dataset Fine-tune the downstream dataset for image classification

Image credit: Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).

Graph Transformers

References

Here are some recommendations for further reading and additional code for review.

“The Illustrated Transformer” by Jay Alammar
“Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)”
“The Illustrated GPT-2 (Visualizing Transformer Language Models)”
“A gentle introduction to positional encoding”
“LLM Tutorial Workshop (Argonne National Laboratory)”
“LLM Tutorial Workshop Part 2 (Argonne National Laboratory)”

Citation

BibTeX citation:

@online{foreman2025,
  author = {Foreman, Sam},
  title = {Language Models {(LMs)}},
  date = {2025-08-05},
  url = {https://saforem2.github.io/hpc-bootcamp-2025/02-llms/00-intro-to-llms/},
  langid = {en}
}

For attribution, please cite this work as:

Foreman, Sam. 2025. “Language Models (LMs).” August 5, 2025. https://saforem2.github.io/hpc-bootcamp-2025/02-llms/00-intro-to-llms/.