LLMs at ALCF

Sam Foreman

October 10 – 12, 2023 \hspace{5pt}

ALCF Hands-on

HPC Workshop

Sam Foreman

I’m a Computational Scientist in the Data Science Group at ALCF¹.
- Personal Website: samforeman.me
- Background: {ML, LLMs, AI4Science, HEP, Lattice QCD, MCMC, Generative Modeling, ...}

Ongoing / recent work:

AI + Science

Scaling Large Language Models
Optimizing distibuted training across thousands of GPUs
Building new parallelism techniques for efficient scaling
Generative modeling (esp. for physical systems)

Status of Large Language Models¹

Figure 1: Large Language Models have (LLM)s have taken the ~~NLP community~~ **world** by storm²

Emergent Abilities

Emergent abilities of Large Language Models Yao et al. (2023)

Training LLMs

Figure 2: Visualization from Yang et al. (2023)

Recent Work (2017 – Now)

Recent Work

Table 1: Papers, 2017–*
Date	keywords	Institute	Paper	Publication
2017-06	Transformers	Google	Attention Is All You Need	NeurIPS
2018-06	GPT 1.0	OpenAI	Improving Language Understanding by Generative Pre-Training
2018-10	BERT	Google	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	NAACL
2019-02	GPT 2.0	OpenAI	Language Models are Unsupervised Multitask Learners
2019-09	Megatron-LM	NVIDIA	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
2019-10	T5	Google	Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer	JMLR
2019-10	ZeRO	Microsoft	ZeRO: Memory Optimizations Toward Training Trillion Parameter Models	SC
2020-01	Scaling Law	OpenAI	Scaling Laws for Neural Language Models
2020-05	GPT 3.0	OpenAI	Language models are few-shot learners	NeurIPS
2021-01	Switch Transformers	Google	Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity	JMLR
2021-08	Codex	OpenAI	Evaluating Large Language Models Trained on Code
2021-08	Foundation Models	Stanford	On the Opportunities and Risks of Foundation Models
2021-09	FLAN	Google	Finetuned Language Models are Zero-Shot Learners	ICLR
2021-10	T0	HuggingFace et al.	Multitask Prompted Training Enables Zero-Shot Task Generalization	ICLR
2021-12	GLaM	Google	GLaM: Efficient Scaling of Language Models with Mixture-of-Experts	ICML
2021-12	WebGPT	OpenAI	WebGPT: Browser-assisted question-answering with human feedback
2021-12	Retro	DeepMind	Improving language models by retrieving from trillions of tokens	ICML
2021-12	Gopher	DeepMind	Scaling Language Models: Methods, Analysis & Insights from Training Gopher
2022-01	COT	Google	Chain-of-Thought Prompting Elicits Reasoning in Large Language Models	NeurIPS
2022-01	LaMDA	Google	LaMDA: Language Models for Dialog Applications
2022-01	Minerva	Google	Solving Quantitative Reasoning Problems with Language Models	NeurIPS
2022-01	Megatron-Turing NLG	Microsoft&NVIDIA	Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
2022-03	InstructGPT	OpenAI	Training language models to follow instructions with human feedback
2022-04	PaLM	Google	PaLM: Scaling Language Modeling with Pathways
2022-04	Chinchilla	DeepMind	An empirical analysis of compute-optimal large language model training	NeurIPS
2022-05	OPT	Meta	OPT: Open Pre-trained Transformer Language Models
2022-05	UL2	Google	Unifying Language Learning Paradigms
2022-06	Emergent Abilities	Google	Emergent Abilities of Large Language Models	TMLR
2022-06	BIG-bench	Google	Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
2022-06	METALM	Microsoft	Language Models are General-Purpose Interfaces
2022-09	Sparrow	DeepMind	Improving alignment of dialogue agents via targeted human judgements
2022-10	Flan-T5/PaLM	Google	Scaling Instruction-Finetuned Language Models
2022-10	GLM-130B	Tsinghua	GLM-130B: An Open Bilingual Pre-trained Model	ICLR
2022-11	HELM	Stanford	Holistic Evaluation of Language Models
2022-11	BLOOM	BigScience	BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
2022-11	Galactica	Meta	Galactica: A Large Language Model for Science
2022-12	OPT-IML	Meta	OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
2023-01	Flan 2022 Collection	Google	The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
2023-02	LLaMA	Meta	LLaMA: Open and Efficient Foundation Language Models
2023-02	Kosmos-1	Microsoft	Language Is Not All You Need: Aligning Perception with Language Models
2023-03	PaLM-E	Google	PaLM-E: An Embodied Multimodal Language Model
2023-03	GPT 4	OpenAI	GPT-4 Technical Report
2023-04	Pythia	EleutherAI et al.	Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling	ICML
2023-05	Dromedary	CMU et al.	Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
2023-05	PaLM 2	Google	PaLM 2 Technical Report
2023-05	RWKV	Bo Peng	RWKV: Reinventing RNNs for the Transformer Era
2023-05	DPO	Stanford	Direct Preference Optimization: Your Language Model is Secretly a Reward Model
2023-07	LLaMA 2	Meta	Llama 2: Open Foundation and Fine-Tuned Chat Models

Life-Cycle of the LLM

Data collection + preprocessing
Pre-training
- Architecture decisions:
  {model_size, hyperparameters,
  parallelism, lr_schedule, ...}
Supervised Fine-Tuning
- Instruction Tuning
- Alignment
Deploy (+ monitor, re-evaluate, etc.)

Figure 3: **Pre-training**: Virtually all of the compute used during pretraining phase¹.

Life-Cycle of the LLM: Pre-training

Figure 4: Pre-training: Virtually all of the compute used during pretraining phase

Life-Cycle of the LLM: Fine-Tuning

Figure 5: **Fine-tuning**¹: Fine-tuning actually updates the model’s weights to make the model better at a certain task.

Transformer Architecture

Vaswani et al. (2017)

Forward Pass

Figure 6: Language Model trained for causal language modeling. Video from: 🤗 Generation with LLMs

Generating Text

Figure 7: Language Model trained for causal language modeling. Video from: 🤗 Generation with LLMs

Parallelism Overview

Modern parallelism techniques enable the training of large language models

Parallelism Concepts¹

DataParallel (DP):
- The same setup is replicated multiple times, and each being fed a slice of the data.
- The processing is done in parallel and all setups are synchronized at the end of each training step.
TensorParallel (TP):
- Each tensor is split up into multiple chunks.
- So, instead of having the whole tensor reside on a single gpu, each shard of the tensor resides on its designated gpu.
  - During processing each shard gets processed separately and in parallel on different GPUs and the results are synced at the end of the step.
  - This is what one may call horizontal parallelism, as he splitting happens on horizontal level.

Parallelism Concepts¹

PipelineParallel (PP):
- Model is split up vertically (layer-level) across multiple GPUs, so that only one or several layers of the model are places on a single gpu.
  - Each gpu processes in parallel different stages of the pipeline and working on a small chunk of the batch.
Zero Redundancy Optimizer (ZeRO):
- Also performs sharding of the tensors somewhat similar to TP, except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn’t need to be modified.
- It also supports various offloading techniques to compensate for limited GPU memory.
Sharded DDP:
- Another name for the foundational ZeRO concept as used by various other implementations of ZeRO.

Data Parallelism

Data Parallelism:
- The simplest and most common parallelism technique. Workers maintain identical copies of the complete model and work on a subset of the data.
- DDP supported in PyTorch native.
ZeRO Data Parallel
- ZeRO powered data parallelism is shown below¹

Tensor Parallelism¹

In Tensor Paralleism each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing.
- The main building block of any transformer is a fully connected nn.Linear followed by a nonlinear activation GeLU.
  - Y = GeLU(XA), where X and Y are the input and output vectors, and A is the weight matrix.
- If we look at the computation in matrix form, it’s easy to see how the matrix multiplication can be split between multiple GPUs:

Tensor Parallelism

3D Parallelism

DP + TP + PP (3D) Parallelism

3D Parallelism illustration. Figure from: https://www.deepspeed.ai/

3D Parallelism

DP + TP + PP (3D) Parallelism

Figure taken from 3D parallelism: Scaling to trillion-parameter models

Running on ALCF

We’ve provided a virtual environment complete with all dependencies for running
argonne-lcf/Megatron-DeepSpeed

# navigate to directory ---------------------------------------
WORKSHOP_DIR="/lus/grand/projects/fallwkshp23/"
PROJECTS_DIR="${WORKSHOP_DIR}/foremans/projects"
PROJECT_DIR="${PROJECTS_DIR}/argonne-lcf/Megatron-DeepSpeed"
cd "${PROJECT_DIR}"
# load conda module and activate venv -------------------------
module load conda/2023-10-04; conda activate base
source venvs/polaris/2023-10-04/bin/activate
# set runtime environment variables ---------------------------
export IBV_FORK_SAFE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# set environment variables for running -----------------------
SEQ_LEN=1024
MICRO_BATCH=1
SP_TYPE="megatron" 
MODEL_SIZE_KEY="GPT1_5B"
# launch training --------------------------------------------
./ALCF/train-gpt3.sh

Running on ALCF

Executable:

MODEL_SIZE_KEY="GPT1_5B" SEQ_LEN=1024 MICRO_BATCH=1 SP_TYPE="megatron" ./ALCF/train-gpt3.sh

Output

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
ALCF_DIR: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/ALCF
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
source-ing /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/ALCF/setup.sh
Setting up MPI on Polaris from x3210c0s1b0n0
++ SetupMPI() +++++++++++++++++++++++++++++++++
Using HOSTFILE: /var/spool/pbs/aux/1126584.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
NHOSTS: 2
NGPU_PER_HOST: 4
NGPUS: 8
+++++++++++++++++++++++++++++++++++++++++++++++
Skipping setupThetaGPU() on x3210c0s1b0n0
Setting up MPI on Polaris from x3210c0s1b0n0
USING PYTHON: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/venvs/polaris/2023-10-04/bin/python3
[...]

Running on ALCF

Once the text has finally stopped printing, you should see output similar to the following:

Job started at: 2023-10-11-092906 on x3210c0s1b0n0
[...]
Writing logs to: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT13B_z1_seqlen1024_mp8_pp1_sp1_nl40_hs5120_gb1_mb1
to view output: tail -f $(tail -1 logfiles)
i.e. tail -f /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT13B_z1_seqlen1024_mp8_pp1_sp1_nl40_hs5120_gb1_mb1/logs/foremans-x3210c0s1b0n0-nhosts2-ngpu8-2023-10-11-092906.log

To watch / view the output:

tail -fn 1000 $(tail -1 logfiles) | less

will look like¹:

Job started at: 2023-10-11-092906 on x3210c0s1b0n0
Training GPT-3 with GPT13B parameters
Writing logs to: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT13B_z1_seqlen1024_mp8_pp1_sp1_nl40_hs5120_gb1_mb1
to view output: tail -f $(tail -1 logfiles)
i.e. tail -f /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT13B_z1_seqlen1024_mp8_pp1_sp1_nl40_hs5120_gb1_mb1/logs/foremans-x3210c0s1b0n0-nhosts2-ngpu8-2023-10-11-092906.log
using: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/venvs/polaris/2023-10-04/bin/python3
[...]

Getting Started at ALCF

We provide below the details for installing / getting started on ALCF (Polaris)

Installation:

Clone GitHub repo:

git clone https://github.com/argonne-lcf/Megatron-DeepSpeed

Load Conda module:

Polaris:

if [[ "$(hostname)==x3*" ]]; then
    export MACHINE="Polaris"
    export CONDA_DATE="2023-10-04"
    module load conda/${CONDA_DATE}
    conda activate base
fi

ThetaGPU:

if [[ "$(hostname)==theta*" ]]; then
    export MACHINE="ThetaGPU"
    export CONDA_DATE="2023-01-10"
    module load conda/${CONDA_DATE}
    conda activate base
fi

Getting Started

Setup virtual environment¹:

cd Megatron-DeepSpeed
# create a new virtual environment
mkdir -p "venvs/${MACHINE}/${CONDA_DATE}"
python3 -m  venv "venvs/${MACHINE}/${CONDA_DATE}" --system-site-packages
source "venvs/${MACHINE}/${CONDA_DATE}/bin/activate"

Create a new folder where we’ll install dependencies:
```
mkdir -p "deps/${MACHINE}"
cd "deps/${MACHINE}"
```

Install Dependencies

conda/2023-10-04

Note: The following instructions should be unnecessary on Polaris.

Dao-AILab/flash-attention
saforem2/ezpz
NVIDIA/apex

The new release supports three different implementations of FlashAttention: (v1.0.4, v2.x, triton)
FlashAttention v2.x may have numerical instability issues. For the best performance, we recommend using FlashAttention + Triton

Dao-AILab/flash-attention:

v1.0.4:

python3 -m pip install flash-attn==1.0.4

v2.x:

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python3 setup.py install

openai/triton:

git clone -b legacy-backend https://github.com/openai/triton
cd triton/python
python3 -m pip install cmake pybind11
python3 -m pip install .

saforem2/ezpz

python3 -m pip install -e "git+https://github.com/saforem2/ezpz.git#egg=ezpz"

NVIDIA/apex

git clone https://github.com/NVIDIA/apex
cd ../apex/
pip install -v \ 
  --disable-pip-version-check \
  --no-cache-dir \
  --no-build-isolation \
  --global-option="--cpp_ext" \
  --global-option="--cuda_ext" \
  -e \
  ./

conda/2023-10-04

Note: apex is already installed in the base conda/2023-10-04 environment on Polaris.

Running

The ALCF/ directory contains shell scripts for setting up the environment and specifying options to be used for training.

ALCF/
├── args.sh
├── launch.sh
├── model.sh
├── setup.sh
├── submit-pbs.sh
├── submit.sh
└── train-gpt3.sh

Various options can be specified dynamically at runtime by setting them in your environment, e.g.:

# Set env. vars to use:
MODEL_SIZE_KEY="GPT25B"
SEQ_LEN=1024
USE_FLASH_ATTN=1
MICRO_BATCH=1
GAS=1
SP_TYPE="megatron"
ZERO_STAGE=1
# Launch training:
./ALCF/train-gpt3.sh

Details

Explicitly:

ALCF/train-gpt3.sh: Main entry point for training. This script will:
- Source the rest of the required ALCF/*.sh scripts below
ALCF/models.sh: Contains some example model architectures for GPT3-style models
ALCF/args.sh: Logic for parsing / setting up runtime options for Megatron and DeepSpeed
ALCF/setup.sh: Locate and activate virtual environment to be used, ensure MPI variables are set properly
ALCF/launch.sh: Identify available resources and build the command to be executed
- i.e. figure out how many: {nodes, GPUs per node, GPUs total}, to pass to mpi{run,exec}
- then, use this to launch mpiexec <mpiexec-args> python3 pretrain_gpt.py <gpt-args>

DeepSpeed4Science

Long Sequence Support for GenSLM Model

Latent space of biologically meaningful properties for SARS-CoV-2 genomes

Loooooooooong Sequence Lengths

Table 2: Long sequence length support from `microsoft/Megatron-DeepSpeed`
Sequence Length	Old Megatron-DeepSpeed (TFLOPS)	New Megatron-DeepSpeed (TFLOPS)
2k	25	68
4k	28	80
8k	OOM	86
16k	OOM	92
32k	OOM	100
64k	OOM	106
128k	OOM	119
256k	OOM	94

Loooooooooong Sequence Lengths

Working with Microsoft DeepSpeed team to enable longer sequence lengths (context windows) for LLMs¹
- Release: DeepSpeed4Science Overview and Tutorial

Figure 8: Maximum (achievable) `SEQ_LEN` for both `25B` and `33B` models [WIP]

Figure 8: Maximum (achievable) `SEQ_LEN` for both `25B` and `33B` models [WIP]

Loooooooooong Sequence Lengths

We can evaluate the performance of our model by looking at two different metrics for throughput: samples_per_sec and TFLOPS.
- Explicitly, we see that we are able to scale up to significantly longer sequences:
  (420k / 128k ~ 3.3x) with only a minimal impact on throughput
  performance: (81 / 105 ~ 77%)¹.

Table 3: Impact on TFLOPS as a function of increasing sequence length. Table from: `throughput/TFLOPS`
Name	Sequence Length (k)	(`seq_len / min_seq_len`)	TFLOPS	TFLOPS (% of peak)
GPT25B	420	3.28125	81.77225	77.867
GPT25B	400	3.125	90.62	86.297
GPT25B	360	2.8125	81.6325	77.7348
GPT25B	360	2.8125	82.6824	78.7346
GPT25B	192	1.5	115.8228	110.2927
GPT25B	128	1	106.672	101.5788
GPT25B	128	1	105.014	100.00

Links

Acknowledgements

This research used resources of the Argonne Leadership Computing Facility,
which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

References

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” https://arxiv.org/abs/1706.03762.

Yang, Jingfeng, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. 2023. “Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond.” https://arxiv.org/abs/2304.13712.

Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” https://arxiv.org/abs/2305.10601.

LLMs on Polaris

Sam Foreman

Status of Large Language Models1

Emergent Abilities

Training LLMs

Recent Work (2017 – Now)

Life-Cycle of the LLM

Life-Cycle of the LLM: Pre-training

Life-Cycle of the LLM: Fine-Tuning

Transformer Architecture

Forward Pass

Generating Text

Parallelism Overview

Parallelism Concepts1

Parallelism Concepts1

Data Parallelism

Tensor Parallelism1

Tensor Parallelism

3D Parallelism

3D Parallelism

Running on ALCF

Running on ALCF

Running on ALCF

Getting Started at ALCF

Getting Started

Install Dependencies

Running

Details

DeepSpeed4Science

Loooooooooong Sequence Lengths

Loooooooooong Sequence Lengths

Loooooooooong Sequence Lengths

Links

References

Status of Large Language Models¹

Parallelism Concepts¹

Parallelism Concepts¹

Tensor Parallelism¹