2023-10-13
October 10 – 12, 2023 \hspace{5pt}
ALCF Hands-on
HPC Workshop
{ML, LLMs, AI4Science, HEP, Lattice QCD, MCMC, Generative Modeling, ...}
Ongoing / recent work:
Emergent abilities of Large Language Models Yao et al. (2023)
Data collection + preprocessing
Pre-training
{model_size, hyperparameters,
parallelism, lr_schedule, ...}
Supervised Fine-Tuning
Deploy (+ monitor, re-evaluate, etc.)
Vaswani et al. (2017)
Modern parallelism techniques enable the training of large language models
The same setup is replicated multiple times, and each being fed a slice of the data.
The processing is done in parallel and all setups are synchronized at the end of each training step.
DDP
supported in PyTorch native.In Tensor Paralleism each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing.
The main building block of any transformer is a fully connected nn.Linear followed by a nonlinear activation GeLU.
Y = GeLU(XA)
, where X and Y are the input and output vectors, and A is the weight matrix.If we look at the computation in matrix form, it’s easy to see how the matrix multiplication can be split between multiple GPUs:
This information is based on (the much more in-depth) TP Overview by @anton-l
DP
+ TP
+ PP
(3D) Parallelism3D Parallelism illustration. Figure from: https://www.deepspeed.ai/
DP
+ TP
+ PP
(3D) ParallelismFigure taken from 3D parallelism: Scaling to trillion-parameter models
We’ve provided a virtual environment complete with all dependencies for running
argonne-lcf/Megatron-DeepSpeed
# navigate to directory ---------------------------------------
WORKSHOP_DIR="/lus/grand/projects/fallwkshp23/"
PROJECTS_DIR="${WORKSHOP_DIR}/foremans/projects"
PROJECT_DIR="${PROJECTS_DIR}/argonne-lcf/Megatron-DeepSpeed"
cd "${PROJECT_DIR}"
# load conda module and activate venv -------------------------
module load conda/2023-10-04; conda activate base
source venvs/polaris/2023-10-04/bin/activate
# set runtime environment variables ---------------------------
export IBV_FORK_SAFE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# set environment variables for running -----------------------
SEQ_LEN=1024
MICRO_BATCH=1
SP_TYPE="megatron"
MODEL_SIZE_KEY="GPT1_5B"
# launch training --------------------------------------------
./ALCF/train-gpt3.sh
Executable:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
ALCF_DIR: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/ALCF
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
source-ing /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/ALCF/setup.sh
Setting up MPI on Polaris from x3210c0s1b0n0
++ SetupMPI() +++++++++++++++++++++++++++++++++
Using HOSTFILE: /var/spool/pbs/aux/1126584.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
NHOSTS: 2
NGPU_PER_HOST: 4
NGPUS: 8
+++++++++++++++++++++++++++++++++++++++++++++++
Skipping setupThetaGPU() on x3210c0s1b0n0
Setting up MPI on Polaris from x3210c0s1b0n0
USING PYTHON: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/venvs/polaris/2023-10-04/bin/python3
[...]
Once the text has finally stopped printing, you should see output similar to the following:
Job started at: 2023-10-11-092906 on x3210c0s1b0n0
[...]
Writing logs to: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT13B_z1_seqlen1024_mp8_pp1_sp1_nl40_hs5120_gb1_mb1
to view output: tail -f $(tail -1 logfiles)
i.e. tail -f /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT13B_z1_seqlen1024_mp8_pp1_sp1_nl40_hs5120_gb1_mb1/logs/foremans-x3210c0s1b0n0-nhosts2-ngpu8-2023-10-11-092906.log
To watch / view the output:
will look like1:
Job started at: 2023-10-11-092906 on x3210c0s1b0n0
Training GPT-3 with GPT13B parameters
Writing logs to: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT13B_z1_seqlen1024_mp8_pp1_sp1_nl40_hs5120_gb1_mb1
to view output: tail -f $(tail -1 logfiles)
i.e. tail -f /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT13B_z1_seqlen1024_mp8_pp1_sp1_nl40_hs5120_gb1_mb1/logs/foremans-x3210c0s1b0n0-nhosts2-ngpu8-2023-10-11-092906.log
using: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/venvs/polaris/2023-10-04/bin/python3
[...]
We provide below the details for installing / getting started on ALCF (Polaris)
Installation:
Polaris:
ThetaGPU:
Setup virtual environment1:
Create a new folder where we’ll install dependencies:
conda/2023-10-04
Note: The following instructions should be unnecessary on Polaris.
The new release supports three different implementations of FlashAttention: (v1.0.4
, v2.x
, triton
)
FlashAttention v2.x
may have numerical instability issues. For the best performance, we recommend using FlashAttention + Triton
v1.0.4
:
v2.x
:
openai/triton
:
conda/2023-10-04
Note: apex
is already installed in the base conda/2023-10-04
environment on Polaris.
ALCF/
directory contains shell scripts for setting up the environment and specifying options to be used for training.ALCF/
├──
args.sh
├──
launch.sh
├──
model.sh
├──
setup.sh
├──
submit-pbs.sh
├──
submit.sh
└──
train-gpt3.sh
Explicitly:
ALCF/train-gpt3.sh
: Main entry point for training. This script will:
ALCF/*.sh
scripts belowALCF/models.sh
: Contains some example model architectures for GPT3-style modelsALCF/args.sh
: Logic for parsing / setting up runtime options for Megatron and DeepSpeedALCF/setup.sh
: Locate and activate virtual environment to be used, ensure MPI variables are set properlyALCF/launch.sh
: Identify available resources and build the command to be executed
{nodes, GPUs per node, GPUs total}
, to pass to mpi{run,exec}
mpiexec <mpiexec-args> python3
pretrain_gpt.py
<gpt-args>
Latent space of biologically meaningful properties for SARS-CoV-2 genomes
Sequence Length | Old Megatron-DeepSpeed (TFLOPS) | New Megatron-DeepSpeed (TFLOPS) |
---|---|---|
2k | 25 | 68 |
4k | 28 | 80 |
8k | OOM | 86 |
16k | OOM | 92 |
32k | OOM | 100 |
64k | OOM | 106 |
128k | OOM | 119 |
256k | OOM | 94 |
samples_per_sec
and TFLOPS
.
420k / 128k ~ 3.3x
) with only a minimal impact on throughput81 / 105 ~ 77%
)1.Name | Sequence Length (k) | (seq_len / min_seq_len ) |
TFLOPS | TFLOPS (% of peak) |
---|---|---|---|---|
GPT25B | 420 | 3.28125 | 81.77225 | 77.867 |
GPT25B | 400 | 3.125 | 90.62 | 86.297 |
GPT25B | 360 | 2.8125 | 81.6325 | 77.7348 |
GPT25B | 360 | 2.8125 | 82.6824 | 78.7346 |
GPT25B | 192 | 1.5 | 115.8228 | 110.2927 |
GPT25B | 128 | 1 | 106.672 | 101.5788 |
GPT25B | 128 | 1 | 105.014 | 100.00 |
Acknowledgements
This research used resources of the Argonne Leadership Computing Facility,
which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.