LLMs on Polaris


October 10 – 12, 2023 \hspace{5pt}

ALCF Hands-on

HPC Workshop

Sam Foreman

  • I’m a Computational Scientist in the Data Science Group at ALCF1.
    • Personal Website: samforeman.me
    • Background: {ML, LLMs, AI4Science, HEP, Lattice QCD, MCMC, Generative Modeling, ...}

Ongoing / recent work:

Status of Large Language Models1

Figure 1: Large Language Models have (LLM)s have taken the NLP community world by storm2

Emergent Abilities

Emergent abilities of Large Language Models Yao et al. (2023)

Training LLMs

Figure 2: Visualization from Yang et al. (2023)

Recent Work (2017 – Now)

Recent Work
Table 1: Papers, 2017–*
Date keywords Institute Paper Publication
2017-06 Transformers Google Attention Is All You Need NeurIPS
Dynamic JSON Badge
2018-06 GPT 1.0 OpenAI Improving Language Understanding by Generative Pre-Training Dynamic JSON Badge
2018-10 BERT Google BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding NAACL
Dynamic JSON Badge
2019-02 GPT 2.0 OpenAI Language Models are Unsupervised Multitask Learners Dynamic JSON Badge
2019-09 Megatron-LM NVIDIA Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Dynamic JSON Badge
2019-10 T5 Google Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer JMLR
Dynamic JSON Badge
2019-10 ZeRO Microsoft ZeRO: Memory Optimizations Toward Training Trillion Parameter Models SC
Dynamic JSON Badge
2020-01 Scaling Law OpenAI Scaling Laws for Neural Language Models Dynamic JSON Badge
2020-05 GPT 3.0 OpenAI Language models are few-shot learners NeurIPS
Dynamic JSON Badge
2021-01 Switch Transformers Google Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity JMLR
Dynamic JSON Badge
2021-08 Codex OpenAI Evaluating Large Language Models Trained on Code Dynamic JSON Badge
2021-08 Foundation Models Stanford On the Opportunities and Risks of Foundation Models Dynamic JSON Badge
2021-09 FLAN Google Finetuned Language Models are Zero-Shot Learners ICLR
Dynamic JSON Badge
2021-10 T0 HuggingFace et al. Multitask Prompted Training Enables Zero-Shot Task Generalization ICLR
Dynamic JSON Badge
2021-12 GLaM Google GLaM: Efficient Scaling of Language Models with Mixture-of-Experts ICML
Dynamic JSON Badge
2021-12 WebGPT OpenAI WebGPT: Browser-assisted question-answering with human feedback Dynamic JSON Badge
2021-12 Retro DeepMind Improving language models by retrieving from trillions of tokens ICML
Dynamic JSON Badge
2021-12 Gopher DeepMind Scaling Language Models: Methods, Analysis & Insights from Training Gopher Dynamic JSON Badge
2022-01 COT Google Chain-of-Thought Prompting Elicits Reasoning in Large Language Models NeurIPS
Dynamic JSON Badge
2022-01 LaMDA Google LaMDA: Language Models for Dialog Applications Dynamic JSON Badge
2022-01 Minerva Google Solving Quantitative Reasoning Problems with Language Models NeurIPS
Dynamic JSON Badge
2022-01 Megatron-Turing NLG Microsoft&NVIDIA Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model Dynamic JSON Badge
2022-03 InstructGPT OpenAI Training language models to follow instructions with human feedback Dynamic JSON Badge
2022-04 PaLM Google PaLM: Scaling Language Modeling with Pathways Dynamic JSON Badge
2022-04 Chinchilla DeepMind An empirical analysis of compute-optimal large language model training NeurIPS
Dynamic JSON Badge
2022-05 OPT Meta OPT: Open Pre-trained Transformer Language Models Dynamic JSON Badge
2022-05 UL2 Google Unifying Language Learning Paradigms Dynamic JSON Badge
2022-06 Emergent Abilities Google Emergent Abilities of Large Language Models TMLR
Dynamic JSON Badge
2022-06 BIG-bench Google Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models Dynamic JSON Badge
2022-06 METALM Microsoft Language Models are General-Purpose Interfaces Dynamic JSON Badge
2022-09 Sparrow DeepMind Improving alignment of dialogue agents via targeted human judgements Dynamic JSON Badge
2022-10 Flan-T5/PaLM Google Scaling Instruction-Finetuned Language Models Dynamic JSON Badge
2022-10 GLM-130B Tsinghua GLM-130B: An Open Bilingual Pre-trained Model ICLR
Dynamic JSON Badge
2022-11 HELM Stanford Holistic Evaluation of Language Models Dynamic JSON Badge
2022-11 BLOOM BigScience BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Dynamic JSON Badge
2022-11 Galactica Meta Galactica: A Large Language Model for Science Dynamic JSON Badge
2022-12 OPT-IML Meta OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization Dynamic JSON Badge
2023-01 Flan 2022 Collection Google The Flan Collection: Designing Data and Methods for Effective Instruction Tuning Dynamic JSON Badge
2023-02 LLaMA Meta LLaMA: Open and Efficient Foundation Language Models Dynamic JSON Badge
2023-02 Kosmos-1 Microsoft Language Is Not All You Need: Aligning Perception with Language Models Dynamic JSON Badge
2023-03 PaLM-E Google PaLM-E: An Embodied Multimodal Language Model Dynamic JSON Badge
2023-03 GPT 4 OpenAI GPT-4 Technical Report Dynamic JSON Badge
2023-04 Pythia EleutherAI et al. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling ICML
Dynamic JSON Badge
2023-05 Dromedary CMU et al. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision Dynamic JSON Badge
2023-05 PaLM 2 Google PaLM 2 Technical Report Dynamic JSON Badge
2023-05 RWKV Bo Peng RWKV: Reinventing RNNs for the Transformer Era Dynamic JSON Badge
2023-05 DPO Stanford Direct Preference Optimization: Your Language Model is Secretly a Reward Model Dynamic JSON Badge
2023-07 LLaMA 2 Meta Llama 2: Open Foundation and Fine-Tuned Chat Models Dynamic JSON Badge

Life-Cycle of the LLM

  1. Data collection + preprocessing

  2. Pre-training

    • Architecture decisions:
      {model_size, hyperparameters,
      parallelism, lr_schedule, ...}
  3. Supervised Fine-Tuning

    • Instruction Tuning
    • Alignment
  4. Deploy (+ monitor, re-evaluate, etc.)

Figure 3: Pre-training: Virtually all of the compute used during pretraining phase1.

Life-Cycle of the LLM: Pre-training

Figure 4: Pre-training: Virtually all of the compute used during pretraining phase

Life-Cycle of the LLM: Fine-Tuning

Figure 5: Fine-tuning1: Fine-tuning actually updates the model’s weights to make the model better at a certain task.

Transformer Architecture

Vaswani et al. (2017)

Forward Pass

Figure 6: Language Model trained for causal language modeling. Video from: 🤗 Generation with LLMs

Generating Text

Figure 7: Language Model trained for causal language modeling. Video from: 🤗 Generation with LLMs

Parallelism Overview

Modern parallelism techniques enable the training of large language models

Parallelism Concepts1

  • DataParallel (DP):
    • The same setup is replicated multiple times, and each being fed a slice of the data.

    • The processing is done in parallel and all setups are synchronized at the end of each training step.

  • TensorParallel (TP):
    • Each tensor is split up into multiple chunks.
    • So, instead of having the whole tensor reside on a single gpu, each shard of the tensor resides on its designated gpu.
      • During processing each shard gets processed separately and in parallel on different GPUs and the results are synced at the end of the step.
      • This is what one may call horizontal parallelism, as he splitting happens on horizontal level.

Parallelism Concepts1

  • PipelineParallel (PP):
    • Model is split up vertically (layer-level) across multiple GPUs, so that only one or several layers of the model are places on a single gpu.
      • Each gpu processes in parallel different stages of the pipeline and working on a small chunk of the batch.
  • Zero Redundancy Optimizer (ZeRO):
    • Also performs sharding of the tensors somewhat similar to TP, except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn’t need to be modified.
    • It also supports various offloading techniques to compensate for limited GPU memory.
  • Sharded DDP:
    • Another name for the foundational ZeRO concept as used by various other implementations of ZeRO.

Data Parallelism

  • Data Parallelism:
    • The simplest and most common parallelism technique. Workers maintain identical copies of the complete model and work on a subset of the data.
    • DDP supported in PyTorch native.
  • ZeRO Data Parallel
    • ZeRO powered data parallelism is shown below1

Tensor Parallelism1

  • In Tensor Paralleism each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing.

    • The main building block of any transformer is a fully connected nn.Linear followed by a nonlinear activation GeLU.

      • Y = GeLU(XA), where X and Y are the input and output vectors, and A is the weight matrix.
    • If we look at the computation in matrix form, it’s easy to see how the matrix multiplication can be split between multiple GPUs:

Tensor Parallelism

3D Parallelism

  • DP + TP + PP (3D) Parallelism

3D Parallelism illustration. Figure from: https://www.deepspeed.ai/

3D Parallelism

  • DP + TP + PP (3D) Parallelism

Running on ALCF

  • We’ve provided a virtual environment complete with all dependencies for running

    # navigate to directory ---------------------------------------
    cd "${PROJECT_DIR}"
    # load conda module and activate venv -------------------------
    module load conda/2023-10-04; conda activate base
    source venvs/polaris/2023-10-04/bin/activate
    # set runtime environment variables ---------------------------
    export IBV_FORK_SAFE=1
    # set environment variables for running -----------------------
    # launch training --------------------------------------------

Running on ALCF

  • Executable:

    MODEL_SIZE_KEY="GPT1_5B" SEQ_LEN=1024 MICRO_BATCH=1 SP_TYPE="megatron" ./ALCF/train-gpt3.sh
ALCF_DIR: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/ALCF
source-ing /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/ALCF/setup.sh
Setting up MPI on Polaris from x3210c0s1b0n0
++ SetupMPI() +++++++++++++++++++++++++++++++++
Using HOSTFILE: /var/spool/pbs/aux/1126584.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
Skipping setupThetaGPU() on x3210c0s1b0n0
Setting up MPI on Polaris from x3210c0s1b0n0
USING PYTHON: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/venvs/polaris/2023-10-04/bin/python3

Running on ALCF

Once the text has finally stopped printing, you should see output similar to the following:

Job started at: 2023-10-11-092906 on x3210c0s1b0n0
Writing logs to: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT13B_z1_seqlen1024_mp8_pp1_sp1_nl40_hs5120_gb1_mb1
to view output: tail -f $(tail -1 logfiles)
i.e. tail -f /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT13B_z1_seqlen1024_mp8_pp1_sp1_nl40_hs5120_gb1_mb1/logs/foremans-x3210c0s1b0n0-nhosts2-ngpu8-2023-10-11-092906.log
  • To watch / view the output:

    tail -fn 1000 $(tail -1 logfiles) | less
  • will look like1:

Job started at: 2023-10-11-092906 on x3210c0s1b0n0
Training GPT-3 with GPT13B parameters
Writing logs to: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT13B_z1_seqlen1024_mp8_pp1_sp1_nl40_hs5120_gb1_mb1
to view output: tail -f $(tail -1 logfiles)
i.e. tail -f /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT13B_z1_seqlen1024_mp8_pp1_sp1_nl40_hs5120_gb1_mb1/logs/foremans-x3210c0s1b0n0-nhosts2-ngpu8-2023-10-11-092906.log
using: /lus/grand/projects/fallwkshp23/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/venvs/polaris/2023-10-04/bin/python3

Getting Started at ALCF

  • We provide below the details for installing / getting started on ALCF (Polaris)

  • Installation:

    1. Clone GitHub repo:
    git clone https://github.com/argonne-lcf/Megatron-DeepSpeed
    1. Load Conda module:
      • Polaris:

        if [[ "$(hostname)==x3*" ]]; then
            export MACHINE="Polaris"
            export CONDA_DATE="2023-10-04"
            module load conda/${CONDA_DATE}
            conda activate base
      • ThetaGPU:

        if [[ "$(hostname)==theta*" ]]; then
            export MACHINE="ThetaGPU"
            export CONDA_DATE="2023-01-10"
            module load conda/${CONDA_DATE}
            conda activate base

Getting Started

  1. Setup virtual environment1:

    cd Megatron-DeepSpeed
    # create a new virtual environment
    mkdir -p "venvs/${MACHINE}/${CONDA_DATE}"
    python3 -m  venv "venvs/${MACHINE}/${CONDA_DATE}" --system-site-packages
    source "venvs/${MACHINE}/${CONDA_DATE}/bin/activate"
  2. Create a new folder where we’ll install dependencies:

    mkdir -p "deps/${MACHINE}"
    cd "deps/${MACHINE}"

Install Dependencies


Note: The following instructions should be unnecessary on Polaris.

  • The new release supports three different implementations of FlashAttention: (v1.0.4, v2.x, triton)

  • FlashAttention v2.x may have numerical instability issues. For the best performance, we recommend using FlashAttention + Triton

  • Dao-AILab/flash-attention:

    • v1.0.4:

      python3 -m pip install flash-attn==1.0.4
    • v2.x:

      git clone https://github.com/Dao-AILab/flash-attention
      cd flash-attention
      python3 setup.py install
    • openai/triton:

      git clone -b legacy-backend https://github.com/openai/triton
      cd triton/python
      python3 -m pip install cmake pybind11
      python3 -m pip install .
  • saforem2/ezpz

    python3 -m pip install -e "git+https://github.com/saforem2/ezpz.git#egg=ezpz"
  • NVIDIA/apex

    git clone https://github.com/NVIDIA/apex
    cd ../apex/
    pip install -v \ 
      --disable-pip-version-check \
      --no-cache-dir \
      --no-build-isolation \
      --global-option="--cpp_ext" \
      --global-option="--cuda_ext" \
      -e \


Note: apex is already installed in the base conda/2023-10-04 environment on Polaris.


  • The ALCF/ directory contains shell scripts for setting up the environment and specifying options to be used for training.


  • Various options can be specified dynamically at runtime by setting them in your environment, e.g.:

    # Set env. vars to use:
    # Launch training:



  • ALCF/train-gpt3.sh: Main entry point for training. This script will:
    • Source the rest of the required ALCF/*.sh scripts below
  • ALCF/models.sh: Contains some example model architectures for GPT3-style models
  • ALCF/args.sh: Logic for parsing / setting up runtime options for Megatron and DeepSpeed
  • ALCF/setup.sh: Locate and activate virtual environment to be used, ensure MPI variables are set properly
  • ALCF/launch.sh: Identify available resources and build the command to be executed
    • i.e. figure out how many: {nodes, GPUs per node, GPUs total}, to pass to mpi{run,exec}
    • then, use this to launch mpiexec <mpiexec-args> python3 pretrain_gpt.py <gpt-args>


Latent space of biologically meaningful properties for SARS-CoV-2 genomes

Loooooooooong Sequence Lengths

Table 2: Long sequence length support from microsoft/Megatron-DeepSpeed
Sequence Length Old Megatron-DeepSpeed (TFLOPS) New Megatron-DeepSpeed (TFLOPS)
2k 25 68
4k 28 80
8k OOM 86
16k OOM 92
32k OOM 100
64k OOM 106
128k OOM 119
256k OOM 94

Loooooooooong Sequence Lengths

25B 33B

Figure 8: Maximum (achievable) SEQ_LEN for both 25B and 33B models [WIP]

Loooooooooong Sequence Lengths

  • We can evaluate the performance of our model by looking at two different metrics for throughput: samples_per_sec and TFLOPS.
    • Explicitly, we see that we are able to scale up to significantly longer sequences:
      (420k / 128k ~ 3.3x) with only a minimal impact on throughput
      performance: (81 / 105 ~ 77%)1.
Table 3: Impact on TFLOPS as a function of increasing sequence length. Table from: throughput/TFLOPS
Name Sequence Length (k) (seq_len / min_seq_len) TFLOPS TFLOPS (% of peak)
GPT25B 420 3.28125 81.77225 77.867
GPT25B 400 3.125 90.62 86.297
GPT25B 360 2.8125 81.6325 77.7348
GPT25B 360 2.8125 82.6824 78.7346
GPT25B 192 1.5 115.8228 110.2927
GPT25B 128 1 106.672 101.5788
GPT25B 128 1 105.014 100.00


Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” https://arxiv.org/abs/1706.03762.
Yang, Jingfeng, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. 2023. “Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond.” https://arxiv.org/abs/2304.13712.
Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” https://arxiv.org/abs/2305.10601.