Intro to HPC Bootcamp 2025
🚀 Parallel Training Methods for AI
Sam Foreman
Intro to AI-driven Science on Supercomputers
2024-11-05
- Slides: https://samforeman.me/talks/ai-for-science-2024/slides
- HTML version: https://samforeman.me/talks/ai-for-science-2024
👋 Hands On
Submit interactive job:
On Sophia:
Clone repos:
Setup python:
Install
{ezpz, wordplay}
:Setup (or disable)
wandb
:Test Distributed Setup:
See:
ezpz/test_dist.py
Prepare Data:
Launch Training:
🎒 Homework
Submit proof that you were able to successfully follow the above instructions and launch a distributed data parallel training run.
Where proof can be any of:
- The contents printed out to your terminal during the run
- A path to a logfile containing the output from a run on the ALCF filesystems
- A screenshot of:
- the text printed out from the run
- a graph from the W&B Run
- anything that shows that you clearly were able to run the example
- url to a W&B Run or W&B Report
- etc.
Citation
BibTeX citation:
@online{foreman2025,
author = {Foreman, Sam},
title = {Intro to {HPC} {Bootcamp} 2025},
date = {2025-07-22},
url = {https://saforem2.github.io/hpc-bootcamp-2025/02-llms/06-parallel-training/},
langid = {en}
}
For attribution, please cite this work as:
Foreman, Sam. 2025. “Intro to HPC Bootcamp 2025.” July 22,
2025. https://saforem2.github.io/hpc-bootcamp-2025/02-llms/06-parallel-training/.