Working in SLURM Environment
Unless youโre lucky and you have a dedicated cluster that is completely under your control chances are that you will have to use SLURM to timeshare the GPUs with others. But, often, if you train at HPC, and youโre given a dedicated partition you still will have to use SLURM.
The SLURM abbreviation stands for: Simple Linux Utility for Resource Management - though now itโs called The Slurm Workload Manager. It is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the worldโs supercomputers and computer clusters.
These chapters will not try to exhaustively teach you SLURM as there are many manuals out there, but will cover some specific nuances that are useful to help in the training process.
- SLURM For Users - everything you need to know to do your training in the SLURM environment.
- SLURM Administration - if youโre unlucky to need to also manage the SLURM cluster besides using it, there is a growing list of recipes in this document to get things done faster for you.
- Performance - SLURM performance nuances.
- Launcher scripts - how to launch with
torchrun
,accelerate
, pytorch-lightning, etc. in the SLURM environment
Citation
BibTeX citation:
@online{bekman2024,
author = {Bekman, Stas and Foreman, Sam},
title = {ML {Engineering}},
date = {2024-02-20},
url = {https://saforem2.github.io/ml-engineering},
langid = {en}
}
For attribution, please cite this work as:
Bekman, Stas, and Sam Foreman. 2024. โML Engineering.โ
February 20, 2024. https://saforem2.github.io/ml-engineering.