ML Engineering
    • Repository
    • Source Code
    • New Issue
  1. ๐ŸŽป Orchestration
  2. Working in SLURM Environment
  • ๐Ÿ““ Resources
  • โœ๏ธ Testing
  • ๐Ÿค— Transformers
  • ๐Ÿ’ป Compute
    • CPU memory
    • CPU
    • Accelerators
      • Accelerators
      • Nvidia
        • Troubleshooting NVIDIA GPUs
  • ๐Ÿ› Debugging
    • A Back up of scripts
    • Faster debug and development with tiny models, tokenizers and datasets
    • NCCL: Debug and Performance
    • Debugging PyTorch programs
    • Debug Tools
    • Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs
    • Underflow and Overflow Detection
  • ๐Ÿง  Insights
    • ๐Ÿช– The AI Battlefield
  • ๐Ÿ›œ Networking
    • Networking Benchmarks
      • Network Benchmarks Results
        • Disabling NVLink Benchmark
  • ๐ŸŽป Orchestration
    • Working in SLURM Environment
      • SLURM Administration
      • Launchers with SLURM
      • SLURM Performance
      • SLURM for users
  • ๐Ÿ“ฆ Storage
    • Benchmarks
      • Results
        • fio benchmark results for hope on 2023-12-20-14:37:02
  • ๐Ÿ‹๏ธ Training
    • Tensor precision / Data types
    • Emulate a multi-node setup using just a single node
    • Selecting Training Hyper-Parameters And Model Initializations
    • Checkpoints
    • Fault Tolerance
    • Model Parallelism
    • Software Tune Up For The Best Performance
    • Reproducibility
    • Re-train HF Hub Models From Scratch Using Finetuning Examples
    • Avoiding, Recovering From and Understanding Instabilities
      • Understanding Training Loss Patterns

On this page

  • Working in SLURM Environment
  • View source
  • Edit this page
  • Report an issue

Other Formats

  • Github (GFM)
  1. ๐ŸŽป Orchestration
  2. Working in SLURM Environment

February 20, 2024

Working in SLURM Environment

Unless youโ€™re lucky and you have a dedicated cluster that is completely under your control chances are that you will have to use SLURM to timeshare the GPUs with others. But, often, if you train at HPC, and youโ€™re given a dedicated partition you still will have to use SLURM.

The SLURM abbreviation stands for: Simple Linux Utility for Resource Management - though now itโ€™s called The Slurm Workload Manager. It is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the worldโ€™s supercomputers and computer clusters.

These chapters will not try to exhaustively teach you SLURM as there are many manuals out there, but will cover some specific nuances that are useful to help in the training process.

  • SLURM For Users - everything you need to know to do your training in the SLURM environment.
  • SLURM Administration - if youโ€™re unlucky to need to also manage the SLURM cluster besides using it, there is a growing list of recipes in this document to get things done faster for you.
  • Performance - SLURM performance nuances.
  • Launcher scripts - how to launch with torchrun, accelerate, pytorch-lightning, etc. in the SLURM environment
Back to top

Citation

BibTeX citation:
@online{bekman2024,
  author = {Bekman, Stas and Foreman, Sam},
  title = {ML {Engineering}},
  date = {2024-02-20},
  url = {https://saforem2.github.io/ml-engineering},
  langid = {en}
}
For attribution, please cite this work as:
Bekman, Stas, and Sam Foreman. 2024. โ€œML Engineering.โ€ February 20, 2024. https://saforem2.github.io/ml-engineering.
๐ŸŽป Orchestration
SLURM Administration

ML-Engineering

2024

  • View source
  • Edit this page
  • Report an issue