ML Engineering
    • Repository
    • Source Code
    • New Issue
  1. 🎻 Orchestration
  2. Working in SLURM Environment
  3. SLURM Performance
  • 📓 Resources
  • ✏️ Testing
  • 🤗 Transformers
  • 💻 Compute
    • CPU memory
    • CPU
    • Accelerators
      • Accelerators
      • Nvidia
        • Troubleshooting NVIDIA GPUs
  • 🐛 Debugging
    • A Back up of scripts
    • Faster debug and development with tiny models, tokenizers and datasets
    • NCCL: Debug and Performance
    • Debugging PyTorch programs
    • Debug Tools
    • Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs
    • Underflow and Overflow Detection
  • 🧠 Insights
    • 🪖 The AI Battlefield
  • 🛜 Networking
    • Networking Benchmarks
      • Network Benchmarks Results
        • Disabling NVLink Benchmark
  • 🎻 Orchestration
    • Working in SLURM Environment
      • SLURM Administration
      • Launchers with SLURM
      • SLURM Performance
      • SLURM for users
  • 📦 Storage
    • Benchmarks
      • Results
        • fio benchmark results for hope on 2023-12-20-14:37:02
  • 🏋️ Training
    • Tensor precision / Data types
    • Emulate a multi-node setup using just a single node
    • Selecting Training Hyper-Parameters And Model Initializations
    • Checkpoints
    • Fault Tolerance
    • Model Parallelism
    • Software Tune Up For The Best Performance
    • Reproducibility
    • Re-train HF Hub Models From Scratch Using Finetuning Examples
    • Avoiding, Recovering From and Understanding Instabilities
      • Understanding Training Loss Patterns

On this page

  • SLURM Performance
    • srun’s --cpus-per-task may need to be explicit
    • To enable Hyper-Threads or not
  • View source
  • Edit this page
  • Report an issue

Other Formats

  • Github (GFM)
  1. 🎻 Orchestration
  2. Working in SLURM Environment
  3. SLURM Performance

February 20, 2024

SLURM Performance

Here you will find discussions of SLURM-specific settings that impact performance.

srun’s --cpus-per-task may need to be explicit

You need to make sure that the launched by srun program receives as many cpu-cores as intended. For example, in a typical case of a ML training program, each gpu needs at least one cpu-core for the process driving it plus a few more cores for the DataLoader. You need multiple cores so that each task can be performed in parallel. If you have 8 gpus and 2 DataLoader workers per gpu, you need at least 3*8=24 cpu-cores per node.

The number of cpus per task is defined by --cpus-per-task, which is passed to sbatch or salloc and originally srun would inherit this setting. However, recently this behavior has changed:

A quote from the sbatch manpage:

NOTE: Beginning with 22.05, srun will not inherit the –cpus-per-task value requested by salloc or sbatch. It must be requested again with the call to srun or set with the SRUN_CPUS_PER_TASK environment variable if desired for the task(s).

Which means that if in the past your SLURM script could have been:

#SBATCH --cpus-per-task=48
[...]

srun myprogram

and the program launched by srun would have received 48 cpu-cores because srun used to inherit the --cpus-per-task=48 settings from sbatch or salloc settings, according to the quoted documentation since SLURM 22.05 this behavior is no longer true.

footnote: I tested with SLURM@22.05.09 and the old behavior was still true, but this is definitely the case with 23.x series as reported here. So the change might have happened in the later 22.05 series.

So if you leave things as is, now the program will receive just 1 cpu-core (unless the srun default has been modified).

You can easily test if your SLURM setup is affected, using os.sched_getaffinity(0)), as it shows which cpu-cores are eligible to be used by the current process. So it should be easy to count those with len(os.sched_getaffinity(0)).

Here is how you can test if you’re affected:

$ cat test.slurm
#!/bin/bash
#SBATCH --job-name=test-cpu-cores-per-task
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48   # adapt to your env if you have less than 48 cpu cores
#SBATCH --time=0:10:00
#SBATCH --partition=x        # adapt to your env to the right partition name
#SBATCH --output=%x-%j.out

srun python -c 'import os; print(f"visible cpu cores: {len(os.sched_getaffinity(0))}")'

If you get

visible cpu cores: 48

then you don’t need to do anything, if however you get:

visible cpu cores: 1

or another value smaller than 48 then you’re affected.

To fix that you need to change your SLURM script to either:

#SBATCH --cpus-per-task=48
[...]

srun --cpus-per-task=48 myprogram

or:

#SBATCH --cpus-per-task=48
[...]

SRUN_CPUS_PER_TASK=48
srun myprogram

To enable Hyper-Threads or not

As explained in the Hyper-Threads section you should be able to double the number of available cpu-cores if your CPUs support hyper-threading and for some workloads this may lead to an overall faster performance.

However, you should test the performance w/ and w/o HT, compare the results and choose the setting that gives the best outcome.

case study: on AWS p4 nodes I discovered that enabling HT made the network throughput 4x slower. Since then we were careful to have HT disabled on that particular setup.

Back to top

Citation

BibTeX citation:
@online{bekman2024,
  author = {Bekman, Stas and Foreman, Sam},
  title = {ML {Engineering}},
  date = {2024-02-20},
  url = {https://saforem2.github.io/ml-engineering},
  langid = {en}
}
For attribution, please cite this work as:
Bekman, Stas, and Sam Foreman. 2024. “ML Engineering.” February 20, 2024. https://saforem2.github.io/ml-engineering.
Launchers with SLURM
SLURM for users

ML-Engineering

2024

  • View source
  • Edit this page
  • Report an issue