ML Engineering

Stas Bekman; Sam Foreman

February 20, 2024

SLURM Performance

Here you will find discussions of SLURM-specific settings that impact performance.

srun’s `--cpus-per-task` may need to be explicit

You need to make sure that the launched by srun program receives as many cpu-cores as intended. For example, in a typical case of a ML training program, each gpu needs at least one cpu-core for the process driving it plus a few more cores for the DataLoader. You need multiple cores so that each task can be performed in parallel. If you have 8 gpus and 2 DataLoader workers per gpu, you need at least 3*8=24 cpu-cores per node.

The number of cpus per task is defined by --cpus-per-task, which is passed to sbatch or salloc and originally srun would inherit this setting. However, recently this behavior has changed:

A quote from the sbatch manpage:

NOTE: Beginning with 22.05, srun will not inherit the –cpus-per-task value requested by salloc or sbatch. It must be requested again with the call to srun or set with the SRUN_CPUS_PER_TASK environment variable if desired for the task(s).

Which means that if in the past your SLURM script could have been:

#SBATCH --cpus-per-task=48
[...]

srun myprogram

and the program launched by srun would have received 48 cpu-cores because srun used to inherit the --cpus-per-task=48 settings from sbatch or salloc settings, according to the quoted documentation since SLURM 22.05 this behavior is no longer true.

footnote: I tested with SLURM@22.05.09 and the old behavior was still true, but this is definitely the case with 23.x series as reported here. So the change might have happened in the later 22.05 series.

So if you leave things as is, now the program will receive just 1 cpu-core (unless the srun default has been modified).

You can easily test if your SLURM setup is affected, using os.sched_getaffinity(0)), as it shows which cpu-cores are eligible to be used by the current process. So it should be easy to count those with len(os.sched_getaffinity(0)).

Here is how you can test if you’re affected:

$ cat test.slurm
#!/bin/bash
#SBATCH --job-name=test-cpu-cores-per-task
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48   # adapt to your env if you have less than 48 cpu cores
#SBATCH --time=0:10:00
#SBATCH --partition=x        # adapt to your env to the right partition name
#SBATCH --output=%x-%j.out

srun python -c 'import os; print(f"visible cpu cores: {len(os.sched_getaffinity(0))}")'

If you get

visible cpu cores: 48

then you don’t need to do anything, if however you get:

visible cpu cores: 1

or another value smaller than 48 then you’re affected.

To fix that you need to change your SLURM script to either:

#SBATCH --cpus-per-task=48
[...]

srun --cpus-per-task=48 myprogram

or:

#SBATCH --cpus-per-task=48
[...]

SRUN_CPUS_PER_TASK=48
srun myprogram

To enable Hyper-Threads or not

As explained in the Hyper-Threads section you should be able to double the number of available cpu-cores if your CPUs support hyper-threading and for some workloads this may lead to an overall faster performance.

However, you should test the performance w/ and w/o HT, compare the results and choose the setting that gives the best outcome.

case study: on AWS p4 nodes I discovered that enabling HT made the network throughput 4x slower. Since then we were careful to have HT disabled on that particular setup.

Back to top

Citation

BibTeX citation:

@online{bekman2024,
  author = {Bekman, Stas and Foreman, Sam},
  title = {ML {Engineering}},
  date = {2024-02-20},
  url = {https://saforem2.github.io/ml-engineering},
  langid = {en}
}

For attribution, please cite this work as:

Bekman, Stas, and Sam Foreman. 2024. “ML Engineering.” February 20, 2024. https://saforem2.github.io/ml-engineering.

SLURM Performance

srun’s --cpus-per-task may need to be explicit

To enable Hyper-Threads or not

Citation

srun’s `--cpus-per-task` may need to be explicit