ποΈ Shell Environment¶
π£ Getting Started¶
π§ Work in Progress
The documentation below is a work in progress.
Please feel free to provide input / suggest changes !
[!NOTE]
- Source the
src/ezpz/bin/utils.sh
file:
- Use the
ezpz_setup_env
function to set up your environment:
This will πͺ automagically:
- π Setup Python: Load the appropriate module(s) and put you inside a suitable python environment
- π§° Setup Job: Determine the resources available in the
current job and build a
launch
alias for launching executables
We provide a variety of helper functions designed to make your life
easier when working with job schedulers (e.g.Β PBS Pro
@ ALCF or
slurm
elsewhere).
All of these functions are:
We would like to write our application in such a way that it is able to take full advantage of the resources allocated by the job scheduler.
That is to say, we want to have a single script with the ability to
dynamically launch
python applications across any number of
accelerators on any of the systems under consideration.
In order to do this, there is some basic setup and information gathering that needs to occur.
In particular, we need mechanisms for:
- Setting up a python environment
- Determining what system / machine weβre on
- + what job scheduler weβre using (e.g.Β
PBS Pro
@ ALCF orslurm
elsewhere) - Determining how many nodes have been allocated in the current job
(
NHOSTS
\(=N_{\mathrm{HOST}}\)) - + Determining how many accelerators exist on each of these nodes
(
NGPU_PER_HOST
)
This allows us to calculate the total number of accelerators (GPUs) as:
is the number of GPUs per host.
With this we have everything we need to build the appropriate
{mpi
{run
, exec
}, slurm
} command for launching our python
application across them.
Now, there are a few functions in particular worth elaborating on.
Function | Description |
---|---|
ezpz_setup_env |
Wrapper around ezpz_setup_python && ezpz_setup_job |
ezpz_setup_job |
Determine {NGPUS , NGPU_PER_HOST , NHOSTS }, build launch command alias |
ezpz_setup_python |
Wrapper around ezpz_setup_conda && ezpz_setup_venv_from_conda |
ezpz_setup_conda |
Find and activate appropriate conda module to load2 |
ezpz_setup_venv_from_conda |
From ${CONDA_NAME} , build or activate the virtual env located in venvs/${CONDA_NAME}/ |
TableΒ 1: Shell Functions
Where am I?
Some of the ezpz_*
functions (e.g.Β ezpz_setup_python
), will try
to create / look for certain directories.
In an effort to be explicit, these directories will be defined
relative to a WORKING_DIR
(e.g.Β "${WORKING_DIR}/venvs/"
)
This WORKING_DIR
will be assigned to the first non-zero match found
below:
PBS_O_WORKDIR
: If found in environment, paths will be relative to thisSLURM_SUBMIT_DIR
: Next in line. If not @ ALCF, maybe usingslurm
β¦$(pwd)
: Otherwise, no worries. Use your actual working directory.
π οΈ Setup Python¶
This will:
- Automatically load and activate
conda
using theezpz_setup_conda
function.
How this is done, in practice, varies from machine to machine:
-
ALCF3: Automatically load the most recent
conda
module and activate the base environment. -
Frontier: Load the appropriate AMD modules (e.g.Β
rocm
,RCCL
, etc.), and activate baseconda
-
Perlmutter: Load the appropriate
pytorch
module and activate environment -
Unknown: In this case, we will look for a
conda
,mamba
, ormicromamba
executable, and if found, use that to activate the base environment.
Using your own conda
If you are already in a conda environment when calling
ezpz_setup_python
then it will try and use this instead.
For example, if you have a custom conda
env at
~/conda/envs/custom
, then this would bootstrap the custom
conda environment and create the virtual env in venvs/custom/
- Build (or activate, if found) a virtual environment on top of (the
active) base
conda
environment.
By default, it will try looking in:
$PBS_O_WORKDIR
, otherwise${SLURM_SUBMIT_DIR}
, otherwise$(pwd)
for a nested folder named "venvs/${CONDA_NAME}"
.
If this doesnβt exist, it will attempt to create a new virtual environment at this location using:
(where weβve pulled in the --system-site-packages
from conda).
π§° Setup Job¶
Now that we are in a suitable python environment, we need to construct the command that we will use to run python on each of our accelerators.
To do this, we need a few things:
- What machine weβre on (and what scheduler is it using i.e.Β {PBS, SLURM})
- How many nodes are available in our active job
- How many GPUs are on each of those nodes
- What type of GPUs are they
With this information, we can then use mpi{exec,run}
or srun
to
launch python across all of our accelerators.
Again, how this is done will vary from machine to machine and will depend on the job scheduler in use.
To identify where we are, we look at our $(hostname)
and see if weβre
running on one of the known machines:
- ALCF4: Using PBS Pro via
qsub
andmpiexec
/mpirun
. x4*
: Aurora- Aurora:
x4*
(oraurora*
on login nodes) - Sunspot:
x1*
(oruan*
) - Sophia:
sophia-*
- Polaris / Sirius:
x3*
- to determine between the two, we look at
"${PBS_O_HOST}"
- to determine between the two, we look at
-
OLCF: Using Slurm via
sbatch
/srun
. -
frontier*
: Frontier, using Slurm -
nid*
: Perlmutter, using Slurm -
Unknown machine: If
$(hostname)
does not match one of these patterns we assume that we are running on an unknown machine and will try to usempirun
as our generic launch command
Once we have this, we can:
-
Get
PBS_NODEFILE
from$(hostname)
:ezpz_qsme_running
: For each (running) job owned by${USER}
, print out both the jobid as well as a list of hosts the job is running on, e.g.:
ezpz_get_pbs_nodefile_from_hostname
: Look for$(hostname)
in the output from the above command to determine our${PBS_JOBID}
.
Once weβve identified our
${PBS_JOBID}
we then know the location of our${PBS_NODEFILE}
since they are named according to: -
Identify number of available accelerators:
-
Plus this is useful for tab-completions in your shell, e.g.:
↩ -
This is system dependent. See
ezpz_setup_conda
↩ -
Any of {Aurora, Polaris, Sophia, Sunspot, Sirius} ↩
-
At ALCF, if our
$(hostname)
starts withx*
, weβre on a compute node. ↩