🏖️ Shell Environment¶

🐣 Getting Started¶

🚧 Work in Progress

The documentation below is a work in progress.
Please feel free to provide input / suggest changes !

[!NOTE]

Source the src/ezpz/bin/utils.sh file:
1
source <(curl -L https://bit.ly/ezpz-utils)
Use the ezpz_setup_env function to set up your environment:
1
ezpz_setup_env

This will 🪄 automagically:

🐍 Setup Python: Load the appropriate module(s) and put you inside a suitable python environment
🧰 Setup Job: Determine the resources available in the current job and build a launch alias for launching executables

We provide a variety of helper functions designed to make your life easier when working with job schedulers (e.g. PBS Pro @ ALCF or slurm elsewhere).

All of these functions are:

located in utils.sh
prefixed with ezpz_* (e.g. ezpz_setup_python)¹

We would like to write our application in such a way that it is able to take full advantage of the resources allocated by the job scheduler.

That is to say, we want to have a single script with the ability to dynamically launch python applications across any number of accelerators on any of the systems under consideration.

In order to do this, there is some basic setup and information gathering that needs to occur.

In particular, we need mechanisms for:

Setting up a python environment
Determining what system / machine we’re on
+ what job scheduler we’re using (e.g. PBS Pro @ ALCF or slurm elsewhere)
Determining how many nodes have been allocated in the current job (NHOSTS $=N_{\mathrm{HOST}}$)
+ Determining how many accelerators exist on each of these nodes (NGPU_PER_HOST)

This allows us to calculate the total number of accelerators (GPUs) as:

\[ N_{\mathrm{GPU}} = N_{\mathrm{HOST}} \times n_{\mathrm{GPU}}; \,\, n_{\mathrm{GPU}} = N_{\mathrm{GPU}} / N_{\mathrm{HOST}} \]

is the number of GPUs per host.

With this we have everything we need to build the appropriate {mpi{run, exec}, slurm} command for launching our python application across them.

Now, there are a few functions in particular worth elaborating on.

Function	Description
`ezpz_setup_env`	Wrapper around `ezpz_setup_python` `&&` `ezpz_setup_job`
`ezpz_setup_job`	Determine {`NGPUS`, `NGPU_PER_HOST`, `NHOSTS`}, build `launch` command alias
`ezpz_setup_python`	Wrapper around `ezpz_setup_conda` `&&` `ezpz_setup_venv_from_conda`
`ezpz_setup_conda`	Find and activate appropriate `conda` module to load⁴
`ezpz_setup_venv_from_conda`	From `${CONDA_NAME}`, build or activate the virtual env located in `venvs/${CONDA_NAME}/`

Table 1: Shell Functions

Where am I?

Some of the ezpz_* functions (e.g. ezpz_setup_python), will try to create / look for certain directories.

In an effort to be explicit, these directories will be defined relative to a WORKING_DIR (e.g. "${WORKING_DIR}/venvs/")

This WORKING_DIR will be assigned to the first non-zero match found below:

PBS_O_WORKDIR: If found in environment, paths will be relative to this
SLURM_SUBMIT_DIR: Next in line. If not @ ALCF, maybe using slurm…
$(pwd): Otherwise, no worries. Use your actual working directory.

🛠️ Setup Python¶

1	`ezpz_setup_python`

This will:

Automatically load and activate conda using the ezpz_setup_conda function.

How this is done, in practice, varies from machine to machine:

ALCF³: Automatically load the most recent conda module and activate the base environment.
Frontier: Load the appropriate AMD modules (e.g. rocm, RCCL, etc.), and activate base conda
Perlmutter: Load the appropriate pytorch module and activate environment
Unknown: In this case, we will look for a conda, mamba, or micromamba executable, and if found, use that to activate the base environment.

Using your own conda

If you are already in a conda environment when calling ezpz_setup_python then it will try and use this instead.

For example, if you have a custom conda env at ~/conda/envs/custom, then this would bootstrap the custom conda environment and create the virtual env in venvs/custom/

Build (or activate, if found) a virtual environment on top of (the active) base conda environment.

By default, it will try looking in:

$PBS_O_WORKDIR, otherwise
${SLURM_SUBMIT_DIR}, otherwise
$(pwd)

for a nested folder named "venvs/${CONDA_NAME}".

If this doesn’t exist, it will attempt to create a new virtual environment at this location using:

python3 -m venv venvs/${CONDA_NAME} --system-site-packages

(where we’ve pulled in the --system-site-packages from conda).

🧰 Setup Job¶

1	`ezpz_setup_job`

Now that we are in a suitable python environment, we need to construct the command that we will use to run python on each of our accelerators.

To do this, we need a few things:

What machine we’re on (and what scheduler is it using i.e. {PBS, SLURM})
How many nodes are available in our active job
How many GPUs are on each of those nodes
What type of GPUs are they

With this information, we can then use mpi{exec,run} or srun to launch python across all of our accelerators.

Again, how this is done will vary from machine to machine and will depend on the job scheduler in use.

To identify where we are, we look at our $(hostname) and see if we’re running on one of the known machines:

ALCF²: Using PBS Pro via qsub and mpiexec / mpirun.
x4*: Aurora
Aurora: x4* (or aurora* on login nodes)
Sunspot: x1* (or uan*)
Sophia: sophia-*
Polaris / Sirius: x3*
- to determine between the two, we look at "${PBS_O_HOST}"

OLCF: Using Slurm via sbatch / srun.
frontier*: Frontier, using Slurm
nid*: Perlmutter, using Slurm
Unknown machine: If $(hostname) does not match one of these patterns we assume that we are running on an unknown machine and will try to use mpirun as our generic launch command

Once we have this, we can:

Get PBS_NODEFILE from $(hostname):
- ezpz_qsme_running: For each (running) job owned by ${USER}, print out both the jobid as well as a list of hosts the job is running on, e.g.:
1 2 3
<jobid0> host00 host01 host02 host03 ... <jobid1> host10 host11 host12 host13 ... ...
- ezpz_get_pbs_nodefile_from_hostname: Look for $(hostname) in the output from the above command to determine our ${PBS_JOBID}.
Once we’ve identified our ${PBS_JOBID} we then know the location of our ${PBS_NODEFILE} since they are named according to:
1 2 3 4
jobid=$(ezpz_qsme_running | grep "$(hostname)" | awk '{print $1}') prefix=/var/spool/pbs/aux match=$(/bin/ls "${prefix}" | grep "${jobid}") hostfile="${prefix}/${match}"
Identify number of available accelerators:

Plus this is useful for tab-completions in your shell, e.g.:

$ ezpz_<TAB>
ezpz_check_and_kill_if_running
ezpz_get_dist_launch_cmd
ezpz_get_job_env
--More--

↩

At ALCF, if our $(hostname) starts with x*, we’re on a compute node. ↩
Any of {Aurora, Polaris, Sophia, Sunspot, Sirius} ↩
This is system dependent. See ezpz_setup_conda ↩