๐ ezpz PBS Guideยถ
Hardware-agnostic distributed PyTorch on PBS with automatic topology inference and CPU binding.
๐ Quick Startยถ
To build a PBS-aware launch command (using the current jobโs hostfile when available):
๐งญ What the PBS helpers doยถ
- Detect your active PBS job (if any) and locate its nodefile.
- Infer topology (
ngpus,nhosts,ngpu_per_host) from the hostfile and machine limits unless you override them. - Build the correct launcher (
mpiexec/mpirun) with sensible CPU binding:- Intel GPU machines (
aurora,sunspot) get--no-vniand vendor binding lists. - If
CPU_BINDis set, its value is forwarded verbatim. - Otherwise, a generic
--cpu-bind=depth --depth=8is applied.
- Intel GPU machines (
- Optionally inject PBS environment metadata (
PBS_NODEFILE, host list, launch command) viaget_pbs_env.
flowchart TD
A["Detect scheduler"] --> B{"Active PBS job?"}
B -- yes --> C["get_pbs_jobid_of_active_job"]
C --> D["get_pbs_nodefile(jobid)"]
B -- no --> E["get_hostfile_with_fallback"]
D --> F["infer_topology"]
E --> F
F --> G["get_pbs_launch_cmd"]
G --> H["mpiexec/mpirun command"]
H --> I["Launch distributed script"]
๐ Discovering jobs and hostfilesยถ
pbs.get_pbs_running_jobs_for_user() -> dict[str, list[str]]: all running jobs for the current user with their node lists.pbs.get_pbs_jobid_of_active_job() -> str | None: job that includes the current host (orNoneif not on a PBS job).pbs.get_pbs_nodefile(jobid=None) -> str | None: path to the nodefile for a job (active job by default).
๐งฎ Topology inferenceยถ
get_pbs_launch_cmd will infer topology when you omit values:
- If nothing is specified: use all GPUs on all hosts in the hostfile.
- If you set
nhosts: it uses all GPUs per host for that many hosts. - If you set
ngpu_per_host:ngpus = nhosts * ngpu_per_host. - If you set
ngpusonly:ngpu_per_host = ngpus / nhosts(must divide evenly). - Any inconsistent combination raises
ValueError.
Override explicitly when needed:
๐ช Building launch commandsยถ
pbs.get_pbs_launch_cmd(...) -> str: build the launcher string (mpiexec/mpirun) with CPU binding.pbs.build_launch_cmd(...) -> str: scheduler-aware wrapper; currently dispatches to PBS or Slurm.
๐ณ Environment injectionยถ
pbs.get_pbs_env(hostfile=None, jobid=None, verbose=False) -> dict[str, str]:- Adds PBS-derived metadata (host list, counts,
LAUNCH_CMD) intoos.environ. - Useful for passing context into downstream tools or logging.
- Adds PBS-derived metadata (host list, counts,
๐งช Minimal end-to-end launchยถ
๐ Notesยถ
- If no PBS job is active,
get_pbs_launch_cmdstill works using the fallback hostfile and machine introspection. - CPU binding precedence:
CPU_BINDenv > machine-specific defaults > generic depth binding. - Combine with
ezpz.launchif you want scheduler-agnostic CLI parsing and fallbacks.