Systems Matrix (WIP)ΒΆ
Quick reference for supported environments and defaults.
MatrixΒΆ
| System | Scheduler | Launcher | GPU type | Hostfile source | Notes |
|---|---|---|---|---|---|
| Aurora | PBS Pro | mpiexec |
Intel XPU | /var/spool/pbs/aux/$PBS_JOBID.* |
Uses xccl/ccl backend; module load via ezpz_setup_env. |
| Sunspot | PBS Pro | mpiexec |
Intel XPU | /var/spool/pbs/aux/$PBS_JOBID.* |
Similar to Aurora; filters for PBS noise. |
| Polaris | PBS Pro | mpiexec |
NVIDIA GPU | /var/spool/pbs/aux/$PBS_JOBID.* |
Ensure matching CUDA toolkit and torch wheel. |
| Frontier | SLURM | srun |
AMD GPU | SLURM env (SLURM_NODELIST) |
Load ROCm/RCCL modules before launch. |
| Perlmutter | SLURM | srun |
NVIDIA GPU | SLURM env (SLURM_NODELIST) |
Ensure matching CUDA toolkit and torch wheel. |
| Local | None | mpirun |
CPU/GPU (single) | None | Set WORLD_SIZE/--np for multi-proc; falls back to CPU if no GPU. |
Overrides & TipsΒΆ
- Want to use a custom hostfile?
- Pass
--hostfile /path/to/hostfileor setHOSTFILE. - Override counts:
-n(total ranks)-ppn(ranks per node)
Known Failure Modes (preview)ΒΆ
- Scheduler not detected:
EZPZ_LOG_LEVEL=DEBUG, checkPBS_NODEFILE/SLURM_NODELIST; use--hostfile. mpiexec/srunnot found: module load or use full path inlaunch_cmdoverride (future hook).- Backend init failures (
xccl/nccl): verify driver/modules, fall back togloofor debugging. - Wandb network issues: set
WANDB_MODE=offline; sync later if needed.
Example Launch CommandsΒΆ
- Aurora (PBS):
- Sunspot (PBS):
- Frontier (SLURM):
- Perlmutter (SLURM):
- Local fallback (mpirun):