π©Ί Doctor
Health-check your environment and ensure that ezpz is installed correctly
ezpz doctor inspects the active environment and reports whether common launch
prerequisites are satisfied before you submit a distributed job.
The command is available from the main CLI group:
ezpz doctor
ezpz doctor --json # machine-friendly output for CI
What gets checked?
Explicitly, this checks:
Python Environment (conda / mamba / venv) details
Torch installed and accelerators visible
Weights & Biases (wandb) installation and authentication
MPI installation and mpi4py bindings
Job scheduler (PBS / Slurm) and determines specifics of active job.
This includes the number of available nodes, how many accelerators per node,
and the total number of accelerators.
Check
Status Criteria
Remedy Hints
MPI
Confirms both mpi4py and a launcher (mpiexec/mpirun) are accessible.
Suggests installing the missing component.
Scheduler
Identifies PBS/SLURM based on environment variables.
Points to the scheduler plug-in mechanism if the host is unknown.
wandb
Detects the wandb Python package plus credentials or offline mode.
Recommends WANDB_MODE=offline when credentials are absent.
torch
Validates the PyTorch install and accelerator visibility (cuda, xpu, mps, or CPU).
Advises setting TORCH_DEVICE for CPU-only runs.
Each line is reported with a severity (OK, WARNING, ERROR) and, when
relevant, an actionable remediation hint.
Example Output
localhost:
( ezpz)
#[12/26/25 @ 15:06:12][~/v/s/ezpz][distributed-metrics][$βΒ»!?]
; ezpz doctor
== Runtime Context ==
User: samforeman
Machine: arm64
Hostname: Sams-MacBook-Pro-2.local
PBS_JOBID: N/A
PBS_NODEFILE: N/A
ezpz: 0 .9.0 ( /Users/samforeman/vibes/saforem2/ezpz/src/ezpz/__init__.py)
Module list:
bash: line 1 : module: command not found
Active environment:
Virtual env: /Users/samforeman/vibes/saforem2/ezpz/venvs/sams-macbook-pro-2.local/ezpz
Python: /Users/samforeman/vibes/saforem2/ezpz/venvs/sams-macbook-pro-2.local/ezpz/bin/python3 ( 3 .12.8)
PyTorch: 2 .9.1 ( /Users/samforeman/vibes/saforem2/ezpz/venvs/sams-macbook-pro-2.local/ezpz/lib/python3.12/site-packages/torch/__init__.py)
[ β
OKAY] [ mpi ] : mpi4py import succeeded and an MPI launcher was found.
[ β
OKAY] [ wandb ] : wandb authentication provided in '~/.netrc' Should be all set.
[ β
OKAY] [ torch ] : PyTorch detected at least one accelerator.
[ β
OKAY] [ hostfile ] : No scheduler detected; HOSTFILE not required.
[ β οΈ WARN] [ scheduler] : No scheduler detected β assuming local launch mode.
β³ Set scheduler environment variables or configure a custom adapter if running under a job queue.
[ 2025 -12-23-162222] Execution time: 5s sec
Aurora at ALCF:
; ezpz doctor
== Runtime Context ==
User: foremans
Machine: x86_64
Hostname: x4712c1s0b0n0
PBS_JOBID: 8227686 .aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
PBS_NODEFILE: /var/spool/pbs/aux/8227686.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
ezpz: v0.9.0 ( /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/saforem2/ezpz-distributed-metrics/src/ezpz/__init__.py)
Module list:
gcc-runtime/13.3.0-ghotoln <H>
gmp/6.3.0-mtokfaw <H>
mpfr/4.2.1-gkcdl5w <H>
mpc/1.3.1-rdrlvsl <H>
gcc/13.3.0
oneapi/release/2025.2.0
libxml2/2.13.5
hwloc/2.12.1-level-zero
mpich/opt/develop-git.6037a7a
libfabric/1.22.0
cray-pals/1.8.0
cray-libpals/1.8.0
hdf5/1.14.6
frameworks/2025.2.0
Active environment:
Virtual env: /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/saforem2/ezpz-distributed-metrics/venvs/aurora/ezpz-distributed-metrics-aurora_frameworks-2025.2.0
Conda env: /opt/aurora/25.190.0/frameworks/aurora_frameworks-2025.2.0 ( /opt/aurora/25.190.0/frameworks/aurora_frameworks-2025.2.0)
Python: /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/saforem2/ezpz-distributed-metrics/venvs/aurora/ezpz-distributed-metrics-aurora_frameworks-2025.2.0/bin/python3 ( 3 .10.14)
PyTorch: 2 .8.0a0+gitba56102 ( /opt/aurora/25.190.0/frameworks/aurora_frameworks-2025.2.0/lib/python3.10/site-packages/torch/__init__.py)
Scheduler resources:
NHOSTS: 2
NGPU_PER_HOST: 12
NGPUS: 24
[ β
OKAY] [ mpi ] : mpi4py import succeeded and an MPI launcher was found.
[ β
OKAY] [ wandb ] : wandb authentication provided in '~/.netrc' Should be all set.
[ β
OKAY] [ torch ] : PyTorch detected at least one accelerator.
[ β
OKAY] [ hostfile ] : HOSTFILE = /var/spool/pbs/aux/8227686.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov ( 2 hosts) .
[ β
OKAY] [ scheduler] : Detected active scheduler: PBS.
JSON Output from Sunspot at ALCF
#[aurora_frameworks-2025.2.0]
#[~][σ° 23s]
#[12/26/25 @ 15:24:07][x1921c0s3b0n0]
; uv run --python= $( which python3) --with "git+https://github.com/saforem2/ezpz@distributed-metrics" ezpz doctor --json
[
{
"name" : "mpi" ,
"status" : "ok" ,
"message" : "mpi4py import succeeded and an MPI launcher was found." ,
"remedy" : null
},
{
"name" : "wandb" ,
"status" : "ok" ,
"message" : "wandb authentication provided in '~/.netrc' Should be all set." ,
"remedy" : null
},
{
"name" : "torch" ,
"status" : "ok" ,
"message" : "PyTorch detected at least one accelerator." ,
"remedy" : null
},
{
"name" : "hostfile" ,
"status" : "ok" ,
"message" : "No scheduler detected; HOSTFILE not required." ,
"remedy" : null
},
{
"name" : "scheduler" ,
"status" : "ok" ,
"message" : "Detected active scheduler: PBS." ,
"remedy" : null
}
]