π ezpzΒΆ
Write once, run anywhere.
ezpz makes distributed PyTorch launches portable across NVIDIA, AMD, Intel,
MPS, and CPUβwith zero-code changes and guardrails for HPC schedulers.
It provides a:
- π§° CLI:
ezpzthat provides utilities for launching distributed jobs - π Python library
ezpzfor writing hardware-agnostic, distributed PyTorch code -
π Pre-built examples:
All of which:
- Use modern distributed PyTorch features (FSDP, TP, HF Trainer)
- Can be run anywhere (e.g. NVIDIA, AMD, Intel, MPS, CPU)
Checkout the π Docs for more information!
π£ Getting StartedΒΆ
-
Setup Python environment:
To useezpz, we first need a Python environment (preferably virtual) that hastorchandmpi4pyinstalled.- Already have one? Skip to (2.) below!
-
Otherwise, we can use the provided src/ezpz/bin/utils.sh2 to setup our environment:
[Optional]
Note: This is technically optional, but recommended.
Especially if you happen to be running behind a job scheduler (e.g. PBS/Slurm) at any of {ALCF, OLCF, NERSC}, this will automatically load the appropriate modules and use these to bootstrap a virtual environment.
However, if you already have a Python environment with {torch,mpi4py} installed and would prefer to use that, skip directly to (2.) installingezpzbelow
-
Install
ezpz1:-
Need PyTorch or
mpi4py?If you don't already have PyTorch or
mpi4pyinstalled, you can specify these as additional dependencies:
-
... or try without installing!
If you already have a Python environment with {
torch,mpi4py} installed, you can tryezpzwithout installing it:
-
-
Distributed Smoke Test:
Train simple MLP on MNIST with PyTorch + DDP:
See: [π ezpz test | W&B Report] for sample output and details of metric tracking.
π Python LibraryΒΆ
At its core, ezpz is a Python library designed to make writing distributed
PyTorch code easy and portable across different hardware backends.
See π Python Library for more information.
β¨ FeaturesΒΆ
-
See π Quickstart for a detailed walk-through of
ezpzfeatures. -
πͺ Automatic:
- Accelerator detection:
ezpz.get_torch_device(),
across {cuda,xpu,mps,cpu} - Distributed initialization:
ezpz.setup_torch(), to pick the right device + backend combo - Metric handling and utilities for {tracking, recording, plotting}:
ezpz.History()with Weights & Biases support - Integration with native job scheduler(s) (PBS, Slurm)
- with safe fall-backs when no scheduler is detected
- Single-process logging with filtering for distributed runs
- Accelerator detection:
π Examples
π See Examples for ready-to-go examples that can be used as templates or starting points for your own distributed PyTorch workloads!
π§° ezpz: CLI ToolboxΒΆ
Once installed, ezpz provides a CLI with a few useful utilities
to help with distributed launches and environment validation.
Explicitly, these are:
To see the list of available commands, run:
π§° CLI Toolbox
Checkout π§° CLI for additional information.
π©Ί ezpz doctorΒΆ
Health-check your environment and ensure that ezpz is installed correctly
Checks MPI, scheduler detection, Torch import + accelerators, and wandb readiness, returning non-zero on errors.
See: π©Ί Doctor for more information.
β
ezpz testΒΆ
Run the bundled test suite (great for first-time validation):
Or, try without installing:
See β Test for more information.
π ezpz launchΒΆ
Single entry point for distributed jobs.
ezpz detects PBS/Slurm automatically and falls back to mpirun, forwarding
useful environment variables so your script behaves the same on laptops and
clusters.
Add your own args to any command (--config, --batch-size, etc.) and ezpz
will propagate them through the detected launcher.
Use the provided
to automatically launch <cmd> across all available3
accelerators.
Use it to launch:
-
Arbitrary command(s):
-
Arbitrary Python string:
-
One of the ready-to-go examples:
-
Your own distributed training script:
to launch
your_app.trainacross 16 processes, 8 per node.
See π Launch for more information.
π Ready-to-go ExamplesΒΆ
See π Examples for complete example scripts covering:
- Train MLP with DDP on MNIST
- Train CNN with FSDP on MNIST
- Train ViT with FSDP on MNIST
- Train Transformer with FSDP and TP on HF Datasets
- Train Diffusion LLM with FSDP on HF Datasets
- Train or Fine-Tune an LLM with FSDP and HF Trainer on HF Datasets
βοΈ Environment VariablesΒΆ
Additional configuration can be done through environment variables, including:
-
The colorized logging output can be toggled via the
NO_COLORenvironment var, e.g. to turn off colors: -
Forcing a specific torch device (useful on GPU hosts when you want CPU-only):
-
Changing the plot marker used in the text-based plots:
β More InformationΒΆ
- Examples live under
ezpz.examples.*βcopy them or extend them for your workloads. - Stuck? Check the docs, or run
ezpz doctorfor actionable hints. - See my recent talk on:
LLMs on Aurora: Hands On with
ezpzfor a detailed walk-through containing examples and use cases.
-
If you don't have
uvinstalled, you can install it via:See the uv documentation for more details. β©
-
The https://bit.ly/ezpz-utils URL is just a short link for convenience that actually points to https://raw.githubusercontent.com/saforem2/ezpz/main/src/ezpz/bin/utils.sh β©
-
By default, this will detect if we're running behind a job scheduler (e.g. PBS or Slurm). If so, we automatically determine the specifics of the currently active job; explicitly, this will determine:
- The number of available nodes
- How many GPUs are present on each of these nodes
- How many GPUs we have total
It will then use this information to automatically construct the appropriate {
mpiexec,srun} command to launch, and finally, execute the launch cmd. β©