Architecture (WIP)ΒΆ
How the pieces of
ezpzfit together.
GoalsΒΆ
- One consistent entrypoint (
ezpz launch) that adapts to local or scheduler-backed runs. - Reusable primitives for device discovery, job metadata, and launch command construction.
- Simple hooks for logging/metrics, profiling, and reproducible env setup.
High-Level FlowΒΆ
- CLI β Launcher: User calls
ezpz launch). CLI parses args/env, detects scheduler, and prepares a launch command. - Scheduler Detection:
jobs.py+pbs.py/slurm.pyinspect hostfiles, env vars, and queue state to deriveNHOSTS,NGPU_PER_HOST,WORLD_SIZE, and hostfile paths. - Command Assembly:
launch.pybuilds the final{mpiexec|srun|mpirun}invocation, injects Python executable when needed, and applies log filters for system-specific noise. - Distributed Setup:
dist.pyinitializes torch distributed (DDP/TP/PP), handles backend selection (NCCL/XCCL/Gloo fallback), and wires rank/world-size environment. - Runtime/Training:
runtime.py,train.py, andtest_dist.pydemonstrate model construction, optimizer setup, and training loops;tp/houses tensor-parallel helpers. - Logging & History:
history.py,log/, and wandb integration capture metrics, plots, and artifacts; outputs land inoutputs/ezpz.*. - Utils & Env:
utils/contains shell helpers (bin/utils.sh), env packaging (yeet_env.py,tar_env.py), lazy imports, and job env save/load. - Integrations:
hf_trainer.py,integrations.py, andcria.pyprovide bridges to HF and other runtimes.
Key Components (outline)ΒΆ
launch.pyβ CLI + command builder; scheduler-aware vs. local fallback paths.jobs.pyβ shared job metadata helpers; scheduler-neutral layer.pbs.py/slurm.pyβ scheduler-specific discovery (hostfiles, env vars, GPU counts).dist.pyβ torch distributed bootstrap; device/backend selection; rank/env plumbing.tp/β tensor-parallel utilities and policies.history.pyβ metric logging, plotting, offline/online wandb support.train.py/runtime.pyβ training orchestration; configurable entrypoints.test_dist.pyβ reference distributed workload and smoke-test.utils/β shell helpers, env management (yeet/tar), lazy imports, job env save/load.integrations.py/hf_trainer.py/cria.pyβ ecosystem hooks (HF, custom runners).
Data & Control Flow (to expand)ΒΆ
- Diagram: CLI β scheduler detection β launch cmd β distributed init β training β logging/output.
flowchart LR
A["CLI: ezpz launch"] --> B{"Scheduler?"}
B -->|pbs or slurm| C["Job discovery (jobs.py)"]
B -->|unknown| D["Local fallback (mpirun)"]
C --> E["Hostfile + counts (pbs.py or slurm.py)"]
E --> F["Launch cmd build (launch.py)"]
D --> F
F --> G["torch dist init (dist.py)"]
G --> H["Training/runtime (runtime.py/train.py/test_dist.py)"]
H --> I["Metrics/logging (history.py/log/ + wandb)"]
I --> J["Outputs: ezpz outputs + wandb"]
- Config propagation (to expand):
- CLI args β parsed in
launch.py/ workload modules. - Hydra/OmegaConf configs (if used) β merged into runtime/train params.
- Environment-derived settings (
WORLD_SIZE,MASTER_ADDR,MASTER_PORT) injected before torch init. -
Workload kwargs flow into model/optimizer/trainer builders.
-
Error/exit paths (to expand):
- Scheduler detection fails β fallback to local; warn and suggest
--hostfile/--np/--ppn. - Missing hostfile β raise with hint to set
PBS_NODEFILEor pass--hostfile. - Backend init fails β retry/fallback to
gloo(debug), surface driver/module guidance. - wandb network issues β default to offline or warn; keep outputs local.
flowchart TD
A["Start launch"] --> B{"Scheduler detected?"}
B -->|No| C["Warn + local fallback (mpirun)"]
B -->|Yes| D{"Hostfile exists?"}
D -->|No| E["Error: set PBS_NODEFILE/SLURM vars or pass --hostfile"]
D -->|Yes| F["Build launch cmd"]
F --> G{"Backend init ok?"}
G -->|No| H["Fallback to gloo; advise driver/module check"]
G -->|Yes| I["Run workload"]
H --> I
I --> J{"wandb online?"}
J -->|No| K["WANDB_MODE=offline; keep local logs"]
J -->|Yes| L["Sync to wandb"]
Extensibility Notes (to expand)ΒΆ
- How to add a new scheduler plugin.
- How to plug in a new backend or launcher flag.
- How to customize logging/filters and output locations.