Skip to content

πŸš€ ezpz launchΒΆ

Single entry point for distributed jobs.

ezpz detects PBS/Slurm automatically and falls back to mpirun, forwarding useful environment variables so your script behaves the same on laptops and clusters.

Add your own args to any command (--config, --batch-size, etc.) and ezpz will propagate them through the detected launcher.

Use the provided

ezpz launch <launch flags> -- <cmd> <cmd flags>

to automatically launch <cmd> across all available1 accelerators.

Use it to launch:

  • Arbitrary command(s):

    ezpz launch hostname
    
  • Arbitrary Python string:

    ezpz launch python3 -c 'import ezpz; ezpz.setup_torch()'
    
  • One of the ready-to-go examples:

    ezpz launch python3 -m ezpz.test_dist --profile
    ezpz launch -n 8 -- python3 -m ezpz.examples.fsdp_tp --tp 4
    
  • Your own distributed training script:

    ezpz launch -n 16 -ppn 8 -- python3 -m your_app.train --config configs/your_config.yaml
    

    to launch your_app.train across 16 processes, 8 per node.

πŸ“ Ready-to-go ExamplesΒΆ

See πŸ“ Examples for complete example scripts covering:

  1. Use DDP + MNIST to train a MLP
  2. Use FSDP + MNIST to train a CNN
  3. Use FSDP + MNIST to train a Vision Transformer
  4. Use FSDP + HF Datasets to train a Diffusion Language Model
  5. Use FSDP + HF Datasets + Tensor Parallelism to train a Llama style model
  6. Use FSDP + HF {Datasets + AutoModel + Trainer} to train / fine-tune an LLM

βš™οΈ Execution FlowΒΆ

Two primary control paths drive ezpz launch: a scheduler-aware path used when running inside PBS/SLURM allocations, and a local fallback that shells out to mpirun when no scheduler metadata is available.

Scheduler DetectedΒΆ

sequenceDiagram
    autonumber
    participant User
    participant CLI as ezpz launch
    participant Scheduler
    participant Launch as ezpz.launch.launch
    participant Nodes as Compute Nodes

    User->>CLI: ezpz launch -- python3 -m ezpz.test_dist
    CLI->>Scheduler: get_scheduler()
    Scheduler-->>CLI: "pbs" / "slurm"
    CLI->>Scheduler: get_active_jobid()
    Scheduler-->>CLI: job id
    CLI->>Launch: launch(cmd_to_launch, include_python=False)
    Launch->>Nodes: run_command(mpi/srun ...)
    Nodes-->>Launch: return code
    Launch-->>CLI: status
    CLI-->>User: exit code (0 on success)
Local `mpirun` Fallback
sequenceDiagram
    autonumber
    participant User
    participant CLI as ezpz launch
    participant Scheduler
    participant MPI as mpirun

    User->>CLI: ezpz launch -- python3 -m ezpz.test_dist
    CLI->>Scheduler: get_scheduler()
    Scheduler-->>CLI: "unknown"
    CLI->>Scheduler: get_active_jobid()
    Scheduler-->>CLI: null
    CLI->>MPI: mpirun -np ${WORLD_SIZE:-2} python3 -m ezpz.test_dist
    MPI-->>CLI: return code
    CLI-->>User: exit code

πŸ’€ Deprecated

> Launch python _from_ python. ### πŸ“ Example We provide below multiple (equivalent) commands that can be used to launch [`test_dist.py`](https://github.com/saforem2/ezpz/blob/main/src/ezpz/test_dist.py) across _all_ available GPUs. 1. Directly:
ezpz test
2. Using `ezpz launch` (preferred; `ezpz-launch` is a deprecated shim):
ezpz launch -- python3 -m ezpz.test_dist
3. As a module using `python3 -m`:
python3 -m ezpz.launch -- python3 -m ezpz.test_dist
# or, equivalently:
python3 -m ezpz.launch -- -m ezpz.test_dist
(will automatically insert `python3` before the second `-m`, if needed) #### πŸ“ Example
1
2
3
source <(curl -L https://bit.ly/ezpz-utils) && ezpz_setup_env
python3 -m pip install "git+https://github.com/saforem2/ezpz"
ezpz launch -- -m ezpz.test_dist
This will _launch_ [`ezpz/test_dist.py`](https://github.com/saforem2/ezpz/blob/main/src/ezpz/test_dist.py) across all available resources in your {PBS, Slurm} job.
sequenceDiagram
  participant User
  participant ezpz.launch
  participant Launcher (mpiexec/srun/mpirun)
  participant Distributed Application (ezpz.test_dist)

  User->>ezpz.launch: Executes `python3 -m ezpz.launch -m ezpz.test_dist`
  ezpz.launch->>Launcher (mpiexec/srun/mpirun): Detects environment and builds launch command
  Launcher (mpiexec/srun/mpirun)->>Distributed Application (ezpz.test_dist): Launches distributed training job
  Distributed Application (ezpz.test_dist)->>Distributed Application (ezpz.test_dist): Performs distributed computation
  Distributed Application (ezpz.test_dist)-->>User: Training progress and metrics (via WandB)
-
πŸͺ„ Magic: Explicitly, this will use the default "launcher" depending on availability: - ALCF (PBS Pro): `mpiexec` - Slurm: `srun` - Unknown: `mpirun` and automatically pull in the specifics about the currently active job when building the appropriate `{srun, mpi{exec,run}}` command. - For example, on any of the ALCF systems, it will automatically: - Identify `"${PBS_NODEFILE}"` (by looking at `hostname` of currently active node) - Use this to calculate: - `NHOSTS` - `NGPUS_PER_HOST` - `WORLD_SIZE` `= NGPUS = NHOSTS * NGPUS_PER_HOST` - With this information, we can construct the full `mpiexec ...` command needed to launch our distributed application, e.g.:
1
2
3
python3 -c 'import ezpz.pbs; print(ezpz.pbs.build_launch_cmd())'
# on 2 nodes of Aurora @ ALCF:
# mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/3878985.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 16
##### 🌌 Aurora - Command:
python3 -m ezpz.launch -m ezpz.test_dist
-
Output:
#[🐍 aurora_nre_models_frameworks-2024.2.1_u1](πŸ‘» aurora_nre_models_frameworks-2024.2.1_u1)
#[08:54:56 AM][x4317c7s7b0n0][/flare/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856]
$ python3 -m ezpz.launch -m ezpz.test_dist --tp 4 --pp 3
[2025-04-01 08:55:21,413] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
[2025-04-01 08:55:29,530] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
[2025-04-01 08:56:06][I][ezpz/launch:56:__main__] Job ID: 3842171
[2025-04-01 08:56:08][I][ezpz/launch:62:__main__] Node file: /var/spool/pbs/aux/3842171.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
[2025-04-01 08:56:08][I][ezpz/launch:72:__main__] Building command to execute from: '{launch_cmd}' + '{python}' + '{cmd_to_launch}'

launch_cmd=mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/3842171.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 16
python=/lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/venvs/aurora_nre_models_frameworks-2024.2.1_u1/bin/python3
cmd_to_launch=-m ezpz.test_dist --tp 4 --pp 3

[2025-04-01 08:56:08][I][ezpz/launch:90:__main__] Evaluating:
mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/3842171.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 16 /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/venvs/aurora_nre_models_frameworks-2024.2.1_u1/bin/python3 -m ezpz.test_dist --tp 4 --pp 3
Disabling local launch: multi-node application
Connected to tcp://x4317c7s6b0n0.hostmgmt2317.cm.aurora.alcf.anl.gov:7919
Launching application 7ceb32d4-e849-4fc3-ad6d-abcb7bad3494
[2025-04-01 08:56:13,276] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
[2025-04-01 08:58:40][I][ezpz/dist:557] Using get_torch_device_type()='xpu' with backend='ccl'
[2025-04-01 08:58:45][I][tp/__init__:148:ezpz.tp] TP: 4, PP: 3, CP: 1, DP: 2
[2025-04-01 08:58:45][I][ezpz/dist:873] Using device='xpu' with backend='ddp' + 'ccl' for distributed training.
2025:04:01-08:58:45:(123380) |CCL_WARN| value of CCL_PROCESS_LAUNCHER changed to be pmix (default:hydra)
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s6b0n0'][ 8/23] [pp:2/2][tp:0/3][dp:0/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s6b0n0'][ 7/23] [pp:1/2][tp:3/3][dp:0/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s6b0n0'][ 4/23] [pp:1/2][tp:0/3][dp:0/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s6b0n0'][ 5/23] [pp:1/2][tp:1/3][dp:0/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s6b0n0'][ 6/23] [pp:1/2][tp:2/3][dp:0/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s6b0n0'][10/23] [pp:2/2][tp:2/3][dp:0/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s6b0n0'][ 9/23] [pp:2/2][tp:1/3][dp:0/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s6b0n0'][11/23] [pp:2/2][tp:3/3][dp:0/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s7b0n0'][20/23] [pp:2/2][tp:0/3][dp:1/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s7b0n0'][16/23] [pp:1/2][tp:0/3][dp:1/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s7b0n0'][17/23] [pp:1/2][tp:1/3][dp:1/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s7b0n0'][18/23] [pp:1/2][tp:2/3][dp:1/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s7b0n0'][19/23] [pp:1/2][tp:3/3][dp:1/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s7b0n0'][22/23] [pp:2/2][tp:2/3][dp:1/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s7b0n0'][23/23] [pp:2/2][tp:3/3][dp:1/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s7b0n0'][21/23] [pp:2/2][tp:1/3][dp:1/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s6b0n0'][ 0/23] [pp:0/2][tp:0/3][dp:0/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s6b0n0'][ 2/23] [pp:0/2][tp:2/3][dp:0/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s6b0n0'][ 1/23] [pp:0/2][tp:1/3][dp:0/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s6b0n0'][ 3/23] [pp:0/2][tp:3/3][dp:0/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s7b0n0'][12/23] [pp:0/2][tp:0/3][dp:1/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s7b0n0'][14/23] [pp:0/2][tp:2/3][dp:1/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s7b0n0'][13/23] [pp:0/2][tp:1/3][dp:1/1]
[2025-04-01 08:58:45][I][ezpz/dist:923] ['x4317c7s7b0n0'][15/23] [pp:0/2][tp:3/3][dp:1/1]
[2025-04-01 08:58:46][I][ezpz/test_dist:395:__main__] model=
Network(
  (layers): Sequential(
    (0): Linear(in_features=128, out_features=1024, bias=True)
    (1): Linear(in_features=1024, out_features=512, bias=True)
    (2): Linear(in_features=512, out_features=256, bias=True)
    (3): Linear(in_features=256, out_features=128, bias=True)
    (4): Linear(in_features=128, out_features=128, bias=True)
  )
)
[2025-04-01 08:58:58][I][ezpz/dist:1100] Setting up wandb from rank=0
[2025-04-01 08:58:58][I][ezpz/dist:1101] Using=WB PROJECT=ezpz.test_dist
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: foremans (aurora_gpt) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.8
wandb: Run data is saved locally in /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/wandb/run-20250401_085858-q1ob71v0
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run young-brook-1229
wandb: ⭐️ View project at https://wandb.ai/aurora_gpt/ezpz.test_dist
wandb: πŸš€ View run at https://wandb.ai/aurora_gpt/ezpz.test_dist/runs/q1ob71v0
[2025-04-01 08:58:59][I][ezpz/dist:1129] W&B RUN=[young-brook-1229](https://wandb.ai/aurora_gpt/ezpz.test_dist/runs/q1ob71v0)
[2025-04-01 08:58:59][I][ezpz/dist:299] Updating wandb.run: young-brook-1229 config with "DIST_INFO"
[2025-04-01 08:58:59][I][ezpz/dist:1168] Running on machine='Aurora'
[2025-04-01 08:58:59][I][ezpz/test_dist:219:__main__] config:
{
  "backend": "DDP",
  "batch_size": 64,
  "cp": 1,
  "dtype": "bfloat16",
  "input_size": 128,
  "layer_sizes": [
    1024,
    512,
    256,
    128
  ],
  "log_freq": 1,
  "output_size": 128,
  "pp": 3,
  "print_freq": 10,
  "pyinstrument_profiler": false,
  "tp": 4,
  "train_iters": 100,
  "warmup": 2
}
[rank23]:[W reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[2025-04-01 08:59:03][I][ezpz/test_dist:192:__main__] Warmup complete at step 2
[2025-04-01 08:59:03][I][ezpz/test_dist:170:__main__] iter=10 loss=752.000000 dtf=0.000528 dtb=0.001079
[2025-04-01 08:59:03][I][ezpz/test_dist:170:__main__] iter=20 loss=652.000000 dtf=0.000482 dtb=0.001007
[2025-04-01 08:59:03][I][ezpz/test_dist:170:__main__] iter=30 loss=596.000000 dtf=0.000475 dtb=0.001008
[2025-04-01 08:59:03][I][ezpz/test_dist:170:__main__] iter=40 loss=564.000000 dtf=0.000486 dtb=0.000990
[2025-04-01 08:59:03][I][ezpz/test_dist:170:__main__] iter=50 loss=520.000000 dtf=0.000492 dtb=0.000989
[2025-04-01 08:59:03][I][ezpz/test_dist:170:__main__] iter=60 loss=494.000000 dtf=0.000476 dtb=0.001019
[2025-04-01 08:59:03][I][ezpz/test_dist:170:__main__] iter=70 loss=456.000000 dtf=0.000495 dtb=0.000969
[2025-04-01 08:59:03][I][ezpz/test_dist:170:__main__] iter=80 loss=426.000000 dtf=0.000488 dtb=0.000988
[2025-04-01 08:59:03][I][ezpz/test_dist:170:__main__] iter=90 loss=396.000000 dtf=0.000496 dtb=0.000966
[2025-04-01 08:59:03][I][ezpz/history:704] Saving iter plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/outputs/ezpz.test_dist/ezpz.test_dist/plots/mplot
[2025-04-01 08:59:04][I][ezpz/history:704] Saving loss plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/outputs/ezpz.test_dist/ezpz.test_dist/plots/mplot
[2025-04-01 08:59:04][I][ezpz/history:704] Saving dtf plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/outputs/ezpz.test_dist/ezpz.test_dist/plots/mplot
[2025-04-01 08:59:04][I][ezpz/history:704] Saving dtb plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/outputs/ezpz.test_dist/ezpz.test_dist/plots/mplot
[2025-04-01 08:59:04][I][ezpz/history:602] Saving tplots to /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot
                    loss [2025-04-01-085904]
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
1592.0β”€β–Œ                                                   β”‚
      β”‚β–Œ                                                   β”‚
1389.3β”€β–Œ                                                   β”‚
      β”‚β–Œ                                                   β”‚
      β”‚β–š                                                   β”‚
1186.7─▐                                                   β”‚
      │▝▖                                                  β”‚
 984.0─ β–Œ                                                  β”‚
      β”‚ β–š                                                  β”‚
 781.3─  β–Œ                                                 β”‚
      β”‚  β–β–€β–šβ–„β–„                                             β”‚
      β”‚       β–€β–šβ–šβ–žβ–„β–„β––β––                                     β”‚
 578.7─             β–β–β–€β–€β–€β–„β–šβ–„β–„β–„β–„β–– β––                         β”‚
      β”‚                        β–β–€β–β–€β–€β–€β–€β–€β–šβ–„β–„β–„β–„β–„β–„             β”‚
 376.0─                                       β–€β–€β–€β–€β–€β–€β–€β–šβ–„β–„β–„β–„β–„β”‚
      β””β”€β”¬β”€β”¬β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”˜
      0 2 6  14 20 25   34 40 47 51 57   67  75  81 87 93
loss                           iter
text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/loss.txt
                      dtf [2025-04-01-085905]
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
0.000671β”€β–Œ                                                 β”‚
        β”‚β–Œ                                 β––               β”‚
0.000637β”€β–Œ   β––                            β–β–Œ               β”‚
        β”‚β–Œ  β–β–Œ                            β–β–Œ     β––         β”‚
        β”‚β–Œ  β–β–Œ    β–—    β–—    β–Ÿ    β–—β–Œ    β––  β–β–Œβ––   β–β–Œ    β–Ÿ    β”‚
0.000604β”€β–Œβ––β––β–β–Œ    β–ˆ    β–ˆ    β–ˆ    β–β–Œ   β–β–Œ  β–β–ˆβ–Œ   β–β–Œ    β–ˆ    β”‚
        β”‚β–ˆβ–ˆβ–Œβ–β–Œ    β–ˆ    β–ˆ    β–ˆ    β–β–Œ   β–β–Œ  β–β–ˆβ–Œ   β–β–Œ    β–ˆ    β”‚
0.000570β”€β–β–œβ–β–β–Œ    β–ˆ    β–ˆ    β–ˆ    β–β–Œ   β–β–Œ  β–β–ˆβ–Œ   β–β–Œ    β–ˆ    β”‚
        β”‚   β–ˆβ–Œβ–žβ–Ÿ  β–ˆ    β–ˆ    β–ˆ    β–β–Œ   β–β–Œβ–—β–Œβ–β–ˆβ–Œ   β–β–Œ    β–ˆ    β”‚
0.000537─   β–ˆβ–Œβ–Œβ–  β–Œβ–Œβ–—β–– β–›β––   β–›β–„  β––β–Œβ–Œ   β–β–šβ–Œβ–Œβ–β–ˆβ–Œ   β–β–š    β–ˆ    β”‚
        β”‚   β–œβ–šβ–Œβ–β–—β–„β–Œβ–™β–˜β–Œβ––β–Œβ–™β–Œβ–— β–Œβ– β–β–Œβ–Œβ–β–—β–œ β–β–β–Œβ–Œβ–Ÿβ–ˆβ–™β–Œβ–Ÿβ–—β–β–β––β–„β–Œ β–Œβ–Œβ––β–žβ–Œβ”‚
        β”‚    β–β–Œβ–β–Œβ–β–Œβ–ˆ β–œβ–Œβ–Œβ–œβ–šβ–›β––β–Œβ–β–žβ–ˆβ–Œβ–Œ β–œβ–β–Ÿβ–  β–ˆ β–ˆβ–β–šβ–œβ–ˆβ– β–œ β–šβ––β–Œβ–œβ–β–Œβ–šβ”‚
0.000503─        β–β–Œβ–  β–šβ–Œ   β–Œβ–Œ  β–β–Œβ–Œ  β–β–ˆβ–  ▝ β–œ   β–œβ–    β–šβ–Œ    β”‚
        β”‚        β–β–Œ   β–β–Œ   β–Œβ–Œ   β–™β–˜   β–β–Ÿ        β–β–ž          β”‚
0.000470─              β–˜   ▝    β–œ     ▝         β–˜          β”‚
        β””β”€β”¬β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”˜
        0 2 6  14   25   34 40 47 51 57   67  75 81 87 93
dtf                             iter
text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/dtf.txt
                    dtf [2025-04-01-085905]
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
37.0─           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
    β”‚           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
30.8─           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
    β”‚           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
    β”‚           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
24.7─           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
    β”‚           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
18.5─           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
    β”‚     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
12.3─     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                β”‚
    β”‚     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                           β”‚
 6.2β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ           β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚
 0.0β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚
    β””β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”˜
  0.000461    0.000516      0.000570     0.000625  0.000679
freq                           dtf
text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/dtf-hist.txt
                      dtb [2025-04-01-085905]
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
0.001541β”€β–Œ                             β––                   β”‚
        β”‚β–Œ                            β–β–Œ                   β”‚
0.001445β”€β–Œ                            β–β–Œ                   β”‚
        β”‚β–Œ                            β–β–Œ                   β”‚
        β”‚β–Œ                            β–β–Œ                   β”‚
0.001349β”€β–Œ                            β–β–Œ                   β”‚
        β”‚β–Œ                            β–β–Œ                   β”‚
0.001253─▐                            β–β–Œ                  β––β”‚
        │▐                            β–β–Œ                 β–β–Œβ”‚
0.001157─▝▖    β–—β–Œ   β––  β–—β–Œ  β–—          β–β–Œ                 β–β–Œβ”‚
        β”‚ β–β–Œ  β–—β–ˆβ–Œ  β–β–Œ  β–β–Œ  β–ˆ        β––β–Ÿβ–β–Œβ––                β–β–Œβ”‚
        β”‚  β–šβ–—β–—β–˜β–œβ–š β–—β–Ÿβ–Œ  β–žβ–šβ–— β–ˆβ–—     β–žβ–Ÿβ–Œβ–ˆβ–β–β–β––               β–β–Œβ”‚
0.001062─   β–˜β–€   β–šβ–Œβ–β–β–€β––β–Œ β–˜β–œβ–ˆβ–Œβ–€β–„β–„β–Ÿβ– β–β–β–œβ–  β–šβ––β–—β–šβ–žβ–„  β–„ β–„β–šβ––β–žβ–„β–žβ–žβ–šβ”‚
        β”‚        β–β–Œ   β–šβ–Œ   β–œβ–Œ   β–β–Œ    β–€   ▐▐   β–œβ– ▀▝ β–Œβ–Œ    β”‚
0.000966─                   β–˜    β–˜         β–œ    β–€    β–β–Œ    β”‚
        β””β”€β”¬β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”˜
        0 2 6  14   25   34 40 47 51 57   67  75 81 87 93
dtb                             iter
text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/dtb.txt
                    dtb [2025-04-01-085905]
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
55.0─     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
    β”‚     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
45.8─     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
    β”‚     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
    β”‚     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
36.7─     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
    β”‚     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
27.5─     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
    β”‚     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
18.3─     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
 9.2β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚
 0.0β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚
    β””β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”˜
  0.00094      0.00110       0.00125      0.00141   0.00157
freq                           dtb
text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/dtb-hist.txt
[2025-04-01 08:59:05][I][ezpz/utils:192] Saving dataset to: /lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/outputs/ezpz.test_dist/ezpz.test_dist/train_dataset.h5
[2025-04-01 08:59:05][I][ezpz/test_dist:186:__main__] dataset=<xarray.Dataset> Size: 3kB
Dimensions:  (draw: 97)
Coordinates:
  * draw     (draw) int64 776B 0 1 2 3 4 5 6 7 8 ... 88 89 90 91 92 93 94 95 96
Data variables:
    iter     (draw) int64 776B 3 4 5 6 7 8 9 10 11 ... 92 93 94 95 96 97 98 99
    loss     (draw) float32 388B 1.592e+03 1.232e+03 1.048e+03 ... 388.0 378.0
    dtf      (draw) float64 776B 0.0006705 0.0005739 ... 0.0005295 0.0005092
    dtb      (draw) float64 776B 0.001541 0.001264 0.00117 ... 0.001247 0.001055
[2025-04-01 08:59:05][I][ezpz/test_dist:459:__main__] Took: 24.42 seconds
wandb:
wandb: πŸš€ View run young-brook-1229 at: https://wandb.ai/aurora_gpt/ezpz.test_dist/runs/q1ob71v0
wandb: Find logs at: ../../../../../../../lus/flare/projects/datascience/foremans/projects/saforem2/tmp/2025-04-01-084856/wandb/run-20250401_085858-q1ob71v0/logs
Application 7ceb32d4 resources: utime=853s stime=315s maxrss=2431600KB inblock=19633858 oublock=1032 minflt=6598818 majflt=132990 nvcsw=1389710 nivcsw=5263346
[2025-04-01 08:59:07][I][ezpz/launch:93:__main__] Command took 179.43 seconds to run.
took: 0h:04m:01s
###### 🌠 Polaris - Command:
python3 -m ezpz.launch -m ezpz.test_dist
-
Output:
# (πŸ‘» 2024-04-29)
#[09:22:22 AM][x3006c0s19b0n0][/e/d/f/p/s/t/ezpz][🌱 feat/python-launcher][πŸ“¦βœ“] [⏱️ 58s]
$ python3 -m ezpz.launch -m ezpz.test_dist --tp 2 --pp 2
[2025-04-01 09:22:32,869] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
[2025-04-01 09:22:37][I][ezpz/launch:56:__main__] Job ID: 4094162
[2025-04-01 09:22:38][I][ezpz/launch:62:__main__] Node file: /var/spool/pbs/aux/4094162.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
[2025-04-01 09:22:38][I][ezpz/launch:72:__main__] Building command to execute from: '{launch_cmd}' + '{python}' + '{cmd_to_launch}'

launch_cmd=mpiexec --verbose --envall -n 8 -ppn 4 --hostfile /var/spool/pbs/aux/4094162.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 16
python=/lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/venvs/2024-04-29/bin/python3
cmd_to_launch=-m ezpz.test_dist --tp 2 --pp 2

[2025-04-01 09:22:38][I][ezpz/launch:90:__main__] Evaluating:
mpiexec --verbose --envall -n 8 -ppn 4 --hostfile /var/spool/pbs/aux/4094162.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 16 /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/venvs/2024-04-29/bin/python3 -m ezpz.test_dist --tp 2 --pp 2
Connected to tcp://x3006c0s19b0n0.hsn.cm.polaris.alcf.anl.gov:7919
Launching application 269d722b-ce74-4fef-92a4-76644aadeccc
Using PMI port 57027,57028
[2025-04-01 09:22:44,418] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-01 09:22:44,418] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-01 09:22:44,418] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-01 09:22:44,419] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-01 09:22:45,292] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-01 09:22:45,292] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-01 09:22:45,292] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-01 09:22:45,292] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
[2025-04-01 09:22:49][I][tp/__init__:148:ezpz.tp] TP: 2, PP: 2, CP: 1, DP: 2
[2025-04-01 09:22:49][I][ezpz/dist:873] Using device='cuda' with backend='ddp' + 'nccl' for distributed training.
[2025-04-01 09:22:51][I][ezpz/dist:923] ['x3006c0s19b0n0'][3/7] [pp:1/1][tp:1/1][dp:0/1]
[2025-04-01 09:22:51][I][ezpz/dist:923] ['x3006c0s19b0n0'][2/7] [pp:1/1][tp:0/1][dp:0/1]
[2025-04-01 09:22:51][I][ezpz/dist:923] ['x3006c0s1b0n0'][6/7] [pp:1/1][tp:0/1][dp:1/1]
[2025-04-01 09:22:51][I][ezpz/dist:923] ['x3006c0s1b0n0'][7/7] [pp:1/1][tp:1/1][dp:1/1]
[2025-04-01 09:22:51][I][ezpz/dist:923] ['x3006c0s19b0n0'][1/7] [pp:0/1][tp:1/1][dp:0/1]
[2025-04-01 09:22:52][I][ezpz/dist:923] ['x3006c0s1b0n0'][5/7] [pp:0/1][tp:1/1][dp:1/1]
[2025-04-01 09:22:52][I][ezpz/dist:923] ['x3006c0s19b0n0'][0/7] [pp:0/1][tp:0/1][dp:0/1]
[2025-04-01 09:22:52][I][ezpz/dist:923] ['x3006c0s1b0n0'][4/7] [pp:0/1][tp:0/1][dp:1/1]
[2025-04-01 09:22:52][I][ezpz/test_dist:395:__main__] model=
Network(
  (layers): Sequential(
    (0): Linear(in_features=128, out_features=1024, bias=True)
    (1): Linear(in_features=1024, out_features=512, bias=True)
    (2): Linear(in_features=512, out_features=256, bias=True)
    (3): Linear(in_features=256, out_features=128, bias=True)
    (4): Linear(in_features=128, out_features=128, bias=True)
  )
)
[2025-04-01 09:22:53][I][ezpz/dist:1100] Setting up wandb from rank=0
[2025-04-01 09:22:53][I][ezpz/dist:1101] Using=WB PROJECT=ezpz.test_dist
wandb: Currently logged in as: foremans (aurora_gpt). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.19.8 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.16.6
wandb: Run data is saved locally in /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/ezpz/wandb/run-20250401_092255-7vcfnxnn
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run deep-frog-1232
wandb: ⭐️ View project at https://wandb.ai/aurora_gpt/ezpz.test_dist
wandb: πŸš€ View run at https://wandb.ai/aurora_gpt/ezpz.test_dist/runs/7vcfnxnn
[2025-04-01 09:22:55][I][ezpz/dist:1129] W&B RUN=[deep-frog-1232](https://wandb.ai/aurora_gpt/ezpz.test_dist/runs/7vcfnxnn)
[2025-04-01 09:22:55][I][ezpz/dist:299] Updating wandb.run: deep-frog-1232 config with "DIST_INFO"
[2025-04-01 09:22:56][I][ezpz/dist:1168] Running on machine='Polaris'
[2025-04-01 09:22:56][I][ezpz/test_dist:219:__main__] config:
{
  "backend": "DDP",
  "batch_size": 64,
  "cp": 1,
  "dtype": "bfloat16",
  "input_size": 128,
  "layer_sizes": [
    1024,
    512,
    256,
    128
  ],
  "log_freq": 1,
  "output_size": 128,
  "pp": 2,
  "print_freq": 10,
  "pyinstrument_profiler": false,
  "tp": 2,
  "train_iters": 100,
  "warmup": 2
}
[2025-04-01 09:22:56][I][ezpz/test_dist:192:__main__] Warmup complete at step 2
[2025-04-01 09:22:56][I][ezpz/test_dist:170:__main__] iter=10 loss=724.000000 dtf=0.000386 dtb=0.000711
[2025-04-01 09:22:56][I][ezpz/test_dist:170:__main__] iter=20 loss=652.000000 dtf=0.000325 dtb=0.000742
[2025-04-01 09:22:56][I][ezpz/test_dist:170:__main__] iter=30 loss=600.000000 dtf=0.000327 dtb=0.000713
[2025-04-01 09:22:56][I][ezpz/test_dist:170:__main__] iter=40 loss=568.000000 dtf=0.000334 dtb=0.000705
[2025-04-01 09:22:56][I][ezpz/test_dist:170:__main__] iter=50 loss=544.000000 dtf=0.000340 dtb=0.000660
[2025-04-01 09:22:56][I][ezpz/test_dist:170:__main__] iter=60 loss=506.000000 dtf=0.000325 dtb=0.000650
[2025-04-01 09:22:56][I][ezpz/test_dist:170:__main__] iter=70 loss=468.000000 dtf=0.000320 dtb=0.000665
[2025-04-01 09:22:56][I][ezpz/test_dist:170:__main__] iter=80 loss=434.000000 dtf=0.000316 dtb=0.000709
[2025-04-01 09:22:56][I][ezpz/test_dist:170:__main__] iter=90 loss=420.000000 dtf=0.000317 dtb=0.000694
[2025-04-01 09:22:56][I][ezpz/history:704] Saving iter plot to: /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/mplot
[2025-04-01 09:22:56][I][ezpz/history:704] Saving loss plot to: /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/mplot
[2025-04-01 09:22:57][I][ezpz/history:704] Saving dtf plot to: /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/mplot
[2025-04-01 09:22:57][I][ezpz/history:704] Saving dtb plot to: /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/mplot
[2025-04-01 09:22:57][I][ezpz/history:602] Saving tplots to /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot
                    loss [2025-04-01-092257]
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
1504.0β”€β–Œ                                                   β”‚
      β”‚β–Œ                                                   β”‚
1317.3β”€β–Œ                                                   β”‚
      β”‚β–Œ                                                   β”‚
      β”‚β–š                                                   β”‚
1130.7─▐                                                   β”‚
      │▐                                                   β”‚
 944.0─ β–Œ                                                  β”‚
      β”‚ ▐                                                  β”‚
 757.3─  β–š                                                 β”‚
      β”‚   β–€β–€β–šβ–„β––                                            β”‚
      β”‚       β–β–€β–€β–€β–„β–šβ–„β–„β––                                    β”‚
 570.7─               β–β–€β–€β–€β–€β–€β–€β–šβ–„β–„β–„β–„β–„β––                       β”‚
      β”‚                            β–β–€β–€β–€β–€β–€β–šβ–„β–„β–„β–„β–„ β––β––         β”‚
 384.0─                                        β–€β–β–β–€β–€β–€β–€β–šβ–„β–€β–„β–„β”‚
      β””β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”˜
      0  4  12 17 23   33 38 44 50 55 61  68 75 80  88 94
loss                           iter
text saved in /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/loss.txt
                      dtf [2025-04-01-092257]
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
0.000508β”€β–—β–Œ                                                β”‚
        β”‚β–β–Œ                                                β”‚
0.000475β”€β–β–Œ                                                β”‚
        β”‚β–β–Œ                                                β”‚
        β”‚β–ˆβ–Œ                                                β”‚
0.000443β”€β–ˆβ–Œ               β––                                β”‚
        β”‚β–β–Œ              β–β–Œ                                β”‚
0.000411─ β–Œ              β–β–Œ                                β”‚
        β”‚ β–Œ              β–β–Œ                                β”‚
0.000379─ β–Œβ–Ÿβ–Ÿ            β–β–Œ                                β”‚
        β”‚ β–β–›β–ˆ            β–β–Œ                                β”‚
        β”‚   ▝▖           β–β–Œ                                β”‚
0.000347─    β–Œ β–—β–Œ    β––   β–Ÿβ–Œ β–—    β––               β––    β–—β–Œ  β–—β”‚
        β”‚    β–β–€β–Ÿβ–šβ–žβ–„β–„β–β–š β–Ÿβ–β–œβ–šβ–—β–€β–„β–Ÿβ–—β–β–β–– β–— β–—β–Œ β––β–— β–„   β–β–š β–— ▗▐▐ β–—β–Œβ”‚
0.000314─           β–˜ β–€ β–˜  β–€  β–˜β–˜β–€ β–β–žβ–˜β–šβ–˜β–β–Ÿβ–β–Œβ–€ β–šβ–žβ–€β–œ β–€β–˜β–€β–€β–Œ β–šβ–Œβ–˜β”‚
        β””β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”˜
        0  4  12   23   33 38 44 50 55 61 68 73 80  88 94
dtf                             iter
text saved in /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/dtf.txt
                    dtf [2025-04-01-092257]
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
75.0β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
62.5β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
50.0β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
37.5β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
25.0β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
12.5β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
 0.0β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚
    β””β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”˜
  0.000306    0.000358      0.000411     0.000464  0.000516
freq                           dtf
text saved in /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/dtf-hist.txt
                      dtb [2025-04-01-092257]
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
0.000966─                β–Ÿ                                 β”‚
        β”‚                β–ˆ                                 β”‚
0.000913β”€β–—β–Œ              β–ˆ                                 β”‚
        β”‚β–β–Œ              β–ˆ                                 β”‚
        β”‚β–β–Œ              β–ˆ                                 β”‚
0.000861β”€β–Œβ–Œ              β–ˆ                                 β”‚
        β”‚β–˜β–Œ              β–ˆ                                 β”‚
0.000808─ ▐              β–ˆ                                 β”‚
        β”‚  β–™β–Œ            β–ˆ                                 β”‚
0.000755─  β–β–Œ            β–ˆ  β–—                              β”‚
        β”‚   β–Œ β–Ÿ   β––β–Ÿ     β–ˆ  β–ˆ                    β–Ÿ  β–—β–Ÿ β–—β––  β”‚
        β”‚   β–šβ–šβ–€β––β–—β–Ÿβ–β–œ   β–Ÿ β–ˆ β––β–ˆ                  β–– β–Œβ–Œ β–ˆβ–ˆβ–—β–˜β–Œ β–—β”‚
0.000703─      β–β–˜   β–šβ–€β–ž β–€β–˜β–ˆβ–β–œ β––               β–β–Œβ–žβ–Œβ–β–—β–œβ–›β–ž β–β–™β–˜β”‚
        β”‚                 β–œ β–β–Ÿβ–Œ β–– β–„ β––  β––β––β–—β–Œβ––β–žβ––β–žβ–œ  β–β–ž β–˜   ▝ β”‚
0.000650─                    ▝▝▀▝▀ β–€β–β–€β–œβ–β–β–˜β–β–β–˜β–             β”‚
        β””β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”˜
        0  4  12   23   33 38 44 50 55 61 68 73 80  88 94
dtb                             iter
text saved in /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/dtb.txt
                    dtb [2025-04-01-092257]
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
38.0β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
31.7β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
25.3β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
19.0β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                           β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
12.7β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
 6.3β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                      β”‚
    β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                      β”‚
 0.0β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚
    β””β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”˜
  0.00064      0.00072       0.00081      0.00089   0.00098
freq                           dtb
text saved in /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/dtb-hist.txt
[2025-04-01 09:22:57][I][ezpz/utils:192] Saving dataset to: /lus/eagle/projects/datascience/foremans/projects/saforem2/tmp/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/train_dataset.h5
[2025-04-01 09:22:57][I][ezpz/test_dist:186:__main__] dataset=<xarray.Dataset> Size: 3kB
Dimensions:  (draw: 97)
Coordinates:
  * draw     (draw) int64 776B 0 1 2 3 4 5 6 7 8 ... 88 89 90 91 92 93 94 95 96
Data variables:
    iter     (draw) int64 776B 3 4 5 6 7 8 9 10 11 ... 92 93 94 95 96 97 98 99
    loss     (draw) float32 388B 1.504e+03 1.144e+03 976.0 ... 396.0 388.0 384.0
    dtf      (draw) float64 776B 0.0004546 0.0004246 ... 0.0003218 0.0003382
    dtb      (draw) float64 776B 0.0008328 0.0008702 ... 0.0006997 0.0007125
[2025-04-01 09:22:57][I][ezpz/test_dist:459:__main__] Took: 9.68 seconds
wandb: \ 0.089 MB of 0.089 MB uploaded
wandb: Run history:
wandb:  dtb β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–
wandb:  dtf β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–
wandb: iter β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–„β–„β–„β–…β–…β–…β–…β–…β–…β–†β–†β–†β–†β–†β–‡β–‡β–‡β–‡β–‡β–‡β–ˆβ–ˆβ–ˆ
wandb: loss β–ˆβ–„β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–
wandb:
wandb: Run summary:
wandb:  dtb 0.00071
wandb:  dtf 0.00034
wandb: iter 99
wandb: loss 384.0
wandb:
wandb: πŸš€ View run deep-frog-1232 at: https://wandb.ai/aurora_gpt/ezpz.test_dist/runs/7vcfnxnn
wandb: ⭐️ View project at: https://wandb.ai/aurora_gpt/ezpz.test_dist
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20250401_092255-7vcfnxnn/logs
Application 269d722b resources: utime=90s stime=97s maxrss=2275848KB inblock=8344 oublock=2248 minflt=2426300 majflt=827 nvcsw=640812 nivcsw=350270
[2025-04-01 09:23:07][I][ezpz/launch:93:__main__] Command took 29.55 seconds to run.

real   42.30s
user   11.50s
sys    8.41s

  1. By default, this will detect if we're running behind a job scheduler (e.g. PBS or Slurm).
    If so, we automatically determine the specifics of the currently active job; explicitly, this will determine:

    1. The number of available nodes
    2. How many GPUs are present on each of these nodes
    3. How many GPUs we have total

    It will then use this information to automatically construct the appropriate {mpiexec, srun} command to launch, and finally, execute the launch cmd. β†©