π ezpz
¶
Write once, run anywhere
Train across all your {NVIDIA, AMD, Intel, MPS, ...} accelerators, ezpz
π.
See π ezpz
docs for additional information.
π£ Getting Started¶
-
ποΈ Setup environment1 (see Shell Environment):
-
π Install
ezpz
(see πΎ Code Reference / ezpz) -
π Launch python from python using
ezpz-launch
(see Launch).Examples, launching:
-
Any
*.py
module (ezpz/test_dist.py
, in this example):Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
#[π aurora_nre_models_frameworks-2025.0.0](π» aurora_nre_models_frameworks-2025.0.0) #[/f/d/f/p/s/ezpz][π± saforem2/dev][π¦π€·β] [β±οΈ 49s] #[06/02/25 @ 08:34:27][x4404c4s4b0n0] ; WANDB_MODE=offline ezpz-launch -m ezpz.test_dist --warmup=10 --layer-sizes='256,512,1024,2048,4096,2048,1024,512,256' --dtype=bf16 --train-iters=5000 --print-freq=100 --log-freq=10 [W602 08:39:04.786863061 OperatorEntry.cpp:155] Warning: Warning only once for all operators, other operators may also be overridden. Overriding a previously registered kernel for the same operator and the same dispatch key operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> () registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 dispatch key: XPU previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476 new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator()) [2025-06-02 08:39:11,507270][I][ezpz/__init__:278:ezpz] Setting logging level to 'INFO' on 'RANK == 0' [2025-06-02 08:39:11,510558][I][ezpz/__init__:279:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0' [2025-06-02 08:39:11,646885][I][ezpz/launch:157] Job ID: 5414072 [2025-06-02 08:39:11,956377][I][ezpz/launch:163] Node file: /var/spool/pbs/aux/5414072.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov [2025-06-02 08:39:11,961307][I][ezpz/launch:178] Building command to execute by piecing together:(1.) ['launch_cmd'] + (2.) ['python'] + (3.) ['cmd_to_launch'] [2025-06-02 08:39:11,962039][I][ezpz/launch:182] (1.) ['launch_cmd']: mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/5414072.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind=depth --depth=8 [2025-06-02 08:39:11,962616][I][ezpz/launch:183] (2.) ['python']: /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/venvs/aurora_nre_models_frameworks-2025.0.0/bin/python3 [2025-06-02 08:39:11,963015][I][ezpz/launch:184] (3.) ['cmd_to_launch']: -m ezpz.test_dist [2025-06-02 08:39:11,963622][I][ezpz/launch:189] Took: 0.45 seconds to build command. [2025-06-02 08:39:11,963985][I][ezpz/launch:192] Executing: mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/5414072.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind=depth --depth=8 /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/venvs/aurora_nre_models_frameworks-2025.0.0/bin/python3 -m ezpz.test_dist [2025-06-02 08:39:11,964786][I][ezpz/launch:119] Filtering for Aurora-specific messages. To view list of filters, run with `EZPZ_LOG_LEVEL=DEBUG` [2025-06-02 08:39:11,965257][I][ezpz/launch:199] Execution started @ 2025-06-02-083911... Disabling local launch: multi-node application Connected to tcp://x4404c4s4b0n0.hostmgmt2404.cm.aurora.alcf.anl.gov:7919 Launching application 09a72a12-de4b-461f-bd7d-d7990dbee665 [2025-06-02 08:39:25,068320][I][ezpz/__init__:278:ezpz] Setting logging level to 'INFO' on 'RANK == 0' [2025-06-02 08:39:25,070671][I][ezpz/__init__:279:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0' [2025-06-02 08:39:25,075236][I][ezpz/dist:760] Using get_torch_device_type()='xpu' with be='ddp' [2025-06-02 08:39:25,076000][I][ezpz/dist:573] Initializing process group with rank=0, world_size=24, torch_backend=ccl 2025:06:02-08:39:26:(23179) |CCL_WARN| value of CCL_LOG_LEVEL changed to be error (default:warn) [2025-06-02 08:39:26,728835][I][ezpz/dist:964] Using device='xpu' with backend='ddp' + 'ccl' for distributed training. [2025-06-02 08:39:26,729616][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 0/23] [2025-06-02 08:39:26,728822][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 3/23] [2025-06-02 08:39:26,728839][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 1/23] [2025-06-02 08:39:26,728828][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 2/23] [2025-06-02 08:39:26,728834][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 4/23] [2025-06-02 08:39:26,728826][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 5/23] [2025-06-02 08:39:26,728821][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 7/23] [2025-06-02 08:39:26,728814][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 8/23] [2025-06-02 08:39:26,728819][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 9/23] [2025-06-02 08:39:26,728816][I][ezpz/dist:1011] ['x4404c4s4b0n0'][10/23] [2025-06-02 08:39:26,728815][I][ezpz/dist:1011] ['x4404c4s4b0n0'][11/23] [2025-06-02 08:39:26,728883][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 6/23] [2025-06-02 08:39:26,728812][I][ezpz/dist:1011] ['x4404c4s6b0n0'][18/23] [2025-06-02 08:39:26,728815][I][ezpz/dist:1011] ['x4404c4s6b0n0'][22/23] [2025-06-02 08:39:26,728829][I][ezpz/dist:1011] ['x4404c4s6b0n0'][12/23] [2025-06-02 08:39:26,728827][I][ezpz/dist:1011] ['x4404c4s6b0n0'][13/23] [2025-06-02 08:39:26,728827][I][ezpz/dist:1011] ['x4404c4s6b0n0'][14/23] [2025-06-02 08:39:26,728833][I][ezpz/dist:1011] ['x4404c4s6b0n0'][15/23] [2025-06-02 08:39:26,728831][I][ezpz/dist:1011] ['x4404c4s6b0n0'][16/23] [2025-06-02 08:39:26,728827][I][ezpz/dist:1011] ['x4404c4s6b0n0'][17/23] [2025-06-02 08:39:26,728812][I][ezpz/dist:1011] ['x4404c4s6b0n0'][19/23] [2025-06-02 08:39:26,728811][I][ezpz/dist:1011] ['x4404c4s6b0n0'][20/23] [2025-06-02 08:39:26,731907][I][ezpz/test_dist:468:__main__] Took: 1.66 seconds to setup torch [2025-06-02 08:39:26,728812][I][ezpz/dist:1011] ['x4404c4s6b0n0'][21/23] [2025-06-02 08:39:26,728813][I][ezpz/dist:1011] ['x4404c4s6b0n0'][23/23] [2025-06-02 08:39:26,748088][I][ezpz/test_dist:218:__main__] Model size: 837632 parameters [2025-06-02 08:39:26,750571][I][ezpz/test_dist:220:__main__] ================================================================= Layer (type:depth-idx) Param # ================================================================= SequentialLinearNet -- ββSequential: 1-1 837,632 ================================================================= Total params: 837,632 Trainable params: 837,632 Non-trainable params: 0 ================================================================= [2025-06-02 08:39:26,751974][I][ezpz/test_dist:226:__main__] Took: 0.011442308983532712 seconds to build model [2025-06-02 08:39:26,756362][I][ezpz/test_dist:406:__main__] model= SequentialLinearNet( (layers): Sequential( (0): Linear(in_features=128, out_features=1024, bias=True) (1): ReLU() (2): Linear(in_features=1024, out_features=512, bias=True) (3): ReLU() (4): Linear(in_features=512, out_features=256, bias=True) (5): ReLU() (6): Linear(in_features=256, out_features=128, bias=True) (7): ReLU() (8): Linear(in_features=128, out_features=128, bias=True) ) ) [2025-06-02 08:39:37,687236][I][ezpz/test_dist:230:__main__] Took: 10.94 seconds to build optimizer [2025-06-02 08:39:37,700439][I][ezpz/dist:1222] Setting up wandb from rank=0 [2025-06-02 08:39:37,701214][I][ezpz/dist:1223] Using WB_PROJECT=ezpz.test_dist wandb: Tracking run with wandb version 0.19.10 wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. wandb: WARNING URL not available in offline run [2025-06-02 08:39:38,357037][I][ezpz/dist:1249] wandb.run=[None](None) [2025-06-02 08:39:38,363539][I][ezpz/dist:1285] Running on machine='Aurora' [2025-06-02 08:39:38,368294][I][ezpz/test_dist:233:__main__] Took: 0.68 seconds to build trainer [2025-06-02 08:39:38,368985][I][ezpz/test_dist:235:__main__] config: { "backend": "DDP", "batch_size": 64, "cp": 1, "dtype": "bfloat16", "input_size": 128, "layer_sizes": [ 1024, 512, 256, 128 ], "log_freq": 1, "output_size": 128, "pp": 1, "print_freq": 10, "pyinstrument_profiler": false, "tp": 1, "train_iters": 100, "warmup": 2 } [2025-06-02 08:39:38,370322][I][ezpz/test_dist:237:__main__] Took: 13.30 to get here. [2025-06-02 08:39:38,794611][I][ezpz/test_dist:196:__main__] Warmup complete at step 2 [2025-06-02 08:39:38,813169][I][ezpz/test_dist:174:__main__] iter=10 loss=904.000000 dtf=0.000644 dtb=0.001260 [2025-06-02 08:39:38,835905][I][ezpz/test_dist:174:__main__] iter=20 loss=712.000000 dtf=0.000610 dtb=0.001283 [2025-06-02 08:39:38,858533][I][ezpz/test_dist:174:__main__] iter=30 loss=704.000000 dtf=0.000608 dtb=0.001252 [2025-06-02 08:39:38,880929][I][ezpz/test_dist:174:__main__] iter=40 loss=684.000000 dtf=0.000607 dtb=0.001315 [2025-06-02 08:39:38,903701][I][ezpz/test_dist:174:__main__] iter=50 loss=684.000000 dtf=0.000579 dtb=0.001247 [2025-06-02 08:39:38,926119][I][ezpz/test_dist:174:__main__] iter=60 loss=676.000000 dtf=0.000597 dtb=0.001234 [2025-06-02 08:39:38,948978][I][ezpz/test_dist:174:__main__] iter=70 loss=664.000000 dtf=0.000603 dtb=0.001242 [2025-06-02 08:39:38,971256][I][ezpz/test_dist:174:__main__] iter=80 loss=672.000000 dtf=0.000599 dtb=0.001240 [2025-06-02 08:39:38,993829][I][ezpz/test_dist:174:__main__] iter=90 loss=672.000000 dtf=0.000615 dtb=0.001249 [2025-06-02 08:39:40,390558][I][ezpz/history:721] Saving iter plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/mplot [2025-06-02 08:39:40,653794][I][ezpz/history:721] Saving loss plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/mplot [2025-06-02 08:39:40,894262][I][ezpz/history:721] Saving dtf plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/mplot [2025-06-02 08:39:41,191474][I][ezpz/history:721] Saving dtb plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/mplot [2025-06-02 08:39:41,377999][I][ezpz/history:618] Saving tplots to /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot loss [2025-06-02-083941] ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 2448β€β β ββ β 2150β€β β ββ β ββ β 1852β€β β βββ β 1554β€ β β β ββ β 1256β€ β β β β β β β β 958β€ ββ β β βββ β 660β€ ββββββββββββββββββββββββββββββββββββββββββββββββ βββ¬βββ¬ββββ¬ββββ¬ββββ¬ββββ¬βββ¬βββ¬βββ¬βββ¬βββ¬ββββ¬ββββ¬βββ¬βββ¬βββ¬ββ 0 2 9 15 22 30 37 42 48 53 59 65 71 79 84 90 96 loss iter text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/loss.txt dtf [2025-06-02-083941] ββββββββββββββββββββββββββββββββββββββββββββββββββββ 0.000805β€ ββ β β ββ β 0.000766β€ ββ β ββ ββ β ββ ββ β 0.000727β€β β β β βββ β ββ ββ β β ββ βββ β 0.000688β€β ββ β β β ββ ββ βββ ββ β ββ ββ β β β ββ ββ βββ ββ β β 0.000649β€β ββ β β β ββ ββ βββ ββ β β ββββββ β β βββ βββ ββ ββββ ββ β β ββββββ β ββ βββ β β β β βββββββ ββββ βββ β 0.000610β€ βββββββββββββββββ ββββ ββββ βββ β βββ βββ βββββ β ββ βββ β ββ βββ β ββββ ββ ββ β 0.000571β€ β βββ β ββ β βββ¬βββ¬βββ¬ββββ¬ββββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬ββββ¬βββ¬βββββ¬ββ 0 2 9 15 22 30 37 42 48 53 60 65 71 79 85 96 dtf iter text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/dtf.txt dtf [2025-06-02-083941] ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 52.0β€ ββββββ β β ββββββ β 43.3β€ ββββββ β β ββββββ β β ββββββ β 34.7β€ ββββββ β β ββββββ β 26.0β€ ββββββ β β ββββββ β 17.3β€ ββββββ β β βββββββββββ β βββββββββββββββββ β 8.7β€ββββββββββββββββββββββ β βββββββββββββββββββββββββββββββββββββββ β 0.0β€βββββββββββββββββββββββββββββββββββββββββββ ββββββ ββ¬βββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬βββββββββββββ¬β 0.000560 0.000624 0.000688 0.000752 0.000815 freq dtf text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/dtf-hist.txt dtb [2025-06-02-083941] ββββββββββββββββββββββββββββββββββββββββββββββββββββ 0.001447β€ β β β β β β β 0.001409β€ β β β β β β β ββ β β β β ββ β 0.001371β€ β β ββ β β β β β ββ β 0.001333β€ βββ ββ β β β β ββ β βββ βββ β ββ β ββ β β βββ ββ ββββ β 0.001294β€ββ βββ ββ ββ β β β βββ βββ ββ ββββ β βββββββββββββ ββββ β β βββ β β ββ β βββββ β βββββββββ ββββββββββ β ββ β ββ ββ ββββ ββββββ β 0.001256β€ βββ β βββββββ β ββ ββββ ββββββ β βββ β β β βββ β β β ββββ ββ βββββ 0.001218β€ β ββ β βββ¬βββ¬βββ¬ββββ¬ββββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬ββββ¬βββ¬βββββ¬ββ 0 2 9 15 22 30 37 42 48 53 60 65 71 79 85 96 dtb iter text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/dtb.txt dtb [2025-06-02-083941] ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 38.0β€ ββββββ β β ββββββ β 31.7β€ ββββββ β β ββββββ β β ββββββ β 25.3β€ ββββββ β β ββββββ β 19.0β€ βββββββββββ β β βββββββββββ β 12.7β€ββββββββββββββββ βββββ β βββββββββββββββββ βββββ β ββββββββββββββββββββββββββββ β 6.3β€βββββββββββββββββββββββββββ β βββββββββββββββββββββββββββββββββ ββββββ 0.0β€ββββββββββββββββββββββββββββββββ ββββββββββββ ββ¬βββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬βββββββββββββ¬β 0.001208 0.001270 0.001333 0.001395 0.001457 freq dtb text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/ezpz.test_dist/ezpz.test_dist/plots/tplot/dtb-hist.txt [2025-06-02 08:39:41,427412][I][ezpz/test_dist:190:__main__] dataset=<xarray.Dataset> Size: 3kB Dimensions: (draw: 97) Coordinates: * draw (draw) int64 776B 0 1 2 3 4 5 6 7 8 ... 88 89 90 91 92 93 94 95 96 Data variables: iter (draw) int64 776B 3 4 5 6 7 8 9 10 11 ... 92 93 94 95 96 97 98 99 loss (draw) float32 388B 2.448e+03 2.112e+03 1.664e+03 ... 672.0 688.0 dtf (draw) float64 776B 0.0007564 0.0006201 ... 0.0006089 0.0006102 dtb (draw) float64 776B 0.001315 0.001286 ... 0.001238 0.001236 [2025-06-02 08:39:41,429616][I][ezpz/test_dist:241:__main__] Took: 3.06 seconds to finish training [2025-06-02 08:39:41,430364][I][ezpz/test_dist:476:__main__] Took: 16.36 seconds wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/wandb/offline-run-20250602_083937-57itor57 wandb: Find logs at: ../../../../../../lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/wandb/offline-run-20250602_083937-57itor57/logs Application 09a72a12 resources: utime=853s stime=186s maxrss=3932628KB inblock=749276 oublock=904 minflt=11280849 majflt=42365 nvcsw=380342 nivcsw=3251786 [2025-06-02 08:39:44,095734][I][ezpz/launch:201] Execution finished @ 2025-06-02-083944 [2025-06-02 08:39:44,096767][I][ezpz/launch:202] Command took 32.13 seconds to run. Exiting. took: 0h:00m:43s
-
Arbitrary python string:
Output:
-
Minimal example [ezpz / examples /
minimal.py
]:Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391
#[π aurora_nre_models_frameworks-2025.0.0](π» aurora_nre_models_frameworks-2025.0.0) #[/f/d/f/p/s/ezpz][π± saforem2/dev][π¦π€·β] [β±οΈ 58s] #[06/02/25 @ 08:24:30][x4404c4s4b0n0] ; WANDB_MODE=offline PRINT_ITERS=100 TRAIN_ITERS=1000 ezpz-launch -m ezpz.examples.minimal [W602 08:24:33.632744487 OperatorEntry.cpp:155] Warning: Warning only once for all operators, other operators may also be overridden. Overriding a previously registered kernel for the same operator and the same dispatch key operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> () registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 dispatch key: XPU previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476 new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator()) [2025-06-02 08:24:40,394556][I][ezpz/__init__:278:ezpz] Setting logging level to 'INFO' on 'RANK == 0' [2025-06-02 08:24:40,397025][I][ezpz/__init__:279:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0' [2025-06-02 08:24:40,546683][I][ezpz/launch:157] Job ID: 5414072 [2025-06-02 08:24:40,862126][I][ezpz/launch:163] Node file: /var/spool/pbs/aux/5414072.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov [2025-06-02 08:24:40,867464][I][ezpz/launch:178] Building command to execute by piecing together:(1.) ['launch_cmd'] + (2.) ['python'] + (3.) ['cmd_to_launch'] [2025-06-02 08:24:40,868229][I][ezpz/launch:182] (1.) ['launch_cmd']: mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/5414072.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind=depth --depth=8 [2025-06-02 08:24:40,868796][I][ezpz/launch:183] (2.) ['python']: /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/venvs/aurora_nre_models_frameworks-2025.0.0/bin/python3 [2025-06-02 08:24:40,869195][I][ezpz/launch:184] (3.) ['cmd_to_launch']: -m ezpz.examples.minimal [2025-06-02 08:24:40,869807][I][ezpz/launch:189] Took: 0.47 seconds to build command. [2025-06-02 08:24:40,870158][I][ezpz/launch:192] Executing: mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/5414072.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind=depth --depth=8 /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/venvs/aurora_nre_models_frameworks-2025.0.0/bin/python3 -m ezpz.examples.minimal [2025-06-02 08:24:40,871013][I][ezpz/launch:119] Filtering for Aurora-specific messages. To view list of filters, run with `EZPZ_LOG_LEVEL=DEBUG` [2025-06-02 08:24:40,871479][I][ezpz/launch:199] Execution started @ 2025-06-02-082440... Disabling local launch: multi-node application Connected to tcp://x4404c4s4b0n0.hostmgmt2404.cm.aurora.alcf.anl.gov:7919 Launching application 51803e72-8555-4056-b49e-4aa9ffb3b099 [2025-06-02 08:24:54,200723][I][ezpz/__init__:278:ezpz] Setting logging level to 'INFO' on 'RANK == 0' [2025-06-02 08:24:54,203301][I][ezpz/__init__:279:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0' [2025-06-02 08:24:54,206944][I][ezpz/dist:760] Using get_torch_device_type()='xpu' with be='ddp' [2025-06-02 08:24:54,207778][I][ezpz/dist:573] Initializing process group with rank=0, world_size=24, torch_backend=ccl 2025:06:02-08:24:55:(17665) |CCL_WARN| value of CCL_LOG_LEVEL changed to be error (default:warn) [2025-06-02 08:24:55,942022][I][ezpz/dist:964] Using device='xpu' with backend='ddp' + 'ccl' for distributed training. [2025-06-02 08:24:55,942738][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 0/23] [2025-06-02 08:24:55,941993][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 3/23] [2025-06-02 08:24:55,942007][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 1/23] [2025-06-02 08:24:55,942013][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 2/23] [2025-06-02 08:24:55,942019][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 4/23] [2025-06-02 08:24:55,942013][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 5/23] [2025-06-02 08:24:55,941989][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 8/23] [2025-06-02 08:24:55,942001][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 6/23] [2025-06-02 08:24:55,941994][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 7/23] [2025-06-02 08:24:55,941995][I][ezpz/dist:1011] ['x4404c4s4b0n0'][10/23] [2025-06-02 08:24:55,941990][I][ezpz/dist:1011] ['x4404c4s4b0n0'][11/23] [2025-06-02 08:24:55,942003][I][ezpz/dist:1011] ['x4404c4s4b0n0'][ 9/23] [2025-06-02 08:24:55,942096][I][ezpz/dist:1011] ['x4404c4s6b0n0'][12/23] [2025-06-02 08:24:55,942095][I][ezpz/dist:1011] ['x4404c4s6b0n0'][13/23] [2025-06-02 08:24:55,942101][I][ezpz/dist:1011] ['x4404c4s6b0n0'][14/23] [2025-06-02 08:24:55,942096][I][ezpz/dist:1011] ['x4404c4s6b0n0'][15/23] [2025-06-02 08:24:55,942092][I][ezpz/dist:1011] ['x4404c4s6b0n0'][16/23] [2025-06-02 08:24:55,942097][I][ezpz/dist:1011] ['x4404c4s6b0n0'][17/23] [2025-06-02 08:24:55,942091][I][ezpz/dist:1011] ['x4404c4s6b0n0'][18/23] [2025-06-02 08:24:55,942073][I][ezpz/dist:1011] ['x4404c4s6b0n0'][19/23] [2025-06-02 08:24:55,942076][I][ezpz/dist:1011] ['x4404c4s6b0n0'][20/23] [2025-06-02 08:24:55,942080][I][ezpz/dist:1011] ['x4404c4s6b0n0'][21/23] [2025-06-02 08:24:55,945053][I][ezpz/dist:1222] Setting up wandb from rank=0 [2025-06-02 08:24:55,942081][I][ezpz/dist:1011] ['x4404c4s6b0n0'][22/23] [2025-06-02 08:24:55,942072][I][ezpz/dist:1011] ['x4404c4s6b0n0'][23/23] [2025-06-02 08:24:55,945440][I][ezpz/dist:1223] Using WB_PROJECT=ezpz.examples.minimal wandb: Tracking run with wandb version 0.19.10 wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. wandb: WARNING URL not available in offline run [2025-06-02 08:24:56,605530][I][ezpz/dist:1249] wandb.run=[None](None) [2025-06-02 08:24:56,611884][I][ezpz/dist:1285] Running on machine='Aurora' [2025-06-02 08:24:56,655910][I][examples/minimal:88:__main__] model=SequentialLinearNet( (layers): Sequential( (0): Linear(in_features=128, out_features=256, bias=True) (1): ReLU() (2): Linear(in_features=256, out_features=512, bias=True) (3): ReLU() (4): Linear(in_features=512, out_features=1024, bias=True) (5): ReLU() (6): Linear(in_features=1024, out_features=2048, bias=True) (7): ReLU() (8): Linear(in_features=2048, out_features=1024, bias=True) (9): ReLU() (10): Linear(in_features=1024, out_features=512, bias=True) (11): ReLU() (12): Linear(in_features=512, out_features=256, bias=True) (13): ReLU() (14): Linear(in_features=256, out_features=128, bias=True) (15): ReLU() (16): Linear(in_features=128, out_features=128, bias=True) ) ) [2025-06-02 08:25:07,566410][I][ezpz/dist:144] `setup` took: dt=13.3595s [2025-06-02 08:25:08,196630][I][examples/minimal:51:__main__] iter=20 loss=713.134399 dt=0.005150 dtf=0.001118 dtb=0.004031 [2025-06-02 08:25:08,254359][I][examples/minimal:51:__main__] iter=30 loss=698.142334 dt=0.005140 dtf=0.001098 dtb=0.004042 [2025-06-02 08:25:08,311676][I][examples/minimal:51:__main__] iter=40 loss=688.149536 dt=0.005088 dtf=0.001100 dtb=0.003988 [2025-06-02 08:25:08,369744][I][examples/minimal:51:__main__] iter=50 loss=685.806091 dt=0.005097 dtf=0.001088 dtb=0.004009 [2025-06-02 08:25:08,427011][I][examples/minimal:51:__main__] iter=60 loss=689.389771 dt=0.005140 dtf=0.001099 dtb=0.004041 [2025-06-02 08:25:08,484186][I][examples/minimal:51:__main__] iter=70 loss=695.363220 dt=0.005125 dtf=0.001111 dtb=0.004014 [2025-06-02 08:25:08,541436][I][examples/minimal:51:__main__] iter=80 loss=667.858032 dt=0.005074 dtf=0.001092 dtb=0.003982 [2025-06-02 08:25:08,598606][I][examples/minimal:51:__main__] iter=90 loss=676.533142 dt=0.005130 dtf=0.001084 dtb=0.004046 [2025-06-02 08:25:08,656182][I][examples/minimal:51:__main__] iter=100 loss=676.170593 dt=0.005510 dtf=0.001399 dtb=0.004111 [2025-06-02 08:25:08,713804][I][examples/minimal:51:__main__] iter=110 loss=676.684814 dt=0.005106 dtf=0.001093 dtb=0.004013 [2025-06-02 08:25:08,773811][I][examples/minimal:51:__main__] iter=120 loss=682.333984 dt=0.005353 dtf=0.001093 dtb=0.004260 [2025-06-02 08:25:08,832594][I][examples/minimal:51:__main__] iter=130 loss=691.218079 dt=0.005333 dtf=0.001119 dtb=0.004214 [2025-06-02 08:25:08,891644][I][examples/minimal:51:__main__] iter=140 loss=686.254883 dt=0.005318 dtf=0.001096 dtb=0.004223 [2025-06-02 08:25:08,950476][I][examples/minimal:51:__main__] iter=150 loss=671.173218 dt=0.005462 dtf=0.001090 dtb=0.004372 [2025-06-02 08:25:09,009324][I][examples/minimal:51:__main__] iter=160 loss=675.119751 dt=0.005372 dtf=0.001095 dtb=0.004277 [2025-06-02 08:25:09,068117][I][examples/minimal:51:__main__] iter=170 loss=681.518127 dt=0.005401 dtf=0.001101 dtb=0.004299 [2025-06-02 08:25:09,129145][I][examples/minimal:51:__main__] iter=180 loss=681.293335 dt=0.005290 dtf=0.001100 dtb=0.004189 [2025-06-02 08:25:09,188790][I][examples/minimal:51:__main__] iter=190 loss=673.555298 dt=0.006316 dtf=0.001088 dtb=0.005228 [2025-06-02 08:25:09,248623][I][examples/minimal:51:__main__] iter=200 loss=686.017700 dt=0.005552 dtf=0.001355 dtb=0.004196 [2025-06-02 08:25:09,307659][I][examples/minimal:51:__main__] iter=210 loss=693.399170 dt=0.005361 dtf=0.001096 dtb=0.004265 [2025-06-02 08:25:09,366454][I][examples/minimal:51:__main__] iter=220 loss=687.048462 dt=0.005304 dtf=0.001083 dtb=0.004222 [2025-06-02 08:25:09,425278][I][examples/minimal:51:__main__] iter=230 loss=683.272217 dt=0.005334 dtf=0.001091 dtb=0.004242 [2025-06-02 08:25:09,484085][I][examples/minimal:51:__main__] iter=240 loss=686.674561 dt=0.005240 dtf=0.001100 dtb=0.004140 [2025-06-02 08:25:09,542500][I][examples/minimal:51:__main__] iter=250 loss=686.590210 dt=0.005419 dtf=0.001090 dtb=0.004330 [2025-06-02 08:25:09,601444][I][examples/minimal:51:__main__] iter=260 loss=685.613770 dt=0.005404 dtf=0.000970 dtb=0.004434 [2025-06-02 08:25:09,660262][I][examples/minimal:51:__main__] iter=270 loss=678.604309 dt=0.005277 dtf=0.000975 dtb=0.004302 [2025-06-02 08:25:09,718685][I][examples/minimal:51:__main__] iter=280 loss=687.360474 dt=0.005371 dtf=0.000978 dtb=0.004393 [2025-06-02 08:25:09,777952][I][examples/minimal:51:__main__] iter=290 loss=672.192383 dt=0.005500 dtf=0.000973 dtb=0.004527 [2025-06-02 08:25:09,836219][I][examples/minimal:51:__main__] iter=300 loss=670.950562 dt=0.005342 dtf=0.001353 dtb=0.003989 [2025-06-02 08:25:09,894611][I][examples/minimal:51:__main__] iter=310 loss=681.033447 dt=0.005213 dtf=0.001068 dtb=0.004145 [2025-06-02 08:25:09,952968][I][examples/minimal:51:__main__] iter=320 loss=678.913208 dt=0.005336 dtf=0.000975 dtb=0.004361 [2025-06-02 08:25:10,011736][I][examples/minimal:51:__main__] iter=330 loss=678.553772 dt=0.005430 dtf=0.001081 dtb=0.004349 [2025-06-02 08:25:10,070662][I][examples/minimal:51:__main__] iter=340 loss=688.489014 dt=0.005390 dtf=0.001087 dtb=0.004303 [2025-06-02 08:25:10,129419][I][examples/minimal:51:__main__] iter=350 loss=680.676147 dt=0.005368 dtf=0.000978 dtb=0.004390 [2025-06-02 08:25:10,187801][I][examples/minimal:51:__main__] iter=360 loss=696.601196 dt=0.005339 dtf=0.001079 dtb=0.004261 [2025-06-02 08:25:10,246699][I][examples/minimal:51:__main__] iter=370 loss=685.925903 dt=0.005347 dtf=0.001099 dtb=0.004248 [2025-06-02 08:25:10,305350][I][examples/minimal:51:__main__] iter=380 loss=681.857178 dt=0.005277 dtf=0.001088 dtb=0.004188 [2025-06-02 08:25:10,364235][I][examples/minimal:51:__main__] iter=390 loss=677.403076 dt=0.005545 dtf=0.001099 dtb=0.004445 [2025-06-02 08:25:10,423312][I][examples/minimal:51:__main__] iter=400 loss=680.605286 dt=0.005513 dtf=0.001338 dtb=0.004175 [2025-06-02 08:25:10,482306][I][examples/minimal:51:__main__] iter=410 loss=688.305176 dt=0.005358 dtf=0.001094 dtb=0.004264 [2025-06-02 08:25:10,541514][I][examples/minimal:51:__main__] iter=420 loss=676.714600 dt=0.005456 dtf=0.001107 dtb=0.004349 [2025-06-02 08:25:10,600146][I][examples/minimal:51:__main__] iter=430 loss=674.251648 dt=0.005348 dtf=0.001116 dtb=0.004232 [2025-06-02 08:25:10,659099][I][examples/minimal:51:__main__] iter=440 loss=692.857361 dt=0.005285 dtf=0.001091 dtb=0.004194 [2025-06-02 08:25:10,718127][I][examples/minimal:51:__main__] iter=450 loss=683.334229 dt=0.005442 dtf=0.001094 dtb=0.004348 [2025-06-02 08:25:10,776750][I][examples/minimal:51:__main__] iter=460 loss=1509.692139 dt=0.005363 dtf=0.001114 dtb=0.004248 [2025-06-02 08:25:10,836261][I][examples/minimal:51:__main__] iter=470 loss=943.557617 dt=0.005265 dtf=0.001108 dtb=0.004157 [2025-06-02 08:25:10,895405][I][examples/minimal:51:__main__] iter=480 loss=704.171509 dt=0.005319 dtf=0.001079 dtb=0.004240 [2025-06-02 08:25:10,954483][I][examples/minimal:51:__main__] iter=490 loss=683.428223 dt=0.005526 dtf=0.001086 dtb=0.004440 [2025-06-02 08:25:11,013286][I][examples/minimal:51:__main__] iter=500 loss=687.314941 dt=0.005473 dtf=0.001332 dtb=0.004141 [2025-06-02 08:25:11,080691][I][examples/minimal:51:__main__] iter=510 loss=688.060669 dt=0.005363 dtf=0.001113 dtb=0.004250 [2025-06-02 08:25:11,139480][I][examples/minimal:51:__main__] iter=520 loss=686.497314 dt=0.005267 dtf=0.001083 dtb=0.004184 [2025-06-02 08:25:11,198098][I][examples/minimal:51:__main__] iter=530 loss=691.718445 dt=0.005295 dtf=0.001086 dtb=0.004208 [2025-06-02 08:25:11,256868][I][examples/minimal:51:__main__] iter=540 loss=681.122681 dt=0.005295 dtf=0.001104 dtb=0.004191 [2025-06-02 08:25:11,315729][I][examples/minimal:51:__main__] iter=550 loss=683.272705 dt=0.005441 dtf=0.001081 dtb=0.004360 [2025-06-02 08:25:11,374406][I][examples/minimal:51:__main__] iter=560 loss=688.077271 dt=0.005318 dtf=0.001093 dtb=0.004225 [2025-06-02 08:25:11,433181][I][examples/minimal:51:__main__] iter=570 loss=683.032715 dt=0.005285 dtf=0.001099 dtb=0.004186 [2025-06-02 08:25:11,491905][I][examples/minimal:51:__main__] iter=580 loss=686.191040 dt=0.005301 dtf=0.001089 dtb=0.004212 [2025-06-02 08:25:11,550809][I][examples/minimal:51:__main__] iter=590 loss=691.924744 dt=0.005503 dtf=0.001088 dtb=0.004415 [2025-06-02 08:25:11,609581][I][examples/minimal:51:__main__] iter=600 loss=681.312744 dt=0.005478 dtf=0.001338 dtb=0.004140 [2025-06-02 08:25:11,668293][I][examples/minimal:51:__main__] iter=610 loss=680.253540 dt=0.005360 dtf=0.001120 dtb=0.004240 [2025-06-02 08:25:11,726991][I][examples/minimal:51:__main__] iter=620 loss=683.039673 dt=0.005297 dtf=0.001090 dtb=0.004207 [2025-06-02 08:25:11,785960][I][examples/minimal:51:__main__] iter=630 loss=679.695679 dt=0.005319 dtf=0.001080 dtb=0.004239 [2025-06-02 08:25:11,845069][I][examples/minimal:51:__main__] iter=640 loss=686.198608 dt=0.005340 dtf=0.001108 dtb=0.004233 [2025-06-02 08:25:11,903999][I][examples/minimal:51:__main__] iter=650 loss=683.652954 dt=0.005456 dtf=0.001089 dtb=0.004367 [2025-06-02 08:25:11,962543][I][examples/minimal:51:__main__] iter=660 loss=686.860229 dt=0.005316 dtf=0.001086 dtb=0.004229 [2025-06-02 08:25:12,021274][I][examples/minimal:51:__main__] iter=670 loss=680.933960 dt=0.005314 dtf=0.001097 dtb=0.004217 [2025-06-02 08:25:12,079889][I][examples/minimal:51:__main__] iter=680 loss=679.905151 dt=0.005319 dtf=0.001089 dtb=0.004230 [2025-06-02 08:25:12,138620][I][examples/minimal:51:__main__] iter=690 loss=682.389832 dt=0.005544 dtf=0.000994 dtb=0.004550 [2025-06-02 08:25:12,196877][I][examples/minimal:51:__main__] iter=700 loss=686.506714 dt=0.005393 dtf=0.001366 dtb=0.004027 [2025-06-02 08:25:12,255083][I][examples/minimal:51:__main__] iter=710 loss=690.196533 dt=0.005322 dtf=0.001087 dtb=0.004235 [2025-06-02 08:25:12,313749][I][examples/minimal:51:__main__] iter=720 loss=678.437134 dt=0.005271 dtf=0.001083 dtb=0.004188 [2025-06-02 08:25:12,372685][I][examples/minimal:51:__main__] iter=730 loss=682.770264 dt=0.005329 dtf=0.001116 dtb=0.004212 [2025-06-02 08:25:12,431392][I][examples/minimal:51:__main__] iter=740 loss=688.560852 dt=0.005218 dtf=0.001016 dtb=0.004203 [2025-06-02 08:25:12,489897][I][examples/minimal:51:__main__] iter=750 loss=687.129883 dt=0.005418 dtf=0.001091 dtb=0.004327 [2025-06-02 08:25:12,548527][I][examples/minimal:51:__main__] iter=760 loss=684.507507 dt=0.005340 dtf=0.001128 dtb=0.004211 [2025-06-02 08:25:12,607235][I][examples/minimal:51:__main__] iter=770 loss=674.559021 dt=0.005275 dtf=0.001087 dtb=0.004188 [2025-06-02 08:25:12,666059][I][examples/minimal:51:__main__] iter=780 loss=690.597290 dt=0.005311 dtf=0.001068 dtb=0.004243 [2025-06-02 08:25:12,724778][I][examples/minimal:51:__main__] iter=790 loss=675.396240 dt=0.005521 dtf=0.001100 dtb=0.004422 [2025-06-02 08:25:12,783613][I][examples/minimal:51:__main__] iter=800 loss=673.097961 dt=0.005453 dtf=0.001320 dtb=0.004134 [2025-06-02 08:25:12,842443][I][examples/minimal:51:__main__] iter=810 loss=679.685730 dt=0.005444 dtf=0.001118 dtb=0.004326 [2025-06-02 08:25:12,901496][I][examples/minimal:51:__main__] iter=820 loss=673.053711 dt=0.005300 dtf=0.001088 dtb=0.004212 [2025-06-02 08:25:12,960154][I][examples/minimal:51:__main__] iter=830 loss=680.830994 dt=0.005351 dtf=0.001112 dtb=0.004239 [2025-06-02 08:25:13,018906][I][examples/minimal:51:__main__] iter=840 loss=691.692932 dt=0.005299 dtf=0.001091 dtb=0.004208 [2025-06-02 08:25:13,077564][I][examples/minimal:51:__main__] iter=850 loss=674.963257 dt=0.005420 dtf=0.001105 dtb=0.004315 [2025-06-02 08:25:13,136279][I][examples/minimal:51:__main__] iter=860 loss=684.604980 dt=0.005302 dtf=0.001107 dtb=0.004195 [2025-06-02 08:25:13,194978][I][examples/minimal:51:__main__] iter=870 loss=696.048218 dt=0.005365 dtf=0.001101 dtb=0.004264 [2025-06-02 08:25:13,253730][I][examples/minimal:51:__main__] iter=880 loss=679.293457 dt=0.005284 dtf=0.001077 dtb=0.004207 [2025-06-02 08:25:13,312501][I][examples/minimal:51:__main__] iter=890 loss=679.364197 dt=0.005558 dtf=0.001110 dtb=0.004448 [2025-06-02 08:25:13,371428][I][examples/minimal:51:__main__] iter=900 loss=675.571289 dt=0.005417 dtf=0.001344 dtb=0.004074 [2025-06-02 08:25:13,430037][I][examples/minimal:51:__main__] iter=910 loss=683.194458 dt=0.005323 dtf=0.001077 dtb=0.004246 [2025-06-02 08:25:13,488662][I][examples/minimal:51:__main__] iter=920 loss=689.960022 dt=0.005316 dtf=0.001103 dtb=0.004213 [2025-06-02 08:25:13,547197][I][examples/minimal:51:__main__] iter=930 loss=693.487732 dt=0.005348 dtf=0.001097 dtb=0.004251 [2025-06-02 08:25:13,606009][I][examples/minimal:51:__main__] iter=940 loss=686.816406 dt=0.005356 dtf=0.001087 dtb=0.004269 [2025-06-02 08:25:13,664743][I][examples/minimal:51:__main__] iter=950 loss=670.237244 dt=0.005430 dtf=0.001109 dtb=0.004322 [2025-06-02 08:25:13,723404][I][examples/minimal:51:__main__] iter=960 loss=700.734741 dt=0.005330 dtf=0.001073 dtb=0.004257 [2025-06-02 08:25:13,782161][I][examples/minimal:51:__main__] iter=970 loss=676.606628 dt=0.005324 dtf=0.001075 dtb=0.004249 [2025-06-02 08:25:13,840797][I][examples/minimal:51:__main__] iter=980 loss=687.955688 dt=0.005335 dtf=0.001105 dtb=0.004230 [2025-06-02 08:25:13,900017][I][examples/minimal:51:__main__] iter=990 loss=689.839966 dt=0.005527 dtf=0.001089 dtb=0.004438 [2025-06-02 08:25:13,953099][I][ezpz/dist:144] `train`((DistributedDataParallel( (module): SequentialLinearNet( (layers): Sequential( (0): Linear(in_features=128, out_features=256, bias=True) (1): ReLU() (2): Linear(in_features=256, out_features=512, bias=True) (3): ReLU() (4): Linear(in_features=512, out_features=1024, bias=True) (5): ReLU() (6): Linear(in_features=1024, out_features=2048, bias=True) (7): ReLU() (8): Linear(in_features=2048, out_features=1024, bias=True) (9): ReLU() (10): Linear(in_features=1024, out_features=512, bias=True) (11): ReLU() (12): Linear(in_features=512, out_features=256, bias=True) (13): ReLU() (14): Linear(in_features=256, out_features=128, bias=True) (15): ReLU() (16): Linear(in_features=128, out_features=128, bias=True) ) ) ), Adam ( Parameter Group 0 amsgrad: False betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None lr: 0.001 maximize: False weight_decay: 0 ))) took: dt=6.3856s [2025-06-02 08:25:15,312954][I][ezpz/history:721] Saving iter plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/mplot [2025-06-02 08:25:15,581086][I][ezpz/history:721] Saving loss plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/mplot [2025-06-02 08:25:15,860783][I][ezpz/history:721] Saving dt plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/mplot [2025-06-02 08:25:16,124027][I][ezpz/history:721] Saving dtf plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/mplot [2025-06-02 08:25:16,380159][I][ezpz/history:721] Saving dtb plot to: /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/mplot [2025-06-02 08:25:16,627648][I][ezpz/history:618] Saving tplots to /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/tplot loss [2025-06-02-082516] ββββββββββββββββββββββββββββββββββββββββββββββββββββββ 2326.0β€ β β β β β 2048.7β€ β β β β β β β β 1771.5β€ β β β β β 1494.2β€ β β β ββ β 1216.9β€ ββ β β ββ β ββ βββ β 939.7β€β βββ β ββ βββ β 662.4β€βββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ¬βββ¬βββββ¬βββββββ¬ββββ¬ββββ¬ββββ¬ββββββββ¬βββ¬ββββ¬ββββββ¬ββββ 10 61 152 301 374 443 516 682 746 805 937 loss iter text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/tplot/loss.txt dt [2025-06-02-082516] βββββββββββββββββββββββββββββββββββββββββββββββββββββ 0.00665β€ β β β β β 0.00631β€ ββ β β ββ β β ββ β 0.00597β€ ββ β β β ββ β β 0.00563β€ β ββ βββ β β βββ βββ β ββ β β ββ β ββ β β ββββ ββββββββββββββββββββββββββββββββββββββββββββββββ 0.00529β€ββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββ 0.00495β€ β β ββββββββ βββββββββ ββββββββββ ββββ βββββββ β βββββ β ββ β β 0.00461β€ βββββ β β β β ββ¬βββ¬βββββ¬βββββββ¬βββββββ¬ββββ¬ββββ¬ββββ¬ββββ¬βββββ¬ββββ¬ββββ 10 61 152 301 443 516 601 682 746 844 937 dt iter text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/tplot/dt.txt dt [2025-06-02-082516] βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 648β€ βββββ β β βββββ β 540β€ βββββ β β βββββ β β βββββ β 432β€ βββββ β β βββββ β 324β€ βββββ β β βββββ β 216β€ βββββ β β ββββββββββ β β ββββββββββ β 108β€ βββββ ββββββββββ β β ββββββββββ ββββββββββ β 0β€βββββ ββββββββββ ββββββββββ βββββ ββββββββββ ββββββ ββ¬ββββββββββββββ¬βββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬β 0.00452 0.00507 0.00563 0.00618 0.00674 freq dt text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/tplot/dt-hist.txt dtf [2025-06-02-082516] ββββββββββββββββββββββββββββββββββββββββββββββββββββ 0.001399β€ β β β ββ β β β β ββ β ββ β 0.001321β€ β ββ ββ ββ ββ ββ ββ ββ ββ ββ β β β ββ ββ ββ ββ ββ ββ ββ ββ ββ β β β ββ ββ ββ ββ ββ ββ ββ ββ ββ β 0.001243β€ β ββ ββ ββ ββ ββ ββ ββ ββ ββ β β βββββ ββ ββ ββ βββ βββ ββ βββ βββ β β 0.001164β€βββββββββββ ββββ βββββββββββββββ ββ ββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββ 0.001086β€βββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββ ββββ ββ ββββββββ β ββββ β 0.001008β€β ββββββββ β ββββ β ββ βββββββ β ββββ β 0.000930β€β β β β β β ββ¬βββ¬ββββ¬ββββ¬ββββ¬βββββββ¬ββββ¬ββββββββ¬βββ¬βββββ¬ββββ¬ββββ 10 61 152 222 301 443 516 682 746 844 937 dtf iter text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/tplot/dtf.txt dtf [2025-06-02-082516] βββββββββββββββββββββββββββββββββββββββββββββββββββββββ 724.0β€ βββββ β β βββββ β 603.3β€ βββββ β β βββββ β β βββββ β 482.7β€ βββββ β β βββββ β 362.0β€ βββββ β β βββββ β 241.3β€ βββββ β β βββββ β β βββββ β 120.7β€ ββββββββββ β ββββββββββββ ββββββββββ βββββ β 0.0β€ββββββββββββββββββββββββββ βββββ ββββββββββββ ββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬β 0.00091 0.00104 0.00116 0.00129 0.00142 freq dtf text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/tplot/dtf-hist.txt dtb [2025-06-02-082516] βββββββββββββββββββββββββββββββββββββββββββββββββββββ 0.00555β€ β β β β β 0.00522β€ ββ β β ββ β β ββ β 0.00489β€ ββ β β ββ β β 0.00456β€ β ββ βββββ β βββ ββ β β β ββ β ββββββββββββββββββββββββββββ ββββββββββββββββββ 0.00424β€ββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββ 0.00391β€ ββ ββββββββββββββββββββββββββββββββββββββββββββββββ β β β ββββββββ β βββββββ ββ ββββββββ ββ β ββ β 0.00358β€ βββββ β β β β ββ¬βββ¬βββββ¬βββββββ¬βββββββ¬ββββ¬ββββ¬ββββ¬ββββ¬βββββ¬ββββ¬ββββ 10 61 152 301 443 516 601 682 746 844 937 dtb iter text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/tplot/dtb.txt dtb [2025-06-02-082516] βββββββββββββββββββββββββββββββββββββββββββββββββββββββ 664.0β€ βββββ β β βββββ β 553.3β€ βββββ β β βββββ β β βββββ β 442.7β€ βββββ β β βββββ β 332.0β€ βββββ β β βββββ β 221.3β€ βββββ β β βββββ β β ββββββββββ β 110.7β€ βββββββββββββββββββββ β β βββββββββββββββββββββ β 0.0β€ββββββββββββββββββββββββββ ββββββββββ ββββββββββββ ββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬β 0.00350 0.00403 0.00456 0.00510 0.00563 freq dtb text saved in /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/plots/tplot/dtb-hist.txt [2025-06-02 08:25:16,757339][I][ezpz/utils:224] Saving dataset to: /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/outputs/History-2025-06-02-082513/2025-06-02-082513/History-2025-06-02-082513/dataset_dataset.h5 [2025-06-02 08:25:16,769431][I][examples/minimal:103:__main__] dataset=<xarray.Dataset> Size: 47kB Dimensions: (draw: 989) Coordinates: * draw (draw) int64 8kB 0 1 2 3 4 5 6 7 ... 982 983 984 985 986 987 988 Data variables: iter (draw) int64 8kB 11 12 13 14 15 16 17 ... 994 995 996 997 998 999 loss (draw) float64 8kB 1.031e+03 898.9 861.3 ... 673.5 680.4 678.1 dt (draw) float64 8kB 0.005432 0.005025 0.005267 ... 0.005351 0.005353 dtf (draw) float64 8kB 0.000955 0.000986 0.000986 ... 0.001077 0.001111 dtb (draw) float64 8kB 0.004477 0.004039 0.004281 ... 0.004274 0.004242 wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/wandb/offline-run-20250602_082455-err2dwwn wandb: Find logs at: ../../../../../../lus/flare/projects/datascience/foremans/projects/saforem2/ezpz/wandb/offline-run-20250602_082455-err2dwwn/logs Application 51803e72 resources: utime=1016s stime=189s maxrss=3923136KB inblock=509002 oublock=2760 minflt=10027248 majflt=27746 nvcsw=558010 nivcsw=1523810 [2025-06-02 08:25:19,307273][I][ezpz/launch:201] Execution finished @ 2025-06-02-082519 [2025-06-02 08:25:19,308393][I][ezpz/launch:202] Command took 38.44 seconds to run. Exiting. took: 0h:00m:50s
π 2 ez.
-
-
This will πͺ automagically source
ezpz/bin/utils.sh
and (&&
) callezpz_setup_env
to setup your python environment. ↩