# [2025-12-18 13:21:50,091003][I][ezpz/launch:378:launch] ----[π ezpz.launch][started][2025-12-18-132150]----
# [2025-12-18 13:21:51,231621][I][ezpz/launch:396:launch] Job ID: 8219131
# [2025-12-18 13:21:51,232430][I][ezpz/launch:397:launch] nodelist: ['x4310c7s4b0n0', 'x4418c6s1b0n0']
# [2025-12-18 13:21:51,232855][I][ezpz/launch:398:launch] hostfile: /var/spool/pbs/aux/8219131.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
# [2025-12-18 13:21:51,233490][I][ezpz/pbs:329:get_pbs_launch_cmd] β
Using [24/24] GPUs [2 hosts] x [12 GPU/host]
# [2025-12-18 13:21:51,234263][I][ezpz/launch:354:build_executable] Building command to execute by piecing together:
# [2025-12-18 13:21:51,234692][I][ezpz/launch:355:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8219131.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
[2025-12-18 13:21:51,235394][I][ezpz/launch:356:build_executable] (2.) cmd_to_launch: python3 -m ezpz.examples.hf_trainer --streaming --dataset_name=stanfordnlp/imdb --model_name_or_path /flare/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/public/sophiag/hf/global_step138650 --bf16=true --do_train=true --do_eval=true --report-to=wandb --logging-steps=1 --include-tokens-per-second=true --max-steps=100 --include-num-input-tokens-seen=true --optim=adamw_torch --logging-first-step --include-for-metrics=inputs,loss --max-eval-samples=50 --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --block_size=8192 --gradient_checkpointing=true --fsdp=auto_wrap --output_dir=2025-12-18-132143
# [2025-12-18 13:21:51,237243][I][ezpz/launch:412:launch] Took: 1.22 seconds to build command.
# [2025-12-18 13:21:51,237645][I][ezpz/launch:413:launch] Executing:
mpiexec
--envall
--np=24
--ppn=12
--hostfile=/var/spool/pbs/aux/8219131.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
--no-vni
--cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
python3
-m
ezpz.examples.hf_trainer
--streaming
--dataset_name=stanfordnlp/imdb
--model_name_or_path
/flare/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/public/sophiag/hf/global_step138650
--bf16=true
--do_train=true
--do_eval=true
--report-to=wandb
--logging-steps=1
--include-tokens-per-second=true
--max-steps=100
--include-num-input-tokens-seen=true
--optim=adamw_torch
--logging-first-step
--include-for-metrics=inputs,loss
--max-eval-samples=50
--per_device_train_batch_size=1
--per_device_eval_batch_size=1
--block_size=8192
--gradient_checkpointing=true
--fsdp=auto_wrap
--output_dir=2025-12-18-132143
[2025-12-18 13:21:51,240032][I][ezpz/launch:213:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
[2025-12-18 13:21:51,240580][I][ezpz/launch:420:launch] Execution started @ 2025-12-18-132151...
[2025-12-18 13:21:51,241009][I][ezpz/launch:421:launch] ----[π ezpz.launch][stop][2025-12-18-132151]----
[2025-12-18 13:21:51,241473][I][ezpz/launch:132:run_command] Caught 24 filters
[2025-12-18 13:21:51,241853][I][ezpz/launch:133:run_command] Running command:
mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8219131.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -m ezpz.examples.hf_trainer --streaming --dataset_name=stanfordnlp/imdb --model_name_or_path /flare/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/public/sophiag/hf/global_step138650 --bf16=true --do_train=true --do_eval=true --report-to=wandb --logging-steps=1 --include-tokens-per-second=true --max-steps=100 --include-num-input-tokens-seen=true --optim=adamw_torch --logging-first-step --include-for-metrics=inputs,loss --max-eval-samples=50 --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --block_size=8192 --gradient_checkpointing=true --fsdp=auto_wrap --output_dir=2025-12-18-132143
[2025-12-18 13:22:09,760765][I][ezpz/dist:1926:setup_wandb] Using WB_PROJECT=ezpz-hf_trainer--flare-AuroraGPT-AuroraGPT-v1-Experiments-AuroraGPT-2B-public-sophiag-hf-global_step138650
[2025-12-18 13:22:11,079430][I][ezpz/dist:1955:setup_wandb] wandb.run=[cosmic-sunset-5](https://wandb.ai/aurora_gpt/ezpz-hf_trainer--flare-AuroraGPT-AuroraGPT-v1-Experiments-AuroraGPT-2B-public-sophiag-hf-global_step138650/runs/pqytcarn)
[WARNING|trainer.py:982] 2025-12-18 13:23:46,860 >> The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly,being updated with the tokenizers values. Updated tokens: {'eos_token_id': 1, 'bos_token_id': 2, 'pad_token_id': 0}.
[INFO|trainer.py:2519] 2025-12-18 13:23:51,164 >> ***** Running training *****
[INFO|trainer.py:2520] 2025-12-18 13:23:51,164 >> Num examples = 2,400
[INFO|trainer.py:2521] 2025-12-18 13:23:51,164 >> Num Epochs = 9,223,372,036,854,775,807
[INFO|trainer.py:2522] 2025-12-18 13:23:51,164 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2525] 2025-12-18 13:23:51,164 >> Total train batch size (w. parallel, distributed & accumulation) = 24
[INFO|trainer.py:2526] 2025-12-18 13:23:51,165 >> Gradient Accumulation steps = 1
[INFO|trainer.py:2527] 2025-12-18 13:23:51,165 >> Total optimization steps = 100
[INFO|trainer.py:2528] 2025-12-18 13:23:51,165 >> Number of trainable parameters = 82,752,264
[INFO|integration_utils.py:867] 2025-12-18 13:23:51,171 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
0%| | 0/100 [00:00<?, ?it/s]
# [...clipped...]
{'loss': 2.6031, 'grad_norm': 1.8801486492156982, 'learning_rate': 5.000000000000001e-07, 'epoch': 2.24, 'num_input_tokens_seen': 19660800, 'train_runtime': 252.007, 'train_tokens_per_second': 78016.894}
100%|##########| 100/100 [04:11<00:00, 2.27s/it]
# [...clipped...]
[INFO|trainer.py:4309] 2025-12-18 13:28:05,079 >> Saving model checkpoint to 2025-12-18-132143/checkpoint-100
{'train_runtime': 299.1983, 'train_samples_per_second': 8.021, 'train_steps_per_second': 0.334, 'train_tokens_per_second': 2737.984, 'train_loss': 2.8478338932991027, 'epoch': 2.24, 'num_input_tokens_seen': 19660800}
100%|##########| 100/100 [04:59<00:00, 2.27s/it]
[INFO|trainer.py:4309] 2025-12-18 13:28:52,199 >> Saving model checkpoint to 2025-12-18-132143
***** train metrics *****
epoch = 2.24
num_input_tokens_seen = 19660800
total_flos = 6691434GF
train_loss = 2.8478
train_runtime = 0:04:59.19
train_samples = 25000
train_samples_per_second = 8.021
train_steps_per_second = 0.334
train_tokens_per_second = 2737.984
[INFO|trainer.py:4643] 2025-12-18 13:29:00,076 >>
***** Running Evaluation *****
[INFO|trainer.py:4647] 2025-12-18 13:29:00,077 >> Num examples: Unknown
[INFO|trainer.py:4648] 2025-12-18 13:29:00,077 >> Batch size = 1
***** eval metrics *****
epoch = 2.24
eval_accuracy = 0.4701
eval_loss = 2.5437
eval_runtime = 0:00:09.12
eval_samples = 50
eval_samples_per_second = 0.329
eval_steps_per_second = 0.11
num_input_tokens_seen = 19660800
perplexity = 12.7268
wandb:
wandb: π View run cosmic-sunset-5 at:
wandb: Find logs at: ../../../../../../../../lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/saforem2/ezpz-distributed-metrics/wandb/run-20251218_132210-pqytcarn/logs