Intro to HPC Bootcamp 2025

Sam Foreman

Intro to HPC Bootcamp 2025

ai4science

hpc

llm

Homepage for Sam Foreman’s Intro to HPC Bootcamp 2025 Project

Author

Affiliation

Sam Foreman

ALCF

Published

July 15, 2025

Modified

August 14, 2025

📂 Project Contents

🏡 Intro to {AI, HPC} for Science/
- 📂 [00] Intro to AI and HPC/
  - 📄 [0] Compute systems
  - 📄 [1] Shared-resources
  - 📄 [2] Jupyter Notebooks
  - 📄 [3] Using Python
  - 📄 [4] Working with Data
  - 📗 [5] MCMC Example
  - 📗 [6] Linear Regression
  - 📗 [7] Statistical Learning
  - 📗 [8] Clustering
- 📂 [01] Neural Networks/
  - 📄 [0] Intro
  - 📗 [1] MNIST
  - 📗 [1] MNIST (ipynb)
  - 📗 [2] Advanced
  - 📗 [3] Conv. Nets
  - 📗 [4] Representation Learning
  - 📄 [5] Distributed Training
- 📂 [02] Large Language Models
  - 📗 [00] Intro to LLMs
  - 📗 [01] Hands-on LLMs
  - 📄 [02] Prompt Engineering
  - 📗 [06] Parallel Training
  - 📗 [07] Shakespeare Example
  - 📗 [08] Shakespeare Example (colab)

🏔️ Instructions for Running @ NERSC

Start terminal

Create symlink:

# symlink 
ln -s /global/cfs/cdirs/m4388 $HOME/m4388

Navigate to m4388 directory:
```
cd $HOME/m4388
```

Clone repo (somewhere) in $HOME/$USER/:

mkdir $USER && cd $USER
git clone https://github.com/saforem2/intro-hpc-bootcamp-2025

Find all Jupyter notebooks:

# find all *.ipynb files
ls **/**/**.ipynb | grep -v "cache" | sort | uniq

🌐 Distributed Training Example

ssh <user>@perlmutter.nersc.gov 
[ -d $HOME/m4388 ] || ln -s /global/cfs/cdirs/m4388 $HOME/m4388

Request an interactive job:

NODES=2 ; HRS=02 ; QUEUE=interactive ; salloc --nodes $NODES --qos $QUEUE --time $HRS:30:00 -C 'gpu' --gpus=$(( 4 * NODES )) -A m4388_g

Clone wordplay and navigate into it:

mkdir -p "${HOME}/m4388/Project5/${USER}"
cd "${HOME}/m4388/Project5/${USER}"
git clone https://github.com/saforem2/wordplay && cd wordplay

Setup Python:

source <(curl -L https://bit.ly/ezpz-utils)
ezpz_setup_python

Setup wandb:
```
wandb login
```
Install wordplay:
```
python3 -m pip install -e "."
```
Run ezpz-test (simple test to verify distributed functionality):
```
ezpz-test  # <- SHOULD WORK (🤞)
```
Prepare data:
```
python3 -m wordplay.prepare
```

Run distributed training:

  ezpz-launch -m wordplay \
    train.backend=DDP \
    train.eval_interval=100 \
    data=shakespeare \
    train.dtype=bf16 \
    model.batch_size=8 \
    model.block_size=2048 \
    train.max_iters=1000 \
    train.log_interval=10 \
    train.compile=true

Output

(👻 pytorch2.6.0)
#[~/m/P/f/p/s/wordplay][🌱 main][🤷✓] via ⨁ v
#[08/14/25 @ 05:52:20][nid001237]
; eezpz-launch-m wordplay train.backend=DDP train.eval_interval=100 data=shakespeare train.dtype=bf16 model.batch_size=8 model.block_size=2048 train.max_iters=1000 train.log_interval=10 train.compile=true
[2025-08-14 05:53:17,261718][I][ezpz/__init__:265:ezpz] Setting logging level to 'INFO' on 'RANK == 0'
[2025-08-14 05:53:17,264886][I][ezpz/__init__:266:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0'


[2025-08-14 05:53:17,283662][I][ezpz/launch:225] ======== [ezpz.launch: START] ========
[2025-08-14 05:53:17,508744][I][ezpz/slurm:90] Checking jobid 41622690 for hostname nid001237...
[2025-08-14 05:53:17,509756][I][ezpz/slurm:92] Found nid001237 in nodelist for 41622690
[2025-08-14 05:53:17,713837][I][ezpz/slurm:90] Checking jobid 41622690 for hostname nid001237...
[2025-08-14 05:53:17,714591][I][ezpz/slurm:92] Found nid001237 in nodelist for 41622690
[2025-08-14 05:53:17,977002][I][ezpz/slurm:90] Checking jobid 41622690 for hostname nid001237...
[2025-08-14 05:53:17,977846][I][ezpz/slurm:92] Found nid001237 in nodelist for 41622690
[2025-08-14 05:53:18,057066][I][ezpz/slurm:109] Writing ['nid001237', '001240'] to /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/nodefile-41622690
[2025-08-14 05:53:18,089205][I][ezpz/launch:230] Job ID: 41622690
[2025-08-14 05:53:18,089621][I][ezpz/launch:231] nodelist: ['nid001237', '001240']
[2025-08-14 05:53:18,090032][I][ezpz/launch:232] hostfile: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/nodefile-41622690
[2025-08-14 05:53:18,241020][I][ezpz/slurm:90] Checking jobid 41622690 for hostname nid001237...
[2025-08-14 05:53:18,241792][I][ezpz/slurm:92] Found nid001237 in nodelist for 41622690
[2025-08-14 05:53:18,328647][I][ezpz/launch:253] Building command to execute by piecing together:

        (1.) ['launch_cmd'] + (2.) ['python'] + (3.) ['cmd_to_launch']

[2025-08-14 05:53:18,330096][I][ezpz/launch:257] (1.) ['launch_cmd']: srun -u --verbose -N2 -n8
[2025-08-14 05:53:18,330544][I][ezpz/launch:258] (2.) ['python']: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/venvs/perlmutter/pytorch2.6.0/bin/python3
[2025-08-14 05:53:18,330996][I][ezpz/launch:259] (3.) ['cmd_to_launch']:  -m wordplay train.backend=DDP train.eval_interval=100 data=shakespeare train.dtype=bf16 model.batch_size=8 model.block_size=2048 train.max_iters=1000 train.log_interval=10 train.compile=true
[2025-08-14 05:53:18,331986][I][ezpz/launch:264] Took: 1.05 seconds to build command.
[2025-08-14 05:53:18,332539][I][ezpz/launch:268] Executing:
        srun
        -u
        --verbose
        -N2
        -n8
        /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/venvs/perlmutter/pytorch2.6.0/bin/python3
        -m
        wordplay
        train.backend=DDP
        train.eval_interval=100
        data=shakespeare
        train.dtype=bf16
        model.batch_size=8
        model.block_size=2048
        train.max_iters=1000
        train.log_interval=10
        train.compile=true
[2025-08-14 05:53:18,334155][I][ezpz/launch:276] Execution started @ 2025-08-14-055318...
[2025-08-14 05:53:18,334619][I][ezpz/launch:277] ======== [ezpz.launch: STOP] ========

srun: defined options
srun: -------------------- --------------------
srun: (null)              : nid[001237,001240]
srun: gpus                : 8
srun: jobid               : 41622690
srun: job-name            : interactive
srun: mpi                 : cray_shasta
srun: nodes               : 2
srun: ntasks              : 8
srun: oom-kill-step       : 0
srun: slurmd-debug        : error
srun: unbuffered          : set
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: jobid 41622690: nodes(2):'nid[001237,001240]', cpu counts: 128(x2)
srun: CpuBindType=(null type)
srun: launching StepId=41622690.1 on host nid001237, 4 tasks: [0-3]
srun: launching StepId=41622690.1 on host nid001240, 4 tasks: [4-7]
srun: topology/default: init: topology Default plugin loaded
srun: Node nid001237, 4 tasks started
srun: Node nid001240, 4 tasks started
[2025-08-14 05:54:16,894992][I][ezpz/__init__:265:ezpz] Setting logging level to 'INFO' on 'RANK == 0'
[2025-08-14 05:54:16,897554][I][ezpz/__init__:266:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0'
[2025-08-14 05:54:17,040731][I][wordplay/configs:81] Setting HF_DATASETS_CACHE to /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/.cache/huggingface/datasets
[2025-08-14 05:54:18,062573][I][ezpz/dist:1159] Using fw='ddp' with torch_{device,backend}= {cuda, nccl}
[2025-08-14 05:54:18,063295][I][ezpz/dist:1026] Caught MASTER_PORT=56513 from environment!
[2025-08-14 05:54:18,372603][I][ezpz/dist:1042] Using torch.distributed.init_process_group with
- master_addr='nid001237'
- master_port='56513'
- world_size=8
- rank=0
- local_rank=0
- timeout=datetime.timedelta(seconds=3600)
- backend='nccl'
[2025-08-14 05:54:18,373741][I][ezpz/dist:759] Calling torch.distributed.init_process_group_with: rank=0 world_size=8 backend=nccl
[rank6]:[W814 05:54:18.650817367 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 6]  using GPU 2 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank4]:[W814 05:54:18.001713707 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 4]  using GPU 0 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank0]:[W814 05:54:19.906551058 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank5]:[W814 05:54:19.231412634 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 5]  using GPU 1 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank7]:[W814 05:54:19.394309045 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7]  using GPU 3 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank3]:[W814 05:54:19.232206867 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank1]:[W814 05:54:19.258096849 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank2]:[W814 05:54:19.258138279 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[2025-08-14 05:54:20,179821][I][ezpz/dist:1377] Using device='cuda' with backend='nccl' + 'nccl' for distributed training.
[2025-08-14 05:54:20,181272][I][ezpz/dist:1422] ['nid001237'][0/7]
[2025-08-14 05:54:20,179635][I][ezpz/dist:1422] ['nid001237'][1/7]
[2025-08-14 05:54:20,179688][I][ezpz/dist:1422] ['nid001240'][6/7]
[2025-08-14 05:54:20,179683][I][ezpz/dist:1422] ['nid001240'][7/7]
[2025-08-14 05:54:20,179727][I][ezpz/dist:1422] ['nid001237'][2/7]
[2025-08-14 05:54:20,179700][I][ezpz/dist:1422] ['nid001237'][3/7]
[2025-08-14 05:54:20,179702][I][ezpz/dist:1422] ['nid001240'][4/7]
[2025-08-14 05:54:20,179691][I][ezpz/dist:1422] ['nid001240'][5/7]
[2025-08-14 05:54:20,213261][I][wordplay/configs:317] Loading val from /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/data/shakespeare_char/val.bin
[2025-08-14 05:54:20,215366][I][wordplay/configs:317] Loading train from /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/data/shakespeare_char/train.bin
[2025-08-14 05:54:20,221097][I][wordplay/configs:442] Tokens per iteration: 131,072
[2025-08-14 05:54:20,221681][I][wordplay/configs:465] Using self.ptdtype=torch.float16 on self.device_type='cuda'
[2025-08-14 05:54:20,222155][I][wordplay/configs:471] Initializing a new model from scratch
[2025-08-14 05:54:20,223622][I][ezpz/dist:1648] Setting up wandb from rank=0
[2025-08-14 05:54:20,224043][I][ezpz/dist:1649] Using WB_PROJECT=WordPlay
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: foremans (aurora_gpt) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.7
wandb: Run data is saved locally in /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17/wandb/run-20250814_055421-qqpij4mt
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run helpful-sea-138
wandb: ⭐️ View project at https://wandb.ai/aurora_gpt/WordPlay
wandb: 🚀 View run at https://wandb.ai/aurora_gpt/WordPlay/runs/qqpij4mt
[2025-08-14 05:54:22,857365][C][wordplay/trainer:322] RANK=1:devid='cuda:1'
[2025-08-14 05:54:22,858391][C][wordplay/trainer:322] RANK=3:devid='cuda:3'
[2025-08-14 05:54:22,859157][C][wordplay/trainer:322] RANK=2:devid='cuda:2'
[2025-08-14 05:54:22,925017][C][wordplay/trainer:322] RANK=7:devid='cuda:3'
[2025-08-14 05:54:22,925164][C][wordplay/trainer:322] RANK=6:devid='cuda:2'
[2025-08-14 05:54:22,931160][C][wordplay/trainer:322] RANK=4:devid='cuda:0'
[2025-08-14 05:54:23,025155][C][wordplay/trainer:322] RANK=5:devid='cuda:1'
[2025-08-14 05:54:23,585980][I][ezpz/dist:1678] wandb.run=[helpful-sea-138](https://wandb.ai/aurora_gpt/WordPlay/runs/qqpij4mt)
[2025-08-14 05:54:23,693859][I][ezpz/dist:1722] Running on machine='Perlmutter'
[2025-08-14 05:54:23,695938][W][wordplay/__main__:93:__main__] {
    "train": {
        "framework": "pytorch",
        "backend": "DDP",
        "device": null,
        "seed": null,
        "port": null,
        "ds_config_path": null,
        "precision": null,
        "ngpus": null,
        "use_wandb": true,
        "eval_interval": 100,
        "log_interval": 10,
        "eval_iters": 200,
        "eval_only": false,
        "always_save_checkpoint": false,
        "init_from": "scratch",
        "wandb_project": "WordPlay",
        "max_iters": 1000,
        "warmup_iters": 100,
        "dtype": "bf16",
        "compile": true
    },
    "model": {
        "n_layer": 12,
        "n_head": 12,
        "n_embd": 768,
        "batch_size": 8,
        "block_size": 2048,
        "activation": "gelu",
        "dropout": 0.0,
        "bias": false,
        "vocab_size": 65
    },
    "data": {
        "dataset": "shakespeare_char",
        "out_dir": "out-shakespeare-char",
        "root_path": null
    },
    "optimizer": {
        "gas": 1,
        "name": "AdamW",
        "learning_rate": 0.0006,
        "weight_decay": 0.1,
        "beta1": 0.9,
        "beta2": 0.95,
        "grad_clip": 1.0,
        "decay_lr": true,
        "lr_decay_iters": 600000,
        "min_lr": 6e-05
    }
}
[2025-08-14 05:54:23,698890][W][wordplay/__main__:94:__main__] Output dir: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17
[2025-08-14 05:54:23,699444][I][wordplay/trainer:234] Initializing a new model from scratch
[2025-08-14 05:54:24,575640][I][wordplay/model:255] number of parameters: 85.00M
[2025-08-14 05:54:24,625918][I][wordplay/trainer:251] Model size: num_params=85003776
[2025-08-14 05:54:24,637219][I][wordplay/model:445] num decayed parameter tensors: 50, with 86,557,440 parameters
[2025-08-14 05:54:24,637872][I][wordplay/model:449] num non-decayed parameter tensors: 25, with 19,200 parameters
[2025-08-14 05:54:24,638937][I][wordplay/model:465] using fused AdamW: True
[2025-08-14 05:54:25,662969][C][wordplay/trainer:322] RANK=0:devid='cuda:0'
[2025-08-14 05:54:25,748890][I][wordplay/trainer:361] • self.model=OptimizedModule(
  (_orig_mod): GPT(
    (transformer): ModuleDict(
      (wte): Embedding(65, 768)
      (wpe): Embedding(2048, 768)
      (drop): Dropout(p=0.0, inplace=False)
      (h): ModuleList(
        (0-11): 12 x Block(
          (ln_1): LayerNorm()
          (attn): CausalSelfAttention(
            (c_attn): Linear(in_features=768, out_features=2304, bias=False)
            (c_proj): Linear(in_features=768, out_features=768, bias=False)
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
          )
          (ln_2): LayerNorm()
          (mlp): MLP(
            (c_fc): Linear(in_features=768, out_features=3072, bias=False)
            (act_fn): GELU(approximate='none')
            (c_proj): Linear(in_features=3072, out_features=768, bias=False)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
      )
      (ln_f): LayerNorm()
    )
    (lm_head): Linear(in_features=768, out_features=65, bias=False)
  )
)
[2025-08-14 05:54:25,752694][I][wordplay/trainer:362] • self.grad_scaler=<torch.cuda.amp.grad_scaler.GradScaler object at 0x14da15e752b0>
[2025-08-14 05:54:25,753734][I][wordplay/trainer:363] • self.model_engine=DistributedDataParallel(
  (module): OptimizedModule(
    (_orig_mod): GPT(
      (transformer): ModuleDict(
        (wte): Embedding(65, 768)
        (wpe): Embedding(2048, 768)
        (drop): Dropout(p=0.0, inplace=False)
        (h): ModuleList(
          (0-11): 12 x Block(
            (ln_1): LayerNorm()
            (attn): CausalSelfAttention(
              (c_attn): Linear(in_features=768, out_features=2304, bias=False)
              (c_proj): Linear(in_features=768, out_features=768, bias=False)
              (attn_dropout): Dropout(p=0.0, inplace=False)
              (resid_dropout): Dropout(p=0.0, inplace=False)
            )
            (ln_2): LayerNorm()
            (mlp): MLP(
              (c_fc): Linear(in_features=768, out_features=3072, bias=False)
              (act_fn): GELU(approximate='none')
              (c_proj): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
        )
        (ln_f): LayerNorm()
      )
      (lm_head): Linear(in_features=768, out_features=65, bias=False)
    )
  )
)
[2025-08-14 05:54:25,757292][I][wordplay/trainer:364] • self.optimizer=AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: True
    lr: 0.0006
    maximize: False
    weight_decay: 0.1

Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: True
    lr: 0.0006
    maximize: False
    weight_decay: 0.0
)
[2025-08-14 05:54:25,759725][I][wordplay/trainer:796] Startup time: 8.6967
                Training Legend
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        abbr ┃ desc                           ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│        step │ Current training iteration     │
│        loss │ Loss value                     │
│          dt │ Elapsed time per training step │
│         dtf │ Elapsed time per forward step  │
│         dtb │ Elapsed time per backward step │
│         sps │ Samples per second             │
│ sps_per_gpu │ Samples per second (per GPU)   │
│         tps │ Tokens per second              │
│ tps_per_gpu │ Tokens per second (per GPU)    │
│         mfu │ Model flops utilization        │
└─────────────┴────────────────────────────────┘
[2025-08-14 05:54:27,146477][I][wordplay/trainer:807] ['prompt']: 'What is an LLM?'
[2025-08-14 05:54:27,147296][I][wordplay/trainer:810] ['response']:

What is an LLM? 'hkk ''Evllll VWccm UzcW!W'':zlWk W  z! XXwltMMV!Qyyx'y kDDvVX;WWlyy  jKy;kkyyxxx$-WDll  l!;WWmmWW eeJJzq.vv;! w;;z'tlWDDDWklUJ ;yyNlccxQ-D V!MMG'zt;WWk lllUU D-kkXWUvvMy;;JrMCzl;Uve,z;'':VWQ-y$l--o.cJD.yM'yyyZyyyVV$Qt!!kxuJeeD kll'Uy-J'vV!tmkzJuM?!ppXXG;'
[2025-08-14 05:55:22,838950][I][wordplay/trainer:868] step=10 loss=3.01934 dt=0.0778009 dtf=0.0055681 dtb=0.0124092 sps=102.827 sps_per_gpu=12.8533 tps=1.68471e+06 tps_per_gpu=210589 mfu=49.7121
[2025-08-14 05:55:23,622530][I][wordplay/trainer:868] step=20 loss=2.73268 dt=0.0783059 dtf=0.00546336 dtb=0.0124693 sps=102.163 sps_per_gpu=12.7704 tps=1.67385e+06 tps_per_gpu=209231 mfu=49.6801
[2025-08-14 05:55:24,407385][I][wordplay/trainer:868] step=30 loss=2.5634 dt=0.0784503 dtf=0.00536114 dtb=0.0125228 sps=101.975 sps_per_gpu=12.7469 tps=1.67076e+06 tps_per_gpu=208846 mfu=49.6421
[2025-08-14 05:55:25,192393][I][wordplay/trainer:868] step=40 loss=2.51223 dt=0.0784606 dtf=0.00545081 dtb=0.012732 sps=101.962 sps_per_gpu=12.7452 tps=1.67054e+06 tps_per_gpu=208818 mfu=49.6073
[2025-08-14 05:55:25,977765][I][wordplay/trainer:868] step=50 loss=2.49004 dt=0.079029 dtf=0.00523198 dtb=0.0125401 sps=101.229 sps_per_gpu=12.6536 tps=1.65853e+06 tps_per_gpu=207316 mfu=49.5406
[2025-08-14 05:55:26,761465][I][wordplay/trainer:868] step=60 loss=2.45537 dt=0.07775 dtf=0.00518063 dtb=0.0127289 sps=102.894 sps_per_gpu=12.8617 tps=1.68581e+06 tps_per_gpu=210727 mfu=49.561
[2025-08-14 05:55:27,546584][I][wordplay/trainer:868] step=70 loss=2.46909 dt=0.0782411 dtf=0.00528915 dtb=0.0125154 sps=102.248 sps_per_gpu=12.781 tps=1.67523e+06 tps_per_gpu=209404 mfu=49.5481
[2025-08-14 05:55:28,332687][I][wordplay/trainer:868] step=80 loss=2.48264 dt=0.0792809 dtf=0.00552908 dtb=0.0130663 sps=100.907 sps_per_gpu=12.6134 tps=1.65326e+06 tps_per_gpu=206657 mfu=49.4717
[2025-08-14 05:55:29,118556][I][wordplay/trainer:868] step=90 loss=2.51034 dt=0.0781782 dtf=0.00514289 dtb=0.0124665 sps=102.33 sps_per_gpu=12.7913 tps=1.67658e+06 tps_per_gpu=209573 mfu=49.4718
[2025-08-14 05:55:29,904022][I][wordplay/trainer:868] step=100 loss=2.46516 dt=0.078483 dtf=0.00516737 dtb=0.0129046 sps=101.933 sps_per_gpu=12.7416 tps=1.67007e+06 tps_per_gpu=208758 mfu=49.4526
[2025-08-14 05:55:30,990755][I][wordplay/trainer:807] ['prompt']: 'What is an LLM?'
[2025-08-14 05:55:30,991344][I][wordplay/trainer:810] ['response']:

What is an LLM? denour sad is thot wind;
Ae micome lofas t butowhatiom, ar thy mitheath anshthath o w gesurcingero w on

GArsitheath the,
Tordist w nofout thoru ol t arthim he my,
Thich thingay
Thot we wiman hineisoule blt me s hat f aul the t,
Tyoffove.
Haicede t tounon
[2025-08-14 05:55:40,218926][I][wordplay/trainer:750] Saving checkpoint to: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17
[2025-08-14 05:55:40,219705][I][wordplay/trainer:751] Saving model to: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17/model.pth
[2025-08-14 05:55:42,195499][I][wordplay/configs:141] Appending /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17 to /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/src/ckpts/checkpoints.log
[2025-08-14 05:55:42,999389][I][wordplay/trainer:868] step=110 loss=2.43177 dt=0.0780071 dtf=0.00545025 dtb=0.0126532 sps=102.555 sps_per_gpu=12.8193 tps=1.68026e+06 tps_per_gpu=210032 mfu=49.4654
[2025-08-14 05:55:43,785978][I][wordplay/trainer:868] step=120 loss=2.44182 dt=0.078106 dtf=0.00545009 dtb=0.0125079 sps=102.425 sps_per_gpu=12.8031 tps=1.67813e+06 tps_per_gpu=209766 mfu=49.4707
[2025-08-14 05:55:44,572009][I][wordplay/trainer:868] step=130 loss=2.44907 dt=0.0785986 dtf=0.00526247 dtb=0.0123562 sps=101.783 sps_per_gpu=12.7229 tps=1.66761e+06 tps_per_gpu=208451 mfu=49.4444
[2025-08-14 05:55:45,358627][I][wordplay/trainer:868] step=140 loss=2.39859 dt=0.0794152 dtf=0.00549359 dtb=0.0128465 sps=100.736 sps_per_gpu=12.592 tps=1.65047e+06 tps_per_gpu=206308 mfu=49.3701
[2025-08-14 05:55:46,146387][I][wordplay/trainer:868] step=150 loss=2.42138 dt=0.079703 dtf=0.00618074 dtb=0.012967 sps=100.373 sps_per_gpu=12.5466 tps=1.6445e+06 tps_per_gpu=205563 mfu=49.2856
[2025-08-14 05:55:46,933663][I][wordplay/trainer:868] step=160 loss=2.41715 dt=0.0793599 dtf=0.00549591 dtb=0.0130799 sps=100.807 sps_per_gpu=12.6008 tps=1.65161e+06 tps_per_gpu=206452 mfu=49.2306
[2025-08-14 05:55:47,721057][I][wordplay/trainer:868] step=170 loss=2.44814 dt=0.0789742 dtf=0.00546218 dtb=0.012633 sps=101.299 sps_per_gpu=12.6624 tps=1.65968e+06 tps_per_gpu=207460 mfu=49.2049
[2025-08-14 05:55:48,507992][I][wordplay/trainer:868] step=180 loss=2.41629 dt=0.0787978 dtf=0.00542163 dtb=0.0128453 sps=101.526 sps_per_gpu=12.6907 tps=1.6634e+06 tps_per_gpu=207925 mfu=49.1928
[2025-08-14 05:55:49,297076][I][wordplay/trainer:868] step=190 loss=2.38078 dt=0.0781823 dtf=0.00540887 dtb=0.0122009 sps=102.325 sps_per_gpu=12.7906 tps=1.67649e+06 tps_per_gpu=209561 mfu=49.2204
[2025-08-14 05:55:50,085624][I][wordplay/trainer:868] step=200 loss=2.38881 dt=0.0787827 dtf=0.00544194 dtb=0.012954 sps=101.545 sps_per_gpu=12.6931 tps=1.66372e+06 tps_per_gpu=207964 mfu=49.2077
[2025-08-14 05:55:51,182746][I][wordplay/trainer:807] ['prompt']: 'What is an LLM?'
[2025-08-14 05:55:51,183345][I][wordplay/trainer:810] ['response']:

What is an LLM?
TYMMurcomarl he ffal ther the arisplit at in fil an arices tor\'se iom o foul yof forsthe,
ADe ce he the her slashe th ous ar me andone be the sorthe spof aris indllfll thir me ay the bldorom n de
She thit t,
Clou lllethe fourth wit thin, pr thee th bl hes
[2025-08-14 05:56:00,419222][I][wordplay/trainer:750] Saving checkpoint to: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17
[2025-08-14 05:56:00,420026][I][wordplay/trainer:751] Saving model to: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17/model.pth
[2025-08-14 05:56:03,473861][I][wordplay/configs:141] Appending /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17 to /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/src/ckpts/checkpoints.log
[2025-08-14 05:56:04,301502][I][wordplay/trainer:868] step=210 loss=2.40137 dt=0.0777987 dtf=0.00521268 dtb=0.012754 sps=102.829 sps_per_gpu=12.8537 tps=1.68476e+06 tps_per_gpu=210595 mfu=49.2582
[2025-08-14 05:56:05,090312][I][wordplay/trainer:868] step=220 loss=2.32615 dt=0.0785563 dtf=0.00518237 dtb=0.0124011 sps=101.838 sps_per_gpu=12.7297 tps=1.66851e+06 tps_per_gpu=208564 mfu=49.2558
[2025-08-14 05:56:05,875417][I][wordplay/trainer:868] step=230 loss=2.28453 dt=0.0781005 dtf=0.00532893 dtb=0.0126092 sps=102.432 sps_per_gpu=12.804 tps=1.67825e+06 tps_per_gpu=209781 mfu=49.2824
[2025-08-14 05:56:06,662143][I][wordplay/trainer:868] step=240 loss=2.32075 dt=0.0790562 dtf=0.0056354 dtb=0.0128831 sps=101.194 sps_per_gpu=12.6492 tps=1.65796e+06 tps_per_gpu=207245 mfu=49.2464
[2025-08-14 05:56:07,448907][I][wordplay/trainer:868] step=250 loss=2.26398 dt=0.0782558 dtf=0.00549903 dtb=0.0123185 sps=102.229 sps_per_gpu=12.7786 tps=1.67492e+06 tps_per_gpu=209365 mfu=49.2641
[2025-08-14 05:56:08,237097][I][wordplay/trainer:868] step=260 loss=2.20778 dt=0.0779999 dtf=0.00544577 dtb=0.0125349 sps=102.564 sps_per_gpu=12.8205 tps=1.68041e+06 tps_per_gpu=210052 mfu=49.2962
[2025-08-14 05:56:09,024535][I][wordplay/trainer:868] step=270 loss=2.13115 dt=0.0790745 dtf=0.00544547 dtb=0.0127346 sps=101.17 sps_per_gpu=12.6463 tps=1.65758e+06 tps_per_gpu=207197 mfu=49.2577
[2025-08-14 05:56:09,812910][I][wordplay/trainer:868] step=280 loss=2.1087 dt=0.078672 dtf=0.00547284 dtb=0.0126957 sps=101.688 sps_per_gpu=12.711 tps=1.66606e+06 tps_per_gpu=208257 mfu=49.2481
[2025-08-14 05:56:10,600338][I][wordplay/trainer:868] step=290 loss=2.07268 dt=0.0785346 dtf=0.00520987 dtb=0.0126892 sps=101.866 sps_per_gpu=12.7332 tps=1.66897e+06 tps_per_gpu=208621 mfu=49.2481
[2025-08-14 05:56:11,388671][I][wordplay/trainer:868] step=300 loss=1.94068 dt=0.0790002 dtf=0.00531021 dtb=0.0126261 sps=101.266 sps_per_gpu=12.6582 tps=1.65914e+06 tps_per_gpu=207392 mfu=49.219
[2025-08-14 05:56:12,465166][I][wordplay/trainer:807] ['prompt']: 'What is an LLM?'
[2025-08-14 05:56:12,465770][I][wordplay/trainer:810] ['response']:

What is an LLM?

BHORLINUS:
You hes.

SORONE:
What the the he opteresint o of men sign ond the be, them wit ook hom sace win comend faren thy to the sate, the there my ford thim helinguss?

Gest will ningure lan friner thing fornce, his blout of dete to hee tweer he hou
[2025-08-14 05:56:21,695714][I][wordplay/trainer:750] Saving checkpoint to: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17
[2025-08-14 05:56:21,696469][I][wordplay/trainer:751] Saving model to: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17/model.pth
[2025-08-14 05:56:24,287565][I][wordplay/configs:141] Appending /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17 to /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/src/ckpts/checkpoints.log
[2025-08-14 05:56:25,085517][I][wordplay/trainer:868] step=310 loss=1.95217 dt=0.0780975 dtf=0.0054995 dtb=0.0122778 sps=102.436 sps_per_gpu=12.8045 tps=1.67831e+06 tps_per_gpu=209789 mfu=49.2495
[2025-08-14 05:56:25,870824][I][wordplay/trainer:868] step=320 loss=1.82002 dt=0.0780001 dtf=0.00549541 dtb=0.012604 sps=102.564 sps_per_gpu=12.8205 tps=1.68041e+06 tps_per_gpu=210051 mfu=49.283
[2025-08-14 05:56:26,657320][I][wordplay/trainer:868] step=330 loss=1.80354 dt=0.080137 dtf=0.00549193 dtb=0.0133185 sps=99.8291 sps_per_gpu=12.4786 tps=1.6356e+06 tps_per_gpu=204450 mfu=49.181
[2025-08-14 05:56:27,444509][I][wordplay/trainer:868] step=340 loss=1.7014 dt=0.0786976 dtf=0.00530128 dtb=0.012458 sps=101.655 sps_per_gpu=12.7069 tps=1.66551e+06 tps_per_gpu=208189 mfu=49.1775
[2025-08-14 05:56:28,230760][I][wordplay/trainer:868] step=350 loss=1.70333 dt=0.0786828 dtf=0.00523585 dtb=0.0120742 sps=101.674 sps_per_gpu=12.7093 tps=1.66583e+06 tps_per_gpu=208228 mfu=49.1752
[2025-08-14 05:56:29,017482][I][wordplay/trainer:868] step=360 loss=1.63698 dt=0.078231 dtf=0.00522951 dtb=0.0119305 sps=102.261 sps_per_gpu=12.7827 tps=1.67545e+06 tps_per_gpu=209431 mfu=49.2016
[2025-08-14 05:56:29,804709][I][wordplay/trainer:868] step=370 loss=1.6209 dt=0.078897 dtf=0.00537987 dtb=0.0121035 sps=101.398 sps_per_gpu=12.6748 tps=1.66131e+06 tps_per_gpu=207663 mfu=49.1836
[2025-08-14 05:56:30,590656][I][wordplay/trainer:868] step=380 loss=1.62243 dt=0.0783268 dtf=0.00511256 dtb=0.0119576 sps=102.136 sps_per_gpu=12.767 tps=1.6734e+06 tps_per_gpu=209175 mfu=49.2031
[2025-08-14 05:56:31,378053][I][wordplay/trainer:868] step=390 loss=1.46302 dt=0.0787818 dtf=0.00514412 dtb=0.0120985 sps=101.546 sps_per_gpu=12.6933 tps=1.66373e+06 tps_per_gpu=207967 mfu=49.1921
[2025-08-14 05:56:32,164231][I][wordplay/trainer:868] step=400 loss=1.48415 dt=0.0786123 dtf=0.00519092 dtb=0.0120484 sps=101.765 sps_per_gpu=12.7206 tps=1.66732e+06 tps_per_gpu=208415 mfu=49.1928
[2025-08-14 05:56:33,253719][I][wordplay/trainer:807] ['prompt']: 'What is an LLM?'
[2025-08-14 05:56:33,254320][I][wordplay/trainer:810] ['response']:

What is an LLM?

EONTES:
The sir, or accution of him. Well oftiess a somet
to marry be of our livery: anst the nemble to tearture,
to prompt out sicibler out himself too suction.
I but stain time acfficiancel\'d cament, and all
nom officious laptimes famits of finge, have

👀

import datetime
from rich import print
now = datetime.datetime.now()
print(' '.join([ "[#838383]Last Updated[/]:", f"[#E599F7]{now.strftime("%Y-%m-%d")}[/]", "[#838383]@[/]", f"[#00CCFF]{now.strftime("%H:%M:%S")}[/]", ]))

Last Updated: 2025-08-12 @ 16:53:47

Citation

BibTeX citation:

@online{foreman2025,
  author = {Foreman, Sam},
  title = {Intro to {HPC} {Bootcamp} 2025},
  date = {2025-07-15},
  url = {https://saforem2.github.io/intro-hpc-bootcamp-2025},
  langid = {en}
}

For attribution, please cite this work as:

Foreman, Sam. 2025. “Intro to HPC Bootcamp 2025.” July 15, 2025. https://saforem2.github.io/intro-hpc-bootcamp-2025.

--- # toc: true title: "Intro to HPC Bootcamp 2025" description: "Homepage for Sam Foreman's Intro to HPC Bootcamp 2025 Project" categories: - ai4science - hpc - llm date: 2025-07-15 date-modified: last-modified citation: author: Sam Foreman type: webpage title: "Intro to HPC Bootcamp 2025" url: https://saforem2.github.io/intro-hpc-bootcamp-2025 open-graph: title: "Intro to HPC Bootcamp 2025" description: "Homepage for Sam Foreman's Intro to HPC Bootcamp 2025 Project" # image: "./assets/thumbnail.png" twitter-card: title: "Intro to HPC Bootcamp 2025" description: "Homepage for Sam Foreman's Intro to HPC Bootcamp 2025 Project" site: "saforem2" creator: "saforem2" sidebar: sidebar # format: live-html # format: # live-html: # respect-user-color-scheme: true # date-modified: last-modified # link-external-newwindow: true # link-external-icon: false # callout-appearance: simple # code-tools: true # code-link: true # code-line-numbers: true # toc: true # self-contained: false # default-image-extension: svg # toc-title: "" # page-layout: article # fig-width: 6.4 # fig-height: 4.8 # grid: # body-width: 1000px # margin-width: 250px # sidebar-width: 250px # # gutter-width: 1.5rem # fig-align: center # fig-responsive: true # anchor-sections: true # code-overflow: scroll # code-copy: hover # code-summary: " " # citations-hover: true # crossrefs-hover: true # html-math-method: katex # footnotes-hover: true # section-divs: true # highlight-style: # light: github # dark: github-dark # theme: # dark: # - css/common.scss # - css/dark.scss # - css/syntax-dark.scss # - css/callout-cards.scss # - default # light: # - css/common.scss # - css/light.scss # - css/syntax-light.scss # - css/callout-cards.scss # - default # include-in-header: # - text: | # KaTeX # <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/katex.min.css" integrity="sha384-5TcZemv2l/9On385z///+d7MSYlvIEw9FuZTIdZ14vJLqWphw7e7ZPuOiCHJcFCP" crossorigin="anonymous"> # <script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/katex.min.js" integrity="sha384-cMkvdD8LoxVzGF/RPUKAcvmm49FQ0oxwDF3BGKtDXcEc+T1b2N+teh/OJfpU0jr6" crossorigin="anonymous"></script> # <script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/contrib/auto-render.min.js" integrity="sha384-hCXGrW6PitJEwbkoStFjeJxv+fSOOQKOPbJxSfM6G5sWZjAyWhXiTIIAmQqnlLlh" crossorigin="anonymous" onload="renderMathInElement(document.body);"></script> # <link rel="preconnect" href="https://fonts.googleapis.com"> # <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin> # <link href="https://fonts.googleapis.com/css2?family=IBM+Plex+Sans:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;1,100;1,200;1,300;1,400;1,500;1,600;1,700&family=IBM+Plex+Sans+Condensed:ital,wght@0,400;0,500;0,600;0,700&family=IBM+Plex+Mono:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700&display=swap" rel="stylesheet"> # <link href="https://fonts.googleapis.com/css?family=IBM+Plex+Sans&family=IBM+Plex+Sans+Condensed&family=IBM+Plex+Mono&display=swap" rel="stylesheet"> #  # <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-TC329HJ');</script> #  # include-before-body: # - text: | #  # <noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-TC329HJ" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript> #  # header-includes: | # <link href="https://iosevka-webfonts.github.io/iosevka/iosevka.css" rel="stylesheet"> --- ## 📂 Project Contents - 🏡 [Intro to {AI, HPC} for Science/](./) - 📂 [**\[00\] Intro to AI and HPC**/](00-intro-AI-HPC/) - 📄 [\[0\] Compute systems](./00-intro-AI-HPC/0-compute-systems/) - 📄 [\[1\] Shared-resources](./00-intro-AI-HPC/1-shared-resources/) - 📄 [\[2\] Jupyter Notebooks](./00-intro-AI-HPC/2-jupyter-notebooks/) - 📄 [\[3\] Using Python](./00-intro-AI-HPC/3-python/) - 📄 [\[4\] Working with Data](./00-intro-AI-HPC/4-data/) - 📗 [\[5\] MCMC Example](./00-intro-AI-HPC/5-mcmc-example/) - 📗 [\[6\] Linear Regression](./00-intro-AI-HPC/6-linear-regression/) - 📗 [\[7\] Statistical Learning](./00-intro-AI-HPC/7-statistical-learning/) - 📗 [\[8\] Clustering](./00-intro-AI-HPC/8-clustering/) - 📂 [**\[01\] Neural Networks/**](./01-neural-networks/) - 📄 [\[0\] Intro](./01-neural-networks/0-intro/) - 📗 [\[1\] MNIST](./01-neural-networks/1-mnist/) - 📗 [\[1\] MNIST (ipynb)](./01-neural-networks/1-mnist-ipynb/) - 📗 [\[2\] Advanced](./01-neural-networks/2-advanced/) - 📗 [\[3\] Conv. Nets](./01-neural-networks/3-conv-nets/) - 📗 [\[4\] Representation Learning](./01-neural-networks/4-representation-learning/) - 📄 [\[5\] Distributed Training](./01-neural-networks/5-distributed-training/) - 📂 [**\[02\] Large Language Models**](./02-llms/) - 📗 [\[00\] Intro to LLMs](./02-llms/00-intro-to-llms/) - 📗 [\[01\] Hands-on LLMs](./02-llms/01-hands-on-llms/) - 📄 [\[02\] Prompt Engineering](./02-llms/02-prompt-engineering/)    - 📗 [\[06\] Parallel Training](./02-llms/06-parallel-training/) - 📗 [\[07\] Shakespeare Example](./02-llms/07-shakespeare-example/) - 📗 [\[08\] Shakespeare Example (colab)](./02-llms/08-shakespeare-example-colab/)  <details closed><summary><h2> 🏔️ Instructions for Running @ NERSC</h2></summary> 1. Start terminal 1. Create symlink: ```bash # symlink ln -s /global/cfs/cdirs/m4388 $HOME/m4388 ``` 1. Navigate to `m4388` directory: ```bash cd $HOME/m4388 ``` 1. Clone repo (somewhere) in `$HOME/$USER/`: ```bash mkdir $USER && cd $USER git clone https://github.com/saforem2/intro-hpc-bootcamp-2025 ``` 1. Find all Jupyter notebooks: ```bash # find all *.ipynb files ls **/**/**.ipynb | grep -v "cache" | sort | uniq ``` </details> ## 🌐 Distributed Training Example 1. Login to Perlmutter: ```bash ssh <user>@perlmutter.nersc.gov [ -d $HOME/m4388 ] || ln -s /global/cfs/cdirs/m4388 $HOME/m4388 ``` 1. Request an interactive job: ```bash NODES=2 ; HRS=02 ; QUEUE=interactive ; salloc --nodes $NODES --qos $QUEUE --time $HRS:30:00 -C 'gpu' --gpus=$(( 4 * NODES )) -A m4388_g ``` 1. Clone [wordplay](https://github.com/saforem2/wordplay) and navigate into it: ```bash mkdir -p "${HOME}/m4388/Project5/${USER}" cd "${HOME}/m4388/Project5/${USER}" git clone https://github.com/saforem2/wordplay && cd wordplay ``` 1. Setup Python: ```bash source <(curl -L https://bit.ly/ezpz-utils) ezpz_setup_python ``` 1. Setup [wandb](https://wandb.ai): ```bash wandb login ``` 1. Install [`wordplay`](https://github.com/saforem2/wordplay): ```bash python3 -m pip install -e "." ``` 1. Run `ezpz-test` (simple test to verify distributed functionality): ```bash ezpz-test # <- SHOULD WORK (🤞) ``` 1. Prepare data: ```bash python3 -m wordplay.prepare ``` 1. Run distributed training: ```bash ezpz-launch -m wordplay \ train.backend=DDP \ train.eval_interval=100 \ data=shakespeare \ train.dtype=bf16 \ model.batch_size=8 \ model.block_size=2048 \ train.max_iters=1000 \ train.log_interval=10 \ train.compile=true ``` - <details closed><summary>Output</summary> ```bash (👻 pytorch2.6.0) #[~/m/P/f/p/s/wordplay][🌱 main][🤷✓] via ⨁ v #[08/14/25 @ 05:52:20][nid001237] ; eezpz-launch-m wordplay train.backend=DDP train.eval_interval=100 data=shakespeare train.dtype=bf16 model.batch_size=8 model.block_size=2048 train.max_iters=1000 train.log_interval=10 train.compile=true [2025-08-14 05:53:17,261718][I][ezpz/__init__:265:ezpz] Setting logging level to 'INFO' on 'RANK == 0' [2025-08-14 05:53:17,264886][I][ezpz/__init__:266:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0' [2025-08-14 05:53:17,283662][I][ezpz/launch:225] ======== [ezpz.launch: START] ======== [2025-08-14 05:53:17,508744][I][ezpz/slurm:90] Checking jobid 41622690 for hostname nid001237... [2025-08-14 05:53:17,509756][I][ezpz/slurm:92] Found nid001237 in nodelist for 41622690 [2025-08-14 05:53:17,713837][I][ezpz/slurm:90] Checking jobid 41622690 for hostname nid001237... [2025-08-14 05:53:17,714591][I][ezpz/slurm:92] Found nid001237 in nodelist for 41622690 [2025-08-14 05:53:17,977002][I][ezpz/slurm:90] Checking jobid 41622690 for hostname nid001237... [2025-08-14 05:53:17,977846][I][ezpz/slurm:92] Found nid001237 in nodelist for 41622690 [2025-08-14 05:53:18,057066][I][ezpz/slurm:109] Writing ['nid001237', '001240'] to /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/nodefile-41622690 [2025-08-14 05:53:18,089205][I][ezpz/launch:230] Job ID: 41622690 [2025-08-14 05:53:18,089621][I][ezpz/launch:231] nodelist: ['nid001237', '001240'] [2025-08-14 05:53:18,090032][I][ezpz/launch:232] hostfile: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/nodefile-41622690 [2025-08-14 05:53:18,241020][I][ezpz/slurm:90] Checking jobid 41622690 for hostname nid001237... [2025-08-14 05:53:18,241792][I][ezpz/slurm:92] Found nid001237 in nodelist for 41622690 [2025-08-14 05:53:18,328647][I][ezpz/launch:253] Building command to execute by piecing together: (1.) ['launch_cmd'] + (2.) ['python'] + (3.) ['cmd_to_launch'] [2025-08-14 05:53:18,330096][I][ezpz/launch:257] (1.) ['launch_cmd']: srun -u --verbose -N2 -n8 [2025-08-14 05:53:18,330544][I][ezpz/launch:258] (2.) ['python']: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/venvs/perlmutter/pytorch2.6.0/bin/python3 [2025-08-14 05:53:18,330996][I][ezpz/launch:259] (3.) ['cmd_to_launch']: -m wordplay train.backend=DDP train.eval_interval=100 data=shakespeare train.dtype=bf16 model.batch_size=8 model.block_size=2048 train.max_iters=1000 train.log_interval=10 train.compile=true [2025-08-14 05:53:18,331986][I][ezpz/launch:264] Took: 1.05 seconds to build command. [2025-08-14 05:53:18,332539][I][ezpz/launch:268] Executing: srun -u --verbose -N2 -n8 /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/venvs/perlmutter/pytorch2.6.0/bin/python3 -m wordplay train.backend=DDP train.eval_interval=100 data=shakespeare train.dtype=bf16 model.batch_size=8 model.block_size=2048 train.max_iters=1000 train.log_interval=10 train.compile=true [2025-08-14 05:53:18,334155][I][ezpz/launch:276] Execution started @ 2025-08-14-055318... [2025-08-14 05:53:18,334619][I][ezpz/launch:277] ======== [ezpz.launch: STOP] ======== srun: defined options srun: -------------------- -------------------- srun: (null) : nid[001237,001240] srun: gpus : 8 srun: jobid : 41622690 srun: job-name : interactive srun: mpi : cray_shasta srun: nodes : 2 srun: ntasks : 8 srun: oom-kill-step : 0 srun: slurmd-debug : error srun: unbuffered : set srun: verbose : 1 srun: -------------------- -------------------- srun: end of defined options srun: jobid 41622690: nodes(2):'nid[001237,001240]', cpu counts: 128(x2) srun: CpuBindType=(null type) srun: launching StepId=41622690.1 on host nid001237, 4 tasks: [0-3] srun: launching StepId=41622690.1 on host nid001240, 4 tasks: [4-7] srun: topology/default: init: topology Default plugin loaded srun: Node nid001237, 4 tasks started srun: Node nid001240, 4 tasks started [2025-08-14 05:54:16,894992][I][ezpz/__init__:265:ezpz] Setting logging level to 'INFO' on 'RANK == 0' [2025-08-14 05:54:16,897554][I][ezpz/__init__:266:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0' [2025-08-14 05:54:17,040731][I][wordplay/configs:81] Setting HF_DATASETS_CACHE to /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/.cache/huggingface/datasets [2025-08-14 05:54:18,062573][I][ezpz/dist:1159] Using fw='ddp' with torch_{device,backend}= {cuda, nccl} [2025-08-14 05:54:18,063295][I][ezpz/dist:1026] Caught MASTER_PORT=56513 from environment! [2025-08-14 05:54:18,372603][I][ezpz/dist:1042] Using torch.distributed.init_process_group with - master_addr='nid001237' - master_port='56513' - world_size=8 - rank=0 - local_rank=0 - timeout=datetime.timedelta(seconds=3600) - backend='nccl' [2025-08-14 05:54:18,373741][I][ezpz/dist:759] Calling torch.distributed.init_process_group_with: rank=0 world_size=8 backend=nccl [rank6]:[W814 05:54:18.650817367 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 6] using GPU 2 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [rank4]:[W814 05:54:18.001713707 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 4] using GPU 0 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [rank0]:[W814 05:54:19.906551058 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [rank5]:[W814 05:54:19.231412634 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 5] using GPU 1 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [rank7]:[W814 05:54:19.394309045 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7] using GPU 3 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [rank3]:[W814 05:54:19.232206867 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [rank1]:[W814 05:54:19.258096849 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [rank2]:[W814 05:54:19.258138279 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices usedby this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [2025-08-14 05:54:20,179821][I][ezpz/dist:1377] Using device='cuda' with backend='nccl' + 'nccl' for distributed training. [2025-08-14 05:54:20,181272][I][ezpz/dist:1422] ['nid001237'][0/7] [2025-08-14 05:54:20,179635][I][ezpz/dist:1422] ['nid001237'][1/7] [2025-08-14 05:54:20,179688][I][ezpz/dist:1422] ['nid001240'][6/7] [2025-08-14 05:54:20,179683][I][ezpz/dist:1422] ['nid001240'][7/7] [2025-08-14 05:54:20,179727][I][ezpz/dist:1422] ['nid001237'][2/7] [2025-08-14 05:54:20,179700][I][ezpz/dist:1422] ['nid001237'][3/7] [2025-08-14 05:54:20,179702][I][ezpz/dist:1422] ['nid001240'][4/7] [2025-08-14 05:54:20,179691][I][ezpz/dist:1422] ['nid001240'][5/7] [2025-08-14 05:54:20,213261][I][wordplay/configs:317] Loading val from /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/data/shakespeare_char/val.bin [2025-08-14 05:54:20,215366][I][wordplay/configs:317] Loading train from /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/data/shakespeare_char/train.bin [2025-08-14 05:54:20,221097][I][wordplay/configs:442] Tokens per iteration: 131,072 [2025-08-14 05:54:20,221681][I][wordplay/configs:465] Using self.ptdtype=torch.float16 on self.device_type='cuda' [2025-08-14 05:54:20,222155][I][wordplay/configs:471] Initializing a new model from scratch [2025-08-14 05:54:20,223622][I][ezpz/dist:1648] Setting up wandb from rank=0 [2025-08-14 05:54:20,224043][I][ezpz/dist:1649] Using WB_PROJECT=WordPlay wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Currently logged in as: foremans (aurora_gpt) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.19.7 wandb: Run data is saved locally in /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17/wandb/run-20250814_055421-qqpij4mt wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run helpful-sea-138 wandb: ⭐️ View project at https://wandb.ai/aurora_gpt/WordPlay wandb: 🚀 View run at https://wandb.ai/aurora_gpt/WordPlay/runs/qqpij4mt [2025-08-14 05:54:22,857365][C][wordplay/trainer:322] RANK=1:devid='cuda:1' [2025-08-14 05:54:22,858391][C][wordplay/trainer:322] RANK=3:devid='cuda:3' [2025-08-14 05:54:22,859157][C][wordplay/trainer:322] RANK=2:devid='cuda:2' [2025-08-14 05:54:22,925017][C][wordplay/trainer:322] RANK=7:devid='cuda:3' [2025-08-14 05:54:22,925164][C][wordplay/trainer:322] RANK=6:devid='cuda:2' [2025-08-14 05:54:22,931160][C][wordplay/trainer:322] RANK=4:devid='cuda:0' [2025-08-14 05:54:23,025155][C][wordplay/trainer:322] RANK=5:devid='cuda:1' [2025-08-14 05:54:23,585980][I][ezpz/dist:1678] wandb.run=[helpful-sea-138](https://wandb.ai/aurora_gpt/WordPlay/runs/qqpij4mt) [2025-08-14 05:54:23,693859][I][ezpz/dist:1722] Running on machine='Perlmutter' [2025-08-14 05:54:23,695938][W][wordplay/__main__:93:__main__] { "train": { "framework": "pytorch", "backend": "DDP", "device": null, "seed": null, "port": null, "ds_config_path": null, "precision": null, "ngpus": null, "use_wandb": true, "eval_interval": 100, "log_interval": 10, "eval_iters": 200, "eval_only": false, "always_save_checkpoint": false, "init_from": "scratch", "wandb_project": "WordPlay", "max_iters": 1000, "warmup_iters": 100, "dtype": "bf16", "compile": true }, "model": { "n_layer": 12, "n_head": 12, "n_embd": 768, "batch_size": 8, "block_size": 2048, "activation": "gelu", "dropout": 0.0, "bias": false, "vocab_size": 65 }, "data": { "dataset": "shakespeare_char", "out_dir": "out-shakespeare-char", "root_path": null }, "optimizer": { "gas": 1, "name": "AdamW", "learning_rate": 0.0006, "weight_decay": 0.1, "beta1": 0.9, "beta2": 0.95, "grad_clip": 1.0, "decay_lr": true, "lr_decay_iters": 600000, "min_lr": 6e-05 } } [2025-08-14 05:54:23,698890][W][wordplay/__main__:94:__main__] Output dir: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17 [2025-08-14 05:54:23,699444][I][wordplay/trainer:234] Initializing a new model from scratch [2025-08-14 05:54:24,575640][I][wordplay/model:255] number of parameters: 85.00M [2025-08-14 05:54:24,625918][I][wordplay/trainer:251] Model size: num_params=85003776 [2025-08-14 05:54:24,637219][I][wordplay/model:445] num decayed parameter tensors: 50, with 86,557,440 parameters [2025-08-14 05:54:24,637872][I][wordplay/model:449] num non-decayed parameter tensors: 25, with 19,200 parameters [2025-08-14 05:54:24,638937][I][wordplay/model:465] using fused AdamW: True [2025-08-14 05:54:25,662969][C][wordplay/trainer:322] RANK=0:devid='cuda:0' [2025-08-14 05:54:25,748890][I][wordplay/trainer:361] • self.model=OptimizedModule( (_orig_mod): GPT( (transformer): ModuleDict( (wte): Embedding(65, 768) (wpe): Embedding(2048, 768) (drop): Dropout(p=0.0, inplace=False) (h): ModuleList( (0-11): 12 x Block( (ln_1): LayerNorm() (attn): CausalSelfAttention( (c_attn): Linear(in_features=768, out_features=2304, bias=False) (c_proj): Linear(in_features=768, out_features=768, bias=False) (attn_dropout): Dropout(p=0.0, inplace=False) (resid_dropout): Dropout(p=0.0, inplace=False) ) (ln_2): LayerNorm() (mlp): MLP( (c_fc): Linear(in_features=768, out_features=3072, bias=False) (act_fn): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=False) (dropout): Dropout(p=0.0, inplace=False) ) ) ) (ln_f): LayerNorm() ) (lm_head): Linear(in_features=768, out_features=65, bias=False) ) ) [2025-08-14 05:54:25,752694][I][wordplay/trainer:362] • self.grad_scaler=<torch.cuda.amp.grad_scaler.GradScaler object at 0x14da15e752b0> [2025-08-14 05:54:25,753734][I][wordplay/trainer:363] • self.model_engine=DistributedDataParallel( (module): OptimizedModule( (_orig_mod): GPT( (transformer): ModuleDict( (wte): Embedding(65, 768) (wpe): Embedding(2048, 768) (drop): Dropout(p=0.0, inplace=False) (h): ModuleList( (0-11): 12 x Block( (ln_1): LayerNorm() (attn): CausalSelfAttention( (c_attn): Linear(in_features=768, out_features=2304, bias=False) (c_proj): Linear(in_features=768, out_features=768, bias=False) (attn_dropout): Dropout(p=0.0, inplace=False) (resid_dropout): Dropout(p=0.0, inplace=False) ) (ln_2): LayerNorm() (mlp): MLP( (c_fc): Linear(in_features=768, out_features=3072, bias=False) (act_fn): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=False) (dropout): Dropout(p=0.0, inplace=False) ) ) ) (ln_f): LayerNorm() ) (lm_head): Linear(in_features=768, out_features=65, bias=False) ) ) ) [2025-08-14 05:54:25,757292][I][wordplay/trainer:364] • self.optimizer=AdamW ( Parameter Group 0 amsgrad: False betas: (0.9, 0.95) capturable: False differentiable: False eps: 1e-08 foreach: None fused: True lr: 0.0006 maximize: False weight_decay: 0.1 Parameter Group 1 amsgrad: False betas: (0.9, 0.95) capturable: False differentiable: False eps: 1e-08 foreach: None fused: True lr: 0.0006 maximize: False weight_decay: 0.0 ) [2025-08-14 05:54:25,759725][I][wordplay/trainer:796] Startup time: 8.6967 Training Legend ┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ abbr ┃ desc ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ step │ Current training iteration │ │ loss │ Loss value │ │ dt │ Elapsed time per training step │ │ dtf │ Elapsed time per forward step │ │ dtb │ Elapsed time per backward step │ │ sps │ Samples per second │ │ sps_per_gpu │ Samples per second (per GPU) │ │ tps │ Tokens per second │ │ tps_per_gpu │ Tokens per second (per GPU) │ │ mfu │ Model flops utilization │ └─────────────┴────────────────────────────────┘ [2025-08-14 05:54:27,146477][I][wordplay/trainer:807] ['prompt']: 'What is an LLM?' [2025-08-14 05:54:27,147296][I][wordplay/trainer:810] ['response']: What is an LLM? 'hkk ''Evllll VWccm UzcW!W'':zlWk W z! XXwltMMV!Qyyx'y kDDvVX;WWlyy jKy;kkyyxxx$-WDll l!;WWmmWW eeJJzq.vv;! w;;z'tlWDDDWklUJ ;yyNlccxQ-D V!MMG'zt;WWk lllUU D-kkXWUvvMy;;JrMCzl;Uve,z;'':VWQ-y$l--o.cJD.yM'yyyZyyyVV$Qt!!kxuJeeD kll'Uy-J'vV!tmkzJuM?!ppXXG;' [2025-08-14 05:55:22,838950][I][wordplay/trainer:868] step=10 loss=3.01934 dt=0.0778009 dtf=0.0055681 dtb=0.0124092 sps=102.827 sps_per_gpu=12.8533 tps=1.68471e+06 tps_per_gpu=210589 mfu=49.7121 [2025-08-14 05:55:23,622530][I][wordplay/trainer:868] step=20 loss=2.73268 dt=0.0783059 dtf=0.00546336 dtb=0.0124693 sps=102.163 sps_per_gpu=12.7704 tps=1.67385e+06 tps_per_gpu=209231 mfu=49.6801 [2025-08-14 05:55:24,407385][I][wordplay/trainer:868] step=30 loss=2.5634 dt=0.0784503 dtf=0.00536114 dtb=0.0125228 sps=101.975 sps_per_gpu=12.7469 tps=1.67076e+06 tps_per_gpu=208846 mfu=49.6421 [2025-08-14 05:55:25,192393][I][wordplay/trainer:868] step=40 loss=2.51223 dt=0.0784606 dtf=0.00545081 dtb=0.012732 sps=101.962 sps_per_gpu=12.7452 tps=1.67054e+06 tps_per_gpu=208818 mfu=49.6073 [2025-08-14 05:55:25,977765][I][wordplay/trainer:868] step=50 loss=2.49004 dt=0.079029 dtf=0.00523198 dtb=0.0125401 sps=101.229 sps_per_gpu=12.6536 tps=1.65853e+06 tps_per_gpu=207316 mfu=49.5406 [2025-08-14 05:55:26,761465][I][wordplay/trainer:868] step=60 loss=2.45537 dt=0.07775 dtf=0.00518063 dtb=0.0127289 sps=102.894 sps_per_gpu=12.8617 tps=1.68581e+06 tps_per_gpu=210727 mfu=49.561 [2025-08-14 05:55:27,546584][I][wordplay/trainer:868] step=70 loss=2.46909 dt=0.0782411 dtf=0.00528915 dtb=0.0125154 sps=102.248 sps_per_gpu=12.781 tps=1.67523e+06 tps_per_gpu=209404 mfu=49.5481 [2025-08-14 05:55:28,332687][I][wordplay/trainer:868] step=80 loss=2.48264 dt=0.0792809 dtf=0.00552908 dtb=0.0130663 sps=100.907 sps_per_gpu=12.6134 tps=1.65326e+06 tps_per_gpu=206657 mfu=49.4717 [2025-08-14 05:55:29,118556][I][wordplay/trainer:868] step=90 loss=2.51034 dt=0.0781782 dtf=0.00514289 dtb=0.0124665 sps=102.33 sps_per_gpu=12.7913 tps=1.67658e+06 tps_per_gpu=209573 mfu=49.4718 [2025-08-14 05:55:29,904022][I][wordplay/trainer:868] step=100 loss=2.46516 dt=0.078483 dtf=0.00516737 dtb=0.0129046 sps=101.933 sps_per_gpu=12.7416 tps=1.67007e+06 tps_per_gpu=208758 mfu=49.4526 [2025-08-14 05:55:30,990755][I][wordplay/trainer:807] ['prompt']: 'What is an LLM?' [2025-08-14 05:55:30,991344][I][wordplay/trainer:810] ['response']: What is an LLM? denour sad is thot wind; Ae micome lofas t butowhatiom, ar thy mitheath anshthath o w gesurcingero w on GArsitheath the, Tordist w nofout thoru ol t arthim he my, Thich thingay Thot we wiman hineisoule blt me s hat f aul the t, Tyoffove. Haicede t tounon [2025-08-14 05:55:40,218926][I][wordplay/trainer:750] Saving checkpoint to: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17 [2025-08-14 05:55:40,219705][I][wordplay/trainer:751] Saving model to: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17/model.pth [2025-08-14 05:55:42,195499][I][wordplay/configs:141] Appending /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17 to /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/src/ckpts/checkpoints.log [2025-08-14 05:55:42,999389][I][wordplay/trainer:868] step=110 loss=2.43177 dt=0.0780071 dtf=0.00545025 dtb=0.0126532 sps=102.555 sps_per_gpu=12.8193 tps=1.68026e+06 tps_per_gpu=210032 mfu=49.4654 [2025-08-14 05:55:43,785978][I][wordplay/trainer:868] step=120 loss=2.44182 dt=0.078106 dtf=0.00545009 dtb=0.0125079 sps=102.425 sps_per_gpu=12.8031 tps=1.67813e+06 tps_per_gpu=209766 mfu=49.4707 [2025-08-14 05:55:44,572009][I][wordplay/trainer:868] step=130 loss=2.44907 dt=0.0785986 dtf=0.00526247 dtb=0.0123562 sps=101.783 sps_per_gpu=12.7229 tps=1.66761e+06 tps_per_gpu=208451 mfu=49.4444 [2025-08-14 05:55:45,358627][I][wordplay/trainer:868] step=140 loss=2.39859 dt=0.0794152 dtf=0.00549359 dtb=0.0128465 sps=100.736 sps_per_gpu=12.592 tps=1.65047e+06 tps_per_gpu=206308 mfu=49.3701 [2025-08-14 05:55:46,146387][I][wordplay/trainer:868] step=150 loss=2.42138 dt=0.079703 dtf=0.00618074 dtb=0.012967 sps=100.373 sps_per_gpu=12.5466 tps=1.6445e+06 tps_per_gpu=205563 mfu=49.2856 [2025-08-14 05:55:46,933663][I][wordplay/trainer:868] step=160 loss=2.41715 dt=0.0793599 dtf=0.00549591 dtb=0.0130799 sps=100.807 sps_per_gpu=12.6008 tps=1.65161e+06 tps_per_gpu=206452 mfu=49.2306 [2025-08-14 05:55:47,721057][I][wordplay/trainer:868] step=170 loss=2.44814 dt=0.0789742 dtf=0.00546218 dtb=0.012633 sps=101.299 sps_per_gpu=12.6624 tps=1.65968e+06 tps_per_gpu=207460 mfu=49.2049 [2025-08-14 05:55:48,507992][I][wordplay/trainer:868] step=180 loss=2.41629 dt=0.0787978 dtf=0.00542163 dtb=0.0128453 sps=101.526 sps_per_gpu=12.6907 tps=1.6634e+06 tps_per_gpu=207925 mfu=49.1928 [2025-08-14 05:55:49,297076][I][wordplay/trainer:868] step=190 loss=2.38078 dt=0.0781823 dtf=0.00540887 dtb=0.0122009 sps=102.325 sps_per_gpu=12.7906 tps=1.67649e+06 tps_per_gpu=209561 mfu=49.2204 [2025-08-14 05:55:50,085624][I][wordplay/trainer:868] step=200 loss=2.38881 dt=0.0787827 dtf=0.00544194 dtb=0.012954 sps=101.545 sps_per_gpu=12.6931 tps=1.66372e+06 tps_per_gpu=207964 mfu=49.2077 [2025-08-14 05:55:51,182746][I][wordplay/trainer:807] ['prompt']: 'What is an LLM?' [2025-08-14 05:55:51,183345][I][wordplay/trainer:810] ['response']: What is an LLM? TYMMurcomarl he ffal ther the arisplit at in fil an arices tor\'se iom o foul yof forsthe, ADe ce he the her slashe th ous ar me andone be the sorthe spof aris indllfll thir me ay the bldorom n de She thit t, Clou lllethe fourth wit thin, pr thee th bl hes [2025-08-14 05:56:00,419222][I][wordplay/trainer:750] Saving checkpoint to: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17 [2025-08-14 05:56:00,420026][I][wordplay/trainer:751] Saving model to: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17/model.pth [2025-08-14 05:56:03,473861][I][wordplay/configs:141] Appending /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17 to /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/src/ckpts/checkpoints.log [2025-08-14 05:56:04,301502][I][wordplay/trainer:868] step=210 loss=2.40137 dt=0.0777987 dtf=0.00521268 dtb=0.012754 sps=102.829 sps_per_gpu=12.8537 tps=1.68476e+06 tps_per_gpu=210595 mfu=49.2582 [2025-08-14 05:56:05,090312][I][wordplay/trainer:868] step=220 loss=2.32615 dt=0.0785563 dtf=0.00518237 dtb=0.0124011 sps=101.838 sps_per_gpu=12.7297 tps=1.66851e+06 tps_per_gpu=208564 mfu=49.2558 [2025-08-14 05:56:05,875417][I][wordplay/trainer:868] step=230 loss=2.28453 dt=0.0781005 dtf=0.00532893 dtb=0.0126092 sps=102.432 sps_per_gpu=12.804 tps=1.67825e+06 tps_per_gpu=209781 mfu=49.2824 [2025-08-14 05:56:06,662143][I][wordplay/trainer:868] step=240 loss=2.32075 dt=0.0790562 dtf=0.0056354 dtb=0.0128831 sps=101.194 sps_per_gpu=12.6492 tps=1.65796e+06 tps_per_gpu=207245 mfu=49.2464 [2025-08-14 05:56:07,448907][I][wordplay/trainer:868] step=250 loss=2.26398 dt=0.0782558 dtf=0.00549903 dtb=0.0123185 sps=102.229 sps_per_gpu=12.7786 tps=1.67492e+06 tps_per_gpu=209365 mfu=49.2641 [2025-08-14 05:56:08,237097][I][wordplay/trainer:868] step=260 loss=2.20778 dt=0.0779999 dtf=0.00544577 dtb=0.0125349 sps=102.564 sps_per_gpu=12.8205 tps=1.68041e+06 tps_per_gpu=210052 mfu=49.2962 [2025-08-14 05:56:09,024535][I][wordplay/trainer:868] step=270 loss=2.13115 dt=0.0790745 dtf=0.00544547 dtb=0.0127346 sps=101.17 sps_per_gpu=12.6463 tps=1.65758e+06 tps_per_gpu=207197 mfu=49.2577 [2025-08-14 05:56:09,812910][I][wordplay/trainer:868] step=280 loss=2.1087 dt=0.078672 dtf=0.00547284 dtb=0.0126957 sps=101.688 sps_per_gpu=12.711 tps=1.66606e+06 tps_per_gpu=208257 mfu=49.2481 [2025-08-14 05:56:10,600338][I][wordplay/trainer:868] step=290 loss=2.07268 dt=0.0785346 dtf=0.00520987 dtb=0.0126892 sps=101.866 sps_per_gpu=12.7332 tps=1.66897e+06 tps_per_gpu=208621 mfu=49.2481 [2025-08-14 05:56:11,388671][I][wordplay/trainer:868] step=300 loss=1.94068 dt=0.0790002 dtf=0.00531021 dtb=0.0126261 sps=101.266 sps_per_gpu=12.6582 tps=1.65914e+06 tps_per_gpu=207392 mfu=49.219 [2025-08-14 05:56:12,465166][I][wordplay/trainer:807] ['prompt']: 'What is an LLM?' [2025-08-14 05:56:12,465770][I][wordplay/trainer:810] ['response']: What is an LLM? BHORLINUS: You hes. SORONE: What the the he opteresint o of men sign ond the be, them wit ook hom sace win comend faren thy to the sate, the there my ford thim helinguss? Gest will ningure lan friner thing fornce, his blout of dete to hee tweer he hou [2025-08-14 05:56:21,695714][I][wordplay/trainer:750] Saving checkpoint to: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17 [2025-08-14 05:56:21,696469][I][wordplay/trainer:751] Saving model to: /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17/model.pth [2025-08-14 05:56:24,287565][I][wordplay/configs:141] Appending /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/outputs/runs/pytorch/DDP/2025-08-14/05-54-17 to /global/cfs/cdirs/m4388/Project5/foremans/projects/saforem2/wordplay/src/ckpts/checkpoints.log [2025-08-14 05:56:25,085517][I][wordplay/trainer:868] step=310 loss=1.95217 dt=0.0780975 dtf=0.0054995 dtb=0.0122778 sps=102.436 sps_per_gpu=12.8045 tps=1.67831e+06 tps_per_gpu=209789 mfu=49.2495 [2025-08-14 05:56:25,870824][I][wordplay/trainer:868] step=320 loss=1.82002 dt=0.0780001 dtf=0.00549541 dtb=0.012604 sps=102.564 sps_per_gpu=12.8205 tps=1.68041e+06 tps_per_gpu=210051 mfu=49.283 [2025-08-14 05:56:26,657320][I][wordplay/trainer:868] step=330 loss=1.80354 dt=0.080137 dtf=0.00549193 dtb=0.0133185 sps=99.8291 sps_per_gpu=12.4786 tps=1.6356e+06 tps_per_gpu=204450 mfu=49.181 [2025-08-14 05:56:27,444509][I][wordplay/trainer:868] step=340 loss=1.7014 dt=0.0786976 dtf=0.00530128 dtb=0.012458 sps=101.655 sps_per_gpu=12.7069 tps=1.66551e+06 tps_per_gpu=208189 mfu=49.1775 [2025-08-14 05:56:28,230760][I][wordplay/trainer:868] step=350 loss=1.70333 dt=0.0786828 dtf=0.00523585 dtb=0.0120742 sps=101.674 sps_per_gpu=12.7093 tps=1.66583e+06 tps_per_gpu=208228 mfu=49.1752 [2025-08-14 05:56:29,017482][I][wordplay/trainer:868] step=360 loss=1.63698 dt=0.078231 dtf=0.00522951 dtb=0.0119305 sps=102.261 sps_per_gpu=12.7827 tps=1.67545e+06 tps_per_gpu=209431 mfu=49.2016 [2025-08-14 05:56:29,804709][I][wordplay/trainer:868] step=370 loss=1.6209 dt=0.078897 dtf=0.00537987 dtb=0.0121035 sps=101.398 sps_per_gpu=12.6748 tps=1.66131e+06 tps_per_gpu=207663 mfu=49.1836 [2025-08-14 05:56:30,590656][I][wordplay/trainer:868] step=380 loss=1.62243 dt=0.0783268 dtf=0.00511256 dtb=0.0119576 sps=102.136 sps_per_gpu=12.767 tps=1.6734e+06 tps_per_gpu=209175 mfu=49.2031 [2025-08-14 05:56:31,378053][I][wordplay/trainer:868] step=390 loss=1.46302 dt=0.0787818 dtf=0.00514412 dtb=0.0120985 sps=101.546 sps_per_gpu=12.6933 tps=1.66373e+06 tps_per_gpu=207967 mfu=49.1921 [2025-08-14 05:56:32,164231][I][wordplay/trainer:868] step=400 loss=1.48415 dt=0.0786123 dtf=0.00519092 dtb=0.0120484 sps=101.765 sps_per_gpu=12.7206 tps=1.66732e+06 tps_per_gpu=208415 mfu=49.1928 [2025-08-14 05:56:33,253719][I][wordplay/trainer:807] ['prompt']: 'What is an LLM?' [2025-08-14 05:56:33,254320][I][wordplay/trainer:810] ['response']: What is an LLM? EONTES: The sir, or accution of him. Well oftiess a somet to marry be of our livery: anst the nemble to tearture, to prompt out sicibler out himself too suction. I but stain time acfficiancel\'d cament, and all nom officious laptimes famits of finge, have ``` ```{python} #| output: asis #| code-fold: true #| code-summary: "👀" import datetime from rich import print now = datetime.datetime.now() print(' '.join([ "[#838383]Last Updated[/]:", f"[#E599F7]{now.strftime("%Y-%m-%d")}[/]", "[#838383]@[/]", f"[#00CCFF]{now.strftime("%H:%M:%S")}[/]", ])) ```