Training LLMs with 🍋 `ezpz` + 🤗 `Trainer`¶

The src/ezpz/examples/hf_trainer.py module provides a mechanism for distributed training with 🤗 huggingface / transformers.

In particular, it allows for distributed training using the transformers.Trainer object with any¹ (compatible) combination of {models, datasets}.

[!NOTE] Quick start:

# setup env
source <(curl -sL https://bit.ly/ezpz-utils)
ezpz_setup_env

# install ezpz
uv pip install --no-cache --link-mode=copy "git+https://github.com/saforem2/ezpz"

# launch ezpz.examples.hf_trainer
ezpz launch -- python3 -m ezpz.examples.hf_trainer \
   --streaming \
   --dataset_name=stanfordnlp/imdb \
   --tokenizer_name meta-llama/Llama-3.2-1B \
   --model_name_or_path meta-llama/Llama-3.2-1B \
   --bf16=true \
   --do_train=true \
   --do_eval=true \
   --report-to=wandb \
   --logging-steps=1 \
   --include-tokens-per-second=true \
   --max-steps=50000 \
   --include-num-input-tokens-seen=true \
   --optim=adamw_torch \
   --logging-first-step \
   --include-for-metrics='inputs,loss' \
   --max-eval-samples=50 \
   --per_device_train_batch_size=1 \
   --block-size=8192 \
   --gradient_checkpointing=true # --fsdp=shard_grad_op

🐣 Getting Started¶

🏡 Setup environment (on ANY {Intel, NVIDIA, AMD} accelerator)

source <(curl -L https://bit.ly/ezpz-utils)
ezpz_setup_env

📦 Install dependencies:

Install 🍋 ezpz (from GitHub):

uv pip install --no-cache --link-mode=copy "git+https://github.com/saforem2/ezpz"
# or:
# python3 -m pip install "git+https://github.com/saforem2/ezpz" --require-virtualenv

<!-- 2. Update {tiktoken, sentencepiece, transformers, evaluate}:

```bash
python3 -m pip install --upgrade tiktoken sentencepiece transformers evaluate
```

→

➕ Details¶

⚙️ Build DeepSpeed config:

python3 -c 'import ezpz; ezpz.utils.write_deepspeed_zero12_auto_config(zero_stage=1)'

🚀 Launch training:

TSTAMP=$(date +%s)
python3 -m ezpz.launch -m ezpz.examples.hf_trainer \
  --model_name_or_path meta-llama/Llama-3.2-1B \
  --dataset_name stanfordnlp/imdb \
  --deepspeed=ds_configs/deepspeed_zero1_auto_config.json \
  --auto-find-batch-size=true \
  --bf16=true \
  --block-size=4096 \
  --do-eval=true \
  --do-predict=true \
  --do-train=true \
  --gradient-checkpointing=true \
  --include-for-metrics=inputs,loss \
  --include-num-input-tokens-seen=true \
  --include-tokens-per-second=true \
  --log-level=info \
  --logging-steps=1 \
  --max-steps=10000 \
  --output_dir="hf-trainer-output/${TSTAMP}" \
  --report-to=wandb \
  | tee "hf-trainer-output-${TSTAMP}.log"

🪄 Magic:

Behind the scenes, this will 🪄 automagically determine the specifics of the running job, and use this information to construct (and subsequently run) the appropriate:
1
mpiexec <mpi-args> $(which python3) <cmd-to-launch>
across all of our available accelerators.
- ➕ Tip:
  
  Call:
  1
  python3 -m ezpz.examples.hf_trainer --help
  to see the full list of supported arguments.
  
  In particular, any transformers.TrainingArguments should be supported.

🚀 DeepSpeed Support¶

Additionally, DeepSpeed is fully supported and can be configured by specifying the path to a compatible DeepSpeed config json file, e.g.:

Build a DeepSpeed config:

python3 -c 'import ezpz; ezpz.utils.write_deepspeed_zero12_auto_config(zero_stage=2)'

Train:

python3 -m ezpz.launch -m ezpz.hf_trainer \
  --dataset_name stanfordnlp/imdb \
  --model_name_or_path meta-llama/Llama-3.2-1B \
  --bf16 \
  --do_train \
  --report-to=wandb \
  --logging-steps=1 \
  --include-tokens-per-second=true \
  --auto-find-batch-size=true \
  --deepspeed=ds_configs/deepspeed_zero2_auto_config.json

😎 2 ez

See the full list of supported models at: https://hf.co/models?filter=text-generation ↩

Training LLMs with 🍋 ezpz + 🤗 Trainer¶

🐣 Getting Started¶

➕ Details¶

🚀 DeepSpeed Support¶

Training LLMs with 🍋 `ezpz` + 🤗 `Trainer`¶