Training LLMs with 🍋 `ezpz` + 🤗 `Trainer`¶

The src/ezpz/hf_trainer.py module provides a mechanism for distributed training with 🤗 huggingface / transformers.

In particular, it allows for distributed training using the transformers.Trainer object with any¹ (compatible) combination of {models, datasets}.

🐣 Getting Started¶

🏡 Setup environment (on ANY {Intel, NVIDIA, AMD} accelerator)

source <(curl -s 'https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh')
ezpz_setup_env

📦 Install dependencies:

Install 🍋 ezpz (from GitHub):

python3 -m pip install "git+https://github.com/saforem2/ezpz" --require-virtualenv

Update {tiktoken, sentencepiece, transformers, evaluate}:

python3 -m pip install --upgrade tiktoken sentencepiece transformers evaluate

⚙️ Build DeepSpeed config:

python3 -c 'import ezpz; ezpz.utils.write_deepspeed_zero12_auto_config(zero_stage=1)'

🚀 Launch training:

TSTAMP=$(date +%s)
python3 -m ezpz.launch -m ezpz.hf_trainer \
  --model_name_or_path meta-llama/Llama-3.2-1B \
  --dataset_name stanfordnlp/imdb \
  --deepspeed=ds_configs/deepspeed_zero1_auto_config.json \
  --auto-find-batch-size=true \
  --bf16=true \
  --block-size=4096 \
  --do-eval=true \
  --do-predict=true \
  --do-train=true \
  --gradient-checkpointing=true \
  --include-for-metrics=inputs,loss \
  --include-num-input-tokens-seen=true \
  --include-tokens-per-second=true \
  --log-level=info \
  --logging-steps=1 \
  --max-steps=10000 \
  --output_dir="hf-trainer-output/${TSTAMP}" \
  --report-to=wandb \
  | tee "hf-trainer-output-${TSTAMP}.log"

🪄 Magic:

Behind the scenes, this will 🪄 automagically determine the specifics of the running job, and use this information to construct (and subsequently run) the appropriate:
1
mpiexec <mpi-args> $(which python3) <cmd-to-launch>
across all of our available accelerators.
- ➕ Tip:
  
  Call:
  1
  python3 -m ezpz.hf_trainer --help
  to see the full list of supported arguments.
  
  In particular, any transformers.TrainingArguments should be supported.

🚀 DeepSpeed Support¶

Additionally, DeepSpeed is fully supported and can be configured by specifying the path to a compatible DeepSpeed config json file, e.g.:

Build a DeepSpeed config:

python3 -c 'import ezpz; ezpz.utils.write_deepspeed_zero12_auto_config(zero_stage=2)'

Train:

python3 -m ezpz.launch -m ezpz.hf_trainer \
  --dataset_name stanfordnlp/imdb \
  --model_name_or_path meta-llama/Llama-3.2-1B \
  --bf16 \
  --do_train \
  --report-to=wandb \
  --logging-steps=1 \
  --include-tokens-per-second=true \
  --auto-find-batch-size=true \
  --deepspeed=ds_configs/deepspeed_zero2_auto_config.json

😎 2 ez

See the full list of supported models at: https://hf.co/models?filter=text-generation ↩

Training LLMs with 🍋 ezpz + 🤗 Trainer¶

🐣 Getting Started¶

🚀 DeepSpeed Support¶

Training LLMs with 🍋 `ezpz` + 🤗 `Trainer`¶