Training LLMs with π ezpz
+ π€ Trainer
The
src/ezpz/hf_trainer.py
module provides a mechanism for distributed training with π€ huggingface /
transformers .
In particular, it allows for distributed training using the
transformers.Trainer
object with any (compatible) combination of
{models
,
datasets
}.
π£ Getting Started
π‘ Setup environment (on ANY {Intel, NVIDIA, AMD} accelerator)
source <( curl -s 'https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh' )
ezpz_setup_env
π¦ Install dependencies:
Install π ezpz
(from GitHub):
python3 -m pip install "git+https://github.com/saforem2/ezpz" --require-virtualenv
Update {tiktoken
, sentencepiece
, transformers
, evaluate
}:
python3 -m pip install --upgrade tiktoken sentencepiece transformers evaluate
βοΈ Build DeepSpeed config:
python3 -c 'import ezpz; ezpz.utils.write_deepspeed_zero12_auto_config(zero_stage=1)'
π Launch training:
TSTAMP = $( date +%s)
python3 -m ezpz.launch -m ezpz.hf_trainer \
--model_name_or_path meta-llama/Llama-3.2-1B \
--dataset_name stanfordnlp/imdb \
--deepspeed= ds_configs/deepspeed_zero1_auto_config.json \
--auto-find-batch-size= true \
--bf16= true \
--block-size= 4096 \
--do-eval= true \
--do-predict= true \
--do-train= true \
--gradient-checkpointing= true \
--include-for-metrics= inputs,loss \
--include-num-input-tokens-seen= true \
--include-tokens-per-second= true \
--log-level= info \
--logging-steps= 1 \
--max-steps= 10000 \
--output_dir= "hf-trainer-output/ ${ TSTAMP } " \
--report-to= wandb \
| tee "hf-trainer-output- ${ TSTAMP } .log"
πͺ Magic :
Behind the scenes, this will πͺ automagically
determine the specifics of the running job, and use this information to
construct (and subsequently run) the appropriate:
mpiexec <mpi-args> $( which python3) <cmd-to-launch>
across all of our available accelerators.
β Tip :
Call:
python3 -m ezpz.hf_trainer --help
to see the full list of supported arguments.
In particular, any transformers.TrainingArguments
should be supported.
π DeepSpeed Support
Additionally, DeepSpeed is fully
supported and can be configured by specifying the path to a compatible
DeepSpeed config json file , e.g.:
Build a DeepSpeed config:
python3 -c 'import ezpz; ezpz.utils.write_deepspeed_zero12_auto_config(zero_stage=2)'
Train:
python3 -m ezpz.launch -m ezpz.hf_trainer \
--dataset_name stanfordnlp/imdb \
--model_name_or_path meta-llama/Llama-3.2-1B \
--bf16 \
--do_train \
--report-to= wandb \
--logging-steps= 1 \
--include-tokens-per-second= true \
--auto-find-batch-size= true \
--deepspeed= ds_configs/deepspeed_zero2_auto_config.json
π 2 ez