Skip to content

Language Model Training with πŸ‹ ezpz and πŸ€— HF Trainer

The src/ezpz/hf_trainer.py module provides a mechanism for distributed training with πŸ€— huggingface / transformers.

In particular, it allows for distributed training using the transformers.Trainer object with any1 (compatible) combination of {models, datasets}.

🐣 Getting Started

  1. 🏑 Setup environment (on ANY {Intel, NVIDIA, AMD} accelerator)

    source <(curl -s 'https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh')
    ezpz_setup_env
    
  2. πŸ“¦ Install dependencies:

  3. Install πŸ‹ ezpz (from GitHub):

    python3 -m pip install "git+https://github.com/saforem2/ezpz" --require-virtualenv
    
    1. Update {tiktoken, sentencepiece, transformers, evaluate}:

      python3 -m pip install --upgrade tiktoken sentencepiece transformers evaluate
      
  4. βš™οΈ Build DeepSpeed config:

    python3 -c 'import ezpz; ezpz.utils.write_deepspeed_zero12_auto_config(zero_stage=1)'
    
  5. πŸš€ Launch training:

    TSTAMP=$(date +%s)
    python3 -m ezpz.launch -m ezpz.hf_trainer \
      --model_name_or_path meta-llama/Llama-3.2-1B \
      --dataset_name stanfordnlp/imdb \
      --deepspeed=ds_configs/deepspeed_zero1_auto_config.json \
      --auto-find-batch-size=true \
      --bf16=true \
      --block-size=4096 \
      --do-eval=true \
      --do-predict=true \
      --do-train=true \
      --gradient-checkpointing=true \
      --include-for-metrics=inputs,loss \
      --include-num-input-tokens-seen=true \
      --include-tokens-per-second=true \
      --log-level=info \
      --logging-steps=1 \
      --max-steps=10000 \
      --output_dir="hf-trainer-output/${TSTAMP}" \
      --report-to=wandb \
      | tee "hf-trainer-output-${TSTAMP}.log"
    
    • πŸͺ„ Magic:

      Behind the scenes, this will πŸͺ„ automagically determine the specifics of the running job, and use this information to construct (and subsequently run) the appropriate:

      mpiexec <mpi-args> $(which python3) <cmd-to-launch>
      

      across all of our available accelerators.

      • βž• Tip:

        Call:

        python3 -m ezpz.hf_trainer --help
        

        to see the full list of supported arguments.

        In particular, any transformers.TrainingArguments should be supported.

πŸš€ DeepSpeed Support

Additionally, DeepSpeed is fully supported and can be configured by specifying the path to a compatible DeepSpeed config json file, e.g.:

  1. Build a DeepSpeed config:
python3 -c 'import ezpz; ezpz.utils.write_deepspeed_zero12_auto_config(zero_stage=2)'
  1. Train:
python3 -m ezpz.launch -m ezpz.hf_trainer \
  --dataset_name stanfordnlp/imdb \
  --model_name_or_path meta-llama/Llama-3.2-1B \
  --bf16 \
  --do_train \
  --report-to=wandb \
  --logging-steps=1 \
  --include-tokens-per-second=true \
  --auto-find-batch-size=true \
  --deepspeed=ds_configs/deepspeed_zero2_auto_config.json

😎 2 ez


  1. See the full list of supported models at: https://hf.co/models?filter=text-generation