Skip to content

Training LLMs with πŸ‹ ezpz + πŸ€— TrainerΒΆ

The src/ezpz/examples/hf_trainer.py module provides a mechanism for distributed training with πŸ€— huggingface / transformers.

In particular, it allows for distributed training using the transformers.Trainer object with any1 (compatible) combination of {models, datasets}.

[!NOTE] Quick start:

# setup env
source <(curl -sL https://bit.ly/ezpz-utils)
ezpz_setup_env

# install ezpz
uv pip install --no-cache --link-mode=copy "git+https://github.com/saforem2/ezpz"

# launch ezpz.examples.hf_trainer
ezpz launch -- python3 -m ezpz.examples.hf_trainer \
   --streaming \
   --dataset_name=stanfordnlp/imdb \
   --tokenizer_name meta-llama/Llama-3.2-1B \
   --model_name_or_path meta-llama/Llama-3.2-1B \
   --bf16=true \
   --do_train=true \
   --do_eval=true \
   --report-to=wandb \
   --logging-steps=1 \
   --include-tokens-per-second=true \
   --max-steps=50000 \
   --include-num-input-tokens-seen=true \
   --optim=adamw_torch \
   --logging-first-step \
   --include-for-metrics='inputs,loss' \
   --max-eval-samples=50 \
   --per_device_train_batch_size=1 \
   --block-size=8192 \
   --gradient_checkpointing=true # --fsdp=shard_grad_op

🐣 Getting Started¢

  1. 🏑 Setup environment (on ANY {Intel, NVIDIA, AMD} accelerator)

    source <(curl -L https://bit.ly/ezpz-utils)
    ezpz_setup_env
    
  2. πŸ“¦ Install dependencies:

  3. Install πŸ‹ ezpz (from GitHub):

    1
    2
    3
    uv pip install --no-cache --link-mode=copy "git+https://github.com/saforem2/ezpz"
    # or:
    # python3 -m pip install "git+https://github.com/saforem2/ezpz" --require-virtualenv
    

    <!-- 2. Update {tiktoken, sentencepiece, transformers, evaluate}:

    1
    2
    3
    ```bash
    python3 -m pip install --upgrade tiktoken sentencepiece transformers evaluate
    ```
    

    β†’

βž• DetailsΒΆ

  1. βš™οΈ Build DeepSpeed config:

    python3 -c 'import ezpz; ezpz.utils.write_deepspeed_zero12_auto_config(zero_stage=1)'
    
  2. πŸš€ Launch training:

    TSTAMP=$(date +%s)
    python3 -m ezpz.launch -m ezpz.examples.hf_trainer \
      --model_name_or_path meta-llama/Llama-3.2-1B \
      --dataset_name stanfordnlp/imdb \
      --deepspeed=ds_configs/deepspeed_zero1_auto_config.json \
      --auto-find-batch-size=true \
      --bf16=true \
      --block-size=4096 \
      --do-eval=true \
      --do-predict=true \
      --do-train=true \
      --gradient-checkpointing=true \
      --include-for-metrics=inputs,loss \
      --include-num-input-tokens-seen=true \
      --include-tokens-per-second=true \
      --log-level=info \
      --logging-steps=1 \
      --max-steps=10000 \
      --output_dir="hf-trainer-output/${TSTAMP}" \
      --report-to=wandb \
      | tee "hf-trainer-output-${TSTAMP}.log"
    
    • πŸͺ„ Magic:

      Behind the scenes, this will πŸͺ„ automagically determine the specifics of the running job, and use this information to construct (and subsequently run) the appropriate:

      mpiexec <mpi-args> $(which python3) <cmd-to-launch>
      

      across all of our available accelerators.

      • βž• Tip:

        Call:

        python3 -m ezpz.examples.hf_trainer --help
        

        to see the full list of supported arguments.

        In particular, any transformers.TrainingArguments should be supported.

πŸš€ DeepSpeed SupportΒΆ

Additionally, DeepSpeed is fully supported and can be configured by specifying the path to a compatible DeepSpeed config json file, e.g.:

  1. Build a DeepSpeed config:
python3 -c 'import ezpz; ezpz.utils.write_deepspeed_zero12_auto_config(zero_stage=2)'
  1. Train:
python3 -m ezpz.launch -m ezpz.hf_trainer \
  --dataset_name stanfordnlp/imdb \
  --model_name_or_path meta-llama/Llama-3.2-1B \
  --bf16 \
  --do_train \
  --report-to=wandb \
  --logging-steps=1 \
  --include-tokens-per-second=true \
  --auto-find-batch-size=true \
  --deepspeed=ds_configs/deepspeed_zero2_auto_config.json

😎 2 ez


  1. See the full list of supported models at: https://hf.co/models?filter=text-generation β†©