ML Engineering
    • Repository
    • Source Code
    • New Issue
  1. ๐Ÿ‹๏ธ Training
  2. Selecting Training Hyper-Parameters And Model Initializations
  • ๐Ÿ““ Resources
  • โœ๏ธ Testing
  • ๐Ÿค— Transformers
  • ๐Ÿ’ป Compute
    • CPU memory
    • CPU
    • Accelerators
      • Accelerators
      • Nvidia
        • Troubleshooting NVIDIA GPUs
  • ๐Ÿ› Debugging
    • A Back up of scripts
    • Faster debug and development with tiny models, tokenizers and datasets
    • NCCL: Debug and Performance
    • Debugging PyTorch programs
    • Debug Tools
    • Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs
    • Underflow and Overflow Detection
  • ๐Ÿง  Insights
    • ๐Ÿช– The AI Battlefield
  • ๐Ÿ›œ Networking
    • Networking Benchmarks
      • Network Benchmarks Results
        • Disabling NVLink Benchmark
  • ๐ŸŽป Orchestration
    • Working in SLURM Environment
      • SLURM Administration
      • Launchers with SLURM
      • SLURM Performance
      • SLURM for users
  • ๐Ÿ“ฆ Storage
    • Benchmarks
      • Results
        • fio benchmark results for hope on 2023-12-20-14:37:02
  • ๐Ÿ‹๏ธ Training
    • Tensor precision / Data types
    • Emulate a multi-node setup using just a single node
    • Selecting Training Hyper-Parameters And Model Initializations
    • Checkpoints
    • Fault Tolerance
    • Model Parallelism
    • Software Tune Up For The Best Performance
    • Reproducibility
    • Re-train HF Hub Models From Scratch Using Finetuning Examples
    • Avoiding, Recovering From and Understanding Instabilities
      • Understanding Training Loss Patterns

On this page

  • Selecting Training Hyper-Parameters And Model Initializations
    • Glossary
    • Global Batch Size Ramp Up
      • STD Init
  • View source
  • Edit this page
  • Report an issue

Other Formats

  • Github (GFM)
  1. ๐Ÿ‹๏ธ Training
  2. Selecting Training Hyper-Parameters And Model Initializations

February 20, 2024

Selecting Training Hyper-Parameters And Model Initializations

The easiest way to find a good hparam and model init starter set is to steal it from a similar training that you know has succeeded. Here is a collection of public training LLM/VLM logbooks to get you started. The other common source is papers if they disclose that information. You can also try to reach out to the authors and ask them for these details if they didnโ€™t publish it.

Glossary

Training jargon uses a multitude of abbreviations and terms, so here are some important for this chapter.

  • BS: Batch Size - here we mean batch size per gpu, often it is also referred to as MBS (micro-batch-size)
  • GBS: Global Batch Size - total batch size per iteration - may include gradient accumulation
  • GAS: Gradient Accumulation Steps - how many forward/backward cycles to perform before one full iteration is complete
  • TFLOPs: Trillion FLOPs per second - FLOPS
  • PP: Pipeline Parallelism

Global Batch Size Ramp Up

If you intend to train with a very large GBS, with say 1024, or 2048 samples and even higher, when you just start training, itโ€™s very wasteful to feed such large batch sizes to the model. At this point itโ€™s totally random and canโ€™t benefit from having too refined data. Therefore to save data and resources, one often ramps up the global batch size over some period of time.

Itโ€™s also important to not start with GBS that is too small, since otherwise the progress wonโ€™t be efficient. When there is too little data the compute (TFLOPS) is inefficient and will slow everything down. This is especially so when Pipeline Parallelism (PP) is used, since the most important thing about PP tuneup is a small GPU idleness bubble, and the smaller the GBS the larger the bubble is.

For example, for BLOOM-176B, where we did use PP, after doing throughput benchmarking we found that starting with GBS=16 was incredibly slow (8 TFLOPs), so we eventually started with GBS=192 (73 TFLOPs) and then we ramped up to GBS=2048 (150 TFLOPs) - we increased GBS by 16 every 9_765_625 samples.

STD Init

This hyper parameter is super-important and it requires math to get it right. For details see STD Init.

Back to top

Citation

BibTeX citation:
@online{bekman2024,
  author = {Bekman, Stas and Foreman, Sam},
  title = {ML {Engineering}},
  date = {2024-02-20},
  url = {https://saforem2.github.io/ml-engineering},
  langid = {en}
}
For attribution, please cite this work as:
Bekman, Stas, and Sam Foreman. 2024. โ€œML Engineering.โ€ February 20, 2024. https://saforem2.github.io/ml-engineering.
Emulate a multi-node setup using just a single node
Checkpoints

ML-Engineering

2024

  • View source
  • Edit this page
  • Report an issue