ML Engineering
    • Repository
    • Source Code
    • New Issue
  1. Machine Learning Engineering Open Book
  • ๐Ÿ““ Resources
  • โœ๏ธ Testing
  • ๐Ÿค— Transformers
  • ๐Ÿ’ป Compute
    • CPU memory
    • CPU
    • Accelerators
      • Nvidia
        • Troubleshooting NVIDIA GPUs
  • ๐Ÿ› Debugging
    • A Back up of scripts
    • Faster debug and development with tiny models, tokenizers and datasets
    • NCCL: Debug and Performance
    • Debugging PyTorch programs
    • Debug Tools
    • Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs
    • Underflow and Overflow Detection
  • ๐Ÿง  Insights
    • ๐Ÿช– The AI Battlefield
  • ๐Ÿ›œ Networking
    • Networking Benchmarks
      • Network Benchmarks Results
        • Disabling NVLink Benchmark
  • ๐ŸŽป Orchestration
    • Working in SLURM Environment
      • SLURM Administration
      • Launchers with SLURM
      • SLURM Performance
      • SLURM for users
  • ๐Ÿ“ฆ Storage
    • Benchmarks
      • Results
        • fio benchmark results for hope on 2023-12-20-14:37:02
  • ๐Ÿ‹๏ธ Training
    • Tensor precision / Data types
    • Emulate a multi-node setup using just a single node
    • Selecting Training Hyper-Parameters And Model Initializations
    • Checkpoints
    • Fault Tolerance
    • Model Parallelism
    • Software Tune Up For The Best Performance
    • Reproducibility
    • Re-train HF Hub Models From Scratch Using Finetuning Examples
    • Avoiding, Recovering From and Understanding Instabilities
      • Understanding Training Loss Patterns

On this page

  • Table of Contents
  • Updates
  • PDF version
  • Shortcuts
  • Gratitude
  • Contributing
  • License
  • My repositories map
  • View source
  • Edit this page
  • Report an issue

Other Formats

  • Github (GFM)

Machine Learning Engineering Open Book

Stas Bekman

February 20, 2024

This is an open collection of methodologies, tools and step by step instructions to help with successful training of large language models and multi-modal models.

This is a technical material suitable for LLM/VLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly address your needs.

This repo is an ongoing brain dump of my experiences training Large Language Models (LLM) (and VLMs); a lot of the know-how I acquired while training the open-source BLOOM-176B model in 2022 and IDEFICS-80B multi-modal model in 2023. Currently, Iโ€™m working on developing/training open-source Retrieval Augmented Generation (RAG) models at Contextual.AI.

Iโ€™ve been compiling this information mostly for myself so that I could quickly find solutions I have already researched in the past and which have worked, but as usual Iโ€™m happy to share these with the wider ML community.

Table of Contents

My apologies if the layout is a bit unstable while Iโ€™m writing new chapters and gradually re-organizing the content to be more intuitive.

Part 1. Insights

  1. The AI Battlefield Engineering - what you need to know in order to succeed

Part 2. Hardware

  1. Compute - accelerators, CPUs, CPU memory.

  2. Storage - local, distributed and shared file systems.

  3. Network - intra- and inter-node networking.

Part 3. Orchestration

  1. SLURM - the main orchestration environment

Part 4. Training

  1. Training - model training related guides

Part 5. Development

  1. Debugging and Troubleshooting - how to debug easy and difficult issues

  2. And more debugging

  3. Testing - numerous tips and tools to make test writing enjoyable

Part 6. Miscellaneous

  1. Resources - LLM/VLM chronicles

Updates

I announce any significant updates on my twitter channel https://twitter.com/StasBekman

PDF version

Download the PDF version of the book.

I will try to rebuild it once a week or so, but if you want the latest, the instructions for building are here.

Thanks to HuggingFace for giving me permission to host my bookโ€™s PDF at the HF hub.

Shortcuts

Things that you are likely to need to find quickly and often.

  • ๐Ÿ› ๏ธ Tools:
    • all_reduce_bench.py - a much easier way to benchmark network throughput than nccl-tests.
    • torch-distributed-gpu-test.py - a tool to quickly test your inter-node connectivity
  • ๐Ÿ“œ Guides:
    • debugging pytorch applications - quick copy-n-paste solutions to resolve hanging or breaking pytorch applications
    • slurm for users - a slurm cheatsheet and tricks
    • make tiny models/datasets/tokenizers
    • LLM/VLM chronicles collection

Gratitude

None of this would have been possible without me being entrusted with doing the specific LLM/VLM trainings I have learned this know-how from. This is a privilege that only a few enjoy due to the prohibitively expensive cost of renting huge ML compute clusters. So hopefully the rest of the ML community will vicariously learn from these notes.

Special thanks go to Thom Wolf who proposed that I lead the BLOOM-176B training back when I didnโ€™t know anything about large scale training. This was the project that catapulted me into the intense learning process. And, of course, HuggingFace for giving me the opportunity to work full time on BLOOM-176B and later on IDEFICS-80B trainings.

Contributing

If you found a bug, typo or would like to propose an improvement please donโ€™t hesitate to open an Issue or contribute a PR.

License

The content of this site is distributed under Attribution-ShareAlike 4.0 International.

My repositories map

โœ” Machine Learning: ML Engineering Open Book | ML ways | Porting

โœ” Guides: The Art of Debugging

โœ” Applications: ipyexperiments

โœ” Tools and Cheatsheets: bash | conda | git | jupyter-notebook | make | python | tensorboard | unix

โค๏ธโ€๐Ÿฉน Status
Last Updated: 02/20/2024 @ 23:05:39

Back to top

Citation

BibTeX citation:
@online{bekman2024,
  author = {Bekman, Stas and Foreman, Sam},
  title = {ML {Engineering}},
  date = {2024-02-20},
  url = {https://saforem2.github.io/ml-engineering},
  langid = {en}
}
For attribution, please cite this work as:
Bekman, Stas, and Sam Foreman. 2024. โ€œML Engineering.โ€ February 20, 2024. https://saforem2.github.io/ml-engineering.

ML-Engineering

2024

  • View source
  • Edit this page
  • Report an issue