ML Engineering
    • Repository
    • Source Code
    • New Issue
  1. 📓 Resources
  • 📓 Resources
  • ✏️ Testing
  • 🤗 Transformers
  • 💻 Compute
    • CPU memory
    • CPU
    • Accelerators
      • Nvidia
        • Troubleshooting NVIDIA GPUs
  • 🐛 Debugging
    • A Back up of scripts
    • Faster debug and development with tiny models, tokenizers and datasets
    • NCCL: Debug and Performance
    • Debugging PyTorch programs
    • Debug Tools
    • Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs
    • Underflow and Overflow Detection
  • 🧠 Insights
    • 🪖 The AI Battlefield
  • 🛜 Networking
    • Networking Benchmarks
      • Network Benchmarks Results
        • Disabling NVLink Benchmark
  • 🎻 Orchestration
    • Working in SLURM Environment
      • SLURM Administration
      • Launchers with SLURM
      • SLURM Performance
      • SLURM for users
  • 📦 Storage
    • Benchmarks
      • Results
        • fio benchmark results for hope on 2023-12-20-14:37:02
  • 🏋️ Training
    • Tensor precision / Data types
    • Emulate a multi-node setup using just a single node
    • Selecting Training Hyper-Parameters And Model Initializations
    • Checkpoints
    • Fault Tolerance
    • Model Parallelism
    • Software Tune Up For The Best Performance
    • Reproducibility
    • Re-train HF Hub Models From Scratch Using Finetuning Examples
    • Avoiding, Recovering From and Understanding Instabilities
      • Understanding Training Loss Patterns

On this page

  • Publicly available training LLM/VLM logbooks
    • 2021
    • 2022
    • 2023
  • View source
  • Edit this page
  • Report an issue

Other Formats

  • Github (GFM)

📓 Resources

February 20, 2024

Publicly available training LLM/VLM logbooks

Logbooks and chronicles of training LLM/VLM are one of the best sources to learn from about dealing with training instabilities and choosing good hyper parameters.

If you know of a public LLM/VLM training logbook that is not on this list please kindly let me know or add it via a PR. Thank you!

The listing is in no particular order other than being grouped by the year.

2021

  • BigScience pre-BLOOM 108B training experiments (2021): chronicles | the full spec and discussions (backup: 1 | 2)

2022

  • BigScience BLOOM-176B (2022): chronicles-prequel | chronicles | the full spec and discussions (backup: 1 | 2 | 3)

  • Meta OPT-175B (2022): logbook | Video (backup: 1)

  • THUDM GLM-130B (2022): en logbook | Mandarin version (backup: 1 | 2)

2023

  • HuggingFace IDEFICS-80B multimodal (Flamingo repro) (2023): Learning log | Training Chronicles (backup: 1 | 2)

  • BloombergGPT 50B LLM - section C in BloombergGPT: A Large Language Model for Finance

Back to top

Citation

BibTeX citation:
@online{bekman2024,
  author = {Bekman, Stas and Foreman, Sam},
  title = {ML {Engineering}},
  date = {2024-02-20},
  url = {https://saforem2.github.io/ml-engineering},
  langid = {en}
}
For attribution, please cite this work as:
Bekman, Stas, and Sam Foreman. 2024. “ML Engineering.” February 20, 2024. https://saforem2.github.io/ml-engineering.
✏️ Testing

ML-Engineering

2024

  • View source
  • Edit this page
  • Report an issue