ML Engineering
    • Repository
    • Source Code
    • New Issue
  1. 🧠 Insights
  • πŸ““ Resources
  • ✏️ Testing
  • πŸ€— Transformers
  • πŸ’» Compute
    • CPU memory
    • CPU
    • Accelerators
      • Nvidia
        • Troubleshooting NVIDIA GPUs
  • πŸ› Debugging
    • A Back up of scripts
    • Faster debug and development with tiny models, tokenizers and datasets
    • NCCL: Debug and Performance
    • Debugging PyTorch programs
    • Debug Tools
    • Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs
    • Underflow and Overflow Detection
  • 🧠 Insights
    • πŸͺ– The AI Battlefield
  • πŸ›œ Networking
    • Networking Benchmarks
      • Network Benchmarks Results
        • Disabling NVLink Benchmark
  • 🎻 Orchestration
    • Working in SLURM Environment
      • SLURM Administration
      • Launchers with SLURM
      • SLURM Performance
      • SLURM for users
  • πŸ“¦ Storage
    • Benchmarks
      • Results
        • fio benchmark results for hope on 2023-12-20-14:37:02
  • πŸ‹οΈ Training
    • Tensor precision / Data types
    • Emulate a multi-node setup using just a single node
    • Selecting Training Hyper-Parameters And Model Initializations
    • Checkpoints
    • Fault Tolerance
    • Model Parallelism
    • Software Tune Up For The Best Performance
    • Reproducibility
    • Re-train HF Hub Models From Scratch Using Finetuning Examples
    • Avoiding, Recovering From and Understanding Instabilities
      • Understanding Training Loss Patterns

Other Formats

  • Github (GFM)

🧠 Insights

February 20, 2024

This chapter is one person’s opinionated overview of the ML/AI Engineering reality, which may or may not be another person’s reality. The intention is to help you start asking the right questions and get your ML Engineering needs met.

See The AI Battlefield Engineering – What You Need to Know

Back to top

Citation

BibTeX citation:
@online{bekman2024,
  author = {Bekman, Stas and Foreman, Sam},
  title = {ML {Engineering}},
  date = {2024-02-20},
  url = {https://saforem2.github.io/ml-engineering},
  langid = {en}
}
For attribution, please cite this work as:
Bekman, Stas, and Sam Foreman. 2024. β€œML Engineering.” February 20, 2024. https://saforem2.github.io/ml-engineering.
Underflow and Overflow Detection
πŸͺ– The AI Battlefield

ML-Engineering

2024

  • View source
  • Edit this page
  • Report an issue