ML Engineering
    • Repository
    • Source Code
    • New Issue
  1. ๐Ÿ’ป Compute
  2. CPU memory
  • ๐Ÿ““ Resources
  • โœ๏ธ Testing
  • ๐Ÿค— Transformers
  • ๐Ÿ’ป Compute
    • CPU memory
    • CPU
    • Accelerators
      • Nvidia
        • Troubleshooting NVIDIA GPUs
  • ๐Ÿ› Debugging
    • A Back up of scripts
    • Faster debug and development with tiny models, tokenizers and datasets
    • NCCL: Debug and Performance
    • Debugging PyTorch programs
    • Debug Tools
    • Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs
    • Underflow and Overflow Detection
  • ๐Ÿง  Insights
    • ๐Ÿช– The AI Battlefield
  • ๐Ÿ›œ Networking
    • Networking Benchmarks
      • Network Benchmarks Results
        • Disabling NVLink Benchmark
  • ๐ŸŽป Orchestration
    • Working in SLURM Environment
      • SLURM Administration
      • Launchers with SLURM
      • SLURM Performance
      • SLURM for users
  • ๐Ÿ“ฆ Storage
    • Benchmarks
      • Results
        • fio benchmark results for hope on 2023-12-20-14:37:02
  • ๐Ÿ‹๏ธ Training
    • Tensor precision / Data types
    • Emulate a multi-node setup using just a single node
    • Selecting Training Hyper-Parameters And Model Initializations
    • Checkpoints
    • Fault Tolerance
    • Model Parallelism
    • Software Tune Up For The Best Performance
    • Reproducibility
    • Re-train HF Hub Models From Scratch Using Finetuning Examples
    • Avoiding, Recovering From and Understanding Instabilities
      • Understanding Training Loss Patterns

On this page

  • What CPU memory is needed for in ML workloads
  • Things to know
  • View source
  • Edit this page
  • Report an issue

Other Formats

  • Github (GFM)
  1. ๐Ÿ’ป Compute
  2. CPU memory

CPU memory

February 20, 2024

This is a tiny chapter, since usually there are very few nuances one needs to know about CPU memory - which is a good thing!

Most of the ML workload compute happens on GPUs, but typically there should be at least as much CPU memory on each node as there is on the GPUs. So, for example, if youโ€™re on a H100 node with 8x 80GB GPUs, you have 640GB of GPU memory. Thus you want at least as much of CPU memory. But most recent high end cloud packages usually come with 1-2TBs of CPU memory.

What CPU memory is needed for in ML workloads

  • Loading the model weights, unless they are loaded directly onto the GPUs - this is usually a transitory memory usage that goes back to zero once the model has been moved to GPUs.
  • Saving the model weights. In some situations each GPU writes its own checkpoint directly to the disk, in other cases the model is recomposed on the CPU before itโ€™s written to disk - this too is a transitory memory usage.
  • Possible parameter and optimizer state offloading when using frameworks like Deepspeed. In which case quite a lot of CPU memory might be needed.
  • Activations calculated in the forward pass, and which need to be available for the backward path can also be offloaded to CPU, rather than discarded and then recomputed during the backward pass to save the unnecessary overhead
  • DataLoader is usually one of the main users of CPU memory and at times it may consume very large amounts of memory. Typically there are at least 2x 8 DL workers running on each node, so you need enough memory to support at least 16 processes each holding some data. For example, in the case of streaming data from the cloud, if the data shards are large, these processes could easily eat up hundreds of GBs of CPU memory.
  • The software itself and its dependent libraries uses a bit of CPU memory, but this amount is usually negligible.

Things to know

  • If the DataLoader uses HF datasets in mmap mode the Resident memory usage may appear to be using a huge amount of CPU memory as itโ€™ll try to map out the whole datasets to the memory. Except this is misleading, since if the memory is needed elsewhere the OS will page out any unneeded mmapโ€™ed pages back to the system. You can read more about it here. This awareness, of course, applies to any dataset using mmap, I was using HF datasets as an example since itโ€™s very widely used.
Back to top

Citation

BibTeX citation:
@online{bekman2024,
  author = {Bekman, Stas and Foreman, Sam},
  title = {ML {Engineering}},
  date = {2024-02-20},
  url = {https://saforem2.github.io/ml-engineering},
  langid = {en}
}
For attribution, please cite this work as:
Bekman, Stas, and Sam Foreman. 2024. โ€œML Engineering.โ€ February 20, 2024. https://saforem2.github.io/ml-engineering.
๐Ÿ’ป Compute
CPU

ML-Engineering

2024

  • View source
  • Edit this page
  • Report an issue