đ Debugging
Debugging and Troubleshooting
Guides
Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs
NCCL Debug and Performance - notes for debugging NCCL-based software and tuning it up for the peak performance
Tools
torch-distributed-gpu-test.py - this a
torch.distributed
diagnostics script that checks that all GPUs in the cluster (one or many nodes) can talk to each other and allocate gpu memory.NicerTrace - this is an improved
trace
python module with multiple additional flags added to the constructor and more useful output.
Citation
BibTeX citation:
@online{bekman2024,
author = {Bekman, Stas and Foreman, Sam},
title = {ML {Engineering}},
date = {2024-02-20},
url = {https://saforem2.github.io/ml-engineering},
langid = {en}
}
For attribution, please cite this work as:
Bekman, Stas, and Sam Foreman. 2024. âML Engineering.â
February 20, 2024. https://saforem2.github.io/ml-engineering.