ML Engineering

Stas Bekman; Sam Foreman

🐛 Debugging

February 20, 2024

Debugging and Troubleshooting

Guides

Debugging PyTorch programs
Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs
Troubleshooting NVIDIA GPUs
Underflow and Overflow Detection
NCCL Debug and Performance - notes for debugging NCCL-based software and tuning it up for the peak performance

Tools

Debug Tools
torch-distributed-gpu-test.py - this a torch.distributed diagnostics script that checks that all GPUs in the cluster (one or many nodes) can talk to each other and allocate gpu memory.
NicerTrace - this is an improved trace python module with multiple additional flags added to the constructor and more useful output.

Back to top

Citation

BibTeX citation:

@online{bekman2024,
  author = {Bekman, Stas and Foreman, Sam},
  title = {ML {Engineering}},
  date = {2024-02-20},
  url = {https://saforem2.github.io/ml-engineering},
  langid = {en}
}

For attribution, please cite this work as:

Bekman, Stas, and Sam Foreman. 2024. “ML Engineering.” February 20, 2024. https://saforem2.github.io/ml-engineering.