Disabling NVLink Benchmark
Letβs compare the training of a gpt2 language model training over a small sample of wikitext.
The results are:
NVlink | Time |
---|---|
Y | 101s |
N | 131s |
You can see that NVLink completes the training ~23% faster. In the second benchmark we use NCCL_P2P_DISABLE=1
to tell the GPUs not to use NVLink, which will use PCIe instead.
We will use HF Transformers examples.
Here is the full benchmark code and outputs:
# DDP w/ NVLink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--model_name_or_path gpt2 \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --dataset_config_name wikitext-2-raw-v1 --do_train \
--dataset_name wikitext --per_device_train_batch_size 4 --max_steps 200
--output_dir /tmp/test-clm
{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69}
# DDP w/o NVLink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \
--model_name_or_path gpt2 \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --dataset_config_name wikitext-2-raw-v1 --do_train
--dataset_name wikitext --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69}
Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2
in nvidia-smi topo -m
) Software: pytorch-1.8-to-be
+ cuda-11.0
/ transformers==4.3.0.dev0
Citation
BibTeX citation:
@online{bekman2024,
author = {Bekman, Stas and Foreman, Sam},
title = {ML {Engineering}},
date = {2024-02-20},
url = {https://saforem2.github.io/ml-engineering},
langid = {en}
}
For attribution, please cite this work as:
Bekman, Stas, and Sam Foreman. 2024. βML Engineering.β
February 20, 2024. https://saforem2.github.io/ml-engineering.