ML Engineering
    • Repository
    • Source Code
    • New Issue
  1. πŸ›œ Networking
  2. Networking Benchmarks
  3. Network Benchmarks Results
  4. Disabling NVLink Benchmark
  • πŸ““ Resources
  • ✏️ Testing
  • πŸ€— Transformers
  • πŸ’» Compute
    • CPU memory
    • CPU
    • Accelerators
      • Nvidia
        • Troubleshooting NVIDIA GPUs
  • πŸ› Debugging
    • A Back up of scripts
    • Faster debug and development with tiny models, tokenizers and datasets
    • NCCL: Debug and Performance
    • Debugging PyTorch programs
    • Debug Tools
    • Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs
    • Underflow and Overflow Detection
  • 🧠 Insights
    • πŸͺ– The AI Battlefield
  • πŸ›œ Networking
    • Networking Benchmarks
      • Network Benchmarks Results
        • Disabling NVLink Benchmark
  • 🎻 Orchestration
    • Working in SLURM Environment
      • SLURM Administration
      • Launchers with SLURM
      • SLURM Performance
      • SLURM for users
  • πŸ“¦ Storage
    • Benchmarks
      • Results
        • fio benchmark results for hope on 2023-12-20-14:37:02
  • πŸ‹οΈ Training
    • Tensor precision / Data types
    • Emulate a multi-node setup using just a single node
    • Selecting Training Hyper-Parameters And Model Initializations
    • Checkpoints
    • Fault Tolerance
    • Model Parallelism
    • Software Tune Up For The Best Performance
    • Reproducibility
    • Re-train HF Hub Models From Scratch Using Finetuning Examples
    • Avoiding, Recovering From and Understanding Instabilities
      • Understanding Training Loss Patterns

On this page

  • Disabling NVLink Benchmark
  • View source
  • Edit this page
  • Report an issue

Other Formats

  • Github (GFM)
  1. πŸ›œ Networking
  2. Networking Benchmarks
  3. Network Benchmarks Results
  4. Disabling NVLink Benchmark

February 20, 2024

Disabling NVLink Benchmark

Let’s compare the training of a gpt2 language model training over a small sample of wikitext.

The results are:

NVlink Time
Y 101s
N 131s

You can see that NVLink completes the training ~23% faster. In the second benchmark we use NCCL_P2P_DISABLE=1 to tell the GPUs not to use NVLink, which will use PCIe instead.

We will use HF Transformers examples.

Here is the full benchmark code and outputs:

# DDP w/ NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69}

# DDP w/o NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69}

Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2 in nvidia-smi topo -m) Software: pytorch-1.8-to-be + cuda-11.0 / transformers==4.3.0.dev0

Back to top

Citation

BibTeX citation:
@online{bekman2024,
  author = {Bekman, Stas and Foreman, Sam},
  title = {ML {Engineering}},
  date = {2024-02-20},
  url = {https://saforem2.github.io/ml-engineering},
  langid = {en}
}
For attribution, please cite this work as:
Bekman, Stas, and Sam Foreman. 2024. β€œML Engineering.” February 20, 2024. https://saforem2.github.io/ml-engineering.
Network Benchmarks Results
🎻 Orchestration

ML-Engineering

2024

  • View source
  • Edit this page
  • Report an issue