Networking Benchmarks
Tools:
all_reduce_bench.py - a tool to benchmark the real network bandwidth while performing
all_reduce
on a largish amount of data. This is useful to find out what one gets in reality as compared to the advertised spec.all_gather_object_vs_all_reduce.py - a quick benchmark showing 23x speed up when moving from
all_gather_object
toall_reduce
when collecting completion status from the process group. e.g. when implementing some sort of all-processes-are-done flag. This technique is usually used for synchronizing gpus when they may complete at different number of iterations - which one needs for inference over multiple DP channels, or when one wants to sync aStopIteration
event inDataLoader
. See also all_gather_object_vs_all_gather.py.all_reduce_latency_comp.py - exemplifies how 1x 4GB reduction is much faster than 1000x 4MB reduction
Crucial reproducibility requirements
The most important requirements for a series of successful experiments is to be able to reproduce the experiment environment again and again while changing only one or a few setup variables.
Therefore when you try to figure out whether some change will improve performance or make it worse, you must figure out how to keep things stable.
For example, you need to find a way to prevent the network usage from fluctuations. When we were doing performance optimizations for 108B pre-BLOOM experiments it was close to impossible to perform, since we were on a shared internode network and the exact same setup would yield different throughput depending on how many other users used the network. It was not working. During BLOOM-176B we were given a dedicated SLURM partition with an isolated network where the only traffic was ours. Doing the performance optimization in such environment was just perfect.
Network throughput
Itβs critical to understand your particular model size and framework requirements with regard to network bandwidth, throughput and latency. If you underpay for network you will end up having idle gpus and thus you wasted money and time. If you overpay for very fast network, but your gpus are slow, then again you wasted money and time.
If your network is very slow, your training is likely to be network-bound and many improvements in the training setup will not help with the improving performance.
Note: The EAI cookbook contains a set of communication benchmarks for each collective that you can use to quickly measure the throughput of your internode or intranode network.
Here is a simple all-reduce benchmark that you can use to quickly measure the throughput of your internode network:
Usually benchmarking at least 4 nodes is recommended, but, of course, if you already have access to all the nodes you will be using during the training, benchmark using all of the nodes.
To run it on 4 nodes:
GPUS_PER_NODE=8
NNODES=4
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000
python -u -m torch.distributed.run \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
--rdzv_backend c10d \
--max_restarts 0 \
--role `hostname -s`: \
--tee 3 \
all_reduce_bench.py
Notes: - adapt MASTER_ADDR
to rank 0 hostname if itβs not a SLURM environment where itβs derived automatically.
Here is how to run launch it in a SLURM env with 4 nodes:
salloc --partition=mypartition --nodes=4 --ntasks-per-node=1 --cpus-per-task=48 --gres=gpu:8 --time=1:00:00 bash
srun --gres=gpu:8 --nodes=4 --tasks-per-node=1 python -u -m torch.distributed.run --nproc_per_node=8 --nnodes 4 --rdzv_endpoint $(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1):6000 --rdzv_backend c10d all_reduce_bench.py
Notes: - You are likely to need to adapt --cpus-per-task
and --partition
arguments there. - You do salloc
once and then can repeat srun
multiple times on the same allocation.
You may get results anywhere between 5Gbps and 1600Gbps (as of this writing). The minimal speed to prevent being network bound will depend on your particular training framework, but typically youβd want at least 400Gbps or higher. Though we trained BLOOM on 50Gbps.
Frameworks that shard weights and optim stages like Deepspeed w/ ZeRO Stage-3 do a lot more traffic than frameworks like Megatron-Deepspeed which do tensor and pipeline parallelism in addition to data parallelism. The latter ones only send activations across and thus donβt need as much bandwidth. But they are much more complicated to set up and run.
Of course, an efficient framework will overlap communications and compute, so that while one stage is fetching data, the other stage in parallel runs computations. So as long as the communication overhead is smaller than compute the network requirements are satisfied and donβt have to be super fantastic.
To get reasonable GPU throughput when training at scale (64+GPUs) with DeepSpeed ZeRO Stage 3 with V100s
- 100Gbps is not enough
- 200-400 Gbps is ok
- 800-1000 Gbps is ideal
Of course, the requirements are higher for A100 gpu nodes and even higher for H100s (but no such benchmark information has been shared yet).
Citation
@online{bekman2024,
author = {Bekman, Stas and Foreman, Sam},
title = {ML {Engineering}},
date = {2024-02-20},
url = {https://saforem2.github.io/ml-engineering},
langid = {en}
}