The new Megatron-DeepSpeed release contains a variety of improvements / optimizations to enable pre-training Transformer based architectures with significantly longer sequences than was previously possible.
Enabled attention map memory optimization, where we first generated attention mask on CPU memory and then moved it into GPU memory to avoid out-of-memory errors when training with very large sequence lengths.
Position embedding partitioning, where we split weights of position encoding across all GPUs when enabling sequence parallel to further reduce the memory footprint.
cd ./genslm/examples/long-sequences# create a new virtual environmentmkdir-p"venvs/${MACHINE}/${CONDA_DATE}"python3-m venv "venvs/${MACHINE}/${CONDA_DATE}"--system-site-packagessource"venvs/${MACHINE}/${CONDA_DATE}/bin/activate"
Create a new folder (genslm/examples/long-sequences/deps/${MACHINE}) where we’ll installing dependencies locally:
mkdir-p"deps/${MACHINE}"cd"deps/${MACHINE}"
Dependencies
We provide below the details needed to install each of the required dependencies.
These newly introduced optimizations, in combination with ZeRO-Offload allows us to go even further.
By employing ZeRO-Offloading, we are able to free up additional memory which can be used for even longer sequences.
Though work is still ongoing, this is a promising direction that will allow us to consider significantly larger genomes than previously possible.
We use Weights & Biases to track these experiments, and have aggregated our initial results in the W&B Report below.
We can evaluate the performance of our model by looking at two different metrics for throughput: samples_per_sec and TFLOPS.
Explicitly, we see that we are able to scale up to significantly longer sequences (420k / 128k ~ 3.3x) with only a minimal impact on throughput performance (81 / 105 ~ 77\%)4.
Table 2: Impact on TFLOPS as a function of increasing sequence length. Table from: throughput/TFLOPS
The described experiments were performed on 4 NVIDIA DGX A100-40GB nodes, all using TPSIZE=32[^tpsize], connected through 8 HDR InfiniBand (200Gb/s per HDR).↩︎