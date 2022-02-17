Baidu's Silicon Valley AI Lab (SVAIL) announced an implementation of the ring allreduce algorithm for the deep learning community, which will enable significantly faster training of neural networks across GPU models. As neural networks have grown to include hundreds of millions or even over a billion parameters, the number of GPU nodes needed to do the training has also increased. However, the higher the number of nodes grows, the less efficient the system becomes in terms of how much computation is done by each node. Therefore, the need for algorithms that maximize the performance across the highly parallel system has also increased.

Using all the GPU nodes more efficiently means the neural network training can be done faster and that the company training a neural network doesn't have to spend as much on hardware that would otherwise be underutilized. Baidu has taken one algorithm, called the "ring allreduce," from the high-performance computing (HPC) world and brought it to deep learning to increase the efficiency of its GPU nodes. The ring allreduce algorithm could speed up the training of an example neural network by 31x across 40 GPUs, compared to using a single GPU.

[...] The group released its ring allreduce implementation as both a standalone C++ library as well as a patch for TensorFlow.