Content area
The distributed training of machine learning models via gradient descent is generally conducted by iteratively computing the local gradients of a loss function and aggregating them across all processors. Communicating these gradients during aggregation is often a major cost but sparsification techniques can greatly improve efficiency. One such technique, Top-k gradient compression, ensures that only the k largest components of each local gradient are sent. However, effectively scaling this method can be challenging. The standard ring AllReduce algorithm, which is frequently used to aggregate dense gradients, lacks a counterpart that is optimized for sparse data. Notably, ring algorithms are contention-free, which generally make them easier to scale than other collective communication algorithms. Thus, in practice, the ring AllGather algorithm, which can be trivially adapted for sparse data, may be used instead, even though its bandwidth costs are proportional to the number of utilized processors (unlike ring AllReduce). To provide a more scalable contention-free alternative, we present a variant of ring AllReduce that has been better optimized for sparse data. We compare it to the standard dense ring AllReduce and ring AllGather algorithms, and we evaluate it empirically using gradients sampled from fine-tuning Llama 2 7b.