Content area
Scaling massively parallel computing tasks, such as scientific applications and LLMs, is critically constrained by the efficiency of collective communications. This efficiency is increasingly bottlenecked by network bandwidth, which struggles to keep pace with the rapid growth in computational power and communication data volumes. Furthermore, the heterogeneous architectures of modern supercomputers and the complexity of communication pipelines further exacerbate these efficiency challenges.
To address these challenges, I pioneered a new research direction: advancing exascale collective communications with co-designed compression techniques. Under this direction, I developed ZCCL, a family of four novel frameworks that significantly improve communication efficiency across CPU and GPU clusters.
The first framework, C-Coll, leverages error-bounded lossy compression to substantially reduce message sizes, thus improving communication performance. The second, gZCCL, presents GPU-aware, compression-enabled collectives that are optimized to achieve both high performance and data accuracy on GPU clusters. The third framework, hZCCL, introduces the first homomorphic compression-communication co-design. It enables direct computation and communication on compressed data, thereby removing the costly decompression and recompression steps required by both C-Coll and gZCCL. Finally, ghZCCL proposes the first GPU-based homomorphic compressor and GPU-aware homomorphic compression-accelerated collectives, offering substantial improvements in both GPU compression and communication efficiency.
Together, the ZCCL family significantly outperforms state-of-the-art communication libraries, including the NVIDIA Collective Communications Library (NCCL), Cray-MPI, and MPICH, while maintaining high data accuracy.