Content area

Abstract

Scaling massively parallel computing tasks, such as scientific applications and LLMs, is critically constrained by the efficiency of collective communications. This efficiency is increasingly bottlenecked by network bandwidth, which struggles to keep pace with the rapid growth in computational power and communication data volumes. Furthermore, the heterogeneous architectures of modern supercomputers and the complexity of communication pipelines further exacerbate these efficiency challenges.

To address these challenges, I pioneered a new research direction: advancing exascale collective communications with co-designed compression techniques. Under this direction, I developed ZCCL, a family of four novel frameworks that significantly improve communication efficiency across CPU and GPU clusters.

The first framework, C-Coll, leverages error-bounded lossy compression to substantially reduce message sizes, thus improving communication performance. The second, gZCCL, presents GPU-aware, compression-enabled collectives that are optimized to achieve both high performance and data accuracy on GPU clusters. The third framework, hZCCL, introduces the first homomorphic compression-communication co-design. It enables direct computation and communication on compressed data, thereby removing the costly decompression and recompression steps required by both C-Coll and gZCCL. Finally, ghZCCL proposes the first GPU-based homomorphic compressor and GPU-aware homomorphic compression-accelerated collectives, offering substantial improvements in both GPU compression and communication efficiency.

Together, the ZCCL family significantly outperforms state-of-the-art communication libraries, including the NVIDIA Collective Communications Library (NCCL), Cray-MPI, and MPICH, while maintaining high data accuracy.

Details

1010268
Title
ZCCL: Advancing Exascale Collective Communications With Co-Designed Compression
Number of pages
195
Publication year
2025
Degree date
2025
School code
0032
Source
DAI-A 87/1(E), Dissertation Abstracts International
ISBN
9798288854569
Committee member
Gupta, Rajiv; Wong, Daniel; Zhao, Zhijia; Di, Sheng; Guo, Yanfei
University/institution
University of California, Riverside
Department
Computer Science
University location
United States -- California
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32043541
ProQuest document ID
3231995875
Document URL
https://www.proquest.com/dissertations-theses/zccl-advancing-exascale-collective-communications/docview/3231995875/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic