Copyright Notice:
The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Publications of SPCL
| Z. Hu, S. Shen, T. Bonato, S. Jeaugey, C. Alexander, E. Spada, J. Dinan, J. Hammond, T. Hoefler: | ||
| Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms (In Proceedings of the 32nd Annual Symposium on High-Performance Interconnects (HOTI'25), presented in Virtual Event, IEEE Press, Aug. 2025) AbstractThe NVIDIA Collective Communication Library (NCCL) is a critical software layer enabling high-performance collectives on large-scale GPU clusters. Despite being open source with a documented API, its internal design remains largely opaque. The orchestration of communication channels, selection of protocols, and handling of memory movement across devices and nodes are not well understood, making it difficult to analyze performance or identify bottlenecks. This paper presents a comprehensive analysis of NCCL, focusing on its communication protocol variants (Simple, LL, and LL128), mechanisms governing intra-node and inter-node data movement, and ringand tree-based collective communication algorithms. The insights obtained from this study serve as the foundation for ATLAHS, an application-trace-driven network simulation toolchain capable of accurately reproducing NCCL communication patterns in largescale AI training workloads. By demystifying NCCL’s internal architecture, this work provides guidance for system researchers and performance engineers working to optimize or simulate collective communication at scaleDocumentsdownload article:access preprint on arxiv: download slides: | ||
BibTeX | ||
| ||














