Copyright Notice:

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Publications of SPCL

M. Khalilov, S. Di Girolamo, M. Chrapek, R. Nudelman, G. Bloch, T. Hoefler:

 Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

(In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24), presented in Atlanta, GA, USA, pages 103:1-103:17, IEEE Press, ISBN: 9798350352917, Nov. 2024)

Publisher Reference

Abstract

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.

Documents

download article:
access preprint on arxiv:
download slides:
 

BibTeX

@inproceedings{khalilov2024allgather,
  author={Mikhail Khalilov and Salvatore Di Girolamo and Marcin Chrapek and Rami Nudelman and Gil Bloch and Torsten Hoefler},
  title={{Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI}},
  year={2024},
  month={11},
  pages={103:1-103:17},
  booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24)},
  location={Atlanta, GA, USA},
  publisher={IEEE Press},
  isbn={9798350352917},
  doi={10.1109/SC41406.2024.00109},
}