Copyright Notice:
The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Publications of SPCL
| M. Khalilov, S. Shen, M. Chrapek, T. Chen, K. Nakano, P. Gootzen, S. Di Girolamo, R. Nudelman, G. Bloch, S. Anantharamu, M. Elhaddad, J. Jose, A. Kabbani, S. Moe, K. Taranov, Z. Yu, J. Zhang, N. Mazzoletti, T. Hoefler: | ||
| SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale RDMA Communication (In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'25), presented in St. Louis, MO, USA, Nov. 2025) AbstractRDMA is vital for efficient distributed training across datacenters, but millisecond-scale latencies complicate the design of its reliability layer. We show that depending on long-haul link characteristics, such as drop rate, distance and bandwidth, the widely used Selective Repeat algorithm can be inefficient, warranting alternatives like Erasure Coding. To enable such alternatives on existing hardware, we propose SDR-RDMA, a software-defined reliability stack for RDMA. Its core is a lightweight SDR SDK that extends standard point-to-point RDMA semantics -- fundamental to AI networking stacks -- with a receive buffer bitmap. SDR bitmap enables partial message completion to let applications implement custom reliability schemes tailored to specific deployments, while preserving zero-copy RDMA benefits. By offloading the SDR backend to NVIDIA's Data Path Accelerator (DPA), we achieve line-rate performance, enabling efficient inter-datacenter communication and advancing reliability innovation for inter-datacenter training.Documentsdownload article:access preprint on arxiv: download slides: | ||
BibTeX | ||
| ||














