Copyright Notice:

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Publications of SPCL

M. Khalilov, S. Shen, M. Chrapek, T. Chen, K. Nakano, P. Gootzen, S. Di Girolamo, R. Nudelman, G. Bloch, S. Anantharamu, M. Elhaddad, J. Jose, A. Kabbani, S. Moe, K. Taranov, Z. Yu, J. Zhang, N. Mazzoletti, T. Hoefler:

 SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale RDMA Communication

(In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'25), presented in St. Louis, MO, USA, Nov. 2025)

Abstract

RDMA is vital for efficient distributed training across datacenters, but millisecond-scale latencies complicate the design of its reliability layer. We show that depending on long-haul link characteristics, such as drop rate, distance and bandwidth, the widely used Selective Repeat algorithm can be inefficient, warranting alternatives like Erasure Coding. To enable such alternatives on existing hardware, we propose SDR-RDMA, a software-defined reliability stack for RDMA. Its core is a lightweight SDR SDK that extends standard point-to-point RDMA semantics -- fundamental to AI networking stacks -- with a receive buffer bitmap. SDR bitmap enables partial message completion to let applications implement custom reliability schemes tailored to specific deployments, while preserving zero-copy RDMA benefits. By offloading the SDR backend to NVIDIA's Data Path Accelerator (DPA), we achieve line-rate performance, enabling efficient inter-datacenter communication and advancing reliability innovation for inter-datacenter training.

Documents

download article:
access preprint on arxiv:
download slides:
 

BibTeX

@inproceedings{khalilov2025sdrrma,
  author={Mikhail Khalilov and Siyuan Shen and Marcin Chrapek and Tiancheng Chen and Kenji Nakano and Peter-Jan Gootzen and Salvatore Di Girolamo and Rami Nudelman and Gil Bloch and Sreevatsa Anantharamu and Mahmoud Elhaddad and Jithin Jose and Abdul Kabbani and Scott Moe and Konstantin Taranov and Zhuolong Yu and Jie Zhang and Nicola Mazzoletti and Torsten Hoefler},
  title={{SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale RDMA Communication}},
  year={2025},
  month={11},
  booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'25)},
  location={St. Louis, MO, USA},
  doi={10.48550/arXiv.2505.05366},
}