Home Publications edited volumes Awards Research Teaching Miscellaneous Full CV [pdf]
Events

Past Events
|
Publications of Torsten Hoefler
Mikhail Khalilov, Siyuan Shen, Marcin Chrapek, Tiancheng Chen, Kenji Nakano, Peter-Jan Gootzen, Salvatore Di Girolamo, Rami Nudelman, Gil Bloch, Sreevatsa Anantharamu, Mahmoud Elhaddad, Jithin Jose, Abdul Kabbani, Scott Moe, Konstantin Taranov, Zhuolong Yu, Jie Zhang, Nicola Mazzoletti, Torsten Hoefler:
| | | SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale RDMA Communication
(In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'25), presented in St. Louis, MO, USA, Nov. 2025)
AbstractRDMA is vital for efficient distributed training across datacenters, but millisecond-scale latencies complicate the design of its reliability layer. We show that depending on long-haul link characteristics, such as drop rate, distance and bandwidth, the widely used Selective Repeat algorithm can be inefficient, warranting alternatives like Erasure Coding. To enable such alternatives on existing hardware, we propose SDR-RDMA, a software-defined reliability stack for RDMA. Its core is a lightweight SDR SDK that extends standard point-to-point RDMA semantics -- fundamental to AI networking stacks -- with a receive buffer bitmap. SDR bitmap enables partial message completion to let applications implement custom reliability schemes tailored to specific deployments, while preserving zero-copy RDMA benefits. By offloading the SDR backend to NVIDIA's Data Path Accelerator (DPA), we achieve line-rate performance, enabling efficient inter-datacenter communication and advancing reliability innovation for inter-datacenter training.
Documentsdownload article:  download slides:  | | | BibTeX | @inproceedings{khalilov2025sdrrma, author={Mikhail Khalilov and Siyuan Shen and Marcin Chrapek and Tiancheng Chen and Kenji Nakano and Peter-Jan Gootzen and Salvatore Di Girolamo and Rami Nudelman and Gil Bloch and Sreevatsa Anantharamu and Mahmoud Elhaddad and Jithin Jose and Abdul Kabbani and Scott Moe and Konstantin Taranov and Zhuolong Yu and Jie Zhang and Nicola Mazzoletti and Torsten Hoefler}, title={{SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale RDMA Communication}}, year={2025}, month={Nov.}, booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'25)}, location={St. Louis, MO, USA}, source={http://www.unixer.de/~htor/publications/}, } |
|
|