Home Publications edited volumes Awards Research Teaching Miscellaneous Full CV [pdf]
Events

Past Events
|
Publications of Torsten Hoefler
Yinxiao Feng, Tiancheng Chen, Yuchen Wei, Siyuan Shen, Shiju Wang, Wei Li, Kaisheng Ma, Torsten Hoefler:
| | | RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems
(arXiv:2507.18889. Jul. 2025)
AbstractIncreasingly large AI workloads are calling for hyper-scale infrastructure; however, traditional interconnection network architecture is neither scalable nor cost-effective enough. Tree-based topologies such as the Rail-optimized network are extremely expensive, while direct topologies such as Torus have insufficient bisection bandwidth and flexibility. In this paper, we propose RailX, a reconfigurable network architecture based on intra-node direct connectivity and inter-node circuit switching. Nodes and optical switches are physically 2D-organized, achieving better scalability than existing centralized circuit switching networks. We propose a novel interconnection method based on Hamiltonian Decomposition theory to organize separate rail-based rings into all-to-all topology, simultaneously optimizing ring-collective and all-to-all communication. More than 100K chips with hyper bandwidth can be interconnected with a flat switching layer, and the diameter is only 2∼4 inter-node hops. The network cost per injection/All-Reduce bandwidth of RailX is less than 10% of the Fat-Tree, and the cost per bisection/All-to-All bandwidth is less than 50% of the Fat-Tree. Specifically, only ∼1.3B is required to interconnect 200K chips with 1.8TB bandwidth. RailX can also be used in the ML-as-a-service (MLaaS) scenario, where single or multiple training workloads with various shapes, scales, and parallelism strategies can be flexibly mapped, and failures can be worked around.
Documentsdownload article: 
| | | BibTeX | @article{yinxiao2025railx, author={Yinxiao Feng and Tiancheng Chen and Yuchen Wei and Siyuan Shen and Shiju Wang and Wei Li and Kaisheng Ma and Torsten Hoefler}, title={{RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems}}, journal={arXiv:2507.18889}, year={2025}, month={Jul.}, source={http://www.unixer.de/~htor/publications/}, } |
|
|