Home Publications edited volumes Awards Research Teaching Miscellaneous Full CV [pdf]
Events

Past Events
|
Publications of Torsten Hoefler
Patrik Okanovic, Sameer Deshmukh, Grzegorz Kwasniewski, Kentaro Katayama, Takumi Honda, Maciej Besta, Torsten Hoefler:
| | | BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers
(arXiv:2507.03117. Jul. 2025)
AbstractThe energy consumption of large-scale ML models is dominated by data movement - shuffling billions of parameters across memory hierarchies and data centers. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable sparsification method applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss. Our fused, highly optimized Sparse MLP kernel delivers up to 16.7x speedup over dense MLPs across 9 architectures and 8 datasets, resulting in up to 1.6x inference speedup, 1.11x pretraining speedup and up to 3.12x inference memory usage reduction. BLaST enables the next generation of large-scale AI systems by reducing energy use, memory footprint, and latency.
Documentsdownload article: 
| | | BibTeX | @article{okanovic2025blast, author={Patrik Okanovic and Sameer Deshmukh and Grzegorz Kwasniewski and Kentaro Katayama and Takumi Honda and Maciej Besta and Torsten Hoefler}, title={{BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers}}, journal={arXiv:2507.03117}, year={2025}, month={Jul.}, source={http://www.unixer.de/~htor/publications/}, } |
|
|