Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
(In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC21), presented in St. Louis, Missouri, ACM, Nov. 2021) Best Paper Finalist
Abstract
Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.
Documents
download article:
Recorded talk (best effort)
BibTeX
@inproceedings{nopfs, author={Shigang Li and Torsten Hoefler}, title={{Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines}}, year={2021}, month={Nov.}, booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC21)}, location={St. Louis, Missouri}, publisher={ACM}, source={http://www.unixer.de/~htor/publications/}, }