GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding - Citegraph

Paper Info

Title
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Abstract
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

Year	Venue	DocType
2021	ICLR	Conference
Citations	PageRank	References
0	0.34	0
Authors
9

Authors (9 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Dmitry Lepikhin	1	0	1.35
HyoukJoong Lee	2	414	17.71
Yuanzhong Xu	3	224	9.30
Dehao Chen	4	17	1.57
Orhan Firat	5	281	29.13
Yanping Huang	6	210	9.80
Maxim Krikun	7	452	17.11
Noam Shazeer	8	1089	43.70
Zhifeng Chen	9	2747	106.75

1