Abstract | ||
---|---|---|
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art. |
Year | Venue | DocType |
---|---|---|
2021 | ICLR | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
9 |
Name | Order | Citations | PageRank |
---|---|---|---|
Dmitry Lepikhin | 1 | 0 | 1.35 |
HyoukJoong Lee | 2 | 414 | 17.71 |
Yuanzhong Xu | 3 | 224 | 9.30 |
Dehao Chen | 4 | 17 | 1.57 |
Orhan Firat | 5 | 281 | 29.13 |
Yanping Huang | 6 | 210 | 9.80 |
Maxim Krikun | 7 | 452 | 17.11 |
Noam Shazeer | 8 | 1089 | 43.70 |
Zhifeng Chen | 9 | 2747 | 106.75 |