Title
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts.
Abstract
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall fewshot performance across 29 NLP tasks.
Year
Venue
DocType
2022
International Conference on Machine Learning
Conference
Citations 
PageRank 
References 
0
0.34
0
Authors
27
Name
Order
Citations
PageRank
Nan Du150352.49
Yanping Huang22109.80
Andrew M. Dai353424.53
Simon Tong400.34
Dmitry Lepikhin501.35
Yuanzhong Xu62249.30
Maxim Krikun745217.11
Yanqi Zhou800.34
Adams Wei Yu91418.79
Orhan Firat1028129.13
Barret Zoph1100.34
Liam Fedus1200.34
Maarten Bosma1300.68
Zongwei Zhou1420.70
Tao Wang1500.34
Yu Emma Wang1600.34
Kellie Webster1710.71
Marie Pellat1800.34
Kevin Robinson1900.34
Kathy Meier-Hellstern2000.34
Toju Duke2100.34
Lucas Dixon22182390.35
Kun Zhang2300.68
Quoc V. Le248501366.59
Yonghui Wu25106572.78
Zhifeng Chen262747106.75
Claire Cui2700.34