GLaM: Efficient Scaling of Language Models with Mixture-of-Experts.

Paper Info

Title
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts.

Abstract
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall fewshot performance across 29 NLP tasks.

Year	Venue	DocType
2022	International Conference on Machine Learning	Conference
Citations	PageRank	References
0	0.34	0
Authors
27

Authors (27 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Nan Du	1	503	52.49
Yanping Huang	2	210	9.80
Andrew M. Dai	3	534	24.53
Simon Tong	4	0	0.34
Dmitry Lepikhin	5	0	1.35
Yuanzhong Xu	6	224	9.30
Maxim Krikun	7	452	17.11
Yanqi Zhou	8	0	0.34
Adams Wei Yu	9	141	8.79
Orhan Firat	10	281	29.13
Barret Zoph	11	0	0.34
Liam Fedus	12	0	0.34
Maarten Bosma	13	0	0.68
Zongwei Zhou	14	2	0.70
Tao Wang	15	0	0.34
Yu Emma Wang	16	0	0.34
Kellie Webster	17	1	0.71
Marie Pellat	18	0	0.34
Kevin Robinson	19	0	0.34
Kathy Meier-Hellstern	20	0	0.34
Toju Duke	21	0	0.34
Lucas Dixon	22	1823	90.35
Kun Zhang	23	0	0.68
Quoc V. Le	24	8501	366.59
Yonghui Wu	25	1065	72.78
Zhifeng Chen	26	2747	106.75
Claire Cui	27	0	0.34