Abstract | ||
---|---|---|
Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language model with over a billion parameters, verifying the benefits of model size. Model parallelism is required if a model is too large to fit in a single computing device. Current methods for model parallelism either suffer from backward locking in backpropagation or are not applicable to language models. We propose the first model-parallel algorithm that speeds the training of Transformer-based language models. We also prove that our proposed algorithm is guaranteed to converge to critical points for non-convex problems. Extensive experiments on Transformer and Transformer-XL language models demonstrate that the proposed algorithm obtains a much faster speedup beyond data parallelism, with comparable or better accuracy. Code to reproduce experiments is to be found at \url{https://github.com/LaraQianYang/Ouroboros}. |
Year | Field | DocType |
---|---|---|
2019 | Engineering drawing,Computer science,Transformer,Artificial intelligence,Language model,Machine learning | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Qian Yang | 1 | 21 | 4.50 |
Zhouyuan Huo | 2 | 81 | 12.16 |
Wenlin Wang | 3 | 51 | 7.06 |
L. Carin | 4 | 4603 | 339.36 |