Title
ERNIE-M - Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora.
Abstract
Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance on downstream cross-lingual tasks. This improvement stems from the learning of a large amount of monolingual and parallel corpora. While it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for the low-resource languages. In this paper, we propose ERNIE-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to break the constraint of parallel corpus size on the model performance. Our key insight is to integrate the idea of back translation in the pre-training process. We generate pseudo-parallel sentences pairs on a monolingual corpus to enable the learning of semantic alignment between different languages, which enhances the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results on various cross-lingual downstream tasks. The codes and pre-trained models will be made publicly available.
Year
Venue
DocType
2021
EMNLP
Conference
Volume
Citations 
PageRank 
2021.emnlp-main
0
0.34
References 
Authors
0
7
Name
Order
Citations
PageRank
Xuan Ouyang101.69
Shuohuan Wang243.76
Chao Pang301.01
Yu Sun444.09
Hao Tian511.02
Hua Wu666459.26
Haifeng Wang780694.25