Title
Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation
Abstract
This paper is concerned with exploring efficient domain adaptation for the task of statistical machine translation, which is based on extracting sentence pairs (pseudo in-domain subcorpora, that are most relevant to the in domain corpora) from a large-scale general-domain web bilingual corpus. These sentences are selected by our proposed unsupervised phrase-based data selection model. Compared with the traditional bag-of-words models, our phrase-based data selection model is more effective because it captures contextual information in modeling the selection of phrase as a whole, rather than selection of single words in isolation. These pseudo in-domain subcorpora can then be used to train small domain-adapted spoken language translation system which outperforms the system trained on the entire corpus, with an increase of 1.6 BLEU points. Performance is further improved when we use these pseudo in-domain corpus/models in combination with the true in-domain corpus/model, with increases of 4.5 and 3.9 BLEU points over single in-and general-domain baseline system, respectively.
Year
DOI
Venue
2013
10.1007/978-3-642-41644-6_12
Communications in Computer and Information Science
Keywords
Field
DocType
domain adaptation,phrase-based data selection,pseudo in-domain subcorpora,spoken language translation
Spoken language translation,BLEU,Data selection,Domain adaptation,Computer science,Machine translation,Phrase,Speech recognition,Natural language processing,Artificial intelligence,Baseline system,Sentence
Conference
Volume
ISSN
Citations 
400
1865-0929
0
PageRank 
References 
Authors
0.34
18
4
Name
Order
Citations
PageRank
Shixiang Lu1193.39
Xingyuan Peng241.46
Zhenbiao Chen3335.14
Bo Xu424136.59