Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation - Citegraph

Paper Info

Title
Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation

Abstract
This paper is concerned with exploring efficient domain adaptation for the task of statistical machine translation, which is based on extracting sentence pairs (pseudo in-domain subcorpora, that are most relevant to the in domain corpora) from a large-scale general-domain web bilingual corpus. These sentences are selected by our proposed unsupervised phrase-based data selection model. Compared with the traditional bag-of-words models, our phrase-based data selection model is more effective because it captures contextual information in modeling the selection of phrase as a whole, rather than selection of single words in isolation. These pseudo in-domain subcorpora can then be used to train small domain-adapted spoken language translation system which outperforms the system trained on the entire corpus, with an increase of 1.6 BLEU points. Performance is further improved when we use these pseudo in-domain corpus/models in combination with the true in-domain corpus/model, with increases of 4.5 and 3.9 BLEU points over single in-and general-domain baseline system, respectively.

Year	DOI	Venue
2013	10.1007/978-3-642-41644-6_12	Communications in Computer and Information Science
Keywords	Field	DocType
domain adaptation,phrase-based data selection,pseudo in-domain subcorpora,spoken language translation	Spoken language translation,BLEU,Data selection,Domain adaptation,Computer science,Machine translation,Phrase,Speech recognition,Natural language processing,Artificial intelligence,Baseline system,Sentence	Conference
Volume	ISSN	Citations
400	1865-0929	0
PageRank	References	Authors
0.34	18	4

Authors (4 rows)

Cited by (0 rows)

References (18 rows)

Name	Order	Citations	PageRank
Shixiang Lu	1	19	3.39
Xingyuan Peng	2	4	1.46
Zhenbiao Chen	3	33	5.14
Bo Xu	4	241	36.59

1