Title
Does more data always yield better translations?
Abstract
Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions.
Year
Venue
Keywords
2012
EACL
training data,indomain corpus,training data selection technique,statistical machine translation system,bilingual data,whole available data,better translation,significant improvement,random sentence selection,huge bilingual corpus
Field
DocType
Citations 
Training set,Computer science,Machine translation,Speech recognition,Natural language processing,Artificial intelligence,Sentence
Conference
10
PageRank 
References 
Authors
0.53
22
5
Name
Order
Citations
PageRank
Guillem Gascó1152.11
Martha-Alicia Rocha2161.80
Germán Sanchis-Trilles310116.95
Jesús Andrés-Ferrer4737.52
francisco casacuberta51439161.33