Does more data always yield better translations? - Citegraph

Paper Info

Title
Does more data always yield better translations?

Abstract
Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions.

Year	Venue	Keywords
2012	EACL	training data,indomain corpus,training data selection technique,statistical machine translation system,bilingual data,whole available data,better translation,significant improvement,random sentence selection,huge bilingual corpus
Field	DocType	Citations
Training set,Computer science,Machine translation,Speech recognition,Natural language processing,Artificial intelligence,Sentence	Conference	10
PageRank	References	Authors
0.53	22	5

Authors (5 rows)

Cited by (10 rows)

References (22 rows)

Name	Order	Citations	PageRank
Guillem Gascó	1	15	2.11
Martha-Alicia Rocha	2	16	1.80
Germán Sanchis-Trilles	3	101	16.95
Jesús Andrés-Ferrer	4	73	7.52
francisco casacuberta	5	1439	161.33

1