Abstract | ||
---|---|---|
Production of parallel training corpora for the development of statistical machine translation (SMT) systems for resource-poor languages usually requires extensive manual effort. Active sample selection aims to reduce the labor, time, and expense incurred in producing such resources, attaining a given performance benchmark with the smallest possible training corpus by choosing informative, nonredundant source sentences from an available candidate pool for manual translation. We present a novel, discriminative sample selection strategy that preferentially selects batches of candidate sentences with constructs that lead to erroneous translations on a held-out development set. The proposed strategy supports a built-in diversity mechanism that reduces redundancy in the selected batches. Simulation experiments on English-to-Pashto and Spanish-to-English translation tasks demonstrate the superiority of the proposed approach to a number of competing techniques, such as random selection, dissimilarity-based selection, as well as a recently proposed semi-supervised active learning strategy. |
Year | Venue | Keywords |
---|---|---|
2010 | EMNLP | manual translation,dissimilarity-based selection,statistical machine translation,active learning strategy,random selection,active sample selection,discriminative sample selection strategy,proposed strategy,spanish-to-english translation task,erroneous translation |
Field | DocType | Volume |
Active learning,Computer science,Machine translation,Redundancy (engineering),Artificial intelligence,Sampling (statistics),Natural language processing,Sample selection,Discriminative model,Machine learning | Conference | D10-1 |
Citations | PageRank | References |
7 | 0.51 | 11 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Sankaranarayanan Ananthakrishnan | 1 | 134 | 13.29 |
Rohit Prasad | 2 | 465 | 39.06 |
David Stallard | 3 | 153 | 59.87 |
Premkumar Natarajan | 4 | 874 | 79.46 |