Title
Inducing a Bilingual Lexicon from Short Parallel Multiword Sequences.
Abstract
This article proposes a technique for mining bilingual lexicons from pairs of parallel short word sequences. The technique builds a generative model from a corpus of training data consisting of such pairs. The model is a hierarchical nonparametric Bayesian model that directly induces a bilingual lexicon while training. The model learns in an unsupervised manner and is designed to exploit characteristics of the language pairs being mined. The proposed model is capable of utilizing commonly used word-pair frequency information and additionally can employ the internal character alignments within the words themselves. It is thereby capable of mining transliterations and can use reliably aligned transliteration pairs to support the mining of other words in their context. The model is also capable of performing word reordering and word deletion during the alignment process, and it is furthermore capable of operating in the absence of full segmentation information. In this work, we study two mining tasks based on English-Japanese and English-Chinese language pairs, and compare the proposed approach to baselines based on a simpler models that use only word-pair frequency information. Our results show that the proposed method is able to mine bilingual word pairs at higher levels of precision and recall than the baselines.
Year
DOI
Venue
2017
10.1145/3003726
ACM Trans. Asian & Low-Resource Lang. Inf. Process.
Keywords
Field
DocType
Bilingual lexicon,mining,alignment
Training set,Word deletion,Bilingual lexicon,Segmentation,Computer science,Precision and recall,Speech recognition,Exploit,Artificial intelligence,Natural language processing,Transliteration,Generative model
Journal
Volume
Issue
ISSN
16
3
2375-4699
Citations 
PageRank 
References 
0
0.34
13
Authors
4
Name
Order
Citations
PageRank
Andrew Finch114419.05
Taisuke Harada200.34
Kumiko Tanaka-Ishii326136.69
Eiichiro SUMITA41466190.87