The word is mightier than the count: accumulating translation resources from parsed parallel corpora - Citegraph

Paper Info

Title
The word is mightier than the count: accumulating translation resources from parsed parallel corpora

Abstract
Large, high-quality, sentence aligned parallel corpora are hard to come by, and this makes the Statistical Machine Translation enterprise more difficult. Even noisy corpora can provide useful translation resources not otherwise available though. Many investigations have used statistical methods to find word correspondences. Often such methods suffer from overgeneration, so to correct this we filter relevant translation candidates using a lexical post-process. This dictionary lookup is so effective in fact that it brings into question the value of the statistical methods. Using a dictionary lookup against all combinations of phrase pairs as a baseline, we compare three statistical methods and report the results. The three methods are (1) Mutual Information; (2) Expectation Maximization over word co-occurrence frequencies; and (3) EM over word alignments in every sentence. We also apply the dictionary lookup as a postprocess, to tackle overgeneration.

Year	Venue	Keywords
2003	CICLing	useful translation resource,dictionary lookup,parsed parallel corpus,mutual information,word co-occurrence frequency,relevant translation candidate,word correspondence,statistical machine translation enterprise,statistical method,expectation maximization,word alignment
Field	DocType	Volume
Computer science,Machine translation,Phrase,Parallel corpora,Artificial intelligence,Natural language processing,Pattern recognition,Expectation–maximization algorithm,Speech recognition,Mutual information,Parsing,Sentence,Statistical analysis	Conference	2588
ISSN	ISBN	Citations
0302-9743	3-540-00532-3	1
PageRank	References	Authors
0.37	6	2

Authors (2 rows)

Cited by (1 rows)

References (6 rows)

Name	Order	Citations	PageRank
Stephen Nightingale	1	1	1.39
Hideki Tanaka	2	80	15.07

1