Title | ||
---|---|---|
Finding ideographic representations of Japanese names written in Latin script via language identification and corpus validation |
Abstract | ||
---|---|---|
Multilingual applications frequently involve dealing with proper names, but names are often missing in bilingual lexicons. This problem is exacerbated for applications involving translation between Latin-scripted languages and Asian languages such as Chinese, Japanese and Korean (CJK) where simple string copying is not a solution. We present a novel approach for generating the ideographic representations of a CJK name written in a Latin script. The proposed approach involves first identifying the origin of the name, and then back-transliterating the name to all possible Chinese characters using language-specific mappings. To reduce the massive number of possibilities for computation, we apply a three-tier filtering process by filtering first through a set of attested bigrams, then through a set of attested terms, and lastly through the WWW for a final validation. We illustrate the approach with English-to-Japanese back-transliteration. Against test sets of Japanese given names and surnames, we have achieved average precisions of 73% and 90%, respectively. |
Year | DOI | Venue |
---|---|---|
2004 | 10.3115/1218955.1218979 | ACL |
Keywords | Field | DocType |
english-to-japanese back-transliteration,attested term,novel approach,latin script,cjk name,attested bigrams,ideographic representation,corpus validation,test set,asian language,proper name,possible chinese character,language identification,japanese name,scripting language,proper names | Chinese characters,Computer science,Copying,Latin script,Natural language processing,Language identification,Bigram,Artificial intelligence,Proper noun,Linguistics | Conference |
Volume | Citations | PageRank |
P04-1 | 24 | 1.29 |
References | Authors | |
5 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yan Qu | 1 | 24 | 1.29 |
Gregory Grefenstette | 2 | 1129 | 147.00 |