Abstract | ||
---|---|---|
AbstractTextual information retrieval (TIR) is based on the relationship between word units. Traditional word segmentation techniques attempt to discern the word units accurately from texts; however, they are unable to appropriately and efficiently identify all new words. Identification of new words, especially in languages such as Chinese, remains a challenge. In recent years, word embedding methods have used numerical word vectors to retain the semantic and correlated information between words in a corpus. In this article, we propose the word-embedding-based method (WEBM), a novel method that combines word embedding and frequent n-gram string mining for discovering new words from domain corpora. First, we mapped all word units in a domain corpus to a high-dimension word vector space. Second, we used a frequent n-gram word string mining method to identify a set of candidates for new words. We designed a pruning strategy based on the word vectors to quantify the possibility of a word string being a new word, thereby allowing the evaluation of candidates based on the similarity of word units in the same string. In a comparative study, our experimental results revealed that WEBM had a great advantage in detecting new words from massive Chinese corpora. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1177/0165551518786676 | Periodicals |
Keywords | Field | DocType |
Natural language processing,new word detection,similarity measurement,textual information retrieval,word embedding | Information retrieval,Computer science,Textual information,Text segmentation,Word embedding | Journal |
Volume | Issue | ISSN |
45 | 2 | 0165-5515 |
Citations | PageRank | References |
0 | 0.34 | 15 |
Authors | ||
6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yu Qian | 1 | 7 | 3.52 |
yang du | 2 | 23 | 6.91 |
Xiongwen Deng | 3 | 0 | 0.34 |
Baojun Ma | 4 | 47 | 7.38 |
Qiongwei Ye | 5 | 0 | 0.34 |
Hua Yuan | 6 | 51 | 8.89 |