Title
Detecting new Chinese words from massive domain texts with word embedding
Abstract
AbstractTextual information retrieval (TIR) is based on the relationship between word units. Traditional word segmentation techniques attempt to discern the word units accurately from texts; however, they are unable to appropriately and efficiently identify all new words. Identification of new words, especially in languages such as Chinese, remains a challenge. In recent years, word embedding methods have used numerical word vectors to retain the semantic and correlated information between words in a corpus. In this article, we propose the word-embedding-based method (WEBM), a novel method that combines word embedding and frequent n-gram string mining for discovering new words from domain corpora. First, we mapped all word units in a domain corpus to a high-dimension word vector space. Second, we used a frequent n-gram word string mining method to identify a set of candidates for new words. We designed a pruning strategy based on the word vectors to quantify the possibility of a word string being a new word, thereby allowing the evaluation of candidates based on the similarity of word units in the same string. In a comparative study, our experimental results revealed that WEBM had a great advantage in detecting new words from massive Chinese corpora.
Year
DOI
Venue
2019
10.1177/0165551518786676
Periodicals
Keywords
Field
DocType
Natural language processing,new word detection,similarity measurement,textual information retrieval,word embedding
Information retrieval,Computer science,Textual information,Text segmentation,Word embedding
Journal
Volume
Issue
ISSN
45
2
0165-5515
Citations 
PageRank 
References 
0
0.34
15
Authors
6
Name
Order
Citations
PageRank
Yu Qian173.52
yang du2236.91
Xiongwen Deng300.34
Baojun Ma4477.38
Qiongwei Ye500.34
Hua Yuan6518.89