Detecting new Chinese words from massive domain texts with word embedding - Citegraph

Paper Info

Title
Detecting new Chinese words from massive domain texts with word embedding

Abstract
AbstractTextual information retrieval (TIR) is based on the relationship between word units. Traditional word segmentation techniques attempt to discern the word units accurately from texts; however, they are unable to appropriately and efficiently identify all new words. Identification of new words, especially in languages such as Chinese, remains a challenge. In recent years, word embedding methods have used numerical word vectors to retain the semantic and correlated information between words in a corpus. In this article, we propose the word-embedding-based method (WEBM), a novel method that combines word embedding and frequent n-gram string mining for discovering new words from domain corpora. First, we mapped all word units in a domain corpus to a high-dimension word vector space. Second, we used a frequent n-gram word string mining method to identify a set of candidates for new words. We designed a pruning strategy based on the word vectors to quantify the possibility of a word string being a new word, thereby allowing the evaluation of candidates based on the similarity of word units in the same string. In a comparative study, our experimental results revealed that WEBM had a great advantage in detecting new words from massive Chinese corpora.

Year	DOI	Venue
2019	10.1177/0165551518786676	Periodicals
Keywords	Field	DocType
Natural language processing,new word detection,similarity measurement,textual information retrieval,word embedding	Information retrieval,Computer science,Textual information,Text segmentation,Word embedding	Journal
Volume	Issue	ISSN
45	2	0165-5515
Citations	PageRank	References
0	0.34	15
Authors
6

Authors (6 rows)

Cited by (0 rows)

References (15 rows)

Name	Order	Citations	PageRank
Yu Qian	1	7	3.52
yang du	2	23	6.91
Xiongwen Deng	3	0	0.34
Baojun Ma	4	47	7.38
Qiongwei Ye	5	0	0.34
Hua Yuan	6	51	8.89

1