Title
Corpus-based topic diffusion for short text clustering.
Abstract
In this paper, we propose a novel corpus-based enrichment approach for short text clustering. Since sparseness brings about the problem of insufficient word co-occurrence and lack of context information, previous researches use external sources such as Wikipedia or WordNet to enrich the representation of short text documents, which requires extra resources and might lead to possible inconsistency. On the other hand, corpus-based approaches use no external information in mining short text data. By introducing a set of conjugate definitions to characterize the structures of topics and words, and by proposing a virtual generative procedure for short texts, we perform expansion on short text data. Specifically, new words which may not appear in a short text document were added with a virtual term frequency, and this virtual frequency is obtained from the posterior probabilities of new words given all the words in that document. The complete procedure can be regarded as mapping data points (documents) from the original feature space to a hidden semantic space (topic space). After performing semantic smoothing, data points are then mapped back to the original space. We conduct experiments on two short text datasets, and the results show that the proposed method can effectively address the sparseness problem. For these datasets, our method, using only a basic clustering algorithm, attains a comparable performance with methods based on enrichment with external information sources. (C) 2017 Elsevier B.V. All rights reserved.
Year
DOI
Venue
2018
10.1016/j.neucom.2017.11.019
NEUROCOMPUTING
Keywords
Field
DocType
Text enrichment,Short text,Clustering,Text mining
Noisy text analytics,Document clustering,Data mapping,Computer science,Artificial intelligence,WordNet,Cluster analysis,Data point,Feature vector,Information retrieval,Pattern recognition,Smoothing,Machine learning
Journal
Volume
ISSN
Citations 
275
0925-2312
5
PageRank 
References 
Authors
0.49
30
3
Name
Order
Citations
PageRank
Chutao Zheng191.90
Cheng Liu2335.72
Hau-San Wong3100886.89