Abstract | ||
---|---|---|
Text clustering is a widely studied problem in the text mining domain. The Dirichlet Multinomial Mixture (DMM) model based clustering algorithms have shown good performance to cope with high dimensional sparse text data, obtaining reasonable results in both clustering accuracy and computational efficiency. However, the time complexity of DMM model training is proportional to the average document length and the number of clusters, making it inefficient for scaling up to long text and large corpora, which is common in real-world applications such as documents organization, retrieval and recommendation. In this paper, we leverage a symmetric prior setting for Dirichlet distribution, and build indices to decrease the time complexity of the sampling-based training for DMM from O(K * L) to O(K * U), where K is the number of clusters, L the average length of document, and U the average number of unique words in each document. We introduce a Metropolis-Hastings sampling algorithm, which further reduces the sampling time complexity from O(K*U) to O(U) in the nearly-to-convergence training stages. Moreover, we also parallelize the DMM model training to obtain a further acceleration by using an uncollapsed Gibbs sampler. We combine all these optimizations into a highly efficient implementation, called X-DMM, which enables the DMM model to scale up for long and large-scale text clustering. We evaluate the performance of X-DMM on several real world datasets, and the experimental results show that X-DMM achieves substantial speed up compared with existing state-of-the-art algorithms without clustering accuracy degradation. |
Year | Venue | Field |
---|---|---|
2019 | THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | Computer science,Document clustering,Algorithm,Sampling (statistics),Artificial intelligence,Dirichlet distribution,Cluster analysis,Time complexity,Gibbs sampling,Machine learning,Scalability,Speedup |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Linwei Li | 1 | 9 | 3.94 |
Liangchen Guo | 2 | 0 | 0.68 |
Zhenying He | 3 | 158 | 16.03 |
Yinan Jing | 4 | 0 | 3.04 |
X. Sean Wang | 5 | 2598 | 382.56 |