Title | ||
---|---|---|
DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling. |
Abstract | ||
---|---|---|
The importance of unsupervised clustering and topic modeling is well recognized with ever-increasing volumes of text data available from numerous sources. Nonnegative matrix factorization (NMF) has proven to be a successful method for cluster and topic discovery in unlabeled data sets. In this paper, we propose a fast algorithm for computing NMF using a divide-and-conquer strategy, called DC-NMF. Given an input matrix where the columns represent data items, we build a binary tree structure of the data items using a recently-proposed efficient algorithm for computing rank-2 NMF, and then gather information from the tree to initialize the rank-k NMF, which needs only a few iterations to reach a desired solution. We also investigate various criteria for selecting the node to split when growing the tree. We demonstrate the scalability of our algorithm for computing general rank-k NMF as well as its effectiveness in clustering and topic modeling for large-scale text data sets, by comparing it to other frequently utilized state-of-the-art algorithms. The value of the proposed approach lies in the highly efficient and accurate method for initializing rank-k NMF and the scalability achieved from the divide-and-conquer approach of the algorithm and properties of rank-2 NMF. In summary, we present efficient tools for analyzing large-scale data sets, and techniques that can be generalized to many other data analytics problem domains along with an open-source software library called SmallK. |
Year | DOI | Venue |
---|---|---|
2017 | 10.1007/s10898-017-0515-z | J. Global Optimization |
Keywords | Field | DocType |
Constrained low rank approximation,Nonnegative matrix factorization,Divide and conquer,Clustering,Topic modeling,Text analysis,Scalable algorithms | Data set,Data analysis,Binary tree,Theoretical computer science,Non-negative matrix factorization,Topic model,Divide and conquer algorithms,Cluster analysis,Mathematics,Scalability | Journal |
Volume | Issue | ISSN |
68 | 4 | 0925-5001 |
Citations | PageRank | References |
3 | 0.38 | 26 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
rundong du | 1 | 5 | 2.43 |
Da Kuang | 2 | 119 | 5.90 |
Barry L. Drake | 3 | 100 | 11.59 |
Haesun Park | 4 | 3546 | 232.42 |