Title
DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling.
Abstract
The importance of unsupervised clustering and topic modeling is well recognized with ever-increasing volumes of text data available from numerous sources. Nonnegative matrix factorization (NMF) has proven to be a successful method for cluster and topic discovery in unlabeled data sets. In this paper, we propose a fast algorithm for computing NMF using a divide-and-conquer strategy, called DC-NMF. Given an input matrix where the columns represent data items, we build a binary tree structure of the data items using a recently-proposed efficient algorithm for computing rank-2 NMF, and then gather information from the tree to initialize the rank-k NMF, which needs only a few iterations to reach a desired solution. We also investigate various criteria for selecting the node to split when growing the tree. We demonstrate the scalability of our algorithm for computing general rank-k NMF as well as its effectiveness in clustering and topic modeling for large-scale text data sets, by comparing it to other frequently utilized state-of-the-art algorithms. The value of the proposed approach lies in the highly efficient and accurate method for initializing rank-k NMF and the scalability achieved from the divide-and-conquer approach of the algorithm and properties of rank-2 NMF. In summary, we present efficient tools for analyzing large-scale data sets, and techniques that can be generalized to many other data analytics problem domains along with an open-source software library called SmallK.
Year
DOI
Venue
2017
10.1007/s10898-017-0515-z
J. Global Optimization
Keywords
Field
DocType
Constrained low rank approximation,Nonnegative matrix factorization,Divide and conquer,Clustering,Topic modeling,Text analysis,Scalable algorithms
Data set,Data analysis,Binary tree,Theoretical computer science,Non-negative matrix factorization,Topic model,Divide and conquer algorithms,Cluster analysis,Mathematics,Scalability
Journal
Volume
Issue
ISSN
68
4
0925-5001
Citations 
PageRank 
References 
3
0.38
26
Authors
4
Name
Order
Citations
PageRank
rundong du152.43
Da Kuang21195.90
Barry L. Drake310011.59
Haesun Park43546232.42