Content area

Abstract

The importance of unsupervised clustering and topic modeling is well recognized with ever-increasing volumes of text data available from numerous sources. Nonnegative matrix factorization (NMF) has proven to be a successful method for cluster and topic discovery in unlabeled data sets. In this paper, we propose a fast algorithm for computing NMF using a divide-and-conquer strategy, called DC-NMF. Given an input matrix where the columns represent data items, we build a binary tree structure of the data items using a recently-proposed efficient algorithm for computing rank-2 NMF, and then gather information from the tree to initialize the rank-k NMF, which needs only a few iterations to reach a desired solution. We also investigate various criteria for selecting the node to split when growing the tree. We demonstrate the scalability of our algorithm for computing general rank-k NMF as well as its effectiveness in clustering and topic modeling for large-scale text data sets, by comparing it to other frequently utilized state-of-the-art algorithms. The value of the proposed approach lies in the highly efficient and accurate method for initializing rank-k NMF and the scalability achieved from the divide-and-conquer approach of the algorithm and properties of rank-2 NMF. In summary, we present efficient tools for analyzing large-scale data sets, and techniques that can be generalized to many other data analytics problem domains along with an open-source software library called SmallK.

Details

Title
DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling
Author
Du, Rundong 1 ; Kuang, Da 2 ; Drake, Barry 3 ; Park, Haesun 4 

 School of Mathematics, Georgia Institute of Technology, Atlanta, GA, USA 
 Department of Mathematics, University of California, Los Angeles, CA, USA 
 Georgia Tech Research Institute, Georgia Institute of Technology, Atlanta, GA, USA 
 School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA 
Pages
777-798
Publication year
2017
Publication date
Aug 2017
Publisher
Springer Nature B.V.
ISSN
09255001
e-ISSN
15732916
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
1918796586
Copyright
Journal of Global Optimization is a copyright of Springer, 2017.