DC-NMF: nonnegative matrix factorization based on

Abstract

The importance of unsupervised clustering and topic modeling is well recognized with ever-increasing volumes of text data available from numerous sources. Nonnegative matrix factorization (NMF) has proven to be a successful method for cluster and topic discovery in unlabeled data sets. In this paper, we propose a fast algorithm for computing NMF using a divide-and-conquer strategy, called DC-NMF. Given an input matrix where the columns represent data items, we build a binary tree structure of the data items using a recently-proposed efficient algorithm for computing rank-2 NMF, and then gather information from the tree to initialize the rank-k NMF, which needs only a few iterations to reach a desired solution. We also investigate various criteria for selecting the node to split when growing the tree. We demonstrate the scalability of our algorithm for computing general rank-k NMF as well as its effectiveness in clustering and topic modeling for large-scale text data sets, by comparing it to other frequently utilized state-of-the-art algorithms. The value of the proposed approach lies in the highly efficient and accurate method for initializing rank-k NMF and the scalability achieved from the divide-and-conquer approach of the algorithm and properties of rank-2 NMF. In summary, we present efficient tools for analyzing large-scale data sets, and techniques that can be generalized to many other data analytics problem domains along with an open-source software library called SmallK.

Details

Title

DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling

Author

Du, Rundong¹; Kuang, Da²; Drake, Barry³; Park, Haesun⁴

¹ School of Mathematics, Georgia Institute of Technology, Atlanta, GA, USA
² Department of Mathematics, University of California, Los Angeles, CA, USA
³ Georgia Tech Research Institute, Georgia Institute of Technology, Atlanta, GA, USA
⁴ School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA

Pages

777-798

Publication year

2017

Publication date

Aug 2017

Publisher

Springer Nature B.V.

ISSN

09255001

e-ISSN

15732916

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1007/s10898-017-0515-z

ProQuest document ID

1918796586

DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling

Content area

Abstract

Details

Suggested sources