Full text

Turn on search term navigation

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Non-negative matrix factorization is a relatively new method of matrix decomposition which factors an m × n data matrix X into an m × k matrix W and a k × n matrix H, so that XW × H. Importantly, all values in X, W, and H are constrained to be non-negative. NMF can be used for dimensionality reduction, since the k columns of W can be considered components into which X has been decomposed. The question arises: how does one choose k? In this paper, we first assess methods for estimating k in the context of NMF in synthetic data. Second, we examine the effect of normalization on this estimate’s accuracy in empirical data. In synthetic data with orthogonal underlying components, methods based on PCA and Brunet’s Cophenetic Correlation Coefficient achieved the highest accuracy. When evaluated on a well-known real dataset, normalization had an unpredictable effect on the estimate. For any given normalization method, the methods for estimating k gave widely varying results. We conclude that when estimating k, it is best not to apply normalization. If the underlying components are known to be orthogonal, then Velicer’s MAP or Minka’s Laplace-PCA method might be best. However, when the orthogonality of the underlying components is unknown, none of the methods seemed preferable.

Details

Title
Assessing Methods for Evaluating the Number of Components in Non-Negative Matrix Factorization
Author
Maisog, José M 1 ; DeMarco, Andrew T 2   VIAFID ORCID Logo  ; Devarajan, Karthik 3 ; Young, Stanley 4   VIAFID ORCID Logo  ; Fogel, Paul 5 ; Luta, George 6   VIAFID ORCID Logo 

 Blue Health Intelligence, Chicago, IL 60601, USA; [email protected] 
 Department of Rehabilitation Medicine, Georgetown University Medical Center, Washington, DC 20057, USA 
 Department of Biostatistics and Bioinformatics, Fox Chase Cancer Center, Temple University Health System, Philadelphia, PA 19111, USA; [email protected] 
 GCStat, 3401 Caldwell Drive, Raleigh, NC 27607, USA; [email protected] 
 Advestis, 69 Boulevard Haussmann, 75008 Paris, France; [email protected] 
 Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical Center, Washington, DC 20057, USA; [email protected]; Department of Clinical Epidemiology, Aarhus University, 8000 Aarhus, Denmark; The Parker Institute, Copenhagen University Hospital, 2000 Frederiksberg, Denmark 
First page
2840
Publication year
2021
Publication date
2021
Publisher
MDPI AG
e-ISSN
22277390
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2602143497
Copyright
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.