Fast and Provable Algorithms for Sparse Principal Component Analysis

Abstract

Principal component analysis (PCA) is a well-known statistical method for feature extraction and dimension reduction, widely used for data analysis. However, traditional PCA encounters problems of overfitting and loss of explainability in high-dimensional settings, particularly when the number of variables exceeds the sample size. Sparse PCA overcomes these limitations by introducing sparsity into the principal components, offering a robust alternative to PCA and obtaining more interpretable results. In this thesis, we investigate the spiked covariance model and the spiked Wigner model in sparse PCA.

We first explore the spiked covariance model, which aims to recover a sparse unit vector from noisy samples. From an information-theoretic perspective, Ω(k log p) observations are sufficient to recover a k-sparse p-dimensional vector v. However, existing polynomial-time methods require at least Ω(k 2 ) samples for successful recovery, highlighting a significant gap in sample efficiency. To bridge this gap, we introduce a novel thresholding-based algorithm that requires only Ω(k log p) samples, provided the signal strength λ = Ω(∥v∥ −1 ∞ ). We also propose a two-stage nonconvex algorithm that further enhances estimation performance. This approach integrates our thresholding algorithm with the truncated power iteration, achieving the minimax optimal rate of statistical error under the desired sample complexity. Numerical experiments validate the superior performance of our algorithms in terms of estimation accuracy and computational efficiency.

Secondly, we study the spiked Wigner model, which aims to recover a s-sparse d-dimensional unit vector u from a d×d noisy matrix. The information theoretical lower bound of the signal strength required to estimate u is β = Ω(√ s log d). In contrast, the signal strength required for existing polynomial-time methods is at least Ω( e s), leading to a notable gap. To close this gap, we propose a new thresholding-based algorithm that requires only Ω(√ s log d) signal strength, given ∥u∥∞ = Ω(1). We also design a two-stage nonconvex method that further improves estimation accuracy. This approach combines our thresholding algorithm with the truncated power iteration, achieving the constant error in limited iterations under the desired signal strength. Empirical results show the advanced performance of our algorithms in terms of the estimation error and computational cost.

Details

Business indexing term

Subject:

Big Data;
Costs

Subject

Sparsity;
Big Data;
Sample size;
Costs;
Normal distribution;
Signal processing;
Eigenvalues;
Eigenvectors;
Mathematics

Classification

0405: Mathematics

Identifier / keyword

Principal component analysis; Algorithms

URL

https://doi.org/10.14711/thesis-hdl152460

Title

Fast and Provable Algorithms for Sparse Principal Component Analysis

Author

Xian, Zhuozhi

Number of pages

110

Publication year

2025

Degree date

2025

School code

1223

Source

DAI-B 87/5(E), Dissertation Abstracts International

ISBN

9798263313760

Advisor

Cai, Jianfeng

University/institution

Hong Kong University of Science and Technology (Hong Kong)

University location

Hong Kong

Degree

Ph.D.

Source type

Dissertation or Thesis

Language

English

Document type

Dissertation/Thesis

Dissertation/thesis number

32407392

ProQuest document ID

3273603877

Document URL

https://www.proquest.com/dissertations-theses/fast-provable-algorithms-sparse-principal/docview/3273603877/se-2?accountid=208611

Go to Library Record

https://doi.org/10.14711/thesis-hdl152460

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Database

ProQuest One Academic

Fast and Provable Algorithms for Sparse Principal Component Analysis

Content area

Abstract

Details