Content area
Abstract
Nowadays researchers can collect and access data that have large numbers of variables. Data sets that have a large number of features and relatively few observations are referred to as high dimensional data. Building statistical models and making statistical inference from high dimensional data is out of the scope of well-developed classical statistical models such as the ordinary least squares. Penalized regression models have been one of the most popular methods in this field. This thesis aims at proposing novel approaches that are able to improve the predictive performance and inference of penalized models.
The first paper is devoted to developing a novel testing procedure, Projection Inference for Penalized Regression Estimator (PIPE). Based on model estimates from an initial penalized linear regression model, PIPE provides a computationally-efficient way to compute test statistics that can be used for false discovery rate control. In the second paper, I extend the PIPE procedure to accommodate binary outcomes with penalized logistic regression. For both linear and binary case, the validity of the proposed PIPE procedure is studied carefully through its theoretical properties and empirical performance.
In the third paper, two novel cross-validation approaches, cross-validated linear predictor and cross-validated deviance residuals are developed for Cox regression, where there is an inherent challenge to conduct cross-validation for the models built upon partial likelihood. Both approaches can be used to conduct model selection for penalized Cox Regression model. I assess those methods and compare them with two existing approaches in a comprehensive set of simulations. The cross-validated linear predictor approach has the best overall performance.
For all methods that are developed in this thesis, I illustrate their usage with real data sets that are considered as high dimensional data.





