Network-Based Logistic Classification with an

Full text

Turn on search term navigation

Academic Editor:Jennifer Wu

Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Avenida Wai Long, Taipa 999078, Macau

Received 24 October 2014; Revised 5 April 2015; Accepted 30 April 2015; 16 June 2015

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Identifying molecular biomarker or signaling pathway involved in a phenotype is a particularly important problem in genomic studies. Logistic regression is a powerful discriminating method and has an explicit statistical interpretation which can obtain probabilities of classification regarding the class label information.

A key challenge in identifying diagnosis or prognosis biomarkers using the logistic regression model is that the number of observations is much smaller than the size of measured biomarkers in most of the genomic studies. Such limitation causes instability in the algorithms used to select gene marker. Regularization methods have been widely used in order to deal with this problem of high dimensionality. For example, Shevade and Keerthi proposed the sparse logistic regression based on the Lasso regularization [1, 2]. Meier et al. investigated logistic regression with group Lasso [3]. Usually, the Lasso type procedures are often called [figure omitted; refer to PDF] -norm type regularization methods. However, [figure omitted; refer to PDF] regularization may yield inconsistent selections when applied to variable selection in some situations [4] and often introduces the extra bias in the estimation [5]. In many genomic studies, we need a sparser solution for interpretation and accurate outcomes, but [figure omitted; refer to PDF] regularization has a gap to meet these requirements. Thus, a further improvement of regularization is urgently required. [figure omitted; refer to PDF] ( [figure omitted; refer to PDF] ) regularization can assuredly generate more sparse and precise solutions than [figure omitted; refer to PDF] regularization. Moreover, [figure omitted; refer to PDF] penalty can be taken as a representative of [figure omitted; refer to PDF] ( [figure omitted; refer to PDF] ) penalty and has demonstrated many attractive properties which do not appear in some [figure omitted; refer to PDF] regularization approaches, such as unbiasedness, sparsity, and oracle properties [6-8].

So far, we observed dense molecular interaction information about the disease-related biological processes and gathered it through databases focused on many aspects of biological systems. For example, BioGRID records collected various biological interactions from more than 43,468 publications [9]. These regulatory relationships are usually represented by a network. Combining these pieces of graphic information extracted from the biological process with an analysis of the gene-expression data had provided useful prior information to detective noise and removes confounding factors from biological data for several classification and regression models [10-14].

Inspired by the aforementioned methods and ideas, here, we define a network-constrained logistic regression model with [figure omitted; refer to PDF] penalty following the framework established by [11], where the predictors are based on the gene-expression data with biologic network knowledge. The proposed model is aimed at identifying some biomarkers and subnetworks regarding diseases. In order to achieve a better prediction, we use an enhanced half thresholding algorithm for [figure omitted; refer to PDF] regularization, which is more efficient than the old half thresholding approach in the literature [6, 15, 16].

The rest of the paper is organized as follows. In Section 2, we proposed a new version of the network-constrained logistic regression model with [figure omitted; refer to PDF] regularization. In Section 3, we presented an enhanced half thresholding method for [figure omitted; refer to PDF] regularization and the corresponding coordinate descent algorithm. In Section 4, we evaluated the performance of our proposed approach on the simulated data and presented the applications of the proposed methods to an analysis of lung cancer data. We concluded the paper with Section 5.

2. [figure omitted; refer to PDF] Penalized Network-Constrained Logistic Regression Model

Generally, assuming that dataset [figure omitted; refer to PDF] has [figure omitted; refer to PDF] samples, [figure omitted; refer to PDF] , where [figure omitted; refer to PDF] is [figure omitted; refer to PDF] th sample with [figure omitted; refer to PDF] genes and [figure omitted; refer to PDF] is the corresponding variable that takes a value of 0 or 1. Define a classifier [figure omitted; refer to PDF] and the logistic regression is defined as [figure omitted; refer to PDF] where [figure omitted; refer to PDF] are the coefficients to be estimated. We can obtain [figure omitted; refer to PDF] by minimizing the log-likelihood function of the logistic regression. Following [11], to combine biological network with an analysis of the gene microarray data, we used a Laplacian constraint approach here. Consider a graph [figure omitted; refer to PDF] , where [figure omitted; refer to PDF] is the set of genes that meet [figure omitted; refer to PDF] explanatory variables and [figure omitted; refer to PDF] is the set of edges. If gene [figure omitted; refer to PDF] and gene [figure omitted; refer to PDF] are connected, then there is an edge between gene [figure omitted; refer to PDF] and gene [figure omitted; refer to PDF] , which is denoted by [figure omitted; refer to PDF] ; else [figure omitted; refer to PDF] . [figure omitted; refer to PDF] denotes the weight of edge [figure omitted; refer to PDF] . The normalized Laplacian matrix [figure omitted; refer to PDF] for [figure omitted; refer to PDF] is defined by [figure omitted; refer to PDF] where [figure omitted; refer to PDF] and [figure omitted; refer to PDF] are the degrees of genes [figure omitted; refer to PDF] and [figure omitted; refer to PDF] , respectively. The degrees of gene [figure omitted; refer to PDF] (or [figure omitted; refer to PDF] ) describe the number of the edges that connected with [figure omitted; refer to PDF] (or [figure omitted; refer to PDF] ). For [figure omitted; refer to PDF] , the network-constrained logistic regression model is presented as [figure omitted; refer to PDF] where the first term in (3) is the log-likelihood function of the logistic model and the second term is a network constraint based on the Laplacian matrix, which induces a smooth solution of [figure omitted; refer to PDF] on the graph.

Directly computing (3) performs poorly for both prediction and biomarker selection purposes when the gene number [figure omitted; refer to PDF] >> the sample size [figure omitted; refer to PDF] . Therefore, the regularization approach is vitally needed. When adding a regularization term to (3), the sparse network-constrained logistic regression can be written as [figure omitted; refer to PDF] where [figure omitted; refer to PDF] is a regularization parameter. In Zhang et al. [13], the authors used Lasso ( [figure omitted; refer to PDF] ) which has the regularization term [figure omitted; refer to PDF] to penalize (4). However, the result of the Lasso type ( [figure omitted; refer to PDF] ) regularization is not good enough for interpretation, especially in genomic research. Besides this, [figure omitted; refer to PDF] regularization is asymptotically biased [17, 18]. To improve the solution's sparsity and its predictive accuracy, we need to think beyond [figure omitted; refer to PDF] regularization to [figure omitted; refer to PDF] penalties. In mathematics, [figure omitted; refer to PDF] type regularization [figure omitted; refer to PDF] with the lower value of [figure omitted; refer to PDF] would lead to better solutions with more sparsity and gives asymptotically unbiased estimates [17]. Moreover, [figure omitted; refer to PDF] penalty can be taken as a representative of [figure omitted; refer to PDF] penalty and has permitted an analytically expressive thresholding representation [6, 7]. Therefore, we proposed a novel [figure omitted; refer to PDF] net approach based on [figure omitted; refer to PDF] regularization to penalize the network-constrained logistic regression model, as shown in [figure omitted; refer to PDF] where [figure omitted; refer to PDF] .

3. A Coordinate Descent Algorithm for the Network-Constrained Logistic Model with the Enhanced [figure omitted; refer to PDF] Thresholding Operator

[figure omitted; refer to PDF] penalty function is nonconvex, which raises numerical challenges in fitting the models. Recently, the coordinate descent algorithms [19] for solving nonconvex regularization models (SCAD [20], MCP [21]) have shown significant efficiency and convergence [22]. Since the computational burden increases only linearly with the feature number [figure omitted; refer to PDF] , the coordinate descent algorithm can be a powerful tool for solving high-dimensional problems. Its standard procedure can be demonstrated as follows: for every coefficient [figure omitted; refer to PDF] , to partially optimize the target function with respect to [figure omitted; refer to PDF] , and fix the remaining elements [figure omitted; refer to PDF] at their most recently updated values. The specific form of updating [figure omitted; refer to PDF] depends on the thresholding operator of the penalty.

In this paper, we present an enhanced [figure omitted; refer to PDF] thresholding operator for the coordinate descent algorithm: [figure omitted; refer to PDF] where [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , and [figure omitted; refer to PDF] as the partial residual for fitting [figure omitted; refer to PDF] .

Remark . This enhanced [figure omitted; refer to PDF] thresholding operator [figure omitted; refer to PDF] outperforms the old [figure omitted; refer to PDF] thresholding [figure omitted; refer to PDF] introduced in [6, 15, 16]. We know that the quantity of the regularization solutions depends seriously on the value of the regularization parameter [figure omitted; refer to PDF] . Based on this enhanced [figure omitted; refer to PDF] thresholding operator, when [figure omitted; refer to PDF] is chosen by some efficient strategies for the parameter tuning, such as cross validation, the convergence of algorithm (6) is proved [7].

The Laplacian matrix [figure omitted; refer to PDF] is nonnegative definite; thus, it can be written as [figure omitted; refer to PDF] by Cholesky decomposition. Following C. Li and H. Li [11] approach, (4) can be expressed as [figure omitted; refer to PDF] where [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , and [figure omitted; refer to PDF] is the regularization parameter and can be expressed as [figure omitted; refer to PDF] .

One-term Taylor series expansion for (7) can be written as [figure omitted; refer to PDF] where [figure omitted; refer to PDF] is the estimated response and [figure omitted; refer to PDF] is the weight for the estimated response. [figure omitted; refer to PDF] is the evaluated value under the current parameters. Thus, we can redefine the partial residual for fitting current [figure omitted; refer to PDF] as [figure omitted; refer to PDF] and [figure omitted; refer to PDF] . The procedure of the coordinate descent algorithm for [figure omitted; refer to PDF] penalized network-constrained logistic model is described as follows.

Algorithm 1 (the coordinate descent algorithm for [figure omitted; refer to PDF] penalized network-constrained logistic model).

We consider the following.

Step 1 . Initialize all [figure omitted; refer to PDF] ( [figure omitted; refer to PDF] ) and [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , and set [figure omitted; refer to PDF] , [figure omitted; refer to PDF] chosen by cross validation.

Step 2 . Calculate [figure omitted; refer to PDF] and [figure omitted; refer to PDF] and approximate the loss function (8) based on the current [figure omitted; refer to PDF] .

Step 3 . Update each [figure omitted; refer to PDF] and cycle over [figure omitted; refer to PDF] , until [figure omitted; refer to PDF] does not change.

Step 3.1 . Compute [figure omitted; refer to PDF] and [figure omitted; refer to PDF] .

Step 3.2 . Update [figure omitted; refer to PDF] .

Step 4 . Let [figure omitted; refer to PDF] , [figure omitted; refer to PDF] .

If [figure omitted; refer to PDF] dose not converge, then repeat Steps 2 and 3.

4. Simulation and Application

4.1. Analyses of Simulated Data

We evaluate the performance of four methods: the network-constrained logistic regression models with [figure omitted; refer to PDF] regularization ( [figure omitted; refer to PDF] net), [figure omitted; refer to PDF] regularization with old thresholding value [figure omitted; refer to PDF] ( [figure omitted; refer to PDF] net) and with the enhanced thresholding value [figure omitted; refer to PDF] (enhanced [figure omitted; refer to PDF] net), and the Elastic net regularization approach (Elastic net). We first simulated the graph structure to mimic gene regulatory network: assuming that the graph consists of 200 independent transcription factors (TFs) and each TF regulates 10 unlike genes, so there are a total of 2200 variables, [figure omitted; refer to PDF] , [figure omitted; refer to PDF] . The training and the independent test data sets include the sample sizes of 100, respectively. Each TF [figure omitted; refer to PDF] and its regulated genes [figure omitted; refer to PDF] were generated by the normal distribution [figure omitted; refer to PDF] . We set the correlation rate between [figure omitted; refer to PDF] and its regulated gene [figure omitted; refer to PDF] as 0.75, [figure omitted; refer to PDF] . The binary responder [figure omitted; refer to PDF] ( [figure omitted; refer to PDF] ), which is associated with the matrix [figure omitted; refer to PDF] of TFs and their regulated genes, is calculated based on the following formula and rule: [figure omitted; refer to PDF] where [figure omitted; refer to PDF] , [figure omitted; refer to PDF] for Model 1, and [figure omitted; refer to PDF] .

Model 2 was defined similar to Model 1, except that we considered the case when the TF can have positive and negative effects on its regulated genes at the same time: [figure omitted; refer to PDF]

In these two models, the 10-fold cross validation approach was conducted on the training datasets to tune the regularization parameters of the enhanced [figure omitted; refer to PDF] net, [figure omitted; refer to PDF] net, and [figure omitted; refer to PDF] net. Both penalized parameters for [figure omitted; refer to PDF] and ridge regularization in the Elastic net were tuned by the 10-fold cross validation on the two-dimensional parameter surfaces. We repeated the simulations over 100 times and then computed the misclassification error, the sensitivity, and the specificity averagely for each net model on the test datasets.

Table 1 summarizes the simulation results from each regularization net model. In general, our proposed enhanced [figure omitted; refer to PDF] net model achieved the smallest misclassification errors in Models 1 (9.22%) and 2 (10.76%) compared with the other regularization methods including the old [figure omitted; refer to PDF] thresholding method (9.85% for Model 1 and 10.83% for Model 2), [figure omitted; refer to PDF] net (11.81% for Model 1 and 13.21% for Model 2), and the Elastic net (13.12% for Model 1 and 14.14% for Model 2). Meanwhile, the enhanced [figure omitted; refer to PDF] net resulted in the highest sensitivity in Model 1 (98.5%) compared with the other methods. Moreover, the enhanced [figure omitted; refer to PDF] net obtained the best specificity in Model 2 (98.7%) amongst the other approaches. To sum up, the enhanced [figure omitted; refer to PDF] net outperforms the other three algorithms in terms of prediction accuracy, sensitivity, and specificity.

Table 1: Simulation results of the enhanced [figure omitted; refer to PDF] net, [figure omitted; refer to PDF] net, [figure omitted; refer to PDF] net, and Elastic net, respectively.

Model	Misclassification errors (%)	Sensitivity (%)	Specificity (%)
Eh_ [figure omitted; refer to PDF]	[figure omitted; refer to PDF]	[figure omitted; refer to PDF]	Elastic	Eh_ [figure omitted; refer to PDF]	[figure omitted; refer to PDF]	[figure omitted; refer to PDF]	Elastic	Eh_ [figure omitted; refer to PDF]	[figure omitted; refer to PDF]	[figure omitted; refer to PDF]	Elastic
1	9.22	9.85	11.81	13.12	0.985	0.971	0.968	0.873	0.969	0.970	0.962	0.981
(0.36)	(0.31)	(0.41)	(0.12)	(0.00)	(0.00)	(0.02)	(0.00)	(0.00)	(0.01)	(0.01)	(0.00)

2	10.76	10.83	13.21	14.14	0.939	0.939	0.943	0.835	0.987	0.981	0.987	0.980
(0.33)	(0.36)	(0.24)	(0.23)	(0.00)	(0.00)	(0.01)	(0.00)	(0.02)	(0.01)	(0.01)	(0.00)

Simulation results (averaged over 100 runs) for comparison of misclassification errors, sensitivity, and specificity used the enhanced [figure omitted; refer to PDF] net, [figure omitted; refer to PDF] net, [figure omitted; refer to PDF] net, and the Elastic net, respectively. The standard errors are given in parentheses.

4.2. Analysis of Lung Cancer

In this section, we merged the protein-protein interaction (PPI) network (see http://thebiogrid.org/) with a lung cancer (LC) gene-expression dataset [23] to demonstrate the performance of our proposed enhanced [figure omitted; refer to PDF] net method. The gene-expression dataset contains the expression profiles of 22284 genes for 107 patients, in which 58 had lung cancer. To test the generalization ability of the proposed method, we divided the dataset into the training set (sample size [figure omitted; refer to PDF] ; 38 LC, 32 non-LC) which covered 2/3 samples of the dataset and the test set (sample size [figure omitted; refer to PDF] ; 20 LC, 17 non-LC) which covered the other 1/3 specimens of the dataset. The 10-fold cross validation approach was conducted on the training dataset to tune the regularization parameters. By combining the gene-expression data with the PPI network, the final PPI network includes 8619 genes and 28293 edges.

Figures 1-4 display the solution paths of the four regularization net methods for the LC dataset in one sample run. Here, [figure omitted; refer to PDF] -axis displays the values of the running lambda (the running lambda of [figure omitted; refer to PDF] penalty in the Elastic net approach), and [figure omitted; refer to PDF] -axis at the top (degrees of freedom) means the number of nonzero coefficients of beta. [figure omitted; refer to PDF] -axis is the values of the coefficients beta which measure the gene importance. The predictive model builds from the training set and then tests its predictive performance on the test set. The detailed results were represented in Table 2.

Table 2: The results of the enhanced [figure omitted; refer to PDF] net, [figure omitted; refer to PDF] net, [figure omitted; refer to PDF] net, and Elastic net on LC dataset, respectively.

	Selected genes	Connected genes	Connected edges	Cross validation error	Test error
Eh_ [figure omitted; refer to PDF] net	171	54	41	6/70	5/37
[figure omitted; refer to PDF] net	193	61	47	6/70	6/37
[figure omitted; refer to PDF] net	500	150	121	7/70	6/37
[figure omitted; refer to PDF]	636	337	510	6/70	6/37

Results of analysis of LC gene expression dataset by four procedures, including the number of genes selected, the number of linked PPI network genes, the number of linked PPI network edges, the CV error, and test errors.

Figure 1: The solution paths of the enhanced [figure omitted; refer to PDF] net for the lung cancer dataset in one sample run.