Content area
Background
Gene set analysis aims to identify gene sets containing differentially expressed genes between two different experimental conditions. A representative example of gene sets is a gene regulatory network where multiple genes are linked with each other for regulation of gene expression. Most of statistical methods for gene set analysis were designed to capture group-based association signals, ignoring a genetic network structure. Consequently, they often fail to identify gene sets where the number of differentially expressed genes are only a few and they have sparse association signals.
Results
We propose a new computational method to utilize prior network knowledge for gene set analysis. The proposed method is essentially combines the coefficient estimates of network-based regularization into overlapping group lasso. Network-based regularization can boost association signals among linked genes while overlapping group lasso performs selection of gene sets including differentially expressed genes. In our extensive simulation study, the performance of the proposed method has been evaluated, compared with the existing methods. We also applied it to gene expression data of The Cancer Genome Atlas Breast Invasive Carcinoma Collection (TCGA-BRCA). We were able to identify cancer-related pathways that were missed by the existing methods.
Conclusion
Overlapping group lasso is a regularization method for group selection allowing overlapping variables. Network-based regularization is a variable selection method utilizing graph information among variables. The proposed weighted overlapping group lasso (wOGL) adopts the coefficient estimates of network-based regularization for the weight of overlapping group lasso. Consequently, it can identify gene sets containing differentially expressed genes, utilizing prior network knowledge.
Background
Gene set analysis aims to identify a gene set that has differentially expressed (DE) genes between two different experimental conditions. Examples of gene sets include gene regulatory networks, signaling pathways and protein-protein interaction networks. There are many literatures about gene sets associated with different types of diseases including Crohn’s disease [1], breast cancer [2], rheumatoid arthritis [3], prostate cancer [4] and schizophrenia [5]. Most of genes within the same gene set usually share a biological process so they have similar association patterns with a phenotype outcome [6]. Therefore, gene set analysis is preferred to detect DE genes, compared with single gene analysis [7, 8]. The most commonly used method for gene set analysis is gene set enrichment analysis (GSEA) [7] which identifies significantly enriched groups of genes based on the ranked gene list. Since GSEA has been developed, many statistical and computational methods have been proposed for gene set analysis [9,10,11,12,13,14,15,16]. For computational convenience, Rahmatallah et al. [14] developed an R package for gene set analysis where various statistical methods have been implemented.
In recent years researches have been conducted for benchmarking gene set analysis methods [17, 18], where the performance of numerous statistical methods were compared with each other in terms of sensitivity and specificity. Although most of the existing gene set analysis methods were selected in their benchmarking analysis, they are all test-based methods for individual genes or gene sets. That is, the phenotype association of individual genes or gene sets is tested one at a time. These individual tests suffer from the multiple testing problem that substantially increases the number of false positives. Since there are generally a large number of genes or gene sets, multiple testing is a crucial issue in gene set analysis.
Alternatively, regression-based approaches to select multiple genetic pathways containing DE genes have been proposed where individual genes are regarded as predictors and two different experimental conditions are a binary response [19,20,21]. In particular, Zeng and Breheny [21] employed overlapping group lasso (OGL) for genetic pathway selection. They considered the detection of gene sets with DE genes as a variable selection problem in a regression model. Since individual genes can be overlapped by multiple genetic pathways, they applied the OGL method [22] to identify caner-related genetic pathways. They also demonstrated that OGL can outperform GSEA when the size of genetic pathways is relatively small. However, their analysis assumes that all genes within the pathways are differentially expressed even if some genes can be neutral in practice. For genetic pathways with only a few DE genes, both OGL and GSEA are likely to miss these pathways due to sparse association signals. They were designed to boost detection power, pooling association signals of individual genes in a genetic pathway. Consequently, their detection power can be maximized when all genes in the same pathway are differentially expressed.
Most of statistical methods for gene set analysis has been developed based on a group of genes, ignoring a genetic network structure a set of genes consists of. For examples, a gene regulatory pathway and a protein-protein interaction network deliver the information about which genes are linked with each other. Biological network knowledge have been accumulated over the past years, and many genetic network databases are now publicly available [23,24,25,26]. Previous researches have shown that integrating prior knowledge of biological networks into analysis of genomic data can promisingly improve gene selection results [6, 27,28,29,30,31,32]. Specifically, the network-based regularization method proposed by Li and Li [27, 28] is able to select genes associated with a phenotype outcome, encouraging coefficient smoothing of linked genes in the genetic network. It essentially estimates regression coefficients of a penalized likelihood where the penalty term composes of the \(l_1\)-norm penalty for sparsity and a Laplacian penalty for a graph structure. Accordingly, non-zero regression coefficients represent the association signals of corresponding genes. An advantage of network-based regularization is able to capture sparse association signals of individual genes, utilizing prior knowledge of linked genes. However, it cannot be directly applied to gene set analysis because network-based regularization was not designed for group selection but for selection of individual variables. Moreover, it does not allow overlapping genes in multiple genetic pathways.
In this article, we propose weighted overlapping group lasso (wOGL) to incorporate prior network knowledge into gene set analysis. The proposed method takes advantage of both OGL and network-based regularization. The former enables us to select group selection allowing overlapping variables, while the later utilizes a graph structure among variables to capture sparse association signals of individual genes. The proposed method eventually computes the selection score of each genetic pathway from both OGL and network-based regularization where the coefficient estimates of network-based regularization are used for a group weight of OGL. We then rank genetic pathways based on their selection score, and the threshold of selection scores is applied to determine cancer-related genetic pathways. In our simulation study, we observed that the proposed method has consistently high true positive rates when genetic pathways have either sparse or dense association signals. In real gene expression data from breast cancer study, we identified some cancer-related genetic pathways which were missed by the existing methods.
Materials and methods
Overlapping group lasso
Group lasso was originally developed by Yuan and Lin [33], and it was extended to the logistic regression by Meier et al. [34]. Suppose that the i-th sample has the p-dimensional gene expression data, which is denoted by \(\varvec{x}_i=(x_{i1},\ldots ,x_{ip})^\textrm{T}\) for \(i=1,2,\ldots ,n\). The group indicator variable \(\delta _{kj}=1\) if the k-th gene set contains the j-th gene; otherwise \(\delta _{kj}=0\). First, let us start with the ordinary group lasso for disjoint gene sets and then extend to the OGL for overlappin genes. For non-overlapping genes in K gene sets, \(\sum _{k=1}^K \delta _{kj} = 1\) for all \(j=1,\ldots , p\), i.e., K gene sets are disjoint so each gene should belong to one of the K gene sets. If the binary response of the i-th sample is denoted by \(y_i=0\) or 1 for two different experimental conditions, the penalized logistic log-likelihood with the group lasso penalty is then
$$\begin{aligned} -\frac{1}{n}\sum _{i=1}^n \left[ y_i \log p(\varvec{x}_i)+(1-y_i)\log (1-p(\varvec{x}_i)\right] + \lambda \sum _{k=1}^K \sqrt{g_k\sum _{j=1}^p\delta _{kj}\beta _j^2}, \end{aligned}$$
(1)
where \(\lambda >0\) is a tuning parameter to control sparsity and \(g_k=\sum _{j=1}^p \delta _{kj}\) is the size of the k-th gene set. The probability of \(y_i=1\) is
$$\begin{aligned} p(\varvec{x}_i)=\frac{\exp (\beta _0+\varvec{x}_i^{\textrm{T}}\varvec{\beta })}{1+\exp (\beta _0+\varvec{x}_i^{\textrm{T}}\varvec{\beta })}, \end{aligned}$$
where \(\beta _0\) is an intercept parameter and \(\varvec{\beta }=(\beta _1,\ldots ,\beta _p)^\textrm{T}\) is the p-dimensional coefficient vector representing the effect size of individual genes. The penalized likelihood (1) consists of the negative likelihood function of regression coefficients and the shrinkage penalty term. Minimizing the penalized likelihood seeks coefficient estimates that fit the data well while shrinking them towards zero. The tuning parameter \(\lambda\) serves to control the relative impact of these two terms on the regression coefficient estimates.
For a given value of \(\lambda\), the group lasso estimates \((\hat{\beta }_0, \hat{\varvec{\beta })}\) minimize the penalized likelihood (1). Since the group lasso penalty is convex, we can choose one of convex optimization programs in order to obtain the solution \((\hat{\beta }_0, \hat{\varvec{\beta })}\). Recently, Yang and Zou [35] derived a groupwise majorization-descent algorithm for efficient computing and they also developed an R package ‘gglasso’ to implement their algorithm. For a relatively large \(\lambda\), most of the coefficient estimates are exactly 0. Accordingly, the nonzero coefficient estimate \(\hat{\beta }_j\ne 0\) indicates that there is an association between the j-th gene and a phenotype outcome. Therefore, we just need to select genes with nonzero coefficients estimates for identification of DE genes. But, the group lasso penalty adopts ‘all-in-all-out’ fashion which forces to select all genes in the k-th gene set if the k-th gene set contains DE genes. That is to say, either \(\hat{\beta }_j=0\) or \(\hat{\beta }_j\ne 0\) for all j satisfying \(\delta _{kj}=1\). Subsequently, the group lasso was designed for group selection not for selection of individual variables.
Next, consider that some of genes can belong to more than one gene set. In practice, the assumption of non-overlapping genes among K gene sets is likely to be failed. Therefore, the ordinary group lasso is not appropriate for gene set analysis. For overlapping genes in K gene sets, we expect \(\sum _{k=1}^K \delta _{kj} \ge 1\) for all \(j=1,\ldots , p\). So, each gene can belong to more than one of the K gene sets. The total number of overlapping genes is then \(t= \sum _{k=1}^K\sum _{j=1}^p \delta _{kj} - p\). Jacob et al. [22] proposed OGL to extend the group lasso to overlapping variables. Their idea is simply to duplicate predictors corresponding to overlapping variables. That is, the p-dimensional gene expression data is converted to the q-dimensional data with \(q=\sum _{k=1}^K\sum _{j=1}^p \delta _{kj}\), regarding t overlapping genes as additional predictors. Let us denote the i-th sample of the q-dimensional gene expression data by \(\tilde{\varvec{x}}_i=(\tilde{x}_{i1}, \ldots , \tilde{x}_{iq})^\textrm{T}\). The q-dimensional coefficient vector is then denoted by \(\varvec{\gamma }=(\varvec{\gamma }_1,\ldots ,\varvec{\gamma }_K)^\textrm{T}\), where the coefficients of the k-th gene set is \(\varvec{\gamma }_k=(\gamma _1,\ldots ,\gamma _{g_k})\) and \(\sum _{k=1}^K g_k=q\). The penalized log-likelihood with the OGL penalty is
$$\begin{aligned} -\frac{1}{n}\sum _{i=1}^n \left[ y_i \log p(\tilde{\varvec{x}}_i)+(1-y_i)\log (1-p(\tilde{\varvec{x}}_i)\right] + \lambda \sum _{k=1}^K \sqrt{g_k}\Vert \varvec{\gamma }_k\Vert _2, \end{aligned}$$
(2)
where the \(l_2\)-norm \(\Vert \varvec{\gamma }_k\Vert _2=\sqrt{\sum _{l=1}^{g_k} \gamma _l^2}\) and
$$\begin{aligned} p(\tilde{\varvec{x}}_i)=\frac{\exp (\beta _0+\tilde{\varvec{x}}_i^{\textrm{T}}\varvec{\gamma })}{1+\exp (\beta _0+\tilde{\varvec{x}}_i^{\textrm{T}}\varvec{\gamma })}. \end{aligned}$$
The OGL coefficient estimates \((\hat{\beta }_0, \hat{\varvec{\gamma }})\) also minimize the penalized likelihood (2) just like the ordinary group lasso estimates. Note that the penalized likelihood (1) and (2) employ the same group lasso penalty although the notations for regression coefficients are differently used. The p-dimensional coefficient vector \(\varvec{\beta }\) in (1) measures the association strength of p genes while the q-dimensional coefficient vector \(\varvec{\gamma }\) in (2) are for the K gene sets. For example, if the k-th gene set contains some DE genes, the OGL coefficient estimates \(\hat{\varvec{\gamma }}_k=(\hat{\gamma }_1,\ldots ,\hat{\gamma }_{g_k})\) are all non-zeros; otherwise, \(\hat{\varvec{\gamma }}_k=0\). Therefore, OGL performs group selection, allowing overlapping genes.
Network-based regularization
Li and Li [27, 28] proposed network-based regularization in order to incorporate a biological graph into genetic association studies. Let us assume that K different genetic pathways consist of p genes. We then create the p-dimensional adjacency matrix denoted by \(\varvec{A}=\{A_{uv}\}\) for \(1\le u,v \le p\) from the pathways. For example, if the u-th gene is linked with the v-th gene in one of the K pathways, \(A_{uv}=1\); otherwise \(A_{uv}=0\). Since we consider an undirected graph among p genes, the adjacency matrix should be symmetric. Given the gene expression data \(\varvec{x}_i\), the binary outcome \(y_i\) and the adjacency matrix \(\varvec{A}\), the penalized logistic likelihood with a graph-constrained penalty is defined as
$$\begin{aligned} -\frac{1}{n}\sum _{i=1}^n \left[ y_i \log p(\varvec{x}_i)+(1-y_i)\log (1-p(\varvec{x}_i)\right] + \eta \left( \alpha \Vert \varvec{\beta }\Vert _1 + (1-\alpha )\varvec{\beta }^{\textrm{T}}L\varvec{\beta }\right) , \end{aligned}$$
(3)
where the \(l_1\)-norm \(\Vert \varvec{\beta }\Vert _1=\sum _{j=1}^p |\beta _j|\). The tuning parameter \(\eta >0\) controls model sparsity like \(\lambda\) and the other tuning parameter \(\alpha \in [0, 1]\) is a balance between lasso and graph-constrained penalties. In the penalized likelihood (3), the p-dimensional Laplacian matrix \(L=\{l_{uv}\}\) is defined as
$$\begin{aligned} l_{uv}=\left\{ \begin{array}{ll} 1 & \text {if}~ u=v~\text {and}~ d_u\ne 0 \\ -\frac{1}{\sqrt{d_u d_v}} & \text {if}~ u\ne v~\text {and}~A_{uv}=1 \\ 0 & \text {otherwise}, \end{array} \right. \end{aligned}$$
where \(d_u\) is the total number of links of the u-th gene, i.e., \(d_u=\sum _{j=1}^p A_{uj}=\sum _{j=1}^p A_{ju}\).
Although the penalized likelihood (3) also consists of the negative likelihood and the penalty term, the coefficient estimates are completely different from the group lasso estimates in (1) and (2) due to the different penalty function. The graph-constrained penalty is a combination of the lasso penalty for model sparsity and the squared \(l_2\)-norm penalty on degree-scaled differences of coefficients between linked genes. It can induce both sparsity and smoothness with respect to the correlated or linked structure of the regression coefficients. With this penalty, desirable grouping effects can be reached by specifying genetic links among genes in the model. The penalized likelihood (3) is also a convex function, so the solution to \((\hat{\beta }_0, \hat{\varvec{\beta }})\) can be easily obtained through one of the optimization algorithms like a cyclic coordinate descent algorithm [29]. Note that network-based regularization was not designed for group selection although it utilizes information of genetic pathways. For two genes linked with each other in a pathway, the estimated coefficients corresponding to these genes can be zero and nonzero, respectively. Consequently, it cannot be directly applied to gene set analysis.
Let us denote the p-dimensional estimated coefficients of network-based regularization by \(\hat{\varvec{\beta }}=\{\hat{\beta }_1, \hat{\beta }_2, \ldots , \hat{\beta }_p\}^\textrm{T}\) for fixed values of \(\eta\) and \(\alpha\). Li and Li [27, 28] suggested cross-validation to determine two tuning parameters \(\eta\) and \(\alpha\). The \(l_2\)-norm of the coefficients of the k-th pathway can be computed by
$$\begin{aligned} w_k=\sqrt{\sum _{j=1}^p \delta _{kj}\hat{\beta }_j^2} \end{aligned}$$
(4)
which is used for the k-th group weight of the proposed wOGL. Even if the k-th genetic pathway contains only a few DE genes with sparse association signals, the group weight \(w_k\) is greater than 0 because the estimated coefficients of the DE genes are expected to be nonzero.
Weighted overlapping group lasso
We first compute the selection probability of individual pathway groups, using the OGL estimates \((\hat{\beta }_0, \hat{\varvec{\gamma }})\). For stable selection of regularization methods, Meinshausen and Bühlmann [36] originally proposed to compute selection probability of individual variables from repeated half-sample resampling. They demonstrated that selection probability can produce very stable selection result when the number of variable is large. Subsequently, computation of selection probability of individual genes has been widely adopted for analysis of high-dimensional genomic data [32, 37,38,39,40,41].
With the sample size of n, let \(I_r\) be the index set of the r-th random subsample that has a size of \(\lfloor n/2\rfloor\) without replacement, where \(\lfloor x \rfloor\) is the largest integer not greater than x. For a fixed value of \(\lambda \in \Lambda\), the OGL coefficients \(\hat{\varvec{\gamma }}=(\hat{\varvec{\gamma }}_1,\ldots ,\hat{\varvec{\gamma }}_K)\) can be estimated based on the r-th subsamples of \((\varvec{x}_i,y_i)\) for \(i\in I_r\). Let us denote the coefficient estimate of the k-th gene set for the r-th subsamples and a value of \(\lambda\) by \(\hat{\varvec{\gamma }}_{k}(I_r; \lambda )\). The selection probability of the k-th gene set is then computed by
$$\begin{aligned} \text {SP}_k = \max _{\lambda \in \Lambda } \frac{1}{R} \sum _{r=1}^R I\left( \hat{\varvec{\gamma }}_{k}(I_r; \lambda )\ne 0\right) , \end{aligned}$$
(5)
where \(I(\cdot )\) is an indicator function and R is a total number of resampling. We fixed \(R=100\) in our simulation study and real data analysis.
Next, we incorporate the k-th group weight (4) into the selection probability of the k-th gene set (5). The selection score of the k-th gene set is then defined as
$$\begin{aligned} s_k = \frac{w_k}{\max _k w_k} \text {SP}_k \end{aligned}$$
(6)
for \(k=1,\ldots , K\). The selection score of each gene set essentially quantifies both individual and group association signals. The normalizing group weight represents the combined association strength of individual genes in the same gene set while the selection probability measures the relative selection frequency of the gene set. We adopted the multiplicative form of two terms to assign the group weight to the selection probability of individual gene sets. After computation of all selection scores, we can rank the K gene sets from the largest score to the smallest score. If the k-th gene set has only a few DE genes, its ranking of selection probability is relatively low due to weak group association signals. However, the ranking of selection score can be higher than that of selection probability because the group weight captures individual association signals of DE genes. Figure 1 illustrates the computational procedure of the selection score.
[IMAGE OMITTED: SEE PDF]
The selection score depends on the fixed values of \(\lambda \in \Lambda\). Suppose that we have a fine grid of \(\lambda\) values such that
$$\begin{aligned} \lambda _{\max }=\lambda _1> \lambda _2> \cdots > \lambda _M=\lambda _{\min }, \end{aligned}$$
where M is a pre-defined number and we used \(M=15\). \(\lambda _{\max }\) ensures that \(\hat{\varvec{\gamma }}_k=0\) for all k, while \(\lambda _{\min }\) leads to the most number of nonzero coefficients. Since the numerical value of the selection score relies on the selection probability, it can be either slightly increased or decreased for a different value of \(\lambda\). However, the score ranking of K gene sets is rarely changed for a different value of \(\lambda\).
In order to select a certain number of gene sets, we need to determine a threshold of the selection score. A theoretical threshold of selection probability has been proposed by Meinshausen and Bühlmann [36] while a permutation threshold of selection probability has been studied [37, 41, 42]. Since the computation of selection probability is based on resampling, computational cost will be explosively increased if both resampling and permutation are simultaneously conducted for a high-dimensional data. Therefore, we employed the theoretical threshold of our selection score for efficient computation.
Let us denote the number of false discoveries by V and then
$$\begin{aligned} E(V) \le \theta = \frac{q^2_\Lambda }{p(2\pi _\theta -1)}, \end{aligned}$$
(7)
where \(q_\Lambda\) is the maximum number of selected variables over all values of \(\lambda \in \Lambda\), and \(\pi _\theta\) is the threshold of selection probability when \(\theta\) is the upper bound of the expected number of false discoveries. Therefore, the theoretical threshold of selection probability
$$\begin{aligned} \pi _\theta =\frac{q^2_\Lambda }{2\theta p} + 0.5 \end{aligned}$$
guarantees that the maximum number of falsely selected variables is \(\theta\), according to Meinshausen and Bühlmann [36]. In other words, for a pre-determined value of \(\theta\) we can always find the threshold \(\pi _\theta\) to control the false discovery rate. Consequently, the total number of gene sets selected by the wOGL depends on the pre-defined value of \(\theta\) which stands for the number of falsely selected gene sets. For example, if we identified 15 gene sets at \(\theta =1\), we then expect that one of 15 gene sets is a false positive.
Suppose that we list the selection probability of the K gene sets from largest to smallest such that
$$\begin{aligned} \text {SP}_{[1]} \ge \text {SP}_{[2]} \ge \cdots \ge \text {SP}_{[K]}. \end{aligned}$$
Then, the \(\hat{k}\) gene sets can be selected if \(\hat{k}\) is the largest value satisfying \(SP_{[k]} > \pi _\theta\) for a given value of \(\theta\). However, the ranking of the K gene sets should not be based on the selection probability (5) but the proposed selection score (6). Therefore, we finally select the \(\hat{k}\) gene sets with largest selection scores such that
$$\begin{aligned} \text {s}_{[1]} \ge \text {s}_{[2]} \ge \cdots \ge \text {s}_{[\hat{k}]} \end{aligned}$$
Results
Simulation studies
[IMAGE OMITTED: SEE PDF]
In order to simulate gene expression data where linked genes within a network graph are correlated with each other, a two-step procedure was conducted. In the first step, we made the p-dimensional covariance matrix from a network graph based on a Gaussian graphical model [43]. In the second step, we generated two different gene expression data where mean vectors are different from each other while they use the same covariance. Before we generated gene expression data, we pre-defined a network graph used in our simulation study. Our network graph consists of four different types of graph modules described in Fig. 2. All of four graph modules has a hub gene in the center and the hub gene has commonly 5 edges. The number of genes of four graph modules are 36, 76, 156 and 316, respectively. We assume that our gene expression data forms a total of 40 disjoint graph modules, where 10 graph modules are equivalent to one of four graph modules in Fig. 2. Consequently, the total number of genes in our simulation data is \(p=5,840\), which can be computed by \((36+76+156+316)\times 10=5,840\).
[IMAGE OMITTED: SEE PDF]
In order to create multiple genetic pathways with overlapping genes, we assume that all pathway groups comprise a hub gene and all genes stem from two adjacent edges of the hub gene. Note that a pathway group is regarded as a gene set in our simulation study. Fig. 3 depicts examples of 2 pathway groups with overlapping genes in two smallest graph modules from Fig. 2, Each graph module can have 5 different pathway groups since the hub gene has 5 edges where the hub gene and the genes stem from each edge are overlapped by two pathway groups. Specifically, the smallest graph module in Fig. 3 has 5 different pathway groups each of which consists of 15 genes including 8 overlapping genes. Also, the second smallest graph module in Fig. 3 has 5 different pathway groups each of which consists of 31 genes including 16 overlapping genes. Similarly, the pathways from the largest and second largest graph modules from Fig. 2 have 127 genes and 63 genes, respectively. Since we have 40 disjoint graph modules, the total number of pathway groups is \(K=200\), and the total number of genes including overlapping genes is \(q=11,800\), which can be computed by \((15+31+63+127)\times 5 \times 10=11,800\).
Once we complete to construct the network graph, we can generate gene expression data corresponding to the network graph. We applied a Gaussian graphical model [43] to generate a covariance matrix of 5, 840 genes, where the linked genes are correlated with each other according to the network graph structure. The key assumption of the Gaussian graphical model is that non-zero entries of an inverse covariance matrix imply genetic links between two genes [44, 45]. Therefore, we can expect that the correlation between linked genes are higher than that of unlikend genes. In the simulation data, the inverse covariance matrix generated from our 40 disjoint graph modules is very sparse since the number of edges for individual genes is at most 5. Note that the centered hub genes have the maximum number of edges. More detailed procedure to generate a covariance matrix given a network graph is described by Peng at al. [44]. Let us denote the generated covariance matrix by \(\Sigma\).
In our simulation, we assumed that the covariance of the first experimental condition is the same as that of the second experimental condition while their mean vectors are different from each other. The p-dimensional gene expression data of the i-th individual \(\varvec{x}_i\) was then simulated from two different multivariate normal distributions such that
$$\begin{aligned} \varvec{x}_i \sim {\left\{ \begin{array}{ll} N(0, \Sigma ) & \,\text {if the { i}-th individual is a condition 1} \\ N(\varvec{\mu }, \Sigma ) & \,\text {if the } i \text {-th individual is a condition 2}, \end{array}\right. } \end{aligned}$$
where \(\varvec{x}_i=(x_{i1}, \ldots ,x_{ip})^\textrm{T}\) and \(\varvec{\mu }=(\mu _1,\ldots ,\mu _p)^\textrm{T}\). If the j-th gene is differentially expressed between two conditions, then the j-th mean value \(\mu _j\) was randomly generated from a Uniform distribution from 0.05 to \(\tau\), where \(\tau\) is 0.3, 0.4 and 0.5 for small, moderate and large effect sizes of the DE gene, respectively. In contrast, \(\mu _j=0\) if the j-th gene is not differentially expressed. The number of pathway groups including DE genes was fixed as 20 among a total number of 200 pathway groups. Since there are 4 different sizes of pathway groups, we selected 5 pathway groups from each size to locate DE genes which were randomly assigned. The sample size was \(n=100\), where 50 samples were for condition 1 and the other 50 samples were for condition 2.
[IMAGE OMITTED: SEE PDF]
In genetic pathway analysis with real gene expression data, some pathways can have only a few number of DE genes while all genes in some pathways are differentially expressed. Therefore, we considered 4 different scenarios to control the level of sparsity of true DE genes. Table 1 summarizes the percentage of DE genes each pathway group in 4 different scenarios. If DE genes are overlapped by two pathway groups, both groups are considered as a pathway group including DE genes. Since we located DE genes in only 20 pathway groups, the percentage of DE genes is equivalent to the number of DE genes divided by the total number of genes in the 20 pathway groups. That is, the computation of percentage in Table 1 is based on \((15+31+63+127) \times 5 = 1,180\) genes where overlapping genes were counted twice. Scenario 1 has a total of 3.1% DE genes which indicates that only 37 genes are differentially expressed among 1,180 genes, so Scenario 1 is extremely sparse. On the contrary, the entire 1,180 genes in 20 pathway groups are differentially expressed in Scenario 4. In the similar way, Scenario 2 and Scenario 3 have 9.2% and 46% sparsity of DE genes, respectively.
In this simulation study, we compared selection performance of four different statistical methods which are GSEA [7], OGL [21], network-based regularization (Net) [27] and the proposed wOGL. The OGL method is able to capture only group association signals so it is the same as the wOGL without the group weights. In contrast, Net can only identify association signals of individual genes so it fails to detect group association signals. The proposed wOGL combining both OGL and Net is able to capture both group and individual association signals. Although three selection methods based on penalized regression models can perform pathway group selection, GSEA is a test-based method which was not originally designed for variable selection. Since variable selection and hypothesis testing are completely different approaches in statistics, comparison with GSEA seems to be inappropriate. However, we included GSEA in our simulation study because it is one of the most popular methods in gene set analysis. Moreover, Zeng and Breheny [21] also tried to compare OGL and GSEA in their simulation study although it was limited to the pathway groups containing only DE genes. In our simulation study, we conducted to compare four different methods in four different sparsity levels of DE genes.
For fair comparison of four methods, we first ranked 200 pathway groups by each method. We then computed the true positive rates (TPR) of 20 pathway groups containing DE genes when each method commonly selected top-ranked 10, 20, 30, 40 and 50 pathway groups. wOGL ranked 200 pathway groups by their selection scores (6) while OGL ranked them based on their selection probability (5). For the Net method, selection probabilities of 5,840 individual genes were first computed using the adjacency matrix of 40 disjoint graph modules. Next, each pathway group took the maximum value of selection probability among genes within the same pathway. Finally, ranking of pathway groups were determined by the maximum selection probability. GSEA ranked 200 pathway groups by their p-values. TPR is equivalent to the number of pathway groups including DE genes among a certain number of top-ranked pathway groups divided by 20. For example, if there are 15 pathway groups containing DE genes in the top ranked groups, the TPR is then 0.75. Note that higher TPR clearly indicates a better method since we compared TPR when four methods selected exactly the same number of top-ranked groups.
[IMAGE OMITTED: SEE PDF]
There are 4 different scenarios for the level of sparsity and 3 different effect sizes of DE genes, including small, moderate and large effects. Therefore, we have a total of 12 different settings to compare TPRs of four methods. Generating the simulation data was repeated 100 times each setting, so TPR was averaged over 100 replications each method. In Table 2, the averaged TPRs of four methods are summarized when they selected top-ranked 30 pathway groups. In scenario 1 with the extremely small number of DE genes, Net has the largest TPR followed by the proposed wOGL, while both GSEA and OGL have relatively low TPR. Since Net was essentially designed to perform selection of individual genes, it has an advantage of capturing the sparse association signals of pathway groups, compared with the group-based selection methods such as GSEA and OGL. In contrast, GSEA has the largest TPRs in scenario 4 where all genes in 20 pathway groups are differentially expressed. wOGL also has the second largest TPR in this scenario. We can see that both OGL and Net have relatively low TPRs in scenario 4. Zeng and Breheny [21] pointed out that OGL can be better than GSEA when the size of gene sets are small. However, in our simulation study the TPR of GSEA was always higher than that of OGL except scenario 1 although we considered both small and large sizes of pathway groups.
[IMAGE OMITTED: SEE PDF]
[IMAGE OMITTED: SEE PDF]
The proposed wOGL has overall the highest TPR in scenarios 2 and 3 where DE genes are not too sparse as well as not too dense. This is what we expect in analysis of real gene expression data. In practice, either extremely sparse DE genes or fully DE genes in a gene set are rarely observed. Although the proposed wOGL was not the best in scenarios 1 and 4, the TPR of wOGL is relatively high in all of 12 different settings. The averaged TPRs of four methods for top-ranked 10, 20, 30, 40 and 50 pathway groups are displayed in Fig. 4 for scenarios 1 and 2 and Fig. 5 for scenarios 3 and 4. It appears that the TPR curves of the proposed wOGL were never dropped down in all of 12 settings. Consequently, the proposed wOGL is very robust for identification of genetic pathways with DE genes, compared with other methods which have very low TPR in a particular setting.
[IMAGE OMITTED: SEE PDF]
In the last simulation study, we investigated the error control performance of the proposed wOGL. We fixed the expected number of false discoveries at \(\theta =1, 3\) and 5 in (7) to select a certain number of pathway groups. Then, the true number of falsely selected pathway groups were recorded every simulation replications. Fig. 6 displays 9 jitter plots where 3 different effect sizes and 3 different values of \(\theta\) were considered. When the effect size is small, the numbers of false discoveries greater than the designed level of \(\theta =1, 3\) and 5 are 0, 43, and 74, respectively. For the moderate effect size, they are reduced down to 0, 13, and 38, respectively. For the large effect size, they are only 0, 5, and 12, respectively. Although the proposed method failed to completely control the number of false discoveries among 100 simulation replications, the performance of the error control seems to be well when \(\theta\) is small and the effect size increases. Meinshausen and Bühlmann [36] stated that the theoretical threshold of selection probabilities can control the false discovery rate only when noise variables are weakly independent with each other. However, gene expression data used in our simulation are highly correlated since it was generated from a multivariate normal distribution with the covariance \(\Sigma\). Alternatively, an empirical threshold [37, 41, 42] can be employed to improve the error control rate though it requires huge computational time.
[IMAGE OMITTED: SEE PDF]
Finally, we compared the computational time and memory usage of four methods. Remind that our simulation data has the sample size \(n=100\), the number of genes \(p=5,840\) (\(q=11,800\) with overlapping genes) and the number of gene sets \(K=200\). For wOGL, OGL and Net methods, the total number of resampling to compute the selection probabilities of individual genes was \(R=100\) and the number of different values of \(\lambda\) used for the coefficient estimates was \(M=15\). For the p-value computation of GSEA, 500 permutations were conducted. Table 3 summarizes the mean and relative memory usage and computation time over 10 replications. It appears that the memory usage of wOGL is the largest while that of GSEA is the smallest. Also, wOGL uses slightly more memory than OGL. Regarding computational time, OGL is the fast method while Net is the slowest. wOGL takes more time than OGL but less than GSEA. Note that the computation time of GSEA can be drastically increased for a large number of permutation. The computational time of wOGL depends mainly on the selection probability of OGL but wOGL requires the additional time for coefficient estimation of Net. However, Net needs to compute selection probabilities over 100 resampling, which generally takes more time than other methods.
Real data analysis
We applied the proposed method to the RNA-seq data extracted from The Cancer Genome Atlas Breast Invasive Carcinoma Collection (TCGA-BRCA). We first downloaded raw gene expression data from the GDC data portal page (https://portal.gdc.cancer.gov). Next, we conducted standard quality control steps such as screening out the protein-coding genes, deleting duplicate gene names and performing quantile normalization, using the TCGAbiolinks package [46] and the DESeq2 package [47] provided by the Bioconductor project (https://www.bioconductor.org). Finally, we ended up with 17,166 genes for 1111 tumor samples and 113 normal samples. We also employed the graphite package [48] to incorporate 3 genetic network databases such as KEGG, reactome and wikipathways, where a total of 3192 genetic pathways have been obtained after matching gene names of our expression data. The number of genetic pathways each database is 320, 2138 and 734 for KEGG, reactome and wikipathways, respectively. We limited the size of the pathway groups to be at least 2. That is, the number of genes in a genetic pathway should be greater than one. The 5-number summary of the sizes of 3192 pathways is \(\min =2\), \(Q_1=9\), \(Q_2=21\), \(Q_3=52\) and \(\max =2,370\). Additional file 1 displays the histogram of sizes of the 3,192 pathways. Also, the 3,192 genetic pathways contain 6,395 overlapping genes which belong to more than one pathway group.
We first conducted GSEA [7] to rank 3,192 pathways based on the empirical p-values of 10,000 permutations. Next, we identified 20 significant pathways at the false discovery rate level of 0.1 using the Benjamini-Hochberg adjusted p-values [49]. Additional file 2 shows the list of 20 pathways along with their adjusted p-value, the size of the pathways and the pathway database. In Additional file 2, we can see that all of 7 pathways have the smallest adjusted p-value of 0.0456 due to the limited number of permutations. Most of these pathways have been already reported to be associated with breast cancer [50,51,52,53]. Next, we also conducted the proposed wOGL to rank 3,192 pathways based on their selection scores. We then identified 15 significant pathways when the expected number of false discoveries is fixed at \(\theta =1\). Table 4 shows the list of 15 pathways along with their selection scores, the size of the pathways and the pathway database. These top ranked pathways also have been reported to be associated with breast cancer: Signaling pathways [54], Immune system [55], Metabolism of proteins [56], Post-translational protein modification [57] and Disease [58]. In particular, Signaling pathways, Immune system and Disease were uniquely identified by wOGL but not by GSEA.
[IMAGE OMITTED: SEE PDF]
Compared with top 20 pathways identified by GSEA in Additional file 2, only 3 pathways such as Metabolism of proteins, Post-translational protein modification and Metabolism in Table 4 were overlapped. Also, it is noticeable that the sizes of top ranked pathways identified by GSEA are relatively small while those of top-ranked pathways identified by wOGL are huge. For further investigation of top-ranked pathways, we performed an independent two-sample T-test for individual genes in these pathways. Based on Bonferroni adjusted p-value at a level of 0.05, we identified significant DE genes each pathway. The proportions of significant genes are 0.0139 for Signaling pathways, 0.0195 for Immune system, 0.0201 for Metabolism of proteins, 0.0194 for Post-translational protein modification and 0.0197 for Disease. Figure 7 displays two network graphs of Signaling pathways in the left and Immune system in the right, where significant genes are colored in orange. Additional file 3 displays the network graphs of Metabolism of proteins in the left and Post-translational protein modification in the right. We found that the number of significant genes of top-ranked pathways identified by wOGL is extremely low, i.e., they have very sparse association signals.
[IMAGE OMITTED: SEE PDF]
In the list of 7 top-ranked pathways identified by GSEA, the proportions of significant genes are 0.6 for Caffeine metabolism, 0.3636 for Ubiquinone and other terpenoid-quinone biosynthesis, 0.25 for Arylamine metabolism, 0.0811 for Chemical carcinogenesis - DNA adducts, 0.0377 for Metabolism of xenobiotics by cytochrome P450, 0.0357 for Metabolism and 0.0331 for Metabolic pathways. Note that they are all greater than the maximum proportion of significant genes in 4 top ranked pathways identified by wOGL. The number of significant genes of top-ranked pathways identified by GSEA is relatively high, compared with wOGL. This indicates that GSEA prioritizes gene sets that have a greater number of DE genes. Additional file 4 displays three network graphs of Caffeine metabolism in the right, Ubiquinone and other terpenoid-quinone biosynthesis in the middle, and Arylamine metabolism in the right. Additional file 5 displays the network graphs of Metabolism of xenobiotics by cytochrome P450 in the left and Chemical carcinogenesis - DNA adducts in the right. Additional file 6 displays the network graphs of Metabolic pathways in the left and Metabolism in the right. In conclusion, the proposed wOGL was able to identify some cancer-related pathways with sparse association signals which were missed by GSEA.
Conclusions
In this article, we proposed new statistical and computational method to incorporate prior network knowledge into gene set analysis. The proposed method essentially combines network-based regularization and overlapping group lasso to utilize the network structure of multiple genetic pathways with overlapping genes. The proposed wOGL does not only perform group selection for genetic pathways associated with a phenotype outcome, but also captures sparse association signals of differentially expressed genes. It computes the selection score of each pathway so that we can rank all pathways based on their selection scores. The selection score measures a relative selection frequency of the pathway based on re-sampled expression data, adjusting the effect size of individual genes. An appropriate threshold of the selection scores can be determined to control the number of false discoveries. In our simulation study, we have demonstrated that the proposed method is very robust in terms of true positive selection, where outcome-related pathways were consistently identified regardless of a number of differentially expressed genes. In analysis of real gene expression data from breast cancer study, we identified some cancer-related genetic pathways which were missed by the existing method. We also developed an R package “wOGL” to implement the proposed method, and the package is publicly available at https://github.com/statddd25/wOGL.
Although the proposed method focused on the coefficient estimates of network-based regularization to capture association signals of individual genes, it can be replaced by the estimates of different types of regularization. Network-based regularization we employed basically assumes that the correlations of linked genes in a pathway is higher than those of unlinked genes. Therefore, we generated gene expression data based on the covariance derived from the prior network structure in our simulation study. The partial correlation has been widely used to construct gene networks [44, 59, 60]. However, the covariance of gene expression data is often different from those inferred from the prior network such as KEGG pathways. In this case, the coefficient estimates of network-based regularization may deliver incorrect information. This is one limitation of the proposed wOGL.
Additionally, incomplete networks and incorrectly identified edges can mislead the coefficient estimation of network-based regularization. In this case, the selection performance of wOGL could be wore than that of OGL. In recent studies new statistical methods have been proposed to address incompleteness and noises from network graphs [61, 62]. We expect that they may improve the estimation accuracy of network-based regularization for either incomplete or incorrect networks. Alternatively, we can choose elastic-net regularization [63] to capture association signals of individual genes when network knowledge is incomplete. Since the Laplacian penalty of network-based regularization is the same as the squared \(l_2\)-norm penalty if an adjacency matrix is missing, elastic-net regularization is a special form of network-based regularization. However, elastic-net regularization has a limitation to utilize prior network knowledge although it is able to identify differentially expressed genes.
Availability of data and materials
The TCGA-BRCA datasets generated and analysed during the current study are available at https://portal.gdc.cancer.gov/projects/TCGA-BRCA.
Wang Kl, Zhang H, Kugathasan S, Annese V, Bradfield JP, Russell RK, et al. Diverse genome-wide association studies associate the IL12/IL23 pathway with Crohn Disease. Am J Hum Genet. 2009;84(3):399–405.
Menashe I, Maeder D, Garcia-Closas M, Figueroa JD, Bhattacharjee S, Rotunno M, et al. Pathway analysis of breast cancer genome-wide association study highlights three pathways and one canonical signaling cascade. Cancer Res. 2010;70(11):4453–9.
Eleftherohorinou H, Hoggart CJ, Wright VJ, Levin M, Coin LJ. Pathway-driven gene stability selection of two rheumatoid arthritis GWAS identifies and validates new susceptibility genes in receptor mediated signalling pathways. Hum Mol Genet. 2011;20(17):3494–506.
Lucas SM, Heath EI. Current challenges in development of differentially expressed and prognostic prostate cancer biomarkers. Prostate Cancer. 2012;2012: 640968.
Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014;511(7510):421–7.
Chen M, Cho J, Zhao H. Incorporating biological pathways via a Markov random field model in genome-wide association studies. PLoS Genet. 2011;7(4): e1001353.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102(43):15545–50.
Goeman JJ, Bühlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23(8):980–7.
Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, et al. Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics. 2007;8:242.
Choi Y, Kendziorski C. Statistical methods for gene set co-expression analysis. Bioinformatics. 2009;25(21):2780–6.
Wu D, Lim E, Vaillant F, Asselin-Labat ML, Visvader JE, Smyth GK. ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics. 2010;26(17):2176–82.
Rahmatallah Y, Emmert-Streib F, Glazko G. Gene Sets Net Correlations Analysis (GSNCA): a multivariate differential coexpression test for gene sets. Bioinformatics. 2014;30(3):360–8.
Hsueh H, Tsai C. Gene set analysis using sufficient dimension reduction. BMC Bioinformatics. 2016;17:74.
Rahmatallah Y, Zybailov B, Emmert-Streib F, Glazko G. GSAR: Bioconductor package for Gene Set analysis in R. BMC Bioinformatics. 2017;18:61.
Oh M, Kim K, Sun H. Covariance thresholding to detect differentially co-expressed genes from microarray gene expression data. J Bioinform Comput Biol. 2020;18:2050002.
Castresana-Aguirre M, Sonnhammer E. Pathway-specific model estimation for improved pathway annotation by network crosstalk. Sci Rep. 2020;10:13585.
Geistlinger L, Csaba G, Santarelli M, Ramos M, Schiffer L, Turaga N, et al. Toward a gold standard for benchmarking gene set enrichment analysis. Brief Bioinform. 2021;22:545–56.
Buzzao D, Castresana-Aguirre M, Guala D, Sonnhammer E. Benchmarking enrichment analysis methods with the disease pathway network. Brief Bioinform. 2024;25:bbae069.
Wei Z, Li H. Nonparametric pathway-based regression models for analysis of genomic data. Biostatistics. 2007;8(2):264–84.
Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics. 2007;63(4):1079–88.
Zeng Y, Breheny P. Overlapping Group Logistic Regression with Applications to Genetic Pathway Selection. Cancer Inform. 2016;15:179–87.
Jacob L, Obozinski G, Vert J. Group lasso with overlap and graph lasso. Proceedings of the 26th Annual International Conference on Machine Learning. ACM, Montreal, Canada. 2009;433–440.
Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28(1):27–30.
Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011;39:D691–7.
Slenter D, Kutmon M, Hanspers K, Riutta A, Windsor J, Nunes N, et al. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res. 2018;46(D1):D661–7.
Wishart D, Li C, Marcu A, Badran H, Pon A, Budinski Z, et al. PathBank: a comprehensive pathway database for model organisms. Nucleic Acids Res. 2020;48(D1):D470–8.
Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–82.
Li C, Li H. Variable Selection and Regression Analysis for Covariates with a Graphical Structure with an Application to Genomics. Ann Appl Stat. 2010;4:1498–516.
Sun H, Wang S. Network-based regularization for matched case-control analysis of high-dimensional DNA methylation data. Statistics in Medicine. 2013;32(12):2127–39.
Huang HH, Liang Y, Liu X-Y. Network-Based Logistic Classification with an Enhanced \(L_{1/2}\) Solver Reveals Biomarker and Subnetwork Signatures for Diagnosing Lung Cancer. Biomed Res Int. 2015;2015:713953.
Jiang HK, Liang Y. The \(L_{1/2}\) regularization network Cox model for analysis of genomic data. Comput Biol Med. 2018;100:203–8.
Kim K, Sun H. Incorporating genetic networks into case-control association studies with high-dimensional DNA methylation data. BMC Bioinformatics. 2019;20(510):1–15.
Yuan M, Lin Y. Model Selection and Estimation in Regression with Grouped Variables. J R Stat Soc Series B Stat Methodol. 2006;68(1):49–67.
Meier L, van de Geer S, Bühlmann P. The group lasso for logistic regression. J R Stat Soc Series B Stat Methodol. 2008;70(1):53–71.
Yang Y, Zou H. A fast unified algorithm for solving group-lasso penalize learning problems. Stat Comput. 2015;25:1129–41.
Meinshausen N, Bühlmann P. Stability Selection. J R Stat Soc Series B Stat Methodol. 2010;72:417–73.
Alexander D, Lange K. Stability Selection for Genome-wide Association. Genet Epidemiol. 2011;35:722–8.
Sun H, Wang S. Penalized Logistic Regression for High-dimensional DNA Methylation Data Analysis with Case-Control Studies. Bioinformatics. 2012;28:1368–75.
Lee G, Sun H. Selection Probability for Rare Variant Association Studies. J Comput Biol. 2017;24:400–11.
Sun H, Wang Y, Chen Y, Li Y, Wang S. pETM: a penalized Exponential Tilt Model for analysis of correlated high-dimensional DNA methylation data. Bioinformatics. 2017;33:1765–72.
Kim K, Jun T, Ha B, Wang S, Sun H. New statistical selection method for pleiotropic variants associated with both quantitative and qualitative traits. BMC Bioinformatics. 2023;24(1):381.
Kim K, Koo J, Sun H. An empirical threshold of selection probability for analysis of high-dimensional correlated data. J Stat Comput Simul. 2020;90(9):1606–17.
Whittaker J. Graphical models in applied mathematical multivariate statistics. Wiley; 1990.
Peng J, Wang P, Zhou N, Zhu J. Partial Correlation Estimation by Joint Sparse Regression Models. J Am Stat Assoc. 2009;104:735–46.
Sun H, Li H. Robust Gaussian Graphical Modeling via l1 Penalization. Biometrics. 2012;68:1197–206.
Colaprico A, Silva T, Olsen C, Garofano L, Cava C, Garolini D, et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016;44(8): e71.
Love M, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550.
Sales G, Calura E, Cavalieri D, Romualdi C. graphite - a Bioconductor package to convert pathway topology to gene network. BMC Bioinformatics. 2012;13:20.
Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B Stat Methodol. 1995;57(1):289–300.
Su D, Wang S, Xi Q, Lin L, Lu Q, Yu Y, et al. Prognostic and predictive value of a metabolic risk score model in breast cancer: an immunogenomic landscape analysis. Brief Funct Genomics. 2022;21(1):128–41.
Farahzadi R, Valipour B, Fathi E, Pirmoradi S, Molavi O, Montazersaheb S, et al. Oxidative stress regulation and related metabolic pathways in epithelial-mesenchymal transition of breast cancer stem cells. Stem Cell Res Ther. 2023;14(1):342.
Chen S, Li X, Ao W. Prognostic and immune infiltration features of disulfidptosis-related subtypes in breast cancer. BMC Womens Health. 2024;24(1):6.
Brown K. Metabolic pathways in obesity-related breast cancer. Nat Rev Endocrinol. 2021;17(6):350–63.
Fang Z, Meng Q, Xu J, Wang W, Zhang B, Liu J, et al. Signaling pathways in cancer-associated fibroblasts: recent advances and future perspectives. Cancer Commun (Lond). 2023;43(1):3–41.
Huang R, Wang Z, Hong J, Wu J, Huang O, He J, et al. Targeting cancer-associated adipocyte-derived CXCL8 inhibits triple-negative breast cancer progression and enhances the efficacy of anti-PD-1 immunotherapy. Cell Death Dis. 2023;14(10):703.
Wu J, Hu W, Yang W, Long Y, Chen K, Li F, et al. Knockdown of SQLE promotes CD8+ T cell infiltration in the tumor microenvironment. Cell Signal. 2024;114: 110983.
Jia Y, Guo Y, Jin Q, Qu H, Qi D, Song P, et al. A SUMOylation-dependent HIF-1alpha/CLDN6 negative feedback mitigates hypoxia-induced breast cancer metastasis. J Exp Clin Cancer Res. 2020;39(1):42.
Buzzao D, Castresana-Aguirre M, Guala D, Sonnhammer E. Benchmarking enrichment analysis methods with the disease pathway network. Brief Bioinform. 2024;25(2):bbae069.
de la Fuente A, Bing N, Hoeschele I, Mendes P. Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics. 2004;20(18):3565–74.
Schäfer J, Strimmer K. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics. 2005;21(6):754–64.
Li W, Chang C, Kundu S, Long Q. Accounting for network noise in graph-guided Bayesian modeling of structured high-dimensional data. Biometrics. 2024;80:ujae012.
Wang H, Qiu Y, Guo H, Yin Y, Liu P. Information-incorporated gene network construction with FDR control. Bioinformatics. 2024;40:btae125.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol. 2005;67(2):301–20.
© 2025. This work is licensed under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.