Deep learning on graphs for multi-omics

Full text

Turn on search term navigation

Introduction

Chronic obstructive pulmonary disease (COPD) is a common, genetically complex and clinically heterogenous disease with unclear pathogenesis [1]. Although prediction models for COPD have been trained on clinical features and imaging data, these features may not help reveal mechanisms contributing to the development of COPD [2–4]. For example, Macaulay et al. developed a model to predict COPD with health-claim data including age, sex, COPD-related resource utilization (such as oxygen use), and all-cause healthcare utilization. The final model correctly predicted COPD severity (Global Initiative for Chronic Obstructive Lung Disease (GOLD) guidelines) with 62.7% accuracy [3]. Although this model helps predict COPD status, the prediction accuracy could be biased since it uses COPD-related resource utilization features (such as oxygen use) to predict COPD status. In addition, Himes et al. developed a Bayesian network model to predict COPD in asthma patients with Electronic Medical Records including age, sex, COPD-related symptoms (short of breath), and diseases (bronchitis and bronchiolitis) [2]. Although the final model in Himes et al. study correctly predicted COPD with 83% accuracy computed as the area under the Receiver Operating Characteristic curve (AUROC), the prediction model was trained on the unbalanced dataset (46 cases vs 946 controls). In addition, the significant age difference between controls and cases in their study led to the finding that the Age variable by itself can achieve 81% accuracy, which questions the generalizability of the model to other data sets. In addition, Schroeder et al. recently developed a CNN imaging-based model, with physiologic lung function data (PFTs) for COPD prediction, which achieved 74.9% accuracy [5]. Although the CNN imaging model achieved good accuracy, it does not give insight into the molecular mechanism of COPD development.

With the rapid development of genomic and proteomic technologies, high-throughput omics data allow comprehensive characterization of complex diseases, empower disease-specific network construction, and provide potential biomarkers to predict complex heterogeneous diseases [6, 7]. The direct and indirect interactions of biological entities including mRNAs and proteins can be represented as regulatory network(s). Networks provide a graphical representation of molecular interactions that helps explain pathogenesis for other complex diseases such as COPD [8–10]. Li et al. integrated nine omics data blocks by similarity network fusion (SNF) and demonstrated the improvement of group clustering and classification through network-based multi-omics integration. However, this study included only 52 female subjects and did not investigate what features are important for group clustering and COPD classification. [11].

Recently graph-based neural network (GNN) techniques have emerged, providing an opportunity to leverage biological network information. Computational efficiency used to be a bottleneck for GNN. One method proposed by Defferrard et al. leverages the spectral Convolutional graph Neural Network (ConvGNN) and improves computational efficiency significantly by using the Chebyshev approximation technique [12]. As a result, this cutting-edge method outperforms the existing approaches in terms of accuracy in many experiments [13–15]. Based on the spectral-based ConvGNN, Rhee et al. developed a hybrid approach of relation network and localized Graph Convolutional Filtering to incorporate protein-protein interaction information and gene expression data (single omics data) for breast cancer subtype classification and achieved significantly better performance than many existing methods [15]. Rhee et al.’s study demonstrates that network topology-based methods exhibit superior statistical power in the classification of diseases as well as other network-based methods including Paradigm and NetGSA [16, 17]. Schulte-Sasse et al. recently integrated mutations, copy number changes, DNA methylation, and gene expression together with protein-protein interaction (PPI) networks with graph convolutional networks for cancer prediction [18]. However, the use and interpretation of ConvGNN for integrating multi-omics data are still not well developed for complex diseases including COPD. In addition, Li et al. recently developed a novel method based on graph convolution network (GCN) for early detection of COPD with chest CT data and achieved higher accuracy than previous studies. However, it does not help reveal the molecular mechanism of COPD development [19].

In this study, we report a novel implementation of spectral-based ConvGNN for COPD classification: integrating multi-omics data (proteomics or transcriptomics) and disease-specific protein/genetic interaction graph information in the form of protein-protein interaction (PPI) networks. We demonstrate that ConvGNN outperforms many existing classification methods, such as Random Forests [20], Support Vector Machine (SVM) [21], and Scalable Tree Boosting System method (XGBoost) [22], which do not incorporate network information. We then extended the current ConvGNN to successfully incorporate two omics data (both transcriptomics and proteomics data) and achieved better performance than single omics data for COPD classification. Network information was retrieved from the STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) database, one of the most comprehensive protein/genetic interaction databases [23]. However, PPI information specific to COPD in STRING might be limited. To build a more complete and specific protein-protein interaction network related to COPD, we used our previously developed AhGlasso (Augmented High-Dimensional Graphical Lasso Method) method [24] to reconstruct a network, and demonstrated how integrating the updated COPD-associated PPI further improves performance. To evaluate stability of the ConvGNN model, we used cross-validation to measure classification performance. Finally, to interpret the ConvGNN model, we applied SHapley Additive exPlanations (SHAP) analysis values [25] to identify top significant genes/proteins and important subnetworks for COPD.

Materials and methods

COPDGene study population and ethics statement

The NIH-sponsored multicenter Genetic Epidemiology of COPD (COPDGene, ClinicalTrials.gov Identifier: NCT00608764) study was approved and reviewed by the institutional review board at all participating clinical centers [26–28]. All study participants provided written informed consent. COPDGene study enrolled 10,198 participants with and without Chronic Obstructive Pulmonary Disease (COPD) between 2007 and 2011 (Phase I study) to identify genetic factors associated with COPD [27]. Participants were brought back in a five-year follow-up visit from 2013 to 2017 (Phase II study). Each in-person visit included spirometry before and after albuterol, quantitative computed tomography (CT) imaging of the chest, and blood sampling. In this study, we focus on -omics data sets collected in Phase 2 based on blood sampling. Data analyzed in this work were de-identified.

Clinical variables and definitions

COPD was defined by post-bronchodilator spirometry FEV1/FVC < 0.7, where FEV1 is forced expiratory volume in the first second and FVC is Forced Vital Capacity. Subjects who did not fall into these categories were defined as controls. The Global Obstructive Lung Disease (GOLD) system was used to grade the severity of airflow limitation: mild—GOLD 1 (FEV1 ≥ 80%), moderate—GOLD 2 (50% ≤ FEV1 < 80%), severe—GOLD 3 (30% ≤ FEV1 < 50%), and very severe—GOLD 4 (FEV1 < 30%) [29]. Of note, GOLD 0 represents at-risk stage and was defined by the presence of risk factors (smoking) and symptoms (chronic cough and sputum production) in the absence of post-bronchodilator spirometry abnormalities that cross the diagnostic threshold for COPD (FEV1/FVC < 0.7) [30]. In other words, GOLD 0 represents an individual at-risk stage (chronic smoker) but without COPD (control).

Proteomics data and processing

Proteomic profiles were constructed on participants who agreed to participate in the ancillary study of COPDGene Phase II. All analyses were performed on frozen plasma from p100 tubes as previously described [31]. 1086 subjects passed QC in proteomic profiling. The never smoker controls and subjects who had lung transplants before Phase II were excluded.

For the SOMAscan 1.3k assay, intra-run normalization and inter-run calibration were performed according to SOMAscan assay data quality-control procedures as defined in the SomaLogic good laboratory practice quality system. Data from all samples passed recommended quality-control criteria by SOMAscan (Sample Normalization 0.4—2.5, Plate Scale Factor 0.4—2.5, SOMAmer Calibration median ± 0.4, Plate Median Scale Factor 0.8—1.2, Plate tail test less than 10%) and were fit for analysis. The data were standardized to have mean 0 and standard deviation 1. To map with the STRING database, the proteomics expression data without one-to-one mapping to gene symbols were removed. For example, if two SomaScan® aptamers map to the same gene symbol, these two aptamers’ corresponding expression data were removed. In addition, some aptamers either detect the expression level of a protein complex or detect the total expression level of several proteins by targeting a shared subunit. These aptamers were removed as well for simplicity. Expression data for 1212 proteins were retained for analysis.

Transcriptomics data and processing

Transcriptomic profiles were also generated to identify genomic factors associated with COPD [9, 27, 32] at Phase II. The mRNA high-throughput sequencing data was generated from peripheral blood samples collected and previously described [33]. Briefly, total RNA was extracted from PAXgene ™ Blood RNA tubes and cDNA library preparation was performed with the Illumina TruSeq Stranded Total RNA kit (Illumina, Inc., San Diego, CA). 75 bp paired-end reads were generated on Illumina sequencers. Reads were trimmed of TruSeq adapters and then aligned to the GRCH38 genome. Quality control was performed using the FastQC and RNA-SeQC programs [34, 35].

The complete CODPGene total mRNA sequencing data contained 60232 transcripts that were measured on 3985 peripheral blood samples. For the purposes of this study, we filtered down to the 19889 protein-coding genes through Ensemble annotation by “biomaRt” R package [36]. We then applied upper-quartile normalization and Remove Unwanted Variation Using Residuals (RUVr) [37] to remove unwanted variance including batch effect. The generalized linear model used in RUVr to determine residuals includes the following covariates: sex, race, age, and GOLD stage. RUV utilizes the residuals from a first-pass generalized linear model (GLM) regression of the counts on the covariates. The “k” for RUVr was set to 3 because the variation decreased drastically after adding the first three components. Finally, the corrected sequencing counts were transformed to be homoscedastic via a variance stabilizing transformation (VST) [38].

Proteomic-transcriptomic integration

719 subjects were profiled using both proteomics and transcriptomics. Because there is clinical heterogeneity of COPD across all stages of airflow limitation, we filtered out the subjects with GOLD = 1 since its diagnosis might be problematic per discussion with clinical experts. The moderate-to-severe (GOLD 2–4) subjects were used to represent cases since they are a more homogenous group of COPD as has been the standard for COPD genetic studies [39]. In total there were 614 subjects including 296 controls and 318 COPD cases (GOLD 2–4).

Among the above 1212 SomaScan® detected proteins, 1183 of them had corresponding mRNA transcript measurements. We focused on these 1183 pairs of mRNAs/proteins to construct the ConvGNN model for multi-omics integration. Of note, 25 pseudo nodes with neutral value (0) were added to the input matrix to ensure fast pooling of graph signals in graph neural network model training.

STRING PPI database

STRING (http://www.STRING-db.org) is a database of known and predicted protein-protein interactions [23]. STRING currently covers 5,214,234 proteins from 1133 organisms. In STRING, protein-protein pair associations (i.e., the “edge weights” in each network) are represented by confidence scores. The scores indicate the estimated probability that a given interaction is biologically meaningful, specific, and reproducible, given the supporting evidence. There are seven evidence channels in STRING: (1) experiments; (2) database; (3) text-mining; (4) coexpression; (5) neighborhood; (6) fusion; and (7) co-occurrence. The edge scores (weights) between proteins in the STRING PPI database range from 0 to 1, with 1 being the highest possible confidence of interaction [23]. The edge score is the probability that the protein-protein interaction really exists given the available evidence [23]. We retrieved human PPI data from STRING and filtered out edges with scores less than 0.4, the default threshold in “STRINGdb” R package to achieve medium confidence to well balance false positive and false negative as recommended by STRING [23, 40].

PPI reconstruction with AhGlasso

Using the prior PPI network retrieved from the STRING database, we applied our previously developed AhGlasso algorithm [24] on the transcriptomics data of 2656 subjects without proteomics measurements but with known clinical phenotype data to construct COPD-associated networks. In order to find the optimal regularization parameter, λ, networks were estimated under a sequence of λ values. The upper bound λ_max was calculated as described previously [24]. With predefined 0.01 λ minimal ratio, λ_min = 0.01 * λ_max. Then 40 λ candidates were generated between λ_min and λ_max in a log scale for grid search. The optimal λ was selected using the Bayesian Information Criterion (BIC) based on the penalized log Likelihood with 4-fold cross-validation and one standard error rule [24]. The resultant network serves as the updated COPD-associated PPI network for model development.

Localized pattern representation by Convolutional Graph Neural Network

ConvGNNs have recently become a popular approach for graph data because of their ability to extract features from graphs [13, 15, 41]. ConvGNN approaches are divided into two categories, spatial-based and spectral-based. The spatial-based ConvGNN methods are often based on random walk and mainly used for node prediction. The spectral-based methods are built based on spectral graph theory and useful for graph prediction. In this study, we are interested in COPD disease prediction (graph prediction) but not individual gene/protein function (node prediction). Therefore, we focus on the spectral-based ConvGNN development and compare its performance with existing classification methods. As described in Rhee et al.’s study [15], for capturing localized patterns of data (gene/protein expression profile), the input data is mapped onto the graph structure with the graph convolution technique.

Let expression data be X_n×p, where l = 1, …, n denotes the number of samples and i = 1, …, p denotes the number of nodes (genes or proteins). Let x_l ∈ R_n be the vector of expression values of p genes/proteins in the sample l. Then the graph Laplacian matrix L is used to find the spectral localized patterns of x_l under the graph structure G. The Laplacian matrix L_p×p of graph G is defined as L = D − A, where D is a weighted degree matrix and A is the adjacency matrix of the graph. Then the graph convolution is defined with the graph Laplacian matrix. Let’s assume that L = UΛU^T is an eigenvalue decomposition of graph Laplacian L, where U = [u₁, …, u_p] is a matrix composed of eigenvectors and Λ is a diagonal matrix diag[τ₁, …, τ_p] composed of eigenvalues . Then the graph Fourier transform (F) is defined as and the inverse graph Fourier transform (F⁻¹) is defined as [12, 42, 43].

Because it is hard to express a meaningful convolution operator in the vertex domain, Defferrard et al. defined the convolution operator on graph *_g in the Fourier domain as follows [12],(1)where x_l is the feature signal of nodes (e.g., gene/protein expression value in the first layer) in subject l while y_l is the transformed signal after graph convolution of subject l. The gene/protein expression data serve as signals of graph nodes in the PPI network and are convoluted with the graph convolution layers. Let the filter with the convolution operator on graph *_g in Eq 1; we can extract high level feature (y) by filtering signal x with the filter g_θ as Defferrard et al.’s study [12],(2)where g_θ(Λ) is a diagonal matrix , and the parameter θ is a vector of Fourier coefficients.

One graph convolution filter is a polynomial parameterized filter, that can express localized patterns in K-hop neighboring nodes [12, 42]. However, learning polynomial parameters is computationally intensive . In 2011, Hammond et al. proposed an approximated polynomial named Chebyshev expansion [42] to significantly reduce the computation burden and improve efficiency. The Chebyshev polynomial T_k(x) of order k is recursively computed by the stable recurrence relation, T_k(x) = 2xT_k−1(x)−T_k−2(x), where T₀ = 1 and T₁ = x. Thus, the filter can be approximated as , where . With the Chebyshev approximation technique, the entire filtering operation reduces to , where E is a set of edges in a PPI network [12, 15].

In spectral-based ConvGNN, the convolutional graph signal is further pooled with neighboring nodes identified by the Graclus algorithm (Pooling) as described previously [12, 15]. For fast pooling of graph signals, pseudo nodes with neutral value (0) are introduced in the graph neural network as previous study [12]. There are several pooling strategies in the neural networks such as max pooling and average pooling [44]. The average pooling was used in this study as discussed below.

Multi-omics integration through ConvGNN for COPD classification

The ConvGNN model developed in Rhee et al.’s study deals with single-omic data. We extend this model to integrate multi-omics data (transcriptomics and proteomics) with the protein-protein interaction (PPI) network in the ConvGNN (Fig 1) to test whether it could further improve COPD classification accuracy. We set the same convolution kernel for proteomics and transcriptomics data to reduce the computation burden. The extracted features through graph convolution are stacked on top of each other (i.e., concatenation) and fed into the fully connected layers.

[Figure omitted. See PDF.]

The inputs are a protein-protein interaction network (PPI) and omics data which could be single omics data only or multi-omics data. The PPI network could be retrieved from STRING databases or reconstructed from the AhGlasso algorithm. The red edges between nodes represent changes between the original PPI from STRING and the updated PPI with AhGlasso. The input is fed into a Spectral-based Convolutional Graph Neural Network, which typically includes layers for graph convolution and pooling to extract features with different kernels. The graph convolution and pooling could be repeated as shown on the top right. The resultant features will be passed to fully connected layers to calculate the probability of COPD using the softmax function.

Model development and ConvGNN hyper-parameters optimization

To ensure model stability, we use 4-fold cross-validation We randomly sample 20% of samples as a testing set. We randomly split the remaining 80% sample into 4 folds, where 3 folds serve as the training set and the remaining one fold serves as the validation set as shown in S1 Fig in S1 File. We cycle through the folds, where each fold serves as a validation set and the remaining 3 folds are used for training. The model parameters were tuned with optuna [45] python library [45] based on the model performance on the validation dataset. The hyper-parameter candidate choices were based on Rhee et al.’s study and listed in the S1 Table in S1 File. The general grid search algorithm explores the possible combinations of hyper-parameters.

With a grid hyperparameter search with the optuna, we chose two convolution layers and 3 fully-connected layers (S2A Fig in S1 File). The number of parameters in each layer is shown in S2B Fig in S1 File. Specifically, the input signal X of each omics data is normalized by a batch normalization method [46, 47] to make the learning process stable. Then, the normalized input signal is filtered by a graph convolution layer. Two graph convolution layers were used. Each layer has 32 convolution filters. The filter size of the first layer is 10 and the second layer is 2. The Rectified Linear Unit (ReLu) is used after graph convolution [48]. The pooling size is 2 for both of the layers. Of note, the convoluted signal is normalized through a batch normalization method so that the learning process can be accelerated and have a regularization effect. Then, ReLU activation function [48] and average pooling are applied. The procedure from graph convolution to average pooling is defined as graph convolution layer as in previous studies [12, 15]. After two graph convolution layers, a final feature map is used as an input of a fully connected layer and the dimension of extracted features in each layer is listed in S2B Fig in S1 File. Two fully connected layers are used with 64, 64 hidden nodes for each. Then cross-entropy between prediction and classification label is minimized by the Adam algorithm [49]. The learning rate = 0.001 with decay rate = 0.95 and decay step = 10 iterations. The model was trained with 25 epochs. To prevent over-fitting, we add the following regularization strategies in the model training process: 1) Dropout rate = 0.3 [50]; 2) L2 regularization = 0.0005 [51, 52]; 3) Early stopping based on the model performance on the validation dataset [53, 54].

Model evaluation

The model is evaluated by prediction accuracy, and F1 score of CV-trained models on the testing dataset. Accuracy is computed as(3)where TP, TN, FP, and FN are the number of true positives, true negatives, false positives, and false negatives, respectively. F1 score is computed as(4)where Precision , and Recall = .

We compare the performance of spectral-based Convolutional Graph Neural Networks with conventional classification methods including Random Forests (RF), Support Vector Machines (SVM), and eXtreme Gradient Boosting (XGB) as well as a regular deep learning method without graph information called Multi-Layer Perceptron (MLP). To illustrate the strength of multi-omics data integration through Convolutional Graph Neural Network, we compare the model performances with single omics data and multiple omics data. Of note, compared to the Convolutional graph neural networks, we use the exact same set of features (single omics data (1183 mRNAs or 1183 proteins), or multiple omics data) except for the protein-protein interaction networks for other algorithms (RF, SVM, XGB, and MLP). The hyper-parameters in other machine learning methods (RF, SVM, XGB, AND MLP) are also tuned based on the same validation dataset as ConvGNN used. We also compare performance using the known PPI or the COPD-associated PPI on the ConvGNN.

SHAP analysis on the ConvGNN model of two omics and COPD-associated PPI

To explain how the spectral-based Convolutional Graph Neural Network model(s) classifies COPD, we applied SHAP method to identify important genes/proteins and subnetworks [25]. SHAP is a unified approach to interpreting model predictions. The SHAP value is the average contribution of features that are predicted in different situations and it has three desirable properties for model interpretation: local accuracy, missingness, and consistency. The SHAP method has been demonstrated to have better consistency with human intuition than previous approaches including LIME and DeepLIFT [25]. We focused on the ConvGNN model that integrates two omics data and COPD-associated PPI networks because this model achieves the best performance among the evaluated models. We used the training dataset as a background dataset to generate the perturbed dataset required for training the surrogate models. The testing dataset was used to explain the model output. To interpret the model, sampling data points in the neighborhood of the original data point was performed to build surrogate models [25, 55]. 1200 samples were drawn to build the surrogate model for explaining each prediction. We performed SHAP analysis on an AWS instance with R5.2xlarge 8vCPU and 64G RAM provided by NHLBI BioData Catalyst.

Gene Ontology enrichment analysis on important features identified by SHAP

To illustrate the effectiveness of the SHAP method and identify the associated pathways and networks of important features in the ConvGNN model, we performed Gene Ontology (GO) enrichment analysis with Fisher’s exact test using the topGO R package [56] on the top 30 important genes/proteins identified with SHAP value calculation. The background genes/proteins in GO analysis are the initial 1183 genes/proteins. In other words, the reference list is the list of all the genes/proteins used for model training. Gene Ontology (GO) is a well-known framework for supporting the computational representation of biological systems [57] and has often been used to evaluate the quality of newly constructed or reconstructed protein-protein interaction networks [58, 59]. Specifically, a statistical test is performed to examine whether the number of identified genes belonging to a particular gene set/pathway is higher than that expected by random chance, as determined by comparison to a background gene list [57]. The adjusted P values were calculated with Benjamini-Hochberg Procedure for False Positive Rate (FDR) correction. The significant level was set to FDR < 0.05. In a multi-omics study, enrichment can be performed on a single omic data type, such as transcriptomics and metabolomics enrichment analysis separately or multi-omics together with integration, such as transcriptomics and metabolomics joint pathway analysis [60, 61]. The two omics in our study are transcriptomics (mRNAs) and proteomics (proteins). GO annotations are created by associating a gene’s transcript (mRNA) or gene’s product (protein) with a GO term. For each gene, its mRNA and protein were linked to the same GO annotation. Due to the intricate relationship between mRNA and protein and their joint role in regulating gene expression and protein synthesis, it can be difficult to differentiate their individual functions. Therefore, we performed GO enrichment of 30 important gene features including mRNA and protein, but not analyzed them separately.

Statistical software

Unless otherwise specified, the data manipulation and data analyses were performed using Python [62]. The Python libraries Tensorflow_1.11.0 [63], optuna_2.10.0 [45], sklearn_0.24.5 [64], pandas_1.1.5 [65], numpy_1.19.5 [66] and scipy_1.5.3 [67] were used for data preparation and ConvGNN model development. SHAP_0.28.5 [68] was used for explaining the trained Convolution Graph Neural Network models and identifying important genes/proteins for COPD classification.

Results

Clinical characteristics of subjects

The samples in this study covered a range of spirometry profiles including normal (296), COPD with all 4 grades of GOLD airflow limitation severity (GOLD 1: 70; GOLD 2: 152; GOLD 3: 80; GOLD 4: 16) (Table 1). There are differences in age, gender BMI, neutrophil percent, lymphocyte percent, FEV1pp, and Percent emphysema in different groups (p-value < 0.05). However, race is not statistically different (p-value > 0.05).

[Figure omitted. See PDF.]

Because there is uncertainty on the COPD diagnosis with GOLD stage 1, we removed those subjects for further analysis in classification model development and evaluation. We categorized the samples into two groups: controls and COPD cases (GOLD = 2—4) (Table 1). There are differences in age, gender, race, smoking status BMI, neutrophil percent, lymphocyte percent, FEV1pp, and Percent emphysema between the two groups (p-value < 0.05) while the percent of eosinophils is not statistically different (p-value > 0.05).

Graph Convultional Neural Network outperforms conventional methods with single omics data

First, we trained the ConvGNN models on single omics data along with the general PPI network retrieved from the STRING database. A typical learning curve of ConvGNN on the validation dataset with proteomics data was shown in S3 Fig in S1 File. We chose the optimal model based on the model performance on the validation set. For the proteomics data, we found that the ConvGNN model has significantly higher prediction accuracy and F1 score than other tested methods. The average prediction accuracies of the four methods that only used the proteomic data were 55.28 ± 0.94, 61.19 ± 1.69, 62.21 ± 1.70, and 65.66 ± 1.13 for RF, SVM, XGB and MLP respectively. ConvGNN, which incorporated the prior knowledge of PPI and protein expression data achieved higher accuracy of 67.38 ± 1.29 (S2 Table in S1 File), which was statistically significant than all other methods (all P < 0.05, paired student’s t-test) (Fig 2A). The testing dataset is well-balanced and the F1 score results follow a similar pattern to the accuracy (S4 Fig in S1 File).

[Figure omitted. See PDF.]

The ConvGNN models were trained in a 4-fold CV strategy with single omics data: proteomics data (A) or transcriptomics data (B). The STRING PPI network was used for the graph convolution. Four other classification methods were also evaluated: RF, SVM, XGB, and MLP. The model performances are assessed using the prediction accuracies on the testing dataset. The lines represent the mean accuracies for CV-trained models and the error bars represent the standard error of the mean.

For the transcriptomics data, we also found that the ConvGNN had significantly higher prediction accuracy than the other testing methods. The average prediction accuracies for RF, SVM, XGB and MLP were 62.62 ± 1.01, 63.15 ± 3.39, 63.15 ± 1.94, and 68.50 ± 1.09, respectively, while ConvGNN with both PPI information and RNAseq expression data achieved 72.09 ± 1.51 accuracy. The differences between ConvGNN and the other 4 methods are statistically significant (all P < 0.05, paired student’s t-test) (Fig 2B).

Multi-omics integration through ConvGNN increases prediction accuracy

Next, we extended the ConvGNN model to integrate two omics datasets with the general PPI network retrieved from the STRING database. To reduce the computational burden, we had the proteomics and transcriptomics data share the same convolution filters for feature extraction. For fair comparisons with other classification methods, we concatenated the two omics datasets for training RF, SVM, XGB, and MLP models. We found that the ConvGNN model has significantly higher prediction accuracy and F1 score than the other testing methods. The average prediction accuracies for RF, SVM, XGB and MLP were 57.97±2.47, 62.96±2.26, 64.28±2.08, and 70.41±1.14, respectively. The ConvGNN which incorporated the prior knowledge of PPI and proteomic data achieved 73.28±1.20 accuracy (S2 Table in S1 File). The differences between ConvGNN and the other 4 methods are statistically significant (all P < 0.05, paired student’s t-test) (Fig 3). We found that the ConvGNN model with two omics datasets also has significantly higher prediction accuracy than the ConvGNN with a single omics dataset only (all P < 0.05, paired student’s t-test) (Fig 4).

[Figure omitted. See PDF.]

The ConvGNN models were trained in a 4-fold CV strategy with two omics data: proteomics and transcriptomics. The STRING PPI network was used for graph convolution. Besides ConvGNN, we also developed classification models with Random Forest (RF), Support Vector Machine (SVM), eXtreme Gradient Boosting (XGB), and Multi-Layer Perceptron (MLP) for comparison. The model performances are assessed using the prediction accuracies on the testing dataset. The lines represent the mean accuracies for CV-trained models.

[Figure omitted. See PDF.]

The ConvGNN models were trained in a 4-fold CV strategy as above on the proteomics data (A), transcriptomics data (B), or both (C). The PPI network for ConvGNN was retrieved from the STRING database. The model performances are assessed using the prediction accuracies on the testing dataset. The lines represent the mean accuracies for CV-trained models.

COPD-associated PPI with AhGlasso improves prediction accuracy

We next investigated the effects of incorporating the COPD-associated PPI network on the ConvGNN. Using the previously known PPI network from the STRING database, we applied AhGlasso on the transcriptomics data of 2656 subjects without proteomics measurements to construct COPD-associated networks. The optimal regularization parameter, λ, was selected to be 0.029 based on BIC. The resultant COPD-associated network has 0.033 density while the density of the original corresponding PPI from STRING is 0.046. The COPD-associated PPI serves as network information for further Graph Neural Network model development.

With proteomics data and COPD-associated PPI, the average prediction accuracy of ConvGNN achieves 70.07 ± 2.84 while the average prediction accuracy of ConvGNN with the STRING PPI is 67.38 ± 1.29 (P = 0.05, paired student’s t-test) (Fig 5A). With transcriptomics data, the average prediction accuracy of ConvGNN achieve to 72.20 ± 0.44 while the average prediction accuracy of ConvGNN with the STRING PPI is 72.09 ± 1.50 (P = 0.91, paired student’s t-test) (Fig 5B). With proteomics data, transcriptomics data, and COPD-associated PPI, the average prediction accuracy of ConvGNN is 74.86 ± 0.67 while the average prediction accuracy of ConvGNN with the STRING PPI is 73.28 ± 1.20 (P = 0.045, paired student’s t-test) (Fig 5C, S2 Table in S1 File). The prediction results for each GOLD level subject in the ConvGNN model with two omics and COPD-associated are presented in S3 Table in S1 File. Of note, the prediction accuracy for GOLD 2 subjects is 65.71% while the accuracy for GOLD 3 and 4 is 90% and 100%, respectively.

[Figure omitted. See PDF.]

The ConvGNN models were trained in a 4-fold CV strategy as above on the proteomics data (A), transcriptomics data (B), or both (C). The PPI network for ConvGNN was either retrieved from the STRING database or COPD-associated PPI with AhGlasso. The model performances are assessed using the prediction accuracies on the testing dataset. The lines represent the mean accuracies for CV-trained models. The differences between conventional classification models are tested with paired student’s t-test (*, P ≤ 0.05).

Identifying most important features through SHAP

To better explain how the spectral-based convolutional Graph Neural Network model(s) works, we perturbed features and estimate the marginal contributions of different features through the SHAP analysis. We focused on the ConvGNN model that integrates two-omics data and COPD-associated PPI networks reconstructed by the AhGlasso algorithm because this model achieved the highest accuracy among the models evaluated. Based on the mean absolute value of the SHAP values for each feature across all samples in the testing dataset, we identified the top important genes/proteins for COPD prediction in the ConvGNN model based on the marginal contribution of each feature, which helps interpret the model globally. With the SHAP library default setting, we further explore the top 20 features (Fig 6A and 6B) and examine the impact of each feature on the model output. The top 20 important genes and proteins for the ConvGNN model are a mix of genes and proteins, which suggests that there is no single omics dominating in the model for prediction. Also, there are no overlapping genes and proteins on the top 20 important features. In addition, we also apply the SHAP method to identify the important features in the other 4 methods (RF, SVM, XGB, and MLP). There are only a few overlapping top 30 important features in the 5 predictive models (S5 Fig in S1 File), which is partially due to mathematical differences among different approaches.

[Figure omitted. See PDF.]

The SHAP values were calculated on the testing dataset with 1200 samplings. The feature importance is evaluated based on the average absolute SHAP values over subjects. The top important features are ranked in descending order. (A) The horizontal bars show the average impact of a feature on model output magnitude. (B) Impact of top 20 important features on the model output. Each dot represents each subject. The dot color shows whether that feature (variable) is high (in red) or low (in blue) for that observation. The horizontal location shows whether the effect of that value is associated with a higher or lower prediction.

The protein-protein interactions of the top 30 genes/proteins (S4 Table in S1 File) were extracted from the COPD-associated PPI network and presented in Fig 7. Of note, we chose the top 30 important genes and proteins instead of 20 to extract robust subnetworks. There are 4 important subnetworks for COPD prediction. One subnet consists of CXCL11, IL-2, CD48, KIR3DL2, KRF1, ADAMTS1, and SORCS. Another subset includes TLR2, PGLYRP1, DBNL, CAMKK1, and CBX5. The other subnetworks include two genes/proteins: 1) BMP10 and WFIKKN1; 2) POSTN and DDR2.

[Figure omitted. See PDF.]

Top important genes/proteins are identified with SHAP values. The sub-adjacency matrix of the top 30 important genes/proteins is extracted for plotting. The genes/proteins without any connections are removed.

Besides global interpretability, the SHAP method also provides local interpretability for each subject [25]. Specifically, each subject gets its own set of SHAP values. We illustrated the important features and their contribution in terms of SHAP on subjects A, B, and C (Fig 8). For example, in a control subject A, HAT1, TFPI, and ADAM12 transcripts push the prediction value higher while CD226 and LILRB2 transcripts push the prediction value lower. The prediction (output value) of subject A in the ConvGNN model is 0, which represents “healthy control”. In COPD subject C, CCL25, ARRKB, LGALS4, and other genes/proteins push the prediction value higher while CCL1 pushes the prediction value lower. The prediction of subject C in the ConvGNN model is 1, which represents “COPD”.

[Figure omitted. See PDF.]

The SHAP values were calculated on two subjects with 1200 samplings to illustrate the local interpretability: Subject A (A), subject B (B), and subject C (C). Subject A and subject B are healthy controls while subject C is a COPD case. The output value is the prediction for that observation. The base value is the value that would be predicted if we have no feature information (expected value). Features pushing the prediction higher (to the right) are shown in red while those pushing the prediction lower are in blue. The bar length of each feature represents its relative contribution to the final output: a wider bar denotes a larger contribution.

GO enrichment analysis on top important genes

We analyzed the enrichment gene ontology (GO) terms and biological pathways of the top 30 important features including genes and proteins (S4 Table in S1 File), which were discovered through SHAP analysis. We chose the top 30 important genes and proteins instead of 20 for GO enrichment analysis to ensure the enrichment findings are more robust. Of note, there are no overlapping genes and proteins on the top 30 important features. For example, BMP10_mRNA was in the top 30 important features, but not BMP10_protein. We found that 6 enriched molecular function pathways (Table 2), including glycosaminoglycan binding signaling and heparin signaling pathways, which have been reported to be important for COPD pathogenesis [69, 70]. We also found 47 enriched biological process and 16 enriched cellular component pathways (S5 and S6 Tables in S1 File).

[Figure omitted. See PDF.]

Discussion

In this study, we first successfully developed the ConvGNN models to incorporate known protein-protein interaction networks with single omics data (proteomics or transcriptomics). Similar to [15], the ConvGNN models outperform many existing methods including RF, SVM, XGB, and MLP. This finding demonstrates the advantages of incorporating PPI information. In Rhee et al.’s study, a hybrid model including graph convolution neural network (ConvGNN) and relation network (RN) was also proposed [15]. Although we tried this hybrid method, we did not find a significant improvement in terms of prediction accuracy after adding the relation network component, which might be partially explained by the increase of parameters to be trained in a more complex model, which often requires a large dataset and can be challenging to train.

Next, we extended the ConvGNN approach for multi-omics data integration. For model simplicity and to reduce computation burden, we assume the interaction relationships among transcripts and proteins are similar and the convolution filters could be shared. We demonstrate that the extended ConvGNN successfully incorporates two omics data and PPI knowledge and improves prediction accuracy in COPD, which suggests that the ConvGNN could be a good approach for multi-omics data and network integration.

Although the ConvGNN method with the PPI from the STRING database outperforms many existing methods, the STRING PPI is general but not specific to COPD. We reconstructed the COPD-associated PPI network through the AhGlasso algorithm based on one independent transcriptomics dataset including COPD cases and controls. We found that this newly-COPD-associated PPI improves the prediction performance of ConvGNN models trained either with the proteomics dataset only or two omics datasets together. However, we did not observe the improvement of ConvGNN based on transcriptomics data. The lack of improvement in this case may be because the COPD-associated PPI was reconstructed based on the held out transcriptomics data, which might contain similar information to the included transcriptomics data, even though the samples are independent of each other.

For downstream analysis, SHAP provides a unified framework to interpret machine learning models and measures a feature’s importance by calculating the increase of the model’s prediction error after perturbing the feature. The SHAP analysis has provided two great advantages: 1) Global interpretability: the SHAP values can show how much each gene/protein contributes to the ConvGNN model; 2) Local interpretability: each subject in the testing dataset gets its own set of SHAP values [25]. In addition, the SHAP analysis also identifies the top important genes/proteins that have a large impact on the ConvGNN prediction of COPD. Among the top 30 gene/proteins, we found significantly enriched glycosaminoglycan, heparin signaling, and carbohydrate derivative signaling pathways. These three pathways have been demonstrated to play important roles in COPD [71–75]. We also found 4 important subnetworks for COPD prediction. One subnet consists of CXCL11, IL-2, CD48, and KIR3DL2. They have been demonstrated to interact together and play important role in the regulation of the inflammatory and immune responses [76, 77], which play key roles in the development and progression of COPD [78]. Another subset includes TLR2, PGLYRP1, DBNL, CAMKK1, and CBX5. It has been demonstrated that TLR2 (Toll-like receptor 2) plays an important role in the immunoregulation of the inflammatory process in COPD and is involved in the development of COPD exacerbation [79]. The other subnetworks include two genes/proteins: 1) BMP10 and WFIKKN1; 2) POSTN and DDR2. BMP10 is currently viewed as one of the major molecules playing a critical role in pulmonary hypertension, which is a common complication of COPD [80, 81]. Besides the SHAP approach, LIME also can provide local explanations for a particular subject. We found there are only a few overlapping top features identified with two explainers: SAHP and LIME. The difference might be due to the measurement difference for feature importance: LIME fits a simple model around a prediction to create a local explanation while SHAP uses game theory to measure the importance of each feature.

Despite the advantages of incorporating the COPD-associated PPI and omics data in the ConvGNN for predicting COPD, there are some limitations to this study. One limitation is that the prediction accuracy is not very high, which might be partially caused by the small sample size. ConvGNN training involves a large number of parameter optimizations, which is best suited for larger sample sizes. In this study, we extend the ConvGNN to integrate multi-omics data and only focus on the subjects with both proteomics and transcriptomics measurements in the COPDGene Phase II study, which limits the amount of available data. Although the data augmentation technique has been used to increase the size of data used for training a model for image classification and natural language processing [82, 83], it is still not clear how to augment the omics data and disease outcome simultaneously because PPI networks are vast and complicated representations of biological processes. Another limitation is the age difference between the control and COPD groups. We only included omics data and not demographic variables in the predictive model. In the future, as larger omic data sets in COPD are available, we will be able to assess generalizability in other cohorts that may have more variability in ages between groups. In addition, how changes in transcript/proteins profiles affect complex heterogeneous diseases through these interactions is still not clear [84]. Another limitation is that we only have 1183 overlapping transcripts/proteins for model development. The limiting factor is that the SomaScan® 1.3k assay only detects 1.3k proteins and protein complexes. SomaScan® recently released a new assay (7k) that could detect up to 7000 proteins and protein complexes, which would increase the overlap with the RNAseq data and allow for a larger network for the ConvGNN. In addition, we used all overlapping features between the two omics data, which might include irrelevant features and noise signals to inhibit model performance. Similarly, the new 7k platform may also introduce additional noisy signals. In the future, we plan to select COPD-related genes/proteins based on extensive literature search and review and use them for ConvGNN model(s) as Rhee et al.’s study, where they only focused on a set of known cancer-related genes. Finally, another limitation is that the convolution kernels for the two omics data were set to be the same to reduce the computational burden. However, the interactions between mRNA transcripts and the interactions between proteins might be different and require differenet kernels.

Regarding interpretation, although SHAP provides several desirable properties and advantages to explain regular machine learning models, it ignores the interaction relationship of genes and proteins during feature perturbation, which could limit its capacity to interpret the ConvGNN accurately. In the future, if the SHAP value could be evaluated with the guidance from the PPI network, we may have improved power to identify important genes/proteins and discover relevant biological pathways in the ConvGNN model for disease classification. In addition, it is an approximation based on feature perturbation and sampling, which is computation-intensive and memory demanding. Although we tried to increase the sampling number to better estimate the feature contributions, we faced memory issues. Therefore, the estimation of feature importance in our study might be biased.

Conclusion

In this study, we integrated single omics data (proteomics data and transcriptomics data individually) with a general PPI network from the STRING database and successfully developed ConvGNN models for COPD classification to outperform several conventional classification methods. Then we reconstructed the COPD-associated PPI network through the AhGlasso algorithm based on an independent transcriptomics dataset including COPD cases and controls. With the COPD-associated PPI network, we further extended the ConvGNN method for incorporating transcriptomics and proteomics data and the ConvGNN with two omics datasets improves prediction accuracy over the model with single omics data only. The updated COPD-associated network with AhGlasso further improves the model performance. Although we focused on the spectral-based GNN model development for COPD, the spatial-based Graph Neural Network and/or Graph Attention model might also have great potential in predicting COPD and identifying key subnetworks associated with COPD.

In many cases, interpretability is an important quality for machine learning models used for disease diagnosis. In this study, we explained how the ConvGNN model works globally with the marginal contribution of each feature (global interpretability) and how each observation derives its own prediction by providing an individual set of SHAP values (local interpretability). We have identified CXCL11, IL-2, CD48, KIR3DL2, TLR2, BMP10 and several other relevant COPD genes in subnetworks of the ConvGNN model for COPD prediction. Finally, Gene Ontology (GO) enrichment analysis identified glycosaminoglycan, heparin signaling, and carbohydrate derivative signaling pathways significant enriched in the top important gene/proteins for COPD classifications.

Supporting information

S1 File.

https://doi.org/10.1371/journal.pone.0284563.s001

S1 Table.

https://doi.org/10.1371/journal.pone.0284563.s002

(TEX)

S2 Table.

https://doi.org/10.1371/journal.pone.0284563.s003

(TEX)

Citation: Zhuang Y, Xing F, Ghosh D, Hobbs BD, Hersh CP, Banaei-Kashani F, et al. (2023) Deep learning on graphs for multi-omics classification of COPD. PLoS ONE 18(4): e0284563. https://doi.org/10.1371/journal.pone.0284563

About the Authors:

Yonghua Zhuang

Roles: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

E-mail: [email protected] (YZ); [email protected] (KK)

Affiliations: Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America, Biostatistics Shared Resource, University of Colorado Cancer Center, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America, Department of Pediatrics, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America

ORICD: https://orcid.org/0000-0002-1822-2395

Fuyong Xing

Roles: Methodology, Writing – review & editing

Affiliation: Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America

Debashis Ghosh

Roles: Supervision, Writing – review & editing

Affiliation: Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America

ORICD: https://orcid.org/0000-0001-6618-1316

Brian D. Hobbs

Roles: Data curation, Writing – review & editing

Affiliations: Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, United States of America, Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital, Boston, MA, United States of America, Harvard Medical School, Boston, MA, United States of America

Craig P. Hersh

Roles: Data curation, Writing – review & editing

ORICD: https://orcid.org/0000-0002-1342-4334

Farnoush Banaei-Kashani

Roles: Methodology, Supervision, Writing – review & editing

Affiliation: Department of Computer Science and Engineering, University of Colorado Denver, Denver, CO, United States of America

Russell P. Bowler

Roles: Data curation, Funding acquisition, Investigation, Methodology, Resources, Supervision, Writing – review & editing

Affiliation: National Jewish Health, Denver, CO, United States of America

ORICD: https://orcid.org/0000-0003-4651-363X

Katerina Kechris

Roles: Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

E-mail: [email protected] (YZ); [email protected] (KK)

Affiliation: Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America

ORICD: https://orcid.org/0000-0002-3725-5459

References

1. Lee JH, Cho MH, McDonald MLN, Hersh CP, Castaldi PJ, Crapo JD, et al. Phenotypic and genetic heterogeneity among subjects with mild airflow obstruction in COPDGene. Respiratory medicine. 2014;108(10):1469–1480. pmid:25154699

2. Himes BE, Dai Y, Kohane IS, Weiss ST, Ramoni MF. Prediction of chronic obstructive pulmonary disease (COPD) in asthma patients using electronic medical records. Journal of the American Medical Informatics Association. 2009;16(3):371–379. pmid:19261943

3. Macaulay D, Sun SX, Sorg RA, Yan SY, De G, Wu EQ, et al. Development and validation of a claims-based prediction model for COPD severity. Respiratory medicine. 2013;107(10):1568–1577. pmid:23806285

4. Humphries SM, Notary AM, Centeno JP, Strand MJ, Crapo JD, Silverman EK, et al. Deep learning enables automatic classification of emphysema pattern at CT. Radiology. 2020;294(2):434–444. pmid:31793851

5. Schroeder JD, Lanfredi RB, Li T, Chan J, Vachet C, Paine R Iii, et al. Prediction of Obstructive Lung Disease from Chest Radiographs via Deep Learning Trained on Pulmonary Function Data. International Journal of Chronic Obstructive Pulmonary Disease. 2020;15:3455. pmid:33447023

6. Li X, Liu L, Zhou J, Wang C. Heterogeneity analysis and diagnosis of complex diseases based on deep learning method. Scientific reports. 2018;8(1):1–8. pmid:29670206

7. Sun YV, Hu YJ. Integrative analysis of multi-omics data for discovery and functional studies of complex human diseases. Advances in genetics. 2016;93:147–190. pmid:26915271

8. Liu Y, Millsap RE, West SG, Tein JY, Tanaka R, Grimm KJ. Testing measurement invariance in longitudinal data with ordered-categorical measures. Psychological methods. 2017;22(3):486. pmid:27213981

9. Zhuang Y, Hobbs BD, Hersh CP, Kechris K. Identifying miRNA-mRNA Networks Associated With COPD Phenotypes. Frontiers in genetics. 2021; p. 1985. pmid:34777474

10. Chang Y, Glass K, Liu YY, Silverman EK, Crapo JD, Tal-Singer R, et al. COPD subtypes identified by network-based clustering of blood gene expression. Genomics. 2016;107(2-3):51–58. pmid:26773458

11. Li CX, Wheelock CE, Sköld CM, Wheelock ÅM. Integration of multi-omics datasets enables molecular classification of COPD. European Respiratory Journal. 2018;51(5). pmid:29545283

12. Defferrard M, Bresson X, Vandergheynst P. Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in neural information processing systems; 2016. p. 3844–3852.

13. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:160902907. 2016;.

14. Hamilton WL, Ying R, Leskovec J. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:170905584. 2017;.

15. Rhee S, Seo S, Kim S. Hybrid Approach of Relation Network and Localized Graph Convolutional Filtering for Breast Cancer Subtype Classification. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization; 2018. p. 3527–3534. Available from: https://doi.org/10.24963/ijcai.2018/490.

16. Vaske CJ, Szeto C, Haussler D, Earl D, Sanborn JZ, Zhu J, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics. 2010;26(12):i237–i245. pmid:20529912

17. Ma J, Shojaie A, Michailidis G. Network-based pathway enrichment analysis with incomplete network information. Bioinformatics. 2016;32(20):3165–3174. pmid:27357170

18. Schulte-Sasse R, Budach S, Hnisz D, Marsico A. Integration of multiomics data with graph convolutional networks to identify new cancer genes and their associated molecular mechanisms. Nature Machine Intelligence. 2021;3(6):513–526.

19. Li Z, Huang K, Liu L, Zhang Z. Early detection of COPD based on graph convolutional network and small and weakly labeled data. Medical & Biological Engineering & Computing. 2022;60(8):2321–2333. pmid:35750976

20. Ho TK. Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition. vol. 1. IEEE; 1995. p. 278–282.

21. Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory; 1992. p. 144–152.

22. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016. p. 785–794.

23. Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, et al. The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic acids research. 2016; p. gkw937. pmid:27924014

24. Zhuang Y, Xing F, Ghosh D, Banaei-Kashani F, Bowler RP, Kechris K. An Augmented High-Dimensional Graphical Lasso Method to Incorporate Prior Biological Knowledge for Global Network Learning. Frontiers in genetics. 2022; p. 2405. pmid:35154240

25. Lundberg SM, Lee SI. A Unified Approach to Interpreting Model Predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in Neural Information Processing Systems 30. Curran Associates, Inc.; 2017. p. 4765–4774. Available from: http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf.

26. Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, et al. Genetic epidemiology of COPD (COPDGene) study design. COPD: Journal of Chronic Obstructive Pulmonary Disease. 2011;7(1):32–43.

27. Ragland MF, Benway CJ, Lutz SM, Bowler RP, Hecker J, Hokanson JE, et al. Genetic advances in chronic obstructive pulmonary disease. Insights from COPDGene. American journal of respiratory and critical care medicine. 2019;200(6):677–690. pmid:30908940

28. Gillenwater LA, Helmi S, Stene E, Pratte KA, Zhuang Y, Schuyler RP, et al. Multi-omics subtyping pipeline for chronic obstructive pulmonary disease. PloS one. 2021;16(8):e0255337. pmid:34432807

29. Wan ES, Hokanson JE, Murphy JR, Regan EA, Make BJ, Lynch DA, et al. Clinical and radiographic predictors of GOLD–unclassified smokers in the COPDGene study. American journal of respiratory and critical care medicine. 2011;184(1):57–63. pmid:21493737

30. Rabe KF, Hurd S, Anzueto A, Barnes PJ, Buist SA, Calverley P, et al. Global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease: GOLD executive summary. American journal of respiratory and critical care medicine. 2007;176(6):532–555. pmid:17507545

31. Mastej E, Gillenwater L, Zhuang Y, Pratte KA, Bowler RP, Kechris K. Identifying Protein–metabolite Networks Associated with COPD Phenotypes. Metabolites. 2020;10(4):124. pmid:32218378

32. Regan EA, Hersh CP, Castaldi PJ, DeMeo DL, Silverman EK, Crapo JD, et al. Omics and the search for blood biomarkers in chronic obstructive pulmonary disease. Insights from COPDGene. American journal of respiratory cell and molecular biology. 2019;61(2):143–149. pmid:30874442

33. Parker MM, Chase RP, Lamb A, Reyes A, Saferali A, Yun JH, et al. RNA sequencing identifies novel non-coding RNA and exon-specific effects associated with cigarette smoking. BMC Medical Genomics. 2017;10(1):58. pmid:28985737

34. Andrew S. A quality control tool for high throughput sequence data. Fast QC. 2010;532:1.

35. DeLuca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire MD, Williams C, et al. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics. 2012;28(11):1530–1532. pmid:22539670

36. Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature protocols. 2009;4(8):1184–1191. pmid:19617889

37. Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nature Biotechnology. 2014;32(9):896. pmid:25150836

38. Anders S, Huber W. Differential expression analysis for sequence count data. Genome biology. 2010;11(10):R106. pmid:20979621

39. Sakornsakolpat P, Prokopenko D, Lamontagne M, Reeve NF, Guyatt AL, Jackson VE, et al. Genetic landscape of chronic obstructive pulmonary disease identifies heterogeneous cell-type and phenotype associations. Nature genetics. 2019;51(3):494–505. pmid:30804561

40. Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S, et al. The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic acids research. 2021;49(D1):D605–D612. pmid:33237311

41. Han P, Yang P, Zhao P, Shang S, Liu Y, Zhou J, et al. GCN-MF: disease-gene association identification by graph convolutional networks and matrix factorization. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining; 2019. p. 705–713.

42. Hammond DK, Vandergheynst P, Gribonval R. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis. 2011;30(2):129–150.

43. Shuman DI, Narang SK, Frossard P, Ortega A, Vandergheynst P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing magazine. 2013;30(3):83–98.

44. Boureau YL, Ponce J, LeCun Y. A theoretical analysis of feature pooling in visual recognition. In: Proceedings of the 27th international conference on machine learning (ICML-10); 2010. p. 111–118.

45. Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining; 2019. p. 2623–2631.

46. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR; 2015. p. 448–456.

47. Santurkar S, Tsipras D, Ilyas A, Mądry A. How does batch normalization help optimization? In: Proceedings of the 32nd international conference on neural information processing systems; 2018. p. 2488–2498.

48. Agarap AF. Deep learning using rectified linear units (relu). arXiv preprint arXiv:180308375. 2018;.

49. Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;.

50. Gal Y, Ghahramani Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: international conference on machine learning. PMLR; 2016. p. 1050–1059.

51. Neyshabur B, Tomioka R, Srebro N. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:14126614. 2014;.

52. Byrd J, Lipton Z. What is the effect of importance weighting in deep learning? In: International Conference on Machine Learning. PMLR; 2019. p. 872–881.

53. Prechelt L. Early stopping-but when? In: Neural Networks: Tricks of the trade. Springer; 1998. p. 55–69.

54. Caruana R, Lawrence S, Giles L. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. Advances in neural information processing systems. 2001; p. 402–408.

55. Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’16. New York, NY, USA: ACM; 2016. p. 1135–1144. Available from: http://doi.acm.org/10.1145/2939672.2939778.

56. Alexa A, Rahnenführer J. Gene set enrichment analysis with topGO. Bioconductor Improv. 2009;27:1–26.

57. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nature genetics. 2000;25(1):25–29. pmid:10802651

58. Seyyedsalehi SF, Soleymani M, Rabiee HR, Mofrad MR. PFP-WGAN: Protein function prediction by discovering Gene Ontology term correlations with generative adversarial networks. Plos one. 2021;16(2):e0244430. pmid:33630862

59. Xu B, Liu Y, Lin C, Dong J, Liu X, He Z. Reconstruction of the protein-protein interaction network for protein complexes identification by walking on the protein pair fingerprints similarity network. Frontiers in genetics. 2018;9:272. pmid:30087694

60. Mao K, Tan Q, Ma Y, Wang S, Zhong H, Liao Y, et al. Proteomics of extracellular vesicles in plasma reveals the characteristics and residual traces of COVID-19 patients without underlying diseases after 3 months of recovery. Cell Death & Disease. 2021;12(6):541. pmid:34035220

61. Mao K, Luo P, Geng W, Xu J, Liao Y, Zhong H, et al. An integrative transcriptomic and metabolomic study revealed that melatonin plays a protective role in chronic lung inflammation by reducing necroptosis. Frontiers in immunology. 2021;12:668002. pmid:34017341

62. Van Rossum G, Drake FL. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace; 2009.

63. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems; 2015. Available from: http://tensorflow.org/.

64. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830.

65. McKinney W, et al. Data structures for statistical computing in python. In: Proceedings of the 9th Python in Science Conference. vol. 445. Austin, TX; 2010. p. 51–56.

66. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585(7825):357–362. pmid:32939066

67. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods. 2020;17:261–272. pmid:32015543

68. Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005;21(16):3439–40. pmid:16082012

69. Ito K, Ito M, Elliott WM, Cosio B, Caramori G, Kon OM, et al. Decreased histone deacetylase activity in chronic obstructive pulmonary disease. New England Journal of Medicine. 2005;352(19):1967–1976. pmid:15888697

70. Henrot P, Prevel R, Berger P, Dupin I. Chemokines in COPD: from implication to therapeutic use. International journal of molecular sciences. 2019;20(11):2785. pmid:31174392

71. Klagas I, Goulet S, Karakiulakis G, Zhong J, Baraket M, Black JL, et al. Decreased hyaluronan in airway smooth muscle cells from patients with asthma and COPD. European Respiratory Journal. 2009;34(3):616–628. pmid:19282346

72. Uhl FE, Zhang F, Pouliot RA, Uriarte JJ, Enes SR, Han X, et al. Functional role of glycosaminoglycans in decellularized lung extracellular matrix. Acta biomaterialia. 2020;102:231–246. pmid:31751810

73. Shute JK, Calzetta L, Cardaci V, Di Toro S, Page CP, Cazzola M. Inhaled nebulised unfractionated heparin improves lung function in moderate to very severe COPD: a pilot study. Pulmonary pharmacology & therapeutics. 2018;48:88–96. pmid:28986203

74. Lai T, Li Y, Chen M, Pan G, Wen X, Mai Z, et al. Heparin-binding epidermal growth factor contributes to COPD disease severity by modulating airway fibrosis and pulmonary epithelial–mesenchymal transition. Laboratory Investigation. 2018;98(9):1159–1169. pmid:29581578

75. Lamonaca P, Prinzi G, Kisialiou A, Cardaci V, Fini M, Russo P. Metabolic disorder in chronic obstructive pulmonary disease (COPD) patients: towards a personalized approach using marine drug derivatives. Marine drugs. 2017;15(3):81. pmid:28335527

76. Cohen O, Weissman D, Fauci A, Paul W. Fundamental immunology; 1999.

77. Goodridge JP, Burian A, Lee N, Geraghty DE. HLA-F and MHC class I open conformers are ligands for NK cell Ig-like receptors. The Journal of Immunology. 2013;191(7):3553–3562. pmid:24018270

78. Rovina N, Koutsoukou A, Koulouris NG. Inflammation and immune response in COPD: where do we stand? Mediators of inflammation. 2013;2013. pmid:23956502

79. Sidletskaya K, Vitkina T, Denisenko Y. The role of toll-like receptors 2 and 4 in the pathogenesis of chronic obstructive pulmonary disease. International Journal of Chronic Obstructive Pulmonary Disease. 2020;15:1481. pmid:32606656

80. Guignabert C, Humbert M. Targeting transforming growth factor-β receptors in pulmonary hypertension. European Respiratory Journal. 2021;57(2). pmid:32817256

81. Chaouat A, Naeije R, Weitzenblum E. Pulmonary hypertension in COPD. European Respiratory Journal. 2008;32(5):1371–1385. pmid:18978137

82. Perez L, Wang J. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:171204621. 2017;.

83. Feng SY, Gangal V, Wei J, Chandar S, Vosoughi S, Mitamura T, et al. A survey of data augmentation approaches for nlp. arXiv preprint arXiv:210503075. 2021;.

84. Safari-Alighiarloo N, Taghizadeh M, Rezaei-Tavirani M, Goliaei B, Peyvandi AA. Protein-protein interaction networks (PPI) and complex diseases. Gastroenterology and Hepatology from bed to bench. 2014;7(1):17. pmid:25436094

Word count: 10296

Show less

© 2023 Zhuang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Network approaches have successfully been used to help reveal complex mechanisms of diseases including Chronic Obstructive Pulmonary Disease (COPD). However despite recent advances, we remain limited in our ability to incorporate protein-protein interaction (PPI) network information with omics data for disease prediction. New deep learning methods including convolution Graph Neural Network (ConvGNN) has shown great potential for disease classification using transcriptomics data and known PPI networks from existing databases. In this study, we first reconstructed the COPD-associated PPI network through the AhGlasso (Augmented High-Dimensional Graphical Lasso Method) algorithm based on one independent transcriptomics dataset including COPD cases and controls. Then we extended the existing ConvGNN methods to successfully integrate COPD-associated PPI, proteomics, and transcriptomics data and developed a prediction model for COPD classification. This approach improves accuracy over several conventional classification methods and neural networks that do not incorporate network information. We also demonstrated that the updated COPD-associated network developed using AhGlasso further improves prediction accuracy. Although deep neural networks often achieve superior statistical power in classification compared to other methods, it can be very difficult to explain how the model, especially graph neural network(s), makes decisions on the given features and identifies the features that contribute the most to prediction generally and individually. To better explain how the spectral-based Graph Neural Network model(s) works, we applied one unified explainable machine learning method, SHapley Additive exPlanations (SHAP), and identified CXCL11, IL-2, CD48, KIR3DL2, TLR2, BMP10 and several other relevant COPD genes in subnetworks of the ConvGNN model for COPD prediction. Finally, Gene Ontology (GO) enrichment analysis identified glycosaminoglycan, heparin signaling, and carbohydrate derivative signaling pathways significantly enriched in the top important gene/proteins for COPD classifications.

Details

Title

Deep learning on graphs for multi-omics classification of COPD

Author

Zhuang, Yonghua

; Xing, Fuyong; Ghosh, Debashis

; Hobbs, Brian D; Hersh, Craig P

; Banaei-Kashani, Farnoush; Bowler, Russell P

; Kechris, Katerina

First page

e0284563

Section

Research Article

Publication year

2023

Publication date

Apr 2023

Publisher

Public Library of Science

e-ISSN

19326203

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1371/journal.pone.0284563

ProQuest document ID

2804273570

Deep learning on graphs for multi-omics classification of COPD

Jump to:

Full text

Abstract

Details

Suggested sources